Multi-Head Egress is a mechanism that allows sharded data to be retrieved directly from cluster nodes, bypassing the overhead of pulling the data through the head node from the cluster nodes. This greatly increases the speed of retrieval by parallelizing the output of data and spreading the network traffic across multiple nodes.
Operationally, the egress mechanism is a lookup of a given
shard key value in a given table or
view, filtering out any records that don't match an
optional expression
.
Language | Multi-head Egress Mechanism |
---|---|
C++ | X |
C# | X |
Java | RecordRetriever |
Javascript | X |
Node.js | X |
Python | RecordRetriever |
REST | X |
SQL | X |
In order for the cluster nodes to send data directly to an egress client,
the configuration on each node needs to be updated to allow the incoming HTTP
requests for that data. The /opt/gpudb/core/etc/gpudb.conf
file
needs to have the following property set for multi-head egress to work
properly:
# Enable worker HTTP servers...
enable_worker_http_servers = true
Both the HA and HTTPD
configurations will be taken into account, as well as any public URLs (which
override all other settings) defined in /opt/gpudb/core/etc/gpudb.conf
when
returning multi-head URLs for the worker nodes. With the exception of the
Python background multi-head process, the multi-head egress object
requires a list of worker nodes to use to distribute the data requests, with
one entry in the list for each node/process. This list can be autopopulated
simply by using a GPUdb
connection object, which can retrieve the list of
available cluster nodes from the database itself. Below is a code snippet
(in Java) showing an automatically populated worker list:
GPUdb gpudb = new GPUdb("http://localhost:9191");
WorkerList workers = new WorkerList(gpudb);
Note that in some cases, workers may be configured to use more than one IP
address, not all of which may be accessible to the client; the worker list
constructor uses the first IP returned by the server for each worker. In cases
where workers may use more than one IP address and public URLs are not
configured, a regular expression Pattern
or prefix String
can be used to
match the correct worker IP addresses:
// Match 172.X.Y.Z addresses
WorkerList(GPUdb gpudb, Pattern.compile("172\\..*"));
// or
WorkerList(GPUdb gpudb, "172.");
There are several factors to consider when using multi-head egress:
expression
must all be
indexed. There is an implicit index for a
primary key, so an explicit column index is not
required when the primary key is the shard key, or when the primary key
is a superset of the shard key and the expression
references all of the
primary key columns that are not part of the shard key.All the functionality for multi-head egress is encapsulated into the
RecordRetriever
object.
See API Support for a chart listing the
supported APIs.
The following is a Python API code block that demonstrates the use of the
background RecordRetriever
for retrieving data.
h_db = gpudb.GPUdb(encoding = 'BINARY', host = '127.0.0.1', port = '9191')
table = gpudb.GPUdbTable(None, 'stocks', db = h_db, use_multihead_io = True)
# Ensure a column index exists on the sharded column
if 'Symbol' not in table.show_table()['additional_info'][0]['attribute_indexes']:
table.alter_table(action = 'create_index', value = 'Symbol')
# Get stock records for S&P
table.get_records_by_key(key_values = {'Symbol':'SPGI'})
In the Python API,
the GPUdbTable
object has a built-in use_multihead_io
parameter, which
allows the GPUdbTable
to handle all RecordRetriever
interactions with
the associated table in the background:
egress_table = gpudb.GPUdbTable(_type=egress_columns, name="egress_table", db=h_db, use_multihead_io=True)
relocation
tag:<relocation>
<pattern>org.apache.avro</pattern>
<shadedPattern>org.shaded.apache.avro</shadedPattern>
</relocation>