Multi-Head Ingest is a mechanism that allows sharded data to be ingested directly into cluster nodes, bypassing the overhead of pushing the data through the head node to the cluster nodes.
Operationally, the ingest mechanism calculates the target shard of each record to insert, and sends batches of would-be co-located records to their respective target nodes.
To fully utilize multi-head ingest, the ingestion process should use a multi-node parallel processing framework, such as MapReduce, Storm, or Spark. As data is divided up and flows through the processing nodes, a Kinetica bulk ingest object can be instantiated in each node to push data to the proper Kinetica cluster node. This greatly increases the speed of ingestion by both parallelizing the ingestion process and spreading the network traffic across multiple nodes.
Language | Multi-head Ingest Mechanism |
---|---|
C++ | GPUdbIngestor |
C# | KineticaIngestor |
Java | BulkInserter |
Javascript | X |
Node.js | X |
Python | GPUdbIngestor |
REST | X |
SQL | X |
In order for the cluster nodes to receive data directly from an ingest client,
the configuration on each node needs to be updated to allow the incoming HTTP
requests that carry the data. The /opt/gpudb/core/etc/gpudb.conf
file
needs to have the following property set for multi-head ingest to work
properly:
# Enable worker HTTP servers...
enable_worker_http_servers = true
Both the HA and HTTPD
configurations will be taken into account, as well as any public URLs (which
override all other settings) defined in /opt/gpudb/core/etc/gpudb.conf
when
returning multi-head URLs for the worker nodes. With the exception of the
Python background multi-head process, the multi-head ingest object
requires a list of worker nodes to use to distribute the insert requests, with
one entry in the list for each node/process. This list can be autopopulated
simply by using a GPUdb
connection object, which can retrieve the list of
available cluster nodes from the database itself. Below is a code snippet
(in Java) showing an automatically populated worker list:
GPUdb gpudb = new GPUdb("http://localhost:9191");
WorkerList workers = new WorkerList(gpudb);
Note that in some cases, workers may be configured to use more than one IP
address, not all of which may be accessible to the client; the worker list
constructor uses the first IP returned by the server for each worker. In cases
where workers may use more than one IP address and public URLs are not
configured, a regular expression Pattern
or prefix String
can be used to
match the correct worker IP addresses:
// Match 172.X.Y.Z addresses
WorkerList(GPUdb gpudb, Pattern.compile("172\\..*"));
// or
WorkerList(GPUdb gpudb, "172.");
There are several factors to consider when using multi-head ingest:
All the functionality for multi-head ingestion is encapsulated in the bulk ingest object. See API Support for chart listing the API-specific object to use for bulk ingestion.
The following is a Java API code block that demonstrates the use of the
BulkInserter
for ingesting data.
// Create a bulk inserter for batch data ingestion
BulkInserter<MyType> bulkInserter = new BulkInserter<MyType>(gpudb, tableName, type, batchSize, null, new WorkerList(gpudb));
// Generate data to be inserted into the table, automatically inserting
// records whenever the batchSize limit is reached
for (int i = 0; i < numRecords; i++)
{
MyType record = new MyType();
record.put( 0, (i + 0.1) ); // col1
record.put( 1, ("string " + String.valueOf( i ) ) ); // col2
record.put( 2, "Group 1" ); // group_id
bulkInserter.insert( record );
}
// To ensure all records are inserted, flush the bulk inserter object.
bulkInserter.flush();
In the Python API,
the GPUdbTable
object has a built-in use_multihead_io
parameter, which
allows the GPUdbTable
to handle all Ingestor
interactions with
the associated table in the background:
ingest_table = gpudb.GPUdbTable(_type=ingest_columns, name="ingest_table", db=h_db, use_multihead_io=True)
relocation
tag:<relocation>
<pattern>org.apache.avro</pattern>
<shadedPattern>org.shaded.apache.avro</shadedPattern>
</relocation>