Version:

Ingestion

Overview

Since Kinetica was designed to work with big data sets, a mechanism is provided to allow for very fast data ingestion. We call this mechanism "multihead ingest". While most of the Kinetica routines are done through the head node of the Kinetica cluster, the BulkIngest object lets you specify multiple worker nodes and processes.

Architecture of an Ingest Process

To fully utilize multihead ingest, the ingestion process should use a multinode parallel processing framework, such as MapReduce, Storm, or Spark. As data is divided up and flows through the processing nodes, the BulkIngest object instantiated in each node process can push data to any available Kinetica worker node. This greatly increases the speed of ingestion by both parallelizing the ingestion process and spreading the network traffic across multiple nodes.

All the functionality for this is encapsulated into the BulkInserter object. To do multihead ingest, you create a WorkerList for your BulkInserter, one worker for each node/process. This list can be autopopulated simply by passing a gpudb object. Below is a code snippet showing this:

BulkInserter.WorkerList workers = new BulkInserter.WorkerList(gpudb);
bulkInserter = new BulkInserter<T>(gpudb, tableName, type, bulkThreshold, null, workers);

Cautions

  1. If your Kinetica servers have multiple IP addresses on each server (for example, internal and external IP addresses), use either the BulkInserter.WorkerList(GPUdb gpudb, Pattern ipRegex) or BulkInserter.WorkerList(GPUdb gpudb, String ipPrefix) constructor, specifying the IP addresses to use, rather than BulkInserter.WorkerList(gpudb). Otherwise, BulkInserter might try to communicate with Kinetica through IP addresses it cannot reach.

  2. We have found a conflict between the version of Avro used by Kinetica and the one used by MapReduce. This conflict can be overcome by using the Maven Shade plug-in with the relocation tag:

    <relocation>
      <pattern>org.apache.avro</pattern>
      <shadedPattern>org.shaded.apache.avro</shadedPattern>
    </relocation>