Kinetica ingests data via five interfaces:
- Kinetica Workbench UI: Upload files directly from your computer into the Kinetica File System (KiFS) using a drag-and-drop interface, then import the data using a simple wizard. Or, use SQL blocks in a workbook to import data using SQL.
- KiSQL: Upload files from a client machine into Kinetica's KiFS and then load it using SQL. KiSQL is a java-based SQL console application that sends commands over JDBC.
- JDBC: Write custom ingestion applications with Kinetica's JDBC Driver, or provide connectivity to 3rd party applications like Nifi.
- Native API: Write custom ingestion applications with one of Kinetica's native APIs, like Java, Python, or C++.
Supported Data Sources
A data source is an object that references a location on a remote system, and acts as a proxy for accessing external data. Once a data source is created you can access different files stored within.
Data sources may be created for the following providers:
- S3: Amazon S3 Bucket
- Azure: Microsoft BLOB Storage
- HDFS: Apache Hadoop Distributed File System
- Kafka: Apache Kafka streaming feed
Files stored in the Kinetica File System (KiFS) can be directly referenced in your LOAD INTO SQL statement and do not require a data source object to be created.
If the file is not stored in a compatible provider, then you will need to move it to a supported provider or location in order to ingest it.
Creating a Data Source
A credential is an authentication object for a resource or repository that is external to the database. A credential provides a data source a way to pass authentication details to a provider whenever it makes a request. A credential is given to a data source at the time of creation or alteration.
Credentials may consist of:
- The authentication type being used
- A username and password
- Authentication IDs of various kinds
Create Data Source
When creating a data source, you will need to specify the provider (E.g. S3, Azure BLOB, HDFS, or Kafka), and some may require the region as well. You can also specify a credential to authenticate with the provider.
When ingesting from a public bucket, use ANONYMOUS = true in the WITH OPTIONS() clause, which will bypass the need to enter an access key and password.
Supported File Types
Kinetica supports the following file formats:
- Text Delimited: All delimited text files: This includes, CSV, TSV etc.
- Parquet: Apache Parquet is a popular open source column oriented data storage format that is used with the Hadoop ecosystem.
- Shapefile: Shapefile is a popular format for storing geospatial data.
- GeoJSON: Stores geographic features along with non-spatial attributes in a JSON format.
For formats like JSON and GeoJSON, Kinetica stringifies nested objects and stores them in an unrestricted string column. If you require data to be flattened, you will need to pre-flatten your data before ingesting it into Kinetica. This must be done using a third-party tool or script.
If your data is not in a compatible format, then you need to convert the file into a supported format using a third-party tool.
Loading via Workbench UI
The Import interface of Kinetica's Workbench imports data from a variety of sources using a simple wizard workflow. In addition to importing data, it generates the SQL equivalent that can be copied and pasted into a workbook.
To access the Import page and begin loading data, click the Import link in the main navigation bar. Then, follow the workflow to import data from one of Kinetica's supported data providers.
Loading via SQL
The purpose of an external table is to incorporate read-only data, governed by an external system, in your queries. An external table comes in two flavors: materialized and logical. A materialized external table loads data from the provider into Kinetica and stores it indefinitely. The data in a materialized external table is refreshed upon startup, and can also be refreshed manually. A logical external table, on the other hand, loads data from the provider only for the life of the query, and then flushes it from Kinetica when the query has completed.
Each type of external table has benefits and drawbacks that should be weighed for your particular use case. Logical external tables allow you to only ingest data when needed, but the data must be loaded for each query and the loading time may be prohibitive. Materialized external tables give you immediate access to the data, but require space on your system to store the data.
The LOAD INTO command is a general-purpose data loading utility that is capable of moving data into Kinetica from the supported data providers. It uses a subset of the parameters available to CREATE EXTERNAL TABLE, including the SUBSCRIBE option, which allows you to continually ingest data from a stream or using a change-data-capture mechanism that detects and loads new data added to a file. See the LOAD INTO section for the full list of options.
When ingesting from Kafka, you must use the LOAD INTO command. An external table cannot be created with a Kafka data source.
Kinetica supports several modes of ingestion for both the LOAD INTO and the EXTERNAL TABLE capabilities:
- One-Time Ingest: Perform a one-time load from a data provider.
- Streaming Ingestion: Continually ingest streaming data from a Kafka topic or use change-data-capture to detect and load new data in a file.
When loading data, Kinetica provides these helpful features:
- Select Columns: Specify columns you want to keep, removing those you do not wish to ingest. See the FIELDS MAPPED BY clause on the SQL Support page for more details.
- Type Inferencing: If your table does not already exist, the data types will be automatically inferred upon ingestion.
Type inferencing has two modes: speed and accuracy, which can be modified by passing the the option 'type_inference_mode' = 'accuracy' in the WITH OPTIONS() clause. Speed mode, which is enabled by default, tells Kinetica to only use a portion of the file to create the type inference, and may over-estimate the column size. This is sufficient for most use cases. Use accuracy mode if you require types to be highly accurate, but anticipate longer loading times as a result.
See this page for a larger list of SQL examples using the LOAD INTO syntax with various options and authentication methods.
Loading with KiSQL
KiSQL is a java-based application that is capable of uploading files from a client machine into Kinetica's KiFS. This is a great option if you don't have direct access to the Kinetica Workbench, or if you want to script files to be loaded on a set schedule from a client.
First, download KiSQL from this GitHub Repository. If you're on a Windows machine, you do not need Java pre-installed. For MacOS or Linux machines, you will need to install Java 8 to run KiSQL.
Next, start KiSQL. You can find the connection details on the landing page of Workbench.
Create a directory, in which to load your file(s).
You will need system administrator privileges to run this command.
Upload your file(s) to KiFS.
Finally, load your file into a table.
Custom Ingest Applications
In some circumstances, you may want to write a custom application to ingest data into Kinetica using Kinetica's native APIs or JDBC Driver. Kinetica offers several endpoints for uploading files, loading data, or inserting records. The usage of such endpoints is outside of the scope of this article.
Kinetica's native APIs are capable of distributing records to the various nodes of your cluster at the time of ingest. This provides a more direct pathway that avoids the need for records to first be sent to Kinetica's head node for evaluation. When writing a custom application, refer to the Multi-head Ingest page for information on how to achieve the best ingestion performance.
3rd Party Tools
If the data you wish to load is not in a supported format, if it is stored in an incompatible provider, or it requires ETL before you load it into Kinetica, you may use Kinetica's JDBC Driver in conjunction with an ETL tool to load data into Kinetica. In many cases, this process is as easy as dropping in Kinetica's JDBC Driver and defining a few configuration parameters. Note, however, that this depends on the tool you have chosen and the way that the JDBC connection was implemented.
For more information about how to use Kinetica's JDBC Driver, see the JDBC driver documentation.
Users can adjust the batch size used to insert records with the SQL Hint KI_HINT_BATCH_SIZE(n), which defaults to 200,000. For tables with very large columns (e.g., extremely complex WKT records), users should adjust the batch size down to see an increase in performance. For very simple tables with short columns, users can adjust the batch size to be higher to see an increase in throughput. These are just rough guidelines and your mileage may vary depending on your use case.