Loading Data

Kinetica supports loading of data via SQL from a variety of file types and through several mechanisms. There are two primary paths:

Loading server-accessible data sets via SQL (or the /execute/sql endpoint call) that either:
- can be reached by Kinetica via an external data source object
- have been uploaded to Kinetica via SQL or API
Loading client-side files via ODBC/JDBC

LOAD INTO

Kinetica can load either internal or external data into a table, using a LOAD INTO statement.

LOAD INTO Syntax
1
2
3
4
5
6
7
8
9
<COPY | LOAD> [DATA] INTO [<schema name>.]<table name>
<
   FROM REMOTE QUERY '<source data query>'
   |
   FROM FILE PATHS <file paths>
      [FORMAT <[DELIMITED] TEXT [(<delimited text options>)] | AVRO | JSON | PARQUET | SHAPEFILE>]
>
[WITH OPTIONS (<load option name> = '<load option value>'[,...])]
[<table property clause>]

Note

For contextualized examples, see Examples. For copy/paste examples, see Loading Data. For an overview of loading data into Kinetica, see Data Loading Concepts.

The data can be sourced through either of the following:

local files uploaded into KiFS
a data source configured to allow access to:
- remote files on Azure, GCS, HDFS, or S3
- remote messages on a Kafka topic
- a remote database queryable via JDBC (or CData JDBC)

A LOAD INTO that uses a data source can perform a one-time load upon creation and optionally subscribe for updates on an interval, depending on the data source provider:

Provider	Description	One-Time Load	Subscription
Azure	Microsoft blob storage	Yes	Yes
CData	CData Software source-specific JDBC driver See driver list for the supported drivers	Yes	Yes
Confluent	Confluent Kafka streaming feed (JSON, GeoJSON, or Avro)		Yes
GCS	Google Cloud Storage	Yes	Yes
HDFS	Apache Hadoop Distributed File System	Yes
JDBC	Java DataBase Connectivity; requires user-supplied driver	Yes	Yes
Kafka	Apache Kafka streaming feed (JSON, GeoJSON, or Avro)		Yes
S3	Amazon S3 Bucket	Yes	Yes

See Manage Subscription for pausing, resuming, & canceling subscriptions on the target table.

The use of LOAD INTO with ring resiliency has additional considerations.

Parameters Description

DATA Optional keyword for compatibility

<schema name> Name of the schema containing the table into which data will be loaded

<table name> Name of the table into which data will be loaded

FROM REMOTE QUERY

Source data specification clause, where <source data query> is a SQL query selecting the data which will be loaded.

Note

This clause is mutually exclusive with the FROM FILE PATHS clause, and is only applicable to CData & JDBC data sources.

The query should meet the following criteria:

Any column expression used is given a column alias.
The first column is not a WKT or unlimited length VARCHAR type.
The columns and expressions queried should match the order, number, & type of the columns in the target table. If the target table doesn't exist, it will be created using the names and data types of the queried columns in the order in which they appear in the query.

Any query resulting in more than 10,000 records will be distributed and loaded in parallel (unless directed otherwise) using the following rule sequence:

If REMOTE_QUERY_NO_SPLIT is TRUE, the query will not be distributed.
If a valid REMOTE_QUERY_PARTITION_COLUMN is specified, the query will be distributed by partitioning on the given column's values
If a valid REMOTE_QUERY_ORDER_BY is specified, the query will be distributed by ordering the data accordingly and then partitioning into sequential blocks from the first record
If a non-null numeric/date/time column exists, the query will be distributed by partitioning on the first such column's values
The query will be distributed by sorting the data on the first column and then partitioning into sequential blocks from the first record

Type inferencing is limited by the available JDBC types. To take advantage of Kinetica-specific types and properties, create the target table manually before loading data.

FROM FILE PATHS

Source file specification clause, where <file paths> is a comma-separated list of single-quoted file paths from which data will be loaded; all files specified are presumed to have the same format and data types.

Note

This clause is mutually exclusive with the FROM REMOTE QUERY clause, and is not applicable to CData & JDBC data sources.

The form of a file path is dependent on the source referenced:

Data Source: If a data source is specified in the load options, these file paths must resolve to accessible files at that data source location. A "path prefix" can be specified instead, which will cause all files whose path begins with the given prefix to be included.
For example, a "path prefix" of /data/ge for <file paths> would match all of the following:
- /data/geo.csv
- /data/geo/flights.csv
- /data/geo/2021/airline.csv
If using an HDFS data source, the "path prefix" must be the name of an HDFS directory.
KiFS: The path must resolve to an accessible file path within KiFS. A "path prefix" can be specified instead, which will cause all files whose path begins with the given prefix to be included.
For example, a "path prefix" of kifs://data/ge would match all of the following files under the KiFS data directory:
- kifs://data/geo.csv
- kifs://data/geo/flights.csv
- kifs://data/geo/2021/airline.csv

FORMAT

Optional indicator of source file type, for file-based data sources; will be inferred from the file extension if not given.

Supported formats include:

Keyword	Description
`[DELIMITED] TEXT`	Any text-based, delimited field data file (CSV, PSV, TSV, etc.); a comma-delimited list of options can be given to specify the way in which the data file(s) should be parsed, including the delimiter used, whether headers are present, etc. Records spanning multiple lines are not supported. See Delimited Text Options for the complete list of `<delimited text options>`.
`AVRO`	Apache Avro data file
`JSON`	Either a JSON or GeoJSON data file See JSON/GeoJSON Limitations for the supported data types.
`PARQUET`	Apache Parquet data file See Parquet Limitations for the supported data types.
`SHAPEFILE`	ArcGIS shapefile

WITH OPTIONS

Optional indicator that a comma-delimited list of connection & global option/value assignments will follow.

See Load Options for the complete list of options.

<table property clause> Optional comma-separated list of table properties, from a subset of those available, to the table to load into, if it is created by this call.

Delimited Text Options

The following options can be specified when loading data from delimited text files. When reading from multiple files, options specific to the source file will be applied to each file being read.

Option Description

COMMENT = '<string>'

Treat lines in the source file(s) that begin with string as comments and skip.

The default comment marker is #.

DELIMITER = '<char>'

Use char as the source file field delimiter.

The default delimiter is a comma, unless a source file has one of these extensions:

.psv - will cause | to be the delimiter
.tsv - will cause the tab character to be the delimiter

See Delimited Text Option Characters for allowed characters.

ESCAPE = '<char>'

Use char as the source file data escape character. The escape character preceding any other character, in the source data, will be converted into that other character, except in the following special cases:

Source Data String	Representation when Loaded into the Database
`<char>a`	ASCII bell
`<char>b`	ASCII backspace
`<char>f`	ASCII form feed
`<char>n`	ASCII line feed
`<char>r`	ASCII carriage return
`<char>t`	ASCII horizontal tab
`<char>v`	ASCII vertical tab

For instance, if the escape character is \, a \t encountered in the data will be converted to a tab character when stored in the database.

The escape character can be used to escape the quoting character, and will be treated as an escape character whether it is within a quoted field value or not.

There is no default escape character.

HEADER DELIMITER = '<char>'

Use char as the source file header field name/property delimiter, when the source file header contains both names and properties. This is largely specific to the Kinetica export to delimited text feature, which will, within each field's header, contain the field name and any associated properties, delimited by the pipe | character.

An example Kinetica header in a CSV file:

1
id|int|data,category|string|data|char16,name|string|data|char32

The default is the | (pipe) character. See Delimited Text Option Characters for allowed characters.

Note

The DELIMITER character will still be used to separate field name/property sets from each other in the header row

INCLUDES HEADER = <TRUE|FALSE>

Declare that the source file(s) will or will not have a header.

The default is TRUE.

NULL = '<string>'

Treat string as the indicator of a null source field value.

The default is the empty string.

QUOTE = '<char>'

Use char as the source file data quoting character, for enclosing field values. Usually used to wrap field values that contain embedded delimiter characters, though any field may be enclosed in quote characters (for clarity, for instance). The quote character must appear as the first and last character of a field value in order to be interpreted as quoting the value. Within a quoted value, embedded quote characters may be escaped by preceding them with another quote character or the escape character specified by ESCAPE, if given.

The default is the " (double-quote) character. See Delimited Text Option Characters for allowed characters.

Delimited Text Option Characters

For DELIMITER, HEADER DELIMITER, ESCAPE, & QUOTE, any single character can be used, or any one of the following escaped characters:

Escaped Char	Corresponding Source File Character
`''`	Single quote
`\a`	ASCII bell
`\b`	ASCII backspace
`\f`	ASCII form feed
`\t`	ASCII horizontal tab
`\v`	ASCII vertical tab

For instance, if two single quotes ('') are specified for a QUOTE character, the parser will interpret single quotes in the source file as quoting characters; specifying \t for DELIMITER will cause the parser to interpret ASCII horizontal tab characters in the source file as delimiter characters.

Load Options

The following options can be specified to modify the way data is loaded (or not loaded) into the target table.

Option Description

BAD RECORD TABLE

Name of the table containing records that failed to be loaded into the target table. This bad record table will include the following columns:

Column Name	Source Data Format Codes
`line_number`	Number of the line in the input file containing the failed record
`char_number`	Position of character within a failed record that is assessed as the beginning of the portion of the record that failed to process
`filename`	Name of file that contained the failed record
`line_rejected`	Text of the record that failed to process
`error_msg`	Error message associated with the record processing failure

Note

This option is not applicable for an ON ERROR mode of ABORT. In that mode, processing stops at the first error and that error is returned to the user.

BATCH SIZE

Use an ingest batch size of the given number of records.

The default batch size is 50,000.

COLUMN FORMATS

Use the given type-specific formatting for the given column when parsing source data being loaded into that column. This should be a map of column names to format specifications, where each format specification is map of column type to data format, all formatted as a JSON string.

Supported column types include:

Column Type Source Data Format Codes

date

Apply the given date format to the given column.

Common date format codes follow. For the complete list, see Date/Time Conversion Codes.

Code	Description
`YYYY`	4-digit year
`MM`	2-digit month, where January is `01`
`DD`	2-digit day of the month, where the 1st of each month is `01`

time

Apply the given time format to the given column.

Common time format codes follow. For the complete list, see Date/Time Conversion Codes.

Code	Description
`HH24`	24-based hour, where 12:00 AM is `00` and 7:00 PM is `19`
`MI`	2-digit minute of the hour
`SS`	2-digit second of the minute
`MS`	milliseconds

datetime Apply the given date/time format to the given column.

For example, to load dates of the format 2010.10.30 into date column d and times of the 24-hour format 18:36:54.789 into time column t:

{
    "d": {"date": "YYYY.MM.DD"},
    "t": {"time": "HH24:MI:SS.MS"}
}

Note

This option is not available for data sources configured for JDBC.

DATA SOURCE Load data from the given data source. Data source connect privilege is required when loading from a data source.

DEFAULT COLUMN FORMATS

Use the given formats for source data being loaded into target table columns with the corresponding column types. This should be a map of target column type to source format for data being loaded into columns of that type, formatted as a JSON string.

Supported column properties and source data formats are the same as those listed in the description of the COLUMN FORMATS option.

For example, to make the default format for loading source data dates like 2010.10.30 and 24-hour times like 18:36:54.789:

{
    "date": "YYYY.MM.DD",
    "time": "HH24:MI:SS.MS",
    "datetime": "YYYY.MM.DD HH24:MI:SS.MS"
}

Note

This option is not available for data sources configured for JDBC.

FIELDS IGNORED BY

Choose a comma-separated list of fields from the source file(s) to ignore, loading only those fields that are not in the identified list in the order they appear in the file. Fields can be identified by either POSITION or NAME. If ignoring by NAME, the specified names must match the source file field names exactly.

Identifying by Name:

FIELDS IGNORED BY NAME(Category, Description)

Identifying by Position:
```
FIELDS IGNORED BY POSITION(3, 4)
```

Note

When ignoring source data file fields, the set of fields that are not ignored must align, in type & number in their order in the source file, with the target table columns into which the data will be loaded.
Ignoring fields by POSITION is only supported for delimited text files.

FIELDS MAPPED BY

Choose a comma-separated list of fields from the source file(s) to load, in the specified order, identifying fields by either POSITION or NAME. If mapping by NAME, the specified names must match the source file field names exactly.

Identifying by Name:
```
FIELDS MAPPED BY NAME(ID, Name, Stock)
```
Identifying by Position:
```
FIELDS MAPPED BY POSITION(1, 2, 5)
```

Note

When mapping source data file fields and the target table exists, the set of fields that are identified must align, in type & number in the specified order, with the target table columns into which data will be loaded.
Mapping fields by POSITION is only supported for delimited text files.

FLATTEN_COLUMNS

Specify the policy for handling nested columns within JSON data.

The default is FALSE.

Value	Description
`TRUE`	Break up nested columns into multiple columns.
`FALSE`	Treat nested columns as JSON columns instead of flattening.

IGNORE_EXISTING_PK

Specify the error suppression policy for inserting duplicate primary key values into a table with a primary key. If the specified table does not have a primary key or the UPDATE_ON_EXISTING_PK option is used, then this options has no effect.

The default is FALSE.

Value	Description
`TRUE`	Suppress errors when inserted records and existing records' PKs match.
`FALSE`	Return errors when inserted records and existing records' PKs match.

INGESTION MODE

Whether to do a full ingest of the data or perform a dry run or type inference instead.

The default mode is FULL.

Value	Description
`DRY RUN`	No data will be inserted, but the file will be read with the applied `ON ERROR` mode and the number of valid records that would normally be inserted is returned.
`FULL`	Data is fully ingested according to the active `ON ERROR` mode.
`TYPE INFERENCE`	Infer the type of the source data and return, without ingesting any data. The inferred type is returned in the response, as the output of a `SHOW TABLE` command.

JDBC_FETCH_SIZE

Retrieve this many records at a time from the remote database. Lowering this number will help tables with large record sizes fit into available memory during ingest.

The default is 50,000.

Note

This option is only available for data sources configured for JDBC.

JDBC_SESSION_INIT_STATEMENT

Run the single given statement before the initial load is performed. Also run it before each subsequent reload, if SUBSCRIBE is TRUE.

For example, to set the time zone to UTC before running each load, use:

JDBC_SESSION_INIT_STATEMENT = 'SET TIME ZONE ''UTC'''

Note

This option is only available for data sources configured for JDBC.

KAFKA_CONSUMERS_PER_RANK

Number of Kafka consumer threads per rank; valid range is 1 through 6.

The default is 1.

KAFKA_GROUP_ID Consumer group ID to use for data sources configured for Kafka. If not specified, one will be randomly generated.

KAFKA_OPTIMISTIC_INGEST

Enable optimistic ingestion where Kafka topic offsets and table data are committed independently to achieve parallelism.

The default is FALSE.

KAFKA_OFFSET_RESET_POLICY

Whether to start a Kafka subscription at the earliest or latest offset, for data sources configured for Kafka. The default is earliest.

Value	Description
`earliest`	Start streaming from the oldest message in the queue.
`latest`	Start streaming from the newest message in the queue.

KAFKA_SUBSCRIPTION_CANCEL_AFTER Optional number of minutes after which a Kafka subscription will be cancelled, for data sources configured for Kafka.

KAFKA_TYPE_INFERENCE_FETCH_TIMEOUT

Maximum time to collect Kafka messages before type inferencing on the set of them.

The default is generally 30 seconds.

MAX_RECORDS_TO_LOAD Limit the number of records to load in this request. If the number is larger than BATCH SIZE, the number of records loaded will be limited to the next whole number of BATCH SIZE (per working thread).

NUM_SPLITS_PER_RANK

The number of remote query partitions to assign each Kinetica worker process. The queries assigned to a worker process will be executed by the tasks allotted to the process.

To decrease memory pressure, increase the number of splits per rank.

The default is 8 splits per rank.

Note

This option is only available for data sources configured for JDBC.

NUM_TASKS_PER_RANK

The number of tasks to use on each Kinetica worker process to process remote queries. The tasks assigned to a worker process will execute any remote query partitions assigned to it.

To decrease memory pressure, decrease the number of tasks per rank.

The default is 8 tasks per rank.

Note

This option is only available for data sources configured for JDBC.

ON ERROR

When an error is encountered loading a record, handle it using either of the following modes. The default mode is ABORT.

Mode	Description
`SKIP`	If an error is encountered parsing a source record, skip the record.
`ABORT`	If an error is encountered parsing a source record, stop the data load process. Primary key collisions are considered abortable errors in this mode.

POLL_INTERVAL

Interval, in seconds, at which a data source is polled for updates. The number of seconds must be passed as a single-quoted string.

The default interval is 60 seconds. This option is only applicable when SUBSCRIBE is TRUE.

Note

This option is not available for data sources configured for Kafka.

PRIMARY KEY

Optional primary key specification clause, where the value is a parenthesized comma-separated list of columns to use as the primary key for the table:

PRIMARY KEY = ('id', 'dept_id')

REMOTE_QUERY_INCREASING_COLUMN

For a JDBC query change data capture loading scheme, the remote query column that will be used to determine whether a record is new and should be loaded or not. This column should have an ever-increasing value and be of an integral or date/timestamp type. Often, this column will be a sequence-based ID or create/modify timestamp.

This option is only applicable when SUBSCRIBE is TRUE.

Note

This option is only available for data sources configured for JDBC.

REMOTE_QUERY_NO_SPLIT

Whether to not distribute the retrieval of remote data and issue queries for blocks of data at time in parallel.

The default is FALSE.

Note

This option is only available for data sources configured for JDBC

Value	Description
`TRUE`	Issue the remote data retrieval as a single query.
`FALSE`	Distribute and parallelize the remote data retrieval in queries for blocks of data at a time.

REMOTE_QUERY_ORDER_BY

Ordering expression to use in partitioning remote data for retrieval. The remote data will be ordered according to this expression and then retrieved in sequential blocks from the first record. This is potentially less performant than using REMOTE_QUERY_PARTITION_COLUMN.

If REMOTE_QUERY_NO_SPLIT is TRUE, a valid REMOTE_QUERY_PARTITION_COLUMN is specified, or the column given is invalid, this option is ignored.

Note

This option is only available for data sources configured for JDBC

REMOTE_QUERY_PARTITION_COLUMN

Column to use to partition remote data for retrieval. The column must be numeric and should be relatively evenly distributed so that queries using values of this column to partition data will retrieve relatively consistently-sized result sets.

If REMOTE_QUERY_NO_SPLIT is TRUE or the column given is invalid, this option is ignored.

Note

This option is only available for data sources configured for JDBC

SCHEMA_REGISTRY_SCHEMA_NAME

Name of the Avro schema in the schema registry to use when reading Avro records.

Note

This option is not available for data sources configured for Kafka.

SHARD KEY

Optional shard key specification clause, where the value is a parenthesized comma-separated list of columns to use as the shard key for the table:

SHARD KEY = ('id')

SKIP_LINES Skip this number of lines when loading from a source file.

SUBSCRIBE

Whether to subscribe to the data source specified in the DATA SOURCE option.

The default is FALSE.

Note

This option is not available for data sources configured for HDFS.

Value	Description
`TRUE`	Stream data from the specified data source.
`FALSE`	Load all data once from the specified data source.

TRUNCATE_STRINGS

Specify the string truncation policy for inserting text into VARCHAR columns that are not large enough to hold the entire text value.

The default is FALSE.

Value	Description
`TRUE`	Truncate any inserted string value at the maximum size for its column.
`FALSE`	Reject any record with a string value that is too long for its column.

TRUNCATE_TABLE

Specify whether to truncate the target table before beginning the load or subscription.

The default is FALSE (don't truncate the target table).

TYPE_INFERENCE_MODE

When making a type inference of the data values in order to define column types for the target table, use one of the following modes.

The default mode is SPEED.

Value	Description
`ACCURACY`	Scan all available data to arrive at column types that are the narrowest possible that can still hold all the data.
`SPEED`	Pick the widest possible column types from the minimum data scanned in order to quickly arrive at column types that should fit all data values.

UPDATE_ON_EXISTING_PK

Specify the record collision policy for inserting into a table with a primary key. If the specified table does not have a primary key, then this options has no effect.

The default is FALSE.

Value	Description
`TRUE`	Update existing records with records being inserted, when PKs match.
`FALSE`	Discard records being inserted when existing records' PKs match.

Table Property Clause

A comma-separated list of options for creating the target table, if it doesn't exist and if the source data file has Kinetica headers, can be specified.

Kinetica headers contain column type information, which are used in creating the target table if it doesn't exist. An example Kinetica header:

id|int|data,category|string|data|char16,name|string|data|char32,description|string|data|char128|nullable,stock|int|data|nullable

LOAD INTO Table Property Clause
1
2
3
4
5
6
USING TABLE PROPERTIES
(
    <table property> = <value>,
    ...
    <table property> = <value>
)

Property	Description
`CHUNK COLUMN MEMORY`	Size of the blocks of memory holding the data, when loaded; specified as the maximum number of bytes any one column should hold. Note The size of dictionary-encoded columns is estimated.
`CHUNK MEMORY`	Size of the blocks of memory holding the data, when loaded; specified as the maximum total number of bytes all columns should hold. Note The size of dictionary-encoded columns is estimated.
`CHUNK SIZE`	Size of the blocks of memory holding the data, when loaded; specified as the maximum number of records each block of memory should hold
`TTL`	The time-to-live (TTL) for the table; if not set, the table will not expire

Examples

File Paths

Single File

To load a CSV file of product data named products.csv in the KiFS directory data, into a table named product:

LOAD INTO (Single File) Example
1
2
LOAD DATA INTO example.product
FROM FILE PATHS 'kifs://data/products.csv'

Multiple Files

When loading files from multiple paths, specify a comma-delimited list of those file paths. To load one file named products.csv and one file named products.kh.csv, both under the KiFS directory data, into a table named product:

LOAD INTO (Multiple Files) Example
1
2
LOAD DATA INTO example.product
FROM FILE PATHS 'kifs://data/products.csv', 'kifs://data/products.kh.csv'

File Formats

Single File - No Header

To load a headerless CSV file of product data named products.nh.csv in the KiFS directory data, into a table named product:

LOAD INTO (No Header) Example
1
2
3
LOAD DATA INTO example.product
FROM FILE PATHS 'kifs://data/products.nh.csv'
FORMAT TEXT (INCLUDES HEADER = FALSE)

Delimited Text Options

To load data from a product data file with the following options:

data file named products.ssv located in the KiFS directory data
the file has a header
the field delimiter is a semicolon
data requiring quotes will have single quotes as the quoting character
the escape character is the backtick
the data file represents null values as <null>
data file comments are on lines starting with --
when parse errors are encountered, the record will be skipped

LOAD INTO with Delimited Text Options Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
LOAD DATA INTO example.product
FROM FILE PATHS 'kifs://data/products.ssv'
FORMAT TEXT
(
    COMMENT = '--',
    DELIMITER = ';',
    ESCAPE = '`',
    INCLUDES HEADER = TRUE,
    NULL = '<null>',
    QUOTE = ''''
)
WITH OPTIONS (ON ERROR = SKIP)

Kinetica Headers

A CSV file with Kinetica headers can be used to specify the column types of the target table when creating it if it doesn't exist.

To load such a file of product data with the following conditions:

data file named products.kh.csv located in the KiFS directory data
data file contains a Kinetica header
target table named product_create_on_load
target table will be created if it doesn't exist

LOAD INTO (Kinetica Header) Example
1
2
LOAD DATA INTO example.product_create_on_load
FROM FILE PATHS 'kifs://data/products.kh.csv'

Parquet File

To load a Parquet file of employee data, employee.parquet, in the KiFS directory data, into a table named example.employee_2000:

LOAD INTO (Parquet) Example
1
2
3
LOAD DATA INTO example.employee_2000
FROM FILE PATHS 'kifs://data/employee.parquet'
FORMAT PARQUET

Kafka

To subscribe to a Kafka data feed with the following conditions:

Kafka remote source is represented by the kafka_ds data source
data will be continuously streamed from the Kafka source
target table named orders

LOAD INTO (Kafka) Example
1
2
3
LOAD DATA INTO example.orders
FORMAT JSON
WITH OPTIONS (DATA SOURCE = 'kafka_ds', SUBSCRIBE = TRUE)

Table Options

If the target table doesn't exist, the LOAD INTO command will create it using the given table options and data type inferencing for the column types.

See Kinetica Headers for an example of loading a file with Kinetica headers to define the column types of the created table.

Primary/Shard Keys

A CSV file can be used to create a target table with primary & shard keys if it doesn't exist.

To load such a file of employee data with the following conditions:

data file named employee.csv located in the KiFS directory data
data file does not contain a Kinetica header
target table named example.employee
target table will be created if it doesn't exist
target table will have a composite primary key on the id & dept_id columns and a shard key on id column, if it is created by this load command

LOAD INTO (Primary/Shard Key) Example
1
2
3
4
5
6
7
LOAD INTO example.employee
FROM FILE PATHS 'kifs://data/employee.csv'
WITH OPTIONS
(
    PRIMARY KEY = (id, dept_id),
    SHARD KEY = (id)
)

Important

The data must have no nulls in any columns designated as primary key columns.

Creation Options

Table creation options can be used to define characteristics of a table when creating it if it doesn't exist.

To load such a file of product data with the following conditions:

data file named products.kh.csv located in the KiFS directory data
data file does not contain a Kinetica header
target table named product_create_options
target table will be created if it doesn't exist
target table will have a chunk size of 1,000,000 records, if it is created by this load command
target table will have a TTL of 5 minutes, and will be dropped if unused for any 5-minute span, if it is created by this load command

LOAD INTO (Create Table) Example
1
2
3
LOAD DATA INTO example.product_create_options
FROM FILE PATHS 'kifs://data/products.csv'
USING TABLE PROPERTIES (CHUNK SIZE = 1000000, TTL = 5)

Column Options

Column/Field Names

To load specific fields from a product data file, with the following conditions:

data file named products.title-case.csv located in the KiFS directory data
data file contains a header with fields named ID, Category, Name, Description, & Stock
target table named product_name_stock
target columns named id, name, & stock
only load fields ID, Name, & Stock into columns id, name, & stock
skip records that fail to load successfully

LOAD INTO (Column/Field Names) Example
1
2
3
LOAD DATA INTO example.product_name_stock
FROM FILE PATHS 'kifs://data/products.title-case.csv'
WITH OPTIONS (ON ERROR = SKIP, FIELDS MAPPED BY NAME(ID, Name, Stock))

Note

When specifying source data file field names, the set of source data file fields selected must align, in type & number, with the target table columns into which data will be loaded.

Date/Time Column Types

To apply a specific date format to the hire_date field:

LOAD INTO (Date/Time Column Types) Example
1
2
3
4
5
6
7
8
9
LOAD INTO example.employee
FROM FILE PATHS 'kifs://data/employee.csv'
WITH OPTIONS
(
    COLUMN FORMATS = '
    {
        "hire_date": {"date": "YYYY-MM-DD"}
    }'
)

Dry Run

To perform a dry run of a load of a CSV file, products.csv, in the KiFS directory data, into a table named example.product:

LOAD INTO (Dry Run) Example
1
2
3
LOAD DATA INTO example.product
FROM FILE PATHS 'kifs://data/products.csv'
WITH OPTIONS (INGESTION MODE = DRY RUN)

Note

The dry run will return the number of records that would have been loaded, but not actually insert any records into the target table.

Data Sources

File-Based

To load a CSV file, products.csv, from the data source example.product_ds, into a table named example.product:

LOAD INTO (Data Source File) Example
1
2
3
LOAD DATA INTO example.product
FROM FILE PATHS 'products.csv'
WITH OPTIONS (DATA SOURCE = 'example.product_ds')

Query-Based

To load the results of a remote query of employees in department 2 from the JDBC data source example.jdbc_ds, into a local table named example.employee_dept2:

LOAD INTO (Data Source Query) Example
1
2
3
LOAD INTO example.employee_dept2
FROM REMOTE QUERY 'SELECT * FROM example.employee WHERE dept_id = 2'
WITH OPTIONS (DATA SOURCE = 'example.jdbc_ds')

Change Data Capture

File-Based

To load a set of order data in a change data capture scheme with the following conditions:

data pulled through a data source, example.order_ds
data files contained with an orders directory
initially, all files in the directory will be loaded; subsequently, only those files that have been updated since the last check will be reloaded
files will be polled for updates every 5 minutes
target table named example.orders
target table will be created if it doesn't exist

LOAD INTO (File Change Data Capture) Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Load files in the orders directory with timestamps newer than the time of the
--   previous poll interval; poll every 5 minutes
LOAD DATA INTO example.orders
FROM FILE PATHS 'orders/'
WITH OPTIONS
(
    DATA SOURCE = 'example.order_ds',
    SUBSCRIBE = TRUE,
    POLL_INTERVAL = 300
)

Query-Based

To load the data from a remote query of orders in a change data capture scheme with the following conditions:

data pulled through a data source, example.jdbc_ds
data contained with an example.orders table, where only orders for product with ID 42 will be loaded into the target table
initially, all orders will be loaded; subsequently, only those orders with an order_id column value higher than the highest one on the previous poll cycle will be loaded
remote table will be polled for updates every 60 seconds
target table named example.order_product42
target table will be created if it doesn't exist

LOAD INTO (Query Change Data Capture) Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Load new orders for product 42 continuously into a table
--   order_id is an ever-increasing sequence allotted to each new order
LOAD INTO example.order_product42
FROM REMOTE QUERY 'SELECT * FROM example.orders WHERE product_id = 42'
WITH OPTIONS
(
    DATA SOURCE = 'example.jdbc_ds',
    SUBSCRIBE = TRUE,
    REMOTE_QUERY_INCREASING_COLUMN = 'order_id'
)

INSERT INTO ... SELECT ... FROM FILE

Kinetica can load data from text-based data files that are local to the client into existing tables using a modified INSERT INTO ... SELECT statement.

Note

This command is only available through KiSQL or database clients configured with the Kinetica ODBC/JDBC driver.

When reading data from a file, Kinetica assumes the following format, able to be overridden using command options:

If there is a header row, it is either:

a simple list of comma-delimited column names:
```
id,category,name,description,stock
    
```

the Kinetica standard format:

id|int|data,category|string|data|char16,name|string|data|char32,description|string|data|char128|nullable,stock|int|data|nullable
    

The data fields are comma-delimited
Strings are optionally enclosed in double-quotes; double-quotes must be used when the data contains commas; two consecutive double-quotes are used as an escape code for double-quoted string data containing double-quotes; e.g.:
"This string contains a "" quote mark and a "","" double-quoted comma."
This would be loaded into a single string column in Kinetica as:
This string contains a " quote mark and a "," double-quoted comma.

INSERT INTO ... SELECT ... FROM FILE Syntax
1
2
3
4
INSERT INTO [<schema name>.]<table name> [( <column list> )]
SELECT <field list>
FROM FILE."<file name | file name expression>"
[ <option list> ]

Parameters	Description
`<schema name>`	Name of the schema containing the table into which data will be inserted
`<table name>`	Name of the table into which data will be inserted
`<column list>`	Optional list of target table columns into which data will be inserted; if not specified, all target table columns will be assumed. Whether specified or implied, this list needs to match the fields specified in `<field list>` in number, order, & type.
`<field list>`	List of the source data file field names from which data will be extracted; use `` to extract all columns (this wildcard must be used when a source file has no header which* defines field names). This list needs to match all the target table columns (or those specified in the `<column list>` option) in number, order, & type.
`<file name \|` `file name expression>`	Name of the source data file or set of source data files from which records will be extracted; wildcards (`*`) can be used to specify a group of files
`<option list>`	Optional list of whitespace-separated processing options

Options

The following options can be specified when loading data from files. When reading from multiple files (using wildcards when specifying the file names), options specific to the source file will be applied to each file being read.

Option Description

BATCH_SIZE=<n>

Use an ingest batch size of n records.

The default batch size is 10,000.

COMMENT='<string>'

Treat lines in the source file that begin with string as comments and skip.

The default comment marker is #.

DELIMITER='<char>'

Use char as the source file field delimiter.

The default delimiter is ,, unless the source file has one of these extensions:

.psv - will cause | to be the delimiter
.tsv - will cause the tab character to be the delimiter

See Option Characters for allowed characters.

ESCAPE='<char>'

Source Data String	Representation when Loaded into the Database
`<char>a`	ASCII bell
`<char>b`	ASCII backspace
`<char>f`	ASCII form feed
`<char>n`	ASCII line feed
`<char>r`	ASCII carriage return
`<char>t`	ASCII horizontal tab
`<char>v`	ASCII vertical tab

For instance, if the escape character is \, a \t encountered in the data will be converted to a tab character when stored in the database.

The escape character can be used to escape the quoting character, and will be treated as an escape character whether it is within a quoted field value or not.

There is no default escape character. See Option Characters for allowed characters.

HEADER=<Y|N> Declare that the source file has or does not have a header. If not given, that determination will be intuited.

INITIAL_CLEAR

Truncate the target table before loading data.

The default is to not truncate the target table.

LIMIT=<n>

Limit the total number of records inserted into the target table to n. If reading from multiple source files, this count is assessed across all files being read. For example, if n is 15 and the three files being read have 10 records each, all of the records from the 1st file and the first 5 records from the 2nd file will be loaded. The remaining records in the 2nd file and all of the records from the 3rd file will be skipped.

The default is no limit.

NULL='<string>'

Treat string as the indicator of a null source field value.

The default is \N.

ON_ERROR=<mode>

When an error is encountered loading a record, handle it using the given <mode>; one of the following:

Mode	Description
`PERMISSIVE`	If an error is encountered parsing a source record, attempt to insert as much of the valid fields from the record as possible.
`SKIP`	If an error is encountered parsing a source record, skip the record.
`ABORT`	If an error is encountered parsing a source record, stop the data load process. Primary key collisions are considered abortable errors in this mode.

The default mode is SKIP.

QUOTE='<char>'

The default is the " (double-quote) character. See Option Characters for allowed characters.

SKIP=<n>

Skip the first n source file data lines read, not including header and comment lines. If reading from multiple source files, this count is assessed across all files being read. For example, if n is 15 and the first two files read have 10 records each, the first record to be loaded (not skipped) will be the 6th record of the 2nd file.

The default is to skip no records.

Option Characters

For DELIMITER, ESCAPE, & QUOTE, any single character can be used, or any one of the following escaped characters:

Escaped Char	Corresponding Source File Character
`''`	Single quote
`\\`	Backslash
`\a`	ASCII bell
`\b`	ASCII backspace
`\f`	ASCII form feed
`\t`	ASCII horizontal tab
`\v`	ASCII vertical tab

Examples

File Paths

Single File

To load a CSV file of product data located at ../data/products.csv into a table named product:

INSERT INTO ... SELECT ... FROM FILE (Single File) Example
1
2
3
INSERT INTO example.product
SELECT *
FROM FILE."../data/products.csv"

Multiple Files

To load all files whose names begin with products and end in a csv extension, located under ../data, into a table named product:

INSERT INTO ... SELECT ... FROM FILE (Multiple File) Example
1
2
3
INSERT INTO example.product
SELECT *
FROM FILE."../data/products*csv"

Options

Column/Field Names

To load specific fields from a product data file located at ../data/products.csv into specific target table columns:

INSERT INTO ... SELECT ... FROM FILE (Column/Field Names) Example
1
2
3
INSERT INTO example.product (id, stock, category, name)
SELECT ID, Stock, Category, Name
FROM FILE."../data/products.csv"

Note

When specifying source data file field names, it is not required that target table column names also be specified, or vice versa. The only requirement is that the set of source data file fields selected align with the target table columns into which data will be inserted.

Load Options

To load data from a product data file with the following options:

file located at ../data/products.csv
only records 4 through 10 will be loaded
any error will cause the insert to fail
the target table will be truncated before loading
records will be inserted into the table in groups of 2

INSERT INTO ... SELECT ... FROM FILE with Load Options Example
1
2
3
4
5
6
7
8
INSERT INTO example.product
SELECT *
FROM FILE."../data/products.csv"
BATCH_SIZE=2
INITIAL_CLEAR
LIMIT=7
ON_ERROR=abort
SKIP=3

To load data from a product data file with the following options:

file located at ../data/products.ssv
the file has a header
the field delimiter is a semicolon
data requiring quotes will have single quotes as the quoting character
the escape character is the backtick
the data file represents null values as <null>
data file comments are on lines starting with --

INSERT INTO ... SELECT ... FROM FILE with Parse Options Example
1
2
3
4
5
6
7
8
9
INSERT INTO example.product
SELECT *
FROM FILE."../data/products.ssv"
COMMENT='--'
DELIMITER=';'
ESCAPE='`'
HEADER=Y
NULL='<null>'
QUOTE='\''

Considerations

Multi-Head

If Multi-Head Ingest has been enabled on the database server, file loading through ODBC/JDBC will automatically leverage it, when applicable.

Blank Lines

A blank line in a source data file will be treated as a valid record and will be inserted into the target table as a record with all null fields. If blank lines should not be considered as data, they should be prefixed with the comment marker in order to be skipped by the loading process.

GAdmin

While GAdmin does not support this command, it does support importing data from client-local files directly.

/execute/sql endpoint

While SQL file ingestion is available to ODBC/JDBC clients; it is not supported within the /execute/sql endpoint, either as a REST call directly or as an API call (like the Python API's execute_sql() function).