Note

This documentation is for a prior release of Kinetica. For the latest documentation, click here.

External Tables

An external table is a database object whose source data is located in one or more files external to the database. The source data can be located in either of the following locations:

in KiFS
on a remote system, accessible via a data source

External tables are created via the /create/table/external native API call. For details on interfacing with external tables from SQL, see CREATE EXTERNAL TABLE.

Using external tables in an environment with ring resiliency enabled has additional considerations.

An external table name must adhere to the standard naming criteria. Each external table exists within a schema and follows the standard name resolution rules for tables.

Types

There are two types of external table, distinguished by the scheme each uses to pull data from the external source:

Materialized external tables pull data from external sources and cache that data in a persisted table within the database. Data is refreshed on demand and, configurably, on database startup. This mode ensures a much quicker response time, at the cost of the data being as current as the last refresh.
Logical external tables pull data from external sources upon servicing each query against the external table. This mode ensures queries on the external table will always return the most current source data, though there will be a performance penalty for reparsing & reloading the data from source files upon each query.

Data File Formats

There are several source data file formats supported for external tables:

Parquet - Apache Parquet data files--see Parquet Limitations for the supported data types
Text - delimited text files (CSV, PSV, TSV, etc.)--the parser is highly configurable and can support a wide variety of delimited text schemes; however, records spanning multiple lines are not supported
JSON - both standard JSON & GeoJSON files are supported--see JSON/GeoJSON Limitations for the supported data types
Shapefile - ArcGIS shapefiles

Regardless of the format selected, one or more source data fields can be used in the creation of the external table. Date/time fields can have their source formats specified.

Table Features

An external table can be assigned many of the features of standard tables, some of which include:

While an external table can have a primary key defined, there are two limitations to consider when configuring one this way:

A primary key collision between an incoming record and one already in the external table will result in the incoming record being rejected--there are no primary key record updates when using external tables
A primary key collision between two records within the incoming data set will result in one of the two records being chosen non-deterministically for insert into the external table; assuming there is no collision between that record and one already in the external table.

External table names and column names must adhere to the supported naming criteria, and the name resolution follows that of tables.

Data Sources

If an external table is to use a data source, then data source connect privilege is required for the following actions:

Creating the external table
Refreshing the external table, if it is materialized external table
Querying the external table, if it is a logical external table

An external table that uses a data source can perform a one-time load upon creation and optionally subscribe for updates on an interval, depending on the data source provider:

Provider	Description	One-Time Load	Subscription
Azure	Microsoft blob storage	Yes	Yes
CData	CData Software source-specific JDBC driver See driver list for the supported drivers	Yes
GCS	Google Cloud Storage	Yes	Yes
HDFS	Apache Hadoop Distributed File System	Yes
JDBC	Java DataBase Connectivity; requires user-supplied driver	Yes
S3	Amazon S3 Bucket	Yes	Yes

A subscription can be paused, resumed, or cancelled by calling /alter/table on the external table.

Ingestion Mode

Materialized external tables have three ingest modes available:

Perform a type inference (if necessary) and ingest all data
Perform a type inference and return the intuited table definition
Perform an ingest dry run, counting the number of valid records

Refresh on Start

Materialized external tables can be directed to refresh their data when the database starts up. Depending on the amount of data and the transfer, parse, & load time, it may be beneficial to load all data at startup, or delay the refresh until a later time. If the data is not refreshed, it will be the same as it was before startup.

Error Mode

An error mode can be assigned to an external table, instructing it on how to handle source data field errors in parsing & loading:

Abort - stop the load process when an error is encountered
Skip - skip the current record when an error is encountered

Creating an External Table

To create an external table with the following features, using KiFS as the source of data:

External table named ext_product in the example schema
External source is in a KiFS directory named data
Source is a file named products.csv
Data is not refreshed on database startup

1
2
3
4
h_db.create_table_external(
    table_name = 'example.ext_product',
    filepaths = 'kifs://data/products.csv'
)

To create an external table with the following features, using KiFS as the source of data:

External table named ext_employee in the example schema
External source is in a KiFS directory named data
Source is a Parquet file named employee.parquet
External table has a primary key on the id column
Data is not refreshed on database startup

1
2
3
4
5
6
7
8
h_db.create_table_external(
    table_name = 'example.ext_employee',
    filepaths = 'kifs://data/employee.parquet',
    options = {
        'file_type': 'parquet',
        'primary_keys': 'id'
    }
)

To create an external table with the following features, using an S3 data source as the source of data:

External table named ext_product in the example schema
External source is a data source named product_ds in the example schema
Source is a file named products.csv
Data is refreshed on database startup

1
2
3
4
5
6
7
8
h_db.create_table_external(
    table_name = 'example.ext_product',
    filepaths = 'products.csv',
    options = {
        'datasource_name': 'example.product_ds',
        'refresh_method': 'on_start'
    }
)

To create an external table with the following features, using a JDBC data source as the source of data:

External table named ext_employee_dept2 in the example schema
External source is a JDBC data source named jdbc_ds in the example schema
Source data is a remote query of employees in department 2 from that database's example.ext_employee table
Data is refreshed on database startup

1
2
3
4
5
6
7
8
h_db.create_table_external(
    table_name = 'example.ext_employee_dept2',
    options = {
        'datasource_name': 'example.jdbc_ds',
        'remote_query': 'SELECT * FROM example.ext_employee WHERE dept_id = 2',
        'refresh_method': 'on_start'
    }
)