External Tables

An external table is a database object whose source data is located in one or more files external to the database. The source data can be located in any of the following locations:

  • on shared storage, mounted on each database node in the cluster
  • in KiFS
  • on a remote system, accessible via a data source

External tables are created via the /create/table/external native API call. For details on interfacing with external tables from SQL, see CREATE EXTERNAL TABLE. Using external tables in an environment with ring resiliency enabled has additional considerations.

Types

There are two types of external table, distinguished by the scheme each uses to pull data from the external source:

  • Materialized external tables pull data from external sources and cache that data in a persisted table within the database. Data is refreshed on demand and, configurably, on database startup. This mode ensures a much quicker response time, at the cost of the data being as current as the last refresh.
  • Logical external tables pull data from external sources upon servicing each query against the external table. This mode ensures queries on the external table will always return the most current source data, though there will be a performance penalty for reparsing & reloading the data from source files upon each query.

Data File Formats

There are several source data file formats supported for external tables:

  • Parquet - Apache Parquet data files
  • Text - delimited text files (CSV, PSV, TSV, etc.)--the parser is highly configurable and can support a wide variety of delimited text schemes
  • JSON - both standard JSON & GeoJSON files are supported
  • Shapefile - ArcGIS shapefiles

Regardless of the format selected, one or more source data fields can be used in the creation of the external table. Date/time fields can have their source formats specified.

Table Features

An external table can be assigned many of the features of standard tables, some of which include:

External table names and column names must adhere to the supported naming criteria, and the name resolution follows that of tables.

Data Sources

If an external table is to use a data source, then data source connect privilege is required for the following actions:

  • Creating the external table
  • Refreshing the external table, if it is materialized external table
  • Querying the external table, if it is a logical external table

An external table can also subscribe to a data source configured for Azure or S3, and can have that subscription paused, resumed, or cancelled.

Ingestion Mode

Materialized external tables have three ingest modes available:

  • Perform a type inference (if necessary) and ingest all data
  • Perform a type inference and return the intuited table definition
  • Perform an ingest dry run, counting the number of valid records

Loading Mode

Depending on the location of the source data files, one of the following schemes for distributing the extraction and loading of data can be assigned:

  • The head node can load all the data; all data must be available to the head_node.
  • Each node loads all files that are available to it. This option works best when each node loads files from its own file system, to maximize performance. It is required, however, that each node has visibility to a set of files unique to it--no file can be visible to more than one node.
  • The cluster coordinates loading all the data across all the nodes. All files must be available to all nodes.

Refresh on Start

Materialized external tables can be directed to refresh their data when the database starts up. Depending on the amount of data and the transfer, parse, & load time, it may be beneficial to load all data at startup, or delay the refresh until a later time. If the data is not refreshed, it will be the same as it was before startup.

Error Mode

An error mode can be assigned to an external table, instructing it on how to handle source data field errors in parsing & loading:

  • Abort - stop the load process when an error is encountered
  • Skip - skip the current record when an error is encountered
  • Permissive - attempt to insert as much as possible of the current record when an error is encountered in parsing a field value, by inserting null into the column corresponding to that field

Creating an External Table

To create an external table with the following features, using a local file as the source of data:

  • External table named ext_employee in the example schema
  • External source is a Parquet file located on the head node at data/employee.parquet, relative to the configured external files directory
  • External table has a primary key on the id column
  • Data is not refreshed on database startup
1
2
3
4
5
6
7
8
h_db.create_table_external(
    table_name = 'example.ext_employee',
    filepaths = 'data/employee.parquet',
    options = {
        'file_type': 'parquet',
        'primary_keys': 'id'
    }
)

To create an external table with the following features, using KiFS as the source of data:

  • External table named ext_product in the example schema
  • External source is in a KiFS directory named data
  • Source is a file named products.csv
  • Data is not refreshed on database startup
1
2
3
4
h_db.create_table_external(
    table_name = 'example.ext_product',
    filepaths = 'kifs://data/products.csv'
)

To create an external table with the following features, using a data source as the source of data:

  • External table named ext_product in the example schema
  • External source is a data source named product_ds
  • Source is a file named products.csv
  • Data is refreshed on database startup
1
2
3
4
5
6
7
8
h_db.create_table_external(
    table_name = 'example.ext_product',
    filepaths = 'products.csv',
    options = {
        'datasource_name': 'product_ds',
        'refresh_method': 'on_start'
    }
)