- in KiFS
- on a remote system, accessible via a data source
Types
There are two types of external table, distinguished by the scheme each uses to pull data from the external source:- Materialized external tables pull data from external sources and cache that data in a persisted table within the database. Data is refreshed on demand and, configurably, on database startup. This mode ensures a much quicker response time, at the cost of the data being as current as the last refresh.
- Logical external tables pull data from external sources upon servicing each query against the external table. This mode ensures queries on the external table will always return the most current source data, though there will be a performance penalty for reparsing & reloading the data from source files upon each query.
Data File Formats
There are several source data file formats supported for external tables:- Parquet - Apache Parquet data files—see Parquet Limitations for the supported data types
- Text - delimited text files (CSV, PSV, TSV, etc.)—the parser is highly configurable and can support a wide variety of delimited text schemes; however, records spanning multiple lines are not supported
- JSON - both standard JSON & GeoJSON files are supported—see JSON/GeoJSON Limitations for the supported data types
- Shapefile - ArcGIS shapefiles
Table Features
An external table can be assigned many of the features of standard tables, some of which include: While an external table can have a primary key defined, there are two limitations to consider when configuring one this way:- A primary key collision between an incoming record and one already in the external table will result in the incoming record being rejected—there are no primary key record updates when using external tables
- A primary key collision between two records within the incoming data set will result in one of the two records being chosen non-deterministically for insert into the external table; assuming there is no collision between that record and one already in the external table.
Data Sources
If an external table is to use a data source, then data source connect privilege is required for the following actions:- Creating the external table
- Refreshing the external table, if it is materialized external table
- Querying the external table, if it is a logical external table
| Provider | Description | One-Time Load | Subscription |
|---|---|---|---|
| Azure | Microsoft blob storage | Yes | Yes |
| GCS | Google Cloud Storage | Yes | Yes |
| HDFS | Apache Hadoop Distributed File System | Yes | |
| JDBC | Java DataBase Connectivity; using a user-supplied driver or one of the drivers on the list supported list | Yes | Yes |
| S3 | Amazon S3 Bucket | Yes | Yes |
Ingestion Mode
Materialized external tables have three ingest modes available:- Perform a type inference (if necessary) and ingest all data
- Perform a type inference and return the intuited table definition
- Perform an ingest dry run, counting the number of valid records
Refresh on Start
Materialized external tables can be directed to refresh their data when the database starts up. Depending on the amount of data and the transfer, parse, & load time, it may be beneficial to load all data at startup, or delay the refresh until a later time. If the data is not refreshed, it will be the same as it was before startup.Error Mode
An error mode can be assigned to an external table, instructing it on how to handle source data field errors in parsing & loading:- Abort - stop the load process when an error is encountered
- Skip - skip the current record when an error is encountered
Creating an External Table
To create an external table with the following features, using KiFS as the source of data:- External table named
ext_productin theexampleschema - External source is in a KiFS directory named
data - Source is a file named
products.csv - Data is not refreshed on database startup
- External table named
ext_employeein theexampleschema - External source is in a KiFS directory named
data - Source is a Parquet file named
employee.parquet - External table has a primary key on the
idcolumn - Data is not refreshed on database startup
- External table named
ext_productin theexampleschema - External source is a data source named
product_dsin theexampleschema - Source is a file named
products.csv - Data is refreshed on database startup
- External table named
ext_employee_dept2in theexampleschema - External source is a JDBC data source named
jdbc_dsin theexampleschema - Source data is a remote query of employees in department 2 from that
database’s
example.ext_employeetable - Data is refreshed on database startup