/create/table/external

URL: http://GPUDB_IP_ADDRESS:GPUDB_PORT/create/table/external

Creates a new external table, which is a local database object whose source data is located externally to the database. The source data can be located either on the cluster, accessible to the database; or remotely, accessible via a pre-defined external data source.

The external table can have its structure defined explicitly, via input parameter create_table_options, which contains many of the options from /create/table; or defined implicitly, inferred from the source data.

Input Parameter Description

NameTypeDescription
table_namestringName of the table to be created, in [schema_name.]table_name format, using standard name resolution rules and meeting table naming criteria.
filepathsarray of stringsA list of file paths from which data will be sourced; wildcards (*) can be used to specify a group of files. If an external data source is specified in datasource_name, these file paths must resolve to accessible files at that data source location. Also, wildcards will only work when used within the file name, not the path. If no data source is specified, the files are assumed to be local to the database and must all be accessible to the gpudb user, residing on the path (or relative to the path) specified by the external files directory in the Kinetica configuration file.
modify_columnsmap of string to maps of string to stringsNot implemented yet. The default value is an empty map ( {} ).
create_table_optionsmap of string to strings

Options from /create/table, allowing the structure of the table to be defined independently of the data source. The default value is an empty map ( {} ).

Supported Parameters (keys)Parameter Description
type_idID of a currently registered type. The default value is ''.
no_error_if_exists

If true, prevents an error from occurring if the table already exists and is of the given type. If a table with the same name but a different type exists, it is still an error. The default value is false. The supported values are:

  • true
  • false
is_replicated

Affects the distribution scheme for the table's data. If true and the given table has no explicit shard key defined, the table will be replicated. If false, the table will be sharded according to the shard key specified in the given type_id, or randomly sharded, if no shard key is specified. Note that a type containing a shard key cannot be used to create a replicated table. The default value is false. The supported values are:

  • true
  • false
foreign_keysSemicolon-separated list of foreign keys, of the format '(source_column_name [, ...]) references target_table_name(primary_key_column_name [, ...]) [as foreign_key_name]'.
foreign_shard_keyForeign shard key of the format 'source_column references shard_by_column from target_table(primary_key_column)'.
partition_type

Partitioning scheme to use.

Supported ValuesDescription
RANGEUse range partitioning.
INTERVALUse interval partitioning.
LISTUse list partitioning.
HASHUse hash partitioning.
partition_keysComma-separated list of partition keys, which are the columns or column expressions by which records will be assigned to partitions defined by partition_definitions.
partition_definitionsComma-separated list of partition definitions, whose format depends on the choice of partition_type. See range partitioning, interval partitioning, list partitioning, or hash partitioning for example formats.
is_automatic_partition

If true, a new partition will be created for values which don't fall into an existing partition. Currently only supported for list partitions. The default value is false. The supported values are:

  • true
  • false
ttlSets the TTL of the table specified in input parameter table_name.
chunk_sizeIndicates the number of records per chunk to be used for this table.
is_result_table

Indicates whether the table is a memory-only table. A result table cannot contain columns with store_only or text_search data-handling or that are non-charN strings, and it will not be retained if the server is restarted. The default value is false. The supported values are:

  • true
  • false
strategy_definitionThe tier strategy for the table and its columns. See tier strategy usage for format and tier strategy examples for examples.
optionsmap of string to strings

Optional parameters. The default value is an empty map ( {} ).

Supported Parameters (keys)Parameter Description
bad_record_table_nameOptional name of a table to which records that were rejected are written. The bad-record-table has the following columns: line_number (long), line_rejected (string), error_message (string).
bad_record_table_limitA positive integer indicating the maximum number of records that can be written to the bad-record-table. Default value is 10000
batch_sizeInternal tuning parameter--number of records per batch when inserting data
column_formatsFor each target column specified, applies the column-property-bound format to the source data loaded into that column. Each column format will contain a mapping of one or more of its column properties to an appropriate format for each property. Currently supported column properties include date, time, & datetime. The parameter value must be formatted as a JSON string of maps of column names to maps of column properties to their corresponding column formats, e.g., '{ "order_date" : { "date" : "%Y.%m.%d" }, "order_time" : { "time" : "%H:%M:%S" } }'. See default_column_formats for valid format syntax.
columns_to_loadSpecifies a comma-delimited list of columns from the source data to load. If more than one file is being loaded, this list applies to all files. Column numbers can be specified discretely or as a range. For example, a value of '5,7,1..3' will insert values from the fifth column in the source data into the first column in the target table, from the seventh column in the source data into the second column in the target table, and from the first through third columns in the source data into the third through fifth columns in the target table. If the source data contains a header, column names matching the file header names may be provided instead of column numbers. If the target table doesn't exist, the table will be created with the columns in this order. If the target table does exist with columns in a different order than the source data, this list can be used to match the order of the target table. For example, a value of 'C, B, A' will create a three column table with column C, followed by column B, followed by column A; or will insert those fields in that order into a table created with columns in that order. If the target table exists, the column names must match the source data field names for a name-mapping to be successful.
columns_to_skipSpecifies a comma-delimited list of columns from the source data to skip. Mutually exclusive to columns_to_load.
datasource_nameName of an existing external data source from which data file(s) specified in input parameter filepaths will be loaded
default_column_formatsSpecifies the default format to be applied to source data loaded into columns with the corresponding column property. Currently supported column properties include date, time, & datetime. This default column-property-bound format can be overridden by specifying a column property & format for a given target column in column_formats. For each specified annotation, the format will apply to all columns with that annotation unless a custom column_formats for that annotation is specified. The parameter value must be formatted as a JSON string that is a map of column properties to their respective column formats, e.g., '{ "date" : "%Y.%m.%d", "time" : "%H:%M:%S" }'. Column formats are specified as a string of control characters and plain text. The supported control characters are 'Y', 'm', 'd', 'H', 'M', 'S', and 's', which follow the Linux 'strptime()' specification, as well as 's', which specifies seconds and fractional seconds (though the fractional component will be truncated past milliseconds). Formats for the 'date' annotation must include the 'Y', 'm', and 'd' control characters. Formats for the 'time' annotation must include the 'H', 'M', and either 'S' or 's' (but not both) control characters. Formats for the 'datetime' annotation meet both the 'date' and 'time' control character requirements. For example, '{"datetime" : "%m/%d/%Y %H:%M:%S" }' would be used to interpret text as "05/04/2000 12:12:11"
error_handling

Specifies how errors should be handled upon insertion. The default value is permissive.

Supported ValuesDescription
permissiveRecords with missing columns are populated with nulls if possible; otherwise, malformed records are skipped.
ignore_bad_recordsMalformed records are skipped.
abortCurrent insertion is stopped and entire operation is aborted when an error is encountered. Primary key collisions are considered abortable errors in this mode.
external_table_type

Specifies whether the external table holds a local copy of the external data. The default value is materialized.

Supported ValuesDescription
materializedLoads a copy of the external data into the database, refreshed on demand
logicalExternal data will not be loaded into the database; the data will be retrieved from the source upon servicing each query against the external table
file_type

Specifies the type of the external data file(s) used as the source of data for this table. The default value is delimited_text.

Supported ValuesDescription
delimited_textDelimited text file format; e.g., CSV, TSV, PSV, etc.
parquetApache Parquet file format
jsonJson file format
ingestion_mode

For materialized external tables, whether to do a full load, dry run, or perform a type inference on the source data. The default value is full.

Supported ValuesDescription
fullRun a type inference on the source data (if needed) and ingest
dry_runDoes not load data, but walks through the source data and determines the number of valid records, taking into account the current mode of error_handling.
type_inference_onlyInfer the type of the source data and return, without creating the table and ingesting data. The inferred type is returned in the response.
loading_mode

Scheme for distributing the extraction and loading of data from the source data file(s). The default value is head.

Supported ValuesDescription
headThe head node loads all data. All files must be available to the head node.
distributed_sharedThe head node coordinates loading data by worker processes across all nodes from shared files available to all workers. NOTE: Instead of existing on a shared source, the files can be duplicated on a source local to each host to improve performance, though the files must appear as the same data set from the perspective of all hosts performing the load.
distributed_localA single worker process on each node loads all files that are available to it. This option works best when each worker loads files from its own file system, to maximize performance. In order to avoid data duplication, either each worker performing the load needs to have visibility to a set of files unique to it (no file is visible to more than one node) or the target table needs to have a primary key (which will allow the worker to automatically deduplicate data). NOTE: If the table's columns aren't defined, table structure will be determined by the head node. If the head node has no files local to it, it will be unable to determine the structure and the request will fail. This mode should not be used in conjunction with a data source, as data sources are seen by all worker processes as shared resources with no 'local' component. If the head node is configured to have no worker processes, no data strictly accessible to the head node will be loaded.
primary_keysOptional: comma separated list of column names, to set as primary keys, when not specified in the type. The default value is ''.
shard_keysOptional: comma separated list of column names, to set as primary keys, when not specified in the type. The default value is ''.
refresh_method

Method by which the table can be refreshed from its source data. The default value is manual.

Supported ValuesDescription
manualRefresh only occurs when manually requested by invoking the refresh action of /alter/table on this table.
on_startRefresh table on database startup and when manually requested by invoking the refresh action of /alter/table on this table.
text_comment_stringSpecifies the character string that should be interpreted as a comment line prefix in the source data. All lines in the data starting with the provided string are ignored. For delimited_text file_type only. The default value is '#'.
text_delimiterSpecifies the character delimiting field values in the source data and field names in the header (if present). For delimited_text file_type only. The default value is ','.
text_escape_characterSpecifies the character that is used to escape other characters in the source data. An 'a', 'b', 'f', 'n', 'r', 't', or 'v' preceded by an escape character will be interpreted as the ASCII bell, backspace, form feed, line feed, carriage return, horizontal tab, & vertical tab, respectively. For example, the escape character followed by an 'n' will be interpreted as a newline within a field value. The escape character can also be used to escape the quoting character, and will be treated as an escape character whether it is within a quoted field value or not. For delimited_text file_type only.
text_has_header

Indicates whether the source data contains a header row. For delimited_text file_type only. The default value is true. The supported values are:

  • true
  • false
text_header_property_delimiterSpecifies the delimiter for column properties in the header row (if present). Cannot be set to same value as text_delimiter. For delimited_text file_type only. The default value is '|'.
text_null_stringSpecifies the character string that should be interpreted as a null value in the source data. For delimited_text file_type only. The default value is ''.
text_quote_characterSpecifies the character that should be interpreted as a field value quoting character in the source data. The character must appear at beginning and end of field value to take effect. Delimiters within quoted fields are treated as literals and not delimiters. Within a quoted field, two consecutive quote characters will be interpreted as a single literal quote character, effectively escaping it. To not have a quote character, specify an empty string. For delimited_text file_type only. The default value is '"'.
num_tasks_per_rankOptional: number of tasks for reading file per rank. Default will be external_file_reader_num_tasks
type_inference_mode

optimize type inference for: The default value is accuracy.

Supported ValuesDescription
accuracyscans all data to get exactly-typed & sized columns for all data present
speedpicks the widest possible column types so that 'all' values will fit with minimum data scanned

Output Parameter Description

The GPUdb server embeds the endpoint response inside a standard response structure which contains status information and the actual response to the query. Here is a description of the various fields of the wrapper:

NameTypeDescription
statusString'OK' or 'ERROR'
messageStringEmpty if success or an error message
data_typeString'create_table_external_request' or 'none' in case of an error
dataStringEmpty string
data_strJSON or String

This embedded JSON represents the result of the /create/table/external endpoint:

NameTypeDescription
table_namestringValue of input parameter table_name.
type_idstringID of the currently registered table structure type for this external table
type_definitionstringA JSON string describing the columns of the created external table
type_labelstringThe user-defined description associated with the table's structure
type_propertiesmap of string to arrays of stringsA mapping of each external table column name to an array of column properties associated with that column
count_insertedlongNumber of records inserted into the external table.
count_skippedlongNumber of records skipped, when not running in abort error handling mode.
count_updatedlong[Not yet implemented] Number of records updated within the external table.
infomap of string to stringsAdditional information.
filesarray of strings 

Empty string in case of an error.