public class InsertRecordsFromFilesRequest extends Object implements org.apache.avro.generic.IndexedRecord
GPUdb.insertRecordsFromFiles
.
Reads from one or more files and inserts the data into a new or existing table. The source data can be located either in KiFS; on the cluster, accessible to the database; or remotely, accessible via a pre-defined external data source.
For delimited text files, there are two loading schemes: positional and
name-based. The name-based loading scheme is enabled when the file has a
header present and TEXT_HAS_HEADER
is set to
TRUE
. In this scheme, the source file(s) field names
must match the target table's column names exactly; however, the source file
can have more fields than the target table has columns. If ERROR_HANDLING
is set to PERMISSIVE
, the source file can have fewer fields than the target table has
columns. If the name-based loading scheme is being used, names matching the
file header's names may be provided to COLUMNS_TO_LOAD
instead of numbers, but ranges are not supported.
Note: Due to data being loaded in parallel, there is no insertion order guaranteed. For tables with primary keys, in the case of a primary key collision, this means it is indeterminate which record will be inserted first and remain, while the rest of the colliding key records are discarded.
Returns once all files are processed.
Modifier and Type | Class and Description |
---|---|
static class |
InsertRecordsFromFilesRequest.CreateTableOptions
A set of string constants for the
InsertRecordsFromFilesRequest
parameter createTableOptions . |
static class |
InsertRecordsFromFilesRequest.Options
A set of string constants for the
InsertRecordsFromFilesRequest
parameter options . |
Constructor and Description |
---|
InsertRecordsFromFilesRequest()
Constructs an InsertRecordsFromFilesRequest object with default
parameters.
|
InsertRecordsFromFilesRequest(String tableName,
List<String> filepaths,
Map<String,Map<String,String>> modifyColumns,
Map<String,String> createTableOptions,
Map<String,String> options)
Constructs an InsertRecordsFromFilesRequest object with the specified
parameters.
|
Modifier and Type | Method and Description | ||
---|---|---|---|
boolean |
equals(Object obj) |
||
Object |
get(int index)
This method supports the Avro framework and is not intended to be called
directly by the user.
|
||
static org.apache.avro.Schema |
getClassSchema()
This method supports the Avro framework and is not intended to be called
directly by the user.
|
||
Map<String,String> |
getCreateTableOptions()
Options from
GPUdb.createTable , allowing the structure of the table to be defined
independently of the data source, when creating the target table. |
||
List<String> |
getFilepaths()
| ||
Map<String,String> |
getOptions()
Optional parameters.
|
||
org.apache.avro.Schema |
getSchema()
This method supports the Avro framework and is not intended to be called
directly by the user.
|
||
String |
getTableName()
Name of the table into which the data will be inserted, in
[schema_name.]table_name format, using standard
int hashCode() | ||
void |
put(int index,
Object value)
This method supports the Avro framework and is not intended to be called
directly by the user.
|
||
InsertRecordsFromFilesRequest |
setCreateTableOptions(Map<String,String> createTableOptions)
Options from
GPUdb.createTable , allowing the structure of the table to be defined
independently of the data source, when creating the target table. |
||
InsertRecordsFromFilesRequest |
setFilepaths(List<String> filepaths)
A list of file paths from which data will be sourced;
For paths in
InsertRecordsFromFilesRequest setModifyColumns(Map<String,Map<String,String>> modifyColumns)
Not implemented yet.
| ||
InsertRecordsFromFilesRequest |
setOptions(Map<String,String> options)
Optional parameters.
|
||
InsertRecordsFromFilesRequest |
setTableName(String tableName)
|
public InsertRecordsFromFilesRequest()
public InsertRecordsFromFilesRequest(String tableName, List<String> filepaths, Map<String,Map<String,String>> modifyColumns, Map<String,String> createTableOptions, Map<String,String> options)
tableName
- Name of the table into which the data will be
inserted, in [schema_name.]table_name format, using
standard name resolution rules. If the table
does not exist, the table will be created using either
an existing TYPE_ID
or the type inferred from the file, and the new table
name will have to meet standard table naming criteria.filepaths
- A list of file paths from which data will be sourced;
For paths in KiFS, use the uri prefix of kifs://
followed by the path to a file or directory. File
matching by prefix is supported, e.g. kifs://dir/file
would match dir/file_1 and dir/file_2. When prefix
matching is used, the path must start with a full,
valid KiFS directory name. If an external data source
is specified in DATASOURCE_NAME
, these file paths must resolve to
accessible files at that data source location. Prefix
matching is supported. If the data source is hdfs,
prefixes must be aligned with directories, i.e.
partial file names will not match. If no data source
is specified, the files are assumed to be local to the
database and must all be accessible to the gpudb user,
residing on the path (or relative to the path)
specified by the external files directory in the
Kinetica configuration file. Wildcards (*)
can be used to specify a group of files. Prefix
matching is supported, the prefixes must be aligned
with directories. If the first path ends in .tsv, the
text delimiter will be defaulted to a tab character.
If the first path ends in .psv, the text delimiter
will be defaulted to a pipe character (|).modifyColumns
- Not implemented yet. The default value is an empty
Map
.createTableOptions
- Options from GPUdb.createTable
, allowing the structure of
the table to be defined independently of the
data source, when creating the target table.
TYPE_ID
: ID of a currently
registered type.
NO_ERROR_IF_EXISTS
: If TRUE
,
prevents an error from occurring if
the table already exists and is of
the given type. If a table with the
same name but a different type
exists, it is still an error.
Supported values:
The default value is FALSE
.
IS_REPLICATED
: Affects the distribution scheme
for the table's data. If TRUE
and the
given table has no explicit shard key defined,
the table will be replicated. If
FALSE
, the table will be sharded according
to the shard key specified in the
given TYPE_ID
,
or randomly sharded,
if no shard key is specified. Note
that a type containing a shard key
cannot be used to create a replicated
table.
Supported values:
The default value is FALSE
.
FOREIGN_KEYS
: Semicolon-separated
list of foreign keys, of
the format '(source_column_name [,
...]) references
target_table_name(primary_key_column_name
[, ...]) [as foreign_key_name]'.
FOREIGN_SHARD_KEY
: Foreign shard key
of the format 'source_column
references shard_by_column from
target_table(primary_key_column)'.
PARTITION_TYPE
: Partitioning scheme
to use.
Supported values:
RANGE
: Use range
partitioning.
INTERVAL
: Use interval
partitioning.
LIST
: Use list
partitioning.
HASH
: Use hash
partitioning.
SERIES
: Use series
partitioning.
PARTITION_KEYS
: Comma-separated list
of partition keys, which are the
columns or column expressions by
which records will be assigned to
partitions defined by PARTITION_DEFINITIONS
.
PARTITION_DEFINITIONS
:
Comma-separated list of partition
definitions, whose format depends on
the choice of PARTITION_TYPE
. See range partitioning,
interval
partitioning, list partitioning,
hash partitioning,
or series partitioning
for example formats.
IS_AUTOMATIC_PARTITION
: If TRUE
, a new
partition will be created for values
which don't fall into an existing
partition. Currently, only supported
for list partitions.
Supported values:
The default value is FALSE
.
TTL
:
Sets the TTL of the table
specified in tableName
.
CHUNK_SIZE
: Indicates the number of
records per chunk to be used for this
table.
CHUNK_COLUMN_MAX_MEMORY
: Indicates
the target maximum data size for each
column in a chunk to be used for this
table.
CHUNK_MAX_MEMORY
: Indicates the
target maximum data size for all
columns in a chunk to be used for
this table.
IS_RESULT_TABLE
: Indicates whether
the table is a memory-only table.
A result table cannot contain columns
with text_search data-handling, and
it will not be retained if the server
is restarted.
Supported values:
The default value is FALSE
.
STRATEGY_DEFINITION
: The tier strategy for
the table and its columns.
Map
.options
- Optional parameters.
BAD_RECORD_TABLE_NAME
: Name of a table to which
records that were rejected are written. The
bad-record-table has the following columns:
line_number (long), line_rejected (string),
error_message (string). When ERROR_HANDLING
is ABORT
, bad records table is not
populated.
BAD_RECORD_TABLE_LIMIT
: A positive integer
indicating the maximum number of records that
can be written to the bad-record-table. The
default value is '10000'.
BAD_RECORD_TABLE_LIMIT_PER_INPUT
: For
subscriptions, a positive integer indicating the
maximum number of records that can be written to
the bad-record-table per file/payload. Default
value will be BAD_RECORD_TABLE_LIMIT
and total size of the
table per rank is limited to BAD_RECORD_TABLE_LIMIT
.
BATCH_SIZE
: Number of
records to insert per batch when inserting data.
The default value is '50000'.
COLUMN_FORMATS
:
For each target column specified, applies the
column-property-bound format to the source data
loaded into that column. Each column format
will contain a mapping of one or more of its
column properties to an appropriate format for
each property. Currently supported column
properties include date, time, & datetime. The
parameter value must be formatted as a JSON
string of maps of column names to maps of column
properties to their corresponding column
formats, e.g., '{ "order_date" : { "date" :
"%Y.%m.%d" }, "order_time" : { "time" :
"%H:%M:%S" } }'. See DEFAULT_COLUMN_FORMATS
for valid format syntax.
COLUMNS_TO_LOAD
:
Specifies a comma-delimited list of columns from
the source data to load. If more than one file
is being loaded, this list applies to all files.
Column numbers can be specified discretely or as
a range. For example, a value of '5,7,1..3'
will insert values from the fifth column in the
source data into the first column in the target
table, from the seventh column in the source
data into the second column in the target table,
and from the first through third columns in the
source data into the third through fifth columns
in the target table. If the source data
contains a header, column names matching the
file header names may be provided instead of
column numbers. If the target table doesn't
exist, the table will be created with the
columns in this order. If the target table does
exist with columns in a different order than the
source data, this list can be used to match the
order of the target table. For example, a value
of 'C, B, A' will create a three column table
with column C, followed by column B, followed by
column A; or will insert those fields in that
order into a table created with columns in that
order. If the target table exists, the column
names must match the source data field names for
a name-mapping to be successful. Mutually
exclusive with COLUMNS_TO_SKIP
.
COLUMNS_TO_SKIP
:
Specifies a comma-delimited list of columns from
the source data to skip. Mutually exclusive
with COLUMNS_TO_LOAD
.
COMPRESSION_TYPE
: Source data compression type.
Supported values:
NONE
: No
compression.
AUTO
: Auto detect
compression type
GZIP
: gzip file
compression.
BZIP2
: bzip2 file
compression.
AUTO
.
DATASOURCE_NAME
:
Name of an existing external data source from
which data file(s) specified in filepaths
will be loaded
DEFAULT_COLUMN_FORMATS
: Specifies the default
format to be applied to source data loaded into
columns with the corresponding column property.
Currently supported column properties include
date, time, & datetime. This default
column-property-bound format can be overridden
by specifying a column property & format for a
given target column in COLUMN_FORMATS
. For each
specified annotation, the format will apply to
all columns with that annotation unless a custom
COLUMN_FORMATS
for that annotation is specified. The parameter
value must be formatted as a JSON string that is
a map of column properties to their respective
column formats, e.g., '{ "date" : "%Y.%m.%d",
"time" : "%H:%M:%S" }'. Column formats are
specified as a string of control characters and
plain text. The supported control characters are
'Y', 'm', 'd', 'H', 'M', 'S', and 's', which
follow the Linux 'strptime()' specification, as
well as 's', which specifies seconds and
fractional seconds (though the fractional
component will be truncated past milliseconds).
Formats for the 'date' annotation must include
the 'Y', 'm', and 'd' control characters.
Formats for the 'time' annotation must include
the 'H', 'M', and either 'S' or 's' (but not
both) control characters. Formats for the
'datetime' annotation meet both the 'date' and
'time' control character requirements. For
example, '{"datetime" : "%m/%d/%Y %H:%M:%S" }'
would be used to interpret text as "05/04/2000
12:12:11"
ERROR_HANDLING
:
Specifies how errors should be handled upon
insertion.
Supported values:
PERMISSIVE
:
Records with missing columns are
populated with nulls if possible;
otherwise, the malformed records are
skipped.
IGNORE_BAD_RECORDS
: Malformed records
are skipped.
ABORT
: Stops
current insertion and aborts entire
operation when an error is encountered.
Primary key collisions are considered
abortable errors in this mode.
ABORT
.
FILE_TYPE
: Specifies
the type of the file(s) whose records will be
inserted.
Supported values:
AVRO
: Avro file
format
DELIMITED_TEXT
: Delimited text file
format; e.g., CSV, TSV, PSV, etc.
GDB
: Esri/GDB file
format
JSON
: Json file
format
PARQUET
: Apache
Parquet file format
SHAPEFILE
:
ShapeFile file format
DELIMITED_TEXT
.
FLATTEN_COLUMNS
:
Specifies how to handle nested columns.
Supported values:
TRUE
: Break up
nested columns to multiple columns
FALSE
: Treat
nested columns as json columns instead
of flattening
FALSE
.
GDAL_CONFIGURATION_OPTIONS
: Comma separated
list of gdal conf options, for the specific
requets: key=value
IGNORE_EXISTING_PK
: Specifies the record
collision error-suppression policy for inserting
into a table with a primary key, only used when
not in upsert mode (upsert mode is disabled when
UPDATE_ON_EXISTING_PK
is FALSE
). If set to TRUE
,
any record being inserted that is rejected for
having primary key values that match those of an
existing table record will be ignored with no
error generated. If FALSE
, the rejection of any record for having
primary key values matching an existing record
will result in an error being reported, as
determined by ERROR_HANDLING
. If the specified table does
not have a primary key or if upsert mode is in
effect (UPDATE_ON_EXISTING_PK
is TRUE
), then this option has no effect.
Supported values:
TRUE
: Ignore new
records whose primary key values collide
with those of existing records
FALSE
: Treat as
errors any new records whose primary key
values collide with those of existing
records
FALSE
.
INGESTION_MODE
:
Whether to do a full load, dry run, or perform a
type inference on the source data.
Supported values:
FULL
: Run a type
inference on the source data (if needed)
and ingest
DRY_RUN
: Does
not load data, but walks through the
source data and determines the number of
valid records, taking into account the
current mode of ERROR_HANDLING
.
TYPE_INFERENCE_ONLY
: Infer the type of
the source data and return, without
ingesting any data. The inferred type
is returned in the response.
FULL
.
KAFKA_CONSUMERS_PER_RANK
: Number of Kafka
consumer threads per rank (valid range 1-6). The
default value is '1'.
KAFKA_GROUP_ID
:
The group id to be used when consuming data from
a Kafka topic (valid only for Kafka datasource
subscriptions).
KAFKA_OFFSET_RESET_POLICY
: Policy to determine
whether the Kafka data consumption starts either
at earliest offset or latest offset.
Supported values:
The default value is EARLIEST
.
KAFKA_OPTIMISTIC_INGEST
: Enable optimistic
ingestion where Kafka topic offsets and table
data are committed independently to achieve
parallelism.
Supported values:
The default value is FALSE
.
KAFKA_SUBSCRIPTION_CANCEL_AFTER
: Sets the Kafka
subscription lifespan (in minutes). Expired
subscription will be cancelled automatically.
KAFKA_TYPE_INFERENCE_FETCH_TIMEOUT
: Maximum
time to collect Kafka messages before type
inferencing on the set of them.
LAYER
: Geo files layer(s)
name(s): comma separated.
LOADING_MODE
:
Scheme for distributing the extraction and
loading of data from the source data file(s).
This option applies only when loading files that
are local to the database.
Supported values:
HEAD
: The head node
loads all data. All files must be
available to the head node.
DISTRIBUTED_SHARED
: The head node
coordinates loading data by worker
processes across all nodes from shared
files available to all workers. NOTE:
Instead of existing on a shared source,
the files can be duplicated on a source
local to each host to improve
performance, though the files must
appear as the same data set from the
perspective of all hosts performing the
load.
DISTRIBUTED_LOCAL
: A single worker
process on each node loads all files
that are available to it. This option
works best when each worker loads files
from its own file system, to maximize
performance. In order to avoid data
duplication, either each worker
performing the load needs to have
visibility to a set of files unique to
it (no file is visible to more than one
node) or the target table needs to have
a primary key (which will allow the
worker to automatically deduplicate
data). NOTE: If the target table
doesn't exist, the table structure will
be determined by the head node. If the
head node has no files local to it, it
will be unable to determine the
structure and the request will fail. If
the head node is configured to have no
worker processes, no data strictly
accessible to the head node will be
loaded.
HEAD
.
LOCAL_TIME_OFFSET
: Apply an offset to Avro
local timestamp columns.
MAX_RECORDS_TO_LOAD
: Limit the number of
records to load in this request: if this number
is larger than BATCH_SIZE
, then the number of records loaded
will be limited to the next whole number of
BATCH_SIZE
(per
working thread).
NUM_TASKS_PER_RANK
: Number of tasks for reading
file per rank. Default will be system
configuration parameter,
external_file_reader_num_tasks.
POLL_INTERVAL
: If
TRUE
, the number of seconds
between attempts to load external files into the
table. If zero, polling will be continuous as
long as data is found. If no data is found, the
interval will steadily increase to a maximum of
60 seconds. The default value is '0'.
PRIMARY_KEYS
: Comma
separated list of column names to set as primary
keys, when not specified in the type.
SCHEMA_REGISTRY_SCHEMA_NAME
: Name of the Avro
schema in the schema registry to use when
reading Avro records.
SHARD_KEYS
: Comma
separated list of column names to set as shard
keys, when not specified in the type.
SKIP_LINES
: Skip
number of lines from begining of file.
START_OFFSETS
:
Starting offsets by partition to fetch from
kafka. A comma separated list of
partition:offset pairs.
SUBSCRIBE
:
Continuously poll the data source to check for
new data and load it into the table.
Supported values:
The default value is FALSE
.
TABLE_INSERT_MODE
: Insertion scheme to use when
inserting records from multiple shapefiles.
Supported values:
SINGLE
: Insert
all records into a single table.
TABLE_PER_FILE
: Insert records from
each file into a new table corresponding
to that file.
SINGLE
.
TEXT_COMMENT_STRING
: Specifies the character
string that should be interpreted as a comment
line prefix in the source data. All lines in
the data starting with the provided string are
ignored. For DELIMITED_TEXT
FILE_TYPE
only. The default value is '#'.
TEXT_DELIMITER
:
Specifies the character delimiting field values
in the source data and field names in the header
(if present). For DELIMITED_TEXT
FILE_TYPE
only. The default value is ','.
TEXT_ESCAPE_CHARACTER
: Specifies the character
that is used to escape other characters in the
source data. An 'a', 'b', 'f', 'n', 'r', 't',
or 'v' preceded by an escape character will be
interpreted as the ASCII bell, backspace, form
feed, line feed, carriage return, horizontal
tab, & vertical tab, respectively. For example,
the escape character followed by an 'n' will be
interpreted as a newline within a field value.
The escape character can also be used to escape
the quoting character, and will be treated as an
escape character whether it is within a quoted
field value or not. For DELIMITED_TEXT
FILE_TYPE
only.
TEXT_HAS_HEADER
:
Indicates whether the source data contains a
header row. For DELIMITED_TEXT
FILE_TYPE
only.
Supported values:
The default value is TRUE
.
TEXT_HEADER_PROPERTY_DELIMITER
: Specifies the
delimiter for column properties in the
header row (if present). Cannot be set to same
value as TEXT_DELIMITER
. For DELIMITED_TEXT
FILE_TYPE
only. The default
value is '|'.
TEXT_NULL_STRING
: Specifies the character
string that should be interpreted as a null
value in the source data. For DELIMITED_TEXT
FILE_TYPE
only. The default
value is '\N'.
TEXT_QUOTE_CHARACTER
: Specifies the character
that should be interpreted as a field value
quoting character in the source data. The
character must appear at beginning and end of
field value to take effect. Delimiters within
quoted fields are treated as literals and not
delimiters. Within a quoted field, two
consecutive quote characters will be interpreted
as a single literal quote character, effectively
escaping it. To not have a quote character,
specify an empty string. For DELIMITED_TEXT
FILE_TYPE
only. The default
value is '"'.
TEXT_SEARCH_COLUMNS
: Add 'text_search' property
to internally inferenced string columns. Comma
seperated list of column names or '*' for all
columns. To add 'text_search' property only to
string columns greater than or equal to a
minimum size, also set the TEXT_SEARCH_MIN_COLUMN_LENGTH
TEXT_SEARCH_MIN_COLUMN_LENGTH
: Set the minimum
column size for strings to apply the
'text_search' property to. Used only when TEXT_SEARCH_COLUMNS
has a value.
TRUNCATE_STRINGS
: If set to TRUE
, truncate string values that are longer
than the column's type size.
Supported values:
The default value is FALSE
.
TRUNCATE_TABLE
:
If set to TRUE
, truncates
the table specified by tableName
prior
to loading the file(s).
Supported values:
The default value is FALSE
.
TYPE_INFERENCE_MODE
: Optimize type inferencing
for either speed or accuracy.
Supported values:
ACCURACY
: Scans
data to get exactly-typed & sized
columns for all data scanned.
SPEED
: Scans data
and picks the widest possible column
types so that 'all' values will fit with
minimum data scanned
ACCURACY
.
UPDATE_ON_EXISTING_PK
: Specifies the record
collision policy for inserting into a table with
a primary key. If set to TRUE
, any existing table record
with primary key values that match those of a
record being inserted will be replaced by that
new record (the new data will be 'upserted'). If
set to FALSE
, any existing
table record with primary key values that match
those of a record being inserted will remain
unchanged, while the new record will be rejected
and the error handled as determined by IGNORE_EXISTING_PK
&
ERROR_HANDLING
.
If the specified table does not have a primary
key, then this option has no effect.
Supported values:
TRUE
: Upsert new
records when primary keys match existing
records
FALSE
: Reject new
records when primary keys match existing
records
FALSE
.
Map
.public static org.apache.avro.Schema getClassSchema()
public String getTableName()
TYPE_ID
or the type inferred from the file,
and the new table name will have to meet standard table naming criteria.tableName
.public InsertRecordsFromFilesRequest setTableName(String tableName)
TYPE_ID
or the type inferred from the file,
and the new table name will have to meet standard table naming criteria.tableName
- The new value for tableName
.this
to mimic the builder pattern.public List<String> getFilepaths()
For paths in KiFS, use the uri prefix of kifs:// followed by the path to a file or directory. File matching by prefix is supported, e.g. kifs://dir/file would match dir/file_1 and dir/file_2. When prefix matching is used, the path must start with a full, valid KiFS directory name.
If an external data source is specified in DATASOURCE_NAME
, these file paths must resolve
to accessible files at that data source location. Prefix matching is
supported. If the data source is hdfs, prefixes must be aligned with
directories, i.e. partial file names will not match.
If no data source is specified, the files are assumed to be local to the database and must all be accessible to the gpudb user, residing on the path (or relative to the path) specified by the external files directory in the Kinetica configuration file. Wildcards (*) can be used to specify a group of files. Prefix matching is supported, the prefixes must be aligned with directories.
If the first path ends in .tsv, the text delimiter will be defaulted to a tab character. If the first path ends in .psv, the text delimiter will be defaulted to a pipe character (|).
filepaths
.public InsertRecordsFromFilesRequest setFilepaths(List<String> filepaths)
For paths in KiFS, use the uri prefix of kifs:// followed by the path to a file or directory. File matching by prefix is supported, e.g. kifs://dir/file would match dir/file_1 and dir/file_2. When prefix matching is used, the path must start with a full, valid KiFS directory name.
If an external data source is specified in DATASOURCE_NAME
, these file paths must resolve
to accessible files at that data source location. Prefix matching is
supported. If the data source is hdfs, prefixes must be aligned with
directories, i.e. partial file names will not match.
If no data source is specified, the files are assumed to be local to the database and must all be accessible to the gpudb user, residing on the path (or relative to the path) specified by the external files directory in the Kinetica configuration file. Wildcards (*) can be used to specify a group of files. Prefix matching is supported, the prefixes must be aligned with directories.
If the first path ends in .tsv, the text delimiter will be defaulted to a tab character. If the first path ends in .psv, the text delimiter will be defaulted to a pipe character (|).
filepaths
- The new value for filepaths
.this
to mimic the builder pattern.public Map<String,Map<String,String>> getModifyColumns()
Map
.modifyColumns
.public InsertRecordsFromFilesRequest setModifyColumns(Map<String,Map<String,String>> modifyColumns)
Map
.modifyColumns
- The new value for modifyColumns
.this
to mimic the builder pattern.public Map<String,String> getCreateTableOptions()
GPUdb.createTable
, allowing the structure of the table to be defined
independently of the data source, when creating the target table.
TYPE_ID
: ID of a currently
registered type.
NO_ERROR_IF_EXISTS
: If TRUE
,
prevents an error from occurring if the table already exists and
is of the given type. If a table with the same name but a
different type exists, it is still an error.
Supported values:
The default value is FALSE
.
IS_REPLICATED
: Affects
the distribution scheme for the table's data. If
TRUE
and the given table has no
explicit shard key defined, the table will be replicated. If FALSE
, the table will be sharded according to the shard key specified
in the given TYPE_ID
, or randomly sharded, if no shard key is
specified. Note that a type containing a shard key cannot be
used to create a replicated table.
Supported values:
The default value is FALSE
.
FOREIGN_KEYS
:
Semicolon-separated list of foreign keys, of the format
'(source_column_name [, ...]) references
target_table_name(primary_key_column_name [, ...]) [as
foreign_key_name]'.
FOREIGN_SHARD_KEY
:
Foreign shard key of the format 'source_column references
shard_by_column from target_table(primary_key_column)'.
PARTITION_TYPE
: Partitioning scheme to use.
Supported values:
RANGE
: Use range partitioning.
INTERVAL
: Use interval partitioning.
LIST
: Use list partitioning.
HASH
: Use hash partitioning.
SERIES
: Use series partitioning.
PARTITION_KEYS
:
Comma-separated list of partition keys, which are the columns or
column expressions by which records will be assigned to
partitions defined by PARTITION_DEFINITIONS
.
PARTITION_DEFINITIONS
: Comma-separated list of partition
definitions, whose format depends on the choice of PARTITION_TYPE
. See range partitioning, interval partitioning, list partitioning, hash partitioning, or series partitioning for example formats.
IS_AUTOMATIC_PARTITION
: If TRUE
, a new partition will be created for values which don't
fall into an existing partition. Currently, only supported for
list partitions.
Supported values:
The default value is FALSE
.
TTL
: Sets the TTL of
the table specified in tableName
.
CHUNK_SIZE
: Indicates the
number of records per chunk to be used for this table.
CHUNK_COLUMN_MAX_MEMORY
: Indicates the target maximum data size
for each column in a chunk to be used for this table.
CHUNK_MAX_MEMORY
:
Indicates the target maximum data size for all columns in a
chunk to be used for this table.
IS_RESULT_TABLE
:
Indicates whether the table is a memory-only table. A result table cannot
contain columns with text_search data-handling, and it will not be retained if
the server is restarted.
Supported values:
The default value is FALSE
.
STRATEGY_DEFINITION
: The tier strategy for the table and its columns.
Map
.createTableOptions
.public InsertRecordsFromFilesRequest setCreateTableOptions(Map<String,String> createTableOptions)
GPUdb.createTable
, allowing the structure of the table to be defined
independently of the data source, when creating the target table.
TYPE_ID
: ID of a currently
registered type.
NO_ERROR_IF_EXISTS
: If TRUE
,
prevents an error from occurring if the table already exists and
is of the given type. If a table with the same name but a
different type exists, it is still an error.
Supported values:
The default value is FALSE
.
IS_REPLICATED
: Affects
the distribution scheme for the table's data. If
TRUE
and the given table has no
explicit shard key defined, the table will be replicated. If FALSE
, the table will be sharded according to the shard key specified
in the given TYPE_ID
, or randomly sharded, if no shard key is
specified. Note that a type containing a shard key cannot be
used to create a replicated table.
Supported values:
The default value is FALSE
.
FOREIGN_KEYS
:
Semicolon-separated list of foreign keys, of the format
'(source_column_name [, ...]) references
target_table_name(primary_key_column_name [, ...]) [as
foreign_key_name]'.
FOREIGN_SHARD_KEY
:
Foreign shard key of the format 'source_column references
shard_by_column from target_table(primary_key_column)'.
PARTITION_TYPE
: Partitioning scheme to use.
Supported values:
RANGE
: Use range partitioning.
INTERVAL
: Use interval partitioning.
LIST
: Use list partitioning.
HASH
: Use hash partitioning.
SERIES
: Use series partitioning.
PARTITION_KEYS
:
Comma-separated list of partition keys, which are the columns or
column expressions by which records will be assigned to
partitions defined by PARTITION_DEFINITIONS
.
PARTITION_DEFINITIONS
: Comma-separated list of partition
definitions, whose format depends on the choice of PARTITION_TYPE
. See range partitioning, interval partitioning, list partitioning, hash partitioning, or series partitioning for example formats.
IS_AUTOMATIC_PARTITION
: If TRUE
, a new partition will be created for values which don't
fall into an existing partition. Currently, only supported for
list partitions.
Supported values:
The default value is FALSE
.
TTL
: Sets the TTL of
the table specified in tableName
.
CHUNK_SIZE
: Indicates the
number of records per chunk to be used for this table.
CHUNK_COLUMN_MAX_MEMORY
: Indicates the target maximum data size
for each column in a chunk to be used for this table.
CHUNK_MAX_MEMORY
:
Indicates the target maximum data size for all columns in a
chunk to be used for this table.
IS_RESULT_TABLE
:
Indicates whether the table is a memory-only table. A result table cannot
contain columns with text_search data-handling, and it will not be retained if
the server is restarted.
Supported values:
The default value is FALSE
.
STRATEGY_DEFINITION
: The tier strategy for the table and its columns.
Map
.createTableOptions
- The new value for createTableOptions
.this
to mimic the builder pattern.public Map<String,String> getOptions()
BAD_RECORD_TABLE_NAME
:
Name of a table to which records that were rejected are written.
The bad-record-table has the following columns: line_number
(long), line_rejected (string), error_message (string). When
ERROR_HANDLING
is ABORT
, bad records table is not populated.
BAD_RECORD_TABLE_LIMIT
: A
positive integer indicating the maximum number of records that
can be written to the bad-record-table. The default value is
'10000'.
BAD_RECORD_TABLE_LIMIT_PER_INPUT
: For subscriptions, a positive
integer indicating the maximum number of records that can be
written to the bad-record-table per file/payload. Default value
will be BAD_RECORD_TABLE_LIMIT
and total size of the table per rank is
limited to BAD_RECORD_TABLE_LIMIT
.
BATCH_SIZE
: Number of records to
insert per batch when inserting data. The default value is
'50000'.
COLUMN_FORMATS
: For each target
column specified, applies the column-property-bound format to
the source data loaded into that column. Each column format
will contain a mapping of one or more of its column properties
to an appropriate format for each property. Currently supported
column properties include date, time, & datetime. The parameter
value must be formatted as a JSON string of maps of column names
to maps of column properties to their corresponding column
formats, e.g., '{ "order_date" : { "date" : "%Y.%m.%d" },
"order_time" : { "time" : "%H:%M:%S" } }'. See DEFAULT_COLUMN_FORMATS
for valid
format syntax.
COLUMNS_TO_LOAD
: Specifies a
comma-delimited list of columns from the source data to load.
If more than one file is being loaded, this list applies to all
files. Column numbers can be specified discretely or as a
range. For example, a value of '5,7,1..3' will insert values
from the fifth column in the source data into the first column
in the target table, from the seventh column in the source data
into the second column in the target table, and from the first
through third columns in the source data into the third through
fifth columns in the target table. If the source data contains
a header, column names matching the file header names may be
provided instead of column numbers. If the target table doesn't
exist, the table will be created with the columns in this order.
If the target table does exist with columns in a different order
than the source data, this list can be used to match the order
of the target table. For example, a value of 'C, B, A' will
create a three column table with column C, followed by column B,
followed by column A; or will insert those fields in that order
into a table created with columns in that order. If the target
table exists, the column names must match the source data field
names for a name-mapping to be successful. Mutually exclusive
with COLUMNS_TO_SKIP
.
COLUMNS_TO_SKIP
: Specifies a
comma-delimited list of columns from the source data to skip.
Mutually exclusive with COLUMNS_TO_LOAD
.
COMPRESSION_TYPE
: Source data
compression type.
Supported values:
NONE
: No compression.
AUTO
: Auto detect compression type
GZIP
: gzip file compression.
BZIP2
: bzip2 file compression.
AUTO
.
DATASOURCE_NAME
: Name of an
existing external data source from which data file(s) specified
in filepaths
will be loaded
DEFAULT_COLUMN_FORMATS
:
Specifies the default format to be applied to source data loaded
into columns with the corresponding column property. Currently
supported column properties include date, time, & datetime.
This default column-property-bound format can be overridden by
specifying a column property & format for a given target column
in COLUMN_FORMATS
. For each
specified annotation, the format will apply to all columns with
that annotation unless a custom COLUMN_FORMATS
for that annotation is specified. The parameter
value must be formatted as a JSON string that is a map of column
properties to their respective column formats, e.g., '{ "date" :
"%Y.%m.%d", "time" : "%H:%M:%S" }'. Column formats are
specified as a string of control characters and plain text. The
supported control characters are 'Y', 'm', 'd', 'H', 'M', 'S',
and 's', which follow the Linux 'strptime()' specification, as
well as 's', which specifies seconds and fractional seconds
(though the fractional component will be truncated past
milliseconds). Formats for the 'date' annotation must include
the 'Y', 'm', and 'd' control characters. Formats for the 'time'
annotation must include the 'H', 'M', and either 'S' or 's' (but
not both) control characters. Formats for the 'datetime'
annotation meet both the 'date' and 'time' control character
requirements. For example, '{"datetime" : "%m/%d/%Y %H:%M:%S" }'
would be used to interpret text as "05/04/2000 12:12:11"
ERROR_HANDLING
: Specifies how
errors should be handled upon insertion.
Supported values:
PERMISSIVE
: Records with
missing columns are populated with nulls if possible;
otherwise, the malformed records are skipped.
IGNORE_BAD_RECORDS
:
Malformed records are skipped.
ABORT
: Stops current insertion and
aborts entire operation when an error is encountered.
Primary key collisions are considered abortable errors
in this mode.
ABORT
.
FILE_TYPE
: Specifies the type of the
file(s) whose records will be inserted.
Supported values:
AVRO
: Avro file format
DELIMITED_TEXT
: Delimited
text file format; e.g., CSV, TSV, PSV, etc.
GDB
: Esri/GDB file format
JSON
: Json file format
PARQUET
: Apache Parquet file
format
SHAPEFILE
: ShapeFile file
format
DELIMITED_TEXT
.
FLATTEN_COLUMNS
: Specifies how
to handle nested columns.
Supported values:
TRUE
: Break up nested columns to
multiple columns
FALSE
: Treat nested columns as
json columns instead of flattening
FALSE
.
GDAL_CONFIGURATION_OPTIONS
: Comma separated list of gdal conf
options, for the specific requets: key=value
IGNORE_EXISTING_PK
: Specifies
the record collision error-suppression policy for inserting into
a table with a primary key, only used when not in upsert mode
(upsert mode is disabled when UPDATE_ON_EXISTING_PK
is FALSE
). If set to TRUE
, any
record being inserted that is rejected for having primary key
values that match those of an existing table record will be
ignored with no error generated. If FALSE
, the rejection of any record for having primary key
values matching an existing record will result in an error being
reported, as determined by ERROR_HANDLING
. If the specified table does not have a primary
key or if upsert mode is in effect (UPDATE_ON_EXISTING_PK
is TRUE
), then this option has no effect.
Supported values:
TRUE
: Ignore new records whose
primary key values collide with those of existing
records
FALSE
: Treat as errors any new
records whose primary key values collide with those of
existing records
FALSE
.
INGESTION_MODE
: Whether to do a
full load, dry run, or perform a type inference on the source
data.
Supported values:
FULL
: Run a type inference on the
source data (if needed) and ingest
DRY_RUN
: Does not load data, but
walks through the source data and determines the number
of valid records, taking into account the current mode
of ERROR_HANDLING
.
TYPE_INFERENCE_ONLY
:
Infer the type of the source data and return, without
ingesting any data. The inferred type is returned in
the response.
FULL
.
KAFKA_CONSUMERS_PER_RANK
: Number of Kafka consumer threads per
rank (valid range 1-6). The default value is '1'.
KAFKA_GROUP_ID
: The group id to
be used when consuming data from a Kafka topic (valid only for
Kafka datasource subscriptions).
KAFKA_OFFSET_RESET_POLICY
: Policy to determine whether the
Kafka data consumption starts either at earliest offset or
latest offset.
Supported values:
The default value is EARLIEST
.
KAFKA_OPTIMISTIC_INGEST
:
Enable optimistic ingestion where Kafka topic offsets and table
data are committed independently to achieve parallelism.
Supported values:
The default value is FALSE
.
KAFKA_SUBSCRIPTION_CANCEL_AFTER
: Sets the Kafka subscription
lifespan (in minutes). Expired subscription will be cancelled
automatically.
KAFKA_TYPE_INFERENCE_FETCH_TIMEOUT
: Maximum time to collect
Kafka messages before type inferencing on the set of them.
LAYER
: Geo files layer(s) name(s): comma
separated.
LOADING_MODE
: Scheme for
distributing the extraction and loading of data from the source
data file(s). This option applies only when loading files that
are local to the database.
Supported values:
HEAD
: The head node loads all data.
All files must be available to the head node.
DISTRIBUTED_SHARED
:
The head node coordinates loading data by worker
processes across all nodes from shared files available
to all workers. NOTE: Instead of existing on a shared
source, the files can be duplicated on a source local to
each host to improve performance, though the files must
appear as the same data set from the perspective of all
hosts performing the load.
DISTRIBUTED_LOCAL
: A
single worker process on each node loads all files that
are available to it. This option works best when each
worker loads files from its own file system, to maximize
performance. In order to avoid data duplication, either
each worker performing the load needs to have visibility
to a set of files unique to it (no file is visible to
more than one node) or the target table needs to have a
primary key (which will allow the worker to
automatically deduplicate data). NOTE: If the target
table doesn't exist, the table structure will be
determined by the head node. If the head node has no
files local to it, it will be unable to determine the
structure and the request will fail. If the head node
is configured to have no worker processes, no data
strictly accessible to the head node will be loaded.
HEAD
.
LOCAL_TIME_OFFSET
: Apply an
offset to Avro local timestamp columns.
MAX_RECORDS_TO_LOAD
: Limit
the number of records to load in this request: if this number is
larger than BATCH_SIZE
, then the
number of records loaded will be limited to the next whole
number of BATCH_SIZE
(per working
thread).
NUM_TASKS_PER_RANK
: Number of
tasks for reading file per rank. Default will be system
configuration parameter, external_file_reader_num_tasks.
POLL_INTERVAL
: If TRUE
, the number of seconds between attempts to
load external files into the table. If zero, polling will be
continuous as long as data is found. If no data is found, the
interval will steadily increase to a maximum of 60 seconds. The
default value is '0'.
PRIMARY_KEYS
: Comma separated list
of column names to set as primary keys, when not specified in
the type.
SCHEMA_REGISTRY_SCHEMA_NAME
: Name of the Avro schema in the
schema registry to use when reading Avro records.
SHARD_KEYS
: Comma separated list of
column names to set as shard keys, when not specified in the
type.
SKIP_LINES
: Skip number of lines from
begining of file.
START_OFFSETS
: Starting offsets by
partition to fetch from kafka. A comma separated list of
partition:offset pairs.
SUBSCRIBE
: Continuously poll the data
source to check for new data and load it into the table.
Supported values:
The default value is FALSE
.
TABLE_INSERT_MODE
: Insertion
scheme to use when inserting records from multiple shapefiles.
Supported values:
SINGLE
: Insert all records into a
single table.
TABLE_PER_FILE
: Insert
records from each file into a new table corresponding to
that file.
SINGLE
.
TEXT_COMMENT_STRING
:
Specifies the character string that should be interpreted as a
comment line prefix in the source data. All lines in the data
starting with the provided string are ignored. For DELIMITED_TEXT
FILE_TYPE
only. The default value is '#'.
TEXT_DELIMITER
: Specifies the
character delimiting field values in the source data and field
names in the header (if present). For DELIMITED_TEXT
FILE_TYPE
only. The default value is ','.
TEXT_ESCAPE_CHARACTER
:
Specifies the character that is used to escape other characters
in the source data. An 'a', 'b', 'f', 'n', 'r', 't', or 'v'
preceded by an escape character will be interpreted as the ASCII
bell, backspace, form feed, line feed, carriage return,
horizontal tab, & vertical tab, respectively. For example, the
escape character followed by an 'n' will be interpreted as a
newline within a field value. The escape character can also be
used to escape the quoting character, and will be treated as an
escape character whether it is within a quoted field value or
not. For DELIMITED_TEXT
FILE_TYPE
only.
TEXT_HAS_HEADER
: Indicates
whether the source data contains a header row. For DELIMITED_TEXT
FILE_TYPE
only.
Supported values:
The default value is TRUE
.
TEXT_HEADER_PROPERTY_DELIMITER
: Specifies the delimiter for column properties in the header row (if
present). Cannot be set to same value as TEXT_DELIMITER
. For DELIMITED_TEXT
FILE_TYPE
only. The default value is '|'.
TEXT_NULL_STRING
: Specifies the
character string that should be interpreted as a null value in
the source data. For DELIMITED_TEXT
FILE_TYPE
only. The
default value is '\N'.
TEXT_QUOTE_CHARACTER
:
Specifies the character that should be interpreted as a field
value quoting character in the source data. The character must
appear at beginning and end of field value to take effect.
Delimiters within quoted fields are treated as literals and not
delimiters. Within a quoted field, two consecutive quote
characters will be interpreted as a single literal quote
character, effectively escaping it. To not have a quote
character, specify an empty string. For DELIMITED_TEXT
FILE_TYPE
only. The default value is '"'.
TEXT_SEARCH_COLUMNS
: Add
'text_search' property to internally inferenced string columns.
Comma seperated list of column names or '*' for all columns. To
add 'text_search' property only to string columns greater than
or equal to a minimum size, also set the TEXT_SEARCH_MIN_COLUMN_LENGTH
TEXT_SEARCH_MIN_COLUMN_LENGTH
: Set the minimum column size for
strings to apply the 'text_search' property to. Used only when
TEXT_SEARCH_COLUMNS
has a
value.
TRUNCATE_STRINGS
: If set to
TRUE
, truncate string values that are
longer than the column's type size.
Supported values:
The default value is FALSE
.
TRUNCATE_TABLE
: If set to TRUE
, truncates the table specified by tableName
prior to loading the file(s).
Supported values:
The default value is FALSE
.
TYPE_INFERENCE_MODE
:
Optimize type inferencing for either speed or accuracy.
Supported values:
ACCURACY
: Scans data to get
exactly-typed & sized columns for all data scanned.
SPEED
: Scans data and picks the
widest possible column types so that 'all' values will
fit with minimum data scanned
ACCURACY
.
UPDATE_ON_EXISTING_PK
:
Specifies the record collision policy for inserting into a table
with a primary key. If set to TRUE
, any existing table record with primary key values that
match those of a record being inserted will be replaced by that
new record (the new data will be 'upserted'). If set to FALSE
, any existing table record with primary key
values that match those of a record being inserted will remain
unchanged, while the new record will be rejected and the error
handled as determined by IGNORE_EXISTING_PK
& ERROR_HANDLING
. If the specified table does not have a primary
key, then this option has no effect.
Supported values:
TRUE
: Upsert new records when
primary keys match existing records
FALSE
: Reject new records when
primary keys match existing records
FALSE
.
Map
.options
.public InsertRecordsFromFilesRequest setOptions(Map<String,String> options)
BAD_RECORD_TABLE_NAME
:
Name of a table to which records that were rejected are written.
The bad-record-table has the following columns: line_number
(long), line_rejected (string), error_message (string). When
ERROR_HANDLING
is ABORT
, bad records table is not populated.
BAD_RECORD_TABLE_LIMIT
: A
positive integer indicating the maximum number of records that
can be written to the bad-record-table. The default value is
'10000'.
BAD_RECORD_TABLE_LIMIT_PER_INPUT
: For subscriptions, a positive
integer indicating the maximum number of records that can be
written to the bad-record-table per file/payload. Default value
will be BAD_RECORD_TABLE_LIMIT
and total size of the table per rank is
limited to BAD_RECORD_TABLE_LIMIT
.
BATCH_SIZE
: Number of records to
insert per batch when inserting data. The default value is
'50000'.
COLUMN_FORMATS
: For each target
column specified, applies the column-property-bound format to
the source data loaded into that column. Each column format
will contain a mapping of one or more of its column properties
to an appropriate format for each property. Currently supported
column properties include date, time, & datetime. The parameter
value must be formatted as a JSON string of maps of column names
to maps of column properties to their corresponding column
formats, e.g., '{ "order_date" : { "date" : "%Y.%m.%d" },
"order_time" : { "time" : "%H:%M:%S" } }'. See DEFAULT_COLUMN_FORMATS
for valid
format syntax.
COLUMNS_TO_LOAD
: Specifies a
comma-delimited list of columns from the source data to load.
If more than one file is being loaded, this list applies to all
files. Column numbers can be specified discretely or as a
range. For example, a value of '5,7,1..3' will insert values
from the fifth column in the source data into the first column
in the target table, from the seventh column in the source data
into the second column in the target table, and from the first
through third columns in the source data into the third through
fifth columns in the target table. If the source data contains
a header, column names matching the file header names may be
provided instead of column numbers. If the target table doesn't
exist, the table will be created with the columns in this order.
If the target table does exist with columns in a different order
than the source data, this list can be used to match the order
of the target table. For example, a value of 'C, B, A' will
create a three column table with column C, followed by column B,
followed by column A; or will insert those fields in that order
into a table created with columns in that order. If the target
table exists, the column names must match the source data field
names for a name-mapping to be successful. Mutually exclusive
with COLUMNS_TO_SKIP
.
COLUMNS_TO_SKIP
: Specifies a
comma-delimited list of columns from the source data to skip.
Mutually exclusive with COLUMNS_TO_LOAD
.
COMPRESSION_TYPE
: Source data
compression type.
Supported values:
NONE
: No compression.
AUTO
: Auto detect compression type
GZIP
: gzip file compression.
BZIP2
: bzip2 file compression.
AUTO
.
DATASOURCE_NAME
: Name of an
existing external data source from which data file(s) specified
in filepaths
will be loaded
DEFAULT_COLUMN_FORMATS
:
Specifies the default format to be applied to source data loaded
into columns with the corresponding column property. Currently
supported column properties include date, time, & datetime.
This default column-property-bound format can be overridden by
specifying a column property & format for a given target column
in COLUMN_FORMATS
. For each
specified annotation, the format will apply to all columns with
that annotation unless a custom COLUMN_FORMATS
for that annotation is specified. The parameter
value must be formatted as a JSON string that is a map of column
properties to their respective column formats, e.g., '{ "date" :
"%Y.%m.%d", "time" : "%H:%M:%S" }'. Column formats are
specified as a string of control characters and plain text. The
supported control characters are 'Y', 'm', 'd', 'H', 'M', 'S',
and 's', which follow the Linux 'strptime()' specification, as
well as 's', which specifies seconds and fractional seconds
(though the fractional component will be truncated past
milliseconds). Formats for the 'date' annotation must include
the 'Y', 'm', and 'd' control characters. Formats for the 'time'
annotation must include the 'H', 'M', and either 'S' or 's' (but
not both) control characters. Formats for the 'datetime'
annotation meet both the 'date' and 'time' control character
requirements. For example, '{"datetime" : "%m/%d/%Y %H:%M:%S" }'
would be used to interpret text as "05/04/2000 12:12:11"
ERROR_HANDLING
: Specifies how
errors should be handled upon insertion.
Supported values:
PERMISSIVE
: Records with
missing columns are populated with nulls if possible;
otherwise, the malformed records are skipped.
IGNORE_BAD_RECORDS
:
Malformed records are skipped.
ABORT
: Stops current insertion and
aborts entire operation when an error is encountered.
Primary key collisions are considered abortable errors
in this mode.
ABORT
.
FILE_TYPE
: Specifies the type of the
file(s) whose records will be inserted.
Supported values:
AVRO
: Avro file format
DELIMITED_TEXT
: Delimited
text file format; e.g., CSV, TSV, PSV, etc.
GDB
: Esri/GDB file format
JSON
: Json file format
PARQUET
: Apache Parquet file
format
SHAPEFILE
: ShapeFile file
format
DELIMITED_TEXT
.
FLATTEN_COLUMNS
: Specifies how
to handle nested columns.
Supported values:
TRUE
: Break up nested columns to
multiple columns
FALSE
: Treat nested columns as
json columns instead of flattening
FALSE
.
GDAL_CONFIGURATION_OPTIONS
: Comma separated list of gdal conf
options, for the specific requets: key=value
IGNORE_EXISTING_PK
: Specifies
the record collision error-suppression policy for inserting into
a table with a primary key, only used when not in upsert mode
(upsert mode is disabled when UPDATE_ON_EXISTING_PK
is FALSE
). If set to TRUE
, any
record being inserted that is rejected for having primary key
values that match those of an existing table record will be
ignored with no error generated. If FALSE
, the rejection of any record for having primary key
values matching an existing record will result in an error being
reported, as determined by ERROR_HANDLING
. If the specified table does not have a primary
key or if upsert mode is in effect (UPDATE_ON_EXISTING_PK
is TRUE
), then this option has no effect.
Supported values:
TRUE
: Ignore new records whose
primary key values collide with those of existing
records
FALSE
: Treat as errors any new
records whose primary key values collide with those of
existing records
FALSE
.
INGESTION_MODE
: Whether to do a
full load, dry run, or perform a type inference on the source
data.
Supported values:
FULL
: Run a type inference on the
source data (if needed) and ingest
DRY_RUN
: Does not load data, but
walks through the source data and determines the number
of valid records, taking into account the current mode
of ERROR_HANDLING
.
TYPE_INFERENCE_ONLY
:
Infer the type of the source data and return, without
ingesting any data. The inferred type is returned in
the response.
FULL
.
KAFKA_CONSUMERS_PER_RANK
: Number of Kafka consumer threads per
rank (valid range 1-6). The default value is '1'.
KAFKA_GROUP_ID
: The group id to
be used when consuming data from a Kafka topic (valid only for
Kafka datasource subscriptions).
KAFKA_OFFSET_RESET_POLICY
: Policy to determine whether the
Kafka data consumption starts either at earliest offset or
latest offset.
Supported values:
The default value is EARLIEST
.
KAFKA_OPTIMISTIC_INGEST
:
Enable optimistic ingestion where Kafka topic offsets and table
data are committed independently to achieve parallelism.
Supported values:
The default value is FALSE
.
KAFKA_SUBSCRIPTION_CANCEL_AFTER
: Sets the Kafka subscription
lifespan (in minutes). Expired subscription will be cancelled
automatically.
KAFKA_TYPE_INFERENCE_FETCH_TIMEOUT
: Maximum time to collect
Kafka messages before type inferencing on the set of them.
LAYER
: Geo files layer(s) name(s): comma
separated.
LOADING_MODE
: Scheme for
distributing the extraction and loading of data from the source
data file(s). This option applies only when loading files that
are local to the database.
Supported values:
HEAD
: The head node loads all data.
All files must be available to the head node.
DISTRIBUTED_SHARED
:
The head node coordinates loading data by worker
processes across all nodes from shared files available
to all workers. NOTE: Instead of existing on a shared
source, the files can be duplicated on a source local to
each host to improve performance, though the files must
appear as the same data set from the perspective of all
hosts performing the load.
DISTRIBUTED_LOCAL
: A
single worker process on each node loads all files that
are available to it. This option works best when each
worker loads files from its own file system, to maximize
performance. In order to avoid data duplication, either
each worker performing the load needs to have visibility
to a set of files unique to it (no file is visible to
more than one node) or the target table needs to have a
primary key (which will allow the worker to
automatically deduplicate data). NOTE: If the target
table doesn't exist, the table structure will be
determined by the head node. If the head node has no
files local to it, it will be unable to determine the
structure and the request will fail. If the head node
is configured to have no worker processes, no data
strictly accessible to the head node will be loaded.
HEAD
.
LOCAL_TIME_OFFSET
: Apply an
offset to Avro local timestamp columns.
MAX_RECORDS_TO_LOAD
: Limit
the number of records to load in this request: if this number is
larger than BATCH_SIZE
, then the
number of records loaded will be limited to the next whole
number of BATCH_SIZE
(per working
thread).
NUM_TASKS_PER_RANK
: Number of
tasks for reading file per rank. Default will be system
configuration parameter, external_file_reader_num_tasks.
POLL_INTERVAL
: If TRUE
, the number of seconds between attempts to
load external files into the table. If zero, polling will be
continuous as long as data is found. If no data is found, the
interval will steadily increase to a maximum of 60 seconds. The
default value is '0'.
PRIMARY_KEYS
: Comma separated list
of column names to set as primary keys, when not specified in
the type.
SCHEMA_REGISTRY_SCHEMA_NAME
: Name of the Avro schema in the
schema registry to use when reading Avro records.
SHARD_KEYS
: Comma separated list of
column names to set as shard keys, when not specified in the
type.
SKIP_LINES
: Skip number of lines from
begining of file.
START_OFFSETS
: Starting offsets by
partition to fetch from kafka. A comma separated list of
partition:offset pairs.
SUBSCRIBE
: Continuously poll the data
source to check for new data and load it into the table.
Supported values:
The default value is FALSE
.
TABLE_INSERT_MODE
: Insertion
scheme to use when inserting records from multiple shapefiles.
Supported values:
SINGLE
: Insert all records into a
single table.
TABLE_PER_FILE
: Insert
records from each file into a new table corresponding to
that file.
SINGLE
.
TEXT_COMMENT_STRING
:
Specifies the character string that should be interpreted as a
comment line prefix in the source data. All lines in the data
starting with the provided string are ignored. For DELIMITED_TEXT
FILE_TYPE
only. The default value is '#'.
TEXT_DELIMITER
: Specifies the
character delimiting field values in the source data and field
names in the header (if present). For DELIMITED_TEXT
FILE_TYPE
only. The default value is ','.
TEXT_ESCAPE_CHARACTER
:
Specifies the character that is used to escape other characters
in the source data. An 'a', 'b', 'f', 'n', 'r', 't', or 'v'
preceded by an escape character will be interpreted as the ASCII
bell, backspace, form feed, line feed, carriage return,
horizontal tab, & vertical tab, respectively. For example, the
escape character followed by an 'n' will be interpreted as a
newline within a field value. The escape character can also be
used to escape the quoting character, and will be treated as an
escape character whether it is within a quoted field value or
not. For DELIMITED_TEXT
FILE_TYPE
only.
TEXT_HAS_HEADER
: Indicates
whether the source data contains a header row. For DELIMITED_TEXT
FILE_TYPE
only.
Supported values:
The default value is TRUE
.
TEXT_HEADER_PROPERTY_DELIMITER
: Specifies the delimiter for column properties in the header row (if
present). Cannot be set to same value as TEXT_DELIMITER
. For DELIMITED_TEXT
FILE_TYPE
only. The default value is '|'.
TEXT_NULL_STRING
: Specifies the
character string that should be interpreted as a null value in
the source data. For DELIMITED_TEXT
FILE_TYPE
only. The
default value is '\N'.
TEXT_QUOTE_CHARACTER
:
Specifies the character that should be interpreted as a field
value quoting character in the source data. The character must
appear at beginning and end of field value to take effect.
Delimiters within quoted fields are treated as literals and not
delimiters. Within a quoted field, two consecutive quote
characters will be interpreted as a single literal quote
character, effectively escaping it. To not have a quote
character, specify an empty string. For DELIMITED_TEXT
FILE_TYPE
only. The default value is '"'.
TEXT_SEARCH_COLUMNS
: Add
'text_search' property to internally inferenced string columns.
Comma seperated list of column names or '*' for all columns. To
add 'text_search' property only to string columns greater than
or equal to a minimum size, also set the TEXT_SEARCH_MIN_COLUMN_LENGTH
TEXT_SEARCH_MIN_COLUMN_LENGTH
: Set the minimum column size for
strings to apply the 'text_search' property to. Used only when
TEXT_SEARCH_COLUMNS
has a
value.
TRUNCATE_STRINGS
: If set to
TRUE
, truncate string values that are
longer than the column's type size.
Supported values:
The default value is FALSE
.
TRUNCATE_TABLE
: If set to TRUE
, truncates the table specified by tableName
prior to loading the file(s).
Supported values:
The default value is FALSE
.
TYPE_INFERENCE_MODE
:
Optimize type inferencing for either speed or accuracy.
Supported values:
ACCURACY
: Scans data to get
exactly-typed & sized columns for all data scanned.
SPEED
: Scans data and picks the
widest possible column types so that 'all' values will
fit with minimum data scanned
ACCURACY
.
UPDATE_ON_EXISTING_PK
:
Specifies the record collision policy for inserting into a table
with a primary key. If set to TRUE
, any existing table record with primary key values that
match those of a record being inserted will be replaced by that
new record (the new data will be 'upserted'). If set to FALSE
, any existing table record with primary key
values that match those of a record being inserted will remain
unchanged, while the new record will be rejected and the error
handled as determined by IGNORE_EXISTING_PK
& ERROR_HANDLING
. If the specified table does not have a primary
key, then this option has no effect.
Supported values:
TRUE
: Upsert new records when
primary keys match existing records
FALSE
: Reject new records when
primary keys match existing records
FALSE
.
Map
.options
- The new value for options
.this
to mimic the builder pattern.public org.apache.avro.Schema getSchema()
getSchema
in interface org.apache.avro.generic.GenericContainer
public Object get(int index)
get
in interface org.apache.avro.generic.IndexedRecord
index
- the position of the field to getIndexOutOfBoundsException
public void put(int index, Object value)
put
in interface org.apache.avro.generic.IndexedRecord
index
- the position of the field to setvalue
- the value to setIndexOutOfBoundsException
Copyright © 2025. All rights reserved.