Tables
The concept of tables is at the heart of Kinetica interactions. A table
is a data container associated with a specific type schema
(set of columns & types), much like tables in other database platforms. When
querying a table, the result set will also have a single type schema
associated with it.
Tables may exist either as top-level objects or as members of
collections.
Each table is identified by a name, which must meet the following criteria:
- Between 1 and 200 characters long
- Alphanumeric, including spaces and these symbols:
_
-
(
)
{
}
[
]
.
:
- First character is alphanumeric or an underscore
- Unique system-wide--cannot have the same name as another table,
view, or collection, regardless of whether it is top-level or
a member of a collection
Column names must meet the following criteria:
- Between 1 and 200 characters long
- Alphanumeric, including these symbols:
{
, }
, _
, [
, ]
,
.
, :
- First character is alphanumeric or an underscore
Using the /create/table endpoint, you can create a blank table (after
you've created a type schema to support the table) that can later hold records.
For example, in Java,:
gpudb.createTable(
myTableName,
myTypeId,
null
)
Distribution
Table data can be distributed across the Kinetica cluster
using one of two methods: sharding & replication.
Sharding
Sharding is the distribution of table data by hashing a particular value for
each record, and by that hash, determining on which Kinetica cluster node the
record will reside.
The benefit of sharding is that it allows distributed queries to be run
against a given data set, with each node responsible for managing its portion of
the overall data set, while only storing each record once across the cluster.
The parallelism inherent in the distributed queries allows for the query
performance to scale linearly with the addition of each cluster node.
A limitation of sharding is that two sharded data sets can only be
joined together if they are sharded in the same way, so that
their corresponding records will all reside on the same nodes. Given that, in a
typical database model, each pair of related data sets is associated by a
different key, a single query may only be able to join together at most two
sharded data sets along their relation.
Since sharding maximizes data storage & query efficiency, it is recommended
that the largest table in the system (e.g. the fact table in a data warehouse)
be sharded. The second largest table against which that table is joined
should also be sharded along the join relation (e.g. the column on each side
of the foreign key relationship).
For example, if the largest joined tables in a system were customer
&
order
, and there was a foreign key relationship between the customer_id
column of the order
table & the id
column of the customer
table,
the order
table should be sharded on customer_id
and the
customer
table should be sharded on id
.
Specifying a shard key requires that you create a type with at least one column
that has the shard_key
property when calling the /create/type
endpoint. See Shard Keys for details.
Replication
Replication is the distribution of table data by locating its
entire data set on every Kinetica cluster node simultaneously, instead of
being distributed across them.
The benefit of replication is that it allows data sets to be
joined together when those data sets are not
sharded on the columns being associated.
Normally, joining two data sets together requires them being joined on their
shard keys so that the joining of the two data sets can
occur locally on each processor shard. Since replicated data sets exist on
all shards, they appear local to any other data set and can be joined on each
shard as if they were local. The result sets from each shard are then
accumulated into one and returned as the final result set.
Since replicated data sets exist in their entirety on all processors, it is
recommended that they be relatively small. For example, a table containing
one gigabyte of data, when replicated across a 10-node cluster will occupy 10
gigabytes of cluster memory overall.
A table is specified as replicated at creation time, by calling the
/create/table endpoint with the is_replicated
option set to
true
.
For example, in Python,:
gpudb.create_table(
table_name = set_id,
type_id = type_id,
options = {'is_replicated': 'true'}
)
Primary Keys
Primary key is a designation that can be applied to a single
table column or set of columns (in the case of a
composite primary key) to ensure the uniqueness of the data contained in the
keyed column(s).
The uniqueness constraint is enforced upon insert in two ways:
- If a record to be inserted has key values that match those of an already
existing record in the target table, the new record’s values
will either:
- Overwrite the existing record’s values, if the
update_on_existing_pk
option is set to true
- Be ignored (the new record effectively discarded), if the
update_on_existing_pk
option is set to false
or not set
- If two or more records within a given batch insert have the same key values,
with respect to the primary key of the target table, the entire batch of
records to insert is rejected
By default, a table has no primary key. One must be explicitly designated
in the creation of the type schema associated with the
table. Only one primary key can exist per table.
The primary key for a table not created as replicated
becomes its implicit shard key, used for distributing its
records across processors. Replicated tables, by definition, are not
sharded and will necessarily have no shard key, implicit or otherwise.
This implicit shard key for non-replicated tables can only be overridden
when the primary key is a composite primary key and one or more of the
composite primary key columns is explicitly designated as the shard key.
The primary key designation also applies a
primary key index to the primary key columns.
Lastly, the primary key designation enables the column to serve as the target
of a foreign key.
Assigning a primary key to a table column requires two steps. The first is
to create a type schema, using /create/type, marking the
primary key field(s) with the primary_key
property. For example, to
create a product type schema with a primary key on product_id
, in
Python:
response = gpudb.create_type(
type_definition = """{
"type": "record",
"name": "product_type",
"fields": [
{"name":"product_id","type":"int"},
{"name":"product_name","type":"string"}
]
}""",
label = 'product_type',
properties = {'product_id': ['primary_key','int16']}
)
Note that the type_definition
is a JSON string defining the type schema.
The second step is to create a table with that type schema, using
/create/table, providing it the type_id
that is returned from the
/create/type call. Continuing from the previous example:
gpudb.create_table(
table_name = 'product',
type_id = response['type_id']
)
Shard Keys
Shard key is a designation that can be applied to a single
table column or set of columns (in the case of a
composite shard key) whose values are used in distributing records across
processors. This distribution allows for processing of queries against a
sharded table to be performed in parallel.
By default, a hash is computed for each record in a table and serves as the
key by which the associated record is distributed to its corresponding
processor, or shard. A shard key can be explicitly designated by assigning
the SHARD_KEY
property to one or more columns of a table.
Note
Store-only columns cannot be used as all
or part of a shard key.
A shard key is implicitly designated when a primary key
is assigned to any table not created as replicated; by
default, all columns involved in the primary key are used as the shard key.
If a primary key exists on a table, one or more of the columns composing the
primary key can be designated as the shard key; columns not part of the
primary key may not be given this designation. Designating a shard key does
not implicitly create a corresponding primary key for a table.
Only one shard key can exist per table.
Assigning a shard key to a table column requires two steps. The first is
to create a type schema, using /create/type, marking the
shard key field(s) with the primary_key
or shard_key
property (either
will work, as a primary key is a table's shard key, by default). For
example, to create a product type schema with a shard key on sku
, in
Python:
response = gpudb.create_type(
type_definition = """{
"type": "record",
"name": "product_type",
"fields": [
{"name":"product_id","type":"int"},
{"name":"sku","type":"int"},
{"name":"product_name","type":"string"}
]
}""",
label = 'product_type',
properties = {
'product_id': ['int16'],
'sku': ['shard_key']
}
)
Note that the type_definition
is a JSON string defining the type schema.
The second step is to create a table with that type schema, using
/create/table, providing it the type_id
that is returned from the
/create/type call. Continuing from the previous example:
gpudb.create_table(
table_name = 'product',
type_id = response['type_id']
)
Note also that sharding applies only to non-replicated tables, and the
default /create/table distribution scheme implied in the example
above is the non-replicated one. If an attempt is made to create a table
as replicated from a type that specifies a shard key that is not the same
as the primary key, the request will fail.
Foreign Keys
A foreign key is a tuning option for joining
tables. It acts as a relational index from a source table to
corresponding records in a target table. As new records are inserted into the
source table, the index is updated with references to the target records
associated with them. If a foreign key is not used, the
join relationships are established during query time.
In order for the linkage to be made successfully, it must connect the
shard key column of a source table with either the
similarly sharded primary key column of a
target table or the primary key column of a replicated
target table. The sharding requirement exists to ensure the related records
of each table being joined are located on the same processing node. The
replication alternative alleviates the sharding requirement, as all of the
records from the replicated target table will exist on all processor nodes,
and thus, appear to be local to all records from the source table involved in
the relation.
Foreign keys link single source columns to single primary key target
columns; multi-column foreign keys are not supported.
Multiple foreign keys can exist on a source table. Multiple foreign keys
may also exist on target tables; though, since foreign key targets must be
primary keys, the foreign keys would all have to target the same
primary key column.
Foreign keys do not provide referential integrity checks.
Foreign keys are designated at the time of table creation, using the
/create/table endpoint. Example (in Python):
gpudb.create_table(
table_name = my_FK_table,
type_id = my_type_id,
options = {
"foreign_key":"fk_column1 references my_PK_table(pk_column1)"
}
)