Kinetica supports the SQL concept of joining data sets, connecting related records between two or more tables.
The supported join types are:
INNER
- matching rows between two tablesLEFT
- matching rows between two tables, and rows in the left-hand table
with no matching rows in the right-hand tableRIGHT
- matching rows between two tables, and rows in the right-hand table
with no matching rows in the left-hand tableFULL OUTER
matching rows between two tables, and rows in both tables with
no matching rows in the otherTwo types of joins are available, depending on need:
A native join isolates the join operation and creates a native join view, whose purpose is to connect the tables involved in the join. The native join consists of three operations:
The /create/jointable endpoint is used to request a native join.
Once the join view has been created, adding or deleting data from the joined view base tables does not change the data queried from the join view, though updates will be reflected. The join view may be used as a source table in most native API calls.
Joins can be performed on any number and combination of tables and views but not on collections.
These limitations and others are discussed in further detail in the Limitations & Cautions section.
Any column expressions used in a join will be evaluated on querying the join view. More detail about the impact of using column expressions can be found under Memory Implications.
A SQL join is performed automatically when the database receives any SQL request containing a JOIN clause. It will generate a native join view, which will either be used as the base table for subsequent operations contained in the original query or as the source of records in the final result set. In either case, the join view is transient and will be removed, by default, after the query that caused it to be generated is complete.
SQL joins can be executed through either the /execute/sql endpoint or ODBC/JDBC.
There are two types of join distribution:
Tables being joined together must either be replicated, or be sharded on the columns being used to join the tables to avoid extensive interprocessor communication. While a join may include any number of replicated tables and sharded tables, all sharded tables must be joined by equating their respective shard key columns.
For instance, given a table A
and a table B
, both sharded on their
respective id
columns, the only node-local join possible is between
A.id
and B.id
. No other columns from either table can be used to
connect the two tables.
When a join view is created, it will contain the results of all join and filter clauses specified. It is recommended that all tables specified in the join view creation be joined within the specified expression. Any replicated tables not joined in the specified expression will be merged into the join view as cross-products. The /filter endpoint can be used subsequently to connect the unmerged tables.
Distributed joins allow the connection of data sets, regardless of their distribution scheme. In particular, this scheme makes the following possible:
A.id > A.id
) between two tables that are not both
replicatedImportant
Distributed joins are only available via the /execute/sql endpoint or SQL using ODBC/JDBC. The /create/jointable endpoint only processes local joins.
Distributed joins will reshard or replicate data, depending on four criteria:
INNER
LEFT
(RIGHT
)FULL
The following chart shows the resulting distribution action for equality-based joins, given the distribution schemes of the left-hand & right-hand tables. In parentheses are the join types that will result in a distribution operation.
Note
Cells marked with (All) indicate that a distribution operation will
occur for all join types, INNER
, LEFT
, RIGHT
, & FULL
.
Also, cells marked LEFT
can be used to infer the distribution for
RIGHT
joins, by simply reversing the left & right tables in the query.
Join Criteria Composition | Right Table | |||
Shard Key | Non-Shard Key | Replicated | ||
Left Table | Shard Key | <Node-local join> | Reshard Right (All) | Reshard Right (FULL) |
Non-Shard Key | Reshard Left (All) | Reshard Both (All) | Replicate Left (All) | |
Replicated | Reshard Left (LEFT,FULL) | Replicate Right (LEFT,FULL) | <Node-local join> |
Some example scenarios:
A
is joined on its shard key x
to a
replicated table B
on its column y
, and the join type is
FULL OUTER
, the right-side table B
will be sharded on its column
y
to match the sharding of the left-side table.A
is joined on columns w
& x
, which are not
its shard key, to a sharded table B
on its shard key columns y
&
z
, and the join is of any type, the left-side table A
will be
resharded on its columns w
& x
to match the sharding of the
right-side table.The following chart shows the resulting distribution action for inequality-based joins, given the distribution schemes of the left-hand & right-hand tables. In parentheses are the join types that will result in a distribution operation.
Note
Cells marked LEFT
can be used to infer the distribution for
RIGHT
joins, by simply reversing the left & right tables in the query.
Join Criteria Composition | Right Table | |||
Shard Key | Non-Shard Key | Replicated | ||
Left Table | Shard Key | Replicate Right (INNER,LEFT) Replicate Both (FULL) |
Replicate Right (INNER,LEFT) Replicate Both (FULL) |
Replicate Left (FULL) |
Non-Shard Key | Replicate Right (INNER,LEFT) Replicate Both (FULL) |
Replicate Right (INNER,LEFT) Replicate Both (FULL) |
Replicate Left (FULL) | |
Replicated | Replicate Right (LEFT,FULL) | Replicate Right (LEFT,FULL) | <Node-local join> |
Some example scenarios:
A
is joined on its shard key x
to a
replicated table B
on its column y
, and the join type is
FULL OUTER
, the left-side table A
will be replicated.A
is joined on columns w
& x
, which are not
its shard key, to a sharded table B
on its shard key columns y
&
z
, redistribution of data will occur based on join type:INNER
& LEFT
joins, the right-side table B
will be
replicatedFULL OUTER
joins, both tables will be replicatedTo create a join view, the /create/jointable endpoint requires four parameters:
TableA as A
, TableB as B
, ..., TableZ
)TableA.id
, B.id as B_id
, ...,
TableY.total + TableZ.total as YZ_total
)TableA.b_id = B.id and ... A.z_id = TableZ.id
)Given a table clause of TableA as A
, the a_id
column of that table can
be accessed via any of the following means:
a_id
)TableA.a_id
)A.a_id
)Note
All sharded tables must be associated through the given expression.
In Python, given tables customer
, order
and lineitem
, a
join view can be created via:
h_db.create_join_table(
join_table_name = 'customer_order_item',
table_names = ['customer as c','orders as o','lineitem as l'],
column_names = ['c_name','o_orderstatus','l_partkey','l_extendedprice','l_quantity','l_orderkey'],
expressions = ['c_custkey = o_custkey','o_orderkey = l_orderkey']
)
The view produced is an equi-join of the member tables, containing the specified columns from all three tables, accessible by their simple column names, given no name collisions among the columns selected. The result would match what would be produced by the SQL:
SELECT
c_name,
o_orderstatus,
l_partkey,
l_extendedprice,
l_quantity
FROM
customer c,
orders o,
lineitem l
WHERE
c_custkey = o_custkey AND
o_orderkey = l_orderkey
or:
SELECT
c_name,
o_orderstatus,
l_partkey,
l_extendedprice,
l_quantity
FROM
customer c
INNER JOIN
orders o ON c_custkey = o_custkey
INNER JOIN
lineitem l ON o_orderkey = l_orderkey
Since join views are read-only, no inserts, updates, or deletes can be issued against them. Additionally, the following data processing endpoints cannot be applied to a join view.
The examples in this section will be in Python. For easier use of the API, a handle to the join view created in the Creating a Join View section can be established:
join_view = gpudb.GPUdbTable(name = 'customer_order_item', db = h_db)
A Python example filter on the join view created in the Creating a Join View section for customer orders containing multiples of the same product is:
join_view.filter(
view_name = 'customer_order_item_multiple',
expression = 'l_quantity > 1'
)
When executed against a join view, the filter endpoint produces a view that is also a join view, having the same endpoint restrictions mentioned above. A chain of such filters can be used to create more and more restrictive views of the original data set. The filter endpoint can also be used to merge any tables not merged during the creation of the original join view or during subsequent filtering of the view.
To call the group-by endpoint, in Python:
records = join_view.aggregate_group_by(
column_names = [
'c_name',
'SUM(l_quantity) as total_items',
'SUM(l_quantity * l_extendedprice) as total_spent'
],
offset = 0,
options = {'sort_by':'key'}
)['data']
To call the statistics endpoint, in Python:
revenue = join_view.aggregate_statistics(
column_name = 'l_quantity * l_extendedprice',
stats = 'sum'
)['stats']['sum']
To call the endpoint to retrieve data from specific join view columns, in Python:
records = join_view.get_records_by_column(
column_names = ['c_name','o_orderstatus','l_partkey','l_extendedprice','l_quantity'],
offset = 0,
limit = gpudb.GPUdb.END_OF_SET
)
This example shows how the TPC-H benchmark's Query 5 SQL statement can be implemented in the Kinetica Python interface. The SQL for Query 5 is:
SELECT
n_name,
SUM(l_extendedprice * (1 - l_discount)) AS revenue
FROM
customer,
orders,
lineitem,
supplier,
nation,
region
WHERE
c_custkey = o_custkey AND
l_orderkey = o_orderkey AND
l_suppkey = s_suppkey AND
c_nationkey = s_nationkey AND
s_nationkey = n_nationkey AND
n_regionkey = r_regionkey AND
r_name = 'ASIA' AND
o_orderdate >= '1994-01-01' AND
o_orderdate < DATE('1994-01-01') + INTERVAL '1' YEAR
GROUP BY
n_name
ORDER BY
revenue DESC
To implement the above statement, a join view needs to be created with all of
the tables listed in the FROM
clause, and an expression, based on the
WHERE
clause, which binds the tables together and applies the necessary
filters:
h_db.create_join_table(
join_table_name = 'tpch_query_5_join_filter',
table_names = ['customer','orders','lineitem','supplier','nation','region'],
column_names = ['n_name','l_extendedprice * (1 - l_discount) as effective_price'],
expressions = [
'c_custkey = o_custkey',
'l_orderkey = o_orderkey',
'l_suppkey = s_suppkey',
'c_nationkey = s_nationkey',
's_nationkey = n_nationkey',
'n_regionkey = r_regionkey',
'r_name = \'ASIA\'',
'o_orderdate >= \'1994-01-01\'',
"o_orderdate < TIMESTAMPADD(YEAR, 1, DATE('1994-01-01'))"
]
)
Neither the table names nor the column names of the join view need to be
aliased, as all column names are unique across all tables in this example.
However, the derived column expression still requires an alias. Note also that
columns used in the expressions
list do not need to appear in the
column_names
list.
This join view could also have been done in two steps, a join and a filter:
h_db.create_join_table(
join_table_name = 'tpch_query_5_join',
table_names = ['customer','orders','lineitem','supplier','nation','region'],
column_names = ['n_name','l_extendedprice * (1 - l_discount) as effective_price','r_name','o_orderdate','l_orderkey'],
expressions = [
'c_custkey = o_custkey',
'l_orderkey = o_orderkey',
'l_suppkey = s_suppkey',
'c_nationkey = s_nationkey',
's_nationkey = n_nationkey',
'n_regionkey = r_regionkey'
]
)
The join view can then have the WHERE
clause filters applied in a
separate step:
join_view.filter(
view_name = 'tpch_query_5_join_filter',
expression = 'r_name = \'ASIA\' and ' \
"o_orderdate >= '1994-01-01' and " \
"o_orderdate < TIMESTAMPADD(YEAR, 1, DATE('1994-01-01'))"
)
Note that the columns used in the filter's expression
must appear in the
join's column_names
list, so that they are available for subsequent filter
calls to use.
Regardless of whether the join/filter was done in one step or two, the results
need to be grouped and summed to match the TPC-H query. The following call is
given the directive to sort_by
the value
, which will sort on the
aggregation column:
records = join_view.aggregate_group_by(
column_names = [
'n_name',
'SUM(effective_price) as revenue'
],
offset = 0,
options = {"sort_by":"value", "sort_order":"descending"}
)['data']
The results can be displayed by calling:
for record in zip(records["n_name"], records["revenue"]):
print "{:<13s} {:7.2f}".format(record[0], record[1])
In order to improve the performance of a join operation, the columns used to make the relation should be indexed.
When a join is performed along a foreign key relation, it is indexed automatically as a function of the foreign key itself--no additional setup is required.
When a join is performed along a relation where no foreign key exists, the indexing must be done as a separate configuration step. Both columns, on either side of the join relation, should have a column index applied. An error may result if a join is attempted across a relation where only one side of the relation is indexed.
When a composite equi-join is performed, only one index will be used. For
example: you have two tables, a
and b
, that each have the same columns,
x
and y
. You also have an attribute index on each column. If you use the
expression a.x == b.x and a.y == b.y
, only the first expression,
a.x == b.x
, will use the index.
Joins have a variable memory footprint depending on the type of join. Kinetica will attempt to create the most efficient join as possible. The amount of memory a join requires depends on the type of join and the source data sets being joined together.
Any column expression used in the column list for a join will be evaluated when the join is queried, requiring no persistent memory to hold the calculated values. While the memory footprint is small, the performance impact could be large, as the expression will need to be re-evaluated each time the join is queried.
Join views are affected by data modifications on underlying tables in different ways, based on the type of data modification.
Overall, data manipulation within the tables supporting a join view will not cause the number of records returned by the join view to change, though updates to records already part of the join view record set will be reflected in the data values retrieved from the join view.