Version:

Joins

Kinetica supports the SQL concept of joining data sets. It does so through the /create/jointable endpoint, which creates an intermediary join view, whose purpose is to connect the tables involved in the join before any query or aggregation operations are applied. In addition to any equi-join clauses specified in the join view creation expression, filtering may also be used to further refine the data set.

Tables being joined together must either be replicated, or be sharded on the columns being used to join the tables to avoid extensive interprocessor communication. While a join may include any number of replicated tables and sharded tables, all sharded tables must be joined on their respective shard key columns. Distributed joins, or joins that connect sharded tables on columns other than their shard keys, are not supported at this time.

For instance, given a table A and a table B, both sharded on their respective id columns, the only join possible is between A.id and B.id. No other columns from either table can be used to connect the two tables.

When a join view is created, it will contain the results of all join and filter clauses specified. Any replicated tables not joined in the specified expression will be merged into the join view as cross-products. Since this is a generally undesirable outcome, it is recommended that all tables specified in the join view creation be joined within the specified expression. Any tables not merged into the overall join by the expression can be subsequently merged by using the /filter endpoint to connect the unmerged tables.

Once the join view has been created, adding or deleting data from the joined view base tables does not change the data queried from the join view, though updates will be reflected. The join view may be used as a source table in most API calls.

Joins can be performed on any number and combination of tables and views but not on collections.

These limitations and others are discussed in further detail in the Limitations & Cautions section.

Any column expressions used in a join will be evaluated on querying the join view. More detail about the impact of using column expressions can be found under Memory Implications.

Refreshing a Join View

There are three modes that govern the way in which a join view can be refreshed:

Manual Mode

In manual mode, the join view will only be made fully current by performing a manual refresh of the view. If an incremental refresh is applied, only inserts into the base table will be reflected in the updated content of the join view. If inserts are made into tables other than the base table, a full refresh will be performed. This will result in a rebuilding of all the view's data, accounting for any updates & deletes.

On Query Mode

In on query mode, the join view will be made current when a query is made against it and records have been inserted into any of the data sets that compose the view. If records have only been added to the base table, an incremental refresh will be performed; otherwise, a full refresh will be performed. If no data is added between the issuance of two queries against the join view, no rebuilding of the view will be performed.

On Insert Mode

In on insert mode, the join view will be kept current by automatically adding to it each record inserted into the base table. Outside of base table inserts, this mode functions the same as on query mode.

The primary use case for this mode is for a view against a set of tables where the data manipulation operations consist almost entirely of base table inserts and the queries against the join view need to be fast and current, with respect to those inserts. An example would be an OLAP system consisting of mostly inserts into a fact table.

Manual Refresh

A manual refresh can be performed on any join view, regardless of operating mode, to update the view's data. If a join view is insert-triggered, it may be necessary to refresh the data periodically to properly reflect updates & deletes that have taken place since the join view was created or since the last refresh. There are two types of refresh operations that can be performed: full & incremental.

A full refresh will cause a rebuilding of the entire join view, irrespective of what data manipulation operations have taken place. This operation takes the same time as creating the original view, but is the one that is guaranteed to properly reflect updates & deletes made to the underlying tables.

An incremental refresh is insert-based. If one is requested on a join view in which the only new data are records inserted into the base table, only those records will be added to the join view. If inserts have been made into tables other than the base table, a full refresh will be performed.

Note that it is generally unnecessary to issue an incremental refresh against a join view in on insert mode or on query mode, as both of these use the incremental refresh automatically to stay current.

Creating a Join View

To create a join view, the /create/jointable endpoint requires four parameters:

  1. the name of the join view to create
  2. the list of member tables to be joined with optional aliases (TableA as A, TableB as B, ..., TableZ)
  3. the list of columns and/or column expressions selected from the joined tables with optional aliases (TableA.id, B.id as B_id, ..., TableY.total + TableZ.total as YZ_total)
  4. the SQL-style expression by which the tables can be joined and, optionally, filtered (TableA.b_id = B.id and ... A.z_id = TableZ.id)

Given a table clause of TableA as A, the a_id column of that table can be accessed via any of the following means:

  • unique name, assuming the name is unique across the tables being joined (a_id)
  • name with table prefix (TableA.a_id)
  • name with table alias (A.a_id)

Note

All sharded tables must be associated through the given expression.

Examples

In Python, given tables customer, order and lineitem, a join view can be created via:

retobj = gpudb.create_join_table(
             join_table_name = 'customer_order_item',
             table_names = ['customer as c','order as o','lineitem as l'],
             column_names = ['c.name','o.orderstatus','l.partkey','l.price','l.quantity'],
             expressions = ['c.custkey = o.custkey','o.orderkey = l.orderkey']
         )

The view produced is an equi-join of the member tables, containing the specified columns from all three tables, accessible by their simple column names, given no name collisions among the columns selected. The result would match what would be produced by the SQL:

SELECT
    c.name,
    o.orderstatus,
    l.partkey,
    l.price,
    l.quantity
FROM
    customer c,
    order o,
    lineitem l
WHERE
    c.customerkey = o.customerkey AND
    o.orderkey = l.orderkey

or:

SELECT
    c.name,
    o.orderstatus,
    l.partkey,
    l.price,
    l.quantity
FROM
    customer c
INNER JOIN
    order o ON c.customerkey = o.customerkey
INNER JOIN
    lineitem l ON o.orderkey = l.orderkey

Operating on a Join View

The list of endpoints currently implemented to operate against join view is as follows:

Examples

A Python example filter on the join view created in the Creating a Join View section for customer orders containing multiples of the same product is:

retobj = gpudb.filter(
             table_name = 'customer_order_item',
             view_name = 'customer_order_item_multiple',
             expression = 'quantity > 1'
         )

When executed against a join view, the filter endpoint produces a view that is also a join view, having the same endpoint restrictions mentioned above. A chain of such filters can be used to create more and more restrictive views of the original data set. The filter endpoint can also be used to merge any tables not merged during the creation of the original join view or during subsequent filtering of the view.

To call the group-by endpoint, in Python:

retobj = gpudb.aggregate_group_by(
             table_name = 'customer_order_item',
             column_names = [
                 'name',
                 'sum(quantity) as total_items',
                 'sum(quantity * price) as total_spent'
             ],
             offset = 0
         )

To call the statistics endpoint, in Python:

retobj = gpudb.aggregate_statistics(
             table_name = 'customer_order_item',
             column_name = 'quantity * price',
             stats = 'sum'
         )

To call the endoint to retrieve data from specific join view columns, in Python:

retobj = gpudb.get_records_by_column(
             table_name = 'customer_order_item',
             column_names = ['name','orderstatus','partkey'],
             offset = 0,
             limit = 50
         )

SQL Conversion Example

This example shows how the TPC-H benchmark's Query 5 SQL statement can be implemented in the Kinetica Python interface. The SQL for Query 5 is:

SELECT
    n_name,
    SUM(l_extendedprice * (1 - l_discount)) AS revenue
FROM
    customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
WHERE
    c_custkey = o_custkey AND
    l_orderkey = o_orderkey AND
    l_suppkey = s_suppkey AND
    c_nationkey = s_nationkey AND
    s_nationkey = n_nationkey AND
    n_regionkey = r_regionkey AND
    r_name = 'ASIA' AND
    o_orderdate >= 757382400 AND
    o_orderdate < 788918400
GROUP BY
    n_name
ORDER BY
    revenue DESC;

To implement the above statement, a join view needs to be created with all of the tables listed in the FROM clause, and an expression, based on the WHERE clause, which binds the tables together and applies the necessary filters:

retobj = gpudb.create_join_table(
             join_table_name = 'tpch_query_5_join_filter',
             table_names = ['customer','orders','lineitem','supplier','nation','region'],
             column_names = ['l_extendedprice * (1 - l_discount) as effective_price','n_name'],
             expressions = [
                 'c_custkey = o_custkey',
                 'l_orderkey = o_orderkey',
                 'l_suppkey = s_suppkey',
                 'c_nationkey = s_nationkey',
                 's_nationkey = n_nationkey',
                 'n_regionkey = r_regionkey',
                 'r_name = \'ASIA\'',
                 'o_orderdate >= 757382400',
                 'o_orderdate < 788918400'
             ]
         )

Neither the table names nor the column names of the join view need to be aliased, as all column names are unique across all tables in this example. However, the derived column expression still requires an alias. Note also that columns used in the expressions list do not need to appear in the column_names list.

This join view could also have been done in two steps, a join and a filter:

retobj = gpudb.create_join_table(
             join_table_name = 'tpch_query_5_join',
             table_names = ['customer','orders','lineitem','supplier','nation','region'],
             column_names = ['o_orderdate','l_extendedprice * (1 - l_discount) as effective_price','n_name','r_name'],
             expressions = [
                 'c_custkey = o_custkey',
                 'l_orderkey = o_orderkey',
                 'l_suppkey = s_suppkey',
                 'c_nationkey = s_nationkey',
                 's_nationkey = n_nationkey',
                 'n_regionkey = r_regionkey'
             ]
         )

The join view can then have the WHERE clause filters applied in a separate step:

retobj = gpudb.filter(
             table_name = 'tpch_query_5_join',
             view_name = 'tpch_query_5_join_filter',
             expression = 'r_name = \'ASIA\' and ' \
                          'o_orderdate >= 757382400 and ' \
                          'o_orderdate < 788918400'
         )

Note that the columns used in the filter's expression must appear in the join's column_names list, so that they are available for subsequent filter calls to use.

Regardless of whether the join/filter was done in one step or two, the results need to be grouped and summed to match the TPC-H query. The following call is given the directive to sort_by the value, which will sort on the aggregation column:

retobj = gpudb.aggregate_group_by(
             table_name = 'tpch_query_5_join_filter',
             column_names = [
                 'n_name',
                 'sum(effective_price)'
             ],
             offset = 0,
             options = {'sort_by':'value', 'sort_order':'descending'}
         )

The results can be displayed by calling:

gpudb.parse_dynamic_response(retobj, do_print=True)

Performance Optimization

In order to improve the performance of a join operation, the columns used to make the relation should be indexed.

When a join is performed along a foreign key relation, it is indexed automatically as a function of the foreign key itself--no additional setup is required.

When a join is performed along a relation where no foreign key exists, the indexing must be done as a separate configuration step. Both columns, on either side of the join relation, should have a column index applied. An error may result if a join is attempted across a relation where only one side of the relation is indexed.

When a composite equi-join is performed, only one index will be used. For example: you have two tables, a and b, that each have the same columns, x and y. You also have an attribute index on each column. If you use the expression a.x == b.x and a.y == b.y, only the first expression, a.x == b.x, will use the index.

Limitations & Cautions

  • Joins must merge together simple data sets, and can therefore use both tables and views as sources, but not collections
  • Joins cannot be used to render geospatial tracks; read more about geospatial objects here
  • Tables being joined together must either be replicated or be sharded on the columns being used to join the tables
  • While a join may include any number of replicated tables and sharded tables, all sharded tables must be joined on their respective shard key columns
  • Distributed joins, or joins that connect sharded tables on columns other than their shard keys, are not supported
  • A replicated table cannot be left (outer) joined to a sharded table, but a sharded table can be left (outer) joined to a replicated table
  • A join view has no permanence by default because it is not protected and will expire after the default TTL setting
  • A join view with a refresh mode of on query or on insert will be created with a TTL set to not expire
  • When a data set is cleared (dropped) all the join views created from the data set are also automatically cleared
  • Records cannot be added to a join view, nor can records be edited or deleted

Memory Implications

Joins have a variable memory footprint depending on the type of join. Kinetica will attempt to create the most efficient join as possible. The amount of memory a join requires depends on the type of join and the source data sets being joined together.

  • A primary key join occupies the least amount of memory in Kinetica. This type is created from a join between two data sets with a primary key - foreign key relationship, allowing for an index on the lookup between the two data sets.
  • A bitmask join occupies the second least amount of memory in Kinetica. This type results from a join between two data sets with no filtering of either table.
  • An equijoin typically occupies the third least amount of memory in Kinetica. This type results from a join between two data sets containing an equality operator, returning only the records with equivalent values for the specified columns.
  • A predicate join can be the most expensive or one of the cheapest for your memory; it entirely depends on how sparse the join is. A predicate join comprises a list of indices, or matched pairs of values, from the source data sets. This type results from a join between two data sets that are filtered prior to joining them.

Any column expression used in the column list for a join will be evaluated when the join is queried, requiring no persistent memory to hold the calculated values. While the memory footprint is small, the performance impact could be large, as the expression will need to be re-evaluated each time the join is queried.

Operations on Underlying Tables

Join views are affected by data modifications on underlying tables in different ways, based on the type of data modification.

  • inserts - if a record is added to an underlying table, it will not appear in queries of the join view, whether or not the applied filter would include it
  • updates - if a record that already contributes to a join view record set is modified in its underlying table, it will appear with the modified values when the join view is re-queried; if a record that does not contribute to a join view record set is modified in its underlying table in a way that would make it part of the join view record set, it will not appear when the join view is re-queried
  • deletes - if a record that already contributes to a join view record set is deleted from its underlying table, it will still appear when the join view is re-queried

Overall, data manipulation within the tables supporting a join view will not cause the number of records returned by the join view to change, though updates to records already part of the join view record set will be reflected in the data values retrieved from the join view.

Cartesian Products

Though the sharded tables in a join view must be joined in the join view creation expression, the replicated tables do not have to be. In this case, a Cartesian Product will result for every unmerged replicated table.

For example, given table A with integer columns x and y and three records, and table B with a string column r and an integer column s and two records, as follows:

Table A:

x y
1 10
2 20
3 30

Table B:

r s
'x' 100
'y' 200

Then the join view of A and B, prefixing the result columns with a_ and b_, respectively, and with no expression given is a 4-column view with 6 records, as follows:

a_x a_y b_r b_s
1 10 'x' 100
1 10 'y' 200
2 20 'x' 100
2 20 'y' 200
3 30 'x' 100
3 30 'y' 200

It should be noted that while the view produced is the cross-product of the input tables, the actual cross-product is never realized in memory. However, when doing an aggregate operation (such as group-by or histogram) on the join view, every object in the cross-product will be visited. Filter operations that transform input join views into smaller result join views should be done prior to any aggregation queries, reducing the number of records that must be visited.

Note

In order for a join to be allowed to contain a Cartesian Product, both the server and the client call need to be configured properly. See the Configuration Reference section for server configuration details and /create/jointable endpoint documentation for API details. If not configured properly, a Cartesian Product may result in an error.