> ## Documentation Index
> Fetch the complete documentation index at: https://docs.kinetica.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Joins

<a id="joining" />

<a id="joined" />

<a id="join" />

<a id="joins" />

*Kinetica* supports the SQL concept of joining data sets, via *join views*,
connecting related records between two or more tables.

<Info>
  For SQL views, see [CREATE VIEW](/content/sql/ddl#sql-create-view).
</Info>

A *join view* name must adhere to the standard
[naming criteria](/content/concepts/tables#table-naming-criteria).  Each *join view*
exists within a [schema](/content/concepts/schemas) and follows the standard
[name resolution rules](/content/concepts/tables#table-name-resolution) for *tables*.

The supported *join types* are:

* `INNER` - matching rows between two tables
* `LEFT` - matching rows between two tables, and rows in the left-hand table
  with no matching rows in the right-hand table
* `RIGHT` - matching rows between two tables, and rows in the right-hand table
  with no matching rows in the left-hand table
* `FULL OUTER` matching rows between two tables, and rows in both tables with
  no matching rows in the other
* *Cross* - all rows in one table matched against all rows in the other; this is
  the *join* that is executed when no *join* conditions are specified--there is
  no keyword for specifying this type

## Join Execution Types

Two types of *joins* are available, depending on need:

* [Native Joins](/content/concepts/joins#join-native)
* [SQL Joins](/content/concepts/joins#join-sql)

<a id="join-native" />

### Native Joins

A *native join* isolates the *join* operation and creates a native *join view*,
whose purpose is to connect the *tables* involved in the *join*.  The
*native join* consists of three operations:

1. Joining of records between tables
2. Filtering of joined records
3. Selection of columns to include in the resulting *join view*

The [/create/jointable](/content/api/rest/create_jointable_rest) endpoint is used to request a
*native join*.

Once the *join view* has been created, adding or deleting data from the
*joined view* base tables does not change the data queried from the *join view*,
though updates will be reflected.  The *join view* may be used as a source table
in most native API calls.

*Joins* can be performed on any number and combination of [tables](/content/concepts/tables#table)
and [views](/content/concepts/views#view).

These limitations and others are discussed in further detail in the
[Limitations & Cautions](/content/concepts/joins#join-limits) section.

Any column expressions used in a *join* will be evaluated on querying the
*join view*.  More detail about the impact of using column expressions can be
found under [Memory Implications](/content/concepts/joins#join-memory-implications).

<a id="join-sql" />

### SQL Joins

A [SQL join](/content/sql/query#sql-join) is performed automatically when the database
receives any SQL request containing a *JOIN* clause.  It will generate a native
*join view*, which will either be used as the base table for subsequent
operations contained in the original query or as the source of records in the
final result set.  In either case, the *join view* is transient and will be
removed, by default, after the query that caused it to be generated is complete.

*SQL joins* can be executed through either the [/execute/sql](/content/api/rest/execute_sql_rest)
endpoint or [ODBC/JDBC](/content/connectors/sql_guide).

## Join Distribution Types

There are two types of join distribution:

1. [Node-Local Joins](/content/concepts/joins#join-local) - each node executes its portion of the
   overall join operation; faster, but join distribution criteria must be met
2. [Distributed Joins](/content/concepts/joins#join-distributed) - data is redistributed
   across the nodes so that *node-local joins* can be performed; slower and
   requiring more memory, but with no distribution restrictions

<a id="join-local" />

### Node-Local Joins

*Tables* being *joined* together must either be [replicated](/content/concepts/tables#replication),
or be [sharded](/content/concepts/tables#sharding) on the columns being used to *join* the
*tables* to avoid extensive interprocessor communication.  While a *join* may
include any number of *replicated tables* and *sharded tables*, all
*sharded tables* must be joined by equating their respective
[shard key](/content/concepts/tables#shard-key) columns.

For instance, given a *table* `A` and a *table* `B`, both *sharded* on their
respective `id` columns, the only *node-local join* possible is between
`A.id` and `B.id`.  No other columns from either *table* can be used to
connect the two *tables*.

When a *join view* is created, it will contain the results of all *join* and
filter clauses specified.  It is recommended that all *tables* specified in the
*join view* creation be *joined* within the specified expression.  Any
*replicated tables* not *joined* in the specified expression will be merged into
the *join view* as cross-products.  The [/filter](/content/api/rest/filter_rest) endpoint can be used
subsequently to connect the unmerged *tables*.

<a id="join-distributed" />

### Distributed Joins

*Distributed joins* allow the connection of data sets, regardless of their
[distribution scheme](/content/concepts/tables#distribution).  These joins are less
performant due to interprocessor communication overhead and require more memory
& disk space to process.  This scheme makes the following possible:

* *Inner join* between [sharded](/content/concepts/tables#sharding) tables on columns other than
  their [shard keys](/content/concepts/tables#shard-key)
* *Left outer join* from a [replicated](/content/concepts/tables#replicated) table to a *sharded*
  table
* *Full outer join* between two tables that are not both *replicated*
* *Non-equi join* (`A.id > A.id`) between two tables that are not both
  *replicated*

<Note>
  *Distributed joins* are only available via the
  [/execute/sql](/content/api/rest/execute_sql_rest) endpoint or [SQL](/content/sql)
  using ODBC/JDBC.  The [/create/jointable](/content/api/rest/create_jointable_rest) endpoint
  only processes *local joins*.
</Note>

*Distributed joins* will reshard or replicate data, depending on four criteria:

1. Whether the join condition is an equality-based or inequality-based

2. Whether the left-hand table is:

   * Replicated
   * Being joined on its *shard key*
   * Neither replicated nor being joined on its *shard key*

3. Whether the right-hand table is:

   * Replicated
   * Being joined on its *shard key*
   * Neither replicated nor being joined on its *shard key*

4. The type of join:

   * `INNER`
   * `LEFT` (`RIGHT`)
   * `FULL`

#### Equality-Based Join Rules

The following chart shows the resulting distribution action for equality-based
joins, given the distribution schemes of the left-hand & right-hand tables.  In
parentheses are the join types that will result in a distribution operation.

<Info>
  Cells marked with *(All)* indicate that a distribution operation will
  occur for *all* join types, `INNER`, `LEFT`, `RIGHT`, & `FULL`.
  Also, cells marked `LEFT` can be used to infer the distribution for
  `RIGHT` joins, by simply reversing the left & right tables in the query.
</Info>

| Left Table    | Right Table   | Join Distribution Operation   |
| ------------- | ------------- | ----------------------------- |
| Shard Key     | Shard Key     | *\<Node-local join>*          |
| Shard Key     | Non-Shard Key | Reshard Right *(All)*         |
| Shard Key     | Replicated    | Reshard Right *(FULL)*        |
| Non-Shard Key | Shard Key     | Reshard Left *(All)*          |
| Non-Shard Key | Non-Shard Key | Reshard Both *(All)*          |
| Non-Shard Key | Replicated    | Replicate Left *(All)*        |
| Replicated    | Shard Key     | Reshard Left *(LEFT,FULL)*    |
| Replicated    | Non-Shard Key | Replicate Right *(LEFT,FULL)* |
| Replicated    | Replicated    | *\<Node-local join>*          |

Some example scenarios:

* If a *sharded* table `A` is joined on its *shard key* `x` to a
  *replicated* table `B` on its column `y`, and the join type is
  `FULL OUTER`, the right-side table `B` will be *sharded* on its column
  `y` to match the *sharding* of the left-side table.

* If a *sharded* table `A` is joined on columns `w` & `x`, which are not
  its *shard key*, to a *sharded* table `B` on its *shard key* columns `y` &
  `z`, and the join is of any type, the left-side table `A` will be
  resharded on its columns `w` & `x` to match the *sharding* of the
  right-side table.

* If a *replicated* table is joined to a *replicated* table, and the join is of
  any type, it will be a *node-local join* and not require redistribution of any
  data.

#### Inequality-Based Join Rules

The following chart shows the resulting distribution action for inequality-based
joins, given the distribution schemes of the left-hand & right-hand tables.  In
parentheses are the join types that will result in a distribution operation.

<Info>
  Cells marked `LEFT` can be used to infer the distribution for
  `RIGHT` joins, by simply reversing the left & right tables in the query.
</Info>

| Left Table    | Right Table   | Join Distribution Operation                                          |
| ------------- | ------------- | -------------------------------------------------------------------- |
| Shard Key     | Shard Key     | Replicate Right *(INNER,LEFT)* <br /> <br /> Replicate Both *(FULL)* |
| Shard Key     | Non-Shard Key | Replicate Right *(INNER,LEFT)* <br /> <br /> Replicate Both *(FULL)* |
| Shard Key     | Replicated    | Replicate Left *(FULL)*                                              |
| Non-Shard Key | Shard Key     | Replicate Right *(INNER,LEFT)* <br /> <br /> Replicate Both *(FULL)* |
| Non-Shard Key | Non-Shard Key | Replicate Right *(INNER,LEFT)* <br /> <br /> Replicate Both *(FULL)* |
| Non-Shard Key | Replicated    | Replicate Left *(FULL)*                                              |
| Replicated    | Shard Key     | Replicate Right *(LEFT,FULL)*                                        |
| Replicated    | Non-Shard Key | Replicate Right *(LEFT,FULL)*                                        |
| Replicated    | Replicated    | *\<Node-local join>*                                                 |

Some example scenarios:

* If a *sharded* table `A` is joined on its *shard key* `x` to a
  *replicated* table `B` on its column `y`, and the join type is
  `FULL OUTER`, the left-side table `A` will be *replicated*.

* If a *sharded* table `A` is joined on columns `w` & `x`, which are not
  its *shard key*, to a *sharded* table `B` on its *shard key* columns `y` &
  `z`, redistribution of data will occur based on join type:

  * For `INNER` & `LEFT` joins, the right-side table `B` will be
    *replicated*
  * For `FULL OUTER` joins, both tables will be *replicated*

* If a *replicated* table is joined to a *replicated* table, and the join is of
  any type, it will be a *node-local join* and not require redistribution of any
  data.

<a id="join-create" />

## Creating a Join View

To create a *join view*, the [/create/jointable](/content/api/rest/create_jointable_rest) endpoint requires four
parameters:

1. the name of the *join view* to create
2. the list of member *tables* to be *joined* with optional aliases
   (`TableA as A`, `TableB as B`, ..., `TableZ`)
3. the list of columns and/or column expressions selected from the *joined*
   tables with optional aliases (`TableA.id`, `B.id as B_id`, ...,
   `TableY.total + TableZ.total as YZ_total`)
4. the SQL-style expression by which the *tables* can be *joined* and,
   optionally, filtered (`TableA.b_id = B.id and ... A.z_id = TableZ.id`)

Given a *table* clause of `TableA as A`, the `a_id` column of that table can
be accessed via any of the following means:

* unique name, assuming the name is unique across the *tables* being *joined*
  (`a_id`)
* name with *table* prefix (`TableA.a_id`)
* name with *table* alias (`A.a_id`)

<Info>
  All *sharded tables* must be associated through the given expression.
</Info>

### Examples

#### Inner Join

In *Python*, given *tables* `customer`, `order` and `lineitem`, a
*join view* can be created via:

```python theme={null}
kinetica.create_join_table(
    join_table_name = "example.customer_order_item",
    table_names = ["example.customer as c","example.orders as o","example.lineitem as l"],
    column_names = ["c_name","o_orderstatus","l_partkey","l_extendedprice","l_quantity","l_orderkey"],
    expressions = ["c_custkey = o_custkey","o_orderkey = l_orderkey"]
)
```

The *view* produced is an equi-join of the member *tables*, containing the
specified columns from all three *tables*, accessible by their simple column
names, given no name collisions among the columns selected.  The result would
match what would be produced by the SQL:

```sql theme={null}
SELECT
    c_name,
    o_orderstatus,
    l_partkey,
    l_extendedprice,
    l_quantity
FROM
    example.customer c,
    example.orders o,
    example.lineitem l
WHERE
    c_custkey = o_custkey AND
    o_orderkey = l_orderkey
```

or:

```sql theme={null}
SELECT
    c_name,
    o_orderstatus,
    l_partkey,
    l_extendedprice,
    l_quantity
FROM
    example.customer c
INNER JOIN
    example.orders o ON c_custkey = o_custkey
INNER JOIN
    example.lineitem l ON o_orderkey = l_orderkey
```

#### Outer Join

To perform a *left outer join* of orders with or without line items, given
*tables* `order` & `lineitem`, a *join view* can be created via:

```python theme={null}
kinetica.create_join_table(
    join_table_name = "example.order_lineitem",
    table_names = ["example.orders as o","example.lineitem as l"],
    column_names = ["o_orderkey","l_partkey","l_extendedprice","l_quantity"],
    expressions = ["left join o, l on (o_orderkey = l_orderkey)"]
)
```

The *view* produced would match what would be produced by the SQL:

```sql theme={null}
SELECT
    o_orderkey,
    l_partkey,
    l_extendedprice,
    l_quantity
FROM
    example.orders o
LEFT JOIN
    example.lineitem l
ON
    o_orderkey = l_orderkey
```

## Operating on a Join View

Since *join views* are read-only, no inserts, updates, or deletes can be issued
against them.  Additionally, the following data processing endpoints cannot be
applied to a *join view*.

* [/aggregate/convexhull](/content/api/rest/aggregate_convexhull_rest)
* [/aggregate/kmeans](/content/api/rest/aggregate_kmeans_rest)
* [/aggregate/minmax/geometry](/content/api/rest/aggregate_minmax_geometry_rest)
* [/filter/bygeometry](/content/api/rest/filter_bygeometry_rest)
* [/filter/byrange](/content/api/rest/filter_byrange_rest)
* [/filter/bystring](/content/api/rest/filter_bystring_rest)
* [/filter/byvalue](/content/api/rest/filter_byvalue_rest)

### Examples

The examples in this section will be in *Python*.  For easier use of the API, a
handle to the *join view* created in the [Creating a Join View](/content/concepts/joins#join-create) section can be
established:

```python theme={null}
join_view = gpudb.GPUdbTable(name = "example.customer_order_item", db = kinetica)
```

A *Python* example filter on the *join view* created in the
[Creating a Join View](/content/concepts/joins#join-create) section for customer orders containing
multiples of the same product is:

```python theme={null}
join_view.filter(
    view_name = "example.customer_order_item_multiple",
    expression = "l_quantity > 1"
)
```

When executed against a *join view*, the filter endpoint produces a *view* that
is also a *join view*, having the same endpoint restrictions mentioned above. A
chain of such filters can be used to create more and more restrictive *views* of
the original data set.  The filter endpoint can also be used to merge any
*tables* not merged during the creation of the original *join view* or during
subsequent filtering of the *view*.

To call the group-by endpoint, in *Python*:

```python theme={null}
records = join_view.aggregate_group_by(
    column_names = [
        "c_name",
        "SUM(l_quantity) as total_items",
        "SUM(l_quantity * l_extendedprice) as total_spent"
    ],
    offset = 0,
    options = {"sort_by":"key"}
)["data"]
```

To call the statistics endpoint, in *Python*:

```python theme={null}
revenue = join_view.aggregate_statistics(
    column_name = "l_quantity * l_extendedprice",
    stats = "sum"
)["stats"]["sum"]
```

To call the endpoint to retrieve data from specific *join view* columns, in
*Python*:

```python theme={null}
records = join_view.get_records_by_column(
    column_names = ["c_name","o_orderstatus","l_partkey","l_extendedprice","l_quantity"],
    offset = 0,
    limit = gpudb.GPUdb.END_OF_SET
)
```

## SQL Conversion Example

This example shows how the *TPC-H* benchmark's *Query 5* SQL statement can be
implemented in the *Kinetica* *Python* interface.  The SQL for *Query 5* is:

```sql theme={null}
SELECT
    n_name,
    SUM(l_extendedprice * (1 - l_discount)) AS revenue
FROM
    example.customer,
    example.orders,
    example.lineitem,
    example.supplier,
    example.nation,
    example.region
WHERE
    c_custkey = o_custkey AND
    l_orderkey = o_orderkey AND
    l_suppkey = s_suppkey AND
    c_nationkey = s_nationkey AND
    s_nationkey = n_nationkey AND
    n_regionkey = r_regionkey AND
    r_name = 'ASIA' AND
    o_orderdate >= '1994-01-01' AND
    o_orderdate < DATE('1994-01-01') + INTERVAL '1' YEAR
GROUP BY
    n_name
ORDER BY
    revenue DESC
```

To implement the above statement, a *join view* needs to be created with all of
the *tables* listed in the `FROM` clause, and an expression, based on the
`WHERE` clause, which binds the *tables* together and applies the necessary
filters:

```python theme={null}
kinetica.create_join_table(
    join_table_name = "example.tpch_query_5_join_filter",
    table_names = [
        "example.customer",
        "example.orders",
        "example.lineitem",
        "example.supplier",
        "example.nation",
        "example.region"
    ],
    column_names = ["n_name","l_extendedprice * (1 - l_discount) as effective_price"],
    expressions = [
        "c_custkey = o_custkey",
        "l_orderkey = o_orderkey",
        "l_suppkey = s_suppkey",
        "c_nationkey = s_nationkey",
        "s_nationkey = n_nationkey",
        "n_regionkey = r_regionkey",
        "r_name = 'ASIA'",
        "o_orderdate >= '1994-01-01'",
        "o_orderdate < TIMESTAMPADD(YEAR, 1, DATE('1994-01-01'))"
    ]
)
```

Neither the table names nor the column names of the *join view* need to be
aliased, as all column names are unique across all tables in this example.
However, the derived column expression still requires an alias.  Note also that
columns used in the `expressions` list do not need to appear in the
`column_names` list.

This *join view* could also have been done in two steps, a join and a filter:

```python theme={null}
kinetica.create_join_table(
    join_table_name = "example.tpch_query_5_join",
    table_names = [
        "example.customer",
        "example.orders",
        "example.lineitem",
        "example.supplier",
        "example.nation",
        "example.region"
    ],
    column_names = [
        "n_name",
        "l_extendedprice * (1 - l_discount) as effective_price",
        "r_name",
        "o_orderdate",
        "l_orderkey"
    ],
    expressions = [
        "c_custkey = o_custkey",
        "l_orderkey = o_orderkey",
        "l_suppkey = s_suppkey",
        "c_nationkey = s_nationkey",
        "s_nationkey = n_nationkey",
        "n_regionkey = r_regionkey"
    ]
)
```

The *join view* can then have the `WHERE` clause filters applied in a
separate step:

```python theme={null}
join_view.filter(
     view_name = "example.tpch_query_5_join_filter",
     expression = "r_name = 'ASIA' and " \
                  "o_orderdate >= '1994-01-01' and " \
                  "o_orderdate < TIMESTAMPADD(YEAR, 1, DATE('1994-01-01'))"
)
```

Note that the columns used in the filter's `expression` must appear in the
join's `column_names` list, so that they are available for subsequent filter
calls to use.

Regardless of whether the join/filter was done in one step or two, the results
need to be grouped and summed to match the *TPC-H* query.  The following call is
given the directive to `sort_by` the `value`, which will sort on the
aggregation column:

```python theme={null}
records = join_view.aggregate_group_by(
    column_names = [
        "n_name",
        "SUM(effective_price) as revenue"
    ],
    offset = 0,
    options = {"sort_by":"value", "sort_order":"descending"}
)["data"]
```

The results can be displayed by calling:

```python theme={null}
for record in zip(records["n_name"], records["revenue"]):
    print("{:<13s} {:7.2f}".format(record[0], float(record[1])))
```

## Performance Optimization

In order to improve the performance of a [join](/content/concepts/joins#join) operation, the
columns used to make the relation should be indexed.

When a *join* is performed along a [foreign key](/content/concepts/tables#foreign-key) relation, it
is indexed automatically as a function of the *foreign key* itself--no
additional setup is required.

When a *join* is performed along a relation where no *foreign key* exists, the
indexing must be done as a separate configuration step.  Both columns, on either
side of the *join* relation, should have a [column index](/content/concepts/indexes#column-index)
applied.  An error may result if a *join* is attempted across a relation where
only one side of the relation is indexed.

When a composite *equi-join* is performed, only one index will be used. For
example: you have two tables, `a` and `b`, that each have the same columns,
`x` and `y`. You also have an attribute index on each column. If you use the
expression `a.x == b.x and a.y == b.y`, only the first expression,
`a.x == b.x`, will use the index.

<a id="join-limits" />

## Limitations & Cautions

* *Joins* cannot be used to render geospatial *tracks*; read more about
  geospatial objects [here](/content/location_intelligence/geo_objects).
* While a *node-local join* may include any number of *replicated tables* and
  *sharded* *tables*, all *sharded tables* must be joined by equating their
  respective [shard key](/content/concepts/tables#shard-key) columns.
* A *sharded table* can be left (outer) *joined* to a *replicated table* via
  *node-local join*, but a *replicated table* can only be left (outer) *joined*
  to a *sharded table* via *distributed join*.
* When a data set is cleared (dropped) all the *join views* created from the
  data set are also automatically cleared.
* Records cannot be added to a *join view*, nor can records be updated or
  deleted.
* A *join view* is transient, by default, and will expire after the default
  [TTL](/content/concepts/ttl) setting.
* A *join view* is not persisted, and will not survive a database restart.

<a id="join-memory-implications" />

## Memory Implications

*Joins* have a variable memory footprint depending on the type of *join*.
*Kinetica* will attempt to create the most efficient *join* as possible.
The amount of memory a *join* requires depends on the type of *join* and the
source data sets being *joined* together.

* A *primary key join* occupies the least amount of memory in *Kinetica*. This
  type is created from a *join* between two data sets with a *primary key* -
  *foreign key* relationship, allowing for an index on the lookup between the
  two data sets.
* A *bitmask join* occupies the second least amount of memory in *Kinetica*.
  This type results from a *join* between two data sets with no filtering of
  either table.
* An *equijoin* typically occupies the third least amount of memory in
  *Kinetica*. This type results from a *join* between two data sets containing
  an equality operator, returning only the records with equivalent values
  for the specified columns.
* A *predicate join* can be the most expensive or one of the cheapest
  for your memory; it entirely depends on how sparse the *join* is. A
  *predicate join* comprises a list of indices, or matched pairs of values,
  from the source data sets. This type results from a *join* between two data
  sets that are filtered prior to *joining* them.

Any column expression used in the column list for a *join* will be evaluated
when the *join* is queried, requiring no persistent memory to hold the
calculated values.  While the memory footprint is small, the performance
impact could be large, as the expression will need to be re-evaluated each time
the *join* is queried.

## Operations on Underlying Tables

*Join views* are affected by data modifications on underlying tables in
different ways, based on the type of data modification.

* *inserts* - if a record is added to an underlying *table*, it will not appear
  in queries of the *join view*, whether or not the applied filter would include
  it
* *updates* - if a record that already contributes to a *join view* record set
  is modified in its underlying *table*, it will appear with the modified values
  when the *join view* is re-queried; if a record that does not contribute to a
  *join view* record set is modified in its underlying *table* in a way that
  would make it part of the *join view* record set, it will not appear when the
  *join view* is re-queried
* *deletes* - if a record that already contributes to a *join view* record set
  is deleted from its underlying *table*, it will still appear when the
  *join view* is re-queried

Overall, data manipulation within the *tables* supporting a *join view* will not
cause the number of records returned by the *join view* to change, though
updates to records already part of the *join view* record set will be reflected
in the data values retrieved from the *join view*.
