Intersect

An intersect is a representation of all rows that appear in both of a pair of specified data sets (tables or views).

Note

An intersect is somewhat analogous to creating a table from a SQL INTERSECT of two tables. See CREATE TABLE ... AS and INTERSECT for details.

An intersect is performed via the /create/union endpoint, using the intersect or intersect_all mode:

Intersect -- all unique rows that exist in both specified data sets
Intersect All -- all rows (including duplicates) that exist in both specified data sets

Note

Set union and set subtraction are also available, and their descriptions and limitations can be found on Union and Except, respectively.

You can only perform an intersect two data sets, and the columns between the two must have similar data types. Kinetica will cast compatible data types as depicted here.

Performing an intersect creates a separate memory-only table containing the results. Intersect results can be persisted (like tables) using the persist option.

An intersect result table name must adhere to the standard naming criteria. Each intersect result exists within a schema and follows the standard name resolution rules for tables.

Note that if the source data sets are replicated, the results of the intersect will also be replicated. If the included data sets are sharded, the resulting memory-only table from the intersect will also be sharded; this also means that if a non-sharded data set is included, the resulting memory-only table will also be non-sharded.

Limitations on using intersect are discussed in further detail in the Limitations section.

Performing an Intersect

To perform an intersect of two data sets, the /create/union endpoint requires five parameters:

the name of the memory-only table to be created
the list of the two member data sets to be used in the intersect operation; the result will contain all of the elements from the first data set that are also in the second one
the list of columns from each of the given data sets to be used in the intersect operation
the list of column names to be output to the resulting memory-only table
the intersect mode specified in the options input parameter

Example

An intersect between the lunch_menu table and the dinner_menu table would look like:

SQL
Python

SQL
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
CREATE OR REPLACE TABLE example.lunch_and_dinner_menu AS
SELECT
    food_name AS lunch_and_dinner_food_name,
    category,
    price
FROM
    example.lunch_menu
INTERSECT
SELECT
    food_name,
    category,
    price
FROM
    example.dinner_menu
Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
kinetica.create_union(
    table_name = "example.lunch_and_dinner_menu",
    table_names = ["example.lunch_menu", "example.dinner_menu"],
    input_column_names = [
        ["food_name", "category", "price"],
        ["food_name", "category", "price"]
    ],
    output_column_names = ["lunch_and_dinner_food_name", "category", "price"],
    options = {"mode": "intersect"}
)

The results from the above call would contain only the menu items (excluding duplicates) found in the extracted columns from both lunch_menu and dinner_menu.

Note

Since the example includes price and all columns selected must match between the two sets for an item to be included, a lunch item that is priced differently as a dinner item would not appear in the result set.

Retrieving Intersect Data

To retrieve records from the intersect results:

SQL
Python

SQL
1
2
3
SELECT lunch_and_dinner_food_name, category, price
FROM example.lunch_and_dinner_menu
ORDER BY lunch_and_dinner_food_name
Python
1
2
3
4
5
gpudb.GPUdbTable(name = "example.lunch_and_dinner_menu", db = kinetica).get_records_by_column(
    ["lunch_and_dinner_food_name", "category", "price"],
    options = {"order_by": "lunch_and_dinner_food_name"},
    print_data = True
)

Limitations

Performing an intersect between two data sets results in an entirely new data set, so be mindful of the memory usage implications.
All data sets have to be replicated or not replicated, e.g., you cannot intersect replicated and non-replicated data sets.
If attempting to intersect sharded data sets, all data sets have to be sharded similarly (if all data is not on the same processing node, the intersect can't be calculated properly).
The result of an intersect operation does not get updated if source data set(s) are updated.
The input_column_name parameter vector size needs to match the number of data sets listed, i.e. if you want to intersect a data set to itself, the data set will need to be listed twice in the table_names parameter.
The input_column_name parameter vectors need to be listed in the same order as their source data sets, e.g., if two data sets are listed in the table_names parameter, the first data set's columns should be listed first in the input_column_name parameter, etc.
The result of an intersect is transient, by default, and will expire after the default TTL setting.
The result of an intersect is not persisted, by default, and will not survive a database restart; specifying a persist option of true will make the table permanent and not expire.