Version:

Intersect

An intersect is a representation of all unique rows that appear in both of a pair of specified data sets (tables or views). Performing an intersect creates a separate view containing the results. An intersect is performed via the /create/union endpoint using the intersect mode.

You can only intersect two data sets, but the two data sets can be any combination of tables or views as long as the columns across the data sets being used have similar data types; note that you cannot intersect collections. Kinetica will cast compatible data types and create a new type schema. The list of compatible data types can be found here.

Note that if the source tables or views are replicated, the results of the intersect will also be replicated. If the included tables or views are sharded, the resulting view from the intersect will also be sharded; this also means that if a non-sharded table or view is included, the resulting view will also not be sharded.

Limitations on using intersect are discussed in further detail in the Limitations section.

Performing an Intersect

To intersect two data sets, the /create/union endpoint requires five parameters:

  1. the name of the view to be created
  2. the list of the two member data sets to be used in the intersect operation
  3. the list of columns from each of the given data sets to be used in the intersect operation
  4. the list of column names to be output to the resulting view
  5. the intersect mode specified in the options input parameter

Example

In Python, an intersect between given tables lunch_menu and dinner_menu would look like:

gpudb.create_union(
  table_name = "lunch_intersect_dinner",
  table_names = ["lunch", "dinner"],
  input_column_names = [
    ["food_name", "category", "price"],
    ["food_name", "category", "price"]
  ],
  output_column_names = ["food_name", "category", "price"],
  options = {"mode":"intersect"}
)

The results from the above call would contain only the menu items (excluding duplicates) found in the extracted columns from both lunch_menu and dinner_menu. The result would match what would be produced by the SQL

SELECT
  food_name,
  category,
  price
FROM
  lunch_menu

INTERSECT

SELECT
  food_name,
  category,
  price
FROM
  dinner_menu

Note

Because the example includes price, if you had two of the same items that were priced differently for lunch and dinner, the item would appear twice in the resulting intersect view because the rows would not be exact duplicates of each other.

Operating on an Intersect

Examples

A Python example filter on the results produced in the Performing an Intersect section for sandwiches available in both menus:

gpudb.filter(
  table_name = "lunch_intersect_dinner",
  view_name = "sandwiches",
  expression = "category = sandwich"
)

When executed against an intersect, the /filter endpoint produces a view. A chain of these filters could be used to create more and more restrictive views from the original intersect operation.

In Python, to aggregate the average, max, and minimum price across the two menus:

gpudb.aggregate_statistics(
  table_name = "lunch_intersect_dinner",
  column_name = "price",
  stats = "mean, min, max"
)

To retrieve records from the intersect results in Python:

gpudb.get_records(
    table_name = "lunch_intersect_dinner",
    offset = 0,
    limit = 25,
    encoding = "json"
)

Limitations

  • Performing an intersect between two data sets results in an entirely new data set, so be mindful of the memory usage implications
  • All data sets have to be replicated or not replicated, e.g., you cannot intersect replicated and non-replicated data sets
  • If attempting to intersect sharded data sets, all data sets have to be sharded similarly (if all data is not on the same processing node, the intersect can't be calculated properly)
  • The result of an intersect operation does not get updated if source data set(s) are updated
  • The result of an intersect operation has no permanence by default because it is not protected and has a default TTL of 20 minutes
  • The result of an intersect operation is not persisted
  • The result of an intersect operation is technically a view, so rows cannot be added, deleted, or edited; however, if the view is persisted, by specifying the persist option upon creation, the view will become a table and no longer have this restriction
  • The input_column_name parameter vector size needs to match the number of data sets listed, i.e. if you want to intersect a data set to itself, the data set will need to be listed twice in the table_names parameter
  • The input_column_name parameter vectors need to be listed in the same order as their source data sets, e.g., if two data sets are listed in the table_names parameter, the first data set's columns should be listed first in the input_column_name parameter, etc.
  • An intersect cannot be performed using a data set with non-charN string fields