Version:

Union

Kinetica supports the concept of set union. A union is performed using the /create/union endpoint and represents a set or subset of data from one or more data sets (tables or views); you can also union a data set to itself. Performing a union creates a separate view containing the results. Results will vary depending on the type of union; the different unions available are as follows:

  • Union All -- retains all rows from the specified data sets
  • Union -- retains all unique rows from the specified data sets (synonymous with Union Distinct)
  • Union Distinct -- retains all unique rows from the specified data sets (synonymous with Union)
  • Merge Views -- merge two or more views (or views of views) of the same base data set into a new view. The resulting view would match the results of a SQL OR operation, e.g., if filter 1 creates a view using the expression x = 10 and filter 2 creates a view using the expression x <= 10, then the merge views operation creates a new view using the expression x = 10 OR x <= 10

Note

Set intersection and set substraction are also available, but their descriptions and limitations can be found on Intersect and Except, respectively.

You can union any number or combination of tables or views as long as the columns across the data sets being used have similar data types; note that you cannot union collections. Kinetica will cast compatible data types and create a new type schema. The compatible data types are as follows:

  • int -- compatible with:
    • long (converts to long)
    • float (converts to float)
    • double (converts to double)
    • decimal (converts to decimal)
  • int16 -- compatible with int16 only
  • int8 -- compatible with int8 only
  • long -- compatible with:
    • int (converts to long)
    • float (converts to float)
    • double (converts to double)
    • decimal (converts to decimal)
  • timestamp -- compatible with:
    • date (converts to timestamp)
  • float-- compatible with:
    • int (converts to float)
    • long (converts to float)
    • double (converts to double)
    • decimal (converts to float)
  • double -- compatible with:
    • int (converts to double)
    • long (converts to double)
    • float (converts to double)
    • decimal (converts to double)
  • charN -- compatible with other charN only; converts to the largest charN specified, e.g., a union between a char256 and char64 column converts to a char256 column
  • ipv4 compatible with ipv4 only
  • date -- compatible with:
    • timestamp (converts to timestamp)
  • decimal -- compatible with:
    • int (converts to decimal)
    • long (converts to decimal)
    • float (converts to float)
    • double (converts to double)
  • time -- compatible with time only

Note that if the source data sets are replicated, the results of the union will also be replicated. If the included data sets are sharded, the resulting view from the union will also be sharded; this also means that if a non-sharded data set is included, the resulting view will also be non-sharded.

Limitations on using union are discussed in further detail in the Limitations and Cautions section.

Performing a Union

To union data sets, the /create/union endpoint requires five parameters:

  1. the name of the view to be created
  2. the list of member data sets to be used in the union operation
  3. the list of columns from each of the given data sets to be used in the union operation
  4. the list of column names to be output to the resulting view
  5. the type of union specified in the options input parameter
    • union_all (the default option)
    • union or union_distinct
    • merge_views -- Note that if this option is selected, the input_column_names AND output_column_names parameters are ignored

Note

If you do not specify a union type, union_all will be used.

Examples

In Python, a union all between tables lunch_menu and dinner_menu would look like:

gpudb.create_union(
  table_name = "lunch_union_dinner",
  table_names = ["lunch_menu", "dinner_menu"],
  input_column_names = [
    ["food_name", "category", "price"],
    ["food_name", "category", "price"]
  ],
  output_column_names = ["food_name", "category", "price"]
)

The results from the above call would contain all menu items (including duplicates) in the extracted columns from lunch_menu and dinner_menu. The result would match what would be produced by the SQL

SELECT
  food_name,
  category,
  price
FROM
  lunch_menu

UNION ALL

SELECT
  food_name,
  category,
  price
FROM
  dinner_menu

A union (or union distinct) using the same tables can be performed via:

gpudb.create_union(
  table_name = "lunch_union_dinner",
  table_names = ["lunch_menu", "dinner_menu"],
  input_column_names = [
    ["food_name", "category", "price"],
    ["food_name", "category", "price"]
  ],
  output_column_names = ["food_name", "category", "price"],
  options = {"mode":"union_distinct"}
)

The result of the union distinct call above would contain all menu items (excluding duplicates) in the extracted columns from lunch_menu and dinner_menu. The result would match what would be produced by the SQL

SELECT
  food_name,
  category,
  price
FROM
  lunch_menu

UNION

SELECT
  food_name,
  category,
  price
FROM
  dinner_menu

A merge view can be performed via:

gpudb.create_union(
  table_name = "sandwiches_merged_view",
  table_names = ["lunch_sandwiches", "dinner_sandwiches"],
  options = {"mode":"merge_views"}
)

Operating on a Union

Examples

A Python example filter on the results produced in the Performing a Union section for menu items that are soups is:

gpudb.filter(
    table_name = "lunch_union_dinner",
    view_name = "menu_soups",
    expression = "category = soup"
)

When executed against a union, the /filter endpoint produces a view. A chain of these filters could be used to create more and more restrictive views from the original union operation.

In Python, to aggregate menu items by category in the results produced earlier:

gpudb.aggregate_group_by(
  table_name = "lunch_union_dinner",
  column_names = [
    "soup"
  ],
  offset = 0,
  limit = 25,
  encoding = "json"
)

To retrieve records from the union results in Python:

gpudb.get_records(
    table_name = "lunch_union_dinner",
    offset = 0,
    limit = 50,
    encoding = "json"
)

Limitations and Cautions

There are several limitations to creating and using unions:

  • Performing a union between two data sets results in an entirely new data set, so be mindful of the memory usage implications
  • All data sets have to be replicated or not replicated, e.g., you cannot union replicated and non-replicated data sets
  • If attempting to union sharded data sets, all data sets have to be sharded similarly (if all data is not on the same processing node, the union can't be calculated properly)
  • The result of a union operation does not get updated if source data set(s) are updated
  • The result of a union operation has no permanence by default because it is not protected and has a default TTL of 20 minutes
  • The result of a union operation is not persisted
  • The result of a union operation is technically a view, so rows cannot be added, deleted, or edited; however, if the view is persisted, by specifying the persist option upon creation, the view will become a table and no longer have this restriction
  • The input_column_name parameter vector size needs to match the number of data sets listed, i.e. if you want to union a data set to itself, the data set will need to be listed twice in the table_names parameter
  • The input_column_name parameter vectors need to be listed in the same order as their source data sets, e.g., if three data sets are listed in the table_names parameter, the first data set's columns should be listed first in the input_column_name parameter, etc.
  • A union cannot be performed using a data set with non-charN string fields
  • A merge views operation must consist of views created using any filter endpoint from the same base data set. This means views of views of the same base data set are allowed, but other types of views are not (e.g., projection, union, join, etc.)