Union

Kinetica supports the concept of set union, which is a representation of all rows that appear across a group of specified data sets (tables or views). See UNION for the SQL version of this operation.

A union is performed using the /create/union endpoint, using one of the following modes:

Union -- retains all unique rows from the specified data sets
Union Distinct -- alias for Union
Union All -- retains all rows from the specified data sets
Merge Views -- merge two or more filtered views (or filtered views of filtered views) of the same base data set into a new view. The resulting view would match the results of a SQL OR operation, e.g., if filter 1 creates a filtered view using the expression x = 10 and filter 2 creates a filtered view using the expression x <= 10, then the merge views operation creates a new view using the expression x = 10 OR x <= 10

Note

Set intersection and set subtraction are also available, but their descriptions and limitations can be found on Intersect and Except, respectively.

You can union any number or combination of data sets as long as the columns across the data sets being used have similar data types. Kinetica will cast compatible data types as follows:

Type	Compatible With	Result Type
int8	int8	int8
int16	int16	int16
int	int	int
	long	long
	float	float
	double	double
	decimal	decimal
long	int	long
	long	long
	float	float
	double	double
	decimal	decimal
float	int	float
	long
	float
	double	double
	decimal	float
double	int	double
	long
	float
	double
	decimal
decimal	int	decimal
	long	decimal
	float	float
	double	double
	decimal	decimal
date	date	date
	datetime	datetime
	timestamp	timestamp
time	time	time
datetime	date	datetime
datetime	datetime	datetime
timestamp	date	timestamp
timestamp	timestamp	timestamp
charN	charN	charN (whichever N is larger)
wkt	wkt	wkt
ipv4	ipv4	ipv4

Performing a union creates a separate memory-only table containing the results. Unions can be persisted (like tables) using the persist option.

A union name must adhere to the standard naming criteria. Each union exists within a schema and follows the standard name resolution rules for tables.

Note that if the source data sets are replicated, the results of the union will also be replicated. If the included data sets are sharded, the resulting memory-only table from the union will also be sharded; this also means that if a non-sharded data set is included, the resulting memory-only table will also be non-sharded.

Limitations on using union are discussed in further detail in the Limitations and Cautions section.

Performing a Union

To perform a union of data sets, the /create/union endpoint requires five parameters:

the name of the memory-only table to be created
the list of member data sets to be used in the union operation; the result will contain all of the elements from the first data set and all of the elements from the second one
the list of columns from each of the given data sets to be used in the union operation
the list of column names to be output to the resulting memory-only table
the union mode specified in the options input parameter
- union_all (the default option)
- union or union_distinct
- merge_views -- Note that if this option is selected, the input_column_names AND output_column_names parameters are ignored

Note

If you do not specify a union mode, union_all will be used.

Examples

In Python, a union all between the lunch_menu table and the dinner_menu table would look like:

1
2
3
4
5
6
7
8
9
kinetica.create_union(
    table_name = "example.lunch_or_dinner_menu_all",
    table_names = ["example.lunch_menu", "example.dinner_menu"],
    input_column_names = [
        ["food_name", "category", "price"],
        ["food_name", "category", "price"]
    ],
    output_column_names = ["lunch_or_dinner_food_name", "category", "price"]
)

The results from the above call would contain all menu items (including duplicates) in the extracted columns from lunch_menu and dinner_menu.

A union (or union distinct) using the same tables can be performed via:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
kinetica.create_union(
    table_name = "example.lunch_or_dinner_menu_unique",
    table_names = ["example.lunch_menu", "example.dinner_menu"],
    input_column_names = [
        ["food_name", "category", "price"],
        ["food_name", "category", "price"]
    ],
    output_column_names = ["lunch_or_dinner_food_name", "category", "price"],
    options = {"mode": "union"}
)

The result of the union distinct call above would contain all menu items (excluding duplicates) in the extracted columns from lunch_menu and dinner_menu.

If there were a single menu table with two filtered views on it, one filtering for lunch sandwiches and one filtering for dinner sandwiches, a merge views of the two could be performed to arrive at a filtered view of all sandwiches via:

1
2
3
4
5
kinetica.create_union(
    table_name = "example.lunch_merge_dinner_sandwiches",
    table_names = ["example.lunch_sandwiches", "example.dinner_sandwiches"],
    options = {"mode": "merge_views"}
)

Operating on a Union

Examples

The union created in the Performing a Union section can be filtered for lunch or dinner food items that are in the sandwich category:

1
2
3
4
5
kinetica.filter(
    table_name = "example.lunch_or_dinner_menu_unique",
    view_name = "example.lunch_or_dinner_sandwiches_unique",
    expression = "category = 'sandwich'"
)

When executed against a union, the /filter endpoint produces a filtered view. A chain of these filters could be used to create more and more restrictive views from the original union operation.

To retrieve records from the union results in Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
product_table = gpudb.GPUdbTable(
    _type = None,
    name = "example.lunch_or_dinner_menu_unique",
    db = kinetica
)
product_table.get_records_by_column(
    [
        "lunch_or_dinner_food_name",
        "category",
        "price"
    ],
    options = {"order_by": "lunch_or_dinner_food_name"},
    print_data = True
)

Limitations and Cautions

There are several limitations to creating and using unions:

Performing a union between two data sets results in an entirely new data set, so be mindful of the memory usage implications.
All data sets have to be replicated or not replicated, e.g., you cannot union replicated and non-replicated data sets.
If attempting to perform a distinct union on sharded data sets, all data sets have to be sharded similarly (if all data is not on the same processing node, the distinct union can't be calculated properly).
The result of a union operation does not get updated if source data set(s) are updated.
The input_column_name parameter vector size needs to match the number of data sets listed, i.e. if you want to union a data set to itself, the data set will need to be listed twice in the table_names parameter.
The input_column_name parameter vectors need to be listed in the same order as their source data sets, e.g., if three data sets are listed in the table_names parameter, the first data set's columns should be listed first in the input_column_name parameter, etc.
A merge views operation must consist of filtered views from the same base data set. This means filtered views of filtered views of the same base data set are allowed, but join views are not.
The result of a union is transient, by default, and will expire after the default TTL setting.
The result of a union is not persisted, by default, and will not survive a database restart; specifying a persist option of true will make the table permanent and not expire.