Kinetica supports the concept of set union. A union is performed using the
/create/union endpoint and represents a set or subset of
data from one or more data sets (tables or
views); you can also union a data set to itself. Performing a
union creates a separate in-memory table containing
the results. Unions are in-memory tables by default, but can be persisted
(like a table) using the persist
option. Results will vary
depending on the union mode; the different union modes available are as
follows:
OR
operation, e.g., if filter 1
creates a filtered view using the expression x = 10
and filter 2 creates
a filtered view using the expression x <= 10
, then the merge views
operation creates a new view using the expression x = 10 OR x <= 10
Note
Set intersection and set substraction are also available, but their descriptions and limitations can be found on Intersect and Except, respectively.
You can union any number or combination of tables or views as long as the columns across the data sets being used have similar data types; note that you cannot union collections. Kinetica will cast compatible data types and create a new type schema. The compatible data types are as follows:
int
-- compatible with:long
(converts to long
)float
(converts to float
)double
(converts to double
)decimal
(converts to decimal
)int16
-- compatible with int16
onlyint8
-- compatible with int8
onlylong
-- compatible with:int
(converts to long
)float
(converts to float
)double
(converts to double
)decimal
(converts to decimal
)timestamp
-- compatible with:date
(converts to timestamp
)float
-- compatible with:int
(converts to float
)long
(converts to float
)double
(converts to double
)decimal
(converts to float
)double
-- compatible with:int
(converts to double
)long
(converts to double
)float
(converts to double
)decimal
(converts to double
)charN
-- compatible with other charN
only; converts to the largest
charN
specified, e.g., a union between a char256
and char64
column converts to a char256
columnipv4
compatible with ipv4
onlydecimal
-- compatible with:int
(converts to decimal
)long
(converts to decimal
)float
(converts to float
)double
(converts to double
)date
-- compatible with:timestamp
(converts to timestamp
)datetime
(converts to datetime
)time
-- compatible with time
onlydatetime
-- compatible with:date
(converts to datetime
)wkt
-- compatible with wkt
onlyNote that if the source data sets are replicated, the results of the union will also be replicated. If the included data sets are sharded, the resulting in-memory table from the union will also be sharded; this also means that if a non-sharded data set is included, the resulting in-memory table will also be non-sharded.
Limitations on using union are discussed in further detail in the Limitations and Cautions section.
To union data sets, the /create/union endpoint requires five parameters:
options
input parameterunion_all
(the default option)union
or union_distinct
merge_views
-- Note that if this option is selected, the
input_column_names
AND output_column_names
parameters are ignoredNote
If you do not specify a union mode, union_all
will be used.
In Python, a union all between tables lunch_menu
and
dinner_menu
would look like:
gpudb.create_union(
table_name = "lunch_union_dinner",
table_names = ["lunch_menu", "dinner_menu"],
input_column_names = [
["food_name", "category", "price"],
["food_name", "category", "price"]
],
output_column_names = ["food_name", "category", "price"]
)
The results from the above call would contain all menu items (including
duplicates) in the extracted columns from lunch_menu
and dinner_menu
.
The result would match what would be produced by the SQL:
SELECT
food_name,
category,
price
FROM
lunch_menu
UNION ALL
SELECT
food_name,
category,
price
FROM
dinner_menu
A union (or union distinct) using the same tables can be performed via:
gpudb.create_union(
table_name = "lunch_union_dinner",
table_names = ["lunch_menu", "dinner_menu"],
input_column_names = [
["food_name", "category", "price"],
["food_name", "category", "price"]
],
output_column_names = ["food_name", "category", "price"],
options = {"mode":"union_distinct"}
)
The result of the union distinct call above would contain all menu items
(excluding duplicates) in the extracted columns from lunch_menu
and
dinner_menu
. The result would match what would be produced by the SQL:
SELECT
food_name,
category,
price
FROM
lunch_menu
UNION
SELECT
food_name,
category,
price
FROM
dinner_menu
A merge view can be performed via:
gpudb.create_union(
table_name = "sandwiches_merged_view",
table_names = ["lunch_sandwiches", "dinner_sandwiches"],
options = {"mode":"merge_views"}
)
A Python example filter on the results produced in the Performing a Union section for menu items that are soups is:
gpudb.filter(
table_name = "lunch_union_dinner",
view_name = "menu_soups",
expression = "category = soup"
)
When executed against a union, the /filter endpoint produces a filtered view. A chain of these filters could be used to create more and more restrictive views from the original union operation.
In Python, to aggregate menu items by category in the results produced earlier:
gpudb.aggregate_group_by(
table_name = "lunch_union_dinner",
column_names = [
"soup"
],
offset = 0,
limit = 25,
encoding = "json"
)
To retrieve records from the union results in Python:
gpudb.get_records(
table_name = "lunch_union_dinner",
offset = 0,
limit = 50,
encoding = "json"
)
There are several limitations to creating and using unions:
input_column_name
parameter vector size needs to match the number of
data sets listed, i.e. if you want to union a data set to itself, the
data set will need to be listed twice in the table_names
parameterinput_column_name
parameter vectors need to be listed in the same
order as their source data sets, e.g., if three data sets are listed in the
table_names
parameter, the first data set's columns should be listed first
in the input_column_name
parameter, etc.