Aggregation¶

Execution
- Pivot
- Unpivot
In-Memory Tables and Persistence
- Sharding and Replication
Limitations
- General
- Aggregate In-Memory Tables

Kinetica is able to aggregate data via the native APIs and SQL using various aggregate expressions. The native API endpoints related to aggregation:

To read more about aggregation in SQL, see Aggregation.

Execution¶

In SQL, to find the minimum, maximum, & average trip distances, as well as the average passenger count for each vendor per year from the taxi data set (weeding out data with errant trip distances):

SELECT
    vendor_id,
    YEAR(pickup_datetime) AS Year,
    MAX(trip_distance) max_trip,
    MIN(trip_distance) min_trip,
    AVG(trip_distance) avg_trip,
    AVG(passenger_count) avg_passenger_count
FROM nyctaxi
WHERE
    trip_distance > 0 AND
    trip_distance < 100
GROUP BY vendor_id, 2
ORDER BY vendor_id, Year

In Python, create a histogram based on passenger counts from taxi data:

resp = h_db.aggregate_histogram(table_name="nyctaxi", column_name="passenger_count", start=1, end=6, interval=1)
print "Passenger count groups by size:"
for histo_group in zip([1, 2, 3, 4, 5, 6], resp['counts']):
    print "\t{}s: {:4.0f}".format(*histo_group)

In Python, unpivot a course_sales table and then create an in-memory table of the results:

gpudb.aggregate_unpivot(
    table_name="course_sales_by_year",
    column_names=["*"]
    variable_column_name="total_amount_sold",
    pivoted_columns=["Advanced Economics", "Programming 101", "American History: World War II"],
    options={"result_table":"course_sales_by_course"}
)

In Java, find aggregate statistics for the trip_distance column from taxi data:

Map<String,Double> stats = gpudb.aggregateStatistics("nyctaxi", "trip_distance", "count,min,max,mean,percentile(50,200)", null).getStats();
System.out.println("Statistics of values in the trip_distance column:");
System.out.printf
(
    "\tCount: %.0f%n\tMin:  %4.2f%n\tMean: %4.2f%n\tMax:  %4.2f%n%n",
    stats.get("count"), stats.get("min"), stats.get("mean"),
    stats.get("max")
);

In the REST API, find the unique Sector values from the stocks table using /aggregate/unique:

{
    "table_name":"stocks",
    "column_name":"Sector",
    "offset":0,
    "limit":-9999,
    "encoding":"json",
    "options":{}
}

Pivot¶

A pivot operation does not have its own endpoint and is instead achieved through the /aggregate/groupby endpoint options. Details can be found on Pivot.

Unpivot¶

As mentioned above, unpivot operations have their own endpoint. Details on unpivot operations can be found on Unpivot.

In-Memory Tables and Persistence¶

The following aggregate endpoints have the ability to create and persist in-memory tables from query results:

To create an in-memory table from any of the above endpoint operations, specify the name of the table in the options map:

"result_table":"my_table_name"

To persist the in-memory table, specify the result_table_persist option as well:

"result_table":"my_table_name",
"result_table_persist":"true"

Sharding and Replication¶

Aggregate in-memory tables can vary in how they are sharded or replicated depending on the endpoint they are created from.

/aggregate/groupby: The aggregate groupby memory-only table will be sharded if the entire shard key is included in the column_names parameter and all result records are selected (offset is 0 and limit is -9999). The aggregate groupby memory-only table can be replicated if the result_table_force_replicated option is set to true, regardless if the source table or view is sharded or not.
/aggregate/unique: An aggregate unique in-memory table is replicated by default, but it can be sharded if the shard key is included in the column_names parameter. If the result_table_force_replicated option is set to true, the aggregate unique in-memory table will be replicated regardless if the source table or view is sharded or not.
/aggregate/unpivot: The aggregate unpivot in-memory table will be sharded if the shard key is not part of the pivoted_columns parameter. The aggregate unpivot in-memory table will be replicated if the source table or view is replicated.

Limitations¶

General¶

Store-only columns cannot be grouped on, nor can aggregation functions be applied to them
Non-charN string column types can be grouped on, but can only have COUNT & COUNT_DISTINCT aggregation functions applied to them
In the native APIs, an alias must be used for the column name if using column or aggregate functions in the relevant selected columns parameter, e.g., count(column_name) as count_col_records

Aggregate In-Memory Tables¶

Creating an in-memory table results in an entirely new data set, so be mindful of the memory usage implications
An aggregate in-memory table cannot be created with a non-charN string column type
If an aggregate in-memory table is created and its source data set's rows are updated (or removed), the in-memory table will not be updated to reflect the changes in the source data set
An aggregate in-memory table cannot be created from a collection
An aggregate in-memory table has no permanence by default because it is not protected and will expire after the default TTL setting

Table Of Contents