Version:

Aggregation

Kinetica is able to aggregate data via the native APIs and SQL using various aggregate expressions. The native API endpoints related to aggregation:

To read more about aggregation in SQL, see Aggregation.

Execution

In SQL, to find the minimum, maximum, & average trip distances, as well as the average passenger count for each vendor per year from the taxi data set (weeding out data with errant trip distances):

SELECT
    vendor_id,
    YEAR(pickup_datetime) AS Year,
    MAX(trip_distance) max_trip,
    MIN(trip_distance) min_trip,
    AVG(trip_distance) avg_trip,
    AVG(passenger_count) avg_passenger_count
FROM nyctaxi
WHERE
    trip_distance > 0 AND
    trip_distance < 100
GROUP BY vendor_id, 2
ORDER BY vendor_id, Year

In Python, create a histogram based on passenger counts from taxi data:

resp = h_db.aggregate_histogram(table_name="nyctaxi", column_name="passenger_count", start=1, end=6, interval=1)
print "Passenger count groups by size:"
for histo_group in zip([1, 2, 3, 4, 5, 6], resp['counts']):
    print "\t{}s: {:4.0f}".format(*histo_group)

In Python, unpivot a course_sales table and then create an in-memory table of the results:

gpudb.aggregate_unpivot(
    table_name="course_sales_by_year",
    column_names=["*"]
    variable_column_name="total_amount_sold",
    pivoted_columns=["Advanced Economics", "Programming 101", "American History: World War II"],
    options={"result_table":"course_sales_by_course"}
)

In Java, find aggregate statistics for the trip_distance column from taxi data:

Map<String,Double> stats = gpudb.aggregateStatistics("nyctaxi", "trip_distance", "count,min,max,mean,percentile(50,200)", null).getStats();
System.out.println("Statistics of values in the trip_distance column:");
System.out.printf
(
    "\tCount: %.0f%n\tMin:  %4.2f%n\tMean: %4.2f%n\tMax:  %4.2f%n%n",
    stats.get("count"), stats.get("min"), stats.get("mean"),
    stats.get("max")
);

In the REST API, find the unique Sector values from the stocks table using /aggregate/unique:

{
    "table_name":"stocks",
    "column_name":"Sector",
    "offset":0,
    "limit":-9999,
    "encoding":"json",
    "options":{}
}

Pivot

A pivot operation does not have its own endpoint and is instead achieved through the /aggregate/groupby endpoint options. Details can be found on Pivot.

Unpivot

As mentioned above, unpivot operations have their own endpoint. Details on unpivot operations can be found on Unpivot.

In-Memory Tables and Persistence

The following aggregate endpoints have the ability to create and persist in-memory tables from query results:

To create an in-memory table from any of the above endpoint operations, specify the name of the table in the options map:

"result_table":"my_table_name"

To persist the in-memory table, specify the result_table_persist option as well:

"result_table":"my_table_name",
"result_table_persist":"true"

Sharding and Replication

Aggregate in-memory tables can vary in how they are sharded or replicated depending on the endpoint they are created from.

  • /aggregate/groupby: The aggregate groupby memory-only table will be sharded if the entire shard key is included in the column_names parameter and all result records are selected (offset is 0 and limit is -9999). The aggregate groupby memory-only table can be replicated if the result_table_force_replicated option is set to true, regardless if the source table or view is sharded or not.
  • /aggregate/unique: An aggregate unique in-memory table is replicated by default, but it can be sharded if the shard key is included in the column_names parameter. If the result_table_force_replicated option is set to true, the aggregate unique in-memory table will be replicated regardless if the source table or view is sharded or not.
  • /aggregate/unpivot: The aggregate unpivot in-memory table will be sharded if the shard key is not part of the pivoted_columns parameter. The aggregate unpivot in-memory table will be replicated if the source table or view is replicated.

Limitations

General

  • Store-only columns cannot be grouped on, nor can aggregation functions be applied to them
  • Non-charN string column types can be grouped on, but can only have COUNT & COUNT_DISTINCT aggregation functions applied to them
  • In the native APIs, an alias must be used for the column name if using column or aggregate functions in the relevant selected columns parameter, e.g., count(column_name) as count_col_records

Aggregate In-Memory Tables

  • Creating an in-memory table results in an entirely new data set, so be mindful of the memory usage implications
  • An aggregate in-memory table cannot be created with a non-charN string column type
  • If an aggregate in-memory table is created and its source data set's rows are updated (or removed), the in-memory table will not be updated to reflect the changes in the source data set
  • An aggregate in-memory table cannot be created from a collection
  • An aggregate in-memory table has no permanence by default because it is not protected and will expire after the default TTL setting