Aggregation

Kinetica is able to aggregate data via the native APIs and SQL using various aggregate expressions. The native API endpoints related to aggregation:

To read more about aggregation in SQL, see Aggregation.

Execution

In SQL, to find the minimum, maximum, & average trip distances, as well as the average passenger count for each vendor per year from the taxi data set (weeding out data with errant trip distances):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SELECT
    vendor_id AS Vendor_ID,
    YEAR(pickup_datetime) AS Year,
    MAX(trip_distance) AS Max_Trip,
    MIN(trip_distance) AS Min_Trip,
    ROUND(AVG(trip_distance),2) AS Avg_Trip,
    INT(AVG(passenger_count)) AS Avg_Passenger_Count
FROM demo.nyctaxi
WHERE
    trip_distance > 0 AND
    trip_distance < 100
GROUP BY vendor_id, 2
ORDER BY Vendor_ID, Year

In Python, create a histogram based on passenger counts from taxi data:

1
2
3
4
resp = h_db.aggregate_histogram(table_name="nyctaxi", column_name="passenger_count", start=1, end=6, interval=1)
print "Passenger count groups by size:"
for histo_group in zip([1, 2, 3, 4, 5, 6], resp['counts']):
    print "\t{}s: {:4.0f}".format(*histo_group)

In Python, unpivot a course_sales table and then create a memory-only table of the results:

1
2
3
4
5
6
7
gpudb.aggregate_unpivot(
    table_name="course_sales_by_year",
    column_names=["*"]
    variable_column_name="total_amount_sold",
    pivoted_columns=["Advanced Economics", "Programming 101", "American History: World War II"],
    options={"result_table":"course_sales_by_course"}
)

In Java, find aggregate statistics for the trip_distance column from taxi data:

1
2
3
4
5
6
7
8
Map<String,Double> stats = gpudb.aggregateStatistics("nyctaxi", "trip_distance", "count,min,max,mean,percentile(50,200)", null).getStats();
System.out.println("Statistics of values in the trip_distance column:");
System.out.printf
(
    "\tCount: %.0f%n\tMin:  %4.2f%n\tMean: %4.2f%n\tMax:  %4.2f%n%n",
    stats.get("count"), stats.get("min"), stats.get("mean"),
    stats.get("max")
);

In the REST API, find the unique Sector values from the stocks table using /aggregate/unique:

1
2
3
4
5
6
7
8
{
    "table_name":"stocks",
    "column_name":"Sector",
    "offset":0,
    "limit":-9999,
    "encoding":"json",
    "options":{}
}

Pivot

A pivot operation does not have its own endpoint and is instead achieved through the /aggregate/groupby endpoint options. Details can be found on Pivot.

Unpivot

As mentioned above, unpivot operations have their own endpoint. Details on unpivot operations can be found on Unpivot.

Memory-Only Tables and Persistence

The following aggregate endpoints have the ability to create and persist memory-only tables from query results:

To create a memory-only table from any of the above endpoint operations, specify the name of the table in the options map:

1
"result_table":"my_table_name"

To persist the memory-only table, specify the result_table_persist option as well:

1
2
"result_table":"my_table_name",
"result_table_persist":"true"

A memory-only table name must adhere to the standard naming criteria. Each memory-only table exists within a schema and follows the standard name resolution rules for tables.

Sharding and Replication

Aggregate memory-only tables can vary in how they are sharded or replicated depending on the endpoint they are created from.

  • /aggregate/groupby: The aggregate groupby memory-only table will be sharded if the entire shard key is included in the column_names parameter and all result records are selected (offset is 0 and limit is -9999). The aggregate groupby memory-only table can be replicated if the result_table_force_replicated option is set to true, regardless if the source table or view is sharded or not.
  • /aggregate/unique: An aggregate unique memory-only table is replicated by default, but it can be sharded if the shard key is included in the column_names parameter. If the result_table_force_replicated option is set to true, the aggregate unique memory-only table will be replicated regardless if the source table or view is sharded or not.
  • /aggregate/unpivot: The aggregate unpivot memory-only table will be sharded if the shard key is not part of the pivoted_columns parameter. The aggregate unpivot memory-only table will be replicated if the source table or view is replicated.

Limitations

General

  • Store-only columns cannot be grouped on, nor can aggregation functions be applied to them.
  • Non-charN string column types can be grouped on, but can only have COUNT & COUNT_DISTINCT aggregation functions applied to them.
  • In the native APIs, an alias must be used for the column name if using column or aggregate functions in the relevant selected columns parameter, e.g., count(column_name) as count_col_records.

Aggregate Memory-Only Tables

  • Creating a memory-only table results in an entirely new data set, so be mindful of the memory usage implications.
  • An aggregate memory-only table cannot be created with a non-charN string column type.
  • If an aggregate memory-only table is created and its source data set's rows are updated (or removed), the memory-only table will not be updated to reflect the changes in the source data set.
  • An aggregate memory-only table is transient, by default, and will expire after the default TTL setting.
  • An aggregate memory-only table is not persisted, by default, and will not survive a database restart; specifying a result_table_persist option of true will make the table permanent and not expire.