Note

This documentation is for a prior release of Kinetica. For the latest documentation, click here.

Aggregation

Kinetica is able to aggregate data via the native APIs and SQL using various aggregate expressions. The native API endpoints related to aggregation:

To read more about aggregation in SQL, see Aggregation.

Execution

In SQL, to find the minimum, maximum, & average trip distances, as well as the average passenger count for each vendor per year from the taxi data set (weeding out data with errant trip distances):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SELECT
    vendor_id AS Vendor_ID,
    YEAR(pickup_datetime) AS Year,
    MAX(trip_distance) AS Max_Trip,
    MIN(trip_distance) AS Min_Trip,
    ROUND(AVG(trip_distance),2) AS Avg_Trip,
    INT(AVG(passenger_count)) AS Avg_Passenger_Count
FROM demo.nyctaxi
WHERE
    trip_distance > 0 AND
    trip_distance < 100
GROUP BY vendor_id, 2
ORDER BY Vendor_ID, Year

In Python, create a histogram based on passenger counts from taxi data:

1
2
3
4
resp = h_db.aggregate_histogram(table_name="nyctaxi", column_name="passenger_count", start=1, end=6, interval=1)
print "Passenger count groups by size:"
for histo_group in zip([1, 2, 3, 4, 5, 6], resp['counts']):
    print "\t{}s: {:4.0f}".format(*histo_group)

In Python, unpivot a course_sales table and then create a memory-only table of the results:

1
2
3
4
5
6
7
gpudb.aggregate_unpivot(
    table_name="course_sales_by_year",
    column_names=["*"]
    variable_column_name="total_amount_sold",
    pivoted_columns=["Advanced Economics", "Programming 101", "American History: World War II"],
    options={"result_table":"course_sales_by_course"}
)

In Java, find aggregate statistics for the trip_distance column from taxi data:

1
2
3
4
5
6
7
8
Map<String,Double> stats = gpudb.aggregateStatistics("nyctaxi", "trip_distance", "count,min,max,mean,percentile(50,200)", null).getStats();
System.out.println("Statistics of values in the trip_distance column:");
System.out.printf
(
    "\tCount: %.0f%n\tMin:  %4.2f%n\tMean: %4.2f%n\tMax:  %4.2f%n%n",
    stats.get("count"), stats.get("min"), stats.get("mean"),
    stats.get("max")
);

In the REST API, find the unique Sector values from the stocks table using /aggregate/unique:

1
2
3
4
5
6
7
8
{
    "table_name":"stocks",
    "column_name":"Sector",
    "offset":0,
    "limit":-9999,
    "encoding":"json",
    "options":{}
}

Pivot

A pivot operation does not have its own endpoint and is instead achieved through the /aggregate/groupby endpoint options. Details can be found on Pivot.

Unpivot

As mentioned above, unpivot operations have their own endpoint. Details on unpivot operations can be found on Unpivot.

Memory-Only Tables and Persistence

The following aggregate endpoints have the ability to create and persist memory-only tables from query results:

To create a memory-only table from any of the above endpoint operations, specify the name of the table in the options map:

1
"result_table":"my_table_name"

To persist the memory-only table, specify the result_table_persist option as well:

1
2
"result_table":"my_table_name",
"result_table_persist":"true"

A memory-only table name must adhere to the standard naming criteria. Each memory-only table exists within a schema and follows the standard name resolution rules for tables.

Sharding and Replication

Aggregate memory-only tables can vary in how they are sharded or replicated depending on the endpoint they are created from.

  • /aggregate/groupby: The aggregate groupby memory-only table will be sharded if the entire shard key is included in the column_names parameter and all result records are selected (offset is 0 and limit is -9999). The aggregate groupby memory-only table can be replicated if the result_table_force_replicated option is set to true, regardless if the source table or view is sharded or not.
  • /aggregate/unique: An aggregate unique memory-only table is replicated by default, but it can be sharded if the shard key is included in the column_names parameter. If the result_table_force_replicated option is set to true, the aggregate unique memory-only table will be replicated regardless if the source table or view is sharded or not.
  • /aggregate/unpivot: The aggregate unpivot memory-only table will be sharded if the shard key is not part of the pivoted_columns parameter. The aggregate unpivot memory-only table will be replicated if the source table or view is replicated.

Limitations

General

  • Store-only columns cannot be grouped on, nor can aggregation functions be applied to them.
  • Non-charN string column types can be grouped on, but can only have COUNT & COUNT_DISTINCT aggregation functions applied to them.
  • In the native APIs, an alias must be used for the column name if using column or aggregate functions in the relevant selected columns parameter, e.g., count(column_name) as count_col_records.

Aggregate Memory-Only Tables

  • Creating a memory-only table results in an entirely new data set, so be mindful of the memory usage implications.
  • An aggregate memory-only table cannot be created with a non-charN string column type.
  • If an aggregate memory-only table is created and its source data set's rows are updated (or removed), the memory-only table will not be updated to reflect the changes in the source data set.
  • An aggregate memory-only table is transient, by default, and will expire after the default TTL setting.
  • An aggregate memory-only table is not persisted, by default, and will not survive a database restart; specifying a result_table_persist option of true will make the table permanent and not expire.