/aggregate/groupby

URL: http://<db.host>:<db.port>/aggregate/groupby

Calculates unique combinations (groups) of values for the given columns in a given table or view and computes aggregates on each unique combination. This is somewhat analogous to an SQL-style SELECT...GROUP BY.

For aggregation details and examples, see Aggregation. For limitations, see Aggregation Limitations.

Any column(s) can be grouped on, and all column types except unrestricted-length strings may be used for computing applicable aggregates; columns marked as store-only are unable to be used in grouping or aggregation.

The results can be paged via the input parameter offset and input parameter limit parameters. For example, to get 10 groups with the largest counts the inputs would be: limit=10, options={"sort_order":"descending", "sort_by":"value"}.

Input parameter options can be used to customize behavior of this call e.g. filtering or sorting the results.

To group by columns 'x' and 'y' and compute the number of objects within each group, use: column_names=['x','y','count(*)'].

To also compute the sum of 'z' over each group, use: column_names=['x','y','count(*)','sum(z)'].

Available aggregation functions are: count(*), sum, min, max, avg, mean, stddev, stddev_pop, stddev_samp, var, var_pop, var_samp, arg_min, arg_max and count_distinct.

Available grouping functions are Rollup, Cube, and Grouping Sets

This service also provides support for Pivot operations.

Filtering on aggregates is supported via expressions using aggregation functions supplied to having.

The response is returned as a dynamic schema. For details see: dynamic schemas documentation.

If a result_table name is specified in the input parameter options, the results are stored in a new table with that name--no results are returned in the response. Both the table name and resulting column names must adhere to standard naming conventions; column/aggregation expressions will need to be aliased. If the source table's shard key is used as the grouping column(s) and all result records are selected (input parameter offset is 0 and input parameter limit is -9999), the result table will be sharded, in all other cases it will be replicated. Sorting will properly function only if the result table is replicated or if there is only one processing node and should not be relied upon in other cases. Not available when any of the values of input parameter column_names is an unrestricted-length string.

Input Parameter Description

Name

Type

Description

table_name

string

Name of an existing table or view on which the operation will be performed, in [schema_name.]table_name format, using standard name resolution rules.

column_names

array of strings

List of one or more column names, expressions, and aggregate expressions.

offset

long

A positive integer indicating the number of initial results to skip (this can be useful for paging through the results). The default value is 0.The minimum allowed value is 0. The maximum allowed value is MAX_INT.

limit

long

A positive integer indicating the maximum number of results to be returned, or END_OF_SET (-9999) to indicate that the maximum number of results allowed by the server should be returned. The number of records returned will never exceed the server's own limit, defined by the max_get_records_size parameter in the server configuration. Use output parameter has_more_records to see if more records exist in the result to be fetched, and input parameter offset & input parameter limit to request subsequent pages of results. The default value is -9999.

encoding

string

Specifies the encoding for returned records. The default value is binary.

Supported Values	Description
binary	Indicates that the returned records should be binary encoded.
json	Indicates that the returned records should be json encoded.

options

map of string to strings

Optional parameters. The default value is an empty map ( {} ).

Supported Parameters (keys)

Parameter Description

create_temp_table

If true, a unique temporary table name will be generated in the sys_temp schema and used in place of result_table. If result_table_persist is false (or unspecified), then this is always allowed even if the caller does not have permission to create tables. The generated name is returned in qualified_result_table_name. The default value is false. The supported values are:

true
false

collection_name

[DEPRECATED--please specify the containing schema as part of result_table and use /create/schema to create the schema if non-existent] Name of a schema which is to contain the table specified in result_table. If the schema provided is non-existent, it will be automatically created.

expression

Filter expression to apply to the table prior to computing the aggregate group by.

pipelined_expression_evaluation

evaluate the group-by during last JoinedSet filter plan step The default value is false. The supported values are:

true
false

having

Filter expression to apply to the aggregated results.

sort_order

[DEPRECATED--use order_by instead] String indicating how the returned values should be sorted - ascending or descending. The default value is ascending.

Supported Values	Description
ascending	Indicates that the returned values should be sorted in ascending order.
descending	Indicates that the returned values should be sorted in descending order.

sort_by

[DEPRECATED--use order_by instead] String determining how the results are sorted. The default value is value.

Supported Values	Description
key	Indicates that the returned values should be sorted by key, which corresponds to the grouping columns. If you have multiple grouping columns (and are sorting by key), it will first sort the first grouping column, then the second grouping column, etc.
value	Indicates that the returned values should be sorted by value, which corresponds to the aggregates. If you have multiple aggregates (and are sorting by value), it will first sort by the first aggregate, then the second aggregate, etc.

order_by

Comma-separated list of the columns to be sorted by as well as the sort direction, e.g., 'timestamp asc, x desc'. The default value is ''.

strategy_definition

The tier strategy for the table and its columns.

compression_codec

The default compression codec for the result table's columns.

result_table

The name of a table used to store the results, in [schema_name.]table_name format, using standard name resolution rules and meeting table naming criteria. Column names (group-by and aggregate fields) need to be given aliases e.g. ["FChar256 as fchar256", "sum(FDouble) as sfd"]. If present, no results are returned in the response. This option is not available if one of the grouping attributes is an unrestricted string (i.e.; not charN) type.

result_table_persist

If true, then the result table specified in result_table will be persisted and will not expire unless a ttl is specified. If false, then the result table will be an in-memory table and will expire unless a ttl is specified otherwise. The default value is false. The supported values are:

true
false

result_table_force_replicated

Force the result table to be replicated (ignores any sharding). Must be used in combination with the result_table option. The default value is false. The supported values are:

true
false

result_table_generate_pk

If true then set a primary key for the result table. Must be used in combination with the result_table option. The default value is false. The supported values are:

true
false

result_table_generate_soft_pk

If true then set a soft primary key for the result table. Must be used in combination with the result_table option. The default value is false. The supported values are:

true
false

ttl

Sets the TTL of the table specified in result_table.

chunk_size

Indicates the number of records per chunk to be used for the result table. Must be used in combination with the result_table option.

chunk_column_max_memory

Indicates the target maximum data size for each column in a chunk to be used for the result table. Must be used in combination with the result_table option.

chunk_max_memory

Indicates the target maximum data size for all columns in a chunk to be used for the result table. Must be used in combination with the result_table option.

create_indexes

Comma-separated list of columns on which to create indexes on the result table. Must be used in combination with the result_table option.

view_id

ID of view of which the result table will be a member. The default value is ''.

pivot

pivot column

pivot_values

The value list provided will become the column headers in the output. Should be the values from the pivot_column.

grouping_sets

Customize the grouping attribute sets to compute the aggregates. These sets can include ROLLUP or CUBE operators. The attribute sets should be enclosed in parentheses and can include composite attributes. All attributes specified in the grouping sets must present in the group-by attributes.

rollup

This option is used to specify the multilevel aggregates.

cube

This option is used to specify the multidimensional aggregates.

shard_key

Comma-separated list of the columns to be sharded on; e.g. 'column1, column2'. The columns specified must be present in input parameter column_names. If any alias is given for any column name, the alias must be used, rather than the original column name. The default value is ''.

Output Parameter Description

The GPUdb server embeds the endpoint response inside a standard response structure which contains status information and the actual response to the query. Here is a description of the various fields of the wrapper:

Name

Type

Description

status

String

'OK' or 'ERROR'

message

String

Empty if success or an error message

data_type

String

'aggregate_group_by_response' or 'none' in case of an error

data

String

Empty string

data_str

JSON or String

This embedded JSON represents the result of the /aggregate/groupby endpoint:

Name

Type

Description

response_schema_str

string

Avro schema of output parameter binary_encoded_response or output parameter json_encoded_response.

binary_encoded_response

bytes

Avro binary encoded response.

json_encoded_response

string

Avro JSON encoded response.

total_number_of_records

long

Total/Filtered number of records. This may be an over-estimate if a limit was applied and there are additional records (i.e., when output parameter has_more_records is true).

has_more_records

boolean

Too many records. Returned a partial set.

info

map of string to strings

Additional information. The default value is an empty map ( {} ).

Possible Parameters (keys)	Parameter Description
qualified_result_table_name	The fully qualified name of the table (i.e. including the schema) used to store the results.

Empty string in case of an error.