/aggregate/statistics

URL: https://<aws.fqdn>/<aws.cluster.name>/gpudb-0/aggregate/statistics

Calculates the requested statistics of the given column(s) in a given table.

The available statistics are:: count (number of total objects), mean, stdv (standard deviation), variance, skew, kurtosis, sum, min, max, weighted_average, cardinality (unique count), estimated_cardinality, percentile, and percentile_rank.

Estimated cardinality is calculated by using the hyperloglog approximation technique.

Percentiles and percentile ranks are approximate and are calculated using the t-digest algorithm. They must include the desired percentile/percentile_rank. To compute multiple percentiles each value must be specified separately (i.e. 'percentile(75.0),percentile(99.0),percentile_rank(1234.56),percentile_rank(-5)').

A second, comma-separated value can be added to the percentile statistic to calculate percentile resolution, e.g., a 50th percentile with 200 resolution would be 'percentile(50,200)'.

The weighted average statistic requires a weight column to be specified in weight_column_name. The weighted average is then defined as the sum of the products of input parameter column_name times the weight_column_name values divided by the sum of the weight_column_name values.

Additional columns can be used in the calculation of statistics via additional_column_names. Values in these columns will be included in the overall aggregate calculation--individual aggregates will not be calculated per additional column. For instance, requesting the count and mean of input parameter column_name x and additional_column_names y and z, where x holds the numbers 1-10, y holds 11-20, and z holds 21-30, would return the total number of x, y, and z values (30), and the single average value across all x, y, and z values (15.5).

The response includes a list of key/value pairs of each statistic requested and its corresponding value.

Input Parameter Description

Name

Type

Description

table_name

string

Name of the table on which the statistics operation will be performed, in [schema_name.]table_name format, using standard name resolution rules.

column_name

string

Name of the primary column for which the statistics are to be calculated.

stats

string

Comma separated list of the statistics to calculate, e.g. "sum,mean".

Supported Values	Description
count	Number of objects (independent of the given column(s)).
mean	Arithmetic mean (average), equivalent to sum/count.
stdv	Sample standard deviation (denominator is count-1).
variance	Unbiased sample variance (denominator is count-1).
skew	Skewness (third standardized moment).
kurtosis	Kurtosis (fourth standardized moment).
sum	Sum of all values in the column(s).
min	Minimum value of the column(s).
max	Maximum value of the column(s).
weighted_average	Weighted arithmetic mean (using the option weight_column_name as the weighting column).
cardinality	Number of unique values in the column(s).
estimated_cardinality	Estimate (via hyperloglog technique) of the number of unique values in the column(s).
percentile	Estimate (via t-digest) of the given percentile of the column(s) (percentile(50.0) will be an approximation of the median). Add a second, comma-separated value to calculate percentile resolution, e.g., 'percentile(75,150)'
percentile_rank	Estimate (via t-digest) of the percentile rank of the given value in the column(s) (if the given value is the median of the column(s), percentile_rank(<median>) will return approximately 50.0).

options

map of string to strings

Optional parameters. The default value is an empty map ( {} ).

Supported Parameters (keys)	Parameter Description
additional_column_names	A list of comma separated column names over which statistics can be accumulated along with the primary column. All columns listed and input parameter column_name must be of the same type. Must not include the column specified in input parameter column_name and no column can be listed twice.
weight_column_name	Name of column used as weighting attribute for the weighted average statistic.

Output Parameter Description

The GPUdb server embeds the endpoint response inside a standard response structure which contains status information and the actual response to the query. Here is a description of the various fields of the wrapper:

Name

Type

Description

status

String

'OK' or 'ERROR'

message

String

Empty if success or an error message

data_type

String

'aggregate_statistics_response' or 'none' in case of an error

data

String

Empty string

data_str

JSON or String

This embedded JSON represents the result of the /aggregate/statistics endpoint:

Name	Type	Description
stats	map of string to doubles	(statistic name, double value) pairs of the requested statistics, including the total count by default.
info	map of string to strings	Additional information.

Empty string in case of an error.