Version:

Window

Kinetica supports the SQL concept of windows. It can do so through the /create/projection and /get/records/bycolumn endpoints. The window function is specified as a column name in the column_name parameter.

A window projection can be created from any table or view. If the source data set is replicated, the resulting window projection will also be replicated. The resulting window projection will be sharded in two cases:

  • If the source data set is sharded and the shard key is included in the projected column list; columns in a multi-column shard key must appear in the projected column list in the same relative order they appear in the shard key, though they do not need to be listed consecutively
  • If the projection call has a shard key specified

Function Details

A window in Kinetica is constructed like so:

<function type> OVER (<window specification>) AS <alias>

Function Type

A window function either be of aggregate or ranking type.

Aggregate

Function Description
AVG(expr) Calculates the average of the given expression expr over the specified window frame
COUNT(expr) Calculates the count of the given expression expr over the specified window frame
MAX(expr) Calculates the maximum value of the given expression expr over the specified window frame
MEAN(expr) Synonymous with AVG(). Calculates the average of the given expression expr over the specified window frame
MIN(expr) Calculates the minimum value of the given expression expr over the specified window frame
SUM(expr) Calculates the sum of the given expression expr over the specified window frame

Ranking

Function Description
CUME_DIST()

The relative position of the current row within the cumulative distribution of the selected partition, expressed as a percentage from 0 (exclusive) to 1 (inclusive). The formula for this calculation is as follows:

(count of <= records within partition) / (partition row count)

This function is shorthand for using the COUNT() function in separate partition statements to arrive at the same result:

DOUBLE
(
    COUNT(*) OVER
    (
        PARTITION BY <partition_column>
        ORDER BY <sort_column>
    )
) /
(
    COUNT(*) OVER
    (
        PARTITION BY <partition_column>
        ORDER BY <sort_column>
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    )
)
DENSE_RANK() Number of the current row within the selected partition except rows with identical values evaluate to different ranks. Starts at 1
FIRST_VALUE(<column>) The value found in the first row within a frame of the given expression
LAG(<column>[, <num>]) The value of the row before the given expression's value. Provide an additional comma-separated value to specify which row to select, e.g., LAG(vendor_id, 3) would list the value in the vendor_id column from three rows prior to the current row
LAST_VALUE(<column>) The value found in the last row within a frame of the given expression
LEAD(<column>[, <num>]) The value of the row after the given expression's value. Provide an additional comma-separated value to specify which row to select, e.g., LEAD(vendor_id, 3) would list the value in the vendor_id column from three rows after the current row
NTILE(<num of groups>) The group number of the row after partitioning the rows into num of groups groups. For example, NTILE(4) will partition data by quartiles and return the associated group number, 1 to 4.
PERCENT_RANK()

The rank of the current row within the selected partition, expressed as a percentage from 0 to 1, inclusive. The formula for this calculation is as follows:

(rank within partition - 1) / (partition row count - 1)

This function is shorthand for using the RANK() & COUNT() functions in separate partition statements to arrive at the same result:

DOUBLE
(
    RANK() OVER
    (
        PARTITION BY <partition_column>
        ORDER BY <sort_column>
    ) - 1
) /
(
    COUNT(*) OVER
    (
        PARTITION BY <partition_column>
        ORDER BY <sort_column>
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) - 1
)
RANK() Number of the current row within the selected partition. However, rows with identical values evaluate to the same rank. Starts at 1
ROW_NUMBER() Number of the current row within the selected partition. Starts at 1

Window Specification

A window specification within a window function is constructed like so:

PARTITION BY <column> [ORDER BY <column> [<frame clause>]]

The window specification defines partition, frames and sort specifications. A few things to note:

  • PARTITION BY clauses can contain several comma-separated keys that define the partitions, e.g., PARTITION BY vendor_id, payment_type
  • PARTITION BY null (or other constant value) will create a window spanning the entire data set, effectively windowing without a partition
  • ORDER BY clauses, while optional, are useful in sorting your individual partitions. You can order by as many values as necessary as long as the values are separated by commas, e.g., ORDER BY vendor_id, payment_type. You can specify the sort order using ASC and DESC keywords (ASC is the default)
  • Frame clauses require an ORDER BY clause
  • A ranking type function does not accept frame clauses

Frame Clauses

A frame clause within a window specification is constructed like so:

ROWS BETWEEN <frame start> AND <frame end>

Frames are organized by rows. ROWS defines a set of rows relative to a given row within the window to include in functions applied to the window for that row. Frame clauses require an ORDER BY clause to be present, which will determine the ordering within the partition, and thus ultimately decide which rows fall within the frame for a particular row.

A frame has a start and end specification:

  • Frame start keywords:
    • UNBOUNDED PRECEDING: the first row of the partition
    • <number> PRECEDING: n rows before the current row
    • CURRENT ROW: the current row
    • <number> FOLLOWING: n rows after the current row
  • Frame end keywords:
    • UNBOUNDED FOLLOWING: the last row of the partition
    • <number> FOLLOWING: n rows after the current row
    • CURRENT ROW: the current row
    • <number> PRECEDING: n rows before the current row

Important

The frame start must begin before the end of the frame end.

Frames that extend outside of the available data set will be narrowed to operate only over the available data. For example, a frame defined as 10 PRECEDING will, for the 5th record in the data set, be narrowed to consider just the first four records as the beginning of the frame in any calculations.

Important

If you use the <number> FOLLOWING or <number> PRECEDING keywords as both the frame start and end keywords, the first <number> used must be less than or equal to the second <number>

Example frame:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Examples

The following call to /get/records/bycolumn would calculate a rolling sum of total amounts collected per vendor for the first two hours after midnight on January 1st, 2015:

{
  "table_name":"nyctaxi",
  "column_names":[
    "vendor_id",
    "STRING(pickup_datetime) AS pickup_datetime",
    "passenger_count",
    "total_amount",
    "SUM(total_amount) OVER (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS growing_sum"
  ],
  "offset":0,
  "limit":-9999,
  "encoding":"json",
  "options":{
    "expression": "pickup_datetime >= DATE('2015-01-01') AND pickup_datetime < DATETIME('2015-01-01 02:00:00')",
    "order_by": "vendor_id, pickup_datetime"
  }
}

The following call to /create/projection would calculate a 1-before and 3-after moving average on the trip distance for single passengers per vendor:

{
  "table_name": "nyctaxi",
  "projection_name": "nyctaxi_single_pickups_local_avg_dist",
  "column_names": [
    "vendor_id",
    "pickup_datetime",
    "trip_distance",
    "AVG(trip_distance) OVER (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS BETWEEN 1 PRECEDING AND 3 FOLLOWING) AS local_avg_dist"
  ],
  "options": {
    "expression": "passenger_count = 1",
    "order_by": "vendor_id, pickup_datetime"
  }
}

The following call to /create/projection would rank, by vendor, the total amounts collected from two-passenger trips on May 11th, 2015:

{
  "table_name":"nyctaxi",
  "projection_name":"ranked_fares_by_vendor_2_passengers_20150511",
  "column_names":[
    "vendor_id",
    "pickup_datetime",
    "dropoff_datetime",
    "total_amount",
    "RANK() OVER (PARTITION BY vendor_id ORDER BY total_amount) AS ranked_total",
    "PERCENT_RANK() OVER (PARTITION BY vendor_id ORDER BY total_amount) AS percent_ranked_total"
  ],
  "options":{
    "expression": "passenger_count = 2 AND pickup_datetime >= DATE('2015-05-11') AND pickup_datetime <= DATE('2015-05-12')"
  }
}

The following call to /create/projection would compare each single passenger trip's tip amount to the lowest, average, & highest tip amounts for the vendor on November 25th, 2010:

{
  "table_name":"nyctaxi",
  "projection_name":"hi_lo_tip_compare_by_vendor_1_passengers_20101125",
  "column_names":[
    "vendor_id",
    "trip_distance",
    "tip_amount AS current_tip",
    "FIRST_VALUE(tip_amount) OVER (PARTITION BY vendor_id ORDER BY tip_amount) AS lowest_tip",
    "AVG(tip_amount) OVER (PARTITION BY vendor_id ORDER BY tip_amount ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS average_tip",
    "FIRST_VALUE(tip_amount) OVER (PARTITION BY vendor_id ORDER BY tip_amount DESC) AS highest_tip"
  ],
  "options":{
    "expression": "passenger_count = 1 AND pickup_datetime >= DATE('2010-11-25') AND pickup_datetime < DATE('2010-11-26') AND tip_amount > 0 AND trip_distance > 0"
  }
}

The following call to /create/projection would calculate the quartiles for total amounts collected per vendor, the result of which could then be aggregated via /aggregate/groupby and used to compare the overall average total collected vs. the average total within the interquartile range:

{
  "table_name":"nyctaxi",
  "projection_name":"vendor_total_quartile",
  "column_names":[
    "vendor_id",
    "total_amount",
    "NTILE(4) OVER (PARTITION BY vendor_id ORDER BY total_amount) AS quartile"
  ],
  "options":{}
}
{
  "table_name":"vendor_total_quartile",
  "column_names":[
    "vendor_id",
    "AVG(total_amount) AS average_total_amount",
    "AVG(IF((quartile IN (2,3)), total_amount, null)) AS average_interq_range_total_amount"
  ],
  "offset": 0,
  "limit": -9999,
  "encoding": "json",
  "options":{}
}

Limitations

  • Empty window specifications are currently not supported; all window specifications must include a PARTITION BY clause. However, partitioning by a constant (e.g., PARTITION BY null) will result in a window spanning the entire data set. Note: this will force all windowed data to a single node for processing and could consume a great deal of memory on that node.
  • A ranking type function does not accept frame clauses
  • Any ranking function other than ROW_NUMBER() requires an ORDER BY clause
  • Window functions executed via /create/projection will have the corresponding projection limitations