Kinetica supports the SQL concept of windows. It can do so through the
/create/projection and
/get/records/bycolumn endpoints. The window
function is specified as a column name in the column_name
parameter.
A window projection can be created from any table or view. If the source data set is replicated, the resulting window projection will also be replicated. The resulting window projection will be sharded in two cases:
A window in Kinetica is constructed like so:
<window function> OVER (<window specification>) AS <alias>
A window function either be of aggregate or ranking type.
Function | Description |
---|---|
AVG(expr) |
Calculates the average of the given expression expr over the specified window frame |
COUNT(expr) |
Calculates the count of the given expression expr over the specified window frame |
MAX(expr) |
Calculates the maximum value of the given expression expr over the specified window frame |
MEAN(expr) |
Synonymous with AVG() . Calculates the average of the given expression expr over the specified
window frame |
MIN(expr) |
Calculates the minimum value of the given expression expr over the specified window frame |
SUM(expr) |
Calculates the sum of the given expression expr over the specified window frame |
Function | Description |
---|---|
CUME_DIST() |
The relative position of the current row within the cumulative distribution of the selected
partition, expressed as a percentage from
This function is shorthand for using the DOUBLE
(
COUNT(*) OVER
(
PARTITION BY <partition_column>
ORDER BY <sort_column>
)
) /
(
COUNT(*) OVER
(
PARTITION BY <partition_column>
ORDER BY <sort_column>
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
)
|
DENSE_RANK() |
Number of the current row within the selected partition except rows with identical values evaluate to different ranks. Starts at 1 |
FIRST_VALUE(<column>) |
The value found in the first row within a frame of the given expression |
LAG(<column>[, <num>]) |
The value of the row before the given expression's value. Provide an additional comma-separated
value to specify which row to select, e.g., LAG(vendor_id, 3) would list the value in the
vendor_id column from three rows prior to the current row |
LAST_VALUE(<column>) |
The value found in the last row within a frame of the given expression |
LEAD(<column>[, <num>]) |
The value of the row after the given expression's value. Provide an additional comma-separated
value to specify which row to select, e.g., LEAD(vendor_id, 3) would list the value in the
vendor_id column from three rows after the current row |
NTILE(<num of groups>) |
The group number of the row after partitioning the rows into num of groups groups. For
example, NTILE(4) will partition data by quartiles and return the associated group number,
1 to 4 . |
PERCENT_RANK() |
The rank of the current row within the selected partition, expressed as a percentage from
This function is shorthand for using the DOUBLE
(
RANK() OVER
(
PARTITION BY <partition_column>
ORDER BY <sort_column>
) - 1
) /
(
COUNT(*) OVER
(
PARTITION BY <partition_column>
ORDER BY <sort_column>
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) - 1
)
|
RANK() |
Number of the current row within the selected partition. However, rows with identical values evaluate to the same rank. Starts at 1 |
ROW_NUMBER() |
Number of the current row within the selected partition. Starts at 1 |
A window specification is constructed like so:
PARTITION BY <column> [ORDER BY <ordering expression list> [<frame clause>]]
The window specification defines partition, frames and sort specifications. A few things to note:
PARTITION BY
clauses can contain several comma-separated keys that define
the partitions, e.g., PARTITION BY vendor_id, payment_type
.PARTITION BY null
(or other constant value) will create a window
spanning the entire data set, effectively windowing without a partition.RANGE
frames, the ORDER BY
can have a
single column, which the frame will use for bounding. For ROWS
frames,
any number of columns can be included in the ORDER BY
; e.g.,
ORDER BY vendor_id, payment_type
. Sort order can be specified using
ASC
(ascending, the default) and DESC
(descending) keywords. The
default null ordering for an ascending sort is NULLS FIRST
; the default
for a descending sort is NULLS LAST
.FIRST_VALUE()
& LAST_VALUE()
require
an ORDER BY
clause.A frame clause within a window specification is constructed like so:
<RANGE | ROWS> BETWEEN <frame start> AND <frame end>
Frames are defined by either ranges of values (RANGE
) or number of rows
(ROWS
):
RANGE
specifies that containment within a frame should be determined by the
value in the column specified in the ORDER BY
clause. This has several
implications:
ORDER BY
column) are
always included in a range frame; for any given row, an aggregate function
applied to its frame will have the same result for all of its peer rows.BETWEEN 100 PRECEDING AND CURRENT ROW
means a frame will contain
records with ORDER BY
column values greater than or equal to the current
record's column value minus 100 and less than or equal to the current record's
column value.ORDER BY
clause can specify only one column, as this column will be
used to bound the frame by its values.ROWS
specifies that containment within a frame should be determined by the
ordering of the rows, as specified in the ORDER BY
clause. This also has
several implications:
BETWEEN 100 PRECEDING AND CURRENT ROW
means a frame will contain 100
records before the current row as well as the current row.ORDER BY
clause can contain multiple comma-separated columns.A frame has a start and end specification:
UNBOUNDED PRECEDING
: the first row of the partition<number> PRECEDING
: either n rows before the current row (for
rows-based frames), or n values less than the current row's value (for
range-based frames)CURRENT ROW
: either the current row (for rows-based frames), or the
current row and its peer rows (for range-based frames)<number> FOLLOWING
: either n rows after the current row (for
rows-based frames), or n values greater than the current row's value (for
range-based frames)UNBOUNDED FOLLOWING
: the last row of the partition<number> FOLLOWING
: either n rows after the current row (for
rows-based frames), or n values greater than the current row's value (for
range-based frames)CURRENT ROW
: either the current row (for rows-based frames), or the
current row and its peer rows (for range-based frames)<number> PRECEDING
: either n rows before the current row (for
rows-based frames), or n values less than the current row's value (for
range-based frames)Important
The frame start must begin before the end of the frame end.
Frames that extend outside of the available data set will be narrowed to
operate only over the available data. For example, a frame defined as
10 PRECEDING
will, for the 5th record in the data set, be narrowed to
consider just the first four records as the beginning of the frame in any
calculations.
Important
If you use the <number> FOLLOWING
or
<number> PRECEDING
keywords as both the frame start and end
keywords, the first <number>
used must be less than or equal
to the second <number>
Example frame:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
The following call to /get/records/bycolumn via Python would calculate a rolling sum of total amounts collected per vendor for the first two hours after midnight on January 1st, 2015, as well as the number of trips that occurred within 5 minutes of each trip:
records = h_db.get_records_by_column_and_decode(
table_name = "nyctaxi",
column_names = [
"vendor_id",
"STRING(pickup_datetime) AS pickup_datetime",
"passenger_count",
"total_amount",
"SUM(total_amount) OVER" \
" (" \
" PARTITION BY vendor_id" \
" ORDER BY pickup_datetime" \
" ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" \
" ) AS growing_sum",
"COUNT(*) OVER" \
" (" \
" PARTITION BY vendor_id" \
" ORDER BY LONG(pickup_datetime)" \
" RANGE BETWEEN 300000 PRECEDING AND 300000 FOLLOWING" \
" ) AS trip_demand"
],
offset = 0,
limit = gpudb.GPUdb.END_OF_SET,
options = {
"expression": "pickup_datetime >= DATE('2015-01-01') AND pickup_datetime < DATETIME('2015-01-01 02:00:00')",
"order_by": "vendor_id, pickup_datetime"
}
)["records"]
The following call to /create/projection via Python would calculate a 5-before and 10-after moving average on the trip distance for 4-passenger trips, on a given day, per vendor:
h_db.create_projection(
table_name = "nyctaxi",
projection_name = "nyctaxi_local_avg_dist_by_vendor_4_passengers_20150101",
column_names = [
"vendor_id",
"pickup_datetime",
"trip_distance",
"AVG(trip_distance) OVER" \
" (" \
" PARTITION BY vendor_id " \
" ORDER BY pickup_datetime " \
" ROWS BETWEEN 5 PRECEDING AND 10 FOLLOWING " \
" ) AS local_avg_dist"
],
options = {
"expression": "passenger_count = 4 AND " \
"pickup_datetime >= DATE('2015-01-01') AND " \
"pickup_datetime < DATE('2015-01-02')"
}
)
The following call to /create/projection via Python would rank, by vendor, the total amounts collected from 3-passenger trips on a given day:
h_db.create_projection(
table_name = "nyctaxi",
projection_name = "nyctaxi_ranked_fares_by_vendor_3_passengers_20150111",
column_names = [
"vendor_id",
"pickup_datetime",
"dropoff_datetime",
"total_amount",
"RANK() OVER " \
" (PARTITION BY vendor_id ORDER BY total_amount) AS ranked_total",
"PERCENT_RANK() OVER " \
" (PARTITION BY vendor_id ORDER BY total_amount) * 100 AS percent_ranked_total"
],
options = {
"expression": "passenger_count = 3 AND " \
"pickup_datetime >= DATE('2015-01-11') AND " \
"pickup_datetime < DATETIME('2015-01-12')"
}
)
The following call to /create/projection via Python would compare each 5-passenger trip's tip amount to the lowest, average, & highest tip amounts for the vendor on a given day:
h_db.create_projection(
table_name = "nyctaxi",
projection_name = "nyctaxi_hi_lo_tip_compare_by_vendor_5_passengers_20150417",
column_names = [
"vendor_id",
"pickup_datetime",
"tip_amount",
"tip_amount - " \
" FIRST_VALUE(tip_amount) OVER" \
" (PARTITION BY vendor_id ORDER BY tip_amount NULLS LAST) AS vs_lowest_tip",
"tip_amount - " \
" AVG(tip_amount) OVER" \
" (" \
" PARTITION BY vendor_id" \
" ORDER BY tip_amount" \
" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING" \
" ) AS vs_average_tip",
"tip_amount - " \
" FIRST_VALUE(tip_amount) OVER" \
" (PARTITION BY vendor_id ORDER BY tip_amount DESC) AS vs_highest_tip"
],
options = {
"expression": "passenger_count = 5 AND " \
"pickup_datetime >= DATE('2015-04-17') AND " \
"pickup_datetime < DATE('2015-04-18') AND " \
"tip_amount > 0 AND " \
"trip_distance > 0"
}
)
The following call to /create/projection via Python would calculate the quartiles for total amounts collected per vendor, the result of which could then be aggregated by calling /aggregate/groupby via Python and used to compare the overall average total collected vs. the average total within the interquartile range:
h_db.create_projection(
table_name = "nyctaxi",
projection_name = "nyctaxi_revenue_quartile_by_vendor",
column_names = [
"vendor_id",
"total_amount",
"NTILE(4) OVER (PARTITION BY vendor_id ORDER BY total_amount) AS quartile"
],
options = {
}
)
nyctaxi_rqbv = gpudb.GPUdbTable(name = 'nyctaxi_revenue_quartile_by_vendor', db = h_db)
records = nyctaxi_rqbv.aggregate_group_by(
column_names = [
"vendor_id",
"AVG(total_amount) AS average_total_amount",
"AVG(IF((quartile IN (2,3)), total_amount, null)) AS average_interq_range_total_amount"
],
offset = 0,
limit = -9999,
options = {
"sort_by": "key"
}
)["data"]
Empty window specifications are currently not supported; all window
specifications must include a PARTITION BY
clause. However, partitioning
by a constant (e.g., PARTITION BY null
) will result in a window spanning
the entire data set.
Note
This will force all windowed data to a single node for processing and could consume a great deal of memory on that node.
A ranking type function does not accept frame clauses.
Any ranking function other than FIRST_VALUE()
& LAST VALUE
does not
accept a frame clause.
Any ranking function other than FIRST_VALUE()
& LAST VALUE
requires an
ORDER BY
clause.
Window functions executed via /create/projection will have the corresponding projection limitations.