Version:

Projections

A projection is a concept unique to Kinetica. A projection is created through the /create/projection endpoint and represents a set of columns from and/or column expressions applied to a source data set (table or view). While projections can simply represent a whole-cloth cross-section of an existing table, filtering clauses can be applied to before creating the column to create an entirely new data set. Expressions that can be used to define columns or filters can be found here.

A projection has several advantages:

  • Because a projection is a copy of your data, querying the projection will be faster than querying a join. You could create a projection of a join to increase the speed of queries against the data set. This also means derived columns and join lookups are precalculated.
  • A projection can utilize the materialize_on_gpu option to cache the projection on the GPU.
  • A /create/projection call can calculate derived columns natively
  • The /create/projection endpoint is the only endpoint that can calculate a moving average
  • A projection can be sharded on entirely different columns than its source table or view; a projection can also be sharded even if its source table or view is unsharded or replicated

Projections can be created from any table or view and will create a new type schema using the columns and expressions specified in the /create/projection column list from the source schema. Note that if the source data set is replicated the projection will also be replicated. If the source data set is sharded, the projection can also be sharded but only if all the sharded columns are listed in the /create/projection call and are specified in the same order as they appear in the source data set. The sharded columns do not have to be listed continguously.

For example, a source table employees has four columns (in order), employee_name, salary, employee_id, and has_benefits, and two of those columns, salary and employee_id, are sharded. To create a sharded projection using the employees table, the sharded columns will need to be listed in the same order as the source table, e.g., column_names = ["salary", "employee_name", "employee_id"].

A projection of an existing table or view (sharded, replicated, or otherwise) can be sharded on entirely different columns using the shard_key option on the /create/projection endpoint, regardless of the column(s) the table/view was previously sharded on (if any). See Shard Keys for more information on shard keys and sharding restrictions.

A projection is also a means to perform a moving average window function, which is similar to an aggregate function except instead of grouping rows into a single output row, the rows will still exist separately. A moving average computes the average over the previous n records and the subsequent n records for each record in the given column. Any records that were not able to contain a complete window are set to null. For more information on using projections to perform a moving average window function, see Examples.

The /create/projection endpoint operates within each processing node individually, so all sorting and moving average calculations are only completed with the data that is in each processing node. This means that sorting and/or moving average calculations should be used on a replicated data set or if the data is sharded such that all the records that need to be used for calculations are on the same processing node.

Several limitations are discussed in further detail in the Limitations and Cautions section.

Creating a Projection

To create a projection, the /create/projection endpoint requires three parameters:

  1. the name of the data set to create the projection from
  2. the name of the projection to create
  3. the list of columns and/or column expressions selected from the source data set with optional aliases (TableA.id, TableB.id as b_id, length(column_name) as column_name_length, ...)

Examples

In Python, given source table employees, a projection can be created via:

response = gpudb.create_projection(
              table_name = "sales",
              projection_name = "sales_id_revenue_region"
              column_names = ["id", "revenue", "region"]
           )

To create a projection in Python that calculates the moving average on the salary column using the previous record (1) and the next three records (3):

response = gpudb.create_projection(
              table_name = "sales",
              projection_name = "sales_moving_avg_revenue_id",
              column_names = ["moving_average(revenue,1,3)", "id"],
              options = {"order_by":"revenue asc"}
)

Operating on a Projection

All endpoints currently implemented are able to operate against a projection.

Examples

A Python example filter on the projection created in the Creating a Projection section for sales in the Northeast region:

response = gpudb.filter(
            table_name = "sales_id_revenue_region",
            view_name = "northeast_sales",
            expression = "region = Northeast"
)

When filtering against a projection, the /filter endpoint produces a view that is also a projection, which means you cannot add, edit, or delete rows. A chain of these filters can be used to create more and more restrictive projections of the original data set.

In Python, to aggregate average, min, and max revenue by region in the projection created earlier:

response = gpudb.aggregate_group_by(
            table_name = "sales_id_revenue_region",
            column_names = [
              "region",
              "avg(revenue) as avg_revenue",
              "min(revenue) as min_revenue",
              "max(revenue) as max_revenue"
            ],
            offset = 0,
            limit = 50,
            encoding = "binary"
)

Limitations and Cautions

There are several limitations to creating and using projections:

  • Performing a projection results in an entirely new data set, so be mindful of the memory usage implications
  • A projection cannot be created using a non-charN string column type
  • A projection cannot be created using a store-only column type
  • The result of a projection operation is technically a view, so rows cannot be added, deleted, or edited; however, if the view is persisted, by specifying the persist option upon creation, the view will become a table and no longer have this restriction
  • If a projection is created and its source data set's rows are updated (or removed), the projection will not be updated to reflect the changes in the source data set
  • A projection cannot be created from a collection
  • If using column functions to define projection column names in the column_names parameter, an alias must be used for the column name, e.g., length(column_name) as column_name_length. For more details on column naming restrictions, see Tables