Version:

Projections

A projection is a concept unique to Kinetica. A projection is created through the /create/projection endpoint and represents a set of columns from and/or column expressions applied to a source data set (table or view). Projections are in-memory tables by default, but can be persisted (like a table) using the persist option. While projections can simply represent a whole-cloth cross-section of an existing table, filtering clauses can be applied before creating the column to create an entirely new data set. Expressions that can be used to define columns or filters can be found here.

A projection has several advantages:

  • Because a projection is a copy of your data, querying the projection will be faster than querying a join. You could create a projection of a join to increase the speed of queries against the data set. This also means derived columns and join lookups are precalculated.
  • A projection can utilize the materialize_on_gpu option to cache the projection on the GPU.
  • A /create/projection call can calculate derived columns natively
  • A projection can be sharded on entirely different columns than its source table or view; a projection can also be sharded even if its source table or view is randomly sharded or replicated
  • A projection can also be replicated even if its source table is sharded

Projections can be created from any table or view and will create a new type schema using the columns and expressions specified in the /create/projection column list from the source schema. Note that if the source data set is replicated the projection will also be replicated, by default. If the source data set is sharded, the projection will also be sharded by default, but only if the entire shard key is included in the /create/projection call and is specified in the same order as it appears in the source data set. The sharded columns do not have to be listed continguously.

For example, a source table employees has four columns (in order), employee_name, salary, employee_id, and has_benefits, and two of those columns, salary and employee_id, comprise the shard key. To create a sharded projection using the employees table, the shard key will need to be listed in the same order as the source table, e.g., column_names = ["salary", "employee_name", "employee_id"].

A projection of an existing table or view (sharded, replicated, or otherwise) can be sharded on entirely different columns using the shard_key option on the /create/projection endpoint, regardless of the column(s) the table/view was previously sharded on (if any). See Shard Keys for more information on shard keys and sharding restrictions.

A projection is also a means to perform window functions. For examples of using window functions, see Examples.

Several limitations are discussed in further detail in the Limitations and Cautions section.

Creating a Projection

To create a projection, the /create/projection endpoint requires three parameters:

  1. the name of the data set to create the projection from
  2. the name of the projection to create
  3. the list of columns and/or column expressions selected from the source data set with optional aliases (TableA.id, TableB.id as b_id, length(column_name) as column_name_length, ...)

Examples

In Python, given source table nyctaxi, a projection of the pickup data for single passengers can be created via:

    h_db.create_projection(
        table_name = "nyctaxi",
        projection_name = "nyctaxi_single_pickups",
        column_names = [
            "vendor_id",
            "pickup_latitude",
            "pickup_longitude",
            "pickup_datetime",
            "trip_distance",
            "total_amount"
        ],
        options = {
            "expression": "passenger_count = 1"
        }
    )

To create a projection in Python that calculates a 1-before and 3-after moving average on the trip distance for single passengers per vendor, using the projection created above as the source:

    h_db.create_projection(
        table_name = "nyctaxi_single_pickups",
        projection_name = "nyctaxi_single_pickups_local_avg_dist",
        column_names = [
            "vendor_id",
            "pickup_datetime",
            "trip_distance",
            "AVG(trip_distance) " \
            "OVER (" \
            "   PARTITION BY vendor_id " \
            "   ORDER BY pickup_datetime " \
            "   ROWS BETWEEN 1 PRECEDING AND 3 FOLLOWING " \
            ") AS local_avg_dist"
        ],
        options = {
            "order_by": "vendor_id, pickup_datetime"
        }
    )

Lastly, to create a projection in Python that calculates the running total revenue per vendor for single passenger trips, using the first projection created above as the source:

    h_db.create_projection(
        table_name = "nyctaxi_single_pickups",
        projection_name = "nyctaxi_single_pickups_running_total_revenue",
        column_names = [
            "vendor_id",
            "pickup_datetime",
            "SUM(total_amount) " \
            "OVER (" \
            "   PARTITION BY vendor_id " \
            "   ORDER BY pickup_datetime " \
            ") AS running_total_revenue"
        ],
        options = {
            "order_by": "vendor_id, pickup_datetime"
        }
    )

Operating on a Projection

All endpoints currently implemented are able to operate against a projection.

Examples

For a Python example filter using the first projection created in the Creating a Projection section, the following would show pickup data for single passengers that traveled more than 10 miles:

    h_db.filter(
        table_name = "nyctaxi_single_pickups",
        view_name = "nyctaxi_single_pickups_over_10_miles",
        expression = "trip_distance > 10"
    )

When filtering against a projection, the /filter endpoint produces a view, which means you cannot add, edit, or delete rows. A chain of these filters can be used to create more and more restrictive projections of the original data set.

In Python, to aggregate per vendor statistics for single passengers, using the same first projection as the source, the following call can be made:

    h_db.aggregate_group_by(
        table_name = "nyctaxi_single_pickups",
        column_names = [
          "vendor_id",
          "MIN(total_amount) AS min_revenue",
          "AVG(total_amount) AS avg_revenue",
          "MAX(total_amount) AS max_revenue"
        ],
        offset = 0,
        limit = -9999,
        options = {"result_table": "nyctaxi_single_pickups_statistics"}
    )

Limitations and Cautions

There are several limitations to creating and using projections:

  • Performing a projection results in an entirely new data set, so be mindful of the memory usage implications
  • A projection cannot be created using a non-charN string column type
  • A projection cannot be created using a store-only column type
  • If a projection is created and its source data set's rows are updated (or removed), the projection will not be updated to reflect the changes in the source data set
  • A projection cannot be created from a collection
  • If using column functions to define projection column names in the column_names parameter, an alias must be used for the column name, e.g., length(column_name) as column_name_length. For more details on column naming restrictions, see Tables
  • A projection has no permanence by default because it is not protected and will expire after the default TTL setting; specifying a persist option of true will make the projection permanent and not expire