Quick Start Guide

Get started with Kinetica Workbench

Before You Start

Before using this guide, you will need access to an instance of Kinetica in AWS. If you do not already have one, you can provision an instance from the AWS Marketplace by following the instructions in the provisioning documentation.

Initiate Streaming Data Ingest

First, we need to start a stream of synthetic taxi pickups and drop-offs from a Kafka queue to show Kinetica's ability to build materialized views on top of dynamic data.

Create Kafka Data Source
1
2
3
CREATE or REPLACE DATA SOURCE taxi_streaming_ds
LOCATION = 'KAFKA://quickstart.kinetica.com:9092'
WITH OPTIONS (KAFKA_TOPIC_NAME = 'nyctaxi')
Start Ingesting Data
1
2
3
4
5
6
LOAD DATA INTO taxi_data_streaming
FORMAT JSON
WITH OPTIONS (
    DATA SOURCE = 'taxi_streaming_ds',
    SUBSCRIBE = 'TRUE'
)

Import Data

For this guide, we'll use several files with geospatial data in them. The nyct2010.csv file provides geospatial boundaries for neighborhoods in New York. The taxi_data_historical.parquet file provides a list of taxi pickups and dropoffs in New York.

First, let's create the data source. In this case, it is an S3 bucket containing the data.

Create S3 Data Source
1
2
3
4
5
6
7
CREATE OR REPLACE DATA SOURCE quickstart
LOCATION = 'S3'
WITH OPTIONS (
    ANONYMOUS = 'true',
    BUCKET NAME = 'quickstartpublic',
    REGION = 'us-east-1'
)

Next, let's load the census tract data from S3. This will ingest all of the neighborhood boundaries into a table.

Load Boundary Data from S3
1
2
3
4
5
6
LOAD DATA INTO nyct2010 
FROM FILE PATHS 'nyct2010.csv'
FORMAT TEXT 
WITH OPTIONS (
    DATA SOURCE = 'quickstart'
)

Finally, let's load the historical taxi data from S3. Here we leverage Kinetica's built-in type inferencing, but we will need to correct the vendor_id and payment_type columns to ensure that they use a VARCHAR instead of a string so that we can use those columns in the analytics and Reveal dashboard later.

Load Taxi Data from S3
1
2
3
4
5
6
LOAD DATA INTO taxi_data_historical
FROM FILE PATHS 'taxi_data.parquet'
FORMAT PARQUET
WITH OPTIONS (
    DATA SOURCE = 'quickstart'
)
Correct Vendor ID Column Type
1
2
ALTER TABLE taxi_data_historical
ALTER COLUMN vendor_id VARCHAR(4)
Correct Payment Type Column Type
1
2
ALTER TABLE taxi_data_historical
ALTER COLUMN payment_type VARCHAR(16)

Query Streaming and Historical Data with SQL

Compare today's shortest, longest, average, and total trips against our historical data. For this analytic, we create a materialized view that is refreshed on every query. It blends the latest streaming data with our historical data to produce a combined report with a single query.

Today vs. Historical Comparison
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
CREATE OR REPLACE MATERIALIZED VIEW M1
REFRESH ON QUERY AS
SELECT
    Today_Vendor_Name as Vendor_Name,
    Today_Shortest_Trip,
    Today_Average_Trip,
    Today_Longest_Trip,
    Today_Total_Trips,
    Hist_Shortest_Trip,
    Hist_Average_Trip,
    Hist_Longest_Trip,
    Hist_Total_Trips
FROM
(
    SELECT
        NVL(vendor_id, '<ALL VENDORS>') AS Today_Vendor_Name,
        DECIMAL(MIN(trip_distance)) AS Today_Shortest_Trip,
        DECIMAL(AVG(trip_distance)) AS Today_Average_Trip,
        DECIMAL(MAX(trip_distance)) AS Today_Longest_Trip,
        COUNT(*) AS Today_Total_Trips
    FROM taxi_data_streaming
    GROUP BY vendor_id
) t1
INNER JOIN
(
    SELECT
        NVL(vendor_id, '<ALL VENDORS>') AS Hist_Vendor_Name,
        DECIMAL(MIN(trip_distance)) AS Hist_Shortest_Trip,
        DECIMAL(AVG(trip_distance)) AS Hist_Average_Trip,
        DECIMAL(MAX(trip_distance)) AS Hist_Longest_Trip,
        COUNT(*) AS Hist_Total_Trips
    FROM taxi_data_historical
    GROUP BY vendor_id
) t2
ON t2.Hist_Vendor_Name = t1.Today_Vendor_Name

Streaming vs. Historical Results

img/streaming-vs-historical.png

Note: your results may vary.

Query Data with SQL

What are the shortest, average, and longest trip lengths for each vendor? In this analytic, we use the MIN, MAX, AVG, and COUNT functions to find these statistics for each vendor.

Trip Statistics
1
2
3
4
5
6
7
8
SELECT
    NVL(vendor_id, '<ALL VENDORS>') AS Vendor_Name,
    DECIMAL(MIN(trip_distance)) AS Shortest_Trip,
    DECIMAL(AVG(trip_distance)) AS Average_Trip,
    DECIMAL(MAX(trip_distance)) AS Longest_Trip,
    COUNT(*) AS Hist_Total_Trips
FROM taxi_data_historical
GROUP BY vendor_id
Results
1
2
3
4
5
6
7
8
9
+---------------+-----------------+----------------+----------------+--------------------+
| Vendor_Name   |   Shortest_Trip |   Average_Trip |   Longest_Trip |   Hist_Total_Trips |
+---------------+-----------------+----------------+----------------+--------------------+
| VTS           |          0.0100 |         2.8620 |        53.7500 |             116597 |
| NYC           |          0.0100 |         3.0071 |        55.8700 |             145395 |
| CMT           |          0.1000 |         2.7427 |        50.0000 |              82452 |
| YCAB          |          0.1000 |         2.8762 |        50.9000 |             130815 |
| DDS           |          0.1000 |         2.9158 |        46.0000 |               3584 |
+---------------+-----------------+----------------+----------------+--------------------+

Average Trip Length by Vendor

img/average-trip-length.png

Location Analytics

Q1: What were the 10 most frequent destination neighborhoods for JFK pickups, and on average, what did it cost to go there?

Joining the NTA table and the taxi data using the STXY_Intersects function, you can group by neighborhood boundary to determine the total number of trips to the neighborhood and average fare for a trip from JFK to that neighborhood.

Frequent Destination Neighborhoods from JFK
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SELECT TOP 10
    n_dropoff.NTAName AS "Neighborhood",
    COUNT(*) AS "Total_Trips",
    DECIMAL(AVG(fare_amount)) AS "Average_Fare"
FROM
    taxi_data_historical t
    JOIN nyct2010 n_pickup
        ON STXY_Intersects(t.pickup_longitude, t.pickup_latitude, n_pickup.geom) = 1
        AND n_pickup.NTAName = 'Airport'
    JOIN nyct2010 n_dropoff
        ON STXY_Intersects(t.dropoff_longitude, t.dropoff_latitude, n_dropoff.geom) = 1
WHERE pickup_datetime < '2019-01-01'
GROUP BY 1
ORDER BY 2 DESC;
Results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
+----------------------------------------------+---------------+----------------+
| Neighborhood                                 |   Total_Trips |   Average_Fare |
+----------------------------------------------+---------------+----------------+
| Midtown-Midtown South                        |          3219 |        39.2489 |
| Turtle Bay-East Midtown                      |          1250 |        36.2404 |
| Hudson Yards-Chelsea-Flatiron-Union Square   |           757 |        39.4635 |
| Murray Hill-Kips Bay                         |           675 |        37.3777 |
| Battery Park City-Lower Manhattan            |           629 |        43.5058 |
| Upper West Side                              |           584 |        37.2917 |
| Clinton                                      |           580 |        41.7162 |
| West Village                                 |           493 |        40.4975 |
| Upper East Side-Carnegie Hill                |           459 |        35.0270 |
| SoHo-TriBeCa-Civic Center-Little Italy       |           455 |        41.9072 |
+----------------------------------------------+---------------+----------------+

Total Trips by Neighborhood

img/total-trips.png

Q2: For neighborhoods with more than 500 pickups, how many had more than half of their pickups at night and by what taxi vendor?

You can use neighborhood boundary data and pickup time to calculate in which NTA vendors were picking up passengers at night and display the results in a pivot table containing columns of percentages for all vendors and each individual vendor.

Majority of Night Pickups by Vendor
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
SELECT
    NTANAme AS "Neighborhood",
    all_npp AS "All",
    cmt_npp AS "CMT", nyc_npp AS "NYC", vts_npp AS "VTS", ycab_npp AS "YCAB"
FROM
(
    SELECT
        NTAName,
        IF (GROUPING(vendor_id) = 1, CAST ('ALL' as VARCHAR(4)), vendor_id) AS vendor_name,
        COUNT(*) AS total_pickups,
        DECIMAL(SUM(IF(HOUR(pickup_datetime) BETWEEN 5 AND 19, 0, 1)))
            / COUNT(*) * 100 AS night_pickup_percentage
    FROM
        taxi_data_historical t
        JOIN nyct2010 n
            ON STXY_Intersects(t.pickup_longitude, t.pickup_latitude, n.geom) = 1
    WHERE pickup_datetime < '2019-01-01'
    GROUP BY
        NTAName,
        ROLLUP(vendor_id)
)
PIVOT
(
    MAX(total_pickups) AS tp,
    MAX(night_pickup_percentage) AS npp
    FOR vendor_name IN ('ALL', 'CMT', 'NYC', 'VTS', 'YCAB')
)
WHERE all_tp > 500 AND all_npp > 50
ORDER BY all_npp DESC;
Results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
+-------------------------+-----------+-----------+-----------+-----------+-----------+
| Neighborhood            |       All |       CMT |       NYC |       VTS |      YCAB |
+-------------------------+-----------+-----------+-----------+-----------+-----------+
| East Williamsburg       |   83.1400 |   84.2100 |   81.2500 |   85.2600 |   83.2300 |
| North Side-South Side   |   78.6400 |   80.7500 |   78.7300 |   82.0000 |   74.6600 |
| Park Slope-Gowanus      |   66.7700 |   72.0000 |   67.6800 |   71.2700 |   58.9600 |
| Chinatown               |   64.2100 |   59.1700 |   66.9800 |   64.1300 |   64.7400 |
| Fort Greene             |   60.6200 |   56.2500 |   63.2500 |   62.8000 |   58.1200 |
| East Village            |   56.2800 |   57.6000 |   56.1700 |   55.1900 |   56.7700 |
| Lower East Side         |   52.6500 |   53.7100 |   54.1300 |   50.3300 |   52.9400 |
+-------------------------+-----------+-----------+-----------+-----------+-----------+

Q3: How many pickups per hour were there at JFK International Airport?

Using the STXY_Intersects function to determine which pickup points were located in the JFK International Airport neighborhood boundary, it's simple to calculate the number of pickups per hour there were at JFK for all cab types.

Pickups per Hour from JFK
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SELECT
    RPAD(LPAD(CHAR2(HOUR(t.pickup_datetime)), 2, '0'), 5, ':00') AS "Pickup_Hour",
    COUNT(*) AS "Total_Pickups"
FROM
    taxi_data_historical t
    JOIN nyct2010 n
    ON STXY_Intersects(t.pickup_longitude, t.pickup_latitude, n.geom) = 1
WHERE
    NTAName = 'Airport' AND
    pickup_datetime < '2019-01-01'
GROUP BY
    HOUR(t.pickup_datetime)
ORDER BY
    1;
Results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
+---------------+-----------------+
| Pickup_Hour   |   Total_Pickups |
+---------------+-----------------+
| 00:00:00      |             662 |
| 01:00:00      |             270 |
| 02:00:00      |              91 |
| 03:00:00      |              40 |
| 04:00:00      |              24 |
| 05:00:00      |             180 |
| 06:00:00      |             344 |
| 07:00:00      |             472 |
| 08:00:00      |             617 |
| 09:00:00      |             652 |
| 10:00:00      |             746 |
| 11:00:00      |             816 |
| 12:00:00      |             832 |
| 13:00:00      |             905 |
| 14:00:00      |            1040 |
| 15:00:00      |            1156 |
| 16:00:00      |            1181 |
| 17:00:00      |            1249 |
| 18:00:00      |            1110 |
| 19:00:00      |            1134 |
| 20:00:00      |            1172 |
| 21:00:00      |            1284 |
| 22:00:00      |            1199 |
| 23:00:00      |            1102 |
+---------------+-----------------+

Pickups by Hour

img/pickups-by-hour.png

Visualization

Kinetica's main visualization tools are Reveal and the Web Mapping Service (WMS). For our purposes, let's use the WMS feature.

Kinetica's web map service renders geospatial images that are overlaid on map tools like Mapbox, OpenLayers, and Esri. Simply point your basemap provider to our /wms endpoint with your desired rendering parameters, and Kinetica will generate geospatial imagery on demand. Every pan and zoom will render a new image of your data, so you can visualize streaming updates in real time.

Q4: Given my storefront's location, how can I calculate the number of drop offs that occurred within 150 meters?

Say you own a small business or restaurant in Chelsea that relies on nearby foot traffic for new customers. You can easily visualize the density of taxi drop offs nearby with the help of the GEODIST function.

Number of Dropoffs Near Storefront
1
2
3
4
CREATE OR REPLACE TABLE store_front_dropoffs AS
SELECT *
FROM taxi_data_historical
WHERE GEODIST(-74.00378, 40.743193, dropoff_longitude, dropoff_latitude) < 150;

We used this query to create a table so the results can be seen on a map. To output the dropoff points to a map, we'll use Kinetica's WMS functionality. Let's validate this query.

In Workbench, click the Data tab of the Explorer pane. Under Tables & Views, click the arrow next to ki_home to expand the listing of tables contained within it. Click on store_front_dropoffs and then click on WMS Preview.

This will bring up the Map Configuration pop-up for the map. Move the Blur Radius slider to the right, until it is at 10, and then click Update at the bottom right to view the map.

The output is a heatmap showing the dropoff points within 150 meters of your storefront.

Click Close to close the map pop-up window.

img/storefronts.png

Load a Reveal Dashboard

Load the "NYC Taxi" Reveal dashboard from Kinetica's GitHub account.

LOAD DASHBOARD
1
2
LOAD DASHBOARD "NYC Taxi"
FROM FILE PATH 'https://github.com/kineticadb/examples/raw/master/quickstart/nyctaxi.db';

Log in to Reveal to see the dashboard:

https://<aws.fqdn>/reveal
img/reveal_taxi.gif

Pause Your Kafka Data Stream

Pause Kafka Stream
1
2
ALTER TABLE taxi_data_streaming
PAUSE SUBSCRIPTION taxi_streaming_ds;

Next Steps

Congratulations, you have successfully completed the Quick Start Guide! But, you have only just begun your journey with Kinetica. To learn more about in depth usage, check out these articles: