Version:

SQL Developer Manual

The following guide provides step-by-step instructions to get started interacting with Kinetica using SQL. This guide demonstrates some of the most common and useful SQL commands, but a detailed look into the full SQL support is available under SQL Support.

Prerequisites

The ODBC connector needs to be enabled for Kinetica to accept ODBC/JDBC connections. For a description of how to configure various clients for issuing SQL commands, see Connecting.

Creating a Collection

Use the CREATE SCHEMA statement to create a collection (or schema). The collection taxi_info will be created to house the three taxi data tables.

CREATE SCHEMA taxi_info;

Creating a Type and Table

In SQL, a type schema and table are created simultaneously. Tables can be created as replicated using the REPLICATED specification. The OR REPLACE specification can be added to ensure the table is replaced if it already exists in the database.

For this tutorial, three tables will be used:

  • A base table containing a sample of NYC taxi cab ride data
  • A lookup table containing sample cab vendor information
  • A 1-to-1 relation table containing payment information for a subset of the cab-ride transactions

Note

Some of the columns being created are specified as DICT, which will dictionary encode those columns. Read more about dictionary encoding on Dictionary Encoding.

CREATE OR REPLACE REPLICATED TABLE taxi_info.vendor
(
  vendor_id VARCHAR(4) NOT NULL,
  vendor_name VARCHAR(64) NOT NULL,
  phone VARCHAR(10),
  email VARCHAR(64),
  hq_street VARCHAR(64) NOT NULL,
  hq_city VARCHAR(8, DICT) NOT NULL,
  hq_state VARCHAR(2, DICT) NOT NULL,
  hq_zip INT NOT NULL,
  num_emps INT NOT NULL,
  num_cabs INT NOT NULL,
  PRIMARY KEY (vendor_id)
);

CREATE OR REPLACE TABLE taxi_info.payment
(
  payment_id LONG NOT NULL,
  payment_type VARCHAR(16),
  credit_type VARCHAR(16, DICT),
  payment_timestamp TYPE_TIMESTAMP,
  fare_amount DECIMAL(7,2),
  surcharge DECIMAL(7,2),
  mta_tax DECIMAL(5,2),
  tip_amount DECIMAL(7,2),
  tolls_amount DECIMAL(7,2),
  total_amount DECIMAL(7,2),
  PRIMARY KEY (payment_id)
);

CREATE OR REPLACE TABLE taxi_info.taxi_trip_data
(
  transaction_id LONG NOT NULL,
  payment_id LONG(SHARD_KEY) NOT NULL,
  vendor_id VARCHAR(4) NOT NULL,
  pickup_datetime TYPE_TIMESTAMP,
  dropoff_datetime TYPE_TIMESTAMP,
  passenger_count DECIMAL(2),
  trip_distance FLOAT,
  pickup_longitude FLOAT,
  pickup_latitude FLOAT,
  dropoff_longitude FLOAT,
  dropoff_latitude FLOAT,
  PRIMARY KEY (transaction_id, payment_id)
);

Inserting and Retrieving Data

The INSERT INTO ... VALUES clause is used to insert records into a table. At a minimum, all of the NOT NULL columns in the table need to be listed in the INSERT INTO clause. Each record in the VALUES clause must match the number and order of values to insert with the specified column names. If each record in the VALUES clause specifies values for all columns in their natural table ordering (the order in which they appeared in the original CREATE TABLE statement) no column names need to be specified.

A SELECT ... FROM statement will retrieve the records from the given table. Use * to select all columns in the table.

/* Use explicit column name syntax, when the ordering of values for the set of
 * records doesn't match the natural ordering of all of the columns defined
 * within the vendor table; here, num_emps and num_cabs are reversed */

INSERT INTO vendor (vendor_id, vendor_name, phone, email, hq_street, hq_city, hq_state, hq_zip, num_cabs, num_emps)
VALUES
  ('VTS','Vine Taxi Service','9998880001','admin@vtstaxi.com','26 Summit St.','Flushing','NY',11354,450,400),
  ('YCAB','Yes Cab','7895444321',NULL,'97 Edgemont St.','Brooklyn','NY',11223,445,425),
  ('NYC','New York City Cabs',NULL,'support@nyc-taxis.com','9669 East Bayport St.','Bronx','NY',10453,505,500),
  ('DDS','Dependable Driver Service',NULL,NULL,'8554 North Homestead St.','Bronx','NY',10472,200,124),
  ('CMT','Crazy Manhattan Taxi','9778896500','admin@crazymanhattantaxi.com','950 4th Road Suite 78','Brooklyn','NY',11210,500,468),
  ('TNY','Taxi New York',NULL,NULL,'725 Squaw Creek St.','Bronx','NY',10458,315,305),
  ('NYMT','New York Metro Taxi',NULL,NULL,'4 East Jennings St.','Brooklyn','NY',11228,166,150),
  ('5BTC','Five Boroughs Taxi Co.','4566541278','mgmt@5btc.com','9128 Lantern Street','Brooklyn','NY',11229,193,175);

/* Verify the records were inserted successfully */

SELECT *
FROM vendor;

/* Use shorthand syntax, where each record of values matches the natural
 * ordering of all of the columns defined within the vendor table */

INSERT INTO payment
VALUES
  (136,'Cash',NULL,'2015-04-11 01:42:01',4,0.5,0.5,1,0,6.3),
  (148,'Cash',NULL,'2015-04-27 08:49:41',9.5,0,0.5,1,0,11.3),
  (114,'Cash',NULL,'2015-04-05 18:47:53',5.5,0,0.5,1.89,0,8.19),
  (180,'Cash',NULL,'2015-04-13 22:57:03',6.5,0.5,0.5,1,0,8.8),
  (109,'Cash',NULL,'2015-04-13 18:08:33',22.5,0.5,0.5,4.75,0,28.55),
  (132,'Cash',NULL,'2015-04-19 19:46:19',6.5,0.5,0.5,1.55,0,9.35),
  (134,'Cash',NULL,'2015-04-19 19:44:28',33.5,0.5,0.5,0,0,34.8),
  (176,'Cash',NULL,'2015-04-07 10:52:42',9,0.5,0.5,2.06,0,12.36),
  (100,'Cash',NULL,NULL,9,0,0.5,2.9,0,12.7),
  (193,'Cash',NULL,NULL,3.5,1,0.5,1.59,0,6.89),
  (140,'Credit','Visa',NULL,28,0,0.5,0,0,28.8),
  (161,'Credit','Visa',NULL,7,0,0.5,0,0,7.8),
  (199,'Credit','Visa',NULL,6,1,0.5,1,0,8.5),
  (159,'Credit','Visa','2015-04-10 14:01:27',7,0,0.5,0,0,7.8),
  (156,'Credit','MasterCard','2015-04-10 13:32:33',12.5,0.5,0.5,0,0,13.8),
  (198,'Credit','MasterCard','2015-04-19 19:43:56',9,0,0.5,0,0,9.8),
  (107,'Credit','MasterCard','2015-04-11 01:56:17',5,0.5,0.5,0,0,6.3),
  (166,'Credit','American Express','2015-04-12 03:18:43',17.5,0,0.5,0,0,18.3),
  (187,'Credit','American Express','2015-04-10 12:49:41',14,0,0.5,0,0,14.8),
  (125,'Credit','Discover','2015-04-24 10:01:13',8.5,0.5,0.5,0,0,9.8),
  (119,NULL,NULL,'2015-04-30 22:04:31',9.5,0,0.5,0,0,10.3),
  (150,NULL,NULL,'2015-04-30 22:20:47',7.5,0,0.5,0,0,8.3),
  (170,'No Charge',NULL,'2015-04-30 22:05:02',28.6,0,0.5,0,0,28.6),
  (123,'No Charge',NULL,'2015-04-27 12:10:49',20,0.5,0.5,0,0,21.3),
  (181,NULL,NULL,'2015-04-27 11:51:01',6.5,0.5,0.5,0,0,7.8),
  (189,'No Charge',NULL,NULL,6.5,0,0.5,0,0,7);

/* Verify the records were inserted successfully */

SELECT *
FROM payment;

Updating and Deleting Data

The UPDATE ... SET statement is used to update matching key values for all records in a table. A WHERE clause can be used to specify the records to update.

/* Update the e-mail, number of employees, and number of cabs of the DDS
vendor */

UPDATE vendor
SET
  email = 'management@ddstaxico.com',
  num_emps = 156,
  num_cabs = 213
WHERE vendor_id = 'DDS';

The DELETE FROM ... statement is used to delete records from a table. A WHERE clause can be used to specify the records to delete.

/* Delete payment 189 */

DELETE
FROM payment
WHERE payment_id = 189;

CSV Data Ingest

CSV data ingest is possible using an INSERT INTO ... FROM statement, but a table must be created first for the data to be inserted into. Much like a typical INSERT INTO ... VALUES statement, where the VALUES clause is replaced with a SELECT statement. In the SELECT statement, you can insert records into as many columns from the base table as desired (all NOT NULL columns must be included). Use a SELECT * to select all columns.

Important

The file in the FROM FILE."..." clause needs to be owned by gpudb (use the chown gpudb <file-name> command), have the correct permissions enabled, and should be specified with an absolute path.

INSERT INTO taxi_trip_data
(
  transaction_id,
  payment_id,
  vendor_id,
  pickup_datetime,
  dropoff_datetime,
  passenger_count,
  trip_distance,
  pickup_longitude,
  pickup_latitude,
  dropoff_longitude,
  dropoff_latitude
)
SELECT *
FROM FILE."/tmp/data/taxi_trip_data.csv";

Alter Table

Some properties can be altered or added after table creation, including indexes, dictionary encoding, and compression. Use the ALTER TABLE ... statement to specify the table and property you want to alter. ADD INDEX ... will add a column index, while SET COLUMN ... COMPRESSION ... adds the specified compression type to the desired column.

/* Add column indexes to vendor_id in both the vendor and taxi_trip_data
 * tables */

ALTER TABLE vendor
ADD INDEX (vendor_id);

ALTER TABLE taxi_trip_data
ADD INDEX (vendor_id);

/* Apply the snappy compression algorithm to the pickup and dropoff datetime
 * columns */

ALTER TABLE taxi_trip_data
SET COLUMN pickup_datetime COMPRESSION snappy;

ALTER TABLE taxi_trip_data
SET COLUMN dropoff_datetime COMPRESSION snappy;

Filters

A simple filter can be performed using a SELECT ... WHERE statement.

-- Selects all payments with no corresponding payment type

SELECT *
FROM payment
WHERE payment_type IS NULL;

You can reduce the amount of records returned by using a SELECT TOP n clause. This will select the first n records returned by the query. The CASE ... WHEN statement is a logical function that allows you to categorize records based on given filters.

/* Select the top 50 records where the pickup is between April 1st and
April 15th, then order them from longest trip to shortest trip.  A trip
description will also be returned, designating any trip of over six miles as a
"long trip", between three and six miles as a "medium trip", and three or
shorter as a "short trip". */

SELECT TOP 50
  vendor_id,
  pickup_datetime,
  dropoff_datetime,
  passenger_count,
  CASE
    WHEN trip_distance > 6 THEN 'long trip'
    WHEN trip_distance > 3 THEN 'medium trip'
    ELSE 'short trip'
  END AS trip_description
FROM taxi_trip_data
WHERE pickup_datetime BETWEEN '2015-04-01 00:00:00.000' AND '2015-04-16 00:00:00.000'
ORDER BY trip_distance DESC;

Aggregate functions are just a small offering of the available SQL functions. Aggregate functions are paired with a GROUP BY clause that groups the returned records by values in a given column. The HAVING clause filters records after they've been aggregated.

/* Select the longest, shortest, and average trip distance & passenger count for
 * each vendor whose average passenger count is higher than 1.4 */

SELECT
  vendor_id,
  MAX(trip_distance) max_trip,
  MIN(trip_distance) min_trip,
  AVG(trip_distance) avg_trip,
  AVG(passenger_count) avg_passenger_count
FROM taxi_trip_data
GROUP BY vendor_id
HAVING AVG(passenger_count) > 1.4;

Subqueries

Subqueries allow for nested queries within the clauses of a SQL statement. The below example is a simple SELECT ... WHERE statement that includes subqueries in each of its three clauses.

/* Show how tips compare between cash and credit card payments:  retrieve unique
 * paid-by-cash fare & tip combinations and calculate the difference between the
 * cash tip percentage and the average credit tip percentage, across all cash
 * fares that were at least as much as the lowest credit card fare */

SELECT
  fare_amount,
  tip_amount,
  (tip_amount / fare_amount) * 100 -
  (
    SELECT AVG(tip_amount / fare_amount) * 100 as avg_tip_pct
    FROM payment
    WHERE payment_type = 'Credit'
  ) as tip_delta
FROM
(
  SELECT DISTINCT
    fare_amount,
    tip_amount
  FROM payment
  WHERE payment_type = 'Cash'
) credit_fare_tip
WHERE fare_amount >=
(
  SELECT MIN(fare_amount)
  FROM payment
  WHERE payment_type = 'Credit'
);

Common Table Expression / With

The WITH statement can be used to give a subquery an alias for use in a larger query. You can reuse the aliased query as if it were another result set, but only in the query immediately following the WITH statement. The "parameters" to the WITH statement are the aliases that will be given to the result columns returned by the SELECT contained within the WITH; the number of "parameters" and columns in the SELECT clause should match.

/* Retrieve the set of cash payments that fall within the timestamp range of
 * recorded credit payments */

WITH credit_pay_ts_min_max (min_pay_ts, max_pay_ts) AS
(
  SELECT
    MIN(payment_timestamp),
    MAX(payment_timestamp)
  FROM payment
  WHERE payment_type = 'Credit'
)
SELECT
  payment_id,
  payment_timestamp,
  total_amount
FROM payment
WHERE
  payment_type = 'Cash' AND
  payment_timestamp BETWEEN
    (SELECT min_pay_ts FROM credit_pay_ts_min_max) AND
    (SELECT max_pay_ts FROM credit_pay_ts_min_max)
ORDER BY payment_timestamp;

Joins

Joins allow you to link multiple tables together, along their relations, retrieving associated information from any or all of them. Tables can only be joined if they're sharded similarly or replicated.

An inner join returns only records that have matching values in both tables.

/* Retrieve cab ride transactions and the full name of the associated vendor
 * for rides having more than three passengers between April 1st & 16th, 2015 */

SELECT
  transaction_id,
  passenger_count,
  pickup_datetime,
  dropoff_datetime,
  vendor_name
FROM taxi_trip_data t
INNER JOIN vendor v ON t.vendor_id = v.vendor_id
WHERE
  passenger_count > 3 AND
  pickup_datetime BETWEEN '2015-04-01' AND '2015-04-16'
ORDER BY dropoff_datetime;

A left join returns all of the records an inner join does, but additionally, for each record in the table on the left side of the join that has no match along the relation to a record in the table on the right side of the join, a corresponding record will be returned with "left-side" columns populated with the "left-side" record data and the "right-side" columns populated with nulls. Using the KI_HINT_EXPLAIN_JOINS hint will direct the ODBC Server to output how the given join will be processed internally. This output is typically logged to a file located in /opt/gpudb/connectors/odbcserver/logs.

/* Retrieve cab ride transactions and the full name of the associated vendor (if
 * available--blank if vendor name is unknown) for transactions with associated
 * payment data, sorting by increasing values of transaction ID.  Also, output
 * the join execution plan to the ODBC logs. */

SELECT /* KI_HINT_EXPLAIN_JOINS */
  transaction_id,
  pickup_datetime,
  trip_distance,
  t.vendor_id,
  vendor_name
FROM taxi_trip_data t
LEFT JOIN vendor v ON t.vendor_id = v.vendor_id
WHERE payment_id != 0
ORDER BY transaction_id;

Note

Full outer joins require both tables to be replicated. Set merges that perform deduplication of records, like UNION DISTINCT, INTERSECT, and EXCEPT also need to use replicated tables to ensure the correct results, so a replicated version of the taxi (taxi_trip_data_replicated) table is created at this point in the tutorial.

CREATE OR REPLACE REPLICATED TABLE taxi_info.taxi_trip_data_replicated (
  transaction_id LONG NOT NULL,
  payment_id LONG NOT NULL,
  vendor_id VARCHAR(4) NOT NULL,
  pickup_datetime TYPE_TIMESTAMP,
  dropoff_datetime TYPE_TIMESTAMP,
  passenger_count DECIMAL(2),
  trip_distance FLOAT,
  pickup_longitude FLOAT,
  pickup_latitude FLOAT,
  dropoff_longitude FLOAT,
  dropoff_latitude FLOAT
);

INSERT INTO taxi_trip_data_replicated
SELECT *
FROM taxi_trip_data;

A full outer join returns all of the records a left join does, but additionally, for each record in the table on the right side of the join that has no match along the relation to a record in the table on the left side of the join, a corresponding record will be returned with "right-side" columns populated with the "right-side" record data and the "left-side" columns populated with nulls.

/* Retrieve the vendor IDs of known vendors with no recorded cab ride
 * transactions, as well as the vendor ID and number of transactions for unknown
 * vendors with recorded cab ride transactions; also output the join plan */

SELECT /* KI_HINT_EXPLAIN_JOINS */
  v.vendor_id vend_table_vendors,
  t.vendor_id taxi_table_vendors,
  count(*) as total_records
FROM taxi_trip_data_replicated t
FULL OUTER JOIN vendor v ON v.vendor_id = t.vendor_id
WHERE
  v.vendor_id IS NULL OR
  t.vendor_id IS NULL
GROUP BY
  v.vendor_id,
  t.vendor_id;

Create Table As

You can also create a table directly from a query using CREATE TABLE ... AS. These can also be created as temporary tables by applying the TEMP specification. Much like regular table creation, you can also specify OR REPLACE to replace the table if it already exists.

/* Create an in-memory only table containing all payments by credit card */

CREATE OR REPLACE TEMP TABLE credit_payment AS (
  SELECT *
  FROM payment
  WHERE payment_type = 'Credit'
);

/* Verify the table was created successfully */

SELECT *
FROM credit_payment;
/* Create a persisted table with cab ride transactions whose trip started
 * during lunch hours. */

CREATE OR REPLACE TABLE lunch_time_rides AS (
  SELECT
    HOUR(pickup_datetime) hour_of_day,
    vendor_id,
    passenger_count,
    trip_distance
  FROM taxi_trip_data
  WHERE HOUR(pickup_datetime) BETWEEN '11' AND '14'
);

/* Verify the table was created successfully */

SELECT *
FROM lunch_time_rides;

Union, Intersect, and Except

UNION can be used to combine homogeneous data sets into one larger data set. UNION & UNION DISTINCT will both combine data sets but only retain the records that are unique across the chosen columns, removing all duplicates. UNION ALL will combine data sets, retaining all records from the source data sets.

/* Calculate the average number of passengers, as well as the shortest, average,
 * and longest trips for all trips in each of the two time periods--from April
 * 1st through the 15th, 2015 and from April 16th through the 23rd, 2015--and
 * return those two sets of statistics in a single result set. */

SELECT
  '2015-04-01 - 2015-04-15' pickup_window_range,
  AVG(passenger_count) avg_pass_count,
  AVG(trip_distance) avg_trip,
  MIN(trip_distance) min_trip,
  MAX(trip_distance) max_trip
FROM taxi_trip_data
WHERE pickup_datetime BETWEEN '2015-04-01' AND '2015-04-15 23:59:59.999'
UNION ALL
SELECT
  '2015-04-16 - 2015-04-23',
  AVG(passenger_count),
  AVG(trip_distance),
  MIN(trip_distance),
  MAX(trip_distance)
FROM taxi_trip_data
WHERE pickup_datetime BETWEEN '2015-04-16' AND '2015-04-23 23:59:59.999';

INTERSECT will combine data sets but only include the records found in both data sets, removing duplicate result records.

/* Retrieve locations (as lat/lon pairs) that were both pick-up and drop-off
 * points */

SELECT
  pickup_latitude,
  pickup_longitude
FROM taxi_trip_data_replicated
WHERE
  pickup_latitude <> 0 AND
  pickup_longitude <> 0
INTERSECT
SELECT
  dropoff_latitude,
  dropoff_longitude
FROM taxi_trip_data_replicated;

EXCEPT will return records that appear in the first data set but not the second data set. Note that the data sets on each side of the EXCEPT will have duplicates removed first, and then the set subtraction will be processed.

/* Show vendors that operate before noon, but not after noon:  retrieve the
 * unique list of IDs of vendors who provided cab rides between midnight and
 * noon, and remove from that list the IDs of any vendors who provided cab rides
 * between noon and midnight */

SELECT vendor_id
FROM taxi_trip_data_replicated
WHERE HOUR(pickup_datetime) BETWEEN 0 AND 11
EXCEPT
SELECT vendor_id
FROM taxi_trip_data_replicated
WHERE HOUR(pickup_datetime) BETWEEN 12 AND 23;

Truncate

Use TRUNCATE to remove all records from a table without deleting the table.

/* Remove all records from the given tables. */

TRUNCATE TABLE taxi_info.vendor;

TRUNCATE TABLE taxi_info.payment;

TRUNCATE TABLE taxi_info.taxi_trip_data;

Drop Table

Use DROP TABLE to drop the table from the default collection.

/* Drop the given tables */

DROP TABLE vendor;

DROP TABLE payment;

DROP TABLE taxi_trip_data;

DROP TABLE credit_payment;

DROP TABLE lunch_time_rides;

DROP TABLE taxi_trip_data_replicated;

Drop Schema

Use DROP SCHEMA to drop the taxi_info collection.

DROP SCHEMA taxi_info;

Complete Sample

Included below is a complete example containing all the above queries, the data files, and output.

The SQL script makes reference to a data file with a placeholder file path. This path should be updated to point to a valid path on the ODBC Server host where the file will be located:

SELECT *
FROM FILE."/tmp/data/taxi_trip_data.csv";

The script can be run via KiSQL, with the output written to sql_tutorial.out:

$ /opt/gpudb/kitools/kisql/kisql -host localhost -echoSql true < sql_tutorial.kisql > sql_tutorial.out