Version:

Python Developer Manual

The following guide provides step-by-step instructions to get started writing Python applications using Kinetica. This guide demonstrates only a small set of the available API. A detailed description of the complete interface is available under Python API Reference.

Prerequisites

File Ingest

The following .csv file must be ingested into the database to run the sample file:

API Download and Installation

Requirements

  • Python 2.7 (or greater)
  • pip

Note

The locations of python and pip should be placed in the PATH environment variable. If they are not, you'll need to use the full path to the python and pip executables in the instructions below. Also, administrative access will most likely be required when installing the Python packages.

Download and Install

Using Git
  1. In the desired directory, run the following but be sure to replace kinetica-version with the correct branch name of the installed Kinetica version, e.g., release/v6.2.0

    git clone -b [kinetica-version] --single-branch https://github.com/kineticadb/kinetica-api-python.git
    
  2. Change directory into the newly downloaded repository

  3. In the root directory of the unzipped repository, install the Kinetica API and Avro packages:

    pip install .
    pip install avro
    
  4. Test the installation:

    cd example
    python example.py
    
  5. Update the repo as desired:

    git pull
    
Via PyPI
  1. In the desired directory, run the following:

    pip install gpudb
    

Connecting to the Database

To interact with Kinetica, you must first instantiate an object of the GPUdb class while providing the connection URL, including the host and port of the database server.

h_db = gpudb.GPUdb(
    encoding="BINARY",
    host="127.0.0.1",
    port="9191"
)

If the server is configured to authenticate users, the username and password will also need to be provided at the time of creating the object:

h_db = gpudb.GPUdb(
  encoding="BINARY",
  host="127.0.0.1",
  port="9191",
  username="user1",
  password="mysecret"
)

Creating a Type

Before a table can be created and any data can be loaded into the database, a type needs to be defined. The type is defined as a list of lists of strings. Each column comprises at least two values: a column name (always the first value) and a primitive type (always the second value). Any subsequent values are column properties. You can also use the GPUdbRecordColumn and GPUdbColumnProperty objects in addition to literal string values to help define a column. The following _ColumnType constants are available:

  • _ColumnType.INT
  • _ColumnType.LONG
  • _ColumnType.FLOAT
  • _ColumnType.DOUBLE
  • _ColumnType.STRING
  • _ColumnType.BYTES
# Create a type from a list of lists. Each list below is an individual column. Each column comprises at
# least two values: a column name (always the first value) and a primitive type (always the second value). Any
# subsequent values are column properties. The order of the columns defines the order in which values must
# be inserted into the table, e.g., a "vendor_name" value cannot be inserted before a "vendor_id" value
vendor_columns = [
    # column types and properties can be listed as strings
    ["vendor_id", "string", "char4", "primary_key"],
    ["vendor_name", "string", "char64"],
    ["phone", "string", "char16", "nullable"],
    ["email", "string", "char64", "nullable"],
    ["hq_street", "string", "char64"],
    # column types and properties can also be listed using the GPUdbRecordColumn and GPUdbColumnProperty objects
    ["hq_city", gpudb.GPUdbRecordColumn._ColumnType.STRING, gpudb.GPUdbColumnProperty.CHAR8,
     gpudb.GPUdbColumnProperty.DICT],
    ["hq_state", gpudb.GPUdbRecordColumn._ColumnType.STRING, gpudb.GPUdbColumnProperty.CHAR8,
     gpudb.GPUdbColumnProperty.DICT],
    ["hq_zip", gpudb.GPUdbRecordColumn._ColumnType.INT],
    ["num_emps", gpudb.GPUdbRecordColumn._ColumnType.INT],
    ["num_cabs", gpudb.GPUdbRecordColumn._ColumnType.INT]
]

Important

The order of the columns defines the order in which values must be inserted into the table, e.g., a vendor_name value cannot be inserted before a vendor_id value

Creating a Table

After you've created a type list and stored it in a variable, you can create a table using the GPUdbTable interface. For this tutorial, three tables will be used:

  • A base table containing a sample of NYC taxi cab ride data
  • A lookup table containing sample cab vendor information (created in the example below)
  • A 1-to-1 relation table containing payment information for a subset of the cab-ride transactions
# Create the table from the type and place it in a collection
try:
    table_vendor_obj = gpudb.GPUdbTable(
        _type=vendor_columns,
        name=table_vendor,
        options={"collection_name": collection,
                 "is_replicated": "true"},
        db=h_db
    )
    print "Vendor table successfully created"
except gpudb.GPUdbException as e:
    print "Vendor table creation failure: {}".format(str(e))
print

To create a table object for a table that already exists, pass in a _type of None:

my_table_object = gpudb.GPUdbTable(_type=None, name=my_existing_table, db=h_db)

To pass in table creation options, you can either specify a map (like in the Vendor table) or you can use the GPUdbTableOptions object. There are two ways to use the GPUdbTableOptions object:

# Instantiate the GPUdbTableOptions object using a variable, then chain options together using the variable
options = gpudb.GPUdbTableOptions()
payment_options = options.collection_name("collection_name").is_replicated("true")

# Use the object directly
options = gpudb.GPUdbTableOptions.default().collection_name("collection_name").is_replicated("true")

Important

At this point in the tutorial, a Payment type and table are created. A GPUdbTable object is also created for the taxi_trip_data table you already ingested. Some of the columns being created are specified as DICT, which will dictionary encode those columns. Read more about dictionary encoding here

Inserting Data

You can insert single records or multiple records using the insert_records method. As mentioned previously, the number and order of values must match the column declarations found in the type creation step. In Python, you can pass a value of None for null values, but the column must be declared nullable when creating the type schema.

Note

The following methods are not the only way to insert records, but they are the simplest.

Key/Value Record

In this example, a dictionary object will be used to map key/value pairs to table column values.

vendor_datum = collections.OrderedDict()
vendor_datum["vendor_id"] = "VTS"
vendor_datum["vendor_name"] = "Vine Taxi Service"
vendor_datum["phone"] = "9998880001"
vendor_datum["email"] = "admin@vtstaxi.com"
vendor_datum["hq_street"] = "26 Summit St."
vendor_datum["hq_city"] = "Flushing"
vendor_datum["hq_state"] = "NY"
vendor_datum["hq_zip"] = 11354
vendor_datum["num_cabs"] = 450
vendor_datum["num_emps"] = 400

# Insert the record into the table
table_vendor_obj.insert_records(vendor_datum)
print "Number of single records inserted into the Vendor table: {}".format(table_vendor_obj.size())

In-line Record

In these examples, records will be defined in-line and added to a list.

vendor_records = [
    ["YCAB", "Yes Cab", "7895444321", None, "97 Edgemont St.", "Brooklyn", "NY", 11223, 445, 425],
    ["NYC", "New York City Cabs", None, "support@nyc-taxis.com", "9669 East Bayport St.", "Bronx", "NY", 10453, 505,
     500],
    ["DDS", "Dependable Driver Service", None, None, "8554 North Homestead St.", "Bronx", "NY", 10472, 200, 124],
    ["CMT", "Crazy Manhattan Taxi", "9778896500", "admin@crazymanhattantaxi.com", "950 4th Road Suite 78",
     "Brooklyn", "NY", 11210, 500, 468],
    ["TNY", "Taxi New York", None, None, "725 Squaw Creek St.", "Bronx", "NY", 10458, 315, 305],
    ["NYMT", "New York Metro Taxi", None, None, "4 East Jennings St.", "Brooklyn", "NY", 11228, 166, 150],
    ["5BTC", "Five Boroughs Taxi Co.", "4566541278", "mgmt@5btc.com", "9128 Lantern Street", "Brooklyn", "NY",
     11229, 193, 175]
]

# Insert the records into the Vendor table
table_vendor_obj.insert_records(vendor_records)
print "Number of batch records inserted into the Vendor table: {}".format(table_vendor_obj.size() - 1)

# Create another list of in-line records
payment_records = [
    [189, "No Charge", None, None, 6.5, 0, 0.6, 0, 0, 7],
    [136, "Cash", None, 1428716521000, 4, 0.5, 0.5, 1, 0, 6.3],
    [148, "Cash", None, 1430124581000, 9.5, 0, 0.5, 1, 0, 11.3],
    [114, "Cash", None, 1428259673000, 5.5, 0, 0.5, 1.89, 0, 8.19],
    [180, "Cash", None, 1428965823000, 6.5, 0.5, 0.5, 1, 0, 8.8],
    [109, "Cash", None, 1428948513000, 22.5, 0.5, 0.5, 4.75, 0, 28.55],
    [132, "Cash", None, 1429472779000, 6.5, 0.5, 0.5, 1.55, 0, 9.35],
    [134, "Cash", None, 1429472668000, 33.5, 0.5, 0.5, 0, 0, 34.8],
    [176, "Cash", None, 1428403962000, 9, 0.5, 0.5, 2.06, 0, 12.36],
    [100, "Cash", None, None, 9, 0, 0.5, 2.9, 0, 12.7],
    [193, "Cash", None, None, 3.5, 1, 0.5, 1.59, 0, 6.89],
    [140, "Credit", "Visa", None, 28, 0, 0.5, 0, 0, 28.8],
    [161, "Credit", "Visa", None, 7, 0, 0.5, 0, 0, 7.8],
    [199, "Credit", "Visa", None, 6, 1, 0.5, 1, 0, 8.5],
    [159, "Credit", "Visa", 1428674487000, 7, 0, 0.5, 0, 0, 7.8],
    [156, "Credit", "MasterCard", 1428672753000, 12.5, 0.5, 0.5, 0, 0, 13.8],
    [198, "Credit", "MasterCard", 1429472636000, 9, 0, 0.5, 0, 0, 9.8],
    [107, "Credit", "MasterCard", 1428717377000, 5, 0.5, 0.5, 0, 0, 6.3],
    [166, "Credit", "American Express", 1428808723000, 17.5, 0, 0.5, 0, 0, 18.3],
    [187, "Credit", "American Express", 1428670181000, 14, 0, 0.5, 0, 0, 14.8],
    [125, "Credit", "Discover", 1429869673000, 8.5, 0.5, 0.5, 0, 0, 9.8],
    [119, None, None, 1430431471000, 9.5, 0, 0.5, 0, 0, 10.3],
    [150, None, None, 1430432447000, 7.5, 0, 0.5, 0, 0, 8.3],
    [170, "No Charge", None, 1430431502000, 28.6, 0, 0.5, 0, 0, 28.6],
    [123, "No Charge", None, 1430136649000, 20, 0.5, 0.5, 0, 0, 21.3],
    [181, None, None, 1430135461000, 6.5, 0.5, 0.5, 0, 0, 7.8],
    [189, "No Charge", None, None, 6.5, 0, 0.6, 0, 0, 7]
]

# Insert the records into the Payment table
for record in payment_records:
    table_payment_obj.insert_records(record)
print "Number of batch records inserted into the Payment table: {}".format(table_payment_obj.size())

Retrieving Data

Once the table is populated with data, the data can be retrieved using binary encoding or JSON encoding. Binary encoding is more efficient than JSON encoding. The example below uses the GPUdbTable interface and binary encoding:

# Retrieve no more than 10 records from the Payment table using the GPUdbTable interface
print "{:<10s} {:<12s} {:<11s} {:<17s} {:<11s} {:<9s} {:<7s} {:<10s} {:<12s} {:<12s}".format(
    "Payment ID",
    "Payment Type",
    "Credit Type",
    "Payment Timestamp",
    "Fare Amount",
    "Surcharge",
    "MTA Tax",
    "Tip Amount",
    "Tolls Amount",
    "Total Amount"
)
print "{:=<10s} {:=<12s} {:=<11s} {:=<17s} {:=<11s} {:=<9s} {:=<7s} {:=<10s} {:=<12s} {:=<12s}".format(
    "", "", "", "", "", "", "", "", "", ""
)
for record in table_payment_obj.get_records(
        offset=0,
        limit=10,
        encoding="binary",
        options={"sort_by": "payment_type"}
):
    print "{payment_id:<10d} {payment_type:<12s} {credit_type:<11s} {payment_timestamp:<17d} {fare_amount:11.2f} " \
          "{surcharge:9.2f} {mta_tax:7.2f} {tip_amount:10.2f} {tolls_amount:12.2f} {total_amount:12.2f}".format(
            payment_id=record["payment_id"],
            payment_type=record["payment_type"],
            credit_type=record["credit_type"],
            payment_timestamp=record["payment_timestamp"],
            fare_amount=record["fare_amount"],
            surcharge=record["surcharge"],
            mta_tax=record["mta_tax"],
            tip_amount=record["tip_amount"],
            tolls_amount=record["tolls_amount"],
            total_amount=record["total_amount"]
            )
print

This example demonstrates JSON encoding record retrieval using the GPUdb interface:

# Retrieve all records from the Vendor table using the GPUdb interface
print "{:<9s} {:<26s} {:<11s} {:<29s} {:<24s} {:<9s} {:<9s} {:<11s} {:<11s} {:<6s}".format(
    "Vendor ID",
    "Vendor Name",
    "Phone",
    "Email",
    "HQ Street",
    "HQ City",
    "HQ State",
    "HQ Zip Code",
    "# Employees",
    "# Cabs"
)
print "{:=<9s} {:=<26s} {:=<11s} {:=<29s} {:=<24s} {:=<9s} {:=<9s} {:=<11s} {:=<11s} {:=<6s}".format(
    "", "", "", "", "", "", "", "", "", ""
)
vendor_records_gpudb = h_db.get_records(
    table_name=table_vendor,
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    encoding="json"
)["records_json"]
for record in vendor_records_gpudb:
    print "{vendor_id:<9s} {vendor_name:<26s} {phone:<11} {email:<29s} {hq_street:<24s} {hq_city:<9s} " \
          "{hq_state:<9s} {hq_zip:<11d} {num_emps:<11d} {num_cabs:<6d}".format(**json.loads(record))

For large tables, the data can be easily be retrieved in smaller blocks by using the offset and limit parameters. The returned response also contains the schema (or data type) of the results.

Updating and Removing Records

Using any GPUdbTable object, you can update records via the update_records method.

# Update the email, number of employees, and number of cabs of the DDS vendor
table_vendor_obj.update_records(
    expressions=["vendor_id = 'DDS'"],
    new_values_maps={"email": "management@ddstaxico.com",
                     "num_emps": "156",
                     "num_cabs": "213"}
)

# Print the updated table
print "Updated DDS vendor information:"
print "{:<9s} {:<25s} {:<10s} {:<28s} {:<24} {:<8s} {:<11s} {:<11s} {:<6s}".format(
    "Vendor ID",
    "Vendor Name",
    "Phone",
    "Email",
    "HQ Street",
    "HQ City",
    "HQ Zip Code",
    "# Employees",
    "# Cabs"
)
print "{:=<9s} {:=<25s} {:=<10s} {:=<28s} {:=<24s} {:=<8s} {:=<11s} {:=<11s} {:=<6s}".format(
    "", "", "", "", "", "", "", "", "", ""
)
for vendor_record in table_vendor_obj.get_records(
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET,
        options={"expression": "vendor_id = 'DDS'"}
):
    print "{vendor_id:<9s} {vendor_name:<25s} {phone:<10s} {email:<28s} {hq_street:<24s} {hq_city:<8s} " \
          "{hq_zip:<11d} {num_emps:<11d} {num_cabs:<6d}".format(
            vendor_id=vendor_record["vendor_id"],
            vendor_name=vendor_record["vendor_name"],
            phone=vendor_record["phone"],
            email=vendor_record["email"],
            hq_street=vendor_record["hq_street"],
            hq_city=vendor_record["hq_city"],
            hq_zip=vendor_record["hq_zip"],
            num_emps=vendor_record["num_emps"],
            num_cabs=vendor_record["num_cabs"]
            )
print

You can delete records using the delete_records method.

# Delete payment 189
pre_delete = table_payment_obj.size()
print "Records in the payment table (before delete): {}".format(pre_delete)
delete_expr = ["payment_id = 189"]
print "Deleting record where " + delete_expr[0]
table_payment_obj.delete_records(expressions=delete_expr)
post_delete = table_payment_obj.size()
print "Records in the payment table (after delete): {}".format(post_delete)

Alter Table

Some properties can be altered or added after table creation, including indexes, dictionary encoding, and compression. Using the alter_table method, you can create indexes on columns using the create_index action paired with a column name.

# Add column indexes to vendor_id in both the vendor and taxi_trip_data tables
vendor_index_resp = table_vendor_obj.alter_table(action="create_index", value="vendor_id")
print "Index added to vendor table: {} \n".format(vendor_index_resp)

taxi_index_resp = table_taxi_obj.alter_table(action="create_index", value="vendor_id")
print "Index added to taxi_trip_data table: {} \n".format(taxi_index_resp)

Applying column compression works similarly: using the alter_table method but with a set_column_compression action paired with a column name and compression type option.

# Apply snappy compression to the pickup and dropoff datetime columns
snappy_compression = {"compression_type": "snappy"}
pickup_compress_resp = table_taxi_obj.alter_table(action="set_column_compression", value="pickup_datetime",
                                              options=snappy_compression)
print "Compression added to 'pickup_datetime' column: {} \n".format(pickup_compress_resp)

dropoff_compress_resp = table_taxi_obj.alter_table(action="set_column_compression", value="dropoff_datetime",
                                               options=snappy_compression)
print "Compression added to 'dropoff_datetime' column: {}".format(dropoff_compress_resp)

Filters

Filters are an easy way to filter down a larger table into more concise views using expressions.

Filter Example 1

# Select all payments with no corresponding payment type; allow Kinetica to assign a random name to the view
collection_option = {"collection_name": collection}
f1_resp = table_payment_obj.filter(
    expression="IS_NULL(payment_type)",
    options=collection_option
).size()
print "Number of null payments: {}".format(f1_resp)

Filter Example 2

# Using GPUdbTable query chaining, filter null payment type records with a fare amount greater than 8
h_db.clear_table(view_example1)
f2_resp = table_payment_obj.filter(
    view_name=view_example1,
    expression="IS_NULL(payment_type)",
    options=collection_option
).filter(
    view_name=view_example2,
    expression="fare_amount > 8",
    options=collection_option
).size()
print "Number of null payments with a fare amount greater than $8.00 (with query chaining): {}".format(f2_resp)

Filter Example 3

# Filter by list where vendor ID is either NYC or YCAB
f3_resp = table_taxi_obj.filter_by_list(
    view_name=view_example3,
    column_values_map={"vendor_id": ["NYC", "YCAB"]}
).size()
print "Number of records where vendor_id is either NYC or YCAB: {}".format(f3_resp)

Filter Example 4

# Filter by range trip with passenger count between 1 and 3
f4_resp = table_taxi_obj.filter_by_range(
    view_name=view_example4,
    column_name="passenger_count",
    lower_bound=1,
    upper_bound=3
).size()
print "Number of trips with a passenger count between 1 and 3: {}".format(f4_resp)

Aggregates

Kinetica supports various aggregate and group-by queries, which group and aggregate your data to return counts and useful statistics.

Aggregate Example 1

# Aggregate count, min, mean, and max on the trip distance
a1_resp = table_taxi_obj.aggregate_statistics(
    column_name="trip_distance",
    stats="count,min,max,mean"
)
print "Statistics of values in the trip_distance column:"
print "\tCount: {count:.0f}\n\tMin: {min:4.2f}\n\tMean: {mean:4.2f}\n\tMax: {max:4.2f}\n".format(**a1_resp["stats"])

Aggregate Example 2

# Find unique taxi vendor IDs
a2_resp = table_taxi_obj.aggregate_unique(
    column_name="vendor_id",
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    encoding="json"
)["data"]["vendor_id"]
print "Unique vendor IDs in the taxi trip table:"
for vendor in a2_resp:
    print "\t* {}".format(vendor)
print

Aggregate Example 3

# Find number of trips per vendor; use binary decoding
a3_resp = table_taxi_obj.aggregate_group_by(
    column_names=["vendor_id",
                  "count(vendor_id)"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET
)["data"]
print "Number of trips per vendor:"
for vendor in zip(
        a3_resp["vendor_id"],
        a3_resp["count(vendor_id)"]
):
    print "\t{:<10s}{:4d}".format(vendor[0] + ":", vendor[1])
print

Aggregate Example 4

# Create a histogram for the different groups of passenger counts
a4_resp = table_taxi_obj.aggregate_histogram(
    column_name="passenger_count",
    start=1,
    end=6,
    interval=1
)["counts"]
print "Passenger count groups by size:"
for histo_group in zip(
        [1, 2, 3, 4, 5, 6],
        a4_resp
):
    print "\t{}s: {:4.0f}".format(*histo_group)

Joins

Joins allow you to link multiple tables together, along their relations, retrieving associated information from any or all of them. Tables can only be joined if they're sharded similarly or replicated. Note that the create_join_table method is static and is not called with a corresponding GPUdbTable object.

An inner join returns only records that have matching values in both tables.

# Retrieve cab ride transactions and the full name of the associated vendor for rides having more than three
# passengers between April 1st & 16th, 2015
gpudb.GPUdbTable.create_join_table(
    join_table_name=join_table_inner,
    table_names=["taxi_trip_data as t",
                 "payment as p"],
    column_names=["payment_id",
                  "payment_type",
                  "total_amount",
                  "passenger_count",
                  "vendor_id",
                  "trip_distance"],
    expressions=["t.payment_id = p.payment_id",
                 "passenger_count > 3"],
    options={"collection_name": collection},
    db=h_db
)
join_table_inner_obj = gpudb.GPUdbTable(
    _type=None,
    name=join_table_inner,
    db=h_db
)
j1_resp = join_table_inner_obj.get_records_by_column(
    column_names=["payment_id",
                  "payment_type",
                  "total_amount",
                  "passenger_count",
                  "vendor_id",
                  "trip_distance"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET
)
print "Inner join table:"
print "Payment ID   Payment Type  Total Amount  Passenger Count  Vendor ID  Trip Distance "
print "============ ============= ============= ================ ========== =============="
for record in zip(
        j1_resp["payment_id"],
        j1_resp["payment_type"],
        j1_resp["total_amount"],
        j1_resp["passenger_count"],
        j1_resp["vendor_id"],
        j1_resp["trip_distance"]
):
    print "{:<12d} {:<13} {:<13f} {:<16d} {:<10s} {:<14f}".format(
        record[0],
        record[1],
        record[2],
        record[3],
        record[4],
        record[5]
    )
print

A left join returns all of the records an inner join does, but additionally, for each record in the table on the left side of the join that has no match along the relation to a record in the table on the right side of the join, a corresponding record will be returned with "left-side" columns populated with the "left-side" record data and the "right-side" columns populated with nulls.

# Retrieve cab ride transactions and the full name of the associated vendor (if available--blank if vendor name is
# unknown) for transactions with associated payment data, sorting by increasing values of transaction ID.
gpudb.GPUdbTable.create_join_table(
    join_table_name=join_table_left,
    table_names=["taxi_trip_data as t",
                 "vendor as v"],
    column_names=["transaction_id",
                  "pickup_datetime",
                  "trip_distance",
                  "t.vendor_id",
                  "vendor_name"],
    expressions=["left join t, v on (t.vendor_id = v.vendor_id)",
                 "payment_id <> 0"],
    options={"collection_name": collection},
    db=h_db
)
join_table_left_obj = gpudb.GPUdbTable(
    _type=None,
    name=join_table_left,
    db=h_db
)
j2_resp = join_table_left_obj.get_records_by_column(
    column_names=["transaction_id",
                  "pickup_datetime",
                  "trip_distance",
                  "vendor_id",
                  "vendor_name"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET
)
print "Left join table:"
print "Transaction ID Pickup (in secs since Epoch) Trip Distance Vendor ID Vendor Name       "
print "============== ============================ ============= ========= =================="
for record in zip(
        j2_resp["transaction_id"],
        j2_resp["pickup_datetime"],
        j2_resp["trip_distance"],
        j2_resp["vendor_id"],
        j2_resp["vendor_name"]
):
    print "{:<14d} {:<28d} {:<13.2f} {:<9s} {:18s}".format(
        record[0],
        record[1],
        record[2],
        record[3],
        record[4]
    )
print

Note

Full outer joins require both tables to be replicated. Set merges that perform deduplication of records, like Union Distinct, Intersect, and Except also need to use replicated tables to ensure the correct results, so a replicated version of the taxi (taxi_trip_data_replicated) table and a corresponding table object are created at this point in the tutorial.

gpudb.GPUdbTable.merge_records(
    table_name=table_taxi_replicated, source_table_names=[table_taxi],
    field_maps=[{
        "transaction_id": "transaction_id",
        "payment_id": "payment_id",
        "vendor_id": "vendor_id",
        "pickup_datetime": "pickup_datetime",
        "dropoff_datetime": "dropoff_datetime",
        "passenger_count": "passenger_count",
        "trip_distance": "trip_distance",
        "pickup_longitude": "pickup_longitude",
        "pickup_latitude": "pickup_latitude",
        "dropoff_longitude": "dropoff_longitude",
        "dropoff_latitude": "dropoff_latitude"
    }],
    options={"collection_name": collection,
             "is_replicated": "true"},
    db=h_db
)
table_taxi_replicated_obj = gpudb.GPUdbTable(
    _type=None,
    name=table_taxi_replicated,
    db=h_db
)

A full outer join returns all of the records a left join does, but additionally, for each record in the table on the right side of the join that has no match along the relation to a record in the table on the left side of the join, a corresponding record will be returned with "right-side" columns populated with the "right-side" record data and the "left-side" columns populated with nulls.

# Retrieve the vendor IDs of known vendors with no recorded cab ride transactions, as well as the vendor ID and
# number of transactions for unknown vendors with recorded cab ride transactions
gpudb.GPUdbTable.create_join_table(
    join_table_name=join_table_outer,
    table_names=["taxi_trip_data_replicated as t",
                 "vendor as v"],
    column_names=["t.vendor_id as vendor_id",
                  "v.vendor_id as vendor_id_1"],
    expressions=["full_outer join t,v on ((v.vendor_id = t.vendor_id))"],
    options={"collection_name": collection},
    db=h_db
)
join_table_outer_obj = gpudb.GPUdbTable(
    _type=None,
    name=join_table_outer,
    db=h_db
)

# Aggregate the join table results by vendor ID and count the amount of records
j3_resp = join_table_outer_obj.aggregate_group_by(
    column_names=["vendor_id_1 as vend_table_vendors",
                  "vendor_id as taxi_table_vendors",
                  "count(*) as total_records"], offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    options={"expression": "(is_null(vendor_id_1) OR is_null(vendor_id))"}
)["data"]
print "Outer join table:"
print "Vend. Table Vendors Taxi Table Vendors Total Records"
print "=================== ================== ============="
for record in zip(
        j3_resp["vend_table_vendors"],
        j3_resp["taxi_table_vendors"],
        j3_resp["total_records"]
):
    print "{:<19s} {:<18s} {:<13d}".format(
        record[0],
        record[1],
        record[2]
    )
print

Projections

You can create projections using the create_projection method.

# Create a projection containing all payments by credit card
table_payment_obj.create_projection(
    projection_name=projection_example1,
    column_names=["payment_id",
                  "payment_type",
                  "credit_type",
                  "payment_timestamp",
                  "fare_amount",
                  "surcharge",
                  "mta_tax",
                  "tip_amount",
                  "tolls_amount",
                  "total_amount"],
    options={"collection_name": collection}
)
projection_example1_obj = gpudb.GPUdbTable(
    _type=None,
    name=projection_example1,
    db=h_db
)
p1_resp = projection_example1_obj.get_records_by_column(
    column_names=["payment_id",
                  "payment_type",
                  "credit_type",
                  "payment_timestamp",
                  "fare_amount",
                  "surcharge",
                  "mta_tax",
                  "tip_amount",
                  "tolls_amount",
                  "total_amount"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET
)
print "Projection of only credit payment types:"
print "Payment ID Payment Type Credit Type      Timestamp     Fare  Surcharge MTA Tax Tip   Tolls Total"
print "========== ============ ================ ============= ===== ========= ======= ===== ===== ====="
for record in zip(
        p1_resp["payment_id"],
        p1_resp["payment_type"],
        p1_resp["credit_type"],
        p1_resp["payment_timestamp"],
        p1_resp["fare_amount"],
        p1_resp["surcharge"],
        p1_resp["mta_tax"],
        p1_resp["tip_amount"],
        p1_resp["tolls_amount"],
        p1_resp["total_amount"]
):
    print "{:<10} {:<12} {:<16} {:<13} {:^5.2f} {:^9.2f} {:^7.2f} {:^5.2f} {:^5.2f} {:^5.2f}".format(
        record[0],
        record[1],
        record[2],
        record[3],
        record[4],
        record[5],
        record[6],
        record[7],
        record[8],
        record[9]
    )
print

To persist a projection:

# Create a persisted table with cab ride transactions greater than 5 miles whose trip started during lunch hours
table_taxi_obj.create_projection(
    projection_name=projection_example2,
    column_names=["hour(pickup_datetime) as hour_of_day",
                  "vendor_id",
                  "passenger_count",
                  "trip_distance"],
    options={"expression": "(hour(pickup_datetime) >=11) AND (hour(pickup_datetime) <= 14) AND (trip_distance > 5)",
             "collection_name": collection,
             "persist": "true"}
)
projection_example2_obj = gpudb.GPUdbTable(
    _type=None,
    name=projection_example2,
    db=h_db
)
p2_resp = projection_example2_obj.get_records_by_column(
    column_names=["hour_of_day",
                  "vendor_id",
                  "passenger_count",
                  "trip_distance"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET
)
print "Projection of long trips taken during lunch hours:"
print "Hour of Day Vendor ID Passenger Count Trip Distance"
print "=========== ========= =============== ============="
for record in zip(
        p2_resp["hour_of_day"],
        p2_resp["vendor_id"],
        p2_resp["passenger_count"],
        p2_resp["trip_distance"]
):
    print "{:<11} {:<9} {:<15} {:<13.2f}".format(
        record[0],
        record[1],
        record[2],
        record[3]
    )

Union, Intersect, and Except

Union can be used to combine homogeneous data sets into one larger data set. Union & Union Distinct will both combine data sets but only retain the records that are unique across the chosen columns, removing all duplicates. Union All will combine data sets, retaining all records from the source data sets.

# Calculate the average number of passengers, as well as the shortest, average, and longest trips for all trips in
# each of the two time periods--from April 1st through the 15th, 2015 and from April 16th through the 23rd, 2015--
# and return those two sets of statistics in a single result set.
table_taxi_obj.aggregate_group_by(
    column_names=["avg(passenger_count) as avg_pass_count",
                  "avg(trip_distance) as avg_trip_dist",
                  "min(trip_distance) as min_trip_dist",
                  "max(trip_distance) as max_trip_dist"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    options={"expression": "((pickup_datetime >= '2015-04-01') AND (pickup_datetime <= '2015-04-15 23:59:59.999'))",
             "result_table": agg_grpby_union_all_src1,
             "collection_name": collection}
)
table_taxi_obj.aggregate_group_by(
    column_names=["avg(passenger_count) as avg_pass_count",
                  "avg(trip_distance) as avg_trip_dist",
                  "min(trip_distance) as min_trip_dist",
                  "max(trip_distance) as max_trip_dist"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    options={"expression": "((pickup_datetime >= '2015-04-16') AND (pickup_datetime <= '2015-04-23 23:59:59.999'))",
             "result_table": agg_grpby_union_all_src2,
             "collection_name": collection}
)
gpudb.GPUdbTable.create_union(
    table_name=union_all_table,
    table_names=[agg_grpby_union_all_src1,
                 agg_grpby_union_all_src2],
    input_column_names=[
        ["'2015-04-01 - 2014-04-15'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"],
        ["'2015-04-16 - 2015-04-23'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"]
    ],
    output_column_names=["pickup_window_range",
                         "avg_pass_count",
                         "avg_trip",
                         "min_trip",
                         "max_trip"],
    options={"mode": "union_all",
             "collection_name": collection},
    db=h_db
)
union_all_table_obj = gpudb.GPUdbTable(
    _type=None,
    name=union_all_table,
    db=h_db
)
u1_resp = union_all_table_obj.get_records_by_column(
    column_names=["pickup_window_range",
                  "avg_pass_count",
                  "avg_trip",
                  "min_trip",
                  "max_trip"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET
)
print "Union All table:"
print "Pickup Window Range     Avg. Pass. Count Avg. Trip     Min. Trip      Max Trip     "
print "======================= ================ ============= ============== ============="
for record in zip(
        u1_resp["pickup_window_range"],
        u1_resp["avg_pass_count"],
        u1_resp["avg_trip"],
        u1_resp["min_trip"],
        u1_resp["max_trip"]
):
    print "{:<23} {:<16} {:<13} {:<14} {:<13}".format(
        record[0],
        record[1],
        record[2],
        record[3],
        record[4]
    )
print

Intersect will combine data sets but only include the records found in both data sets, removing duplicate result records.

# Retrieve locations (as lat/lon pairs) that were both pick-up and drop-off points
table_taxi_replicated_obj.create_projection(
    projection_name=projection_intersect_src,
    column_names=["pickup_latitude",
                  "pickup_longitude"],
    options={"expression": "((pickup_latitude > 0) AND (pickup_longitude <> 0))",
             "collection_name": collection})
gpudb.GPUdbTable.create_union(
    table_name=union_intersect_table,
    table_names=[projection_intersect_src,
                 table_taxi_replicated],
    input_column_names=[["pickup_latitude", "pickup_longitude"],
                        ["dropoff_latitude", "dropoff_longitude"]],
    output_column_names=["pickup_latitude", "pickup_longitude"],
    options={"mode": "intersect",
             "collection_name": collection},
    db=h_db
)
union_intersect_table_obj = gpudb.GPUdbTable(
    _type=None,
    name=union_intersect_table,
    db=h_db
)
u2_resp = union_intersect_table_obj.get_records_by_column(
    column_names=["pickup_latitude", "pickup_longitude"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET)
print "Intersect table:"
print "Pickup Latitude Pickup Longitude"
print "=============== ================"
for record in zip(u2_resp["pickup_latitude"], u2_resp["pickup_longitude"]):
    print "{:<15} {:<16}".format(
        record[0],
        record[1]
    )
print

Except will return records that appear in the first data set but not the second data set. Note that the data sets on each side of the Except will have duplicates removed first, and then the set subtraction will be processed.

# Show vendors that operate before noon, but not after noon: retrieve the unique list of IDs of vendors who
# provided cab rides between midnight and noon, and remove from that list the IDs of any vendors who provided cab
# rides between noon and midnight
table_taxi_replicated_obj.create_projection(
    projection_name=projection_except_src1,
    column_names=["vendor_id"],
    options={"expression": "((HOUR(pickup_datetime) >= 0) AND (HOUR(pickup_datetime) <= 11))",
             "collection_name": collection}
)
table_taxi_replicated_obj.create_projection(
    projection_name=projection_except_src2,
    column_names=["vendor_id"],
    options={"expression": "((HOUR(pickup_datetime) >= 12) AND (HOUR(pickup_datetime) <= 23))",
             "collection_name": collection}
)
gpudb.GPUdbTable.create_union(
    table_name=union_except_table,
    table_names=[projection_except_src1,
                 projection_except_src2],
    input_column_names=[["vendor_id"],
                        ["vendor_id"]],
    output_column_names=["vendor_id"],
    options={"mode": "except",
             "collection_name": collection},
    db=h_db
)
union_except_table_obj = gpudb.GPUdbTable(
    _type=None,
    name=union_except_table,
    db=h_db
)
u3_resp = union_except_table_obj.get_records_by_column(
    column_names=["vendor_id"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET
)
print "Except table:"
print "Vendor ID"
print "========="
for record in zip(u3_resp["vendor_id"]):
    print "{:<9}".format(record[0])

Deleting Records

You can delete records from a table using filter expressions. This method allows you to specify multiple filter expressions--note that each expression is used to delete records independently from the others (i.e., a record only needs to meet any one expression's criteria to be deleted from the table).

delete_resp = table_payment_obj.filter(expression="payment_type = 'Cash'").size()
print "Number of records that meet deletion criteria before deleting: {}".format(delete_resp)

table_payment_obj.delete_records(expressions=["payment_type = 'Cash'"])

delete_resp = table_payment_obj.filter(expression="payment_type = 'Cash'").size()
print "Number of records that meet deletion criteria after deleting: {}".format(delete_resp)

Clearing Tables/Collections

You can remove tables and collections using the GPUdb interface method clear_table.

h_db.clear_table(table_name=table_taxi)
h_db.clear_table(table_name=table_taxi_replicated)
h_db.clear_table(table_name=collection)

Complete Sample

Included below is a complete sample program containing all the above queries:

"""This script walks through how to use the Python API.

Covered here: importing GPUdb, instantiating Kinetica, creating types, creating tables, inserting records,
retrieving records, updating records, altering tables, filtering records, aggregating/grouping records, joining tables,
projections, unioning tables, and deleting records/tables.

"""

import collections
import json
import gpudb


def gpudb_example():
    print
    print "TUTORIAL OUTPUT"
    print "==============="

    # All tables/views used in examples below
    agg_grpby_union_all_src1 = "agg_passcount_tripdist_btw_apr1_apr15"
    agg_grpby_union_all_src2 = "agg_passcount_tripdist_btw_apr16_apr23"

    collection = "taxi_info"

    join_table_inner = "pay_info_rides_gt_3_pass"
    join_table_left = "all_vendor_transactions"
    join_table_outer = "vendors_w_no_transactions"

    projection_example1 = "credit_payment"
    projection_example2 = "long_lunch_time_rides"
    projection_intersect_src = "pickup_lat_lon"
    projection_except_src1 = "vendors_operating_before_noon"
    projection_except_src2 = "vendors_operating_after_noon"

    table_taxi = "taxi_trip_data"
    table_taxi_replicated = "taxi_trip_data_replicated"
    table_vendor = "vendor"
    table_payment = "payment"

    union_all_table = "passcount_tripdist_stats_apr"
    union_intersect_table = "shared_pickup_dropoff_points"
    union_except_table = "vendors_operating_btw_midnight_noon"

    view_example1 = "null_payments"
    view_example2 = "null_payments_gt_8"
    view_example3 = "nyc_ycab_vendors"
    view_example4 = "passenger_count_btw_1_3"

    # Establish connection with a locally-running instance of Kinetica using binary encoding to save memory
    h_db = gpudb.GPUdb(
        encoding="BINARY",
        host="127.0.0.1",
        port="9191"
    )

    print
    print "CREATING TYPES & TABLES"
    print "-----------------------"
    print

    print "Vendor Table"
    print "************"
    # Create a type from a list of lists. Each list below is an individual column. Each column comprises at
    # least two values: a column name (always the first value) and a primitive type (always the second value). Any
    # subsequent values are column properties. The order of the columns defines the order in which values must
    # be inserted into the table, e.g., a "vendor_name" value cannot be inserted before a "vendor_id" value
    vendor_columns = [
        # column types and properties can be listed as strings
        ["vendor_id", "string", "char4", "primary_key"],
        ["vendor_name", "string", "char64"],
        ["phone", "string", "char16", "nullable"],
        ["email", "string", "char64", "nullable"],
        ["hq_street", "string", "char64"],
        # column types and properties can also be listed using the GPUdbRecordColumn and GPUdbColumnProperty objects
        ["hq_city", gpudb.GPUdbRecordColumn._ColumnType.STRING, gpudb.GPUdbColumnProperty.CHAR8,
         gpudb.GPUdbColumnProperty.DICT],
        ["hq_state", gpudb.GPUdbRecordColumn._ColumnType.STRING, gpudb.GPUdbColumnProperty.CHAR8,
         gpudb.GPUdbColumnProperty.DICT],
        ["hq_zip", gpudb.GPUdbRecordColumn._ColumnType.INT],
        ["num_emps", gpudb.GPUdbRecordColumn._ColumnType.INT],
        ["num_cabs", gpudb.GPUdbRecordColumn._ColumnType.INT]
    ]

    # Clear any existing table with the same name (otherwise we won"t be able to create the table)
    if h_db.has_table(table_name=table_vendor)["table_exists"]:
        h_db.clear_table(table_name=table_vendor)

    # Create the table from the type and place it in a collection
    try:
        table_vendor_obj = gpudb.GPUdbTable(
            _type=vendor_columns,
            name=table_vendor,
            options={"collection_name": collection,
                     "is_replicated": "true"},
            db=h_db
        )
        print "Vendor table successfully created"
    except gpudb.GPUdbException as e:
        print "Vendor table creation failure: {}".format(str(e))
    print

    print "Payment Table"
    print "*************"
    payment_columns = [
        ["payment_id", "long", "primary_key"],
        ["payment_type", "string", "char16", "nullable"],
        ["credit_type", "string", "char16", "nullable", "dict"],
        ["payment_timestamp", "long", "timestamp", "nullable"],
        ["fare_amount", "double", "nullable"],
        ["surcharge", "double", "nullable"],
        ["mta_tax", "double", "nullable"],
        ["tip_amount", "double", "nullable"],
        ["tolls_amount", "double", "nullable"],
        ["total_amount", "double", "nullable"]
    ]

    # Clear any existing table with the same name (otherwise we won't be able to create the table)
    if h_db.has_table(table_name=table_payment)["table_exists"]:
        h_db.clear_table(table_name=table_payment)

    # Create an "options" object containing the table creation options
    options = gpudb.GPUdbTableOptions.default().collection_name("collection_name").is_replicated("true")

    # Create the table from the type and place it in a collection
    try:
        table_payment_obj = gpudb.GPUdbTable(
            payment_columns,
            table_payment,
            options,
            # options={"collection_name": collection},
            h_db
        )
        print "Payment table successfully created"
    except gpudb.GPUdbException as e:
        print "Payment table creation failure: {}".format(str(e))
    print

    print "Taxi Table"
    print "**********"
    # Create a table object for the pre-existing taxi table
    try:
        table_taxi_obj = gpudb.GPUdbTable(
            _type=None,
            name=table_taxi,
            db=h_db
        )
        print "Taxi table object successfully created"
    except gpudb.GPUdbException as e:
        print "Taxi table object creation failure: {}".format(str(e))

    print "\n"
    print "INSERTING DATA"
    print "--------------"
    print

    # Insert single record example
    # Create ordered dictionary for keys & values of record
    vendor_datum = collections.OrderedDict()
    vendor_datum["vendor_id"] = "VTS"
    vendor_datum["vendor_name"] = "Vine Taxi Service"
    vendor_datum["phone"] = "9998880001"
    vendor_datum["email"] = "admin@vtstaxi.com"
    vendor_datum["hq_street"] = "26 Summit St."
    vendor_datum["hq_city"] = "Flushing"
    vendor_datum["hq_state"] = "NY"
    vendor_datum["hq_zip"] = 11354
    vendor_datum["num_cabs"] = 450
    vendor_datum["num_emps"] = 400

    # Insert the record into the table
    table_vendor_obj.insert_records(vendor_datum)
    print "Number of single records inserted into the Vendor table: {}".format(table_vendor_obj.size())

    # Insert multiple records examples
    # Create a list of in-line records. The order of the values must match the column order in the type
    vendor_records = [
        ["YCAB", "Yes Cab", "7895444321", None, "97 Edgemont St.", "Brooklyn", "NY", 11223, 445, 425],
        ["NYC", "New York City Cabs", None, "support@nyc-taxis.com", "9669 East Bayport St.", "Bronx", "NY", 10453, 505,
         500],
        ["DDS", "Dependable Driver Service", None, None, "8554 North Homestead St.", "Bronx", "NY", 10472, 200, 124],
        ["CMT", "Crazy Manhattan Taxi", "9778896500", "admin@crazymanhattantaxi.com", "950 4th Road Suite 78",
         "Brooklyn", "NY", 11210, 500, 468],
        ["TNY", "Taxi New York", None, None, "725 Squaw Creek St.", "Bronx", "NY", 10458, 315, 305],
        ["NYMT", "New York Metro Taxi", None, None, "4 East Jennings St.", "Brooklyn", "NY", 11228, 166, 150],
        ["5BTC", "Five Boroughs Taxi Co.", "4566541278", "mgmt@5btc.com", "9128 Lantern Street", "Brooklyn", "NY",
         11229, 193, 175]
    ]

    # Insert the records into the Vendor table
    table_vendor_obj.insert_records(vendor_records)
    print "Number of batch records inserted into the Vendor table: {}".format(table_vendor_obj.size() - 1)

    # Create another list of in-line records
    payment_records = [
        [189, "No Charge", None, None, 6.5, 0, 0.6, 0, 0, 7],
        [136, "Cash", None, 1428716521000, 4, 0.5, 0.5, 1, 0, 6.3],
        [148, "Cash", None, 1430124581000, 9.5, 0, 0.5, 1, 0, 11.3],
        [114, "Cash", None, 1428259673000, 5.5, 0, 0.5, 1.89, 0, 8.19],
        [180, "Cash", None, 1428965823000, 6.5, 0.5, 0.5, 1, 0, 8.8],
        [109, "Cash", None, 1428948513000, 22.5, 0.5, 0.5, 4.75, 0, 28.55],
        [132, "Cash", None, 1429472779000, 6.5, 0.5, 0.5, 1.55, 0, 9.35],
        [134, "Cash", None, 1429472668000, 33.5, 0.5, 0.5, 0, 0, 34.8],
        [176, "Cash", None, 1428403962000, 9, 0.5, 0.5, 2.06, 0, 12.36],
        [100, "Cash", None, None, 9, 0, 0.5, 2.9, 0, 12.7],
        [193, "Cash", None, None, 3.5, 1, 0.5, 1.59, 0, 6.89],
        [140, "Credit", "Visa", None, 28, 0, 0.5, 0, 0, 28.8],
        [161, "Credit", "Visa", None, 7, 0, 0.5, 0, 0, 7.8],
        [199, "Credit", "Visa", None, 6, 1, 0.5, 1, 0, 8.5],
        [159, "Credit", "Visa", 1428674487000, 7, 0, 0.5, 0, 0, 7.8],
        [156, "Credit", "MasterCard", 1428672753000, 12.5, 0.5, 0.5, 0, 0, 13.8],
        [198, "Credit", "MasterCard", 1429472636000, 9, 0, 0.5, 0, 0, 9.8],
        [107, "Credit", "MasterCard", 1428717377000, 5, 0.5, 0.5, 0, 0, 6.3],
        [166, "Credit", "American Express", 1428808723000, 17.5, 0, 0.5, 0, 0, 18.3],
        [187, "Credit", "American Express", 1428670181000, 14, 0, 0.5, 0, 0, 14.8],
        [125, "Credit", "Discover", 1429869673000, 8.5, 0.5, 0.5, 0, 0, 9.8],
        [119, None, None, 1430431471000, 9.5, 0, 0.5, 0, 0, 10.3],
        [150, None, None, 1430432447000, 7.5, 0, 0.5, 0, 0, 8.3],
        [170, "No Charge", None, 1430431502000, 28.6, 0, 0.5, 0, 0, 28.6],
        [123, "No Charge", None, 1430136649000, 20, 0.5, 0.5, 0, 0, 21.3],
        [181, None, None, 1430135461000, 6.5, 0.5, 0.5, 0, 0, 7.8],
        [189, "No Charge", None, None, 6.5, 0, 0.6, 0, 0, 7]
    ]

    # Insert the records into the Payment table
    for record in payment_records:
        table_payment_obj.insert_records(record)
    print "Number of batch records inserted into the Payment table: {}".format(table_payment_obj.size())

    print "\n"
    print "RETRIEVING DATA"
    print "---------------"
    print

    # Retrieve no more than 10 records from the Payment table using the GPUdbTable interface
    print "{:<10s} {:<12s} {:<11s} {:<17s} {:<11s} {:<9s} {:<7s} {:<10s} {:<12s} {:<12s}".format(
        "Payment ID",
        "Payment Type",
        "Credit Type",
        "Payment Timestamp",
        "Fare Amount",
        "Surcharge",
        "MTA Tax",
        "Tip Amount",
        "Tolls Amount",
        "Total Amount"
    )
    print "{:=<10s} {:=<12s} {:=<11s} {:=<17s} {:=<11s} {:=<9s} {:=<7s} {:=<10s} {:=<12s} {:=<12s}".format(
        "", "", "", "", "", "", "", "", "", ""
    )
    for record in table_payment_obj.get_records(
            offset=0,
            limit=10,
            encoding="binary",
            options={"sort_by": "payment_type"}
    ):
        print "{payment_id:<10d} {payment_type:<12s} {credit_type:<11s} {payment_timestamp:<17d} {fare_amount:11.2f} " \
              "{surcharge:9.2f} {mta_tax:7.2f} {tip_amount:10.2f} {tolls_amount:12.2f} {total_amount:12.2f}".format(
                payment_id=record["payment_id"],
                payment_type=record["payment_type"],
                credit_type=record["credit_type"],
                payment_timestamp=record["payment_timestamp"],
                fare_amount=record["fare_amount"],
                surcharge=record["surcharge"],
                mta_tax=record["mta_tax"],
                tip_amount=record["tip_amount"],
                tolls_amount=record["tolls_amount"],
                total_amount=record["total_amount"]
                )
    print

    # Retrieve all records from the Vendor table using the GPUdb interface
    print "{:<9s} {:<26s} {:<11s} {:<29s} {:<24s} {:<9s} {:<9s} {:<11s} {:<11s} {:<6s}".format(
        "Vendor ID",
        "Vendor Name",
        "Phone",
        "Email",
        "HQ Street",
        "HQ City",
        "HQ State",
        "HQ Zip Code",
        "# Employees",
        "# Cabs"
    )
    print "{:=<9s} {:=<26s} {:=<11s} {:=<29s} {:=<24s} {:=<9s} {:=<9s} {:=<11s} {:=<11s} {:=<6s}".format(
        "", "", "", "", "", "", "", "", "", ""
    )
    vendor_records_gpudb = h_db.get_records(
        table_name=table_vendor,
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET,
        encoding="json"
    )["records_json"]
    for record in vendor_records_gpudb:
        print "{vendor_id:<9s} {vendor_name:<26s} {phone:<11} {email:<29s} {hq_street:<24s} {hq_city:<9s} " \
              "{hq_state:<9s} {hq_zip:<11d} {num_emps:<11d} {num_cabs:<6d}".format(**json.loads(record))

    print "\n"
    print "UPDATING & REMOVING RECORDS"
    print "---------------------------"
    print

    # Update the email, number of employees, and number of cabs of the DDS vendor
    table_vendor_obj.update_records(
        expressions=["vendor_id = 'DDS'"],
        new_values_maps={"email": "management@ddstaxico.com",
                         "num_emps": "156",
                         "num_cabs": "213"}
    )

    # Print the updated table
    print "Updated DDS vendor information:"
    print "{:<9s} {:<25s} {:<10s} {:<28s} {:<24} {:<8s} {:<11s} {:<11s} {:<6s}".format(
        "Vendor ID",
        "Vendor Name",
        "Phone",
        "Email",
        "HQ Street",
        "HQ City",
        "HQ Zip Code",
        "# Employees",
        "# Cabs"
    )
    print "{:=<9s} {:=<25s} {:=<10s} {:=<28s} {:=<24s} {:=<8s} {:=<11s} {:=<11s} {:=<6s}".format(
        "", "", "", "", "", "", "", "", "", ""
    )
    for vendor_record in table_vendor_obj.get_records(
            offset=0,
            limit=gpudb.GPUdb.END_OF_SET,
            options={"expression": "vendor_id = 'DDS'"}
    ):
        print "{vendor_id:<9s} {vendor_name:<25s} {phone:<10s} {email:<28s} {hq_street:<24s} {hq_city:<8s} " \
              "{hq_zip:<11d} {num_emps:<11d} {num_cabs:<6d}".format(
                vendor_id=vendor_record["vendor_id"],
                vendor_name=vendor_record["vendor_name"],
                phone=vendor_record["phone"],
                email=vendor_record["email"],
                hq_street=vendor_record["hq_street"],
                hq_city=vendor_record["hq_city"],
                hq_zip=vendor_record["hq_zip"],
                num_emps=vendor_record["num_emps"],
                num_cabs=vendor_record["num_cabs"]
                )
    print

    # Delete payment 189
    pre_delete = table_payment_obj.size()
    print "Records in the payment table (before delete): {}".format(pre_delete)
    delete_expr = ["payment_id = 189"]
    print "Deleting record where " + delete_expr[0]
    table_payment_obj.delete_records(expressions=delete_expr)
    post_delete = table_payment_obj.size()
    print "Records in the payment table (after delete): {}".format(post_delete)

    print "\n"
    print "ALTER TABLE"
    print "-----------"
    print

    # Add column indexes to vendor_id in both the vendor and taxi_trip_data tables
    vendor_index_resp = table_vendor_obj.alter_table(action="create_index", value="vendor_id")
    print "Index added to vendor table: {} \n".format(vendor_index_resp)

    taxi_index_resp = table_taxi_obj.alter_table(action="create_index", value="vendor_id")
    print "Index added to taxi_trip_data table: {} \n".format(taxi_index_resp)

    # Apply snappy compression to the pickup and dropoff datetime columns
    snappy_compression = {"compression_type": "snappy"}
    pickup_compress_resp = table_taxi_obj.alter_table(action="set_column_compression", value="pickup_datetime",
                                                  options=snappy_compression)
    print "Compression added to 'pickup_datetime' column: {} \n".format(pickup_compress_resp)

    dropoff_compress_resp = table_taxi_obj.alter_table(action="set_column_compression", value="dropoff_datetime",
                                                   options=snappy_compression)
    print "Compression added to 'dropoff_datetime' column: {}".format(dropoff_compress_resp)

    print "\n"
    print "FILTERING"
    print "---------"
    print

    # Clear any existing views with the same name (otherwise we won't be able to create the views)
    if h_db.has_table(table_name=view_example3)["table_exists"]:
        h_db.clear_table(table_name=view_example3)
    if h_db.has_table(table_name=view_example4)["table_exists"]:
        h_db.clear_table(table_name=view_example4)

    # Filter Example 1
    # Select all payments with no corresponding payment type; allow Kinetica to assign a random name to the view
    collection_option = {"collection_name": collection}
    f1_resp = table_payment_obj.filter(
        expression="IS_NULL(payment_type)",
        options=collection_option
    ).size()
    print "Number of null payments: {}".format(f1_resp)

    # Filter Example 2
    # Using GPUdbTable query chaining, filter null payment type records with a fare amount greater than 8
    h_db.clear_table(view_example1)
    f2_resp = table_payment_obj.filter(
        view_name=view_example1,
        expression="IS_NULL(payment_type)",
        options=collection_option
    ).filter(
        view_name=view_example2,
        expression="fare_amount > 8",
        options=collection_option
    ).size()
    print "Number of null payments with a fare amount greater than $8.00 (with query chaining): {}".format(f2_resp)

    # Filter Example 3
    # Filter by list where vendor ID is either NYC or YCAB
    f3_resp = table_taxi_obj.filter_by_list(
        view_name=view_example3,
        column_values_map={"vendor_id": ["NYC", "YCAB"]}
    ).size()
    print "Number of records where vendor_id is either NYC or YCAB: {}".format(f3_resp)

    # Filter Example 4
    # Filter by range trip with passenger count between 1 and 3
    f4_resp = table_taxi_obj.filter_by_range(
        view_name=view_example4,
        column_name="passenger_count",
        lower_bound=1,
        upper_bound=3
    ).size()
    print "Number of trips with a passenger count between 1 and 3: {}".format(f4_resp)

    print "\n"
    print "AGGREGATING, GROUPING, AND HISTOGRAMS"
    print "-------------------------------------"
    print

    # Aggregate Example 1
    # Aggregate count, min, mean, and max on the trip distance
    a1_resp = table_taxi_obj.aggregate_statistics(
        column_name="trip_distance",
        stats="count,min,max,mean"
    )
    print "Statistics of values in the trip_distance column:"
    print "\tCount: {count:.0f}\n\tMin: {min:4.2f}\n\tMean: {mean:4.2f}\n\tMax: {max:4.2f}\n".format(**a1_resp["stats"])

    # Aggregate Example 2
    # Find unique taxi vendor IDs
    a2_resp = table_taxi_obj.aggregate_unique(
        column_name="vendor_id",
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET,
        encoding="json"
    )["data"]["vendor_id"]
    print "Unique vendor IDs in the taxi trip table:"
    for vendor in a2_resp:
        print "\t* {}".format(vendor)
    print

    # Aggregate Example 3
    # Find number of trips per vendor; use binary decoding
    a3_resp = table_taxi_obj.aggregate_group_by(
        column_names=["vendor_id",
                      "count(vendor_id)"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET
    )["data"]
    print "Number of trips per vendor:"
    for vendor in zip(
            a3_resp["vendor_id"],
            a3_resp["count(vendor_id)"]
    ):
        print "\t{:<10s}{:4d}".format(vendor[0] + ":", vendor[1])
    print

    # Aggregate Example 4
    # Create a histogram for the different groups of passenger counts
    a4_resp = table_taxi_obj.aggregate_histogram(
        column_name="passenger_count",
        start=1,
        end=6,
        interval=1
    )["counts"]
    print "Passenger count groups by size:"
    for histo_group in zip(
            [1, 2, 3, 4, 5, 6],
            a4_resp
    ):
        print "\t{}s: {:4.0f}".format(*histo_group)

    print "\n"
    print "JOINS"
    print "-----"
    print

    # Join Example 1 (Inner Join)
    # Retrieve cab ride transactions and the full name of the associated vendor for rides having more than three
    # passengers between April 1st & 16th, 2015
    gpudb.GPUdbTable.create_join_table(
        join_table_name=join_table_inner,
        table_names=["taxi_trip_data as t",
                     "payment as p"],
        column_names=["payment_id",
                      "payment_type",
                      "total_amount",
                      "passenger_count",
                      "vendor_id",
                      "trip_distance"],
        expressions=["t.payment_id = p.payment_id",
                     "passenger_count > 3"],
        options={"collection_name": collection},
        db=h_db
    )
    join_table_inner_obj = gpudb.GPUdbTable(
        _type=None,
        name=join_table_inner,
        db=h_db
    )
    j1_resp = join_table_inner_obj.get_records_by_column(
        column_names=["payment_id",
                      "payment_type",
                      "total_amount",
                      "passenger_count",
                      "vendor_id",
                      "trip_distance"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET
    )
    print "Inner join table:"
    print "Payment ID   Payment Type  Total Amount  Passenger Count  Vendor ID  Trip Distance "
    print "============ ============= ============= ================ ========== =============="
    for record in zip(
            j1_resp["payment_id"],
            j1_resp["payment_type"],
            j1_resp["total_amount"],
            j1_resp["passenger_count"],
            j1_resp["vendor_id"],
            j1_resp["trip_distance"]
    ):
        print "{:<12d} {:<13} {:<13f} {:<16d} {:<10s} {:<14f}".format(
            record[0],
            record[1],
            record[2],
            record[3],
            record[4],
            record[5]
        )
    print

    # Join example 2 (Left Join)
    # Retrieve cab ride transactions and the full name of the associated vendor (if available--blank if vendor name is
    # unknown) for transactions with associated payment data, sorting by increasing values of transaction ID.
    gpudb.GPUdbTable.create_join_table(
        join_table_name=join_table_left,
        table_names=["taxi_trip_data as t",
                     "vendor as v"],
        column_names=["transaction_id",
                      "pickup_datetime",
                      "trip_distance",
                      "t.vendor_id",
                      "vendor_name"],
        expressions=["left join t, v on (t.vendor_id = v.vendor_id)",
                     "payment_id <> 0"],
        options={"collection_name": collection},
        db=h_db
    )
    join_table_left_obj = gpudb.GPUdbTable(
        _type=None,
        name=join_table_left,
        db=h_db
    )
    j2_resp = join_table_left_obj.get_records_by_column(
        column_names=["transaction_id",
                      "pickup_datetime",
                      "trip_distance",
                      "vendor_id",
                      "vendor_name"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET
    )
    print "Left join table:"
    print "Transaction ID Pickup (in secs since Epoch) Trip Distance Vendor ID Vendor Name       "
    print "============== ============================ ============= ========= =================="
    for record in zip(
            j2_resp["transaction_id"],
            j2_resp["pickup_datetime"],
            j2_resp["trip_distance"],
            j2_resp["vendor_id"],
            j2_resp["vendor_name"]
    ):
        print "{:<14d} {:<28d} {:<13.2f} {:<9s} {:18s}".format(
            record[0],
            record[1],
            record[2],
            record[3],
            record[4]
        )
    print

    # Full outer joins require both tables to be replicated. Set merges like Union Distinct, Intersect, and Except
    # need to use replicated tables to ensure the correct results. Copy the records from the taxi trip table to a
    # replicated taxi trip table using /merge/records
    gpudb.GPUdbTable.merge_records(
        table_name=table_taxi_replicated, source_table_names=[table_taxi],
        field_maps=[{
            "transaction_id": "transaction_id",
            "payment_id": "payment_id",
            "vendor_id": "vendor_id",
            "pickup_datetime": "pickup_datetime",
            "dropoff_datetime": "dropoff_datetime",
            "passenger_count": "passenger_count",
            "trip_distance": "trip_distance",
            "pickup_longitude": "pickup_longitude",
            "pickup_latitude": "pickup_latitude",
            "dropoff_longitude": "dropoff_longitude",
            "dropoff_latitude": "dropoff_latitude"
        }],
        options={"collection_name": collection,
                 "is_replicated": "true"},
        db=h_db
    )
    table_taxi_replicated_obj = gpudb.GPUdbTable(
        _type=None,
        name=table_taxi_replicated,
        db=h_db
    )

    # Join Example 3 (Full Outer Join)
    # Retrieve the vendor IDs of known vendors with no recorded cab ride transactions, as well as the vendor ID and
    # number of transactions for unknown vendors with recorded cab ride transactions
    gpudb.GPUdbTable.create_join_table(
        join_table_name=join_table_outer,
        table_names=["taxi_trip_data_replicated as t",
                     "vendor as v"],
        column_names=["t.vendor_id as vendor_id",
                      "v.vendor_id as vendor_id_1"],
        expressions=["full_outer join t,v on ((v.vendor_id = t.vendor_id))"],
        options={"collection_name": collection},
        db=h_db
    )
    join_table_outer_obj = gpudb.GPUdbTable(
        _type=None,
        name=join_table_outer,
        db=h_db
    )

    # Aggregate the join table results by vendor ID and count the amount of records
    j3_resp = join_table_outer_obj.aggregate_group_by(
        column_names=["vendor_id_1 as vend_table_vendors",
                      "vendor_id as taxi_table_vendors",
                      "count(*) as total_records"], offset=0,
        limit=gpudb.GPUdb.END_OF_SET,
        options={"expression": "(is_null(vendor_id_1) OR is_null(vendor_id))"}
    )["data"]
    print "Outer join table:"
    print "Vend. Table Vendors Taxi Table Vendors Total Records"
    print "=================== ================== ============="
    for record in zip(
            j3_resp["vend_table_vendors"],
            j3_resp["taxi_table_vendors"],
            j3_resp["total_records"]
    ):
        print "{:<19s} {:<18s} {:<13d}".format(
            record[0],
            record[1],
            record[2]
        )
    print

    print "\n"
    print "PROJECTIONS"
    print "-----------"
    print

    # Clear any existing projections with the same name (otherwise we won't be able to create the projections)
    if h_db.has_table(table_name=projection_example1)["table_exists"]:
        h_db.clear_table(projection_example1)
    if h_db.has_table(table_name=projection_example2)["table_exists"]:
        h_db.clear_table(projection_example2)

    # Projection Example 1
    # Create a projection containing all payments by credit card
    table_payment_obj.create_projection(
        projection_name=projection_example1,
        column_names=["payment_id",
                      "payment_type",
                      "credit_type",
                      "payment_timestamp",
                      "fare_amount",
                      "surcharge",
                      "mta_tax",
                      "tip_amount",
                      "tolls_amount",
                      "total_amount"],
        options={"collection_name": collection}
    )
    projection_example1_obj = gpudb.GPUdbTable(
        _type=None,
        name=projection_example1,
        db=h_db
    )
    p1_resp = projection_example1_obj.get_records_by_column(
        column_names=["payment_id",
                      "payment_type",
                      "credit_type",
                      "payment_timestamp",
                      "fare_amount",
                      "surcharge",
                      "mta_tax",
                      "tip_amount",
                      "tolls_amount",
                      "total_amount"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET
    )
    print "Projection of only credit payment types:"
    print "Payment ID Payment Type Credit Type      Timestamp     Fare  Surcharge MTA Tax Tip   Tolls Total"
    print "========== ============ ================ ============= ===== ========= ======= ===== ===== ====="
    for record in zip(
            p1_resp["payment_id"],
            p1_resp["payment_type"],
            p1_resp["credit_type"],
            p1_resp["payment_timestamp"],
            p1_resp["fare_amount"],
            p1_resp["surcharge"],
            p1_resp["mta_tax"],
            p1_resp["tip_amount"],
            p1_resp["tolls_amount"],
            p1_resp["total_amount"]
    ):
        print "{:<10} {:<12} {:<16} {:<13} {:^5.2f} {:^9.2f} {:^7.2f} {:^5.2f} {:^5.2f} {:^5.2f}".format(
            record[0],
            record[1],
            record[2],
            record[3],
            record[4],
            record[5],
            record[6],
            record[7],
            record[8],
            record[9]
        )
    print

    # Projection Example 2
    # Create a persisted table with cab ride transactions greater than 5 miles whose trip started during lunch hours
    table_taxi_obj.create_projection(
        projection_name=projection_example2,
        column_names=["hour(pickup_datetime) as hour_of_day",
                      "vendor_id",
                      "passenger_count",
                      "trip_distance"],
        options={"expression": "(hour(pickup_datetime) >=11) AND (hour(pickup_datetime) <= 14) AND (trip_distance > 5)",
                 "collection_name": collection,
                 "persist": "true"}
    )
    projection_example2_obj = gpudb.GPUdbTable(
        _type=None,
        name=projection_example2,
        db=h_db
    )
    p2_resp = projection_example2_obj.get_records_by_column(
        column_names=["hour_of_day",
                      "vendor_id",
                      "passenger_count",
                      "trip_distance"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET
    )
    print "Projection of long trips taken during lunch hours:"
    print "Hour of Day Vendor ID Passenger Count Trip Distance"
    print "=========== ========= =============== ============="
    for record in zip(
            p2_resp["hour_of_day"],
            p2_resp["vendor_id"],
            p2_resp["passenger_count"],
            p2_resp["trip_distance"]
    ):
        print "{:<11} {:<9} {:<15} {:<13.2f}".format(
            record[0],
            record[1],
            record[2],
            record[3]
        )

    print "\n"
    print "UNIONS"
    print "------"
    print

    # Clear any existing tables with the same name (otherwise we won't be able to create the tables)
    if h_db.has_table(table_name=agg_grpby_union_all_src1)["table_exists"]:
        h_db.clear_table(agg_grpby_union_all_src1)
    if h_db.has_table(table_name=agg_grpby_union_all_src2)["table_exists"]:
        h_db.clear_table(agg_grpby_union_all_src2)
    if h_db.has_table(table_name=projection_intersect_src)["table_exists"]:
        h_db.clear_table(projection_intersect_src)
    if h_db.has_table(table_name=projection_except_src1)["table_exists"]:
        h_db.clear_table(projection_except_src1)
    if h_db.has_table(table_name=projection_except_src2)["table_exists"]:
        h_db.clear_table(projection_except_src2)
    if h_db.has_table(table_name=union_all_table)["table_exists"]:
        h_db.clear_table(union_all_table)
    if h_db.has_table(table_name=union_except_table)["table_exists"]:
        h_db.clear_table(union_except_table)
    if h_db.has_table(table_name=union_intersect_table)["table_exists"]:
        h_db.clear_table(union_intersect_table)

    # Union Example 1 (Union All)
    # Calculate the average number of passengers, as well as the shortest, average, and longest trips for all trips in
    # each of the two time periods--from April 1st through the 15th, 2015 and from April 16th through the 23rd, 2015--
    # and return those two sets of statistics in a single result set.
    table_taxi_obj.aggregate_group_by(
        column_names=["avg(passenger_count) as avg_pass_count",
                      "avg(trip_distance) as avg_trip_dist",
                      "min(trip_distance) as min_trip_dist",
                      "max(trip_distance) as max_trip_dist"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET,
        options={"expression": "((pickup_datetime >= '2015-04-01') AND (pickup_datetime <= '2015-04-15 23:59:59.999'))",
                 "result_table": agg_grpby_union_all_src1,
                 "collection_name": collection}
    )
    table_taxi_obj.aggregate_group_by(
        column_names=["avg(passenger_count) as avg_pass_count",
                      "avg(trip_distance) as avg_trip_dist",
                      "min(trip_distance) as min_trip_dist",
                      "max(trip_distance) as max_trip_dist"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET,
        options={"expression": "((pickup_datetime >= '2015-04-16') AND (pickup_datetime <= '2015-04-23 23:59:59.999'))",
                 "result_table": agg_grpby_union_all_src2,
                 "collection_name": collection}
    )
    gpudb.GPUdbTable.create_union(
        table_name=union_all_table,
        table_names=[agg_grpby_union_all_src1,
                     agg_grpby_union_all_src2],
        input_column_names=[
            ["'2015-04-01 - 2014-04-15'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"],
            ["'2015-04-16 - 2015-04-23'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"]
        ],
        output_column_names=["pickup_window_range",
                             "avg_pass_count",
                             "avg_trip",
                             "min_trip",
                             "max_trip"],
        options={"mode": "union_all",
                 "collection_name": collection},
        db=h_db
    )
    union_all_table_obj = gpudb.GPUdbTable(
        _type=None,
        name=union_all_table,
        db=h_db
    )
    u1_resp = union_all_table_obj.get_records_by_column(
        column_names=["pickup_window_range",
                      "avg_pass_count",
                      "avg_trip",
                      "min_trip",
                      "max_trip"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET
    )
    print "Union All table:"
    print "Pickup Window Range     Avg. Pass. Count Avg. Trip     Min. Trip      Max Trip     "
    print "======================= ================ ============= ============== ============="
    for record in zip(
            u1_resp["pickup_window_range"],
            u1_resp["avg_pass_count"],
            u1_resp["avg_trip"],
            u1_resp["min_trip"],
            u1_resp["max_trip"]
    ):
        print "{:<23} {:<16} {:<13} {:<14} {:<13}".format(
            record[0],
            record[1],
            record[2],
            record[3],
            record[4]
        )
    print

    # Union Example 2 (Intersect)
    # Retrieve locations (as lat/lon pairs) that were both pick-up and drop-off points
    table_taxi_replicated_obj.create_projection(
        projection_name=projection_intersect_src,
        column_names=["pickup_latitude",
                      "pickup_longitude"],
        options={"expression": "((pickup_latitude > 0) AND (pickup_longitude <> 0))",
                 "collection_name": collection})
    gpudb.GPUdbTable.create_union(
        table_name=union_intersect_table,
        table_names=[projection_intersect_src,
                     table_taxi_replicated],
        input_column_names=[["pickup_latitude", "pickup_longitude"],
                            ["dropoff_latitude", "dropoff_longitude"]],
        output_column_names=["pickup_latitude", "pickup_longitude"],
        options={"mode": "intersect",
                 "collection_name": collection},
        db=h_db
    )
    union_intersect_table_obj = gpudb.GPUdbTable(
        _type=None,
        name=union_intersect_table,
        db=h_db
    )
    u2_resp = union_intersect_table_obj.get_records_by_column(
        column_names=["pickup_latitude", "pickup_longitude"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET)
    print "Intersect table:"
    print "Pickup Latitude Pickup Longitude"
    print "=============== ================"
    for record in zip(u2_resp["pickup_latitude"], u2_resp["pickup_longitude"]):
        print "{:<15} {:<16}".format(
            record[0],
            record[1]
        )
    print

    # Union Example 3 (Except)
    # Show vendors that operate before noon, but not after noon: retrieve the unique list of IDs of vendors who
    # provided cab rides between midnight and noon, and remove from that list the IDs of any vendors who provided cab
    # rides between noon and midnight
    table_taxi_replicated_obj.create_projection(
        projection_name=projection_except_src1,
        column_names=["vendor_id"],
        options={"expression": "((HOUR(pickup_datetime) >= 0) AND (HOUR(pickup_datetime) <= 11))",
                 "collection_name": collection}
    )
    table_taxi_replicated_obj.create_projection(
        projection_name=projection_except_src2,
        column_names=["vendor_id"],
        options={"expression": "((HOUR(pickup_datetime) >= 12) AND (HOUR(pickup_datetime) <= 23))",
                 "collection_name": collection}
    )
    gpudb.GPUdbTable.create_union(
        table_name=union_except_table,
        table_names=[projection_except_src1,
                     projection_except_src2],
        input_column_names=[["vendor_id"],
                            ["vendor_id"]],
        output_column_names=["vendor_id"],
        options={"mode": "except",
                 "collection_name": collection},
        db=h_db
    )
    union_except_table_obj = gpudb.GPUdbTable(
        _type=None,
        name=union_except_table,
        db=h_db
    )
    u3_resp = union_except_table_obj.get_records_by_column(
        column_names=["vendor_id"],
        offset=0,
        limit=gpudb.GPUdb.END_OF_SET
    )
    print "Except table:"
    print "Vendor ID"
    print "========="
    for record in zip(u3_resp["vendor_id"]):
        print "{:<9}".format(record[0])

    print "\n"
    print "DELETING DATA"
    print "-------------"
    print

    delete_resp = table_payment_obj.filter(expression="payment_type = 'Cash'").size()
    print "Number of records that meet deletion criteria before deleting: {}".format(delete_resp)

    table_payment_obj.delete_records(expressions=["payment_type = 'Cash'"])

    delete_resp = table_payment_obj.filter(expression="payment_type = 'Cash'").size()
    print "Number of records that meet deletion criteria after deleting: {}".format(delete_resp)

    print "\n"
    print "CLEARING TABLES AND COLLECTIONS"
    print "-------------------------------"
    print

    h_db.clear_table(table_name=table_taxi)
    h_db.clear_table(table_name=table_taxi_replicated)
    h_db.clear_table(table_name=collection)

    print "Deleted tables: OK"
    print

# end gpudb_example()


if __name__ == "__main__":
    gpudb_example()

Sample Output

After running the above sample script, the following is output:

TUTORIAL OUTPUT
===============

CREATING TYPES & TABLES
-----------------------

Vendor Table
************
Vendor table successfully created

Payment Table
*************
Payment table successfully created

Taxi Table
**********
Taxi table object successfully created


INSERTING DATA
--------------

Number of single records inserted into the Vendor table: 1
Number of batch records inserted into the Vendor table: 7
Number of batch records inserted into the Payment table: 26


RETRIEVING DATA
---------------

Payment ID Payment Type Credit Type Payment Timestamp Fare Amount Surcharge MTA Tax Tip Amount Tolls Amount Total Amount
========== ============ =========== ================= =========== ========= ======= ========== ============ ============
150        None         None        1430432447000            7.50      0.00    0.50       0.00         0.00         8.30
181        None         None        1430135461000            6.50      0.50    0.50       0.00         0.00         7.80
119        None         None        1430431471000            9.50      0.00    0.50       0.00         0.00        10.30
176        Cash         None        1428403962000            9.00      0.50    0.50       2.06         0.00        12.36
109        Cash         None        1428948513000           22.50      0.50    0.50       4.75         0.00        28.55
180        Cash         None        1428965823000            6.50      0.50    0.50       1.00         0.00         8.80
134        Cash         None        1429472668000           33.50      0.50    0.50       0.00         0.00        34.80
148        Cash         None        1430124581000            9.50      0.00    0.50       1.00         0.00        11.30
114        Cash         None        1428259673000            5.50      0.00    0.50       1.89         0.00         8.19
132        Cash         None        1429472779000            6.50      0.50    0.50       1.55         0.00         9.35

Vendor ID Vendor Name                Phone       Email                         HQ Street                HQ City   HQ State  HQ Zip Code # Employees # Cabs
========= ========================== =========== ============================= ======================== ========= ========= =========== =========== ======
VTS       Vine Taxi Service          9998880001  admin@vtstaxi.com             26 Summit St.            Flushing  NY        11354       400         450
YCAB      Yes Cab                    7895444321  None                          97 Edgemont St.          Brooklyn  NY        11223       445         425
NYC       New York City Cabs         None        support@nyc-taxis.com         9669 East Bayport St.    Bronx     NY        10453       505         500
DDS       Dependable Driver Service  None        None                          8554 North Homestead St. Bronx     NY        10472       200         124
CMT       Crazy Manhattan Taxi       9778896500  admin@crazymanhattantaxi.com  950 4th Road Suite 78    Brooklyn  NY        11210       500         468
TNY       Taxi New York              None        None                          725 Squaw Creek St.      Bronx     NY        10458       315         305
NYMT      New York Metro Taxi        None        None                          4 East Jennings St.      Brooklyn  NY        11228       166         150
5BTC      Five Boroughs Taxi Co.     4566541278  mgmt@5btc.com                 9128 Lantern Street      Brooklyn  NY        11229       193         175


UPDATING & REMOVING RECORDS
---------------------------

Updated DDS vendor information:
Vendor ID Vendor Name               Phone      Email                        HQ Street                HQ City  HQ Zip Code # Employees # Cabs
========= ========================= ========== ============================ ======================== ======== =========== =========== ======
DDS       Dependable Driver Service None       management@ddstaxico.com     8554 North Homestead St. Bronx    10472       156         213

Records in the payment table (before delete): 26
Deleting record where payment_id = 189
Records in the payment table (after delete): 25


ALTER TABLE
-----------

Index added to vendor table: {u'type_id': u'', u'value': u'vendor_id', u'type_definition': u'', u'table_name': u'vendor', u'action': u'create_index', 'status_info': {u'status': u'OK', u'data_type': u'alter_table_response_avro', u'message': u'', 'response_time': 0.00292}, u'label': u'', u'properties': {}}

Index added to taxi_trip_data table: {u'type_id': u'', u'value': u'vendor_id', u'type_definition': u'', u'table_name': u'taxi_trip_data', u'action': u'create_index', 'status_info': {u'status': u'OK', u'data_type': u'alter_table_response_avro', u'message': u'', 'response_time': 0.00302}, u'label': u'', u'properties': {}}

Compression added to 'pickup_datetime' column: {u'type_id': u'', u'value': u'pickup_datetime', u'type_definition': u'', u'table_name': u'taxi_trip_data', u'action': u'set_column_compression', 'status_info': {u'status': u'OK', u'data_type': u'alter_table_response_avro', u'message': u'', 'response_time': 0.00302}, u'label': u'', u'properties': {}}

Compression added to 'dropoff_datetime' column: {u'type_id': u'', u'value': u'dropoff_datetime', u'type_definition': u'', u'table_name': u'taxi_trip_data', u'action': u'set_column_compression', 'status_info': {u'status': u'OK', u'data_type': u'alter_table_response_avro', u'message': u'', 'response_time': 0.0051}, u'label': u'', u'properties': {}}


FILTERING
---------

Number of null payments: 3
Number of null payments with a fare amount greater than $8.00 (with query chaining): 1
Number of records where vendor_id is either NYC or YCAB: 498
Number of trips with a passenger count between 1 and 3: 972


AGGREGATING, GROUPING, AND HISTOGRAMS
-------------------------------------

Statistics of values in the trip_distance column:
        Count: 1081
        Min: 0.00
        Mean: 2.86
        Max: 21.88

Unique vendor IDs in the taxi trip table:
        * CMT
        * DDS
        * LYFT
        * NYC
        * UBER
        * VTS
        * YCAB

Number of trips per vendor:
        LYFT:        3
        UBER:        3
        DDS:       106
        VTS:       229
        YCAB:      236
        CMT:       242
        NYC:       262

Passenger count groups by size:
        1s:  775
        2s:  145
        3s:   52
        4s:   22
        5s:   87


JOINS
-----

Inner join table:
Payment ID   Payment Type  Total Amount  Passenger Count  Vendor ID  Trip Distance
============ ============= ============= ================ ========== ==============
176          Cash          12.360000     5                NYC        1.220000
148          Cash          11.300000     6                NYC        1.170000
136          Cash          6.300000      5                NYC        2.310000

Left join table:
Transaction ID Pickup (in secs since Epoch) Trip Distance Vendor ID Vendor Name
============== ============================ ============= ========= ==================
169677687      1430431112000                6.50          YCAB      Yes Cab
109329823      1430431115000                1.20          YCAB      Yes Cab
132706957      1429035716000                1.02          NYC       New York City Cabs
107008976      1428880198000                1.66          NYC       New York City Cabs
164839979      1429471893000                1.02          NYC       New York City Cabs
192545327      1428403054000                1.22          NYC       New York City Cabs
195905599      1428716058000                12.03         NYC       New York City Cabs
129891539      1428948421000                0.51          NYC       New York City Cabs
149211614      1427981386000                1.50          YCAB      Yes Cab
104506140      1428674077000                1.40          YCAB      Yes Cab
181667689      1430135005000                3.03          UBER      None
186246812      1429413166000                6.10          LYFT      None
161650444      1428965054000                3.09          NYC       New York City Cabs
170101359      1428922783000                0.99          NYC       New York City Cabs
191386113      1428669361000                1.30          YCAB      Yes Cab
174771463      1429471891000                2.01          NYC       New York City Cabs
199751611      1428808559000                0.50          YCAB      Yes Cab
132760909      1430124124000                1.17          NYC       New York City Cabs
104118718      1428716058000                1.53          UBER      None
163836875      1428520490000                0.18          LYFT      None
155194734      1430135005000                1.08          NYC       New York City Cabs
111998195      1428259167000                2.10          YCAB      Yes Cab
112901023      1429799817000                2.37          NYC       New York City Cabs
174019433      1429471890000                2.56          NYC       New York City Cabs
190120416      1429413163000                1.49          LYFT      None
139294057      1428813585000                4.30          YCAB      Yes Cab
106041671      1429869346000                0.79          NYC       New York City Cabs
177106037      1430431112000                1.20          YCAB      Yes Cab
178732586      1430135005000                8.39          NYC       New York City Cabs
160856304      1428716062000                2.31          NYC       New York City Cabs
128530951      1428672482000                1.00          YCAB      Yes Cab
197243309      1430135005000                2.00          UBER      None

Outer join table:
Vend. Table Vendors Taxi Table Vendors Total Records
=================== ================== =============
5BTC                None               1
NYMT                None               1
TNY                 None               1
None                LYFT               3
None                UBER               3



PROJECTIONS
-----------

Projection of only credit payment types:
Payment ID Payment Type Credit Type      Timestamp     Fare  Surcharge MTA Tax Tip   Tolls Total
========== ============ ================ ============= ===== ========= ======= ===== ===== =====
176        Cash         None             1428403962000 9.00    0.50     0.50   2.06  0.00  12.36
198        Credit       MasterCard       1429472636000 9.00    0.00     0.50   0.00  0.00  9.80
150        None         None             1430432447000 7.50    0.00     0.50   0.00  0.00  8.30
170        No Charge    None             1430431502000 28.60   0.00     0.50   0.00  0.00  28.60
109        Cash         None             1428948513000 22.50   0.50     0.50   4.75  0.00  28.55
140        Credit       Visa             None          28.00   0.00     0.50   0.00  0.00  28.80
159        Credit       Visa             1428674487000 7.00    0.00     0.50   0.00  0.00  7.80
107        Credit       MasterCard       1428717377000 5.00    0.50     0.50   0.00  0.00  6.30
180        Cash         None             1428965823000 6.50    0.50     0.50   1.00  0.00  8.80
199        Credit       Visa             None          6.00    1.00     0.50   1.00  0.00  8.50
134        Cash         None             1429472668000 33.50   0.50     0.50   0.00  0.00  34.80
161        Credit       Visa             None          7.00    0.00     0.50   0.00  0.00  7.80
187        Credit       American Express 1428670181000 14.00   0.00     0.50   0.00  0.00  14.80
148        Cash         None             1430124581000 9.50    0.00     0.50   1.00  0.00  11.30
166        Credit       American Express 1428808723000 17.50   0.00     0.50   0.00  0.00  18.30
114        Cash         None             1428259673000 5.50    0.00     0.50   1.89  0.00  8.19
132        Cash         None             1429472779000 6.50    0.50     0.50   1.55  0.00  9.35
193        Cash         None             None          3.50    1.00     0.50   1.59  0.00  6.89
181        None         None             1430135461000 6.50    0.50     0.50   0.00  0.00  7.80
136        Cash         None             1428716521000 4.00    0.50     0.50   1.00  0.00  6.30
100        Cash         None             None          9.00    0.00     0.50   2.90  0.00  12.70
156        Credit       MasterCard       1428672753000 12.50   0.50     0.50   0.00  0.00  13.80
125        Credit       Discover         1429869673000 8.50    0.50     0.50   0.00  0.00  9.80
119        None         None             1430431471000 9.50    0.00     0.50   0.00  0.00  10.30
123        No Charge    None             1430136649000 20.00   0.50     0.50   0.00  0.00  21.30

Projection of long trips taken during lunch hours:
Hour of Day Vendor ID Passenger Count Trip Distance
=========== ========= =============== =============
12          NYC       1               5.07
14          NYC       6               9.02
13          NYC       1               5.38
14          YCAB      1               8.60
14          YCAB      2               17.00
12          NYC       3               12.01
13          NYC       1               21.88
11          NYC       5               7.26
13          YCAB      1               20.70
13          YCAB      1               7.10
14          YCAB      1               18.00
14          NYC       5               16.92
12          NYC       2               5.70
14          NYC       1               9.72
14          NYC       3               5.96
11          YCAB      1               6.00
13          DDS       1               12.50
12          VTS       1               9.65
13          VTS       3               12.56
11          VTS       2               11.43
11          VTS       1               6.33
14          CMT       1               9.90
14          CMT       1               8.20
12          CMT       1               9.80
11          CMT       1               16.70
12          CMT       1               9.50
11          NYC       1               8.39


UNIONS
------

Union All table:
Pickup Window Range     Avg. Pass. Count Avg. Trip     Min. Trip      Max Trip
======================= ================ ============= ============== =============
2015-04-01 - 2014-04-15 1.58052434457    3.05310860405 0.0            21.8799991608
2015-04-16 - 2015-04-23 1.75187969925    3.13285715217 0.379999995232 19.3999996185

Intersect table:
Pickup Latitude Pickup Longitude
=============== ================
40.7513122559   -73.9750518799
40.7210159302   -74.0053863525
41.3661384583   -73.1373901367
40.7258033752   -73.7250137329
40.7733917236   -73.9810638428
40.7639274597   -73.9019927979
40.7608184814   -73.9800567627
40.7504119873   -73.9906692505
40.758190155    -73.9373474121
40.7640647888   -73.9618606567
40.7149124146   -74.0074920654
40.7145347595   -73.942199707
40.6573753357   -73.7936782837
40.6492347717   -74.2553405762

Except table:
Vendor ID
=========
UBER


DELETING DATA
-------------

Number of records that meet deletion criteria before deleting: 10
Number of records that meet deletion criteria after deleting: 0


CLEARING TABLES AND COLLECTIONS
-------------------------------

Deleted tables: OK