Version:

Python Developer Manual

The following guide provides step-by-step instructions to get started writing Python applications using Kinetica. This guide demonstrates only a small set of the available API. A detailed description of the complete interface is available under Python API Reference.

Prerequisites

API Installation

The native Kinetica Python API is accessible through the following means:

  • For development on the Kinetica server:
  • For development not on the Kinetica server:

Kinetica RPM

In default Kinetica installations, the native Python API is located in the /opt/gpudb/api/python directory. The /opt/gpudb/bin/gpudb_python wrapper script is provided, which sets the execution environment appropriately.

Test the installation:

/opt/gpudb/bin/gpudb_python /opt/gpudb/api/python/examples/example.py

Important

When developing on the Kinetica server, use /opt/gpudb/bin/gpudb_python to run Python programs and /opt/gpudb/bin/gpudb_pip to install dependent libraries.

Git

Python 2.7 (or greater) is necessary for downloading the API from Github:

  1. In the desired directory, run the following but be sure to replace <kinetica-version> with the name of the installed Kinetica version, e.g., v6.2.0:

    git clone -b release/<kinetica-version> --single-branch https://github.com/kineticadb/kinetica-api-python.git
    
  2. Change directory into the newly downloaded repository:

    cd kinetica-api-python
    
  3. In the root directory of the unzipped repository, install the Kinetica API:

    sudo python setup.py install
    
  4. Test the installation:

    python examples/example.py
    

PyPI

The Python package manager, pip, is required to install the API from PyPI.

  1. Install the API:

    pip install gpudb --upgrade
    
  2. Test the installation:

    python -c "import gpudb;print('Import Successful')"
    

    If Import Successful is displayed, the API has been installed as is ready for use.

Data File

The tutorial script makes reference to a data file in the current directory. This path can be updated to point to a valid path on the host where the file will be located, or the script can be run with the data file in the current directory.

taxi_data = csv.reader(open('taxi_trip_data.csv'))

Connecting to the Database

To interact with Kinetica, you must first instantiate an object of the GPUdb class while providing the connection URL, including the host and port of the database server.

h_db = gpudb.GPUdb(
    encoding="BINARY",
    host="127.0.0.1",
    port="9191"
)

If the server is configured to authenticate users, the username and password will also need to be provided at the time of creating the object:

h_db = gpudb.GPUdb(
    encoding="BINARY",
    host="127.0.0.1",
    port="9191",
    username="user1",
    password="mysecret"
)

Creating a Type

Before a table can be created and any data can be loaded into the database, a type needs to be defined. The type is defined as a list of lists of strings. Each column comprises at least two values: a column name (always the first value) and a primitive type (always the second value). Any subsequent values are column properties. You can also use the GPUdbRecordColumn and GPUdbColumnProperty objects in addition to literal string values to help define a column. The following _ColumnType constants are available:

  • _ColumnType.INT
  • _ColumnType.LONG
  • _ColumnType.FLOAT
  • _ColumnType.DOUBLE
  • _ColumnType.STRING
  • _ColumnType.BYTES
# Create a type from a list of lists. Each list below is an individual
# column. Each column comprises at least two values: a column name (always
# the first value) and a primitive type (always the second value). Any
# subsequent values are column properties. The order of the columns defines
# the order in which values must be inserted into the table, e.g., a
# "vendor_name" value cannot be inserted before a "vendor_id" value
vendor_columns = [
    # column types and properties can be listed as strings
    ["vendor_id", "string", "char4", "primary_key"],
    ["vendor_name", "string", "char64"],
    ["phone", "string", "char16", "nullable"],
    ["email", "string", "char64", "nullable"],
    ["hq_street", "string", "char64"],
    # column types and properties can also be listed using the
    # GPUdbRecordColumn and GPUdbColumnProperty objects
    [
        "hq_city",
        gpudb.GPUdbRecordColumn._ColumnType.STRING,
        gpudb.GPUdbColumnProperty.CHAR8,
        gpudb.GPUdbColumnProperty.DICT
    ],
    [
        "hq_state",
        gpudb.GPUdbRecordColumn._ColumnType.STRING,
        gpudb.GPUdbColumnProperty.CHAR8,
        gpudb.GPUdbColumnProperty.DICT
    ],
    ["hq_zip", gpudb.GPUdbRecordColumn._ColumnType.INT],
    ["num_emps", gpudb.GPUdbRecordColumn._ColumnType.INT],
    ["num_cabs", gpudb.GPUdbRecordColumn._ColumnType.INT]
]

Important

The order of the columns defines the order in which values must be inserted into the table, e.g., a vendor_name value cannot be inserted before a vendor_id value

Creating a Table

After you've created a type list and stored it in a variable, you can create a table using the GPUdbTable interface. For this tutorial, three tables will be used:

  • A base table containing a sample of NYC taxi cab ride data
  • A lookup table containing sample cab vendor information (created in the example below)
  • A 1-to-1 relation table containing payment information for a subset of the cab-ride transactions
# Create the table from the type and place it in a collection
try:
    table_vendor_obj = gpudb.GPUdbTable(
        _type=vendor_columns,
        name=table_vendor,
        options={
            "collection_name": collection,
            "is_replicated": "true"
        },
        db=h_db
    )
    print "Vendor table successfully created"
except gpudb.GPUdbException as e:
    print "Vendor table creation failure: {}".format(str(e))

Important

At this point in the tutorial, a Payment and Taxi Trip Data type and table are created.

To create a table object for a table that already exists, pass in a _type of None:

gpudb.GPUdbTable.merge_records(
    table_name=table_taxi_replicated, source_table_names=[table_taxi],
    field_maps=[{
        "transaction_id": "transaction_id",
        "payment_id": "payment_id",
        "vendor_id": "vendor_id",
        "pickup_datetime": "pickup_datetime",
        "dropoff_datetime": "dropoff_datetime",
        "passenger_count": "passenger_count",
        "trip_distance": "trip_distance",
        "pickup_longitude": "pickup_longitude",
        "pickup_latitude": "pickup_latitude",
        "dropoff_longitude": "dropoff_longitude",
        "dropoff_latitude": "dropoff_latitude"
    }],
    options={
        "collection_name": collection,
        "is_replicated": "true"
    },
    db=h_db
)

To pass in table creation options, you can either specify a map (like in the Vendor table) or you can use the GPUdbTableOptions object, as follows:

# Create an "options" object containing the table creation options
collection_option_object = gpudb.GPUdbTableOptions.default().collection_name(collection)

Inserting Data

You can insert single records or multiple records using the insert_records method. As mentioned previously, the number and order of values must match the column declarations found in the type creation step. In Python, you can pass a value of None for null values, but the column must be declared nullable when creating the type schema.

Note

The following methods are not the only way to insert records, but they are the simplest.

Key/Value Record

In this example, a dictionary object will be used to map key/value pairs to table column values.

payment_datum = collections.OrderedDict()
payment_datum["payment_id"] = 189
payment_datum["payment_type"] = "No Charge"
payment_datum["credit_type"] = None
payment_datum["payment_timestamp"] = None
payment_datum["fare_amount"] = 6.5
payment_datum["surcharge"] = 0
payment_datum["mta_tax"] = 0.6
payment_datum["tip_amount"] = 0
payment_datum["tolls_amount"] = 0
payment_datum["total_amount"] = 7.1

# Insert the record into the table
table_payment_obj.insert_records(payment_datum)
print "Number of records inserted into the Payment table: {}".format(
    table_payment_obj.size()
)

In-line Record

In these examples, records will be defined in-line and added to a list.

# Create a list of in-line records. The order of the values must match the
# column order in the type
vendor_records = [
    ["VTS", "Vine Taxi Service", "9998880001", "admin@vtstaxi.com",
     "26 Summit St.", "Flushing", "NY", 11354, 450, 400],
    ["YCAB", "Yes Cab", "7895444321", None, "97 Edgemont St.", "Brooklyn",
     "NY", 11223, 445, 425],
    ["NYC", "New York City Cabs", None, "support@nyc-taxis.com",
     "9669 East Bayport St.", "Bronx", "NY", 10453, 505, 500],
    ["DDS", "Dependable Driver Service", None, None,
        "8554 North Homestead St.", "Bronx", "NY", 10472, 200, 124],
    ["CMT", "Crazy Manhattan Taxi", "9778896500",
     "admin@crazymanhattantaxi.com", "950 4th Road Suite 78", "Brooklyn",
     "NY", 11210, 500, 468],
    ["TNY", "Taxi New York", None, None, "725 Squaw Creek St.", "Bronx",
     "NY", 10458, 315, 305],
    ["NYMT", "New York Metro Taxi", None, None, "4 East Jennings St.",
     "Brooklyn", "NY", 11228, 166, 150],
    ["5BTC", "Five Boroughs Taxi Co.", "4566541278", "mgmt@5btc.com",
     "9128 Lantern Street", "Brooklyn", "NY", 11229, 193, 175]
]

# Insert the records into the Vendor table
table_vendor_obj.insert_records(vendor_records)

print "Number of records inserted into the Vendor table: {}".format(
    table_vendor_obj.size()
)
# Create another list of in-line records
payment_records = [
    [136, "Cash", None, 1428716521000, 4, 0.5, 0.5, 1, 0, 6.3],
    [148, "Cash", None, 1430124581000, 9.5, 0, 0.5, 1, 0, 11.3],
    [114, "Cash", None, 1428259673000, 5.5, 0, 0.5, 1.89, 0, 8.19],
    [180, "Cash", None, 1428965823000, 6.5, 0.5, 0.5, 1, 0, 8.8],
    [109, "Cash", None, 1428948513000, 22.5, 0.5, 0.5, 4.75, 0, 28.55],
    [132, "Cash", None, 1429472779000, 6.5, 0.5, 0.5, 1.55, 0, 9.35],
    [134, "Cash", None, 1429472668000, 33.5, 0.5, 0.5, 0, 0, 34.8],
    [176, "Cash", None, 1428403962000, 9, 0.5, 0.5, 2.06, 0, 12.36],
    [100, "Cash", None, None, 9, 0, 0.5, 2.9, 0, 12.7],
    [193, "Cash", None, None, 3.5, 1, 0.5, 1.59, 0, 6.89],
    [140, "Credit", "Visa", None, 28, 0, 0.5, 0, 0, 28.8],
    [161, "Credit", "Visa", None, 7, 0, 0.5, 0, 0, 7.8],
    [199, "Credit", "Visa", None, 6, 1, 0.5, 1, 0, 8.5],
    [159, "Credit", "Visa", 1428674487000, 7, 0, 0.5, 0, 0, 7.8],
    [156, "Credit", "MasterCard", 1428672753000, 12.5, 0.5, 0.5, 0, 0, 13.8],
    [198, "Credit", "MasterCard", 1429472636000, 9, 0, 0.5, 0, 0, 9.8],
    [107, "Credit", "MasterCard", 1428717377000, 5, 0.5, 0.5, 0, 0, 6.3],
    [166, "Credit", "American Express", 1428808723000, 17.5, 0, 0.5, 0, 0, 18.3],
    [187, "Credit", "American Express", 1428670181000, 14, 0, 0.5, 0, 0, 14.8],
    [125, "Credit", "Discover", 1429869673000, 8.5, 0.5, 0.5, 0, 0, 9.8],
    [119, None, None, 1430431471000, 9.5, 0, 0.5, 0, 0, 10.3],
    [150, None, None, 1430432447000, 7.5, 0, 0.5, 0, 0, 8.3],
    [170, "No Charge", None, 1430431502000, 28.6, 0, 0.5, 0, 0, 28.6],
    [123, "No Charge", None, 1430136649000, 20, 0.5, 0.5, 0, 0, 21.3],
    [181, None, None, 1430135461000, 6.5, 0.5, 0.5, 0, 0, 7.8]
]

# Insert the records into the Payment table
for record in payment_records:
    table_payment_obj.insert_records(record)

print "Number of records inserted into the Payment table: {}".format(
    table_payment_obj.size() - payment_table_size_after_first_insert
)

CSV

This example requires the csv Python module but allows for importing a large amount of records with ease. After setting a CSV reader, you can loop over all values in a .csv file, append the values to lists of a list, then insert the list.

taxi_data = csv.reader(open('taxi_trip_data.csv'))
taxi_data.next()
taxi_records = []
for record in taxi_data:
    record_data = []
    record_data.append(long(record[0]))
    record_data.append(long(record[1]))
    record_data.append(record[2])
    record_data.append(long(record[3]))
    record_data.append(long(record[4]))
    record_data.append(int(record[5]))
    record_data.append(float(record[6]))
    record_data.append(float(record[7]))
    record_data.append(float(record[8]))
    record_data.append(float(record[9]))
    record_data.append(float(record[10]))
    taxi_records.append(record_data)
table_taxi_obj.insert_records(taxi_records)

Retrieving Data

Once the table is populated with data, the data can be retrieved using binary encoding or JSON encoding. Binary encoding is more efficient than JSON encoding. The example below uses the GPUdbTable interface and binary encoding:

# Retrieve no more than 10 records from the Payment table using the
# GPUdbTable interface
print "{:<10s} {:<12s} {:<11s} {:<17s} {:<11s} {:<9s} {:<7s} {:<10s} " \
    "{:<12s} {:<12s}".format(
        "Payment ID",
        "Payment Type",
        "Credit Type",
        "Payment Timestamp",
        "Fare Amount",
        "Surcharge",
        "MTA Tax",
        "Tip Amount",
        "Tolls Amount",
        "Total Amount"
)
print "{:=<10s} {:=<12s} {:=<11s} {:=<17s} {:=<11s} {:=<9s} {:=<7s} {:=<10s} " \
    "{:=<12s} {:=<12s}".format(
        "", "", "", "", "", "", "", "", "", ""
)
for record in table_payment_obj.get_records(
    offset=0,
    limit=10,
    encoding="binary",
    options={"sort_by": "payment_id"}
):
    print "{payment_id:<10d} {payment_type:<12s} {credit_type:<11s} " \
        "{payment_timestamp:<17} {fare_amount:11.2f} {surcharge:9.2f} " \
        "{mta_tax:7.2f} {tip_amount:10.2f} {tolls_amount:12.2f} " \
        "{total_amount:12.2f}".format(
            payment_id=record["payment_id"],
            payment_type=record["payment_type"],
            credit_type=record["credit_type"],
            payment_timestamp=record["payment_timestamp"],
            fare_amount=record["fare_amount"],
            surcharge=record["surcharge"],
            mta_tax=record["mta_tax"],
            tip_amount=record["tip_amount"],
            tolls_amount=record["tolls_amount"],
            total_amount=record["total_amount"]
    )

This example demonstrates JSON encoding record retrieval using the GPUdb interface:

# Retrieve all records from the Vendor table using the GPUdb interface
print "{:<9s} {:<26s} {:<11s} {:<29s} {:<24s} {:<9s} {:<9s} {:<11s} {:<11s} {:<6s}".format(
        "Vendor ID",
        "Vendor Name",
        "Phone",
        "Email",
        "HQ Street",
        "HQ City",
        "HQ State",
        "HQ Zip Code",
        "# Employees",
        "# Cabs"
)
print "{:=<9s} {:=<26s} {:=<11s} {:=<29s} {:=<24s} {:=<9s} {:=<9s} " \
    "{:=<11s} {:=<11s} {:=<6s}".format(
        "", "", "", "", "", "", "", "", "", ""
)
vendor_records_gpudb = h_db.get_records(
    table_name=table_vendor,
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    encoding="json",
    options={"sort_by": "vendor_id"}
)["records_json"]
for record in vendor_records_gpudb:
    print "{vendor_id:<9s} {vendor_name:<26s} {phone:<11} {email:<29s} " \
        "{hq_street:<24s} {hq_city:<9s} {hq_state:<9s} {hq_zip:<11d} " \
        "{num_emps:11d} {num_cabs:6d}".format(
            **json.loads(record)
    )

For large tables, the data can be easily be retrieved in smaller blocks by using the offset and limit parameters. The returned response also contains the schema (or data type) of the results.

Updating and Removing Records

Using any GPUdbTable object, you can update records via the update_records method.

# Update the e-mail of, and add two employees and one cab to, the DDS vendor
table_vendor_obj.update_records(
    expressions=["vendor_id = 'DDS'"],
    new_values_maps={
        "email": "'management@ddstaxico.com'",
        "num_emps": "num_emps + 2",
        "num_cabs": "num_cabs + 1"
    },
    options={"use_expressions_in_new_values_maps":"true"}
)

You can delete records using the delete_records method.

# Delete payment 189
pre_delete = table_payment_obj.size()
print "Records in the payment table (before delete): {}".format(pre_delete)
delete_expr = ["payment_id = 189"]
print "Deleting record where " + delete_expr[0]
table_payment_obj.delete_records(expressions=delete_expr)
post_delete = table_payment_obj.size()
print "Records in the payment table (after delete): {}".format(post_delete)

Alter Table

Some properties can be altered or added after table creation, including indexes, dictionary encoding, and compression.

Indexes

Using the alter_table method, you can create indexes on columns using the create_index action paired with a column name.

# Add column indexes on:
#   - payment table, fare_amount (for query-chaining filter example)
#   - taxi table, passenger_count (for filter-by-range example) */
table_payment_obj.alter_table(action="create_index", value="fare_amount")
table_taxi_obj.alter_table(action="create_index", value="passenger_count")

Compression

Applying column compression works similarly: using the alter_table method but with a set_column_compression action paired with a column name and compression type option.

# Apply the snappy compression algorithm to the pickup and dropoff datetime columns
snappy_compression = {"compression_type": "snappy"}
table_taxi_obj.alter_table(
    action="set_column_compression",
    value="pickup_datetime",
    options=snappy_compression
)

dropoff_compress_resp = table_taxi_obj.alter_table(
    action="set_column_compression",
    value="dropoff_datetime",
    options=snappy_compression
)

Important

Column compression is applied at a fixed interval, so be sure to verify later that the compression has been added. Column usage should decrease by roughly 23% (~1989 bytes)

Dictionary Encoding

Applying dictionary encoding via alter_table involves adding a new property to a column.

# Apply dictionary encoding to the vendor_id column
table_taxi_obj.alter_table(
    action="change_column",
    value="vendor_id",
    options={"column_properties": "char4,dict"}
)

Important

To add a new property, all existing column properties must be listed along with any new property(ies)

Filters

Filters are an easy way to reduce larger table into more concise views using expressions.

Filter Example 1

# Select all payments with no corresponding payment type; allow Kinetica
# to assign a random name to the view
collection_option = {"collection_name": collection}
f1_count = table_payment_obj.filter(
    expression="IS_NULL(payment_type)",
    options=collection_option
).size()

print "Number of null payments: {}".format(f1_count)

Filter Example 2

# Using GPUdbTable query chaining, filter null payment type records with a
# fare amount greater than 8
f2_count = table_payment_obj.filter(
    view_name=view_example1,
    expression="IS_NULL(payment_type)",
    options=collection_option
).filter(
    view_name=view_example2,
    expression="fare_amount > 8",
    options=collection_option
).size()

print "Number of null payments with a fare amount greater than $8.00 " \
    "(with query chaining): {}".format(f2_count)

Filter Example 3

# Filter by list where vendor ID is either NYC or YCAB
f3_count = table_taxi_obj.filter_by_list(
    view_name=view_example3,
    column_values_map={"vendor_id": ["NYC", "YCAB"]}
).size()

print "Number of records where vendor_id is either NYC or YCAB: {}".format(f3_count)

Filter Example 4

# Filter by range trip with passenger count between 1 and 3
f4_count = table_taxi_obj.filter_by_range(
    view_name=view_example4,
    column_name="passenger_count",
    lower_bound=1,
    upper_bound=3
).size()

Aggregates

Kinetica supports various aggregate and group-by queries, which group and aggregate your data to return counts and useful statistics.

Aggregate Example 1

# Aggregate count, min, mean, and max on the trip distance
a1_resp = table_taxi_obj.aggregate_statistics(
    column_name="trip_distance",
    stats="count,min,max,mean"
)

print "Statistics of values in the trip_distance column:"
print "\tCount: {count:.0f}\n\tMin: {min:4.2f}\n\tMean: {mean:4.2f}\n\t" \
    "Max: {max:4.2f}\n".format(**a1_resp["stats"])

Aggregate Example 2

# Find unique taxi vendor IDs
a2_resp = table_taxi_obj.aggregate_unique(
    column_name="vendor_id",
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    encoding="json"
)["data"]["vendor_id"]

print "Unique vendor IDs in the Taxi table:"
for vendor in a2_resp:
    print "\t* {}".format(vendor)

Aggregate Example 3

# Find number of trips per vendor
a3_resp = table_taxi_obj.aggregate_group_by(
    column_names=["vendor_id", "count(vendor_id)"],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    options={"sort_by": "key"}
)["data"]

print "Number of trips per vendor:"
for vendor in zip(a3_resp["vendor_id"], a3_resp["count(vendor_id)"]):
    print "\t{:<10s}{:4d}".format(vendor[0] + ":", vendor[1])

Aggregate Example 4

# Create a histogram for the different groups of passenger counts
a4_resp = table_taxi_obj.aggregate_histogram(
    column_name="passenger_count",
    start=1,
    end=6,
    interval=1
)["counts"]

print "Passenger count groups by size:"
print "{:<10s} {:<11s}".format("Passengers", "Total Trips")
print "{:=<10s} {:=<11s}".format("", "")

for histo_group in zip([1, 2, 3, 4, '>5'], a4_resp):
    print "{:>10} {:11.0f}".format(*histo_group)

Joins

Joins allow you to link multiple tables together, along their relations, retrieving associated information from any or all of them. Tables can only be joined if they're sharded similarly or replicated. Note that the create_join_table method is static and is not called with a corresponding GPUdbTable object.

An inner join returns only records that have matching values in both tables.

# Retrieve payment information for rides having more than three passengers
gpudb.GPUdbTable.create_join_table(
    join_table_name=join_table_inner,
    table_names=[
        "taxi_trip_data as t",
        "payment as p"
    ],
    column_names=[
        "t.payment_id",
        "payment_type",
        "total_amount",
        "passenger_count",
        "vendor_id",
        "trip_distance"
    ],
    expressions=[
        "t.payment_id = p.payment_id",
        "passenger_count > 3"
    ],
    options=collection_option,
    db=h_db
)

A left join returns all of the records an inner join does, but additionally, for each record in the table on the left side of the join that has no match along the relation to a record in the table on the right side of the join, a corresponding record will be returned with "left-side" columns populated with the "left-side" record data and the "right-side" columns populated with nulls.

# Retrieve cab ride transactions and the full name of the associated vendor
# (if available--blank if vendor name is unknown) for transactions with
# associated payment data, sorting by increasing values of transaction ID.
gpudb.GPUdbTable.create_join_table(
    join_table_name=join_table_left,
    table_names=[
        "taxi_trip_data as t",
        "vendor as v"
    ],
    column_names=[
        "transaction_id",
        "pickup_datetime",
        "trip_distance",
        "t.vendor_id",
        "vendor_name"
    ],
    expressions=[
        "left join t, v on (t.vendor_id = v.vendor_id)",
        "payment_id <> 0"
    ],
    options=collection_option,
    db=h_db
)

Note

Full outer joins require both tables to be replicated or joined on their shard keys. Set merges that perform deduplication of records, like Union Distinct, Intersect, and Except also need to use replicated tables to ensure the correct results, so a replicated version of the taxi (taxi_trip_data_replicated) table and a corresponding table object are created at this point in the tutorial.

gpudb.GPUdbTable.merge_records(
    table_name=table_taxi_replicated, source_table_names=[table_taxi],
    field_maps=[{
        "transaction_id": "transaction_id",
        "payment_id": "payment_id",
        "vendor_id": "vendor_id",
        "pickup_datetime": "pickup_datetime",
        "dropoff_datetime": "dropoff_datetime",
        "passenger_count": "passenger_count",
        "trip_distance": "trip_distance",
        "pickup_longitude": "pickup_longitude",
        "pickup_latitude": "pickup_latitude",
        "dropoff_longitude": "dropoff_longitude",
        "dropoff_latitude": "dropoff_latitude"
    }],
    options={
        "collection_name": collection,
        "is_replicated": "true"
    },
    db=h_db
)

A full outer join returns all of the records a left join does, but additionally, for each record in the table on the right side of the join that has no match along the relation to a record in the table on the left side of the join, a corresponding record will be returned with "right-side" columns populated with the "right-side" record data and the "left-side" columns populated with nulls.

# Retrieve the vendor IDs of known vendors with no recorded cab ride
# transactions, as well as the vendor ID and number of transactions for
# unknown vendors with recorded cab ride transactions
gpudb.GPUdbTable.create_join_table(
    join_table_name=join_table_outer,
    table_names=[
        "taxi_trip_data_replicated as t",
        "vendor as v"
    ],
    column_names=[
        "t.vendor_id as vendor_id",
        "v.vendor_id as vendor_id_1"
    ],
    expressions=["full_outer join t,v on ((v.vendor_id = t.vendor_id))"],
    options=collection_option,
    db=h_db
)

Projections

You can create projections using the create_projection method.

# Create a projection containing all payments by credit card
table_payment_obj.create_projection(
    projection_name=projection_example1,
    column_names=[
        "payment_id",
        "payment_type",
        "credit_type",
        "payment_timestamp",
        "fare_amount",
        "surcharge",
        "mta_tax",
        "tip_amount",
        "tolls_amount",
        "total_amount"
    ],
    options={
        "collection_name": collection,
        "expression": "payment_type = 'Credit'"
    }
)

To persist a projection:

# Create a persisted table with cab ride transactions greater than 5 miles
# whose trip started during lunch hours
table_taxi_obj.create_projection(
    projection_name=projection_example2,
    column_names=[
        "hour(pickup_datetime) as hour_of_day",
        "vendor_id",
        "passenger_count",
        "trip_distance"
    ],
    options={
        "expression":
            "(hour(pickup_datetime) >= 11) AND "
            "(hour(pickup_datetime) <= 14) AND "
            "(trip_distance > 5)",
        "collection_name": collection,
        "persist": "true"
    }
)

Union, Intersect, and Except

Union can be used to combine homogeneous data sets into one larger data set. Union & Union Distinct will both combine data sets but only retain the records that are unique across the chosen columns, removing all duplicates. Union All will combine data sets, retaining all records from the source data sets.

# Calculate the average number of passengers, as well as the shortest,
# average, and longest trips for all trips in
# each of the two time periods--from April 1st through the 15th, 2015 and
# from April 16th through the 23rd, 2015--and return those two sets of
# statistics in a single result set.
table_taxi_obj.aggregate_group_by(
    column_names=[
        "avg(passenger_count) as avg_pass_count",
        "avg(trip_distance) as avg_trip_dist",
        "min(trip_distance) as min_trip_dist",
        "max(trip_distance) as max_trip_dist"
    ],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    options={
        "expression":
            "(pickup_datetime >= '2015-04-01') AND "
            "(pickup_datetime <= '2015-04-15 23:59:59.999')",
        "result_table": agg_grpby_union_all_src1,
        "collection_name": collection
    }
)
table_taxi_obj.aggregate_group_by(
    column_names=[
        "avg(passenger_count) as avg_pass_count",
        "avg(trip_distance) as avg_trip_dist",
        "min(trip_distance) as min_trip_dist",
        "max(trip_distance) as max_trip_dist"
    ],
    offset=0,
    limit=gpudb.GPUdb.END_OF_SET,
    options={
        "expression":
            "(pickup_datetime >= '2015-04-16') AND "
            "(pickup_datetime <= '2015-04-23 23:59:59.999')",
        "result_table": agg_grpby_union_all_src2,
        "collection_name": collection
    }
)
gpudb.GPUdbTable.create_union(
    table_name=union_all_table,
    table_names=[
        agg_grpby_union_all_src1,
        agg_grpby_union_all_src2
    ],
    input_column_names=[
        ["'2015-04-01 - 2014-04-15'", "avg_pass_count",
            "avg_trip_dist", "min_trip_dist", "max_trip_dist"],
        ["'2015-04-16 - 2015-04-23'", "avg_pass_count",
            "avg_trip_dist", "min_trip_dist", "max_trip_dist"]
    ],
    output_column_names=[
        "pickup_window_range",
        "avg_pass_count",
        "avg_trip",
        "min_trip",
        "max_trip"
    ],
    options={"mode": "union_all", "collection_name": collection},
    db=h_db
)

Intersect will combine data sets but only include the records found in both data sets, removing duplicate result records.

# Retrieve locations (as lat/lon pairs) that were both pick-up and
# drop-off points
gpudb.GPUdbTable.create_union(
    table_name=union_intersect_table,
    table_names=[
        table_taxi_replicated,
        table_taxi_replicated
    ],
    input_column_names=[
        ["pickup_latitude", "pickup_longitude"],
        ["dropoff_latitude", "dropoff_longitude"]
    ],
    output_column_names=["latitude", "longitude"],
    options={"mode": "intersect", "collection_name": collection},
    db=h_db
)

Except will return records that appear in the first data set but not the second data set. Note that the data sets on each side of the Except will have duplicates removed first, and then the set subtraction will be processed.

# Show vendors that operate before noon, but not after noon: retrieve the
# unique list of IDs of vendors who provided cab rides between midnight
# and noon, and remove from that list the IDs of any vendors who provided
# cab rides between noon and midnight
table_taxi_replicated_obj.create_projection(
    projection_name=projection_except_src1,
    column_names=["vendor_id"],
    options={
        "expression":
            "(HOUR(pickup_datetime) >= 0) AND "
            "(HOUR(pickup_datetime) <= 11)",
        "collection_name": collection
    }
)
table_taxi_replicated_obj.create_projection(
    projection_name=projection_except_src2,
    column_names=["vendor_id"],
    options={
        "expression":
            "(HOUR(pickup_datetime) >= 12) AND "
            "(HOUR(pickup_datetime) <= 23)",
        "collection_name": collection
    }
)
gpudb.GPUdbTable.create_union(
    table_name=union_except_table,
    table_names=[
        projection_except_src1,
        projection_except_src2
    ],
    input_column_names=[
        ["vendor_id"],
        ["vendor_id"]
    ],
    output_column_names=["vendor_id"],
    options={"mode": "except", "collection_name": collection},
    db=h_db
)

Deleting Records

You can delete records from a table using filter expressions. This method allows you to specify multiple filter expressions--note that each expression is used to delete records independently from the others (i.e., a record only needs to meet any one expression's criteria to be deleted from the table).

table_payment_obj.delete_records(expressions=["payment_type = 'Cash'"])

Download & Run

Included below is a complete example containing all the above requests, the data file, and output.

To run the complete sample, ensure the python_tutorial.py and taxi_trip_data.csv are in the same directory; then switch to that directory and run:

python python_tutorial.py