The following guide provides step-by-step instructions to get started writing Python applications using Kinetica. This guide demonstrates only a small set of the available API. A detailed description of the complete interface is available under Python API Reference.
The following .csv file must be ingested into the database to run the sample file:
Requirements
Note
The locations of python
and pip
should be placed in the
PATH
environment variable. If they are not, you'll need to use
the full path to the python
and pip
executables in the
instructions below. Also, administrative access will most likely be
required when installing the Python packages.
In the desired directory, run the following but be sure to replace
kinetica-version
with the correct branch name of the installed Kinetica
version, e.g., release/v6.2.0
git clone -b [kinetica-version] --single-branch https://github.com/kineticadb/kinetica-api-python.git
Change directory into the newly downloaded repository
In the root directory of the unzipped repository, install the Kinetica API and Avro packages:
pip install .
pip install avro
Test the installation:
cd example
python example.py
Update the repo as desired:
git pull
In the desired directory, run the following:
pip install gpudb
To interact with Kinetica, you must first instantiate an object of the GPUdb
class while providing the connection URL, including the host and port of the
database server.
h_db = gpudb.GPUdb(
encoding="BINARY",
host="127.0.0.1",
port="9191"
)
If the server is configured to authenticate users, the username and password will also need to be provided at the time of creating the object:
h_db = gpudb.GPUdb(
encoding="BINARY",
host="127.0.0.1",
port="9191",
username="user1",
password="mysecret"
)
Before a table can be created and any data can be
loaded into the database, a type needs to be defined.
The type is defined as a list of lists of strings. Each column comprises at
least two values: a column name (always the first value) and a primitive type
(always the second value). Any subsequent values are column properties. You can
also use the GPUdbRecordColumn
and GPUdbColumnProperty
objects in
addition to literal string values to help define a column. The following
_ColumnType
constants are available:
_ColumnType.INT
_ColumnType.LONG
_ColumnType.FLOAT
_ColumnType.DOUBLE
_ColumnType.STRING
_ColumnType.BYTES
# Create a type from a list of lists. Each list below is an individual column. Each column comprises at
# least two values: a column name (always the first value) and a primitive type (always the second value). Any
# subsequent values are column properties. The order of the columns defines the order in which values must
# be inserted into the table, e.g., a "vendor_name" value cannot be inserted before a "vendor_id" value
vendor_columns = [
# column types and properties can be listed as strings
["vendor_id", "string", "char4", "primary_key"],
["vendor_name", "string", "char64"],
["phone", "string", "char16", "nullable"],
["email", "string", "char64", "nullable"],
["hq_street", "string", "char64"],
# column types and properties can also be listed using the GPUdbRecordColumn and GPUdbColumnProperty objects
["hq_city", gpudb.GPUdbRecordColumn._ColumnType.STRING, gpudb.GPUdbColumnProperty.CHAR8,
gpudb.GPUdbColumnProperty.DICT],
["hq_state", gpudb.GPUdbRecordColumn._ColumnType.STRING, gpudb.GPUdbColumnProperty.CHAR8,
gpudb.GPUdbColumnProperty.DICT],
["hq_zip", gpudb.GPUdbRecordColumn._ColumnType.INT],
["num_emps", gpudb.GPUdbRecordColumn._ColumnType.INT],
["num_cabs", gpudb.GPUdbRecordColumn._ColumnType.INT]
]
Important
The order of the columns defines the order in which values must
be inserted into the table, e.g., a vendor_name
value cannot be inserted
before a vendor_id
value
After you've created a type list and stored it in a
variable, you can create a table using the
GPUdbTable
interface. For this tutorial, three tables will be used:
# Create the table from the type and place it in a collection
try:
table_vendor_obj = gpudb.GPUdbTable(
_type=vendor_columns,
name=table_vendor,
options={"collection_name": collection,
"is_replicated": "true"},
db=h_db
)
print "Vendor table successfully created"
except gpudb.GPUdbException as e:
print "Vendor table creation failure: {}".format(str(e))
print
To create a table object for a table that already exists, pass in a _type
of None
:
my_table_object = gpudb.GPUdbTable(_type=None, name=my_existing_table, db=h_db)
To pass in table creation options, you can either specify a map (like in the
Vendor
table) or you can use the GPUdbTableOptions
object. There are two
ways to use the GPUdbTableOptions
object:
# Instantiate the GPUdbTableOptions object using a variable, then chain options together using the variable
options = gpudb.GPUdbTableOptions()
payment_options = options.collection_name("collection_name").is_replicated("true")
# Use the object directly
options = gpudb.GPUdbTableOptions.default().collection_name("collection_name").is_replicated("true")
Important
At this point in the tutorial, a Payment
type and table are
created. A GPUdbTable object
is also created for the taxi_trip_data
table you already ingested. Some
of the columns being created are specified as DICT
, which will
dictionary encode those columns. Read more about dictionary encoding
here
You can insert single records or multiple records using the insert_records
method. As mentioned previously, the number and order of values must match the
column declarations found in the type creation step. In
Python, you can pass a value of None
for null values, but the column must
be declared nullable when creating the type schema.
Note
The following methods are not the only way to insert records, but they are the simplest.
In this example, a dictionary object will be used to map key/value pairs to table column values.
vendor_datum = collections.OrderedDict()
vendor_datum["vendor_id"] = "VTS"
vendor_datum["vendor_name"] = "Vine Taxi Service"
vendor_datum["phone"] = "9998880001"
vendor_datum["email"] = "admin@vtstaxi.com"
vendor_datum["hq_street"] = "26 Summit St."
vendor_datum["hq_city"] = "Flushing"
vendor_datum["hq_state"] = "NY"
vendor_datum["hq_zip"] = 11354
vendor_datum["num_cabs"] = 450
vendor_datum["num_emps"] = 400
# Insert the record into the table
table_vendor_obj.insert_records(vendor_datum)
print "Number of single records inserted into the Vendor table: {}".format(table_vendor_obj.size())
In these examples, records will be defined in-line and added to a list.
vendor_records = [
["YCAB", "Yes Cab", "7895444321", None, "97 Edgemont St.", "Brooklyn", "NY", 11223, 445, 425],
["NYC", "New York City Cabs", None, "support@nyc-taxis.com", "9669 East Bayport St.", "Bronx", "NY", 10453, 505,
500],
["DDS", "Dependable Driver Service", None, None, "8554 North Homestead St.", "Bronx", "NY", 10472, 200, 124],
["CMT", "Crazy Manhattan Taxi", "9778896500", "admin@crazymanhattantaxi.com", "950 4th Road Suite 78",
"Brooklyn", "NY", 11210, 500, 468],
["TNY", "Taxi New York", None, None, "725 Squaw Creek St.", "Bronx", "NY", 10458, 315, 305],
["NYMT", "New York Metro Taxi", None, None, "4 East Jennings St.", "Brooklyn", "NY", 11228, 166, 150],
["5BTC", "Five Boroughs Taxi Co.", "4566541278", "mgmt@5btc.com", "9128 Lantern Street", "Brooklyn", "NY",
11229, 193, 175]
]
# Insert the records into the Vendor table
table_vendor_obj.insert_records(vendor_records)
print "Number of batch records inserted into the Vendor table: {}".format(table_vendor_obj.size() - 1)
# Create another list of in-line records
payment_records = [
[189, "No Charge", None, None, 6.5, 0, 0.6, 0, 0, 7],
[136, "Cash", None, 1428716521000, 4, 0.5, 0.5, 1, 0, 6.3],
[148, "Cash", None, 1430124581000, 9.5, 0, 0.5, 1, 0, 11.3],
[114, "Cash", None, 1428259673000, 5.5, 0, 0.5, 1.89, 0, 8.19],
[180, "Cash", None, 1428965823000, 6.5, 0.5, 0.5, 1, 0, 8.8],
[109, "Cash", None, 1428948513000, 22.5, 0.5, 0.5, 4.75, 0, 28.55],
[132, "Cash", None, 1429472779000, 6.5, 0.5, 0.5, 1.55, 0, 9.35],
[134, "Cash", None, 1429472668000, 33.5, 0.5, 0.5, 0, 0, 34.8],
[176, "Cash", None, 1428403962000, 9, 0.5, 0.5, 2.06, 0, 12.36],
[100, "Cash", None, None, 9, 0, 0.5, 2.9, 0, 12.7],
[193, "Cash", None, None, 3.5, 1, 0.5, 1.59, 0, 6.89],
[140, "Credit", "Visa", None, 28, 0, 0.5, 0, 0, 28.8],
[161, "Credit", "Visa", None, 7, 0, 0.5, 0, 0, 7.8],
[199, "Credit", "Visa", None, 6, 1, 0.5, 1, 0, 8.5],
[159, "Credit", "Visa", 1428674487000, 7, 0, 0.5, 0, 0, 7.8],
[156, "Credit", "MasterCard", 1428672753000, 12.5, 0.5, 0.5, 0, 0, 13.8],
[198, "Credit", "MasterCard", 1429472636000, 9, 0, 0.5, 0, 0, 9.8],
[107, "Credit", "MasterCard", 1428717377000, 5, 0.5, 0.5, 0, 0, 6.3],
[166, "Credit", "American Express", 1428808723000, 17.5, 0, 0.5, 0, 0, 18.3],
[187, "Credit", "American Express", 1428670181000, 14, 0, 0.5, 0, 0, 14.8],
[125, "Credit", "Discover", 1429869673000, 8.5, 0.5, 0.5, 0, 0, 9.8],
[119, None, None, 1430431471000, 9.5, 0, 0.5, 0, 0, 10.3],
[150, None, None, 1430432447000, 7.5, 0, 0.5, 0, 0, 8.3],
[170, "No Charge", None, 1430431502000, 28.6, 0, 0.5, 0, 0, 28.6],
[123, "No Charge", None, 1430136649000, 20, 0.5, 0.5, 0, 0, 21.3],
[181, None, None, 1430135461000, 6.5, 0.5, 0.5, 0, 0, 7.8],
[189, "No Charge", None, None, 6.5, 0, 0.6, 0, 0, 7]
]
# Insert the records into the Payment table
for record in payment_records:
table_payment_obj.insert_records(record)
print "Number of batch records inserted into the Payment table: {}".format(table_payment_obj.size())
Once the table is populated with data, the data can be
retrieved using binary encoding or JSON encoding. Binary encoding is more
efficient than JSON encoding. The example below uses the GPUdbTable
interface and binary encoding:
# Retrieve no more than 10 records from the Payment table using the GPUdbTable interface
print "{:<10s} {:<12s} {:<11s} {:<17s} {:<11s} {:<9s} {:<7s} {:<10s} {:<12s} {:<12s}".format(
"Payment ID",
"Payment Type",
"Credit Type",
"Payment Timestamp",
"Fare Amount",
"Surcharge",
"MTA Tax",
"Tip Amount",
"Tolls Amount",
"Total Amount"
)
print "{:=<10s} {:=<12s} {:=<11s} {:=<17s} {:=<11s} {:=<9s} {:=<7s} {:=<10s} {:=<12s} {:=<12s}".format(
"", "", "", "", "", "", "", "", "", ""
)
for record in table_payment_obj.get_records(
offset=0,
limit=10,
encoding="binary",
options={"sort_by": "payment_type"}
):
print "{payment_id:<10d} {payment_type:<12s} {credit_type:<11s} {payment_timestamp:<17d} {fare_amount:11.2f} " \
"{surcharge:9.2f} {mta_tax:7.2f} {tip_amount:10.2f} {tolls_amount:12.2f} {total_amount:12.2f}".format(
payment_id=record["payment_id"],
payment_type=record["payment_type"],
credit_type=record["credit_type"],
payment_timestamp=record["payment_timestamp"],
fare_amount=record["fare_amount"],
surcharge=record["surcharge"],
mta_tax=record["mta_tax"],
tip_amount=record["tip_amount"],
tolls_amount=record["tolls_amount"],
total_amount=record["total_amount"]
)
print
This example demonstrates JSON encoding record retrieval using the GPUdb
interface:
# Retrieve all records from the Vendor table using the GPUdb interface
print "{:<9s} {:<26s} {:<11s} {:<29s} {:<24s} {:<9s} {:<9s} {:<11s} {:<11s} {:<6s}".format(
"Vendor ID",
"Vendor Name",
"Phone",
"Email",
"HQ Street",
"HQ City",
"HQ State",
"HQ Zip Code",
"# Employees",
"# Cabs"
)
print "{:=<9s} {:=<26s} {:=<11s} {:=<29s} {:=<24s} {:=<9s} {:=<9s} {:=<11s} {:=<11s} {:=<6s}".format(
"", "", "", "", "", "", "", "", "", ""
)
vendor_records_gpudb = h_db.get_records(
table_name=table_vendor,
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
encoding="json"
)["records_json"]
for record in vendor_records_gpudb:
print "{vendor_id:<9s} {vendor_name:<26s} {phone:<11} {email:<29s} {hq_street:<24s} {hq_city:<9s} " \
"{hq_state:<9s} {hq_zip:<11d} {num_emps:<11d} {num_cabs:<6d}".format(**json.loads(record))
For large tables, the data can be easily be retrieved in smaller blocks by using
the offset
and limit
parameters. The returned response also contains
the schema (or data type) of the results.
Using any GPUdbTable object, you can
update records via the update_records
method.
# Update the email, number of employees, and number of cabs of the DDS vendor
table_vendor_obj.update_records(
expressions=["vendor_id = 'DDS'"],
new_values_maps={"email": "management@ddstaxico.com",
"num_emps": "156",
"num_cabs": "213"}
)
# Print the updated table
print "Updated DDS vendor information:"
print "{:<9s} {:<25s} {:<10s} {:<28s} {:<24} {:<8s} {:<11s} {:<11s} {:<6s}".format(
"Vendor ID",
"Vendor Name",
"Phone",
"Email",
"HQ Street",
"HQ City",
"HQ Zip Code",
"# Employees",
"# Cabs"
)
print "{:=<9s} {:=<25s} {:=<10s} {:=<28s} {:=<24s} {:=<8s} {:=<11s} {:=<11s} {:=<6s}".format(
"", "", "", "", "", "", "", "", "", ""
)
for vendor_record in table_vendor_obj.get_records(
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={"expression": "vendor_id = 'DDS'"}
):
print "{vendor_id:<9s} {vendor_name:<25s} {phone:<10s} {email:<28s} {hq_street:<24s} {hq_city:<8s} " \
"{hq_zip:<11d} {num_emps:<11d} {num_cabs:<6d}".format(
vendor_id=vendor_record["vendor_id"],
vendor_name=vendor_record["vendor_name"],
phone=vendor_record["phone"],
email=vendor_record["email"],
hq_street=vendor_record["hq_street"],
hq_city=vendor_record["hq_city"],
hq_zip=vendor_record["hq_zip"],
num_emps=vendor_record["num_emps"],
num_cabs=vendor_record["num_cabs"]
)
print
You can delete records using the delete_records
method.
# Delete payment 189
pre_delete = table_payment_obj.size()
print "Records in the payment table (before delete): {}".format(pre_delete)
delete_expr = ["payment_id = 189"]
print "Deleting record where " + delete_expr[0]
table_payment_obj.delete_records(expressions=delete_expr)
post_delete = table_payment_obj.size()
print "Records in the payment table (after delete): {}".format(post_delete)
Some properties can be altered or added after table
creation, including indexes,
dictionary encoding, and
compression. Using the alter_table
method,
you can create indexes on columns using the create_index
action paired
with a column name.
# Add column indexes to vendor_id in both the vendor and taxi_trip_data tables
vendor_index_resp = table_vendor_obj.alter_table(action="create_index", value="vendor_id")
print "Index added to vendor table: {} \n".format(vendor_index_resp)
taxi_index_resp = table_taxi_obj.alter_table(action="create_index", value="vendor_id")
print "Index added to taxi_trip_data table: {} \n".format(taxi_index_resp)
Applying column compression works similarly: using the
alter_table
method but with a set_column_compression
action paired
with a column name and compression type option.
# Apply snappy compression to the pickup and dropoff datetime columns
snappy_compression = {"compression_type": "snappy"}
pickup_compress_resp = table_taxi_obj.alter_table(action="set_column_compression", value="pickup_datetime",
options=snappy_compression)
print "Compression added to 'pickup_datetime' column: {} \n".format(pickup_compress_resp)
dropoff_compress_resp = table_taxi_obj.alter_table(action="set_column_compression", value="dropoff_datetime",
options=snappy_compression)
print "Compression added to 'dropoff_datetime' column: {}".format(dropoff_compress_resp)
Filters are an easy way to filter down a larger table into more concise views using expressions.
# Select all payments with no corresponding payment type; allow Kinetica to assign a random name to the view
collection_option = {"collection_name": collection}
f1_resp = table_payment_obj.filter(
expression="IS_NULL(payment_type)",
options=collection_option
).size()
print "Number of null payments: {}".format(f1_resp)
# Using GPUdbTable query chaining, filter null payment type records with a fare amount greater than 8
h_db.clear_table(view_example1)
f2_resp = table_payment_obj.filter(
view_name=view_example1,
expression="IS_NULL(payment_type)",
options=collection_option
).filter(
view_name=view_example2,
expression="fare_amount > 8",
options=collection_option
).size()
print "Number of null payments with a fare amount greater than $8.00 (with query chaining): {}".format(f2_resp)
# Filter by list where vendor ID is either NYC or YCAB
f3_resp = table_taxi_obj.filter_by_list(
view_name=view_example3,
column_values_map={"vendor_id": ["NYC", "YCAB"]}
).size()
print "Number of records where vendor_id is either NYC or YCAB: {}".format(f3_resp)
# Filter by range trip with passenger count between 1 and 3
f4_resp = table_taxi_obj.filter_by_range(
view_name=view_example4,
column_name="passenger_count",
lower_bound=1,
upper_bound=3
).size()
print "Number of trips with a passenger count between 1 and 3: {}".format(f4_resp)
Kinetica supports various aggregate and group-by queries, which group and aggregate your data to return counts and useful statistics.
# Aggregate count, min, mean, and max on the trip distance
a1_resp = table_taxi_obj.aggregate_statistics(
column_name="trip_distance",
stats="count,min,max,mean"
)
print "Statistics of values in the trip_distance column:"
print "\tCount: {count:.0f}\n\tMin: {min:4.2f}\n\tMean: {mean:4.2f}\n\tMax: {max:4.2f}\n".format(**a1_resp["stats"])
# Find unique taxi vendor IDs
a2_resp = table_taxi_obj.aggregate_unique(
column_name="vendor_id",
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
encoding="json"
)["data"]["vendor_id"]
print "Unique vendor IDs in the taxi trip table:"
for vendor in a2_resp:
print "\t* {}".format(vendor)
print
# Find number of trips per vendor; use binary decoding
a3_resp = table_taxi_obj.aggregate_group_by(
column_names=["vendor_id",
"count(vendor_id)"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)["data"]
print "Number of trips per vendor:"
for vendor in zip(
a3_resp["vendor_id"],
a3_resp["count(vendor_id)"]
):
print "\t{:<10s}{:4d}".format(vendor[0] + ":", vendor[1])
print
# Create a histogram for the different groups of passenger counts
a4_resp = table_taxi_obj.aggregate_histogram(
column_name="passenger_count",
start=1,
end=6,
interval=1
)["counts"]
print "Passenger count groups by size:"
for histo_group in zip(
[1, 2, 3, 4, 5, 6],
a4_resp
):
print "\t{}s: {:4.0f}".format(*histo_group)
Joins allow you to link multiple
tables together, along their relations, retrieving
associated information from any or all of them. Tables can only be joined if
they're sharded similarly or replicated.
Note that the create_join_table
method is static and is not called with
a corresponding GPUdbTable object.
An inner join returns only records that have matching values in both tables.
# Retrieve cab ride transactions and the full name of the associated vendor for rides having more than three
# passengers between April 1st & 16th, 2015
gpudb.GPUdbTable.create_join_table(
join_table_name=join_table_inner,
table_names=["taxi_trip_data as t",
"payment as p"],
column_names=["payment_id",
"payment_type",
"total_amount",
"passenger_count",
"vendor_id",
"trip_distance"],
expressions=["t.payment_id = p.payment_id",
"passenger_count > 3"],
options={"collection_name": collection},
db=h_db
)
join_table_inner_obj = gpudb.GPUdbTable(
_type=None,
name=join_table_inner,
db=h_db
)
j1_resp = join_table_inner_obj.get_records_by_column(
column_names=["payment_id",
"payment_type",
"total_amount",
"passenger_count",
"vendor_id",
"trip_distance"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Inner join table:"
print "Payment ID Payment Type Total Amount Passenger Count Vendor ID Trip Distance "
print "============ ============= ============= ================ ========== =============="
for record in zip(
j1_resp["payment_id"],
j1_resp["payment_type"],
j1_resp["total_amount"],
j1_resp["passenger_count"],
j1_resp["vendor_id"],
j1_resp["trip_distance"]
):
print "{:<12d} {:<13} {:<13f} {:<16d} {:<10s} {:<14f}".format(
record[0],
record[1],
record[2],
record[3],
record[4],
record[5]
)
print
A left join returns all of the records an inner join does, but additionally, for each record in the table on the left side of the join that has no match along the relation to a record in the table on the right side of the join, a corresponding record will be returned with "left-side" columns populated with the "left-side" record data and the "right-side" columns populated with nulls.
# Retrieve cab ride transactions and the full name of the associated vendor (if available--blank if vendor name is
# unknown) for transactions with associated payment data, sorting by increasing values of transaction ID.
gpudb.GPUdbTable.create_join_table(
join_table_name=join_table_left,
table_names=["taxi_trip_data as t",
"vendor as v"],
column_names=["transaction_id",
"pickup_datetime",
"trip_distance",
"t.vendor_id",
"vendor_name"],
expressions=["left join t, v on (t.vendor_id = v.vendor_id)",
"payment_id <> 0"],
options={"collection_name": collection},
db=h_db
)
join_table_left_obj = gpudb.GPUdbTable(
_type=None,
name=join_table_left,
db=h_db
)
j2_resp = join_table_left_obj.get_records_by_column(
column_names=["transaction_id",
"pickup_datetime",
"trip_distance",
"vendor_id",
"vendor_name"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Left join table:"
print "Transaction ID Pickup (in secs since Epoch) Trip Distance Vendor ID Vendor Name "
print "============== ============================ ============= ========= =================="
for record in zip(
j2_resp["transaction_id"],
j2_resp["pickup_datetime"],
j2_resp["trip_distance"],
j2_resp["vendor_id"],
j2_resp["vendor_name"]
):
print "{:<14d} {:<28d} {:<13.2f} {:<9s} {:18s}".format(
record[0],
record[1],
record[2],
record[3],
record[4]
)
print
Note
Full outer joins require both tables to be replicated. Set merges
that perform deduplication of records, like
Union Distinct,
Intersect, and Except
also need to use replicated tables to ensure the correct results, so a
replicated version of the taxi (taxi_trip_data_replicated
) table and a
corresponding table object are created at this point in the tutorial.
gpudb.GPUdbTable.merge_records(
table_name=table_taxi_replicated, source_table_names=[table_taxi],
field_maps=[{
"transaction_id": "transaction_id",
"payment_id": "payment_id",
"vendor_id": "vendor_id",
"pickup_datetime": "pickup_datetime",
"dropoff_datetime": "dropoff_datetime",
"passenger_count": "passenger_count",
"trip_distance": "trip_distance",
"pickup_longitude": "pickup_longitude",
"pickup_latitude": "pickup_latitude",
"dropoff_longitude": "dropoff_longitude",
"dropoff_latitude": "dropoff_latitude"
}],
options={"collection_name": collection,
"is_replicated": "true"},
db=h_db
)
table_taxi_replicated_obj = gpudb.GPUdbTable(
_type=None,
name=table_taxi_replicated,
db=h_db
)
A full outer join returns all of the records a left join does, but additionally, for each record in the table on the right side of the join that has no match along the relation to a record in the table on the left side of the join, a corresponding record will be returned with "right-side" columns populated with the "right-side" record data and the "left-side" columns populated with nulls.
# Retrieve the vendor IDs of known vendors with no recorded cab ride transactions, as well as the vendor ID and
# number of transactions for unknown vendors with recorded cab ride transactions
gpudb.GPUdbTable.create_join_table(
join_table_name=join_table_outer,
table_names=["taxi_trip_data_replicated as t",
"vendor as v"],
column_names=["t.vendor_id as vendor_id",
"v.vendor_id as vendor_id_1"],
expressions=["full_outer join t,v on ((v.vendor_id = t.vendor_id))"],
options={"collection_name": collection},
db=h_db
)
join_table_outer_obj = gpudb.GPUdbTable(
_type=None,
name=join_table_outer,
db=h_db
)
# Aggregate the join table results by vendor ID and count the amount of records
j3_resp = join_table_outer_obj.aggregate_group_by(
column_names=["vendor_id_1 as vend_table_vendors",
"vendor_id as taxi_table_vendors",
"count(*) as total_records"], offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={"expression": "(is_null(vendor_id_1) OR is_null(vendor_id))"}
)["data"]
print "Outer join table:"
print "Vend. Table Vendors Taxi Table Vendors Total Records"
print "=================== ================== ============="
for record in zip(
j3_resp["vend_table_vendors"],
j3_resp["taxi_table_vendors"],
j3_resp["total_records"]
):
print "{:<19s} {:<18s} {:<13d}".format(
record[0],
record[1],
record[2]
)
print
You can create projections using the
create_projection
method.
# Create a projection containing all payments by credit card
table_payment_obj.create_projection(
projection_name=projection_example1,
column_names=["payment_id",
"payment_type",
"credit_type",
"payment_timestamp",
"fare_amount",
"surcharge",
"mta_tax",
"tip_amount",
"tolls_amount",
"total_amount"],
options={"collection_name": collection}
)
projection_example1_obj = gpudb.GPUdbTable(
_type=None,
name=projection_example1,
db=h_db
)
p1_resp = projection_example1_obj.get_records_by_column(
column_names=["payment_id",
"payment_type",
"credit_type",
"payment_timestamp",
"fare_amount",
"surcharge",
"mta_tax",
"tip_amount",
"tolls_amount",
"total_amount"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Projection of only credit payment types:"
print "Payment ID Payment Type Credit Type Timestamp Fare Surcharge MTA Tax Tip Tolls Total"
print "========== ============ ================ ============= ===== ========= ======= ===== ===== ====="
for record in zip(
p1_resp["payment_id"],
p1_resp["payment_type"],
p1_resp["credit_type"],
p1_resp["payment_timestamp"],
p1_resp["fare_amount"],
p1_resp["surcharge"],
p1_resp["mta_tax"],
p1_resp["tip_amount"],
p1_resp["tolls_amount"],
p1_resp["total_amount"]
):
print "{:<10} {:<12} {:<16} {:<13} {:^5.2f} {:^9.2f} {:^7.2f} {:^5.2f} {:^5.2f} {:^5.2f}".format(
record[0],
record[1],
record[2],
record[3],
record[4],
record[5],
record[6],
record[7],
record[8],
record[9]
)
print
To persist a projection:
# Create a persisted table with cab ride transactions greater than 5 miles whose trip started during lunch hours
table_taxi_obj.create_projection(
projection_name=projection_example2,
column_names=["hour(pickup_datetime) as hour_of_day",
"vendor_id",
"passenger_count",
"trip_distance"],
options={"expression": "(hour(pickup_datetime) >=11) AND (hour(pickup_datetime) <= 14) AND (trip_distance > 5)",
"collection_name": collection,
"persist": "true"}
)
projection_example2_obj = gpudb.GPUdbTable(
_type=None,
name=projection_example2,
db=h_db
)
p2_resp = projection_example2_obj.get_records_by_column(
column_names=["hour_of_day",
"vendor_id",
"passenger_count",
"trip_distance"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Projection of long trips taken during lunch hours:"
print "Hour of Day Vendor ID Passenger Count Trip Distance"
print "=========== ========= =============== ============="
for record in zip(
p2_resp["hour_of_day"],
p2_resp["vendor_id"],
p2_resp["passenger_count"],
p2_resp["trip_distance"]
):
print "{:<11} {:<9} {:<15} {:<13.2f}".format(
record[0],
record[1],
record[2],
record[3]
)
Union can be used to combine homogeneous data sets into one larger data set. Union & Union Distinct will both combine data sets but only retain the records that are unique across the chosen columns, removing all duplicates. Union All will combine data sets, retaining all records from the source data sets.
# Calculate the average number of passengers, as well as the shortest, average, and longest trips for all trips in
# each of the two time periods--from April 1st through the 15th, 2015 and from April 16th through the 23rd, 2015--
# and return those two sets of statistics in a single result set.
table_taxi_obj.aggregate_group_by(
column_names=["avg(passenger_count) as avg_pass_count",
"avg(trip_distance) as avg_trip_dist",
"min(trip_distance) as min_trip_dist",
"max(trip_distance) as max_trip_dist"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={"expression": "((pickup_datetime >= '2015-04-01') AND (pickup_datetime <= '2015-04-15 23:59:59.999'))",
"result_table": agg_grpby_union_all_src1,
"collection_name": collection}
)
table_taxi_obj.aggregate_group_by(
column_names=["avg(passenger_count) as avg_pass_count",
"avg(trip_distance) as avg_trip_dist",
"min(trip_distance) as min_trip_dist",
"max(trip_distance) as max_trip_dist"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={"expression": "((pickup_datetime >= '2015-04-16') AND (pickup_datetime <= '2015-04-23 23:59:59.999'))",
"result_table": agg_grpby_union_all_src2,
"collection_name": collection}
)
gpudb.GPUdbTable.create_union(
table_name=union_all_table,
table_names=[agg_grpby_union_all_src1,
agg_grpby_union_all_src2],
input_column_names=[
["'2015-04-01 - 2014-04-15'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"],
["'2015-04-16 - 2015-04-23'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"]
],
output_column_names=["pickup_window_range",
"avg_pass_count",
"avg_trip",
"min_trip",
"max_trip"],
options={"mode": "union_all",
"collection_name": collection},
db=h_db
)
union_all_table_obj = gpudb.GPUdbTable(
_type=None,
name=union_all_table,
db=h_db
)
u1_resp = union_all_table_obj.get_records_by_column(
column_names=["pickup_window_range",
"avg_pass_count",
"avg_trip",
"min_trip",
"max_trip"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Union All table:"
print "Pickup Window Range Avg. Pass. Count Avg. Trip Min. Trip Max Trip "
print "======================= ================ ============= ============== ============="
for record in zip(
u1_resp["pickup_window_range"],
u1_resp["avg_pass_count"],
u1_resp["avg_trip"],
u1_resp["min_trip"],
u1_resp["max_trip"]
):
print "{:<23} {:<16} {:<13} {:<14} {:<13}".format(
record[0],
record[1],
record[2],
record[3],
record[4]
)
print
Intersect will combine data sets but only include the records found in both data sets, removing duplicate result records.
# Retrieve locations (as lat/lon pairs) that were both pick-up and drop-off points
table_taxi_replicated_obj.create_projection(
projection_name=projection_intersect_src,
column_names=["pickup_latitude",
"pickup_longitude"],
options={"expression": "((pickup_latitude > 0) AND (pickup_longitude <> 0))",
"collection_name": collection})
gpudb.GPUdbTable.create_union(
table_name=union_intersect_table,
table_names=[projection_intersect_src,
table_taxi_replicated],
input_column_names=[["pickup_latitude", "pickup_longitude"],
["dropoff_latitude", "dropoff_longitude"]],
output_column_names=["pickup_latitude", "pickup_longitude"],
options={"mode": "intersect",
"collection_name": collection},
db=h_db
)
union_intersect_table_obj = gpudb.GPUdbTable(
_type=None,
name=union_intersect_table,
db=h_db
)
u2_resp = union_intersect_table_obj.get_records_by_column(
column_names=["pickup_latitude", "pickup_longitude"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET)
print "Intersect table:"
print "Pickup Latitude Pickup Longitude"
print "=============== ================"
for record in zip(u2_resp["pickup_latitude"], u2_resp["pickup_longitude"]):
print "{:<15} {:<16}".format(
record[0],
record[1]
)
print
Except will return records that appear in the first data set but not the second data set. Note that the data sets on each side of the Except will have duplicates removed first, and then the set subtraction will be processed.
# Show vendors that operate before noon, but not after noon: retrieve the unique list of IDs of vendors who
# provided cab rides between midnight and noon, and remove from that list the IDs of any vendors who provided cab
# rides between noon and midnight
table_taxi_replicated_obj.create_projection(
projection_name=projection_except_src1,
column_names=["vendor_id"],
options={"expression": "((HOUR(pickup_datetime) >= 0) AND (HOUR(pickup_datetime) <= 11))",
"collection_name": collection}
)
table_taxi_replicated_obj.create_projection(
projection_name=projection_except_src2,
column_names=["vendor_id"],
options={"expression": "((HOUR(pickup_datetime) >= 12) AND (HOUR(pickup_datetime) <= 23))",
"collection_name": collection}
)
gpudb.GPUdbTable.create_union(
table_name=union_except_table,
table_names=[projection_except_src1,
projection_except_src2],
input_column_names=[["vendor_id"],
["vendor_id"]],
output_column_names=["vendor_id"],
options={"mode": "except",
"collection_name": collection},
db=h_db
)
union_except_table_obj = gpudb.GPUdbTable(
_type=None,
name=union_except_table,
db=h_db
)
u3_resp = union_except_table_obj.get_records_by_column(
column_names=["vendor_id"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Except table:"
print "Vendor ID"
print "========="
for record in zip(u3_resp["vendor_id"]):
print "{:<9}".format(record[0])
You can delete records from a table using filter expressions. This method allows you to specify multiple filter expressions--note that each expression is used to delete records independently from the others (i.e., a record only needs to meet any one expression's criteria to be deleted from the table).
delete_resp = table_payment_obj.filter(expression="payment_type = 'Cash'").size()
print "Number of records that meet deletion criteria before deleting: {}".format(delete_resp)
table_payment_obj.delete_records(expressions=["payment_type = 'Cash'"])
delete_resp = table_payment_obj.filter(expression="payment_type = 'Cash'").size()
print "Number of records that meet deletion criteria after deleting: {}".format(delete_resp)
You can remove tables and collections using the GPUdb interface method
clear_table
.
h_db.clear_table(table_name=table_taxi)
h_db.clear_table(table_name=table_taxi_replicated)
h_db.clear_table(table_name=collection)
Included below is a complete sample program containing all the above queries:
"""This script walks through how to use the Python API.
Covered here: importing GPUdb, instantiating Kinetica, creating types, creating tables, inserting records,
retrieving records, updating records, altering tables, filtering records, aggregating/grouping records, joining tables,
projections, unioning tables, and deleting records/tables.
"""
import collections
import json
import gpudb
def gpudb_example():
print
print "TUTORIAL OUTPUT"
print "==============="
# All tables/views used in examples below
agg_grpby_union_all_src1 = "agg_passcount_tripdist_btw_apr1_apr15"
agg_grpby_union_all_src2 = "agg_passcount_tripdist_btw_apr16_apr23"
collection = "taxi_info"
join_table_inner = "pay_info_rides_gt_3_pass"
join_table_left = "all_vendor_transactions"
join_table_outer = "vendors_w_no_transactions"
projection_example1 = "credit_payment"
projection_example2 = "long_lunch_time_rides"
projection_intersect_src = "pickup_lat_lon"
projection_except_src1 = "vendors_operating_before_noon"
projection_except_src2 = "vendors_operating_after_noon"
table_taxi = "taxi_trip_data"
table_taxi_replicated = "taxi_trip_data_replicated"
table_vendor = "vendor"
table_payment = "payment"
union_all_table = "passcount_tripdist_stats_apr"
union_intersect_table = "shared_pickup_dropoff_points"
union_except_table = "vendors_operating_btw_midnight_noon"
view_example1 = "null_payments"
view_example2 = "null_payments_gt_8"
view_example3 = "nyc_ycab_vendors"
view_example4 = "passenger_count_btw_1_3"
# Establish connection with a locally-running instance of Kinetica using binary encoding to save memory
h_db = gpudb.GPUdb(
encoding="BINARY",
host="127.0.0.1",
port="9191"
)
print
print "CREATING TYPES & TABLES"
print "-----------------------"
print
print "Vendor Table"
print "************"
# Create a type from a list of lists. Each list below is an individual column. Each column comprises at
# least two values: a column name (always the first value) and a primitive type (always the second value). Any
# subsequent values are column properties. The order of the columns defines the order in which values must
# be inserted into the table, e.g., a "vendor_name" value cannot be inserted before a "vendor_id" value
vendor_columns = [
# column types and properties can be listed as strings
["vendor_id", "string", "char4", "primary_key"],
["vendor_name", "string", "char64"],
["phone", "string", "char16", "nullable"],
["email", "string", "char64", "nullable"],
["hq_street", "string", "char64"],
# column types and properties can also be listed using the GPUdbRecordColumn and GPUdbColumnProperty objects
["hq_city", gpudb.GPUdbRecordColumn._ColumnType.STRING, gpudb.GPUdbColumnProperty.CHAR8,
gpudb.GPUdbColumnProperty.DICT],
["hq_state", gpudb.GPUdbRecordColumn._ColumnType.STRING, gpudb.GPUdbColumnProperty.CHAR8,
gpudb.GPUdbColumnProperty.DICT],
["hq_zip", gpudb.GPUdbRecordColumn._ColumnType.INT],
["num_emps", gpudb.GPUdbRecordColumn._ColumnType.INT],
["num_cabs", gpudb.GPUdbRecordColumn._ColumnType.INT]
]
# Clear any existing table with the same name (otherwise we won"t be able to create the table)
if h_db.has_table(table_name=table_vendor)["table_exists"]:
h_db.clear_table(table_name=table_vendor)
# Create the table from the type and place it in a collection
try:
table_vendor_obj = gpudb.GPUdbTable(
_type=vendor_columns,
name=table_vendor,
options={"collection_name": collection,
"is_replicated": "true"},
db=h_db
)
print "Vendor table successfully created"
except gpudb.GPUdbException as e:
print "Vendor table creation failure: {}".format(str(e))
print
print "Payment Table"
print "*************"
payment_columns = [
["payment_id", "long", "primary_key"],
["payment_type", "string", "char16", "nullable"],
["credit_type", "string", "char16", "nullable", "dict"],
["payment_timestamp", "long", "timestamp", "nullable"],
["fare_amount", "double", "nullable"],
["surcharge", "double", "nullable"],
["mta_tax", "double", "nullable"],
["tip_amount", "double", "nullable"],
["tolls_amount", "double", "nullable"],
["total_amount", "double", "nullable"]
]
# Clear any existing table with the same name (otherwise we won't be able to create the table)
if h_db.has_table(table_name=table_payment)["table_exists"]:
h_db.clear_table(table_name=table_payment)
# Create an "options" object containing the table creation options
options = gpudb.GPUdbTableOptions.default().collection_name("collection_name").is_replicated("true")
# Create the table from the type and place it in a collection
try:
table_payment_obj = gpudb.GPUdbTable(
payment_columns,
table_payment,
options,
# options={"collection_name": collection},
h_db
)
print "Payment table successfully created"
except gpudb.GPUdbException as e:
print "Payment table creation failure: {}".format(str(e))
print
print "Taxi Table"
print "**********"
# Create a table object for the pre-existing taxi table
try:
table_taxi_obj = gpudb.GPUdbTable(
_type=None,
name=table_taxi,
db=h_db
)
print "Taxi table object successfully created"
except gpudb.GPUdbException as e:
print "Taxi table object creation failure: {}".format(str(e))
print "\n"
print "INSERTING DATA"
print "--------------"
print
# Insert single record example
# Create ordered dictionary for keys & values of record
vendor_datum = collections.OrderedDict()
vendor_datum["vendor_id"] = "VTS"
vendor_datum["vendor_name"] = "Vine Taxi Service"
vendor_datum["phone"] = "9998880001"
vendor_datum["email"] = "admin@vtstaxi.com"
vendor_datum["hq_street"] = "26 Summit St."
vendor_datum["hq_city"] = "Flushing"
vendor_datum["hq_state"] = "NY"
vendor_datum["hq_zip"] = 11354
vendor_datum["num_cabs"] = 450
vendor_datum["num_emps"] = 400
# Insert the record into the table
table_vendor_obj.insert_records(vendor_datum)
print "Number of single records inserted into the Vendor table: {}".format(table_vendor_obj.size())
# Insert multiple records examples
# Create a list of in-line records. The order of the values must match the column order in the type
vendor_records = [
["YCAB", "Yes Cab", "7895444321", None, "97 Edgemont St.", "Brooklyn", "NY", 11223, 445, 425],
["NYC", "New York City Cabs", None, "support@nyc-taxis.com", "9669 East Bayport St.", "Bronx", "NY", 10453, 505,
500],
["DDS", "Dependable Driver Service", None, None, "8554 North Homestead St.", "Bronx", "NY", 10472, 200, 124],
["CMT", "Crazy Manhattan Taxi", "9778896500", "admin@crazymanhattantaxi.com", "950 4th Road Suite 78",
"Brooklyn", "NY", 11210, 500, 468],
["TNY", "Taxi New York", None, None, "725 Squaw Creek St.", "Bronx", "NY", 10458, 315, 305],
["NYMT", "New York Metro Taxi", None, None, "4 East Jennings St.", "Brooklyn", "NY", 11228, 166, 150],
["5BTC", "Five Boroughs Taxi Co.", "4566541278", "mgmt@5btc.com", "9128 Lantern Street", "Brooklyn", "NY",
11229, 193, 175]
]
# Insert the records into the Vendor table
table_vendor_obj.insert_records(vendor_records)
print "Number of batch records inserted into the Vendor table: {}".format(table_vendor_obj.size() - 1)
# Create another list of in-line records
payment_records = [
[189, "No Charge", None, None, 6.5, 0, 0.6, 0, 0, 7],
[136, "Cash", None, 1428716521000, 4, 0.5, 0.5, 1, 0, 6.3],
[148, "Cash", None, 1430124581000, 9.5, 0, 0.5, 1, 0, 11.3],
[114, "Cash", None, 1428259673000, 5.5, 0, 0.5, 1.89, 0, 8.19],
[180, "Cash", None, 1428965823000, 6.5, 0.5, 0.5, 1, 0, 8.8],
[109, "Cash", None, 1428948513000, 22.5, 0.5, 0.5, 4.75, 0, 28.55],
[132, "Cash", None, 1429472779000, 6.5, 0.5, 0.5, 1.55, 0, 9.35],
[134, "Cash", None, 1429472668000, 33.5, 0.5, 0.5, 0, 0, 34.8],
[176, "Cash", None, 1428403962000, 9, 0.5, 0.5, 2.06, 0, 12.36],
[100, "Cash", None, None, 9, 0, 0.5, 2.9, 0, 12.7],
[193, "Cash", None, None, 3.5, 1, 0.5, 1.59, 0, 6.89],
[140, "Credit", "Visa", None, 28, 0, 0.5, 0, 0, 28.8],
[161, "Credit", "Visa", None, 7, 0, 0.5, 0, 0, 7.8],
[199, "Credit", "Visa", None, 6, 1, 0.5, 1, 0, 8.5],
[159, "Credit", "Visa", 1428674487000, 7, 0, 0.5, 0, 0, 7.8],
[156, "Credit", "MasterCard", 1428672753000, 12.5, 0.5, 0.5, 0, 0, 13.8],
[198, "Credit", "MasterCard", 1429472636000, 9, 0, 0.5, 0, 0, 9.8],
[107, "Credit", "MasterCard", 1428717377000, 5, 0.5, 0.5, 0, 0, 6.3],
[166, "Credit", "American Express", 1428808723000, 17.5, 0, 0.5, 0, 0, 18.3],
[187, "Credit", "American Express", 1428670181000, 14, 0, 0.5, 0, 0, 14.8],
[125, "Credit", "Discover", 1429869673000, 8.5, 0.5, 0.5, 0, 0, 9.8],
[119, None, None, 1430431471000, 9.5, 0, 0.5, 0, 0, 10.3],
[150, None, None, 1430432447000, 7.5, 0, 0.5, 0, 0, 8.3],
[170, "No Charge", None, 1430431502000, 28.6, 0, 0.5, 0, 0, 28.6],
[123, "No Charge", None, 1430136649000, 20, 0.5, 0.5, 0, 0, 21.3],
[181, None, None, 1430135461000, 6.5, 0.5, 0.5, 0, 0, 7.8],
[189, "No Charge", None, None, 6.5, 0, 0.6, 0, 0, 7]
]
# Insert the records into the Payment table
for record in payment_records:
table_payment_obj.insert_records(record)
print "Number of batch records inserted into the Payment table: {}".format(table_payment_obj.size())
print "\n"
print "RETRIEVING DATA"
print "---------------"
print
# Retrieve no more than 10 records from the Payment table using the GPUdbTable interface
print "{:<10s} {:<12s} {:<11s} {:<17s} {:<11s} {:<9s} {:<7s} {:<10s} {:<12s} {:<12s}".format(
"Payment ID",
"Payment Type",
"Credit Type",
"Payment Timestamp",
"Fare Amount",
"Surcharge",
"MTA Tax",
"Tip Amount",
"Tolls Amount",
"Total Amount"
)
print "{:=<10s} {:=<12s} {:=<11s} {:=<17s} {:=<11s} {:=<9s} {:=<7s} {:=<10s} {:=<12s} {:=<12s}".format(
"", "", "", "", "", "", "", "", "", ""
)
for record in table_payment_obj.get_records(
offset=0,
limit=10,
encoding="binary",
options={"sort_by": "payment_type"}
):
print "{payment_id:<10d} {payment_type:<12s} {credit_type:<11s} {payment_timestamp:<17d} {fare_amount:11.2f} " \
"{surcharge:9.2f} {mta_tax:7.2f} {tip_amount:10.2f} {tolls_amount:12.2f} {total_amount:12.2f}".format(
payment_id=record["payment_id"],
payment_type=record["payment_type"],
credit_type=record["credit_type"],
payment_timestamp=record["payment_timestamp"],
fare_amount=record["fare_amount"],
surcharge=record["surcharge"],
mta_tax=record["mta_tax"],
tip_amount=record["tip_amount"],
tolls_amount=record["tolls_amount"],
total_amount=record["total_amount"]
)
print
# Retrieve all records from the Vendor table using the GPUdb interface
print "{:<9s} {:<26s} {:<11s} {:<29s} {:<24s} {:<9s} {:<9s} {:<11s} {:<11s} {:<6s}".format(
"Vendor ID",
"Vendor Name",
"Phone",
"Email",
"HQ Street",
"HQ City",
"HQ State",
"HQ Zip Code",
"# Employees",
"# Cabs"
)
print "{:=<9s} {:=<26s} {:=<11s} {:=<29s} {:=<24s} {:=<9s} {:=<9s} {:=<11s} {:=<11s} {:=<6s}".format(
"", "", "", "", "", "", "", "", "", ""
)
vendor_records_gpudb = h_db.get_records(
table_name=table_vendor,
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
encoding="json"
)["records_json"]
for record in vendor_records_gpudb:
print "{vendor_id:<9s} {vendor_name:<26s} {phone:<11} {email:<29s} {hq_street:<24s} {hq_city:<9s} " \
"{hq_state:<9s} {hq_zip:<11d} {num_emps:<11d} {num_cabs:<6d}".format(**json.loads(record))
print "\n"
print "UPDATING & REMOVING RECORDS"
print "---------------------------"
print
# Update the email, number of employees, and number of cabs of the DDS vendor
table_vendor_obj.update_records(
expressions=["vendor_id = 'DDS'"],
new_values_maps={"email": "management@ddstaxico.com",
"num_emps": "156",
"num_cabs": "213"}
)
# Print the updated table
print "Updated DDS vendor information:"
print "{:<9s} {:<25s} {:<10s} {:<28s} {:<24} {:<8s} {:<11s} {:<11s} {:<6s}".format(
"Vendor ID",
"Vendor Name",
"Phone",
"Email",
"HQ Street",
"HQ City",
"HQ Zip Code",
"# Employees",
"# Cabs"
)
print "{:=<9s} {:=<25s} {:=<10s} {:=<28s} {:=<24s} {:=<8s} {:=<11s} {:=<11s} {:=<6s}".format(
"", "", "", "", "", "", "", "", "", ""
)
for vendor_record in table_vendor_obj.get_records(
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={"expression": "vendor_id = 'DDS'"}
):
print "{vendor_id:<9s} {vendor_name:<25s} {phone:<10s} {email:<28s} {hq_street:<24s} {hq_city:<8s} " \
"{hq_zip:<11d} {num_emps:<11d} {num_cabs:<6d}".format(
vendor_id=vendor_record["vendor_id"],
vendor_name=vendor_record["vendor_name"],
phone=vendor_record["phone"],
email=vendor_record["email"],
hq_street=vendor_record["hq_street"],
hq_city=vendor_record["hq_city"],
hq_zip=vendor_record["hq_zip"],
num_emps=vendor_record["num_emps"],
num_cabs=vendor_record["num_cabs"]
)
print
# Delete payment 189
pre_delete = table_payment_obj.size()
print "Records in the payment table (before delete): {}".format(pre_delete)
delete_expr = ["payment_id = 189"]
print "Deleting record where " + delete_expr[0]
table_payment_obj.delete_records(expressions=delete_expr)
post_delete = table_payment_obj.size()
print "Records in the payment table (after delete): {}".format(post_delete)
print "\n"
print "ALTER TABLE"
print "-----------"
print
# Add column indexes to vendor_id in both the vendor and taxi_trip_data tables
vendor_index_resp = table_vendor_obj.alter_table(action="create_index", value="vendor_id")
print "Index added to vendor table: {} \n".format(vendor_index_resp)
taxi_index_resp = table_taxi_obj.alter_table(action="create_index", value="vendor_id")
print "Index added to taxi_trip_data table: {} \n".format(taxi_index_resp)
# Apply snappy compression to the pickup and dropoff datetime columns
snappy_compression = {"compression_type": "snappy"}
pickup_compress_resp = table_taxi_obj.alter_table(action="set_column_compression", value="pickup_datetime",
options=snappy_compression)
print "Compression added to 'pickup_datetime' column: {} \n".format(pickup_compress_resp)
dropoff_compress_resp = table_taxi_obj.alter_table(action="set_column_compression", value="dropoff_datetime",
options=snappy_compression)
print "Compression added to 'dropoff_datetime' column: {}".format(dropoff_compress_resp)
print "\n"
print "FILTERING"
print "---------"
print
# Clear any existing views with the same name (otherwise we won't be able to create the views)
if h_db.has_table(table_name=view_example3)["table_exists"]:
h_db.clear_table(table_name=view_example3)
if h_db.has_table(table_name=view_example4)["table_exists"]:
h_db.clear_table(table_name=view_example4)
# Filter Example 1
# Select all payments with no corresponding payment type; allow Kinetica to assign a random name to the view
collection_option = {"collection_name": collection}
f1_resp = table_payment_obj.filter(
expression="IS_NULL(payment_type)",
options=collection_option
).size()
print "Number of null payments: {}".format(f1_resp)
# Filter Example 2
# Using GPUdbTable query chaining, filter null payment type records with a fare amount greater than 8
h_db.clear_table(view_example1)
f2_resp = table_payment_obj.filter(
view_name=view_example1,
expression="IS_NULL(payment_type)",
options=collection_option
).filter(
view_name=view_example2,
expression="fare_amount > 8",
options=collection_option
).size()
print "Number of null payments with a fare amount greater than $8.00 (with query chaining): {}".format(f2_resp)
# Filter Example 3
# Filter by list where vendor ID is either NYC or YCAB
f3_resp = table_taxi_obj.filter_by_list(
view_name=view_example3,
column_values_map={"vendor_id": ["NYC", "YCAB"]}
).size()
print "Number of records where vendor_id is either NYC or YCAB: {}".format(f3_resp)
# Filter Example 4
# Filter by range trip with passenger count between 1 and 3
f4_resp = table_taxi_obj.filter_by_range(
view_name=view_example4,
column_name="passenger_count",
lower_bound=1,
upper_bound=3
).size()
print "Number of trips with a passenger count between 1 and 3: {}".format(f4_resp)
print "\n"
print "AGGREGATING, GROUPING, AND HISTOGRAMS"
print "-------------------------------------"
print
# Aggregate Example 1
# Aggregate count, min, mean, and max on the trip distance
a1_resp = table_taxi_obj.aggregate_statistics(
column_name="trip_distance",
stats="count,min,max,mean"
)
print "Statistics of values in the trip_distance column:"
print "\tCount: {count:.0f}\n\tMin: {min:4.2f}\n\tMean: {mean:4.2f}\n\tMax: {max:4.2f}\n".format(**a1_resp["stats"])
# Aggregate Example 2
# Find unique taxi vendor IDs
a2_resp = table_taxi_obj.aggregate_unique(
column_name="vendor_id",
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
encoding="json"
)["data"]["vendor_id"]
print "Unique vendor IDs in the taxi trip table:"
for vendor in a2_resp:
print "\t* {}".format(vendor)
print
# Aggregate Example 3
# Find number of trips per vendor; use binary decoding
a3_resp = table_taxi_obj.aggregate_group_by(
column_names=["vendor_id",
"count(vendor_id)"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)["data"]
print "Number of trips per vendor:"
for vendor in zip(
a3_resp["vendor_id"],
a3_resp["count(vendor_id)"]
):
print "\t{:<10s}{:4d}".format(vendor[0] + ":", vendor[1])
print
# Aggregate Example 4
# Create a histogram for the different groups of passenger counts
a4_resp = table_taxi_obj.aggregate_histogram(
column_name="passenger_count",
start=1,
end=6,
interval=1
)["counts"]
print "Passenger count groups by size:"
for histo_group in zip(
[1, 2, 3, 4, 5, 6],
a4_resp
):
print "\t{}s: {:4.0f}".format(*histo_group)
print "\n"
print "JOINS"
print "-----"
print
# Join Example 1 (Inner Join)
# Retrieve cab ride transactions and the full name of the associated vendor for rides having more than three
# passengers between April 1st & 16th, 2015
gpudb.GPUdbTable.create_join_table(
join_table_name=join_table_inner,
table_names=["taxi_trip_data as t",
"payment as p"],
column_names=["payment_id",
"payment_type",
"total_amount",
"passenger_count",
"vendor_id",
"trip_distance"],
expressions=["t.payment_id = p.payment_id",
"passenger_count > 3"],
options={"collection_name": collection},
db=h_db
)
join_table_inner_obj = gpudb.GPUdbTable(
_type=None,
name=join_table_inner,
db=h_db
)
j1_resp = join_table_inner_obj.get_records_by_column(
column_names=["payment_id",
"payment_type",
"total_amount",
"passenger_count",
"vendor_id",
"trip_distance"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Inner join table:"
print "Payment ID Payment Type Total Amount Passenger Count Vendor ID Trip Distance "
print "============ ============= ============= ================ ========== =============="
for record in zip(
j1_resp["payment_id"],
j1_resp["payment_type"],
j1_resp["total_amount"],
j1_resp["passenger_count"],
j1_resp["vendor_id"],
j1_resp["trip_distance"]
):
print "{:<12d} {:<13} {:<13f} {:<16d} {:<10s} {:<14f}".format(
record[0],
record[1],
record[2],
record[3],
record[4],
record[5]
)
print
# Join example 2 (Left Join)
# Retrieve cab ride transactions and the full name of the associated vendor (if available--blank if vendor name is
# unknown) for transactions with associated payment data, sorting by increasing values of transaction ID.
gpudb.GPUdbTable.create_join_table(
join_table_name=join_table_left,
table_names=["taxi_trip_data as t",
"vendor as v"],
column_names=["transaction_id",
"pickup_datetime",
"trip_distance",
"t.vendor_id",
"vendor_name"],
expressions=["left join t, v on (t.vendor_id = v.vendor_id)",
"payment_id <> 0"],
options={"collection_name": collection},
db=h_db
)
join_table_left_obj = gpudb.GPUdbTable(
_type=None,
name=join_table_left,
db=h_db
)
j2_resp = join_table_left_obj.get_records_by_column(
column_names=["transaction_id",
"pickup_datetime",
"trip_distance",
"vendor_id",
"vendor_name"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Left join table:"
print "Transaction ID Pickup (in secs since Epoch) Trip Distance Vendor ID Vendor Name "
print "============== ============================ ============= ========= =================="
for record in zip(
j2_resp["transaction_id"],
j2_resp["pickup_datetime"],
j2_resp["trip_distance"],
j2_resp["vendor_id"],
j2_resp["vendor_name"]
):
print "{:<14d} {:<28d} {:<13.2f} {:<9s} {:18s}".format(
record[0],
record[1],
record[2],
record[3],
record[4]
)
print
# Full outer joins require both tables to be replicated. Set merges like Union Distinct, Intersect, and Except
# need to use replicated tables to ensure the correct results. Copy the records from the taxi trip table to a
# replicated taxi trip table using /merge/records
gpudb.GPUdbTable.merge_records(
table_name=table_taxi_replicated, source_table_names=[table_taxi],
field_maps=[{
"transaction_id": "transaction_id",
"payment_id": "payment_id",
"vendor_id": "vendor_id",
"pickup_datetime": "pickup_datetime",
"dropoff_datetime": "dropoff_datetime",
"passenger_count": "passenger_count",
"trip_distance": "trip_distance",
"pickup_longitude": "pickup_longitude",
"pickup_latitude": "pickup_latitude",
"dropoff_longitude": "dropoff_longitude",
"dropoff_latitude": "dropoff_latitude"
}],
options={"collection_name": collection,
"is_replicated": "true"},
db=h_db
)
table_taxi_replicated_obj = gpudb.GPUdbTable(
_type=None,
name=table_taxi_replicated,
db=h_db
)
# Join Example 3 (Full Outer Join)
# Retrieve the vendor IDs of known vendors with no recorded cab ride transactions, as well as the vendor ID and
# number of transactions for unknown vendors with recorded cab ride transactions
gpudb.GPUdbTable.create_join_table(
join_table_name=join_table_outer,
table_names=["taxi_trip_data_replicated as t",
"vendor as v"],
column_names=["t.vendor_id as vendor_id",
"v.vendor_id as vendor_id_1"],
expressions=["full_outer join t,v on ((v.vendor_id = t.vendor_id))"],
options={"collection_name": collection},
db=h_db
)
join_table_outer_obj = gpudb.GPUdbTable(
_type=None,
name=join_table_outer,
db=h_db
)
# Aggregate the join table results by vendor ID and count the amount of records
j3_resp = join_table_outer_obj.aggregate_group_by(
column_names=["vendor_id_1 as vend_table_vendors",
"vendor_id as taxi_table_vendors",
"count(*) as total_records"], offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={"expression": "(is_null(vendor_id_1) OR is_null(vendor_id))"}
)["data"]
print "Outer join table:"
print "Vend. Table Vendors Taxi Table Vendors Total Records"
print "=================== ================== ============="
for record in zip(
j3_resp["vend_table_vendors"],
j3_resp["taxi_table_vendors"],
j3_resp["total_records"]
):
print "{:<19s} {:<18s} {:<13d}".format(
record[0],
record[1],
record[2]
)
print
print "\n"
print "PROJECTIONS"
print "-----------"
print
# Clear any existing projections with the same name (otherwise we won't be able to create the projections)
if h_db.has_table(table_name=projection_example1)["table_exists"]:
h_db.clear_table(projection_example1)
if h_db.has_table(table_name=projection_example2)["table_exists"]:
h_db.clear_table(projection_example2)
# Projection Example 1
# Create a projection containing all payments by credit card
table_payment_obj.create_projection(
projection_name=projection_example1,
column_names=["payment_id",
"payment_type",
"credit_type",
"payment_timestamp",
"fare_amount",
"surcharge",
"mta_tax",
"tip_amount",
"tolls_amount",
"total_amount"],
options={"collection_name": collection}
)
projection_example1_obj = gpudb.GPUdbTable(
_type=None,
name=projection_example1,
db=h_db
)
p1_resp = projection_example1_obj.get_records_by_column(
column_names=["payment_id",
"payment_type",
"credit_type",
"payment_timestamp",
"fare_amount",
"surcharge",
"mta_tax",
"tip_amount",
"tolls_amount",
"total_amount"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Projection of only credit payment types:"
print "Payment ID Payment Type Credit Type Timestamp Fare Surcharge MTA Tax Tip Tolls Total"
print "========== ============ ================ ============= ===== ========= ======= ===== ===== ====="
for record in zip(
p1_resp["payment_id"],
p1_resp["payment_type"],
p1_resp["credit_type"],
p1_resp["payment_timestamp"],
p1_resp["fare_amount"],
p1_resp["surcharge"],
p1_resp["mta_tax"],
p1_resp["tip_amount"],
p1_resp["tolls_amount"],
p1_resp["total_amount"]
):
print "{:<10} {:<12} {:<16} {:<13} {:^5.2f} {:^9.2f} {:^7.2f} {:^5.2f} {:^5.2f} {:^5.2f}".format(
record[0],
record[1],
record[2],
record[3],
record[4],
record[5],
record[6],
record[7],
record[8],
record[9]
)
print
# Projection Example 2
# Create a persisted table with cab ride transactions greater than 5 miles whose trip started during lunch hours
table_taxi_obj.create_projection(
projection_name=projection_example2,
column_names=["hour(pickup_datetime) as hour_of_day",
"vendor_id",
"passenger_count",
"trip_distance"],
options={"expression": "(hour(pickup_datetime) >=11) AND (hour(pickup_datetime) <= 14) AND (trip_distance > 5)",
"collection_name": collection,
"persist": "true"}
)
projection_example2_obj = gpudb.GPUdbTable(
_type=None,
name=projection_example2,
db=h_db
)
p2_resp = projection_example2_obj.get_records_by_column(
column_names=["hour_of_day",
"vendor_id",
"passenger_count",
"trip_distance"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Projection of long trips taken during lunch hours:"
print "Hour of Day Vendor ID Passenger Count Trip Distance"
print "=========== ========= =============== ============="
for record in zip(
p2_resp["hour_of_day"],
p2_resp["vendor_id"],
p2_resp["passenger_count"],
p2_resp["trip_distance"]
):
print "{:<11} {:<9} {:<15} {:<13.2f}".format(
record[0],
record[1],
record[2],
record[3]
)
print "\n"
print "UNIONS"
print "------"
print
# Clear any existing tables with the same name (otherwise we won't be able to create the tables)
if h_db.has_table(table_name=agg_grpby_union_all_src1)["table_exists"]:
h_db.clear_table(agg_grpby_union_all_src1)
if h_db.has_table(table_name=agg_grpby_union_all_src2)["table_exists"]:
h_db.clear_table(agg_grpby_union_all_src2)
if h_db.has_table(table_name=projection_intersect_src)["table_exists"]:
h_db.clear_table(projection_intersect_src)
if h_db.has_table(table_name=projection_except_src1)["table_exists"]:
h_db.clear_table(projection_except_src1)
if h_db.has_table(table_name=projection_except_src2)["table_exists"]:
h_db.clear_table(projection_except_src2)
if h_db.has_table(table_name=union_all_table)["table_exists"]:
h_db.clear_table(union_all_table)
if h_db.has_table(table_name=union_except_table)["table_exists"]:
h_db.clear_table(union_except_table)
if h_db.has_table(table_name=union_intersect_table)["table_exists"]:
h_db.clear_table(union_intersect_table)
# Union Example 1 (Union All)
# Calculate the average number of passengers, as well as the shortest, average, and longest trips for all trips in
# each of the two time periods--from April 1st through the 15th, 2015 and from April 16th through the 23rd, 2015--
# and return those two sets of statistics in a single result set.
table_taxi_obj.aggregate_group_by(
column_names=["avg(passenger_count) as avg_pass_count",
"avg(trip_distance) as avg_trip_dist",
"min(trip_distance) as min_trip_dist",
"max(trip_distance) as max_trip_dist"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={"expression": "((pickup_datetime >= '2015-04-01') AND (pickup_datetime <= '2015-04-15 23:59:59.999'))",
"result_table": agg_grpby_union_all_src1,
"collection_name": collection}
)
table_taxi_obj.aggregate_group_by(
column_names=["avg(passenger_count) as avg_pass_count",
"avg(trip_distance) as avg_trip_dist",
"min(trip_distance) as min_trip_dist",
"max(trip_distance) as max_trip_dist"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={"expression": "((pickup_datetime >= '2015-04-16') AND (pickup_datetime <= '2015-04-23 23:59:59.999'))",
"result_table": agg_grpby_union_all_src2,
"collection_name": collection}
)
gpudb.GPUdbTable.create_union(
table_name=union_all_table,
table_names=[agg_grpby_union_all_src1,
agg_grpby_union_all_src2],
input_column_names=[
["'2015-04-01 - 2014-04-15'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"],
["'2015-04-16 - 2015-04-23'", "avg_pass_count", "avg_trip_dist", "min_trip_dist", "max_trip_dist"]
],
output_column_names=["pickup_window_range",
"avg_pass_count",
"avg_trip",
"min_trip",
"max_trip"],
options={"mode": "union_all",
"collection_name": collection},
db=h_db
)
union_all_table_obj = gpudb.GPUdbTable(
_type=None,
name=union_all_table,
db=h_db
)
u1_resp = union_all_table_obj.get_records_by_column(
column_names=["pickup_window_range",
"avg_pass_count",
"avg_trip",
"min_trip",
"max_trip"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Union All table:"
print "Pickup Window Range Avg. Pass. Count Avg. Trip Min. Trip Max Trip "
print "======================= ================ ============= ============== ============="
for record in zip(
u1_resp["pickup_window_range"],
u1_resp["avg_pass_count"],
u1_resp["avg_trip"],
u1_resp["min_trip"],
u1_resp["max_trip"]
):
print "{:<23} {:<16} {:<13} {:<14} {:<13}".format(
record[0],
record[1],
record[2],
record[3],
record[4]
)
print
# Union Example 2 (Intersect)
# Retrieve locations (as lat/lon pairs) that were both pick-up and drop-off points
table_taxi_replicated_obj.create_projection(
projection_name=projection_intersect_src,
column_names=["pickup_latitude",
"pickup_longitude"],
options={"expression": "((pickup_latitude > 0) AND (pickup_longitude <> 0))",
"collection_name": collection})
gpudb.GPUdbTable.create_union(
table_name=union_intersect_table,
table_names=[projection_intersect_src,
table_taxi_replicated],
input_column_names=[["pickup_latitude", "pickup_longitude"],
["dropoff_latitude", "dropoff_longitude"]],
output_column_names=["pickup_latitude", "pickup_longitude"],
options={"mode": "intersect",
"collection_name": collection},
db=h_db
)
union_intersect_table_obj = gpudb.GPUdbTable(
_type=None,
name=union_intersect_table,
db=h_db
)
u2_resp = union_intersect_table_obj.get_records_by_column(
column_names=["pickup_latitude", "pickup_longitude"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET)
print "Intersect table:"
print "Pickup Latitude Pickup Longitude"
print "=============== ================"
for record in zip(u2_resp["pickup_latitude"], u2_resp["pickup_longitude"]):
print "{:<15} {:<16}".format(
record[0],
record[1]
)
print
# Union Example 3 (Except)
# Show vendors that operate before noon, but not after noon: retrieve the unique list of IDs of vendors who
# provided cab rides between midnight and noon, and remove from that list the IDs of any vendors who provided cab
# rides between noon and midnight
table_taxi_replicated_obj.create_projection(
projection_name=projection_except_src1,
column_names=["vendor_id"],
options={"expression": "((HOUR(pickup_datetime) >= 0) AND (HOUR(pickup_datetime) <= 11))",
"collection_name": collection}
)
table_taxi_replicated_obj.create_projection(
projection_name=projection_except_src2,
column_names=["vendor_id"],
options={"expression": "((HOUR(pickup_datetime) >= 12) AND (HOUR(pickup_datetime) <= 23))",
"collection_name": collection}
)
gpudb.GPUdbTable.create_union(
table_name=union_except_table,
table_names=[projection_except_src1,
projection_except_src2],
input_column_names=[["vendor_id"],
["vendor_id"]],
output_column_names=["vendor_id"],
options={"mode": "except",
"collection_name": collection},
db=h_db
)
union_except_table_obj = gpudb.GPUdbTable(
_type=None,
name=union_except_table,
db=h_db
)
u3_resp = union_except_table_obj.get_records_by_column(
column_names=["vendor_id"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET
)
print "Except table:"
print "Vendor ID"
print "========="
for record in zip(u3_resp["vendor_id"]):
print "{:<9}".format(record[0])
print "\n"
print "DELETING DATA"
print "-------------"
print
delete_resp = table_payment_obj.filter(expression="payment_type = 'Cash'").size()
print "Number of records that meet deletion criteria before deleting: {}".format(delete_resp)
table_payment_obj.delete_records(expressions=["payment_type = 'Cash'"])
delete_resp = table_payment_obj.filter(expression="payment_type = 'Cash'").size()
print "Number of records that meet deletion criteria after deleting: {}".format(delete_resp)
print "\n"
print "CLEARING TABLES AND COLLECTIONS"
print "-------------------------------"
print
h_db.clear_table(table_name=table_taxi)
h_db.clear_table(table_name=table_taxi_replicated)
h_db.clear_table(table_name=collection)
print "Deleted tables: OK"
print
# end gpudb_example()
if __name__ == "__main__":
gpudb_example()
After running the above sample script, the following is output:
TUTORIAL OUTPUT
===============
CREATING TYPES & TABLES
-----------------------
Vendor Table
************
Vendor table successfully created
Payment Table
*************
Payment table successfully created
Taxi Table
**********
Taxi table object successfully created
INSERTING DATA
--------------
Number of single records inserted into the Vendor table: 1
Number of batch records inserted into the Vendor table: 7
Number of batch records inserted into the Payment table: 26
RETRIEVING DATA
---------------
Payment ID Payment Type Credit Type Payment Timestamp Fare Amount Surcharge MTA Tax Tip Amount Tolls Amount Total Amount
========== ============ =========== ================= =========== ========= ======= ========== ============ ============
150 None None 1430432447000 7.50 0.00 0.50 0.00 0.00 8.30
181 None None 1430135461000 6.50 0.50 0.50 0.00 0.00 7.80
119 None None 1430431471000 9.50 0.00 0.50 0.00 0.00 10.30
176 Cash None 1428403962000 9.00 0.50 0.50 2.06 0.00 12.36
109 Cash None 1428948513000 22.50 0.50 0.50 4.75 0.00 28.55
180 Cash None 1428965823000 6.50 0.50 0.50 1.00 0.00 8.80
134 Cash None 1429472668000 33.50 0.50 0.50 0.00 0.00 34.80
148 Cash None 1430124581000 9.50 0.00 0.50 1.00 0.00 11.30
114 Cash None 1428259673000 5.50 0.00 0.50 1.89 0.00 8.19
132 Cash None 1429472779000 6.50 0.50 0.50 1.55 0.00 9.35
Vendor ID Vendor Name Phone Email HQ Street HQ City HQ State HQ Zip Code # Employees # Cabs
========= ========================== =========== ============================= ======================== ========= ========= =========== =========== ======
VTS Vine Taxi Service 9998880001 admin@vtstaxi.com 26 Summit St. Flushing NY 11354 400 450
YCAB Yes Cab 7895444321 None 97 Edgemont St. Brooklyn NY 11223 445 425
NYC New York City Cabs None support@nyc-taxis.com 9669 East Bayport St. Bronx NY 10453 505 500
DDS Dependable Driver Service None None 8554 North Homestead St. Bronx NY 10472 200 124
CMT Crazy Manhattan Taxi 9778896500 admin@crazymanhattantaxi.com 950 4th Road Suite 78 Brooklyn NY 11210 500 468
TNY Taxi New York None None 725 Squaw Creek St. Bronx NY 10458 315 305
NYMT New York Metro Taxi None None 4 East Jennings St. Brooklyn NY 11228 166 150
5BTC Five Boroughs Taxi Co. 4566541278 mgmt@5btc.com 9128 Lantern Street Brooklyn NY 11229 193 175
UPDATING & REMOVING RECORDS
---------------------------
Updated DDS vendor information:
Vendor ID Vendor Name Phone Email HQ Street HQ City HQ Zip Code # Employees # Cabs
========= ========================= ========== ============================ ======================== ======== =========== =========== ======
DDS Dependable Driver Service None management@ddstaxico.com 8554 North Homestead St. Bronx 10472 156 213
Records in the payment table (before delete): 26
Deleting record where payment_id = 189
Records in the payment table (after delete): 25
ALTER TABLE
-----------
Index added to vendor table: {u'type_id': u'', u'value': u'vendor_id', u'type_definition': u'', u'table_name': u'vendor', u'action': u'create_index', 'status_info': {u'status': u'OK', u'data_type': u'alter_table_response_avro', u'message': u'', 'response_time': 0.00292}, u'label': u'', u'properties': {}}
Index added to taxi_trip_data table: {u'type_id': u'', u'value': u'vendor_id', u'type_definition': u'', u'table_name': u'taxi_trip_data', u'action': u'create_index', 'status_info': {u'status': u'OK', u'data_type': u'alter_table_response_avro', u'message': u'', 'response_time': 0.00302}, u'label': u'', u'properties': {}}
Compression added to 'pickup_datetime' column: {u'type_id': u'', u'value': u'pickup_datetime', u'type_definition': u'', u'table_name': u'taxi_trip_data', u'action': u'set_column_compression', 'status_info': {u'status': u'OK', u'data_type': u'alter_table_response_avro', u'message': u'', 'response_time': 0.00302}, u'label': u'', u'properties': {}}
Compression added to 'dropoff_datetime' column: {u'type_id': u'', u'value': u'dropoff_datetime', u'type_definition': u'', u'table_name': u'taxi_trip_data', u'action': u'set_column_compression', 'status_info': {u'status': u'OK', u'data_type': u'alter_table_response_avro', u'message': u'', 'response_time': 0.0051}, u'label': u'', u'properties': {}}
FILTERING
---------
Number of null payments: 3
Number of null payments with a fare amount greater than $8.00 (with query chaining): 1
Number of records where vendor_id is either NYC or YCAB: 498
Number of trips with a passenger count between 1 and 3: 972
AGGREGATING, GROUPING, AND HISTOGRAMS
-------------------------------------
Statistics of values in the trip_distance column:
Count: 1081
Min: 0.00
Mean: 2.86
Max: 21.88
Unique vendor IDs in the taxi trip table:
* CMT
* DDS
* LYFT
* NYC
* UBER
* VTS
* YCAB
Number of trips per vendor:
LYFT: 3
UBER: 3
DDS: 106
VTS: 229
YCAB: 236
CMT: 242
NYC: 262
Passenger count groups by size:
1s: 775
2s: 145
3s: 52
4s: 22
5s: 87
JOINS
-----
Inner join table:
Payment ID Payment Type Total Amount Passenger Count Vendor ID Trip Distance
============ ============= ============= ================ ========== ==============
176 Cash 12.360000 5 NYC 1.220000
148 Cash 11.300000 6 NYC 1.170000
136 Cash 6.300000 5 NYC 2.310000
Left join table:
Transaction ID Pickup (in secs since Epoch) Trip Distance Vendor ID Vendor Name
============== ============================ ============= ========= ==================
169677687 1430431112000 6.50 YCAB Yes Cab
109329823 1430431115000 1.20 YCAB Yes Cab
132706957 1429035716000 1.02 NYC New York City Cabs
107008976 1428880198000 1.66 NYC New York City Cabs
164839979 1429471893000 1.02 NYC New York City Cabs
192545327 1428403054000 1.22 NYC New York City Cabs
195905599 1428716058000 12.03 NYC New York City Cabs
129891539 1428948421000 0.51 NYC New York City Cabs
149211614 1427981386000 1.50 YCAB Yes Cab
104506140 1428674077000 1.40 YCAB Yes Cab
181667689 1430135005000 3.03 UBER None
186246812 1429413166000 6.10 LYFT None
161650444 1428965054000 3.09 NYC New York City Cabs
170101359 1428922783000 0.99 NYC New York City Cabs
191386113 1428669361000 1.30 YCAB Yes Cab
174771463 1429471891000 2.01 NYC New York City Cabs
199751611 1428808559000 0.50 YCAB Yes Cab
132760909 1430124124000 1.17 NYC New York City Cabs
104118718 1428716058000 1.53 UBER None
163836875 1428520490000 0.18 LYFT None
155194734 1430135005000 1.08 NYC New York City Cabs
111998195 1428259167000 2.10 YCAB Yes Cab
112901023 1429799817000 2.37 NYC New York City Cabs
174019433 1429471890000 2.56 NYC New York City Cabs
190120416 1429413163000 1.49 LYFT None
139294057 1428813585000 4.30 YCAB Yes Cab
106041671 1429869346000 0.79 NYC New York City Cabs
177106037 1430431112000 1.20 YCAB Yes Cab
178732586 1430135005000 8.39 NYC New York City Cabs
160856304 1428716062000 2.31 NYC New York City Cabs
128530951 1428672482000 1.00 YCAB Yes Cab
197243309 1430135005000 2.00 UBER None
Outer join table:
Vend. Table Vendors Taxi Table Vendors Total Records
=================== ================== =============
5BTC None 1
NYMT None 1
TNY None 1
None LYFT 3
None UBER 3
PROJECTIONS
-----------
Projection of only credit payment types:
Payment ID Payment Type Credit Type Timestamp Fare Surcharge MTA Tax Tip Tolls Total
========== ============ ================ ============= ===== ========= ======= ===== ===== =====
176 Cash None 1428403962000 9.00 0.50 0.50 2.06 0.00 12.36
198 Credit MasterCard 1429472636000 9.00 0.00 0.50 0.00 0.00 9.80
150 None None 1430432447000 7.50 0.00 0.50 0.00 0.00 8.30
170 No Charge None 1430431502000 28.60 0.00 0.50 0.00 0.00 28.60
109 Cash None 1428948513000 22.50 0.50 0.50 4.75 0.00 28.55
140 Credit Visa None 28.00 0.00 0.50 0.00 0.00 28.80
159 Credit Visa 1428674487000 7.00 0.00 0.50 0.00 0.00 7.80
107 Credit MasterCard 1428717377000 5.00 0.50 0.50 0.00 0.00 6.30
180 Cash None 1428965823000 6.50 0.50 0.50 1.00 0.00 8.80
199 Credit Visa None 6.00 1.00 0.50 1.00 0.00 8.50
134 Cash None 1429472668000 33.50 0.50 0.50 0.00 0.00 34.80
161 Credit Visa None 7.00 0.00 0.50 0.00 0.00 7.80
187 Credit American Express 1428670181000 14.00 0.00 0.50 0.00 0.00 14.80
148 Cash None 1430124581000 9.50 0.00 0.50 1.00 0.00 11.30
166 Credit American Express 1428808723000 17.50 0.00 0.50 0.00 0.00 18.30
114 Cash None 1428259673000 5.50 0.00 0.50 1.89 0.00 8.19
132 Cash None 1429472779000 6.50 0.50 0.50 1.55 0.00 9.35
193 Cash None None 3.50 1.00 0.50 1.59 0.00 6.89
181 None None 1430135461000 6.50 0.50 0.50 0.00 0.00 7.80
136 Cash None 1428716521000 4.00 0.50 0.50 1.00 0.00 6.30
100 Cash None None 9.00 0.00 0.50 2.90 0.00 12.70
156 Credit MasterCard 1428672753000 12.50 0.50 0.50 0.00 0.00 13.80
125 Credit Discover 1429869673000 8.50 0.50 0.50 0.00 0.00 9.80
119 None None 1430431471000 9.50 0.00 0.50 0.00 0.00 10.30
123 No Charge None 1430136649000 20.00 0.50 0.50 0.00 0.00 21.30
Projection of long trips taken during lunch hours:
Hour of Day Vendor ID Passenger Count Trip Distance
=========== ========= =============== =============
12 NYC 1 5.07
14 NYC 6 9.02
13 NYC 1 5.38
14 YCAB 1 8.60
14 YCAB 2 17.00
12 NYC 3 12.01
13 NYC 1 21.88
11 NYC 5 7.26
13 YCAB 1 20.70
13 YCAB 1 7.10
14 YCAB 1 18.00
14 NYC 5 16.92
12 NYC 2 5.70
14 NYC 1 9.72
14 NYC 3 5.96
11 YCAB 1 6.00
13 DDS 1 12.50
12 VTS 1 9.65
13 VTS 3 12.56
11 VTS 2 11.43
11 VTS 1 6.33
14 CMT 1 9.90
14 CMT 1 8.20
12 CMT 1 9.80
11 CMT 1 16.70
12 CMT 1 9.50
11 NYC 1 8.39
UNIONS
------
Union All table:
Pickup Window Range Avg. Pass. Count Avg. Trip Min. Trip Max Trip
======================= ================ ============= ============== =============
2015-04-01 - 2014-04-15 1.58052434457 3.05310860405 0.0 21.8799991608
2015-04-16 - 2015-04-23 1.75187969925 3.13285715217 0.379999995232 19.3999996185
Intersect table:
Pickup Latitude Pickup Longitude
=============== ================
40.7513122559 -73.9750518799
40.7210159302 -74.0053863525
41.3661384583 -73.1373901367
40.7258033752 -73.7250137329
40.7733917236 -73.9810638428
40.7639274597 -73.9019927979
40.7608184814 -73.9800567627
40.7504119873 -73.9906692505
40.758190155 -73.9373474121
40.7640647888 -73.9618606567
40.7149124146 -74.0074920654
40.7145347595 -73.942199707
40.6573753357 -73.7936782837
40.6492347717 -74.2553405762
Except table:
Vendor ID
=========
UBER
DELETING DATA
-------------
Number of records that meet deletion criteria before deleting: 10
Number of records that meet deletion criteria after deleting: 0
CLEARING TABLES AND COLLECTIONS
-------------------------------
Deleted tables: OK