The following guide provides step-by-step instructions to get started writing Python applications using Kinetica. This guide demonstrates only a small set of the available API. A detailed description of the complete interface is available under Python API Reference.
The native Kinetica Python API is accessible through the following means:
In default Kinetica installations, the native Python API is located in the
/opt/gpudb/api/python
directory. The
/opt/gpudb/bin/gpudb_python
wrapper script is provided, which sets the
execution environment appropriately.
Test the installation:
/opt/gpudb/bin/gpudb_python /opt/gpudb/api/python/examples/example.py
Important
When developing on the Kinetica server, use /opt/gpudb/bin/gpudb_python to run Python programs and /opt/gpudb/bin/gpudb_pip to install dependent libraries.
Python 2.7 (or greater) is necessary for downloading the API from Github:
In the desired directory, run the following but be sure to replace
<kinetica-version>
with the name of the installed Kinetica version,
e.g., v6.2.0
:
git clone -b release/<kinetica-version> --single-branch https://github.com/kineticadb/kinetica-api-python.git
Change directory into the newly downloaded repository:
cd kinetica-api-python
In the root directory of the unzipped repository, install the Kinetica API:
sudo python setup.py install
Test the installation:
python examples/example.py
The Python package manager, pip, is required to install the API from PyPI.
Install the API:
pip install gpudb --upgrade
Test the installation:
python -c "import gpudb;print('Import Successful')"
If Import Successful is displayed, the API has been installed as is ready for use.
The tutorial script
makes
reference to a data file
in the
current directory. This path can be updated to point to a valid path on the host
where the file will be located, or the script can be run with the data file in
the current directory.
taxi_data = csv.reader(open('taxi_trip_data.csv'))
To interact with Kinetica, you must first instantiate an object of the GPUdb
class while providing the connection URL, including the host and port of the
database server.
h_db = gpudb.GPUdb(
encoding="BINARY",
host="127.0.0.1",
port="9191"
)
If the server is configured to authenticate users, the username and password will also need to be provided at the time of creating the object:
h_db = gpudb.GPUdb(
encoding="BINARY",
host="127.0.0.1",
port="9191",
username="user1",
password="mysecret"
)
Before a table can be created and any data can be
loaded into the database, a type needs to be defined.
The type is defined as a list of lists of strings. Each column comprises at
least two values: a column name (always the first value) and a primitive type
(always the second value). Any subsequent values are column properties. You can
also use the GPUdbRecordColumn
and GPUdbColumnProperty
objects in
addition to literal string values to help define a column. The following
_ColumnType
constants are available:
_ColumnType.INT
_ColumnType.LONG
_ColumnType.FLOAT
_ColumnType.DOUBLE
_ColumnType.STRING
_ColumnType.BYTES
# Create a type from a list of lists. Each list below is an individual
# column. Each column comprises at least two values: a column name (always
# the first value) and a primitive type (always the second value). Any
# subsequent values are column properties. The order of the columns defines
# the order in which values must be inserted into the table, e.g., a
# "vendor_name" value cannot be inserted before a "vendor_id" value
vendor_columns = [
# column types and properties can be listed as strings
["vendor_id", "string", "char4", "primary_key"],
["vendor_name", "string", "char64"],
["phone", "string", "char16", "nullable"],
["email", "string", "char64", "nullable"],
["hq_street", "string", "char64"],
# column types and properties can also be listed using the
# GPUdbRecordColumn and GPUdbColumnProperty objects
[
"hq_city",
gpudb.GPUdbRecordColumn._ColumnType.STRING,
gpudb.GPUdbColumnProperty.CHAR8,
gpudb.GPUdbColumnProperty.DICT
],
[
"hq_state",
gpudb.GPUdbRecordColumn._ColumnType.STRING,
gpudb.GPUdbColumnProperty.CHAR8,
gpudb.GPUdbColumnProperty.DICT
],
["hq_zip", gpudb.GPUdbRecordColumn._ColumnType.INT],
["num_emps", gpudb.GPUdbRecordColumn._ColumnType.INT],
["num_cabs", gpudb.GPUdbRecordColumn._ColumnType.INT]
]
Important
The order of the columns defines the order in which values must
be inserted into the table, e.g., a vendor_name
value cannot be inserted
before a vendor_id
value
After you've created a type list and stored it in a
variable, you can create a table using the
GPUdbTable
interface. For this tutorial, three tables will be used:
# Create the table from the type and place it in a collection
try:
table_vendor_obj = gpudb.GPUdbTable(
_type=vendor_columns,
name=table_vendor,
options={
"collection_name": collection,
"is_replicated": "true"
},
db=h_db
)
print "Vendor table successfully created"
except gpudb.GPUdbException as e:
print "Vendor table creation failure: {}".format(str(e))
Important
At this point in the tutorial, a Payment
and
Taxi Trip Data
type and table are created.
To create a table object for a table that already exists, pass in a _type
of None
:
gpudb.GPUdbTable.merge_records(
table_name=table_taxi_replicated, source_table_names=[table_taxi],
field_maps=[{
"transaction_id": "transaction_id",
"payment_id": "payment_id",
"vendor_id": "vendor_id",
"pickup_datetime": "pickup_datetime",
"dropoff_datetime": "dropoff_datetime",
"passenger_count": "passenger_count",
"trip_distance": "trip_distance",
"pickup_longitude": "pickup_longitude",
"pickup_latitude": "pickup_latitude",
"dropoff_longitude": "dropoff_longitude",
"dropoff_latitude": "dropoff_latitude"
}],
options={
"collection_name": collection,
"is_replicated": "true"
},
db=h_db
)
To pass in table creation options, you can either specify a map (like in the
Vendor
table) or you can use the GPUdbTableOptions
object, as follows:
# Create an "options" object containing the table creation options
collection_option_object = gpudb.GPUdbTableOptions.default().collection_name(collection)
You can insert single records or multiple records using the insert_records
method. As mentioned previously, the number and order of values must match the
column declarations found in the type creation step. In
Python, you can pass a value of None
for null values, but the column must
be declared nullable when creating the type schema.
Note
The following methods are not the only way to insert records, but they are the simplest.
In this example, a dictionary object will be used to map key/value pairs to table column values.
payment_datum = collections.OrderedDict()
payment_datum["payment_id"] = 189
payment_datum["payment_type"] = "No Charge"
payment_datum["credit_type"] = None
payment_datum["payment_timestamp"] = None
payment_datum["fare_amount"] = 6.5
payment_datum["surcharge"] = 0
payment_datum["mta_tax"] = 0.6
payment_datum["tip_amount"] = 0
payment_datum["tolls_amount"] = 0
payment_datum["total_amount"] = 7.1
# Insert the record into the table
table_payment_obj.insert_records(payment_datum)
print "Number of records inserted into the Payment table: {}".format(
table_payment_obj.size()
)
In these examples, records will be defined in-line and added to a list.
# Create a list of in-line records. The order of the values must match the
# column order in the type
vendor_records = [
["VTS", "Vine Taxi Service", "9998880001", "admin@vtstaxi.com",
"26 Summit St.", "Flushing", "NY", 11354, 450, 400],
["YCAB", "Yes Cab", "7895444321", None, "97 Edgemont St.", "Brooklyn",
"NY", 11223, 445, 425],
["NYC", "New York City Cabs", None, "support@nyc-taxis.com",
"9669 East Bayport St.", "Bronx", "NY", 10453, 505, 500],
["DDS", "Dependable Driver Service", None, None,
"8554 North Homestead St.", "Bronx", "NY", 10472, 200, 124],
["CMT", "Crazy Manhattan Taxi", "9778896500",
"admin@crazymanhattantaxi.com", "950 4th Road Suite 78", "Brooklyn",
"NY", 11210, 500, 468],
["TNY", "Taxi New York", None, None, "725 Squaw Creek St.", "Bronx",
"NY", 10458, 315, 305],
["NYMT", "New York Metro Taxi", None, None, "4 East Jennings St.",
"Brooklyn", "NY", 11228, 166, 150],
["5BTC", "Five Boroughs Taxi Co.", "4566541278", "mgmt@5btc.com",
"9128 Lantern Street", "Brooklyn", "NY", 11229, 193, 175]
]
# Insert the records into the Vendor table
table_vendor_obj.insert_records(vendor_records)
print "Number of records inserted into the Vendor table: {}".format(
table_vendor_obj.size()
)
# Create another list of in-line records
payment_records = [
[136, "Cash", None, 1428716521000, 4, 0.5, 0.5, 1, 0, 6.3],
[148, "Cash", None, 1430124581000, 9.5, 0, 0.5, 1, 0, 11.3],
[114, "Cash", None, 1428259673000, 5.5, 0, 0.5, 1.89, 0, 8.19],
[180, "Cash", None, 1428965823000, 6.5, 0.5, 0.5, 1, 0, 8.8],
[109, "Cash", None, 1428948513000, 22.5, 0.5, 0.5, 4.75, 0, 28.55],
[132, "Cash", None, 1429472779000, 6.5, 0.5, 0.5, 1.55, 0, 9.35],
[134, "Cash", None, 1429472668000, 33.5, 0.5, 0.5, 0, 0, 34.8],
[176, "Cash", None, 1428403962000, 9, 0.5, 0.5, 2.06, 0, 12.36],
[100, "Cash", None, None, 9, 0, 0.5, 2.9, 0, 12.7],
[193, "Cash", None, None, 3.5, 1, 0.5, 1.59, 0, 6.89],
[140, "Credit", "Visa", None, 28, 0, 0.5, 0, 0, 28.8],
[161, "Credit", "Visa", None, 7, 0, 0.5, 0, 0, 7.8],
[199, "Credit", "Visa", None, 6, 1, 0.5, 1, 0, 8.5],
[159, "Credit", "Visa", 1428674487000, 7, 0, 0.5, 0, 0, 7.8],
[156, "Credit", "MasterCard", 1428672753000, 12.5, 0.5, 0.5, 0, 0, 13.8],
[198, "Credit", "MasterCard", 1429472636000, 9, 0, 0.5, 0, 0, 9.8],
[107, "Credit", "MasterCard", 1428717377000, 5, 0.5, 0.5, 0, 0, 6.3],
[166, "Credit", "American Express", 1428808723000, 17.5, 0, 0.5, 0, 0, 18.3],
[187, "Credit", "American Express", 1428670181000, 14, 0, 0.5, 0, 0, 14.8],
[125, "Credit", "Discover", 1429869673000, 8.5, 0.5, 0.5, 0, 0, 9.8],
[119, None, None, 1430431471000, 9.5, 0, 0.5, 0, 0, 10.3],
[150, None, None, 1430432447000, 7.5, 0, 0.5, 0, 0, 8.3],
[170, "No Charge", None, 1430431502000, 28.6, 0, 0.5, 0, 0, 28.6],
[123, "No Charge", None, 1430136649000, 20, 0.5, 0.5, 0, 0, 21.3],
[181, None, None, 1430135461000, 6.5, 0.5, 0.5, 0, 0, 7.8]
]
# Insert the records into the Payment table
for record in payment_records:
table_payment_obj.insert_records(record)
print "Number of records inserted into the Payment table: {}".format(
table_payment_obj.size() - payment_table_size_after_first_insert
)
This example requires the csv
Python module but allows for importing a
large amount of records with ease. After setting a CSV reader, you can loop over
all values in a .csv file, append the values to lists of a list, then insert
the list.
taxi_data = csv.reader(open('taxi_trip_data.csv'))
taxi_data.next()
taxi_records = []
for record in taxi_data:
record_data = []
record_data.append(long(record[0]))
record_data.append(long(record[1]))
record_data.append(record[2])
record_data.append(long(record[3]))
record_data.append(long(record[4]))
record_data.append(int(record[5]))
record_data.append(float(record[6]))
record_data.append(float(record[7]))
record_data.append(float(record[8]))
record_data.append(float(record[9]))
record_data.append(float(record[10]))
taxi_records.append(record_data)
table_taxi_obj.insert_records(taxi_records)
Once the table is populated with data, the data can be
retrieved using binary encoding or JSON encoding. Binary encoding is more
efficient than JSON encoding. The example below uses the GPUdbTable
interface and binary encoding:
# Retrieve no more than 10 records from the Payment table using the
# GPUdbTable interface
print "{:<10s} {:<12s} {:<11s} {:<17s} {:<11s} {:<9s} {:<7s} {:<10s} " \
"{:<12s} {:<12s}".format(
"Payment ID",
"Payment Type",
"Credit Type",
"Payment Timestamp",
"Fare Amount",
"Surcharge",
"MTA Tax",
"Tip Amount",
"Tolls Amount",
"Total Amount"
)
print "{:=<10s} {:=<12s} {:=<11s} {:=<17s} {:=<11s} {:=<9s} {:=<7s} {:=<10s} " \
"{:=<12s} {:=<12s}".format(
"", "", "", "", "", "", "", "", "", ""
)
for record in table_payment_obj.get_records(
offset=0,
limit=10,
encoding="binary",
options={"sort_by": "payment_id"}
):
print "{payment_id:<10d} {payment_type:<12s} {credit_type:<11s} " \
"{payment_timestamp:<17} {fare_amount:11.2f} {surcharge:9.2f} " \
"{mta_tax:7.2f} {tip_amount:10.2f} {tolls_amount:12.2f} " \
"{total_amount:12.2f}".format(
payment_id=record["payment_id"],
payment_type=record["payment_type"],
credit_type=record["credit_type"],
payment_timestamp=record["payment_timestamp"],
fare_amount=record["fare_amount"],
surcharge=record["surcharge"],
mta_tax=record["mta_tax"],
tip_amount=record["tip_amount"],
tolls_amount=record["tolls_amount"],
total_amount=record["total_amount"]
)
This example demonstrates JSON encoding record retrieval using the GPUdb
interface:
# Retrieve all records from the Vendor table using the GPUdb interface
print "{:<9s} {:<26s} {:<11s} {:<29s} {:<24s} {:<9s} {:<9s} {:<11s} {:<11s} {:<6s}".format(
"Vendor ID",
"Vendor Name",
"Phone",
"Email",
"HQ Street",
"HQ City",
"HQ State",
"HQ Zip Code",
"# Employees",
"# Cabs"
)
print "{:=<9s} {:=<26s} {:=<11s} {:=<29s} {:=<24s} {:=<9s} {:=<9s} " \
"{:=<11s} {:=<11s} {:=<6s}".format(
"", "", "", "", "", "", "", "", "", ""
)
vendor_records_gpudb = h_db.get_records(
table_name=table_vendor,
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
encoding="json",
options={"sort_by": "vendor_id"}
)["records_json"]
for record in vendor_records_gpudb:
print "{vendor_id:<9s} {vendor_name:<26s} {phone:<11} {email:<29s} " \
"{hq_street:<24s} {hq_city:<9s} {hq_state:<9s} {hq_zip:<11d} " \
"{num_emps:11d} {num_cabs:6d}".format(
**json.loads(record)
)
For large tables, the data can be easily be retrieved in smaller blocks by
using the offset
and limit
parameters. The returned response also
contains the schema (or data type) of the results.
Using any GPUdbTable object, you can
update records via the update_records
method.
# Update the e-mail of, and add two employees and one cab to, the DDS vendor
table_vendor_obj.update_records(
expressions=["vendor_id = 'DDS'"],
new_values_maps={
"email": "'management@ddstaxico.com'",
"num_emps": "num_emps + 2",
"num_cabs": "num_cabs + 1"
},
options={"use_expressions_in_new_values_maps":"true"}
)
You can delete records using the delete_records
method.
# Delete payment 189
pre_delete = table_payment_obj.size()
print "Records in the payment table (before delete): {}".format(pre_delete)
delete_expr = ["payment_id = 189"]
print "Deleting record where " + delete_expr[0]
table_payment_obj.delete_records(expressions=delete_expr)
post_delete = table_payment_obj.size()
print "Records in the payment table (after delete): {}".format(post_delete)
Some properties can be altered or added after table creation, including indexes, dictionary encoding, and compression.
Using the alter_table
method, you can create indexes on columns using the
create_index
action paired with a column name.
# Add column indexes on:
# - payment table, fare_amount (for query-chaining filter example)
# - taxi table, passenger_count (for filter-by-range example) */
table_payment_obj.alter_table(action="create_index", value="fare_amount")
table_taxi_obj.alter_table(action="create_index", value="passenger_count")
Applying column compression works similarly: using the alter_table
method
but with a set_column_compression
action paired with a column name and
compression type option.
# Apply the snappy compression algorithm to the pickup and dropoff datetime columns
snappy_compression = {"compression_type": "snappy"}
table_taxi_obj.alter_table(
action="set_column_compression",
value="pickup_datetime",
options=snappy_compression
)
dropoff_compress_resp = table_taxi_obj.alter_table(
action="set_column_compression",
value="dropoff_datetime",
options=snappy_compression
)
Important
Column compression is applied at a fixed interval, so be sure to verify later that the compression has been added. Column usage should decrease by roughly 23% (~1989 bytes)
Applying dictionary encoding via alter_table
involves adding a new
property to a column.
# Apply dictionary encoding to the vendor_id column
table_taxi_obj.alter_table(
action="change_column",
value="vendor_id",
options={"column_properties": "char4,dict"}
)
Important
To add a new property, all existing column properties must be listed along with any new property(ies)
Filters are an easy way to reduce larger table into more concise views using expressions.
# Select all payments with no corresponding payment type; allow Kinetica
# to assign a random name to the view
collection_option = {"collection_name": collection}
f1_count = table_payment_obj.filter(
expression="IS_NULL(payment_type)",
options=collection_option
).size()
print "Number of null payments: {}".format(f1_count)
# Using GPUdbTable query chaining, filter null payment type records with a
# fare amount greater than 8
f2_count = table_payment_obj.filter(
view_name=view_example1,
expression="IS_NULL(payment_type)",
options=collection_option
).filter(
view_name=view_example2,
expression="fare_amount > 8",
options=collection_option
).size()
print "Number of null payments with a fare amount greater than $8.00 " \
"(with query chaining): {}".format(f2_count)
# Filter by list where vendor ID is either NYC or YCAB
f3_count = table_taxi_obj.filter_by_list(
view_name=view_example3,
column_values_map={"vendor_id": ["NYC", "YCAB"]}
).size()
print "Number of records where vendor_id is either NYC or YCAB: {}".format(f3_count)
# Filter by range trip with passenger count between 1 and 3
f4_count = table_taxi_obj.filter_by_range(
view_name=view_example4,
column_name="passenger_count",
lower_bound=1,
upper_bound=3
).size()
Kinetica supports various aggregate and group-by queries, which group and aggregate your data to return counts and useful statistics.
# Aggregate count, min, mean, and max on the trip distance
a1_resp = table_taxi_obj.aggregate_statistics(
column_name="trip_distance",
stats="count,min,max,mean"
)
print "Statistics of values in the trip_distance column:"
print "\tCount: {count:.0f}\n\tMin: {min:4.2f}\n\tMean: {mean:4.2f}\n\t" \
"Max: {max:4.2f}\n".format(**a1_resp["stats"])
# Find unique taxi vendor IDs
a2_resp = table_taxi_obj.aggregate_unique(
column_name="vendor_id",
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
encoding="json"
)["data"]["vendor_id"]
print "Unique vendor IDs in the Taxi table:"
for vendor in a2_resp:
print "\t* {}".format(vendor)
# Find number of trips per vendor
a3_resp = table_taxi_obj.aggregate_group_by(
column_names=["vendor_id", "count(vendor_id)"],
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={"sort_by": "key"}
)["data"]
print "Number of trips per vendor:"
for vendor in zip(a3_resp["vendor_id"], a3_resp["count(vendor_id)"]):
print "\t{:<10s}{:4d}".format(vendor[0] + ":", vendor[1])
# Create a histogram for the different groups of passenger counts
a4_resp = table_taxi_obj.aggregate_histogram(
column_name="passenger_count",
start=1,
end=6,
interval=1
)["counts"]
print "Passenger count groups by size:"
print "{:<10s} {:<11s}".format("Passengers", "Total Trips")
print "{:=<10s} {:=<11s}".format("", "")
for histo_group in zip([1, 2, 3, 4, '>5'], a4_resp):
print "{:>10} {:11.0f}".format(*histo_group)
Joins allow you to link multiple
tables together, along their relations, retrieving
associated information from any or all of them. Tables can only be joined if
they're sharded similarly or replicated.
Note that the create_join_table
method is static and is not called with
a corresponding GPUdbTable object.
An inner join returns only records that have matching values in both tables.
# Retrieve payment information for rides having more than three passengers
gpudb.GPUdbTable.create_join_table(
join_table_name=join_table_inner,
table_names=[
"taxi_trip_data as t",
"payment as p"
],
column_names=[
"t.payment_id",
"payment_type",
"total_amount",
"passenger_count",
"vendor_id",
"trip_distance"
],
expressions=[
"t.payment_id = p.payment_id",
"passenger_count > 3"
],
options=collection_option,
db=h_db
)
A left join returns all of the records an inner join does, but additionally, for each record in the table on the left side of the join that has no match along the relation to a record in the table on the right side of the join, a corresponding record will be returned with "left-side" columns populated with the "left-side" record data and the "right-side" columns populated with nulls.
# Retrieve cab ride transactions and the full name of the associated vendor
# (if available--blank if vendor name is unknown) for transactions with
# associated payment data, sorting by increasing values of transaction ID.
gpudb.GPUdbTable.create_join_table(
join_table_name=join_table_left,
table_names=[
"taxi_trip_data as t",
"vendor as v"
],
column_names=[
"transaction_id",
"pickup_datetime",
"trip_distance",
"t.vendor_id",
"vendor_name"
],
expressions=[
"left join t, v on (t.vendor_id = v.vendor_id)",
"payment_id <> 0"
],
options=collection_option,
db=h_db
)
Note
Full outer joins require both tables to be replicated or joined on
their shard keys. Set merges that perform deduplication of records, like
Union Distinct,
Intersect, and Except
also need to use replicated tables to ensure the correct results, so a
replicated version of the taxi (taxi_trip_data_replicated
) table and a
corresponding table object are created at this point in the tutorial.
gpudb.GPUdbTable.merge_records(
table_name=table_taxi_replicated, source_table_names=[table_taxi],
field_maps=[{
"transaction_id": "transaction_id",
"payment_id": "payment_id",
"vendor_id": "vendor_id",
"pickup_datetime": "pickup_datetime",
"dropoff_datetime": "dropoff_datetime",
"passenger_count": "passenger_count",
"trip_distance": "trip_distance",
"pickup_longitude": "pickup_longitude",
"pickup_latitude": "pickup_latitude",
"dropoff_longitude": "dropoff_longitude",
"dropoff_latitude": "dropoff_latitude"
}],
options={
"collection_name": collection,
"is_replicated": "true"
},
db=h_db
)
A full outer join returns all of the records a left join does, but additionally, for each record in the table on the right side of the join that has no match along the relation to a record in the table on the left side of the join, a corresponding record will be returned with "right-side" columns populated with the "right-side" record data and the "left-side" columns populated with nulls.
# Retrieve the vendor IDs of known vendors with no recorded cab ride
# transactions, as well as the vendor ID and number of transactions for
# unknown vendors with recorded cab ride transactions
gpudb.GPUdbTable.create_join_table(
join_table_name=join_table_outer,
table_names=[
"taxi_trip_data_replicated as t",
"vendor as v"
],
column_names=[
"t.vendor_id as vendor_id",
"v.vendor_id as vendor_id_1"
],
expressions=["full_outer join t,v on ((v.vendor_id = t.vendor_id))"],
options=collection_option,
db=h_db
)
You can create projections using the
create_projection
method.
# Create a projection containing all payments by credit card
table_payment_obj.create_projection(
projection_name=projection_example1,
column_names=[
"payment_id",
"payment_type",
"credit_type",
"payment_timestamp",
"fare_amount",
"surcharge",
"mta_tax",
"tip_amount",
"tolls_amount",
"total_amount"
],
options={
"collection_name": collection,
"expression": "payment_type = 'Credit'"
}
)
To persist a projection:
# Create a persisted table with cab ride transactions greater than 5 miles
# whose trip started during lunch hours
table_taxi_obj.create_projection(
projection_name=projection_example2,
column_names=[
"hour(pickup_datetime) as hour_of_day",
"vendor_id",
"passenger_count",
"trip_distance"
],
options={
"expression":
"(hour(pickup_datetime) >= 11) AND "
"(hour(pickup_datetime) <= 14) AND "
"(trip_distance > 5)",
"collection_name": collection,
"persist": "true"
}
)
Union can be used to combine homogeneous data sets into one larger data set. Union & Union Distinct will both combine data sets but only retain the records that are unique across the chosen columns, removing all duplicates. Union All will combine data sets, retaining all records from the source data sets.
# Calculate the average number of passengers, as well as the shortest,
# average, and longest trips for all trips in
# each of the two time periods--from April 1st through the 15th, 2015 and
# from April 16th through the 23rd, 2015--and return those two sets of
# statistics in a single result set.
table_taxi_obj.aggregate_group_by(
column_names=[
"avg(passenger_count) as avg_pass_count",
"avg(trip_distance) as avg_trip_dist",
"min(trip_distance) as min_trip_dist",
"max(trip_distance) as max_trip_dist"
],
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={
"expression":
"(pickup_datetime >= '2015-04-01') AND "
"(pickup_datetime <= '2015-04-15 23:59:59.999')",
"result_table": agg_grpby_union_all_src1,
"collection_name": collection
}
)
table_taxi_obj.aggregate_group_by(
column_names=[
"avg(passenger_count) as avg_pass_count",
"avg(trip_distance) as avg_trip_dist",
"min(trip_distance) as min_trip_dist",
"max(trip_distance) as max_trip_dist"
],
offset=0,
limit=gpudb.GPUdb.END_OF_SET,
options={
"expression":
"(pickup_datetime >= '2015-04-16') AND "
"(pickup_datetime <= '2015-04-23 23:59:59.999')",
"result_table": agg_grpby_union_all_src2,
"collection_name": collection
}
)
gpudb.GPUdbTable.create_union(
table_name=union_all_table,
table_names=[
agg_grpby_union_all_src1,
agg_grpby_union_all_src2
],
input_column_names=[
["'2015-04-01 - 2014-04-15'", "avg_pass_count",
"avg_trip_dist", "min_trip_dist", "max_trip_dist"],
["'2015-04-16 - 2015-04-23'", "avg_pass_count",
"avg_trip_dist", "min_trip_dist", "max_trip_dist"]
],
output_column_names=[
"pickup_window_range",
"avg_pass_count",
"avg_trip",
"min_trip",
"max_trip"
],
options={"mode": "union_all", "collection_name": collection},
db=h_db
)
Intersect will combine data sets but only include the records found in both data sets, removing duplicate result records.
# Retrieve locations (as lat/lon pairs) that were both pick-up and
# drop-off points
gpudb.GPUdbTable.create_union(
table_name=union_intersect_table,
table_names=[
table_taxi_replicated,
table_taxi_replicated
],
input_column_names=[
["pickup_latitude", "pickup_longitude"],
["dropoff_latitude", "dropoff_longitude"]
],
output_column_names=["latitude", "longitude"],
options={"mode": "intersect", "collection_name": collection},
db=h_db
)
Except will return records that appear in the first data set but not the second data set. Note that the data sets on each side of the Except will have duplicates removed first, and then the set subtraction will be processed.
# Show vendors that operate before noon, but not after noon: retrieve the
# unique list of IDs of vendors who provided cab rides between midnight
# and noon, and remove from that list the IDs of any vendors who provided
# cab rides between noon and midnight
table_taxi_replicated_obj.create_projection(
projection_name=projection_except_src1,
column_names=["vendor_id"],
options={
"expression":
"(HOUR(pickup_datetime) >= 0) AND "
"(HOUR(pickup_datetime) <= 11)",
"collection_name": collection
}
)
table_taxi_replicated_obj.create_projection(
projection_name=projection_except_src2,
column_names=["vendor_id"],
options={
"expression":
"(HOUR(pickup_datetime) >= 12) AND "
"(HOUR(pickup_datetime) <= 23)",
"collection_name": collection
}
)
gpudb.GPUdbTable.create_union(
table_name=union_except_table,
table_names=[
projection_except_src1,
projection_except_src2
],
input_column_names=[
["vendor_id"],
["vendor_id"]
],
output_column_names=["vendor_id"],
options={"mode": "except", "collection_name": collection},
db=h_db
)
You can delete records from a table using filter expressions. This method allows you to specify multiple filter expressions--note that each expression is used to delete records independently from the others (i.e., a record only needs to meet any one expression's criteria to be deleted from the table).
table_payment_obj.delete_records(expressions=["payment_type = 'Cash'"])
Included below is a complete example containing all the above requests, the data file, and output.
To run the complete sample, ensure the python_tutorial.py
and
taxi_trip_data.csv
are in the same directory; then switch to that
directory and run:
python python_tutorial.py