Python Developer Guide
Step-by-step instructions on writing Python applications with Kinetica
Step-by-step instructions on writing Python applications with Kinetica
The following guide provides step-by-step instructions to get started writing Python applications using Kinetica. This guide demonstrates only a small set of the available API. A detailed description of the complete interface is available under Python API Reference.
The prerequisites for running the match graph example are listed below:
The native Kinetica Python API is accessible through the following means:
The Python package manager, pip, is required to install the API from PyPI.
Install the API:
|
|
Test the installation:
|
|
If Import Successful is displayed, the API has been installed as is ready for use.
In the desired directory, run the following, but be sure to replace <kinetica-version> with the name of the installed Kinetica version, e.g., v7.2:
|
|
Change directory into the newly downloaded repository:
|
|
In the root directory of the unzipped repository, install the Kinetica API:
|
|
Test the installation (Python 2.7 (or greater) is necessary for running the API example):
|
|
The tutorial script references the taxi_trip_data.csv
data file,
mentioned in the Prerequisites, in the current local directory, by default.
This directory can specified as a parameter when running the script.
To interact with Kinetica, you must first instantiate an object of the GPUdb class while providing the connection URL, username, & password to the database server. For more details on connecting to the database, see Connecting via API.
|
|
All tables & views must be created within a schema. So, the first step will be to create a schema to contain all of the tables & views created within this tutorial.
|
|
Before a table can be created and any data can be loaded into it, a type needs to be defined. The type is the set of column definitions for the table, structured as a list of lists, where each contained list corresponds to a single column definition. Each column comprises at least two values: a column name (always the first value) and a base type (always the second value). Any subsequent values are column properties. You can also use the GPUdbColumnProperty object in addition to literal string values to help define a column.
For this tutorial, three types will be used to create three corresponding tables:
The definition of the type for the vendor table follows:
|
|
Note
While the type is just defined here, it will be created implicitly when first used to create a table, shown below.
After you've defined a type list, you can create a table using the GPUdbTable interface, passing in the type, the name of the table, any table options, and a handle to the database connection.
|
|
To pass in table creation options, you can either specify a Python map, as above, or you can use the GPUdbTableOptions object instead:
|
|
Important
At this point in the tutorial, a payment and taxi_trip_data type and table are also created.
To acquire a GPUdbTable object for a table that already exists, pass in a _type of None:
|
|
You can insert single records or multiple records using the insert_records method. As mentioned previously, the number and order of values must match the column declarations found in the type creation step. In Python, you can pass a value of None for null values, but the column must be declared nullable when creating the type schema.
Note
The following methods are not the only way to insert records, but they are the simplest. There is a convenience class called GPUdbIngestor and an automatic multi-head capability available through GPUdbTable, which facilitate inserting records into a table in batches; see Multi-Head Ingest for details.
In this example, a dictionary object will be used to map key/value pairs to table column values.
|
|
In this example, records will be defined in-line and added to a list. They can either be inserted as a single list of lists, as is done here, or as individual lists, one per record.
|
|
CSV and other data files can be loaded into Kinetica via KiFS using two lines of code:
|
|
Note
The KiFS directory must already exist before uploading a file into it. The create_directory method can be used to create this directory.
Once the table is populated with data, the data can be retrieved using binary encoding or JSON encoding. Binary encoding is more efficient than JSON encoding. The example below uses the GPUdbTable.get_records method and binary encoding:
|
|
This example demonstrates JSON encoding record retrieval, using the GPUdb.get_records method:
|
|
For large tables, the data can be easily be retrieved in smaller blocks by using the offset and limit parameters. The returned response also contains the type (column definitions) of the results.
Using any GPUdbTable object, you can update records via the update_records method.
|
|
You can delete records using the delete_records method. This method allows you to specify multiple filter expressions--note that each expression is used to delete records independently from the others (i.e., a record only needs to meet any one expression's criteria to be deleted from the table). Set delete_all_records to True in the options map to delete all records in the table.
|
|
Some properties can be altered or added after table creation, using alter_table, including indexes and dictionary encoding.
Using the alter_table method, you can create indexes on columns using the create_index action paired with a column name.
|
|
Applying dictionary encoding via alter_table involves adding a new property to a column.
|
|
Important
To add a new property, all existing column properties must be listed along with any new properties.
Filters are an easy way to reduce larger tables into more concise views using expressions.
|
|
|
|
|
|
|
|
Kinetica supports various aggregate and group-by queries, which group and aggregate your data to return counts and useful statistics.
|
|
|
|
|
|
|
|
Joins allow you to link multiple tables together, along their relations, retrieving associated information from any or all of them. Tables can only be joined if they're sharded similarly or replicated. All join types can be performed using the create_join_table method. This method is static and is not called with a corresponding GPUdbTable object.
An inner join returns only records that have matching values in both tables.
|
|
A left join returns all of the records an inner join does, but additionally, for each record in the table on the left side of the join that has no match along the relation to a record in the table on the right side of the join, a corresponding record will be returned with "left-side" columns populated with the "left-side" record data and the "right-side" columns populated with nulls. Note the usage of left join in the given expression.
|
|
Note
Full outer joins require both tables to be replicated or joined on their shard keys. Set merges that perform deduplication of records, like Union Distinct, Intersect, and Except also need to use replicated tables to ensure the correct results, so a replicated version of the taxi (taxi_trip_data_replicated) table is created at this point in the tutorial using merge_records.
|
|
A full outer join returns all of the records a left join does, but additionally, for each record in the table on the right side of the join that has no match along the relation to a record in the table on the left side of the join, a corresponding record will be returned with "right-side" columns populated with the "right-side" record data and the "left-side" columns populated with nulls.
|
|
You can create projections using the create_projection method.
|
|
To persist a projection:
|
|
Union can be used to combine homogeneous data sets into one larger data set. When calling the create_union method, a mode of union or union_distinct will both combine data sets, but only retain the records that are unique across the chosen columns, removing all duplicates. Using union_all will combine data sets, retaining all records from the source data sets.
|
|
Calling the create_union method with a mode of intersect will perform a set intersection, which will combine data sets but only include the records found in both data sets, removing duplicate result records. Using intersect_all will retain intersecting duplicates from both sets.
|
|
Calling the create_union method with a mode of except will perform a set except (subtraction), which will return records that appear in the first data set but not the second data set. Note that the data sets on each side of the except will have duplicates removed first, and then the set subtraction will be processed. Using except_all will retain duplicates from the first set.
|
|
Included below is a complete example containing all the above requests, the data file, and output.
To run the complete sample, ensure that:
python_tutorial.py
script is in the current directorytaxi_trip_data.csv
file is in the current directory or use
the data_dir parameter to specify the local directory containing itThen, run the following:
|
|
Note
As this script creates a schema and several database objects within it, system admin permission is required to run it.