Prerequisites
The prerequisites for running the match graph example are listed below:- Python API
- Tutorial script
- Taxi trip data CSV file
Python API Installation
When using this include, precede it with the following heading:Python API Installation
Depending on the target operating system, a Python virtual environment may need to be installed first: The native Kinetica Python API is accessible through the following means:Python Virtual Environment
A Python virtual environment is necessary to install in an operating environment where Python is externally managed.-
Install a Python virtual environment:
-
Activate the Python virtual environment:
PyPI
-
Install the API:
-
Test the installation:
If Import Successful is displayed, the API has been installed as is ready for use.
Git
-
In the desired directory, run the following, but be sure to replace
<kinetica-version>with the name of the installed Kinetica version, e.g.,v7.2: -
Change directory into the newly downloaded repository:
-
In the root directory of the unzipped repository, install the Kinetica API:
-
Test the installation (Python3 is necessary for running the API example):
Data File
The tutorial script references the taxi_trip_data.csv data file, mentioned in the Prerequisites, in the current local directory, by default. This directory can specified as a parameter when running the script.Connecting to the Database
To interact with Kinetica, you must first instantiate an object of the GPUdb class while providing the connection URL, username, & password to the database server. For more details on connecting to the database, see Connecting via API.Connect to the Database
Creating a Schema
All tables & views must be created within a schema. So, the first step will be to create a schema to contain all of the tables & views created within this tutorial.Create Tutorial Schema
Creating a Type
Before a table can be created and any data can be loaded into it, a type needs to be defined. The type is the set of column definitions for the table, structured as a list of lists, where each contained list corresponds to a single column definition. Each column comprises at least two values: a column name (always the first value) and a base type (always the second value). Any subsequent values are column properties. You can also use the GPUdbColumnProperty object in addition to literal string values to help define a column. For this tutorial, three types will be used to create three corresponding tables:- A base table containing a sample of NYC taxi cab ride data
- A lookup table containing sample cab vendor information (created in the example below)
- A 1-to-1 relation table containing payment information for a subset of the cab-ride transactions
vendor table follows:
Define a Type
While the type is just defined here, it will be created implicitly
when first used to create a table, shown below.
Creating a Table
After you’ve defined a type list, you can create a table using the GPUdbTable interface, passing in the type, the name of the table, any table options, and a handle to the database connection.Create a Table
Use GPUdbTableOptions Instead of Raw Map for Options
At this point in the tutorial, a
payment and
taxi_trip_data type and table are also created._type of None:
Get a Handle to an Existing Table
Inserting Data
You can insert single records or multiple records using the insert_records method. As mentioned previously, the number and order of values must match the column declarations found in the type creation step. In Python, you can pass a value ofNone for null values, but the column must
be declared nullable when creating the type schema.
The following methods are not the only way to insert records, but they
are the simplest. There is a convenience class called
GPUdbIngestor and an
automatic multi-head capability available through
GPUdbTable, which facilitate
inserting records into a table in batches; see
Distributed Ingest for details.
Key/Value Record
In this example, a dictionary object will be used to map key/value pairs to table column values.Insert by Dictionary
In-Line Record
In this example, records will be defined in-line and added to a list. They can either be inserted as a single list of lists, as is done here, or as individual lists, one per record.Insert by List
CSV
CSV and other data files can be loaded into Kinetica via KiFS using two lines of code:- Upload the local file to KiFS using upload_files, passing the KiFS path to upload to and the bytes of the file to upload.
- Insert the data from the uploaded file in KiFS using insert_records_from_files, passing the name of the table to load data into and the KiFS path to the source file.
Insert from File
The KiFS directory must already exist before uploading a file into
it. The
create_directory
method can be used to create this directory.
Retrieving Data
Once the table is populated with data, the data can be retrieved using binary encoding or JSON encoding. Binary encoding is more efficient than JSON encoding. The example below uses the GPUdbTable.get_records method and binary encoding:Retrieve Records (Binary Encoding via GPUdbTable class)
Retrieve Records (JSON Encoding via GPUdb class)
offset and limit parameters. The returned response also
contains the type (column definitions) of the results.
Updating Records
Using any GPUdbTable object, you can update records via the update_records method.Update Records
Deleting Records
You can delete records using the delete_records method. This method allows you to specify multiple filter expressions—note that each expression is used to delete records independently from the others (i.e., a record only needs to meet any one expression’s criteria to be deleted from the table). Setdelete_all_records to True in the options map
to delete all records in the table.
Delete Records
Alter Table
Some properties can be altered or added after table creation, using alter_table, including indexes and dictionary encoding.Indexes
Using the alter_table method, you can create indexes on columns using thecreate_index action
paired with a column name.
Index Columns
Dictionary Encoding
Applying dictionary encoding via alter_table involves adding a new property to a column.Dictionary Encode Column
To add a new property, all existing column properties must be
listed along with any new properties.
Filters
Filters are an easy way to reduce larger tables into more concise views using expressions.Count Value-Filtered Records
Count Using Filter Chaining; Name the Backing Views
Count List-Filtered Records
Count Range-Filtered Records
Aggregates
Kinetica supports various aggregate and group-by queries, which group and aggregate your data to return counts and useful statistics.Calculate Count/Min/Mean/Max
Retrieve Unique Values
Report Number of Records per Group
Generate Histogram
Joins
Joins allow you to link multiple tables together, along their relations, retrieving associated information from any or all of them. Tables can only be joined if they’re sharded similarly or replicated. All join types can be performed using the create_join_table method. This method is static and is not called with a corresponding GPUdbTable object. An inner join returns only records that have matching values in both tables.Inner Join
left join in the given expression.
Left Join
Full outer joins require both tables to be replicated or joined on
their shard keys. Set merges that perform deduplication of records, like
Union Distinct,
Intersect, and
Except also need to use replicated tables to
ensure the correct results, so a replicated version of the taxi
(
taxi_trip_data_replicated) table is created at this point in the
tutorial using
create_projection.Create Replicated Table for Full Outer Join
Full Outer Join
Projections
You can create projections using the create_projection method.Create a Temporary Projection
Create a Permanent Projection
Union, Intersect, & Except
Union can be used to combine homogeneous data sets into one larger data set. When calling the create_union method, amode of union or union_distinct will both combine data
sets, but only retain the records that are unique across the chosen columns,
removing all duplicates. Using union_all will combine data sets, retaining
all records from the source data sets.
Set Union Retaining Duplicates
mode of intersect will perform a set
intersection, which will combine data sets
but only include the records found in both data sets, removing duplicate result
records. Using intersect_all will retain intersecting duplicates from both
sets.
Set Intersection
mode of except will perform a set
except (subtraction), which will return records
that appear in the first data set but not the second data set. Note that the
data sets on each side of the except will have duplicates removed first, and
then the set subtraction will be processed. Using except_all will retain
duplicates from the first set.
Set Exception (Subtraction)
Download & Run
Included below is a complete example containing all the above requests, the data file, and output. To run the complete sample, ensure that:- the python_tutorial.py script is in the current directory
- the taxi_trip_data.csv file is in the current directory or use
the
data_dirparameter to specify the local directory containing it
Run Example
As this script creates a schema and several database objects within
it, system admin permission is
required to run it.