References
- Python UDF Reference — detailed description of the entire UDF API
- Running UDFs — detailed description on running Python UDFs
- Example UDFs — example UDFs written in Python
Prerequisites
The general prerequisites for using UDFs in Kinetica can be found under UDF Prerequisites.This example cannot run on Mac OSX
- Python 3
- Miniconda
Visit the Conda website to download
the Miniconda installer for Python 3.
UDF API Download
This example requires local access to the Python UDF API repository. In the desired directory, run the following but be sure to replace<kinetica-version> with the name of the installed Kinetica version, e.g,
v7.2:
Relevant Scripts
There are four files associated with the pandas UDF example, all of which can be found in the Python UDF API repo.- A database setup script (
test_environment.py) that is called from the initialization script - An initialization script (
setup_db.py) that creates the output table - A script (
register_execute_UDF.py) to register and execute the UDF - A UDF (
df_to_output_UDF.py) that creates a pandas dataframe and inserts it into the output table
conda_env_py3.yml file found in the
Python UDF API repository.
-
In the same directory you cloned the API, change directory into the root
folder of the Python UDF API repository:
-
Create the Conda environment, replacing
<environment name>with the desired name:It may take a few minutes to create the environment. -
Verify the environment was created properly:
-
Activate the new environment:
-
Install PyGDF:
-
Install the Kinetica Python API:
-
Add the Python UDF API repo’s root directory to the PYTHONPATH:
-
Edit the
util/test_environment.pyscript for the correct database url, user, and password for your Kinetica instance:
UDF Deployment
-
Change directory into the UDF Pandas directory:
-
Run the UDF initialization script:
-
Run the execute script for the training UDF:
-
Verify the results in GAdmin, either on the
logging page or in the
unittest_df_outputoutput table on the table view page.
Execution Detail
This example details using a distributed UDF to create and ingest a pandas dataframe into Kinetica. Thedf_to_output_UDF proc creates the dataframe
and inserts it into the output table, unittest_df_output.
The dataframe has a shape of (3, 3) and will get inserted into the
output table n number of times, where n is equal to the number of
processing nodes available in each processing node container registered in
Kinetica.
The output table contains 3 columns:
id— an integer columnvalue_long— a long columnvalue_float— a float column
Database Setup
The setup script,setup_db.py, which creates the output table for the UDF,
imports the test_environment.py script to access its methods:
test_environment.py: one to create the
schema used to contain the example tables and one to create the output tables
where the values for the output table’s name and type are passed in:
test_environment.py require a connection to Kinetica.
This is done by instantiating an object of the GPUdb class with a
provided connection URL. See Connecting via API for details on the URL
format and how to look it up.
create_schema() method creates the schema that will contain the table
used in the example:
create_test_output_table() method creates the type and table for the
output table, but the table is removed first if it already exists:
UDF (df_to_output_UDF.py)
First, packages are imported to access the Kinetica Python UDF API and pandas:ProcData() class:
unittest_df_output). Its size is expanded to match the shape of the
dataframe; this will allocated enough memory to copy all records in the
dataframe to the output table. Then the dataframe is assigned to the output
table:
UDF Registration (register_execute_UDF.py)
To interact with Kinetica, an object of theGPUdb class is instantiated
while providing the connection URL of the database server.
df_to_output_UDF.py and kinetica_proc.py files to
Kinetica, they will first need to be read in as bytes and added to a file data
map:
Pandas_df_output
proc is created in Kinetica and the files are associated with it:
The proc requires the proper
command and args to be executed. In
this case, the assembled command line would be: