References
- Python UDF Reference — detailed description of the entire UDF API
- Running UDFs — detailed description on running Python UDFs
- Example UDFs — example UDFs written in Python
Prerequisites
The general prerequisites for using UDFs in Kinetica can be found on the User-Defined Function Implementation page.Program Files
There are three files associated with the Python UDF tutorial. All the files can be found in the Python Tutorial Git Repo, which is cloned in the API Download and Installation section.- A UDF management program, udf_tc_py_manager.py, written using the Python API, which creates the input & output tables, and creates the UDF and executes it.
- A UDF, udf_tc_py_proc.py, written using the Python UDF API, which contains a table copying example.
- A CSV input file, rank_tom.csv, used to identify which processing nodes should copy data.
API Download and Installation
The Python UDF tutorial requires local access to the Python UDF API & tutorial repositories and the Python API. The native Python API is also used to run the UDF simulator (details found in Development).-
In the desired directory, run the following to download the Kinetica Python
UDF tutorial repository:
-
In the same directory, run the following to download the Kinetica Python UDF
API repository:
-
In the same directory, run the following to download the Kinetica Python API
repository:
-
Install the
pandasPython library: -
Change directory into the newly downloaded native Python API repository:
-
In the root directory of the repository, install the Kinetica API:
-
Change directory into the UDF tutorial root:
-
Add the Python UDF API directory to the
PYTHONPATH:
Development
The steps below outline using the UDF Simulator, included with the Python API. The UDF Simulator simulates the mechanics of execute_proc() without actually calling it in the database; this is useful for developing UDFs piece-by-piece and test incrementally, avoiding memory ramifications for the database.-
Ensure that the Python UDF API directory is in the
PYTHONPATH. -
Change directory into the newly downloaded Python UDF tutorial repository:
-
Run the UDF manager script with the
initoption, specifying the database URL and a username & password: -
In the native Python API directory, run the UDF Simulator in
executemode with the following options to simulate running the UDF:Where:-i- schema-qualified UDF input table-o- schema-qualified UDF output table-K- Kinetica URL-U- Kinetica username-P- Kinetica password
-
Copy & execute the
exportcommand output by the previous command; this will prepare the execution environment for simulating the UDF:Theexportcommand shown above is an example of what theudfsim.pyscript will output—it should not be copied to the terminal in which this example is being run. Make sure to copy & execute the actual command output byudfsim.pyin the previous step. -
Run the UDF:
-
Run the UDF Simulator in
outputmode to output the results to Kinetica (use the dry run flag-dto avoid writing to Kinetica). Theresultsmap will be returned (even if there’s nothing in it) as well as the number of records that were (or will be in the case of a dry run) added to the given output table:For instance:This should output the following: -
Clean the control files output by the UDF Simulator:
The
cleancommand is only necessary if data was output to Kinetica; otherwise, the UDF Simulator can be re-run as many times as desired without having to clean the output files and enter another export command.
Deployment
The UDF can be created and executed using the UDF functions: create_proc() and execute_proc() (respectively).-
Run the UDF manager script with the
initoption to reset the example tables: -
Run the UDF manager script with the
execoption to run the example: -
Verify the results, using a SQL client (KiSQL), Kinetica Workbench, or other:
-
The
udf_tc_py_in_tabletable is created in the user’s default schema (ki_home, unless a different one was assigned during account creation) -
A matching
udf_tc_py_out_tabletable is created in the same schema -
The
udf_tc_py_in_tablecontains 10,000 records of random data -
The
udf_tc_py_out_tablecontains the correct amount of copied data fromudf_tc_py_in_table. On single-node installations, as is the case with Developer Edition, all data should be copied. This is because single-node instances have a default configuration of 2 worker ranks with one TOM each, and therank_tom.csvconfiguration file contains a reference to rank 1/TOM 0 and rank 2/TOM 0, effectively naming both data TOMs to copy data from. In larger cluster configurations, only a fraction of the data in the input table will be stored on those two TOMs; so, the output table will contain that same fraction of the input table’s data. The database logs should also show the portion of the data being copied:
-
The
UDF Detail
As mentioned previously, this section details a simple distributed UDF that copies data from one table to another. While the table copy UDF can run against multiple tables, the example run will use a single table,udf_tc_py_in_table, as input and a similar table,
udf_tc_py_out_table, for output.
The input table will contain one int16 column (id) and two float
columns (x and y). The id column will be an ordered integer field,
with the first row containing 1, the second row containing 2, etc. Both
float columns will contain 10,000 pairs of randomly-generated numbers:
id) and two float
columns (a and b). No data is inserted:
tom_num column values refer to processing nodes that contain the many
shards of data inside the database. The rank_num column values refer to
processing node containers that hold the processing nodes for the database. For
example, the given CSV file determines that the data from udf_tc_py_in_table
on processing node container 1, processing node 0 and processing node
container 2, processing node 0 will be copied to udf_tc_py_out_table
on those same nodes.
Once the UDF is executed, a UDF instance (OS process) is spun up for each
processing node to execute the UDF code against its assigned processing node’s
data. Each UDF process then determines if its corresponding processing node
container/processing node pair matches one of the pairs of values in the CSV
file. If there is a match, the UDF process will loop through the given input
tables and copy the data contained in that processing node from the input tables
to the output tables. If there isn’t a match, no data will be copied by that
process.
Initialization (udf_tc_py_manager.py init)
The init option invokes theinit() function in the
udf_tc_py_manager.py script. This function will create the input table for
the UDF to copy data from and the output table to copy data to. Sample data will
also be generated and inserted into the input table.
To interact with Kinetica, you must first instantiate an object of the
GPUdb class while providing the connection URL and username & password to
use for logging in. This database object is later passed to the init() and
exec() methods:
UDF (udf_tc_py_proc.py)
Theudf_tc_py_proc.py script is the UDF itself. It does the work of copying
the input table data to the output table, based on the ranks & TOMs specified in
the given CSV file.
First, instantiate a handle to the ProcData() class:
input_data and output_data
objects (respectively), set the output tables’ size to the input tables’ size.
This will allocate enough memory to copy all input records to the output
table:
complete() to tell Kinetica the UDF is finished.
Execution (udf_tc_py_manager.py exec)
The exec option invokes theexec() function in the
udf_tc_py_manager.py script. This function will read files in as bytes,
create a UDF, and upload the files to the database. The function will then
execute the UDF.
To upload the udf_tc_py_proc.py and rank_tom.csv files to Kinetica,
they will first need to be read in as bytes and added to a file data map:
udf_tc_py_proc
UDF can be created in Kinetica and the files can be associated with it: