The UDF simulator simulates the mechanics of the
/execute/proc call without the UDF actually having
to be created in the database. The simulator reads data out of tables, writes it
to files in the correct format, and provides the environment variable that needs
to be set for the UDF API. After the UDF code is run, it can optionally read any
output that was written and write it back into Kinetica. The simulator does not
run the UDF code itself, it only manages the environment so the developer can
run the UDF code from a debugger, Jupyter notebook, etc., as long as the
environment variable is set correctly.
The UDF simulator is invoked via a Python script packaged with the native
Python API, at: examples/udfsim.py.
Though the script is distributed with the Python API, you can also use it
when developing a UDF in C++.
Modes
| Mode | Description |
|---|
execute <execute parameters> | Simulates UDF execution |
output <output parameters> | Processes UDF output |
clean | Cleans up files generated from execute or output. Takes no parameters |
Execute Parameters and Flags
| Parameter/Flag | Type Restriction | Description |
|---|
-f </path/to/file> --path </path/to/file> | N/A | User-specified directory in which to create the control files. Default directory is /opt/gpudb/api/python/ |
-p <param_name value> --param <param_name value> | N/A | UDF execution parameter(s) (see /execute/proc for more information) |
-d --distributed | N/A | Enable distributed UDF simulation. This is the default execution mode |
-i <table_name [column_name, ...]> --input <table_name [column_name, ...]> | Distributed only | Input table and optional column list |
-o <table_name> --output <table_name> | Distributed only | Output table |
-n --nondistributed | N/A | Enable non-distributed UDF simulation |
-K <url> --url <url> | N/A | Kinetica URL; default is http://localhost:9191 |
-U <username> --username <username> | N/A | Kinetica username for authentication |
-P <password> --password <password> | N/A | Kinetica password for authentication |
-h --help | N/A | Prints the help menu |
Output Parameters and Flags
| Parameter/Flag | Description |
|---|
-d --dry-run | Display output only; no output written to Kinetica |
-K <url> --url <url> | Kinetica URL; default is http://localhost:9191 |
-U <username> --username <username> | Kinetica username for authentication |
-P <password> --password <password> | Kinetica password for authentication |
-h --help | Prints the help menu |
Usage
-
Run the simulator with the
execute argument and any parameters. Once
finished, it prints an export command that sets the environment variable
needed for the UDF API:
python udfsim.py execute <execute parameters>
-
Run the printed
export command via command line:
export KINETICA_PCF=/opt/gpudb/api/kinetica-api-python/gpudb/kinetica-udf-sim-icf-Wvbdqi
-
Execute the UDF. For example, executing a Python UDF script:
python udf_cublas_proc.py
You can execute the UDF using whatever method (Jupyter notebook,
debugger, etc.) as long as the environment variable output from step 2 has
been set. The UDF can be executed multiple times without rerunning step 1
and step 2 as long as it hasn’t output any data. For iterative testing
purposes, it may be desirable to comment out any data output code and
instead use print statements.
-
Optionally, run the simulator with the
output argument and any
parameters to output data from the UDF into the database:
python udfsim.py output <output parameters>
This mode requires the environment variable output from step 2 to
be set
-
Optionally, run the simulator with the
clean argument to clean up all the
files written in step 2. The files can also be manually deleted if desired:
This mode requires the environment variable output from step 2 to
be set
Examples
Running the UDF simulator for the
Python table copy
UDF with the following parameters/flags:
- in distributed mode
- using the
udf_tc_in_table table as input and udf_tc_out_table table as
output (both tables created using
Python table copy manager script),
- placing all control files in
/tmp/data/udf-sim-test/
- performing a dry run on the
output mode
$ python /opt/gpudb/python/api/examples/udfsim.py execute -f /tmp/data/udf-sim-test/ -d -i udf_tc_in_table -o udf_tc_out_table
export KINETICA_PCF=/tmp/data/udf-sim-test/kinetica-udf-sim-icf-k2cncp
$ export KINETICA_PCF=/tmp/data/udf-sim-test/kinetica-udf-sim-icf-k2cncp
$ python /opt/gpudb/udf/api/python/udf_tc.py
$ python examples/udfsim.py output -d
No results
Output:
udf_tc_out_table: 10000 records
Limitations
The UDF simulator has some limitations:
- When simulating a distributed UDF, all the data from the input table goes to
one place and the UDF isn’t run in parallel
- Input data is written to actual files, not memory maps, so reading it from
within the UDF may be slower and therefore I/O performance testing is not
possible