The UDF simulator simulates the mechanics of the /execute/proc call without the UDF actually having to be created in the database. The simulator reads data out of tables, writes it to files in the correct format, and provides the environment variable that needs to be set for the UDF API. After the UDF code is run, it can optionally read any output that was written and write it back into Kinetica. The simulator does not run the UDF code itself, it only manages the environment so the developer can run the UDF code from a debugger, Jupyter notebook, etc., as long as the environment variable is set correctly.
Important
The UDF simulator is invoked via a Python script packaged with the native Python API, at: examples/udfsim.py. Though the script is distributed with the Python API, you can also use it when developing a UDF in C++.
Modes
Mode | Description |
---|---|
execute <execute parameters> | Simulates UDF execution |
output <output parameters> | Processes UDF output |
clean | Cleans up files generated from execute or output. Takes no parameters |
Execute Parameters and Flags
Parameter/Flag | Type Restriction | Description |
---|---|---|
-f </path/to/file> --path </path/to/file> | N/A | User-specified directory in which to create the control files. Default directory is /opt/gpudb/api/python/ |
-p <param_name value> --param <param_name value> | N/A | UDF execution parameter(s) (see /execute/proc for more information) |
-d --distributed | N/A | Enable distributed UDF simulation. This is the default execution mode |
-i <table_name [column_name, ...]> --input <table_name [column_name, ...]> | Distributed only | Input table and optional column list |
-o <table_name> --output <table_name> | Distributed only | Output table |
-n --nondistributed | N/A | Enable non-distributed UDF simulation |
-K <url> --url <url> | N/A | Kinetica URL; default is http://localhost:9191 |
-U <username> --username <username> | N/A | Kinetica username for authentication |
-P <password> --password <password> | N/A | Kinetica password for authentication |
-h --help | N/A | Prints the help menu |
Output Parameters and Flags
Parameter/Flag | Description | |
---|---|---|
-d --dry-run | Display output only; no output written to Kinetica | |
-K URL --url URL | Kinetica URL; default is http://localhost:9191 | |
-U username --username username | Kinetica username for authentication | |
-P password --password password | Kinetica password for authentication | |
-h --help | Prints the help menu |
Usage
Run the simulator with the execute argument and any parameters. Once finished, it prints an export command that sets the environment variable needed for the UDF API:
python udfsim.py execute <execute parameters>
Run the printed export command via command line:
export KINETICA_PCF=/opt/gpudb/api/kinetica-api-python/gpudb/kinetica-udf-sim-icf-Wvbdqi
Execute the UDF. For example, executing a Python UDF script:
python udf_cublas_proc.py
Tip
You can execute the UDF using whatever method (Jupyter notebook, debugger, etc.) as long as the environment variable output from step 2 has been set. The UDF can be executed multiple times without rerunning step 1 and step 2 as long as it hasn't output any data. For iterative testing purposes, it may be desirable to comment out any data output code and instead use print statements.
Optionally, run the simulator with the output argument and any parameters to output data from the UDF into the database:
python udfsim.py output <output parameters>
Note
This mode requires the environment variable output from step 2 to be set
Optionally, run the simulator with the clean argument to clean up all the files written in step 2. The files can also be manually deleted if desired:
python udfsim.py clean
Note
This mode requires the environment variable output from step 2 to be set
Examples
Running the UDF simulator for the Python table copy UDF with the following parameters/flags:
- in distributed mode
- using the udf_tc_in_table table as input and udf_tc_out_table table as output (both tables created using Python table copy manager script ),
- placing all control files in /tmp/data/udf-sim-test/
- performing a dry run on the output mode
$ python /opt/gpudb/python/api/examples/udfsim.py execute -f /tmp/data/udf-sim-test/ -d -i udf_tc_in_table -o udf_tc_out_table export KINETICA_PCF=/tmp/data/udf-sim-test/kinetica-udf-sim-icf-k2cncp $ export KINETICA_PCF=/tmp/data/udf-sim-test/kinetica-udf-sim-icf-k2cncp $ python /opt/gpudb/udf/api/python/udf_tc.py $ python examples/udfsim.py output -d No results Output: udf_tc_out_table: 10000 records
Limitations
The UDF simulator has some limitations:
- When simulating a distributed UDF, all the data from the input table goes to one place and the UDF isn't run in parallel
- Input data is written to actual files, not memory maps, so reading it from within the UDF may be slower and therefore I/O performance testing is not possible