The following is a complete example, using the Java UDF API, of a non-CUDA UDF that takes a list of input tables and corresponding output tables (must be the same number) and, for each record of each input table, sums the squares of input table columns and saves the result to the corresponding output table column; i.e.
in.a2 + in.b2 + ... + in.n2 -> out.a
This setup assumes the UDF is being developed on the Kinetica host (or head
node host, if a multi-node Kinetica cluster), the Python database
API is available at /opt/gpudb/api/python
, and the Java UDF API is
available at /opt/gpudb/udf/api/java
(the Java UDF API compiled jar,
kinetica-proc-api-1.0-jar-with-dependencies.jar
, should already be
available).
This example will contain the following Python scripts and Java files (click to download):
udf_sos_java_init.py
: creates the input & output tables and loads
test dataUdfSosJavaProc.java
:
the UDF itselfudf_sos_java_exec.py
: creates & executes the UDFNote
All commands should be run as the gpudb
user.
After copying these three files to a gpudb
-accessible directory on
the Kinetica head node, the example can be run as follows:
$ /opt/gpudb/bin/gpudb_python udf_sos_java_init.py
$ javac -cp /opt/gpudb/udf/api/java/proc-api/kinetica-proc-api-1.0-jar-with-dependencies.jar UdfSosJavaProc.java
$ jar -cvf UdfSosJavaProc.jar UdfSosJavaProc.class
$ /opt/gpudb/bin/gpudb_python udf_sos_java_exec.py
The results of the run can be checked via Kinetica Administration Application (GAdmin). There
should exist two tables, udf_sos_in_table
& udf_sos_out_table
, each
holding 10,000 records; the former containing pairs of numbers and the latter
containing the sums of squares of those numbers. Each table will carry an
id
, which can be used to associate input values to output sums.
To verify the existence of the tables, in GAdmin, click Data > Tables. Both tables should appear in the listing, each with 10,000 records.
To verify the calculations, click Query > KiSQL. Enter the following query into the SQL Statement box:
SELECT
sos_in.id,
STRING(x1) || '^2 + ' || STRING(x2) || '^2 = ' || STRING(y) AS "Equation"
FROM
udf_sos_in_table sos_in,
udf_sos_out_table sos_out
WHERE
sos_in.id = sos_out.id
ORDER BY
id;
The Query Result box should show each of the 10,000 calculations made.
While the example UDF itself can run against multiple tables, the example run
will use a single table, udf_sos_in_table
, as input and a matching table,
udf_sos_out_table
, for output.
The input table will contain two float columns and be populated with 10,000 pairs of randomly-generated numbers. The output table will contain one float column that will hold the sums calculated by the UDF. Both tables will also contain an int column that is the calculation identifier, allowing the input data to be matched up with the output data after the UDF has run.
Note
The UDF will assume the first column of the input table, as defined in the original table creation process, is the identifier field. All of the remaining columns after the first will be used in the sum-of-squares calculation.
The UDF will calculate the sum of the squares of each of the 10,000 pairs of numbers and insert into the output table the corresponding 10,000 sums.
This initialization script creates the input & output tables and populates the input data using the standard Kinetica Python API, all outside of the UDF execution framework.
Several aspects of the initialization process are noteworthy:
h_db = GPUdb(encoding = 'BINARY', host = KINETICA_HOST, port = KINETICA_PORT)
columns = []
columns.append(GPUdbRecordColumn("id", GPUdbRecordColumn._ColumnType.INT, [GPUdbColumnProperty.PRIMARY_KEY, GPUdbColumnProperty.INT16]))
columns.append(GPUdbRecordColumn("x1", GPUdbRecordColumn._ColumnType.FLOAT))
columns.append(GPUdbRecordColumn("x2", GPUdbRecordColumn._ColumnType.FLOAT))
input_table = GPUdbTable(columns, INPUT_TABLE, db = h_db)
columns = []
columns.append(GPUdbRecordColumn("id", GPUdbRecordColumn._ColumnType.INT, [GPUdbColumnProperty.PRIMARY_KEY, GPUdbColumnProperty.INT16]))
columns.append(GPUdbRecordColumn("y", GPUdbRecordColumn._ColumnType.FLOAT))
GPUdbTable(columns, OUTPUT_TABLE, db = h_db)
This is the UDF itself. It uses the Kinetica Java UDF API to compute the sums of squares of input table columns and output those sums to an output table. It runs within the UDF execution framework, and as such, is not called directly--instead, it is registered and launched by udf_sos_java_exec.py.
Noteworthy in the UDF are the following:
ProcData()
to access the database: ProcData procData = ProcData.get();
long recordCount = inputTable.getSize();
outputTable.setSize(recordCount);
complete()
to mark the process as finished and ready for
clean-up: procData.complete();
The execution script uses the standard Kinetica Python API to register the UDF in the database and then execute it.
The registration step associates a name with the UDF execution code contained in UdfSosJavaProc.jar, the command ( java ) and arguments (the name of the proc class and class path) to use to run it, and that it will run in distributed mode.
response = h_db.create_proc(proc_name, 'distributed', files, 'java', ['-cp', class_path, class_name], {})
The execution step invokes the UDF by name, passing in the input & output table names against which the UDF will execute.
response = h_db.execute_proc(proc_name, {}, {}, [INPUT_TABLE], {}, [OUTPUT_TABLE], {})