Example UDF (Non-CUDA) - Sum of Squares

The following is a complete example, using the Python API, of a non-CUDA UDF that takes a list of input tables and corresponding output tables (must be the same number) and, for each record of each input table, sums the squares of input table columns and saves the result to the corresponding output table column; i.e.:

in.a2 + in.b2 + ... + in.n2 -> out.a

This example will contain the following Python scripts (click to download):

After copying these three scripts to a local directory, the example can be run as follows, specifying the database URL, username, & password to the Python scripts:

Run Example
1
2
$ python udf_sos_py_init.py --url <kinetica-url> --username <kinetica-user> --password <kinetica-pass>
$ python udf_sos_py_exec.py --url <kinetica-url> --username <kinetica-user> --password <kinetica-pass>

The results of the run can be checked via Kinetica SQL (KiSQL). There should exist two tables within the example_udf_python schema, udf_sos_in_table & udf_sos_out_table, each holding 10,000 records; the former containing pairs of numbers and the latter containing the sums of squares of those numbers. Each table will carry an id, which can be used to associate input values to output sums.

To verify the existence of the tables, run the following queries. Each query should return 10,000 records per table.

Count Records in Input Table
1
2
SELECT COUNT(*)
FROM example_udf_python.udf_sos_in_table
Count Records in Results Table
1
2
SELECT COUNT(*)
FROM example_udf_python.udf_sos_out_table

To verify the calculations, run the following query, which should output each pair of input values and the corresponding sum of squares.

Show Sum of Squares Calculations
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT
   sos_in.id,
   STRING(x1) || '^2 + ' || STRING(x2) || '^2 = ' || STRING(y) AS "Equation"
FROM
   example_udf_python.udf_sos_in_table sos_in,
   example_udf_python.udf_sos_out_table sos_out
WHERE
   sos_in.id = sos_out.id
ORDER BY
   id

Execution Detail

While the example UDF itself can run against multiple tables, the example run will use a single schema-qualified table, example_udf_python.udf_sos_in_table, as input and a matching schema-qualified table, example_udf_python.udf_sos_out_table, for output.

The input table will contain two float columns and be populated with 10,000 pairs of randomly-generated numbers. The output table will contain one float column that will hold the sums calculated by the UDF. Both tables will also contain an int column that is the calculation identifier, allowing the input data to be matched up with the output data after the UDF has run.

Note

The UDF will assume the first column of the input table, as defined in the original table creation process, is the identifier field. All of the remaining columns after the first will be used in the sum-of-squares calculation.

The UDF will calculate the sum of the squares of each of the 10,000 pairs of numbers and insert into the output table the corresponding 10,000 sums.

udf_sos_py_init.py

This initialization script creates the schema, input & output tables, and populates the input data using the standard Kinetica Python API, all outside of the UDF execution framework.

Several aspects of the initialization process are noteworthy:

  • The external database connection, indicative of the use of the standard Kinetica Python API--the UDF itself will not have this, as it runs within the database:

    Connect to the Database
    1
    
    kinetica = gpudb.GPUdb(host=[args.url], username=args.username, password=args.password)
    
  • Schema, input, and output table creation:

    Create Schema
    1
    2
    
    SCHEMA = 'example_udf_python'
    kinetica.create_schema(SCHEMA, options=OPTION_NO_CREATE_ERROR)
    
    Create Input Table
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    input_table = gpudb.GPUdbTable(
        _type = [
            ["id", "int", gpudb.GPUdbColumnProperty.INT16, gpudb.GPUdbColumnProperty.PRIMARY_KEY],
            ["x1", "float"],
            ["x2", "float"]
        ],
        name = INPUT_TABLE,
        db = kinetica
    )
    
    Create Results Table
    1
    2
    3
    4
    5
    6
    7
    8
    
    gpudb.GPUdbTable(
        _type = [
            ["id", "int", gpudb.GPUdbColumnProperty.INT16, gpudb.GPUdbColumnProperty.PRIMARY_KEY],
            ["y", "float"]
        ],
        name = OUTPUT_TABLE,
        db = kinetica
    )
    

udf_sos_py_proc.py

This is the UDF itself. It uses the Kinetica Python UDF API to compute the sums of squares of input table columns and output those sums to an output table. It runs within the UDF execution framework, and as such, is not called directly--instead, it is registered and launched by udf_sos_py_exec.py.

Noteworthy in the UDF are the following:

  • The initial call to ProcData() to access the database:

    Begin UDF
    1
    
    proc_data = ProcData()
    
  • The size of the output table must be specified before writing to it:

    Size Results Table to Match the Input Table
    1
    
    out_table.size = in_table.size
    
  • The final call to complete() to mark the process as finished and ready for clean-up:

    End UDF
    1
    
    proc_data.complete()
    

udf_sos_py_exec.py

The execution script uses the standard Kinetica Python API to register the UDF in the database and then execute it.

The registration step associates a name with the UDF execution code contained in udf_sos_py_proc.py, the command ( python ) and arguments (the name of the proc script) to use to run it, and that it will run in distributed mode.

Create UDF
1
response = kinetica.create_proc(proc_name, 'distributed', files, 'python', [file_name], {})

The execution step invokes the UDF by name, passing in the input & output table names against which the UDF will execute.

Execute UDF
1
response = kinetica.execute_proc(proc_name, {}, {}, [INPUT_TABLE], {}, [OUTPUT_TABLE], {})