The UDF C++ API consists of two files, Proc.hpp
& Proc.cpp
.
These need to be included in the make process and the header file,
Proc.hpp
needs to be included in the UDF source code. There are no
external dependencies beyond the C++ standard library.
To take advantage of GPU processing within a UDF, the CUDA Toolkit must be downloaded & installed from the NVIDIA Developer Zone.
A UDF must first make a call to kinetica::ProcData::get()
. This will
parse the primary control file and set up all the necessary structures. It will
return a kinetica::ProcData*
, which is used to access everything else. All
configuration information is cached, so repeated calls to
kinetica::ProcData::get()
will not reload any configuration files.
Once the UDF has been initialized, the following calls are possible:
Call | Description |
---|---|
procData->getRequestInfo() |
Returns a map of basic information about the /execute/proc request, map values being accessed using: procData->getRequestInfo().at(<map_key>)
The full set of map keys is listed below, under Request Info Keys. |
procData->getParams() |
Returns a map of string-valued parameters that were passed into the UDF |
procData->getBinParams() |
Returns a map of binary-valued parameters that were passed into the UDF |
procData->getInputData() |
Returns an InputDataSet object for accessing input table data that
was passed into the UDF |
Call | Description |
---|---|
procData->getResults() |
Returns a map that can be populated with string-valued results to be returned from the UDF |
procData->getBinResults() |
Returns a map that can be populated with binary-valued results to be returned from the UDF |
procData->getOutputData() |
Returns an OutputDataSet object for writing output table data that
will be written to the database |
The UDF must finish with a call to procData->complete()
. This writes out
some final control information to indicate that the UDF completed
successfully.
NOTE: If this call is not made, the database will assume that the UDF didn’t finish and will return an error to the caller.
The InputDataSet
& OutputDataSet
objects contain InputTable
&
OutputTable
objects, which, in turn, contain InputColumn
&
OutputColumn
, holding the actual data sets. Tables and columns can be
accessed by index or by name. For example, given a customer
table at
InputDataSet
index 5
and a name
column at that InputTable
's
index 1
, either of the following calls will retrieve the column values
associated with customer.name
:
procData->getInputData()["customer"]["name"]
procData->getInputData()[5][1]
Unlike the other Kinetica APIs, the UDF C++ API does not process data using records or schemas, operating in terms of columns of data instead. The raw column values returned closely map to the data types used in the tables being accessed:
Column Category | Column Type | UDF Value Type |
---|---|---|
Numeric | int | int32_t |
int8 | int8_t | |
int16 | int16_t | |
long | int64_t | |
float | float | |
double | double | |
decimal | int64_t | |
String | string | char* |
char[N] | char* | |
ipv4 | uint32_t | |
Date/Time | date | kinetica::Date |
time | kinetica::Time | |
timestamp | int64_t | |
Binary | bytes | uint8_t* |
While CharN
column values are arrays of N chars, there is a template
class, kinetica::CharN<N>
, that makes accessing these easier (it does
automatic to and from string conversions and provides accessors that make the
chars appear to be in the expected order). To access column values, use:
column.getValue<T>
For example, to retrieve the value for the 10th record as a double:
column.getValue<double>(10)
NOTE: No type-checking is performed, so the correct type for the column must be used in this call.
Both string & bytes columns are variable-length and are not able to be accessed directly by index. To access string & bytes columns, use these two methods, respectively:
column.getVarString(n)
column.getVarBytes(n)
To determine if a particular value of a nullable column is null
, use:
isNull(n)
Calling getValue
, getVarString
, or getVarBytes
on a null
value
causes undefined behavior, so always check first. Calling isNull
on a
non-nullable column will always return false.
To output data to a table, the size of the table must be set in order to allocate enough space in all of the columns to hold the correct number of values. To do this, call:
table.setSize()
At that point, there are two approaches to loading in values: appending a series
of values to each table column, and setting table column values by index. The
following methods are called on each OutputColumn
of the OutputTable
:
appendValue(value)
appendVarString(value)
appendVarBytes(value)
appendNull()
setValue(n, value)
setNull(n, value)
For debugging purposes, there is also a column.toString(n)
method that
converts a specified column value to a string, regardless of type. Any output
from the UDF to stdout is written to the
/opt/gpudb/core/logs/gpudb.log
file.
A variety of details about the executing UDF can be extracted from the request information map made available to each running proc. The full list of keys follows.
Map Key | Description |
---|---|
proc_name |
The name of the proc (UDF) being executed |
run_id |
The run ID of the proc being executed; this is also displayed in Gadmin on the UDF page in the Status section as a link you can click on to get more detailed information; note that although this is an integer, it should not be relied upon as such, as its format may change in the future |
rank_number |
The rank number on which the current proc instance is executing; for distributed procs [1..n], for non-distributed procs 0 |
tom_number |
The TOM number within the rank on which the current proc instance is executing; for distributed procs [0..n-1], where n is the number of TOMs per rank, for non-distributed procs it is not provided, since these do not run on a TOM |
<option_name> |
Any options passed in the options map in the
/execute/proc request will also be in the request
info map |
Data is passed into procs in segments. Each segment consists of the entirety of the data on a single TOM and is processed by the proc instance executing on that TOM. Thus, there is a 1-to-1 mapping of data segment and executing proc instance, though this relationship may change in the future.
Running the same proc multiple times should result in the same set of segments, assuming the same environment and system state across runs.
Map Key | Description |
---|---|
data_segment_id |
A unique identifier for the segment of the currently executing proc instance; all of the data segment IDs for a given proc execution are displayed in Gadmin when you click on the run ID; note that although it is possible to identify rank and TOM numbers from this ID, it should not be relied upon, as its format may change in the future |
data_segment_count |
The total cluster-wide count of data segments for distributed procs; for non-distributed procs 1 |
data_segment_number |
The number of the current data segment or executing proc instance [0..data_segment_count-1] |
These can be used to connect back to Kinetica using the regular API endpoint calls. Use with caution in distributed procs, particularly in large clusters, to avoid overwhelming the head node. Also note, multi-head ingest may not work from a proc in some cases without overriding the worker URLs to use internal IP addresses.
Map Key | Description |
---|---|
head_url |
The URL to connect to |
username |
The username to connect as (corresponds to the user that called /execute/proc) |
password |
The password for the username |
Note: username
and password
are not the actual login credentials of
the user; they are randomly generated temporary credentials, which, for security
reasons, should still not be printed or output to logs, etc., as they are live
credentials for a period of time.