The information below includes all the information one needs to know to begin writing UDFs using the C++ UDF API. For more information on executing C++ UDFs, see Running C++ UDFs.
To begin writing C++ UDFs, access to the Kinetica C++ UDF API is
required. In default Kinetica installations, the C++ UDF API is
located in the /opt/gpudb/udf/api/cpp
directory.
If developing UDFs without a local Kinetica installation, the API can be
downloaded from the C++ UDF API repo on GitHub.
After downloading, see the README.md
in the UDF API directory created
for further setup instructions.
The UDF C++ API consists of two files: Proc.hpp
& Proc.cpp
.
These need to be included in the make process and the header file,
Proc.hpp
needs to be included in the UDF source code. There are no
external dependencies beyond the C++ standard library.
To take advantage of GPU processing within a UDF, the CUDA Toolkit must be downloaded & installed from the NVIDIA Developer Zone.
A UDF must get a handle to ProcData
using kinetica::ProcData::get()
.
This will parse the primary control file and set up all the necessary
structures. It will return a kinetica::ProcData*
instance, which is used to
access everything else. All configuration information is cached, so repeated
calls to kinetica::ProcData::get()
will not reload any configuration files.
Important
When you get a handle to ProcData
, the handle is actually
to the data given to that instance (OS process) of the UDF on the
Kinetica host; therefore, there will be a ProcData
handle for every
instance of the UDF on your Kinetica host
Unlike the other Kinetica APIs, the UDF C++ API does not process data using records or schemas, operating in terms of columns of data instead. The raw column values returned closely map to the data types used in the tables being accessed:
Column Category | Column Type | UDF C++ Type |
---|---|---|
Numeric | int | int32_t |
int8 | int8_t | |
int16 | int16_t | |
long | int64_t | |
float | float | |
double | double | |
decimal | int64_t | |
String | string | char* |
char[N] | char* | |
ipv4 | uint32_t | |
Date/Time | date | kinetica::Date |
datetime | kinetica::DateTime | |
time | kinetica::Time | |
timestamp | int64_t | |
Binary | bytes | uint8_t* |
While CharN
column values are arrays of N chars, there is a template
class, kinetica::CharN<N>
, that makes accessing these easier (it does
automatic to and from string conversions and provides accessors that make the
chars appear to be in the expected order). To access column values, use:
column.getValue<T>
For example, to retrieve the value for the 10th record as a double:
column.getValue<double>(10)
Note
No type-checking is performed, so the correct type for the column must be used in this call.
Both string & bytes columns are variable-length and are not able to be accessed directly by index. To access string & bytes columns, use these two methods, respectively:
column.getVarString(n)
column.getVarBytes(n)
To determine if a particular value of a nullable column is null
, use:
isNull(n)
Calling getValue
, getVarString
, or getVarBytes
on a null
value
causes undefined behavior, so always check first. Calling isNull
on a
non-nullable column will always return false.
For debugging purposes, there is also a column.toString(n)
method that
converts a specified column value to a string, regardless of type.
Call | Type | Description |
---|---|---|
procData->getInputData() |
Object | Returns an InputDataSet object for accessing input table data that
was passed into the UDF |
procData->getRequestInfo() |
Map of strings to strings | Returns a map of basic information about the /execute/proc request, map values being accessed using: procData->getRequestInfo().at(<map_key>)
The full set of map keys is listed below, under Request Info Keys. |
procData->getParams() |
Map of strings to strings | Returns a map of string-valued parameters that were passed into the UDF |
procData->getBinParams() |
Map of strings to bytes | Returns a map of binary-valued parameters that were passed into the UDF |
The InputDataSet
object returned from procData->getInputData()
contains
the InputTable
object, which in turn contains InputColumn
, holding the
actual data set. Tables and columns can be accessed by index or by name. For
example, given a customer
table at InputDataSet
index 5
and a
name
column at that InputTable
's index 1
, either of the following
calls will retrieve the column values associated with customer.name
:
procData.getInputData().getTable("customer").getColumn("name")
procData.getInputData().getTable(5).getColumn(1)
The request info keys are returned from calling proc_data.request_info
.
These keys include a variety of details about the executing UDF from the
request information map made available to each running UDF.
Map Key | Description |
---|---|
proc_name |
The name of the UDF being executed |
run_id |
The run ID of the UDF being executed; this is also displayed in GAdmin on the UDF page in the Status section as a link you can click on to get more detailed information; note that although this is an integer, it should not be relied upon as such, as its format may change in the future |
rank_number |
The processing node container number on which the current UDF instance is executing; for distributed UDFs, [1..n]. For non-distributed UDFs, 0 |
tom_number |
The processing node number within the processing node container on which the current UDF instance is executing; for distributed UDFs, [0..n-1], where n is the number of processing nodes per processing node container. For non-distributed UDFs it is not provided, since these do not run on a processing node |
kifs_mount_point |
The current mount point for the Kinetica File System (KiFS) |
<option_name> |
Any options passed in the options map in the
/execute/proc request will also be in the request
info map |
When executing UDFs that utilize CUDA, additional request information is returned.
Map Key | Description |
---|---|
cuda_devices |
The number of CUDA devices currently in use |
cuda_free |
The amount of CUDA memory available |
Data is passed into UDFs in segments. Each segment consists of the entirety of the data on a single TOM and is processed by the UDF instance executing on that TOM. Thus, there is a 1-to-1 mapping of data segment and executing UDF instance, though this relationship may change in the future.
Running the same UDF multiple times should result in the same set of segments, assuming the same environment and system state across runs.
Map Key | Description |
---|---|
data_segment_id |
A unique identifier for the segment of the currently executing UDF instance; all of the data segment IDs for a given UDF execution are displayed in GAdmin when you click on the run ID; note that although it is possible to identify rank and TOM numbers from this ID, it should not be relied upon, as its format may change in the future |
data_segment_count |
The total cluster-wide count of data segments for distributed UDFs; for non-distributed UDFs 1 |
data_segment_number |
The number of the current data segment or executing UDF instance [0..data_segment_count-1] |
These can be used to connect back to Kinetica using the regular API endpoint calls. Use with caution in distributed UDFs, particularly in large clusters, to avoid overwhelming the head node. Also note, multi-head ingest may not work from a UDF in some cases without overriding the worker URLs to use internal IP addresses.
Map Key | Description |
---|---|
head_url |
The URL to connect to |
username |
Randomly generated temporary username used to execute the UDF |
password |
Randomly generated temporary password used to execute the UDF |
Important
Since username
and password
are randomly generated
temporary credentials, which for security reasons should still
not be printed or output to logs, etc., as they are live
credentials for a period of time.
To output data to a table, the size of the table must be set in order to allocate enough space in all of the columns to hold the correct number of values. To do this, call:
table.setSize()
Call | Description |
---|---|
procData->getResults() |
Returns a map that can be populated with string-valued results to be returned from the UDF |
procData->getBinResults() |
Returns a map that can be populated with binary-valued results to be returned from the UDF |
procData->getOutputData() |
Returns an OutputDataSet object for writing output table data that
will be written to the database |
The OutputDataSet
object returned from procData.getOutputData()
contains
the OutputTable
object, which in turn contains OutputColumn
, holding the
actual data set. Tables and columns are accessed the same way as
InputDataSet
:
procData->getOutputData().getTable("customer").getColumn("name")
procData->getOutputData().getTable(5).getColumn(1)
There are two approaches to loading values into Kinetica:
The following methods are called on each OutputColumn
of the
OutputTable
:
appendValue(value)
appendVarString(value)
appendVarBytes(value)
appendNull()
setValue(n, value)
setNull(n, value)
The procData->getStatus()
property can be set to a string value to help
convey status information during UDF execution, e.g.,
procData->getStatus("25% complete")
. The /show/proc/status
endpoint will return any status messages set for each data segment -- one data
segment per processing node if in distributed mode, or one data segment total
in non-distributed mode. These messages are subject to the following scenarios:
If the user-provided UDF code is executing and has set a status message,
/show/proc/status
will return the last message that was set
If the user-provided UDF code finishes executing successfully, the status message is cleared
Note
The UDF may not show as "complete" yet since any data written by the UDF (in distributed mode) still has to be inserted into the database, but the status set by the UDF code isn't relevant to this process
If the UDF is killed while executing user-provided UDF code,
/show/proc/status
will return the last message that was set
If the user-provided UDF code errors out, /show/proc/status
will return
the error message and the last status message that was set in parentheses
The UDF must finish with a call to procData->complete()
. This writes out
some final control information to indicate that the UDF completed
successfully.
Note
If this call is not made, the database will assume that the UDF didn’t finish and will return an error to the caller.
Any output from the UDF to is written to two places:
/opt/gpudb/core/logs/gpudb_proc.log
file local to the processing
node containerLogging output location for UDFs is currently not configurable.