Java UDF API¶

The information below includes all the information one needs to know to begin writing UDFs using the Java UDF API. For more information on executing Java UDFs, see Running Java UDFs.

Dependencies
Initializing
Column Types
Reading Input Data
- Accessing Input Values
- Request Info Keys
Writing Output Data
- Setting Output Values
Status Reporting
Complete
Logging

Dependencies¶

To begin writing Java UDFs, access to the Kinetica Java UDF API is required. In default Kinetica installations, the Java UDF API is located in the /opt/gpudb/udf/api/java directory.

If developing UDFs without a local Kinetica installation, the API can be downloaded from the Java UDF API repo on GitHub. After downloading, see the README.md in the UDF API directory created for further setup instructions.

The UDF Java API consists of three files: ProcData.java, MemoryMappedFile.cpp, MemoryMappedFile.java. There are no external dependencies required.

Initializing¶

A UDF must first get a handle to ProcData using ProcData.get(). This will parse the primary control file and set all the necessary structures. It will return a ProcData instance, which is used to access everything else. All configuration information is cached, so repeated calls to ProcData.get() will not reload any configuration files.

Important

When you get a handle to ProcData, the handle is actually to the data given to that instance (OS process) of the UDF on the Kinetica host; therefore, there will be a ProcData handle for every instance of the UDF on your Kinetica host

Column Types¶

Unlike the other Kinetica APIs, the UDF Java API does not process data using records or schemas, operating in terms of columns of data instead. The raw column values returned closely map to the data types used in the tables being accessed:

Column Category	Column Type	UDF Java Type
Numeric	int	Int
	int8	Byte
	int16	Short
	long	Long
	float	Float
	double	Double
	decimal	BigDecimal
String	string	String
	char[N]	String
	ipv4	Inet4Address
Date/Time	date	Calendar
	datetime	Calendar
	time	Calendar
	timestamp	Long
Binary	bytes	byte[]

Both string & bytes columns are variable-length and are not able to be accessed directly by index. To access string & bytes columns, use these two methods, respectively:

column.getVarString(n)
column.getVarBytes(n)

To get values from any other column type, use the following methods:

// int
column.getInt(n)

// int8
column.getByte(n)

// int16
column.getShort(n)

// long, timestamp
column.getLong(n)

// float
column.getFloat(n)

// double
column.getDouble(n)

// decimal
column.getBigDecimal(n)

// charN
column.getChar(n)

// ipv4
column.getInet4Address(n)

// date, datetime, time
column.getCalendar(n)

To determine if a particular value of a nullable column is null, use:

isNull(n)

A null is returned if a column is null (for all data types).

For debugging purposes, there is also a column.toString(n) method that converts a specified column value to a string, regardless of type.

Reading Input Data¶

Call	Type	Description
`procData.getInputData()`	Object	Returns an `InputDataSet` object for accessing input table data that was passed into the UDF
`procData.getRequestInfo()`	Map of strings to strings	Returns a map of basic information about the `executeProc()` request, map values being accessed using: procData.getRequestInfo().get(<map_key>) The full set of map keys is listed below, under Request Info Keys.
`procData.getParams()`	Map of strings to strings	Returns a map of string-valued parameters that were passed into the UDF
`procData.getBinParams()`	Map of strings to bytes	Returns a map of binary-valued parameters that were passed into the UDF

Accessing Input Values¶

The InputDataSet object returned from procData.getInputData() contains the InputTable object, which in turn contains InputColumn, holding the actual data set. Tables and columns can be accessed by index or by name. For example, given a customer table at InputDataSet index 5 and a name column at that InputTable's index 1, either of the following calls will retrieve the column values associated with customer.name:

procData.getInputData().getTable("customer").getColumn("name")
procData.getInputData().getTable(5).getColumn(1)

Request Info Keys¶

The request info keys are returned from calling proc_data.request_info. These keys include a variety of details about the executing UDF from the request information map made available to each running UDF.

General Information¶

Map Key	Description
`proc_name`	The name of the UDF being executed
`run_id`	The run ID of the UDF being executed; this is also displayed in GAdmin on the UDF page in the Status section as a link you can click on to get more detailed information; note that although this is an integer, it should not be relied upon as such, as its format may change in the future
`rank_number`	The processing node container number on which the current UDF instance is executing; for distributed UDFs, [1..n]. For non-distributed UDFs, 0
`tom_number`	The processing node number within the processing node container on which the current UDF instance is executing; for distributed UDFs, [0..n-1], where n is the number of processing nodes per processing node container. For non-distributed UDFs it is not provided, since these do not run on a processing node
`kifs_mount_point`	The current mount point for the Kinetica File System (KiFS)
`<option_name>`	Any options passed in the `options` map in the /execute/proc request will also be in the request info map

CUDA Information¶

When executing UDFs that utilize CUDA, additional request information is returned.

Map Key	Description
`cuda_devices`	The number of CUDA devices currently in use
`cuda_free`	The amount of CUDA memory available

Data Segment Information¶

Data is passed into UDFs in segments. Each segment consists of the entirety of the data on a single TOM and is processed by the UDF instance executing on that TOM. Thus, there is a 1-to-1 mapping of data segment and executing UDF instance, though this relationship may change in the future.

Running the same UDF multiple times should result in the same set of segments, assuming the same environment and system state across runs.

Map Key	Description
`data_segment_id`	A unique identifier for the segment of the currently executing UDF instance; all of the data segment IDs for a given UDF execution are displayed in GAdmin when you click on the run ID; note that although it is possible to identify rank and TOM numbers from this ID, it should not be relied upon, as its format may change in the future
`data_segment_count`	The total cluster-wide count of data segments for distributed UDFs; for non-distributed UDFs 1
`data_segment_number`	The number of the current data segment or executing UDF instance [0..data_segment_count-1]

Kinetica API Connection Parameters¶

These can be used to connect back to Kinetica using the regular API endpoint calls. Use with caution in distributed UDFs, particularly in large clusters, to avoid overwhelming the head node. Also note, multi-head ingest may not work from a UDF in some cases without overriding the worker URLs to use internal IP addresses.

Map Key	Description
`head_url`	The URL to connect to
`username`	Randomly generated temporary username used to execute the UDF
`password`	Randomly generated temporary password used to execute the UDF

Important

Since username and password are randomly generated temporary credentials, which for security reasons should still not be printed or output to logs, etc., as they are live credentials for a period of time.

Writing Output Data¶

To output data to a table, the size of the table must be set in order to allocate enough space in all of the columns to hold the correct number of values. To do this, call:

table.setSize()

Table column values can then be assigned by index and/or type to each OutputColumn of each OutputTable in the OutputDataSet. See Setting Output Values for more information.

The following calls are available to assist with writing data to Kinetica:

Call	Type	Description
`procData.getOutputData()`	Object	Returns an `OutputDataSet` object for writing output table data that will be written to the database
`procData.getResults()`	Map of strings to strings	Returns a map that can be populated with string-valued results to be returned from the UDF
`procData.getBinResults()`	Map of strings to bytes returned from the UDF	Returns a map that can be populated with binary-valued results to be

Setting Output Values¶

The OutputDataSet object returned from procData.getOutputData() contains the OutputTable object, which in turn contains OutputColumn, holding the actual data set. Tables and columns are accessed the same way as InputDataSet:

procData.getOutputData().getTable("customer").getColumn("name")
procData.getOutputData().getTable(5).getColumn(1)

There are two approaches to loading values into Kinetica:

appending a series of values to each table column
setting table column values by index

The following methods can be called on each OutputColumn of the OutputTable:

Appending (required for variable-length data)
- appendVarBytes(value)
- appendChar(value)
- appendCalendar(value)
- appendBigDecimal(value)
- appendDouble(value)
- appendFloat(value)
- appendInt(value)
- appendByte(value)
- appendShort(value)
- appendInet4Address(value)
- appendLong(value)
- appendVarString(value)
Setting by Index (non-variable length data only)
- setInt(n, value)
- setBigDecimal(n, value)
- setByte(n, value)
- setCalendar(n, value)
- setChar(n, value)
- setDouble(n, value)
- setFloat(n, value)
- setInet4Address(n, value)
- setLong(n, value)
- setShort(n, value)
- setNull(n, value)

Status Reporting¶

The procData.getStatus() property can be set to a string value to help convey status information during UDF execution, e.g., procData.getStatus("25% complete"). The showProcStatus() endpoint will return any status messages set for each data segment -- one data segment per processing node if in distributed mode, or one data segment total in non-distributed mode. These messages are subject to the following scenarios:

If the user-provided UDF code is executing and has set a status message, showProcStatus() will return the last message that was set
If the user-provided UDF code finishes executing successfully, the status message is cleared

Note

The UDF may not show as "complete" yet since any data written by the UDF (in distributed mode) still has to be inserted into the database, but the status set by the UDF code isn't relevant to this process
If the UDF is killed while executing user-provided UDF code, showProcStatus() will return the last message that was set
If the user-provided UDF code errors out, showProcStatus() will return the error message and the last status message that was set in parentheses

Complete¶

The UDF must finish with a call to procData.complete(). This writes out some final control information to indicate that the UDF completed successfully.

Note

If this call is not made, the database will assume that the UDF didn’t finish and will return an error to the caller.

Logging¶

Any output from the UDF to is written to two places:

The system log file on the head node
The /opt/gpudb/core/logs/gpudb_proc.log file local to the processing node container

Logging output location for UDFs is currently not configurable.

Table Of Contents