> ## Documentation Index
> Fetch the complete documentation index at: https://docs.kinetica.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Python UDF API

<a id="udf-python-writing-label" />

The information below includes all the information one needs to know to begin
writing UDFs using the Python UDF API. For more information on running Python
UDFs, see [Running Python UDFs](/content/udf/python/running).

## Dependencies

To begin writing Python UDFs, access to the Kinetica Python UDF API is
required.
In default Kinetica installations, the Python UDF API is located in the
`/opt/gpudb/udf/api/python` directory.

The API can be downloaded from the
[Python UDF API repo on GitHub](https://github.com/kineticadb/kinetica-udf-api-python.git).
After downloading, see the <Badge color="gray">README.md</Badge> in the UDF API directory created
for further setup instructions.

Instructions for installing & configuring the API can be found under
[UDF API Installation](/content/udf/python/writing#udf-api-install) below.

The Python UDF API consists of one file, <Badge color="blue-destructive">kinetica\_proc.py</Badge>.
This needs to be included in the UDF source code and added to the
`PYTHONPATH`.  There are no external dependencies beyond the
*Python standard library*.

To take advantage of GPU processing within a UDF, the *CUDA Toolkit* must be
downloaded & installed from the
[Nvidia Developer Zone](http://docs.nvidia.com/cuda/index.html).

Also, any Python packages the UDF may need to use should be installed on the
development platform and be made available within the Kinetica cluster's UDF
environment using [function environments](/content/udf/python/writing#udf-python-func-env).

<a id="udf-api-install" />

### UDF API Installation

The Kinetica Python UDF API is accessible via *GitHub*.

1. In the desired directory, run the following to download the Kinetica Python
   UDF API repository:

   ```
   git clone -b release/v7.2 --single-branch https://github.com/kineticadb/kinetica-udf-api-python.git
   ```
2. Add the Python UDF API directory to the `PYTHONPATH`:

   ```
   export PYTHONPATH=$PYTHONPATH:$(cd kinetica-udf-api-python;pwd)
   ```

<a id="udf-python-func-env" />

### Python UDF Function Environments

Python UDF *function environments* provide a means for deploying Python packages
needed by UDFs to the Kinetica cluster.

Each UDF executes in a default *function environment* running *Python 3.10* with
a set of Python packages [pre-installed](/content/udf/python/writing#udf-python-libs).  Any
user-created *function environment* will have that default set as well as any
added by the user.

*Function environments* can be managed in SQL, the native APIs, and in
[Workbench](/content/admin/workbench/ui/explorer/data#wb-explorer-data-udf-env).

#### Installing Function Environment Libraries

To install *Python* packages within the *Kinetica* UDF environment, create a
*function environment* and install the packages within it.

<CodeGroup>
  ```sql SQL theme={null}
  CREATE FUNCTION ENVIRONMENT udfe_st;

  ALTER FUNCTION ENVIRONMENT udfe_st
  INSTALL PACKAGE 'sentence-transformers'
  ```

  ```python Python theme={null}
  _ENV_NAME = 'udfe_st'

  kinetica.create_environment(environment_name = PROC_ENV_NAME)

  kinetica.alter_environment(
      environment_name = PROC_ENV_NAME,
      action = "install_package",
      value = "sentence-transformers"
  )
  ```
</CodeGroup>

Afterwards, when creating the UDF, assign the environment to the UDF.

<CodeGroup>
  ```sql SQL theme={null}
  CREATE FUNCTION udf_st
  MODE = 'distributed'
  RUN_COMMAND = 'python'
  RUN_COMMAND_ARGS = 'udf/udf_st.py'
  FILE PATHS 'kifs://udf/udf_st.py'
  WITH OPTIONS (SET_ENVIRONMENT = 'udfe_st')
  ```

  ```python Python theme={null}
  response = kinetica.create_proc(
      proc_name = PROC_NAME,
      execution_mode = "distributed",
      files = file_map,
      command = "python",
      args = [PROC_FILE_NAME],
      options = {"set_environment": PROC_ENV_NAME}
  )
  ```
</CodeGroup>

#### Uninstalling Function Environment Libraries

To uninstall *Python* packages from a given *function environment*:

<CodeGroup>
  ```sql SQL theme={null}
  ALTER FUNCTION ENVIRONMENT udfe_st
  UNINSTALL PACKAGE 'sentence-transformers'
  ```

  ```python Python theme={null}
  _ENV_NAME = 'udfe_st'

  kinetica.alter_environment(
      environment_name = PROC_ENV_NAME,
      action = "uninstall_package",
      value = "sentence-transformers"
  )
  ```
</CodeGroup>

To delete a *function environment*:

<CodeGroup>
  ```sql SQL theme={null}
  DROP FUNCTION ENVIRONMENT udfe_st
  ```

  ```python Python theme={null}
  kinetica.drop_environment(environment_name = PROC_ENV_NAME, options = {"no_error_if_not_exists": "true"})
  ```
</CodeGroup>

<Info>
  Be sure to delete all UDFs that use a *function environment* before
  deleting it.
</Info>

<a id="udf-python-libs" />

### Pre-installed Libraries on the Cluster

Each cluster node comes pre-installed with the latest Kinetica APIs:

* `gpudb` - Kinetica Python API
* `kinetica-proc` - Kinetica Python UDF API

The following 3rd-party libraries are pre-installed on Kinetica cluster nodes,
for use by UDFs:

| Library             | Version   |
| ------------------- | --------- |
| bcrypt              | 3.2.0     |
| certifi             | 2025.4.26 |
| cffi                | 1.17.1    |
| charset\_normalizer | 3.4.3     |
| Cython              | 3.1.2     |
| future              | 1.0.0     |
| idna                | 3.10      |
| meson               | 1.8.2     |
| numpy               | 2.2.6     |
| pandas              | 2.3.0     |
| pycparser           | 2.22      |
| python-dateutil     | 2.9.0     |
| python-snappy       | 0.6.1     |
| pytz                | 2025.2    |
| pyzmq               | 26.4.0    |
| requests            | 2.32.5    |
| scikit\_build\_core | 0.11.4    |
| six                 | 1.17.0    |
| urllib3             | 2.5.0     |

## Initializing

A UDF must first get a handle to the `ProcData` class imported.  This will
parse the primary control file and set up all the necessary structures.  It will
return a `ProcData` instance, which is used to access data.  All configuration
information is cached, so repeated calls to get a handle to `ProcData`
will not reload any configuration files.

<Note>
  When you get a handle to `ProcData`, the handle is actually to
  the data given to that instance (OS process) of the UDF on the Kinetica host;
  therefore, there will be a `ProcData` handle for every instance of the UDF
  on your Kinetica host
</Note>

## Column Types

Unlike the other Kinetica APIs, the Python UDF API does not process data using
records or schemas, operating in terms of columns of data instead.  The raw
column values returned closely map to the data types used in the tables being
accessed:

### Numeric

| Kinetica Type | Python UDF Type   |
| ------------- | ----------------- |
| `boolean`     | `bool`            |
| `int8`        | `int`             |
| `int16`       | `int`             |
| `int`         | `int`             |
| `long`        | `int`             |
| `float`       | `float`           |
| `double`      | `float`           |
| `decimal`     | `decimal.Decimal` |

### String

| Kinetica Type | Python UDF Type |
| ------------- | --------------- |
| `string`      | `str`           |
| `char[N]`     | `str`           |
| `ipv4`        | `int`           |
| `uuid`        | `uuid.UUID`     |
| `wkt`         | `str`           |

### Date/Time

| Kinetica Type | Python UDF Type     |
| ------------- | ------------------- |
| `date`        | `datetime.date`     |
| `datetime`    | `datetime.datetime` |
| `time`        | `datetime.time`     |
| `timestamp`   | `int`               |

### Binary

| Kinetica Type | Python UDF Type |
| ------------- | --------------- |
| `bytes`       | `bytes`         |
| `wkb`         | `str`           |

Column data values can be accessed through array indices:

```
column[i]
```

For example, to retrieve the value for the 10th record:

```
column[9]
```

## Reading Input Data

Accessing the request information for the UDF, the parameters passed into the
UDF, or the input data can be completed using the following calls:

| Call                     | Type                      | Description                                                                                                                                                                                                                  |
| ------------------------ | ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `proc_data.input_data`   | Object                    | Returns an `InputDataSet` object for accessing input table data that was passed into the UDF                                                                                                                                 |
| `proc_data.request_info` | Map of strings to strings | Returns a map of basic information about the [execute\_proc()](/content/api/python/source/gpudb#gpudb.GPUdb.execute_proc) request, map values being accessed using: <br /> <br /> `proc_data.request_info[<map_key>]` <br /> |
| `proc_data.params`       | Map of strings to strings | Returns a map of string-valued parameters that were passed into the UDF                                                                                                                                                      |
| `proc_data.bin_params`   | Map of strings to bytes   | Returns a map of binary-valued parameters that were passed into the UDF                                                                                                                                                      |

### Accessing Input Values

The `InputDataSet` object returned from `proc_data.input_data` contains
the `InputTable` object, which in turn contains `InputColumn`, holding the
actual data set. Tables and columns can be accessed by index or by name. For
example, given a `customer` table at `InputDataSet` index `5` and a
`name` column at that `InputTable`'s index `1`, either of the following
calls will retrieve the column values associated with `customer.name`:

```
proc_data.input_data["customer"]["name"]
proc_data.input_data[5][1]
```

### Request Info Keys

The request info keys are returned from calling `proc_data.request_info`.
These keys include a variety of details about the executing *UDF* from the
request information map made available to each running *UDF*.

#### General Information

| Map Key         | Description                                                                                                                                                                                                                                                                                                             |
| --------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `proc_name`     | The name of the *UDF* being executed.                                                                                                                                                                                                                                                                                   |
| `run_id`        | The run ID of the *UDF* being executed. This is also displayed in *GAdmin* on the **UDF** page in the **Status** section as a link you can click on to get more detailed information; note that although this is an integer, it should not be relied upon as such, as its format may change in the future.              |
| `rank_number`   | The processing node container number on which the current UDF instance is executing.  For distributed UDFs, *\[1..n]*; for non-distributed UDFs, *0*.                                                                                                                                                                   |
| `tom_number`    | The processing node number within the processing node container on which the current UDF instance is executing. For distributed UDFs, *\[0..n-1]*, where *n* is the number of processing nodes per processing node container. For non-distributed UDFs it is not provided, since these do not run on a processing node. |
| `<option_name>` | Any options passed in the `options` map in the [/execute/proc](/content/api/rest/execute_proc_rest) request will also be in the request info map.                                                                                                                                                                       |

#### CUDA Information

When executing UDFs that utilize CUDA, additional request information is
returned.

| Map Key        | Description                                  |
| -------------- | -------------------------------------------- |
| `cuda_devices` | The number of CUDA devices currently in use. |
| `cuda_free`    | The amount of CUDA memory available.         |

#### Data Segment Information

Data is passed into *UDFs* in *segments*.  Each *segment* consists of the
entirety of the data on a single *TOM* and is processed by the *UDF* instance
executing on that *TOM*.  Thus, there is a 1-to-1 mapping of *data segment* and
executing *UDF* instance, though this relationship may change in the future.

Running the same *UDF* multiple times should result in the same set of
*segments*, assuming the same environment and system state across runs.

| Map Key               | Description                                                                                                                                                                                                                                                                                                                                                       |
| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `data_segment_id`     | A unique identifier for the *segment* of the currently executing *UDF* instance. All of the *data segment IDs* for a given *UDF* execution are displayed in *GAdmin* when you click on the **run ID**; note that although it is possible to identify *rank* and *TOM* numbers from this ID, it should not be relied upon, as its format may change in the future. |
| `data_segment_count`  | The total cluster-wide count of *data segments* for distributed *UDFs*; for non-distributed *UDFs*, *1*.                                                                                                                                                                                                                                                          |
| `data_segment_number` | The number of the current *data segment* or executing *UDF* instance *\[0..data\_segment\_count-1]*.                                                                                                                                                                                                                                                              |

#### Kinetica API Connection Parameters

These can be used to connect back to *Kinetica* using the regular API endpoint
calls.  Use with caution in distributed *UDFs*, particularly in large clusters,
to avoid overwhelming the head node.  Also note, multi-head ingest may not work
from a *UDF* in some cases without overriding the worker URLs to use internal IP
addresses.

| Map Key    | Description                                                      |
| ---------- | ---------------------------------------------------------------- |
| `head_url` | The URL to connect to.                                           |
| `username` | Randomly generated temporary username used to execute the *UDF*. |
| `password` | Randomly generated temporary password used to execute the *UDF*. |

<Note>
  Since `username` and `password` are randomly-generated
  temporary credentials, for security reasons, they should not be
  printed or output to logs.
</Note>

## Writing Output Data

To output data to a table, the size of the table must be set in order to
allocate enough space in all of the columns to hold the correct number of
values.  To do this, call:

```
table.size = <total number of output records>
```

Table column values can then be assigned to each `OutputColumn` of each
`OutputTable` in the `OutputDataSet`.

The following calls are available to assist with writing data to *Kinetica*:

| Call                    | Type                      | Description                                                                                            |
| ----------------------- | ------------------------- | ------------------------------------------------------------------------------------------------------ |
| `proc_data.output_data` | Object                    | Returns an `OutputDataSet` object for writing output *table* data that will be written to the database |
| `proc_data.results`     | Map of strings to strings | Returns a map that can be populated with string-valued results to be returned from the UDF             |
| `proc_data.bin_results` | Map of strings to bytes   | Returns a map that can be populated with binary-valued results to be returned from the UDF             |

### Setting Output Values

The `OutputDataSet` object returned from `proc_data.output_data` contains
the `OutputTable` object, which in turn contains `OutputColumn`, holding the
actual data set. Tables and columns are accessed the same way as
`InputDataSet`, by name, index, or a combination of the two:

```
proc_data.output_data["customer"]["name"]
proc_data.output_data[5][1]
```

To assign fixed-width output values:

```
proc_data.output_data["customer"][1][2] = 12.34
```

To assign variable-width output values:

```
proc_data.output_data[5]["name"][4].append("Joe")
```

<a id="status-reporting" />

## Status Reporting

The `proc_data.status` property can be set to a string value to help
convey status information during UDF execution, e.g.,
`proc_data.status="25% complete"`. The [show\_proc\_status()](/content/api/python/source/gpudb#gpudb.GPUdb.show_proc_status) function will
return any status messages set for each data segment -- one data segment per
processing node if in distributed mode, or one data segment total in
non-distributed mode. These messages are subject to the following scenarios:

* If the user-provided UDF code is executing and has set a status message,
  [show\_proc\_status()](/content/api/python/source/gpudb#gpudb.GPUdb.show_proc_status) will return the last message that was set.

* If the user-provided UDF code finishes executing successfully, the status
  message is cleared.

  <Info>
    The UDF may not show as "complete" yet since any data written by the
    UDF (in distributed mode) still has to be inserted into the
    database, but the status set by the UDF code isn't relevant to this
    process
  </Info>

* If the UDF is killed while executing user-provided UDF code,
  [show\_proc\_status()](/content/api/python/source/gpudb#gpudb.GPUdb.show_proc_status) will return the last message that was set.

* If the user-provided UDF code errors out, [show\_proc\_status()](/content/api/python/source/gpudb#gpudb.GPUdb.show_proc_status) will return
  the error message and the last status message that was set in parentheses.

## Complete

The UDF must finish with a call to `proc_data.complete`.  This writes out
some final control information to indicate that the UDF completed successfully.

<Info>
  If this call is not made, the database will assume that the UDF
  didn’t finish and will return an error to the caller.
</Info>

## Logging

Any output from the UDF to is written to two places:

* The [system log file](/content/install/kagent_install#logging-ref) on the head node, which can
  be viewed in [GAdmin](/content/admin/gadmin) on the
  [logging page](/content/admin/gadmin/cluster#cluster-logging).
* The <Badge color="gray">/opt/gpudb/core/logs/gpudb.log</Badge> file local to the processing
  node container
