This documentation is for a prior release of Kinetica. For the latest documentation, click
here.
SQL-Managed Python UDF: Semantic Search
Learn how to install dependent packages using Python UDF function environments
The following is a SQL-managed UDF written in the Python UDF API that performs a
semantic search for a given sentence on a baseline set of sentences, ranking the
baseline sentences on their similarity to the given one.
This will demonstrate the UDF being registered as a standard UDF and as a table
function (UDTF). While it is only necessary to use one or the other, both are
demonstrated here.
Which UDF type to use depends on the use case:
UDF - Can be executed once and have results repeatedly queried when needed
UDTF - Must be executed every time results are needed, but can be used within
a query--execution & results in one step
Download & Run
This example will contain the following files (click to download):
udf_st.kisql
- UDF management script, in
SQL, which creates the input & output tables, uploads & loads the baseline
data, and creates and executes the UDF as both a UDF & UDTF
udf_st.py
- the UDF, written using
the Python UDF API, which contains the semantic search example
sentences.txt
- the data file containing
the set of sentences that will be used as the baseline for the semantic search
After copying these scripts to a local directory, the example can be run as
follows, using the KiSQL SQL client, specifying the
database URL, username, & password (when prompted):
The script should output the sentences with the top 5 closest match scores to
the given sentence for both the UDF & UDTF executions.
Semantic Search Results
1
2
3
4
5
6
7
8
9
+------------+---------------------------------------------------------------------------------------------------------------------------------+
| score | sentence |
+------------+---------------------------------------------------------------------------------------------------------------------------------+
| 0.470727 | I am always drawn back to places where I have lived, the houses and their neighborhoods. |
| 0.336594 | In the late summer of that year we lived in a house in a village that looked across the river and the plain to the mountains. |
| 0.28631 | I write this sitting in the kitchen sink. |
| 0.285421 | In the beginning, sometimes I left messages in the street. |
| 0.25593 | If you want to find Cherry-Tree Lane all you have to do is ask the Policeman at the cross-roads. |
+------------+---------------------------------------------------------------------------------------------------------------------------------+
UDF Detail
The example UDF uses a single table, udf_st_in, as input and a corresponding
table, udf_st_out, for output.
The input table will contain a string column and be populated with a set of
sentences. The output table will contain a float column for each sentence's
search score. Both tables will also contain an int column that is the
sentence identifier, allowing the input sentences to be matched up with the
output scores after the UDF has run.
The UDF will compare a given sentence to each of the sentences in the input
table baseline data set and insert the distance scores into the output table.
The management script will then output the sentences & scores from both the UDF
& UDTF executions.
SQL Management Script
The SQL management script runs a sequence of commands to setup and execute the
UDF:
Creates input & output tables
Creates a KiFS directory, uploads data file & UDF to
it, and loads the data into the input table
Creates a Python UDF
function environment for the
UDF to run in and installs the sentence-transformers package into it
Registers both the UDF and the table function (UDTF) variant of it, making
both available for subsequent execution
Executes the UDF and queries for the results, and then executes the UDTF,
displaying its results
Create Tables
Create UDF input table, into which set of baseline sentences will be loaded:
output table to write semantic search score results to
sentence to compare to the baseline set of sentences
Run the UDF:
Run UDF
1
2
3
4
5
6
EXECUTEFUNCTIONudf_st(INPUT_TABLE_NAMES=>INPUT_TABLE(udf_st_in),OUTPUT_TABLE_NAMES=>OUTPUT_TABLES('udf_st_out'),PARAMS=>KV_PAIRS(sentence='I live in a neighborhood in a small town.'))
Display the sentences and resulting semantic search scores by joining the input
and output tables together:
Run the UDTF, joining its results to the input table, allowing the display of
sentences & search scores in one SQL statement:
Run UDTF & Display Results
1
2
3
4
5
6
7
8
9
10
11
12
13
SELECTTOP5score,sentenceFROMudf_st_ini,TABLE(UDTF_ST(INPUT_TABLE_NAMES=>INPUT_TABLE(udf_st_in),PARAMS=>KV_PAIRS(sentence='I live in a neighborhood in a small town.')))oWHEREi.id=o.idORDERBYscoreDESC
Python UDF Script
When executed, each Python UDF process performs several operations on the
segment of data available to it:
Initialize the UDF's dependencies and data access handlers
Perform the semantic search of the given sentence vs. the baseline sentences
Write the search scores to the output table and mark the process complete
Initialize UDF
Import the ProcData class from the kinetica_proc module to use the API;
also, load dependent packages and configure the environment for their use prior
to loading:
Calculate the embeddings for the baseline sentences in the input table, using
the SentenceTransformer library's all-MiniLM-L6-v2 model; the sentences
in the sentence column are stored in the input table map of column name to
column values:
Calculate semantic search scores between the given sentence and baseline
sentence embeddings using the cosine similarity function:
Perform Semantic Search Scoring
1
scores=util.cos_sim(sentence_emb,emb)[0]
Write Output Data
Set the size of the output table segment appropriately before writing to it;
there should be one output score for each input sentence, so the output table
should be the same size as the input table:
Size Output Table
1
out_table.size = in_table.size
Note
Resizing the output table during or after record-writing will decrease
performance.
Acquire references to the input column being read from and the output columns
being written to: