Vector Search

Overview

Kinetica's vector search capability is enabled through its use as a vector store database. Once a table has been created with vector type and a set of embeddings have been loaded into the table, a variety of K-nearest neighbor searches can be performed on the data set.

Details on the vector type and its usage are found below. For more complete walkthroughs of the functionality in Jupyter notebook form, see:

Vector Type

The vector data type has been added to facilitate managing embeddings and issuing vector search queries. The vector type is effectively an array of float types and can be used as shown in the following examples.

Create Table

A vector column can optionally be configured to normalize the vector data inserted into it, giving each vector a magnitude (L2 norm) of 1. This can improve the performance of some vector operations with minimal overhead.

Vector Column without Normalization

SQL
1
2
3
4
5
CREATE OR REPLACE TABLE example.vs
(
    name VARCHAR(16) NOT NULL,
    embedding VECTOR(3) NOT NULL
)
Python (native API)
1
2
3
4
5
6
7
table_name = 'example.vs'
columns = [
    ['name', 'string', 'char16'],
    ['embedding', 'bytes', 'vector(3)']
]

gpudb.GPUdbTable(columns, table_name, db = kinetica)
Python (from DataFrame)
1
2
table_name = 'example.vs'
vs = gpudb.GPUdbTable.from_df(df, kinetica, table_name, clear_table = True)

Vector Column with Normalization

SQL
1
2
3
4
5
CREATE OR REPLACE TABLE example.vs_norm
(
    name VARCHAR(16) NOT NULL,
    embedding VECTOR(3, NORMALIZE) NOT NULL
)
Python (native API)
1
2
3
4
5
6
7
table_name = 'example.vs_norm'
columns = [
    ['name', 'string', 'char16'],
    ['embedding', 'bytes', 'vector(3)', 'normalize']
]

vs = gpudb.GPUdbTable(columns, table_name, db = kinetica)
Python (from DataFrame)
1
2
3
4
5
6
table_name = 'example.vs_norm'

vs = gpudb.GPUdbTable.from_df(
        df, kinetica, table_name, clear_table = True,
        column_types = { 'embedding': [ 'vector(3)', 'normalize' ] }
)

Insert Data

SQL
1
2
3
4
5
6
7
INSERT INTO example.vs
VALUES
    ('fun','[-0.23764,0.43119,-0.72154]'),
    ('play','[-0.73571,0.19937,-0.89408]'),
    ('food','[0.47222,-0.44545,-0.51833]'),
    ('money','[0.59784,-0.057026,0.97746]'),
    ('work','[0.51359,0.19695,-0.51944]')
Python (as strings)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
embs = [
    ['fun', '[-0.23764,0.43119,-0.72154]'],
    ['play', '[-0.73571,0.19937,-0.89408]'],
    ['food', '[0.47222,-0.44545,-0.51833]'],
    ['money', '[0.59784,-0.057026,0.97746]'],
    ['work', '[0.51359,0.19695,-0.51944]']
]

vs = gpudb.GPUdbTable(name = 'example.vs', db = kinetica)

vs.insert_records(embs)
Python (as native types)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
embs = [
    ['fun', [-0.23764,0.43119,-0.72154]],
    ['play', [-0.73571,0.19937,-0.89408]],
    ['food', [0.47222,-0.44545,-0.51833]],
    ['money', [0.59784,-0.057026,0.97746]],
    ['work', [0.51359,0.19695,-0.51944]]
]

vs = gpudb.GPUdbTable(name = 'example.vs', db = kinetica)

vs.insert_records(embs)
Python (as DataFrame)
1
2
3
vs = gpudb.GPUdbTable(name = 'example.vs', db = kinetica)

vs.insert_df(df, batch_size = 10000)

Retrieve Data

SQL
1
2
SELECT name, embedding
FROM example.vs
Python (w/o native type conversions)
1
2
3
4
5
6
7
8
9
vs = gpudb.GPUdbTable(name = 'example.vs', db = kinetica)

records = vs.get_records()

dim = 3

print('%-16s %s' % ('Name', 'Embedding'))
for record in records:
    print('%-16s %s' % (record[0], list(struct.unpack('%df'%dim, record[1]))))
Python (w/ native type conversions)
1
2
3
4
5
6
7
vs = gpudb.GPUdbTable(name = 'example.vs', db = kinetica, convert_special_types_on_retrieval = True)

records = vs.get_records(encoding = 'json')

print('%-16s %s' % ('Name', 'Embedding'))
for record in records:
    print('%-16s %s' % (record["name"], record["embedding"]))
Python (as DataFrame)
1
2
3
4
5
vs = gpudb.GPUdbTable(name = 'example.vs', db = kinetica)

df = vs.to_df()

print(df)

Vector Indexes

There are two types of indexes available to improve the performance of vector searches:


CAGRA Vector Index

The performance of some vector searches can be improved with the application of a CAGRA index, which must be manually refreshed to account for updates to the data in the corresponding table.

This can be applied, in SQL, during table creation, via CREATE TABLE, as well as afterwards, as below:

SQL
1
2
ALTER TABLE example.employee
ADD CAGRA INDEX (profile)
Python
1
2
3
4
5
6
7
8
retobj = kinetica.alter_table(
    table_name = "example.employee",
    action = "create_index",
    value = "profile",
    options = {
            "index_type": "cagra"
    }
)

HNSW Vector Index

The performance of some vector searches can be improved with the application of an HNSW index, which is automatically updated as the data in the corresponding table changes.

This can be applied, in SQL, during table creation, via CREATE TABLE, as well as afterwards, as below:

SQL
1
2
ALTER TABLE example.employee
ADD HNSW INDEX (profile)
Python
1
2
3
4
5
6
7
8
retobj = kinetica.alter_table(
    table_name = "example.employee",
    action = "create_index",
    value = "profile",
    options = {
            "index_type": "hnsw"
    }
)

Vector Functions & Operators

Vector Column Functions

FunctionDescription
L1_NORM(v)Calculates the sum of the absolute values of the given vector's values
L2_NORM(v)Calculates the square root of the sum of squares of the given vector's values
LINF_NORM(v)Returns the maximum of the given vector's values
LP_NORM(v, p)Calculates the Lp-space norm of the given vector in the space p
NTH(v, n)Returns the given vector's value at 0-based index n
SIZE(v)Returns the given vector's number of values

Vector Search Functions

A number of K-nearest neighbor functions have been implemented to support vector searches. For examples, see Vector Function Examples.

FunctionDescription
COSINE_DISTANCE(v1, v2)1 minus the cosine similarity (equality of angle) of the given vectors
DOT_PRODUCT(v1, v2)Calculates the sum of products of the given vectors' values
EUCLIDEAN_DISTANCE(v1, v2)Alias for L2_DISTANCE
L1_DISTANCE(v1, v2)Calculates the L1-space (taxicab) distance between the given vectors
L2_DISTANCE(v1, v2)Calculates the L2-space (Euclidean) distance between the given vectors
L2_SQUAREDDISTANCE(v1, v2)Calculates the sum of squares of distances between the given vectors' values
L2_DISTSQ(v1, v2)Alias for L2_SQUAREDDISTANCE
LINF_DISTANCE(v1, v2)Calculates the maximum of the distances between pairs of values in the given vectors
LP_DISTANCE(v1, v2, p)Calculates the Lp-space distance between the given vectors in the space p

Vector Search Operators

These operators can be used as shorthand to apply vector functions to individual vector column values. For examples, see Vector Operator Examples.

Note

These operators are only available in SQL or in the native API via /execute/sql.

OperatorEquivalent Function
v1 <-> v2L2_DISTANCE(v1, v2)
v1 <=> v2COSINE_DISTANCE(v1, v2)
v1 <#> v2DOT_PRODUCT(v1, v2)

Vector Search Examples

Vector searches can be performed using either named functions or the corresponding operators for select functions.

Vector Operator Examples

SQL Literal
1
2
3
SELECT TOP 5 name, embedding <-> VECTOR('[-0.23764,0.43119,-0.72154]', 3) as distance 
FROM example.vs
ORDER BY distance
SQL Lookup
1
2
3
4
5
6
SELECT TOP 5 name, embedding <-> search_embedding as distance 
FROM
    example.vs,
    (SELECT embedding AS search_embedding FROM example.vs WHERE name = 'fun')
WHERE name <> 'fun'
ORDER BY distance
Python Literal
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
sql = """
    SELECT TOP 5 name, embedding <-> VECTOR('[-0.23764,0.43119,-0.72154]', 3) as distance 
    FROM example.vs
    ORDER BY distance
"""

with gpudb.GPUdbSqlIterator(kinetica, sql) as records:
    
    print('%-16s %s' % ('Name', 'Distance'))
    for record in records:
        print('%-16s %f' % (record[0], record[1]))
Python Lookup w/ SQL
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
sql = """
    SELECT TOP 5 name, embedding <-> search_embedding as distance 
    FROM
        example.vs,
        (SELECT embedding AS search_embedding FROM example.vs WHERE name = 'fun')
    WHERE name <> 'fun'
    ORDER BY distance
"""

with gpudb.GPUdbSqlIterator(kinetica, sql) as records:
    
    print('%-16s %s' % ('Name', 'Distance'))
    for record in records:
        print('%-16s %f' % (record[0], record[1]))
Python Literal to DataFrame
1
2
3
4
5
6
7
8
9
sql = """
    SELECT TOP 5 name, embedding <-> VECTOR('[-0.23764,0.43119,-0.72154]', 3) as distance 
    FROM example.vs
    ORDER BY distance
"""

df = kinetica.to_df(sql)

print(df)
Python Lookup w/ SQL to DataFrame
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
sql = """
    SELECT TOP 5 name, embedding <-> search_embedding as distance 
    FROM
        example.vs,
        (SELECT embedding AS search_embedding FROM example.vs WHERE name = 'fun')
    WHERE name <> 'fun'
    ORDER BY distance
"""

df = kinetica.to_df(sql)

print(df)

Vector Function Examples

SQL Literal
1
2
3
SELECT TOP 5 name, L2_DISTANCE(embedding,'[-0.23764,0.43119,-0.72154]') as distance 
FROM example.vs
ORDER BY distance
SQL Lookup
1
2
3
4
5
6
SELECT TOP 5 name, L2_DISTANCE(embedding, search_embedding) as distance 
FROM
    example.vs,
    (SELECT embedding AS search_embedding FROM example.vs WHERE name = 'fun')
WHERE name <> 'fun'
ORDER BY distance
Python Literal
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
vs = gpudb.GPUdbTable(name = 'example.vs', db = kinetica)

records = vs.get_records_by_column(
        ["name", "L2_DISTANCE(embedding,'[-0.23764,0.43119,-0.72154]') as distance"],
        limit = 5,
        options = {"sort_by": "distance"}
)

print('%-16s %s' % ('Name', 'Distance'))
for record in zip(records['name'], records['distance']):
    print('%-16s %f' % (record[0], record[1]))
Python Lookup w/ SQL
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
sql = """
    SELECT TOP 5 name, L2_DISTANCE(embedding, search_embedding) as distance 
    FROM
        example.vs,
        (SELECT embedding AS search_embedding FROM example.vs WHERE name = 'fun')
    WHERE name <> 'fun'
    ORDER BY distance
"""

with gpudb.GPUdbSqlIterator(kinetica, sql) as records:
    
    print('%-16s %s' % ('Name', 'Distance'))
    for record in records:
        print('%-16s %f' % (record[0], record[1]))
Python Literal to DataFrame
1
2
3
4
5
6
7
8
9
sql = """
    SELECT TOP 5 name, L2_DISTANCE(embedding,'[-0.23764,0.43119,-0.72154]') as distance
    FROM example.vs
    ORDER BY distance
"""

df = kinetica.to_df(sql)

print(df)
Python Lookup w/ SQL to DataFrame
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
sql = """
    SELECT TOP 5 name, L2_DISTANCE(embedding, search_embedding) as distance 
    FROM
        example.vs,
        (SELECT embedding AS search_embedding FROM example.vs WHERE name = 'fun')
    WHERE name <> 'fun'
    ORDER BY distance
"""

df = kinetica.to_df(sql)

print(df)

Embedding Models

Kinetica supports three different embedding models:

  • SQLGPT: the default model, based on the OpenAI model at sqlgpt.io, with a maximum of 8191 tokens and a maximum returned vector size of 1536

  • OpenAI: either of these models can be used:

    • text-embedding-3-small
    • text-embedding-3-large

    See OpenAI documentation for details.

  • Nvidia: a NIM microservice, like embed-qa-4 can be used, with a maximum of 512 tokens and a fixed returned vector size of 1024; an input type of query or passage must be specified for the model to generate the appropriate response text

    See Nvidia Preview for a demo.


CREATE MODEL

Creates a new embedding model reference.

CREATE MODEL Syntax
1
2
3
4
5
6
7
CREATE [OR REPLACE] [REMOTE] MODEL <model name>
WITH OPTIONS
(
    [CREDENTIAL = '[<credential_schema_name>.]<credential name>',]
    REMOTE_MODEL_NAME = '<remote model name>',
    REMOTE_MODEL_LOCATION = '<remote model location>'
)
ParameterDescription
OR REPLACEAny existing model with the same name will be dropped before creating this one
REMOTEOptional keyword for compatibility
<model name>Name of the model to create; must adhere to the supported naming criteria
WITH OPTIONSIndicator that a comma-delimited list of model option/value assignments will follow.
CREDENTIALCredential object to use to authenticate to model service.
REMOTE_MODEL_NAMEName of the model hosted by the embedding service; e.g., NV-Embed-QA.
REMOTE_MODEL_LOCATIONURL of the model host service.

For example, to create a remote model, first create an API key credential:

OpenAI API Credential Example
1
2
3
CREATE CREDENTIAL example.openai_api_cred
TYPE = 'openai_api_key',
SECRET = 'sk-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
Nvidia API Credential Example
1
2
3
CREATE CREDENTIAL example.nvidia_api_cred
TYPE = 'nvidia_api_key',
SECRET = 'nvapi-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

Then, create the model:

OpenAI Model Example
1
2
3
4
5
6
7
CREATE MODEL openai_remote_model
WITH OPTIONS
(
    CREDENTIAL = 'example.openai_api_cred',
    REMOTE_MODEL_NAME = 'text-embedding-3-small',
    REMOTE_MODEL_LOCATION = 'https://api.openai.com/v1/embeddings'
)
Nvidia Remote Model Example
1
2
3
4
5
6
7
CREATE MODEL nvidia_remote_model
WITH OPTIONS
(
    CREDENTIAL = 'example.nvidia_api_cred',
    REMOTE_MODEL_NAME = 'NV-Embed-QA',
    REMOTE_MODEL_LOCATION = 'https://ai.api.nvidia.com/v1/retrieval/nvidia/embeddings'
)
Nvidia Local (No Key) Model Example
1
2
3
4
5
6
CREATE MODEL nvidia_local_model
WITH OPTIONS
(
    REMOTE_MODEL_NAME = 'NV-Embed-QA',
    REMOTE_MODEL_LOCATION = 'http://localhost:8180/v1/embeddings'
)

Optionally, set the model as the default one:

OpenAI Default Model Example
1
ALTER SYSTEM SET PROPERTIES ('ai_api_embeddings_model' = 'openai_remote_model')
Nvidia Default Remote Model Example
1
ALTER SYSTEM SET PROPERTIES ('ai_api_embeddings_model' = 'nvidia_remote_model')
Nvidia Default Local (No Key) Model Example
1
ALTER SYSTEM SET PROPERTIES ('ai_api_embeddings_model' = 'nvidia_local_model')

ALTER MODEL

Alters the configuration of an existing embedding model reference. Any of the model configuration parameters can be modified individually.

ALTER MODEL Syntax
1
2
3
4
5
6
7
8
9
ALTER MODEL <model name>
SET
<
    CREDENTIAL = '[<credential_schema_name>.]<credential name>'
    |
    REMOTE_MODEL_NAME = '<remote model name>'
    |
    REMOTE_MODEL_LOCATION = '<remote model location>'
>
ParameterDescription
<model name>Name of the model to modify; must adhere to the supported naming criteria
CREDENTIALCredential object to use to authenticate to model service.
REMOTE_MODEL_NAMEName of the model hosted by the embedding service; e.g., NV-Embed-QA.
REMOTE_MODEL_LOCATIONURL of the model host service.

For example, to change the name of the remote model referenced:

ALTER MODEL Example
1
2
ALTER MODEL openai_remote_model
SET REMOTE_MODEL_NAME = 'text-embedding-3-small'

GENERATE_EMBEDDINGS

Generates embeddings for the specified input data using the specified embedding model reference.

GENERATE_EMBEDDINGS Syntax
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SELECT *
FROM TABLE [/* KI_HINT_SAVE_UDF_STATS */]
(
    GENERATE_EMBEDDINGS
    (
        [MODEL_NAME => '<model name>',]
        EMBEDDING_TABLE => INPUT_TABLE(<table name | query>),
        EMBEDDING_INPUT_COLUMNS => '<comma-delimited input column list>',
        [EMBEDDING_OUTPUT_COLUMNS => '<comma-delimited output column list>',]
        [DIMENSIONS => <dimensions>,]
        [PARAMS => KV_PAIRS('input_type' = <'query' | 'passage'>)]
    )
)
ParameterDescription
KI_HINT_SAVE_UDF_STATSSQL hint to log the embedding generation results of the UDF responsible for the processing (rag_udf_embed), so the logs can be viewed from the Jobs tab in Workbench.
MODEL_NAME

Name of the model to use in generating embeddings, adhering to the supported naming criteria; if not specified, the default model, sqlgpt, will be used.

To set the default to another model and avoid having to specify this parameter:

Set Default Embeddings Model Example
1
ALTER SYSTEM SET PROPERTIES ('ai_api_embeddings_model' = 'nvidia_remote_model')

EMBEDDING_TABLEThe query or name of the table to use for input data to the embedding generation process.
EMBEDDING_INPUT_COLUMNSNames of the columns in the given EMBEDDING_TABLE for which embeddings will be generated.
EMBEDDING_OUTPUT_COLUMNSNames of the columns to return the generated embeddings as; if none specified, each input column will have _embedding appended to it to construct the name of the corresponding output column.
DIMENSIONSSize of the vector returned by the embedding generation process; required for the default sqlgpt model.
PARAMSSize of the vector returned by the embedding generation process.
PARAMS

Optional indicator that a list of embedding option/value assignments will follow, passed as a comma-delimited list of key/value pairs to the KV_PAIRS function

OptionDescription
input_type

For Nvidia models, the type of input on which embeddings will be generated:

TypeDescription
passageInput is the data set that will be searched.
queryInput is the query that will be used for the search.

For example, to generate embeddings with various models:

Generate Embeddings (Default) Example
1
2
3
4
5
6
7
8
9
SELECT * FROM TABLE /* KI_HINT_SAVE_UDF_STATS */
(
    GENERATE_EMBEDDINGS
    (
        EMBEDDING_TABLE => INPUT_TABLE(SELECT * FROM example.fine_food_reviews WHERE Score > 3),
        EMBEDDING_INPUT_COLUMNS => 'Summary,Text',
        DIMENSIONS => 256
    )
)
Generate Embeddings (OpenAI) Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT * FROM TABLE /* KI_HINT_SAVE_UDF_STATS */
(
    GENERATE_EMBEDDINGS
    (
        MODEL_NAME => 'openai_remote_model',
        EMBEDDING_TABLE => INPUT_TABLE(example.fine_food_reviews),
        EMBEDDING_INPUT_COLUMNS => 'Summary,Text',
        EMBEDDING_OUTPUT_COLUMNS => 'Summary_emb,Text_emb',
        DIMENSIONS => 1536
    )
)
Generate Source Embeddings (Nvidia) Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT * FROM TABLE /* KI_HINT_SAVE_UDF_STATS */
(
    GENERATE_EMBEDDINGS
    (
        MODEL_NAME => 'nvidia_remote_model',
        EMBEDDING_TABLE => INPUT_TABLE(SELECT * FROM example.fine_food_reviews WHERE Score > 3),
        EMBEDDING_INPUT_COLUMNS => 'Summary,Text',
        PARAMS => KV_PAIRS('input_type' = 'passage')
    )
)
Generate Query Embeddings (Nvidia) Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT * FROM TABLE /* KI_HINT_SAVE_UDF_STATS */
(
    GENERATE_EMBEDDINGS
    (
        MODEL_NAME => 'nvidia_remote_model',
        EMBEDDING_TABLE => INPUT_TABLE(SELECT 'healthy food' AS search_query),
        EMBEDDING_INPUT_COLUMNS => 'search_query',
        EMBEDDING_OUTPUT_COLUMNS => 'search_query_emb',
        PARAMS => KV_PAIRS('input_type' = 'query')
    )
)

Vector Search with Embedding Models

Using embedding models, vector searches can be performed dynamically on a given data set.

For example, to search for healthy food references within a data set of product reviews:

Vector Search with Embeddings (OpenAI) Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SELECT /* KI_HINT_SAVE_UDF_STATS */
    COSINE_DISTANCE(p.Text_emb, q.query_emb) as dist,
    p.ProductId,
    p.UserId,
    p.Summary,
    p.Text
FROM
    TABLE
    (
        GENERATE_EMBEDDINGS
        (
            MODEL_NAME => 'openai_remote_model',
            EMBEDDING_TABLE => INPUT_TABLE(SELECT * FROM example.fine_food_reviews WHERE Score > 3),
            EMBEDDING_INPUT_COLUMNS => 'Summary,Text',
            EMBEDDING_OUTPUT_COLUMNS => 'Summary_emb,Text_emb',
            DIMENSIONS => 1536
        )
    ) p,
    TABLE
    (
        GENERATE_EMBEDDINGS
        (
            MODEL_NAME => 'openai_remote_model',
            EMBEDDING_TABLE => INPUT_TABLE(SELECT 'healthy food' AS query),
            EMBEDDING_INPUT_COLUMNS => 'query',
            EMBEDDING_OUTPUT_COLUMNS => 'query_emb',
            DIMENSIONS => 1536
        )
    ) q
ORDER BY 1
Vector Search with Embeddings (Nvidia) Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SELECT /* KI_HINT_SAVE_UDF_STATS */
    COSINE_DISTANCE(p.Text_emb, q.query_emb) as dist,
    p.ProductId,
    p.UserId,
    p.Summary,
    p.Text
FROM
    TABLE
    (
        GENERATE_EMBEDDINGS
        (
            MODEL_NAME => 'nvidia_remote_model',
            EMBEDDING_TABLE => INPUT_TABLE(SELECT * FROM example.fine_food_reviews WHERE Score > 3),
            EMBEDDING_INPUT_COLUMNS => 'Summary,Text',
            EMBEDDING_OUTPUT_COLUMNS => 'Summary_emb,Text_emb',
            PARAMS => KV_PAIRS('input_type' = 'passage')
        )
    ) p,
    TABLE
    (
        GENERATE_EMBEDDINGS
        (
            MODEL_NAME => 'nvidia_remote_model',
            EMBEDDING_TABLE => INPUT_TABLE(SELECT 'healthy food' AS query),
            EMBEDDING_INPUT_COLUMNS => 'query',
            EMBEDDING_OUTPUT_COLUMNS => 'query_emb',
            PARAMS => KV_PAIRS('input_type' = 'query')
        )
    ) q
ORDER BY 1