SQL-GPT Concepts

SQL-GPT is a Kinetica generative AI capability that produces SQL based on questions asked in natural language.

A SQL-GPT interface is integrated into the top of each Worksheet of every Workbook in Kinetica. The feature provides the option to customize the SQL-GPT context for each Workbook independently.

The functionality is also accessible using SQL syntax integrated into the core database engine, so it can be leveraged by any Kinetica-powered application that can make SQL requests.

Overview

Transformer-based LLMs have an ability to learn new tasks with minimal training, and they have extensive knowledge to apply when performing generative tasks, such as generating SQL. Configuring SQL-GPT to answer questions about a proprietary dataset requires minimal configuration to get started.

Enabling SQL-GPT in a Kinetica environment consists of two steps:

  1. Creating a SQL-GPT context: Create a SQL-GPT context via the Workbench UI, or using the CREATE CONTEXT SQL syntax directly. The context defines tables, rules, annotation, and training samples that the LLM needs to understand your data.

  2. Generating SQL: Enter natural language into the Workbench UI and click Generate SQL, or use the GENERATE SQL syntax. The response will contain the generated SQL.

    The EXECUTE_AI_QUERY function can be used to generate SQL and execute it in one step.

Generating SQL

Assuming you have a SQL-GPT context you can use the GENERATE SQL syntax to make natural language queries.

GENERATE SQL Syntax
1
2
GENERATE SQL FOR '<question>'
WITH OPTIONS (context_name = '<context object>')

For example if you have a context object stock_trades_context, you can use it to generate SQL for the following question:

GENERATE SQL Example
1
2
3
4
5
6
GENERATE SQL FOR
'
    For each trade of the Apple stock find a corresponding
    quote that is less than 5 seconds after the trade occurred?
'
WITH OPTIONS (context_name = 'stock_trades_context')

SQL-GPT Context

Context Overview

A SQL-GPT context is a Kinetica database object that specifies information about the tables constituting a particular data set, as well as rules and sample queries for referencing them. The LLM uses this context to generate Kinetica-compliant SQL that references objects in that data set.

A context is required to run GENERATE SQL. You can define multiple context objects and have one for each use case. Each context is contained within a database schema and has relevant access controls.

The context object is used to populate the LLM context to facilitate in-context training, as opposed to fine tuning. The LLM context is a limited storage area in the model which can be compared to a person's short term memory. An advantage of in-context training is that a single model can be used to inference any number of schemas without modification, but it has the limitation that the training data must fit within the LLM context and this limits how many tables and columns can be considered at any one time.

At a high level the context contains:

  • Context Name: This is the name of the database object that stores the SQL-GPT context and is referenced when generating a query.
  • Context Tables: The schemas of the tables included in the context are passed to the LLM. Only these tables will be considered when generating SQL.
  • Table and Column Comments: The LLM relies entirely on the context to understand what data is contained in the tables. If the columns and table names do not describe the data well, then comments are needed.
  • Rules: These are natural language descriptions explaining how to generate the SQL. For example, a rule could specify a geospatial function to be used in certain cases or how to join a table.
  • Samples: These are question/answer pairs that provide examples the LLM can generalize when generating SQL for user-supplied questions.

Context Syntax

The CREATE CONTEXT command creates the table & query context for SQL-GPT to understand the tables available for generating SQL and how to use them to answer questions.

CREATE CONTEXT Syntax
1
2
3
4
5
CREATE [OR REPLACE] [TEMP] CONTEXT [<schema name>.]<context name>
[<table definition clause>[,...],]
[<sample definition clause>,]
[<rules definition clause>]
[WITH OPTIONS (ttl = '<ttl value>'[,...])]

A context contains a set of clauses where each entry is one of the following:

The context must include a table reference in one of these clauses in order for it to be properly considered during SQL generation. While all clauses are strictly optional, a context must include one of them to be considered valid.

A context can be made to be non-persistent (not survive a database restart) with the TEMP option, and be assigned a removal time-to-live via ttl. Specifying OR REPLACE will remove any existing context of the same name before creating the new one.

For example:

CREATE CONTEXT Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
CREATE OR REPLACE CONTEXT nyctaxi_ctx
(
    TABLE = taxi_data,
    COMMENT =
    '
        This table contains historical New York taxi trip data.
        Each row contains the trip starting (pickup) and ending
        (drop off) latitudes and longitudes, the total fare,
        and other details about the trip.
    ',
    COMMENTS = 
    (
        vendor_id = 'The ID for the taxi company providing the ride.',
        pickup_datetime = 'The time at which a trip started.',
        dropoff_datetime = 'The time at which a trip ended',
        trip_distance = 'The total distance in kilometers for a particular trip'
        
    ),
    RULES = 
    (
        'Do not use the surcharge column of the taxi_data table',
        'Payment types include cash, credit, dispute, no charge, and unknown'
    )
),
(
    TABLE = nyct2020,
    COMMENT = 'This table contains Neighborhood Tabulation Area (NTA) geoms and names for NYC neighborhoods',
    COMMENTS = 
    (
        gid = 'The primary key that identifies each row in this table',
        geom = 'The WKT that provides the spatial outline for the neighborhood',
        NTAName = 'The name of the NTA neighborhood.'
    )
    RULES = 
    (
        'Use the nyct2020 table to calculate aggregate statistics for neighborhoods',
        'Use NTAName instead of spatial coordinates to identify a neighborhood'
    )
),
(
    SAMPLES =
    (
        'How many trips did each taxi vendor''s cabs make?' =
        '
            SELECT vendor_id, COUNT(*)
            FROM taxi_data
            GROUP BY vendor_id
        '
    )
),
(
    RULES =
    (
        '
            Use the ST_CONTAINS() function to join the geom column of the nyct2020 table
            with the pickup_longitude and pickup_latitude columns of the taxi_data table
            or the dropoff_longitude and dropoff_latitude columns of the taxi_data table
        ',
        'Use functions like HOUR(expr) DAY(expr) etc to extract a date/time expression''s components'
    )
)

Table Definition Clause

The table definition clause provides information about a single table that includes a table comment, column comments, and rules that apply to that table. All parameters are optional except for the table name. The table definition clause itself is also optional.

Important

Kinetica will pass the all columns of the table to the LLM regardless of whether you provide comments for them. If the table has many columns, this could consume unnecessary space in the LLM context. In this situation, it is advised that you create a view with descriptive column names containing only the columns needed for generating SQL.

Table Definition Clause Syntax
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
(
    TABLE = <table name>
    [COMMENT = '<table comment>']
    [COMMENTS = (
        <column name> = '<column comment>',
        ...
        <column name> = '<column comment>'
    )]
    [RULES = (
        '<table rule>',
        ...
        '<table rule>'
    )]
)

The following example table definition clause for a table taxi_data contains a table comment, four column comments, and two rules for SQL-GPT to consider when using the table.

Table Definition Clause Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
(
    TABLE = taxi_data,
    COMMENT =
    '
        This table contains historical New York taxi trip data.
        Each row contains the trip starting (pickup) and ending
        (drop off) latitudes and longitudes, the total fare,
        and other details about the trip.
    ',
    COMMENTS = 
    (
        vendor_id = 'The ID for the taxi company providing the ride.',
        pickup_datetime = 'The time at which a trip started.',
        dropoff_datetime = 'The time at which a trip ended',
        trip_distance = 'The total distance in kilometers for a particular trip'
        
    ),
    RULES = 
    (
        'Do not use the surcharge column of the taxi_data table',
        'Payment types include cash, credit, dispute, no charge, and unknown'
    )
)

Sample Definition Clause

A sample definition contains a set of question & answer pairs used to train SQL-GPT for generating SQL. SQL-GPT has previously been trained on thousands of generic samples, so you should not need more than a few (if any) samples to customize it for your use case. The sample definition clause is optional.

Tip

Samples take up space in the SQL-GPT context, and you are limited in the number you can provide. It is advised to use no more than 10 samples per context object.

Important

A sample definition should reference tables that are included in table definitions within either its own context or a separate context that is passed along with it into a GENERATE SQL or EXECUTE_AI_QUERY command. This ensures that SQL-GPT has a proper definition (and understanding) of the tables and columns referenced in each sample query.

Sample Definition Clause Syntax
1
2
3
4
5
6
7
8
(
    SAMPLES =
    (
        '<question>' = '<SQL answer>',
        ...
        '<question>' = '<SQL answer>'
    )
)

The following example sample definition clause contains a single natural-language question and the desired query used to answer that question.

Sample Definition Clause Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
(
    SAMPLES =
    (
        'How many trips did each taxi vendor''s cabs make?' =
        '
            SELECT vendor_id, COUNT(*)
            FROM taxi_data
            GROUP BY vendor_id
        '
    )
)

Note

String literals will need to have their single quotes escaped with pairs of single quotes, as per standard SQL escaping rules.

Rule Definition Clause

A rule definition contains a set of global rules that SQL-GPT should use in employing the various tables named in the context to generate SQL. By convention, rules that apply to a specific table belong in the RULES section of the table definition, while rules that apply across tables or independently from any tables belong here. The rule definition clause is optional.

Important

A rule definition should reference tables that are included in table definitions within either its own context or a separate context that is passed along with it into a GENERATE SQL or EXECUTE_AI_QUERY command. This ensures that SQL-GPT has a proper definition (and understanding) of any tables and columns referenced in each rule.

Rule Definition Clause Syntax
1
2
3
4
5
6
7
8
(
    RULES =
    (
        '<rule>',
        ...
        '<rule>'
    )
)

The following example rule definition clause contains one rule for handling joins between the taxi_data & nyct2020 tables and another rule for dealing with date/time values.

Rules Definition Clause Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
(
    RULES =
    (
        '
            Use the ST_CONTAINS() function to join the geom column of the nyct2020 table
            with the pickup_longitude and pickup_latitude columns of the taxi_data table
            or the dropoff_longitude and dropoff_latitude columns of the taxi_data table
        ',
        'Use functions like HOUR(expr) DAY(expr) etc to extract a date/time expression''s components'
    )
)

Note

String literals will need to have their single quotes escaped with pairs of single quotes, as per standard SQL escaping rules.

Context Best Practices

LLM inferencing behavior is complex, and there is no way to accurately predict how it will respond to changes in the context. Building a context to generate the SQL statements you expect may require some best practices in addition to trial and error.

Managing Context Size

Increasing the size of the context is a trade-off. The SQLGPT.io service and Kinetica SQLAssist LLM currently support thousands of tokens, where a token is a representation of a group of 3 to 4 characters, on average. Because of limited space, it is important to eliminate unnecessary information from the context, while adding what is necessary.

  • Limit the number of tables included in the context. Adding more tables increases context size and gives the model more tables to choose from, which increases the probability of error. Consider dividing up the inferencing cases into contexts of no more than 5 tables. Tables that can be joined together can be replaced in the context by a single view that connects them along their relational key columns.
  • Limit the number of columns in the context. For each table included in a context, all of its columns are automatically sent to the LLM, and the space occupied by the column definitions can quickly add up. If a table has many columns that are not relevant to the inferencing, then create a view with only the columns needed for inferencing.

Reducing Ambiguity

The model does not know anything about your tables or data other than what is provided in the context. It can be helpful to ask yourself: "If a programmer/analyst who is not familiar with the schema was given these comments and samples would they be able to write correct SQL?"

In many situations the programmer/analyst might say: "This does not make sense," or "I need more information". The model can't ask these questions, because it is programmed to choose the response with the highest probability of being correct, regardless of the input provided. For this reason, it is important to consider what it needs to know without being ambiguous.

  • Use descriptive column names when possible. Many tables have short and unintuitive column names. The model is good at interpreting the meaning of columns based on their names, so descriptive names should be used whenever possible. Additionally, provide comments for columns that are most likely to be used in queries. This will make it more likely that the LLM will use these correctly during inferencing.
  • Make sure samples are of high quality. Providing a sample with an ambiguous question or incorrect answer can confuse the model and degrade performance. It can adapt to ambiguous questions while inferencing, but the samples should be clear and correct. If the model answers a question incorrectly when it is provided with a similar sample, then check the samples for ambiguities, errors, or inconsistencies.
  • Provide samples that can help build more complex queries. An LLM is capable of generating really complex queries. However, it helps to provide simpler samples that give the LLM the building blocks for constructing the complex patterns that you want to see in its responses.
  • Make sure your samples reference tables that are in your environment. An LLM will likely reference any tables that are provided as part of the context in its output, regardless of whether a given table was used in a sample or provided as a table definition. It is important that a sample definition references tables that are included in the overall context supplied to the LLM, as this ensures the LLM has a proper definition (and understanding) of the tables and columns referenced in the query.

Managing SQL Generation

The rules and samples in the context can be used to suggest to the model how to generate the SQL. This can include features like: SQL functions to use, columns to return, or strategy to use to solve the problem. It is important to understand that these are suggestions, and there is not a guarantee that they will be used. Sometimes trial and error with different suggestions can produce the intended result.

  • Provide at least one sample per table.

    For each table, it can be beneficial to give the model at least one example that queries the table in isolation.

    In some cases, you may add a table for the purpose of it being joined with another table, as you would not ask a question that would use this table in isolation. It is still useful to include an isolated SQL example on the join column so the model is familiar with the column.

    For example, if you have a geospatial table with shapes of cities that you intend only to join with another table containing positions of landmarks, then you could add a sample for "What is the boundary geometry of the Washington DC?" with the generated SQL returning the WKT column containing the city that should be joined.

  • Add rules or samples to show how tables should be joined.

    Questions that require the LLM to determine that a response needs a join as well as what tables to use for that join are much more challenging than single-table queries. If the use case requires joins, it is advised to provide samples or rules to show how the tables should be joined.

  • Encapsulate complex and commonly used SQL into views.

    If you have problems getting the LLM to generate a complex join or aggregation, then consider encapsulating the table(s) involved in a logical view for the context. Unlike materialized views, logical views have essentially no memory footprint in the database.

LLM Options

For the SQL-GPT LLM, you can use one of the following two options:

  • Kinetica's SQLGPT.io service - (default) uses external LLM services
  • Kinetica SQLAssist LLM - runs in your local environment for added security, ensuring that your metadata and sample queries are not shared with any external services; also provides deterministic control over the model so that query results will not change unexpectedly

Contact Kinetica support for details on accessing Kinetica SQLAssist LLM.