expression.
API Support
| Language | Distributed Key Lookup Mechanism |
|---|---|
| C++ | RecordRetriever |
| C# | RecordRetriever |
| Java | RecordRetriever |
| Javascript | X |
| Node.js | X |
| Python | RecordRetriever |
| REST | X |
| SQL | Using KI_HINT_KEY_LOOKUP hint in a query; see SQL below for details |
Configuration
Distributed operations are enabled through configuration:- Distributed operations needs to be enabled on the server (default)
- Client-accessible URLs need to be set for each node in the cluster
Server-Side
In order for the cluster nodes to transmit data directly to a key lookup client, the configuration on each node needs to be set to allow the incoming HTTP requests for that data. The/opt/gpudb/core/etc/gpudb.conf file
needs to have the following property set for distributed key lookup to work
properly:
gpudb.conf:
public_url is not defined, each node can be connected to on any of its
available interfaces, taking HA and
HTTPD configurations, as well as any
general network security restrictions into account.
Client-Side
The list of URLs for connecting to each worker node is automatically created when using the following:- Java API default
RecordRetriever - Python API
GPUdbTableconfigured for background distributed operations
public_url was configured on each server node,
workers may have more than one IP address, not all of which may be accessible to
the client. The API worker list constructor uses the first IP in the list
returned by the server for each worker, by default. To override this behavior,
a regular expression Pattern or prefix String can be used to match the
correct worker IP addresses:
Considerations
There are several factors to consider when using distributed key lookup:- Only primary key/shard key value lookups are allowed for sharded tables; arbitrary queries are not supported.
- Lookup values cannot be expression functions.
- The shard key and any columns used in the
expressionmust all be indexed. - There is an implicit index for a primary key, but queries that make use of this index for the lookup cannot have any other columns in the expression.
- If column indexes exist on all columns being filtered, additional indexed columns can be added to the lookup expression using either equality or inequality relations.
SQL
SQL has wide support for distributed key lookups. The following configurations will result in successful lookups:- Specifying only the primary key columns/values of any table with a primary key
- Specifying only the shard key columns/values of a sharded table with attribute indexes on each of the shard key columns
- Specifying any indexed columns of a replicated table
As noted under Considerations, if all
key lookup columns have column indexes,
additional indexed columns can be added to the filter. See
below for an example.
KI_HINT_KEY_LOOKUP hint is used to request a fast key lookup. If the
table configuration and query given aren’t a match for the key lookup, the query
will be executed as any other query would—without the fast key lookup. A
warning will be returned that the key lookup was not possible, but the query
will still be executed and results returned.
Syntax
The following simple query syntax is supported for key lookups:SQL Key Lookup Syntax
Examples
To look up a particular record in a sharded table with an explicit shard key on columnid:
Sharded Table with Shard Key Lookup Example
stock,
has an attribute index.
Sharded Table with Shard Key Lookup and Additional Filter Example
id & category:
Replicated Table with Indexed Columns Lookup Example
Python Distributed Key Lookup
In the Python API, theGPUdbTable constructor
has a use_multihead_io parameter, which allows a GPUdbTable object to
handle all RecordRetriever interactions with the associated
table in the background. The following is
a Python API example that demonstrates the use of the background
RecordRetriever for retrieving data.
This example relies on tables mirroring the
stocks data set, which can be imported into
Kinetica via GAdmin.
Key Lookup
Ensure thestocks_mh table, which has a one-column shard key on
Symbol, is indexed on that column:
stocks_mh table, passing the
use_multihead_io option to enable background distributed key lookups:
SPGI:
Key Lookup with Expression
Ensure thestocks_mh table, which has a one-column shard key on
Symbol, is indexed on that column:
Date column, which will be used in the example filter:
stocks_mh table, passing the
use_multihead_io option to enable background distributed key lookups:
SPGI and filtering out records
prior to 2017:
Multi-Column Key Lookup
Ensure thestocks_mh_mc table, which has a two-column shard
key on Symbol & Industry, is indexed on those columns:
stocks_mh_mc table, passing the
use_multihead_io option to enable background distributed key lookups:
Industry &
Symbol columns for the values Financial Exchanges & Data & SPGI,
respectively:
Multi-Column Key Lookup with Expression
Ensure thestocks_mh_mc table, which has a two-column shard
key on Symbol & Industry, is indexed on those columns:
Date column, which will be used in the example filter:
stocks_mh_mc table, passing the
use_multihead_io option to enable background distributed key lookups:
Industry & Symbol columns for the values
Financial Exchanges & Data & SPGI, respectively, and filtering out
records prior to 2017:
Result
The results of a distributed key lookup call can be output fromresponse as follows:
| Lookup Type | Time (Without DKL) | Time (With DKL) | Record Count |
|---|---|---|---|
| Single-Column Key Only | 0.0043 | 0.0020 | 222 |
| Single-Column Key + Expression | 0.0031 | 0.0009 | 53 |
| Multi-Column Key | 0.0046 | 0.0024 | 222 |
| Multi-Column Key + Expression | 0.0031 | 0.0010 | 53 |
Cautions
If using the Java API and MapReduce, there is a conflict between the version of Avro used by Kinetica and the one used by MapReduce. This conflict can be overcome by using the Maven shade plug-in with therelocation tag: