Skip to main content
A data source is reference object for a data set that is external to the database. It consists of the location & connection information to that external source, but doesn’t hold the names of any specific data sets/files within that source. A data source can make use of a credential object for storing remote authentication information. A data source name must adhere to the standard naming criteria. Each data source exists within a schema and follows the standard name resolution rules for tables. The following data source providers are supported:
  • Azure (Microsoft blob storage)
  • GCS (Google Cloud Storage)
  • HDFS (Apache Hadoop Distributed File System)
  • JDBC (Java Database Connectivity, using a user-supplied driver or one of the drivers on the supported list)
  • Kafka (streaming feed)
    • Apache
    • Confluent
  • S3 (Amazon S3 Bucket)
The following default hosts are used for Azure, GCS, & S3, but can be overridden in the location parameter:
  • Azure: <service_account_name>.blob.core.windows.net
  • GCS: storage.googleapis.com
  • S3: <region>.amazonaws.com
Data sources perform no function by themselves, but act as proxies for accessing external data when referenced in certain database operations. The following can make use of data sources: Individual files within a data source need to be identified when the data source is referenced within these calls.
The data source will be validated upon creation, by default, and will fail to be created if an authorized connection cannot be established.

Managing Data Sources

A data source can be managed using the following API endpoint calls. For managing data sources in SQL, see CREATE DATA SOURCE.
API CallDescription
/create/datasourceCreates a data source, given a location and connection information
/alter/datasourceModifies the properties of a data source, validating the new connection
/drop/datasourceRemoves the data source reference from the database; will not modify the external source data
/show/datasourceOutputs the data source properties; passwords are redacted
/grant/permission/datasourceGrants the permission for a user to connect to a data source
/revoke/permission/datasourceRevokes the permission for a user to connect to a data source

Creating a Data Source

To create a data source, kin_ds, that connects to an Amazon S3 bucket, kinetica_ds, in the US East (N. Virginia) region:
CREATE DATA SOURCE kin_ds
LOCATION = 'S3'
USER = '<aws access id>'
PASSWORD = '<aws access key>'
WITH OPTIONS
(
    BUCKET NAME = 'kinetica-ds',
    REGION = 'us-east-1'
)
For Amazon S3 connections, the user_name & password parameters refer to the AWS Access ID & Key, respectively.

Provider-Specific Syntax

Several authentication schemes across multiple providers are supported.

Azure

kinetica.create_datasource(
    name = '[<data source schema name>.]<data source name>',
    location = 'azure[://<host>]',
    user_name = '',
    password = '',
    options = {
        'credential': '[<credential schema name>.]<credential name>',
        'azure_container_name': '<azure container name>'
    }
)

GCS

kinetica.create_datasource(
    name = '[<data source schema name>.]<data source name>',
    location = 'gcs[://<host>]',
    user_name = '',
    password = '',
    options = {
    	'credential': '[<credential schema name>.]<credential name>',
    	['gcs_project_id': '<gcs project id>',]
    	'gcs_bucket_name': '<gcs bucket name>'
    }
)

HDFS

kinetica.create_datasource(
    name = '[<data source schema name>.]<data source name>',
    location = 'hdfs://<host>:<port>',
    user_name = '',
    password = '',
    options = {
        'credential': '[<credential schema name>.]<credential name>'
    }
)

JDBC

kinetica.create_datasource(
    name = '[<data source schema name>.]<data source name>',
    location = '<jdbc url>',
    user_name = '',
    password = '',
    options = {
        'credential': '[<credential schema name>.]<credential name>',
        'jdbc_driver_class_name': '<jdbc driver class full path>',
        'jdbc_driver_jar_path': 'kifs://<jdbc driver jar path>'
    }
)

Kafka (Apache)

The location can be a comma-delimited list of Kafka URLs to be used for high-availability; only one of which will be streamed from at any given time.
kinetica.create_datasource(
    name = '[<data source schema name>.]<data source name>',
    location = 'kafka://<host>:<port>',
    user_name = '',
    password = '',
    options = {
        'credential': '[<credential schema name>.]<credential name>',
        'kafka_topic_name': '<kafka topic name>'
    }
)

Kafka (Confluent)

The location can be a comma-delimited list of Kafka URLs to be used for high-availability; only one of which will be streamed from at any given time.
kinetica.create_datasource(
    name = '[<data source schema name>.]<data source name>',
    location = 'confluent://<host>:<port>',
    user_name = '',
    password = '',
    options = {
        'credential': '[<credential schema name>.]<credential name>',
        'kafka_topic_name': '<kafka topic name>'
    }
)

S3

kinetica.create_datasource(
    name = '[<data source schema name>.]<data source name>',
    location = 's3[://<host>]',
    user_name = '',
    password = '',
    options = {
        'credential': '[<credential schema name>.]<credential name>',
        's3_bucket_name': '<aws s3 bucket name>',
        's3_region': '<aws s3 region>'
    }
)

Limitations

  • Azure anonymous data sources are only supported when both the container and the contained objects allow anonymous access.
  • HDFS systems with wire encryption are not supported.
  • Kafka data sources require an associated credential object for authentication.