The following guide provides step by step instructions to get started using R with Kinetica. The RKinetica connector provides access to the Kinetica database through a REST interface. The RKinetica package can be built locally from source or a prebuilt package release can be downloaded from the release page. Kinetica strongly recommends that release builds be used instead of building the package locally.
RKinetica interfaces with Kinetica similarly to other R-database interactions: using the DBI and dplyr packages. Interacting with RKinetica is easiest using the RStudio interface. RStudio and RKinetica require that R be installed locally.
Source code for the connector can be found at:
Installation & Configuration
Prerequisites
RKinetica depends on the system libraries methods and stats & several other R packages from the comprehensive R archive network (CRAN):
All dependencies should be installed prior to installing RKinetica. To install the dependencies in RStudio or the R Console:
install.packages(c("DBI", "RJSONIO", "httr", "bit64", "hms", "methods", "purrr", "stats", "dplyr"))
Prebuilt Package Installation
Download the latest package tar.gz file from the release page.
Install the RKinetica package in RStudio or R console:
install.packages("/path/to/RKinetica_<version>.tar.gz", repos = NULL, type = "source")
Manually Built Package Installation
It is strongly recommended that a release build be used instead of building the package locally, but the RKinetica package can be built as follows:
Clone the latest RKinetica repository version:
git clone https://github.com/kineticadb/RKinetica.git -b release/v7.1 --single-branch
In the same directory, use the R CLI to build the package:
R CMD build RKinetica
This sequence produces a tar.gz file, which, once installed, is made available to R. The tar.gz file is created in the same directory the build command was issued.
Verify the tar.gz file was created before installing the RKinetica package:
ls RKinetica*
Install the RKinetica package via the R CLI:
R CMD install RKinetica_<version>.tar.gz
Usage
Before using RKinetica, the package must be loaded:
library(RKinetica)
Create a Kinetica connection object using the dbConnect() method, passing in Kinetica URL, user, and password parameters:
con <- dbConnect(RKinetica::Kinetica(), url = "https://<aws.fqdn>/<aws.cluster.name>/gpudb-0", username = "<user>", password = "<password>")
If using RStudio, you can use the rstudioapi package to instead prompt for username and password:
con <- dbConnect(RKinetica::Kinetica(), url = "http://<kinetica-host>:9191", username = rstudioapi::askForPassword("Database username?"), password = rstudioapi::askForPassword("Database password?"))
Strings vs. Factors
When RKinetica reads a character list into an R dataframe, it can be converted into a factor. This option is controlled by the stringsAsFactors environment property that's read into the as.data.frame() parameter:
stringsAsFactors = default.stringsAsFactors()
To set this environment property to TRUE or FALSE explicitly, use the following syntax at the beginning of your R script or once per session:
options(stringsAsFactors = FALSE)
Schema Support
Kinetica schemas, tables, and views must meet standard naming criteria and follow rules for name resolution, when referenced. The KineticaConnection object has a read-only attribute default_schema that stores the user's default schema name:
con@default_schema [1] "<default schema name>"
Important
The user's default schema cannot be set or managed by the user.
All table management functions available to KineticaConnection (dbCreateTable(), dbAppendTable(), dbReadTable(), dbWriteTable(), dbExistsTable(), and dbRemoveTable()) have a name argument that supports passing a character value, such as <schema name>.<table name> (or simply <table name> if utilizing the user's default schema), or a KineticaId object modeled after the DBI::Id class that encapsulates a named vector with schema and table parameter values defined separately:
id <- KineticaId(schema = "<schema name>", table = "<table name>") show(id) [1] <KineticaId> schema = <schema name>, table = <table name>
A KineticaId object can be created by submitting a named character value that will be parsed into a schema/table pair:
id <- KineticaId("<schema name>.<table name>") show(id) [1] <KineticaId> schema = <schema name>, table = <table name>
Additional Connection Configuration
Row Limits
If you expect a result set from your queries to exceed 10,000 rows, set the row_limit parameter value accordingly:
con <- dbConnect(RKinetica::Kinetica(), url = "https://<aws.fqdn>/<aws.cluster.name>/gpudb-0", username = "<user>", password = "<password>", row_limit=1000000L)
Important
A parameter cannot be added to an existing KineticaConnection object. Instead, a new KineticaConnection object must be created to properly initialize any functionality enabled by additional parameters.
Examples
The following examples assume that this KineticaConnection object has already been established, which you can then use as a regular DBI connection:
con <- dbConnect(RKinetica::Kinetica(), url = "https://<aws.fqdn>/<aws.cluster.name>/gpudb-0", username = "<user>", password = "<password>")
Some of the most common commands:
Print connection info:
dbGetInfo(con)
Get the current user's default schema:
con@default_schema
List top-level schemas the user has access to, the tables contained within those schemas, and basic information about the schemas and tables:
dbListObjects(con)
Get a list of tables within the ki_home schema:
dbListTables(con, "ki_home")
Write a table to the user's default schema with 3 columns:
dbWriteTable(con, "tableA", data.frame(a = 1L, b = 2L, c = 3.0), row.names = NULL)
Check if the table exists in the user's default schema:
dbExistsTable(con, "tableA")
Check if the table exists in the example schema using the KineticaId object:
id <- KineticaId("example.tableA") dbExistsTable(con, id)
List tableA (located in the user's default schema) fields:
dbListFields(con, "tableA")
Add records to tableA (located in the user's default schema):
dbAppendTable(con, "tableA", data.frame(a = 2L:3L, b = 3L:4L, c = 4.0:5.0), row.names = NULL)
Read tableA (located in the user's default schema) into variable:
rows <- dbReadTable(con, "tableA") print(rows)
Drop a table in the user's default schema if it exists:
dbRemoveTable(con, "tableA")
Disconnect:
dbDisconnect(con)
Large tables, views, or query results can have pagination parameters passed into the dbSendQuery or dbSendStatement methods. Assuming that the connection was established with a row limit of 1,000,000, i.e. row_limit = 1000000L, the following example query extracts all records from the acquisition table in the example schema sorted by l_id. The data is retrieved in batches of one million records and the offset is increased by 1 million each batch. This loop continues until the dbSendQuery() resultset is returned empty:
sql_query <- "SELECT l_id, product_type, term, score FROM example.acquisition ORDER BY l_id" offset <- 0L repeat { result <- dbSendQuery(con, sql_query, limit = 1000000L, offset = offset) # work with current data.frame provided in result@data if (nrow(result@data) > 0) { # increase offset to get next page of data offset <- offset + 1000000L } else { # exit pagination loop when data.frame is empty break } }
Additional code examples are available in the examples subdirectory of the RKinetica repository.
Documentation
Detailed RKinetica information including method and function descriptions, usage examples, and arguments can be found in the RKinetica PDF.