Machine Learning Concepts

Kinetica provides a machine learning (ML) capability for simplifying and accelerating data science in a scalable fashion. With ML, users can ingest data, train models, make inferences (answers/output from models), and even audit models. ML leverages Kubernetes to deploy, train, and test models.

Note

To work with models in SQL, see Machine Learning (ML). For statistical analysis functions that don't require a model, see ML Functions.

Concepts

The ML workflow is defined by five key concepts:

Registries

Docker container registries house the ML models themselves and abstract their implementation details. Any model used by Kinetica must exist in a Docker container registry.

Data

Data comprises ingests, datasets, and feature sets.

  • Ingests represent an ingest tool (e.g., Kafka) that pulls data from a given source (Kinetica, PostgreSQL, S3, etc.) and puts it in a new table inside Kinetica. Data can be pulled in batches or via continuous streaming.
  • Datasets represent a set (or sub-set) of column data from a source table in Kinetica. One or more columns can be filtered to create a dataset.
  • Feature sets represent a group of features, which are datasets transformed inline with functions or relationally using Materialized Views

Models

A model can be a function, statistical model, regression, data model, and more that is deployed to enable inferencing capabilities. Kinetica can deploy any number of replicas of the model, allowing for scalability and better resource management. Currently, only Blackbox models are supported, which are models where implementation details are abstracted and housed in Docker containers. Input and output are the only available interface; they also don't require a training dataset.

Deployments

A deployment represents a model that has been deployed. A deployed model can have inference tests run manually, automatically, or in batches, depending on the type of deployment. Currently there are three types of deployments:

  • On-Demand -- inference tests are run as necessary using user-provided input with results being returned based on the given input
  • Continuous -- inference tests are run automatically against records being streamed into an input table; inference results are inserted into an output table
  • Batch -- inference tests are run against a batch of data in an existing table all at once

Audits

An audit represents the ability to audit a model deployment to ensure its training, testing, and inferencing are untampered. An audit enables drilling into specific inferences from a deployment and filtering the inference by input parameter, process status, and more.