Note

This documentation is for a prior release of Kinetica. For the latest documentation, click here.

/aggregate/kmeans

URL: http://<db.host>:<db.port>/aggregate/kmeans

This endpoint runs the k-means algorithm - a heuristic algorithm that attempts to do k-means clustering. An ideal k-means clustering algorithm selects k points such that the sum of the mean squared distances of each member of the set to the nearest of the k points is minimized. The k-means algorithm however does not necessarily produce such an ideal cluster. It begins with a randomly selected set of k points and then refines the location of the points iteratively and settles to a local minimum. Various parameters and options are provided to control the heuristic search.

NOTE: The Kinetica instance being accessed must be running a CUDA (GPU-based) build to service this request.

Input Parameter Description

Name

Type

Description

table_name

string

Name of the table on which the operation will be performed. Must be an existing table, in [schema_name.]table_name format, using standard name resolution rules.

column_names

array of strings

List of column names on which the operation would be performed. If n columns are provided then each of the k result points will have n dimensions corresponding to the n columns.

int

The number of mean points to be determined by the algorithm.

tolerance

double

Stop iterating when the distances between successive points is less than the given tolerance.

options

map of string to strings

Optional parameters. The default value is an empty map ( {} ).

Supported Parameters (keys)	Parameter Description
whiten	When set to 1 each of the columns is first normalized by its stdv - default is not to whiten.
max_iters	Number of times to try to hit the tolerance limit before giving up - default is 10.
num_tries	Number of times to run the k-means algorithm with a different randomly selected starting points - helps avoid local minimum. Default is 1.
create_temp_table	If true, a unique temporary table name will be generated in the sys_temp schema and used in place of result_table. If result_table_persist is false (or unspecified), then this is always allowed even if the caller does not have permission to create tables. The generated name is returned in qualified_result_table_name. The default value is false. The supported values are: true false
result_table	The name of a table used to store the results, in [schema_name.]table_name format, using standard name resolution rules and meeting table naming criteria. If this option is specified, the results are not returned in the response.
result_table_persist	If true, then the result table specified in result_table will be persisted and will not expire unless a ttl is specified. If false, then the result table will be an in-memory table and will expire unless a ttl is specified otherwise. The default value is false. The supported values are: true false
ttl	Sets the TTL of the table specified in result_table.

Output Parameter Description

The GPUdb server embeds the endpoint response inside a standard response structure which contains status information and the actual response to the query. Here is a description of the various fields of the wrapper:

Name

Type

Description

status

String

'OK' or 'ERROR'

message

String

Empty if success or an error message

data_type

String

'aggregate_k_means_response' or 'none' in case of an error

data

String

Empty string

data_str

JSON or String

This embedded JSON represents the result of the /aggregate/kmeans endpoint:

Name

Type

Description

means

array of arrays of doubles

The k-mean values found.

counts

array of longs

The number of elements in the cluster closest the corresponding k-means values.

rms_dists

array of doubles

The root mean squared distance of the elements in the cluster for each of the k-means values.

count

long

The total count of all the clusters - will be the size of the input table.

rms_dist

double

The sum of all the rms_dists - the value the k-means algorithm is attempting to minimize.

tolerance

double

The distance between the last two iterations of the algorithm before it quit.

num_iters

int

The number of iterations the algorithm executed before it quit.

info

map of string to strings

Additional information. The default value is an empty map ( {} ).

Possible Parameters (keys)	Parameter Description
qualified_result_table_name	The fully qualified name of the result table (i.e. including the schema) used to store the results.

Empty string in case of an error.