GPUdb.aggregate_k_means( table_name = None, column_names = None, k = None,
tolerance = None, options = {} )
This endpoint runs the k-means algorithm - a heuristic algorithm that attempts to do k-means clustering. An ideal k-means clustering algorithm selects k points such that the sum of the mean squared distances of each member of the set to the nearest of the k points is minimized. The k-means algorithm however does not necessarily produce such an ideal cluster. It begins with a randomly selected set of k points and then refines the location of the points iteratively and settles to a local minimum. Various parameters and options are provided to control the heuristic search.
Input Parameter Description
Name | Type | Description | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
table_name | str | Name of the table on which the operation will be performed. Must be an existing table or collection. | ||||||||
column_names | list of str | List of column names on which the operation would be performed. If n columns are provided then each of the k result points will have n dimensions corresponding to the n columns. | ||||||||
k | int | The number of mean points to be determined by the algorithm. | ||||||||
tolerance | float | Stop iterating when the distances between successive points is less than the given tolerance. | ||||||||
options | dict of str | Optional parameters. Default value is an empty dict ( {} ).
|
Output Parameter Description
Name | Type | Description |
---|---|---|
means | list of lists of floats | The k-mean values found. |
counts | list of longs | The number of elements in the cluster closest the corresponding k-means values. |
rms_dists | list of floats | The root mean squared distance of the elements in the cluster for each of the k-means values. |
count | long | The total count of all the clusters - will be the size of the input table. |
rms_dist | float | The sum of all the rms_dists - the value the k-means algorithm is attempting to minimize. |
tolerance | float | The distance between the last two iterations of the algorithm before it quit. |
num_iters | int | The number of iterations the algorithm executed before it quit. |