Skip to content
Reference > Prepare Methods

kmeansCluster(nClusters,seed,values)

Definition

Comparable kmeansCluster(Integer nClusters, Long seed, Object ... values)

Description

Clusters data using the k-means clustering algorithm. Takes one or more expressions, followed by int nClusters specifying the number of clusters and long seed specifying the random seed for initializing the model parameters. The seed argument may be set to null if no specific seed is desired. Returns a column of integers labeling each record according to its learned cluster. Input expressions must evaluate to numbers.

Parameter Definition

Name Type Description
nClusters Integer the number of clusters
seed Long the seed for initializing the model parameters, can be null
values Object... the array of all the numerical columns to evaluate

Example 1

CREATE TABLE input(clientGroup String, clientId int, loanAmount double, applicantIncome double);
INSERT INTO input VALUES ("Client Group 1", 0, 4583, 128),("Client Group 1", 1, 3000, 66),("Client Group 1", 2, 2583, 120),("Client Group 1", 3, 6000, 141),("Client Group 1", 4, 2333, 95);
INSERT INTO input VALUES ("Client Group 2", 0, 2571, 28),("Client Group 2", 1, 6000, 16),("Client Group 2", 2, 1583, 1200),("Client Group 2", 3, 4500, 1410),("Client Group 2", 4, 1133, 5);
CREATE TABLE result AS PREPARE *, kmeansCluster(4, 1000L, loanAmount, applicantIncome) as cluster FROM input PARTITION BY clientGroup;
Table input = SELECT * FROM input;
Table result = SELECT * FROM result;

// input = 
// +--------------------------------------------------+
// |                      input                       |
// +--------------+--------+----------+---------------+
// |clientGroup   |clientId|loanAmount|applicantIncome|
// |String        |Integer |Double    |Double         |
// +--------------+--------+----------+---------------+
// |Client Group 1|0       |4583.0    |128.0          |
// |Client Group 1|1       |3000.0    |66.0           |
// |Client Group 1|2       |2583.0    |120.0          |
// |Client Group 1|3       |6000.0    |141.0          |
// |Client Group 1|4       |2333.0    |95.0           |
// |Client Group 2|0       |2571.0    |28.0           |
// |Client Group 2|1       |6000.0    |16.0           |
// |Client Group 2|2       |1583.0    |1200.0         |
// |Client Group 2|3       |4500.0    |1410.0         |
// |Client Group 2|4       |1133.0    |5.0            |
// +--------------+--------+----------+---------------+
// 
// result = 
// +----------------------------------------------------------+
// |                          result                          |
// +--------------+--------+----------+---------------+-------+
// |clientGroup   |clientId|loanAmount|applicantIncome|cluster|
// |String        |Integer |Double    |Double         |Integer|
// +--------------+--------+----------+---------------+-------+
// |Client Group 1|0       |4583.0    |128.0          |1      |
// |Client Group 1|1       |3000.0    |66.0           |2      |
// |Client Group 1|2       |2583.0    |120.0          |0      |
// |Client Group 1|3       |6000.0    |141.0          |1      |
// |Client Group 1|4       |2333.0    |95.0           |3      |
// |Client Group 2|0       |2571.0    |28.0           |1      |
// |Client Group 2|1       |6000.0    |16.0           |2      |
// |Client Group 2|2       |1583.0    |1200.0         |0      |
// |Client Group 2|3       |4500.0    |1410.0         |2      |
// |Client Group 2|4       |1133.0    |5.0            |3      |
// +--------------+--------+----------+---------------+-------+
//