Prateek Tuli

K-Means Clustering Algorithm through Apache Spark

How do you go about solving a problem of classifying some data without having any labels associated with the data?

Consider a Social Network Analysis problem. Let’s look at a group of people on social network and get data about-

  1. Who are the people that are exchanging messages back and forth?
  2. Who are the group of people posting regularly on certain kind of groups?

Now coming up with an analysis and classifying certain kinds of people together without labeling/naming them as any group- this kind of Machine Learning Classification problem is termed as Unsupervised Learning. K-Means, is by far the most popular and widely used algorithm in the Unsupervised Learning.

Learning. K-Means

How it works?

Let me demonstrate with an example given by Andrew Ng in his Machine Learning Algorithm tutorials. Consider these unlabeled data points-

Machine Learning Algorithm tutorials

Step 1-

The first step would be to randomly initialize cluster centroids. In this example, I am taking the number of centroids as 2. But this number can vary as to how many unlabeled groups you want to divide your data into.

randomly initialize cluster centroids

Step 2-

The second step is Cluster Assignment step. So, we go through each and every data point and depending upon whether it’s closer to Red or Blue Cluster Centroid, color the data points as Red or Blue.

Cluster Assignment

Step 3-

The third step is the Move Centroid step. This step involves moving each cluster centroid to a position derived by calculating the mean of the corresponding colored data points. This means blue point would be moved to the Mean we get from all the Blue data points and similarly for Red.

Move Centroid

Step Second and third would be repeated till the time the Cluster Centroid do not move any further.

Cluster Centroid do not move any further            Cluster Centroid
 

What is Cost Function?

In this case, the Cost function is the total sum of the squared distance of every point to its corresponding cluster centroid.

What is Cost Function

The objective of K-Means id is to minimize the Cost Function.

A Proof of Concept was done to demonstrate the accuracy of this algorithm by supplying training data set to MLLIB library of Apache Spark. Please refer this for setting up Apache Spark on your machine. Marks were collected of the First Semester students from an Engineering Institute.
 

Java Program to provide marks of students as training data.

package com.apache.spark.kmeans;

import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;

public class App 
{
    public static void main( String[] args )
    {
    	SparkConf conf = new SparkConf().setAppName("K-means Example");
        JavaSparkContext context = new JavaSparkContext(conf);

        String path = "/home/prateek/Desktop/NSIT_marks.txt";
        JavaRDD<String> data = context.textFile(path);
        JavaRDD<Vector> parsedData = data.map(
          new Function<String, Vector>() {
            public Vector call(String s) {
              String[] sarray = s.split(" ");
              double[] values = new double[sarray.length];
              for (int i = 0; i < sarray.length; i++) {
            	  values[i] = Double.parseDouble(sarray[i]);
              }
              return Vectors.dense(values);
            }
          }
        );
        parsedData.cache();

        int numberOfClusters = 4;
        int numberOfIterations = 20;
        KMeansModel clusters = KMeans.train(parsedData.rdd(), numberOfClusters, numberOfIterations);

        double WSSSE = clusters.computeCost(parsedData.rdd());
        System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
        
        for (Vector center : clusters.clusterCenters()) {
          System.out.println(" " + center);
        }

        sc.stop();

    }
}

 

Input Command

root@prateek-HP-250-G1-Notebook-PC:/home/prateek# bin/spark-submit --class "com.apache.spark.kmeans.App" --master local[4] /home/prateek/Documents/work/caarmo/Apache-SparkWorkspace/kmeans/target/kmeans.jar

Main Class – com.apache.spark.kmeans.App

Executable Jar to run on Apache Spark – /home/prateek/Documents/work/caarmo/Apache-SparkWorkspace/kmeans/target/kmeans.jar

Sample Input

54
86
61
84
74
77
29
72
71
50
40
52
52
55
84
71
81
27
77
40
30
31
69
45
40
89
68
43
80
52
72
69
60
69

 

Output

Within Set Sum of Squared Errors = 2521.96491228072
Cluster centers:
 [74.42105263157895]
 [42.666666666666664]

 
Within Set Sum of Squared Errors shows the Cost Function for the given training data.

Cluster centers shows the two Cluster Centroids derived from the data.
 
References : https://class.coursera.org/ml-005/lecture