How do you go about solving a problem of classifying some data without having any labels associated with the data?
Consider a Social Network Analysis problem. Let’s look at a group of people on social network and get data about-
- Who are the people that are exchanging messages back and forth?
- Who are the group of people posting regularly on certain kind of groups?
Now coming up with an analysis and classifying certain kinds of people together without labeling/naming them as any group- this kind of Machine Learning Classification problem is termed as Unsupervised Learning. K-Means, is by far the most popular and widely used algorithm in the Unsupervised Learning.
How it works?
Let me demonstrate with an example given by Andrew Ng in his Machine Learning Algorithm tutorials. Consider these unlabeled data points-
Step 1-
The first step would be to randomly initialize cluster centroids. In this example, I am taking the number of centroids as 2. But this number can vary as to how many unlabeled groups you want to divide your data into.
Step 2-
The second step is Cluster Assignment step. So, we go through each and every data point and depending upon whether it’s closer to Red or Blue Cluster Centroid, color the data points as Red or Blue.
Step 3-
The third step is the Move Centroid step. This step involves moving each cluster centroid to a position derived by calculating the mean of the corresponding colored data points. This means blue point would be moved to the Mean we get from all the Blue data points and similarly for Red.
Step Second and third would be repeated till the time the Cluster Centroid do not move any further.
What is Cost Function?
In this case, the Cost function is the total sum of the squared distance of every point to its corresponding cluster centroid.
The objective of K-Means id is to minimize the Cost Function.
A Proof of Concept was done to demonstrate the accuracy of this algorithm by supplying training data set to MLLIB library of Apache Spark. Please refer this for setting up Apache Spark on your machine. Marks were collected of the First Semester students from an Engineering Institute.
Java Program to provide marks of students as training data.
package com.apache.spark.kmeans; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.Function; import org.apache.spark.mllib.clustering.KMeans; import org.apache.spark.mllib.clustering.KMeansModel; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.linalg.Vectors; import org.apache.spark.SparkConf; public class App { public static void main( String[] args ) { SparkConf conf = new SparkConf().setAppName("K-means Example"); JavaSparkContext context = new JavaSparkContext(conf); String path = "/home/prateek/Desktop/NSIT_marks.txt"; JavaRDD<String> data = context.textFile(path); JavaRDD<Vector> parsedData = data.map( new Function<String, Vector>() { public Vector call(String s) { String[] sarray = s.split(" "); double[] values = new double[sarray.length]; for (int i = 0; i < sarray.length; i++) { values[i] = Double.parseDouble(sarray[i]); } return Vectors.dense(values); } } ); parsedData.cache(); int numberOfClusters = 4; int numberOfIterations = 20; KMeansModel clusters = KMeans.train(parsedData.rdd(), numberOfClusters, numberOfIterations); double WSSSE = clusters.computeCost(parsedData.rdd()); System.out.println("Within Set Sum of Squared Errors = " + WSSSE); for (Vector center : clusters.clusterCenters()) { System.out.println(" " + center); } sc.stop(); } }
Input Command
root@prateek-HP-250-G1-Notebook-PC:/home/prateek# bin/spark-submit --class "com.apache.spark.kmeans.App" --master local[4] /home/prateek/Documents/work/caarmo/Apache-SparkWorkspace/kmeans/target/kmeans.jar
Main Class – com.apache.spark.kmeans.App
Executable Jar to run on Apache Spark – /home/prateek/Documents/work/caarmo/Apache-SparkWorkspace/kmeans/target/kmeans.jar
Sample Input
54 86 61 84 74 77 29 72 71 50 40 52 52 55 84 71 81 27 77 40 30 31 69 45 40 89 68 43 80 52 72 69 60 69
Output
Within Set Sum of Squared Errors = 2521.96491228072 Cluster centers: [74.42105263157895] [42.666666666666664]
Within Set Sum of Squared Errors shows the Cost Function for the given training data.
Cluster centers shows the two Cluster Centroids derived from the data.
References : https://class.coursera.org/ml-005/lecture