Motivation
There exists a number of different machine learning algorithms like, Classification, Regression and Ranking Algorithms. They can learn from historical data and create models for prediction of future. So, which one is better (in terms of performance) , can be determined from different evaluation metrics.
One evaluation metric can be accuracy which has been defined as below
Accuracy = Number of passed test cases / Number of total test cases.
An accuracy of 0.8 means 80% of the test cases are predicted correctly.
So by evaluating each model by this metric we can figure out which one is performing better.
This type of metric is better if we are interested in Positive data points where data set itself contains majority of positive data points. If we are interested in Negative Data points in this data set, certainly accuracy will be very low.
That is why methods like Precision and Recall are used to evaluate classification models.
Classification Evaluation Metrics
For classification models, for each test data or test case there can be four possibilities of how it could be predicted by model.
- True Positive (TP) – Actually True, Predicted True
- True Negative (TN) – Actually False, Predicted False
- False Positive (FP) – Actually False, Predicted True
- False Negative (FN) – Actually True, Predicted False
Binary Classification Evaluation
Binary classification has only two outcomes, True(Positive) or False (Negative) for each data point. Precision and recall can be defined as follows
Precision (Positive Predictive Value) = TP / (TP + FP)
Recall ( Rate of True Positive) = TP / P = TP / (TP + FN)
Example
Suppose a binary classification model for predicting if an email is spam or not, it predicts 15 messages as spam from a data set containing 25 spams and 12 non-spam messages. If 10 predictions of the model are actually spam then
Precision = 10 / (10 + 5 ) = 2/3
Recall = 10 / (10 + (25-10)) = 10/25 = 2/5
Interpretation of Precision & Recall
Precision is the percentage of selected items that are correct (Positive). It is the ratio of correct positive(TP) predictions over total positive predictions ( TP + FP)
P = TP / (TP + FP)
Recall is the percentage of correct (Positive) items that are selected. It is the ratio of correct positive (TP) predictions over total correct predictions ( TP + FN)
R = TP / (TP + FN)
In above example
P = 2/3 ~ 66.7 %
R = 2/5 ~ 40 %
It means on 40 % recall we have 66.7 % precision or (correctness). These two measure are inversely related. So if we try to optimize one measure, other will be the cost of that optimization. If we want high precision we have to bear low recall. That means we can get better precision when we select marginally small data points.
One more observation we can make that if we make a simple classifier that predicts each data point as correct (Positive), i.e. FN = 0, we will have 100% recall. Suppose we have (spam,not-spam) classification task and spam represents Positive label and not-spam represents Negative label and our data set mostly contains non-spam messages. Now as we have predicted everything as spam, In this case our false positives (FP) will be very high and we will have very low precision.
Now we know that we can not rely on only one of above measure, we must find a metric that is a combination of these two metrics. We can use one of Arithmetic Mean, Geometric Mean or Harmonic Mean.
Harmonic mean gives a balance between Precision & Recall. A weighted harmonic mean can also me introduced to give weight to one of the measures which we are more interested in.
A weighted Harmonic mean can be defined as
Where α is the weight if we use α = 1/2, i.e. equal weight of both P & R, It becomes simple harmonic mean of P& R
The weighted harmonic mean formula can be written as
When β = 1, we get F1 which is the Simple harmonic mean.
References and Sources
https://class.coursera.org/nlp/lecture/142
https://en.wikipedia.org/wiki/Precision_and_recall
https://en.wikipedia.org/wiki/F1_score
http://spark.apache.org/docs/latest/mllib-evaluation-metrics.html