There are various metrics to evaluate a classification model: Accuracy, Precision, Recall F1-score, and AUC-ROC score. However, it is always confusing for newcomers in Machine Learning to decide which performance metrics should we use for evaluating a model on an imbalanced data set in case of Classification settings?
In this post, I will explain how to answer the above question in different cases.
It is a matrix table (rows and columns) that is used to describe the performance of a classification model in terms of TP, TN, FP, and FN as follows:
Let’s suppose we have a cancer dataset in which we are supposed to predict based on some medical report, who is going to suffer from cancer in the near future. Then TP, TN, FP, and FN can be defined as:
The confusion matrix alone often becomes hard to interpret when we have a multi-class classification problem. Now, let’s get to know performance metrics based upon:
For a better understanding of the above topics, I take the Binary Classification problem (1 => positive class and 0 => negative class) with the following two imbalanced scenarios:
I willfully created an imbalanced dataset (situations) to get a stronghold on the concepts.
Assume that our trained classifier labeled all negative samples as False Positive (FP).
import numpy as np
import pandas as pd
Y = np.hstack((np.ones((10000,)), np.zeros((100,))))
Y_score = np.random.uniform(0.5,0.9,10100)
df_imb = pd.DataFrame(data=np.array((Y, Y_score)).T, columns=['y','proba'])
df_imb = df_imb.sample(10100)
y_pred=[0 if y_score < 0.5 else 1]
and then
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(df_imb.y, df_imb.y_pred)
conf_mat
Key observation: got lots of FP
Accuracy: It defines the number of correct predictions out of total predictions.
Accuracy score: (TP+TN)/(TP+TN+FP+FN) = 0.9900990099009901
Precision: How many samples belong to the actual positive class out of the total positive predicted samples by a model.
Precision: TP/(TP+FP) = 0.9900990099009901
Recall: Out of total actual positive samples how many are predicted as positive by the model. Recall is also called as True Positive Rate (TPR) or Sensitivity or probability of detection vice-versa.
Recall: (TP)/(TP+FN) = 1.0
F1-score: It returns the Harmonic Mean of Precision and Recall.
F1-score = 2 * (precision*recall)/(precision+recall)= 0.9950248756218906
True Positive Rate(TPR) = Recall
False Positive Rate (FPR) = Out of all actual negative samples how many are predicted as positive by a model. Its range is between 0-1 (lower the better)
FPR = (FP)/(FP+TN)= 1.0
A receiver operating characteristic curve, or ROC curve is plotted between the true positive rate (TPR /Recall) on the y-axis against the false positive rate (FPR) on the x-axis at various threshold settings. ROC curve is used to measure how well the classifier can separate TP and TN.
How the ROC curve is drawn?
We take each probability score we calculated using these steps:
LogisticRegression.predict_proba as threshold => compute confusion matrix => measure TPR and FPR (for each threshold)
ROC curve can be extended to a Multiclass Classification problem using the one-vs-all approach.
The diagonal line represents a random model that predicts either 1 or 0 randomly. The area under the diagonal line is 0.5
For example, given the following examples, which are arranged from left to right in ascending order of logistic regression predictions:
AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example. AUC provides an aggregate measure of performance across all possible classification thresholds. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
AUC score: 0.4580425
Accuracy score: 0.9900990099009901
FPR: 1.0
Precision: 0.9900990099009901
Recall: 1.0
F1-score 0.9950248756218906
AUC score: 0.4580425
A. Metrics that don’t help to measure your model:
Precision and Recall basically deal with the positive class. And when the dataset inherently has lots of positive cases, Precision and Recall seem to be not good metrics to measure the model performance.
B. Metrics that help to measure your model:
Here we just do the opposite of the previous situation:
import numpy as np
import pandas as pd
Y = np.hstack((np.ones((10000,)), np.zeros((100,))))
Y_score = np.random.uniform(0.1,0.51,10000)
df_imb = pd.DataFrame(data=np.array((Y, Y_score)).T, columns=['y','proba'])
df_imb = df_imb.sample(10100)
def pred(X):
N = len(X)
predict = []
for i in range(N):
if X[i] >= 0.5: # sigmoid(w,x,b) returns 1/(1+exp(-(dot(x,w)+b)))
predict.append(1)
else:
predict.append(0)
return np.array(predict)
from sklearn import metrics
print(f'Accuracy score :{metrics.accuracy_score(Y, pred(Y_score)):>{20}}',)
print(f'F1-score%:{metrics.f1_score(Y, pred(Y_score)):>{26}}')
print(f'RoC score:{metrics.roc_auc_score(Y, Y_score):>{25}}')
print(f'Precison:{metrics.precision_score(Y, pred(Y_score)):>{25}}')
print(f'Recall:{metrics.recall_score(Y, pred(Y_score)):>{15}}'))
metrics.confusion_matrix(Y, pred(Y_score))
Accuracy score : 0.9722772277227723
FPR: 0.0232
Precison: 0.18309859154929578
Recall(TPR): 0.52
F1-score: 0.27083333333333337
RoC score: 0.9276659999999999
A. Metrics that don’t help to measure your model:
AUC score doesn’t capture the true picture when a dataset contains negative majority class and our focus is the minority positive class.
B. Metrics that help to measure your model:
Note: What is “Positive” and what is “negative” is a purely semantic construction (in your situation). You can simply flip the labels and then decide your focus class based on the given Business Problem and finally opt for the correct performance metrics as discussed in this post.
Source:
https://towardsdatascience.com/demystifying-roc-and-precision-recall-curves-d30f3fad2cbf