In a traditional classification problem formulation, classes are mutually exclusive. In other words, under the condition of mutual exclusivity, each training example can belong only to one class. In such cases, classification errors occur due to overlapping classes in the feature space. However, often we encounter tasks where a data point can belong to multiple classes. In such cases, we pivot the traditional classification problem formulation to a Multi-Label Classification framework where we assume each label to be a Bernoulli random variable representing a different classification task.
Definitions:
The difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.
From an implementation standpoint –
There are numerous blogs out there with details on how to train a multi-label classifier using popular frameworks (Sklearn, Keras, Tensorflow, PyTorch, etc). However, evaluating the performance of any machine learning algorithm is a critical piece of the puzzle.
In this blog post, we would focus on different evaluation metrics that can be used for evaluating the performance of a multilabel classifier. The evaluation metrics for multi-label classification can be broadly classified into two categories —
The example-based evaluation metrics are designed to compute the average difference between the true labels and the predicted labels for each training data point, averaged over all the training examples in the dataset.
def emr(y_true, y_pred):
n = len(y_true)
row_indicators = np.all(y_true == y_pred, axis = 1) # axis = 1 will check for equality along rows.
exact_match_count = np.sum(row_indicators)
return exact_match_count/n
def one_zero_loss(y_true, y_pred):
n = len(y_true)
row_indicators = np.logical_not(np.all(y_true == y_pred, axis = 1)) # axis = 1 will check for equality along rows.
not_equal_count = np.sum(row_indicators)
return not_equal_count/n
def hamming_loss(y_true, y_pred):
"""
XOR TT for reference -
A B Output
0 0 0
0 1 1
1 0 1
1 1 0
"""
hl_num = np.sum(np.logical_xor(y_true, y_pred))
hl_den = np.prod(y_true.shape)
return hl_num/hl_den
hl_value = hamming_loss(y_true, y_pred)
print(f"Hamming Loss: {hl_value}")
def example_based_accuracy(y_true, y_pred):
# compute true positives using the logical AND operator
numerator = np.sum(np.logical_and(y_true, y_pred), axis = 1)
# compute true_positive + false negatives + false positive using the logical OR operator
denominator = np.sum(np.logical_or(y_true, y_pred), axis = 1)
instance_accuracy = numerator/denominator
avg_accuracy = np.mean(instance_accuracy)
return avg_accuracy
ex_based_accuracy = example_based_accuracy(y_true, y_pred)
print(f"Example Based Accuracy: {ex_based_accuracy}")
def example_based_precision(y_true, y_pred):
"""
precision = TP/ (TP + FP)
"""
# Compute True Positive
precision_num = np.sum(np.logical_and(y_true, y_pred), axis = 1)
# Total number of pred true labels
precision_den = np.sum(y_pred, axis = 1)
# precision averaged over all training examples
avg_precision = np.mean(precision_num/precision_den)
return avg_precision
ex_based_precision = example_based_precision(y_true, y_pred)
print(f"Example Based Precision: {ex_based_precision}")
As opposed to example-based metrics, Label based metrics evaluate each label separately and then averaged over all labels. As a result, any metric that can be used for binary classification can be used as a label-based metric. These metrics can be computed on individual class labels and then averaged over all classes. This is termed Macro Averaging. Alternatively, we can compute these metrics globally over all instances and all class labels. This is termed Micro averaging.
Precision
It is the proportion of predicted correct labels to the total number of actual labels, averaged over all instances.
def Precision(y_true, y_pred):
temp = 0
for i in range(y_true.shape[0]):
if sum(y_true[i]) == 0:
continue
temp+= sum(np.logical_and(y_true[i], y_pred[i]))/ sum(y_true[i])
return temp/ y_true.shape[0]
#0.5
def Recall(y_true, y_pred):
temp = 0
for i in range(y_true.shape[0]):
if sum(y_pred[i]) == 0:
continue
temp+= sum(np.logical_and(y_true[i], y_pred[i]))/ sum(y_pred[i])
return temp/ y_true.shape[0]
#0.375
def F1Measure(y_true, y_pred):
temp = 0
for i in range(y_true.shape[0]):
if (sum(y_true[i]) == 0) and (sum(y_pred[i]) == 0):
continue
temp+= (2*sum(np.logical_and(y_true[i], y_pred[i])))/ (sum(y_true[i])+sum(y_pred[i]))
return temp/ y_true.shape[0]
print(F1Measure(y_true, y_pred))
#0.41666666666666663
def label_based_macro_accuracy(y_true, y_pred):
# axis = 0 computes true positives along columns i.e labels
l_acc_num = np.sum(np.logical_and(y_true, y_pred), axis = 0)
# axis = 0 computes true postive + false positive + false negatives along columns i.e labels
l_acc_den = np.sum(np.logical_or(y_true, y_pred), axis = 0)
# compute mean accuracy across labels.
return np.mean(l_acc_num/l_acc_den)
lb_macro_acc_val = label_based_macro_accuracy(y_true, y_pred)
print(f"Label Based Macro Accuracy: {lb_macro_acc_val}")
def label_based_macro_precision(y_true, y_pred):
# axis = 0 computes true positive along columns i.e labels
l_prec_num = np.sum(np.logical_and(y_true, y_pred), axis = 0)
# axis = computes true_positive + false positive along columns i.e labels
l_prec_den = np.sum(y_pred, axis = 0)
# compute precision per class/label
l_prec_per_class = l_prec_num/l_prec_den
# macro precision = average of precsion across labels.
l_prec = np.mean(l_prec_per_class)
return l_prec
lb_macro_precision_val = label_based_macro_precision(y_true, y_pred)
print(f"Label Based Precision: {lb_macro_precision_val}")
def label_based_macro_recall(y_true, y_pred):
# compute true positive along axis = 0 i.e labels
l_recall_num = np.sum(np.logical_and(y_true, y_pred), axis = 0)
# compute true positive + false negatives along axis = 0 i.e columns
l_recall_den = np.sum(y_true, axis = 0)
# compute recall per class/label
l_recall_per_class = l_recall_num/l_recall_den
# compute macro averaged recall i.e recall averaged across labels.
l_recall = np.mean(l_recall_per_class)
return l_recall
lb_macro_recall_val = label_based_macro_recall(y_true, y_pred)
print(f"Label Based Recall: {lb_macro_recall_val}")
def label_based_micro_accuracy(y_true, y_pred):
# sum of all true positives across all examples and labels
l_acc_num = np.sum(np.logical_and(y_true, y_pred))
# sum of all tp+fp+fn across all examples and labels.
l_acc_den = np.sum(np.logical_or(y_true, y_pred))
# compute mirco averaged accuracy
return l_acc_num/l_acc_den
lb_micro_acc_val = label_based_micro_accuracy(y_true, y_pred)
print(f"Label Based Micro Accuracy: {lb_micro_acc_val}")
def label_based_micro_precision(y_true, y_pred):
# compute sum of true positives (tp) across training examples
# and labels.
l_prec_num = np.sum(np.logical_and(y_true, y_pred))
# compute the sum of tp + fp across training examples and labels
l_prec_den = np.sum(y_pred)
# compute micro-averaged precision
return l_prec_num/l_prec_den
lb_micro_prec_val = label_based_micro_precision(y_true, y_pred)
print(f"Label Based Micro Precision: {lb_micro_prec_val}")
# Function for Computing Label Based Micro Averaged Recall
# for a MultiLabel Classification problem.
def label_based_micro_recall(y_true, y_pred):
# compute sum of true positives across training examples and labels.
l_recall_num = np.sum(np.logical_and(y_true, y_pred))
# compute sum of tp + fn across training examples and labels
l_recall_den = np.sum(y_true)
# compute mirco-average recall
return l_recall_num/l_recall_den
lb_micro_recall_val = label_based_micro_recall(y_true, y_pred)
print(f"Label Based Micro Recall: {lb_micro_recall_val}")
def alpha_evaluation_score(y_true, y_pred):
alpha = 1
beta = 0.25
gamma = 1
# compute true positives across training examples and labels
tp = np.sum(np.logical_and(y_true, y_pred))
# compute false negatives (Missed Labels) across training examples and labels
fn = np.sum(np.logical_and(y_true, np.logical_not(y_pred)))
# compute False Positive across training examples and labels.
fp = np.sum(np.logical_and(np.logical_not(y_true), y_pred))
# Compute alpha evaluation score
alpha_score = (1 - ((beta * fn + gamma * fp ) / (tp +fn + fp + 0.00001)))**alpha
return alpha_score
One can also use Scikit Learn’s functions to compute accuracy, Hamming loss, and other metrics:
import sklearn.metrics
print('Exact Match Ratio: {0}'.format(sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)))
#Exact Match Ratio: 0.25
print('Hamming loss: {0}'.format(sklearn.metrics.hamming_loss(y_true, y_pred)))
#Hamming loss: 0.4166666666666667
#"samples" applies only to multilabel problems. It does not calculate a per-class measure, instead calculating the metric over the true and predicted classes
#for each sample in the evaluation data, and returning their (sample_weight-weighted) average.
print('Recall: {0}'.format(sklearn.metrics.precision_score(y_true=y_true, y_pred=y_pred, average='samples')))
#Recall: 0.375
print('Precision: {0}'.format(sklearn.metrics.recall_score(y_true=y_true, y_pred=y_pred, average='samples')))
#Precision: 0.5
print('F1 Measure: {0}'.format(sklearn.metrics.f1_score(y_true=y_true, y_pred=y_pred, average='samples')))
#F1 Measure: 0.41666666666666663
Here is a complete Jupyter Notebook on these metrics in scikit-learn.
Training a multi-label classification problem seems trivial with the use of abstract libraries. However, evaluating performance is a whole different ball game. Apart from evaluation metrics, computing and visualizing the confusion matrix for the Multi-label classification problem seems like another fun challenge.
https://www.kaggle.com/code/kmkarakaya/multi-label-model-evaluation/notebook
https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics