Lesson 27 - Classification Metrics

The following topics are discussed in this notebook:

  • Confusion matrices.
  • Precion and recall.

Classification Metrics

Recall the a classification model's accuracy with respect to a particular dataset is equal to the proportion of observations in the dataset for which the model generates correct predictions. Accuracy is a commonly used metric for measuring the performance of a classification model. However, it can sometimes give misleading results, especially when working with datasets in which the classes are very imbalanced. When a large majority of the observations in a dataset belong to a single class, a "naive model" that always predicts the majority class will get a very high accuracy score, but will be completely useless.

In this lesson, we will introduce two new classification metrics, known as precision and recall. These metrics can be used to give us a sense as to how well a classification model performs on specific classes, rather than on the dataset as a whole.

We will start by importing a few tools that we will need in this lesson. Take note of the functions classification_report and confusion_matrix that we import from the sklearn.metrics module. We will discuss these functions later in the lesson.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix

Generate Data

To illustrate the concepts discussed in this notebook, we will generate some synthetic data. Suppose that we have a classification problem in which observations are grouped into one of three classes, labeled 'A', 'B', and 'C'. In the cell below, we randomly generate two arrays y_true and y_pred, each of which contains 1000 elements. Assume that y_true contains the observed classes for observations in a test set that has been set aside, while y_pred contains the predicted classes for the test set, as generated by some classification model.

In [2]:
np.random.seed(1)
n = 1000
y_true = np.random.choice(['A', 'B', 'C'],size=n, p=[0.2, 0.5, 0.3])
ix = np.random.choice(range(n), replace=False, size=400)
y_pred = y_true.copy()
y_pred[ix] = np.random.choice(['A', 'B', 'C'], size=400, p=[0.2, 0.5, 0.3])

print('First 10 elements of y_true:', y_true[:10])
print('First 10 elements of y_pred:', y_pred[:10])
First 10 elements of y_true: ['B' 'C' 'A' 'B' 'A' 'A' 'A' 'B' 'B' 'B']
First 10 elements of y_pred: ['B' 'C' 'A' 'B' 'B' 'C' 'A' 'B' 'B' 'C']

In the cell below, we use y_pred and y_true to calculate the model's accuracy on this test set.

In [3]:
accuracy = np.mean(y_pred == y_true)
print('Test Accuracy:', accuracy)
Test Accuracy: 0.739

Confusion Matrix

A confusion matrix is a table that can be used to get an idea as to how a classification model performs on individuals classes, as well as the types of mistakes that the model is prone to making. A confusion matrix is structured so that there is one row and one column for each class. The rows are associated with the actual observed values of each class, while the columns are associated with the predicted values of each class. Each cell in the table contains a number, which is the count of the number of observations that are actually in the class associated with the row the cell is in, and that were predicted to be a member of the class assocuiated with the cell's column.

We can use the confusion_matrix function from sklearn.metrics to automatically generate a confusion matrix based on the observed and predicted label arrays.

In [4]:
cm = confusion_matrix(y_true, y_pred)
print(cm)
[[123  48  32]
 [ 42 394  64]
 [ 21  54 222]]

The confusion matrix is returned as a 2D array. To make this information easier to read, we will convert it to a DataFrame and add names for the rows and columns.

In [5]:
pd.DataFrame(cm, index=['Actual A', 'Actual B', 'Actual C'], 
            columns=['Pred A', 'Pred B', 'Pred C'], )
Out[5]:
Pred A Pred B Pred C
Actual A 123 48 32
Actual B 42 394 64
Actual C 21 54 222

Reading the first row on the confusion matrix, we see that of the 203 Class A observations found in the test set, the model predicted that 123 were in Class A, 48 were in Class B, and 32 were in Class C. We can read the other rows in a similar way.

True Positive, False Positive, True Negative, False Negative

Before defining precision and recall, we need to introduce the concepts of true positives, false positives, true negatives, and false negatives.

Assume that we have a classifcation problem with a response variable Y. Assume that we are particular interest in a model's performance with respect to one of the classes of Y. We will refer to that specific class as the Positive Class. We will refer all other classes as Negative Classes. For notational convenience, we will refer to the positive class as "Class P".

  • An observation is a True Positive for Class P if the model predicts that the observation is positive, and the observed class is actually positive.

  • An observation is a False Positive for Class C if the model predicts that the observation is positive, but the observed class is actually negative.

  • An observation is a True Negative for Class P if the model predicts that the observation is negative, and the observed class is actually negative.

  • An observation is a False Negative for Class P if the model predicts that the observation is negative, but the observed class is actually positive.

For a particular class, let TP, FP, TN, and FN refer to the number of true positives, false positives, true negatives, and false negatives respectivly.

In the cell below, we calculate TP, FP, TN, and FN for each of the three classes in our example, and then display the results as a DataFrame.

In [6]:
ClassA = [np.sum((y_pred == 'A') & (y_true == 'A')), np.sum((y_pred == 'A') & (y_true != 'A')), 
          np.sum((y_pred != 'A') & (y_true != 'A')), np.sum((y_pred != 'A') & (y_true == 'A')), ]

ClassB = [np.sum((y_pred == 'B') & (y_true == 'B')), np.sum((y_pred == 'B') & (y_true != 'B')), 
          np.sum((y_pred != 'B') & (y_true != 'B')), np.sum((y_pred != 'B') & (y_true == 'B')), ]

ClassC = [np.sum((y_pred == 'C') & (y_true == 'C')), np.sum((y_pred == 'C') & (y_true != 'C')), 
          np.sum((y_pred != 'C') & (y_true != 'C')), np.sum((y_pred != 'C') & (y_true == 'C')), ]

pd.DataFrame([ClassA, ClassB, ClassC], 
            index=['Class A', 'Class B', 'Class C'], 
            columns=['TP', 'FP', 'TN', 'FN'])
Out[6]:
TP FP TN FN
Class A 123 63 734 80
Class B 394 102 398 106
Class C 222 96 607 75

Precision and Recall

We define a model's pecision and recall with respect to a particular class in a dataset as follows:

  • Precision $\large = \frac{TP}{TP ~+~ FP} ~=~ \frac{TP}{\textrm{Number of Positive Predictions} }$

  • Recall $\large =\frac{TP}{TP ~+~ FN} ~=~ \frac{TP}{\textrm{Number of Actual Positive Observations} }$

Precision can be thought of as an estimate of the probability that a positive prediction will actually be correct. Recall can be intepretted as an estimate of the probability that a positive observation will be correctly identified by the model.

In the cell below, we will use y_true and y_pred to calculate our model's the precision and recall for Class A.

In [7]:
classA_precision = np.sum((y_pred == 'A') & (y_true == 'A') ) / np.sum(y_pred == 'A')
classA_recall = np.sum((y_pred == 'A') & (y_true == 'A') ) / np.sum(y_true == 'A')

print('Class A Precision:', round(classA_precision,4))
print('Class A Recall:   ', round(classA_recall,4))
Class A Precision: 0.6613
Class A Recall:    0.6059

The precision score suggests that when our model predicts that an observation is in Class A, it will be correct about 66.13% of the time. The recall score suggests that our model will correctly identify about 60.59% of all observations that are actually in Class A.

Classification Report

The classification_report function from sklearn.metrics can be used to automatically calculate the precision and recall scores for each class. We will display these results for our example below.

In [8]:
print(classification_report(y_true, y_pred, digits=4))
              precision    recall  f1-score   support

           A     0.6613    0.6059    0.6324       203
           B     0.7944    0.7880    0.7912       500
           C     0.6981    0.7475    0.7220       297

   micro avg     0.7390    0.7390    0.7390      1000
   macro avg     0.7179    0.7138    0.7152      1000
weighted avg     0.7388    0.7390    0.7384      1000