Recall the a classification model's accuracy with respect to a particular dataset is equal to the proportion of observations in the dataset for which the model generates correct predictions. Accuracy is a commonly used metric for measuring the performance of a classification model. However, it can sometimes give misleading results, especially when working with datasets in which the classes are very imbalanced. When a large majority of the observations in a dataset belong to a single class, a "naive model" that always predicts the majority class will get a very high accuracy score, but will be completely useless.
In this lesson, we will introduce two new classification metrics, known as precision and recall. These metrics can be used to give us a sense as to how well a classification model performs on specific classes, rather than on the dataset as a whole.
We will start by importing a few tools that we will need in this lesson. Take note of the functions classification_report
and confusion_matrix
that we import from the sklearn.metrics
module. We will discuss these functions later in the lesson.
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
To illustrate the concepts discussed in this notebook, we will generate some synthetic data. Suppose that we have a classification problem in which observations are grouped into one of three classes, labeled 'A', 'B', and 'C'. In the cell below, we randomly generate two arrays y_true
and y_pred
, each of which contains 1000 elements. Assume that y_true
contains the observed classes for observations in a test set that has been set aside, while y_pred
contains the predicted classes for the test set, as generated by some classification model.
np.random.seed(1)
n = 1000
y_true = np.random.choice(['A', 'B', 'C'],size=n, p=[0.2, 0.5, 0.3])
ix = np.random.choice(range(n), replace=False, size=400)
y_pred = y_true.copy()
y_pred[ix] = np.random.choice(['A', 'B', 'C'], size=400, p=[0.2, 0.5, 0.3])
print('First 10 elements of y_true:', y_true[:10])
print('First 10 elements of y_pred:', y_pred[:10])
In the cell below, we use y_pred
and y_true
to calculate the model's accuracy on this test set.
accuracy = np.mean(y_pred == y_true)
print('Test Accuracy:', accuracy)
A confusion matrix is a table that can be used to get an idea as to how a classification model performs on individuals classes, as well as the types of mistakes that the model is prone to making. A confusion matrix is structured so that there is one row and one column for each class. The rows are associated with the actual observed values of each class, while the columns are associated with the predicted values of each class. Each cell in the table contains a number, which is the count of the number of observations that are actually in the class associated with the row the cell is in, and that were predicted to be a member of the class assocuiated with the cell's column.
We can use the confusion_matrix
function from sklearn.metrics
to automatically generate a confusion matrix based on the observed and predicted label arrays.
cm = confusion_matrix(y_true, y_pred)
print(cm)
The confusion matrix is returned as a 2D array. To make this information easier to read, we will convert it to a DataFrame and add names for the rows and columns.
pd.DataFrame(cm, index=['Actual A', 'Actual B', 'Actual C'],
columns=['Pred A', 'Pred B', 'Pred C'], )
Reading the first row on the confusion matrix, we see that of the 203 Class A observations found in the test set, the model predicted that 123 were in Class A, 48 were in Class B, and 32 were in Class C. We can read the other rows in a similar way.
Before defining precision and recall, we need to introduce the concepts of true positives, false positives, true negatives, and false negatives.
Assume that we have a classifcation problem with a response variable Y
. Assume that we are particular interest in a model's performance with respect to one of the classes of Y
. We will refer to that specific class as the Positive Class. We will refer all other classes as Negative Classes. For notational convenience, we will refer to the positive class as "Class P".
An observation is a True Positive for Class P
if the model predicts that the observation is positive, and the observed class is actually positive.
An observation is a False Positive for Class C
if the model predicts that the observation is positive, but the observed class is actually negative.
An observation is a True Negative for Class P
if the model predicts that the observation is negative, and the observed class is actually negative.
An observation is a False Negative for Class P
if the model predicts that the observation is negative, but the observed class is actually positive.
For a particular class, let TP
, FP
, TN
, and FN
refer to the number of true positives, false positives, true negatives, and false negatives respectivly.
In the cell below, we calculate TP
, FP
, TN
, and FN
for each of the three classes in our example, and then display the results as a DataFrame.
ClassA = [np.sum((y_pred == 'A') & (y_true == 'A')), np.sum((y_pred == 'A') & (y_true != 'A')),
np.sum((y_pred != 'A') & (y_true != 'A')), np.sum((y_pred != 'A') & (y_true == 'A')), ]
ClassB = [np.sum((y_pred == 'B') & (y_true == 'B')), np.sum((y_pred == 'B') & (y_true != 'B')),
np.sum((y_pred != 'B') & (y_true != 'B')), np.sum((y_pred != 'B') & (y_true == 'B')), ]
ClassC = [np.sum((y_pred == 'C') & (y_true == 'C')), np.sum((y_pred == 'C') & (y_true != 'C')),
np.sum((y_pred != 'C') & (y_true != 'C')), np.sum((y_pred != 'C') & (y_true == 'C')), ]
pd.DataFrame([ClassA, ClassB, ClassC],
index=['Class A', 'Class B', 'Class C'],
columns=['TP', 'FP', 'TN', 'FN'])
We define a model's pecision and recall with respect to a particular class in a dataset as follows:
Precision $\large = \frac{TP}{TP ~+~ FP} ~=~ \frac{TP}{\textrm{Number of Positive Predictions} }$
Recall $\large =\frac{TP}{TP ~+~ FN} ~=~ \frac{TP}{\textrm{Number of Actual Positive Observations} }$
Precision can be thought of as an estimate of the probability that a positive prediction will actually be correct. Recall can be intepretted as an estimate of the probability that a positive observation will be correctly identified by the model.
In the cell below, we will use y_true
and y_pred
to calculate our model's the precision and recall for Class A.
classA_precision = np.sum((y_pred == 'A') & (y_true == 'A') ) / np.sum(y_pred == 'A')
classA_recall = np.sum((y_pred == 'A') & (y_true == 'A') ) / np.sum(y_true == 'A')
print('Class A Precision:', round(classA_precision,4))
print('Class A Recall: ', round(classA_recall,4))
The precision score suggests that when our model predicts that an observation is in Class A, it will be correct about 66.13% of the time. The recall score suggests that our model will correctly identify about 60.59% of all observations that are actually in Class A.
The classification_report
function from sklearn.metrics
can be used to automatically calculate the precision and recall scores for each class. We will display these results for our example below.
print(classification_report(y_true, y_pred, digits=4))