Lesson 22 - Cross Validation

Additional Resources

  • Hands-On Machine Learning, pages 83 - 84
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.datasets import make_classification

#from sklearn.tree import DecisionTreeClassifier, export_graphviz
#from sklearn.model_selection import train_test_split
#from sklearn.preprocessing import OneHotEncoder

K-Fold Cross Validation

K-Fold cross validation is a validation technique in which the training data is split into K (roughly) evenly sized portions, called folds. K versions of a model are contructed using the same hyperparameters. Each model is trained on K-1 folds and validated on the remaining fold. Each fold is used as a validation set for exactly one model. The overall validation performance of the model is reported as the average of the K scores.

Common values for K are 3, 5, 10, and n-1. When K = n-1, we refer to the technique as "Leave-one-out cross-validation".

We will see several ways of implemention cross-validation in this lesson.

cv

Generate Data

In [2]:
np.random.seed(1)
X, y = make_classification(n_samples=250, n_features=6, n_informative=6, n_redundant=0, n_classes=7, class_sep=2)
multipliers = [0.01, 100, 0.1, 10, 1, 5]
for i in range(0,6):
    X[:,i] = multipliers[i] * X[:,i]

np.set_printoptions(suppress=True, precision=2)
print('Distribution of Features:')
print('Min: ', np.min(X, axis=0))
print('Max: ', np.max(X, axis=0))
print('Mean:', np.mean(X, axis=0))
print('SDev:', np.std(X, axis=0))
Distribution of Features:
Min:  [  -0.05 -544.61   -0.54  -45.62   -5.04  -32.31]
Max:  [  0.05 486.02   0.47  57.36   5.63  26.1 ]
Mean: [  0.01 -67.25  -0.02   3.49   0.68  -0.33]
SDev: [  0.02 223.13   0.23  23.84   2.44  12.86]
In [3]:
np.set_printoptions(suppress=True, precision=4)

Naive Implementation of 5-Fold Cross-Validation

In [4]:
tr_acc = []
va_acc = []

for k in range(0,5):
    mask = np.ones(250).astype('bool')
    mask[50*k : 50*(k+1)] = False
    
    temp_mod = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000)
    temp_mod.fit(X[mask,:], y[mask])
    
    tr_acc.append(temp_mod.score(X[mask,:], y[mask]))
    va_acc.append(temp_mod.score(X[~mask,:], y[~mask]))
    
print('Training Scores:  ', tr_acc)
print('Validation Scores:', va_acc)

print('Avg Valid Score:  ', np.mean(va_acc))
Training Scores:   [0.585, 0.63, 0.565, 0.59, 0.575]
Validation Scores: [0.52, 0.42, 0.56, 0.54, 0.62]
Avg Valid Score:   0.532

Note that the validation scores vary somewhat considerably across the different folds. If we were to perform traditional validation, we would have very different impressions of the performance of our model if the validation set was equal to the 2nd fold as opposed to the 5th fold.

Using cross_val_score

Scikit-Learn comes with several tools for performing cross validation. The most basic of these is cross_val_score.

In [5]:
from sklearn.model_selection import cross_val_score
In [6]:
mod_01 = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000)

cv_results = cross_val_score(mod_01, X, y, cv=5, scoring='accuracy')

print(cv_results)
print(np.mean(cv_results))
[0.4906 0.48   0.5306 0.6122 0.4898]
0.5206438197920679
In [7]:
models = [
    LogisticRegression(C=1, solver='lbfgs', multi_class='ovr', max_iter=1000), 
    LogisticRegression(C=10, solver='lbfgs', multi_class='ovr', max_iter=1000), 
    KNeighborsClassifier(n_neighbors=5),
    KNeighborsClassifier(n_neighbors=10),
    DecisionTreeClassifier(max_depth=2, random_state=1), 
    DecisionTreeClassifier(max_depth=4, random_state=1),
    SVC(kernel='rbf', C=1, gamma=0.1),
    SVC(kernel='rbf', C=1, gamma=1)
]

cv_scores = []
for mod in models:
    cv_results = cross_val_score(mod, X, y, cv=5, scoring='accuracy')
    cv_scores.append(np.mean(cv_results))
    
print(np.array(cv_scores))
[0.5206 0.5491 0.3592 0.3322 0.4194 0.5909 0.1761 0.1479]
In [8]:
idx = np.argmax(cv_scores)
print(models[idx])
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

Using KFold for Cross-Validation

If we need more flexibility when performing cross-validation, we can use the KFold class from sklearn. This class can be used to create the folds that we will use, but it does not fit any models. This is useful when, for instance, we need to preprocess our data.

In [9]:
from sklearn.model_selection import KFold
In [10]:
five_fold = KFold(n_splits=5, shuffle=True, random_state=1)
split = five_fold.split(X, y)

for train_index, val_index in split:
    print(val_index)
[  0   4  11  18  19  27  29  31  33  34  35  38  39  44  51  58  62  67
  73  78  91  93  95 102 110 116 119 120 127 160 161 180 182 184 186 188
 205 208 216 224 225 226 229 230 234 236 239 242 246 249]
[  5  14  16  17  21  28  40  42  47  48  53  55  56  59  69  84  85  88
  89  90  99 106 107 112 117 118 122 130 132 138 148 150 159 168 171 172
 176 177 179 181 185 187 189 197 199 207 222 228 232 240]
[  2   6  12  13  36  45  54  66  70  75  81  82  94  97  98 103 105 108
 114 123 124 145 147 152 154 158 163 164 167 169 170 173 190 193 194 201
 202 210 217 218 220 221 227 231 233 238 241 243 244 247]
[  3   9  10  15  23  24  26  32  41  43  46  49  52  64  65  74  76  77
  80  83  87  92 100 104 109 111 113 125 126 135 136 143 149 151 153 155
 162 165 166 174 183 191 195 196 198 206 211 213 214 248]
[  1   7   8  20  22  25  30  37  50  57  60  61  63  68  71  72  79  86
  96 101 115 121 128 129 131 133 134 137 139 140 141 142 144 146 156 157
 175 178 192 200 203 204 209 212 215 219 223 235 237 245]
In [11]:
from sklearn.preprocessing import StandardScaler
In [12]:
five_fold = KFold(n_splits=5, shuffle=True, random_state=1)

cv_scores = []
for train_index, val_index in five_fold.split(X, y):
    X_train, y_train = X[train_index, :], y[train_index]
    X_val, y_val = X[val_index, :], y[val_index]
    
    scaler = StandardScaler()
    Xs_train = scaler.fit_transform(X_train)
    Xs_val = scaler.transform(X_val)
    
    fold_scores = []
    for mod in models:
        mod.fit(Xs_train, y_train)
        fold_scores.append(mod.score(Xs_val, y_val))
    
    cv_scores.append(fold_scores)

cv_scores = np.array(cv_scores)
cv_means = np.mean(cv_scores, axis=0)

print(cv_scores)
print()
print(cv_means)         
[[0.54 0.52 0.96 0.96 0.5  0.72 0.94 0.96]
 [0.7  0.68 0.92 0.88 0.36 0.6  0.86 0.94]
 [0.48 0.52 0.88 0.82 0.36 0.58 0.88 0.9 ]
 [0.6  0.62 0.92 0.88 0.48 0.7  0.9  0.92]
 [0.62 0.62 0.88 0.88 0.42 0.72 0.84 0.92]]

[0.588 0.592 0.912 0.884 0.424 0.664 0.884 0.928]
In [13]:
idx = np.argmax(cv_means)
print(models[idx])
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

After identifying the preferred model using cross validation, you should retrain that model on the entire training set.

In [14]:
svm_mod = SVC(kernel='rbf', C=1, gamma=1)
svm_mod.fit(X, y)
print('Training Accuracy:', svm_mod.score(X,y))
Training Accuracy: 1.0

Using StratifiedKFold for Cross Validation

The StratifiedKFold class from sklearn behaves similarly to KFold, except that it attempts to create folds in which the distribution of the classes in the individual folds are similar to that of the original set.

In [15]:
from sklearn.model_selection import StratifiedKFold
In [16]:
strat_five_fold = StratifiedKFold(n_splits=5, random_state=1)

strat_cv_scores = []
for train_index, val_index in strat_five_fold.split(X, y):
    X_train, y_train = X[train_index, :], y[train_index]
    X_val, y_val = X[val_index, :], y[val_index]
    
    scaler = StandardScaler()
    Xs_train = scaler.fit_transform(X_train)
    Xs_val = scaler.transform(X_val)
    
    fold_scores = []
    for mod in models:
        mod.fit(Xs_train, y_train)
        fold_scores.append(mod.score(Xs_val, y_val))
    
    strat_cv_scores.append(fold_scores)

strat_cv_scores = np.array(strat_cv_scores)
strat_cv_means = np.mean(strat_cv_scores, axis=0)

print(strat_cv_scores)
print()
print(strat_cv_means)
[[0.5094 0.5094 0.9245 0.9057 0.4717 0.6604 0.8868 0.9434]
 [0.56   0.58   0.92   0.9    0.36   0.6    0.9    0.96  ]
 [0.5306 0.5306 0.9184 0.8367 0.4286 0.6327 0.8776 0.898 ]
 [0.6327 0.6122 0.9388 0.9388 0.3673 0.5918 0.898  0.9592]
 [0.5714 0.6122 0.898  0.8367 0.4694 0.4694 0.8776 0.8571]]

[0.5608 0.5689 0.9199 0.8836 0.4194 0.5909 0.888  0.9235]

Comparison of KFold and StratifiedKFold Results

In [17]:
print('--Mean of CV Scores--')
print('Ordinary:  ', cv_means)
print('Stratified:', strat_cv_means)
print()

print('--StdDev of CV Scores--')
print('Ordinary:  ', np.std(cv_scores, axis=0))
print('Stratified:', np.std(strat_cv_scores, axis=0))
--Mean of CV Scores--
Ordinary:   [0.588 0.592 0.912 0.884 0.424 0.664 0.884 0.928]
Stratified: [0.5608 0.5689 0.9199 0.8836 0.4194 0.5909 0.888  0.9235]

--StdDev of CV Scores--
Ordinary:   [0.0744 0.0627 0.0299 0.0445 0.0585 0.0612 0.0344 0.0204]
Stratified: [0.042  0.0421 0.0131 0.0405 0.0481 0.0655 0.0096 0.0401]

Leave-One-Out Cross Validation

We can perform leave-one-out cross validation by using KFold with the number of splits equal to the number of observations.

In [18]:
loo = KFold(n_splits=250, shuffle=False, random_state=1)

loo_scores = []
for train_index, val_index in loo.split(X, y):
    X_train, y_train = X[train_index, :], y[train_index]
    X_val, y_val = X[val_index, :], y[val_index]
    
    scaler = StandardScaler()
    Xs_train = scaler.fit_transform(X_train)
    Xs_val = scaler.transform(X_val)
    
    fold_scores = []
    for mod in models:
        mod.fit(Xs_train, y_train)
        fold_scores.append(mod.score(Xs_val, y_val))
    
    loo_scores.append(fold_scores)

loo_scores = np.array(loo_scores)
loo_means = np.mean(loo_scores, axis=0)

print(loo_means)    
[0.572 0.584 0.936 0.9   0.456 0.588 0.896 0.932]

Scikit-learn also has a LeaveOneOut class for performing leave-one-out cross validation.

In [19]:
from sklearn.model_selection import LeaveOneOut
In [20]:
loo = LeaveOneOut()

loo_scores = []
for train_index, val_index in loo.split(X, y):
    X_train, y_train = X[train_index, :], y[train_index]
    X_val, y_val = X[val_index, :], y[val_index]
    
    scaler = StandardScaler()
    Xs_train = scaler.fit_transform(X_train)
    Xs_val = scaler.transform(X_val)
    
    fold_scores = []
    for mod in models:
        mod.fit(Xs_train, y_train)
        fold_scores.append(mod.score(Xs_val, y_val))
    
    loo_scores.append(fold_scores)

loo_scores = np.array(loo_scores)
loo_means = np.mean(loo_scores, axis=0)

print(loo_means)   
[0.572 0.584 0.936 0.9   0.456 0.588 0.896 0.932]