Lesson 07 - Feature Scaling

The following topics are discussed in this notebook:

  • Using Scikit-Learn to scale numerical features.
In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

For some supervised learning models, it is important for all of the numerical features to be on roughly the same scale. As it turns out, this does not apply to linear regression, but it does apply to certain variants of linear regression that we will consider soon.

In this lesson, we will learn about two types of feature scaling (normalization and standardization), and will see how to use tools from Scikit-Learn to perform these scaling methods. We will illustrate these concepts using a subset of the Boston Housing dataset.

Load the Dataset

In the next few cells, we will load the dataset and will select a subset of the features to work with.

In [2]:
df = pd.read_csv('data/BostonHousingV2.txt', sep='\t')
df = df.iloc[:, [5, 6, 11, 12, 16]]
df.head(n=10)
Out[2]:
cmedv crim rm age ptratio
0 24.0 0.00632 6.575 65.2 15.3
1 21.6 0.02731 6.421 78.9 17.8
2 34.7 0.02729 7.185 61.1 17.8
3 33.4 0.03237 6.998 45.8 18.7
4 36.2 0.06905 7.147 54.2 18.7
5 28.7 0.02985 6.430 58.7 18.7
6 22.9 0.08829 6.012 66.6 15.2
7 22.1 0.14455 6.172 96.1 15.2
8 16.5 0.21124 5.631 100.0 15.2
9 18.9 0.17004 6.004 85.9 15.2
In [3]:
X = df.iloc[:, 1:].values
y = df.iloc[:,0].values

print(X.shape)
print(y.shape)
(506, 4)
(506,)
In [4]:
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=1)

Explore Scale of Features in Training Set

In [5]:
np.set_printoptions(suppress=True)

X_min = X_train.min(axis=0)
X_max = X_train.max(axis=0)
X_avg = X_train.mean(axis=0)
X_std = X_train.std(axis=0)

print(np.vstack([X_min, X_max, X_avg, X_std]))
[[  0.00632      3.863        6.          12.6       ]
 [ 73.5341       8.398      100.          22.        ]
 [  3.7912705    6.2409967   68.92937294  18.38910891]
 [  8.8976109    0.67164858  28.48253177   2.17961911]]

Normalization

Let $X$ refer to a feature in a dataset. Let $x_{min}$ be the smallest value of $X$ for observations in the training set, and let $x_{max}$ be the largest value of $X$ for observations in the training set.

If $x_0$ is a value of $X$ for an observation (from any set), its normalized value is given by: $\large z_0 = \frac{x_0 - x_{min}}{x_{max} - x_{min}}$

When using normalization to scale features, each feature in the scaled training set will have values ranging between 0 and 1.

We can perform normalization using the MinMaxScaler class from sklearn.preprocessing.

In [6]:
from sklearn.preprocessing import MinMaxScaler

n_scaler = MinMaxScaler()

Xn_train = n_scaler.fit_transform(X_train)

#n_scaler.fit(X_train)
#Xn_train = n_scaler.transform(X_train)

print(np.vstack([
    Xn_train.min(axis=0),
    Xn_train.max(axis=0),
    Xn_train.mean(axis=0),
    X_train.std(axis=0)
]))
[[ 0.          0.          0.          0.        ]
 [ 1.          1.          1.          1.        ]
 [ 0.05147647  0.52436531  0.66946141  0.61586265]
 [ 8.8976109   0.67164858 28.48253177  2.17961911]]

We can use the scaler that was fit to the training set to also scale the features in the validation and testing sets.

In [7]:
Xn_val = n_scaler.transform(X_val)
Xn_test = n_scaler.transform(X_test)

Xn_test_min = Xn_train.min(axis=0)
Xn_max = Xn_train.max(axis=0)

print(np.vstack([
    Xn_val.min(axis=0),
    Xn_val.max(axis=0),
    Xn_val.mean(axis=0),
    Xn_val.std(axis=0)
]), '\n')

print(np.vstack([
    Xn_test.min(axis=0),
    Xn_test.max(axis=0),
    Xn_test.mean(axis=0),
    Xn_test.std(axis=0)
]))
[[ 0.00006311 -0.06659316  0.04042553  0.04255319]
 [ 1.21001722  1.06747519  1.          1.        ]
 [ 0.04882438  0.56006637  0.65115863  0.65809985]
 [ 0.13324011  0.15779041  0.29707514  0.20492739]] 

[[ 0.00018129  0.06063947 -0.03297872  0.        ]
 [ 0.51212725  1.08423374  1.          0.91489362]
 [ 0.0421103   0.53674903  0.66887776  0.6090947 ]
 [ 0.08130617  0.16755498  0.2891138   0.24445259]]

Standardization

Let $X$ refer to a feature in a dataset. Let $\bar x$ be the mean of $X$ values for observations in the training set, and let $s_X$ be the standard deviation of $X$ for observations in the training set.

If $x_0$ is a value of $X$ for an observation (from any set), its normalized value is given by: $\large z_0 = \frac{x_0 - \bar x}{s_X}$

When using normalization to scale features, then each feature in the scaled training set will have a mean of 0 and a standard deviation of 1.

We can perform standardization using the StandardScaler class from sklearn.preprocessing.

In [8]:
from sklearn.preprocessing import StandardScaler

s_scaler = StandardScaler()

Xs_train = s_scaler.fit_transform(X_train)

#s_scaler.fit(X_train)
#Xs_train = s_scaler.transform(X_train)

print(np.vstack([
    Xs_train.min(axis=0),
    Xs_train.max(axis=0),
    Xs_train.mean(axis=0),
    Xs_train.std(axis=0)
])) 
[[-0.42538953 -3.54053709 -2.20940236 -2.6560186 ]
 [ 7.8383771   3.2115058   1.09086605  1.65666151]
 [ 0.          0.          0.          0.        ]
 [ 1.          1.          1.          1.        ]]

We can use the scaler that was fit to the training set to also scale the features in the validation and testing sets.

In [9]:
Xs_val = s_scaler.transform(X_val)
Xs_test = s_scaler.transform(X_test)

Xs_test_min = Xs_train.min(axis=0)
Xs_max = Xs_train.max(axis=0)

print(np.vstack([
    Xs_val.min(axis=0),
    Xs_val.max(axis=0),
    Xs_val.mean(axis=0),
    Xs_val.std(axis=0)
]), '\n')

print(np.vstack([
    Xs_test.min(axis=0),
    Xs_test.max(axis=0),
    Xs_test.mean(axis=0),
    Xs_test.std(axis=0)
]))
[[-0.42486804 -3.99017699 -2.07598726 -2.4725003 ]
 [ 9.5739104   3.66710119  1.09086605  1.65666151]
 [-0.02191623  0.24105506 -0.06040411  0.18215554]
 [ 1.10106518  1.06540759  0.98042769  0.88378628]] 

[[-0.42389137 -3.13109678 -2.318241   -2.6560186 ]
 [ 3.80671058  3.78025561  1.09086605  1.2896249 ]
 [-0.07739987  0.0836154  -0.0019262  -0.029188  ]
 [ 0.67189525  1.13133838  0.95415315  1.05424582]]

Comparison of Regression Models Trained on Scaled and Unscaled Data

Although scaling is critical for the effectiveness of certain learning algorithms, the basic version of linear regression is not such an algorithm. To illustrate this fact, we will train three models, one on the unscaled data, one on the normalized data, and one on the standardized data. We will show that feature scaling does not affect model performance for linear regression.

Unscaled Data

In [10]:
mod1 = LinearRegression()
mod1.fit(X_train, y_train)

print('Training r-Squared:   ', mod1.score(X_train, y_train))
print('Validation r-Squared: ', mod1.score(X_val, y_val))
Training r-Squared:    0.6209574849008046
Validation r-Squared:  0.5048967029572833

Normalized Data

In [11]:
mod2 = LinearRegression()
mod2.fit(Xn_train, y_train)

print('Training r-Squared:   ', mod2.score(Xn_train, y_train))
print('Validation r-Squared: ', mod2.score(Xn_val, y_val))
Training r-Squared:    0.6209574849008046
Validation r-Squared:  0.5048967029572842

Standardized Data

In [12]:
mod3 = LinearRegression()
mod3.fit(Xs_train, y_train)

print('Training r-Squared:   ', mod3.score(Xs_train, y_train))
print('Validation r-Squared: ', mod3.score(Xs_val, y_val))
Training r-Squared:    0.6209574849008046
Validation r-Squared:  0.5048967029572844