import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
For some supervised learning models, it is important for all of the numerical features to be on roughly the same scale. As it turns out, this does not apply to linear regression, but it does apply to certain variants of linear regression that we will consider soon.
In this lesson, we will learn about two types of feature scaling (normalization and standardization), and will see how to use tools from Scikit-Learn to perform these scaling methods. We will illustrate these concepts using a subset of the Boston Housing dataset.
In the next few cells, we will load the dataset and will select a subset of the features to work with.
df = pd.read_csv('data/BostonHousingV2.txt', sep='\t')
df = df.iloc[:, [5, 6, 11, 12, 16]]
df.head(n=10)
X = df.iloc[:, 1:].values
y = df.iloc[:,0].values
print(X.shape)
print(y.shape)
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=1)
np.set_printoptions(suppress=True)
X_min = X_train.min(axis=0)
X_max = X_train.max(axis=0)
X_avg = X_train.mean(axis=0)
X_std = X_train.std(axis=0)
print(np.vstack([X_min, X_max, X_avg, X_std]))
Let $X$ refer to a feature in a dataset. Let $x_{min}$ be the smallest value of $X$ for observations in the training set, and let $x_{max}$ be the largest value of $X$ for observations in the training set.
If $x_0$ is a value of $X$ for an observation (from any set), its normalized value is given by: $\large z_0 = \frac{x_0 - x_{min}}{x_{max} - x_{min}}$
When using normalization to scale features, each feature in the scaled training set will have values ranging between 0 and 1.
We can perform normalization using the MinMaxScaler
class from sklearn.preprocessing
.
from sklearn.preprocessing import MinMaxScaler
n_scaler = MinMaxScaler()
Xn_train = n_scaler.fit_transform(X_train)
#n_scaler.fit(X_train)
#Xn_train = n_scaler.transform(X_train)
print(np.vstack([
Xn_train.min(axis=0),
Xn_train.max(axis=0),
Xn_train.mean(axis=0),
X_train.std(axis=0)
]))
We can use the scaler that was fit to the training set to also scale the features in the validation and testing sets.
Xn_val = n_scaler.transform(X_val)
Xn_test = n_scaler.transform(X_test)
Xn_test_min = Xn_train.min(axis=0)
Xn_max = Xn_train.max(axis=0)
print(np.vstack([
Xn_val.min(axis=0),
Xn_val.max(axis=0),
Xn_val.mean(axis=0),
Xn_val.std(axis=0)
]), '\n')
print(np.vstack([
Xn_test.min(axis=0),
Xn_test.max(axis=0),
Xn_test.mean(axis=0),
Xn_test.std(axis=0)
]))
Let $X$ refer to a feature in a dataset. Let $\bar x$ be the mean of $X$ values for observations in the training set, and let $s_X$ be the standard deviation of $X$ for observations in the training set.
If $x_0$ is a value of $X$ for an observation (from any set), its normalized value is given by: $\large z_0 = \frac{x_0 - \bar x}{s_X}$
When using normalization to scale features, then each feature in the scaled training set will have a mean of 0 and a standard deviation of 1.
We can perform standardization using the StandardScaler
class from sklearn.preprocessing
.
from sklearn.preprocessing import StandardScaler
s_scaler = StandardScaler()
Xs_train = s_scaler.fit_transform(X_train)
#s_scaler.fit(X_train)
#Xs_train = s_scaler.transform(X_train)
print(np.vstack([
Xs_train.min(axis=0),
Xs_train.max(axis=0),
Xs_train.mean(axis=0),
Xs_train.std(axis=0)
]))
We can use the scaler that was fit to the training set to also scale the features in the validation and testing sets.
Xs_val = s_scaler.transform(X_val)
Xs_test = s_scaler.transform(X_test)
Xs_test_min = Xs_train.min(axis=0)
Xs_max = Xs_train.max(axis=0)
print(np.vstack([
Xs_val.min(axis=0),
Xs_val.max(axis=0),
Xs_val.mean(axis=0),
Xs_val.std(axis=0)
]), '\n')
print(np.vstack([
Xs_test.min(axis=0),
Xs_test.max(axis=0),
Xs_test.mean(axis=0),
Xs_test.std(axis=0)
]))
Although scaling is critical for the effectiveness of certain learning algorithms, the basic version of linear regression is not such an algorithm. To illustrate this fact, we will train three models, one on the unscaled data, one on the normalized data, and one on the standardized data. We will show that feature scaling does not affect model performance for linear regression.
mod1 = LinearRegression()
mod1.fit(X_train, y_train)
print('Training r-Squared: ', mod1.score(X_train, y_train))
print('Validation r-Squared: ', mod1.score(X_val, y_val))
mod2 = LinearRegression()
mod2.fit(Xn_train, y_train)
print('Training r-Squared: ', mod2.score(Xn_train, y_train))
print('Validation r-Squared: ', mod2.score(Xn_val, y_val))
mod3 = LinearRegression()
mod3.fit(Xs_train, y_train)
print('Training r-Squared: ', mod3.score(Xs_train, y_train))
print('Validation r-Squared: ', mod3.score(Xs_val, y_val))