Lesson 08 - L1 and L2 Regularization

The following topics are discussed in this notebook:

  • Using L1 and L2 regularizaion to combat overfitting.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Regularization

Regularization refers to a set of techniques used in supevised learning to attempt to reduce to likelihood of model overfitting. In this lession, we will consider L1 and L2 regularization, which penalize a model for being overly complex.

Generate Data

The following cell generates a dataset to be used in a regression task. The dataset contains one feature, and displays a nonlinear relationship.

In [2]:
np.random.seed(1)
n = 40
x = np.random.uniform(-4, 4, n)
X = x.reshape(-1,1)
y =  0.3 + 0.05 * x + 0.001 * x**7 + np.random.normal(0, 2, n)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.4, random_state=1)
print('X_train shape:', X_train.shape)
print('X_test shape: ', X_val.shape)

plt.close()
plt.scatter(X_train.reshape(-1,), y_train)
plt.scatter(X_val.reshape(-1,), y_val, c='r')
plt.show()
X_train shape: (24, 1)
X_test shape:  (16, 1)

Polynomial Models of Various Degrees

We will fit poynomial models of degrees 1, 3, 5, 7, 9, and 11 to this dataset.

In [3]:
from sklearn.preprocessing import PolynomialFeatures
In [4]:
degree = [1, 3, 5, 7, 9, 11]
x_curve = np.linspace(-4, 4, 100)
tr_score = []
va_score = []

plt.figure(figsize=[12,8])
for i in range(len(degree)):
    
    poly = PolynomialFeatures(degree[i])
    Xp_train = poly.fit_transform(X_train)
    Xp_val = poly.fit_transform(X_val)
    
    mod = LinearRegression()
    mod.fit(Xp_train, y_train)
    
    tr_score.append(mod.score(Xp_train, y_train))
    va_score.append(mod.score(Xp_val, y_val))

    xp_curve = poly.fit_transform(x_curve.reshape(-1,1))
    y_curve = mod.predict(xp_curve)
    plt.subplot(2,3,i+1)
    plt.scatter(X_train.reshape(-1,), y_train)
    plt.scatter(X_val.reshape(-1,), y_val, c='r')
    plt.ylim([-20,20])
    plt.plot(x_curve, y_curve, c='darkgreen')

plt.show()

for i in range(len(degree)):
    print('-- Degree', degree[i], '--')
    print('Training r2:  ', tr_score[i])
    print('Validation r2:', va_score[i], '\n')
-- Degree 1 --
Training r2:   0.6382751443295913
Validation r2: 0.5334020784605253 

-- Degree 3 --
Training r2:   0.8228042419821454
Validation r2: 0.8224250413792193 

-- Degree 5 --
Training r2:   0.8508252764216662
Validation r2: 0.8788413864576503 

-- Degree 7 --
Training r2:   0.8728242410836946
Validation r2: 0.12019555086909085 

-- Degree 9 --
Training r2:   0.8899490344773457
Validation r2: -0.10629843367512493 

-- Degree 11 --
Training r2:   0.9108177814282059
Validation r2: -101.55739174561239 

L2 Regularization (Ridge Regression)

L2 regularization is performed by adding an additional term to the loss function. This new term applies a penalty to the loss for having larger coefficients in the model. As such, it can be thought of as penalizing model complexity.

The loss function used in L2 regularizaion is:

$$\Large Loss = SSE + \alpha \sum_{i=1}^n \hat\beta_{i} ^2$$

Note that the penalty term does not apply to the intercept (bias) term in the model. The constant $\alpha$ is a model hyperparameter that we are free to choose.

In is important to scale features prior to applying L2 regularization.

In [5]:
from sklearn.preprocessing import StandardScaler
In [6]:
poly = PolynomialFeatures(11)

Xp_train = poly.fit_transform(X_train)
Xp_val = poly.transform(X_val)

scaler = StandardScaler()

scaler.fit(Xp_train)
Xps_train = scaler.transform(Xp_train)
Xps_val = scaler.transform(Xp_val)

print(Xps_train.shape)
print(Xps_val.shape)
(24, 12)
(16, 12)

Create a Ridge Model

We can use tools from sklearn to implement an L2-regularized regression (or Ridge) model.

When training a ridge model, we must select a value for $\alpha$. We will arbitrarily set that to 0.01 in our example.

In [7]:
from sklearn.linear_model import Ridge
In [8]:
L2_mod = Ridge(alpha=0.01)
L2_mod.fit(Xps_train, y_train)

print('Training r2:  ', L2_mod.score(Xps_train, y_train))
print('Validation r2:', L2_mod.score(Xps_val, y_val))
Training r2:   0.8762765562857949
Validation r2: 0.7289104622467528

The plot below shows the fit resulting from our Ridge model.

In [9]:
plt.figure(figsize=[6,4])
x_curve = np.linspace(-4, 4, 100)
xp_curve = poly.fit_transform(x_curve.reshape(-1,1))
xps_curve = scaler.transform(xp_curve)
y_curve = L2_mod.predict(xps_curve)
plt.scatter(X_train.reshape(-1,), y_train)
plt.scatter(X_val.reshape(-1,), y_val, c='r')
plt.ylim([-20,20])
plt.plot(x_curve, y_curve, c='darkgreen')
plt.show()

Selecting Alpha

We got decent (but not great) results for our choice of $\alpha$. Perhaps we could have done better with a different choice for this hyperparameter. We will write a loop to consider several values of $\alpha$. We will compare the resulting models based on their validation scores.

In [10]:
tr_score = []
va_score = []

exp_list = np.linspace(-3,10, 100)

for k in exp_list:
    temp_mod = Ridge(alpha=10**k)
    temp_mod.fit(Xps_train, y_train)

    tr_score.append(temp_mod.score(Xps_train, y_train))
    va_score.append(temp_mod.score(Xps_val, y_val))
    
plt.figure(figsize=([6,4]))
plt.plot(exp_list, tr_score, label='Training Score')
plt.plot(exp_list, va_score, label='Validation Score')
plt.legend()
plt.show()
In [11]:
idx = np.argmax(va_score)
print(exp_list[idx])
-0.8989898989898988
In [12]:
L2_mod_2 = Ridge(alpha=10**(-0.9))
L2_mod_2.fit(Xps_train, y_train)

print('Training r2:  ', L2_mod_2.score(Xps_train, y_train))
print('Validation r2:', L2_mod_2.score(Xps_val, y_val))
Training r2:   0.8622909515669346
Validation r2: 0.9078894466135733
In [13]:
plt.figure(figsize=[6,4])
x_curve = np.linspace(-4, 4, 100)
xp_curve = poly.fit_transform(x_curve.reshape(-1,1))
xps_curve = scaler.transform(xp_curve)
y_curve = L2_mod_2.predict(xps_curve)
plt.scatter(X_train.reshape(-1,), y_train)
plt.scatter(X_val.reshape(-1,), y_val, c='r')
plt.ylim([-20,20])
plt.plot(x_curve, y_curve, c='darkgreen')
plt.show()

Visualizing Effect on Coefficients

Since L2 regularization penalizes models for having large coefficients, selecting larger values of $\alpha$ will generally result in models with smaller coefficients. The plot below visualizes the effects on the coefficient estimates as we increase the value of alpha.

In [14]:
coef_list = []

exp_list = np.linspace(-3,0, 100)

for k in exp_list:
    temp_mod = Ridge(alpha=10**k)
    temp_mod.fit(Xps_train, y_train)

    coef_list.append(temp_mod.coef_)
    
coef_array = np.array(coef_list).transpose()
    
plt.figure(figsize=([12,8]))

for i in range(coef_array.shape[0]):
    
    plt.plot(exp_list, coef_array[i, :], label='Beta' + str(i+1))
    
plt.legend()
plt.title('Effect of Alpha on Model Coefficients')
plt.xlabel('log(alpha)')
plt.ylabel('Coefficient Estimate')
plt.show()

L1 Regularization (Lasso)

L1 regularization is similar to L2 regression conceptually, but instead of squaring the model coefficients in the penalty term, we take their absolute value.

The loss function used in L2 regularizaion is:

$$\Large Loss = SSE + \alpha \sum_{i=1}^n \left|\hat\beta_{i} \right|$$

As before, penalty term does not apply to the intercept (bias) term in the model.

We can use tools from sklearn to implement an L1-regularized regression (or Lasso) model.

As with ridge regression, when training a lasso model, we must select a value for $\alpha$. We will arbitrarily set that to 1.5 in our example.

In [15]:
from sklearn.linear_model import Lasso
In [16]:
L1_mod = Lasso(alpha=1.5)
L1_mod.fit(Xps_train, y_train)

print('Training r2:  ', L1_mod.score(Xps_train, y_train))
print('Validation r2:', L1_mod.score(Xps_val, y_val))
Training r2:   0.6659542441714752
Validation r2: 0.6489201024959461
In [17]:
plt.figure(figsize=[6,4])
x_curve = np.linspace(-4, 4, 100)
xp_curve = poly.fit_transform(x_curve.reshape(-1,1))
xps_curve = scaler.transform(xp_curve)
y_curve = L1_mod.predict(xps_curve)
plt.scatter(X_train.reshape(-1,), y_train)
plt.scatter(X_val.reshape(-1,), y_val, c='r')
plt.ylim([-20,20])
plt.plot(x_curve, y_curve, c='darkgreen')
plt.show()

Selecting Alpha

We will write a loop to consider several values of $\alpha$, comparing the resulting models based on their validation scores.

In [18]:
tr_score = []
va_score = []

exp_list = np.linspace(-3,1, 100)

for k in exp_list:
    temp_mod = Lasso(alpha=10**k, max_iter=1e8)
    temp_mod.fit(Xps_train, y_train)

    tr_score.append(temp_mod.score(Xps_train, y_train))
    va_score.append(temp_mod.score(Xps_val, y_val))
    
plt.figure(figsize=([6,4]))
plt.plot(exp_list, tr_score, label='Training Score')
plt.plot(exp_list, va_score, label='Validation Score')
plt.legend()
plt.show()
In [19]:
idx = np.argmax(va_score)
print(exp_list[idx])
-2.111111111111111
In [20]:
L1_mod_2 = Lasso(alpha=10**(-2.1))
L1_mod_2.fit(Xps_train, y_train)

print('Training r2:  ', L1_mod_2.score(Xps_train, y_train))
print('Validation r2:', L1_mod_2.score(Xps_val, y_val))
Training r2:   0.8606870840036821
Validation r2: 0.9074638907936589
In [21]:
plt.figure(figsize=[6,4])
x_curve = np.linspace(-4, 4, 100)
xp_curve = poly.fit_transform(x_curve.reshape(-1,1))
xps_curve = scaler.transform(xp_curve)
y_curve = L1_mod_2.predict(xps_curve)
plt.scatter(X_train.reshape(-1,), y_train)
plt.scatter(X_val.reshape(-1,), y_val, c='r')
plt.ylim([-20,20])
plt.plot(x_curve, y_curve, c='darkgreen')
plt.show()

Visualizing Effect on Coefficients

In L1 regularization, like L2 regularization, selecting larger values of $\alpha$ applies a larger penalty for having large coefficients, and thus pushes coefficients toward 0. L1 regularization is somewhat more agressive in this, however. In L1 regularization, some model coefficients will actually be set to zero for sufficiently large values of $\alpha$, effectively removing the related features from the model. As a result, L1 regularization can be thought of as performing a type of automated variable selection.

In [22]:
coef_list = []

exp_list = np.linspace(-3,-1, 100)

for k in exp_list:
    temp_mod = Lasso(alpha=10**k, max_iter=1e8)
    temp_mod.fit(Xps_train, y_train)

    coef_list.append(temp_mod.coef_)
    
coef_array = np.array(coef_list).transpose()
    
plt.figure(figsize=([12,8]))

for i in range(coef_array.shape[0]):
    
    plt.plot(exp_list, coef_array[i, :], label='Beta' + str(i+1))
    
plt.legend()
plt.title('Effect of Alpha on Model Coefficients')
plt.xlabel('log(alpha)')
plt.ylabel('Coefficient Estimate')
plt.show()