Recall that in a regression task, we wish to create a model capable of estimating the value of a continuous, real-valued label (or response variable) $Y$. The model will use values of one or more features (or predictor variables) $X^{(1)}, X^{(2)}, ..., X^{(m)}$ as inputs.
There are many different types of algorithms that you might consider creating for a given regression task. Some will work better on certain datasets than others. In this lesson, we will discuss the most basic type of regression algorithm, linear regression.
A linear regression model uses the features $X^{(1)}, X^{(2)}, ..., X^{(m)}$ to generate predictions for the real-valued label $Y$ according to the following linear equation:
$$\large \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X^{(1)} + \hat{\beta}_2 X^{(2)} + ... + \hat{\beta}_m X^{(m)}$$The model parameters $\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, ..., \hat{\beta}_m$ are calculated by a learning algorithm to generate the model that provides the best fit for the given data. The learning algorithm measures the quality of the fit using the sum of squared errors loss function.
The sum of squared errors (SSE) loss is a common loss function used in regression problems. It provides a way of scoring the performance of a regression model on a particular dataset using the size of the errors in the estimates or predictions generated by that model on that dataset. We will now explain how this loss function is calculated.
Let $y_1, y_2, y_3, ..., y_n$ represent several observed values for a real-valued label $Y$. Assume that $\hat y_1,\hat y_2, \hat y_3, ..., \hat y_n$ are corresponding estimates generated by some regression model. The sum of squared errors loss (SSE) for the model, as calculated on this data, is given by:
The formula for SSE can also be written as:
The goal of the learning algorithm in a linear regression problem is to find the values of $\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, ..., \hat{\beta}_m$ that result in a model with the smallest SSE.
Linear regression models are created in Scikit-Learn as instances of the LinearRegression
class, which is found in the sklearn.linear_model
module. We will import that now, along with some other packages that we will need in this lesson.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
We will now look at some examples of creating linear regression models in Scikit-Learn.
As our first example, we will consider a regression problem with only one feature, $X$. A linear regression model with only one feature is called a simple linear regression model.
In the cell below, we use NumPy to randomly generate a synthetic dataset consisting of a single predictor, and a continuous label. Notice that the data below was generated by selecting points on a line with an intercept of 5 and a slope of 1.4, and then adding some randomly generated, normally distributed "noise" to the y-values.
np.random.seed(3)
n = 200
x = np.random.normal(loc=5, scale=1.8, size=n)
y = 5 + 1.4 * x + np.random.normal(loc=0, scale=1.5, size=n)
# The features need to be stored in a 2D array or DataFrame
X = x.reshape(n,1)
print('Shape of X:', X.shape)
print('Shape of y:', y.shape)
We display a scatter plot of our synthetic dataset below.
plt.figure(figsize=[6,4])
plt.scatter(x, y, alpha=0.6, color='plum', edgecolor='k')
plt.xlim([0,10.5])
plt.ylim([0,25])
plt.xlabel('x')
plt.ylabel('y')
plt.show()
We will now split the dataset into training and test sets, using an 80/20 split. We will not create a validation set in this instance, as we will not be comparing different models in this example.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1)
The figure below displays scatter plots of the training and test sets.
plt.figure(figsize=[12,4])
plt.subplot(1,2,1)
plt.scatter(X_train, y_train, alpha=0.6, color='cornflowerblue', edgecolor='k')
plt.xlim([0,10.5])
plt.ylim([0,25])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Training Set')
plt.subplot(1,2,2)
plt.scatter(X_test, y_test, alpha=0.6, color='salmon', edgecolor='k')
plt.xlim([0,10.5])
plt.ylim([0,25])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Test Set')
plt.show()
We will now use the LinearRegression
class from Scikit-Learn to create the regression model. The trained model object will contain two new attributes: intercept_
and coef_
. The intercept_
attribute contains the optimal value of $\hat\beta_0$ in the fitted model. The coef_
attribute contains a list of values of the other $\hat\beta_i$ parameters. In this case, the list will contain only $\hat\beta_1$.
model_1 = LinearRegression()
model_1.fit(X_train, y_train)
# We print the model parameters.
print('Intercept:', model_1.intercept_)
print('Slope: ', model_1.coef_)
From the output above, we see that the equation of our fitted linear regression model is:
$$\large\hat Y = 4.9954 + 1.4013 \cdot X $$Notice that the values for the intercept and slope in our fitted model are very close to the associated values for the line that we used to generate the data.
Each Scikit-Learn model object comes equipped with a score()
method that can be used to measure the model's performance. For classification models, score()
returns the model's accuracy on the provided dataset. For regression models, score()
returns the model's r-squared score, as calculated on the provided dataset. The r-squared score is a number between 0 and 1 that can be interpretted as the proportion of the variance of the response variable $Y$ that has been explained by the model.
We will will now use our linear regression model's score()
method to calculate its r-squared score on the training and testing sets.
train_r2 = model_1.score(X_train, y_train)
test_r2 = model_1.score(X_test, y_test)
print('Training r-Squared:', round(train_r2,4))
print('Testing r-Squared: ', round(test_r2,4))
We see that model performs a bit better on the training data than on the testing data. It is often the case that models will achieve better performance for the dataset on which they were trained than on new, previously unseen data.
We will plot the dataset, along with the line that represents the regression model.
b = model_1.intercept_
m = model_1.coef_[0]
plt.figure(figsize=[12,4])
plt.subplot(1,2,1)
plt.scatter(X_train, y_train, alpha=0.6, color='cornflowerblue', edgecolor='k')
plt.plot([0,10],[m*0 + b, m*10 + b], c='r')
plt.xlim([0,10])
plt.ylim([0,25])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Training Set')
plt.subplot(1,2,2)
plt.scatter(X_test, y_test, alpha=0.6, color='salmon', edgecolor='k')
plt.plot([0,10],[m*0 + b, m*10 + b], c='r')
plt.xlim([0,10])
plt.ylim([0,25])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Test Set')
plt.show()
Every Scikit-Learn model comes equipped with a predict
method that can be used to generate predictions. We will use this method to find the y
values that our model predicts for observations with X=2
, X=5
, and X=8
. Note that the predict()
method expects the new feature values to be provided as a two-dimensional dataset with each row referring to a single observation.
Xnew = np.array([2, 5, 8]).reshape(3,1)
print(model_1.predict(Xnew))
We will now consider a regression problem with four features. A linear regression model in which there is more than one feature is called a multiple linear regression model. We will see that very little in our workflow changes when we have more than one feature to work with. Once again, we will work with synthetic, randomly generated data.
np.random.seed(1)
n = 1000
x1 = np.round(np.random.normal(16, 3, n),2)
x2 = np.round(np.random.normal(50, 10, n),2)
x3 = np.round(np.random.normal(24, 6, n),2)
x4 = np.round(np.random.normal(150, 15, n),2)
y = np.round(37 + 3.2 * x1 + 1.5 * x2 - 2.1 * x3 + 0.3 * x4 + np.random.normal(0, 9, n),2)
df = pd.DataFrame({'X1':x1, 'X2':x2, 'X3':x3, 'X4':x4, 'Y':y})
df.head()
Since our dataset is store in a DataFrame, we will extract the feature array X
and the label array y
.
X = df.iloc[:,:4].values
y = df.iloc[:,4].values
print('Shape of X:', X.shape)
print('Shape of y:', y.shape)
We will split the dataset into training and test sets, using an 80/20 split. We will not create a validation set in this instance, as we will not be comparing different models in this example.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape)
print(X_test.shape)
We can use the LinearRegression
class from Scikit-Learn to create multiple regression models. The syntax is exactly the same as that used for our simple linear regression model.
model_2 = LinearRegression()
model_2.fit(X_train, y_train)
print('Model intercept:', model_2.intercept_)
print('Model coefficients:', model_2.coef_)
From the output above, we see that the equation of our fitted linear regression model is:
$$\large\hat Y = 39.3111 + 3.1491 \cdot X^{(1)} + 1.5060 \cdot X^{(2)} - 2.1135 \cdot X^{(3)} + 0.2940 \cdot X^{(4)} $$In the cell below, we calculate the model's training and testing r-squared scores.
train_r2 = model_2.score(X_train, y_train)
test_r2 = model_2.score(X_test, y_test)
print('Training r-Squared:', round(train_r2,4))
print('Testing r-Squared: ', round(test_r2,4))
We will now use our model to generate estimates $\hat y$ for new observations with the following feature values:
Xnew = np.array([[20, 40, 22, 150], [12, 50, 18, 120], [15, 30, 24, 140]])
print(model_2.predict(Xnew))
In this example, we will be working with the Boston Housing dataset. This dataset contains data for 506 census tracts of Boston from the 1970 census.
The dataset contains the following 19 pieces of information for each census tract:
town
- name of towntract
- census tractlon
- longitude of census tractlat
- latitude of census tract medv
- median value of owner-occupied homes in USD 1000'scmedv
- corrected median value of owner-occupied homes in USD 1000'scrim
- per capita crime rate by townzn
- proportion of residential land zoned for lots over 25,000 sq.ftindus
- proportion of non-retail business acres per townchas
- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)nox
- nitric oxides concentration (parts per 10 million)rm
- average number of rooms per dwellingage
- proportion of owner-occupied units built prior to 1940dis
- weighted distances to five Boston employment centresrad
- index of accessibility to radial highwaystax
- full-value property-tax rate per USD 10,000ptratio
- pupil-teacher ratio by townb
- 1000(B - 0.63)^2 where B is the proportion of blacks by townlstat
- percentage of lower status of the populationOur goal will be to create a regression model for the purpose of estimating values of cmedv
use the columns that come after cmedv
as features.
We will start by importing the dataset from a text file, and then viewing the first 10 rows.
boston = pd.read_csv('data/BostonHousing.txt', sep='\t')
boston.head(n=10)
Let's check the dimensions of the DataFrame.
print(boston.shape)
As mentioned above, we will use cmedv
as the label for our model, and will use the last 13 columns as features. We will now prepare our feature and label arrays.
X = boston.iloc[:,6:].values
y = boston.iloc[:,5].values
print(X.shape)
print(y.shape)
In order to measure how well our model generalizes to new data, we will split our data into training and test sets, using an 80/20 split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape)
print(X_test.shape)
We will now use the LinearRegression
class to create our model. We will then display the parameters in the fitted model.
model_3 = LinearRegression()
model_3.fit(X_train, y_train)
print(model_3.intercept_)
pd.Series(model_3.coef_, index=boston.columns[6:])
In the cell below, we calculate the model's training and testing r-squared scores.
We will use the score
method of our trained model to calculate the r-Squared value on our training set, as well as on our test set.
train_r2 = model_3.score(X_train, y_train)
test_r2 = model_3.score(X_test, y_test)
print("Training r-Squared:", round(train_r2,4))
print("Testing r-Squared: ", round(test_r2,4))