import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
The goal in a supervised learning task is to produce a model that we can use to predict the value of a label (or response) $y$ given values for a set of features (or predictors), $x^{(1)}, x^{(2)}, ..., x^{(p)}$. This model is typically represented by a predict()
function.
When we select a learning algorithm, we are effectively selecting a class of models that we will be considering for our particular task. This set of allowed models is called our hypothesis space. Our learning algorithm will search through the hypothesis space to find the model that performs the best on our training data. To determine what model from the hypothesis space is the "best", we need to provide the learning algorithm with a method of score different models. Such a scoring method is called an objective function. An objective function takes a model and a dataset as input, and produces a numerical score as its output. If an objective function is defined in such a way that lower scores are better, then we call it a loss function. The goal of a learning algorithm is to minimize its loss on the training data.
We will illustrate these concepts with the linear regression learning algorithm.
In a simple linear regression task, we have a single feature, x
, from which we wish to predict a continuous, real-valued label y
.
%matplotlib inline
%run -i snippets/snippet04.py
Scikit-Learn is a library that contains implementations of many machine learning algorithms, as well as useful tools for evaluatting models, processing data, and generating synthetic data sets. We will use this package extensively in this class.
In this lesson, we will illustrate how to use Scikit-Learn to construct linear regression models. We begin by recreating the dataset used in the demo above.
# Generate Data
np.random.seed(164)
x = np.random.uniform(low=0, high=10, size=12)
y = 5 + 1.4 * x + np.random.normal(loc=0, scale=2.5, size=12)
# The features need to be stored in a 2D array or DataFrame
X = x.reshape(12,1)
print(X.shape)
print(y.shape)
We will now use the LinearRegression
class from Scikit-Learn to create the SLR model.
from sklearn.linear_model import LinearRegression
# We create an instance of the LinearRegression() class.
# Then we call its fit() method.
mod = LinearRegression()
mod.fit(X,y)
# We print the model parameters.
print('Intercept:', mod.intercept_)
print('Slope:', mod.coef_)
We will plot the dataset, along with the line that represents the model.
b = mod.intercept_
m = mod.coef_[0]
plt.close()
plt.scatter(x,y)
plt.plot([0,10],[m*0 + b, m*10 + b], c='r')
plt.xlim([0,10])
plt.ylim([0,25])
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Ever Scikit-Learn model comes equipped with a predict
method that can be used to generate predictions. We will use this method to find the y
values predicted for our model for x=2
, x=5
, and x=8
.
Xnew = np.array([2, 5, 8])
print(Xnew.shape)
The predict function requires a 2D array as input, so we will reshape Xnew
.
Xnew = Xnew.reshape((3,1))
print(Xnew.shape)
print(mod.predict(Xnew))
We know that the parameter values were selected in order to minimize SSE for the given training data. Let's calculate the SSE for the model we have generated. This will require us to use the predict
method built into our model.
y_hat = mod.predict(X)
residuals = y - y_hat
SSE = np.sum(residuals**2)
print('Training SSE:', SSE)
When working with linear regression, the optimal parameter values are determined by minimizing the objective function SSE. However, a different value is typically used to assess the quality of a regression model. This alternate scoring method is called the called the r-squared value. It is defined as follows:
$ \large SST = \sum (y_i - \bar y ) ^2 $
$ \large SSE = \sum \hat{e}_i^2 = \sum (y_i - \hat {y}_i ) ^2 $
$ \large r^2 = 1 - \frac{SSE}{SST}$
Since SST is a constant for the supplied training data, minimizing SSE is equivalent to maximizing $r^2$. The score supplied by $r^2$ has two advantages over SSE:
We will now directly compute the $r^2$ value for our model.
SST = np.sum((y - np.mean(y))**2)
r2 = 1 - SSE / SST
print(r2)
Scikit-Learn LinearRegression
objects do not have an r_squared
attribute, but they do contain a method called score
that can be used to calcuate $r^2$.
print(mod.score(X,y))
In a multiple linear regression task, we have several features, $X = [x^{(1)}, x^{(2)}, ..., x^{(p)}]$, from which we wish to predict a single continuous, real-valued label y
.
We will generate a synthetic dataset to illustrate how multiple regression works.
np.random.seed(1)
n = 200
x1 = np.random.uniform(0, 10, n)
x2 = np.random.uniform(0, 10, n)
y = 7 + 1.3 * x1 + 2.5 * x2 + np.random.normal(0, 3, n)
df = pd.DataFrame({'x1':x1, 'x2':x2, 'y':y})
df.head()
We can use the LinearRegression
class from Scikit-Learn to create multiple regression models.
X = np.hstack([x1.reshape(n,1), x2.reshape(n,1)])
print(X.shape)
mlr = LinearRegression()
mlr.fit(X,y)
print('\nModel intercept:', mlr.intercept_)
print('\nModel coefficients:', mlr.coef_)
print('\nr-Squared:', mlr.score(X,y))
In this example, we will be working with the "Boston Housing" dataset. This dataset contains data for 506 census tracts of Boston from the 1970 census.
The dataset contains the following 19 pieces of information for each census tract:
town
- name of towntract
- census tractlon
- longitude of census tractlat
- latitude of census tract medv
- median value of owner-occupied homes in USD 1000'scmedv
- corrected median value of owner-occupied homes in USD 1000'scrim
- per capita crime rate by townzn
- proportion of residential land zoned for lots over 25,000 sq.ftindus
- proportion of non-retail business acres per townchas
- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)nox
- nitric oxides concentration (parts per 10 million)rm
- average number of rooms per dwellingage
- proportion of owner-occupied units built prior to 1940dis
- weighted distances to five Boston employment centresrad
- index of accessibility to radial highwaystax
- full-value property-tax rate per USD 10,000ptratio
- pupil-teacher ratio by townb
- 1000(B - 0.63)^2 where B is the proportion of blacks by townlstat
- percentage of lower status of the populationWe will start by importing the dataset from a text file, and then viewing the first 10 rows.
df = pd.read_csv('data/BostonHousingV2.txt', sep='\t')
df.head(n=10)
Let's check the dimensions of the DataFrame.
print(df.shape)
We will use this dataset for a regression task in which we will attempt to predict the label cmedv
using the last 13 columns as features. We will now prepare our feature and label arrays.
X = df.iloc[:,6:].values
y = df.iloc[:,5].values
print(X.shape)
print(y.shape)
In order to measure how well our model generalizes to new data, we will create a train/test split of our data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape)
print(X_test.shape)
We will now use the LinearRegression
class to create our model. We will use the score
method of our trained model to calculate the r-Squared value on our training set, as well as on our test set.
model = LinearRegression()
model.fit(X_train, y_train)
print("Training r-Squared:", model.score(X_train, y_train))
print("Testing r-Squared: ", model.score(X_test, y_test))
The model object has attributes containing the optimal parameter values that define our model.
np.set_printoptions(precision=6, suppress=True)
print(model.intercept_)
print(model.coef_)
It is often undesireable to use all features that are potentially available to you in a supervised learning task. A few reasons why you might want to consider using a subset of the available features are:
Let train a second MLR model on this data, but this time using only the features rm
, ptratio
, and lstat
.
df.iloc[:, [11,16,18]].head()
Xs = df.iloc[:,[11,16,18]].values
print(Xs.shape)
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=1)
model_s = LinearRegression()
model_s.fit(Xs_train, y_train)
print("Training r-Squared:", model_s.score(Xs_train, y_train))
print("Testing r-Squared: ", model_s.score(Xs_test, y_test))
print(model_s.intercept_)
print(model_s.coef_)