import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
In this lesson, we will be working with a dataset containing salary information for 397 university professors. The dataset contains the following 6 columns:
rank
- A categorical variable with levels: AssocProf, AsstProf, Profdiscipline
- A categorical variable with levels: A (“theoretical” departments) or B (“applied” departments).yrs.since.phd
- Years since PhD.yrs.service
- Years of service.sex
- A categorical variable with levels: Female, Malesalary
- Nine-month salary, in dollars.Our goal will be to create a regession model with salary
as the response variable.
salaries = pd.read_csv('data/Salaries.txt', sep='\t')
salaries.head(20)
We will create seperate arrays to hold the numerical and categorical variables. The numerical features are ready to be used in a model, but the categorical features will require some additional preprocessing.
X_num = salaries.iloc[:,[2,3]].values
X_cat = salaries.iloc[:,[0,1,4]].values
y = salaries.iloc[:,-1]
print(X_num.shape)
print(X_cat.shape)
print(y.shape)
Before we turn our attention to encoding the categorical variables, we will fit a regression model using only the numerical features. We will compare the performance of this model against the performance of the model we will build later using the categorical variables.
X_train, X_holdout, y_train, y_holdout = train_test_split(X_num, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=1)
mod1 = LinearRegression()
mod1.fit(X_train, y_train)
print('Training r-Squared: ', mod1.score(X_train, y_train))
print('Validation r-Squared:', mod1.score(X_val, y_val))
We will use the OneHotEncoder
class from Scikit-Learn to encode our categorical features. In a one-hot encoding scheme, a new "dummy" variable will be created for each category of each categorical variable. For a given observation, the value of the dummy variable associated with a specific category will be equal to 1 if the observation belongs to that category, and will be equal to 0 otherwise.
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False)
X_enc = enc.fit_transform(X_cat)
In the next cell, we will create a DataFrame for the purpose of displaying the original categorical feature matrix side-by-side with the encoded feature matrix. We will also add names for the columns to help illustrate how the encoding was performed.
temp_df = pd.DataFrame(np.hstack((X_cat,X_enc)))
temp_df.columns = ['Rank', 'Disc', 'Sex', 'Rank_Assoc',
'Rank_Asst', 'Rank_Prof', 'Disc_A',
'Disc_B', 'Sex_F', 'Sex_M']
temp_df.head(20)
If a caegorical variable has K
categories, then we need only K-1
dummy variables to completely encode the possible values of the variable. In some models, including linear regression, you should not use more than K-1
dummy variables when creating your model.
With that in mind, we will remove one column from the encoded matrix for each categorical variable. It does not actually matter which columns we remove, but the ones we do remove will in a sense be considered to be the "base" level for the associated categorical variable.
For our purposes, we will remove the following columns from X_enc
:
rank = AsstProf
.discipline = A
. sex = Female
.X_enc = X_enc[:, [0,2,4,6]]
We will now combine the numerical feature array with the encoded categorical feature array.
X = np.hstack([X_num, X_enc])
print(X.shape)
We will now create a new train/validation/test split, and we will then fit another model using all of the features.
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=1)
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
mod2 = LinearRegression()
mod2.fit(X_train, y_train)
print('Training r-Squared: ', mod2.score(X_train, y_train))
print('Validation r-Squared:', mod2.score(X_val, y_val))
print('Testing r-Squared: ', mod2.score(X_test, y_test))