Lesson 06 - Encoding Categorical Variables¶

The following topics are discussed in this notebook:¶

Using Scikit-Learn to one-hot encode categorical variables.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Load the Data¶

In this lesson, we will be working with a dataset containing salary information for 397 university professors. The dataset contains the following 6 columns:

rank - A categorical variable with levels: AssocProf, AsstProf, Prof
discipline - A categorical variable with levels: A (“theoretical” departments) or B (“applied” departments).
yrs.since.phd - Years since PhD.
yrs.service - Years of service.
sex - A categorical variable with levels: Female, Male
salary - Nine-month salary, in dollars.

Our goal will be to create a regession model with salary as the response variable.

salaries = pd.read_csv('data/Salaries.txt', sep='\t')
salaries.head(20)

Separate Categorical and Numerical Features¶

We will create seperate arrays to hold the numerical and categorical variables. The numerical features are ready to be used in a model, but the categorical features will require some additional preprocessing.

X_num = salaries.iloc[:,[2,3]].values
X_cat = salaries.iloc[:,[0,1,4]].values
y = salaries.iloc[:,-1]

print(X_num.shape)
print(X_cat.shape)
print(y.shape)

(397, 2)
(397, 3)
(397,)

Create Model using only Numerical Features¶

Before we turn our attention to encoding the categorical variables, we will fit a regression model using only the numerical features. We will compare the performance of this model against the performance of the model we will build later using the categorical variables.

X_train, X_holdout, y_train, y_holdout = train_test_split(X_num, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=1)

mod1 = LinearRegression()
mod1.fit(X_train, y_train)

print('Training r-Squared:  ', mod1.score(X_train, y_train))
print('Validation r-Squared:', mod1.score(X_val, y_val))

Training r-Squared:   0.21869777719465355
Validation r-Squared: 0.22996728480331952

One-Hot Encode the Categorical Features¶

We will use the OneHotEncoder class from Scikit-Learn to encode our categorical features. In a one-hot encoding scheme, a new "dummy" variable will be created for each category of each categorical variable. For a given observation, the value of the dummy variable associated with a specific category will be equal to 1 if the observation belongs to that category, and will be equal to 0 otherwise.

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(sparse=False)
X_enc = enc.fit_transform(X_cat)

In the next cell, we will create a DataFrame for the purpose of displaying the original categorical feature matrix side-by-side with the encoded feature matrix. We will also add names for the columns to help illustrate how the encoding was performed.

temp_df = pd.DataFrame(np.hstack((X_cat,X_enc)))
temp_df.columns = ['Rank', 'Disc', 'Sex', 'Rank_Assoc', 
                   'Rank_Asst', 'Rank_Prof', 'Disc_A', 
                   'Disc_B', 'Sex_F', 'Sex_M']
temp_df.head(20)

Removing Redundant Columns from the Encoded Matrix¶

If a caegorical variable has K categories, then we need only K-1 dummy variables to completely encode the possible values of the variable. In some models, including linear regression, you should not use more than K-1 dummy variables when creating your model.

With that in mind, we will remove one column from the encoded matrix for each categorical variable. It does not actually matter which columns we remove, but the ones we do remove will in a sense be considered to be the "base" level for the associated categorical variable.

For our purposes, we will remove the following columns from X_enc:

Column 1, which corresponds to rank = AsstProf.
Column 3, which corresponds to discipline = A.
Column 5, which corresponds to sex = Female.

X_enc = X_enc[:, [0,2,4,6]]

Combine the Numerical Features with the Encoded Categorical Features¶

We will now combine the numerical feature array with the encoded categorical feature array.

X = np.hstack([X_num, X_enc])
print(X.shape)

(397, 6)

Create a Model Using All Features¶

We will now create a new train/validation/test split, and we will then fit another model using all of the features.

X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=1)

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(238, 6)
(79, 6)
(80, 6)

mod2 = LinearRegression()
mod2.fit(X_train, y_train)

print('Training r-Squared:  ', mod2.score(X_train, y_train))
print('Validation r-Squared:', mod2.score(X_val, y_val))

Training r-Squared:   0.4687085845174794
Validation r-Squared: 0.42017544751745983

print('Testing r-Squared:   ', mod2.score(X_test, y_test))

Testing r-Squared:    0.40102892029460957

	rank	discipline	yrs.since.phd	yrs.service	sex	salary
0	Prof	B	19	18	Male	139750
1	Prof	B	20	16	Male	173200
2	AsstProf	B	4	3	Male	79750
3	Prof	B	45	39	Male	115000
4	Prof	B	40	41	Male	141500
5	AssocProf	B	6	6	Male	97000
6	Prof	B	30	23	Male	175000
7	Prof	B	45	45	Male	147765
8	Prof	B	21	20	Male	119250
9	Prof	B	18	18	Female	129000
10	AssocProf	B	12	8	Male	119800
11	AsstProf	B	7	2	Male	79800
12	AsstProf	B	1	1	Male	77700
13	AsstProf	B	2	0	Male	78000
14	Prof	B	20	18	Male	104800
15	Prof	B	12	3	Male	117150
16	Prof	B	19	20	Male	101000
17	Prof	A	38	34	Male	103450
18	Prof	A	37	23	Male	124750
19	Prof	A	39	36	Female	137000