Lesson 06 - Encoding Categorical Variables

The following topics are discussed in this notebook:

  • Using Scikit-Learn to one-hot encode categorical variables.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Load the Data

In this lesson, we will be working with a dataset containing salary information for 397 university professors. The dataset contains the following 6 columns:

  • rank - A categorical variable with levels: AssocProf, AsstProf, Prof
  • discipline - A categorical variable with levels: A (“theoretical” departments) or B (“applied” departments).
  • yrs.since.phd - Years since PhD.
  • yrs.service - Years of service.
  • sex - A categorical variable with levels: Female, Male
  • salary - Nine-month salary, in dollars.

Our goal will be to create a regession model with salary as the response variable.

In [2]:
salaries = pd.read_csv('data/Salaries.txt', sep='\t')
salaries.head(20)
Out[2]:
rank discipline yrs.since.phd yrs.service sex salary
0 Prof B 19 18 Male 139750
1 Prof B 20 16 Male 173200
2 AsstProf B 4 3 Male 79750
3 Prof B 45 39 Male 115000
4 Prof B 40 41 Male 141500
5 AssocProf B 6 6 Male 97000
6 Prof B 30 23 Male 175000
7 Prof B 45 45 Male 147765
8 Prof B 21 20 Male 119250
9 Prof B 18 18 Female 129000
10 AssocProf B 12 8 Male 119800
11 AsstProf B 7 2 Male 79800
12 AsstProf B 1 1 Male 77700
13 AsstProf B 2 0 Male 78000
14 Prof B 20 18 Male 104800
15 Prof B 12 3 Male 117150
16 Prof B 19 20 Male 101000
17 Prof A 38 34 Male 103450
18 Prof A 37 23 Male 124750
19 Prof A 39 36 Female 137000

Separate Categorical and Numerical Features

We will create seperate arrays to hold the numerical and categorical variables. The numerical features are ready to be used in a model, but the categorical features will require some additional preprocessing.

In [3]:
X_num = salaries.iloc[:,[2,3]].values
X_cat = salaries.iloc[:,[0,1,4]].values
y = salaries.iloc[:,-1]

print(X_num.shape)
print(X_cat.shape)
print(y.shape)
(397, 2)
(397, 3)
(397,)

Create Model using only Numerical Features

Before we turn our attention to encoding the categorical variables, we will fit a regression model using only the numerical features. We will compare the performance of this model against the performance of the model we will build later using the categorical variables.

In [4]:
X_train, X_holdout, y_train, y_holdout = train_test_split(X_num, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=1)
In [5]:
mod1 = LinearRegression()
mod1.fit(X_train, y_train)

print('Training r-Squared:  ', mod1.score(X_train, y_train))
print('Validation r-Squared:', mod1.score(X_val, y_val))
Training r-Squared:   0.21869777719465355
Validation r-Squared: 0.22996728480331952

One-Hot Encode the Categorical Features

We will use the OneHotEncoder class from Scikit-Learn to encode our categorical features. In a one-hot encoding scheme, a new "dummy" variable will be created for each category of each categorical variable. For a given observation, the value of the dummy variable associated with a specific category will be equal to 1 if the observation belongs to that category, and will be equal to 0 otherwise.

In [6]:
from sklearn.preprocessing import OneHotEncoder
In [7]:
enc = OneHotEncoder(sparse=False)
X_enc = enc.fit_transform(X_cat)

In the next cell, we will create a DataFrame for the purpose of displaying the original categorical feature matrix side-by-side with the encoded feature matrix. We will also add names for the columns to help illustrate how the encoding was performed.

In [8]:
temp_df = pd.DataFrame(np.hstack((X_cat,X_enc)))
temp_df.columns = ['Rank', 'Disc', 'Sex', 'Rank_Assoc', 
                   'Rank_Asst', 'Rank_Prof', 'Disc_A', 
                   'Disc_B', 'Sex_F', 'Sex_M']
temp_df.head(20)
Out[8]:
Rank Disc Sex Rank_Assoc Rank_Asst Rank_Prof Disc_A Disc_B Sex_F Sex_M
0 Prof B Male 0 0 1 0 1 0 1
1 Prof B Male 0 0 1 0 1 0 1
2 AsstProf B Male 0 1 0 0 1 0 1
3 Prof B Male 0 0 1 0 1 0 1
4 Prof B Male 0 0 1 0 1 0 1
5 AssocProf B Male 1 0 0 0 1 0 1
6 Prof B Male 0 0 1 0 1 0 1
7 Prof B Male 0 0 1 0 1 0 1
8 Prof B Male 0 0 1 0 1 0 1
9 Prof B Female 0 0 1 0 1 1 0
10 AssocProf B Male 1 0 0 0 1 0 1
11 AsstProf B Male 0 1 0 0 1 0 1
12 AsstProf B Male 0 1 0 0 1 0 1
13 AsstProf B Male 0 1 0 0 1 0 1
14 Prof B Male 0 0 1 0 1 0 1
15 Prof B Male 0 0 1 0 1 0 1
16 Prof B Male 0 0 1 0 1 0 1
17 Prof A Male 0 0 1 1 0 0 1
18 Prof A Male 0 0 1 1 0 0 1
19 Prof A Female 0 0 1 1 0 1 0

Removing Redundant Columns from the Encoded Matrix

If a caegorical variable has K categories, then we need only K-1 dummy variables to completely encode the possible values of the variable. In some models, including linear regression, you should not use more than K-1 dummy variables when creating your model.

With that in mind, we will remove one column from the encoded matrix for each categorical variable. It does not actually matter which columns we remove, but the ones we do remove will in a sense be considered to be the "base" level for the associated categorical variable.

For our purposes, we will remove the following columns from X_enc:

  • Column 1, which corresponds to rank = AsstProf.
  • Column 3, which corresponds to discipline = A.
  • Column 5, which corresponds to sex = Female.
In [9]:
X_enc = X_enc[:, [0,2,4,6]]

Combine the Numerical Features with the Encoded Categorical Features

We will now combine the numerical feature array with the encoded categorical feature array.

In [10]:
X = np.hstack([X_num, X_enc])
print(X.shape)
(397, 6)

Create a Model Using All Features

We will now create a new train/validation/test split, and we will then fit another model using all of the features.

In [11]:
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=1)

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
(238, 6)
(79, 6)
(80, 6)
In [12]:
mod2 = LinearRegression()
mod2.fit(X_train, y_train)

print('Training r-Squared:  ', mod2.score(X_train, y_train))
print('Validation r-Squared:', mod2.score(X_val, y_val))
Training r-Squared:   0.4687085845174794
Validation r-Squared: 0.42017544751745983
In [13]:
print('Testing r-Squared:   ', mod2.score(X_test, y_test))
Testing r-Squared:    0.40102892029460957