Lesson 28 - One-Hot Encoding

The following topics are discussed in this notebook:

  • Performing one-hot encoding on categorical features.

Categorical Features

A variable is categorical, or qualitative if it takes on values from a finite set of categories or classes. Such values are often recorded as strings, but might also be encoded using numerical labels. Such numerical labels do not typically have any numerical or quantitative significance.The possible values that a categorical variable can assume are referred to as its levels.

One-Hot Encoding

The Scikit-Learn implementations of supervised learning models require all of the feature information to be provided in a numerical format. If we wish to use a categorical variable as a feature in a Scikit-Learn model, we can do so, but we must first find a way to numerically encode the variable. The most common technique for numerically encoding qualitative information is to use one-hot encoding.

To perform one-hot encoding on a categorical variable, we must introduce new variables, referred to as dummy variables. The encoding will have one dummy variable for each level found within the categorical variable. The values found within a dummy variable will be either 0 or 1. A particular dummy variable is equal to one for observations with the associated level, and is zero otherwise.

For example, assume that we have a categorical variable Z with four levels: a, b, c, and d. A one-hot encoding for Z will create four new variables: Za, Zb, Zc, and Zd. The value of these dummy variables for each possible level of Z is shown in the table below.

Z - Za Zb Zc Zd
a - 1 0 0 0
b - 0 1 0 0
c - 0 0 1 0
d - 0 0 0 1

We could perform one-hot encoding manually using tools provided by numpy and pandas. However, sklearn provides a class OneHotEncoder for performing one-hot encoding. This class can be imported as follows:

In [1]:
from sklearn.preprocessing import OneHotEncoder

We will now consider a simple example to demonstrate how to perform one-hot encoding.

Example: One-Hot Encoding

Assume that we have a data set of customers for an online retailer, and we wish to build a model to help us make marketing decisions. Suppose the data set contains three categorical features:

  • sex - The sex of the customer. This has levels "F" and "M".
  • emp - The customer's employment status. This has levels "FT" (full time), "PT" (part time), and "U" (unemployed).
  • reg - The region in which the customer lives. This is encoded with levels 1 (Americas), 2 (Asia), and 3 (Europe).

A DataFrame containing 6 observations from the DataSet has been created below.

In [2]:
import pandas as pd
In [3]:
X_cat = pd.DataFrame({
    'sex':['M', 'F', 'F', 'M', 'M', 'F'],
    'emp':['PT', 'FT', 'U', 'FT', 'PT', 'PT'],
    'reg':[1, 3, 3, 2, 1, 2]
})

X_cat
Out[3]:
sex emp reg
0 M PT 1
1 F FT 3
2 F U 3
3 M FT 2
4 M PT 1
5 F PT 2

Notice that sex has two levels, while emp and reg each have three levels. If we apply one-hot encoding to this DataFrame, the encoding will consist of 8 columns.

In the cell below, we will use OneHotEncoder to perform the encoding. This is a three-step process:

  1. Create an instance of the encoder object.
  2. Fit the encoder to the data using the fit() method. This does not actually perform the encoding. It simply learns the levels for each categorical variable.
  3. We then perform the encoding using the transform() method of the encoder object.
In [4]:
encoder = OneHotEncoder(sparse=False)
encoder.fit(X_cat)
X_enc = encoder.transform(X_cat)

print(X_enc)
[[0. 1. 0. 1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1. 0. 0. 1.]
 [0. 1. 1. 0. 0. 0. 1. 0.]
 [0. 1. 0. 1. 0. 1. 0. 0.]
 [1. 0. 0. 1. 0. 0. 1. 0.]]

The output of the encoder is an array. Since the columns are not labeled, it can be difficult to tell what the new columns refer to. The columns of the original DataFrame are encoded from left to right, and the encoded columns for a particular variable are arranged in alphabetical order by level.

It follows that the first two columns of the encoded DataFrame encode the sex variable. The first column refers to F, and the second to M. The next three columns are associated with emp, and are arranged in the order FT, PT, and U. The final three columns encode the reg variable, and refer to levels 1, 2, and 3, in that order.

We will make this clearer by converting the encoded array into a data frame and adding column names.

In [5]:
X_enc_df = pd.DataFrame(X_enc.astype('int'))
X_enc_df.columns = ['sex_F', 'sex_M', 'emp_FT', 'emp_PT', 
                  'emp_U', 'reg_1', 'reg_2', 'reg_3']
X_enc_df
Out[5]:
sex_F sex_M emp_FT emp_PT emp_U reg_1 reg_2 reg_3
0 0 1 0 1 0 1 0 0
1 1 0 1 0 0 0 0 1
2 1 0 0 0 1 0 0 1
3 0 1 1 0 0 0 1 0
4 0 1 0 1 0 1 0 0
5 1 0 0 1 0 0 1 0

To help illustrate how the encoding was performed, we will now join the original DataFrame with the encoded DataFrame.

In [6]:
sep = pd.DataFrame({'---':['---']*6})
pd.concat([X_cat, sep, X_enc_df], axis=1)
Out[6]:
sex emp reg --- sex_F sex_M emp_FT emp_PT emp_U reg_1 reg_2 reg_3
0 M PT 1 --- 0 1 0 1 0 1 0 0
1 F FT 3 --- 1 0 1 0 0 0 0 1
2 F U 3 --- 1 0 0 0 1 0 0 1
3 M FT 2 --- 0 1 1 0 0 0 1 0
4 M PT 1 --- 0 1 0 1 0 1 0 0
5 F PT 2 --- 1 0 0 1 0 0 1 0