A variable is categorical, or qualitative if it takes on values from a finite set of categories or classes. Such values are often recorded as strings, but might also be encoded using numerical labels. Such numerical labels do not typically have any numerical or quantitative significance.The possible values that a categorical variable can assume are referred to as its levels.
The Scikit-Learn implementations of supervised learning models require all of the feature information to be provided in a numerical format. If we wish to use a categorical variable as a feature in a Scikit-Learn model, we can do so, but we must first find a way to numerically encode the variable. The most common technique for numerically encoding qualitative information is to use one-hot encoding.
To perform one-hot encoding on a categorical variable, we must introduce new variables, referred to as dummy variables. The encoding will have one dummy variable for each level found within the categorical variable. The values found within a dummy variable will be either 0 or 1. A particular dummy variable is equal to one for observations with the associated level, and is zero otherwise.
For example, assume that we have a categorical variable Z
with four levels: a
, b
, c
, and d
. A one-hot encoding for Z
will create four new variables: Za
, Zb
, Zc
, and Zd
. The value of these dummy variables for each possible level of Z
is shown in the table below.
Z | - | Za | Zb | Zc | Zd |
---|---|---|---|---|---|
a | - | 1 | 0 | 0 | 0 |
b | - | 0 | 1 | 0 | 0 |
c | - | 0 | 0 | 1 | 0 |
d | - | 0 | 0 | 0 | 1 |
We could perform one-hot encoding manually using tools provided by numpy and pandas. However, sklearn provides a class OneHotEncoder
for performing one-hot encoding. This class can be imported as follows:
from sklearn.preprocessing import OneHotEncoder
We will now consider a simple example to demonstrate how to perform one-hot encoding.
Assume that we have a data set of customers for an online retailer, and we wish to build a model to help us make marketing decisions. Suppose the data set contains three categorical features:
sex
- The sex of the customer. This has levels "F" and "M".emp
- The customer's employment status. This has levels "FT" (full time), "PT" (part time), and "U" (unemployed).reg
- The region in which the customer lives. This is encoded with levels 1 (Americas), 2 (Asia), and 3 (Europe). A DataFrame containing 6 observations from the DataSet has been created below.
import pandas as pd
X_cat = pd.DataFrame({
'sex':['M', 'F', 'F', 'M', 'M', 'F'],
'emp':['PT', 'FT', 'U', 'FT', 'PT', 'PT'],
'reg':[1, 3, 3, 2, 1, 2]
})
X_cat
Notice that sex
has two levels, while emp
and reg
each have three levels. If we apply one-hot encoding to this DataFrame, the encoding will consist of 8 columns.
In the cell below, we will use OneHotEncoder
to perform the encoding. This is a three-step process:
fit()
method. This does not actually perform the encoding. It simply learns the levels for each categorical variable.transform()
method of the encoder object.encoder = OneHotEncoder(sparse=False)
encoder.fit(X_cat)
X_enc = encoder.transform(X_cat)
print(X_enc)
The output of the encoder is an array. Since the columns are not labeled, it can be difficult to tell what the new columns refer to. The columns of the original DataFrame are encoded from left to right, and the encoded columns for a particular variable are arranged in alphabetical order by level.
It follows that the first two columns of the encoded DataFrame encode the sex
variable. The first column refers to F
, and the second to M
. The next three columns are associated with emp
, and are arranged in the order FT
, PT
, and U
. The final three columns encode the reg
variable, and refer to levels 1
, 2
, and 3
, in that order.
We will make this clearer by converting the encoded array into a data frame and adding column names.
X_enc_df = pd.DataFrame(X_enc.astype('int'))
X_enc_df.columns = ['sex_F', 'sex_M', 'emp_FT', 'emp_PT',
'emp_U', 'reg_1', 'reg_2', 'reg_3']
X_enc_df
To help illustrate how the encoding was performed, we will now join the original DataFrame with the encoded DataFrame.
sep = pd.DataFrame({'---':['---']*6})
pd.concat([X_cat, sep, X_enc_df], axis=1)