Lesson 13 - Decision Tree Classifier

The following topics are discussed in this notebook:

  • Overview of decision tree classifiers.
  • Implementing decision trees using Scikit-Learn.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from ClassificationPlotter import plot_regions
import ipywidgets as widgets

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.datasets import make_circles, make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

Introduction to Decision Trees

Decision tree algorithms apply a divide-and-conquer strategy to split the feature space into small rectangular regions. A single label value is then assigned to each of the regions for the purposes of making predictions.

Decision trees can be used for either classification or regression tasks. In this lecture, we will focus on using decision trees for classification.

Example 1

In [2]:
X1 = np.array([[1,1],[1,2],[1,3],[2,1],[2,2],[2,3],[3,1],[3,2],[3,3],[4,1],[4,2],[4,3]])
y1 = np.array([0,0,0,  0,0,2,  1,2,2,  2,1,1])

tree_mod_01 = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=2)
tree_mod_01.fit(X1,y1)
plot_regions(tree_mod_01, X1, y1, fig_size=[8,6])

One advantage of decision trees is that they generate easy to understand rules-based heuristics for making predictions. The rules generated by a decision tree are often represented in a flowchart, such as the following one.

In [3]:
#export_graphviz(tree_mod_01, out_file='images/tree_01.dot', filled=True, rounded=True)

Tree01

Example 2

In [4]:
X2, y2 = make_circles(n_samples=100, noise=0.25, random_state=9662, factor=0.4)

plt.close()
plt.rcParams["figure.figsize"] = [8,6]
plt.scatter(X2[y2==0,0],X2[y2==0,1],c='b', edgecolor='k')
plt.scatter(X2[y2==1,0],X2[y2==1,1],c='r', edgecolor='k')
plt.show()
In [5]:
def knn_example_2(max_depth):
    tree_mod_02 = DecisionTreeClassifier(criterion='gini', max_depth=max_depth, random_state=1)
    tree_mod_02.fit(X2,y2)
    plot_regions(tree_mod_02, X2, y2)
    
    print('Training Accuracy:', tree_mod_02.score(X2,y2))

_ = widgets.interact(knn_example_2,
                     max_depth=widgets.IntSlider(min=1,max=15,step=1,value=1,continuous_update=False))

Tree02

Example 3

In [6]:
X3, y3 = make_blobs(n_samples=300, centers=4, random_state=2997, n_features=2, cluster_std=2)

plt.close()
plt.scatter(X3[y3==0,0],X3[y3==0,1],c='purple', edgecolor='k')
plt.scatter(X3[y3==1,0],X3[y3==1,1],c='blue', edgecolor='k')
plt.scatter(X3[y3==2,0],X3[y3==2,1],c='yellow', edgecolor='k')
plt.scatter(X3[y3==3,0],X3[y3==3,1],c='red', edgecolor='k')
plt.show()
In [7]:
def knn_example_3(max_depth):
    tree_mod_03 = DecisionTreeClassifier(criterion='gini', max_depth=max_depth, random_state=1)
    tree_mod_03.fit(X3,y3)
    plot_regions(tree_mod_03, X3, y3)

_ = widgets.interact(knn_example_3,
                     max_depth=widgets.IntSlider(min=1,max=15,step=1,value=1, continuous_update=False))

Tree03

Example 4: Iris Dataset

In [8]:
iris = pd.read_table(filepath_or_buffer='Data/iris_mod.txt', sep='\t')
iris.head(n=10)
Out[8]:
sepal_length sepal_width petal_length petal_width species
0 6.3 3.2 5.0 2.0 virginica
1 5.3 3.8 1.9 0.4 setosa
2 7.5 2.9 5.8 1.5 virginica
3 6.5 3.0 4.8 1.6 versicolor
4 6.8 3.1 4.9 1.5 versicolor
5 6.1 2.3 4.4 1.3 versicolor
6 4.9 3.5 1.6 0.4 setosa
7 6.3 3.1 5.7 1.7 virginica
8 4.9 3.5 1.5 0.2 setosa
9 5.5 3.9 1.3 0.4 setosa
In [9]:
import seaborn as sns
plt.close()
g = sns.pairplot(iris, hue="species")
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [10]:
X_iris = iris.iloc[:,:4]
y_iris = iris.iloc[:,4]
In [11]:
X_train, X_holdout, y_train, y_holdout = train_test_split(X_iris, y_iris, test_size=0.2, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_iris, y_iris, test_size=0.5, random_state=1)

print(y_train.shape)
print(y_val.shape)
print(y_test.shape)
(480,)
(300,)
(300,)
In [12]:
tr_acc = []
va_acc = []

rng = range(1,11)

for d in rng:
    temp_mod = DecisionTreeClassifier(max_depth=d, criterion='gini', random_state=1)
    temp_mod.fit(X_train, y_train)
    tr_acc.append(temp_mod.score(X_train, y_train))
    va_acc.append(temp_mod.score(X_val, y_val))

plt.figure(figsize=([9, 6]))
plt.plot(rng, tr_acc, label='Training Accuracy')
plt.plot(rng, va_acc, label='Validation Accuracy')
plt.xlabel('Maximum Depth')
plt.ylabel('Accuracy')
plt.xticks(rng)
plt.legend()
plt.show()
In [13]:
iris_tree = DecisionTreeClassifier(max_depth = 7, criterion='gini', random_state=1)
iris_tree.fit(X_train, y_train)

print(iris_tree.score(X_test, y_test))
0.97

Iris_Tree