Lesson 27 - K-Means Clustering

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Load the Data

In [2]:
df = pd.read_csv('data/kmeans_01.csv')
X = df.values
print(X.shape)
(306, 2)
In [3]:
plt.scatter(X[:,0], X[:,1], edgecolors='k')
plt.show()

K-Means in Scikit-Learn

In [4]:
from sklearn.cluster import KMeans
In [5]:
kmeans_3 = KMeans(n_clusters=3)
kmeans_3.fit(X)

print(kmeans_3.labels_)

plt.scatter(X[:,0], X[:,1], edgecolors='k', c=kmeans_3.labels_)
plt.show()
[2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 0 2
 2 2 2 2 0 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0
 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 2 2 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 2 0 0
 0 2 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 0 2 2 2 2 2 2 2 2 2]
In [6]:
print(kmeans_3.cluster_centers_)
[[35.22978521 31.4962348 ]
 [20.72753478 33.84308141]
 [46.76025904 28.00104715]]
In [7]:
pred = kmeans_3.predict([[40,35]])
print(pred)
[0]
In [8]:
kmeans_3 = KMeans(n_clusters=5)
kmeans_3.fit(X)

print(kmeans_3.labels_)

plt.scatter(X[:,0], X[:,1],edgecolors='k', c=kmeans_3.labels_)
plt.show()
[4 4 2 2 2 2 2 2 2 4 2 2 4 2 2 4 2 4 2 2 2 2 4 2 2 4 2 4 0 2 2 2 4 2 2 4 2
 2 2 4 2 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 0 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 4 4 0 0 0 0 4 0 4 0 0
 0 0 0 0 0 0 0 0 0 0 4 4 4 0 4 0 4 0 0 4 0 0 0 0 0 0 0 0 0 4 0 0 4 0 0 0 0
 0 4 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 2 4 2 4 4 2 2 4 2 2 4 4 2 4 4 2 4 2 2 4 2 4 4 2 2 2 4
 2 4 4 4 2 2 2 2 2 4 4 4 4 2 2 2 4 2 2 2 4 2 4 4 2 2 2 2 4 2 2 4 4 4 2 2 2
 4 2 4 2 2 2 4 2 4 2]

The K-Means Algortihm

In [9]:
df = pd.read_csv('data/kmeans_02.csv')
pts = df.values
print(pts.shape)

plt.scatter(pts[:,0], pts[:,1], edgecolors='k')
plt.show()
(66, 2)

Overview of Algorithm

The cell below loads the following functions:

  • distance(P,Q)
  • assign(pts, ctrs)
  • newCenters(pts, clAssign, K)
  • icDist(pts, ctrs, clAssign)
  • kMeans(pts, K)
In [10]:
%run -i "Snippets/snippet20.py"

The next cell initializes a few variables used by in this example.

In [11]:
%run -i "Snippets/snippet21.py"

Repeatedly executing the cell below will illustrated the KMeans algorithm step-by-step.

In [12]:
%run -i "Snippets/snippet22.py"
Run 1: New Centers

The Algorithm in Detail

In [13]:
pts = X
plt.scatter(X[:,0], X[:,1], edgecolors='k')
plt.show()
In [14]:
K = 3

Randomly Assign Centers

In [15]:
sel = np.random.choice(range(0,X.shape[0]), K, replace=False)
centers = X[sel,:]

plt.scatter(X[:,0], X[:,1], edgecolors='k')
plt.scatter(centers[:,0], centers[:,1], edgecolors='k', c='r')
plt.show()

Assign Clusters

In [16]:
clusters = assign(X, centers)

plt.scatter(X[:,0], X[:,1], edgecolors='k', c=clusters)
plt.scatter(centers[:,0], centers[:,1], edgecolors='k', c='r')
plt.show()

Calculate Inter-Cluster Distance

In [17]:
curDist = icDist(X, centers, clusters)

print("New Distance is " + str(curDist))
New Distance is 2336.3485958191745

Find new Centers

In [18]:
centers = newCenters(pts, clusters, K)

plt.scatter(X[:,0], X[:,1], edgecolors='k', c=clusters)
plt.scatter(centers[:,0], centers[:,1], edgecolors='k', c='r')
plt.show()

Repeat

In [19]:
clusters = assign(X, centers)
curDist = icDist(X, centers, clusters)
print("New Distance is " + str(curDist))

centers = newCenters(pts, clusters, K)

plt.scatter(X[:,0], X[:,1], edgecolors='k', c=clusters)
plt.scatter(centers[:,0], centers[:,1], edgecolors='k', c='r')
plt.show()
New Distance is 1789.3842957740867