import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
Principal component analysis is an unsupervised dimensionality reduction technique. It allows us to reduce a dataset with a large number of features (dimensions) to one with a smaller number of features, hopefully without losing too much information in the process.
One method of performing dimensionality reduction in a supervised learning task is to simply remove from the dataset features that are deemed to not have much influence on the response variable. This method of dimensionality reduction is referred to as feature selection. The lasso algorithm performs a type of automatic feature selection.
The type of dimensionality reduction performed in PCA, on the other hand, is referred to as feature extraction. In feature extraction methods, the features are transformed into new features are are in some way functions of the old features. The goal of PCA is to extract new features that capture most of the information contained in the original dataset, but with fewer features.
To illustrate the idea behind PCA, we will consider an example.
from sklearn.decomposition import PCA
# Generate and plot synthetic dataset
np.random.seed(1)
n = 50
v0 = np.random.normal(0,1.2,[n,1])
v1 = np.random.normal(0,0.6,[n,1])
X = np.hstack([v0 + v1, 0.8*v0 - v1])
mu = np.mean(X, axis=0)
plt.figure(figsize=[6,6])
plt.scatter(X[:,0], X[:,1], c='grey')
plt.scatter(mu[0], mu[1], c='darkorange', edgecolor='k', s=120)
plt.xlim([-4,4])
plt.ylim([-4,4])
plt.show()
We will use the PCA
class from sklearn
to perform Principal Component Analysis. The code below creates a new, transformed feature array called Z
. The shape of Z
is the same as that of the orginal feature array X
.
# Find transformed coordinates and principal components
pca = PCA(n_components=2)
Z = pca.fit_transform(X)
print('Original Feature Values:')
print(X[:5,:])
print()
print('Transformed Feature Values:')
print(Z[:5,:])
The first step of PCA is to create a set of mutually orthogonal (perpendicular) vectors called principal components. Roughtly speaking, the first principal component is selected to point in the direction of the greated variance in the dataset. Each subsequent principal component is selected so that it is perpendical to all previous components, and also points in the direction of greatest variance in the dataset among all such vectors. We will illustrate this idea more clearly in a moment.
The pca
object returned by the PCA
class above has an attribute components_
that contains the principal components.
pc = pca.components_
print(pc)
To get a sense as to what the principal components are meant to convey, we will plot them as vectors emanating from the mean point in our dataset.
%run -i Snippets/snippet17.py
To help us understand how the principal components are used to obtain the transformed feature values from the original feature values, we will consider the following plots of both data sets.
%run -i Snippets/snippet18.py
print('Original Coordinates:')
print(X[[2,11],:])
print()
print('Transformed Coordinates:')
print(Z[[2,11],:])
print('X coords of first sample:', X[0,:])
print('Z coords of first sample:', Z[0,:])
print(mu + Z[0,0] * pc[0,:] + Z[0,1] * pc[1,:])
print(X[0,:])
The pca
object contains an attribute called explained_variance_ratio_
that, roughly speaking, explains what proportion of the information in the original dataset is explained by the feature associated with each principal component.
print(pca.explained_variance_ratio_)
print(np.cumsum(pca.explained_variance_ratio_))
We can see that the first feature in the transformed dataset contains, by itself, roughtly 83.6% of the information contained in the original dataset. The second feature explains the remaining 16.4% of the information from the original dataset.
We will get some additional inside concerning the explained variance ration in the next example.
%matplotlib notebook
%run -i Snippets/snippet19.py
# Find transformed coordinates and principal components
pca = PCA(n_components=3)
Z = pca.fit_transform(X)
pc = pca.components_
print(pc)
Here is the first point in our original set.
print(X[0,:])
Here is the transformed version of the same point.
print(Z[0,:])
The coordinates of Z represent how far from the mean we move in the direction of each of the principal components.
delta0 = Z[0,0] * pc[0]
delta1 = Z[0,1] * pc[1]
delta2 = Z[0,2] * pc[2]
print(delta0)
print(delta1)
print(delta2)
new_point = mu + delta0 + delta1 + delta2
print(new_point)
print(X[0,:])
%matplotlib inline
plt.figure(figsize=[12,3])
plt.subplot(131)
plt.scatter(Z[:,0],Z[:,1], edgecolor='k')
plt.xlim([-3,3])
plt.ylim([-1.5,1.5])
plt.title('z0 and z1')
plt.subplot(132)
plt.scatter(Z[:,0],Z[:,2], edgecolor='k')
plt.xlim([-3,3])
plt.ylim([-1.5,1.5])
plt.title('z0 and z2')
plt.subplot(133)
plt.scatter(Z[:,1],Z[:,2], edgecolor='k')
plt.xlim([-3,3])
plt.ylim([-1.5,1.5])
plt.title('z1 and z2')
plt.show()
It appears that the first two transformed variables, z0 and z1 (which coorespond to pc0 and pc1), capture most of the information contained in the original data set.
print(pca.explained_variance_ratio_)
print(np.cumsum(pca.explained_variance_ratio_))
To see where these explained_variance_ratios
come from, lets calculate the sample variance of the original data. We will defined this to be the squared deviations from each point to the mean, mu
, divided by n-1
.
sq_distances = np.sum( (X - mu)**2, axis=1)
var = np.sum(sq_distances) / 49
print(var)
We will now calculate the variance of each of the three transformed variables, z0
, z1
, and z2
.
var_z0 = np.var(Z[:,0], ddof=1)
var_z1 = np.var(Z[:,1], ddof=1)
var_z2 = np.var(Z[:,2], ddof=1)
print(var_z0)
print(var_z1)
print(var_z2)
Note that the variance in the original dataset is equal to the sum of the variances of the transformed features.
print(var)
print(var_z0 + var_z1 + var_z2)
Diving the variance of each of the transformed features by the variance of the original dataset will produce the explained variance ratios.
print(var_z0 / var)
print(var_z1 / var)
print(var_z2 / var)
print()
print(pca.explained_variance_ratio_)
The pca
object also has an attribute that stores the actual variances of the transformed variables (rather than just the ratios).
print(pca.explained_variance_)
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
np.random.seed(1)
X, y = make_classification(n_samples=1000, n_classes=3, n_features=200, n_informative=20,
n_redundant=0, class_sep=1.6)
M = np.random.uniform(-1,1,[200,200])
X = np.dot(X,M)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape)
print(X_val.shape)
model = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=2000)
model.fit(X_train, y_train)
print('Training Accuracy: ', model.score(X_train, y_train))
print('Validation Accuracy:', model.score(X_val, y_val))
pca = PCA(n_components=15)
pca.fit(X_train)
Z_train = pca.transform(X_train)
Z_val = pca.transform(X_val)
print(np.cumsum(pca.explained_variance_ratio_))
print(Z_train.shape)
print(Z_val.shape)
pca_model = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=2000)
pca_model.fit(Z_train, y_train)
print('Training Accuracy: ', pca_model.score(Z_train, y_train))
print('Validation Accuracy:', pca_model.score(Z_val, y_val))