Lesson 26 - Interpreting Principle Components¶

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Example 1: Credit Scores¶

A data frame with 10000 observations on the following 12 variables.

ID Identification
Income Income in $10,000's
Limit Credit limit
Rating Credit rating
Cards Number of credit cards
Age Age in years
Education Number of years of education
Gender A factor with levels Male and Female
Student A factor with levels No and Yes indicating whether the individual was a student
Married A factor with levels No and Yes indicating whether the individual was married
Ethnicity A factor with levels African American, Asian, and Caucasian indicating the individual's ethnicity
Balance Average credit card balance in $.

df_credit = pd.read_table(filepath_or_buffer='Data/credit.csv', sep=',')
df_credit.head(n=10)

X_credit = df_credit.iloc[:, [1,2,3,4,5,6,11]]
print(X_credit.shape)

(400, 7)

scaler = StandardScaler()
scaler.fit(X_credit.astype('float'))
X_credit_scaled = scaler.transform(X_credit.astype('float'))
pd.DataFrame(X_credit_scaled).describe()

pca_credit = PCA(n_components=7)
Z_credit = pca_credit.fit_transform(X_credit_scaled)

pc_credit = np.round(pca_credit.components_, 2)

pc_credit_df = pd.DataFrame(pc_credit, columns=X_credit.columns).transpose()
pc_credit_df

plt.figure(figsize = [8,6])
sns.heatmap(pc_credit_df, cmap = 'RdBu')
plt.show()

print(np.round(pca_credit.explained_variance_ratio_,2))

[0.49 0.15 0.15 0.14 0.07 0.01 0.  ]

Example 2: U.S. News and World Report's College Data¶

Statistics for a large number of US Colleges from the 1995 issue of US News and World Report. Contains 777 observations on the following 18 variables.

Private A factor with levels No and Yes indicating private or public university
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
F.Undergrad Number of fulltime undergraduates
P.Undergrad Number of parttime undergraduates
Outstate Out-of-state tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.'s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate

df_colleges = pd.read_table(filepath_or_buffer='Data/college.csv', sep=',')
df_colleges.head(n=10)

X_colleges = df_colleges.iloc[:, 1:]
print(X_colleges.shape)

(777, 17)

scaler = StandardScaler()
scaler.fit(X_colleges.astype('float'))
X_colleges_scaled = scaler.transform(X_colleges.astype('float'))
pd.DataFrame(X_colleges_scaled).describe()

pca_colleges = PCA(n_components=17)
Z_colleges = pca_colleges.fit_transform(X_colleges_scaled)

pc_colleges = np.round(pca_colleges.components_,2)

pc_colleges_df = pd.DataFrame(pc_colleges, columns=X_colleges.columns).transpose()
pc_colleges_df

#df = pd.DataFrame(np.random.random((5,5)), columns=["a","b","c","d","e"])

plt.figure(figsize = [10,8])
sns.heatmap(pc_colleges_df, cmap = 'RdBu')
plt.show()

print(np.round(pca_colleges.explained_variance_ratio_,2))

[0.32 0.26 0.07 0.06 0.05 0.05 0.04 0.03 0.03 0.02 0.02 0.01 0.01 0.01
 0.01 0.   0.  ]

#print(df_colleges.loc['University of Missouri at Rolla',:])
print(df_colleges.loc['Harvard University',:])

Private          Yes
Apps           13865
Accept          2165
Enroll          1606
Top10perc         90
Top25perc        100
F.Undergrad     6862
P.Undergrad      320
Outstate       18485
Room.Board      6410
Books            500
Personal        1920
PhD               97
Terminal          97
S.F.Ratio        9.9
perc.alumni       52
Expend         37219
Grad.Rate        100
Name: Harvard University, dtype: object

df_colleges.index.get_loc('Harvard University')

250

print(Z_colleges[250,:])

[ 7.69419904 -1.24192041  0.79279617  0.79258425 -0.97045821 -1.83658638
  0.41618569 -0.44601985  0.31523839  1.32735077  2.22479445  1.52048224
  0.33024333  0.04131785  0.37979928  0.81611647  0.78531243]

Example 3: Rand McNally Places Rated Data¶

The data is taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:

Climate & Terrain
Housing
Health Care & Environment
Crime
Transportation
Education
The Arts
Recreation
Economics

For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better.

df_places = pd.read_table(filepath_or_buffer='Data/places.txt', sep='\t')
df_places.set_index('City', inplace=True)
df_places.index.name = None
df_places.head(n=10)

X_places = df_places.iloc[:,[0,1,2,3,4,5,6,7,8]]
print(X_places.shape)

(329, 9)

scaler = StandardScaler()
scaler.fit(X_places.astype('float'))
X_places_scaled = scaler.transform(X_places.astype('float'))
pd.DataFrame(X_places_scaled).describe()

pca_places = PCA(n_components=9)
Z_places = pca_places.fit_transform(X_places_scaled)

pc_places = np.round(pca_places.components_,3)

pc_places_df = pd.DataFrame(pc_places, columns=X_places.columns).transpose()
pc_places_df

plt.figure(figsize = [8,6])
sns.heatmap(pc_places_df, cmap = 'RdBu')
plt.show()

print(np.round(pca_places.explained_variance_ratio_,2))

[0.38 0.13 0.13 0.1  0.08 0.07 0.05 0.04 0.01]

df_places.index.get_loc('St.-Louis,MO-IL')

261

print(Z_places[261,:])

[ 3.01215474 -1.60752687 -1.14447639 -0.62931814 -0.31031828 -1.03565276
  0.13875249 -0.27484785 -0.15912399]

	0	1	2	3	4	5	6
count	4.000000e+02	4.000000e+02	4.000000e+02	4.000000e+02	4.000000e+02	4.000000e+02	4.000000e+02
mean	1.729172e-16	-1.584843e-16	9.159340e-18	6.161738e-17	1.830480e-16	2.142730e-16	1.915135e-17
std	1.001252e+00	1.001252e+00	1.001252e+00	1.001252e+00	1.001252e+00	1.001252e+00	1.001252e+00
min	-9.904743e-01	-1.683330e+00	-1.695069e+00	-1.429291e+00	-1.896161e+00	-2.707207e+00	-1.132477e+00
25%	-6.878268e-01	-7.146973e-01	-6.968846e-01	-6.991298e-01	-8.078311e-01	-7.849299e-01	-9.827546e-01
50%	-3.438443e-01	-4.906061e-02	-7.079503e-02	3.103187e-02	1.929972e-02	1.762088e-01	-1.317882e-01
75%	3.480625e-01	4.932738e-01	5.326453e-01	7.611935e-01	8.319194e-01	8.169679e-01	7.469449e-01
max	4.017453e+00	3.980980e+00	4.057837e+00	4.412002e+00	2.457159e+00	2.098486e+00	3.220900e+00

	0	1	2	3	4	5	6
Income	0.45	-0.03	-0.15	-0.13	0.74	0.46	-0.00
Limit	0.54	-0.05	0.04	-0.01	-0.03	-0.46	-0.70
Rating	0.54	-0.01	0.04	0.02	-0.03	-0.46	0.71
Cards	0.03	0.74	0.08	0.65	0.16	-0.02	-0.03
Age	0.08	0.31	-0.90	-0.18	-0.26	0.01	-0.00
Education	-0.02	-0.60	-0.36	0.71	0.05	-0.01	0.00
Balance	0.47	-0.02	0.19	0.14	-0.60	0.61	-0.00

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15
Albertson College	Yes	587	479	158	38	62	678	41	13500	3335	500	675	67	73	9.4	11	9727	55
Albertus Magnus College	Yes	353	340	103	17	45	416	230	13290	5720	500	1500	90	93	11.5	26	8861	63
Albion College	Yes	1899	1720	489	37	68	1594	32	13868	4826	450	850	89	100	13.7	37	11487	73
Albright College	Yes	1038	839	227	30	63	973	306	15595	4400	300	500	79	84	11.3	23	11644	80
Alderson-Broaddus College	Yes	582	498	172	21	44	799	78	10468	3380	660	1800	40	41	11.5	15	8991	52

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
count	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02	7.770000e+02
mean	6.355797e-17	6.774575e-17	-5.249269e-17	-2.753232e-17	-1.546739e-16	-1.661405e-16	-3.029180e-17	6.515595e-17	3.570717e-16	-2.192583e-16	4.765243e-17	5.954768e-17	-4.481615e-16	-2.057556e-17	-6.022638e-17	1.213101e-16	3.886495e-16
std	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00	1.000644e+00
min	-7.551337e-01	-7.947645e-01	-8.022728e-01	-1.506526e+00	-2.364419e+00	-7.346169e-01	-5.615022e-01	-2.014878e+00	-2.351778e+00	-2.747779e+00	-1.611860e+00	-3.962596e+00	-3.785982e+00	-2.929799e+00	-1.836580e+00	-1.240641e+00	-3.230876e+00
25%	-5.754408e-01	-5.775805e-01	-5.793514e-01	-7.123803e-01	-7.476067e-01	-5.586426e-01	-4.997191e-01	-7.762035e-01	-6.939170e-01	-4.810994e-01	-7.251203e-01	-6.532948e-01	-5.915023e-01	-6.546598e-01	-7.868237e-01	-5.574826e-01	-7.260193e-01
50%	-3.732540e-01	-3.710108e-01	-3.725836e-01	-2.585828e-01	-9.077663e-02	-4.111378e-01	-3.301442e-01	-1.120949e-01	-1.437297e-01	-2.992802e-01	-2.078552e-01	1.433889e-01	1.561419e-01	-1.237939e-01	-1.408197e-01	-2.458933e-01	-2.698956e-02
75%	1.609122e-01	1.654173e-01	1.314128e-01	4.221134e-01	6.671042e-01	6.294077e-02	7.341765e-02	6.179271e-01	6.318245e-01	3.067838e-01	5.310950e-01	7.562224e-01	8.358184e-01	6.093067e-01	6.666852e-01	2.241735e-01	7.302926e-01
max	1.165867e+01	9.924816e+00	6.043678e+00	3.882319e+00	2.233391e+00	5.764674e+00	1.378992e+01	2.800531e+00	3.436593e+00	1.085230e+01	8.068387e+00	1.859323e+00	1.379560e+00	6.499390e+00	3.331452e+00	8.924721e+00	3.060392e+00

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
Apps	0.25	0.33	-0.06	0.28	0.01	-0.02	-0.04	-0.10	-0.09	0.05	0.04	0.02	0.60	0.08	0.13	0.46	0.36
Accept	0.21	0.37	-0.10	0.27	0.06	0.01	-0.01	-0.06	-0.18	0.04	-0.06	-0.15	0.29	0.03	-0.15	-0.52	-0.54
Enroll	0.18	0.40	-0.08	0.16	-0.06	-0.04	-0.03	0.06	-0.13	0.03	-0.07	0.01	-0.44	-0.09	0.03	-0.40	0.61
Top10perc	0.35	-0.08	0.04	-0.05	-0.40	-0.05	-0.16	-0.12	0.34	0.06	-0.01	0.04	0.00	-0.11	0.70	-0.15	-0.14
Top25perc	0.34	-0.04	-0.02	-0.11	-0.43	0.03	-0.12	-0.10	0.40	0.01	-0.27	-0.09	0.02	0.15	-0.62	0.05	0.08
F.Undergrad	0.15	0.42	-0.06	0.10	-0.04	-0.04	-0.03	0.08	-0.06	0.02	-0.08	0.06	-0.52	-0.06	0.01	0.56	-0.41
P.Undergrad	0.03	0.32	0.14	-0.16	0.30	-0.19	0.06	0.57	0.56	-0.22	0.10	-0.06	0.13	0.02	0.02	-0.05	0.01
Outstate	0.29	-0.25	0.05	0.13	0.22	-0.03	0.11	0.01	-0.00	0.19	0.14	-0.82	-0.14	-0.03	0.04	0.10	0.05
Room.Board	0.25	-0.14	0.15	0.18	0.56	0.16	0.21	-0.22	0.28	0.30	-0.36	0.35	-0.07	-0.06	0.00	-0.03	0.00
Books	0.06	0.06	0.68	0.09	-0.13	0.64	-0.15	0.21	-0.13	-0.08	0.03	-0.03	0.01	-0.07	-0.01	0.00	0.00
Personal	-0.04	0.22	0.50	-0.23	-0.22	-0.33	0.63	-0.23	-0.09	0.14	-0.02	-0.04	0.04	0.03	-0.00	-0.01	-0.00
PhD	0.32	0.06	-0.13	-0.53	0.14	0.09	-0.00	-0.08	-0.19	-0.12	0.04	0.02	0.13	-0.69	-0.11	0.03	0.01
Terminal	0.32	0.05	-0.07	-0.52	0.20	0.15	-0.03	-0.01	-0.25	-0.09	-0.06	0.02	-0.06	0.67	0.16	-0.03	0.01
S.F.Ratio	-0.18	0.25	-0.29	-0.16	-0.08	0.49	0.22	-0.08	0.27	0.47	0.45	-0.01	-0.02	0.04	-0.02	-0.02	-0.00
perc.alumni	0.21	-0.25	-0.15	0.02	-0.22	-0.05	0.24	0.68	-0.26	0.42	-0.13	0.18	0.10	-0.03	-0.01	0.00	-0.02
Expend	0.32	-0.13	0.23	0.08	0.08	-0.30	-0.23	-0.05	-0.05	0.13	0.69	0.33	-0.09	0.07	-0.23	-0.04	-0.04
Grad.Rate	0.25	-0.17	-0.21	0.27	-0.11	0.22	0.56	-0.01	0.04	-0.59	0.22	0.12	-0.07	0.04	-0.00	-0.01	-0.01

	ID	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity	Balance
1	1	14.891	3606	283	2	34	11	Male	No	Yes	Caucasian	333
2	2	106.025	6645	483	3	82	15	Female	Yes	Yes	Asian	903
3	3	104.593	7075	514	4	71	11	Male	No	No	Asian	580
4	4	148.924	9504	681	3	36	11	Female	No	No	Asian	964
5	5	55.882	4897	357	2	68	16	Male	No	Yes	Caucasian	331
6	6	80.180	8047	569	4	77	10	Male	No	No	Caucasian	1151
7	7	20.996	3388	259	2	37	12	Female	No	No	African American	203
8	8	71.408	7114	512	2	87	9	Male	No	No	Asian	872
9	9	15.125	3300	266	5	66	13	Female	No	No	Caucasian	279
10	10	71.061	6819	491	3	41	19	Female	Yes	Yes	African American	1350

	Climate	HousingCost	HlthCare	Crime	Transp	Educ	Arts	Recreat	Econ	CaseNum	Long	Lat	Pop	StNum
Abilene,TX	521	6200	237	923	4031	2757	996	1405	7633	1	-99.6890	32.5590	110932	44
Akron,OH	575	8138	1656	886	4883	2438	5564	2632	4350	2	-81.5180	41.0850	660328	36
Albany,GA	468	7339	618	970	2531	2560	237	859	5250	3	-84.1580	31.5750	112402	11
Albany-Schenectady-Troy,NY	476	7908	1431	610	6883	3399	4655	1617	5864	4	-73.7983	42.7327	835880	35
Albuquerque,NM	659	8393	1853	1483	6558	3026	4496	2612	5727	5	-106.6500	35.0830	419700	33
Alexandria,LA	520	5819	640	727	2444	2972	334	1018	5254	6	-92.4530	31.3020	135282	19
Allentown,Bethlehem,PA-NJ	559	8288	621	514	2881	3144	2333	1117	5097	7	-75.4405	40.6155	635481	39
Alton,Granite-City,IL	537	6487	965	706	4975	2945	1487	1280	5795	8	-90.1615	38.7940	268229	15
Altoona,PA	561	6191	432	399	4246	2778	256	1210	4230	9	-78.3950	40.5150	136621	39
Amarillo,TX	609	6546	669	1073	4902	2852	1235	1109	6241	10	-101.8490	35.3830	173699	44

	0	1	2	3	4	5	6	7	8
count	3.290000e+02	3.290000e+02	3.290000e+02	3.290000e+02	3.290000e+02	3.290000e+02	3.290000e+02	3.290000e+02	3.290000e+02
mean	1.737887e-16	3.306204e-16	8.773799e-18	1.936985e-16	-2.817739e-16	2.368926e-16	-1.012361e-18	-6.749076e-18	-3.799730e-16
std	1.001523e+00	1.001523e+00	1.001523e+00	1.001523e+00	1.001523e+00	1.001523e+00	1.001523e+00	1.001523e+00	1.001523e+00
min	-3.595724e+00	-1.338391e+00	-1.141054e+00	-1.831280e+00	-2.115349e+00	-3.477583e+00	-6.685513e-01	-1.916493e+00	-2.290655e+00
25%	-4.869037e-01	-6.661639e-01	-6.018499e-01	-7.124141e-01	-7.378210e-01	-6.115656e-01	-5.119245e-01	-6.569779e-01	-6.310978e-01
50%	2.708800e-02	-1.971584e-01	-3.522185e-01	-3.941189e-02	-8.977541e-02	-6.521139e-02	-2.761214e-01	-2.181310e-01	-1.305525e-01
75%	4.415974e-01	2.806647e-01	2.588791e-01	5.466609e-01	6.866370e-01	6.153898e-01	1.495323e-01	4.091473e-01	5.426901e-01
max	3.077877e+00	6.421405e+00	6.654436e+00	4.309865e+00	3.046931e+00	3.016226e+00	1.156236e+01	3.662069e+00	4.113924e+00

	0	1	2	3	4	5	6	7	8
Climate	0.206	0.218	0.690	0.137	-0.369	-0.375	-0.085	-0.362	0.001
HousingCost	0.357	0.251	0.208	0.512	0.233	0.142	-0.231	0.614	0.014
HlthCare	0.460	-0.299	0.007	0.015	-0.103	0.374	0.014	-0.186	-0.716
Crime	0.281	0.355	-0.185	-0.539	-0.524	-0.081	0.019	0.430	-0.059
Transp	0.351	-0.180	-0.146	-0.303	0.404	-0.468	-0.583	-0.094	0.004
Educ	0.275	-0.483	-0.230	0.335	-0.209	-0.502	0.426	0.189	0.111
Arts	0.463	-0.195	0.026	-0.101	-0.105	0.462	-0.022	-0.204	0.686
Recreat	0.328	0.384	0.051	-0.190	0.530	-0.090	0.628	-0.151	-0.026
Econ	0.135	0.471	-0.607	0.422	-0.160	-0.033	-0.150	-0.405	0.000