require(car)
require(MASS)
require(openintro)
When we have only two predictors, we can detect the presence of collinearity fairly reliably by using a pairs plot or a correlation matrix. Detecting multicollinearity in a larger set of predictors is not so easy, however.
Assume that we have 5 predictors, \(X1, ..., X5\) that satisfy the following properties:
For any \(i,j < 5\) such that \(i \neq j\), we have that \(\mathrm{corr}(X_i,X_j) = 0\).
\(X5\) is very nearly an average of the other four predictors.
Let’s consider a pairs plot of our predictors.
pairs(df)
Let’s now look at a correlation matrix for our predictors.
round(cor(df), 2)
X1 X2 X3 X4 X5
X1 1.00 0.00 0.0 0.00 0.51
X2 0.00 1.00 0.0 0.00 0.46
X3 0.00 0.00 1.0 0.00 0.50
X4 0.00 0.00 0.0 1.00 0.48
X5 0.51 0.46 0.5 0.48 1.00
We will now generate \(Y\) values according to the model: \[Y = 3 + 2\cdot X1 + 2\cdot X2 + 2\cdot X3 + 2\cdot X4 + 2\cdot X5 + e, \hspace{1em} e \sim N(0,3)\]
We will then fit a regression model to the generated data. The summary output for the model is shown below.
set.seed(19)
e <- rnorm(n,0,3)
df2['Y'] <- 3 + 2*df[,'X1'] + 2*df[,'X2'] + 2*df[,'X3'] + 2*df[,'X4'] + 2*df[,'X5'] + e
Error in df2["Y"] <- 3 + 2 * df[, "X1"] + 2 * df[, "X2"] + 2 * df[, "X3"] + :
object 'df2' not found
Notice that the standard error for the parameter estimate \(\hat\beta_5\) is considerably larger than that of the other parameter estimates.
Consider a linear regression model with a response \(Y\) and several predictors \(X_1, X_2, ..., X_p\).
The residual standard error of the model is given by \(s = \sqrt{\frac{SSE}{n - k - 1}}\).
For each \(j = 1,2,...,p\), let \(s_j = sd(X_j)\).
For each \(j\), let \(R^2_j\) denote the r-squared value for the model obtained by regression \(X_j\) against the other \(p-1\) predictors.
The standard errors of the coefficient estimates \(\hat\beta_1,...,\hat\beta_p\) are given by the formula \(se(\hat\beta_j) = \frac{s}{s_j \sqrt{n-1}}\sqrt\frac{1}{1 - R^2_j}\).
For each \(i=1,2,...,p\), define the variance inflaction factor, \(VIF_j\) as follows: \(VIF_j = \frac{1}{1 - R^2_j}\).
Then \(se(\hat\beta_j) = \frac{s}{s_j \sqrt{n-1}}\sqrt{VIF_j}\)
When the value of \(R^2_j\) is large, then so to will be the values of \(VIF_j\) and \(se(\hat\beta_j)\). A rule of thumb is that \(VIF_j\) values greater than 5 should be avoided. Noted that if \(VIF_j = 5\), then \(R^2_j = 0.8\).
We can calculate the variance inflation factor for each of our predictors using the vif()
function from the cars
package.
vif(mod)
X1 X2 X3 X4 X5
4.620756 3.712993 4.413106 4.043983 13.790838
We will remove the predictor \(X5\) and then run the regression again.
mod2 <- lm(Y ~ X1 + X2 + X3 + X4, df2)
summary(mod2)
Call:
lm(formula = Y ~ X1 + X2 + X3 + X4, data = df2)
Residuals:
Min 1Q Median 3Q Max
-6.4985 -2.1787 0.1954 1.6846 8.6907
Coefficients:
In previous lectures, we considered the bdims
data set. The code chunk below loads this dataset and prints a list of names of the variables in the set.
data(bdims)
names(bdims)
[1] "bia.di" "bii.di" "bit.di" "che.de" "che.di" "elb.di" "wri.di" "kne.di" "ank.di" "sho.gi" "che.gi" "wai.gi" "nav.gi"
[14] "hip.gi" "thi.gi" "bic.gi" "for.gi" "kne.gi" "cal.gi" "ank.gi" "wri.gi" "age" "wgt" "hgt" "sex"
Our goal in working with this dataset was to create a model for wgt
using some combination of the other 24 variables as predictors. We previously applied best subset selection (with BIC as our final select criteria) to obtain the following 12 variable model.
bdim_mod <- lm(wgt ~ che.de + kne.di + sho.gi + che.gi + wai.gi + hip.gi + thi.gi + for.gi + cal.gi + age + hgt + sex, bdims)
summary(bdim_mod)
Call:
lm(formula = wgt ~ che.de + kne.di + sho.gi + che.gi + wai.gi +
hip.gi + thi.gi + for.gi + cal.gi + age + hgt + sex, data = bdims)
Residuals:
Min 1Q Median 3Q Max
-7.3146 -1.3279 0.0409 1.3412 8.7983
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -122.26113 2.52608 -48.400 < 2e-16 ***
che.de 0.26584 0.06879 3.864 0.000126 ***
kne.di 0.64053 0.11727 5.462 7.47e-08 ***
sho.gi 0.08655 0.02825 3.063 0.002308 **
che.gi 0.16188 0.03385 4.782 2.30e-06 ***
wai.gi 0.38580 0.02499 15.440 < 2e-16 ***
hip.gi 0.23328 0.03843 6.070 2.55e-09 ***
thi.gi 0.25782 0.04873 5.290 1.84e-07 ***
for.gi 0.59434 0.09648 6.160 1.51e-09 ***
cal.gi 0.40568 0.05797 6.998 8.49e-12 ***
age -0.05331 0.01181 -4.515 7.93e-06 ***
hgt 0.32247 0.01553 20.769 < 2e-16 ***
sex -1.57950 0.48321 -3.269 0.001155 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.114 on 494 degrees of freedom
Multiple R-squared: 0.9755, Adjusted R-squared: 0.9749
F-statistic: 1639 on
Let’s check the variance inflation factors for parameter estimates in this model.
vif(bdim_mod)
che.de kne.di sho.gi che.gi wai.gi hip.gi thi.gi for.gi cal.gi age hgt sex
3.391332 2.827420 9.726734 13.047310 8.572089 7.463359 5.348271 8.443665 3.085075 1.457092 2.415180 6.617277
We see that \(VIF_3\), \(VIF_4\), \(VIF_5\), and \(VIF_8\) are all above the threshhold of 8. Let’s start by removing the predictor whose parameter estimate has the largest \(VIF\). This would be che.gi
.
bdim_mod_2 <- lm(wgt ~ che.de + kne.di + sho.gi + wai.gi + hip.gi + thi.gi + for.gi + cal.gi + age + hgt + sex, bdims)
vif(bdim_mod_2)
che.de kne.di sho.gi wai.gi hip.gi thi.gi for.gi cal.gi age hgt sex
3.165528 2.821696 6.930982 7.406451 7.457177 5.340541 7.489848 2.993096 1.457012 2.412913 6.579834
Removing che.gi
from the model reduced all of the variance inflation factors. Although some of them are still high, they are all now below 8. Let’s look at a summary of this model.
summary(bdim_mod_2)
Call:
lm(formula = wgt ~ che.de + kne.di + sho.gi + wai.gi + hip.gi +
thi.gi + for.gi + cal.gi + age + hgt + sex, data = bdims)
Residuals:
Min 1Q Median 3Q Max
-6.8449 -1.3155 0.0831 1.2208 9.0228
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -121.36420 2.57414 -47.148 < 2e-16 ***
che.de 0.35071 0.06792 5.164 3.51e-07 ***
kne.di 0.61530 0.11971 5.140 3.96e-07 ***
sho.gi 0.15897 0.02437 6.523 1.70e-10 ***
wai.gi 0.42986 0.02373 18.112 < 2e-16 ***
hip.gi 0.22799 0.03926 5.808 1.13e-08 ***
thi.gi 0.26668 0.04976 5.359 1.29e-07 ***
for.gi 0.74940 0.09285 8.071 5.34e-15 ***
cal.gi 0.35782 0.05835 6.133 1.77e-09 ***
age -0.05289 0.01206 -4.384 1.43e-05 ***
hgt 0.32020 0.01586 20.191 < 2e-16 ***
sex -1.75330 0.49237 -3.561 0.000405 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.16 on 495 degrees of freedom
Multiple R-squared: 0.9744, Adjusted R-squared: 0.9738
F-statistic: 1711 on 11 and 495 DF, p-value: < 2.2e-16