require(leaps)
package 㤼㸱leaps㤼㸲 was built under R version 3.4.4
require(openintro)
require(ggplot2)
package 㤼㸱ggplot2㤼㸲 was built under R version 3.4.3
require(GGally)
package 㤼㸱GGally㤼㸲 was built under R version 3.4.4
We will be working with the bdims
dataset, which is included in the openintro
package. A description of the dataset can be found here: [https://www.openintro.org/stat/data/bdims.php].
data(bdims)
summary(bdims)
bia.di bii.di bit.di che.de che.di elb.di wri.di
Min. :32.40 Min. :18.70 Min. :24.70 Min. :14.30 Min. :22.20 Min. : 9.90 Min. : 8.10
1st Qu.:36.20 1st Qu.:26.50 1st Qu.:30.60 1st Qu.:17.30 1st Qu.:25.65 1st Qu.:12.40 1st Qu.: 9.80
Median :38.70 Median :28.00 Median :32.00 Median :19.00 Median :27.80 Median :13.30 Median :10.50
Mean :38.81 Mean :27.83 Mean :31.98 Mean :19.23 Mean :27.97 Mean :13.39 Mean :10.54
3rd Qu.:41.15 3rd Qu.:29.25 3rd Qu.:33.35 3rd Qu.:20.90 3rd Qu.:29.95 3rd Qu.:14.40 3rd Qu.:11.20
Max. :47.40 Max. :34.70 Max. :38.00 Max. :27.50 Max. :35.60 Max. :16.70 Max. :13.30
kne.di ank.di sho.gi che.gi wai.gi nav.gi hip.gi
Min. :15.70 Min. : 9.90 Min. : 85.90 Min. : 72.60 Min. : 57.90 Min. : 64.00 Min. : 78.80
1st Qu.:17.90 1st Qu.:13.00 1st Qu.: 99.45 1st Qu.: 85.30 1st Qu.: 68.00 1st Qu.: 78.85 1st Qu.: 92.00
Median :18.70 Median :13.80 Median :108.20 Median : 91.60 Median : 75.80 Median : 84.60 Median : 96.00
Mean :18.81 Mean :13.86 Mean :108.20 Mean : 93.33 Mean : 76.98 Mean : 85.65 Mean : 96.68
3rd Qu.:19.60 3rd Qu.:14.80 3rd Qu.:116.55 3rd Qu.:101.15 3rd Qu.: 84.50 3rd Qu.: 91.60 3rd Qu.:101.00
Max. :24.30 Max. :17.20 Max. :134.80 Max. :118.70 Max. :113.20 Max. :121.10 Max. :128.30
thi.gi bic.gi for.gi kne.gi cal.gi ank.gi wri.gi
Min. :46.30 Min. :22.40 Min. :19.60 Min. :29.00 Min. :28.40 Min. :16.40 Min. :13.0
1st Qu.:53.70 1st Qu.:27.60 1st Qu.:23.60 1st Qu.:34.40 1st Qu.:34.10 1st Qu.:21.00 1st Qu.:15.0
Median :56.30
We will use the function regsubsets()
from the leaps
package to perform best subset selection.
subset <- regsubsets(wgt ~ ., bdims, nvmax = 24)
summary(subset)
Subset selection object
Call: regsubsets.formula(wgt ~ ., bdims, nvmax = 24)
24 Variables (and intercept)
Forced in Forced out
bia.di FALSE FALSE
bii.di FALSE FALSE
bit.di FALSE FALSE
che.de FALSE FALSE
che.di FALSE FALSE
elb.di FALSE FALSE
wri.di FALSE FALSE
kne.di FALSE FALSE
ank.di FALSE FALSE
sho.gi FALSE FALSE
che.gi FALSE FALSE
wai.gi FALSE FALSE
nav.gi FALSE FALSE
hip.gi FALSE FALSE
thi.gi FALSE FALSE
bic.gi FALSE FALSE
for.gi FALSE FALSE
kne.gi FALSE FALSE
cal.gi FALSE FALSE
ank.gi FALSE FALSE
wri.gi FALSE FALSE
age FALSE FALSE
hgt FALSE FALSE
sex FALSE FALSE
1 subsets of each size up to 24
Selection Algorithm: exhaustive
bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi che.gi wai.gi nav.gi hip.gi thi.gi bic.gi
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" " " " " "*" " "
4 ( 1 ) " " " " " " " " " " " " " " " " "*" "*" " " "*" "*" " "
10 ( 1 ) " " " " " " "*" " " " " " " "*" " " " " "*" "*" " " "*" "*" " "
11 ( 1 ) " " " " " " "*" " " " " " " "*" " " " " "*" "*" " " "*" "*" " "
12 ( 1 ) " " " " " " "*" " " " " " " "*" " " "*" "*" "*" " " "*" "*" " "
13 ( 1 ) " " " " " " "*" " " " " " " "*" " " "*" "*" "*" " " "*" "*" " "
14 ( 1 ) " " "*" " " "*" " " " " " " "*" " " "*" "*" "*" " " "*" "*" " "
15 ( 1 ) " " "*" " " "*" "*" " " " " "*" " " "*" "*" "*" " " "*" "*" " "
16 ( 1 ) " " "*" " " "*" "*" "*" " " "*" "*" "*"
21 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
22 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
23 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
24 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
for.gi kne.gi cal.gi ank.gi wri.gi age hgt sex
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " "*" " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " "*" " "
4 ( 1 ) "*" " " " " " " " " " " "*" " "
5 ( 1 ) " " " " "*" " " " " " " "*" " "
6 ( 1 ) " " " " "*" " " " " " " "*" " "
7 ( 1 ) "*" " " "*" " " " " " " "*" " "
8 ( 1 ) "*" " " "*" " " " " " " "*" " "
9 ( 1 ) "*" " " "*" " " " " "*" "*" " "
10 ( 1 ) "*" " " "*" " " " " "*" "*" " "
11 ( 1 ) "*" "*" "*" " " " " "*" "*" " "
12 ( 1 ) "*" " " "*" " " " " "*" "*" "*"
13 ( 1 ) "*" "*" "*" " " " " "*" "*" "*"
14 ( 1 ) "*" "*" "*" " " " " "*" "*" "*"
15 ( 1 ) "*" "*" "*" " " " " "*" "*" "*"
16 ( 1 ) "*" "*" "*" " " " " "*" "*" "*"
17 ( 1 ) "*" "*" "*" " " " " "*" "*" "*"
18 ( 1 ) "*" "*" "*" " " " " "*" "*" "*"
19 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
20 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
21 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
22 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
23 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
24 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
When we apply summary()
function to the output of regsubsets()
, some of the calculated information is not displayed. Let’s see what else is contained in the summary.
reg_summary <- summary(subset)
names(reg_summary)
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Let’s generate plots of the metrics contained in the summary output. We will plot the metrics against the number of predictors in our model. For any given number of predictors, only the model with the best value for a certain metric will be displayed.
par(mfrow=c(2,2))
# r-Squared
plot(reg_summary$rsq, xlab="Number of Predictors", ylab="r-Squared", type="l")
# Adjusted r-Squared
plot(reg_summary$adjr2, xlab="Number of Predictors", ylab="Adjusted r-Squared", type="l")
k <- which.max(reg_summary$adjr2)
points(k, reg_summary$adjr2[k], col="red", cex=2, pch=20)
lines(c(k,k), c(0,reg_summary$adjr2[k]), col="red", lty="dashed")
# C_p
plot(reg_summary$cp, xlab="Number of Predictors", ylab="Cp", type="l")
k <- which.min(reg_summary$cp)
points(k, reg_summary$cp[k], col="red", cex=2, pch=20)
lines(c(k,k), c(0,reg_summary$cp[k]), col="red", lty="dashed")
# BIC
plot(reg_summary$bic, xlab="Number of Predictors", ylab="BIC", type="l")
k <- which.min(reg_summary$bic)
points(k, reg_summary$bic[k], col="red", cex=2, pch=20)
lines(c(k,k), c(0,reg_summary$bic[k]), col="red", lty="dashed")
par(mfrow=c(2,2))
par(mfrow=c(2,2))
plot(subset, scale="r2")
plot(subset, scale="adjr2")
plot(subset, scale="Cp")
plot(subset, scale="bic")
par(mfrow=c(1,1))
mod <- lm(wgt ~ che.de + kne.di + sho.gi + che.gi + wai.gi + hip.gi + thi.gi + for.gi + cal.gi + age + hgt + sex, bdims)
summary(mod)
Call:
lm(formula = wgt ~ che.de + kne.di + sho.gi + che.gi + wai.gi +
hip.gi + thi.gi + for.gi + cal.gi + age + hgt + sex, data = bdims)
Residuals:
Min 1Q Median 3Q Max
-7.3146 -1.3279 0.0409 1.3412 8.7983
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -122.26113 2.52608 -48.400 < 2e-16 ***
che.de 0.26584 0.06879 3.864 0.000126 ***
kne.di 0.64053 0.11727 5.462 7.47e-08 ***
sho.gi 0.08655 0.02825 3.063 0.002308 **
che.gi 0.16188 0.03385 4.782 2.30e-06 ***
wai.gi 0.38580 0.02499 15.440 < 2e-16 ***
hip.gi 0.23328 0.03843 6.070 2.55e-09 ***
thi.gi 0.25782 0.04873 5.290 1.84e-07 ***
for.gi 0.59434 0.09648 6.160 1.51e-09 ***
cal.gi 0.40568 0.05797 6.998 8.49e-12 ***
age -0.05331 0.01181 -4.515 7.93e-06 ***
hgt 0.32247 0.01553 20.769 < 2e-16 ***
sex -1.57950 0.48321 -3.269 0.001155 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.114 on 494 degrees of freedom
Multiple R-squared: 0.9755, Adjusted R-squared: 0.9749
F-statistic: 1639 on 12 and 494 DF, p-value: < 2.2e-16
plot(mod$residuals)
abline(h=0, col="red")
shapiro.test(mod$residuals)
Shapiro-Wilk normality test
data: mod$residuals
W = 0.9898, p-value = 0.001366
res <- mod$residuals
res[res > 6]
124 141 159 192 440
8.503349 8.798344 6.604020 6.201988 6.365948
res2 <- res[res <= 6]
par(mfrow = c(1,2))
hist(res2)
qqnorm(res2)
qqline(res2)
par(mfrow = c(1,1))
shapiro.test(res2)
Shapiro-Wilk normality test
data: res2
W = 0.99736, p-value = 0.6084
nd <- data.frame(bia.di = 36.5, bii.di = 32, bit.di = 32, che.de = 15, che.di = 30.5,
elb.di = 14, wri.di = 10, kne.di = 20, ank.di = 12, sho.gi = 109,
che.gi = 89, wai.gi = 71.5, nav.gi = 75, hip.gi = 91, thi.gi = 51.5,
bic.gi = 30, for.gi = 26.5, kne.gi = 37, cal.gi = 37, ank.gi = 22,
wri.gi = 16.5, age = 39, hgt = 173, sex = 1)
predict(mod, newdata = nd, interval="prediction", level = 0.95) * 2.20462
fit lwr upr
1 139.6792 130.3717 148.9867
tempdf <- bdims[c('wgt', 'che.de', 'kne.di', 'sho.gi', 'che.gi', 'wai.gi', 'hip.gi', 'thi.gi', 'for.gi', 'cal.gi', 'age', 'hgt', 'sex')]
cor(tempdf)
wgt che.de kne.di sho.gi che.gi wai.gi hip.gi thi.gi for.gi cal.gi age
wgt 1.0000000 0.8007315 0.7660485 0.8788342 0.8989595 0.9039908 0.7629691 0.55856265 0.8695531 0.7692826 0.20726524
che.de 0.8007315 1.0000000 0.5502889 0.7376115 0.8065033 0.8037549 0.5563131 0.35765409 0.7175490 0.5535016 0.31540494
kne.di 0.7660485 0.5502889 1.0000000 0.6818019 0.6522224 0.6239675 0.5795936 0.43152755 0.7206519 0.6860935 0.17233593
sho.gi 0.8788342 0.7376115 0.6818019 1.0000000 0.9271923 0.8234546 0.5336717 0.32342724 0.8949838 0.6270538 0.17683840
che.gi 0.8989595 0.8065033 0.6522224 0.9271923 1.0000000 0.8837994 0.5834991 0.36305077 0.8875909 0.6088643 0.24896674
wai.gi 0.9039908 0.8037549 0.6239675 0.8234546 0.8837994 1.0000000 0.6923506 0.42108488 0.7807924 0.6313445 0.36913120
hip.gi 0.7629691 0.5563131 0.5795936 0.5336717 0.5834991 0.6923506 1.0000000 0.82894106 0.5143585 0.6745805 0.22708011
thi.gi 0.5585626 0.3576541 0.4315276 0.3234272 0.3630508 0.4210849 0.8289411 1.00000000 0.3452848 0.6288901 -0.01985545
for.gi 0.8695531 0.7175490 0.7206519 0.8949838 0.8875909 0.7807924 0.5143585 0.34528479 1.0000000 0.6701918 0.14691546
cal.gi 0.7692826 0.5535016 0.6860935 0.6270538 0.6088643 0.6313445 0.6745805 0.62889011 0.6701918 1.0000000 0.10605240
age 0.2072652 0.3154049 0.1723359 0.1768384 0.2489667 0.3691312 0.2270801 -0.01985545 0.1469155 0.1060524 1.00000000
hgt 0.7173011 0.5529111 0.5880951 0.6657353 0.6187309 0.5529605 0.3385840 0.11630969 0.6550178 0.4526826 0.06788349
sex 0.6577258 0.6128861 0.5439263 0.7811421 0.7449280 0.6692025 0.1580583 -0.07828143 0.7918965 0.3866644 0.15094462
hgt sex
wgt 0.71730108 0.65772575
che.de 0.55291112 0.61288606
kne.di 0.58809509 0.54392633
sho.gi 0.66573530 0.78114212
che.gi 0.61873095 0.74492797
wai.gi 0.55296046 0.66920250
hip.gi 0.33858399 0.15805828
thi.gi 0.11630969 -0.07828143
for.gi 0.65501784 0.79189648
cal.gi 0.45268265 0.38666437
age 0.06788349 0.15094462
hgt 1.00000000 0.68466211
sex 0.68466211 1.00000000
ggcorr(tempdf, palette = "RdBu", label = TRUE)