In multiple linear regression, we model the response variable as a linear combination of several predictors, plus a single error term. Assume that we have a single response \(Y\), and \(p\) predictors \(X^{(1)}, X^{(2)}, ..., X^{(p)}\). The hypothetical form for a multiple regression model would be as follows:
\[Y = \beta_0 + \beta_1 X^{(1)} + \beta_2 X^{(2)} + ... + \beta_p X^{(p)} + e\]
We make the same assumptions for the error term \(e\) as we did for simple linear regression. The fitted model will have the form:
\[\hat Y = \hat\beta_0 + \hat\beta_1 X^{(1)} + \hat\beta_2 X^{(2)} + ... + \hat\beta_p X^{(p)}\] This fitted model will be obtained by finding the parameter estimates that minimize the sum of squared errors, as was the case in simple linear regression.
In this lesson, we will be explorting the “New York City Restaurant” dataset, which is stored in the file nyc.txt
. This tab-separated text file contains information for 168 Italian restaurants in New York City. The dataset contains six variables for each restaurant:
Variable | Description |
---|---|
Price |
The average price, in USD, for a meal at the restaurant (including one drink and a tip). |
Food |
The average customer rating of the food, on a scale from 1 to 30. |
Decor |
The average customer rating of the decor, on a scale from 1 to 30. |
Service |
The average cusstomer rating of the service, on a scale from 1 to 30. |
Wait |
The average wait time, in minutes, on a Saturday evening. |
East |
A binary variable. Equals 1 if the restaurant is east of 5th Avenue, and 0 otherwise. |
df <- read.table("nyc.txt", sep="\t", header = TRUE)
summary(df)
Price Food Decor Service Wait East
Min. :19.0 Min. :16.0 Min. : 6.00 Min. :14.0 Min. : 0.00 Min. :0.000
1st Qu.:36.0 1st Qu.:19.0 1st Qu.:16.00 1st Qu.:18.0 1st Qu.:16.75 1st Qu.:0.000
Median :43.0 Median :20.5 Median :18.00 Median :20.0 Median :23.00 Median :1.000
Mean :42.7 Mean :20.6 Mean :17.69 Mean :19.4 Mean :22.92 Mean :0.631
3rd Qu.:50.0 3rd Qu.:22.0 3rd Qu.:19.00 3rd Qu.:21.0 3rd Qu.:29.00 3rd Qu.:1.000
Max. :65.0 Max. :25.0 Max. :25.00 Max. :24.0 Max. :46.00 Max. :1.000
The variable East
is a qualitative (categorical) variable categorical that we will not be using in this lesson.
We will create a pairs plot of the 5 variables that we are interested in.
pairs(df[1:5])
Our goal is to create a regression model with price as the response and some combination of the other four variables as predictors. We will begin with the “full” model that uses all four available predictors. Our fitted model will have the following form:
\[\hat {Price} = \hat\beta_0 + \hat\beta_1\cdot Food + \hat\beta_2\cdot Decor + \hat\beta_3\cdot Service + \hat\beta_4\cdot Wait\]
mod1 <- lm(Price ~ Food + Decor + Service + Wait, df)
summary(mod1)
Call:
lm(formula = Price ~ Food + Decor + Service + Wait, data = df)
Residuals:
Min 1Q Median 3Q Max
-14.5265 -3.8743 -0.1174 3.6356 19.0440
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25.81378 4.82789 -5.347 2.98e-07 ***
Food 1.57176 0.37249 4.220 4.05e-05 ***
Decor 1.85427 0.21715 8.539 9.09e-15 ***
Service 0.08922 0.39637 0.225 0.822
Wait 0.07006 0.05370 1.305 0.194
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.791 on 163 degrees of freedom
Multiple R-squared: 0.6209, Adjusted R-squared: 0.6116
F-statistic: 66.75 on 4 and 163 DF, p-value: < 2.2e-16
We will briefly explain the various parts of the model summary.
The coefficient estimates \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\) are shown in the first column of the section labeled “Coefficients”. We can use these to see that the fitted model is given by:
\[\hat {Price} = -25.814 + 1.5718\cdot Food + 1.8543\cdot Decor + 0.08922\cdot Service + 0.07006\cdot Wait\]
Notice that some of the coefficients in this model are very small. That is not a huge concern, in and of itself. What we want to know, however, is if any of the coefficients are so small, that we cannot say that they are non-zero with a reasonably high degree of certainty. We can check that using t-tests on each of the estimates.
R automatically performs a t-Test of the form \(H_0: \beta_i = 0\), \(H_A: \beta_i \neq 0\) for each of the coefficients in the model. Let’s display the model summary again to see these results.
summary(mod1)
Call:
lm(formula = Price ~ Food + Decor + Service + Wait, data = df)
Residuals:
Min 1Q Median 3Q Max
-14.5265 -3.8743 -0.1174 3.6356 19.0440
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25.81378 4.82789 -5.347 2.98e-07 ***
Food 1.57176 0.37249 4.220 4.05e-05 ***
Decor 1.85427 0.21715 8.539 9.09e-15 ***
Service 0.08922 0.39637 0.225 0.822
Wait 0.07006 0.05370 1.305 0.194
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.791 on 163 degrees of freedom
Multiple R-squared: 0.6209, Adjusted R-squared: 0.6116
F-statistic: 66.75 on 4 and 163 DF, p-value: < 2.2e-16
Notice that the results of the t-tests for the intercept, Food
, and Decor
are statistically significant, while the results of the tests for Service
and Wait
are NOT statistically significant. We can be quite confident that the true values of \(\beta_0\), \(\beta_1\), and \(\beta_2\) are non-zero. We cannot say with a large degree of certainty that the true values of the parameters \(\beta_3\) and \(\beta_4\) are non-zero.
If we are not confident that \(\beta_3\) and \(\beta_4\) are non-zero, then we are not confident that Service
and Wait
have an effect on the price. We should consider removing these from the model. We will do this in a moment.
Recall that \(\sigma^2 = \mathrm{Var}[e]\). For simple linear regression, the quantity \(s^2 = \frac{SSE}{n - 2}\) provides an unbaised estimate of \(\sigma^2\). For multiple regression, our unbiased estimate is given by:
\[s^2 = \frac{SSE}{n-p-1}\] The square root of this quantity, \(s\), is called the residual standard error. It approximates the standard deviation of our error term.
For our model, we see that \(s = 5.791\). Let’s compare that with the standard deviation of our response variable:
sd(df$Price)
[1] 9.292814
Note that summary output for the model states two R-squared values: “Multiple R-squared”" and “Adjusted R-squared”. The first value, multiple r-squared, is the standard version of r^2 that we are used to. In particular:
\[r^2 = 1 - \frac{SSE}{SST} \approx 1 - \frac{s^2}{s^2_Y}\]
Adjusted r-squared is a modification of the r^2 value that applies a small penalty for having additional (potentially unnecessary) predictors in your model. It is defined as follows:
\[r_{adj}^2 = 1 - \frac{s^2}{s_Y^2} = 1 - \frac{\frac{1}{n-p-1}SSE}{\frac{1}{n-1}SST} = 1 - \frac{(n-1) SSE}{(n-p-1)SST} = r^2 - (1-r^2) \frac{p}{n-p-1}\]
When considering several possible models, each with a different numbers of predictors, we prefer to use the adjusted r-squared to compare the models (as opposed to the standard r-squared value). This will help us to avoid creating models that contain unnessary predictors.
The model summary outputs the results of an F-test. This is a test of the following hypotheses:
\(H_0 : \beta_1 = \beta_2 = ... = \beta_p = 0\)
\(H_A :\) There is at least one \(\beta_i \neq 0\)summary(mod1)
Call:
lm(formula = Price ~ Food + Decor + Service + Wait, data = df)
Residuals:
Min 1Q Median 3Q Max
-14.5265 -3.8743 -0.1174 3.6356 19.0440
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25.81378 4.82789 -5.347 2.98e-07 ***
Food 1.57176 0.37249 4.220 4.05e-05 ***
Decor 1.85427 0.21715 8.539 9.09e-15 ***
Service 0.08922 0.39637 0.225 0.822
Wait 0.07006 0.05370 1.305 0.194
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.791 on 163 degrees of freedom
Multiple R-squared: 0.6209, Adjusted R-squared: 0.6116
F-statistic: 66.75 on 4 and 163 DF, p-value: < 2.2e-16
Since the p-value for the F-test is so small, we can be confident that at least some of the coefficients in our model are non-zero.
Since the t-tests for the coefficients of Service
and Wait
were not statistically significant, we will remove those from the model.
mod2 <- lm(Price ~ Food + Decor, df)
summary(mod2)
Call:
lm(formula = Price ~ Food + Decor, data = df)
Residuals:
Min 1Q Median 3Q Max
-14.945 -3.766 -0.153 3.701 18.757
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24.5002 4.7230 -5.187 6.19e-07 ***
Food 1.6461 0.2615 6.294 2.68e-09 ***
Decor 1.8820 0.1919 9.810 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.788 on 165 degrees of freedom
Multiple R-squared: 0.6167, Adjusted R-squared: 0.6121
F-statistic: 132.7 on 2 and 165 DF, p-value: < 2.2e-16
Notice that the multiple r-squared value has decreased,but the adjusted r-squared value has increased.
We will conclude by analyzing the residuals for our reduced model.
plot(mod2$residuals ~ mod2$fitted.values, pch=21, col="black", bg="salmon",
xlab="Fitted Value", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
res <- mod2$residuals
par(mfrow=c(1,2))
hist(res, col='orchid', breaks=10)
qqnorm(res)
qqline(res)
par(mfrow=c(1,1))
shapiro.test(res)
Shapiro-Wilk normality test
data: res
W = 0.98647, p-value = 0.1044
nd <- data.frame(Food = c(21, 24, 18), Decor = c(25, 13, 16))
predict(mod2, newdata = nd, interval = "prediction", level = 0.99)
fit lwr upr
1 57.11936 41.58650 72.65222
2 39.47363 23.81404 55.13321
3 35.24278 20.03783 50.44773