Overview

Up to this point, we have made no assumptions about the distribution of the error term \(e\). In fact, it is not necessary to make assumptions about the distributions of \(e\) in order to find the parameter estimates in the OLS regression line. However, if we would like to quantify how certain we are in predictions made by our model, then we will need to make some assumptions about the distribution of \(e\).

Linear Regression Assumptions (Gauss-Markov Assumptions)

When working with a linear regression model, we will make the following assumptions:

  1. The relationship between X and Y is determined by a linear model of the form \(Y = \beta_0 + \beta_1 X + e\).

  2. Seperate observations of \(e\) are independent from one another.

  3. The error term \(e\) is normally distributed with a mean of 0 and standard deviation \(\sigma\).

  4. The error term \(e\) is independent from \(X\). In particular, the \(\mathrm{e} = \sigma^2\) does not depend on \(X\).

The condition described in item four above requires that the variance of \(e\) is constant over all values of \(X\). This condition is referred to a homoskedasticity. If the variance of \(e\) varies over \(X\), then we say that the error is heteroskedastic.

Residual Plots

Before using our model for any tasks that depend on the Gauss-Markov assumptions, we should test our model to get a sense as to the validity of these assumptions. We don’t have access to the original error terms associated with each of our observations, but we do have access to our residuals, which can be used to approximate the distribution of \(e\).

One useful took for conducting residual analysis is a residual plot. This is simply a plot of the residuals against the \(X\) values of our observations.

In the code chunk below, we simulate a dataset and then create a regression model based on this dataset.

set.seed(1)
x <- runif(100, 0, 20)
y <- 6 + 1.4 * x + rnorm(100,0,2)
mod <- lm(y ~ x)
summary(mod)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6996 -1.1244 -0.1741  1.0485  5.0332 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.64135    0.41162   13.71   <2e-16 ***
x            1.43123    0.03535   40.49   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.882 on 98 degrees of freedom
Multiple R-squared:  0.9436,    Adjusted R-squared:  0.943 
F-statistic:  1640 on

Te plot below displays a scatter plot of \(Y\) against \(X\), as well as a plot of the OLS regression line.

plot(y ~ x, pch=21, bg="cyan", col="black", cex=1.25)
abline(mod$coefficients, col="maroon", lwd=2)

The following residual plot is formed by plotting the residuals against the associated \(X\) values.

res <- mod$residuals
plot(res ~ x, pch=21, bg="cyan", col="black", cex=1.25)
abline(h=0, col="maroon", lwd=2)

Violations of the Gauss-Markov Assumptions

To help you identify situations in which the Gauss-Markov assumptions appear to be valid, we will now demonstrate several examples in which these assumptions are violated.

Example 1

There is a noticable “trend” in the residuals for this dataset. This results in assumptions 1, 2 and 4 being violated.

Example 2

Notice that the residuals tend to be larger for large values of \(X\), causing observations tend to “fan out” away from the regression line. In other words, the errors are heteroskedastic, violating assumption 4.

Example 3

In this example, we see thatthe data does seem to exhibit a linear relationshiop. But there are intervals in which the residuals are mostly positive, or mostly negative. This indicates that the residuals are correlated with \(X\), and that independent observations of \(e\) are not independent of each other. This violates assumptions 2 and 4.

Example 4

hist(mod4$residuals, col="orchid", main="Histogram of Residuals", xlab="Residuals")

In this example, it appears that the residuals are not normally distributed, violating assumption 3.

Q-Q Plots

A useful tool for testing the normality assumption is provided by the Q-Q Plot (or Quantile-Quantile Plot). The Q-Q plot is constructed by plotting the emperical quantiles of a sample against the theoretical quantiles from a distribution that we might suspect the sample was drawn from. This will be explained in more detail in class.

If the sample was drawn from the hypothesized distribution, then the points in the Q-Q Plot should fall near a line. We conclude this lesson by showing Q-Q plots for several distributions. In each case, we are testing the assumption that the sample was drawn from a normal distribution.

Example 1: Normal Distribution

set.seed(1)
r1 <- rnorm(500, 0, 2)
par(mfrow=c(1,2))
hist(r1, col="orchid")
qqnorm(r1)
qqline(r1)
par(mfrow=c(1,1))

Example 2: Right-Skewed Distribution

set.seed(1)
r2 <- rgamma(500, 4, 1)
par(mfrow=c(1,2))
hist(r2, col="orchid")
qqnorm(r2)
qqline(r2)
par(mfrow=c(1,1))

Example 3: Left-Skewed Distribution

set.seed(1)
r3 <- 20 - rgamma(500, 4, 1)
par(mfrow=c(1,2))
hist(r3, col="orchid")
qqnorm(r3)
qqline(r3)
par(mfrow=c(1,1))

Example 4: Heavy-Tailed Distribution

set.seed(1)
r4 <- 4 + rt(500, 5)
par(mfrow=c(1,2))
hist(r4, col="orchid")
qqnorm(r4)
qqline(r4)
par(mfrow=c(1,1))

Example 5: Light-Tailed Distribution

set.seed(250)
r5 <- runif(500, 4,8)
par(mfrow=c(1,2))
hist(r5, col="orchid")
qqnorm(r5)
qqline(r5)
par(mfrow=c(1,1))

