Variance Reduction

Assume that we have two variables, X and Y that are related according to the hypothetical model \(Y = \beta_0 + \beta_1 X + e\). We collect a sample consisting of n paired observations of the form \((x_i,y_i)\). We then use the sample to create a fitted model of the form \(\hat Y = \hat \beta_0 + \hat \beta_1 X\).

Assume we wish to make a prediction about the value of \(Y\) for a new observation. Without considering the effect that \(X\) has on \(Y\), our best point estimate for the new value of \(Y\) would be \(\bar y\). To take into account the uncertainty that we know exists in our prediction, we could create a 95% prediction interval around our point prediction, \(\bar y\). Such an interval is shown in the plot on the left below. We will discuss how prediction intervals are formed in a later lesson.

If, however, we know the value of \(X\) in the new observation, then a better estimate for the value of \(Y\) would be given by \(E[Y | X = x]= \beta_0 + \beta_1 x\), which can be approximated using our fitted model. This gives us a point estimate of \(\hat y = \hat \beta_0 + \hat \beta_1 x\). Again, we know that there is some uncertainty in our prediction, so we might create a 95% prediction interval around the point prediction \(\hat y\). Such a prediction interval is shown in the plot on the right below.

Notice that the prediction interval in the plot on the right is considerably shorter than the one on the left. By taking into account the effect that \(X\) has on \(Y\), we are able to reduce the uncertainty, or variance, in our prediction. This allows us to make more precise predictions.

In this lesson, we will discuss a method of measuring the amount of variance reduction we obtain in using a regression model to explain a portion of the variation in \(Y\) through its relationship with \(X\).



SST, SSR, and SSE

We will use a quantity denoted by \(r^2\) to measure the proportion of the variance in our response variable \(Y\) that is explained by the relationship between \(Y\) and a predictor \(X\). Before we define \(r^2\), we first need to introduce some related quantities.

For a particular observation \(y_i\), notice that the quantity \(d_i = y_i - \bar y\) measures the amount by which the observation deviates from the sample mean. We will decompose this quantity into two pieces.

Notice that \(d_i = r_i + \hat e_i\).

We now sum the squares of each of these three quantities over all of the points in our sample.

SST = SSE + SSR

An important relationship between these three sums is given by the equation \(SST = SSE + SSR\).

To establish this result, we will need to make use of the following two identities:

The first identity is one of our two normal equations. The second identity can be derived from the two normal equations. Armed with these identities, we may procede with our proof that \(SST = SSE + SSR\) as follows:

\(\hspace{30pt} SST = \sum (y_i - \bar{y})^2\)

\(\hspace{30pt} SST = \sum [(\hat e_i + \hat y_i) - \bar{y}]^2\)

\(\hspace{30pt} SST = \sum [\hat e_i + (\hat y_i - \bar{y})]^2\)

\(\hspace{30pt} SST = \sum [\hat e_i^2 + 2\hat e_i (\hat y_i - \bar{y}) + ( \hat y_i - \bar y)^2]\)

\(\hspace{30pt} SST = \sum \hat e_i^2 + 2\sum \hat e_i (\hat y_i - \bar{y}) + \sum ( \hat y_i - \bar y)^2\)

\(\hspace{30pt} SST = SSE + 2 \sum( \hat e_i \hat y_i - \hat e_i\bar{y}) + SSR\)

\(\hspace{30pt} SST = SSE + 2 \sum \hat e_i \hat y_i - 2 \bar{y}\sum\hat e_i + SSR\)

\(\hspace{30pt} SST = SSE + 0 - 0 + SSR\)

\(\hspace{30pt} SST = SSE + SSR\)

r-Squared

Intuitive explanations of the meaning of the variables \(SST\), \(SSR\), and \(SSE\) are as follows:

Ideally, we would like for \(SSE\) to be close to 0 and for \(SSR\) to thus be close to \(SST\). We can measure the proportion of the variance in \(Y\) that is explained by our regression model using the following quantity:

\[r^2 = \frac{SSR}{SST}\]

Note that since \(0 \leq SSR \leq SST\), we get that \(0 \leq r^2 \leq 1\). The quantity \(r^2\) is a diagnostic tool that is commonly uses to measure the quality of the fit in a regression model.

Notice that since \(SST = SSR + SSE\), we can rewrite the formula for \(r^2\) as follows:

\[r^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SSR}\]

The \(r^2 = 1 - \frac{SSE}{SSR}\) formula for \(r^2\) is often the more useful of the two formulas, since we will have other reasons for calculating \(SSE\).

Residual Standard Error

Consider the hypothetical model \(Y = \beta_0 + \beta_1 X + e\). If we know the value of the variable \(X\), then the only uncertainty in the value of \(Y\) is due to the error term \(e\). We often assume that \(E[e]=0\). In that case, \(Var[e] = \sigma_e^2\) provides a measurement of the amount of error inherent in our model.

Since we do not generally have access to the hypothetical model, we cannot calulcate values of \(e\) exactly. However, we can construct a fitted model, \(Y = \hat\beta_0 + \hat\beta_1 X + \hat e\) and then use \(\hat e\) as an approximation of \(e\). As such, we should be able to use the observed residuals \(\hat e_i\) to find an approximation of \(\sigma_e^2\).

It can be shown that the following expression provides an unbiased estimate of \(\sigma_e^2\):

\[s^2 = \frac{SSE}{n-2} = \frac{1}{n-2} \sum_{i=1}^n \hat e_i^2\] The square root of this quantity is called the residual standard error (RSE). It approximates the standard deviation of the error term \(e\).

\[s = \sqrt{\frac{SSE}{n-2}} = \sqrt{\frac{1}{n-2} \sum_{i=1}^n \hat e_i^2}\]

Using R to Calculate r-Squared and RSE

Let’s return to the Pearson data set. In the code chunk below, we load the data, create an OLS regression model, and then print a summary of that model.

myData <- read.table("father_son.txt", sep="\t", header=TRUE)
mod <- lm(sheight ~ fheight, myData)
summary(mod)

Call:
 Median      3Q     Max 
-8.8772 -1.5144 -0.0079  1.6285  8.9685 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.88660    1.83235   18.49   <2e-16 ***
fheight      0.51409    0.02705   19.01   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.437 on 1076 degrees of freedom
Multiple R-squared:  0.2513,    Adjusted R-squared:  0.2506 
F-statistic: 361.2 on 1 and 1076 DF,  p-value: < 2.2e-16

We can see from the summary that:

r-Squared and Correlation

The value \(r^2\) is related to the sample correlation \(\rho_{X,Y} = \mathrm{corr}[X,Y]\). In fact, it can be shown that:

\[r^2 = \mathrm{corr}[X,Y]\]

To establish this result, we first need to derive an alternate form for the expression \(SSR\). Notice that:

\(\hspace{30pt} SSR = \sum\limits_{i=1}^n (\hat y_i - \bar y)^2\)

\(\hspace{30pt} SSR = \sum\limits_{i=1}^n \left[\left(\hat \beta_0 + \hat \beta_1 x_i \right) - \bar y\right]^2\)

\(\hspace{30pt} SSR = \sum\limits_{i=1}^n \left[\left(\bar y - \hat\beta_1\bar x \right) + \hat \beta_1 x_i - \bar y\right]^2\)

\(\hspace{30pt} SSR = \sum\limits_{i=1}^n \left( - \hat\beta_1\bar x + \hat \beta_1 x_i\right)^2\)

\(\hspace{30pt} SSR = \beta_1^2 \sum\limits_{i=1}^n \left( x_i - \bar x\right)^2\)

\(\hspace{30pt} SSR = \beta_1^2 SXX\)

\(\hspace{30pt} SSR = \left(\frac{SXY}{SXX} \right)^2 SXX\)

\(\hspace{30pt} SSR = \frac{\left(SXY \right)^2}{SXX}\)

Now, recall that \(r^2 = \frac{SSR}{SST}\). We will substitute the expression above in for \(SSR\), and then simplify.

\(\hspace{30pt} r^2 = \frac{SSR}{SST}\)

\(\hspace{30pt} r^2 = \frac{\left(SXY \right)^2}{SXX} \frac{1}{SST}\)

\(\hspace{30pt} r^2 = \frac{\left(SXY \right)^2}{SXX \cdot SST}\)

\(\hspace{30pt} r^2 = \frac{\left(SXY \right)^2}{SXX \cdot SST}\)

\(\hspace{30pt} r^2 = \frac{\left(\sum\limits_{i=1}^n (x_i - \bar x)(y_i - \bar y) \right)^2}{\sum\limits_{i=1}^n (x_i - \bar x)^2 \cdot \sum\limits_{i=1}^n (y_i - \bar y)^2}\)

\(\hspace{30pt} r^2 = \left(\frac{\sum\limits_{i=1}^n (x_i - \bar x)(y_i - \bar y) }{\sqrt{\sum\limits_{i=1}^n (x_i - \bar x)^2} \cdot \sqrt{\sum\limits_{i=1}^n (y_i - \bar y)^2}} \right)^2\)

\(\hspace{30pt} r^2 = \left( \frac{\mathrm{cov}[X,Y]}{s_X s_Y} \right)^2\)

\(\hspace{30pt} r^2 = \left( \mathrm{corr}[X,Y] \right)^2\)

This completes our proof.

Correlation Between \(Y\) and \(\hat Y\)

It can also be shown that the correlation between the fitted value \(\hat Y\) and the response \(Y\) is exactly the same as that between the predictor \(X\) and the response \(Y\). In other words:

\[\mathrm{corr}\left[\hat Y,Y \right] = \mathrm{corr}\left[X,Y\right]\]

The proof of this fact is left as an exercise.

