Introduction to Regression Analysis

In this lecture, we are going to provide a general overview of regression analysis by walking through a typical regression problem. In the example we will be working through, we will be interested in studying the relationship between the price and mileage of used cars.

Assume that we gather a sample of 20 recently sold used 2016 Ford Fictus automobiles. For each vehicle, we record the sales price and mileage of the vehicle.

We will store the prices in an R vector called price. The prices are stated in thousands of dollars.

price <- c(53.7, 56.8, 58.5, 42.0, 48.9, 33.2, 22.2, 32.6, 30.3, 19.8, 
           26.1, 24.9, 18.1, 11.7, 13.3, 23.4, 13.2, 13.6, 14.8, 4.6)
hist(price, col='salmon')

Descriptive Statistics

Before looking at the mileages of the cars in the sample, let’s study the prices a bit more. We will begin by calculating the mean and standard deviation of the price of the cars in this sample.

xbar <- mean(price)
s <- sd(price)
stats <- c(xbar, s)
names(stats) <- c('mean', 'stdev')
stats
    mean    stdev 
28.08500 16.17273 

Constructing a (Naive) Prediction Interval

Assume that we wish to construct an interval that we we believe will contain the prices of 95% of all used 2016 Fictuses. Knowing that 95% of all observations of a normally distributed variable fall within 1.96 standard deviations of the mean, we might (naively) construct our interval by adding and subtracting 1.96 times the sample standard deviation to and from the sample mean.

lower <- xbar - 1.96*s
upper <- xbar + 1.96*s
interval <- c(lower, upper)
names(interval) <- c('lower', 'upper')
interval
    lower     upper 
-3.613547 59.783547 

Notice that this interval is fairly large. If we were trying to predict the price of a particular 2016 Fictus without any additional information about the car, we would not be able to provide a very precise estimate. In other words, we see that there is a lot of variability in the sales prices of this model of vehicle.

If we had some additional information about the car whose price we were trying to predict, then perhaps we could offer a better estimate.

Relationship between Price and Mileage

Assume that in addition to recording the prices of the cars in our sample, we also recorded the mileages (in thousands of miles). We will store these mileages in a vector called mileage.

mileage <- c( 3.1,  4.1,  5.3,  7.1, 19.5, 28.3, 36.8, 37.2, 42.3,  52.3, 
             53.3, 53.4, 63.2, 68.4, 82.3, 83.9, 88.4, 97.6, 99.7, 105.9)
hist(mileage, col='cornflowerblue')

To study the relationship between price and mileage of this automobile model, we might create a scatterplot from the paired observations in our sample.

plot(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5,
     xlab = 'Mileage (in 1000s of Miles)', ylab = 'Price (in 1000s of Dollars)', 
     main = 'Relationship between Mileage and Price')

We see from this plot that the cars with greater mileage tend to have a lower sales price (as you might expect). In fact, it appears that the relationship between the price and mileage of the vehicles might be roughly linear.

We will use the lm function in R to find a linear model that attempts to capture the relationship between price and mileage. We will store the resulting model in a variable called model and will then use the summary function to get some information about the model.

model <- lm(price ~ mileage)
summary(model)

Call:
lm(formula = price ~ mileage)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.329  -4.961  -1.335   5.862  10.259 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 50.54834    2.77369  18.224 4.76e-13 ***
mileage     -0.43529    0.04523  -9.625 1.60e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.702 on 18 degrees of freedom
Multiple R-squared:  0.8373,    Adjusted R-squared:  0.8283 
F-statistic: 92.63 on 1 and 18 DF,  p-value: 1.604e-08

There is a lot of information in this summary. We will eventually learn how to interpret all of the information presented here. For now, lets focus on one piece of information: the coefficient estimates.

The coefficient estimates are shown in the summary above, but we can access them directly as follows:

model$coefficients
(Intercept)     mileage 
 50.5483433  -0.4352939 

These are the coefficients that determine the slope and intercept of our linear model. This tells us that our model has the following form:

Predicted Price = 50.55 - 0.4353 · Mileage


This relationship between price and mileage can also be represented as follows:

Actual Price = 50.55 - 0.4353 · Mileage + Unexplained Error


Let’s add our regression line to the scatter plot of price and mileage.

plot(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5,
     xlab = 'Mileage (in 1000s of Miles)', ylab = 'Price (in 1000s of Dollars)', 
     main = 'Relationship between Mileage and Price')
abline(model$coefficients, col='cadetblue', lwd=2)

Fitted Values

The fitted value for any particlar observation in our sample is the price that the model predicts for the car, given the mileage of that car. This is obtained by plugging the mileage into the equation:
Predicted Price = 50.55 - 0.4353 · Mileage


The fitted values for the cars in our sample are stored within the model variable

round(model$fitted.values,1)
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
49.2 48.8 48.2 47.5 42.1 38.2 34.5 34.4 32.1 27.8 27.3 27.3 23.0 20.8 14.7 14.0 12.1  8.1  7.1  4.5 

To illustrate this idea, we will add the fitted values to our scatterplot.

Residuals

The residuals are the error in the predicted prices. The residual for a particular observation is given by the equation:
Residal = Actual Price - Predicted Price


Residuals reflect the uncertainty remaining in our model.

Our model object model contains the residuals for the observations in our sample.

res <- model$residuals
round(res,1)
    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17 
  4.5   8.0  10.3  -5.5   6.8  -5.0 -12.3  -1.8  -1.8  -8.0  -1.2  -2.4  -4.9  -9.1  -1.4   9.4   1.1 
   18    19    20 
  5.5   7.7   0.1 

Since the residuals represent the uncertainly in our predictions, we would like to get a sense as to how they are distributed. To that end, we generate a histogram of the residuals.

hist(res, col='orchid')

It seems reasonable to assume that the residuals might be normally distributed with a mean of zero. Let’s calculate their sample mean and standard deviation.

res_mean <- mean(res)
res_sd <- sd(res)
res_stats <- c(res_mean, res_sd)
names(res_stats) <- c('mean', 'stdev')
round(res_stats,4)
  mean  stdev 
0.0000 6.5235 

Using the Model to Make Predictions

Assume that we are interested in purchasing a used 2016 Ford Fictus with 45,000 miles. We would like to use our model to determine a fair price for the car. We could calulcate this by plugging 45 into the equation for our model.That gives:

Predicted Price = 50.55 - 0.4353 · 45 = 30.96


In other words, our model predicts that such a vehicle should cost (on average) $30,960.

We can use the R function predict to calculate this predicted value.

newdata = data.frame(mileage=c(45))
predict(model, newdata)
       1 
30.96012 

We know that the predictions made by the model are not 100% accurate. We expect there to be some error in the predictions. To better understand how much the actual price of a car with 45,000 miles might vary, we will use predict to create an interval that we are 95% certain contains the true price of the car. This interval is called a prediction interval.

predict(model, newdata, interval = 'prediction', level=0.95)
       fit      lwr      upr
1 30.96012 16.51791 45.40232
---
title: "Lesson 01 - Introduction to Regression Analysis"
author: "Robbie Beane"
output:
  html_notebook:
    theme: flatly
    toc: true
    toc_depth: 2
---


## Introduction to Regression Analysis

In this lecture, we are going to provide a general overview of regression analysis by walking through a typical regression problem. In the example we will be working through, we will be interested in studying the relationship between the price and mileage of used cars. 

Assume that we gather a sample of 20 recently sold used 2016 Ford Fictus automobiles. For each vehicle, we record the sales price and mileage of the vehicle. 

We will store the prices in an R vector called `price`. The prices are stated in thousands of dollars. 


```{r}
price <- c(53.7, 56.8, 58.5, 42.0, 48.9, 33.2, 22.2, 32.6, 30.3, 19.8, 
           26.1, 24.9, 18.1, 11.7, 13.3, 23.4, 13.2, 13.6, 14.8, 4.6)

hist(price, col='salmon')
```

# Descriptive Statistics

Before looking at the mileages of the cars in the sample, let's study the prices a bit more. We will begin by calculating the mean and standard deviation of the price of the cars in this sample. 


```{r}
xbar <- mean(price)
s <- sd(price)

stats <- c(xbar, s)
names(stats) <- c('mean', 'stdev')

stats
```

# Constructing a (Naive) Prediction Interval

Assume that we wish to construct an interval that we we believe will contain the prices of 95% of all used 2016 Fictuses. Knowing that 95% of all observations of a normally distributed variable fall within 1.96 standard deviations of the mean, we might (naively) construct our interval by adding and subtracting 1.96 times the sample standard deviation to and from the sample mean. 


```{r}
lower <- xbar - 1.96*s
upper <- xbar + 1.96*s

interval <- c(lower, upper)
names(interval) <- c('lower', 'upper')

interval
```

Notice that this interval is fairly large. If we were trying to predict the price of a particular 2016 Fictus without any additional information about the car, we would not be able to provide a very precise estimate. In other words, we see that there is a lot of variability in the sales prices of this model of vehicle. 

If we had some additional information about the car whose price we were trying to predict, then perhaps we could offer a better estimate. 


# Relationship between Price and Mileage

Assume that in addition to recording the prices of the cars in our sample, we also recorded the mileages (in thousands of miles). We will store these mileages in a vector called `mileage`. 

```{r}
mileage <- c( 3.1,  4.1,  5.3,  7.1, 19.5, 28.3, 36.8, 37.2, 42.3,  52.3, 
             53.3, 53.4, 63.2, 68.4, 82.3, 83.9, 88.4, 97.6, 99.7, 105.9)

hist(mileage, col='cornflowerblue')
```

To study the relationship between price and mileage of this automobile model, we might create a scatterplot from the paired observations in our sample.

```{r}
plot(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5,
     xlab = 'Mileage (in 1000s of Miles)', ylab = 'Price (in 1000s of Dollars)', 
     main = 'Relationship between Mileage and Price')
```


We see from this plot that the cars with greater mileage tend to have a lower sales price (as you might expect). In fact, it appears that the relationship between the price and mileage of the vehicles might be roughly linear. 

We will use the `lm` function in R to find a linear model that attempts to capture the relationship between price and mileage. We will store the resulting model in a variable called `model` and will then use the `summary` function to get some information about the model. 


```{r}
model <- lm(price ~ mileage)

summary(model)
```

There is a lot of information in this summary. We will eventually learn how to interpret all of the information presented here. For now, lets focus on one piece of information: the coefficient estimates. 

The coefficient estimates are shown in the summary above, but we can access them directly as follows:

```{r}
model$coefficients
```

These are the coefficients that determine the slope and intercept of our linear model. This tells us that our model has the following form:

<center>
**Predicted Price = 50.55 - 0.4353 · Mileage**
</center>
<br>
This relationship between price and mileage can also be represented as follows:

<center>
**Actual Price = 50.55 - 0.4353 · Mileage + Unexplained Error**
</center>
<br>
Let's add our regression line to the scatter plot of price and mileage. 


```{r}
plot(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5,
     xlab = 'Mileage (in 1000s of Miles)', ylab = 'Price (in 1000s of Dollars)', 
     main = 'Relationship between Mileage and Price')
abline(model$coefficients, col='cadetblue', lwd=2)
```

# Fitted Values

The fitted value for any particlar observation in our sample is the price that the model predicts for the car, given the mileage of that car. This is obtained by plugging the mileage into the equation:
<center>
**Predicted Price = 50.55 - 0.4353 · Mileage**
</center>
<br>
The fitted values for the cars in our sample are stored within the `model` variable

```{r}
round(model$fitted.values,1)
```

To illustrate this idea, we will add the fitted values to our scatterplot. 

```{r, echo='FALSE'}
plot(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5,
     xlab = 'Mileage (in 1000s of Miles)', ylab = 'Price (in 1000s of Dollars)', 
     main = 'Relationship between Mileage and Price')

abline(model$coefficients, col='cadetblue', lwd=2)

points(mileage, model$fitted.values, pch=18, cex=1.5, col='darkgreen')
points(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5)


```


#Residuals

The residuals are the error in the predicted prices. The residual for a particular observation is given by the equation:
<center>
**Residal = Actual Price - Predicted Price**
</center>
<br>

Residuals reflect the uncertainty remaining in our model. 


```{r, echo='FALSE'}
plot(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5,
     xlab = 'Mileage (in 1000s of Miles)', ylab = 'Price (in 1000s of Dollars)', 
     main = 'Relationship between Mileage and Price')

segments(mileage, model$fitted.values, mileage, price, col='red', lwd=2)
points(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5)

abline(model$coefficients, col='cadetblue', lwd=2)
```

Our model object `model` contains the residuals for the observations in our sample. 


```{r}
res <- model$residuals

round(res,1)
```

Since the residuals represent the uncertainly in our predictions, we would like to get a sense as to how they are distributed. To that end, we generate a histogram of the residuals.


```{r}
hist(res, col='orchid')
```


It seems reasonable to assume that the residuals might be normally distributed with a mean of zero. Let's calculate their sample mean and standard deviation.


```{r}
res_mean <- mean(res)
res_sd <- sd(res)

res_stats <- c(res_mean, res_sd)
names(res_stats) <- c('mean', 'stdev')

round(res_stats,4)
```

# Using the Model to Make Predictions

Assume that we are interested in purchasing a used 2016 Ford Fictus with 45,000 miles. We would like to use our model to determine a fair price for the car. We could calulcate this by plugging 45 into the equation for our model.That gives:

<center>
**Predicted Price = 50.55 - 0.4353 · 45 = 30.96**
</center>
<br>
In other words, our model predicts that such a vehicle should cost (on average) $30,960. 

We can use the R function `predict` to calculate this predicted value. 

```{r}
newdata = data.frame(mileage=c(45))
predict(model, newdata)
```

We know that the predictions made by the model are not 100% accurate. We expect there to be some error in the predictions. To better understand how much the actual price of a car with 45,000 miles might vary, we will use `predict` to create an interval that we are 95% certain contains the true price of the car. This interval is called a **prediction interval**. 

```{r}
predict(model, newdata, interval = 'prediction', level=0.95)
```


