We begin by reading in the data and looking at a summary of it.
df <- read.table("ex01.txt", sep="\t", header=TRUE)
summary(df)
x y
Min. :2.124 Min. : 8.333
1st Qu.:3.795 1st Qu.: 31.661
Median :5.436 Median : 78.320
Mean :5.358 Mean :133.155
3rd Qu.:6.944 3rd Qu.:199.158
Max. :7.979 Max. :624.952
Create a scatterplot.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
We will begin our analysis by considering a linear model of the form: \(Y = \hat{\beta}_0 + \hat{\beta}_1 X + e\)
m1 <- lm(y ~ x, df)
summary(m1)
Call:
lm(formula = y ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-149.97 -50.98 -10.01 33.54 318.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -221.754 26.707 -8.303 5.7e-13 ***
x 66.240 4.746 13.958 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 81.69 on 98 degrees of freedom
Multiple R-squared: 0.6653, Adjusted R-squared: 0.6619
F-statistic: 194.8 on 1 and 98 DF, p-value: < 2.2e-16
Plot the fitted line.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
abline(m1$coefficients, col="darkred", lwd=2)
We now create a residual plot.
plot(m1$residuals ~ df$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
The residuals exhibit a trend, as well as heteroskedasticity (unequal variance).
We now assess the normality of the residuals with a histogram and a QQ-plot.
res1 <- m1$residuals
par(mfrow=c(1,2))
hist(res1, col='orchid')
qqnorm(res1)
qqline(res1)
par(mfrow=c(1,1))
These plots indicate that the residuals were drawn from a right-skewed distribution.
We will now consider a model of the form \(Y = \hat{\beta}_0 + \hat{\beta}_1 e^X + \varepsilon\).
m2 <- lm(y ~ exp(x), df)
summary(m2)
Call:
lm(formula = y ~ exp(x), data = df)
Residuals:
Min 1Q Median 3Q Max
-173.688 -22.301 -5.473 21.888 202.380
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.869993 7.814467 4.206 5.75e-05 ***
exp(x) 0.153758 0.007461 20.607 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 61.14 on 98 degrees of freedom
Multiple R-squared: 0.8125, Adjusted R-squared: 0.8106
F-statistic: 424.7 on 1 and 98 DF, p-value: < 2.2e-16
We plot the fitted curve.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
ptn = seq(from=2,to=8, by=0.1)
lines(ptn, predict(m2, data.frame(x=ptn)), col="darkred", lwd=2)
We generate a residual plot.
plot(m2$residuals ~ df$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
There is still a slight trend. The heteroskedasticity is also still present.
We will add prediction intervals to the scatterplot.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
p2 <- predict(m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p2[,"fit"], col="darkred", lwd=2)
lines(ptn, p2[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p2[,"upr"], col="darkorange", lwd=2)
We now assess the normality of the residuals with a histogram and a QQ-plot.
res2 <- m2$residuals
par(mfrow=c(1,2))
hist(res2, col='orchid')
qqnorm(res2)
qqline(res2)
par(mfrow=c(1,1))
These plots indicate that the residuals were drawn from a heavy-tailed distribution.
Let’s plot the natural log of y against x.
plot(log(y) ~ x, df, pch=21, col="black", bg="cyan")
That looks like an excellent linear relationship. Perhaps we should consider a model of the form: \(\ln(Y) = \hat{\beta}_0 + \hat{\beta}_1 X + \varepsilon\)
m3 <- lm(log(y) ~ x, df)
summary(m3)
Call:
lm(formula = log(y) ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.81094 -0.23060 0.03466 0.28640 0.67522
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.99771 0.11466 8.702 7.91e-14 ***
x 0.62018 0.02037 30.439 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3507 on 98 degrees of freedom
Multiple R-squared: 0.9043, Adjusted R-squared: 0.9034
F-statistic: 926.5 on 1 and 98 DF, p-value: < 2.2e-16
Let’s consider the residual plot. Note that these are residuals of ln(Y), and not Y.
plot(m3$residuals ~ df$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
There is no apparent trend or heteroskedasticity in this residual plot.
We now assess the normality of the residuals with a histogram and a QQ-plot.
res3 <- m3$residuals
par(mfrow=c(1,2))
hist(res3, col='orchid')
qqnorm(res3)
qqline(res3)
par(mfrow=c(1,1))
These plots suggest that the residuals were likey drawn from a normal (or perhaps slightly left-skewed) distribution.
We will add the fitted curve and to the scatter plot of ln(y) against y.
plot(log(y) ~ x, df, pch=21, col="black", bg="cyan")
p3 <- predict(m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p3[,"fit"], col="darkred", lwd=2)
lines(ptn, p3[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p3[,"upr"], col="darkorange", lwd=2)
We can exponentiate the values of log(Y) to transform back to be in terms of Y.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
p3 <- predict(m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)
When using this model to create predictions for Y, we need to first generate a prediction for ln(Y), and then exponentiate. Let’s construct a 95% prediction interval for Y when X = 6.5.
ln_yhat = predict(m3, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat
fit lwr upr
1 152.7606 75.78804 307.909
Note that we cannot directly compare r^2 values for linear and log-linear models. We can directly compare SSE values (assuming we transform everything back to Y), but we shouldn’t use SSE exclusively to determine which model to use.
# Linear
sum((m1$residuals)^2)
[1] 653919.3
sum((df$y - m1$fitted.values)^2)
[1] 653919.3
# Exponential
sum((m2$residuals)^2) # 366,355.6
[1] 366355.6
# Log-Linear
sum((df$y - exp(m3$fitted.values))^2)
[1] 383890.1