We begin by reading in the data and looking at a summary of it.
df <- read.table("ex02.txt", sep="\t", header=TRUE)
summary(df)
x y
Min. :2.283 Min. : 46.79
1st Qu.:3.767 1st Qu.: 381.51
Median :4.985 Median : 953.84
Mean :5.085 Mean :1862.38
3rd Qu.:6.416 3rd Qu.:2888.49
Max. :7.918 Max. :7982.39
Create a scatterplot.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
We will begin our analysis by considering a linear model of the form: \(Y = \hat{\beta}_0 + \hat{\beta}_1 X + e\)
m1 <- lm(y ~ x, df)
summary(m1)
Call:
lm(formula = y ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-2189.4 -671.4 -170.1 516.8 3842.2
Coefficients:
Estimate Std. Error t value Pr(>|t|) 2.96e-15 ***
x 1012.61 65.69 15.416 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1083 on 98 degrees of freedom
Multiple R-squared: 0.708, Adjusted R-squared: 0.705
F-statistic: 237.6 on 1 and 98 DF, p-value: < 2.2e-16
Plot the fitted line.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
abline(m1$coefficients, col="darkred", lwd=2)
We now create a residual plot.
plot(m1$residuals ~ df$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
The residuals exhibit a trend, as well as heteroskedasticity (unequal variance).
We now assess the normality of the residuals with a histogram and a QQ-plot.
res1 <- m1$residuals
par(mfrow=c(1,2))
hist(res1, col='orchid')
qqnorm(res1)
qqline(res1)
par(mfrow=c(1,1))
These plots indicate that the residuals were drawn from a right-skewed distribution.
We will now consider a model of the form \(Y = \hat{\beta}_0 + \hat{\beta}_1 e^X + \varepsilon\).
m2 <- lm(y ~ exp(x), df)
summary(m2)
Call:
lm(formula = y ~ exp(x), data = df)
Residuals:
Min 1Q Median 3Q Max
-3783.0 -543.9 -229.4 436.6 4134.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 689.4541 132.2450 5.213 1.03e-06 *** 2.2769 0.1476 15.430 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1082 on 98 degrees of freedom
Multiple R-squared: 0.7084, Adjusted R-squared: 0.7054
F-statistic: 238.1 on 1 and 98 DF, p-value: < 2.2e-16
We plot the fitted curve.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
ptn = seq(from=2,to=8, by=0.1)
lines(ptn, predict(m2, data.frame(x=ptn)), col="darkred", lwd=2)
We generate a residual plot.
plot(m2$residuals ~ df$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
There is still a slight trend. The heteroskedasticity is also still present.
We will add prediction intervals to the scatterplot.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
p2 <- predict(m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p2[,"fit"], col="darkred", lwd=2)
lines(ptn, p2[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p2[,"upr"], col="darkorange", lwd=2)
We now assess the normality of the residuals with a histogram and a QQ-plot.
res2 <- m2$residuals
par(mfrow=c(1,2))
hist(res2, col='orchid')
qqnorm(res2)
qqline(res2)
par(mfrow=c(1,1))
These plots suggests that the residuals were drawn from a heavy-tailed distribution.
Let’s plot the natural log of y against x.
plot(log(y) ~ x, df, pch=21, col="black", bg="cyan")
Perhaps we will consider a model of the form: \(\ln(Y) = \hat{\beta}_0 + \hat{\beta}_1 X + \varepsilon\)
m3 <- lm(log(y) ~ x, df)
summary(m3)
Min 1Q Median 3Q Max
-1.00249 -0.26117 0.03103 0.26751 0.99605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1290 0.1465 21.36 <2e-16 ***
x 0.7335 0.0274 26.77 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4517 on 98 degrees of freedom
Multiple R-squared: 0.8797, Adjusted R-squared: 0.8785
F-statistic: 716.6 on 1 and 98 DF, p-value: < 2.2e-16
Let’s consider the residual plot. Note that these are residuals of ln(Y), and not Y.
plot(m3$residuals ~ df$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
There is no heteroskedasticity in this residual plot, but there does appear to be a slight trend.
We now assess the normality of the residuals with a histogram and a QQ-plot.
res3 <- m3$residuals
par(mfrow=c(1,2))
hist(res3, col='orchid')
qqnorm(res3)
qqline(res3)
par(mfrow=c(1,1))
These plots suggest that the residuals were likey drawn from a slightly heavy-tailed distribution.
We will add the fitted curve and to the scatter plot of ln(y) against y.
plot(log(y) ~ x, df, pch=21, col="black", bg="cyan")
p3 <- predict(m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p3[,"fit"], col="darkred", lwd=2)
lines(ptn, p3[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p3[,"upr"], col="darkorange", lwd=2)
We can exponentiate the values of log(Y) to transform back to be in terms of Y.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
p3 <- predict(m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)
Let’s plot the natural log of y against the natural log of x.
plot(log(y) ~ log(x), df, pch=21, col="black", bg="cyan")
This appears to be an approximately linear relationship. We will consider a model of the form: \(\ln(Y) = \hat{\beta}_0 + \hat{\beta}_1 \ln(X) + \varepsilon\)
m4 <- lm(log(y) ~ log(x), df)
summary(m4)
Call:
lm(formula = log(y) ~ log(x), data = df)
Residuals:
Min 1Q Median 3Q Max
-0.96888 -0.24472 -0.01224 0.23851 0.91477
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2986 0.1855 7.001 3.2e-10 ***
log(x) 3.5432 0.1154 30.699 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3997 on 98 degrees of freedom
Multiple R-squared: 0.9058, Adjusted R-squared: 0.9048
F-statistic: 942.4 on 1 and 98 DF, p-value: < 2.2e-16
Let’s consider the residual plot. Note that these are residuals of ln(Y), and not Y.
plot(m4$residuals ~ df$x, pch=21, col="black", bg="cyan",
xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
There is no apparent heteroskedasticity or trend in this residual plot.
We now assess the normality of the residuals with a histogram and a QQ-plot.
res4 <- m4$residuals
par(mfrow=c(1,2))
hist(res4, col='orchid')
qqnorm(res4)
qqline(res4)
par(mfrow=c(1,1))
These plots suggest that the residuals were perhaps drawn from a slightly heavy-tailed distribution.
We will add the fitted curve and to the scatter plot of ln(y) against y.
plot(log(y) ~ log(x), df, pch=21, col="black", bg="cyan")
p4 <- predict(m4, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(log(ptn), p4[,"fit"], col="darkred", lwd=2)
lines(log(ptn), p4[,"lwr"], col="darkorange", lwd=2)
lines(log(ptn), p4[,"upr"], col="darkorange", lwd=2)
We can exponentiate the values of log(Y) to transform back to be in terms of Y.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
p4 <- predict(m4, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p4[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p4[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p4[,"upr"]), col="darkorange", lwd=2)
When using this model to create predictions for Y, we need to first generate a prediction for ln(Y), and then exponentiate. Let’s construct a 95% prediction interval for Y when X = 6.5.
ln_yhat = predict(m4, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat
fit lwr upr
1 2781.552 1249.637 6191.424
Note that we cannot directly compare r^2 values for linear and log-linear models. We can directly compare SSE values (assuming we transform everything back to Y), but we shouldn’t use SSE exclusively to determine which model to use.
# Linear
sum((m1$residuals)^2)
[1] 114917119
sum((df$y - m1$fitted.values)^2)
[1] 114917119
# Exponential
sum((m2$residuals)^2)
[1] 114762921
# Log-Linear
sum((df$y - exp(m3$fitted.values))^2)
[1] 112548872
# Log-Log
sum((df$y - exp(m4$fitted.values))^2)
[1] 90860143