We begin by reading in the data and looking at a summary of it.
df <- read.table("ex02.txt", sep="\t", header=TRUE)
summary(df)       x               y          
 Min.   :2.283   Min.   :  46.79  
 1st Qu.:3.767   1st Qu.: 381.51  
 Median :4.985   Median : 953.84  
 Mean   :5.085   Mean   :1862.38  
 3rd Qu.:6.416   3rd Qu.:2888.49  
 Max.   :7.918   Max.   :7982.39  Create a scatterplot.
plot(y ~ x, df, pch=21, col="black", bg="cyan")We will begin our analysis by considering a linear model of the form: \(Y = \hat{\beta}_0 + \hat{\beta}_1 X + e\)
m1 <- lm(y ~ x, df)
summary(m1) 
Call:
lm(formula = y ~ x, data = df)
Residuals:
    Min      1Q  Median      3Q     Max 
-2189.4  -671.4  -170.1   516.8  3842.2 
Coefficients:
            Estimate Std. Error t value Pr(>|t|) 2.96e-15 ***
x            1012.61      65.69  15.416  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1083 on 98 degrees of freedom
Multiple R-squared:  0.708, Adjusted R-squared:  0.705 
F-statistic: 237.6 on 1 and 98 DF,  p-value: < 2.2e-16Plot the fitted line.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
abline(m1$coefficients, col="darkred", lwd=2)We now create a residual plot.
plot(m1$residuals ~ df$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)The residuals exhibit a trend, as well as heteroskedasticity (unequal variance).
We now assess the normality of the residuals with a histogram and a QQ-plot.
res1 <- m1$residuals
par(mfrow=c(1,2))
hist(res1, col='orchid')
qqnorm(res1)qqline(res1)
par(mfrow=c(1,1))These plots indicate that the residuals were drawn from a right-skewed distribution.
We will now consider a model of the form \(Y = \hat{\beta}_0 + \hat{\beta}_1 e^X + \varepsilon\).
m2 <- lm(y ~ exp(x), df)
summary(m2) 
Call:
lm(formula = y ~ exp(x), data = df)
Residuals:
    Min      1Q  Median      3Q     Max 
-3783.0  -543.9  -229.4   436.6  4134.5 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 689.4541   132.2450   5.213 1.03e-06 ***   2.2769     0.1476  15.430  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1082 on 98 degrees of freedom
Multiple R-squared:  0.7084,    Adjusted R-squared:  0.7054 
F-statistic: 238.1 on 1 and 98 DF,  p-value: < 2.2e-16We plot the fitted curve.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
ptn = seq(from=2,to=8, by=0.1)
lines(ptn, predict(m2, data.frame(x=ptn)), col="darkred", lwd=2)We generate a residual plot.
plot(m2$residuals ~ df$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)There is still a slight trend. The heteroskedasticity is also still present.
We will add prediction intervals to the scatterplot.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
p2 <- predict(m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p2[,"fit"], col="darkred", lwd=2)lines(ptn, p2[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p2[,"upr"], col="darkorange", lwd=2)We now assess the normality of the residuals with a histogram and a QQ-plot.
res2 <- m2$residuals
par(mfrow=c(1,2))
hist(res2, col='orchid')
qqnorm(res2)qqline(res2)
par(mfrow=c(1,1))These plots suggests that the residuals were drawn from a heavy-tailed distribution.
Let’s plot the natural log of y against x.
plot(log(y) ~ x, df, pch=21, col="black", bg="cyan")Perhaps we will consider a model of the form: \(\ln(Y) = \hat{\beta}_0 + \hat{\beta}_1 X + \varepsilon\)
m3 <- lm(log(y) ~ x, df)
summary(m3)      Min       1Q   Median       3Q      Max 
-1.00249 -0.26117  0.03103  0.26751  0.99605 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.1290     0.1465   21.36   <2e-16 ***
x             0.7335     0.0274   26.77   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4517 on 98 degrees of freedom
Multiple R-squared:  0.8797,    Adjusted R-squared:  0.8785 
F-statistic: 716.6 on 1 and 98 DF,  p-value: < 2.2e-16Let’s consider the residual plot. Note that these are residuals of ln(Y), and not Y.
plot(m3$residuals ~ df$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)There is no heteroskedasticity in this residual plot, but there does appear to be a slight trend.
We now assess the normality of the residuals with a histogram and a QQ-plot.
res3 <- m3$residuals
par(mfrow=c(1,2))
hist(res3, col='orchid')
qqnorm(res3)qqline(res3)
par(mfrow=c(1,1))These plots suggest that the residuals were likey drawn from a slightly heavy-tailed distribution.
We will add the fitted curve and to the scatter plot of ln(y) against y.
plot(log(y) ~ x, df, pch=21, col="black", bg="cyan")
p3 <- predict(m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p3[,"fit"], col="darkred", lwd=2)lines(ptn, p3[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p3[,"upr"], col="darkorange", lwd=2)We can exponentiate the values of log(Y) to transform back to be in terms of Y.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
p3 <- predict(m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)Let’s plot the natural log of y against the natural log of x.
plot(log(y) ~ log(x), df, pch=21, col="black", bg="cyan")This appears to be an approximately linear relationship. We will consider a model of the form: \(\ln(Y) = \hat{\beta}_0 + \hat{\beta}_1 \ln(X) + \varepsilon\)
m4 <- lm(log(y) ~ log(x), df)
summary(m4) 
Call:
lm(formula = log(y) ~ log(x), data = df)
Residuals:
     Min       1Q   Median       3Q      Max 
-0.96888 -0.24472 -0.01224  0.23851  0.91477 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.2986     0.1855   7.001  3.2e-10 ***
log(x)        3.5432     0.1154  30.699  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3997 on 98 degrees of freedom
Multiple R-squared:  0.9058,    Adjusted R-squared:  0.9048 
F-statistic: 942.4 on 1 and 98 DF,  p-value: < 2.2e-16Let’s consider the residual plot. Note that these are residuals of ln(Y), and not Y.
plot(m4$residuals ~ df$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)There is no apparent heteroskedasticity or trend in this residual plot.
We now assess the normality of the residuals with a histogram and a QQ-plot.
res4 <- m4$residuals
par(mfrow=c(1,2))
hist(res4, col='orchid')
qqnorm(res4)qqline(res4)
par(mfrow=c(1,1))These plots suggest that the residuals were perhaps drawn from a slightly heavy-tailed distribution.
We will add the fitted curve and to the scatter plot of ln(y) against y.
plot(log(y) ~ log(x), df, pch=21, col="black", bg="cyan")
p4 <- predict(m4, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(log(ptn), p4[,"fit"], col="darkred", lwd=2)lines(log(ptn), p4[,"lwr"], col="darkorange", lwd=2)
lines(log(ptn), p4[,"upr"], col="darkorange", lwd=2)We can exponentiate the values of log(Y) to transform back to be in terms of Y.
plot(y ~ x, df, pch=21, col="black", bg="cyan")
p4 <- predict(m4, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p4[,"fit"]), col="darkred", lwd=2)lines(ptn, exp(p4[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p4[,"upr"]), col="darkorange", lwd=2)When using this model to create predictions for Y, we need to first generate a prediction for ln(Y), and then exponentiate. Let’s construct a 95% prediction interval for Y when X = 6.5.
ln_yhat = predict(m4, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat       fit      lwr      upr
1 2781.552 1249.637 6191.424Note that we cannot directly compare r^2 values for linear and log-linear models. We can directly compare SSE values (assuming we transform everything back to Y), but we shouldn’t use SSE exclusively to determine which model to use.
# Linear
sum((m1$residuals)^2)[1] 114917119sum((df$y - m1$fitted.values)^2)[1] 114917119# Exponential
sum((m2$residuals)^2)   [1] 114762921# Log-Linear
sum((df$y - exp(m3$fitted.values))^2)  [1] 112548872# Log-Log
sum((df$y - exp(m4$fitted.values))^2)  [1] 90860143