Load and Explore the Data

We begin by reading in the data and looking at a summary of it.

df <- read.table("ex01.txt", sep="\t", header=TRUE)
summary(df)
       x               y          
 Min.   :2.124   Min.   :  8.333  
 1st Qu.:3.795   1st Qu.: 31.661  
 Median :5.436   Median : 78.320  
 Mean   :5.358   Mean   :133.155  
 3rd Qu.:6.944   3rd Qu.:199.158  
 Max.   :7.979   Max.   :624.952  

Create a scatterplot.

plot(y ~ x, df, pch=21, col="black", bg="cyan")

Linear Model

We will begin our analysis by considering a linear model of the form: \(Y = \hat{\beta}_0 + \hat{\beta}_1 X + e\)

m1 <- lm(y ~ x, df)
summary(m1) 

Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-149.97  -50.98  -10.01   33.54  318.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -221.754     26.707  -8.303  5.7e-13 ***
x             66.240      4.746  13.958  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 81.69 on 98 degrees of freedom
Multiple R-squared:  0.6653,    Adjusted R-squared:  0.6619 
F-statistic: 194.8 on 1 and 98 DF,  p-value: < 2.2e-16

Plot the fitted line.

plot(y ~ x, df, pch=21, col="black", bg="cyan")
abline(m1$coefficients, col="darkred", lwd=2)

We now create a residual plot.

plot(m1$residuals ~ df$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)

The residuals exhibit a trend, as well as heteroskedasticity (unequal variance).

We now assess the normality of the residuals with a histogram and a QQ-plot.

res1 <- m1$residuals
par(mfrow=c(1,2))
hist(res1, col='orchid')
qqnorm(res1)
qqline(res1)
par(mfrow=c(1,1))

These plots indicate that the residuals were drawn from a right-skewed distribution.

Exponential Model

We will now consider a model of the form \(Y = \hat{\beta}_0 + \hat{\beta}_1 e^X + \varepsilon\).

m2 <- lm(y ~ exp(x), df)
summary(m2) 

Call:
lm(formula = y ~ exp(x), data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-173.688  -22.301   -5.473   21.888  202.380 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 32.869993   7.814467   4.206 5.75e-05 ***
exp(x)       0.153758   0.007461  20.607  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 61.14 on 98 degrees of freedom
Multiple R-squared:  0.8125,    Adjusted R-squared:  0.8106 
F-statistic: 424.7 on 1 and 98 DF,  p-value: < 2.2e-16

We plot the fitted curve.

plot(y ~ x, df, pch=21, col="black", bg="cyan")
ptn = seq(from=2,to=8, by=0.1)
lines(ptn, predict(m2, data.frame(x=ptn)), col="darkred", lwd=2)

We generate a residual plot.

plot(m2$residuals ~ df$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)

There is still a slight trend. The heteroskedasticity is also still present.

We will add prediction intervals to the scatterplot.

plot(y ~ x, df, pch=21, col="black", bg="cyan")
p2 <- predict(m2, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p2[,"fit"], col="darkred", lwd=2)
lines(ptn, p2[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p2[,"upr"], col="darkorange", lwd=2)

We now assess the normality of the residuals with a histogram and a QQ-plot.

res2 <- m2$residuals
par(mfrow=c(1,2))
hist(res2, col='orchid')
qqnorm(res2)
qqline(res2)
par(mfrow=c(1,1))

These plots indicate that the residuals were drawn from a heavy-tailed distribution.

Log-Level (or Log-Linear) Model

Let’s plot the natural log of y against x.

plot(log(y) ~ x, df, pch=21, col="black", bg="cyan")

That looks like an excellent linear relationship. Perhaps we should consider a model of the form: \(\ln(Y) = \hat{\beta}_0 + \hat{\beta}_1 X + \varepsilon\)

m3 <- lm(log(y) ~ x, df)
summary(m3) 

Call:
lm(formula = log(y) ~ x, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.81094 -0.23060  0.03466  0.28640  0.67522 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.99771    0.11466   8.702 7.91e-14 ***
x            0.62018    0.02037  30.439  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3507 on 98 degrees of freedom
Multiple R-squared:  0.9043,    Adjusted R-squared:  0.9034 
F-statistic: 926.5 on 1 and 98 DF,  p-value: < 2.2e-16

Let’s consider the residual plot. Note that these are residuals of ln(Y), and not Y.

plot(m3$residuals ~ df$x, pch=21, col="black", bg="cyan", 
     xlab="x", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)

There is no apparent trend or heteroskedasticity in this residual plot.

We now assess the normality of the residuals with a histogram and a QQ-plot.

res3 <- m3$residuals
par(mfrow=c(1,2))
hist(res3, col='orchid')
qqnorm(res3)
qqline(res3)
par(mfrow=c(1,1))

These plots suggest that the residuals were likey drawn from a normal (or perhaps slightly left-skewed) distribution.

We will add the fitted curve and to the scatter plot of ln(y) against y.

plot(log(y) ~ x, df, pch=21, col="black", bg="cyan")
p3 <- predict(m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, p3[,"fit"], col="darkred", lwd=2)
lines(ptn, p3[,"lwr"], col="darkorange", lwd=2)
lines(ptn, p3[,"upr"], col="darkorange", lwd=2)

We can exponentiate the values of log(Y) to transform back to be in terms of Y.

plot(y ~ x, df, pch=21, col="black", bg="cyan")
p3 <- predict(m3, newdata=data.frame(x=ptn), interval="prediction", level=0.95)
lines(ptn, exp(p3[,"fit"]), col="darkred", lwd=2)
lines(ptn, exp(p3[,"lwr"]), col="darkorange", lwd=2)
lines(ptn, exp(p3[,"upr"]), col="darkorange", lwd=2)

Generating Predictions with the Transformed Model

When using this model to create predictions for Y, we need to first generate a prediction for ln(Y), and then exponentiate. Let’s construct a 95% prediction interval for Y when X = 6.5.

ln_yhat = predict(m3, data.frame(x=c(6.5)), interval="prediction", level=0.95)
yhat = exp(ln_yhat)
yhat
       fit      lwr     upr
1 152.7606 75.78804 307.909

Note that we cannot directly compare r^2 values for linear and log-linear models. We can directly compare SSE values (assuming we transform everything back to Y), but we shouldn’t use SSE exclusively to determine which model to use.

# Linear
sum((m1$residuals)^2)
[1] 653919.3
sum((df$y - m1$fitted.values)^2)
[1] 653919.3
# Exponential
sum((m2$residuals)^2)   # 366,355.6
[1] 366355.6
# Log-Linear
sum((df$y - exp(m3$fitted.values))^2)  
[1] 383890.1
LS0tDQp0aXRsZTogIkxlc3NvbiAxMyAtIExvZ2FyaXRobWljIFRyYW5zZm9ybWF0aW9ucyAoUGFydCAxKSINCmF1dGhvcjogIlJvYmJpZSBCZWFuZSINCm91dHB1dDoNCiAgaHRtbF9ub3RlYm9vazoNCiAgICB0aGVtZTogZmxhdGx5DQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZGVwdGg6IDINCi0tLQ0KDQojIExvYWQgYW5kIEV4cGxvcmUgdGhlIERhdGENCg0KV2UgYmVnaW4gYnkgcmVhZGluZyBpbiB0aGUgZGF0YSBhbmQgbG9va2luZyBhdCBhIHN1bW1hcnkgb2YgaXQuIA0KDQpgYGB7cn0NCmRmIDwtIHJlYWQudGFibGUoImV4MDEudHh0Iiwgc2VwPSJcdCIsIGhlYWRlcj1UUlVFKQ0Kc3VtbWFyeShkZikNCmBgYA0KDQpDcmVhdGUgYSBzY2F0dGVycGxvdC4NCg0KYGBge3J9DQpwbG90KHkgfiB4LCBkZiwgcGNoPTIxLCBjb2w9ImJsYWNrIiwgYmc9ImN5YW4iKQ0KYGBgDQoNCg0KDQojIExpbmVhciBNb2RlbA0KDQpXZSB3aWxsIGJlZ2luIG91ciBhbmFseXNpcyBieSBjb25zaWRlcmluZyBhIGxpbmVhciBtb2RlbCBvZiB0aGUgZm9ybTogJFkgPSBcaGF0e1xiZXRhfV8wICsgXGhhdHtcYmV0YX1fMSBYICsgZSQgDQoNCmBgYHtyfQ0KbTEgPC0gbG0oeSB+IHgsIGRmKQ0Kc3VtbWFyeShtMSkgDQpgYGANCg0KUGxvdCB0aGUgZml0dGVkIGxpbmUuIA0KDQpgYGB7cn0NCnBsb3QoeSB+IHgsIGRmLCBwY2g9MjEsIGNvbD0iYmxhY2siLCBiZz0iY3lhbiIpDQphYmxpbmUobTEkY29lZmZpY2llbnRzLCBjb2w9ImRhcmtyZWQiLCBsd2Q9MikNCmBgYA0KDQoNCldlIG5vdyBjcmVhdGUgYSByZXNpZHVhbCBwbG90LiANCg0KYGBge3J9DQpwbG90KG0xJHJlc2lkdWFscyB+IGRmJHgsIHBjaD0yMSwgY29sPSJibGFjayIsIGJnPSJjeWFuIiwgDQogICAgIHhsYWI9IngiLCB5bGFiPSJSZXNpZHVhbHMiLCBtYWluPSJSZXNpZHVhbCBQbG90IikNCg0KYWJsaW5lKGg9MCwgY29sPSJkYXJrcmVkIiwgbHdkPTIpDQpgYGANCg0KVGhlIHJlc2lkdWFscyBleGhpYml0IGEgdHJlbmQsIGFzIHdlbGwgYXMgaGV0ZXJvc2tlZGFzdGljaXR5ICh1bmVxdWFsIHZhcmlhbmNlKS4gDQoNCldlIG5vdyBhc3Nlc3MgdGhlIG5vcm1hbGl0eSBvZiB0aGUgcmVzaWR1YWxzIHdpdGggYSBoaXN0b2dyYW0gYW5kIGEgUVEtcGxvdC4NCg0KYGBge3J9DQpyZXMxIDwtIG0xJHJlc2lkdWFscw0KcGFyKG1mcm93PWMoMSwyKSkNCmhpc3QocmVzMSwgY29sPSdvcmNoaWQnKQ0KcXFub3JtKHJlczEpDQpxcWxpbmUocmVzMSkNCnBhcihtZnJvdz1jKDEsMSkpDQpgYGANCg0KVGhlc2UgcGxvdHMgaW5kaWNhdGUgdGhhdCB0aGUgcmVzaWR1YWxzIHdlcmUgZHJhd24gZnJvbSBhIHJpZ2h0LXNrZXdlZCBkaXN0cmlidXRpb24uIA0KDQoNCg0KIyBFeHBvbmVudGlhbCBNb2RlbA0KDQpXZSB3aWxsIG5vdyBjb25zaWRlciBhIG1vZGVsIG9mIHRoZSBmb3JtICRZID0gXGhhdHtcYmV0YX1fMCArIFxoYXR7XGJldGF9XzEgZV5YICsgXHZhcmVwc2lsb24kLg0KDQpgYGB7cn0NCm0yIDwtIGxtKHkgfiBleHAoeCksIGRmKQ0Kc3VtbWFyeShtMikgDQpgYGANCg0KV2UgcGxvdCB0aGUgZml0dGVkIGN1cnZlLiANCg0KYGBge3J9DQpwbG90KHkgfiB4LCBkZiwgcGNoPTIxLCBjb2w9ImJsYWNrIiwgYmc9ImN5YW4iKQ0KDQpwdG4gPSBzZXEoZnJvbT0yLHRvPTgsIGJ5PTAuMSkNCmxpbmVzKHB0biwgcHJlZGljdChtMiwgZGF0YS5mcmFtZSh4PXB0bikpLCBjb2w9ImRhcmtyZWQiLCBsd2Q9MikNCmBgYA0KDQpXZSBnZW5lcmF0ZSBhIHJlc2lkdWFsIHBsb3QuIA0KDQpgYGB7cn0NCnBsb3QobTIkcmVzaWR1YWxzIH4gZGYkeCwgcGNoPTIxLCBjb2w9ImJsYWNrIiwgYmc9ImN5YW4iLCANCiAgICAgeGxhYj0ieCIsIHlsYWI9IlJlc2lkdWFscyIsIG1haW49IlJlc2lkdWFsIFBsb3QiKQ0KDQphYmxpbmUoaD0wLCBjb2w9ImRhcmtyZWQiLCBsd2Q9MikNCmBgYA0KDQpUaGVyZSBpcyBzdGlsbCBhIHNsaWdodCB0cmVuZC4gVGhlIGhldGVyb3NrZWRhc3RpY2l0eSBpcyBhbHNvIHN0aWxsIHByZXNlbnQuIA0KDQpXZSB3aWxsIGFkZCBwcmVkaWN0aW9uIGludGVydmFscyB0byB0aGUgc2NhdHRlcnBsb3QuIA0KDQpgYGB7cn0NCnBsb3QoeSB+IHgsIGRmLCBwY2g9MjEsIGNvbD0iYmxhY2siLCBiZz0iY3lhbiIpDQoNCnAyIDwtIHByZWRpY3QobTIsIG5ld2RhdGE9ZGF0YS5mcmFtZSh4PXB0biksIGludGVydmFsPSJwcmVkaWN0aW9uIiwgbGV2ZWw9MC45NSkNCmxpbmVzKHB0biwgcDJbLCJmaXQiXSwgY29sPSJkYXJrcmVkIiwgbHdkPTIpDQpsaW5lcyhwdG4sIHAyWywibHdyIl0sIGNvbD0iZGFya29yYW5nZSIsIGx3ZD0yKQ0KbGluZXMocHRuLCBwMlssInVwciJdLCBjb2w9ImRhcmtvcmFuZ2UiLCBsd2Q9MikNCmBgYA0KDQoNCldlIG5vdyBhc3Nlc3MgdGhlIG5vcm1hbGl0eSBvZiB0aGUgcmVzaWR1YWxzIHdpdGggYSBoaXN0b2dyYW0gYW5kIGEgUVEtcGxvdC4NCg0KYGBge3J9DQpyZXMyIDwtIG0yJHJlc2lkdWFscw0KcGFyKG1mcm93PWMoMSwyKSkNCmhpc3QocmVzMiwgY29sPSdvcmNoaWQnKQ0KcXFub3JtKHJlczIpDQpxcWxpbmUocmVzMikNCnBhcihtZnJvdz1jKDEsMSkpDQpgYGANCg0KVGhlc2UgcGxvdHMgaW5kaWNhdGUgdGhhdCB0aGUgcmVzaWR1YWxzIHdlcmUgZHJhd24gZnJvbSBhIGhlYXZ5LXRhaWxlZCBkaXN0cmlidXRpb24uIA0KIA0KDQoNCiMgTG9nLUxldmVsIChvciBMb2ctTGluZWFyKSBNb2RlbA0KDQpMZXQncyBwbG90IHRoZSBuYXR1cmFsIGxvZyBvZiB5IGFnYWluc3QgeC4gDQoNCg0KYGBge3J9DQpwbG90KGxvZyh5KSB+IHgsIGRmLCBwY2g9MjEsIGNvbD0iYmxhY2siLCBiZz0iY3lhbiIpDQpgYGANCg0KVGhhdCBsb29rcyBsaWtlIGFuIGV4Y2VsbGVudCBsaW5lYXIgcmVsYXRpb25zaGlwLiBQZXJoYXBzIHdlIHNob3VsZCBjb25zaWRlciBhIG1vZGVsIG9mIHRoZSBmb3JtOiAkXGxuKFkpID0gXGhhdHtcYmV0YX1fMCArIFxoYXR7XGJldGF9XzEgWCArIFx2YXJlcHNpbG9uJA0KDQoNCmBgYHtyfQ0KbTMgPC0gbG0obG9nKHkpIH4geCwgZGYpDQpzdW1tYXJ5KG0zKSANCmBgYA0KDQpMZXQncyBjb25zaWRlciB0aGUgcmVzaWR1YWwgcGxvdC4gTm90ZSB0aGF0IHRoZXNlIGFyZSByZXNpZHVhbHMgb2YgbG4oWSksIGFuZCBub3QgWS4gDQoNCmBgYHtyfQ0KcGxvdChtMyRyZXNpZHVhbHMgfiBkZiR4LCBwY2g9MjEsIGNvbD0iYmxhY2siLCBiZz0iY3lhbiIsIA0KICAgICB4bGFiPSJ4IiwgeWxhYj0iUmVzaWR1YWxzIiwgbWFpbj0iUmVzaWR1YWwgUGxvdCIpDQoNCmFibGluZShoPTAsIGNvbD0iZGFya3JlZCIsIGx3ZD0yKQ0KYGBgDQoNClRoZXJlIGlzIG5vIGFwcGFyZW50IHRyZW5kIG9yIGhldGVyb3NrZWRhc3RpY2l0eSBpbiB0aGlzIHJlc2lkdWFsIHBsb3QuIA0KDQoNCldlIG5vdyBhc3Nlc3MgdGhlIG5vcm1hbGl0eSBvZiB0aGUgcmVzaWR1YWxzIHdpdGggYSBoaXN0b2dyYW0gYW5kIGEgUVEtcGxvdC4NCg0KYGBge3J9DQpyZXMzIDwtIG0zJHJlc2lkdWFscw0KcGFyKG1mcm93PWMoMSwyKSkNCmhpc3QocmVzMywgY29sPSdvcmNoaWQnKQ0KcXFub3JtKHJlczMpDQpxcWxpbmUocmVzMykNCnBhcihtZnJvdz1jKDEsMSkpDQpgYGANCg0KVGhlc2UgcGxvdHMgc3VnZ2VzdCB0aGF0IHRoZSByZXNpZHVhbHMgd2VyZSBsaWtleSBkcmF3biBmcm9tIGEgbm9ybWFsIChvciBwZXJoYXBzIHNsaWdodGx5IGxlZnQtc2tld2VkKSBkaXN0cmlidXRpb24uIA0KIA0KDQoNCldlIHdpbGwgYWRkIHRoZSBmaXR0ZWQgY3VydmUgYW5kICB0byB0aGUgc2NhdHRlciBwbG90IG9mIGxuKHkpIGFnYWluc3QgeS4gDQoNCg0KDQpgYGB7cn0NCnBsb3QobG9nKHkpIH4geCwgZGYsIHBjaD0yMSwgY29sPSJibGFjayIsIGJnPSJjeWFuIikNCg0KcDMgPC0gcHJlZGljdChtMywgbmV3ZGF0YT1kYXRhLmZyYW1lKHg9cHRuKSwgaW50ZXJ2YWw9InByZWRpY3Rpb24iLCBsZXZlbD0wLjk1KQ0KbGluZXMocHRuLCBwM1ssImZpdCJdLCBjb2w9ImRhcmtyZWQiLCBsd2Q9MikNCmxpbmVzKHB0biwgcDNbLCJsd3IiXSwgY29sPSJkYXJrb3JhbmdlIiwgbHdkPTIpDQpsaW5lcyhwdG4sIHAzWywidXByIl0sIGNvbD0iZGFya29yYW5nZSIsIGx3ZD0yKQ0KYGBgDQoNCldlIGNhbiBleHBvbmVudGlhdGUgdGhlIHZhbHVlcyBvZiBsb2coWSkgdG8gdHJhbnNmb3JtIGJhY2sgdG8gYmUgaW4gdGVybXMgb2YgWS4gDQoNCmBgYHtyfQ0KcGxvdCh5IH4geCwgZGYsIHBjaD0yMSwgY29sPSJibGFjayIsIGJnPSJjeWFuIikNCg0KcDMgPC0gcHJlZGljdChtMywgbmV3ZGF0YT1kYXRhLmZyYW1lKHg9cHRuKSwgaW50ZXJ2YWw9InByZWRpY3Rpb24iLCBsZXZlbD0wLjk1KQ0KbGluZXMocHRuLCBleHAocDNbLCJmaXQiXSksIGNvbD0iZGFya3JlZCIsIGx3ZD0yKQ0KbGluZXMocHRuLCBleHAocDNbLCJsd3IiXSksIGNvbD0iZGFya29yYW5nZSIsIGx3ZD0yKQ0KbGluZXMocHRuLCBleHAocDNbLCJ1cHIiXSksIGNvbD0iZGFya29yYW5nZSIsIGx3ZD0yKQ0KYGBgDQoNCiMgR2VuZXJhdGluZyBQcmVkaWN0aW9ucyB3aXRoIHRoZSBUcmFuc2Zvcm1lZCBNb2RlbCANCg0KV2hlbiB1c2luZyB0aGlzIG1vZGVsIHRvIGNyZWF0ZSBwcmVkaWN0aW9ucyBmb3IgWSwgd2UgbmVlZCB0byBmaXJzdCBnZW5lcmF0ZSBhIHByZWRpY3Rpb24gZm9yIGxuKFkpLCBhbmQgdGhlbiBleHBvbmVudGlhdGUuIExldCdzIGNvbnN0cnVjdCBhIDk1JSBwcmVkaWN0aW9uIGludGVydmFsIGZvciBZIHdoZW4gWCA9IDYuNS4NCg0KDQpgYGB7cn0NCmxuX3loYXQgPSBwcmVkaWN0KG0zLCBkYXRhLmZyYW1lKHg9Yyg2LjUpKSwgaW50ZXJ2YWw9InByZWRpY3Rpb24iLCBsZXZlbD0wLjk1KQ0KeWhhdCA9IGV4cChsbl95aGF0KQ0KeWhhdA0KYGBgDQoNCk5vdGUgdGhhdCB3ZSBjYW5ub3QgZGlyZWN0bHkgY29tcGFyZSByXjIgdmFsdWVzIGZvciBsaW5lYXIgYW5kIGxvZy1saW5lYXIgbW9kZWxzLiBXZSBjYW4gZGlyZWN0bHkgY29tcGFyZSBTU0UgdmFsdWVzIChhc3N1bWluZyB3ZSB0cmFuc2Zvcm0gZXZlcnl0aGluZyBiYWNrIHRvIFkpLCBidXQgd2Ugc2hvdWxkbid0IHVzZSBTU0UgZXhjbHVzaXZlbHkgdG8gZGV0ZXJtaW5lIHdoaWNoIG1vZGVsIHRvIHVzZS4gDQoNCg0KYGBge3J9DQojIExpbmVhcg0Kc3VtKChtMSRyZXNpZHVhbHMpXjIpDQpzdW0oKGRmJHkgLSBtMSRmaXR0ZWQudmFsdWVzKV4yKQ0KYGBgDQoNCg0KYGBge3J9DQojIEV4cG9uZW50aWFsDQpzdW0oKG0yJHJlc2lkdWFscyleMikgICANCmBgYA0KDQoNCmBgYHtyfQ0KIyBMb2ctTGluZWFyDQpzdW0oKGRmJHkgLSBleHAobTMkZml0dGVkLnZhbHVlcykpXjIpICANCg0KYGBgDQoNCg0KDQoNCg0K