In this lesson, we will see how work with qualitative predictors in which there are more than two possible values.
Load Packages and Data
require(ggplot2)
Our goal in this lesson will be to build a regression model for the purposes of predicting the sales price of an apartment. In addition to the response variable, price
, our dataset contains two predictors: sqft
and region
. The predictor sqft
is a quantitative variable that is equal to the square footage of the apartment. The predictor region
is a qualitative variable that reports in which of three geographic regions the apartment is located. The regions are encoded A
, B
, and C
. This is synthetic data.
df <- read.table("apt_prices.txt", sep="\t", header=TRUE)
summary(df)
price sqft region
Min. : 39067 Min. : 548.4 A:154
1st Qu.:107526 1st Qu.: 762.5 B:117
Median :151706 Median : 870.6 C:156
Mean :181081 Mean : 903.7
3rd Qu.:221799 3rd Qu.:1021.1
Max. :776368 Max. :1735.8
Exploratory Analysis
We will begin our analysis by plotting price
against sqft
.
ggplot(df, aes(x=sqft, y=price, col=region)) + geom_point()
We see that heteroskedasticity is likely a concern with this dataset. We will now plot ln(price)
against ln(sqft)
.
ggplot(df, aes(x=log(sqft), y=log(price), col=region)) + geom_point()
The log-log transformation seems to have addressed the concerns regarding heteroskedasticity.
ggplot(df, aes(x=region, y=price, fill = region)) + geom_boxplot()
ggplot(df, aes(x=sqft, y=price, col=region)) + geom_point() + facet_wrap(~region)
Create Model
mod <- lm(log(price) ~ log(sqft) + region, df)
summary(mod)
Call:
lm(formula = log(price) ~ log(sqft) + region, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.83246 -0.17565 -0.02565 0.20028 0.78663
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.70981 0.44700 -1.588 0.1130
log(sqft) 1.85212 0.06602 28.055 <2e-16 ***
regionB 0.46219 0.03580 12.909 <2e-16 ***
regionC -0.07744 0.03311 -2.339 0.0198 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2911 on 423 degrees of freedom
Multiple R-squared: 0.7211, Adjusted R-squared: 0.7191
F-statistic: 364.6 on 3 and 423 DF, p-value: < 2.2e-16
\[\large\ln(\hat {price}) = -0.70981 + 1.85212\cdot \ln(sqft) + 0.46219 \cdot regionB - 0.07744\cdot regionC\]
Residual Analysis
plot(mod$residuals ~ mod$fitted.values, pch=21, col="black", bg="salmon",
xlab="Fitted Value", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)
res <- mod$residuals
par(mfrow=c(1,2))
hist(res, col='orchid', breaks=10)
qqnorm(res)
qqline(res)
par(mfrow=c(1,1))
LS0tDQp0aXRsZTogIkxlc3NvbiAxOSAtIFF1YWxpdGF0aXZlIFByZWRpY3RvcnMiDQphdXRob3I6ICJSb2JiaWUgQmVhbmUiDQpvdXRwdXQ6DQogIGh0bWxfbm90ZWJvb2s6DQogICAgdGhlbWU6IGZsYXRseQ0KICAgIHRvYzogeWVzDQogICAgdG9jX2RlcHRoOiAyDQotLS0NCg0KSW4gdGhpcyBsZXNzb24sIHdlIHdpbGwgc2VlIGhvdyB3b3JrIHdpdGggcXVhbGl0YXRpdmUgcHJlZGljdG9ycyBpbiB3aGljaCB0aGVyZSBhcmUgbW9yZSB0aGFuIHR3byBwb3NzaWJsZSB2YWx1ZXMuIA0KDQojIExvYWQgUGFja2FnZXMgYW5kIERhdGENCg0KYGBge3IsIG1lc3NhZ2U9RkFMU0V9DQpyZXF1aXJlKGdncGxvdDIpDQpgYGANCg0KT3VyIGdvYWwgaW4gdGhpcyBsZXNzb24gd2lsbCBiZSB0byBidWlsZCBhIHJlZ3Jlc3Npb24gbW9kZWwgZm9yIHRoZSBwdXJwb3NlcyBvZiBwcmVkaWN0aW5nIHRoZSBzYWxlcyBwcmljZSBvZiBhbiBhcGFydG1lbnQuIEluIGFkZGl0aW9uIHRvIHRoZSByZXNwb25zZSB2YXJpYWJsZSwgYHByaWNlYCwgb3VyIGRhdGFzZXQgY29udGFpbnMgdHdvIHByZWRpY3RvcnM6IGBzcWZ0YCBhbmQgYHJlZ2lvbmAuIFRoZSBwcmVkaWN0b3IgYHNxZnRgIGlzIGEgcXVhbnRpdGF0aXZlIHZhcmlhYmxlIHRoYXQgaXMgZXF1YWwgdG8gdGhlIHNxdWFyZSBmb290YWdlIG9mIHRoZSBhcGFydG1lbnQuIFRoZSBwcmVkaWN0b3IgYHJlZ2lvbmAgaXMgYSBxdWFsaXRhdGl2ZSB2YXJpYWJsZSB0aGF0IHJlcG9ydHMgaW4gd2hpY2ggb2YgdGhyZWUgZ2VvZ3JhcGhpYyByZWdpb25zIHRoZSBhcGFydG1lbnQgaXMgbG9jYXRlZC4gVGhlIHJlZ2lvbnMgYXJlIGVuY29kZWQgYEFgLCBgQmAsIGFuZCBgQ2AuIFRoaXMgaXMgc3ludGhldGljIGRhdGEuIA0KDQoNCmBgYHtyfQ0KZGYgPC0gcmVhZC50YWJsZSgiYXB0X3ByaWNlcy50eHQiLCBzZXA9Ilx0IiwgaGVhZGVyPVRSVUUpDQpzdW1tYXJ5KGRmKQ0KYGBgDQoNCiMgRXhwbG9yYXRvcnkgQW5hbHlzaXMNCg0KV2Ugd2lsbCBiZWdpbiBvdXIgYW5hbHlzaXMgYnkgcGxvdHRpbmcgYHByaWNlYCBhZ2FpbnN0IGBzcWZ0YC4gDQoNCmBgYHtyfQ0KZ2dwbG90KGRmLCBhZXMoeD1zcWZ0LCB5PXByaWNlLCBjb2w9cmVnaW9uKSkgKyBnZW9tX3BvaW50KCkNCmBgYA0KDQoNCldlIHNlZSB0aGF0IGhldGVyb3NrZWRhc3RpY2l0eSBpcyBsaWtlbHkgYSBjb25jZXJuIHdpdGggdGhpcyBkYXRhc2V0LiBXZSB3aWxsIG5vdyBwbG90IGBsbihwcmljZSlgIGFnYWluc3QgYGxuKHNxZnQpYC4gDQoNCmBgYHtyfQ0KZ2dwbG90KGRmLCBhZXMoeD1sb2coc3FmdCksIHk9bG9nKHByaWNlKSwgY29sPXJlZ2lvbikpICsgZ2VvbV9wb2ludCgpDQpgYGANCg0KVGhlIGxvZy1sb2cgdHJhbnNmb3JtYXRpb24gc2VlbXMgdG8gaGF2ZSBhZGRyZXNzZWQgdGhlIGNvbmNlcm5zIHJlZ2FyZGluZyBoZXRlcm9za2VkYXN0aWNpdHkuIA0KDQpgYGB7cn0NCmdncGxvdChkZiwgYWVzKHg9cmVnaW9uLCB5PXByaWNlLCBmaWxsID0gcmVnaW9uKSkgKyBnZW9tX2JveHBsb3QoKQ0KYGBgDQoNCg0KYGBge3J9DQpnZ3Bsb3QoZGYsIGFlcyh4PXByaWNlLCBmaWxsPXJlZ2lvbikpICsgZ2VvbV9kZW5zaXR5KCkgKyBmYWNldF93cmFwKH5yZWdpb24pDQpgYGANCg0KDQpgYGB7cn0NCmdncGxvdChkZiwgYWVzKHg9c3FmdCwgeT1wcmljZSwgY29sPXJlZ2lvbikpICsgZ2VvbV9wb2ludCgpICsgZmFjZXRfd3JhcCh+cmVnaW9uKQ0KYGBgDQoNCiMgQ3JlYXRlIE1vZGVsDQoNCg0KYGBge3J9DQptb2QgPC0gbG0obG9nKHByaWNlKSB+IGxvZyhzcWZ0KSArIHJlZ2lvbiwgZGYpDQpzdW1tYXJ5KG1vZCkNCmBgYA0KDQokJFxsYXJnZVxsbihcaGF0IHtwcmljZX0pID0gLTAuNzA5ODEgKyAxLjg1MjEyXGNkb3QgXGxuKHNxZnQpICsgMC40NjIxOSBcY2RvdCByZWdpb25CIC0gMC4wNzc0NFxjZG90IHJlZ2lvbkMkJA0KDQojIFJlc2lkdWFsIEFuYWx5c2lzDQoNCmBgYHtyfQ0KcGxvdChtb2QkcmVzaWR1YWxzIH4gbW9kJGZpdHRlZC52YWx1ZXMsIHBjaD0yMSwgY29sPSJibGFjayIsIGJnPSJzYWxtb24iLA0KICAgICB4bGFiPSJGaXR0ZWQgVmFsdWUiLCB5bGFiPSJSZXNpZHVhbHMiLCBtYWluPSJSZXNpZHVhbCBQbG90IikNCmFibGluZShoPTAsIGNvbD0iZGFya3JlZCIsIGx3ZD0yKQ0KYGBgDQoNCg0KYGBge3J9DQpyZXMgPC0gbW9kJHJlc2lkdWFscw0KcGFyKG1mcm93PWMoMSwyKSkNCmhpc3QocmVzLCBjb2w9J29yY2hpZCcsIGJyZWFrcz0xMCkNCnFxbm9ybShyZXMpDQpxcWxpbmUocmVzKQ0KcGFyKG1mcm93PWMoMSwxKSkNCmBgYA0KDQoNCg0KDQoNCg==