Parameter Estimates
Recall that in ordinary least squares regression, we assume that the relationship between our variables
\(X\) and
\(Y\) is determined by a
hypothetical model of the form:
\[Y = \beta_0 + \beta_1 X + e\]
We collect a sample of paired observations \((x_i, y_i)\) which we use to create a fitted model:
\[\hat Y = \hat \beta_0 + \hat \beta_1 X\] \[Y = \hat \beta_0 + \hat \beta_1 X + \hat e\]
The parameter estimates that result in \(SSE = \sum\limits_{i=1}^n \hat e_i^2\) being minimized are given by the following formulas:
\[SXX = \sum\limits_{i=1}^n \left(x_i - \bar x \right)^2\] \[SXY = \sum\limits_{i=1}^n \left(x_i - \bar x \right)\left(y_i - \bar y \right)\]
\[\hat\beta_1= \frac{SXY}{SXX} \hspace{20px} \mathrm{and} \hspace{20px}\hat\beta_0 = \bar y - \hat\beta_1 \bar x\]
We will now walk through an example in which we use R to calculate these parameter estimates. We will begin the example by loading a real-world dataset from a text file.
Pearson Dataset
In this notebook, we will study the relationship between the heights of fathers and their adult sons. For this study, we will use the Pearson data set, collected in England in 1903. This data set consists of 1078 observations, each with two features:
- fheight gives the height of the father, in inches.
- sheight gives the height of the son, in inches.
Loading and Exploring the Data
We will begin by loading the data and generating a summary of it.
In the code chunk below, the function read.table()
is used to read the date file “father_son.txt”. The parameter sep
is set to “\t”, which indicates that the values in the data file are separated by tabs. The parameter header
is set to TRUE
, which tells R that the first row of the data file contains the names of the variables whose values are stored in the columns. The data is stored in a data frame named myData
. We use the head()
function to view the first few rows of this data frame.
myData <- read.table("father_son.txt", sep="\t", header=TRUE)
head(myData, n=8)
We will use the nrow()
function to confirm that there are 1078 observations in the data frame.
nrow(myData)
[1] 1078
The summary()
function can be used to provide information regarding the distribution of each of the variables in the data frame.
summary(myData)
fheight sheight
Min. :59.01 Min. :58.51
1st Qu.:65.79 1st Qu.:66.93
Median :67.77 Median :68.62
Mean :67.69 Mean :68.68
3rd Qu.:69.60 3rd Qu.:70.47
Max. :75.43 Max. :78.36
The distributions of the heights of the fathers appears to be similar to that of the sons. The sons do seem to be slightly taller on average.
We will further explore the distributions of these variables by creating histograms.
par(mfrow=c(1,2))
hist(myData$fheight, ylim=c(0,200), xlim=c(55,80), breaks=20, col = "cyan",
xlab="Father Height in Inches", main="Histogram of\n Father Height")
hist(myData$sheight, ylim=c(0,200), xlim=c(55,80), breaks=20, col = "orange",
xlab="Son Height in Inches", main="Histogram of \n Son Height")
par(mfrow=c(1,1))
The histograms provide evidence that the heights of both populations are approximately normal.
Relationship Between Father and Son Height
We begin our study of the relationship between the heights of fathers and sons by calculating the correlation between the two variables.
cor(myData$fheight, myData$sheight)
[1] 0.5013383
A correlation of 0.5013383 indicates a moderate positive linear relationship.
We will further explore the relationship between the two variables by creating a scatterplot.
plot(myData$sheight ~ myData$fheight, pch=19, col=rgb(0.1, 0.2, 1.0, 0.6),
xlab="Father Height (inches)", ylab="Son Height (inches)",
main="Heights of Fathers and Sons")
This scatterplot does show a clear linear relationship, although it is one with a lot of noise. Given any particular father height, the range of possible son heights covers several inches.
Calulating Parameter Estimates Using lm() Function
Conveniently, R has a built-in function for calculating parameter estimates (as well as many other quantities that are useful when performing regression analysis). The function is denoted lm()
, which stands for “linear model”. We will demonstrate its usage below.
# mod <- lm(myData$sheight ~ myData$fheight)
mod <- lm(sheight ~ fheight, myData)
summary(mod)
Call:
lm(formula = sheight ~ fheight, data = myData)
Residuals:
Min 1Q Median 3Q Max
-8.8772 -1.5144 -0.0079 1.6285 8.9685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.88660 1.83235 18.49 <2e-16 ***
fheight 0.51409 0.02705 19.01 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.437 on 1076 degrees of freedom
Multiple R-squared: 0.2513, Adjusted R-squared: 0.2506
F-statistic: 361.2 on 1 and 1076 DF, p-value: < 2.2e-16
Note that the estimates for the slope and the intercept in our model are exactly the same as what we calculated above. We can access these objects directly from the mod
variable:
mod$coefficients
(Intercept) fheight
33.886604 0.514093
plot(myData$sheight ~ myData$fheight, pch=19, col=rgb(0.1, 0.2, 1.0, 0.6),
xlab="Father Height (inches)", ylab="Son Height (inches)",
main="Heights of Fathers and Sons")
abline(mod$coefficients, col="red", lwd=2)
LS0tDQp0aXRsZTogIkxlc3NvbiAwNyAtIE9MUyBSZWdyZXNzaW9uIGluIFIiDQphdXRob3I6ICJSb2JiaWUgQmVhbmUiDQpvdXRwdXQ6DQogIGh0bWxfbm90ZWJvb2s6DQogICAgdGhlbWU6IGZsYXRseQ0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19kZXB0aDogMg0KLS0tDQoNCiMgUGFyYW1ldGVyIEVzdGltYXRlcw0KDQpSZWNhbGwgdGhhdCBpbiBvcmRpbmFyeSBsZWFzdCBzcXVhcmVzIHJlZ3Jlc3Npb24sIHdlIGFzc3VtZSB0aGF0IHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiBvdXIgdmFyaWFibGVzICRYJCBhbmQgJFkkIGlzIGRldGVybWluZWQgYnkgYSAqKmh5cG90aGV0aWNhbCBtb2RlbCoqIG9mIHRoZSBmb3JtOiANCjxjZW50ZXI+DQokJFkgPSBcYmV0YV8wICsgXGJldGFfMSBYICsgZSQkDQo8L2NlbnRlcj4NCg0KV2UgY29sbGVjdCBhIHNhbXBsZSBvZiBwYWlyZWQgb2JzZXJ2YXRpb25zICQoeF9pLCB5X2kpJCB3aGljaCB3ZSB1c2UgdG8gY3JlYXRlIGEgKipmaXR0ZWQgbW9kZWwqKjoNCg0KPGNlbnRlcj4NCiQkXGhhdCBZID0gXGhhdCBcYmV0YV8wICsgXGhhdCBcYmV0YV8xIFgkJA0KJCRZID0gXGhhdCBcYmV0YV8wICsgXGhhdCBcYmV0YV8xIFggKyBcaGF0IGUkJA0KPC9jZW50ZXI+DQoNClRoZSBwYXJhbWV0ZXIgZXN0aW1hdGVzIHRoYXQgcmVzdWx0IGluICRTU0UgPSBcc3VtXGxpbWl0c197aT0xfV5uIFxoYXQgZV9pXjIkIGJlaW5nIG1pbmltaXplZCBhcmUgZ2l2ZW4gYnkgdGhlIGZvbGxvd2luZyBmb3JtdWxhczogDQoNCg0KJCRTWFggPSBcc3VtXGxpbWl0c197aT0xfV5uIFxsZWZ0KHhfaSAtIFxiYXIgeCBccmlnaHQpXjIkJA0KJCRTWFkgPSBcc3VtXGxpbWl0c197aT0xfV5uIFxsZWZ0KHhfaSAtIFxiYXIgeCBccmlnaHQpXGxlZnQoeV9pIC0gXGJhciB5IFxyaWdodCkkJA0KDQokJFxoYXRcYmV0YV8xPSBcZnJhY3tTWFl9e1NYWH0gXGhzcGFjZXsyMHB4fSBcbWF0aHJte2FuZH0gXGhzcGFjZXsyMHB4fVxoYXRcYmV0YV8wID0gXGJhciB5IC0gXGhhdFxiZXRhXzEgXGJhciB4JCQNCg0KV2Ugd2lsbCBub3cgd2FsayB0aHJvdWdoIGFuIGV4YW1wbGUgaW4gd2hpY2ggd2UgdXNlIFIgdG8gY2FsY3VsYXRlIHRoZXNlIHBhcmFtZXRlciBlc3RpbWF0ZXMuIFdlIHdpbGwgYmVnaW4gdGhlIGV4YW1wbGUgYnkgbG9hZGluZyBhIHJlYWwtd29ybGQgZGF0YXNldCBmcm9tIGEgdGV4dCBmaWxlLiANCg0KIyBQZWFyc29uIERhdGFzZXQNCg0KSW4gdGhpcyBub3RlYm9vaywgd2Ugd2lsbCBzdHVkeSB0aGUgcmVsYXRpb25zaGlwIGJldHdlZW4gdGhlIGhlaWdodHMgb2YgZmF0aGVycyBhbmQgdGhlaXIgYWR1bHQgc29ucy4gRm9yIHRoaXMgc3R1ZHksIHdlIHdpbGwgdXNlIHRoZSBbUGVhcnNvbiBkYXRhIHNldF0oaHR0cDovL3d3dy5yYW5kb21zZXJ2aWNlcy5vcmcvcmFuZG9tL2RhdGEvUGVhcnNvbi5odG1sKSwgY29sbGVjdGVkIGluIEVuZ2xhbmQgaW4gMTkwMy4gVGhpcyBkYXRhIHNldCBjb25zaXN0cyBvZiAxMDc4IG9ic2VydmF0aW9ucywgZWFjaCB3aXRoIHR3byBmZWF0dXJlczoNCg0KKiAqKmZoZWlnaHQqKiBnaXZlcyB0aGUgaGVpZ2h0IG9mIHRoZSBmYXRoZXIsIGluIGluY2hlcy4NCiogKipzaGVpZ2h0KiogZ2l2ZXMgdGhlIGhlaWdodCBvZiB0aGUgc29uLCBpbiBpbmNoZXMuDQoNCiMgTG9hZGluZyBhbmQgRXhwbG9yaW5nIHRoZSBEYXRhDQoNCldlIHdpbGwgYmVnaW4gYnkgbG9hZGluZyB0aGUgZGF0YSBhbmQgZ2VuZXJhdGluZyBhIHN1bW1hcnkgb2YgaXQuIA0KDQpJbiB0aGUgY29kZSBjaHVuayBiZWxvdywgdGhlIGZ1bmN0aW9uIGByZWFkLnRhYmxlKClgIGlzIHVzZWQgdG8gcmVhZCB0aGUgZGF0ZSBmaWxlICJmYXRoZXJfc29uLnR4dCIuIFRoZSBwYXJhbWV0ZXIgYHNlcGAgaXMgc2V0IHRvICJcXHQiLCB3aGljaCBpbmRpY2F0ZXMgdGhhdCB0aGUgdmFsdWVzIGluIHRoZSBkYXRhIGZpbGUgYXJlIHNlcGFyYXRlZCBieSB0YWJzLiBUaGUgcGFyYW1ldGVyIGBoZWFkZXJgIGlzIHNldCB0byBgVFJVRWAsIHdoaWNoIHRlbGxzIFIgdGhhdCB0aGUgZmlyc3Qgcm93IG9mIHRoZSBkYXRhIGZpbGUgY29udGFpbnMgdGhlIG5hbWVzIG9mIHRoZSB2YXJpYWJsZXMgd2hvc2UgdmFsdWVzIGFyZSBzdG9yZWQgaW4gdGhlIGNvbHVtbnMuIFRoZSBkYXRhIGlzIHN0b3JlZCBpbiBhIGRhdGEgZnJhbWUgbmFtZWQgYG15RGF0YWAuIFdlIHVzZSB0aGUgYGhlYWQoKWAgZnVuY3Rpb24gdG8gdmlldyB0aGUgZmlyc3QgZmV3IHJvd3Mgb2YgdGhpcyBkYXRhIGZyYW1lLiANCg0KDQpgYGB7cn0NCm15RGF0YSA8LSByZWFkLnRhYmxlKCJmYXRoZXJfc29uLnR4dCIsIHNlcD0iXHQiLCBoZWFkZXI9VFJVRSkNCg0KaGVhZChteURhdGEsIG49OCkNCmBgYA0KDQpXZSB3aWxsIHVzZSB0aGUgYG5yb3coKWAgZnVuY3Rpb24gdG8gY29uZmlybSB0aGF0IHRoZXJlIGFyZSAxMDc4IG9ic2VydmF0aW9ucyBpbiB0aGUgZGF0YSBmcmFtZS4gDQoNCmBgYHtyfQ0KbnJvdyhteURhdGEpDQpgYGANCg0KDQpUaGUgYHN1bW1hcnkoKWAgZnVuY3Rpb24gY2FuIGJlIHVzZWQgdG8gcHJvdmlkZSBpbmZvcm1hdGlvbiByZWdhcmRpbmcgdGhlIGRpc3RyaWJ1dGlvbiBvZiBlYWNoIG9mIHRoZSB2YXJpYWJsZXMgaW4gdGhlIGRhdGEgZnJhbWUuIA0KDQpgYGB7cn0NCnN1bW1hcnkobXlEYXRhKQ0KYGBgDQoNCg0KVGhlIGRpc3RyaWJ1dGlvbnMgb2YgdGhlIGhlaWdodHMgb2YgdGhlIGZhdGhlcnMgYXBwZWFycyB0byBiZSBzaW1pbGFyIHRvIHRoYXQgb2YgdGhlIHNvbnMuIFRoZSBzb25zIGRvIHNlZW0gdG8gYmUgc2xpZ2h0bHkgdGFsbGVyIG9uIGF2ZXJhZ2UuIA0KDQpXZSB3aWxsIGZ1cnRoZXIgZXhwbG9yZSB0aGUgZGlzdHJpYnV0aW9ucyBvZiB0aGVzZSB2YXJpYWJsZXMgYnkgY3JlYXRpbmcgaGlzdG9ncmFtcy4NCg0KYGBge3J9DQpwYXIobWZyb3c9YygxLDIpKQ0KDQpoaXN0KG15RGF0YSRmaGVpZ2h0LCB5bGltPWMoMCwyMDApLCB4bGltPWMoNTUsODApLCBicmVha3M9MjAsIGNvbCA9ICJjeWFuIiwNCiAgICAgeGxhYj0iRmF0aGVyIEhlaWdodCBpbiBJbmNoZXMiLCBtYWluPSJIaXN0b2dyYW0gb2ZcbiBGYXRoZXIgSGVpZ2h0IikNCg0KaGlzdChteURhdGEkc2hlaWdodCwgeWxpbT1jKDAsMjAwKSwgeGxpbT1jKDU1LDgwKSwgYnJlYWtzPTIwLCBjb2wgPSAib3JhbmdlIiwNCiAgICAgeGxhYj0iU29uIEhlaWdodCBpbiBJbmNoZXMiLCBtYWluPSJIaXN0b2dyYW0gb2YgXG4gU29uIEhlaWdodCIpDQoNCnBhcihtZnJvdz1jKDEsMSkpDQpgYGANCg0KVGhlIGhpc3RvZ3JhbXMgcHJvdmlkZSBldmlkZW5jZSB0aGF0IHRoZSBoZWlnaHRzIG9mIGJvdGggcG9wdWxhdGlvbnMgYXJlIGFwcHJveGltYXRlbHkgbm9ybWFsLiANCg0KIyBSZWxhdGlvbnNoaXAgQmV0d2VlbiBGYXRoZXIgYW5kIFNvbiBIZWlnaHQNCg0KV2UgYmVnaW4gb3VyIHN0dWR5IG9mIHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiB0aGUgaGVpZ2h0cyBvZiBmYXRoZXJzIGFuZCBzb25zIGJ5IGNhbGN1bGF0aW5nIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHRoZSB0d28gdmFyaWFibGVzLiANCg0KIA0KYGBge3J9DQpjb3IobXlEYXRhJGZoZWlnaHQsIG15RGF0YSRzaGVpZ2h0KQ0KYGBgDQoNCkEgY29ycmVsYXRpb24gb2YgMC41MDEzMzgzIGluZGljYXRlcyBhIG1vZGVyYXRlIHBvc2l0aXZlIGxpbmVhciByZWxhdGlvbnNoaXAuIA0KDQpXZSB3aWxsIGZ1cnRoZXIgZXhwbG9yZSB0aGUgcmVsYXRpb25zaGlwIGJldHdlZW4gdGhlIHR3byB2YXJpYWJsZXMgYnkgY3JlYXRpbmcgYSBzY2F0dGVycGxvdC4NCg0KYGBge3J9DQpwbG90KG15RGF0YSRzaGVpZ2h0IH4gbXlEYXRhJGZoZWlnaHQsIHBjaD0xOSwgY29sPXJnYigwLjEsIDAuMiwgMS4wLCAwLjYpLA0KICAgICB4bGFiPSJGYXRoZXIgSGVpZ2h0IChpbmNoZXMpIiwgeWxhYj0iU29uIEhlaWdodCAoaW5jaGVzKSIsDQogICAgIG1haW49IkhlaWdodHMgb2YgRmF0aGVycyBhbmQgU29ucyIpDQpgYGANCg0KDQpUaGlzIHNjYXR0ZXJwbG90IGRvZXMgc2hvdyBhIGNsZWFyIGxpbmVhciByZWxhdGlvbnNoaXAsIGFsdGhvdWdoIGl0IGlzIG9uZSB3aXRoIGEgbG90IG9mIG5vaXNlLiBHaXZlbiBhbnkgcGFydGljdWxhciBmYXRoZXIgaGVpZ2h0LCB0aGUgcmFuZ2Ugb2YgcG9zc2libGUgc29uIGhlaWdodHMgY292ZXJzIHNldmVyYWwgaW5jaGVzLiANCg0KIyBDYWx1bGF0aW5nIFBhcmFtZXRlciBFc3RpbWF0ZXMgVXNpbmcgdGhlIEZvcm11bGFzDQoNCkxldCdzIHVzZSBSIHRvIGNhbGN1bGF0ZSB0aGUgcGFyYW1ldGVyIGVzdGltYXRlcyBmb3IgdGhlIGxlYXN0IHNxdWFyZXMgcmVncmVzc2lvbiBsaW5lcyB1c2luZyB0aGUgZm9ybXVsYXMgbWVudGlvbmVkIGF0IHRoZSBiZWdpbm5pbmcgb2YgdGhpcyBsZXNzb24uIA0KDQpgYGB7cn0NCnggPC0gbXlEYXRhJGZoZWlnaHQNCnkgPC0gbXlEYXRhJHNoZWlnaHQNCg0KeGJhciA8LSBtZWFuKHgpDQp5YmFyIDwtIG1lYW4oeSkNCg0KU1hYIDwtIHN1bSgoeCAtIHhiYXIpXjIpDQpTWFkgPC0gc3VtKCh4IC0geGJhcikqKHkgLSB5YmFyKSkNCg0KYmV0YTEgPC0gU1hZIC8gU1hYDQpiZXRhMCA8LSB5YmFyIC0gYmV0YTEgKiB4YmFyDQoNCnBhcl9lc3QgPC0gYyhiZXRhMCwgYmV0YTEpDQpuYW1lcyhwYXJfZXN0KSA8LSBjKCJiZXRhMCIsICJiZXRhMSIpDQpwYXJfZXN0DQpgYGANCg0KYGBge3J9DQpwbG90KG15RGF0YSRzaGVpZ2h0IH4gbXlEYXRhJGZoZWlnaHQsIHBjaD0xOSwgY29sPXJnYigwLjEsIDAuMiwgMS4wLCAwLjYpLA0KICAgICB4bGFiPSJGYXRoZXIgSGVpZ2h0IChpbmNoZXMpIiwgeWxhYj0iU29uIEhlaWdodCAoaW5jaGVzKSIsDQogICAgIG1haW49IkhlaWdodHMgb2YgRmF0aGVycyBhbmQgU29ucyIpDQoNCmFibGluZShiZXRhMCwgYmV0YTEsIGNvbD0icmVkIiwgbHdkPTIpDQpgYGANCg0KDQojIENhbHVsYXRpbmcgUGFyYW1ldGVyIEVzdGltYXRlcyBVc2luZyBsbSgpIEZ1bmN0aW9uDQoNCkNvbnZlbmllbnRseSwgUiBoYXMgYSBidWlsdC1pbiBmdW5jdGlvbiBmb3IgY2FsY3VsYXRpbmcgcGFyYW1ldGVyIGVzdGltYXRlcyAoYXMgd2VsbCBhcyBtYW55IG90aGVyIHF1YW50aXRpZXMgdGhhdCBhcmUgdXNlZnVsIHdoZW4gcGVyZm9ybWluZyByZWdyZXNzaW9uIGFuYWx5c2lzKS4gVGhlIGZ1bmN0aW9uIGlzIGRlbm90ZWQgYGxtKClgLCB3aGljaCBzdGFuZHMgZm9yICJsaW5lYXIgbW9kZWwiLiBXZSB3aWxsIGRlbW9uc3RyYXRlIGl0cyB1c2FnZSBiZWxvdy4gDQoNCmBgYHtyfQ0KIyBtb2QgPC0gbG0obXlEYXRhJHNoZWlnaHQgfiBteURhdGEkZmhlaWdodCkNCm1vZCA8LSBsbShzaGVpZ2h0IH4gZmhlaWdodCwgbXlEYXRhKQ0Kc3VtbWFyeShtb2QpDQpgYGANCg0KTm90ZSB0aGF0IHRoZSBlc3RpbWF0ZXMgZm9yIHRoZSBzbG9wZSBhbmQgdGhlIGludGVyY2VwdCBpbiBvdXIgbW9kZWwgYXJlIGV4YWN0bHkgdGhlIHNhbWUgYXMgd2hhdCB3ZSBjYWxjdWxhdGVkIGFib3ZlLiBXZSBjYW4gYWNjZXNzIHRoZXNlIG9iamVjdHMgZGlyZWN0bHkgZnJvbSB0aGUgYG1vZGAgdmFyaWFibGU6DQoNCiANCmBgYHtyfQ0KbW9kJGNvZWZmaWNpZW50cw0KYGBgDQoNCmBgYHtyfQ0KcGxvdChteURhdGEkc2hlaWdodCB+IG15RGF0YSRmaGVpZ2h0LCBwY2g9MTksIGNvbD1yZ2IoMC4xLCAwLjIsIDEuMCwgMC42KSwNCiAgICAgeGxhYj0iRmF0aGVyIEhlaWdodCAoaW5jaGVzKSIsIHlsYWI9IlNvbiBIZWlnaHQgKGluY2hlcykiLA0KICAgICBtYWluPSJIZWlnaHRzIG9mIEZhdGhlcnMgYW5kIFNvbnMiKQ0KDQphYmxpbmUobW9kJGNvZWZmaWNpZW50cywgY29sPSJyZWQiLCBsd2Q9MikNCmBgYA==