Parameter Estimates

Recall that in ordinary least squares regression, we assume that the relationship between our variables \(X\) and \(Y\) is determined by a hypothetical model of the form:
\[Y = \beta_0 + \beta_1 X + e\]

We collect a sample of paired observations \((x_i, y_i)\) which we use to create a fitted model:

\[\hat Y = \hat \beta_0 + \hat \beta_1 X\] \[Y = \hat \beta_0 + \hat \beta_1 X + \hat e\]

The parameter estimates that result in \(SSE = \sum\limits_{i=1}^n \hat e_i^2\) being minimized are given by the following formulas:

\[SXX = \sum\limits_{i=1}^n \left(x_i - \bar x \right)^2\] \[SXY = \sum\limits_{i=1}^n \left(x_i - \bar x \right)\left(y_i - \bar y \right)\]

\[\hat\beta_1= \frac{SXY}{SXX} \hspace{20px} \mathrm{and} \hspace{20px}\hat\beta_0 = \bar y - \hat\beta_1 \bar x\]

We will now walk through an example in which we use R to calculate these parameter estimates. We will begin the example by loading a real-world dataset from a text file.

Pearson Dataset

In this notebook, we will study the relationship between the heights of fathers and their adult sons. For this study, we will use the Pearson data set, collected in England in 1903. This data set consists of 1078 observations, each with two features:

Loading and Exploring the Data

We will begin by loading the data and generating a summary of it.

In the code chunk below, the function read.table() is used to read the date file “father_son.txt”. The parameter sep is set to “\t”, which indicates that the values in the data file are separated by tabs. The parameter header is set to TRUE, which tells R that the first row of the data file contains the names of the variables whose values are stored in the columns. The data is stored in a data frame named myData. We use the head() function to view the first few rows of this data frame.

myData <- read.table("father_son.txt", sep="\t", header=TRUE)
head(myData, n=8)

We will use the nrow() function to confirm that there are 1078 observations in the data frame.

nrow(myData)
[1] 1078

The summary() function can be used to provide information regarding the distribution of each of the variables in the data frame.

summary(myData)
    fheight         sheight     
 Min.   :59.01   Min.   :58.51  
 1st Qu.:65.79   1st Qu.:66.93  
 Median :67.77   Median :68.62  
 Mean   :67.69   Mean   :68.68  
 3rd Qu.:69.60   3rd Qu.:70.47  
 Max.   :75.43   Max.   :78.36  

The distributions of the heights of the fathers appears to be similar to that of the sons. The sons do seem to be slightly taller on average.

We will further explore the distributions of these variables by creating histograms.

par(mfrow=c(1,2))
hist(myData$fheight, ylim=c(0,200), xlim=c(55,80), breaks=20, col = "cyan",
     xlab="Father Height in Inches", main="Histogram of\n Father Height")
hist(myData$sheight, ylim=c(0,200), xlim=c(55,80), breaks=20, col = "orange",
     xlab="Son Height in Inches", main="Histogram of \n Son Height")
par(mfrow=c(1,1))

The histograms provide evidence that the heights of both populations are approximately normal.

Relationship Between Father and Son Height

We begin our study of the relationship between the heights of fathers and sons by calculating the correlation between the two variables.

cor(myData$fheight, myData$sheight)
[1] 0.5013383

A correlation of 0.5013383 indicates a moderate positive linear relationship.

We will further explore the relationship between the two variables by creating a scatterplot.

plot(myData$sheight ~ myData$fheight, pch=19, col=rgb(0.1, 0.2, 1.0, 0.6),
     xlab="Father Height (inches)", ylab="Son Height (inches)",
     main="Heights of Fathers and Sons")

This scatterplot does show a clear linear relationship, although it is one with a lot of noise. Given any particular father height, the range of possible son heights covers several inches.

Calulating Parameter Estimates Using the Formulas

Let’s use R to calculate the parameter estimates for the least squares regression lines using the formulas mentioned at the beginning of this lesson.

x <- myData$fheight
y <- myData$sheight
xbar <- mean(x)
ybar <- mean(y)
SXX <- sum((x - xbar)^2)
SXY <- sum((x - xbar)*(y - ybar))
beta1 <- SXY / SXX
beta0 <- ybar - beta1 * xbar
par_est <- c(beta0, beta1)
names(par_est) <- c("beta0", "beta1")
par_est
    beta0     beta1 
33.886604  0.514093 
plot(myData$sheight ~ myData$fheight, pch=19, col=rgb(0.1, 0.2, 1.0, 0.6),
     xlab="Father Height (inches)", ylab="Son Height (inches)",
     main="Heights of Fathers and Sons")
abline(beta0, beta1, col="red", lwd=2)

Calulating Parameter Estimates Using lm() Function

Conveniently, R has a built-in function for calculating parameter estimates (as well as many other quantities that are useful when performing regression analysis). The function is denoted lm(), which stands for “linear model”. We will demonstrate its usage below.

# mod <- lm(myData$sheight ~ myData$fheight)
mod <- lm(sheight ~ fheight, myData)
summary(mod)

Call:
lm(formula = sheight ~ fheight, data = myData)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8772 -1.5144 -0.0079  1.6285  8.9685 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.88660    1.83235   18.49   <2e-16 ***
fheight      0.51409    0.02705   19.01   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.437 on 1076 degrees of freedom
Multiple R-squared:  0.2513,    Adjusted R-squared:  0.2506 
F-statistic: 361.2 on 1 and 1076 DF,  p-value: < 2.2e-16

Note that the estimates for the slope and the intercept in our model are exactly the same as what we calculated above. We can access these objects directly from the mod variable:

mod$coefficients
(Intercept)     fheight 
  33.886604    0.514093 
plot(myData$sheight ~ myData$fheight, pch=19, col=rgb(0.1, 0.2, 1.0, 0.6),
     xlab="Father Height (inches)", ylab="Son Height (inches)",
     main="Heights of Fathers and Sons")
abline(mod$coefficients, col="red", lwd=2)

LS0tDQp0aXRsZTogIkxlc3NvbiAwNyAtIE9MUyBSZWdyZXNzaW9uIGluIFIiDQphdXRob3I6ICJSb2JiaWUgQmVhbmUiDQpvdXRwdXQ6DQogIGh0bWxfbm90ZWJvb2s6DQogICAgdGhlbWU6IGZsYXRseQ0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19kZXB0aDogMg0KLS0tDQoNCiMgUGFyYW1ldGVyIEVzdGltYXRlcw0KDQpSZWNhbGwgdGhhdCBpbiBvcmRpbmFyeSBsZWFzdCBzcXVhcmVzIHJlZ3Jlc3Npb24sIHdlIGFzc3VtZSB0aGF0IHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiBvdXIgdmFyaWFibGVzICRYJCBhbmQgJFkkIGlzIGRldGVybWluZWQgYnkgYSAqKmh5cG90aGV0aWNhbCBtb2RlbCoqIG9mIHRoZSBmb3JtOiANCjxjZW50ZXI+DQokJFkgPSBcYmV0YV8wICsgXGJldGFfMSBYICsgZSQkDQo8L2NlbnRlcj4NCg0KV2UgY29sbGVjdCBhIHNhbXBsZSBvZiBwYWlyZWQgb2JzZXJ2YXRpb25zICQoeF9pLCB5X2kpJCB3aGljaCB3ZSB1c2UgdG8gY3JlYXRlIGEgKipmaXR0ZWQgbW9kZWwqKjoNCg0KPGNlbnRlcj4NCiQkXGhhdCBZID0gXGhhdCBcYmV0YV8wICsgXGhhdCBcYmV0YV8xIFgkJA0KJCRZID0gXGhhdCBcYmV0YV8wICsgXGhhdCBcYmV0YV8xIFggKyBcaGF0IGUkJA0KPC9jZW50ZXI+DQoNClRoZSBwYXJhbWV0ZXIgZXN0aW1hdGVzIHRoYXQgcmVzdWx0IGluICRTU0UgPSBcc3VtXGxpbWl0c197aT0xfV5uIFxoYXQgZV9pXjIkIGJlaW5nIG1pbmltaXplZCBhcmUgZ2l2ZW4gYnkgdGhlIGZvbGxvd2luZyBmb3JtdWxhczogDQoNCg0KJCRTWFggPSBcc3VtXGxpbWl0c197aT0xfV5uIFxsZWZ0KHhfaSAtIFxiYXIgeCBccmlnaHQpXjIkJA0KJCRTWFkgPSBcc3VtXGxpbWl0c197aT0xfV5uIFxsZWZ0KHhfaSAtIFxiYXIgeCBccmlnaHQpXGxlZnQoeV9pIC0gXGJhciB5IFxyaWdodCkkJA0KDQokJFxoYXRcYmV0YV8xPSBcZnJhY3tTWFl9e1NYWH0gXGhzcGFjZXsyMHB4fSBcbWF0aHJte2FuZH0gXGhzcGFjZXsyMHB4fVxoYXRcYmV0YV8wID0gXGJhciB5IC0gXGhhdFxiZXRhXzEgXGJhciB4JCQNCg0KV2Ugd2lsbCBub3cgd2FsayB0aHJvdWdoIGFuIGV4YW1wbGUgaW4gd2hpY2ggd2UgdXNlIFIgdG8gY2FsY3VsYXRlIHRoZXNlIHBhcmFtZXRlciBlc3RpbWF0ZXMuIFdlIHdpbGwgYmVnaW4gdGhlIGV4YW1wbGUgYnkgbG9hZGluZyBhIHJlYWwtd29ybGQgZGF0YXNldCBmcm9tIGEgdGV4dCBmaWxlLiANCg0KIyBQZWFyc29uIERhdGFzZXQNCg0KSW4gdGhpcyBub3RlYm9vaywgd2Ugd2lsbCBzdHVkeSB0aGUgcmVsYXRpb25zaGlwIGJldHdlZW4gdGhlIGhlaWdodHMgb2YgZmF0aGVycyBhbmQgdGhlaXIgYWR1bHQgc29ucy4gRm9yIHRoaXMgc3R1ZHksIHdlIHdpbGwgdXNlIHRoZSBbUGVhcnNvbiBkYXRhIHNldF0oaHR0cDovL3d3dy5yYW5kb21zZXJ2aWNlcy5vcmcvcmFuZG9tL2RhdGEvUGVhcnNvbi5odG1sKSwgY29sbGVjdGVkIGluIEVuZ2xhbmQgaW4gMTkwMy4gVGhpcyBkYXRhIHNldCBjb25zaXN0cyBvZiAxMDc4IG9ic2VydmF0aW9ucywgZWFjaCB3aXRoIHR3byBmZWF0dXJlczoNCg0KKiAqKmZoZWlnaHQqKiBnaXZlcyB0aGUgaGVpZ2h0IG9mIHRoZSBmYXRoZXIsIGluIGluY2hlcy4NCiogKipzaGVpZ2h0KiogZ2l2ZXMgdGhlIGhlaWdodCBvZiB0aGUgc29uLCBpbiBpbmNoZXMuDQoNCiMgTG9hZGluZyBhbmQgRXhwbG9yaW5nIHRoZSBEYXRhDQoNCldlIHdpbGwgYmVnaW4gYnkgbG9hZGluZyB0aGUgZGF0YSBhbmQgZ2VuZXJhdGluZyBhIHN1bW1hcnkgb2YgaXQuIA0KDQpJbiB0aGUgY29kZSBjaHVuayBiZWxvdywgdGhlIGZ1bmN0aW9uIGByZWFkLnRhYmxlKClgIGlzIHVzZWQgdG8gcmVhZCB0aGUgZGF0ZSBmaWxlICJmYXRoZXJfc29uLnR4dCIuIFRoZSBwYXJhbWV0ZXIgYHNlcGAgaXMgc2V0IHRvICJcXHQiLCB3aGljaCBpbmRpY2F0ZXMgdGhhdCB0aGUgdmFsdWVzIGluIHRoZSBkYXRhIGZpbGUgYXJlIHNlcGFyYXRlZCBieSB0YWJzLiBUaGUgcGFyYW1ldGVyIGBoZWFkZXJgIGlzIHNldCB0byBgVFJVRWAsIHdoaWNoIHRlbGxzIFIgdGhhdCB0aGUgZmlyc3Qgcm93IG9mIHRoZSBkYXRhIGZpbGUgY29udGFpbnMgdGhlIG5hbWVzIG9mIHRoZSB2YXJpYWJsZXMgd2hvc2UgdmFsdWVzIGFyZSBzdG9yZWQgaW4gdGhlIGNvbHVtbnMuIFRoZSBkYXRhIGlzIHN0b3JlZCBpbiBhIGRhdGEgZnJhbWUgbmFtZWQgYG15RGF0YWAuIFdlIHVzZSB0aGUgYGhlYWQoKWAgZnVuY3Rpb24gdG8gdmlldyB0aGUgZmlyc3QgZmV3IHJvd3Mgb2YgdGhpcyBkYXRhIGZyYW1lLiANCg0KDQpgYGB7cn0NCm15RGF0YSA8LSByZWFkLnRhYmxlKCJmYXRoZXJfc29uLnR4dCIsIHNlcD0iXHQiLCBoZWFkZXI9VFJVRSkNCg0KaGVhZChteURhdGEsIG49OCkNCmBgYA0KDQpXZSB3aWxsIHVzZSB0aGUgYG5yb3coKWAgZnVuY3Rpb24gdG8gY29uZmlybSB0aGF0IHRoZXJlIGFyZSAxMDc4IG9ic2VydmF0aW9ucyBpbiB0aGUgZGF0YSBmcmFtZS4gDQoNCmBgYHtyfQ0KbnJvdyhteURhdGEpDQpgYGANCg0KDQpUaGUgYHN1bW1hcnkoKWAgZnVuY3Rpb24gY2FuIGJlIHVzZWQgdG8gcHJvdmlkZSBpbmZvcm1hdGlvbiByZWdhcmRpbmcgdGhlIGRpc3RyaWJ1dGlvbiBvZiBlYWNoIG9mIHRoZSB2YXJpYWJsZXMgaW4gdGhlIGRhdGEgZnJhbWUuIA0KDQpgYGB7cn0NCnN1bW1hcnkobXlEYXRhKQ0KYGBgDQoNCg0KVGhlIGRpc3RyaWJ1dGlvbnMgb2YgdGhlIGhlaWdodHMgb2YgdGhlIGZhdGhlcnMgYXBwZWFycyB0byBiZSBzaW1pbGFyIHRvIHRoYXQgb2YgdGhlIHNvbnMuIFRoZSBzb25zIGRvIHNlZW0gdG8gYmUgc2xpZ2h0bHkgdGFsbGVyIG9uIGF2ZXJhZ2UuIA0KDQpXZSB3aWxsIGZ1cnRoZXIgZXhwbG9yZSB0aGUgZGlzdHJpYnV0aW9ucyBvZiB0aGVzZSB2YXJpYWJsZXMgYnkgY3JlYXRpbmcgaGlzdG9ncmFtcy4NCg0KYGBge3J9DQpwYXIobWZyb3c9YygxLDIpKQ0KDQpoaXN0KG15RGF0YSRmaGVpZ2h0LCB5bGltPWMoMCwyMDApLCB4bGltPWMoNTUsODApLCBicmVha3M9MjAsIGNvbCA9ICJjeWFuIiwNCiAgICAgeGxhYj0iRmF0aGVyIEhlaWdodCBpbiBJbmNoZXMiLCBtYWluPSJIaXN0b2dyYW0gb2ZcbiBGYXRoZXIgSGVpZ2h0IikNCg0KaGlzdChteURhdGEkc2hlaWdodCwgeWxpbT1jKDAsMjAwKSwgeGxpbT1jKDU1LDgwKSwgYnJlYWtzPTIwLCBjb2wgPSAib3JhbmdlIiwNCiAgICAgeGxhYj0iU29uIEhlaWdodCBpbiBJbmNoZXMiLCBtYWluPSJIaXN0b2dyYW0gb2YgXG4gU29uIEhlaWdodCIpDQoNCnBhcihtZnJvdz1jKDEsMSkpDQpgYGANCg0KVGhlIGhpc3RvZ3JhbXMgcHJvdmlkZSBldmlkZW5jZSB0aGF0IHRoZSBoZWlnaHRzIG9mIGJvdGggcG9wdWxhdGlvbnMgYXJlIGFwcHJveGltYXRlbHkgbm9ybWFsLiANCg0KIyBSZWxhdGlvbnNoaXAgQmV0d2VlbiBGYXRoZXIgYW5kIFNvbiBIZWlnaHQNCg0KV2UgYmVnaW4gb3VyIHN0dWR5IG9mIHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiB0aGUgaGVpZ2h0cyBvZiBmYXRoZXJzIGFuZCBzb25zIGJ5IGNhbGN1bGF0aW5nIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHRoZSB0d28gdmFyaWFibGVzLiANCg0KIA0KYGBge3J9DQpjb3IobXlEYXRhJGZoZWlnaHQsIG15RGF0YSRzaGVpZ2h0KQ0KYGBgDQoNCkEgY29ycmVsYXRpb24gb2YgMC41MDEzMzgzIGluZGljYXRlcyBhIG1vZGVyYXRlIHBvc2l0aXZlIGxpbmVhciByZWxhdGlvbnNoaXAuIA0KDQpXZSB3aWxsIGZ1cnRoZXIgZXhwbG9yZSB0aGUgcmVsYXRpb25zaGlwIGJldHdlZW4gdGhlIHR3byB2YXJpYWJsZXMgYnkgY3JlYXRpbmcgYSBzY2F0dGVycGxvdC4NCg0KYGBge3J9DQpwbG90KG15RGF0YSRzaGVpZ2h0IH4gbXlEYXRhJGZoZWlnaHQsIHBjaD0xOSwgY29sPXJnYigwLjEsIDAuMiwgMS4wLCAwLjYpLA0KICAgICB4bGFiPSJGYXRoZXIgSGVpZ2h0IChpbmNoZXMpIiwgeWxhYj0iU29uIEhlaWdodCAoaW5jaGVzKSIsDQogICAgIG1haW49IkhlaWdodHMgb2YgRmF0aGVycyBhbmQgU29ucyIpDQpgYGANCg0KDQpUaGlzIHNjYXR0ZXJwbG90IGRvZXMgc2hvdyBhIGNsZWFyIGxpbmVhciByZWxhdGlvbnNoaXAsIGFsdGhvdWdoIGl0IGlzIG9uZSB3aXRoIGEgbG90IG9mIG5vaXNlLiBHaXZlbiBhbnkgcGFydGljdWxhciBmYXRoZXIgaGVpZ2h0LCB0aGUgcmFuZ2Ugb2YgcG9zc2libGUgc29uIGhlaWdodHMgY292ZXJzIHNldmVyYWwgaW5jaGVzLiANCg0KIyBDYWx1bGF0aW5nIFBhcmFtZXRlciBFc3RpbWF0ZXMgVXNpbmcgdGhlIEZvcm11bGFzDQoNCkxldCdzIHVzZSBSIHRvIGNhbGN1bGF0ZSB0aGUgcGFyYW1ldGVyIGVzdGltYXRlcyBmb3IgdGhlIGxlYXN0IHNxdWFyZXMgcmVncmVzc2lvbiBsaW5lcyB1c2luZyB0aGUgZm9ybXVsYXMgbWVudGlvbmVkIGF0IHRoZSBiZWdpbm5pbmcgb2YgdGhpcyBsZXNzb24uIA0KDQpgYGB7cn0NCnggPC0gbXlEYXRhJGZoZWlnaHQNCnkgPC0gbXlEYXRhJHNoZWlnaHQNCg0KeGJhciA8LSBtZWFuKHgpDQp5YmFyIDwtIG1lYW4oeSkNCg0KU1hYIDwtIHN1bSgoeCAtIHhiYXIpXjIpDQpTWFkgPC0gc3VtKCh4IC0geGJhcikqKHkgLSB5YmFyKSkNCg0KYmV0YTEgPC0gU1hZIC8gU1hYDQpiZXRhMCA8LSB5YmFyIC0gYmV0YTEgKiB4YmFyDQoNCnBhcl9lc3QgPC0gYyhiZXRhMCwgYmV0YTEpDQpuYW1lcyhwYXJfZXN0KSA8LSBjKCJiZXRhMCIsICJiZXRhMSIpDQpwYXJfZXN0DQpgYGANCg0KYGBge3J9DQpwbG90KG15RGF0YSRzaGVpZ2h0IH4gbXlEYXRhJGZoZWlnaHQsIHBjaD0xOSwgY29sPXJnYigwLjEsIDAuMiwgMS4wLCAwLjYpLA0KICAgICB4bGFiPSJGYXRoZXIgSGVpZ2h0IChpbmNoZXMpIiwgeWxhYj0iU29uIEhlaWdodCAoaW5jaGVzKSIsDQogICAgIG1haW49IkhlaWdodHMgb2YgRmF0aGVycyBhbmQgU29ucyIpDQoNCmFibGluZShiZXRhMCwgYmV0YTEsIGNvbD0icmVkIiwgbHdkPTIpDQpgYGANCg0KDQojIENhbHVsYXRpbmcgUGFyYW1ldGVyIEVzdGltYXRlcyBVc2luZyBsbSgpIEZ1bmN0aW9uDQoNCkNvbnZlbmllbnRseSwgUiBoYXMgYSBidWlsdC1pbiBmdW5jdGlvbiBmb3IgY2FsY3VsYXRpbmcgcGFyYW1ldGVyIGVzdGltYXRlcyAoYXMgd2VsbCBhcyBtYW55IG90aGVyIHF1YW50aXRpZXMgdGhhdCBhcmUgdXNlZnVsIHdoZW4gcGVyZm9ybWluZyByZWdyZXNzaW9uIGFuYWx5c2lzKS4gVGhlIGZ1bmN0aW9uIGlzIGRlbm90ZWQgYGxtKClgLCB3aGljaCBzdGFuZHMgZm9yICJsaW5lYXIgbW9kZWwiLiBXZSB3aWxsIGRlbW9uc3RyYXRlIGl0cyB1c2FnZSBiZWxvdy4gDQoNCmBgYHtyfQ0KIyBtb2QgPC0gbG0obXlEYXRhJHNoZWlnaHQgfiBteURhdGEkZmhlaWdodCkNCm1vZCA8LSBsbShzaGVpZ2h0IH4gZmhlaWdodCwgbXlEYXRhKQ0Kc3VtbWFyeShtb2QpDQpgYGANCg0KTm90ZSB0aGF0IHRoZSBlc3RpbWF0ZXMgZm9yIHRoZSBzbG9wZSBhbmQgdGhlIGludGVyY2VwdCBpbiBvdXIgbW9kZWwgYXJlIGV4YWN0bHkgdGhlIHNhbWUgYXMgd2hhdCB3ZSBjYWxjdWxhdGVkIGFib3ZlLiBXZSBjYW4gYWNjZXNzIHRoZXNlIG9iamVjdHMgZGlyZWN0bHkgZnJvbSB0aGUgYG1vZGAgdmFyaWFibGU6DQoNCiANCmBgYHtyfQ0KbW9kJGNvZWZmaWNpZW50cw0KYGBgDQoNCmBgYHtyfQ0KcGxvdChteURhdGEkc2hlaWdodCB+IG15RGF0YSRmaGVpZ2h0LCBwY2g9MTksIGNvbD1yZ2IoMC4xLCAwLjIsIDEuMCwgMC42KSwNCiAgICAgeGxhYj0iRmF0aGVyIEhlaWdodCAoaW5jaGVzKSIsIHlsYWI9IlNvbiBIZWlnaHQgKGluY2hlcykiLA0KICAgICBtYWluPSJIZWlnaHRzIG9mIEZhdGhlcnMgYW5kIFNvbnMiKQ0KDQphYmxpbmUobW9kJGNvZWZmaWNpZW50cywgY29sPSJyZWQiLCBsd2Q9MikNCmBgYA==