Linear regression

Applied Statistics for Life Sciences

Least squares estimation

Meadowfoam flowering

Mean flowering depends on light intensity.

  • the “response” variable is flowers per plant
  • the “explanatory” variable is light intensity

In an ANOVA model, the explanatory variable is treated as categorical.

  • one mean estimated per unique value of the explanatory variable
  • values fixed in advance by researcher for designed experiments

But what if we had observational data instead?

Could we estimate mean flowering as a continuous function of intensity?

PREVEND data

Ruff Figural Fluency Test (RFFT) is a cognitive assessment.

  • measures nonverbal fluency and executive cognitive function
  • scale: 0 (worst) to 175 (best)
casenr age rfft
126 37 136
33 36 80
145 37 102
146 37 85

Question of interest:

How much does cognitive ability as measured by RFFT decline with age on average?

PREVEND data

Ruff Figural Fluency Test (RFFT) is a cognitive assessment.

  • measures nonverbal fluency and executive cognitive function
  • scale: 0 (worst) to 175 (best)
casenr age rfft
126 37 136
33 36 80
145 37 102
146 37 85

A straight line seems to describe the RFFT-age relationship well enough.

This suggests a model:

  • mean RFFT is a linear function of age
  • remaining variation in RFFT is random

Lines

The equation of a line in slope-intercept form is:

\[ y = ax + b \]

\(a\) is the slope (rise over run) and \(b\) is the intercept:

  • the line crosses the \(y\) axis at \(b\)
  • \(y\) changes by \(a\) per unit increment in \(x\)

There is exactly one line through any two points.

Correlation

Correlation is a signed measure of strength of linear relationship.

\[ r = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{s_x \times s_y} \]

When data points are far from the mean at the same time in the same direction, the magnitude will be larger.

  • sign indicates direction of relationship
  • magnitude indicates strength (always between -1 and 1)

Common mistake: strength \(\neq\) slope.

Fitting lines to data

Correlation between age and RFFT:

cor(age, rfft)
[1] -0.635854
  • negative relationship (sign)
  • moderate strength

But how do we find a line that describes the trend?

Fitting lines to data

Most of these are bad. Some are worse than others. How might one measure this?

Fitting lines to data

Hint: consider the residuals – distances from the line to each point.

Measuring quality of fit

Residuals are the distances to each point: \[ \textcolor{red}{e_i} = y_i - \textcolor{blue}{\hat{y}_i} \]

Quality of fit can be measured by:

  • [bias] average error \(-\frac{1}{n}\sum_i \textcolor{red}{e_i}\)
  • [SSE] total (squared) error \(\sum_i \textcolor{red}{e_i}^2\)

Measuring quality of fit

Now consider what the bias and SSE (total squared error) capture.

Least squares line

The line with no bias and minimal total error is called the least squares line:

\[ \text{RFFT} = 134.098 - 1.191 \times \text{age} \]

With each year of age, RFFT decreases by 1.191 points on average.

The least squares line has an analytic solution:

\[ \begin{align} \text{slope}: \quad-1.191 &= \text{cor}(\text{age}, \text{RFFT})\times\frac{SD(\text{RFFT})}{SD(\text{age})} \\ \text{intercept}: \quad134.098 &= \text{mean}(\text{RFFT}) - (-1.191)\times\text{mean}(\text{age}) \end{align} \]

The SLR model

The simple linear regression model is:

\[ Y = \textcolor{blue}{\underbrace{\beta_0 + \beta_1 x}_\text{mean}} + \textcolor{red}{\underbrace{\epsilon}_\text{error}} \]

  • continuous response \(Y\)
  • explanatory variable \(x\)
  • regression coefficients \(\beta_0, \beta_1\)
  • model error \(\epsilon\)

The values that minimize error subject to the model being unbiased are:

\[\begin{align*} \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} &\quad(\text{unbiased}) \\ \hat{\beta}_1 &= \frac{s_y}{s_x}\times r &\quad(\text{minimizes SSE}) \end{align*}\]

These are called the least squares estimates.

Least squares estimates in R

According to the model, a one-unit increment in \(x\) corresponds to a \(\beta_1\)-unit change in mean \(Y\):

# fit model
fit <- lm(formula = rfft ~ age, data = prevend)
fit

Call:
lm(formula = rfft ~ age, data = prevend)

Coefficients:
(Intercept)          age  
    134.098       -1.191  

With each additional year of age, mean RFFT score decreases by an estimated 1.191 points.

  • formula = <RESPONSE> ~ <EXPLANATORY> specifies the model
  • data = <DATAFRAME> specifies the observations

Error variability and model fit

The residual standard deviation provides an estimate of error variability:

\[\textcolor{\red}{\hat{\sigma}} = \sqrt{\frac{1}{n - 2} \sum_i e_i^2} \qquad\text{(estimated error variability)}\]

The proportion of variability explained by the model is: \[ R^2 = 1 - \frac{(n - 2)\textcolor{red}{\hat{\sigma}^2}}{(n - 1)\textcolor{darkgrey}{s_y^2}} \quad\left(1 - \frac{\text{error variability}}{\text{total variability}}\right) \]

1 - (n - 2)*sigma(fit)^2/((n - 1)*var(rfft))
[1] 0.4043103

Age explains 40.43% of variability in RFFT.

Model specification

Is the relationship actually linear?

Two ways to check:

  1. Inspect observed data for nonlinear trend
  2. Inspect model residuals for “patterns”

Local smoothing is shown in blue.

The linear model is a fine approximation here – the curvature is very minor – but let’s consider an alternative model specification as a thought exercise.

An alternative model

\[ \log(\text{RFFT}) = \beta_0 + \beta_1\times\text{age} + \epsilon \]

The model now implies that the mean RFFT score is a nonlinear function of age:

\[ \text{RFFT} \propto e^{\beta_1\text{age}} \]

And we can still fit it using least squares:

lm(log(rfft) ~ age, data = prevend)

Call:
lm(formula = log(rfft) ~ age, data = prevend)

Coefficients:
(Intercept)          age  
    5.21739     -0.01951  

Linear models can be used to capture more than just linear relationships!

Kleiber’s law

Kleiber’s law refers to the relationship between metabolic rate and body mass.

We can estimate it via the SLR model: \[ \log(\text{metabolism}) = \beta_0 + \beta_1 \log(\text{mass}) + \epsilon \]

Fitted model: \[ \log(\text{metabolism}) = 5.64 + 0.74 \times \log(\text{mass}) \]

fit <- lm(log.metab ~ log.mass, data = kleiber)
fit

Call:
lm(formula = log.metab ~ log.mass, data = kleiber)

Coefficients:
(Intercept)     log.mass  
     5.6383       0.7387  

Kleiber’s law: model interpretation

Exponentiating both sides of the fitted SLR model equation:

\[ \underbrace{\text{metabolism}}_{e^{\log(\text{metabolism})}} = \underbrace{280.99}_{e^{5.64}} \times \underbrace{\text{mass}^{0.74}}_{e^{0.74 \log(\text{mass})}} \]

So we’ve really estimated what’s known as a power law relationship: \(y = ax^b\).

  • multiplicative, not additive, relationship
  • doubling \(x\) corresponds to changing \(y\) by a factor of \(2^b\)

The estimate and interval for \(\beta_1\) in the SLR model can be transformed appropriately for a more direct interpretation:

Every doubling of body mass is associated with an estimated 66.87% increase in median metabolism.

Inference in regression

Review

How much does RFFT decline with age?

Simple linear regression (SLR) model: \[ \text{RFFT} = \beta_0 + \beta_1\text{age} + \epsilon \]

# least squares estimates
fit <- lm(formula = rfft ~ age, data = prevend)
fit

Call:
lm(formula = rfft ~ age, data = prevend)

Coefficients:
(Intercept)          age  
    134.098       -1.191  

Interpretation:

With each additional year of age, mean RFFT score decreases by an estimated 1.191 points.

Review

The residual standard deviation is an estimate of the unexplained variation in RFFT.

\[ \textcolor{\red}{\hat{\sigma}} = \sqrt{\frac{1}{n - 2} \sum_i e_i^2} \]

# compute residual sd
sigma(fit)
[1] 20.51788

More unexplained variation entails more sampling variability in the model fit.

Standard errors for the coefficients

Standard errors for the coefficients are:

\[SE\left(\hat{\beta}_0\right) = \hat{\sigma}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{(n - 1)s_x^2}} \qquad\text{and}\qquad SE\left(\hat{\beta}_1\right) = \hat{\sigma}\sqrt{\frac{1}{(n - 1)s_x^2}}\]

While you won’t need to know these formulae, do notice that:

  • more data \(\longrightarrow\) less sampling variability
  • more spread in \(x\) \(\longrightarrow\) less sampling variability
  • more spread in \(y\) about \(x\) \(\longrightarrow\) more sampling variability

Inference for the slope parameter

If the errors are symmetric and unimodal, then the sampling distribution of \[ T = \frac{\hat{\beta}_1 - \beta_1}{SE(\beta_1)} \] is well-approximated by a \(t_{n - 2}\) model.

  1. Significance test: \(\begin{cases} H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0 \end{cases}\)

  2. Confidence interval: \(\hat{\beta}_1 \pm c\times SE\left(\hat{\beta}_1\right)\)

  • \(P(T > |T_\text{obs}|) \approx 0\): evidence of an association (true slope is not zero)

  • confidence interval using \(t_{206}\) critical value: (-1.389, -0.992)

Inference for the intercept is analogous, but not very common.

Model summary

The model summary shows most quantities of interest, except CIs.

# model summary
summary(fit)

Call:
lm(formula = rfft ~ age, data = prevend)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.085 -14.690  -2.937  12.744  77.975 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 134.0981     6.0701   22.09   <2e-16 ***
age          -1.1908     0.1007  -11.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.52 on 206 degrees of freedom
Multiple R-squared:  0.4043,    Adjusted R-squared:  0.4014 
F-statistic: 139.8 on 1 and 206 DF,  p-value: < 2.2e-16
  • \(\hat{\beta}_0 = 134.10, \hat{\beta}_1 = -1.19, \hat{\sigma} = 20.52\)

  • Age explains an estimated 40.43% of variation in RFFT.

  • With each year of age mean RFFT declines by an estimated 1.19 points (SE 0.10).

  • There is a significant association between age and mean RFFT score (T = -11.82 on 206 degrees of freedom, p < 0.0001).

Take a moment to locate the quantities that support the conclusions listed at right.

Confidence intervals

Confidence interval:

# ci for model parameters
confint(fit, level = 0.95)
                 2.5 %      97.5 %
(Intercept) 122.130647 146.0654574
age          -1.389341  -0.9922471

With 95% confidence, each additional year of age is associated with a decrease in mean RFFT score of between 0.99 and 1.39 points.

Since the intercept is not meaningful in this context, we don’t interpret that interval.

Prediction in SLR

Prediction is a matter of plugging in values to the estimated model equation.

For example, the predicted RFFT score for a 55-year old is:

\[ \hat{y} = 134.0981 - 1.1908\times55 = 68.604 \]

# model prediction for a 55 year old
predict(fit, newdata = data.frame(age = 55))
       1 
68.60439 

Intervals for the mean

There are two possible ways to interpret model predictions:

  1. The estimated mean RFFT score for 55-year-olds is 68.6
  2. The predicted value of RFFT for a specific 55-year-old individual is 68.6
# model prediction for a 55 year old
predict(fit, newdata = data.frame(age = 55), 
        interval = 'confidence', 
        level = 0.95)
       fit     lwr      upr
1 68.60439 65.7101 71.49868

With 95% confidence, the mean RFFT score among 55-year-olds is estimated to be between 65.71 and 71.50 points.

Intervals for predicted values

There are two possible ways to interpret model predictions:

  1. The estimated mean RFFT score for 55-year-olds is 68.6
  2. The predicted value of RFFT for a specific 55-year-old individual is 68.6
# model prediction for a 55 year old
predict(fit, newdata = data.frame(age = 55), 
        interval = 'prediction', 
        level = 0.95)
       fit      lwr      upr
1 68.60439 28.04903 109.1598

With 95% confidence, the RFFT score for an individual 55 year old is estimated to be between 28.05 and 109.16 points.

Uncertainty bands

Pointwise intervals shown along the line provide a visual of the model uncertainty.

Why the difference? Individual observations are more variable than averages.

Kleiber’s law

fit <- lm(log.metab ~ log.mass, data = kleiber)
summary(fit)

Call:
lm(formula = log.metab ~ log.mass, data = kleiber)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.14216 -0.26466 -0.04889  0.25308  1.37616 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.63833    0.04709  119.73   <2e-16 ***
log.mass     0.73874    0.01462   50.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4572 on 93 degrees of freedom
Multiple R-squared:  0.9649,    Adjusted R-squared:  0.9645 
F-statistic:  2553 on 1 and 93 DF,  p-value: < 2.2e-16

  • \(\hat{\beta}_0 = 5.638, \hat{\beta}_1 = 0.739, \hat{\sigma} = 0.457\)

There is a significant association between body mass and metabolism (p < 0.0001): body mass explains 96.49% of variation in metabolism; with 95% confidence, a unit increment in log mass is associated with an estimated increase in mean log metabolism between 0.7097 and 0.7678.

Kleiber’s law

How much energy do we consume on a daily basis?

Conversions:

  • 110lb \(\approx\) 50kg
  • 180lb \(\approx\) 80kg
  • 1 kJ \(\approx\) 0.239 kcal
# range of body mass
mass <- seq(50, 80)

# interval for mean log energy
pred <- predict(fit, 
                newdata = data.frame(log.mass = log(mass)), 
                interval = 'confidence')
# convert
estimates <- 0.239*exp(pred)

Using the SLR model, estimated resting energy consumption is:

\[ \hat{y} = 281\times\text{mass}^{0.74} \]

Left, prediction curve with 95% confidence interval.

Kleiber’s law

How much energy do you consume on a daily basis?

Conversions:

  • 110lb \(\approx\) 50kg
  • 180lb \(\approx\) 80kg
  • 1 kJ \(\approx\) 0.239 kcal
# range of body mass
mass <- seq(50, 80)

# interval for mean log energy
pred <- predict(fit, 
                newdata = data.frame(log.mass = log(mass)), 
                interval = 'prediction')
# convert
estimates <- 0.239*exp(pred)

Using the SLR model, estimated resting energy consumption is:

\[ \hat{y} = 281\times\text{mass}^{0.74} \]

Left, prediction curve with 95% prediction interval.

Hubble constant

The Hubble constant \(H\) relates a galaxy’s relative distance and velocity as \(H = \frac{v}{d}\).

# fit SLR model without an intercept
fit <- lm(distance ~ velocity - 1, 
          data = hubble)

Least squares estimate of \(\beta = \frac{1}{H}\):

\[ \hat{\beta} = 0.0123 \]

90% CI for the age of the universe:

# interval for age of universe in bn yr
km.mpc <- 3.09e19
yr.sec <- 1/(60*60*24*365)
confint(fit, level = 0.9)*km.mpc*yr.sec/1e9
              5 %     95 %
velocity 10.98235 13.12108

With 90% confidence, the universe is estimated to be between 10.98 and 13.12 billion years old.