Methods of inference for population means

Applied Statistics for Life Sciences

Inference for one mean

Body temperatures

mean sd n se
98.41 0.9162 39 0.1467

Is the true mean body temperature actually 98.6°F?

Seems plausible given our data.

But what if the sample mean were instead…

\(\bar{x}\) consistent with \(\mu = 98.6\)?
98.30 probably still yes
98.15 maybe
98.00 hesitating
97.85 skeptical
97.40 unlikely

How much error is too much?

Consider how many standard errors away from the hypothesized value we’d be:

\(\bar{x}\) estimation error no. SE’s interpretation
98.30 -0.3 2 double the average error
98.15 -0.45 3 triple the average error
98.00 -0.6 4 quadruple
97.85 -0.75 5 quintuple
97.40 -1.2 8 octuple!

We know from interval coverage that:

  • an error greater than 2SE occurs for only about 5% of samples
  • an error greater than 3SE occurs for less than 0.5% of samples

So if the estimation error exceeds 2-3SE, the data aren’t consistent with the hypothesis.

Basic idea for a test

If we want to test the hypothesis

\[H_0: \mu = 98.6\]

against the alternative

\[H_A: \mu \neq 98.6\]

then we consider the test statistic

\[T = \frac{\bar{x} - 98.6}{s_x/\sqrt{n}} \qquad\left(\frac{\text{estimation error}}{\text{standard error}}\right)\]

Two possible outcomes:

  1. estimation error is small relative to standard error

\(\Longrightarrow\) the data are consistent with \(H_0\)

  1. estimation error is large relative to standard error

\(\Longrightarrow\) the data favor \(H_A\)

To formalize the test we just need to decide: how many SE’s is large enough to reject \(H_0\)?

Applying the \(t\) model

If the population mean is in fact 98.6°F then \[T = \frac{\bar{x} - 98.6}{s_x/\sqrt{n}} \qquad\left(\frac{\text{estimation error}}{\text{standard error}}\right)\] has a sampling distribution that is well-approximated by a \(t_{39 - 1}\) model.

Applying the \(t\) model

If the population mean is in fact 98.6°F then \[T = \frac{\bar{x} - 98.6}{s_x/\sqrt{n}}\qquad\left(\frac{\text{estimation error}}{\text{standard error}}\right)\] has a sampling distribution that is well-approximated by a \(t_{39 - 1}\) model.

  • actual summary statistics give \(T\) = -1.328

Applying the \(t\) model

If the population mean is in fact 98.6°F then \[ T = \frac{\bar{x} - 98.6}{s_x/\sqrt{n}} \qquad\left(\frac{\text{estimation error}}{\text{standard error}}\right) \] has a sampling distribution that is well-approximated by a \(t_{39 - 1}\) model.

  • actual summary statistics give \(T\) = -1.328
  • underestimate more in 9.6% of samples

Applying the \(t\) model

If the population mean is in fact 98.6°F then \[ T = \frac{\bar{x} - 98.6}{s_x/\sqrt{n}} \qquad\left(\frac{\text{estimation error}}{\text{standard error}}\right) \] has a sampling distribution that is well-approximated by a \(t_{39 - 1}\) model.

  • actual summary statistics give \(T\) = -1.328
  • underestimate more in 9.6% of samples
  • overestimate more in 9.6% of samples

Applying the \(t\) model

If the population mean is in fact 98.6°F then \[ T = \frac{\bar{x} - 98.6}{s_x/\sqrt{n}} \qquad\left(\frac{\text{estimation error}}{\text{standard error}}\right) \] has a sampling distribution that is well-approximated by a \(t_{39 - 1}\) model.

  • actual summary statistics give \(T\) = -1.328
  • underestimate more in 9.6% of samples
  • overestimate more in 9.6% of samples

\[P(|T| > 1.328) = 0.192\]

If the hypothesis were true, we’d see at least as much estimation error 19.2% of the time. So the data are consistent with the hypothesis \(\mu = 98.6\).

A more extreme scenario

If the population mean is in fact 98.6°F then \[ T = \frac{\bar{x} - 98.6}{s_x/\sqrt{n}} \qquad\left(\frac{\text{estimation error}}{\text{standard error}}\right) \] has a sampling distribution that is well-approximated by a \(t_{39 - 1}\) model.

  • suppose instead \(\bar{x} = 98.2\) so \(T\) = -2.726
  • underestimate more in 0.48% of samples
  • overestimate more in 0.48% of samples

\[P(|T| > 2.726) = 0.0096\]

If the hypothesis were true, we’d see at least as much estimation error only 0.96% of the time. So the data are not consistent with the hypothesis \(\mu = 98.6\)

Where to draw the line?

We just made these assessments:

sample.mean se t.stat how.often evaluation
98.41 0.1467 -1.328 0.192 not unusual
98.2 0.1467 -2.726 0.009632 unusual

The sampling frequency we’re calculating above is called a p-value: the proportion of samples for which the test statistic, under \(H_0\), would exceed the observed value.

  • smaller \(\longrightarrow\) more unusual \(\longrightarrow\) more evidence favoring \(H_A\)
  • larger \(\longrightarrow\) less unusual \(\longrightarrow\) less evidence favoring \(H_A\)

The convention is to reject the hypothesis whenever p < 0.05. The threshold 0.05 is called the “significance level” or simply “level” of the test.

How to perform the test

To test the hypotheses:

\[ \begin{cases} H_0: \mu = \mu_0\\ H_A: \mu \neq \mu_0 \end{cases} \]

Steps:

  1. Fix the significance level \(\alpha\).
  2. Compute the \(p\)-value.
  3. Reject \(H_0\) if \(p < \alpha\).
# point estimate and standard error
temp.mean <- mean(temps$body.temp)
temp.mean.se <- sd(temps$body.temp)/sqrt(nrow(temps))

# compute test statistic
tstat <- (temp.mean - 98.6)/temp.mean.se
tstat
[1] -1.328265
# proportion of samples where T exceeds observed value
p.val <- 2*pt(abs(tstat), df = 38, lower.tail = F)
p.val
[1] 0.1920133
# decision with error rate controlled at 5%
p.val < 0.05
[1] FALSE

We call this a \(t\)-test for the population mean. Interpretation in conventional style:

The data do not provide evidence that the mean body temperature differs from 98.6°F (T = -1.328 on 38 degrees of freedom, p = 0.192).

Types of errors

There are two ways to make a mistake in a hypothesis test – two “error types”.

Reject \(H_0\) Fail to reject \(H_0\)
True \(H_0\) type I error correct decision
False \(H_0\) correct decision type II error
  • The type I error rate is controlled at the significance level \(\alpha\) by design.

  • Type II error is trickier, because the sampling distribution of the test statistic depends on the value of the population mean under specific alternatives – we’ll revisit this later.

Summing up

A hypothesis test boils down to deciding whether your estimate is too far from a hypothetical value for that hypothesis to be plausible.

To test the hypotheses:

\[ \begin{cases} H_0: &\mu = \mu_0 \\ H_A: &\mu \neq \mu_0 \end{cases} \]

We use the test statistic:

\[ T = \frac{\bar{x} - \mu_0}{SE(\bar{x})} \quad\left(\frac{\text{estimation error under } H_0}{\text{standard error}}\right) \]

We say \(H_0\) is implausible at level \(\alpha\) if:

  • \(\underbrace{P(|T| > |T_\text{observed}|)}_\text{p-value} < \alpha\)

This decision rule controls the type I error rate at level \(\alpha\).

A different example

Test the hypothesis that the average U.S. adult sleeps 8 hours.

estimate std.err tstat pval
6.96 0.02 -42.53 <0.0001

95% confidence interval: (6.91, 7.01)

The data provide evidence that the average U.S. adult does not sleep 8 hours per night (T = -42.53 on 3178 degrees of freedom, p < 0.0001). With 95% confidence, the mean nightly hours of sleep among U.S. adults is estimated to be between 6.91 and 7.01 hours, with a point estimate of 6.59 hours (SE: 0.0245).

The test and interval convey complementary information:

  • the test tells you U.S. adults don’t sleep 8 hours
  • the interval tells you how much they do sleep

Test-interval duality

A level \(\alpha\) test rejects \(H_0: \mu = \mu_0\) exactly when \(\mu_0\) is outside the \((1 - \alpha)\times 100\)% confidence interval for \(\mu\).

Left, \(p\)-values for a sequence of tests:

  • \(p > 0.05\) precisely for \(\mu_0\) in 95% CI
  • \(p > 0.01\) precisely for \(\mu_0\) in 99% CI

In other words:

\[\text{level $\alpha$ test rejects} \Longleftrightarrow \text{$1 - \alpha$ CI excludes}\]

The t.test(...) function

Since tests and intervals go together, there is a single R function that computes both.

t.test(sleep, mu = 8, conf.level = 0.95)

    One Sample t-test

data:  sleep
t = -42.533, df = 3178, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 8
95 percent confidence interval:
 6.911123 7.007090
sample estimates:
mean of x 
 6.959107 

You have to make the decision using the \(p\) value:

  • \(p < \alpha\): reject
  • \(p > \alpha\): fail to reject

Take a moment to locate each component of the test and estimates from the output.

A lower one-sided test

Does the average U.S. adult sleep less than 7 hours?

This example leads to a lower-sided alternative:

\[ \begin{cases} H_0: &\mu = 7 \\ H_A: &\mu < 7 \end{cases} \]

The test statistic is the same as before:

\[ T = \frac{\bar{x} - 7}{SE(\bar{x})} = -1.671 \]

The lower-sided \(p\)-value is 0.0474:

For the \(p\)-value, we look at how often \(T\) is larger in the direction of the alternative.

  • in this case, how often \(T\) is smaller

The other direction

Does the average U.S. adult sleep more than 7 hours?*

Now the alternative is in the opposite direction:

\[ \begin{cases} H_0: &\mu = 7 \\ H_A: &\mu > 7 \end{cases} \]

The test statistic is the same as before:

\[ T = \frac{\bar{x} - 7}{SE(\bar{x})} = -1.671 \]

The upper-sided \(p\)-value is 0.9526:

For the \(p\)-value, we look at how often \(T\) is larger in the direction of the alternative.

  • in this case, how often \(T\) is larger

Directional hypotheses

Tests for the mean can involve directional or non-directional alternatives. We refer to these as one-sided and two-sided tests, respectively.

Test type Alternative Direction favoring alternative
Upper-sided \(\mu > \mu_0\) larger \(T\)
Lower-sided \(\mu < \mu_0\) smaller \(T\)
Two-sided \(\mu \neq \mu_0\) larger \(|T|\)

The direction of the test affects the \(p\)-value calculation (and thus decision), but won’t change the test statistic.

Conceptually tricky, but easy in R:

# upper-sided
t.test(ddt, mu = mu_0, alternative = 'greater')

# lower-sided
t.test(ddt, mu = mu_0, alternative = 'less')

# two-sided (default)
t.test(ddt, mu = mu_0, alternative = 'two.sided')

Three \(t\)-tests and a puzzle

Do U.S. adults sleep 7 hours per night?

# two sided test
t.test(sleep, 
       mu = 7)

    One Sample t-test

data:  sleep
t = -1.671, df = 3178, p-value = 0.09482
alternative hypothesis: true mean is not equal to 7
95 percent confidence interval:
 6.911123 7.007090
sample estimates:
mean of x 
 6.959107 

Do U.S. adults sleep less than 7 hours per night?

# lower-sided test
t.test(sleep, 
       mu = 7, 
       alternative = 'less')

    One Sample t-test

data:  sleep
t = -1.671, df = 3178, p-value = 0.04741
alternative hypothesis: true mean is less than 7
95 percent confidence interval:
     -Inf 6.999372
sample estimates:
mean of x 
 6.959107 

Do U.S. adults sleep more than 7 hours per night?

# upper-sided test
t.test(sleep, 
       mu = 7, 
       alternative = 'greater')

    One Sample t-test

data:  sleep
t = -1.671, df = 3178, p-value = 0.9526
alternative hypothesis: true mean is greater than 7
95 percent confidence interval:
 6.918841      Inf
sample estimates:
mean of x 
 6.959107 

Why don’t the tests agree?

Resolving the puzzle

# test (A)
t.test(sleep, mu = 7, 
       alternative = 'two.sided', 
       conf.level = 0.95)

    One Sample t-test

data:  sleep
t = -1.671, df = 3178, p-value = 0.09482
alternative hypothesis: true mean is not equal to 7
95 percent confidence interval:
 6.911123 7.007090
sample estimates:
mean of x 
 6.959107 
# test (B)
t.test(sleep, mu = 7, 
       alternative = 'less', 
       conf.level = 0.95)

    One Sample t-test

data:  sleep
t = -1.671, df = 3178, p-value = 0.04741
alternative hypothesis: true mean is less than 7
95 percent confidence interval:
     -Inf 6.999372
sample estimates:
mean of x 
 6.959107 

The tests don’t agree if the same significance level is used.

Consider: a type I error occurs in each case if…

  • test (A) falsely concludes the mean differs from 7
  • test (B) falsely concludes the mean is less than 7

Using level \(\alpha = 0.05\) for both tests means that falsely inferring \(\mu < 7\) occurs:

  • less than 5% of the time for test (A)
  • exactly 5% of the time for test (A)

So test (B) is more sensitive to data suggesting \(\mu < 7\) because it permits errors at a higher rate.

Inference for two means

Review: one-sample inference

The following are 15 measurements of the pesticide DDT in kale in parts per million (ppm).

2.79, 2.93, 3.22, 3.78, 3.22, 3.38, 3.18, 3.33, 3.34, 3.06, 3.07, 3.56, 3.08, 4.64 and 3.34

C. E. Finsterwalder (1976) Collaborative study of an extension of the Mills et al method for the determination of pesticide residues in food. J. Off. Anal. Chem. 59, 169–171.

Imagine the target level for safety considerations is 3ppm or less, and you want to use this data to determine whether the mean DDT level is within safe limits.

\[ \begin{cases} H_0: &\mu = 3 \quad(\text{null hypothesis})\\ H_A: &\mu > 3 \quad(\text{alternative hypothesis}) \end{cases} \]

We choose this direction because we’re concerned with evidence that mean DDT exceeds the threshold.

Review: one-sample inference

If in fact \(\mu = 3\), then according to the \(t\) model 0.58% of samples would produce an error of this magnitude or more in the direction of the alternative:

t.test(ddt, mu = 3, alternative = 'greater')

    One Sample t-test

data:  ddt
t = 2.9059, df = 14, p-value = 0.005753
alternative hypothesis: true mean is greater than 3
95 percent confidence interval:
 3.129197      Inf
sample estimates:
mean of x 
    3.328 

The data provide strong evidence that mean DDT in kale exceeds 3ppm (T = 2.9059 on 14 degrees of freedom, p = 0.0058). With 95% confidence, the mean DDT is estimated to be at least 3.129, with a point estimate of 3.32 (SE: 0.1129).

Notice the one-sided interval! (Inf = \(\infty\).) This is called a “lower confidence bound”.

Testing a mean difference

Does the average U.S. adult wish to lose weight?

subject actual desired difference
1 265 225 40
2 150 150 0
3 137 150 -13
4 159 125 34
5 145 125 20
weight.diffs <- brfss$weight - brfss$wtdesire
t.test(weight.diffs, 
       mu = 0, 
       alternative = 'greater')

    One Sample t-test

data:  weight.diffs
t = 4.2172, df = 59, p-value = 4.311e-05
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 10.99824      Inf
sample estimates:
mean of x 
 18.21667 

The data provide evidence that the average U.S. adult’s actual weight exceeds their desired weight (T = 4.2172 on 59 degrees of freedom, p < 0.0001).

Inference here is on the mean difference: \(H_0: \delta = 0\) vs. \(H_A: \delta > 0\).

Can we also do inference on a difference in means?

Testing a difference in means

Peter and Rosemary Grant caught and measured birds from more than 20 generations of finches on Daphne Major.

  • severe drought in 1977 limited food to large tough seeds

  • selection pressure favoring larger and stronger beaks

  • hypothesis: beak depth increased in 1978 relative to 1976

year depth
1976 10.8
1976 7.4
1978 11.4
1978 10.6

To answer this, we need to test a hypothesis involving two means:

\[ \begin{cases} H_0: &\mu_{1976} = \mu_{1978} \\ H_A: &\mu_{1976} < \mu_{1978} \end{cases} \]

  • can’t do inference on a mean difference here (no pairing of observations)
  • treat each year as an independent sample

Two-sample \(t\)-test

If \(x_1, \dots, x_{58}\) are the 1976 observations and \(y_1, \dots, y_{65}\) are the 1978 observations:

  • \(\bar{x}\) is a point estimate for \(\mu_{1976}\) with standard error \(SE(\bar{x}) = \frac{s_x}{\sqrt{n_x}}\)
  • \(\bar{y}\) is a point estimate for \(\mu_{1978}\) with standard error \(SE(\bar{y}) = \frac{s_y}{\sqrt{n_y}}\)

Inference uses a new \(T\) statistic:

\[ T = \frac{(\bar{x} - \bar{y}) - \delta_0}{SE(\bar{x} - \bar{y})} \]

  • \(\delta_0\) is the hypothesized difference in means
  • \(SE(\bar{x} - \bar{y}) = \sqrt{SE(\bar{x})^2 + SE(\bar{y})^2}\)
  • \(t_\nu\) model approximates the sampling distribution when each sample meets assumptions for one-sample inference

Performing the test (by hand)

year depth.mean depth.sd n
1976 9.453 0.9625 58
1978 10.19 0.8073 65

Point estimates: \[ \begin{align*} \bar{x} &= \qquad\qquad\qquad\qquad\\ \bar{y} &= \\ SE(\bar{x}) &= \\ SE(\bar{y}) &= \end{align*} \] Hypotheses: \[ \begin{cases} H_0: &\mu_{1976} - \mu_{1978} = 0 \\ H_A: &\mu_{1976} - \mu_{1978} < 0 \end{cases} \]

Test statistic: \[ \begin{align*} \bar{x} - \bar{y} &= \qquad\qquad\qquad\qquad\qquad\qquad\\ SE(\bar{x} - \bar{y}) &= \sqrt{SE(\bar{x})^2 + SE(\bar{y})^2} = \\ T &= \frac{(\bar{x} - \bar{y}) - \delta_0}{SE(\bar{x} - \bar{y})} = \end{align*} \] Conclusion: [reject]/[fail to reject] \(H_0\) in favor of \(H_A\)

How can you tell without the exact \(p\) value?

Performing the test (in R)

t.test(depth ~ year, data = finch,
       mu = 0, alternative = 'less')

    Welch Two Sample t-test

data:  depth by year
t = -4.5727, df = 111.79, p-value = 6.255e-06
alternative hypothesis: true difference in means between group 1976 and group 1978 is less than 0
95 percent confidence interval:
       -Inf -0.4698812
sample estimates:
mean in group 1976 mean in group 1978 
          9.453448          10.190769 

The data provide evidence that mean beak depth increased following the drought (T = -4.5727 on 111.79 degrees of freedom, p < 0.0001). With 95% confidence, the mean increase is estimated to be at least 0.4699 mm, with a point estimate of 0.7373 (SE 0.1612).

Highly similar to the one-sample \(t\)-test, but notice:

  • input is a formula depth ~ year (“depth depends on year”) and data frame finch
  • mu now indicates hypothesized difference in means \(\delta_0\)
  • decimal degrees of freedom
  • alternative is relative to the order in which groups appear

Checking test assumptions

The two-sample test is appropriate whenever two one-sample tests would be.

In other words, the test assumes that both samples are either:

  • sufficiently large; or
  • have little skew and few outliers

To check, simply inspect each histogram.

  • both distributions unimodal
  • both a bit left skewed
  • no extreme outliers
  • large sample sizes (58, 65)

Checking test assumptions

The two-sample test is appropriate whenever two one-sample tests would be.

In other words, the test assumes that both samples are either:

  • sufficiently large; or
  • have little skew and few outliers

Could also check side-by-side boxplots for:

  • approximate symmetry of boxes
  • outliers far from whiskers

This is also a nice visualization of differences between samples.

Cloud data

Does seeding clouds with silver iodide increase mean rainfall?

Data are rainfall measurements in a target area from 26 days when clouds were seeded and 26 days when clouds were not seeded.

  • rainfall gives volume of rainfall in acre-feet
  • treatment indicates whether clouds were seeded

Hypotheses to test: \[ \begin{cases} H_0: &\mu_\text{seeded} = \mu_\text{unseeded} \\ H_A: &\mu_\text{seeded} > \mu_\text{unseeded} \end{cases} \]

rainfall treatment
334.1 seeded
489.1 seeded
200.7 seeded
40.6 seeded
21.7 unseeded
17.3 unseeded
68.5 unseeded
830.1 unseeded

Cloud data: which test?

Does seeding clouds with silver iodide increase mean rainfall?

# test (A)
t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'less')

    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.9731
alternative hypothesis: true difference in means between group seeded and group unseeded is less than 0
95 percent confidence interval:
     -Inf 512.1582
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885 
# test (B)
t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'greater')

    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
 42.63408      Inf
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885 

You can tell which group R considers first based on which estimate is printed first.

  • 'greater' is interpreted as [FIRST GROUP] > [SECOND GROUP]
  • 'less' is interpreted as [FIRST GROUP] < [SECOND GROUP]

Cloud data: interpretation

Does seeding clouds with silver iodide increase mean rainfall?

t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'greater')

    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
 42.63408      Inf
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885 

The data provide evidence that cloud seeding increases mean rainfall (T = 1.9982 on 33.855 degrees of freedom, p = 0.02689). With 95% confidence, seeding is estimated to increase mean rainfall by at least 42.63 acre-feet, with a point estimate of 277.4 (SE 138.8199).

Extra: the equal-variance \(t\)-test

If it is reasonable to assume the (population) standard deviations are the same in each group, one can gain a bit of power by using a different standard error:

\[SE_\text{pooled}(\bar{x} - \bar{y}) = \sqrt{\frac{\color{red}{s_p^2}}{n_x} + \frac{\color{red}{s_p^2}}{n_y}} \quad\text{where}\quad \color{red}{s_p} = \underbrace{\sqrt{\frac{(n_x - 1)s_x^2 + (n_y - 1)s_y^2}{n_x + n_y - 2}}}_{\text{weighted average of } s_x^2 \;\&\; s_y^2}\]

Implement by adding var.equal = T as an argument to t.test().

  • larger df is used, hence more frequent rejections
  • avoid unless you have a small sample

Statistical power for two-sample inference

\(p\)-values and type I errors

A \(p\)-value captures how often you’d make a mistake if \(H_0\) were true.

t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'greater')

    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
 42.63408      Inf
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885 

If there is no effect of cloud seeding, then we would see \(T > 1.9982\) for 2.69% of samples.

  • While unlikely, our sample could have been among that 2.69%

  • By rejecting here we are willing to be wrong 2.69% of the time

Type II errors

But you can also make a mistake when \(H_0\) is false!

t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'greater')

    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
 42.63408      Inf
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885 

Suppose we test at the 1% level.

  • The test fails to reject (\(p > 0.01\)), but that doesn’t completely rule out \(H_A\).

  • The estimated effect – an increase of 277.4 acre-feet – could be too small relative to the variability in rainfall.

Simulating type II errors

Summary stats for cloud data:

treatment mean sd n
seeded 442 650.8 26
unseeded 164.6 278.4 26

We can approximate the type II error rate by:

  1. simulating datasets with matching statistics
  2. performing upper-sided tests of no difference
  3. computing the proportion of fail-to-reject decisions
type2sim(delta = 277, n = 26, sd = 650.8, 
         alpha = 0.01, nsim = 1000)

If in fact the effect size is exactly 277, a level 1% test with similar data will fail to reject 80.3% of the time!

Larger effect size

Summary stats for cloud data:

treatment mean sd n
seeded 442 650.8 26
unseeded 164.6 278.4 26

Type II error rate depends on effect size:

  • larger effect \(\longrightarrow\) lower error rate
type2sim(delta = 400, n = 26, sd = 650.8, 
         alpha = 0.01, nsim = 1000)

If in fact the effect size is exactly 400, a level 1% test with similar data will fail to reject 57.9% of the time.

Smaller effect size

Summary stats for cloud data:

treatment mean sd n
seeded 442 650.8 26
unseeded 164.6 278.4 26

Type II error rate depends on effect size:

  • larger effect \(\longrightarrow\) lower error rate
  • smaller effect \(\longrightarrow\) higher error rate
type2sim(delta = 100, n = 26, sd = 650.8, 
         alpha = 0.01, nsim = 1000)

If in fact the effect size is exactly 100, a level 1% test with similar data will fail to reject 96.3% of the time.

Statistical power

The power of a test refers to its true rejection rate under an alternative and is defined as: \[\beta = \underbrace{(1 - \text{type II error rate})}_\text{correct decision rate when alternative is true}\]

Power is often interpreted as a detection rate:

  • high type II error \(\longrightarrow\) low power \(\longrightarrow\) low detection rate
  • low type II error \(\longrightarrow\) high power \(\longrightarrow\) high detection rate

Power depends on the exact alternative scenario:

  • low for alternatives close to the hypothetical value (e.g., \(\delta_0 = 0\))
  • high for alternatives far from the hypothetical value (e.g., \(\delta_0 = 0\))

Power curves

Power is usually represented as a curve depending on the true difference.

Power curves for the test applied to the cloud data:

Assumptions:

  • significance level \(\alpha = 0.05\)
  • population standard deviation \(\sigma = 650\) (larger of two group estimates)

Which sample size achieves a specific \(\beta, \delta\)?

Sample size calculation

If you were (re)designing the study, how much data should you collect to detect a specified effect size?

To detect a difference of 250 or more due to cloud seeding with power 0.9:

power.t.test(power = 0.9, # target power level
             delta = 250, # smallest difference
             sd = 650, # largest population SD
             sig.level = 0.01, 
             type = 'two.sample', 
             alternative = 'one.sided') 

     Two-sample t test power calculation 

              n = 177.349
          delta = 250
             sd = 650
      sig.level = 0.01
          power = 0.9
    alternative = one.sided

NOTE: n is number in *each* group

For a conservative estimate, use:

  • overestimate of the larger of the two standard deviations
  • minimum difference of interest

\(\Longrightarrow\) we need at least 177 observations in each group to detect a difference of 250 or more at least 90% of the time

Revisiting body temperatures

Does mean body temperature differ between men and women?

Test \(H_0: \mu_F = \mu_M\) against \(H_A: \mu_F \neq \mu_M\)

t.test(body.temp ~ sex, data = temps,
       mu = 0, alternative = 'two.sided')

    Welch Two Sample t-test

data:  body.temp by sex
t = 1.7118, df = 34.329, p-value = 0.09595
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -0.09204497  1.07783444
sample estimates:
mean in group female   mean in group male 
            98.65789             98.16500 

Suggestive but insufficient evidence that mean body temperature differs by sex.

Notice: estimated difference (F - M) is 0.493 °F (SE 0.2879)

What if we had more data?

Here are estimates from two larger samples of 65 individuals each (compared with 19, 20):

sex mean.temp se n
female 98.39 0.09222 65
male 98.1 0.08667 65
  • estimated difference (F - M) is smaller 0.2892 °F
  • but so is the standard error SE 0.1266 (recall more data \(\longleftrightarrow\) better precision)
t.test(body.temp ~ sex, data = temps.aug,
       mu = 0, alternative = 'two.sided')

    Welch Two Sample t-test

data:  body.temp by sex
t = 2.2854, df = 127.51, p-value = 0.02394
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 0.03881298 0.53964856
sample estimates:
mean in group female   mean in group male 
            98.39385             98.10462 

The data provide evidence that mean body temperature differs by sex (T = 2.29 on 127.51 degrees of freedom, p = 0.02394).

A statistical trap

If you collect enough data, you can detect an arbitrarily small difference in means.

So keep in mind:

  • statistical significance \(\neq\) practical significance
  • large samples will tend to produce statistically significant results

It’s a good idea to always check your point estimates and ask whether findings are practically meaningful.

Inference for many means

Chick weight data

Chick weights at 20 days of age by diet:

Here we have four means to compare.

diet mean se sd n
1 170.4 13.45 55.44 17
2 205.6 22.22 70.25 10
3 258.9 20.63 65.24 10
4 233.9 12.52 37.57 9

This is experimental data: chicks were randomly allocated one of the four diets.

  • looks like there are differences in mean weight by diet
  • how do you test for an effect of diet?

Hypotheses comparing many means

Let \(\mu_i = \text{mean weight on diet } i = 1, 2, 3, 4\).

The hypothesis that there are no differences in means by diet is:

\[ H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 \quad (\text{no difference in means}) \]

The alternative, if this is false, is that there is at least one difference:

\[ H_A: \mu_i \neq \mu_j \;\text{for some}\; i \neq j\quad (\text{at least one difference}) \]

This is known as an “omnibus” test because of the composite alternative.

How much difference is enough?

Here are two made-up examples of the same four sample means with different SEs.

Why does it look like there’s a difference at right but not at left?

Think about the \(t\)-test: we say there’s a difference if \(T = \frac{\text{estimate} - \text{hypothesis}}{\text{variability}}\) is large.

Same idea here: we see differences if they are big relative to the variability in estimates.

Partitioning variation

Partitioning variation into two or more components is called “analysis of variance”

For the chick data, two sources of variability:

  • group variability between diets

  • error variability among chicks

The analysis of variance (ANOVA) model:

\[\color{grey}{\text{total variation}} = \color{red}{\text{group variation}} + \color{blue}{\text{error variation}}\]

We’ll base the test on the ratio \(F = \frac{\color{red}{\text{group variation}}}{\color{blue}{\text{error variation}}}\) and reject \(H_0\) if the ratio is large enough.

The \(F\) statistic: a variance ratio

The \(F\) statistic measures variability attributable to group differences relative to variability attributable to individual differences.

Notation:

  • \(\bar{x}\): “grand” mean of all observations
  • \(\bar{x}_i\): mean of observations in group \(i\)
  • \(s_i\): SD of observations in group \(i\)
  • \(k\) groups
  • \(n\) total observations
  • \(n_i\) observations per group

Measures of variability:

\[\color{red}{MSG} = \frac{1}{k - 1}\sum_i n_i(\bar{x}_i - \bar{x})^2 \quad(\color{red}{\text{group}})\] \[\color{blue}{MSE} = \frac{1}{n - k}\sum_i (n_i - 1)s_i^2 \quad(\color{blue}{\text{error}})\] Ratio:

\[F = \frac{\color{red}{MSG}}{\color{blue}{MSE}} \quad\left(\frac{\color{red}{\text{group variation}}}{\color{blue}{\text{error variation}}}\right)\]

Sampling distribution for \(F\)

Provided that

  1. each group satisfies conditions for a \(t\) test
  2. the variability (standard deviation) is about the same across groups

the \(F\) statistic has a sampling distribution well-approximated by an \(F_{k - 1, n - k}\) model.

  • numerator degrees of freedom \(k - 1\)
  • denominator degrees of freedom \(n - k\)

\(F\) models for several different numerator degrees of freedom \(k - 1\) with fixed \(n = 30\).

The ANOVA \(F\) test “by hand”

To test the hypotheses:

\[ \begin{cases} H_0: &\mu_1 = \mu_2 = \mu_3 = \mu_4 \\ H_A: &\mu_i \neq \mu_j \quad\text{for some}\quad i \neq j \end{cases} \] Calculate the \(F\) statistic:

# ingredients of mean squares
k <- nrow(chicks.summary)
n <- nrow(chicks)
n.i <- chicks.summary$n
xbar.i <- chicks.summary$mean
s.i <- chicks.summary$sd
xbar <- mean(chicks$weight)

# mean squares
msg <- sum(n.i*(xbar.i - xbar)^2)/(k - 1)
mse <- sum((n.i - 1)*s.i^2)/(n - k)

# f statistic
fstat <- msg/mse
fstat
[1] 5.463598

And reject \(H_0\) when \(F\) is large.

For a significance level \(\alpha\) test, reject \(H_0\) when \(\underbrace{P(F > F_\text{obs})}_\text{p-value} < \alpha\).

pf(fstat, 4 - 1, 46 - 4, lower.tail = F)
[1] 0.002909054

Test outcome

\[ \begin{cases} H_0: &\mu_1 = \mu_2 = \mu_3 = \mu_4 \\ H_A: &\mu_i \neq \mu_j \quad\text{for some}\quad i \neq j \end{cases} \]

\(F = \frac{\color{red}{\text{group variation}}}{\color{blue}{\text{error variation}}} = \frac{MSG}{MSE} = 5.4636\).

pf(fstat, 4 - 1, 46 - 4, lower.tail = F)
[1] 0.002909054

F = 5.4636 means the variation in weight attributable to diets is 5.46 times greater than individual variation among chicks.

The \(p\)-value for the test is 0.0029:

  • if there is truly no difference in means, then under 1% of samples (about 2 in 1000) would produce at least as much diet-to-diet variability as we observed

  • so in this case we reject \(H_0\) at the 1% level

ANOVA in R

The aov(...) function fits ANOVA models using a formula/dataframe specification:

# fit anova model
fit <- aov(weight ~ diet, data = chicks)

# generate table
summary(fit)
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
diet 3 55881 18627 5.464 0.002909
Residuals 42 143190 3409 NA NA

Conventional interpretation style closely follows that of previous inferences for one or two means:

The data provide evidence of an effect of diet on mean weight (F = 5.464 on 3 and 42 df, p = 0.0029).

Analysis of variance table

The results of an analysis of variance are traditionally displayed in a table.

Source degrees of freedom Sum of squares Mean square F statistic p-value
Group \(k - 1\) SSG \(MSG = \frac{SSG}{k - 1}\) \(\frac{MSG}{MSE}\) \(P(F > F_\text{obs})\)
Error \(n - k\) SSE \(MSE = \frac{SSE}{n - k}\)
  • the sum of square terms are ‘raw’ measures of variability from each source
  • the mean square terms are adjusted for the amount of data available and number of parameters

Formally, the ANOVA model says \[(n - 1)S^2 = SSG + SSE\]

Constructing the table “by hand”

Source DF Sum sq. Mean sq. F statistic p-value
Group \(k - 1\) SSG \(MSG = \frac{SSG}{k - 1}\) \(\frac{MSG}{MSE}\) \(P(F > F_\text{obs})\)
Error \(n - k\) SSE \(MSE = \frac{SSE}{n - k}\)

Here is what the fitted model looks like in R:

# fitted anova model
aov(weight ~ diet, data = chicks)
Call:
   aov(formula = weight ~ diet, data = chicks)

Terms:
                     diet Residuals
Sum of Squares   55881.02 143190.31
Deg. of Freedom         3        42

Residual standard error: 58.38915
Estimated effects may be unbalanced

Exercise: compute MSE, MSG, and F based on this output.

Rules of thumb for level \(\alpha = 0.05\) tests:

  • \(F > 5\) almost always rejects
  • \(F > 4\) usually rejects for \(k > 2\) (need \(n > 15\))
  • \(F > 3\) usually rejects for \(k > 3\) (need \(n > 25\))
  • \(F > 2\) rarely rejects unless \(k > 8\)

Measuring effect size

A common measure of effect size is \[\eta^2 = \frac{SSG}{SST} \quad\left(\frac{\text{group variation}}{\text{total variation}}\right)\] where \[SST = SSG + SSE = (n - 1)S^2_x\] This is the proportion of total variation attributed to the grouping factor.

  • possible values \(0 \leq \eta^2 \leq 1\)
  • near 1 \(\longrightarrow\) large effect
  • near 0 \(\longrightarrow\) small effect
library(effectsize)
fit <- aov(weight ~ diet, data = chicks)
eta_squared(fit, partial = F)
# Effect Size for ANOVA (Type I)

Parameter | Eta2 |       95% CI
-------------------------------
diet      | 0.28 | [0.08, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

An estimated 28% of variation in weight is attributable to diet.

The confidence interval is for the population analogue of \(\eta^2\).

Powering for effect size

How many chicks should we measure to detect an average group difference of 20g?

Group variation on the order of 20g amounts to about (since \(n_i\) varies) \[SSG \approx (k - 1)(n_i\times400) \approx 14400\] which is an effect size of \[\eta^2 = \frac{14400}{143190 + 14400} = 0.0914\] i.e., group variation is around 9.14% of total variation.

To power the study to detect \(\eta^2 = 0.1\) use the approximation (assumes \(k \ll n\)):

power.anova.test(groups = 4,
                 sig.level = 0.05,
                 power = 0.8,
                 between.var = 0.1, # eta^2
                 within.var = 0.9) # 1 - eta^2

     Balanced one-way analysis of variance power calculation 

         groups = 4
              n = 33.70068
    between.var = 0.1
     within.var = 0.9
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

We’d need 34 chicks per group.

More examples: diet and longevity

From data on lifespans of lab mice fed calorie-restricted diets:

diet mean sd n
NP 27.4 6.134 49
N/N85 32.69 5.125 57
N/R50 42.3 7.768 71
N/R40 45.12 6.703 60

ANOVA omnibus test:

\[ \begin{align*} &H_0: \mu_\text{NP} = \mu_\text{N/N85} = \mu_\text{N/R50} = \mu_\text{N/R40} \\ &H_A: \text{at least two means differ} \end{align*} \]

fit <- aov(lifetime ~ diet, data = longevity)
summary(fit)
             Df Sum Sq Mean Sq F value Pr(>F)    
diet          3  11426    3809   87.41 <2e-16 ***
Residuals   233  10152      44                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The data provide evidence that diet restriction has an effect on mean lifetime among mice (F = 87.41 on 3 and 233 degrees of freedom, p < 0.0001).

More examples: treating anorexia

From data on weight change among 72 young anorexic women after a period of therapy:

treat post - pre sd n
ctrl -0.45 7.989 26
cbt 3.007 7.309 29
ft 7.265 7.157 17

ANOVA omnibus test: \[ \begin{align*} &H_0: \mu_\text{ctrl} = \mu_\text{cbt} = \mu_\text{ft} \\ &H_A: \text{at least two means differ} \end{align*} \]

# fit anova model
fit <- aov(change ~ treat, data = anorexia)

# generate table
summary(fit)
            Df Sum Sq Mean Sq F value Pr(>F)   
treat        2    615  307.32   5.422 0.0065 **
Residuals   69   3911   56.68                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The data provide evidence of an effect of therapy on mean weight change among young women with anorexia (F = 5.422 on 2 and 69 degrees of freedom, p = 0.0065).

Post-hoc inference in ANOVA

Diet restriction and longevity

\[ \begin{align*} &H_0: \mu_\text{NP} = \mu_\text{N/N85} = \mu_\text{N/R50} = \mu_\text{N/R40} \\ &H_A: \text{at least two means differ} \end{align*} \]

fit <- aov(lifetime ~ diet, data = longevity)
summary(fit)
             Df Sum Sq Mean Sq F value Pr(>F)    
diet          3  11426    3809   87.41 <2e-16 ***
Residuals   233  10152      44                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
eta_squared(fit, partial = F, alternative = 'two.sided')
# Effect Size for ANOVA (Type I)

Parameter | Eta2 |       95% CI
-------------------------------
diet      | 0.53 | [0.44, 0.60]

The data provide evidence that caloric restriction affects mean lifespan (F = 87.41 on 3 and 233 degrees of freedom, p < 0.0001). With 95% confidence, an estimated 44%-60% of variability in lifespans is attributable to caloric intake.

Post-hoc inference in ANOVA

The ANOVA tells us there’s evidence that caloric intake affects mean lifespan. So now that we’ve inferred an effect, we might want to know:

  1. What are the mean lifespans for each level of restriction?
  2. Which means differ?
  3. How do restricted-calorie diets compare with the control (NP)?

We can answer these questions with post-hoc (after-the-fact) inferences on:

  • group means \(\mu_i\)
  • “contrasts” \(\mu_i - \mu_j\)

Estimates for group means

Interval estimates for group means in ANOVA are similar to marginal confidence intervals except for two details.

emmeans(object = fit, spec = ~ diet) |> 
  confint(level = 0.95)
diet estimate SE 95% CI
NP 27.4 0.943 (25.54, 29.26)
N/N85 32.69 0.8743 (30.97, 34.41)
N/R50 42.3 0.7834 (40.75, 43.84)
N/R40 45.12 0.8522 (43.44, 46.8)
  1. Standard errors are based on a “pooled” standard deviation:

\[SE_i = \frac{s_\text{pooled}}{\sqrt{n_i}} = \frac{\sqrt{MSE}}{\sqrt{n_i}}\]

  1. Critical values are from a \(t_{n - k}\) model (instead of \(t_{n - 1}\)).

Rationale:

  • the ANOVA model assumes equal variability (standard deviations) across groups
  • better precision (for variability estimates, not means) when assumption holds

Simultaneous intervals

For multiple intervals we can distinguish two types of coverage.

  • individual coverage: how often one interval covers the population mean
  • simultaneous coverage: how often all intervals cover population means at the same time

The Bonferroni correction for \(k\) intervals consists in changing the individual coverage level to \(\left(1 - \frac{\alpha}{k}\right)\%\).

  • Effectively a width increase
  • Guarantees joint coverage \((1 - \alpha)\%\)
  • Tends to be over conservative with many means (large \(k\))

Implementation in R

# table
emmeans(object = fit, spec = ~ diet) |>
  confint(level = 0.95, adjust = 'bonferroni')
 diet  emmean    SE  df lower.CL upper.CL
 NP      27.4 0.943 233     25.0     29.8
 N/N85   32.7 0.874 233     30.5     34.9
 N/R50   42.3 0.783 233     40.3     44.3
 N/R40   45.1 0.852 233     43.0     47.3

Confidence level used: 0.95 
Conf-level adjustment: bonferroni method for 4 estimates 
  • note Bonferroni adjustment
  • these are model-based estimates that depend on the fitted ANOVA model
# visualization
emmeans(object = fit, spec = ~ diet) |>
  confint(level = 0.95, adjust = 'bonferroni') |>
  plot(xlab = 'mean lifespan (days)', 
       ylab = 'diet')

Caution: these intervals do NOT indicate which means differ significantly.

Pairwise comparisons

The difference \(\mu_{NP} - \mu_{N/N85}\) is an example of a “contrast” involving two means.

Simultaneous inference on pairwise contrasts will determine which means differ and by how much.

emmeans(object = fit, specs = ~ diet) |> 
  contrast('pairwise')
difference estimate SE
NP - (N/N85) -5.289 1.286
NP - (N/R50) -14.9 1.226
NP - (N/R40) -17.71 1.271
(N/N85) - (N/R50) -9.606 1.174
(N/N85) - (N/R40) -12.43 1.221
(N/R50) - (N/R40) -2.819 1.158

Remarks:

  • point estimates are \(\bar{x}_i - \bar{x}_j\)

  • \(SE_{ij} = SE(\bar{x}_i - \bar{x}_j) = s_\text{pooled}\sqrt{\frac{1}{n_i} + \frac{1}{n_j}}\)

  • inference is based on a \(t_{n - k}\) model

    • degrees of freedom: \(n - k\)
    • tests: \(T_{ij} = \frac{\bar{x}_i - \bar{x}_j}{SE_{ij}}\)
    • intervals: \(\bar{x}_i - \bar{x}_j \pm c\times SE_{ij}\)

Multiplicity corrections must adjust for the number of contrasts (6), not means (4).

Pairwise comparisons: tests

Which means differ significantly?

emmeans then contrast then test:

emmeans(object = fit, specs = ~ diet) |>
  contrast('pairwise') |>
  test(adjust = 'bonferroni')
 contrast          estimate   SE  df t.ratio p.value
 NP - (N/N85)         -5.29 1.29 233  -4.113  0.0003
 NP - (N/R50)        -14.90 1.23 233 -12.150  <.0001
 NP - (N/R40)        -17.71 1.27 233 -13.938  <.0001
 (N/N85) - (N/R50)    -9.61 1.17 233  -8.183  <.0001
 (N/N85) - (N/R40)   -12.43 1.22 233 -10.177  <.0001
 (N/R50) - (N/R40)    -2.82 1.16 233  -2.436  0.0937

P value adjustment: bonferroni method for 6 tests 
  • p-values are adjusted for multiplicity
  • reject if adjusted p-value is below the significance threshold

Hypotheses for pairwise tests:

\[\begin{cases} H_0: \mu_i - \mu_j = 0 \\ H_A: \mu_i - \mu_j \neq 0 \end{cases}\] Test statistic:

\[T_{ij} = \frac{\bar{x}_i - \bar{x}_j}{SE_{ij}}\]

\(p\)-values are obtained from a \(t_{n - k}\) model for the sampling distribution of \(T_{ij}\).

Pairwise comparisons: tests

Which means differ significantly?

emmeans then contrast then test:

emmeans(object = fit, specs = ~ diet) |>
  contrast('pairwise') |>
  test(adjust = 'bonferroni')
 contrast          estimate   SE  df t.ratio p.value
 NP - (N/N85)         -5.29 1.29 233  -4.113  0.0003
 NP - (N/R50)        -14.90 1.23 233 -12.150  <.0001
 NP - (N/R40)        -17.71 1.27 233 -13.938  <.0001
 (N/N85) - (N/R50)    -9.61 1.17 233  -8.183  <.0001
 (N/N85) - (N/R40)   -12.43 1.22 233 -10.177  <.0001
 (N/R50) - (N/R40)    -2.82 1.16 233  -2.436  0.0937

P value adjustment: bonferroni method for 6 tests 

The data provide evidence at the 5% significance level that mean lifespan differs among all levels of diet restriction except the N/R40 and N/R50 groups (p = 0.0937), for which the evidence is suggestive but inconclusive.

Multiple testing correction matters

Using unadjusted \(p\)-values will inflate type I error rates.

setting adjust = 'none':

emmeans(object = fit, specs = ~ diet) |>
  contrast('pairwise') |>
  test(adjust = 'none')
 contrast          estimate   SE  df t.ratio p.value
 NP - (N/N85)         -5.29 1.29 233  -4.113  0.0001
 NP - (N/R50)        -14.90 1.23 233 -12.150  <.0001
 NP - (N/R40)        -17.71 1.27 233 -13.938  <.0001
 (N/N85) - (N/R50)    -9.61 1.17 233  -8.183  <.0001
 (N/N85) - (N/R40)   -12.43 1.22 233 -10.177  <.0001
 (N/R50) - (N/R40)    -2.82 1.16 233  -2.436  0.0156

Failure to adjust for multiple inferences leads to a different conclusion:

The data provide evidence at the 5% significance levelthat mean lifespan differs among all levels of diet restriction.

This is incorrect, because the joint significance level is not 5%.

Without adjustment, type I error could be as high as \(k\times\alpha = 6\times 0.05 = 0.3\).

Pairwise comparisons: intervals

How much do means differ?

emmeans then contrast then confint:

emmeans(object = fit, specs = ~ diet) |> 
  contrast('pairwise') |>
  confint(level = 0.95, adjust = 'bonferroni')
 contrast          estimate   SE  df lower.CL upper.CL
 NP - (N/N85)         -5.29 1.29 233    -8.71   -1.867
 NP - (N/R50)        -14.90 1.23 233   -18.16  -11.633
 NP - (N/R40)        -17.71 1.27 233   -21.10  -14.333
 (N/N85) - (N/R50)    -9.61 1.17 233   -12.73   -6.482
 (N/N85) - (N/R40)   -12.43 1.22 233   -15.67   -9.177
 (N/R50) - (N/R40)    -2.82 1.16 233    -5.90    0.261

Confidence level used: 0.95 
Conf-level adjustment: bonferroni method for 6 estimates 
  • level specifies joint coverage after adjustment

Intervals are for the contrast \(\mu_i - \mu_j\):

\[\bar{x}_i - \bar{x}_j \pm c\times SE_{ij}\]

The critical value \(c\) is obtained from the \(t_{n - k}\) model.

For a \((1 - \alpha)\times 100\%\) confidence interval with Bonferroni correction:

\[c = \left(1 - \frac{\alpha}{2k}\right) \;\text{quantile}\]

Pairwise comparisons: intervals

How much do means differ?

emmeans then contrast then confint:

emmeans(object = fit, specs = ~ diet) |> 
  contrast('pairwise') |>
  confint(level = 0.95, adjust = 'bonferroni')
 contrast          estimate   SE  df lower.CL upper.CL
 NP - (N/N85)         -5.29 1.29 233    -8.71   -1.867
 NP - (N/R50)        -14.90 1.23 233   -18.16  -11.633
 NP - (N/R40)        -17.71 1.27 233   -21.10  -14.333
 (N/N85) - (N/R50)    -9.61 1.17 233   -12.73   -6.482
 (N/N85) - (N/R40)   -12.43 1.22 233   -15.67   -9.177
 (N/R50) - (N/R40)    -2.82 1.16 233    -5.90    0.261

Confidence level used: 0.95 
Conf-level adjustment: bonferroni method for 6 estimates 

Interpretations are the same as usual:

With 95% confidence, mean lifespan on a normal diet is estimated to exceed mean lifespan on an unrestricted diet by between 1.87 and 8.71 days, with a point estimate of 5.29 days difference (SE 1.29).

Pairwise comparisons: visualizations

Another option is to visualize the pairwise comparison inferences by displaying simultaneous 95% intervals.

Easy to spot significant contrasts:

  • intervals exclude 0 \(\Leftrightarrow\) tests reject

emmeans then contrast then confint then plot:

emmeans(object = fit, specs = ~ diet) |>
  contrast('pairwise') |>
  confint(level = 0.95, adjust = 'bonferroni') |>
  plot(xlab = 'difference in mean lifespan (days)', 
       ylab = 'contrast')

Comparisons with a control

Does diet restriction increase mean lifespan, and if so by how much?

Specify contrast('trt.vs.ctrl'):

emmeans(object = fit, specs = ~ diet) |> 
  contrast('trt.vs.ctrl') |>
  confint(level = 0.95, adjust = 'dunnett')
contrast estimate SE 95% CI
(N/N85) - NP 5.289 1.286 (2.23, 8.34)
(N/R50) - NP 14.9 1.226 (11.98, 17.81)
(N/R40) - NP 17.71 1.271 (14.7, 20.73)
  • Multiple inference adjustment uses Dunnett’s method

    • specialized correction for comparisons with a control
    • adjust = 'dunnett'
  • Comparisons will be relative to first group or factor level in R

With 95% confidence, relative to an unrestricted diet…

  • a 85kcal/day diet increases lifespan by an estimated 2.23 to 8.34 days
  • a 50kcal/day diet increases lifespan by an estimated 11.98 to 17.81 days
  • a 40kcal/day diet increases lifespan by an estimated 14.70 to 20.73 days

Extras: log-contrasts

Can we instead estimate a percent change in lifespan relative to the control?

The contrast in log-lifetimes would be:

\[ \log(\mu_{N85}) - \log(\mu_{NP}) = \log\left(\frac{\mu_{N85}}{\mu_{NP}}\right) \] So to answer the question:

  1. refit the ANOVA model with log lifetimes
  2. compute contrasts with control group using Dunnett’s method
  3. exponentiate estimates to obtain ratios (and hence percentages)
fit.log <- aov(log(lifetime) ~ diet, data = longevity)
contrast estimate 95% CI
(N/R50)/NP 1.572 (1.42, 1.73)
(N/R40)/NP 1.688 (1.52, 1.87)

Fact: mean log lifetime = log median lifetime.

With 95% confidence, diet restriction to 40kcal (a 52.9% reduction relative to an 85kCal diet) is estimated to increase median lifespan in mice by 52% to 87%.

Extra credit: work out and interpret the interval estimate for the contrast not shown above.

Extras: Response curve

We can use the log-contrasts to estimate a response curve relating caloric reduction and change in median lifespan.

Simplifying heuristics:

  • percent increase in median lifespan (y axis) is relative to unrestricted (NP) diet
  • percent reduction in caloric intake (x axis) is relative to 85kCal diet

Nonparametric alternatives

Parametric inference

The inferences we’ve developed so far are based on simple statistical models:

  • [one- and two-sample inference] \(t_{n - 1}\) model
  • [ANOVA] \(F_{k - 1, n - k}\) model

Both models assume underlying data distributions are described by…

  1. a specific bell-curved shape
  2. population parameters \(\mu, \sigma\)

We call these called parametric methods.

Some failure modes

Parametric model assumptions don’t always hold.

Scenario 1: difficult to assess

DDT concentrations (ppm) in kale samples.

With only \(n = 12\) observations, it’s hard to assess the shape of the distribution.

Scenario 2: assumptions fail

Serum cholesterol (mg/L) on two diets.

The distribution for the oat bran group is right-skewed with an outlier to the left.

Inference for the median

ddt sign
2.79 -
2.93 -
3.08 +
3.18 +
3.22 +
3.22 +
3.33 +
3.34 +
3.34 +
3.38 +
3.56 +
3.78 +

Consider the following hypothesis and alternative:

\[ \begin{cases} H_0: &m = 3 \qquad(\text{median is } 3 \text{ppm})\\ H_A: &m > 3 \qquad(\text{median exceeds } 3 \text{ppm}) \end{cases} \]

If \(m = 3\) you’d expect 3ppm to evenly divide the data.

But actually 10 observations (83%) are larger and 2 observations (17%) are smaller; using combinatorics, this occurs by chance only 1.9% of the time.

The data provide evidence that median DDT in kale exceeds 3ppm (p = 0.019).

This is called a sign test, and it is nonparametric because it makes no assumptions about the underlying distribution.

Inference for the center

ddt di
2.79 -0.21
2.93 -0.07
3.08 0.08
3.18 0.18
3.22 0.22
3.22 0.22
3.33 0.33
3.34 0.34
3.34 0.34
3.38 0.38
3.56 0.56
3.78 0.78

Now consider:

\[ \begin{cases} H_0: &c = 3 \qquad(\text{center is } 3 \text{ppm})\\ H_A: &c > 3 \qquad(\text{center exceeds } 3 \text{ppm}) \end{cases} \] If the distribution is symmetric, deviations from center should be about the same in either direction.

The signed rank test leverages this expectation:

  1. [deviations] compute deviations \(d_i = x_i - c_0\)
  2. [signed ranks] compute \(v_i = \text{sign}(d_i)\times \text{rank}(di)\)
  3. [test statistic] add up positive signed ranks \(V = \sum_{v_i > 0} v_i\)

If \(V\) is large, there is more spread to the right of \(c_0\), providing evidence favoring \(H_A\).

Inference for the center

ddt di rank sign vi
2.93 -0.07 1 -1 -1
3.08 0.08 2 1 2
3.18 0.18 3 1 3
2.79 -0.21 4 -1 -4
3.22 0.22 5.5 1 5.5
3.22 0.22 5.5 1 5.5
3.33 0.33 7 1 7
3.34 0.34 8.5 1 8.5
3.34 0.34 8.5 1 8.5
3.38 0.38 10 1 10
3.56 0.56 11 1 11
3.78 0.78 12 1 12

Signed rank statistic: \[ \begin{align*} V &= 2+3+5.5+5.5+7+8.5+8.5+10+11+12 \\ &= 73 \end{align*} \]

There are 4096 possible sign combinations; of these, only about 0.43% give a larger value of \(V\).

wilcox.test(ddt, mu = 3, alternative = 'greater')

    Wilcoxon signed rank test with continuity correction

data:  ddt
V = 73, p-value = 0.004269
alternative hypothesis: true location is greater than 3

The data provide evidence that the center of the distribution of DDT in kale exceeds 3ppm (signed rank test, p = 0.00427).

Inference comparing centers

bp diet
0 FishOil
12 FishOil
10 FishOil
2 FishOil
14 FishOil
8 FishOil
-3 RegularOil
-4 RegularOil
-6 RegularOil
0 RegularOil
1 RegularOil
2 RegularOil

Consider using data on blood pressure percent reduction to test: \[ \begin{cases} H_0: &c_F = c_R \qquad(\text{no effect})\\ H_A: &c_F > c_R \qquad(\text{fish oil produces greater reduction}) \end{cases} \]

If there is no effect of diet, ranks will be randomly distributed among groups. This idea leads to the rank sum test:

  1. [pool] Combine observations from both groups
  2. [rank] Sort and rank pooled observations
  3. [sum] Add up ranks in the first group

\[W = \text{rank sum} - \frac{n_1(n_1 + 1)}{2}\]

When \(W\) is near \(0\) or \(\frac{n(n + 1)}{2} - \frac{n_1(n_1 + 1)}{2}\) there is more separation.

Inference comparing centers

bp diet rank
-6 RegularOil 1
-4 RegularOil 2
-3 RegularOil 3
0 FishOil 4.5
0 RegularOil 4.5
1 RegularOil 6
2 FishOil 7.5
2 RegularOil 7.5
8 FishOil 9
10 FishOil 10
12 FishOil 11
14 FishOil 12

Rank sum statistic:

\[ W = \underbrace{(4.5+7.5+9+10+11+12)}_\text{rank sum} - \underbrace{\frac{6\times 7}{2}}_\text{adjustment} = 54 - 21 = 33 \] There are 924 ways to allocate ranks to groups; among these, larger values of \(W\) occur about 0.99% of the time.

wilcox.test(bp ~ diet, data = fish.oil, alternative = 'greater')

    Wilcoxon rank sum test with continuity correction

data:  bp by diet
W = 33, p-value = 0.009903
alternative hypothesis: true location shift is greater than 0

The data provide evidence that fish oil reduces blood pressure by more than regular oil (rank sum test, p = 0.0099).

Kruskal-Wallis test

Here assumptions may not hold:

  • sample sizes are small
  • spread differs a bit
  • outliers (tvarminne, petersburg)
  • skewness (magadan, tillamook)

\[ \begin{cases} H_0: &c_M = c_N = c_{Ti} = c_{Tv} = c_P \\ H_A: &c_i \neq c_j \quad\text{ for some }\quad i\neq j \end{cases} \]

An ANOVA-like test can be formulated using ranks of pooled observations:

\[ U = \frac{\sum_{j = 1}^k n_j (\bar{r}_j - \bar{r})^2}{\sum_{i = 1}^n (r_i - \bar{r})^2} \qquad \left(\frac{\text{group variation}}{\text{total variation}}\right) \]

  • \(r_i\): rank of the \(i\)th observation
  • \(\bar{r}_j\): average rank within \(j\)th group
  • \(\bar{r}\): average rank

If there are location differences, \(U\) will be large.

Kruskal-Wallis test

Omnibus test for location differences:

kruskal.test(aam.length ~ location, data = mussels)

    Kruskal-Wallis rank sum test

data:  aam.length by location
Kruskal-Wallis chi-squared = 16.405, df = 4, p-value = 0.002521

Post-hoc comparisions use pairwise rank sum tests:

pairwise.wilcox.test(x = mussels$aam.length, 
                     g = mussels$location, 
                     p.adjust.method = 'bonferroni')

    Pairwise comparisons using Wilcoxon rank sum exact test 

data:  mussels$aam.length and mussels$location 

           magadan newport tillamook tvarminne
newport    1.000   -       -         -        
tillamook  1.000   1.000   -         -        
tvarminne  0.293   0.127   0.312     -        
petersburg 0.059   0.022   0.084     1.000    

P value adjustment method: bonferroni 

The data provide evidence that the distribution of AAM lengths differs by geographic location (Kruskal-Wallis test, p = 0.0025)

Pairwise comparisons indicate that distributions differ significantly between Petersburg and Newport populations (p = 0.022)

Comparison with ANOVA

The omnibus test in ANOVA gives a similar result:

            Df   Sum Sq   Mean Sq F value   Pr(>F)    
location     4 0.004520 0.0011299   7.121 0.000281 ***
Residuals   34 0.005395 0.0001587                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But the pairwise comparisons differ:

# A tibble: 4 × 6
  contrast               estimate      SE    df t.ratio p.value
  <chr>                     <dbl>   <dbl> <dbl>   <dbl>   <dbl>
1 magadan - petersburg    -0.0254 0.00652    34   -3.90 0.00430
2 newport - tvarminne     -0.0209 0.00680    34   -3.07 0.0417 
3 newport - petersburg    -0.0286 0.00652    34   -4.39 0.00103
4 tillamook - petersburg  -0.0232 0.00621    34   -3.74 0.00670

The parametric test is more sensitive to skewness and outliers!

Caveats

Two-sample and ANOVA-type rank-based inference procedures detect location shifts only.

While we write the hypotheses in terms of centers by convention, really we’re testing:

  • \(H_0\): all observations come from one distribution
  • \(H_A\): observations in at least one group tend to be larger/smaller than the other(s)

These tests are not sensitive to alternatives in which centers differ due to shape.

Recap

Nonparametric methods provide attractive alternatives to \(t\) and \(F\) tests when assumptions don’t hold or aren’t easily checked.

  • helpful for small sample sizes or odd data distributions

  • more robust to outliers

  • fewer assumptions

No. of groups Method Test of… Assumptions
One sample Sign test median none
One sample Signed rank test center/location symmetry
Two sample Rank sum test center/location location shifts only
Many samples Kruskal-Wallis test center/location location shifts only