Categorical data analysis

Applied statistics for life sciences

One-sample inference for binomial data

Categorical data

Consider the frequency of diabetes among a random sample of \(n = 500\) American adults:

  • data values are Y/N \(\Rightarrow\) nominal data
  • exactly 2 categories \(\Rightarrow\) “binomial” data

Categorical data

Or, to take another example, consider the frequency of self-rated general health.

  • values are ranked labels \(\Rightarrow\) ordinal data
  • \(>2\) categories \(\Rightarrow\) “multinomial” data

Proportions

For categorical data the usual inferential target is a set of population proportions.

Sample proportions for diabetes data:

diabetes count proportion
Yes 57 0.114
No 443 0.886

\[ \begin{align*} \hat{p}_1 &= 57/500 = 0.114 \\ \hat{p}_2 &= 443/500 = 0.886 \end{align*} \]

Example interpretation:

Diabetes prevalence among U.S. adults is estimated to be 11.4%.

Proportions

For categorical data the usual inferential target is a set of population proportions.

Sample proportions for general health:

health count proportion
Excellent 47 0.1044
Vgood 162 0.36
Good 177 0.3933
Fair 53 0.1178
Poor 11 0.02444
(Missing) 50

\[ \begin{align*} \hat{p}_1 &= 47/450 = 0.1044 \\ \hat{p}_2 &= 162/450 = 0.3600 \\ \hat{p}_3 &= 177/450 = 0.3933 \\ \hat{p}_4 &= 53/450 = 0.1178 \\ \hat{p}_5 &= 11/450 = 0.0244 \end{align*} \]

Note that 10% of observations are missing (NA) and are removed to compute \(\hat{p}_j\)’s.

Example interpretation:

An estimated 10.44% of U.S. adults believe they are in excellent health.

Compositionality

Categorical data are “compositional” – the counts add up to the total sample size.

health count proportion
Excellent 47 0.1044
Vgood 162 0.36
Good 177 0.3933
Fair 53 0.1178
Poor 11 0.02444
(Missing) 50
SUM 500 1

As a result, the “last” proportion is determined by the rest:

\[ \begin{align*} \hat{p}_1 &= 47/450 = 0.1044 \\ \hat{p}_2 &= 162/450 = 0.3600 \\ \hat{p}_3 &= 177/450 = 0.3933 \\ \hat{p}_4 &= 53/450 = 0.1178 \\ \hat{p}_5 &= 1 - 0.1044 - 0.3600 - 0.3933 - 0.1178 \end{align*} \]

Compositionality

Categorical data are “compositional” – the counts add up to the total sample size.

health count proportion
Excellent 47 0.1044
Vgood 162 0.36
Good 177 0.3933
Fair 53 0.1178
Poor 11 0.02444
(Missing) 50
SUM 500 1

Mathematically this amounts to the constraint:

\[ \sum_{j=1}^K \hat{p}_j = 1 \quad\text{and}\quad \hat{p}_i = 1 - \sum_{j \neq i} \hat{p}_j \]

Consequences:

  • estimates \(\hat{p}_j\) are not independent
  • only need to compute \(\hat{p}_1, \dots, \hat{p}_{K - 1}\) to estimate \(p_1, \dots, p_K\)

Inference for binomial data

Consider testing whether diabetes prevalence in the U.S. is 10%. \(\quad\begin{cases} H_0: p = 0.1 \\ H_A: p \neq 0.1 \end{cases}\)

If so, one would expect a roughly 90-10 allocation of counts. Compare:

Expected counts
  Yes No
count 50 450
proportion 0.1 0.9
Observed counts
  Yes No
count 57 443
proportion 0.114 0.886

A measure of their difference is the chi (pronounced /ˈkaɪ ) square statistic:

\[ \chi^2 = \sum \frac{(\text{observed} - \text{expected})^2}{\text{expected}} = \frac{(57 - 50)^2}{50} + \frac{(443 - 450)^2}{450} = 0.9389 \] If \(\chi^2\) is large enough, the data favor \(H_A\).

Inference for binomial data

Under \(H_0\), the \(\chi^2\) statistic has a sampling distribution that can be approximated by a \(\chi^2_1\) model (shown at right).

  • one degree of freedom (subscript)
  • the model assumes no expected counts are too small (rule of thumb: 5 or greater)

prop.test(x = 57, n = 500, p = 0.1)

    1-sample proportions test with continuity correction

data:  57 out of 500, null probability 0.1
X-squared = 0.93889, df = 1, p-value = 0.3326
alternative hypothesis: true p is not equal to 0.1
95 percent confidence interval:
 0.08814952 0.14594579
sample estimates:
    p 
0.114 

Interpretation:

The data provide no evidence that diabetes prevalence differs from 10% (\(\chi^2\) = 0.94 on 1 degree of freedom, p = 0.3326).

Confidence interval for \(p\)

A confidence interval for a binomial proportion \(p\) is:

\[ \hat{p} \pm c \times SE(\hat{p}) \quad\text{where}\quad SE(\hat{p}) = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

The critical value \(c\) comes from either the empirical rule or from inverting the \(\chi^2\) test.

95% interval for diabetes prevalence using empirical rule:

\[ 0.114 \pm 2\times 0.0142 = (0.088, 0.146) \]

With 95% confidence, diabetes prevalence among U.S. adults is estimated to be between 8.8% and 14.6%.

Why is \(SE(\hat{p}) = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\) ??

Measure of spread for binomial data: \[\sqrt{\hat{p}(1 - \hat{p})}\]

  • highest when \(\hat{p} \approx 0.5\)
  • lowest when \(\hat{p} \approx 0 \text{ or } 1\)

Analogous to estimating a mean: \[ SE\left(\hat{p}\right) = \frac{\text{spread}}{\sqrt{\text{sample size}}} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

One-sample inference for multinomial data

Inference for multinomial data

Imagine comparing observations to data from a prior year.1 Did health perceptions change?

\[ \begin{align*} H_0: &\mathbf{p} = \overbrace{\left( 0.13, 0.32, 0.35, 0.13, 0.06 \right)}^{\mathbf{p}_0} \\ H_A: &\mathbf{p} \neq \mathbf{p}_0 \\ \end{align*} \]

Compare observed counts with expectations if the population proportions were \(\mathbf{p}_0\):

  Excellent Vgood Good Fair Poor
observed 47 162 177 53 11
expected 59.83 145.8 158.4 57.3 28.65

Inference for multinomial data

The same \(\chi^2\) statistic we used before still provides a measure of divergence:

  Excellent Vgood Good Fair Poor
observed 47 162 177 53 11
expected 59.83 145.8 158.4 57.3 28.65

\[ \chi^2 = \frac{(47 - 59.8)^2}{59.8} + \frac{(162 - 145.8)^2}{145.8} + \cdots + \frac{(11 - 28.7)^2}{28.7} = 17.93 \]

The only difference before is that the formal test uses a \(\chi^2_4\) model for the sampling distribution.

  • in general, \(\text{df} = K - 1\)
  • \(K\) is the number of categories

Inference for multinomial data

# hypothesis
p0 <- c(0.1330, 0.3240, 0.3521, 0.1273, 0.0637)

# perform chi square test 
table(nhanes$healthgen) |>
  chisq.test(p = p0, rescale.p = T)

    Chi-squared test for given probabilities

data:  table(nhanes$healthgen)
X-squared = 17.94, df = 4, p-value = 0.001268

Interpretation:

The data provide evidence that the proportions differ from the prior year (\(\chi^2\) = 17.94 on 4 degrees of freedom, p = 0.0013).

Options in R

Can provide a table(...) input:

table(nhanes$healthgen) |>
  chisq.test(p = c(0.1330, 0.3240, 0.3521, 0.1273, 0.0637), 
             rescale.p = T)

    Chi-squared test for given probabilities

data:  table(nhanes$healthgen)
X-squared = 17.94, df = 4, p-value = 0.001268

Or input the counts directly:

chisq.test(x = c(47, 162, 177, 53, 11),
           p = c(0.1330, 0.3240, 0.3521, 0.1273, 0.0637), 
           rescale.p = T)

    Chi-squared test for given probabilities

data:  c(47, 162, 177, 53, 11)
X-squared = 17.94, df = 4, p-value = 0.001268

Other inputs:

  • p = ... gives the hypothesized population distribution, specified:

    • as proportions (not counts)
    • in the same order in which category counts appear in the input
  • rescale.p = T ensures hypothesized proportions sum to 1 (always include)

Residuals for \(\chi^2\) tests

Which proportions differ?

The residuals for the test are the normalized differences between observed and expected counts:

\[ \text{residual}_j = \frac{\text{observed}_j - \text{expected}_j}{\sqrt{\text{expected}_j}} \]

Think of this as the number of standard errors by which observed and expected counts differ.

  • sign indicates direction
  • value indicates magnitude (in SE’s)
Table: residuals from chi square test.
health observed expected resid
Excellent 47 60 -1.7
Vgood 162 146 1.3
Good 177 158 1.5
Fair 53 57 -0.57
Poor 11 29 -3.3

Significantly fewer than expected participants rated themselves as being in poor health.

Confidence intervals

Confidence intervals can be computed as if estimates were for binomial proportions.

\[ \hat{p}_j \pm c\times SE(\hat{p}_j) \quad\text{where}\quad SE(\hat{p}_j) = \sqrt{\frac{\hat{p}_j(1 - \hat{p}_j)}{n}} \]

MultinomCI(x = c(47, 162, 177, 53, 11), method = 'wald')
Table: 95% confidence intervals for the health perceptions.
healthgen n p se lwr upr
Excellent 47 0.1044 0.01442 0.07561 0.1333
Vgood 162 0.36 0.02263 0.3147 0.4053
Good 177 0.3933 0.02303 0.3473 0.4394
Fair 53 0.1178 0.0152 0.08739 0.1482
Poor 11 0.02444 0.00728 0.009885 0.039

Examples: fair dice?

Suppose you rolled a die 200 times and tallied the following counts:

1 2 3 4 5 6
38 24 31 39 44 24

In terms of proportions, the frequencies of each roll are:

1 2 3 4 5 6
0.19 0.12 0.155 0.195 0.22 0.12
  • 1’s, 4’s, and 5’s are more frequent than expected
  • 2’s, 3’s, and 6’s are less frequent than expected

To test whether the dice are fair:

# test whether dice are fair
chisq.test(x = c(38, 24, 31, 39, 44, 24),
           p = c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6),
           rescale.p = T)

    Chi-squared test for given probabilities

data:  c(38, 24, 31, 39, 44, 24)
X-squared = 10.42, df = 5, p-value = 0.06417

The data do not provide evidence that the die is unfair (\(\chi^2\) = 10.42 on 5 degrees of freedom, p = 0.06417).

… but they are certainly suggestive (the test would be significant at the 10% level).

Examples: fair dice?

Table: 95% confidence intervals.
roll est lwr.ci upr.ci
1 0.19 0.1356 0.2444
2 0.12 0.07496 0.165
3 0.155 0.1048 0.2052
4 0.195 0.1401 0.2499
5 0.22 0.1626 0.2774
6 0.12 0.07496 0.165
Table: residuals from chi square test.
roll observed expected resid
1 38 33 0.81
2 24 33 -1.6
3 31 33 -0.4
4 39 33 0.98
5 44 33 1.8
6 24 33 -1.6

The residuals and intervals tell slighty different stories.

  • The intervals for 2 and 6 miss the fair value (0.167).
  • The residual for 5 is the largest (but followed closely by 2 and 6).

Examples: segregation distortion?

Mendelian genetics assumes alleles have an equal chance of being passed on to offspring.

This entails that offspring of two AB individuals are expected to exhibit a 1:2:1 genotypic ratio:

  • 25% of offspring will be AA
  • 50% of offspring will be AB
  • 25% of offspring will be BB

Otherwise there is what’s known as segregation distortion.

Researchers tested for segregation distortion at 250 gene loci using data from 451 offspring of AB individuals of a model plant species (Arabidospis thaliana). At one locus:

AA AB BB
84 233 134
chisq.test(x = c(84, 233, 134),
           p = c(0.25, 0.5, 0.25))

    Chi-squared test for given probabilities

data:  c(84, 233, 134)
X-squared = 11.585, df = 2, p-value = 0.00305

The data do not provide evidence of segregation distortion at the specified gene location (\(\chi^2 = 11.585\) on 2 degrees of freedom, adjusted p = 0.775).

Examples: segregation distortion?

Table: 95% confidence intervals and residuals from chi square test.
phenotype estimate lwr.ci upr.ci observed expected residual
AA 0.1863 0.1397 0.237 84 112.8 -2.708
AB 0.5166 0.4701 0.5674 233 225.5 0.4994
BB 0.2971 0.2506 0.3478 134 112.8 2.001

In this case residuals and intervals tell the same story:

  • the AA phenotype has a lower-than-expected frequency
  • the BB phenotype has a higher-than-expected frequency

Two-sample inference for binomial data

Comparing binomial proportions

Is vitamin C effective at preventing common cold?

Vitamin C experiment
  Cold NoCold n
Placebo 335 76 411
VitC 302 105 407

  • vitamin C and placebo supplements were randomly allocated to 818 volunteers
  • volunteers took supplements daily for a cold season
  • study recorded how many volunteers came down with a cold

To answer the question, we can test the hypotheses \(\begin{cases} H_0: &p_1 = p_2 \\ H_A: &p_1 \neq p_2 \end{cases}\)

Expected counts

If there is no effect of vitamin C supplements and \(p_1 = p_2 = p\) then:

  • expected counts are just the sample sizes allocated by the common proportion \(p\)
  • marginal totals provide an estimate of \(p\)

For example:

  • \(\hat{p} = 637/818 = 0.7787\) of all participants had a cold
  • so an expected 77.87% of the placebo group would have a cold:\[0.7787 \times 411 = \frac{637}{818}\times 411 = 320.1\]
Observed counts.
  Cold NoCold n
Placebo 335 76 411
VitC 302 105 407
both 637 181 818
Expected counts.
  Cold NoCold n
Placebo 320.1 90.94 411
VitC 316.9 90.06 407
both 637 181 818

\(\chi^2\) test for binomial proportions

Observed counts.
  Cold NoCold
Placebo 335 76
VitC 302 105
Expected counts.
  Cold NoCold
Placebo 320.1 90.94
VitC 316.9 90.06

As in the one-sample case, the \(\chi^2\) statistic measures the divergence:

\[ \chi^2 = \frac{(335 - 320.1)^2}{320.1} + \frac{(76 - 90.94)^2}{90.94} + \frac{(302 - 316.9)^2}{316.9} + \frac{(105 - 90.06)^2}{90.06} = 6.34 \]

And we compare with a \(\chi^2_1\) model to determine a p-value.

\(\chi^2\) test for binomial proportions

Typically a “continuity correction” is applied. This slightly diminishes the test statistic for a more conservative test:

\[ \chi^2_\text{adj} = \frac{(|335 - 320.1| - 0.5)^2}{320.1} + \cdots + \frac{(|105 - 90.06| - 0.5)^2}{90.06} = 5.92 \]

In R:

chisq.test(vitamin$treatment, vitamin$outcome)

    Pearson's Chi-squared test with Yates' continuity correction

data:  vitamin$treatment and vitamin$outcome
X-squared = 5.9196, df = 1, p-value = 0.01497

Interpretation:

The data provide evidence of an effect of vitamin C on the probability of having a common cold (\(\chi^2\) = 5.92 on 1 degree of freedom, p = 0.01497).

So there is an effect. But how much of a difference is there?

Confidence interval for the difference

A confidence interval for the difference in proportions is:

\[ \hat{p}_1 - \hat{p}_2 \pm c\times SE(\hat{p}_1 - \hat{p}_2) \quad\text{where}\quad \begin{cases} SE(\hat{p}_1 - \hat{p}_2) = \sqrt{SE^2(\hat{p}_1) + SE^2(\hat{p}_2)} \\ SE^2(\hat{p}_1) = \frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} \\ SE^2(\hat{p}_2) = \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2} \end{cases} \]

The critical value \(c\) comes from either the empirical rule or from inverting the \(\chi^2\) test.

For the vitamin C data:

  • point estimate: \(\hat{p}_\text{placebo} - \hat{p}_\text{vitC} = 0.0731\)

  • standard error: \(\sqrt{SE^2(\hat{p}_\text{placebo}) + SE^2(\hat{p}_\text{vitC})} = 0.0289\)

  • 95% confidence interval: \(0.0731 \pm 2\times 0.0289 = (0.0164, 0.1298)\)

Implementation in R

# variables of interest
treatment <- vitamin$treatment
outcome <- vitamin$outcome

# pass table to prop.test
table(treatment, outcome) |>
  prop.test()

    2-sample test for equality of proportions with continuity correction

data:  table(treatment, outcome)
X-squared = 5.9196, df = 1, p-value = 0.01497
alternative hypothesis: two.sided
95 percent confidence interval:
 0.01391972 0.13222111
sample estimates:
   prop 1    prop 2 
0.8150852 0.7420147 

Details:

  • order in table() matters
  • grouping should be first
  • outcome variable should be second

Test interpretation:

The data provide evidence of an effect of vitamin C on the probability of having a common cold (\(\chi^2\) = 5.92 on 1 degree of freedom, p = 0.01497).

Interval interpretation:

With 95% confidence, the probability of getting a common cold is estimated to be between 1.64% and 12.98% lower among adults who take daily vitamin C supplements.

Two-sample inference for multinomial data

Comparing multinomial proportions

Did U.S. adults’ health perceptions change between 2009-2010 and 2011-2012?

  Excellent Vgood Good Fair Poor Total
2009_10 27 78 88 28 5 226
2011_12 20 84 89 25 6 224

To answer this question, we want to compare multinomial proportions, i.e., to test:

\[ \begin{cases} H_0: &\mathbf{p}_1 = \mathbf{p}_2 \\ H_A: &\mathbf{p}_1 \neq \mathbf{p}_2 \end{cases} \]

Expected counts

If \(\mathbf{p}_1 = \mathbf{p}_2 = \mathbf{p}\) then the marginal category totals provide an estimate of \(\mathbf{p}\):

  Excellent Vgood Good Fair Poor
both 0.1044 0.36 0.3933 0.1178 0.02444

And expected counts are simply allocations of the group sizes according to \(\mathbf{p}\):

  Excellent Vgood Good Fair Poor n
2009_10 23.6 81.36 88.89 26.62 5.524 226
2011_12 23.4 80.64 88.11 26.38 5.476 224
both 47 162 177 53 11 450

For example, an expected 36% of the 2009-2010 respondents would report being in very good health:

\[ 81.36 = 0.36 \times 226 = \frac{162}{450}\times 226 \]

\(\chi^2\) test for multinomial proportions

Observed counts
  Excellent Vgood Good Fair Poor
2009_10 27 78 88 28 5
2011_12 20 84 89 25 6
Expected counts
  Excellent Vgood Good Fair Poor
2009_10 23.6 81.36 88.89 26.62 5.524
2011_12 23.4 80.64 88.11 26.38 5.476

The \(\chi^2\) statistic is the same as usual:

\[ \chi^2 = \frac{(27 - 23.6)^2}{23.6} + \frac{(78 - 81.36))^2}{81.36} + \cdots + \frac{(6 - 5.476)^2}{5.476} = 1.52 \]

\(\chi^2\) test for multinomial proportions

\[ \chi^2 = \frac{(27 - 23.6)^2}{23.6} + \frac{(78 - 81.36))^2}{81.36} + \cdots + \frac{(6 - 5.476)^2}{5.476} = 1.52 \]

A \(\chi^2_{K - 1}\) model is used to compute the \(p\)-value for the test.

table(nhanes$surveyyr, nhanes$healthgen) |>
  chisq.test()

    Pearson's Chi-squared test

data:  table(nhanes$surveyyr, nhanes$healthgen)
X-squared = 1.5223, df = 4, p-value = 0.8227

The data do not provide evidence that U.S. adults’ health perceptions changed between survey periods (\(\chi^2\) = 1.52 on 4 degrees of freedom, p = 0.8227).

Residual analysis

If there had been a significant difference, we could have used residuals to explain the change.

  Excellent Vgood Good Fair Poor
2009_10 0.6989 -0.3725 -0.09475 0.2679 -0.2231
2011_12 -0.702 0.3742 0.09517 -0.2691 0.2241

While magnitudes are small, the signs show that from 2009-2010 to 2011-2012…

  • “excellent” perceptions decreased
  • “very good” perceptions increased
  • “good” perceptions increased
  • “fair” pereptions decreased
  • “poor” perceptions increased

Confidence intervals

Confidence intervals are obtained by category-wise comparisons of binomial proportions.

95% confidence intervals for differences in proportions.
  p_2009_10 p_2011_12 diff se ci.lwr ci.upr
Excellent 0.1195 0.08929 -0.03018 0.1491 -0.3285 0.2681
Vgood 0.3451 0.375 0.02987 0.5096 -0.9894 1.049
Good 0.3894 0.3973 0.007941 0.5563 -1.105 1.121
Fair 0.1239 0.1116 -0.01229 0.1668 -0.3458 0.3212
Poor 0.02212 0.02679 0.004662 0.03474 -0.06482 0.07414

Notice all intervals span zero: it is plausible that there is no change in every category.

Extending \(\chi^2\) tests to \(I \times J\) tables

Do ACTN3 genotype frequencies vary by race?

FAMuSS data:

  CC CT TT total
African Am 16 6 5 27
Asian 21 18 16 55
Caucasian 125 216 126 467
Hispanic 4 10 9 23
Other 7 11 5 23
total 173 261 161 595

Expected counts:

  CC CT TT total
African Am 7.85 11.84 7.31 27
Asian 15.99 24.13 14.88 55
Caucasian 135.8 204.8 126.4 467
Hispanic 6.69 10.09 6.22 23
Other 6.69 10.09 6.22 23
total 173 261 161 595
  • expected counts and chi-square statistic are calculated exactly the same way
  • degrees of freedom are now \((I - 1)\times(J - 1)\)

Extending \(\chi^2\) tests to \(I \times J\) tables

In detail:

  CC CT TT
African Am \(\frac{(16 - 7.85)^2}{7.85}\) \(\frac{(6 - 11.84)^2}{11.84}\) \(\frac{(5 - 7.306)^2}{7.306}\)
Asian \(\frac{(21 - 15.99)^2}{15.99}\) \(\frac{(18 - 24.13)^2}{24.13}\) \(\frac{(16 -14.88)^2}{14.88}\)
Caucasian \(\frac{(125 - 135.8)^2}{135.8}\) \(\frac{(216 - 204.9)^2}{204.9}\) \(\frac{(126 - 126.4)^2}{126.4}\)
Hispanic \(\frac{(4 - 6.687)^2}{6.687}\) \(\frac{(10 - 10.09)^2}{10.09}\) \(\frac{(9 - 6.224)^2}{6.224}\)
Other \(\frac{(7 - 6.687)^2}{6.687}\) \(\frac{(11 - 10.09)^2}{10.09}\) \(\frac{(5 - 6.224)^2}{6.224}\)

Then:

\[\chi^2 = \sum \text{all cells above} = 19.4\]

Implementation in R

The implementation is the same as for a \(2\times K\) table:

# construct table and pass to chisq.test
table(famuss$race, famuss$genotype) |>
  chisq.test()

    Pearson's Chi-squared test

data:  table(famuss$race, famuss$genotype)
X-squared = 19.4, df = 8, p-value = 0.01286

The data provide evidence of an association between race and genotype (\(\chi^2\) = 19.4 on 8 degrees of freedom, p = 0.01286).

Residual analysis

Which genotype/race combinations are contributing most to this inferred association?

# store result of test; display residuals
rslt <- chisq.test(tbl)
rslt$residuals
  CC CT TT
African Am 2.909 -1.698 -0.8531
Asian 1.252 -1.247 0.2897
Caucasian -0.9254 0.7789 -0.03244
Hispanic -1.039 -0.02804 1.113
Other 0.1209 0.2868 -0.4905

Look for the largest absolute residuals to explain inferred association.

African American and Asian populations have higher CC and lower CT frequencies than would be expected if genotype were independent of race.

Relative risk

Asthma data

Consider estimating the difference in proportions:

  asthma no asthma
male 30 769
female 49 781
table(asthma$sex, asthma$asthma) |>
  prop.test(asthma.tbl, conf.level = 0.9)

    2-sample test for equality of proportions with continuity correction

data:  table(asthma$sex, asthma$asthma)
X-squared = 3.6217, df = 1, p-value = 0.05703
alternative hypothesis: two.sided
90 percent confidence interval:
 -0.040137075 -0.002841347
sample estimates:
    prop 1     prop 2 
0.03754693 0.05903614 

With 90% confidence, asthma prevalence is estimated to be between 0.28 and 4.01 percentage points higher among women than among men.

Is a difference of up to 4 percentage points practically meaningful? Well, it depends:

  • yes if prevalence is very low
  • no if prevalence is very high

Relative risk

If \(p_F, p_M\) are the (population) proportions of women and men with asthma, then the relative risk of asthma among women compared with men is defined as:

\[ RR = \frac{p_F}{p_M} \qquad \left(\frac{\text{risk among women}}{\text{risk among men}}\right) \]

An estimate of the relative risk is simply the ratio of estimated proportions. For the asthma data, an estimate is: \[ \widehat{RR} = \frac{\hat{p}_F}{\hat{p}_M} = \frac{0.059}{0.038} = 1.57 \]

It is estimated that the risk of asthma among women is 1.57 times greater than among men.

Confidence intervals for relative risk

A normal model can be used to approximate the sampling distribution of \(\log(RR)\) and construct a confidence interval. If \(\hat{p}_1\) and \(\hat{p}_2\) are the two estimated proportions:

\[\log\left(\widehat{RR}\right) \pm c \times SE\left(\log\left(\widehat{RR}\right)\right) \quad\text{where}\quad SE\left(\log\left(\widehat{RR}\right)\right) = \sqrt{\frac{1 - p_1}{p_1n_1} + \frac{1 - p_2}{p_2n_2}}\]

Exponentiate endpoints to obtain an interval for relative risk.

table(asthma$sex, asthma$asthma) |>
  riskratio(rev = 'columns', 
            method = 'wald', 
            conf.level = 0.9,
            correction = T)
         risk ratio with 90% C.I.
Predictor estimate    lower    upper
   male   1.000000       NA       NA
   female 1.572329 1.083353 2.282007

With 90% confidence, the risk of asthma is estimated to be betwen 1.08 and 2.28 times greater for women than for men.

Implementation with epitools

table(asthma$sex, asthma$asthma) |>
  riskratio(rev = 'columns', 
            method = 'wald', 
            conf.level = 0.9,
            correction = T)
$data
        
         no asthma asthma Total
  male         769     30   799
  female       781     49   830
  Total       1550     79  1629

$measure
        risk ratio with 90% C.I.
         estimate    lower    upper
  male   1.000000       NA       NA
  female 1.572329 1.083353 2.282007

$p.value
        two-sided
         midp.exact fisher.exact chi.square
  male           NA           NA         NA
  female 0.04412095   0.04961711 0.05703135

$correction
[1] TRUE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

riskratio is picky about input format:

  • outcome of interest should be second column
  • group of interest should be second row

It will return the relative risk

\[ RR = \frac{n_{22}/n_2}{n_{12}/n_1} \] The data table can be reoriented using rev

  • rev = neither keeps original orientation
  • rev = rows reverses order of rows
  • rev = columns reverses order of columns
  • rev = both reverses both

Reporting results

$data
        
         no asthma asthma Total
  male         769     30   799
  female       781     49   830
  Total       1550     79  1629

$measure
        risk ratio with 90% C.I.
         estimate    lower    upper
  male   1.000000       NA       NA
  female 1.572329 1.083353 2.282007

$p.value
        two-sided
         midp.exact fisher.exact chi.square
  male           NA           NA         NA
  female 0.04412095   0.04961711 0.05703135

$correction
[1] TRUE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The data provide evidence of an association between asthma and sex (\(\chi^2\) = 3.62 on 1 degree of freedom, p = 0.057). With 90% confidence, the risk of asthma is estimated to be betwen 1.08 and 2.28 times greater for women than for men, with a point estimate of 1.57.

Conventional style:

  • first report the test result
  • then the measure of association
  • include point estimates (since interval is asymmetric)

Treatment efficacy: reduction in risk

In a randomized trial for a malaria vaccine, 20 individuals were randomly allocated to receive a dose of the vaccine or a placebo.

Vaccine trials often estimate relative reduction in risk or “efficacy”:

\[ \underbrace{\frac{\hat{p}_\text{ctrl} - \hat{p}_\text{trt}}{\hat{p}_\text{ctrl}}}_\text{efficacy} = 1 - RR \]

  no infection infection
placebo 0 6
vaccine 9 5
# relative risk
rr.out <- table(malaria$treatment, malaria$outcome) |>
            riskratio(method = 'wald', correction = T)
rr.out$measure
         risk ratio with 95% C.I.
           estimate     lower     upper
  placebo 1.0000000        NA        NA
  vaccine 0.3571429 0.1768593 0.7212006
# efficacy
1 - rr.out$measure
         risk ratio with 95% C.I.
           estimate     lower     upper
  placebo 0.0000000        NA        NA
  vaccine 0.6428571 0.8231407 0.2787994

The vaccine reduces the risk of infection by an estimated 27.9% to 82.3%.

Odds ratios

Example: case-control study

  Smokers NonSmokers total
Cancer 83 3 86
Control 72 14 86

Note: not possible to estimate the case rate (cancer prevalence) due to the study design.

# chi square test of association
table(smoking$group, smoking$smoking) |>
  chisq.test()

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062

The data provide evidence of an association between smoking and lung cancer (\(\chi^2 = 6.53\) on 1 degree of freedom, \(p = 0.0106\)).

How do we measure the association, considering we can’t estimate case rates?

Odds

If \(p\) is the true cancer prevalence (a population proportion), then the odds of cancer are:

\[ \text{odds} = \frac{p}{1 - p} \]

The odds represent the relative likelihood of an outcome.

  • \(\text{odds} = 2\): the outcome (cancer) is twice as likely to occur as to not occur
  • \(\text{odds} = 1/2\) indicates the outcome (cancer) is half as likely to occur as to not occur

Odds ratios

Let \(a, b, c, d\) denote population proportions.

\(\;\) outcome 1 (O1) outcome 2 (O2)
group 1 (G1) a b
group 2 (G2) c d

The odds of outcome 1 (O1) in each group are:

  • \(\frac{\textcolor{red}{a}}{\textcolor{blue}{b}}\) in group 1 (G1)
  • \(\frac{\textcolor{orange}{c}}{\textcolor{purple}{d}}\) in group 2 (G2)

The odds ratio or “relative odds” is:

\[ \omega = \frac{\text{odds}_{G1}(O1)}{\text{odds}_{G2}(O1)} = \frac{\textcolor{red}{a}/\textcolor{blue}{b}}{\textcolor{orange}{c}/\textcolor{purple}{d}} = \frac{\textcolor{red}{a}\textcolor{purple}{d}}{\textcolor{blue}{b}\textcolor{orange}{c}} \]

A surprising algebraic fact is that:

\[ \frac{\text{odds}_{G1}(O1)}{\text{odds}_{G2}(O1)} =\frac{\text{odds}_{O1}(G1)}{\text{odds}_{O2}(G1)} \]

relative odds of cancer given smoking status = relative odds of smoking given cancer status

Estimating odds ratios

The estimate is the same calculation as on the previous slide, but with sample counts.

\(\;\) Smoker (O1) NonSmoker (O2)
Case (G1) 83 3
Control (G2) 72 14

Estimate of \(\omega\): \[ \hat{\omega} = \frac{\textcolor{red}{83}\times\textcolor{purple}{14}}{\textcolor{blue}{3}\times\textcolor{orange}{72}} = 5.38 \]

Interpretation:

It is estimated that the relative odds of lung cancer are 5.38 times greater for smokers compared with nonsmokers.

Confidence intervals for odds ratios

The sampling distribution of the log odds ratio can be approximated by a normal model.

\[ \log\left(\hat{\omega}\right) \pm c \times SE\left(\log\left(\hat{\omega}\right)\right) \quad\text{where}\quad SE\left(\log\left(\hat{\omega}\right)\right) = \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}} \]

The oddsratio(...) function in the epitools package will compute and back-transform the interval for you.

table(smoking$group, smoking$smoking) |>
  oddsratio(rev = 'both', method = 'wald')
         odds ratio with 95% C.I.
          estimate    lower    upper
  Control  1.00000       NA       NA
  Cancer   5.37963 1.486376 19.47045
  • critical value \(c\) from the normal model
  • exponentiate to obtain an interval for \(\omega\)

With 95% confidence, the relative odds of lung cancer are estimated to be between 1.49 and 19.47 times greater for smokers compared with nonsmokers.

Implementation with epitools

table(smoking$group, smoking$smoking) |>
  oddsratio(rev = 'both', 
            method = 'wald',
            conf.level = 0.95,
            correction = T)
$data
         
          NonSmokers Smokers Total
  Control         14      72    86
  Cancer           3      83    86
  Total           17     155   172

$measure
         odds ratio with 95% C.I.
          estimate    lower    upper
  Control  1.00000       NA       NA
  Cancer   5.37963 1.486376 19.47045

$p.value
         two-sided
           midp.exact fisher.exact chi.square
  Control          NA           NA         NA
  Cancer  0.005116319  0.008822805 0.01062183

$correction
[1] TRUE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

oddsratio is picky about data inputs:

  • outcome of interest should be second column

  • group of interest should be second row

It will return the odds ratio

\[ \frac{\text{odds}_\text{R2} (\text{C2})}{\text{odds}_\text{R1}(\text{C2})} = \frac{n_{22}/n_{21}}{n_{12}/n_{11}} = \frac{n_{22} n_{11}}{n_{12}n_{21}} = \frac{da}{cb} \]

The data table can be reoriented using rev

  • rev = neither keeps original orientation
  • rev = rows reverses order of rows
  • rev = columns reverses order of columns
  • rev = both reverses both

Reporting results

$data
         
          NonSmokers Smokers Total
  Control         14      72    86
  Cancer           3      83    86
  Total           17     155   172

$measure
         odds ratio with 95% C.I.
          estimate    lower    upper
  Control  1.00000       NA       NA
  Cancer   5.37963 1.486376 19.47045

$p.value
         two-sided
           midp.exact fisher.exact chi.square
  Control          NA           NA         NA
  Cancer  0.005116319  0.008822805 0.01062183

$correction
[1] TRUE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

First report the test result, then the measure of association:

The data provide evidence of an association between smoking and lung cancer (\(\chi^2 = 6.53\) on 1 degree of freedom, \(p = 0.1062\)). With 95% confidence, the relative odds of cancer are estimated to be between 1.49 and 19.47 times greater among smokers compared with nonsmokers, with a point estimate of 5.38.

Be sure to include the point estimate, since the interval estimate is asymmetric.

Risk or odds? Rules of thumb

If a study design employs outcome-based sampling, proportions are not estimable.

  • analysis must use relative odds

Otherwise, analysis may use any measure of association.

  • difference in proportions
  • relative risk
  • treatment efficacy
  • relative odds

Risk or odds? Cyclosporiasis outbreak

An outbreak of cyclosporiasis was detected among residents of New Jersey. In a case-control study, investigators found that 21 of 30 cases and 4 of 60 controls had eaten raspberries.

Outcome-based sampling means…

  • can’t estimate risk
  • analysis should use odds ratio
oddsratio(outbreak$exposure, outbreak$group, 
          rev = 'columns', method = 'wald', 
          correction = T)
                odds ratio with 95% C.I.
Predictor        estimate    lower    upper
  no raspberries  1.00000       NA       NA
  raspberries    32.66667 9.081425 117.5048
  no raspberries raspberries
case 9 21
control 56 4

The data provide evidence of an association raspberry consumption and illness (\(\chi^2\) = 36.89 on 1 degree of freedom, p < 0.0001). With 95% confidence, the odds of illness are estimated to be between 9.08 and 117.5 times higher among those who consumed raspberries during the outbreak.

Risk or odds? smoking and CHD

A cohort study of 3,000 smokers and 5,000 nonsmokers investigated the link between smoking and the development of coronary heart disease (CHD) over 1 year.

Two independent samples but not outcome-based sampling…

  • can estimate risk
  • analysis can use any measure
  • RR (arguably) most natural
riskratio(chd$smoking, chd$chd, rev = 'columns',
          method = 'wald', correction = T)
           risk ratio with 95% C.I.
Predictor   estimate    lower    upper
  nonsmoker 1.000000       NA       NA
  smoker    1.609195 1.196452 2.164325
  CHD no CHD
nonsmoker 87 4913
smoker 84 2916

The data provide evidence that smoking is associated with coronary heart disease (\(\chi^2\) = 9.5711 on 1 degree of freedom, p = 0.00198). With 95% confidence, the risk of CHD is estimated to be between 1.196 and 2.164 times greater among smokers compared with nonsmokers.