Applied statistics for life sciences
Consider the frequency of diabetes among a random sample of \(n = 500\) American adults:
Or, to take another example, consider the frequency of self-rated general health.
For categorical data the usual inferential target is a set of population proportions.
Sample proportions for diabetes data:
diabetes | count | proportion |
---|---|---|
Yes | 57 | 0.114 |
No | 443 | 0.886 |
\[ \begin{align*} \hat{p}_1 &= 57/500 = 0.114 \\ \hat{p}_2 &= 443/500 = 0.886 \end{align*} \]
Example interpretation:
Diabetes prevalence among U.S. adults is estimated to be 11.4%.
For categorical data the usual inferential target is a set of population proportions.
Sample proportions for general health:
health | count | proportion |
---|---|---|
Excellent | 47 | 0.1044 |
Vgood | 162 | 0.36 |
Good | 177 | 0.3933 |
Fair | 53 | 0.1178 |
Poor | 11 | 0.02444 |
(Missing) | 50 | – |
\[ \begin{align*} \hat{p}_1 &= 47/450 = 0.1044 \\ \hat{p}_2 &= 162/450 = 0.3600 \\ \hat{p}_3 &= 177/450 = 0.3933 \\ \hat{p}_4 &= 53/450 = 0.1178 \\ \hat{p}_5 &= 11/450 = 0.0244 \end{align*} \]
Note that 10% of observations are missing (NA) and are removed to compute \(\hat{p}_j\)’s.
Example interpretation:
An estimated 10.44% of U.S. adults believe they are in excellent health.
Categorical data are “compositional” – the counts add up to the total sample size.
health | count | proportion |
---|---|---|
Excellent | 47 | 0.1044 |
Vgood | 162 | 0.36 |
Good | 177 | 0.3933 |
Fair | 53 | 0.1178 |
Poor | 11 | 0.02444 |
(Missing) | 50 | – |
SUM | 500 | 1 |
As a result, the “last” proportion is determined by the rest:
\[ \begin{align*} \hat{p}_1 &= 47/450 = 0.1044 \\ \hat{p}_2 &= 162/450 = 0.3600 \\ \hat{p}_3 &= 177/450 = 0.3933 \\ \hat{p}_4 &= 53/450 = 0.1178 \\ \hat{p}_5 &= 1 - 0.1044 - 0.3600 - 0.3933 - 0.1178 \end{align*} \]
Categorical data are “compositional” – the counts add up to the total sample size.
health | count | proportion |
---|---|---|
Excellent | 47 | 0.1044 |
Vgood | 162 | 0.36 |
Good | 177 | 0.3933 |
Fair | 53 | 0.1178 |
Poor | 11 | 0.02444 |
(Missing) | 50 | – |
SUM | 500 | 1 |
Mathematically this amounts to the constraint:
\[ \sum_{j=1}^K \hat{p}_j = 1 \quad\text{and}\quad \hat{p}_i = 1 - \sum_{j \neq i} \hat{p}_j \]
Consequences:
Consider testing whether diabetes prevalence in the U.S. is 10%. \(\quad\begin{cases} H_0: p = 0.1 \\ H_A: p \neq 0.1 \end{cases}\)
If so, one would expect a roughly 90-10 allocation of counts. Compare:
Yes | No | |
---|---|---|
count | 50 | 450 |
proportion | 0.1 | 0.9 |
Yes | No | |
---|---|---|
count | 57 | 443 |
proportion | 0.114 | 0.886 |
A measure of their difference is the chi (pronounced /ˈkaɪ ) square statistic:
\[ \chi^2 = \sum \frac{(\text{observed} - \text{expected})^2}{\text{expected}} = \frac{(57 - 50)^2}{50} + \frac{(443 - 450)^2}{450} = 0.9389 \] If \(\chi^2\) is large enough, the data favor \(H_A\).
Under \(H_0\), the \(\chi^2\) statistic has a sampling distribution that can be approximated by a \(\chi^2_1\) model (shown at right).
1-sample proportions test with continuity correction
data: 57 out of 500, null probability 0.1
X-squared = 0.93889, df = 1, p-value = 0.3326
alternative hypothesis: true p is not equal to 0.1
95 percent confidence interval:
0.08814952 0.14594579
sample estimates:
p
0.114
Interpretation:
The data provide no evidence that diabetes prevalence differs from 10% (\(\chi^2\) = 0.94 on 1 degree of freedom, p = 0.3326).
A confidence interval for a binomial proportion \(p\) is:
\[ \hat{p} \pm c \times SE(\hat{p}) \quad\text{where}\quad SE(\hat{p}) = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]
The critical value \(c\) comes from either the empirical rule or from inverting the \(\chi^2\) test.
95% interval for diabetes prevalence using empirical rule:
\[ 0.114 \pm 2\times 0.0142 = (0.088, 0.146) \]
With 95% confidence, diabetes prevalence among U.S. adults is estimated to be between 8.8% and 14.6%.
Measure of spread for binomial data: \[\sqrt{\hat{p}(1 - \hat{p})}\]
Analogous to estimating a mean: \[ SE\left(\hat{p}\right) = \frac{\text{spread}}{\sqrt{\text{sample size}}} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]
Imagine comparing observations to data from a prior year.1 Did health perceptions change?
\[ \begin{align*} H_0: &\mathbf{p} = \overbrace{\left( 0.13, 0.32, 0.35, 0.13, 0.06 \right)}^{\mathbf{p}_0} \\ H_A: &\mathbf{p} \neq \mathbf{p}_0 \\ \end{align*} \]
Compare observed counts with expectations if the population proportions were \(\mathbf{p}_0\):
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
observed | 47 | 162 | 177 | 53 | 11 |
expected | 59.83 | 145.8 | 158.4 | 57.3 | 28.65 |
The same \(\chi^2\) statistic we used before still provides a measure of divergence:
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
observed | 47 | 162 | 177 | 53 | 11 |
expected | 59.83 | 145.8 | 158.4 | 57.3 | 28.65 |
\[ \chi^2 = \frac{(47 - 59.8)^2}{59.8} + \frac{(162 - 145.8)^2}{145.8} + \cdots + \frac{(11 - 28.7)^2}{28.7} = 17.93 \]
The only difference before is that the formal test uses a \(\chi^2_4\) model for the sampling distribution.
Interpretation:
The data provide evidence that the proportions differ from the prior year (\(\chi^2\) = 17.94 on 4 degrees of freedom, p = 0.0013).
Can provide a table(...)
input:
Other inputs:
p = ...
gives the hypothesized population distribution, specified:
rescale.p = T
ensures hypothesized proportions sum to 1 (always include)
Which proportions differ?
The residuals for the test are the normalized differences between observed and expected counts:
\[ \text{residual}_j = \frac{\text{observed}_j - \text{expected}_j}{\sqrt{\text{expected}_j}} \]
Think of this as the number of standard errors by which observed and expected counts differ.
health | observed | expected | resid |
---|---|---|---|
Excellent | 47 | 60 | -1.7 |
Vgood | 162 | 146 | 1.3 |
Good | 177 | 158 | 1.5 |
Fair | 53 | 57 | -0.57 |
Poor | 11 | 29 | -3.3 |
Significantly fewer than expected participants rated themselves as being in poor health.
Confidence intervals can be computed as if estimates were for binomial proportions.
\[ \hat{p}_j \pm c\times SE(\hat{p}_j) \quad\text{where}\quad SE(\hat{p}_j) = \sqrt{\frac{\hat{p}_j(1 - \hat{p}_j)}{n}} \]
healthgen | n | p | se | lwr | upr |
---|---|---|---|---|---|
Excellent | 47 | 0.1044 | 0.01442 | 0.07561 | 0.1333 |
Vgood | 162 | 0.36 | 0.02263 | 0.3147 | 0.4053 |
Good | 177 | 0.3933 | 0.02303 | 0.3473 | 0.4394 |
Fair | 53 | 0.1178 | 0.0152 | 0.08739 | 0.1482 |
Poor | 11 | 0.02444 | 0.00728 | 0.009885 | 0.039 |
Suppose you rolled a die 200 times and tallied the following counts:
1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|
38 | 24 | 31 | 39 | 44 | 24 |
In terms of proportions, the frequencies of each roll are:
1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|
0.19 | 0.12 | 0.155 | 0.195 | 0.22 | 0.12 |
To test whether the dice are fair:
# test whether dice are fair
chisq.test(x = c(38, 24, 31, 39, 44, 24),
p = c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6),
rescale.p = T)
Chi-squared test for given probabilities
data: c(38, 24, 31, 39, 44, 24)
X-squared = 10.42, df = 5, p-value = 0.06417
The data do not provide evidence that the die is unfair (\(\chi^2\) = 10.42 on 5 degrees of freedom, p = 0.06417).
… but they are certainly suggestive (the test would be significant at the 10% level).
roll | est | lwr.ci | upr.ci |
---|---|---|---|
1 | 0.19 | 0.1356 | 0.2444 |
2 | 0.12 | 0.07496 | 0.165 |
3 | 0.155 | 0.1048 | 0.2052 |
4 | 0.195 | 0.1401 | 0.2499 |
5 | 0.22 | 0.1626 | 0.2774 |
6 | 0.12 | 0.07496 | 0.165 |
roll | observed | expected | resid |
---|---|---|---|
1 | 38 | 33 | 0.81 |
2 | 24 | 33 | -1.6 |
3 | 31 | 33 | -0.4 |
4 | 39 | 33 | 0.98 |
5 | 44 | 33 | 1.8 |
6 | 24 | 33 | -1.6 |
The residuals and intervals tell slighty different stories.
Mendelian genetics assumes alleles have an equal chance of being passed on to offspring.
This entails that offspring of two AB individuals are expected to exhibit a 1:2:1 genotypic ratio:
Otherwise there is what’s known as segregation distortion.
Researchers tested for segregation distortion at 250 gene loci using data from 451 offspring of AB individuals of a model plant species (Arabidospis thaliana). At one locus:
AA | AB | BB |
---|---|---|
84 | 233 | 134 |
The data do not provide evidence of segregation distortion at the specified gene location (\(\chi^2 = 11.585\) on 2 degrees of freedom, adjusted p = 0.775).
phenotype | estimate | lwr.ci | upr.ci | observed | expected | residual |
---|---|---|---|---|---|---|
AA | 0.1863 | 0.1397 | 0.237 | 84 | 112.8 | -2.708 |
AB | 0.5166 | 0.4701 | 0.5674 | 233 | 225.5 | 0.4994 |
BB | 0.2971 | 0.2506 | 0.3478 | 134 | 112.8 | 2.001 |
In this case residuals and intervals tell the same story:
Is vitamin C effective at preventing common cold?
Cold | NoCold | n | |
---|---|---|---|
Placebo | 335 | 76 | 411 |
VitC | 302 | 105 | 407 |
To answer the question, we can test the hypotheses \(\begin{cases} H_0: &p_1 = p_2 \\ H_A: &p_1 \neq p_2 \end{cases}\)
If there is no effect of vitamin C supplements and \(p_1 = p_2 = p\) then:
For example:
Cold | NoCold | n | |
---|---|---|---|
Placebo | 335 | 76 | 411 |
VitC | 302 | 105 | 407 |
both | 637 | 181 | 818 |
Cold | NoCold | n | |
---|---|---|---|
Placebo | 320.1 |
90.94 | 411 |
VitC | 316.9 | 90.06 | 407 |
both | 637 |
181 | 818 |
Cold | NoCold | |
---|---|---|
Placebo | 335 | 76 |
VitC | 302 | 105 |
Cold | NoCold | |
---|---|---|
Placebo | 320.1 | 90.94 |
VitC | 316.9 | 90.06 |
As in the one-sample case, the \(\chi^2\) statistic measures the divergence:
\[ \chi^2 = \frac{(335 - 320.1)^2}{320.1} + \frac{(76 - 90.94)^2}{90.94} + \frac{(302 - 316.9)^2}{316.9} + \frac{(105 - 90.06)^2}{90.06} = 6.34 \]
And we compare with a \(\chi^2_1\) model to determine a p-value.
Typically a “continuity correction” is applied. This slightly diminishes the test statistic for a more conservative test:
\[ \chi^2_\text{adj} = \frac{(|335 - 320.1| - 0.5)^2}{320.1} + \cdots + \frac{(|105 - 90.06| - 0.5)^2}{90.06} = 5.92 \]
In R:
Interpretation:
The data provide evidence of an effect of vitamin C on the probability of having a common cold (\(\chi^2\) = 5.92 on 1 degree of freedom, p = 0.01497).
So there is an effect. But how much of a difference is there?
A confidence interval for the difference in proportions is:
\[ \hat{p}_1 - \hat{p}_2 \pm c\times SE(\hat{p}_1 - \hat{p}_2) \quad\text{where}\quad \begin{cases} SE(\hat{p}_1 - \hat{p}_2) = \sqrt{SE^2(\hat{p}_1) + SE^2(\hat{p}_2)} \\ SE^2(\hat{p}_1) = \frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} \\ SE^2(\hat{p}_2) = \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2} \end{cases} \]
The critical value \(c\) comes from either the empirical rule or from inverting the \(\chi^2\) test.
For the vitamin C data:
point estimate: \(\hat{p}_\text{placebo} - \hat{p}_\text{vitC} = 0.0731\)
standard error: \(\sqrt{SE^2(\hat{p}_\text{placebo}) + SE^2(\hat{p}_\text{vitC})} = 0.0289\)
95% confidence interval: \(0.0731 \pm 2\times 0.0289 = (0.0164, 0.1298)\)
# variables of interest
treatment <- vitamin$treatment
outcome <- vitamin$outcome
# pass table to prop.test
table(treatment, outcome) |>
prop.test()
2-sample test for equality of proportions with continuity correction
data: table(treatment, outcome)
X-squared = 5.9196, df = 1, p-value = 0.01497
alternative hypothesis: two.sided
95 percent confidence interval:
0.01391972 0.13222111
sample estimates:
prop 1 prop 2
0.8150852 0.7420147
Details:
table()
mattersTest interpretation:
The data provide evidence of an effect of vitamin C on the probability of having a common cold (\(\chi^2\) = 5.92 on 1 degree of freedom, p = 0.01497).
Interval interpretation:
With 95% confidence, the probability of getting a common cold is estimated to be between 1.64% and 12.98% lower among adults who take daily vitamin C supplements.
Did U.S. adults’ health perceptions change between 2009-2010 and 2011-2012?
Excellent | Vgood | Good | Fair | Poor | Total | |
---|---|---|---|---|---|---|
2009_10 | 27 | 78 | 88 | 28 | 5 | 226 |
2011_12 | 20 | 84 | 89 | 25 | 6 | 224 |
To answer this question, we want to compare multinomial proportions, i.e., to test:
\[ \begin{cases} H_0: &\mathbf{p}_1 = \mathbf{p}_2 \\ H_A: &\mathbf{p}_1 \neq \mathbf{p}_2 \end{cases} \]
If \(\mathbf{p}_1 = \mathbf{p}_2 = \mathbf{p}\) then the marginal category totals provide an estimate of \(\mathbf{p}\):
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
both | 0.1044 | 0.36 | 0.3933 | 0.1178 | 0.02444 |
And expected counts are simply allocations of the group sizes according to \(\mathbf{p}\):
Excellent | Vgood | Good | Fair | Poor | n | |
---|---|---|---|---|---|---|
2009_10 | 23.6 | 81.36 |
88.89 | 26.62 | 5.524 | 226 |
2011_12 | 23.4 | 80.64 | 88.11 | 26.38 | 5.476 | 224 |
both | 47 | 162 |
177 | 53 | 11 | 450 |
For example, an expected 36% of the 2009-2010 respondents would report being in very good health:
\[ 81.36 = 0.36 \times 226 = \frac{162}{450}\times 226 \]
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
2009_10 | 27 | 78 | 88 | 28 | 5 |
2011_12 | 20 | 84 | 89 | 25 | 6 |
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
2009_10 | 23.6 | 81.36 | 88.89 | 26.62 | 5.524 |
2011_12 | 23.4 | 80.64 | 88.11 | 26.38 | 5.476 |
The \(\chi^2\) statistic is the same as usual:
\[ \chi^2 = \frac{(27 - 23.6)^2}{23.6} + \frac{(78 - 81.36))^2}{81.36} + \cdots + \frac{(6 - 5.476)^2}{5.476} = 1.52 \]
\[ \chi^2 = \frac{(27 - 23.6)^2}{23.6} + \frac{(78 - 81.36))^2}{81.36} + \cdots + \frac{(6 - 5.476)^2}{5.476} = 1.52 \]
A \(\chi^2_{K - 1}\) model is used to compute the \(p\)-value for the test.
Pearson's Chi-squared test
data: table(nhanes$surveyyr, nhanes$healthgen)
X-squared = 1.5223, df = 4, p-value = 0.8227
The data do not provide evidence that U.S. adults’ health perceptions changed between survey periods (\(\chi^2\) = 1.52 on 4 degrees of freedom, p = 0.8227).
If there had been a significant difference, we could have used residuals to explain the change.
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
2009_10 | 0.6989 | -0.3725 | -0.09475 | 0.2679 | -0.2231 |
2011_12 | -0.702 | 0.3742 | 0.09517 | -0.2691 | 0.2241 |
While magnitudes are small, the signs show that from 2009-2010 to 2011-2012…
Confidence intervals are obtained by category-wise comparisons of binomial proportions.
p_2009_10 | p_2011_12 | diff | se | ci.lwr | ci.upr | |
---|---|---|---|---|---|---|
Excellent | 0.1195 | 0.08929 | -0.03018 | 0.1491 | -0.3285 | 0.2681 |
Vgood | 0.3451 | 0.375 | 0.02987 | 0.5096 | -0.9894 | 1.049 |
Good | 0.3894 | 0.3973 | 0.007941 | 0.5563 | -1.105 | 1.121 |
Fair | 0.1239 | 0.1116 | -0.01229 | 0.1668 | -0.3458 | 0.3212 |
Poor | 0.02212 | 0.02679 | 0.004662 | 0.03474 | -0.06482 | 0.07414 |
Notice all intervals span zero: it is plausible that there is no change in every category.
Do ACTN3 genotype frequencies vary by race?
FAMuSS data:
CC | CT | TT | total | |
---|---|---|---|---|
African Am | 16 | 6 | 5 | 27 |
Asian | 21 | 18 | 16 | 55 |
Caucasian | 125 | 216 | 126 | 467 |
Hispanic | 4 | 10 | 9 | 23 |
Other | 7 | 11 | 5 | 23 |
total | 173 | 261 | 161 | 595 |
Expected counts:
CC | CT | TT | total | |
---|---|---|---|---|
African Am | 7.85 | 11.84 | 7.31 | 27 |
Asian | 15.99 | 24.13 | 14.88 | 55 |
Caucasian | 135.8 | 204.8 | 126.4 | 467 |
Hispanic | 6.69 | 10.09 | 6.22 | 23 |
Other | 6.69 | 10.09 | 6.22 | 23 |
total | 173 | 261 | 161 | 595 |
In detail:
CC | CT | TT | |
---|---|---|---|
African Am | \(\frac{(16 - 7.85)^2}{7.85}\) | \(\frac{(6 - 11.84)^2}{11.84}\) | \(\frac{(5 - 7.306)^2}{7.306}\) |
Asian | \(\frac{(21 - 15.99)^2}{15.99}\) | \(\frac{(18 - 24.13)^2}{24.13}\) | \(\frac{(16 -14.88)^2}{14.88}\) |
Caucasian | \(\frac{(125 - 135.8)^2}{135.8}\) | \(\frac{(216 - 204.9)^2}{204.9}\) | \(\frac{(126 - 126.4)^2}{126.4}\) |
Hispanic | \(\frac{(4 - 6.687)^2}{6.687}\) | \(\frac{(10 - 10.09)^2}{10.09}\) | \(\frac{(9 - 6.224)^2}{6.224}\) |
Other | \(\frac{(7 - 6.687)^2}{6.687}\) | \(\frac{(11 - 10.09)^2}{10.09}\) | \(\frac{(5 - 6.224)^2}{6.224}\) |
Then:
\[\chi^2 = \sum \text{all cells above} = 19.4\]
The implementation is the same as for a \(2\times K\) table:
Pearson's Chi-squared test
data: table(famuss$race, famuss$genotype)
X-squared = 19.4, df = 8, p-value = 0.01286
The data provide evidence of an association between race and genotype (\(\chi^2\) = 19.4 on 8 degrees of freedom, p = 0.01286).
Which genotype/race combinations are contributing most to this inferred association?
CC | CT | TT | |
---|---|---|---|
African Am | 2.909 | -1.698 | -0.8531 |
Asian | 1.252 | -1.247 | 0.2897 |
Caucasian | -0.9254 | 0.7789 | -0.03244 |
Hispanic | -1.039 | -0.02804 | 1.113 |
Other | 0.1209 | 0.2868 | -0.4905 |
Look for the largest absolute residuals to explain inferred association.
African American and Asian populations have higher CC and lower CT frequencies than would be expected if genotype were independent of race.
Consider estimating the difference in proportions:
asthma | no asthma | |
---|---|---|
male | 30 | 769 |
female | 49 | 781 |
2-sample test for equality of proportions with continuity correction
data: table(asthma$sex, asthma$asthma)
X-squared = 3.6217, df = 1, p-value = 0.05703
alternative hypothesis: two.sided
90 percent confidence interval:
-0.040137075 -0.002841347
sample estimates:
prop 1 prop 2
0.03754693 0.05903614
With 90% confidence, asthma prevalence is estimated to be between 0.28 and 4.01 percentage points higher among women than among men.
Is a difference of up to 4 percentage points practically meaningful? Well, it depends:
If \(p_F, p_M\) are the (population) proportions of women and men with asthma, then the relative risk of asthma among women compared with men is defined as:
\[ RR = \frac{p_F}{p_M} \qquad \left(\frac{\text{risk among women}}{\text{risk among men}}\right) \]
An estimate of the relative risk is simply the ratio of estimated proportions. For the asthma data, an estimate is: \[ \widehat{RR} = \frac{\hat{p}_F}{\hat{p}_M} = \frac{0.059}{0.038} = 1.57 \]
It is estimated that the risk of asthma among women is 1.57 times greater than among men.
A normal model can be used to approximate the sampling distribution of \(\log(RR)\) and construct a confidence interval. If \(\hat{p}_1\) and \(\hat{p}_2\) are the two estimated proportions:
\[\log\left(\widehat{RR}\right) \pm c \times SE\left(\log\left(\widehat{RR}\right)\right) \quad\text{where}\quad SE\left(\log\left(\widehat{RR}\right)\right) = \sqrt{\frac{1 - p_1}{p_1n_1} + \frac{1 - p_2}{p_2n_2}}\]
Exponentiate endpoints to obtain an interval for relative risk.
risk ratio with 90% C.I.
Predictor estimate lower upper
male 1.000000 NA NA
female 1.572329 1.083353 2.282007
With 90% confidence, the risk of asthma is estimated to be betwen 1.08 and 2.28 times greater for women than for men.
epitools
table(asthma$sex, asthma$asthma) |>
riskratio(rev = 'columns',
method = 'wald',
conf.level = 0.9,
correction = T)
$data
no asthma asthma Total
male 769 30 799
female 781 49 830
Total 1550 79 1629
$measure
risk ratio with 90% C.I.
estimate lower upper
male 1.000000 NA NA
female 1.572329 1.083353 2.282007
$p.value
two-sided
midp.exact fisher.exact chi.square
male NA NA NA
female 0.04412095 0.04961711 0.05703135
$correction
[1] TRUE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
riskratio
is picky about input format:
It will return the relative risk
\[
RR = \frac{n_{22}/n_2}{n_{12}/n_1}
\] The data table can be reoriented using rev
rev = neither
keeps original orientationrev = rows
reverses order of rowsrev = columns
reverses order of columnsrev = both
reverses both$data
no asthma asthma Total
male 769 30 799
female 781 49 830
Total 1550 79 1629
$measure
risk ratio with 90% C.I.
estimate lower upper
male 1.000000 NA NA
female 1.572329 1.083353 2.282007
$p.value
two-sided
midp.exact fisher.exact chi.square
male NA NA NA
female 0.04412095 0.04961711 0.05703135
$correction
[1] TRUE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
The data provide evidence of an association between asthma and sex (\(\chi^2\) = 3.62 on 1 degree of freedom, p = 0.057). With 90% confidence, the risk of asthma is estimated to be betwen 1.08 and 2.28 times greater for women than for men, with a point estimate of 1.57.
Conventional style:
In a randomized trial for a malaria vaccine, 20 individuals were randomly allocated to receive a dose of the vaccine or a placebo.
Vaccine trials often estimate relative reduction in risk or “efficacy”:
\[ \underbrace{\frac{\hat{p}_\text{ctrl} - \hat{p}_\text{trt}}{\hat{p}_\text{ctrl}}}_\text{efficacy} = 1 - RR \]
no infection | infection | |
---|---|---|
placebo | 0 | 6 |
vaccine | 9 | 5 |
# relative risk
rr.out <- table(malaria$treatment, malaria$outcome) |>
riskratio(method = 'wald', correction = T)
rr.out$measure
risk ratio with 95% C.I.
estimate lower upper
placebo 1.0000000 NA NA
vaccine 0.3571429 0.1768593 0.7212006
risk ratio with 95% C.I.
estimate lower upper
placebo 0.0000000 NA NA
vaccine 0.6428571 0.8231407 0.2787994
The vaccine reduces the risk of infection by an estimated 27.9% to 82.3%.
Smokers | NonSmokers | total | |
---|---|---|---|
Cancer | 83 | 3 | 86 |
Control | 72 | 14 | 86 |
Note: not possible to estimate the case rate (cancer prevalence) due to the study design.
The data provide evidence of an association between smoking and lung cancer (\(\chi^2 = 6.53\) on 1 degree of freedom, \(p = 0.0106\)).
How do we measure the association, considering we can’t estimate case rates?
If \(p\) is the true cancer prevalence (a population proportion), then the odds of cancer are:
\[ \text{odds} = \frac{p}{1 - p} \]
The odds represent the relative likelihood of an outcome.
Let \(a, b, c, d\) denote population proportions.
\(\;\) | outcome 1 (O1) | outcome 2 (O2) |
---|---|---|
group 1 (G1) | a | b |
group 2 (G2) | c | d |
The odds of outcome 1 (O1) in each group are:
The odds ratio or “relative odds” is:
\[ \omega = \frac{\text{odds}_{G1}(O1)}{\text{odds}_{G2}(O1)} = \frac{\textcolor{red}{a}/\textcolor{blue}{b}}{\textcolor{orange}{c}/\textcolor{purple}{d}} = \frac{\textcolor{red}{a}\textcolor{purple}{d}}{\textcolor{blue}{b}\textcolor{orange}{c}} \]
A surprising algebraic fact is that:
\[ \frac{\text{odds}_{G1}(O1)}{\text{odds}_{G2}(O1)} =\frac{\text{odds}_{O1}(G1)}{\text{odds}_{O2}(G1)} \]
relative odds of cancer given smoking status = relative odds of smoking given cancer status
The estimate is the same calculation as on the previous slide, but with sample counts.
\(\;\) | Smoker (O1) | NonSmoker (O2) |
---|---|---|
Case (G1) | 83 | 3 |
Control (G2) | 72 | 14 |
Estimate of \(\omega\): \[ \hat{\omega} = \frac{\textcolor{red}{83}\times\textcolor{purple}{14}}{\textcolor{blue}{3}\times\textcolor{orange}{72}} = 5.38 \]
Interpretation:
It is estimated that the relative odds of lung cancer are 5.38 times greater for smokers compared with nonsmokers.
The sampling distribution of the log odds ratio can be approximated by a normal model.
\[ \log\left(\hat{\omega}\right) \pm c \times SE\left(\log\left(\hat{\omega}\right)\right) \quad\text{where}\quad SE\left(\log\left(\hat{\omega}\right)\right) = \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}} \]
The oddsratio(...)
function in the epitools
package will compute and back-transform the interval for you.
odds ratio with 95% C.I.
estimate lower upper
Control 1.00000 NA NA
Cancer 5.37963 1.486376 19.47045
With 95% confidence, the relative odds of lung cancer are estimated to be between 1.49 and 19.47 times greater for smokers compared with nonsmokers.
epitools
table(smoking$group, smoking$smoking) |>
oddsratio(rev = 'both',
method = 'wald',
conf.level = 0.95,
correction = T)
$data
NonSmokers Smokers Total
Control 14 72 86
Cancer 3 83 86
Total 17 155 172
$measure
odds ratio with 95% C.I.
estimate lower upper
Control 1.00000 NA NA
Cancer 5.37963 1.486376 19.47045
$p.value
two-sided
midp.exact fisher.exact chi.square
Control NA NA NA
Cancer 0.005116319 0.008822805 0.01062183
$correction
[1] TRUE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
oddsratio
is picky about data inputs:
outcome of interest should be second column
group of interest should be second row
It will return the odds ratio
\[ \frac{\text{odds}_\text{R2} (\text{C2})}{\text{odds}_\text{R1}(\text{C2})} = \frac{n_{22}/n_{21}}{n_{12}/n_{11}} = \frac{n_{22} n_{11}}{n_{12}n_{21}} = \frac{da}{cb} \]
The data table can be reoriented using rev
rev = neither
keeps original orientationrev = rows
reverses order of rowsrev = columns
reverses order of columnsrev = both
reverses both$data
NonSmokers Smokers Total
Control 14 72 86
Cancer 3 83 86
Total 17 155 172
$measure
odds ratio with 95% C.I.
estimate lower upper
Control 1.00000 NA NA
Cancer 5.37963 1.486376 19.47045
$p.value
two-sided
midp.exact fisher.exact chi.square
Control NA NA NA
Cancer 0.005116319 0.008822805 0.01062183
$correction
[1] TRUE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
First report the test result, then the measure of association:
The data provide evidence of an association between smoking and lung cancer (\(\chi^2 = 6.53\) on 1 degree of freedom, \(p = 0.1062\)). With 95% confidence, the relative odds of cancer are estimated to be between 1.49 and 19.47 times greater among smokers compared with nonsmokers, with a point estimate of 5.38.
Be sure to include the point estimate, since the interval estimate is asymmetric.
If a study design employs outcome-based sampling, proportions are not estimable.
Otherwise, analysis may use any measure of association.
An outbreak of cyclosporiasis was detected among residents of New Jersey. In a case-control study, investigators found that 21 of 30 cases and 4 of 60 controls had eaten raspberries.
Outcome-based sampling means…
odds ratio with 95% C.I.
Predictor estimate lower upper
no raspberries 1.00000 NA NA
raspberries 32.66667 9.081425 117.5048
no raspberries | raspberries | |
---|---|---|
case | 9 | 21 |
control | 56 | 4 |
The data provide evidence of an association raspberry consumption and illness (\(\chi^2\) = 36.89 on 1 degree of freedom, p < 0.0001). With 95% confidence, the odds of illness are estimated to be between 9.08 and 117.5 times higher among those who consumed raspberries during the outbreak.
A cohort study of 3,000 smokers and 5,000 nonsmokers investigated the link between smoking and the development of coronary heart disease (CHD) over 1 year.
Two independent samples but not outcome-based sampling…
risk ratio with 95% C.I.
Predictor estimate lower upper
nonsmoker 1.000000 NA NA
smoker 1.609195 1.196452 2.164325
CHD | no CHD | |
---|---|---|
nonsmoker | 87 | 4913 |
smoker | 84 | 2916 |
The data provide evidence that smoking is associated with coronary heart disease (\(\chi^2\) = 9.5711 on 1 degree of freedom, p = 0.00198). With 95% confidence, the risk of CHD is estimated to be between 1.196 and 2.164 times greater among smokers compared with nonsmokers.
STAT218