Categorical data analysis

Applied statistics for life sciences

One-sample inference for binomial data

Categorical data

Consider the frequency of diabetes among a random sample of \(n = 500\) American adults:

data values are Y/N \(\Rightarrow\) nominal data
exactly 2 categories \(\Rightarrow\) “binomial” data

Categorical data

Or, to take another example, consider the frequency of self-rated general health.

values are ranked labels \(\Rightarrow\) ordinal data
\(>2\) categories \(\Rightarrow\) “multinomial” data

Proportions

For categorical data the usual inferential target is a set of population proportions.

Sample proportions for diabetes data:

diabetes	count	proportion
Yes	57	0.114
No	443	0.886

\[ \begin{align*} \hat{p}_1 &= 57/500 = 0.114 \\ \hat{p}_2 &= 443/500 = 0.886 \end{align*} \]

Example interpretation:

Diabetes prevalence among U.S. adults is estimated to be 11.4%.

Proportions

For categorical data the usual inferential target is a set of population proportions.

Sample proportions for general health:

health	count	proportion
Excellent	47	0.1044
Vgood	162	0.36
Good	177	0.3933
Fair	53	0.1178
Poor	11	0.02444
(Missing)	50	–

\[ \begin{align*} \hat{p}_1 &= 47/450 = 0.1044 \\ \hat{p}_2 &= 162/450 = 0.3600 \\ \hat{p}_3 &= 177/450 = 0.3933 \\ \hat{p}_4 &= 53/450 = 0.1178 \\ \hat{p}_5 &= 11/450 = 0.0244 \end{align*} \]

Note that 10% of observations are missing (NA) and are removed to compute \(\hat{p}_j\)’s.

Example interpretation:

An estimated 10.44% of U.S. adults believe they are in excellent health.

Compositionality

Categorical data are “compositional” – the counts add up to the total sample size.

health	count	proportion
Excellent	47	0.1044
Vgood	162	0.36
Good	177	0.3933
Fair	53	0.1178
Poor	11	0.02444
(Missing)	50	–
SUM	500	1

As a result, the “last” proportion is determined by the rest:

\[ \begin{align*} \hat{p}_1 &= 47/450 = 0.1044 \\ \hat{p}_2 &= 162/450 = 0.3600 \\ \hat{p}_3 &= 177/450 = 0.3933 \\ \hat{p}_4 &= 53/450 = 0.1178 \\ \hat{p}_5 &= 1 - 0.1044 - 0.3600 - 0.3933 - 0.1178 \end{align*} \]

Compositionality

Categorical data are “compositional” – the counts add up to the total sample size.

health	count	proportion
Excellent	47	0.1044
Vgood	162	0.36
Good	177	0.3933
Fair	53	0.1178
Poor	11	0.02444
(Missing)	50	–
SUM	500	1

Mathematically this amounts to the constraint:

\[ \sum_{j=1}^K \hat{p}_j = 1 \quad\text{and}\quad \hat{p}_i = 1 - \sum_{j \neq i} \hat{p}_j \]

Consequences:

estimates \(\hat{p}_j\) are not independent
only need to compute \(\hat{p}_1, \dots, \hat{p}_{K - 1}\) to estimate \(p_1, \dots, p_K\)

Inference for binomial data

Consider testing whether diabetes prevalence in the U.S. is 10%. \(\quad\begin{cases} H_0: p = 0.1 \\ H_A: p \neq 0.1 \end{cases}\)

If so, one would expect a roughly 90-10 allocation of counts. Compare:

Expected counts
	Yes	No
count	50	450
proportion	0.1	0.9

Observed counts
	Yes	No
count	57	443
proportion	0.114	0.886

A measure of their difference is the chi (pronounced /ˈkaɪ ) square statistic:

\[ \chi^2 = \sum \frac{(\text{observed} - \text{expected})^2}{\text{expected}} = \frac{(57 - 50)^2}{50} + \frac{(443 - 450)^2}{450} = 0.9389 \] If \(\chi^2\) is large enough, the data favor \(H_A\).

Inference for binomial data

Under \(H_0\), the \(\chi^2\) statistic has a sampling distribution that can be approximated by a \(\chi^2_1\) model (shown at right).

one degree of freedom (subscript)
the model assumes no expected counts are too small (rule of thumb: 5 or greater)

prop.test(x = 57, n = 500, p = 0.1)


    1-sample proportions test with continuity correction

data:  57 out of 500, null probability 0.1
X-squared = 0.93889, df = 1, p-value = 0.3326
alternative hypothesis: true p is not equal to 0.1
95 percent confidence interval:
 0.08814952 0.14594579
sample estimates:
    p 
0.114

Interpretation:

The data provide no evidence that diabetes prevalence differs from 10% (\(\chi^2\) = 0.94 on 1 degree of freedom, p = 0.3326).

Confidence interval for \(p\)

A confidence interval for a binomial proportion \(p\) is:

\[ \hat{p} \pm c \times SE(\hat{p}) \quad\text{where}\quad SE(\hat{p}) = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

The critical value \(c\) comes from either the empirical rule or from inverting the \(\chi^2\) test.

95% interval for diabetes prevalence using empirical rule:

\[ 0.114 \pm 2\times 0.0142 = (0.088, 0.146) \]

With 95% confidence, diabetes prevalence among U.S. adults is estimated to be between 8.8% and 14.6%.

Why is \(SE(\hat{p}) = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\) ??

Measure of spread for binomial data: \[\sqrt{\hat{p}(1 - \hat{p})}\]

highest when \(\hat{p} \approx 0.5\)
lowest when \(\hat{p} \approx 0 \text{ or } 1\)

Analogous to estimating a mean: \[ SE\left(\hat{p}\right) = \frac{\text{spread}}{\sqrt{\text{sample size}}} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

One-sample inference for multinomial data

Inference for multinomial data

Imagine comparing observations to data from a prior year.¹ Did health perceptions change?

\[ \begin{align*} H_0: &\mathbf{p} = \overbrace{\left( 0.13, 0.32, 0.35, 0.13, 0.06 \right)}^{\mathbf{p}_0} \\ H_A: &\mathbf{p} \neq \mathbf{p}_0 \\ \end{align*} \]

Compare observed counts with expectations if the population proportions were \(\mathbf{p}_0\):

	Excellent	Vgood	Good	Fair	Poor
observed	47	162	177	53	11
expected	59.83	145.8	158.4	57.3	28.65

Inference for multinomial data

The same \(\chi^2\) statistic we used before still provides a measure of divergence:

	Excellent	Vgood	Good	Fair	Poor
observed	47	162	177	53	11
expected	59.83	145.8	158.4	57.3	28.65

\[ \chi^2 = \frac{(47 - 59.8)^2}{59.8} + \frac{(162 - 145.8)^2}{145.8} + \cdots + \frac{(11 - 28.7)^2}{28.7} = 17.93 \]

The only difference before is that the formal test uses a \(\chi^2_4\) model for the sampling distribution.

in general, \(\text{df} = K - 1\)
\(K\) is the number of categories

Inference for multinomial data

# hypothesis
p0 <- c(0.1330, 0.3240, 0.3521, 0.1273, 0.0637)

# perform chi square test 
table(nhanes$healthgen) |>
  chisq.test(p = p0, rescale.p = T)


    Chi-squared test for given probabilities

data:  table(nhanes$healthgen)
X-squared = 17.94, df = 4, p-value = 0.001268

Interpretation:

The data provide evidence that the proportions differ from the prior year (\(\chi^2\) = 17.94 on 4 degrees of freedom, p = 0.0013).

Options in R

Can provide a table(...) input:

table(nhanes$healthgen) |>
  chisq.test(p = c(0.1330, 0.3240, 0.3521, 0.1273, 0.0637), 
             rescale.p = T)


    Chi-squared test for given probabilities

data:  table(nhanes$healthgen)
X-squared = 17.94, df = 4, p-value = 0.001268

Or input the counts directly:

chisq.test(x = c(47, 162, 177, 53, 11),
           p = c(0.1330, 0.3240, 0.3521, 0.1273, 0.0637), 
           rescale.p = T)


    Chi-squared test for given probabilities

data:  c(47, 162, 177, 53, 11)
X-squared = 17.94, df = 4, p-value = 0.001268

Other inputs:

p = ... gives the hypothesized population distribution, specified:
- as proportions (not counts)
- in the same order in which category counts appear in the input
rescale.p = T ensures hypothesized proportions sum to 1 (always include)

Residuals for \(\chi^2\) tests

Which proportions differ?

The residuals for the test are the normalized differences between observed and expected counts:

\[ \text{residual}_j = \frac{\text{observed}_j - \text{expected}_j}{\sqrt{\text{expected}_j}} \]

Think of this as the number of standard errors by which observed and expected counts differ.

sign indicates direction
value indicates magnitude (in SE’s)

Table: residuals from chi square test.
health	observed	expected	resid
Excellent	47	60	-1.7
Vgood	162	146	1.3
Good	177	158	1.5
Fair	53	57	-0.57
Poor	11	29	-3.3

Significantly fewer than expected participants rated themselves as being in poor health.

Confidence intervals

Confidence intervals can be computed as if estimates were for binomial proportions.

\[ \hat{p}_j \pm c\times SE(\hat{p}_j) \quad\text{where}\quad SE(\hat{p}_j) = \sqrt{\frac{\hat{p}_j(1 - \hat{p}_j)}{n}} \]

MultinomCI(x = c(47, 162, 177, 53, 11), method = 'wald')

Table: 95% confidence intervals for the health perceptions.
healthgen	n	p	se	lwr	upr
Excellent	47	0.1044	0.01442	0.07561	0.1333
Vgood	162	0.36	0.02263	0.3147	0.4053
Good	177	0.3933	0.02303	0.3473	0.4394
Fair	53	0.1178	0.0152	0.08739	0.1482
Poor	11	0.02444	0.00728	0.009885	0.039

Examples: fair dice?

Suppose you rolled a die 200 times and tallied the following counts:

1	2	3	4	5	6
38	24	31	39	44	24

In terms of proportions, the frequencies of each roll are:

1	2	3	4	5	6
0.19	0.12	0.155	0.195	0.22	0.12

1’s, 4’s, and 5’s are more frequent than expected
2’s, 3’s, and 6’s are less frequent than expected

To test whether the dice are fair:

# test whether dice are fair
chisq.test(x = c(38, 24, 31, 39, 44, 24),
           p = c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6),
           rescale.p = T)


    Chi-squared test for given probabilities

data:  c(38, 24, 31, 39, 44, 24)
X-squared = 10.42, df = 5, p-value = 0.06417

The data do not provide evidence that the die is unfair (\(\chi^2\) = 10.42 on 5 degrees of freedom, p = 0.06417).

… but they are certainly suggestive (the test would be significant at the 10% level).

Examples: fair dice?

Table: 95% confidence intervals.
roll	est	lwr.ci	upr.ci
1	0.19	0.1356	0.2444
2	0.12	0.07496	0.165
3	0.155	0.1048	0.2052
4	0.195	0.1401	0.2499
5	0.22	0.1626	0.2774
6	0.12	0.07496	0.165

Table: residuals from chi square test.
roll	observed	expected	resid
1	38	33	0.81
2	24	33	-1.6
3	31	33	-0.4
4	39	33	0.98
5	44	33	1.8
6	24	33	-1.6

The residuals and intervals tell slighty different stories.

The intervals for 2 and 6 miss the fair value (0.167).
The residual for 5 is the largest (but followed closely by 2 and 6).

Examples: segregation distortion?

Mendelian genetics assumes alleles have an equal chance of being passed on to offspring.

This entails that oﬀspring of two AB individuals are expected to exhibit a 1:2:1 genotypic ratio:

25% of offspring will be AA
50% of offspring will be AB
25% of offspring will be BB

Otherwise there is what’s known as segregation distortion.

Researchers tested for segregation distortion at 250 gene loci using data from 451 offspring of AB individuals of a model plant species (Arabidospis thaliana). At one locus:

AA	AB	BB
84	233	134

chisq.test(x = c(84, 233, 134),
           p = c(0.25, 0.5, 0.25))


    Chi-squared test for given probabilities

data:  c(84, 233, 134)
X-squared = 11.585, df = 2, p-value = 0.00305

The data do not provide evidence of segregation distortion at the specified gene location (\(\chi^2 = 11.585\) on 2 degrees of freedom, adjusted p = 0.775).

Examples: segregation distortion?

Table: 95% confidence intervals and residuals from chi square test.
phenotype	estimate	lwr.ci	upr.ci	observed	expected	residual
AA	0.1863	0.1397	0.237	84	112.8	-2.708
AB	0.5166	0.4701	0.5674	233	225.5	0.4994
BB	0.2971	0.2506	0.3478	134	112.8	2.001

In this case residuals and intervals tell the same story:

the AA phenotype has a lower-than-expected frequency
the BB phenotype has a higher-than-expected frequency

Two-sample inference for binomial data

Comparing binomial proportions

Is vitamin C effective at preventing common cold?

Vitamin C experiment
	Cold	NoCold	n
Placebo	335	76	411
VitC	302	105	407

vitamin C and placebo supplements were randomly allocated to 818 volunteers
volunteers took supplements daily for a cold season
study recorded how many volunteers came down with a cold

To answer the question, we can test the hypotheses \(\begin{cases} H_0: &p_1 = p_2 \\ H_A: &p_1 \neq p_2 \end{cases}\)

Expected counts

If there is no effect of vitamin C supplements and \(p_1 = p_2 = p\) then:

expected counts are just the sample sizes allocated by the common proportion \(p\)
marginal totals provide an estimate of \(p\)

For example:

\(\hat{p} = 637/818 = 0.7787\) of all participants had a cold
so an expected 77.87% of the placebo group would have a cold:\[0.7787 \times 411 = \frac{637}{818}\times 411 = 320.1\]

Observed counts.
	Cold	NoCold	n
Placebo	335	76	411
VitC	302	105	407
both	637	181	818

Expected counts.
	Cold	NoCold	n
Placebo	*`320.1`*	90.94	`411`
VitC	316.9	90.06	407
both	`637`	181	`818`

\(\chi^2\) test for binomial proportions

Observed counts.
	Cold	NoCold
Placebo	335	76
VitC	302	105

Expected counts.
	Cold	NoCold
Placebo	320.1	90.94
VitC	316.9	90.06

As in the one-sample case, the \(\chi^2\) statistic measures the divergence:

\[ \chi^2 = \frac{(335 - 320.1)^2}{320.1} + \frac{(76 - 90.94)^2}{90.94} + \frac{(302 - 316.9)^2}{316.9} + \frac{(105 - 90.06)^2}{90.06} = 6.34 \]

And we compare with a \(\chi^2_1\) model to determine a p-value.

\(\chi^2\) test for binomial proportions

Typically a “continuity correction” is applied. This slightly diminishes the test statistic for a more conservative test:

\[ \chi^2_\text{adj} = \frac{(|335 - 320.1| - 0.5)^2}{320.1} + \cdots + \frac{(|105 - 90.06| - 0.5)^2}{90.06} = 5.92 \]

In R:

chisq.test(vitamin$treatment, vitamin$outcome)


    Pearson's Chi-squared test with Yates' continuity correction

data:  vitamin$treatment and vitamin$outcome
X-squared = 5.9196, df = 1, p-value = 0.01497

Interpretation:

The data provide evidence of an effect of vitamin C on the probability of having a common cold (\(\chi^2\) = 5.92 on 1 degree of freedom, p = 0.01497).

So there is an effect. But how much of a difference is there?

Confidence interval for the difference

A confidence interval for the difference in proportions is:

\[ \hat{p}_1 - \hat{p}_2 \pm c\times SE(\hat{p}_1 - \hat{p}_2) \quad\text{where}\quad \begin{cases} SE(\hat{p}_1 - \hat{p}_2) = \sqrt{SE^2(\hat{p}_1) + SE^2(\hat{p}_2)} \\ SE^2(\hat{p}_1) = \frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} \\ SE^2(\hat{p}_2) = \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2} \end{cases} \]

The critical value \(c\) comes from either the empirical rule or from inverting the \(\chi^2\) test.

For the vitamin C data:

point estimate: \(\hat{p}_\text{placebo} - \hat{p}_\text{vitC} = 0.0731\)
standard error: \(\sqrt{SE^2(\hat{p}_\text{placebo}) + SE^2(\hat{p}_\text{vitC})} = 0.0289\)
95% confidence interval: \(0.0731 \pm 2\times 0.0289 = (0.0164, 0.1298)\)

Implementation in R

# variables of interest
treatment <- vitamin$treatment
outcome <- vitamin$outcome

# pass table to prop.test
table(treatment, outcome) |>
  prop.test()


    2-sample test for equality of proportions with continuity correction

data:  table(treatment, outcome)
X-squared = 5.9196, df = 1, p-value = 0.01497
alternative hypothesis: two.sided
95 percent confidence interval:
 0.01391972 0.13222111
sample estimates:
   prop 1    prop 2 
0.8150852 0.7420147

Details:

order in table() matters
grouping should be first
outcome variable should be second

Test interpretation:

The data provide evidence of an effect of vitamin C on the probability of having a common cold (\(\chi^2\) = 5.92 on 1 degree of freedom, p = 0.01497).

Interval interpretation:

With 95% confidence, the probability of getting a common cold is estimated to be between 1.64% and 12.98% lower among adults who take daily vitamin C supplements.

Two-sample inference for multinomial data

Comparing multinomial proportions

Did U.S. adults’ health perceptions change between 2009-2010 and 2011-2012?

	Excellent	Vgood	Good	Fair	Poor	Total
2009_10	27	78	88	28	5	226
2011_12	20	84	89	25	6	224

To answer this question, we want to compare multinomial proportions, i.e., to test:

\[ \begin{cases} H_0: &\mathbf{p}_1 = \mathbf{p}_2 \\ H_A: &\mathbf{p}_1 \neq \mathbf{p}_2 \end{cases} \]

Expected counts

If \(\mathbf{p}_1 = \mathbf{p}_2 = \mathbf{p}\) then the marginal category totals provide an estimate of \(\mathbf{p}\):

	Excellent	Vgood	Good	Fair	Poor
both	0.1044	0.36	0.3933	0.1178	0.02444

And expected counts are simply allocations of the group sizes according to \(\mathbf{p}\):

	Excellent	Vgood	Good	Fair	Poor	n
2009_10	23.6	*`81.36`*	88.89	26.62	5.524	`226`
2011_12	23.4	80.64	88.11	26.38	5.476	224
both	47	`162`	177	53	11	`450`

For example, an expected 36% of the 2009-2010 respondents would report being in very good health:

\[ 81.36 = 0.36 \times 226 = \frac{162}{450}\times 226 \]

\(\chi^2\) test for multinomial proportions

Observed counts
	Excellent	Vgood	Good	Fair	Poor
2009_10	27	78	88	28	5
2011_12	20	84	89	25	6

Expected counts
	Excellent	Vgood	Good	Fair	Poor
2009_10	23.6	81.36	88.89	26.62	5.524
2011_12	23.4	80.64	88.11	26.38	5.476

The \(\chi^2\) statistic is the same as usual:

\[ \chi^2 = \frac{(27 - 23.6)^2}{23.6} + \frac{(78 - 81.36))^2}{81.36} + \cdots + \frac{(6 - 5.476)^2}{5.476} = 1.52 \]

\(\chi^2\) test for multinomial proportions

\[ \chi^2 = \frac{(27 - 23.6)^2}{23.6} + \frac{(78 - 81.36))^2}{81.36} + \cdots + \frac{(6 - 5.476)^2}{5.476} = 1.52 \]

A \(\chi^2_{K - 1}\) model is used to compute the \(p\)-value for the test.

table(nhanes$surveyyr, nhanes$healthgen) |>
  chisq.test()


    Pearson's Chi-squared test

data:  table(nhanes$surveyyr, nhanes$healthgen)
X-squared = 1.5223, df = 4, p-value = 0.8227

The data do not provide evidence that U.S. adults’ health perceptions changed between survey periods (\(\chi^2\) = 1.52 on 4 degrees of freedom, p = 0.8227).

Residual analysis

If there had been a significant difference, we could have used residuals to explain the change.

	Excellent	Vgood	Good	Fair	Poor
2009_10	0.6989	-0.3725	-0.09475	0.2679	-0.2231
2011_12	-0.702	0.3742	0.09517	-0.2691	0.2241

While magnitudes are small, the signs show that from 2009-2010 to 2011-2012…

“excellent” perceptions decreased
“very good” perceptions increased
“good” perceptions increased
“fair” pereptions decreased
“poor” perceptions increased

Confidence intervals

Confidence intervals are obtained by category-wise comparisons of binomial proportions.

95% confidence intervals for differences in proportions.
	p_2009_10	p_2011_12	diff	se	ci.lwr	ci.upr
Excellent	0.1195	0.08929	-0.03018	0.1491	-0.3285	0.2681
Vgood	0.3451	0.375	0.02987	0.5096	-0.9894	1.049
Good	0.3894	0.3973	0.007941	0.5563	-1.105	1.121
Fair	0.1239	0.1116	-0.01229	0.1668	-0.3458	0.3212
Poor	0.02212	0.02679	0.004662	0.03474	-0.06482	0.07414

Notice all intervals span zero: it is plausible that there is no change in every category.

Extending \(\chi^2\) tests to \(I \times J\) tables

Do ACTN3 genotype frequencies vary by race?

FAMuSS data:

	CC	CT	TT	total
African Am	16	6	5	27
Asian	21	18	16	55
Caucasian	125	216	126	467
Hispanic	4	10	9	23
Other	7	11	5	23
total	173	261	161	595

Expected counts:

	CC	CT	TT	total
African Am	7.85	11.84	7.31	27
Asian	15.99	24.13	14.88	55
Caucasian	135.8	204.8	126.4	467
Hispanic	6.69	10.09	6.22	23
Other	6.69	10.09	6.22	23
total	173	261	161	595

expected counts and chi-square statistic are calculated exactly the same way
degrees of freedom are now \((I - 1)\times(J - 1)\)

Extending \(\chi^2\) tests to \(I \times J\) tables

In detail:

	CC	CT	TT
African Am	\(\frac{(16 - 7.85)^2}{7.85}\)	\(\frac{(6 - 11.84)^2}{11.84}\)	\(\frac{(5 - 7.306)^2}{7.306}\)
Asian	\(\frac{(21 - 15.99)^2}{15.99}\)	\(\frac{(18 - 24.13)^2}{24.13}\)	\(\frac{(16 -14.88)^2}{14.88}\)
Caucasian	\(\frac{(125 - 135.8)^2}{135.8}\)	\(\frac{(216 - 204.9)^2}{204.9}\)	\(\frac{(126 - 126.4)^2}{126.4}\)
Hispanic	\(\frac{(4 - 6.687)^2}{6.687}\)	\(\frac{(10 - 10.09)^2}{10.09}\)	\(\frac{(9 - 6.224)^2}{6.224}\)
Other	\(\frac{(7 - 6.687)^2}{6.687}\)	\(\frac{(11 - 10.09)^2}{10.09}\)	\(\frac{(5 - 6.224)^2}{6.224}\)

Then:

\[\chi^2 = \sum \text{all cells above} = 19.4\]

Implementation in R

The implementation is the same as for a \(2\times K\) table:

# construct table and pass to chisq.test
table(famuss$race, famuss$genotype) |>
  chisq.test()


    Pearson's Chi-squared test

data:  table(famuss$race, famuss$genotype)
X-squared = 19.4, df = 8, p-value = 0.01286

The data provide evidence of an association between race and genotype (\(\chi^2\) = 19.4 on 8 degrees of freedom, p = 0.01286).

Residual analysis

Which genotype/race combinations are contributing most to this inferred association?

# store result of test; display residuals
rslt <- chisq.test(tbl)
rslt$residuals

	CC	CT	TT
African Am	2.909	-1.698	-0.8531
Asian	1.252	-1.247	0.2897
Caucasian	-0.9254	0.7789	-0.03244
Hispanic	-1.039	-0.02804	1.113
Other	0.1209	0.2868	-0.4905

Look for the largest absolute residuals to explain inferred association.

African American and Asian populations have higher CC and lower CT frequencies than would be expected if genotype were independent of race.

Relative risk

Asthma data

Consider estimating the difference in proportions:

	asthma	no asthma
male	30	769
female	49	781

table(asthma$sex, asthma$asthma) |>
  prop.test(asthma.tbl, conf.level = 0.9)


    2-sample test for equality of proportions with continuity correction

data:  table(asthma$sex, asthma$asthma)
X-squared = 3.6217, df = 1, p-value = 0.05703
alternative hypothesis: two.sided
90 percent confidence interval:
 -0.040137075 -0.002841347
sample estimates:
    prop 1     prop 2 
0.03754693 0.05903614

With 90% confidence, asthma prevalence is estimated to be between 0.28 and 4.01 percentage points higher among women than among men.

Is a difference of up to 4 percentage points practically meaningful? Well, it depends:

yes if prevalence is very low
no if prevalence is very high

Relative risk

If \(p_F, p_M\) are the (population) proportions of women and men with asthma, then the relative risk of asthma among women compared with men is defined as:

\[ RR = \frac{p_F}{p_M} \qquad \left(\frac{\text{risk among women}}{\text{risk among men}}\right) \]

An estimate of the relative risk is simply the ratio of estimated proportions. For the asthma data, an estimate is: \[ \widehat{RR} = \frac{\hat{p}_F}{\hat{p}_M} = \frac{0.059}{0.038} = 1.57 \]

It is estimated that the risk of asthma among women is 1.57 times greater than among men.

Confidence intervals for relative risk

A normal model can be used to approximate the sampling distribution of \(\log(RR)\) and construct a confidence interval. If \(\hat{p}_1\) and \(\hat{p}_2\) are the two estimated proportions:

\[\log\left(\widehat{RR}\right) \pm c \times SE\left(\log\left(\widehat{RR}\right)\right) \quad\text{where}\quad SE\left(\log\left(\widehat{RR}\right)\right) = \sqrt{\frac{1 - p_1}{p_1n_1} + \frac{1 - p_2}{p_2n_2}}\]

Exponentiate endpoints to obtain an interval for relative risk.

table(asthma$sex, asthma$asthma) |>
  riskratio(rev = 'columns', 
            method = 'wald', 
            conf.level = 0.9,
            correction = T)

         risk ratio with 90% C.I.
Predictor estimate    lower    upper
   male   1.000000       NA       NA
   female 1.572329 1.083353 2.282007

With 90% confidence, the risk of asthma is estimated to be betwen 1.08 and 2.28 times greater for women than for men.

Implementation with `epitools`

table(asthma$sex, asthma$asthma) |>
  riskratio(rev = 'columns', 
            method = 'wald', 
            conf.level = 0.9,
            correction = T)

$data
        
         no asthma asthma Total
  male         769     30   799
  female       781     49   830
  Total       1550     79  1629

$measure
        risk ratio with 90% C.I.
         estimate    lower    upper
  male   1.000000       NA       NA
  female 1.572329 1.083353 2.282007

$p.value
        two-sided
         midp.exact fisher.exact chi.square
  male           NA           NA         NA
  female 0.04412095   0.04961711 0.05703135

$correction
[1] TRUE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

riskratio is picky about input format:

outcome of interest should be second column
group of interest should be second row

It will return the relative risk

\[ RR = \frac{n_{22}/n_2}{n_{12}/n_1} \] The data table can be reoriented using rev

rev = neither keeps original orientation
rev = rows reverses order of rows
rev = columns reverses order of columns
rev = both reverses both

Reporting results

$data
        
         no asthma asthma Total
  male         769     30   799
  female       781     49   830
  Total       1550     79  1629

$measure
        risk ratio with 90% C.I.
         estimate    lower    upper
  male   1.000000       NA       NA
  female 1.572329 1.083353 2.282007

$p.value
        two-sided
         midp.exact fisher.exact chi.square
  male           NA           NA         NA
  female 0.04412095   0.04961711 0.05703135

$correction
[1] TRUE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The data provide evidence of an association between asthma and sex (\(\chi^2\) = 3.62 on 1 degree of freedom, p = 0.057). With 90% confidence, the risk of asthma is estimated to be betwen 1.08 and 2.28 times greater for women than for men, with a point estimate of 1.57.

Conventional style:

first report the test result
then the measure of association
include point estimates (since interval is asymmetric)

Treatment efficacy: reduction in risk

In a randomized trial for a malaria vaccine, 20 individuals were randomly allocated to receive a dose of the vaccine or a placebo.

Vaccine trials often estimate relative reduction in risk or “efficacy”:

\[ \underbrace{\frac{\hat{p}_\text{ctrl} - \hat{p}_\text{trt}}{\hat{p}_\text{ctrl}}}_\text{efficacy} = 1 - RR \]

	no infection	infection
placebo	0	6
vaccine	9	5

# relative risk
rr.out <- table(malaria$treatment, malaria$outcome) |>
            riskratio(method = 'wald', correction = T)
rr.out$measure

         risk ratio with 95% C.I.
           estimate     lower     upper
  placebo 1.0000000        NA        NA
  vaccine 0.3571429 0.1768593 0.7212006

# efficacy
1 - rr.out$measure

         risk ratio with 95% C.I.
           estimate     lower     upper
  placebo 0.0000000        NA        NA
  vaccine 0.6428571 0.8231407 0.2787994

The vaccine reduces the risk of infection by an estimated 27.9% to 82.3%.

Odds ratios

Example: case-control study

	Smokers	NonSmokers	total
Cancer	83	3	86
Control	72	14	86

Note: not possible to estimate the case rate (cancer prevalence) due to the study design.

# chi square test of association
table(smoking$group, smoking$smoking) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062

The data provide evidence of an association between smoking and lung cancer (\(\chi^2 = 6.53\) on 1 degree of freedom, \(p = 0.0106\)).

How do we measure the association, considering we can’t estimate case rates?

Odds

If \(p\) is the true cancer prevalence (a population proportion), then the odds of cancer are:

\[ \text{odds} = \frac{p}{1 - p} \]

The odds represent the relative likelihood of an outcome.

\(\text{odds} = 2\): the outcome (cancer) is twice as likely to occur as to not occur
\(\text{odds} = 1/2\) indicates the outcome (cancer) is half as likely to occur as to not occur

Odds ratios

Let \(a, b, c, d\) denote population proportions.

\(\;\)	outcome 1 (O1)	outcome 2 (O2)
group 1 (G1)	a	b
group 2 (G2)	c	d

The odds of outcome 1 (O1) in each group are:

\(\frac{\textcolor{red}{a}}{\textcolor{blue}{b}}\) in group 1 (G1)
\(\frac{\textcolor{orange}{c}}{\textcolor{purple}{d}}\) in group 2 (G2)

The odds ratio or “relative odds” is:

\[ \omega = \frac{\text{odds}_{G1}(O1)}{\text{odds}_{G2}(O1)} = \frac{\textcolor{red}{a}/\textcolor{blue}{b}}{\textcolor{orange}{c}/\textcolor{purple}{d}} = \frac{\textcolor{red}{a}\textcolor{purple}{d}}{\textcolor{blue}{b}\textcolor{orange}{c}} \]

A surprising algebraic fact is that:

\[ \frac{\text{odds}_{G1}(O1)}{\text{odds}_{G2}(O1)} =\frac{\text{odds}_{O1}(G1)}{\text{odds}_{O2}(G1)} \]

relative odds of cancer given smoking status = relative odds of smoking given cancer status

Estimating odds ratios

The estimate is the same calculation as on the previous slide, but with sample counts.

\(\;\)	Smoker (O1)	NonSmoker (O2)
Case (G1)	83	3
Control (G2)	72	14

Estimate of \(\omega\): \[ \hat{\omega} = \frac{\textcolor{red}{83}\times\textcolor{purple}{14}}{\textcolor{blue}{3}\times\textcolor{orange}{72}} = 5.38 \]

Interpretation:

It is estimated that the relative odds of lung cancer are 5.38 times greater for smokers compared with nonsmokers.

Confidence intervals for odds ratios

The sampling distribution of the log odds ratio can be approximated by a normal model.

\[ \log\left(\hat{\omega}\right) \pm c \times SE\left(\log\left(\hat{\omega}\right)\right) \quad\text{where}\quad SE\left(\log\left(\hat{\omega}\right)\right) = \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}} \]

The oddsratio(...) function in the epitools package will compute and back-transform the interval for you.

table(smoking$group, smoking$smoking) |>
  oddsratio(rev = 'both', method = 'wald')

         odds ratio with 95% C.I.
          estimate    lower    upper
  Control  1.00000       NA       NA
  Cancer   5.37963 1.486376 19.47045

critical value \(c\) from the normal model
exponentiate to obtain an interval for \(\omega\)

With 95% confidence, the relative odds of lung cancer are estimated to be between 1.49 and 19.47 times greater for smokers compared with nonsmokers.

Implementation with `epitools`

table(smoking$group, smoking$smoking) |>
  oddsratio(rev = 'both', 
            method = 'wald',
            conf.level = 0.95,
            correction = T)

$data
         
          NonSmokers Smokers Total
  Control         14      72    86
  Cancer           3      83    86
  Total           17     155   172

$measure
         odds ratio with 95% C.I.
          estimate    lower    upper
  Control  1.00000       NA       NA
  Cancer   5.37963 1.486376 19.47045

$p.value
         two-sided
           midp.exact fisher.exact chi.square
  Control          NA           NA         NA
  Cancer  0.005116319  0.008822805 0.01062183

$correction
[1] TRUE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

oddsratio is picky about data inputs:

outcome of interest should be second column
group of interest should be second row

It will return the odds ratio

\[ \frac{\text{odds}_\text{R2} (\text{C2})}{\text{odds}_\text{R1}(\text{C2})} = \frac{n_{22}/n_{21}}{n_{12}/n_{11}} = \frac{n_{22} n_{11}}{n_{12}n_{21}} = \frac{da}{cb} \]

The data table can be reoriented using rev

rev = neither keeps original orientation
rev = rows reverses order of rows
rev = columns reverses order of columns
rev = both reverses both

Reporting results

$data
         
          NonSmokers Smokers Total
  Control         14      72    86
  Cancer           3      83    86
  Total           17     155   172

$measure
         odds ratio with 95% C.I.
          estimate    lower    upper
  Control  1.00000       NA       NA
  Cancer   5.37963 1.486376 19.47045

$p.value
         two-sided
           midp.exact fisher.exact chi.square
  Control          NA           NA         NA
  Cancer  0.005116319  0.008822805 0.01062183

$correction
[1] TRUE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

First report the test result, then the measure of association:

The data provide evidence of an association between smoking and lung cancer (\(\chi^2 = 6.53\) on 1 degree of freedom, \(p = 0.1062\)). With 95% confidence, the relative odds of cancer are estimated to be between 1.49 and 19.47 times greater among smokers compared with nonsmokers, with a point estimate of 5.38.

Be sure to include the point estimate, since the interval estimate is asymmetric.

Risk or odds? Rules of thumb

If a study design employs outcome-based sampling, proportions are not estimable.

analysis must use relative odds

Otherwise, analysis may use any measure of association.

difference in proportions
relative risk
treatment efficacy
relative odds

Risk or odds? Cyclosporiasis outbreak

An outbreak of cyclosporiasis was detected among residents of New Jersey. In a case-control study, investigators found that 21 of 30 cases and 4 of 60 controls had eaten raspberries.

Outcome-based sampling means…

can’t estimate risk
analysis should use odds ratio

oddsratio(outbreak$exposure, outbreak$group, 
          rev = 'columns', method = 'wald', 
          correction = T)

                odds ratio with 95% C.I.
Predictor        estimate    lower    upper
  no raspberries  1.00000       NA       NA
  raspberries    32.66667 9.081425 117.5048

	no raspberries	raspberries
case	9	21
control	56	4

The data provide evidence of an association raspberry consumption and illness (\(\chi^2\) = 36.89 on 1 degree of freedom, p < 0.0001). With 95% confidence, the odds of illness are estimated to be between 9.08 and 117.5 times higher among those who consumed raspberries during the outbreak.

Risk or odds? smoking and CHD

A cohort study of 3,000 smokers and 5,000 nonsmokers investigated the link between smoking and the development of coronary heart disease (CHD) over 1 year.

Two independent samples but not outcome-based sampling…

can estimate risk
analysis can use any measure
RR (arguably) most natural

riskratio(chd$smoking, chd$chd, rev = 'columns',
          method = 'wald', correction = T)

           risk ratio with 95% C.I.
Predictor   estimate    lower    upper
  nonsmoker 1.000000       NA       NA
  smoker    1.609195 1.196452 2.164325

	CHD	no CHD
nonsmoker	87	4913
smoker	84	2916

The data provide evidence that smoking is associated with coronary heart disease (\(\chi^2\) = 9.5711 on 1 degree of freedom, p = 0.00198). With 95% confidence, the risk of CHD is estimated to be between 1.196 and 2.164 times greater among smokers compared with nonsmokers.

Categorical data analysis

One-sample inference for binomial data

Categorical data

Categorical data

Proportions

Proportions

Compositionality

Compositionality

Inference for binomial data

Inference for binomial data

Confidence interval for \(p\)

Why is \(SE(\hat{p}) = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\) ??

One-sample inference for multinomial data

Inference for multinomial data

Inference for multinomial data

Inference for multinomial data

Options in R

Residuals for \(\chi^2\) tests

Confidence intervals

Examples: fair dice?

Examples: fair dice?

Examples: segregation distortion?

Examples: segregation distortion?

Two-sample inference for binomial data

Comparing binomial proportions

Expected counts

\(\chi^2\) test for binomial proportions

\(\chi^2\) test for binomial proportions

Confidence interval for the difference

Implementation in R

Two-sample inference for multinomial data

Comparing multinomial proportions

Expected counts

\(\chi^2\) test for multinomial proportions

\(\chi^2\) test for multinomial proportions

Residual analysis

Confidence intervals

Extending \(\chi^2\) tests to \(I \times J\) tables

Extending \(\chi^2\) tests to \(I \times J\) tables

Implementation in R

Residual analysis

Relative risk

Asthma data

Relative risk

Confidence intervals for relative risk

Implementation with epitools

Reporting results

Treatment efficacy: reduction in risk

Odds ratios

Example: case-control study

Odds

Odds ratios

Estimating odds ratios

Confidence intervals for odds ratios

Implementation with epitools

Reporting results

Risk or odds? Rules of thumb

Risk or odds? Cyclosporiasis outbreak

Risk or odds? smoking and CHD

Implementation with `epitools`

Implementation with `epitools`