Association Analysis for Count Data Using R

The most commonly used association analysis methods for count data/contingency tables compare observed frequencies with those expected under the assumption of independence:

\[ X^2 = \sum_{i=1}^k \frac{(x_i-e_i)^2}{e_i}, \] where \(k\) is the number of contingency table cells.

Other measures for the correlation of two continuous variables are:

Example: Lung Cancer Data

Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.

Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.

library(survival) 
glimpse(lung)
Rows: 228
Columns: 10
$ inst      <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22, 16…
$ time      <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, 654…
$ status    <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ age       <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 57, …
$ sex       <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, …
$ ph.ecog   <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2, 1,…
$ ph.karno  <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, 80,…
$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, 70,…
$ meal.cal  <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, NA,…
$ wt.loss   <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 15, …

\(2\times 2\) Contingency Table

We analyze the association between ECOG performance score (fully active vs any symptomatic patient) as rated by physician and weight loss (see ?lung for details).

lung2 <- survival::lung %>% 
  mutate(
    ecog_grp = factor(ph.ecog > 0, labels = c("fully active", "symptomatic")), 
    wt_grp = factor(wt.loss > 0, labels = c("weight loss", "weight gain"))
  ) 

(tab <- table(x = lung2$ecog_grp, y = lung2$wt_grp))
              y
x              weight loss weight gain
  fully active          22          39
  symptomatic           39         113

Chi-Squared test

chisq.test(tab)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 1.8261, df = 1, p-value = 0.1766

Fisher Exact Test

For \(2 \times 2\) contingency tables, p-values are obtained directly using the hypergeometric distribution.

fisher.test(tab)

    Fisher's Exact Test for Count Data

data:  tab
p-value = 0.135
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.8158882 3.2251299
sample estimates:
odds ratio 
  1.630576 

Large Contingency Tables

In a second example, we analyze the association between ECOG and Karnofky performance scores, both rated by physician (see ?lung for details).

(tab2 <- table(x = lung2$ph.ecog, y = lung2$ph.karno))
   y
x   50 60 70 80 90 100
  0  0  0  0  3 32  28
  1  1  2  7 60 42   1
  2  5 15 25  4  0   0
  3  0  1  0  0  0   0

Chi-Squared Test

chisq.test(tab2)
Warning in chisq.test(tab2): Chi-squared approximation may be incorrect

    Pearson's Chi-squared test

data:  tab2
X-squared = 260.76, df = 15, p-value < 2.2e-16

The warning means that the smallest expected frequencies is lower than 5. It is recommended to use the Fisher’s exact test in this case.

Fisher Exact Test

For contingency tables larger than \(2 \times 2\), p-values are based on simulations, which might require a lot of time (see ?fisher.test for details).

fisher.test(tab2, simulate.p.value=TRUE)

    Fisher's Exact Test for Count Data with simulated p-value (based on
    2000 replicates)

data:  tab2
p-value = 0.0004998
alternative hypothesis: two.sided