Regression for Count Data

The most commonly used models for count data in clinical trials include:

\[ \text{log}(E(Y|x))= \beta_0 + \beta' x, \; i = 1,\ldots,n \]

Other models include hurdle or zero-inflated models, if data have more zero observations than expected.

Example: Familial Andenomatous Polyposis Data

Data source: F. M. Giardiello, S. R. Hamilton, A. J. Krush, S. Piantadosi, L. M. Hylind, P. Celano, S. V. Booker, C. R. Robinson and G. J. A. Offerhaus (1993), Treatment of colonic and rectal adenomas with sulindac in familial adenomatous polyposis. New England Journal of Medicine, 328(18), 1313–1316.

Data from a placebo-controlled trial of a non-steroidal anti-inflammatory drug in the treatment of familial andenomatous polyposis (FAP). (see ?polyps for details).

polyps <- HSAUR2::polyps
glimpse(polyps)
Rows: 20
Columns: 3
$ number <dbl> 63, 2, 28, 17, 61, 1, 7, 15, 44, 25, 3, 28, 10, 40, 33, 46, 50,…
$ treat  <fct> placebo, drug, placebo, drug, placebo, drug, placebo, placebo, …
$ age    <dbl> 20, 16, 18, 22, 13, 23, 34, 50, 19, 17, 23, 22, 30, 27, 23, 22,…

We analyze the number of colonic polyps at 12 months in dependency of treatment and age of the patient.

polyps %>% 
  ggplot(aes(y = number, x = age, color = treat)) + 
  geom_point() + theme_minimal()

Model Fit

We fit a generalized linear model for number using the Poisson distribution with default log link.

# Poisson
m1 <- glm(number ~ treat + age, data = polyps, family = poisson)
summary(m1)

Call:
glm(formula = number ~ treat + age, family = poisson, data = polyps)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.529024   0.146872   30.84  < 2e-16 ***
treatdrug   -1.359083   0.117643  -11.55  < 2e-16 ***
age         -0.038830   0.005955   -6.52 7.02e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 378.66  on 19  degrees of freedom
Residual deviance: 179.54  on 17  degrees of freedom
AIC: 273.88

Number of Fisher Scoring iterations: 5

The parameter estimates are on log-scale. For better interpretation, we can exponentiate these estimates, to obtain estimates and provide \(95\)% confidence intervals:

# OR and CI
exp(coef(m1))
(Intercept)   treatdrug         age 
 92.6681047   0.2568961   0.9619140 
exp(confint(m1))
Waiting for profiling to be done...
                 2.5 %      97.5 %
(Intercept) 69.5361752 123.6802476
treatdrug    0.2028078   0.3218208
age          0.9505226   0.9729788

Predictions for number of colonic polyps given a new 25-year-old patient on either treatment using predict():

# new 25 year old patient
new_pt <- data.frame(treat = c("drug","placebo"), age=25)
predict(m1, new_pt, type = "response")
        1         2 
 9.017654 35.102332 

Modelling Overdispersion

Poisson model assumes that mean and variance are equal, which can be a very restrictive assumption. One option to relax the assumption is adding a overdispersion constant to the relationship, i.e. \(\text{Var}(\text{response}) = \phi\cdot \mu\), which results in a quasipoisson model:

# Quasi poisson
m2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson)
summary(m2)

Call:
glm(formula = number ~ treat + age, family = quasipoisson, data = polyps)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.52902    0.48106   9.415 3.72e-08 ***
treatdrug   -1.35908    0.38533  -3.527  0.00259 ** 
age         -0.03883    0.01951  -1.991  0.06284 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 10.72805)

    Null deviance: 378.66  on 19  degrees of freedom
Residual deviance: 179.54  on 17  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Alternatively, we can explicitly model the count data with overdispersion using the negative Binomial model. In this case, the overdispersion is a function of both \(\mu\) and \(\mu^2\):

\[ \text{Var}(\text{response}) = \mu + \kappa\,\mu^2. \]

# Negative Binomial
m3 <- MASS::glm.nb(number ~ treat + age, data = polyps)
summary(m3)

Call:
MASS::glm.nb(formula = number ~ treat + age, data = polyps, init.theta = 1.719491, 
    link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.52603    0.59466   7.611 2.72e-14 ***
treatdrug   -1.36812    0.36903  -3.707 0.000209 ***
age         -0.03856    0.02095  -1.840 0.065751 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.7195) family taken to be 1)

    Null deviance: 36.734  on 19  degrees of freedom
Residual deviance: 22.002  on 17  degrees of freedom
AIC: 164.88

Number of Fisher Scoring iterations: 1

              Theta:  1.719 
          Std. Err.:  0.607 

 2 x log-likelihood:  -156.880 

Both model result very similar parameter estimates, but vary in estimates for their respective standard deviation.