Logistic Regression

A model of the dependence of binary variables on explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed \((y_i, x_i),\) where \(y_i\) is a Bernoulli variable and \(x_i\) a vector of explanatory variables, the model for \(\pi_i = P(y_i=1)\) is

\[ \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n \]

The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.

Example: Lung Cancer Data

Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.

Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities (see ?lung for details).

library(survival) 
glimpse(lung)
Rows: 228
Columns: 10
$ inst      <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22, 16…
$ time      <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, 654…
$ status    <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ age       <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 57, …
$ sex       <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, …
$ ph.ecog   <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2, 1,…
$ ph.karno  <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, 80,…
$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, 70,…
$ meal.cal  <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, NA,…
$ wt.loss   <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 15, …

Model Fit

We analyze the weight loss in lung cancer patients in dependency of age, sex, ECOG performance score and calories consumed at meals.

lung2 <- survival::lung %>% 
  mutate(
    wt_grp = factor(wt.loss > 0, labels = c("weight loss", "weight gain"))
  ) 


m1 <- glm(wt_grp ~ age + sex + ph.ecog + meal.cal, data = lung2, family = binomial(link="logit"))
summary(m1)

Call:
glm(formula = wt_grp ~ age + sex + ph.ecog + meal.cal, family = binomial(link = "logit"), 
    data = lung2)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept)  3.2631673  1.6488207   1.979   0.0478 *
age         -0.0101717  0.0208107  -0.489   0.6250  
sex         -0.8717357  0.3714042  -2.347   0.0189 *
ph.ecog      0.4179665  0.2588653   1.615   0.1064  
meal.cal    -0.0008869  0.0004467  -1.985   0.0471 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 202.36  on 169  degrees of freedom
Residual deviance: 191.50  on 165  degrees of freedom
  (58 observations deleted due to missingness)
AIC: 201.5

Number of Fisher Scoring iterations: 4

The model summary contains the parameter estimates \(\beta_j\) for each explanatory variable \(x_j\), corresponding to the log-odds for the response variable to take the value \(1\), conditional on all other explanatory variables remaining constant. For better interpretation, we can exponentiate these estimates, to obtain estimates for the odds instead and provide \(95\)% confidence intervals:

exp(coef(m1))
(Intercept)         age         sex     ph.ecog    meal.cal 
 26.1321742   0.9898798   0.4182250   1.5188698   0.9991135 
exp(confint(m1))
Waiting for profiling to be done...
                2.5 %      97.5 %
(Intercept) 1.0964330 730.3978786
age         0.9495388   1.0307216
sex         0.1996925   0.8617165
ph.ecog     0.9194053   2.5491933
meal.cal    0.9982107   0.9999837

Model Comparison

To compare two logistic models, one tests the difference in residual variances from both models using a \(\chi^2\)-distribution with a single degree of freedom (here at the \(5\)% level):

m2 <- glm(wt_grp ~ sex + ph.ecog + meal.cal, data = lung2, family = binomial(link="logit"))
summary(m2)

Call:
glm(formula = wt_grp ~ sex + ph.ecog + meal.cal, family = binomial(link = "logit"), 
    data = lung2)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)   
(Intercept)  2.5606595  0.7976887   3.210  0.00133 **
sex         -0.8359241  0.3637378  -2.298  0.02155 * 
ph.ecog      0.3794295  0.2469030   1.537  0.12435   
meal.cal    -0.0008334  0.0004346  -1.918  0.05517 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 202.36  on 169  degrees of freedom
Residual deviance: 191.74  on 166  degrees of freedom
  (58 observations deleted due to missingness)
AIC: 199.74

Number of Fisher Scoring iterations: 4
anova(m1, m2, test = "Chisq")
Analysis of Deviance Table

Model 1: wt_grp ~ age + sex + ph.ecog + meal.cal
Model 2: wt_grp ~ sex + ph.ecog + meal.cal
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       165     191.50                     
2       166     191.75 -1 -0.24046   0.6239

Prediction

Predictions from the model for the log-odds of a patient with new data to experience a weight loss are derived using predict():

# new female, symptomatic but completely ambulatory patient consuming 2500 calories
new_pt <- data.frame(sex=2, ph.ecog=1, meal.cal=2500)
predict(m2, new_pt, type = "response")
       1 
0.306767