Logistic Regression in R

In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we represent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling !

The logistic model models the log-odds of an event as a linear combination of one or more independent variables (explanatory variables). If we observed \((y_i, x_i),\) where \(y_i\) is a Bernoulli variable and \(x_i\) a vector of explanatory variables, the model for \(\pi_i = P(y_i=1)\) is

\[ \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n \]

The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.

Example: Lung Cancer Data

Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.

wgt_catn consists of: 1= patients a weight loss of zero or less, 0= patients with a weight loss of more than zero

trt01pn consists of 1= active treatment, 0 = placebo

Model Fit

We analyze the event of weight gain (or staying the same weight) in lung cancer patients in dependency of treatment (active or placebo), age, sex, ECOG performance score and calories consumed at meals. One of the most important things to remember is to ensure you tell R what your event is and what treatment comparison you are doing Active / Placebo or Placebo/Active! The easiest way to do this is to have event (or non-reference treatment) as 1, and non-event (reference treatment) as 0. Hence we will use the wt_catn and trt01pn variables.

lung<-read.csv("../data/lung_cancer.csv")
head(lung)

  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss trt01p
1    3  306      2  74   1       1       90       100     1175      NA Active
2    3  455      2  68   1       0       90        90     1225      15 Active
3    3 1010      1  56   1       0       90        90       NA      15 Active
4    5  210      2  57   1       1       90        60     1150      11 Active
5    1  883      2  60   1       0      100        90       NA       0 Active
6   12 1022      1  74   1       1       50        80      513       0 Active
  trt01pn dose_mg dose_id wt_cat wt_catn
1       1      10       1   <NA>      NA
2       1      10       1   loss       0
3       1      10       1   loss       0
4       1      10       1   loss       0
5       1      10       1   gain       1
6       1      10       1   gain       1

m1 <- glm(wt_catn ~ trt01pn + age + sex + ph.ecog + meal.cal, data = lung, family = binomial(link="logit"))
summary(m1)


Call:
glm(formula = wt_catn ~ trt01pn + age + sex + ph.ecog + meal.cal, 
    family = binomial(link = "logit"), data = lung)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept) -3.8623998  1.7675776  -2.185   0.0289 *
trt01pn      0.3887667  0.3781566   1.028   0.3039  
age          0.0122549  0.0211553   0.579   0.5624  
sex          0.8321005  0.3743793   2.223   0.0262 *
ph.ecog     -0.3763592  0.2638322  -1.427   0.1537  
meal.cal     0.0008500  0.0004486   1.895   0.0581 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 202.36  on 169  degrees of freedom
Residual deviance: 190.46  on 164  degrees of freedom
  (58 observations deleted due to missingness)
AIC: 202.46

Number of Fisher Scoring iterations: 4

The model summary contains the parameter estimates \(\beta_j\) for each explanatory variable \(x_j\), corresponding to the log-odds for the response variable to take the value \(1\), conditional on all other explanatory variables remaining constant. For better interpretation, we can exponentiate these estimates, to obtain estimates for the odds instead and provide 95% confidence intervals.

# model coefficients summary
summary(m1)$coefficients

                 Estimate   Std. Error    z value   Pr(>|z|)
(Intercept) -3.8623998098 1.7675776454 -2.1851373 0.02887878
trt01pn      0.3887666677 0.3781565596  1.0280574 0.30392281
age          0.0122549015 0.0211552875  0.5792831 0.56239813
sex          0.8321005169 0.3743792762  2.2226137 0.02624186
ph.ecog     -0.3763592487 0.2638321918 -1.4265100 0.15372119
meal.cal     0.0008499918 0.0004486401  1.8945961 0.05814593

## Wald confidence limits
cbind(est = exp(coef(m1)), 
          exp(confint.default(m1)))

                  est        2.5 %    97.5 %
(Intercept) 0.0210175 0.0006576825 0.6716544
trt01pn     1.4751603 0.7029942478 3.0954705
age         1.0123303 0.9712137510 1.0551875
sex         2.2981410 1.1033275058 4.7868397
ph.ecog     0.6863557 0.4092369943 1.1511280
meal.cal    1.0008504 0.9999706738 1.0017308

## profile-likelihood limits
cbind(est = exp(coef(m1)), 
          exp(confint(m1)))

Waiting for profiling to be done...

                  est       2.5 %    97.5 %
(Intercept) 0.0210175 0.000580124 0.6197446
trt01pn     1.4751603 0.696210092 3.0850890
age         1.0123303 0.971670916 1.0561942
sex         2.2981410 1.107651762 4.8367698
ph.ecog     0.6863557 0.405659156 1.1474521
meal.cal    1.0008504 0.999978126 1.0017613

NOTE: that there are two types of confidence intervals that you can calculate. Function confint.default gives the Wald confidence limits, which is the default option in SAS PROC LOGISTIC procedure; whereas confint gives the profile-likelihood limits.

Model Comparison

To compare two logistic models, the residual deviances (-2 * log likelihoods) are compared against a \(\chi^2\)-distribution with degrees of freedom calculated using the difference in the two models’ parameters. Below, the only difference is the inclusion/exclusion of age in the model, hence we test using \(\chi^2\) with 1 df. Here testing at the 5% level.

m2 <- glm(wt_catn ~ trt01pn + sex + ph.ecog + meal.cal, data = lung, family = binomial(link="logit"))
summary(m2)


Call:
glm(formula = wt_catn ~ trt01pn + sex + ph.ecog + meal.cal, family = binomial(link = "logit"), 
    data = lung)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)   
(Intercept) -2.9951557  0.9245655  -3.240   0.0012 **
trt01pn      0.3681563  0.3761976   0.979   0.3278   
sex          0.7919227  0.3674936   2.155   0.0312 * 
ph.ecog     -0.3312031  0.2527586  -1.310   0.1901   
meal.cal     0.0007896  0.0004378   1.803   0.0713 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 202.36  on 169  degrees of freedom
Residual deviance: 190.80  on 165  degrees of freedom
  (58 observations deleted due to missingness)
AIC: 200.8

Number of Fisher Scoring iterations: 4

anova(m1, m2, test = "LRT")

Analysis of Deviance Table

Model 1: wt_catn ~ trt01pn + age + sex + ph.ecog + meal.cal
Model 2: wt_catn ~ trt01pn + sex + ph.ecog + meal.cal
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       164     190.46                     
2       165     190.80 -1 -0.33867   0.5606

Stackexchange here has a good article describing this method and the difference between comparing 2 models using the likelihood ratio tests versus using wald tests and Pr>chisq (from the maximum likelihood estimate). Note: anova(m1, m2, test = "Chisq") and using test="LRT" as above are synonymous in this context.

Prediction

Predictions from the model for the log-odds of a patient with new data to experience a weight loss are derived using predict():

# new female, symptomatic but completely ambulatory patient consuming 2500 calories
new_pt <- data.frame(trt01pn=1, sex=2, ph.ecog=1, meal.cal=2500)
predict(m2, new_pt, type = "response")

        1 
0.6455793

Contrast statements for 2 or more treatments

To create contrasts, you can use the fit.contrast() function from the gmodels package.

This can be used with lm and glm objects:

Here if we use the 3 level treatment variable (dose_id), we have 1=10mg, 2=20mg doses for active treatment and then 3= placebo which is 0mg.

You would fit the model as above, followed by fit.contrast(). This is effective testing the null hypothesis that 0.5dose10mg + 0.5 dose20mg - placebo = 0.

lung2<-lung |> 
    mutate(dose_id2 = as.factor(lung$dose_id))

m3 <- glm (wt_catn ~ dose_id2 + age + sex + ph.ecog + meal.cal, data = lung2, family = binomial(link="logit"))

fit.contrast (m3,'dose_id2',c(0.5,0.5,-1),conf.int=0.95)

                            Estimate Std. Error   z value  Pr(>|z|)  lower CI
dose_id2 c=( 0.5 0.5 -1 ) -0.4096323  0.3801683 -1.077502 0.2812558 -1.160322
                           upper CI
dose_id2 c=( 0.5 0.5 -1 ) 0.3410574

Reference

#\| echo: false

#List all the packages needed

si <- sessioninfo::session_info(c('tidyverse','gmodels')) 
si

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.3 (2025-02-28)
 os       Ubuntu 24.04.2 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       Europe/London
 date     2025-03-13
 pandoc   NA (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package       * version date (UTC) lib source
   askpass         1.2.1   2024-10-04 [1] RSPM (R 4.4.0)
   backports       1.5.0   2024-05-23 [1] RSPM (R 4.4.0)
   base64enc       0.1-3   2015-07-28 [1] RSPM (R 4.4.0)
   bit             4.5.0   2024-09-20 [1] RSPM (R 4.4.0)
   bit64           4.5.2   2024-09-22 [1] RSPM (R 4.4.0)
   blob            1.2.4   2023-03-17 [1] RSPM (R 4.4.0)
   broom           1.0.7   2024-09-26 [1] RSPM (R 4.4.0)
   bslib           0.8.0   2024-07-29 [1] RSPM (R 4.4.0)
   cachem          1.1.0   2024-05-16 [1] RSPM (R 4.4.0)
   callr           3.7.6   2024-03-25 [1] RSPM (R 4.4.0)
   cellranger      1.1.0   2016-07-27 [1] RSPM (R 4.4.0)
 P cli             3.6.3   2024-06-21 [?] RSPM (R 4.4.0)
   clipr           0.8.0   2022-02-22 [1] RSPM (R 4.4.0)
 P colorspace      2.1-1   2024-07-26 [?] RSPM (R 4.4.0)
   conflicted      1.2.0   2023-02-01 [1] RSPM (R 4.4.0)
   cpp11           0.5.0   2024-08-27 [1] RSPM (R 4.4.0)
   crayon          1.5.3   2024-06-20 [1] RSPM (R 4.4.0)
   curl            5.2.3   2024-09-20 [1] RSPM (R 4.4.0)
   data.table      1.16.0  2024-08-27 [1] RSPM (R 4.4.0)
   DBI             1.2.3   2024-06-02 [1] RSPM (R 4.4.0)
   dbplyr          2.5.0   2024-03-19 [1] RSPM (R 4.4.0)
 P digest          0.6.37  2024-08-19 [?] RSPM (R 4.4.0)
 P dplyr         * 1.1.4   2023-11-17 [?] RSPM (R 4.4.0)
   dtplyr          1.3.1   2023-03-22 [1] RSPM (R 4.4.0)
 P evaluate        1.0.0   2024-09-17 [?] RSPM (R 4.4.0)
 P fansi           1.0.6   2023-12-08 [?] RSPM (R 4.4.0)
   farver          2.1.2   2024-05-13 [1] RSPM (R 4.4.0)
 P fastmap         1.2.0   2024-05-15 [?] RSPM (R 4.4.0)
   fontawesome     0.5.2   2023-08-19 [1] RSPM (R 4.4.0)
 P forcats       * 1.0.0   2023-01-29 [?] RSPM (R 4.4.0)
   fs              1.6.4   2024-04-25 [1] RSPM (R 4.4.0)
   gargle          1.5.2   2023-07-20 [1] RSPM (R 4.4.0)
 P gdata           3.0.1   2024-10-22 [?] RSPM (R 4.4.0)
 P generics        0.1.3   2022-07-05 [?] RSPM (R 4.4.0)
 P ggplot2       * 3.5.1   2024-04-23 [?] RSPM (R 4.4.0)
 P glue            1.8.0   2024-09-30 [?] RSPM (R 4.4.0)
 P gmodels       * 2.19.1  2024-03-06 [?] RSPM (R 4.4.0)
   googledrive     2.1.1   2023-06-11 [1] RSPM (R 4.4.0)
   googlesheets4   1.1.1   2023-06-11 [1] RSPM (R 4.4.0)
 P gtable          0.3.5   2024-04-22 [?] RSPM (R 4.4.0)
 P gtools          3.9.5   2023-11-20 [?] RSPM (R 4.4.0)
   haven           2.5.4   2023-11-30 [1] RSPM (R 4.4.0)
   highr           0.11    2024-05-26 [1] RSPM (R 4.4.0)
 P hms             1.1.3   2023-03-21 [?] RSPM (R 4.4.0)
 P htmltools       0.5.8.1 2024-04-04 [?] RSPM (R 4.4.0)
   httr            1.4.7   2023-08-15 [1] RSPM (R 4.4.0)
   ids             1.0.1   2017-05-31 [1] RSPM (R 4.4.0)
   isoband         0.2.7   2022-12-20 [1] RSPM (R 4.4.0)
   jquerylib       0.1.4   2021-04-26 [1] RSPM (R 4.4.0)
 P jsonlite        1.8.9   2024-09-20 [?] RSPM (R 4.4.0)
 P knitr           1.48    2024-07-07 [?] RSPM (R 4.4.0)
   labeling        0.4.3   2023-08-29 [1] RSPM (R 4.4.0)
   lattice         0.22-6  2024-03-20 [2] CRAN (R 4.4.3)
 P lifecycle       1.0.4   2023-11-07 [?] RSPM (R 4.4.0)
 P lubridate     * 1.9.3   2023-09-27 [?] RSPM (R 4.4.0)
 P magrittr        2.0.3   2022-03-30 [?] RSPM (R 4.4.0)
 P MASS            7.3-61  2024-06-13 [?] RSPM (R 4.4.0)
   Matrix          1.7-1   2024-10-18 [1] RSPM (R 4.4.0)
   memoise         2.0.1   2021-11-26 [1] RSPM (R 4.4.0)
   mgcv            1.9-1   2023-12-21 [2] CRAN (R 4.4.3)
   mime            0.12    2021-09-28 [1] RSPM (R 4.4.0)
   modelr          0.1.11  2023-03-22 [1] RSPM (R 4.4.0)
 P munsell         0.5.1   2024-04-01 [?] RSPM (R 4.4.0)
   nlme            3.1-166 2024-08-14 [1] RSPM (R 4.4.0)
   openssl         2.2.2   2024-09-20 [1] RSPM (R 4.4.0)
 P pillar          1.9.0   2023-03-22 [?] RSPM (R 4.4.0)
 P pkgconfig       2.0.3   2019-09-22 [?] RSPM (R 4.4.0)
   prettyunits     1.2.0   2023-09-24 [1] RSPM (R 4.4.0)
   processx        3.8.4   2024-03-16 [1] RSPM (R 4.4.0)
   progress        1.2.3   2023-12-06 [1] RSPM (R 4.4.0)
   ps              1.8.0   2024-09-12 [1] RSPM (R 4.4.0)
 P purrr         * 1.0.2   2023-08-10 [?] RSPM (R 4.4.0)
 P R6              2.5.1   2021-08-19 [?] RSPM (R 4.4.0)
   ragg            1.3.3   2024-09-11 [1] RSPM (R 4.4.0)
   rappdirs        0.3.3   2021-01-31 [1] RSPM (R 4.4.0)
   RColorBrewer    1.1-3   2022-04-03 [1] RSPM (R 4.4.0)
 P readr         * 2.1.5   2024-01-10 [?] RSPM (R 4.4.0)
   readxl          1.4.3   2023-07-06 [1] RSPM (R 4.4.0)
   rematch         2.0.0   2023-08-30 [1] RSPM (R 4.4.0)
   rematch2        2.1.2   2020-05-01 [1] RSPM (R 4.4.0)
   reprex          2.1.1   2024-07-06 [1] RSPM (R 4.4.0)
 P rlang           1.1.4   2024-06-04 [?] RSPM (R 4.4.0)
 P rmarkdown       2.28    2024-08-17 [?] RSPM (R 4.4.0)
   rstudioapi      0.16.0  2024-03-24 [1] RSPM (R 4.4.0)
   rvest           1.0.4   2024-02-12 [1] RSPM (R 4.4.0)
   sass            0.4.9   2024-03-15 [1] RSPM (R 4.4.0)
 P scales          1.3.0   2023-11-28 [?] RSPM (R 4.4.0)
   selectr         0.4-2   2019-11-20 [1] RSPM (R 4.4.0)
 P stringi         1.8.4   2024-05-06 [?] RSPM (R 4.4.0)
 P stringr       * 1.5.1   2023-11-14 [?] RSPM (R 4.4.0)
   sys             3.4.3   2024-10-04 [1] RSPM (R 4.4.0)
   systemfonts     1.1.0   2024-05-15 [1] RSPM (R 4.4.0)
   textshaping     0.4.0   2024-05-24 [1] RSPM (R 4.4.0)
 P tibble        * 3.2.1   2023-03-20 [?] RSPM (R 4.4.0)
 P tidyr         * 1.3.1   2024-01-24 [?] RSPM (R 4.4.0)
 P tidyselect      1.2.1   2024-03-11 [?] RSPM (R 4.4.0)
 P tidyverse     * 2.0.0   2023-02-22 [?] RSPM (R 4.4.0)
 P timechange      0.3.0   2024-01-18 [?] RSPM (R 4.4.0)
   tinytex         0.53    2024-09-15 [1] RSPM (R 4.4.0)
 P tzdb            0.4.0   2023-05-12 [?] RSPM (R 4.4.0)
 P utf8            1.2.4   2023-10-22 [?] RSPM (R 4.4.0)
   uuid            1.2-1   2024-07-29 [1] RSPM (R 4.4.0)
 P vctrs           0.6.5   2023-12-01 [?] RSPM (R 4.4.0)
   viridisLite     0.4.2   2023-05-02 [1] RSPM (R 4.4.0)
   vroom           1.6.5   2023-12-05 [1] RSPM (R 4.4.0)
 P withr           3.0.1   2024-07-31 [?] RSPM (R 4.4.0)
 P xfun            0.48    2024-10-03 [?] RSPM (R 4.4.0)
   xml2            1.3.6   2023-12-04 [1] RSPM (R 4.4.0)
 P yaml            2.3.10  2024-07-26 [?] RSPM (R 4.4.0)

 [1] /home/michael/source/CAMIS/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu
 [2] /opt/R/4.4.3/lib/R/library

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────