Logistic Regression in R

In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we represent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling !

The logistic model models the log-odds of an event as a linear combination of one or more independent variables (explanatory variables). If we observed \((y_i, x_i),\) where \(y_i\) is a Bernoulli variable and \(x_i\) a vector of explanatory variables, the model for \(\pi_i = P(y_i=1)\) is

\[ \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n \]

The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.

Example: Lung Cancer Data

Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.

Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities (see ?lung for details).

library(survival) 
glimpse(lung)
Rows: 228
Columns: 10
$ inst      <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22, 16…
$ time      <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, 654…
$ status    <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ age       <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 57, …
$ sex       <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, …
$ ph.ecog   <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2, 1,…
$ ph.karno  <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, 80,…
$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, 70,…
$ meal.cal  <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, NA,…
$ wt.loss   <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 15, …

Model Fit

We analyze the event of weight gain (or staying the same weight) in lung cancer patients in dependency of age, sex, ECOG performance score and calories consumed at meals. In the original data, a positive number for the wt.loss variable is a weight loss, negative number is a gain. We start by dichotomising the response such that a result >0 is a weight loss, <= weight gain and creating a factor variable wt_grp.

One of the most important things to remember is to ensure you tell R what your event is ! We want to model Events / Non-events, and hence your reference category for wt_grp dichotomous variable below is the weight loss level. Therefore, by telling R that your reference category is weight loss, you are effectively telling R that your Event = Weight Gain !

lung2 <- survival::lung %>% 
  mutate(
    wt_grp = factor(wt.loss > 0, labels = c("weight loss", "weight gain"))
  ) 

#specify that weight loss should be used as baseline level (i.e we want to model weight gain as the event)
lung2$wt_grp <- relevel(lung2$wt_grp, ref='weight loss')

m1 <- glm(wt_grp ~ age + sex + ph.ecog + meal.cal, data = lung2, family = binomial(link="logit"))
summary(m1)

Call:
glm(formula = wt_grp ~ age + sex + ph.ecog + meal.cal, family = binomial(link = "logit"), 
    data = lung2)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept)  3.2631673  1.6488207   1.979   0.0478 *
age         -0.0101717  0.0208107  -0.489   0.6250  
sex         -0.8717357  0.3714042  -2.347   0.0189 *
ph.ecog      0.4179665  0.2588653   1.615   0.1064  
meal.cal    -0.0008869  0.0004467  -1.985   0.0471 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 202.36  on 169  degrees of freedom
Residual deviance: 191.50  on 165  degrees of freedom
  (58 observations deleted due to missingness)
AIC: 201.5

Number of Fisher Scoring iterations: 4

The model summary contains the parameter estimates \(\beta_j\) for each explanatory variable \(x_j\), corresponding to the log-odds for the response variable to take the value \(1\), conditional on all other explanatory variables remaining constant. For better interpretation, we can exponentiate these estimates, to obtain estimates for the odds instead and provide \(95\)% confidence intervals:

exp(coef(m1))
(Intercept)         age         sex     ph.ecog    meal.cal 
 26.1321742   0.9898798   0.4182250   1.5188698   0.9991135 
exp(confint(m1))
Waiting for profiling to be done...
                2.5 %      97.5 %
(Intercept) 1.0964330 730.3978786
age         0.9495388   1.0307216
sex         0.1996925   0.8617165
ph.ecog     0.9194053   2.5491933
meal.cal    0.9982107   0.9999837

Model Comparison

To compare two logistic models, the residual deviances (-2 * log likelihoods) are compared against a \(\chi^2\)-distribution with degrees of freedom calculated using the difference in the two models’ parameters. Below, the only difference is the inclusion/exclusion of age in the model, hence we test using \(\chi^2\) with 1 df. Here testing at the \(5\)% level.

m2 <- glm(wt_grp ~ sex + ph.ecog + meal.cal, data = lung2, family = binomial(link="logit"))
summary(m2)

Call:
glm(formula = wt_grp ~ sex + ph.ecog + meal.cal, family = binomial(link = "logit"), 
    data = lung2)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)   
(Intercept)  2.5606595  0.7976887   3.210  0.00133 **
sex         -0.8359241  0.3637378  -2.298  0.02155 * 
ph.ecog      0.3794295  0.2469030   1.537  0.12435   
meal.cal    -0.0008334  0.0004346  -1.918  0.05517 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 202.36  on 169  degrees of freedom
Residual deviance: 191.74  on 166  degrees of freedom
  (58 observations deleted due to missingness)
AIC: 199.74

Number of Fisher Scoring iterations: 4
anova(m1, m2, test = "LRT")
Analysis of Deviance Table

Model 1: wt_grp ~ age + sex + ph.ecog + meal.cal
Model 2: wt_grp ~ sex + ph.ecog + meal.cal
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       165     191.50                     
2       166     191.75 -1 -0.24046   0.6239

Stackexchange here has a good article describing this method and the difference between comparing 2 models using the likelihood ratio tests versus using wald tests and Pr>chisq (from the maximum likelihood estimate). Note: anova(m1, m2, test = "Chisq") and using test="LRT" as above are synonymous in this context.

Prediction

Predictions from the model for the log-odds of a patient with new data to experience a weight loss are derived using predict():

# new female, symptomatic but completely ambulatory patient consuming 2500 calories
new_pt <- data.frame(sex=2, ph.ecog=1, meal.cal=2500)
predict(m2, new_pt, type = "response")
       1 
0.306767