Logistic Regression

Imports

#data manipulation
import pandas as pd
import numpy as np

#modelling
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

Background

Logistic regression models the relationship between a binary variable and set of explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed \((y_i, x_i),\) where \(y_i\) is a Bernoulli variable and \(x_i\) a vector of explanatory variables, the model for \(\pi_i = P(y_i=1)\) is

\[ \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n \]

The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.

Example : Lung cancer data

Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.

These data were sourced from the R package {survival} and have been downloaded and stored in the data folder.

# importing and prepare
lung2 = pd.read_csv("../data/lung_cancer.csv")

#create weight loss factor while respecting missing values
# 1: patients with a weight loss of more than zero
# 0: patients a weight loss of zero oe less
lung2["wt_grp"] = np.where(lung2["wt.loss"].isnull(), np.nan, (lung2["wt.loss"] > 0).astype(int))

Logistic Regression Modelling

Let’s further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values.

x_vars = ["age", "sex", "ph.ecog", "meal.cal"]
y_var = "wt_grp"

# drop rows with missing values 
lung2_complete = lung2.dropna(axis=0)

#select variables
x = lung2_complete[x_vars]
y = lung2_complete[y_var]

Statsmodels package

We will use the sm.Logit() method to fit our logistic regression model.

#intercept column
x_sm = sm.add_constant(x)

#fit model
lr_sm = sm.Logit(y, x_sm).fit() 
print(lr_sm.summary())
Optimization terminated successfully.
         Current function value: 0.568825
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 wt_grp   No. Observations:                  167
Model:                          Logit   Df Residuals:                      162
Method:                           MLE   Df Model:                            4
Date:                Thu, 10 Oct 2024   Pseudo R-squ.:                 0.05169
Time:                        16:53:06   Log-Likelihood:                -94.994
converged:                       True   LL-Null:                       -100.17
Covariance Type:            nonrobust   LLR p-value:                   0.03484
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.3576      1.654      2.029      0.042       0.115       6.600
age           -0.0126      0.021     -0.598      0.550      -0.054       0.029
sex           -0.8645      0.371     -2.328      0.020      -1.592      -0.137
ph.ecog        0.4182      0.263      1.592      0.111      -0.097       0.933
meal.cal      -0.0009      0.000     -1.932      0.053      -0.002    1.27e-05
==============================================================================

Model fitting

In addition to the information contained in the summary, we can display the model coefficients as odds ratios:

print("Odds ratios for statsmodels logistic regression:")
print(np.exp(lr_sm.params))
Odds ratios for statsmodels logistic regression:
const       28.719651
age          0.987467
sex          0.421266
ph.ecog      1.519198
meal.cal     0.999140
dtype: float64

We can also provide the 5% confidence intervals for the odds ratios:

print("CI at 5% for statsmodels logistic regression:")
print(np.exp(lr_sm.conf_int(alpha = 0.05)))
CI at 5% for statsmodels logistic regression:
                 0           1
const     1.121742  735.301118
age       0.947449    1.029175
sex       0.203432    0.872354
ph.ecog   0.907984    2.541852
meal.cal  0.998269    1.000013

Prediction

Let’s use our trained model to make a weight loss prediction about a new patient.

# new female, symptomatic but completely ambulatory patient consuming 2500 calories
new_pt = pd.DataFrame({
    "age": [56],
    "sex": [2],
    "ph.ecog": [1.00], 
    "meal.cal": [2500]
})

# Add intercept term to the new data; for a single row this should be 
# forced using the `add_constant` command
new_pt_sm = sm.add_constant(new_pt, has_constant="add")
print("Probability of weight loss using the statsmodels package:")
print(lr_sm.predict(new_pt_sm))
Probability of weight loss using the statsmodels package:
0    0.308057
dtype: float64

Scikit-learn Package

The scikit-learn package is a popular package for machine learning and predictive modelling.

Warning

It’s important to note that l2 regularisation is applied by default in the scikit-learn implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty.

lr_sk = LogisticRegression(penalty=None).fit(x, y)

Unlike the statsmodels approach scikit-learn doesn’t have a summary method for the model but you can extract some of the model parameters as follows:

print("Intercept for scikit learn logistic regression:")
print(lr_sk.intercept_)
print("Odds ratios for scikit learn logistic regression:")
print(np.exp(lr_sk.coef_))
Intercept for scikit learn logistic regression:
[3.35756405]
Odds ratios for scikit learn logistic regression:
[[0.98746739 0.42126736 1.51919379 0.99914048]]

However, obtaining the confidence intervals and other metrics is not directly supported in scikit-learn.

Prediction

Using the same new patient example we can use our logistic regression model to make a prediction. The predict_proba method is used to return the probability for each class. If you are interested in viewing the prediction for y = 1, i.e. the probability of weight loss then you can select the second probability as shown:

print("Probability of weight loss using the scikit-learn package:")
print(lr_sk.predict_proba(new_pt)[:,1])
Probability of weight loss using the scikit-learn package:
[0.30805813]

Conclusions

There are two main ways to fit a logistic regression using python. Each of these packages have their advantages with statsmodel geared more towards model and coefficient interpretation in low dimensional data settings and in contrast the scikit-learn implementation more appropriate for use cases focused on prediction with more complex, higher dimensional data.