#data manipulation
import pandas as pd
import numpy as np
#modelling
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
Logistic Regression
Imports
Background
Logistic regression models the relationship between a binary variable and set of explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed \((y_i, x_i),\) where \(y_i\) is a Bernoulli variable and \(x_i\) a vector of explanatory variables, the model for \(\pi_i = P(y_i=1)\) is
\[ \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n \]
The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.
Example : Lung cancer data
Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.
These data were sourced from the R package {survival} and have been downloaded and stored in the data
folder.
# importing and prepare
= pd.read_csv("../data/lung_cancer.csv")
lung2
#create weight loss factor while respecting missing values
# 1: patients with a weight loss of more than zero
# 0: patients a weight loss of zero oe less
"wt_grp"] = np.where(lung2["wt.loss"].isnull(), np.nan, (lung2["wt.loss"] > 0).astype(int)) lung2[
Logistic Regression Modelling
Let’s further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values.
= ["age", "sex", "ph.ecog", "meal.cal"]
x_vars = "wt_grp"
y_var
# drop rows with missing values
= lung2.dropna(axis=0)
lung2_complete
#select variables
= lung2_complete[x_vars]
x = lung2_complete[y_var] y
Statsmodels package
We will use the sm.Logit()
method to fit our logistic regression model.
#intercept column
= sm.add_constant(x)
x_sm
#fit model
= sm.Logit(y, x_sm).fit()
lr_sm print(lr_sm.summary())
Optimization terminated successfully.
Current function value: 0.568825
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: wt_grp No. Observations: 167
Model: Logit Df Residuals: 162
Method: MLE Df Model: 4
Date: Fri, 25 Oct 2024 Pseudo R-squ.: 0.05169
Time: 08:32:33 Log-Likelihood: -94.994
converged: True LL-Null: -100.17
Covariance Type: nonrobust LLR p-value: 0.03484
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 3.3576 1.654 2.029 0.042 0.115 6.600
age -0.0126 0.021 -0.598 0.550 -0.054 0.029
sex -0.8645 0.371 -2.328 0.020 -1.592 -0.137
ph.ecog 0.4182 0.263 1.592 0.111 -0.097 0.933
meal.cal -0.0009 0.000 -1.932 0.053 -0.002 1.27e-05
==============================================================================
Model fitting
In addition to the information contained in the summary, we can display the model coefficients as odds ratios:
print("Odds ratios for statsmodels logistic regression:")
print(np.exp(lr_sm.params))
Odds ratios for statsmodels logistic regression:
const 28.719651
age 0.987467
sex 0.421266
ph.ecog 1.519198
meal.cal 0.999140
dtype: float64
We can also provide the 5% confidence intervals for the odds ratios:
print("CI at 5% for statsmodels logistic regression:")
print(np.exp(lr_sm.conf_int(alpha = 0.05)))
CI at 5% for statsmodels logistic regression:
0 1
const 1.121742 735.301118
age 0.947449 1.029175
sex 0.203432 0.872354
ph.ecog 0.907984 2.541852
meal.cal 0.998269 1.000013
Prediction
Let’s use our trained model to make a weight loss prediction about a new patient.
# new female, symptomatic but completely ambulatory patient consuming 2500 calories
= pd.DataFrame({
new_pt "age": [56],
"sex": [2],
"ph.ecog": [1.00],
"meal.cal": [2500]
})
# Add intercept term to the new data; for a single row this should be
# forced using the `add_constant` command
= sm.add_constant(new_pt, has_constant="add")
new_pt_sm print("Probability of weight loss using the statsmodels package:")
print(lr_sm.predict(new_pt_sm))
Probability of weight loss using the statsmodels package:
0 0.308057
dtype: float64
Scikit-learn Package
The scikit-learn
package is a popular package for machine learning and predictive modelling.
It’s important to note that l2 regularisation is applied by default in the scikit-learn
implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty.
= LogisticRegression(penalty=None).fit(x, y) lr_sk
Unlike the statsmodels
approach scikit-learn
doesn’t have a summary method for the model but you can extract some of the model parameters as follows:
print("Intercept for scikit learn logistic regression:")
print(lr_sk.intercept_)
print("Odds ratios for scikit learn logistic regression:")
print(np.exp(lr_sk.coef_))
Intercept for scikit learn logistic regression:
[3.35756405]
Odds ratios for scikit learn logistic regression:
[[0.98746739 0.42126736 1.51919379 0.99914048]]
However, obtaining the confidence intervals and other metrics is not directly supported in scikit-learn
.
Prediction
Using the same new patient example we can use our logistic regression model to make a prediction. The predict_proba
method is used to return the probability for each class. If you are interested in viewing the prediction for y = 1
, i.e. the probability of weight loss then you can select the second probability as shown:
print("Probability of weight loss using the scikit-learn package:")
print(lr_sk.predict_proba(new_pt)[:,1])
Probability of weight loss using the scikit-learn package:
[0.30805813]
Conclusions
There are two main ways to fit a logistic regression using python. Each of these packages have their advantages with statsmodel
geared more towards model and coefficient interpretation in low dimensional data settings and in contrast the scikit-learn
implementation more appropriate for use cases focused on prediction with more complex, higher dimensional data.