```
#data manipulation
import pandas as pd
import numpy as np
#modelling
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
```

# Logistic Regression

# Imports

# Background

Logistic regression models the relationship between a binary variable and set of explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed \((y_i, x_i),\) where \(y_i\) is a Bernoulli variable and \(x_i\) a vector of explanatory variables, the model for \(\pi_i = P(y_i=1)\) is

\[ \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n \]

The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.

# Example : Lung cancer data

*Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.*

These data were sourced from the R package {survival} and have been downloaded and stored in the `data`

folder.

```
# importing and prepare
= pd.read_csv("../data/lung_cancer.csv")
lung2
#create weight loss factor while respecting missing values
# 1: patients with a weight loss of more than zero
# 0: patients a weight loss of zero oe less
"wt_grp"] = np.where(lung2["wt.loss"].isnull(), np.nan, (lung2["wt.loss"] > 0).astype(int)) lung2[
```

# Logistic Regression Modelling

Let’s further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values.

```
= ["age", "sex", "ph.ecog", "meal.cal"]
x_vars = "wt_grp"
y_var
# drop rows with missing values
= lung2.dropna(axis=0)
lung2_complete
#select variables
= lung2_complete[x_vars]
x = lung2_complete[y_var] y
```

## Statsmodels package

We will use the `sm.Logit()`

method to fit our logistic regression model.

```
#intercept column
= sm.add_constant(x)
x_sm
#fit model
= sm.Logit(y, x_sm).fit()
lr_sm print(lr_sm.summary())
```

```
Optimization terminated successfully.
Current function value: 0.568825
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: wt_grp No. Observations: 167
Model: Logit Df Residuals: 162
Method: MLE Df Model: 4
Date: Thu, 10 Oct 2024 Pseudo R-squ.: 0.05169
Time: 16:53:06 Log-Likelihood: -94.994
converged: True LL-Null: -100.17
Covariance Type: nonrobust LLR p-value: 0.03484
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 3.3576 1.654 2.029 0.042 0.115 6.600
age -0.0126 0.021 -0.598 0.550 -0.054 0.029
sex -0.8645 0.371 -2.328 0.020 -1.592 -0.137
ph.ecog 0.4182 0.263 1.592 0.111 -0.097 0.933
meal.cal -0.0009 0.000 -1.932 0.053 -0.002 1.27e-05
==============================================================================
```

### Model fitting

In addition to the information contained in the summary, we can display the model coefficients as odds ratios:

```
print("Odds ratios for statsmodels logistic regression:")
print(np.exp(lr_sm.params))
```

```
Odds ratios for statsmodels logistic regression:
const 28.719651
age 0.987467
sex 0.421266
ph.ecog 1.519198
meal.cal 0.999140
dtype: float64
```

We can also provide the 5% confidence intervals for the odds ratios:

```
print("CI at 5% for statsmodels logistic regression:")
print(np.exp(lr_sm.conf_int(alpha = 0.05)))
```

```
CI at 5% for statsmodels logistic regression:
0 1
const 1.121742 735.301118
age 0.947449 1.029175
sex 0.203432 0.872354
ph.ecog 0.907984 2.541852
meal.cal 0.998269 1.000013
```

### Prediction

Let’s use our trained model to make a weight loss prediction about a new patient.

```
# new female, symptomatic but completely ambulatory patient consuming 2500 calories
= pd.DataFrame({
new_pt "age": [56],
"sex": [2],
"ph.ecog": [1.00],
"meal.cal": [2500]
})
# Add intercept term to the new data; for a single row this should be
# forced using the `add_constant` command
= sm.add_constant(new_pt, has_constant="add")
new_pt_sm print("Probability of weight loss using the statsmodels package:")
print(lr_sm.predict(new_pt_sm))
```

```
Probability of weight loss using the statsmodels package:
0 0.308057
dtype: float64
```

## Scikit-learn Package

The `scikit-learn`

package is a popular package for machine learning and predictive modelling.

It’s important to note that l2 regularisation is applied by default in the `scikit-learn`

implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty.

`= LogisticRegression(penalty=None).fit(x, y) lr_sk `

Unlike the `statsmodels`

approach `scikit-learn`

doesn’t have a summary method for the model but you can extract some of the model parameters as follows:

```
print("Intercept for scikit learn logistic regression:")
print(lr_sk.intercept_)
print("Odds ratios for scikit learn logistic regression:")
print(np.exp(lr_sk.coef_))
```

```
Intercept for scikit learn logistic regression:
[3.35756405]
Odds ratios for scikit learn logistic regression:
[[0.98746739 0.42126736 1.51919379 0.99914048]]
```

However, obtaining the confidence intervals and other metrics is not directly supported in `scikit-learn`

.

### Prediction

Using the same new patient example we can use our logistic regression model to make a prediction. The `predict_proba`

method is used to return the probability for each class. If you are interested in viewing the prediction for `y = 1`

, i.e. the probability of weight loss then you can select the second probability as shown:

```
print("Probability of weight loss using the scikit-learn package:")
print(lr_sk.predict_proba(new_pt)[:,1])
```

```
Probability of weight loss using the scikit-learn package:
[0.30805813]
```

## Conclusions

There are two main ways to fit a logistic regression using python. Each of these packages have their advantages with `statsmodel`

geared more towards model and coefficient interpretation in low dimensional data settings and in contrast the `scikit-learn`

implementation more appropriate for use cases focused on prediction with more complex, higher dimensional data.