Linear Regression

To demonstrate the use of linear regression we examine a dataset that illustrates the relationship between Height and Weight in a group of 237 teen-aged boys and girls. The dataset is available here and is imported to the workspace.

Descriptive Statistics

The first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing summary function. There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participants’ height to their age and sex.

import pandas as pd
import statsmodels.api as sm

# Importing CSV
htwt = pd.read_csv("../data/htwt.csv")

In order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.

htwt['female'] = (htwt['SEX'] == 'f').astype(int)
htwt['fem_age'] = htwt['AGE'] * htwt['female']
htwt.head()
ROW SEX AGE HEIGHT WEIGHT female fem_age
0 1 f 14.3 56.3 85.0 1 14.3
1 2 f 15.5 62.3 105.0 1 15.5
2 3 f 15.3 63.3 108.0 1 15.3
3 4 f 16.1 59.0 92.0 1 16.1
4 5 f 19.1 62.5 112.5 1 19.1

Regression Analysis

Next, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is height = b0 + b1 x female + b2 x age + b3 x fem_age + e

X = htwt[['female', 'AGE', 'fem_age']][htwt['AGE'] <= 19]
X = sm.add_constant(X)
Y = htwt['HEIGHT'][htwt['AGE'] <= 19]

model = sm.OLS(Y, X).fit()

model.summary()
OLS Regression Results
Dep. Variable: HEIGHT R-squared: 0.460
Model: OLS Adj. R-squared: 0.452
Method: Least Squares F-statistic: 60.93
Date: Fri, 25 Oct 2024 Prob (F-statistic): 1.50e-28
Time: 08:31:55 Log-Likelihood: -534.17
No. Observations: 219 AIC: 1076.
Df Residuals: 215 BIC: 1090.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 28.8828 2.873 10.052 0.000 23.219 34.547
female 13.6123 4.019 3.387 0.001 5.690 21.534
AGE 2.0313 0.178 11.435 0.000 1.681 2.381
fem_age -0.9294 0.248 -3.750 0.000 -1.418 -0.441
Omnibus: 1.300 Durbin-Watson: 2.284
Prob(Omnibus): 0.522 Jarque-Bera (JB): 0.981
Skew: -0.133 Prob(JB): 0.612
Kurtosis: 3.191 Cond. No. 450.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the coefficients table b0,b1,b2,b3 are estimated as b0=28.88 b1=13.61 b2=2.03 b3=-0.92942

The resulting regression model for height, age and gender based on the available data is height=28.8828 + 13.6123 x female + 2.0313 x age -0.9294 x fem_age