```
import pandas as pd
import statsmodels.api as sm
# Importing CSV
= pd.read_csv("../data/htwt.csv") htwt
```

# Linear Regression

To demonstrate the use of linear regression we examine a dataset that illustrates the relationship between Height and Weight in a group of 237 teen-aged boys and girls. The dataset is available here and is imported to the workspace.

### Descriptive Statistics

The first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing summary function. There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participantsâ€™ height to their age and sex.

In order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.

```
'female'] = (htwt['SEX'] == 'f').astype(int)
htwt['fem_age'] = htwt['AGE'] * htwt['female']
htwt[ htwt.head()
```

ROW | SEX | AGE | HEIGHT | WEIGHT | female | fem_age | |
---|---|---|---|---|---|---|---|

0 | 1 | f | 14.3 | 56.3 | 85.0 | 1 | 14.3 |

1 | 2 | f | 15.5 | 62.3 | 105.0 | 1 | 15.5 |

2 | 3 | f | 15.3 | 63.3 | 108.0 | 1 | 15.3 |

3 | 4 | f | 16.1 | 59.0 | 92.0 | 1 | 16.1 |

4 | 5 | f | 19.1 | 62.5 | 112.5 | 1 | 19.1 |

### Regression Analysis

Next, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is *height = b0 + b1 x female + b2 x age + b3 x fem_age + e*

```
= htwt[['female', 'AGE', 'fem_age']][htwt['AGE'] <= 19]
X = sm.add_constant(X)
X = htwt['HEIGHT'][htwt['AGE'] <= 19]
Y
= sm.OLS(Y, X).fit()
model
model.summary()
```

Dep. Variable: | HEIGHT | R-squared: | 0.460 |

Model: | OLS | Adj. R-squared: | 0.452 |

Method: | Least Squares | F-statistic: | 60.93 |

Date: | Thu, 10 Oct 2024 | Prob (F-statistic): | 1.50e-28 |

Time: | 16:52:29 | Log-Likelihood: | -534.17 |

No. Observations: | 219 | AIC: | 1076. |

Df Residuals: | 215 | BIC: | 1090. |

Df Model: | 3 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |

const | 28.8828 | 2.873 | 10.052 | 0.000 | 23.219 | 34.547 |

female | 13.6123 | 4.019 | 3.387 | 0.001 | 5.690 | 21.534 |

AGE | 2.0313 | 0.178 | 11.435 | 0.000 | 1.681 | 2.381 |

fem_age | -0.9294 | 0.248 | -3.750 | 0.000 | -1.418 | -0.441 |

Omnibus: | 1.300 | Durbin-Watson: | 2.284 |

Prob(Omnibus): | 0.522 | Jarque-Bera (JB): | 0.981 |

Skew: | -0.133 | Prob(JB): | 0.612 |

Kurtosis: | 3.191 | Cond. No. | 450. |

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the coefficients table b0,b1,b2,b3 are estimated as b0=28.88 b1=13.61 b2=2.03 b3=-0.92942

The resulting regression model for height, age and gender based on the available data is *height=28.8828 + 13.6123 x female + 2.0313 x age -0.9294 x fem_age*