Linear Regression

Introduction

Simple linear regression is a statistical method used to model the relationship between a continuous dependent variable and continuous independent variable by fitting a linear equation to the observed data. It estimates how changes in the independent variable affect the dependent variable, allowing for predictions and insights about the underlying relationship. The primary goal is to minimize the difference between the observed values and the values predicted by the model.

The following assumptions must hold when building a linear regression model.

  1. The dependent variable must be continuous.

  2. The data you are modeling meets the “iid” criterion. That means the error terms, ε, are:

    1. independent from one another and
    2. identically distributed.
  3. The error term is normally distributed with a mean of zero.

To demonstrate the use of linear regression we examine a dataset that illustrates the relationship between Height and Weight in a group of 237 teen-aged boys and girls. The dataset is available here and is imported to the workspace.

Descriptive Statistics

The first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing summary function. There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participants’ height to their age and sex.

knitr::opts_chunk$set(echo = TRUE)
htwt<-read.csv("../data/htwt.csv")
summary(htwt)
      ROW          SEX                 AGE            HEIGHT     
 Min.   :  1   Length:237         Min.   :13.90   Min.   :50.50  
 1st Qu.: 60   Class :character   1st Qu.:14.80   1st Qu.:58.80  
 Median :119   Mode  :character   Median :16.30   Median :61.50  
 Mean   :119                      Mean   :16.44   Mean   :61.36  
 3rd Qu.:178                      3rd Qu.:17.80   3rd Qu.:64.30  
 Max.   :237                      Max.   :25.00   Max.   :72.00  
     WEIGHT     
 Min.   : 50.5  
 1st Qu.: 85.0  
 Median :101.0  
 Mean   :101.3  
 3rd Qu.:112.0  
 Max.   :171.5  

In order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.

htwt$female <- ifelse(htwt$SEX=='f',1,0)
htwt$fem_age <- htwt$AGE * htwt$female
head(htwt)
  ROW SEX  AGE HEIGHT WEIGHT female fem_age
1   1   f 14.3   56.3   85.0      1    14.3
2   2   f 15.5   62.3  105.0      1    15.5
3   3   f 15.3   63.3  108.0      1    15.3
4   4   f 16.1   59.0   92.0      1    16.1
5   5   f 19.1   62.5  112.5      1    19.1
6   6   f 17.1   62.5  112.0      1    17.1

Regression Analysis

Next, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is \(height = b_0 + b_1\times female + b_2\times age + b_3\times fem\_age + e\)

regression<-lm(HEIGHT~female+AGE+fem_age, data=htwt, AGE<=19)
summary(regression)

Call:
lm(formula = HEIGHT ~ female + AGE + fem_age, data = htwt, subset = AGE <= 
    19)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2429 -1.7351  0.0383  1.6518  7.9289 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  28.8828     2.8734  10.052  < 2e-16 ***
female       13.6123     4.0192   3.387 0.000841 ***
AGE           2.0313     0.1776  11.435  < 2e-16 ***
fem_age      -0.9294     0.2478  -3.750 0.000227 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.799 on 215 degrees of freedom
Multiple R-squared:  0.4595,    Adjusted R-squared:  0.452 
F-statistic: 60.93 on 3 and 215 DF,  p-value: < 2.2e-16
b0=round(regression$coefficients[1],4)
b1=round(regression$coefficients[2],4)
b2=round(regression$coefficients[3],4)
b3=round(regression$coefficients[4],4)

From the coefficients table b0,b1,b2,b3 are estimated as b0=28.8828 b1=13.6123 b2=2.0313 b3=-0.9294

The resulting regression model for height, age and gender based on the available data is \(height= 28.8828 + 13.6123\times female + 2.0313\times age -0.9294\times fem\_age\)