Linear Regression

Published

25 October, 2024

To demonstrate the use of linear regression we examine a dataset that illustrates the relationship between Height and Weight in a group of 237 teen-aged boys and girls. The dataset is available at (../data/htwt.csv) and is imported to sas using proc import procedure.

Descriptive Statistics

The first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing proc means and proc freq procedures There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participants’ height to their age and sex.

proc means data=htwt;
run;

                    Descriptive Statistics for HTWT Data Set                  
                             The MEANS Procedure

Variable  Label     N          Mean       Std Dev       Minimum       Maximum
-----------------------------------------------------------------------------
AGE       AGE     237    16.4430380     1.8425767    13.9000000    25.0000000
HEIGHT    HEIGHT  237    61.3645570     3.9454019    50.5000000    72.0000000
WEIGHT    WEIGHT  237   101.3080169    19.4406980    50.5000000   171.5000000
----------------------------------------------------------------------------
proc freq data=htwt;
tables sex;
run;

    Oneway Frequency Tabulation for Sex for HTWT Data Set                    
                    The FREQ Procedure

                                      Cumulative    Cumulative
SEX         Frequency     Percent     Frequency      Percent
-------------------------------------------------------------
f           111           46.84           111        46.84
m           126           53.16           237       100.00

In order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.

data htwt2;
  set htwt;
  if sex="f" then female=1;
  if sex="m" then female=0; 
 
  *model to demonstrate interaction between female gender and age;
  fem_age = female * age;  
run;

Regression Analysis

Next, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is height = b0 + b1 x female + b2 x age + b3 x fem_age + e

proc reg data=htwt2;
  where age <=19;
  model height = female age fem_age / clb;
run; quit;

                        Number of Observations Read         219
                        Number of Observations Used         219

                                 Analysis of Variance

                                        Sum of           Mean
    Source                   DF        Squares         Square    F Value    Pr > F
    Model                     3     1432.63813      477.54604      60.93    <.0001
    Error                   215     1684.95730        7.83701
    Corrected Total         218     3117.59543

                 Root MSE              2.79947    R-Square     0.4595
                 Dependent Mean       61.00457    Adj R-Sq     0.4520
                 Coeff Var             4.58895

We examine the parameter estimates in the output below.

                            Parameter Estimates
                            Parameter       Standard
       Variable     DF       Estimate          Error    t Value    Pr > |t|       95% Confidence Limits
       Intercept     1       28.88281        2.87343      10.05      <.0001       23.21911       34.54650
       female        1       13.61231        4.01916       3.39      0.0008        5.69031       21.53432
       AGE           1        2.03130        0.17764      11.44      <.0001        1.68117        2.38144
       fem_age       1       -0.92943        0.24782      -3.75      0.0002       -1.41791       -0.44096

From the parameter estimates table the coefficients b0,b1,b2,b3 are estimated as b0=28.88 b1=13.61 b2=2.03 b3=-0.92942

The resulting regression model for height, age and gender based on the available data is height=28.88281 + 13.61231 x female + 2.03130 x age -0.92943 x fem_age