Linear Regression
To demonstrate the use of linear regression we examine a dataset that illustrates the relationship between Height and Weight in a group of 237 teen-aged boys and girls. The dataset is available at (../data/htwt.csv) and is imported to sas using proc import procedure.
Descriptive Statistics
The first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing proc means and proc freq procedures There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participants’ height to their age and sex.
#| eval: false
proc means data=htwt;
run;
Descriptive Statistics for HTWT Data Set
The MEANS Procedure
Variable Label N Mean Std Dev Minimum Maximum
-----------------------------------------------------------------------------
AGE AGE 237 16.4430380 1.8425767 13.9000000 25.0000000
HEIGHT HEIGHT 237 61.3645570 3.9454019 50.5000000 72.0000000
WEIGHT WEIGHT 237 101.3080169 19.4406980 50.5000000 171.5000000 ----------------------------------------------------------------------------
#| eval: false
proc freq data=htwt;
tables sex;
run;
Oneway Frequency Tabulation for Sex for HTWT Data Set
The FREQ Procedure
Cumulative Cumulative
SEX Frequency Percent Frequency Percent
-------------------------------------------------------------
f 111 46.84 111 46.84 m 126 53.16 237 100.00
In order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.
#| eval: false
data htwt2;
set htwt;
if sex="f" then female=1;
if sex="m" then female=0;
*model to demonstrate interaction between female gender and age;
fem_age = female * age;
run;
Regression Analysis
Next, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is height = b0 + b1 x female + b2 x age + b3 x fem_age + e
#| eval: false
proc reg data=htwt2;
where age <=19;
model height = female age fem_age / clb;
run;
quit;
Number of Observations Read 219
Number of Observations Used 219
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 1432.63813 477.54604 60.93 <.0001
Error 215 1684.95730 7.83701
Corrected Total 218 3117.59543
Root MSE 2.79947 R-Square 0.4595
Dependent Mean 61.00457 Adj R-Sq 0.4520 Coeff Var 4.58895
We examine the parameter estimates in the output below.
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits
Intercept 1 28.88281 2.87343 10.05 <.0001 23.21911 34.54650
female 1 13.61231 4.01916 3.39 0.0008 5.69031 21.53432
AGE 1 2.03130 0.17764 11.44 <.0001 1.68117 2.38144 fem_age 1 -0.92943 0.24782 -3.75 0.0002 -1.41791 -0.44096
From the parameter estimates table the coefficients b0,b1,b2,b3 are estimated as b0=28.88 b1=13.61 b2=2.03 b3=-0.92942
The resulting regression model for height, age and gender based on the available data is height=28.88281 + 13.61231 x female + 2.03130 x age -0.92943 x fem_age