=lung;
proc logistic datasex (ref="1") ph_ecog (ref="0")/param=glm;
class wt_grp(event="weight_gain") = age sex ph_ecog meal_cal;
model run;
Logistic Regression in SAS
For a brief description of what is logistic regression see here.
Modelling the lung cancer data
In SAS, we can use proc logistic or proc genmod to perform a logistic regression.
To demonstrate the use of logistic regression we examine the same lung dataset as used in the R example here.
Some Common mistakes
Handling of missing data. Check SAS output that the number of missing values is as you expect. Make sure you have changed any
NA
results in the raw data to be missing, since SAS would considerNA
as a valid category (a non-missing character result).Make sure you consider continuous or categorical variables as you intended. Just because a variable is character or numeric in the dataset, doesn’t mean SAS will treat it that way in the model. You have to use Class row to tell SAS which variables should be treated as character factors (also using
ref=' '
to select the reference category).Be careful you are modelling the correct event (response vs non-response, or weight_gain vs weight_loss for example)
Be careful when interpreting any odds ratios that you have the factor of interest the correct way around (0 vs 1, or 1 vs 0)
If using proc logistic, be careful of how SAS creates its parameters used in the model as this determines how you can use the parameter estimates! It is often easiest to use
param=glm
so that the exp(maximum liklihood parameter estimate) = odds ratio. Check the class level information (Design variables) is as you would expect. See below for more detail on other parameterization methods.
Option 1: Using Proc Logistic with sex and ph_ecog as character factors: sex (2 levels: 1,2) and ph_ecog (4 levels:0,1,2 or 3), meal_cal as continuous. As shown below, the class
statement and ref=" "
option can be used to set the direction of the comparison, /param=glm
ensures exp(estimates)=odds ratio.
Option 2: Using proc Genmod with sex and ph_ecog as character factors: sex (2 levels: 1,2) and ph_ecog (4 levels:0,1,2 or 3), meal_cal as continuous. This gets the same results as above. You can use exp(maximum likelihood parameter estimate) to obtain the odds ratios and 95% CIs for the odds ratios. The below already uses GLM parameterization.
=lung;
proc genmod datasex (ref="1") ph_ecog (ref="0");
class wt_grp (event="weight_gain") = age sex ph_ecog meal_cal / dist=bin link=logit;
model run;
Option 3: Using proc logistic, however this time all covariates are treated as continuous parameters not categorical. You do need to ensure age
sex
ph_ecog
and meal_cal
are numeric SAS variables else you will get an error message suggesting to add the varibles to a class statement.
=lung;
proc logistic datawt_grp(event="weight_gain") = age sex ph_ecog meal_cal;
model run;
Comparison with R model Option 3 is used with the following 3 models, in order to match the same analysis performed in R. This is modelling Sex and ECOG as continuous parameters. We analyze the weight gain in lung cancer patients in dependency of age, sex, ECOG performance score and calories consumed at meals. Weight is categorized into a binary variable for Loss or Gain. The probability modelled is weight_gain.
We fit 3 models. In
Model 1: Weight Loss (1/0) = Age + Sex + ECOG + Calories
Model 2: Weight Loss (1/0) = Sex + ECOG + Calories
Model 3: Using backwards selection to compare models with/without age
"Model 1";
title1 =lung;
proc logistic datawt_grp (event="weight_gain") = age sex ph_ecog meal_cal;
model
run;
Response Profile
Ordered value wt_grp Total Frequency-----------------------------------------------------------------------------
1 weight_gain 122
2 weight_loss 48
-----------------------------------------------------------------------------
1 Results
Model
Analysis of Maximum Likelihood Estimates
-Square Pr>ChiSq
Parameter DF Estimate Standard Error Wald Chi-----------------------------------------------------------------------------
1 3.2632 1.6488 3.9168 0.0478
Intercept 1 -0.0102 0.0208 0.2389 0.6250
Age 1 -0.8717 0.3714 5.5090 0.0189
Sex 1 0.4180 0.2589 2.6069 0.1064
ph_ecog 1 -0.00089 0.000447 3.9417 0.0471
meal_cal ----------------------------------------------------------------------------
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates --------------------------------------------------------
204.355 201.505
AIC 207.491 217.184
SC -2 Log L 202.355 191.505
--------------------------------------------------------
"Model 2";
title1 =lung2;
proc logistic datawt_grp (event="weight_gain") = sex ph_ecog meal_cal;
model
run;
2 Results
Model
Analysis of Maximum Likelihood Estimates
-Square Pr>ChiSq
Parameter DF Estimate Standard Error Wald Chi-----------------------------------------------------------------------------
1 2.5607 0.7977 10.3047 0.0013
Intercept 1 -0.8359 0.3637 5.2815 0.0216
Sex 1 0.3794 0.2469 2.3616 0.1244
ph_ecog 1 -0.00083 0.000435 3.6770 0.0552
meal_cal ----------------------------------------------------------------------------
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates --------------------------------------------------------
204.355 201.505
AIC 207.491 217.184
SC -2 Log L 202.355 191.505
--------------------------------------------------------
"Using selection=backward to get model comparison stats";
title1 =lung2;
proc logistic datawt_grp (event="weight_gain") = age sex ph_ecog meal_cal / selection = backward stop=3;
model
run;
1: Effect age is removed
Step
Summary of Backward Elimination-Square Pr>ChiSq
Step Effect Removed DF Number In Wald Chi-----------------------------------------------------------------------------
1 Age 1 3 0.2389 0.6250
-----------------------------------------------------------------------------
Note: The number of effects in the model has reached STOP=3, (when 3 variables remain in the model, it will not proceed to remove any more even if the ones left are not significant)
3 Results
Model
Analysis of Maximum Likelihood Estimates
-Square Pr>ChiSq
Parameter DF Estimate Standard Error Wald Chi-----------------------------------------------------------------------------
1 2.5607 0.7977 10.3047 0.0013
Intercept 1 -0.8359 0.3637 5.2815 0.0216
Sex 1 0.3794 0.2469 2.3616 0.1244
ph_ecog 1 -0.00083 0.000435 3.6770 0.0552
meal_cal ----------------------------------------------------------------------------
NOTE: the chi-square test summary of backward elimination, p=0.6250 is different to the results in R, which gave a difference in deviance of -0.24046, p=0.6239.
This is because R was doing a difference in -2 * Log Likelihood test comparing the model with sex + ph_ecog + meal_cal + age vs sex + ph_ecog + meal_cal, based on \(\chi^2\) distribution with 1 df.
However, the backward elimination process in SAS uses the residual sums of squares and the F statistic. Starting with the full model, it removes the parameter with the least significant F statistic until all effects in the model have F statistics significant as a certain level. The F statistic is calculated as:
\[F=\frac{(RSS_{p-k}-RSS_p)/k}{RSS_p /(n-p-k)}\] where RSS = Residual sums of squares, n=number of observations in the analysis, p=number of parameters in fuller model (exc. intercept), k=number of degrees of freedom associated with the effect you are dropping, \[RSS_p\] =RSS for the fuller model, \[RSS_{p-k}\] = RSS for the reduced model.
Parameterization of model effects (categorical covariates) in SAS
The most common problem when fitting logistic regression in SAS, is getting SAS to model the binary variable (events) and any categorical covariates correctly.
There are two procedures which can be used to perform logistic regression in SAS, proc logistic
, and proc genmod
(using dist=bin and link=logit options). They obtain the same results when parameterized correctly. However proc logistic
has three ways to parameterize categorical variables, and failure to parameterize correctly can lead to incorrect interpretation of results. Parameterization is what SAS does when you include a categorical variable in your model. It codes that variable using new parameters and it fits those into your model.
The default parameterization is /PARAM=EFFECT
- This is shown as Example A in the design matrix example figure below.
For the example of a categorical variable trt - which has 3 treatment levels (Active 1, Active 2, Placebo). With the EFFECT option, the reference level (Placebo) is given values of “-1” for both of the x1 and x2 parameters. The design matrix consists of 1 less parameter than the number of levels of your variable (so that for 3 treatment groups you have x1 and x2 - 2 parameters in the model). Which reference category is selected is specified either by using the ref='X'
option, or the default which would be the last value of the variable. In our example (as we have not specified ref='X'
this is Placebo (as this is last in the alphabet after Active 1 and Active 2).
=dataset;
proc logistic data/PARAM=EFFECT;
CLASS TRTP resp (event="Y") = TRTP ;
MODEL
run;
or
=dataset;
proc logistic data
CLASS TRTP; resp (event="Y") = TRTP ;
MODEL run;
Alternatively, you change the parameterization in SAS using the following options:
PARAM=REF
- This is shown as Example B below - reference parameter values “0”, n-1 variables
=dataset;
proc logistic data/PARAM=REF;
CLASS TRTP resp (event="Y") = TRTP ;
MODEL run;
PARAM=GLM
- This is shown as Example C below - no reference parameter as such (output compares against last value of treatment group), n variables.
=dataset;
proc logistic data/PARAM=GLM;
CLASS TRTP resp (event="Y") = TRTP ;
MODEL run;
Design Matrix Examples
Depending on which parameterization you use, you have to interpret the estimates you obtains from the model carefully. It is also very important to check the reference assigned is the correct one.
This is also particularly important if you are using any contrast, estimate or lsmestimate statements since the coefficients used for the contrasts must be in line with the parameterization.
Example of deriving contrast statements in SAS
The table below shows the how the parameterizations using EFFECT, REF or GLM, translate to the model being fitted.
Using General model: Y= α + β1x1 + β2x1 {+β3x3}, the treatments are represented as follows:
Effect(A) REF (B) GLM (C)
Treatment --------------------------------------------------------
= α + β1 Y = α + β1 Y = α + β1
Active A Y = α + β2 Y = α + β2 Y = α + β2
Active B Y = α - β1 - β2 Y = α Y = α + β3
Placebo Y --------------------------------------------------------
Now let’s suppose we wanted to compare the average of the Active Treatment groups versus the Placebo treatment group. (A+B)/2 compared to Placebo.
EFFECT (A) parameterization
+ β1 + α + β2 ) /2 ) - (α - β1 - β2)
((α = α + 0.5 β1 + 0.5 β2 - α + β1 + β2
= 1.5 β1 + 1.5 β2
Therefore, to apply this contrast in SAS we would use:
=dataset;
PROC LOGISTIC data-default PARAM=EFFECT option is used so last trt is reference
CLASS trt ; resp (event="Y") = trt ;
MODEL "Active (A & B) vs. Placebo" trt 1.5 1.5 / e;
CONTRAST RUN;
GLM (C) parameterization
The GLM parameterization can be more intuitive.
+ β1 + α + β2)/2 - (α + β3)
((α = α + 0.5 β1 + 0.5 β2 - α - β3
= 0.5 β1 + 0.5 β2 - β3
Therefore, to apply this contrast in SAS we would use:
=dataset;
PROC LOGISTIC data/ param=glm;
CLASS trt resp (event="Y") = trt ;
MODEL "Active (2 & 3) vs. Placebo (1)" trt 0.5 0.5 -1 / e;
CONTRAST RUN;
For this reason, it is common to either use the default PARAM=Effect with ref=“X” option. or to use the param=GLM with no ref=“X” option. It is very important to check the output design matrix so you know what parameterization and reference groups are being used.
Ensuring you are modelling the correct Binary event in SAS
With logistic regression, we often want to model the number of “Successes”. However, by default, SAS sorts alphabetically/numerically and selects the first occurring EVENT alphabetically as the one it’s going to model.
It’s a common mistake, and we find SAS modelling the number of failures instead of successes. Very common when your response is: ‘Responder’ vs ‘Non-responder’, SAS will model the Non-responders as ‘N’ is alphabetically first before ‘R’!
For this reason, It is recommended to always use the event=“Y” option.
Options such as ORDER=DATA|FORMATTED|FREQ|INTERNAL
as well as descending can be used to ensure the correct levels of classification variables are being modelled. More detail here