Logistic Regression in SAS
For a brief description of what is logistic regression see here.
Modelling the lung cancer data
In SAS, we can use proc logistic or proc genmod to perform a logistic regression.
To demonstrate the use of logistic regression we examine the same lung dataset as used in the R example here.
Summary of Common Mistakes in SAS
Handling of missing data. Check SAS output that the number of missing values is as you expect. Make sure you have changed any
NA
results in the raw data to be missing, since SAS would considerNA
as a valid category (a non-missing character result).Make sure you consider continuous or categorical variables as you intended. Just because a variable is character or numeric in the dataset, doesn’t mean SAS will treat it that way in the model. You have to use Class row to tell SAS which variables should be treated as character factors. You also have to use
ref=' '
to tell SAS which is the reference category, otherwise SAS by default which use the last value of the variable alphabetically (e..g a categorical variable with 1, 2, 3 would default to 3 as the reference).Be careful you are modelling the correct event (response vs non-response, or weight_gain vs weight_loss for example)
Be careful when interpreting any odds ratios that you have the factor of interest the correct way around (0 vs 1, or 1 vs 0)
If using proc logistic, be careful of how SAS creates its parameters used in the model as this determines how you can use the parameter estimates! It is often easiest to use
param=glm
so that the exp(maximum likelihood parameter estimate) = odds ratio. Check the class level information (Design variables) is as you would expect. See below for more detail oneffect
andref
parameterization and here for more options such as polynomial coding.By default, SAS includes an intercept in the model. The intercept represents the baseline log odds of the outcome when all predictor variables are set to zero. SAS outputs a p-value testing if the baseline log odds is significantly different to zero, however this is not generally of interest because the purpose of our modelling is to find out which parameters have a significant effect on the probability of an event occurring. This baseline log odds is simply shifting the linear expression up or down so that the variable components are most accurate. The baseline log odds, may be not interpretable if it’s not possible for some variables to take a value of zero (e.g. age=0 is not yet born!). Therefore, we generally ignore the intercept and instead calculate odds ratios for parameters of interest. How your model is parameterised (param=glm, param=effect, param=ref, can also affect the estimate, so intercept estimates may not align when using different parameterization.
Modelling using Proc Genmod
Proc Genmod is a procedure that allows the fitting of Generalized Linear Models. By using the options dist=bin
and link=logit,
it fits a logistic regression as shown below. For more information see the SAS help here.
Always check that the Class Level Information matches what you expect (SAS puts the reference class level last). Also check that you are modelling the correct ‘event’ and that the algorithm has converged.
Below we are fitting trt01pn and sex as categorical variables, age, ph_ecog2 and meal_caln as continuous variables.
You can use exponential of the maximum likelihood parameter estimate and the exponential of the Wald 95% Confidence Limits to obtain the odds ratios and 95% CIs. Proc Genmod uses GLM parameterization.
#| eval: false
Example data: . = missing, trt01pn (1 or 2), sex (1 or 2), ph_ecog2 (0,1,2,3)
wt_gain (1=gain, 0=no gain)
wt_gain trt01pn age sex ph_ecog2 meal_caln
-----------------------------------------------------------------------------
. 1 74 1 1 1175
0 1 68 2 0 1225 1 2 60 1 2 .
#| eval: false
proc genmod data=lung;
class trt01pn (ref="1") sex (ref="1");
model wt_gain (event="1") = trt01pn age sex ph_ecog2 meal_caln /
dist=bin link=logit;
run;
Class Level Information
Class Levels Values
-----------------------------------------------------------------------------
trt01pn 2 2 1
sex 2 2 1
-----------------------------------------------------------------------------
Response Profile
Ordered value wt_gain Total Frequency
-----------------------------------------------------------------------------
1 1 48
2 0 122
-----------------------------------------------------------------------------
PROC GENMOD is modeling the probability that wt_gain='1'.
Algorithm Converged.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Wald 95% Wald Pr>ChiSq
Error CIs Chi-Square
-----------------------------------------------------------------------------
Intercept 1 -2.6415 1.5140 -5.6090 0.3259 3.04 0.0810
trt01pn 2 1 0.3888 0.3782 -0.3524 1.1299 1.03 0.3039
trt01pn 1 0 0.0000 0.0000 0.0000 0.0000 . .
age 1 0.0123 0.0212 -0.0292 0.0537 0.34 0.5624
sex 2 1 0.8321 0.3744 0.0983 1.5659 4.97 0.0262
sex 1 0 0.0000 0.0000 0.0000 0.0000 . .
ph_ecog 1 -0.3764 0.2638 -0.8935 0.1407 2.03 0.1537
meal_cal 1 -0.0008 0.0004 -0.0000 0.0017 3.59 0.0581
scale 0 1.0000 0.0000 1.0000 1.0000
---------------------------------------------------------------------------- Note: The scale parameter was held fixed
Modelling using Proc Logistic
The same model above can also be modelled using Proc Logistic. You no longer have to specify the distribution and link function, however you do need to add an option / param=glm
on the class row. Different parameterizations are discussed later in the context of forming treatment contrast statements.
For now, all you need to know is that using /param=glm
ensures exp(estimates)=odds ratio. You will also note that in the class level information
, SAS now tells you the design variables
. This will also be important later when we learn more about parameterization.
Proc Logistic is often preferred above Proc Genmod as it outputs the Odds Ratios and 95% CIs for you, without you having to back transform them using exponential of the MLEs yourself.
NOTE: that the 95% confidence limits are being calculated using the Wald method. This assumes symmetric intervals around the maximum likelihood estimate using a normal distribution assumption (MLE +/-1.96* SE). Alternative confidence interval estimation methods exist such as the profile likelihood method but SAS does not calculate these.
#| eval: false
proc logistic data=lung;
class trt01pn (ref="1") sex (ref="1") /param=glm;
model wt_gain(event="1") = trt01pn age sex ph_ecog2 meal_caln;
run;
Response Profile
Ordered value wt_gain Total Frequency
-----------------------------------------------------------------------------
1 0 122
2 1 48
-----------------------------------------------------------------------------
Probability modeled is wt_gain=1
Note: 58 observations were deleted due to missing values fro the repsonse or
explanatory variables.
Class Level Information
Class Levels Design Variables
-----------------------------------------------------------------------------
trt01pn 2 1 0
1 0 1
sex 2 1 0
1 0 1
-----------------------------------------------------------------------------
Convergence criterion (GCONV=1E-8) satisfied.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Wald Pr>ChiSq
Error Chi-Square
-----------------------------------------------------------------------------
Intercept 1 -2.6415 1.5140 3.0440 0.0810
trt01pn 2 1 0.3888 0.3782 1.0569 0.3039
trt01pn 1 0 0.0000 . . .
age 1 0.0123 0.0212 0.3356 0.5624
sex 2 1 0.8321 0.3744 4.9400 0.0262
sex 1 0 0.0000 . . .
ph_ecog 1 -0.3764 0.2638 2.0349 0.1537
meal_cal 1 -0.000850 0.000449 3.5895 0.0581
----------------------------------------------------------------------------
Odds Ratio Estimates
Effect Point Estimate 95% Wald Confidence Limits
-----------------------------------------------------------------------------
trt01pn 2 vs 1 1.475 0.703 3.095
age 1.012 0.971 1.055
sex 2 vs 1 2.298 1.103 4.787
ph_ecog 0.686 0.409 1.151
meal_cal 1.001 1.000 1.002 ----------------------------------------------------------------------------
Model Comparison
To compare two logistic models, the -2 * Log Likelihood from each model can be compared against a \(\chi^2\)-distribution with degrees of freedom calculated using the difference in the two models’ parameters.
#| eval: false
Model 1: model wt_gain(event="1") = trt01pn age sex ph_ecog2 meal_caln;
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
--------------------------------------------------------
AIC 204.355 202.460
SC 207.491 221.274
-2 Log L 202.355 190.460
--------------------------------------------------------
Model 2: model wt_gain(event="1") = trt01pn sex ph_ecog2 meal_caln;
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
--------------------------------------------------------
AIC 204.355 200.798
SC 207.491 216.477
-2 Log L 202.355 190.798
--------------------------------------------------------
190.460 - 190.798 = -0.338 which using a $\chi^2$-distribution corresponds to p=0.5606
SAS also allows us to fit forward or backwards stepwise selection. Below we specify to stop when we have 4 variables left in the model. This is not commonly done in practice but is included to highlight the difference in using a selection procedure compared to doing the difference betweeen the -2 * log likelihood models using a \(\chi^2\)-distribution.
#| eval: false
proc logistic data=lung;
class trt01pn (ref="1") sex (ref="1") /param=glm;
model wt_gain(event="1") = trt01pn age sex ph_ecog2 meal_caln/
selection=backward stop=4;
run;
Step 1: Effect age is removed
Summary of Backward Elimination
Step Effect Removed DF Number In Wald Chi-Square Pr>ChiSq
-----------------------------------------------------------------------------
1 Age 1 4 0.3356 0.5624 -----------------------------------------------------------------------------
NOTE: the chi-square test summary of backward elimination, p=0.5624 is slightly different to using the -2 * log likelihood models using a \(\chi^2\)-distribution p=0.5606.
This is because the backward elimination process in SAS uses the residual sums of squares and the F statistic. Starting with the full model, it removes the parameter with the least significant F statistic until all effects in the model have F statistics significant as a certain level. The F statistic is calculated as:
\[F=\frac{(RSS_{p-k}-RSS_p)/k}{RSS_p /(n-p-k)}\] where RSS = Residual sums of squares, n=number of observations in the analysis, p=number of parameters in fuller model (exc. intercept), k=number of degrees of freedom associated with the effect you are dropping, \[RSS_p\] =RSS for the fuller model, \[RSS_{p-k}\] = RSS for the reduced model.
Parameterization of model effects (categorical covariates) in SAS
The most common problem when fitting logistic regression in SAS, is getting SAS to model the binary variable (events) and any categorical covariates correctly. Using proc genmod
(using dist=bin and link=logit options), there is no issue as SAS defaults to using GLM parameterization. However using proc logistic
there are three ways to parameterize categorical variables, and the default is /PARAM=EFFECT
which can cause confusion when interpreting your model.
To demonstrate, we will now model a categorical variable called Dose, which has 3 treatment levels (1=10mg Active, 2=20 mg Active, 3=Placebo). The reference is now dose=3. You must pay close attention to the table of Class level information in order to understand how SAS is modelling your data.
CLASS X Y Z /PARAM=Effect
This is the SAS default such that if you do not specify the /param
option, SAS defaults to using this method.
With the EFFECT option, dose_id has 3 levels, and so needs 2 design variables (β1 and β2). Sex has 2 levels so uses just 1 design variable (β1). For dose_id, the reference level (Placebo) is given values of “-1” for both of the β1 and β2 parameters. General model: Y= α + β1x1 + β2x1 {+β3x3} etc, representing each parameter in the model.
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=effect;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
run;
Class Level Information
Class Value Design Variables
--------------------------------------------------------
dose_β1 dose_β2
dose_id 1 1 0
2 0 1
3 -1 -1
sex_β3
SEX 1 -1
2 1 --------------------------------------------------------
If we want to estimate the effect of treatment (ignoring the other covariates), the Class Level Information can be translated into the table below, which is then used to form contrast statements.
α is the intercept, and β1 and β2 (including the sign +/-) are from the design variables above.
Dose_id Effect
------------------------------
10mg Active Y = α + β1
20mg Active Y = α + β2
Placebo Y = α - β1 - β2 ------------------------------
To compare 10mg Active vs Placebo, we would do the following:
(α + β1 ) - (α - β1 - β2)
= 2 β1 + β2 = 2 β1 + 1 β2
This equates to a contrast statement as follows:
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=effect;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
CONTRAST "10mg vs. Placebo" dose_id 2 1 / e;
run;
To compare the average of (10mg Active and 20mg Active) vs Placebo, we would do the following:
((α + β1 + α + β2 ) /2 ) - (α - β1 - β2)
= α + 0.5 β1 + 0.5 β2 - α + β1 + β2 = 1.5 β1 + 1.5 β2
This equates to a contrast statement as follows:
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=effect;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
CONTRAST "Active (10mg + 20mg) vs. Placebo" dose_id 1.5 1.5 / e;
run;
As you can see, these contrasts are not very intuitive and hence it is not reccomended to use the default SAS option of /param=effect, since its easy to end up with the wrong contrasts.
Contract Test Results
Contrast DF Wald Chi-Square Pr>ChiSq
-------------------------------------------------------- Active (10mg +20mg) vs Placebo 1 1.1610 0.2813
CLASS X Y Z /PARAM=glm
Now let’s look at the param=glm
option. GLM parameterization has a design variable for each level of a parameter. Hence for dose with 3 levels, we have 3 design variables (β1, β2 and β3).
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=glm;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
run;
Class Level Information
Class Value Design Variables
--------------------------------------------------------
dose_β1 dose_β2 dose_β3
dose_id 1 1 0 0
2 0 1 0
3 0 0 1
sex_β3 sex_β4
SEX 1 1 0
2 0 1 --------------------------------------------------------
If we want to estimate the effect of treatment (ignoring the other covariates), the Class Level Information can be translated into the table below, which is then used to form contrast statements.
α is the intercept, and β1 and β2 and β3 (including the sign +/-) are from the design variables above.
Dose_id Effect
------------------------------
10mg Active Y = α + β1
20mg Active Y = α + β2
Placebo Y = α - β3 --------------------------------------------------------
To compare 10mg Active vs Placebo, we would do the following:
(α + β1 ) - (α - β3)
= β1 - β3 = 1 β1 - 1 β3
This equates to a contrast statement as follows:
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=glm;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
CONTRAST "10mg vs. Placebo" trt 1 -1 / e;
run;
As you can see, this contrast is much more intuitive. If you want to compare the effect of Active (10mg) compared to placebo, you take the effect of 10mg and subtract the effect of placebo !
To compare the average of (10mg Active and 20mg Active) vs Placebo, we would do the following:
(α + β1 + α + β2)/2 - (α + β3)
= α + 0.5 β1 + 0.5 β2 - α - β3 = 0.5 β1 + 0.5 β2 - β3
This equates to a contrast statement as follows:
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=glm;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
CONTRAST "Active (10mg + 20mg) vs. Placebo (1)" trt 0.5 0.5 -1 / e;
run;
As you can see, this contrast is much more intuitive. If you want to compare the average of Active (10mg + 20mg) compared to placebo, you take half the effect of 10mg plus half the effect of 20mg and substract the effect of placebo!
Contract Test Results
Contrast DF Wald Chi-Square Pr>ChiSq
-------------------------------------------------------- Active (10mg +20mg) vs Placebo 1 1.1610 0.2813
CLASS X Y Z /PARAM=Ref
Now let’s look at the param=ref
option. Similar to param=effect, ref parameterization uses 1 less design variable compared to the number of levels each parameter has, but the parameterization is different. For dose with 3 levels, we have 2 design variables (β1 and β2).
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=ref;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
run;
Class Level Information
Class Value Design Variables
--------------------------------------------------------
dose_β1 dose_β2
dose_id 1 1 0
2 0 1
3 0 0
sex_β3
SEX 1 0
2 1 --------------------------------------------------------
If we want to estimate the effect of treatment (ignoring the other covariates), the Class Level Information can be translated into the table below, which is then used to form contrast statements.
α is the intercept, and β1 and β2 and β3 (including the sign +/-) are from the design variables above.
Dose_id Effect
------------------------------
10mg Active Y = α + β1
20mg Active Y = α + β2
Placebo Y = α --------------------------------------------------------
To compare 10mg Active vs Placebo, we would do the following:
(α + β1 ) - (α ) = β1
This equates to a contrast statement as follows:
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=glm;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
CONTRAST "10mg vs. Placebo" trt 1 / e;
run;
To compare the average of (10mg Active and 20mg Active) vs Placebo, we would do the following:
(α + β1 + α + β2)/2 - α
= α + 0.5 β1 + 0.5 β2 - α = 0.5 β1 + 0.5 β2
This equates to a contrast statement as follows:
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=glm;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
CONTRAST "Active (10mg + 20mg) vs. Placebo (1)" trt 0.5 0.5 / e;
run;
Again this is less intuitive than the param=glm parameterization, but the same results are obtained.
Contract Test Results
Contrast DF Wald Chi-Square Pr>ChiSq
-------------------------------------------------------- Active (10mg +20mg) vs Placebo 1 1.1610 0.2813
Contrast statements for 2 or more treatments
The Contrast statement, only outputs the p-value for the contrast, but it is common to also require an estimate of the difference between the treatments, with associated 95% CI. You can do this by changing contrast
to an estimate
statement. Note that the parameterization of the contrast remains the same as when using a contrast statement as shown below. These estimates and 95% CI’s can be back transformed to give you the Odds ratio of the contrast and associated 95% CI. The estimate coefficients table should be checked for accuracy versus the contrast you are trying to do.
#| eval: false
proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=glm;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
Estimate "Active (10mg + 20mg) vs. Placebo (1)" dose_id 0.5 0.5 -1 / e cl;
run;
Estimate Coefficients
Parameter dose_id sex Row1
--------------------------------------------------------
dose_id 1 1 0.5
dose_id 2 2 0.5
dose_id 3 3 -1
--------------------------------------------------------
Estimate
Label estimate Std Err Z value Pr>|z| Alpha Lower Upper
--------------------------------------------------------------------------
Active (10mg +20mg) vs. Placebo (1)
-0.4096 0.3802 -1.08 0.2813 0.05 -1.1547 0.3355
--------------------------------------------------------------------------
The odds ratio for the comparison of Active (10mg +20mg) vs. Placebo is
exp(-0.4096) (95% CI: exp(-1.1547) to exp(0.3355)) = OR: 0.664 (95%CI for OR: 0.315 to 1.399)
Ensuring you are modelling the correct Binary event in SAS
With logistic regression, we often want to model the number of “Successes”. However, by default, SAS sorts alphabetically/numerically and selects the first occurring EVENT alphabetically as the one it’s going to model.
It’s a common mistake, and we find SAS modelling the number of failures instead of successes. Very common when your response is: ‘Responder’ vs ‘Non-responder’, SAS will model the Non-responders as ‘N’ is alphabetically first before ‘R’!
For this reason, It is recommended to always use the event=“Y” option.
Options such as ORDER=DATA|FORMATTED|FREQ|INTERNAL
as well as descending can be used to ensure the correct levels of classification variables are being modelled. More detail here
References
Created using : SAS release:9.04.01M7P08062020”