: . = missing, trt01pn (1 or 2), sex (1 or 2), ph_ecog2 (0,1,2,3)
Example datawt_gain (1=gain, 0=no gain)
wt_gain trt01pn age sex ph_ecog2 meal_caln-----------------------------------------------------------------------------
1 74 1 1 1175
. 0 1 68 2 0 1225
1 2 60 1 2 .
=lung;
proc genmod datatrt01pn (ref="1") sex (ref="1") ;
class wt_gain (event="1") = trt01pn age sex ph_ecog2 meal_caln /
model =bin link=logit;
dist
run;
Class Level Information
Class Levels Values-----------------------------------------------------------------------------
2 2 1
trt01pn 2 2 1
sex -----------------------------------------------------------------------------
Response Profile
Ordered value wt_gain Total Frequency-----------------------------------------------------------------------------
1 1 48
2 0 122
-----------------------------------------------------------------------------
='1'.
PROC GENMOD is modeling the probability that wt_gain
Algorithm Converged.
Analysis of Maximum Likelihood Estimates
95% Wald Pr>ChiSq
Parameter DF Estimate Standard Wald -Square
Error CIs Chi-----------------------------------------------------------------------------
1 -2.6415 1.5140 -5.6090 0.3259 3.04 0.0810
Intercept 2 1 0.3888 0.3782 -0.3524 1.1299 1.03 0.3039
trt01pn 1 0 0.0000 0.0000 0.0000 0.0000 . .
trt01pn 1 0.0123 0.0212 -0.0292 0.0537 0.34 0.5624
age 2 1 0.8321 0.3744 0.0983 1.5659 4.97 0.0262
sex 1 0 0.0000 0.0000 0.0000 0.0000 . .
sex 1 -0.3764 0.2638 -0.8935 0.1407 2.03 0.1537
ph_ecog 1 -0.0008 0.0004 -0.0000 0.0017 3.59 0.0581
meal_cal 0 1.0000 0.0000 1.0000 1.0000
scale ----------------------------------------------------------------------------
: The scale parameter was held fixed Note
Logistic Regression in SAS
For a brief description of what is logistic regression see here.
Modelling the lung cancer data
In SAS, we can use proc logistic or proc genmod to perform a logistic regression.
To demonstrate the use of logistic regression we examine the same lung dataset as used in the R example here.
Summary of Common Mistakes in SAS
Handling of missing data. Check SAS output that the number of missing values is as you expect. Make sure you have changed any
NA
results in the raw data to be missing, since SAS would considerNA
as a valid category (a non-missing character result).Make sure you consider continuous or categorical variables as you intended. Just because a variable is character or numeric in the dataset, doesn’t mean SAS will treat it that way in the model. You have to use Class row to tell SAS which variables should be treated as character factors. You also have to use
ref=' '
to tell SAS which is the reference category, otherwise SAS by default which use the last value of the variable alphabetically (e..g a categorical variable with 1, 2, 3 would default to 3 as the reference).Be careful you are modelling the correct event (response vs non-response, or weight_gain vs weight_loss for example)
Be careful when interpreting any odds ratios that you have the factor of interest the correct way around (0 vs 1, or 1 vs 0)
If using proc logistic, be careful of how SAS creates its parameters used in the model as this determines how you can use the parameter estimates! It is often easiest to use
param=glm
so that the exp(maximum liklihood parameter estimate) = odds ratio. Check the class level information (Design variables) is as you would expect. See below for more detail on other parameterization methods.
Modelling using Proc Genmod
Proc Genmod is a procedure that allows the fitting of Generalized Linear Models. By using the options dist=bin
and link=logit,
it fits a logistic regression as shown below. For more information see the SAS help here.
Always check that the Class Level Information matches what you expect (SAS puts the reference class level last). Also check that you are modelling the correct ‘event’ and that the algorithm has converged.
Below we are fitting trt01pn and sex as categorical variables, age, ph_ecog2 and meal_caln as continuous variables.
You can use exponential of the maximum likelihood parameter estimate and the exponential of the Wald 95% Confidence Limits to obtain the odds ratios and 95% CIs. Proc Genmod uses GLM parameterization.
Modelling using Proc Logistic
The same model above can also be modelled using Proc Logistic. You no longer have to specify the distribution and link function, however you do need to add an option / param=glm
on the class row. Different parameterizations are discussed later in the context of forming treatment contrast statements.
For now, all you need to know is that using /param=glm
ensures exp(estimates)=odds ratio. You will also note that in the class level information
, SAS now tells you the design variables
. This will also be important later when we learn more about parameterization.
Proc Logistic is often preferred above Proc Genmod as it outputs the Odds Ratios and 95% CIs for you, without you having to back transform them using exponential of the MLEs yourself.
NOTE: that the 95% confidence limits are being calculated using the Wald method. This assumes symmetric intervals around the maximum likelihood estimate using a normal distribution assumption (MLE +/-1.96* SE). Alternative confidence interval estimation methods exist such as the profile likelihood method but SAS does not calculate these.
=lung;
proc logistic datatrt01pn (ref="1") sex (ref="1") /param=glm;
class wt_gain(event="1") = trt01pn age sex ph_ecog2 meal_caln;
model
run;
Response Profile
Ordered value wt_gain Total Frequency-----------------------------------------------------------------------------
1 0 122
2 1 48
-----------------------------------------------------------------------------
=1
Probability modeled is wt_gain
: 58 observations were deleted due to missing values fro the repsonse or
Note
explanatory variables.
Class Level Information
Class Levels Design Variables-----------------------------------------------------------------------------
2 1 0
trt01pn 1 0 1
2 1 0
sex 1 0 1
-----------------------------------------------------------------------------
criterion (GCONV=1E-8) satisfied.
Convergence
Analysis of Maximum Likelihood Estimates
>ChiSq
Parameter DF Estimate Standard Wald Pr-Square
Error Chi-----------------------------------------------------------------------------
1 -2.6415 1.5140 3.0440 0.0810
Intercept 2 1 0.3888 0.3782 1.0569 0.3039
trt01pn 1 0 0.0000 . . .
trt01pn 1 0.0123 0.0212 0.3356 0.5624
age 2 1 0.8321 0.3744 4.9400 0.0262
sex 1 0 0.0000 . . .
sex 1 -0.3764 0.2638 2.0349 0.1537
ph_ecog 1 -0.000850 0.000449 3.5895 0.0581
meal_cal ----------------------------------------------------------------------------
Odds Ratio Estimates
95% Wald Confidence Limits
Effect Point Estimate -----------------------------------------------------------------------------
2 vs 1 1.475 0.703 3.095
trt01pn 1.012 0.971 1.055
age 2 vs 1 2.298 1.103 4.787
sex 0.686 0.409 1.151
ph_ecog 1.001 1.000 1.002
meal_cal ----------------------------------------------------------------------------
Model Comparison
To compare two logistic models, the -2 * Log Likelihood from each model can be compared against a \(\chi^2\)-distribution with degrees of freedom calculated using the difference in the two models’ parameters.
1: model wt_gain(event="1") = trt01pn age sex ph_ecog2 meal_caln;
Model
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates --------------------------------------------------------
204.355 202.460
AIC 207.491 221.274
SC -2 Log L 202.355 190.460
--------------------------------------------------------
2: model wt_gain(event="1") = trt01pn sex ph_ecog2 meal_caln;
Model
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates --------------------------------------------------------
204.355 200.798
AIC 207.491 216.477
SC -2 Log L 202.355 190.798
--------------------------------------------------------
190.460 - 190.798 = -0.338 which using a \(\chi^2\)-distribution corresponds to p=0.5606
SAS also allows us to fit forward or backwards stepwise selection. Below we specify to stop when we have 4 variables left in the model. This is not commonly done in practice but is included to highlight the difference in using a selection procedure compared to doing the difference betweeen the -2 * log likelihood models using a \(\chi^2\)-distribution.
=lung;
proc logistic datatrt01pn (ref="1") sex (ref="1") /param=glm;
class wt_gain(event="1") = trt01pn age sex ph_ecog2 meal_caln/
model =backward stop=4;
selection
run;
1: Effect age is removed
Step
Summary of Backward Elimination-Square Pr>ChiSq
Step Effect Removed DF Number In Wald Chi-----------------------------------------------------------------------------
1 Age 1 4 0.3356 0.5624
-----------------------------------------------------------------------------
NOTE: the chi-square test summary of backward elimination, p=0.5624 is slightly different to using the -2 * log likelihood models using a \(\chi^2\)-distribution p=0.5606.
This is because the backward elimination process in SAS uses the residual sums of squares and the F statistic. Starting with the full model, it removes the parameter with the least significant F statistic until all effects in the model have F statistics significant as a certain level. The F statistic is calculated as:
\[F=\frac{(RSS_{p-k}-RSS_p)/k}{RSS_p /(n-p-k)}\] where RSS = Residual sums of squares, n=number of observations in the analysis, p=number of parameters in fuller model (exc. intercept), k=number of degrees of freedom associated with the effect you are dropping, \[RSS_p\] =RSS for the fuller model, \[RSS_{p-k}\] = RSS for the reduced model.
Parameterization of model effects (categorical covariates) in SAS
The most common problem when fitting logistic regression in SAS, is getting SAS to model the binary variable (events) and any categorical covariates correctly. Using proc genmod
(using dist=bin and link=logit options), there is no issue as SAS defaults to using GLM parameterization. However using proc logistic
there are three ways to parameterize categorical variables, and the default is /PARAM=EFFECT
which can cause confusion when interpreting your model.
To demonstrate, we will now model a categorical variable called Dose, which has 3 treatment levels (1=10mg Active, 2=20 mg Active, 3=Placebo). The reference is now dose=3. You must pay close attention to the table of Class level information in order to understand how SAS is modelling your data.
CLASS X Y Z /PARAM=Effect
This is the SAS default such that if you do not specify the /param
option, SAS defaults to using this method.
With the EFFECT option, dose_id has 3 levels, and so needs 2 design variables (β1 and β2). Sex has 2 levels so uses just 1 design variable (β1). For dose_id, the reference level (Placebo) is given values of “-1” for both of the β1 and β2 parameters. General model: Y= α + β1x1 + β2x1 {+β3x3} etc, representing each parameter in the model.
=lung;
proc logistic datadose_id (ref="3") sex (ref="1") /param=effect;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model
run;
Class Level Information
Class Value Design Variables--------------------------------------------------------
dose_β1 dose_β21 1 0
dose_id 2 0 1
3 -1 -1
sex_β3 1 -1
SEX 2 1
--------------------------------------------------------
treatment (ignoring the other covariates), the Class Level Information can be translated into the table below, which is then used to form contrast statements.
If we want to estimate the effect of +/-) are from the design variables above.
α is the intercept, and β1 and β2 (including the sign
Dose_id Effect ------------------------------
10mg Active Y = α + β1
20mg Active Y = α + β2
= α - β1 - β2
Placebo Y ------------------------------
10mg Active vs Placebo, we would do the following:
To compare + β1 ) - (α - β1 - β2)
(α = 2 β1 + β2
= 2 β1 + 1 β2
:
This equates to a contrast statement as follows=lung;
proc logistic datadose_id (ref="3") sex (ref="1") /param=effect;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model "10mg vs. Placebo" dose_id 2 1 / e;
CONTRAST
run;
of (10mg Active and 20mg Active) vs Placebo, we would do the following:
To compare the average + β1 + α + β2 ) /2 ) - (α - β1 - β2)
((α = α + 0.5 β1 + 0.5 β2 - α + β1 + β2
= 1.5 β1 + 1.5 β2
:
This equates to a contrast statement as follows=lung;
proc logistic datadose_id (ref="3") sex (ref="1") /param=effect;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model "Active (10mg + 20mg) vs. Placebo" dose_id 1.5 1.5 / e;
CONTRAST
run;
/param=effect, since its easy to end up with the wrong contrasts.
As you can see, these contrasts are not very intuitive and hence it is not reccomended to use the default SAS option of
Contract Test Results-Square Pr>ChiSq
Contrast DF Wald Chi--------------------------------------------------------
Active (10mg +20mg) vs Placebo 1 1.1610 0.2813
CLASS X Y Z /PARAM=glm
Now let’s look at the param=glm
option. GLM parameterization has a design variable for each level of a parameter. Hence for dose with 3 levels, we have 3 design variables (β1, β2 and β3).
=lung;
proc logistic datadose_id (ref="3") sex (ref="1") /param=glm;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model
run;
Class Level Information
Class Value Design Variables--------------------------------------------------------
dose_β1 dose_β2 dose_β31 1 0 0
dose_id 2 0 1 0
3 0 0 1
sex_β3 sex_β4 1 1 0
SEX 2 0 1
--------------------------------------------------------
treatment (ignoring the other covariates), the Class Level Information can be translated into the table below, which is then used to form contrast statements.
If we want to estimate the effect of +/-) are from the design variables above.
α is the intercept, and β1 and β2 and β3 (including the sign
Dose_id Effect ------------------------------
10mg Active Y = α + β1
20mg Active Y = α + β2
= α - β3
Placebo Y --------------------------------------------------------
10mg Active vs Placebo, we would do the following:
To compare + β1 ) - (α - β3)
(α = β1 - β3
= 1 β1 - 1 β3
:
This equates to a contrast statement as follows=lung;
proc logistic datadose_id (ref="3") sex (ref="1") /param=glm;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model "10mg vs. Placebo" trt 1 -1 / e;
CONTRAST
run;
Active (10mg) compared to placebo, you take the effect of 10mg and subtract the effect of placebo !
As you can see, this contrast is much more intuitive. If you want to compare the effect of
of (10mg Active and 20mg Active) vs Placebo, we would do the following:
To compare the average + β1 + α + β2)/2 - (α + β3)
((α = α + 0.5 β1 + 0.5 β2 - α - β3
= 0.5 β1 + 0.5 β2 - β3
:
This equates to a contrast statement as followsdata=lung;
proc logistic dose_id (ref="3") sex (ref="1") /param=glm;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model "Active (10mg + 20mg) vs. Placebo (1)" trt 0.5 0.5 -1 / e;
CONTRAST
run;
Active (10mg + 20mg) compared to placebo, you take half the effect of 10mg plus half the effect of 20mg and substract the effect of placebo !
As you can see, this contrast is much more intuitive. If you want to compare the average of
Contract Test Results-Square Pr>ChiSq
Contrast DF Wald Chi--------------------------------------------------------
Active (10mg +20mg) vs Placebo 1 1.1610 0.2813
CLASS X Y Z /PARAM=Ref
Now let’s look at the param=ref
option. Similar to param=effect, ref parameterization uses 1 less design variable compared to the number of levels each parameter has, but the parameterization is different. For dose with 3 levels, we have 2 design variables (β1 and β2).
=lung;
proc logistic datadose_id (ref="3") sex (ref="1") /param=ref;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model
run;
Class Level Information
Class Value Design Variables--------------------------------------------------------
dose_β1 dose_β2 1 1 0
dose_id 2 0 1
3 0 0
sex_β3 1 0
SEX 2 1
--------------------------------------------------------
treatment (ignoring the other covariates), the Class Level Information can be translated into the table below, which is then used to form contrast statements.
If we want to estimate the effect of +/-) are from the design variables above.
α is the intercept, and β1 and β2 and β3 (including the sign
Dose_id Effect ------------------------------
10mg Active Y = α + β1
20mg Active Y = α + β2
= α
Placebo Y --------------------------------------------------------
10mg Active vs Placebo, we would do the following:
To compare + β1 ) - (α )
(α = β1
:
This equates to a contrast statement as follows=lung;
proc logistic datadose_id (ref="3") sex (ref="1") /param=glm;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model "10mg vs. Placebo" trt 1 / e;
CONTRAST
run;
of (10mg Active and 20mg Active) vs Placebo, we would do the following:
To compare the average + β1 + α + β2)/2 - α
((α = α + 0.5 β1 + 0.5 β2 - α
= 0.5 β1 + 0.5 β2
:
This equates to a contrast statement as followsdata=lung;
proc logistic dose_id (ref="3") sex (ref="1") /param=glm;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model "Active (10mg + 20mg) vs. Placebo (1)" trt 0.5 0.5 / e;
CONTRAST
run;
param=glm parameterization, but the same results are obtained.
Again this is less intuitive than the
Contract Test Results-Square Pr>ChiSq
Contrast DF Wald Chi--------------------------------------------------------
Active (10mg +20mg) vs Placebo 1 1.1610 0.2813
Contrast statements for 2 or more treatments
The Contrast statement, only outputs the p-value for the contrast, but it is common to also require an estimate of the difference between the treatments, with associated 95% CI. You can do this by changing contrast
to an estimate
statement. Note that the parameterization of the contrast remains the same as when using a contrast statement as shown below. These estimates and 95% CI’s can be back transformed to give you the Odds ratio of the contrast and associated 95% CI. The estimate coefficients table should be checked for accuracy versus the contrast you are trying to do.
=lung;
proc logistic datadose_id (ref="3") sex (ref="1") /param=glm;
class wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
model "Active (10mg + 20mg) vs. Placebo (1)" dose_id 0.5 0.5 -1 / e cl;
Estimate
run;
Estimate Coefficients
Parameter dose_id sex Row1--------------------------------------------------------
1 1 0.5
dose_id 2 2 0.5
dose_id 3 3 -1
dose_id --------------------------------------------------------
Estimate>|z| Alpha Lower Upper
Label estimate Std Err Z value Pr--------------------------------------------------------------------------
Active (10mg +20mg) vs. Placebo (1)
-0.4096 0.3802 -1.08 0.2813 0.05 -1.1547 0.3355
--------------------------------------------------------------------------
for the comparison of Active (10mg +20mg) vs. Placebo is
The odds ratio exp(-0.4096) (95% CI: exp(-1.1547) to exp(0.3355))
= OR: 0.664 (95%CI for OR: 0.315 to 1.399)
Ensuring you are modelling the correct Binary event in SAS
With logistic regression, we often want to model the number of “Successes”. However, by default, SAS sorts alphabetically/numerically and selects the first occurring EVENT alphabetically as the one it’s going to model.
It’s a common mistake, and we find SAS modelling the number of failures instead of successes. Very common when your response is: ‘Responder’ vs ‘Non-responder’, SAS will model the Non-responders as ‘N’ is alphabetically first before ‘R’!
For this reason, It is recommended to always use the event=“Y” option.
Options such as ORDER=DATA|FORMATTED|FREQ|INTERNAL
as well as descending can be used to ensure the correct levels of classification variables are being modelled. More detail here
References
Created using : SAS release:9.04.01M7P08062020”