Logistic Regression in SAS

Published

27 March, 2025

For a brief description of what is logistic regression see here.

Modelling the lung cancer data

In SAS, we can use proc logistic or proc genmod to perform a logistic regression.

To demonstrate the use of logistic regression we examine the same lung dataset as used in the R example here.

Summary of Common Mistakes in SAS

  1. Handling of missing data. Check SAS output that the number of missing values is as you expect. Make sure you have changed any NA results in the raw data to be missing, since SAS would consider NA as a valid category (a non-missing character result).

  2. Make sure you consider continuous or categorical variables as you intended. Just because a variable is character or numeric in the dataset, doesn’t mean SAS will treat it that way in the model. You have to use Class row to tell SAS which variables should be treated as character factors. You also have to use ref=' ' to tell SAS which is the reference category, otherwise SAS by default which use the last value of the variable alphabetically (e..g a categorical variable with 1, 2, 3 would default to 3 as the reference).

  3. Be careful you are modelling the correct event (response vs non-response, or weight_gain vs weight_loss for example)

  4. Be careful when interpreting any odds ratios that you have the factor of interest the correct way around (0 vs 1, or 1 vs 0)

  5. If using proc logistic, be careful of how SAS creates its parameters used in the model as this determines how you can use the parameter estimates! It is often easiest to use param=glm so that the exp(maximum liklihood parameter estimate) = odds ratio. Check the class level information (Design variables) is as you would expect. See below for more detail on other parameterization methods.

Modelling using Proc Genmod

Proc Genmod is a procedure that allows the fitting of Generalized Linear Models. By using the options dist=bin and link=logit, it fits a logistic regression as shown below. For more information see the SAS help here.

Always check that the Class Level Information matches what you expect (SAS puts the reference class level last). Also check that you are modelling the correct ‘event’ and that the algorithm has converged.

Below we are fitting trt01pn and sex as categorical variables, age, ph_ecog2 and meal_caln as continuous variables.

You can use exponential of the maximum likelihood parameter estimate and the exponential of the Wald 95% Confidence Limits to obtain the odds ratios and 95% CIs. Proc Genmod uses GLM parameterization.

Example data: . = missing, trt01pn (1 or 2), sex (1 or 2), ph_ecog2 (0,1,2,3)
wt_gain (1=gain, 0=no gain)

wt_gain   trt01pn  age   sex    ph_ecog2   meal_caln
-----------------------------------------------------------------------------
.         1        74    1       1          1175
0         1        68    2       0          1225
1         2        60    1       2           .


proc genmod data=lung;       
  class trt01pn (ref="1") sex (ref="1")  ;    
  model wt_gain (event="1") = trt01pn age sex ph_ecog2 meal_caln / 
        dist=bin link=logit;
run; 

Class Level Information
Class     Levels      Values
-----------------------------------------------------------------------------
trt01pn   2           2 1 
sex       2           2 1
-----------------------------------------------------------------------------

Response Profile
Ordered value     wt_gain         Total Frequency
-----------------------------------------------------------------------------
1                 1                48 
2                 0                122
-----------------------------------------------------------------------------

PROC GENMOD is modeling the probability that wt_gain='1'. 

Algorithm Converged.

                    Analysis of Maximum Likelihood Estimates                 

Parameter   DF  Estimate  Standard   Wald 95%        Wald         Pr>ChiSq
                          Error      CIs             Chi-Square
-----------------------------------------------------------------------------
Intercept   1   -2.6415   1.5140   -5.6090 0.3259    3.04         0.0810
trt01pn 2   1    0.3888   0.3782   -0.3524 1.1299    1.03         0.3039
trt01pn 1   0    0.0000   0.0000    0.0000 0.0000    .            .
age         1    0.0123   0.0212   -0.0292 0.0537    0.34         0.5624    
sex 2       1    0.8321   0.3744    0.0983 1.5659    4.97         0.0262  
sex 1       0    0.0000   0.0000    0.0000 0.0000    .            .
ph_ecog     1   -0.3764   0.2638   -0.8935 0.1407    2.03         0.1537   
meal_cal    1   -0.0008   0.0004   -0.0000 0.0017    3.59         0.0581  
scale       0    1.0000   0.0000    1.0000 1.0000
----------------------------------------------------------------------------
Note: The scale parameter was held fixed

Modelling using Proc Logistic

The same model above can also be modelled using Proc Logistic. You no longer have to specify the distribution and link function, however you do need to add an option / param=glm on the class row. Different parameterizations are discussed later in the context of forming treatment contrast statements.

For now, all you need to know is that using /param=glm ensures exp(estimates)=odds ratio. You will also note that in the class level information, SAS now tells you the design variables. This will also be important later when we learn more about parameterization.

Proc Logistic is often preferred above Proc Genmod as it outputs the Odds Ratios and 95% CIs for you, without you having to back transform them using exponential of the MLEs yourself.

NOTE: that the 95% confidence limits are being calculated using the Wald method. This assumes symmetric intervals around the maximum likelihood estimate using a normal distribution assumption (MLE +/-1.96* SE). Alternative confidence interval estimation methods exist such as the profile likelihood method but SAS does not calculate these.

proc logistic data=lung;  
    class trt01pn (ref="1") sex (ref="1") /param=glm;
    model  wt_gain(event="1") = trt01pn age sex ph_ecog2 meal_caln;
run;

Response Profile
Ordered value     wt_gain         Total Frequency
-----------------------------------------------------------------------------
1                 0                122 
2                 1                48
-----------------------------------------------------------------------------
Probability modeled is wt_gain=1

Note: 58 observations were deleted due to missing values fro the repsonse or
explanatory variables.

Class Level Information
Class     Levels      Design Variables
-----------------------------------------------------------------------------
trt01pn   2           1  0 
          1           0  1
sex       2           1  0
          1           0  1
-----------------------------------------------------------------------------
Convergence criterion (GCONV=1E-8) satisfied.

                    Analysis of Maximum Likelihood Estimates                 

Parameter   DF  Estimate  Standard     Wald         Pr>ChiSq
                          Error        Chi-Square
-----------------------------------------------------------------------------
Intercept   1   -2.6415   1.5140       3.0440       0.0810
trt01pn 2   1    0.3888   0.3782       1.0569       0.3039
trt01pn 1   0    0.0000   .            .            .
age         1    0.0123   0.0212       0.3356       0.5624    
sex 2       1    0.8321   0.3744       4.9400       0.0262  
sex 1       0    0.0000   .            .            .
ph_ecog     1   -0.3764   0.2638       2.0349       0.1537   
meal_cal    1   -0.000850 0.000449     3.5895       0.0581  
----------------------------------------------------------------------------
  
                   Odds Ratio Estimates                

Effect           Point Estimate    95% Wald Confidence Limits
-----------------------------------------------------------------------------
trt01pn 2 vs 1   1.475             0.703   3.095
age              1.012             0.971   1.055
sex     2 vs 1   2.298             1.103   4.787    
ph_ecog          0.686             0.409   1.151  
meal_cal         1.001             1.000   1.002 
----------------------------------------------------------------------------

Model Comparison

To compare two logistic models, the -2 * Log Likelihood from each model can be compared against a \(\chi^2\)-distribution with degrees of freedom calculated using the difference in the two models’ parameters.

Model 1:  model  wt_gain(event="1") = trt01pn age sex ph_ecog2 meal_caln;

Model Fit Statistics
Criterion   Intercept Only     Intercept and Covariates       
--------------------------------------------------------
AIC         204.355            202.460
SC          207.491            221.274
-2 Log L    202.355            190.460   
--------------------------------------------------------
  
Model 2:  model  wt_gain(event="1") = trt01pn sex ph_ecog2 meal_caln;

Model Fit Statistics
Criterion   Intercept Only     Intercept and Covariates       
--------------------------------------------------------
AIC         204.355            200.798
SC          207.491            216.477
-2 Log L    202.355            190.798   
--------------------------------------------------------

190.460 - 190.798 = -0.338 which using a \(\chi^2\)-distribution corresponds to p=0.5606

SAS also allows us to fit forward or backwards stepwise selection. Below we specify to stop when we have 4 variables left in the model. This is not commonly done in practice but is included to highlight the difference in using a selection procedure compared to doing the difference betweeen the -2 * log likelihood models using a \(\chi^2\)-distribution.

proc logistic data=lung;  
    class trt01pn (ref="1") sex (ref="1") /param=glm;
    model  wt_gain(event="1") = trt01pn age sex ph_ecog2 meal_caln/ 
           selection=backward stop=4;
run;  

Step 1: Effect age is removed
Summary of Backward Elimination
Step   Effect Removed   DF     Number In     Wald Chi-Square   Pr>ChiSq
-----------------------------------------------------------------------------
1      Age              1      4             0.3356            0.5624   
-----------------------------------------------------------------------------

NOTE: the chi-square test summary of backward elimination, p=0.5624 is slightly different to using the -2 * log likelihood models using a \(\chi^2\)-distribution p=0.5606.

This is because the backward elimination process in SAS uses the residual sums of squares and the F statistic. Starting with the full model, it removes the parameter with the least significant F statistic until all effects in the model have F statistics significant as a certain level. The F statistic is calculated as:

\[F=\frac{(RSS_{p-k}-RSS_p)/k}{RSS_p /(n-p-k)}\] where RSS = Residual sums of squares, n=number of observations in the analysis, p=number of parameters in fuller model (exc. intercept), k=number of degrees of freedom associated with the effect you are dropping, \[RSS_p\] =RSS for the fuller model, \[RSS_{p-k}\] = RSS for the reduced model.

Parameterization of model effects (categorical covariates) in SAS

The most common problem when fitting logistic regression in SAS, is getting SAS to model the binary variable (events) and any categorical covariates correctly. Using proc genmod (using dist=bin and link=logit options), there is no issue as SAS defaults to using GLM parameterization. However using proc logistic there are three ways to parameterize categorical variables, and the default is /PARAM=EFFECT which can cause confusion when interpreting your model.

To demonstrate, we will now model a categorical variable called Dose, which has 3 treatment levels (1=10mg Active, 2=20 mg Active, 3=Placebo). The reference is now dose=3. You must pay close attention to the table of Class level information in order to understand how SAS is modelling your data.

CLASS X Y Z /PARAM=Effect

This is the SAS default such that if you do not specify the /param option, SAS defaults to using this method.

With the EFFECT option, dose_id has 3 levels, and so needs 2 design variables (β1 and β2). Sex has 2 levels so uses just 1 design variable (β1). For dose_id, the reference level (Placebo) is given values of “-1” for both of the β1 and β2 parameters. General model: Y= α + β1x1 + β2x1 {+β3x3} etc, representing each parameter in the model.

proc logistic data=lung;  
    class dose_id (ref="3") sex (ref="1") /param=effect;
    model  wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
run;

Class Level Information
Class       Value            Design Variables
--------------------------------------------------------
                             dose_β1  dose_β2
dose_id      1                1           0
             2                0           1
             3               -1          -1
             
                             sex_β3      
SEX          1               -1
             2                1
--------------------------------------------------------

If we want to estimate the effect of treatment (ignoring the other covariates), the Class Level Information can be translated into the table below, which is then used to form contrast statements.
α is the intercept, and β1 and β2 (including the sign +/-) are from the design variables above.
                              
Dose_id       Effect          
------------------------------
10mg Active   Y = α + β1          
20mg Active   Y = α + β2          
Placebo       Y = α - β1 - β2    
------------------------------
  
To compare 10mg Active vs Placebo, we would do the following:
+ β1 ) -- β1 - β2)
= 2 β1 + β2  
= 2 β1 + 1 β2

This equates to a contrast statement as follows:
proc logistic data=lung;  
    class dose_id (ref="3") sex (ref="1") /param=effect;
    model  wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
    CONTRAST "10mg vs. Placebo" dose_id 2 1 / e; 
run;
  
To compare the average of (10mg Active and 20mg Active) vs Placebo, we would do the following:
((α + β1 + α + β2 ) /2 ) -- β1 - β2)
= α + 0.5 β1 + 0.5 β2 - α + β1 + β2  
= 1.5 β1 + 1.5 β2

This equates to a contrast statement as follows:
proc logistic data=lung;  
    class dose_id (ref="3") sex (ref="1") /param=effect;
    model  wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
    CONTRAST "Active (10mg + 20mg) vs. Placebo" dose_id 1.5 1.5 / e; 
run;

As you can see, these contrasts are not very intuitive and hence it is not reccomended to use the default SAS option of /param=effect, since its easy to end up with the wrong contrasts.

Contract Test Results
Contrast                           DF        Wald Chi-Square      Pr>ChiSq
--------------------------------------------------------
Active (10mg +20mg) vs Placebo     1          1.1610               0.2813

CLASS X Y Z /PARAM=glm

Now let’s look at the param=glm option. GLM parameterization has a design variable for each level of a parameter. Hence for dose with 3 levels, we have 3 design variables (β1, β2 and β3).

proc logistic data=lung;  
    class dose_id (ref="3") sex (ref="1") /param=glm;
    model  wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
run;

Class Level Information
Class       Value            Design Variables
--------------------------------------------------------
                             dose_β1  dose_β2 dose_β3
dose_id      1                1           0       0
             2                0           1       0  
             3                0           0       1
             
                             sex_β3   sex_β4   
SEX          1                1           0
             2                0           1
--------------------------------------------------------

If we want to estimate the effect of treatment (ignoring the other covariates), the Class Level Information can be translated into the table below, which is then used to form contrast statements.
α is the intercept, and β1 and β2 and β3 (including the sign +/-) are from the design variables above.
                              
Dose_id       Effect          
------------------------------
10mg Active   Y = α + β1          
20mg Active   Y = α + β2          
Placebo       Y = α - β3    
--------------------------------------------------------
  
To compare 10mg Active vs Placebo, we would do the following:
+ β1 ) -- β3)
=  β1 - β3  
= 1 β1 - 1 β3

This equates to a contrast statement as follows:
proc logistic data=lung;  
    class dose_id (ref="3") sex (ref="1") /param=glm;
    model  wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
    CONTRAST "10mg vs. Placebo" trt 1 -1 / e; 
run;
  
As you can see, this contrast is much more intuitive. If you want to compare the effect of Active (10mg) compared to placebo,  you take the effect of 10mg and subtract the effect of placebo !

To compare the average of (10mg Active and 20mg Active) vs Placebo, we would do the following:
((α + β1 + α + β2)/2 -+ β3) 
= α + 0.5 β1 + 0.5 β2 - α - β3 
= 0.5 β1 + 0.5 β2 - β3

This equates to a contrast statement as follows:
proc logistic data=lung;  
    class dose_id (ref="3") sex (ref="1") /param=glm;
    model  wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
CONTRAST "Active (10mg + 20mg) vs. Placebo (1)" trt 0.5 0.5 -1 / e; 
run;

As you can see, this contrast is much more intuitive. If you want to compare the average of Active (10mg + 20mg) compared to placebo,  you take half the effect of 10mg plus half the effect of 20mg and substract the effect of placebo !
  
Contract Test Results
Contrast                           DF        Wald Chi-Square      Pr>ChiSq
--------------------------------------------------------
Active (10mg +20mg) vs Placebo     1          1.1610               0.2813

CLASS X Y Z /PARAM=Ref

Now let’s look at the param=ref option. Similar to param=effect, ref parameterization uses 1 less design variable compared to the number of levels each parameter has, but the parameterization is different. For dose with 3 levels, we have 2 design variables (β1 and β2).

proc logistic data=lung;   
class dose_id (ref="3") sex (ref="1") /param=ref;
model  wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
run;

Class Level Information
Class       Value            Design Variables
--------------------------------------------------------
                             dose_β1  dose_β2 
dose_id      1                1           0       
             2                0           1         
             3                0           0       
             
                             sex_β3     
SEX          1                0           
             2                1           
--------------------------------------------------------

If we want to estimate the effect of treatment (ignoring the other covariates), the Class Level Information can be translated into the table below, which is then used to form contrast statements.
α is the intercept, and β1 and β2 and β3 (including the sign +/-) are from the design variables above.
                              
Dose_id       Effect          
------------------------------
10mg Active   Y = α + β1          
20mg Active   Y = α + β2          
Placebo       Y = α     
--------------------------------------------------------
  
To compare 10mg Active vs Placebo, we would do the following:
+ β1 ) - (α )
=  β1  

This equates to a contrast statement as follows:
proc logistic data=lung;  
    class dose_id (ref="3") sex (ref="1") /param=glm;
    model  wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
    CONTRAST "10mg vs. Placebo" trt 1  / e; 
run;
  
To compare the average of (10mg Active and 20mg Active) vs Placebo, we would do the following:
((α + β1 + α + β2)/2 -  α  
= α + 0.5 β1 + 0.5 β2 - α  
= 0.5 β1 + 0.5 β2 

This equates to a contrast statement as follows:
proc logistic data=lung;  
    class dose_id (ref="3") sex (ref="1") /param=glm;
    model  wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln;
CONTRAST "Active (10mg + 20mg) vs. Placebo (1)" trt 0.5 0.5  / e; 
run;

Again this is less intuitive than the param=glm parameterization,  but the same results are obtained.
  
Contract Test Results
Contrast                           DF        Wald Chi-Square      Pr>ChiSq
--------------------------------------------------------
Active (10mg +20mg) vs Placebo     1          1.1610               0.2813

Contrast statements for 2 or more treatments

The Contrast statement, only outputs the p-value for the contrast, but it is common to also require an estimate of the difference between the treatments, with associated 95% CI. You can do this by changing contrast to an estimate statement. Note that the parameterization of the contrast remains the same as when using a contrast statement as shown below. These estimates and 95% CI’s can be back transformed to give you the Odds ratio of the contrast and associated 95% CI. The estimate coefficients table should be checked for accuracy versus the contrast you are trying to do.

proc logistic data=lung;
class dose_id (ref="3") sex (ref="1") /param=glm;
model wt_gain(event="1") = dose_id age sex ph_ecog2 meal_caln; 
Estimate "Active (10mg + 20mg) vs. Placebo (1)" dose_id 0.5 0.5 -1 / e cl; 
run;

Estimate Coefficients
Parameter   dose_id    sex    Row1
--------------------------------------------------------
dose_id 1   1                 0.5
dose_id 2   2                 0.5 
dose_id 3   3                 -1
--------------------------------------------------------
  
Estimate
Label      estimate   Std Err   Z value   Pr>|z| Alpha  Lower  Upper
--------------------------------------------------------------------------
Active (10mg +20mg) vs. Placebo (1)
           -0.4096    0.3802    -1.08     0.2813  0.05  -1.1547  0.3355  
--------------------------------------------------------------------------

The odds ratio for the comparison of Active (10mg +20mg) vs. Placebo is 
  exp(-0.4096) (95% CI: exp(-1.1547) to exp(0.3355))
= OR: 0.664 (95%CI for OR: 0.315 to 1.399)

Ensuring you are modelling the correct Binary event in SAS

With logistic regression, we often want to model the number of “Successes”. However, by default, SAS sorts alphabetically/numerically and selects the first occurring EVENT alphabetically as the one it’s going to model.

It’s a common mistake, and we find SAS modelling the number of failures instead of successes. Very common when your response is: ‘Responder’ vs ‘Non-responder’, SAS will model the Non-responders as ‘N’ is alphabetically first before ‘R’!

For this reason, It is recommended to always use the event=“Y” option.

Options such as ORDER=DATA|FORMATTED|FREQ|INTERNAL as well as descending can be used to ensure the correct levels of classification variables are being modelled. More detail here

References

Created using : SAS release:9.04.01M7P08062020”