Generalized Estimating Equations (GEE) methods in R

INTRODUCTION

Generalized Estimating Equations (GEE) methods extend the Generalized Linear Model (GLM) framework using link functions that relate the predictors to transformed outcome variable. For dichotomous response variables, the link functions is the probit (in case of rare events complementary log-log may be preferable). For outcomes with more than two categories, the cumulative link function is used in case of ordinal variables and generalized logit for nominal variables.

GEE are marginal models and therefore estimate population-averaged effects and within-subject correlation is analysed by specifying a working correlation structure (as in MMRM). Estimators are obtained via quasi-likelihood via iterative solving of estimating equations.

EXAMPLE DATA

A SAS data of clinical trial data comparing two treatments for a respiratory disorder available in “Gee Model for Binary Data” in the SAS/STAT Sample Program Library [1] is used to create these examples.

To uniquely identify subjects, a new variable USUBJID was created by concatenating SITE and ID. Variables TREATMENT, BASELINE, and VISIT were renamed to TRTP, BASE, and AVISITN.

Additionally, two variables were created using randomly generated values to simulate variables with more than two categories. One was an ordinal variable with values 1, 2, and 3; the other was a nominal variable with categories ‘liver’, ‘lung’, and ‘bone’. Data were created in SAS (See SAS section), and imported to R.

#install.packages("readxl")
library(readxl)
resp <- read_excel("../data/resp_gee.xlsx")

Variables OUTCOME, TRTP and AVISITN were converted to factors. Since the modeling functions use the first (alphabetically) level as the reference category, TRTP levels are ordered as ‘P’ (placebo) and ‘A’ (active), to ensure that placebo is used as the reference category in the models:

resp$trtp<-factor(resp$trtp, levels=c('P', 'A'))
resp$avisitn<-factor(resp$avisitn)
resp$outcome<-factor(resp$outcome)
resp$usubjid<-factor(resp$usubjid)
resp$respnom<-factor(resp$respnom)

GEE models are run in the next section, including treatment, visit and visit by treatment interaction as fixed effects.

PACKAGES

BINARY OUTCOME

Binary outcomes can be analyzed with geepack::geelm [2] and and gee::gee [3] functions. Independence correlation matrix is used by default in both functions, and can be modified in the corstr= argument.

Both functions estimate robust “Sandwich” standard errors (SE) by default, and the gee::gee function also returns the naive SE (model-based SE).

#Create numeric outcome (values: 1/0).
resp$outcome_num<-as.numeric(resp$outcome)-1

model <- geepack::geeglm(outcome_num ~ trtp + avisitn +trtp*avisitn,
                id=usubjid,
                data=resp,
                family=binomial(link='logit'),
                corstr="independence")
summary(model)

Call:
geepack::geeglm(formula = outcome_num ~ trtp + avisitn + trtp * 
    avisitn, family = binomial(link = "logit"), data = resp, 
    id = usubjid, corstr = "independence")

 Coefficients:
               Estimate  Std.err  Wald Pr(>|W|)  
(Intercept)    -0.03509  0.26495 0.018   0.8946  
trtpA           0.81280  0.39503 4.234   0.0396 *
avisitn2       -0.42921  0.26357 2.652   0.1034  
avisitn3       -0.14080  0.29834 0.223   0.6370  
avisitn4       -0.21177  0.25344 0.698   0.4034  
trtpA:avisitn2  0.51651  0.41043 1.584   0.2082  
trtpA:avisitn3  0.31861  0.42839 0.553   0.4570  
trtpA:avisitn4 -0.11395  0.39485 0.083   0.7729  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = independence 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)        1 0.05045
Number of clusters:   111  Maximum cluster size: 4 
model<- gee::gee(outcome ~ trtp + avisitn + trtp*avisitn,
            id=usubjid,
            data=resp,
            family=binomial(link='logit'),
            corstr = "independence")
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
   (Intercept)          trtpA       avisitn2       avisitn3       avisitn4 
      -0.03509        0.81280       -0.42921       -0.14080       -0.21177 
trtpA:avisitn2 trtpA:avisitn3 trtpA:avisitn4 
       0.51651        0.31861       -0.11395 
summary(model)

 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     Independent 

Call:
gee::gee(formula = outcome ~ trtp + avisitn + trtp * avisitn, 
    id = usubjid, data = resp, family = binomial(link = "logit"), 
    corstr = "independence")

Summary of Residuals:
    Min      1Q  Median      3Q     Max 
-0.7222 -0.4561  0.2778  0.3889  0.6140 


Coefficients:
               Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept)    -0.03509     0.2674 -0.1312      0.2649  -0.1324
trtpA           0.81280     0.3986  2.0389      0.3950   2.0576
avisitn2       -0.42921     0.3832 -1.1200      0.2636  -1.6285
avisitn3       -0.14080     0.3788 -0.3717      0.2983  -0.4719
avisitn4       -0.21177     0.3795 -0.5580      0.2534  -0.8356
trtpA:avisitn2  0.51651     0.5699  0.9064      0.4104   1.2585
trtpA:avisitn3  0.31861     0.5700  0.5589      0.4284   0.7437
trtpA:avisitn4 -0.11395     0.5575 -0.2044      0.3948  -0.2886

Estimated Scale Parameter:  1.018
Number of Iterations:  1

Working Correlation
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1

PROBABILITIES AND ODDS RATIO (OR)

The estimated probabilities of event and OR can be obtained using the emmeans function from a geepack::geeglm object. The code below computes probabilities and OR, along with p-values.

model <- geepack::geeglm(outcome_num ~ trtp + avisitn +trtp*avisitn,
                id=usubjid,
                data=resp,
                family=binomial(link='logit'),
                corstr="independence")


#Get predicted probabilities for each treatment.
#Get predicted probabilities for each treatment, using the covariance matrix computed by the model
prob <- emmeans::emmeans(model, ~ trtp*avisitn, data=resp, vcov.method=vcov(model), type='response')
prob
 trtp avisitn  prob     SE  df lower.CL upper.CL
 P    1       0.491 0.0662 436    0.364    0.619
 A    1       0.685 0.0632 436    0.550    0.795
 P    2       0.386 0.0645 436    0.269    0.518
 A    2       0.704 0.0621 436    0.569    0.810
 P    3       0.456 0.0660 436    0.332    0.586
 A    3       0.722 0.0609 436    0.589    0.825
 P    4       0.439 0.0657 436    0.316    0.569
 A    4       0.611 0.0663 436    0.476    0.731

Covariance estimate used: user-supplied 
Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 
#Get differences between treatments by visit (option "revpairwise" is used to compare A vs P) 
diffs_mean<-emmeans::emmeans(model, ~ trtp*avisitn, data=resp, vcov.method=vcov(model))
diffs <- emmeans::contrast(diffs_mean,"revpairwise" , simple="trtp")
diffs <- as.data.frame(diffs)

#Calculate CI (alpha=0.05)
alpha<-0.05
z_crit <- qnorm(1 - alpha / 2)
diffs$low<-diffs$estimate - (z_crit*diffs$SE)
diffs$upp<-diffs$estimate + (z_crit*diffs$SE)

#Get OR applying exponential transformation;
or<-exp(diffs$estimate)
or_low<-exp(diffs$low)
or_upp<-exp(diffs$up)

#Get two-sided p-value
z <- diffs$estimate/diffs$SE
pvalue <- 2 * (1 - pnorm(z))

#Create a dataset with all the results
OR<-as.data.frame(cbind(diffs$avisitn, or,or_low, or_upp, z, round(pvalue, digits=4)))
colnames(OR)<-c('avisit', 'OR', 'lower.CL', 'upper.CL', 'Z', 'p-value')
OR
  avisit    OR lower.CL upper.CL     Z p-value
1      1 2.254   1.0393    4.889 2.058  0.0396
2      2 3.778   1.7132    8.333 3.294  0.0010
3      3 3.100   1.4050    6.840 2.802  0.0051
4      4 2.011   0.9435    4.288 1.809  0.0704

OUTCOME WITH MORE THAN TWO CATEGORIES

The functions used for binary outcomes above do not support outcomes with more than two categories, as these functions relay on the family function, which does not include the multinomial option. Nevertheless, the multgee package [4] provides two functions for estimating GEE models when the outcome has more than two categories:

The correlation matrix by default is “Exchangeable”, and it can be modified using the LORstr= argument in both functions.

Ordinal variable

model <- multgee::ordLORgee(formula = respord ~ trtp + avisitn  + trtp*avisitn,
                   data = resp,
                   id = usubjid,
                   repeated = avisitn,
                   LORstr="time.exch",
                   )

summary(model)
GEE FOR ORDINAL MULTINOMIAL RESPONSES 
version 1.6.0 modified 2017-07-10 

Link : Cumulative logit 

Local Odds Ratios:
Structure:         time.exch
Model:             3way
Homogenous scores: TRUE

call:
multgee::ordLORgee(formula = respord ~ trtp + avisitn + trtp * 
    avisitn, data = resp, id = usubjid, repeated = avisitn, LORstr = "time.exch")

Summary of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.4704 -0.3542 -0.2854  0.0003  0.6335  0.7279 

Number of Iterations: 3 

Coefficients:
               Estimate  san.se san.z Pr(>|san.z|)  
beta10          -0.5472  0.2417 -2.26        0.024 *
beta20           0.6304  0.2551  2.47        0.013 *
trtpA            0.4284  0.3432  1.25        0.212  
avisitn2        -0.0024  0.3668 -0.01        0.995  
avisitn3         0.1156  0.3228  0.36        0.720  
avisitn4        -0.1470  0.3323 -0.44        0.658  
trtpA:avisitn2  -0.2717  0.5094 -0.53        0.594  
trtpA:avisitn3  -0.5974  0.4619 -1.29        0.196  
trtpA:avisitn4  -0.3269  0.4488 -0.73        0.466  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Local Odds Ratios Estimates:
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
[1,] 0.000 0.000 1.000 0.999 1.000 0.999 1.000 0.999
[2,] 0.000 0.000 0.999 1.281 0.999 1.281 0.999 1.281
[3,] 1.000 0.999 0.000 0.000 1.000 0.999 1.000 0.999
[4,] 0.999 1.281 0.000 0.000 0.999 1.281 0.999 1.281
[5,] 1.000 0.999 1.000 0.999 0.000 0.000 1.000 0.999
[6,] 0.999 1.281 0.999 1.281 0.000 0.000 0.999 1.281
[7,] 1.000 0.999 1.000 0.999 1.000 0.999 0.000 0.000
[8,] 0.999 1.281 0.999 1.281 0.999 1.281 0.000 0.000

p-value of Null model: 0.69 

Nominal variable

model <- multgee::nomLORgee(formula = respnom ~ trtp + avisitn  + trtp*avisitn,
                   data = resp,
                   id = usubjid,
                   repeated = avisitn,
                   LORstr = "time.exch")
                   
        
summary(model)
GEE FOR NOMINAL MULTINOMIAL RESPONSES 
version 1.6.0 modified 2017-07-10 

Link : Baseline Category Logit 

Local Odds Ratios:
Structure:         time.exch
Model:             3way
Homogenous scores: TRUE

call:
multgee::nomLORgee(formula = respnom ~ trtp + avisitn + trtp * 
    avisitn, data = resp, id = usubjid, repeated = avisitn, LORstr = "time.exch")

Summary of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -0.444  -0.368  -0.333   0.000   0.611   0.778 

Number of Iterations: 1 

Coefficients:
                 Estimate  san.se san.z Pr(>|san.z|)
beta10             0.2231  0.3354  0.67         0.51
trtpA:1           -0.6286  0.5014 -1.25         0.21
avisitn2:1         0.1823  0.4310  0.42         0.67
avisitn3:1         0.0132  0.4786  0.03         0.98
avisitn4:1        -0.0690  0.4814 -0.14         0.89
trtpA:avisitn2:1   0.1625  0.6342  0.26         0.80
trtpA:avisitn3:1   0.9518  0.6979  1.36         0.17
trtpA:avisitn4:1   0.6463  0.7006  0.92         0.36
beta20             0.2719  0.3318  0.82         0.41
trtpA:2            0.0158  0.4553  0.03         0.97
avisitn2:2         0.1800  0.4254  0.42         0.67
avisitn3:2         0.1555  0.4660  0.33         0.74
avisitn4:2        -0.2719  0.5079 -0.54         0.59
trtpA:avisitn2:2  -0.2564  0.6025 -0.43         0.67
trtpA:avisitn3:2   0.1164  0.6469  0.18         0.86
trtpA:avisitn4:2   0.1561  0.6399  0.24         0.81

Local Odds Ratios Estimates:
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
[1,] 0.000 0.000 1.279 1.001 1.279 1.001 1.279 1.001
[2,] 0.000 0.000 1.001 1.000 1.001 1.000 1.001 1.000
[3,] 1.279 1.001 0.000 0.000 1.279 1.001 1.279 1.001
[4,] 1.001 1.000 0.000 0.000 1.001 1.000 1.001 1.000
[5,] 1.279 1.001 1.279 1.001 0.000 0.000 1.279 1.001
[6,] 1.001 1.000 1.001 1.000 0.000 0.000 1.001 1.000
[7,] 1.279 1.001 1.279 1.001 1.279 1.001 0.000 0.000
[8,] 1.001 1.000 1.001 1.000 1.001 1.000 0.000 0.000

p-value of Null model: 0.28 

REFERENCES

[1] SAS Institute Inc.. SAS Help Center. The GEE procedure.

[2] Generalized Estimating Equation Package

[3] Generalized Estimation Equation Solver

[4] Touloumis A. (2015). “R Package multgee: A Generalized Estimating Equations Solver for Multinomial Responses.” Journal of Statistical Software.