#install.packages("readxl")
library(readxl)
resp <- read_excel("../data/resp_gee.xlsx")Generalized Estimating Equations (GEE) methods in R
INTRODUCTION
Generalized Estimating Equations (GEE) methods extend the Generalized Linear Model (GLM) framework using link functions that relate the predictors to transformed outcome variable. For dichotomous response variables, the link functions is the probit (in case of rare events complementary log-log may be preferable). For outcomes with more than two categories, the cumulative link function is used in case of ordinal variables and generalized logit for nominal variables.
GEE are marginal models and therefore estimate population-averaged effects and within-subject correlation is analysed by specifying a working correlation structure (as in MMRM). Estimators are obtained via quasi-likelihood via iterative solving of estimating equations.
EXAMPLE DATA
A SAS data of clinical trial data comparing two treatments for a respiratory disorder available in “Gee Model for Binary Data” in the SAS/STAT Sample Program Library [1] is used to create these examples.
To uniquely identify subjects, a new variable USUBJID was created by concatenating SITE and ID. Variables TREATMENT, BASELINE, and VISIT were renamed to TRTP, BASE, and AVISITN.
Additionally, two variables were created using randomly generated values to simulate variables with more than two categories. One was an ordinal variable with values 1, 2, and 3; the other was a nominal variable with categories ‘liver’, ‘lung’, and ‘bone’. Data were created in SAS (See SAS section), and imported to R.
Variables OUTCOME, TRTP and AVISITN were converted to factors. Since the modeling functions use the first (alphabetically) level as the reference category, TRTP levels are ordered as ‘P’ (placebo) and ‘A’ (active), to ensure that placebo is used as the reference category in the models:
resp$trtp<-factor(resp$trtp, levels=c('P', 'A'))
resp$avisitn<-factor(resp$avisitn)
resp$outcome<-factor(resp$outcome)
resp$usubjid<-factor(resp$usubjid)
resp$respnom<-factor(resp$respnom)GEE models are run in the next section, including treatment, visit and visit by treatment interaction as fixed effects.
PACKAGES
BINARY OUTCOME
Binary outcomes can be analyzed with geepack::geelm [2] and and gee::gee [3] functions. Independence correlation matrix is used by default in both functions, and can be modified in the corstr= argument.
Both functions estimate robust “Sandwich” standard errors (SE) by default, and the gee::gee function also returns the naive SE (model-based SE).
#Create numeric outcome (values: 1/0).
resp$outcome_num<-as.numeric(resp$outcome)-1
model <- geepack::geeglm(outcome_num ~ trtp + avisitn +trtp*avisitn,
id=usubjid,
data=resp,
family=binomial(link='logit'),
corstr="independence")
summary(model)
Call:
geepack::geeglm(formula = outcome_num ~ trtp + avisitn + trtp *
avisitn, family = binomial(link = "logit"), data = resp,
id = usubjid, corstr = "independence")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -0.03509 0.26495 0.018 0.8946
trtpA 0.81280 0.39503 4.234 0.0396 *
avisitn2 -0.42921 0.26357 2.652 0.1034
avisitn3 -0.14080 0.29834 0.223 0.6370
avisitn4 -0.21177 0.25344 0.698 0.4034
trtpA:avisitn2 0.51651 0.41043 1.584 0.2082
trtpA:avisitn3 0.31861 0.42839 0.553 0.4570
trtpA:avisitn4 -0.11395 0.39485 0.083 0.7729
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = independence
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 1 0.05045
Number of clusters: 111 Maximum cluster size: 4
model<- gee::gee(outcome ~ trtp + avisitn + trtp*avisitn,
id=usubjid,
data=resp,
family=binomial(link='logit'),
corstr = "independence")Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) trtpA avisitn2 avisitn3 avisitn4
-0.03509 0.81280 -0.42921 -0.14080 -0.21177
trtpA:avisitn2 trtpA:avisitn3 trtpA:avisitn4
0.51651 0.31861 -0.11395
summary(model)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Logit
Variance to Mean Relation: Binomial
Correlation Structure: Independent
Call:
gee::gee(formula = outcome ~ trtp + avisitn + trtp * avisitn,
id = usubjid, data = resp, family = binomial(link = "logit"),
corstr = "independence")
Summary of Residuals:
Min 1Q Median 3Q Max
-0.7222 -0.4561 0.2778 0.3889 0.6140
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -0.03509 0.2674 -0.1312 0.2649 -0.1324
trtpA 0.81280 0.3986 2.0389 0.3950 2.0576
avisitn2 -0.42921 0.3832 -1.1200 0.2636 -1.6285
avisitn3 -0.14080 0.3788 -0.3717 0.2983 -0.4719
avisitn4 -0.21177 0.3795 -0.5580 0.2534 -0.8356
trtpA:avisitn2 0.51651 0.5699 0.9064 0.4104 1.2585
trtpA:avisitn3 0.31861 0.5700 0.5589 0.4284 0.7437
trtpA:avisitn4 -0.11395 0.5575 -0.2044 0.3948 -0.2886
Estimated Scale Parameter: 1.018
Number of Iterations: 1
Working Correlation
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
PROBABILITIES AND ODDS RATIO (OR)
The estimated probabilities of event and OR can be obtained using the emmeans function from a geepack::geeglm object. The code below computes probabilities and OR, along with p-values.
model <- geepack::geeglm(outcome_num ~ trtp + avisitn +trtp*avisitn,
id=usubjid,
data=resp,
family=binomial(link='logit'),
corstr="independence")
#Get predicted probabilities for each treatment.
#Get predicted probabilities for each treatment, using the covariance matrix computed by the model
prob <- emmeans::emmeans(model, ~ trtp*avisitn, data=resp, vcov.method=vcov(model), type='response')
prob trtp avisitn prob SE df lower.CL upper.CL
P 1 0.491 0.0662 436 0.364 0.619
A 1 0.685 0.0632 436 0.550 0.795
P 2 0.386 0.0645 436 0.269 0.518
A 2 0.704 0.0621 436 0.569 0.810
P 3 0.456 0.0660 436 0.332 0.586
A 3 0.722 0.0609 436 0.589 0.825
P 4 0.439 0.0657 436 0.316 0.569
A 4 0.611 0.0663 436 0.476 0.731
Covariance estimate used: user-supplied
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
#Get differences between treatments by visit (option "revpairwise" is used to compare A vs P)
diffs_mean<-emmeans::emmeans(model, ~ trtp*avisitn, data=resp, vcov.method=vcov(model))
diffs <- emmeans::contrast(diffs_mean,"revpairwise" , simple="trtp")
diffs <- as.data.frame(diffs)
#Calculate CI (alpha=0.05)
alpha<-0.05
z_crit <- qnorm(1 - alpha / 2)
diffs$low<-diffs$estimate - (z_crit*diffs$SE)
diffs$upp<-diffs$estimate + (z_crit*diffs$SE)
#Get OR applying exponential transformation;
or<-exp(diffs$estimate)
or_low<-exp(diffs$low)
or_upp<-exp(diffs$up)
#Get two-sided p-value
z <- diffs$estimate/diffs$SE
pvalue <- 2 * (1 - pnorm(z))
#Create a dataset with all the results
OR<-as.data.frame(cbind(diffs$avisitn, or,or_low, or_upp, z, round(pvalue, digits=4)))
colnames(OR)<-c('avisit', 'OR', 'lower.CL', 'upper.CL', 'Z', 'p-value')
OR avisit OR lower.CL upper.CL Z p-value
1 1 2.254 1.0393 4.889 2.058 0.0396
2 2 3.778 1.7132 8.333 3.294 0.0010
3 3 3.100 1.4050 6.840 2.802 0.0051
4 4 2.011 0.9435 4.288 1.809 0.0704
OUTCOME WITH MORE THAN TWO CATEGORIES
The functions used for binary outcomes above do not support outcomes with more than two categories, as these functions relay on the family function, which does not include the multinomial option. Nevertheless, the multgee package [4] provides two functions for estimating GEE models when the outcome has more than two categories:
The correlation matrix by default is “Exchangeable”, and it can be modified using the LORstr= argument in both functions.
Ordinal variable
model <- multgee::ordLORgee(formula = respord ~ trtp + avisitn + trtp*avisitn,
data = resp,
id = usubjid,
repeated = avisitn,
LORstr="time.exch",
)
summary(model)GEE FOR ORDINAL MULTINOMIAL RESPONSES
version 1.6.0 modified 2017-07-10
Link : Cumulative logit
Local Odds Ratios:
Structure: time.exch
Model: 3way
Homogenous scores: TRUE
call:
multgee::ordLORgee(formula = respord ~ trtp + avisitn + trtp *
avisitn, data = resp, id = usubjid, repeated = avisitn, LORstr = "time.exch")
Summary of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.4704 -0.3542 -0.2854 0.0003 0.6335 0.7279
Number of Iterations: 3
Coefficients:
Estimate san.se san.z Pr(>|san.z|)
beta10 -0.5472 0.2417 -2.26 0.024 *
beta20 0.6304 0.2551 2.47 0.013 *
trtpA 0.4284 0.3432 1.25 0.212
avisitn2 -0.0024 0.3668 -0.01 0.995
avisitn3 0.1156 0.3228 0.36 0.720
avisitn4 -0.1470 0.3323 -0.44 0.658
trtpA:avisitn2 -0.2717 0.5094 -0.53 0.594
trtpA:avisitn3 -0.5974 0.4619 -1.29 0.196
trtpA:avisitn4 -0.3269 0.4488 -0.73 0.466
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Local Odds Ratios Estimates:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.000 0.000 1.000 0.999 1.000 0.999 1.000 0.999
[2,] 0.000 0.000 0.999 1.281 0.999 1.281 0.999 1.281
[3,] 1.000 0.999 0.000 0.000 1.000 0.999 1.000 0.999
[4,] 0.999 1.281 0.000 0.000 0.999 1.281 0.999 1.281
[5,] 1.000 0.999 1.000 0.999 0.000 0.000 1.000 0.999
[6,] 0.999 1.281 0.999 1.281 0.000 0.000 0.999 1.281
[7,] 1.000 0.999 1.000 0.999 1.000 0.999 0.000 0.000
[8,] 0.999 1.281 0.999 1.281 0.999 1.281 0.000 0.000
p-value of Null model: 0.69
Nominal variable
model <- multgee::nomLORgee(formula = respnom ~ trtp + avisitn + trtp*avisitn,
data = resp,
id = usubjid,
repeated = avisitn,
LORstr = "time.exch")
summary(model)GEE FOR NOMINAL MULTINOMIAL RESPONSES
version 1.6.0 modified 2017-07-10
Link : Baseline Category Logit
Local Odds Ratios:
Structure: time.exch
Model: 3way
Homogenous scores: TRUE
call:
multgee::nomLORgee(formula = respnom ~ trtp + avisitn + trtp *
avisitn, data = resp, id = usubjid, repeated = avisitn, LORstr = "time.exch")
Summary of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.444 -0.368 -0.333 0.000 0.611 0.778
Number of Iterations: 1
Coefficients:
Estimate san.se san.z Pr(>|san.z|)
beta10 0.2231 0.3354 0.67 0.51
trtpA:1 -0.6286 0.5014 -1.25 0.21
avisitn2:1 0.1823 0.4310 0.42 0.67
avisitn3:1 0.0132 0.4786 0.03 0.98
avisitn4:1 -0.0690 0.4814 -0.14 0.89
trtpA:avisitn2:1 0.1625 0.6342 0.26 0.80
trtpA:avisitn3:1 0.9518 0.6979 1.36 0.17
trtpA:avisitn4:1 0.6463 0.7006 0.92 0.36
beta20 0.2719 0.3318 0.82 0.41
trtpA:2 0.0158 0.4553 0.03 0.97
avisitn2:2 0.1800 0.4254 0.42 0.67
avisitn3:2 0.1555 0.4660 0.33 0.74
avisitn4:2 -0.2719 0.5079 -0.54 0.59
trtpA:avisitn2:2 -0.2564 0.6025 -0.43 0.67
trtpA:avisitn3:2 0.1164 0.6469 0.18 0.86
trtpA:avisitn4:2 0.1561 0.6399 0.24 0.81
Local Odds Ratios Estimates:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.000 0.000 1.279 1.001 1.279 1.001 1.279 1.001
[2,] 0.000 0.000 1.001 1.000 1.001 1.000 1.001 1.000
[3,] 1.279 1.001 0.000 0.000 1.279 1.001 1.279 1.001
[4,] 1.001 1.000 0.000 0.000 1.001 1.000 1.001 1.000
[5,] 1.279 1.001 1.279 1.001 0.000 0.000 1.279 1.001
[6,] 1.001 1.000 1.001 1.000 0.000 0.000 1.001 1.000
[7,] 1.279 1.001 1.279 1.001 1.279 1.001 0.000 0.000
[8,] 1.001 1.000 1.001 1.000 1.001 1.000 0.000 0.000
p-value of Null model: 0.28
REFERENCES
[1] SAS Institute Inc.. SAS Help Center. The GEE procedure.
[2] Generalized Estimating Equation Package