R vs EAST Group sequential design

Introduction

In this vignette, we briefly compare sample size/power calculations for a group sequential design (GSD) for time to event endpoints between EAST and gsDesign, gsDesign2, and rpact. Note that, a comparison between rpact and gsDesign has been previously reported here.

There are two main methods that are generally used for GSD sample-size/power calculations for time to event endpoints under proportional hazard assumption:

Lachin & Foulkes (LF) Method (1986)
Kim & Tsiatis (KT) Method (1990)

The main difference between the two methods is that LF method requires specification of accrual duration as well as study duration, while KT method calculates study duration iteratively given accrual rates and accrual duration. In general, these two methods produce similar, but not identical results.

Both LF and KT methods are implemented in gsDesign, while KT method is implemented in EAST and rpact. gsDesign2 uses a modification of the LF method while applying an average hazard ratio (AHR) approach for non-proportional hazards (Schemper, Wakounig, and Heinze, 2009, Yung and Liu 2020). gsDesign2 also enables use of the sample size method of Yung and Liu (2020).

One additional computational difference to note for EAST vs gsDesign/gsDesign2 is the usage of different log hazard ratio variance assumptions. By default, EAST uses the variance under the null hypothesis and provides an option for using the variance under the alternative hypothesis. gsDesign, on the other hand, is using both of these variances as suggested by Lachin and Foulkes (1986). gsDesign2 has info_scale argument in gsDesign2::gs_power_ahr(), gsDesign2::gs_design_ahr(), which could be set to variance under the null or alternative hypothesis or to the combination of variances.

Below we provide an example of reproducing EAST results from this vignette using gsDesign/gsDesign2/rpact. As shown in the example, gsDesign2 and rpact can reproduce EAST calculations for GSD boundaries, while gsDesign results have minor differences. gsDesign has an option under development to support a complete concordance with EAST.

Design example

We assume that a GSD is utilized for progression-free survival (PFS) endpoint. It will be tested at one interim analysis (IA) for both efficacy and non-binding futility and then at final analysis (FA). O’Brien-Fleming spending function will be used for efficacy testing and Hwang-Shih-DeCani spending function with \(\gamma = -10\) will be used for futility.

Further design assumptions are as follows:

# PFS HR=0.6
hr1_pfs <- 0.6
# median PFS of 9.4 months in the control arm
med_pfs <- 9.4
# minimum follow-up of 10 months for PFS
minfu_pfs <- 10
# Monthly exponential dropout of 0.019  for PFS
do_rate_pfs <- 0.019
# IA timing for PFS is at approximately 75% information fraction, and is derived 
# using the number of events that was calculated by EAST which sets integer event counts to approximate targeted information
timing_pfs_rpact <- c(176/235, 1)
timing_pfs_gs <- c(0.75, 1)


# power of approximately 95% for PFS, EAST reported power will be used
power_pfs <- 0.9505021

# Enrollment period of 24 months
enroll_dur <- 24
# 1:1 randomization ratio
rand_ratio <- 1
# alpha level of 1.25%
alphal <- 0.0125

We assume that EAST was initially used to calculate the target number of events and the total sample size, and we will use gsDesign/gsDesign2/rpact to reproduce those.

Note that, in EAST the number of target events is reported as an integer, however, gsDesign/gsDesign2/rpact by default provide non-integer values which match exactly the specified information fraction. Both gsDesign/gsDesign2 can facilitate computations using integer number of events with gsDesign::toInteger() and gsDesign2::to_integer() as shown below. In order to reproduce EAST results with rpact, we will use the number of events that was calculated in EAST for informationRates argument in rpact::getDesignGroupSequential(): 176 and 235 PFS events for IA and FA respectively (please see the timing_pfs_rpact object in the code above).

For ease of comparison the results from EAST are summarized below:

Analysis	Value	Efficacy	Futility
IA1: 75%	Z	-2.6606	-0.7379
N=398	p (1-sided)	0.0039	0.2303
Events: 176	HR at bound	0.6696	0.8947
Month: 25	P(Cross) if HR=1	0.0039	0.7697
	P(Cross) if HR=0.60	0.7666	0.0040
FA	Z	-2.2798
N=398	p (1-sided)	0.0113
Events: 235	HR at bound	0.7427
Month: 34	P(Cross) if HR=1	0.0125
	P(Cross) if HR=0.60	0.9505

The comparison between EAST and gsDesign/gsDesign/rpact results is presented below using absolute difference in efficacy/futility boundaries and crossing probabilities up to 4 decimals. Non-zero values are highlighted.
Note that, in gsDesign/gsDesign Efficacy/Futility bounds refer to upper/lower bounds respectively, while in EAST these refer to the opposite directions, i.e., lower/upper bounds respectively. For the comparison purposes, we will assume that Efficacy/Futility bounds refer to upper/lower bounds respectively.

Code to reproduce EAST results

gsDesign code

gsDesign code to reproduce the above EAST results:

library(gsDesign)
pfs_gsDesign <- gsSurv(
  k = length(timing_pfs_gs),
  timing = timing_pfs_gs, 
  R = enroll_dur, 
  eta =  do_rate_pfs, 
  minfup = minfu_pfs,
  T = enroll_dur + minfu_pfs, 
  lambdaC = log(2)/med_pfs,
  hr = hr1_pfs,
  beta = 1 - power_pfs,
  alpha = alphal,
  sfu = sfLDOF,
  sfl = sfHSD, 
  sflpar = -10,
  test.type = 4
  ) |> 
  toInteger()
  

pfs_gsDesign |> gsBoundSummary()

    Analysis              Value Efficacy Futility
   IA 1: 75%                  Z   2.6606   0.7422
      N: 400        p (1-sided)   0.0039   0.2290
 Events: 176       ~HR at bound   0.6696   0.8941
   Month: 25   P(Cross) if HR=1   0.0039   0.7710
             P(Cross) if HR=0.6   0.7679   0.0040
       Final                  Z   2.2798   2.2798
      N: 400        p (1-sided)   0.0113   0.0113
 Events: 235       ~HR at bound   0.7427   0.7427
   Month: 34   P(Cross) if HR=1   0.0125   0.9875
             P(Cross) if HR=0.6   0.9510   0.0490

gsDesign vs EAST comparison using absolute differences:

Warning in raw_block(x, "html", ...): raw_block() requires Pandoc >= 2.0.0

Analysis	Value	Efficacy	Futility
IA1: 75%	Z	0.0000	0.0043
N=398	p (1-sided)	0.0000	0.0013
Events: 176	HR at bound	0.0000	0.0006
Month: 25	P(Cross) if HR=1	0.0000	0.0013
	P(Cross) if HR=0.60	0.0013	0.0000
FA	Z	0.0000
N=398	p (1-sided)	0.0000
Events: 235	HR at bound	0.0000
Month: 34	P(Cross) if HR=1	0.0000
	P(Cross) if HR=0.60	0.0005

gsDesign2 code

gsDesign2 code to reproduce the above EAST results appears below.
Note that, here gsDesign2::gs_power_ahr() is used given the number of target events for each analysis based on EAST results.

library(gsDesign2)
library(tibble)
enroll_rate <- tibble(
  stratum = "All", 
  duration = enroll_dur,
  rate = 398/enroll_dur
)
fail_rate_pfs <- tibble(
  stratum = "All", 
  duration = Inf, #could be set to Inf when proportional hazard is assumed
  fail_rate = log(2)/med_pfs,
  hr = hr1_pfs, 
  dropout_rate = do_rate_pfs
  )

pfs_gsDesign2 <- gs_power_ahr(
  enroll_rate = enroll_rate,
  fail_rate = fail_rate_pfs,
  ratio = rand_ratio, 
  event = c(176, 235),
  upper = gs_spending_bound, 
  upar = list(
    sf = gsDesign::sfLDOF, 
    total_spend = alphal
    ), 
  lower = gs_spending_bound,
  lpar = list(
    sf = gsDesign::sfHSD, 
    total_spend = 1 - power_pfs, 
    param = -10
    ),
  info_scale = "h0_info"
  ) |> 
  to_integer()

pfs_gsDesign2 |>
  summary() |>
  gsDesign2::as_gt()

Bound summary for AHR design
AHR approximations of ~HR at bound
Bound	Z	Nominal p¹	~HR at bound²	Cumulative boundary crossing probability
Bound	Z	Nominal p¹	~HR at bound²	Alternate hypothesis	Null hypothesis
Analysis: 1 Time: 25.4 N: 398 Events: 176 AHR: 0.6 Information fraction: 0.75
Futility	0.74	0.2303	0.8947	0.0040	0.7697
Efficacy	2.66	0.0039	0.6696	0.7666	0.0039
Analysis: 2 Time: 34.1 N: 398 Events: 235 AHR: 0.6 Information fraction: 1
Futility	2.28	0.0113	0.7427	0.0495	0.9875
Efficacy	2.28	0.0113	0.7427	0.9505	0.0125
¹ One-sided p-value for experimental vs control treatment. Value < 0.5 favors experimental, > 0.5 favors control.
² Approximate hazard ratio to cross bound.

gsDesign2 vs EAST comparison using absolute differences:

Warning in raw_block(x, "html", ...): raw_block() requires Pandoc >= 2.0.0

Analysis	Value	Futility
IA1: 75%	Z	0.0000
N=398	p (1-sided)	0.0000
Events: 176	HR at bound	0.0000
Month: 25	P(Cross) if HR=1	0.0000
	P(Cross) if HR=0.60	0.0000
FA	Z
N=398	p (1-sided)
Events: 235	HR at bound
Month: 34	P(Cross) if HR=1
	P(Cross) if HR=0.60

rpact code

rpact code to reproduce the above EAST results appears below.

library(rpact)
pfs_rpact_gsd <- getDesignGroupSequential(
  sided = 1, 
  alpha = alphal,
  informationRates = timing_pfs_rpact,
  typeOfDesign = "asOF",
  beta = 1 - power_pfs,
  typeBetaSpending = "bsHSD",
  gammaB = -10,
  bindingFutility = FALSE
  )

pfs_rpact <- getSampleSizeSurvival(
  design = pfs_rpact_gsd,
  accrualTime = enroll_dur,
  followUpTime = minfu_pfs,
  lambda2 = log(2)/med_pfs,
  hazardRatio = hr1_pfs,
  dropoutRate1 = 0.2, 
  dropoutRate2 = 0.2, 
  dropoutTime = 12
)

kable(summary(pfs_rpact))

Sample size calculation for a survival endpoint

Sequential analysis with a maximum of 2 looks (group sequential design), one-sided overall significance level 1.25%, power 95.1%. The results were calculated for a two-sample logrank test, H0: hazard ratio = 1, H1: hazard ratio = 0.6, control lambda(2) = 0.074, accrual time = 24, accrual intensity = 16.6, follow-up time = 10, dropout rate(1) = 0.2, dropout rate(2) = 0.2, dropout time = 12.

Stage	1	2
Planned information rate	74.9%	100%
Cumulative alpha spent	0.0039	0.0125
Cumulative beta spent	0.0040	0.0495
Stage levels (one-sided)	0.0039	0.0113
Efficacy boundary (z-value scale)	2.661	2.280
Futility boundary (z-value scale)	0.738
Efficacy boundary (t)	0.670	0.743
Futility boundary (t)	0.895
Cumulative power	0.7666	0.9505
Number of subjects	397.9	397.9
Expected number of subjects under H1		397.9
Cumulative number of events	176.0	235.0
Expected number of events under H1	189.5
Analysis time	25.34	34.00
Expected study duration under H1		27.32
Overall exit probability (under H0)	0.7736
Overall exit probability (under H1)	0.7707
Exit probability for efficacy (under H0)	0.0039
Exit probability for efficacy (under H1)	0.7666
Exit probability for futility (under H0)	0.7697
Exit probability for futility (under H1)	0.0040

Legend:

(t): treatment effect scale

rpact vs EAST comparison using absolute differences:

Warning in raw_block(x, "html", ...): raw_block() requires Pandoc >= 2.0.0

Analysis	Value	Futility
IA1: 75%	Z	0.0000
N=398	p (1-sided)	0.0000
Events: 176	HR at bound	0.0000
Month: 25	P(Cross) if HR=1	0.0000
	P(Cross) if HR=0.60	0.0000
FA	Z
N=398	p (1-sided)
Events: 235	HR at bound
Month: 34	P(Cross) if HR=1
	P(Cross) if HR=0.60

sessionInfo()

R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] rpact_4.1.0     tibble_3.2.1    gsDesign2_1.1.3 gsDesign_3.6.5 
[5] flextable_0.9.7

loaded via a namespace (and not attached):
 [1] gt_0.11.1               sass_0.4.9              utf8_1.2.4             
 [4] generics_0.1.3          tidyr_1.3.1             fontLiberation_0.1.0   
 [7] renv_1.0.10             xml2_1.3.6              r2rtf_1.1.1            
[10] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.0         
[13] grid_4.4.3              fastmap_1.2.0           jsonlite_1.8.9         
[16] zip_2.3.1               purrr_1.0.2             fansi_1.0.6            
[19] scales_1.3.0            fontBitstreamVera_0.1.1 textshaping_0.4.0      
[22] cli_3.6.3               rlang_1.1.4             fontquiver_0.2.1       
[25] munsell_0.5.1           withr_3.0.1             yaml_2.3.10            
[28] gdtools_0.4.1           tools_4.4.3             officer_0.6.7          
[31] uuid_1.2-1              dplyr_1.1.4             colorspace_2.1-1       
[34] ggplot2_3.5.1           vctrs_0.6.5             R6_2.5.1               
[37] lifecycle_1.0.4         htmlwidgets_1.6.4       ragg_1.3.3             
[40] pkgconfig_2.0.3         pillar_1.9.0            gtable_0.3.5           
[43] data.table_1.16.0       glue_1.8.0              Rcpp_1.0.13            
[46] systemfonts_1.1.0       xfun_0.48               tidyselect_1.2.1       
[49] knitr_1.48              xtable_1.8-4            htmltools_0.5.8.1      
[52] rmarkdown_2.28          compiler_4.4.3          askpass_1.2.1          
[55] openssl_2.2.2

References

Lachin JM and Foulkes M. Evaluation of sample size and power for analyses of survival with allowance for nonuniform patient entry, losses to follow-up, non-compliance, and stratification. Biometrics 1986;42:507-19.
Kim K and Tsiatis AA. Study duration for clinical trials with survival response and early stopping rule. Biometrics 1990(46): 81-92.
Schemper M, Wakounig S and Heinze G. The estimation of average hazard ratios by weighted cox regression. Statistics in Medicine 2009; 28(19): 2473-2489.
Yung G and Liu Y. Sample size and power for the weighted log-rank test and Kaplan-Meier based tests with allowance for nonproportional hazards. Biometrics 2020;76:939-50.