# PFS HR=0.6
<- 0.6
hr1_pfs # median PFS of 9.4 months in the control arm
<- 9.4
med_pfs # minimum follow-up of 10 months for PFS
<- 10
minfu_pfs # Monthly exponential dropout of 0.019 for PFS
<- 0.019
do_rate_pfs # IA timing for PFS is at approximately 75% information fraction, and is derived
# using the number of events that was calculated by EAST which sets integer event counts to approximate targeted information
<- c(176/235, 1)
timing_pfs_rpact <- c(0.75, 1)
timing_pfs_gs
# power of approximately 95% for PFS, EAST reported power will be used
<- 0.9505021
power_pfs
# Enrollment period of 24 months
<- 24
enroll_dur # 1:1 randomization ratio
<- 1
rand_ratio # alpha level of 1.25%
<- 0.0125 alphal
R vs EAST Group sequential design
Introduction
In this vignette, we briefly compare sample size/power calculations for a group sequential design (GSD) for time to event endpoints between EAST and gsDesign, gsDesign2, and rpact. Note that, a comparison between rpact and gsDesign has been previously reported here.
There are two main methods that are generally used for GSD sample-size/power calculations for time to event endpoints under proportional hazard assumption:
- Lachin & Foulkes (LF) Method (1986)
- Kim & Tsiatis (KT) Method (1990)
The main difference between the two methods is that LF method requires specification of accrual duration as well as study duration, while KT method calculates study duration iteratively given accrual rates and accrual duration. In general, these two methods produce similar, but not identical results.
Both LF and KT methods are implemented in gsDesign, while KT method is implemented in EAST and rpact. gsDesign2 uses a modification of the LF method while applying an average hazard ratio (AHR) approach for non-proportional hazards (Schemper, Wakounig, and Heinze, 2009, Yung and Liu 2020). gsDesign2 also enables use of the sample size method of Yung and Liu (2020).
One additional computational difference to note for EAST vs gsDesign/gsDesign2 is the usage of different log hazard ratio variance assumptions. By default, EAST uses the variance under the null hypothesis and provides an option for using the variance under the alternative hypothesis. gsDesign, on the other hand, is using both of these variances as suggested by Lachin and Foulkes (1986). gsDesign2 has info_scale
argument in gsDesign2::gs_power_ahr()
, gsDesign2::gs_design_ahr()
, which could be set to variance under the null or alternative hypothesis or to the combination of variances.
Below we provide an example of reproducing EAST results from this vignette using gsDesign/gsDesign2/rpact. As shown in the example, gsDesign2 and rpact can reproduce EAST calculations for GSD boundaries, while gsDesign results have minor differences. gsDesign has an option under development to support a complete concordance with EAST.
Design example
We assume that a GSD is utilized for progression-free survival (PFS) endpoint. It will be tested at one interim analysis (IA) for both efficacy and non-binding futility and then at final analysis (FA). O’Brien-Fleming spending function will be used for efficacy testing and Hwang-Shih-DeCani spending function with \(\gamma = -10\) will be used for futility.
Further design assumptions are as follows:
We assume that EAST was initially used to calculate the target number of events and the total sample size, and we will use gsDesign/gsDesign2/rpact to reproduce those.
Note that, in EAST the number of target events is reported as an integer, however, gsDesign/gsDesign2/rpact by default provide non-integer values which match exactly the specified information fraction. Both gsDesign/gsDesign2 can facilitate computations using integer number of events with gsDesign::toInteger()
and gsDesign2::to_integer()
as shown below. In order to reproduce EAST results with rpact, we will use the number of events that was calculated in EAST for informationRates
argument in rpact::getDesignGroupSequential()
: 176 and 235 PFS events for IA and FA respectively (please see the timing_pfs_rpact
object in the code above).
For ease of comparison the results from EAST are summarized below:
Analysis | Value | Efficacy | Futility |
---|---|---|---|
IA1: 75% | Z | -2.6606 | -0.7379 |
N=398 | p (1-sided) | 0.0039 | 0.2303 |
Events: 176 | HR at bound | 0.6696 | 0.8947 |
Month: 25 | P(Cross) if HR=1 | 0.0039 | 0.7697 |
P(Cross) if HR=0.60 | 0.7666 | 0.0040 | |
FA | Z | -2.2798 | |
N=398 | p (1-sided) | 0.0113 | |
Events: 235 | HR at bound | 0.7427 | |
Month: 34 | P(Cross) if HR=1 | 0.0125 | |
P(Cross) if HR=0.60 | 0.9505 |
- The comparison between EAST and gsDesign/gsDesign/rpact results is presented below using absolute difference in efficacy/futility boundaries and crossing probabilities up to 4 decimals. Non-zero values are highlighted.
- Note that, in gsDesign/gsDesign Efficacy/Futility bounds refer to upper/lower bounds respectively, while in EAST these refer to the opposite directions, i.e., lower/upper bounds respectively. For the comparison purposes, we will assume that Efficacy/Futility bounds refer to upper/lower bounds respectively.
Code to reproduce EAST results
gsDesign code
- gsDesign code to reproduce the above EAST results:
library(gsDesign)
<- gsSurv(
pfs_gsDesign k = length(timing_pfs_gs),
timing = timing_pfs_gs,
R = enroll_dur,
eta = do_rate_pfs,
minfup = minfu_pfs,
T = enroll_dur + minfu_pfs,
lambdaC = log(2)/med_pfs,
hr = hr1_pfs,
beta = 1 - power_pfs,
alpha = alphal,
sfu = sfLDOF,
sfl = sfHSD,
sflpar = -10,
test.type = 4
|>
) toInteger()
|> gsBoundSummary() pfs_gsDesign
Analysis Value Efficacy Futility
IA 1: 75% Z 2.6606 0.7422
N: 400 p (1-sided) 0.0039 0.2290
Events: 176 ~HR at bound 0.6696 0.8941
Month: 25 P(Cross) if HR=1 0.0039 0.7710
P(Cross) if HR=0.6 0.7679 0.0040
Final Z 2.2798 2.2798
N: 400 p (1-sided) 0.0113 0.0113
Events: 235 ~HR at bound 0.7427 0.7427
Month: 34 P(Cross) if HR=1 0.0125 0.9875
P(Cross) if HR=0.6 0.9510 0.0490
- gsDesign vs EAST comparison using absolute differences:
Analysis | Value | Efficacy | Futility |
---|---|---|---|
IA1: 75% | Z | 0.0000 | 0.0043 |
N=398 | p (1-sided) | 0.0000 | 0.0013 |
Events: 176 | HR at bound | 0.0000 | 0.0006 |
Month: 25 | P(Cross) if HR=1 | 0.0000 | 0.0013 |
P(Cross) if HR=0.60 | 0.0013 | 0.0000 | |
FA | Z | 0.0000 | |
N=398 | p (1-sided) | 0.0000 | |
Events: 235 | HR at bound | 0.0000 | |
Month: 34 | P(Cross) if HR=1 | 0.0000 | |
P(Cross) if HR=0.60 | 0.0005 |
gsDesign2 code
- gsDesign2 code to reproduce the above EAST results appears below.
- Note that, here
gsDesign2::gs_power_ahr()
is used given the number of target events for each analysis based on EAST results.
library(gsDesign2)
library(tibble)
<- tibble(
enroll_rate stratum = "All",
duration = enroll_dur,
rate = 398/enroll_dur
)<- tibble(
fail_rate_pfs stratum = "All",
duration = Inf, #could be set to Inf when proportional hazard is assumed
fail_rate = log(2)/med_pfs,
hr = hr1_pfs,
dropout_rate = do_rate_pfs
)
<- gs_power_ahr(
pfs_gsDesign2 enroll_rate = enroll_rate,
fail_rate = fail_rate_pfs,
ratio = rand_ratio,
event = c(176, 235),
upper = gs_spending_bound,
upar = list(
sf = gsDesign::sfLDOF,
total_spend = alphal
), lower = gs_spending_bound,
lpar = list(
sf = gsDesign::sfHSD,
total_spend = 1 - power_pfs,
param = -10
),info_scale = "h0_info"
|>
) to_integer()
|>
pfs_gsDesign2 summary() |>
::as_gt() gsDesign2
Bound summary for AHR design | |||||
---|---|---|---|---|---|
AHR approximations of ~HR at bound | |||||
Bound | Z | Nominal p1 | ~HR at bound2 |
Cumulative boundary crossing probability
|
|
Alternate hypothesis | Null hypothesis | ||||
Analysis: 1 Time: 25.4 N: 398 Events: 176 AHR: 0.6 Information fraction: 0.75 | |||||
Futility | 0.74 | 0.2303 | 0.8947 | 0.0040 | 0.7697 |
Efficacy | 2.66 | 0.0039 | 0.6696 | 0.7666 | 0.0039 |
Analysis: 2 Time: 34.1 N: 398 Events: 235 AHR: 0.6 Information fraction: 1 | |||||
Futility | 2.28 | 0.0113 | 0.7427 | 0.0495 | 0.9875 |
Efficacy | 2.28 | 0.0113 | 0.7427 | 0.9505 | 0.0125 |
1 One-sided p-value for experimental vs control treatment. Value < 0.5 favors experimental, > 0.5 favors control. | |||||
2 Approximate hazard ratio to cross bound. |
- gsDesign2 vs EAST comparison using absolute differences:
Analysis | Value | Efficacy | Futility |
---|---|---|---|
IA1: 75% | Z | 0.0000 | 0.0000 |
N=398 | p (1-sided) | 0.0000 | 0.0000 |
Events: 176 | HR at bound | 0.0000 | 0.0000 |
Month: 25 | P(Cross) if HR=1 | 0.0000 | 0.0000 |
P(Cross) if HR=0.60 | 0.0000 | 0.0000 | |
FA | Z | 0.0000 | |
N=398 | p (1-sided) | 0.0000 | |
Events: 235 | HR at bound | 0.0000 | |
Month: 34 | P(Cross) if HR=1 | 0.0000 | |
P(Cross) if HR=0.60 | 0.0000 |
rpact code
- rpact code to reproduce the above EAST results appears below.
library(rpact)
<- getDesignGroupSequential(
pfs_rpact_gsd sided = 1,
alpha = alphal,
informationRates = timing_pfs_rpact,
typeOfDesign = "asOF",
beta = 1 - power_pfs,
typeBetaSpending = "bsHSD",
gammaB = -10,
bindingFutility = FALSE
)
<- getSampleSizeSurvival(
pfs_rpact design = pfs_rpact_gsd,
accrualTime = enroll_dur,
followUpTime = minfu_pfs,
lambda2 = log(2)/med_pfs,
hazardRatio = hr1_pfs,
dropoutRate1 = 0.2,
dropoutRate2 = 0.2,
dropoutTime = 12
)
kable(summary(pfs_rpact))
Sample size calculation for a survival endpoint
Sequential analysis with a maximum of 2 looks (group sequential design), one-sided overall significance level 1.25%, power 95.1%. The results were calculated for a two-sample logrank test, H0: hazard ratio = 1, H1: hazard ratio = 0.6, control lambda(2) = 0.074, accrual time = 24, accrual intensity = 16.6, follow-up time = 10, dropout rate(1) = 0.2, dropout rate(2) = 0.2, dropout time = 12.
Stage | 1 | 2 |
---|---|---|
Planned information rate | 74.9% | 100% |
Cumulative alpha spent | 0.0039 | 0.0125 |
Cumulative beta spent | 0.0040 | 0.0495 |
Stage levels (one-sided) | 0.0039 | 0.0113 |
Efficacy boundary (z-value scale) | 2.661 | 2.280 |
Futility boundary (z-value scale) | 0.738 | |
Efficacy boundary (t) | 0.670 | 0.743 |
Futility boundary (t) | 0.895 | |
Cumulative power | 0.7666 | 0.9505 |
Number of subjects | 397.9 | 397.9 |
Expected number of subjects under H1 | 397.9 | |
Cumulative number of events | 176.0 | 235.0 |
Expected number of events under H1 | 189.5 | |
Analysis time | 25.34 | 34.00 |
Expected study duration under H1 | 27.32 | |
Overall exit probability (under H0) | 0.7736 | |
Overall exit probability (under H1) | 0.7707 | |
Exit probability for efficacy (under H0) | 0.0039 | |
Exit probability for efficacy (under H1) | 0.7666 | |
Exit probability for futility (under H0) | 0.7697 | |
Exit probability for futility (under H1) | 0.0040 |
Legend:
- (t): treatment effect scale
- rpact vs EAST comparison using absolute differences:
Analysis | Value | Efficacy | Futility |
---|---|---|---|
IA1: 75% | Z | 0.0000 | 0.0000 |
N=398 | p (1-sided) | 0.0000 | 0.0000 |
Events: 176 | HR at bound | 0.0000 | 0.0000 |
Month: 25 | P(Cross) if HR=1 | 0.0000 | 0.0000 |
P(Cross) if HR=0.60 | 0.0000 | 0.0000 | |
FA | Z | 0.0000 | |
N=398 | p (1-sided) | 0.0000 | |
Events: 235 | HR at bound | 0.0000 | |
Month: 34 | P(Cross) if HR=1 | 0.0000 | |
P(Cross) if HR=0.60 | 0.0000 |
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] rpact_4.1.0 tibble_3.2.1 gsDesign2_1.1.3 gsDesign_3.6.5
[5] flextable_0.9.7
loaded via a namespace (and not attached):
[1] gt_0.11.1 sass_0.4.9 utf8_1.2.4
[4] generics_0.1.3 tidyr_1.3.1 fontLiberation_0.1.0
[7] renv_1.0.10 xml2_1.3.6 r2rtf_1.1.1
[10] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.0
[13] grid_4.4.2 fastmap_1.2.0 jsonlite_1.8.9
[16] zip_2.3.1 purrr_1.0.2 fansi_1.0.6
[19] scales_1.3.0 fontBitstreamVera_0.1.1 textshaping_0.4.0
[22] cli_3.6.3 rlang_1.1.4 fontquiver_0.2.1
[25] munsell_0.5.1 withr_3.0.1 yaml_2.3.10
[28] gdtools_0.4.1 tools_4.4.2 officer_0.6.7
[31] uuid_1.2-1 dplyr_1.1.4 colorspace_2.1-1
[34] ggplot2_3.5.1 vctrs_0.6.5 R6_2.5.1
[37] lifecycle_1.0.4 htmlwidgets_1.6.4 ragg_1.3.3
[40] pkgconfig_2.0.3 pillar_1.9.0 gtable_0.3.5
[43] data.table_1.16.0 glue_1.8.0 Rcpp_1.0.13
[46] systemfonts_1.1.0 xfun_0.48 tidyselect_1.2.1
[49] knitr_1.48 xtable_1.8-4 htmltools_0.5.8.1
[52] rmarkdown_2.28 compiler_4.4.2 askpass_1.2.1
[55] openssl_2.2.2
References
- Lachin JM and Foulkes M. Evaluation of sample size and power for analyses of survival with allowance for nonuniform patient entry, losses to follow-up, non-compliance, and stratification. Biometrics 1986;42:507-19.
- Kim K and Tsiatis AA. Study duration for clinical trials with survival response and early stopping rule. Biometrics 1990(46): 81-92.
- Schemper M, Wakounig S and Heinze G. The estimation of average hazard ratios by weighted cox regression. Statistics in Medicine 2009; 28(19): 2473-2489.
- Yung G and Liu Y. Sample size and power for the weighted log-rank test and Kaplan-Meier based tests with allowance for nonproportional hazards. Biometrics 2020;76:939-50.