--- title: "A gentle introduction to group sequential design" output: rmarkdown::html_vignette bibliography: gsDesign.bib vignette: > %\VignetteIndexEntry{A gentle introduction to group sequential design} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse = FALSE, comment = "#>", dev = "svg", fig.width = 7.2916667, fig.asp = 0.618, fig.align = "center", out.width = "80%" ) options(width = 58) ``` ## Introduction This article is intended to give a gentle mathematical and statistical introduction to group sequential design. We also provide relatively simple examples from the literature to explain clinical applications. There is no programming shown, but by accessing the source for the article all required programming can be accessed; substantial commenting is provided in the source in the hope that users can understand how to implement the concepts developed here. Hopefully, the few mathematical and statistical concepts introduced will not discourage those wishing to understand some underlying concepts for group sequential design. A group sequential design enables repeated analysis of an endpoint for a clinical trial to enable possible early stopping of a trial for either a positive result, for futility, or for a safety issue. This approach can - limit exposure risk to patients and clinical trial investment past the time where known unacceptable safety risks have been established for the endpoint of interest, - limit investment in a trial where interim results suggest further evaluation for a positive efficacy finding is futile, or - accelerate the availability of a highly effective treatment by enabling early approval following an early positive finding. Examples of outcomes that might be considered include: - a continuous outcome such as change from baseline at some fixed follow-up time in the HAM-D depression score, - absolute or difference or risk ratio for a response rate (e.g., in oncology) or failure rate for a binary (yes/no) outcome, and - a hazard ratio for a time-to-event out such such as time-to-death or disease progression in an oncology trial or for time until a cardiovascular event (death, myocardial infarction or unstable angina). Examples of the above include: - a new treatment for major depression where an interim analysis of a continuous outcome stopped the trial for futility (@binneman20086), - a new treatment for patients with unstable angina undergoing balloon angioplasty with a positive interim finding for a binary outcome of death, myocardial infarction or urgent repeat intervention within 30 days (@CAPTURE), and - a new treatment for patients with lung cancer based on a positive interim finding for time-to-death (@KEYNOTE189). ## Group sequential design framework We assume - A two-arm clinical trial with a control and experimental group. - There are $k$ analyses planned for some integer $k> 1.$ - There is a natural parameter $\delta$ describing the underlying treatment difference with an estimate that has an asymptotically normal and efficient estimate $\hat\delta_j$ with variance $\sigma_j^2$ and corresponding statistical information $\mathcal{I}_j=1/\sigma_j^2$, at analysis $j=1,2,\ldots,k$. A positive value favoring experimental treatment and negative value favoring control. We assume a consistent estimate $\hat\sigma_j^2$ of $\sigma_j^2, j=1,2,\ldots,k$. - The information fraction is defined as $t_j=\mathcal{I}_i/\mathcal{I}_j$ at analysis $j=1,\ldots,k$. - Correlations between estimates at different analyses are $\text{Corr}(\hat\delta_i,\hat\delta_j)=\sqrt{\mathcal{I}_i/\mathcal{I}_j}=\sqrt{t_j}$ for $1\le i\le j\le k.$ - There is a test test $Z_j\approx\hat\delta_j/\hat{\sigma}^2_j.$ For a time-to-event outcome, $\delta$ would typically represent the logarithm of the hazard ratio for the control group versus the experimental group. For a difference in response rates, $\delta$ would represent the underlying response rates. For a continuous outcome such as the HAM-D, we would examine the difference in change from baseline at a milestone time point (e.g., at 6 weeks as in @binneman20086). For $j=1,\ldots,k$, the tests $Z_j$ are asymptotically multivariate normal with correlations as above, and for $i=1,\ldots,k$ have $\text{Cov}(Z_i,Z_j)=\text{Corr}(\hat\delta_i,\hat\delta_j)$ and $E(Z_j)=\delta\sqrt{I_j}.$ This multivariate asymptotic normal distribution for $Z_1,\ldots,Z_k$ is referred to as the *canonical form* by @JTBook who have also summarized much of the surrounding literature. ## Bounds for testing ### One-sided testing We assume that the primary test the null hypothesis $H_{0}$: $\delta=0$ against the alternative $H_{1}$: $\delta = \delta_1$ for a fixed effect size $\delta_1 > 0$ which represents a benefit of experimental treatment compared to control. We assume further that there is interest in stopping early if there is good evidence to reject one hypothesis in favor of the other. For $i=1,2,\ldots,k-1$, interim cutoffs $l_{i}< u_{i}$ are set; final cutoffs $l_{k}\leq u_{k}$ are also set. For $i=1,2,\ldots,k$, the trial is stopped at analysis $i$ to reject $H_{0}$ if $l_{j}<Z_{j}< u_{j}$, $j=1,2,\dots,i-1$ and $Z_{i}\geq u_{i}$. If the trial continues until stage $i$, $H_{0}$ is not rejected at stage $i$, and $Z_{i}\leq l_{i}$ then $H_{1}$ is rejected in favor of $H_{0}$, $i=1,2,\ldots,k$. Thus, $3k$ parameters define a group sequential design: $l_{i}$, $u_{i}$, and $\mathcal{I}_{i}$, $i=1,2,\ldots,k$. Note that if $l_{k}< u_{k}$ there is the possibility of completing the trial without rejecting $H_{0}$ or $H_{1}$. We will often restrict $l_{k}= u_{k}$ so that one hypothesis is rejected. We begin with a one-sided test. In this case there is no interest in stopping early for a lower bound and thus $l_i= -\infty$, $i=1,2,\ldots,k$. The probability of first crossing an upper bound at analysis $i$, $i=1,2,\ldots,k$, is $$\alpha_{i}^{+}(\delta)=P_{\delta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{Z_{j}< u_{j}\}\}$$ The Type I error is the probability of ever crossing the upper bound when $\delta=0$. The value $\alpha^+_{i}(0)$ is commonly referred to as the amount of Type I error spent at analysis $i$, $1\leq i\leq k$. The total upper boundary crossing probability for a trial is denoted in this one-sided scenario by $$\alpha^+(\delta) \equiv \sum_{i=1}^{k}\alpha^+_{i}(\delta)$$ and the total Type I error by $\alpha^+(0)$. Assuming $\alpha^+(0)=\alpha$ the design will be said to provide a one-sided group sequential test at level $\alpha$. ### Asymmetric two-sided testing {#binding} With both lower and upper bounds for testing and any real value $\delta$ representing treatment effect we denote the probability of crossing the upper boundary at analysis $i$ without previously crossing a bound by $$\alpha_{i}(\delta)=P_{\delta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{ l_{j}<Z_{j}< u_{j}\}\},$$ $i=1,2,\ldots,k.$ The total probability of crossing an upper bound prior to crossing a lower bound is denoted by $$\alpha(\delta)\equiv\sum_{i=1}^{k}\alpha_{i}(\delta).$$ Next, we consider analogous notation for the lower bound. For $i=1,2,\ldots,k$ denote the probability of crossing a lower bound at analysis $i$ without previously crossing any bound by $$\beta_{i}(\delta)=P_{\delta}\{\{Z_{i}\leq l_{i}\}\cap_{j=1}^{i-1}\{ l_{j} <Z_{j}< u_{j}\}\}.$$ The total lower boundary crossing probability in this case is written as $$\beta(\delta)= {\sum\limits_{i=1}^{k}} \beta_{i}(\delta).$$ When a design has final bounds equal ($l_k=u_k$), $\beta(\delta_1)$ is the Type II error which is equal to 1 minus the power of the design. In this case, $\beta_i(\delta)$ is referred to as the $\beta$-spending at analysis $i, i=1,\ldots,k$. ## Spending function design Type I error is most often defined with $\alpha_i^+(0), i=1,\ldots,k$. This is referred to as non-binding Type I error since any lower bound is ignored in the calculation. This means that if a trial is continued in spite of a lower bound being crossed at an interim analysis that Type I error is still controlled at the design $\alpha$-level. For Phase III trials used for approvals of new treatments, non-binding Type I error calculation is generally expected by regulators. For any given $0<\alpha<1$ we define a non-decreasing $\alpha$-spending function $f(t; \alpha)$ for $t\geq 0$ with $\alpha\left( 0\right) =0$ and for $t\geq 1$, $f( t; \alpha) =\alpha$. Letting $t_0=0$, we set $\alpha_j(0)$ for $j=1,\ldots,k$ through the equation $$\alpha^+_{j}(0) = f(t_j;\alpha)-f(t_{j-1}; \alpha).$$ Assuming an asymmetric lower bound, we similarly use a $\beta$-spending function and to set $\beta$-spending at analysis $j=1,\ldots, k$ as: $$\beta_{j}(\delta_1) = g(t_j;\delta_1, \beta) - g(t_{j-1};\delta_1, \beta).$$ In the following example, the function $\Phi()$ represents the cumulative distribution function for the standard normal distribution function (i.e., mean 0, standard deviation 1). The major depression study of @binneman20086 considered above used the @LanDeMets spending function approximating an O'Brien-Fleming bound for a single interim analysis half way through the trial with $$f(t; \alpha) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\alpha/2)}{\sqrt{t}}\right) \right).$$ $$g(t; \beta) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\beta/2)}{\sqrt{t}}\right) \right).$$ ```{r, message=FALSE, warning=FALSE} library(gsDesign) ``` ```{r} delta1 <- 3 # Treatment effect, alternate hypothesis delta0 <- 0 # Treatment effect, null hypothesis ratio <- 1 # Randomization ratio (experimental / control) sd <- 7.5 # Standard deviation for change in HAM-D score alpha <- 0.1 # 1-sided Type I error beta <- 0.17 # Targeted Type II error (1 - targeted power) k <- 2 # Number of planned analyses test.type <- 4 # Asymmetric bound design with non-binding futility bound timing <- .5 # information fraction at interim analyses sfu <- sfLDOF # O'Brien-Fleming spending function for alpha-spending sfupar <- 0 # Parameter for upper spending function sfl <- sfLDOF # O'Brien-Fleming spending function for beta-spending sflpar <- 0 # Parameter for lower spending function delta <- 0 endpoint <- "normal" ``` ```{r} # Derive normal fixed design sample size n <- nNormal( delta1 = delta1, delta0 = delta0, ratio = ratio, sd = sd, alpha = alpha, beta = beta ) ``` ```{r} # Derive group sequential design based on parameters above x <- gsDesign( k = k, test.type = test.type, alpha = alpha, beta = beta, timing = timing, sfu = sfu, sfupar = sfupar, sfl = sfl, sflpar = sflpar, delta = delta, # Not used since n.fix is provided delta1 = delta1, delta0 = delta0, endpoint = "normal", n.fix = n ) # Convert sample size at each analysis to integer values x <- toInteger(x) ``` The planned design used $\alpha=0.1$, one-sided and Type II error 17% (83% power) with an interim analysis at 50% of the final planned observations. This leads to Type I $\alpha$-spending of `r round(sfLDOF(alpha = 0.1, t = 0.5)$spend, 3)` and $\beta$-spending of `r round(sfLDOF(alpha = 0.17, t = 0.5)$spend, 3)` at the planned interim. An advantage of the spending function approach is that bounds can be adjusted when the number of observations at analyses are different than planned. The actual observations for experimental versus control at the analysis were 59 as opposed to the planned 67, which resulted in interim spending fraction $t_1=$ `r round(59 / x$n.I[2], 4)`. With the Lan-DeMets spending function to approximate O'Brien-Fleming bounds this results in $\alpha$-spending of `r round(sfLDOF(alpha = alpha, t = 59 / x$n.I[2])$spend, 4)` (`P(Cross) if delta=0` row in Efficacy column) and $\beta$-spending of `r round(sfLDOF(alpha = beta, t = 59 / x$n.I[2])$spend, 4)` (`P(Cross) if delta=3` row in Futility column). We note that the Z-value and 1-sided p-values in the table below correspond exactly and either can be used for evaluation of statistical significance for a trial result. The rows labeled `~delta at bound` are approximations that describe approximately what treatment difference is required to cross a bound; these should not be used for a formal evaluation of whether a bound has been crossed. The O'Brien-Fleming spending function is generally felt to provide conservative bounds for stopping at interim analysis. Most of the error spending is reserved for the final analysis in this example. The futility bound only required a small trend in the wrong direction to stop the trial; a nominal p-value of 0.77 was observed which crossed the futility bound, stopping the trial since this was greater than the futility p-value bound of 0.59. Finally, we note that at the final analysis, the cumulative probability for `P(Cross) if delta=0` is less than the planned $\alpha=0.10$. This probability represents $\alpha(0)$ which excludes the probability of crossing the lower bound at the interim analysis and the final analysis. The value of the non-binding Type I error is still $\alpha^+(0) = 0.10$. ```{r} # Updated alpha is unchanged alphau <- 0.1 # Updated sample size at each analysis n.I <- c(59, 134) # Updated number of analyses ku <- length(n.I) # Information fraction is used for spending usTime <- n.I / x$n.I[x$k] lsTime <- usTime ``` ```{r} # Update design based on actual interim sample size and planned final sample size xu <- gsDesign( k = ku, test.type = test.type, alpha = alphau, beta = x$beta, sfu = sfu, sfupar = sfupar, sfl = sfl, sflpar = sflpar, n.I = n.I, maxn.IPlan = x$n.I[x$k], delta = x$delta, delta1 = x$delta1, delta0 = x$delta0, endpoint = endpoint, n.fix = n, usTime = usTime, lsTime = lsTime ) ``` ```{r} # Summarize bounds gsBoundSummary(xu, Nname = "N", digits = 4, ddigits = 2, tdigits = 1) ``` ## References