\documentclass{article} \usepackage[group-separator={,}, group-minimum-digits={3}]{siunitx} \usepackage{amsmath,amsthm,amssymb} \usepackage{booktabs} \usepackage{hyperref} \usepackage{graphics} \usepackage{amsmath} \usepackage{indentfirst} \usepackage[utf8]{inputenc} \usepackage[top=1in, bottom=1in, left=1in, right=1in]{geometry} \usepackage{setspace} \usepackage[nomarkers,figuresonly]{endfloat} %\usepackage{float} %\newfloat{rcode}{h!t}{rcode} %\floatname{rcode}{Code Block} \renewcommand{\P}{{\mathbb{P}}} \newcommand{\E}{{\mathbb{E}}} \newcommand{\mrm}[1]{{\mathrm{#1}}} %%\newcommand{\verblistelt}[2]{\texttt{{#1}\${#2}}} \renewcommand{\ni}{{\vskip0.1truein \noindent}} \DeclareMathOperator{\var}{var} \doublespacing % \VignetteIndexEntry{Using pwrFDR} \begin{document} <<label=header,include=FALSE,echo=FALSE>>= options(keep.source = TRUE, width = 90, digits=4) pwrFDRpd <- packageDescription("pwrFDR") pct <- function(x, dig=2) format(round(10^dig*100*x)/10^dig) %,% "\\\\%" one.in.k <- function(x, out="c") { nmch <- c("two","three","four","five","six","seven","eight","nine","ten") kk <- 2:10 y <- kk*x err <- abs(y-round(y)) idx <- which(err==min(err)) rslt <- kk[idx] if(out=="c") rslt <- nmch[idx] rslt } @ <<label=seqtstdat,include=FALSE,echo=FALSE>>= SeqTstDat10 <- as.data.frame(t(matrix( c(1,3.93066049143225,0.000166075536538912,0.005,1, 1,2.74360567522164,0.00733385249064922,0.01,1, 1,2.41434752771541,0.0177875327883759,0.015,0, 1,2.38482062693501,0.0191857939241782,0.02,1, 1,2.23237296410553,0.028071740070398,0.025,0, 0,-1.90361989911666,0.0601552703570536,0.03,0, 0,1.12448792534619,0.263796263653014,0.035,0, 0,-1.00660629048567,0.316822712495861,0.04,0, 0,0.932733736292402,0.353453014154629,0.045,0, 0,0.90140806501516,0.369777327548002,0.05,0), 5, 10))) names(SeqTstDat10) <- c("xi", "X", "P", "BHFDR.crit", "Marked") @ \title{Using pwrFDR in design and analysis of a multiple testing experiment (Version \Sexpr{pwrFDRpd$Version})} \author{Grant Izmirlian} \date{\Sexpr{Sys.Date()}} \maketitle \section{Introduction} The package \texttt{pwrFDR} allows computation of sample size required for average power or for TPX Power under various sequential multiple testing procedures such as the Benjamini-Hochberg False Discovery Rate (BH-FDR) procedure. Before we begin, we first load some libraries and then provide a brief review of multiple testing and sequential procedures. <<label=load-libs, echo=TRUE, results=hide>>= library(pwrFDR) library(ggplot2) library(TableMonster) @ \ni In addition to the \texttt{pwrFDR} library, we load \texttt{ggplot2}, an advanced plotting package many readers will be familiar with, and \texttt{TableMonster}, and easy to use frontend to \texttt{xtable} for generating publication quality tables in \LaTeX. \ni In short, statistical hypothesis testing whereby a single p-value is compared to a threshhold value, $\alpha$, to determine statistical significance assures that the resulting conclusion has false positive rate less than $\alpha$. This guarantee applies in the context of a single statistical hypothesis test only. Adjustment for multiple tests of hypotheses provides an algorithm whereby the comparison thresholds are adjusted so that some aggregate false positive rate is guaranteed. We will review several multiple testing procedures and discuss the aggregate false positive rate or target of \textit{protected inference} which each one controls. Consider a multiple testing experiment with $m$ simultaneous tests of hypotheses. The most widely used multiple testing procedure is Bonferroni's \cite{BonferroniCE:1936} procedure which guarantees control of the family-wise error rate (FWER). This is the probability that of one or more of the hypothesis tests results in a false positive. It is applied by referring all p-values to the common threshold $\alpha/m$. This quickly becomes overly conservative as the number of tests, $m$, becomes larger. The Benjamini-Hochberg \cite{BenjaminiY:1995} procedure guarantees control of the false discovery rate (FDR). This is the expected proportion of hypothesis tests declared significant which are false positives. It is applied by sorting p-values in increasing order, comparing the $i^{\mrm{th}}$ largest to $\alpha i /m$ and then declaring all tests statistically significant which correspond to p-values not larger than the largest exceeded by its threshold. Thus the Bonferroni procedure and the BH-FDR procedure control different targets of protected inference. The domain of application and in particular the cost of a false positive guides the choice of the target for protected inference, with higher costs (drug development) requiring a more conservative target of control such as the FWE rate, and lower costs (thresholding in --omics studies) allowing for a less conservative target of control, such as the FDR. \ni We now discuss sequential multiple testing procedures in general. The application of a sequential procedure in a multiple testing experiment usually begins with ordering the $m$ p-values from smallest to largest and then comparing each sorted p-value with a corresponding member of a sequnce of criterion values. This sequence of criterion values, also a non-decreasing sequence and specific to the particular procedure, is the product of $\alpha$ and a sequence of multiple testing penalties, $\psi_m(j)$. All procedures begin with marking rows for which the sorted p-value is less than its corresponding criterion value. \ni Sequential procedures are defined by two distinguishing features which in turn, provide a recipe for their application. First is the chosen sequence of multiple testing penalties, and second, whether the procedure is step-up or step-down. This latter distinction provides a recipe for calling tests significant based upon marked/unmarked rows of p-value and criterion pairs. A step-up procedure calls significant all tests up until the last marked row. A step-down procedure calls significant tests belonging to a block of contiguous marked rows beginning with the first. If the first row is not marked, a step-down procedure calls nothing significant. Table \ref{tbl:SeqTstDat10} below shows p-values for 10 simultaneous tests of hypotheses, and a sequence of threshold criterion values, $\alpha i/m$, with $\alpha=0.05$. This sequnce of threshold criterion values should be familiar as it is the one used in the BH-FDR procedure. Also shown in the table is an indicator of whether or not each p-value is less than or equal to its corresponding threshold value. A step up procedure based upon the given criterion sequence will call statistical tests corresponding to the smallest 4 p-values significant. Notice that this includes the third smallest p-value which was not smaller than its threshold value. A step-down procedure based upon the given criterion sequence would call only the first two tests significant. <<label=show-SeqTstDat10, echo=FALSE, results=tex>>= basic.tmPrint(SeqTstDat10, lbl="tbl:SeqTstDat10", cptn="Sorted p-values, their test statistics, " %,% "population indicators, and BH-FDR threshold in 10 simultaneous tests") @ \ni We now discuss the number of significant calls and of these which are true positives and which are false positives. Let $R$ denote the number of tests called significant by the procedure. As mentioned above in our example, table \ref{tbl:SeqTstDat10} above, $R=4$ under the step-up procedure and $R=2$ under the step-down procedure. This partitions into the unobserved false positive count, $V$, e.g. the number of tests called significant which are distributed as the null, and unobserved true positive count, T, e.g. the number of tests called significant which are distributed as the alternative, $V + T = R$. We see that $V=0$ under both the step-up and step-down procedures, while $T=3$ under the step-up procedure and $T=2$ under the step-down procedure. The ratio, $\mrm{FDP}=V/R$ is called the false discovery proportion and the ratio, $\mrm{TPP}=T/M$ is called true positive proportion. Here $M$ is the number of statistics distributed as the alternative (more on this below). We see that $\mrm{FDP}=0$ under both the step-up and step-down procedures, while $\mrm{TPP}=3/5$ under the step-up procedure, and $\mrm{TPP}=2/5$ under the step-down procedure. We note in passing that the BH-FDR procedure is the step-up procedure based upon the given criterion sequence. \ni Within the fairly broad scope of sequential procedures considered here the goal of protected multiple inference will be to control some summary of the false discovery proportion distribution: $\P\{\mrm{FDP}>x\}=\P\{V/R >x\}$. Protected inference must be done within the context of some definition of multiple test or aggregate power so that multiple testing experiments can be sized and so that we have some idea of the probability of success as defined appropriately for the application. We will consider definitions of aggregate power based upon some summary of the true positive proportion distribution: $\P\{\mrm{TPP} > x \}=\P\{ T/M > x \}$. \ni As previously noted, the BH-FDR procedure is a step-up procedure with multiple testing penalty sequence $\psi_m(j) = j/m$. It guarantees control of the FDR, which is the expected FDP: \begin{equation} \mrm{FDR}=\E[\mrm{FDP}]=\E[V/R]\nonumber \end{equation} \ni The type of aggregate power usually used in conjunction with the BH-FDR procedure is the average power. It is the expected TPP: \begin{equation} \mrm{AvgPwr}=\E[\mrm{TPP}]=\E[T/M]\nonumber \end{equation} \ni Let's begin by computing the sample size required for 80\% average power under the BH-FDR procedure at $\mrm{FDR}=15\%$ when the effect size is 0.79. There is one more parameter required for calculation of sample size for multiple test power besides the usual required power, type I error and effect size which are sufficient to calculate the sample size in the single testing case. Whereas in the single test case, we condition upon the statistic being drawn from the null or the alternative, in the multiple testing case we must somehow make a specification regarding the number of tests distributed as the alternative. Our methodology assumes a common mixture distribution for the p-value CDF. This means that each test statistic is distributed according to the alternative hypothesis with probability, $r_1$, and distributed according to the null hypothesis with the complementary probability. This is the additional parameter which must be specified. In applications, a reasonable working value is drawn from substance experts. Let us assume this is 5\%, the value typically used in larger --omics studies like mRNA profiling and RNAseq (\cite{AlizadehAA:2000, MortazaviA:2008}). The last argument, which was not specified, \texttt{FDP.control.method}, takes its default value, \texttt{"BHFDR"}, as we here desire. <<label=vinfo, echo=FALSE, results=hide>>= rhome <- Sys.getenv("R_HOME") sep<-c("\\\\", "/")[1+(substring(rhome, 1, 1)=="/")] rscript.path <- rhome %,% sep %,% "site-library" %,% sep %,% "pwrFDR" %,% sep %,% "doc" %,% sep %,% "pwrFDR-vignette.R" @ \ni You can use this vignette file to follow along or if you prefer, open the companion script file (all supporting text removed) at \verb!\Sexpr{rscript.path}!. \ni We are now ready to call \texttt{pwrFDR} to calculate sample size required for 80\% average power under the BH-FDR procedure at $\alpha=0.15$ and above mentioned effect size and prior probability: %\begin{rcode} <<label=avgpwr_minf>>= ss.fdr.r05 <- pwrFDR(effect.size=0.79, alpha=0.15, r.1=0.05, average.power=0.80) @ %\end{rcode} \ni Notice that we did not specify the number of tests. The calculation is done using the infinite tests consistent limit approximation. This consistent limit exists for procedures controlling the FDR and for procedures controlling the FDX, but not for procedures controlling the family-wise error rate (FWER). \ni As is the case with the base R library supplied power functions like \texttt{power.t.test}, the routine will solve for any missing parameter, except in this case, $\alpha$ must be specified. This means that the routine will calculate the average power or TPX power (see below) under the specified procedure at the given alpha at the specified effect size and prior probility. It will also find the required sample size, effect size, or prior probability required for specified average power or TPX power for given values of the other parameters. For example the following 4 lines of code return essentially the same result, but calculate, in order listed, the average power, the sample size required for average power, the effect size required for average power, and the prior probability required for average power, respectively, for given values of the other parameters. <<label=solve-for, echo=TRUE, results=hide>>= find.avp <- pwrFDR(effect.size=0.79, alpha=0.15, n.sample=ss.fdr.r05$n.sample, r.1=0.05) find.ss <- pwrFDR(effect.size=0.79, alpha=0.15, r.1=0.05, average.power=0.80) find.es <- pwrFDR(alpha=0.15, n.sample=ss.fdr.r05$n.sample, r.1=0.05, average.power=0.80) find.r1 <- pwrFDR(effect.size=0.79, alpha=0.15, n.sample=ss.fdr.r05$n.sample, average.power=0.80) @ \ni The point is that of the four parameters, desired power (average or TPX), effect size, sample size and prior probability, the user must specify $\alpha$ together with three of these and the missing one will be calculated. See the help documentation for more information. \ni While we're at it, in order to see how much the alternative hypothesis prior probability, $r_1$, affects the required sample size, let's calculate sample size required for 80\% average power under BH-FDR at $\alpha=0.15$ under the above settings ammended to incorporate a higher prior probability, $r.1=0.10$. <<label=avgpwr_r10_minf>>= ss.fdr.r10 <- update(ss.fdr.r05, r.1=0.10) @ \ni The following line generates a publication ready table. <<label=avgpwr-tbl, echo=TRUE, results=hide>>= print(ss.fdr.r05, label="tbl:minf", result="tex", cptn="$m=\\infty$") @ \ni or we can join the two tables into one, also adding a caption <<label=avgpwr-tbl-join, echo=TRUE, results=tex>>= print(join.tbl(ss.fdr.r05, ss.fdr.r10), label="tbl:minf-r05-r10", result="tex", cptn="$m=\\infty, r_1=0.05, 0.10$") @ \ni The first six lines are user specified parameters or default values, and the seventh through tenth lines are calculated by the function. As for the last two lines, sample size and power, as is usually the case, one is specified and the other is calculated. The first line indicates that calculations were done according to the theoretical method, which in the case of average power under FDR control is the infinite tests consistent limit approximation. Lines 2 and 4 are here default values. The default method of FDP control is \texttt{"BHFDR"} as mentioned above and the value \texttt{Inf} for \texttt{N.tests} signifies the infinite tests consistent limit is being used, and in this case specification of \texttt{N.tests} is not required. Not to belabor an obvious point, but this means that all quantities derived are independent of the number of simultaneous tests. We shall discuss when and how this assumption breaks down below. Lines 3, 5, and 6, being $\alpha, r_1$ and \texttt{effect.size}, respectively, are user specified as discussed above. Recall that result a and b differ only by the specified value for prior probability, $r_1$, being 0.05 and 0.10, respectively. The first of the derived values, $\gamma$, on line 7, is the infinite tests consistent limit of $R/m$. This limit is the rejection rate, or expected proportion of all tests which are declared significant. The next three lines are the asymptotic standard deviations of the rejection proportion, $R/m$, the false discovery proportion, $V/R$, and the true positive proportion, $T/M$. We will see below why it is useful to know these. Lines 11 and 12 are the sample size and average power, respectively. In this case we specified average power and the function calculated required sample size given the other parameter values. Comparing results ``a'' and ``b'' it is clear that doubling the prior probability, $r_1$, results in roughly a doubling of the rejection rate, $\gamma$, and roughly three quarters the sample size required for \Sexpr{pct(ss.fdr.r05$average.power, 1)} average power. \ni The package provides a simulation method as a check on the variety of theoretical methods used. In this case we must specify the number of tests. The simulation method will not find sample size required for specifed power as it is impractical given the use of a back-solver resulting in more than 20 calls to the function. Thus we must instead request a computation of power (average power in this case) given specified sample size. The simulation routine generates replicate data-sets, each containing $m$ full data records, where each of these consist of a population indicator (bernouli, probability $r_1$), a test statistic distributed under the alternative or null corresponding to the value of the population indicator, and a corresponding p-value. For each simulation replicate dataset, the requested procedure is applied to the $m$ test statistics, and then the numbers of rejected tests, $R$, and true positives, $T$, are recorded. The number of statistics distributed as the alternative, $M$, is also recorded. Of course, the number of false positives need not be recorded as it can be found via subtraction: $V=R-T$. These per simulation replicate statistics are in the \texttt{reps} component of the \texttt{detail} attribute, which can be obtained for given a \texttt{pwrFDR} object, \texttt{result}, via the expression \texttt{detail(result)[["reps"]]}. In the following code block we call \texttt{pwrFDR} via the \texttt{"simulation"} method at the parameter settings used in \texttt{ss.fdr.r10} above when the number of simultaneous tests is \num{10000}, \num{1000}, and 100, respectively in the three following lines of code. <<label=avgpwr-sim, results=hide>>= avgpwr.fdr.sim.r10.m1e5 <- pwrFDR(effect.size=0.79, alpha=0.15, r.1=0.10, n.sample=ss.fdr.r10$n.sample, N.tests=10000, meth="sim") avgpwr.fdr.sim.r10.m1e3 <- update(avgpwr.fdr.sim.r10.m1e5, N.tests=1000) avgpwr.fdr.sim.r10.m100 <- update(avgpwr.fdr.sim.r10.m1e5, N.tests=100) @ <<label=genplotVoRr10m100, echo=FALSE, results=tex>>= print(join.tbl(avgpwr.fdr.sim.r10.m1e5, avgpwr.fdr.sim.r10.m1e3, avgpwr.fdr.sim.r10.m100), label="tbl:sim_cmpst", result="tex", cptn="Results of simulation calls with varying `m'.") @ \ni Comparing the simulation results in each of the three columns in table \ref{tbl:sim_cmpst} with the theoretical approximation at the same design parameters in the second column of table \ref{tbl:minf-r05-r10}, the only differences in derived results beyond that expected from simulation error are the designation that the \texttt{"Simulation"} method was used, what appear to be standard errors of the rejection proportion, false discovery proportion and true positive proportion as opposed to asymptotic standard deviations, and appearance of two new derived quantities, \texttt{emp.FDR} and \texttt{emp.FDX}. First, when the number of tests, \texttt{N.tests}, is specified, the function returns estimated standard errors instead of asymptotic standard deviations, these being the latter divided by the square root of the number of tests. Secondly, the two new derived quantities are the empirical FDR and FDX derived as simulation estimates. The latter estimates $\P\{ \mrm{FDP} > \alpha \}$, the probability that the false discovery proportion exceeds $\alpha$. Notice that the empirical FDR's corresponding to the differing numbers of simultaneous tests are identical to within simulation error, but the standard error of the false discovery proportion, as well as those of the other two ratios increase in proportion to the ratio of the square root of number of tests. While the location, i.e. the mean of the FDP distribution remains more or less constant as the number of tests decreases from \num{10000} to 100, the width of the distribution grows. The point is that the BH-FDR procedure guarantees that the mean of the FDP will be less than $\alpha$, but not what the width of the distribution will be. BH-FDR control means that if a multiple testing experiment is repeated then the FDP's corresponding to each experiment will have average value less than $\alpha$. Per experiment values of FDP's may vary wildly and in fact, have high probability of being unacceptably large. Here we see that while for \num{10000} tests, the probabiliy that the FDP exceeds $\alpha$ is roughly \Sexpr{pct(avgpwr.fdr.sim.r10.m1e5$emp.FDX,0)}, it becomes quite large as the number of tests decreases, being \Sexpr{pct(avgpwr.fdr.sim.r10.m1e3$emp.FDX, 0)}, \Sexpr{pct(avgpwr.fdr.sim.r10.m100$emp.FDX, 0)} when the number of tests is \num{1000} and \num{100}, respectively. This raises the question as to whether BH-FDR control is appropriate for a moderate to small number of simultaneous tests. \section{Caveats Arising from FDR control and Use of Average Power} \ni This point regarding the appropriateness of BH-FDR control for a moderate to small number of simultaneous tests is made clearer by having a look at the distribution of the FDP as the number of tests decreases. We will re-run the above simulations for 6 multiple testing experiments at the same design parameter settings when the number of tests is \num{10000}, \num{2000}, \num{1000}, \num{500}, \num{250}, and \num{100}, respectively. <<label=cmpt-ss>>= ss <- pwrFDR(effect.size = 0.79, average.power=0.80, r.1 = 0.10, alpha = 0.15) avgp <- update(ss, average.power=NULL, n.sample=ss$n.sample) @ <<label=gen-FDP-violin-plot, echo=FALSE, results=hide>>= sim <- list() FDX20.tbl <- TPX70.tbl <- NULL m.lst <- list(m10000=10000, m2000=2000, m1000=1000, m500=500, m250=250, m100=100) n.mlst <- length(m.lst) for(k in 1:n.mlst) { sim[[names(m.lst)[k]]] <- update(avgp, method = "sim", N.tests=m.lst[[k]]) FDX20.tbl <- c(FDX20.tbl, with(detail(sim[[names(m.lst)[k]]])$reps, mean((R-T)%over%R>=0.20))) TPX70.tbl <- c(TPX70.tbl, with(detail(sim[[names(m.lst)[k]]])$reps, mean( T %over% M1 < 0.70))) } FDX20.tbl <- as.data.frame(rbind(FDX20.tbl)) names(FDX20.tbl) <- m.lst FDX20.tbl <- format(FDX20.tbl) TPX70.tbl <- as.data.frame(rbind(TPX70.tbl)) names(TPX70.tbl) <- m.lst TPX70.tbl <- format(TPX70.tbl) FDX20.TPX70.tbl <- rbind(FDX20.tbl, TPX70.tbl) FDX20.TPX70.tbl <- cbind(` `=c("$P(\\mrm{FDP} \\geq 0.20)$", "$P(\\mrm{TPP} < 0.70)$"), FDX20.TPX70.tbl) n.sim <- nrow(detail(sim[["m10000"]])$reps) FDP.dat <- TPP.dat <- NULL for(k in 1:length(m.lst)) { FDP.dat <- c(FDP.dat, with(detail(sim[[names(m.lst)[k]]])$reps, (R-T)%over%R)) TPP.dat <- c(TPP.dat, with(detail(sim[[names(m.lst)[k]]])$reps, T %over% M1)) } FDP.dat <- data.frame(FDP=FDP.dat) FDP.dat$group <- rep(unlist(m.lst), each=n.sim) FDP.dat$group <- ordered(FDP.dat$group, levels=FDP.dat$group[1000*(0:(n.mlst-1))+1]) FDP.plot <- ggplot()+geom_violin(position="dodge", alpha=0.5, aes(y=FDP, x=group), data=FDP.dat) + ylim(c(0, 0.4)) TPP.dat <- data.frame(TPP=TPP.dat) TPP.dat$group <- rep(unlist(m.lst), each=n.sim) TPP.dat$group <- ordered(TPP.dat$group, levels=TPP.dat$group[1000*(0:(n.mlst-1))+1]) TPP.plot <- ggplot()+geom_violin(position="dodge", alpha=0.5, aes(y=TPP, x=group), data=TPP.dat) + ylim(c(0.6, 1)) @ <<label=prnt.excdnc, echo=FALSE, results=tex>>= basic.tmPrint(FDX20.TPX70.tbl, cptn="Simulation estimates of indicated probabilities of FDX and TPX for indicated " %,% "values of $m$ when $\\alpha=0.15, r_1=0.20$, effect size 0.79 with average " %,% "power 80\\%", lbl="tbl:FDX20") @ \begin{figure} \begin{centering} <<label=plot-FDP-violin-plot, echo=FALSE, fig=TRUE, height=4>>= FDP.plot @ \end{centering} \caption{Violin plots of FDP distribution for numbers of simultaneous tests varying from 10,000 down to 100, effect.size=0.79, n.sample=47, r.1=0.20, alpha=0.15} \label{fig:FDP-violin-plot} \end{figure} \begin{figure} \begin{centering} <<label=plot-TPP-violin-plot, echo=FALSE, fig=TRUE, height=4>>= TPP.plot @ \end{centering} \caption{Violin plots of TPP distribution for numbers of simultaneous tests varying from 10,000 down to 100, effect.size=0.79, n.sample=47, r.1=0.20, alpha=0.15} \label{fig:TPP-violin-plot} \end{figure} \ni A sample of \Sexpr{ss[["n.sample"]]} is required for \Sexpr{pct(ss[["average.power"]])} average power. Figure \ref{fig:FDP-violin-plot} below shows violin plots of the FDP distribution for varying values of $m$ from 10,000 down to 100 when the FDR is 15\% and the other parameters are as indicated above. It is clear that the spread of the FDP distribution goes from very narrow to very disperse. In the more dispersed cases for 250 and 100 tests, it is clear that controlling the mean of the FDP distribution offers little assurance as to the value of the FDP. Table \ref{tbl:FDX20} shows the probability that the FDP exceeds 20\% for each of the indicated values of $m$. At the most extreme level of dispersion when $m=100$, the probability that the FDP exceeds 20\% is roughly 20\%. It is easy to be lulled into a sense that FDR control at 15\% means that the FDP will be less than 15\% but here is probability 133\% of this value that it exceeds a value 133\% of the target. At the larger numbers of tests, the simulation estimate of exceedence probability is zero, suggesting that FDR control is a good indication that the FDP is controlled when the number of tests is larger. \ni Similar caveats arise from the use of the average power to define the power for a multiple testing experiment. The average power is the expected value of the TPP, which can be thought of as the average TPP over many identical multiple testing experiments. That a given sample size guarantees average power says very little about what the TPP will be for any one given multiple testing experiment. Figure \ref{fig:TPP-violin-plot} below shows violin plots of the TPP distribution for varying values of $m$ from 10,000 down to 100 when the average power is 80\% and the other parameters are as indicated above. Once again, it is clear that the spread of the TPP distribution goes from very narrow to very disperse. In the more dispersed cases for 250 and 100 tests, it is clear that controlling the mean of the TPP distribution offers little assurance as to the value of the TPP. Table \ref{tbl:FDX20} shows the probability that the TPP is less than 70\% for each of the indicated values of $m$. At the most extreme level of dispersion when $m=100$, the probability that the TPP is less than 80\% is roughly 15\%. It is easy to be lulled into a sense that a sample size required for average power 80\% means that the TPP will 80\% but here is probability 15\% that the TPP is less than 70\%. At the larger numbers of tests, the simulation estimate of the probability that the TPP is less than 70\% is zero, suggesting that the use of average power to size a multiple testing experiment will result in an equally high TPP when the number of tests is large. \section{FDX control and the TPX Power} \ni When the FDP distribution is too dispersed as we saw above in the case of only several hundred tests, a more reliable method of controlling the value of the FDP is to control the probability that the FDP exceeds a given threshold, $\P\{\mrm{FDP}>\delta\} \leq \alpha$, known as $\mrm{FDX}$ control. A procedure due to Lehmann, Romano and Shaikh, \cite{LehmannEL:2005, RomanoJP:2006}, controls the FDX. It is a step-down procedure with multiple testing penalty sequence, \begin{equation} \psi_m(j; \delta) = \frac{1 + \lfloor \delta j \rfloor}{m + 1 + \lfloor \delta j \rfloor - j} \end{equation} \ni Lets compute the sample size required for 80\% average power under the Lehmann-Romano-Shaikh procedure when $\alpha=0.15$. This call is exactly the same as the very first sample size we computed above except that we specify \texttt{FDP.control.method="Romano"} to override its default value, ``FDR'' . <<label=cmpt-ss-Rom-Inf>>= ss.Rom <- pwrFDR(effect.size = 0.79, average.power=0.80, r.1 = 0.20, alpha = 0.15, FDP.control.method="Romano") @ \ni The following table was generated <<label=cmpt-avgp-Rom-Inf, echo=FALSE, results=hide>>= avgp.Rom <- pwrFDR(effect.size = 0.79, n.sample=ss.Rom$n.sample, r.1 = 0.20, alpha = 0.15, FDP.control.method="Romano") avgp.Rom.500.sim <- update(avgp.Rom, N.tests=500, method="sim") avgp.Rom.250.sim <- update(avgp.Rom, N.tests=250, method="sim") avgp.Rom.100.sim <- update(avgp.Rom, N.tests=100, method="sim") @ <<label=prt-Rom, echo=FALSE, results=tex>>= print(join.tbl(avgp.Rom.500.sim, avgp.Rom.250.sim, avgp.Rom.100.sim), result="tex", cptn=" ", label=" ") @ <<label=cmpt-BHFDX>>= ss.BHFDX.500 <- pwrFDR(effect.size = 0.79, average.power=0.80, r.1 = 0.20, alpha = 0.15, FDP.control.method="BHFDX", N.tests=500) ss.BHFDX.250 <- update(ss.BHFDX.500, N.tests=250) ss.BHFDX.100 <- update(ss.BHFDX.500, N.tests=100) avgp.BHFDX.500.sim <- update(ss.BHFDX.500, n.sample=ss.BHFDX.500$n.sample, average.power=NULL, method="sim") avgp.BHFDX.250.sim <- update(ss.BHFDX.250, n.sample=ss.BHFDX.250$n.sample, average.power=NULL, method="sim") avgp.BHFDX.100.sim <- update(ss.BHFDX.100, n.sample=ss.BHFDX.100$n.sample, average.power=NULL, method="sim") @ <<label=prt-BHFDX, echo=FALSE, results=tex>>= print(join.tbl(avgp.BHFDX.500.sim, avgp.BHFDX.250.sim, avgp.BHFDX.100.sim), result="tex", cptn=" ", label=" ") @ \bibliographystyle{plain} \bibliography{pwrFDR-vignette} \end{document}