{"title": "A Bandit Approach to Sequential Experimental Design with False Discovery Control", "book": "Advances in Neural Information Processing Systems", "page_first": 3660, "page_last": 3670, "abstract": "We propose a new adaptive sampling approach to multiple testing which aims to maximize statistical power while ensuring anytime false discovery control. We consider $n$ distributions whose means are partitioned by whether they are below or equal to a baseline (nulls), versus above the baseline (true positives). In addition, each distribution can be sequentially and repeatedly sampled. Using techniques from multi-armed bandits, we provide an algorithm that takes as few samples as possible to exceed a target true positive proportion (i.e. proportion of true positives discovered) while giving anytime control of the false discovery proportion (nulls predicted as true positives). Our sample complexity results match known information theoretic lower bounds and through simulations we show a substantial performance improvement over uniform sampling and an adaptive elimination style algorithm. Given the simplicity of the approach, and its sample efficiency, the method has promise for wide adoption in the biological sciences, clinical testing for drug discovery, and maximization of click through in A/B/n testing problems.", "full_text": "A Bandit Approach to Multiple Testing with False\n\nDiscovery Control\n\nKevin Jamieson\u21e4,\u2020, Lalit Jain\u21e4\n\n{jamieson,lalitj}@cs.washington.edu\n\n\u21e4Paul G. Allen School of Computer Science & Engineering,\n\nUniversity of Washington, Seattle, WA, and\n\n\u2020Optimizely, San Francisco, CA\n\nAbstract\n\nWe propose an adaptive sampling approach for multiple testing which aims to\nmaximize statistical power while ensuring anytime false discovery control. We\nconsider n distributions whose means are partitioned by whether they are below or\nequal to a baseline (nulls), versus above the baseline (actual positives). In addition,\neach distribution can be sequentially and repeatedly sampled. Inspired by the\nmulti-armed bandit literature, we provide an algorithm that takes as few samples\nas possible to exceed a target true positive proportion (i.e. proportion of actual\npositives discovered) while giving anytime control of the false discovery proportion\n(nulls predicted as actual positives). Our sample complexity results match known\ninformation theoretic lower bounds and through simulations we show a substantial\nperformance improvement over uniform sampling and an adaptive elimination style\nalgorithm. Given the simplicity of the approach, and its sample ef\ufb01ciency, the\nmethod has promise for wide adoption in the biological sciences, clinical testing\nfor drug discovery, and online A/B/n testing problems.\n\n1\n\nIntroduction\n\nConsider n possible treatments, say, drugs in a clinical trial, where each treatment either has a\npositive expected effect relative to a baseline (actual positive), or no difference (null), with a goal\nof identifying as many actual positive treatments as possible. If evaluating the ith trial results in a\nnoisy outcome (e.g. due to variance in the actual measurement or just diversity in the population)\nthen given a total measurement budget of B, it is standard practice to execute and average B/n\nmeasurements of each treatment, and then output a set of predicted actual positives based on the\nmeasured effect sizes. False alarms (i.e. nulls predicted as actual positives) are controlled by either\ncontrolling family-wise error rate (FWER), where one bounds the probability that at least one of the\npredictions is null, or false discovery rate (FDR), where one bounds the expected proportion of the\nnumber of predicted nulls to the number of predictions. FDR is a weaker condition than FWER but is\noften used in favor of FWER because of its higher statistical power: more actual positives are output\nas predictions using the same measurements.\nIn the pursuit of even greater statistical power, there has recently been increased interest in the\nbiological sciences to reject the uniform allocation strategy of B/n trials to the n treatments in\nfavor of an adaptive allocation. Adaptive allocations partition the budget B into sequential rounds\nof measurements in which the measurements taken at one round inform which measurements are\ntaken in the next [1, 2]. Intuitively, if the effect size is relatively large for some treatment, fewer\ntrials will be necessary to identify that treatment as an actual positive relative to the others, and\nthat savings of measurements can be allocated towards treatments with smaller effect sizes to boost\nthe signal. However, both [1, 2] employed ad-hoc heuristics which may not only have sub-optimal\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fstatistical power, but also may even result in more false alarms than expected. As another example,\nin the domain of A/B/n testing in online environments, the desire to understand and maximize\nclick-through-rate across treatments (e.g., web-layouts, campaigns, etc.) has become ubiquitous\nacross retail, social media, and headline optimization for the news. And in this domain, the desire for\nstatistically rigorous adaptive sampling methods with high statistical power are explicit [3].\nIn this paper we propose an adaptive measurement allocation scheme that achieves near-optimal\nstatistical power subject to FWER or FDR false alarm control. Perhaps surprisingly, we show that\neven if the treatment effect sizes of the actual positives are identical, adaptive measurement allocation\ncan still substantially improve statistical power. That is, more actual positives can be predicted using\nan adaptive allocation relative to the uniform allocation under the same false alarm control.\n\n1.1 Problem Statement\nConsider n distributions (or arms) and a game where at each time t, the player chooses an arm\niid\u21e0 \u232bi where Xi,t 2 [0, 1]1 and\ni 2 [n] := {1, . . . , n} and immediately observes a reward Xi,t\nE\u232bi[Xi,t] = \u00b5i. For a known threshold \u00b50, de\ufb01ne the sets2\n\nH1 = {i 2 [n] : \u00b5i > \u00b50}\n\nand H0 = {i 2 [n] : \u00b5i = \u00b50} = [n] \\ H1.\n\ni=1, \u00b50) it satis\ufb01es E[|St\\H0|\n\nThe value of the means \u00b5i for i 2 [n] and the cardinality of H1 are unknown. The arms (treatments)\nin H1 have means greater than \u00b50 (positive effect) while those in H0 have means equal to \u00b50 (no\neffect over baseline). At each time t, after the player plays an arm, she also outputs a set of indices\nSt \u2713 [n] that are interpreted as discoveries or rejections of the null-hypothesis (that is, if i 2S t then\nthe player believes i 2H 1). For as small a \u2327 2 N as possible, the goal is to have the number of\ntrue detections |St \\H 1| be approximately |H1| for all t \u2327, subject to the number of false alarms\n|St \\H 0| being small uniformly over all times t 2 N. We now formally de\ufb01ne our notions of false\nalarm control and true discoveries.\nDe\ufb01nition 1 (False Discovery Rate, FDR-). Fix some 2 (0, 1). We say an algorithm is FDR- if for\n|St|_1 ] \uf8ff for all t 2 N simultaneously.\nall possible problem instances ({\u232bi}n\nDe\ufb01nition 2 (Family-wise Error Rate, FWER-). Fix some 2 (0, 1). We say an algorithm is\ni=1, \u00b50) it satis\ufb01es P(S1t=1{St \\H 0 6= ;}) \uf8ff .\nFWER- if for all possible problem instances ({\u232bi}n\nNote FWER- implies FDR-, the former being a stronger condition than the latter. Allowing a\nrelatively small number of false discoveries is natural, especially if |H1| is relatively large. Because\n\u00b50 is known, there exist schemes that guarantee FDR- or FWER- even if the arm means \u00b5i and the\ncardinality of H1 are unknown (see Section 2.1). It is also natural to relax the goal of identifying all\narms in H1 to simply identifying a large proportion of them.\nDe\ufb01nition 3 (True Positive Rate, TPR-, \u2327). Fix some 2 (0, 1). We say an algorithm is TPR-, \u2327\non an instance ({\u232bi}n\nDe\ufb01nition 4 (Family-wise Probability of Detection, FWPD-, \u2327). Fix some 2 (0, 1). We say an\nalgorithm is FWPD-, \u2327 on an instance ({\u232bi}n\nNote that FWPD-, \u2327 implies TPR-, \u2327, the former being a stronger condition than the latter. Also\nnote P(S1t=1{St \\H 0 6= ;}) \uf8ff and P(H1 \u2713S \u2327 ) 1 together imply P(H1 = S\u2327 ) 1 2.\nWe will see that it is possible to control the number of false discoveries |St \\H 0| regardless of how\nthe player selects arms to play. It is the rate at which St includes H1 that can be thought of as the\nstatistical power of the algorithm, which we formalize as its sample complexity:\nDe\ufb01nition 5 (Sample Complexity). Fix some 2 (0, 1) and an algorithm A that is FDR- (or\nFWER-) over all possible problem instances. Fix a particular problem instance ({\u232bi}n\ni=1, \u00b50). At\neach time t 2 N, A chooses an arm i 2 [n] to obtain an observation from, and before proceeding to\nthe next round outputs a set St \u2713 [n]. The sample complexity of A on this instance is the smallest\ntime \u2327 2 N such that A is TPR-, \u2327 (or FWPD-, \u2327).\nThe sample complexity and value of \u2327 of an algorithm will depend on the particular instance\ni=1, \u00b50). For example, if H1 = {i 2 [n] : \u00b5i = \u00b50 +} and H0 = [n]\\H1, then we expect the\n({\u232bi}n\n1All results without modi\ufb01cation apply to unbounded, sub-Gaussian random variables.\n2All results generalize to the case when H0 = {i : \u00b5i \uf8ff \u00b50}.\n\ni=1, \u00b50) if P(H1 \u2713S t) 1 for all t \u2327.\n\ni=1, \u00b50) if E[|St\\H1|\n|H1|\n\n] 1 for all t \u2327.\n\n2\n\n\fFalse alarm control\n\nFDR-\nmaxt E[|St\\H0|\n|St|_1 ] \uf8ff \n\nTheorem 2\n\nn2\n\nTheorem 3\n\nFWER-\n\nP(S1t=1{St \\H 0 6= ;}) \uf8ff \n(n k)2 + k2 log(n k)\n\nTheorem 5\n\nTheorem 4\n\nDetection Probability\n\nTPR-, \u2327\n\nE[|S\u2327\\H1|\n|H1|\nFWPD-, \u2327\n\n] 1 \n\nP(H1 \u2713S \u2327 ) 1 \n(n k)2 log(k) + k2 log(n k)\nTable 1: Informal summary of sample complexity results proved in this paper for |H1| = k, constant (e.g.,\n = .05) and = min i2H1 \u00b5i \u00b50. Uniform sampling across all settings requires at least n2 log(n/k)\nsamples, and in the FWER+FWPD setting requires n2 log(n). Constants and log log factors are ignored.\n\n(n k)2 log(k) + k2\n\nsample complexity to increase as decreases since at least 2 samples are necessary to determine\nwhether an arm has mean \u00b50 versus \u00b50 + . The next section will give explicit cases.\nRemark 1 (Impossibility of stopping time). We emphasize that just as in the non-adaptive setting,\nat no time can an algorithm stop and declare that it is TPR-, \u2327 or FWPD-, \u2327 for any \ufb01nite \u2327 2 N.\nThis is because there may be an arm in H1 with a mean in\ufb01nitesimally close to \u00b50 but distinct such\nthat no algorithm can determine whether it is in H0 or H1. Thus, the algorithm must run inde\ufb01nitely\nor until it is stopped externally. However, using an anytime con\ufb01dence bound (see Section 2) one can\nalways make statements like \u201ceither H1 \u2713S t, or maxi2H1\\St \u00b5i \u00b50 \uf8ff \u270f\u201d where the \u270f will depend\non the width of the con\ufb01dence interval.\n\n1.2 Contributions and Informal Summary of Main Results\nIn Section 2 we propose an algorithm that handles all four combinations of {FDR-, FWER-} and\n{TPR-, \u2327, FWPD-, \u2327}. A reader familiar with the multi-armed bandit literature would expect an\nadaptive sampling algorithm to have a large advantage over uniform sampling when there is a large\ndiversity in the means of H1 since larger means can be distinguished from \u00b50 with fewer samples.\nHowever, one should note that to declare all of H1 as discoveries, one must sample every arm in H0 at\nleast as many times as the most sampled arm in H1, otherwise they are statistically indistinguishable.\nAs discoveries are typically uncovering rare phenomenon, it is common to assume |H1| = n for\n 2 (0, 1) [4, 5], or |H1| = o(n), but this implies that the number of samples taken from the arms\nin H1, regardless of how samples are allocated to those arms, will almost always be dwarfed by the\nnumber of samples allocated to those arms in H0 since there are \u2326(n) of them. This line of reasoning,\nin part, is what motivates us to give our sample complexity results in terms of the quantities that\nbest describe the contributions from those arms in H0, namely, the cardinality |H1| = n |H 0|, the\ncon\ufb01dence parameter (e.g., = .05), and the gap := min i2H1 \u00b5i \u00b50 between the means of\nthe arms in H0 and the smallest mean in H1. Reporting sample complexity results in terms of \nalso allows us to compare to known lower bounds in the literature [6, 4, 7, 8]. Nevertheless, we do\naddress the case where the means of H1 are varied in Theorem 2.\nAn informal summary of the sample complexity results proven in this work are found in Table 1 for\n|H1| = k. For the least strict setting of FDR+TPR, the upper-left quadrant of Table 1 matches the\nlower bound of [4], a sample complexity of just 2n. In this FDR+TPR setting (which requires\nthe fewest samples of the four settings), uniform sampling which pulls each arm an equal number of\ntimes has a sample complexity of at least n2 log(n/|H1|) (see Theorem 7 in Appendix G), which\nexceeds all results in Table 1 demonstrating the statistical power gained by adaptive sampling. For the\nmost strict setting of FWER+FWPD, the lower-right quadrant of Table 1 matches the lower bounds\nof [7, 9, 8], a sample complexity of (n k)2 log(k) + k2 log(n k). Uniform sampling in\nthe FWER+FWPD setting has a sample complexity lower bounded by n2 log(n) (see Theorem 8\nin Appendix G). The settings of FDR+FWPD and FWER+TPR are sandwiched between these results,\nand we are unaware of existing lower bounds for these settings.\nAll the results in Table 1 are novel, and to the best of our knowledge are the \ufb01rst non-trivial sample\ncomplexity results for an adaptive algorithm in the \ufb01xed con\ufb01dence setting where a desired con\ufb01dence\n is set, and the algorithm attempts to minimize the number of samples taken to meet the desired\nconditions. We also derive tools that we believe may be useful outside this work: for always valid\np-values (c.f. [3, 10]) we show that FDR is controlled for all times using the Benjamini-Hochberg\n\n3\n\n\fprocedure [11] (see Lemma 1), and also provide an anytime high probability bound on the false\ndiscovery proportion (see Lemma 2).\nFinally, as a direct consequence of the theoretical guarantees proven in this work and the empirical\nperformance of the FDR+TPR variant of the algorithm on real data, an algorithm faithful to the theory\nwas implemented and is in use in production at a leading A/B testing platform [12].\n\n1.3 Related work\n\nIdentifying arms with means above a threshold, or equivalently, multiple testing via rejecting null-\nhypotheses with small p-values, is an ubiquitous problem in the biological sciences. In the standard\nsetup, each arm is given an equal number of measurements (i.e., a uniform sampling strategy),\na p-value Pi is produced for each arm where P(Pi \uf8ff x) \uf8ff x for all x 2 (0, 1] and i 2H 0,\nand a procedure is then run on these p-values to declare small p-values as rejections of the null-\nhypothesis, or discoveries. For a set of p-values P1 \uf8ff P2 \uf8ff\u00b7\u00b7\u00b7\uf8ff Pn, the so-called Bonferroni\nselection rule selects SBF = {i : Pi \uf8ff /n}. The fact that FWER control implies FDR control,\nE[|SBF \\H 0|] \uf8ff P(Si2H0{Pi \uf8ff /n}) \uf8ff |H0|n \uf8ff , suggests that greater statistical power\n(i.e. more discoveries) could be achieved with procedures designed speci\ufb01cally for FDR. The BH\nprocedure [11] is one such procedure to control FDR and is widely used in practice (with its many\nextensions [6] and performance investigations [5]). Recall that a uniform measurement strategy where\nevery arm is sampled the same number of times requires n2 log(n/k) samples in the FDR+TPR\nsetting, and n2 log(n) samples in the FWER+FWPD setting (Theorems 7 and 8 in Appendix G),\nwhich can be substantially worse than our adaptive procedure (see Table 1).\nAdaptive sequential testing has been previously addressed in the \ufb01xed budget setting: the procedure\ntakes a sampling budget as input, and the guarantee states that if the given budget is larger than a\nproblem dependent constant, the procedure drives the error probability to zero and the detection\nprobability to one. One of the \ufb01rst methods called distilled sensing [13] assumed that arms from\nH0 were Gaussian with mean at most \u00b50, and successively discarded arms after repeated sampling\nby thresholding at \u00b50\u2013at most the median of the null distribution\u2013thereby discarding about half\nthe nulls at each round. The procedure made guarantees about FDR and TPR, which were later\nshown to be nearly optimal [4]. Speci\ufb01cally, [4, Corollary 4.2] implies that any procedure with\nmax{F DR + (1 T P R)}\uf8ff requires a budget of at least 2n log(1/), which is consistent\nwith our work. Later, another thresholding algorithm for the \ufb01xed budget setting addressed the\nFWER and FWPD metrics [7]. In particular, if their procedure is given a budget exceeding (n \n|H1|)2 log(|H1|) + |H1|2 log(n |H 1|) then the FWER is driven to zero, and the FWPD is\ndriven to one. By appealing to the optimality properties of the SPRT (which knows the distributions\nprecisely) it was argued that this is optimal. These previous works mostly focused on the asymptotic\nregime as n ! 1 and |H1| = o(n).\nOur paper, in contrast to these previous works considers the \ufb01xed con\ufb01dence setting: the procedure\ntakes a desired FDR (or FWER) and TPR (or FWPD) and aims to minimize the number of samples\ntaken before these constraints are met. To the best of our knowledge, our paper is the \ufb01rst to propose\na scheme for this problem in the \ufb01xed con\ufb01dence regime with near-optimal sample complexity\nguarantees.\nA related line of work is the threshold bandit problem, where all the means of H1 are assumed to be\nstrictly above a given threshold, and the means of H0 are assumed to be strictly below the threshold\n[14, 15]. To identify this partition, each arm must be pulled a number of times inversely proportional\nto the square of its deviation from the threshold. This contrasts with our work, where the majority of\narms may have means equal to the threshold and the goal is to identify arms with means greater than\nthe threshold subject to discovery constraints. If the arms in H0 are assumed to be strictly below the\nthreshold it is possible to declare arms as in H0. In our setting we can only ever determine that an\narm is in H1 and not H0, but it is impossible to detect that an arm is in H0 and not in H1.\nNote that the problem considered in this paper is very related to the top-k identi\ufb01cation problem\nwhere the objective is to identify the unique k arms with the highest means with high probability\n[16, 9, 8]. Indeed, if we knew |H1|, then our FWER+FWPD setting is equivalent to the top-k problem\nwith k = |H1|. Lower bounds derived for the top-k problem assume the algorithm has knowledge of\nthe values of the means, just not their indices [16, 8]. Thus, these lower bounds also apply to our\nsetting and are what are referenced in Section 1.2.\n\n4\n\n\fAlgorithm 1 An algorithm for identifying arms with means above a threshold \u00b50 using as few samples as\npossible subject to false alarm and true discovery conditions. The set St is designed to control FDR at level .\nThe set Rt is designed to control FWER at level .\nInput: Threshold \u00b50, con\ufb01dence 2 (0, e1], con\ufb01dence interval (\u00b7,\u00b7)\nInitialize: Pull each arm i 2 [n] once and let Ti(t) denote the number of times arm i has been pulled\nup to time t. Set Sn+1 = ;, Rn+1 = ;, and\nIf TPR\n\n\u21e0t = 1,\n\nElse if FWPD\n\nand\n\n\u232bt = 1 8t\n\n5\n\n\n\n6.4 log(36/) to obtain FDR-controlled\n\nset St:\n\nFor t = n + 1, n + 2, . . .\n\n\u21e0t\n\nand\n\n\u232bt = max{|St|, 1}8 t\n),\n\n\u21e0t = max{2|St|,\nPull arm It = arg max\nApply Benjamini-Hochberg [11] selection at level 0 =\n\n3(14) log(1/)},\ni2[n]\\Stb\u00b5i,Ti(t) + (Ti(t), \ns(k) = {i 2 [n] :b\u00b5i,Ti(t) (Ti(t), 0 k\nSt+1 = s(bk) wherebk = max{k 2 [n] : |s(k)| k} (if 6 9bk set St+1 = St)\nIf FWER and St 6= ;:\nPull arm Jt = arg max\ni2St\\Rtb\u00b5i,Ti(t) + (Ti(t), \nApply Bonferroni-like selection to obtain FWER-controlled set Rt:\nt = n (1 20(1 + 40))|St| + 4(1+40)\nlog(5 log2(n/0)/0)\nRt+1 = Rt [{ i 2S t :b\u00b5i,Ti(t) (Ti(t), \n) \u00b50}\nt\n\nn ) \u00b50}, 8k 2 [n]\n\n)\n\n\u232bt\n\n3\n\nAs pointed out by [14], both our setting and the threshold bandit problem can be posed as a combi-\nnatorial bandits problem as studied in [17, 18], but such generality leads to unnecessary log factors.\nThe techniques used in this work aim to reduce extraneous log factors, a topic of recent interest in\nthe top-1 and top-k arm identi\ufb01cation problem [19, 20, 21, 22, 16, 8]. While these works are most\nsimilar to exact identi\ufb01cation (FWER+FWPD), there also exist examples of approximate top-k where\nthe objective is to \ufb01nd any k means that are each within \u270f of the best k means [9]. Approximate\nrecovery is also studied in a ranking context with a symmetric difference metric [23] which is more\nsimilar to the FDR and TPR setting, but neither this nor that work subsumes one another.\nFinally, maximizing the number of discoveries subject to a FDR constraint has been studied in a\nsequential setting in the context of A/B testing with uniform sampling [3]. This work popularized the\nconcept of an always valid p-value that we employ here (see Section 2). The work of [10] controls\nFDR over a sequence of independent bandit problems that each outputs at most one discovery. While\n[10] shares much of the same vocabulary as this paper, the problem settings are very different.\n\n2 Algorithm and Discussion\n\nThroughout, we will assume the existence of an anytime con\ufb01dence interval. Namely, ifb\u00b5i,t denotes\nthe empirical mean of the \ufb01rst t bounded i.i.d. rewards in [0, 1] from arm i, then for any 2 (0, 1) we\nassume the existence of a function such that for any we have P (T1t=1{|b\u00b5i,t \u00b5i|\uf8ff (t, )}) \n1. We assume that (t, ) is non-increasing in its second argument and that there exists an absolute\nconstant c such that (t, ) \uf8ffq c log(log2(2t)/)\n. It suf\ufb01ces to de\ufb01ne with this upper bound with\nc = 4 but there are much sharper known bounds that should be used in practice (e.g., they may take\nempirical variance into account), see [21, 24, 25, 26]. Anytime bounds constructed with such a (t, )\nare known to be tight in the sense that P(S1t=1{|b\u00b5i,t \u00b5i| (t, )}) \uf8ff and that there exists an\nabsolute constant h 2 (0, 1) such that P({|b\u00b5i,t \u00b5i| h (t, ) for in\ufb01nitely many t 2 N}) = 1 by\n\nthe Law of the Iterated Logarithm [27].\nConsider Algorithm 1. Before entering the for loop, time-dependent variables \u21e0t and \u232bt are de\ufb01ned\nthat should be updated at each time t for different settings. If just FDR control is desired, the\nalgorithm merely loops over the three lines following the for loop, pulling the arm It not in St that\n\nt\n\n5\n\n\fhas the highest upper con\ufb01dence bound; such strategies are common for pure-exploration problems\n[21, 10]. But if FWER control is desired then at most one additional arm Jt is pulled per round to\nprovide an extra layer of \ufb01ltering and evidence before an arm is added to Rt. Below we describe\nthe main elements of the algorithm and along the way sketch out the main arguments of the analysis,\nshedding light on the constants \u21e0t and \u232bt.\n\nPi,t := sup{\u21b5 2 (0, 1] :b\u00b5i,t \u00b50 \uf8ff (t, \u21b5)}\uf8ff log2(2t) exp(t(b\u00b5i,t \u00b50)2/c).\n\n2.1 False alarm control\nSt is FDR-controlled. In addition to its use as a con\ufb01dent bound, we can also use (t, ) to construct:\n(1)\nProposition 1 of [10] (and the proof of our Lemma 1) shows that if i 2H 0 so that \u00b5i = \u00b50 then\nPi,t is an anytime, sub-uniformly distributed p-value in the sense that P(S1t=1{Pi,t \uf8ff x}) \uf8ff x.\nSequences that have this property are sometimes referred to as always-valid p-values [3]. Note that\nif i 2H 1 so that \u00b5i > \u00b50, we would intuitively expect the sequence {Pi,t}1t=1 to be point-wise\nsmaller than if \u00b5i = \u00b50 by the property that (\u00b7,\u00b7) is non-increasing in its second argument. This\nleads to the intuitive rule to reject the null-hypothesis (i.e., declare i /2H 0) for those arms i 2 [n]\nwhere Pi,t is very small. The Benjamini-Hochberg (BH) procedure introduced in [11] proceeds\nby \ufb01rst sorting the p-values so that P(1),T(1)(t) \uf8ff P(2),T(2)(t) \uf8ff\u00b7\u00b7\u00b7\uf8ff P(n),T(n)(t), then de\ufb01nes\nbk = max{k : P(k),T(k)(t) \uf8ff k\nn}. Note that this procedure is\nidentical to de\ufb01ning sets\nsettingbk = max{k : |s(k)| k}, and SBH = s(bk), which is exactly the set St = SBH in Algo-\nrithm 1. Thus, St in Algorithm 1 is equivalent to applying the BH procedure at a level O(/ log(1/))\nto the anytime p-values of (1). We now discuss the extra logarithmic factor.\nBecause the algorithm is pulling arms sequentially, some dependence between the p-values may be\nintroduced. Because the anytime p-values are not independent, the BH procedure at level does not\ndirectly guarantee FDR-control at level . However, it has been shown [28] that for even arbitrarily\ndependent p-values the BH procedure at level controls FDR at level log(n) (and that it is nearly\ntight). Similarly, the following theorem, which may be of independent interest, is a signi\ufb01cant\nimprovement when applied to our setting.\nTheorem 1. Fix 2 (0, e1). Let p1, . . . , pn be random variables such that {pi}i2H0 are indepen-\ndent and sub-uniformly distributed so that maxi2H0 P(pi \uf8ff x) \uf8ff x. For any k 2{ 0, 1, . . . , n}, let\nRk := {i : pi \uf8ff k\n\nn}, and sets SBH = {i : Pi,Ti(t) \uf8ff bk\nn} = {i :b\u00b5i,Ti(t) (Ti(t), k\n\ns(k) = {i : Pi,Ti(t) \uf8ff k\n\nn ) \u00b50},\n\n.\n\nE\"\n\nn} and \\F DP (Rk) := maxpi2Rk pi\n|Rk|_1\nF DP (Rk)# \uf8ff |H0|\n\nn \u21e32 log( 2n\nIn other words, any procedure that chooses a set {i : pi \uf8ff k\ncontrolled at level O( log(1/)).\n\n\uf8ff 4 log(9/)\n\nk:\\F DP (Rk)\uf8ff\n\nmax\n\n|H0| ) + log(8e5 log( 8n\n\n|H0| ))\u2318\nn }| k is FDR\nn } satisfying |{i : pi \uf8ff k\n\nRecall, ifbk = max{k : \\F DP (Rk) \uf8ff } then E[F DP (Rbk)] \uf8ff by the standard BH result. When\nrunning the algorithm we recommend using BH at level , not level O(/ log(1/)). As Ti gets very\nlarge, Pi,Ti(t) ! Pi,\u21e4 and we know that if BH is run on Pi,\u21e4 at level then FDR would be controlled\nat level . We believe this in\ufb02ation to be somewhat of an artifact of our proofs.\nRt is FWER-controlled. A core obstacle in our analysis is the fact that we don\u2019t know the cardinality\nof H1. If we did know |H1| (and equivalently know |H0| = n|H1|) then a FWER+FWPD algorithm\nis equivalent to the so-called top-k multi-armed bandit problem [9, 8] and controlling FWER would\nbe relatively simple using a Bonferroni correction:\n) \u00b50}\u2318 \uf8ff|H 0| \n|H0|\nwhich implies FWER-. Comparing the \ufb01rst expression immediately above to the de\ufb01nition of Rt\nin the algorithm, it is clear our strategy is to use |St| as a surrogate for |H1|. Note that we could\n\nP\u21e3[1t=1{b\u00b5i,t (t,\n\n) \u00b50}\u2318 \uf8ff Xi2H0\n\n[1t=1{b\u00b5i,t (t,\n\nP\u21e3 [i2H0\n\nn|H1|\n\n\n|H0|\n\n\n\n6\n\n\fuse the bound |H0| = n |H 1|\uf8ff n to guarantee FWER-, but this could be very loose and induce\nan n log(n) sample complexity. Using |St| as a surrogate for |H1| in Rt is intuitive because by the\nFDR guarantee, we know |H1| E[|St \\H 1|] = E[|St|] E[|St \\H 0|] (1 )E[|St|], implying\nthat |H0| = n |H 1|\uf8ff n (1 )E[|St|] which may be much tighter than n if E[|St|] !|H 1|.\nBecause we only know |St| and not its expectation, the extra factors in the surrogate expression used\nin Rt are used to ensure correctness with high-probability (see Lemma 7).\n2.2 Sampling strategies to boost statistical power\nThe above discussion about controlling false alarms for St and Rt holds for any choice of arms It\nand Jt that may be pulled at time t. Thus, It and Jt are chosen in order to minimize the amount of\ntime necessary to add arms into St and Rt, respectively, and optimize the sample complexity.\n\u00b5i 8t 2 N}. Because is an anytime con\ufb01dence bound, E [|I|] (1)|H1|. If = min i2H1 \u00b5i\n\u00b50, then mini2I \u00b5i \u00b50 + and we claim that with probability at least 1 O() (Section C)\n\nTPR-, \u2327 setting implies \u21e0t = \u232bt = 1. De\ufb01ne the random set I = {i 2H 1 :b\u00b5i,Ti(t) + (Ti(t), ) \n\nP1t=1 1{It 2H 0,I 6\u2713 St}\uf8ff P1t=1 1{It 2H 0,b\u00b5It,TIt (t) + (TIt(t), ) \u00b50 + }\n\n\uf8ff c|H0|2 log(log(2/).\n\nThus once this number of samples has been taken, either I\u2713S t, or arms in I will be repeatedly\nsampled until they are added to St since each arm i 2I has its upper con\ufb01dence bound larger than\nthose arms in H0 by de\ufb01nition. It is clear that an arm in H1 that is repeatedly sampled will eventually\nbe added to St since its anytime p-value of (1) approaches 0 at an exponential rate as it is pulled, and\nBH selects for low p-values. A similar argument holds for Jt and adding arms to Rt.\nRemark 2. While the main objective of Algorithm 1 is to identify all arms with means above a\ngiven threshold, we note that prior to adding an arm to St in the TPR setting (i.e., when \u21e0t = 1)\nAlgorithm 1 behaves identically to the nearly optimal best-arm identi\ufb01cation algorithm lil\u2019UCB of\n[21]. Thus, whether the goal is best-arm identi\ufb01cation or to identify all arms with means above a\ncertain threshold, Algorithm 1 is applicable.\n\nP(Si2H1 [1t=1{b\u00b5i,t + (t, \n\nFWPD-, \u2327 setting is more delicate and uses in\ufb02ated values of \u21e0t and \u232bt. This time, we must ensure\nthat {H1 6\u2713 St} =) maxi2H1\\Sc\nt \u00b5i \u00b50 + . Because\nthen we could argue that either H1 \u21e2S t, or only arms in H1 are sampled until they are added to St\n(mirroring the TPR argument). As in the FWER setting above, if we knew the value of |H1| the we\ncould set \u21e0t |H 1| to observe that\n) < \u00b5i}\u2318 \uf8ff|H 1| \nwhich is less than , to guarantee such a condition. But we don\u2019t know |H1| so we use |St|\nas a surrogate, resulting in the in\ufb02ated de\ufb01nitions of \u21e0t and \u232bt relative to the TPR setting. The\n) \u00b50 + by the\nkey argument is that either I 6\u2713 St so that maxi2I\\Sc\nde\ufb01nition of I (since \u21e0t 1), or I\u21e2S t and |St| 1\n2|H1| with high probability which implies\n\u21e0t = max{2|St|,\n3 Main Results\n\nt b\u00b5i,Ti(t) + (Ti(t), ) mini2H1\\Sc\n) < \u00b5i}) \uf8ffPi2H1 P\u21e3[1t=1{b\u00b5i,t + (t, \nt b\u00b5i,Ti(t) + (Ti(t), \n\n3(14) log(1/)}|H 1| and the union bound of the display above holds.\n\n\u21e0t\n\n\u21e0t\n\n\u21e0t\n\n\u21e0t\n\n5\n\nIn what follows, we say f . g if there exists a c > 0 that is independent of all problem parameters\nand f \uf8ff cg. The theorems provide an upper bound on the sample complexity \u2327 2 N as de\ufb01ned in\nSection 1.1 for TPR-, \u2327 or FWER-, \u2327 that holds with probability at least 1 c for different values\nof c3. We begin with the least restrictive setting, resulting in the smallest sample complexity of all the\nresults presented in this work. Note the slight generalization in the below theorem where the means\nof H0 are assumed to be no greater than \u00b50.\nTheorem 2 (FDR, TPR). Let H1 = {i 2 [n] : \u00b5i > \u00b50}, H0 = {i 2 [n] : \u00b5i \uf8ff \u00b50}. De\ufb01ne\ni = \u00b5i \u00b50 for i 2H 1, = min i2H1 i, and i = minj2H1 \u00b5j \u00b5i =+ ( \u00b50 \u00b5i) for\n3 Each theorem relies on different events holding with high probability, and consequently a different c for\neach. To have c = 1 for each of the four settings, we would have had to de\ufb01ne different constants in the\nalgorithm for each setting. We hope the reader forgives us for this attempt at minimizing clutter.\n\n7\n\n\f2\n\ni\n\ni\n\nlog(log(2\n\ni\n\n2\n\ni\n\nlog(n log(2\n\n)/) \n\nPi2H0\n\n)/) +Pi2H1\n\n|St|_1 ] \uf8ff . Moreover, with probability at least 1 2 there\n\ni 2H 0. For all t 2 N we have E[|St\\H0|\nexists a T such that\nT . minn2 log(log(2)/),\n] 1 for all t T . Neither argument of the minimum follows from the other.\n\nand E[|St\\H1|\n|H1|\nIf the means of H1 are very diverse so that maxi2H1 \u00b5i \u00b50 mini2H1 \u00b5i \u00b50 then the second\nargument of the min in Theorem 2 can be tighter than the \ufb01rst. But as discussed above, this advantage\nis inconsequential if |H1| = o(n). The remaining theorems are given in terms of just . The\nlog log(2) dependence is due to inverting the con\ufb01dence interval and is unavoidable on at least\none arm when is unknown a priori due to the law of the iterated logarithm [27, 21, 22].\nInformally, Theorem 2 states that if just most true detections suf\ufb01ce while not making too many\nmistakes, then O(n) samples suf\ufb01ce. The \ufb01rst argument of the min is known to be tight in a minimax\nsense up to doubly logarithmic factors due to the lower bound of [4]. As a consequence of this work,\nan algorithm inspired by Algorithm 1 in this setting is now in production at one of the largest A/B\ntesting platforms on the web. The full proof of Theorem 2 (and all others) is given in the Appendix\ndue to space.\nTheorem 3 (FDR, FWPD). For all t 2 N we have E[|St\\H0|\n|St|_1 ] \uf8ff . Moreover, with probability at\nleast 1 5, there exists a T such that\nT . (n |H 1|)2 log(max{|H1|, log log(n/)} log(2)/) + |H1|2 log(log(2)/)\nand H1 \u2713S t for all t T .\nHere T roughly scales like (n |H 1|) max{log(|H1|), log log log(n/)} + |H1| where the\nlog log log(n/) term comes from a high probability bound on the false discovery proportion for\nanytime p-values (in contrast to just expectation) in Lemma 2 that may be of independent interest.\nWhile negligible for all practical purposes, it appears unnatural and we suspect that this is an artifact of\nour analysis. We note that if |H1| = \u2326(log(n)) then the sample complexity sheds this awkwardness4.\nThe next two theorems are concerned with controlling FWER on the set Rt and determining how\nlong it takes before the claimed detection conditions are satis\ufb01ed on the set Rt. Note we still have\nthat FDR is controlled on the set St but now this set feeds into Rt.\nTheorem 4 (FWER, FWPD). For all t we have E[|St\\H0|\n|St|_1 ] \uf8ff . Moreover, with probability at least\n1 6, we have H0 \\R t = ; for all t 2 N and there exists a T such that\nT .(n |H 1|)2 log(max{|H1|, log log(n/)} log(2)/)\n\n+ |H1|2 log(max{n (1 2(1 + 4))|H1|, log log(n/)} log(2)/)\n\nand H1 \u2713R t for all t T . Note, together this implies H1 = Rt for all t T .\nTheorem 4 has the strongest conditions, and therefore the largest sample complexity. Ignoring\nlog log log(n) factors, T roughly scales as (n|H 1|) log(|H1|)+|H1| log(n(12(1+4))|H1|).\nInspecting the top-k lower bound of [8] where the arms\u2019 means in H1 are equal to \u00b50 + , the arms\u2019\nmeans in H0 are equal to \u00b50, and the algorithm has knowledge of the cardinality of H1, a necessary\nsample complexity of (n|H 1|) log(|H1|) +|H1| log(n|H 1|) is given. It is not clear whether this\nsmall difference of log(n (1 2(1 + 4))|H1|) versus log(n|H 1|) is an artifact of our analysis,\nor a fundamental limitation when the cardinality |H1| is unknown. We now state our \ufb01nal theorem.\nTheorem 5 (FWER, TPR). For all t we have E[|St\\H0|\n|St|_1 ] \uf8ff . Moreover, with probability at least\n1 7 we have H0 \\R t = ; for all t 2 N and there exists a T such that\n\nT .(n |H 1|)2 log(log(2)/)\n\nand E[|Rt\\H1|\n\n+ |H1|2 log(max{n (1 \u2318)|H1|, log log(n log(1/)/)} log(2)/)\n] 1 for all t T , where \u2318 = (1 3 p2 log(1/)/|H1|).\n\n4In the asymptotic n regime, it is common to study the case when |H1| = n for 2 (0, 1) [4, 13].\n\n|H1|\n\n8\n\n\f4 Experiments\nThe distribution of each arm equals \u232bi = N (\u00b5i, 1) where \u00b5i = \u00b50 = 0 if i 2H 0, and \u00b5i > 0 if\ni 2H 1. We consider three algorithms: i) uniform allocation with anytime BH selection as done in\nAlgorithm 1, ii) successive elimination (SE) (see Appendix G)5 that performs uniform allocation\non only those arms that have not yet been selected by BH, and iii) Algorithm 1 (UCB). Algorithm\n1 and the BH selection rule for all algorithms use (t, ) =q 2 log(1/)+6 log log(1/)+3 log(log(et/2))\nfrom [25, Theorem 8]. In addition, we ran BH at level instead of /(6.4 log(36/)) as discussed\nin section 3. Here we present the sample complexity for TPR+FDR with = 0.05 and different\nparameterizations of \u00b5, n, |H1|.\n\nt\n\nThe \ufb01rst panel shows an empirical estimate of E[|St\\H1|\n] at each time t for each algorithm, averaged\n|H1|\nover 1000 trials. The black dashed line on the \ufb01rst panel denotes the level E[|St\\H1|\n] = 1 = .95,\n|H1|\nand corresponds to the dashed black line on the second panel. The right four panels show the number\nof samples each algorithm takes before the true positive rate exceeds 1 = .95, relative to the\nnumber of samples taken by UCB, for various parameterizations. Panels two, three, and four have\ni = for i 2H 1 while panel \ufb01ve is a case where the i\u2019s are linear for i 2H 1. While the\ndifferences are most clear on the second panel when |H1| = 2 = o(n), over all cases UCB uses at\nleast \u21e1 3 times fewer samples than uniform and SE. For FDR+TPR, Appendix G shows uniform\nsampling roughly has a sample complexity that scales like n2 log( n\n) while SE\u2019s is upper\n|H1|\n2\nbounded by min{n2 log( n\nlog(n)}. Comparing\ni\n|H1|\nwith Theorem 2 for the difference cases (i.e., |H1| = 2,pn, n/5) provides insight into the relative\ndifference between UCB, uniform, and SE on the different panels.\n\n), (n |H 1|)2 log( n\n|H1|\n\n) +Pi2H1\n\nAcknowledgments\nThis work was informed and inspired by early discussions with Aaditya Ramdas on methods for\ncontrolling the false discovery rate (FDR) in multiple testing; we are grateful to have learned from\na leader in the \ufb01eld. We also thank him for his careful reading and feedback. We\u2019d also like to\nthank Martin J. Zhang for his input. We also thank the leading experimentation and A/B testing\nplatform on the web, Optimizely, for its support, insight into its customers\u2019 needs, and for committing\nengineering time to implementing this research into their platform [12]. In particular, we thank\nWhelan Boyd, Jimmy Jin, Pete Koomen, Sammy Lee, Ajith Mascarenhas, Sonesh Surana, and Hao\nXia at Optimizely for their efforts.\n\n5Inspired by the best-arm identi\ufb01cation literature [19].\n\n9\n\n\fReferences\n[1] Linhui Hao, Akira Sakurai, Tokiko Watanabe, Ericka Sorensen, Chairul A Nidom, Michael A\nNewton, Paul Ahlquist, and Yoshihiro Kawaoka. Drosophila rnai screen identi\ufb01es host genes\nimportant for in\ufb02uenza virus replication. Nature, 454(7206):890, 2008.\n\n[2] GJ Rocklin, TM Chidyausiku, I Goreshnik, A Ford, S Houliston, A Lemak, L Carter, R Ravichan-\ndran, VK Mulligan, A Chevalier, CH Arrowsmith, and D Baker. Global analysis of protein\nfolding using massively parallel design, synthesis, and testing. Science, 357:168\u2013175, 2017.\n\n[3] Ramesh Johari, Leo Pekelis, and David J Walsh. Always valid inference: Bringing sequential\n\nanalysis to a/b testing. arXiv preprint arXiv:1512.04922, 2015.\n\n[4] Rui M Castro. Adaptive sensing performance lower bounds for sparse signal detection and\n\nsupport estimation. Bernoulli, 20(4):2217\u20132246, 2014.\n\n[5] Maxim Rabinovich, Aaditya Ramdas, Michael I Jordan, and Martin J Wainwright. Optimal\n\nrates and tradeoffs in multiple testing. arXiv preprint arXiv:1705.05391, 2017.\n\n[6] A. Ramdas, R. Foygel Barber, M. J. Wainwright, and M. I. Jordan. A Uni\ufb01ed Treatment of\n\nMultiple Testing with Prior Knowledge. ArXiv e-prints, March 2017.\n\n[7] Matthew L Malloy and Robert D Nowak. Sequential testing for sparse recovery.\n\nTransactions on Information Theory, 60(12):7862\u20137873, 2014.\n\nIEEE\n\n[8] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive\nsampling in the moderate-con\ufb01dence regime. In Conference on Learning Theory, pages 1794\u2013\n1834, 2017.\n\n[9] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in\n\nstochastic multi-armed bandits. In ICML, volume 12, pages 655\u2013662, 2012.\n\n[10] Fanny Yang, Aaditya Ramdas, Kevin G Jamieson, and Martin J Wainwright. A framework\nfor multi-a(rmed)/b(andit) testing with online fdr control. In Advances in Neural Information\nProcessing Systems, pages 5959\u20135968, 2017.\n\n[11] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and\npowerful approach to multiple testing. Journal of the royal statistical society. Series B (Method-\nological), pages 289\u2013300, 1995.\n\n[12] Optimizely. Accelerating experimentation through machine learning, https://help.\n\noptimizely.com/Build_Campaigns_and_Experiments/Stats_Accelerator, 2018.\n\n[13] Jarvis Haupt, Rui M Castro, and Robert Nowak. Distilled sensing: Adaptive sampling for sparse\ndetection and estimation. IEEE Transactions on Information Theory, 57(9):6222\u20136235, 2011.\n\n[14] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for\nthe thresholding bandit problem. In International Conference on Machine Learning, pages\n1690\u20131698, 2016.\n\n[15] Hideaki Kano, Junya Honda, Kentaro Sakamaki, Kentaro Matsuura, Atsuyoshi Nakamura,\narXiv preprint\n\nand Masashi Sugiyama. Good arm identi\ufb01cation via bandit feedback.\narXiv:1710.06360, 2017.\n\n[16] Lijie Chen, Jian Li, and Mingda Qiao. Nearly instance optimal sample complexity bounds for\n\ntop-k arm selection. In Arti\ufb01cial Intelligence and Statistics, pages 101\u2013110, 2017.\n\n[17] Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure\nexploration of multi-armed bandits. In Advances in Neural Information Processing Systems,\npages 379\u2013387, 2014.\n\n[18] Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration:\nEf\ufb01cient algorithms and an analysis with localization. arXiv preprint arXiv:1711.08018, 2017.\n\n10\n\n\f[19] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions\nfor the multi-armed bandit and reinforcement learning problems. Journal of machine learning\nresearch, 7(Jun):1079\u20131105, 2006.\n\n[20] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed\nbandits. In Proceedings of the 30th International Conference on Machine Learning (ICML-13),\npages 1238\u20131246, 2013.\n\n[21] Kevin Jamieson, Matthew Malloy, Robert Nowak, and S\u00e9bastien Bubeck. lil\u2019ucb: An optimal\nexploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages\n423\u2013439, 2014.\n\n[22] Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identi\ufb01ca-\n\ntion. In Conference on Learning Theory, pages 535\u2013592, 2017.\n\n[23] Reinhard Heckel, Max Simchowitz, Kannan Ramchandran, and Martin Wainwright. Approxi-\nmate ranking from pairwise comparisons. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 1057\u20131066, 2018.\n\n[24] A. Balsubramani. Sharp Finite-Time Iterated-Logarithm Martingale Concentration. ArXiv\n\ne-prints, May 2014.\n\n[25] Emilie Kaufmann, Olivier Capp\u00e9, and Aur\u00e9lien Garivier. On the complexity of best arm\nidenti\ufb01cation in multi-armed bandit models. Journal of Machine Learning Research, 17(1):1\u2013\n42, 2016.\n\n[26] Ervin Tanczos, Robert Nowak, and Bob Mankoff. A kl-lucb algorithm for large-scale crowd-\nsourcing. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5896\u20135905.\n2017.\n\n[27] Philip Hartman and Aurel Wintner. On the law of the iterated logarithm. American Journal of\n\nMathematics, 63(1):169\u2013176, 1941.\n\n[28] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing\n\nunder dependency. Annals of statistics, pages 1165\u20131188, 2001.\n\n[29] Maxim Raginsky and Alexander Rakhlin. Lower bounds for passive and active learning. In\n\nAdvances in Neural Information Processing Systems, pages 1026\u20131034, 2011.\n\n[30] Alexandre B Tsybakov. Introduction to nonparametric estimation, 2009.\n[31] Pascal Massart et al. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals\n\nof Probability, 18(3):1269\u20131283, 1990.\n\n11\n\n\f", "award": [], "sourceid": 1847, "authors": [{"given_name": "Kevin", "family_name": "Jamieson", "institution": "U Washington"}, {"given_name": "Lalit", "family_name": "Jain", "institution": "University of Washington"}]}