{"title": "A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control", "book": "Advances in Neural Information Processing Systems", "page_first": 5957, "page_last": 5966, "abstract": "We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework proposes to replace a sequence of A/B tests by a sequence of best-arm MAB instances, which can be continuously monitored by the data scientist. When interleaving the MAB tests with an online false discovery rate (FDR) algorithm, we can obtain the best of both worlds: low sample complexity and any time online FDR control. Our main contributions are: (i) to propose reasonable definitions of a null hypothesis for MAB instances; (ii) to demonstrate how one can derive an always-valid sequential p-value that allows continuous monitoring of each MAB test; and (iii) to show that using rejection thresholds of online-FDR algorithms as the confidence levels for the MAB algorithms results in both sample-optimality, high power and low FDR at any point in time. We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.", "full_text": "A framework for Multi-A(rmed)/B(andit) Testing\n\nwith Online FDR Control\n\nFanny Yang\n\nDept. of EECS, U.C. Berkeley\nfanny-yang@berkeley.edu\n\nKevin Jamieson\n\nAllen School of CSE, U. of Washington\n\njamieson@cs.washington.edu\n\nAaditya Ramdas\n\nDept. of EECS and Statistics, U.C. Berkeley\n\nramdas@berkeley.edu\n\nMartin Wainwright\n\nDept. of EECS and Statistics, U.C. Berkeley\n\nwainwrig@berkeley.edu\n\nAbstract\n\nWe propose an alternative framework to existing setups for controlling false alarms\nwhen multiple A/B tests are run over time. This setup arises in many practical\napplications, e.g. when pharmaceutical companies test new treatment options\nagainst control pills for different diseases, or when internet companies test their\ndefault webpages versus various alternatives over time. Our framework proposes to\nreplace a sequence of A/B tests by a sequence of best-arm MAB instances, which\ncan be continuously monitored by the data scientist. When interleaving the MAB\ntests with an online false discovery rate (FDR) algorithm, we can obtain the best of\nboth worlds: low sample complexity and any time online FDR control. Our main\ncontributions are: (i) to propose reasonable de\ufb01nitions of a null hypothesis for\nMAB instances; (ii) to demonstrate how one can derive an always-valid sequential\np-value that allows continuous monitoring of each MAB test; and (iii) to show that\nusing rejection thresholds of online-FDR algorithms as the con\ufb01dence levels for\nthe MAB algorithms results in both sample-optimality, high power and low FDR\nat any point in time. We run extensive simulations to verify our claims, and also\nreport results on real data collected from the New Yorker Cartoon Caption contest.\n\nIntroduction\n\n1\nRandomized trials are the default option to determine whether potential improvements of an alternative\nmethod (e.g. website design for a tech company, or medication in clinical trials for pharmaceutical\ncompanies) are signi\ufb01cant compared to a well-established default. In the applied domain, this is often\ncolloquially referred to as A/B testing or A/B/n testing for several alternatives. The standard practice\nis to divert a small amount of the traf\ufb01c or patients to the alternative and control. If an alternative\nappears to be signi\ufb01cantly better, it is implemented; otherwise, the default setting is maintained.\nAt \ufb01rst glance, this procedure seems intuitive and simple. However, in cases where the aim is to\noptimize over one particular metric, one can do better. In particular, this common tool suffers from\nseveral downsides. (1) First, one may wish to allocate more traf\ufb01c to a better treatment if it is clearly\nbetter. Yet typical A/B/n testing frameworks split the traf\ufb01c uniformly over alternatives. Adaptive\ntechniques should help to detect better alternatives faster. (2) Second, companies often desire to\ncontinuously monitor an ongoing A/B test as they may adjust their termination criteria as time goes\nby and possibly stop earlier or later than originally intended. However, this practice may result in\nmany more false alarms if not properly accounted for. This is one of the reasons for the lack of\nreproducibility of scienti\ufb01c results, an issue recently receiving increased attention from the public\nmedia. (3) Third, the lack of suf\ufb01cient evidence or an insigni\ufb01cant improvement of the metric may\nmake it undesirable from a practical or \ufb01nancial perspective to replace the default. Therefore, when a\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fcompany runs hundreds to thousands of A/B tests within a year, ideally the number of statistically\ninsigni\ufb01cant changes that it made should be small relative to the total number of changes made. While\ncontrolling the false alarm rate of each individual test does not achieve this type of false discovery\nrate (FDR) control, there are known procedures in the multiple testing literature that are tailored to\nthis problem.\nIn this paper, we provide a novel framework that addresses the above shortcomings of A/B or A/B/n\ntesting. The \ufb01rst concern is tackled by employing recent advances in adaptive sampling like the pure-\nexploration multi-armed bandit (MAB) algorithm. For the second concern, we adopt the notion of\nany-time p-values for guilt-free continuous monitoring. Finally, we handle the third issue using recent\nresults in online FDR control. Hence the combined framework can be described as doubly-sequential\n(sequences of MAB tests, each of which is itself sequential). Although each of those problems\nhas been studied in hitherto disparate communities, how to leverage the best of all worlds, if at all\npossible, has remained an open problem. The main contributions of this paper are in successfully\nmerging these ideas in a meta framework and presenting the conditions under which it can be shown\nto yield near-optimal sample complexity and FDR control.\nThe remainder of this paper is organized as follows. In Section 2, we lay out the conceptual challenges\nthat we address in the paper, and describe a meta-algorithm that combines adaptive sampling strategies\nwith FDR control procedures. Section 3 is devoted to the description of a concrete procedure, along\nwith some theoretical guarantees on its properties. In Section 4, we discuss some results of our\nextensive experiments on both simulated and real-world data sets available to us.\n2 Formal experimental setup and a meta-algorithm\nIn this section provide a high-level overview of our proposed combined framework aimed at address-\ning the shortcomings mentioned in the introduction. A speci\ufb01c instantiation of this meta-algorithm\nalong with detailed theoretical guarantees are speci\ufb01ed in Section 3.\nFor concreteness, we refer to the system designer, whether a tech company or a pharmaceutical\ncompany, as a (data) scientist. We assume that the scientist needs to possibly conduct an in\ufb01nite\nnumber of experiments sequentially, indexed by j. Each experiment has one default setting, referred\nto as the control, and K = K(j) alternative settings, called the treatments or alternatives. The\nscientist must return one of the K + 1 options that is the \u201cbest\u201d according to some prede\ufb01ned metric,\nbefore the next experiment is started. Such a setup is a simple mathematical model both for clinical\ntrials run by pharmaceutical labs, and A/B/n testing used at scale by tech companies.\nOne full experiment consists of a sequence of steps. In each step, the scientist assigns a new person to\none of the K + 1 options and observes an outcome. In practice, the role of the scientist could be taken\nby an adaptive algorithm, which determines the assignment at time step j by careful consideration\nof all previous outcomes. Borrowing terminology from the multi-armed bandit (MAB) literature,\nwe refer to each of the K + 1 options as an arm, and each assignment to arm i is termed \u201cpulling\narm i\u201d. For concreteness, we assign the index 0 to the control arm and note that it is known to the\nalgorithm. Furthermore, we assume that the observable metric from each pull of arm i = 0, 1, . . . , K\ncorresponds to an independent draw from an unknown probability distribution with expectation \u00b5i.\nIn the sequel we use \u00b5i(cid:63) := max\n\u00b5i to denote the mean of the best arm. We refer the reader to\ni=1,...,K\nTable 1 in Appendix A for a glossary of the notation used throughout this paper.\n\n2.1 Some desiderata and dif\ufb01culties\nGiven the setup above, how can we mathematically describe the guarantees that the companies might\ndesire from an improved multiple-A/B/n testing framework? For which parts can we leverage known\nresults and what challenges remain?\nFor the purpose of addressing the \ufb01rst question, let us adopt terminology from the hypothesis testing\nliterature and view each experiment as a test of a null hypothesis. Any claim that an alternative arm is\nthe best is called a discovery, and when such a claim is erroneous, it is called a false discovery. When\nmultiple hypotheses are to be tested, the scientist needs to de\ufb01ne the quantity it wants to control.\nWhile we may desire that the probability of even a single false discovery is small, this is usually\nfar too stringent for a large and unknown number of tests and results in low power. For this reason,\n[1] proposed that it may be more useful to control the expected ratio of false discoveries to the total\nnumber of discoveries (called the False Discovery Rate, or FDR for short) or the ratio of expected\nnumber of false discoveries to the expected number of total discoveries (called the modi\ufb01ed FDR\n\n2\n\n\for mFDR for short). Over the past decades, the FDR and its variants like the mFDR have become\nstandard quantities for multiple testing applications. In the following, if not otherwise speci\ufb01ed, we\nuse the term FDR to denote both measures in order to simplify the presentation. In Section 3, we\nshow that both mFDR and FDR can be controlled for different choices of procedures.\n2.1.1 Challenges in viewing an MAB instance as a hypothesis test\nIn our setup, we want to be able to control the FDR at any time in an online manner. Online FDR\nprocedures were \ufb01rst introduced by Foster and Stine [2], and have since been studied by other authors\n(e.g., [3, 4]). They are based on comparing a valid p-value P j with carefully-chosen levels \u03b1j for\neach hypothesis test1. We reject the null hypothesis, represented as Rj = 1, when P j \u2264 \u03b1j and we\nset Rj = 0 otherwise.\nAs mentioned, we want to use adaptive MAB algorithms to test each hypothesis, since they can \ufb01nd a\nbest arm among K + 1 with near-optimal sample complexity. However the traditional MAB setup\ndoes not account for the asymmetry between the arms as is the case in a testing setup, with one being\nthe default (control) and others being alternatives (treatments). This is the standard scenario in A/B/n\ntesting applications, as e.g. a company might prefer wrong claims that the control is the best (false\nnegative), rather than wrong claims that an alternative is the best (false positive), simply because new\nsystem-wide adoption of selected alternatives might involve high costs. What would be a suitable\nnull hypothesis in this hybrid setting? For the sake of continuous monitoring, is it possible to de\ufb01ne\nand compute always-valid p-values that are super-uniformly distributed under the null hypothesis\nwhen computed at any time t?\nIn addition to asymmetry, the practical scientist might have a different incentive than the ideal outcome\nfor MAB algorithms as he/she might not want to \ufb01nd the best alternative if it is not substantially\nbetter than the control. Indeed, if the net gain is small, it might be offset by the cost of implementing\nthe change from the existing default choice. By similar reasoning, we may not require identifying\nthe single best arm if there is a set of arms with similar means all larger than the rest. We propose a\nsensible null-hypothesis for each experiment which incorporates the approximation and minimum\nimprovement requirement as described above, and provide an always valid p-value which can be\neasily calculated at each time step in the experiment. We show that a slight modi\ufb01cation of the usual\nLUCB algorithm caters to this speci\ufb01c null-hypothesis while still maintaining near-optimal sample\ncomplexity.\n\nFigure 1: Diagram of our MAB-FDR meta algorithm. The green solid arrows symbolize interaction\nbetween the MAB and FDR procedures via the FDR test levels \u03b1j and rejection indicator variables Rj.\nNotice that the P j-values are now dependent as each \u03b1j depends on the past rejections R1, . . . , Rj\u22121.\nThe eyes represent possible continuous monitoring by the scientist.\n\n2.1.2 Interaction between MAB and FDR\nIn order to take advantage of the sample ef\ufb01ciency of best-arm bandit algorithms, it is crucial to set the\ncon\ufb01dence levels close to what is needed. Given a user-de\ufb01ned level \u03b1, at each hypothesis j, online\n\n1A valid P j must be stochastically dominated by a uniform distribution on [0, 1], which we henceforth refer\n\nto as super-uniformly distributed.\n\n3\n\nMAB-FDR meta algorithm\ud835\udefc\ud835\udefc\ud835\udc57\ud835\udc57\ud835\udc45\ud835\udc45\ud835\udc57\ud835\udc57(\ud835\udefc\ud835\udefc\ud835\udc57\ud835\udc57)ExpjMABTest\ud835\udc5d\ud835\udc5d\ud835\udc57\ud835\udc57<\ud835\udefc\ud835\udefc\ud835\udc57\ud835\udc57\ud835\udc5d\ud835\udc5d\ud835\udc57\ud835\udc57(\ud835\udefc\ud835\udefc\ud835\udc57\ud835\udc57)\ud835\udefc\ud835\udefcj+1\ud835\udc45\ud835\udc45j+1(\ud835\udefc\ud835\udefcj+1)Expj+1MABTest\ud835\udc5d\ud835\udc5dj+1 <\ud835\udefc\ud835\udefcj+1\ud835\udc5d\ud835\udc5dj+1 (\ud835\udefc\ud835\udefcj+1)Online FDR procedure\u2026\u2026desired FDR level \ud835\udefc\ud835\udefc\fFDR procedures automatically output the signi\ufb01cance level \u03b1j which are suf\ufb01cient to guarantee FDR\ncontrol, based on past decisions. Can we directly set the MAB con\ufb01dence levels to these output levels\n\u03b1j? If we do, our p-values are not independent across different hypotheses anymore: P j directly\ndepends on the FDR levels \u03b1j and each \u03b1j in turn depends on past MAB rejections, thus on past\nMAB p-values (see Figure 1). Does the new interaction compromise FDR guarantees?\nAlthough known procedures as in [2, 4] guarantee FDR control for independent p-values, this does\nnot hold for dependent p-values in general. Hence FDR control guarantees cannot simply be obtained\nout of the box. A key insight that emerges from our analysis is that an appropriate bandit algorithm\nactually shapes the p-value distribution under the null in a \u201cgood\u201d way that allows us to control FDR.\n2.2 A meta-algorithm\nProcedure 1 summarizes our doubly-sequential procedure, with a corresponding \ufb02owchart in Figure 1.\nWe will prove theoretical guarantees after instantiating the separate modules. Note that our framework\nallows the scientist to plug in their favorite best-arm MAB algorithm or online FDR procedure. The\nchoice for each of them determines which guarantees can be proven for the entire setup. Any\nindependent improvement in either of the two parts would immediately lead to an overall performance\nboost of the overall framework.\n\nProcedure 1 MAB-FDR Meta algorithm skeleton\n\n1. The scientist sets a desired FDR control rate \u03b1.\n2. For each j = 1, 2, . . . :\n\u2022 Experiment j receives a designated control arm and some number of alternative arms.\n\u2022 An online-FDR procedure returns an \u03b1j that is some function of the past values {P (cid:96)}j\u22121\n(cid:96)=1.\n\u2022 An MAB procedure is executed with inputs (a) the control arm and K(j) alternative arms,\n(b) con\ufb01dence level \u03b1j, maintains an always valid p-value for each t and if the procedure\nself-terminates, returns a recommended arm.\n\n\u2022 When the MAB procedure is terminated at time t by itself or the user, if the arm with the\nt \u2264 \u03b1j, then we return P j := P j\nt ,\n\nhighest empirical mean is not the control arm and P j\nand the control arm is rejected in favor of this empirically best arm.\n\n3 A concrete procedure with guarantees\nWe now take the high-level road map given in Procedure 1, and show that we can obtain a concrete,\npractically implementable framework with FDR control and power guarantees. We \ufb01rst discuss the\nkey modeling decisions we have to make in order to seamlessly embed MAB algorithms into an\nonline FDR framework. We then outline a modi\ufb01ed version of a commonly used best-arm algorithm,\nbefore we \ufb01nally prove FDR and power guarantees for the concrete combined procedure.\n3.1 De\ufb01ning null hypotheses and constructing p-values\nOur \ufb01rst task is to de\ufb01ne a null hypothesis for each experiment. As mentioned before, the choice of\nthe null is not immediately obvious, since we sample from multiple distributions adaptively instead\nof independently. In particular, we will generally not have the same number of samples for all arms.\ni=1, we propose that the null hypothesis for the\nGiven a default mean \u00b50 and alternatives means {\u00b5i}K\nj-th experiment should be de\ufb01ned as\n\nH j\n0 : \u00b50 \u2265 \u00b5i \u2212 \u0001\n\nfor all i = 1, . . . , K,\n\n(1)\nwhere we usually omit the index j for simplicity. It remains to de\ufb01ne an always valid p-value\n(previously de\ufb01ned by Johari et al. [5]) for each experiment for the purpose of continuous monitoring.\nIt is de\ufb01ned as a stochastic process {Pt}\u221et=1 such that for all \ufb01xed and random stopping times T ,\nunder any distribution P0 over the arm rewards such that the null hypothesis is true, we have\n(2)\nWhen all arms are drawn independently an equal number of times, by linearity of expectation one can\nregard the distance of each pair of samples as a random variable drawn i.i.d. from a distribution with\nmean \u02dc\u00b5 := \u00b50 \u2212 \u00b5i. We can then view the problem as testing the standard hypothesis H j\n0 : \u02dc\u00b5 > \u2212\u0001.\nHowever, when the arms are pulled adaptively, a different solution needs to be found\u2014indeed, in this\n\nP0(PT \u2264 \u03b1) \u2264 \u03b1.\n\n4\n\n\fcase, the sample means are not unbiased estimators of the true means, since the number of times an\narm was pulled now depends on the empirical means of all the arms.\nOur strategy is to construct always valid p-values by using the fact that p-values can be obtained\nby inverting con\ufb01dence intervals. To construct always-valid con\ufb01dence bounds, we resort to the\nfundamental concept of the law of the iterated logarithm (LIL), for which non-asymptotic versions\nhave been recently derived and used for both bandits and testing problems (see [6], [7]).\nTo elaborate, de\ufb01ne the function\n\nlog( 1\n\n\u03b4 ) + 3 log(log( 1\n\u03b4 )) + 3\nn\n\n2 log(log(en))\n\n\u03d5n(\u03b4) =\n\nIf(cid:98)\u00b5i,n is the empirical average of independent samples from a sub-Gaussian distribution, then it is\nknown (see, for instance, [8, Theorem 8]) that for all \u03b4 \u2208 (0, 1), we have\n\n.\n\n(cid:110)P(cid:16) \u221e(cid:91)\n\n(cid:17)\n{(cid:98)\u00b5i,n \u2212 \u00b5i > \u03d5n(\u03b4 \u2227 0.1)}\n\n, P(cid:16) \u221e(cid:91)\n\n{(cid:98)\u00b5i,n \u2212 \u00b5i < \u2212\u03d5n(\u03b4 \u2227 0.1)}\n\n(4)\n\n\u2264 \u03b4,\n\n(cid:17)(cid:111)\n\n(3)\n\nmax\n\nn=1\n\n(cid:115)\n\nn=1\n\nwhere \u03b4 \u2227 0.1 := min{\u03b4, 0.1}.\nWe are now ready to propose single arm p-values of the form\n\n(cid:110)\n\u03b3 \u2208 [0, 1] | (cid:98)\u00b5i,ni(t) \u2212 \u03d5ni(t)( \u03b3\n(cid:110)\n\u03b3 \u2208 [0, 1] | LCBi(t) \u2264 UCB0(t) + \u0001\n\n(cid:111)\n\nPi,t : = sup\n\n= sup\n\n2K ) \u2264 (cid:98)\u00b50,n0(t) + \u03d5n0(t)( \u03b3\n\n2 ) + \u0001\n\n(cid:111)\n\n(5)\n\nHere we set Pi,t = 1 if the supremum is taken over an empty set. Given these single arm p-values,\nthe always-valid p-value for the experiment is de\ufb01ned as\n\nPt := min\ns\u2264t\n\nmin\n\ni=1,...,K\n\nPi,s.\n\n(6)\n\ni=0,1,...,K\n\nMAB algorithms can identify i(cid:63) with probability at least 1\u2212\u03b4 based on at most2(cid:80)\n\nWe claim that this procedure leads to an always valid p-value (with proof in Appendix C).\nProposition 1. The sequence {Pt}\u221et=1 de\ufb01ned via equation (6) is an always valid p-value.\n3.2 Adaptive sampling for best-arm identi\ufb01cation\nIn the traditional A/B testing setting described in the introduction, samples are allocated uniformly\nto the different alternatives. But by allowing adaptivity, decisions can be made with the same\nstatistical signi\ufb01cance using far fewer samples. Suppose moreover that there is a unique maximizer\n\u00b5i, so that \u2206i := \u00b5i(cid:63) \u2212 \u00b5i > 0 for all i (cid:54)= i(cid:63). Then for any \u03b4 \u2208 (0, 1), best-arm\ni(cid:63) := arg max\nlog(1/\u03b4)\ntotal samples (see the paper [9] for a brief survey and [10] for an application to clinical trials). In\ncontrast, if samples are allocated uniformly to the alternatives under the same conditions, then the\nmost natural procedures require K max\nlog(K/\u03b4) samples before returning i(cid:63) with probability\ni(cid:54)=i(cid:63)\nat least 1 \u2212 \u03b4.\nHowever, standard best-arm bandit algorithms do not incorporate asymmetry as induced by null-\nhypotheses as in de\ufb01nition (1) by default. Furthermore, recall that a practical scientist might desire\nthe ability to incorporate approximation and a minimum improvement requirement. More precisely,\nit is natural to consider the requirement that the returned arm ib satis\ufb01es the bounds \u00b5ib \u2265 \u00b50 + \u0001\nand \u00b5ib \u2265 \u00b5i(cid:63) \u2212 \u0001 for some \u0001 > 0. In Algorithm 1 we present a modi\ufb01ed MAB algorithm based on\nthe common LUCB algorithm (see [11, 12]) which incorporates the above desiderata. We provide a\nvisualization of how \u0001 affects the usual stopping condition in Figure 4 in Appendix A.1.\nThe following proposition applies to Algorithm 1 run with a control arm indexed by i = 0 with mean\n\u00b50 and alternative arms indexed by i = 1, . . . , K with means \u00b5i, respectively. Let ib denote the\nrandom arm returned by the algorithm assuming that it exits, and de\ufb01ne the set\nand \u00b5i(cid:63) > \u00b50 + \u0001}.\n\nS (cid:63) := {i(cid:63) (cid:54)= 0 | \u00b5i(cid:63) \u2265 max\n\n\u00b5i \u2212 \u0001\n\n\u2206\u22122\n\ni\n\n\u2206\u22122\n\ni\n\ni=1,...,K\n\ni(cid:54)=i(cid:63)\n\n(7)\n\n2Here we have ignored some doubly-logarithmic factors.\n\n5\n\n\f(cid:80)ni(t)\n\nAlgorithm 1 Best-arm identi\ufb01cation with a control arm for con\ufb01dence \u03b4 and precision \u0001 \u2265 0\ni let(cid:98)\u00b5i(t) = 1\nFor all t let ni(t) be the number of times arm i has been pulled up to time t. In addition, for each arm\nUCBi(t) :=(cid:98)\u00b5i,ni(t) + \u03d5ni(t)( \u03b4\n\nLCBi(t) :=(cid:98)\u00b5i,ni(t) \u2212 \u03d5ni(t)( \u03b4\n\n\u03c4 =1 ri(\u03c4 ), de\ufb01ne\n\n2K )\n\nand\n\n2 ).\n\nni(t)\n\n1. Set t = 1 and sample every arm once.\n2. Repeat: Compute ht = arg max\n\ni=0,1,...,K(cid:98)\u00b5i(t), and (cid:96)t = arg\n\nmax\n\nUCBi(t)\n\ni=0,1,...,K,i(cid:54)=ht\n(a) If LCB0(t) > UCBi(t) \u2212 \u0001, for all i (cid:54)= 0, then output 0 and terminate.\n\nElse if LCBht(t) > UCB(cid:96)t(t) \u2212 \u0001 and LCBht(t) > UCB0(t) + \u0001, then output ht and\nterminate.\n(b) If \u0001 > 0, let ut = arg maxi(cid:54)=0 UCBi(t) and pull all distinct arms in {0, ut, ht, (cid:96)t} once.\n\nIf \u0001 = 0, pull arms ht and (cid:96)t and set t = t + 1.\n\nNote that the mean associated with any index i(cid:63) \u2208 S (cid:63), assuming that the set is non-empty, is\nguaranteed to be \u0001-superior to the control mean, and at most \u0001-inferior to the maximum mean over all\narms.\nProposition 2. The algorithm 1 terminates in \ufb01nite time with probability one. Furthermore, suppose\nthat the samples from each arm are independent and sub-Gaussian with scale 1. Then for any\n\u03b4 \u2208 (0, 1) and \u0001 \u2265 0, Algorithm 1 has the following guarantees:\n(cid:16)(cid:80)K\ni=0(cid:101)\u2206\u22122\nlog(K log((cid:101)\u2206\u22122\n(a) Suppose that \u00b50 > max\ni=1,...,K\n(cid:101)\u22060 = (\u00b50 + \u0001) \u2212 max\n(cid:101)\u2206i = (\u00b50 + \u0001) \u2212 \u00b5i.\n(cid:17)\n\n\u00b5i \u2212 \u0001. Then with probability at least 1 \u2212 \u03b4, the algorithm exits with\ntime steps with effective gaps\n\n(b) Otherwise, suppose that the set S (cid:63) as de\ufb01ned in equation (7) is non-empty. Then with probability\n\nat least 1 \u2212 \u03b4, the algorithm exits with ib \u2208 S (cid:63) after taking at most\ntime steps with effective gaps\nO\n\nib = 0 after taking at most O\n\n(cid:16)(cid:80)K\ni=0(cid:101)\u2206\u22122\n\n\u00b5j and\n\nj=1,...,K\n\n(cid:17)\n\n)/\u03b4)\n\n)/\u03b4)\n\ni\n\ni\n\ni\n\ni\n\nlog(K log((cid:101)\u2206\u22122\n(cid:26)\n(cid:101)\u22060 = min\n(cid:26)\n(cid:101)\u2206i = max\n\n(cid:26)\n\nmax\n\nj=1,...,K\n\n\u00b5j \u2212 (\u00b50 + \u0001), max{\u22060, \u0001}\n\nand\n\n\u2206i, min\n\nmax\n\nj=1,...,K\n\n\u00b5j \u2212 (\u00b50 + \u0001), \u0001\n\n.\n\n(cid:27)\n(cid:27)(cid:27)\n\nSee Appendix D for the proof of this claim. Part (a) of Proposition 2 guarantees that when no\nalternative arm is \u0001-superior to the control arm (i.e. under the null hypothesis), the algorithm stops\nand returns the control arm with probability at least 1 \u2212 \u03b4. Part (b) guarantees that if there is in fact at\nleast one alternative that is \u0001-superior to the control arm (i.e. under the alternative), then the algorithm\nwill \ufb01nd at least one of them that is at most \u0001-inferior to the best of all possible arms.\nAs our algorithm is a slight modi\ufb01cation of the LUCB algorithm, the results of [11, 12] pro-\nvide insight into the number of samples taken before the algorithm terminates.\nIndeed, when\n\u0001 = 0 and i(cid:63) = arg maxi=0,1,...,K \u00b5i is a unique maximizer, the nearly optimal sample complex-\nity result of [12] implies that the algorithm terminates under settings (a) and (b) after at most\n)/\u03b4) samples are taken (ignoring con-\nmaxj(cid:54)=i(cid:63) \u2206\u22122\nlog(K log(\u2206\u22122\nstants), where \u2206i = \u00b5i(cid:63) \u2212 \u00b5i.\nIn our development to follow, we now bring back the index for experiment j, in particular using P j\nto denote the quantity P j\nT at any stopping time T . Here the stopping time can either be de\ufb01ned by the\nscientist, or in an algorithmic manner.\n\nj )/\u03b4) +(cid:80)\n\nlog(log(\u2206\u22122\n\n\u2206\u22122\n\ni(cid:54)=i(cid:63)\n\nj\n\ni\n\ni\n\n6\n\n\f3.3 Best-arm MAB interacting with online FDR\nAfter having established null hypotheses and p-values in the context of best-arm MAB algorithms, we\nare now ready to embed them into an online FDR procedure. In the following, we consider p-values\nfor the j-th experiment P j := P j\nwhich is just the p-value as de\ufb01ned in equation (6) at the stopping\nTj\ntime Tj, which depends on \u03b1j.\nWe denote the set of true null and false null hypotheses up to experiment J as H0(J) and H1(J)\nrespectively, where we drop the argument whenever it\u2019s clear from the context. The variable\nP j\u2264\u03b1j indicates whether a the null hypothesis of experiment j has been rejected, where\nRj = 1\nRj = 1 denotes a claimed discovery that an alternative was better than the control. The false\ndiscovery rate (FDR) and modi\ufb01ed FDR up to experiment J are then de\ufb01ned as\n\nFDR(J) := E\n\nand\n\nmFDR(J) :=\n\n.\n\n(8)\n\nE(cid:80)\nE(cid:80)J\n\nj\u2208H0 Rj\ni=1 Ri + 1\n\n(cid:80)\n(cid:80)J\n\nj\u2208H0 Rj\ni=1 Ri \u2228 1\n\n(cid:80)J\n\nj=1\n\n7\n\nHere the expectations are taken with respect to distributions of the arm pulls and the respective\nsampling algorithm. In general, it is not true that control of one quantity implies control of the other.\nNevertheless, in the long run (when the law of large numbers is a good approximation), one does not\nexpect a major difference between the two quantities in practice.\nThe set of true nulls H0 thus includes all experiments where H j\n0 is true, and the FDR and mFDR are\nwell-de\ufb01ned for any number of experiments J, since we often desire to control FDR(J) or mFDR(J)\nfor all J \u2208 N. In order to measure power, we de\ufb01ne the \u0001-best-arm discovery rate as\n\n\u0001BDR(J) :=\n\nj\u2208H1 Rj 1\u00b5ib\u2265\u00b5i(cid:63)\u2212\u00011\u00b5ib\u2265\u00b50+\u0001\n\n(9)\n\nE(cid:80)\n\n|H1(J)|\n\nWe provide a concrete procedure 2 for our doubly sequential framework, where we use a particular\nonline FDR algorithm due to Javanmard and Montanari [4] known as LORD; the reader should note\nthat other online FDR procedure could be used to obtain essentially the same set of guarantees. Given\na desired level \u03b1, the LORD procedure starts off with an initial \u201c\u03b1-wealth\u201d of W (0) < \u03b1. Based on\na ini\ufb01nite sequence {\u03b3i}\u221ei=1 that sums to one, and the time of the most recent discovery \u03c4j, it uses up\na fraction \u03b3j\u2212\u03c4j of the remaining \u03b1-wealth to test. Whenever there is a rejection, we increase the\n\u03b1-wealth by \u03b1 \u2212 W (0). A feasible choice for a stopping time in practice is Tj := min{T (\u03b1j), TS},\nwhere TS is a maximal number of samples the scientist wants to pull and T (\u03b1j) is the stopping time\nof the best-arm MAB algorithm run at con\ufb01dence \u03b1j.\n\nProcedure 2 MAB-LORD: best-arm identi\ufb01cation with online FDR control\n\n1. Initialize W (0) < \u03b1, set \u03c40 = 0, and choose a sequence {\u03b3i} s.t.(cid:80)\u221ei=1 \u03b3i = 1\n\nW (j + 1) = W (j) \u2212 \u03b1j + Rj(\u03b1 \u2212 W (0))\n\n2. At each step j, compute \u03b1j = \u03b3j\u2212\u03c4j W (\u03c4j) and\n3. Output \u03b1j and run Algorithm 1 using \u03b1j-con\ufb01dence and stop at a stopping time Tj.\n4. Algorithm 1 returns P j and we reject the null hypothesis if P j \u2264 \u03b1j.\n5. Set Rj = 1\n\nP j\u2264\u03b1j , \u03c4j = \u03c4j\u22121 \u2228 jRj, update j = j + 1 and go back to step 2.\n\nThe following theorem provides guarantees on mFDR and power for the MAB-LORD procedure.\nTheorem 1 (Online mFDR control for MAB-LORD).\n(a) Procedure 2 achieves mFDR control at level \u03b1 for stopping times Tj = min{T (\u03b1j), TS}.\n(b) Furthermore, if we set TS = \u221e, Procedure 2 satis\ufb01es\n\n\u0001BDR(J) \u2265\n\n1j\u2208H1 (1 \u2212 \u03b1j)\n|H1(J)|\n\n.\n\n(10)\n\nSee Appendix E for the proof of this claim. Note that by the arguments in the proof of Theorem 1,\nmFDR control itself is actually guaranteed for any generalized \u03b1-investing procedure [3] combined\nwith any best-arm MAB algorithm. In fact we could use any adaptive stopping time Tj which depend\non the history only via the rejections R1, . . . , Rj\u22121. Furthermore, using a modi\ufb01ed LORD proposed\n\n\fby Javanmard and Montanari [13], we can also guarantee FDR control\u2013 a result we moved to the\nAppendix F due to space constraints. It is noteworthy that small values of \u03b1 do not only guarantee\nsmaller FDR error but also higher BDR. However, there is no free lunch \u2014 a smaller \u03b1 implies a\nsmaller \u03b1j at each experiment, resulting in a larger required number of pulls for the the best-arm\nMAB algorithm.\n4 Experimental results\nIn the following, we brie\ufb02y describe some results of our experiments3 on both simulated and real-\nworld data sets, which illustrate that, apart from FDR control, MAB-FDR (used interchangeably with\nMAB-LORD here) is highly advantageous in terms of sample complexity and power compared to a\nstraightforward embedding of A/B testing in online FDR procedures. Unless otherwise noted, we set\n\u0001 = 0 in all of our simulations to focus on the main ideas and keep the discussion concise.\nCompeting procedures There are two natural frameworks to compare against MAB-FDR. The\n\ufb01rst, called AB-FDR or AB-LORD, swaps the MAB part for an A/B (i.e. A/B/n) test (uniformly\nsampling all alternatives until termination). The second comparator exchanges the online FDR\ncontrol for independent testing at \u03b1 for all hypotheses \u2013 we call this MAB-IND. Formally, AB-FDR\nswaps step 3 in Procedure 2 with \u201cOutput \u03b1j and uniformly sample each arm until stopping time Tj.\u201d\nwhile MAB-IND swaps step 4 in Procedure 2 with \u201cThe algorithm returns P j and we reject the null\nhypothesis if P j \u2264 \u03b1.\u201d. In order to compare the performances of these procedures, we ran three sets\nof simulations using Procedure 2 with \u0001 = 0 and \u03b3j = 0.07 log(j\u22282)\nOur experiments are run on arti\ufb01cial data with Gaussian/Bernoulli draws and real-world Bernoulli\ndraws from the New Yorker Cartoon Caption Contest. Recall that the sample complexity of the\nbest-arm MAB algorithm is determined by the gaps \u2206j = \u00b5i(cid:63) \u2212 \u00b5j. One of the main relevant\ndifferences to consider between an experiment of arti\ufb01cial or real-world nature is thus the distribution\nof the means \u00b5i for i = 1, . . . , K. The arti\ufb01cial data simulations are run with a \ufb01xed gap \u2206 := \u22062\nwhile the means of the other arms are set uniformly in [0, \u00b5i(cid:63) \u2212 \u2206]. For our real-world simulations,\nwe use empirical means computed from the cartoon caption contest (see details in Appendix B.1.1).\nIn addition, the contests actually follow a natural chronological order, which makes this dataset highly\nrelevant to our purposes. In all simulations, 60% of all the hypotheses are true nulls, and their indices\nare chosen uniformly. Due to space constraints, the experimental results for arti\ufb01cial and real-world\nBernoulli draws are deferred to Appendix B.\n\nlog j as in [4].\n\u221a\n\nje\n\n(a)\n\n(b)\n\nFigure 2: (a) Power vs. truncation time TS (per hypothesis) for 50 arms and (b) Sample complexity\nvs. # arms for truncation time TS = 300 for Gaussian draws with \ufb01xed \u00b5i(cid:63) = 8, \u2206 = 3 over 500\nhypotheses with 200 non-nulls, averaged over 100 runs and \u03b1 = 0.1.\nPower and sample complexity In this section we include \ufb01gures on arti\ufb01cial Gaussian trials which\ncon\ufb01rm that the total number of necessary pulls to determine signi\ufb01cance is much smaller for MAB-\nFDR than for AB-FDR. In Fig. 2 (a) we \ufb01x the number of arms and plot the \u0001BDR with \u0001 = 0 (BDR\nfor short) for both procedures over different choices of truncation times TS. Low BDR indicates that\nthe algorithm often reaches truncation time before it could stop. For Fig. 2 (b) we \ufb01x TS and show\nhow the sample complexity varies with the number of arms.\n\n3The code for\n\nhttps://github.com/fanny-yang/MABFDR\n\nreproducing all experiments and plots in this paper\n\nis publicly available at\n\n8\n\n100200300400500600700800Truncation time TS0.00.20.40.60.81.0BDRMAB-LORDAB-LORD20406080100120Number of arms020406080100120140160Total number of samples /1000MAB-LORDAB-LORD\fObserve in Fig. 2 (a) that the power at any given truncation time is much higher for MAB-FDR than\nAB-FDR. This is because the best-arm MAB is more likely to satisfy the stopping criterion before\nany given truncation time than the uniform sampling algorithm. Fig. 2(b) qualitatively shows how\nthe total number of necessary arm pulls for AB-FDR increases much faster with the number of arms\nthan for MAB-FDR before it plateaus due to the truncation. Recall that whenever the best-arm MAB\nstops before the truncation time in each hypothesis, the stopping criterion is met, i.e. the best arm is\nidenti\ufb01ed with probability at least 1 \u2212 \u03b1j, so that the power is bound to be close to one whenever\nTj = T (\u03b1j).\nmFDR control For Fig. 3, we again consider Gaussian draws as in Fig. 2. This time however, for\neach true null hypothesis we skip the bandit experiment and directly draw P j \u223c [0, 1] to compare\nwith the signi\ufb01cance levels \u03b1j from our online FDR procedure 2 (see App. B.2 for motivation of this\nsetting). By Theorem 1, mFDR should still be controlled as it only requires the p-values to be super-\nj\u2208H0J Rj\nuniform. In Fig. 3(a) we plot the instantaneous false discovery proportion FDP(J) =\nj=1 Rj\nover the hypothesis index for different runs with the same settings. Apart from initial \ufb02uctuations due\nto the relatively small denominator, observe how the guarantee for the FDR(J) = E FDP(J) with\nthe red line showing its empirical value, transfers to the control of each individual run (blue lines).\n\n(cid:80)\n(cid:80)T\n\n(a)\n\n(b)\n\n\u03c02j2 such that(cid:80)\u221ej=1 \u03b1j \u2264 \u03b1\n\nFigure 3: (a) Single runs of MAB-LORD (blue) and their average (red) with uniformly drawn p-values\nfor null hypotheses and Gaussian draws as in Figure 2. (b) mFDR over different proportions of\nnon-nulls \u03c01, with same settings, averaged over 80 runs.\nIn Figure 3 (b), we compare the mFDR of MAB-FDR against MAB-IND and a Bonferroni type\ncorrection. The latter uses a simple union bound and chooses \u03b1j = 6\u03b1\nand thus trivially allows for any time FWER, implying FDR control. As expected, Bonferroni is too\nconservative and barely makes any rejections whereas the naive MAB-IND approach does not control\nFDR. LORD avoids both extremes and controls FDR while having reasonable power.\n5 Discussion\nThe recent focus in popular media about the lack of reproducibility of scienti\ufb01c results erodes the\npublic\u2019s con\ufb01dence in published scienti\ufb01c research. To maintain credibility of claimed discoveries,\nsimply decreasing the statistical signi\ufb01cance levels \u03b1 of each individual experimental work (e.g.,\nreject at level 0.001 rather than 0.05) would drastically hurt power. A common approach is instead\nto control the ratio of false discoveries to claimed discoveries at some desired value over many\nsequential experiments, requiring the statistical signi\ufb01cances \u03b1j to change from experiment to\nexperiment. Unlike earlier works on online FDR control, our framework synchronously interacts\nwith adaptive sampling methods like MABs to make the overall sampling procedure per experiment\nmuch more ef\ufb01cient than uniform sampling. To the best of our knowledge, it is the \ufb01rst work that\nsuccessfully combines the bene\ufb01ts of adaptive sampling and FDR control. It is worthwhile to note that\nany improvement, theoretical or practical, to either online FDR algorithms or best-arm identi\ufb01cation\nin MAB, immediately results in a corresponding improvement for our MAB-FDR framework.\nMore general notions of FDR with corresponding online procedures have recently been developed by\nRamdas et al [14]. In particular, they incorporate the notion of memory and a priori importance of\neach hypothesis. This could prove to be a valuable extension for our setting, especially in cases when\nonly the percentage of wrong rejections in the recent past matters. It would be useful to establish\nFDR control for these generalized notions of FDR as well.\n\n9\n\n0.10.30.50.70.9Proportion of alternatives \u03c010.00.10.20.30.40.5mFDRMAB-LORDMAB-INDMAB-Bonf.\fAcknowledgements\n\nThis work was partially supported by Of\ufb01ce of Naval Research MURI grant DOD-002888, Air Force\nOf\ufb01ce of Scienti\ufb01c Research Grant AFOSR-FA9550-14-1-001, and National Science Foundation\nGrants CIF-31712-23800 and DMS-1309356.\n\nReferences\n[1] Y. Benjamini and Y. Hochberg, \u201cControlling the false discovery rate: a practical and powerful\napproach to multiple testing,\u201d Journal of the Royal Statistical Society. Series B (Methodological),\npp. 289\u2013300, 1995.\n\n[2] D. P. Foster and R. A. Stine, \u201c\u03b1-investing: a procedure for sequential control of expected false\ndiscoveries,\u201d Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70,\nno. 2, pp. 429\u2013444, 2008.\n\n[3] E. Aharoni and S. Rosset, \u201cGeneralized \u03b1-investing: de\ufb01nitions, optimality results and ap-\nplication to public databases,\u201d Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), vol. 76, no. 4, pp. 771\u2013794, 2014.\n\n[4] A. Javanmard and A. Montanari, \u201cOnline rules for control of false discovery rate and false\n\ndiscovery exceedance,\u201d The Annals of Statistics, 2017.\n\n[5] R. Johari, L. Pekelis, and D. J. Walsh, \u201cAlways valid inference: Bringing sequential analysis to\n\nA/B testing,\u201d arXiv preprint arXiv:1512.04922, 2015.\n\n[6] K. G. Jamieson, M. Malloy, R. D. Nowak, and S. Bubeck, \u201clil\u2019ucb: An optimal exploration\n\nalgorithm for multi-armed bandits,\u201d in COLT, vol. 35, 2014, pp. 423\u2013439.\n\n[7] A. Balsubramani and A. Ramdas, \u201cSequential nonparametric testing with the law of the iter-\nated logarithm,\u201d in Proceedings of the Thirty-Second Conference on Uncertainty in Arti\ufb01cial\nIntelligence. AUAI Press, 2016, pp. 42\u201351.\n\n[8] E. Kaufmann, O. Capp\u00e9, and A. Garivier, \u201cOn the complexity of best arm identi\ufb01cation in\n\nmulti-armed bandit models,\u201d The Journal of Machine Learning Research, 2015.\n\n[9] K. Jamieson and R. Nowak, \u201cBest-arm identi\ufb01cation algorithms for multi-armed bandits in\nthe \ufb01xed con\ufb01dence setting,\u201d in Information Sciences and Systems (CISS), 2014 48th Annual\nConference on.\n\nIEEE, 2014, pp. 1\u20136.\n\n[10] S. S. Villar, J. Bowden, and J. Wason, \u201cMulti-armed bandit models for the optimal design of\nclinical trials: bene\ufb01ts and challenges,\u201d Statistical science: a review journal of the Institute of\nMathematical Statistics, vol. 30, no. 2, p. 199, 2015.\n\n[11] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone, \u201cPac subset selection in stochastic multi-\narmed bandits,\u201d in Proceedings of the 29th International Conference on Machine Learning\n(ICML-12), 2012, pp. 655\u2013662.\n\n[12] M. Simchowitz, K. Jamieson, and B. Recht, \u201cThe simulator: Understanding adaptive sampling\n\nin the moderate-con\ufb01dence regime,\u201d arXiv preprint arXiv:1702.05186, 2017.\n\n[13] A. Javanmard and A. Montanari, \u201cOn online control of false discovery rate,\u201d arXiv preprint\n\narXiv:1502.06197, 2015.\n\n[14] A. Ramdas, F. Yang, M. J. Wainwright, and M. I. Jordan, \u201cOnline control of the false discovery\nrate with decaying memory,\u201d in Advances in Neural Information Processing Systems (NIPS)\n2017, arXiv preprint arXiv:1710.00499, 2017.\n\n10\n\n\f", "award": [], "sourceid": 3039, "authors": [{"given_name": "Fanny", "family_name": "Yang", "institution": "University of California, Berkeley"}, {"given_name": "Aaditya", "family_name": "Ramdas", "institution": "University of California, Berkeley"}, {"given_name": "Kevin", "family_name": "Jamieson", "institution": "UC Berkeley"}, {"given_name": "Martin", "family_name": "Wainwright", "institution": "UC Berkeley"}]}