{"title": "The Power of Adaptivity in Identifying Statistical Alternatives", "book": "Advances in Neural Information Processing Systems", "page_first": 775, "page_last": 783, "abstract": "This paper studies the trade-off between two different kinds of pure exploration: breadth versus depth. We focus on the most biased coin problem, asking how many total coin flips are required to identify a ``heavy'' coin from an infinite bag containing both ``heavy'' coins with mean $\\theta_1 \\in (0,1)$, and ``light\" coins with mean $\\theta_0 \\in (0,\\theta_1)$, where heavy coins are drawn from the bag with proportion $\\alpha \\in (0,1/2)$. When $\\alpha,\\theta_0,\\theta_1$ are unknown, the key difficulty of this problem lies in distinguishing whether the two kinds of coins have very similar means, or whether heavy coins are just extremely rare. While existing solutions to this problem require some prior knowledge of the parameters $\\theta_0,\\theta_1,\\alpha$, we propose an adaptive algorithm that requires no such knowledge yet still obtains near-optimal sample complexity guarantees. In contrast, we provide a lower bound showing that non-adaptive strategies require at least quadratically more samples. In characterizing this gap between adaptive and nonadaptive strategies, we make connections to anomaly detection and prove lower bounds on the sample complexity of differentiating between a single parametric distribution and a mixture of two such distributions.", "full_text": "The Power of Adaptivity in Identifying Statistical\n\nAlternatives\n\nKevin Jamieson, Daniel Haas, Ben Recht\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\n{kjamieson,dhaas,brecht}@eecs.berkeley.edu\n\nAbstract\n\nThis paper studies the trade-off between two different kinds of pure exploration:\nbreadth versus depth. We focus on the most biased coin problem, asking how\nmany total coin \ufb02ips are required to identify a \u201cheavy\u201d coin from an in\ufb01nite bag\ncontaining both \u201cheavy\u201d coins with mean \u27131 2 (0, 1), and \u201clight\" coins with\nmean \u27130 2 (0,\u2713 1), where heavy coins are drawn from the bag with proportion\n\u21b5 2 (0, 1/2). When \u21b5, \u27130,\u2713 1 are unknown, the key dif\ufb01culty of this problem lies in\ndistinguishing whether the two kinds of coins have very similar means, or whether\nheavy coins are just extremely rare. While existing solutions to this problem require\nsome prior knowledge of the parameters \u27130,\u2713 1,\u21b5 , we propose an adaptive algorithm\nthat requires no such knowledge yet still obtains near-optimal sample complexity\nguarantees. In contrast, we provide a lower bound showing that non-adaptive\nstrategies require at least quadratically more samples. In characterizing this gap\nbetween adaptive and nonadaptive strategies, we make connections to anomaly\ndetection and prove lower bounds on the sample complexity of differentiating\nbetween a single parametric distribution and a mixture of two such distributions.\n\nIntroduction\n\n1\nThe trade-off between exploration and exploitation has been an ever-present trope in the online\nlearning literature. In contrast, this paper studies the trade-off between two different kinds of pure\nexploration: breadth versus depth. Consider a bag that contains an in\ufb01nite number of two kinds\nof biased coins: \u201cheavy\u201d coins with mean \u27131 2 (0, 1) and \u201clight\u201d coins with mean \u27130 2 (0,\u2713 1).\nWhen a player picks a coin from the bag, with probability \u21b5 the coin is \u201cheavy\u201d and with probability\n(1 \u21b5) the coin is \u201clight.\u201d The player can \ufb02ip any coin she picks from the bag as many times as she\nwants, and the goal is to identify a heavy coin using as few total \ufb02ips as possible. When \u21b5, \u27130,\u2713 1 are\nunknown, the key dif\ufb01culty of this problem lies in distinguishing whether the two kinds of coins have\nvery similar means, or whether heavy coins are just extremely rare. That is, how does one balance\n\ufb02ipping an individual coin many times to better estimate its mean against considering many new coins\nto maximize the probability of observing a heavy one. Previous work has only proposed solutions\nthat rely on some or full knowledge \u21b5, \u27130,\u2713 1, limiting their applicability. In this work we propose\nthe \ufb01rst algorithm that requires no knowledge of \u21b5, \u27130,\u2713 1, is guaranteed to return a heavy coin with\nprobability at least 1 , and \ufb02ips a total number of coins, in expectation, that nearly matches known\nlower bounds. Moreover, our fully adaptive algorithm supports more general sub-Gaussian sources in\naddition to just coins, and only ever has one \u201ccoin\u201d outside the bag at a given time, a constraint of\npractical importance to some applications.\nIn addition, we connect the most biased coin problem to anomaly detection and prove novel lower\nbounds on the dif\ufb01culty of detecting the presence of a mixture versus just a single component of\na known family of distributions (e.g. X \u21e0 (1 \u21b5)g\u27130 + \u21b5g\u27131 versus X \u21e0 g\u2713 for some \u2713). We\nshow that in detecting the presence of a mixture distribution, there is a stark difference of dif\ufb01culty\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fbetween when the underlying distribution parameters are known (e.g. \u21b5, \u27130,\u2713 1) and when they are\nnot. The most biased coin problem can be viewed as an online, adaptive mixture detection problem\nwhere source distributions arrive one at a time that are either g\u27130 with probability (1 \u21b5) or g\u27131\nwith probability \u21b5 (e.g. null or anomolous) and the player adaptively chooses how many samples\nto take from each distribution (to increase the signal-to-noise ratio) with the goal of identifying an\nanomolous distribution f\u27131 using as few total number of samples as possible. This work draws a\ncontrast between the power of an adaptive versus non-adaptive (e.g. taking the same number of\nsamples each time) approaches to this problem, speci\ufb01cally when \u21b5, \u27130,\u2713 1 are unknown.\n1.1 Motivation and Related Work for the Most Biased Coin Problem\nThe most biased coin problem characterizes the inherent dif\ufb01culty of real-world problems including\nanomaly and intrusion detection and discovery of vacant frequencies in the radio spectrum. Our\ninterest in the problem stemmed from automated hiring of crowd workers: data labeling for machine\nlearning applications is often performed by humans, and recent work in the crowdsourcing literature\naccelerates labeling by organizing workers into pools of labelers and paying them to wait for incoming\ndata [4, 12]. Workers hired on marketplaces such as Amazon\u2019s Mechanical Turk [16] vary widely in\nskill, and identifying high-quality workers as quickly as possible is an important challenge. We can\nmodel each worker\u2019s performance (e.g. accuracy or speed) as a random variable so that selecting\na good worker is equivalent to identifying a worker with a high mean. Since we do not observe a\nworker\u2019s expected performance directly, we must give them tasks from which we estimate it (like\nrepeatedly \ufb02ipping a biased coin). Arlotto et al. [3] proposed a strategy with some guarantees for a\nrelated problem but did not characterize the sample complexity of the problem, the focus of our work.\nThe most biased coin problem was \ufb01rst proposed by Chandrasekaran and Karp [8]. In that work,\nit was shown that if \u21b5, \u27130,\u2713 1 were known then there exists an algorithm based on the sequential\nprobability ratio test (SPRT) that is optimal in that it minimizes the expected number of total \ufb02ips to\n\ufb01nd a \u201cheavy\u201d coin whose posterior probability of being heavy is at least 1 , and the expected\nsample complexity of this algorithm was upper-bounded by\n\n16\n\n(\u27131 \u27130)2\u2713 1 \u21b5\n\n\u21b5\n\n+ log\u2713 (1 \u21b5)(1 )\n\n\u21b5\n\n\u25c6\u25c6 .\n\n(1)\n\nHowever, the practicality of the proposed algorithm is severely limited as it relies critically on\nknowing \u21b5, \u27130, and \u27131 exactly. In addition, the algorithm returns to coins it has previously \ufb02ipped\nand thus requires more than one coin to be outside the bag at a time, ruling out some applications.\nMalloy et al. [15] addressed some of the shortcomings of [9] (a preprint of [8]) by considering both\nan alternative SPRT procedure and a sequential thresholding procedure. Both of these proposed\nalgorithms only ever have one coin out of the bag at a time. However, the former requires knowledge\nof all relevant parameters \u21b5, \u27130,\u2713 1, and the latter requires knowledge of \u21b5, \u27130. Moreover, these results\nare only presented for the asymptotic case where ! 0.\nThe most biased coin problem can be viewed through the lens of multi-armed bandits. In the\nbest-arm identi\ufb01cation problem, the player has access to K distributions (arms) such that if arm\ni 2 [K] is sampled (pulled), an iid random variable with mean \u00b5i is observed; the objective is to\nidentify the arm associated with the highest mean with probability at least 1 using as few pulls\nas possible (see [14] for a short survey). In the in\ufb01nite armed bandit problem, the player is not\ncon\ufb01ned to K arms but an in\ufb01nite reservoir of arms such that a draw from this reservoir results in\nan arm with a mean \u00b5 drawn from some distribution; the objective is to identify the highest mean\npossible after n total pulls for any n > 0 with probability 1 (see [7]). The most biased coin\nproblem is an instance of this latter game with the arm reservoir distribution of means \u00b5 de\ufb01ned as\nP(\u00b5 \u27131 \u270f) = \u21b51\u270f0 + (1 \u21b5)1\u270f\u27131\u27130 for all \u270f. Previous work has focused on an alternative\narm distribution reservoir that satis\ufb01es E\u270f \uf8ff P(\u00b5 \u00b5\u21e4 \u270f) \uf8ff E0\u270f for some \u00b5\u21e4 2 [0, 1] where\nE, E0 are constants and is known [5, 21, 6, 7]. Because neither arm distribution reservoir can be\nwritten in terms of the other, neither work subsumes the other. Note that one can always apply an\nalgorithm designed for the in\ufb01nite armed bandit problem to any \ufb01nite K-armed bandit problem by\nde\ufb01ning the arm reservoir as placing a uniform distribution over the K arms. This is appealing when\nK is very large and one wishes to guarantee nontrivial performance when the number of pulls is\nmuch less than K1. The most biased problem is a special case of the K-armed reservoir distribution\nwhere one arm has mean \u27131 and K 1 arms have mean \u27130 with \u21b5 = 1\nK .\n\n1All algorithms for K-armed bandit problem known to these authors begins by sampling each arm once so\n\nthat until the number of pulls exceeds K, performance is no better than random selection.\n\n2\n\n\fGiven that [8] and [15] are provably optimal algorithms for the most biased coin problem given\nknowledge of \u21b5, \u27130,\u2713 1, it is natural to consider a procedure that \ufb01rst estimates these unknown\nparameters \ufb01rst and then uses these estimates in the algorithms of [8] or [15]. Indeed, in the -\nparameterized arm reservoir setting discussed above, this is exactly what Carpentier and Valko [7]\n\npropose to do, suggesting a particular estimator for given a lower boundb \uf8ff . They show that\nthis estimator is suf\ufb01cient to obtain the same sample complexity result up to log factors as when \nwas known. Sadly, through upper and lower bounds we show that for the most biased coin problem\nthis estimate-then-explore approach requires quadratically more \ufb02ips than our proposed algorithm\nthat adapts to these unknown parameters. Speci\ufb01cally, we show that when \u27131 \u27130 is suf\ufb01ciently\nsmall one cannot use a static estimation step to determine whether \u21b5 = 0 or \u21b5> 0 unless a number\nof samples quadratic in the optimal sample complexity are taken.\nOur contributions to the most biased coin problem include a novel algorithm that never has more\nthan one coin outside the bag at a time, has no knowledge of the distribution parameters, supports\ndistributions on [0, 1] rather than just \u201ccoins,\u201d and comes within log factors of the known information-\ntheoretic lower bound and Equation 1 which is achieved by an algorithm that knows the parameters.\nSee Table 1 for an overview of the upper and lower bounds proved in this work for this problem.\nWe believe that our algorithm is the \ufb01rst solution to the most biased coin problem that does not\nrequire prior knowledge of the problem parameters and that the same approach can be reworked to\nsolve more general instances of the in\ufb01nite-armed bandit problem, including the -parameterized\nand K-armed reservoir cases described of above. Finally, if an algorithm is desired for arbitrary arm\nreservoir distributions, this work rules out an estimate-then-explore approach.\n\n1.2 Problem Statement\nLet \u2713 2 \u21e5 index a family of single-parameter probability density functions g\u2713 and \ufb01x \u27130,\u2713 1 2 \u21e5,\n\u21b5 2 [0, 1/2]. For any \u2713 2 \u21e5 assume that g\u2713 is known to the procedure. Note that in the most biased\ncoin problem, g\u2713 =Bernoulli(\u2713), but in general it is arbitrary (e.g. N (\u2713, 1)). Consider a sequence of iid\nBernoulli random variables \u21e0i 2{ 0, 1} for i = 1, 2, . . . where each P(\u21e0i = 1) = 1 P(\u21e0i = 0) = \u21b5.\nLet Xi,j for j = 1, 2, . . . be a sequence of random variables drawn from g\u27131 if \u21e0i = 1 and g\u27130\notherwise, and let {{Xi,j}Mi\ni=1 represent the sampling history generated by a procedure for some\nN 2 N and (M1, . . . , MN ) 2 NN. Any valid procedure behaves accordingly:\nAlgorithm 1 The most biased coin problem de\ufb01nition. Only the last distribution drawn may be\nsampled or declared heavy, enforcing the rule that only one coin may be outside the bag at a time.\nInitialize an empty history (N = 1, M = (0, 0, . . . )).\nRepeat until heavy distribution declared:\n\nj=1}N\n\nChoose one of\n\n1. draw a sample from distribution N, MN MN + 1\n2. draw a sample from the (N + 1)st distribution, MN +1 = 1, N N + 1\n3. declare distribution N as heavy\n\nDe\ufb01nition 1 We say a strategy for the most biased coin problem is -probably correct if for all\n(\u21b5, \u27130,\u2713 1) it identi\ufb01es a \u201cheavy\u201d g\u27131 distribution with probability at least 1 .\nDe\ufb01nition 2 (Strategies for the most biased coin problem) An estimate-then-explore strategy\nis a strategy that, for any \ufb01xed m 2 N, begins by sampling each successive coin exactly m times for\na number of coins that is at least the minimum necessary for any test to determine that \u21b5 6= 0 with\nprobability at least 1 , then optionally continues sampling with an arbitrary strategy that declares\na heavy coin. An adaptive strategy is any strategy that is not an estimate-then-explore strategy.\n\nWe study the estimate-then-explore strategy because there exist optimal algorithms [8, 15] for the\nmost biased coin problem if \u21b5, \u27130,\u2713 1 are known, so it is natural to consider estimating these quantities\nthen using one of these algorithms. Note that the algorithm of [7] for the -parameterized in\ufb01nite\narmed bandit problem discussed above can be considered an estimate-then-explore strategy since it\n\ufb01rst estimates by sampling a \ufb01xed number of samples from a set of arms, and then uses this estimate\nto draw a \ufb01xed number of arms and applies a UCB-style algorithm to these arms. A contribution of\nthis work is showing that such a strategy is infeasible for the most biased coin problem.\n\n3\n\n\fFor all strategies that are -probably correct and follow the interface of Algorithm 1, our goal is\ni=1 Mi] for any (\u21b5, \u27130,\u2713 1) if N\n\nto provide lower and upper bounds on the quantity E[T ] := E[PN\n\ndenotes the \ufb01nal number of coins considered.\n\n2 From Identifying Coins to Detecting Mixture Distributions\nAddressing the most biased coin problem, [15] analyzes perhaps the most natural strategy: \ufb01x an\nm 2 N and \ufb02ip each successive coin exactly m times. The relevant questions are how large does m\nhave to be in order to guarantee correctness with probability 1 , and for a given m how long must\none wait to declare a \u201cheavy\u201d coin? The authors partially answer these questions and we improve\nupon them (see Section 3.2.1) which leads us to our study of the dif\ufb01culty of detecting the presence of\na mixture distribution. As an example of the kind of lower bounds shown in this work, if we observe\na sequence of random variables X1, . . . , Xn, consider the following hypothesis test:\n\nH0 : 8i X1, . . . , Xn \u21e0N (\u2713, 2)\nH1 : 8i X1, . . . , Xn \u21e0 (1 \u21b5)N (\u27130, 2) + \u21b5 N (\u27131, 2)\n\nfor some \u2713 2 R,\n\n(P1)\n\nwhich will henceforth be referred to as Problem P1 or just (P1). We can show that if \u27130,\u2713 1,\u21b5 are\nknown and \u2713 = \u27130, then it is suf\ufb01cient to observe just max{1/\u21b5,\n\u21b52(\u27131\u27130)2 log(1/)} samples to\ndetermine the correct hypothesis with probability at least 1 . However, if \u27130,\u2713 1,\u21b5 are unknown\nthen it is necessary to observe at least max1/\u21b5,\n\u21b5(\u27131\u27130)22 log(1/) samples in expectation\nwhenever (\u27131\u27130)2\n\n\u21b52(\u27131\u27130)2 log(1/)} otherwise (see Appendix C).\n\n\uf8ff 1 and max{1/\u21b5,\n\n2\n\n2\n\n2\n\n2\n\n2\n\nRecognizing (\u27131\u27130)2\nas the KL divergence between two Gaussians of H1, we observe startling\nconsequences for anomaly detection when the parameters of the underlying distributions are unknown:\nif the anomalous distribution is well separated from the null distribution, then detecting an anomalous\ncomponent is only about as hard as observing just one anomalous sample (i.e. 1/\u21b5) multiplied by\nthe inverse KL divergence between the null and anomalous distributions. However, when the two\ndistributions are not well separated then the necessary sample complexity explodes to this latter\nquantity squared. In Section 4 we will investigate adaptive methods for dramatically decreasing this\nsample complexity.\nOur lower bounds are based on the detection of the presence of a mixture of two distributions of an\nexponential family versus just a single distribution of the same family. There has been extensive\nwork in the estimation of mixture distributions [13, 11] but this literature often assumes that the\nmixture coef\ufb01cient \u21b5 is bounded away from 0 and 1 to ensure a suf\ufb01cient number of samples from\neach distribution. In contrast, we highlight the regime when \u21b5 is arbitrarily small, as is the case in\nstatistical anomaly detection [10, 20, 2]. Property testing, e.g. unimodality, [1] is relevant but can\nlack interpetability or strength in favor of generality. Considering the exponential family allowing us\nto make interpretable statements about the relevant problem parameters in different regimes.\nPreliminaries Let P and Q be two probability distributions with densities p and q, respectively. For\nsimplicity, assume p and q have the same support. De\ufb01ne the KL Divergence between P and Q\n\nq(x)\u2318 dp(x). De\ufb01ne the 2 Divergence between P and Q as 2(P, Q) =\n\nas KL(P, Q) =R log\u21e3 p(x)\nR\u21e3 p(x)\nq(x) 1\u23182\nExamples: If P = N (\u27131, 2) and Q = N (\u27130, 2) then KL(P, Q) = (\u27131\u27130)2\n(\u27131\u27130)2\n\nq\u21e4 = log2(P, Q) + 1 \uf8ff 2(P, Q).\n2 1. If P = Bernoulli(\u27131) and Q = Bernoulli(\u27130) then KL(P, Q) = \u27131 log( \u27131\n\ndq(x) =R (p(x)q(x))2\nKL(P, Q) = Ep\u21e5 log p\n\nq\u21e4 \uf8ff logEp\u21e5 p\n\ndx. Note that by Jensen\u2019s inequality\n\n22\n\nand 2(P, Q) =\n) + (1 \n\u27130(1\u27130). All proofs appear in the\n\nand 2(P, Q) = (\u27131\u27130)2\n\n\u27130(1\u27130)[(\u27131\u27130)(2\u271301)]+\n\n(\u27131\u27130)2/2\n\n(2)\n\ne\n\u27131) log( 1\u27131\n1\u27130\nappendix.\n\n) \uf8ff\n\nq(x)\n\n\u27130\n\n3 Lower bounds\nWe present lower bounds on the sample complexity of -probably correct strategies for the most\nbiased coin problem that follow the interface of Algorithm 1. Lower bounds are stated for any\n\n4\n\n\fadaptive strategy in Section 3.1, non-adaptive strategies that may have knowledge of the parameters\nbut sample each distribution the same number of times in Section 3.2.1, and estimate-then-explore\nstrategies that do not have prior knowledge of the parameters in Section 3.2.2. Our lower bounds,\nwith the exception of the adaptive strategy, are based on the dif\ufb01culty of detecting the presence of a\nmixture distribution, and this reduction is explained in Section 3.2.\n\n3.1 Adaptive strategies\nThe following theorem, reproduced from [15], describes the sample complexity of any -probably\ncorrect algorithm for the most biased coin identi\ufb01cation problem. Note that this lower bound holds\nfor any procedure even if it returns to previously seen distributions to draw additional samples and\neven if it knows \u21b5, \u27130,\u2713 1.\nTheorem 1 [15, Theorem 2] Fix 2 (0, 1). Let T be the total number of samples taken of any\nprocedure that is -probably correct in identifying a heavy distribution. Then\n\nE[T ] c1 max\u21e2 1 \n\n\u21b5\n\n,\n\n(1 )\n\n\u21b5KL(g\u27130|g\u27131)\n\nwhenever \u21b5 \uf8ff c2 where c1, c2 2 (0, 1) are absolute constants.\nThe above theorem is directly applicable to the special case where g\u2713 is a Bernoulli distribution,\nimplying a lower bound of max 1\n\nThe upper bounds of our proposed procedures for the most biased coin problem presented later will\nbe compared to this benchmark.\n\n for the most biased coin problem.\n\n\u21b5 , 2 min{\u27130(1\u27130),\u27131(1\u27131)}\n\n\u21b5(\u27131\u27130)2\n\n3.2 The detection of a mixture distribution and the most biased coin problem\nFirst observe that identifying a speci\ufb01c distribution i \uf8ff N as heavy (i.e. \u21e0i = 1) or determining that \u21b5\nis strictly greater than 0, is at least as hard as detecting that any of the distributions up to distribution N\nis heavy. Thus, a lower bound on the total expected number of samples of all considered distributions\nfor this strictly easier detection problem is also a lower bound for the estimate-then-explore strategy\nfor the most biased coin identi\ufb01cation problem.\nThe estimate-then-explore strategy \ufb01xes an m 2 N prior to starting the game and then samples each\ndistribution exactly m times, i.e. Mi = m for all i \uf8ff N for some N. To simplify notation let f\u2713\ndenote the distribution of the suf\ufb01cient statistics of these m samples. In general f\u2713 is a product\ndistribution, but when g\u2713 is a Bernoulli distribution, as in the biased coin problem, we can take f\u2713 to\nbe a Binomial distribution with parameters (m, \u2713). Now our problem is more succinctly described as:\n\nH0 : 8i Xi \u21e0 f\u2713\nH1 : 8i\u21e0 i \u21e0 Bernoulli(\u21b5),\n\nfor some \u2713 2 e\u21e5 \u2713 \u21e5,\n8i Xi \u21e0\u21e2f\u27130\n\nf\u27131\n\nif \u21e0i = 0\nif \u21e0i = 1\n\n(P2)\n\nIf \u27130 and \u27131 are close to each other, or if \u21b5 is very small, it can be very dif\ufb01cult to decide between H0\nand H1 even if \u21b5, \u27130,\u2713 1 are known a priori. Note that when the parameters are known, one can take\n\ne\u21e5= {\u27130}. However, when the parameters are unknown, one takese\u21e5=\u21e5 to prove a lower bound on\n\nthe sample complexity of the estimate-then-explore algorithm, which is tasked with deciding whether\nor not samples are coming from a mixture of distributions or just a single distribution within the\nfamily. That is, lower bounds on the sample complexity when the parameters are known and unknown\nfollow by analyzing a simple binary and composite hypothesis test, respectively. In what follows, for\nany event A, let Pi(A) and Ei[A] denote probability and expectation of A under hypothesis Hi for\ni 2{ 0, 1} (the speci\ufb01c value of \u2713 in H0 will be clear from context). The next claim is instrumental\nin our ability to prove lower bounds on the dif\ufb01culty of the hypothesis tests.\nClaim 1 Any procedure that is -probably correct also satis\ufb01es P0(N < 1) \uf8ff whenever \u21b5 = 0.\n3.2.1 Sample complexity when parameters are known\n\n\u21e5. Let N be the random number of distributions considered before stopping and declaring a\n\nTheorem 2 Fix 2 (0, 1). Consider the hypothesis test of Problem P2 for any \ufb01xed \u2713 2 e\u21e5 \u2713\n\n5\n\n\f\u21b5 ,\n\nlog(1/)\n\nhypothesis. If a procedure satis\ufb01es P0(N < 1) \uf8ff and P1([N\n2(P1|P0)o . In particular, ife\u21e5= {\u27130} then\nE1[N ] maxn 1\n\u21b522(f\u27131|f\u27130)o.\n\nKL(P1|P0)o maxn 1\nE1[N ] maxn 1 \n\n\u21b5 , log(1/)\n\ni=1{\u21e0i = 1}) 1 , then\n\nThe next corollary relates Theorem 2 to the most biased coin problem and is related to Malloy et al.\n[15, Theorem 4] that considers the limit as \u21b5 ! 0 and assumes m is suf\ufb01ciently large (speci\ufb01cally,\nlarge enough for the Chernoff-Stein lemma to apply). In contrast, our result holds for all \ufb01nite , \u21b5, m.\nCorollary 1 Fix 2 (0, 1). For any m 2 N consider a -probably correct strategy that \ufb02ips each\ncoin exactly m times. If Nm is the number of coins considered before declaring a coin as heavy then\n\nlog(1/)\n\n\u21b5\n\n,\n\nmin\nm2N\n\nE[mNm] \n\n(1 ) log\u21e3 log(1/)\n\n\u21b5\n\n\u21b5\n\n\u2318\n\n\u27130(1 \u27130)\n(\u27131 \u27130)2 .\n\nOne can show the existence of such a strategy with a nearly matching upperbound when \u21b5, \u27130,\u2713 1 are\nknown (see Appendix B.1). Note that this is at least log(1/\u21b5) larger than the sample complexity of\n(1) that can be achieved by an adaptive algorithm when the parameters are known.\n3.2.2 Sample complexity when parameters are unknown\nIf \u21b5, \u27130, and \u27131 are unknown, we cannot test f\u27130 against the mixture (1 \u21b5)f\u27130 + \u21b5f\u27131. Instead, we\nhave the general composite test of any individual distribution against any mixture, which is at least as\nhard as the hypothesis test of Problem P2 withe\u21e5= {\u2713} for some particular worst-case setting of \u2713.\n\nWithout any speci\ufb01c form of f\u2713, it is dif\ufb01cult to pick a worst case \u2713 that will produce a tight bound.\nConsequently, in this section we consider single parameter exponential families (de\ufb01ned formally\nbelow) to provide us with a class of distributions in which we can reason about different possible\nvalues for \u2713. Since exponential families include Bernoulli, Gaussian, exponential, and many other\ndistributions, the following theorem is general enough to be useful in a wide variety of settings. The\nconstant C referred to in the next theorem is an absolute constant under certain conditions that we\noutline in the following remark and corollary, its explicit form is given in the proof.\nTheorem 3 Suppose f\u2713 for \u2713 2 \u21e5 \u21e2 R is a single parameter exponential family so that f\u2713(x) =\nh(x) exp(\u2318(\u2713)x b(\u2318(\u2713))) for some scalar functions h, b, \u2318 where \u2318 is strictly increasing. Ife\u21e5=\n{\u2713\u21e4} where \u2713\u21e4 = \u23181(1 \u21b5)\u2318(\u27130) + \u21b5\u2318(\u27131)and N is the stopping time of any procedure that\nsatis\ufb01es P0(N < 1) \uf8ff and P1([N\n\ni=1{\u21e0i = 1}) 1 , then\n1\n )\n\nE1[N ] maxn 1\n\n\u21b5 ,\n\nlog(\n\n2 \u21b5(1\u21b5)(\u2318(\u27131)\u2318(\u27130))2)2o.\n\nC( 1\n\nwhere C is a constant that may depend on \u21b5, \u27130,\u2713 1.\nThe following remark and corollary apply Theorem 3 to the special cases of Gaussian mixture model\ndetection and the most biased coin problem, respectively.\n\n2\n2\n\nRemark 1 When \u21b5, \u27130,\u2713 1 are unknown, any procedure has no knowledge ofe\u21e5 in Problem P2 and\nconsequently it cannot rule out \u2713 = \u2713\u21e4 for H0 where \u2713\u21e4 is de\ufb01ned in Theorem 3. If f\u2713 = N (\u2713, 2)\nfor known , then whenever (\u27131\u27130)2\n\uf8ff 1 the constant C in Theorem 3 is an absolute constant and\n\u21b5(\u27131\u27130)22 log(1/). Conversely, when \u21b5, \u27130,\u2713 1 are known, then we\nconsequently, E1[N ] =\u2326\nsimply need to determine whether samples came from N (\u27130, 2) or (1 \u21b5)N (\u27130, 2) + \u21b5N (\u27131, 2),\n\u21b52(\u27131\u27130)2 log(1/)\u2318 samples (see Appendix C).\nand we show that it is suf\ufb01cient to take just O\u21e3\nCorollary 2 Fix 2 [0, 1] and assume \u27130,\u2713 1 are bounded suf\ufb01ciently far from {0, 1} such that\n2(\u27131 \u27130) \uf8ff min{\u27130(1 \u27130),\u2713 1(1 \u27131)}. For any m let Nm be the number of coins a -probably\ncorrect estimate-then-explore strategy that \ufb02ips each coin m times in the exploration step. Then\n\n2\n\nmE[Nm] \n\nm ,\u2713 \u21e4(1 \u2713\u21e4)}\n\nc0 min{ 1\n\u21e3\u21b5(1 \u21b5) (\u27131\u27130)2\n\n\u2713\u21e4(1\u2713\u21e4)\u23182 log( 1\n\n ) whenever m \uf8ff\n\n\u2713\u21e4(1 \u2713\u21e4)\n(\u27131 \u27130)2 .\n\nwhere c0 is an absolute constant and \u2713\u21e4 = \u23181 ((1 \u21b5)\u2318(\u27130) + \u21b5\u2318(\u27131)) 2 [\u27130,\u2713 1].\n\n6\n\n\fRemark 2 If \u21b5, \u27130,\u2713 1 are unknown, any estimate-then-explore strategy (or the strategy described in\nCorollary 1) would be unable to choose an m that depended on these parameters, so we can treat it as\na constant. Thus, for the case when \u27130 and \u27131 are bounded away from {0, 1} (e.g. \u27130,\u2713 1 2 [1/8, 7/8]),\nthe above corollary states that for any \ufb01xed m, whenever \u27131 \u27130 is suf\ufb01ciently small the number\n\u21b5(\u27131\u27130)22 log(1/).\nof samples necessary for these strategies to identify a heavy coin scales like\nThis is striking example of the difference when parameters are known versus when they are not and\neffectively rules out an estimate-then-explore strategy for practical purposes.\n\n1\n\nSetting\n\nUpper Bound\nlog(1/(\u21b5))\n\n,\n\nFixed, known \u21b5, \u27130,\u2713 1\n\nAdaptive, known \u21b5, \u27130,\u2713 1\n\n\u21b5\u270f2\n\u21b5 + log( 1\nEst+Expl, unknown \u21b5, \u27130,\u2713 1 Unconsidered\u2020\n\n\u270f2 1\n\n )\n\n1\n\nLower Bound\nlog(log(1/)/\u21b5)\n\n\u21b5\u270f2\n\nThm. 7\n\n[8, 15], Thm. 4\n\n1\n\u21b5\u270f2\n\n 1\n\u21b5\u270f22 log( 1\n\n1\n\u21b5\u270f2\n\n )\n\nCor. 1\n\n[15]\n\nCor. 2\n\n[15]\n\nAdaptive, unknown \u21b5, \u27130,\u2713 1\n\nc log( 1\n\n\u21b5\u270f2 ) log(log( 1\n\n\u21b5\u270f2 )/)\n\n\u21b5\u270f2\n\nThm. 5\n\nTable 1: Upper and lower bounds on the expected sample complexity of different -probably correct\nstrategies. Fixed refers to the strategy of Corollary 1. For this table, we assume min{\u27130(1 \n\u27130),\u2713 1(1 \u27131)} is lower bounded by a constant (e.g. \u27130,\u2713 1 2 [1/8, 7/8]) and \u270f = \u27131 \u27130 is\nsuf\ufb01ciently small. Also note that the upperbounds apply to distributions supported on [0, 1], not\njust coins. All results without bracketed citations were unknown prior to this work. \u2020 Due to our\ndiscouraging lower bound for any estimate-then-explore strategy, it is inadvisable to propose an algorithm.\n\n4 Near optimal adaptive algorithm\nIn this section we propose an algorithm that has no prior knowledge of the parameters \u21b5, \u27130,\u2713 1 yet\nyields an upper bound that matches the lower bound of Theorem 1 up to logarithmic factors. We\nassume that samples from heavy or light distributions are supported on [0, 1], and that drawn samples\nare independent and unbiased estimators of the mean, i.e., E[Xi,j] = \u00b5i for \u00b5i 2{ \u27130,\u2713 1}. All\nresults can be easily extended to sub-Gaussian distributions. Consider Algorithm 2, an SPRT-like\nprocedure [18] for \ufb01nding a heavy distribution given and lower bounds on \u21b5 and \u270f = \u27131 \u27130. It\nimproves upon prior work by supporting arbitrary distributions on [0, 1] and requires only bounds\n\u21b5, \u270f.\n\nAlgorithm 2 Adaptive strategy for heavy distribution identi\ufb01cation with inputs \u21b50,\u270f 0,\nGiven 2 (0, 1/4),\u21b5 0 2 (0, 1/2),\u270f 0 2 (0, 1).\nInitialize n = d2 log(9)/\u21b50e, m = d64\u270f2\nlog(14n/), k1 = 5, k2 = d8\u270f2\nDraw k1 distributions and sample them each k2 times.\n\nlog(2k1/ min{/8, m1\u270f2\n\nlog(14n/)e, A = 8\u270f1\n\nlog(21),\n0 })e.\n\nB = 8\u270f1\n0\n\n0\n\n0\n\n0\n\nEstimateb\u27130 = mini=1,...,k1b\u00b5i,k2, \u02c6 =b\u27130 + \u270f0/2.\n\nRepeat for i = 1, . . . , n:\nDraw distribution i.\nRepeat for j = 1, . . . , m:\n\nSample distribution i and observe Xi,j.\n\nIfPj\nElse ifPj\n\nbreak.\n\nOutput null.\n\nk=1(Xi,k \u02c6) > B:\nDeclare distribution i to be heavy and Output distribution i.\n\nk=1(Xi,k \u02c6) < A:\n\nTheorem 4 If Algorithm 2 is run with 2 (0, 1/4),\u21b5 0 2 (0, 1/2),\u270f 0 2 (0, 1), then the expected\nnumber of total samples taken by the algorithm is no more than\n\nc0\u21b5 log(1/\u21b50) + c00 log 1\n\n\n\u21b50\u270f2\n0\n\n7\n\n(3)\n\n\ffor some absolute constants c0,c00, and all of the following hold: 1) with probability at least 1 ,\na light distribution is not returned, 2) if \u270f0 \uf8ff \u27131 \u27130 and \u21b50 \uf8ff \u21b5, then with probability 4\n5 a heavy\ndistribution is returned, and 3) the procedure takes no more than c log(1/(\u21b50))\ntotal samples.\n\n\u21b50\u270f2\n0\n\nThe second claim of the theorem holds only with constant probability (versus with probability 1 )\nsince the probability of observing a heavy distribution among the n = d2 log(4)/\u21b50e distributions\nonly occurs with constant probability. One can show that if the outer loop of algorithm is allowed\nto run inde\ufb01nitely (with m and n de\ufb01ned as is), \u270f0 = \u27131 \u27130, \u21b50 = \u21b5, andb\u27130 = \u27130, then a heavy\ncoin is returned with probability at least 1 and the expected number of samples is bounded by\n(3). If a tight lower bound is known on either \u270f = \u27131 \u27130 or \u21b5, there is only one parameter that is\nunknown and the \u201cdoubling trick\u201d, along with Theorem 4, can be used to identify a heavy coin with\njust log(log(\u270f2)/)\nNow consider Algorithm 3 that assumes no prior knowledge of \u21b5, \u27130,\u2713 1, the \ufb01rst result for this setting\nthat we are aware of. We remark that while the placing of \u201clandmarks\u201d (\u21b5k,\u270f k) throughout the search\nspace as is done in Algorithm 3 appears elementary in hindsight, it is surprising that so few can cover\nthis two dimensional space since one has to balance the exploration of \u21b5 and \u270f. We believe similar a\nsimilar approach may be generalized for more generic in\ufb01nite armed bandit problems.\n\nsamples, respectively (see Appendix B.3).\n\nand log(log(\u21b51)/)\n\n\u21b5\u270f2\n\n\u21b5\u270f2\n\nAlgorithm 3 Adaptive strategy for heavy distribution identi\ufb01cation with unknown parameters\nGiven > 0.\nInitialize ` = 1, heavy distribution h = null.\nRepeat until h is not null:\n\nSet ` = 2`, ` = /(2`3)\nRepeat for k = 0, . . . ,` :\n\n,\u270f k =q 1\n\nSet \u21b5k = 2k\n`\nRun Algorithm 2 with \u21b50 = \u21b5k,\u270f 0 = \u270fk, = ` and Set h to its output.\nIf h is not null break\n\n2\u21b5k`\n\nSet ` = ` + 1\n\nOutput h\n\nTheorem 5 (Unknown \u21b5, \u27130,\u2713 1) Fix 2 (0, 1). If Algorithm 3 is run with then with probability\nat least 1 a heavy distribution is returned and the expected number of total samples taken is\nbounded by\n\n\u21b5\u270f2 )\n\nlog2( 1\n\u21b5\u270f2\n\nc\n\nfor an absolute constant c.\n\n(\u21b5 log2( 1\n\n\u270f2 ) + log(log2( 1\n\n\u21b5\u270f2 )) + log(1/))\n\n5 Conclusion\nWhile all prior works have required at least partial knowledge of \u21b5, \u27130,\u2713 1 to solve the most biased\ncoin problem, our algorithm requires no knowledge of these parameters yet obtain the near-optimal\nsample complexity. In addition, we have proved lower bounds on the sample complexity of detecting\nthe presence of a mixture distribution when the parameters are known or unknown, with consequences\nfor any estimate-then-explore strategy, an approach previously proposed for an in\ufb01nite armed bandit\nproblem. Extending our adaptive algorithm to arbitrary arm reservoir distributions is of signi\ufb01cant\ninterest. We believe a successful algorithm in this vein could have a signi\ufb01cant impact on how\nresearchers think about sequential decision processes in both \ufb01nite and uncountable action spaces.\n\nAcknowledgments Kevin Jamieson is generously supported by ONR awards N00014-15-1-2620, and N00014-\n13-1-0129. This research is supported in part by NSF CISE Expeditions Award CCF-1139158, DOE Award\nSN10040 DE-SC0012463, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services,\nGoogle, IBM, SAP, The Thomas and Stacey Siebel Foundation, Apple Inc., Arimo, Blue Goji, Bosch, Cisco, Cray,\nCloudera, Ericsson, Facebook, Fujitsu, Guavus, HP, Huawei, Intel, Microsoft, Pivotal, Samsung, Schlumberger,\nSplunk, State Farm and VMware.\n\n8\n\n\fReferences\n[1] Jayadev Acharya, Constantinos Daskalakis, and Gautam C Kamath. Optimal testing for properties of\n\ndistributions. In Advances in Neural Information Processing Systems, pages 3577\u20133598, 2015.\n\n[2] Deepak Agarwal. Detecting anomalies in cross-classi\ufb01ed streams: a bayesian approach. Knowledge and\n\nInformation Systems, 11(1):29\u201344, 2006.\n\n[3] Alessandro Arlotto, Stephen E Chick, and Noah Gans. Optimal hiring and retention policies for heteroge-\n\nneous workers who learn. Management Science, 60(1):110\u2013129, 2013.\n\n[4] Michael S Bernstein, Joel Brandt, Robert C Miller, and David R Karger. Crowds in two seconds: enabling\n\nrealtime crowd-powered interfaces. UIST, 2011.\n\n[5] Donald A. Berry, Robert W. Chen, Alan Zame, David C. Heath, and Larry A. Shepp. Bandit problems\n\nwith in\ufb01nitely many arms. Ann. Statist., 25(5):2103\u20132116, 10 1997.\n\n[6] Thomas Bonald and Alexandre Proutiere. Two-target algorithms for in\ufb01nite-armed bandits with bernoulli\nrewards. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances\nin Neural Information Processing Systems 26, pages 2184\u20132192. Curran Associates, Inc., 2013.\n\n[7] Alexandra Carpentier and Michal Valko. Simple regret for in\ufb01nitely many armed bandits. arXiv preprint\n\narXiv:1505.04627, 2015.\n\n[8] Karthekeyan Chandrasekaran and Richard Karp. Finding a most biased coin with fewest \ufb02ips.\n\nProceedings of The 27th Conference on Learning Theory, pages 394\u2013407, 2014.\n\nIn\n\n[9] Karthekeyan Chandrasekaran and Richard M. Karp. Finding the most biased coin with fewest \ufb02ips. CoRR,\n\nabs/1202.3639, 2012. URL http://arxiv.org/abs/1202.3639.\n\n[10] Eleazar Eskin. Anomaly detection over noisy data using learned probability distributions. In Proceedings of\nthe Seventeenth International Conference on Machine Learning, ICML \u201900, pages 255\u2013262, San Francisco,\nCA, USA, 2000. Morgan Kaufmann Publishers Inc.\n\n[11] Yoav Freund and Yishay Mansour. Estimating a mixture of two product distributions. In Proceedings of\n\nthe twelfth annual conference on Computational learning theory, pages 53\u201362. ACM, 1999.\n\n[12] Daniel Haas, Jiannan Wang, Eugene Wu, and Michael J. Franklin. Clamshell: Speeding up crowds for\n\nlow-latency data labeling. Proc. VLDB Endow., 9(4):372\u2013383, December 2015. ISSN 2150-8097.\n\n[13] Moritz Hardt and Eric Price. Sharp bounds for learning a mixture of two gaussians. ArXiv e-prints, 1404,\n\n2014.\n\n[14] Kevin Jamieson and Robert Nowak. Best-arm identi\ufb01cation algorithms for multi-armed bandits in the\n\n\ufb01xed con\ufb01dence setting. In Information Sciences and Systems (CISS), pages 1\u20136. IEEE, 2014.\n\n[15] Matthew L Malloy, Gongguo Tang, and Robert D Nowak. Quickest search for a rare distribution. In\n\nInformation Sciences and Systems (CISS), pages 1\u20136. IEEE, 2012.\n\n[16] MTurk. Amazon Mechanical Turk. https://www.mturk.com/.\n[17] David Pollard. Asymptopia. Manuscript in progress. Available at http://www. stat.yale.edu/\u21e0pollard,\n\n2000.\n\n[18] David Siegmund. Sequential analysis: tests and con\ufb01dence intervals. Springer Science & Business Media,\n\n2013.\n\n[19] Robert Spira. Calculation of the gamma function by stirling\u2019s formula. mathematics of computation, pages\n\n317\u2013322, 1971.\n\n[20] Gautam Thatte, Urbashi Mitra, and John Heidemann. Parametric methods for anomaly detection in\n\naggregate traf\ufb01c. IEEE/ACM Trans. Netw., 19(2):512\u2013525, April 2011. ISSN 1063-6692.\n\n[21] Yizao Wang, Jean yves Audibert, and R\u00e9mi Munos. Algorithms for in\ufb01nitely many-armed bandits. In\nD. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing\nSystems 21, pages 1729\u20131736. Curran Associates, Inc., 2009.\n\n9\n\n\f", "award": [], "sourceid": 460, "authors": [{"given_name": "Kevin", "family_name": "Jamieson", "institution": "UC Berkeley"}, {"given_name": "Daniel", "family_name": "Haas", "institution": "UC Berkeley"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}]}