{"title": "Extreme bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1089, "page_last": 1097, "abstract": "In many areas of medicine, security, and life sciences, we want to allocate limited resources to different sources in order to detect extreme values. In this paper, we study an efficient way to allocate these resources sequentially under limited feedback. While sequential design of experiments is well studied in bandit theory, the most commonly optimized property is the regret with respect to the maximum mean reward. However, in other problems such as network intrusion detection, we are interested in detecting the most extreme value output by the sources. Therefore, in our work we study extreme regret which measures the efficiency of an algorithm compared to the oracle policy selecting the source with the heaviest tail. We propose the ExtremeHunter algorithm, provide its analysis, and evaluate it empirically on synthetic and real-world experiments.", "full_text": "Extreme bandits\n\nAlexandra Carpentier\n\nStatistical Laboratory, CMS\nUniversity of Cambridge, UK\n\nMichal Valko\nSequeL team\n\nINRIA Lille - Nord Europe, France\n\na.carpentier@statslab.cam.ac.uk\n\nmichal.valko@inria.fr\n\nAbstract\n\nIn many areas of medicine, security, and life sciences, we want to allocate lim-\nited resources to different sources in order to detect extreme values. In this paper,\nwe study an ef\ufb01cient way to allocate these resources sequentially under limited\nfeedback. While sequential design of experiments is well studied in bandit theory,\nthe most commonly optimized property is the regret with respect to the maximum\nmean reward. However, in other problems such as network intrusion detection, we\nare interested in detecting the most extreme value output by the sources. There-\nfore, in our work we study extreme regret which measures the ef\ufb01ciency of an al-\ngorithm compared to the oracle policy selecting the source with the heaviest tail.\nWe propose the EXTREMEHUNTER algorithm, provide its analysis, and evaluate\nit empirically on synthetic and real-world experiments.\n\n1\n\nIntroduction\n\nWe consider problems where the goal is to detect outstanding events or extreme values in domains\nsuch as outlier detection [1], security [18], or medicine [17]. The detection of extreme values is\nimportant in many life sciences, such as epidemiology, astronomy, or hydrology, where, for example,\nwe may want to know the peak water \ufb02ow. We are also motivated by network intrusion detection\nwhere the objective is to \ufb01nd the network node that was compromised, e.g., by seeking the one\ncreating the most number of outgoing connections at once. The search for extreme events is typically\nstudied in the \ufb01eld of anomaly detection, where one seeks to \ufb01nd examples that are far away from\nthe majority, according to some problem-speci\ufb01c distance (cf. the surveys [8, 16]).\nIn anomaly detection research, the concept of anomaly is ambiguous and several de\ufb01nitions ex-\nist [16]: point anomalies, structural anomalies, contextual anomalies, etc. These de\ufb01nitions are\noften followed by heuristic approaches that are seldom analyzed theoretically. Nonetheless, there\nexist some theoretical characterizations of anomaly detection. For instance, Steinwart et al. [19]\nconsider the level sets of the distribution underlying the data, and rare events corresponding to rare\nlevel sets are then identi\ufb01ed as anomalies. A very challenging characteristic of many problems in\nanomaly detection is that the data emitted by the sources tend to be heavy-tailed (e.g., network traf-\n\ufb01c [2]) and anomalies come from the sources with the heaviest distribution tails. In this case, rare\nlevel sets of [19] correspond to distributions\u2019 tails and anomalies to extreme values. Therefore, we\nfocus on the kind of anomalies that are characterized by their outburst of events or extreme values,\nas in the setting of [22] and [17].\nSince in many cases, the collection of the data samples emitted by the sources is costly, it is im-\nportant to design adaptive-learning strategies that spend more time sampling sources that have a\nhigher risk of being abnormal. The main objective of our work is the active allocation of the sam-\npling resources for anomaly detection, in the setting where anomalies are de\ufb01ned as extreme values.\nSpeci\ufb01cally, we consider a variation of the common setting of minimal feedback also known as\nthe bandit setting [14]: the learner searches for the most extreme value that the sources output by\nprobing the sources sequentially. In this setting, it must carefully decide which sources to observe\n\n1\n\n\fbecause it only receives the observation from the source it chooses to observe. As a consequence,\nit needs to allocate the sampling time ef\ufb01ciently and should not waste it on sources that do not have\nan abnormal character. We call this speci\ufb01c setting extreme bandits, but it is also known as max-k\nproblem [9, 21, 20]. We emphasize that extreme bandits are poles apart from classical bandits, where\nthe objective is to maximize the sum of observations [3]. An effective algorithm for the classical\nbandit setting should focus on the source with the highest mean, while an effective algorithm for the\nextreme bandit problem should focus on the source with the heaviest tail. It is often the case that\na heavy-tailed source has a small mean, which implies that the classical bandit algorithms perform\npoorly for the extreme bandit problem.\nThe challenging part of our work dwells in the active sampling strategy to detect the heaviest tail\nunder the limited bandit feedback. We proffer EXTREMEHUNTER, a theoretically founded algo-\nrithm, that sequentially allocates the resources in an ef\ufb01cient way, for which we prove performance\nguarantees. Our algorithm is ef\ufb01cient under a mild semi-parametric assumption common in ex-\ntreme value theory, while known results by [9, 21, 20] for the extreme bandit problem only hold in\na parametric setting (see Section 4 for a detailed comparison).\n\n2 Learning model for extreme bandits\n\nIn this section, we formalize the active (bandit) setting and characterize the measure of performance\nfor any algorithm \u03c0. The learning setting is de\ufb01ned as follows. Every time step, each of the K arms\n(sources) emits a sample Xk,t \u223c Pk, unknown to the learner. The precise characteristics of Pk are\nde\ufb01ned in Section 3. The learner \u03c0 then chooses some arm It and then receives only the sample\nXIt,t. The performance of \u03c0 is evaluated by the most extreme value found and compared to the\nmost extreme value possible. We de\ufb01ne the reward of a learner \u03c0 as:\n\nG\u03c0\n\nn = max\nt\u2264n\n\nXIt,t\n\nThe optimal oracle strategy is the one that chooses at each time the arm with the highest potential\nrevealing the highest value, i.e., the arm \u2217 with the heaviest tail. Its expected reward is then:\n\nE [G\u2217\n\nn] = max\nk\u2264K\n\nE\n\nmax\nt\u2264n\n\nXk,t\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)\n\n(cid:21)\n\nThe goal of learner \u03c0 is to get as close as possible to the optimal oracle strategy. In other words, the\naim of \u03c0 is to minimize the expected extreme regret:\nDe\ufb01nition 1. The extreme regret in the bandit setting is de\ufb01ned as:\n\n(cid:20)\n\n(cid:21)\n\nmax\nt\u2264n\n\nXk,t\n\nmax\nt\u2264n\n\nXIt,t\n\n\u2212 E\n\nE [R\u03c0\n\nn] = E [G\u2217\n\nn] \u2212 E [G\u03c0\n\nn] = max\nk\u2264K\n\nE\n\n3 Heavy-tailed distributions\n\nIn this section, we formally de\ufb01ne our observation model. Let X1, . . . , Xn be n i.i.d. observations\nfrom a distribution P . The behavior of the statistic maxi\u2264n Xi is studied by extreme value theory.\nOne of the main results is the Fisher-Tippett-Gnedenko theorem [11, 12] that characterizes the lim-\niting distribution of this maximum as n converges to in\ufb01nity. Speci\ufb01cally, it proves that a rescaled\nversion of this maximum converges to one of the three possible distributions: Gumbel, Fr\u00b4echet, or\nWeibull. This rescaling factor depends on n. To be concise, we write \u201cmaxi\u2264n Xi converges to a\ndistribution\u201d to refer to the convergence of the rescaled version to a given distribution. The Gum-\nbel distribution corresponds to the limiting distribution of the maximum of \u2018not too heavy tailed\u2019\ndistributions, such as sub-Gaussian or sub-exponential distributions. The Weibull distribution co-\nincides with the behaviour of the maximum of some speci\ufb01c bounded random variables. Finally,\nthe Fr\u00b4echet distribution corresponds to the limiting distribution of the maximum of heavy-tailed\nrandom variables. As many interesting problems concern heavy-tailed distributions, we focus on\nFr\u00b4echet distributions in this work. The distribution function of a Fr\u00b4echet random variable is de\ufb01ned\nfor x \u2265 m, and for two parameters \u03b1, s as:\n\nP (x) = exp(cid:8)\u2212(cid:0) x\u2212m\n\n(cid:1)\u03b1(cid:9) .\n\ns\n\n2\n\n\f|1 \u2212 P (x) \u2212 Cx\u2212\u03b1|\n\nlim\nx\u2192\u221e\n\nx\u2212\u03b1\n\n= 0,\n\nIn this work, we consider positive distributions P : [0,\u221e) \u2192 [0, 1]. For \u03b1 > 0, the Fisher-\nTippett-Gnedenko theorem also states that the statement \u2018P converges to an \u03b1-Fr\u00b4echet distribution\u2019\nis equivalent to the statement \u20181\u2212 P is a \u2212\u03b1 regularly varying function in the tail\u2019. These statements\nare slightly less restrictive than the de\ufb01nition of approximately \u03b1-Pareto distributions1, i.e., that there\nexists C such that P veri\ufb01es:\n\n(1)\nor equivalently that P (x) = 1 \u2212 Cx\u2212\u03b1 + o(x\u2212\u03b1). If and only if 1 \u2212 P is \u2212\u03b1 regularly varying in\nthe tail, then the limiting distribution of maxi Xi is an \u03b1-Fr\u00b4echet distribution. The assumption of\n\u2212\u03b1 regularly varying in the tail is thus the weakest possible assumption that ensures that the (prop-\nerly rescaled) maximum of samples emitted by a heavy tailed distributions has a limit. Therefore,\nthe very related assumption of approximate Pareto is almost minimal, but it is (provably) still not\nrestrictive enough to ensure a convergence rate. For this reason, it is natural to introduce an assump-\ntion that is slightly stronger than (1). In particular, we assume, as it is common in the extreme value\nliterature, a second order Pareto condition also known as the Hall condition [13].\nDe\ufb01nition 2. A distribution P is (\u03b1, \u03b2, C, C(cid:48))-second order Pareto (\u03b1, \u03b2, C, C(cid:48) > 0) if for x \u2265 0:\n\nBy this de\ufb01nition, P (x) = 1 \u2212 Cx\u2212\u03b1 + O(cid:0)x\u2212\u03b1(1+\u03b2)(cid:1), which is stronger than the assumption\n\n(cid:12)(cid:12)1 \u2212 P (x) \u2212 Cx\u2212\u03b1(cid:12)(cid:12) \u2264 C(cid:48)x\u2212\u03b1(1+\u03b2)\n\nP (x) = 1 \u2212 Cx\u2212\u03b1 + o(x\u2212\u03b1), but similar for small \u03b2.\nRemark 1. In the de\ufb01nition above, \u03b2 de\ufb01nes the rate of the convergence (when x diverges to in\ufb01nity)\nof the tail of P to the tail of a Pareto distribution 1 \u2212 Cx\u2212\u03b1. The parameter \u03b1 characterizes the\nheaviness of the tail: The smaller the \u03b1, the heavier the tail. In the reminder of the paper, we will be\ntherefore concerned with learning the \u03b1 and identifying the smallest one among the sources.\n\n4 Related work\n\nThere is a vast body of research in of\ufb02ine anomaly detection which looks for examples that deviate\nfrom the rest of the data, or that are not expected from some underlying model. A comprehensive\nreview of many anomaly detection approaches can be found in [16] or [8]. There has been also some\nwork in active learning for anomaly detection [1], which uses a reduction to classi\ufb01cation. In online\nanomaly detection, most of the research focuses on studying the setting where a set of variables is\nmonitored. A typical example is the monitoring of cold relief medications, where we are interested\nin detecting an outbreak [17]. Similarly to our focus, these approaches do not look for outliers in a\nbroad sense but rather for the unusual burst of events [22].\nIn the extreme values settings above, it is often assumed, that we have full information about each\nvariable. This is in contrast to the limited feedback or a bandit setting that we study in our work.\nThere has been recently some interest in bandit algorithms for heavy-tailed distributions [4]. How-\never the goal of [4] is radically different from ours as they maximize the sum of rewards and not\nthe maximal reward. Bandit algorithms have been already used for network intrusion detection [15],\nbut they typically consider classical or restless setting. [9, 21, 20] were the \ufb01rst to consider the\nextreme bandits problem, where our setting is de\ufb01ned as the max-k problem. [21] and [9] con-\nsider a fully parametric setting. The reward distributions are assumed to be exactly generalized\nextreme value distributions. Speci\ufb01cally, [21] assumes that the distributions are exactly Gumbel,\nP (x) = exp(\u2212(x \u2212 m)/s)), and [9], that the distributions are exactly of Gumbel or Fr\u00b4echet\nP (x) = exp(\u2212(x \u2212 m)\u03b1/(s\u03b1))). Provided that these assumptions hold, they propose an algo-\nrithm for which the regret is asymptotically negligible when compared to the optimal oracle reward.\nThese results are interesting since they are the \ufb01rst for extreme bandits, but their parametric assump-\ntion is unlikely to hold in practice and the asymptotic nature of their bounds limits their impact.\nInterestingly, the objective of [20] is to remove the parametric assumptions of [21, 9] by offering\nthe THRESHOLDASCENT algorithm. However, no analysis of this algorithm for extreme bandits is\nprovided. Nonetheless, to the best of our knowledge, this is the closest competitor for EXTREME-\nHUNTER and we empirically compare our algorithm to THRESHOLDASCENT in Section 7.\n\n1We recall the de\ufb01nition of the standard Pareto distribution as a distribution P , where for some constants \u03b1\n\nand C, we have that for x \u2265 C 1/\u03b1, P = 1 \u2212 Cx\u2212\u03b1.\n\n3\n\n\fIn this paper we also target the extreme bandit setting, but contrary to [9, 21, 20], we only make a\nsemi-parametric assumption on the distribution; the second order Pareto assumption (De\ufb01nition 2),\nwhich is standard in extreme value theory (see e.g., [13, 10]). This is light-years better and sig-\nni\ufb01cantly weaker than the parametric assumptions made in the prior works for extreme bandits.\nFurthermore, we provide a \ufb01nite-time regret bound for our more general semi-parametric setting\n(Theorem 2), while the prior works only offer asymptotic results. In particular, we provide an up-\nper bound on the rate at which the regret becomes negligible when compared to the optimal oracle\nreward (De\ufb01nition 1).\n\n5 Extreme Hunter\n\nIn this section, we present our main results. In particular, we present the algorithm and the main\ntheorem that bounds its extreme regret. Before that, we \ufb01rst provide an initial result on the expecta-\ntion of the maximum of second order Pareto random variables which will set the benchmark for the\noracle regret. We \ufb01rst characterize the expectation of the maximum of second order Pareto distribu-\ntions. The following lemma states that the expectation of the maximum of i.i.d. second order Pareto\nsamples is equal, up to a negligible term, to the expectation of the maximum of i.i.d. Pareto samples.\nThis result is crucial for assessing the benchmark for the regret, in particular the expected value of\nthe maximal oracle sample. Theorem 1 is based on Lemma 3, both provided in the appendix.\nTheorem 1. Let X1, . . . , Xn be n i.i.d. samples drawn according to (\u03b1, \u03b2, C, C(cid:48))-second order\nPareto distribution P (see De\ufb01nition 2). If \u03b1 > 1, then:\n\n(cid:12)(cid:12)(cid:12)E(max\n\ni\n\nXi) \u2212 (nC)1/\u03b1\u0393(cid:0)1\u2212 1\n\n\u03b1\n\n(cid:1)(cid:12)(cid:12)(cid:12) \u2264 4D2\n\nn (nC)1/\u03b1 + 2C(cid:48)D\u03b2+1\n\nC\u03b2+1n\u03b2 (nC)1/\u03b1 + B = o\n\n(cid:16)\n\n(nC)1/\u03b1(cid:17)\n\n,\n\nwhere D2, D1+\u03b2 > 0 are some universal constants, and B is de\ufb01ned in the appendix (9).\n\nTheorem 1 implies that the optimal strategy in hindsight attains the following expected reward:\n\n(cid:104)\n\n(Ckn)1/\u03b1k \u0393(cid:0)1\u2212 1\n\n\u03b1\n\n(cid:1)(cid:105)\n\nE [G\u2217\n\nn] \u2248 max\n\nk\n\nInput:\n\nInitialize:\n\nn] \u2212 E [G\u03c0\n\n\u0393(cid:0)1\u2212 1\n\nRun:\nfor t = 1 to n do\n\nAlgorithm 1 EXTREMEHUNTER\n\n(cid:1) \u2248 n1/\u03b1\u2217\n\nK: number of arms\nn: time horizon\nb: where b \u2264 \u03b2k for all k \u2264 K\nN: minimum number of pulls of each arm\nTk \u2190 0 for all k \u2264 K\n\u03b4 \u2190 exp(\u2212 log2 n)/(2nK)\n\nOur objective is therefore to \ufb01nd a learner \u03c0\nsuch that E [G\u2217\nn] is negligible when\ncompared to E[G\u2217\nn], i.e., when compared to\nwhere \u2217 is the\n(nC\u2217)1/\u03b1\u2217\n\u03b1\u2217\noptimal arm.\nFrom the discussion above, we know that the\nminimization of the extreme regret is linked\nwith the identi\ufb01cation of the arm with the heav-\niest tail. Our EXTREMEHUNTER algorithm is\nbased on a classical idea in bandit theory: op-\ntimism in the face of uncertainty. Our strat-\negy is to estimate E [maxt\u2264n Xk,t] for any k\nand to pull the arm which maximizes its up-\nper bound. From De\ufb01nition 2, the estimation\nof this quantity relies heavily on an ef\ufb01cient es-\ntimation of \u03b1k and Ck, and on associated con\ufb01-\ndence widths. This topic is a classic problem in\nextreme value theory, and such estimators exist\nprovided that one knows a lower bound b on \u03b2k\n[10, 6, 7]. From now on we assume that a con-\nstant b > 0 such that b \u2264 mink \u03b2k is known\nto the learner. As we argue in Remark 2, this\nassumption is necessary .\nSince our main theoretical result is a \ufb01nite-time upper bound, in the following exposition we care-\nfully describe all the constants and stress what quantities they depend on. Let Tk,t be the number of\nsamples drawn from arm k at time t. De\ufb01ne \u03b4 = exp(\u2212 log2 n)/(2nK) and consider an estimator\n\nestimate(cid:98)hk,t that veri\ufb01es (2)\nestimate (cid:98)Ck,t using (3)\n\nend if\nend for\nPlay arm kt \u2190 arg maxk Bk,t\nTkt \u2190 Tkt + 1\n\nfor k = 1 to K do\nif Tk \u2264 N then\nelse\n\nBk,t \u2190 \u221e\n\nupdate Bk,t using (5) with (2) and (4)\n\nend for\n\n4\n\n\f(cid:98)hk,t of 1/\u03b1k at time t that veri\ufb01es the following condition with probability 1\u2212 \u03b4, for Tk,t larger than\n\nsome constant N2 that depends only on \u03b1k, Ck, C(cid:48) and b:\n\n(2)\nwhere D is a constant that also depends only on \u03b1k, Ck, C(cid:48), and b. For instance, the estimator\nin [6] (Theorem 3.7) veri\ufb01es this property and provides D and N2 but other estimators are possible.\nConsider the associated estimator for Ck:\n\n= B1(Tk,t),\n\n\u03b1k\n\n\u2212b/(2b+1)\nk,t\n\n(cid:12)(cid:12)(cid:12) 1\n\n\u2212(cid:98)hk,t\n\n(cid:12)(cid:12)(cid:12) \u2264 D(cid:112)log(1/\u03b4)T\n\uf8eb\uf8ed 1\n(cid:110)\nTk,t(cid:88)\n\n1\n\nTk,t\n\nu=1\n\n(cid:98)hk,t/(2b+1)\n\nXk,u \u2265 T\n\nk,t\n\n(cid:111)\uf8f6\uf8f8\n\n(3)\n\nk,t\n\n(cid:98)Ck,t = T 1/(2b+1)\n(cid:113)\n(cid:12)(cid:12)(cid:12) \u2264 E\n(cid:12)(cid:12)(cid:12)Ck \u2212 (cid:98)Ck,t\n\nFor this estimator, we know [7] with probability 1 \u2212 \u03b4 that for Tk,t \u2265 N2:\n\nwhere E is derived in [7] in the proof of Theorem 2. Let N = max(cid:0)A log(n)2(2b+1)/b, N2\n\nlog(Tk,t/\u03b4) log(Tk,t)T\n\n\u2212b/(2b+1)\nk,T\n\n= B2(Tk,t),\n\nA depends on (\u03b1k, Ck)k, b, D, E, and C(cid:48), and is such that:\nmax (2B1(N ), 2B2(N )/Ck) \u2264 1, N \u2265 (2D log2 n)(2b+1)/b, and N >\n\n(cid:18)\n\n\u221a\n\nlog(n)2\n2D\n1\u2212maxk 1/\u03b1k\n\n(4)\n\n(cid:1) where\n(cid:19)(2b+1)/b\n\nThis inspires Algorithm 1, which \ufb01rst pulls each arm N times and then, at each time t > KN, pulls\nthe arm that maximizes Bk,t, which we de\ufb01ne as:\n\n(cid:17)\n(cid:16)(cid:16)(cid:98)Ck,t + B2 (Tk,t)\n\nn\n\n(cid:17)(cid:98)hk,t+B1(Tk,t)\n\n(cid:16)(cid:98)hk,t, B1 (Tk,t)\n\n(cid:17)\n\n\u00af\u0393\n\n,\n\n(5)\n\nwhere \u00af\u0393(x, y) = \u02dc\u0393(1 \u2212 x \u2212 y), where we set \u02dc\u0393 = \u0393 for any x > 0 and +\u221e otherwise.\nRemark 2. A natural question is whether it is possible to learn \u03b2k as well. In fact, this is not possible\nfor this model and a negative result was proved by [7]. The result states that in this setting it is not\npossible to test between two \ufb01xed values of \u03b2 uniformly over the set of distributions. Thereupon, we\nde\ufb01ne b as a lower bound for all \u03b2k. With regards to the Pareto distribution, \u03b2 = \u221e corresponds to\nthe exact Pareto distribution, while \u03b2 = 0 for such distribution that is not (asymptotically) Pareto.\n\nWe show that this algorithm meets the desired properties. The following theorem states our main\nresult by upper-bounding the extreme regret of EXTREMEHUNTER.\nTheorem 2. Assume that the distributions of the arms are respectively (\u03b1k, \u03b2k, Ck, C(cid:48)) second\norder Pareto (see De\ufb01nition 2) with mink \u03b1k > 1. If n \u2265 Q, the expected extreme regret of EX-\nTREMEHUNTER is bounded from above as:\n\nn log(n)(2b+1)/b + n\u2212 log(n)(1\u22121/\u03b1\u2217) + n\u2212b/((b+1)\u03b1\u2217)(cid:17)\n\nE [Rn] \u2264 L(nC\u2217)1/\u03b1\u2217(cid:16) K\n\n= E [G\u2217\n\nn] o(1),\n\nwhere L, Q > 0 are some constants depending only on (\u03b1k, Ck)k, C(cid:48), and b (Section 6).\nTheorem 2 states that the EXTREMEHUNTER strategy performs almost as well as the best (oracle)\nstrategy, up to a term that is negligible when compared to the performance of the oracle strategy.\nIndeed, the regret is negligible when compared to (nC\u2217)1/\u03b1\u2217\n, which is the order of magnitude of the\nperformance of the best oracle strategy E [G\u2217\nn] = maxk\u2264K E [maxt\u2264n Xk,t]. Our algorithm thus\ndetects the arm that has the heaviest tail.\nFor n large enough (as a function of (\u03b1k, \u03b2k, Ck)k, C(cid:48) and K), the two \ufb01rst terms in the regret\nbecome negligible when compared to the third one, and the regret is then bounded as:\n\nn]O(cid:16)\n\nn\u2212b/((b+1)\u03b1\u2217)(cid:17)\n\nE [Rn] \u2264 E [G\u2217\n\nWe make two observations: First, the larger the b, the tighter this bound is, since the model is then\ncloser to the parametric case. Second, smaller \u03b1\u2217 also tightens the bound, since the best arm is then\nvery heavy tailed and much easier to recognize.\n\n5\n\n\f6 Analysis\n\nIn this section, we prove an upper bound on the extreme regret of Algorithm 1 stated in Theorem 2.\nBefore providing the detailed proof, we give a high-level overview and the intuitions.\nIn Step 1, we de\ufb01ne the (favorable) high probability event \u03be of interest, useful for analyzing the\nmechanism of the bandit algorithm. In Step 2, given \u03be, we bound the estimates of \u03b1k and Ck, and\nuse them to bound the main upper con\ufb01dence bound. In Step 3, we upper-bound the number of pulls\nof each suboptimal arm: we prove that with high probability we do not pull them too often. This\nenables us to guarantee that the number of pulls of the optimal arms \u2217 is on \u03be equal to n up to a\nnegligible term.\nThe \ufb01nal Step 4 of the proof is concerned with using this lower bound on the number of pulls of\nthe optimal arm in order to lower bound the expectation of the maximum of the collected samples.\nSuch step is typically straightforward in the classical (mean-optimizing) bandits by the linearity of\nthe expectation. It is not straightforward in our setting. We therefore prove Lemma 2, in which we\nshow that the expected value of the maximum of the samples in the favorable event \u03be will be not too\nfar away from the one that we obtain without conditioning on \u03be.\n\nStep 1: High probability event.\n\u03b4 def= exp(\u2212 log2n)/(2nK) and consider the event \u03be such that for any k \u2264 K, N \u2264 T \u2264 n:\n\nIn this step, we de\ufb01ne the favorable event \u03be. We set\n\n(cid:12)(cid:12)(cid:12) 1\n(cid:12)(cid:12)(cid:12)Ck \u2212 \u02dcCk(T )\n\n\u2212 \u02dchk(T )\n\n\u03b1k\n\n(cid:12)(cid:12)(cid:12) \u2264 D(cid:112)log(1/\u03b4)T \u2212b/(2b+1),\n(cid:12)(cid:12)(cid:12) \u2264 E(cid:112)log(T /\u03b4)T \u2212b/(2b+1),\n\nNotice, they are not the same as(cid:98)hk,t and (cid:98)Ck,t which are the estimates of the same quantities at time\n\nwhere \u02dchk(T ) and \u02dcCk(T ) are the estimates of 1/\u03b1k and Ck respectively using the \ufb01rst T samples.\nt for the algorithm, and thus with Tk,t samples. The probability of \u03be is larger than 1 \u2212 2nK\u03b4 by a\nunion bound on (2) and (4).\n\nStep 2: Bound on Bk,t. The following lemma holds on \u03be for upper- and lower-bounding Bk,t.\nLemma 1. (proved in the appendix) On \u03be, we have that for any k \u2264 K, and for Tk,t \u2265 N:\n\u2212b/(2b+1)\nk,t\n\n(cid:17)(cid:16)\n1 + F log(n)(cid:112)log(n/\u03b4)T\n\n(cid:17) \u2264 Bk,t\u2264 (Ckn)\n\n1\u2212 1\n\n1\u2212 1\n\n(Ckn)\n\n1\n\u03b1k \u0393\n\n1\n\u03b1k \u0393\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\n\u03b1k\n\n\u03b1k\n\n(6)\n\n(C\u2217n)1/\u03b1\u2217\n\nRearranging the terms we get:\n(C\u2217n)1/\u03b1\u2217\n\nStep 3: Upper bound on the number of pulls of a suboptimal arm. We proceed by using the\nbounds on Bk,t from the previous step to upper-bound the number of suboptimal pulls. Let \u2217 be the\nbest arm. Assume that at round t, some arm k (cid:54)= \u2217 is pulled. Then by de\ufb01nition of the algorithm\nB\u2217,t \u2264 Bk,t, which implies by Lemma 1:\n\n\u2212b/(2b+1)\nk,t\n\n\u03b1k\n\n\u03b1\u2217\n\n\u03b1\u2217\n\n1\u2212 1\n\n(cid:17)(cid:16)\n1 + F log(n)(cid:112)log(n/\u03b4)T\n\n(cid:16)\n(cid:1) \u2264 (Ckn)1/\u03b1k \u0393\n\u0393(cid:0)1\u2212 1\n\u0393(cid:0)1\u2212 1\n(cid:1)\n(cid:1) \u2264 1 + F log(n)(cid:112)log(n/\u03b4)T\n(Ckn)1/\u03b1k \u0393(cid:0)1\u2212 1\n(cid:1)\n\u0393(cid:0)1\u2212 1\n(cid:1) \u2212 1\n(Ckn)1/\u03b1k \u0393(cid:0)1\u2212 1\n(cid:17)(2b+1)/(2b) \u2264 N + G(cid:0)log2n log(n/\u03b4)(cid:1)(2b+1)(2b)\n\n\u2212b/(2b+1)\nk,t\n\n(C\u2217n)1/\u03b1\u2217\n\n\u2206k =\n\n\u03b1\u2217\n\n\u03b1k\n\n\u03b1k\n\nTk,t \u2264 N + G(cid:48)(cid:16) log2n log(n/\u03b4)\n\n\u22062\nk\n\nSince Tk,t \u2264 n, (7) implies for some problem dependent constants G and G(cid:48) dependent only on\n(\u03b1k, Ck)k, C(cid:48) and b, but independent of \u03b4 that:\n\nWe now de\ufb01ne \u2206k which is analogous to the gap in the classical bandits:\n\n(cid:17)\n\n(7)\n\n6\n\n\fThis implies that number T \u2217 of pulls of arm \u2217 is with probability 1 \u2212 \u03b4(cid:48), at least\n\nn \u2212(cid:88)\nG(cid:0)log2n log(2nK/\u03b4(cid:48))(cid:1)(2b+1)/(2b) \u2212 KN,\nQ \u2265 2KN + 2GK(cid:0)log2n log (2nK/\u03b4(cid:48))(cid:1)(2b+1)/(2b)\n\nk(cid:54)=\u2217\n\n,\n\nwhere \u03b4(cid:48) = 2nK\u03b4. Since n is larger than\n\nwe have that T \u2217 \u2265 n\n\n2 as a corollary.\n\nmax\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)\n\n\u2265E\n\n\u2265E\n\nmax\nt\u2264n\n\nmax\nt\u2264n\n\nXIt,Tk,t\n\n(cid:21)\nX\u2217,T\u2217,t1{\u03be}\n\n(cid:21)\nXIt,Tk,t 1{\u03be}\n\nStep 4: Bound on the expectation. We start by lower-bounding the expected gain:\n\n(cid:20)\ni\u2264T \u2217 Xi1{\u03be}\nE[Gn] =E\nThe next lemma links the expectation of maxt\u2264T \u2217 X\u2217,t with the expectation of maxt\u2264T \u2217 X\u2217,t1{\u03be}.\nLemma 2. (proved in the appendix) Let X1, . . . , XT be i.i.d. samples from an (\u03b1, \u03b2, C, C(cid:48))-second\norder Pareto distribution F . Let \u03be(cid:48) be an event of probability larger than 1 \u2212 \u03b4. Then for \u03b4 < 1/2\n\nand for T \u2265 Q large enough so that c max(cid:0)1/T, 1/T \u03b2(cid:1) \u2264 1/4 for a given constant c > 0, that\ndepends only on C, C(cid:48) and \u03b2, and also for T \u2265 log(2) max(cid:0)C (2C(cid:48))1/\u03b2 , 8 log (2)(cid:1):\n(cid:1) (T C)1/\u03b1 \u03b41\u22121/\u03b1\n\nmax\nt\u2264n\n\n=E\n\nXt1{\u03be}\n\n(cid:1) \u2212(cid:0)4 + 8\n\n\u2265 (T C)1/\u03b1 \u0393(cid:0)1\u2212 1\n(cid:16) 4D2\nT (T C)1/\u03b1 + 2C(cid:48)D1+\u03b2\n\n\u2212 2\n\nmax\nt\u2264T\n\n(cid:17)\n\nSince n is large enough so that 2n2K\u03b4(cid:48) = 2n2K exp(cid:0)\u2212 log2n(cid:1) \u2264 1/2, where \u03b4(cid:48) = exp(cid:0)\u2212 log2n(cid:1) ,\n(cid:20)\n(cid:21)\nt\u2264T \u2217 X\u2217,t1{\u03be}\n\nand the probability of \u03be is larger than 1 \u2212 \u03b4(cid:48), we can use Lemma 2 for the optimal arm:\n(C\u2217)1+b(T \u2217)b \u2212\n\n(cid:20)\n\u0393(cid:0)1\u2212 1\n\n(cid:1)\u2212(cid:0)4+ 8\n\nT \u2217 \u2212 4C(cid:48)Dmax\n\n(cid:1)\u03b4(cid:48)1\u2212 1\n\n\u2265 (T \u2217C\u2217)\n\n\u03b1\u2217 \u2212 8D2\n\n2B\n(T \u2217C\u2217)\n\nC1+\u03b2 T \u03b2 (T C)1/\u03b1 + B\n\n(cid:21)\n\n\u03b1\u22121\n\nmax\n\n(cid:21)\n\n(cid:20)\n\n\u03b1\u22121\n\n\u03b1\u2217\n\nE\n\nE\n\n1\n\n\u03b1\u2217\n\n\u03b1\n\n.\n\n,\n\n1\n\n\u03b1\u2217\n\nwhere Dmax\nwe lower-bound the last three terms in the brackets using T \u2217 \u2265 n\n\ndef= maxi D1+\u03b2i. Using Step 3, we bound the above with a function of n. In particular,\n\nfactor as:\n\nWe are now ready to relate the lower bound on the gain of EXTREMEHUNTER with the upper bound\nof the gain of the optimal policy (Theorem 1), which brings us the upper bound for the regret:\n\nn\n\n1 \u2212 GK\n\n(cid:17)\n2 and the (T \u2217C\u2217)1/\u03b1\u2217\n\n(T \u2217C\u2217)1/\u03b1\u2217 \u2265 (nC\u2217)1/\u03b1\u2217(cid:16)\n(cid:20)\n(cid:0)log(2n2K/\u03b4(cid:48))(cid:1) 2b+1\n\n(cid:0)log(2n2K/\u03b4(cid:48))(cid:1) 2b+1\n(cid:20)\n(cid:21)\nt\u2264T \u2217 X\u2217,t1{\u03be}\nn + \u03b4(cid:48)1\u22121/\u03b1\u2217\n\n(nC\u2217)b + GK\n\nn] \u2212 E [Gn] \u2264 E [G\u2217\n\n2b \u2212 KN\n\n\u2264 E [G\u2217\n\nn] \u2212 E\n\nn] \u2212 E\n\ni\u2264T \u2217 Xi\nmax\n\n2b + KN\n\nn + 1\n\n(cid:21)\n\nmax\n\n+\n\nn\n\nn\n\n1\n\nE [Rn] = E [G\u2217\n\n\u2264 H(nC\u2217)1/\u03b1\u2217(cid:18)\n\n(cid:19)\n\n,\n\nB\n\n(nC\u2217)1/\u03b1\u2217\n\nwhere H is a constant that depends on (\u03b1k, Ck)k, C(cid:48), and b. To bound the last term, we use the\nde\ufb01nition of B (9) to get the n\u2212\u03b2\u2217/((\u03b2\u2217+1)\u03b1\u2217) term, upper-bounded by n\u2212b/((b+1)\u03b1\u2217) as b \u2264 \u03b2\u2217.\nNotice that this \ufb01nal term also eats up n\u22121 and n\u2212b terms since b/((b + 1)\u03b1\u2217) \u2264 min(1, b).\n\nWe \ufb01nish by using \u03b4(cid:48) = exp(cid:0)\u2212 log2n(cid:1) and grouping the problem-dependent constants into L to get\n\nE [Rn] \u2264 L(nC\u2217)1/\u03b1\u2217(cid:16) K\n\nn log(n)(2b+1)/b + n\u2212 log(n)(1\u22121/\u03b1\u2217) + n\u2212b/((b+1)\u03b1\u2217)(cid:17)\n\nthe \ufb01nal upper bound:\n\n7\n\n\fFigure 1: Extreme regret as a function of time for the exact Pareto distributions (left), approximate\nPareto (middle) distributions, and the network traf\ufb01c data (right).\n\n7 Experiments\nIn this section, we empirically evaluate EXTREMEHUNTER on synthetic and real-world data. The\nmeasure of our evaluation is the extreme regret from De\ufb01nition 1. Notice that even thought we\nevaluate the regret as a function of time T , the extreme regret is not cumulative and it is more in the\nspirit of simple regret [5]. We compare our EXTREMEHUNTER with THRESHOLDASCENT [20].\nMoreover, we also compare to classical UCB [3], as an example of the algorithm that aims for the\narm with the highest mean as opposed to the heaviest tail. When the distribution of a single arm\nhas both the highest mean and the heaviest-tail, both EXTREMEHUNTER and UCB are expected to\nperform the same with respect to the extreme regret. In the light of Remark 2, we set b = 1 to\nconsider a wide class of distributions.\n\nExact Pareto Distributions\nIn the \ufb01rst experiment, we consider K = 3 arms with the distributions\nPk(x) = 1\u2212x\u2212\u03b1k, where \u03b1 = [5, 1.1, 2]. Therefore, the most heavy-tailed distribution is associated\nwith the arm k = 2. Figure 1 (left) displays the averaged result of 1000 simulations with the time\nhorizon T = 104. We observe that EXTREMEHUNTER eventually keeps allocating most of the\npulls to the arm of the interest. Since in this case, the arm with the heaviest tail is also the arm\nwith the largest mean, UCB also performs well and it is even able to detect the best arm earlier.\nTHRESHOLDASCENT, on the other way, was not always able to allocate the pulls properly in 104\nsteps. This may be due to the discretization of the rewards that this algorithm is using.\n\nApproximate Pareto Distributions For the exact Pareto distributions, the smaller the tail index\nthe higher the mean and even UCB obtains a good performance. However, this is no longer nec-\nessarily the case for the approximate Pareto distributions. For this purpose, we perform the second\nexperiment where we mix an exact Pareto distribution with a Dirac distribution in 0. We consider\nK = 3 arms. Two of the arms follow the exact Pareto distributions with \u03b11 = 1.5 and \u03b13 = 3.\nOn the other hand, the second arm has a mixture weight of 0.2 for the exact Pareto distribution with\n\u03b12 = 1.1 and 0.8 mixture weight of the Dirac distribution in 0. For this setting, the second arm\nis the most heavy-tailed but the \ufb01rst arms has the largest mean. Figure 1 (middle) shows the re-\nsult. We see that UCB performs worse since it eventually focuses on the arm with the largest mean.\nTHRESHOLDASCENT performs better than UCB but not as good as EXTREMEHUNTER.\n\nComputer Network Traf\ufb01c Data In this experiment, we evaluate EXTREMEHUNTER on heavy-\ntailed network traf\ufb01c data which was collected from user laptops in the enterprise environment [2].\nThe objective is to allocate the sampling capacity among the computer nodes (arms), in order to \ufb01nd\nthe largest outbursts of the network activity. This information then serves an IT department to further\ninvestigate the source of the extreme network traf\ufb01c. For each arm, a sample at the time t corre-\nsponds to the number of network activity events for 4 consecutive seconds. Speci\ufb01cally, the network\nevents are the starting times of packet \ufb02ows. In this experiment, we selected K = 5 laptops (arms),\nwhere the recorded sequences were long enough. Figure 1 (right) shows that EXTREMEHUNTER\nagain outperforms both THRESHOLDASCENT and UCB.\n\nAcknowledgements We would like to thank John Mark Agosta and Jennifer Healey for the net-\nwork traf\ufb01c data. The research presented in this paper was supported by Intel Corporation, by\nFrench Ministry of Higher Education and Research, and by European Community\u2019s Seventh Frame-\nwork Programme (FP7/2007-2013) under grant agreement no270327 (CompLACS).\n\n8\n\n010002000300040005000600070008000900010000010002000300040005000600070008000900010000time textreme regretComparison of extreme bandit strategies (K=3) ExtremeHunterUCBThresholdAscent01000200030004000500060007000800090001000005001000150020002500time textreme regretComparison of extreme bandit strategies (K=3) ExtremeHunterUCBThresholdAscent010002000300040005000600070008000900010000050100150200250time textreme regretComparison of extreme bandit strategies on the network data K=5 ExtremeHunterUCBThresholdAscent\fReferences\n[1] Naoki Abe, Bianca Zadrozny, and John Langford. Outlier Detection by Active Learning. In\nProceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 504\u2013509, 2006.\n\n[2] John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft, and Daniel Ting. Mix-\nture models of endhost network traf\ufb01c. In IEEE Proceedings of INFOCOM,, pages 225\u2013229.\n[3] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time Analysis of the Multiarmed\n\nBandit Problem. Machine Learning, 47(2-3):235\u2013256, 2002.\n\n[4] S\u00b4ebastien Bubeck, Nicol`o Cesa-Bianchi, and G\u00b4abor Lugosi. Bandits With Heavy Tail. Infor-\n\nmation Theory, IEEE Transactions on, 59(11):7711\u20137717, 2013.\n\n[5] S\u00b4ebastien Bubeck, R\u00b4emi Munos, and Gilles Stoltz. Pure Exploration in Multi-armed Bandits\n\nProblems. Algorithmic Learning Theory, pages 23\u201337, 2009.\n\n[6] Alexandra Carpentier and Arlene K. H. Kim. Adaptive and minimax optimal estimation of the\n\ntail coef\ufb01cient. Statistica Sinica, 2014.\n\n[7] Alexandra Carpentier and Arlene K. H. Kim. Honest and adaptive con\ufb01dence interval for the\n\ntail coef\ufb01cient in the Pareto model. Electronic Journal of Statistics, 2014.\n\n[8] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM\n\nComput. Surv., 41(3):15:1\u201315:58, July 2009.\n\n[9] Vincent A. Cicirello and Stephen F. Smith. The max k-armed bandit: A new model of explo-\nration applied to search heuristic selection. AAAI Conference on Arti\ufb01cial Intelligence, 2005.\n[10] Laurens de Haan and Ana Ferreira. Extreme Value Theory: An Introduction. Springer Series\n\nin Operations Research and Financial Engineering. Springer, 2006.\n\n[11] Ronald Aylmer Fisher and Leonard Henry Caleb Tippett. Limiting forms of the frequency\ndistribution of the largest or smallest member of a sample. Mathematical Proceedings of the\nCambridge Philosophical Society, 24:180, 1928.\n\n[12] Boris Gnedenko. Sur la distribution limite du terme maximum d\u2019une s\u00b4erie al\u00b4eatoire. The\n\nAnnals of Mathematics, 44(3):423\u2013453, 1943.\n\n[13] Peter Hall and Alan H. Welsh. Best Attainable Rates of Convergence for Estimates of Param-\n\neters of Regular Variation. The Annals of Statistics, 12(3):1079\u20131084, 1984.\n\n[14] Tze L. Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances\n\nin Applied Mathematics, 6(1):4\u201322, 1985.\n\n[15] Keqin Liu and Qing Zhao. Dynamic Intrusion Detection in Resource-Constrained Cyber Net-\n\nworks. In IEEE International Symposium on Information Theory Proceedings, 2012.\n\n[16] Markos Markou and Sameer Singh. Novelty detection: a review, part 1: statistical approaches.\n\nSignal Process., 83(12):2481\u20132497, 2003.\n\n[17] Daniel B. Neill and Gregory F. Cooper. A multivariate Bayesian scan statistic for early event\n\ndetection and characterization. Machine Learning, 79:261\u2013282, 2010.\n\n[18] Carey E. Priebe, John M. Conroy, David J. Marchette, and Youngser Park. Scan Statistics on\nEnron Graphs. In Computational and Mathematical Organization Theory, volume 11, pages\n229\u2013247, 2005.\n\n[19] Ingo Steinwart, Don Hush, and Clint Scovel. A Classi\ufb01cation Framework for Anomaly Detec-\n\ntion. Journal of Machine Learning Research, 6:211\u2013232, 2005.\n\n[20] Matthew J. Streeter and Stephen F. Smith. A Simple Distribution-Free Approach to the Max\nIn Principles and Practice of Constraint Programming, volume\n\nk-Armed Bandit Problem.\n4204, pages 560\u2013574, 2006.\n\n[21] Matthew J. Streeter and Stephen F. Smith. An Asymptotically Optimal Algorithm for the Max\nk-Armed Bandit Problem. In AAAI Conference on Arti\ufb01cial Intelligence Intelligence, pages\n135\u2013142, 2006.\n\n[22] Ryan Turner, Zoubin Ghahramani, and Steven Bottone. Fast online anomaly detection using\n\nscan statistics. IEEE Workshop on Machine Learning for Signal Processing, 2010.\n\n9\n\n\f", "award": [], "sourceid": 644, "authors": [{"given_name": "Alexandra", "family_name": "Carpentier", "institution": "StatsLab Cambridge"}, {"given_name": "Michal", "family_name": "Valko", "institution": "INRIA Lille - Nord Europe"}]}