{"title": "Fighting Bandits with a New Kind of Smoothness", "book": "Advances in Neural Information Processing Systems", "page_first": 2197, "page_last": 2205, "abstract": "We focus on the adversarial multi-armed bandit problem. The EXP3 algorithm of Auer et al. (2003) was shown to have a regret bound of $O(\\sqrt{T N \\log N})$, where $T$ is the time horizon and $N$ is the number of available actions (arms). More recently, Audibert and Bubeck (2009) improved the bound by a logarithmic factor via an entirely different method. In the present work, we provide a new set of analysis tools, using the notion of convex smoothing, to provide several novel algorithms with optimal guarantees. First we show that regularization via the Tsallis entropy matches the minimax rate of Audibert and Bubeck (2009) with an even tighter constant; it also fully generalizes EXP3. Second we show that a wide class of perturbation methods lead to near-optimal bandit algorithms as long as a simple condition on the perturbation distribution $\\mathcal{D}$ is met: one needs that the hazard function of $\\mathcal{D}$ remain bounded. The Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property; interestingly, the Gaussian and Uniform distributions do not.", "full_text": "Fighting Bandits with a New Kind of Smoothness\n\nJacob Abernethy\n\nUniversity of Michigan\n\njabernet@umich.edu\n\nChansoo Lee\n\nUniversity of Michigan\n\nchansool@umich.edu\n\nAmbuj Tewari\n\nUniversity of Michigan\ntewaria@umich.edu\n\nAbstract\n\nWe provide a new analysis framework for the adversarial multi-armed bandit\nproblem. Using the notion of convex smoothing, we de\ufb01ne a novel family of\nalgorithms with minimax optimal regret guarantees. First, we show that regular-\nization via the Tsallis entropy, which includes EXP3 as a special case, matches\nthe O(pN T ) minimax regret with a smaller constant factor. Second, we show\nthat a wide class of perturbation methods achieve a near-optimal regret as low\nas O(pN T log N ), as long as the perturbation distribution has a bounded haz-\nard function. For example, the Gumbel, Weibull, Frechet, Pareto, and Gamma\ndistributions all satisfy this key property and lead to near-optimal algorithms.\n\n1\n\nIntroduction\n\nThe classic multi-armed bandit (MAB) problem, generally attributed to the early work of Robbins\n(1952), poses a generic online decision scenario in which an agent must make a sequence of choices\nfrom a \ufb01xed set of options. After each decision is made, the agent receives some feedback in the\nform of a loss (or gain) associated with her choice, but no information is provided on the outcomes\nof alternative options. The agent\u2019s goal is to minimize the total loss over time, and the agent is thus\nfaced with the balancing act of both experimenting with the menu of choices while also utilizing\nthe data gathered in the process to improve her decisions. The MAB framework is not only mathe-\nmatically elegant, but useful for a wide range of applications including medical experiments design\n(Gittins, 1996), automated poker playing strategies (Van den Broeck et al., 2009), and hyperparam-\neter tuning (Pacula et al., 2012).\nEarly MAB results relied on stochastic assumptions (e.g., IID) on the loss sequence (Auer et al.,\n2002; Gittins et al., 2011; Lai and Robbins, 1985). As researchers began to establish non-stochastic,\nworst-case guarantees for sequential decision problems such as prediction with expert advice (Little-\nstone and Warmuth, 1994), a natural question arose as to whether similar guarantees were possible\nfor the bandit setting. The pioneering work of Auer, Cesa-Bianchi, Freund, and Schapire (2003) an-\nswered this in the af\ufb01rmative by showing that their algorithm EXP3 possesses nearly-optimal regret\nbounds with matching lower bounds. Attention later turned to the bandit version of online linear\noptimization, and several associated guarantees were published the following decade (Abernethy\net al., 2012; Dani and Hayes, 2006; Dani et al., 2008; Flaxman et al., 2005; McMahan and Blum,\n2004).\nNearly all proposed methods have relied on a particular algorithmic blueprint; they reduce the ban-\ndit problem to the full-information setting, while using randomization to make decisions and to\nestimate the losses. A well-studied family of algorithms for the full-information setting is Follow\nthe Regularized Leader (FTRL), which optimizes the objective function of the following form:\n\narg min\n\nL>x + R(x)\n\n(1)\n\nwhere K is the decision set, L is (an estimate of) the cumulative loss vector, and R is a regularizer,\na convex function with suitable curvature to stabilize the objective. The choice of regularizer R is\n\nx2K\n\n1\n\n\fcritical to the algorithm\u2019s performance. For example, the EXP3 algorithm (Auer, 2003) regularizes\nwith the entropy function and achieves a nearly optimal regret bound when K is the probability sim-\nplex. For a general convex set, however, other regularizers such as self-concordant barrier functions\n(Abernethy et al., 2012) have tighter regret bounds.\nAnother class of algorithms for the full information setting is Follow the Perturbed Leader (FTPL)\n(Kalai and Vempala, 2005) whose foundations date back to the earliest work in adversarial online\nlearning (Hannan, 1957). Here we choose a distribution D on RN, sample a random vector Z \u21e0D ,\nand solve the following linear optimization problem\n\narg min\n\n(L + Z)>x.\n\n(2)\n\nx2K\n\nFTPL is computationally simpler than FTRL due to the linearity of the objective, but it is analytically\nmuch more complex due to the randomness. For every different choice of D, an entirely new set of\ntechniques had to be developed (Devroye et al., 2013; Van Erven et al., 2014). Rakhlin et al. (2012)\nand Abernethy et al. (2014) made some progress towards unifying the analysis framework. Their\ntechniques, however, are limited to the full-information setting.\nIn this paper, we propose a new analysis framework for the multi-armed bandit problem that uni\ufb01es\nthe regularization and perturbation algorithms. The key element is a new kind of smoothness prop-\nerty, which we call differential consistency. It allows us to generate a wide class of both optimal and\nnear-optimal algorithms for the adversarial multi-armed bandit problem. We summarize our main\nresults:\n\n1. We show that regularization via the Tsallis entropy leads to the state-of-the-art adversarial MAB\nalgorithm, matching the minimax regret rate of Audibert and Bubeck (2009) with a tighter con-\nstant. Interestingly, our algorithm fully generalizes EXP3.\n\n2. We show that a wide array of well-studied noise distributions lead to near-optimal regret bounds\n(matching those of EXP3). Furthermore, our analysis reveals a strikingly simple and appealing\nsuf\ufb01cient condition for achieving O(pT ) regret: the hazard rate function of the noise distribution\nmust be bounded by a constant. We conjecture that this requirement is in fact both necessary and\nsuf\ufb01cient.\n\n2 Gradient-Based Prediction Algorithms for the Multi-Armed Bandit\n\nLet us now introduce the adversarial multi-armed bandit problem. On each round t = 1, . . . , T ,\na learner must choose a distribution pt 2 N over the set of N available actions. The adversary\n(Nature) chooses a vector gt 2 [1, 0]N of losses, the learner samples it \u21e0 pt, and plays action it.\nAfter selecting this action, the learner observes only the value gt,it, and receives no information as\nto the values gt,j for j 6= it. This limited information feedback is what makes the bandit problem\nmuch more challenging than the full-information setting in which the entire gt is observed.\nThe learner\u2019s goal is to minimize the regret. Regret is de\ufb01ned to be the difference in the realized\nloss and the loss of the best \ufb01xed action in hindsight:\n\nRegretT := max\ni2[N ]\n\n(gt,i gt,it).\n\n(3)\n\nTXt=1\n\nTo be precise, we consider the expected regret, where the expectation is taken with respect to the\nlearner\u2019s randomization.\n\nLoss vs. Gain Note: We use the term \u201closs\u201d to refer to g, although the maximization in (3) would\nimply that g should be thought of as a \u201cgain\u201d instead. We use the former term, however, as we\nimpose the assumption that gt 2 [1, 0]N throughout the paper.\n2.1 The Gradient-Based Algorithmic Template\n\nOur results focus on a particular algorithmic template described in Framework 1, which is a slight\nvariation of the Gradient Based Prediction Algorithm (GBPA) of Abernethy et al. (2014). Note that\n\n2\n\n\fthe algorithm (i) maintains an unbiased estimate of the cumulative losses \u02c6Gt, (ii) updates \u02c6Gt by\nadding a single round estimate \u02c6gt that has only one non-zero coordinate, and (iii) uses the gradient\nof a convex function \u02dc as sampling distribution pt. The choice of \u02dc is \ufb02exible but \u02dc must be a\ndifferentiable convex function and its derivatives must always be a probability distribution.\nFramework 1 may appear restrictive but it has served as the basis for much of the published work on\nadversarial MAB algorithms (Auer et al., 2003; Kujala and Elomaa, 2005; Neu and Bart\u00b4ok, 2013).\nFirst, the GBPA framework essentially encompasses all FTRL and FTPL algorithms (Abernethy\net al., 2014), which are the core techniques not only for the full information settings, but also for\nthe bandit settings. Second, the estimation scheme ensures that \u02c6Gt remains an unbiased estimate of\nGt. Although there is some \ufb02exibility, any unbiased estimation scheme would require some kind\nof inverse-probability scaling\u2014information theory tells us that the unbiased estimates of a quantity\nthat is observed with only probabilty p must necessarily involve \ufb02uctuations that scale as O(1/p).\n\nFramework 1: Gradient-Based Prediction Alg. (GBPA) Template for Multi-Armed Bandit\nGBPA( \u02dc): \u02dc is a differentiable convex function such that r \u02dc 2 N and ri \u02dc > 0 for all i.\nInitialize \u02c6G0 = 0\nfor t = 1 to T do\n\nNature: A loss vector gt 2 [1, 0]N is chosen by the Adversary\nSampling: Learner chooses it according to the distribution p( \u02c6Gt1) = rt( \u02c6Gt1)\nCost: Learner \u201cgains\u201d loss gt,it\nEstimation: Learner \u201cguesses\u201d \u02c6gt := gt,it\nUpdate: \u02c6Gt = \u02c6Gt1 + \u02c6gt\n\npit ( \u02c6Gt1)\n\neit\n\nLemma 2.1. De\ufb01ne (G) \u2318 maxi Gi so that we can write the expected regret of GBPA( \u02dc) as\n\nThen, the expected regret of the GBPA( \u02dc) can be written as:\n\nERegretT =( GT ) PT\n\nt=1hr \u02dc( \u02c6Gt1), gti.\n\nERegretT \uf8ff \u02dc(0) (0)\n}\n\n{z\n\noverestimation penalty\n\n|\n\n+Ei1,...,it1\uf8ff ( \u02c6GT ) \u02dc( \u02c6GT )\n}\n\nunderestimation penalty\n\n{z\n\n|\n\n+\n\nTXt=1\n\nEit[D \u02dc( \u02c6Gt, \u02c6Gt1)| \u02c6Gt1]\n}\n|\n\ndivergence penalty\n\n{z\n\n,\n\n(4)\n\nwhere the expectations are over the sampling of it.\n\nProof. Let \u02dc be a valid convex function for the GBPA. Consider GBPA( \u02dc) being run on the loss\nsequence g1, . . . , gT . The algorithm produces a sequence of estimated losses \u02c6g1, . . . , \u02c6gT . Now\nconsider GBPA-NE( \u02dc), which is GBPA( \u02dc) run with the full information on the deterministic loss\nsequence \u02c6g1, . . . , \u02c6gT (there is no estimation step, and the learner updates \u02c6Gt directly). The regret of\nthis run can be written as\n\n( \u02c6GT ) PT\n\nt=1hr \u02dc( \u02c6Gt1), \u02c6gti,\n\nand (GT ) \uf8ff ( \u02c6GT ) by the convexity of . Hence, it suf\ufb01ces to show that the GBPA-NE( \u02dc) has\nregret at most the righthand side of Equation 4, which is a fairly well-known result in online learning\nliterature; see, for example, (Cesa-Bianchi and Lugosi, 2006, Theorem 11.6) or (Abernethy et al.,\n2014, Section 2). For completeness, we included the full proof in Appendix A.\n\n2.2 A New Kind of Smoothness\n\nWhat has emerged as a guiding principle throughout machine learning is that enforcing stability of\nan algorithm can often lead immediately to performance guarantees\u2014that is, small modi\ufb01cations of\nthe input data should not dramatically alter the output. In the context of GBPA, algorithmic stability\nis guaranteed as long as the dervative r \u02dc(\u00b7) is Lipschitz. Abernethy et al. (2014) explored a set of\nconditions on r2 \u02dc(\u00b7) that lead to optimal regret guarantees for the full-information setting. Indeed,\n\n3\n\n\fthis work discussed different settings where the regret depends on an upper bound on either the\nnuclear norm or the operator norm of this hessian.\nIn short, regret in the full information setting relies on the smoothness of the choice of \u02dc. In the\nbandit setting, however, merely a uniform bound on the magnitude of r2 \u02dc is insuf\ufb01cient to guar-\nantee low regret; the regret (Lemma 2.1) involves terms of the form D \u02dc( \u02c6Gt1 + \u02c6gt, \u02c6Gt1), where\nthe incremental quantity \u02c6gt can scale as large as the inverse of the smallest probability of p( \u02c6Gt1).\nWhat is needed is a stronger notion of the smoothness that bounds r2 \u02dc in correspondence with r \u02dc,\nand we propose the following de\ufb01nition:\nDe\ufb01nition 2.2 (Differential Consistency). For constants , C > 0, we say that a convex function\n\u02dc(\u00b7) is (, C )-differentially-consistent if for all G 2 (1, 0]N ,\n\u02dc(G) \uf8ff C(ri \u02dc(G)).\n\nr2\n\nii\n\nWe now prove a useful bound that emerges from differential consistency, and in the following two\nsections we shall show how this leads to regret guarantees.\nTheorem 2.3. Suppose \u02dc is (, C )-differentially-consistent for constants C, > 0. Then diver-\ngence penalty at time t in Lemma 2.1 can be upper bounded as:\n\nEit[D \u02dc( \u02c6Gt, \u02c6Gt1)| \u02c6Gt1] \uf8ff C\n\nNXi=1\u21e3ri \u02dc( \u02c6Gt1)\u23181\n\n.\n\nProof. For the sake of clarity, we drop the subscripts; we use \u02c6G to denote the cumulative estimate\n\u02c6Gt1, \u02c6g to denote the marginal estimate \u02c6gt = \u02c6Gt \u02c6Gt1, and g to denote the true loss gt.\nNote that by de\ufb01nition of Algorithm 1, \u02c6g is a sparse vector with one non-zero and non-positive\ncoordinate \u02c6git = gt,i/ri \u02dc( \u02c6G). Plus, it is conditionally independent given \u02c6G. For a \ufb01xed it, Let\nso that h00(r) = (\u02c6g/k\u02c6gk)>r2 \u02dc\u21e3 \u02c6G + t\u02c6g/k\u02c6gk\u2318 (\u02c6g/k\u02c6gk) = e>itr2 \u02dc\u21e3 \u02c6G teit\u2318 eit. Now we can\n\nh(r) := D \u02dc( \u02c6G + r\u02c6g/k\u02c6gk, \u02c6G) = D \u02dc( \u02c6G + reit, \u02c6G),\n\nwrite\n\nThe \ufb01rst inequality is by the supposition and the second inequality is due to the convexity of \u02dc\nwhich guarantees that ri is an increasing function in the i-th coordinate. Interestingly, this part\nof the proof critically depends on the fact that the we are in the \u201closs\u201d setting where g is always\nnon-positive.\n\n3 A Minimax Bandit Algorithm via Tsallis Smoothing\n\nThe design of a multi-armed bandit algorithm in the adversarial setting proved to be a challenging\ntask. Ignoring the dependence on N for the moment, we note that the initial published work on\nEXP3 provided only an O(T 2/3) guarantee (Auer et al., 1995), and it was not until the \ufb01nal version\nof this work (Auer et al., 2003) that the authors obtained the optimal O(pT ) rate. For the more\n\n4\n\n0 h00(r) dr ds\n\nR s\ni=1 P[it = i]R k\u02c6gk\nEit[D \u02dc( \u02c6G + \u02c6g, \u02c6G)| \u02c6G] =PN\n0 e>i r2 \u02dc\u21e3 \u02c6G rei\u2318 ei dr ds\nR s\ni=1 ri \u02dc( \u02c6G)R k\u02c6gk\n=PN\n0 C\u21e3ri \u02dc( \u02c6G rei)\u2318\nR s\ni=1 ri \u02dc( \u02c6G)R k\u02c6gk\n\uf8ffPN\n0 C\u21e3ri \u02dc( \u02c6G)\u2318\nR s\ni=1 ri \u02dc( \u02c6G)R k\u02c6gk\n\uf8ffPN\ni=1\u21e3ri \u02dc( \u02c6G)\u23181+R k\u02c6gk\nR s\n= CPN\ni=1\u21e3ri \u02dc( \u02c6G)\u23181\ni \uf8ff CPN\n2 PN\n\ni=1\u21e3ri \u02dc( \u02c6G)\u23181\n\n0 dr ds\n\ndr ds\n\ndr ds\n\n= C\n\n0\n\n0\n\n0\n\n0\n\n0\n\ng2\n\n.\n\n\fgeneral setting of online linear optimization, several sub-optimal rates were achieved (Dani and\nHayes, 2006; Flaxman et al., 2005; McMahan and Blum, 2004) before the desired pT was obtained\n(Abernethy et al., 2012; Dani et al., 2008).\nWe can view EXP3 as an instance of GBPA where the potential function \u02dc(\u00b7) is the Fenchel con-\njugate of the Shannon entropy. For any p 2 N, the (negative) Shannon entropy is de\ufb01ned as\nH(p) := Pi pi log pi, and its Fenchel conjugate is H ?(G) = supp2N{hp, Gi \u2318H (p)}. In\n\u2318 log (Pi exp(\u2318Gi)) . By\nfact, we have a closed-form expression for the supremum: H ?(G) = 1\ninspecting the gradient of the above expression, it is easy to see that EXP3 chooses the distribution\npt = rH ?(G) every round.\nThe tighter EXP3 bound given by Auer et al. (2003) scaled according to O(pT N log N ) and the\nauthors provided a matching lower bound of the form \u2326(pT N ). It remained an open question for\nsome time whether there exists a minimax optimal algorithm that does not contain the log term un-\ntil Audibert and Bubeck (2009) proposed the Implicitly Normalized Forecaster (INF). The INF is\nimplicitly de\ufb01ned via a specially-designed potential function with certain properties. It was not im-\nmediately clear from this result how to de\ufb01ne a minimax-optimal algorithm using the now-standard\ntools of regularization and Bregman divergence.\nMore recently, Audibert et al. (2011) improved upon Audibert and Bubeck (2009), extending the\nresults to the combinatorial setting, and they also discovered that INF can be interpreted in terms\nof Bregman divergences. We give here a reformulation of INF that leads to a very simple analysis\nin terms of our notion of differential consistency. Our reformulation can be viewed as a variation\nof EXP3, where the key modi\ufb01cation is to replace the Shannon entropy function with the Tsallis\nentropy1 for parameter 0 <\u21b5< 1:\n\nIn particular, the choice of \u21b5 = 1\n\nERegret \uf8ff 2q N T\n\n\u21b5(1\u21b5) .\n\n2 gives a regret of no more than 4pN T .\n\nProof of Theorem 3.1. We will bound each penalty term in Lemma 2.1. Since S\u21b5 is non-positive,\nthe underestimation penalty is upper bounded by 0 and the overestimation penalty is at most\n( min S\u21b5). The minimum of S\u21b5 occurs at (1/N, . . . , 1/N ). Hence,\n\n(overestimation penalty) \uf8ff \n\n\u2318\n\n1 \u21b5 1 \n\nNXi=1\n\n1\n\nN \u21b5! \uf8ff \u2318(N 1\u21b5 1).\n\n(6)\n\n1More precisely, the function we give here is the negative Tsallis entropy according to its original de\ufb01nition.\n\n5\n\nS\u21b5(p) =\n\n1\n\n1 \u21b5\u21e31 X p\u21b5\ni\u2318 .\n\nS\u21b5(\u00b7) ! H(\u00b7)\n\nas \u21b5 ! 1.\n\nThis particular function, proposed by Tsallis (1988), possesses a number of natural properties. The\nTsallis entropy is in fact a generalization of the Shannon entropy, as one obtains the latter as a special\ncase of the former asymptotically. That is, it is easy to prove the following uniform convergence:\n\nWe emphasize again that one can easily show that Tsallis-smoothing bandit algorithm is indeed\nidentical to INF using the appropriate parameter mapping, although our analysis is simpler due to\nthe notion of differential consistency (De\ufb01nition 2.2).\nTheorem 3.1. Let \u02dc(G) = maxp2N{hp, Gi \u2318S\u21b5(p)}. Then the GBPA( \u02dc) has regret at most\n(5)\n\n+\n\n.\n\nERegret \uf8ff \u2318\n\nN 1\u21b5 1\n1 \u21b5\n\nN \u21b5T\n\u2318\u21b5\n\nBefore proving the theorem, we note that it immediately recovers the EXP3 upper bound as a special\ncase \u21b5 ! 1. An easy application of L\u2019H\u02c6opital\u2019s rule shows that as \u21b5 ! 1, N 1\u21b51\n1\u21b5 ! log N and\nN \u21b5/\u21b5 ! N. Choosing \u2318 = p(N log N )/T , we see that the right-hand side of (5) tends to\n2pT N log N. However the choice \u21b5 ! 1 is clearly not the optimal choice, as we show in the\nfollowing statement, which directly follows from the theorem once we see that N 1\u21b5 1 < N 1\u21b5.\nCorollary 3.2. For any \u21b5 2 (0, 1), if we choose \u2318 =q \u21b5N 12\u21b5\n\n(1\u21b5)T then we have\n\n\fNow it remains to upper bound the divergence penalty with (\u2318\u21b5)1N \u21b5T . We observe that straight-\nforward calculus gives r2S\u21b5(p) = \u2318\u21b5diag(p\u21b52\nN ). Let IN (\u00b7) be the indicator function\nof N; that is, IN (x) = 0 for x 2 N and IN (x) = 1 for x /2 N. It is clear that \u02dc(\u00b7) is\nthe dual of the function S\u21b5(\u00b7) + IN (\u00b7), and moreover we observe that r2S\u21b5(p) is a sub-hessian of\nS\u21b5(\u00b7) + IN (\u00b7) at p(G), following the setup of Penot (1994). Taking advantage of Proposition 3.2\nin the latter reference, we conclude that r2S\u21b5(p(G)) is a super-hessian of \u02dc= S\u21e4\u21b5 at G. Hence,\n\n, . . . , p\u21b52\n\n1\n\nr2 \u02dc(G) (\u2318\u21b5)1diag(p2\u21b5\n\n1\n\n(G), . . . , p2\u21b5\n\nN (G))\n\nfor any G. What we have stated, indeed, is that \u02dc is (2 \u21b5, (\u2318\u21b5)1)-differentially-consistent, and\nthus applying Theorem 2.3 gives\n\nD \u02dc( \u02c6Gt, \u02c6Gt1) \uf8ff (\u2318\u21b5)1\n\nNXi=1\u21e3pi( \u02c6Gt1)\u23181\u21b5\n\n.\n\nNoting that the 1\nto any probability distribution p1, . . . , pN to obtain\n\n\u21b5-norm and the\n\n1\u21b5-norm are dual to each other, we can apply H\u00a8older\u2019s inequality\n\n1\n\nNXi=1\n\np1\u21b5\ni\n\n=\n\np1\u21b5\ni\n\nNXi=1\n\n\u00b7 1 \uf8ff NXi=1\n\np\n\n1\u21b5\n1\u21b5\n\ni !1\u21b5 NXi=1\n\n1\n\n\u21b5!\u21b5\n\n1\n\n= (1)1\u21b5N \u21b5 = N \u21b5.\n\nSo, the divergence penalty is at most (\u2318\u21b5)1N \u21b5, which completes the proof.\n\n4 Near-Optimal Bandit Algorithms via Stochastic Smoothing\n\nLet D be a continuous distribution over an unbounded support with probability density function f\nand cumulative density function F . Consider the GBPA( \u02dc(G;D)) where\n\n\u02dc(G;D) = EZ1,...,ZN\n\ni {Gi + Zi}\nmax\n\niid\u21e0D\n\nwhich is a stochastic smoothing of (maxi Gi) function. Since the max function is convex, \u02dc is also\nconvex. By Bertsekas (1973), we can swap the order of differentiation and expectation:\n\n\u02dc(G;D) = EZ1,...,ZN\n\niid\u21e0D\n\nei\u21e4, where i\u21e4 = arg max\n\ni=1,...,N {Gi + Zi}.\n\n(7)\n\nEven if the function is not differentiable everywhere, the swapping is still possible with any sub-\ngradient as long as they are bounded. Hence, the ties between coordinates (which happen with\nIt is clear that r \u02dc is in the\nprobability zero anyways) can be resolved in an arbitrary manner.\nprobability simplex, and note that\n\n@ \u02dc\n@Gi\n\n= EZ1,...,ZN 1{Gi + Zi > Gj + Zj,8j 6= i}\n= E \u02dcGj\u21e4\n\n[PZi[Zi > \u02dcGj\u21e4 Gi]] = E \u02dcGj\u21e4\n\n(8)\nwhere \u02dcGj\u21e4 = maxj6=i Gj + Zj. The unbounded support condition guarantees that this partial\nderivative is non-zero for all i given any G. So, \u02dc(G;D) satis\ufb01es the requirements of Algorithm 1.\n4.1 Connection to Follow the Perturbed Leader\n\n[1 F ( \u02dcGj\u21e4 Gi)]\n\nThere is a straightforward way to ef\ufb01ciently implement the sampling step of the bandit GBPA (Al-\ngorithm 1) with a stochastically smoothed function. Instead of evaluating the expectation of Equa-\ntion 7, we simply take a random sample. In fact, this is equivalent to Follow the Perturbed Leader\nAlgorithm (FTPL) (Kalai and Vempala, 2005) for bandit settings. On the other hand, implementing\nthe estimation step is hard because generally there is no closed-form expression for r \u02dc.\nTo address this issue, Neu and Bart\u00b4ok (2013) proposed Geometric Resampling (GR). GR uses an\niterative resampling process to estimate ri \u02dc. This process gives an unbiased estimate when allowed\n\n6\n\n\fto run for an unbounded number of iterations. Even when we truncate the resampling process after\nM iterations, the extra regret due to the estimation bias is at most N T\neM (additive term). Since the\nlower bound for the multi-armed bandit problem is O(pN T ), any choice of M = O(pN T ) does\nnot affect the asymptotic regret of the algorithm. In summary, all our GBPA regret bounds in this\nsection hold for the corresponding FTPL algorithm with an extra additive N T\nDespite the fact that perturbation-based algorithms provide a natural randomized decision strategy,\nthey have seen little applications mostly because they are hard to analyze. But one should expect\ngeneral results to be within reach: the EXP3 algorithm, for example, can be viewed through the\nlens of perturbations, where the noise is distributed according to the Gumbel distribution. Indeed,\nan early result of Kujala and Elomaa (2005) showed that a near-optimal MAB strategy comes about\nthrough the use of exponentially-distributed noise, and the same perturbation strategy has more\nrecently been utilized in the work of Neu and Bart\u00b4ok (2013) and Koc\u00b4ak et al. (2014). However,\na more general understanding of perturbation methods has remained elusive. For example, would\nGaussian noise be suf\ufb01cient for a guarantee? What about, say, the Weibull distribution?\n\neM term in the bound.\n\n4.2 Hazard Rate analysis\nIn this section, we show that the performance of the GBPA( \u02dc(G;D)) can be characterized by the\nhazard function of the smoothing distribution D. The hazard rate is a standard tool in survival\nanalysis to describe failures due to aging; for example, an increasing hazard rate models units that\ndeteriorate with age while a decreasing hazard rate models units that improve with age (a counter\nintuitive but not illogical possibility). To the best of our knowledge, the connection between hazard\nrates and design of adversarial bandit algorithms has not been made before.\nDe\ufb01nition 4.1 (Hazard rate function). Hazard rate function of a distribution D is\n\nhD(x) :=\n\nf (x)\n\n1 F (x)\n\nFor the rest of the section, we assume that D is unbounded in the direction of +1, so that the hazard\nfunction is well-de\ufb01ned everywhere. This assumption is for the clarity of presentation and can be\neasily removed (Appendix B).\nTheorem 4.2. The regret of the GBPA on \u02dc(L) = EZ1,...,Zn\u21e0D maxi{Gi + \u2318Zi} is at most:\n\nN (sup hD)\n\n\u2318\n\nT + \u2318EZ1,...,Zn\u21e0Dhmax\n\ni\n\nZii\n\nProof. We analyze each penalty term in Lemma 2.1. Due to the convexity of , the underestimation\npenalty is non-positive. The overestimation penalty is clearly at most EZ1,...,Zn\u21e0D[maxi Zi], and\nLemma 4.3 proves the N (sup hD) upper bound on the divergence penalty.\nIt remains to provide the tuning parameter \u2318. Suppose we scale the perturbation Z by \u2318> 0, i.e., we\nadd \u2318Zi to each coordinate. It is easy to see that E[maxi=1,...,n \u2318Xi] = \u2318E[maxi=1,...,n Xi]. For the\ndivergence penalty, let F\u2318 be the CDF of the scaled random variable. Observe that F\u2318(t) = F (t/\u2318)\nand thus f\u2318(t) = 1\nLemma 4.3. The divergence penalty of the GBPA with \u02dc(G) = EZ\u21e0D maxi{Gi + Zi} is at most\nN (sup hD) each round.\nProof. Recall the gradient expression in Equation 8. The i-th diagonal entry of the Hessian is:\n\n\u2318 f (t/\u2318). Hence, the hazard rate scales by 1/\u2318, which completes the proof.\n\n\u02dc(G) =\n\nr2\n\nii\n\n[1 F ( \u02dcGj\u21e4 Gi)] = E \u02dcGj\u21e4\uf8ff @\nE \u02dcGj\u21e4\n[h( \u02dcGj\u21e4 Gi)(1 F ( \u02dcGj\u21e4 Gi))]\n[1 F ( \u02dcGj\u21e4 Gi)]\n\n@\n@Gi\n= E \u02dcGj\u21e4\n\uf8ff (sup h)E \u02dcGj\u21e4\n= (sup h)ri(G)\n\n@Gi\n\n(1 F ( \u02dcGj\u21e4 Gi)) = E \u02dcGj\u21e4\n\nf ( \u02dcGj\u21e4 Gi)\n(9)\n\nwhere \u02dcGj\u21e4 = maxj6=i{Gj + Zj} which is a random variable independent of Zi. We now apply\nTheorem 2.3 with = 1 and C = (sup h) to complete the proof.\n\n7\n\n\fi=1 Zi]\n\nE[maxN\nlog N + 0\nN 1/\u21b5(1 1/\u21b5)\nk )\n\u21b5N 1/\u21b5/(\u21b5 1)\nlog (\u21b5) + 10\n\nO( 1\nk!(log N ) 1\n\nsupx hD(x)\n1 as x ! 0\nat most 2\u21b5\nk at x = 0\n\u21b5 at x = 0\n as x ! 1 log N +(\u21b51) log log N \n\nDistribution\nGumbel(\u00b5 = 1, = 1)\nFrechet (\u21b5> 1)\nWeibull*( = 1, k \uf8ff 1)\nPareto*(xm = 1,\u21b5 )\nGamma(\u21b5 1, )\nTable 1: Distributions that give O(pT N log N ) regret FTPL algorithm. The parameterization fol-\nlows Wikipedia pages for easy lookup. We denote the Euler constant (\u21e1 0.58) by 0. Distributions\nmarked with (*) need to be slightly modi\ufb01ed using the conditioning trick explained in Appendix B.2.\nThe maximum of Frechet hazard function has to be computed numerically (Elsayed, 2012, p. 47)\nbut elementary calculations show that it is bounded by 2\u21b5 (Appendix D).\n\nO(pT N log N ) Param.\nN/A\n\u21b5 = log N\nk = 1 (Exponential)\n\u21b5 = log N\n = \u21b5 = 1 (Exponential)\n\nCorollary 4.4. Follow the Perturbed Leader Algorithm with distributions in Table 1 (restricted to a\ncertain range of parameters), combined with Geometric Resampling (Section 4.1) with M = pN T ,\nhas an expected regret of order O(pT N log N ).\nTable 1 provides the two terms we need to bound. We derive the third column of the table in\nAppendix C using Extreme Value Theory (Embrechts et al., 1997). Note that our analysis in the\nproof of Lemma 4.3 is quite tight; the only place we have an inequality is when we upper bound the\nhazard rate. It is thus reasonable to pose the following conjecture:\nConjecture 4.5. If a distribution D has a monotonically increasing hazard rate hD(x) that does\nnot converge as x ! +1 (e.g., Gaussian), then there is a sequence of losses that will incur at least\na linear regret.\n\nThe intuition is that if adversary keeps incurring a high loss for the i-th arm, then with high prob-\nability \u02dcGj\u21e4 Gi will be large. So, the expectation in Equation 9 will be dominated by the hazard\nfunction evaluated at large values of \u02dcGj\u21e4 Gi.\nAcknowledgments.\n1453304. A. Tewari acknowledges the support of NSF under CAREER grant IIS-1452099.\n\nJ. Abernethy acknowledges the support of NSF under CAREER grant IIS-\n\nReferences\nJ. Abernethy, E. Hazan, and A. Rakhlin.\n\nInterior-point methods for full-information and bandit\n\nonline learning. IEEE Transactions on Information Theory, 58(7):4164\u20134175, 2012.\n\nJ. Abernethy, C. Lee, A. Sinha, and A. Tewari. Online linear optimization via smoothing. In COLT,\n\npages 807\u2013823, 2014.\n\nJ.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT,\n\npages 217\u2013226, 2009.\n\nJ.-Y. Audibert, S. Bubeck, and G. Lugosi. Minimax policies for combinatorial prediction games. In\n\nCOLT, 2011.\n\nP. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. The Journal of Machine\n\nLearning Research, 3:397\u2013422, 2003.\n\nP. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adver-\n\nsarial multi-arm bandit problem. In FOCS, 1995.\n\nP. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\nP. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal of Computuataion, 32(1):48\u201377, 2003. ISSN 0097-5397.\n\nD. P. Bertsekas. Stochastic optimization problems with nondifferentiable cost functionals. Journal\n\nof Optimization Theory and Applications, 12(2):218\u2013231, 1973. ISSN 0022-3239.\n\n8\n\n\fN. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\n2006.\n\nV. Dani and T. P. Hayes. Robbing the bandit: less regret in online geometric optimization against an\n\nadaptive adversary. In SODA, pages 937\u2013943, 2006.\n\nV. Dani, T. Hayes, and S. Kakade. The price of bandit information for online optimization. In NIPS,\n\n2008.\n\nL. Devroye, G. Lugosi, and G. Neu. Prediction by random-walk perturbation. In Conference on\n\nLearning Theory, pages 460\u2013473, 2013.\n\nE. Elsayed. Reliability Engineering. Wiley Series in Systems Engineering and Management.\nISBN 9781118309544. URL https://books.google.com/books?id=\n\nWiley, 2012.\nNdjF5G6tfLQC.\n\nP. Embrechts, C. Kl\u00a8uppelberg, and T. Mikosch. Modelling Extremal Events: For Insurance and\nFinance. Applications of mathematics. Springer, 1997. ISBN 9783540609315. URL https:\n//books.google.com/books?id=BXOI2pICfJUC.\n\nA. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting:\n\ngradient descent without a gradient. In SODA, pages 385\u2013394, 2005. ISBN 0-89871-585-7.\n\nJ. Gittins. Quantitative methods in the planning of pharmaceutical research. Drug Information\n\nJournal, 30(2):479\u2013487, 1996.\n\nJ. Gittins, K. Glazebrook, and R. Weber. Multi-armed bandit allocation indices. John Wiley & Sons,\n\n2011.\n\nJ. Hannan. Approximation to bayes risk in repeated play. In M. Dresher, A. W. Tucker, and P. Wolfe,\n\neditors, Contributions to the Theory of Games, volume III, pages 97\u2013139, 1957.\n\nA. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Computer\n\nand System Sciences, 71(3):291\u2013307, 2005.\n\nT. Koc\u00b4ak, G. Neu, M. Valko, and R. Munos. Ef\ufb01cient learning by implicit exploration in bandit\n\nproblems with side observations. In NIPS, pages 613\u2013621. Curran Associates, Inc., 2014.\n\nJ. Kujala and T. Elomaa. On following the perturbed leader in the bandit setting. In Algorithmic\n\nLearning Theory, pages 371\u2013385. Springer, 2005.\n\nT. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied\n\nMathematics, 6(1):4\u201322, 1985.\n\nN. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation,\n\n108(2):212\u2013261, 1994. ISSN 0890-5401.\n\nH. B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an adap-\n\ntive adversary. In COLT, pages 109\u2013123, 2004.\n\nG. Neu and G. Bart\u00b4ok. An ef\ufb01cient algorithm for learning with semi-bandit feedback. In Algorithmic\n\nLearning Theory, pages 234\u2013248. Springer, 2013.\n\nM. Pacula, J. Ansel, S. Amarasinghe, and U.-M. OReilly. Hyperparameter tuning in bandit-based\nadaptive operator selection. In Applications of Evolutionary Computation, pages 73\u201382. Springer,\n2012.\n\nJ.-P. Penot. Sub-hessians, super-hessians and conjugation. Nonlinear Analysis: Theory, Meth-\nods & Applications, 23(6):689\u2013702, 1994. URL http://www.sciencedirect.com/\nscience/article/pii/0362546X94902127.\n\nS. Rakhlin, O. Shamir, and K. Sridharan. Relax and randomize: From value to algorithms.\n\nAdvances in Neural Information Processing Systems, pages 2141\u20132149, 2012.\n\nIn\n\nH. Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5):\n\n527\u2013535, 1952.\n\nC. Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics, 52\n\n(1-2):479\u2013487, 1988.\n\nG. Van den Broeck, K. Driessens, and J. Ramon. Monte-carlo tree search in poker using expected\n\nreward distributions. In Advances in Machine Learning, pages 367\u2013381. Springer, 2009.\n\nT. Van Erven, W. Kotlowski, and M. K. Warmuth. Follow the leader with dropout perturbations. In\n\nCOLT, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1304, "authors": [{"given_name": "Jacob", "family_name": "Abernethy", "institution": "University of Michigan"}, {"given_name": "Chansoo", "family_name": "Lee", "institution": "University of Michigan Ann Arb"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}