{"title": "On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 2695, "page_last": 2704, "abstract": "We investigate the optimality of perturbation based algorithms in the stochastic and adversarial multi-armed bandit problems. For the stochastic case, we provide a unified regret analysis for both sub-Weibull and bounded perturbations when rewards are sub-Gaussian. Our bounds are instance optimal for sub-Weibull perturbations with parameter 2 that also have a matching lower tail bound, and all bounded support perturbations where there is sufficient probability mass at the extremes of the support. For the adversarial setting, we prove rigorous barriers against two natural solution approaches using tools from discrete choice theory and extreme value theory. Our results suggest that the optimal perturbation, if it exists, will be of Frechet-type.", "full_text": "On the Optimality of Perturbations in Stochastic and\n\nAdversarial Multi-armed Bandit Problems\n\nBaekjin Kim\n\nDepartment of Statistics\nUniversity of Michigan\nAnn Arbor, MI 48109\nbaekjin@umich.edu\n\nAmbuj Tewari\n\nDepartment of Statistics\nUniversity of Michigan\nAnn Arbor, MI 48109\ntewaria@umich.edu\n\nAbstract\n\nWe investigate the optimality of perturbation based algorithms in the stochastic\nand adversarial multi-armed bandit problems. For the stochastic case, we provide\na uni\ufb01ed regret analysis for both sub-Weibull and bounded perturbations when\nrewards are sub-Gaussian. Our bounds are instance optimal for sub-Weibull per-\nturbations with parameter 2 that also have a matching lower tail bound, and all\nbounded support perturbations where there is suf\ufb01cient probability mass at the\nextremes of the support. For the adversarial setting, we prove rigorous barriers\nagainst two natural solution approaches using tools from discrete choice theory\nand extreme value theory. Our results suggest that the optimal perturbation, if it\nexists, will be of Fr\u00e9chet-type.\n\n1\n\nIntroduction\n\nBeginning with the seminal work of Hannan [12], researchers have been interested in algorithms that\nuse random perturbations to generate a distribution over available actions. Kalai and Vempala [17]\nshowed that the perturbation idea leads to ef\ufb01cient algorithms for many online learning problems with\nlarge action sets. Due to the Gumbel lemma [14], the well known exponential weights algorithm [11]\nalso has an interpretation as a perturbation based algorithm that uses Gumbel distributed perturbations.\nThere have been several attempts to analyze the regret of perturbation based algorithms with speci\ufb01c\ndistributions such as Uniform, Double-exponential, drop-out and random walk (see, e.g., [17, 18,\n9, 28]). These works provided rigorous guarantees but the techniques they used did not generalize\nto general perturbations. Recent work [1] provided a general framework to understand general\nperturbations and clari\ufb01ed the relation between regularization and perturbation by understanding\nthem as different ways to smooth an underlying non-smooth potential function.\nAbernethy et al. [2] extended the analysis of general perturbations to the partial information setting\nof the adversarial multi-armed bandit problem. They isolated bounded hazard rate as an important\nproperty of a perturbation and gave several examples of perturbations that lead to the near optimal\nKT log K). Since Tsallis entropy regularization can achieve the minimax\nregret bound of O(\nregret of O(\nKT ) [4, 5], the question of whether perturbations can match the power of regularizers\nremained open for the adversarial multi-armed bandit problem.\nIn this paper, we build upon previous works [1, 2] in two distinct but related directions. First, we\nprovide the \ufb01rst general result for perturbation algorithms in the stochastic multi-armed bandit\nproblem. We present a uni\ufb01ed regret analysis for both sub-Weibull and bounded perturbations when\nrewards are sub-Gaussian. Our regrets are instance optimal for sub-Weibull perturbations with\nparameter 2 (with a matching lower tail bound), and all bounded support perturbations where there\nis suf\ufb01cient probability mass at the extremes of the support. Since the Uniform and Rademacher\n\n\u221a\n\n\u221a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdistribution are instances of these bounded support perturbations, one of our results is a regret bound\nfor a randomized version of UCB where the algorithm picks a random number in the con\ufb01dence\ninterval or randomly chooses between lower and upper con\ufb01dence bounds instead of always picking\nthe upper bound. Our analysis relies on the simple but powerful observation that Thompson sampling\nwith Gaussian priors and rewards can also be interpreted as a perturbation algorithm with Gaussian\nperturbations. We are able to generalize both the upper bound and lower bound of Agrawal and Goyal\n[3] in two respects; (1) from the special Gaussian perturbation to general sub-Weibull or bounded\nperturbations, and (2) from the special Gaussian rewards to general sub-Gaussian rewards.\nSecond, we return to the open problem mentioned above: is there a perturbation that gives us minimax\noptimality? We do not resolve it but provide rigorous proofs that there are barriers to two natural\napproaches to solving the open problem. (A) One cannot simply \ufb01nd a perturbation that is exactly\nequivalent to Tsallis entropy. This is surprising since Shannon entropy does have an exact equivalent\n\u221a\nperturbation, viz. Gumbel. (B) One cannot simply do a better analysis of perturbations used before\n[2] and plug the results into their general regret bound to eliminate the extra O(\nlog K) factor. In\nproving the \ufb01rst barrier, we use a fundamental result in discrete choice theory. For the second barrier,\nwe rely on tools from extreme value theory.\n\n2 Problem Setup\n\nIn every round t starting at 1, a learner chooses an action At \u2208 [K] (cid:44) {1, 2,\u00b7\u00b7\u00b7 , K} out of K arms\nand the environment picks a response in the form of a real-valued reward vector gt \u2208 [0, 1]K. While\nthe entire reward vector gt is revealed to the learner in the full information setting, the learner only\nreceives a reward associated with his choice in the bandit setting, while any information on other\narms is not provided. Thus, we denote the reward corresponding to his choice At as Xt = gt,At.\nIn stochastic multi-armed bandit, the rewards gt,i are sampled i.i.d from a \ufb01xed, but unknown\ndistribution with mean \u00b5i. Adversarial multi-armed bandit is more general in that all assumptions\non how rewards are assigned to arms are dropped. It only assumes that rewards are assigned by\nan adversary before the interaction begins. Such an adversary is called an oblivious adversary. In\nboth environments, the learner makes a sequence of decisions At based on each history Ht\u22121 =\n\n(A1, X1,\u00b7\u00b7\u00b7 , At\u22121, Xt\u22121) to maximize the cumulative reward,(cid:80)T\n\nt=1 Xt.\n\n(cid:80)T\nt=1 gt,i \u2212(cid:80)T\nregret, R(cid:48)(T ) = T \u00b7 maxi\u2208[K] \u00b5i \u2212 E[(cid:80)T\n\nAs a measure of evaluating a learner, Regret is the difference between rewards the learner would\nhave received had he played the best in hindsight, and the rewards he actually received. Therefore,\nminimizing the regret is equivalent to maximizing the expected cumulative reward. We consider the\nexpected regret, R(T ) = E[maxi\u2208[K]\nt=1 gt,At] in adversarial setting, and the pseudo\nt=1 Xt] in stochastic setting. Note that two regrets are the\nsame where an oblivious adversary is considered. An online algorithm is called a no-regret algorithm\nif for every adversary, the expected regret with respect to every action At is sub-linear in T .\nWe use FTPL (Follow The Perturbed Leader) to denote families of algorithms for both stochastic and\nadversarial settings. The common core of FTPL algorithms consists in adding random perturbations\nto the estimates of rewards of each arm prior to computing the current \u201cthe best arm\u201d (or \u201cleader\u201d).\nHowever, the estimates used are different in the two settings: stochastic setting uses sample means\nand adversarial setting uses inverse probability weighted estimates.\n\n3 Stochastic Bandits\n\nIn this section, we propose FTPL algorithms for stochastic multi-armed bandits and characterize a\nfamily of perturbations that make the algorithm instance-optimal in terms of regret bounds. This\nwork is mainly motivated by Thompson Sampling [25], one of the standard algorithms in stochastic\nsettings. We also provide a lower bound for the regret of this FTPL algorithm.\nfrom arm i after round t written formally as \u02c6\u00b5i(t) = (cid:80)t\nFor our analysis, we assume, without loss of generality, that arm 1 is optimal, \u00b51 = maxi\u2208[K] \u00b5i,\nand the sub-optimality gap is denoted as \u2206i = \u00b51 \u2212 \u00b5i. Let \u02c6\u00b5i(t) be the average reward received\n(cid:80)t\ns=1 I{As = i}Xs/Ti(t) where Ti(t) =\ns=1 I{As = i} is the number of times arm i has been pulled after round t. The regret for stochastic\n\n2\n\n\fbandits can be decomposed into R(T ) =(cid:80)K\n\ni=1 \u2206iE[Ti(T )]. The reward distributions are generally\n\nassumed to be sub-Gaussian with parameter 1 [20].\nDe\ufb01nition 1 (sub-Gaussian). A random variable Z with mean \u00b5 = E[Z] is sub-Gaussian with\nparameter \u03c3 > 0 if it satis\ufb01es P(|Z \u2212 \u00b5| \u2265 t) \u2264 exp(\u2212t2/(2\u03c32)) for all t \u2265 0.\nLemma 1 (Hoeffding bound of sub-Gaussian [15]). Suppose Zi, i \u2208 [n] are i.i.d. random variables\nwith E(Zi) = \u00b5 and sub-Gaussian with parameter \u03c3. Then P( \u00afZn \u2212 \u00b5 \u2265 t) \u2228 P( \u00afZn \u2212 \u00b5 \u2264 \u2212t) \u2264\n\nexp(\u2212nt2/(2\u03c32)) for all t \u2265 0, where \u00afZn =(cid:80)n\n\ni=1 Zi/n.\n\n3.1 Upper Con\ufb01dence Bound and Thompson Sampling\n\nThe standard algorithms in stochastic bandit are Upper Con\ufb01dence Bound (UCB1) [6] and Thompson\nSampling [25]. The former algorithm is constructed to compare the largest plausible estimate of mean\nfor each arm based on the optimism in the face of uncertainty so that it would be deterministic in\ncontradistinction to the latter one. At time t + 1, UCB1 chooses an action At+1 by maximizing upper\n\ncon\ufb01dence bounds, U CBi(t) = \u02c6\u00b5i(t) +(cid:112)2 log T /Ti(t). Regarding the instance-dependent regret of\nUCB1, there exists some universal constant C > 0 such that R(T ) \u2264 C(cid:80)\n\ni:\u2206i>0(\u2206i + log T /\u2206i).\nThompson Sampling is a Bayesian solution based on randomized probability matching approach\n[24]. Given the prior distribution Q0, at time t + 1, it computes posterior distribution Qt based on\nobserved data, samples \u03bdt from posterior Qt, and then chooses the arm At+1 = arg maxi\u2208[k] \u00b5i(\u03bdt).\nIn Gaussian Thompson Sampling where the Gaussian rewards N (\u00b5i, 1) and the Gaussian prior\ndistribution for each \u00b5i with mean \u00b50 and in\ufb01nite variance are considered, the policy from Thompson\nSampling is to choose an index that maximizes \u03b8i(t) randomly sampled from Gaussian posterior\ndistribution, N (\u02c6\u00b5i(t), 1/Ti(t)) as stated in Alg.1-(i). Also, its regret bound is restated in Theorem 2.\nInitialize Ti(0) = 0, \u02c6\u00b5i(0) = 0 for all i \u2208 [K]\nfor t = 1 to T do\n\nFor each arm i \u2208 [K],\n\n(ii) FTPL via Unbounded Perturbation : \u03b8i(t \u2212 1) = \u02c6\u00b5i(t \u2212 1) +\n\n(i) Gaussian Thompson Sampling : \u03b8i(t \u2212 1) \u223c N(cid:0)\u02c6\u00b5i(t \u2212 1),\n\n(cid:1)\n1\u2228Ti(t\u22121)\n1\u221a\n(cid:113) (2+\u0001) log T\n1\u2228Ti(t\u22121)\n1\u2228Ti(t\u22121) \u00b7 Zit\nLearner chooses At = arg maxi\u2208[K] \u03b8i(t \u2212 1) and receives the reward of Xt \u2208 [0, 1].\nUpdate : \u02c6\u00b5At(t) = \u02c6\u00b5At (t\u22121)\u00b7TAt (t\u22121)+Xt\n\n(iii) FTPL via Bounded Perturbation : \u03b8i(t \u2212 1) = \u02c6\u00b5i(t \u2212 1) +\n\nwhere Zits are randomly sampled from Z \u2208 [\u22121, 1].\n\nwhere Zits are randomly sampled from unbounded Z.\n\n, TAt(t) = TAt(t \u2212 1) + 1.\n\n1\n\n\u00b7 Zit\n\nTAt (t\u22121)+1\n\nend\n\nAlgorithm 1: Randomized probability matching algorithms via Perturbation\n\nTheorem 2 (Theorem 3 [3]). Assume that reward distribution of each arm i is Gaussian with mean\n\u00b5i and unit variance. Thompson sampling policy via Gaussian prior de\ufb01ned in Alg.1-(i) has the\nfollowing instance-dependent and independent regret bounds, for C(cid:48) > 0,\n\n(cid:17)\n\n, R(T ) \u2264 O((cid:112)KT log K).\n\nR(T ) \u2264 C(cid:48) (cid:88)\n\n(cid:16)\n\nlog(T \u22062\n\ni )/\u2206i + \u2206i\n\n\u2206i>0\n\nViewpoint of Follow-The-Perturbed-Leader The more generic view of Thompson Sampling is\nvia the idea of perturbation. We bring an interpretation of viewing this Gaussian Thompson Sampling\nas Follow-The-Perturbed-Leader (FTPL) algorithm via Gaussian perturbation [20]. If Gaussian\nrandom variables \u03b8i(t) are decomposed into the average mean reward of each arm \u02c6\u00b5i(t) and scaled\nalgorithm chooses the action according to At+1 = arg maxi\u2208[K] \u02c6\u00b5i(t) + \u03b7it \u00b7 Zit.\n\nGaussian perturbation \u03b7it \u00b7 Zit where \u03b7it = 1/(cid:112)Ti(t), Zit \u223c N (0, 1). In a round t + 1, the FTPL\n\n3.2 Follow-The-Perturbed-Leader\n\nWe show that the FTPL algorithm with Gaussian perturbation under Gaussian reward setting can be\nextended to sub-Gaussian rewards as well as families of sub-Weibull and bounded perturbations. The\n\n3\n\n\fsub-Weibull family is an interesting family in that it includes well known families like sub-Gaussian\nand sub-Exponential as special cases. We propose perturbation based algorithms via sub-Weibull and\nbounded perturbation in Alg.1-(ii), (iii), and their regrets are analyzed in Theorem 3 and 5.\nDe\ufb01nition 2 (sub-Weibull [29]). A random variable Z with mean \u00b5 = E[Z] is sub-Weibull (p) with\n\u03c3 > 0 if it satis\ufb01es P(|Z \u2212 \u00b5| \u2265 t) \u2264 Ca exp(\u2212tp/(2\u03c3p)) for all t \u2265 0.\nTheorem 3 (FTPL via sub-Weibull Perturbation, Proof in Appendix A.1). Assume that reward\ndistribution of each arm i is 1-sub-Gaussian with mean \u00b5i, and the sub-Weibull (p) perturbation Z\nwith parameter \u03c3 and E[Z] = 0 satis\ufb01es the following anti-concentration inequality,\n\nP(|Z| \u2265 t) \u2265 exp(\u2212tq/2\u03c3q)/Cb,\n\nfor t \u2265 0\n\n(1)\nThen the Follow-The-Perturbed-Leader algorithm via Z in Alg.1-(ii) has the following instance-\ndependent and independent regret bounds, for p \u2264 q \u2264 2 (if q = 2, \u03c3 \u2265 1) and C(cid:48)(cid:48) = C(\u03c3, p, q) > 0,\n(2)\n\nR(T ) \u2264 C(cid:48)(cid:48) (cid:88)\n\n(cid:16)(cid:2) log(T \u22062\ni )(cid:3)2/p\n\n\u221a\n, R(T ) \u2264 O(\n\nKT (log K)1/p).\n\n/\u2206i + \u2206i\n\n(cid:17)\n\n\u2206i>0\n\nNote that the parameters p and q can be chosen from any values p \u2264 q \u2264 2, and the algorithm\ncan achieve smaller regret bound as p becomes larger. For nice distributions such as Gaussian and\nDouble-exponential, the parameters p and q can be matched by 2 and 1, respectively.\nCorollary 4 (FTPL via Gaussian Perturbation). Assume that reward distribution of each arm i is 1-\nsub-Gaussian with mean \u00b5i. The Follow-The-Perturbed-Leader algorithm via Gaussian perturbation\nZ with parameter \u03c3 and E[Z] = 0 in Alg.1-(ii) has the following instance-dependent and independent\nregret bounds, for C(cid:48)(cid:48) = C(\u03c3, 2, 2) > 0 and \u03c3 \u2265 1,\n\nR(T ) \u2264 C(cid:48)(cid:48) (cid:88)\n\n(cid:0)log(T \u22062\n\n\u2206i>0\n\n(cid:1), R(T ) \u2264 O((cid:112)KT log K).\n\ni )/\u2206i + \u2206i\n\n(3)\n\nFailure of Bounded Perturbation Any perturbation with bounded support cannot yield an optimal\nFTPL algorithm. For example, in a two-armed bandit setting with \u00b51 = 1 and \u00b52 = 0, rewards of\neach arm i are generated from Gaussian distribution with mean \u00b5i and unit variance and perturbation\nis uniform with support [\u22121, 1]. In the case where we have T1(10) = 1, T2(10) = 9 during \ufb01rst 10\ntimes, and average mean rewards are \u02c6\u00b51 = \u22121 and \u02c6\u00b52 = 1/3, then perturbed rewards are sampled\nfrom \u03b81 \u223c U[\u22122, 0] and \u03b82 \u223c U[0, 2/3]. This algorithm will not choose the \ufb01rst arm and accordingly\nachieve a linear regret. To overcome this limitation of bounded support, we suggest another FTPL\nalgorithm via bounded perturbation by adding an extra logarithmic term in T as stated in Alg.1-(iii).\nTheorem 5 (FTPL algorithm via Bounded support Perturbation, Proof in Appendix A.3). Assume\nthat reward distribution of each arm i is 1-sub-Gaussian with mean \u00b5i, the perturbation distri-\nbution Z with E[Z] = 0 lies in [\u22121, 1] and for any \u0001 > 0, there exists 0 < MZ,\u0001 < \u221e s.t.\n\nP(cid:0)Z \u2264(cid:112)2/(2 + \u0001)(cid:1)/P(cid:0)Z \u2265(cid:112)2/(2 + \u0001)(cid:1) = MZ,\u0001. Then the Follow-The-Perturbed-Leader algo-\n\nrithm via Z in Alg.1-(iii) has the following instance-dependent and independent regret bounds, for\nC(cid:48)(cid:48)(cid:48) > 0 independent of T, K and \u2206i,\n\n(cid:17)\n\n, R(T ) \u2264 O((cid:112)KT log T ).\n\nR(T ) \u2264 C(cid:48)(cid:48)(cid:48) (cid:88)\n\n(cid:16)\n\nlog(T )/\u2206i + \u2206i\n\n(4)\n\n\u2206i>0\n\nRandomized Con\ufb01dence Bound algorithm Theorem 5 implies that the optimism embedded in\nUCB can be replaced by simple randomization. Instead of comparing upper con\ufb01dence bounds,\nour modi\ufb01cation is to compare a value randomly chosen from con\ufb01dence interval or between\nlower and upper con\ufb01dence bounds by introducing uniform U[\u22121, 1] or Rademacher perturbation\nR{\u22121, 1} in UCB1 algorithm with slightly wider con\ufb01dence interval, At+1 = arg maxi\u2208[K] \u02c6\u00b5i(t) +\n\n(cid:112)(2 + \u0001) log T /Ti(t) \u00b7 Zit. These FTPL algorithms via Uniform and Rademacher perturbations\n\ncan be regarded as a randomized version of UCB algorithm, which we call the RCB (Randomized\nCon\ufb01dence Bound) algorithm, and they also achieve the same regret bound as that of UCB1. The\nRCB algorithm is meaningful in that it can be arrived at from two different perspectives, either as\na randomized variant of UCB or by replacing the Gaussian distribution with Uniform in Gaussian\nThompson Sampling.\nThe regret lower bound of the FTPL algorithm in Alg.1-(ii) is built on the work of Agrawal and Goyal\n[3]. Theorem 6 states that the regret lower bound depends on the lower bound of the tail probability\n\n4\n\n\f\u221a\nof perturbation. As special cases, FTPL algorithms via Gaussian (q = 2) and Double-exponential\n(q = 1) make the lower and upper regret bounds matched, \u0398(\nTheorem 6 (Regret lower bound, Proof in Appendix A.4). If the perturbation Z with E[Z] = 0 has\nthe lower bound of tail probability as P(|Z| \u2265 t) \u2265 exp[\u2212tq/(2\u03c3q)]/Cb for t \u2265 0, \u03c3 > 0, the Follow-\nKT (log K)1/q).\nThe-Perturbed-Leader algorithm via Z has the lower bound of expected regret, \u2126(\n\nKT (log K)1/q).\n\n\u221a\n\n4 Adversarial Bandits\n\nIn this section we study two major families of online learning, Follow The Regularized Leader\n(FTRL) and Follow The Perturbed Leader (FTPL), as ways of smoothings and introduce the Gradient\nBased Prediction Algorithm (GBPA) family for solving the adversarial multi-armed bandit problem.\nThen, we mention an important open problem regarding existence of an optimal FTPL algorithm.\nThe main contributions of this section are theoretical results showing that two natural approaches to\nsolving the open problem are not going to work. We also make some conjectures on what alternative\nideas might work.\n\n4.1 FTRL and FTPL as Two Types of Smoothings and An Open Problem\n\nFollowing previous work [2], we consider a general algorithmic framework, Alg.2. There are two\nmain ingredients of GBPA. The \ufb01rst ingredient is the smoothed potential \u02dc\u03a6 whose gradient is used to\nmap the current estimate of the cumulative reward vector to a probability distribution pt over arms.\nThe second ingredient is the construction of an unbiased estimate \u02c6gt of the rewards vector using the\nreward of the pulled arm only by inverse probability weighting. This step reduces the bandit setting\nto full-information setting so that any algorithm for the full-information setting can be immediately\napplied to the bandit setting.\n\nGBPA( \u02dc\u03a6): \u02dc\u03a6 differentiable convex function s.t. \u2207 \u02dc\u03a6 \u2208 \u2206K\u22121 and \u2207i \u02dc\u03a6 > 0,\u2200i. Initialize \u02c6G0 = 0.\nfor t = 1 to T do\n\nA reward vector gt \u2208 [0, 1]K is chosen by environment.\nLearner chooses At randomly sampled from the distribution pt = \u2207 \u02dc\u03a6( \u02c6Gt\u22121).\nLearner receives the reward of chosen arm gt,At, and estimates reward vector \u02c6gt = gt,At\npt,At\nUpdate : \u02c6Gt = \u02c6Gt\u22121 + \u02c6gt.\n\neAt.\n\nend\n\nAlgorithm 2: Gradient-Based Prediction Algorithm in Bandit setting\n\nIf we did not use any smoothing and directly used the baseline potential \u03a6(G) = maxw\u2208\u2206K\u22121(cid:104)w, G(cid:105),\nwe would be running Follow The Leader (FTL) as our full information algorithm. It is well known\nthat FTL does not have good regret guarantees [13]. Therefore, we need to smooth the baseline\npotential to induce stability in the algorithm. It turns out that two major algorithm families in online\nlearning, namely Follow The Regularized Leader (FTRL) and Follow The Perturbed Leader (FTPL)\ncorrespond to two different types of smoothings.\nThe smoothing used by FTRL is achieved by adding a strongly convex regularizer in the dual\nrepresentation of the baseline potential. That is, we set \u02dc\u03a6(G) = R(cid:63)(G) = maxw\u2208\u2206K\u22121(cid:104)w, G(cid:105) \u2212\n\u03b7R(w), where R is a strongly convex function. The well known exponential weights algorithm\ni=1 wi log(wi). GBPA with the resulting\nsmoothed potential becomes the EXP3 algorithm [7] which achieves a near-optimal regret bound\nO(\nKT ). This lower\nbound was matched by Implicit Normalized Forecaster with polynomial function (Poly-INF algorithm)\n[4, 5] and later work [2] showed that Poly-INF algorithm is equivalent to FTRL algorithm via the\n\n[11] uses the Shannon entropy regularizer, RS(w) =(cid:80)K\n\nKT log K) just logarithmically worse compared to the lower bound \u2126(\n\nTsallis entropy regularizer, RT,\u03b1(w) = 1\u2212(cid:80)K\n\n\u221a\n\ni=1 w\u03b1\n1\u2212\u03b1\n\ni\n\nfor \u03b1 \u2208 (0, 1).\n\n\u221a\n\nAn alternate way of smoothing is stochastic smoothing which is what is used by FTPL algorithms. It\ninjects stochastic perturbations to the cumulative rewards of each arm and then \ufb01nds the best arm.\nGiven a perturbation distribution D and Z = (Z1,\u00b7\u00b7\u00b7 , ZK) consisting of i.i.d. draws from D, the\n\n5\n\n\fresulting stochastically smoothed potential is \u02dc\u03a6(G;D) = EZ1,\u00b7\u00b7\u00b7 ,ZK\u223cD [maxw\u2208\u2206K\u22121(cid:104)w, G + \u03b7Z(cid:105)].\nIts gradient is pt = \u2207 \u02dc\u03a6(Gt;D) = EZ1,\u00b7\u00b7\u00b7 ,ZK\u223cD[ei(cid:63) ] \u2208 \u2206K\u22121 where i(cid:63) = arg maxi Gt,i + \u03b7Zi.\nIn Section 4.3, we recall the general regret bound proved by Abernethy et al. [2] for distributions\n\u221a\nwith bounded hazard rate. They showed that a variety of natural perturbation distributions can\nyield a near-optimal regret bound of O(\n\u221a\nKT log K). However, none of the distributions they tried\nyielded the minimax optimal rate O(\nKT ). Since FTRL with Tsallis entropy regularizer can achieve\nthe minimax optimal rate in adversarial bandits, the following is an important unresolved question\nregarding the power of perturbations.\nOpen Problem Is there a perturbation D such that GBPA with a stochastically smoothed potential\n\u221a\nusing D achieves the optimal regret bound O(\nGiven what we currently know, there are two very natural approaches to resolving the open question\nin the af\ufb01rmative. Approach 1: Find a perturbation so that we get the exactly same choice probability\nfunction as the one used by FTRL via Tsallis entropy. Approach 2: Provide a tighter control on\nexpected block maxima of random variables considered as perturbations by Abernethy et al. [2].\n\nKT ) in adversarial K-armed bandits?\n\n4.2 Barrier Against First Approach: Discrete Choice Theory\n\nThe \ufb01rst approach is motivated by a folklore observation in online learning theory, namely, that the\nexponential weights algorithm [11] can be viewed as FTRL via Shannon entropy regularizer or as\nFTPL via a Gumbel-distributed perturbation. Thus, we might hope to \ufb01nd a perturbation which is\nan exact equivalent of the Tsallis entropy regularizer. Since FTRL via Tsallis entropy is optimal for\nadversarial bandits, \ufb01nding such a perturbation would immediately settle the open problem.\nThe relation between regularizers and perturbations has been theoretically studied in discrete choice\ntheory [22, 16]. For any perturbation, there is always a regularizer which gives the same choice proba-\nbility function. The converse, however, does not hold. The Williams-Daly-Zachary Theorem provides\na characterization of choice probability functions that can be derived via additive perturbations.\nTheorem 7 (Williams-Daly-Zachary Theorem [22]). Let C : RK \u2192 SK be the choice probability\nfunction and derivative matrix DC(G) =\n. The following 4 conditions\nare necessary and suf\ufb01cient for the existence of perturbations Zi such that this choice probability\nfunction C can be written in Ci(G) = P(arg maxj\u2208[K] Gj + \u03b7Zj = i) for i \u2208 [K].\n(1) DC(G) is symmetric, (2) DC(G) is positive de\ufb01nite, (3) DC(G) \u00b7 1 = 0, and (4) All mixed\n> 0 for each j = 1, ..., K \u2212 1.\npartial derivatives of C are positive, (\u22121)j\n\n(cid:124)\n,\u00b7\u00b7\u00b7 , \u2202C\n\n(cid:19)(cid:124)\n\n(cid:124)\n\u2202C\n\u2202G1\n\n(cid:18)\n\n(cid:124)\n\n, \u2202C\n\u2202G2\n\n\u2202GK\n\n\u2202j Ci0\n\n\u2202Gi1\u00b7\u00b7\u00b7\u2202Gij\n\nWe now show that if the number of arms is greater than three, there does not exist any perturbation\nexactly equivalent to Tsallis entropy regularization. Therefore, the \ufb01rst approach to solving the open\nproblem is doomed to failure.\nTheorem 8 (Proof in Appendix A.5). When K \u2265 4, there is no stochastic perturbation that yields\nthe same choice probability function as the Tsallis entropy regularizer.\n\n4.3 Barrier Against Second Approach: Extreme Value Theory\n\nThe second approach is built on the work of Abernethy et al. [2] who provided the-state-of-the-art\nperturbation based algorithm for adversarial multi-armed bandits. The framework proposed in this\nwork covered all distributions with bounded hazard rate and showed that the regret of GBPA via\nperturbation Z \u223c D with a bounded hazard is upper bounded by trade-off between the bound of\nhazard rate and expected block maxima as stated below.\nTheorem 9 (Theorem 4.2 [2]). Assume the support of D is unbounded in positive direction and\n1\u2212F (x) is bounded, then the expected regret of GBPA( \u02dc\u03a6) in adversarial bandit\nhazard rate hD(x) = f (x)\nis bounded by \u03b7 \u00b7 E[MK] + K sup hD\nT , where sup hD = supx:f (x)>0 hD(x). The optimal choice of\n\n\u03b7 leads to the regret bound 2(cid:112)KT \u00b7 sup hD \u00b7 E[MK] where MK = maxi\u2208[K] Zi.\n\n\u03b7\n\nAbernethy et al. [2] considered several perturbations such as Gumbel, Gamma, Weibull, Fr\u00e9chet\nand Pareto. The best tuning of distribution parameters (to minimize upper bounds on the product\n\n6\n\n\f\u221a\nsup hD \u00b7 E[MK]) always leads to the bound O(\nKT log K), which is tantalizingly close to the lower\nbound but does not match it. It is possible that some of their upper bounds on expected block maxima\nE[MK] are loose and that we can get closer, or perhaps even match, the lower bound by simply doing\na better job of bounding expected block maxima (we will not worry about supremum of the hazard\nsince their bounds can easily be shown to be tight, up to constants, using elementary calculations\nin Appendix B.2). We show that this approach will also not work by characterizing the asymptotic\n(as K \u2192 \u221e) behavior of block maxima of perturbations using extreme value theory. The statistical\nbehavior of block maxima, MK = maxi\u2208[K] Zi, where Zi\u2019s is a sequence of i.i.d. random variables\nwith distribution function F can be described by one of three extreme value distributions: Gumbel,\nFr\u00e9chet and Weibull [8, 23]. Then, the normalizing sequences {aK > 0} and {bK} are explicitly\n\ncharacterized [21]. Under the mild condition, E(cid:0)(MK \u2212 bK)/aK\n\n(cid:1) \u2192 EZ\u223cG[Z] = C as K \u2192 \u221e\n\nwhere G is extreme value distribution and C is constant, and the expected block maxima behave\nasymptotically as E[MK] = \u0398(C \u00b7 aK + bK). See Theorem 11-13 in Appendix B for more details.\n\nTable 1: Asymptotic expected block maxima based on Extreme Value Theory. Gumbel-type and\nFr\u00e9chet-type are denoted by \u039b and \u03a6\u03b1 respectively. The Gamma function and the Euler-Mascheroni\nconstant are denoted by \u0393(\u00b7) and \u03b3 respectively.\n\nDistribution\n\nType\n\nsup h\n\nE[MK]\n\nGumbel(\u00b5 = 0, \u03b2 = 1)\n\nGamma(\u03b1, 1)\nWeibull(\u03b1 \u2264 1)\nFr\u00e9chet (\u03b1 > 1)\nPareto(xm = 1, \u03b1)\n\n\u039b\n\u039b\n\u039b\n\u03a6\u03b1\n\u03a6\u03b1\n\n1\n1\n\u03b1\n\u2208 ( \u03b1\n\u03b1\n\ne\u22121 , 2\u03b1)\n\nlog K + \u03b3 + o(1)\n\nlog K + \u03b3 + o(log K)\n\u0393(1 \u2212 1/\u03b1) \u00b7 K 1/\u03b1\n\n(log K)1/\u03b1 + o((log K)1/\u03b1)\n\u0393(1 \u2212 1/\u03b1) \u00b7 (K 1/\u03b1 \u2212 1)\n\nThe asymptotically tight growth rates (with explicit constants for the leading term!) of expected block\nmaximum of some distributions are given in Table 1. They match the upper bounds of the expected\nblock maximum in Table 1 of Abernethy et al. [2]. That is, their upper bounds are asymptotically tight.\nGumbel, Gamma and Weibull distribution are Gumbel-type (\u039b) and their expected block maximum\nbehave as O(log K) asymptotically. It implies that Gumbel type perturbation can never achieve\noptimal regret bound despite bounded hazard rate. Fr\u00e9chet and Pareto distributions are Fr\u00e9chet-type\n(\u03a6\u03b1) and their expected block maximum grows as K 1/\u03b1. Heuristically, if \u03b1 is set optimally to log K,\nthe expected block maxima is independent of K while the supremum of hazard is upper bounded by\nO(log K).\n\nIf there exists a perturbation that achieves minimax optimal regret in adversarial\n\nConjecture\nmulti-armed bandits, it must be of Fr\u00e9chet-type.\nFr\u00e9chet-type perturbations can still possibly yield the optimal regret bound in perturbation based\nalgorithm if the expected block maximum is asymptotically bounded by a constant and the divergence\nterm in regret analysis of GBPA algorithm can be shown to enjoy a tighter bound than what follows\nfrom the assumption of a bounded hazard rate.\nThe perturbation equivalent to Tsallis entropy (in two armed setting) is of Fr\u00e9chet-type Further\nevidence to support the conjecture can be found in the connection between FTRL and FTPL algorithms\nthat regularizer R and perturbation Z \u223c FD are bijective in two-armed bandit in terms of a mapping\n0 F \u22121D(cid:63) (1 \u2212 z)dz, where Z1, Z2 are i.i.d random variables\nwith distribution function, FD, and then Z1 \u2212 Z2 \u223c FD(cid:63). The difference of two i.i.d. Fr\u00e9chet-type\ndistributed random variables is conjectured to be Fr\u00e9chet-type. Thus, Tsallis entropy in two-armed\nsetting leads to Fr\u00e9chet-type perturbation, which supports our conjecture about optimal perturbations\nin adversarial multi-armed bandits. See Appendix C for more details.\n\nbetween FD(cid:63) and R, R(w) \u2212 R(0) = \u2212(cid:82) w\n\n5 Numerical Experiments\n\nWe present some experimental results with perturbation based algorithms (Alg.1-(ii),(iii)) and compare\nthem to the UCB1 algorithm in the simulated stochastic K-armed bandit.\nIn all experiments,\nthe number of arms (K) is 10, the number of different episodes is 1000, and true mean rewards\n(\u00b5i) are generated from U[0, 1] [19]. We consider the following four examples of 1-sub-Gaussian\n\n7\n\n\freward distributions that will be shifted by true mean \u00b5i; (a) Uniform, U[\u22121, 1], (b) Rademacher,\nR{\u22121, 1}, (c) Gaussian, N (0, 1), and (d) Gaussian mixture, W \u00b7 N (\u22121, 1) + (1 \u2212 W ) \u00b7 N (1, 1)\nwhere W \u223c Bernoulli(1/2). Under each reward setting, we run \ufb01ve different algorithms; UCB1,\nRCB with Uniform and Rademacher, and FTPL via Gaussian N (0, \u03c32) and Double-exponential (\u03c3)\nafter we use grid search to tune con\ufb01dence levels for con\ufb01dence based algorithms and the parameter \u03c3\nfor FTPL algorithms. All tuned con\ufb01dence level and parameter are speci\ufb01ed in Figure 1. We compare\nthe performance of perturbation based algorithms to UCB1 in terms of average regret R(t)/t, which\nis expected to more rapidly converge to zero if the better algorithm is used. 1\nThe average regret plots in Figure 1 have the similar patterns that FTPL algorithms via Gaussian\nand Double-exponential consistently perform the best after parameters tuned, while UCB1 algorithm\nworks as well as them in all rewards except for Rademacher reward. The RCB algorithms with\nUniform and Rademacher perturbations are slightly worse than UCB1 in early stages, but perform\ncomparably well to UCB1 after enough iterations. In the Rademacher reward case, which is discrete,\nRCB with Uniform perturbation slightly outperforms UCB1.\nNote that the main contribution of this work is to establish theoretical foundations for a large family\nof perturbation based algorithms (including those used in this section). Our numerical results are not\nintended to show the superiority of perturbation methods but to demonstrate that they are competitive\nwith Thompson Sampling and UCB. Note that in more complex bandit problems, sampling from\nthe posterior and optimistic optimization can prove to be computationally challenging. Accordingly,\nour work paves the way for designing ef\ufb01cient perturbation methods in complex settings, such\nas stochastic linear bandits and stochastic combinatorial bandits, that have both computational\nadvantages and low regret guarantees. Furthermore, perturbation approaches based on the Double-\nexponential distribution are of special interest from a privacy viewpoint since that distribution \ufb01gures\nprominently in the theory of differential privacy [10, 27, 26].\n\n(a) Uniform Reward, U[\u22121, 1]\n\n(b) Rademacher Reward, R{\u22121, 1}\n\n(c) Gaussian Reward, N (0, 1)\n\n(d) Gaussian Mixture, 0.5 \u00b7 N (\u22121, 1) + 0.5 \u00b7 N (1, 1)\n\nFigure 1: Average Regret for Bandit algorithms in four reward settings (Best viewed in color)\n\n1https://github.com/Kimbaekjin/Perturbation-Methods-StochasticMAB\n\n8\n\n\f6 Conclusion\n\nWe provided the \ufb01rst general analysis of perturbations for the stochastic multi-armed bandit problem.\nWe believe that our work paves the way for similar extension for more complex settings, e.g.,\nstochastic linear bandits, stochastic partial monitoring, and Markov decision processes. We also\nshowed that the open problem regarding minimax optimal perturbations for adversarial bandits cannot\nbe solved in two ways that might seem very natural. While our results are negative, they do point the\nway to a possible af\ufb01rmative solution of the problem. They led us to a conjecture that the optimal\nperturbation, if it exists, will be of Fr\u00e9chet-type.\n\nAcknowledgments\n\nWe acknowledge the support of NSF CAREER grant IIS-1452099 and the UM-LSA Associate\nProfessor Support Fund. AT was also supported by a Sloan Research Fellowship.\n\nReferences\n[1] Jacob Abernethy, Chansoo Lee, Abhinav Sinha, and Ambuj Tewari. Online linear optimization\n\nvia smoothing. In Conference on Learning Theory, pages 807\u2013823, 2014.\n\n[2] Jacob Abernethy, Chansoo Lee, and Ambuj Tewari. Fighting bandits with a new kind of\nsmoothness. In Advances in Neural Information Processing Systems, pages 2197\u20132205, 2015.\n\n[3] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In\n\nArti\ufb01cial Intelligence and Statistics, pages 99\u2013107, 2013.\n\n[4] Jean-Yves Audibert and S\u00e9bastien Bubeck. Minimax policies for adversarial and stochastic\n\nbandits. In Conference on Learning Theory, pages 217\u2013226, 2009.\n\n[5] Jean-Yves Audibert and S\u00e9bastien Bubeck. Regret bounds and minimax policies under partial\n\nmonitoring. Journal of Machine Learning Research, 11(Oct):2785\u20132836, 2010.\n\n[6] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\n[7] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic\n\nmultiarmed bandit problem. SIAM journal on computing, 32(1):48\u201377, 2002.\n\n[8] Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. An introduction to statistical\n\nmodeling of extreme values, volume 208. Springer, 2001.\n\n[9] Luc Devroye, G\u00e1bor Lugosi, and Gergely Neu. Prediction by random-walk perturbation. In\n\nConference on Learning Theory, pages 460\u2013473, 2013.\n\n[10] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Founda-\n\ntions and Trends R(cid:13) in Theoretical Computer Science, 9(3-4):211\u2013407, 2014.\n\n[11] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n\n[12] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of\n\nGames, 3:97\u2013139, 1957.\n\n[13] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends R(cid:13) in\n\nOptimization, 2(3-4):157\u2013325, 2016.\n\n[14] Tamir Hazan, George Papandreou, and Daniel Tarlow, editors. Perturbations, Optimization and\n\nStatistics. MIT Press, 2017.\n\n[15] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The\n\nCollected Works of Wassily Hoeffding, pages 409\u2013426. Springer, 1994.\n\n[16] Josef Hofbauer and William H Sandholm. On the global convergence of stochastic \ufb01ctitious\n\nplay. Econometrica, 70(6):2265\u20132294, 2002.\n\n9\n\n\f[17] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal\n\nof Computer and System Sciences, 71(3):291\u2013307, 2005.\n\n[18] Jussi Kujala and Tapio Elomaa. On following the perturbed leader in the bandit setting. In\n\nInternational Conference on Algorithmic Learning Theory, pages 371\u2013385. Springer, 2005.\n\n[19] Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems. arXiv\n\npreprint arXiv:1402.6028, 2014.\n\n[20] Tor Lattimore and Csaba Szepesv\u00e1ri. Bandit algorithms. preprint, 2018.\n\n[21] Malcolm R Leadbetter, Georg Lindgren, and Holger Rootz\u00e9n. Extremes and related properties\n\nof random sequences and processes. Springer Science & Business Media, 2012.\n\n[22] Daniel McFadden. Econometric models of probabilistic choice. Structural analysis of discrete\n\ndata with econometric applications, 198272, 1981.\n\n[23] Sidney I Resnick. Extreme values, regular variation and point processes. Springer, 2013.\n\n[24] Steven L Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models\n\nin Business and Industry, 26(6):639\u2013658, 2010.\n\n[25] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[26] Aristide Charles Yedia Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial\n\nmulti-armed bandit. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[27] Aristide CY Tossou and Christos Dimitrakakis. Algorithms for differentially private multi-armed\n\nbandits. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[28] Tim Van Erven, Wojciech Kot\u0142owski, and Manfred K Warmuth. Follow the leader with dropout\n\nperturbations. In Conference on Learning Theory, pages 949\u2013974, 2014.\n\n[29] Kam Chung Wong, Zifan Li, and Ambuj Tewari. Lasso guarantees for \u03b2-mixing heavy tailed\n\ntime series. Annals of Statistics, 2019.\n\n10\n\n\f", "award": [], "sourceid": 1546, "authors": [{"given_name": "Baekjin", "family_name": "Kim", "institution": "University of Michigan"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}