{"title": "Bootstrapping Upper Confidence Bound", "book": "Advances in Neural Information Processing Systems", "page_first": 12123, "page_last": 12133, "abstract": "Upper Confidence Bound (UCB) method is arguably the most celebrated one used in online decision making with partial information feedback. Existing techniques for constructing confidence bounds are typically built upon various concentration inequalities, which thus lead to over-exploration. In this paper, we propose a non-parametric and data-dependent UCB algorithm based on the multiplier bootstrap. To improve its finite sample performance, we further incorporate second-order correction into the above construction. In theory, we derive both problem-dependent and problem-independent regret bounds for multi-armed bandits under a much weaker tail assumption than the standard sub-Gaussianity. Numerical results demonstrate significant regret reductions by our method, in comparison with several baselines in a range of multi-armed and linear bandit problems.", "full_text": "Bootstrapping Upper Con\ufb01dence Bound\n\nBotao Hao\n\nPurdue University\n\nYasin Abbasi-Yadkori\n\nVinAI\n\nhaobotao000@gmail.com\n\nyasin.abbasi@gmail.com\n\nZheng Wen\nDeepmind\n\nGuang Cheng\n\nPurdue University\n\nzhengwen@google.com\n\nchengg@purdue.edu\n\nAbstract\n\nUpper Con\ufb01dence Bound (UCB) method is arguably the most celebrated one used\nin online decision making with partial information feedback. Existing techniques\nfor constructing con\ufb01dence bounds are typically built upon various concentration\ninequalities, which thus lead to over-exploration. In this paper, we propose a non-\nparametric and data-dependent UCB algorithm based on the multiplier bootstrap.\nTo improve its \ufb01nite sample performance, we further incorporate second-order\ncorrection into the above construction. In theory, we derive both problem-dependent\nand problem-independent regret bounds for multi-armed bandits with symmetric\nrewards under a much weaker tail assumption than the standard sub-Gaussianity.\nNumerical results demonstrate signi\ufb01cant regret reductions by our method, in\ncomparison with several baselines in a range of multi-armed and linear bandit\nproblems.\n\n1\n\nIntroduction\n\nIn arti\ufb01cial intelligence, learning to make decisions online plays a critical role in many \ufb01elds, such\nas personalized news recommendation [1], robotics [2] and the game of Go [3]. To learn to make\noptimal decisions as soon as possible, the decision-makers must carefully design an algorithm to\nbalance the trade-off between the exploration and exploitation [4, 5]. Over-exploration could be\nexpensive and unethical in practice, e.g., medical decision making [6, 7, 8]. On the other hand,\ninsuf\ufb01cient exploration tends to make an algorithm stuck at a sub-optimal solution. The delicate\ndesign of exploration methods stands in the heart of online learning and decision making.\nUpper Con\ufb01dence Bound (UCB) [9, 10, 11, 12, 13] is a class of highly effective algorithms in dealing\nwith the exploration-exploitation trade-off in bandits and reinforcement learning. The tightness of\ncon\ufb01dence bound, as is known, is the key ingredient to achieve the optimal degree of explorations. To\nthe best of our knowledge, nearly all the existing works construct con\ufb01dence bounds based on various\nconcentration inequalities, e.g. Hoeffding-type [10], empirical Bernstein type [14] or self-normalized\ntype [13]. Those concentration-based con\ufb01dence bounds, however, are typically conservative since\nthey are data-independent. Concentration inequalities only exploit tail information, e.g., bounded or\nsub-Gaussian, rather than the whole distribution knowledge. In general, the loose constant factor may\nresult in con\ufb01dence bounds that are too wide to be informative [15].\nIn this paper, we propose a non-parametric and data-dependent UCB algorithm based on the multiplier\nbootstrap [16, 17, 18, 19, 20], called bootstrapped UCB. The principle is to use the multiplier\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbootstrapped quantile as the con\ufb01dence bound to enforce the exploration. Inspired by recent advances\non non-asymptotic guarantee and non-asymptotic inference such as [18, 19, 20, 21], we develop\nan explicit second-order correction for the multiplier bootstrapped quantile that ensures the non-\nasymptotic validity. Our algorithm is easy to implement and has the potential to be generalized to\nmore complicated models such as structured contextual bandits.\nIn theory, we develop both problem-dependent and problem-independent regret bounds for multi-\narmed bandits with symmetric rewards under a much weaker tail assumption, i.e., sub-Weibull\ndistribution, than the classical sub-Gaussianity. In this case, it is proven that the mean estimator\ncan still achieve the same problem-independent regret bound as the one under the sub-Gaussian\nassumption. Note that our result does not rely on other sophisticated approaches such as median-\nof-means or Catoni\u2019s M-estimator in [22]. A key technical tool we propose is a new concentration\ninequality for the sum of sub-Weibull random variables. Empirically, we evaluate our method in\nseveral multi-armed and linear bandit models. When the exact posterior is unavailable or the noise\nvariance is mis-speci\ufb01ed, the bootstrapped UCB demonstrates superior performance over variants of\nThompson sampling and concentration-based UCB due to its non-parametric and data-dependent\nnature.\nRecently, an increasing number of works [23, 24, 25, 26] study bootstrap methods for multi-armed\nand contextual bandits as an alternative to Thompson sampling. Most treat the bootstrap just as a way\nto randomize historical data (without any theoretical guarantee). One exception is [27] who derive a\nregret bound for Bernoulli bandit by adding pseudo observations. However, their method cannot be\neasily extended to unbounded cases, and their analyses heavily limit to the Bernoulli assumption. In\ncontrast, our method applies to a broader class of bandit models with rigorous regret analysis.\nThe rest of the paper is organized as follows. Section 2 introduces the basic setup and our bootstrapped\nUCB algorithm. Section 3 provides the regret analysis and Section 4 conducts several experiments.\nNotations. Throughout the paper, we denote Pw(\u00b7), Ew(\u00b7) as the probability and expectation\noperator with respect to the distribution of the vector w only, conditioning on other random variables.\nWe use similar notations for Py(\u00b7), Ey(\u00b7) with respect to y only. [n] means the set {1, 2, . . . , n}. We\ndenote boldface lower letters (e.g. x, y) as a vector. For a set E, we de\ufb01ne its complement as E c.\n\n2 Bootstrapped UCB\n\nProblem setup. As a fruit \ufb02y, we illustrate our idea on the stochastic multi-armed bandit problem\n[28, 5]. In detail, the decision-makers interact with an environment for T rounds. In round t \u2208 [T ], the\ndecision-makers pull an arm It \u2208 [K] and observes its reward yIt which is drawn from a distribution\nassociated with the arm It, denoted by PIt with an unknown mean \u00b5It. Without loss of generality,\nwe assume arm 1 is the optimal arm, that is, \u00b51 = maxk\u2208[K] \u00b5k. In multi-armed bandit problems,\nthe objective is to minimize the expected cumulative regret, de\ufb01ned as,\n\nwhere \u2206k = \u00b51 \u2212 \u00b5k is the sub-optimality gap for arm k, and I{\u00b7} is an indicator function. Here,\nthe second equality is from the regret decomposition Lemma (Lemma 4.5 in [5]). We call an upper\nbound of R(T ) problem-independent if the bound only depends on the distributional assumption and\nnot on the speci\ufb01c bandit problem, say the gap \u2206k.\n\nUpper Con\ufb01dence Bound. The upper con\ufb01dence bound (UCB) algorithm [10] is based on the\nprinciple of optimism in the face of uncertainty. The key idea is to act as if the environment\n(parameterized by \u00b5k in multi-armed bandits) is as nice as plausibly possible. Concretely, a plausible\nenvironment refers to an upper con\ufb01dence bound G(yn, 1 \u2212 \u03b1) for the true mean \u00b5, of the form\n\nG(yn, 1 \u2212 \u03b1) =(cid:8)x \u2208 R, x \u2212 \u00afyn \u2264 h\u03b1(yn)(cid:9),\n\n(2.2)\n\n2\n\nR(T ) = T \u00b51 \u2212 E(cid:104) T(cid:88)\n\n(cid:105)\n\nyt\n\n=\n\nt=1\n\nK(cid:88)\n\nk=2\n\n\u2206kE(cid:104) T(cid:88)\n\nI{It = k}(cid:105)\n\nt=1\n\n,\n\n(2.1)\n\n\fwhere yn = (y1, . . . , yn)(cid:62) is the sample vector, \u00afyn is the empirical mean, \u03b1 \u2208 (0, 1) is the con\ufb01dence\nlevel, and h\u03b1 : Rn \u2192 R+ is a threshold that could be either data-dependent or data-independent.\nDe\ufb01nition 2.1. We de\ufb01ne G(yn, 1 \u2212 \u03b1) as a non-asymptotic upper con\ufb01dence bound if for any\nsample size n \u2265 1, the following inequality holds\n\nP(cid:16)\n\n\u00b5 \u2208 G(yn, 1 \u2212 \u03b1)\n\n(cid:17) \u2265 1 \u2212 \u03b1.\n\n(2.3)\n\nIn bandit problems, a non-asymptotic control on the con\ufb01dence level is more commonly used. This is\nrather different from the asymptotic validity of con\ufb01dence bound in statistics literature [29].\nA generic UCB algorithm will select the action based on its UCB index \u00afyn + h\u03b1(yn) for different\narms. As is well known, the sharper the threshold is, the better exploration and exploitation trade-off\none can achieve [5]. By the de\ufb01nition of quantile, the sharpest threshold in (2.2) is the (1\u2212\u03b1)-quantile\nof the distribution of \u00afyn \u2212 \u00b5. However, this quantile relies on the knowledge of the exact reward\ndistribution and is therefore itself unknown. To evaluate this value, we construct a data-dependent\ncon\ufb01dence bound based on the multiplier bootstrap.\n\n2.1 Con\ufb01dence Bound Based on Multiplier Bootstrap\n\na multiplier bootstrapped estimator as n\u22121(cid:80)n\n\nMultiplier Bootstrap. Multiplier bootstrap is a fast and easy-to-implement alternative to the\nstandard bootstrap, and has been successfully applied in various statistical contexts [18, 19, 20].\nIts goal is to approximate the distribution of the target statistic by reweighing its summands with\nrandom multipliers independent of the data. For instance, in a mean estimation problem, we de\ufb01ne\ni=1(wi \u2212 \u00afwn)yi, where\n{wi}n\ni=1 are some random variables independent of yn, called bootstrap weights. Some classical\nweights are as follows:\n\ni=1 wi(yi \u2212 \u00afyn) = n\u22121(cid:80)n\n\n\u2022 Efron\u2019s bootstrap weights. (w1, . . . , wn) is a multinomial random vector with parameters\n(n; n\u22121, . . . , n\u22121). This is the standard nonparameteric bootstrap [30].\n\u2022 Gaussian weights. wi\u2019s are i.i.d standard Gaussian random variables. This is closely related\n\nto Gaussian approximation in statistics [19].\n\nsymmetrization in learning theory.\n\n\u2022 Rademacher weights. wi\u2019s are i.i.d Rademacher variables. This is closely related to\n\ni=1 wi(yi\u2212 \u00afyn)\nconditionally on yn could be used to approximate the (1 \u2212 \u03b1)-quantile of the distribution of \u00afyn \u2212 \u00b5.\nAs the \ufb01rst building block, the multiplier bootstrapped quantile is de\ufb01ned as,\n\nThe bootstrap principle suggests that the (1\u2212 \u03b1)-quantile of the distribution of n\u22121(cid:80)n\n(cid:111)\n(cid:17) \u2264 \u03b1\n\nq\u03b1(yn \u2212 \u00afyn) := inf\n\nwi(yi \u2212 \u00afyn) > x\n\nx \u2208 R|Pw\n\n.\n\n(2.4)\n\n(cid:110)\n\n(cid:16) 1\n\nn\n\nn(cid:88)\n\ni=1\n\nThe question is whether q\u03b1(yn \u2212 \u00afyn) is a valid threshold for any sample size n \u2265 1.\n\n2.2 Second-order Correction\nMost statistical theories guarantee the asymptotic validity of q\u03b1(yn \u2212 \u00afyn) by the multiplier central\nlimit theorem [31]. However, we show that such a claim is valid non-asymptotically at the cost of\nadding a second-order correction. Next theorem rigorously characterizes this phenomenon under\na symmetric assumption on the reward. Moreover, in Section A in the supplement, we show that\nwithout the second-order correction, a naive bootstrapped UCB will result in linear regret.\nTheorem 2.2 (Non-asymptotic Second-order Correction). Suppose {yi}n\ni=1 are i.i.d symmetric\nrandom variables with respect to its mean \u00b5, and the bootstrap weights {wi}n\ni=1 are i.i.d Rademacher\nrandom variables. For two arbitrary parameters \u03b1, \u03b4 \u2208 (0, 1), the following inequality holds for any\n\n3\n\n\fsample size n \u2265 1,\n\nPy\n\n(cid:16)\n\n\u00afyn \u2212 \u00b5 > q\u03b1(1\u2212\u03b4)(yn \u2212 \u00afyn) +\n\nlog(2/\u03b1\u03b4)\n\nn\n\nbootstrapped threshold\n\n(cid:124)\n\n(cid:114)\n(cid:123)(cid:122)\n\n(cid:17) \u2264 2\u03b1,\n\n\u03d5(yn)\n\n(cid:125)\n\n(2.5)\n\nwhere \u03d5(yn) is a non-negative function satisfying Py(|\u00afyn \u2212 \u00b5| \u2265 \u03d5(yn)) \u2264 \u03b1.\n\nThe detailed proof is deferred to Section B.1 in the supplement. In (2.5), the bootstrapped threshold\nmay be interpreted as a main term, i.e., q\u03b1(1\u2212\u03b4)(yn \u2212 \u00afyn) (at a shrunk con\ufb01dence level), plus a\nsecond-order correction term, i.e., (log(2/\u03b1\u03b4)/n)1/2\u03d5(yn). The latter is added to guarantee the\nnon-asymptotic validity of the bootstrapped threshold. In the above, \u03d5(yn) could be any preliminary\nupper bound on \u00afyn \u2212 \u00b5. Hence, Theorem 2.2 transforms a possibly coarse prior bound \u03d5(yn) on\nquantiles into a more accurate version that is based on a main term estimated by multiplier bootstrap\nplus a second-order correction term based on \u03d5(yn) multiplied by a O(n\u22121/2) factor.\nRemark 2.3 (Choice of \u03d5(yn)). If {yi}n\ni=1 are independent 1-sub-Gaussian random variables, a\nnatural choice of \u03d5(yn) is (2 log(1/\u03b1)/n)1/2 by Hoeffding\u2019s inequality (Lemma 2). Plugging it into\n(2.5) and letting \u03b4 = 1/2, the bootstrapped threshold in (2.5) becomes\n\n(cid:124)\n\nq\u03b1/4(yn \u2212 \u00afyn)\n\n+\n\n(cid:123)(cid:122)\n\nmain term\n\n(cid:125)\n\n(cid:124)\n\n2 log(8/\u03b1)\n\n(cid:123)(cid:122)\n\nn\n\n(cid:125)\n\nsecond order correction\n\n.\n\n(2.6)\n\nLemma B.3 in the supplement shows that the main term is of order at least O(n\u22121/2) as n grows,\nwhich implies the second order correction is just a remainder term. We emphasize that the reminder\nterm is obviously not sharp and will be sharpened as a future work.\nRemark 2.4. Existing works on UCB-type algorithms typically utilized various concentration in-\nequalities, e.g. Hoeffding\u2019s inequality [10] or empirical Bernstein\u2019s inequality [14], to \ufb01nd a valid\nthreshold h\u03b1(yn). However, they are not data-dependent and only use the tail information, rather\nthan fully exploit the whole distribution knowledge. This is typically conservative, and leads to\nover-exploration.\nRemark 2.5. Empirical KL-UCB [32] used empirical likelihood to build con\ufb01dence intervals for\ngeneral distributions that have support in [0, 1]. Although empirical KL-UCB is also data-dependent,\nour proposed method is from a very different non-parametric perspective and uses different tools\nby bootstrap. In practice, resampling tends to be more ef\ufb01cient computationally, without solving\na convex optimization each round like empirical KL-UCB. Moreover, our method can work with\nunbounded rewards and we believe it is easier to generalize to structured bandits, e.g. linear bandit.\n\nIn Figure 1, we compare different approaches to calculate 95% con\ufb01dence bound for the population\nmean based on samples from a truncated-normal distribution. When the sample size is extremely\nsmall (\u2264 10), the naive bootstrap (without any correction) cannot output a valid threshold since\nthe bootstrapped quantile is smaller than the true 95% quantile. This con\ufb01rms the necessity of the\nsecond-order correction. When the sample size increases, our bootstrapped threshold converges to\nthe truth rapidly. This con\ufb01rms the correction term is just a small remainder term. Additionally, the\nbootstrapped threshold is shown to be sharper than Hoeffding\u2019s bound and empirical Bernstein bound\nwhen sample size is large (see the right panel of Figure 1).\n\n2.3 Main Algorithm: Bootstrapped UCB\n\nBased on the above theoretical \ufb01ndings, we conclude that bootstrapped UCB will select the arm\naccording to its UCB index de\ufb01ned as below:\n\n(cid:115)\n\nUCBk(t) = \u00afynk,t + q\u03b1(1\u2212\u03b4)(ynk,t \u2212 \u00afynk,t ) +\n\n4\n\nlog(2/\u03b1\u03b4)\n\nnk,t\n\n\u03d5(ynk,t) ,\n\n(2.7)\n\n\fFigure 1: 95% con\ufb01dence bound of the sample mean.\n\nwhere nk,t is the number of pulls for arm k until time t. Practically, we may use Monte Carlo\n\nquantile approximation to get an approximated bootstrapped quantile(cid:101)q\u03b1(1\u2212\u03b4)(ynk,t \u2212 \u00afynk,t , wB)\nThe computational complexity at step t is (cid:101)O(Bt) \u2264 (cid:101)O(BT ). Comparing with vanilla UCB, the extra\n\nand corresponding theorem for the control of the approximation of the bootstrapped quantile is also\nderived (see Section D in the supplement for details). The algorithm is summarized in Algorithm 1.\n\nBt is due to resampling. In practice, the choice of B is seldom treated as a tuning parameter, but\nusually determined by the available computational resource.\n\nAlgorithm 1 Bootstrapped UCB\nInput: the number of bootstrap repetitions B, hyper-parameter \u03b4.\nfor t = 1 to K do\n\nPull each arm once to initialize the algorithm.\n\nend\nfor t = K + 1 to T do\n\nSet con\ufb01dence level \u03b1 = 1/(t + 1).\n\nCalculate the boostrapped quantile(cid:101)q\u03b1(1\u2212\u03b4)(ynk,t \u2212 \u00afynk,t , wB).\n\nPull the arm\n\nIt = argmax\nk\u2208[K]\n\n(\u00afynk,t +(cid:101)q\u03b1(1\u2212\u03b4)(ynk,t \u2212 \u00afynk,t , wB) + (log(2/\u03b1\u03b4)/nk,t)1/2\u03d5(ynk,t )).\n\nReceive reward yIt.\n\nend\n\n3 Regret Analysis\n\nIn Section 3.1, we derive regret bounds for bootstrapped UCB. Moreover, we show that naive\nbootstrapped UCB will result in linear regret in some cases in Section A in the supplement.\n\n3.1 Regret Bound for Bootstrapped UCB\n\nFor multi-armed bandit problems, most literature [5] consider sub-Gaussian rewards. In this work,\nwe move beyond sub-Gaussianity and consider the reward under a much weaker tail assumption,\nso-called sub-Weibull distribution. As shown in [33, 34], it is characterized by the right tail of the\nWeibull distribution and generalizes sub-Gaussian and sub-exponential distributions.\n\nDe\ufb01nition 1 (Sub-Weibull Distribution). We de\ufb01ne y as a sub-Weibull random variable if it has a\nbounded \u03c8\u03b2-norm. The \u03c8\u03b2-norm of y for any \u03b2 > 0 is de\ufb01ned as\n\n(cid:110)\n\n5\n\n(cid:107)y(cid:107)\u03c8\u03b2 := inf\n\nC \u2208 (0,\u221e) : E[exp(|y|\u03b2/C \u03b2)] \u2264 2\n\n.\n\n(cid:111)\n\n\fParticularly, when \u03b2 = 1 or 2, sub-Weibull random variables reduce to sub-exponential or sub-\nGaussian random variables, respectively. It is obvious that the smaller \u03b2 is, the heavier tail the\nrandom variable has. Next theorem provides a corresponding concentration inequality for the sum of\nindependent sub-Weibull random variables.\nTheorem 3.1 (Concentration Inequality for Sub-Weibull Distribution). Suppose {yi}n\ni=1 are inde-\npendent sub-Weibull random variables with (cid:107)yi(cid:107)\u03c8\u03b2 \u2264 \u03c3. Then there exists an absolute constant C\u03b2\nonly depending on \u03b2 such that for any a = (a1, . . . , an) \u2208 Rn and 0 < \u03b1 < 1/e2,\n\n(cid:12)(cid:12)(cid:12) \u2264 C\u03b2\u03c3\n\n(cid:16)(cid:107)a(cid:107)2(log \u03b1\u22121)1/2 + (cid:107)a(cid:107)\u221e(log \u03b1\u22121)1/\u03b2(cid:17)\n\naiyi \u2212 E(\n\naiyi)\n\n(cid:12)(cid:12)(cid:12) n(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nwith probability at least 1 \u2212 \u03b1.\n\nThe proof relies on a precise characterization of p-th moment of a Weibull random variable and\nstandard symmetrization arguments. Details are deferred to Section B.2 in the supplement. This\ntheorem generalizes the Hoeffding-type concentration inequalities for sub-Gaussian random variables\n(see, e.g. Proposition 5.10 in [35]), and Bernstein-type concentration inequalities for sub-exponential\nrandom variables (see, e.g. Proposition 5.16 in [35]) up to some constants.\nIn Theorem 3.2, we provide both problem-dependent and problem-independent regret bounds.\nTheorem 3.2. Consider a stochastic K-armed sub-Weibull bandit, where the noise follows a sym-\nmetric sub-Weibull distribution with its \u03c8\u03b2-norm upper bounded by \u03c3. Denote nk,t as the number of\npulls for arm k until time t. We choose \u03d5 according to Theorem 3.1 as follows\n\n\u03d5(ynk,t ) = C\u03b2\u03c3\n\nlog 1/\u03b1\n\nnk,t\n\n+\n\n(log 2/\u03b1)1/\u03b2\n\nnk,t\n\n,\n\n(3.1)\n\nand let the con\ufb01dence level \u03b1 = 1/T 2. For any round T , the problem-dependent regret of boot-\nstrapped UCB is upper bounded by\n\n(cid:16)(cid:115)\n\n(cid:17)\n\nK(cid:88)\n\nk=2\n\nR(T ) \u2264 (cid:88)\n\nk:\u2206k>0\n\n128C 2\n\n\u03b2\u03c32 log T\n\u2206k\n\n+ 23+1/\u03b2C\u03b2\u03c3K(log T )1/\u03b2 + 4\n\n\u2206k,\n\n(3.2)\n\nwhere C\u03b2 is some absolute constant from Theorem 3.1, and \u2206k is the sub-optimality gap. Moreover,\nif the round T \u2265 22/\u03b2\u22123K(log T )2/\u03b2\u22121, the problem-independent regret of bootstrapped UCB is\nupper bounded by\n\n2C\u03b2\u03c3(cid:112)T K log T + 4K\u00b5\u2217\n\n1.\n\n\u221a\n\nR(T ) \u2264 32\n\n(3.3)\n\nThe main proof structure follows the standard analysis of UCB [5] and relies on a sharp upper\nbound for the (data-dependent) bootstrapped quantile term by Theorem 3.1. Details are deferred\nto Section B.3 in the supplement. When \u03b2 \u2265 1, (3.2) provides a logarithm regret that matches the\nstate-of-art result [5]. When \u03b2 < 1, we have a non-negligible term (log T )1/\u03b2 that is the price paid\nfor heavy-tailedness. However, this term does not depend on the gap \u2206k. Therefore, we have an\noptimal problem-independent regret bound.\nRemark 3.3. The choice of \u03b1 = 1/T 2 led to an easy analysis. Using similar techniques in Chapter\n8.2 of [5], we can achieve a similar regret bound by setting \u03b1t = 1/(t log\u03c4 (t)) for any \u03c4 > 0.\nRemark 3.4. [22] consider bandit with heavy-tail (moment of order (1 + \u03b5)) based on a median-of-\nmeans estimator. As mentioned in [36], there are two disadvantages for median-of-means approach:\n(a) it involves an additional tuning parameter; (b) it is numerically unstable for small sample size.\nIn contrast, we identify a class of heavy-tailed bandits (sub-Weibull bandit) where mean estimators\ncan still achieve regret bounds of the same order as those under sub-Gaussian reward distributions.\nThe reason is that although sub-Weibull r.v. has heavier tail than sub-Gaussian r.v., its tail still has an\nexponential-like decay.\n\n6\n\n\f4 Experiments\n\nIn Section 4.1, we consider multi-armed bandits with both symmetric and asymmetric rewards. In\nSection 4.2, we extend our method to linear bandits. Implementation details and some additional\nexperimental results are deferred to Section E in the supplement.\n\n4.1 Multi-armed Bandit\n\nIn this section, we compare bootstrapped UCB (Algorithm 1) with three baselines: Upper Con\ufb01dence\nBound based on concentration inequalities (Vanilla UCB), Thompson sampling with normal Jeffery\nprior [37] (Jeffery-TS) and Thompson sampling with Beta prior [38] (Bernoulli-TS). For bounded\nrewards, we also compare with Giro [27]1, that is a sampling-based exploration method by adding\narti\ufb01cial pseudo observations {0, 1} to escape from local optima, and empirical KL-UCB [39] using\npackage: PymaBandits. For the preliminary bound \u03d5(yn), we simply choose the one derived by\n\u221a\nthe concentration inequality. Note that the second-order correction term in (2.5) is conservative.\nFor practitioners, we suggest to set the correction term to be \u03d5(yn)/\nn. To be fair, we choose the\ncon\ufb01dence level \u03b1 = 1/(1 + t) for both UCB1 and bootstrapped UCB, and \u03b4 = 0.1 in (2.5). All\nalgorithms above require knowledge of an upper bound on the noise standard deviation. The number\nof bootstrap repetitions is B = 200, and the number of arms is K = 5.\nFirst, we consider symmetric rewards with a mean parameter \u00b5k generated from Uniform(\u22121, 1). The\nnoise follows either truncated-normal distribution within [\u22121, 1], or standard Gaussian distribution.\nFrom Figure 2, bootstrapped UCB outperforms Jeffery-TS and Vanilla-UCB for truncated-normal\nbandit and has comparable or sometimes better performance over empirical KL-UCB. It\u2019s obvious\nthat if the reward distribution is exactly Gaussian and the plug-in estimate for the noise standard\ndeviation is the truth, Jeffery-TS should be the best. However, when the posterior (plots (a),(b)) or\nnoise standard derivation (plot (c)) are mis-speci\ufb01ed, the performance of TS deteriorates fast. Since\n(concentration-based) Vanilla UCB only uses the tail information (bounded or sub-Gaussian), it is\nvery conservative and results in bad regret as expected.\n\nFigure 2: Cumulative regrets for truncated-normal bandit and Gaussian bandit. Sigma is the upper\nbound on the standard deviation of the noise. The results are averaged over 200 realizations.\n\n1We have implemented Giro in the unbounded reward case, which could result in linear regret in most cases.\n\nSee Figure 7 in the supplement. So, it\u2019s unclear what is the best way to add pseudo observations in this case.\n\n7\n\n\fSecond, we consider asymmetric rewards with a mean parameter \u00b5k generated from\nUniform(0.25, 0.75). For Bernoulli bandit, the reward follows Ber(\u00b5k); for Beta bandit, the re-\nward follows 2 Beta(v\u00b5k, v(1 \u2212 \u00b5k)) for v = 8. From Figure 3, bootstrapped UCB outperforms\nVanilla UCB and Giro in both cases, and outperforms Bernoulli-TS for Beta bandit. In fact, we are\nsupposed not to beat Bernoulli-TS for Bernoulli bandit since TS fully makes use of the distribution\nknowledge in this case. One possible explanation is that our method is non-parametric.\n\nFigure 3: Cumulative regrets for Bernoulli bandit and Beta bandit. The results are averaged over 200\nrealizations.\n\nThird, we demonstrate that the robustness of bootstrapped UCB over mis-speci\ufb01cations of the noise\nstandard deviation. In the left panel of Figure 4, we consider the cumulative regret at round T = 2000\nof standard Gaussian bandit. As one can see, when we increase the plug-in upper bound of the\nstandard deviation of the noise, bootstrapped UCB is more robust than Bernoulli-TS and Vanilla\nUCB.\n\nFigure 4: The left panel is the cumulative regret over noise levels while the right panel is the instance-\ndependent regret of various algorithms as a function of gaps. The results are averaged over 200\nrealizations.\n\nLast, we present a frequentist instance-dependent regret curve for truncated-normal bandit and the\nexperiment set up follows [40]. We plot cumulative regrets at T = 1000 of various algorithms with\nrespect to the instance gap \u2206 and the mean vector \u00b5 = (\u2206, 0, 0, 0, 0). The results are summarized in\nthe right panel of Figure 4.\n\n4.2 Linear Bandit\n\nWe extend our method to linear bandit case. The basic set up follows the one in [15]. In detail,\n\u03b8\u2217 \u2208 Rd is drawn from a multivariate Gaussian distribution with mean vector \u00b5 = 0 and covariance\n\u221a\nmatrix \u03a3 = 10Id. The noise follows a standard Gaussian distribution. There are 100 actions with\nfeature vector components drawn uniformly at random from [\u22121/\n10]. We consider two\nstate-of-art methods: Thompson sampling for linear bandit [41] (TSL) and optimism in the face of\nuncertainty for linear bandits [13] (OFUL). Following the principle of constructing second-order\n\n10, 1/\n\n\u221a\n\n2We adopt the technique in [38] to run Thompson Sampling with [0, 1] rewards. In particular, for any reward\n\nyt \u2208 [0, 1], we draw pseudo reward(cid:98)yt \u223c Ber(yt), and then use(cid:98)yt instead of yt in the algorithm.\n\n8\n\n\ft,1\u2212\u03b4 = q\u03b1((cid:98)\u03b8(b)\n\nt \u2212(cid:98)\u03b8t) + \u03b2OFUL\n\n\u221a\nt,1\u2212\u03b4,\u03c3/\n\nt\n\n(BUCBL) as follows: At each round t, the action is selected as argmaxx(x(cid:62)(cid:98)\u03b8t + \u03b2BUCBL\n\ncorrection in mean problems (Theorem 2.2), we construct the bootstrapped UCB for linear bandit\nt,1\u2212\u03b4 (cid:107)x(cid:107)V\n),\n\u22121\nt\nwhere \u03b2BUCBL\nt,1\u2212\u03b4,\u03c3 and some\nbasic setups are given in Section E.2 in the supplement. To be fair, the con\ufb01dence level for all\nmethods is set to be \u03b4 = 1/(1 + t) and we plug in the true standard deviation of the noise for each\nmethod. From Figure 5, we can see that bootstrapped UCB greatly improves the cumulative regret\nover TSL and OFUL.\n\nn. The formal de\ufb01nition of(cid:98)\u03b8t,(cid:98)\u03b8(b)\n\n, \u03b2OFUL\n\nFigure 5: Cumulative regret for linear bandit.\n\n5 Conclusion\n\nIn this paper, we propose a novel class of non-parametric and data-driven UCB algorithms based on\nmultiplier bootstrap. It is easy to implement and has the potential to be generalized to other complex\nstructured problems. As future works, we will evaluate our idea on other structured contextual bandits\nand reinforcement learning problems.\n\nAcknowledgments\n\nWe thank Tor Lattimore for helpful discussions. Guang Cheng would like to acknowledge support by\nNSF DMS-1712907, DMS-1811812, DMS-1821183, and Of\ufb01ce of Naval Research (ONR N00014-\n18-2759). In addition, Guang Cheng is a visiting member of Institute for Advanced Study, Princeton\n(funding provided by Eric and Wendy Schmidt) and visiting Fellow of SAMSI for the Deep Learning\nProgram in the Fall of 2019; he would like to thank both Institutes for their hospitality.\n\nReferences\n[1] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th International Conference on World Wide Web,\nWWW \u201910, pages 661\u2013670, New York, NY, USA, 2010. ACM.\n\n[2] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The\n\nInternational Journal of Robotics Research, 32(11):1238\u20131274, 2013.\n\n[3] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go\nwith deep neural networks and tree search. Nature, 529(7587):484, 2016.\n\n[4] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.\n\n[5] Tor Lattimore and Csaba Szepesv\u00e1ri. Bandit algorithms. preprint, 2018.\n\n[6] Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covariates. Available\n\nat SSRN 2661896, 2015.\n\n[7] Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual\n\nbandits. arXiv preprint arXiv:1704.09011, 2017.\n\n9\n\n\f[8] Sarah Bird, Solon Barocas, Kate Crawford, Fernando Diaz, and Hanna Wallach. Exploring or exploit-\ning? social and ethical implications of autonomous experimentation in ai. In Workshop on Fairness,\nAccountability, and Transparency in Machine Learning, 2016.\n\n[9] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine Learning\n\nResearch, 3(Nov):397\u2013422, 2002.\n\n[10] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[11] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback.\n\n2008.\n\n[12] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th international conference on World wide web,\npages 661\u2013670. ACM, 2010.\n\n[13] Yasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri. Improved algorithms for linear stochastic bandits.\n\nIn Advances in Neural Information Processing Systems, pages 2312\u20132320, 2011.\n\n[14] Volodymyr Mnih, Csaba Szepesv\u00e1ri, and Jean-Yves Audibert. Empirical bernstein stopping. In Proceedings\n\nof the 25th international conference on Machine learning, pages 672\u2013679. ACM, 2008.\n\n[15] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of\n\nOperations Research, 39(4):1221\u20131243, 2014.\n\n[16] Donald B Rubin. The bayesian bootstrap. The annals of statistics, pages 130\u2013134, 1981.\n\n[17] Chien-Fu Jeff Wu et al. Jackknife, bootstrap and other resampling methods in regression analysis. the\n\nAnnals of Statistics, 14(4):1261\u20131295, 1986.\n\n[18] Sylvain Arlot, Gilles Blanchard, Etienne Roquain, et al. Some nonasymptotic results on resampling in\n\nhigh dimension, i: con\ufb01dence regions. The Annals of Statistics, 38(1):51\u201382, 2010.\n\n[19] Victor Chernozhukov, Denis Chetverikov, Kengo Kato, et al. Gaussian approximation of suprema of\n\nempirical processes. The Annals of Statistics, 42(4):1564\u20131597, 2014.\n\n[20] Vladimir Spokoiny, Mayya Zhilova, et al. Bootstrap con\ufb01dence sets under model misspeci\ufb01cation. The\n\nAnnals of Statistics, 43(6):2653\u20132675, 2015.\n\n[21] Yun Yang, Zuofeng Shang, and Guang Cheng. Non-asymptotic theory for nonparametric testing. arXiv\n\npreprint arXiv:1702.01330, 2017.\n\n[22] S\u00e9bastien Bubeck, Nicolo Cesa-Bianchi, and G\u00e1bor Lugosi. Bandits with heavy tail. IEEE Transactions\n\non Information Theory, 59(11):7711\u20137717, 2013.\n\n[23] Adam N. Elmachtoub, Ryan McNellis, Sechan Oh, and Marek Petrik. A practical method for solving\nIn Proceedings of the Thirty-Third Conference on\n\ncontextual bandit problems using decision trees.\nUncertainty in Arti\ufb01cial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017, 2017.\n\n[24] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped\n\ndqn. In Advances in neural information processing systems, pages 4026\u20134034, 2016.\n\n[25] Liang Tang, Yexi Jiang, Lei Li, Chunqiu Zeng, and Tao Li. Personalized recommendation via parameter-\nfree contextual bandits. In Proceedings of the 38th international ACM SIGIR conference on research and\ndevelopment in information retrieval, pages 323\u2013332. ACM, 2015.\n\n[26] Dean Eckles and Maurits Kaptein. Thompson sampling with the online bootstrap. arXiv preprint\n\narXiv:1410.4009, 2014.\n\n[27] Branislav Kveton, Csaba Szepesvari, Zheng Wen, Mohammad Ghavamzadeh, and Tor Lattimore. Garbage\nin, reward out: Bootstrapping exploration in multi-armed bandits. arXiv preprint arXiv:1811.05154, 2018.\n\n[28] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\n10\n\n\f[29] George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury Paci\ufb01c Grove, CA, 2002.\n\n[30] Bradley Efron. The jackknife, the bootstrap, and other resampling plans, volume 38. Siam, 1982.\n\n[31] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.\n\n[32] Olivier Capp\u00e9, Aur\u00e9lien Garivier, Odalric-Ambrym Maillard, R\u00e9mi Munos, Gilles Stoltz, et al. Kullback\u2013\nleibler upper con\ufb01dence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516\u2013\n1541, 2013.\n\n[33] Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-gaussianity in high-dimensional\nstatistics: Applications in covariance estimation and linear regression. arXiv preprint arXiv:1804.02605,\n2018.\n\n[34] Mariia Vladimirova and Julyan Arbel. Sub-weibull distributions: generalizing sub-gaussian and sub-\n\nexponential properties to heavier-tailed distributions. arXiv preprint arXiv:1905.04955, 2019.\n\n[35] Roman Vershynin. Compressed sensing, chapter Introduction to the non-asymptotic analysis of random\n\nmatrices, pages 210\u2013268. Cambridge Univ. Press, 2012.\n\n[36] Xi Chen and Wen-Xin Zhou. Robust inference via multiplier bootstrap. The Annals of Statistics, to appear,\n\n2019.\n\n[37] Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential\n\nfamily bandits. In Advances in Neural Information Processing Systems, pages 1448\u20131456, 2013.\n\n[38] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Arti\ufb01cial\n\nintelligence and statistics, pages 99\u2013107, 2013.\n\n[39] Aur\u00e9lien Garivier and Olivier Capp\u00e9. The kl-ucb algorithm for bounded stochastic bandits and beyond. In\n\nProceedings of the 24th annual Conference On Learning Theory, pages 359\u2013376, 2011.\n\n[40] Tor Lattimore. Re\ufb01ning the con\ufb01dence level for optimistic bandit strategies. The Journal of Machine\n\nLearning Research, 19(1):765\u2013796, 2018.\n\n[41] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In\n\nInternational Conference on Machine Learning, pages 127\u2013135, 2013.\n\n[42] Sharan Vaswani, Branislav Kveton, Zheng Wen, Anup Rao, Mark Schmidt, and Yasin Abbasi-Yadkori.\n\nNew insights into bootstrapping for bandits. arXiv preprint arXiv:1805.09793, 2018.\n\n[43] P Hitczenko, SJ Montgomery-Smith, and K Oleszkiewicz. Moment inequalities for sums of certain\n\nindependent symmetric random variables. Studia Math, 123(1):15\u201342, 1997.\n\n[44] Michel Talagrand. The supremum of some canonical processes. American Journal of Mathematics,\n\n116(2):283\u2013325, 1994.\n\n[45] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer\n\nScience & Business Media, 2013.\n\n[46] Joseph P Romano and Michael Wolf. Exact and approximate stepdown methods for multiple hypothesis\n\ntesting. Journal of the American Statistical Association, 100(469):94\u2013108, 2005.\n\n[47] Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2004.\n\n[48] Radoslaw Adamczak, Alexander E Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann. Restricted\nisometry property of matrices with independent columns and neighborly polytopes by random sampling.\nConstructive Approximation, 34(1):61\u201388, 2011.\n\n[49] Robert Bogucki. Suprema of canonical weibull processes. Statistics & Probability Letters, 107:253\u2013263,\n\n2015.\n\n[50] Victor De la Pena and Evarist Gin\u00e9. Decoupling: from dependence to independence. Springer Science &\n\nBusiness Media, 2012.\n\n11\n\n\f", "award": [], "sourceid": 6576, "authors": [{"given_name": "Botao", "family_name": "Hao", "institution": "Purdue University"}, {"given_name": "Yasin", "family_name": "Abbasi Yadkori", "institution": "VinAI Research/ VinTech JSC.,"}, {"given_name": "Zheng", "family_name": "Wen", "institution": "DeepMind"}, {"given_name": "Guang", "family_name": "Cheng", "institution": "Purdue University"}]}