{"title": "Almost Optimal Algorithms for Linear Stochastic Bandits with Heavy-Tailed Payoffs", "book": "Advances in Neural Information Processing Systems", "page_first": 8420, "page_last": 8429, "abstract": "In linear stochastic bandits, it is commonly assumed that payoffs are with sub-Gaussian noises. In this paper, under a weaker assumption on noises, we study the problem of \\underline{lin}ear stochastic {\\underline b}andits with h{\\underline e}avy-{\\underline t}ailed payoffs (LinBET), where the distributions have finite moments of order $1+\\epsilon$, for some $\\epsilon\\in (0,1]$. We rigorously analyze the regret lower bound of LinBET as $\\Omega(T^{\\frac{1}{1+\\epsilon}})$, implying that finite moments of order 2 (i.e., finite variances) yield the bound of $\\Omega(\\sqrt{T})$, with $T$ being the total number of rounds to play bandits. The provided lower bound also indicates that the state-of-the-art algorithms for LinBET are far from optimal. By adopting median of means with a well-designed allocation of decisions and truncation based on historical information, we develop two novel bandit algorithms, where the regret upper bounds match the lower bound up to polylogarithmic factors. To the best of our knowledge, we are the first to solve LinBET optimally in the sense of the polynomial order on $T$. Our proposed algorithms are evaluated based on synthetic datasets, and outperform the state-of-the-art results.", "full_text": "Almost Optimal Algorithms for Linear Stochastic\n\nBandits with Heavy-Tailed Payoffs\n\nHan Shao\u2217\n\nXiaotian Yu\u2217\n\nIrwin King\n\nMichael R. Lyu\n\nDepartment of Computer Science and Engineering\n\nThe Chinese University of Hong Kong\n{hshao,xtyu,king,lyu}@cse.cuhk.edu.hk\n\nAbstract\n\nIn linear stochastic bandits, it is commonly assumed that payoffs are with sub-\nGaussian noises. In this paper, under a weaker assumption on noises, we study the\nproblem of linear stochastic bandits with heavy-tailed payoffs (LinBET), where\nthe distributions have \ufb01nite moments of order 1 + \u03f5, for some \u03f5 \u2208 (0, 1]. We\n1\nrigorously analyze the regret lower bound of LinBET as \u2126(T\n1+\u03f5 ), implying that\n\ufb01nite moments of order 2 (i.e., \ufb01nite variances) yield the bound of \u2126(\u221aT ), with T\nbeing the total number of rounds to play bandits. The provided lower bound also\nindicates that the state-of-the-art algorithms for LinBET are far from optimal. By\nadopting median of means with a well-designed allocation of decisions and trun-\ncation based on historical information, we develop two novel bandit algorithms,\nwhere the regret upper bounds match the lower bound up to polylogarithmic fac-\ntors. To the best of our knowledge, we are the \ufb01rst to solve LinBET optimally in\nthe sense of the polynomial order on T . Our proposed algorithms are evaluated\nbased on synthetic datasets, and outperform the state-of-the-art results.\n\n1 Introduction\n\nThe decision-making model named Multi-Armed Bandits (MAB), where at each time step an algo-\nrithm chooses an arm among a given set of arms and then receives a stochastic payoff with respect\nto the chosen arm, elegantly characterizes the tradeoff between exploration and exploitation in se-\nquential learning. The algorithm usually aims at maximizing cumulative payoffs over a sequence of\nrounds. A natural and important variant of MAB is linear stochastic bandits with the expected payoff\nof each arm satisfying a linear mapping from the arm information to a real number. The model of\nlinear stochastic bandits enjoys some good theoretical properties, e.g., there exists a closed-form so-\nlution of the linear mapping at each time step in light of ridge regression. Many practical applications\ntake advantage of MAB and its variants to control decision performance, e.g., online personalized\nrecommendations (Li et al., 2010) and resource allocations (Lattimore et al., 2014).\n\nIn most previous studies of MAB and linear stochastic bandits, a common assumption is that noises\nin observed payoffs are sub-Gaussian conditional on historical information (Abbasi-Yadkori et al.,\n2011; Bubeck et al., 2012), which encompasses cases of all bounded payoffs and many unbounded\npayoffs, e.g., payoffs of an arm following a Gaussian distribution. However, there do exist practical\nscenarios of non-sub-Gaussian noises in observed payoffs for sequential decisions, such as high-\nprobability extreme returns in investments for \ufb01nancial markets (Cont and Bouchaud, 2000) and\n\ufb02uctuations of neural oscillations (Roberts et al., 2015), which are called heavy-tailed noises. Thus,\nit is signi\ufb01cant to completely study theoretical behaviours of sequential decisions in the case of\nheavy-tailed noises.\n\n\u2217The \ufb01rst two authors contributed equally.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fMany practical distributions, e.g., Pareto distributions and Weibull distributions, are heavy-tailed,\nwhich perform high tail probabilities compared with exponential family distributions. We consider\na general characterization of heavy-tailed payoffs in bandits, where the distributions have \ufb01nite\nmoments of order 1 + \u03f5, where \u03f5 \u2208 (0, 1]. When \u03f5 = 1, stochastic payoffs are generated from distri-\nbutions with \ufb01nite variances. When 0 <\u03f5< 1, stochastic payoffs are generated from distributions\nwith in\ufb01nite variances (Shao and Nikias, 1993). Note that, different from sub-Gaussian noises in the\ntraditional bandit setting, noises from heavy-tailed distributions do not enjoy exponentially decaying\ntails, and thus make it more dif\ufb01cult to learn a parameter of an arm.\n\nThe regret of MAB with heavy-tailed payoffs has been well addressed by Bubeck et al. (2013),\nwhere stochastic payoffs have bounds on raw or central moments of order 1 + \u03f5. For MAB with\n\ufb01nite variances (i.e., \u03f5 = 1), the regret of truncation algorithms or median of means recovers the op-\ntimal regret for MAB under the sub-Gaussian assumption. Recently, Medina and Yang (2016) inves-\ntigated theoretical guarantees for the problem of linear stochastic bandits with heavy-tailed payoffs\n(LinBET). It is surprising to \ufb01nd that, for \u03f5 = 1, the regret of bandit algorithms by Medina and Yang\n4 ) 2, which is far away from the regret of the state-of-the-art al-\n\n(2016) to solve LinBET is !O(T 3\ngorithms (i.e., !O(\u221aT )) in linear stochastic bandits under the sub-Gaussian assumption (Dani et al.,\n\n2008a; Abbasi-Yadkori et al., 2011). Thus, the most interesting and non-trivial question is\n\nIs it possible to recover the regret of !O(\u221aT ) when \u03f5 = 1 for LinBET?\n\nIn this paper, we answer this question af\ufb01rmatively. Speci\ufb01cally, we investigate the problem of\nLinBET characterized by \ufb01nite (1 + \u03f5)-th moments, where \u03f5 \u2208 (0, 1]. The problem of LinBET\nraises several interesting challenges. The \ufb01rst challenge is the lower bound of the problem, which\nremains unknown. The technical issues come from the construction of an elegant setting for LinBET,\nand the derivation of a lower bound with respect to \u03f5. The second challenge is how to develop a\nrobust estimator for the parameter in LinBET, because heavy-tailed noises greatly affect errors of\nthe conventional least-squares estimator. It is worth mentioning that Medina and Yang (2016) has\ntried to tackle this challenge, but their estimators do not make full use of the contextual information\nof chosen arms to eliminate the effect from heavy-tailed noises, which eventually leads to large\nregrets. The third challenge is how to successfully adopt median of means and truncation to solve\nLinBET with regret upper bounds matching the lower bound as closely as possible.\n\ntwo algorithms, which are !O(T\n\nOur Results. First of all, we rigorously analyze the lower bound on the problem of LinBET,\n1\nwhich enjoys a polynomial order on T as \u2126(T\n1+\u03f5 ). The lower bound provides two essential hints:\none is that \ufb01nite variances in LinBET yield a bound of \u2126(\u221aT ), and the other is that algorithms\nby Medina and Yang (2016) are sub-optimal. Then, we develop two novel bandit algorithms to\nsolve LinBET based on the basic techniques of median of means and truncation. Both the algo-\nrithms adopt the optimism in the face of uncertainty principle, which is common in bandit prob-\nlems (Abbasi-Yadkori et al., 2011; Munos et al., 2014). The regret upper bounds of the proposed\n1\n1+\u03f5 ), match the lower bound up to polylogarithmic factors. To the\nbest of our knowledge, we are the \ufb01rst to solve LinBET almost optimally. We conduct experiments\nbased on synthetic datasets, which are generated by Student\u2019s t-distribution and Pareto distribution,\nto demonstrate the effectiveness of our algorithms. Experimental results show that our algorithms\noutperform the state-of-the-art results. The contributions of this paper are summarized as follows:\n\u2022 We provide the lower bound for the problem of LinBET characterized by \ufb01nite (1 + \u03f5)-th\nmoments, where \u03f5 \u2208 (0, 1]. In the analysis, we construct an elegant setting of LinBET,\nwhich results in a regret bound of \u2126(T\n\u2022 We develop two novel bandit algorithms, which are named as MENU and TOFU (with\ndetails shown in Section 4). The MENU algorithm adopts median of means with a well-\ndesigned allocation of decisions and the TOFU algorithm adopts truncation via historical\n\n1+\u03f5 ) in expectation for any bandit algorithm.\n\n1\n\ninformation. Both algorithms achieve the regret !O(T\n\n\u2022 We conduct experiments based on synthetic datasets to demonstrate the effectiveness of\nour proposed algorithms. By comparing our algorithms with the state-of-the-art results,\nwe show improvements on cumulative payoffs for MENU and TOFU, which are strictly\nconsistent with theoretical guarantees in this paper.\n\n1\n\n1+\u03f5 ) with high probability.\n\n2We omit polylogarithmic factors of T for !O(\u00b7).\n\n2\n\n\f2 Preliminaries and Related Work\n\nIn this section, we \ufb01rst present preliminaries, i.e., notations and learning setting of LinBET. Then,\nwe give a detailed discussion on the line of research for bandits with heavy-tailed payoffs.\n\nd) 1\n\n2.1 Notations\nFor a positive integer K, [K] ! {1, 2,\u00b7\u00b7\u00b7 , K}. Let the \u2113-norm of a vector x \u2208 Rd be \u2225x\u2225\u2113 !\n(x\u2113\n1 + \u00b7\u00b7\u00b7 + x\u2113\n\u2113 , where \u2113 \u2265 1 and xi is the i-th element of x with i \u2208 [d]. For r \u2208 R, its absolute\nvalue is |r|, its ceiling integer is \u2308r\u2309, and its \ufb02oor integer is \u230ar\u230b. The inner product of two vectors x, y\nis denoted by x\u22a4y = \u27e8x, y\u27e9. Given a positive de\ufb01nite matrix A \u2208 Rd\u00d7d, the weighted Euclidean\nnorm of a vector x \u2208 Rd is \u2225x\u2225A = \u221ax\u22a4Ax. B(x, r) denotes a Euclidean ball centered at x with\nradius r \u2208 R+, where R+ is the set of positive numbers. Let e be Euler\u2019s number, and Id \u2208 Rd\u00d7d\nan identity matrix. Let 1{\u00b7} be an indicator function, and E[X] the expectation of X.\n\n2.2 Learning Setting\n\nFor a bandit algorithm A, we consider sequential decisions with the goal to maximize cumulative\npayoffs, where the total number of rounds for playing bandits is T . For each round t = 1,\u00b7\u00b7\u00b7 , T ,\nthe bandit algorithm A is given a decision set Dt \u2286 Rd such that \u2225x\u22252 \u2264 D for any x \u2208 Dt. A\nhas to choose an arm xt \u2208 Dt and then observes a stochastic payoff yt(xt). For notation simplicity,\nwe also write yt = yt(xt). The expectation of the observed payoff for the chosen arm satis\ufb01es a\nlinear mapping from the arm to a real number as yt(xt) ! \u27e8xt,\u03b8 \u2217\u27e9 + \u03b7t, where \u03b8\u2217 is an underly-\ning parameter with \u2225\u03b8\u2217\u22252 \u2264 S and \u03b7t is a random noise. Without loss of generality, we assume\nE [\u03b7t|Ft\u22121] = 0, where Ft\u22121 ! {x1,\u00b7\u00b7\u00b7 , xt}\u222a{ \u03b71,\u00b7\u00b7\u00b7 ,\u03b7 t\u22121} is a \u03c3-\ufb01ltration and F0 = \u2205.\nClearly, we have E[yt(xt)|Ft\u22121] = \u27e8xt,\u03b8 \u2217\u27e9. For an algorithm A, to maximize cumulative payoffs\nis equivalent to minimizing the regret as\n\nR(A, T ) !\" T#t=1\n\nt ,\u03b8 \u2217\u27e9$ \u2212\" T#t=1\n\n\u27e8x\u2217\n\n\u27e8xt,\u03b8 \u2217\u27e9$ =\n\nT#t=1\n\n\u27e8x\u2217\nt \u2212 xt,\u03b8 \u2217\u27e9,\n\n(1)\n\nt denotes the optimal decision at time t for \u03b8\u2217, i.e., x\u2217\n\nwhere x\u2217\nt \u2208 arg maxx\u2208Dt\u27e8x, \u03b8\u2217\u27e9. In this paper,\nwe will provide high-probability upper bound of R(A, T ) with respect to A, and provide the lower\nbound for LinBET in expectation for any algorithm. The problem of LinBET is de\ufb01ned as below.\nDe\ufb01nition 1 (LinBET). Given a decision set Dt for time step t = 1,\u00b7\u00b7\u00b7 , T , an algorithm A, of\nwhich the goal is to maximize cumulative payoffs over T rounds, chooses an arm xt \u2208 Dt. With\n\nFt\u22121, the observed stochastic payoff yt(xt) is conditionally heavy-tailed, i.e., E%|yt|1+\u03f5|Ft\u22121& \u2264 b\nor E%|yt \u2212 \u27e8xt,\u03b8 \u2217\u27e9|1+\u03f5|Ft\u22121& \u2264 c, where \u03f5 \u2208 (0, 1], and b, c \u2208 (0, +\u221e).\n\n2.3 Related Work\n\nThe model of MAB dates back to 1952 with the original work by Robbins et al. (1952), and its\ninherent characteristic is the trade-off between exploration and exploitation. The asymptotic lower\nbound of MAB was developed by Lai and Robbins (1985), which is logarithmic with respect to the\ntotal number of rounds. An important technique called upper con\ufb01dence bound was developed to\nachieve the lower bound (Lai and Robbins, 1985; Agrawal, 1995). Other related techniques to solve\nthe problem of sequential decisions include Thompson sampling (Thompson, 1933; Chapelle and Li,\n2011; Agrawal and Goyal, 2012) and Gittins index (Gittins et al., 2011).\n\nThe problem of MAB with heavy-tailed payoffs characterized by \ufb01nite (1 + \u03f5)-th moments has\nbeen well investigated (Bubeck et al., 2013; Vakili et al., 2013; Yu et al., 2018). Bubeck et al. (2013)\npointed out that \ufb01nite variances in MAB are suf\ufb01cient to achieve regret bounds of the same order\nas the optimal regret for MAB under the sub-Gaussian assumption, and the order of T in regret\nbounds increases when \u03f5 decreases. The lower bound of MAB with heavy-tailed payoffs has been\nanalyzed (Bubeck et al., 2013), and robust algorithms by Bubeck et al. (2013) are optimal. The-\noretical guarantees by Bubeck et al. (2013); Vakili et al. (2013) are for the setting of \ufb01nite arms.\nIn Vakili et al. (2013), primary theoretical results were presented for the case of \u03f5> 1. We notice\nthat the case of \u03f5> 1 is not interesting, because it reduces to the case of \ufb01nite variances in MAB.\n\n3\n\n\fFor the problem of linear stochastic bandits, which is also named linear reinforcement learning\nby Auer (2002), the lower bound is \u2126(d\u221aT ) when contextual information of arms is from a d-\ndimensional space (Dani et al., 2008b). Bandit algorithms matching the lower bound up to poly-\nlogarithmic factors have been well developed (Auer, 2002; Dani et al., 2008a; Abbasi-Yadkori et al.,\n2011; Chu et al., 2011). Notice that all these studies assume that stochastic payoffs contain sub-\nGaussian noises. More variants of MAB can be discussed by Bubeck et al. (2012).\n\nIt is surprising to \ufb01nd that the lower bound of LinBET remains unknown. In Medina and Yang\n(2016), bandit algorithms based on truncation and median of means were presented. When \u03f5 is \ufb01nite\n\nis the regret of the state-of-the-art algorithms in linear stochastic bandits under the sub-Gaussian\n\nfor LinBET, the algorithms by Medina and Yang (2016) cannot recover the bound of !O(\u221aT ) which\nassumption. Medina and Yang (2016) conjectured that it is possible to recover !O(\u221aT ) with \u03f5 being\n\na \ufb01nite number for LinBET. Thus, it is urgent to conduct a thorough analysis of the conjecture in\nconsideration of the importance of heavy-tailed noises in real scenarios. Solving the conjecture\ngeneralizes the practical applications of bandit models. Practical motivating examples for bandits\nwith heavy-tailed payoffs include delays in end-to-end network routing (Liebeherr et al., 2012) and\nsequential investments in \ufb01nancial markets (Cont and Bouchaud, 2000).\n\nRecently, the assumption in stochastic payoffs of MAB was relaxed from sub-Gaussian noises to\nbounded kurtosis (Lattimore, 2017), which can be viewed as an extension of Bubeck et al. (2013).\nThe interesting point of Lattimore (2017) is the scale free algorithm, which might be practical in\napplications. Besides, Carpentier and Valko (2014) investigated extreme bandits, where stochastic\npayoffs of MAB follow Fr\u00e9chet distributions. The setting of extreme bandits \ufb01ts for the real scenario\nof anomaly detection without contextual information. The order of regret in extreme bandits is\ncharacterized by distributional parameters, which is similar to the results by Bubeck et al. (2013).\n\nIt is worth mentioning that, for linear regression with heavy-tailed noises, several interesting studies\nhave been conducted. Hsu and Sabato (2016) proposed a generalized method in light of median of\nmeans for loss minimization with heavy-tailed noises. Heavy-tailed noises in Hsu and Sabato (2016)\nmight come from contextual information, which is more complicated than the setting of stochastic\npayoffs in this paper. Therefore, linear regression with heavy-tailed noises usually requires a \ufb01nite\nfourth moment. In Audibert et al. (2011), the basic technique of truncation was adopted to solve\nrobust linear regression in the absence of exponential moment condition. The related studies in this\nline of research are not directly applicable for the problem of LinBET.\n\n3 Lower Bound\n\nIn this section, we provide the lower bound for LinBET. We consider heavy-tailed payoffs with \ufb01nite\n(1 + \u03f5)-th raw moments in the analysis. In particular, we construct the following setting. Assume\nd \u2265 2 is even (when d is odd, similar results can be easily derived by considering the \ufb01rst d \u2212 1\ndimensions). For Dt \u2286 Rd with t \u2208 [T ], we \ufb01x the decision set as D1 = \u00b7\u00b7\u00b7 = DT = D(d). Then,\nthe \ufb01xed decision set is constructed as D(d) ! {(x1,\u00b7\u00b7\u00b7 , xd) \u2208 Rd\n+ : x1 + x2 = \u00b7\u00b7\u00b7 = xd\u22121 + xd =\n1}, which is a subset of intersection of the cube [0, 1]d and the hyperplane x1 + \u00b7\u00b7\u00b7 + xd = d/2.\nWe de\ufb01ne a set Sd ! {(\u03b81,\u00b7\u00b7\u00b7 ,\u03b8 d) : \u2200i \u2208 [d/2] , (\u03b82i\u22121,\u03b8 2i) \u2208{ (2\u2206, \u2206), (\u2206, 2\u2206)}} with \u2206 \u2208\n(0, 1/d]. The payoff functions take values in {0, (1/\u2206) 1\n\u03f5 } such that, for every x \u2208 D(d), the expected\npayoff is \u03b8\u22a4\n\n\u2217 x. To be more speci\ufb01c, we have the payoff function of x as\n\ny(x) =\u2019( 1\n\u2206) 1\n\n0\n\n\u03f5 with a probability of \u2206 1\n\n\u03f5 \u03b8\u22a4\n\u2217 x,\nwith a probability of 1 \u2212 \u2206 1\n\n\u03f5 \u03b8\u22a4\n\n\u2217 x.\n\n(2)\n\nWe have the theorem for the lower bound of LinBET as below.\n\nTheorem 1 (Lower Bound of LinBET). If \u03b8\u2217 is chosen uniformly at random from Sd, and the\npayoff for each x \u2208 D(d) is in {0, (1/\u2206) 1\n\u2217 x, then for any algorithm A and every\nT \u2265 (d/12)\n\n\u03f5 } with mean \u03b8\u22a4\n\n\u03f5\n1+\u03f5 , we have\n\nE [R(A, T )] \u2265\n\nd\n192\n\n1\n\n1+\u03f5 .\n\nT\n\n4\n\n(3)\n\n\f\u02c6\u03b8n,k\n\ntake median of means of {\u02c6\u03b8n,j}k\nj=1\n\u02c6\u03b8n,1\n\n\u00b7\u00b7\u00b7\ncalculate k LSEs with payoffs on {xi}n\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\nN = \u230a T\nk \u230b\n\nk = \u230824 log( eT\n\u03b4 )\u2309\n\n...\nx3\n\n...\nx1\n\n...\nx2\n\nx1\n\nx2\n\nx3\n\ni=1\n\nxn\n\n...\nxn\n\n\u02c6\u03b8n,k\u2217\n\n\u02dcl1\n\nx1\n\nx1\n...\nx1\n\nxN\n\n...\nxN\n\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\nk = T\n\n1+\u03f5\n1+3\u03f5\n\ni=1\n\n\u02dcln\n\n\u02c6\u03b8n\ncalculate LSE with {\u02dcli}n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\ntake median of means\nof payoffs on {xi}k, i \u2208 [n]\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\nN = T\n\nxN\n...\nxN\n\nxn\n...\nxn\n\n2\u03f5\n\n1+3\u03f5\n\nxN\n\nxn\n\n(a) Framework of MENU\n\n(b) Framework of MoM\n\nFigure 1: Framework comparison between our MENU and MoM by Medina and Yang (2016).\n\nIn the proof of Theorem 1, we \ufb01rst prove the lower bound when d = 2, and then generalize the\nargument to any d > 2. We notice that the parameter in the original d-dimensional space is rear-\nranged to d/2 tuples, each of which is a 2-dimensional vector as (\u03b82i\u22121,\u03b8 2i) \u2208{ (2\u2206, \u2206), (\u2206, 2\u2206)}\nwith i \u2208 [d/2]. If the i-th tuple of the parameter is selected as (2\u2206, \u2206), then the i-th tuple of the\noptimal arm is (x\u2217,2i\u22121, x\u2217,2i) = (1, 0). In this case, if we de\ufb01ne the i-th tuple of the chosen arm as\n(xt,2i\u22121, xt,2i), the instantaneous regret is \u2206(1\u2212 xt,2i\u22121). Then, the regret can be represented as an\nintegration of \u2206(1 \u2212 xt,2i\u22121) over D(d). Finally, with common inequalities in information theory,\nwe obtain the regret lower bound by setting \u2206= T \u2212 \u03f5\nWe notice that martingale differences to prove the lower bound for linear stochastic bandits\nin (Dani et al., 2008a) are not directly feasible for the proof of lower bound in LinBET, because\nunder our construction of heavy-tailed payoffs (i.e., Eq. (4)), the information of \u03f5 is excluded. Be-\nsides, our proof is partially inspired by Bubeck (2010). We show the detailed proof of Theorem 1 in\nAppendix A.\n\n1+\u03f5 /12.\n\nRemark 1. The above lower bound provides two essential hints: one is that \ufb01nite variances in\nLinBET yield a bound of \u2126(\u221aT ), and the other is that algorithms proposed by Medina and Yang\n(2016) are far from optimal. The result in Theorem 1 strongly indicates that it is possible to design\n\nbandit algorithms recovering !O(\u221aT ) with \ufb01nite variances.\n\n4 Algorithms and Upper Bounds\n\nIn this section, we develop two novel bandit algorithms to solve LinBET, which turns out to be al-\nmost optimal. We rigorously prove regret upper bounds for the proposed algorithms. In particular,\nour core idea is based on the optimism in the face of uncertainty principle (OFU). The \ufb01rst algo-\nrithm is median of means under OFU (MENU) shown in Algorithm 1, and the second algorithm is\ntruncation under OFU (TOFU) shown in Algorithm 2. For comparisons, we directly name the ban-\ndit algorithm based on median of means in Medina and Yang (2016) as MoM, and name the bandit\nalgorithm based on con\ufb01dence region with truncation in Medina and Yang (2016) as CRT.\nBoth algorithms in this paper adopt the tool of ridge regression. At time step t, let \u02c6\u03b8t be the \u21132-\nregularized least-squares estimate (LSE) of \u03b8\u2217 as \u02c6\u03b8t = V \u22121\nt Yt, where Xt \u2208 Rt\u00d7d is a matrix\nof which rows are x\u22a4\nt Xt + \u03bbId, Yt ! (y1,\u00b7\u00b7\u00b7 , yt) is a vector of the historical\nobserved payoffs until time t and \u03bb> 0 is a regularization parameter.\n\n1 ,\u00b7\u00b7\u00b7 , x\u22a4\n\nt , Vt ! X \u22a4\n\nt X \u22a4\n\n4.1 MENU and Regret\n\nDescription of MENU. To conduct median of means in LinBET, it is common to allocate T\npulls of bandits among N \u2264 T epochs, and for each epoch the same arm is played multiple\ntimes to obtain an estimate of \u03b8\u2217. We \ufb01nd that there exist different ways to construct the epochs.\nWe design the framework of MENU in Figure 1(a), and show the framework of MoM designed\n\n5\n\n\fn=1\n\nk \u230b, V0 = \u03bbId, C0 = B(0, S)\n\nAlgorithm 1 Median of means under OFU (MENU)\n1: input d, c, \u03f5, \u03b4, \u03bb, S, T , {Dn}N\n2: initialization: k = \u230824 log\" eT\n3: for n = 1, 2, \u00b7\u00b7\u00b7 , N do\n4:\n5:\n6:\n7:\n8:\n9:\n\n\u03b4 #\u2309, N = \u230a T\n(xn, \u02dc\u03b8n) = arg max(x,\u03b8)\u2208Dn\u00d7Cn\u22121\u27e8x, \u03b8\u27e9\nPlay xn with k times and observe payoffs yn,1, yn,2,\u00b7\u00b7\u00b7 , yn,k\nVn = Vn\u22121 + xnx\u22a4\nn\nn $n\nFor j \u2208 [k], \u02c6\u03b8n,j = V \u22121\nFor j \u2208 [k], let rj be the median of {\u2225\u02c6\u03b8n,j \u2212 \u02c6\u03b8n,s\u2225Vn : s \u2208 [k]\\j}\nk\u2217 = arg minj\u2208[k] rj\n2 S&\n\u03b2n = 3%(9dc)\n1\u2212\u03f5\n1\n2(1+\u03f5) + \u03bb\n1+\u03f5 n\nCn = {\u03b8 : \u2225\u03b8 \u2212 \u02c6\u03b8n,k\u2217\u2225Vn \u2264 \u03b2n}\n\ni=1 yi,jxi\n\n10:\n\n1\n\n11:\n12: end for\n\n\u25c3 to select an arm\n\n\u25c3 to calculate LSE for the j-th group\n\n\u25c3 to take median of means of estimates\n\n\u25c3 to update the con\ufb01dence region\n\nby Medina and Yang (2016) in Figure 1(b). For MENU and MoM, we have the following three dif-\nferences. First, for each epoch n = 1,\u00b7\u00b7\u00b7 , N , MENU plays the same arm xn by O(log(T )) times,\n1+3\u03f5 ) times. Second, at epoch n with historical payoffs,\nwhile MoM plays the same arm by O(T\nMENU conducts LSEs by O(log(T )) times, each of which is based on {xi}n\ni=1, while MoM con-\nducts LSE by one time based on intermediate payoffs calculated via median of means of observed\npayoffs. Third, MENU adopts median of means of LSEs, while MoM adopts median of means of\nthe observed payoffs. Intuitively, the execution of multiple LSEs will lead to the improved regret of\nMENU. With a better trade-off between k and N in Figure 1(a), we derive an improved upper bound\nof regret in Theorem 2.\n\n1+\u03f5\n\nIn light of Figure 1(a), we develop algorithmic procedures in Algorithm 1 for MENU. We notice that,\nin order to guarantee the median of means of LSEs not far away from the true underlying parameter\nwith high probability, we construct the con\ufb01dence interval in Line 10 of Algorithm 1. Now we have\nthe following theorem for the regret upper bound of MENU.\nTheorem 2 (Regret Analysis for the MENU Algorithm). Assume that for all t and xt \u2208 Dt with\n\u2225xt\u22252 \u2264 D, \u2225\u03b8\u2217\u22252 \u2264 S, |x\u22a4\nt \u03b8\u2217|\u2264 L and E[|\u03b7t|1+\u03f5|Ft\u22121] \u2264 c. Then, with probability at least 1\u2212 \u03b4,\nfor every T \u2265 256 + 24 log (e/\u03b4), the regret of the MENU algorithm satis\ufb01es\n\u03bbd (.\nR(MENU, T ) \u2264 6%(9dc)\n\n1+\u03f5)2d log\u20191 +\n\n1+\u03f5\u201924 log\u2019 eT\n\n\u03b4 ( + 1( \u03f5\n\n2 S + L& T\n\n1\n1+\u03f5 + \u03bb\n\nT D2\n\n1\n\n1\n\nThe technical challenges in MENU (i.e., Algorithm 1) and its proofs are discussed as follows. Based\non the common techniques in linear stochastic bandits (Abbasi-Yadkori et al., 2011), to guarantee\n\nthe instantaneous regret in LinBET, we need to guarantee\u2225\u03b8\u2217\u2212 \u02c6\u03b8n,k\u2217\u2225Vn \u2264 \u03b2n with high probability.\nWe attack this issue by guaranteeing \u2225\u03b8\u2217 \u2212 \u02c6\u03b8n,j\u2225Vn \u2264 \u03b2n/3 with a probability of 3/4, which could\nreduce to a problem of bounding a weighted sum of historical noises. Interestingly, by conducting\nsingular value decomposition on Xn (of which rows are x\u22a4\nn ), we \ufb01nd that 2-norm of the\nweights is no greater than 1. Then the weighted sum can be bounded by a term as O*n\n2(1+\u03f5)+.\n\nWith a standard analysis in linear stochastic bandits from the instantaneous regret to the regret, we\nachieve the above results for MENU. We show the detailed proof of Theorem 2 in Appendix B.\n\n1 ,\u00b7\u00b7\u00b7 , x\u22a4\n\n1\u2212\u03f5\n\nRemark 2. For MENU, we adopt the assumption of heavy-tailed payoffs on central moments,\nwhich is required in the basic technique of median of means (Bubeck et al., 2013). Besides, there\nexists an implicit mild assumption in Algorithm 1 that, at each epoch n, the decision set must contain\nthe selected arm xn at least k times, which is practical in applications, e.g., online personalized\nrecommendations (Li et al., 2010). The condition of T \u2265 256 + 24 log (e/\u03b4) is required for T \u2265 k.\n1+\u03f5 ), which implies that \ufb01nite variances in LinBET are\n\n1\n\nThe regret upper bound of MENU is !O(T\nsuf\ufb01cient to achieve !O(\u221aT ).\n\n4.2 TOFU and Regret\n\nDescription of TOFU. We demonstrate the algorithmic procedures of TOFU in Algorithm 2. We\npoint out two subtle differences between our TOFU and the algorithm of CRT as follows. In TOFU,\n\n6\n\n\fAlgorithm 2 Truncation under OFU (TOFU)\n1: input d, b, \u03f5, \u03b4, \u03bb, T , {Dt}T\n2: initialization: V0 = \u03bbId, C0 = B(0, S)\n3: for t = 1, 2,\u00b7 \u00b7\u00b7 , T do\nt\n4:\n\n2(1+\u03f5)\n\nt=1\n\n1\u2212\u03f5\n\n\u03f5\n\nlog( 2T\n\n\u03b4 )( 1\n\nt and X \u22a4\nt = [x1,\u00b7\u00b7\u00b7 , xt]\nX \u22a4\nt\n\nbt =\u2019 b\n(xt, \u02dc\u03b8t) = arg max(x,\u03b8)\u2208Dt\u00d7Ct\u22121\u27e8x, \u03b8\u27e9\nPlay xt and observe a payoff yt\nVt = Vt\u22121 + xtx\u22a4\n[u1,\u00b7\u00b7 \u00b7 , ud]\u22a4 = V \u22121/2\nfor i = 1,\u00b7 \u00b7\u00b7 , d do\nend for\nt = V \u22121/2\n\u03b8\u2020\n1 ,\u00b7 \u00b7\u00b7 , u\u22a4\n\u03b2t = 4\u221adb\n\u03b4 ## \u03f5\n1+\u03f5\"log\" 2dT\nUpdate Ct = {\u03b8 : \u2225\u03b8 \u2212 \u03b8\u2020\n\nY \u2020\ni = (y11ui,1y1\u2264bt ,\u00b7\u00b7\u00b7 , yt1ui,tyt\u2264bt )\n\nt\u2225Vt \u2264 \u03b2t}\n\nd Y \u2020\nd )\n1+\u03f5 t\n\n1 Y \u2020\n\n(u\u22a4\n\n1\u2212\u03f5\n2(1+\u03f5) + \u03bb\n\nt\n\nt\n\n1\n\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\n13:\n14:\n15: end for\n\n\u25c3 to select an arm\n\n\u25c3 to truncate the payoffs\n\n1\n2 S\n\n\u25c3 to update the con\ufb01dence region\n\nTable 1: Statistics of synthetic datasets in experiments. For Student\u2019s t-distribution, \u03bd denotes the\ndegree of freedom, lp denotes the location, sp denotes the scale. For Pareto distribution, \u03b1 denotes\nthe shape and sm denotes the scale. NA denotes not available.\n\ndataset\n\nDt\n\ndistribution {parameters}\n\n{#arms,#dimensions}\n\n{\u03f5, b, c}\n\nmean of the\noptimal arm\n\nS1\n\nS2\n\nS3\n\nS4\n\n{20,10}\n\n{100,20}\n\n{20,10}\n\n{100,20}\n\nStudent\u2019s t-distribution\n{\u03bd = 3, lp = 0, sp = 1}\n\nStudent\u2019s t-distribution\n{\u03bd = 3, lp = 0, sp = 1}\n\n{\u03b1 = 2, sm = x\u22a4\n\nPareto distribution\nt \u03b8\u2217\n2 }\nPareto distribution\nt \u03b8\u2217\n2 }\n\n{\u03b1 = 2, sm = x\u22a4\n\n{1.00, NA, 3.00}\n\n{1.00, NA, 3.00}\n\n{0.50, 7.72, NA}\n\n4.00\n\n7.40\n\n3.10\n\n{0.50, 54.37, NA}\n\n11.39\n\nto obtain the accurate estimate of \u03b8\u2217, we need to trim all historical payoffs for each dimension\nindividually. Besides, the truncating operations depend on the historical information of arms. By\ncontrast, in CRT, the historical payoffs are trimmed once, which is controlled only by the number\nof rounds for playing bandits. Compared to CRT, our TOFU achieves a tighter con\ufb01dence interval,\nwhich can be found from the setting of \u03b2t. Now we have the following theorem for the regret upper\nbound of TOFU.\nTheorem 3 (Regret Analysis for the TOFU Algorithm). Assume that for all t and xt \u2208 Dt with\nt \u03b8\u2217|\u2264 L and E[|yt|1+\u03f5|Ft\u22121] \u2264 b. Then, with probability at least 1\u2212 \u03b4,\n\u2225xt\u22252 \u2264 D, \u2225\u03b8\u2217\u22252 \u2264 S, |x\u22a4\nfor every T \u2265 1, the regret of the TOFU algorithm satis\ufb01es\n1+\u03f5\"4\u221adb\n\u03bbd -.\n\n2 S + L$.2d log,1 +\n\n1+\u03f5,log, 2dT\n\nR(TOFU, T ) \u2264 2T\n\n\u03b4 -- \u03f5\n\nT D2\n\n+ \u03bb\n\n1+\u03f5\n\n1\n\n1\n\n1\n\nSimilarly to the proof in Theorem 2, we can achieve the above results for TOFU. Due to space\nlimitation, we show the detailed proof of Theorem 3 in Appendix C.\n\nis worth pointing out\n\nRemark 3. For TOFU, we adopt the assumption of heavy-tailed payoffs on raw moments.\nthat, when \u03f5 = 1, we have regret upper bound for TOFU as\nIt\n\n!O(d\u221aT ), which implies that we recover the same order of d as that under sub-Gaussian assump-\n\ntion (Abbasi-Yadkori et al., 2011). A weakness in TOFU is high time complexity, because for each\nround TOFU needs to truncate all historical payoffs. The time complexity might be reasonably\nreduced by dividing T into multiple epochs, each of which contains only one truncation.\n\n7\n\n\f5 Experiments\n\nIn this section, we conduct experiments based on synthetic datasets to evaluate the performance\nof our proposed bandit algorithms: MENU and TOFU. For comparisons, we adopt two baselines:\nMoM and CRT proposed by Medina and Yang (2016). We run multiple independent repetitions for\neach dataset in a personal computer under Windows 7 with Intel CPU@3.70GHz and 16GB memory.\n\n5.1 Datasets and Setting\n\nTo show effectiveness of bandit algorithms, we will demonstrate cumulative payoffs with respect\nto number of rounds for playing bandits over a \ufb01xed \ufb01nite-arm decision set. For veri\ufb01cations, we\nadopt four synthetic datasets (named as S1\u2013S4) in the experiments, of which statistics are shown\nin Table 1. The experiments on heavy tails require \u03f5, b or \u03f5, c to be known, which corresponds to\nthe assumptions of Theorem 2 or Theorem 3. According to the required information, we can apply\nMENU or TOFU into practical applications. We adopt Student\u2019s t and Pareto distributions because\nthey are common in practice. For Student\u2019s t-distributions, we easily estimate c, while for Pareto\ndistributions, we easily estimate b. Besides, we can choose different parameters (e.g., larger values)\nin the distributions, and recalculate the parameters of b and c.\n\nFor S1 and S2, which contain different numbers of arms and different dimensions for the contextual\ninformation, we adopt standard Student\u2019s t-distribution to generate heavy-tailed noises. For the cho-\nsen arm xt \u2208 Dt, the expected payoff is x\u22a4\nt \u03b8\u2217, and the observed payoff is added a noise generated\nfrom a standard Student\u2019s t-distribution. We generate each dimension of contextual information for\nan arm, as well as the underlying parameter, from a uniform distribution over [0, 1]. The standard\nStudent\u2019s t-distribution implies that the bound for the second central moment of S1 and S2 is 3.\n\nm /(\u03b1 \u2212 1.5) = 4s1.5\n\nFor S3 and S4, we adopt Pareto distribution, where the shape parameter is set as \u03b1 = 2. We know\nt \u03b8\u2217 = \u03b1sm/(\u03b1 \u2212 1) implying sm = x\u22a4\nx\u22a4\nt \u03b8\u2217/2. Then, we set \u03f5 = 0.5 leading to the bound of raw\nmoment as E%|yt|1.5& = \u03b1s1.5\nm among all arms\nas the bound of the 1.5-th raw moment. We generate arms and the parameter similar to S1 and S2.\nIn \ufb01gures, we show the average of cumulative payoffs with time evolution over ten independent\nrepetitions for each dataset, and show error bars of a standard variance for comparing the robustness\nof algorithms. For S1 and S2, we run MENU and MoM and set T = 2 \u00d7 104. For S3 and S4, we\nrun TOFU and CRT and set T = 1 \u00d7 104. For all algorithms, we set \u03bb = 1.0, and \u03b4 = 0.1.\n5.2 Results and Discussions\n\nm . We take the maximum of 4s1.5\n\nWe show experimental results in Figure 2. From the \ufb01gure, we clearly \ufb01nd that our proposed two\nalgorithms outperform MoM and CRT, which is consistent with the theoretical results in Theorems 2\nand 3. We also evaluate our algorithms with other synthetic datasets, as well as different \u03bb and \u03b4,\nand observe similar superiority of MENU and TOFU. Finally, for further comparison on regret,\ncomplexity and storage of four algorithms, we list the results shown in Table 2.\n\nTable 2: Comparison on regret, complexity and storage of four algorithms.\n\nalgorithm\n\nMoM\n\nMENU\n\n1+2\u03f5\n\n1+3\u03f5 )\n\n!O(T\n\nO(T )\nO(1)\n\nregret\n\ncomplexity\n\nstorage\n\n6 Conclusion\n\n1\n\n1+\u03f5 )\n\n!O(T\n\nO(T log T )\nO(log T )\n\nCRT\n1\n2 + 1\n\n2(1+\u03f5) )\n\nO(T )\nO(1)\n\n!O(T\n\nTOFU\n\n1\n\n!O(T\n\n1+\u03f5 )\nO(T 2)\nO(T )\n\nWe have studied the problem of LinBET, where stochastic payoffs are characterized by \ufb01nite (1 +\n\u03f5)-th moments with \u03f5 \u2208 (0, 1]. We broke the traditional assumption of sub-Gaussian noises in\npayoffs of bandits, and derived theoretical guarantees based on the prior information of bounds\non \ufb01nite moments. We rigorously analyzed the lower bound of LinBET, and developed two novel\n\n8\n\n\f(a) S1\n\n(b) S2\n\n(c) S3\n\n(d) S4\n\nFigure 2: Comparison of cumulative payoffs for synthetic datasets S1-S4 with four algorithms.\n\nbandit algorithms with regret upper bounds matching the lower bound up to polylogarithmic factors.\nTwo novel algorithms were developed based on median of means and truncation. In the sense of\npolynomial dependence on T , we provided optimal algorithms for the problem of LinBET, and\nthus solved an open problem, which has been pointed out by Medina and Yang (2016). Finally, our\nproposed algorithms have been evaluated based on synthetic datasets, and outperformed the state-of-\nthe-art results. Since both algorithms in this paper require a priori knowledge of \u03f5, future directions\nin this line of research include automatic learning of LinBET without information of distributional\nmoments, and evaluation of our proposed algorithms in real-world scenarios.\n\nAcknowledgments\n\nThe work described in this paper was partially supported by the Research Grants Council of the Hong\nKong Special Administrative Region, China (No. CUHK 14208815 and No. CUHK 14210717 of the\nGeneral Research Fund), and Microsoft Research Asia (2018 Microsoft Research Asia Collaborative\nResearch Award).\n\nReferences\n\nY. Abbasi-Yadkori, D. P\u00e1l, and C. Szepesv\u00e1ri. Improved algorithms for linear stochastic bandits. In\n\nAdvances in Neural Information Processing Systems, pages 2312\u20132320, 2011.\n\nR. Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit\n\nproblem. Advances in Applied Probability, 27(4):1054\u20131078, 1995.\n\nS. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In\n\nConference on Learning Theory, pages 39\u20131, 2012.\n\nJ.-Y. Audibert, O. Catoni, et al. Robust linear least squares regression. The Annals of Statistics, 39\n\n(5):2766\u20132794, 2011.\n\nP. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine Learn-\n\ning Research, 3(Nov):397\u2013422, 2002.\n\n9\n\n\fS. Bubeck. Bandits games and clustering foundations. PhD thesis, Universit\u00e9 des Sciences et\n\nTechnologie de Lille-Lille I, 2010.\n\nS. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed\n\nS. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Infor-\n\nbandit problems. Foundations and Trends R\u20dd in Machine Learning, 5(1):1\u2013122, 2012.\nmation Theory, 59(11):7711\u20137717, 2013.\n\nA. Carpentier and M. Valko. Extreme bandits.\n\nIn Advances in Neural Information Processing\n\nSystems, pages 1089\u20131097, 2014.\n\nO. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in Neural\n\nInformation Processing Systems, pages 2249\u20132257, 2011.\n\nW. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In\nProceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 208\u2013214, 2011.\n\nR. Cont and J.-P. Bouchaud. Herd behavior and aggregate \ufb02uctuations in \ufb01nancial markets. Macroe-\n\nconomic Dynamics, 4(2):170\u2013196, 2000.\n\nV. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In\n\nConference on Learning Theory, pages 355\u2013366, 2008a.\n\nV. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for online optimization. In\n\nAdvances in Neural Information Processing Systems, pages 345\u2013352, 2008b.\n\nJ. Gittins, K. Glazebrook, and R. Weber. Multi-armed bandit allocation indices. John Wiley & Sons,\n\n2011.\n\nD. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In International\n\nConference on Machine Learning, pages 37\u201345, 2014.\n\nD. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of\n\nMachine Learning Research, 17(1):543\u2013582, 2016.\n\nT. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied\n\nMathematics, 6(1):4\u201322, 1985.\n\nT. Lattimore. A scale free algorithm for stochastic bandits with bounded kurtosis. In Advances in\n\nNeural Information Processing Systems, pages 1583\u20131592, 2017.\n\nT. Lattimore, K. Crammer, and C. Szepesv\u00e1ri. Optimal resource allocation with semi-bandit feed-\nback. In Proceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n477\u2013486. AUAI Press, 2014.\n\nL. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news\narticle recommendation. In Proceedings of the Nineteenth International Conference on World\nWide Web, pages 661\u2013670. ACM, 2010.\n\nJ. Liebeherr, A. Burchard, and F. Ciucu. Delay bounds in communication networks with heavy-tailed\n\nand self-similar traf\ufb01c. IEEE Transactions on Information Theory, 58(2):1010\u20131024, 2012.\n\nA. M. Medina and S. Yang. No-regret algorithms for heavy-tailed linear bandits. In International\n\nConference on Machine Learning, pages 1642\u20131650, 2016.\n\nR. Munos et al. From bandits to monte-carlo tree search: The optimistic principle applied to opti-\n\nH. Robbins et al. Some aspects of the sequential design of experiments. Bulletin of the American\n\nmization and planning. Foundations and Trends R\u20dd in Machine Learning, 7(1):1\u2013129, 2014.\nMathematical Society, 58(5):527\u2013535, 1952.\n\nJ. A. Roberts, T. W. Boonstra, and M. Breakspear. The heavy tail of the human brain. Current\n\nOpinion in Neurobiology, 31:164\u2013172, 2015.\n\nY. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer. PAC-Bayesian inequalities\n\nfor martingales. IEEE Transactions on Information Theory, 58(12):7086\u20137093, 2012.\n\nM. Shao and C. L. Nikias. Signal processing with fractional lower order moments: stable processes\n\nand their applications. Proceedings of the IEEE, 81(7):986\u20131010, 1993.\n\nW. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the\n\nevidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nS. Vakili, K. Liu, and Q. Zhao. Deterministic sequencing of exploration and exploitation for multi-\nIEEE Journal of Selected Topics in Signal Processing, 7(5):759\u2013767,\n\narmed bandit problems.\n2013.\n\nX. Yu, H. Shao, M. R. Lyu, and I. King. Pure exploration of multi-armed bandits with heavy-tailed\npayoffs. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Arti\ufb01cial Intelligence,\npages 937\u2013946. AUAI Press, 2018.\n\n10\n\n\f", "award": [], "sourceid": 5106, "authors": [{"given_name": "Han", "family_name": "Shao", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Xiaotian", "family_name": "Yu", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Irwin", "family_name": "King", "institution": "Chinese University of Hong Kong"}, {"given_name": "Michael", "family_name": "Lyu", "institution": "CUHK"}]}