{"title": "Improved Regret Bounds for Bandit Combinatorial Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 12050, "page_last": 12059, "abstract": "\\textit{Bandit combinatorial optimization} is a bandit framework in which a player chooses an action within a given finite set $\\mathcal{A} \\subseteq \\{ 0, 1 \\}^d$ and incurs a loss that is the inner product of the chosen action and an unobservable loss vector in $\\mathbb{R} ^ d$ in each round. In this paper, we aim to reveal the property, which makes the bandit combinatorial optimization hard. Recently, Cohen et al.~\\citep{cohen2017tight} obtained a lower bound $\\Omega(\\sqrt{d k^3 T / \\log T})$ of the regret, where $k$ is the maximum $\\ell_1$-norm of action vectors, and $T$ is the number of rounds. This lower bound was achieved by considering a continuous strongly-correlated distribution of losses. Our main contribution is that we managed to improve this bound by $\\Omega( \\sqrt{d k ^3 T} )$ through applying a factor of $\\sqrt{\\log T}$, which can be done by means of strongly-correlated losses with \\textit{binary} values. The bound derives better regret bounds for three specific examples of the bandit combinatorial optimization: the multitask bandit, the bandit ranking and the multiple-play bandit. In particular, the bound obtained for the bandit ranking in the present study addresses an open problem raised in \\citep{cohen2017tight}. In addition, we demonstrate that the problem becomes easier without considering correlations among entries of loss vectors. In fact, if each entry of loss vectors is an independent random variable, then, one can achieve a regret of $\\tilde{O}(\\sqrt{d k^2 T})$, which is $\\sqrt{k}$ times smaller than the lower bound shown above. The observed results indicated that correlation among losses is the reason for observing a large regret.", "full_text": "Improved Regret Bounds\n\nfor Bandit Combinatorial Optimization\u2217\n\nShinji Ito\u2020\n\nNEC Corporation, The University of Tokyo\n\ni-shinji@nec.com\n\nHanna Sumita\n\nTokyo Metropolitan University\n\nsumita@tmu.ac.jp\n\nDaisuke Hatano\n\nRIKEN AIP\n\ndaisuke.hatano@riken.jp\n\nKei Takemura\nNEC Corporation\n\nkei_takemura@nec.com\n\nTakuro Fukunaga\u2021\n\nChuo University, RIKEN AIP, JST PRESTO\n\nfukunaga.07s@g.chuo-u.ac.jp\n\nNaonori Kakimura\u00a7\n\nKeio University\n\nkakimura@math.keio.ac.jp\n\nKen-ichi Kawarabayashi\u00a7\nNational Institute of Informatics\n\nk-keniti@nii.ac.jp\n\nAbstract\n\nBandit combinatorial optimization is a bandit framework in which a player chooses\nan action within a given \ufb01nite set A \u2286 {0, 1}d and incurs a loss that is the\ninner product of the chosen action and an unobservable loss vector in Rd in each\nround.\nIn this paper, we aim to reveal the property, which makes the bandit\ncombinatorial optimization hard. Recently, Cohen et al. [8] obtained a lower\n\nbound \u2126((cid:112)dk3T / log T ) of the regret, where k is the maximum (cid:96)1-norm of\n\naction vectors, and T is the number of rounds. This lower bound was achieved\nby considering a continuous strongly-correlated distribution of losses. Our main\ncontribution is that we managed to improve this bound by \u2126(\ndk3T ) through\nlog T , which can be done by means of strongly-correlated\napplying a factor of\nlosses with binary values. The bound derives better regret bounds for three speci\ufb01c\nexamples of the bandit combinatorial optimization: the multitask bandit, the bandit\nranking and the multiple-play bandit. In particular, the bound obtained for the\nbandit ranking in the present study addresses an open problem raised in [8]. In\naddition, we demonstrate that the problem becomes easier without considering\n\u221a\ncorrelations among entries of loss vectors. In fact, if each entry of loss vectors is an\nindependent random variable, then, one can achieve a regret of \u02dcO(\ndk2T ), which\nis\nk times smaller than the lower bound shown above. The observed results\nindicated that correlation among losses is the reason for observing a large regret.\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u2217This work was supported by JST, ERATO, Grant Number JPMJER1201, Japan.\n\u2020This work was supported by JST, ACT-I, Grant Number JPMJPR18U5, Japan.\n\u2021This work was supported by JST, PRESTO, Grant Number JPMJPR1759, Japan.\n\u00a7This work was supported by JSPS, KAKENHI, Grant Number JP18H05291, Japan.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1\n\nIntroduction\n\nThis paper is aimed to investigate the bandit combinatorial optimization problem de\ufb01ned as follows:\nA player is given a \ufb01nite action set A \u2286 {a \u2208 {0, 1}d | (cid:107)a(cid:107)1 = k} and the number T of rounds for\ndecision-making. In each round t = 1, 2, . . . , T , the player chooses an action at from A. At the\nsame time, the environment privately chooses a loss vector (cid:96)t = [(cid:96)t1, . . . , (cid:96)td](cid:62) \u2208 [0, 1]d, and the\nplayer observes the loss (cid:96)(cid:62)\nt at incurred by the action at. The goal of the player is to minimize the\nt at], where the expectation is taken with respect to the player\u2019s\ninternal randomization. The performance of the algorithm is measured in terms of the regret RT\nde\ufb01ned by RT = maxa\u2208A E\n\nexpected cumulative loss E[(cid:80)T\n(cid:104)(cid:80)T\n\nt at \u2212(cid:80)T\n\nt=1 (cid:96)(cid:62)\n\nt=1 (cid:96)(cid:62)\nt a\n\nt=1 (cid:96)(cid:62)\n\n(cid:105)\n\n.\n\nIn this study, we focus on the minimax regret, the worst-case regret attained by optimal algo-\nrithms, which can be expressed as RT := minalgorithm max{(cid:96)t}T\nt=1\u2286[0,1]d RT . The minimax re-\ngret can be bounded from above by designing algorithms. The current best bound is RT =\n\nO((cid:112)dk3T log(ed/k)), as reported in a number of papers [2; 6; 7; 10; 12]. However, lower bounds\nhowever, Cohen et al. [8] presented the lower bound of RT = \u2126((cid:112)dk3T / log T ), which rejected\nlower bounds to O((cid:112)log(ed/k) log T ) consisting of logarithmic terms only.\n\nof the minimax regret can be proven by constructing a probabilistic distribution of loss vectors for\nwhich any algorithm incurs a certain degree of regret. To obtain a lower bound, Audibert et al. [2]\nconstructed a probabilistic distribution of loss vectors for which arbitrary algorithms incurred a regret\nof \u2126(\ndk2T ). Recently,\n\n\u221a\ndk2T ), and they conjectured that this bound was tight, i.e., RT = \u0398(\n\nthe above-mentioned conjecture, and thereby, they have decreased the gap between the upper and\n\n\u221a\n\nThe input distribution constructed by Cohen et al. [8] to derive the lower bound has the unique\ncharacteristics that cannot be found in previous studies, such as lower bounds for a multi-armed\nbandit [4], a combinatorial semi-bandit [6; 20; 23] and a combinatorial bandit [2]. In previous studies\non lower bounds, only binary inputs and an arm-wise independent distribution were considered,\ni.e., (cid:96)t1, . . . , (cid:96)td are mutually independent {0, 1}-valued discrete random variables. Such inputs\nwere proved to result in tight lower bounds for multi-armed bandits [4] and combinatorial semi-\nbandits [2; 20]. In contrast to these studies, Cohen et al. [8] introduced loss vectors following a\ncontinuous distribution over [0, 1]d and having a strong correlation among d entries. Furthermore,\nthe lower bound obtained in Cohen et al. [8] includes a 1/\nlog T term, which does not appear in\nthe other lower bounds for bandit problems. In addition, they applied the obtained lower bounds to\nspecial cases, such as the multitask bandit and the bandit ranking problem. However, their results\nare restricted to the problems under certain parameter constraints, and consequently, the task of\nidentifying the tight bounds for some important special cases, including the problem referred to as\nbandit ranking with full permutations, were left open.\nSuch characteristics corresponding to the input distribution de\ufb01ned by Cohen et al. [8] lead to the\nfollowing research questions:\n\n\u221a\n\n\u221a\n\nQ. 1 Is the 1/\nlog T factor in the lower bound given by Cohen et al. [8] redundant or inevitable?\nQ. 2 Does the continuous distribution of loss vectors make the problem essentially harder than the\ndiscrete (binary) distribution? If we restrict our consideration to the loss vectors in {0, 1}d,\nthen the player can see the number of good arms (i \u2208 [d] s.t. lti = 0) in the chosen arms St,\nwhich may, or may not, be more informative than actual values.\n\nQ. 3 Does the correlation of loss among different arms make the problem essentially harder than the\n\narm-wise independent loss?\n\nQ. 4 Can we obtain tight lower bounds for the special cases such as the bandit ranking problem with\n\nfull permutations resolving the open question in [8]?\n\n2 Main Results\n\nOur main results can be interpreted to answer the above four questions. First, we improve the regret\nlower bound obtained by [8] to \u2126(\nlog T , as shown in Table 1.\nThese bounds can be proven by constructing a distribution of strongly-correlated losses using binary\nvalues. We apply the bounds to the three speci\ufb01c examples of bandit combinatorial optimization\n\ndk3T ), by applying a factor of\n\n\u221a\n\n\u221a\n\n2\n\n\fTable 1: Regret bounds RT for bandit combinatorial optimization.\n\nAssumption\n\nNo assumption\n\nIndependent losses\n\nUpper bound by Algorithms\n\nO((cid:112)dk2T log |A|)\n= O((cid:112)dk3T log(ed/k))\nO((cid:112)dkT log |A| log T )\n= O((cid:112)dk2T log(ed/k) log T )\n\n([6] and [7])\n\n(Algorithm 1 and Theorem 3)\n\nLower bound\n\n\u2126((cid:112)dk3T / log T ) by (cid:96)t \u2208 [0, 1]d\n\n([8]),\n\u221a\n\n\u221a\n\u2126(\n([2])\n\ndk3T ) by (cid:96)t \u2208 {0, 1}d\n\n\u2126(\n(Theorems 1 and 2)\n\ndk2T ) by (cid:96)t \u2208 {0, 1}d\n\n\u221a\n\nthat have high practical importance: the multitask bandit problem, the bandit ranking problem\n(Theorem 1), and the multiple-play bandit problem (Theorem 2). This result provides answers to\nQ. 1 and Q. 2 outlined in Section 1: The 1/\nlog T factor in the lower bound is redundant, and the\ndifference between continuous-valued and discrete-valued losses does not have a large impact on the\nhardness of the problem. This observation also addresses Q. 4, an open problem outlined in [8].\nThe multitask bandit problem [7; 8] is a bandit framework in which the player tries to solve k instances\nof the n-armed bandit problem. This is a special case of the bandit combinatorial optimization with\nd = kn and\n\nA =\n\nai = 1\n\n(j \u2208 [k])\n\n(1)\n\n\uf8fc\uf8fd\uf8fe .\n\njn(cid:88)\n\ni=(j\u22121)n+1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\uf8f1\uf8f2\uf8f3a \u2208 {0, 1}d\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\njn(cid:88)\n\ni=(j\u22121)n+1\n\nk(cid:88)\n\ni=1\n\n\uf8f1\uf8f2\uf8f3a \u2208 {0, 1}d\n\nA =\n\nIn the bandit ranking problem or online ranking problem [13] with bandit feedback problem, the goal\nof the player is to \ufb01nd a maximum matching in the complete bipartite graph Kk,n with d = kn edges,\nwhere k \u2208 [n]. The set of all maximum matchings can be expressed as follows:\n\nai = 1 (j \u2208 [k]),\n\na(i\u22121)n+j \u2264 1 (j \u2208 [n])\n\n(2)\n\n\uf8fc\uf8fd\uf8fe .\n\nConsidering these problems, we obtain the following regret lower bound.\nTheorem 1 (multitask bandit, bandit ranking). Suppose that A is de\ufb01ned by (1) or (2) and n \u2265 2.\nThere is a probability distribution D over {0, 1}d for which the following statement holds: If (cid:96)t\n\u221a\nis drawn from D for t = 1, . . . , T independently, the regret for any algorithm satis\ufb01es E[RT ] =\n\u2126(min{\nConsidering the bandit ranking problem, the previous work [8] demonstrated the lower bound of\n\n\u2126((cid:112)dk3T / log T ) under the assumption of n \u2265 2k, and the full-permutation case (k = n) was left\n\ndk3T , k3/4T}), where the expectation is taken with respect to the randomness of (cid:96)t.\n\n\u221a\nas an open problem, as mentioned in the conclusion of this research work. Theorem 1 answers to this\nopen problem: Even if k = n, the minimax regret is of RT = \u02dc\u0398(\n\u221a\nk5T ), ignoring\na\nlog k factor. Theorem 1 can also be extended to the online shortest path problem [5], by the\nstandard reduction from multitask bandit to the online shortest path. See e.g., [8] for details of the\nreduction.\nThe multiple-play bandit problem [7; 16; 18; 23] is another bandit framework in which the player\ncan choose arbitrary k arms from a set of d arms in each round. This problem corresponds to\n\n\u221a\ndk3T ) = \u02dc\u0398(\n\n(cid:1) := {a \u2208 {0, 1}d | (cid:107)a(cid:107)1 = k}.\n\nA =(cid:0)[d]\nTheorem 2 (multiple-play bandit). Suppose that A =(cid:0)[d]\n(cid:110)\n\n(cid:16)\n\nk\n\nover {0, 1}d for which the following holds: If (cid:96)t is drawn from D for t = 1, . . . , T independently, the\nregret for any algorithm will satisfy E[RT ] = \u2126\n, where the\nmin\nexpectation is taken with respect to the randomness of (cid:96)t.\nThe above lower bound means that RT = \u2126(\ndk3T ) for T = \u2126(dk3/2) and d = \u2126(k). It should\nbe noted that existing works [2; 8; 20] provided weaker lower bounds only for the case of d \u2265 2k,\n\ndk3T , d\u2212k\n\nd k3/4T\n\n( d\u2212k\n\nd )2\n\n\u221a\n\n\u221a\n\nk\n\n(cid:1). There is a probability distribution D\n\n(cid:111)(cid:17)\n\n3\n\n\fwhile those provided in the present study are valid for general d and k. The proof of Theorem 2 is\npresented in Appendix C.\nA basic idea for proving a nearly tight bound is to construct an environment, where all entries of (cid:96)t\nare strongly correlated between each other; this concept has been introduced by Cohen et al. [8]. If\nlosses are strongly correlated, the observed value (cid:96)(cid:62)\nt a has a larger variance. For example, the variance\nis of order k if all entries are independent, while it can be of order k2 if all entries take the same value.\nWhen the observed values (cid:96)(cid:62)\nt a have larger variance, the KL divergence among the values for different\nactions a is small, which implies that no algorithm can detect \u201cgood\u201d actions properly. Cohen et al.\n\u221a\n[8] constructed such an environment by means of normal distributions, which improve the lower\nk) factor. However, their proposed bound includes a redundant (log T )\u22121/2 factor\nbound by \u02dcO(\ndue to the unbounded support of normal distributions.1 We note that their technique has been used\nrecently for proving a lower bound for bandit PCA [17], which includes a redundant (log T )\u22121/2\nfactor too, for the same reason as the above.\nTo shave off the (log T )\u22121/2 factor, in this paper, we introduce a novel class of discrete distributions\nover {0, 1}d, so that entries of loss vectors are bounded and strongly correlated. To make the losses\ncorrelated, we consider d Bernoulli distributions that share the parameter, by which the observed value\nhas a large variance of O(k2). However, it is not a straightforward task to set \u201cgood\u201d actions in this\napproach. The previous work [8] simply decreases the mean parameter in the normal distribution to\nset \u201cgood\u201d actions, but it does not work as it causes large KL divergences between \u201cgood\" actions and\nthe others in our distribution. In the present work, we adjust the parameter of Bernoulli distributions\ncarefully with the intention of ensuring small KL divergence, which allows improving the regret\nlower bound successfully. The idea outlined in the present study can be used to improve the idea of\n[8] even considering other problems.\nSecond, we show that the correlation among losses is the reason of observing a large regret. In fact,\n\u221a\nif each entry of loss vectors is an independent random variable, then one can achieve a regret of\n\u02dcO(\nk times smaller than the lower bounds in Theorems 1 and 2. This\n\u221a\nprovides the answer to Q. 3: The correlation among losses makes the problem essentially harder, as\nthe minimax regret bound becomes larger by a factor of \u02dc\u0398(\nTheorem 3 (smaller regret bound for the arm-wise independent loss). There exists an algorithm\n\nthat achieves E[RT ] = O((cid:112)dk2T log T log(ed/k)) for T = \u2126(d3), under the assumption that (cid:96)t\n\ndk2T ) as below, which is\n\nfollows a distribution of mutually independent d random variables in [0, 1], i.i.d. for t = 1, 2, . . . , T .\n\n\u221a\n\nk).\n\n\u221a\n\nThis upper bound is nearly tight; Theorem 5 in [2] implies that any algorithm suffers E[RT ] =\ndk2T ) in the worst case under the same assumption as in Theorem 3.2 By combining this result\n\u2126(\nand Theorem 3, we obtain the following corollary:\n\u221a\nCorollary 1. Under the same assumption as in Theorem 3, the minimax regret in the bandit combi-\nnatorial optimization is of order \u02dc\u0398(\n\ndk2T ), where we ignore logarithmic factors in d and T .\n\nTo prove Theorem 3, we analyze regret upper bounds for stochastic linear bandits, which are\ngeneralization of the bandit combinatorial optimization with stochastic environments. In stochastic\nlinear bandits, a player is given a \ufb01nite set A \u2286 Rd of d-dimensional vectors. In each round, the\nplayer chooses at \u2208 A and receives loss Lt = (cid:96)\u2217(cid:62)at + \u03b7t, where \u03b7t is the noise, which is assumed to\nbe conditionally \u03b1-subgaussian. We also assume that supa,b\u2208A (cid:96)\u2217(cid:62)(a \u2212 b) \u2264 L. We observe that\nbandit combinatorial optimization with the assumption de\ufb01ned in Theorem 3 is a special case of\nstochastic linear bandits with \u03b1 =\nFor stochastic linear bandits with \u03b1 = 1 and L = 1, Lattimore and Szepesv\u00e1ri [19] provided an\n).3 This upper bound, however, does not directly\nalgorithm that achieves RT = O(\nlead to Theorem 3, because their bound holds only for the case of \u03b1 = 1 and L = 1; If we directly\n\nk/2 and L = k.\n\n(cid:113)\n\n|A| log T\n\n\u221a\n\ndT log\n\n\u03b4\n\n1 To keep (cid:96)t in the bounded region [0, 1]d with high probability, the variances of normal distributions need to\n\nbe maintained suf\ufb01ciently small, which makes the KL divergence large.\n\n2 Although the original statement in [2] does not include the independence assumption, we can con\ufb01rm that\n\nit is satis\ufb01ed in their proof.\n\n3 In their book, the proof is left for the reader as an exercise.\n\n4\n\n\f(cid:113)\n\n|A| log T\n\n\u221a\n) = \u02dcO(\n\n\u221a\n\napply their result, we obtain RT = O(\n1/k. This is \u02dc\u2126(\nTo mitigate this issue, we modify their algorithm, so that we can perform a more re\ufb01ned analysis for\nthe case of arbitrary \u03b1 and L. The differences between our Algorithm 1 given in Appendix D.1 and\nAlgorithm 12 in [19] are summarized as follows:\n\nk) times larger than the bound provided in Theorem 3.\n\ndk3T ) by multiplying losses by\n\ndk2T log\n\n\u03b4\n\n\u2022 They deal with only the case in which the noise \u03b7t has a bounded variance, i.e., \u03b1 = 1. To\ndeal with the case for a general \u03b1, we modify the de\ufb01nition (31) of Tk in their algorithm.\n\u2022 They assume that the suboptimality gap maxa,b\u2208A{(cid:96)\u2217(cid:62)(a \u2212 b)} is bounded by 1. To handle\nproperly the changing suboptimality gaps, we modify the de\ufb01nition of \u03b5t in their algorithm.\n\u2022 They basically consider maximization problems, while we consider minimization (This doe\n\nnot result in essential differences).\n\nWe demonstrate that Algorithm 1 achieves the following regret bound:\nTheorem 4. For any input parameters \u03b4 > 0 and \u03b51 > 0, with a probability of at least 1 \u2212 \u03b4, the\noutput of Algorithm 1 satis\ufb01es\n\n(cid:96)\u2217(cid:62)\n\n(at \u2212 a) \u2264 9\u03b1\n\ndT log\n\n|A| log T\n\n\u03b4\n\n+ L\n\n2d\u03b12\n\n\u03b52\n1\n\nlog\n\n2|A|\n\u03b4\n\n+ (L + \u03b51)d2.\n\n(3)\n\nT(cid:88)\n\nt=1\n\nmax\na\u2208A\n\n(cid:114)\n\nTheorem 4 means that the upper bound L of (cid:96)\u2217(cid:62)at does not affect the leading term of the regret upper\nbound, however, \u03b1 does affect. By substituting \u03b1 =\nk/2 and L = k with the bound in Theorem 4,\nwe obtain Theorem 3.\n\n\u221a\n\n3 Related Work\n\n\u221a\n\nBandit combinatorial optimization was \ufb01rst introduced by McMahan and Blum [21] and Awer-\nbuch and Kleinberg [5]. They proposed the algorithms achieving the regret of \u02dcO(T 3/4) and\n\u02dcO(T 2/3), respectively, ignoring dependence on d and logarithmic factors in T . Algorithms with\nimproved regret bounds have been proposed in several papers [2; 6; 7; 10]. These algorithms achieve\n\nachieving the sublinear regret have also been introduced in [7; 9; 12; 22; 14].\nWith regard to lower bounds in the bandit combinatorial optimization, Audibert et al. [2] showed that\ndk2T ), and consequently, they conjectured that this lower bound was tight. However, the\nRT = \u2126(\n\nRT = O((cid:112)dk3T log(ed/k)) in our problem setting. Recently, computationally ef\ufb01cient algorithms\nrecent work by Cohen et al. [8], rejected this conjecture showing that RT = \u2126((cid:112)dk3T / log T ).\n(cid:1), Audibert et al. [2] proposed an algorithm achieving the\ngraph. For general action sets A \u2286(cid:0)[d]\ndkT ), and showed that it is minimax optimal, i.e., there is an action set A \u2286(cid:0)[d]\n(cid:1) such\ndkT ). With regard to the multiple-play bandit problem, i.e., the case of A =(cid:0)[d]\n(cid:1),\n\nCombinatorial semi-bandit optimization is a variant of bandit combinatorial optimization, in which\nt at, but also the entry (cid:96)ti for each chosen arm i \u2208 St.\nthe player can observe not only the total loss (cid:96)(cid:62)\nThis problem was introduced by Gy\u00f6rgy et al. [11] in the context of the online shortest path problem,\ni.e., they considered the case in which A is a set of all subsets of edges constructing a path in a given\n\n\u221a\nregret of O(\n\u221a\nthat RT = \u2126(\nwith semi-bandit feedback, Uchiya et al. [23] showed that RT = \u2126(\ndT ), but it remained open\n\u221a\nwhether this bound was tight, until the recent work by Lattimore et al. [20] provided the proof that\nRT = \u2126(\nThe study on stochastic linear bandits was introduced in the work by Abe and Long [1]. They and\nAuer [3] considered the case of \ufb01nite action sets that can change every round. Bandit combinatorial\noptimization with a stochastic environment can be seen as a special case of stochastic linear bandits\n\nin which the action set is included in(cid:0)[d]\n(cid:1) and does not change in every round. Auer [3] introduced a\ntechnique of dividing rounds to achieve RT = O((cid:112)dT (log(|A|T log T ))3) under the assumption\n\nof bounded loss. We remark that a similar technique is used in Algorithm 1. Moreover, a similar\ntechnique was used for spectral bandits considered by Valko et al. [24], in which they eliminated\ninappropriate arms over several phases.\n\ndkT ).\n\n\u221a\n\nk\n\nk\n\nk\n\nk\n\n5\n\n\f4 Lower Bounds\n\n\u221a\n\ndk2T ) and \u2126((cid:112)dk3T / log T ) for multitask bandits, respectively. From the proofs\n\nIn this section, we provide proofs for Theorems 1 and 2. First, we revisit the proofs presented in\nthe previous work: Theorem 5 in [2] and Lemma 4 in [8], which provide the regret lower bounds\nof the order \u2126(\nprovided in the related work, we can observe that regret lower bounds can be derived from upper\nbounds on KL divergences determined by distributions of loss vectors. Second, we construct a\ndistribution of loss vectors, so that the corresponding KL divergence is small enough. Combining\nthese two results, we obtain Theorem 1, which provides an improved lower bound for multitask\nbandits. Finally, we extend the proof for multitask bandit to prove Theorem 2 for multiple-play\nbandits.\n\n4.1 Proof idea used in the previous work\n\nThis subsection revisits the proofs for regret lower bounds for multitask bandit, given in [2] and [8].\nWe note that, from Yao\u2019s minimax principle, it suf\ufb01ces to construct a probabilistic distribution of (cid:96)t,\nsuch that in expectations, any deterministic algorithm suffers large regret.\nIn both proofs, the probabilistic distribution of the loss vectors is de\ufb01ned as follows. First, it is\nd](cid:62) \u2208 {0, 1}d,\nnecessary to set a parameter \u03b5 > 0, which is to be optimized later. For a\u2217 = [a\u2217\na probabilistic distribution Da\u2217 over Rd is de\ufb01ned such that (cid:96) \u223c Da\u2217 satis\ufb01es\n\n1, . . . , a\u2217\n\nE\n\n\u2212 \u03b5a\u2217\n\ni\n\n1\n2\n\n[(cid:96)i] =\n\n2 \u2212 \u03b5a\u2217\n\n(cid:96)\u223cDa\u2217\n2 \u2212 \u03b5a\u2217\n\n(4)\nfor each i \u2208 [d]. More concretely, [2] de\ufb01ne Da\u2217 such that the i-th entry of the vector follows the\nBernoulli distribution of parameter 1\ni , independently. Cohen et al. [8] de\ufb01ne Da\u2217 such that\ni + Z, where Z follows the normal distribution N (0, \u03c32). We can\nthe i-th entry is equal to 1\ncon\ufb01rm that these two de\ufb01nitions satisfy (4). The environment picks a\u2217 \u2208 A uniformly at random\nbefore the game begins, and then, in round t = 1, 2, . . . , T , generates a loss vector (cid:96)t following Da\u2217\ni.i.d. It should be noted that A is de\ufb01ned by (1) here.\ni = 1}, and at be the\nWe analyze the regret bounds for these loss vectors. Let S\u2217 = {i \u2208 [d] | a\u2217\naction chosen by the player in round t. Let us de\ufb01ne Ni to be the number of rounds in [T ] in which\nthe player suffers a loss for the i-th entry of loss vectors, i.e., Ni = |{t \u2208 [T ] | ati = 1}|. Then, from\n(cid:33)\n(4), the regret RT satis\ufb01es\n\n(cid:35)\n\nt at \u2212 T(cid:88)\n\n(cid:32)\nkT \u2212 (cid:88)\nFrom (5), to obtain a lower bound on RT , it suf\ufb01ces to bound(cid:80)\n(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:20)\n\nwe use the following lemma:\nLemma 1. Let D and D(cid:48) be the probability distributions over [0, 1]d. Then, we have\n\n(cid:118)(cid:117)(cid:117)(cid:116) T(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 T\n\n(cid:96)\u223cD,(cid:96)(cid:48)\u223cD(cid:48)(a(cid:62)\n\n(5)\ni\u2208S\u2217 Ni. To obtain a bound on Ni,\n\n(cid:21)\n\n(6)\n\nE\n\n(cid:96)1,...,(cid:96)T \u223cDa\u2217\n\n[Ni]\n\n.\n\n(cid:34) T(cid:88)\n\nt (cid:96)||a(cid:62)\n\nt (cid:96)(cid:48))\n\nE\n\n(cid:96)1,...,(cid:96)T \u223cDa\u2217\n\nE\n\n(cid:96)1,...,(cid:96)T \u223cDa\u2217\n\nE\n\n(cid:96)1,...,(cid:96)T \u223cD(cid:48)\n\n[Ni]\n\nE\n\n(cid:96)1,...,(cid:96)T \u223cD\n\n[RT ] \u2265\n\n(cid:96)(cid:62)\nt a\u2217\n\n= \u03b5\n\n[Ni] \u2212\n\nt=1\n\nt=1\n\nE\n\nat\u223cAt(D)\n\nt=1\n\nKL\n\n(cid:96)(cid:62)\n\ni\u2208S\u2217\n\nfor any deterministic algorithm, where At(D) represents the probability distribution of the outputs of\nthe algorithm in round t for the inputs (cid:96)1, (cid:96)2, . . . , (cid:96)t\u22121 following D independently.\nThis lemma follows from Pinsker\u2019s inequality and the chain rule for the KL divergence. For details,\nsee, e.g., Lemma A.1. in [4].\nLemma 1 de\ufb01nes a connection between bounds on Ni and upper bounds on KL divergences of\nspeci\ufb01c distributions. To provide a bound on Ni by means of Lemma 1, Audibert et al. [2] and Cohen\net al. [8] used speci\ufb01c properties of their distributions. We observe that their arguments are focused\non the fact that their distributions of loss vectors satisfy the following condition regarding the KL\ndivergence:\n\na \u2208 A, a\u2217, \u02c6a \u2208 {0, 1}d, \u02c6a(cid:62)a \u2212 a\u2217(cid:62)a = 1, (cid:96)\u2217 \u223c Da\u2217 , \u02c6(cid:96) \u223c D\u02c6a\n=\u21d2 KL((cid:96)\u2217(cid:62)a||\u02c6(cid:96)(cid:62)a) \u2264 CD\u03b52 for a constant CD depending on {Da}.\n\n(7)\n\n6\n\n\fIntuitively, the precondition of (7) means that the discrepancy with respect to the expected loss is at\nmost \u03b5. In fact, a\u2217(cid:62)a in (7) corresponds to \u201cgoodness of action a\" for the loss vector (cid:96) \u223c Da\u2217, because\nthe expected loss for action a is equal to k/2 \u2212 \u03b5a\u2217(cid:62)a from (4). Consequently, \u02c6a(cid:62)a \u2212 a\u2217(cid:62)a = 1\nmeans that the expected loss for Da\u2217 is smaller than one for D\u02c6a by \u03b5.\nWe can show that, if (7) is true, then Lemma 1 implies that if a\u2217 follows a uniform distribution over\nA de\ufb01ned by (1), we obtain\n. Therefore, if we\n\nset \u03b5 \u2264(cid:112)d/(16CDkT ), we obtain\n(cid:3) \u2264 3kT /4, and consequently, we\n(cid:112)d/(16CDkT ) that satis\ufb01es (4) and (7). If A is given by (1) with n \u2265 2, then we have a regret\n\nobtain E[RT ] \u2265 \u03b5kT\nfrom (5). The main observation of this subsection is summarized as follows:\nObservation 1. Suppose a family {Da\u2217 | a\u2217 \u2208 {0, 1}d} of distributions with a parameter \u03b5 \u2264\n\n(cid:3) \u2264 k\n(cid:2)(cid:80)\n\n(cid:113) kT\n\na\u2217,(cid:96)1,...,(cid:96)T \u223cDa\u2217\n\na\u2217,(cid:96)1,...,(cid:96)T \u223cDa\u2217\n\n(cid:2)(cid:80)\n\nT\n2 + T \u03b5\n\ni\u2208S\u2217 Ni\n\ni\u2208S\u2217 Ni\n\nd CD\n\n(cid:18)\n\n(cid:19)\n\nE\n\nE\n\n4\n\nlower bound of E[RT ] = \u2126(\u03b5kT ).\n\n4.2 Construction of the probabilistic distribution\nThe goal of this subsection is to construct a family {Da\u2217 | a\u2217 \u2208 {0, 1}d} of distributions such that (4)\n\u221a\nand (7) are satis\ufb01ed with CD = O(1/k2). From Observation 1, such construction leads to a regret\nlower bound of E[RT ] = \u2126(\ndk3T ) for the multitask bandit problem, thereby, proving Theorem 1.\nThe proposed probabilistic distribution of loss vectors is de\ufb01ned as follows. Let us set a parameter\n\u03b5 \u2208 [0, 2\u221216], which is to be optimized later. For a\u2217 = [a\u2217\nd](cid:62) \u2208 {0, 1}d, let Da\u2217 be a\ndistribution of (cid:96) = [(cid:96)1, . . . , (cid:96)d](cid:62) \u2208 [0, 1]d generated in the following way:\n\n1, . . . , a\u2217\n\n(i) Draw u0 from a uniform distribution over [0, 1].\n(ii) Draw bi from a Bernoulli distribution of parameter (1/2 + 2\u03b5a\u2217\ni ).\n(iii) For i \u2208 [d], draw ui from a uniform distribution over\n(iv) Let (cid:96)i = 1 if ui \u2265 u0, and otherwise, (cid:96)i = 0.\n\n(cid:26) [0, 1/2]\n\n(1/2, 1]\n\n(8)\n\nif bi = 1,\nif bi = 0.\n\n(cid:18)\n\ni ) + 3\n\n4 ( 1\n\ni ) = 1\n\n2 \u2212 \u03b5a\u2217\n\n2 + 2\u03b5a\u2217\n\n2 \u2212 2\u03b5a\u2217\n\n4 Prob[bi = 1] + 3\n\n= \u2212 k(cid:88)\n\ni , which means that (4) holds.\n\nKL((cid:96)\u2217(cid:62)a||\u02c6(cid:96)(cid:62)a) = \u2212 k(cid:88)\n\u2264 \u2212 k(cid:88)\n\nWe can con\ufb01rm that (4) holds for this Da\u2217. In fact, step (iv) means E[(cid:96)i] = Prob[ui \u2265 u0], and as\nu0 follows the uniform distribution over [0, 1] and ui \u2208 [0, 1], we obtain Prob[ui \u2265 u0] = E[ui].\nMoreover, from steps (ii) and (iii), we obtain E[ui] = 1\n4 Prob[bi = 0] =\n4 ( 1\n1\nLet us show that (7) is satis\ufb01ed with CD = O(1/k2). As Da\u2217 is a distribution over {0, 1}d,\n(cid:96)\u2217(cid:62)a takes values from {0, 1, . . . , k} for any a \u2208 A and (cid:96)\u2217 \u223c Da\u2217. For i = 0, 1, . . . , k, de\ufb01ne\nP (i) = Prob[(cid:96)\u2217(cid:62)a = i] and P (cid:48)(i) = Prob[\u02c6(cid:96)(cid:62)a = i], where (cid:96)\u2217 \u223c Da\u2217 and \u02c6(cid:96) \u223c D\u02c6a. Then, from the\nde\ufb01nition, the KL divergence can be expressed as follows:\n\n(cid:19)\nk(cid:88)\n|P (cid:48)(i) \u2212 P (i)|/P (i) \u2264 1/2 holds,4 and the last equality holds, as we have(cid:80)k\nwhere the inequality comes from the fact that log(1 + x) \u2265 x \u2212 2x2 for |x| \u2264 1/2 and\n(cid:80)k\n=\ni=0(P (cid:48)(i) \u2212 P (i)) = 1 \u2212 1 = 0. Thereby, it suf\ufb01ces to bound (P (cid:48)(i) \u2212 P (i))2/P (i) for deriving\nan upper bound on the KL divergence. We can then show that P (i) = \u2126(1/k) for all i. Indeed, if\n\u03b5 = 0, then we have P (i) = 1/(k + 1); as Prob[(cid:96)ti = 1] = Prob[ui \u2265 u0] from the de\ufb01nition (8)\n\uf8f9\uf8fb = Prob(cid:2) u0 is the (i + 1)-th smallest among {uj}k\nof Da, and as each ui is a uniform random variable over [0, 1] under the condition of \u03b5 = 0, we have\n\nP (cid:48)(i)\nP (i)\nP (cid:48)(i) \u2212 P (i)\n\n(cid:18) P (cid:48)(i) \u2212 P (i)\n\ni=0 P (i) P (cid:48)(i)\u2212P (i)\n\n\uf8ee\uf8f0 k(cid:88)\n\n(P (cid:48)(i) \u2212 P (i))2\n\nP (cid:48)(i) \u2212 P (i)\n\n(cid:19)2(cid:33)\n\n(cid:3) =\n\nP (i) = Prob\n\nP (i) log\n\nP (i) log\n\n1 +\n\n(cid:96)tj = i\n\n(cid:32)\n\ni=0\n\n\u2212 2\n\ni=0\n\ni=0\n\n= 2\n\ni=0\n\nP (i)\n\n,\n\nP (i)\n\nP (i)\n\nP (i)\n\nP (i)\n\nP (i)\n\nj=0\n\n1\n\n,\n\nk + 1\n\nj=1\n\n4 The statement |P (cid:48)(i) \u2212 P (i)|/P (i) \u2264 1/2 comes from \u03b5 \u2264 2\u221216. See Appendix A for details.\n\n7\n\n\f(cid:16)\n\ni=0 (P (cid:48)(i) \u2212 P (i))2(cid:17)\nk(cid:80)k\n\nwhere the last equality comes from the fact that u0, u1, . . . , uk are i.i.d. random variables. Even\nif \u03b5 > 0, we show in Appendix A that for \u03b5 \u2264 2\u221216, P (i) is suf\ufb01ciently close to 1\nk+1 to have an\norder of \u2126(1/k). Thus, we have P (i) = \u2126(1/k) for all i = 1, . . . , k and \u03b5 \u2208 [0, 2\u221216], and hence,\nKL((cid:96)\u2217(cid:62)a||\u02c6(cid:96)(cid:62)a) = O\n. Finally, by proving |P (cid:48)(i) \u2212 P (i)| = O(\u03b5/k2),\nwe obtain the following lemma:\nLemma 2. Let a\u2217, \u02c6a \u2208 {0, 1}d and (cid:96)\u2217 \u223c Da\u2217 , \u02c6(cid:96) \u223c D\u02c6a. Then, for \u03b5 \u2208 [0, 2\u221216] and a \u2208 {0, 1}d\nsatisfying (cid:107)a(cid:107)1 = k and \u02c6a(cid:62)a \u2212 \u02c6a\u2217(cid:62)a = 1, we have\nKL((cid:96)\u2217(cid:62)a||\u02c6(cid:96)(cid:62)a) = O\n\n(cid:18) \u03b52\n\n(cid:19)\n\n(9)\n\n.\n\nk2 +\nThe complete proof of this lemma is provided in Appendix A.\n\n\u03b54\nk3/2\n\nImproved lower bound for the multitask bandit problem\n\n4.3\nWe obtain an improved lower bound for A de\ufb01ned as (1), by combining Observation 1 and Lemma 2.\nFrom Lemma 2, if \u03b5 \u2264 2\u221216k\u2212 1\n4 , there is a global constant C for which (7) holds with CD =\nT }, we obtain E[RT ] = \u2126(\u03b5kT ) =\n(C/k)2. Consequently, setting \u03b5 = min{2\u221216k\u2212 1\ndk3T}), which provides the lower bound in Theorem 1, for A given by (1), i.e., the\n\u2126(min{k 3\n\u221a\nmultitask bandit problem. The key point for shaving off the\nlog T factor is that our probabilistic\ndistribution presented in Section 4.2 satis\ufb01es (7) with CD = O(1/k2), while the previous work [8]\ndoes not exceed CD = O(log T /k2).\n\n(cid:113) dk\n\n4 , 1\n4C\n\n4 T,\n\n\u221a\n\n4.4\n\nImproved and extended lower bound for the bandit ranking problem\n\n\u221a\n\nFor the bandit ranking problem, Cohen et al. [8] have identi\ufb01ed lower bounds by considering\n(cid:96)t \u223c Da\u2217 for a\u2217 \u2208 A, similar to the multitask bandit problem. However, this approach does not\n\u221a\nwork well for the case of full permutations (i.e., with k = n), and has left an \u2126(\nn)-gap between the\nlower and the upper bounds, as mentioned in the conclusion of this research work.\nn)-gap by improving the lower bound by a surprisingly simple approach.\nWe can eliminate this \u2126(\nIn contrast to the probability distribution considered by Cohen et al. [8] that has k good arms (i such\ni = 1), we de\ufb01ne the probability distribution with m = (cid:100)k/2(cid:101) good arms, i.e., we consider\nthat a\u2217\na\u2217 \u2208 A(cid:48) \u2286 {0, 1}d de\ufb01ned by\n\n\uf8f1\uf8f2\uf8f3a \u2208 {0, 1}d\n(cid:112)d/(32CDkT ) that satis\ufb01es (4) and (7). Suppose n \u2265 2 and 1 \u2264 k \u2264 n.\n\n(10)\nLemma 3. Suppose a family {Da\u2217 | a\u2217 \u2208 {0, 1}d} of distributions with a parameter \u03b5 \u2264\nIf a\u2217 is chosen\nfrom A(cid:48) de\ufb01ned by (10), and (cid:96)t follows Da\u2217 for t = 1, 2, . . . , T , independently, then, for the bandit\nranking problem de\ufb01ned by (2), any algorithm suffers regret of E[RT ] = \u2126(\u03b5kT ).\n\n(1 \u2264 j \u2264 m)\n(m < j \u2264 k)\n\na(i\u22121)n+j \u2264 1\n\n\uf8fc\uf8fd\uf8fe .\n\n(cid:26) 1\n\n(j \u2208 [n])\n\njn(cid:88)\n\nk(cid:88)\n\ni=(j\u22121)n+1\n\nA(cid:48) =\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nai =\n\n0\n\n,\n\ni=1\n\nThe proof of this lemma is provided in Appendix B.\nThe lower bound in Lemma 3 is valid even if k = n, while the approach of the previous work [8]\nconsidering a\u2217 \u2208 A is applicable only to the case of n \u2265 2k. Intuitively, this difference can be\nexplained as follows: the regret depends on the number of good arms (i \u2208 [d] such that a\u2217\ni = 1)\nin chosen arms (i \u2208 [d] such that ati = 1). If a\u2217 and at are chosen from A with k = n, and if the\nchosen arms (de\ufb01ned by at) include k \u2212 1 good arms, then the chosen arms automatically include\nthe entire [k] good arms, because a\u2217 and at express edge sets of perfect matchings of the complete\nbipartite graph Kk,k. This means that, in this setting, the probability of choosing several good arms\nstrongly affects that of choosing other good arms, which makes the analysis dif\ufb01cult. However, such\nan effect can be reduced if a\u2217 is chosen from A(cid:48), i.e., a\u2217 has only m = (cid:100)k/2(cid:101) good arms.\nThe lower bound in Theorem 1 for the bandit ranking problem, i.e., A given by (2), can be derived in\nthe same way as in Section 4.3. This accomplishes the proof of Theorem 1.\n\n8\n\n\f5 Conclusion\n\n\u221a\n\nIn this study, we considered the regret bounds of the bandit combinatorial optimization. As a result,\nwe managed to improve the regret lower regret bounds comparing with those presented in the existing\nstudy [8] by applying a factor of\nlog T . The obtained lower bounds apply to three practically\nimportant examples of the bandit combinatorial optimization, and are valid under the parameter\nconstraints milder than those outlined in the existing studies. In particular, the bound for the bandit\nranking obtained in the present study addresses an open problem outlined in [8]. To shave off\nlog T\nfactor, we have introduced a novel class of distributions, which could be potentially used to improve\nregret lower bounds considering other problems. Moreover, by obtaining a lower regret bound under\nthe assumption of independent losses, we demonstrated that correlation among losses is the cause of\nobserving a large regret.\nWith respect to the bandit combinatorial optimization, we decreased the gap between the upper and\nthe lower bounds to O(log(ed/k)). We will consider this issue as an open question for the future\nresearch, in which we will aim to improve the gap to a constant factor only.\n\n\u221a\n\nReferences\n[1] N. Abe and P. M. Long. Associative reinforcement learning using linear probabilistic concepts.\n\nIn International Conference on Machine Learning, pages 3\u201311, 1999.\n\n[2] J.-Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathe-\n\nmatics of Operations Research, 39(1):31\u201345, 2013.\n\n[3] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\n[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[5] B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed\nlearning and geometric approaches. In Proceedings of the Thirty-sixth Annual ACM Symposium\non Theory of Computing, pages 45\u201353. ACM, 2004.\n\n[6] S. Bubeck, N. Cesa-Bianchi, and S. Kakade. Towards minimax policies for online linear\noptimization with bandit feedback. In Conference on Learning Theory, volume 23, pages\n41.1\u201341.14, 2012.\n\n[7] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System\n\nSciences, 78(5):1404\u20131422, 2012.\n\n[8] A. Cohen, T. Hazan, and T. Koren. Tight bounds for bandit combinatorial optimization. In\n\nConference on Learning Theory, pages 629\u2013642, 2017.\n\n[9] R. Combes, M. S. T. M. Shahi, A. Proutiere, and M. Lelarge. Combinatorial bandits revisited.\n\nIn Advances in Neural Information Processing Systems, pages 2116\u20132124, 2015.\n\n[10] V. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for online optimization.\n\nIn Advances in Neural Information Processing Systems, pages 345\u2013352, 2008.\n\n[11] A. Gy\u00f6rgy, T. Linder, G. Lugosi, and G. Ottucs\u00e1k. The on-line shortest path problem under\n\npartial monitoring. Journal of Machine Learning Research, 8(Oct):2369\u20132403, 2007.\n\n[12] E. Hazan and Z. Karnin. Volumetric spanners: an ef\ufb01cient exploration basis for learning. The\n\nJournal of Machine Learning Research, 17(1):4062\u20134095, 2016.\n\n[13] D. P. Helmbold and M. K. Warmuth. Learning permutations with exponential weights. Journal\n\nof Machine Learning Research, 10(Jul):1705\u20131736, 2009.\n\n[14] S. Ito, D. Hatano, H. Sumita, K. Takemura, T. Fukunaga, N. Kakimura, and K. Kawarabayashi.\nOracle-ef\ufb01cient algorithms for online linear optimization with bandit feedback. In Advances in\nNeural Information Processing Systems, 2019, to appear.\n\n9\n\n\f[15] J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of\n\nMathematics, 12(363-366):234, 1960.\n\n[16] J. Komiyama, J. Honda, and H. Nakagawa. Optimal regret analysis of thompson sampling in\nstochastic multi-armed bandit problem with multiple plays. In International Conference on\nMachine Learning, pages 1152\u20131161, 2015.\n\n[17] W. Kot\u0142owski and G. Neu. Bandit principal component analysis. In Proceedings of the Thirty-\n\nSecond Conference on Learning Theory, volume 99, pages 1994\u20132024, 2019.\n\n[18] P. Lagr\u00e9e, C. Vernade, and O. Cappe. Multiple-play bandits in the position-based model. In\n\nAdvances in Neural Information Processing Systems, pages 1597\u20131605, 2016.\n\n[19] T. Lattimore and C. Szepesv\u00e1ri. Bandit Algorithms. preprint, Revision: 1699, 2019.\n\n[20] T. Lattimore, B. Kveton, S. Li, and C. Szepesvari. Toprank: A practical algorithm for online\n\nstochastic ranking. In Advances in Neural Information Processing Systems, 2018.\n\n[21] H. B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an\nadaptive adversary. In International Conference on Computational Learning Theory, pages\n109\u2013123, 2004.\n\n[22] S. Sakaue, M. Ishihata, and S.-i. Minato. Ef\ufb01cient bandit combinatorial optimization algorithm\nIn International Conference on Arti\ufb01cial\n\nwith zero-suppressed binary decision diagrams.\nIntelligence and Statistics, pages 585\u2013594, 2018.\n\n[23] T. Uchiya, A. Nakamura, and M. Kudo. Algorithms for adversarial bandit problems with\nmultiple plays. In International Conference on Algorithmic Learning Theory, pages 375\u2013389,\n2010.\n\n[24] M. Valko, R. Munos, B. Kveton, and T. Koc\u00e1k. Spectral bandits for smooth graph functions. In\n\nInternational Conference on Machine Learning, pages 46\u201354, 2014.\n\n10\n\n\f", "award": [], "sourceid": 6487, "authors": [{"given_name": "Shinji", "family_name": "Ito", "institution": "NEC Corporation, University of Tokyo"}, {"given_name": "Daisuke", "family_name": "Hatano", "institution": "RIKEN AIP"}, {"given_name": "Hanna", "family_name": "Sumita", "institution": "Tokyo Metropolitan University"}, {"given_name": "Kei", "family_name": "Takemura", "institution": "NEC Corporation"}, {"given_name": "Takuro", "family_name": "Fukunaga", "institution": "Chuo University, JST PRESTO, RIKEN AIP"}, {"given_name": "Naonori", "family_name": "Kakimura", "institution": "Keio University"}, {"given_name": "Ken-Ichi", "family_name": "Kawarabayashi", "institution": "National Institute of Informatics"}]}