{"title": "Combinatorial Bandits Revisited", "book": "Advances in Neural Information Processing Systems", "page_first": 2116, "page_last": 2124, "abstract": "This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the adversarial setting under bandit feedback, we propose CombEXP, an algorithm with the same regret scaling as state-of-the-art algorithms, but with lower computational complexity for some combinatorial problems.", "full_text": "Combinatorial Bandits Revisited\n\nRichard Combes\u2217\n\nM. Sadegh Talebi\u2020\n\nAlexandre Proutiere\u2020\n\u2217 Centrale-Supelec, L2S, Gif-sur-Yvette, FRANCE\n\nMarc Lelarge\u2021\n\n\u2020 Department of Automatic Control, KTH, Stockholm, SWEDEN\n\n\u2021 INRIA & ENS, Paris, FRANCE\n\nrichard.combes@supelec.fr,{mstms,alepro}@kth.se,marc.lelarge@ens.fr\n\nAbstract\n\nThis paper investigates stochastic and adversarial combinatorial multi-armed ban-\ndit problems.\nIn the stochastic setting under semi-bandit feedback, we derive\na problem-speci\ufb01c regret lower bound, and discuss its scaling with the dimen-\nsion of the decision space. We propose ESCB, an algorithm that ef\ufb01ciently ex-\nploits the structure of the problem and provide a \ufb01nite-time analysis of its regret.\nESCB has better performance guarantees than existing algorithms, and signi\ufb01-\ncantly outperforms these algorithms in practice. In the adversarial setting under\nbandit feedback, we propose COMBEXP, an algorithm with the same regret scal-\ning as state-of-the-art algorithms, but with lower computational complexity for\nsome combinatorial problems.\n\n1\n\nIntroduction\n\nreceives a reward M(cid:62)X(n) = (cid:80)d\n\nMulti-Armed Bandit (MAB) problems [1] constitute the most fundamental sequential decision prob-\nlems with an exploration vs. exploitation trade-off. In such problems, the decision maker selects\nan arm in each round, and observes a realization of the corresponding unknown reward distribution.\nEach decision is based on past decisions and observed rewards. The objective is to maximize the\nexpected cumulative reward over some time horizon by balancing exploitation (arms with higher\nobserved rewards should be selected often) and exploration (all arms should be explored to learn\ntheir average rewards). Equivalently, the performance of a decision rule or algorithm can be mea-\nsured through its expected regret, de\ufb01ned as the gap between the expected reward achieved by the\nalgorithm and that achieved by an oracle algorithm always selecting the best arm. MAB problems\nhave found applications in many \ufb01elds, including sequential clinical trials, communication systems,\neconomics, see e.g. [2, 3].\nIn this paper, we investigate generic combinatorial MAB problems with linear rewards, as introduced\nin [4]. In each round n \u2265 1, a decision maker selects an arm M from a \ufb01nite set M \u2282 {0, 1}d and\n+ is unknown.\nWe focus here on the case where all arms consist of the same number m of basic actions in the\nsense that (cid:107)M(cid:107)1 = m, \u2200M \u2208 M. After selecting an arm M in round n, the decision maker\nreceives some feedback. We consider both (i) semi-bandit feedback under which after round n, for\nall i \u2208 {1, . . . , d}, the component Xi(n) of the reward vector is revealed if and only if Mi = 1; (ii)\nbandit feedback under which only the reward M(cid:62)X(n) is revealed. Based on the feedback received\nup to round n \u2212 1, the decision maker selects an arm for the next round n, and her objective is to\nmaximize her cumulative reward over a given time horizon consisting of T rounds. The challenge in\nthese problems resides in the very large number of arms, i.e., in its combinatorial structure: the size\nof M could well grow as dm. Fortunately, one may hope to exploit the problem structure to speed\nup the exploration of sub-optimal arms.\nWe consider two instances of combinatorial bandit problems, depending on how the sequence\nof reward vectors is generated. We \ufb01rst analyze the case of stochastic rewards, where for all\n\ni=1 MiXi(n). The reward vector X(n) \u2208 Rd\n\n1\n\n\fAlgorithm\n\nRegret\n\nO(cid:16) m3d\u2206max\n\nLLR\n[9]\n\n\u22062\n\nmin\n\n(cid:17) O(cid:16) m2d\n\nCUCB\n\n[10]\n\n\u2206min\n\nlog(T )\n\n(cid:17) O(cid:16) md\n\nCUCB\n\n[11]\n\n\u2206min\n\nlog(T )\n\n(cid:17) O(cid:16) \u221a\n\nESCB\n\n(Theorem 5)\n\nmd\n\u2206min\n\nlog(T )\n\n(cid:17)\n\nlog(T )\n\nTable 1: Regret upper bounds for stochastic combinatorial optimization under semi-bandit feedback.\n\ni \u2208 {1, . . . , d}, (Xi(n))n\u22651 are i.i.d. with Bernoulli distribution of unknown mean. The reward\nsequences are also independent across i. We then address the problem in the adversarial setting\nwhere the sequence of vectors X(n) is arbitrary and selected by an adversary at the beginning of\nthe experiment. In the stochastic setting, we provide sequential arm selection algorithms whose per-\nformance exceeds that of existing algorithms, whereas in the adversarial setting, we devise simple\nalgorithms whose regret have the same scaling as that of state-of-the-art algorithms, but with lower\ncomputational complexity.\n\n2 Contribution and Related Work\n\n2.1 Stochastic combinatorial bandits under semi-bandit feedback\n\nContribution. (a) We derive an asymptotic (as the time horizon T grows large) regret lower bound\nsatis\ufb01ed by any algorithm (Theorem 1). This lower bound is problem-speci\ufb01c and tight:\nthere\nexists an algorithm that attains the bound on all problem instances, although the algorithm might\nbe computationally expensive. To our knowledge, such lower bounds have not been proposed in\nthe case of stochastic combinatorial bandits. The dependency in m and d of the lower bound is\nunfortunately not explicit. We further provide a simpli\ufb01ed lower bound (Theorem 2) and derive its\nscaling in (m, d) in speci\ufb01c examples.\n\u221a\n(b) We propose ESCB (Ef\ufb01cient Sampling for Combinatorial Bandits), an algorithm whose regret\nscales at most as O(\nmin log(T )) (Theorem 5), where \u2206min denotes the expected reward dif-\nference between the best and the second-best arm. ESCB assigns an index to each arm. The index\nof given arm can be interpreted as performing likelihood tests with vanishing risk on its average re-\nward. Our indexes are the natural extension of KL-UCB and UCB1 indexes de\ufb01ned for unstructured\nbandits [5, 21]. Numerical experiments for some speci\ufb01c combinatorial problems are presented in\nthe supplementary material, and show that ESCB signi\ufb01cantly outperforms existing algorithms.\nRelated work. Previous contributions on stochastic combinatorial bandits focused on speci\ufb01c com-\nbinatorial structures, e.g. m-sets [6], matroids [7], or permutations [8]. Generic combinatorial prob-\nlems were investigated in [9, 10, 11, 12]. The proposed algorithms, LLR and CUCB are variants\nof the UCB algorithm, and their performance guarantees are presented in Table 1. Our algorithms\nimprove over LLR and CUCB by a multiplicative factor of\n\nmd\u2206\u22121\n\n\u221a\n\nm.\n\nmin\n\nWe\n\n(cid:19)\n\n1\n\nm|M|\n\nwhose\n\npresent\n\nalgorithm\n\nCOMBEXP,\n\nm3T (d + m1/2\u03bb\n\n\u22121) log \u00b5\u22121\n\n, where \u00b5min = mini\u2208[d]\n\n2.2 Adversarial combinatorial problems under bandit feedback\n\n(cid:18)(cid:113)\nso that COMBEXP has O((cid:112)m3dT log(d/m)) regret. A known regret lower bound is \u2126(m\n\n(cid:80)\nContribution.\nis\nregret\nO\nand \u03bb is\nthe smallest nonzero eigenvalue of the matrix E[M M(cid:62)] when M is uniformly distributed over M\n(Theorem 6). For most problems of interest m(d\u03bb)\u22121 = O(1) [4] and \u00b5\u22121\nmin = O(poly(d/m)),\ndT )\n[13], so the regret gap between COMBEXP and this lower bound scales at most as m1/2 up to a\nlogarithmic factor.\nRelated work. Adversarial combinatorial bandits have been extensively investigated recently,\nsee [13] and references therein. Some papers consider speci\ufb01c instances of these problems, e.g.,\nshortest-path routing [14], m-sets [15], and permutations [16]. For generic combinatorial problems,\n(if d \u2265 2m) in the case of semi-\nknown regret lower bounds scale as \u2126\nbandit and bandit feedback, respectively [13]. In the case of semi-bandit feedback, [13] proposes\n\n(cid:16)\u221a\n\nM\u2208M Mi\n\n\u221a\nm\n\nmdT\n\nand \u2126\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\n\u221a\n\ndT\n\n2\n\n\fAlgorithm\n\nLower Bound [13]\n\nCOMBAND [4]\n\nEXP2 WITH JOHN\u2019S EXPLORATION [18]\n\nCOMBEXP (Theorem 6)\n\nO\n\n\u2126\n\n(cid:16)\n(cid:18)(cid:114)\n(cid:18)(cid:113)\n(cid:18)(cid:114)\n(cid:16)\n\nO\n\nm3dT\n\nRegret\n\u221a\ndT\nm\n\n, if d \u2265 2m\n\n(cid:16)\n\n1 + 2m\nd\u03bb\n\nm3dT log d\nm\nO\n\nm3dT log d\nm\n\n(cid:17)(cid:19)\n\n(cid:19)\n\n(cid:19)\n\n(cid:17)\n\n(cid:17)\n\n1 + m1/2\nd\u03bb\n\nlog \u00b5\n\n\u22121\nmin\n\nTable 2: Regret of various algorithms for adversarial combinatorial bandits with bandit feedback.\nNote that for most combinatorial classes of interests, m(d\u03bb)\u22121 = O(1) and \u00b5\u22121\nmin = O(poly(d/m)).\n\nT log(d/m)) regret where L(cid:63)\n\nOSMD, an algorithm whose regret upper bound matches the lower bound. [17] presents an algorithm\n\nwith O(m(cid:112)dL(cid:63)\nis upper-bounded by O((cid:112)m3dT log(d/m)). [18] addresses generic linear optimization with bandit\nregret scaling at most as O((cid:112)m3dT log(d/m)) in the case of combinatorial structure. As we show\n\nFor problems with bandit feedback, [4] proposes COMBAND and derives a regret upper bound which\ndepends on the structure of arm set M. For most problems of interest, the regret under COMBAND\n\nfeedback and the proposed algorithm, referred to as EXP2 WITH JOHN\u2019S EXPLORATION, has a\n\nT is the total reward of the best arm after T rounds.\n\nd and \u03bb = m(d\u2212m)\n\nnext, for many combinatorial structures of interest (e.g. m-sets, matchings, spanning trees), COMB-\nEXP yields the same regret as COMBAND and EXP2 WITH JOHN\u2019S EXPLORATION, with lower\ncomputational complexity for a large class of problems. Table 2 summarises known regret bounds.\nExample 1: m-sets. M is the set of all d-dimensional binary vectors with m non-zero coordinates.\nd(d\u22121) (refer to the supplementary material for details). Hence when\nWe have \u00b5min = m\n\nsame as that of COMBAND and EXP2 WITH JOHN\u2019S EXPLORATION.\nExample 2: matchings. The set of arms M is the set of perfect matchings in Km,m. d = m2 and\n|M| = m!. We have \u00b5min = 1\nm\u22121. Hence the regret upper bound of COMBEXP is\n\nm = o(d), the regret upper bound of COMBEXP becomes O((cid:112)m3dT log(d/m)), which is the\nO((cid:112)m5T log(m)), the same as for COMBAND and EXP2 WITH JOHN\u2019S EXPLORATION.\ncase, d = (cid:0)N\n(cid:1), m = N \u2212 1, and by Cayley\u2019s formula M has N N\u22122 arms. log \u00b5\u22121\nEXPLORATION becomes O((cid:112)N 5T log(N )). As for COMBEXP, we get the same regret upper\nbound O((cid:112)N 5T log(N )).\n\nExample 3: spanning trees. M is the set of spanning trees in the complete graph KN . In this\nmin \u2264 2N for\nd\u03bb < 7 when N \u2265 6, The regret upper bound of COMBAND and EXP2 WITH JOHN\u2019S\nN \u2265 2 and m\n\nm, and \u03bb = 1\n\n2\n\n3 Models and Objectives\n\nWe consider MAB problems where each arm M is a subset of m basic actions taken from [d] =\n{1, . . . , d}. For i \u2208 [d], Xi(n) denotes the reward of basic action i in round n. In the stochastic\nsetting, for each i, the sequence of rewards (Xi(n))n\u22651 is i.i.d. with Bernoulli distribution with\nmean \u03b8i. Rewards are assumed to be independent across actions. We denote by \u03b8 = (\u03b81, . . . , \u03b8d)(cid:62) \u2208\n\u0398 = [0, 1]d the vector of unknown expected rewards of the various basic actions. In the adversarial\nsetting, the reward vector X(n) = (X1(n), . . . , Xd(n))(cid:62) \u2208 [0, 1]d is arbitrary, and the sequence\n(X(n), n \u2265 1) is decided (but unknown) at the beginning of the experiment.\nThe set of arms M is an arbitrary subset of {0, 1}d, such that each of its elements M has m basic\nactions. Arm M is identi\ufb01ed with a binary column vector (M1, . . . , Md)(cid:62), and we have (cid:107)M(cid:107)1 =\nm, \u2200M \u2208 M. At the beginning of each round n, a policy \u03c0, selects an arm M \u03c0(n) \u2208 M based on\nthe arms chosen in previous rounds and their observed rewards. The reward of arm M \u03c0(n) selected\n\nin round n is(cid:80)\n\ni\u2208[d] M \u03c0\n\ni (n)Xi(n) = M \u03c0(n)(cid:62)X(n).\n\n3\n\n\fWe consider both semi-bandit and bandit feedbacks. Under semi-bandit feedback and policy \u03c0,\nat the end of round n, the outcome of basic actions Xi(n) for all i \u2208 M \u03c0(n) are revealed to the\ndecision maker, whereas under bandit feedback, M \u03c0(n)(cid:62)X(n) only can be observed.\nLet \u03a0 be the set of all feasible policies. The objective is to identify a policy in \u03a0 maximizing the\ncumulative expected reward over a \ufb01nite time horizon T . The expectation is here taken with respect\nto possible randomness in the rewards (in the stochastic setting) and the possible randomization in\nthe policy. Equivalently, we aim at designing a policy that minimizes regret, where the regret of\npolicy \u03c0 \u2208 \u03a0 is de\ufb01ned by:\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\nR\u03c0(T ) = max\nM\u2208M\n\nE\n\nM(cid:62)X(n)\n\n\u2212 E\n\nM \u03c0(n)(cid:62)X(n)\n\n.\n\nn=1\n\nn=1\n\nFinally, for the stochastic setting, we denote by \u00b5M (\u03b8) = M(cid:62)\u03b8 the expected reward of arm M,\nand let M (cid:63)(\u03b8) \u2208 M, or M (cid:63) for short, be any arm with maximum expected reward: M (cid:63)(\u03b8) \u2208\narg maxM\u2208M \u00b5M (\u03b8). In what follows, to simplify the presentation, we assume that the optimal\nM (cid:63) is unique. We further de\ufb01ne: \u00b5(cid:63)(\u03b8) = M (cid:63)(cid:62)\u03b8, \u2206min = minM(cid:54)=M (cid:63) \u2206M where \u2206M =\n\u00b5(cid:63)(\u03b8) \u2212 \u00b5M (\u03b8), and \u2206max = maxM (\u00b5(cid:63)(\u03b8) \u2212 \u00b5M (\u03b8)).\n\n4 Stochastic Combinatorial Bandits under Semi-bandit Feedback\n\n4.1 Regret Lower Bound\n\nGiven \u03b8, de\ufb01ne the set of parameters that cannot be distinguished from \u03b8 when selecting action\nM (cid:63)(\u03b8), and for which arm M (cid:63)(\u03b8) is suboptimal:\n\nB(\u03b8) = {\u03bb \u2208 \u0398 : M (cid:63)\n\ni (\u03b8)(\u03b8i \u2212 \u03bbi) = 0, \u2200i, \u00b5(cid:63)(\u03bb) > \u00b5(cid:63)(\u03b8)}.\n\nWe de\ufb01ne X = (R+)|M| and kl(u, v) the Kullback-Leibler divergence between Bernoulli distri-\nbutions of respective means u and v, i.e., kl(u, v) = u log(u/v) + (1 \u2212 u) log((1 \u2212 u)/(1 \u2212 v)).\nFinally, for (\u03b8, \u03bb) \u2208 \u03982, we de\ufb01ne the vector kl(\u03b8, \u03bb) = (kl(\u03b8i, \u03bbi))i\u2208[d].\nWe derive a regret lower bound valid for any uniformly good algorithm. An algorithm \u03c0 is uniformly\ngood iff R\u03c0(T ) = o(T \u03b1) for all \u03b1 > 0 and all parameters \u03b8 \u2208 \u0398. The proof of this result relies on\na general result on controlled Markov chains [19].\nTheorem 1 For all \u03b8 \u2208 \u0398, for any uniformly good policy \u03c0 \u2208 \u03a0,\nwhere c(\u03b8) is the optimal value of the optimization problem:\n\nlog(T ) \u2265 c(\u03b8),\n\nlim inf T\u2192\u221e R\u03c0(T )\n\nxM (M (cid:63)(\u03b8) \u2212 M )(cid:62)\u03b8\n\ns.t.\n\nxM M\n\nkl(\u03b8, \u03bb) \u2265 1 , \u2200\u03bb \u2208 B(\u03b8).\n\n(1)\n\n(cid:16) (cid:88)\n\nM\u2208M\n\n(cid:17)(cid:62)\n\n(cid:88)\n\ninf\nx\u2208X\n\nM\u2208M\n\nObserve \ufb01rst that optimization problem (1) is a semi-in\ufb01nite linear program which can be solved for\nany \ufb01xed \u03b8, but its optimal value is dif\ufb01cult to compute explicitly. Determining how c(\u03b8) scales as\na function of the problem dimensions d and m is not obvious. Also note that (1) has the following\ninterpretation: assume that (1) has a unique solution x(cid:63). Then any uniformly good algorithm must\nselect action M at least x(cid:63)\nM log(T ) times over the T \ufb01rst rounds. From [19], we know that there\nexists an algorithm which is asymptotically optimal, so that its regret matches the lower bound of\nTheorem 1. However this algorithm suffers from two problems: it is computationally infeasible\nfor large problems since it involves solving (1) T times, furthermore the algorithm has no \ufb01nite\ntime performance guarantees, and numerical experiments suggest that its \ufb01nite time performance\non typical problems is rather poor. Further remark that if M is the set of singletons (classical\nbandit), Theorem 1 reduces to the Lai-Robbins bound [20] and if M is the set of m-sets (bandit\nwith multiple plays), Theorem 1 reduces to the lower bound derived in [6]. Finally, Theorem 1 can\nbe generalized in a straightforward manner for when rewards belong to a one-parameter exponential\nfamily of distributions (e.g., Gaussian, Exponential, Gamma etc.) by replacing kl by the appropriate\ndivergence measure.\n\n4\n\n\fA Simpli\ufb01ed Lower Bound We now study how the regret c(\u03b8) scales as a function of the problem\ndimensions d and m. To this aim, we present a simpli\ufb01ed regret lower bound. Given \u03b8, we say that\na set H \u2282 M \\ M (cid:63) has property P (\u03b8) iff, for all (M, M(cid:48)) \u2208 H2, M (cid:54)= M(cid:48) we have MiM(cid:48)\ni (1 \u2212\nM (cid:63)\nTheorem 2 Let H be a maximal (inclusion-wise) subset of M with property P (\u03b8). De\ufb01ne \u03b2(\u03b8) =\nminM(cid:54)=M (cid:63)\n\ni (\u03b8)) = 0 for all i. We may now state Theorem 2.\n\n\u2206M|M\\M (cid:63)| . Then:\n\nc(\u03b8) \u2265 (cid:88)\n\n(cid:16)\n\n(cid:80)\n\n\u03b2(\u03b8)\n\n\u03b8i,\n\n1\n\n|M\\M (cid:63)|\n\n(cid:17) .\n\nM\u2208H\n\nmaxi\u2208M\\M (cid:63) kl\n\nj\u2208M (cid:63)\\M \u03b8j\n\nCorollary 1 Let \u03b8 \u2208 [a, 1]d for some constant a > 0 and M be such that each arm M \u2208 M, M (cid:54)=\nM (cid:63) has at most k suboptimal basic actions. Then c(\u03b8) = \u2126(|H|/k).\n\nTheorem 2 provides an explicit regret lower bound. Corollary 1 states that c(\u03b8) scales at least\nwith the size of H. For most combinatorial sets, |H| is proportional to d \u2212 m (see supplementary\nmaterial for some examples), which implies that in these cases, one cannot obtain a regret smaller\nmin log(T )). This result is intuitive since d \u2212 m is the number of parameters\nthan O((d \u2212 m)\u2206\u22121\n\u221a\nnot observed when selecting the optimal arm. The algorithms proposed below have a regret of\nO(d\n\nmin log(T )), which is acceptable since typically,\n\nm is much smaller than d.\n\nm\u2206\u22121\n\n\u221a\n\n4.2 Algorithms\n\nNext we present ESCB, an algorithm for stochastic combinatorial bandits that relies on arm indexes\nas in UCB1 [21] and KL-UCB [5]. We derive \ufb01nite-time regret upper bounds for ESCB that hold\neven if we assume that (cid:107)M(cid:107)1 \u2264 m, \u2200M \u2208 M, instead of (cid:107)M(cid:107)1 = m, so that arms may have\ndifferent numbers of basic actions.\n\n4.2.1 Indexes\n\nESCB relies on arm indexes. In general, an index of arm M in round n, say bM (n), should be\nde\ufb01ned so that bM (n) \u2265 M(cid:62)\u03b8 with high probability. Then as for UCB1 and KL-UCB, applying the\nprinciple of optimism in face of uncertainty, a natural way to devise algorithms based on indexes is\nto select in each round the arm with the highest index. Under a given algorithm, at time n, we de\ufb01ne\ns=1 Mi(s) the number of times basic action i has been sampled. The empirical mean\ns=1 Xi(s)Mi(s) if ti(n) > 0 and \u02c6\u03b8i(n) =\n\nti(n) = (cid:80)n\nreward of action i is then de\ufb01ned as \u02c6\u03b8i(n) = (1/ti(n))(cid:80)n\n\n0 otherwise. We de\ufb01ne the corresponding vectors t(n) = (ti(n))i\u2208[d] and \u02c6\u03b8(n) = (\u02c6\u03b8i(n))i\u2208[d].\nThe indexes we propose are functions of the round n and of \u02c6\u03b8(n). Our \ufb01rst index for arm M,\nreferred to as bM (n, \u02c6\u03b8(n)) or bM (n) for short, is an extension of KL-UCB index. Let f (n) =\nlog(n) + 4m log(log(n)). bM (n, \u02c6\u03b8(n)) is the optimal value of the following optimization problem:\n\nM(cid:62)q\n\n(M t(n))(cid:62)kl(\u02c6\u03b8(n), q) \u2264 f (n),\n\ns.t.\n\nmax\nq\u2208\u0398\n\n(2)\nwhere we use the convention that for v, u \u2208 Rd, vu = (viui)i\u2208[d]. As we show later, bM (n) may be\ncomputed ef\ufb01ciently using a line search procedure similar to that used to determine KL-UCB index.\nOur second index cM (n, \u02c6\u03b8(n)) or cM (n) for short is a generalization of the UCB1 and UCB-tuned\nindexes:\n\ncM (n) = M(cid:62) \u02c6\u03b8(n) +\n\n(cid:118)(cid:117)(cid:117)(cid:116) f (n)\n\n2\n\n(cid:32) d(cid:88)\n\ni=1\n\n(cid:33)\n\nMi\nti(n)\n\nNote that, in the classical bandit problems with independent arms, i.e., when m = 1, bM (n) re-\nduces to the KL-UCB index (which yields an asymptotically optimal algorithm) and cM (n) reduces\nto the UCB-tuned index. The next theorem provides generic properties of our indexes. An impor-\ntant consequence of these properties is that the expected number of times where bM (cid:63) (n, \u02c6\u03b8(n)) or\ncM (cid:63) (n, \u02c6\u03b8(n)) underestimate \u00b5(cid:63)(\u03b8) is \ufb01nite, as stated in the corollary below.\n\n5\n\n\fTheorem 3 (i) For all n \u2265 1, M \u2208 M and \u03c4 \u2208 [0, 1]d, we have bM (n, \u03c4 ) \u2264 cM (n, \u03c4 ).\n(ii) There exists Cm > 0 depending on m only such that, for all M \u2208 M and n \u2265 2:\n\nP[bM (n, \u02c6\u03b8(n)) \u2264 M(cid:62)\u03b8] \u2264 Cmn\u22121(log(n))\u22122.\n\nP[bM (cid:63) (n, \u02c6\u03b8(n)) \u2264 \u00b5(cid:63)] \u2264 1 + Cm\n\nn\u22651\n\nn\u22652 n\u22121(log(n))\u22122 < \u221e.\n\nCorollary 2 (cid:80)\n\n(cid:80)\n\nStatement (i) in the above theorem is obtained combining Pinsker and Cauchy-Schwarz inequalities.\nThe proof of statement (ii) is based on a concentration inequality on sums of empirical KL diver-\ngences proven in [22]. It enables to control the \ufb02uctuations of multivariate empirical distributions\nfor exponential families. It should also be observed that indexes bM (n) and cM (n) can be extended\nin a straightforward manner to the case of continuous linear bandit problems, where the set of arms\nis the unit sphere and one wants to maximize the dot product between the arm and an unknown\nvector. bM (n) can also be extended to the case where reward distributions are not Bernoulli but\nlie in an exponential family (e.g. Gaussian, Exponential, Gamma, etc.), replacing kl by a suitably\nchosen divergence measure. A close look at cM (n) reveals that the indexes proposed in [10], [11],\nMi\nti(n)\n\nand [9] are too conservative to be optimal in our setting: there the \u201ccon\ufb01dence bonus\u201d(cid:80)d\nwas replaced by (at least) m(cid:80)d\n\nti(n). Note that [10], [11] assume that the various basic actions\nare arbitrarily correlated, while we assume independence among basic actions. When independence\nlog(T )). This\ndoes not hold, [11] provides a problem instance where the regret is at least \u2126( md\n\u221a\n\u2206min\ndoes not contradict our regret upper bound (scaling as O( d\nlog(T ))), since we have added the\nindependence assumption.\n\nm\n\u2206min\n\ni=1\n\ni=1\n\nMi\n\n4.2.2 Index computation\n\nWhile the index cM (n) is explicit, bM (n) is de\ufb01ned as the solution to an optimization problem. We\nshow that it may be computed by a simple line search. For \u03bb \u2265 0, w \u2208 [0, 1] and v \u2208 N, de\ufb01ne:\n\n(cid:17)\n1 \u2212 \u03bbv +(cid:112)(1 \u2212 \u03bbv)2 + 4wv\u03bb\n\n/2.\n\ng(\u03bb, w, v) =\n\nFix n, M, \u02c6\u03b8(n) and t(n). De\ufb01ne I = {i : Mi = 1, \u02c6\u03b8i(n) (cid:54)= 1}, and for \u03bb > 0, de\ufb01ne:\n\nF (\u03bb) =\n\nti(n)kl(\u02c6\u03b8i(n), g(\u03bb, \u02c6\u03b8i(n), ti(n))).\n\n(cid:16)\n(cid:88)\n\ni\u2208I\n\nTheorem 4 If I = \u2205, bM (n) = ||M||1. Otherwise: (i) \u03bb (cid:55)\u2192 F (\u03bb) is strictly increasing, and\nF (R+) = R+. (ii) De\ufb01ne \u03bb(cid:63) as the unique solution to F (\u03bb) = f (n). Then bM (n) = ||M||1 \u2212|I| +\n\n(cid:80)\n\ni\u2208I g(\u03bb(cid:63), \u02c6\u03b8i(n), ti(n)).\n\nTheorem 4 shows that bM (n) can be computed using a line search procedure such as bisection,\nas this computation amounts to solving the nonlinear equation F (\u03bb) = f (n), where F is strictly\nincreasing. The proof of Theorem 4 follows from KKT conditions and the convexity of the KL\ndivergence.\n\n4.2.3 The ESCB Algorithm\n\nThe pseudo-code of ESCB is presented in Algorithm 1. We consider two variants of the algorithm\nbased on the choice of the index \u03beM (n): ESCB-1 when \u03beM (n) = bM (n) and ESCB-2 if \u03beM (n) =\ncM (n). In practice, ESCB-1 outperforms ESCB-2. Introducing ESCB-2 is however instrumental\nin the regret analysis of ESCB-1 (in view of Theorem 3 (i)). The following theorem provides a\n\ufb01nite time analysis of our ESCB algorithms. The proof of this theorem borrows some ideas from\nthe proof of [11, Theorem 3].\nTheorem 5 The regret under algorithms \u03c0 \u2208 {ESCB-1, ESCB-2} satis\ufb01es for all T \u2265 1:\n\n\u221a\nR\u03c0(T ) \u2264 16d\n\nm\u2206\u22121\n\nminf (T ) + 4dm3\u2206\u22122\n\nmin + C(cid:48)\nm,\n\nm \u2265 0 does not depend on \u03b8, d and T . As a consequence R\u03c0(T ) = O(d\n\n\u221a\n\nm\u2206\u22121\n\nmin log(T ))\n\nwhere C(cid:48)\nwhen T \u2192 \u221e.\n\n6\n\n\fSelect arm M (n) \u2208 arg maxM\u2208M \u03beM (n).\nObserve the rewards, and update ti(n) and \u02c6\u03b8i(n),\u2200i \u2208 M (n).\n\nAlgorithm 1 ESCB\n\nfor n \u2265 1 do\n\nend for\n\nAlgorithm 2 COMBEXP\n\n(cid:113)\n\n(cid:113)\n\n\u22121\nmin\n\nm log \u00b5\n\n\u221a\n\u22121\nmin+\n\nand \u03b7 = \u03b3C, with C = \u03bb\n\nm3/2 .\n\nInitialization: Set q0 = \u00b50, \u03b3 =\nfor n \u2265 1 do\n\nm log \u00b5\n\nC(Cm2d+m)T\n\nMixing: Let q(cid:48)\n\nn\u22121 = (1 \u2212 \u03b3)qn\u22121 + \u03b3\u00b50.\n\nDecomposition: Select a distribution pn\u22121 over M such that(cid:80)\nSampling: Select a random arm M (n) with distribution pn\u22121 and incur a reward Yn =(cid:80)\nEstimation: Let \u03a3n\u22121 = E(cid:2)M M(cid:62)(cid:3), where M has law pn\u22121. Set \u02dcX(n) = Yn\u03a3+\n\nM pn\u22121(M )M = mq(cid:48)\n\nn\u22121.\n\nis the pseudo-inverse of \u03a3n\u22121.\nUpdate: Set \u02dcqn(i) \u221d qn\u22121(i) exp(\u03b7 \u02dcXi(n)), \u2200i \u2208 [d].\nProjection: Set qn to be the projection of \u02dcqn onto the set P using the KL divergence.\n\nend for\n\ni Xi(n)Mi(n).\nn\u22121\n\nn\u22121M (n), where \u03a3+\n\nESCB with time horizon T has a complexity of O(|M|T ) as neither bM nor cM can be writ-\nten as M(cid:62)y for some vector y \u2208 Rd. Assuming that the of\ufb02ine (static) combinatorial prob-\nlem is solvable in O(V (M)) time, the complexity of the CUCB algorithm in [10] and [11]\nafter T rounds is O(V (M)T ). Thus, if the of\ufb02ine problem is ef\ufb01ciently implementable, i.e.,\nV (M) = O(poly(d/m)), CUCB is ef\ufb01cient, whereas ESCB is not since |M| may have expo-\nIn \u00a72.5 of the supplement, we provide an extension of ESCB called\nnentially many elements.\nEPOCH-ESCB, that attains almost the same regret as ESCB while enjoying much better computa-\ntional complexity.\n\n5 Adversarial Combinatorial Bandits under Bandit Feedback\n\nWe now consider adversarial combinatorial bandits with bandit feedback. We start with the follow-\ning observation:\n\nM\u2208M M(cid:62)X = max\n\u00b5\u2208Co(M)\n\nmax\n\n\u00b5(cid:62)X,\n\nwith Co(M) the convex hull of M. We embed M in the d-dimensional simplex by dividing its\nelements by m. Let P be this scaled version of Co(M).\nInspired by OSMD [13, 18], we propose the COMBEXP algorithm, where the KL divergence\nis the Bregman divergence used to project onto P.\nProjection using the KL divergence is\naddressed in [23]. We denote the KL divergence between distributions q and p in P by\nq(i) . The projection of distribution q onto a closed convex set \u039e of\n\nKL(p, q) =(cid:80)\n\ni\u2208[d] p(i) log p(i)\n\ndistributions is p(cid:63) = arg minp\u2208\u039e KL(p, q).\nLet \u03bb be the smallest nonzero eigenvalue of E[M M(cid:62)], where M is uniformly distributed over M.\nWe de\ufb01ne the exploration-inducing distribution \u00b50 \u2208 P: \u00b50\n\u2200i \u2208 [d], and\ni . \u00b50 is the distribution over basic actions [d] induced by the uniform distri-\nlet \u00b5min = mini m\u00b50\nbution over M. The pseudo-code for COMBEXP is shown in Algorithm 2. The KL projection\nin COMBEXP ensures that mqn\u22121 \u2208 Co(M). There exists \u03bb, a distribution over M such that\nM \u03bb(M )M. This guarantees that the system of linear equations in the decomposition\nstep is consistent. We propose to perform the projection step (the KL projection of \u02dcq onto P) using\ninterior-point methods [24]. We provide a simpler method in \u00a73.4 of the supplement. The decom-\nposition step can be ef\ufb01ciently implemented using the algorithm of [25]. The following theorem\nprovides a regret upper bound for COMBEXP.\n\nmqn\u22121 =(cid:80)\n\nM\u2208M Mi,\n\ni = 1\n\n(cid:80)\n\nm|M|\n\nTheorem 6 For all T \u2265 1: RCOMBEXP(T ) \u2264 2\n\nd + m1/2\n\n\u03bb\n\nlog \u00b5\u22121\n\nmin + m5/2\n\n\u03bb\n\nlog \u00b5\u22121\nmin.\n\n(cid:17)\n\n(cid:114)\n\n(cid:16)\n\nm3T\n\n7\n\n\fFor most classes of M, we have \u00b5\u22121\n\nclasses, COMBEXP has a regret of O((cid:112)m3dT log(d/m)), which is a factor(cid:112)m log(d/m) off the\n\nmin = O(poly(d/m)) and m(d\u03bb)\u22121 = O(1) [4]. For these\n\nlower bound (see Table 2).\nIt might not be possible to compute the projection step exactly, and this step can be solved up\nto accuracy \u0001n in round n. Namely we \ufb01nd qn such that KL(qn, \u02dcqn) \u2212 minp\u2208\u039e KL(p, \u02dcqn) \u2264 \u0001n.\nProposition 1 shows that for \u0001n = O(n\u22122 log\n\u22123(n)), the approximate projection gives the same\nregret as when the projection is computed exactly. Theorem 7 gives the computational complexity of\nCOMBEXP with approximate projection. When Co(M) is described by polynomially (in d) many\nlinear equalities/inequalities, COMBEXP is ef\ufb01ciently implementable and its running time scales\n(almost) linearly in T . Proposition 1 and Theorem 7 easily extend to other OSMD-type algorithms\nand thus might be of independent interest.\n\nProposition 1 If\n\u0001n = O(n\u22122 log\n\nthe\n\nprojection\n\n\u22123(n)), we have:\n\nstep\n\nof\n\nCOMBEXP\n\nis\n\nsolved\n\nup\n\nto\n\naccuracy\n\n(cid:115)\n\n(cid:18)\n\n(cid:19)\n\nRCOMBEXP(T ) \u2264 2\n\n2m3T\n\nd +\n\nm1/2\n\n\u03bb\n\nlog \u00b5\u22121\n\nmin +\n\n2m5/2\n\n\u03bb\n\nlog \u00b5\u22121\nmin.\n\nTheorem 7 Assume that Co(M) is de\ufb01ned by c linear equalities and s linear inequalities. If the\nprojection step is solved up to accuracy \u0001n = O(n\u22122 log\n\u22123(n)), then COMBEXP has time com-\nplexity.\nThe time complexity of COMBEXP can be reduced by exploiting the structure of M (See [24,\npage 545]). In particular, if inequality constraints describing Co(M) are box constraints, the time\ncomplexity of COMBEXP is O(T [c2\u221a\nThe computational complexity of COMBEXP is determined by the structure of Co(M) and COMB-\nEXP has O(T log(T )) time complexity due to the ef\ufb01ciency of interior-point methods. In con-\ntrast, the computational complexity of COMBAND depends on the complexity of sampling from M.\nCOMBAND may have a time complexity that is super-linear in T (see [16, page 217]). For instance,\nconsider the matching problem described in Section 2. We have c = 2m equality constraints and\ns = m2 box constraints, so that the time complexity of COMBEXP is: O(m5T log(T )). It is noted\nthat using [26, Algorithm 1], the cost of decomposition in this case is O(m4). On the other hand,\nCOMBBAND has a time complexity of O(m10F (T )), with F a super-linear function, as it requires\nto approximate a permanent, requiring O(m10) operations per round. Thus, COMBEXP has much\nlower complexity than COMBAND and achieves the same regret.\n\ns(c + d) log(T ) + d4]).\n\n6 Conclusion\n\nWe have investigated stochastic and adversarial combinatorial bandits. For stochastic combinatorial\nbandits with semi-bandit feedback, we have provided a tight, problem-dependent regret lower bound\nthat, in most cases, scales at least as O((d \u2212 m)\u2206\u22121\n\u221a\nmin log(T )). We proposed ESCB, an algorithm\nwith O(d\nmin log(T )) regret. We plan to reduce the gap between this regret guarantee and\nthe regret lower bound, as well as investigate the performance of EPOCH-ESCB. For adversarial\ncombinatorial bandits with bandit feedback, we proposed the COMBEXP algorithm. There is a gap\nbetween the regret of COMBEXP and the known regret lower bound in this setting, and we plan to\nreduce it as much as possible.\n\nm\u2206\u22121\n\nAcknowledgments\n\nA. Proutiere\u2019s research is supported by the ERC FSA grant, and the SSF ICT-Psi project.\n\n8\n\n\fReferences\n[1] Herbert Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected\n\nPapers, pages 169\u2013177. Springer, 1985.\n\n[2] S\u00b4ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013222, 2012.\n\n[3] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, learning, and games, volume 1. Cambridge Univer-\n\nsity Press Cambridge, 2006.\n\n[4] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Combinatorial bandits. Journal of Computer and System Sci-\n\nences, 78(5):1404\u20131422, 2012.\n\n[5] Aur\u00b4elien Garivier and Olivier Capp\u00b4e. The KL-UCB algorithm for bounded stochastic bandits and beyond.\n\nIn Proc. of COLT, 2011.\n\n[6] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptotically ef\ufb01cient allocation rules\nfor the multiarmed bandit problem with multiple plays-part i: iid rewards. Automatic Control, IEEE\nTransactions on, 32(11):968\u2013976, 1987.\n\n[7] Branislav Kveton, Zheng Wen, Azin Ashkan, Hoda Eydgahi, and Brian Eriksson. Matroid bandits: Fast\n\ncombinatorial optimization with learning. In Proc. of UAI, 2014.\n\n[8] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Learning multiuser channel allocations in cognitive\n\nradio networks: A combinatorial multi-armed bandit formulation. In Proc. of IEEE DySpan, 2010.\n\n[9] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown\nvariables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Trans. on\nNetworking, 20(5):1466\u20131478, 2012.\n\n[10] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and\n\napplications. In Proc. of ICML, 2013.\n\n[11] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic\n\ncombinatorial semi-bandits. In Proc. of AISTATS, 2015.\n\n[12] Zheng Wen, Azin Ashkan, Hoda Eydgahi, and Branislav Kveton. Ef\ufb01cient learning in large-scale combi-\n\nnatorial semi-bandits. In Proc. of ICML, 2015.\n\n[13] Jean-Yves Audibert, S\u00b4ebastien Bubeck, and G\u00b4abor Lugosi. Regret in online combinatorial optimization.\n\nMathematics of Operations Research, 39(1):31\u201345, 2013.\n\n[14] Andr\u00b4as Gy\u00a8orgy, Tam\u00b4as Linder, G\u00b4abor Lugosi, and Gy\u00a8orgy Ottucs\u00b4ak. The on-line shortest path problem\n\nunder partial monitoring. Journal of Machine Learning Research, 8(10), 2007.\n\n[15] Satyen Kale, Lev Reyzin, and Robert Schapire. Non-stochastic bandit slate problems. Advances in Neural\n\nInformation Processing Systems, pages 1054\u20131062, 2010.\n\n[16] Nir Ailon, Kohei Hatano, and Eiji Takimoto. Bandit online optimization over the permutahedron.\n\nAlgorithmic Learning Theory, pages 215\u2013229. Springer, 2014.\n\nIn\n\n[17] Gergely Neu. First-order regret bounds for combinatorial semi-bandits. In Proc. of COLT, 2015.\n[18] S\u00b4ebastien Bubeck, Nicol`o Cesa-Bianchi, and Sham M. Kakade. Towards minimax policies for online\n\nlinear optimization with bandit feedback. Proc. of COLT, 2012.\n\n[19] Todd L. Graves and Tze Leung Lai. Asymptotically ef\ufb01cient adaptive choice of control laws in controlled\n\nmarkov chains. SIAM J. Control and Optimization, 35(3):715\u2013743, 1997.\n\n[20] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6(1):4\u201322, 1985.\n\n[21] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2-3):235\u2013256, 2002.\n\n[22] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bounds\n\nand optimal algorithms. Proc. of COLT, 2014.\n\n[23] I. Csisz\u00b4ar and P.C. Shields. Information theory and statistics: A tutorial. Now Publishers Inc, 2004.\n[24] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.\n[25] H. D. Sherali. A constructive proof of the representation theorem for polyhedral sets based on fundamental\n\nde\ufb01nitions. American Journal of Mathematical and Management Sciences, 7(3-4):253\u2013270, 1987.\n\n[26] David P. Helmbold and Manfred K. Warmuth. Learning permutations with exponential weights. Journal\n\nof Machine Learning Research, 10:1705\u20131736, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1265, "authors": [{"given_name": "Richard", "family_name": "Combes", "institution": "Supelec"}, {"given_name": "Mohammad Sadegh", "family_name": "Talebi Mazraeh Shahi", "institution": "KTH Royal Inst. of Technology"}, {"given_name": "Alexandre", "family_name": "Proutiere", "institution": null}, {"given_name": "marc", "family_name": "lelarge", "institution": "INRIA - ENS"}]}