{"title": "Minimal Exploration in Structured Stochastic Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1763, "page_last": 1771, "abstract": "This paper introduces and addresses a wide class of stochastic bandit problems where the function mapping the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, lipschitz, unimodal, combinatorial, dueling,...) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and develop OSSB, an algorithm whose regret matches this fundamental limit. OSSB is not based on the classical principle of ``optimism in the face of uncertainty'' or on Thompson sampling, and rather aims at matching the minimal exploration rates of sub-optimal arms as characterized in the derivation of the regret lower bound. We illustrate the efficiency of OSSB using numerical experiments in the case of the linear bandit problem and show that OSSB outperforms existing algorithms, including Thompson sampling", "full_text": "Minimal Exploration\n\nin Structured Stochastic Bandits\n\nRichard Combes\n\nCentrale-Supelec / L2S\n\nrichard.combes@supelec.fr\n\nStefan Magureanu\n\nKTH, EE School / ACL\n\nmagur@kth.se\n\nAlexandre Proutiere\nKTH, EE School / ACL\n\nalepro@kth.se\n\nAbstract\n\nThis paper introduces and addresses a wide class of stochastic bandit problems\nwhere the function mapping the arm to the corresponding reward exhibits some\nknown structural properties. Most existing structures (e.g. linear, Lipschitz, uni-\nmodal, combinatorial, dueling, . . . ) are covered by our framework. We derive an\nasymptotic instance-speci\ufb01c regret lower bound for these problems, and develop\nOSSB, an algorithm whose regret matches this fundamental limit. OSSB is not\nbased on the classical principle of \u201coptimism in the face of uncertainty\u201d or on\nThompson sampling, and rather aims at matching the minimal exploration rates\nof sub-optimal arms as characterized in the derivation of the regret lower bound.\nWe illustrate the ef\ufb01ciency of OSSB using numerical experiments in the case of\nthe linear bandit problem and show that OSSB outperforms existing algorithms,\nincluding Thompson sampling.\n\n1\n\nIntroduction\n\nNumerous extensions of the classical stochastic MAB problem [30] have been recently investigated.\nThese extensions are motivated by applications arising in various \ufb01elds including e.g. on-line\nservices (search engines, display ads, recommendation systems, ...), and most often concern structural\nproperties of the mapping of arms to their average rewards. This mapping can for instance be\nlinear [14], convex [2], unimodal [36], Lipschitz [3], or may exhibit some combinatorial structure\n[10, 29, 35].\nIn their seminal paper, Lai and Robbins [30] develop a comprehensive theory for MAB problems\nwith unrelated arms, i.e., without structure. They derive asymptotic (as the time horizon grows large)\ninstance-speci\ufb01c regret lower bounds and propose algorithms achieving this minimal regret. These\nalgorithms have then been considerably simpli\ufb01ed, so that today, we have a few elementary index-\nbased1 and yet asymptotically optimal algorithms [18, 26]. Developing a similar comprehensive\ntheory for MAB problems with structure is considerably more challenging. Due to the structure, the\nrewards observed for a given arm actually provide side-information about the average rewards of\nother arms2. This side-information should be exploited so as to accelerate as much as possible the\nprocess of learning the average rewards. Very recently, instance-speci\ufb01c regret lower bounds and\nasymptotically optimal algorithms could be derived only for a few MAB problems with \ufb01nite set of\narms and speci\ufb01c structures, namely linear [31], Lipschitz [32] and unimodal [12].\nIn this paper, we investigate a large class of structured MAB problems. This class extends the\nclassical stochastic MAB problem [30] in two directions: (i) it allows for any arbitrary structure;\n(ii) it allows different kinds of feedback. More precisely, our generic MAB problem is as follows.\n\n1An algorithm is index-based if the arm selection in each round is solely made comparing the indexes of\n\neach arm, and where the index of an arm only depends on the rewards observed for this arm.\n\n2Index-based algorithms cannot be optimal in MAB problems with structure.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn each round, the decision maker selects an arm from a \ufb01nite set X . Each arm x \u2208 X has\nan unknown parameter \u03b8(x) \u2208 R, and when this arm is chosen in round t, the decision maker\nobserves a real-valued random variable Y (x, t) with expectation \u03b8(x) and distribution \u03bd(\u03b8(x)). The\nobservations (Y (x, t))x\u2208X ,t\u22651 are independent across arms and rounds. If x is chosen, she also\nreceives an unobserved and deterministic3 reward \u00b5(x, \u03b8), where \u03b8 = (\u03b8(x))x\u2208X . The parameter\n\u03b8 lies in a compact set \u0398 that encodes the structural properties of the problem. The set \u0398, the\nclass of distributions \u03bd, and the mapping (x, \u03b8) (cid:55)\u2192 \u00b5(x, \u03b8) encode the structure of the problem,\nare known to the decision maker, whereas \u03b8 is initially unknown. We denote by x\u03c0(t) the arm\nselected in round t under algorithm \u03c0; this selection is based on previously selected arms and\nthe corresponding observations. Hence the set \u03a0 of all possible arm selection rules consists in\nalgorithms \u03c0 such that for any t \u2265 1, x\u03c0(t) is F \u03c0\nt is the \u03c3-algebra generated\nby (x\u03c0(1), Y (x\u03c0(1), 1), . . . , x\u03c0(t\u2212 1), Y (x\u03c0(t\u2212 1), t\u2212 1). The performance of an algorithm \u03c0 \u2208 \u03a0\nis de\ufb01ned through its regret up to round T :\n\nt -measurable where F \u03c0\n\nx\u2208X \u00b5(x, \u03b8) \u2212 T(cid:88)\n\nt=1\n\nR\u03c0(T, \u03b8) = T max\n\nE(\u00b5(x(t), \u03b8)).\n\nThe above MAB problem is very generic, as any kind of structure can be considered. In particular,\nour problem includes classical, linear, unimodal, dueling, and Lipschitz bandit problems as particular\nexamples, see Section 3 for details. Our contributions in this paper are as follows:\n\n\u2022 We derive a tight instance-speci\ufb01c regret lower bound satis\ufb01ed by any algorithm for our\n\ngeneric structured MAB problem.\n\n\u2022 We develop OSSB (Optimal Sampling for Structured Bandits), a simple and yet asymptoti-\ncally optimal algorithm, i.e., its regret matches our lower bound. OSSB optimally exploits\nthe structure of the problem so as to minimize regret.\n\n\u2022 We brie\ufb02y exemplify the numerical performance of OSSB in the case of linear bandits.\nOSSB outperforms existing algorithms (including Thompson Sampling [2], GLM-UCB\n[16], and a recently proposed asymptotically optimal algorithm [31]).\n\nAs noticed in [31], for structured bandits (even for linear bandits), no algorithm based on the principle\nof optimism (a la UCB) or on that of Thompson sampling can achieve an asymptotically minimal\nregret. The design of OSSB does not follow these principles, and is rather inspired by the derivation of\nthe regret lower bound. To obtain this bound, we characterize the minimal rates at which sub-optimal\narms have to be explored. OSSB aims at sampling sub-optimal arms so as to match these rates. The\nlatter depends on the unknown parameter \u03b8, and so OSSB needs to accurately estimate \u03b8. OSSB\nhence alternates between three phases: exploitation (playing arms with high empirical rewards),\nexploration (playing sub-optimal arms at well chosen rates), and estimation (getting to know \u03b8 to\ntune these exploration rates).\nThe main technical contribution of this paper is a \ufb01nite-time regret analysis of OSSB for any generic\nstructure. In spite of the simplicity the algorithm, its analysis is involved. Not surprisingly, it uses\nconcentration-of-measure arguments, but it also requires to establish that the minimal exploration\nrates (derived in the regret lower bound) are essentially smooth with respect to the parameter \u03b8. This\ncomplication arises due to the (additional) estimation phase of OSSB: the minimal exploration rates\nshould converge as our estimate of \u03b8 gets more and more accurate.\nThe remainder of the paper is organized as follows. In the next section, we survey recent results\non structured stochastic bandits. In Section 3, we illustrate the versatility of our MAB problem by\ncasting most existing structured bandit problems into our framework. Section 4 is devoted to the\nderivation of the regret lower bound. In Sections 5 and 6, we present OSSB and provide an upper\nbound of its regret. Finally Section 7 explores the numerical performance of OSSB in the case of\nlinear structures.\n\n3Usually in MAB problems, the reward is a random variable given as feedback to the decision maker. In our\nmodel, the reward is deterministic (as if it was averaged), but not observed as the only observation is Y (x, t) if\nx is chosen in round t. We will illustrate in Section 3 why usual MAB formulations are speci\ufb01c instances of our\nmodel.\n\n2\n\n\f2 Related work\n\nStructured bandits have generated many recent contributions since they \ufb01nd natural applications\nin the design of computer systems, for instance: recommender systems and information retrieval\n[28, 11], routing in networks and network optimization [22, 5, 17], and in\ufb02uence maximization\nin social networks [8]. A large number of existing structures have been investigated, including:\nlinear [14, 34, 1, 31, 27] (linear bandits are treated here as a partial monitoring game), combinatorial\n[9, 10, 29, 35, 13], Lipschitz [32], unimodal [36, 12]. The results in this paper cover all models\nconsidered in the above body of work and are the \ufb01rst that can be applied to problems with any\nstructure in the set of allowed parameters.\nHere, we focus on generic stochastic bandits with a \ufb01nite but potentially large number of arms. Both\ncontinuous as well as adversarial versions of the problem have been investigated, see survey [6].\nThe performance of Thompson sampling for generic bandit problems has appeared in the literature\n[15, 20], however, the recent results in [31] prove that Thompson sampling is not optimal for all\nstructured bandits. Generic structured bandits were treated in [7, 21]. The authors show that the\nregret of any algorithm must scale as C(\u03b8)ln T when T \u2192 \u221e where C(\u03b8) is the optimal value of a\nsemi-in\ufb01nite linear program, and propose asymptotically optimal algorithms. However the proposed\nalgorithms are involved and have poor numerical performance, furthermore their performance\nguarantees are asymptotic, and no \ufb01nite time analysis is available.\nTo our knowledge, our algorithm is the \ufb01rst which covers completely generic MAB problems, is\nasymptotically optimal and is amenable to a \ufb01nite-time regret analysis. Our algorithm is in the same\nspirit as the DMED algorithm, presented in [24], as well as the algorithm in [31], but is generic\nenough to be optimal in any structured bandit setting. Similar to DMED, our algorithm relies on\nrepeatedly solving an optimization problem and then exploring according to its solution, thus moving\naway from the UCB family of algorithms.\n\n3 Examples\n\nThe class of MAB problems described in the introduction covers most known bandit problems as\nillustrated in the six following examples.\nClassical Bandits. The classical MAB problem [33] with Bernoulli rewards is obtained by making\nthe following choices: \u03b8(x) \u2208 [0, 1]; \u0398 = [0, 1]|X|; for any a \u2208 [0, 1], \u03bd(a) is the Bernoulli\ndistribution with mean a; for all x \u2208 X , \u00b5(x, \u03b8) = \u03b8(x).\nLinear Bandits. To get \ufb01nite linear bandit problems [14],[31], in our framework we choose X as a\n\ufb01nite subset of Rd; we pick an unknown vector \u03c6 \u2208 Rd and de\ufb01ne \u03b8(x) = (cid:104)\u03c6, x(cid:105) for all x \u2208 X ; the\nset of possible parameters is \u0398 = {\u03b8 = ((cid:104)\u03c6, x(cid:105))x\u2208X , \u03c6 \u2208 Rd}; for any a \u2208 Rd, \u03bd(a) is a Gaussian\ndistribution with unit variance and centered at a; for all x \u2208 X , \u00b5(x, \u03b8) = \u03b8(x). Observe that our\nframework also includes generalized linear bandit problems as those considered in [16]: we just need\nto de\ufb01ne \u00b5(x, \u03b8) = g(\u03b8(x)) for some function g.\nDueling Bandits. To model dueling bandits [27] using our framework, the set of arms is X =\n{(i, j) \u2208 {1, . . . , d}2}; for any x = (i, j) \u2208 X , \u03b8(x) \u2208 [0, 1] denotes the probability that i is\nbetter than j with the conventions that \u03b8(i, j) = 1 \u2212 \u03b8(j, i) and that \u03b8(i, i) = 1/2; \u0398 = {\u03b8 :\n\u2203i(cid:63) : \u03b8(i(cid:63), j) > 1/2,\u2200j (cid:54)= i(cid:63)} is the set of parameters such there exists a Condorcet winner; for\nany a \u2208 [0, 1], \u03bd(a) is the Bernoulli distribution with mean a; \ufb01nally, we de\ufb01ne the rewards as\n\u00b5((i, j), \u03b8) = 1\nLipschitz Bandits. For \ufb01nite Lipschitz bandits [32], the set of arms X is a \ufb01nite subset of a metric\nspace endowed with a distance (cid:96). For any x \u2208 X , \u03b8(x) is a scalar, and the mapping x (cid:55)\u2192 \u03b8(x) is\nLipschitz continuous with respect to (cid:96), and the set of parameters is:\n\n2 (\u03b8(i(cid:63), i) + \u03b8(i(cid:63), j) \u2212 1). Note that the best arm is (i(cid:63), i(cid:63)) and has zero reward.\n\n\u0398 = {\u03b8 : |\u03b8(x) \u2212 \u03b8(y)| \u2264 (cid:96)(x, y) \u2200x, y \u2208 X}.\n\nAs in classical bandits \u00b5(x, \u03b8(x)) = \u03b8(x). The structure is encoded by the distance (cid:96), and is an\nexample of local structure so that arms close to each other have similar rewards.\nUnimodal Bandits. Unimodal bandits [23],[12] are obtained as follows. X = {1, ...,|X|}, \u03b8(x) is a\nscalar, and \u00b5(x, \u03b8(x)) = \u03b8(x). The added assumption is that x (cid:55)\u2192 \u03b8(x) is unimodal. Namely, there\n\n3\n\n\fexists x(cid:63) \u2208 X such that this mapping is stricly incrasing on {1, ..., x(cid:63)} and strictly decreasing on\n{x(cid:63), ...,|X|}.\nCombinatorial bandits. The combinatorial bandit problems with bandit feedback (see [9]) are\njust particular instances of linear bandits where the set of arms X is a subset of {0, 1}d. Now\nto model combinatorial problems with semi-bandit feedback, we need a slight extension of the\nframework described in introduction. More precisely, the set of arms is still a subset of {0, 1}d.\nThe observation Y (x, t) is a d-dimensional r.v. with independent components, with mean \u03b8(x)\nand distribution \u03bd(\u03b8(x)) (a product distribution). There is an unknown vector \u03c6 \u2208 Rd such that\ni=1 \u03c6(i)x(i) (linear reward). With semi-bandit\nfeedback, the decision maker gets detailed information about the various components of the selected\narm.\n\n\u03b8(x) = (\u03c6(1)x(1), . . . , \u03c6(d)x(d)), and \u00b5(x, \u03b8) =(cid:80)d\n\n4 Regret Lower Bound\n\nTo derive regret lower bounds, a strategy consists in restricting the attention to so-called uniformly\ngood algorithms [30]: \u03c0 \u2208 \u03a0 is uniformly good if R\u03c0(T, \u03b8) = o(T a) when T \u2192 \u221e for all a > 0\nand all \u03b8 \u2208 \u0398. A simple change-of-measure argument is then enough to prove that for MAB problems\nwithout structure, under any uniformly good algorithm, the number of times that a sub-optimal\narm x should be played is greater than ln T /d(\u03b8(x), \u03b8(x(cid:63))) as the time horizon T grows large, and\nwhere x(cid:63) denotes the optimal arm and d(\u03b8(x), \u03b8(x(cid:63))) is the Kullback-Leibler divergence between\nthe distributions \u03bd(\u03b8(x)) and \u03bd(\u03b8(x(cid:63))). Refer to [25] for a direct and elegant proof.\nFor our structured MAB problems, we follow the same strategy, and derive constraints on the number\nof times a sub-optimal arm x is played under any uniformly good algorithm. We show that this number\nis greater than c(x, \u03b8)ln T asymptotically where the c(x, \u03b8)\u2019s are the solutions of a semi-in\ufb01nite\nlinear program [19] whose constraints directly depend on the structure of the problem.\nBefore stating our lower bound, we introduce the following notations. For \u03b8 \u2208 \u0398, let x(cid:63)(\u03b8) be the\noptimal arm (we assume that it is unique), and de\ufb01ne \u00b5(cid:63)(\u03b8) = \u00b5(x(cid:63)(\u03b8), \u03b8). For any x \u2208 X , we\ndenote by D(\u03b8, \u03bb, x) the Kullback-Leibler divergence between distributions \u03bd(\u03b8(x)) and \u03bd(\u03bb(x)).\nAssumption 1 The optimal arm x(cid:63)(\u03b8) is unique.\nTheorem 1 Let \u03c0 \u2208 \u03a0 be a uniformly good algorithm. For any \u03b8 \u2208 \u0398, we have:\n\nln T\nwhere C(\u03b8) is the value of the optimization problem:\n\nR\u03c0(T, \u03b8)\n\n\u2265 C(\u03b8),\n\nlim inf\nT\u2192\u221e\n\n(cid:88)\nsubject to(cid:88)\n\nminimize\n\u03b7(x)\u22650 , x\u2208X\n\nx\u2208X\n\nx\u2208X\n\n\u03b7(x)(\u00b5(cid:63)(\u03b8) \u2212 \u00b5(x, \u03b8))\n\n\u03b7(x)D(\u03b8, \u03bb, x) \u2265 1 , \u2200\u03bb \u2208 \u039b(\u03b8),\n\nwhere\n\n\u039b(\u03b8) = {\u03bb \u2208 \u0398 : D(\u03b8, \u03bb, x(cid:63)(\u03b8)) = 0, x(cid:63)(\u03b8) (cid:54)= x(cid:63)(\u03bb)}.\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\nalgorithm should perform a hypothesis test between \u03b8 and \u03bb, and(cid:80)\n\nLet (c(x, \u03b8))x\u2208X denote the solutions of the semi-in\ufb01nite linear program (2)-(3). In this program,\n\u03b7(x)ln T indicates the number of times arm x is played. The regret lower bound may be understood as\nfollows. The set \u039b(\u03b8) is the set of \u201cconfusing\u201d parameters: if \u03bb \u2208 \u039b(\u03b8) then D(\u03b8, \u03bb, x(cid:63)(\u03b8)) = 0 so \u03bb\nand \u03b8 cannot be differentiated by only sampling the optimal arm x(cid:63)(\u03b8). Hence distinguishing \u03b8 from\n\u03bb requires to sample suboptimal arms x (cid:54)= x(cid:63)(\u03b8). Further, since any uniformly good algorithm must\nidentify the best arm with high probability to ensure low regret and x(cid:63)(\u03b8) (cid:54)= x(cid:63)(\u03bb), any algorithm\nmust distinguish these two parameters. The constraint (3) states that for any \u03bb, a uniformly good\nx\u2208X \u03b7(x)D(\u03b8, \u03bb, x) \u2265 1 is\nrequired to ensure there is enough statistical information to perform this test. In summary, for a\nsub-optimal arm x, c(x, \u03b8)lnT represents the asymptotically minimal number of times x should be\nsampled. It is noted that this lower bound is instance-speci\ufb01c (it depends on \u03b8), and is attainable\nas we propose an algorithm which attains it. The proof of Theorem 1 is presented in appendix, and\nleverages techniques used in the context of controlled Markov chains [21].\n\n4\n\n\f(cid:88)\nsubject to(cid:88)\n\nminimize\n\u03b7(x)\u22650 , x\u2208X\n\nx\u2208X\n\nz\u2208X\n\n\u2200x (cid:54)= x(cid:63).\n\n\u03b7(x)(\u03b8(x(cid:63)) \u2212 \u03b8(x))\n\n\u03b7(z)d(\u03b8(z), max{\u03b8(z), \u03b8(x(cid:63)) \u2212 (cid:96)(x, z)}) \u2265 1 ,\n\nNext, we show that with usual structures as those considered in Section 3, the semi-in\ufb01nite linear\nprogram (2)-(3) reduces to simpler optimization problems (e.g. an LP) and can sometimes even be\nsolved explicitly. Simplifying (2)-(3) is important for us, since our proposed asymptotically optimal\nalgorithm requires to solve this program. In the following examples, please refer to Section 3 for\nthe de\ufb01nitions and notations. As mentioned already, the solutions of (2)-(3) for classical MAB is\nc(x, \u03b8) = 1/d(\u03b8(x), \u03b8(x(cid:63))).\nLinear bandits. For this class of problems, [31] recently proved that (2)-(3) was equivalent to the\nfollowing optimization problem:\n\n(cid:88)\n\n\u03b7(x)(\u03b8(x(cid:63)) \u2212 \u03b8(x))\n\n(cid:33)\n\nminimize\n\u03b7(x)\u22650 , x\u2208X\n\nx\u2208X\nsubject to x(cid:62)inv\n\u2200x (cid:54)= x(cid:63).\n\n(cid:32)(cid:88)\n\nz\u2208X\n\n\u03b7(z)zz(cid:62)\n\nx \u2264 (\u03b8(x(cid:63)) \u2212 \u03b8(x))2\n\n2\n\n,\n\nRefer to [31] for the proof of this result, and for insightful discussions.\nLipschitz bandits. It can be shown that for Bernoulli rewards (the reward of arm x is \u03b8(x)) (2)-(3)\nreduces to the following LP [32]:\n\nWhile the solution is not explicit, the problem reduces to a LP with |X| variables and 2|X| constraints.\nDueling bandits. The solution of (2)-(3) is as follows [27]. Assume to simplify that for any i (cid:54)= i(cid:63),\nthere exists a unique j minimizing \u00b5((i,j),\u03b8)\nd(\u03b8(i,j),1/2) and such that \u03b8(i, j) < 1/2. Let j(i) denote this\nindex. Then for any x = (i, j), we have\n\nc(x, \u03b8) =\n\n1{j = j(i)}\nd(\u03b8(i, j), 1/2)\n\n.\n\nUnimodal bandits. For such problems, it is shown in [12] that the solution of (2)-(3) is given by: for\nall x \u2208 X ,\n\nc(x, \u03b8) =\n\n1{|x \u2212 x(cid:63)| = 1}\nd(\u03b8(x), \u03b8(x(cid:63)))\n\n.\n\nHence, in unimodal bandits, under an asymptotically optimal algorithm, the sub-optimal arms\ncontributing to the regret (i.e., those that need to be sampled \u2126(ln T )) are neighbours of the optimal\narm.\n\n5 The OSSB Algorithm\n\nIn this section we propose OSSB (Optimal Sampling for Structured Bandits), an algorithm that is\nasymptotically optimal, i.e., its regret matches the lower bound of Theorem 1. OSSB pseudo-code is\npresented in Algorithm 1, and takes as an input two parameters \u03b5, \u03b3 > 0 that control the amount of\nexploration performed by the algorithm.\nThe design of OSSB is guided by the necessity to explore suboptimal arms as much as prescribed\nby the solution of the optimization problem (2)-(3), i.e., the sub-optimal arm x should be explored\nc(x, \u03b8)ln T times. If \u03b8 was known, then sampling arm x c(x, \u03b8)ln T times for all x, and then selecting\nthe arm with the largest estimated reward should yield minimal regret.\nSince \u03b8 is unknown, we have to estimate it. De\ufb01ne the empirical averages:\n\nm(x, t) =\n\n(cid:80)t\ns=1 Y (x, s)1{x(s) = x}\n\nmax(1, N (x, t))\n\n5\n\n\fAlgorithm 1 OSSB(\u03b5,\u03b3)\n\ns(0) \u2190 0, N (x, 1), m(x, 1) \u2190 0 , \u2200x \u2208 X\nfor t = 1, ..., T do\n\nCompute the optimization problem (2)-(3) solution (c(x, m(t)))x\u2208X where m(t) =\n(m(x, t))x\u2208X\nif N (x, t) \u2265 c(x, m(t))(1 + \u03b3)ln t, \u2200x then\n\n{Initialization}\n\n{Exploitation}\n\n{Estimation}\n\n{Exploration}\n\nelse\n\ns(t) \u2190 s(t \u2212 1)\nx(t) \u2190 x(cid:63)(m(t))\ns(t) \u2190 s(t \u2212 1) + 1\nX(t) \u2190 arg minx\u2208X N (x,t)\nX(t) \u2190 arg minx\u2208X N (x, t)\nif N (X(t), t) \u2264 \u03b5s(t) then\n\nc(x,m(t))\n\nx(t) \u2190 X(t)\nx(t) \u2190 X(t)\n\nelse\n\nend if\n\nend if\n{Update statistics}\nSelect arm x(t) and observe Y (x(t), t)\nm(x, t + 1) \u2190 m(x, t), \u2200x (cid:54)= x(t) ,\nN (x, t + 1) \u2190 N (x, t), \u2200x (cid:54)= x(t)\nm(x(t), t + 1) \u2190 Y (x(t),t)+m(x(t),t)N (x(t),t)\nN (x(t), t + 1) \u2190 N (x(t), t) + 1\n\nN (x(t),t)+1\n\nend for\n\nwhere x(s) is the arm selected in round s, and N (x, t) =(cid:80)t\n\ns=1 1{x(s) = x} is the number of times\nx has been selected up to round t. The key idea of OSSB is to use m(t) = (m(x, t))x\u2208X as an\nestimator for \u03b8, and explore arms to match the estimated solution of the optimization problem (2)-(3),\nso that N (x, t) \u2248 c(x, m(t))ln t for all x. This should work if we can ensure certainty equivalence,\ni.e. m(t) \u2192 \u03b8(t) when t \u2192 \u221e at a suf\ufb01ciently fast rate.\nThe OSSB algorithm has three components. More precisely, under OSSB, we alternate between\nthree phases: exploitation, estimation and exploration. In round t, one \ufb01rst attempts to identify the\noptimal arm. We calculate x(cid:63)(m(x, t)) the arm with the largest empirical reward. If N (x, t) \u2265\nc(x, m(t))(1 + \u03b3)ln t for all x, we enter the exploitation phase: we have enough information to infer\nthat x(cid:63)(m(x, t)) = x(cid:63)(\u03b8) w.h.p. and we select x(t) = x(cid:63)(m(x, t)). Otherwise, we need to gather\nmore information to identify the optimal arm. We have two goals: (i) make sure that all components\nof \u03b8 are accurately estimated and (ii) make sure that N (x, t) \u2248 c(x, m(t))ln t for all x. We maintain\na counter s(t) of the number of times we have not entered the expoitation phase. We choose between\ntwo possible arms, namely the least played arm X(t) and the arm X(t) which is the farthest from\nsatisfying N (x, t) \u2265 c(x, m(t))ln t. We then consider the number of times X(t) has been selected.\nIf N (X(t), t) is much smaller than s(t), there is a possibility that X(t) has not been selected enough\ntimes so that \u03b8(X(t)) is not accurately estimated so we enter the estimation phase, where we select\nX(t) to ensure that certainty equivalence holds. Otherwise we enter the exploration phase where\nwe select X(t) to explore as dictated by the solution of (2)-(3), since c(x, m(t)) should be close to\nc(x, \u03b8).\nTheorem 2 states that OSSB is asymptotically optimal. The complete proof is presented in Appendix,\nwith a sketch of the proof provided in the next section. We prove Theorem 2 for Bernoulli or\nSubgaussian observations, but the analysis is easily extended to rewards in a 1-parameter exponential\nfamily of distributions. While we state an asymptotic result here, we actually perform a \ufb01nite time\nanalysis of OSSB, and a \ufb01nite time regret upper bound for OSSB is displayed at the end of next\nsection.\nAssumption 2 (Bernoulli observations) \u03b8(x) \u2208 [0, 1] and \u03bd(\u03b8(x)) =Ber(\u03b8(x)) for all x \u2208 X .\nAssumption 3 (Gaussian observations) \u03b8(x) \u2208 R and \u03bd(\u03b8(x)) = N (\u03b8(x), 1) for all x \u2208 X .\n\n6\n\n\fAssumption 4 For all x, the mapping (\u03b8, \u03bb) (cid:55)\u2192 D(x, \u03b8, \u03bb) is continuous at all points where it is not\nin\ufb01nite.\nAssumption 5 For all x, the mapping \u03b8 \u2192 \u00b5(x, \u03b8) is continuous.\n\nAssumption 6 The solution to problem (2)-(3) is unique.\n\nTheorem 2 If Assumptions 1, 4, 5 and 6 hold and either Assumption 2 or 3 holds, then under the\nalgorithm \u03c0 =OSSB(\u03b5, \u03b3) with \u03b5 < 1|X| we have:\nR\u03c0(T )\nln T\n\n\u2264 C(\u03b8)F (\u03b5, \u03b3, \u03b8),\n\nlim sup\nT\u2192\u221e\n\nwith F a function such that for all \u03b8, we have F (\u03b5, \u03b3, \u03b8) \u2192 1 as \u03b5 \u2192 0 and \u03b3 \u2192 0.\n\nWe conclude this section by a remark on the computational complexity of the OSSB algorithm. OSSB\nrequires to solve the optimization problem (2)-(3) in each round. The complexity of solving this\nproblem strongly depends on the problem structure. For general structures, the complexity of this\nproblem is dif\ufb01cult to assess. However for problems exempli\ufb01ed in Section 3, this problem is usually\neasy to solve. Note that the algorithm proposed in [31] for linear bandits requires to solve (2)-(3)\nonly once, and is hence simpler to implement; its performance however is much worse in practice\nthan that of OSSB as illustrated in Section 7.\n\n6 Finite Time Analysis of OSSB\n\nround t is upper bounded by P((cid:80)\n\nThe proof of Theorem 2 is presented in Appendix in detail, and is articulated in four steps. (i) We\n\ufb01rst notice that the probability of selecting a suboptimal arm during the exploitation phase at some\nx\u2208X N (x, t)D(m(t), \u03b8, x) \u2265 (1 + \u03b3)ln t). Using a concentration\ninequality on KL-divergences (Lemma 1 in Appendix), we show that this probability is small and\nthe regret caused by the exploitation phase is upper bounded by G(\u03b3,|X|) where G is \ufb01nite and\ndepends solely on \u03b3 and |X|. (ii) The second step, which is the most involved, is to show Lemma 1\nstating the solutions of (2)-(3) are continuous. The main dif\ufb01culty is that the set \u039b(\u03b8) is not \ufb01nite, so\nthat the optimization problem (2)-(3) is not a linear program. The proof strategy is similar to that\nused to prove Berge\u2019s maximal theorem, the additional dif\ufb01culty being that the feasible set is not\ncompact, so that Berge\u2019s theorem cannot be applied directly. Using Assumptions 1 and 5, both the\nvalue \u03b8 (cid:55)\u2192 C(\u03b8) and the solution \u03b8 (cid:55)\u2192 c(\u03b8) are continuous.\nLemma 1 The optimal value of (2)-(3), \u03b8 (cid:55)\u2192 C(\u03b8) is continuous. If (2)-(3) admits a unique solution\nc(\u03b8) = (c(x, \u03b8))x\u2208X at \u03b8, then \u03b8 (cid:55)\u2192 c(\u03b8) is continuous at \u03b8.\nLemma 1 is in fact interesting in its own right, since optimization problems such as (2)-(3) occur\nin all bandit problems. (iii) The third step is to upper bound the number of times the solution to\n(2)-(3) is not well estimated, so that C(m(t)) \u2265 (1 + \u03ba)C(\u03b8) for some \u03ba > 0. From the previous\nstep this implies that ||m(t) \u2212 \u03b8||\u221e \u2265 \u03b4(\u03ba) for some well-chosen \u03b4(\u03ba) > 0. Using a deviation\nresult (Lemma 2 in Appendix), we show that the expected regret caused by such events is \ufb01nite and\nupper bounded by 2|X|\n\u03b5\u03b42(\u03ba). (vi) Finally a counting argument ensures that the regret incurred when\nC(\u03b8) \u2264 C(m(t)) \u2264 (1 + \u03ba)C(\u03b8) i.e. the solution (2)-(3) is well estimated is upper bounded by\n\n(C(\u03b8)(1 + \u03ba) + 2\u03b5\u03c8(\u03b8))ln T , where \u03c8(\u03b8) = |X|||c(\u03b8)||\u221e(cid:80)\n\nx\u2208X (\u00b5(cid:63)(\u03b8) \u2212 \u00b5(x, \u03b8)).\n\nPutting everything together we obtain the \ufb01nite-time regret upper bound:\n\n+ (C(\u03b8)(1 + \u03ba) + 2\u03b5\u03c8(\u03b8))(1 + \u03b3)ln T.\n\n(cid:18)\n\nG(\u03b3,|X|) +\n\n2|X|\n\u03b5\u03b42(\u03ba)\n\n(cid:19)\n\nR\u03c0(T ) \u2264 \u00b5(cid:63)(\u03b8)\n\nThis implies that:\n\nlim sup\nT\u2192\u221e\n\nR\u03c0(T )\nln T\n\n\u2264 (C(\u03b8)(1 + \u03ba) + 2\u03b5\u03c8(\u03b8))(1 + \u03b3).\n\nThe above holds for all \u03ba > 0, which yields the result.\n\n7\n\n\f7 Numerical Experiments\n\nTo assess the ef\ufb01ciency of OSSB, we compare its performance for reasonable time horizons to the\nstate of the art algorithms for linear bandit problems. We considered a linear bandit with Gaussian\nrewards of unit variance, 81 arms of unit length, d = 3 and 10 parameters \u03b8 in [0.2, 0.4]3, generated\nuniformly at random. In our implementation of OSSB, we use \u03b3 = \u03b5 = 0 since \u03b3 is typically chosen\n0 in the literature (see [18]) and the performance of the algorithm does not appear sensitive to the\nchoice of \u03b5. As baselines we select the extension of Thompson Sampling presented in [4](using\n\nvt = R(cid:112)0.5dln(t/\u03b4), we chose \u03b4 = 0.1, R = 1), GLM-UCB (using \u03c1(t) = (cid:112)0.5ln(t)), an\n\nextension of UCB [16] and the algorithm presented in [31].\nFigure 1 presents the regret of the various algorithms averaged over the 10 parameters. OSSB clearly\nexhibits the best performance in terms of average regret.\n\nFigure 1: Regret of various algorithms in the linear bandit setting with 81 arms and d = 3. Regret is\naveraged over 10 randomly generated parameters and 100 trials. Colored regions represent the 95%\ncon\ufb01dence intervals.\n\n8 Conclusion\n\nIn this paper, we develop a uni\ufb01ed solution to a wide class of stochastic structured bandit problems.\nFor the \ufb01rst time, we derive, for these problems, an asymptotic regret lower bound and devise OSSB,\na simple and yet asymptotically optimal algorithm. The implementation of OSSB requires that we\nsolve the optimization problem de\ufb01ning the minimal exploration rates of the sub-optimal arms. In\nthe most general case, this problem is a semi-in\ufb01nite linear program, which can be hard to solve\nin reasonable time. Studying the complexity of this semi-in\ufb01nite LP depending on the structural\nproperties of the reward function is an interesting research direction. Indeed any asymptotically\noptimal algorithm needs to learn the minimal exploration rates of sub-optimal arms, and hence needs\nto solve this semi-in\ufb01nite LP. Characterizing the complexity of the latter would thus yield important\ninsights into the trade-off between the complexity of the sequential arm selection algorithms and their\nregret.\n\nAcknowledgments\n\nA. Proutiere\u2019s research is supported by the ERC FSA (308267) grant. This work is supported by\nthe French Agence Nationale de la Recherche (ANR), under grant ANR-16-CE40-0002 (project\nBADASS).\n\n8\n\n0e+002e+044e+046e+048e+041e+05010002000300040005000TimeAverage Regret Thompson Sampling (Agrawal et al.)GLM\u2212UCB (Filippi et al.)OSSBLattimore et al.\fReferences\n[1] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved algorithms for linear stochastic bandits. In NIPS,\n\n[2] A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic convex optimization with\n\nbandit feedback. In NIPS, pages 1035\u20131043, 2011.\n\n[3] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control Optim., 33(6):1926\u20131951, 1995.\n[4] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML, 2013.\n[5] B. Awerbuch and R. Kleinberg. Online linear optimization and adaptive routing. J. Comput. Syst. Sci.,\n\n74(1):97\u2013114, 2008.\n\n[6] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit\n\nproblems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[7] A. Burnetas and M. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in\n\nApplied Mathematics, 17(2):122\u2013142, 1996.\n\n[8] A. Carpentier and M. Valko. Revealing graph bandits for maximizing local in\ufb02uence. In AISTATS, 2016.\n[9] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. J. Comput. Syst. Sci., 78(5):1404\u20131422, 2012.\n[10] W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and applications.\n\nIn ICML, 2013.\n\n[11] R. Combes, S. Magureanu, A. Proutiere, and C. Laroche. Learning to rank: Regret lower bound and\n\nef\ufb01cient algorithms. In SIGMETRICS, 2015.\n\n[12] R. Combes and A. Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms. In ICML,\n\n[13] R. Combes, S. Talebi, A. Proutiere, and M. Lelarge. Combinatorial bandits revisited. In NIPS, 2015.\n[14] V. Dani, T. Hayes, and S. Kakade. Stochastic linear optimization under bandit feedback. In COLT, 2008.\n[15] A. Durand and C. Gagn\u00e9. Thompson sampling for combinatorial bandits and its application to online\n\nfeature selection. In Workshops at the Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n[16] S. Filippi, O. Cappe, A. Garivier, and C. Szepesv\u00e1ri. Parametric bandits: The generalized linear case. In\n\nNIPS, pages 586\u2013594, 2010.\n\n[17] Y. Gai, B. Krishnamachari, and R. Jain. Combinatorial network optimization with unknown variables:\nMulti-armed bandits with linear rewards and individual observations. IEEE/ACM Trans. on Networking,\n20(5):1466\u20131478, 2012.\n\n[18] A. Garivier and O. Capp\u00e9. The KL-UCB algorithm for bounded stochastic bandits and beyond. In COLT,\n\n[19] K. Glashoff and S.-A. Gustafson. Linear Optimization and Approximation. Springer Verlag, Berlin, 1983.\n[20] A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In ICML,\n\n2011.\n\n2014.\n\n2011.\n\n2014.\n\n[21] T. L. Graves and T. L. Lai. Asymptotically ef\ufb01cient adaptive choice of control laws in controlled markov\n\nchains. SIAM J. Control and Optimization, 35(3):715\u2013743, 1997.\n\n[22] A. Gy\u00f6rgy, T. Linder, G. Lugosi, and G. Ottucs\u00e1k. The on-line shortest path problem under partial\n\nmonitoring. Journal of Machine Learning Research, 8(10), 2007.\n\n[23] U. Herkenrath. The n-armed bandit with unimodal structure. Metrika, 30(1):195\u2013210, 1983.\n[24] J. Honda and A. Takemura. An asymptotically optimal bandit algorithm for bounded support models. In\n\nCOLT, 2010.\n\n[25] E. Kaufmann, O. Capp\u00e9, and A. Garivier. On the complexity of best-arm identi\ufb01cation in multi-armed\n\nbandit models. Journal of Machine Learning Research, 17(1):1\u201342, 2016.\n\n[26] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal \ufb01nite-time\n\n[27] J. Komiyama, J. Honda, H. Kashima, and H. Nakagawa. Regret lower bound and optimal algorithm in\n\nanalysis. In ALT, 2012.\n\ndueling bandit problem. In COLT, 2015.\n\n[28] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari. Cascading bandits: Learning to rank in the cascade\n\n[29] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari. Tight regret bounds for stochastic combinatorial\n\n[30] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied\n\n[31] T. Lattimore and C. Szepesvari. The end of optimism? an asymptotic analysis of \ufb01nite-armed linear bandits.\n\n[32] S. Magureanu, R. Combes, and A. Proutiere. Lipschitz bandits: Regret lower bounds and optimal\n\n[33] H. Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers,\n\n[34] P. Rusmevichientong and J. Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2), 2010.\n[35] Z. Wen, A. Ashkan, H. Eydgahi, and B. Kveton. Ef\ufb01cient learning in large-scale combinatorial semi-bandits.\n\nIn ICML, 2015.\n\n[36] J. Yu and S. Mannor. Unimodal bandits. In ICML, 2011.\n\nmodel. In NIPS, 2015.\n\nsemi-bandits. In AISTATS, 2015.\n\nMathematics, 6(1):4\u201322, 1985.\n\nAISTATS, 2016.\n\nalgorithms. COLT, 2014.\n\npages 169\u2013177. Springer, 1985.\n\n9\n\n\f", "award": [], "sourceid": 1115, "authors": [{"given_name": "Richard", "family_name": "Combes", "institution": "Centrale-Supelec"}, {"given_name": "Stefan", "family_name": "Magureanu", "institution": "KTH"}, {"given_name": "Alexandre", "family_name": "Proutiere", "institution": "KTH"}]}