{"title": "Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 3135, "page_last": 3143, "abstract": "We propose a new oracle-based algorithm, BISTRO+, for the adversarial contextual bandit problem, where either contexts are drawn i.i.d. or the sequence of contexts is known a priori, but where the losses are picked adversarially. Our algorithm is computationally efficient, assuming access to an offline optimization oracle, and enjoys a regret of order $O((KT)^{\\frac{2}{3}}(\\log N)^{\\frac{1}{3}})$, where $K$ is the number of actions, $T$ is the number of iterations, and $N$ is the number of baseline policies. Our result is the first to break the $O(T^{\\frac{3}{4}})$ barrier achieved by recent algorithms, which was left as a major open problem. Our analysis employs the recent relaxation framework of (Rakhlin and Sridharan, ICML'16).", "full_text": "Improved Regret Bounds for Oracle-Based\n\nAdversarial Contextual Bandits\n\nVasilis Syrgkanis\nMicrosoft Research\nvasy@microsoft.com\n\nHaipeng Luo\n\nMicrosoft Research\n\nhaipeng@microsoft.com\n\nAkshay Krishnamurthy\n\nUniversity of Massachusetts, Amherst\n\nakshay@cs.umass.edu\n\nRobert E. Schapire\nMicrosoft Research\n\nschapire@microsoft.com\n\nAbstract\n\nWe propose a new oracle-based algorithm, BISTRO+, for the adversarial con-\ntextual bandit problem, where either contexts are drawn i.i.d. or the sequence\nof contexts is known a priori, but where the losses are picked adversarially. Our\nalgorithm is computationally ef\ufb01cient, assuming access to an of\ufb02ine optimization\noracle, and enjoys a regret of order O((KT ) 2\n3 ), where K is the number\nof actions, T is the number of iterations, and N is the number of baseline policies.\nOur result is the \ufb01rst to break the O(T 3\n4 ) barrier achieved by recent algorithms,\nwhich was left as a major open problem. Our analysis employs the recent relaxation\nframework of Rakhlin and Sridharan [7].\n\n3 (log N ) 1\n\n1\n\nIntroduction\n\nWe study online decision making problems where a learner chooses an action based on some side\ninformation (context) and incurs some cost for that action with a goal of incurring minimal cost over\na sequence of rounds. These contextual online learning settings form a powerful framework for\nmodeling many important decision-making scenarios with applications ranging from personalized\nhealth care to content recommendation and targeted advertising. Many of these applications also\ninvolve a partial feedback component, wherein costs for alternative actions are unobserved, and are\ntypically modeled as contextual bandits.\nThe contextual information present in these problems enables learning of a much richer policy\nfor choosing actions based on context. In the literature, the typical goal for the learner is to have\ncumulative cost that is not much higher than the best policy \u03c0 in a large policy class \u03a0. This is\nformalized by the notion of regret, which is the learner\u2019s cumulative cost minus the cumulative cost\nof the best \ufb01xed policy \u03c0 in hindsight.\nNaively, one can view the contextual problem as a standard online learning problem where the set\nof possible \u201cactions\u201d available at each iteration is the set of policies. This perspective is fruitful, as\nclassical algorithms, such as Hedge [5, 3] and Exp4 [2], give information theoretically optimal regret\n\nbounds of O((cid:112)T log(N )) in full-information and O((cid:112)T K log(N ) in the bandit setting, where T is\n\nthe number of rounds, K is the number of actions, and N is number of policies. However, naively\nlifting standard online learning algorithms to the contextual setting leads to a running time that is\nlinear in the number of policies. Given that the optimal regret is only logarithmic in N and that our\nhigh-level goal is to learn a very rich policy, we want to capture policy classes that are exponentially\nlarge. When we use a large policy class, existing algorithms are no longer computationally tractable.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fTo study this computational question, a number of recent papers have developed oracle-based\nalgorithms that only access the policy class through an optimization oracle for the of\ufb02ine full-\ninformation problem. Oracle-based approaches harness the research in supervised learning that\nfocuses on designing ef\ufb01cient algorithms for full-information problems and uses it for online and\npartial-feedback problems. Optimization oracles have been used in designing contextual bandit\n\nalgorithms [1, 4] that achieve the optimal O((cid:112)KT log(N )) regret while also being computationally\n\n4 K 1\n\n2 (log(N )) 1\n\n3 (log(N )) 1\n\n4 ) barrier can be broken.\n\nef\ufb01cient (i.e. requiring poly(K, log(N ), T ) oracle calls and computation). However, these results\nonly apply when the contexts and costs are independently and identically distributed at each iteration,\ncontrasting with the computationally inef\ufb01cient approaches that can handle adversarial inputs.\nTwo very recent works provide the \ufb01rst oracle ef\ufb01cient algorithms for the contextual bandit problem\nin adversarial settings [7, 8]. Rakhlin and Sridharan [7] considers a setting where the contexts are\ndrawn i.i.d. from a known distribution with adversarial costs and they provide an oracle ef\ufb01cient\nalgorithm called BISTRO with O(T 3\n4 ) regret. Their algorithm also applies in the\ntransductive setting where the sequence of contexts is known a priori. Srygkanis et. al [8] also obtain\na T 3\n4 -style bound with a different oracle-ef\ufb01cient algorithm, but in a setting where the learner knows\n\u221a\nonly the set of contexts that will arrive. Both of these results achieve very suboptimal regret bounds,\nas the dependence on the number of iterations is far from the optimal O(\nT )-bound. A major open\nquestion posed by both works is whether the O(T 3\nIn this paper, we provide an oracle-based contextual bandit algorithm, BISTRO+, that achieves regret\nO((KT ) 2\n3 ) in both the i.i.d. context and the transductive settings considered by Rakhlin\nand Sridharan [7]. This bound matches the T -dependence of the epoch-greedy algorithm of Langford\nand Zhang [6] that only applies to the fully stochastic setting. As in Rakhlin and Sridharan [7],\nour algorithm only requires access to a value oracle, which is weaker than the standard argmax\noracle, and it makes K + 1 oracle calls per iteration. To our knowledge, this is the best regret bound\nachievable by an oracle-ef\ufb01cient algorithm for any adversarial contextual bandit problem.\nOur algorithm and regret bound are based on a novel and improved analysis of the minimax prob-\nlem that arises in the relaxation-based framework of Rakhlin and Sridharan [7] (hence the name\nBISTRO+). Our proof requires analyzing the value of a sequential game where the learner chooses a\ndistribution over actions and then the adversary chooses a distribution over costs in some bounded\n\ufb01nite domain, with - importantly - a bounded variance. This is unlike the simpler minimax problem\nanalyzed in [7], where the adversary is only constrained by the range of the costs.\nApart from showing that this more structured minimax problem has a small value, we also need to\nderive an oracle-based strategy for the learner that achieves the improved regret bound. The additional\nconstraints on the game require a much more intricate argument to derive this strategy which is an\nalgorithm for solving a structured two-player minimax game (see Section 4).\n\n2 Model and Preliminaries\nBasic notation. Throughout the paper we denote with x1:t a sequence of quantities {x1, . . . , xt}\nand with (x, y, z)1:t a sequence of tuples {(x1, y1, z1), . . .}. \u2205 denotes an empty sequence. The\nvector of ones is denoted by 1 and the vector of zeroes is denoted by 0. Denote with [K] the set\n{1, . . . , K}, e1, . . . , eK the standard basis vectors in RK, and \u2206U the set of distributions over a set\nU. We also use \u2206K as a shorthand for \u2206[K].\n\nContextual online learning. We consider the following version of the contextual online learning\nproblem. On each round t = 1, . . . , T , the learner observes a context xt and then chooses a probability\ndistribution qt over a set of K actions. The adversary then chooses a cost vector ct \u2208 [0, 1]K. The\nlearner picks an action \u02c6yt drawn from distribution qt, incurs a cost ct(\u02c6yt) and observes only ct(\u02c6yt)\nand not the cost of the other actions.\nThroughout the paper we will assume that the context xt at each iteration t is drawn i.i.d. from a\ndistribution D. This is referred to as the hybrid i.i.d.-adversarial setting [7]. As in prior work [7], we\nassume that the learner can sample contexts from this distribution as needed. It is easy to adapt the\narguments in the paper to apply for the transductive setting where the learner knows the sequence of\ncontexts that will arrive. The cost vectors ct are chosen by a non-adaptive adversary.\n\n2\n\n\fThe goal of the learner is to compete with a set of policies \u03a0 of size N, where each policy \u03c0 \u2208 \u03a0 is a\nfunction mapping contexts to actions. The cumulative expected regret with respect to the best \ufb01xed\npolicy in hindsight is\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nREG =\n\n(cid:104)qt, ct(cid:105) \u2212 min\n\u03c0\u2208\u03a0\n\nct(\u03c0(xt)).\n\nOptimization value oracle. We will assume that we are given access to an optimization oracle that\nwhen given as input a sequence of contexts and cost vectors (x, c)1:t, it outputs the cumulative cost\nof the best \ufb01xed policy, which is\n\nct(\u03c0(xt)).\n\n(1)\n\nt(cid:88)\n\n\u03c4 =1\n\nmin\n\u03c0\u2208\u03a0\n\n(cid:20)\n\nThis can be viewed as an of\ufb02ine batch optimization or ERM oracle.\n\n2.1 Relaxation based algorithms\n\nWe brie\ufb02y review the relaxation based framework proposed in [7]. The reader is directed to [7]\nfor a more extensive exposition. We will also slightly augment the framework with some internal\nrandomness that the algorithm can generate and use, which does not affect the cost of the algorithm.\nA crucial concept in the relaxation based framework is the information obtained by the learner at the\nend of each round t \u2208 [T ], which is the following tuple:\n\nIt(xt, qt, \u02c6yt, ct, St) = (xt, qt, \u02c6yt, ct(\u02c6yt), St),\n\nwhere \u02c6yt is the realized chosen action drawn from the distribution qt and St is some random string\ndrawn from some distribution that can depend on qt, \u02c6yt and ct(\u02c6yt) and which can be used by the\nalgorithm in subsequent rounds.\nDe\ufb01nition 1 A partial-information relaxation REL(\u00b7) is a function that maps (I1, . . . , It) to a real\nvalue for any t \u2208 [T ]. A partial-information relaxation is admissible if for any t \u2208 [T ], and for all\nI1, . . . , It\u22121\n\nExt\n\nmin\nqt\n\nmax\n\nct\n\nE\u02c6yt\u223cqt,St [ct(\u02c6yt) + REL(I1:t\u22121, It(xt, qt, \u02c6yt, ct, St))]\n\n\u2264 REL(I1:t\u22121),\n\n(2)\n\nand for all x1:T , c1:T and q1:T\n\nE\u02c6y1:T \u223cq1:T ,S1:T [REL(I1:T )] \u2265 \u2212 min\n\u03c0\u2208\u03a0\n\nT(cid:88)\n\nt=1\n\nct(\u03c0(xt)).\n\n(3)\n\nDe\ufb01nition 2 Any randomized strategy q1:T that certi\ufb01es inequalities (2) and (3) is called an admissi-\nble strategy.\n\nA basic lemma proven in [7] is that if one constructs a relaxation and a corresponding admissible\nstrategy, then the expected regret of the admissible strategy is upper bounded by the value of the\nrelaxation at the beginning of time.\n\nLemma 1 ([7]) Let REL be an admissible relaxation and q1:T be an admissible strategy. Then for\nany c1:T , we have\n\nE [REG] \u2264 REL(\u2205).\n\nWe will utilize this framework and construct a novel relaxation with an admissible strategy. We\nwill show that the value of the relaxation at the beginning of time is upper bounded by the desired\nimproved regret bound and that the admissible strategy can be ef\ufb01ciently computed assuming access\nto an optimization value oracle.\n\n3\n\n(cid:21)\n\n\f3 A Faster Contextual Bandit Algorithm\n\n(cid:40)\n\nFirst we de\ufb01ne an unbiased estimator the cost vectors ct. In addition to doing the usual importance\nweighting, we also discretize the estimated cost to either 0 or L for some constant L \u2265 K to be\nspeci\ufb01ed later. Speci\ufb01cally, suppose that at iteration t an action \u02c6yt is picked based on some distribution\nht \u2208 \u2206K. Now, consider the random variable Xt, which is de\ufb01ned conditionally on \u02c6yt and ht, as\n\nXt =\n\n1 with probability ct(\u02c6yt)\n0 with the remaining probability.\n\nLht(\u02c6yt) ,\n\n(4)\n\nThis is a valid random variable whenever miny ht(y) \u2265 1\nL, which will be ensured by the algorithm.\nThis is the only randomness in the random string St that we used in the general relaxation framework.\nOur construction of an unbiased estimate for each ct based on the information It collected at the end\nof each round is then: \u02c6ct = LXte\u02c6yt. Observe that for any y \u2208 [K],\n\nE\u02c6yt,Xt [\u02c6ct(y)] = L \u00b7 Pr[\u02c6yt = y] \u00b7 Pr[Xt = 1|\u02c6yt = y] = L \u00b7 ht(y) \u00b7 ct(y)\nLht(y)\n\n= ct(y).\n\nHence, \u02c6ct is an unbiased estimate of ct.\nWe are now ready to de\ufb01ne our relaxation. Let \u0001t \u2208 {\u22121, 1}K be a Rademacher random vector\n(i.e. each coordinate is an independent Rademacher random variable, which is \u22121 or 1 with equal\nprobability), and let Zt \u2208 {0, L} be a random variable which is L with probability K/L and 0\notherwise. We denote with \u03c1t = (x, \u0001, Z)t+1:T and with Gt the distribution of \u03c1t which is described\nabove. Our relaxation is de\ufb01ned as follows:\n\nREL(I1:t) = E\u03c1t\u223cGt [R((x, \u02c6c)1:t, \u03c1t)],\n\n(5)\n\nwhere\n\n(cid:32) t(cid:88)\n\nT(cid:88)\n\n(cid:34)\n\nT(cid:88)\n\n(cid:33)\n\n(cid:35)\n\nR((x, \u02c6c)1:t, \u03c1t) = \u2212 min\n\u03c0\u2208\u03a0\n\n+ (T \u2212 t)K/L.\nNote that REL(\u2205) is the following quantity, whose \ufb01rst part resembles a Rademacher average:\n\n2\u0001\u03c4 (\u03c0(x\u03c4 ))Z\u03c4\n\n\u02c6c\u03c4 (\u03c0(x\u03c4 )) +\n\n\u03c4 =t+1\n\n\u03c4 =1\n\nREL(\u2205) = 2E(x,\u0001,Z)1:T\n\nmax\n\u03c0\u2208\u03a0\n\n\u0001\u03c4 (\u03c0(x\u03c4 ))Z\u03c4\n\n+ T K/L.\n\nUsing the following lemma (whose proof is deferred to the supplementary material) and the fact\n\n(cid:3) \u2264 KL, we can upper bound REL(\u2205) by O((cid:112)T KL log(N ) + T K/L), which after tuning L\n\n\u03c4 =1\n\nwill give the claimed O(T 2/3) bound.\nLemma 2 Let \u0001t be Rademacher random vectors, and Zt be non-negative real-valued random\n\n(cid:3) \u2264 M for some constant M > 0. Then\n\nE(cid:2)Z 2\nvariables such that E(cid:2)Z 2\n\nt\n\nt\n\n(cid:34)\n\nT(cid:88)\n\nt=1\n\nEZ1:T ,\u00011:T\n\nmax\n\u03c0\u2208\u03a0\n\n\u0001t(\u03c0(xt)) \u00b7 Zt\n\n\u2264(cid:112)2T M log(N ).\n\n(cid:19)\n\nTo show an admissible strategy for our relaxation, we let D = {L \u00b7 ei : i \u2208 [K]} \u222a {0}. For a\ndistribution p \u2208 \u2206(D), we denote with p(i), for i \u2208 {0, . . . , K}, the probability assigned to vector\nei, with the convention that e0 = 0. Also let \u2206(cid:48)\nBased on this notation our admissible strategy is de\ufb01ned as\n\nD = {p \u2208 \u2206D : p(i) \u2264 1/L,\u2200i \u2208 [K]}.\n\nand\n\nqt = E\u03c1t [qt(\u03c1t)] where\n\nq\u2217\nt (\u03c1t) = argmin\nq\u2208\u2206K\n\nmax\npt\u2208\u2206(cid:48)\n\nD\n\nqt(\u03c1t) =\n\n1 \u2212 K\nL\n\nq\u2217\nt (\u03c1t) +\n\n1\nL\nE\u02c6ct\u223cpt [(cid:104)q, \u02c6ct(cid:105) + R((x, \u02c6c)1:t, \u03c1t)].\n\n1,\n\n(6)\n\n(7)\n\nAlgorithm 1 implements this admissible strategy. Note that it suf\ufb01ces to use qt(\u03c1t) for a single\nrandom draw \u03c1t instead of qt to ensure the exact same guarantee in expectation. In Section 4 we\nshow that qt(\u03c1t) can be computed ef\ufb01ciently using an optimization value oracle.\nWe state the main theorem of our relaxation construction and defer the proof to Section 5.\n\n4\n\n(cid:35)\n\n(cid:18)\n\n\fAlgorithm 1 BISTRO+\n\nInput: parameter L \u2265 K\nfor each time step t \u2208 [T ] do\n\nObserve xt. Draw \u03c1t = (x, \u0001, Z)t+1:T where each x\u03c4 is drawn from the distribution of contexts,\n\u0001\u03c4 is a Rademacher random vectors and Z\u03c4 \u2208 {0, L} is L with probability K/L and 0 otherwise.\nCompute qt(\u03c1t) based on Eq. (6) (using Algorithm 2).\nPredict \u02c6yt \u223c qt(\u03c1t) and observe ct(\u02c6yt).\nCreate an estimate \u02c6ct = LXte\u02c6yt, where Xt is de\ufb01ned in Eq. (4) using qt(\u03c1t) as ht.\n\nend for\n\nAlgorithm 2 Computing q\u2217\n\nt (\u03c1t)\n\nInput: a value optimization oracle, (x, \u02c6c)1:t\u22121, xt and \u03c1t.\nOutput: q \u2208 \u2206K, a solution to Eq. (7).\nCompute \u03c8i as in Eq. (9) for all i = 0, . . . , K using the optimization oracle.\nfor all i \u2208 [K].\nCompute \u03c6i = \u03c8i\u2212\u03c80\nLet m = 1 and q = 0.\nfor each coordinate i \u2208 [K] do\n\nL\n\nSet q(i) = min{(\u03c6i)+, m}. ((x)+ = max{x, 0})\nUpdate m \u2190 m \u2212 q(i).\n\nend for\nDistribute m arbitrarily on the coordinates of q if m > 0.\n\nTheorem 3 The relaxation de\ufb01ned in Equation (5) is admissible. An admissible randomized strategy\nfor this relaxation is given by (6). The expected regret of BISTRO+ is upper bounded by\n\n2(cid:112)2T KL log(N ) + T K/L,\n\n(8)\n3 when T \u2265 K 2 log(N ), the regret is of\n\n1\n\nfor any L \u2265 K. Speci\ufb01cally, setting L = (KT / log(N ))\norder O((KT ) 2\n\n3 (log(N )) 1\n\n3 ).\n\n4 Computational Ef\ufb01ciency\n\nIn this section we will argue that if one is given access to a value optimization oracle (1), then one\ncan run BISTRO+ ef\ufb01ciently. Speci\ufb01cally, we will show that the minimizer of Equation (7) can be\ncomputed ef\ufb01ciently via Algorithm 2.\n\nLemma 4 Computing the quantity de\ufb01ned in equation (7) for any given \u03c1t can be done in time O(K)\nand with only K + 1 accesses to a value optimization oracle.\nProof: For i \u2208 {0, . . . , K}, let\n\n(cid:32)t\u22121(cid:88)\n\n\u03c4 =1\n\n(cid:33)\n\nT(cid:88)\n\n\u03c4 =t+1\n\nK(cid:88)\n\ni=1\n\n5\n\n\u03c8i = min\n\u03c0\u2208\u03a0\n\n\u02c6c\u03c4 (\u03c0(x\u03c4 )) + Lei(\u03c0(xt)) +\n\n2\u0001\u03c4 (\u03c0(x\u03c4 ))Zt\n\n(9)\n\nwith the convention e0 = 0. Then observe that we can re-write the de\ufb01nition of q\u2217\npt(i)(L \u00b7 q(i) \u2212 \u03c8i) \u2212 pt(0) \u00b7 \u03c80.\n\nK(cid:88)\n\nq\u2217\nt (\u03c1t) = argmin\nq\u2208\u2206K\n\nmax\npt\u2208\u2206(cid:48)\nD\n\ni=1\n\nt (\u03c1t) as\n\nObserve that each \u03c8i can be computed with a single oracle access. Thus we can assume that all K + 1\n\u03c8\u2019s are computed ef\ufb01ciently and are given. We now argue how to compute the minimizer.\nFor any q, the maximum over pt can be characterized as follows. With the notation zi = L\u00b7 q(i)\u2212 \u03c8i\nand z0 = \u2212\u03c80 we re-write the minimax quantity as\n\nq\u2217\nt (\u03c1t) = argmin\nq\u2208\u2206K\n\nmax\npt\u2208\u2206(cid:48)\n\nD\n\npt(i) \u00b7 zi + pt(0) \u00b7 z0.\n\n\fObserve that without the constraint that pt(i) \u2264 1/L for i > 0 we would put all the probability mass\non the maximum of the zi. However, with the constraint the maximizer put as much probability\nmass as allowed on the maximum coordinate argmaxi\u2208{0,...,K} zi and continues to the next highest\nquantity. We repeat this until reaching the quantity z0, which is unconstrained. Thus we can put all\nthe remaining probability mass on this coordinate.\nLet z(1), z(2), . . . , z(K) denote the ordered zi quantities for i > 0 (from largest to smallest). Moreover,\nlet \u00b5 \u2208 [K] be the largest index such that z(\u00b5) \u2265 z0. By the above reasoning we get that for a given\nq, the maximum over pt is equal to (recall that we assume L \u2265 K)\nz(t) \u2212 z0\n\nz(t)\nL\n\n+\n\n1 \u2212 \u00b5\nL\n\nz0 =\n\n+ z0.\n\nL\n\nNow since for any t > \u00b5, z(t) < z0, we can write the latter as\n\nz(t)\nL\n\n+\n\n1 \u2212 \u00b5\nL\n\nz0 =\n\n(zi \u2212 z0)+\n\nL\n\n+ z0\n\nwith the convention (x)+ = max{x, 0}. We thus further re-write the minimax expression as\n\n(cid:17)\n(cid:17)\n\n\u00b5(cid:88)\nK(cid:88)\n\nt=1\n\ni=1\n\n\u00b5(cid:88)\n\u00b5(cid:88)\n\nt=1\n\nt=1\n\n(cid:16)\n(cid:16)\n\nK(cid:88)\nK(cid:88)\n\ni=1\n\ni=1\n\nq\u2217\nt (\u03c1t) = argmin\nq\u2208\u2206K\n\n= argmin\nq\u2208\u2206K\n\n+ z0 = argmin\nq\u2208\u2206K\n\n(zi \u2212 z0)+\n\nL\n\n(cid:18)\nq(i) \u2212 \u03c8i \u2212 \u03c80\n\nL\n\n(cid:19)+\n\n.\n\nK(cid:88)\n\n(zi \u2212 z0)+\n\nL\n\ni=1\n\n(cid:80)K\nt=1(q(i) \u2212 \u03c6i)+.\n\nt (\u03c1t) = argminq\u2208\u2206K\n\nL . The expression becomes: q\u2217\n\nLet \u03c6i = \u03c8i\u2212\u03c80\nThis quantity is minimized as follows: consider any i \u2208 [K] such that \u03c6i \u2264 0. Then putting positive\nmass \u03be on such a coordinate i is going to lead to a marginal increase of \u03be in the objective. On the other\nhand if we put some mass on an index \u03c6i > 0, then that will not increase the objective until we reach\ni:\u03c6i>0 \u03c6i, 1},\non the coordinates for which \u03c6i > 0. The remaining mass, if any, can be distributed arbitrarily. See\nAlgorithm 2 for details.\n\nthe point where q(i) = \u03c6i. Thus a minimizer will distribute probability mass of min{(cid:80)\n\n5 Proof of Theorem 3\n\nWe verify the two conditions for admissibility.\n\nFinal condition.\n\nIt is clear that inequality (3) is satis\ufb01ed since \u02c6ct are unbiased estimates of ct:\n\nE\u02c6y1:T ,X1:T [REL(I1:T )] = E\u02c6y1:T ,X1:T\n\nmax\n\u03c0\u2208\u03a0\n\n\u02c6c\u03c4 (\u03c0(x\u03c4 ))\n\n(cid:34)\n\n\u2212 T(cid:88)\n(cid:34) T(cid:88)\n\n\u03c4 =1\n\n\u03c4 =1\n\n(cid:35)\n(cid:35)\n\n\u2265 max\n\u03c0\u2208\u03a0\n\n\u2212E\u02c6y1:T ,X1:T\n\n\u02c6c\u03c4 (\u03c0(x\u03c4 ))\n\n= max\n\u03c0\u2208\u03a0\n\n\u2212 T(cid:88)\n\n\u03c4 =1\n\nc\u03c4 (\u03c0(x\u03c4 )).\n\nt-th Step condition. We now check that inequality (2) is also satis\ufb01ed at some time step t \u2208 [T ].\nWe reason conditionally on the observed context xt and show that qt de\ufb01nes an admissible strategy\nfor the relaxation. For convenience let Ft denote the joint distribution of the pair (\u02c6yt, Xt). Observe\nthat the marginal of Ft on the \ufb01rst coordinate is equal to qt. Let q\u2217\nt (\u03c1t)]. First observe that:\nE(\u02c6yt,Xt)\u223cFt [ct(\u02c6yt)] = E\u02c6yt\u223cqt [ct(\u02c6yt)] = (cid:104)qt, ct(cid:105) \u2264 (cid:104)q\u2217\nK\nL\n\nt = E\u03c1t [q\u2217\n(cid:104)1, ct(cid:105) \u2264 E(\u02c6yt,Xt)\u223cFt [(cid:104)q\u2217\n\nt , \u02c6ct(cid:105)]+\n\nt , ct(cid:105)+\n\n1\nL\n\n.\n\nHence,\n\nmax\n\nct\u2208[0,1]K\n\nE(\u02c6yt,Xt)\u223cFt [ct(\u02c6yt) + REL(I1:t)] \u2264 max\nct\u2208[0,1]K\n\nE(\u02c6yt,Xt)\u223cFt [(cid:104)q\u2217\n\nt , \u02c6ct(cid:105) + REL(I1:t)] +\n\nK\nL\n\n.\n\n6\n\n\fWe now work with the \ufb01rst term of the right hand side.\n\nmax\n\nct\u2208[0,1]K\n\n= max\n\nct\u2208[0,1]K\n\nt , \u02c6ct(cid:105) + REL(I1:t)]\n\nE(\u02c6yt,Xt)\u223cFt [(cid:104)q\u2217\nE(\u02c6yt,Xt)\u223cFt [E\u03c1t\u223cGt [(cid:104)q\u2217\n\nt (\u03c1t), \u02c6ct(cid:105) + R((x, \u02c6c)1:t, \u03c1t)]]\n\n(cid:20)\n\n(cid:21)\n\nObserve that \u02c6ct is a random variable taking values in D and such that the probability that it is equal\nto Ley (for y \u2208 {0, . . . , K}) can be upper bounded as\n\nPr[\u02c6ct = Ley] = E\u03c1t\u223cGt [Pr[\u02c6ct = Ley|\u03c1t]] = E\u03c1t\u223cGt\n\nqt(\u03c1t)(y)\n\nct(y)\n\nL \u00b7 qt(\u03c1t)(y)\n\n\u2264 1/L.\n\nmax\n\nThus we can upper bound the latter quantity by the supremum over all distributions in \u2206(cid:48)\n\nE(\u02c6yt,Xt) [(cid:104)q\u2217\n\nt , \u02c6ct(cid:105) + REL(I1:t)] \u2264 max\npt\u2208\u2206(cid:48)\n\nct\u2208[0,1]K\nNow we can continue by pushing the expectation over \u03c1t outside of the supremum, i.e.,\n\nt (\u03c1t), \u02c6ct(cid:105) + R((x, \u02c6c)1:t, \u03c1t)]].\n(cid:21)\nt (\u03c1t), \u02c6ct(cid:105) + R((x, \u02c6c)1:t, \u03c1t)]\nct\u2208[0,1]K\nand working conditionally on \u03c1t. Since the expression is linear in pt, the supremum is realized, and\nby the de\ufb01nition of q\u2217\n\nE\u02c6ct\u223cpt [E\u03c1t\u223cGt [(cid:104)q\u2217\n(cid:20)\n\nt (\u03c1t), the quantity inside the expectation E\u03c1t\u223cGt is equal to\n\nt , \u02c6ct(cid:105) + REL(I1:t)] \u2264 E\u03c1t\u223cGt\n\nE(\u02c6yt,Xt) [(cid:104)q\u2217\n\nE\u02c6ct\u223cpt [(cid:104)q\u2217\n\nmax\npt\u2208\u2206(cid:48)\n\nD, i.e.,\n\nmax\n\nD\n\nD\n\nE\u02c6ct\u223cpt [(cid:104)q, \u02c6ct(cid:105) + R((x, \u02c6c)1:t, \u03c1t)].\nWe can now apply the minimax theorem and upper bound the above by\nE\u02c6ct\u223cpt [(cid:104)q, \u02c6ct(cid:105) + R((x, \u02c6c)1:t, \u03c1t)].\n\nmax\npt\u2208\u2206(cid:48)\nD\n\nmin\nq\u2208\u2206K\n\nmax\npt\u2208\u2206(cid:48)\nD\n\nmin\nq\u2208\u2206K\n\nSince the inner objective is linear in q, we continue with\n\nmax\npt\u2208\u2206(cid:48)\nD\n\nmin\n\ny\n\nE\u02c6ct\u223cpt [\u02c6ct(y) + R((x, \u02c6c)1:t, \u03c1t)].\n\n(cid:34)\n\n(cid:32) t(cid:88)\n\nWe can now expand the de\ufb01nition of R(\u00b7)\n\ny\n\nmin\n\nE\u02c6ct\u223cpt\n\nmax\npt\u2208\u2206(cid:48)\nD\n\nWith the notation A\u03c0 = \u2212(cid:80)t\u22121\n\n\u02c6ct(y) + max\n\u03c0\u2208\u03a0\n\n\u2212\n\n\u03c4 =1\n\n\u03c4 =1 \u02c6c\u03c4 (\u03c0(x\u03c4 )) \u2212(cid:80)T\n(cid:20)\n\nmax\npt\u2208\u2206(cid:48)\nD\n\nmin\n\ny\n\nE\u02c6ct\u223cpt\n\n\u02c6ct(y) + max\n\u03c0\u2208\u03a0\n\nT(cid:88)\n\n\u03c4 =t+1\n\n(cid:33)(cid:35)\n\n+ (T \u2212 t)K/L.\n\n\u02c6c\u03c4 (\u03c0(x\u03c4 )) +\n\n2\u0001\u03c4 (\u03c0(x\u03c4 ))Z\u03c4\n\n\u03c4 =t+1 2\u0001\u03c4 (\u03c0(x\u03c4 ))Z\u03c4 , we re-write the above as\n(A\u03c0 \u2212 \u02c6ct(\u03c0(xt)))\n\n+ (T \u2212 t)K/L.\n\n(cid:21)\n\nWe now upper bound the \ufb01rst term. The extra term (T \u2212 t)K/L will be combined with the extra\nK/L that we have abandoned to give the correct term (T \u2212 (t \u2212 1))K/L needed for REL(I1:t\u22121).\nObserve that we can re-write the \ufb01rst term by using symmetrization as\n(A\u03c0 \u2212 \u02c6ct(\u03c0(xt)))\n\n(cid:21)\n\n(cid:20)\n\nmin\n\nE\u02c6ct\u223cpt\n\n\u02c6ct(y) + max\n\u03c0\u2208\u03a0\n\nE\u02c6c(cid:48)\n\n(cid:21)\n(cid:21)\nt(y)] \u2212 \u02c6ct(\u03c0(xt)))\nt\u223cpt [\u02c6c(cid:48)\nt(\u03c0(xt))] \u2212 \u02c6ct(\u03c0(xt)))\n(cid:21)\nt(\u03c0(xt)) \u2212 \u02c6ct(\u03c0(xt))))\n\n(cid:21)\nt\u223cpt [\u02c6c(cid:48)\nt(\u03c0(xt)) \u2212 \u02c6ct(\u03c0(xt)))\n\n(cid:21)\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0 + min\n\ny\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0 + E\u02c6c(cid:48)\n\n(A\u03c0 + \u02c6c(cid:48)\n\nmax\n\u03c0\u2208\u03a0\n\n(cid:20)\n\n(cid:20)\n\n(cid:20)\n\nE\u02c6ct,\u02c6c(cid:48)\n\nt\u223cpt,\u03b4\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0 + \u03b4 (\u02c6c(cid:48)\n\n(cid:20)\n(cid:20)\n\nE\u02c6ct\u223cpt\n\nE\u02c6ct\u223cpt\n\nE\u02c6ct,\u02c6c(cid:48)\n\nt\u223cpt\n\nE\u02c6ct\u223cpt,\u03b4\n\nmax\npt\u2208\u2206(cid:48)\nD\n\ny\n\n= max\npt\u2208\u2206(cid:48)\nD\n\u2264 max\npt\u2208\u2206(cid:48)\n\u2264 max\npt\u2208\u2206(cid:48)\n\nD\n\nD\n\nD\n\n= max\npt\u2208\u2206(cid:48)\n\u2264 max\npt\u2208\u2206(cid:48)\n\nD\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0 + 2\u03b4\u02c6ct(\u03c0(xt)))\n\n7\n\n\f(cid:20)\n\nwhere \u03b4 is a random variable which is \u22121 and 1 with equal probability. The last inequality follows by\nsplitting the maximum into two equal parts.\nConditioning on \u02c6ct, consider the random variable Mt which is \u2212 maxy \u02c6ct(y) or maxy \u02c6ct(y) on the\ncoordinates where \u02c6ct is equal to zero and equal to \u02c6ct on the coordinate that achieves the maximum.\nThis is clearly an unbiased estimate of \u02c6ct. Thus we can upper bound the last quantity by\n\n\u2264 max\npt\u2208\u2206(cid:48)\n\nD\n\n(A\u03c0 + 2\u03b4E [Mt(\u03c0(xt))|\u02c6ct])\n\nE\u02c6ct\u223cpt,\u03b4\n\nD\n\nmax\n\u03c0\u2208\u03a0\n\nE\u02c6ct,\u03b4,Mt\n\nmax\n(A\u03c0 + 2\u03b4Mt(\u03c0(xt)))\npt\u2208\u2206(cid:48)\nThe random vector \u03b4Mt, conditioning on \u02c6ct, is equal to \u2212 maxy \u02c6ct(y) or maxy \u02c6ct(y) with equal\nprobability independently on each coordinate. Moreover, observe that for any distribution pt \u2208 \u2206(cid:48)\nD,\nthe distribution of the maximum coordinate of \u02c6ct has support on {0, L} and is equal to L with\nprobability at most K/L. Since the objective only depends on the distribution of the maximum\ncoordinate of \u02c6ct, we can continue the upper bound with a maximum over any distribution of random\nvectors whose coordinates are 0 with probability at least 1 \u2212 K/L and otherwise are \u2212L or L with\nequal probability. Speci\ufb01cally, let \u0001t be a Rademacher random vector, we continue with\n\nmax\n\u03c0\u2208\u03a0\n\n(cid:21)\n\n.\n\nZt\u2208\u2206{0,L}:P r[Zt=L]\u2264K/L\n\nmax\n\nE\u0001t,Zt\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0 + 2\u0001t(\u03c0(xt))Zt)\n\n.\n\nNow observe that if we denote with a = Pr[Zt = L], the above is equal to\n\nmax\n\na:0\u2264a\u2264K/L\n\n(1 \u2212 a) max\n\u03c0\u2208\u03a0\n\n(A\u03c0) + aE\u0001t\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0 + 2\u0001t(\u03c0(xt))L)\n\n.\n\nWe now argue that this maximum is achieved by setting a = K/L. For that it suf\ufb01ces to show that\n\n(cid:21)\n\n(cid:20)\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)\n\n(cid:21)\n\n(cid:21)(cid:19)\n\n(cid:18)\n\n(cid:20)\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0) \u2264 E\u0001t\n(cid:21)\n\nwhich is true by observing that with \u03c0\u2217 = argmax\u03c0\u2208\u03a0(A\u03c0) one has\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0 + 2\u0001t(\u03c0(xt))L)\n\n,\n\n(cid:20)\n\nE\u0001t\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0 + 2\u0001t(\u03c0(xt))L)\n\n\u2265 E\u0001t [A\u03c0\u2217 + 2\u0001t(\u03c0\u2217(xt))L)] = A\u03c0\u2217 +E\u0001t [2\u0001t(\u03c0\u2217(xt))L)] = A\u03c0\u2217 .\n\nThus we can upper bound the quantity we want by\n\n(cid:20)\n\nE\u0001t,Zt\n\nmax\n\u03c0\u2208\u03a0\n\n(A\u03c0 + 2\u0001t(\u03c0(xt))Zt\n\n(cid:21)\n\n,\n\nwhere \u0001t is a Rademacher random vector and Zt is now a random variable which is equal to L with\nprobability K/L and is equal to 0 with the remaining probability.\nTaking expectation over \u03c1t and xt and adding the (T \u2212 (t \u2212 1))K/L term that we abandoned, we\narrive at the desired upper bound of REL(I1:t\u22121). This concludes the proof of admissibility.\n\nRegret bound. By applying Lemma 2 (See Appendix A) with E[Z 2\nand invoking Lemma 1, we get the regret bound in Equation (8).\n\nt ] = L2 Pr[Zt = L] = KL\n\n6 Discussion\n\nIn this paper, we present a new oracle-based algorithm for adversarial contextual bandits and we prove\nthat it achieves O((KT )2/3 log(N )1/3) regret in the settings studied by Rakhlin and Sridharan [7].\nThis is the best regret bound that we are aware of among oracle-based algorithms.\nWhile our bound improves on the O(T 3/4) bounds in prior work [7, 8], achieving the optimal\n\nO((cid:112)T K log(N )) regret bound with an oracle based approach still remains an important open\n\nquestion. Another interesting avenue for future work involves removing the stochastic assumption on\nthe contexts.\n\n8\n\n\fReferences\n[1] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming\nthe monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine\nLearning (ICML), 2014.\n\n[2] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged casino: The\n\nadversarial multi-armed bandit pproblem. In Foundations of Computer Science (FOCS), 1995.\n\n[3] Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K\n\nWarmuth. How to use expert advice. Journal of the ACM (JACM), 1997.\n\n[4] Miroslav Dud\u00edk, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong\nZhang. Ef\ufb01cient optimal learning for contextual bandits. In Uncertainty and Arti\ufb01cial Intelligence (UAI),\n2011.\n\n[5] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 1997.\n\n[6] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2008.\n\n[7] Alexander Rakhlin and Karthik Sridharan. BISTRO: an ef\ufb01cient relaxation-based method for contextual\n\nbandits. In International Conference on Machine Learning (ICML), 2016.\n\n[8] Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert E. Schapire. Ef\ufb01cient algorithms for adversarial\n\ncontextual learning. In International Conference on Machine Learning (ICML), 2016.\n\n9\n\n\f", "award": [], "sourceid": 1568, "authors": [{"given_name": "Vasilis", "family_name": "Syrgkanis", "institution": "Microsoft Research"}, {"given_name": "Haipeng", "family_name": "Luo", "institution": "Princeton University"}, {"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": "UMass Amherst"}, {"given_name": "Robert", "family_name": "Schapire", "institution": "MIcrosoft Research"}]}