{"title": "The Blinded Bandit: Learning with Adaptive Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 1610, "page_last": 1618, "abstract": "We study an online learning setting where the player is temporarily deprived of feedback each time it switches to a different action. Such model of \\emph{adaptive feedback} naturally occurs in scenarios where the environment reacts to the player's actions and requires some time to recover and stabilize after the algorithm switches actions. This motivates a variant of the multi-armed bandit problem, which we call the \\emph{blinded multi-armed bandit}, in which no feedback is given to the algorithm whenever it switches arms. We develop efficient online learning algorithms for this problem and prove that they guarantee the same asymptotic regret as the optimal algorithms for the standard multi-armed bandit problem. This result stands in stark contrast to another recent result, which states that adding a switching cost to the standard multi-armed bandit makes it substantially harder to learn, and provides a direct comparison of how feedback and loss contribute to the difficulty of an online learning problem. We also extend our results to the general prediction framework of bandit linear optimization, again attaining near-optimal regret bounds.", "full_text": "The Blinded Bandit:\n\nLearning with Adaptive Feedback\n\nOfer Dekel\n\nMicrosoft Research\n\nElad Hazan\n\nTechnion\n\nTomer Koren\n\nTechnion\n\noferd@microsoft.com\n\nehazan@ie.technion.ac.il\n\ntomerk@technion.ac.il\n\nAbstract\n\nWe study an online learning setting where the player is temporarily deprived of\nfeedback each time it switches to a different action. Such a model of adaptive feed-\nback naturally occurs in scenarios where the environment reacts to the player\u2019s ac-\ntions and requires some time to recover and stabilize after the algorithm switches\nactions. This motivates a variant of the multi-armed bandit problem, which we call\nthe blinded multi-armed bandit, in which no feedback is given to the algorithm\nwhenever it switches arms. We develop ef\ufb01cient online learning algorithms for\nthis problem and prove that they guarantee the same asymptotic regret as the op-\ntimal algorithms for the standard multi-armed bandit problem. This result stands\nin stark contrast to another recent result, which states that adding a switching cost\nto the standard multi-armed bandit makes it substantially harder to learn, and pro-\nvides a direct comparison of how feedback and loss contribute to the dif\ufb01culty\nof an online learning problem. We also extend our results to the general predic-\ntion framework of bandit linear optimization, again attaining near-optimal regret\nbounds.\n\n1\n\nIntroduction\n\nThe adversarial multi-armed bandit problem [4] is a T -round prediction game played by a random-\nized player in an adversarial environment. On each round of the game, the player chooses an arm\n(also called an action) from some \ufb01nite set, and incurs the loss associated with that arm. The player\ncan choose the arm randomly, by choosing a distribution over the arms and then drawing an arm\nfrom that distribution. He observes the loss associated with the chosen arm, but he does not observe\nthe loss associated with any of the other arms. The player\u2019s cumulative loss is the sum of all the loss\nvalues that he incurs during the game. To minimize his cumulative loss, the player must trade-off\nexploration (trying different arms to observe their loss values) and exploitation (choosing a good\narm based on historical observations).\nThe loss values are assigned by the adversarial environment before the game begins. Each of the\nloss values is constrained to be in [0, 1] but otherwise they can be arbitrary. Since the loss values are\nset beforehand, we say that the adversarial environment is oblivious to the player\u2019s actions.\nThe performance of a player strategy is measured in the standard way, using the game-theoretic\nnotion of regret (formally de\ufb01ned below). Auer et al. [4] present a player strategy called EXP3,\nprove that it guarantees a worst-case regret of O(\nT ) on any oblivious assignment of loss values,\nand prove that this guarantee is the best possible. A sublinear upper bound on regret implies that the\n\u221a\nplayer\u2019s strategy improves over time and is therefore a learning strategy, but if this upper bound has\na rate of O(\n\nT ) then the problem is called an easy online learning problem.1\n\n\u221a\n\n1The classi\ufb01cation of online problems into easy vs. hard is borrowed from Antos et al. [2].\n\n1\n\n\fIn this paper, we study a variant of the standard multi-armed bandit problem where the player is\ntemporarily blinded each time he switches arms. In other words, if the player\u2019s current choice is\ndifferent than his choice on the previous round then we say that he has switched arms, he incurs the\nloss as before, but he does not observe this loss, or any other feedback. On the other hand, if the\nplayer chooses the same arm that he chose on the previous round, he incurs and observes his loss as\nusual.2 We call this setting the blinded multi-armed bandit.\nFor example, say that the player\u2019s task is to choose an advertising campaign (out of k candidates) to\nreduce the frequency of car accidents. Even if a new advertising campaign has an immediate effect,\nthe new accident rate can only be measured over time (since we must wait for a few accidents to\noccur) and the environment\u2019s reaction to the change cannot be observed immediately.\nThe blinded bandit setting can also be used to model problems where a switch introduces a tempo-\nrary bias into the feedback, which makes this feedback useless. A good example is the well-known\nprimacy and novelty effect [14, 15] that occurs in human-computer interaction. Say that we operate\nan online restaurant directory and the task is to choose the best user interface (UI) for our site (from\na set of k candidates). The quality of a UI is measured by the the time it takes the user to complete\na successful interaction with our system. Whenever we switch to a new UI, we encounter a primacy\neffect: users are initially confused by the unfamiliar interface and interaction times arti\ufb01cially in-\ncrease. In some situations, we may encounter the opposite, a novelty effect: a fresh new UI could\nintrigue users, increase their desire to engage with the system, and temporarily decrease interac-\ntion times. In both cases, feedback is immediately available, but each switch makes the feedback\ntemporarily unreliable.\nThere are also cases where switching introduces a variance in the feedback, rather than a bias.\nAlmost any setting where the feedback is measured by a physical sensor, such as a photometer or a\ndigital thermometer, \ufb01ts in this category. Most physical sensors apply a low-pass \ufb01lter to the signal\nthey measure and a low-pass \ufb01lter in the frequency domain is equivalent to integrating the signal\nover a sliding window in the time domain. While the sensor may output an immediate reading, it\nneeds time to stabilize and return to an adequate precision.\nThe blinded bandit setting bears a close similarity to another setting called the adversarial multi-\narmed bandit with switching costs. In that setting, the player incurs an additional loss each time he\nswitches arms. This penalty discourages the player from switching frequently. At \ufb01rst glance, it\nwould seem that the practical problems described above could be formulated and solved as multi-\narmed bandit problems with switching costs and one might question the need for our new blinded\nbandit setting. However, Dekel et al. [12] recently proved that the adversarial multi-armed bandit\nwith switching costs is a hard online learning problem, which is a problem where the best possible\n\nregret guarantee is (cid:101)\u0398(T 2/3). In other words, for any learning algorithm, there exists an oblivious\nsetting of the loss values that forces a regret of(cid:101)\u2126(T 2/3).\n\n\u221a\nIn this paper, we present a new algorithm for the blinded bandit setting and prove that it guarantees a\nregret of O(\nT ) on any oblivious sequence of loss values. In other words, we prove that the blinded\nbandit is surprisingly as easy as the standard multi-armed bandit setting, despite its close similarity to\nthe hard multi-armed bandit with switching costs problem. Our result has a theoretical signi\ufb01cance\nand a practical signi\ufb01cance. Theoretically, it provides a direct comparison of how feedback and\nloss contribute to the dif\ufb01culty of an online learning problem. Practically, it identi\ufb01es a rich and\nimportant class of online learning problems that would seem to be a natural \ufb01t for the multi-armed\nbandit setting with switching costs, but are in fact much easier to learn. Moreover, to the best of our\nknowledge, our work is the \ufb01rst to consider online learning in an setting where the loss values are\noblivious to the player\u2019s past actions but the feedback is adaptive.\nWe also extend our results and study a blinded version of the more general bandit linear optimization\nsetting. The bandit linear optimization framework is useful for ef\ufb01ciently modeling problems of\nlearning under uncertainty with extremely large, yet structured decision sets. For example, consider\nthe problem of online routing in networks [5], where our task is to route a stream of packets between\ntwo nodes in a computer network. While there may be exponentially many paths between the two\nnodes, the total time it takes to send a packet is simply the sum of the delays on each edge in the\npath. If the route is switched in the middle of a long streaming transmission, the network protocol\n\n2More generally, we could de\ufb01ne a setting where the player is blinded for m rounds following each switch,\n\nbut for simplicity we focus on m = 1.\n\n2\n\n\fneeds a while to \ufb01nd the new optimal transmission rate, and the delay of the \ufb01rst few packets after\nthe switch can be arbitrary. This view on the packet routing problem demonstrates the need for a\nblinded version of bandit linear optimization.\nThe paper is organized as follows. In Section 2 we formalize the setting and lay out the necessary\nde\ufb01nitions. Section 3 is dedicated to presenting our main result, which is an optimal algorithm for\nthe blinded bandit problem. In Section 4 we extend this result to the more general setting of bandit\nlinear optimization. We conclude in Section 5.\n\n2 Problem Setting\n\nTo describe our contribution to this problem and its signi\ufb01cance compared to previous work, we \ufb01rst\nde\ufb01ne our problem setting more formally and give some background on the problem.\nAs mentioned above, the player plays a T -round prediction game against an adversarial environment.\nBefore the game begins, the environment picks a sequence of loss functions (cid:96)1, . . . , (cid:96)T : K (cid:55)\u2192 [0, 1]\nthat assigns loss values to arms from the set K = {1, . . . , k}. On each round t, the player chooses an\narm xt \u2208 K, possibly at random, which results in a loss (cid:96)t(xt). In the standard multi-armed bandit\nsetting, the feedback provided to the player at the end of round t is the number (cid:96)t(xt), whereas the\nother values of the function (cid:96)t are never observed.\n\nThe player\u2019s expected cumulative loss at the end of the game equals E[(cid:80)T\n\nt=1 (cid:96)t(xt)]. Since the loss\nvalues are assigned adversarially, the player\u2019s cumulative loss is only meaningful when compared\nto an adequate baseline; we compare the player\u2019s cumulative loss to the cumulative loss of a \ufb01xed\npolicy, which chooses the same arm on every round. De\ufb01ne the player\u2019s regret as\n\nR(T ) = E\n\n(cid:96)t(xt)\n\n(cid:96)t(x) .\n\n(1)\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\nT(cid:88)\n\nt=1\n\n\u2212 min\nx\u2208K\n\nRegret can be positive or negative. If R(T ) = o(T ) (namely, the regret is either negative or grows at\nmost sublinearly with T ), we say that the player is learning. Otherwise, if R(T ) = \u0398(T ) (namely,\nthe regret grows linearly with T ), it indicates that the player\u2019s per-round loss does not decrease with\ntime and therefore we say that the player is not learning.\nIn the blinded version of the problem, the feedback on round t, i.e. the number (cid:96)t(xt), is revealed to\nthe player only if he chooses xt to be the same as xt\u22121. On the other hand, if xt (cid:54)= xt\u22121, then the\nplayer does not observe any feedback. The blinded bandit game is summarized in Fig. 1.\n\nParameters: action set K, time horizon T\n\u2022 Environment determines a sequence of loss functions (cid:96)1, . . . , (cid:96)T : K (cid:55)\u2192 [0, 1]\n\u2022 On each round t = 1, 2, . . . , T :\n\n1. Player picks an action xt \u2208 K and suffers the loss (cid:96)t(xt) \u2208 [0, 1]\n2. If xt = xt\u22121, the number (cid:96)t(xt) is revealed as feedback to the player\n3. Otherwise, if xt (cid:54)= xt\u22121, the player gets no feedback from the environment\n\nFigure 1: The blinded bandit game.\n\nBandit Linear Optimization.\nIn Section 4, we consider the more general setting of online linear\noptimization with bandit feedback [10, 11, 1]. In this problem, on round t of the game, the player\nchooses an action, possibly at random, which is a point xt in a \ufb01xed action set K \u2282 Rn. The loss\nhe suffers on that round is then computed by a linear function (cid:96)t(xt) = (cid:96)t \u00b7 xt, where (cid:96)t \u2208 Rn is a\nloss vector chosen by the oblivious adversarial environment before the game begins. To ensure that\nthe incurred losses are bounded, we assume that the loss vectors (cid:96)1, . . . , (cid:96)T are admissible, that is,\nthey satisfy |(cid:96)t \u00b7 x| \u2264 1 for all t and x \u2208 K (in other words, the loss vectors reside in the polar set\nof K). As in the multi-armed bandit problem, the player only observes the loss he incurred, and the\nfull loss vector (cid:96)t is never revealed to him. The player\u2019s performance is measured by his regret, as\nde\ufb01ned above in Eq. (1).\n\n3\n\n\f3 Algorithm\n\nWe recall the classic EXP3 algorithm for the standard multi-armed bandit problem, and speci\ufb01cally\nfocus on the version presented in Bubeck and Cesa-Bianchi [6]. The player maintains a probability\ndistribution over the arms, which we denote by pt \u2208 \u2206(K) (where \u2206(K) denotes the set of probabil-\nity measures over K, which is simply the k-dimensional simplex when K = {1, 2, . . . , k}). Initially,\np1 is set to the uniform distribution ( 1\nk ). On round t, the player draws xt according to pt,\nincurs and observes the loss (cid:96)t(xt), and applies the update rule\n\u2212\u03b7\n\npt+1(x) \u221d pt(x) \u00b7 exp\n\n\u2200 x \u2208 K,\n\nk , . . . , 1\n\n(cid:19)\n\n(cid:18)\n\n(cid:96)t(xt)\npt(xt)\n\n\u00b7 11x=xt\n\n.\n\nEXP3 provides the following regret guarantee, which depends on the user-de\ufb01ned learning rate\nparameter \u03b7:\nTheorem 1 (due to Auer et al. [4], taken from Bubeck and Cesa-Bianchi [6]). Let (cid:96)1, . . . , (cid:96)T be an\narbitrary loss sequence, where each (cid:96)t : K (cid:55)\u2192 [0, 1]. Let x1, . . . , xT be the random sequence of\narms chosen by EXP3 (with learning rate \u03b7 > 0) as it observes this sequence. Then,\n\nR(T ) \u2264 \u03b7kT\n2\n\n+\n\nlog k\n\n\u03b7\n\n.\n\n\u221a\n\nEXP3 cannot be used in the blinded bandit setting because the EXP3 update rule cannot be called\non rounds where a switch occurs. Also, since switching actions \u2126(T ) times is, in general, required\nfor obtaining the optimal O(\nT ) regret (see [12]), the player must avoid switching actions too fre-\nquently and often stick with the action that was chosen on the previous round. Due to the adversarial\nnature of the problem, randomization must be used in controlling the scheme of action switches.\nWe propose a variation on EXP3, which is presented in Algorithm 1. Our algorithm begins by\ndrawing a sequence of independent Bernoulli random variables b0, b1, . . . , bT +1 (i.e., such that\nP(bt = 0) = P(bt = 1) = 1\n2). This sequence determines the schedule of switches and updates\nfor the entire game. The algorithm draws a new arm (and possibly switches) only on rounds where\nbt\u22121 = 0 and bt = 1 and invokes the EXP3 update rule only on rounds where bt = 0 and bt+1 = 1.\nNote that these two events can never co-occur. Speci\ufb01cally, the algorithm always invokes the update\nrule one round before the potential switch occurs. This con\ufb01rms that the algorithm relies on the\nvalue of (cid:96)t(xt) only on non-switching rounds.\n\nk , . . . , 1\n\nk ), draw x0 \u223c p1\n\nAlgorithm 1: BLINDED EXP3\nset p1 \u2190 ( 1\ndraw b0, . . . , bT +1 i.i.d. unbiased Bernoullis\nfor t = 1, 2, . . . , T\ndraw xt \u223c pt\nset xt \u2190 xt\u22121\n\nif bt\u22121 = 0 and bt = 1\n\nelse\n\n// possible switch\n\n// no switch\n\nplay arm xt and incur loss (cid:96)t(xt)\nif bt = 0 and bt+1 = 1\n\nobserve (cid:96)t(xt) and for all x \u2208 K, update\n\nwt+1(x) \u2190 pt(x) \u00b7 exp\n\nelse\n\nset pt+1 \u2190 wt+1/(cid:107)wt+1(cid:107)1\nset pt+1 \u2190 pt\n\n(cid:18)\n\n\u2212\u03b7\n\n(cid:19)\n\n(cid:96)t(xt)\npt(xt)\n\n\u00b7 11x=xt\n\nWe set out to prove the following regret bound.\n\n4\n\n\f: K (cid:55)\u2192 [0, 1]. Let\nTheorem 2. Let (cid:96)1, . . . , (cid:96)T be an arbitrary loss sequence, where each (cid:96)t\nx1, . . . , xT be the random sequence of arms chosen by Algorithm 1 as it plays the blinded ban-\ndit game on this sequence (with learning rate \ufb01xed to \u03b7 =\n\nkT ). Then,\n\n(cid:113) 2 log k\nR(T ) \u2264 6(cid:112)T k log k .\n\nWe prove Theorem 2 with the below sequence of lemmas. In the following, we let (cid:96)1, . . . , (cid:96)T be\nan arbitrary loss sequence and let x1, . . . , xT be the sequence of arms chosen by Algorithm 1 (with\nparameter \u03b7 > 0). First, we de\ufb01ne the set\n\nIn words, S is a random subset of [T ] that indicates the rounds on which Algorithm 1 uses its\nfeedback and applies the EXP3 update.\nLemma 1. For any x \u2208 K, it holds that\n\nS = (cid:8)t \u2208 [T ] : bt = 0 and bt+1 = 1(cid:9) .\n(cid:34)(cid:88)\n\n(cid:96)t(xt) \u2212(cid:88)\n\n(cid:96)t(x)\n\n(cid:35)\n\n+\n\nlog k\n\n\u2264 \u03b7kT\n8\n\nt\u2208S\n\nt\u2208S\n\n.\n\n\u03b7\n\nE\n\nProof. For any concrete instantiation of b0, . . . , bT +1, the set S is \ufb01xed and the sequence ((cid:96)t)t\u2208S is\nan oblivious sequence of loss functions. Note that the steps performed by Algorithm 1 on the rounds\nindicated in S are precisely the steps that the standard EXP3 algorithm would perform if it were\npresented with the loss sequence ((cid:96)t)t\u2208S. Therefore, Theorem 1 guarantees that\n\n(cid:34)(cid:88)\n\nt\u2208S\n\nE\n\n(cid:96)t(xt) \u2212(cid:88)\n\nt\u2208S\n\n(cid:96)t(x)\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S\n\n\u2264 \u03b7k|S|\n\n2\n\n+\n\nlog k\n\n\u03b7\n\n.\n\nTaking expectations on both sides of the above and noting that E[|S|] \u2264 T /4 proves the lemma.\n\nLemma 1 proves a regret bound that is restricted to the rounds indicated by S. The following lemma\nrelates that regret to the total regret, on all T rounds.\nLemma 2. For any x \u2208 K, we have\n\n(cid:96)t(xt)\n\n(cid:96)t(x) \u2264 4 E\n\n(cid:96)t(x)\n\n+ E\n\n(cid:107)pt \u2212 pt\u22121(cid:107)1\n\n.\n\n(cid:34)(cid:88)\n\nt\u2208S\n\n(cid:96)t(xt) \u2212(cid:88)\n\nt\u2208S\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\nE\n\n(cid:35)\n\n\u2212 T(cid:88)\n(cid:34)(cid:88)\n\nt=1\n\nE\n\n(cid:35)\n(cid:34)(cid:88)\n\nE\n\nProof. Using the de\ufb01nition of S, we have\n\nt\u2208S\n\nSimilarly, we have\n\n(3)\nWe focus on the t\u2019th summand in the right-hand side above. Since bt+1 is independent of (cid:96)t(xt)(1\u2212\nbt), it holds that\n\n(cid:96)t(xt)\n\nt\u2208S\n\nt=1\n\n=\n\n(cid:96)t(x)\n\n=\n\n(cid:96)t(x) E[(1 \u2212 bt)bt+1] =\n\n(2)\n\nT(cid:88)\n(cid:35)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n1\n4\n\n(cid:96)t(x) .\n\nT(cid:88)\n(cid:3) .\nE(cid:2)(cid:96)t(xt) (1 \u2212 bt)bt+1\nE(cid:2)(cid:96)t(xt)(1 \u2212 bt)(cid:3) .\n(cid:12)(cid:12)(cid:12) bt = 1\n(cid:105)\n\n(cid:96)t(xt)(1 \u2212 bt)\n\nE(cid:104)\n\n1\n2\n\n+\n\nUsing the law of total expectation, we get\n\nE(cid:2)(cid:96)t(xt)(1 \u2212 bt)bt+1\nE(cid:2)(cid:96)t(xt)(1 \u2212 bt)(cid:3) =\n\n1\n2\n\n(cid:3) = E[bt+1]E(cid:2)(cid:96)t(xt)(1 \u2212 bt)(cid:3) =\n(cid:12)(cid:12)(cid:12) bt = 0\nE(cid:104)\n(cid:105)\nE(cid:2)(cid:96)t(xt)(cid:12)(cid:12) bt = 0(cid:3) .\n\n(cid:96)t(xt)(1 \u2212 bt)\n\n1\n4\n\n1\n4\n1\n4\n\n=\n\n5\n\n\fIf bt = 0 then Algorithm 1 sets xt \u2190 xt\u22121 so we have that xt = xt\u22121. Therefore, the above equals\nE[(cid:96)t(xt\u22121) | bt = 0]. Since xt\u22121 is independent of bt, this simply equals 1\nE[(cid:96)t(xt\u22121)]. H\u00a8older\u2019s\n1\n4\ninequality can be used to upper bound\n\n4\n\n(cid:0)pt(x) \u2212 pt\u22121(x)(cid:1) (cid:96)t(x)\n\n(cid:105) \u2264 E[(cid:107)pt \u2212 pt\u22121(cid:107)1] \u00b7 max\n\nE[(cid:96)t(xt) \u2212 (cid:96)t(xt\u22121)] = E(cid:104)(cid:88)\n\nx\u2208K (cid:96)t(x) ,\n\nx\u2208K\n\nwhere we have used the fact that xt and xt\u22121 are distributed according to pt and pt\u22121 respectively\n(regardless of whether an update took place or not). Since it is assumed that (cid:96)t(x) \u2208 [0, 1] for all t\nand x \u2208 K, we obtain\n\n4\n\n1\n4\n\n(cid:0)E(cid:2)(cid:96)t(xt)(cid:3) \u2212 E[(cid:107)pt \u2212 pt\u22121(cid:107)1](cid:1) .\nE(cid:2)(cid:96)t(xt\u22121)(cid:3) \u2265 1\n(cid:0)E(cid:2)(cid:96)t(xt)(cid:3) \u2212 E[(cid:107)pt \u2212 pt\u22121(cid:107)1](cid:1) .\n(cid:3) \u2265 1\nE(cid:2)(cid:96)t(xt)(1 \u2212 bt)bt+1\n(cid:34) T(cid:88)\n(cid:35)\n(cid:35)\n(cid:34)(cid:88)\n(cid:96)t(xt) \u2212 T(cid:88)\n\n(cid:107)pt \u2212 pt\u22121(cid:107)1\n\n(cid:96)t(xt)\n\nE\n\nE\n\n.\n\n\u2265 1\n4\n\nt=1\n\nt=1\n\nt\u2208S\n\nOverall, we have shown that\n\n4\nPlugging this inequality back into Eq. (3) gives\n\nSumming the inequality above with the one in Eq. (2) concludes the proof.\n\nNext, we prove that the probability distributions over arms do not change much on consecutive\nrounds of EXP3.\nLemma 3. The distributions p1, p2, . . . , pT generated by the BLINDED EXP3 algorithm satisfy\nE[(cid:107)pt+1 \u2212 pt(cid:107)1] \u2264 2\u03b7 for all t.\nProof. Fix a round t; we shall prove the stronger claim that (cid:107)pt+1 \u2212 pt(cid:107)1 \u2264 2\u03b7 with probability 1.\nIf no update had occurred on round t and pt+1 = pt, this holds trivially. Otherwise, we can use the\ntriangle inequality to bound\n\n(cid:107)pt+1 \u2212 pt(cid:107)1 \u2264 (cid:107)pt+1 \u2212 wt+1(cid:107)1 + (cid:107)wt+1 \u2212 pt(cid:107)1 ,\n\nwith the vector wt+1 as speci\ufb01ed in Algorithm 1. Letting Wt+1 = (cid:107)wt+1(cid:107)1 we have pt+1 =\nwt+1/Wt+1, so we can rewrite the \ufb01rst term on the right-hand side above as\n\n(cid:107)pt+1 \u2212 Wt+1 \u00b7 pt+1(cid:107)1 = |1 \u2212 Wt+1| \u00b7 (cid:107)pt+1(cid:107)1 = 1 \u2212 Wt+1 = (cid:107)pt \u2212 wt+1(cid:107)1 ,\n\nwhere the last equality follows by observing that pt \u2265 wt+1 entrywise, (cid:107)pt(cid:107)1 = 1 and (cid:107)wt+1(cid:107)1 =\n\nWt+1. By the de\ufb01nition of wt+1, the second term on the right-hand side above equals pt(xt) \u00b7(cid:0)1 \u2212\ne\u2212\u03b7(cid:96)t(xt)/pt(xt)(cid:1). Overall, we have\n\n(cid:107)pt+1 \u2212 pt(cid:107)1 \u2264 2pt(xt) \u00b7(cid:0)1 \u2212 e\u2212\u03b7(cid:96)t(xt)/pt(xt)(cid:1) .\n\nUsing the inequality 1 \u2212 exp(\u2212\u03b1) \u2264 \u03b1, we get (cid:107)pt+1 \u2212 pt(cid:107)1 \u2264 2\u03b7(cid:96)t(xt). The claim now follows\nfrom the assumption that (cid:96)t(xt) \u2208 [0, 1].\n\nWe can now proceed to prove our regret bound.\nProof of Theorem 2. Combining the bounds of Lemmas 1\u20133 proves that for any \ufb01xed arm x \u2208 K, it\nholds that\n\n(cid:34) T(cid:88)\n\nE\n\n(cid:35)\n\n\u2212 T(cid:88)\n\n(cid:96)t(xt)\n\nt=1\n\nt=1\n\n(cid:96)t(x) \u2264 \u03b7kT\n2\n\n+\n\n4 log k\n\n\u03b7\n\n+ 2\u03b7T\n\nSpeci\ufb01cally, the above holds for the best arm in hindsight. Setting \u03b7 =\n\n\u2264 2\u03b7kT +\n\n4 log k\n\n\u03b7\n\n.\n\n(cid:113) 2 log k\n\nkT\n\nproves the theorem.\n\n6\n\n\f4 Blinded Bandit Linear Optimization\n\nIn this section we extend our results to the setting of linear optimization with bandit feedback,\n\u221a\nformally de\ufb01ned in Section 2. We focus on the GEOMETRICHEDGE algorithm [11], that was the\n\ufb01rst algorithm for the problem to attain the optimal O(\nT ) regret, and adapt it to the blinded setup.\nOur BLINDED GEOMETRICHEDGE algorithm is detailed in Algorithm 2. The algorithm uses a\nmechanism similar to that of Algorithm 1 for deciding when to avoid switching actions. Following\nthe presentation of [11], we assume that K \u2286 [\u22121, 1]n is \ufb01nite and that the standard basis vectors\ne1, . . . , en are contained in K. Then, the set E = {e1, . . . , en} is a barycentric spanner of K [5] that\nserves the algorithm as an exploration basis. We denote the uniform distribution over E by uE.\n\nAlgorithm 2: BLINDED GEOMETRICHEDGE\nParameter: learning rate \u03b7 > 0\nlet q1 be the uniform distribution over K, and draw x0 \u223c q1\ndraw b0, . . . , bT +1 i.i.d. unbiased Bernoullis\nset \u03b3 \u2190 n2\u03b7\nfor t = 1, 2, . . . , T\n\nset pt \u2190 (1 \u2212 \u03b3) qt + \u03b3 uE\ncompute covariance Ct \u2190 Ex\u223cpt[xx(cid:62)]\nif bt\u22121 = 0 and bt = 1\n\nelse\n\ndraw xt \u223c pt\nset xt \u2190 xt\u22121\n\nplay arm xt and incur loss (cid:96)t(xt) = (cid:96)t \u00b7 xt\nif bt = 0 and bt+1 = 1\n\nobserve (cid:96)t(xt) and let \u02c6(cid:96)t \u2190 (cid:96)t(xt) \u00b7 C\u22121\nupdate qt+1(x) \u221d qt(x) \u00b7 exp(\u2212\u03b7 \u02c6(cid:96)t \u00b7 x)\nset qt+1 \u2190 qt\n\nelse\n\nt xt\n\n// possible switch\n\n// no switch\n\n\u221a\n\nThe main result of this section is an O(\nT ) upper-bound over the expected regret of Algorithm 2.\nTheorem 3. Let (cid:96)1, . . . , (cid:96)T be an arbitrary sequence of linear loss functions, admissible with respect\nto the action set K \u2286 Rn. Let x1, . . . , xT be the random sequence of arms chosen by Algorithm 2 as\nit plays the blinded bandit game on this sequence, with learning rate \ufb01xed to \u03b7 =\n10nT . Then,\n\n(cid:113) log(nT )\n\nR(T ) \u2264 4n3/2(cid:112)T log(nT ) .\n\nWith minor modi\ufb01cations, our technique can also be applied to variants of the GEOMET-\nRICHEDGE algorithm (that differ by their exploration basis) for obtaining regret bounds with im-\nproved dependence of the dimension n. This includes the COMBAND algorithm [8], EXP2 with\nJohn\u2019s exploration [7], and the more recent version employing volumetric spanners [13].\nWe now turn to prove Theorem 3. Our \ufb01rst step is proving an analogue of Lemma 1, using the regret\nbound of the GEOMETRICHEDGE algorithm proved by Dani et al. [11].\n\nLemma 4. For any x \u2208 K, it holds that E(cid:2)(cid:80)\n\nt\u2208S (cid:96)t(x)(cid:3) \u2264 \u03b7n2T\n\nt\u2208S (cid:96)t(xt) \u2212(cid:80)\n\n2 + n log(nT )\n\n2\u03b7\n\n.\n\nWe proceed to prove that the distributions generated by Algorithm 2 do not change too quickly.\nLemma 5. The distributions p1, p2, . . . , pT produced by the BLINDED GEOMETRICHEDGE algo-\nrithm (from which the actions x1, x2, . . . , xT are drawn) satisfy E[(cid:107)pt+1 \u2212 pt(cid:107)1] \u2264 4\u03b7\nn for all t.\nThe proofs of both lemmas are omitted due to space constraints. We now prove Theorem 3.\n\n\u221a\n\n7\n\n\fProof of Theorem 3. Notice that the bound of Lemma 2 is independent of the construction of the\ndistributions p1, p2, . . . , pT and the structure of K, and thus applies for Algorithm 2 as well. Com-\nbining this bound with the results of Lemmas 4 and 5, it follows that for any \ufb01xed action x \u2208 K,\n\n(cid:96)t(x) \u2264 \u03b7n2T\n2\n\n+\n\nn log(nT )\n\n2\u03b7\n\n\u221a\n\n+ 4\u03b7\n\nnT \u2264 5\u03b7n2T +\n\nn log(nT )\n\n2\u03b7\n\n.\n\n(cid:34) T(cid:88)\n\nE\n\n(cid:35)\n\u2212 T(cid:88)\n(cid:113) log(nT )\n\nt=1\n\n(cid:96)t(xt)\n\nt=1\n\nSetting \u03b7 =\n\n10nT proves the theorem.\n\n5 Discussion and Open Problems\n\nIn this paper, we studied a new online learning scenario where the player receives feedback from\nthe adversarial environment only when his action is the same as the one from the previous round, a\nsetting that we named the blinded bandit. We devised an optimal algorithm for the blinded multi-\narmed bandit problem based on the EXP3 strategy, and used similar ideas to adapt the GEOMET-\nRICHEDGE algorithm to the blinded bandit linear optimization setting. In fact, a similar analysis\ncan be applied to any online algorithm that does not change its underlying prediction distributions\ntoo quickly (in total variation distance).\nIn the practical examples given in the introduction, where each switch introduces a bias or a vari-\nance, we argued that the multi-armed bandit problem with switching costs is an inadequate solution,\nsince it is unreasonable to solve an easy problem by reducing it to one that is substantially harder.\nAlternatively, one might consider simply ignoring the noise in the feedback after each switch and\nusing a standard adversarial multi-armed bandit algorithm like EXP3 despite the bias or the vari-\nance. However, if we do that, the player\u2019s observed losses would no longer be oblivious (as the\n\u221a\nobserved loss on round t would depend on xt\u22121), and the regret guarantees of EXP3 would no\nlonger hold.3 Moreover, any multi-armed bandit algorithm with O(\nT ) regret can be forced to\nmake \u0398(T ) switches [12], so the loss observed by the player could actually be non-oblivious in a\nconstant fraction of the rounds, which would deteriorate the performance of EXP3.\nOur setting might seem similar to the related problem of label-ef\ufb01cient prediction (with bandit feed-\nback), see [9]. In the label-ef\ufb01cient prediction setting, the feedback for the action performed on\nsome round is received only if the player explicitly asks for it. The player may freely choose when\nto observe feedback, subject to a global constraint on the number of total feedback queries. In con-\ntrast, in our setting there is a strong correlation between the actions the player takes and the presence\nof the feedback signal. As a consequence, the player is not free to decide when he observes feedback\nas in the label-ef\ufb01cient setting. Another setting that may seem closely related to our setting is the\nmulti-armed bandit problem with delayed feedback [16, 17]. In this setting, the feedback for the\naction performed on round t is received at the end of round t + 1. However, note that in all of the ex-\namples we have discussed, the feedback is always immediate, but is either nonexistent or unreliable\nright after a switch. The important aspect of our setup, which does not apply to the label-ef\ufb01cient\nand delayed feedback settings, is that the feedback adapts to the player\u2019s past actions.\nOur work leaves a few interesting questions for future research. A closely related adaptive-feedback\n\u221a\nproblem is one where feedback is revealed only on rounds where the player does switch actions.\nT ) regret in this setting as well, or is the need to constantly switch actions\nCan the player attain O(\ndetrimental to the player? More generally, we can consider other multi-armed bandit problems with\nadaptive feedback, where the feedback depends on the player\u2019s actions on previous rounds. It would\nbe quite interesting to understand what kind of adaptive-feedback patterns give rise to easy problems,\nT ) is attainable. Speci\ufb01cally, is there a problem with oblivious losses and\nfor which a regret of O(\n\nadaptive feedback whose minimax regret is(cid:101)\u0398(T 2/3), as is the case with adaptive losses?\n\n\u221a\n\nAcknowledgments\n\nThe research leading to these results has received funding from the Microsoft-Technion EC center,\nand the European Union\u2019s Seventh Framework Programme (FP7/2007-2013]) under grant agreement\nn\u25e6 336078 ERC-SUBLRN.\n\n3Auer et al. [4] also present an algorithm called EXP3.P and seemingly prove O(\n\nagainst non-oblivious adversaries. These bounds are irrelevant in our setting\u2014see Arora et al. [3].\n\n\u221a\nT ) regret guarantees\n\n8\n\n\fReferences\n[1] J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for\n\nbandit linear optimization. In COLT, pages 263\u2013274, 2008.\n\n[2] A. Antos, G. Bart\u00b4ok, D. P\u00b4al, and C. Szepesv\u00b4ari. Toward a classi\ufb01cation of \ufb01nite partial-\n\nmonitoring games. Theoretical Computer Science, 2012.\n\n[3] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary:\nfrom regret to policy regret. In Proceedings of the Twenty-Ninth International Conference on\nMachine Learning, 2012.\n\n[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[5] B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed\nlearning and geometric approaches. In Proceedings of the thirty-sixth annual ACM symposium\non Theory of computing, pages 45\u201353. ACM, 2004.\n\n[6] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[7] S. Bubeck, N. Cesa-Bianchi, and S. M. Kakade. Towards minimax policies for online linear\noptimization with bandit feedback. In Proceedings of the 25th Annual Conference on Learning\nTheory (COLT), volume 23, pages 41.1\u201341.14, 2012.\n\n[8] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System\n\nSciences, 78(5):1404\u20131422, 2012.\n\n[9] N. Cesa-Bianchi, G. Lugosi, and G. Stoltz. Minimizing regret with label ef\ufb01cient prediction.\n\nIEEE Transactions on Information Theory, 51(6):2152\u20132162, 2005.\n\n[10] V. Dani and T. P. Hayes. Robbing the bandit: Less regret in online geometric optimization\nagainst an adaptive adversary. In Proceedings of the Seventeenth Annual ACM-SIAM Sympo-\nsium on Discrete Algorithms, 2006.\n\n[11] V. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for online optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 345\u2013352, 2007.\n\n[12] O. Dekel, J. Ding, T. Koren, and Y. Peres. Bandits with switching costs: T 2/3 regret. arXiv\n\npreprint arXiv:1310.2997, 2013.\n\n[13] E. Hazan, Z. Karnin, and R. Mehka. Volumetric spanners and their applications to machine\n\nlearning. In arXiv:1312.6214, 2013.\n\n[14] R. Kohavi, R. Longbotham, D. Sommer\ufb01eld, and R. M. Henne. Controlled experiments on\nthe web: survey and practical guide. Data Mining and Knowledge Discovery, 18(1):140\u2013181,\n2009.\n\n[15] R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu. Trustworthy online\ncontrolled experiments: Five puzzling outcomes explained. In Proceedings of the 18th ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 786\u2013794.\nACM, 2012.\n\n[16] C. Mesterharm. Online learning with delayed label feedback. In Proceedings of the Sixteenth\n\nInternational Conference on Algorithmic Learning Theory, 2005.\n\n[17] G. Neu, A. Gy\u00a8orgy, C. Szepesv\u00b4ari, and A. Antos. Online Markov decision processes under\nbandit feedback. In Advances in Neural Information Processing Systems 23, pages 1804\u20131812,\n2010.\n\n9\n\n\f", "award": [], "sourceid": 850, "authors": [{"given_name": "Ofer", "family_name": "Dekel", "institution": "Microsoft Research"}, {"given_name": "Elad", "family_name": "Hazan", "institution": "Technion"}, {"given_name": "Tomer", "family_name": "Koren", "institution": "Technion"}]}