{"title": "Nearly Tight Bounds for the Continuum-Armed Bandit Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 697, "page_last": 704, "abstract": null, "full_text": "Nearly Tight Bounds for the Continuum-Armed\n                                Bandit Problem\n\n\n\n                                         Robert Kleinberg\n\n\n\n\n                                            Abstract\n\n          In the multi-armed bandit problem, an online algorithm must choose\n          from a set of strategies in a sequence of n trials so as to minimize the\n          total cost of the chosen strategies. While nearly tight upper and lower\n          bounds are known in the case when the strategy set is finite, much less is\n          known when there is an infinite strategy set. Here we consider the case\n          when the set of strategies is a subset of Rd, and the cost functions are\n          continuous. In the d = 1 case, we improve on the best-known upper and\n          lower bounds, closing the gap to a sublogarithmic factor. We also con-\n          sider the case where d > 1 and the cost functions are convex, adapting a\n          recent online convex optimization algorithm of Zinkevich to the sparser\n          feedback model of the multi-armed bandit problem.\n\n\n\n1     Introduction\n\nIn an online decision problem, an algorithm must choose from among a set of strategies in\neach of n consecutive trials so as to minimize the total cost of the chosen strategies. The\ncosts of strategies are specified by a real-valued function which is defined on the entire\nstrategy set and which varies over time in a manner initially unknown to the algorithm.\nThe archetypical online decision problems are the best expert problem, in which the entire\ncost function is revealed to the algorithm as feedback at the end of each trial, and the multi-\narmed bandit problem, in which the feedback reveals only the cost of the chosen strategy.\nThe names of the two problems are derived from the metaphors of combining expert advice\n(in the case of the best expert problem) and learning to play the best slot machine in a casino\n(in the case of the multi-armed bandit problem).\n\nThe applications of online decision problems are too numerous to be listed here. In ad-\ndition to occupying a central position in online learning theory, algorithms for such prob-\nlems have been applied in numerous other areas of computer science, such as paging and\ncaching [6, 14], data structures [7], routing [4, 5], wireless networks [19], and online auc-\ntion mechanisms [8, 15]. Algorithms for online decision problems are also applied in a\nbroad range of fields outside computer science, including statistics (sequential design of\nexperiments [18]), economics (pricing [20]), game theory (adaptive game playing [13]),\nand medical decision making (optimal design of clinical trials [10]).\n\nMulti-armed bandit problems have been studied quite thoroughly in the case of a finite\nstrategy set, and the performance of the optimal algorithm (as a function of n) is known\n\n      M.I.T. CSAIL, Cambridge, MA 02139. Email: rdk@csail.mit.edu. Supported by a Fannie\nand John Hertz Foundation Fellowship.\n\n\f\nup to a constant factor [3, 18]. In contrast, much less is known in the case of an infinite\nstrategy set. In this paper, we consider multi-armed bandit problems with a continuum of\nstrategies, parameterized by one or more real numbers. In other words, we are studying\nonline learning problems in which the learner designates a strategy in each time step by\nspecifying a d-tuple of real numbers (x1, . . . , xd); the cost function is then evaluated at\n(x1, . . . , xd) and this number is reported to the algorithm as feedback. Recent progress on\nsuch problems has been spurred by the discovery of new algorithms (e.g. [4, 9, 16, 21])\nas well as compelling applications. Two such applications are online auction mechanism\ndesign [8, 15], in which the strategy space is an interval of feasible prices, and online\noblivious routing [5], in which the strategy space is a flow polytope.\n\nAlgorithms for online decisions problems are often evaluated in terms of their regret, de-\nfined as the difference in expected cost between the sequence of strategies chosen by the\nalgorithm and the best fixed (i.e. not time-varying) strategy. While tight upper and lower\nbounds on the regret of algorithms for the K-armed bandit problem have been known\nfor many years [3, 18], our knowledge of such bounds for continuum-armed bandit prob-\nlems is much less satisfactory. For a one-dimensional strategy space, the first algorithm\nwith sublinear regret appeared in [1], while the first polynomial lower bound on regret ap-\npeared in [15]. For Lipschitz-continuous cost functions (the case introduced in [1]), the\nbest known upper and lower bounds for this problem are currently O(n3/4) and (n1/2),\nrespectively [1, 15], leaving as an open question the problem of determining tight bounds\nfor the regret as a function of n. Here, we solve this open problem by sharpening the up-\nper and lower bounds to O(n2/3 log1/3(n)) and (n2/3), respectively, closing the gap to a\nsublogarithmic factor. Note that this requires improving the best known algorithm as well\nas the lower bound technique.\n\nRecently, and independently, Eric Cope [11] considered a class of cost functions obeying\na more restrictive condition on the shape of the function near its optimum, and for such\nfunctions he obtained a sharper bound on regret than the bound proved here for uniformly\nlocally Lipschitz cost functions. Cope requires that each cost function C achieves its op-\ntimum at a unique point , and that there exist constants K0 > 0 and p  1 such that for\nall x, |C(x) - C()|  K0 x -  p. For this class of cost functions -- which is probably\nbroad enough to capture most cases of practical interest -- he proves that the regret of the\noptimal continuum-armed bandit algorithm is O(n-1/2), and that this bound is tight.\n\nFor a d-dimensional strategy space, any multi-armed bandit algorithm must suffer regret\ndepending exponentially on d unless the cost functions are further constrained. (This is\ndemonstrated by a simple counterexample in which the cost function is identically zero\nin all but one orthant of Rd, takes a negative value somewhere in that orthant, and does\nnot vary over time.) For the best-expert problem, algorithms whose regret is polynomial\nin d and sublinear in n are known for the case of cost functions which are constrained to\nbe linear [16] or convex [21]. In the case of linear cost functions, the relevant algorithm\nhas been adapted to the multi-armed bandit setting in [4, 9]. Here we adapt the online\nconvex programming algorithm of [21] to the continuum-armed bandit setting, obtaining\nthe first known algorithm for this problem to achieve regret depending polynomially on\nd and sublinearly on n. A remarkably similar algorithm was discovered independently\nand simultaneously by Flaxman, Kalai, and McMahan [12]. Their algorithm and analysis\nare superior to ours, requiring fewer smoothness assumptions on the cost functions and\nproducing a tighter upper bound on regret.\n\n\n2    Terminology and Conventions\n\n\nWe will assume that a strategy set S  Rd is given, and that it is a compact subset of Rd.\nTime steps will be denoted by the numbers {1, 2, . . . , n}. For each t  {1, 2, . . . , n} a cost\n\n\f\nfunction Ct : S  R is given. These cost functions must satisfy a continuity property\nbased on the following definition. A function f is uniformly locally Lipschitz with constant\nL (0  L < ), exponent  (0 <   1), and restriction  ( > 0) if it is the case that\nfor all u, u  S with u - u  ,\n                                      |f(u) - f(u )|  L u - u .\n(Here,     denotes the Euclidean norm on Rd.) The class of all such functions f will be\ndenoted by ulL(, L, ).\n\nWe will consider two models which may govern the cost functions. The first of these\nis identical with the continuum-armed bandit problem considered in [1], except that [1]\nformulates the problem in terms of maximizing reward rather than minimizing cost. The\nsecond model concerns a sequence of cost functions chosen by an oblivious adversary.\n\nRandom The functions C1, . . . , Cn are independent, identically distributed random sam-\n          ples from a probability distribution on functions C : S  R. The expected cost\n          function \n                     C : S  R is defined by C(u) = E(C(u)) where C is a random sample\n          from this distribution. This function \n                                                               C is required to belong to ulL(, L, ) for\n          some specified , L, . In addition, we assume there exist positive constants , s0\n          such that if C is a random sample from the given distribution on cost functions,\n          then\n                                                          1\n                                     E(esC(u))  e 2s2\n                                                          2         |s|  s0,u  S.\n          The \"best strategy\" u is defined to be any element of arg min                       \n                                                                                         uS C (u). (This\n          set is non-empty, by the compactness of S.)\nAdversarial The functions C1, . . . , Cn are a fixed sequence of functions in ulL(, L, ),\n          taking values in [0, 1]. The \"best strategy\" u is defined to be any element of\n          arg min             n\n                     uS             C\n                              t=1         t(u). (Again, this set is non-empty by compactness.)\n\nA multi-armed bandit algorithm is a rule for deciding which strategy to play at time t, given\nthe outcomes of the first t - 1 trials. More formally, a deterministic multi-armed bandit\nalgorithm U is a sequence of functions U1, U2, . . . such that Ut : (S  R)t-1  S. The\ninterpretation is that Ut(u1, x1, u2, x2, . . . , ut-1, xt-1) defines the strategy to be chosen at\ntime t if the algorithm's first t - 1 choices were u1, . . . , ut-1 respectively, and their costs\nwere x1, . . . , xt-1 respectively. A randomized multi-armed bandit algorithm is a proba-\nbility distribution over deterministic multi-armed bandit algorithms. (If the cost functions\nare random, we will assume their randomness is independent of the algorithm's random\nchoices.) For a randomized multi-armed bandit algorithm, the n-step regret Rn is the ex-\npected difference in total cost between the algorithm's chosen strategies u1, u2, . . . , un and\nthe best strategy u, i.e.\n\n                                                   n\n                                     Rn = E              Ct(ut) - Ct(u) .\n                                                  t=1\n\nHere, the expectation is over the algorithm's random choices and (in the random-costs\nmodel) the randomness of the cost functions.\n\n\n3    Algorithms for the one-parameter case (d = 1)\n\nThe continuum-bandit algorithm presented in [1] is based on computing an estimate ^\n                                                                                                     C of\nthe expected cost function \n                                   C which converges almost surely to \n                                                                              C as n  . This estimate\nis obtained by devoting a small fraction of the time steps (tending to zero as n  )\nto sampling the random cost functions at an approximately equally-spaced sequence of\n\"design points\" in the strategy set, and combining these samples using a kernel estimator.\n\n\f\nWhen the algorithm is not sampling a design point, it chooses a strategy which minimizes\nexpected cost according to the current estimate ^\n                                                             C. The convergence of ^\n                                                                                    C to \n                                                                                         C ensures that\nthe average cost in these \"exploitation steps\" converges to the minimum value of \n                                                                                               C.\n\nA drawback of this approach is its emphasis on estimating the entire function \n                                                                                           C. Since the\nalgorithm's goal is to minimize cost, its estimate of \n                                                                  C need only be accurate for strategies\nwhere \n       C is near its minimum. Elsewhere a crude estimate of \n                                                                           C would have sufficed, since\nsuch strategies may safely be ignored by the algorithm. The algorithm in [1] thus uses\nits sampling steps inefficiently, focusing too much attention on portions of the strategy\ninterval where an accurate estimate of \n                                                       C is unnecessary. We adopt a different approach\nwhich eliminates this inefficiency and also leads to a much simpler algorithm. First we\ndiscretize the strategy space by constraining the algorithm to choose strategies only from\na fixed, finite set of K equally spaced design points {1/K, 2/K, . . . , 1}. (For simplicity,\nwe are assuming here and for the rest of this section that S = [0, 1].) This reduces the\ncontinuum-armed bandit problem to a finite-armed bandit problem, and we may apply one\nof the standard algorithms for such problems. Our continuum-armed bandit algorithm is\nshown in Figure 1. The outer loop uses a standard doubling technique to transform a\nnon-uniform algorithm to a uniform one. The inner loop requires a subroutine MAB\nwhich should implement a finite-armed bandit algorithm appropriate for the cost model\nunder consideration. For example, MAB could be the algorithm UCB1 of [2] in the\nrandom case, or the algorithm Exp3 of [3] in the adversarial case. The semantics of MAB\nare as follows: it is initialized with a finite set of strategies; subsequently it recommends\nstrategies in this set, waits to learn the feedback score for its recommendation, and updates\nits recommendation when the feedback is received.\n\nThe analysis of this algorithm will ensure that its choices have low regret relative to the best\ndesign point. The Lipschitz regularity of \n                                                        C guarantees that the best design point performs\nnearly as well, on average, as the best strategy in S.\n\n                 ALGORITHM CAB1\n                 T  1\n                 while T  n                    1\n                                               2+1\n                       K              T\n                                      log T\n\n                       Initialize MAB with strategy set {1/K, 2/K, . . . , 1}.\n                       for t = T, T + 1, . . . , min(2T - 1, n)\n                          Get strategy ut from MAB.\n                          Play ut and discover Ct(ut).\n                          Feed 1 - Ct(ut) back to MAB.\n                       end\n                       T  2T\n                 end\n\n\n       Figure 1: Algorithm for the one-parameter continuum-armed bandit problem\n\n\nTheorem 3.1. In both the random and adversarial models, the regret of algorithm CAB1\n       +1        \nis O(n 2+1 log 2+1 (n)).\n\n\nProof Sketch. Let q =                , so that the regret bound is O(n1-q logq(n)). It suffices to\n                              2+1\nprove that the regret in the inner loop is O(T 1-q logq(T )); if so, then we may sum this\nbound over all iterations of the inner loop to get a geometric progression with constant\nratio, whose largest term is O(n1-q logq(n)). So from now on assume that T is fixed and\nthat K is defined as in Figure 1, and for simplicity renumber the T steps in this iteration of\n\n\f\ninner loop so that the first is step 1 and the last is step T . Let u be the best strategy in S,\nand let u be the element of {1/K, 2/K, . . . , 1} nearest to u. Then\n                                                                  T                        |u - u| < 1/K, so\nusing the fact that \n                        C  ulL(,L,) (or that 1                        C\n                                                        T         t=1         t  ulL(, L, ) in the adversarial\ncase) we obtain\n\n                          T                                  T\n                   E            Ct(u ) - Ct(u)                   = O T 1-q logq(T ) .\n                                                             K\n                         t=1\n\n\nIt remains to show that E           T      C\n                                    t=1         t(ut) - Ct(u ) = O T 1-q logq(T ) . For the adver-\nsarial model, this follows directly from Corollary 4.2 in [3], which asserts that the regret\nof Exp3 is O T K log K . For the random model, a separate argument is required.\n(The upper bound for the adversarial model doesn't directly imply an upper bound for\nthe random model, since the cost functions are required to take values in [0, 1] in the ad-\nversarial model but not in the random model.) For u  {1/K, 2/K, . . . , 1} let (u) =\n\nC(u) - C(u ). Let  = K log(T)/T, and partition the set {1/K,2/K,... ,1} into two\nsubsets A, B according to whether (u) <  or (u)  . The time steps in which the\nalgorithm chooses strategies in A contribute at most O(T ) = O(T 1-q logq(T )) to the\nregret. For each strategy u  B, one may prove that, with high probability, u is played\nonly O(log(T )/(u)2) times. (This parallels the corresponding proof in [2] and is omitted\nhere. Our hypothesis on the moment generating function of the random variable C(u) is\nstrong enough to imply the exponential tail inequality required in that proof.) This im-\nplies that the time steps in which the algorithm chooses strategies in B contribute at most\nO(K log(T )/) = O(T 1-q logq(T )) to the regret, which completes the proof.\n\n\n4    Lower bounds for the one-parameter case\n\nThere are many reasons to expect that Algorithm CAB1 is an inefficient algorithm for the\ncontinuum-armed bandit problem. Chief among these is that fact that it treats the strategies\n{1/K,2/K,... ,1} as an unordered set, ignoring the fact that experiments which sample\nthe cost of one strategy j/K are (at least weakly) predictive of the costs of nearby strategies.\nIn this section we prove that, contrary to this intuition, CAB1 is in fact quite close to the\noptimal algorithm. Specifically, in the regret bound of Theorem 3.1, the exponent of +1\n                                                                                                           2+1\nis the best possible: for any  < +1 , no algorithm can achieve regret O(n). This lower\n                                         2+1\nbound applies to both the randomized and adversarial models.\n\nThe lower bound relies on a function f : [0, 1]  [0, 1] defined as the sum of a nested fam-\nily of \"bump functions.\" Let B be a C bump function defined on the real line, satisfying\n0  B(x)  1 for all x, B(x) = 0 if x  0 or x  1, and B(x) = 1 if x  [1/3,2/3]. For\nan interval [a, b], let B[a,b] denote the bump function B( x-a ), i.e. the function B rescaled\n                                                                              b-a\nand shifted so that its support is [a, b] instead of [0, 1]. Define a random nested sequence\nof intervals [0, 1] = [a0, b0]  [a1, b1]  . . . as follows: for k > 0, the middle third of\n[ak-1, bk-1] is subdivided into intervals of width wk = 3-k!, and [ak, bk] is one of these\nsubintervals chosen uniformly at random. Now let\n\n                                                                   \n\n                         f (x) = 1/3 + 3-1 - 1/3                         w\n                                                                              k B[ak,bk](x).\n                                                                   k=1\n\nFinally, define a probability distribution on functions C : [0, 1]  [0, 1] by the following\nrule: sample  uniformly at random from the open interval (0, 1) and put C(x) = f(x).\n\nThe relevant technical properties of this construction are summarized in the following\nlemma.\n\n\f\nLemma 4.1. Let {u} =  [a\n                                     k=1    k, bk]. The function f (x) belongs to ulL(, L, ) for\nsome constants L, , it takes values in [1/3, 2/3], and it is uniquely maximized at u. For\neach   (0, 1), the function C(x) = f(x) belongs to ulL(, L, ) for some constants\nL, , and is uniquely minimized at u. The same two properties are satisfied by the function\n\nC(x) = E(0,1) f(x) = (1 + f(x))-1.\nTheorem 4.2. For any randomized multi-armed bandit algorithm, there exists a probability\ndistribution on cost functions such that for all  < +1 , the algorithm's regret\n                                                                    2+1                            {Rn}n=1\nin the random model satisfies\n                                                        R\n                                            lim sup              n = .\n                                                   n n\n\nThe same lower bound applies in the adversarial model.\n\n\nProof sketch. The idea is to prove, using the probabilistic method, that there exists a nested\nsequence of intervals [0, 1] = [a0, b0]  [a1, b1]  . . ., such that if we use these intervals\nto define a probability distribution on cost functions C(x) as above, then Rn/n diverges\nas n runs through the sequence n1, n2, n3, . . . defined by nk =                      1 (w                     .\n                                                                                     k       k-1/wk)w-2\n                                                                                                        k\nAssume that intervals [a0, b0]  . . .  [ak-1, bk-1] have already been specified. Subdivide\n[ak-1, bk-1] into subintervals of width wk, and suppose [ak, bk] is chosen uniformly at\nrandom from this set of subintervals. For any u, u  [ak-1, bk-1], the Kullback-Leibler\ndistance KL(C(u) C(u )) between the cost distributions at u and u is O(w2)\n                                                                                               k    , and it is\nequal to zero unless at least one of u, u lies in [ak, bk]. This means, roughly speaking,\nthat the algorithm must sample strategies in [ak, bk] at least w-2 times before being able\n                                                                                k\nto identify [ak, bk] with constant probability. But [ak, bk] could be any one of wk-1/wk\npossible subintervals, and we don't have enough time to play w-2 trials in even a constant\n                                                                                k\nfraction of these subintervals before reaching time nk. Therefore, with constant probability,\na constant fraction of the strategies chosen up to time nk are not located in [ak, bk], and\neach of them contributes (w)\n                                     k    to the regret. This means the expected regret at time nk is\n(nkw)\n         k . From this, we obtain the stated lower bound using the fact that\n\n                                                            +1 -o(1)\n                                            n               2+1\n                                                 kw\n                                                   k = n                   .\n                                                            k\n\nAlthough this proof sketch rests on a much more complicated construction than the lower\nbound proof for the finite-armed bandit problem given by Auer et al in [3], one may follow\nessentially the same series of steps as in their proof to make the sketch given above into\na rigorous proof. The only significant technical difference is that we are working with\ncontinuous-valued rather than discrete-valued random variables, which necessitates using\nthe differential Kullback-Leibler distance1 rather than working with the discrete Kullback-\nLeibler distance as in [3].\n\n\n5     An online convex optimization algorithm\n\nWe turn now to continuum-armed bandit problems with a strategy space of dimension\nd > 1. As mentioned in the introduction, for any randomized multi-armed bandit al-\ngorithm there is a cost function C (with any desired degree of smoothness and bound-\nedness) such that the algorithm's regret is (2d) when faced with the input sequence\nC1 = C2 = . . . = Cn = C. As a counterpoint to this negative result, we seek interesting\nclasses of cost functions which admit a continuum-armed bandit algorithm whose regret is\npolynomial in d (and, as always, sublinear in n). A natural candidate is the class of convex,\nsmooth functions on a closed, bounded, convex strategy set S  Rd, since this is the most\n     1Defined by the formula KL(P Q) = R log (p(x)/q(x)) dp(x), for probability distributions\nP, Q with density functions p, q.\n\n\f\ngeneral class of functions for which the corresponding best-expert problem is known to\nadmit an efficient algorithm, namely Zinkevich's greedy projection algorithm [21]. Greedy\nprojection is initialized with a sequence of learning rates 1 > 2 > . . .. It selects an\narbitrary initial strategy u1  S and updates its strategy in each subsequent time step t\naccording to the rule ut+1 = P (ut - t Ct(ut)), where Ct(ut) is the gradient of Ct at\nut and P : Rd  S is the projection operator which maps each point of Rd to the nearest\npoint of S. (Here, distance is measured according to the Euclidean norm.)\nNote that greedy projection is nearly a multi-armed bandit algorithm: if the algorithm's\nfeedback when sampling strategy ut were the vector                 Ct(ut) rather than the number\nCt(ut), it would have all the information required to run greedy projection. To adapt this\nalgorithm to the multi-armed bandit setting, we use the following idea: group the timeline\ninto phases of d + 1 consecutive steps, with a cost function C for each phase  defined by\naveraging the cost functions at each time step of . In each phase use trials at d + 1 affinely\nindependent points of S, located at or near ut, to estimate the gradient C(ut).2\nTo describe the algorithm, it helps to assume that the convex set S is in isotropic position in\nRd. (If not, we may bring it into isotropic position by an affine transformation of the coordi-\nnate system. This does not increase the regret by a factor of more than d2.) The algorithm,\nwhich we will call simulated greedy projection, works as follows. It is initialized with a\nsequence of \"learning rates\" 1, 2, . . . and \"frame sizes\" 1, 2, . . .. At the beginning of a\nphase , we assume the algorithm has determined a basepoint strategy u. (An arbitrary\nu may be used in the first phase.) The algorithm chooses a set of (d + 1) affinely indepen-\ndent points {x0 = u, x1, x2, . . . , xd} with the property that for any y  S, the difference\ny - x0 may be expressed as a linear combination of the vectors {xi - x0 : 1  i  d}\nusing coefficients in [-2, 2]. (Such a set is called an approximate barycentric spanner, and\nmay computed efficiently using an algorithm specified in [4].) We then choose a random\nbijection  mapping the time steps in phase  into the set {0, 1, . . . , d}, and in step t we\nsample the strategy yt = u + (x(t) -u). At the end of the phase we let B denote the\nunique affine function whose values at the points yt are equal to the costs observed during\nthe phase at those points. The basepoint for the next phase  is determined according to\nZinkevich's update rule u = P (u -  B(u)).3\nTheorem 5.1. Assume that S is in isotropic position and that the cost functions satisfy\n Ct(x)  1 for all x  S,1tn, and that in addition the Hessian matrix of Ct(x) at\neach point x  S has Frobenius norm bounded above by a constant. If k = k-3/4 and\nk = k-1/4, then the regret of the simulated greedy projection algorithm is O(d3n3/4).\n\nProof sketch. In each phase , let Y = {y0, . . . , yd} be the set of points which were\nsampled, and define the following four functions: C, the average of the cost functions in\nphase ; , the linearization of C at u, defined by the formula\n\n                           (x) =      C(u)  (x - u) + C(u);\nL, the unique affine function which agrees with C at each point of Y; and B, the affine\nfunction computed by the algorithm at the end of phase . The algorithm is simply run-\nning greedy projection with respect to the simulated cost functions B, and it consequently\nsatisfies a low-regret bound with respect to those functions. The expected value of B(u)\nis L(u) for every u. (Proof: both are affine functions, and they agree on every point of\n\n   2Flaxman, Kalai, and McMahan [12], with characteristic elegance, supply an algorithm which\ncounterintuitively obtains an unbiased estimate of the approximate gradient using only a single sam-\nple. Thus they avoid grouping the timeline into phases and improve the algorithm's convergence time\nby a factor of d.\n   3Readers familiar with Kiefer-Wolfowitz stochastic approximation [17] will note the similarity\nwith our algorithm. The random bijection  -- which is unnecessary in the Kiefer-Wolfowitz algo-\nrithm -- is used here to defend against the oblivious adversary.\n\n\f\nY.) Hence we obtain a low-regret bound with respect to L. To transfer this over to a low-\nregret bound for the original problem, we need to bound several additional terms: the regret\nexperienced because the algorithm was using u + (x(t) - u) instead of u, the dif-\nference between L(u) and (u), and the difference between (u) and C(u). In\neach case, the desired upper bound can be inferred from properties of barycentric spanners,\nor from the convexity of C and the bounds on its first and second derivatives.\n\n\nReferences\n\n [1] R. AGRAWAL. The continuum-armed bandit problem. SIAM J. Control and Optimization,\n     33:1926-1951, 1995.\n\n [2] P. AUER, N. CESA-BIANCHI, AND P. FISCHER. Finite-time analysis of the multi-armed bandit\n     problem. Machine Learning, 47:235-256, 2002.\n\n [3] P. AUER, N. CESA-BIANCHI, Y. FREUND, AND R. SCHAPIRE. Gambling in a rigged casino:\n     The adversarial multi-armed bandit problem. In Proceedings of FOCS 1995.\n\n [4] B. AWERBUCH AND R. KLEINBERG. Near-Optimal Adaptive Routing: Shortest Paths and\n     Geometric Generalizations. In Proceedings of STOC 2004.\n\n [5] N. BANSAL, A. BLUM, S. CHAWLA, AND A. MEYERSON. Online oblivious routing. In Pro-\n     ceedings of SPAA 2003: 44-49.\n\n [6] A. BLUM, C. BURCH, AND A. KALAI. Finely-competitive paging. In Proceedings of FOCS\n     1999.\n\n [7] A. BLUM, S. CHAWLA, AND A. KALAI. Static Optimality and Dynamic Search-Optimality\n     in Lists and Trees. Algorithmica 36(3): 249-260 (2003).\n\n [8] A. BLUM, V. KUMAR, A. RUDRA, AND F. WU. Online learning in online auctions. In Pro-\n     ceedings of SODA 2003.\n\n [9] A. BLUM AND H. B. MCMAHAN. Online geometric optimization in the bandit setting against\n     an adaptive adversary. In Proceedings of COLT 2004.\n\n[10] D. BERRY AND L. PEARSON. Optimal Designs for Two-Stage Clinical Trials with Dichoto-\n     mous Responses. Statistics in Medicine 4:487 - 508, 1985.\n\n[11] E. COPE. Regret and Convergence Bounds for Immediate-Reward Reinforcement Learning\n     with Continuous Action Spaces. Preprint, 2004.\n\n[12] A. FLAXMAN, A. KALAI, AND H. B. MCMAHAN. Online Convex Optimization in the Bandit\n     Setting: Gradient Descent Without a Gradient. To appear in Proceedings of SODA 2005.\n\n[13] Y. FREUND AND R. SCHAPIRE. Adaptive Game Playing Using Multiplicative Weights. Games\n     and Economic Behavior 29:79-103, 1999.\n\n[14] R. GRAMACY, M. WARMUTH, S. BRANDT, AND I. ARI. Adaptive Caching by Refetching. In\n     Advances in Neural Information Processing Systems 15, 2003.\n\n[15] R. KLEINBERG AND T. LEIGHTON. The Value of Knowing a Demand Curve: Bounds on\n     Regret for On-Line Posted-Price Auctions. In Proceedings of FOCS 2003.\n\n[16] A. KALAI AND S. VEMPALA. Efficient algorithms for the online decision problem. In Pro-\n     ceedings of COLT 2003.\n\n[17] J. KIEFER AND J. WOLFOWITZ. Stochastic Estimation of the Maximum of a Regression Func-\n     tion. Annals of Mathematical Statistics 23:462-466, 1952.\n\n[18] T. L. LAI AND H. ROBBINS. Asymptotically efficient adaptive allocations rules. Adv. in Appl.\n     Math. 6:4-22, 1985.\n\n[19] C. MONTELEONI AND T. JAAKKOLA. Online Learning of Non-stationary Sequences. In Ad-\n     vances in Neural Information Processing Systems 16, 2004.\n\n[20] M. ROTHSCHILD. A Two-Armed Bandit Theory of Market Pricing. Journal of Economic The-\n     ory 9:185-202, 1974.\n\n[21] M. ZINKEVICH. Online Convex Programming and Generalized Infinitesimal Gradient Ascent.\n     In Proceedings of ICML 2003, 928-936.\n\n\f\n", "award": [], "sourceid": 2634, "authors": [{"given_name": "Robert", "family_name": "Kleinberg", "institution": null}]}