Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Robert Kleinberg*

In the multi-armed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when there is an infinite strategy set. Here we consider the case when the set of strategies is a subset of Rd, and the cost functions are continuous. In the d = 1 case, we improve on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor. We also con- sider the case where d > 1 and the cost functions are convex, adapting a recent online convex optimization algorithm of Zinkevich to the sparser feedback model of the multi-armed bandit problem.

1 Introduction

In an online decision problem, an algorithm must choose from among a set of strategies in each of n consecutive trials so as to minimize the total cost of the chosen strategies. The costs of strategies are specified by a real-valued function which is defined on the entire strategy set and which varies over time in a manner initially unknown to the algorithm. The archetypical online decision problems are the best expert problem, in which the entire cost function is revealed to the algorithm as feedback at the end of each trial, and the multi- armed bandit problem, in which the feedback reveals only the cost of the chosen strategy. The names of the two problems are derived from the metaphors of combining expert advice (in the case of the best expert problem) and learning to play the best slot machine in a casino (in the case of the multi-armed bandit problem).

The applications of online decision problems are too numerous to be listed here. In ad- dition to occupying a central position in online learning theory, algorithms for such prob- lems have been applied in numerous other areas of computer science, such as paging and caching [6, 14], data structures [7], routing [4, 5], wireless networks [19], and online auc- tion mechanisms [8, 15]. Algorithms for online decision problems are also applied in a broad range of fields outside computer science, including statistics (sequential design of experiments [18]), economics (pricing [20]), game theory (adaptive game playing [13]), and medical decision making (optimal design of clinical trials [10]).

Multi-armed bandit problems have been studied quite thoroughly in the case of a finite strategy set, and the performance of the optimal algorithm (as a function of n) is known

```
M.I.T. CSAIL, Cambridge, MA 02139. Email: rdk@csail.mit.edu. Supported by a Fannie and John Hertz Foundation Fellowship.
```

up to a constant factor [3, 18]. In contrast, much less is known in the case of an infinite strategy set. In this paper, we consider multi-armed bandit problems with a continuum of strategies, parameterized by one or more real numbers. In other words, we are studying online learning problems in which the learner designates a strategy in each time step by specifying a d-tuple of real numbers (x1, . . . , xd); the cost function is then evaluated at (x1, . . . , xd) and this number is reported to the algorithm as feedback. Recent progress on such problems has been spurred by the discovery of new algorithms (e.g. [4, 9, 16, 21]) as well as compelling applications. Two such applications are online auction mechanism design [8, 15], in which the strategy space is an interval of feasible prices, and online oblivious routing [5], in which the strategy space is a flow polytope.

Algorithms for online decisions problems are often evaluated in terms of their regret, de- fined as the difference in expected cost between the sequence of strategies chosen by the algorithm and the best fixed (i.e. not time-varying) strategy. While tight upper and lower bounds on the regret of algorithms for the K-armed bandit problem have been known for many years [3, 18], our knowledge of such bounds for continuum-armed bandit prob- lems is much less satisfactory. For a one-dimensional strategy space, the first algorithm with sublinear regret appeared in [1], while the first polynomial lower bound on regret ap- peared in [15]. For Lipschitz-continuous cost functions (the case introduced in [1]), the best known upper and lower bounds for this problem are currently O(n3/4) and (n1/2), respectively [1, 15], leaving as an open question the problem of determining tight bounds for the regret as a function of n. Here, we solve this open problem by sharpening the up- per and lower bounds to O(n2/3 log1/3(n)) and (n2/3), respectively, closing the gap to a sublogarithmic factor. Note that this requires improving the best known algorithm as well as the lower bound technique.

Recently, and independently, Eric Cope [11] considered a class of cost functions obeying a more restrictive condition on the shape of the function near its optimum, and for such functions he obtained a sharper bound on regret than the bound proved here for uniformly locally Lipschitz cost functions. Cope requires that each cost function C achieves its op- timum at a unique point , and that there exist constants K0 > 0 and p 1 such that for all x, |C(x) - C()| K0 x - p. For this class of cost functions -- which is probably broad enough to capture most cases of practical interest -- he proves that the regret of the optimal continuum-armed bandit algorithm is O(n-1/2), and that this bound is tight.

For a d-dimensional strategy space, any multi-armed bandit algorithm must suffer regret depending exponentially on d unless the cost functions are further constrained. (This is demonstrated by a simple counterexample in which the cost function is identically zero in all but one orthant of Rd, takes a negative value somewhere in that orthant, and does not vary over time.) For the best-expert problem, algorithms whose regret is polynomial in d and sublinear in n are known for the case of cost functions which are constrained to be linear [16] or convex [21]. In the case of linear cost functions, the relevant algorithm has been adapted to the multi-armed bandit setting in [4, 9]. Here we adapt the online convex programming algorithm of [21] to the continuum-armed bandit setting, obtaining the first known algorithm for this problem to achieve regret depending polynomially on d and sublinearly on n. A remarkably similar algorithm was discovered independently and simultaneously by Flaxman, Kalai, and McMahan [12]. Their algorithm and analysis are superior to ours, requiring fewer smoothness assumptions on the cost functions and producing a tighter upper bound on regret.

2 Terminology and Conventions

We will assume that a strategy set S Rd is given, and that it is a compact subset of Rd. Time steps will be denoted by the numbers {1, 2, . . . , n}. For each t {1, 2, . . . , n} a cost

function Ct : S R is given. These cost functions must satisfy a continuity property based on the following definition. A function f is uniformly locally Lipschitz with constant L (0 L < ), exponent (0 < 1), and restriction ( > 0) if it is the case that for all u, u S with u - u , |f(u) - f(u )| L u - u . (Here, denotes the Euclidean norm on Rd.) The class of all such functions f will be denoted by ulL(, L, ).

We will consider two models which may govern the cost functions. The first of these is identical with the continuum-armed bandit problem considered in [1], except that [1] formulates the problem in terms of maximizing reward rather than minimizing cost. The second model concerns a sequence of cost functions chosen by an oblivious adversary.

Random The functions C1, . . . , Cn are independent, identically distributed random sam- ples from a probability distribution on functions C : S R. The expected cost function C : S R is defined by C(u) = E(C(u)) where C is a random sample from this distribution. This function C is required to belong to ulL(, L, ) for some specified , L, . In addition, we assume there exist positive constants , s0 such that if C is a random sample from the given distribution on cost functions, then 1 E(esC(u)) e 2s2 2 |s| s0,u S. The "best strategy" u is defined to be any element of arg min uS C (u). (This set is non-empty, by the compactness of S.) Adversarial The functions C1, . . . , Cn are a fixed sequence of functions in ulL(, L, ), taking values in [0, 1]. The "best strategy" u is defined to be any element of arg min n uS C t=1 t(u). (Again, this set is non-empty by compactness.)

A multi-armed bandit algorithm is a rule for deciding which strategy to play at time t, given the outcomes of the first t - 1 trials. More formally, a deterministic multi-armed bandit algorithm U is a sequence of functions U1, U2, . . . such that Ut : (S R)t-1 S. The interpretation is that Ut(u1, x1, u2, x2, . . . , ut-1, xt-1) defines the strategy to be chosen at time t if the algorithm's first t - 1 choices were u1, . . . , ut-1 respectively, and their costs were x1, . . . , xt-1 respectively. A randomized multi-armed bandit algorithm is a proba- bility distribution over deterministic multi-armed bandit algorithms. (If the cost functions are random, we will assume their randomness is independent of the algorithm's random choices.) For a randomized multi-armed bandit algorithm, the n-step regret Rn is the ex- pected difference in total cost between the algorithm's chosen strategies u1, u2, . . . , un and the best strategy u, i.e.

```
n Rn = E Ct(ut) - Ct(u) . t=1
```

Here, the expectation is over the algorithm's random choices and (in the random-costs model) the randomness of the cost functions.

3 Algorithms for the one-parameter case (d = 1)

The continuum-bandit algorithm presented in [1] is based on computing an estimate ^ C of the expected cost function C which converges almost surely to C as n . This estimate is obtained by devoting a small fraction of the time steps (tending to zero as n ) to sampling the random cost functions at an approximately equally-spaced sequence of "design points" in the strategy set, and combining these samples using a kernel estimator.

When the algorithm is not sampling a design point, it chooses a strategy which minimizes expected cost according to the current estimate ^ C. The convergence of ^ C to C ensures that the average cost in these "exploitation steps" converges to the minimum value of C.

A drawback of this approach is its emphasis on estimating the entire function C. Since the algorithm's goal is to minimize cost, its estimate of C need only be accurate for strategies where C is near its minimum. Elsewhere a crude estimate of C would have sufficed, since such strategies may safely be ignored by the algorithm. The algorithm in [1] thus uses its sampling steps inefficiently, focusing too much attention on portions of the strategy interval where an accurate estimate of C is unnecessary. We adopt a different approach which eliminates this inefficiency and also leads to a much simpler algorithm. First we discretize the strategy space by constraining the algorithm to choose strategies only from a fixed, finite set of K equally spaced design points {1/K, 2/K, . . . , 1}. (For simplicity, we are assuming here and for the rest of this section that S = [0, 1].) This reduces the continuum-armed bandit problem to a finite-armed bandit problem, and we may apply one of the standard algorithms for such problems. Our continuum-armed bandit algorithm is shown in Figure 1. The outer loop uses a standard doubling technique to transform a non-uniform algorithm to a uniform one. The inner loop requires a subroutine MAB which should implement a finite-armed bandit algorithm appropriate for the cost model under consideration. For example, MAB could be the algorithm UCB1 of [2] in the random case, or the algorithm Exp3 of [3] in the adversarial case. The semantics of MAB are as follows: it is initialized with a finite set of strategies; subsequently it recommends strategies in this set, waits to learn the feedback score for its recommendation, and updates its recommendation when the feedback is received.

The analysis of this algorithm will ensure that its choices have low regret relative to the best design point. The Lipschitz regularity of C guarantees that the best design point performs nearly as well, on average, as the best strategy in S.

```
ALGORITHM CAB1 T 1 while T n 1 2+1 K T log T
Initialize MAB with strategy set {1/K, 2/K, . . . , 1}. for t = T, T + 1, . . . , min(2T - 1, n) Get strategy ut from MAB. Play ut and discover Ct(ut). Feed 1 - Ct(ut) back to MAB. end T 2T end
Figure 1: Algorithm for the one-parameter continuum-armed bandit problem
```

Theorem 3.1. In both the random and adversarial models, the regret of algorithm CAB1 +1 is O(n 2+1 log 2+1 (n)).

Proof Sketch. Let q = , so that the regret bound is O(n1-q logq(n)). It suffices to 2+1 prove that the regret in the inner loop is O(T 1-q logq(T )); if so, then we may sum this bound over all iterations of the inner loop to get a geometric progression with constant ratio, whose largest term is O(n1-q logq(n)). So from now on assume that T is fixed and that K is defined as in Figure 1, and for simplicity renumber the T steps in this iteration of

inner loop so that the first is step 1 and the last is step T . Let u be the best strategy in S, and let u be the element of {1/K, 2/K, . . . , 1} nearest to u. Then T |u - u| < 1/K, so using the fact that C ulL(,L,) (or that 1 C T t=1 t ulL(, L, ) in the adversarial case) we obtain

```
T T E Ct(u ) - Ct(u) = O T 1-q logq(T ) . K t=1
```

It remains to show that E T C t=1 t(ut) - Ct(u ) = O T 1-q logq(T ) . For the adver- sarial model, this follows directly from Corollary 4.2 in [3], which asserts that the regret of Exp3 is O T K log K . For the random model, a separate argument is required. (The upper bound for the adversarial model doesn't directly imply an upper bound for the random model, since the cost functions are required to take values in [0, 1] in the ad- versarial model but not in the random model.) For u {1/K, 2/K, . . . , 1} let (u) =

C(u) - C(u ). Let = K log(T)/T, and partition the set {1/K,2/K,... ,1} into two subsets A, B according to whether (u) < or (u) . The time steps in which the algorithm chooses strategies in A contribute at most O(T ) = O(T 1-q logq(T )) to the regret. For each strategy u B, one may prove that, with high probability, u is played only O(log(T )/(u)2) times. (This parallels the corresponding proof in [2] and is omitted here. Our hypothesis on the moment generating function of the random variable C(u) is strong enough to imply the exponential tail inequality required in that proof.) This im- plies that the time steps in which the algorithm chooses strategies in B contribute at most O(K log(T )/) = O(T 1-q logq(T )) to the regret, which completes the proof.

4 Lower bounds for the one-parameter case

There are many reasons to expect that Algorithm CAB1 is an inefficient algorithm for the continuum-armed bandit problem. Chief among these is that fact that it treats the strategies {1/K,2/K,... ,1} as an unordered set, ignoring the fact that experiments which sample the cost of one strategy j/K are (at least weakly) predictive of the costs of nearby strategies. In this section we prove that, contrary to this intuition, CAB1 is in fact quite close to the optimal algorithm. Specifically, in the regret bound of Theorem 3.1, the exponent of +1 2+1 is the best possible: for any < +1 , no algorithm can achieve regret O(n). This lower 2+1 bound applies to both the randomized and adversarial models.

The lower bound relies on a function f : [0, 1] [0, 1] defined as the sum of a nested fam- ily of "bump functions." Let B be a C bump function defined on the real line, satisfying 0 B(x) 1 for all x, B(x) = 0 if x 0 or x 1, and B(x) = 1 if x [1/3,2/3]. For an interval [a, b], let B[a,b] denote the bump function B( x-a ), i.e. the function B rescaled b-a and shifted so that its support is [a, b] instead of [0, 1]. Define a random nested sequence of intervals [0, 1] = [a0, b0] [a1, b1] . . . as follows: for k > 0, the middle third of [ak-1, bk-1] is subdivided into intervals of width wk = 3-k!, and [ak, bk] is one of these subintervals chosen uniformly at random. Now let

```
f (x) = 1/3 + 3-1 - 1/3 w k B[ak,bk](x). k=1
```

Finally, define a probability distribution on functions C : [0, 1] [0, 1] by the following rule: sample uniformly at random from the open interval (0, 1) and put C(x) = f(x).

The relevant technical properties of this construction are summarized in the following lemma.

Lemma 4.1. Let {u} = [a k=1 k, bk]. The function f (x) belongs to ulL(, L, ) for some constants L, , it takes values in [1/3, 2/3], and it is uniquely maximized at u. For each (0, 1), the function C(x) = f(x) belongs to ulL(, L, ) for some constants L, , and is uniquely minimized at u. The same two properties are satisfied by the function

C(x) = E(0,1) f(x) = (1 + f(x))-1. Theorem 4.2. For any randomized multi-armed bandit algorithm, there exists a probability distribution on cost functions such that for all < +1 , the algorithm's regret 2+1 {Rn}n=1 in the random model satisfies R lim sup n = . n n

The same lower bound applies in the adversarial model.

Proof sketch. The idea is to prove, using the probabilistic method, that there exists a nested sequence of intervals [0, 1] = [a0, b0] [a1, b1] . . ., such that if we use these intervals to define a probability distribution on cost functions C(x) as above, then Rn/n diverges as n runs through the sequence n1, n2, n3, . . . defined by nk = 1 (w . k k-1/wk)w-2 k Assume that intervals [a0, b0] . . . [ak-1, bk-1] have already been specified. Subdivide [ak-1, bk-1] into subintervals of width wk, and suppose [ak, bk] is chosen uniformly at random from this set of subintervals. For any u, u [ak-1, bk-1], the Kullback-Leibler distance KL(C(u) C(u )) between the cost distributions at u and u is O(w2) k , and it is equal to zero unless at least one of u, u lies in [ak, bk]. This means, roughly speaking, that the algorithm must sample strategies in [ak, bk] at least w-2 times before being able k to identify [ak, bk] with constant probability. But [ak, bk] could be any one of wk-1/wk possible subintervals, and we don't have enough time to play w-2 trials in even a constant k fraction of these subintervals before reaching time nk. Therefore, with constant probability, a constant fraction of the strategies chosen up to time nk are not located in [ak, bk], and each of them contributes (w) k to the regret. This means the expected regret at time nk is (nkw) k . From this, we obtain the stated lower bound using the fact that

```
+1 -o(1) n 2+1 kw k = n . k
```

Although this proof sketch rests on a much more complicated construction than the lower bound proof for the finite-armed bandit problem given by Auer et al in [3], one may follow essentially the same series of steps as in their proof to make the sketch given above into a rigorous proof. The only significant technical difference is that we are working with continuous-valued rather than discrete-valued random variables, which necessitates using the differential Kullback-Leibler distance1 rather than working with the discrete Kullback- Leibler distance as in [3].

5 An online convex optimization algorithm

We turn now to continuum-armed bandit problems with a strategy space of dimension d > 1. As mentioned in the introduction, for any randomized multi-armed bandit al- gorithm there is a cost function C (with any desired degree of smoothness and bound- edness) such that the algorithm's regret is (2d) when faced with the input sequence C1 = C2 = . . . = Cn = C. As a counterpoint to this negative result, we seek interesting classes of cost functions which admit a continuum-armed bandit algorithm whose regret is polynomial in d (and, as always, sublinear in n). A natural candidate is the class of convex, smooth functions on a closed, bounded, convex strategy set S Rd, since this is the most 1Defined by the formula KL(P Q) = R log (p(x)/q(x)) dp(x), for probability distributions P, Q with density functions p, q.

general class of functions for which the corresponding best-expert problem is known to admit an efficient algorithm, namely Zinkevich's greedy projection algorithm [21]. Greedy projection is initialized with a sequence of learning rates 1 > 2 > . . .. It selects an arbitrary initial strategy u1 S and updates its strategy in each subsequent time step t according to the rule ut+1 = P (ut - t Ct(ut)), where Ct(ut) is the gradient of Ct at ut and P : Rd S is the projection operator which maps each point of Rd to the nearest point of S. (Here, distance is measured according to the Euclidean norm.) Note that greedy projection is nearly a multi-armed bandit algorithm: if the algorithm's feedback when sampling strategy ut were the vector Ct(ut) rather than the number Ct(ut), it would have all the information required to run greedy projection. To adapt this algorithm to the multi-armed bandit setting, we use the following idea: group the timeline into phases of d + 1 consecutive steps, with a cost function C for each phase defined by averaging the cost functions at each time step of . In each phase use trials at d + 1 affinely independent points of S, located at or near ut, to estimate the gradient C(ut).2 To describe the algorithm, it helps to assume that the convex set S is in isotropic position in Rd. (If not, we may bring it into isotropic position by an affine transformation of the coordi- nate system. This does not increase the regret by a factor of more than d2.) The algorithm, which we will call simulated greedy projection, works as follows. It is initialized with a sequence of "learning rates" 1, 2, . . . and "frame sizes" 1, 2, . . .. At the beginning of a phase , we assume the algorithm has determined a basepoint strategy u. (An arbitrary u may be used in the first phase.) The algorithm chooses a set of (d + 1) affinely indepen- dent points {x0 = u, x1, x2, . . . , xd} with the property that for any y S, the difference y - x0 may be expressed as a linear combination of the vectors {xi - x0 : 1 i d} using coefficients in [-2, 2]. (Such a set is called an approximate barycentric spanner, and may computed efficiently using an algorithm specified in [4].) We then choose a random bijection mapping the time steps in phase into the set {0, 1, . . . , d}, and in step t we sample the strategy yt = u + (x(t) -u). At the end of the phase we let B denote the unique affine function whose values at the points yt are equal to the costs observed during the phase at those points. The basepoint for the next phase is determined according to Zinkevich's update rule u = P (u - B(u)).3 Theorem 5.1. Assume that S is in isotropic position and that the cost functions satisfy Ct(x) 1 for all x S,1tn, and that in addition the Hessian matrix of Ct(x) at each point x S has Frobenius norm bounded above by a constant. If k = k-3/4 and k = k-1/4, then the regret of the simulated greedy projection algorithm is O(d3n3/4).

Proof sketch. In each phase , let Y = {y0, . . . , yd} be the set of points which were sampled, and define the following four functions: C, the average of the cost functions in phase ; , the linearization of C at u, defined by the formula

```
(x) = C(u) (x - u) + C(u); L, the unique affine function which agrees with C at each point of Y; and B, the affine function computed by the algorithm at the end of phase . The algorithm is simply run- ning greedy projection with respect to the simulated cost functions B, and it consequently satisfies a low-regret bound with respect to those functions. The expected value of B(u) is L(u) for every u. (Proof: both are affine functions, and they agree on every point of
```

2Flaxman, Kalai, and McMahan [12], with characteristic elegance, supply an algorithm which counterintuitively obtains an unbiased estimate of the approximate gradient using only a single sam- ple. Thus they avoid grouping the timeline into phases and improve the algorithm's convergence time by a factor of d. 3Readers familiar with Kiefer-Wolfowitz stochastic approximation [17] will note the similarity with our algorithm. The random bijection -- which is unnecessary in the Kiefer-Wolfowitz algo- rithm -- is used here to defend against the oblivious adversary.

Y.) Hence we obtain a low-regret bound with respect to L. To transfer this over to a low- regret bound for the original problem, we need to bound several additional terms: the regret experienced because the algorithm was using u + (x(t) - u) instead of u, the dif- ference between L(u) and (u), and the difference between (u) and C(u). In each case, the desired upper bound can be inferred from properties of barycentric spanners, or from the convexity of C and the bounds on its first and second derivatives.

Do not remove: This comment is monitored to verify that the site is working properly