{"title": "Mortal Multi-Armed Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 273, "page_last": 280, "abstract": "We formulate and study a new variant of the $k$-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard $k$-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with near-certainty. The main motivation for our setting is online-advertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budget. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the ads' lifetime. We present an optimal algorithm for the state-aware (deterministic reward function) case, and build on this technique to obtain an algorithm for the state-oblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms significantly outperform the standard multi-armed bandit approaches applied to these settings.", "full_text": "Mortal Multi-Armed Bandits\n\nDeepayan Chakrabarti\n\nYahoo! Research\n\nSunnyvale, CA 94089\n\ndeepay@yahoo-inc.com\n\nRavi Kumar\n\nYahoo! Research\n\nSunnyvale, CA 94089\n\nravikumar@yahoo-inc.com\n\nFilip Radlinski\u2217\nMicrosoft Research\n\nCambridge, UK\n\nfiliprad@microsoft.com\n\nEli Upfal\u2020\n\nBrown University\n\nProvidence, RI 02912\neli@cs.brown.edu\n\nAbstract\n\nWe formulate and study a new variant of the k-armed bandit problem, motivated\nby e-commerce applications. In our model, arms have (stochastic) lifetime after\nwhich they expire. In this setting an algorithm needs to continuously explore new\narms, in contrast to the standard k-armed bandit model in which arms are available\ninde\ufb01nitely and exploration is reduced once an optimal arm is identi\ufb01ed with near-\ncertainty. The main motivation for our setting is online-advertising, where ads\nhave limited lifetime due to, for example, the nature of their content and their\ncampaign budgets. An algorithm needs to choose among a large collection of ads,\nmore than can be fully explored within the typical ad lifetime.\nWe present an optimal algorithm for the state-aware (deterministic reward func-\ntion) case, and build on this technique to obtain an algorithm for the state-oblivious\n(stochastic reward function) case. Empirical studies on various reward distribu-\ntions, including one derived from a real-world ad serving application, show that\nthe proposed algorithms signi\ufb01cantly outperform the standard multi-armed bandit\napproaches applied to these settings.\n\n1 Introduction\n\nOnline advertisements (ads) are a rapidly growing source of income for many Internet content\nproviders. The content providers and the ad brokers who match ads to content are paid only when\nads are clicked; this is commonly referred to as the pay-per-click model. In this setting, the goal of\nthe ad brokers is to select ads to display from a large corpus, so as to generate the most ad clicks and\nrevenue. The selection problem involves a natural exploration vs. exploitation tradeoff: balancing\nexploration for ads with better click rates against exploitation of the best ads found so far.\nFollowing [17, 16], we model the ad selection task as a multi-armed bandit problem [5]. A multi-\narmed bandit models a casino with k slot machines (one-armed bandits), where each machine (arm)\nhas a different and unknown expected payoff. The goal is to sequentially select the optimal sequence\nof slot machines to play (i.e., slot machine arms to pull) to maximize the expected total reward.\nConsidering each ad as a slot machine, that may or may not provide a reward when presented to\nusers, allows any multi-armed bandit strategy to be used for the ad selection problem.\n\n\u2217Most of this work was done while the author was at Yahoo! Research.\n\u2020Part of this work was done while the author was visiting the Department of Information Engineering at the\nUniversity of Padova, Italy, supported in part by the FP6 EC/IST Project 15964 AEOLUS. Work supported in\npart by NSF award DMI-0600384 and ONR Award N000140610607.\n\n\fA standard assumption in the multi-armed bandit setting, however, is that each arm exists perpetu-\nally. Although the payoff function of an arm is allowed to evolve over time, the evolution is assumed\nto be slow. Ads, on the other hand, are regularly created while others are removed from circulation.\nThis occurs as advertisers\u2019 budgets run out, when advertising campaigns change, when holiday\nshopping seasons end, and due to other factors beyond the control of the ad selection system. The\nadvertising problem is even more challenging as the set of available ads is often huge (in the tens of\nmillions), while standard multi-armed bandit strategies converge only slowly and require time linear\nin the number of available options.\nIn this paper we initiate the study of a rapidly changing variant of the multi-armed bandit problem.\nWe call it the mortal multi-armed bandit problem since ads (or equivalently, available bandit arms)\nare assumed to be born and die regularly. In particular, we will show that while the standard multi-\narmed bandit setting allows for algorithms that only deviate from the optimal total payoff by O(ln t)\n[21], in the mortal arm setting a regret of \u2126(t) is possible.\nOur analysis of the mortal multi-arm bandit problem considers two settings. First, in the less realistic\nbut simpler state-aware (deterministic reward) case, pulling arm i always provides a reward that\nequals the expected payoff of the arm. Second, in the more realistic state-oblivious (stochastic\nreward) case, the reward from arm i is a binomial random variable indicating the true payoff of the\narm only in expectation. We provide an optimal algorithm for the state-aware case. This algorithm\nis based on characterizing the precise payoff threshold below which repeated arm pulls become\nsuboptimal. This characterization also shows that there are cases when a linear regret is inevitable.\nWe then extend the algorithm to the state-oblivious case, and show that it is near-optimal. Following\nthis, we provide a general heuristic recipe for modifying standard multi-armed bandit algorithms to\nbe more suitable in the mortal-arm setting. We validate the ef\ufb01cacy of our algorithms on various\npayoff distributions including one empirically derived from real ads. In all cases, we show that the\nalgorithms presented signi\ufb01cantly outperform standard multi-armed bandit approaches.\n\n2 Modeling mortality\n\nSuppose we wish to select the ads to display on a webpage. Every time a user visits this webpage,\nwe may choose one ad to display. Each ad has a different potential to provide revenue, and we wish\nto sequentially select the ads to maximize the total expected revenue. Formally, say that at time t,\nwe have ads A(t) = {ad1t, . . . , adkt} from which we must pick one to show. Each adit has a payoff\n\u00b5it \u2208 [0, 1] that is drawn from some known cumulative distribution F (\u00b5)1. Presenting adit at time\nt provides a (\ufb01nancial) reward R(\u00b5it); the reward function R(\u00b7) will be speci\ufb01ed below.\nIf the pool of available ads A(t) were static, or if the payoffs were only slowly changing with t,\nthis problem could be solved using any standard multi-armed bandit approach. As described earlier,\nin reality the available ads are rapidly changing. We propose the following simple model for this\nchange: at the end of each time step t, one or more ads may die and be replaced with new ads. The\nprocess then continues with time t + 1. Note that since change happens only through replacement\nof ads, the number of ads k = |A(t)| remains \ufb01xed. Also, as long as an ad is alive, we assume that\nits payoff is \ufb01xed.\nDeath can be modeled in two ways, and we will address both in this work. An ad i may have a\nbudget Li that is known a priori and revealed to the algorithm. The ad dies immediately after it has\nbeen selected Li times; we assume that Li values are drawn from a geometric distribution, with an\nexpected budget of L. We refer to this case as budgeted death. Alternatively, each ad may die with\na \ufb01xed probability p after every time step, whether it was selected or not. This is equivalent to each\nad being allocated a lifetime budget Li, drawn from a geometric distribution with parameter p, that\nis \ufb01xed when the arm is born but is never revealed to the algorithm; in this case new arms have an\nexpected lifetime of L = 1/p. We call this timed death. In both death settings, we assume in our\ntheoretical analysis that at any time there is always at least one previously unexplored ad available.\nThis re\ufb02ects reality where the number of ads is practically unlimited.\nFinally, we model the reward function in two ways, the \ufb01rst being simpler to analyze and the latter\nIn the state-aware (deterministic reward) case, we assume R(\u00b5it) = \u00b5it. This\nmore realistic.\n\n1We limit our analysis to the case where F (\u00b5) is stationary and known, as we are particularly interested in\n\nthe long-term steady-state setting.\n\n\fprovides us with complete information about each ad immediately after it is chosen to be displayed.\nIn the state-oblivious (stochastic reward) case, we take R(\u00b5it) to be a random variable that is 1 with\nprobability \u00b5it and 0 otherwise.\nThe mortal multi-armed bandit setting requires different performance measures than the ones used\nwith static multi-armed bandits. In the static setting, very little exploration is needed once an optimal\narm is identi\ufb01ed with near-certainty; therefore the quality measure is the total regret over time. In\nour setting the algorithm needs to continuously explore newly available arms. We therefore study the\nlong term, steady-state, mean regret per time step of various solutions. We de\ufb01ne this regret as the\nexpected payoff of the best currently alive arm minus the payoff actually obtained by the algorithm.\n\n3 Related work\n\nOur work is most related to the study of dynamic versions of the multi-arm bandit (MAB) paradigm\nwhere either the set of arms or their expected reward may change over time. Motivated by task\nscheduling, Gittins [10] proposed a policy where only the state of the active arm (the arm currently\nbeing played) can change in a given step, and proved its optimality for the Bayesian formulation\nwith time discounting. This seminal result gave rise to a rich line of work, a proper review of which\nis beyond the scope of this paper. In particular, Whittle [23] introduced an extension termed restless\nbandits [23, 6, 15], where the states of all arms can change in each step according to a known (but\narbitrary) stochastic transition function. Restless bandits have been shown to be intractable: e.g.,\neven with deterministic transitions the problem of computing an (approximately) optimal strategy is\nPSPACE-hard [18]. Sleeping bandits problem, where the set of strategies is \ufb01xed but only a subset of\nthem available in each step, were studied in [9, 7] and recently, using a different evaluation criteria,\nin [13]. Strategies with expected rewards that change gradually over time were studied in [19]. The\nmixture-of-experts paradigm is related [11], but assumes that data tuples are provided to each expert,\ninstead of the tuples being picked by the algorithm, as in the bandit setting.\nAuer et al. [3] adopted an adversarial approach: they de\ufb01ned the adversarial MAB problem where\nthe reward distributions are allowed to change arbitrarily over time, and the goal is to approach\nthe performance of the best time-invariant policy. This formulation has been further studied in\nseveral other papers. Auer et al. [3, 1] also considered a more general de\ufb01nition of regret, where\nthe comparison is to the best policy that can change arms a limited number of times. Due to the\noverwhelming strength of the adversary, the guarantees obtained in this line of work are relatively\nweak when applied to the setting that we consider in this paper.\nAnother aspect of our model is that unexplored arms are always available. Related work broadly\ncomes in three \ufb02avors. First, new arms can become available over time; the optimality of Gittins\u2019\nindex was shown to extend to this case [22]. The second case is that of in\ufb01nite-armed bandits with\ndiscrete arms, \ufb01rst studied by [4] and recently extended to the case of unknown payoff distributions\nand an unknown time horizon [20]. Finally, the bandit arms may be indexed by numbers from the\nreal line, implying uncountably in\ufb01nite bandit arms, but where \u201cnearby\u201d arms (in terms of distance\nalong the real line) have similar payoffs [12, 14]. However, none of these approaches allows for\narms to appear then disappear, which as we show later critically affects any regret bounds.\n\n4 Upper bound on mortal reward\n\nIn this section we show that in the mortal multi-armed bandit setting, the regret per time step of any\nalgorithm can never go to zero, unlike in the standard MAB setting. Speci\ufb01cally, we develop an\nupper bound on the mean reward per step of any such algorithm for the state-aware, budgeted death\ncase. We then use reductions between the different models to show that this bound holds for the\nstate-oblivious, timed death cases as well.\nWe prove the bound assuming we always have new arms available. The expected reward of an arm\nis drawn from a cumulative distribution F (\u00b5) with support in [0, 1]. For X \u223c F (\u00b5), let E[X] be the\nexpectation of X over F (\u00b5). We assume that the lifetime of an arm has an exponential distribution\nwith parameter p, and denote its expectation by L = 1/p. The following function captures the\ntradeoff between exploration and exploitation in our setting and plays a major role in our analysis:\n\n\u0393(\u00b5) = E[X] + (1 \u2212 F (\u00b5))(L \u2212 1)E[X|X \u2265 \u00b5]\n\n1 + (1 \u2212 F (\u00b5))(L \u2212 1)\n\n.\n\n(1)\n\n\fTheorem 1. Let \u00af\u00b5(t) denote the maximum mean reward that any algorithm for the state-aware\nmortal multi-armed bandit problem can obtain in t steps in the budgeted death case. Then\nlimt\u2192\u221e \u00af\u00b5(t) \u2264 max\u00b5 \u0393(\u00b5).\n\nProof sketch. We distinguish between fresh arm pulls, i.e., pulls of arms that were not pulled before,\nand repeat arm pulls. Assume that the optimal algorithm pulls \u03c4(t) distinct (fresh) arms in t steps,\nand hence makes t \u2212 \u03c4(t) repeat pulls. The expected number of repeat pulls to an arm before it\nexpires is (1 \u2212 p)/p. Thus, using Wald\u2019s equation [8], the expected number of different arms the\nalgorithm must use for the repeat pulls is (t \u2212 \u03c4(t)) \u00b7 p/(1 \u2212 p). Let (cid:96)(t) \u2264 \u03c4(t) be the number\nof distinct arms that get pulled more than once. Using Chernoff bounds, we can show that for any\n\u03b4 > 0, for suf\ufb01ciently large t, with probability \u2265 1 \u2212 1/t2 the algorithm uses at least (cid:96)(t) =\np(t \u2212 \u03c4(t))/(1 \u2212 p) \u00b7 (1 \u2212 \u03b4) different arms for the repeat pulls. Call this event E1(\u03b4).\nNext, we upper bound the expected reward of the best (cid:96)(t) arms found in \u03c4(t) fresh probes. For any\nh > 0, let \u00b5(h) = F \u22121(1 \u2212 ((cid:96)(t)/\u03c4(t))(1 \u2212 h)). In other words, the probability of picking an arm\nwith expected reward greater or equal to \u00b5(h) is ((cid:96)(t)/\u03c4(t))(1 \u2212 h). Applying the Chernoff bound,\nfor any \u03b4, h > 0 there exists t(\u03b4, h) such that for all t \u2265 t(\u03b4, h) the probability that the algorithm\n\ufb01nds at least (cid:96)(t) arms with expected reward at least \u00b5(\u03b4, h) = \u00b5(h)(1\u2212 \u03b4) is bounded by 1/t2. Call\nthis event E2(\u03b4, h).\nLet E(\u03b4, h) be the event E1(\u03b4) \u2227 \u00acE2(\u03b4, h). The expected reward of the algorithm in this event after\nt steps is then bounded by \u03c4(t)E[X] + (t \u2212 \u03c4(t))E[X | X \u2265 \u00b5(\u03b4, h)] Pr(E(\u03b4, h)) + (t \u2212 \u03c4(t))(1 \u2212\nPr(E(\u03b4, h)). As \u03b4, h \u2192 0, Pr(E(\u03b4, h)) \u2192 1, and the expected reward per step when the algorithm\npulls \u03c4(t) fresh arms is given by\n\n(cid:17)\n\n(cid:16)\n\nlim sup\nt\u2192\u221e\n\n\u00af\u00b5(t) \u2264 1\nt\n\n\u03c4(t)E[X] + (t \u2212 \u03c4(t))E[X | X \u2265 \u00b5]\n\n,\n\nwhere \u00b5 = F \u22121(1 \u2212 (cid:96)(t)/\u03c4(t)) and (cid:96)(t) = (t \u2212 \u03c4(t))p/(1 \u2212 p). After some calculations, we get\nlim supt\u2192\u221e \u00af\u00b5(t) \u2264 max\u00b5 \u0393(\u00b5).\n\nIn Section 5.1 we present an algorithm that achieves this performance bound in the state-aware case.\nThe following two simple reductions establish the lower bound for the timed death and the state-\noblivious models.\nLemma 2. Assuming that new arms are always available, any algorithm for the timed death model\nobtains at least the same reward per timestep in the budgeted death model.\n\nAlthough we omit the proof due to space constraints, the intuition behind this lemma is that an arm\nin the timed case can die no sooner than in the budgeted case (i.e., when it is always pulled). As a\nresult, we get:\nLemma 3. Let \u00af\u00b5det(t) and \u00af\u00b5sto(t) denote the respective maximum mean expected rewards that any\nalgorithm for the state-aware and state-oblivious mortal multi-armed bandit problems can obtain\nafter running for t steps. Then \u00af\u00b5sto(t) \u2264 \u00af\u00b5det(t).\nWe now present two applications of the upper bound. The \ufb01rst simply observes that if the time to\n\ufb01nd an optimal arm is greater than the lifetime of such an arm, the the mean reward per step of any\nalgorithm must be smaller than the best value. This is in contrast to the standard MAB problem with\nthe same reward distribution, where the mean regret per step tends to 0.\nCorollary 4. Assume that the expected reward of a bandit arms is 1 with probability p < 1/2 and\n1 \u2212 \u03b4 otherwise, for some \u03b4 \u2208 (0, 1]. Let the lifetime of arms have geometric distribution with the\nsame parameter p. The mean reward per step of any algorithm for this supply of arms is at most\n1 \u2212 \u03b4 + \u03b4p, while the maximum expected reward is 1, yielding an expected regret per step of \u2126(1).\nCorollary 5. Assume arm payoffs are drawn from a uniform distribution, F (x) = x, x \u2208 [0, 1].\nConsider the timed death case with parameter p \u2208 (0, 1). Then the mean reward per step in bounded\nby 1\u2212\u221a\n\n1\u2212p and expected regret per step of any algorithm is \u2126(\n\n\u221a\np).\n\np\n\n5 Bandit algorithms for mortal arms\n\nIn this section we present and analyze a number of algorithms speci\ufb01cally designed for the mortal\nmulti-armed bandit task. We develop the optimal algorithm for the state-aware case and then modify\n\n\fthe algorithm to the state-oblivious case, yielding near-optimal regret. We also study a subset ap-\nproach that can be used in tandem with any standard multi-armed bandit algorithm to substantially\nimprove performance in the mortal multi-armed bandit setting.\n\n5.1 The state-aware case\n\nWe now show that the algorithm DETOPT is optimal for this deterministic reward setting.\n\nAlgorithm DETOPT\ninput: Distribution F (\u00b5), expected lifetime L\n\u00b5\u2217 \u2190 argmax\u00b5 \u0393(\u00b5)\nwhile we keep playing\ni \u2190 random new arm\nPull arm i; R \u2190 R(\u00b5i) = \u00b5i\nif R > \u00b5\u2217\nPull arm i every turn until it expires\nend if\nend while\n\n[\u0393 is de\ufb01ned in (1)]\n\n[If arm is good, stay with it]\n\nAssume the same setting as in the previous section, with a constant supply of new arms. The\nexpected reward of an arm is drawn from cumulative distribution F (\u00b5). Let X be a random variable\nwith that distribution, and E[X] be its expectation over F (\u00b5). Assume that the lifetime of an arm\nhas an exponential distribution with parameter p, and denote its expectation by L = 1/p. Recall\n\u0393(\u00b5) from (1) and let \u00b5\u2217 = argmax\u00b5 \u0393(\u00b5). Now,\nTheorem 6. Let DETOPT(t) denote the mean per turn reward obtained by DETOPT after running\nfor t steps with \u00b5\u2217 = argmax\u00b5 \u0393(\u00b5), then limt\u2192\u221e DETOPT(t) = max\u00b5 \u0393(\u00b5).\nNote that the analysis of the algorithm holds for both budgeted and timed death models.\n\n5.2 The state-oblivious case\n\nWe now present a modi\ufb01ed version of DETOPT for the state-oblivious case. The intuition behind\nthis modi\ufb01cation, STOCHASTIC, is simple: instead of pulling an arm once to determine its payoff\n\u00b5i, the algorithm pulls each arm n times and abandons it unless it looks promising. A variant, called\nSTOCHASTIC WITH EARLY STOPPING, abandons the arm earlier if its maximum possible future\nreward per step of \u0393(\u00b5\u2217\u2212\u0001) and is thus near-optimal; the details are omitted due to space constraints.\n\nreward will still not justify its retention. For n = O(cid:0)log L/\u00012(cid:1), STOCHASTIC gets an expected\n\n[\u0393 is de\ufb01ned in (1)]\n\n[Play a random arm n times]\n\nAlgorithm STOCHASTIC\ninput: Distribution F (\u00b5), expected lifetime L\n\u00b5\u2217 \u2190 argmax\u00b5 \u0393(\u00b5)\nwhile we keep playing\ni \u2190 random new arm; r \u2190 0\nfor d = 1, . . . , n\nPull arm i; r \u2190 r + R(\u00b5i)\nend for\nif r > n\u00b5\u2217 [If it is good, stay with it forever ]\nPull arm i every turn until it dies\nend if\nend while\n\n[\u0393 is de\ufb01ned in (1)]\n\n[Play random arm as long as necessary]\n\nAlgorithm STOCH. WITH EARLY STOPPING\ninput: Distribution F (\u00b5), expected lifetime L\n\u00b5\u2217 \u2190 argmax\u00b5 \u0393(\u00b5)\nwhile we keep playing\ni \u2190 random new arm; r \u2190 0; d \u2190 0\nwhile d < n and n \u2212 d \u2265 n\u00b5\u2217 \u2212 r\nPull arm i; r \u2190 r + R(\u00b5i); d \u2190 d + 1\nend while\nif r > n\u00b5\u2217\nPull arm i every turn until it dies\nend if\nend while\n\n[If it is good, stay with it forever]\n\nThe subset heuristic. Why can\u2019t we simply use a standard multi-armed bandit (MAB) algorithm\nfor mortal bandits as well? Intuitively, MAB algorithms invest a lot of pulls on all arms (at least\nlogarithmic in the total number of pulls) to guarantee convergence to the optimal arm. This is\nnecessary in the traditional bandit settings, but in the limit as t \u2192 \u221e, the cost is recouped and leads\nto sublinear regret. However, such an investment is not justi\ufb01ed for mortal bandits: the most gain\nwe can get from an arm is L (if the arm has payoff 1), which reduces the importance of convergence\nto the best arm. In fact, as shown by Corollary 4, converging to a reasonably good arm suf\ufb01ces.\n\n\fHowever, standard MAB algorithms do identify better arms very well. This suggests the following\nepoch-based heuristic: (a) select a subset of k/c arms uniformly at random from the total k arms at\nthe beginning of each epoch, (b) operate a standard bandit algorithm on these until the epoch ends,\nand repeat. Intuitively, step (a) reduces the load on the bandit algorithm, allowing it to explore less\nand converge faster, in return for \ufb01nding an arm that is probably optimal only among the k/c subset.\nPicking the right c and the epoch length then depends on balancing the speed of convergence of\nthe bandit algorithm, the arm lifetimes, and the difference between the k-th and the k/c-th order\nstatistics of the arm payoff distribution; in our experiments, c is chosen empirically.\nUsing the subset heuristic, we propose an extension of the UCB1 algorithm2 [2], called UCB1K/C,\nfor the state-oblivious case. Note that this is just one example of the use of this heuristic; any stan-\ndard bandit algorithm could have been used in place of UCB1 here. In the next section, UCB1K/C\nis shown to perform far better than UCB1 in the mortal arms setting.\nThe ADAPTIVEGREEDY heuristic. Empirically, simple greedy MAB algorithms have previously\nbeen shown to perform well due to fast convergence. Hence for the purpose of evaluation, we also\ncompare to an adaptive greedy heuristic for mortal bandits. Note that the \u0001n-greedy algorithm [2]\ndoes not apply directly to mortal bandits since the probability \u0001t of random exploration decays to\nzero for large t, which can leave the algorithm with no good choices should the best arm expire.\n\nAlgorithm UCB1K/C\ninput: k-armed bandit, c\nwhile we keep playing\nS \u2190 k/c random arms\ndead \u2190 0\nAU CB1(S) \u2190 Initialize UCB1 over arms S\nrepeat\ni \u2190 arm selected by AU CB1(S)\nPull arm i, provide reward to AU CB1(S)\nx \u2190 total arms that died this turn\nCheck for newly dead arms in S, remove any\ndead \u2190 dead + x\nuntil dead \u2265 k/2 or |S| = 0\nend while\n\n6 Empirical evaluation\n\nAlgorithm ADAPTIVEGREEDY\ninput: k-armed bandit, c\nInitialization: \u2200i \u2208 [1, k], ri, ni \u2190 0\nwhile we keep playing\nm \u2190 argmaxi ri/ni\npm \u2190 rm/nm\nWith probability min(1, c \u00b7 pm)\nj \u2190 m\nOtherwise\nj \u2190 uniform(1, k)\nr \u2190 R(j)\nrj \u2190 rj + r\nnj \u2190 nj + 1\nend while\n\n[Update the observed rewards]\n\n[Find best arm so far]\n\n[Pull a random arm]\n\nIn this section we evaluate the performance of UCB1K/C, STOCHASTIC, STOCHASTIC WITH\nEARLY STOPPING, and ADAPTIVEGREEDY in the mortal arm state-oblivious setting. We also com-\npare these to the UCB1 algorithm [2], that does not consider arm mortality in its policy but is among\nthe faster converging standard multi-armed bandit algorithms. We present the results of simulation\nstudies using three different distributions of arm payoffs F (\u00b7).\nUniform distributed arm payoffs. Our performance analyses assume that the cumulative payoff\ndistribution F (\u00b7) of new arms is known. A particularly simple one is the uniform distribution,\n\u00b5it \u223c uniform(0, 1). Figure 1(a) shows the performance of these algorithms as a function of the\nexpected lifetime of each arm, using a timed death and state-oblivious model. The evaluation was\nperformed over k = 1000 arms, with each curve showing the mean regret per turn obtained by each\nalgorithm when averaged over ten runs. Each run was simulated for ten times the expected lifetime\nof the arms, and all parameters were empirically optimized for each algorithm and each lifetime.\nRepeating the evaluation with k = 100, 000 arms produces qualitatively very similar performance.\nWe \ufb01rst note the striking difference between UCB1 and UCB1K/C, with the latter performing far\nbetter. In particular, even with the longest lifetimes, each arm can be sampled in expectation at most\n100 times. With such limited sampling, UCB1 spends almost all the time exploring and generates\nalmost the same regret of 0.5 per turn as would an algorithm that pulls arms at random.\nIn contrast, UCB1K/C is able to obtain a substantially lower regret by limiting the exploration to a\nsubset of the arms. This demonstrates the usefulness of the K/C idea: by running the UCB1 algorithm\non an appropriately sized subset of arms, the overall regret per turn is reduced drastically. In practice,\n\n2UCB1 plays the arm j, previously pulled nj times, with highest mean historical payoff plus(cid:112)(2 ln n)/nj.\n\n\f(a)\n\n(b)\n\nFigure 1: Comparison of the regret per turn obtained by \ufb01ve different algorithms assuming that new\narm payoffs come from the (a) uniform distribution and (b) beta(1, 3) distribution.\n\nwith k = 1000 arms, the best performance was obtained with K/C between 4 and 40, depending on\nthe arm lifetime.\nSecond, we see that STOCHASTIC outperformed UCB1K/C with optimally chosen parameters.\nMoreover, STOCHASTIC WITH EARLY STOPPING performs as well as ADAPTIVEGREEDY, which\nmatches the best performance we were able to obtain by any algorithm. This demonstrates that (a)\nthe state-oblivious versions of the optimal deterministic algorithm is effective in general, and (b) the\nearly stopping criterion allows arms with poor payoff to be quickly weeded out.\nBeta distributed arm payoffs. While the strategies discussed perform well when arm payoffs are\nuniformly distributed, it is unlikely that in a real setting the payoffs would be so well distributed.\nIn particular, if there are occasional arms with substantially higher payoffs, we could expect any\nalgorithm that does not exhaustively search available arms may obtain very high regret per turn.\nFigure 1(b) shows the results when the arm payoff probabilities are drawn from the beta(1, 3) distri-\nbution. We chose this distribution as it has \ufb01nite support yet tends to select small payoffs for most\narms while selecting high payoffs occasionally. Once again, we see that STOCHASTIC WITH EARLY\nSTOPPING and ADAPTIVEGREEDY perform best, with the relative ranking of all other algorithms\nthe same as in the uniform case above. The absolute regret of the algorithms we have proposed is\nincreased relative to that seen in Figure 1(a), but still substantially better than that of the UCB1. In\nfact, the regret of the UCB1 has increased more under this distribution than any other algorithm.\nReal-world arm payoffs. Considering the application that motivated this work, we now evaluate the\nperformance of the four new algorithms when the arm payoffs come from the empirically observed\ndistribution of clickthrough rates on real ads served by a large ad broker.\nFigure 2(a) shows a histogram of the payoff probabilities for a random sample of approximately\n300 real ads belonging to a shopping-related category when presented on web pages classi\ufb01ed as\nbelonging to the same category. The probabilities have been linearly scaled such that all ads have\npayoff between 0 and 1. We see that the distribution is unimodal, and is fairly tightly concentrated.\nBy sampling arm payoffs from a smoothed version of this empirical distribution, we evaluated the\nperformance of the algorithms presented earlier. Figure 2(b) shows that the performance of all the\nalgorithms is consistent with that seen for both the uniform and beta payoff distributions. In partic-\nular, while the mean regret per turn is somewhat higher than that seen for the uniform distribution,\nit is still lower than when payoffs are from the beta distribution. As before, STOCHASTIC WITH\nEARLY STOPPING and ADAPTIVEGREEDY perform best, indistinguishable from each other.\n\n7 Conclusions\n\nWe have introduced a new formulation of the multi-armed bandit problem motivated by the real\nworld problem of selecting ads to display on webpages. In this setting the set of strategies available\nto a multi-armed bandit algorithm changes rapidly over time. We provided a lower bound of linear\nregret under certain payoff distributions. Further, we presented a number of algorithms that perform\nsubstantially better in this setting than previous multi-armed bandit algorithms, including one that is\noptimal under the state-aware setting, and one that is near-optimal under the state-oblivious setting.\nFinally, we provided an extension that allows any previous multi-armed bandit algorithm to be used\n\n 0 0.1 0.2 0.3 0.4 0.5 100 1000 10000 100000Regret per time stepExpected arm lifetimeUCB1UCB1-k/cStochasticStochastic with Early Stop.AdaptiveGreedy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 100 1000 10000 100000Regret per time stepExpected arm lifetimeUCB1UCB1-k/cStochasticStochastic with Early StoppingAdaptiveGreedy\f(a)\n\n(b)\n\nFigure 2: (a) Distribution of real world ad payoffs, scaled linearly such that the maximum payoff is\n1 and (b) Regret per turn under the real-world ad payoff distribution.\n\nin the case of mortal arms. Simulations on multiple payoff distributions, including one derived from\nreal-world ad serving application, demonstrate the ef\ufb01cacy of our approach.\n\nAcknowledgments\n\nWe would like to thank the anonymous reviewers for their helpful comments and suggestions.\n\nReferences\n[1] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. J. Machine Learning Research,\n\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit problem. Machine\n\n[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem.\n\n3:397\u2013422, 2002.\n\nLearning, 47:235\u2013256, 2002.\n\nSIAM J. Comput., 32(1):48\u201377, 2002.\n\n[4] D. A. Berry, R. W. Chen, A. Zame, D. C. Heath, and L. A. Shepp. Bandit problems with in\ufb01nitely many\n\narms. The Annals of Statistics, 25(5):2103\u20132116, 1997.\n\n[5] D. A. Berry and B. Fristedt. Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall,\n\nLondon, UK, 1985.\n\n[6] D. Bertsimas and J. Nino-Mora. Restless bandits, linear programming relaxations, and a primal-dual\n\nindex heuristic. Operations Research, 48(1):80\u201390, 2000.\n\n[7] A. Blum and Y. Mansour. From external to internal regret. In 18th COLT, pages 621\u2013636, 2005.\n[8] W. Feller. An Introduction to Probability Theory and Its Applications, Volume 2. Wiley, 1971.\n[9] Y. Freund, R. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that specialize.\n\nIn 29th STOC, pages 334\u2013343, 1997.\n\n[10] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments. In\n\nJ. G. et al., editor, Progress in Statistics, pages 241\u2013266. North-Holland, 1974.\n\n[11] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32:151\u2013178, 1998.\n[12] R. Kleinberg. Online Decision Problems with Large Strategy Sets. PhD thesis, MIT, 2005.\n[13] R. D. Kleinberg, A. Niculescu-Mizil, and Y. Sharma. Regret bounds for sleeping experts and bandits. In\n\n21st COLT, pages 425\u2013436, 2008.\n\n[14] A. Krause and C. Guestrin. Nonmyopic active learning of Gaussian processes: An exploration-\n\nexploitation approach. In 24th ICML, pages 449\u2013456, 2007.\n\n[15] J. Nino-Mora. Restless bandits, partial conservation laws and indexability. Adv. Appl. Prob., 33:76\u201398,\n\n[16] S. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for taxonomies: A model-based\n\napproach. In SDM, pages 216\u2013227, 2007.\n\n[17] S. Pandey, D. Chakrabarti, and D. Agarwal. Multi-armed bandit problems with dependent arms. In ICML,\n\n[18] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queueing network control. In 9th\n\n[19] A. Slivkins and E. Upfal. Adapting to a changing environment: The Brownian restless bandits. In 21st\n\npages 721\u2013728, 2007.\n\nCCC, pages 318\u2013322, 1994.\n\nCOLT, pages 343\u2013354, 2008.\n\n[20] O. Teytaud, S. Gelly, and M. Sebag. Anytime many-armed bandits. In CAP, 2007.\n[21] T.Lai and H.Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Adv. Appl. Math., 6:4\u201322, 1985.\n[22] P. Whittle. Arm-acquiring bandits. The Annals of Probability, 9(2):284\u2013292, 1981.\n[23] P. Whittle. Restless bandits: Activity allocation in a changing world. J. of Appl. Prob., 25A:287\u2013298,\n\n2001.\n\n1988.\n\n 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0 0.2 0.4 0.6 0.8 1Fraction of armsPayoff probability (scaled)Ad Payoff Distribution 0 0.1 0.2 0.3 0.4 0.5 100 1000 10000 100000Regret per time stepExpected arm lifetimeStochasticStochastic with Early StoppingAdaptiveGreedyUCB1UCB1-k/c\f", "award": [], "sourceid": 717, "authors": [{"given_name": "Deepayan", "family_name": "Chakrabarti", "institution": null}, {"given_name": "Ravi", "family_name": "Kumar", "institution": null}, {"given_name": "Filip", "family_name": "Radlinski", "institution": null}, {"given_name": "Eli", "family_name": "Upfal", "institution": null}]}