{"title": "Multi-armed bandits on implicit metric spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 1602, "page_last": 1610, "abstract": "The multi-armed bandit (MAB) setting is a useful abstraction of many online learning tasks which focuses on the trade-off between exploration and exploitation. In this setting, an online algorithm has a fixed set of alternatives (\"arms\"), and in each round it selects one arm and then observes the corresponding reward. While the case of small number of arms is by now well-understood, a lot of recent work has focused on multi-armed bandits with (infinitely) many arms, where one needs to assume extra structure in order to make the problem tractable. In particular, in the Lipschitz MAB problem there is an underlying similarity metric space, known to the algorithm, such that any two arms that are close in this metric space have similar payoffs. In this paper we consider the more realistic scenario in which the metric space is *implicit* -- it is defined by the available structure but not revealed to the algorithm directly. Specifically, we assume that an algorithm is given a tree-based classification of arms. For any given problem instance such a classification implicitly defines a similarity metric space, but the numerical similarity information is not available to the algorithm. We provide an algorithm for this setting, whose performance guarantees (almost) match the best known guarantees for the corresponding instance of the Lipschitz MAB problem.", "full_text": "Multi-armed bandits on implicit metric spaces\n\nAleksandrs Slivkins\n\nslivkins at microsoft.com\n\nMicrosoft Research Silicon Valley\n\nMountain View, CA 94043\n\nAbstract\n\nThe multi-armed bandit (MAB) setting is a useful abstraction of many online\nlearning tasks which focuses on the trade-off between exploration and exploita-\ntion. In this setting, an online algorithm has a \ufb01xed set of alternatives (\u201carms\u201d),\nand in each round it selects one arm and then observes the corresponding reward.\nWhile the case of small number of arms is by now well-understood, a lot of re-\ncent work has focused on multi-armed bandits with (in\ufb01nitely) many arms, where\none needs to assume extra structure in order to make the problem tractable. In\nparticular, in the Lipschitz MAB problem there is an underlying similarity metric\nspace, known to the algorithm, such that any two arms that are close in this metric\nspace have similar payoffs. In this paper we consider the more realistic scenario\nin which the metric space is implicit \u2013 it is de\ufb01ned by the available structure but\nnot revealed to the algorithm directly. Speci\ufb01cally, we assume that an algorithm\nis given a tree-based classi\ufb01cation of arms. For any given problem instance such\na classi\ufb01cation implicitly de\ufb01nes a similarity metric space, but the numerical sim-\nilarity information is not available to the algorithm. We provide an algorithm for\nthis setting, whose performance guarantees (almost) match the best known guar-\nantees for the corresponding instance of the Lipschitz MAB problem.\n\n1\n\nIntroduction\n\nIn a multi-armed bandit (MAB) problem, a player is presented with a sequence of trials. In each\nround, the player chooses one alternative from a set of alternatives (\u201carms\u201d) based on the past history,\nand receives the payoff associated with this alternative. The goal is to maximize the total payoff of\nthe chosen arms. The multi-armed bandit setting was introduced in 1950s and has since been studied\nintensively since then in Operations Research, Economics and Computer Science, e.g. see [8] for\nbackground. This setting is often used to model the tradeoffs between exploration and exploitation,\nwhich is the principal issue in sequential decision-making under uncertainty.\nOne standard way to evaluate the performance of a multi-armed bandit algorithm is regret, de\ufb01ned\nas the difference between the expected payoff of an optimal arm and that of the algorithm. By\nnow the multi-armed bandit problem with a small \ufb01nite number of arms is quite well understood\n(e.g. see [22, 3, 2]). However, if the set of arms is exponentially or in\ufb01nitely large, the problem\nbecomes intractable, unless we make further assumptions about the problem instance. Essentially,\nan MAB algorithm needs to \ufb01nd a needle in a haystack; for each algorithm there are inputs on which\nit performs as badly as random guessing.\nThe bandit problems with large sets of arms have received a considerable attention, e.g. [1, 5, 23,\n12, 21, 10, 24, 25, 11, 4, 16, 20, 7, 19]. The common theme in these works is to assume a certain\nstructure on payoff functions. Assumptions of this type are natural in many applications, and often\nlead to ef\ufb01cient learning algorithms, e.g. see [18, 8] for a background.\n\n1\n\n\fIn particular, the line of work [1, 17, 4, 20, 7, 19] considers the Lipschitz MAB problem, a broad\nand natural bandit setting in which the structure is induced by a metric on the set of arms.1 In this\nsetting an algorithm is given a metric space (X,D), where X is the set of arms, which represents the\navailable similarity information (information on similarity between arms). Payoffs are stochastic:\nthe payoff from choosing arm x is an independent random sample with expectation \u00b5(x). The metric\nspace is related to payoffs via the following Lipschitz condition:2\n\n|\u00b5(x) \u2212 \u00b5(y)| \u2264 D(x, y)\n\nfor all x, y \u2208 X.\n\n(1)\n\nPerformance guarantees consider regret R(t) as a function of time t, and focus on the asymptoti-\ncal dependence of R(\u00b7) on a suitably de\ufb01ned \u201cdimensionality\u201d of the problem instance (X,D, \u00b5).\nVarious upper and lower bounds of the form R(t) = \u02dc\u0398(t\u03b3), \u03b3 < 1 have been proved.\nWe relax an important assumption in Lipschitz MAB that the available similarity information pro-\nvides numerical values in the sense of (1).3 Speci\ufb01cally, following [21, 24, 25] we assume that an\nalgorithm is (only) given a taxonomy on arms: a tree-based classi\ufb01cation modeled by a rooted tree\nT whose leaf set is X. The idea is that any two arms in the same subtree are likely to have sim-\nilar payoffs. Motivations include contextual advertising and web search with topical taxonomies,\ne.g. [25, 6, 29, 27], Monte-Carlo planning [21, 24], and Computer Go [13, 14].\nWe call the above formulation the Taxonomy MAB problem; a problem instance is a triple (X,T , \u00b5).\nCrucially, in Taxonomy MAB no numerical similarity information is explicitly revealed. All prior\nalgorithms for Lipschitz MAB (and in particular, all algorithms in [20, 7]) are parameterized by\nsome numerical similarity information, and therefore do not directly apply to Taxonomy MAB.\nOne natural way to quantify the extent of similarity between arms in a given subtree is via the\nmaximum difference in expected payoffs. Speci\ufb01cally, for each internal node v we de\ufb01ne the width\nof the corresponding subtree T (v) to be W(v) = supx,y\u2208X(v) |\u00b5(x) \u2212 \u00b5(y)|, where X(v) is the set\nof leaves in T (v). Note that the subtree widths are non-increasing from root to leaves. A standard\nnotion of distance induced by subtree widths, henceforth called implicit distance, is as follows:\nDimp(x, y) is the width of the least common ancestor of leaves x, y. It is immediate that this is\nindeed a metric, and moreover that it satis\ufb01es (1). In fact, Dimp(x, y) is the smallest \u201cwidth-based\u201d\ndistance that satis\ufb01es (1). If the widths are strictly decreasing, T can be reconstructed from Dimp.\nThus, an instance (X,T , \u00b5) of Taxonomy MAB naturally induces an instance (X,Dimp, \u00b5) of Lip-\nschitz MAB which (assuming the widths are strictly decreasing) encodes all relevant information.\nThe crucial distinction is that in Taxonomy MAB the metric space (X,Dimp) is implicit: the subtree\nwidths are not revealed to the algorithm. In particular, the algorithms in [20, 7] do not apply.\nWe view Lipschitz MAB as a performance benchmark for Taxonomy MAB. We are concerned with\nthe following question: can an algorithm for Taxonomy MAB perform as if it was given the implicit\nmetric space (X,Dimp)? More formally, we ask whether it is possible to obtain guarantees for\nTaxonomy MAB that (almost) match the state-of-art for Lipschitz MAB.\nWe answer this question in the af\ufb01rmative as long as the implicit metric space (X,Dimp) has a small\ndoubling constant (see Section 2 for a milder condition). We provide an algorithm with guarantees\nthat are almost identical to those for the zooming algorithm in [20].4\nOur algorithm proceeds by estimating subtree widths of near-optimal subtrees. Thus, we encounter\na two-pronged exploration-exploitation trade-off: samples from a given subtree reveal information\nnot only about payoffs but also about the width, whereas in Lipschitz MAB we only need to worry\nabout the payoffs. Dealing with this more complicated trade-off is the main new conceptual hurdle\n(which leads to some technical complications such as the proof of Lemma 4.4). These complications\naside, our algorithm is similar to those in [17, 20] in that it maintains a partition of the space of arms\ninto regions (in this case, subtrees) so that each region is treated as a \u201cmeta-arm\u201d, and this partition\nis adapted to the high-payoff regions.\n\n1This problem has been explicitly de\ufb01ned in [20]. Preceding work [1, 17, 9, 4] considered a few special\ncases such as a one-dimensional real interval with a metric de\ufb01ned by D(x, y) = |x \u2212 y|\u03b1, \u03b1 \u2208 (0, 1].\n2Lipschitz constant is clip = 1 without loss of generality: else, one could take a metric clip \u00d7 D.\n3In the full version of [20] the setting is relaxed so that (1) needs to hold only if x is optimal, and the\n\ndistances between non-optimal points do not need to be explicitly known; [7] provides a similar result.\n\n4The guarantees in [7] are similar but slightly different technically.\n\n2\n\n\fR(T ) (cid:44) \u00b5\u2217T \u2212 E(cid:104)(cid:80)T\n\n(cid:105)\n\n= E(cid:104)(cid:80)T\n\n(cid:105)\n\n1.1 Preliminaries\nThe Taxonomy MAB problem and the implicit metric space (X,Dimp) are de\ufb01ned as in Section 1.\nWe assume stochastic payoffs [2]: in each round t the algorithm chooses a point x = xt \u2208 X and\nobserves an independent random sample from a payoff distribution Ppayoff(x) with support [0, 1]\nand expectation \u00b5(x).5 The payoff function \u00b5 : X \u2192 [0, 1] is not revealed to an algorithm. The\ngoal is to minimize regret with respect to the best expected arm:\n\n,\n\nt=1 \u00b5(xt)\n\nt=1 \u2206(xt)\n\n(2)\nwhere \u00b5\u2217 (cid:44) supx\u2208X \u00b5(x) is the maximal expected payoff, and \u2206(x) (cid:44) \u00b5\u2217 \u2212 \u00b5(x), is the \u201cbadness\u201d\nof arm x. An arm x \u2208 X is called optimal if \u00b5(x) = \u00b5\u2217.\nWe will assume that the number of arms is \ufb01nite (but possibly very large). Extension to in\ufb01nitely\nmany arms (which does not require new algorithmic ideas) is not included to simplify presentation.\nAlso, we will assume a known time horizon (total number of rounds), denoted Thor.\nOur guarantees are in terms of the zooming dimension [20] of (X,Dimp, \u00b5), a concept that takes\ninto account both the dimensionality of the metric space and the \u201cgoodness\u201d of the payoff function.\nBelow we specialize the de\ufb01nition from [20] to Taxonomy MAB.\nDe\ufb01nition 1.1 (zooming dimension). For X(cid:48) \u2282 X, de\ufb01ne the covering number N cov\n(X(cid:48)) as the\nsmallest number of subtrees of width at most \u03b4 that cover X(cid:48). Let X\u03b4 (cid:44) {x \u2208 X : 0 < \u2206(x) \u2264 \u03b4}.\nThe zooming dimension of a problem instance I = (X,T , \u00b5), with multiplier c, is\n\u03b4/8 (X\u03b4) \u2264 c \u03b4\u2212d \u2200\u03b4 > 0}.\n(3)\n\u03b4/8 (X\u03b4) \u2264 c \u03b4\u2212d, and de\ufb01ne the zooming dimen-\nIn other words, we consider a covering property N cov\nsion as the smallest d such that this covering property holds for all \u03b4 > 0. The zooming dimension\nessentially coincides with the covering dimension of (X,D) 6 for the worst-case payoff function \u00b5,\nbut can be (much) smaller when \u00b5 is \u201cbenign\u201d. In particular, zooming dimension would \u201cignore\u201d a\nsubtree with high covering dimension but signi\ufb01cantly sub-optimal payoffs.\nThe doubling constant cDBL of a metric space is the smallest k such that any ball can be covered by\nk balls of half the radius. (In our case, any subtree can be covered by k subtrees of half the width.)\nDoubling constant has been a standard notion in theoretical computer science literature since [15];\nsince then, it was used to characterize tractable problem instances for a variety of problems. It is\nknown that cDBL = O(2d) for any bounded subset S \u2282 Rd(cid:48)\nof linear dimension d, under any metric\n(cid:96)p, p \u2265 1. Moreover, cDBL \u2265 c 2d if d is the covering dimension with multiplier c.\n\nZoomDim(I, c) (cid:44) inf{d \u2265 0 : N cov\n\n\u03b4\n\n2 Statement of results\n\nWe will prove that our algorithm (TaxonomyZoom) satis\ufb01es the following regret bound:\n\nFor each instance I of Taxonomy MAB, each c > 0 and each T \u2264 Thor,\n\nd = ZoomDim(I, c).\n\nR(T ) \u2264 O(c KI log Thor)1/(2+d) \u00d7 T 1\u22121/(2+d),\n\n(4)\nWe will bound the factor KI below. For KI = 1 this is the guarantee for the zooming algorithm\nin [20] for the corresponding instance (X,Dimp, \u00b5) of Lipschitz MAB. Note that the de\ufb01nition of\nzooming dimension allows a trade-off between c and d, and we obtain the optimal trade-off since (4)\nholds for all values of c at once. Following the prior work on Lipschitz MAB, we identify the\nexponent in (4) as the crucial parameter, as long as the multiplier c is suf\ufb01ciently small.7\nOur \ufb01rst (and crude) bound for KI is in terms of the doubling constant of (X,Dimp).\nTheorem 2.1 (Crude bound). Given an upper bound c(cid:48)\nTaxonomyZoom achieves (4) with KI = f (c(cid:48)\n\nDBL on the doubling constant of (X,Dimp),\n\nDBL) log |X|, where f (n) = n 2n.\n\n5Other than support and expectation, the \u201cshape\u201d of Ppayoff(x) is not essential for this paper.\n6Covering dimension is de\ufb01ned as in (3), replacing N cov\n7One can reduce ZoomDim by making c huge, e.g. ZoomDim = 0 for c = |X|. However, this is not likely to\n\n\u03b4/8 (X\u03b4) with N cov\n\n(X)..\n\n\u03b4\n\nlead to useful regret bounds. Similar trade-off (dimension vs multiplier) is implicit in [7].\n\n3\n\n\fOur main result (which implies Theorem 2.1) uses a more ef\ufb01cient bound for KI.\nRecall that in Taxonomy MAB subtree widths are not revealed, and the algorithm has to use sam-\npling to estimate them. Informally, the taxonomy is useful for our purposes if and only if subtree\nwidths can be ef\ufb01ciently estimated using random sampling. We quantify this as a parameter called\nquality, and bound KI in terms of this parameter.\nWe use simple random sampling: start at a tree node v and choose a branch uniformly at random at\neach junction. Let P(u|v) be the probability that node u is reached starting from v. The probabilities\nP(\u00b7|v) induce a distribution on X(v), the leaf set of subtree T (v). A sample from this distribution\n\nis called a random sample from T (v), with expected payoff \u00b5(v) (cid:44)(cid:80)\n\nx\u2208X(v) \u00b5(x) P(x|v).\n\nDe\ufb01nition 2.2. The quality of the taxonomy for a given problem instance is the largest number\nq \u2208 (0, 1) with the following property: for each subtree T (v) containing an optimal arm there exist\ntree nodes u, u(cid:48) \u2208 T (v) such that P(u|v) and P(u(cid:48)|v) are at least q and\n\n|\u00b5(u) \u2212 \u00b5(u(cid:48))| \u2265 1\n\n2 W(v).\n\n(5)\n\n2 in (5) is arbitrary; we \ufb01x it for convenience.\n\nOne could use the pair u, u(cid:48) in De\ufb01nition 2.2 to obtain reliable estimates for W(v). The de\ufb01nition\nfocuses on the dif\ufb01culty of obtaining such pair via random sampling from T (v). The de\ufb01nition is\n\ufb02exible: it allows u and u(cid:48) to be at different depth (which is useful if node degrees are large and\nnon-uniform) and the widths of other internal nodes in T (v) cannot adversely impact quality. The\nconstant 1\nFor a particularly simple example, consider a binary taxonomy such that for each subtree T (v)\ncontaining an optimal arm there exist grandchildren u, u(cid:48) of v that satisfy (5). For instance, such\nu, u(cid:48) exist if the width of each grandchild of v is at most 1\nTheorem 2.3 (Main result). Assume an lower bound q \u2264 quality(I) is known.\nTaxonomyZoom achieves (4) with KI = deg\nq\nTheorem 2.1 follows because, letting cDBL be the doubling constant of (X,Dimp), all node degrees\nare at most cDBL and moreover quality \u2265 2\u2212cDBL (we omit the proof from this version).\n\nlog |X|, where deg is the degree of the taxonomy.\n\n4 W(v). Then quality \u2265 1\n4.\n\nThen\n\nDiscussion. The guarantee in Theorem 2.3 is instance-dependent: it depends on deg/quality\nand ZoomDim, and is meaningful only if these quantities are small compared to the number of arms\n(informally, we will call such problem instances \u201cbenign\u201d). Also, the algorithm needs to know\na non-trivial lower bound on quality; very conservative estimates would not suf\ufb01ce. However,\nunderestimating quality (and likewise overestimating Thor) is relatively inexpensive as long as the\n\u201cin\ufb02uence\u201d of these parameters on regret is eventually dominated by the T 1\u22121/(2+d) term.\nFor benign problem instances, the bene\ufb01t of using the taxonomy is the vastly improved dependence\non the number of arms. Without a taxonomy or any other structure, regret of any algorithm for\nstochastic MAB scales linearly in the number of (near-optimal) arms, for a suf\ufb01ciently large t.\n2 < \u2206(x) \u2264 \u03b4. Then the worst-case regret\nSpeci\ufb01cally, let N\u03b4 be the number of arms x such that \u03b4\n(over all problem instances) cannot be better than R(t) = min(\u03b4t, \u2126( 1\nAn alternative approach to MAB problems on trees (without knowing the \u201cwidths\u201d) are the \u201ctree\nbandit algorithms\u201d explored in [21, 24]. Here for each tree node v there is a separate, independent\ncopy of UCB1 [2] or a UCB1-style index algorithm (call it Av), so that the \u201carms\u201d for Av corre-\nspond to children u of v, and selecting a child u in a given round corresponds to playing Au in\nthis round. [21, 24] report successful empirical performance of such algorithms on some examples.\nHowever, regret bounds for these algorithms do not scale as well with the number of arms: even if\nthe tree widths are given, then letting \u2206min (cid:44) minx\u2208X: \u2206(x)>0 \u2206(x), the regret bound is propor-\ntional to |X\u03b4|/\u2206min (where X\u03b4 is as in De\ufb01nition 1.1), whereas the regret bound in Theorem 2.3 is\n(essentially) in terms of the covering numbers N cov\n\n\u03b4 N\u03b4)). 8\n\n\u03b4/8 (X\u03b4).\n\n8This is implicit from the lower-bounding analysis in [22] and [3].\n\n4\n\n\f3 Main algorithm\n\nOur algorithm TaxonomyZoom(Thor, q) is parameterized by the time horizon Thor and the quality\nparameter q \u2264 quality. In each round the algorithm selects one of the tree nodes, say v, and plays\na randomly sampled arm x from T (v). We say that a subtree T (u) is hit in this round if u \u2208 T (v)\nand x \u2208 T (u). For each tree node v and time t, let nt(v) be the number of times the subtree T (v)\nhas been hit by the algorithm before time t, and let \u00b5t(v) be the corresponding average reward. Note\nthat E[\u00b5t(v)| nt(v) > 0] = \u00b5(v). De\ufb01ne the con\ufb01dence radius of v at time t as\n\nradt(v) (cid:44)(cid:112)8 log(Thor|X|) / (2 + nt(v)).\n\n(6)\n\n(7)\n\n(8)\n\nThe meaning of the con\ufb01dence radius is that |\u00b5t(v) \u2212 \u00b5(v)| \u2264 radt(v) with high probability.\nFor each tree node v and time t, let us de\ufb01ne the index of v at time t as\n\nIt(v) (cid:44) \u00b5t(v) + (1 + 2 kA) radt(v), where kA (cid:44) 4(cid:112)2/q.\n\nHere we posit \u00b5t(v) = 0 if nt(v) = 0. Let us de\ufb01ne the width estimate9\n\nWt(v) (cid:44) max(0, Ut(v) \u2212 Lt(v)), where\n\n(cid:26) Ut(v) (cid:44) maxu\u2208T (v), s\u2264t \u00b5s(u) \u2212 rads(u),\n\nLt(v) (cid:44) minu\u2208T (v), s\u2264t \u00b5s(u) + rads(u).\n\nHere Ut(v) is the best available lower con\ufb01dence bound on maxx\u2208X(v) \u00b5(x), and Lt(v) is the best\navailable upper con\ufb01dence bound on minx\u2208X(v) \u00b5(x). If both bounds hold then Wt(v) \u2264 W(v).\nThroughout the phase, some tree nodes are designated active. We maintain the following invariant:\n\nWt(v) < kA radt(v) for each active internal node v.\n\n(9)\n\nTaxonomyZoom(Thor, q ) operates as follows. Initially the only active tree node is the root. In each\nround, the algorithm performs the following three steps:\n\n(S1) While Invariant (9) is violated by some v, de-activate v and activate all its children.\n(S2) Select an active tree node v with the maximal index (7), breaking ties arbitrarily.\n(S3) Play a randomly sampled arm from T (v).\n\nNote that each arm is activated and deactivated at most once.\n\nImplementation details.\nIf an explicit representation of the taxonomy can be stored in memory,\nthen the following simple implementation is possible. For each tree node v, we store several statis-\ntics: nt, \u00b5t, Ut and Lt. Further, we maintain a linked list of active nodes, sorted by the index.\nSuppose in a given round t, a subtree v is chosen, and an arm x is played. We update the statistics\nby going up the x \u2192 v path in the tree (note that only the statistics on this path need to be updated).\nThis update can be done in time O(depth(x)). Then one can check whether Invariant (9) holds for\na given node in time O(1). So step (S1) of the algorithm can be implemented in time O(1 + N ),\nwhere N is the number of nodes activated during this step. Finally, the linked list of active nodes\ncan be updated in time O(1 + N ). Then the selections in steps (S2) and (S3) are done in time O(1).\nLemma 3.1. TaxonomyZoom can be implemented with O(1) storage per each tree node, so that in\neach round the time complexity is O(N + depth(x)), where N is the number of arms activated in\nstep (S1), and x is the arm chosen in step (S3).\n\nSometimes it may be feasible (and more space-ef\ufb01cient) to represent the taxonomy implicitly, so that\na tree node is expanded only if needed. Speci\ufb01cally, suppose the following interface is provided:\ngiven a tree node v, return all its children and an arbitrary arm x \u2208 T (v). Then TaxonomyZoom can\nbe implemented so that it only stores the statistics for each node u such that P(u|v) \u2265 q for some\nactive node v (rather than for all tree nodes).10 The running times are as in Lemma 3.1.\n\n9De\ufb01ning Ut, Lt in (8) via s \u2264 t (rather than s = t) improves performance, but is not essential for analysis.\n10The algorithm needs to be modi\ufb01ed slightly; we leave the details to the full version.\n\n5\n\n\f4 Analysis: proof of Theorem 2.3\nFirst, let us \ufb01x some notation. We will focus on regret up to a \ufb01xed time T \u2264 Thor. In what follows,\nlet d = ZoomDim(I, c) for some \ufb01xed c > 0. Recall the notation X\u03b4 (cid:44) {x \u2208 X : \u2206(x) \u2264 \u03b4} from\nDe\ufb01nition 1.1. Here \u03b4 is the \u201c distance scale\u201d; we will be interested in \u03b4 \u2265 \u03b40, for\n\n\u03b40 (cid:44) ( K\n\nT )1/(d+2), where K (cid:44) O(c deg k2A log Thor).\n\n(10)\n\nWe identify a certain high-probability behavior of the algorithm, and argue deterministically condi-\ntional on the event that this behavior actually holds.\nDe\ufb01nition 4.1. An execution of TaxonomyZoom is called clean if for each time t \u2264 T and all tree\nnodes v the following two properties hold:\n\n(P1) |\u00b5t(v) \u2212 \u00b5(v)| \u2264 radt(v) as long as nt(v) > 0.\n(P2) If u \u2208 T (v) then\n\nnt(v)P(u|v) \u2265 8 log T \u21d2 nt(u) \u2265 1\n\n2 nt(v)P(u|v).\n\nNote that in a clean execution the quantities in (8) satisfy the desired high-con\ufb01dence bounds:\nUt(v) \u2264 maxx\u2208X(v) \u00b5(x) and Lt(v) \u2265 minx\u2208X(v) \u00b5(x), which implies W(v) \u2265 Wt(v).\nLemma 4.2. An execution of TaxonomyZoom is clean with probability at least 1 \u2212 2 T \u22122\nhor.\n\nProof. For part (P1), \ufb01x a tree node v and let \u03b6j to be the payoff in the j-th round that v has been\nj=1(\u03b6j \u2212 \u00b5(v))}n=1..T is a martingale.11 Let \u00af\u03b6n (cid:44) 1\nj=1 \u03b6j be the n-th average.\nThen by the Azuma-Hoeffding inequality, for any n \u2264 Thor we have:\n\nhit. Then {(cid:80)n\nPr[|\u00af\u03b6n \u2212 \u00b5(v)| > r(n)] \u2264 (Thor |X|)\u22122, where r(n) (cid:44)(cid:112)8 log(Thor|X|) / (2 + n).\n\n(11)\nNote that radt(v) = r(nt(v)). We obtain (P1) by taking the Union Bound for (11) over all nodes v\nand all n \u2264 T . (This is the only place where we use the log |X| term in (6).)\nPart (P2) is proved via a similar application of martingales and Azuma-Hoeffding inequality.\nFrom now on we will argue about a clean execution. Recall that by de\ufb01nition of W(\u00b7),\n\n(cid:80)n\n\nn\n\n\u00b5(v) \u2264 \u00b5(u) + W(v)\n\nfor any tree node u \u2208 T (v).\n\n(12)\n\nThe crux of the proof of Theorem 2.3 is that at all times the maximal index is at least \u00b5\u2217.\nLemma 4.3. Consider a clean execution of TaxonomyZoom(Thor, q). Then the following holds: in\nany round t \u2264 Thor, at any point in the execution such that the invariant (9) holds, there exists an\nactive tree node v\u2217 such that It(v\u2217) \u2265 \u00b5\u2217.\nProof. Fix an optimal arm x\u2217 \u2208 X. Note that in each round t, there exist an active tree node v\u2217\nsuch that x\u2217 \u2208 T (v\u2217). (One can prove it by induction on t, using the (de)activation rule (S1) in\nTaxonomyZoom.) Fix round t and the corresponding tree node v\u2217 = v\u2217\nt .\nBy De\ufb01nition 2.2, there exist v0, v1 \u2208 Tq(v\u2217) such that |\u00b5(v1) \u2212 \u00b5(v0)| \u2265 W(v\u2217)/2.\nAssume that \u2206 (cid:44) W(v\u2217) > 0, and de\ufb01ne f (\u2206) = 83 log(Thor) \u2206\u22122. Then for each tree node v\n\nradt(v) \u2264 \u2206/8 \u21d0\u21d2 nt(v) \u2265 f (\u2206).\n(13)\nNow, for the sake of contradiction let us suppose that nt(v\u2217) \u2265 ( 1\n4 kA)2 f (\u2206). By (13), this is\nequivalent to \u2206 \u2265 2 kA radt(v\u2217). Note that nt(v\u2217) \u2265 (2/q) f (\u2206) by our assumption on kA, so\nby property (P2) in the de\ufb01nition of the clean execution, for each node vj, j \u2208 {0, 1} we have\nnt(vj) \u2265 f (\u2206), which implies radt(vj) \u2264 \u2206/8. Therefore (8) gives a good estimate of W(v\u2217),\nnamely Wt(v\u2217) \u2265 \u2206/4. It follows that Wt(v\u2217) \u2265 kA radt(v\u2217), which violates Invariant (9).\nWe proved that W(v\u2217) \u2264 2 kA radt(v\u2217). Using (12), we have \u2206(v\u2217) \u2264 W(v\u2217) < 2 kA radt(v\u2217) and\n(14)\n\nIt(v\u2217) \u2265 \u00b5(v\u2217) + 2 kA radt(v\u2217) \u2265 \u00b5\u2217,\n\nt\n\nwhere the \ufb01rst inequality in (14) holds by de\ufb01nition (7) and property (P1) of a clean execution.\n\n11To make \u03b6n well-de\ufb01ned for any n \u2264 Thor, consider a hypothetical algorithm which coincides with\n\nTaxonomyZoom for the \ufb01rst Thor rounds and then proceeds so that each tree node is selected Thor times.\n\n6\n\n\fWe use Lemma 4.3 to show that the algorithm does not activate too many tree nodes with large\nbadness \u2206(\u00b7), and each such node is not played too often. For each tree node v, let N (v) be the\nnumber of times node v was selected in step (S2) of the algorithm. Call v positive if N (v) > 0. We\npartition all positive tree nodes and all deactivated tree nodes into sets\n\nSi = {positive tree nodes v : 2\u2212i < \u2206(v) \u2264 2\u2212i+1},\ni = {deactivated tree nodes v : 2\u2212i < 4 W(v) \u2264 2\u2212i+1}.\nS\u2217\nLemma 4.4. Consider a clean execution of algorithm TaxonomyZoom(Thor, q ).\n\n(a) for each tree node v we have N (v) \u2264 O(k2A log Thor) \u2206\u22122(v).\n(b) if node v is de-activated at some point in the execution, then \u2206(v) \u2264 4 W(v).\n(c) For each i, |S\u2217\n(d) For each i, |Si| \u2264 O(deg Ki+1).\n\ni | \u2264 2Ki, where Ki (cid:44) c 2(i+1) d.\n\nProof. For part (a), \ufb01x an arbitrary tree node v and let t be the last time v was selected in step (S2)\nof the algorithm. By Lemma 4.3, at that point in the execution there was a tree node v\u2217 such that\nIt(v\u2217) \u2265 \u00b5\u2217. Then using the selection rule (step (S2)) and the de\ufb01nition of index (7), we have\n\n\u00b5\u2217 \u2264 It(v\u2217) \u2264 It(v) \u2264 \u00b5(v) + (2 + 2 kA) radt(v),\n\n\u2206(v) \u2264 (2 + 2 kA) radt(v).\nN (v) \u2264 nt(v) \u2264 O(k2A log Thor) \u2206\u22122(v).\n\n(15)\n\nFor part (b), suppose tree node v was de-activated at time s. Let t be the last round in which v was\nselected. Then\n\n3 (2 + 2 kA) radt(v) \u2265 1\n\nW(v) \u2265 Ws(v) \u2265 kA rs(v) \u2265 1\n\n(16)\nIndeed, the \ufb01rst inequality in (16) holds since we are in a clean execution, the second inequality\nin (16) holds because v was de-activated, the third inequality holds because ns(v) = nt(v) + 1, and\nthe last inequality in (16) holds by (15).\nFor part (c), let us \ufb01x i and de\ufb01ne Yi = {x \u2208 X : \u2206(x) \u2264 2\u2212i+1}. By De\ufb01nition 1.1, this set can\nbe covered by Ki subtrees T (v1), . . . ,T (vKi), each of width < 2\u2212i/4. Fix a deactivated tree node\nv \u2208 S\u2217\n\ni . For each arm x \u2208 X in subtree T (v) we have, by part (b),\n\n3 \u2206(v).\n\n\u2206(x) \u2264 \u2206(v) + W(v) \u2264 4 W(v) \u2264 2\u2212i+1,\n\ni | \u2264 2Ki, proving part (c).\n\nso x \u2208 Yi and therefore is contained in some T (vj). Note that vj \u2208 T (v) since W(v) > W(vj). It\nfollows that the subtrees T (v1), . . . ,T (vK) cover the leaf set of T (v).\ni \u222a {v1, . . . , vK}, where two nodes u, v are connected by a\nConsider the graph G on the node set S\u2217\ndirected edge (u, v) if there is a path from u to v in the tree T . This is a directed forest of out-degree\nat least 2, whose leaf set is a subset of {v1, . . . , vKi}. Since in any directed tree of out-degree \u2265 2\nthe number of nodes is at most twice the number of leaves, G contains at most Ki internal nodes.\nThus |S\u2217\nFor part (d), let us \ufb01x i and consider a positive tree node u \u2208 Si. Since N (u) > 0, either u is active\nat time Thor, or it was deactivated in some round before Thor. In the former case, let v be the parent\nof u. In the latter case, let v = u. Then by part (b) we have 2\u2212i \u2264 \u2206(u) \u2264 \u2206(v) + W(v) \u2264 4 W(v),\nso v \u2208 S\u2217\nFor each tree node v, de\ufb01ne its family as the set which consists of u itself and all its children.\nWe have proved that each positive node u \u2208 Si belongs to the family of some deactivated node\nv \u2208 \u222ai+1\n\nj . Since each family consists of at most 1 + deg nodes, it follows that\n\nj for some j \u2264 i + 1.\n\nj=1S\u2217\n\nj=1Kj) \u2264 O(deg Ki+1).\n\n|Si| \u2264 (1 + deg) ((cid:80)i+1\nN (v) \u2206(v) \u2264 O(k2A log Thor)(cid:80)\n\nv\u2208Si\n\n(cid:80)\n\nv\u2208Si\n\nProof of Theorem 2.3: The theorem follows Lemma 4.4(ad). Let us assume a clean execution.\n(Recall that by Lemma 4.2 the failure probability is suf\ufb01ciently small to be neglected.) Then:\n\n\u2206(v) \u2264 O(k2A log Thor)|Si| 2i \u2264 K 2(i+2)(1+d),\n\n1\n\n7\n\n\fwhere K is de\ufb01ned in (10). For any \u03b40 = 2\u2212i0 we have\n\nR(T ) \u2264(cid:80)\n(cid:16)(cid:80)\n\ntree nodes v N (v) \u2206(v)\n\n(cid:16)(cid:80)\n\nv: \u2206(v)<\u03b40\n\n=\n\u2264 \u03b40T +\ni\u2264i0\n\u2264 \u03b40T + O(K) ( 8\n\n\u03b40\n\n(cid:80)\n\nv\u2208Si\n)(1+d).\n\n(cid:17)\n\n(cid:16)(cid:80)\n(cid:17) \u2264 \u03b40T +(cid:80)\n\nv: \u2206(v)\u2265\u03b40\n\nN (v) \u2206(v)\n\n(cid:17)\n\nN (v) \u2206(v)\n\n+\n\nN (v) \u2206(v)\n\ni\u2264i0\n\nK 2(i+2)(1+d)\n\nWe obtain the desired regret bound (4) by setting \u03b40 as in (10).\n\n5\n\n(De)parameterizing the algorithm\n\nRecall that TaxonomyZoom needs to be parameterized by Thor and q. dependence on the param-\neters can be removed using a suitable version of the standard doubling trick: consider a \u201cmeta-\nalgorithm\u201d that proceeds in phases so that in each phase i = 1, 2, 3, . . . a fresh instance of\nTaxonomyZoom(2i, qi) is run for 2i rounds, where qi slowly decreases with i. For instance, if we\ntake qi = 2\u2212\u03b1i for some \u03b1 \u2208 (0, 1) then this meta-algorithm has regret\n\nR(T ) \u2264 O(c deg log T )1/(2+d) \u00d7 T 1\u2212(1\u2212\u03b1)/(2+d)\n\n\u2200T \u2265 quality\n\n\u22121/\u03b1\n\nwhere d = ZoomDim(I, c), for any given c > 0.\nWhile the doubling trick is very useful in theory of online decision problems, its practical importance\nis questionable, as running a fresh algorithm instance in each phase seems unnecessarily wasteful.\nWe conjecture that in practice one could run a single instance of the algorithm while gradually\nincreasing Thor and decreasing q. However, providing provable guarantees for this modi\ufb01ed algo-\nrithm seems beyond the current techniques. In particular, extending a much simpler analysis of the\nzooming algorithm [20] to arbitrary time horizon remains a challenge.12\nFurther, we conjecture that TaxonomyZoom will typically work in practice even if the parameters are\nmisspeci\ufb01ed, i.e. even if Thor is too low and q is too optimistic. Indeed, recall that our algorithm\nis index-based, in the style of UCB1 [2]. The only place where the parameters are invoked is in\nthe de\ufb01nition of the index (7), namely in the constant in front of the exploration term. It has been\nobserved in [28, 29] that in a related MAB setting, reducing this constant to 1 from the theoretically\nmandated \u0398(log T )-type term actually improves algorithms\u2019 performance in simulations.\n\n(17)\n\n6 Conclusions\n\nIn this paper, we have extended previous multi-armed bandit learning algorithms with large numbers\nof available strategies. Whereas the most effective previous approaches rely on explicitly knowing\nthe distance between available strategies, we consider the case where the distances are implicit in a\nhierarchy of available strategies. We have provided a learning algorithm for this setting, and show\nthat its performance almost matches the best known guarantees for the Lipschitz MAB problem.\nFurther, we have shown how our approach results in stronger provable guarantees than alternative\nalgorithms such as tree bandit algorithms [21, 24].\nWe conjecture that the dependence on quality (or some version thereof) is necessary for the worst-\ncase regret bounds, even if ZoomDim is low. It is an open question whether there are non-trivial\nfamilies of problem instances with low quality for which one could achieve low regret.\nOur results suggest some natural extensions. Most interestingly, a number of applications recently\nposed as MAB problems over large sets of arms \u2013 including learning to rank online advertisements\nor web documents (e.g. [26, 29]) \u2013 naturally involve choosing among arms (e.g. ads) that can be clas-\nsi\ufb01ed according to any of a number of hierarchies (e.g. by class of product sold, geographic location,\netc). In particular, such different hierarchies may be of different usefulness. Selecting among, or\ncombining from, a set of available hierarchical representations of arms poses interesting challenges.\nMore generally, we would like to generalize Theorem 2.3 to other structures that implicitly de\ufb01ne\na metric space on arms (in the sense of (1)). One speci\ufb01c target would be directed acyclic graphs.\nWhile our algorithm is well-de\ufb01ned for this setting, the theoretical analysis does not apply.\n\n12However, [7] obtains similar guarantees for arbitrary time horizon, with a different algorithm.\n\n8\n\n\fReferences\n[1] Rajeev Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33(6):1926\u2013\n\n1951, 1995.\n\n[2] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2-3):235\u2013256, 2002. Preliminary version in 15th ICML, 1998.\n\n[3] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM J. Comput., 32(1):48\u201377, 2002. Preliminary version in 36th IEEE FOCS, 1995.\n\n[4] Peter Auer, Ronald Ortner, and Csaba Szepesv\u00b4ari. Improved Rates for the Stochastic Continuum-Armed\n\nBandit Problem. In 20th COLT, pages 454\u2013468, 2007.\n\n[5] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. J. of Computer\n\nand System Sciences, 74(1):97\u2013114, February 2008. Preliminary version in 36th ACM STOC, 2004.\n\n[6] Andrei Broder, Marcus Fontoura, Vanja Josifovski, and Lance Riedel. A semantic approach to contextual\n\nadvertising. In 30th SIGIR, pages 559\u2013566, 2007.\n\n[7] S\u00b4ebastien Bubeck, R\u00b4emi Munos, Gilles Stoltz, and Csaba Szepesvari. Online Optimization in X-Armed\nBandits. J. of Machine Learning Research (JMLR), 12:1587\u20131627, 2011. Preliminary version in NIPS\n2008.\n\n[8] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, learning, and games. Cambridge Univ. Press, 2006.\n[9] Eric Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous\n\naction spaces. IEEE Trans. on Automatic Control, 54(6):1243\u20131253, 2009. A manuscript from 2004.\n\n[10] Varsha Dani and Thomas P. Hayes. Robbing the bandit: less regret in online geometric optimization\n\nagainst an adaptive adversary. In 17th ACM-SIAM SODA, pages 937\u2013943, 2006.\n\n[11] Varsha Dani, Thomas P. Hayes, and Sham Kakade. The Price of Bandit Information for Online Optimiza-\n\ntion. In 20th NIPS, 2007.\n\n[12] Abraham Flaxman, Adam Kalai, and H. Brendan McMahan. Online Convex Optimization in the Bandit\n\nSetting: Gradient Descent without a Gradient. In 16th ACM-SIAM SODA, pages 385\u2013394, 2005.\n\n[13] Sylvain Gelly and David Silver. Combining online and of\ufb02ine knowledge in UCT. In 24th ICML, 2007.\n[14] Sylvain Gelly and David Silver. Achieving master level play in 9x9 computer go. In 23rd AAAI, 2008.\n[15] Anupam Gupta, Robert Krauthgamer, and James R. Lee. Bounded geometries, fractals, and low\u2013\n\ndistortion embeddings. In 44th IEEE FOCS, pages 534\u2013543, 2003.\n\n[16] Sham M. Kakade, Adam T. Kalai, and Katrina Ligett. Playing Games with Approximation Algorithms.\n\nIn 39th ACM STOC, 2007.\n\n[17] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th NIPS, 2004.\n[18] Robert Kleinberg. Online Decision Problems with Large Strategy Sets. PhD thesis, MIT, 2005.\n[19] Robert Kleinberg and Aleksandrs Slivkins. Sharp Dichotomies for Regret Minimization in Metric Spaces.\n\nIn 21st ACM-SIAM SODA, 2010.\n\n[20] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-Armed Bandits in Metric Spaces. In 40th\n\nACM STOC, pages 681\u2013690, 2008.\n\n[21] Levente Kocsis and Csaba Szepesvari. Bandit Based Monte-Carlo Planning. In 17th ECML, pages 282\u2013\n\n293, 2006.\n\n[22] T.L. Lai and Herbert Robbins. Asymptotically ef\ufb01cient Adaptive Allocation Rules. Advances in Applied\n\nMathematics, 6:4\u201322, 1985.\n\n[23] H. Brendan McMahan and Avrim Blum. Online Geometric Optimization in the Bandit Setting Against\n\nan Adaptive Adversary. In 17th COLT, pages 109\u2013123, 2004.\n\n[24] R\u00b4emi Munos and Pierre-Arnaud Coquelin. Bandit algorithms for tree search. In 23rd UAI, 2007.\n[25] Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. Bandits for Taxonomies:\n\nA Model-based Approach. In SDM, 2007.\n\n[26] Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. Multi-armed Bandit Problems with De-\n\npendent Arms. In 24th ICML, 2007.\n\n[27] Susan T. Dumais Paul N. Bennett, Krysta Marie Svore. Classi\ufb01cation-enhanced ranking. In 19th WWW,\n\npages 111\u2013120, 2010.\n\n[28] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed\n\nbandits. In 25th ICML, pages 784\u2013791, 2008.\n\n[29] Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Learning optimally diverse rankings over\n\nlarge document collections. In 27th ICML, pages 983\u2013990, 2010.\n\n9\n\n\f", "award": [], "sourceid": 914, "authors": [{"given_name": "Aleksandrs", "family_name": "Slivkins", "institution": null}]}