{"title": "Multi-Armed Bandits with Metric Movement Costs", "book": "Advances in Neural Information Processing Systems", "page_first": 4119, "page_last": 4128, "abstract": "We consider the non-stochastic Multi-Armed Bandit problem in a setting where there is a fixed and known metric on the action space that determines a cost for switching between any pair of actions. The loss of the online learner has two components: the first is the usual loss of the selected actions, and the second is an additional loss due to switching between actions.  Our main contribution gives a tight characterization of the expected minimax regret in this setting, in terms of a complexity measure $\\mathcal{C}$ of the underlying metric which depends on its covering numbers. In finite metric spaces with $k$ actions, we give an efficient algorithm that achieves regret of the form $\\widetilde(\\max\\set{\\mathcal{C}^{1/3}T^{2/3},\\sqrt{kT}})$, and show that this is the best possible. Our regret bound generalizes previous known regret bounds for some special cases: (i) the unit-switching cost regret $\\widetilde{\\Theta}(\\max\\set{k^{1/3}T^{2/3},\\sqrt{kT}})$ where $\\mathcal{C}=\\Theta(k)$, and (ii) the interval metric with regret $\\widetilde{\\Theta}(\\max\\set{T^{2/3},\\sqrt{kT}})$ where $\\mathcal{C}=\\Theta(1)$.  For infinite metrics spaces with Lipschitz loss functions, we derive a tight regret bound of $\\widetilde{\\Theta}(T^{\\frac{d+1}{d+2}})$ where $d \\ge 1$ is the Minkowski dimension of the space, which is known to be tight even when there are no switching costs.", "full_text": "Multi-Armed Bandits with Metric Movement Costs\n\nTomer Koren\nGoogle Brain\n\nRoi Livni\n\nPrinceton University\n\nYishay Mansour\n\nTel Aviv University and Google\n\ntkoren@google.com\n\nrlivni@cs.princeton.edu\n\nmansour@cs.tau.ac.il\n\nAbstract\n\nWe consider the non-stochastic Multi-Armed Bandit problem in a setting where\nthere is a \ufb01xed and known metric on the action space that determines a cost for\nswitching between any pair of actions. The loss of the online learner has two\ncomponents: the \ufb01rst is the usual loss of the selected actions, and the second is an\nadditional loss due to switching between actions. Our main contribution gives a\ntight characterization of the expected minimax regret in this setting, in terms of\na complexity measure C of the underlying metric which depends on its covering\nnumbers. In \ufb01nite metric spaces with k actions, we give an e\ufb03cient algorithm\nkT}), and show that this is\nthe best possible. Our regret bound generalizes previous known regret bounds\nkT})\nkT}) where\nC = \u0398(1). For in\ufb01nite metrics spaces with Lipschitz loss functions, we derive a\nd+2) where d \u2265 1 is the Minkowski dimension of the\nspace, which is known to be tight even when there are no switching costs.\n\nthat achieves regret of the form (cid:101)O(max{C1/3T2/3\nfor some special cases: (i) the unit-switching cost regret(cid:101)\u0398(max{k1/3T2/3\nwhere C = \u0398(k), and (ii) the interval metric with regret(cid:101)\u0398(max{T2/3\ntight regret bound of(cid:101)\u0398(T d+1\n\n\u221a\n\n\u221a\n\n,\n\n\u221a\n\n,\n\n,\n\nt=1 (cid:96)t(i).\n\nIntroduction\n\nthe expected di\ufb00erence between her loss,(cid:80)T\nmini\u2208K(cid:80)T\n\n1\nMulti-Armed Bandit (MAB) is perhaps one of the most well studied model for learning that allows to\nincorporate settings with limited feedback. In its simplest form, MAB can be thought of as a game\nbetween a learner and an adversary: At \ufb01rst, the adversary chooses an arbitrary sequence of losses\n(cid:96)1, . . . , (cid:96)T (possibly adversarially). Then, at each round the learner chooses an action it from a \ufb01nite\nset of actions K. At the end of each round, the learner gets to observe her loss (cid:96)t(it), and only the loss\nof her chosen action. The objective of the learner is to minimize her (external) regret, de\ufb01ned as\nt=1 (cid:96)t(it), and the loss of the best action in hindsight, i.e.,\nOne simpli\ufb01cation of the MAB is that it assumes that the learner can switch between actions without\nany cost, this is in contrast to online algorithms that maintain a state and have a cost of switching\nbetween states. One simple intermediate solution is to add further costs to the learner that penalize\nmovements between actions. (Since we compare the learner to the single best action, the adversary\nhas no movement and hence no movement cost.) This approach has been studied in the MAB with\nunit switching costs [2, 12], where the learner is not only penalized for her loss but also pays a\nunit cost for any time she switches between actions. This simple penalty implicitly advocates the\nconstruction of algorithms that avoid frequent \ufb02uctuation in their decisions. Regulating switching has\nbeen successfully applied to many interesting instances such as bu\ufb00ering problems [16], limited-delay\nlossy coding [19] and dynamic pricing with patient buyers [15].\nThe unit switching cost assumes that any pair of actions have the same cost, which in many scenarios\nis far from true. For example, consider an ice-cream vendor on a beach, where his actions are to select\na location and price. Clearly, changing location comes at a cost, while changing prices might come\nwith no cost. In this case we can de\ufb01ne a interval metric (the coast line) and the movement cost is the\ndistance. A more involved case is a hot-dog vendor in Manhattan, which needs to select a location\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u221a\n\n,\n\nand price. Again, it makes sense to charge a switching cost between locations according to their\ndistance, and in this case the Manhattan-distance seems the most appropriate. Such settings are at the\ncore of our model for MAB with movement cost. The authors of [24] considered a MAB problem\nequipped with an interval metric, i.e, the actions are [0, 1] and the movement cost is the distance\nbetween the actions. They proposed a new online algorithm, called the Slowly Moving Bandit (SMB)\nalgorithm, that achieves optimal regret bound for this setting, and applied it to a dynamic pricing\nproblem with patient buyers to achieve a new tight regret bound.\nThe objective of this paper is to handle general metric spaces, both \ufb01nite and in\ufb01nite. We show\nhow to generalize the SMB algorithm and its analysis to design optimal moving-cost algorithms\nfor any metric space over \ufb01nite decision space. Our main result identi\ufb01es an intrinsic complexity\nmeasure of the metric space, which we call the covering/packing complexity, and give a tight\ncharacterization of the expected movement regret in terms of the complexity of the underlying metric.\nIn particular, in \ufb01nite metric spaces of complexity C with k actions, we give a regret bound of the\nkT}) and present an e\ufb03cient algorithm that achieves it. We also give a\n,\n\nform (cid:101)O(max{C1/3T2/3\nmatching(cid:101)\u2126(max{C1/3T2/3\n\nkT}) lower bound that applies to any metric with complexity C.\n\n\u221a\n\nin\ufb01nite case. Speci\ufb01cally, we give an upper bound on the regret of(cid:101)O(T d+1\n\nWe extend out results to general continuous metric spaces. For such a settings we clearly have to make\nsome assumption about the losses, and we make the rather standard assumption that the losses are\nLipchitz with respect to the underlying metric. In this setting our results depend on a quite di\ufb00erent\ncomplexity measures: the upper and lower Minkowski dimensions of the space, thus exhibiting a\nphase transition between the \ufb01nite case (that corresponds to Minkowski dimension zero) and the\nd+2) where d \u2265 1 is the upper\nMinkowski dimension. When the upper and lower Minkowski dimensions coincide\u2014which is the\ncase in many natural spaces, such as normed vector spaces\u2014the latter bound matches a lower bound\nof [10] that holds even when there are no switching costs. Thus, a surprising implication of our result\nis that in in\ufb01nite actions spaces (of bounded Minkowski dimension), adding movement costs do not\nadd to the complexity of the MAB problem!\nOur approach extends the techniques of [24] for the SMB algorithm, which was designed to optimize\nover an interval metric, which is equivalent to a complete binary Hierarchally well-Separated Tree\n(HST) metric space. By carefully balancing and regulating its sampling distributions, the SMB\nalgorithm avoids switching between far-apart nodes in the tree and possibly incurring large movement\ncosts with respect to the associated metric. We show that the SMB regret guarantees are much more\ngeneral than just binary balanced trees, and give an analysis of the SMB algorithm when applied to\ngeneral HSTs. As a second step, we show that a rich class of trees, on which the SMB algorithm can\nbe applied, can be used to upper-bound any general metric. Finally, we reduce the case of an in\ufb01nite\nmetric space to the \ufb01nite case via simple discretization, and show that this reduction gives rise to\nthe Minkowski dimension as a natural complexity measure. All of these contractions turn out to be\noptimal (up to logarithmic factors), as demonstrated by our matching lower bounds.\n\n1.1 Related Work\n\nthat guarantee a regret of(cid:101)O(\u221a\n(cid:101)\u2126(T) switches between actions are expected (see [12]).\n\nPerhaps the most well known classical algorithm for non-stochastic bandit is the Exp3 Algorithm [4]\nkT) without movement costs. However, for general MAB algorithms\nthere are no guarantees for slow movement between actions. In fact, it is known that in a worst case\n\nA simple case of MAB with movement cost is the uniform metric, i.e., when the distance between any\ntwo actions is the same. This setting has seen intensive study, both in terms of analyzing optimal\nregret rates [2, 12], as well as applications [16, 19, 15]. Our main technical tools for achieving lower\nbounds is through the lower bound of Dekel et al. [12] that achieve such bound for this special case.\nThe general problem of bandits with movement costs has been \ufb01rst introduced in [24], where the\nauthors gave an e\ufb03cient algorithm for a 2-HST binary balanced tree metric, as well as for evenly\nspaced points on the interval. The main contribution of this paper is a generalization of these results\nto general metric spaces.\nThere is a vast and vigorous study of MAB in continuous spaces [23, 11, 5, 10, 32]. These works\nrelate the change in the payo\ufb00 to the change in the action. Speci\ufb01cally, there has been a vast research\non Lipschitz MAB with stochastic payo\ufb00s [22, 29, 30, 21, 26], where, roughly, the expected reward\nis Lipschitz. For applying our results in continuous spaces we too need to assume Lipschitz losses,\n\n2\n\n\fhowever, our metric de\ufb01nes also the movement cost between actions and not only relates the losses of\nsimilar actions. Our general \ufb01ndings is that in Euclidean spaces, one can achieve the same regret\nbounds when movement cost is applied. Thus, the SMB algorithm can achieve the optimal regret rate.\nOne can model our problem as a deterministic Markov Decision Process (MDP), where the states\nare the MAB actions and in every state there is an action to move the MDP to a given state (which\ncorrespond to switching actions). The payo\ufb00 would be the payo\ufb00 of the MAB action associated with\nthe state plus the movement cost to the next state. The work of Ortner [28] studies deterministic MDP\nwhere the payo\ufb00s are stochastic, and also allows for a \ufb01xed uniform switching cost. The work of\nEven-Dar et al. [13] and it extensions [27, 33] studies a MDP where the payo\ufb00s are adversarial but\nthere is full information of the payo\ufb00s. Latter this work was extended to the bandit model by Neu et al.\n[27]. This line of works imposes various assumptions regarding the MDP and the benchmark policies,\nspeci\ufb01cally, that the MDP is \u201cmixing\u201d and that the policies considered has full support stationary\ndistributions, assumptions that clearly fail in our very speci\ufb01c setting.\nBayesian MAB, such as in the Gittins index (see [17]), assume that the payo\ufb00s are from some\nstochastic process. It is known that when there are switching costs then the existence of an optimal\nindex policy is not guaranteed [6]. There have been some works on special cases with a \ufb01xed uniform\nswitching cost [1, 3]. The most relevant work is that of Guha and Munagala [18] which for a general\nmetric over the actions gives a constant approximation o\ufb00-line algorithm. For a survey of switching\ncosts in this context see [20].\nThe MAB problem with movement costs is related to the literature on online algorithms and the\ncompetitive analysis framework [8]. A prototypical online problem is the Metrical Task System\n(MTS) presented by Borodin et al. [9]. In a metrical task system there are a collection of states and\na metric over the states. Similar to MAB, the online algorithm at each time step moves to a state,\nincurs a movement cost according to the metric, and su\ufb00ers a loss that corresponds to that state.\nHowever, unlike MAB, in an MTS the online algorithm is given the loss prior to selecting the new\nstate. Furthermore, competitive analysis has a much more stringent benchmark: the best sequence of\nactions in retrospect. Like most of the regret minimization literature, we use the best single action in\nhindsight as a benchmark, aiming for a vanishing average regret.\nOne of our main technical tools is an approximation from above of a metric via a Metric Tree (i.e.,\n2-HST). k-HST metrics have been vastly studied in the online algorithms starting with [7]. The main\ngoal is to derive a simpler metric representation (using randomized trees) that will both upper and\nlower bound the given metric. The main result is to show a bound of O(log n) on the expected stretch\nof any edge, and this is also the best possible [14]. It is noteworthy that for bandit learning, and in\ncontrast with these works, an upper bound over the metric su\ufb03ces to achieve optimal regret rate. This\nis since in online learning we compete against the best static action in hindsight, which does not move\nat all and hence has zero movement cost. In contrast, in a MTS, where one compete against the best\ndynamic sequence of actions, one needs both an upper a lower bound on the metric.\n\n2 Problem Setup and Background\nIn this section we recall the setting of Multi-armed Bandit with Movement Costs introduced in [24],\nand review the necessary background required to state our main results.\n\n2.1 Multi-armed Bandits with Movement Costs\nIn the Multi-armed Bandits (MAB) with Movement Costs problem, we consider a game between an\nonline learner and an adversary continuing for T rounds. There is a set K, possibly in\ufb01nite, of actions\n(or \u201carms\u201d) that the learner can choose from. The set of actions is equipped with a \ufb01xed and known\nmetric \u2206 that determines a cost \u2206(i, j) \u2208 [0, 1] for moving between any pair of actions i, j \u2208 K.\nBefore the game begins, an adversary \ufb01xes a sequence (cid:96)1, . . . , (cid:96)T : K (cid:55)\u2192 [0, 1] of loss functions\nassigning loss values in [0, 1] to actions in K (in particular, we assume an oblivious adversary). Then,\non each round t = 1, . . . , T, the learner picks an action it \u2208 K, possibly at random. At the end of each\nround t, the learner gets to observe her loss (namely, (cid:96)t(it)) and nothing else. In contrast with the\nstandard MAB setting, in addition to the loss (cid:96)t(it) the learner su\ufb00ers an additional cost due to her\nmovement between actions, which is determined by the metric and is equal to \u2206(it, it\u22121). Thus, the\ntotal cost at round t is given by (cid:96)t(it) + \u2206(it\u22121, it).\n\n3\n\n\fThe goal of the learner, over the course of T rounds of the game, is to minimize her expected\nmovement-regret, which is de\ufb01ned as the di\ufb00erence between her (expected) total costs and the total\ncosts of the best \ufb01xed action in hindsight (that incurs no movement costs); namely, the movement\nregret with respect to a sequence (cid:96)1:T of loss vectors and a metric \u2206 equals\n\nRegretMC((cid:96)1:T, \u2206) = (cid:69)\n\n(cid:96)t(it) +\n\n\u2206(it, it\u22121)\n\n(cid:34) T(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=2\n\n(cid:35)\n\nT(cid:88)\n\nt=1\n\n\u2212 min\ni\u2208K\n\n(cid:96)t(i) .\n\nHere, the expectation is taken with respect to the learner\u2019s randomization in choosing the actions\ni1, . . . , iT ; notice that, as we assume an oblivious adversary, the loss functions (cid:96)t are deterministic and\ncannot depend on the learner\u2019s randomization.\n\n2.2 Basic De\ufb01nitions in Metric Spaces\nWe recall basic notions in metric space that govern the regret in the MAB with movement costs\nsetting. Throughout we assume a bounded metric space (K, \u2206), where for normalization we assume\n\u2206(i, j) \u2208 [0, 1] for all i, j \u2208 K. Given a point i \u2208 K we will denote by B\u0001(i) = { j \u2208 K : \u2206(i, j) \u2264 \u0001}\nthe ball of radius \u0001 around i.\nThe following de\ufb01nitions are standard.\nDe\ufb01nition 1 (Packing numbers). A subset P \u2282 K in a metric space (K, \u2206) is an \u0001-packing if the sets\n{B\u0001(i)}i\u2208P are disjoint sets. The \u0001-packing number of \u2206, denoted N\n\u0001 (\u2206), is the maximum cardinality\nof any \u0001-packing of K.\nDe\ufb01nition 2 (Covering numbers). A subset C \u2282 K in a metric space (K, \u2206) is an \u0001-covering if\nK \u2286 \u222ai\u2208C B\u0001(i). The \u0001-covering number of K, denoted Nc\n\u0001(\u2206), is the minimum cardinality of any\n\u0001-covering of K.\n\np\n\nTree metrics and HSTs. We recall the notion of a tree metric, and in particular, a metric induced\nby an Hierarchically well-Separated (HST) Tree; see [7] for more details. Any weighted tree de\ufb01nes\na metric over the vertices, by considering the shortest path between each two nodes. An HST tree\n(2-HST tree, to be precise) is a rooted weighted tree such that: 1) the edge weight from any node\nto each of its children is the same and 2) the edge weight along any path from the root to a leaf are\ndecreasing by a factor 2 per edge. We will also assume that all leaves are of the same depth in the tree\n(this does not imply that the tree is complete).\nGiven a tree T we let depth(T) denote its height, which is the maximal length of a path from any leaf\nto the root. Let level(v) be the level of a node v \u2208 T , where the level of the leaves is 0 and the level of\nthe root is depth(T). Given nodes u, v \u2208 T , let LCA(u, v) be their least common ancestor node in T .\nThe metric which we next de\ufb01ne is equivalent (up to a constant factor) to standard tree\u2013metric induced\nover the leaves by an HST. By a slight abuse of terminology, we will call it HST metric:\nDe\ufb01nition 3 (HST metric). Let K be a \ufb01nite set and let T be a tree whose leaves are at the same\ndepth and are indexed by elements of K. Then the HST metric \u2206T over K induced by the tree T is\nde\ufb01ned as follows:\n\n\u2206T(i, j) =\n\n2level(LCA(i, j))\n\n2depth(T)\n\n\u2200 i, j \u2208 K.\n\nFor a HST metric \u2206T, observe that the packing number and covering number are simple to characterize:\nfor all 0 \u2264 h < depth(T) we have that for \u0001 = 2h\u2212H,\n\n\u0001 (\u2206T) =(cid:12)(cid:12){v \u2208 T : level(v) = h}(cid:12)(cid:12).\n\np\n\n\u0001(\u2206T) = N\nNc\n\nComplexity measures for \ufb01nite metric spaces. We next de\ufb01ne the two notions of complexity that,\nas we will later see, governs the complexity of MAB with metric movement costs.\nDe\ufb01nition 4 (covering complexity). The covering complexity of a metric space (K, \u2206) denoted Cc(\u2206)\nis given by\n\nCc(\u2206) = sup\n0<\u0001 <1\n\n\u0001\u00b7Nc\n\n\u0001(\u2206).\n\n4\n\n\fDe\ufb01nition 5 (packing complexity). The packing complexity of a metric space (K, \u2206) denoted Cp(\u2206)\nis given by\n\nCp(\u2206) = sup\n0<\u0001 <1\n\n\u0001\u00b7N\n\np\n\n\u0001 (\u2206).\n\nFor a HST metric, the two complexity measures coincide as its packing and covering numbers are the\nsame. Therefore, for a HST metric \u2206T we will simply denote the complexity of (K, \u2206T) as C(T). In\n\u0001/2(\u2206) for all \u0001 > 0. Thus, for a general\nfact, it is known that in any metric space N\nmetric space we obtain that\n\n\u0001 (\u2206) \u2264 Nc\n\np\n\n\u0001(\u2206) \u2264 Np\nCp(\u2206) \u2264 Cc(\u2206) \u2264 2Cp(\u2206).\n\n(1)\n\nComplexity measures for in\ufb01nite metric spaces. For in\ufb01nite metric spaces, we require the\nfollowing de\ufb01nition.\nDe\ufb01nition 6 (Minkowski dimensions). Let (K, \u2206) be a bounded metric space. The upper Minkowski\ndimension of (K, \u2206), denoted D(\u2206), is de\ufb01ned as\np\nlog N\nlog(1/\u0001) = lim sup\n\u0001\u21920\n\nD(\u2206) = lim sup\n\u0001\u21920\n\n\u0001(\u2206)\nlog Nc\nlog(1/\u0001) .\n\n\u0001 (\u2206)\n\nSimilarly, the lower Minkowski dimension is denoted by D(\u2206) and is de\ufb01ned as\n\nD(\u2206) = lim inf\n\u0001\u21920\n\np\n\n\u0001 (\u2206)\n\nlog N\nlog(1/\u0001) = lim inf\n\u0001\u21920\n\n\u0001(\u2206)\nlog Nc\nlog(1/\u0001) .\n\nWe refer to [31] for more background on the Minkowski dimensions and related notions in metric\nspaces theory.\n\n3 Main Results\nWe now state the main results of the paper, which give a complete characterization of the expected\nregret in the MAB with movement costs problem.\n\n3.1 Finite Metric Spaces\nThe following are the main results of the paper.\nTheorem 7 (Upper Bound). Let (K, \u2206) be a \ufb01nite metric space over |K| = k elements with diameter\n\u2264 1 and covering complexity Cc = Cc(\u2206). There exists an algorithm such that for any sequence of\nloss functions (cid:96)1, . . . , (cid:96)T guarantees that\n\nTheorem 8 (Lower Bound). Let (K, \u2206) be a \ufb01nite metric space over |K| = k elements with diameter\n\u2265 1 and packing complexity Cp = Cp(\u2206). For any algorithm there exists a sequence (cid:96)1, . . . , (cid:96)T of loss\nfunctions such that\n\nRegretMC((cid:96)1:T, \u2206) =(cid:101)O(cid:0) max(cid:8)C1/3\nRegretMC((cid:96)1:T, \u2206) =(cid:101)\u2126(cid:0) max(cid:8)C1/3\n\nc T2/3\n\n,\n\np T2/3\n\n,\n\nkT(cid:9)(cid:1).\nkT(cid:9)(cid:1).\n\n\u221a\n\n\u221a\n\nFor the detailed proofs, see the full version of the paper [25]. Recalling Eq. (1), we see that the regret\nbounds obtained in Theorems 7 and 8 are matching up to logarithmic factors. Notice that the tightness\nis achieved per instance; namely, for any given metric we are able to fully characterize the regret\u2019s\nrate of growth as a function of the intrinsic properties of the metric. (In particular, this is substantially\nstronger than demonstrating a speci\ufb01c metric for which the upper bound cannot be improved.) Note\nthat for the lower bound statement in Theorem 8 we require that the diameter of K is bounded away\nfrom zero, where for simplicity we assume a constant bound of 1. Such an assumption is necessary\nto avoid degenerate metrics. Indeed, when the diameter is very small, the problem reduces to the\nstandard MAB setting without any additional costs and we obtain a regret rate of \u2126(\u221a\nNotice how the above results extend known instances of the problem from previous work: for uniform\nmovement costs (i.e., unit switching costs) over K = {1, . . . , k} we have Cc = \u0398(k), so that the\n\nkT).\n\n5\n\n\fobtain bound is(cid:101)\u0398(max{k1/3T2/3\nbalanced tree with k leaves, we have Cc = \u0398(1) and the resulting bound is(cid:101)\u0398(max{T2/3\n\nkT}), which recovers the results in [2, 12]; and for a 2-HST binary\nkT}), which\n\nis identical to the bound proved in [24].\nThe 2-HST regret bound in [24] was primarily used to obtain regret bounds for the action space\nK = [0, 1]. In the next section we show how this technique is extended for in\ufb01nite metric space to\nobtain regret bounds that depend on the dimensionality of the action space.\n\n\u221a\n\n\u221a\n\n,\n\n,\n\nIn\ufb01nite Metric Spaces\n\n3.2\nWhen (K, \u2206) is an in\ufb01nite metric space, without additional constraints on the loss functions, the\nproblem becomes ill-posed with a linear regret rate, even without movement costs. Therefore, one\nhas to make additional assumptions on the loss functions in order to achieve sublinear regret. One\nnatural assumption, which is common in previous work, is to assume that the loss functions (cid:96)1, . . . , (cid:96)T\nare all 1-Lipschitz with respect to the metric \u2206. Under this assumption, we have the following result.\nTheorem 9. Let (K, \u2206) be a metric space with diameter \u2264 1 and upper Minkowski dimension\nd = D(\u2206), such that d \u2265 1. There exists a strategy such that for any sequence of loss functions\n(cid:96)1, . . . , (cid:96)T , which are all 1-Lipschitz with respect to \u2206, guarantees that\n\nd+2(cid:1).\nRegretMC((cid:96)1:T, \u2206) =(cid:101)O(cid:0)T d+1\n\nWe refer the full version of the paper [25] for a proof of the theorem. Again, we observe that the above\nresult extend the case of K = [0, 1] where d = 1. Indeed, for Lipschitz functions over the interval a\nd+2) is known for MAB in metric spaces with Lipschitz cost\nfunctions\u2014even without movement costs\u2014where d = D(\u2206) is the lower Minkowski dimension.\nTheorem 10 (Bubeck et al. [10]). Let (K, \u2206) be a metric space with diameter \u2264 1 and lower Minkowski\ndimension d = D(\u2206), such that d \u2265 1. Then for any learning algorithm, there exists a sequence\nof loss function (cid:96)1, . . . , (cid:96)T , which are all 1-Lipschitz with respect to \u2206, such that the regret (without\n\ntight regret bound of(cid:101)\u0398(T2/3) was achieved in [24], which is exactly the bound we obtain above.\nWe mention that a lower bound of(cid:101)\u2126(T d+1\nmovement costs) is(cid:101)\u2126(cid:0)T d+1\nd+2(cid:1).\n\nIn many natural metric spaces in which the upper and lower Minkowski dimensions coincide (e.g.,\nnormed spaces), the bound of Theorem 9 is tight up to logarithmic factors in T. In particular, and\nquite surprisingly, we see that the movement costs do not add to the regret of the problem!\nIt is important to note that Theorem 9 holds only for metric spaces whose (upper) Minkowski\ndemonstrated in Section 3.1 above, a O(\u221a\ndimension is at least 1. Indeed, \ufb01nite metric spaces are of Minkowski dimension zero, and as we\nT) regret bound is not achievable. Finite matric spaces are\nassociated with a complexity measure which is very di\ufb00erent from the Minkowski dimension (i.e.,\nthe covering/packing complexity). In other words, we exhibit a phase transition between dimension\nd = 0 and d \u2265 1 in the rate of growth of the regret induced by the metric.\n\n4 Algorithms\nIn this section we turn to prove Theorem 7. Our strategy is much inspired by the approach in [24],\nand we employ a two-step approach: First, we consider the case that the metric is a HST metric; we\nthen turn to deal with general metrics, and show how to upper-bound any metric with a HST metric.\n\n4.1 Tree Metrics: The Slowly-Moving Bandit Algorithm\nIn this section we analyze the simplest case of the problem, in which the metric \u2206 = \u2206T is induced\nby a HST tree T (whose leaves are associated with actions in K). In this case, our main tool is the\nSlowly-Moving Bandit (SMB) algorithm [24]: we demonstrate how it can be applied to general tree\nmetrics, and analyze its performance in terms of intrinsic properties of the metric.\nWe begin by reviewing the SMB algorithm. In order to present the algorithm we require few additional\nnotations. The algorithm receives as input a tree structure over the set of actions K, and its operation\ndepends on the tree structure. We \ufb01x a HST tree T and let H = depth(T). For any level 0 \u2264 h \u2264 H\nand action i \u2208 K, let Ah(i) be the set of leaves of T that share a common ancestor with i at level h\n\n6\n\n\f(recall that level h = 0 is the bottom\u2013most level corresponding to the singletons). In terms of the tree\nmetric we have that Ah(i) = { j : \u2206T(i, j) \u2264 2\u2212H+h}.\nThe SMB algorithm is presented in Algorithm 1. The algorithm is based on the multiplicative update\nmethod, in the spirit of Exp3 algorithms [4]. Similarly to Exp3, the algorithm computes at each round\n\nt an estimator(cid:101)(cid:96)t to the loss vector (cid:96)t using the single loss value (cid:96)t(it) observed. In addition to being\nan (almost) unbiased estimate for the true loss vector, the estimator(cid:101)(cid:96)t used by SMB has the additional\n\nproperty of inducing slowly-changing sampling distributions pt: This is done by choosing at random\na level ht of the tree to be rebalanced (in terms of the weights maintained by the algorithm): As a\nresult, the marginal probabilities pt+1(Aht(i)) are not changed at round t.\nIn turn, and in contrast with Exp3, the algorithm choice of action at round t + 1 is not purely sampled\nfrom pt, but rather conditioned on our last choice of level ht. This is informally justi\ufb01ed by the fact\nthat pt and pt+1 agree on the marginal distribution of Aht(it), hence we can think of the level drawn\nat round t as if it were drawn subject to pt+1(Aht) = pt(Aht).\n\nInput: A tree T with a set of \ufb01nite leaves K, \u03b7 > 0.\nInitialize: H = depth(T), Ah(i) = B2\u2212H +h(i), \u2200i \u2208 K, 0 \u2264 h \u2264 H\nInitialize p1 = unif(K), h0 = H and i0 \u223c p1\nFor t = 1, . . . , T:\n\n(1) Choose action it \u223c pt(\u00b7 | Aht\u22121(it\u22121)), observe loss (cid:96)t(it)\n(2) Choose \u03c3t,0, . . . , \u03c3t, H\u22121 \u2208 {\u00b11} uniformly at random;\nlet ht = min{0 \u2264 h \u2264 H : \u03c3t,h < 0} where \u03c3t, H = \u22121\n(3) Compute vectors \u00af(cid:96)t,0, . . . , \u00af(cid:96)t, H\u22121 recursively via\n\n1{it = i}\n\u00af(cid:96)t,0(i) =\npt(i)\nand for all h \u2265 1:\n\u00af(cid:96)t,h(i) = \u22121\n(cid:101)(cid:96)t =\n\nln\n\n\u03b7\n\n(cid:96)t(it),\n\n(cid:33)\n\nj\u2208 Ah(i)\n\n(cid:32) (cid:88)\npt(j)\npt(Ah(i)) e\u2212\u03b7(1+\u03c3t, h\u22121) \u00af(cid:96)t, h\u22121(j)\n(cid:40) 0\n\u00af(cid:96)t,0 +(cid:80)H\u22121\nh=0 \u03c3t,h \u00af(cid:96)t,h\npt(i) e\u2212\u03b7(cid:101)(cid:96)t(i)\n(cid:80)k\nj=1 pt(j) e\u2212\u03b7(cid:101)(cid:96)t(j)\n\nif it \u2208 Et;\notherwise\n\n\u2200 i \u2208 K\n\n(4) De\ufb01ne Et = {i : pt(Ah(i)) < 2h\u03b7 for some 0 \u2264 h < H} and set:\n\n(5) Update:\n\npt+1(i) =\n\nAlgorithm 1: The SMB algorithm.\n\nA key observation is that by directly applying SMB to the metric \u2206T, we can achieve the following\nregret bound:\nTheorem 11. Let (K, \u2206T) be a metric space de\ufb01ned by a 2-HST T with depth(T) = H and complexity\nC(T) = C. Using SMB algorithm we can achieve the following regret bound:\n\n(cid:16)\n\n(cid:112)2HTClog C + H2\u2212HT\n\n(cid:17)\n\nRegretMC((cid:96)1:T, \u2206T) = O\n\nH\n\n.\n\n(2)\n\nTo show Theorem 11, we adapt the analysis of [24] (that applies only to complete binary HSTs) to\nhandle more general HSTs. We defer this part of our analysis to the full version of the paper [25],\nsince it follows from a technical modi\ufb01cation of the original proof.\nFor a tree that is either too deep or too shallow, Eq. (2) may not necessarily lead to a sublinear regret\nbound, let alone optimal. The main idea behind achieving optimal regret bound for a general tree, is\nto modify it until one of two things happen: Either we have optimized the depth so that the two terms\nin the left-hand side of Eq. (2) are of same order: In that case, we will show that one can achieve\n\n7\n\n\fkT).\n\n(2) One of the following is true:\n\n(b) 2\u2212(H\u22121)(H \u2212 1)T \u2265(cid:112)2H\u22121(H \u2212 1)CT.\n\n(a) 2HC \u2264 k;\n\n2H HCT;\n\nregret rate of order O(C(T)1/3T2/3). If we fail to do that, we show that the \ufb01rst term in the left-hand\nside is the dominant one, and it will be of order O(\u221a\nFor trees that are in some sense \u201cwell behaved\" we have the following Corollary of Theorem 11.\nCorollary 12. Let (K, \u2206T) be a metric space de\ufb01ned by a tree T over |K| = k leaves with\ndepth(T) = H and complexity C(T) = C. Assume that T satis\ufb01es the following:\n(1) 2\u2212H HT \u2264 \u221a\nThen, the SMB algorithm can be used to attain RegretMC((cid:96)1:T, \u2206T) =(cid:101)O(cid:0) max(cid:8)C1/3T2/3\nSMB to the metric space (K, \u2206T(cid:48)) leads to RegretMC((cid:96)1:T, \u2206T) =(cid:101)O(cid:0) max(cid:8)C(T)1/3T2/3\n\nThe following establishes Theorem 7 for the special case of tree metrics.\nLemma 13. For any tree T and time horizon T, there exists a tree T (cid:48) (over the same set K of k leaves)\nthat satis\ufb01es the conditions of Corollary 12, such that \u2206T(cid:48) \u2265 \u2206T and C(T (cid:48)) = C(T). Furthermore,\nT (cid:48) can be constructed e\ufb03ciently from T (i.e., in time polynomial in |K| and T). Hence, applying\n\nkT(cid:9)(cid:1).\nkT(cid:9)(cid:1).\n\n\u221a\n\n,\n\n\u221a\n\n,\n\nWe refer to [25] for the proofs of both results.\n\n4.2 General Finite Metrics\nFinally, we obtain the general \ufb01nite case as a corollary of the following.\nLemma 14. Let (K, \u2206) be a \ufb01nite metric space. There exists a tree metric \u2206T over K (with\n|K| = k) such that 4\u2206T, dominates \u2206 (i.e., such that 4\u2206T(i, j) \u2265 \u2206(i, j) for all i, j \u2208 K) for which\nC(T) = O(Cc(\u2206) log k). Furthermore, T can be constructed e\ufb03ciently.\nProof. Let H be such that the minimal distance in \u2206 is larger than 2\u2212H. For each r = 2\u22121\n, . . . , 2\u2212H\nwe let {Br(i{1,r}), . . . , Br(i{mr,r})} = Br be a covering of K of size Nc\nr(T) log k using balls of radius r.\nNote that \ufb01nding a minimal set of balls of radius r that covers K is exactly the set cover problem.\nHence, we can e\ufb03ciently approximate it (to within a O(log k) factor) and construct the sets Br.\nWe now construct a tree graph, whose nodes are associated with the cover balls: The leaves correspond\nto singleton balls, hence correspond to the action space. For each leaf i we \ufb01nd an action a1(i) \u2208 K\nsuch that: i \u2208 B2\u2212H +1(a1(i)) \u2208 B2\u2212H +1 . If there is more than one, we arbitrarily choose one, and we\nconnect an edge between i and B2\u2212H +1(a1(i)). We continue in this manner inductively to de\ufb01ne ar(i) for\nevery a and r < 1: given ar\u22121(i) we \ufb01nd an action ar(i) such that ar\u22121(i) \u2208 B2\u2212H +r(ar(i)) \u2208 B2\u2212H +r ,\nand we connect an edge from B2\u2212H +r\u22121(ar\u22121(i)) and B2\u2212H +r(ar(i)).\nWe now claim that the metric induced by the tree graph dominates up to factor 4 the original metric.\nLet i, j \u2208 K such that \u2206T(i, j) < 2\u2212H+r then by construction there are i, a1(i), a2(i), . . . ar(i) and\nj, a1(j), a2(j), . . . ar(j), such that ar(i) = ar(j) and for which it holds that \u2206(as(i), as\u22121(i)) \u2264 2\u2212H+s\nand similarly \u2206(as(j), as\u22121(j)) \u2264 2\u2212H+s for every s \u2264 r. Denoting a0(i) = i and a0(j) = j, we have\nthat\n\n, 2\u22122\n\n\u2206(as\u22121(i), as(i)) +\n\n\u2206(as\u22121(j), as(j))\n\n2\u2212H+s \u2264 2\u00b72\u2212H \u00b72r+1 \u2264 4\u2206T(i, j).\n\n(cid:3)\n\n\u2206(i, j) \u2264 r(cid:88)\n\u2264 2 r(cid:88)\n\ns=1\n\ns=1\n\nr(cid:88)\n\ns=1\n\nIn\ufb01nite Metric Spaces\n\n4.3\nFinally, we address in\ufb01nite spaces by discretizing the space K and reducing to the \ufb01nite case. Recall\nthat in this case we also assume that the loss functions are Lipschitz.\nProof of Theorem 9. Given the de\ufb01nition of the covering dimension d = D(\u2206) \u2265 1, it is straightfor-\nr(\u2206) \u2264 Cr\u2212d for\nward that for some constant C > 0 (that might depend on the metric \u2206) it holds that Nc\n\n8\n\n\fall r > 0. Fix some \u0001 > 0, and take a minimal 2\u0001-covering K(cid:48) of K of size |K(cid:48)| \u2264 C(2\u0001)\u2212d \u2264 C\u0001\u2212d.\nObserve that by restricting the algorithm to pick actions from K(cid:48), we might lose at most O(\u0001T) in the\nregret. Also, since K(cid:48) is minimal, the distance between any two elements in K(cid:48) is at least \u0001, thus the\ncovering complexity of the space has\nCc(\u2206) = sup\nr \u2265\u0001\n\nr(\u2206) \u2264 C sup\nr \u2265\u0001\n\nr\u2212d+1 \u2264 C\u0001\n\nas we assume that d \u2265 1. Hence, by Theorem 7 and the Lipschitz assumption, there exists an\nalgorithm for which\n\n\u2212d+1\n,\n\nr\u00b7Nc\n\nRegretMC((cid:96)1:T, \u2206) =(cid:101)O\n\n1\n\nA simple computation reveals that \u0001 = \u0398(T\u2212 1\nmovement regret.\n\n(cid:16)max(cid:8)\u0001\nd+2) optimizes the above bound, and leads to(cid:101)O(T d+1\n\n2 , \u0001T(cid:9)(cid:17)\n\n\u2212 d2 T\n\n\u2212 d\u22121\n\n2\n3 , \u0001\n\n3 T\n\n.\n\nd+2)\n(cid:3)\n\nAcknowledgements\nRL is supported in funds by the Eric and Wendy Schmidt Foundation for strategic innovations. YM is\nsupported in part by a grant from the Israel Science Foundation, a grant from the United States-Israel\nBinational Science Foundation (BSF), and the Israeli Centers of Research Excellence (I-CORE)\nprogram (Center No. 4/11).\n\nReferences\n[1] R. Agrawal, M. V. Hegde, and D. Teneketzis. Asymptotically e\ufb03cient adaptive allocation rules\nfor the multiarmed bandit problem with switching costs. IEEE Transactions on Optimal Control,\n33(10):899\u2013906, 1988.\n\n[2] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary:\nfrom regret to policy regret. In Proceedings of the 29th International Conference on Machine\nLearning (ICML-12), pages 1503\u20131510, 2012.\n\n[3] M. Asawa and D. Teneketzis. Multi-armed bandits with switching penalties. IEEE Transactions\n\non Automatic Control, 41(3):328\u2013348, 1996.\n\n[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[5] P. Auer, R. Ortner, and C. Szepesv\u00e1ri. Improved rates for the stochastic continuum-armed bandit\nproblem. Proceedings of the 20th Annual Conference on Learning Theory, pages 454\u2013468,\n2007.\n\n[6] J. S. Banks and R. K. Sundaram. Switching costs and the gittins index. Econometrica, 62:\n\n687\u2013694, 1994.\n\n[7] Y. Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In\n37th Annual Symposium on Foundations of Computer Science, FOCS \u201996, Burlington, Vermont,\nUSA, 14-16 October, 1996, pages 184\u2013193, 1996.\n\n[8] A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge\n\nUniversity Press, 1998.\n\n[9] A. Borodin, N. Linial, and M. E. Saks. An optimal on-line algorithm for metrical task system.\n\nJournal of the ACM (JACM), 39(4):745\u2013763, 1992.\n\n[10] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv\u00e1ri. X-armed bandits. Journal of Machine\n\nLearning Research, 12:1587\u20131627, 2011.\n\n[11] E. Cope. Regret and convergence bounds for a class of continuum-armed bandit problems. IEEE\n\nTransactions on Automatic Control, 54(6):1243\u20131253, 2009.\n\n9\n\n\f[12] O. Dekel, J. Ding, T. Koren, and Y. Peres. Bandits with switching costs: T2/3 regret. In\nProceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 459\u2013467.\nACM, 2014.\n\n[13] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Math. Oper.\n\nRes., 34(3):726\u2013736, 2009.\n\n[14] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by\n\ntree metrics. J. Comput. Syst. Sci., 69(3):485\u2013497, 2004.\n\n[15] M. Feldman, T. Koren, R. Livni, Y. Mansour, and A. Zohar. Online pricing with strategic and\n\npatient buyers. In Annual Conference on Neural Information Processing Systems, 2016.\n\n[16] S. Geulen, B. V\u00f6cking, and M. Winkler. Regret minimization for online bu\ufb00ering problems\n\nusing the weighted majority algorithm. In COLT, pages 132\u2013143, 2010.\n\n[17] J. Gittins, K. Glazebrook, and R. Weber. Multi-Armed Bandit Allocation Indices, 2nd Edition.\n\nJohn Wiley, 2011.\n\n[18] S. Guha and K. Munagala. Multi-armed bandits with metric switching costs. In International\n\nColloquium on Automata, Languages, and Programming, pages 496\u2013507. Springer, 2009.\n\n[19] A. Gyorgy and G. Neu. Near-optimal rates for limited-delay universal lossy source coding.\n\nIEEE Transactions on Information Theory, 60(5):2823\u20132834, 2014.\n\n[20] T. Jun. A survey on the bandit problem with switching costs. De Economist, 152(4):513\u2013541,\n\n2004.\n\n[21] R. Kleinberg and A. Slivkins. Sharp dichotomies for regret minimization in metric spaces. In\nProceedings of the twenty-\ufb01rst annual ACM-SIAM symposium on Discrete Algorithms, pages\n827\u2013846. Society for Industrial and Applied Mathematics, 2010.\n\n[22] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings\nof the fortieth annual ACM symposium on Theory of computing, pages 681\u2013690. ACM, 2008.\n[23] R. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in\n\nNeural Information Processing Systems, pages 697\u2013704, 2004.\n\n[24] T. Koren, R. Livni, and Y. Mansour. Bandits with movement costs and adaptive pricing. In\n\nCOLT, 2017.\n\n[25] T. Koren, R. Livni, and Y. Mansour. Multi-armed bandits with metric movement costs. arXiv\n\npreprint arXiv:1710.08997, 2017.\n\n[26] S. Magureanu, R. Combes, and A. Proutiere. Lipschitz bandits: Regret lower bound and optimal\n\nalgorithms. In COLT, pages 975\u2013999, 2014.\n\n[27] G. Neu, A. Gy\u00f6rgy, C. Szepesv\u00e1ri, and A. Antos. Online markov decision processes under\n\nbandit feedback. IEEE Trans. Automat. Contr., 59(3):676\u2013691, 2014.\n\n[28] R. Ortner. Online regret bounds for markov decision processes with deterministic transitions.\n\nTheor. Comput. Sci., 411(29-30):2684\u20132695, 2010.\n\n[29] A. Slivkins. Multi-armed bandits on implicit metric spaces. In Advances in Neural Information\n\nProcessing Systems, pages 1602\u20131610, 2011.\n\n[30] A. Slivkins, F. Radlinski, and S. Gollapudi. Ranked bandits in metric spaces: learning diverse\nrankings over large document collections. Journal of Machine Learning Research, 14(Feb):\n399\u2013436, 2013.\n\n[31] T. Tao. 245c, notes 5: Hausdor\ufb00 dimension. http://terrytao.wordpress.com/2009/05/\n\n19/245c-notes-5-hausdorff-dimension-optional/, 2009.\n\n[32] J. Yu and S. Mannor. Unimodal bandits. In Proceedings of the 28th International Conference\n\non Machine Learning, 2011.\n\n[33] J. Y. Yu, S. Mannor, and N. Shimkin. Markov decision processes with arbitrary reward processes.\n\nMath. Oper. Res., 34(3):737\u2013757, Aug. 2009. ISSN 0364-765X.\n\n10\n\n\f", "award": [], "sourceid": 2169, "authors": [{"given_name": "Tomer", "family_name": "Koren", "institution": "Google"}, {"given_name": "Roi", "family_name": "Livni", "institution": "Princeton"}, {"given_name": "Yishay", "family_name": "Mansour", "institution": "Tel Aviv University"}]}