{"title": "Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 7523, "page_last": 7532, "abstract": "We consider the classical stochastic multi-armed bandit problem with a constraint on the total cost incurred by switching between actions. Under the unit switching cost structure, where the constraint limits the total number of switches, we prove matching upper and lower bounds on regret and provide near-optimal algorithms for this problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. That is, we show that associated with the multi-armed bandit problem, there are equal-length phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. The results enable us to fully characterize the trade-off between regret and incurred switching cost in the stochastic multi-armed bandit problem, contributing new insights to this fundamental problem. Under the general switching cost structure, our analysis reveals a surprising connection between the bandit problem and the shortest Hamiltonian path problem.", "full_text": "Phase Transitions and Cyclic Phenomena in Bandits\n\nwith Switching Constraints\n\nDavid Simchi-Levi\n\nYunzong Xu\n\nInstitute for Data, Systems and Society\nMassachusetts Institute of Technology\n\nInstitute for Data, Systems and Society\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\ndslevi@mit.edu\n\nCambridge, MA 02139\n\nyxu@mit.edu\n\nAbstract\n\nWe consider the classical stochastic multi-armed bandit problem with a constraint\non the total cost incurred by switching between actions. Under the unit switching\ncost structure, where the constraint limits the total number of switches, we prove\nmatching upper and lower bounds on regret and provide near-optimal algorithms\nfor this problem. Surprisingly, we discover phase transitions and cyclic phenomena\nof the optimal regret. That is, we show that associated with the multi-armed bandit\nproblem, there are equal-length phases de\ufb01ned by the number of arms and switching\ncosts, where the regret upper and lower bounds in each phase remain the same and\ndrop signi\ufb01cantly between phases. The results enable us to fully characterize the\ntrade-off between regret and incurred switching cost in the stochastic multi-armed\nbandit problem, contributing new insights to this fundamental problem. Under\nthe general switching cost structure, our analysis reveals a surprising connection\nbetween the bandit problem and the shortest Hamiltonian path problem.\n\n1\n\nIntroduction\n\nThe multi-armed bandit (MAB) problem is one of the most fundamental problems in online learning,\nwith diverse applications ranging from pricing and online advertising to clinical trails. In a traditional\nMAB problem, the learner (i.e., decision-maker) is allowed to switch freely between actions, and\nan effective learning policy may incur frequent switching \u2014 indeed, the learner\u2019s task is to balance\nthe exploration-exploitation trade-off, and both exploration (i.e., acquiring new information) and\nexploitation (i.e., optimizing decisions based on up-to-date information) require switching. However,\nin many real-world scenarios, it is costly to switch between different alternatives, and a learning\npolicy with limited switching behavior is preferred. The learner thus has to consider the cost of\nswitching in her learning task.\nThe conventional model: switching cost as a penalty. There is rich literature studying stochastic\nMAB with switching costs. Most of the papers model the switching cost as a penalty in the learner\u2019s\nobjective, i.e., they measure a policy\u2019s regret and incurred switching cost using the same metric and\nthe objective is to minimize the sum of these two terms (e.g., [1, 2, 7, 8]; there are other variations\nwith discounted rewards [5, 4, 6], see [12] for a survey). Though this conventional \u201cswitching\npenalty\u201d model has attracted signi\ufb01cant research interest in the past, it has two limitations. First,\nunder this model, the learner\u2019s total switching cost is a complete output determined by the learning\nalgorithm. However, in many real-world applications, there are strict limits on the learner\u2019s switching\nbehavior, which should be modeled as a hard constraint, and hence the learner\u2019s total budget of\n\u221a\nswitching cost should be an input that helps determine the algorithm. In particular, while the algorithm\nin [8] developed for the \u201cswitching penalty\u201d model can achieve \u02dcO(\nT ) (distribution-free) regret\nwith O(log log T ) switches, if the learner wants a policy that always incurs \ufb01nite switching cost\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\findependent of T , then prior literature does not provide an answer. Second, the \u201cswitching penalty\u201d\nmodel has fundamental weakness in studying the trade-off between regret and incurred switching\n\u221a\ncost in stochastic MAB \u2014 since the O(log log T ) bound on the incurred switching cost of a policy is\nnegligible compared with the \u02dcO(\nT ) bound on its optimal regret, when adding the two terms up,\nthe term associated with incurred switching cost is always dominated by the regret, thus no trade-off\ncan be identi\ufb01ed. As a result, to the best of our knowledge, prior literature has not characterized the\nfundamental trade-off between regret and incurred switching cost in stochastic MAB.\nThe BwSC model: switching as a limited resource. In this paper, we introduce the Bandits with\nSwitching Constraints (BwSC) problem. The BwSC model addresses the issues associated with the\n\u201cswitching penalty\u201d model in several ways. First, it introduces a hard constraint on the total switching\ncost, making the switching budget an input to learning policies, enabling us to design good policies\nthat guarantee limited switching cost. While O(log log T ) switches has proven to be suf\ufb01cient for a\nlearning policy to achieve near-optimal regret in MAB, in BwSC, we are mostly interested in the setting\nof \ufb01nite or o(log log T ) switching budget, which is highly relevant in practice. Second, by focusing\non rewards in the objective function and incurred switching cost in the switching constraint, the BwSC\nframework enables the characterization of the fundamental trade-off between regret and maximum\nincurred switching cost in MAB. Third, while most prior research assumes speci\ufb01c structures on\nswitching costs (e.g., unit or homogeneous costs), in reality, switching between different pairs of\nactions may incur heterogeneous costs that do not follow any parametric form. The BwSC model\nallows general switching costs, which makes it a powerful modeling framework.\nMotivating examples. The BwSC framework has numerous applications, including dynamic pricing,\nonline assortment optimization, online advertising, clinical trails and vehicle routing. A representative\nexample is the dynamic pricing problem. Dynamic pricing with demand learning has proven its\neffectiveness in online retailing. However, it is well known that in practice, sellers often face business\nconstraints that prevent them from conducting extensive price experimentation and making frequent\nprice changes. For example, according to [10], Groupon limits the number of price changes, either\nbecause of implementation constraints, or for fear of confusing customers and receiving negative\ncustomer feedback.\nIn such scenarios, the seller\u2019s sequential decision-making problem can be\nmodeled as a BwSC problem, where changing from each price to another price incurs some cost, and\nthere is a limit on the total cost incurred by price changes.\nMain contributions. In this paper, we introduce the BwSC model, a general framework with strong\nmodeling power. The model overcomes the limitations of the prior \u201cswitching penalty\u201d model and\nhas both practical and theoretical values.\nWe \ufb01rst study the unit-switching-cost BwSC problem in Section 4. We develop an upper bound on\nregret by proposing a simple and intuitive policy with carefully-designed switching rules, and prove\nan information-theoretic lower bound that matches the above upper bound, indicating that our policy\nis rate-optimal up to logarithmic factors. Methodologically, the proof of the lower bound involves\na novel \u201ctracking the cover time\u201d argument that has not appeared in prior literature and may be of\nindependent interest. With the analysis described above we obtain some surprising and insightful\nresults, namely, phase transitions and cyclic phenomena of the optimal regret. That is, we show that\nassociated with the BwSC problem, there are equal-length phases, de\ufb01ned by the number of arms and\nswitching costs, where the regret upper and lower bounds in each phase remain the same and drop\nsigni\ufb01cantly between phases, see the precise de\ufb01nitions in Section 4.3.\nWe then study the general-switching-cost BwSC problem in Section 5. We propose an ef\ufb01cient policy\nand prove regret upper and lower bounds in the general setting. The results reveal a surprising\nconnection between the BwSC problem and the shortest Hamiltonian path problem.\nFor the full version of this paper (containing additional results and missing proofs), see [14].\n\n2 Notations, Model and De\ufb01nitions\nNotations. For all n1, n2 \u2208 N such that n1 \u2264 n2, we use [n1] to denote the set {1, . . . , n1}, and use\n[n1 : n2] (resp. (n1 : n2]) to denote the set {n1, n1 + 1, . . . , n2} (resp. {n1 + 1, . . . , n2}). For all\nx \u2265 0, we use (cid:98)x(cid:99) to denote the largest integer less than or equal to x. For ease of presentation, we\nde\ufb01ne (cid:98)x(cid:99) = 0 for all x < 0. Throughout the paper, we use big O, \u2126, \u0398 notations to hide constant\nfactors, and use \u02dcO, \u02dc\u2126, \u02dc\u0398 notations to hide constant factors and logarithmic factors.\n\n2\n\n\fProblem formulation. Consider a k-armed bandit problem where a learner chooses actions from\na \ufb01xed set [k] = {1, . . . , k}. There is a total of T rounds. In each round t \u2208 [T ], the learner \ufb01rst\nchooses an action it \u2208 [k], then observes a reward rt(it) \u2208 R. For each action i \u2208 [k], the reward of\naction i is i.i.d. drawn from an (unknown) distribution Di with (unknown) expected value \u00b5i. We\nassume that the distributions Di are standardized sub-Gaussian.Without loss of generality, we assume\nsupi,j\u2208[k] |\u00b5i \u2212 \u00b5j| \u2208 [0, 1].\nIn our problem, the learner incurs a switching cost ci,j = cj,i \u2265 0 each time she switches between\naction i and action j (i, j \u2208 [k]). In particular, ci,i = 0 for i \u2208 [k]. There is a pre-speci\ufb01ed switching\nbudget S \u2265 0 representing the maximum amount of switching costs that the learner can incur in total.\nOnce the total switching cost exceeds the switching budget S, the learner cannot switch her actions\nany more. The learner\u2019s goal is to maximize the expected total reward over T rounds.\nAdmissible policies. Let \u03c0 denote the learner\u2019s (non-anticipating) learning policy, and \u03c0t \u2208 [k]\ndenote the action chosen by policy \u03c0 at round t \u2208 [T ]. More formally, \u03c0t establishes a probability\nkernel acting from the space of historical actions and observations to the space of actions at round\nt. Let P\u03c0D and E\u03c0D be the probability measure and expectation induced by policy \u03c0 and latent\ndistributions D = (D1, . . . ,Dk). According to the problem formulation, we only need to restrict our\nattention to the S-switching-budget policies, which take S, k and T as input and are de\ufb01ned below.\nDe\ufb01nition 1 A policy \u03c0 is said to be an S-switching-budget policy if for all D,\n\n(cid:34)T\u22121(cid:88)\n\nt=1\n\nP\u03c0D\n\n(cid:35)\n\nc\u03c0t,\u03c0t+1 \u2264 S\n\n= 1.\n\n(cid:40)\n\n(cid:35)(cid:41)\n\n(cid:34) T(cid:88)\n\nt=1\n\nLet \u03a0S denote the set of all S-switching-budget policies, which is also the admissible policy class of\nthe BwSC problem.\nRegret. The performance of a learning policy is measured against a clairvoyant policy that maximizes\nthe expected total reward given foreknowledge of the environment (i.e., latent distributions) D. Let\n\u00b5\u2217 = maxi\u2208[k] \u00b5i. We de\ufb01ne the regret of policy \u03c0 as the worst-case difference between the expected\nperformance of the optimal clairvoyant policy and the expected performance of policy \u03c0:\n\n\u00b5\u03c0t\n\n.\n\nR\u03c0(T ) = supD\n\nS(T ) = inf \u03c0\u2208\u03a0S R\u03c0(T ).\n\nT \u00b5\u2217 \u2212 E\u03c0D\nThe minimax (optimal) regret of BwSC is de\ufb01ned as R\u2217\nIn our paper, when we say a policy is \u201cnear-optimal\u201d or \u201coptimal up to logarithmic factors\u201d, we mean\nthat its regret bound is optimal in T up to logarithmic factors of T , irrespective of whether the bound\nis optimal in k, since typically k is much smaller than T (e.g., k = O(1)).\nRemark. There are two notions of regret in the stochastic bandit literature. The R\u03c0(T ) regret that\nwe consider is called distribution-free, as it does not depend on D. On the other hand, one can\nalso de\ufb01ne the distribution-dependent regret R\u03c0D(T ) = T \u00b5\u2217 \u2212 E\u03c0D\nthat depends on\nD. This second notion of regret is only meaningful when \u00b51, . . . , \u00b5k are well-separated. Unlike the\nclassical MAB problem where there are policies simultaneously achieving near-optimal bounds under\nboth regret notions, in the BwSC problem, due to the limited switching budget, \ufb01nding a policy that\nsimultaneously achieves near-optimal bounds under both regret notions is usually impossible. In this\npaper, we focus on the distribution-free regret. Extensions to the distribution-dependent regret can be\nfound in the full version of this paper [14].\nRelationship between BwSC and MAB. Obviously, BwSC and MAB share the same de\ufb01nition of\nR\u03c0(S), and the only difference between BwSC and MAB is the existence of a switching constraint\n\u03c0 \u2208 \u03a0S, determined by (ci,j) \u2208 Rk\u00d7k\n\u22650 and S \u2208 R\u22650 (when S = \u221e, BwSC degenerates to MAB).\nThis makes BwSC a natural framework to study the trade-off between regret and incurred switching\ncost in MAB. That is, the trade-off between the optimal regret R\u2217\nS(T ) and switching budget S in\nBwSC completely characterizes the trade-off between a policy\u2019s best achievable regret and its worst\npossible incurred switching cost in MAB. We are interested in how R\u2217\nS(T ) behaves over a range of\nswitching budget S, and how it is affected by the structure of switching costs (ci,j).\n\n(cid:104)(cid:80)T\n\nt=1 \u00b5\u03c0t\n\n(cid:105)\n\n3\n\n\f3 Other Related Work\n\nThis paper is not the \ufb01rst one to study online learning problems with limited switches. Indeed, a few\nauthors have realized the practical signi\ufb01cance of limited switching budget. [10] considers a dynamic\npricing model where the demand function is unknown but belongs to a known \ufb01nite set, and a pricing\npolicy is allowed to make at most m price changes. [9] studies a multi-period stochastic inventory\nreplenishment and pricing problem with unknown parametric demand and limited price changes. We\nnote that both [10, 9] only focus on speci\ufb01c decision-making problems, and their results rely on some\nstrong assumptions about the unknown environment. By contrast, the BwSC model in our paper is\ngeneric and assumes no prior knowledge of the environment. The learning task in BwSC is thus more\nchallenging than previous models. In the adversarial setting, [3] studies the adversarial MAB with\nlimited number of switches. Since our problem is stochastic while their problem is adversarial, the\nresults and methodologies in our paper are fundamentally different from their paper. It is worth noting\nthat the switching constraint in BwSC is also more general than the number-of-switch constraints in\nthe above-mentioned models.\nThe BwSC problem is also related to the batched bandit problem proposed by [13]. The M-batched\nbandit problem is de\ufb01ned as follows: given a classical bandit problem, assumes that the learner\nmust split her learning process into M batches and is only able to observe the realized rewards\nfrom a given batch after the entire batch is completed. [13] studies the problem in the case of two\narms. Very recently, [11] extends the results to k arms. The batched bandit problem and the BwSC\nproblem are two different problems: the batched bandit problem limits observations and allows\nunlimited switching, while the BwSC problem limits switching and allows unlimited observations.\nSurprisingly, we discover some non-trivial connections between the batched bandit problem and the\nunit-switching-cost BwSC problem, which are presented in the full version of this paper [14].\n\n4 Unit Switching Costs\nIn this section, we consider the BwSC problem with unit switching costs, where ci,j = 1 for all i (cid:54)= j.\nIn this case, since every switch incurs a unit cost, the switching budget S can be interpreted as the\nmaximum number of switches that the learner can make in total. Thus, the unit-switching-cost BwSC\nproblem can be simply interpreted as \u201cMAB with limited number of switches\u201d.\n\n4.1 Upper Bound on Regret\n\nWe \ufb01rst propose a simple and intuitive policy that provides an upper bound on the regret. Our\npolicy, called the S-Switch Successive Elimination (SS-SE) policy, is described in Algorithm 1. The\ndesign philosophy behind the SS-SE policy is to divide the entire horizon into several pre-determined\nintervals (i.e. batches) and to control the number of switches in each interval. The policy thus has\nsome similarities with the 2-armed batched policy of [13] and the k-armed batched policy of [11] ,\nwhich proves to be near-optimal in the batched bandit problem. However, since we are studying a\ndifferent problem, directly applying a batched policy to the BwSC problem does not work. In particular,\nin the batched bandit problem, the number of intervals (i.e., batches) is a given constraint, while in\nthe BwSC problem, the switching budget is the given constraint. We thus add two key ingredients into\nthe SS-SE policy: (1) an index m(S) suggesting how many intervals should be used to partition the\nentire horizon; (2) a switching rule ensuring that the total number of switches within k actions cannot\nexceed the switching budget S. These two ingredients make the SS-SE policy substantially different\nfrom an ordinary batched policy.\nIntuition about the policy. The policy divides the T rounds into (cid:98) S\u22121\nk\u22121(cid:99) + 1 intervals in advance.\nThe sizes of the intervals are designed to balance the exploration-exploitation trade-off. An active\nset of \u201cgood\u201d actions Al is maintained for each interval l and at the end of each interval some \u201cbad\u201d\nactions are eliminated before the start of the next interval. The policy controls the number of switches\nby ensuring that only |Al| \u2212 1 switches happen within each interval l and at most one switch happens\nbetween two consecutive intervals. Finally, in the last interval only the empirical best action is chosen.\nWe show that the SS-SE policy is indeed an S-switching-budget policy and establish the following\nupper bound on its regret.\n\n4\n\n\fAlgorithm 1 S-Switch Successive Elimination (SS-SE)\nInput: Number of arms k, Switching budget S, Horizon T\nPartition: Calculate m(S) =\n\n(cid:106) S\u22121\n\n(cid:107)\n\n.\n\nk\u22121\n\nDivide the entire time horizon 1, . . . , T into m(S) + 1 intervals: (t0 : t1], (t1 : t2], . . . , (tm(S) :\n\ntm(S)+1], where the endpoints are de\ufb01ned by t0 = 0 and\n\n(cid:22)\n\n(cid:23)\n\n1\u2212 2\u22122\u2212(i\u22121)\n\n2\u22122\u2212m(S) T\n\n2\u22122\u2212(i\u22121)\n2\u22122\u2212m(S)\n\nti =\n\nk\n\n, \u2200i = 1, . . . , m(S) + 1.\n\n(cid:115)\n(cid:115)\n\nif atl\u22121 \u2208 Al then\n\nInitialization: Let the set of all active actions in the l-th interval be Al. Set A1 = [k]. Let a0 be a\nrandom action in [k].\nPolicy:\n1: for l = 1, . . . , m(S) do\n2:\n3:\n\nLet atl\u22121+1 = atl\u22121. Starting from this action, choose each action in Al for tl\u2212tl\u22121\n|Al|\nconsecutive rounds. Mark the last chosen action as atl.1\nStarting from an arbitrary active action in Al, choose each action in Al for tl\u2212tl\u22121\n|Al|\ntive rounds. Mark the last chosen action as atl.\n\nelse if atl\u22121 /\u2208 Al then\n\nconsecu-\n\nend if\nStatistical test: deactivate all actions i s.t. \u2203 action j with UCBtl (i) < LCBtl(j), where\n\n4:\n5:\n\n6:\n7:\n\nUCBtl (i) = empirical mean of action i in[1 : tl] +\n\nLCBtl (i) = empirical mean of action i in[1 : tl] \u2212\n\n2 log T\n\nnumber of plays of action i in[1 : tl]\n\n2 log T\n\nnumber of plays of action i in[1 : tl]\n\n,\n\n.\n\n8: end for\n9: In the last interval, choose the action with the highest empirical mean (up to round tm(S)).\n\nTheorem 1 Let \u03c0 be the SS-SE policy, then \u03c0 \u2208 \u03a0S. There exists an absolute constant C \u2265 0 such\nthat for all k \u2265 1, S \u2265 1 and T \u2265 k,\n\n(cid:107)\n\n(cid:106) S\u22121\n\nk\u22121\n\n.\n\nwhere m(S) =\n\nR\u03c0(T ) \u2264 C(log k log T )k\n\n1\u2212\n\n1\n\n2\u22122\u2212m(S) T\n\n1\n\n2\u22122\u2212m(S) ,\n\nTheorem 1 provides an upper bound on the optimal regret of the unit-switching-cost BwSC problem:\n\nR\u2217\nS(T ) = \u02dcO(T 1/(2\u22122\u2212(cid:98)(S\u22121)/(k\u22121)(cid:99))).\n\n4.2 Lower Bound on Regret\n\nThe SS-SE policy, though achieves sublinear regret, seems to have many limitations that could have\nweaken its performance, and on the surface it may suggest that the regret bound is not optimal. We\ndiscuss two points here. (1) The SS-SE policy does not make full use of its switching budget. Consider\nthe case of 11 actions and 20 switching budget. Since m(20) = (cid:98)(20 \u2212 1)/(11 \u2212 1)(cid:99) = 1 = m(11),\nthe SS-SE policy will just run as if it could only make 11 switches, despite the fact that it has 9\nadditional switching budget (which will never be used). It seems that by tracking and allocating\nthe switching budget in a more careful way, one can achieve lower regret. (2) The SS-SE policy\nlearns from data infrequently. Note that the SS-SE policy pre-determines the number, sizes and\nlocations of its intervals before seeing any data, and executes actions within each interval based on a\npre-determined schedule. Consider again the case of 11 actions and 20 switching budget, the SS-SE\npolicy will split the entire horizon into two intervals and will only learn from data at the end of the\n\n5\n\n\f\ufb01rst interval, after which it will choose a single action to be applied throughout the entire second\ninterval. It seems that by learning from data more frequently, one can achieve lower regret.\nWhile the above arguments are based on our \ufb01rst instinct and seem very reasonable, surprisingly,\nall of them prove to be wrong: no S-switch policy can theoretically do better! In fact, we match\nthe upper bound provided by SS-SE by showing an information-theoretic lower bound in Theorem\n2. This indicates that the SS-SE policy is rate-optimal up to logarithmic factors, and R\u2217\nS(T ) =\n\u02dc\u0398(T 1/(2\u22122\u2212(cid:98)(S\u22121)/(k\u22121)(cid:99))). Note that the tightness of T is acheived per instance, i.e., for every k and\nevery S. That is, our lower bound is substantially stronger than a single lower bound demonstrated for\nspeci\ufb01c k and S. The proof of the lower bound involves a novel \u201ctracking the cover time\u201d argument\nthat (to the best of our knowledge) has not appeared in previous literature and may be of independent\ninterest. We state the lower bound and give a sketch of the proof below.\nTheorem 2 There exists an absolute constant C > 0 such that for all k \u2265 1, S \u2265 1, T \u2265 k and for\nall policy \u03c0 \u2208 \u03a0S,\n\u221a\nk\nC\n\n2\u22122\u2212m(S) (m(S) + 1)\u22122(cid:17)\n\nR\u03c0(T ) \u2265\n\n2\u22122\u2212m(S) ,\n\n2\u2212\n\nT\n\n1\n\n1\n\nif m(S) \u2264 log2 log2(T /k),\nif m(S) > log2 log2(T /k),\n\n(cid:16)\n(cid:40)\n(cid:106) S\u22121\n\nC\n\n\u2212 3\n\n(cid:107)\n\nkT ,\n\n.\n\nk\u22121\n\nwhere m(S) =\nProof idea. For any k \u2265 1, S \u2265 1 and T \u2265 k, for any S-switch policy \u03c0 \u2208 \u03a0S, we want to \ufb01nd an\nenvironment D such that R\u03c0D(T ) is larger than the desired lower bound. A key challenge here is that\n\u03c0 is an arbitrary and abstract S-switch policy \u2014 we need more information about \u03c0 to construct D.\nWith this goal in mind, we \ufb01rst design a concrete \u201cprimal environment\u201d \u03b1. We use this environment\nto evaluate policy \u03c0, such that we can observe some key patterns revealed by policy \u03c0 under \u03b1. These\npatterns are characterized by a series of ordered stopping times \u03c41 \u2264 \u03c42 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03c4m(S)+1, some of\nwhich may be \u221e, that are recursively de\ufb01ned as follows:\n\n\u2022 \u03c41 is the \ufb01rst time that all the actions in [k] have been chosen in period [1 : \u03c41],\n\u2022 \u03c42 is the \ufb01rst time that all the actions in [k] have been chosen in period [\u03c41 : \u03c42],\n\u2022 Generally, \u03c4i is the \ufb01rst time that all the actions in [k] have been chosen in period [\u03c4i\u22121 : \u03c4i],\n\nfor i = 2, . . . , m(S) + 1.\n\nWe then compare the realization of \u03c41, . . . , \u03c4m(S) with a series of \ufb01xed values t1, . . . , tm(S), which are\nthe endpoints of the intervals de\ufb01ned in Algorithm 1. Based on the possible outcomes of comparisons,\nwe de\ufb01ne m(S) + 1 key events:\n\n\u2022 E1 = {\u03c41 > t1},\n\u2022 Ej = {\u03c4j\u22121 \u2264 tj\u22121, \u03c4j > tj}, for j = 2, . . . , m(S),\n\u2022 Em(S)+1 = {\u03c4m(S) \u2264 tm(S)},\n\nat least one of which must occur under \u03c0 and \u03b1 with probability at least 1/(m(S) + 1). We then\ndo a case by case analysis as follows. In the \ufb01rst case, {\u03c41 > t1} occurs with certain probability,\nindicating that the action chosen in round \u03c41 was not chosen in [1 : t1] with certain probability; in\nthe second case, \u2203j \u2208 [2 : m(S)] such that {\u03c4j\u22121 \u2264 tj\u22121, \u03c4j > tj} occurs with certain probability,\nindicating that the action chosen in round \u03c4j was not chosen in [tj\u22121 : tj] with certain probability;\nin the third case, {\u03c4m(S) \u2264 tm(S)} with certain probability, indicating that the number of switches\noccurs in [tm(S) : T ] is at most k \u2212 1. For each case, we construct an \u201cauxiliary environment\u201d \u03b2 by\ncarefully adjusting \u03b1 based on the aforementioned indication. The environment \u03b2 ensures two things:\n(1) \u03b2 is \u201chard for \u03c0 to distinguish from \u03b1\u201d, such that a crucial event E (constructed based on the\nindication) that occurs under \u03c0 and \u03b1 with certain probability also occurs under \u03c0 and \u03b2 with similar\nprobability; and (2) \u03b2 is \u201cdifferent enough from \u03b1\u201d such that the certain occurrence probability of\nthe event E under \u03b2 makes R\u03c0\n\u03b2(T ) larger than the desired lower bound. Theorem 2 then follows by\nR\u03c0(T ) \u2265 R\u03c0\nCombining Theorem 1 and Theorem 2, we have\nCorollary 1 For any \ufb01xed k \u2265 1, for any S \u2265 1, R\u2217\n\n\u03b2(T ). For the complete proof of Theorem 2, see the full version of this paper [14].\n\nS(T ) = \u02dc\u0398(T 1/(2\u22122\u2212(cid:98)(S\u22121)/(k\u22121)(cid:99))).\n\n6\n\n\fRemark. We brie\ufb02y explain why the upper and lower bounds in Theorem 1 and Theorem 2 match in\nT . When m(S) \u2264 log2 log2(T /k), which is the case we are mostly interested in, (m(S) + 1)2 =\n\u221a\no(log T ), thus the upper and lower bounds match within o((log T )2). When m(S) > log2 log2(T /k),\nthe upper bound is O(\nT log T ), thus the upper and lower bounds directly match within O(log T ).\n\n4.3 Phase Transitions and Cyclic Phenomena\n\nS(T ). To illustrate this trade-off, Table 1 depicts the behavior of R\u2217\n\nCorollary 1 allows us to characterize the trade-off between the switching budget S and the optimal\nregret R\u2217\nS(T ) as a function of\nS given a \ufb01xed k. Note that as discussed in Section 2, the relationship between R\u2217\nS(T ) and S also\ncharacterizes the inherent trade-off between regret and maximum number of switches in the classical\nMAB problem.\n\nTable 1: Regret as a Function of Switching Budget\n\nS\nR\u2217\nS(T )\nR\u2217\nS(T )/R\u2217\n\n\u221e(T )\n\n[0, k)\n\u02dc\u0398(T )\n\n\u02dc\u0398(T 1/2)\n\n[k, 2k \u2212 1)\n\u02dc\u0398(T 2/3)\n\u02dc\u0398(T 1/6)\n\n[2k \u2212 1, 3k \u2212 2)\n\n[3k \u2212 2, 4k \u2212 3)\n\n[4k \u2212 3, 5k \u2212 4)\n\n\u02dc\u0398(T 4/7)\n\u02dc\u0398(T 1/14)\n\n\u02dc\u0398(T 8/15)\n\u02dc\u0398(T 1/30)\n\n\u02dc\u0398(T 16/31)\n\u02dc\u0398(T 1/62)\n\nAs we have shown, R\u2217\nS(T ) = \u02dc\u0398(T 1/(2\u22122\u2212(cid:98)(S\u22121)/(k\u22121)(cid:99))). To the best of knowledge, this is the \ufb01rst\ntime that a \ufb02oor function naturally arises in the order of T in the optimal regret of an online learning\nproblem. As a direct consequence of this \ufb02oor function, we discover several surprising phenomena\nregarding the trade-off between S and R\u2217\n\nS(T ) for any given k.\n\nDe\ufb01nition 2 (Phases and Transition Points) For a k-armed unit-switching-cost BwSC, we call the\ninterval [(j\u22121)(k\u22121)+1, j(k\u22121)+1) the j-th phase, and call j(k\u22121)+1 the j-th transition point\n(j \u2208 Z>0).\nFact 1 (Phase Transitions) As S increases from 0 to \u0398(log log T ), S will leave the j-th phase and\nenter the (j + 1)-th phase at the j-th transition point (j \u2208 Z>0). Each time S arrives at a transition\npoint, R\u2217\nS(T ) will drop signi\ufb01cantly, and stay at the same level until S arrives the next transition\npoint.\nFact 2 (Cyclic Phenomena) The length of each phase is always equal to k \u2212 1, independent of S\nand T . We call the quantity k \u2212 1 the budget cycle, which is the length of each phase.\n\nPhase transitions are clearly presented in Table 1. This phenomenon seems counter-intuitive, as\nit suggests that increasing switching budget would not help to decrease the best achievable regret,\nas long as the budget does not reach the next transition point. Note that phase transitions are only\nexhibited when S is in the range of 0 to \u0398(log log T ). After S exceeds \u0398(log log T ), R\u2217\n\u221a\nS(T ) will\nreamin unchanged at the level of \u02dc\u0398(\nT ) \u2014 the optimal regret will only vary within logarithmic\nfactors and there is no signi\ufb01cant regret drop any more. Therefore, one can also view \u0398(log log T ) as\na \u201c\ufb01nal transition point\u201d that marks the disappearance of phase transitions.\nCyclic Phenomena indicate that, assuming that the learner\u2019s switching budget is at a transition point,\nthen the extra switching budget that the learner needs to achieve the next regret drop (i.e., to arrive at\nthe next transition point) is always k \u2212 1. Cyclic phenomena also seem counter-intuitive: when the\nlearner has more switching budget, she can conduct more statistical tests, eliminate more bad actions\n(which can be thought of as reducing k) and allocate her switching budget in a more \ufb02exible way \u2014\nall of these suggest that the budget cycle should be a quantity decreasing with S. However, the cyclic\nphenomena tell us that the budget cycle is always a constant and no learning policy in the unit-cost\nBwSC (and in MAB) can escape this cycle, no matter how large S is , as long as S = o(log log T ).\nOn the other hand, as S contains more and more budget cycles, the gap between R\u2217\n\u221a\nS(T ) and\nR\u2217\n\u221e(T ) = \u02dc\u0398(\nS(T ) decreases doubly exponentially fast\nas S contains more budget cycles. From Table 1, we can verify that 3 or 4 budget cycles are already\nenough for an S-switching-budget policy to achieve close-to-optimal regret in MAB (compared with\nthe optimal policy with unlimited switching budget).\n\nT ) does decrease dramatically. In fact, R\u2217\n\n7\n\n\fFinally, we give some comments on the scope of our results. Note that phase transitions and cyclic\nphenomena are associated with theoretical bounds of the worst-case regret, so if (1) the underlying\ndistributions are not the worst-case distributions and we are focusing on the \u201cactual incurred regret\u201d,\nor (2) T is not large enough to dominate the constants in the bounds, phase transitions and cyclic\nphenemona may not be exhibited.\n\n5 General Switching Costs\n\nWe now proceed to the general case of BwSC, where ci,j (= cj,i) can be any non-negative real number\nand even \u221e. The problem is signi\ufb01cantly more challenging in this general setting. For this purpose,\nwe need to enhance the framework of Section 2 to better characterize the structure of switching costs.\nWe do this by representing switching costs via a weighted graph.\nLet G = (V, E) be a (weighted) complete graph, where V = [k] (i.e., each vertex corresponds to an\naction), and the edge between i and j is assigned a weight ci,j (\u2200i (cid:54)= j). We call the weighted graph\nG the switching graph. In this paper, we assume the switching costs satisfy the triangle inequality:\n\u2200i, j, l \u2208 [k], ci,j \u2264 ci,l + cl,j.\nThe results of the unit-switching-cost model suggest that an effective policy that minimizes the\nworst-case regret must repeatedly visit all actions, in a manner similar to the SS-SE policy. This\nindicates that in the general-switching-cost model, an effective policy should repeatedly visit all\nvertices in the switching graph, in a most economical way to stay within budget. Motivated by this\nidea, we propose the Hamiltonian-Switching Successive Elimination (HS-SE) policy, and present\nits details in Algorithm 2. The HS-SE policy enhances the original SS-SE policy by adding two\nadditional ingredients: (1) a pre-speci\ufb01ed switching order: within each interval, the HS-SE policy\nswitches based on an order determined by the shortest Hamiltonian path of the switching graph G;\n(2) a reversing policy: the HS-SE policy switches along one direction in the odd intervals, and along\nthe reverse direction in the even intervals. Note that while the shortest Hamiltonian path problem is\nNP-hard, solving this problem is entirely an \u201cof\ufb02ine\u201d step in the HS-SE policy. That is, for a given\nswitching graph, the learner only needs to solve this problem once.\nLet H denote the total weight of the shortest Hamiltonian path of G. We give an upper bound on the\nregret of the HS-SE policy in Theorem 3.\nTheorem 3 Let \u03c0 be the HS-SE policy, then \u03c0 \u2208 \u03a0S. There exists an absolute constant C \u2265 0 such\nthat for all G, k = |G|, S \u2265 0, T \u2265 k,\n\nR\u03c0(T ) \u2264 C(log k log T )k\n\n(cid:106) S\u2212maxi,j\u2208[k] ci,j\n\n(cid:107)\n\nH\n\n.\n\nwhere mU\n\nG(S) =\n\n1\u2212\n\n1\n\n\u2212mU\nG\n\n2\u22122\n\n(S) T\n\n2\u22122\n\n1\n\n\u2212mU\nG\n\n(S) ,\n\nWe then give a lower bound that is close to the above upper bound, see Theorem 4. Compared to\nthe proof of Theorem 2 (see Section 4.2 for a proof sketch), we would like to highlight an important\nnew step in the proof of Theorem 4. Recall that in the proof sketch of Theorem 2, we mention a\nstep of constructing the \u201cprimal environment\u201d \u03b1. In the proof of Theorem 4, our construction of \u03b1\nensures that \u03b1 has an additional property: (arg maxi\u2208[k] minj(cid:54)=i ci,j) is the optimal action in \u03b1. This\nproperty makes (maxi\u2208[k] minj(cid:54)=i ci,j) a lower bound on the cost incurred by switching between a\nsub-optimal action and the optimal action in \u03b1. Our new proof utilizes this property and makes the\nquantity (maxi\u2208[k] minj(cid:54)=i ci,j) appear in the lower bound. For the complete proof of Theorem 4,\nsee the full version of this paper [14].\nTheorem 4 There exists an absolute constant C > 0 such that for all G, k = |G|, S \u2265 0, T \u2265 k\nand for all policy \u03c0 \u2208 \u03a0S,\n\n1\n\n\u2212mL\nG\n\n2\u22122\n\nT\n\n(S) ,\n\nif mL\n\nif mL\n\nG(S) \u2264 log2 log2(T /k),\nG(S) > log2 log2(T /k),\n\n(cid:18)\n\nk\n\n\u221a\n\n\uf8f1\uf8f2\uf8f3C\n\nC\n\nR\u03c0(T ) \u2265\n\n\u2212 3\n\n2\u2212\n\n1\n\n\u2212mL\nG\n\n2\u22122\n\n(S) (mG(S) + 1)\u22122\n\nkT ,\n\n(cid:106) S\u2212maxi\u2208[k] minj(cid:54)=i ci,j\n\n(cid:107)\n\n.\n\nH\n\nwhere mL\n\nG(S) =\n\n(cid:19)\n\n8\n\n\f(cid:107)\n\n.\n\nH\n\nG(S).\n\nG(S) =\n\nG(S) do\n\nAlgorithm 2 Hamiltonian-Switching Successive Elimination (HS-SE)\nInput: Switching Graph G, Switching budget S, Horizon T\nOf\ufb02ine Step: Find the shortest Hamiltonian path in G: i1 \u2192 \u00b7\u00b7\u00b7 \u2192 ik. Denote the total weight of\nthe shortest Hamiltonian path as H. Calculate mU\nPartition: Run the partition step in the SS-SE policy with m(S) = mU\nInitialization: Let the set of all active actions in the l-th interval be Al. Set A1 = [k], a0 = i1.\nPolicy:\n1: for l = 1, . . . , mU\n2:\n3:\n\n(cid:106) S\u2212maxi,j\u2208[k] ci,j\n\nif atl\u22121 \u2208 Al and l is odd then\n\nLet atl\u22121+1 = atl\u22121. Starting from this action, along the direction of i1 \u2192 \u00b7\u00b7\u00b7 \u2192 ik,\nchoose each action in Al for tl\u2212tl\u22121\nconsecutive rounds. Mark the last chosen action as atl.\n|Al|\nelse if atl\u22121 \u2208 Al and l is even then\nLet atl\u22121+1 = atl\u22121. Starting from this action, along the direction of ik \u2192 \u00b7\u00b7\u00b7 \u2192 i1,\nchoose each action in Al for tl\u2212tl\u22121\nconsecutive rounds. Mark the last chosen action as atl.\n|Al|\nelse if atl\u22121 /\u2208 Al and l is odd then\nAlong the direction of i1 \u2192 \u00b7\u00b7\u00b7 \u2192 ik, \ufb01nd the \ufb01rst action that still remains in Al. Starting\nfrom this action, along the direction of i1 \u2192 \u00b7\u00b7\u00b7 \u2192 ik, choose each action in Al for tl\u2212tl\u22121\n|Al|\nconsecutive rounds. Mark the last chosen action as atl.\nAlong the direction of ik \u2192 \u00b7\u00b7\u00b7 \u2192 i1, \ufb01nd the \ufb01rst action that still remains in Al. Starting\nfrom this action, along the direction of ik \u2192 \u00b7\u00b7\u00b7 \u2192 i1, choose each action in Al for tl\u2212tl\u22121\n|Al|\nconsecutive rounds. Mark the last chosen action as atl.\n\nelse if atl\u22121 /\u2208 Al and l is even then\n\nend if\nStatistical test: deactivate all actions i s.t. \u2203 action j with UCBtl (i) < LCBtl(j), where\n\n4:\n5:\n\n6:\n7:\n\n8:\n9:\n\n10:\n11:\n\nUCBtl (i) = empirical mean of action i in[1 : tl] +\n\nLCBtl (i) = empirical mean of action i in[1 : tl] \u2212\n\n2 log T\n\nnumber of plays of action i in[1 : tl]\n\n2 log T\n\nnumber of plays of action i in[1 : tl]\n\n,\n\n.\n\n12: end for\n13: In the last interval, choose the action with the highest empirical mean (up to round tmU\n\nG(S)).\n\n(cid:115)\n(cid:115)\n\nFinally, we illustrate how tight the above upper and lower bounds are. When the switching costs\nsatisfy the condition maxi,j\u2208[k] ci,j = maxi\u2208[k] minj(cid:54)=i ci,j, the two bounds directly match. When\nthis condition is not satis\ufb01ed, for any switching graph G, the above two bounds still match for a wide\n\nrange of S:(cid:20)\n\n(cid:19)(cid:91)(cid:40) \u221e(cid:91)\n\n(cid:20)\n\nn=1\n\n0, H + max\ni\u2208[k]\n\nmin\nj(cid:54)=i\n\nci,j\n\nci,j, (n + 1)H + max\ni\u2208[k]\n\nmin\nj(cid:54)=i\n\nci,j\n\n.\n\nnH + max\ni,j\u2208[k]\nG(S) \u2264 mL\n\nG(S) \u2264 mU\n\nEven when S is not in this range, we still have mU\nG(S) + 1 for any G and any\nS, which means that the difference between the two indices is at most 1 and the regret bounds are\nalways very close. In fact, it can be shown that as S increases, the gap between the upper and lower\nbounds decreases doubly exponentially. Therefore, the HS-SE policy is quite effective for the general\nBwSC problem.\nFor additional theoretical results for the general BwSC problem, see the full version of this paper [14].\n\n(cid:19)(cid:41)\n\nReferences\n[1] R. Agrawal, M. Hedge, and D. Teneketzis. Asymptotically ef\ufb01cient adaptive allocation rules for\nthe multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control,\n33(10):899\u2013906, 1988.\n\n9\n\n\f[2] R. Agrawal, M. Hegde, and D. Teneketzis. Multi-armed bandit problems with multiple plays\n\nand switching cost. Stochastics and Stochastic Reports, 29(4):437\u2013459, 1990.\n\n[3] J. Altschuler and K. Talwar. Online learning over a \ufb01nite action set with limited switching. In\n\nConference on Learning Theory, pages 1569\u20131573, 2018.\n\n[4] M. Asawa and D. Teneketzis. Multi-armed bandits with switching penalties. IEEE Transactions\n\non Automatic Control, 41(3):328\u2013348, 1996.\n\n[5] J. S. Banks and R. K. Sundaram. Switching costs and the gittins index. Econometrica, 62(3):\n\n687\u2013694, 1994.\n\n[6] D. Bergemann and J. V\u00e4lim\u00e4ki. Stationary multi-choice bandit problems. Journal of Economic\n\ndynamics and Control, 25(10):1585\u20131594, 2001.\n\n[7] M. Brezzi and T. L. Lai. Optimal learning and experimentation in bandit problems. Journal of\n\nEconomic Dynamics and Control, 27(1):87\u2013108, 2002.\n\n[8] N. Cesa-Bianchi, O. Dekel, and O. Shamir. Online learning with switching costs and other\nadaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160\u20131168,\n2013.\n\n[9] B. Chen and X. Chao. Parametric demand learning with limited price explorations in a backlog\n\nstochastic inventory system. IISE Transactions, pages 1\u20139, 2019.\n\n[10] W. C. Cheung, D. Simchi-Levi, and H. Wang. Dynamic pricing and demand learning with\n\nlimited price experimentation. Operations Research, 65(6):1722\u20131731, 2017.\n\n[11] Z. Gao, Y. Han, Z. Ren, and Z. Zhou. Batched multi-armed bandits problem. arXiv preprint\n\narXiv:1904.01763, 2019.\n\n[12] T. Jun. A survey on the bandit problem with switching costs. De Economist, 152(4):513\u2013541,\n\n2004.\n\n[13] V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg. Batched bandit problems. The Annals of\n\nStatistics, 44(2):660\u2013681, 2016.\n\n[14] D. Simchi-Levi and Y. Xu. Phase transitions and cyclic phenomena in bandits with switching\nconstraints. arXiv preprint arXiv:1905.10825, 2019. URL https://arxiv.org/abs/1905.\n10825.\n\n10\n\n\f", "award": [], "sourceid": 4090, "authors": [{"given_name": "David", "family_name": "Simchi-Levi", "institution": "MIT"}, {"given_name": "Yunzong", "family_name": "Xu", "institution": "MIT"}]}