{"title": "Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards", "book": "Advances in Neural Information Processing Systems", "page_first": 199, "page_last": 207, "abstract": "In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler's objective is to maximize his cumulative expected earnings over some given horizon of play T. To do this, the gambler needs to acquire information about arms (exploration) while simultaneously optimizing immediate rewards (exploitation); the price paid due to this trade off is often referred to as the regret, and the main question is how small can this price be as a function of the horizon length T. This problem has been studied extensively when the reward distributions do not change over time; an assumption that supports a sharp characterization of the regret, yet is often violated in practical settings. In this paper, we focus on a MAB formulation which allows for a broad range of temporal uncertainties in the rewards, while still maintaining mathematical tractability. We fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward variation\" and the minimal achievable regret, and by establishing a connection between the adversarial and the stochastic MAB frameworks.\"", "full_text": "Stochastic Multi-Armed-Bandit Problem\n\nwith Non-stationary Rewards\n\nOmar Besbes\n\nColumbia University\n\nNew York, NY\n\nob2105@columbia.edu\n\nYonatan Gur\n\nStanford University\n\nStanford, CA\n\nygur@stanford.edu\n\nAssaf Zeevi\n\nColumbia University\n\nNew York, NY\n\nassaf@gsb.columbia.edu\n\nAbstract\n\nIn a multi-armed bandit (MAB) problem a gambler needs to choose at each round\nof play one of K arms, each characterized by an unknown reward distribution.\nReward realizations are only observed when an arm is selected, and the gambler\u2019s\nobjective is to maximize his cumulative expected earnings over some given hori-\nzon of play T . To do this, the gambler needs to acquire information about arms\n(exploration) while simultaneously optimizing immediate rewards (exploitation);\nthe price paid due to this trade off is often referred to as the regret, and the main\nquestion is how small can this price be as a function of the horizon length T . This\nproblem has been studied extensively when the reward distributions do not change\nover time; an assumption that supports a sharp characterization of the regret, yet is\noften violated in practical settings. In this paper, we focus on a MAB formulation\nwhich allows for a broad range of temporal uncertainties in the rewards, while still\nmaintaining mathematical tractability. We fully characterize the (regret) complex-\nity of this class of MAB problems by establishing a direct link between the extent\nof allowable reward \u201cvariation\u201d and the minimal achievable regret, and by estab-\nlishing a connection between the adversarial and the stochastic MAB frameworks.\n\n1\n\nIntroduction\n\nBackground and motivation. In the presence of uncertainty and partial feedback on rewards, an\nagent that faces a sequence of decisions needs to judiciously use information collected from past\nobservations when trying to optimize future actions. A widely studied paradigm that captures this\ntension between the acquisition cost of new information (exploration) and the generation of instan-\ntaneous rewards based on the existing information (exploitation), is that of multi armed bandits\n(MAB), originally proposed in the context of drug testing by [1], and placed in a general setting\nby [2]. The original setting has a gambler choosing among K slot machines at each round of play,\nand upon that selection observing a reward realization. In this classical formulation the rewards\nare assumed to be independent and identically distributed according to an unknown distribution\nthat characterizes each machine. The objective is to maximize the expected sum of (possibly dis-\ncounted) rewards received over a given (possibly in\ufb01nite) time horizon. Since their inception, MAB\nproblems with various modi\ufb01cations have been studied extensively in Statistics, Economics, Oper-\nations Research, and Computer Science, and are used to model a plethora of dynamic optimization\nproblems under uncertainty; examples include clinical trials ([3]), strategic pricing ([4]), investment\nin innovation ([5]), packet routing ([6]), on-line auctions ([7]), assortment selection ([8]), and on-\n\n1\n\n\fline advertising ([9]), to name but a few. For overviews and further references cf. the monographs\nby [10], [11] for Bayesian / dynamic programming formulations, and [12] that covers the machine\nlearning literature and the so-called adversarial setting. Since the set of MAB instances in which one\ncan identify the optimal policy is extremely limited, a typical yardstick to measure performance of a\ncandidate policy is to compare it to a benchmark: an oracle that at each time instant selects the arm\nthat maximizes expected reward. The difference between the performance of the policy and that of\nthe oracle is called the regret. When the growth of the regret as a function of the horizon T is sub-\nlinear, the policy is long-run average optimal: its long run average performance converges to that of\nthe oracle. Hence the \ufb01rst order objective is to develop policies with this characteristic. The precise\nrate of growth of the regret as a function of T provides a re\ufb01ned measure of policy performance.\n[13] is the \ufb01rst paper that provides a sharp characterization of the regret growth rate in the context of\nthe traditional (stationary random rewards) setting, often referred to as the stochastic MAB problem.\nMost of the literature has followed this path with the objective of designing policies that exhibit the\n\u201cslowest possible\u201d rate of growth in the regret (often referred to as rate optimal policies).\nIn many application domains, several of which were noted above, temporal changes in the reward\ndistribution structure are an intrinsic characteristic of the problem. These are ignored in the tradi-\ntional stochastic MAB formulation, but there have been several attempts to extend that framework.\nThe origin of this line of work can be traced back to [14] who considered a case where only the state\nof the chosen arm can change, giving rise to a rich line of work (see, e.g., [15], and [16]). In partic-\nular, [17] introduced the term restless bandits; a model in which the states (associated with reward\ndistributions) of arms change in each step according to an arbitrary, yet known, stochastic process.\nConsidered a hard class of problems (cf. [18]), this line of work has led to various approximations\n(see, e.g., [19]), relaxations (see, e.g., [20]), and considerations of more detailed processes (see, e.g.,\n[21] for irreducible Markov process, and [22] for a class of history-dependent rewards).\nDeparture from the stationarity assumption that has dominated much of the MAB literature raises\nfundamental questions as to how one should model temporal uncertainty in rewards, and how to\nbenchmark performance of candidate policies. One view, is to allow the reward realizations to be\nselected at any point in time by an adversary. These ideas have their origins in game theory with the\nwork of [23] and [24], and have since seen signi\ufb01cant development; [25] and [12] provide reviews\nof this line of research. Within this so called adversarial formulation, the ef\ufb01cacy of a policy over a\ngiven time horizon T is often measured relative to a benchmark de\ufb01ned by the single best action one\ncould have taken in hindsight (after seeing all reward realizations). The single best action benchmark\nrepresents a static oracle, as it is constrained to a single (static) action. This static oracle can perform\nquite poorly relative to a dynamic oracle that follows the optimal dynamic sequence of actions, as\nthe latter optimizes the (expected) reward at each time instant over all possible actions.1 Thus, a\npotential limitation of the adversarial framework is that even if a policy has a \u201csmall\u201d regret relative\nto a static oracle, there is no guarantee with regard to its performance relative to the dynamic oracle.\nMain contributions. The main contribution of this paper lies in fully characterizing the (regret)\ncomplexity of a broad class of MAB problems with non-stationary reward structure by establishing\na direct link between the extent of reward \u201cvariation\u201d and the minimal achievable regret. More\nspeci\ufb01cally, the paper\u2019s contributions are along four dimensions. On the modeling side we formulate\na class of non-stationary reward structure that is quite general, and hence can be used to realistically\ncapture a variety of real-world type phenomena, yet is mathematically tractable. The main constraint\nthat we impose on the evolution of the mean rewards is that their variation over the relevant time\nhorizon is bounded by a variation budget VT ; a concept that was recently introduced in [26] in the\ncontext of non-stationary stochastic approximation. This limits the power of nature compared to\nthe adversarial setup discussed above where rewards can be picked to maximally affect the policy\u2019s\nperformance at each instance within {1, . . . , T}. Nevertheless, this constraint allows for a rich\nclass of temporal changes, extending most of the treatment in the non-stationary stochastic MAB\nliterature, which mainly focuses on a \ufb01nite number of changes in the mean rewards, see, e.g., [27]\nand references therein. We further discuss connections with studied non-stationary instances in \u00a76.\nThe second dimension of contribution lies in the analysis domain.\nFor a general class of\nnon-stationary reward distributions we establish lower bounds on the performance of any non-\nanticipating policy relative to the dynamic oracle, and show that these bounds can be achieved,\n\n1Under non-stationary rewards it is immediate that the single best action may be sub-optimal in many\ndecision epochs, and the performance gap between the static and the dynamic oracles can grow linearly with T .\n\n2\n\n\funiformly over the class of admissible reward distributions, by a suitable policy construction. The\nterm \u201cachieved\u201d is meant in the sense of the order of the regret as a function of the time horizon T ,\nthe variation budget VT , and the number of arms K. Our policies are shown to be minimax optimal\nup to a term that is logarithmic in the number of arms, and the regret is sublinear and is of order\n(KVT )1/3 T 2/3. Our analysis complements studied non-stationary instances by treating a broad and\n\ufb02exible class of temporal changes in the reward distributions, yet still establishing optimality results\nand showing that sublinear regret is achievable. Our results provide a spectrum of orders of the\nminimax regret ranging between order T 2/3 (when VT is a constant independent of T ) and order T\n(when VT grows linearly with T ), mapping allowed variation to best achievable performance.\nWith the analysis described above we shed light on the exploration-exploitation trade off that charac-\nterizes the non-stationary reward setting, and the change in this trade off compared to the stationary\nsetting. In particular, our results highlight the tension that exists between the need to \u201cremember\u201d\nand \u201cforget.\u201d This is characteristic of several algorithms that have been developed in the adversar-\nial MAB literature, e.g., the family of exponential weight methods such as EXP3, EXP3.S and the\nlike; see, e.g., [28], and [12]. In a nutshell, the fewer past observations one retains, the larger the\nstochastic error associated with one\u2019s estimates of the mean rewards, while at the same time using\nmore past observations increases the risk of these being biased.\nOne interesting observation drawn in this paper connects between the adversarial MAB setting, and\nthe non-stationary environment studied here. In particular, as in [26], it is seen that an optimal policy\nin the adversarial setting may be suitably calibrated to perform near-optimally in the non-stationary\nstochastic setting. This will be further discussed after the main results are established.\n\nt = E(cid:2)X k\n(cid:3). We denote the best possible expected reward at decision\n(cid:9).\nt = maxk\u2208K(cid:8)\u00b5k\n\n2 Problem Formulation\nLet K = {1, . . . , K} be a set of arms. Let T = {1, 2, . . . , T} denote a sequence of decision epochs\nfaced by a decision maker. At any epoch t \u2208 T , the decision-maker pulls one of the K arms. When\npulling arm k \u2208 K at epoch t \u2208 T , a reward X k\nis a random\nvariable with expectation \u00b5k\nepoch t by \u00b5\u2217\nChanges in the expected rewards of the arms. We assume the expected reward of each arm \u00b5k\nt\nmay change at any decision epoch. We denote by \u00b5k the sequence of expected rewards of arm k:\nt=1. In addition, we denote by \u00b5 the sequence of vectors of all K expected rewards:\nk=1. We assume that the expected reward of each arm can change an arbitrary number of\n\n(cid:9)T\n\u00b5k = (cid:8)\u00b5k\n\u00b5 =(cid:8)\u00b5k(cid:9)K\n\nt \u2208 [0, 1] is obtained, where X k\n\nt , i.e., \u00b5\u2217\n\nt\n\nt\n\nt\n\nt\n\ntimes, but bound the total variation of the expected rewards:\n\nT\u22121(cid:88)\n\n(cid:12)(cid:12)\u00b5k\n\nt \u2212 \u00b5k\n\n(cid:12)(cid:12) .\n\n(1)\nLet {Vt : t = 1, 2, . . .} be a non-decreasing sequence of positive real numbers such that V1 = 0,\nKVt \u2264 t for all t, and for normalization purposes set V2 = 2 \u00b7 K\u22121. We refer to VT as the variation\nbudget over T . We de\ufb01ne the corresponding temporal uncertainty set, as the set of reward vector\nsequences that are subject to the variation budget VT over the set of decision epochs {1, . . . , T}:\n\nsup\nk\u2208K\n\nt=1\n\nt+1\n\n(cid:40)\n\nV =\n\n\u00b5 \u2208 [0, 1]K\u00d7T :\n\nT\u22121(cid:88)\n\nt=1\n\n(cid:12)(cid:12)\u00b5k\n\nt \u2212 \u00b5k\n\nt+1\n\nsup\nk\u2208K\n\n(cid:41)\n\n.\n\n(cid:12)(cid:12) \u2264 VT\n\nThe variation budget captures the constraint imposed on the non-stationary environment faced by\nthe decision-maker. While limiting the possible evolution in the environment, it allows for numer-\nous forms in which the expected rewards may change: continuously, in discrete shocks, and of a\nchanging rate (Figure 1 depicts two different variation patterns that correspond to the same variation\nbudget). In general, the variation budget VT is designed to depend on the number of pulls T .\nAdmissible policies, performance, and regret. Let U be a random variable de\ufb01ned over a proba-\nbility space (U,U, Pu). Let \u03c01 : U \u2192 K and \u03c0t : [0, 1]t\u22121 \u00d7 U \u2192 K for t = 2, 3, . . . be measurable\nfunctions. With some abuse of notation we denote by \u03c0t \u2208 K the action at time t, that is given by\n\n(cid:26) \u03c01 (U )\n(cid:0)X \u03c0\n\n\u03c0t\n\n\u03c0t =\n\nt\u22121, . . . , X \u03c0\n\nt = 1,\nt = 2, 3, . . . ,\n\n1 , U(cid:1)\n\n3\n\n\f(cid:40) T(cid:88)\n\nt=1\n\n(cid:35)(cid:41)\n\n\u00b5\u03c0\nt\n\n,\n\n(cid:34) T(cid:88)\n\nt=1\n\nR\u03c0(V, T ) = sup\n\u00b5\u2208V\n\nt \u2212 E\u03c0\n\u00b5\u2217\n\nFigure 1: Two instances of variation in the mean rewards: (Left) A \ufb01xed variation budget (that equals 3) is\n\u201cspent\u201d over the whole horizon. (Right) The same budget is \u201cspent\u201d in the \ufb01rst third of the horizon.\nThe mappings {\u03c0t : t = 1, . . . , T} together with the distribution Pu de\ufb01ne the class of admissible\npolicies. We denote this class by P. We further denote by {Ht, t = 1, . . . , T} the \ufb01ltration associ-\nated with a policy \u03c0 \u2208 P, such that H1 = \u03c3 (U ) and Ht = \u03c3\nfor all t \u2208 {2, 3, . . .}.\nNote that policies in P are non-anticipating, i.e., depend only on the past history of actions and ob-\nservations, and allow for randomized strategies via their dependence on U.\nWe de\ufb01ne the regret under policy \u03c0 \u2208 P compared to a dynamic oracle as the worst-case difference\nbetween the expected performance of pulling at each epoch t the arm which has the highest expected\nreward at epoch t (the dynamic oracle performance) and the expected performance under policy \u03c0:\n\n(cid:16)(cid:8)X \u03c0\n\n(cid:17)\n\n(cid:9)t\u22121\n\nj=1\n\n, U\n\nj\n\nwhere the expectation E\u03c0 [\u00b7] is taken with respect to the noisy rewards, as well as to the policy\u2019s\nactions. In addition, we denote by R\u2217(V, T ) the minimal worst-case regret that can be guaranteed\nby an admissible policy \u03c0 \u2208 P, that is, R\u2217(V, T ) = inf \u03c0\u2208P R\u03c0(V, T ). Then, R\u2217(V, T ) is the best\nachievable performance. In the following sections we study the magnitude of R\u2217(V, T ). We analyze\nthe magnitude of this quantity by establishing upper and lower bounds; in these bounds we refer to\na constant C as absolute if it is independent of K, VT , and T .\n\n3 Lower bound on the best achievable performance\n\nWe next provide a lower bound on the the best achievable performance.\n\nstant C > 0 such that for any policy \u03c0 \u2208 P and for any T \u2265 1, K \u2265 2 and VT \u2208(cid:2)K\u22121, K\u22121T(cid:3),\n\nTheorem 1 Assume that rewards have a Bernoulli distribution. Then, there is some absolute con-\n\nR\u03c0(V, T ) \u2265 C (KVT )1/3 T 2/3.\n\n\u221a\n\nWe note that when reward distributions are stationary, there are known policies such as UCB1 ([29])\nthat achieve regret of order\nT in the stochastic setup. When the reward structure is non-stationary\nand de\ufb01ned by the class V, then no policy may achieve such a performance and the best performance\nmust incur a regret of at least order T 2/3. This additional complexity embedded in the non-stationary\nstochastic MAB problem compared to the stationary one will be further discussed in \u00a76. We note\nthat Theorem 1 also holds when VT is increasing with T . In particular, when the variation budget is\nlinear in T , the regret grows linearly and long run average optimality is not achievable.\nThe driver of the change in the best achievable performance relative to the one established in a\nstationary environment, is a second tradeoff (over the tension between exploring different arms and\ncapitalizing on the information already collected) introduced by the non-stationary environment,\nbetween \u201cremembering\u201d and \u201cforgetting\u201d: estimating the expected rewards is done based on past\nobservations of rewards. While keeping track of more observations may decrease the variance of\nmean rewards estimates, the non-stationary environment implies that \u201cold\u201d information is potentially\nless relevant due to possible changes in the underlying rewards. The changing rewards give incentive\nto dismiss old information, which in turn encourages enhanced exploration. The proof of Theorem 1\nemphasizes the impact of these tradeoffs on the achievable performance.\n\n4\n\n\f(cid:110)\n\n(cid:110)\n\n(cid:111)(cid:111)\n\nKey ideas in the proof. At a high level the proof of Theorem 1 builds on ideas of identifying\na worst-case \u201cstrategy\u201d of nature (e.g., [28], proof of Theorem 5.1) adapting them to our setting.\nWhile the proof is deferred to the online companion (as supporting material), we next describe the\nkey ideas when VT = 1.2 We de\ufb01ne a subset of vector sequences V(cid:48) \u2282 V and show that when\n\u00b5 is drawn randomly from V(cid:48), any admissible policy must incur regret of order (KVT )1/3 T 2/3.\nWe de\ufb01ne a partition of the decision horizon T into batches T1, . . . ,Tm of size \u02dc\u2206T each (except,\npossibly the last batch):\n\n,\n\nTj =\n\nt : (j \u2212 1) \u02dc\u2206T + 1 \u2264 t \u2264 min\n\nj \u02dc\u2206T , T\n\nfor all j = 1, . . . , m,\n\n(2)\nwhere m = (cid:100)T / \u02dc\u2206T(cid:101) is the number of batches. In V(cid:48), in every batch there is exactly one \u201cgood\u201d\narm with expected reward 1/2 + \u03b5 for some 0 < \u03b5 \u2264 1/4, and all the other arms have expected\nreward 1/2. The \u201cgood\u201d arm is drawn independently in the beginning of each batch according to\na discrete uniform distribution over {1, . . . , K}. Thus, the identity of the \u201cgood\u201d arm can change\nonly between batches. By selecting \u03b5 such that \u03b5T / \u02dc\u2206T \u2264 VT , any \u00b5 \u2208 V(cid:48) is composed of expected\nreward sequences with a variation of at most VT , and therefore V(cid:48) \u2282 V. Given the draws under which\nexpected reward sequences are generated, nature prevents any accumulation of information from one\nbatch to another, since at the beginning of each batch a new \u201cgood\u201d arm is drawn independently of\nthe history. The proof of Theorem 1 establishes that when \u03b5 \u2248 1/\nidentify the \u201cgood\u201d arm with high probability within a batch. Since there are \u02dc\u2206T epochs in each\n\n(cid:112) \u02dc\u2206T no admissible policy can\nbatch, the regret that any policy must incur along a batch is of order \u02dc\u2206T \u00b7 \u03b5 \u2248(cid:112) \u02dc\u2206T , which yields\n(cid:112) \u02dc\u2206T throughout the whole horizon. Selecting the smallest\n\na regret of order\nfeasible \u02dc\u2206T such that the variation budget constraint is satis\ufb01ed leads to \u02dc\u2206T \u2248 T 2/3, yielding a\nregret of order T 2/3 throughout the horizon.\n\n(cid:112) \u02dc\u2206T \u00b7 T / \u02dc\u2206T \u2248 T /\n\n4 A near-optimal policy\n\nWe apply the ideas underlying the lower bound in Theorem 1 to develop a rate optimal policy for\nthe non-stationary stochastic MAB problem with a variation budget. Consider the following policy:\n\nRexp3. Inputs: a positive number \u03b3, and a batch size \u2206T .\n\n1. Set batch index j = 1\n2. Repeat while j \u2264 (cid:100)T /\u2206T(cid:101):\n\n(a) Set \u03c4 = (j \u2212 1) \u2206T\n(b) Initialization: for any k \u2208 K set wk\n(c) Repeat for t = \u03c4 + 1, . . . , min{T, \u03c4 + \u2206T}:\n\nt = 1\n\n\u2022 For each k \u2208 K, set\n\nt(cid:80)K\n\nwk\nk(cid:48)=1 wk(cid:48)\n\nt\n\n\u03b3\nK\n\nt = (1 \u2212 \u03b3)\npk\n\n+\n\n\u2022 Draw an arm k(cid:48) from K according to the distribution(cid:8)pk\n(cid:41)\nt , and for any k (cid:54)= k(cid:48) set \u02c6X k\nt = 0. For all k \u2208 K update:\n\u03b3 \u02c6X k\nt\nK\n\n\u2022 Receive a reward X k(cid:48)\n\u2022 For k(cid:48) set \u02c6X k(cid:48)\nt = X k(cid:48)\n\nt+1 = wk\n\n(cid:9)K\n\nt /pk(cid:48)\n\n(cid:40)\n\nt exp\n\nwk\n\nk=1\n\nt\n\nt\n\n(d) Set j = j + 1, and return to the beginning of step 2\n\nClearly \u03c0 \u2208 P. The Rexp3 policy uses Exp3, a policy introduced by [30] for solving a worst-case\nsequential allocation problem, as a subroutine, restarting it every \u2206T epochs.\n\n2For the sake of simplicity, the discussion in this paragraph assumes a variation budget that is \ufb01xed and\n\nindependent of T ; the proof of Theorem 3 details a general treatment for a budget that depends on T .\n\n5\n\n\fTheorem 2 Let \u03c0 be the Rexp3 policy with a batch size \u2206T =\nwith \u03b3 = min\n\nand\n. Then, there is some absolute constant \u00afC such that for every T \u2265 1,\n\n1 ,\n\n(cid:110)\n\n(cid:113) K log K\n(cid:111)\nK \u2265 2, and VT \u2208(cid:2)K\u22121, K\u22121T(cid:3):\n\n(e\u22121)\u2206T\n\n(K log K)1/3 (T /VT )2/3(cid:109)\n(cid:108)\n\nR\u03c0(V, T ) \u2264 \u00afC (K log K \u00b7 VT )1/3 T 2/3.\n\n\u221a\n\nTheorem 2 is obtained by establishing a connection between the regret relative to the single best\naction in the adversarial setting, and the regret with respect to the dynamic oracle in non-stationary\n\u221a\nstochastic setting with variation budget. Several classes of policies, such as exponential-weight\n(including Exp3) and polynomial-weight policies, have been shown to achieve regret of order\nT\nwith respect to the single best action in the adversarial setting (see chapter 6 of [12] for a review).\nWhile in general these policies tend to perform well numerically, there is no guarantee for their\nperformance relative to the dynamic oracle studied in this paper, since the single best action itself\nmay incur linear regret relative to the dynamic oracle; see also [31] for a study of the empirical\nperformance of one class of algorithms. The proof of Theorem 2 shows that any policy that achieves\nregret of order\nT with respect to the single best action in the adversarial setting, can be used as a\nsubroutine to obtain near-optimal performance with respect to the dynamic oracle in our setting.\nRexp3 emphasizes the two tradeoffs discussed in the previous section. The \ufb01rst tradeoff, information\nacquisition versus capitalizing on existing information, is captured by the subroutine policy Exp3. In\nfact, any policy that achieves a good performance compared to a single best action benchmark in the\nadversarial setting must balance exploration and exploitation. The second tradeoff, \u201cremembering\u201d\nversus \u201cforgetting,\u201d is captured by restarting Exp3 and forgetting any acquired information every\n\u2206T pulls. Thus, old information that may slow down the adaptation to the changing environment\nis being discarded. Theorem 1 and Theorem 2 together characterize the minimax regret (up to a\nmultiplicative factor, logarithmic in the number of arms) in a full spectrum of variations VT :\n\nR\u2217(V, T ) (cid:16) (KVT )1/3 T 2/3.\n\nHence, we have quanti\ufb01ed the impact of the extent of change in the environment on the best achiev-\nable performance in this broad class of problems. For example, for the case in which VT = C \u00b7 T \u03b2,\nfor some absolute constant C and 0 \u2264 \u03b2 < 1 the best achievable regret is of order T (2+\u03b2)/3.\nWe \ufb01nally note that restarting is only one way of adapting policies from the adversarial MAB setting\nto achieve near optimality in the non-stationary stochastic setting; a way that articulates well the\nprinciples leading to near optimality. In the online companion we demonstrate that near optimality\ncan be achieved by other adaptation methods, showing that the Exp3.S policy (given in [28]) can be\nT and \u03b3 \u2248 (KVT /T )1/3 to achieve near optimality in our setting, without restarting.\ntuned by \u03b1 = 1\n\n5 Proof of Theorem 2\n\n(cid:111)\n\nThe structure of the proof is as follows. First, we break the horizon to a sequence of batches of size\n\u2206T each, and analyze the performance gap between the single best action and the dynamic oracle\nin each batch. Then, we plug in a known performance guarantee for Exp3 relative to the single best\naction, and sum over batches to establish the regret of Rexp3 relative to the dynamic oracle.\n\nStep 1 (Preliminaries). Fix T \u2265 1, K \u2265 2, and VT \u2208(cid:2)K\u22121, K\u22121T(cid:3). Let \u03c0 be the Rexp3 policy,\n\uf8f9\uf8fb\n\uf8ee\uf8f0(cid:88)\n(cid:125)\n\nand \u2206T \u2208 {1, . . . , T} (to be speci\ufb01ed later on). We break the\ntuned by \u03b3 = min\nhorizon T into a sequence of batches T1, . . . ,Tm of size \u2206T each (except, possibly Tm) according\n\uf8fc\uf8fd\uf8fe\nto (2). Let \u00b5 \u2208 V, and \ufb01x j \u2208 {1, . . . , m}. We decomposition the regret in batch j:\n(cid:123)(cid:122)\n\n(cid:113) K log K\n(cid:110)\n\uf8f9\uf8fb =\n(cid:88)\n(cid:124)\n\n\uf8ee\uf8f0max\n(cid:123)(cid:122)\n\n\uf8f9\uf8fb \u2212 E\u03c0\n\n\uf8f1\uf8f2\uf8f3(cid:88)\n\nt\u2208Tj\n\n\uf8f1\uf8f2\uf8f3(cid:88)\n\nt\u2208Tj\n\n\uf8ee\uf8f0(cid:88)\n\n\uf8ee\uf8f0max\n\nk\u2208K\n\n\uf8fc\uf8fd\uf8fe\n\uf8f9\uf8fb\n(cid:125)\n\nt \u2212 \u00b5\u03c0\n\u2217\nt )\n\n(\u00b5\n\nt \u2212 E\n\u2217\n\n\u00b5\n\n+ E\n\n(cid:124)\n\n\u00b5\u03c0\nt\n\n.\n\n1 ,\n\n(e\u22121)\u2206T\n\nE\u03c0\n\nt\u2208Tj\n\nt\u2208Tj\n\nX k\nt\n\nJ2,j\n\nt\u2208Tj\n\n(3)\nThe \ufb01rst component, J1,j, is the expected loss associated with using a single action over batch j.\nThe second component, J2,j, is the expected regret relative to the best static action in batch j.\n\nX k\nt\n\nk\u2208K\n\nJ1,j\n\n6\n\n\fLet k0 be an arm with best expected performance over Tj: k0 \u2208 arg maxk\u2208K\n\n. Then,\n\nStep 2 (Analysis of J1,j and J2,j). De\ufb01ning \u00b5k\nt\u2208Tj\n\nexpected rewards along batch Tj by Vj =(cid:80)\nm(cid:88)\n(cid:88)\n\nT for all k \u2208 K, we denote the variation in\n\nj=1\n\nm(cid:88)\n\uf8fc\uf8fd\uf8fe =\n(cid:88)\n\nt\u2208Tj\n\nVj =\n\nj=1\n\nt\u2208Tj\n\n(cid:88)\n\nt\u2208Tj\n\nt \u2212 E\n\u00b5\u2217\n(cid:110)\n\nt = E\n\u00b5k0\n\n\uf8ee\uf8f0max\n\nk\u2208K\n\n(cid:12)(cid:12). We note that:\n(cid:110)(cid:80)\n\uf8f1\uf8f2\uf8f3(cid:88)\n\nt\u2208Tj\n\nX k\nt\n\nt\u2208Tj\n\n\u00b5k\nt\n\n\uf8fc\uf8fd\uf8fe\n\uf8f9\uf8fb ,\n\n(cid:111)\n\n(cid:17)\n\nt \u2212 \u00b5k0\n\u00b5\u2217\n\nt\n\nt\n\nt\n\nmax\nk\u2208K\n\nT +1 = \u00b5k\n\nt+1 \u2212 \u00b5k\n\nt+1 \u2212 \u00b5k\n\nmaxk\u2208K(cid:12)(cid:12)\u00b5k\n(cid:12)(cid:12) \u2264 VT .\n(cid:12)(cid:12)\u00b5k\n\uf8ee\uf8f0(cid:88)\n\uf8ee\uf8f0max\n\uf8f9\uf8fb \u2264 E\n\uf8fc\uf8fd\uf8fe\n\uf8f1\uf8f2\uf8f3(cid:88)\n\uf8f9\uf8fb (a)\u2264 (cid:88)\n(cid:16)\n(cid:111) (b)\u2264 2Vj\u2206T ,\n\nX k0\nt\n\nt\u2208Tj\n\nt\u2208Tj\n\nt\u2208Tj\n\nk\u2208K\n\nX k\nt\n\nmax\nk\u2208K\n\nt\u2208Tj\nand therefore, one has:\n\n\u00b5k\nt\n\n\uf8f1\uf8f2\uf8f3(cid:88)\n\nJ1,j =\n\n(4)\n\n(5)\n\n(7)\n\n(cid:17)\n\n(8)\n\n\u2264 \u2206T max\nt\u2208Tj\n\nt \u2212 \u00b5k0\n\u00b5\u2217\n\nt\n\n(6)\nfor any \u00b5 \u2208 V and j \u2208 {1, . . . , m}, where (a) holds by (5) and (b) holds by the following argument:\notherwise there is an epoch t0 \u2208 Tj for which \u00b5\u2217\nt0.\nt0 > 2Vj. Indeed, let k1 = arg maxk\u2208K \u00b5k\nIn such case, for all t \u2208 Tj one has \u00b5k1\nt \u2265 \u00b5k1\nt , since Vj is the maximal\nvariation in batch Tj. This however, contradicts the optimality of k0 at epoch t, and thus (6) holds.\nIn addition, Corollary 3.2 in [28] points out that the regret incurred by Exp3 (tuned by \u03b3 =\nis bounded by\nmin\n\u221a\n2\n\n) along \u2206T batches, relative to the single best action,\n\n(e\u22121)\u2206T\n\u2206T K log K. Therefore, for each j \u2208 {1, . . . , m} one has\n\u221a\n\n(cid:113) K log K\n\nt0 \u2212 Vj > \u00b5k0\n\nt0 + Vj \u2265 \u00b5k0\n\n1 ,\n\u221a\ne \u2212 1\n\n\u2212 \u00b5k0\n\n(cid:110)\n\nt0\n\ne \u2212 1(cid:112)\u2206T K log K,\n\nJ2,j = E\n\nX k\nt\n\nfor any \u00b5 \u2208 V, where (a) holds since within each batch arms are pulled according to Exp3(\u03b3).\nStep 3 (Regret throughout the horizon). Summing over m = (cid:100)T /\u2206T(cid:101) batches we have:\n\nk\u2208K\n\n(cid:111)\n\uf8f1\uf8f2\uf8f3(cid:88)\n\uf8ee\uf8f0max\n(cid:40) T(cid:88)\n(cid:18) T\n\nt\u2208Tj\n\nt=1\n\n+ 1\n\u2206T\n\u221a\n\u221a\ne \u2212 1\n\u221a\n2\n\n(b)\u2264\n\n=\n\n\u00b5\u03c0\nt\n\nt\u2208Tj\n\n\uf8fc\uf8fd\uf8fe \u2212 E\u03c0\n(cid:34) T(cid:88)\ne \u2212 1(cid:112)\u2206T K log K + 2\u2206T VT .\n\n\uf8ee\uf8f0(cid:88)\n\uf8f9\uf8fb\uf8f9\uf8fb (a)\u2264 2\n(cid:35)(cid:41) (a)\u2264 m(cid:88)\ne \u2212 1(cid:112)\u2206T K log K + 2\u2206T VT ,\n\n\u221a\n2\n\n(cid:16)\n\n+ 2\n\n\u00b5\u03c0\nt\n\n\u221a\n\nj=1\n\nt=1\n\nt \u2212 E\u03c0\n\u00b5\u2217\n(cid:19)\n\u221a\n\n\u00b7 2\nK log K \u00b7 T\n\u2206T\n(6), and (7);\n\ne \u2212 1(cid:112)\u2206T K log K + 2Vj\u2206T\n\n(K log K)1/3 (T /VT )2/3(cid:109)\n(cid:108)\n\n(a) holds by (3),\n\nwhere:\n\u2206T =\n\n, we establish:\n\nand (b)\n\nfollows from (4).\n\nFinally,\n\nselecting\n\nR\u03c0(V, T ) \u2264 2\n\ne \u2212 1 (K log K \u00b7 VT )1/3 T 2/3\n\n\u221a\n\n\u221a\n+2\n\n(cid:16)\n\n+2\n\n(a)\u2264 (cid:16)(cid:16)\n\n(cid:114)(cid:16)\n(cid:17)\u221a\n\ne \u2212 1\n\n(K log K)1/3 (T /VT )2/3 + 1\n\n(K log K)1/3 (T /VT )2/3 + 1\n\u221a\n\nVT\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\nK log K\n\nwhere (a) follows from T \u2265 K \u2265 2, and VT \u2208(cid:2)K\u22121, K\u22121T(cid:3). This concludes the proof.\n\n(K log K \u00b7 VT )1/3 T 2/3,\n\ne \u2212 1 + 4\n\n2 + 2\n\n2\n\nR\u03c0(V, T ) = sup\n\u00b5\u2208V\n\n7\n\n\f6 Discussion\n\n\u221a\n\nUnknown variation budget. The Rexp3 policy relies on prior knowledge of VT , but predictions\nof VT may be inaccurate (such estimation can be maintained from historical data if actions are\noccasionally randomized, for example, by \ufb01tting VT = T \u03b1). Denoting the \u201ctrue\u201d variation budget\nby VT and the estimate that is used by the agent when tuning Rexp3 by \u02c6VT , one may observe that\nthe analysis in the proof of Theorem 2 holds until equation (8), but then \u2206T will be tuned using \u02c6VT .\nThis implies that when VT and \u02c6VT are \u201cclose,\u201d Rexp3 still guarantees long-run average optimality.\nFor example, suppose that Rexp3 is tuned by \u02c6VT = T \u03b1, but the variation is VT = T \u03b1+\u03b4. Then\nsublinear regret (of order T 2/3+\u03b1/3+\u03b4) is guaranteed as long as \u03b4 < (1 \u2212 \u03b1)/3; e.g., if \u03b1 = 0 and\n\u03b4 = 1/4, Rexp3 guarantees regret of order T 11/12 (accurate tuning would have guaranteed order\nT 3/4). Since there are no restrictions on the rate at which the variation budget can be spent, an\ninteresting and potentially challenging open problem is to delineate to what extent it is possible to\ndesign adaptive policies that do not use prior knowledge of VT , yet guarantee \u201cgood\u201d performance.\nContrasting with traditional (stationary) MAB problems. The characterized minimax regret in\nthe stationary stochastic setting is of order\nT when expected rewards can be arbitrarily close to\neach other, and of order log T when rewards are \u201cwell separated\u201d (see [13] and [29]). Contrast-\ning the minimax regret (of order V 1/3\nT T 2/3) we have established in the stochastic non-stationary\nMAB problem with those established in stationary settings allows one to quantify the \u201cprice of non-\nstationarity,\u201d which mathematically captures the added complexity embedded in changing rewards\nversus stationary ones (as a function of the allowed variation). Clearly, additional complexity is\nintroduced even when the allowed variation is \ufb01xed and independent of the horizon length.\nContrasting with other non-stationary MAB instances. The class of MAB problems with non-\nstationary rewards that is formulated in the current chapter extends other MAB formulations that\nallow rewards to change in a more structured manner. For example, [32] consider a setting where\nrewards evolve according to a Brownian motion and regret is linear in T ; our results (when VT is\nlinear in T ) are consistent with theirs. Two other representative studies are those of [27], that study a\nstochastic MAB problems in which expected rewards may change a \ufb01nite number of times, and [28]\nthat formulate an adversarial MAB problem in which the identity of the best arm may change a \ufb01nite\nnumber of times. Both studies suggest policies that, utilizing the prior knowledge that the number of\nchanges must be \ufb01nite, achieve regret of order\nT relative to the best sequence of actions. However,\nthe performance of these policies can deteriorate to regret that is linear in T when the number of\nchanges is allowed to depend on T . When there is a \ufb01nite variation (VT is \ufb01xed and independent of\nT ) but not necessarily a \ufb01nite number of changes, we establish that the best achievable performance\ndeteriorate to regret of order T 2/3. In that respect, it is not surprising that the \u201chard case\u201d used to\nestablish the lower bound in Theorem 1 describes a nature\u2019s strategy that allocates variation over a\nlarge (as a function of T ) number of changes in the expected rewards.\nLow variation rates. While our formulation focuses on \u201csigni\ufb01cant\u201d variation in the mean rewards,\nour established bounds also hold for \u201csmaller\u201d variation scales; when VT decreases from O(1) to\nO(T \u22121/2) the minimax regret rate decreases from T 2/3 to\nT . Indeed, when the variation scale is\nO(T \u22121/2) or smaller, the rate of regret coincides with that of the classical stochastic MAB setting.\n\n\u221a\n\n\u221a\n\nReferences\n\n[1] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence\n\nof two samples. Biometrika, 25:285\u2013294, 1933.\n\n[2] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical\n\nSociety, 55:527\u2013535, 1952.\n\n[3] M. Zelen. Play the winner rule and the controlled clinical trials. Journal of the American Statistical\n\nAssociation, 64:131\u2013146, 1969.\n\n[4] D. Bergemann and J. Valimaki. Learning and strategic pricing. Econometrica, 64:1125\u20131149, 1996.\n\n[5] D. Bergemann and U. Hege. The \ufb01nancing of innovation: Learning and stopping. RAND Journal of\n\nEconomics, 36 (4):719\u2013752, 2005.\n\n8\n\n\f[6] B. Awerbuch and R. D. Kleinberg. Addaptive routing with end-to-end feedback: distributed learning and\ngeometric approaches. In Proceedings of the 36th ACM Symposiuim on Theory of Computing (STOC),\npages 45\u201353, 2004.\n\n[7] R. D. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online\nposted-price auctions. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer\nScience (FOCS), pages 594\u2013605, 2003.\n\n[8] F. Caro and G. Gallien. Dynamic assortment with demand learning for seasonal consumer goods. Man-\n\nagement Science, 53:276\u2013292, 2007.\n\n[9] S. Pandey, D. Agarwal, D. Charkrabarti, and V. Josifovski. Bandits for taxonomies: A model-based\n\napproach. In SIAM International Conference on Data Mining, 2007.\n\n[10] D. A. Berry and B. Fristedt. Bandit problems: sequential allocation of experiments. Chapman and Hall,\n\n1985.\n\n[11] J. C. Gittins. Multi-Armed Bandit Allocation Indices. John Wiley and Sons, 1989.\n[12] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, Cam-\n\nbridge, UK, 2006.\n\n[13] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied Math-\n\nematics, 6:4\u201322, 1985.\n\n[14] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments.\n\nNorth-Holland, 1974.\n\n[15] J. C. Gittins. Bandit processes and dynamic allocation indices (with discussion). Journal of the Royal\n\nStatistical Society, Series B, 41:148\u2013177, 1979.\n\n[16] P. Whittle. Arm acquiring bandits. The Annals of Probability, 9:284\u2013292, 1981.\n[17] P. Whittle. Restless bandits: Activity allocation in a changing world. Journal of Applied Probability,\n\n25A:287\u2013298, 1988.\n\n[18] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queueing network control. In Structure\n\nin Complexity Theory Conference, pages 318\u2013322, 1994.\n\n[19] D. Bertsimas and J. Nino-Mora. Restless bandits, linear programming relaxations, and primal dual index\n\nheuristic. Operations Research, 48(1):80\u201390, 2000.\n\n[20] S. Guha and K. Munagala. Approximation algorithms for partial-information based stochastic control\nwith markovian rewards. In 48th Annual IEEE Symposium on Fundations of Computer Science (FOCS),\npages 483\u2013493, 2007.\n\n[21] R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless markov bandits. In Algorithmic\n\nLearning Theory, pages 214\u2013228. Springer Berlin Heidelberg, 2012.\n\n[22] M. G. Azar, A. Lazaric, and E. Brunskill. Stochastic optimization of a locally smooth function under\n\ncorrelated bandit feedback. arXiv preprint arXiv:1402.0562, 2014.\n\n[23] D. Blackwell. An analog of the minimax theorem for vector payoffs. Paci\ufb01c Journal of Mathematics,\n\n6:1\u20138, 1956.\n\n[24] J. Hannan. Approximation to bayes risk in repeated plays, Contributions to the Theory of Games, Volume\n\n3. Princeton University Press, Cambridge, UK, 1957.\n\n[25] D. P. Foster and R. V. Vohra. Regret in the on-line decision problem. Games and Economic Behaviour,\n\n29:7\u201335, 1999.\n\n[26] O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Working paper, 2014.\n[27] A. Garivier and E. Moulines. On upper-con\ufb01dence bound policies for switching bandit problems.\n\nAlgorithmic Learning Theory, pages 174\u2013188. Springer Berlin Heidelberg, 2011.\n\nIn\n\n[28] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The non-stochastic multi-armed bandit problem.\n\nSIAM journal of computing, 32:48\u201377, 2002.\n\n[29] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nLearning, 47:235\u2013246, 2002.\n\n[30] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. J. Comput. System Sci., 55:119\u2013139, 1997.\n\n[31] C. Hartland, S. Gelly, N. Baskiotis, O. Teytaud, and M. Sebag. Multi-armed bandit, dynamic environments\nand meta-bandits. NIPS-2006 workshop, Online trading between exploration and exploitation, Whistler,\nCanada, 2006.\n\n[32] A. Slivkins and E. Upfal. Adapting to a changing environment: The brownian restless bandits. In Pro-\n\nceedings of the 21st Annual Conference on Learning Theory (COLT), pages 343\u2013354, 2008.\n\n9\n\n\f", "award": [], "sourceid": 143, "authors": [{"given_name": "Omar", "family_name": "Besbes", "institution": "Columbia University"}, {"given_name": "Yonatan", "family_name": "Gur", "institution": "Stanford Graduate School of Business"}, {"given_name": "Assaf", "family_name": "Zeevi", "institution": "Columbia University"}]}