{"title": "Rotting Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 3074, "page_last": 3083, "abstract": "The Multi-Armed Bandits (MAB) framework highlights the trade-off between acquiring new knowledge (Exploration) and leveraging available knowledge (Exploitation). In the classical MAB problem, a decision maker must choose an arm at each time step, upon which she receives a reward. The decision maker's objective is to maximize her cumulative expected reward over the time horizon. The MAB problem has been studied extensively, specifically under the assumption of the arms' rewards distributions being stationary, or quasi-stationary, over time. We consider a variant of the MAB framework, which we termed Rotting Bandits, where each arm's expected reward decays as a function of the number of times it has been pulled. We are motivated by many real-world scenarios such as online advertising, content recommendation, crowdsourcing, and more. We present algorithms, accompanied by simulations, and derive theoretical guarantees.", "full_text": "Rotting Bandits\n\nNir Levine\n\nKoby Crammer\n\nElectrical Engineering Department\n\nElectrical Engineering Department\n\nThe Technion\n\nHaifa 32000, Israel\n\nlevin.nir1@gmail.com\n\nThe Technion\n\nHaifa 32000, Israel\n\nkoby@ee.technion.ac.il\n\nShie Mannor\n\nElectrical Engineering Department\n\nThe Technion\n\nHaifa 32000, Israel\n\nshie@ee.technion.ac.il\n\nAbstract\n\nThe Multi-Armed Bandits (MAB) framework highlights the trade-off between\nacquiring new knowledge (Exploration) and leveraging available knowledge (Ex-\nploitation). In the classical MAB problem, a decision maker must choose an arm at\neach time step, upon which she receives a reward. The decision maker\u2019s objective\nis to maximize her cumulative expected reward over the time horizon. The MAB\nproblem has been studied extensively, speci\ufb01cally under the assumption of the\narms\u2019 rewards distributions being stationary, or quasi-stationary, over time. We\nconsider a variant of the MAB framework, which we termed Rotting Bandits, where\neach arm\u2019s expected reward decays as a function of the number of times it has been\npulled. We are motivated by many real-world scenarios such as online advertis-\ning, content recommendation, crowdsourcing, and more. We present algorithms,\naccompanied by simulations, and derive theoretical guarantees.\n\n1\n\nIntroduction\n\nOne of the most fundamental trade-offs in stochastic decision theory is the well celebrated Exploration\nvs. Exploitation dilemma. Should one acquire new knowledge on the expense of possible sacri\ufb01ce in\nthe immediate reward (Exploration), or leverage past knowledge in order to maximize instantaneous\nreward (Exploitation)? Solutions that have been demonstrated to perform well are those which\nsucceed in balancing the two. First proposed by Thompson [1933] in the context of drug trials, and\nlater formulated in a more general setting by Robbins [1985], MAB problems serve as a distilled\nframework for this dilemma. In the classical setting of the MAB, at each time step, the decision maker\nmust choose (pull) between a \ufb01xed number of arms. After pulling an arm, she receives a reward\nwhich is a realization drawn from the arm\u2019s underlying reward distribution. The decision maker\u2019s\nobjective is to maximize her cumulative expected reward over the time horizon. An equivalent, more\ntypically studied, is the regret, which is de\ufb01ned as the difference between the optimal cumulative\nexpected reward (under full information) and that of the policy deployed by the decision maker.\nMAB formulation has been studied extensively, and was leveraged to formulate many real-world\nproblems. Some examples for such modeling are online advertising [Pandey et al., 2007], routing of\npackets [Awerbuch and Kleinberg, 2004], and online auctions [Kleinberg and Leighton, 2003].\nMost past work (Section 6) on the MAB framework has been performed under the assumption that\nthe underlying distributions are stationary, or possibly quasi-stationary. In many real-world scenarios,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthis assumption may seem simplistic. Speci\ufb01cally, we are motivated by real-world scenarios where\nthe expected reward of an arm decreases over time instances that it has been pulled. We term this\nvariant Rotting Bandits. For motivational purposes, we present the following two examples.\n\u2022 Consider an online advertising problem where an agent must choose which ad (arm) to present\n(pull) to a user. It seems reasonable that the effectiveness (reward) of a speci\ufb01c ad on a user\nwould deteriorate over exposures. Similarly, in the content recommendation context, Agarwal et al.\n[2009] showed that articles\u2019 CTR decay over amount of exposures.\n\n\u2022 Consider the problem of assigning projects through crowdsourcing systems [Tran-Thanh et al.,\n2012]. Given that the assignments primarily require human perception, subjects may fall into\nboredom and their performance would decay (e.g., license plate transcriptions [Du et al., 2013]).\n\nAs opposed to the stationary case, where the optimal policy is to always choose some speci\ufb01c arm, in\nthe case of Rotting Bandits the optimal policy consists of choosing different arms. This results in the\nnotion of adversarial regret vs. policy regret [Arora et al., 2012] (see Section 6). In this work we\ntackle the harder problem of minimizing the policy regret.\nThe main contributions of this paper are the following:\n\u2022 Introducing a novel, real-world oriented MAB formulation, termed Rotting Bandits.\n\u2022 Present an easy-to-follow algorithm for the general case, accompanied with theoretical guarantees.\n\u2022 Re\ufb01ne the theoretical guarantees for the case of existing prior knowledge on the rotting models,\n\naccompanied with suitable algorithms.\n\nThe rest of the paper is organized as follows: in Section 2 we present the model and relevant\npreliminaries. In Section 3 we present our algorithm along with theoretical guarantees for the general\ncase. In Section 4 we do the same for the parameterized case, followed by simulations in Section 5.\nIn Section 6 we review related work, and conclude with a discussion in Section 7.\n\n2 Model and Preliminaries\n\nWe consider the problem of Rotting Bandits (RB); an agent is given K arms and at each time step\nt = 1, 2, .. one of the arms must be pulled. We denote the arm that is pulled at time step t as\ni (t) \u2208 [K] = {1, .., K}. When arm i is pulled for the nth time, the agent receives a time independent,\n\u03c32 sub-Gaussian random reward, rt, with mean \u00b5i (n).1\nIn this work we consider two cases: (1) There is no prior knowledge on the expected rewards, except\nfor the \u2018rotting\u2019 assumption to be presented shortly, i.e., a non-parametric case (NPC). (2) There is\nprior knowledge that the expected rewards comprised of an unknown constant part and a rotting part\nwhich is known to belong to a set of rotting models, i.e., a parametric case (PC).\nLet Ni (t) be the number of pulls of arm i at time t not including this round\u2019s choice (Ni (1) = 0),\nand \u03a0 the set of all sequences i (1) , i (2) , .., where i (t) \u2208 [K] ,\u2200t \u2208 N. i.e., \u03c0 \u2208 \u03a0 is an in\ufb01nite\nsequence of actions (arms), also referred to as a policy. We denote the arm that is chosen by policy\n\u03c0 at time t as \u03c0 (t). The objective of an agent is to maximize the expected total reward in time T ,\nde\ufb01ned for policy \u03c0 \u2208 \u03a0 by,\n\n(cid:34) T(cid:88)\n\n(cid:0)N\u03c0(t) (t) + 1(cid:1)(cid:35)\n\nJ (T ; \u03c0) = E\n\n\u00b5\u03c0(t)\n\nWe consider the equivalent objective of minimizing the regret in time T de\ufb01ned by,\n\nt=1\n\nR (T ; \u03c0) = max\n\u02dc\u03c0\u2208\u03a0\n\n{J (T ; \u02dc\u03c0)} \u2212 J (T ; \u03c0) .\n\nAssumption 2.1. (Rotting) \u2200i \u2208 [K], \u00b5i (n) is positive, and non-increasing in n.\n\n1Our results hold for pulls-number dependent variances \u03c32 (n), by upper bound them \u03c32 \u2265 \u03c32 (n) ,\u2200n. It\nis fairly straightforward to adapt the results to pulls-number dependent variances, but we believe that the way\npresented conveys the setting in the clearest way.\n\n2\n\n(1)\n\n(2)\n\n\f2.1 Optimal Policy\n\nLet \u03c0max be a policy de\ufb01ned by,\n\n\u03c0max (t) \u2208 argmax\ni\u2208[K]\n\n{\u00b5i (Ni (t) + 1)}\n\n(3)\n\nwhere, in a case of tie, break it randomly.\nLemma 2.1. \u03c0max is an optimal policy for the RB problem.\nProof: See Appendix B of the supplementary material.\n\n3 Non-Parametric Case\n\nIn the NPC setting for the RB problem, the only information we have is that the expected rewards\nsequences are positive and non-increasing in the number of pulls. The Sliding-Window Average\n(SWA) approach is a heuristic for ensuring with high probability that, at each time step, the agent did\nnot sample signi\ufb01cantly sub-optimal arms too many times. We note that, potentially, the optimal arm\nchanges throughout the trajectory, as Lemma 2.1 suggests. We start by assuming that we know the\ntime horizon, and later account for the case we do not.\nKnown Horizon\nThe idea behind the SWA approach is that after we pulled a signi\ufb01cantly sub-optimal arm \u201cenough\"\ntimes, the empirical average of these \u201cenough\" pulls would be distinguishable from the optimal\narm for that time step and, as such, given any time step there is a bounded number of signi\ufb01cantly\nsub-optimal pulls compared to the optimal policy. Pseudo algorithm for SWA is given by Algorithm 1.\n\nRamp up : i (t) by Round-Robin, receive rt, and set Ni(t) \u2190 Ni(t) + 1 ; r\n\ni(t) \u2190 rt\n\nAlgorithm 1 SWA\n\nInitialize : M \u2190 (cid:100)\u03b142/3\u03c32/3K\u22122/3T 2/3 ln1/3(cid:0)\u221a\n\nInput : K, T, \u03b1 > 0\n\nfor t = 1, 2, .., KM do\n\nend for\nfor t = KM + 1, ..., T do\n\nBalance : i (t) \u2208 argmaxi\u2208[K]\nUpdate : receive rt, and set Ni(t) \u2190 Ni(t) + 1 ; r\n\nn=Ni\u2212M +1 rn\nNi(t)\n\n1\nM\n\nend for\n\n(cid:26)\n\n(cid:80)Ni\n\n2T(cid:1)(cid:101), and Ni \u2190 0 for all i \u2208 [K]\n(cid:27)\ni(t) \u2190 rt\n\nNi(t)\n\ni\n\n(cid:19)\n\n(cid:18)\n\n2T\n\n(cid:17)\n\nTheorem 3.1. Suppose Assumption 2.1 holds. SWA algorithm achieves regret bounded by,\n\nR(cid:0)T ; \u03c0SWA(cid:1) \u2264\nWe note that the upper bound obtains its minimum for \u03b1 = (cid:0)2 maxi\u2208[K] \u00b5i (1)(cid:1)\u22122/3, which can\n\n42/3\u03c32/3K 1/3T 2/3 ln1/3(cid:16)\u221a\n\nProof: See Appendix C.1 of the supplementary material.\n\n\u00b5i (1) + \u03b1\u22121/2\n\n+ 3K max\ni\u2208[K]\n\n\u03b1 max\ni\u2208[K]\n\n\u00b5i (1)\n\nserve as a way to choose \u03b1 if maxi\u2208[K] \u00b5i (1) is known, but \u03b1 can also be given as an input to SWA\nto allow control on the averaging window size.\nUnknown Horizon\nIn this case we use doubling trick in order to achieve the same horizon-dependent rate for the regret.\nWe apply the SWA algorithm with a series of increasing horizons (powers of two, i.e., 1, 2, 4, ..) until\nreaching the (unknown) horizon. We term this Algorithm wSWA (wrapper SWA).\nCorollary 3.1.1. Suppose Assumption 2.1 holds. wSWA algorithm achieves regret bounded by,\n\n(4)\n\n(cid:19)\n\n8\u03c32/3K 1/3T 2/3 ln1/3(cid:16)\u221a\n\n2T\n\n(cid:17)\n\nR(cid:0)T ; \u03c0wSWA(cid:1) \u2264\n\n(cid:18)\n\n\u00b5i (1) + \u03b1\u22121/2\n\n\u03b1 max\ni\u2208[K]\n\nProof: See Appendix C.2 of the supplementary material.\n\n3\n\n+ 3K max\ni\u2208[K]\n\n\u00b5i (1) (log2 T + 1)\n\n(5)\n\n\f4 Parametric Case\n\nIn the PC setting for the RB problem, there is prior knowledge that the expected rewards comprised\nof a sum of an unknown constant part and a rotting part known to belong to a set of models, \u0398. i.e.,\ni \u2208 \u0398. We\ni ), where \u03b8\u2217\nthe expected reward of arm i at its nth pull is given by, \u00b5i (n) = \u00b5c\ni }[K]\ndenote {\u03b8\u2217\ni=1 by \u0398\u2217. We consider two cases: The \ufb01rst is the asymptotically vanishing case (AV),\ni.e., \u2200i : \u00b5c\ni \u2208 R.\ni = 0. The second is the asymptotically non-vanishing case (ANV), i.e., \u2200i : \u00b5c\nWe present a few de\ufb01nitions that will serve us in the following section.\nDe\ufb01nition 4.1. For a function f : N \u2192 R, we de\ufb01ne the function f (cid:63)\u2193 : R \u2192 N \u222a {\u221e} by the follow-\ning rule: given \u03b6 \u2208 R, f (cid:63)\u2193 (\u03b6) returns the smallest N \u2208 N such that \u2200n \u2265 N : f (n) \u2264 \u03b6, or \u221e if\nsuch N does not exist.\nDe\ufb01nition 4.2. For any \u03b81 (cid:54)= \u03b82 \u2208 \u03982, de\ufb01ne det\u03b81,\u03b82, Ddet\u03b81,\u03b82 : N \u2192 R as,\n\ni + \u00b5 (n; \u03b8\u2217\n\ndet\u03b81,\u03b82 (n) =\n\nDdet\u03b81,\u03b82 (n) =\n\nn\u03c32\n\n(cid:16)(cid:80)n\nj=1 \u00b5 (j; \u03b81) \u2212(cid:80)n\n(cid:16)(cid:80)(cid:98)n/2(cid:99)\n\n(cid:17)2\n[\u00b5 (j; \u03b81) \u2212 \u00b5 (j; \u03b82)] \u2212(cid:80)n\n\nj=1 \u00b5 (j; \u03b82)\n\nn\u03c32\n\nj=1\n\n(cid:17)2\nj=(cid:98)n/2(cid:99)+1 [\u00b5 (j; \u03b81) \u2212 \u00b5 (j; \u03b82)]\n\nDe\ufb01nition 4.3. Let bal : N \u222a \u221e \u2192 N \u222a \u221e be de\ufb01ned at each point n \u2208 N as the solution for,\n\nmin \u03b1\n\ns.t, max\n\u03b8\u2208\u0398\n\n\u00b5 (\u03b1; \u03b8) \u2264 min\n\u03b8\u2208\u0398\n\n\u00b5 (n; \u03b8)\n\nWe de\ufb01ne bal (\u221e) = \u221e.\nAssumption 4.1. (Rotting Models) \u00b5 (n; \u03b8) is positive, non-increasing in n, and \u00b5 (n; \u03b8) \u2208 o (1),\n\u2200\u03b8 \u2208 \u0398, where \u0398 is a discrete known set.\nWe present an example for which, in Appendix E, we demonstrate how the different following\nassumptions hold. By this we intend to achieve two things: (i) show that the assumptions are not\ntoo harsh, keeping the problem relevant and non-trivial, and (ii) present a simple example on how to\nverify the assumptions.\ni \u2208 \u0398 = {\u03b81, \u03b82, ..., \u03b8M}, and \u2200\u03b8 \u2208 \u0398 : 0.01 \u2264 \u03b8 \u2264 0.49.\n\u03b8\u2217\n4.1 Closest To Origin (AV)\n\nExample 4.1. The reward of arm i for its nth pull is distributed as N(cid:0)\u00b5c\n\ni , \u03c32(cid:1). Where\n\ni + n\u2212\u03b8\u2217\n\nThe Closest To Origin (CTO) approach for RB is a heuristic that simply states that we hypothesize\nthat the true underlying model for an arm is the one that best \ufb01ts the past rewards. The \ufb01tting criterion\nis proximity to the origin of the sum of expected rewards shifted by the observed rewards. Let\nri\n1, ri\n\nNi(t) be the sequence of rewards observed from arm i up until time t. De\ufb01ne,\n\n2, .., ri\n\n(cid:26) Ni(t)(cid:88)\n\nNi(t)(cid:88)\n\nj \u2212\nri\n\nj=1\n\nj=1\n\n(cid:27)\n\n\u00b5 (j; \u03b8)\n\n.\n\n\u03b8\u2208\u0398\n\nY (i, t; \u0398) =\n\n(6)\n\n(7)\n\nThe CTO approach dictates that at each decision point, we assume that the true underlying rotting\nmodel corresponds to the following proximity to origin rule (hence the name),\n\n\u02c6\u03b8i (t) = argmin\n\n\u03b8\u2208\u0398\n\n{|Y (i, t; \u03b8)|}.\n\nThe CTOSIM version tackles the RB problem by simultaneously detecting the true rotting models and\nbalancing between the expected rewards (following Lemma 2.1). In this approach, every time step,\neach arm\u2019s rotting model is hypothesized according to the proximity rule (7). Then the algorithm\nsimply follows an argmax rule, where least number of pulls is used for tie breaking (randomly\nbetween an equal number of pulls). Pseudo algorithm for CTOSIM is given by Algorithm 2.\nAssumption 4.2. (Simultaneous Balance and Detection ability)\n\n(cid:18)\n\n(cid:26)\n\n(cid:19)(cid:27)(cid:19)\n\nbal\n\nmax\n\n\u03b81(cid:54)=\u03b82\u2208\u03982\n\ndet(cid:63)\u2193\n\n\u03b81,\u03b82\n\n\u22121 (\u03b6)\nln\n\n\u2208 o (\u03b6)\n\n(cid:18) 1\n\n16\n\n4\n\n\fThe above assumption ensures that, starting from some horizon T , the underlying models could be\ndistinguished from the others, w.p 1 \u2212 1/T 2, by their sums of expected rewards, and the arms could\nthen be balanced, all within the horizon.\nTheorem 4.1. Suppose Assumptions 4.1 and 4.2 hold. There exists a \ufb01nite step T \u2217\nfor all T \u2265 T \u2217\nmax\u03b8\u2208\u0398\u2217 \u00b5 (1; \u03b8)). Furthermore, T \u2217\n\nSIM, such that\nSIM, CTOSIM achieves regret upper bounded by o (1) (which is upper bounded by\n\nSIM is upper bounded by the solution for the following,\n\nmin T\n\ns.t\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nT, b \u2208 N \u222a {0}, t \u2208 NK\n(cid:107)t(cid:107)1 \u2264 T + b\nti \u2265 max\u03b8\u2208\u0398\u2217\n\u00b5 (ti + 1; \u03b8\u2217\n\nm\u2217(cid:16)\n\n(cid:26)\ni ) \u2264 min\u02dc\u03b8\u2208\u0398\n\n\u2200b,\u2203t :\n\n1\n\nK(T +b)2 ; \u03b8\n\n(cid:20)\n\n(cid:18)\n\n(cid:17)(cid:27)\n\n\u00b5\n\nmax\u03b8\u2208\u0398\u2217\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(cid:26)\n\nm\u2217(cid:16)\n\n(cid:17)(cid:27)\n\n(cid:19)(cid:21)\n\n; \u02dc\u03b8\n\n1\n\nK(T +b)2 ; \u03b8\n\n(8)\n\nProof: See Appendix D.1 of the supplementary material.\nRegret upper bounded by o (1) is achieved by proving that w.p of 1 \u2212 1/T the regret vanishes, and in\nany case it is still bounded by a decaying term. The shown optimization bound stems from ensuring\nthat the arms would be pulled enough times to be correctly detected, and then balanced (following\nthe optimal policy, Lemma 2.1). Another upper bound for T \u2217\n\nSIM can be found in Appendix D.1.\n\n4.2 Differences Closest To Origin (ANV)\n\nWe tackle this problem by estimating both the rotting models and the constant terms of the arms. The\nDifferences Closest To Origin (D-CTO) approach is composed of two stages: \ufb01rst, detecting the\nunderlying rotting models, then estimating and controlling the pulls due to the constant terms. We\ndenote a\u2217 = argmaxi\u2208[K]{\u00b5c\nAssumption 4.3. (D-Detection ability)\n\ni}, and \u2206i = \u00b5c\n(cid:26)\n\na\u2217 \u2212 \u00b5c\ni .\n(cid:27)\n\nDdet(cid:63)\u2193\n\n\u03b81,\u03b82\n\n(\u0001)\n\nmax\n\n\u03b81(cid:54)=\u03b82\u2208\u03982\n\n\u2264 D (\u0001) < \u221e,\n\n\u2200\u0001 > 0\n\nThis assumption ensures that for any given probability, the models could be distinguished, by the\ndifferences (in pulls) between the \ufb01rst and second halves of the models\u2019 sums of expected rewards.\nModels Detection\nIn order to detect the underlying rotting models, we cancel the in\ufb02uence of the constant terms. Once\nwe do this, we can detect the underlying models. Speci\ufb01cally, we de\ufb01ne a criterion of proximity to\nthe origin based on differences between the halves of the rewards sequences, as follows: de\ufb01ne,\n\n\uf8eb\uf8ed(cid:98)Ni(t)/2(cid:99)(cid:88)\n\nj=1\n\nj \u2212\nri\n\nNi(t)(cid:88)\n\n\uf8f6\uf8f8 \u2212\n\n\uf8eb\uf8ed(cid:98)Ni(t)/2(cid:99)(cid:88)\n\nri\nj\n\nj=(cid:98)Ni(t)/2(cid:99)+1\n\nj=1\n\nNi(t)(cid:88)\n\n\u00b5 (j; \u03b8) \u2212\n\n\u00b5 (j; \u03b8)\n\nj=(cid:98)Ni(t)/2(cid:99)+1\n\nZ (i, t; \u0398) =\n\n(9)\nThe D-CTO approach is that in each decision point, we assume that the true underlying model\ncorresponds to the following rule,\n\n\u02c6\u03b8i (t) = argmin\n\n\u03b8\u2208\u0398\n\n{|Z (i, t; \u03b8)|}\n\n(10)\n\nWe de\ufb01ne the following optimization problem, indicating the number of samples required for ensuring\ncorrect detection of the rotting models w.h.p. For some arm i with (unknown) rotting model \u03b8\u2217\ni ,\n\n(cid:40)\n\n(cid:16)\u02c6\u03b8i (l) (cid:54)= \u03b8\u2217\n\n(cid:17) \u2264 p,\n\nP\nwhile pulling only arm i.\n\ni\n\n\u2200l \u2265 m\n\nmin m\n\ns.t\n\nWe denote the solution to the above problem, when we use proximity rule (10), by m\u2217\nde\ufb01ne m\u2217\n\ndiff (p) = max\u03b8\u2208\u0398 {m\u2217\n\ndiff (p; \u03b8)}.\n\ndiff (p; \u03b8\u2217\n\ni ), and\n\n5\n\n\uf8f6\uf8f8 .\n\n(11)\n\n\f(cid:104)\n(cid:16)\n\ni (t) \u2208 argmax\ni\u2208[K]\n\n(cid:80)Ni(t)\n\nj=1\n\n\u02c6\u00b5c\n\ni (t) =\n\n(cid:16)\n\n(cid:16)\n\nj; \u02c6\u03b8i (t)\n\nj \u2212 \u00b5\nri\nNi (t)\n\n(cid:17)\n\n(cid:114)\n\n,\n\nct,s =\n\n8 ln (t) \u03c32\n\ns\n\n(cid:17)(cid:17)\n\n(cid:27)\n\nAlgorithm 2 CTOSIM\n\nAlgorithm 3 D-CTOUCB\n\nInput : K, \u0398\nInitialization : Ni = 0, \u2200i \u2208 [K]\nfor t = 1, 2, .., K do\n\nRamp up : i (t) = t ,and update Ni(t)\n\nend for\nfor t = K + 1, ..., do\n\nDetect : determine {\u02c6\u03b8i} by Eq. (7)\nBalance : i (t) \u2208 argmaxi\u2208[K] \u00b5\nUpdate : Ni(t) \u2190 Ni(t) + 1\n\nend for\n\n(cid:16)\n\n(cid:17)\n\nNi + 1; \u02c6\u03b8i\n\nInput : K, \u0398, \u03b4\nInitialization : Ni = 0, \u2200i \u2208 [K]\nfor t = 1, 2, .., K \u00d7 m\u2217\ndiff (\u03b4/K) do\n\nExplore :\ni (t) by Round Robin, update Ni(t)\nend for\nDetect : determine {\u02c6\u03b8i} by Eq. (10)\nfor t = K \u00d7 m\u2217\ndiff (\u03b4/K) + 1, ..., do\nUCB : i (t) according to Eq. (12)\nUpdate : Ni(t) \u2190 Ni(t) + 1\n\nend for\n\nD-CTOUCB\nWe next describe an approach with one decision point, and later on remark on the possibility of\nhaving a decision point at each time step. As explained above, after detecting the rotting models,\nwe move to tackle the constant terms aspect of the expected rewards. This is done in a UCB1-like\napproach [Auer et al., 2002a]. Given a sequence of rewards from arm i, {ri\nk=1 , we modify them\nusing the estimated rotting model \u02c6\u03b8i, then estimate the arm\u2019s constant term, and \ufb01nally choose the\narm with the highest estimated expected reward, plus an upper con\ufb01dent term. i.e., at time t, we pull\narm i (t), according to the rule,\n\nk}Ni(t)\n(cid:105)\n\n\u02c6\u00b5c\n\ni (t) + \u00b5\n\nNi (t) + 1; \u02c6\u03b8i (t)\n\n+ ct,Ni(t)\n\n(12)\n\nwhere \u02c6\u03b8i (t) is the estimated rotting model (obtained in the \ufb01rst stage), and,\n\n(cid:21)\n\n(cid:20)\n\n(cid:88)\n\n(cid:26)\n\nIn a case of a tie in the UCB step, it may be arbitrarily broken. Pseudo algorithm for D-CTOUCB is\ngiven by Algorithm 3, accompanied with the following theorem.\nTheorem 4.2. Suppose Assumptions 4.1, and 4.3 hold. For \u03b4 \u2208 (0, 1), with probability of at least\n1 \u2212 \u03b4, D-CTOUCB algorithm achieves regret bounded at time T by,\n\nmax\n\nm\u2217\ndiff (\u03b4/K) , \u00b5(cid:63)\u2193 (\u0001i; \u03b8\u2217\ni ) ,\n\n32\u03c32 ln T\n(\u2206i \u2212 \u0001i)2\n\n\u00d7 (\u2206i + \u00b5 (1; \u03b8\u2217\n\na\u2217 ))\n\n+ C (\u0398\u2217,{\u00b5c\n\ni}) (13)\n\ni\u2208[K]\ni(cid:54)=a\u2217\nfor any sequence \u0001i \u2208 (0, \u2206i) ,\u2200i (cid:54)= a\u2217. Where 32\u03c32 ln T\nProof: See Appendix D.2 of the supplementary material.\nA few notes on the result: Instead of calculating m\u2217\n(e.g., as shown in Appendix E, max\u03b81(cid:54)=\u03b82\u2208\u03982 Ddet(cid:63)\u2193\nnumber). We cannot hope for a better rate than ln T as stochastic MAB is a special case of the RB\nproblem. Finally, we can convert the D-CTOUCB algorithm to have a decision point in each step: at\neach time step, determine the rotting models according to proximity rule (10), followed by pulling an\narm according to Eq. (12). We term this version D-CTOSIM-UCB.\n\n(cid:1)(cid:1) rounded to higher even\n\n(\u2206i\u2212\u0001i)2 is the only time-dependent factor.\n\ndiff (\u03b4/K), it is possible to use any upper bound\n\n\u22121(cid:0) 2K\n\n(cid:0) 1\n\n8 ln\n\n\u03b81,\u03b82\n\n\u03b4\n\n5 Simulations\n\nWe next compare the performance of the SWA and CTO approaches with benchmark algorithms.\nSetups for all the simulations we use Normal distributions with \u03c32 = 0.2, and T = 30, 000.\nNon-Parametric: K = 2. As for the expected rewards: \u00b51 (n) = 0.5,\u2200n, and \u00b52 (n) = 1 for its \ufb01rst\n7, 500 pulls and 0.4 afterwards. This setup is aimed to show the importance of not relying on the\n\n6\n\n\fTable 1: Number of \u2018wins\u2019 and p-values between the different algorithms\n\nP\nN\n\nV\nA\n\nV\nN\nA\n\nUCB1\nDUCB\nSWUCB\nwSWA\nUCB1\nDUCB\nSWUCB\nwSWA\nCTO\nUCB1\nDUCB\nSWUCB\nwSWA\nD-CTO\n\nUCB1\n\n100\n100\n100\n\n55\n15\n98\n100\n\n40\n50\n97\n100\n\nDUCB\n<1e-5\n\n100\n100\n0.81\n\n22\n99\n100\n0.54\n\n50\n98\n100\n\nSWUCB\n<1e-5\n<1e-5\n\n100\n<1e-5\n<1e-5\n\n100\n100\n0.83\n0.91\n\n97\n100\n\nwSWA\n<1e-5\n<1e-5\n<1e-5\n\n<1e-5\n<1e-5\n<1e-5\n\n100\n<1e-5\n< 1e-5\n<1e-5\n\n66\n\n(D-)CTO\n\n<1e-5\n<1e-5\n<1e-5\n<1e-5\n\n<1e-5\n<1e-5\n<1e-5\n<1e-5\n\nFigure 1: Average regret. Left: non-parametric. Middle: parametric AV. Right: parametric ANV\n\nParametric AV & ANV: K = 10. The rotting models are of the form \u00b5 (j; \u03b8) =(cid:0)int(cid:0) j\n\nwhole past rewards in the RB setting.\n\n(cid:1) + 1(cid:1)\u2212\u03b8,\n\n100\n\nwhere int(\u00b7) is the lower rounded integer, and \u0398 = {0.1, 0.15, .., 0.4} (i.e., plateaus of length 100,\nwith decay between plateaus according to \u03b8). {\u03b8\u2217\ni }K\ni=1 were sampled with replacement from \u0398,\nindependently across arms and trajectories. {\u00b5c\ni}K\ni=1 (ANV) were sampled randomly from [0, 0.5]K.\nAlgorithms we implemented standard benchmark algorithms for non-stationary MAB: UCB1 by\nAuer et al. [2002a], Discounted UCB (DUCB) and Sliding-Window UCB (SWUCB) by Garivier and\nMoulines [2008]. We implemented CTOSIM, D-CTOSIM-UCB, and wSWA for the relevant setups. We\nnote that adversarial benchmark algorithms are not relevant in this case, as the rewards are unbounded.\nGrid Searches were performed to determine the algorithms\u2019 parameters. For DUCB, following\nKocsis and Szepesv\u00e1ri [2006], the discount factor was chosen from \u03b3 \u2208 {0.9, 0.99, .., 0.999999}, the\nwindow size for SWUCB from \u03c4 \u2208 {1e3, 2e3, .., 20e3}, and \u03b1 for wSWA from {0.2, 0.4, .., 1}.\nPerformance for each of the cases, we present a plot of the average regret over 100 trajectories,\nspecify the number of \u2018wins\u2019 of each algorithm over the others, and report the p-value of a paired\nT-test between the (end of trajectories) regrets of each pair of algorithms. For each trajectory and two\nalgorithms, the \u2018winner\u2019 is de\ufb01ned as the algorithm with the lesser regret at the end of the horizon.\nResults the parameters that were chosen by the grid search are as follows: \u03b3 = 0.999 for the\nnon-parametric case, and 0.999999 for the parametric cases. \u03c4 = 4e3, 8e3, and 16e3 for the non-\nparametric, AV, and ANV cases, respectively. \u03b1 = 0.2 was chosen for all cases.\nThe average regret for the different algorithms is given by Figure 1. Table 1 shows the number of\n\u2018wins\u2019 and p-values. The table is to be read as the following: the entries under the diagonal are the\nnumber of times the algorithms from the left column \u2018won\u2019 against the algorithms from the top row,\nand the entries above the diagonal are the p-values between the two.\nWhile there is no clear \u2018winner\u2019 between the three benchmark algorithms across the different cases,\nwSWA, which does not require any prior knowledge, consistently and signi\ufb01cantly outperformed\nthem. In addition, when prior knowledge was available and CTOSIM or D-CTOUCB-SIM could be\ndeployed, they outperformed all the others, including wSWA.\n\n7\n\n050001000015000200002500030000timesteps05001000150020002500RegretNon-ParametricCaseUCB1DUCBSWUCBwSWA050001000015000200002500030000timesteps0100200300400500600RegretAsymptoticallyVanishingCaseUCB1DUCBSWUCBwSWACTO050001000015000200002500030000timesteps050100150200250300350400450RegretAsymptoticallyNon-VanishingCaseUCB1DUCBSWUCBwSWAD-CTO\f6 Related Work\n\nWe turn to reviewing related work while emphasizing the differences from our problem.\nStochastic MAB In the stochastic MAB setting [Lai and Robbins, 1985], the underlying reward\ndistributions are stationary over time. The notion of regret is the same as in our work, but the optimal\npolicy in this setting is one that pulls a \ufb01xed arm throughout the trajectory. The two most common\napproaches for this problem are: constructing Upper Con\ufb01dence Bounds which stem from the seminal\nwork by Gittins [1979] in which he proved that index policies that compute upper con\ufb01dence bounds\non the expected rewards of the arms are optimal in this case (e.g., see Auer et al. [2002a], Garivier\nand Capp\u00e9 [2011], Maillard et al. [2011]), and Bayesian heuristics such as Thompson Sampling\nwhich was \ufb01rst presented by Thompson [1933] in the context of drug treatments (e.g., see Kaufmann\net al. [2012], Agrawal and Goyal [2013], Gopalan et al. [2014]).\nAdversarial MAB In the Adversarial MAB setting (also referred to as the Experts Problem, see the\nbook of Cesa-Bianchi and Lugosi [2006] for a review), the sequence of rewards are selected by an\nadversary (i.e., can be arbitrary). In this setting the notion of adversarial regret is adopted [Auer et al.,\n2002b, Hazan and Kale, 2011], where the regret is measured against the best possible \ufb01xed action\nthat could have been taken in hindsight. This is as opposed to the policy regret we adopt, where the\nregret is measured against the best sequence of actions in hindsight.\nHybrid models Some past work consider settings between the Stochastic and the Adversarial settings.\nGarivier and Moulines [2008] consider the case where the reward distributions remain constant over\nepochs and change arbitrarily at unknown time instants, similarly to Yu and Mannor [2009] who\nconsider the same setting, only with the availability of side observations. Chakrabarti et al. [2009]\nconsider the case where arms can expire and be replaced with new arms with arbitrary expected\nreward, but as long as an arm does not expire its statistics remain the same.\nNon-Stationary MAB Most related to our problem is the so-called Non-Stationary MAB. Originally\nproposed by Jones and Gittins [1972], who considered a case where the reward distribution of a\nchosen arm can change, and gave rise to a sequence of works (e.g., Whittle et al. [1981], Tekin\nand Liu [2012]) which were termed Restless Bandits and Rested Bandits. In the Restless Bandits\nsetting, termed by Whittle [1988], the reward distributions change in each step according to a known\nstochastic process. Komiyama and Qin [2014] consider the case where each arm decays according to\na linear combination of decaying basis functions. This is similar to our parametric case in that the\nreward distributions decay according to possible models, but differs fundamentally in that it belongs\nto the Restless Bandits setup (ours to the Rested Bandits). More examples in this line of work are\nSlivkins and Upfal [2008] who consider evolution of rewards according to Brownian motion, and\nBesbes et al. [2014] who consider bounded total variation of expected rewards. The latter is related to\nour setting by considering the case where the total variation is bounded by a constant, but signi\ufb01cantly\ndiffers by that it considers the case where the (unknown) expected rewards sequences are not affected\nby actions taken, and in addition requires bounded support as it uses the EXP3 as a sub-routine. In\nthe Rested Bandits setting, only the reward distribution of a chosen arm changes, which is the case\nwe consider. An optimal control policy (reward processes are known, no learning required) to bandits\nwith non-increasing rewards and discount factor was previously presented (e.g., Mandelbaum [1987],\nand Kaspi and Mandelbaum [1998]). Heidari et al. (2016) consider the case where the reward decays\n(as we do), but with no statistical noise (deterministic rewards), which signi\ufb01cantly simpli\ufb01es the\nproblem. Another somewhat closely related setting is suggested by Bouneffouf and Feraud [2016], in\nwhich statistical noise exists, but the expected reward shape is known up to a multiplicative factor.\n\n7 Discussion\n\nWe introduced a novel variant of the Rested Bandits framework, which we termed Rotting Bandits.\nThis setting deals with the case where the expected rewards generated by an arm decay (or generally\ndo not increase) as a function of pulls of that arm. This is motivated by many real-world scenarios.\nWe \ufb01rst tackled the non-parametric case, where there is no prior knowledge on the nature of the decay.\nWe introduced an easy-to-follow algorithm accompanied by theoretical guarantees.\nWe then tackled the parametric case, and differentiated between two scenarios: expected rewards\ndecay to zero (AV), and decay to different constants (ANV). For both scenarios we introduced\n\n8\n\n\fsuitable algorithms with stronger guarantees than for the non-parametric case: For the AV scenario\nwe introduced an algorithm for ensuring, in expectation, regret upper bounded by a term that decays\nto zero with the horizon. For the ANV scenario we introduced an algorithm for ensuring, with high\nprobability, regret upper bounded by a horizon-dependent rate which is optimal for the stationary\ncase.\nWe concluded with simulations that demonstrated our algorithms\u2019 superiority over benchmark\nalgorithms for non-stationary MAB. We note that since the RB setting is novel, there are not suitable\navailable benchmarks, and so this paper also serves as a benchmark.\nFor future work we see two main interesting directions: (i) show a lower bound on the regret for the\nnon-parametric case, and (ii) extend the scope of the parametric case to continuous parameterization.\n\nAcknowledgment The research leading to these results has received funding from the European\nResearch Council under the European Union\u2019s Seventh Framework Program (FP/2007-2013) / ERC\nGrant Agreement n. 306638\n\nReferences\nD. Agarwal, B.-C. Chen, and P. Elango. Spatio-temporal models for estimating click-through rate. In Proceedings\n\nof the 18th international conference on World wide web, pages 21\u201330. ACM, 2009.\n\nS. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In Aistats, pages 99\u2013107, 2013.\n\nR. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: from regret to policy\n\nregret. arXiv preprint arXiv:1206.6400, 2012.\n\nP. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nlearning, 47(2-3):235\u2013256, 2002a.\n\nP. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM\n\nJournal on Computing, 32(1):48\u201377, 2002b.\n\nB. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and\ngeometric approaches. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing,\npages 45\u201353. ACM, 2004.\n\nO. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In\n\nAdvances in neural information processing systems, pages 199\u2013207, 2014.\n\nD. Bouneffouf and R. Feraud. Multi-armed bandit problem with known trend. Neurocomputing, 205:16\u201321,\n\n2016.\n\nN. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.\n\nD. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal multi-armed bandits. In Advances in Neural\n\nInformation Processing Systems, pages 273\u2013280, 2009.\n\nS. Du, M. Ibrahim, M. Shehata, and W. Badawy. Automatic license plate recognition (alpr): A state-of-the-art\n\nreview. IEEE Transactions on Circuits and Systems for Video Technology, 23(2):311\u2013325, 2013.\n\nA. Garivier and O. Capp\u00e9. The kl-ucb algorithm for bounded stochastic bandits and beyond. In COLT, pages\n\n359\u2013376, 2011.\n\nA. Garivier and E. Moulines. On upper-con\ufb01dence bound policies for non-stationary bandit problems. arXiv\n\npreprint arXiv:0805.3415, 2008.\n\nJ. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B\n\n(Methodological), pages 148\u2013177, 1979.\n\nA. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In ICML, volume 14,\n\npages 100\u2013108, 2014.\n\nE. Hazan and S. Kale. Better algorithms for benign bandits. Journal of Machine Learning Research, 12(Apr):\n\n1287\u20131311, 2011.\n\nH. Heidari, M. Kearns, and A. Roth. Tight policy regret bounds for improving and decaying bandits.\n\n9\n\n\fD. M. Jones and J. C. Gittins. A dynamic allocation index for the sequential design of experiments. University\n\nof Cambridge, Department of Engineering, 1972.\n\nH. Kaspi and A. Mandelbaum. Multi-armed bandits in discrete and continuous time. Annals of Applied\n\nProbability, pages 1270\u20131290, 1998.\n\nE. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal \ufb01nite-time analysis. In\n\nInternational Conference on Algorithmic Learning Theory, pages 199\u2013213. Springer, 2012.\n\nR. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price\nauctions. In Foundations of Computer Science, 2003. Proceedings. 44th Annual IEEE Symposium on, pages\n594\u2013605. IEEE, 2003.\n\nL. Kocsis and C. Szepesv\u00e1ri. Discounted ucb. In 2nd PASCAL Challenges Workshop, pages 784\u2013791, 2006.\n\nJ. Komiyama and T. Qin. Time-decaying bandits for non-stationary systems. In International Conference on\n\nWeb and Internet Economics, pages 460\u2013466. Springer, 2014.\n\nT. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied mathematics,\n\n6(1):4\u201322, 1985.\n\nO.-A. Maillard, R. Munos, G. Stoltz, et al. A \ufb01nite-time analysis of multi-armed bandits problems with\n\nkullback-leibler divergences. In COLT, pages 497\u2013514, 2011.\n\nA. Mandelbaum. Continuous multi-armed bandits and multiparameter processes. The Annals of Probability,\n\npages 1527\u20131556, 1987.\n\nS. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for taxonomies: A model-based approach. In\n\nSDM, pages 216\u2013227. SIAM, 2007.\n\nH. Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pages\n\n169\u2013177. Springer, 1985.\n\nA. Slivkins and E. Upfal. Adapting to a changing environment: the brownian restless bandits. In COLT, pages\n\n343\u2013354, 2008.\n\nC. Tekin and M. Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory,\n\n58(8):5588\u20135611, 2012.\n\nW. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of\n\ntwo samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nL. Tran-Thanh, S. Stein, A. Rogers, and N. R. Jennings. Ef\ufb01cient crowdsourcing of unknown experts using\n\nmulti-armed bandits. In European Conference on Arti\ufb01cial Intelligence, pages 768\u2013773, 2012.\n\nP. Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, pages\n\n287\u2013298, 1988.\n\nP. Whittle et al. Arm-acquiring bandits. The Annals of Probability, 9(2):284\u2013292, 1981.\n\nJ. Y. Yu and S. Mannor. Piecewise-stationary bandit problems with side observations. In Proceedings of the 26th\n\nAnnual International Conference on Machine Learning, pages 1177\u20131184. ACM, 2009.\n\n10\n\n\f", "award": [], "sourceid": 1745, "authors": [{"given_name": "Nir", "family_name": "Levine", "institution": "Technion - Israel Institute of Technology"}, {"given_name": "Koby", "family_name": "Crammer", "institution": "Technion"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}