{"title": "Tracking the Best Expert in Non-stationary Stochastic Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 3972, "page_last": 3980, "abstract": "We study the dynamic regret of multi-armed bandit and experts problem in non-stationary stochastic environments. We introduce a new parameter $\\W$, which measures the total statistical variance of the loss distributions over $T$ rounds of the process, and study how this amount affects the regret. We investigate the interaction between $\\W$ and $\\Gamma$, which counts the number of times the distributions change, as well as $\\W$ and $V$, which measures how far the distributions deviates over time. One striking result we find is that even when $\\Gamma$, $V$, and $\\Lambda$ are all restricted to constant, the regret lower bound in the bandit setting still grows with $T$. The other highlight is that in the full-information setting, a constant regret becomes achievable with constant $\\Gamma$ and $\\Lambda$, as it can be made independent of $T$, while with constant $V$ and $\\Lambda$, the regret still has a $T^{1/3}$ dependency. We not only propose algorithms with upper bound guarantee, but prove their matching lower bounds as well.", "full_text": "Tracking the Best Expert in Non-stationary Stochastic\n\nEnvironments\n\nChen-Yu Wei\n\nYi-Te Hong\n\nChi-Jen Lu\n\nInstitute of Information Science\n\n{bahh723, ted0504, cjlu}@iis.sinica.edu.tw\n\nAcademia Sinica, Taiwan\n\nAbstract\n\nWe study the dynamic regret of multi-armed bandit and experts problem in non-\nstationary stochastic environments. We introduce a new parameter \u039b, which\nmeasures the total statistical variance of the loss distributions over T rounds of the\nprocess, and study how this amount affects the regret. We investigate the interaction\nbetween \u039b and \u0393, which counts the number of times the distributions change, as\nwell as \u039b and V , which measures how far the distributions deviates over time. One\nstriking result we \ufb01nd is that even when \u0393, V , and \u039b are all restricted to constant,\nthe regret lower bound in the bandit setting still grows with T . The other highlight\nis that in the full-information setting, a constant regret becomes achievable with\nconstant \u0393 and \u039b, as it can be made independent of T , while with constant V and\n\u039b, the regret still has a T 1/3 dependency. We not only propose algorithms with\nupper bound guarantee, but prove their matching lower bounds as well.\n\n1\n\nIntroduction\n\n\u221a\n\nMany situations in our daily life require us to make repeated decisions which result in some losses\ncorresponding to our chosen actions. This can be abstracted as the well-known online decision\nproblem in machine learning [5]. Depending on how the loss vectors are generated, two different\nworlds are usually considered. In the adversarial world, loss vectors are assumed to be deterministic\nand controlled by an adversary, while in the stochastic world, loss vectors are assumed to be sampled\nindependently from some distributions. In both worlds, good online algorithms are known which\ncan achieve a regret of about\nT over T time steps, where the regret is the difference between the\ntotal loss of the online algorithm and that of the best of\ufb02ine one. Another distinction is about the\ninformation the online algorithm can receive after each action. In the full-information setting, it gets\nto know the whole loss vector of that step, while in the bandit setting, only the loss value of the\nchosen action is received. Again, in both settings, a regret of about\nT turns out to be achievable.\nWhile the regret bounds remain in the same order in those general scenarios discussed above, things\nbecome different when some natural conditions are considered. One well-known example is that in\nthe stochastic multi-armed bandit (MAB) problem, when the best arm (or action) is substantially\nbetter than the second best, with a constant gap between their means, then a much lower regret, of the\norder of log T , becomes possible. This motivates us to consider other possible conditions which can\nhave \ufb01ner characterization of the problem in terms of the achievable regret.\nIn the stochastic world, most previous works focused on the stationary setting, in which the loss (or\nreward) vectors are assumed to be sampled from the same distribution for all time steps. With this\nassumption, although one needs to balance between exploration and exploitation in the beginning,\nafter some trials, one can be con\ufb01dent about which action is the best and rest assured that there are\nno more surprises. On the other hand, the world around us may not be stationary, in which existing\nlearning algorithms for the stationary case may no longer work. In fact, in a non-stationary world, the\n\n\u221a\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fdilemma between exploration and exploitation persists as the underlying distribution may drift as\ntime evolves. How does the non-stationarity affect the achievable regret? How does one measure the\ndegree of non-stationarity?\nIn this paper, we answer the above questions through the notion of dynamic regret, which measures\nthe algorithm\u2019s performance against an of\ufb02ine algorithm allowed to select the best arm at every step.\n\n\u221a\n\nRelated Works. One way to measure the non-stationarity of a sequence of distributions is to count\nthe number of times the distribution at a time step differs from its previous one. Let \u0393 \u2212 1 be\nthis number so that the whole time horizon can be partitioned into \u0393 intervals, with each interval\nhaving a stationary distribution. In the bandit setting, a regret of about\n\u0393T is achieved by the\nEXP3.S algorithm in [2], as well as the discounted UCB and sliding-window UCB algorithms in\n[8]. The dependency on T can be re\ufb01ned in the full-information setting: AdaNormalHedge [10]\nand Adapt-ML-Prod [7] can both achieve regret in the form of\n\u0393C, where C is the total \ufb01rst-order\nand second-order excess loss respectively, which is upper-bounded by T . From a slightly different\nOnline Mirror Descent approach, [9] can also achieve a regret of about\n\u0393D, where D is the sum of\ndifferences between consecutive loss vectors.\nAnother measure of non-stationarity, denoted by V , is to compute the difference between the means\nof consecutive distributions and sum them up. Note that this allows the possibility for the best arm to\nchange frequently, with a very large \u0393, while still having similar distributions with a small V . For\nsuch a measure V , [3] provided a bandit algorithm which achieves a regret of about V 1/3T 2/3. This\nregret upper bound is unimprovable in general even in the full-information setting, as a matching\nlower bound was shown in [4]. Again, [9] re\ufb01ned the upper bound in the full-information setting\n\nthrough the introduction of D, achieving the regret of about 3(cid:112) \u02dcV DT , for a parameter \u02dcV different but\n\n\u221a\n\n\u221a\n\nrelated to V : \u02dcV calculates the sum of differences between consecutive realized loss vectors, while V\nmeasures that between mean loss vectors. This makes the results of [3] and [9] incomparable. The\nproblem stems from the fact that [9] considers the traditional adversarial setting, while [3] studies the\nnon-stationary stochastic setting. In this paper, we will provide a framework that bridges these two\nseemingly disparate worlds.\n\nOur Results. We base ourselves in the stochastic world with non-stationary distributions, charac-\nterized by the parameters \u0393 and V . In addition, we introduce a new parameter \u039b, which measures\nthe total statistical variance of the distributions. Note that traditional adversarial setting corresponds\nto the case with \u039b = 0 and \u0393 \u2248 V \u2248 T , while the traditional stochastic setting has \u039b \u2248 T and\n\u0393 = V = 1. Clearly, with a smaller \u039b, the learning problem becomes easier, and we would like to\nunderstand the tradeoff between \u039b and other parameters, including \u0393, V , and T . In particular, we\nwould like to know how the bounds described in the related works would be changed. Would all the\ndependency on T be replaced by \u039b, or would only some partial dependency on T be shifted to \u039b?\nFirst, we consider the effect of the variance \u039b with respect to the parameter \u0393. We show that in\nthe full-information setting, a regret of about\n\u0393\u039b + \u0393 can be achieved, which is independent of\nT . On the other hand, we show a sharp contrast that in the bandit setting, the dependency on T is\nunavoidable, and a lower bound of the order of\n\u0393T exists. That is, even when there is no variance in\ndistributions, with \u039b = 0, and the distributions only change once, with \u0393 = 2, any bandit algorithm\ncannot avoid a regret of about\nT , while a full-information algorithm can achieve a constant regret\nindependent of T .\n\u221a\nNext, we study the tradeoff between \u039b and V . We show that in the bandit setting, a regret of about\nV T is achievable. Note that this recovers the V 1/3T 2/3 regret bound of [3] as \u039b is at\nmost of the order of T , but our bound becomes better when \u039b is much smaller than T . Again, one\nmay notice the dependency on T and wonder if this can also be removed in the full-information\nsetting. We show that in the full-information setting, the regret upper bound and lower bound are\n\u221a\nboth about 3\nadversarial setting corresponds to \u039b = 0 and their D can be as large as T in our setting. Moreover,\nwe see that while the full-information regret bound is slightly better than that in the bandit setting,\nthere is still an unavoidable T 1/3 dependency.\n\n\u039bV T + V . Our upper bound is incomparable to the 3(cid:112) \u02dcV DT bound of [9], since their\n\n\u221a\n\u221a\n\n\u039bV T +\n\n\u221a\n\n\u221a\n\n3\n\n2\n\n\fOur results provide a big picture of the regret landscape in terms of the parameters \u039b, \u0393, V , and T ,\nin both full-information and bandit settings. A table summarizing our bounds as well as previous\nones is given in Appendix A in the supplementary material. Finally, let us remark that our effort\nmostly focuses on characterizing the achievable (minimax) regrets, and most of our upper bounds are\nachieved by algorithms which need the knowledge of the related parameters and may not be practical.\nTo complement this, we also propose a parameter-free algorithm, which still achieve a good regret\nbound and may have independent interest of its own.\n\n2 Preliminaries\nLet us \ufb01rst introduce some notations. For an integer K > 0, let [K] denote the set {1, . . . , K}. For\na vector (cid:96) \u2208 RK, let (cid:96)i denote its i\u2019th component. When we need to refer to a time-indexed vector\n(cid:96)t \u2208 RK, we will write (cid:96)t,i to denote its i\u2019th component. We will use the indicator function 1C for a\ncondition C, which gives the value 1 if C holds and 0 otherwise. For a vector (cid:96), we let (cid:107)(cid:96)(cid:107)b denote its\nLb-norm. While standard notation O(\u00b7) is used to hide constant factors, we will use the notation \u02dcO(\u00b7)\nto hide logarithmic factors.\nNext, let us describe the problem we study in this paper. Imagine that a learner is given the choice\nof a total of K actions, and has to play iteratively for a total of T steps. At step t, the learner\nneeds to choose an action at \u2208 [K], and then suffers a corresponding loss (cid:96)t,i \u2208 [0, 1], which is\nindependently drawn from a non-stationary distribution with expected loss E[(cid:96)t,i] = \u00b5t,i, which\nmay drift over time. After that, the learner receives some feedback from the environment. In the\nfull-information setting, the feedback gives the whole loss vector (cid:96)t = ((cid:96)t,1, ..., (cid:96)t,K), while in\nthe bandit setting, only the loss (cid:96)t,at of the chosen action is revealed. A standard way to evaluate\nthe learner\u2019s performance is to measure her (or his) regret, which is the difference between the\ntotal loss she suffers and that of an of\ufb02ine algorithm. While most prior works consider of\ufb02ine\nalgorithms which can only play a \ufb01xed action for all the steps, we consider stronger of\ufb02ine algorithms\nwhich can take different actions in different steps. Our consideration is natural for non-stationary\ndistributions, although this would make the regret large when compared to such stronger of\ufb02ine\nalgorithms. Formally, we measure the learner\u2019s performance by its expected dynamic pseudo-regret,\nt = arg mini \u00b5t,i is the best\naction at step t. For convenience, we will simply refer it as the regret of the learner later in our paper.\nWe will consider the following parameters characterizing different aspects of the environments:\n\nde\ufb01ned as(cid:80)T\n\n(cid:0)\u00b5t,at \u2212 \u00b5t,u\u2217\n\nt=1\n\nt\n\nt=1 E(cid:2)(cid:96)t,at \u2212 (cid:96)t,u\u2217\nT(cid:88)\n\nt\n\n(cid:3) = (cid:80)T\nT(cid:88)\n\n(cid:1) , where u\u2217\nT(cid:88)\n\nE(cid:2)(cid:107)(cid:96)t \u2212 \u00b5t(cid:107)2\n\n2\n\n(cid:3) ,\n\n(cid:107)\u00b5t \u2212 \u00b5t\u22121(cid:107)\u221e, and \u039b =\n\n\u0393 = 1 +\n\n1\u00b5t(cid:54)=\u00b5t\u22121 , V =\n\n(1)\n\nt=2\n\nt=1\n\nt=1\n\nwhere we let \u00b50 be the all-zero vector. Here, \u0393 \u2212 1 is the number of times the distributions switch,\nV measures the distance the distributions deviate, and \u039b is the total statistical variance of these T\ndistributions. We will call distributions with a small \u0393 switching distributions, while we will call\ndistributions with a small V drifting distributions and call V the total drift of the distributions.\nFinally, we will need the following large deviation bound, known as empirical Bernstein inequality.\nTheorem 2.1. [11] Let X = (X1, ..., Xn) be a vector of independent random variables taking\n\nvalues in [0, 1], and let \u039bX =(cid:80)\n\n1\u2264i 0, we have\n\n(cid:115)\n\n(cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88)\n\ni=1\n\nPr\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > \u03c1(n, \u039bX , \u03b4)\n\nE [Xi] \u2212 Xi\n\nn\n\n\u2264 \u03b4,\n\nfor \u03c1(n, \u039b, \u03b4) =\n\n2\u039b log 2\n\u03b4\n\nn\n\n+\n\n7 log 2\n\u03b4\n3(n \u2212 1)\n\n.\n\n3 Algorithms\n\nWe would like to characterize the achievable regret bounds for both switching and drifting distribu-\ntions, in both full-information and bandit settings. In particular, we would like to understand the\ninterplay among the parameters \u0393, V, \u039b, and T , de\ufb01ned in (1). The only known upper bound which is\ngood enough for our purpose is that by [8] for switching distributions in the bandit setting, which is\nclose to the lower bound in our Theorem 4.1. In subsection 3.1, we provide a bandit algorithm for\ndrifting distributions which achieves an almost optimal regret upper bound, when given the parameters\n\n3\n\n\fAlgorithm 1 Rerun-UCB-V\n\nInitialization: Set B according to (2) and \u03b4 = 1/(KT ).\nfor m = 1, . . . , T /B do\n\nfor t = (m \u2212 1)B + 1, . . . , mB do\n\nChoose arm at := argmini(\u02c6\u00b5t,i \u2212 \u03bbt,i), with \u02c6\u00b5t,i and \u03bbt,i computed according to (3).\n\nend for\n\nend for\n\nV, \u039b, T . In subsection 3.2, we provide a full-information algorithm which works for both switching\nand drifting distributions. The regret bounds it achieves are also close to optimal, but it again needs\nthe knowledge of the related parameters. To complement this, we provide a full-information algorithm\nin subsection 3.3, which does not need to know the parameters but achieves slightly larger regret\nbounds.\n\n3.1 Parameter-Dependent Bandit Algorithm\n\n\u221a\n\n\u039bV T +\n\n\u221a\nIn this subsection, we consider drifting distributions parameterized by V and \u039b. Our main result is a\nbandit algorithm which achieves a regret of about 3\nV T . As we aim to achieve smaller\nregrets for distributions with smaller statistical variances, we adopt a variant of the UCB algorithm\ndeveloped by [1], called UCB-V, which takes variances into account when building its con\ufb01dence\ninterval.\nOur algorithm divides the time steps into T /B intervals I1, . . . ,IT /B, each having B steps,1 with\n(2)\nFor each interval, our algorithm clears all the information from previous intervals, and starts a fresh\nrun of UCB-V. More precisely, before step t in an interval I, it maintains for each arm i its empirical\nmean \u02c6\u00b5t,i, empirical variance \u02c6\u039bt,i, and size of con\ufb01dence interval \u03bbt,i, de\ufb01ned as\n\nB = 3(cid:112)K 2\u039bT /V 2 if K\u039b2 \u2265 T V and B =(cid:112)KT /V otherwise.\n\n\u02c6\u00b5t,i =\n\n(cid:96)s,i\n|St,i| , \u02c6\u039bt,i =\n\n((cid:96)r,i \u2212 (cid:96)s,i)2\n|St,i|(|St,i| \u2212 1)\n\n, and \u03bbt,i = \u03c1(|St,i|, \u02c6\u039bt,i, \u03b4),\n\n(3)\n\nwhere St,i denotes the set of steps before t in I that arm i was played, and \u03c1 is the function given in\nTheorem 2.1. Here we use the convention that \u02c6\u00b5t,i = 0 if |St,i| = 0, while \u02c6\u039bt,i = 0 and \u03bbt,i = 1 if\n|St,i| \u2264 1. Then at step t, our algorithm selects the optimistic arm\n\n(cid:88)\n\ns\u2208St,i\n\n(cid:88)\n\nr,s\u2208St,i\n\nat := argmin\n\ni\n\n(\u02c6\u00b5t,i \u2212 \u03bbt,i),\n\nreceives the corresponding loss, and updates the statistics.\nOur algorithm is summarized in Algorithm 1, and its regret is guaranteed by the following, which we\nprove in Appendix B in the supplementary material.\n\u221a\nTheorem 3.1. The expected regret of Algorithm 1 is at most \u02dcO( 3\n\nK 2\u039bV T +\n\nKV T ).\n\n\u221a\n\n3.2 Parameter-Dependent Full-Information Algorithms\n\nt (cid:107)(cid:96)t \u2212 (cid:96)t\u22121(cid:107)2\n\nwhere D =(cid:80)\n\nIn this subsection, we provide full-information algorithms for switching and drifting distributions. In\nfact, they are based on an existing algorithm from [6], which is known to work in a different setting:\n\u221a\nthe loss vectors are deterministic and adversarial, and the of\ufb02ine comparator cannot switch arms. In\nthat setting, one of their algorithms, based on gradient-descent (GD), can achieve a regret of O(\nD)\n2, which is small when the loss vectors have small deviation. Our \ufb01rst\n\u221a\nobservation is that their algorithm in fact can work against a dynamic of\ufb02ine comparator which\nswitches arms less than N times, given any N, with its regret becoming O(\nN D). Our second\nobservation is that when \u039b is small, each observed loss vector (cid:96)t is likely to be close to its true mean\n1For simplicity of presentation, let us assume here and later in the paper that taking divisions and roots to\nproduce blocks of time steps all yield integers. It is easy to modify our analysis to the general case without\naffecting the order of our regret bound.\n\n4\n\n\fAlgorithm 2 Full-information GD-based algorithm\nInitialization: Let x1 = \u02c6x1 = (1/K, . . . , 1/K)(cid:62).\nfor t = 1, 2, . . . , T do\n(cid:107)\u02c6x \u2212 xt(cid:107)2\n\nPlay \u02c6xt = arg min\u02c6x\u2208X ((cid:104)(cid:96)t\u22121, \u02c6x(cid:105) + 1\nUpdate xt+1 = arg minx\u2208X ((cid:104)(cid:96)t, x(cid:105) + 1\n\n\u03b7t\n\n(cid:107)x \u2212 xt(cid:107)2\n2).\n\n\u03b7t\n\nend for\n\n2), and then receive loss vector (cid:96)t.\n\n\u00b5t, and when V is small, (cid:96)t is likely to be close to (cid:96)t\u22121. These two observations make possible for us\nto adopt their algorithm to our setting.\nWe show the \ufb01rst algorithm in Algorithm 2, with the feasible set X being the probability simplex.\nThe idea is to use (cid:96)t\u22121 as an estimate for (cid:96)t to move \u02c6xt further in a possibly bene\ufb01cial direction. Its\nregret is guaranteed by the following, which we prove in Appendix C in the supplementary material.\nTheorem 3.2. For switching distributions parameterized by \u0393 and \u039b, the regret of Algorithm 2 with\n\n\u03b7t = \u03b7 =(cid:112)\u0393/(\u039b + K\u0393), is at most O(\n\nK\u0393).\n\n\u0393\u039b +\n\n\u221a\n\n\u221a\n\nNote that for switching distributions, the regret of Algorithm 2 does not depend on T , which means\nthat it can achieve a constant regret for constant \u0393 and \u039b. Let us remark that although using a variant\nbased on multiplicative updates could result in a better dependency on K, an additional factor of\nlog T would then emerge when using existing techniques for dealing with dynamic comparators.\nFor drifting distributions, one can show that Algorithm 2 still works and has a good regret bound.\nHowever, a slightly better bound can be achieved as we describe next. The idea is to divide the time\n\nsteps into T /B intervals of size B, with B = 3(cid:112)\u039bT /V 2 if \u039bT > V 2 and B = 1 otherwise, and\n\nre-run Algorithm 2 in each interval with an adaptive learning rate. One way to have an adaptive\nlearning rate can be found in [9], which works well when there is only one interval. A natural way to\nadopt it here is to reset the learning rate at the start of each interval, but this does not lead to a good\nenough regret bound as it results in some constant regret at the start of every interval. To avoid this,\nsome careful changes are needed. Speci\ufb01cally, in an interval [t1, t2], we run Algorithm 2 with the\nlearning rate reset as\n\n(cid:118)(cid:117)(cid:117)(cid:116)4\n\nt\u22121(cid:88)\n\n\u03b7t = 1/\n\n(cid:107)(cid:96)\u03c4 \u2212 (cid:96)\u03c4\u22121(cid:107)2\n\n2\n\n\u03c4 =t1\n\nfor t > t1, with \u03b7t1 = \u221e initially for every interval. This has the bene\ufb01t of having small or even no\nregret at the start of an interval when the loss vectors across the boundary have small or no deviation.\nThe regret of this new algorithm is guaranteed by the following, which we prove in Appendix D in\nthe supplementary material.\n\u221a\n\u221a\nTheorem 3.3. For drifting distributions parameterized by V and \u039b, the regret of this new algorithm\nis at most O( 3\n\nV \u039bT +\n\nKV ).\n\n3.3 Parameter-Free Full-Information Algorithm\n\n(cid:113)(cid:80)\n\nThe reason that our algorithm for Theorem 3.3 needs the related parameters is to set its learning rate\nproperly. To have a parameter-free algorithm, we would like to adjust the learning rate dynamically\nin a data-driven way. One way for doing this can be found in [7], which is based on the multiplicative\nupdates variant of the mirror-descent algorithm. It achieves a static regret of about\nt,k against\nany expert k, where rt,k = (cid:104)pt, (cid:96)t(cid:105) \u2212 (cid:96)t,k is its instantaneous regret for playing pt at step t. However,\nin order to work in our setting, we would like the regret bound to depend on (cid:96)t \u2212 (cid:96)t\u22121 as seen\npreviously. This suggests us to modify the Adapt-ML-Prod algorithm of [7] using the idea of [6],\nwhich takes (cid:96)t\u22121 as an estimate of (cid:96)t to move pt further in an optimistic direction.\nRecall that the algorithm of [7] maintains a separate learning rate \u03b7t,k for each arm k at time t, and it\nupdates the weight wt,k as well as \u03b7t,k using the instantaneous regret rt,k. To modify the algorithm\nusing the idea of [6], we would like to have an estimate mt,k for rt,k in order to move pt,k further\nusing mt,k and update the learning rate accordingly. More precisely, at step t, we now play pt, with\n(4)\n\npt,k = \u03b7t\u22121,k \u02dcwt\u22121,k/(cid:104)\u03b7t\u22121, \u02dcwt\u22121(cid:105) where \u02dcwt\u22121,k = wt\u22121,k exp(\u03b7t\u22121,kmt,k),\n\nt r2\n\n5\n\n\fAlgorithm 3 Optimistic-Adapt-ML-Prod\n\nInitialization: Let w0,k = 1/K and (cid:96)0,k = 0 for every k \u2208 [K].\nfor t = 1, 2, . . . , T do\n\nPlay pt according to (4), and then receive loss vector (cid:96)t.\nUpdate each weight wt,k according to (5) and each learning rate \u03b7t,k according to (6).\n\nend for\n\nwhich uses the estimate mt,k to move further from wt\u22121,k. Then after receiving the loss vector (cid:96)t,\nwe update each weight\n\nwt,k =(cid:0)wt\u22121,k exp(cid:0)\u03b7t\u22121,krt,k \u2212 \u03b72\n(cid:18)\n\n(cid:115)\n\n(cid:40)\n\n\u03b7t,k = min\n\n1/4,\n\n(ln K)/\n\n1 +\n\nt\u22121,k(rt,k \u2212 mt,k)2(cid:1)(cid:1)\u03b7t,k/\u03b7t\u22121,k\n(cid:19)(cid:41)\n(cid:88)\n\n(rs,k \u2212 ms,k)2\n\n.\n\ns\u2208[t]\n\nas well as each learning rate\n\n(5)\n\n(6)\n\n(cid:112)(cid:80)\n\n\u03b7t\u22121,k \u02dcwt\u22121,k(\u03b1)/(cid:80)\n\nOur algorithm is summarized in Algorithm 3, and we will show that it achieves a regret of about\nt(rt,k \u2212 mt,k)2 against arm k. It remains to choose an appropriate estimate mt,k. One attempt\nis to have mt,k = rt\u22121,k, but rt,k \u2212 rt\u22121,k = ((cid:104)pt, (cid:96)t(cid:105) \u2212 (cid:96)t,k) \u2212 ((cid:104)pt\u22121, (cid:96)t\u22121(cid:105) \u2212 (cid:96)t\u22121,k), which does\nnot lead to a desirable bound. The other possibility is to set mt,k = (cid:104)pt, (cid:96)t\u22121(cid:105) \u2212 (cid:96)t\u22121,k, which can\nbe shown to have (rt,k \u2212 mt,k)2 \u2264 (2(cid:107)(cid:96)t \u2212 (cid:96)t\u22121(cid:107)\u221e)2. However, it is not clear how to compute\nsuch mt,k because it depends on pt,k which in turns depends on mt,k itself. Fortunately, we can\napproximate it ef\ufb01ciently in the following way.\nNote that the key quantity is (cid:104)pt, (cid:96)t\u22121(cid:105). Given its value \u03b1, \u02dcwt\u22121,k and pt,k can be seen as func-\ntions of \u03b1, de\ufb01ned according to (5) as \u02dcwt\u22121,k(\u03b1) = wt\u22121,k exp(\u03b7t,k(\u03b1 \u2212 (cid:96)t\u22121,k)) and pt,k(\u03b1) =\ni \u03b7t\u22121,i \u02dcwt\u22121,i(\u03b1). Then we would like to show the existence of \u03b1 such that\n(cid:104)pt(\u03b1), (cid:96)t\u22121(cid:105) = \u03b1 and to \ufb01nd it ef\ufb01ciently. For this, consider the function f (\u03b1) = (cid:104)pt(\u03b1), (cid:96)t\u22121(cid:105),\nwith pt(\u03b1) de\ufb01ned above. It is easy to check that f is a continuous function bounded in [0, 1], which\nimplies the existence of some \ufb01xed point \u03b1 \u2208 [0, 1] with f (\u03b1) = \u03b1. Using a binary search, such an \u03b1\ncan be approximated within error 1/T in log T iterations. As such a small error does not affect the\norder of the regret, we will ignore it for simplicity of presentation, and assume that we indeed have\n(cid:104)pt, (cid:96)t\u22121(cid:105) and hence mt,k = (cid:104)pt, (cid:96)t\u22121(cid:105) \u2212 (cid:96)t\u22121,k without error.\nThen we have the following regret bound (c.f. [7, Corollary 4]), which we prove in Appendix E in\nthe supplementary material.\nTheorem 3.4. The static regret of Algorithm 3 w.r.t. any arm (or expert) k \u2208 [K] is at most\n(cid:107)(cid:96)t \u2212 (cid:96)t\u22121(cid:107)2\u221e ln K + ln K\n\n(cid:18)(cid:114)(cid:88)\n\n(cid:18)(cid:114)(cid:88)\n\n(rt,k \u2212 mt,k)2 ln K + ln K\nwhere the notation \u02c6O(\u00b7) hides a ln ln T factor.\nThe regret in the theorem above is measured against a \ufb01xed arm. To achieve a dynamic regret against\nan of\ufb02ine algorithm which can switch arms, one can use a generic reduction to the so-called sleeping\nexperts problem. In particular, we can use the idea in [7] by creating \u02dcK = KT sleeping experts, and\nrun our Algorithm 3 on these \u02dcK experts (instead of on the K arms). More precisely, each sleeping\nexpert is indexed by some pair (s, k), and it is asleep for steps before s and becomes awake for steps\nt \u2265 s. At step t, it calls Algorithm 3 for the distribution \u02dcpt over the \u02dcK experts, and computes its own\ns=1 \u02dcpt,(s,k). Then it plays pt, receives loss\nvector (cid:96)t, and feeds some modi\ufb01ed loss vector \u02dc(cid:96)t and estimate vector \u02dcmt to Algorithm 3 for update.\nHere, we set \u02dc(cid:96)t,(s,k) to its expected loss (cid:104)pt, (cid:96)t(cid:105) if expert (s, k) is asleep and to (cid:96)t,k otherwise, while\nwe set \u02dcmt,(s,k) to 0 if expert (s, k) is asleep and to mt,k = (cid:104)pt, (cid:96)t\u22121(cid:105)\u2212 (cid:96)t\u22121,k otherwise. This choice\nallows us to relate the regret of Algorithm 3 to that of the new algorithm, which can be seen in the\nproof of the following theorem, given in Appendix F in the supplementary material.\n\u221a\nTheorem 3.5. The dynamic expected regret of the new algorithm is \u02dcO(\n\u221a\nswitching distributions and \u02dcO( 3\n\ndistribution pt over K arms, with pt,k proportional to(cid:80)t\n\nV T ln K) for drifting distributions.\n\n\u0393\u039b ln K + \u0393 ln K) for\n\nV \u039bT ln K +\n\n\u2264 \u02c6O\n\n(cid:19)\n\n(cid:19)\n\nt\u2208[T ]\n\nt\u2208[T ]\n\n\u221a\n\n\u02c6O\n\n,\n\n6\n\n\f4 Lower Bounds\n\n\u221a\nWe study regret lower bounds in this section. In subsection 4.1, we show that for switching distribu-\ntions with \u0393 \u2212 1 \u2265 1 switches, there is an \u2126(\n\u0393T ) lower bound for bandit algorithms, even when\nthere is no variance (\u039b = 0) and there are constant loss gaps between the optimal and suboptimal arms.\nWe also show a full-information lower bound, which almost matches our upper bound in Theorem 3.2.\nIn subsection 4.2, we show that for drifting distributions, our upper bounds in Theorem 3.1 and\n\u221a\nTheorem 3.2 are almost tight. In particular, we show that now even for full-information algorithms,\na large 3\nT dependency in the regret turns out to be unavoidable, even for small V and \u039b. This\nprovides a sharp contrast to the upper bound of our Theorem 3.2, which shows that a constant regret\nis in fact achievable by a full-information algorithm for switching distributions with constant \u0393 and\n\u039b. For simplicity of presentation, we will only discuss the case with K = 2 actions, as it is not hard\nto extend our proofs to the general case.\n\n4.1 Switching Distributions\n\nIn contrast to the full-information setting, the existence of switches presents a dilemma with lose-lose\nsituation for a bandit algorithm: in order to detect any possible switch early enough, it must explore\naggressively, but this has the consequence of playing suboptimal arms too often. To fool any bandit\nalgorithm, we will switch between two deterministic distributions, with no variance, which have\nmean vectors (cid:96)(1) = (1/2, 1)(cid:62) and (cid:96)(2) = (1/2, 0)(cid:62), respectively. Our result is the following.\n\u0393T ), for \u0393 \u2265 2.\nTheorem 4.1. The worst-case expected regret of any bandit algorithm is \u2126(\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nB/2 =\n\nProof. Consider any bandit algorithm A, and let us partition the T steps into \u0393/2 intervals, each\nconsisting of B = 2T /\u0393 steps. Our goal is to make A suffer in each interval an expected regret\nof \u2126(\nB) by switching the loss vectors at most once. As mentioned before, we will only switch\nbetween two different deterministic distributions with mean vectors: (cid:96)(1) and (cid:96)(2). Note that we can\nsee these two distributions simply as two loss vectors, with (cid:96)(i) having arm i as the optimal arm.\nIn what follows, we focus on one of the intervals, and assume that we have chosen the distributions in\nall previous intervals. We would like to start the interval with the loss vector (cid:96)(1). Let N2 denote the\nexpected number of steps A plays the suboptimal arm 2 in this interval if (cid:96)(1) is used for the whole\ninterval. If N2 \u2265 \u221a\nA suffer an expected regret of at least (1/2) \u00b7 \u221a\nB/2, we can actually use (cid:96)(1) for the whole interval with no switch, which makes\nB/4 in this interval. Thus, it remains to\nB/2. In this case, A does not explore arm 2 often enough, and we\nconsider the case with N2 <\nlet it pay by choosing an appropriate step to switch to the other loss vector (cid:96)(2) = (1/2, 0)(cid:62), which\nhas arm 2 as the optimal one. For this, let us divide the B steps of the interval into\nB blocks, each\nB/2, there must be a block in which the expected number of\nconsisting of\nsteps that A plays arm 2 is at most N2/\nB < 1/2. By a Markov inequality, the probability that A\never plays arm 2 in this block is less than 1/2. This implies that when given the loss vector (cid:96)(1) for\nall the steps till the end of this block, A never plays arm 2 in the block with probability more than\n1/2. Therefore, if we make the switch to the loss vector (cid:96)(2) = (1/2, 0)(cid:62) at the beginning of the\nblock, then A with probability more than 1/2 still never plays arm 2 and never notices the switch in\nthis block. As arm 2 is the optimal one with respect to (cid:96)(2), the expected regret of A in this block is\nmore than (1/2) \u00b7 (1/2) \u00b7 \u221a\nNow if we choose distributions in each interval as described above, then there are at most \u0393/2 \u00b7 2 = \u0393\nperiods of stationary distribution in the whole horizon, and the total expected regret of A can be made\nat least \u0393/2 \u00b7 \u221a\n\nB/4 = \u0393/2 \u00b7(cid:112)2T /\u0393/4 = \u2126(\n\n\u0393T ), which proves the theorem.\n\nB steps. As N2 <\n\nB =\n\nB/4.\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nFor full-information algorithms, we have the following lower bound, which almost matches our upper\nbound in Theorem 3.2. We provide the proof in Appendix G in the supplementary material.\nTheorem 4.2. The worst-case expected regret of any full-information algorithm is \u2126(\n\n\u0393\u039b + \u0393).\n\n\u221a\n\n7\n\n\f4.2 Drifting Distributions\n\n\u221a\n\nV T ).\n\nIn this subsection, we show that the regret upper bounds achieved by our bandit algorithm and\nfull-information algorithm are close to optimal by showing almost matching lower bounds. More\nprecisely, we have the following.\n\u221a\n\u221a\nTheorem 4.3. The worst-case expected regret of any full-information algorithm is \u2126( 3\nwhile that of any bandit algorithm is \u2126( 3\nProof. Let us \ufb01rst consider the full-information case. When \u039bT \u2264 32KV 2, we immediately have\n\u221a\nfrom Theorem 4.2 the regret lower bound of \u2126(\u0393) \u2265 \u2126(V ) \u2265 \u2126( 3\n\u221a\nThus, let us focus on the case with \u039bT \u2265 32KV 2. In this case, V \u2264 O( 3\n\u221a\n\u039bV T ), so it suf\ufb01ces to\n\u039bV T ). Fix any full-information algorithm A, and we will show the\nprove a lower bound of \u2126( 3\nexistence of a sequence of loss distributions for A to suffer such an expected regret. Following [3],\n\nwe divide the time steps into T /B intervals of length B, and we set B = 3(cid:112)\u039bT /(32KV 2) \u2265 1.\n\n\u039bV T + V ).\n\n\u039bV T + V ),\n\n\u039bV T +\n\nFor each interval, we will pick some arm i as the optimal one, and give it some loss distribution\nP, while other arms are sub-optimal and all have some loss distribution Q. We need P and Q to\nsatisfy the following three conditions: (a) P\u2019s mean is smaller than Q\u2019s by \u0001, (b) their variances are at\nmost \u03c32, and (c) their KL divergence satis\ufb01es (ln 2)KL(Q,P) \u2264 \u00012/\u03c32, for some \u0001, \u03c3 \u2208 (0, 1) to be\nspeci\ufb01ed later. Their existence is guaranteed by the following, which we prove in Appendix H in the\nsupplementary material.\n\u221a\nLemma 4.4. For any 0 \u2264 \u03c3 \u2264 1/2 and 0 \u2264 \u0001 \u2264 \u03c3/\nthe three conditions above.\nLet Di denote the joint distribution of such K distributions, with arm i being the optimal one, and we\nwill use the same Di for all the steps in an interval. We will show that for any interval, there is some\ni such that using Di this way can make algorithm A suffer a large expected regret in the interval,\nconditioned on the distributions chosen for previous intervals. Before showing that, note that when\nwe choose distributions in this way, their total variance is at most T K\u03c32 while their total drift is at\n\nmost (T /B)\u0001. To have them bounded by \u039b and V respectively, we choose \u03c3 =(cid:112)\u039b/(4KT ) and\n\n2, there exist distributions P and Q satisfying\n\n\u0001 = V B/T , which satisfy the condition of Lemma 4.4, with our choice of B.\nTo \ufb01nd the distributions, we deal with the intervals one by one. Consider any interval, and assume\nthat the distributions for previous intervals have been chosen. Let Ni denote the number of steps A\nplays arm i in this interval, and let Ei[Ni] denote its expectation when Di is used for every step of\nthe interval, conditioned on the distributions of previous intervals. One can bound this conditional\nexpectation in terms of a related one, denoted as Eunif[Ni], when every arm has the distribution Q for\nevery step of the interval, again conditioned on the distributions of previous intervals. Speci\ufb01cally,\nusing an almost identical argument to that in [2, proof of Theorem A.2.], one can show that\n\nEi [Ni] \u2264 Eunif [Ni] +\n\nB\n2\n\n(cid:112)B(2 ln 2) \u00b7 KL(Q,P).2\n\ni\n\n(\u00012/\u03c32) \u2264 1/4. Summing both sides of (7) over arm i, and using the fact that(cid:80)\nwe get (cid:80)\n\n(7)\nAccording to Lemma 4.4 and our choice of parameters, we have B(2 ln 2) \u00b7 KL(Q,P) \u2264 2B \u00b7\nEunif [Ni] = B,\nEi [Ni] \u2264 B + BK/4, which implies the existence of some i such that Ei [Ni] \u2264\nB/K + B/4 \u2264 (3/4)B. Therefore, if we choose this distribution Di, the conditional expected regret\nof algorithm A in this interval is at least \u0001(B \u2212 Ei[Ni]) \u2265 \u0001B/4.\nBy choosing distributions inductively in this way, we can make A suffer a total expected regret of at\n\u221a\nleast (T /B) \u00b7 (\u0001B/4) \u2265 \u2126( 3\nNext, let us consider the bandit case. From Theorem 4.1, we immediately have a lower bound of\nV T \u2264\n\u221a\n\u2126(\n\u039bV T , and we can then use the full-\n3\n\n\u221a\n\u221a\n\u039bV T , we have V \u2264 \u039b2/T which implies that V \u2264 3\n\n\u039bV T ). This completes the proof for the full-information case.\n\u221a\n\nV T ), which implies the required bound when\n\n\u221a\n\u0393T ) \u2265 \u2126(\n\n\u221a\nV T \u2265 3\n\n\u039bV T . When\n\n\u221a\n\ni\n\n\u039bV T ) just proved before. This completes the proof of the theorem.\n\n\u221a\ninformation bound of \u2126( 3\n\n2Note that inside the square root, we use B instead of Eunif[Ni] as in [2]. This is because in their bandit\nsetting, Ni is the number of steps when arm i is sampled and has its information revealed to the learner, while in\nour full-information case, information about arm i is revealed in every step and there are at most B steps.\n\n8\n\n\fReferences\n[1] Jean-Yves Audibert, R\u00e9mi Munos, and Csaba Szepesv\u00e1ri. Exploration-exploitation tradeoff\nusing variance estimates in multi-armed bandits. Theor. Comput. Sci., 410(19):1876\u20131902,\n2009.\n\n[2] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic\n\nmultiarmed bandit problem. SIAM J. Comput., 32(1):48\u201377, 2002.\n\n[3] Omar Besbes, Yonatan Gur, and Assaf J. Zeevi. Stochastic multi-armed-bandit problem with\nnon-stationary rewards. In Advances in Neural Information Processing Systems 27: Annual\nConference on Neural Information Processing Systems (NIPS), December 2014.\n\n[4] Omar Besbes, Yonatan Gur, and Assaf J. Zeevi. Non-stationary stochastic optimization. Opera-\n\ntions Research, 63(5):1227\u20131244, 2015.\n\n[5] Nicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge University\n\nPress, 2006.\n\n[6] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin,\nand Shenghuo Zhu. Online optimization with gradual variations. In The 25th Conference on\nLearning Theory (COLT), June 2012.\n\n[7] Pierre Gaillard, Gilles Stoltz, and Tim van Erven. A second-order bound with excess losses. In\n\nThe 27th Conference on Learning Theory (COLT), June 2014.\n\n[8] Aur\u00e9lien Garivier and Eric Moulines. On upper-con\ufb01dence bound policies for switching bandit\nproblems. In The 22nd International Conferenc on Algorithmic Learning Theory (ALT), October\n2011.\n\n[9] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online\noptimization : Competing with dynamic comparators. In Proceedings of the 18th International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTAT), May 2015.\n\n[10] Haipeng Luo and Robert E. Schapire. Achieving all with no parameters: Adanormalhedge. In\n\nThe 28th Conference on Learning Theory (COLT), July 2015.\n\n[11] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample-variance\n\npenalization. In The 22nd Conference on Learning Theory (COLT), June 2009.\n\n9\n\n\f", "award": [], "sourceid": 1978, "authors": [{"given_name": "Chen-Yu", "family_name": "Wei", "institution": "Academia Sinica"}, {"given_name": "Yi-Te", "family_name": "Hong", "institution": "Academia Sinica"}, {"given_name": "Chi-Jen", "family_name": "Lu", "institution": "Academia Sinica"}]}