{"title": "Online Learning of Non-stationary Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 1093, "page_last": 1100, "abstract": "", "full_text": "Online Learning of Non-stationary Sequences\n\nClaire Monteleoni and Tommi Jaakkola\n\nMIT Computer Science and Arti\ufb01cial Intelligence Laboratory\n\n200 Technology Square\nCambridge, MA 02139\n\nfcmontel,tommig@ai.mit.edu\n\nAbstract\n\nWe consider an online learning scenario in which the learner can make\npredictions on the basis of a \ufb01xed set of experts. We derive upper and\nlower relative loss bounds for a class of universal learning algorithms in-\nvolving a switching dynamics over the choice of the experts. On the basis\nof the performance bounds we provide the optimal a priori discretiza-\ntion for learning the parameter that governs the switching dynamics. We\ndemonstrate the new algorithm in the context of wireless networks.\n\n1 Introduction\n\nWe focus on the online learning framework in which the learner has access to a set of ex-\nperts but possesses no other a priori information relating to the observation sequence. In\nsuch a scenario the learner may choose to quickly identify a single best expert to rely on\n[12], or switch from one expert to another in response to perceived changes in the observa-\ntion sequence [8], thus making assumptions about the switching dynamics. The ability to\nshift emphasis from one \u201cexpert\u201d to another, in response to changes in the observations, is\nvaluable in many applications, including energy management in wireless networks.\n\nMany algorithms developed for universal prediction on the basis of a set of experts have\nclear performance guarantees (e.g., [12, 6, 8, 14]). The performance bounds characterize\nthe regret relative to the best expert, or best sequence of experts, chosen in hindsight. Al-\ngorithms with such relative loss guarantees have also been developed for adaptive game\nplaying [5], online portfolio management [7], paging [3] and the k-armed bandit problem\n[1]. Other relative performance measures for universal prediction involve comparing across\nsystematic variations in the sequence [4].\n\nHere we extend the class of algorithms considered in [8], by learning the switching-rate\nparameter online, at the optimal resolution. Our goal of removing the switching-rate as a\nparameter is similar to Vovk\u2019s in [14], though the approach and the comparison class for\nthe bounds differ. We provide upper and lower performance bounds, and demonstrate the\nutility of these algorithms in the context of wireless networks.\n\n2 Algorithms and performance guarantees\n\nThe learner has access to n experts, a1; : : : ; an, and each expert makes a prediction at each\ntime-step over a \ufb01nite (known) time period t = 1; : : : ; T . We denote the ith expert at\n\n\ftime t as ai;t to suppress any details about how the experts arrive at their predictions and\nwhat information is available to facilitate the predictions. These details may vary from one\nexpert to another and may change over time. We denote the non-negative prediction loss\nof expert i at time t as L(i; t), where the loss, a function of t, naturally depends on the\nobservation yt 2 Y at time t. We consider here algorithms that provide a distribution pt(i),\ni = 1; : : : ; n, over the experts at each time point. The prediction loss of such an algorithm\nis denoted by L(pt; t).\nFor the purpose of deriving learning algorithms such as Static-expert and Fixed-\nshare described in [8], we associate the loss of each expert with a predictive probability so\nthat (cid:0) log p(ytjyt(cid:0)1; : : : ; y1; i) = L(i; t). We de\ufb01ne the loss of any probabilistic prediction\nto be the log-loss:\n\nL(pt; t) = (cid:0) log\n\nn\n\nXi=1\n\npt(i) p(ytji; y1; : : : ; yt(cid:0)1) = (cid:0) log\n\npt(i)e(cid:0)L(i;t)\n\n(1)\n\nn\n\nXi=1\n\nMany other de\ufb01nitions of the loss corresponding to pt((cid:1)) can be bounded by a scaled log-\nloss [6, 8]. We omit such modi\ufb01cations here as they do not change the essential nature of\nthe algorithms nor their analysis.\n\nThe algorithms combining expert predictions can be now derived as simple Bayesian esti-\nmation methods calculating the distribution pt(i) = P (ijy1; : : : ; yt(cid:0)1) over the experts on\nthe basis of the observations seen so far. p1(i) = 1=n for any such method as any other ini-\ntial bias could be detrimental in terms of relative performance guarantees. Updating pt((cid:1))\ninvolves assumptions about how the optimal choice of expert can change with time. For\nsimplicity, we consider here only a Markov dynamics, de\ufb01ned by p(itjit(cid:0)1; (cid:11)), where (cid:11)\nparameterizes the one-step transition probabilities. Allowing switches at rate (cid:11), we de\ufb01ne1\n\np(itjit(cid:0)1; (cid:11)) = (1 (cid:0) (cid:11))(cid:14)(it; it(cid:0)1) +\n\n[1 (cid:0) (cid:14)(it; it(cid:0)1)]\n\n(2)\n\n(cid:11)\nn (cid:0) 1\n\nwhich corresponds to the Fixed-share algorithm, and yields the Static-expert\nalgorithm when (cid:11) = 0. The Bayesian algorithm updating pt((cid:1)) is de\ufb01ned analogously to\nforward propagation in generalized HMMs (allowing observation dependence on past):\n\npt(i; (cid:11)) =\n\n1\nZt\n\nn\n\nXj=1\n\npt(cid:0)1(j; (cid:11))e(cid:0)L(j;t(cid:0)1)p(ijj; (cid:11))\n\n(3)\n\nwhere Zt normalizes the distribution. While we have made various probabilistic assump-\ntions (e.g., conditional independence of expert predictions) in deriving the algorithm, the\nalgorithms can be used in a context where no such statistical assumptions about the ob-\nservation sequence or the experts are warranted. The performance guarantees we provide\nbelow for these algorithms do not require these assumptions.\n\n2.1 Relative loss bounds\n\nThe existing upper bound on the relative loss of the Fixed-share algorithm [8] is ex-\npressed in terms of the loss of the algorithm relative to the loss of the best k-partition of\nthe observation sequence, where the best expert is assigned to each segment. We start by\nproviding here a similar guarantee but characterizing the regret relative to the best Fixed-\nshare algorithm, parameterized by (cid:11)(cid:3), where (cid:11)(cid:3) is chosen in hindsight after having seen\nthe observation sequence. Our proof technique is different from [8] and gives rise to simple\nguarantees for a wider class of prediction methods, along with a lower bound on this regret.\n\n1where (cid:14)((cid:1); (cid:1)) is the Kronecker delta.\n\n\fshare algorithm on an arbitrary sequence of observations. Then for any (cid:11); (cid:11)(cid:3):\n\nt=1 L(pt;(cid:11); t), (cid:11) 2 [0; 1], be the cumulative loss of the Fixed-\n\nLemma 1 Let LT ((cid:11)) =PT\n\nLT ((cid:11)) (cid:0) LT ((cid:11)(cid:3)) = (cid:0) loghE ^(cid:11)(cid:24)Q e(T (cid:0)1)[D( ^(cid:11)k(cid:11)(cid:3))(cid:0)D( ^(cid:11)k(cid:11))]i\n\n(4)\n\nProof: The cumulative log-loss of the Bayesian algorithm can be expressed in terms of\nnegative log-probability of all the observations:\n\nLT ((cid:11)) = (cid:0) log[X~s\n\n(cid:30)(~s)p(~s; (cid:11))]\n\n(5)\n\nt=2 p(itjit(cid:0)1; (cid:11)).\n\nt=1 e(cid:0)L(it;t), and p(~s; (cid:11)) = p1(i1)QT\n= (cid:0) log\"X~s (cid:18) (cid:30)(~s)p(~s; (cid:11)(cid:3))\np(~s; (cid:11)(cid:3))# = (cid:0) log\"X~s\n\np(~s; (cid:11)(cid:3))#\nP~r (cid:30)(~r)p(~r; (cid:11)(cid:3))(cid:19) p(~s; (cid:11))\np(~s;(cid:11)(cid:3))#\n\nQ(~s; (cid:11)(cid:3))elog p(~s;(cid:11))\n\np(~s; (cid:11))\n\nwhere ~s = fi1; : : : ; iTg, (cid:30)(~s) =QT\nConsequently, LT ((cid:11)) (cid:0) LT ((cid:11)(cid:3))\n= (cid:0) log P~s (cid:30)(~s)p(~s; (cid:11))\nP~r (cid:30)(~r)p(~r; (cid:11)(cid:3))\n= (cid:0) log\"X~s\n= (cid:0) log\"X~s\n\nQ(~s; (cid:11)(cid:3))\n\nQ(~s; (cid:11)(cid:3))e(T (cid:0)1)( ^(cid:11)(~s) log (cid:11)\n\n(cid:11)(cid:3) +(1(cid:0) ^(cid:11)(~s)) log 1(cid:0)(cid:11)\n\n1(cid:0)(cid:11)(cid:3) )#\n\nwhere Q(~s; (cid:11)(cid:3)) is the posterior probability over the choices of experts along the sequence,\ninduced by the hindsight-optimal switching-rate (cid:11)(cid:3), and ^(cid:11)(~s) is the empirical fraction of\nnon-self-transitions in the selection sequence ~s. This can be rewritten as the expected value\nof ^(cid:11) under distribution Q. 2\nWe obtain upper and lower bounds on regret by optimizing Q in Q, the set of all distribu-\ntions over ^(cid:11) 2 [0; 1], of the expression for regret.\n2.1.1 Upper bound\n\nThe upper bound follows from solving: maxQ2Q(cid:8)(cid:0) log(cid:2)E ^(cid:11)(cid:24)Q e(T (cid:0)1)[D( ^(cid:11)k(cid:11)(cid:3))(cid:0)D( ^(cid:11)k(cid:11))](cid:3)(cid:9)\n\nsubject to the constraint that (cid:11)(cid:3) has to be the hindsight-optimal switching-rate, i.e. that:\n(C1)\n\nd\n\nd(cid:11) (LT ((cid:11)) (cid:0) LT ((cid:11)(cid:3)))j(cid:11)=(cid:11)(cid:3) = 0\n\nTheorem 1 Let LT ((cid:11)(cid:3)) = min(cid:11) LT ((cid:11)) be the loss of the best Fixed-share algorithm\nchosen in hindsight. Then for any (cid:11) 2 [0; 1], LT ((cid:11))(cid:0) LT ((cid:11)(cid:3)) (cid:20) (T (cid:0) 1) D((cid:11)(cid:3)k(cid:11)); where\nD((cid:11)(cid:3)k(cid:11)) is the relative entropy between Bernoulli distributions de\ufb01ned by (cid:11)(cid:3) and (cid:11).\nThe bound vanishes when (cid:11) = (cid:11)(cid:3) and does not depend directly on the number of experts.\nThe dependence on n may appear indirectly through (cid:11)(cid:3), however. While the regret appears\nproportional to T , this dependence vanishes for any reasonable learning algorithm that is\nguaranteed to \ufb01nd (cid:11) such that D((cid:11)(cid:3)k(cid:11)) (cid:20) O(1=T ), as we will show in Section 3.\nTheorem 1 follows, as a special case, from an analogous result for algorithms based on\narbitrary \ufb01rst-order Markov transition dynamics.\nIn the general case, the regret bound\nis: (T (cid:0) 1) maxi D(P (jji; (cid:11)(cid:3))k P (jji; (cid:11))), where (cid:11); (cid:11)(cid:3) are now transition matrices, and\nD((cid:1)k(cid:1)) is the relative entropy between discrete distributions. For brevity, we provide only\nthe proof of the scalar case of Theorem 1.\nProof: Constraint (C1) can be expressed simply as d\nd(cid:11) LT ((cid:11))j(cid:11)=(cid:11)(cid:3) = 0, which is equiv-\nalent to E ^(cid:11)(cid:24)Qf^(cid:11)g = (cid:11)(cid:3). Taking the expectation outside the logarithm, in Equation 4,\nresults in the upper bound. 2\n\n\f2.1.2 Lower bound\nThe relative losses obviously satisfy LT ((cid:11))(cid:0) LT ((cid:11)(cid:3)) (cid:21) 0 providing a trivial lower bound.\nAny non-trivial lower bound on the regret cannot be expressed only in terms of (cid:11) and (cid:11)(cid:3),\nbut needs to incorporate some additional information about the losses along the observation\nsequence. We express the lower bound on the regret as a function of the relative quality (cid:12) (cid:3)\nof the minimum (cid:11)(cid:3):\n\n(cid:11)(cid:3)(1 (cid:0) (cid:11)(cid:3))\n\nd2\nd(cid:11)2 LT ((cid:11))j(cid:11)=(cid:11)(cid:3)\n\n(cid:12)(cid:3) =\n\n(6)\nwhere the normalization guarantees that (cid:12)(cid:3) (cid:20) 1. (cid:12)(cid:3) (cid:21) 0 for any (cid:11)(cid:3) that minimizes LT ((cid:11)).\nThe lower bound is found by solving: minQ2Q(cid:8)(cid:0) log(cid:2)E ^(cid:11)(cid:24)Q e(T (cid:0)1)[D( ^(cid:11)k(cid:11)(cid:3))(cid:0)D( ^(cid:11)k(cid:11))](cid:3)(cid:9)\n\nd(cid:11)2 (LT ((cid:11)) (cid:0) LT ((cid:11)(cid:3)))j(cid:11)=(cid:11)(cid:3) = (cid:12)(cid:3)(T (cid:0)1)\n\nsubject to both constraint (C1) and (C2)\n\nT (cid:0) 1\n\n(cid:11)(cid:3)(1(cid:0)(cid:11)(cid:3))\n\nd2\n\nTheorem 2 Let (cid:12)(cid:3) and (cid:11)(cid:3) be de\ufb01ned as above based on an arbitrary observation se-\nquence, and q1 = [1 + T (cid:0)1\n1(cid:0)(cid:12) (cid:3)\n\n1(cid:0)(cid:11)(cid:3)\n(cid:11)(cid:3) ](cid:0)1 and q0 = [1 + T (cid:0)1\n\n1(cid:0)(cid:11)(cid:3) ](cid:0)1. Then\n\n1(cid:0)(cid:12) (cid:3)\n\n(cid:11)(cid:3)\n\nLT ((cid:11)) (cid:0) LT ((cid:11)(cid:3)) (cid:21) (cid:0) loghE ^(cid:11)(cid:24)Q e(T (cid:0)1)[D( ^(cid:11)k(cid:11)(cid:3))(cid:0)D( ^(cid:11)k(cid:11))]i\n\n(7)\nwhere Q(1) = q1 and Q(((cid:11)(cid:3) (cid:0) q1)=(1 (cid:0) q1)) = 1 (cid:0) q1 whenever (cid:11) (cid:21) (cid:11)(cid:3); Q(0) = q0 and\nQ((cid:11)(cid:3)=(1 (cid:0) q0)) = 1 (cid:0) q0 otherwise.\nProof omitted due to space constraints. The upper and lower bounds agree for all (cid:11); (cid:11)(cid:3) 2\n(0; 1) when (cid:12)(cid:3) ! 1. Thus there may exist observation sequences on which Fixed-\nshare, using (cid:11) 6= (cid:11)(cid:3), must incur regret linear in T .\n2.2 Algorithm Learn-(cid:11)\n\nWe now give an algorithm to learn the switching-rate simultaneously to updating the prob-\nability weighting over the experts. Since the cumulative loss Lt((cid:11)) of each Fixed-\nshare algorithm running with switching parameter (cid:11) can be interpreted as a negative\nlog-probability, the posterior distribution over the switching-rate becomes\n\npt((cid:11)) = P ((cid:11)jyt(cid:0)1; : : : ; y1) / e(cid:0)Lt(cid:0)1((cid:11))\n\n(8)\nassuming a uniform prior over (cid:11) 2 [0; 1]. As a predictive distribution pt((cid:11)) does not\ninclude the observation at the same time point. We can view this algorithm as \ufb01nding\nthe single best \u201c(cid:11)-expert,\u201d where the collection of (cid:11)-experts is given by Fixed-share\nalgorithms running with different switching-rates, (cid:11).\nWe will consider a \ufb01nite resolution version of this algorithm, allowing only m possible\nchoices for the switching-rate, f(cid:11)1; : : : ; (cid:11)mg. For a suf\ufb01ciently large m and appropriately\nchosen values f(cid:11)jg, we expect to be able to always \ufb01nd (cid:11)j (cid:25) (cid:11)(cid:3) and suffer only a minimal\nadditional loss due to not being able to represent the hindsight-optimal value exactly.\nLet pt;j(i) be the distribution over experts de\ufb01ned by the j th Fixed-share algorithm\ncorresponding to (cid:11)j, and let ptop\n(j) be the top-level algorithm producing a weighting over\nsuch Fixed-share experts. The top-level algorithm is given by\n\nt\n\nptop\nt\n\n(j) =\n\n1\nZt\n\nptop\nt(cid:0)1(j)e(cid:0)L(pt(cid:0)1;j ;t(cid:0)1)\n\n(9)\n\nwhere ptop\n\n1 (j) = 1=m, and the loss per time-step becomes\n\nm\n\nt\n\nLtop(ptop\n\n; t) = (cid:0) log\n\n(j)e(cid:0)L(pt;j ;t) = (cid:0) log\nas is appropriate for a hierarchical Bayesian method.\n\nXj=1\n\nptop\nt\n\nm\n\nn\n\nXj=1\n\nXi=1\n\nptop\nt\n\n(j)pt;j(i)e(cid:0)L(i;t)\n\n(10)\n\n\f3 Relative loss and optimal discretization\n\nWe derive here the optimal choice of the discrete set f(cid:11)1; : : : ; (cid:11)mg on the basis of the\nupper bound on relative loss. We begin by extending Theorem 1 to provide an analogous\nguarantee for the Learn-(cid:11) algorithm.\n\nCorollary to Theorem 1 Let Ltop\nalgorithm using f(cid:11)1; : : : ; (cid:11)mg. Then\n\nT\n\nbe the cumulative loss of the hierarchical Learn-(cid:11)\n\n(16)\n\nLtop\nT (cid:0) LT ((cid:11)(cid:3)) (cid:20) log(m) + (T (cid:0) 1) min\n\nj=1;:::;m\n\nD((cid:11)(cid:3)k(cid:11)j)\n\n(11)\n\nThe hierarchical algorithm involves two competing goals that manifest themselves in the\nregret: 1) the ability to identify the best Fixed-share expert, which degrades for larger\nm, and 2) the ability to \ufb01nd (cid:11)j whose loss is close to the optimal (cid:11) for that sequence, which\nimproves for larger m. The additional regret arising from having to consider a number of\nnon-optimal values of the parameter, in the search, comes from the relative loss bound\nfor the Static-Expert algorithm, i.e.\nthe relative loss associated with tracking the\nbest single expert [8, 12]. This regret is simply log(m) in our context. More precisely,\nthe corollary follows directly from successive application of that single expert relative loss\nbound, and then our Fixed-share relative loss bound (Theorem 1):\n\nLtop\nT (cid:0) LT ((cid:11)(cid:3)) (cid:20) log(m) + min\n\nj=1;:::;m\n\nLT ((cid:11)j)\n\n(cid:20) log(m) + (T (cid:0) 1) min\n\nj=1;:::;m\n\nD((cid:11)(cid:3)k(cid:11)j)\n\n(12)\n\n(13)\n\n3.1 Optimal discretization\n\nWe start by \ufb01nding the smallest discrete set of switching-rate parameters so that any addi-\ntional regret due to discretization does not exceed (T (cid:0) 1)(cid:14), for some threshold (cid:14). In other\nwords, we \ufb01nd m = m((cid:14)) values (cid:11)1; : : : ; (cid:11)m((cid:14)) such that\nD((cid:11)(cid:3)k(cid:11)j) = (cid:14)\n(14)\n\nj=1;:::;m((cid:14))\n\n(cid:11)(cid:3)2[0;1]\n\nmax\n\nmin\n\nThe resulting discretization, a function of (cid:14), can be found algorithmically as follows. First,\nwe set (cid:11)1 so that max(cid:11)(cid:3)2[0;(cid:11)1] D((cid:11)(cid:3)k(cid:11)1) = D(0k(cid:11)1) = (cid:14) implying that (cid:11)1 = 1 (cid:0) e(cid:0)(cid:14).\nEach subsequent (cid:11)j is found conditionally on (cid:11)j(cid:0)1 so that\n(15)\n\nmax\n\nThe maximizing (cid:11)(cid:3) can be solved explicitly by equating the two relative entropies giving\n\n(cid:11)(cid:3)2[(cid:11)j(cid:0)1;(cid:11)j ]\n\n(cid:11)(cid:3) = log(\n\nminfD((cid:11)(cid:3)k(cid:11)j(cid:0)1); D((cid:11)(cid:3)k(cid:11)j)g = (cid:14)\n)(cid:19)(cid:0)1\n\n1 (cid:0) (cid:11)j(cid:0)1\n1 (cid:0) (cid:11)j\n\n)(cid:18) log(\n\n1 (cid:0) (cid:11)j(cid:0)1\n1 (cid:0) (cid:11)j\n\n(cid:11)j\n(cid:11)j(cid:0)1\n\nwhich lies within [(cid:11)j(cid:0)1; (cid:11)j] and is an increasing function of the new point (cid:11)j. Substituting\nthis (cid:11)(cid:3) back into one of the relative entropies we can set (cid:11)j so that D((cid:11)(cid:3)k(cid:11)j(cid:0)1) = (cid:14). The\nrelative entropy is an increasing function of (cid:11)j (through (cid:11)(cid:3)) and the solution is obtained\neasily via, e.g., bisection search. The iterative procedure of generating new values (cid:11)j\ncan be stopped after the new point exceeds 1=2; the remaining levels can be \ufb01lled-in by\nsymmetry so long as we also include 1=2. The resulting discretization is not uniform but\n\ndenser towards the edges; the spacing around the edges is O((cid:14)), and O(p(cid:14)) around 1=2.\nFor small values of (cid:14), the logarithm of the number of resulting discretization levels, or\nlog m((cid:14)), closely approximates (cid:0)1=2 log (cid:14). We can then optimize the regret bound (11):\n(cid:0)1=2 log (cid:14) + (T (cid:0) 1)(cid:14), yielding (cid:14)(cid:3) = 1=(2T ), and m((cid:14)(cid:3)) = p2T . Thus we will need\nO(pT ) settings of (cid:11), as in the case of choosing the levels uniformly with spacing p(cid:14). The\n\nuniform discretization would not, however, possess the same regret guarantee, resulting in\na higher than necessary loss due to discretization.\n\n\f3.1.1 Optimized regret bound for Learn-(cid:11)\n\nThe optimized regret bound for Learn-(cid:11)((cid:14)(cid:3)) is thus (approximately) 1\n2 log T +c; which is\ncomparable to analysis of universal coding for word-length T [11]. The optimal discretiza-\ntion for learning the parameter is not affected by n, the number of original experts. Unlike\nregret bounds for Fixed-share, the value of the bound does not depend on the obser-\nvation sequence. And notably, in comparison to the lower bound on Fixed-share\u2019s\nperformance, Learn-(cid:11)\u2019s regret is at most logarithmic in T .\n\n4 Application to wireless networks\n\nWe applied the Learn-(cid:11) algorithm to an open problem in computer networks: managing\nthe tradeoff between energy consumption and performance in wireless nodes of the IEEE\n802.11 standard [9]. Since a node cannot receive packets while asleep, yet maintaining the\nawake state drains energy, the existing standard uses a \ufb01xed polling time at which a node\nshould wake from the sleep state to poll its neighbors for buffered packets. Polling at \ufb01xed\nintervals however, does not respond optimally to current network activity. This problem is\nclearly an appropriate application for an online learning algorithm, such as Fixed-share\ndue to [8]. Since we are concerned with wireless, mobile nodes, there is no principled way\nto set the switching-rate parameter a priori, as network activity varies not only over time,\nbut across location, and the location of the mobile node is allowed to change. We can\ntherefore expect an additional bene\ufb01t from learning the switching-rate.\n\nPrevious work includes Krashinsky and Balakrishnan\u2019s [10] Bounded Slowdown algo-\nrithm which uses an adaptive control loop to change polling time based on network con-\nditions. This algorithm uses parameterized exploration intervals, and the tradeoff is not\nmanaged optimally. Steinbach applied reinforcement learning [13] to this problem, yet\nrequired an unrealistic assumption: that network activity possesses the Markov property.\n\nWe instantiate the experts as deterministic algorithms assuming constant polling times.\nThus we use n experts, each corresponding to a different but \ufb01xed polling time in millisec-\nonds (ms): Ti : i 2 f1 : : : ng The experts form a discretization over the range of possible\npolling times. We then apply the Learn-(cid:11) algorithm exactly as in our previous exposition,\nusing the discretization de\ufb01ned by (cid:14)(cid:3), and thus running m((cid:14)(cid:3)) sub-algorithms, each run-\nning Fixed-share with a different (cid:11)j. In this application, the learning algorithm can\nonly receive observations, and perform learning updates, when it is awake. So our subscript\nt here signi\ufb01es only wake times, not every time epoch at which bytes might arrive.\nWe de\ufb01ne the loss function, L, to re\ufb02ect the tradeoff inherent in the con\ufb02icting goals of\nminimizing both the energy usage of the node, and the network latency it introduces by\nsleeping. We propose a loss function that is one of many functions proportional to this\ntradeoff. We de\ufb01ne loss per expert i as:\n\nLoss(i; t) = (cid:13)\n\nItT 2\ni\n2Tt\n\n+\n\n1\nTi\n\n(17)\n\nwhere It is the observation the node receives, of how many bytes arrived upon awakening\nat time t, and Tt is the length of time that the node just slept. The \ufb01rst term models the\naverage latency introduced into the network by buffering those bytes, and scales It to the\nnumber of bytes that would have arrived had the node slept for time Ti instead of Tt, under\nthe assumption that the bytes arrived at a uniform rate. The second term models the energy\nconsumption of the node, based on the design that the node wakes only after an interval Tt\nto poll for buffered bytes, and the fact that it consumes less energy when asleep than awake.\nThe objective function is a sum of convex functions and thus admits a unique minimum.\n(cid:13) > 0 allows for scaling between the units of information and time, and the ability to\nencode a preference for the ratio between energy and latency that the user favors.\n\n\f12000\n\n10000\n\n8000\n\n6000\n\n4000\n\n2000\n\narbitrary expert (500ms) \n\nFixed\u2212share(alpha) alg \n\nbest expert (100ms)    \nIEE 802.11 Protocol alg\n\nStatic\u2212expert alg \n\nLearn\u2212alpha(delta*) \n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.5\n\nalpha\n\nc)\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\nbest expert (100ms)    \nIEE 802.11 Protocol alg\n\nFixed\u2212share(alpha) alg \n\nStatic\u2212expert alg \n\nLearn\u2212alpha(delta*) \n\na)\n\ns\ns\no\n\nl\n \n\ne\nv\ni\nt\n\nl\n\na\nu\nm\nu\nC\n\ns\ns\no\n\nl\n \n\ne\nv\ni\nt\n\nl\n\na\nu\nm\nu\nC\n\n500\n\n0\n\nb)\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\nalpha\n\n2\nx 10\u22123\n\nd)\n\nl\n\nl\n\n)\n0\n1\n=\nn\n(\n \ng\na\n \na\nh\np\na\n\u2212\nn\nr\na\ne\nL\n \nf\no\n \ns\ns\no\nl\n \ne\nv\ni\nt\na\nu\nm\nu\nC\n\nl\n\nl\n\nl\n\n)\n5\n=\nn\n(\n \ng\na\n \na\nh\np\na\n\u2212\nn\nr\na\ne\nL\n \nf\no\n \ns\ns\no\nl\n \ne\nv\ni\nt\na\nu\nm\nu\nC\n\nl\n\n1150\n\n1100\n\n1050\n\n1000\n\n950\n\n900\n\n850\n\n800\n\n0\n\n1280\n\n1260\n\n1240\n\n1220\n\n1200\n\n1180\n\n1160\n\n1140\n\n1120\n\n1100\n\n1080\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n1/delta\n\n14\nx 104\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n1/delta\n\n14\nx 104\n\nFigure 1: a) Cumulative loss of Fixed-share((cid:11)) as a function of (cid:11), compared to the\ncumulative loss on the same trace of the 802.11 protocol, Static-expert, and Learn-\n(cid:11)((cid:14)(cid:3)). Figure b) zooms in on the \ufb01rst 0.002 of the (cid:11) range. c) Cumulative loss of Learn-\n(cid:11)((cid:14)), as a function of 1=(cid:14), when n = 10, and b) n = 5. Circles at 1=(cid:14)(cid:3) = 2T .\n\n4.0.2 Experiments\n\nWe used traces of real network activity from [2], a UC Berkeley home dial-up server that\nmonitored users accessing HTTP \ufb01les from home. Multiple overlapping connections, pass-\ning through a collection node over several days, were recorded by start and end times, and\nnumber of bytes transferred. Per connection we smoothed the total number of bytes uni-\nformly over 10ms intervals spanning its duration. We set (cid:13) = 1:0 (cid:2) 10(cid:0)7, calibrated to\nattain polling times within the range of the existing protocol.\n\nFigure 1a) and b) compare cumulative loss of the various algorithms on a 4 hour trace,\nwith observation epochs every 10ms. This corresponds to approximately 26,100 training\niterations for the learning algorithms. In the typical online learning scenario, T , the number\nof learning iterations, i.e. the time horizen parameter to the loss bounds, is just the number\nof observation epochs. In this application, the number of training epochs need not match\nthe number of observation epochs, since the application involves sleeping during many\nobservation epochs, and learning is only done upon awakening. Since in these experiments\nthe performance of the three learning algorithms are compared by each algorithm using n\nexperts spanning the range of 1000ms at regularly spaced intervals of 100ms, to obtain a\nprior estimate of T , we assume a mean sleep interval of 550ms, the mean of the experts.\nThe Static-expert algorithm achieved lower cumulative loss than the best expert,\nsince it can attain the optimal smoothed value over the desired range of polling times,\nwhereas the expert values just form a discretization. On this trace, the optimal (cid:11) for\nFixed-share turns out to be extremely low. So for most settings of (cid:11), one would be\nbetter off using a Static-expert model, yet as the second graph shows, there is a value\nof (cid:11) below which it is bene\ufb01cial to use Fixed-share. This lends validity to our fun-\ndamental goal of being able to quantify the level of non-stationarity of a process, in order\n\n\fto better model it. Moreover there is a clear advantage to using Learn-(cid:11), since without\nprior knowledge of the stochastic process to be observed, there is no optimal way to set (cid:11).\nFigure 1c) and d) show the cumulative loss of Learn-(cid:11) as a function of 1=(cid:14). We see that\nchoosing (cid:14) = 1\n2T , matches the point in the curve beyond which one cannot signi\ufb01cantly\nreduce cumulative loss by decreasing (cid:14). As expected, the performance of the algorithm\nlevels off after the optimal (cid:14) that we can compute a priori. Our results also verify that the\noptimal (cid:14) is not signi\ufb01cantly affected by the number of experts n.\n\n5 Conclusion\n\nWe proved upper and lower bounds on the regret for a class of online learning algorithms,\napplicable to any sequence of observations. The bounds extend to richer models of non-\nstationary sequences, allowing the switching dynamics to be governed by an arbitrary tran-\nsition matrix. We derived the regret-optimal discretization (including the overall resolution)\nfor learning the switching-rate parameter in a simple switching dynamics, yielding an algo-\nrithm with stronger guarantees than previous algorithms. We exempli\ufb01ed the approach in\nthe context of energy management in wireless networks. In future work, we hope to extend\nthe online estimation of (cid:11) and the optimal discretization to learning a full transition matrix.\n\nReferences\n\n[1] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: the\nadversarial multi-armed bandit problem. In Proc. of the 36th Annual Symposium on Foundations\nof Computer Science, pages 322\u2013331, 1995.\n\n[2] Berkeley. UC Berkeley home IP web traces. In http://ita.ee.lbl.gov/html/contrib/UCB.home-\n\nIP-HTTP.html, 1996.\n\n[3] A. Blum, C. Burch, and A. Kalai. Finely-competitive paging. In IEEE 40th Annual Symposium\n\non Foundations of Computer Science, page 450, New York, New York, October 1999.\n\n[4] D. P. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic\n\nBehavior, 29:7\u201335, 1999.\n\n[5] Y. Freund and R. Schapire. Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior, 29:79\u2013103, 1999.\n\n[6] D. Haussler, J. Kivinen, and M. K. Warmuth. Sequential prediction of individual sequences\n\nunder general loss functions. IEEE Trans. on Information Theory, 44(5):1906\u20131925, 1998.\n\n[7] D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth. On-line portfolio selection\nusing multiplicative updates. In International Conference on Machine Learning, pages 243\u2013\n251, 1996.\n\n[8] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32:151\u2013178,\n\n1998.\n\n[9] IEEE. Computer society LAN MAN standards committee. In IEEE Std 802.11: Wireless LAN\n\nMedium Access Control and Physical Layer Speci\ufb01cations, August 1999.\n\n[10] R. Krashinsky and H. Balakrishnan. Minimizing energy for wireless web access with bounded\n\nslowdown. In MobiCom 2002, Atlanta, GA, September 2002.\n\n[11] R. Krichevsky and V. Tro\ufb01mov. The performance of universal encoding.\n\nInformation Theory, 27(2):199\u2013207, 1981.\n\nIEEE Trans. on\n\n[12] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. In IEEE Symposium on\n\nFoundations of Computer Science, pages 256\u2013261, 1989.\n\n[13] C. Steinbach. A reinforcement-learning approach to power management. In AI Technical Re-\n\nport, M.Eng Thesis, Arti\ufb01cial Intelligence Laboratory, MIT, May 2002.\n\n[14] V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35:247\u2013282, 1999.\n\n\f", "award": [], "sourceid": 2440, "authors": [{"given_name": "Claire", "family_name": "Monteleoni", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}