{"title": "Online Learning with Transductive Regret", "book": "Advances in Neural Information Processing Systems", "page_first": 5214, "page_last": 5224, "abstract": "We study online learning with the general notion of transductive regret, that is regret with modification rules applying to expert sequences (as opposed to single experts) that are representable by weighted finite-state transducers. We show how transductive regret generalizes existing notions of regret, including: (1) external regret; (2) internal regret; (3) swap regret; and (4) conditional swap regret. We present a general and efficient online learning algorithm for minimizing transductive regret. We further extend that to design efficient algorithms for the time-selection and sleeping expert settings. A by-product of our study is an algorithm for swap regret, which, under mild assumptions, is more efficient than existing ones, and a substantially more efficient algorithm for time selection swap regret.", "full_text": "Online Learning with Transductive Regret\n\nMehryar Mohri\n\nCourant Institute and Google Research\n\nNew York, NY\n\nmohri@cims.nyu.edu\n\nScott Yang\u21e4\n\nD. E. Shaw & Co.\nNew York, NY\n\nyangs@cims.nyu.edu\n\nAbstract\n\nWe study online learning with the general notion of transductive regret, that is\nregret with modi\ufb01cation rules applying to expert sequences (as opposed to single\nexperts) that are representable by weighted \ufb01nite-state transducers. We show how\ntransductive regret generalizes existing notions of regret, including: (1) external\nregret; (2) internal regret; (3) swap regret; and (4) conditional swap regret. We\npresent a general and ef\ufb01cient online learning algorithm for minimizing transductive\nregret. We further extend that to design ef\ufb01cient algorithms for the time-selection\nand sleeping expert settings. A by-product of our study is an algorithm for swap\nregret, which, under mild assumptions, is more ef\ufb01cient than existing ones, and a\nsubstantially more ef\ufb01cient algorithm for time selection swap regret.\n\n1\n\nIntroduction\n\nOnline learning is a general framework for sequential prediction. Within that framework, a widely\nadopted setting is that of prediction with expert advice [Littlestone and Warmuth, 1994, Cesa-Bianchi\nand Lugosi, 2006], where the algorithm maintains a distribution over a set of experts. At each round,\nthe loss assigned to each expert is revealed. The algorithm then incurs the expected value of these\nlosses for its current distribution and next updates its distribution.\nThe standard benchmark for the algorithm in this scenario is the external regret, that is the difference\nbetween its cumulative loss and that of the best (static) expert in hindsight. However, while this\nbenchmark is useful in a variety of contexts and has led to the design of numerous effective online\nlearning algorithms, it may not constitute a useful criterion in common cases where no single \ufb01xed\nexpert performs well over the full course of the algorithm\u2019s interaction with the environment. This\nhad led to several extensions of the notion of external regret, along two main directions.\nThe \ufb01rst is an extension of the notion of regret so that the learner\u2019s algorithm is compared against\na competitor class consisting of dynamic sequences of experts. Research in this direction started\nwith the work of Herbster and Warmuth [1998] on tracking the best expert, who studied the scenario\nof learning against the best sequence of experts with at most k switches. This work has been\nsubsequently improved [Monteleoni and Jaakkola, 2003], generalized [Vovk, 1999, Cesa-Bianchi\net al., 2012, Koolen and de Rooij, 2013], and modi\ufb01ed [Hazan and Seshadhri, 2009, Adamskiy et al.,\n2012, Daniely et al., 2015]. More recently, an ef\ufb01cient algorithm with favorable regret guarantees has\nbeen given for the general case of a competitor class consisting of sequences of experts represented by\na (weighted) \ufb01nite automaton [Mohri and Yang, 2017, 2018]. This includes as special cases previous\ncompetitor classes considered in the literature.\nThe second direction is to consider competitor classes based on modi\ufb01cations of the learner\u2019s sequence\nof actions. This approach began with the notion of internal regret [Foster and Vohra, 1997, Hart and\nMas-Colell, 2000], which considers how much better an algorithm could have performed if it had\n\n\u21e4Work done at the Courant Institute of Mathematical Sciences.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fswitched all instances of playing one action with another, and was subsequently generalized to the\nnotion of swap regret [Blum and Mansour, 2007], which considers all possible in-time modi\ufb01cations\nof a learner\u2019s action sequence. More recently, Mohri and Yang [2014] introduced the notion of\nconditional swap regret, which considers all possible modi\ufb01cations of a learner\u2019s action sequence\nthat depend on some \ufb01xed bounded history. Odalric and Munos [2011] also studied regret against\nhistory-dependent modi\ufb01cations and presented computationally tractable algorithms (with suboptimal\nregret guarantees) when the comparator class can be organized into a small number of equivalence\nclasses.\nIn this paper, we consider the second direction and study regret with respect to modi\ufb01cation rules. We\n\ufb01rst present an ef\ufb01cient online algorithm for minimizing swap regret (Section 3). We then introduce\nthe notion of transductive regret in Section 4, that is the regret of the learner\u2019s algorithm with respect\nto modi\ufb01cation rules representable by a family of weighted \ufb01nite-state transducers (WFSTs). This\nde\ufb01nition generalizes the existing notions of external, internal, swap, and conditional swap regret, and\nincludes modi\ufb01cation rules that apply to expert sequences, as opposed to single experts. Moreover, we\npresent ef\ufb01cient algorithms for minimizing transductive regret. We further extend transductive regret\nto the time-selection setting (Section 5) and present ef\ufb01cient algorithms minimizing time-selection\ntransductive regret. These algorithms signi\ufb01cantly improve upon existing state-of-the-art algorithms\nin the special case of time-selection swap regret. Finally, in Section 6, we extend transductive regret\nto the sleeping experts setting and present new and ef\ufb01cient algorithms for minimizing sleeping\ntransductive regret.\n\n2 Preliminaries and notation\n\nWe consider the setting of prediction with expert advice with a set \u2303 of N experts. At each round\nt 2 [T ], an online algorithm A selects a distribution pt over \u2303, the adversary reveals a loss vector\nlt 2 [0, 1]N, where lt(x) is the loss of expert x 2 \u2303, and the algorithm incurs the expected loss pt \u00b7 lt.\nLet \u2713 \u2303\u2303 denote a set of modi\ufb01cation functions mapping the expert set to itself. The objective\nof the algorithm is to minimize its -regret, RegT (A, ), de\ufb01ned as the difference between its\ncumulative expected loss and that of the best modi\ufb01cation of the sequence in hindsight:\n\nRegT (A, ) = max\n\nExt\u21e0pt\n\n[lt(xt)] Ext\u21e0pt\n\n'2( TXt=1\n\n[lt('(xt))]) .\n\n(1)\n\nThis de\ufb01nition coincides with the standard notion of external regret [Cesa-Bianchi and Lugosi,\n2006] when is reduced to the family of constant functions: ext = {'a :\u2303 ! \u2303: a 2 \u2303,8x 2\n\u2303,' a(x) = a}, with the notion of internal regret [Foster and Vohra, 1997] when is the family\nof functions that only switch two actions: int = {'a,b :\u2303 ! \u2303: a, b 2 \u2303,' a,b(x) = 1x=ab +\n1x=ba + x1x6=a,b}, and with the notion of swap regret [Blum and Mansour, 2007] when consists\nof all possible functions mapping \u2303 to itself: swap. In Section 4, we will introduce a more general\nnotion of regret with modi\ufb01cation rules applying to expert sequences, as opposed to single experts.\nThere are known algorithms achieving an external regret in O(pT log N ) with a per-iteration\ncomputational cost in O(N ) [Cesa-Bianchi and Lugosi, 2006], an internal regret in O(pT log N )\nwith a per-iteration computational cost in O(N 3) [Stoltz and Lugosi, 2005], and a swap regret in\nO(pT N log N ) with a per-iteration computational cost in O(N 3) [Blum and Mansour, 2007].\n\n3 Ef\ufb01cient online algorithm for swap regret\n\nIn this section, we present an online algorithm, FASTSWAP, that achieves the same swap regret\nguarantee as the algorithm of Blum and Mansour [2007], O(pT N log N ), but admits the more\nfavorable per-iteration complexity of O(N 2 log(T )), under some mild assumptions.\nExisting online algorithms for internal or swap regret minimization require, at each round, solving\nfor a \ufb01xed-point of an N \u21e5 N-stochastic matrix [Foster and Vohra, 1997, Stoltz and Lugosi, 2005,\nBlum and Mansour, 2007]. For example, the algorithm of Blum and Mansour [2007] is based on\na meta-algorithm A that makes use of N external regret minimization sub-algorithms {Ai}i2[N ]\n(see Figure 1). Sub-algorithm Ai is specialized in guaranteeing low regret against swapping expert\ni with any other expert j. The meta-algorithm A maintains a distribution pt over the experts and,\n\n2\n\n\fA\n\nqt,2\n\npt,N lt\n\npt,1lt\n\nqt,1\n\nA1\n\npt,2lt\n\nA2\n\nqt,N\n\nAN\n\nFigure 1: Illustration of the swap regret algorithm of Blum and Mansour [2007] or the FASTSWAP\nalgorithm, which use a meta-algorithm to control a set of N external regret minimizing algorithms.\n\nAlgorithm 1: FASTSWAP; {Ai}N\n\ni=1 are external regret minimization algorithms.\n\nAlgorithm: FASTSWAP((Ai)N\nfor t 1 to T do\n\ni=1)\n\nqi QUERY(Ai)\n\nfor i 1 to N do\nQt [q1 \u00b7\u00b7\u00b7 qN ]>\nfor j 1 to N do\ni=1 Qt\ni,j\n\ncj minN\n\n\u2327t l log\u21e3 1pt\u2318\nlog(1\u21b5t)m\n\n\u21b5t\n\n\u21b5t kck1;\nif \u2327t < N then\nt c\npt p1\nfor \u2327 1 to \u2327t do\nt )> (p\u2327\n(p\u2327\npt pt\nkptk1\np>t = FIXED-POINT(Qt)\n\nelse\nxt SAMPLE(pt);\nfor i 1 to N do\n\nATTRIBUTELOSS(pt[i]lt, Ai)\n\nt )>(Qt ~1c>);\n\npt pt + p\u2327\n\nt\n\nlt RECEIVELOSS()\n\nat each round t, assigns to sub-algorithm Ai only a fraction of the loss, (pt,ilt), and receives the\ndistribution qi (over the experts) returned by Ai. At each round t, the distribution pt is selected to be\nthe \ufb01xed-point of the N \u21e5 N-stochastic matrix Qt = [q1 \u00b7\u00b7\u00b7 qN ]>. Thus, pt = ptQt is the stationary\ndistribution of the Markov process de\ufb01ned by Qt. This choice of the distribution is natural to ensure\nthat the learner\u2019s sequence of actions is competitive against a family of modi\ufb01cations, since it is\ninvariant under a mapping that relates to this family of modi\ufb01cations.\nThe computation of a \ufb01xed-point involves solving a linear system of equations, thus, the per-round\ncomplexity of these algorithms is in O(N 3) using standard methods (or O(N 2.373), using the method\nof Coppersmith and Winograd). To improve upon this complexity in the setting of internal regret,\nGreenwald et al. [2008] estimate the \ufb01xed-point by applying, at each round, a single power iteration\nto some stochastic matrix. Their algorithm runs in O(N 2) time per iteration, but at the price of a\nregret guarantee that is only in O(pN T 9\n10 ).\nHere, we describe an ef\ufb01cient algorithm for swap regret, FASTSWAP. Algorithm 1 gives its pseu-\ndocode. As with the algorithm of Blum and Mansour [2007], FASTSWAP is based on a meta-algorithm\nA making use of N external regret minimization sub-algorithms {Ai}i2[N ]. However, unlike the\nalgorithm of Blum and Mansour [2007], which explicitly computes the stationary distribution of\nQt at round t, or that of Greenwald et al. [2008], which applies a single power iteration at each\nround, our algorithm applies multiple modi\ufb01ed power iterations at round t (\u2327t power iterations). Our\nmodi\ufb01ed power iterations are based on the REDUCEDPOWERMETHOD (RPM) algorithm introduced\nby Nesterov and Nemirovski [2015]. Unlike the algorithm of Greenwald et al. [2008], FASTSWAP\nuses a speci\ufb01c initial distribution at each round, applies the power method to a modi\ufb01cation of the\noriginal stochastic matrix, and uses, as an approximation, an average of all the iterates at that round.\nTheorem 1. Let A1, . . . ,AN be external regret minimizing algorithms admitting data-dependent\nregret bounds of the form O(pLT (Ai) log N ), where LT (Ai) is the cumulative loss of Ai after T\n\n3\n\n\fa:b/1\n\nb:\u03c6(b)/1\n\n0\n\na:b/1\nb:a/1\n\nb:b/1\n\n1\n\n2\n\na:\u03c6(a)/1\n\n0\n\nc:\u03c6(c)/1\n\nb:b/1\n\n(i)\n\n(ii)\n\n0\n\nApple:IBM/0.4\nApple:Apple/0.5\nApple:gold/0.1\nIBM:IBM/0.5\nIBM:Apple/0.3\nIBM:silver/0.2\n\n sell:IBM/0.3\n sell:Apple/0.7\n\ngold:silver/0.4\ngold:gold/0.5\ngold:Apple/0.1\nsilver:gold/0.3\nsilver:silver/0.5\nsilver:IBM/0.2\n\n1\n\n2\n\nsell:gold/0.5\nsell:silver/0.5\n\ngold:silver/0.2\ngold:gold/0.6\ngold:sell/0.2\nsilver:gold/0.3\nsilver:silver/0.6\nsilver:sell/0.1\n\nApple:IBM/0.3\nApple:Apple/0.5\nApple:sell/0.2\nIBM:IBM/0.6\nIBM:Apple/0.3\nIBM:sell/0.1\n\n(iii)\n\nFigure 2: (i) Example of a WFST T: IT = 0, ilab[ET[0]] = {a, b}, olab[ET[1]] = {b}, ET[2] =\n{(0, a, b, 1, 1), (0, b, a, 1, 1)}. (ii) Family of swap WFSTs T', with ' : {a, b, c}!{ a, b, c}. (iii) A\nmore general example of a WFST.\nrounds. Assume that, at each round, the sum of the minimal probabilities given to an expert by these\nalgorithms is bounded below by some constant \u21b5> 0. Then, FASTSWAP achieves a swap regret in\n\nlog T\n\nlog(1/(1\u21b5)) , N .\n\nO(pT N log N ) with a per-iteration complexity in ON 2 min\n\nThe proof is given in Appendix D. It is based on a stability analysis bounding the additive regret\nterm due to using an approximation of the \ufb01xed point distribution, and the property that \u2327t iterations\nof the reduced power method ensure a 1pt-approximation, where t is the number of rounds. The\nfavorable complexity of our algorithm requires an assumption on the sum of the minimal probabilities\nassigned to an expert by the algorithms at each round. This is a reasonable assumption which one\nwould expect to hold in practice if all the external regret minimizing sub-algorithms are the same.\nThis is because the true losses assigned to each column of the stochastic matrix are the same, and the\nrescaling based on the distribution pt is uniform over each row. Furthermore, since the number of\nrounds suf\ufb01cient for a good approximation can be ef\ufb01ciently estimated, our algorithm can determine\nwhen it is worthwhile to switch to standard \ufb01xed-point methods, that is when the condition \u2327t > N\nholds. Thus, the time complexity of our algorithm is never worse than that of Blum and Mansour\n[2007].\n\n4 Online algorithm for transductive regret\n\nIn this section, we consider a more general notion of regret than swap regret, where the family\nof modi\ufb01cation functions applies to sequences instead of just to single experts. We will consider\nsequence-to-sequence mappings that can be represented by \ufb01nite-state transducers. In fact, more\ngenerally, we will allow weights to be used for these mappings and will consider weighted \ufb01nite-state\ntransducers. This will lead us to de\ufb01ne the notion of transductive regret where the cumulative loss of\nan algorithm\u2019s sequence of actions is compared to that of sequences images of its action sequence via\na transducer mapping. As we shall see, this is an extremely \ufb02exible de\ufb01nition that admits as special\ncases standard notions of external, internal, and swap regret.\nWe will start with some preliminary de\ufb01nitions and concepts related to transducers.\n\n4.1 Weighted \ufb01nite-state transducer de\ufb01nitions\n\nA weighted \ufb01nite-state transducer (WFST) T is a \ufb01nite automaton whose transitions are augmented\nwith an output label and a real-valued weight, in addition to the familiar input label. Figure 2(i) shows\na simple example. We will assume both input and output labels to be elements of the alphabet \u2303,\nwhich denotes the set of experts. \u2303\u21e4 denotes the set of all strings over the alphabet \u2303.\nWe denote by ET the set of transitions of T and, for any transition e 2 ET, we denote by ilab[e] its\ninput label, by olab[e] its output label, and by w[e] its weight. For any state u of T, we denote by\nET[u] the set of transitions leaving u. We also extend the de\ufb01nition of ilab to sets and denote by\nilab[ET[u]] the set of input labels of the transitions ET[u].\nWe assume that T admits a single initial state, which we denote by IT. For any state u and string\nx 2 \u2303\u21e4, we also denote by T(u, x) the set of states reached from u by reading string x as input. In\nparticular, we will denote by T(IT, x) the set of states reached from the initial state by reading string\nx as input.\n\n4\n\n\fThe input (or output) label of a path is obtained by concatenating the input (output) transition labels\nalong that path. The weight of a path is obtained by multiplying is transition weights. A path from\nthe initial state to a \ufb01nal state is called an accepting path. A WFST maps the input label of each\naccepting path to its output label, with that path weight probability.\nThe WFSTs we consider may be non-deterministic, that is they may admit states with multiple\noutgoing transitions sharing the same input label. However, we will assume that, at any state,\noutgoing transitions sharing the same input label admit the same destination state. We will further\nrequire that, at any state, the set of output labels of the outgoing transitions be contained in the set\nof input labels of the same transitions. This requirement is natural for our de\ufb01nition of regret: our\nlearner will use input label experts and will compete against sequences of output label experts. Thus,\nthe algorithm should have the option of selecting an expert sequence it must compete against.\nFinally, we will assume that our WFSTs are stochastic, that is, for any state u and input label a 2 \u2303,\nwe havePe2ET[u,a] w[e] = 1. The class of WFSTs thereby de\ufb01ned is broad and, as we shall see,\nincludes the families de\ufb01ning external, internal and swap regret.\n\nX\n\ne2ET[T (IT ,x1:t1),xt]\n\nExt\u21e0pt\"\n\nTXt=1\n\nExt\u21e0pt\n\n[lt(xt)]\n\nRegT (A,T ) = max\n\nT2T ( TXt=1\n\n4.2 Transductive regret\nGiven any WFST T, let T be a family of WFSTs with the same alphabet \u2303, the same set of states Q,\nthe same initial state I and \ufb01nal states F , but with different output labels and weights. Thus, we can\nwrite IT , FT , QT , and T , without any ambiguity. We will also use the notation ET when we refer\nto the transitions of a transducer within the family T in a way that does not depend on the output\nlabels or weights. We de\ufb01ne the learner\u2019s transductive regret with respect to T as follows:\nw[e] lt(olab[e])#). (2)\n1 played by A and\nThis measures the maximum difference of the expected loss of the sequence xT\nthe expected loss of a competitor sequence, that is a sequence image by T 2T of xT\n1 , where the\nexpectation for competing sequences is both over pts and the transitions weights w[e] of T. We also\nassume that the family T does not admit proper non-empty invariant subsets of labels out of any state,\ni.e. for any state u, there exists no proper subset E ( ET [u] where the inclusion olab[E] \u2713 ilab[E]\nholds for all T 2T . This is not a strict requirement but will allow us to avoid cases of degenerate\ncompetitor classes.\nAs an example, consider the family of WFSTs Ta, a 2 \u2303, with a single state Q = I = F = {0} and\nwith Ta de\ufb01ned by self-loop transitions with all input labels b 2 \u2303 with the same output label a, and\nwith uniform weights. Thus, Ta maps all labels to a. Then, the notion of transductive regret with\nT = {Ta : a 2 \u2303} coincides with that of external regret.\nSimilarly, consider the family of WFSTs T', ' :\u2303 ! \u2303, with a single state Q = I = F = {0} and\nwith T' de\ufb01ned by self-loop transitions with input label a 2 \u2303 and output '(a), all weights uniform.\nThus, T' maps a symbol a to '(a). Then, the notion of transductive regret with T = {T' : ' 2 \u2303\u2303}\ncoincides with that of swap regret (see Figure 2 (ii)). The more general notion of k-gram conditional\nswap regret presented in Mohri and Yang [2014] can also be modeled as transductive regret with\nrespect to a family of WFSTs (k-gram WFSTs). We present additional \ufb01gures illustrating all of these\nexamples in Appendix A.\nIn general, it may be desirable to design WFSTs intended for a speci\ufb01c task, so that an algorithm is\nrobust against some sequence modi\ufb01cations more than others. In fact, such WFSTs may have been\nlearned from past data. The de\ufb01nition of transductive regret is \ufb02exible and can accommodate such\nsettings both because a transducer can conveniently help model mappings and because the transition\nweights help distinguish alternatives. For instance, consider a scenario where each action naturally\nadmits a different swapping subset, which may be only a small subset of all actions. As an example,\nan investor may only be expected to pick the best strategy from within a similar class of strategies.\nFor example, instead of buying IBM, the investor could have bought Apple or Microsoft, and instead\nof buying gold, he could have bought silver or bronze. One can also imagine a setting where along\nthe sequences, some new alternatives are possible while others are excluded. Moreover, one may\nwish to assign different weights to some sequence modi\ufb01cations or penalize the investor for choosing\nstrategies that are negatively correlated to recent choices. The algorithms in this work are \ufb02exible\n\n5\n\n\fenough to accommodate these environments, which can be straightforwardly modeled by a WFST.\nWe give a simple example in Figure 2(iii) and give another illustration in Figure 5 in Appendix A,\nwhich can be easily generalized. Notice that, as we shall see later, in the case where the maximum\nout-degree of any state in the WFST (size of the swapping subset) is bounded by a mild constant\nindependent of the number of actions, our transductive regret bounds can be very favorable.\n\n4.3 Algorithm\n\nWe now present an algorithm, FASTTRANSDUCE, seeking to minimize the transductive regret given a\nfamily T of WFSTs.\nOur algorithm is an extension of FASTSWAP. As in that algorithm, a meta-algorithm is used that\nassigns partial losses to external regret minimization slave algorithms and combines the distributions\nit receives from these algorithms via multiple reduced power method iterations. The meta-algorithm\ntracks the state reached in the WFST and maintains a set of external regret minimizing algorithms\nthat help the learner perform well at every state. Thus, here, we need one external regret minimization\nalgorithm Au,i, for each state u reached at time t after reading sequence x1:t1 and each i 2 \u2303\nlabeling an outgoing transition at u. The pseudocode of this algorithm is provided in Appendix B.\nLet |ET |in denote the sum of the number of transitions with distinct input label at each state of T , that\nis |ET |in =Pu2QT |ilab[ET [u]]|. |ET |in is upper bounded by the total number of transitions |ET |.\nThen, the following regret guarantee and computational complexity hold for FASTTRANSDUCE.\nTheorem 2. Let (Au,i)u2Q,i2ilab[ET [u]] be external regret minimizing algorithms admitting data-\ndependent regret bounds of the form O(pLT (Au,i) log N ), where LT (Au,i) is the cumulative loss\nof Au,i after T rounds. Assume that, at each round, the sum of the minimal probabilities given to\nan expert by these algorithms is bounded below by some constant \u21b5> 0. Then, FASTTRANSDUCE\nachieves a transductive regret against T that is in O(pT|ET |in log N ) with a per-iteration complexity\nin O\u21e3N 2 minn\nThe proof is given in Appendix E. The regret guarantee of FASTTRANSDUCE matches that of the\nswap regret algorithm of Blum and Mansour [2007] or FASTSWAP in the case where T is chosen\nto be the family of swap transducers, and it matches the conditional k-gram swap regret of Mohri\nand Yang [2014] when T is chosen to be that of the k-gram swap transducers. Additionally, its\ncomputational complexity is typically more favorable than that of algorithms previously presented in\nthe literature when the assumption on \u21b5 holds, and it is never worse.\nRemarkably, the computational complexity of FASTTRANSDUCE is comparable to the cost of\nFASTSWAP, even though FASTTRANSDUCE is a regret minimization algorithm against an arbitrary\nfamily of \ufb01nite-state transducers. This is because only the external regret minimizing algorithms that\ncorrespond to the current state need to be updated at each round.\n\nlog(1/(1\u21b5)) , No\u2318.\n\nlog T\n\n5 Time-selection transductive regret\n\nIn this section, we extend the notion of time-selection functions with modi\ufb01cation rules to the setting\nof transductive regret and present an algorithm that achieves the same regret guarantee as Khot and\nPonnuswami [2008] in their speci\ufb01c setting, but with a substantially more favorable computational\ncomplexity.\nTime-selection functions were \ufb01rst introduced in [Lehrer, 2003] as boolean functions that determine\nwhich subset of times are relevant in the calculation of regret. This concept was relaxed to the\nreal-valued setting by Blum and Mansour [2007] who considered time-selection functions taking\nvalues in [0, 1]. The authors introduced an algorithm which, for K modi\ufb01cation rules and M time-\n\nselection functions, guarantees a regret in O(pT N log(M K)) and admits a per-iteration complexity\nin O(max{N KM, N 3}). For swap regret with time selection functions, this corresponds to a regret\nbound of O(pT N 2 log(M N )) and a per-iteration computational cost in O(N N +1M ). [Khot and\nPonnuswami, 2008] improved upon this result and presented an algorithm with a regret bound\nin O(pT log(M K)) and a per-iteration computational cost in O(max{M K, N 3}), which is still\n\nprohibitively expensive for swap regret, since it is in O(N N M ).\n\n6\n\n\ffor each I 2I do\n\nAlgorithm 2: FASTTIMESELECTTRANSDUCE; AI, (AI,u,i) external regret algorithms.\n\nAlgorithm: FASTTIMESELECTTRANSDUCE(I, T , AI, (AI,u,i)I2I,u2QT ,i2ilab[ET [q]])\nu IT\nfor t 1 to T do\n\n\u02dcq QUERY(AI)\nfor each i 2 ilab[ET [u]] do\nqI,i QUERY(AI,u,i)\nMt,u,I [qI,1112ilab[ET [u]]; . . . ; qI,N 1N2ilab[ET [u]]]; Qt,u Qt,u + I(t)\u02dcqIMt,u,I;\nZt Zt + I(t)\u02dcqI\nQt,u Qt,u\nfor each j 1 to N do\n\nZt\n\n\u21b5t\n\ni,j 1j2ilab[ET [u]]\n\nt )>(Qt,u ~1c>);\n\ncj mini2ilab[ET [u]] Qt,u\n\u2327t l log\u21e3 1pt\u2318\nlog(1\u21b5t)m\n\n\u21b5t kck1;\nif \u2327t < N then\nt c\npt p0\nfor \u2327 1 to \u2327t do\nt )> (p\u2327\n(p\u2327\npt pt\nkptk1\nelse\np>t FIXED-POINT(Qt,u)\nxt SAMPLE(pt);\nfor each I 2I do\nI I(t)p>t Mt,u,Ilt p>t lt\n\u02dclt\nfor each i 2 ilab[ET [u]] do\nATTRIBUTELOSS(AI,u,i, pt[i]I(t)lt)\nATTRIBUTELOSS(AI, \u02dclt)\n\nlt RECEIVELOSS();\n\npt pt + p\u2327\n\nt\n\nu T [u, xt]\n\nWe now formally de\ufb01ne the scenario of online learning with time-selection transductive regret. Let\nI\u21e2 [0, 1]N be a family of time-selection functions. Each time-selection function I 2I determines\nthe importance of the instantaneous regret at each round. Then, the time-selection transductive regret\nis de\ufb01ned as:\nRegT (A,I, )\n= max\n\nw[e]lt(olab[e])#). (3)\n\nI(t) Ext\u21e0pt\n\n[lt(xt)] \n\nI(t) Ext\u21e0pt\"\n\nTXt=1\n\nX\n\ne2ET [T (IT ,x1:t1),xt]\n\nI2I,T2( TXt=1\n\nWhen the family of transducers admits a single state, this de\ufb01nition coincides with the notion of\ntime-selection regret studied by Blum and Mansour [2007] or Khot and Ponnuswami [2008].\nTime-selection transductive regret is a more dif\ufb01cult benchmark than transductive regret because the\nlearner must account for only a subset of the rounds being relevant, in addition to playing a strategy\nthat is robust against a large set of possible transductions.\nTo handle this scenario, we propose the following strategy. We maintain an external regret minimizing\nalgorithm AI over the set of time-selection functions. This algorithm will be responsible for ensuring\nthat our strategy is competitive against the a posteriori optimal time-selection function. We also\nmaintain |I||Q|N other external regret minimizing algorithms, {AI,u,i}I2I,u2QT ,i2ilab[ET [u]], which\nwill ensure that our algorithm is robust against each of the modi\ufb01cation rules and the potential\ntransductions. We will then use a meta-algorithm to assign appropriate surrogate losses to each of\nthese external regret minimizing algorithms and combine them to form a stochastic matrix. As in\nFASTTRANSDUCE, this meta-algorithm will also approximate the stationary distribution of the matrix\nand use that as the learner\u2019s strategy. We call this algorithm FASTTIMESELECTTRANSDUCE. Its\npseudocode is given in Algorithm 2.\n\n7\n\n\fTheorem 3. Let (AI,u,i)I2I,u2QT ,i2ilab[ET [q]] be external regret minimizing algorithms admitting\ndata-dependent regret bounds of the form O(pLT (AI,u,i) log N ), where LT (AI,u,i) is the cumu-\nlative loss of AI,u,i after T rounds. Let AI be an external regret minimizing algorithm over I that\nadmits a regret in O(pT log(|I|)) after T rounds. Assume further that at each round, the sum of\nthe minimal probabilities given to an expert by these algorithms is bounded below by some constant\n\u21b5> 0. Then, FASTTIMESELECTTRANSDUCE achieves a time-selection transductive regret with re-\nspect to the time-selection family I and WFST family T that is in O\u21e3pT (log(|I|) + |ET |in log N )\u2318\nwith a per-iteration complexity in O\u21e3N 2\u21e3 minn\nlog((1\u21b5)1) , No + |I|\u2318\u2318, as opposed to\nround computational cost that is only in O\u21e3N 2\u21e3 minn\nO(|I|N N ), which is an exponential improvement! Notice that this signi\ufb01cant improvement does not\nrequire any assumption (it holds even for \u21b5 = 0).\n\nIn particular, Theorem 3 implies that FASTTIMESELECTTRANSDUCE achieves the same time-\nselection swap regret guarantee as the algorithm of Khot and Ponnuswami [2008] but with a per-\n\nlog((1\u21b5)1) , No + |I|\u2318\u2318.\n\nlog(T )\n\nlog(T )\n\n6 Sleeping transductive regret\n\nE\n\npi\n\nE\n\nmax\n\nxt\u21e0uAt\n\nxt\u21e0pAt\n\nt\n\nTXt=1\n\nThe standard setting of prediction with expert advice can be extended to the sleeping experts scenario\nstudied by Freund et al. [1997], where, at each round, a subset of the experts are asleep and thus\nunavailable to the learner. The sleeping experts setting has been used to model problems appearing in\ntext categorization [Cohen and Singer, 1999], calendar scheduling [Blum, 1997], or learning how to\nformulate search-engine queries [Cohen and Singer, 1996].\nThe standard benchmark in this setting is the sleeping regret, that is the difference between the\ncumulative expected loss of the learner and the cumulative expected loss of the best static distribution\nover the experts, restricted to and normalized over the set of awake experts At \u2713 \u2303 at each round t:\n(4)\n\n[lt(xt)]) .\nHere, for any distribution p, we use the notation pAt = p|AtPi2At\nwith p|A(a) = p(a)1a2A, for any\na 2 \u2303 and A \u2713 \u2303. An alternative de\ufb01nition of sleeping regret studied and bounded by Freund et al.\n[1997] is the following:\nTXt=1\n\n[1xt2Atlt(xt)]) .\n\nu2N( TXt=1\n\nu2N( TXt=1\n\n[lt(xt)] \n\n[lt(xt)] \n\nu(At) E\n\nThis is also the de\ufb01nition we will be adopting in our analysis. Note that if u(At) does not vary with t,\nthen the two de\ufb01nitions only differ by a multiplicative constant. By generalizing the results of Freund\net al. [1997] to arbitrary losses, that is beyond those that satisfy equation (6) in their paper, one can\n\nshow that there exist algorithms with sleeping regret in O\u21e3qPT\n\nwhere u\u21e4 maximizes the expression to be bounded.\nIn this section, we extend this de\ufb01nition of sleeping regret to sleeping transductive regret, that is the\ndifference between the learner\u2019s cumulative expected loss and the cumulative expected loss of any\ntransduction of the learner\u2019s actions among a family of \ufb01nite-state transducers, where the weights of\nthe transductions are normalized over the set of awake experts. The sleeping transductive regret can\nbe expressed as follows:\n\nt=1 u\u21e4(At) Ext\u21e0pt[lt(xt)] log(N )\u2318,\n\nExt\u21e0u\n\nxt\u21e0pAt\n\nt\n\nmax\n\n(5)\n\nRegT (A,T , AT\n\n1 ) = max\nT2T\n\nu(At) E\n\n[lt(xt)]\n\nxt\u21e0pAt\n\nt\n\nu2N( TXt=1\nTXt=1\n\n\n\nt \"\n\nE\n\nxt\u21e0pAt\n\n(u|At)olab[e]w[e]lt(olab[e])#).\n\n(6)\n\nX\n\ne2ET[T (IT ,x1:t1),xt]\n\n8\n\n\fFigure 3: Maximum values of \u2327 and minimum values of \u21b5 in FASTSWAP experiments. The vertical\nbars represent the standard deviation across 16 instantiations of the same simulation.\n\nt\n\nt\n\nWhen all experts are awake at every round, i.e. At =\u2303 , the sleeping transductive regret reduces to the\nstandard transductive regret. When the family of transducers corresponds to that of swap regret, we un-\n[lt(xt)] \n\ncover a natural de\ufb01nition for sleeping swap regret: max'2swap,u2NPT\nPT\nt=1 Ext\u21e0pAt\n\n\u21e5u'(xt)1'(xt)2Atlt('(xt))\u21e4. We now present an ef\ufb01cient algorithm for minimizing\n\nt=1 u(At) Ext\u21e0pAt\n\nsleeping transductive regret, FASTSLEEPTRANSDUCE. Similar to FASTTRANSDUCE, this algorithm\nuses a meta-algorithm with multiple regret minimizing sub-algorithms and a \ufb01xed-point approx-\nimation to compute the learner\u2019s strategy. However, since FASTSLEEPTRANSDUCE minimizes\nsleeping transductive regret, it uses sleeping regret minimizing sub-algorithms (i.e. those with regret\nguarantees of the form (5)). The meta-algorithm also designs a different stochastic matrix. The\npseudocode of this algorithm is given in Appendix C.\nTheorem 4. Assume that\nthe sleeping regret minimizing algorithms used as inputs of\nFASTSLEEPTRANSDUCE achieve data-dependent regret bounds such that, if the algorithm selects\nt=1, then the regret of Aq\nthe distributions (pt)T\ni is\nt=1 u\u21e4(At) Ext\u21e0pt[lt(xt)] log(N )\u25c6. Assume further that at each round, the sum of\nt=1 u(At)|ET |in log(N )\u2318. Moreover, FASTSLEEPTRANSDUCE\n\nat most O\u2713qPT\nthe minimal probabilities given to an expert by these algorithms is bounded below by some constant\n\u21b5> 0. Then, the sleeping regret RegT (FASTSLEEPTRANSDUCE,T , AT\n1 ) of FASTSLEEPTRANS-\nDUCE is upper bounded by O\u21e3qPT\nlog(1/(1\u21b5)) , No\u2318.\nadmits a per-iteration complexity in O\u21e3N 2 minn\n\nt=1 and observes losses (lt)T\n\nt=1 with awake sets (At)T\n\nlog T\n\n7 Experiments\n\nIn this section, we present some toy experiments illustrating the effectiveness of the Reduced Power\nMethod for approximating the stationary distribution in FASTSWAP.\nWe considered n base learners, where n 2{ 40, 80, 120, 160, 200}, each using the weighted-majority\nalgorithm [Littlestone and Warmuth, 1994]. We generated losses as i.i.d. normal random variables\nwith means in (0.1, 0.9) (chosen randomly) and standard deviation equal to 0.1. We capped the\nlosses above and below to remain in [0, 1]. We ran FASTSWAP for 10,000 rounds in each simulation\nand repeated each simulation 16 times. The plot of the maximum \u2327 for each simulation is shown in\nFigure 3. Across all simulations, the maximum \u2327 attained was 4, so that at most 4 iterations of the\nRPM were needed on any given round to obtain a suf\ufb01cient approximation. Thus, the per-iteration\n\ncost in these simulations was indeed in eO(N 2), an improvement over the O(N 3) cost in prior work.\n\n8 Conclusion\n\nWe introduced the notion of transductive regret, further extended it to the time-selection and sleeping\nexperts settings, and presented ef\ufb01cient online learning algorithms for all these setting with sublinear\ntransductive regret guarantees. We both generalized the existing theory and gave more ef\ufb01cient\nalgorithms in existing subcases. The algorithms and results in this paper can be further extended to\nthe case of fully non-deterministic weighted \ufb01nite-state transducers.\n\n9\n\n\fAcknowledgments\nWe thank Avrim Blum for informing us of an existing lower bound for swap regret proven by Auer\n[2017]. This work was partly funded by NSF CCF-1535987 and NSF IIS-1618662.\n\nReferences\nD. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. In ALT,\n\npages 290\u2013304. Springer, 2012.\n\nP. Auer. Personal communication, 2017.\n\nA. Blum. Empirical support for Winnow and Weighted-Majority algorithms: Results on a calendar\n\nscheduling domain. Machine Learning, 26(1):5\u201323, 1997.\n\nA. Blum and Y. Mansour. From external to internal regret. Journal of Machine Learning Research, 8:\n\n1307\u20131324, 2007.\n\nN. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\nNew York, NY, USA, 2006.\n\nN. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets \ufb01xed share (and feels\n\nno regret). In NIPS, pages 980\u2013988, 2012.\n\nW. W. Cohen and Y. Singer. Learning to query the web. In In AAAI Workshop on Internet-Based\n\nInformation Systems. Citeseer, 1996.\n\nW. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM\n\nTransactions on Information Systems, 17(2):141\u2013173, 1999.\n\nA. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Proceedings of\n\nICML, pages 1405\u20131411, 2015.\n\nD. P. Foster and R. V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic\n\nBehavior, 21(1-2):40\u201355, 1997.\n\nY. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that\n\nspecialize. In STOC, pages 334\u2013343. ACM, 1997.\n\nA. Greenwald, Z. Li, and W. Schudy. More ef\ufb01cient internal-regret-minimizing algorithms. In COLT,\n\npages 239\u2013250, 2008.\n\nS. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econo-\n\nmetrica, 68(5):1127\u20131150, 2000.\n\nE. Hazan and S. Kale. Computational equivalence of \ufb01xed points and no regret algorithms, and\n\nconvergence to equilibria. In NIPS, pages 625\u2013632, 2008.\n\nE. Hazan and C. Seshadhri. Ef\ufb01cient learning algorithms for changing environments. In Proceedings\n\nof ICML, pages 393\u2013400. ACM, 2009.\n\nM. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151\u2013178, 1998.\n\nS. Khot and A. K. Ponnuswami. Minimizing wide range regret with time selection functions. In 21st\n\nAnnual Conference on Learning Theory, COLT 2008, 2008.\n\nW. M. Koolen and S. de Rooij. Universal codes from switching strategies. IEEE Transactions on\n\nInformation Theory, 59(11):7168\u20137185, 2013.\n\nE. Lehrer. A wide range no-regret theorem. Games and Economic Behavior, 42(1):101\u2013115, 2003.\n\nN. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and computation,\n\n108(2):212\u2013261, 1994.\n\n10\n\n\fM. Mohri and S. Yang. Conditional swap regret and conditional correlated equilibrium. In NIPS,\n\npages 1314\u20131322, 2014.\n\nM. Mohri and S. Yang. Online learning with expert automata. ArXiv 1705.00132, 2017. URL\n\nhttp://arxiv.org/abs/1705.00132.\n\nM. Mohri and S. Yang. Competing with automata-based expert sequences. In AISTATS, 2018.\nC. Monteleoni and T. S. Jaakkola. Online learning of non-stationary sequences. In NIPS, 2003.\nY. Nesterov and A. Nemirovski. Finding the stationary states of Markov chains by iterative methods.\n\nApplied Mathematics and Computation, 255:58\u201365, 2015.\n\nN. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani. Algorithmic game theory, volume 1.\n\nCambridge University Press Cambridge, 2007.\n\nM. Odalric and R. Munos. Adaptive bandits: Towards the best history-dependent strategy.\n\nAISTATS, pages 570\u2013578, 2011.\n\nIn\n\nG. Stoltz and G. Lugosi. Internal regret in on-line portfolio selection. Machine Learning, 59(1):\n\n125\u2013159, 2005.\n\nV. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35(3):247\u2013282, 1999.\n\n11\n\n\f", "award": [], "sourceid": 2683, "authors": [{"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute and Google"}, {"given_name": "Scott", "family_name": "Yang", "institution": "D. E. Shaw & Co."}]}