{"title": "The Pareto Regret Frontier", "book": "Advances in Neural Information Processing Systems", "page_first": 863, "page_last": 871, "abstract": "Performance guarantees for online learning algorithms typically take the form of regret bounds, which express that the cumulative loss overhead compared to the best expert in hindsight is small. In the common case of large but structured expert sets we typically wish to keep the regret especially small compared to simple experts, at the cost of modest additional overhead compared to more complex others. We study which such regret trade-offs can be achieved, and how. We analyse regret w.r.t. each individual expert as a multi-objective criterion in the simple but fundamental case of absolute loss. We characterise the achievable and Pareto optimal trade-offs, and the corresponding optimal strategies for each sample size both exactly for each finite horizon and asymptotically.", "full_text": "The Pareto Regret Frontier\n\nWouter M. Koolen\n\nQueensland University of Technology\nwouter.koolen@qut.edu.au\n\nAbstract\n\nPerformance guarantees for online learning algorithms typically take the form of\nregret bounds, which express that the cumulative loss overhead compared to the\nbest expert in hindsight is small. In the common case of large but structured expert\nsets we typically wish to keep the regret especially small compared to simple\nexperts, at the cost of modest additional overhead compared to more complex\nothers. We study which such regret trade-offs can be achieved, and how.\nWe analyse regret w.r.t. each individual expert as a multi-objective criterion in\nthe simple but fundamental case of absolute loss. We characterise the achievable\nand Pareto optimal trade-offs, and the corresponding optimal strategies for each\nsample size both exactly for each \ufb01nite horizon and asymptotically.\n\n1\n\nIntroduction\n\nOne of the central problems studied in online learning is prediction with expert advice. In this task\na learner is given access to K strategies, customarily referred to as experts. He needs to make a\nsequence of T decisions with the objective of performing as well as the best expert in hindsight.\nThis goal can be achieved with modest overhead, called regret. Typical algorithms, e.g. Hedge [1]\n\nwith learning rate \u03b7 =(cid:112)8/T ln K, guarantee\nT \u2264 (cid:112)T /2 ln K\n\nLT \u2212 Lk\n\nfor each expert k.\n\n(1)\n\nT are the cumulative losses of the learner and expert k after all T rounds.\n\nwhere LT and Lk\nHere we take a closer look at that right-hand side. For it is not always desirable to have a uniform\nregret bound w.r.t. all experts. Instead, we may want to single out a few special experts and demand\nto be really close to them, at the cost of increased overhead compared to the rest. When the number\nof experts K is large or in\ufb01nite, such favouritism even seems unavoidable for non-trivial regret\nbounds. The typical proof of the regret bound (1) suggests that the following can be guaranteed as\nwell. For each choice of probability distribution q on experts, there is an algorithm that guarantees\n\nT \u2264 (cid:112)T /2(\u2212 ln q(k))\n\nLT \u2212 Lk\n\nfor each expert k.\n\n(2)\n\nHowever, it is not immediately obvious how this can be achieved. For example, the Hedge learning\nrate \u03b7 would need to be tuned differently for different experts. We are only aware of a single\n(complex) algorithm that achieves something along these lines [2]. On the \ufb02ip side, it is also not\nobvious that this trade-off pro\ufb01le is optimal.\nIn this paper we study the Pareto (achievable and non-dominated) regret trade-offs. Let us say that a\ncandidate trade-off (cid:104)r1, . . . , rK(cid:105) \u2208 RK is T -realisable if there is an algorithm that guarantees\n\nLT \u2212 Lk\n\nT \u2264 rk\n\nfor each expert k.\n\nWhich trade-offs are realisable? Among them, which are optimal? And what is the strategy that\nwitnesses these realisable strategies?\n\n1\n\n\f1.1 This paper\n\nWe resolve the preceding questions for the simplest case of absolute loss, where K = 2. We\n\ufb01rst obtain an exact characterisation of the set of realisable trade-offs. We then construct for each\nrealisable pro\ufb01le a witnessing strategy. We also give a randomised procedure for optimal play that\nextends the randomised procedures for balanced regret pro\ufb01les from [3] and later [4, 5].\nWe then focus on the relation between priors and regret bounds, to see if the particular form (2)\nis achievable, and if so, whether it is optimal. To this end, we characterise the asymptotic Pareto\nfrontier as T \u2192 \u221e. We \ufb01nd that the form (2) is indeed achievable but fundamentally sub-optimal.\nThis is of philosophical interest as it hints that approaching absolute loss by essentially reducing\nit to information theory (including Bayesian and Minimum Description Length methods, relative\nentropy based optimisation (instance of Mirror Descent), Defensive Forecasting etc.) is lossy.\nFinally, we show that our solution for absolute loss equals that of K = 2 experts with bounded linear\nloss. We then show how to obtain the bound (1) for K \u2265 2 experts using a recursive combination\nof two-expert predictors. Counter-intuitively, this cannot be achieved with a balanced binary tree of\npredictors, but requires the most unbalanced tree possible. Recursive combination with non-uniform\nprior weights allows us to obtain (2) (with higher constant) for any prior q.\n\n1.2 Related work\n\nOur work lies in the intersection of two lines of work, and uses ideas from both. On the one hand\nthere are the game-theoretic (minimax) approaches to prediction with expert advice. In [6] Cesa-\nBianchi, Freund, Haussler, Helmbold, Schapire and Warmuth analysed the minimax strategy for\nabsolute loss with a known time horizon T . In [5] Cesa-Bianchi and Shamir used random walks to\nimplement it ef\ufb01ciently for K = 2 experts or K \u2265 2 static experts. A similar analysis was given\nby Koolen in [4] with an application to tracking. In [7] Abernethy, Langford and Warmuth obtained\nthe optimal strategy for absolute loss with experts that issue binary predictions, now controlling\nthe game complexity by imposing a bound on the loss of the best expert. Then in [3] Abernethy,\nWarmuth and Yellin obtained the worst case optimal algorithm for K \u2265 2 arbitrary experts. More\ngeneral budgets were subsequently analysed by Abernethy and Warmuth in [8]. Connections be-\ntween minimax values and algorithms were studied by Rakhlin, Shamir and Sridharan in [9].\nOn the other hand there are the approaches that do not treat all experts equally. Freund and Schapire\nobtain a non-uniform bound for Hedge in [1] using priors, although they leave the tuning problem\nopen. The tuning problem was addressed by Hutter and Poland in [2] using two-stages of Follow\nthe Perturbed Leader. Even-Dar, Kearns, Mansour and Wortman characterise the achievable trade-\noffs when we desire especially small regret compared to a \ufb01xed average of the experts\u2019 losses in\n[10]. Their bounds were subsequently tightened by Kapralov and Panigrahy in [11]. An at least\ntangentially related problem is to ensure smaller regret when there are several good experts. This\nwas achieved by Chaudhuri, Freund and Hsu in [12], and later re\ufb01ned by Chernov and Vovk in [13].\n\n2 Setup\n\nThe absolute loss game is one of the core decision problems studied in online learning [14]. In it,\nthe learner sequentially predicts T binary outcomes. Each round t \u2208 {1, . . . , T} the learner assigns\na probability pt \u2208 [0, 1] to the next outcome being a 1, after which the actual outcome xt \u2208 {0, 1} is\nrevealed, and the learner suffers absolute loss |pt \u2212 xt|. Note that absolute loss equals expected 0/1\nloss, that is, the probability of a mistake if a \u201chard\u201d prediction in {0, 1} is sampled with bias p on 1.\nRealising that the learner cannot avoid high cumulative loss without assumptions on the origin of\nthe outcomes, the learner\u2019s objective is de\ufb01ned to ensure low cumulative loss compared to a \ufb01xed\nset of baseline strategies. Meeting this goal ensures that the easier the outcome sequence (i.e. for\nwhich some reference strategy has low loss), the lower the cumulative loss incurred by the learner.\n\n2\n\n\f3\n\n0\n\n2\n\n0\n\nr1\n\n1\n\n1\n\n0\n\n1\n\n0\n\n0\n0\n\n2\n\n1\n1\n\n2\n2\n\nr0\n\n3\n3\n\n(a) The Pareto trade-off pro\ufb01les for small T . The\nsets GT consist of the points to the north-east of\neach curve.\n\n(b) Realisable trade-off pro\ufb01les for T = 0, 1, 2, 3.\nThe vertices on the pro\ufb01le for each horizon T are\nnumbered 0, . . . , T from left to right.\n\nFigure 1: Exact regret trade-off pro\ufb01le\n\nThe regret w.r.t. the strategy k \u2208 {0, 1} that always predicts k is given by 1\n\nT(cid:88)\n(cid:0)|pt \u2212 xt| \u2212 |k \u2212 xt|(cid:1).\n\nRk\n\nT :=\n\nt=1\n\nMinimising regret, de\ufb01ned in this way, is a multi-objective optimisation problem. The classical\nT , that is, to ensure small regret\napproach is to \u201cscalarise\u201d it into the single objective RT := maxk Rk\ncompared to the best expert in hindsight. In this paper we study the full Pareto trade-off curve.\nDe\ufb01nition 1. A candidate trade-off (cid:104)r0, r1(cid:105) \u2208 R2 is called T -realisable for the T -round absolute\nloss game if there is a strategy that keeps the regret w.r.t. each k \u2208 {0, 1} below rk, i.e. if\n\n\u2203p1\u2200x1 \u00b7\u00b7\u00b7\u2203pT\u2200xT : R0\n\nT \u2264 r0 and R1\n\nT \u2264 r1\n\nwhere pt \u2208 [0, 1] and xt \u2208 {0, 1} in each round t. We denote the set of all T -realisable pairs by GT .\nThis de\ufb01nition extends easily to other losses, many experts, fancy reference combinations of ex-\nperts (e.g. shifts, drift, mixtures), protocols with side information etc. We consider some of these\nextension in Section 5, but for now our goal is to keep it as simple as possible.\n\n3 The exact regret trade-off pro\ufb01le\nIn this section we characterise the set GT \u2282 R2 of T -realisable trade-offs. We show that it is a\nconvex polygon, that we subsequently characterise by its vertices and edges. We also exhibit the\noptimal strategy witnessing each Pareto optimal trade-off and discuss the connection with random\nwalks. We \ufb01rst present some useful observations about GT .\nThe linearity of the loss as a function of the prediction already renders GT highly regular.\nLemma 2. The set GT of T -realisable trade-offs is convex for each T .\nProof. Take rA and rB in GT . We need to show that \u03b1rA + (1\u2212 \u03b1)rB \u2208 GT for all \u03b1 \u2208 [0, 1]. Let\nA and B be strategies witnessing the T -realisability of these points. Now consider the strategy that\nin each round t plays the mixture \u03b1pA\nt . As the absolute loss is linear in the prediction,\nthis strategy guarantees LT = \u03b1LA\n\nk for each k \u2208 {0, 1}.\n\nk +(1\u2212\u03b1)rB\n\nT \u2264 Lk\n\nT +\u03b1rA\n\nt + (1 \u2212 \u03b1)pB\nT +(1\u2212\u03b1)LB\nGuarantees violated early cannot be restored later.\nLemma 3. A strategy that guarantees Rk\n\nT \u2264 rk must maintain Rk\n\nt \u2264 rk for all 0 \u2264 t \u2264 T .\n\n1One could de\ufb01ne the regret Rk\n\nT for all static reference probabilities k \u2208 [0, 1], but as the loss is minimised\n\nby either k = 0 or k = 1, we immediately restrict to only comparing against these two.\n\n3\n\n 0 1 2 3 4 0 1 2 3 4regret w.r.t. 1regret w.r.t. 0T=1T=2T=3T=4T=5T=6T=7T=8T=9T=10\fT = Lk\n\nt > rk at some t < T . An adversary may set all\n\nt . As LT \u2265 Lt, we have Rk\n\nProof. Suppose toward contradiction that Rk\nxt+1 . . . xT to k to \ufb01x Lk\nThe two extreme trade-offs (cid:104)0, T(cid:105) and (cid:104)T, 0(cid:105) are Pareto optimal.\nLemma 4. Fix horizon T and r1 \u2208 R. The candidate pro\ufb01le (cid:104)0, r1(cid:105) is T -realisable iff r1 \u2265 T .\nProof. The static strategy pt = 0 witnesses (cid:104)0, T(cid:105) \u2208 GT for every horizon T . To ensure R1\nany strategy will have to play pt > 0 at some time t \u2264 T . But then it cannot maintain R0\nt = 0.\n\nT < T ,\n\nT = LT \u2212Lk\n\nT \u2265 Lt\u2212Lk\n\nt = Rk\n\nt > rk.\n\nIt is also intuitive that maintaining low regret becomes progressively harder with T .\nLemma 5. G0 \u2283 G1 \u2283 . . .\nProof. Lemma 3 establishes \u2287, whereas Lemma 4 establishes (cid:54)=.\nWe now come to our \ufb01rst main result, the characterisation of GT . We will directly characterise its\nsouth-west frontier, that is, the set of Pareto optimal trade-offs. These frontiers are graphed up to\nT = 10 in Figure 1a. The vertex numbering we introduce below is illustrated by Figure 1b.\nTheorem 6. The Pareto frontier of GT is the piece-wise linear curve through the T + 1 vertices\n\ni(cid:88)\n\nj=0\n\n(cid:18)T \u2212 j \u2212 1\n(cid:19)\n\nT \u2212 i \u2212 1\n\n.\n\n(cid:10)fT (i), fT (T \u2212 i)(cid:11)\n\nfor i \u2208 {0, . . . , T}\n\nwhere\n\nfT (i) :=\n\nj2j\u2212T\n\nMoreover, for T > 0 the optimal strategy at vertex i assigns to the outcome x = 1 the probability\n\npT (0) := 0,\n\npT (T ) := 1,\n\nand pT (i) :=\n\nfor\n\n0 < i < T,\n\nand the optimal probability interpolates linearly in between consecutive vertices.\n\nfT\u22121(i) \u2212 fT\u22121(i \u2212 1)\n\n2\n\nProof. By induction on T . We \ufb01rst consider the base case T = 0. By De\ufb01nition 1\n\nG0 = (cid:8)(cid:104)r0, r1(cid:105)(cid:12)(cid:12) r0 \u2265 0 and r1 \u2265 0(cid:9)\n\nis the positive orthant, which has the origin as its single Pareto optimal vertex, and indeed\n(cid:104)f0(0), f0(0)(cid:105) = (cid:104)0, 0(cid:105). We now turn to T \u2265 1. Again by De\ufb01nition 1 (cid:104)r0, r1(cid:105) \u2208 GT if\n\n\u2203p \u2208 [0, 1]\u2200x \u2208 {0, 1} :(cid:10)r0 \u2212 |p \u2212 x| + |0 \u2212 x|, r1 \u2212 |p \u2212 x| + |1 \u2212 x|(cid:11) \u2208 GT\u22121,\n\u2203p \u2208 [0, 1] :(cid:10)r0 \u2212 p, r1 \u2212 p + 1(cid:11) \u2208 GT\u22121 and(cid:10)r0 + p, r1 + p \u2212 1(cid:11) \u2208 GT\u22121.\n\nthat is if\n\nmin\np\u2208[0,1]\n\nBy the induction hypothesis we know that the south-west frontier curve for GT\u22121 is piecewise linear.\nWe will characterise GT via its frontier as well. For each r0, let r1(r0) and p(r0) denote the value\nand minimiser of the optimisation problem\n\n(cid:12)(cid:12) both (cid:104)r0, r1(cid:105) \u00b1 (cid:104)p, p \u2212 1(cid:105) \u2208 GT\u22121\n\n(cid:8)r1\n\n(cid:9).\n\nWe also refer to (cid:104)r0, r1(r0)(cid:105) \u00b1 (cid:104)p(r0), p(r0) \u2212 1(cid:105) as the rear(\u2212) and front(+) contact points. For\nr0 = 0 we \ufb01nd r1(0) = T , with witness p(0) = 0 and rear/front contact points (cid:104)0, T + 1(cid:105) and\n(cid:104)0, T \u2212 1(cid:105), and for r0 = T we \ufb01nd r1(T ) = 0 with witness p(T ) = 1 and rear/front contact\npoints (cid:104)T \u2212 1, 0(cid:105) and (cid:104)T + 1, 0(cid:105). It remains to consider the intermediate trajectory of r1(r0) as\nr0 runs from 0 to T . Initially at r0 = 0 the rear contact point lies on the edge of GT\u22121 entering\nvertex i = 0 of GT\u22121, while the front contact point lies on the edge emanating from that same\nvertex. So if we increase r0 slightly, the contact points will slide along their respective lines. By\nLemma 11 (supplementary material), r1(r0) will trace along a straight line as a result. Once we\nincrease r0 enough, both the rear and front contact point will hit the vertex at the end of their\nedges simultaneously (a fortunate fact that greatly simpli\ufb01es our analysis), as shown in Lemma 12\n(supplementary material). The contact points then transition to tracing the next pair of edges of\nGT\u22121. At this point r0 the slope of r1(r0) changes, and we have discovered a vertex of GT .\nGiven that at each such transition (cid:104)r0, r1(r0)(cid:105) is the midpoint between both contact points, this\nimplies that all midpoints between successive vertices of GT\u22121 are vertices of GT . And in addition,\nthere are the two boundary vertices (cid:104)0, T(cid:105) and (cid:104)T, 0(cid:105).\n\n4\n\n\f(a) Normal scale\n\n(b) Log-log scale to highlight the tail behaviour\n\nFigure 2: Pareto frontier of G, the asymptotically realisable trade-off rates. There is no noticeable\ndifference with the normalised regret trade-off pro\ufb01le GT /\nT for T = 10000. We also graph the\n\ncurve(cid:10)(cid:112)\u2212 ln(q),(cid:112)\u2212 ln(1 \u2212 q)(cid:11) for all q \u2208 [0, 1].\n\n\u221a\n\n3.1 The optimal strategy and random walks\n\nIn this section we describe how to follow the optimal strategy. First suppose we desire to witness a\nT -realisable trade-off that happens to be a vertex of GT , say vertex i at (cid:104)fT (i), fT (T \u2212 i)(cid:105). With T\nrounds remaining and in state i, the strategy predicts with pT (i). Then the outcome x \u2208 {0, 1} is\nrevealed. If x = 0, we need to witness in the remaining T \u2212 1 rounds the trade-off (cid:104)fT (i), fT (T \u2212\ni)(cid:105) \u2212 (cid:104)pT (i), pT (i) + 1(cid:105) = (cid:104)fT\u22121(i \u2212 1), fT\u22121(T \u2212 1)(cid:105), which is vertex i \u2212 1 of GT\u22121. So the\nstrategy transition to state i \u2212 1. Similarly upon x = 1 we update our internal state to i. If the state\never either exceeds the number of rounds remaining or goes negative we simply clamp it.\nSecond, if we desire to witness a T -realisable trade-off that is a convex combination of successive\nvertices, we simply follow the mixture strategy as constructed in Lemma 2. Third, if we desire to\nwitness a sub-optimal element of GT , we may follow any strategy that witnesses a Pareto optimal\ndominating trade-off.\nThe probability p issued by the algorithm is sometimes used to randomly sample a \u201chard prediction\u201d\nfrom {0, 1}. The expression |p \u2212 x| then denotes the expected loss, which equals the probability\nof making a mistake. We present, following [3], a random-walk based method to sample a 1 with\nprobability pT (i). Our random walk starts in state (cid:104)T, i(cid:105). In each round it transitions from state (cid:104)T, i(cid:105)\nto either state (cid:104)T \u2212 1, i(cid:105) or state (cid:104)T \u2212 1, i \u2212 1(cid:105) with equal probability. It is stopped when the state\n(cid:104)T, i(cid:105) becomes extreme in the sense that i \u2208 {0, T}. Note that this process always terminates. Then\nthe probability that this process is stopped with i = T equals pT (i). In our case of absolute loss,\nevaluating pT (i) and performing the random walk both take T units of time. The random walks\nconsidered in [3] for K \u2265 2 experts still take T steps, whereas direct evaluation of the optimal\nstrategy scales rather badly with K.\n\n4 The asymptotic regret rate trade-off pro\ufb01le\n\nIn the previous section we obtained for each time horizon T a combinatorial characterisation of the\nset GT of T -realisable trade-offs. In this section we show that properly normalised Pareto frontiers\nfor increasing T are better and better approximations of a certain intrinsic smooth limit curve. We\nobtain a formula for this curve, and use it to study the question of realisability for large T .\nDe\ufb01nition 7. Let us de\ufb01ne the set G of asymptotically realisable regret rate trade-offs by\n\nG := lim\nT\u2192\u221e\n\nGT\u221a\n\nT\n\n.\n\nDespite the disappearance of the horizon T from the notation, the set G still captures the trade-offs\nthat can be achieved with prior knowledge of T . Each achievable regret rate trade-off (cid:104)\u03c10, \u03c11(cid:105) \u2208 G\n\n5\n\n 0 0.5 1 1.5 2 0 0.5 1 1.5 2normalised regret w.r.t. 1normalised regret w.r.t. 0T=10000asymptotically realisablesqrt min-log-prior 1 10 1e-50 1e-40 1e-30 1e-20 1e-10 1normalised regret w.r.t. 1normalised regret w.r.t. 0T=10000asymptotically realisablesqrt min-log-prior\f\u221a\nmay be witnessed by a different strategy for each T . This is \ufb01ne for our intended interpretation of\n\nT G as a proxy for GT . We brie\ufb02y mention horizon-free algorithms at the end of this section.\n\nThe literature [2] suggests that, for some constant c, (cid:104)(cid:112)\u2212c ln(q),(cid:112)\u2212c ln(1 \u2212 q)(cid:105) should be asymp-\n\ntotically realisable for each q \u2208 [0, 1]. We indeed con\ufb01rm this below, and determine the optimal\nconstant to be c = 1. We then discuss the philosophical implications of the quality of this bound.\nWe now come to our second main result, the characterisation of the asymptotically realisable trade-\n\u221a\noff rates. The Pareto frontier is graphed in Figure 2 both on normal axes for comparison to Figure 1a,\nand on a log-log scale to show its tails. Note the remarkable quality of approximation to GT /\nT .\nTheorem 8. The Pareto frontier of the set G of asymptotically realisable trade-offs is the curve\n\n(cid:10)f (u), f (\u2212u)(cid:11)\n(cid:82) u\n0 e\u2212v2 dv is the error function. Moreover, the optimal strategy converges to\n\nand erf(u) = 2\u221a\n\u03c0\n\n2u(cid:1) +\n\nfor u \u2208 R,\n\ne\u22122u2\u221a\n\nwhere\n\n+ u,\n\n2\u03c0\n\nf (u) := u erf(cid:0)\u221a\n2u(cid:1)\n\n.\n\n1 \u2212 erf(cid:0)\u221a\n\n2\n\n\u221a\nProof. We calculate the limit of normalised Pareto frontiers at vertex i = T /2 + u\n\nT , and obtain\n\np(u) =\n\n(cid:0)T /2 + u\n\n\u221a\n\nT(cid:1)\n\n\u221a\n\nT\n\nfT\n\nlim\nT\u2192\u221e\n\n= lim\nT\u2192\u221e\n\n1\u221a\nT\n\n\u221a\n\nT /2+u\n\nT(cid:88)\n(cid:90) T /2+u\n\nj=0\n\nj2j\u2212T\n\u221a\n\nT\n\n1\u221a\nT\n\n(cid:90) u\n\n0\n\nj2j\u2212T\n\u221a\n\n= lim\nT\u2192\u221e\n\n= lim\nT\u2192\u221e\n\n(cid:90) u\n(cid:90) u\n\n\u2212\u221e\n\n\u2212\u221e\n\n=\n\n=\n\nT /2\n\n(u \u2212 v)2(u\u2212v)\n\u2212\u221a\n\u221a\n(u \u2212 v) lim\nT\u2192\u221e 2(u\u2212v)\ne\u2212 1\n2 (u+v)2\n\u221a\n\n(u \u2212 v)\n\ndv\n\n2\u03c0\n\n\u221a\n\n(cid:19)\n\nT \u2212 1\n\n\u221a\nT /2 \u2212 u\n\n(cid:18) T \u2212 j \u2212 1\n(cid:19)\n(cid:18) T \u2212 j \u2212 1\n(cid:18)T \u2212 (u \u2212 v)\n(cid:18)T \u2212 (u \u2212 v)\n2u(cid:1) +\n= u erf(cid:0)\u221a\n\n(cid:19)\u221a\ndj\nT \u2212 1\n\u221a\nT \u2212 1\n\u221a\n(cid:19)\u221a\nT /2 \u2212 u\nT \u2212 1\n\u221a\nT \u2212 1\n\u221a\nT \u2212 1\nT /2 \u2212 u\ne\u22122u2\u221a\n\nT /2 \u2212 u\nT\u2212T\n\nT\u2212T\n\n+ u\n\n2\u03c0\n\nT dv\n\nT dv\n\nIn the \ufb01rst step we replace the sum by an integral. We can do this as the summand is continuous in j,\nand the approximation error is multiplied by 2\u2212T and hence goes to 0 with T . In the second step we\nperform the variable substitution v = u\u2212 j/\nT . We then exchange limit and integral, subsequently\nevaluate the limit, and in the \ufb01nal step we evaluate the integral.\nTo obtain the optimal strategy, we observe the following relation between the slope of the Pareto\ncurve and the optimal strategy for each horizon T . Let g and h denote the Pareto curves at times T\nand T + 1 as a function of r0. The optimal strategy p for T + 1 at r0 satis\ufb01ed the system of equations\n\n\u221a\n\nh(r0) + p \u2212 1 = g(u + p)\nh(r0) \u2212 p + 1 = g(u \u2212 p)\n\nto which the solution satis\ufb01es\n\n1 \u2212 1\np\n\n=\n\ng(r0 + p) \u2212 g(r0 \u2212 p)\n\n2p\n\n\u2248 dg(r0)\ndr0\n\n,\n\nso that\n\np \u2248\n\n1\n\n1 \u2212 dg(r0)\ndr0\n\n.\n\nSince slope is invariant under normalisation, this relation between slope and optimal strategy be-\ncomes exact as T tends to in\ufb01nity, and we \ufb01nd\n\n1 \u2212 erf(cid:0)\u221a\n\n2u(cid:1)\n\n.\n\n2\n\np(u) =\n\n1\n\n1 + df (r0(u))\ndr0(u)\n\n=\n\n1\n\n1 + f(cid:48)(u)\nf(cid:48)(\u2212u)\n\n=\n\nWe believe this last argument is more insightful than a direct evaluation of the limit.\n\n6\n\n\f4.1 Square root of min log prior\n\nResults for Hedge suggest \u2014 modulo a daunting tuning problem \u2014 that a trade-off featuring square\nroot negative log prior akin to (2) should be realisable. We \ufb01rst show that this is indeed the case, we\nthen determine the optimal leading constant and we \ufb01nally discuss its sub-optimality.\n\nTheorem 9. The parametric curve(cid:10)(cid:112)\u2212c ln(q),(cid:112)\u2212c ln(1 \u2212 q)(cid:11) for q \u2208 [0, 1] is contained in G\n\n(i.e. asymptotically realisable) iff c \u2265 1.\n\nProof. By Theorem 8, the frontier of G is of the form (cid:104)f (u), f (\u2212u)(cid:105). Our argument revolves around\nthe tails (extreme u) of G. For large u (cid:29) 0, we \ufb01nd that f (u) \u2248 2u. For small u (cid:28) 0, we \ufb01nd that\nf (u) \u2248 e\u22122u2\n2\u03c0u2 . This is obtained by a 3rd order Taylor series expansion around u = \u2212\u221e. We need\nto go to 3rd order since all prior orders evaluate to 0. The additive approximation error is of order\ne\u22122u2\n\nu\u22124, which is negligible. So for large r0 (cid:29) 0, the least realisable r1 is approximately\n\n\u221a\n4\n\nWith the candidate relation r0 =(cid:112)\u2212c ln(q) and r1 =(cid:112)\u2212c ln(1 \u2212 q), still for large r0 (cid:29) 0 so that\n\nq is small and \u2212 ln(1 \u2212 q) \u2248 q, we would instead \ufb01nd least realisable r1 approximately equal to\n\n(3)\n\n2\u03c0\n\n.\n\nr1 \u2248 e\u2212 r2\n2 \u22122 ln r0\n\u221a\n\n0\n\nr1 \u2248 \u221a\n\nce\u2212 r2\n0\n2c .\n\n(4)\n\nThe candidate tail (4) must be at least the actual tail (3) for all large r0. The minimal c for which\nthis holds is c = 1. The graphs of Figure 2 illustrate this tail behaviour for c = 1, and at the same\ntime verify that there are no violations for moderate u.\n\n\u221a\n\nln 2 \u2248 0.833, whereas f (0) = 1/\n\nEven though the sqrt-min-log-prior trade-off is realisable, we see that its tail (4) exceeds the actual\ntail (3) by the factor r2\n2\u03c0, which gets progressively worse with the extremity of the tail r0. Fig-\nure 2a shows that its behaviour for moderate (cid:104)r0, r1(cid:105) is also not brilliant. For example it gives us a\n0\n\u221a\nsymmetric bound of\nFor certain log loss games, each Pareto regret trade-off is witnessed uniquely by the Bayesian mix-\nture of expert predictions w.r.t. a certain non-uniform prior and vice versa (not shown). In this sense\nthe Bayesian method is the ideal answer to data compression/investment/gambling. Be that as it\nmay, we conclude that the world of absolute loss is not information theory: simply putting a prior\nis not the de\ufb01nitive answer to non-uniform guarantees. It is a useful intuition that leads to the con-\nvenient sqrt-min-log-prior bounds. We hope that our results contribute to obtaining tighter bounds\nthat remain manageable.\n\n2\u03c0 \u2248 0.399 is optimal.\n\n\u221a\n\n4.2 The asymptotic algorithm\n\n\u221a\n\nThe previous theorem immediately suggests an approximate algorithm for \ufb01nite horizon T . To\nT(cid:104)f (u), f (\u2212u)(cid:105) is closest to it.\napproximately witness (cid:104)r0, r1(cid:105), \ufb01nd the value of u for which\nThen play p(u). This will not guarantee (cid:104)r0, r1(cid:105) exactly, but intuitively it will be close. We leave\nanalysing this idea to the journal version. Conversely, by taking the limit of the game protocol,\nwhich involves the absolute loss function, we might obtain an interesting protocol and \u201casymptotic\u201d\nloss function2, for which u is the natural state, p(u) is the optimal strategy, and u is updated in\na certain way. Investigating such questions will probably lead to interesting insights, for example\nT \u2264 \u03c1k for all T simultaneously. Again this will be\nhorizon-free strategies that maintain Rk\npursued for the journal version.\n\n\u221a\n\nT /\n\n2 We have seen an instance of this before. When the Hedge algorithm with learning rate \u03b7 plays weights w\nand faces loss vector (cid:96), its dot loss is given by wT (cid:96). Now consider the same loss vector handed out in identical\npieces (cid:96)/n over the course of n trials, during which the weights w update as usual. In the limit of n \u2192 \u221e, the\nresulting loss becomes the mix loss \u2212 1\n\n\u03b7 ln(cid:80)\n\nk w(k)e\u2212\u03b7(cid:96)k.\n\n7\n\n\f5 Extension\n\n5.1 Beyond absolute loss\n\nT(cid:88)\n\nt \u2212 T(cid:88)\n\nIn this section we consider the general setting with K = 2 expert, that we still refer to as 0 and\n1. Here the learner plays p \u2208 [0, 1] which is now interpreted as the weight allocated to expert 1,\nthe adversary chooses a loss vector (cid:96) = (cid:104)(cid:96)0, (cid:96)1(cid:105) \u2208 [0, 1]2, and the learner incurs dot loss given by\n(1 \u2212 p)(cid:96)0 + p(cid:96)1. The regrets are now rede\ufb01ned as follows\n\nRk\n\nT :=\n\nt + (1 \u2212 pt)(cid:96)0\n\npt(cid:96)1\n\n(cid:96)k\nt\n\nfor each expert k \u2208 {0, 1}.\n\nTheorem 10. The T -realisable trade-offs for absolute loss and K = 2 expert dot loss coincide.\n\nt=1\n\nt=1\n\nProof. By induction on T . The loss is irrelevant in the base case T = 0. For T > 0, a trade-off\n(cid:104)r0, r1(cid:105) is T -realisable for dot loss if\n\n\u2203p \u2208 [0, 1]\u2200(cid:96) \u2208 [0, 1]2 : (cid:104)r0 + p(cid:96)1 + (1 \u2212 p)(cid:96)0 \u2212 (cid:96)0, r1 + p(cid:96)1 + (1 \u2212 p)(cid:96)0 \u2212 (cid:96)1(cid:105) \u2208 GT\u22121\n\nthat is if\nWe recover the absolute loss case by restricting \u03b4 to {\u22121, 1}. These requirements are equivalent\nsince GT is convex by Lemma 2.\n\n\u2203p \u2208 [0, 1]\u2200\u03b4 \u2208 [\u22121, 1] : (cid:104)r0 \u2212 p\u03b4, r1 + (1 \u2212 p)\u03b4(cid:105) \u2208 GT\u22121 .\n\n5.2 More than 2 experts\n\nIn the general experts problem we compete with K instead of 2 experts. We now argue that an al-\ngorithm guaranteeing Rk\ncT ln K w.r.t. each expert k can be obtained. The intuitive approach,\ncombining the K experts in a balanced binary tree of two-expert predictors, does not achieve this\n\ngoal: each internal node contributes the optimal symmetric regret of(cid:112)T /(2\u03c0). This accumulates\n\nT \u2264 \u221a\n\n\u221a\n\nto Rk\n\nT \u2264 ln K\n\ncT , where the log sits outside the square root.\n\nCounter-intuitively, the maximally unbalanced binary tree does result in a\nln K factor when the\ninternal nodes are properly skewed. At each level we combine K experts one-vs-all, permitting large\nregret w.r.t. the \ufb01rst expert but tiny regret w.r.t. the recursive combination of the remaining K \u2212 1\nexperts. The argument can be found in Appendix A.1. The same argument shows that, for any prior\nq on k = 1, 2, . . ., combining the expert with the smallest prior with the recursive combination of\n\nthe rest guarantees regret(cid:112)\u2212cT ln q(k) w.r.t. each expert k.\n\n\u221a\n\n6 Conclusion\n\nT(cid:104)(cid:112)\u2212 ln(q),(cid:112)\u2212 ln(1 \u2212 q)(cid:105) trade-off is achievable for any prior probability q \u2208 [0, 1], but that it\n\nWe studied asymmetric regret guarantees for the fundamental online learning setting of the absolute\nloss game. We obtained exactly the achievable skewed regret guarantees, and the corresponding\n\u221a\noptimal algorithm. We then studied the pro\ufb01le in the limit of large T . We conclude that the expected\nis not tight. We then showed how our results transfer from absolute loss to general linear losses, and\nto more than two experts.\nMajor next steps are to determine the optimal trade-offs for K > 2 experts, to replace our traditional\n\u221a\nD\u221e [18], \u2206T [19]\n\u221a\n\n\u221a\n[17],\nT \u2264 \u03c1k\netc. and to \ufb01nd the Pareto frontier for horizon-free strategies maintaining Rk\n\n[16],(cid:112)Varmax\n\nT budget by modern variants\n\n(cid:113) Lk\n\nT at any T .\n\nT [15],\nLk\n\nT (T\u2212Lk\nT )\n\nT\n\n(cid:113)\n\nT\n\nAcknowledgements\n\nThis work bene\ufb01ted substantially from discussions with Peter Gr\u00a8unwald.\n\n8\n\n\fReferences\n[1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\n[2] Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader.\n\nJournal of Machine Learning Research, 6:639\u2013660, 2005.\n\n[3] Jacob Abernethy, Manfred K. Warmuth, and Joel Yellin. When random play is optimal against\nan adversary. In Rocco A. Servedio and Tong Zhang, editors, COLT, pages 437\u2013446. Omni-\npress, 2008.\n\n[4] Wouter M. Koolen. Combining Strategies Ef\ufb01ciently: High-quality Decisions from Con\ufb02icting\nAdvice. PhD thesis, Institute of Logic, Language and Computation (ILLC), University of\nAmsterdam, January 2011.\n\n[5] Nicol`o Cesa-Bianchi and Ohad Shamir. Ef\ufb01cient online learning via randomized rounding.\nIn J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 24, pages 343\u2013351, 2011.\n\n[6] Nicol`o Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire,\nand Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427\u2013485,\n1997.\n\n[7] Jacob Abernethy, John Langford, and Manfred K Warmuth. Continuous experts and the Bin-\n\nning algorithm. In Learning Theory, pages 544\u2013558. Springer, 2006.\n\n[8] Jacob Abernethy and Manfred K. Warmuth. Repeated games against budgeted adversaries. In\nJ. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances\nin Neural Information Processing Systems 23, pages 1\u20139, 2010.\n\n[9] Sasha Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize : From value to\nIn P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger,\n\nalgorithms.\neditors, Advances in Neural Information Processing Systems 25, pages 2150\u20132158, 2012.\n\n[10] Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best\n\nvs. regret to the average. Machine Learning, 72(1-2):21\u201337, 2008.\n\n[11] Michael Kapralov and Rina Panigrahy. Prediction strategies without loss. In J. Shawe-Taylor,\nR.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 24, pages 828\u2013836, 2011.\n\n[12] Kamalika Chaudhuri, Yoav Freund, and Daniel Hsu. A parameter-free hedging algorithm. In\nY. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in\nNeural Information Processing Systems 22, pages 297\u2013305, 2009.\n\n[13] Alexey V. Chernov and Vladimir Vovk. Prediction with advice of unknown number of experts.\n\nIn Peter Gr\u00a8unwald and Peter Spirtes, editors, UAI, pages 117\u2013125. AUAI Press, 2010.\n\n[14] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, Learning, and Games. Cambridge Uni-\n\nversity Press, 2006.\n\n[15] Peter Auer, Nicol`o Cesa-Bianchi, and Claudio Gentile. Adaptive and self-con\ufb01dent on-line\n\nlearning algorithms. Journal of Computer and System Sciences, 64(1):48\u201375, 2002.\n\n[16] Nicol`o Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for\n\nprediction with expert advice. Machine Learning, 66(2-3):321\u2013352, 2007.\n\n[17] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by varia-\n\ntion in costs. Machine learning, 80(2-3):165\u2013188, 2010.\n\n[18] Chao-Kai Chiang, Tianbao Yangand Chia-Jung Leeand Mehrdad Mahdaviand Chi-Jen Lu-\nand Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Proceedings\nof the 25th Annual Conference on Learning Theory, number 23 in JMLR W&CP, pages 6.1 \u2013\n6.20, June 2012.\n\n[19] Steven de Rooij, Tim van Erven, Peter D. Gr\u00a8unwald, and Wouter M. Koolen. Follow the leader\n\nif you can, Hedge if you must. ArXiv, 1301.0534, January 2013.\n\n9\n\n\f", "award": [], "sourceid": 477, "authors": [{"given_name": "Wouter", "family_name": "Koolen", "institution": "Queensland University of Technology"}]}