{"title": "Constant Regret, Generalized Mixability, and Mirror Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 7419, "page_last": 7428, "abstract": "We consider the setting of prediction with expert advice; a learner makes predictions by aggregating those of a group of experts. Under this setting, and for the right choice of loss function and ``mixing'' algorithm, it is possible for the learner to achieve a constant regret regardless of the number of prediction rounds. For example, a constant regret can be achieved for \\emph{mixable} losses using the \\emph{aggregating algorithm}. The \\emph{Generalized Aggregating Algorithm} (GAA) is a name for a family of algorithms parameterized by convex functions on simplices (entropies), which reduce to the aggregating algorithm when using the \\emph{Shannon entropy} $\\operatorname{S}$. For a given entropy $\\Phi$, losses for which a constant regret is possible using the \\textsc{GAA} are called $\\Phi$-mixable. Which losses are $\\Phi$-mixable was previously left as an open question. We fully characterize $\\Phi$-mixability and answer other open questions posed by \\cite{Reid2015}. We show that the Shannon entropy $\\operatorname{S}$ is fundamental in nature when it comes to mixability; any $\\Phi$-mixable loss is necessarily $\\operatorname{S}$-mixable, and the lowest worst-case regret of the \\textsc{GAA} is achieved using the Shannon entropy. Finally, by leveraging the connection between the \\emph{mirror descent algorithm} and the update step of the GAA, we suggest a new \\emph{adaptive} generalized aggregating algorithm and analyze its performance in terms of the regret bound.", "full_text": "Constant Regret, Generalized Mixability, and Mirror\n\nDescent\n\nZakaria Mhammedi\n\nResearch School of Computer Science\n\nAustralian National University and DATA61\n\nzak.mhammedi@anu.edu.au\n\nRobert C. Williamson\n\nResearch School of Computer Science\n\nAustralian National University and DATA61\n\nbob.williamson@anu.edu.au\n\nAbstract\n\nWe consider the setting of prediction with expert advice; a learner makes predictions\nby aggregating those of a group of experts. Under this setting, and for the right\nchoice of loss function and \u201cmixing\u201d algorithm, it is possible for the learner to\nachieve a constant regret regardless of the number of prediction rounds. For\nexample, a constant regret can be achieved for mixable losses using the aggregating\nalgorithm. The Generalized Aggregating Algorithm (GAA) is a name for a family\nof algorithms parameterized by convex functions on simplices (entropies), which\nreduce to the aggregating algorithm when using the Shannon entropy S. For a given\nentropy \u03a6, losses for which a constant regret is possible using the GAA are called\n\u03a6-mixable. Which losses are \u03a6-mixable was previously left as an open question.\nWe fully characterize \u03a6-mixability and answer other open questions posed by [6].\nWe show that the Shannon entropy S is fundamental in nature when it comes to\nmixability; any \u03a6-mixable loss is necessarily S-mixable, and the lowest worst-case\nregret of the GAA is achieved using the Shannon entropy. Finally, by leveraging\nthe connection between the mirror descent algorithm and the update step of the\nGAA, we suggest a new adaptive generalized aggregating algorithm and analyze\nits performance in terms of the regret bound.\n\n1\n\nIntroduction\n\nTwo fundamental problems in learning are how to aggregate information and under what circum-\nstances can one learn fast. In this paper, we consider the problems jointly, extending the understanding\nand characterization of exponential mixing due to [10], who showed that not only does the \u201caggregat-\ning algorithm\u201d learn quickly when the loss is suitably chosen, but that it is in fact a generalization of\nclassical Bayesian updating, to which it reduces when the loss is log-loss [12]. We consider a general\nclass of aggregating schemes, going beyond Vovk\u2019s exponential mixing, and provide a complete\ncharacterization of the mixing behavior for general losses and general mixing schemes parameterized\nby an arbitrary entropy function.\nIn the game of prediction with expert advice a learner predicts the outcome of a random variable\n(outcome of the environment) by aggregating the predictions of a pool of experts. At the end of\neach prediction round, the outcome of the environment is announced and the learner and experts\nsuffer losses based on their predictions. We are interested in algorithms that the learner can use to\n\u201caggregate\u201d the experts\u2019 predictions and minimize the regret at the end of the game. In this case,\nthe regret is de\ufb01ned as the difference between the cumulative loss of the learner and that of the best\nexpert in hindsight after T rounds.\nThe Aggregating Algorithm (AA) [10] achieves a constant regret \u2014 a precise notion of fast learning\n\u2014 for mixable losses; that is, the regret is bounded from above by a constant R(cid:96) which depends only\non the loss function (cid:96) and not on the number of rounds T . It is worth mentioning that mixability\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fis a weaker condition than exp-concavity, and contrary to the latter, mixability is an intrinsic,\nparametrization-independent notion [4].\nReid et al. [6] introduced the Generalized Aggregating Algorithm (GAA), going beyond the AA. The\nGAA is parameterized by the choice of a convex function \u03a6 on the simplex (entropy) and reduces to\nthe AA when \u03a6 is the Shannon entropy. The GAA can achieve a constant regret for losses satisfying\na certain condition called \u03a6-mixability (characterizing when losses are \u03a6-mixable was left as an open\nproblem). This regret depends jointly on the generalized mixability constant \u03b7\u03a6\n(cid:96) \u2014 essentially the\n\u03b7 \u03a6)-mixable \u2014 and the divergence D\u03a6(e\u03b8, q), where q \u2208 \u2206k is a prior\nlargest \u03b7 such that (cid:96) is ( 1\ndistribution over k experts and e\u03b8 is the \u03b8th standard basis element of Rk [6]. At each prediction\nround, the GAA can be divided into two steps; a substitution step where the learner picks a prediction\nfrom a set speci\ufb01ed by the \u03a6-mixability condition; and an update step where a new distribution q\nover experts is computed depending on their performance. Interestingly, this update step is exactly\nthe mirror descent algorithm [8, 5] which minimizes the weighted loss of experts.\n\nContributions. We introduce the notion of a support loss; given a loss (cid:96) de\ufb01ned on any action\nspace, there exists a proper loss (cid:96) which shares the same Bayes risk as (cid:96). When a loss is mixable,\none can essentially work with a proper (support) loss instead \u2014 this will be the \ufb01rst stepping stone\ntowards a characterization of (generalized) mixability.\nThe notion of \u03a6-mixable and the GAA were previously restricted to \ufb01nite losses. We extend these to\nallow for the use of losses which can take in\ufb01nite values (such as the log-loss), and we show in this\ncase that under the \u03a6-mixability condition a constant regret is achievable using the GAA.\nFor an entropy \u03a6 and a loss (cid:96), we derive a necessary and suf\ufb01cient condition (Theorems 13 and\n14) for (cid:96) to be \u03a6-mixable. In particular, if (cid:96) and \u03a6 satisfy some regularity conditions, then (cid:96) is\n\u03a6-mixable if and only if \u03b7(cid:96)\u03a6 \u2212 S is convex on the simplex, where S is the Shannon entropy and \u03b7(cid:96)\nis essentially the largest \u03b7 such that (cid:96) is \u03b7-mixable [10, 9]. This implies that a loss (cid:96) is \u03a6-mixable\nonly if it is \u03b7-mixable for some \u03b7 > 0. This, combined with the fact that \u03b7-mixability is equivalently\n\u03b7 S)-mixability (Theorem 12), re\ufb02ects one fundamental aspect of the Shannon entropy.\n( 1\nThen, we derive an explicit expression for the generalized mixability constant \u03b7\u03a6\n(cid:96) (Corollary 17), and\nthus for the regret bound of the GAA. This allows us to compare the regret bound R\u03a6\n(cid:96) of any entropy\n(cid:96) \u2264 R\u03a6\n\u03a6 with that of the Shannon entropy S. In this case, we show (Theorem 18) that RS\n(cid:96) ; that is, the\nGAA achieves the lowest worst-case regret when using the Shannon entropy \u2014 another result which\nre\ufb02ects the fundamental nature of the Shannon entropy.\nFinally, by leveraging the connection between the GAA and the mirror descent algorithm, we present\na new algorithm \u2014 the Adaptive Generalized Aggregating Algorithm (AGAA). This algorithm\nconsists of changing the entropy function at each prediction round similar to the adaptive mirror\ndescent algorithm [8]. We analyze the performance of this algorithm in terms of its regret bound.\n\nLayout.\nIn \u00a72, we give some background on loss functions and present new results (Theorem 4 and\n5) based on the new notion of a proper support loss; we show that, as far as mixability is concerned,\none can always work with a proper (support) loss instead of the original loss (which can be de\ufb01ned on\nan arbitrary action space). In \u00a73, we introduce the notions of classical and generalized mixability and\nderive a characterization of \u03a6-mixability (Theorems 13 and 14). We then introduce our new algorithm\n\u2014 the AGAA \u2014 and analyze its performance. We conclude the paper by a general discussion and\ndirection for future work. All proofs, except for that of Theorem 16, are deferred to Appendix C.\nNotation. Let m \u2208 N. We denote [m] := {1, . . . , m} and \u02dcm := m \u2212 1. We write (cid:104)\u00b7,\u00b7(cid:105) for the\nstandard inner product in Euclidean space. Let \u2206m := {p \u2208 [0, +\u221e[m : (cid:104)p, 1m(cid:105) = 1} be the\nprobability simplex in Rm, and let \u02dc\u2206m := { \u02dcp \u2208 [0, +\u221e[ \u02dcm : (cid:104) \u02dcp, 1 \u02dcm(cid:105) \u2264 1}. We will extensively\nmake use of the af\ufb01ne map (cid:113)m : R \u02dcm \u2192 Rm de\ufb01ned by\n\n(cid:113)m(u) := [u1, . . . , u \u02dcm, 1 \u2212 (cid:104)u, 1 \u02dcm(cid:105)]T.\n\n(1)\nWe denote intC, riC, and rbdC the interior, relative interior, and relative boundary of a set C \u2208 Rm,\nrespectively [2]. The sub-differential of a function f : Rm \u2192 R \u222a {+\u221e} at u \u2208 Rm such that\nf (u) < +\u221e is de\ufb01ned by ([2])\n\n\u2202f (u) := {s\u2217 \u2208 Rm : f (v) \u2265 f (u) + (cid:104)s\u2217, v \u2212 u(cid:105) ,\u2200v \u2208 Rm}.\n\n(2)\n\n2\n\n\fTable 1 on page 9 provides a list of the main symbols used in this paper.\n\n1\u2264\u03b8\u2264k \u2208 Ak, (cid:96)x(A) := [(cid:96)x(a\u03b8)]T\n\ndenote (cid:96)x(\u00b7) := [(cid:96)(\u00b7)]x. We further extend the new de\ufb01nition of (cid:96) to the set(cid:83)\n\n2 Loss Functions\nIn general, a loss function is a map (cid:96) : X \u00d7 A \u2192 [0, +\u221e] where X is an outcome set and A is an\naction set. In this paper, we only consider the case X = [n], i.e. \ufb01nite outcome space. Overloading\nnotation slightly, we de\ufb01ne the mapping (cid:96) : A \u2192 [0, +\u221e]n by [(cid:96)(a)]x = (cid:96)(x, a),\u2200x \u2208 [n] and\nk\u22651 Ak such that for\nx \u2208 [n] and A := [a\u03b8]T\n1\u2264\u03b8\u2264k \u2208 [0, +\u221e]k. We de\ufb01ne the effective\ndomain of (cid:96) by dom (cid:96) := {a \u2208 A : (cid:96)(a) \u2208 [0, +\u221e[n}, and the loss surface by S(cid:96) := {(cid:96)(a) : a \u2208\ndom (cid:96)}. We say that (cid:96) is closed if S(cid:96) is closed in Rn. The superprediction set of (cid:96) is de\ufb01ned by\nS \u221e\nLet a0, a1 \u2208 A. The prediction a0 is said to be better than a1 if the component-wise inequality\n(cid:96)(a0) \u2264 (cid:96)(a1) holds and there exists some x \u2208 [n] such that (cid:96)x(a0) < (cid:96)x(a1) [14]. A loss (cid:96) is\nadmissible if for any a \u2208 A there are no better predictions.\nFor the rest of this paper (except for Theorem 4), we make the following assumption on losses;\nAssumption 1. (cid:96) is a closed, admissible loss such that dom (cid:96) (cid:54)= \u2205.\n\n:= {(cid:96)(a) + d : (a, d) \u2208 A \u00d7 [0, +\u221e[n}. Let S(cid:96) := S \u221e\n\n(cid:96) \u2229 [0, +\u221e[n be its \ufb01nite part.\n\n(cid:96)\n\nIt is clear that there is no loss of generality in considering only admissible losses. The condition that (cid:96)\nis closed is a weaker version of the more common assumption that A is compact and that a (cid:55)\u2192 (cid:96)(x, a)\nis continuous with respect to the extended topology of [0, +\u221e] for all x \u2208 [n] [3, 1]. In fact, we do\nnot make any explicit topological assumptions on the set A (A is allowed to be open in our case).\nOur condition simply says that if a sequence of points on the loss surface converges in [0, +\u221e[n,\nthen there exists an action in A whose image through the loss is equal to the limit. For example the\n0-1 loss (cid:96)0-1 is closed, yet the map p (cid:55)\u2192 (cid:96)0-1(x, p) is not continuous on \u22062, for x \u2208 {0, 1}.\nIn this paragraph let A be the n-simplex, i.e. A = \u2206n. We de\ufb01ne the conditional risk L(cid:96) : \u2206n\u00d7\u2206n \u2192\nR by L(cid:96)(p, q) = Ex\u223cp[(cid:96)x(q)] = (cid:104)p, (cid:96)(q)(cid:105) and the Bayes risk by L(cid:96)(p) := inf q\u2208\u2206n L(cid:96)(p, q). In\nthis case, the loss (cid:96) is proper if L(cid:96)(p) = (cid:104)p, (cid:96)(p)(cid:105) \u2264 (cid:104)p, (cid:96)(q)(cid:105) for all p (cid:54)= q in \u2206n (and strictly\nproper if the inequality is strict). For example, the log-loss (cid:96)log : \u2206n \u2192 [0, +\u221e]n is de\ufb01ned by\n(cid:96)log(p) = \u2212 log p, where the \u2018log\u2019 of a vector applies component-wise. One can easily check that\n(cid:96)log is strictly proper. We denote Llog its Bayes risk.\nThe above de\ufb01nition of the Bayes risk is restricted to losses de\ufb01ned on the simplex. For a general\nloss (cid:96) : A \u2192 [0, +\u221e]n, we use the following de\ufb01nition;\nDe\ufb01nition 2 (Bayes Risk). Let (cid:96) : A \u2192 [0, +\u221e]n be a loss such that dom (cid:96) (cid:54)= \u2205. The Bayes risk\nL(cid:96) : Rn \u2192 R \u222a {\u2212\u221e} is de\ufb01ned by\n\n\u2200u \u2208 Rn, L(cid:96)(u) := inf\nz\u2208S(cid:96)\n\n(cid:104)u, z(cid:105) .\n\n(3)\n\nThe support function of a set C \u2286 Rn is de\ufb01ned by \u03c3C(u) := supz\u2208C(cid:104)u, z(cid:105), u \u2208 Rn, and thus it is\neasy to see that one can express the Bayes risk as L(cid:96)(u) = \u2212\u03c3S(cid:96)(\u2212u). Our de\ufb01nition of the Bayes\nrisk is slightly different from previous ones ([3, 9, 1]) in two ways; 1) the Bayes risk is de\ufb01ned on all\nRn instead of [0, +\u221e[n; and 2) the in\ufb01mum is taken over the \ufb01nite part of the superprediction set\nS \u221e\n. The \ufb01rst point is a mere mathematical convenience and makes no practical difference since\nL(cid:96)(p) = \u2212\u221e for all p /\u2208 [0, +\u221e[n. For the second point, swapping S(cid:96) for S \u221e\nin (3) does not\nchange the value of L(cid:96) for mixable losses (see Appendix D). However, we chose to work with S(cid:96) \u2014\na subset of Rn \u2014 as it allows us to directly apply techniques from convex analysis.\nDe\ufb01nition 3 (Support Loss). We call a map (cid:96) : \u2206n \u2192 [0, +\u221e]n a support loss of (cid:96) if\n\n(cid:96)\n\n(cid:96)\n\n\u2200p \u2208 ri \u2206n, (cid:96)(p) \u2208 \u2202\u03c3S(cid:96)(\u2212p);\n\n\u2200p \u2208 rbd \u2206n,\u2203(pm) \u2282 ri \u2206n, pm\n\nm\u2192\u221e\u2192 p and (cid:96)(pm) m\u2192\u221e\u2192 (cid:96)(p) component-wise,\nwhere \u2202\u03c3S(cid:96) (see (2)) is the sub-differential of the support function \u2014 \u03c3S(cid:96) \u2014 of the set S(cid:96).\nTheorem 4. Any loss (cid:96) : A \u2192 [0, +\u221e]n such that dom (cid:96) (cid:54)= \u2205, has a proper support loss (cid:96) with the\nsame Bayes risk, L(cid:96), as (cid:96).\n\n3\n\n\fTheorem 4 shows that regardless of the action space on which the loss is de\ufb01ned, there always exists\na proper loss whose Bayes risk coincides with that of the original loss. This fact is useful in situations\nwhere the Bayes risk contains all the information one needs \u2014 such is the case for mixability. The\nnext Theorem shows a stronger relationship between a loss and its corresponding support loss.\nTheorem 5. Let (cid:96) : A \u2192 [0, +\u221e]n be a loss and (cid:96) be a proper support loss of (cid:96). If the Bayes risk L(cid:96)\nis differentiable on ]0, +\u221e[n, then (cid:96) is uniquely de\ufb01ned on ri \u2206n and\n\n\u2200p \u2208 dom (cid:96), \u2203a\u2217 \u2208 dom (cid:96),\n\u2200a \u2208 dom (cid:96), \u2203(pm) \u2282 ri \u2206n,\n\n(cid:96)(a\u2217) = (cid:96)(p),\n(cid:96)(pm) m\u2192\u221e\u2192 (cid:96)(a) component-wise.\n\nTheorem 5 shows that when the Bayes risk is differentiable (a necessary condition for mixability \u2014\nTheorem 12), the support loss is almost a reparametrization of the original loss, and in practice, it is\nenough to work with support losses instead. This will be crucial for characterizing \u03a6-mixability.\n\n3 Mixability in the Game of Prediction with Expert Advice\n\nM := M (at\n\n1:k, (xs, as\n\n1:k)1\u2264s 0 rounds, the cumulative loss of each expert \u03b8 [resp. learner] is given\nby Loss(cid:96)\nM)]. We say that M achieves a\nconstant regret if \u2203R > 0,\u2200T > 0,\u2200\u03b8 \u2208 [k], Loss(cid:96)\n\u03b8(T ) + R. In what follows, this\ngame setting will be referred to by Gn\n\n(cid:96) (A, k) and we only consider the case where k \u2265 2.\n\nM(T ) := (cid:80)T\n\n\u03b8(T ) := (cid:80)T\n\nt=1 (cid:96)xt(at\nM(T ) \u2264 Loss(cid:96)\n\n\u03b8) [resp. Loss(cid:96)\n\n1:k := [a\u00b7\n\nt=1 (cid:96)xt(at\n\n(cid:96)\n\n.\n\n\u2200a1:k \u2208 Ak,\u2203a\u2217 \u2208 A,\u2200x \u2208 [n],\n\n(cid:96)x(a\u2217) \u2264 \u2212\u03b7\u22121 log (cid:104)q, exp(\u2212\u03b7(cid:96)x(a1:k))(cid:105) ,\n\n(cid:96) (A, k)\n(cid:96) \u2192 A is a substitution function of the loss (cid:96) [10, 4];\n\n3.1 The Aggregating Algorithm and \u03b7-mixability\nDe\ufb01nition 6 (\u03b7-mixability). For \u03b7 > 0, a loss (cid:96) : A \u2192 [0, +\u221e]n is said to be \u03b7-mixable, if \u2200q \u2208 \u2206k,\n(4)\nwhere the exp applies component-wise. Letting H(cid:96) := {\u03b7 > 0 : (cid:96) is \u03b7-mixable}, we de\ufb01ne the\nmixability constant of (cid:96) by \u03b7(cid:96) := sup H(cid:96) if H(cid:96) (cid:54)= \u2205; and 0 otherwise. (cid:96) is said to be mixable if\n\u03b7(cid:96) > 0.\nIf a loss (cid:96) is \u03b7-mixable for \u03b7 > 0, the AA (Algorithm 1) achieves a constant regret in the Gn\ngame[10]. In Algorithm 1, the map S(cid:96) : S \u221e\nthat is, S(cid:96) satis\ufb01es the component-wise inequality (cid:96)(G(cid:96)(s)) \u2264 s, for all s \u2208 S \u221e\nIt was shown by Chernov et al. [1] that the \u03b7-mixability condition (4) is equivalent to the convexity\nof the \u03b7-exponentiated superprediction set of (cid:96) de\ufb01ned by exp(\u2212\u03b7S \u221e\n(cid:96) ) := {exp(\u2212\u03b7s) : s \u2208 S \u221e\n(cid:96) }.\nUsing this fact, van Erven et al. [9] showed that the mixability constant \u03b7(cid:96) of a strictly proper loss\n(cid:96) : \u2206n \u2192 [0, +\u221e[n, whose Bayes risk is twice continuously differentiable on ]0, +\u221e[n, is equal to\n(5)\nwhere H is the Hessian operator and \u02dcL\u00b7 := L\u00b7\u25e6(cid:113)n ((cid:113)n was de\ufb01ned in (1)). The next theorem extends\nthis result by showing that the mixability constant \u03b7(cid:96) of any loss (cid:96) is lower bounded by \u03b7(cid:96) in (5), as\nlong as (cid:96) satis\ufb01es Assumption 1 and its Bayes risk is twice differentiable.\nTheorem 7. Let \u03b7 > 0 and (cid:96) : A \u2192 [0, +\u221e]n be a loss. Suppose that dom (cid:96) = A and that L(cid:96) is\ntwice differentiable on ]0, +\u221e[n. If \u03b7(cid:96) > 0 then (cid:96) is \u03b7(cid:96)-mixable. In particular, \u03b7(cid:96) \u2265 \u03b7(cid:96).\nWe later show that, under the same conditions as Theorem 7, we actually have \u03b7(cid:96) = \u03b7(cid:96) (Theorem 16)\nwhich indicates that the Bayes risk contains all the information necessary to characterize mixability.\nRemark 8. In practice, the requirement \u2018dom (cid:96) = A\u2019 is not necessarily a strict restriction to \ufb01nite\nlosses; it is often the case that a loss \u00af(cid:96) : \u00afA \u2192 [0, +\u221e]n only takes in\ufb01nite values on the relative\nboundary of \u00afA (such is the case for the log-loss de\ufb01ned on the simplex), and thus the restriction\n(cid:96) := \u00af(cid:96)|A, where A = ri \u00afA, satis\ufb01es dom (cid:96) = A. It follows trivially from the de\ufb01nition of mixability\n(4) that if (cid:96) is \u03b7-mixable and \u00af(cid:96) is continuous with respect to the extended topology of [0, +\u221e]n \u2014 a\ncondition often satis\ufb01ed \u2014 then \u00af(cid:96) is also \u03b7-mixable.\n\n(\u03bbmax([H \u02dcLlog( \u02dcp)]\u22121H \u02dcL(cid:96)( \u02dcp)))\u22121,\n\n\u03b7(cid:96) := inf\n\n\u02dcp\u2208int \u02dc\u2206n\n\n4\n\n\f3.2 The Generalized Aggregating Algorithm and (\u03b7, \u03a6)\u2212mixability\nA function \u03a6 : Rk \u2192 R\u222a{+\u221e} is an entropy if it is convex, its epigraph epi \u03a6 := {(u, h) : \u03a6(u) \u2264\nh} is closed in Rk \u00d7 R, and \u2206k \u2286 dom \u03a6 := {u \u2208 Rk : \u03a6(u) < +\u221e}. For example, the Shannon\nentropy is de\ufb01ned by S(q) = +\u221e if q /\u2208 [0, +\u221e[k, and\n\n(cid:88)\n\n\u2200q \u2208 [0, +\u221e[k,\n\nS(q) =\n\nqi log qi,\n\ni\u2208[k] : qi(cid:54)=0\n\n(6)\n\n(7)\n\nThe divergence generated by an entropy \u03a6 is the map D\u03a6 : Rn \u00d7 dom \u03a6 \u2192 [0, +\u221e] de\ufb01ned by\n\n(cid:26) \u03a6(v) \u2212 \u03a6(u) \u2212 \u03a6(cid:48)(u; v \u2212 u),\n\nD\u03a6(v, u) :=\n\n+\u221e,\n\nif v \u2208 dom \u03a6;\notherwise.\n\nwhere \u03a6(cid:48)(u; v \u2212 u) := lim\u03bb\u21930[\u03a6(u + \u03bb(v \u2212 u)) \u2212 \u03a6(u)]/\u03bb (the limit exists since \u03a6 is convex [7]).\nDe\ufb01nition 9 (\u03a6-mixability). Let \u03a6 : Rk \u2192 R \u222a {+\u221e} be an entropy. A loss (cid:96) : A \u2192 [0, +\u221e]n is\n(\u03b7, \u03a6)-mixable for \u03b7 > 0 if \u2200q \u2208 \u2206k, \u2200a1:k \u2208 Ak, \u2203a\u2217 \u2208 A, such that\n\n\u2200x \u2208 [n], (cid:96)x(a\u2217) \u2264 Mix\u03b7\n\n\u03a6((cid:96)x(a1:k), q) := inf\n\u02c6q\u2208\u2206k\n\n(cid:104) \u02c6q, (cid:96)x(a1:k)(cid:105) + \u03b7\u22121D\u03a6( \u02c6q, q).\n\n(8)\n\n(cid:96) := {\u03b7 >\n(cid:96) , if\n\n(cid:96) := sup H\u03a6\n\n\u03a6. Letting H\u03a6\n\nWhen \u03b7 = 1, we simply say that (cid:96) is \u03a6-mixable and we denote Mix\u03a6 := Mix1\n0 : (cid:96) is (\u03b7, \u03a6)-mixable}, we de\ufb01ne the generalized mixability constant of ((cid:96), \u03a6) by \u03b7\u03a6\n(cid:96) (cid:54)= \u2205; and 0 otherwise.\nH\u03a6\nReid et al. [6] introduced the GAA (see Algorithm 2) which uses an entropy function \u03a6 : Rk \u2192\nR \u222a {+\u221e} and a substitution function S(cid:96) (see previous section) to specify the learner\u2019s merging\nstrategy M. It was shown that the GAA reduces to the AA when \u03a6 is the Shannon entropy S. It was\nalso shown that under some regularity conditions on \u03a6, the GAA achieves a constant regret in the\n(cid:96) (A, k) game for any \ufb01nite, (\u03b7, \u03a6)-mixable loss.\nGn\nOur de\ufb01nition of \u03a6-mixability differs slightly from that of Reid et al. [6] \u2014 we use directional\nderivatives to de\ufb01ne the divergence D\u03a6. This distinction makes it possible to extend the GAA to losses\nwhich can take in\ufb01nite values (such as the log-loss de\ufb01ned on the simplex). We show, in this case,\nthat a constant regret is still achievable under the (\u03b7, \u03a6)-mixability condition. Before presenting this\nresult, we de\ufb01ne the notion of \u2206-differentiability; for l \u2286 [k], let \u2206l := {q \u2208 \u2206k : q\u03b8 = 0,\u2200\u03b8 /\u2208 l}.\nWe say that an entropy \u03a6 is \u2206-differentiable if \u2200l \u2286 [k], \u2200u, u0 \u2208 ri \u2206l, the map z (cid:55)\u2192 \u03a6(cid:48)(u; z) is\nlinear on L0\nTheorem 10. Let \u03a6 : Rk \u2192 R \u222a {+\u221e} be a \u2206-differentiable entropy. Let (cid:96) : A \u2192 [0, +\u221e]n be a\nloss (not necessarily \ufb01nite) such that L(cid:96) is twice differentiable on ]0, +\u221e[n. If (cid:96) is (\u03b7, \u03a6)-mixable\nthen the GAA achieves a constant regret in the Gn\n\u03b8(T ) \u2264 R\u03a6\n\n(cid:96) (A, k) game; for any sequence (xt, at\n\nl := {\u03bb(v \u2212 u0) : (\u03bb, v) \u2208 R \u00d7 \u2206l}.\n\nD\u03a6(e\u03b8, q)/\u03b7\u03a6\n(cid:96) ,\n\n1:k)T\n\nt=1,\n\n(9)\n\nLoss(cid:96)\n\nGAA(T ) \u2212 min\n\u03b8\u2208[k]\n\nLoss(cid:96)\n\n(cid:96) := inf\nq\u2208\u2206k\n\nmax\n\u03b8\u2208[k]\n\nfor initial distribution over experts q0 = argminq\u2208\u2206k\nelement of Rk, and any substitution function S(cid:96).\n\nmax\u03b8\u2208[k] D\u03a6(e\u03b8, q), where e\u03b8 is the \u03b8th basis\n\nLooking at Algorithm 2, it is clear that the GAA is divided into two steps; 1) a substitution step which\nconsists of \ufb01nding a prediction a\u2217 \u2208 A satisfying the mixability condition (8) using a substitution\nfunction S(cid:96); and 2) an update step where a new distribution over experts is computed. Except for the\ncase of the AA with the log-loss (which reduces to Bayesian updating [12]), there is not a unique\nchoice of substitution function in general. An example of substitution function S(cid:96) is the inverse loss\n[13]. Kamalaruban et al. [4] discuss other alternatives depending on the curvature of the Bayes risk.\nAlthough the choice of S(cid:96) can affect the performance of the algorithm to some extent [4], the regret\nbound in (9) remains unchanged regardless of S(cid:96). On the other hand, the update step is well de\ufb01ned\nand corresponds to a mirror descent step [6] (we later use this fact to suggest a new algorithm).\n\n5\n\n\fAlgorithm 1: Aggregating Algorithm\ninput\n\n:q0 \u2208 \u2206k; \u03b7 > 0; A \u03b7-mixable loss\n(cid:96) : A \u2192 [0, +\u221e]n; A substitution\nfunction S(cid:96).\n\noutput :Learner\u2019s predictions (at\u2217)\nfor t = 1 to T do\n\n1:k \u2208 Ak;\n\n(cid:16)\u2212 1\n\u03b8)(cid:17)\n\u03b7 log(cid:80)\nObserve At = at\nat\u2217 \u2190 S(cid:96)\nObserve outcome xt \u2208 [n];\nexp(\u2212\u03b7(cid:96)xt(at\n\u03b8 \u2190 qt\u22121\n(cid:104)qt\u22121, exp(\u2212\u03b7(cid:96)xt(At))(cid:105) ,\u2200\u03b8 \u2208 [k];\nqt\n\n\u03b8\u2208[k] qt\u22121\n\ne\u2212\u03b7(cid:96)(at\n\n\u03b8))\n\n\u03b8\n\n\u03b8\n\n;\n\nend\n\nend\n\nAlgorithm 2: Generalized Aggregating Algorithm\ninput\n\n:q0 \u2208 \u2206k; A \u2206-differentiable entropy\n\u03a6 : Rk \u2192 R \u222a {+\u221e}; \u03b7 > 0; A\n(\u03b7, \u03a6)-mixable loss (cid:96) : A \u2192 [0, +\u221e]n; A\nsubstitution function S(cid:96).\noutput :Learner\u2019s predictions (at\u2217)\nfor t = 1 to T do\n\n(cid:16)(cid:2)Mix\u03b7\n\n1:k \u2208 Ak;\n\n\u03a6((cid:96)x(At), qt\u22121)(cid:3)T\n\nObserve At = at\nat\u2217 \u2190 S(cid:96)\nObserve outcome xt \u2208 [n];\nqt \u2190 argmin\n\u00b5\u2208\u2206k\n\n(cid:104)\u00b5, (cid:96)xt(At)(cid:105) + 1\n\n1\u2264x\u2264n\n\n\u03b7 D\u03a6(\u00b5, qt\u22121);\n\n(cid:17)\n\n;\n\nWe conclude this subsection with two new and important results which will lead to a characterization\nof \u03a6-mixability. The \ufb01rst result shows that (\u03b7, S)-mixability is equivalent to \u03b7-mixability, and the\nsecond rules out losses and entropies for which \u03a6-mixability is not possible.\nTheorem 11. Let \u03b7 > 0. A loss (cid:96) : A \u2192 [0, +\u221e]n is \u03b7-mixable if and only if (cid:96) is (\u03b7, S)-mixable.\nProposition 12. Let \u03a6 : Rk \u2192 R \u222a {+\u221e} be an entropy and (cid:96) : A \u2192 [0, +\u221e]n. If (cid:96) is \u03a6-mixable,\nthen the Bayes risk satis\ufb01es L(cid:96) \u2208 C 1(]0, +\u221e[n). If, additionally, L(cid:96) is twice differentiable on\n]0, +\u221e[n, then \u03a6 must be strictly convex on \u2206k.\nIt should be noted that since the Bayes risk of a loss (cid:96) must be differentiable for it to be \u03a6-mixable\nfor some entropy \u03a6, Theorem 5 says that we can essentially work with a proper support loss (cid:96) of (cid:96).\nThis will be crucial in the proof of the suf\ufb01cient condition of \u03a6-mixability (Theorem 14).\n\n3.3 A Characterization of \u03a6-Mixability\nIn this subsection, we \ufb01rst show that given an entropy \u03a6 : Rk \u2192 R \u222a {+\u221e} and a loss (cid:96) : A \u2192\n[0, +\u221e]n satisfying certain regularity conditions, (cid:96) is \u03a6-mixable if and only if\n\n\u03b7(cid:96)\u03a6 \u2212 S is convex on \u2206k.\n\n(10)\nTheorem 13. Let \u03b7 > 0, (cid:96) : A \u2192 [0, +\u221e]n a \u03b7-mixable loss, and \u03a6 : Rk \u2192 R \u222a {+\u221e} an entropy.\nIf \u03b7\u03a6 \u2212 S is convex on \u2206k, then (cid:96) is \u03a6-mixable.\nThe converse of Theorem 13 also holds under additional smoothness conditions on \u03a6 and (cid:96);\nTheorem 14. Let (cid:96) : A \u2192 [0, +\u221e]n be a loss such that L(cid:96) is twice differentiable on ]0, +\u221e[n, and\n\u03a6 : Rk \u2192 R \u222a {+\u221e} an entropy such that \u02dc\u03a6 := \u03a6 \u25e6 (cid:113)k is twice differentiable on int \u02dc\u2206k. Then (cid:96) is\n\u03a6-mixable only if \u03b7(cid:96)\u03a6 \u2212 S is convex on \u2206k.\nAs consequence of Theorem 14, if a loss (cid:96) is not classically mixable, i.e. \u03b7(cid:96) = 0, it cannot be\n\u2217\n\u03a6-mixable for any entropy \u03a6. This is because \u03b7(cid:96)\u03a6 \u2212 S\n= \u03b7(cid:96)\u03a6 \u2212 S = \u2212 S is not convex (where\nequality \u2018*\u2019 is due to Theorem 7).\nWe need one more result before arriving at (10); Recall that the mixability constant \u03b7(cid:96) is de\ufb01ned as\nthe supremum of the set H(cid:96) := {\u03b7 \u2265 0 : (cid:96) is \u03b7-mixable}. The next lemma essentially gives a suf\ufb01cient\ncondition for this supremum to be attained when H(cid:96) is non-empty \u2014 in this case, (cid:96) is \u03b7(cid:96)-mixable.\nLemma 15. Let (cid:96) : A \u2192 [0, +\u221e]n be a loss. If dom (cid:96) = A, then either H(cid:96) = \u2205 or \u03b7(cid:96) \u2208 H(cid:96).\nTheorem 16. Let (cid:96) and \u03a6 be as in Theorem 14 with dom (cid:96) = A. Then \u03b7(cid:96) = \u03b7(cid:96). Furthermore, (cid:96) is\n\u03a6-mixable if and only if \u03b7(cid:96)\u03a6 \u2212 S is convex on \u2206k.\n\nProof. Suppose now that (cid:96) is mixable. By Lemma 15, it follows that (cid:96) is \u03b7(cid:96)-mixable, and from\n(cid:96) S in Theorem 14 implies that (\u03b7(cid:96)/\u03b7(cid:96) \u2212 1) S\nTheorem 11, (cid:96) is (\u03b7\u22121\nis convex on ri \u2206k. Thus, \u03b7(cid:96) \u2264 \u03b7(cid:96), and since from Theorem 7 \u03b7(cid:96) \u2264 \u03b7(cid:96), we conclude that \u03b7(cid:96) = \u03b7(cid:96).\n\n(cid:96) S)-mixable. Substituting \u03a6 for \u03b7\u22121\n\n6\n\n\fFrom Theorem 14, if (cid:96) is \u03a6-mixable then \u03b7(cid:96)\u03a6 \u2212 S is convex on \u2206k. Now suppose that \u03b7(cid:96)\u03a6 \u2212 S is\nconvex on \u2206k. This implies that \u03b7(cid:96) > 0, and thus from Theorem 7, (cid:96) is \u03b7(cid:96)-mixable. Now since (cid:96) is\n\u03b7(cid:96)-mixable and \u03b7(cid:96)\u03a6 \u2212 S is convex on \u2206k, Theorem 13 implies that (cid:96) is \u03a6-mixable.\nNote that the condition \u2018dom (cid:96) = A\u2019 is in practice not a restriction to \ufb01nite losses \u2014 see Remark\n8. Theorem 16 implies that under the regularity conditions of Theorem 14, the Bayes risk L(cid:96) [resp.\n(L(cid:96), \u03a6)] contains all necessary information to characterize classical [resp. generalized] mixability.\nCorollary 17 (The Generalized Mixability Constant). Let (cid:96) and \u03a6 be as in Theorem 16. Then the\ngeneralized mixability constant (see De\ufb01nition 9) is given by\n\n\u03b7\u03a6\n(cid:96) = \u03b7(cid:96)\n\ninf\n\n\u02dcq\u2208int \u02dc\u2206k\n\n\u03bbmin(H \u02dc\u03a6( \u02dcq)(H\u02dcS( \u02dcq))\u22121),\n\n(11)\n\nwhere \u02dc\u03a6 := \u03a6 \u25e6 (cid:113)k, \u02dcS = S\u25e6(cid:113)k, and (cid:113)k is de\ufb01ned in (1).\nObserve that when \u03a6 = S, (11) reduces to \u03b7S\n\n(cid:96) = \u03b7(cid:96) as expected from Theorem 11 and Theorem 16.\n\n3.4 The (In)dependence Between (cid:96) and \u03a6 and the Fundamental Nature of S\n\nSo far, we showed that the \u03a6-mixability of losses satisfying Assumption 1 is characterized by the\nconvexity of \u03b7\u03a6\u2212S, where \u03b7 \u2208]0, \u03b7(cid:96)] (see Theorems 13 and 14). As a result, and contrary to what was\nconjectured previously [6], the generalized mixability condition does not induce a correspondence\nbetween losses and entropies; for a given loss (cid:96), there is no particular entropy \u03a6(cid:96) \u2014 speci\ufb01c to the\nchoice of (cid:96) \u2014 which minimizes the regret of the GAA. Rather, the Shannon entropy S minimizes the\nregret regardless of the choice of (cid:96) (see Theorem 18 below). This re\ufb02ects one fundamental aspect of\nthe Shannon entropy.\nNevertheless, given a loss (cid:96) and entropy \u03a6, the curvature of the loss surface S(cid:96) determines the\n(cid:96) of the GAA; the curvature of S(cid:96) is linked to \u03b7(cid:96) through the Hessian of\nmaximum \u2018learning rate\u2019 \u03b7\u03a6\nthe Bayes risk (see Theorem 30 in Appendix H.2), which is in turn linked to \u03b7\u03a6\n(cid:96) in (11) to explicitly compare the regret bounds R\u03a6\nGiven a loss (cid:96), we now use the expression of \u03b7\u03a6\n(cid:96) achieved with the GAA (see (9)) using entropy \u03a6 and the Shannon entropy S, respectively.\nand RS\nTheorem 18. Let S, \u03a6 : Rk \u2192 R \u222a {+\u221e}, where S is the Shannon entropy and \u03a6 is an entropy\nsuch that \u02dc\u03a6 := \u03a6 \u25e6 (cid:113)k is twice differentiable on int \u02dc\u2206k. A loss (cid:96) : A \u2192 [0, +\u221e[n with L(cid:96) twice\ndifferentiable on ]0, +\u221e[n, is \u03a6-mixable only if RS\n\n(cid:96) through (11).\n\n(cid:96)\n\n(cid:96) \u2264 R\u03a6\n(cid:96) .\n\nTheorem 18 is consistent with Vovk\u2019s result [10, \u00a75] which essentially states that the regret bound\n(cid:96) = \u03b7\u22121\nRS\n\nlog k is in general tight for \u03b7-mixable losses.\n\n(cid:96)\n\n4 Adaptive Generalized Aggregating Algorithm\n\nIn this section, we take advantage of the similarity between the GAA\u2019s update step and the mirror\ndescent algorithm (see Appendix E) to devise a modi\ufb01cation to the GAA leading to improved\nregret bounds in certain cases. The GAA can be modi\ufb01ed in (at least) two immediate ways; 1)\nchanging the learning rate at each time step to speed-up convergence; and 2) changing the entropy,\ni.e. the regularizer \u03a6, at each time step \u2014 similar to the adaptive mirror descent algorithm [8, 5].\nIn the former case, one can use Corollary 17 to calculate the maximum \u2018learning rate\u2019 under the\n\u03a6-mixability constraint. Here, we focus on the second method; changing the entropy at each round.\nAlgorithm 3 displays the modi\ufb01ed GAA \u2014 which we call the Adaptive Generalized Aggregating\n(cid:104)q, z(cid:105) \u2212 \u03a6(q) is\nAlgorithm (AGAA) \u2014 in its most general form. In Algorithm 3, \u03a6(cid:63)(z) := supq\u2208\u2206k\nthe entropic dual of \u03a6.\nGiven a (\u03b7, \u03a6)-mixable loss (cid:96), we verify that Algorithm 3 is well de\ufb01ned; for simplicity, assume that\ndom (cid:96) = A and L(cid:96) is twice differentiable on ]0, +\u221e[n. From the de\ufb01nition of an entropy, |\u03a6| < +\u221e\nt is de\ufb01ned and \ufb01nite on all Rk (in particular at \u03b8t). On the\non \u2206k, and thus the entropic dual \u03a6(cid:63)\nother hand, from Proposition 12, \u03a6 is strictly convex on \u2206k which implies that \u03a6(cid:63) (and thus \u03a6(cid:63)\nt ) is\ndifferentiable on Rk (see e.g. [2, Thm. E.4.1.1]). It remains to check that (cid:96) is (\u03b7, \u03a6t)-mixable. Since\nfor \u03b7 > 0, (\u03b7, \u03a6t)-mixability is equivalent to ( 1\n\u03b7 \u03a6t)-mixability (by de\ufb01nition), Theorem 16 implies\n\n7\n\n\fAlgorithm 3: Adaptive Generalized Aggregating Algorithm (AGAA)\ninput\n\n:\u03b81 = 0 \u2208 Rk; A \u2206-differentiable entropy \u03a6 : Rk \u2192 R \u222a {+\u221e}; \u03b7 > 0; A (\u03b7, \u03a6)-mixable\nloss (cid:96) : A \u2192 [0, +\u221e[n; A substitution function S(cid:96); A protocol of choosing \u03b2t at round t.\n\noutput :Learner\u2019s predictions (at\u2217)\nfor t = 1 to T do\n\n(cid:16)(cid:2)Mix\u03b7\n\nLet \u03a6t(w) := \u03a6(w) \u2212 (cid:104)w, \u03b2t \u2212 \u03b8t(cid:105);\nObserve At := at\nat\u2217 \u2190 S(cid:96)\nObserve xt \u2208 [n] and pick some vt \u2208 Rk;\n\u03b8t+1 \u2190 \u03b8t \u2212 \u03b7(cid:96)xt(At);\n\n1:k \u2208 Ak ;\n((cid:96)x(At),\u2207\u03a6(cid:63)\n\nt (\u03b8t))(cid:3)T\n\n\u03a6t\n\n1\u2264x\u2264n\n\nend\n\n(cid:17)\n\n;\n\n// New entropy\n// Experts\u2019 predictions\n// Learner\u2019s prediction\n\n\u2212\u03b7(cid:80)t\u22121\n\nthat (cid:96) is (\u03b7, \u03a6t)-mixable if and only if \u03b7(cid:96)\u03b7\u22121\u03a6t \u2212 S is convex on \u2206k. This is in fact the case since\n\u03a6t is an af\ufb01ne transformation of \u03a6, and we have assumed that (cid:96) is (\u03b7, \u03a6)-mixable.\nIn what follows, we focus on a particular instantiation of Algorithm 3 where we choose \u03b2t :=\ns=1((cid:96)xs (As) + vs), for some (arbitrary for now) (vs) \u2282 Rk. The (vt) vectors act as correction\nterms in the update step of the AGAA. Using standard duality properties (see Appendix A), it is easy\nto show that the AGAA reduces to the GAA except for the update step where the new distribution\nover experts at round t \u2208 [T ] is now given by\n\nqt = \u2207\u03a6(cid:63)(\u2207\u03a6(qt\u22121) \u2212 \u03b7(cid:96)xt(At) \u2212 \u03b7vt).\n\nbe a loss such that L(cid:96) is twice differentiable on ]0, +\u221e[n. Let \u03b2t = \u2212\u03b7(cid:80)t\u22121\n\nTheorem 19. Let \u03a6 : Rk \u2192 R \u222a {+\u221e} be a \u2206-differentiable entropy. Let (cid:96) : A \u2192 [0, +\u221e]n\ns=1((cid:96)xs (As) + vs),\n1:k \u2208 Ak. If (cid:96) is (\u03b7, \u03a6)-mixable then for initial distribution q0 =\nwhere vs \u2208 Rk and As := as\nt=1, the AGAA achieves the regret\nargminq\u2208\u2206k\n\nmax\u03b8\u2208[k] D\u03a6(e\u03b8, q) and any sequence (xt, at\n\n\u2200\u03b8 \u2208 [k], Loss(cid:96)\n\nAGAA(T ) \u2212 Loss(cid:96)\n\n(cid:96) + \u2206R\u03b8(T ),\n\n(12)\n\n1:k)T\n\u03b8(T ) \u2264 R\u03a6\n\nwhere \u2206R\u03b8(T ) :=(cid:80)T\u22121\n\nt=1 (vt\n\n\u03b8 \u2212 (cid:104)vt, qt(cid:105)).\n\nt\n\n(cid:80)t\n\nTheorem 19 implies that if the sequence (vt) is chosen such that \u2206R\u03b8\u2217 (T ) is negative for the best\nexpert \u03b8\u2217 (in hindsight), then the regret bound \u2018R\u03a6\n(cid:96) + \u2206R\u03b8\u2217 (T )\u2019 of the AGAA is lower than that of\nthe GAA (see (9)), and ultimately that of the AA (when \u03a6 = S). Unfortunately, due to Vovk\u2019s result\n[10, \u00a75] there is no \u201cuniversal\u201d choice of (vt) which guarantees that \u2206R\u03b8\u2217 (T ) is always negative.\nHowever, there are cases where this term is expected to be negative.\nConsider a dataset where it is typical for the best experts (i.e., the \u03b8\u2217\u2019s) to perform poorly at some\npoint during the game, as measured by their average loss, for example. Under such an assumption,\nchoosing the correction vectors vt to be negatively proportional to the average losses of experts, i.e.\nvt := \u2212 \u03b1\ns=1 (cid:96)xs (As) (for small enough \u03b1 > 0), would be consistent with the idea of making\n\u2206R\u03b8\u2217 (T ) negative. To see this, suppose expert \u03b8\u2217 is performing poorly during the game (say at\nt < T ), as measured by its instantaneous and average loss. At that point the distribution qt would put\nmore weight on experts performing better than \u03b8\u2217, i.e. having a lower average loss. And since vt\n\u03b8 is\n\u03b8\u2217 \u2212 (cid:104)vt, qt(cid:105) would be negative\nnegatively proportional to the average loss of expert \u03b8, the quantity vt\n\u2014 consistent with making \u2206R\u03b8\u2217 (T ) < 0. On the other hand, if expert \u03b8\u2217 performs well during the\n\u03b8\u2217 \u2212(cid:104)vt, qt(cid:105) (cid:39) 0, since qt would put comparable weights between\ngame (say close to the best) then vt\n\u03b8\u2217 and other experts (if any) with similar performance.\nExample 1. (A Negative Regret). One can construct an example that illustrates the idea above. Con-\n(\u22062, 2); a probability game with 2 experts {\u03b81, \u03b82}, 2 outcomes {0, 1},\nsider the Brier game G2\nand where the loss (cid:96)Brier is the Brier loss [11] (which is 1-mixable). Assume that; expert \u03b81 consis-\ntently predicts Pr(x = 0) = 1/2; expert \u03b82 predicts Pr(x = 0) = 1/4 during the \ufb01rst 50 rounds, then\nswitches to predicting Pr(x = 0) = 3/4 thereafter; the outcome is always x = 0. A straightforward\nsimulation using the AGAA with the Shannon entropy, Vovk\u2019s substitution function for the Brier loss\n+ \u2206R\u03b8\u2217 (T ) (cid:39) \u22125,\n[11], \u03b2t as in Theorem 19 with vt := \u2212 1\n\ns=1 (cid:96)Brier(xs, As), yields R\u03a6\n\n(cid:80)t\n\n(cid:96)Brier\n\n(cid:96)Brier\n\n8t\n\n8\n\n\f\u2200T \u2265 150, where in this case \u03b8\u2217 = \u03b82 is the best expert for T \u2265 150. The learner then does better than\nthe best expert. If we use the AA instead, the learner does worse than \u03b82 by (cid:39) RS\n\n= log 2.\n\n(cid:96)Brier\n\nIn real data, the situation described above \u2014 where the best expert does not necessarily perform\noptimally during the game \u2014 is typical, especially when the number of rounds T is large. We have\ntested the aggregating algorithms on real data as studied by Vovk [11]. We compared the performance\nof the AA with the AGAA, and found that the AGAA outperforms the AA, and in fact achieved a\nnegative regret on two data sets. Details of the experiments are in Appendix J.\nAs pointed out earlier, there are situations where \u2206R\u03b8\u2217 (T ) \u2265 0 even for the choice of (vt) in\nExample 1, and this could potentially lead to a large positive regret for the AGAA. There is an easy\nway to remove this risk at a small price; the outputs of the AGAA and the AA can themselves be\nconsidered as expert predictions. These predictions can in turn be passed to a new instance of the\nAA to yield a meta prediction. The resulting worst case regret is guaranteed not to exceed that of the\noriginal AA instance by more than \u03b7\u22121 log 2 for an \u03b7-mixable loss. We test this idea in Appendix J.\n\n5 Discussion and Future Work\n\nIn this work, we derived a characterization of \u03a6-mixability, which enables a better understanding of\nwhen a constant regret is achievable in the game of prediction with expert advice. Then, borrowing\ntechniques from mirror descent, we proposed a new \u201cadaptive\u201d version of the generalized aggregating\nalgorithm. We derived a regret bound for a speci\ufb01c instantiation of this algorithm and discussed\ncertain situations where the algorithm is expected to perform well. We empirically demonstrated the\nperformance of this algorithm on football game predictions (see Appendix J).\nVovk [10, \u00a75] essentially showed that given an \u03b7-mixable loss there is no algorithm that can achieve\na lower regret bound than \u03b7\u22121 log k on all sequences of outcomes. There is no contradiction in trying\nto design algorithms which perform well in expectation (maybe better than the AA) on \u201ctypical\u201d data\nwhile keeping the worst case regret close to \u03b7\u22121 log k. This was the motivation behind the AGAA.\nIn future work, we will explore other choices for the correction vector vt with the goal of lowering\nthe (expected) bound in (12). In the present work, we did not study the possibility of varying the\nlearning rate \u03b7. One might obtain better regret bounds using an adaptive learning rate as is the case\nwith the mirror descent algorithm. Our Corollary 17 is useful in that it gives an upper bound on the\nmaximal learning rate under the \u03a6-mixability constraint. Finally, although our Theorem 18 states that\nworst-case regret of the GAA is minimized when using the Shannon entropy, it would be interesting\nto study the dynamics of the AGAA with other entropies.\n\nTable 1: A short list of the main symbols used in the paper\n\nSymbol Description\n(cid:96)\nS(cid:96)\n(cid:96)\nL(cid:96)\n\u02dcL(cid:96)\nS\n\u03b7(cid:96)\n\u03b7(cid:96)\n\u03b7\u03a6\n(cid:96)\nS(cid:96)\nR\u03a6\n(cid:96)\n\nA loss function de\ufb01ned on a set A and taking values in [0, +\u221e]n (see Sec. 2)\nThe \ufb01nite part of the superprediction set of a loss (cid:96) (see Sec. 2)\nThe support loss of a loss (cid:96) (see Def. 3)\nThe Bayes risk corresponding to a loss (cid:96) (see De\ufb01nition 2)\nThe composition of the Bayes risk with an af\ufb01ne function; \u02dcL(cid:96) := L(cid:96) \u25e6 (cid:113)n (see (1))\nThe Shannon Entropy (see (6))\nThe mixability constant of (cid:96) (see Def. 6) ; essentially the largest \u03b7 s.t. (cid:96) is \u03b7-mixable.\nEssentially the largest \u03b7 such that \u03b7L(cid:96) \u2212 Llog is convex (see (5) and [9])\nThe generalized mixability constant (see Def. 9); the largest \u03b7 s.t. (cid:96) is (\u03b7, \u03a6)-mixable.\nA substitution function of a loss (cid:96) (see Sec. 3.1)\nThe regret achieved by the GAA using entropy \u03a6 (see (9) and Algorithm 2)\n\nAcknowledgments\n\nThis work was supported by the Australian Research Council and DATA61.\n\n9\n\n\fReferences\n[1] Alexey Chernov, Yuri Kalnishkan, Fedor Zhdanov, and Vladimir Vovk. Supermartingales in\n\nprediction with expert advice. Theoretical Computer Science, 411(29-30):2647\u20132669, 2010.\n\n[2] J-B. Hiriart-Urruty and C. Lemar\u00e9chal. Fundamentals of convex analysis., 2001.\n\n[3] Yuri Kalnishkan, Volodya Vovk, and Michael V. Vyugin. Loss functions, complexities, and the\n\nlegendre transformation. Theoretical Computer Science, 313(2):195\u2013207, 2004.\n\n[4] Parameswaran Kamalaruban, Robert Williamson, and Xinhua Zhang. Exp-concavity of proper\n\ncomposite losses. In Conference on Learning Theory, pages 1035\u20131065, 2015.\n\n[5] Francesco Orabona, Koby Crammer, and Nicol\u00f2 Cesa-Bianchi. A generalized online mirror\ndescent with applications to classi\ufb01cation and regression. Machine Learning, 99(3):411\u2013435,\n2015.\n\n[6] Mark D. Reid, Rafael M. Frongillo, Robert C. Williamson, and Nishant Mehta. Generalized\n\nmixability via entropic duality. In Conference on Learning Theory, pages 1501\u20131522, 2015.\n\n[7] R. Tyrrell Rockafellar. Convex analysis. Princeton University Press, Princeton, NJ, 1997.\n\n[8] Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated\ngradient algorithm. In International Conference on Machine Learning, pages 1593\u20131601, 2014.\n\n[9] Tim van Erven, Mark D. Reid, and Robert C. Williamson. Mixability is Bayes risk curvature\n\nrelative to log loss. Journal of Machine Learning Research, 13:1639\u20131663, 2012.\n\n[10] Vladimir Vovk. A game of prediction with expert advice. Journal of Computer and System\n\nSciences, 56(2):153\u2013173, 1998.\n\n[11] Vladimir Vovk and Fedor Zhdanov. Prediction with expert advice for the brier game. Journal of\n\nMachine Learning Research, 10(Nov):2445\u20132471, 2009.\n\n[12] Volodya Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213\u2013248,\n\n2001.\n\n[13] Robert C. Williamson. The geometry of losses. In Conference on Learning Theory, pages\n\n1078\u20131108, 2014.\n\n[14] Robert C. Williamson, Elodie Vernet, and Mark D. Reid. Composite multiclass losses. Journal\n\nof Machine Learning Research, 17:223:1\u2013223:52, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3695, "authors": [{"given_name": "Zakaria", "family_name": "Mhammedi", "institution": "The Australian National University"}, {"given_name": "Robert", "family_name": "Williamson", "institution": "Australian National University & Data61"}]}