{"title": "Coin Betting and Parameter-Free Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 585, "abstract": "In the recent years, a number of parameter-free algorithms have been developed for online linear optimization over Hilbert spaces and for learning with expert advice.  These algorithms achieve optimal regret bounds that depend on the unknown competitors, without having to tune the learning rates with oracle choices.  We present a new intuitive framework to design parameter-free algorithms for both online linear optimization over Hilbert spaces and for learning with expert advice, based on reductions to betting on outcomes of adversarial coins. We instantiate it using a betting algorithm based on the Krichevsky-Trofimov estimator.  The resulting algorithms are simple, with no parameters to be tuned, and they improve or match previous results in terms of regret guarantee and per-round complexity.", "full_text": "Coin Betting and Parameter-Free Online Learning\n\nFrancesco Orabona\n\nStony Brook University, Stony Brook, NY\n\nD\u00b4avid P\u00b4al\n\nYahoo Research, New York, NY\n\nfrancesco@orabona.com\n\ndpal@yahoo-inc.com\n\nAbstract\n\nIn the recent years, a number of parameter-free algorithms have been developed\nfor online linear optimization over Hilbert spaces and for learning with expert ad-\nvice. These algorithms achieve optimal regret bounds that depend on the unknown\ncompetitors, without having to tune the learning rates with oracle choices.\nWe present a new intuitive framework to design parameter-free algorithms for both\nonline linear optimization over Hilbert spaces and for learning with expert advice,\nbased on reductions to betting on outcomes of adversarial coins. We instantiate\nit using a betting algorithm based on the Krichevsky-Tro\ufb01mov estimator. The\nresulting algorithms are simple, with no parameters to be tuned, and they improve\nor match previous results in terms of regret guarantee and per-round complexity.\n\nIntroduction\n\n1\nWe consider the Online Linear Optimization (OLO) [4, 25] setting. In each round t, an algorithm\nchooses a point wt from a convex decision set K and then receives a reward vector gt. The algo-\nrithm\u2019s goal is to keep its regret small, de\ufb01ned as the difference between its cumulative reward and\nthe cumulative reward of a \ufb01xed strategy u \u2208 K, that is\n\nT(cid:88)\n\n(cid:104)gt, u(cid:105) \u2212 T(cid:88)\n\n(cid:104)gt, wt(cid:105) .\n\nRegretT (u) =\n\nt=1\n\nt=1\n\nWe focus on two particular decision sets, the N-dimensional probability simplex \u2206N = {x \u2208\nRN : x \u2265 0,(cid:107)x(cid:107)1 = 1} and a Hilbert space H. OLO over \u2206N is referred to as the problem of\nLearning with Expert Advice (LEA). We assume bounds on the norms of the reward vectors: For\nOLO over H, we assume that (cid:107)gt(cid:107) \u2264 1, and for LEA we assume that gt \u2208 [0, 1]N .\nOLO is a basic building block of many machine learning problems. For example, Online Convex\nOptimization (OCO), the problem analogous to OLO where (cid:104)gt, u(cid:105) is generalized to an arbitrary\nconvex function (cid:96)t(u), is solved through a reduction to OLO [25]. LEA [17, 27, 5] provides a\nway of combining classi\ufb01ers and it is at the heart of boosting [12]. Batch and stochastic convex\noptimization can also be solved through a reduction to OLO [25].\nTo achieve optimal regret, most of the existing online algorithms require the user to set the learning\nrate (step size) \u03b7 to an unknown/oracle value. For example, to obtain the optimal bound for Online\nGradient Descent (OGD), the learning rate has to be set with the knowledge of the norm of the\ncompetitor u, (cid:107)u(cid:107); second entry in Table 1. Likewise, the optimal learning rate for Hedge depends on\nthe KL divergence between the prior weighting \u03c0 and the unknown competitor u, D (u(cid:107)\u03c0); seventh\nentry in Table 1. Recently, new parameter-free algorithms have been proposed, both for LEA [6, 8,\n18, 19, 15, 11] and for OLO/OCO over Hilbert spaces [26, 23, 21, 22, 24]. These algorithms adapt\nto the number of experts and to the norm of the optimal predictor, respectively, without the need\nto tune parameters. However, their design and underlying intuition is still a challenge. Foster et al.\n[11] proposed a uni\ufb01ed framework, but it is not constructive. Furthermore, all existing algorithms for\nLEA either have sub-optimal regret bound (e.g. extra O(log log T ) factor) or sub-optimal running\ntime (e.g. requiring solving a numerical problem in every round, or with extra factors); see Table 1.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAlgorithm\n\nT\n\n[25]\n[25]\n\nOGD, \u03b7 = 1\u221a\nOGD, \u03b7 = U\u221a\n[23]\n[22, 24]\nThis paper, Sec. 7.1\n\nT\n\n(cid:113) ln N\n\nT\n\nHedge, \u03b7 =\nHedge, \u03b7 = U\u221a\n[6]\n[8]\n[8, 19, 15]2\n[11]\nThis paper, Sec. 7.2\n\nT , \u03c0i = 1\n[12]\n\nN [12]\n\nU\nO((cid:107)u(cid:107) ln(1 + (cid:107)u(cid:107) T )\n\nWorst-case regret guarantee\n\u221a\nT ), \u2200u \u2208 H\nO((1 + (cid:107)u(cid:107)2)\n\u221a\nT for any u \u2208 H s.t. (cid:107)u(cid:107) \u2264 U\n\u221a\nT ), \u2200u \u2208 H\n\nO((cid:107)u(cid:107)(cid:112)T ln(1 + (cid:107)u(cid:107) T )), \u2200u \u2208 H\nO((cid:107)u(cid:107)(cid:112)T ln(1 + (cid:107)u(cid:107) T )), \u2200u \u2208 H\nT ) for any u \u2208 \u2206N s.t.(cid:112)D (u(cid:107)\u03c0) \u2264 U\nO((cid:112)T (1 + D (u(cid:107)\u03c0)) + ln2 N ), \u2200u \u2208 \u2206N\nO((cid:112)T (1 + D (u(cid:107)\u03c0))), \u2200u \u2208 \u2206N\nO((cid:112)T (ln ln T + D (u(cid:107)\u03c0))), \u2200u \u2208 \u2206N\nO((cid:112)T (1 + D (u(cid:107)\u03c0))), \u2200u \u2208 \u2206N\nO((cid:112)T (1 + D (u(cid:107)\u03c0))), \u2200u \u2208 \u2206N\n\nT ln N ), \u2200u \u2208 \u2206N\n\nO(\n\n\u221a\n\n\u221a\n\nO(U\n\nPer-round time\n\ncomplexity\n\nAdaptive Uni\ufb01ed\nanalysis\n\nO(1)\nO(1)\nO(1)\nO(1)\nO(1)\nO(N )\nO(N )\nO(N K)1\nO(N K)1\nO(N )\nO(N )\n\nO(N ln maxu\u2208\u2206N D (u(cid:107)\u03c0))3\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\nTable 1: Algorithms for OLO over Hilbert space and LEA.\n\nContributions. We show that a more fundamental notion subsumes both OLO and LEA parameter-\nfree algorithms. We prove that the ability to maximize the wealth in bets on the outcomes of coin\n\ufb02ips implies OLO and LEA parameter-free algorithms. We develop a novel potential-based frame-\nwork for betting algorithms. It gives intuition to previous constructions and, instantiated with the\nKrichevsky-Tro\ufb01mov estimator, provides new and elegant algorithms for OLO and LEA. The new\nalgorithms also have optimal worst-case guarantees on regret and time complexity; see Table 1.\n\n2 Preliminaries\n\ncrete distributions p and q is D (p(cid:107)q) =(cid:80)\n\nWe begin by providing some de\ufb01nitions. The Kullback-Leibler (KL) divergence between two dis-\ni pi ln (pi/qi). If p, q are real numbers in [0, 1], we denote\nby D (p(cid:107)q) = p ln (p/q)+(1\u2212p) ln ((1 \u2212 p)/(1 \u2212 q)) the KL divergence between two Bernoulli dis-\ntributions with parameters p and q. We denote by H a Hilbert space, by (cid:104)\u00b7,\u00b7(cid:105) its inner product, and by\n(cid:107)\u00b7(cid:107) the induced norm. We denote by (cid:107)\u00b7(cid:107)1 the 1-norm in RN . A function F : I \u2192 R+ is called loga-\nrithmically convex iff f (x) = ln(F (x)) is convex. Let f : V \u2192 R \u222a {\u00b1\u221e}, the Fenchel conjugate\nof f is f\u2217 : V \u2217 \u2192 R\u222a{\u00b1\u221e} de\ufb01ned on the dual vector space V \u2217 by f\u2217(\u03b8) = supx\u2208V (cid:104)\u03b8, x(cid:105)\u2212f (x).\nA function f : V \u2192 R \u222a {+\u221e} is said to be proper if there exists x \u2208 V such that f (x) is \ufb01nite. If\nf is a proper lower semi-continuous convex function then f\u2217 is also proper lower semi-continuous\nconvex and f\u2217\u2217 = f.\nCoin Betting. We consider a gambler making repeated bets on the outcomes of adversarial coin\n\ufb02ips. The gambler starts with an initial endowment \u0001 > 0. In each round t, he bets on the outcome\nof a coin \ufb02ip gt \u2208 {\u22121, 1}, where +1 denotes heads and \u22121 denotes tails. We do not make any\nassumption on how gt is generated, that is, it can be chosen by an adversary.\nThe gambler can bet any amount on either heads or tails. However, he is not allowed to borrow any\nadditional money. If he loses, he loses the betted amount; if he wins, he gets the betted amount back\nand, in addition to that, he gets the same amount as a reward. We encode the gambler\u2019s bet in round t\nby a single number wt. The sign of wt encodes whether he is betting on heads or tails. The absolute\nvalue encodes the betted amount. We de\ufb01ne Wealtht as the gambler\u2019s wealth at the end of round t\nand Rewardt as the gambler\u2019s net reward (the difference of wealth and initial endowment), that is\n\nWealtht = \u0001 +\n\nwigi\n\nand\n\nRewardt = Wealtht \u2212 \u0001 .\n\n(1)\n\ni=1\n\nIn the following, we will also refer to a bet with \u03b2t, where \u03b2t is such that\n\n(2)\nThe absolute value of \u03b2t is the fraction of the current wealth to bet, and sign of \u03b2t encodes whether\nhe is betting on heads or tails. The constraint that the gambler cannot borrow money implies that\n\u03b2t \u2208 [\u22121, 1]. We also generalize the problem slightly by allowing the outcome of the coin \ufb02ip gt to\nbe any real number in the interval [\u22121, 1]; wealth and reward in (1) remain exactly the same.\n\nwt = \u03b2t Wealtht\u22121 .\n\n1These algorithms require to solve a numerical problem at each step. The number K is the number of steps\n\nneeded to reach the required precision. Neither the precision nor K are calculated in these papers.\n\n2The proof in [15] can be modi\ufb01ed to prove a KL bound, see http://blog.wouterkoolen.info.\n3A variant of the algorithm in [11] can be implemented with the stated time complexity [10].\n\n2\n\nt(cid:88)\n\n\f3 Warm-Up: From Betting to One-Dimensional Online Linear Optimization\n\nt=1 be its sequence of predictions on a sequence of rewards {gt}\u221e\n\nreward of the algorithm after t rounds is Rewardt =(cid:80)t\n\nIn this section, we sketch how to reduce one-dimensional OLO to betting on a coin. The reasoning\nfor generic Hilbert spaces (Section 5) and for LEA (Section 6) will be similar. We will show that\nthe betting view provides a natural way for the analysis and design of online learning algorithms,\nwhere the only design choice is the potential function of the betting algorithm (Section 4). A speci\ufb01c\nexample of coin betting potential and the resulting algorithms are in Section 7.\nAs a warm-up, let us consider an algorithm for OLO over one-dimensional Hilbert space R. Let\n{wt}\u221e\nt=1, gt \u2208 [\u22121, 1]. The total\ni=1 giwi. Also, even if in OLO there is no\nconcept of \u201cwealth\u201d, de\ufb01ne the wealth of the OLO algorithm as Wealtht = \u0001 + Rewardt, as in (1).\nWe now restrict our attention to algorithms whose predictions wt are of the form of a bet, that is\nwt = \u03b2t Wealtht\u22121, where \u03b2t \u2208 [\u22121, 1]. We will see that the restriction on \u03b2t does not prevent us\nfrom obtaining parameter-free algorithms with optimal bounds.\nGiven the above, it is immediate to see that any coin betting algorithm that, on a sequence of\ncoin \ufb02ips {gt}\u221e\nt=1, gt \u2208 [\u22121, 1], bets the amounts wt can be used as an OLO algorithm in a one-\ndimensional Hilbert space R. But, what would be the regret of such OLO algorithms?\n\nAssume that the betting algorithm at hand guarantees that its wealth is at least F ((cid:80)T\n\nt=1 gt) starting\n\nfrom an endowment \u0001, for a given potential function F , then\n\nRewardT =\n\ngtwt = WealthT \u2212 \u0001 \u2265 F\n\n\u2212 \u0001 .\n\n(3)\n\ngt\n\nT(cid:88)\n\n(cid:32) T(cid:88)\n\n(cid:33)\n\nt=1\n\nt=1\n\n(cid:125)\n\nt=1\n\nt=1\n\n\u2265 F\n\n\u2212 \u0001\n\ngt\n\n(cid:33)\n\n(cid:123)(cid:122)\n\n(cid:104)gt, wt(cid:105)\n\n(cid:32) T(cid:88)\n\nIntuitively, if the reward is big we can expect the regret to be small. Indeed, the following lemma\nconverts the lower bound on the reward to an upper bound on the regret.\nLemma 1 (Reward-Regret relationship [22]). Let V, V \u2217 be a pair of dual vector spaces. Let F :\nV \u2192 R\u222a{+\u221e} be a proper convex lower semi-continuous function and let F \u2217 : V \u2217 \u2192 R\u222a{+\u221e}\nbe its Fenchel conjugate. Let w1, w2, . . . , wT \u2208 V and g1, g2, . . . , gT \u2208 V \u2217. Let \u0001 \u2208 R. Then,\n\nT(cid:88)\n(cid:124)\nTo summarize, if we have a betting algorithm that guarantees a minimum wealth of F ((cid:80)T\nApplying the lemma, we get a regret upper bound: RegretT (u) \u2264 F \u2217(u) + \u0001 for all u \u2208 H.\nalgorithm that is adaptive to u is equivalent to designing an algorithm that is adaptive to(cid:80)T\n\nt=1 gt),\nit can be used to design and analyze a one-dimensional OLO algorithm. The faster the growth of\nthe wealth, the smaller the regret will be. Moreover, the lemma also shows that trying to design an\nt=1 gt.\nAlso, most importantly, methods that guarantee optimal wealth for the betting scenario are already\nknown, see, e.g., [4, Chapter 9]. We can just re-use them to get optimal online algorithms!\n\n\u2264 F \u2217(u) + \u0001 .\n\nT(cid:88)\n(cid:124)\n\n(cid:104)gt, u \u2212 wt(cid:105)\n\n\u2200u \u2208 V \u2217,\n\nif and only if\n\nRegretT (u)\n\n(cid:123)(cid:122)\n\nRewardT\n\n(cid:125)\n\nt=1\n\n4 Designing a Betting Algorithm: Coin Betting Potentials\n\nFor sequential betting on i.i.d. coin \ufb02ips, an optimal strategy has been proposed by Kelly [14].\nt=1, gt \u2208 {+1,\u22121}, are generated i.i.d. with known\nThe strategy assumes that the coin \ufb02ips {gt}\u221e\nprobability of heads. If p \u2208 [0, 1] is the probability of heads, the Kelly bet is to bet \u03b2t = 2p \u2212 1 at\neach round. He showed that, in the long run, this strategy will provide more wealth than betting any\nother \ufb01xed fraction of the current wealth [14].\nFor adversarial coins, Kelly betting does not make sense. With perfect knowledge of the future, the\ngambler could always bet everything on the right outcome. Hence, after T rounds from an initial\nendowment \u0001, the maximum wealth he would get is \u00012T . Instead, assume he bets the same fraction \u03b2\nof its wealth at each round. Let Wealtht(\u03b2) the wealth of such strategy after t rounds. As observed\n\nin [21], the optimal \ufb01xed fraction to bet is \u03b2\u2217 = ((cid:80)T\n(cid:80)T\nt=1 gt)/T and it gives the wealth\nt=1 gt)2\nt=1 gt\n2T\n2T\n\n(cid:17)(cid:17) \u2265 \u0001 exp\n\nWealthT (\u03b2\u2217) = \u0001 exp\n\n(cid:16) ((cid:80)T\n\n,\n\n(4)\n\n(cid:16) 1\n\nT \u00b7 D\n\n(cid:17)\n\n(cid:16)\n\n2 +\n\n(cid:13)(cid:13)(cid:13) 1\n\n2\n\n3\n\n\fwhere the inequality follows from Pinsker\u2019s inequality [9, Lemma 11.6.1].\nHowever, even without knowledge of the future, it is possible to go very close to the wealth in (4).\nThis problem was studied by Krichevsky and Tro\ufb01mov [16], who proposed that after seeing the coin\nshould be used instead of p.\n\ni=1 1[gi=+1]\n\nTheir estimate is commonly called KT estimator.1 The KT estimator results in the betting\n\n\ufb02ips g1, g2, . . . , gt\u22121 the empirical estimate kt = 1/2+(cid:80)t\u22121\n(cid:80)t\u22121\n(cid:16)\n\n\u03b2t = 2kt \u2212 1 =\n\nt\n\nt\n\ni=1 gi\n\nWealthT \u2265 WealthT (\u03b2\u2217)\n\n\u221a\n2\n\nT\n\n\u221a\n= \u0001\n2\n\nT\n\nexp\n\nT \u00b7 D\n\nwhich we call adaptive Kelly betting based on the KT estimator. It looks like an online and slightly\nbiased version of the oracle choice of \u03b2\u2217. This strategy guarantees2\n\n(5)\n\n(cid:16) 1\n\n(cid:80)T\n\n2 +\n\nt=1 gt\n2T\n\n(cid:17)(cid:17)\n\n(cid:13)(cid:13)(cid:13) 1\n\n2\n\n.\n\nThis guarantee is optimal up to constant factors [4] and mirrors the guarantee of the Kelly bet.\nHere, we propose a new set of de\ufb01nitions that allows to generalize the strategy of adaptive Kelly\nbetting based on the KT estimator. For these strategies it will be possible to prove that, for any\ng1, g2, . . . , gt \u2208 [\u22121, 1],\n\n(cid:33)\n\nWealtht \u2265 Ft\n\ngi\n\n,\n\n(6)\n\n(cid:32) t(cid:88)\n\ni=1\n\nwhere Ft(x) is a certain function. We call such functions potentials. The betting strategy will be\ndetermined uniquely by the potential (see (c) in the De\ufb01nition 2), and we restrict our attention to\npotentials for which (6) holds. These constraints are speci\ufb01ed in the de\ufb01nition below.\nDe\ufb01nition 2 (Coin Betting Potential). Let \u0001 > 0. Let {Ft}\u221e\nFt : (\u2212at, at) \u2192 R+ where at > t. The sequence {Ft}\u221e\npotentials for initial endowment \u0001, if it satis\ufb01es the following three conditions:\n\nt=0 be a sequence of functions\nt=0 is called a sequence of coin betting\n\n(a) F0(0) = \u0001.\n(b) For every t \u2265 0, Ft(x) is even, logarithmically convex, strictly increasing on [0, at), and\n\n(c) For every t \u2265 1, every x \u2208 [\u2212(t \u2212 1), (t \u2212 1)] and every g \u2208 [\u22121, 1], (1 + g\u03b2t) Ft\u22121(x) \u2265\n\nlimx\u2192at Ft(x) = +\u221e.\n\nFt(x + g), where\n\n\u03b2t = Ft(x+1)\u2212Ft(x\u22121)\nFt(x+1)+Ft(x\u22121) .\n\n(7)\n\nt=0 is called a sequence of excellent coin betting potentials for initial endow-\n\nThe sequence {Ft}\u221e\nment \u0001 if it satis\ufb01es conditions (a)\u2013(c) and the condition (d) below.\nt (x) for every x \u2208 [0, at).\n(d) For every t \u2265 0, Ft is twice-differentiable and satis\ufb01es x\u00b7 F (cid:48)(cid:48)\nLet\u2019s give some intuition on this de\ufb01nition. First, let\u2019s show by induction on t that (b) and (c) of the\nde\ufb01nition together with (2) give a betting strategy that satis\ufb01es (6). The base case t = 0 is trivial.\nAt time t \u2265 1, bet wt = \u03b2t Wealtht\u22121 where \u03b2t is de\ufb01ned in (7), then\n\nt (x) \u2265 F (cid:48)\n\nWealtht = Wealtht\u22121 +wtgt = (1 + gt\u03b2t) Wealtht\u22121\n\n(cid:32)t\u22121(cid:88)\n\n(cid:33)\n\n(cid:32)t\u22121(cid:88)\n\n(cid:33)\n\n(cid:32) t(cid:88)\n\n(cid:33)\n\n\u2265 (1 + gt\u03b2t)Ft\u22121\n\n\u2265 Ft\n\ngi\n\ngi + gt\n\n= Ft\n\ngi\n\n.\n\ni=1\n\ni=1\n\ni=1\n\nThe formula for the potential-based strategy (7) might seem strange. However, it is derived\u2014see\nTheorem 8 in Appendix B\u2014by minimizing the worst-case value of the right-hand side of the in-\nequality used w.r.t. to gt in the induction proof above: Ft\u22121(x) \u2265 Ft(x+gt)\nThe last point, (d), is a technical condition that allows us to seamlessly reduce OLO over a Hilbert\nspace to the one-dimensional problem, characterizing the worst case direction for the reward vectors.\n\n1+gt\u03b2t\n\n.\n\n1Compared to the maximum likelihood estimate\n2See Appendix A for a proof. For lack of space, all the appendices are in the supplementary material.\n\n, KT estimator shrinks slightly towards 1/2.\n\ni=1 1[gi=+1]\n\nt\u22121\n\n(cid:80)t\u22121\n\n4\n\n\fpossible wealth in (4) to be a good candidate. In fact, Ft(x) = \u0001 exp(cid:0)x2/(2t)(cid:1) /\n\nRegarding the design of coin betting potentials, we expect any potential that approximates the best\nt, essentially\nthe potential used in the parameter-free algorithms in [22, 24] for OLO and in [6, 18, 19] for LEA,\napproximates (4) and it is an excellent coin betting potential\u2014see Theorem 9 in Appendix B. Hence,\nour framework provides intuition to previous constructions and in Section 7 we show new examples\nof coin betting potentials.\nIn the next two sections, we presents the reductions to effortlessly solve both the generic OLO case\nand LEA with a betting potential.\n\n\u221a\n\n5 From Coin Betting to OLO over Hilbert Space\n\nWe de\ufb01ne reward and wealth analogously to the one-dimensional case: Rewardt =(cid:80)t\n\nIn this section, generalizing the one-dimensional construction in Section 3, we show how to use\na sequence of excellent coin betting potentials {Ft}\u221e\nt=0 to construct an algorithm for OLO over a\nHilbert space and how to prove a regret bound for it.\ni=1(cid:104)gi, wi(cid:105)\nt=0, using (7) we\n\nand Wealtht = \u0001 + Rewardt. Given a sequence of coin betting potentials {Ft}\u221e\nde\ufb01ne the fraction\n\n\u03b2t =\n\n.\n\n(8)\n\nThe prediction of the OLO algorithm is de\ufb01ned similarly to the one-dimensional case, but now we\nalso need a direction in the Hilbert space:\n\ni=1 gi(cid:107)\u22121)\ni=1 gi(cid:107)\u22121)\n\nFt((cid:107)(cid:80)t\u22121\nFt((cid:107)(cid:80)t\u22121\n(cid:80)t\u22121\n(cid:13)(cid:13)(cid:13)(cid:80)t\u22121\n\ni=1 gi(cid:107)+1)\u2212Ft((cid:107)(cid:80)t\u22121\ni=1 gi(cid:107)+1)+Ft((cid:107)(cid:80)t\u22121\n(cid:80)t\u22121\n(cid:13)(cid:13)(cid:13) = \u03b2t\n(cid:13)(cid:13)(cid:13)(cid:80)t\u22121\n\ni=1 gi\ni=1 gi\n\ni=1 gi\ni=1 gi\n\n(cid:13)(cid:13)(cid:13)\n\n(cid:32)\n\nt\u22121(cid:88)\n\ni=1\n\n(cid:33)\n\nIf(cid:80)t\u22121\n\nwt = \u03b2t Wealtht\u22121\n\n\u0001 +\n\n(cid:104)gi, wi(cid:105)\n\n.\n\n(9)\n\ni=1 gi is the zero vector, we de\ufb01ne wt to be the zero vector as well. For this prediction strategy\nwe can prove the following regret guarantee, proved in Appendix C. The proof reduces the general\nHilbert case to the 1-d case, thanks to (d) in De\ufb01nition 2, then it follows the reasoning of Section 3.\nTheorem 3 (Regret Bound for OLO in Hilbert Spaces). Let {Ft}\u221e\nt=0 be a sequence of excellent coin\nbetting potentials. Let {gt}\u221e\nt=1 be any sequence of reward vectors in a Hilbert space H such that\n(cid:107)gt(cid:107) \u2264 1 for all t. Then, the algorithm that makes prediction wt de\ufb01ned by (9) and (8) satis\ufb01es\n\n\u2200T \u2265 0 \u2200u \u2208 H\n\nRegretT (u) \u2264 F \u2217\n\nT ((cid:107)u(cid:107)) + \u0001 .\n\n6 From Coin Betting to Learning with Expert Advice\nIn this section, we show how to use the algorithm for OLO over one-dimensional Hilbert space R\nfrom Section 3\u2014which is itself based on a coin betting strategy\u2014to construct an algorithm for LEA.\nLet N \u2265 2 be the number of experts and \u2206N be the N-dimensional probability simplex. Let\n\u03c0 = (\u03c01, \u03c02, . . . , \u03c0N ) \u2208 \u2206N be any prior distribution. Let A be an algorithm for OLO over the\none-dimensional Hilbert space R, based on a sequence of the coin betting potentials {Ft}\u221e\nt=0 with\ninitial endowment3 1. We instantiate N copies of A.\nConsider any round t. Let wt,i \u2208 R be the prediction of the i-th copy of A. The LEA algorithm\n\ncomputes(cid:98)pt = ((cid:98)pt,1,(cid:98)pt,2, . . . ,(cid:98)pt,N ) \u2208 RN\n\n(10)\nwhere [x]+ = max{0, x} is the positive part of x. Then, the LEA algorithm predicts pt =\n(pt,1, pt,2, . . . , pt,N ) \u2208 \u2206N as\n\nIf (cid:107)(cid:98)pt(cid:107)1 = 0, the algorithm predicts the prior \u03c0. Then, the algorithm receives the reward vector\n\ngt = (gt,1, gt,2, . . . , gt,N ) \u2208 [0, 1]N . Finally, it feeds the reward to each copy of A. The reward for\n3Any initial endowment \u0001 > 0 can be rescaled to 1. Instead of Ft(x) we would use Ft(x)/\u0001. The wt would\n\n(11)\n\nbecome wt/\u0001, but pt is invariant to scaling of wt. Hence, the LEA algorithm is the same regardless of \u0001.\n\n0,+ as\n\n(cid:98)pt,i = \u03c0i \u00b7 [wt,i]+,\npt = (cid:98)pt(cid:107)(cid:98)pt(cid:107)1\n\n.\n\n5\n\n\fthe i-th copy of A is(cid:101)gt,i \u2208 [\u22121, 1] de\ufb01ned as\n\n(cid:26)gt,i \u2212 (cid:104)gt, pt(cid:105)\n\n(cid:101)gt,i =\n\nif wt,i > 0 ,\n[gt,i \u2212 (cid:104)gt, pt(cid:105)]+ if wt,i \u2264 0 .\n\n(12)\n\nt\n\nThe construction above de\ufb01nes a LEA algorithm de\ufb01ned by the predictions pt, based on the algo-\nrithm A. We can prove the following regret bound for it.\nTheorem 4 (Regret Bound for Experts). Let A be an algorithm for OLO over the one-dimensional\nHilbert space R, based on the coin betting potentials {Ft}\u221e\nt=0 for an initial endowment of 1. Let\nbe the inverse of ft(x) = ln(Ft(x)) restricted to [0,\u221e). Then, the regret of the LEA algorithm\nf\u22121\nwith prior \u03c0 \u2208 \u2206N that predicts at each round with pt in (11) satis\ufb01es\nRegretT (u) \u2264 f\u22121\n\nThe proof, in Appendix D, is based on the fact that (10)\u2013(12) guarantee that(cid:80)N\n\n\u2200T \u2265 0 \u2200u \u2208 \u2206N\n\ni=1 \u03c0i(cid:101)gt,iwt,i \u2264 0\n\nand on a variation of the change of measure lemma used in the PAC-Bayes literature, e.g. [20].\n\nT (D (u(cid:107)\u03c0)) .\n\n7 Applications of the Krichevsky-Tro\ufb01mov Estimator to OLO and LEA\n\n(cid:16) t+1\n\n(cid:17)\u00b7\u0393\n(cid:16) t+1\n2 \u2212 x\n\n2\n\n(cid:17)\n\nIn the previous sections, we have shown that a coin betting potential with a guaranteed rapid growth\nof the wealth will give good regret guarantees for OLO and LEA. Here, we show that the KT\nestimator has associated an excellent coin betting potential, which we call KT potential. Then, the\noptimal wealth guarantee of the KT potentials will translate to optimal parameter-free regret bounds.\nThe sequence of excellent coin betting potentials for an initial endowment \u0001 corresponding to the\nadaptive Kelly betting strategy \u03b2t de\ufb01ned by (5) based on the KT estimator are\n\n2t\u00b7\u0393\n\n2 + x\n\n2\n\nwhere \u0393(x) = (cid:82) \u221e\n\nt \u2265 0,\n\nx \u2208 (\u2212t \u2212 1, t + 1),\n\n0\n\n\u03c0\u00b7t!\n\nFt(x) = \u0001\n\n(13)\ntx\u22121e\u2212tdt is Euler\u2019s gamma function\u2014see Theorem 13 in Appendix E. This\npotential was used to prove regret bounds for online prediction with the logarithmic loss [16][4,\nChapter 9.7]. Theorem 13 also shows that the KT betting strategy \u03b2t as de\ufb01ned by (5) satis\ufb01es (7).\nThis potential has the nice property that is satis\ufb01es the inequality in (c) of De\ufb01nition 2 with equality\nwhen gt \u2208 {\u22121, 1}, i.e. Ft(x + gt) = (1 + gt\u03b2t) Ft\u22121(x).\nWe also generalize the KT potentials to \u03b4-shifted KT potentials, where \u03b4 \u2265 0, de\ufb01ned as\n(cid:19)\n\n(cid:19)\n\n2t\u00b7\u0393(\u03b4+1)\u00b7\u0393\n\nFt(x) =\n\n(cid:18) t+\u03b4+1\n(cid:18) \u03b4+1\n\n\u0393\n\n2\n\n\u00b7\u0393\n\n2 + x\n\n(cid:19)2\u00b7\u0393(t+\u03b4+1)\n\n2\n\n(cid:18) t+\u03b4+1\n2 \u2212 x\n\n2\n\n.\n\nThe reason for its name is that, up to a multiplicative constant, Ft is equal to the KT potential\nshifted in time by \u03b4. Theorem 13 also proves that the \u03b4-shifted KT potentials are excellent coin\nj=1 gj\n.\nbetting potentials with initial endowment 1, and the corresponding betting fraction is \u03b2t =\n\u03b4+t\n\n(cid:80)t\u22121\n\n7.1 OLO in Hilbert Space\nWe apply the KT potential for the construction of an OLO algorithm over a Hilbert space H. We\nwill use (9), and we just need to calculate \u03b2t. According to Theorem 13 in Appendix E, the formula\nfor \u03b2t simpli\ufb01es to \u03b2t =\n\ni=1(cid:104)gi, wi(cid:105)(cid:17)(cid:80)t\u22121\n\n\u0001 +(cid:80)t\u22121\n\n(cid:107)(cid:80)t\u22121\ni=1 gi(cid:107)\n\nso that wt = 1\nt\n\ni=1 gi.\n\n(cid:16)\n\nt\n\nThe resulting algorithm is stated as Algorithm 1. We derive a regret bound for it as a very simple\ncorollary of Theorem 3 to the KT potential (13). The only technical part of the proof, in Appendix F,\nis an upper bound on F \u2217\nCorollary 5 (Regret Bound for Algorithm 1). Let \u0001 > 0. Let {gt}\u221e\nvectors in a Hilbert space H such that (cid:107)gt(cid:107) \u2264 1. Then Algorithm 1 satis\ufb01es\n1 + 24T 2(cid:107)u(cid:107)2\n\nt since it cannot be expressed as an elementary function.\n\nt=1 be any sequence of reward\n\n\u2200 T \u2265 0 \u2200u \u2208 H\n\nRegretT (u) \u2264 (cid:107)u(cid:107)\n\n1 \u2212 1\n\u221a\n\n(cid:114)\n\n(cid:17)\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\nT ln\n\n+ \u0001\n\n.\n\n\u00012\n\ne\n\n\u03c0T\n\n6\n\n\fAlgorithm 1 Algorithm for OLO over Hilbert space H based on KT potential\nRequire: Initial endowment \u0001 > 0\n1: for t = 1, 2, . . . do\n2:\n3:\n4: end for\n\nPredict with wt \u2190 1\ni=1 gi\nReceive reward vector gt \u2208 H such that (cid:107)gt(cid:107) \u2264 1\n\ni=1(cid:104)gi, wi(cid:105)(cid:17)(cid:80)t\u22121\n\n(cid:16)\n\u0001 +(cid:80)t\u22121\n\nt\n\nAlgorithm 2 Algorithm for Learning with Expert Advice based on \u03b4-shifted KT potential\nRequire: Number of experts N, prior distribution \u03c0 \u2208 \u2206N , number of rounds T\n1: for t = 1, 2, . . . , T do\n2:\n3:\n\nFor each i \u2208 [N ], set wt,i \u2190\n\nj=1(cid:101)gj,iwj,i\n\n1 +(cid:80)t\u22121\n\n(cid:16)\n\n(cid:17)\n\nt+T /2\n\n(cid:80)t\u22121\nj=1(cid:101)gj,i\nFor each i \u2208 [N ], set(cid:98)pt,i \u2190 \u03c0i[wt,i]+\n(cid:26)(cid:98)pt/(cid:107)(cid:98)pt(cid:107)1\nif (cid:107)(cid:98)pt(cid:107)1 > 0\nif (cid:107)(cid:98)pt(cid:107)1 = 0\n(cid:26)gt,i \u2212 (cid:104)gt, pt(cid:105)\nFor each i \u2208 [N ], set(cid:101)gt,i \u2190\n\nPredict with pt \u2190\nReceive reward vector gt \u2208 [0, 1]N\n\n\u03c0\n\nif wt,i > 0\n[gt,i \u2212 (cid:104)gt, pt(cid:105)]+ if wt,i \u2264 0\n\n4:\n\n5:\n\n6:\n7: end for\n\nIt is worth noting the elegance and extreme simplicity of Algorithm 1 and contrast it with the algo-\nrithms in [26, 22\u201324]. Also, the regret bound is optimal [26, 23]. The parameter \u0001 can be safely set\nto any constant, e.g. 1. Its role is equivalent to the initial guess used in doubling tricks [25].\n\n7.2 Learning with Expert Advice\n\nWe will now construct an algorithm for LEA based on the \u03b4-shifted KT potential. We set \u03b4 to T /2,\nrequiring the algorithm to know the number of rounds T in advance; we will \ufb01x this later with the\nstandard doubling trick.\nTo use the construction in Section 6, we need an OLO algorithm for the 1-d Hilbert space R. Using\n\nthe \u03b4-shifted KT potentials, the algorithm predicts for any sequence {(cid:101)gt}\u221e\n\uf8eb\uf8ed1 +\nt\u22121(cid:88)\n\n(cid:80)t\u22121\ni=1(cid:101)gi\n\n\uf8eb\uf8ed1 +\n\nwt = \u03b2t Wealtht\u22121 = \u03b2t\n\n\uf8f6\uf8f8 =\n\n\uf8f6\uf8f8 .\n\nt=1 of reward\n\nt\u22121(cid:88)\n\n(cid:101)gjwj\n\n(cid:101)gjwj\n\nj=1\n\nT /2 + t\n\nj=1\n\nThen, following the construction in Section 6, we arrive at the \ufb01nal algorithm, Algorithm 2. We can\nderive a regret bound for Algorithm 2 by applying Theorem 4 to the \u03b4-shifted KT potential.\nCorollary 6 (Regret Bound for Algorithm 2). Let N \u2265 2 and T \u2265 0 be integers. Let \u03c0 \u2208 \u2206N be a\nprior. Then Algorithm 2 with input N, \u03c0, T for any rewards vectors g1, g2, . . . , gT \u2208 [0, 1]N satis\ufb01es\n\nRegretT (u) \u2264(cid:112)3T (3 + D (u(cid:107)\u03c0)) .\n\n\u2200u \u2208 \u2206N\n\nt\n\nHence, the Algorithm 2 has both the best known guarantee on worst-case regret and per-round time\ncomplexity, see Table 1. Also, it has the advantage of being very simple.\nThe proof of the corollary is in the Appendix F. The only technical part of the proof is an upper\nbound on f\u22121\nThe reason for using the shifted potential comes from the analysis of f\u22121\n\ngorithm would have a O((cid:112)T (log T + D (u(cid:107)\u03c0)) regret bound; the shifting improves the bound to\nO((cid:112)T (1 + D (u(cid:107)\u03c0)). By changing T /2 in Algorithm 2 to another constant fraction of T , it is pos-\n\n(x), which we conveniently do by lower bounding Ft(x).\n\n(x). The unshifted al-\n\nsible to trade-off between the two constants 3 present in the square root in the regret upper bound.\nThe requirement of knowing the number of rounds T in advance can be lifted by the standard dou-\nbling trick [25, Section 2.3.1], obtaining an anytime guarantee with a bigger leading constant,\n\nt\n\n\u2200 T \u2265 0 \u2200u \u2208 \u2206N\n\nRegretT (u) \u2264 \u221a\n\n2\u221a\n2\u22121\n\n(cid:112)3T (3 + D (u(cid:107)\u03c0)) .\n\n7\n\n\fFigure 1: Total loss versus learning rate parameter of OGD (in log scale), compared with parameter-free\nalgorithms DFEG [23], Adaptive Normal [22], PiSTOL [24] and the KT-based Algorithm 1.\n\nFigure 2: Regrets to the best expert after T = 32768 rounds, versus learning rate parameter of Hedge (in\nlog scale). The \u201cgood\u201d experts are \u0001 = 0.025 better than the others. The competitor algorithms are Normal-\nHedge [6], AdaNormalHedge [19], Squint [15], and the KT-based Algorithm 2. \u03c0i = 1/N for all algorithms.\n\n8 Discussion of the Results\n\nWe have presented a new interpretation of parameter-free algorithms as coin betting algorithms. This\ninterpretation, far from being just a mathematical gimmick, reveals the common hidden structure\nof previous parameter-free algorithms for both OLO and LEA and also allows the design of new\nalgorithms. For example, we show that the characteristic of parameter-freeness is just a consequence\nof having an algorithm that guarantees the maximum reward possible. The reductions in Sections 5\nand 6 are also novel and they are in a certain sense optimal. In fact, the obtained Algorithms 1 and 2\nachieve the optimal worst case upper bounds on the regret, see [26, 23] and [4] respectively.\nWe have also run an empirical evaluation to show that the theoretical difference between classic\nonline learning algorithms and parameter-free ones is real and not just theoretical. In Figure 1, we\nhave used three regression datasets4, and solved the OCO problem through OLO. In all the three\ncases, we have used the absolute loss and normalized the input vectors to have L2 norm equal to 1.\nFrom the empirical results, it is clear that the optimal learning rate is completely data-dependent, yet\nparameter-free algorithms have performance very close to the unknown optimal tuning of the learn-\ning rate. Moreover, the KT-based Algorithm 1 seems to dominate all the other similar algorithms.\nFor LEA, we have used the synthetic setting in [6]. The dataset is composed of Hadamard matrices\nof size 64, where the row with constant values is removed, the rows are duplicated to 126 inverting\ntheir signs, 0.025 is subtracted to k rows, and the matrix is replicated in order to generate T = 32768\nsamples. For more details, see [6]. Here, the KT-based algorithm is the one in Algorithm 2, where\nthe term T /2 is removed, so that the \ufb01nal regret bound has an additional ln T term. Again, we\nsee that the parameter-free algorithms have a performance close or even better than Hedge with an\noracle tuning of the learning rate, with no clear winners among the parameter-free algorithms.\nNotice that since the adaptive Kelly strategy based on KT estimator is very close to optimal, the only\npossible improvement is to have a data-dependent bound, for example like the ones in [24, 15, 19].\nIn future work, we will extend our de\ufb01nitions and reductions to the data-dependent case.\n\n4Datasets available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.\n\n8\n\n10\u2212210\u2212110010110233.053.13.153.23.253.33.353.43.453.5x 105UTotal lossYearPredictionMSD dataset, absolute loss OGD,\u03b7t=Up1/tDFEGAdaptiveNormalPiSTOLKT-based10\u221211001011025.566.577.588.59x 104UTotal losscpusmall dataset, absolute loss OGD,\u03b7t=Up1/tDFEGAdaptiveNormalPiSTOLKT-based1001021041061.71.751.81.851.91.9522.05x 109UTotal losscadata dataset, absolute loss OGD,\u03b7t=Up1/tDFEGAdaptiveNormalPiSTOLKT-based100101200250300350400450Replicated Hadamard matrices, N=126, k=2 good expertsURegret to best expert after T=32768 Hedge,\u03b7t=Up1/tNormalHedgeAdaNormalHedgeSquintKT-based100101180200220240260280300320340360380400Replicated Hadamard matrices, N=126, k=8 good expertsURegret to best expert after T=32768 Hedge,\u03b7t=Up1/tNormalHedgeAdaNormalHedgeSquintKT-based100101150200250300350Replicated Hadamard matrices, N=126, k=32 good expertsURegret to best expert after T=32768 Hedge,\u03b7t=Up1/tNormalHedgeAdaNormalHedgeSquintKT-based\fAcknowledgments. The authors thank Jacob Abernethy, Nicol`o Cesa-Bianchi, Satyen Kale, Chan-\nsoo Lee, Giuseppe Molteni, and Manfred Warmuth for useful discussions on this work.\n\nReferences\n\n[1] E. Artin. The Gamma Function. Holt, Rinehart and Winston, Inc., 1964.\n[2] N. Batir. Inequalities for the gamma function. Archiv der Mathematik, 91(6):554\u2013563, 2008.\n[3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\nSpringer Publishing Company, Incorporated, 1st edition, 2011.\n\n[4] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\n[5] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to\n\nuse expert advice. J. ACM, 44(3):427\u2013485, 1997.\n\n[6] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm.\n\nInformation Processing Systems 22, pages 297\u2013305, 2009.\n\nIn Advances in Neural\n\n[7] C.-P. Chen.\n65\u201372, 2005.\n\nInequalities for the polygamma functions with application. General Mathematics, 13(3):\n\n[8] A. Chernov and V. Vovk. Prediction with advice of unknown number of experts. In Proc. of the 26th\n\nConf. on Uncertainty in Arti\ufb01cial Intelligence. AUAI Press, 2010.\n\n[9] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2nd edition, 2006.\n[10] D. J. Foster. personal communication, 2016.\n[11] D. J. Foster, A. Rakhlin, and K. Sridharan. Adaptive online learning. In Advances in Neural Information\n\nProcessing Systems 28, pages 3375\u20133383. Curran Associates, Inc., 2015.\n\n[12] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. J. Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[13] A. Hoorfar and M. Hassani. Inequalities on the Lambert W function and hyperpower function. J. Inequal.\n\nPure and Appl. Math, 9(2), 2008.\n\n[14] J. L. Kelly. A new interpretation of information rate. Information Theory, IRE Trans. on, 2(3):185\u2013189,\n\nSeptember 1956.\n\n[15] W. M. Koolen and T. van Erven. Second-order quantile methods for experts and combinatorial games. In\n\nProc. of the 28th Conf. on Learning Theory, pages 1155\u20131175, 2015.\n\n[16] R. E. Krichevsky and V. K. Tro\ufb01mov. The performance of universal encoding. IEEE Trans. on Information\n\nTheory, 27(2):199\u2013206, 1981.\n\n[17] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108\n\n(2):212\u2013261, 1994.\n\n[18] H. Luo and R. E. Schapire. A drifting-games analysis for online learning and applications to boosting. In\n\nAdvances in Neural Information Processing Systems 27, pages 1368\u20131376, 2014.\n\n[19] H. Luo and R. E. Schapire. Achieving all with no parameters: AdaNormalHedge. In Proc. of the 28th\n\nConf. on Learning Theory, pages 1286\u20131304, 2015.\n\n[20] D. McAllester. A PAC-Bayesian tutorial with a dropout bound, 2013. arXiv:1307.2118.\n[21] H. B. McMahan and J. Abernethy. Minimax optimal algorithms for unconstrained linear optimization. In\n\nAdvances in Neural Information Processing Systems 26, pages 2724\u20132732, 2013.\n\n[22] H. B. McMahan and F. Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax al-\ngorithms and normal approximations. In Proc. of the 27th Conf. on Learning Theory, pages 1020\u20131039,\n2014.\n\n[23] F. Orabona. Dimension-free exponentiated gradient.\n\nIn Advances in Neural Information Processing\n\nSystems 26 (NIPS 2013), pages 1806\u20131814. Curran Associates, Inc., 2013.\n\n[24] F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning.\n\nIn Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 1116\u20131124, 2014.\n\n[25] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine\n\nLearning, 4(2):107\u2013194, 2011.\n\n[26] M. Streeter and B. McMahan. No-regret algorithms for unconstrained online convex optimization. In\n\nAdvances in Neural Information Processing Systems 25 (NIPS 2012), pages 2402\u20132410, 2012.\n\n[27] V. Vovk. A game of prediction with expert advice. J. Computer and System Sciences, 56:153\u2013173, 1998.\n[28] E. T. Whittaker and G. N. Watson. A Course of Modern Analysis. Cambridge University Press, fourth\n\nedition, 1962. Reprinted.\n\n[29] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context tree weighting method: Basic properties.\n\nIEEE Trans. on Information Theory, 41:653\u2013664, 1995.\n\n9\n\n\fA From Log Loss to Wealth\n\nGuarantees for betting or sequential investement algorithm are often expressed as upper bounds on\nthe regret with respect to the log loss. Here, for the sake of completeness, we show how to convert\nsuch a guarantee to a lower bound on the wealth of the corresponding betting algorithm.\nWe consider the problem of predicting a binary outcome. The algorithm predicts at each round\nprobability pt \u2208 [0, 1]. The adversary generates a sequences of outcomes xt \u2208 {0, 1} and the\nalgorithm\u2019s loss is\n\n(cid:96)(pt, xt) = \u2212xt ln pt \u2212 (1 \u2212 xt) ln(1 \u2212 pt) .\n\nWe de\ufb01ne the regret with respect to a \ufb01xed probability vector \u03b2 as\n\nRegretlogloss\n\nT\n\n=\n\n(cid:96)(pt, xt) \u2212 min\n\u03b2\u2208[0,1]\n\n(cid:96)(\u03b2, xt) .\n\nLemma 7. Assume that an algorithm that predicts pt guarantees Regretlogloss\ncoin betting strategy with endowement \u0001 and \u03b2t = 2pt \u2212 1 guarantees\n\nT\n\n(cid:32)\n\n(cid:80)T\n\n1\n2\n\n+\n\nt=1 gt\n2T\n\n\u2212 RT\n\n\u2264 RT . Then, the\n\n(cid:33)\n\nt=1\n\nT(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\n2\n\n(cid:33)\n\nT(cid:88)\n\nt=1\n\n(cid:32)\n\nT \u00b7 D\nagainst any sequence of outcomes gt \u2208 [\u22121, +1].\n\nWealthT \u2265 \u0001 exp\n\nProof. De\ufb01ne xt = 1+gt\n\n2 . We have\n\nln WealthT = ln(Wealtht\u22121 +wtgt)\n\n= ln(Wealtht\u22121(1 + gt\u03b2t))\n\nt=1\n\nT(cid:89)\nT(cid:88)\nT(cid:88)\nT(cid:88)\n\nt=1\n\n= ln \u0001\n\n(1 + gt\u03b2t)\n\nln (1 + \u03b2t) +\n\nln (2pt) +\n\n(cid:19)\n\n2\n\nt=1\n\n= ln \u0001 +\n\nln(1 + gt\u03b2t)\n\n\u2265 ln \u0001 +\n\n(cid:18) 1 + gt\n(cid:19)\n(cid:18) 1 + gt\n(cid:19)\n(cid:18) 1 + gt\nT(cid:88)\n= ln \u0001 + T ln(2) \u2212 T(cid:88)\n\n= ln \u0001 + T ln(2) +\n\n= ln \u0001 +\n\n(cid:96)(pt, xt)\n\nt=1\n\nt=1\n\n2\n\n2\n\nt=1\n\n= ln \u0001 + T ln(2) \u2212 Regretlogloss\n\nT\n\n\u2265 ln \u0001 + T ln(2) \u2212 RT \u2212 min\n\u03b2\u2208[0,1]\n\n2\n\n(cid:19)\n\n(cid:18) 1 \u2212 gt\n(cid:18) 1 \u2212 gt\n(cid:19)\n(cid:18) 1 \u2212 gt\n\n2\n\nln(pt) +\n\n2\n\n(cid:19)\n\nln (1 \u2212 \u03b2t)\n\nln (2(1 \u2212 pt))\n\nln(1 \u2212 pt)\n\nT(cid:88)\n\nt=1\n\n(cid:96)(\u03b2, xt)\n\n\u2212 min\n\u03b2\u2208[0,1]\n\nT(cid:88)\n\nt=1\n\n(cid:96)(\u03b2, xt) ,\n\nwhere the \ufb01rst inequality is due to the concavity of ln and the second one is due to the assumption\nof the regret.\nIt is easy to see that the \u03b2\u2217 = arg min\u03b2\u2208[0,1]\n\n. Hence, we have\n\n(cid:80)T\n\nt=1 (cid:96)(\u03b2, xt) =\n\n(cid:80)T\n\nt=1 xt\nT\n\nT(cid:88)\n\nmin\n\u03b2\u2208[0,1]\n\nt=1\n\n10\n\n(cid:96)(\u03b2, xt) = T (\u2212\u03b2\u2217 ln \u03b2\u2217 \u2212 (1 \u2212 \u03b2\u2217) ln(1 \u2212 \u03b2\u2217)) .\n\n\fAlso, we have that for any \u03b2 \u2208 [0, 1]\n\n\u2212\u03b2 ln \u03b2 \u2212 (1 \u2212 \u03b2) ln(1 \u2212 \u03b2) = \u2212 D\n\n(cid:18)\n\n\u03b2\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)(cid:13) 1\n\n2\n\n+ ln 2 .\n\nPutting all together, we have the stated lemma.\n\nThe lower bound on the wealth of the adaptive Kelly betting based on the KT estimator is obtained\nsimply by the stated Lemma and reminding that the log loss regret of the KT estimator is upper\nbounded by 1\n\n2 ln T + ln 2.\n\nB Optimal Betting Fraction\nTheorem 8 (Optimal Betting Fraction). Let x \u2208 R. Let F : [x\u2212 1, x + 1] \u2192 R be a logarithmically\nconvex function. Then,\n\nF (x + g)\n\n1 + \u03b2g\n\n=\n\nF (x + 1) \u2212 F (x \u2212 1)\nF (x + 1) + F (x \u2212 1)\n\n.\n\nmax\ng\u2208[\u22121,1]\n\narg min\n\u03b2\u2208(\u22121,1)\nMoreover, \u03b2\u2217 = F (x+1)\u2212F (x\u22121)\n\nF (x+1)+F (x\u22121) satis\ufb01es\n\nln(F (x + 1)) \u2212 ln(1 + \u03b2\u2217) = ln(F (x \u2212 1)) \u2212 ln(1 \u2212 \u03b2\u2217) .\n\nProof. We de\ufb01ne the functions h, f : [\u22121, 1] \u00d7 (\u22121, 1) \u2192 R as\n\nh(g, \u03b2) =\n\nF (x + g)\n\n1 + \u03b2g\n\nand\n\nf (g, \u03b2) = ln(h(g, \u03b2)) = ln(F (x + g)) \u2212 ln(1 + \u03b2g) .\n\nClearly, arg min\u03b2\u2208(\u22121,1) maxg\u2208[\u22121,1] h(g, \u03b2) = arg min\u03b2\u2208(\u22121,1) maxg\u2208[\u22121,1] f (g, \u03b2) and we can\nwork with f instead of h. The function h is logarithmically convex in g and thus f is convex in g.\nTherefore,\n\n\u2200\u03b2 \u2208 (\u22121, 1)\n\nf (g, \u03b2) = max{f (+1, \u03b2), f (\u22121, \u03b2)} .\n\nmax\ng\u2208[\u22121,1]\n\nLet \u03c6(\u03b2) = max{f (+1, \u03b2), f (\u22121, \u03b2)}. We seek to \ufb01nd the arg min\u03b2\u2208(\u22121,1) \u03c6(\u03b2). Since f (+1, \u03b2)\nis decreasing in \u03b2 and f (\u22121, \u03b2) is increasing in \u03b2, the minimum of \u03c6(\u03b2) is at a point \u03b2\u2217 such that\nf (+1, \u03b2\u2217) = f (\u22121, \u03b2\u2217). In other words, \u03b2\u2217 satis\ufb01es\n\nln(F (x + 1)) \u2212 ln(1 + \u03b2\u2217) = ln(F (x \u2212 1)) \u2212 ln(1 \u2212 \u03b2\u2217) .\n\nThe only solution of this equation is\n\n\u03b2\u2217 =\n\nF (x + 1) \u2212 F (x \u2212 1)\nF (x + 1) + F (x \u2212 1)\n\n.\n\nTheorem 9. The functions Ft(x) = \u0001 exp( x2\n\n2t \u2212 1\n\n2\n\n(cid:80)t\n\ni=1\n\n1\n\ni ) are excellent coin betting potentials.\n\nProof. The \ufb01rst and second properties of De\ufb01nition 2 are trivially true. For the third property, we\n\ufb01rst use Theorem 8 to have\n\nln(1 + \u03b2tg) \u2212 ln Ft(x + g) \u2265 ln(1 + \u03b2t) \u2212 ln Ft(x + 1) = ln\n\n2\n\nFt(x + 1) + Ft(x \u2212 1)\n\n,\n\n11\n\n\f\f\f\f\f\f\f\f\f\fCoin Betting and Parameter-Free Online Learning\n\nFrancesco Orabona\n\nStony Brook University, Stony Brook, NY\n\nD\u00b4avid P\u00b4al\n\nYahoo Research, New York, NY\n\nfrancesco@orabona.com\n\ndpal@yahoo-inc.com\n\nAbstract\n\nIn the recent years, a number of parameter-free algorithms have been developed\nfor online linear optimization over Hilbert spaces and for learning with expert ad-\nvice. These algorithms achieve optimal regret bounds that depend on the unknown\ncompetitors, without having to tune the learning rates with oracle choices.\nWe present a new intuitive framework to design parameter-free algorithms for both\nonline linear optimization over Hilbert spaces and for learning with expert advice,\nbased on reductions to betting on outcomes of adversarial coins. We instantiate\nit using a betting algorithm based on the Krichevsky-Tro\ufb01mov estimator. The\nresulting algorithms are simple, with no parameters to be tuned, and they improve\nor match previous results in terms of regret guarantee and per-round complexity.\n\nIntroduction\n\n1\nWe consider the Online Linear Optimization (OLO) [4, 25] setting. In each round t, an algorithm\nchooses a point wt from a convex decision set K and then receives a reward vector gt. The algo-\nrithm\u2019s goal is to keep its regret small, de\ufb01ned as the difference between its cumulative reward and\nthe cumulative reward of a \ufb01xed strategy u \u2208 K, that is\n\nT(cid:88)\n\n(cid:104)gt, u(cid:105) \u2212 T(cid:88)\n\n(cid:104)gt, wt(cid:105) .\n\nRegretT (u) =\n\nt=1\n\nt=1\n\nWe focus on two particular decision sets, the N-dimensional probability simplex \u2206N = {x \u2208\nRN : x \u2265 0,(cid:107)x(cid:107)1 = 1} and a Hilbert space H. OLO over \u2206N is referred to as the problem of\nLearning with Expert Advice (LEA). We assume bounds on the norms of the reward vectors: For\nOLO over H, we assume that (cid:107)gt(cid:107) \u2264 1, and for LEA we assume that gt \u2208 [0, 1]N .\nOLO is a basic building block of many machine learning problems. For example, Online Convex\nOptimization (OCO), the problem analogous to OLO where (cid:104)gt, u(cid:105) is generalized to an arbitrary\nconvex function (cid:96)t(u), is solved through a reduction to OLO [25]. LEA [17, 27, 5] provides a\nway of combining classi\ufb01ers and it is at the heart of boosting [12]. Batch and stochastic convex\noptimization can also be solved through a reduction to OLO [25].\nTo achieve optimal regret, most of the existing online algorithms require the user to set the learning\nrate (step size) \u03b7 to an unknown/oracle value. For example, to obtain the optimal bound for Online\nGradient Descent (OGD), the learning rate has to be set with the knowledge of the norm of the\ncompetitor u, (cid:107)u(cid:107); second entry in Table 1. Likewise, the optimal learning rate for Hedge depends on\nthe KL divergence between the prior weighting \u03c0 and the unknown competitor u, D (u(cid:107)\u03c0); seventh\nentry in Table 1. Recently, new parameter-free algorithms have been proposed, both for LEA [6, 8,\n18, 19, 15, 11] and for OLO/OCO over Hilbert spaces [26, 23, 21, 22, 24]. These algorithms adapt\nto the number of experts and to the norm of the optimal predictor, respectively, without the need\nto tune parameters. However, their design and underlying intuition is still a challenge. Foster et al.\n[11] proposed a uni\ufb01ed framework, but it is not constructive. Furthermore, all existing algorithms for\nLEA either have sub-optimal regret bound (e.g. extra O(log log T ) factor) or sub-optimal running\ntime (e.g. requiring solving a numerical problem in every round, or with extra factors); see Table 1.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAlgorithm\n\nT\n\n[25]\n[25]\n\nOGD, \u03b7 = 1\u221a\nOGD, \u03b7 = U\u221a\n[23]\n[22, 24]\nThis paper, Sec. 7.1\n\nT\n\n(cid:113) ln N\n\nT\n\nHedge, \u03b7 =\nHedge, \u03b7 = U\u221a\n[6]\n[8]\n[8, 19, 15]2\n[11]\nThis paper, Sec. 7.2\n\nT , \u03c0i = 1\n[12]\n\nN [12]\n\nU\nO((cid:107)u(cid:107) ln(1 + (cid:107)u(cid:107) T )\n\nWorst-case regret guarantee\n\u221a\nT ), \u2200u \u2208 H\nO((1 + (cid:107)u(cid:107)2)\n\u221a\nT for any u \u2208 H s.t. (cid:107)u(cid:107) \u2264 U\n\u221a\nT ), \u2200u \u2208 H\n\nO((cid:107)u(cid:107)(cid:112)T ln(1 + (cid:107)u(cid:107) T )), \u2200u \u2208 H\nO((cid:107)u(cid:107)(cid:112)T ln(1 + (cid:107)u(cid:107) T )), \u2200u \u2208 H\nT ) for any u \u2208 \u2206N s.t.(cid:112)D (u(cid:107)\u03c0) \u2264 U\nO((cid:112)T (1 + D (u(cid:107)\u03c0)) + ln2 N ), \u2200u \u2208 \u2206N\nO((cid:112)T (1 + D (u(cid:107)\u03c0))), \u2200u \u2208 \u2206N\nO((cid:112)T (ln ln T + D (u(cid:107)\u03c0))), \u2200u \u2208 \u2206N\nO((cid:112)T (1 + D (u(cid:107)\u03c0))), \u2200u \u2208 \u2206N\nO((cid:112)T (1 + D (u(cid:107)\u03c0))), \u2200u \u2208 \u2206N\n\nT ln N ), \u2200u \u2208 \u2206N\n\nO(\n\n\u221a\n\n\u221a\n\nO(U\n\nPer-round time\n\ncomplexity\n\nAdaptive Uni\ufb01ed\nanalysis\n\nO(1)\nO(1)\nO(1)\nO(1)\nO(1)\nO(N )\nO(N )\nO(N K)1\nO(N K)1\nO(N )\nO(N )\n\nO(N ln maxu\u2208\u2206N D (u(cid:107)\u03c0))3\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\nTable 1: Algorithms for OLO over Hilbert space and LEA.\n\nContributions. We show that a more fundamental notion subsumes both OLO and LEA parameter-\nfree algorithms. We prove that the ability to maximize the wealth in bets on the outcomes of coin\n\ufb02ips implies OLO and LEA parameter-free algorithms. We develop a novel potential-based frame-\nwork for betting algorithms. It gives intuition to previous constructions and, instantiated with the\nKrichevsky-Tro\ufb01mov estimator, provides new and elegant algorithms for OLO and LEA. The new\nalgorithms also have optimal worst-case guarantees on regret and time complexity; see Table 1.\n\n2 Preliminaries\n\ncrete distributions p and q is D (p(cid:107)q) =(cid:80)\n\nWe begin by providing some de\ufb01nitions. The Kullback-Leibler (KL) divergence between two dis-\ni pi ln (pi/qi). If p, q are real numbers in [0, 1], we denote\nby D (p(cid:107)q) = p ln (p/q)+(1\u2212p) ln ((1 \u2212 p)/(1 \u2212 q)) the KL divergence between two Bernoulli dis-\ntributions with parameters p and q. We denote by H a Hilbert space, by (cid:104)\u00b7,\u00b7(cid:105) its inner product, and by\n(cid:107)\u00b7(cid:107) the induced norm. We denote by (cid:107)\u00b7(cid:107)1 the 1-norm in RN . A function F : I \u2192 R+ is called loga-\nrithmically convex iff f (x) = ln(F (x)) is convex. Let f : V \u2192 R \u222a {\u00b1\u221e}, the Fenchel conjugate\nof f is f\u2217 : V \u2217 \u2192 R\u222a{\u00b1\u221e} de\ufb01ned on the dual vector space V \u2217 by f\u2217(\u03b8) = supx\u2208V (cid:104)\u03b8, x(cid:105)\u2212f (x).\nA function f : V \u2192 R \u222a {+\u221e} is said to be proper if there exists x \u2208 V such that f (x) is \ufb01nite. If\nf is a proper lower semi-continuous convex function then f\u2217 is also proper lower semi-continuous\nconvex and f\u2217\u2217 = f.\nCoin Betting. We consider a gambler making repeated bets on the outcomes of adversarial coin\n\ufb02ips. The gambler starts with an initial endowment \u0001 > 0. In each round t, he bets on the outcome\nof a coin \ufb02ip gt \u2208 {\u22121, 1}, where +1 denotes heads and \u22121 denotes tails. We do not make any\nassumption on how gt is generated, that is, it can be chosen by an adversary.\nThe gambler can bet any amount on either heads or tails. However, he is not allowed to borrow any\nadditional money. If he loses, he loses the betted amount; if he wins, he gets the betted amount back\nand, in addition to that, he gets the same amount as a reward. We encode the gambler\u2019s bet in round t\nby a single number wt. The sign of wt encodes whether he is betting on heads or tails. The absolute\nvalue encodes the betted amount. We de\ufb01ne Wealtht as the gambler\u2019s wealth at the end of round t\nand Rewardt as the gambler\u2019s net reward (the difference of wealth and initial endowment), that is\n\nWealtht = \u0001 +\n\nwigi\n\nand\n\nRewardt = Wealtht \u2212 \u0001 .\n\n(1)\n\ni=1\n\nIn the following, we will also refer to a bet with \u03b2t, where \u03b2t is such that\n\n(2)\nThe absolute value of \u03b2t is the fraction of the current wealth to bet, and sign of \u03b2t encodes whether\nhe is betting on heads or tails. The constraint that the gambler cannot borrow money implies that\n\u03b2t \u2208 [\u22121, 1]. We also generalize the problem slightly by allowing the outcome of the coin \ufb02ip gt to\nbe any real number in the interval [\u22121, 1]; wealth and reward in (1) remain exactly the same.\n\nwt = \u03b2t Wealtht\u22121 .\n\n1These algorithms require to solve a numerical problem at each step. The number K is the number of steps\n\nneeded to reach the required precision. Neither the precision nor K are calculated in these papers.\n\n2The proof in [15] can be modi\ufb01ed to prove a KL bound, see http://blog.wouterkoolen.info.\n3A variant of the algorithm in [11] can be implemented with the stated time complexity [10].\n\n2\n\nt(cid:88)\n\n\f3 Warm-Up: From Betting to One-Dimensional Online Linear Optimization\n\nt=1 be its sequence of predictions on a sequence of rewards {gt}\u221e\n\nreward of the algorithm after t rounds is Rewardt =(cid:80)t\n\nIn this section, we sketch how to reduce one-dimensional OLO to betting on a coin. The reasoning\nfor generic Hilbert spaces (Section 5) and for LEA (Section 6) will be similar. We will show that\nthe betting view provides a natural way for the analysis and design of online learning algorithms,\nwhere the only design choice is the potential function of the betting algorithm (Section 4). A speci\ufb01c\nexample of coin betting potential and the resulting algorithms are in Section 7.\nAs a warm-up, let us consider an algorithm for OLO over one-dimensional Hilbert space R. Let\n{wt}\u221e\nt=1, gt \u2208 [\u22121, 1]. The total\ni=1 giwi. Also, even if in OLO there is no\nconcept of \u201cwealth\u201d, de\ufb01ne the wealth of the OLO algorithm as Wealtht = \u0001 + Rewardt, as in (1).\nWe now restrict our attention to algorithms whose predictions wt are of the form of a bet, that is\nwt = \u03b2t Wealtht\u22121, where \u03b2t \u2208 [\u22121, 1]. We will see that the restriction on \u03b2t does not prevent us\nfrom obtaining parameter-free algorithms with optimal bounds.\nGiven the above, it is immediate to see that any coin betting algorithm that, on a sequence of\ncoin \ufb02ips {gt}\u221e\nt=1, gt \u2208 [\u22121, 1], bets the amounts wt can be used as an OLO algorithm in a one-\ndimensional Hilbert space R. But, what would be the regret of such OLO algorithms?\n\nAssume that the betting algorithm at hand guarantees that its wealth is at least F ((cid:80)T\n\nt=1 gt) starting\n\nfrom an endowment \u0001, for a given potential function F , then\n\nRewardT =\n\ngtwt = WealthT \u2212 \u0001 \u2265 F\n\n\u2212 \u0001 .\n\n(3)\n\ngt\n\nT(cid:88)\n\n(cid:32) T(cid:88)\n\n(cid:33)\n\nt=1\n\nt=1\n\n(cid:125)\n\nt=1\n\nt=1\n\n\u2265 F\n\n\u2212 \u0001\n\ngt\n\n(cid:33)\n\n(cid:123)(cid:122)\n\n(cid:104)gt, wt(cid:105)\n\n(cid:32) T(cid:88)\n\nIntuitively, if the reward is big we can expect the regret to be small. Indeed, the following lemma\nconverts the lower bound on the reward to an upper bound on the regret.\nLemma 1 (Reward-Regret relationship [22]). Let V, V \u2217 be a pair of dual vector spaces. Let F :\nV \u2192 R\u222a{+\u221e} be a proper convex lower semi-continuous function and let F \u2217 : V \u2217 \u2192 R\u222a{+\u221e}\nbe its Fenchel conjugate. Let w1, w2, . . . , wT \u2208 V and g1, g2, . . . , gT \u2208 V \u2217. Let \u0001 \u2208 R. Then,\n\nT(cid:88)\n(cid:124)\nTo summarize, if we have a betting algorithm that guarantees a minimum wealth of F ((cid:80)T\nApplying the lemma, we get a regret upper bound: RegretT (u) \u2264 F \u2217(u) + \u0001 for all u \u2208 H.\nalgorithm that is adaptive to u is equivalent to designing an algorithm that is adaptive to(cid:80)T\n\nt=1 gt),\nit can be used to design and analyze a one-dimensional OLO algorithm. The faster the growth of\nthe wealth, the smaller the regret will be. Moreover, the lemma also shows that trying to design an\nt=1 gt.\nAlso, most importantly, methods that guarantee optimal wealth for the betting scenario are already\nknown, see, e.g., [4, Chapter 9]. We can just re-use them to get optimal online algorithms!\n\n\u2264 F \u2217(u) + \u0001 .\n\nT(cid:88)\n(cid:124)\n\n(cid:104)gt, u \u2212 wt(cid:105)\n\n\u2200u \u2208 V \u2217,\n\nif and only if\n\nRegretT (u)\n\n(cid:123)(cid:122)\n\nRewardT\n\n(cid:125)\n\nt=1\n\n4 Designing a Betting Algorithm: Coin Betting Potentials\n\nFor sequential betting on i.i.d. coin \ufb02ips, an optimal strategy has been proposed by Kelly [14].\nt=1, gt \u2208 {+1,\u22121}, are generated i.i.d. with known\nThe strategy assumes that the coin \ufb02ips {gt}\u221e\nprobability of heads. If p \u2208 [0, 1] is the probability of heads, the Kelly bet is to bet \u03b2t = 2p \u2212 1 at\neach round. He showed that, in the long run, this strategy will provide more wealth than betting any\nother \ufb01xed fraction of the current wealth [14].\nFor adversarial coins, Kelly betting does not make sense. With perfect knowledge of the future, the\ngambler could always bet everything on the right outcome. Hence, after T rounds from an initial\nendowment \u0001, the maximum wealth he would get is \u00012T . Instead, assume he bets the same fraction \u03b2\nof its wealth at each round. Let Wealtht(\u03b2) the wealth of such strategy after t rounds. As observed\n\nin [21], the optimal \ufb01xed fraction to bet is \u03b2\u2217 = ((cid:80)T\n(cid:80)T\nt=1 gt)/T and it gives the wealth\nt=1 gt)2\nt=1 gt\n2T\n2T\n\n(cid:17)(cid:17) \u2265 \u0001 exp\n\nWealthT (\u03b2\u2217) = \u0001 exp\n\n(cid:16) ((cid:80)T\n\n,\n\n(4)\n\n(cid:16) 1\n\nT \u00b7 D\n\n(cid:17)\n\n(cid:16)\n\n2 +\n\n(cid:13)(cid:13)(cid:13) 1\n\n2\n\n3\n\n\fwhere the inequality follows from Pinsker\u2019s inequality [9, Lemma 11.6.1].\nHowever, even without knowledge of the future, it is possible to go very close to the wealth in (4).\nThis problem was studied by Krichevsky and Tro\ufb01mov [16], who proposed that after seeing the coin\nshould be used instead of p.\n\ni=1 1[gi=+1]\n\nTheir estimate is commonly called KT estimator.1 The KT estimator results in the betting\n\n\ufb02ips g1, g2, . . . , gt\u22121 the empirical estimate kt = 1/2+(cid:80)t\u22121\n(cid:80)t\u22121\n(cid:16)\n\n\u03b2t = 2kt \u2212 1 =\n\nt\n\nt\n\ni=1 gi\n\nWealthT \u2265 WealthT (\u03b2\u2217)\n\n\u221a\n2\n\nT\n\n\u221a\n= \u0001\n2\n\nT\n\nexp\n\nT \u00b7 D\n\nwhich we call adaptive Kelly betting based on the KT estimator. It looks like an online and slightly\nbiased version of the oracle choice of \u03b2\u2217. This strategy guarantees2\n\n(5)\n\n(cid:16) 1\n\n(cid:80)T\n\n2 +\n\nt=1 gt\n2T\n\n(cid:17)(cid:17)\n\n(cid:13)(cid:13)(cid:13) 1\n\n2\n\n.\n\nThis guarantee is optimal up to constant factors [4] and mirrors the guarantee of the Kelly bet.\nHere, we propose a new set of de\ufb01nitions that allows to generalize the strategy of adaptive Kelly\nbetting based on the KT estimator. For these strategies it will be possible to prove that, for any\ng1, g2, . . . , gt \u2208 [\u22121, 1],\n\n(cid:33)\n\nWealtht \u2265 Ft\n\ngi\n\n,\n\n(6)\n\n(cid:32) t(cid:88)\n\ni=1\n\nwhere Ft(x) is a certain function. We call such functions potentials. The betting strategy will be\ndetermined uniquely by the potential (see (c) in the De\ufb01nition 2), and we restrict our attention to\npotentials for which (6) holds. These constraints are speci\ufb01ed in the de\ufb01nition below.\nDe\ufb01nition 2 (Coin Betting Potential). Let \u0001 > 0. Let {Ft}\u221e\nFt : (\u2212at, at) \u2192 R+ where at > t. The sequence {Ft}\u221e\npotentials for initial endowment \u0001, if it satis\ufb01es the following three conditions:\n\nt=0 be a sequence of functions\nt=0 is called a sequence of coin betting\n\n(a) F0(0) = \u0001.\n(b) For every t \u2265 0, Ft(x) is even, logarithmically convex, strictly increasing on [0, at), and\n\n(c) For every t \u2265 1, every x \u2208 [\u2212(t \u2212 1), (t \u2212 1)] and every g \u2208 [\u22121, 1], (1 + g\u03b2t) Ft\u22121(x) \u2265\n\nlimx\u2192at Ft(x) = +\u221e.\n\nFt(x + g), where\n\n\u03b2t = Ft(x+1)\u2212Ft(x\u22121)\nFt(x+1)+Ft(x\u22121) .\n\n(7)\n\nt=0 is called a sequence of excellent coin betting potentials for initial endow-\n\nThe sequence {Ft}\u221e\nment \u0001 if it satis\ufb01es conditions (a)\u2013(c) and the condition (d) below.\nt (x) for every x \u2208 [0, at).\n(d) For every t \u2265 0, Ft is twice-differentiable and satis\ufb01es x\u00b7 F (cid:48)(cid:48)\nLet\u2019s give some intuition on this de\ufb01nition. First, let\u2019s show by induction on t that (b) and (c) of the\nde\ufb01nition together with (2) give a betting strategy that satis\ufb01es (6). The base case t = 0 is trivial.\nAt time t \u2265 1, bet wt = \u03b2t Wealtht\u22121 where \u03b2t is de\ufb01ned in (7), then\n\nt (x) \u2265 F (cid:48)\n\nWealtht = Wealtht\u22121 +wtgt = (1 + gt\u03b2t) Wealtht\u22121\n\n(cid:32)t\u22121(cid:88)\n\n(cid:33)\n\n(cid:32)t\u22121(cid:88)\n\n(cid:33)\n\n(cid:32) t(cid:88)\n\n(cid:33)\n\n\u2265 (1 + gt\u03b2t)Ft\u22121\n\n\u2265 Ft\n\ngi\n\ngi + gt\n\n= Ft\n\ngi\n\n.\n\ni=1\n\ni=1\n\ni=1\n\nThe formula for the potential-based strategy (7) might seem strange. However, it is derived\u2014see\nTheorem 8 in Appendix B\u2014by minimizing the worst-case value of the right-hand side of the in-\nequality used w.r.t. to gt in the induction proof above: Ft\u22121(x) \u2265 Ft(x+gt)\nThe last point, (d), is a technical condition that allows us to seamlessly reduce OLO over a Hilbert\nspace to the one-dimensional problem, characterizing the worst case direction for the reward vectors.\n\n1+gt\u03b2t\n\n.\n\n1Compared to the maximum likelihood estimate\n2See Appendix A for a proof. For lack of space, all the appendices are in the supplementary material.\n\n, KT estimator shrinks slightly towards 1/2.\n\ni=1 1[gi=+1]\n\nt\u22121\n\n(cid:80)t\u22121\n\n4\n\n\fpossible wealth in (4) to be a good candidate. In fact, Ft(x) = \u0001 exp(cid:0)x2/(2t)(cid:1) /\n\nRegarding the design of coin betting potentials, we expect any potential that approximates the best\nt, essentially\nthe potential used in the parameter-free algorithms in [22, 24] for OLO and in [6, 18, 19] for LEA,\napproximates (4) and it is an excellent coin betting potential\u2014see Theorem 9 in Appendix B. Hence,\nour framework provides intuition to previous constructions and in Section 7 we show new examples\nof coin betting potentials.\nIn the next two sections, we presents the reductions to effortlessly solve both the generic OLO case\nand LEA with a betting potential.\n\n\u221a\n\n5 From Coin Betting to OLO over Hilbert Space\n\nWe de\ufb01ne reward and wealth analogously to the one-dimensional case: Rewardt =(cid:80)t\n\nIn this section, generalizing the one-dimensional construction in Section 3, we show how to use\na sequence of excellent coin betting potentials {Ft}\u221e\nt=0 to construct an algorithm for OLO over a\nHilbert space and how to prove a regret bound for it.\ni=1(cid:104)gi, wi(cid:105)\nt=0, using (7) we\n\nand Wealtht = \u0001 + Rewardt. Given a sequence of coin betting potentials {Ft}\u221e\nde\ufb01ne the fraction\n\n\u03b2t =\n\n.\n\n(8)\n\nThe prediction of the OLO algorithm is de\ufb01ned similarly to the one-dimensional case, but now we\nalso need a direction in the Hilbert space:\n\ni=1 gi(cid:107)\u22121)\ni=1 gi(cid:107)\u22121)\n\nFt((cid:107)(cid:80)t\u22121\nFt((cid:107)(cid:80)t\u22121\n(cid:80)t\u22121\n(cid:13)(cid:13)(cid:13)(cid:80)t\u22121\n\ni=1 gi(cid:107)+1)\u2212Ft((cid:107)(cid:80)t\u22121\ni=1 gi(cid:107)+1)+Ft((cid:107)(cid:80)t\u22121\n(cid:80)t\u22121\n(cid:13)(cid:13)(cid:13) = \u03b2t\n(cid:13)(cid:13)(cid:13)(cid:80)t\u22121\n\ni=1 gi\ni=1 gi\n\ni=1 gi\ni=1 gi\n\n(cid:13)(cid:13)(cid:13)\n\n(cid:32)\n\nt\u22121(cid:88)\n\ni=1\n\n(cid:33)\n\nIf(cid:80)t\u22121\n\nwt = \u03b2t Wealtht\u22121\n\n\u0001 +\n\n(cid:104)gi, wi(cid:105)\n\n.\n\n(9)\n\ni=1 gi is the zero vector, we de\ufb01ne wt to be the zero vector as well. For this prediction strategy\nwe can prove the following regret guarantee, proved in Appendix C. The proof reduces the general\nHilbert case to the 1-d case, thanks to (d) in De\ufb01nition 2, then it follows the reasoning of Section 3.\nTheorem 3 (Regret Bound for OLO in Hilbert Spaces). Let {Ft}\u221e\nt=0 be a sequence of excellent coin\nbetting potentials. Let {gt}\u221e\nt=1 be any sequence of reward vectors in a Hilbert space H such that\n(cid:107)gt(cid:107) \u2264 1 for all t. Then, the algorithm that makes prediction wt de\ufb01ned by (9) and (8) satis\ufb01es\n\n\u2200T \u2265 0 \u2200u \u2208 H\n\nRegretT (u) \u2264 F \u2217\n\nT ((cid:107)u(cid:107)) + \u0001 .\n\n6 From Coin Betting to Learning with Expert Advice\nIn this section, we show how to use the algorithm for OLO over one-dimensional Hilbert space R\nfrom Section 3\u2014which is itself based on a coin betting strategy\u2014to construct an algorithm for LEA.\nLet N \u2265 2 be the number of experts and \u2206N be the N-dimensional probability simplex. Let\n\u03c0 = (\u03c01, \u03c02, . . . , \u03c0N ) \u2208 \u2206N be any prior distribution. Let A be an algorithm for OLO over the\none-dimensional Hilbert space R, based on a sequence of the coin betting potentials {Ft}\u221e\nt=0 with\ninitial endowment3 1. We instantiate N copies of A.\nConsider any round t. Let wt,i \u2208 R be the prediction of the i-th copy of A. The LEA algorithm\n\ncomputes(cid:98)pt = ((cid:98)pt,1,(cid:98)pt,2, . . . ,(cid:98)pt,N ) \u2208 RN\n\n(10)\nwhere [x]+ = max{0, x} is the positive part of x. Then, the LEA algorithm predicts pt =\n(pt,1, pt,2, . . . , pt,N ) \u2208 \u2206N as\n\nIf (cid:107)(cid:98)pt(cid:107)1 = 0, the algorithm predicts the prior \u03c0. Then, the algorithm receives the reward vector\n\ngt = (gt,1, gt,2, . . . , gt,N ) \u2208 [0, 1]N . Finally, it feeds the reward to each copy of A. The reward for\n3Any initial endowment \u0001 > 0 can be rescaled to 1. Instead of Ft(x) we would use Ft(x)/\u0001. The wt would\n\n(11)\n\nbecome wt/\u0001, but pt is invariant to scaling of wt. Hence, the LEA algorithm is the same regardless of \u0001.\n\n0,+ as\n\n(cid:98)pt,i = \u03c0i \u00b7 [wt,i]+,\npt = (cid:98)pt(cid:107)(cid:98)pt(cid:107)1\n\n.\n\n5\n\n\fthe i-th copy of A is(cid:101)gt,i \u2208 [\u22121, 1] de\ufb01ned as\n\n(cid:26)gt,i \u2212 (cid:104)gt, pt(cid:105)\n\n(cid:101)gt,i =\n\nif wt,i > 0 ,\n[gt,i \u2212 (cid:104)gt, pt(cid:105)]+ if wt,i \u2264 0 .\n\n(12)\n\nt\n\nThe construction above de\ufb01nes a LEA algorithm de\ufb01ned by the predictions pt, based on the algo-\nrithm A. We can prove the following regret bound for it.\nTheorem 4 (Regret Bound for Experts). Let A be an algorithm for OLO over the one-dimensional\nHilbert space R, based on the coin betting potentials {Ft}\u221e\nt=0 for an initial endowment of 1. Let\nbe the inverse of ft(x) = ln(Ft(x)) restricted to [0,\u221e). Then, the regret of the LEA algorithm\nf\u22121\nwith prior \u03c0 \u2208 \u2206N that predicts at each round with pt in (11) satis\ufb01es\nRegretT (u) \u2264 f\u22121\n\nThe proof, in Appendix D, is based on the fact that (10)\u2013(12) guarantee that(cid:80)N\n\n\u2200T \u2265 0 \u2200u \u2208 \u2206N\n\ni=1 \u03c0i(cid:101)gt,iwt,i \u2264 0\n\nand on a variation of the change of measure lemma used in the PAC-Bayes literature, e.g. [20].\n\nT (D (u(cid:107)\u03c0)) .\n\n7 Applications of the Krichevsky-Tro\ufb01mov Estimator to OLO and LEA\n\n(cid:16) t+1\n\n(cid:17)\u00b7\u0393\n(cid:16) t+1\n2 \u2212 x\n\n2\n\n(cid:17)\n\nIn the previous sections, we have shown that a coin betting potential with a guaranteed rapid growth\nof the wealth will give good regret guarantees for OLO and LEA. Here, we show that the KT\nestimator has associated an excellent coin betting potential, which we call KT potential. Then, the\noptimal wealth guarantee of the KT potentials will translate to optimal parameter-free regret bounds.\nThe sequence of excellent coin betting potentials for an initial endowment \u0001 corresponding to the\nadaptive Kelly betting strategy \u03b2t de\ufb01ned by (5) based on the KT estimator are\n\n2t\u00b7\u0393\n\n2 + x\n\n2\n\nwhere \u0393(x) = (cid:82) \u221e\n\nt \u2265 0,\n\nx \u2208 (\u2212t \u2212 1, t + 1),\n\n0\n\n\u03c0\u00b7t!\n\nFt(x) = \u0001\n\n(13)\ntx\u22121e\u2212tdt is Euler\u2019s gamma function\u2014see Theorem 13 in Appendix E. This\npotential was used to prove regret bounds for online prediction with the logarithmic loss [16][4,\nChapter 9.7]. Theorem 13 also shows that the KT betting strategy \u03b2t as de\ufb01ned by (5) satis\ufb01es (7).\nThis potential has the nice property that is satis\ufb01es the inequality in (c) of De\ufb01nition 2 with equality\nwhen gt \u2208 {\u22121, 1}, i.e. Ft(x + gt) = (1 + gt\u03b2t) Ft\u22121(x).\nWe also generalize the KT potentials to \u03b4-shifted KT potentials, where \u03b4 \u2265 0, de\ufb01ned as\n(cid:19)\n\n(cid:19)\n\n2t\u00b7\u0393(\u03b4+1)\u00b7\u0393\n\nFt(x) =\n\n(cid:18) t+\u03b4+1\n(cid:18) \u03b4+1\n\n\u0393\n\n2\n\n\u00b7\u0393\n\n2 + x\n\n(cid:19)2\u00b7\u0393(t+\u03b4+1)\n\n2\n\n(cid:18) t+\u03b4+1\n2 \u2212 x\n\n2\n\n.\n\nThe reason for its name is that, up to a multiplicative constant, Ft is equal to the KT potential\nshifted in time by \u03b4. Theorem 13 also proves that the \u03b4-shifted KT potentials are excellent coin\nj=1 gj\n.\nbetting potentials with initial endowment 1, and the corresponding betting fraction is \u03b2t =\n\u03b4+t\n\n(cid:80)t\u22121\n\n7.1 OLO in Hilbert Space\nWe apply the KT potential for the construction of an OLO algorithm over a Hilbert space H. We\nwill use (9), and we just need to calculate \u03b2t. According to Theorem 13 in Appendix E, the formula\nfor \u03b2t simpli\ufb01es to \u03b2t =\n\ni=1(cid:104)gi, wi(cid:105)(cid:17)(cid:80)t\u22121\n\n\u0001 +(cid:80)t\u22121\n\n(cid:107)(cid:80)t\u22121\ni=1 gi(cid:107)\n\nso that wt = 1\nt\n\ni=1 gi.\n\n(cid:16)\n\nt\n\nThe resulting algorithm is stated as Algorithm 1. We derive a regret bound for it as a very simple\ncorollary of Theorem 3 to the KT potential (13). The only technical part of the proof, in Appendix F,\nis an upper bound on F \u2217\nCorollary 5 (Regret Bound for Algorithm 1). Let \u0001 > 0. Let {gt}\u221e\nvectors in a Hilbert space H such that (cid:107)gt(cid:107) \u2264 1. Then Algorithm 1 satis\ufb01es\n1 + 24T 2(cid:107)u(cid:107)2\n\nt since it cannot be expressed as an elementary function.\n\nt=1 be any sequence of reward\n\n\u2200 T \u2265 0 \u2200u \u2208 H\n\nRegretT (u) \u2264 (cid:107)u(cid:107)\n\n1 \u2212 1\n\u221a\n\n(cid:114)\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\nT ln\n\n+ \u0001\n\n.\n\n\u00012\n\ne\n\n\u03c0T\n\n6\n\n\fAlgorithm 1 Algorithm for OLO over Hilbert space H based on KT potential\nRequire: Initial endowment \u0001 > 0\n1: for t = 1, 2, . . . do\n2:\n3:\n4: end for\n\nPredict with wt \u2190 1\ni=1 gi\nReceive reward vector gt \u2208 H such that (cid:107)gt(cid:107) \u2264 1\n\ni=1(cid:104)gi, wi(cid:105)(cid:17)(cid:80)t\u22121\n\n(cid:16)\n\u0001 +(cid:80)t\u22121\n\nt\n\nAlgorithm 2 Algorithm for Learning with Expert Advice based on \u03b4-shifted KT potential\nRequire: Number of experts N, prior distribution \u03c0 \u2208 \u2206N , number of rounds T\n1: for t = 1, 2, . . . , T do\n2:\n3:\n\nFor each i \u2208 [N ], set wt,i \u2190\n\nj=1(cid:101)gj,iwj,i\n\n1 +(cid:80)t\u22121\n\n(cid:16)\n\n(cid:17)\n\nt+T /2\n\n(cid:80)t\u22121\nj=1(cid:101)gj,i\nFor each i \u2208 [N ], set(cid:98)pt,i \u2190 \u03c0i[wt,i]+\n(cid:26)(cid:98)pt/(cid:107)(cid:98)pt(cid:107)1\nif (cid:107)(cid:98)pt(cid:107)1 > 0\nif (cid:107)(cid:98)pt(cid:107)1 = 0\n(cid:26)gt,i \u2212 (cid:104)gt, pt(cid:105)\nFor each i \u2208 [N ], set(cid:101)gt,i \u2190\n\nPredict with pt \u2190\nReceive reward vector gt \u2208 [0, 1]N\n\n\u03c0\n\nif wt,i > 0\n[gt,i \u2212 (cid:104)gt, pt(cid:105)]+ if wt,i \u2264 0\n\n4:\n\n5:\n\n6:\n7: end for\n\nIt is worth noting the elegance and extreme simplicity of Algorithm 1 and contrast it with the algo-\nrithms in [26, 22\u201324]. Also, the regret bound is optimal [26, 23]. The parameter \u0001 can be safely set\nto any constant, e.g. 1. Its role is equivalent to the initial guess used in doubling tricks [25].\n\n7.2 Learning with Expert Advice\n\nWe will now construct an algorithm for LEA based on the \u03b4-shifted KT potential. We set \u03b4 to T /2,\nrequiring the algorithm to know the number of rounds T in advance; we will \ufb01x this later with the\nstandard doubling trick.\nTo use the construction in Section 6, we need an OLO algorithm for the 1-d Hilbert space R. Using\n\nthe \u03b4-shifted KT potentials, the algorithm predicts for any sequence {(cid:101)gt}\u221e\n\uf8eb\uf8ed1 +\nt\u22121(cid:88)\n\n(cid:80)t\u22121\ni=1(cid:101)gi\n\n\uf8eb\uf8ed1 +\n\nwt = \u03b2t Wealtht\u22121 = \u03b2t\n\n\uf8f6\uf8f8 =\n\n\uf8f6\uf8f8 .\n\nt=1 of reward\n\nt\u22121(cid:88)\n\n(cid:101)gjwj\n\n(cid:101)gjwj\n\nj=1\n\nT /2 + t\n\nj=1\n\nThen, following the construction in Section 6, we arrive at the \ufb01nal algorithm, Algorithm 2. We can\nderive a regret bound for Algorithm 2 by applying Theorem 4 to the \u03b4-shifted KT potential.\nCorollary 6 (Regret Bound for Algorithm 2). Let N \u2265 2 and T \u2265 0 be integers. Let \u03c0 \u2208 \u2206N be a\nprior. Then Algorithm 2 with input N, \u03c0, T for any rewards vectors g1, g2, . . . , gT \u2208 [0, 1]N satis\ufb01es\n\nRegretT (u) \u2264(cid:112)3T (3 + D (u(cid:107)\u03c0)) .\n\n\u2200u \u2208 \u2206N\n\nt\n\nHence, the Algorithm 2 has both the best known guarantee on worst-case regret and per-round time\ncomplexity, see Table 1. Also, it has the advantage of being very simple.\nThe proof of the corollary is in the Appendix F. The only technical part of the proof is an upper\nbound on f\u22121\nThe reason for using the shifted potential comes from the analysis of f\u22121\n\ngorithm would have a O((cid:112)T (log T + D (u(cid:107)\u03c0)) regret bound; the shifting improves the bound to\nO((cid:112)T (1 + D (u(cid:107)\u03c0)). By changing T /2 in Algorithm 2 to another constant fraction of T , it is pos-\n\n(x), which we conveniently do by lower bounding Ft(x).\n\n(x). The unshifted al-\n\nsible to trade-off between the two constants 3 present in the square root in the regret upper bound.\nThe requirement of knowing the number of rounds T in advance can be lifted by the standard dou-\nbling trick [25, Section 2.3.1], obtaining an anytime guarantee with a bigger leading constant,\n\nt\n\n\u2200 T \u2265 0 \u2200u \u2208 \u2206N\n\nRegretT (u) \u2264 \u221a\n\n2\u221a\n2\u22121\n\n(cid:112)3T (3 + D (u(cid:107)\u03c0)) .\n\n7\n\n\fFigure 1: Total loss versus learning rate parameter of OGD (in log scale), compared with parameter-free\nalgorithms DFEG [23], Adaptive Normal [22], PiSTOL [24] and the KT-based Algorithm 1.\n\nFigure 2: Regrets to the best expert after T = 32768 rounds, versus learning rate parameter of Hedge (in\nlog scale). The \u201cgood\u201d experts are \u0001 = 0.025 better than the others. The competitor algorithms are Normal-\nHedge [6], AdaNormalHedge [19], Squint [15], and the KT-based Algorithm 2. \u03c0i = 1/N for all algorithms.\n\n8 Discussion of the Results\n\nWe have presented a new interpretation of parameter-free algorithms as coin betting algorithms. This\ninterpretation, far from being just a mathematical gimmick, reveals the common hidden structure\nof previous parameter-free algorithms for both OLO and LEA and also allows the design of new\nalgorithms. For example, we show that the characteristic of parameter-freeness is just a consequence\nof having an algorithm that guarantees the maximum reward possible. The reductions in Sections 5\nand 6 are also novel and they are in a certain sense optimal. In fact, the obtained Algorithms 1 and 2\nachieve the optimal worst case upper bounds on the regret, see [26, 23] and [4] respectively.\nWe have also run an empirical evaluation to show that the theoretical difference between classic\nonline learning algorithms and parameter-free ones is real and not just theoretical. In Figure 1, we\nhave used three regression datasets4, and solved the OCO problem through OLO. In all the three\ncases, we have used the absolute loss and normalized the input vectors to have L2 norm equal to 1.\nFrom the empirical results, it is clear that the optimal learning rate is completely data-dependent, yet\nparameter-free algorithms have performance very close to the unknown optimal tuning of the learn-\ning rate. Moreover, the KT-based Algorithm 1 seems to dominate all the other similar algorithms.\nFor LEA, we have used the synthetic setting in [6]. The dataset is composed of Hadamard matrices\nof size 64, where the row with constant values is removed, the rows are duplicated to 126 inverting\ntheir signs, 0.025 is subtracted to k rows, and the matrix is replicated in order to generate T = 32768\nsamples. For more details, see [6]. Here, the KT-based algorithm is the one in Algorithm 2, where\nthe term T /2 is removed, so that the \ufb01nal regret bound has an additional ln T term. Again, we\nsee that the parameter-free algorithms have a performance close or even better than Hedge with an\noracle tuning of the learning rate, with no clear winners among the parameter-free algorithms.\nNotice that since the adaptive Kelly strategy based on KT estimator is very close to optimal, the only\npossible improvement is to have a data-dependent bound, for example like the ones in [24, 15, 19].\nIn future work, we will extend our de\ufb01nitions and reductions to the data-dependent case.\n\n4Datasets available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.\n\n8\n\n10\u2212210\u2212110010110233.053.13.153.23.253.33.353.43.453.5x 105UTotal lossYearPredictionMSD dataset, absolute loss OGD,\u03b7t=Up1/tDFEGAdaptiveNormalPiSTOLKT-based10\u221211001011025.566.577.588.59x 104UTotal losscpusmall dataset, absolute loss OGD,\u03b7t=Up1/tDFEGAdaptiveNormalPiSTOLKT-based1001021041061.71.751.81.851.91.9522.05x 109UTotal losscadata dataset, absolute loss OGD,\u03b7t=Up1/tDFEGAdaptiveNormalPiSTOLKT-based100101200250300350400450Replicated Hadamard matrices, N=126, k=2 good expertsURegret to best expert after T=32768 Hedge,\u03b7t=Up1/tNormalHedgeAdaNormalHedgeSquintKT-based100101180200220240260280300320340360380400Replicated Hadamard matrices, N=126, k=8 good expertsURegret to best expert after T=32768 Hedge,\u03b7t=Up1/tNormalHedgeAdaNormalHedgeSquintKT-based100101150200250300350Replicated Hadamard matrices, N=126, k=32 good expertsURegret to best expert after T=32768 Hedge,\u03b7t=Up1/tNormalHedgeAdaNormalHedgeSquintKT-based\fAcknowledgments. The authors thank Jacob Abernethy, Nicol`o Cesa-Bianchi, Satyen Kale, Chan-\nsoo Lee, Giuseppe Molteni, and Manfred Warmuth for useful discussions on this work.\n\nReferences\n\n[1] E. Artin. The Gamma Function. Holt, Rinehart and Winston, Inc., 1964.\n[2] N. Batir. Inequalities for the gamma function. Archiv der Mathematik, 91(6):554\u2013563, 2008.\n[3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\nSpringer Publishing Company, Incorporated, 1st edition, 2011.\n\n[4] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\n[5] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to\n\nuse expert advice. J. ACM, 44(3):427\u2013485, 1997.\n\n[6] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm.\n\nInformation Processing Systems 22, pages 297\u2013305, 2009.\n\nIn Advances in Neural\n\n[7] C.-P. Chen.\n65\u201372, 2005.\n\nInequalities for the polygamma functions with application. General Mathematics, 13(3):\n\n[8] A. Chernov and V. Vovk. Prediction with advice of unknown number of experts. In Proc. of the 26th\n\nConf. on Uncertainty in Arti\ufb01cial Intelligence. AUAI Press, 2010.\n\n[9] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2nd edition, 2006.\n[10] D. J. Foster. personal communication, 2016.\n[11] D. J. Foster, A. Rakhlin, and K. Sridharan. Adaptive online learning. In Advances in Neural Information\n\nProcessing Systems 28, pages 3375\u20133383. Curran Associates, Inc., 2015.\n\n[12] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. J. Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[13] A. Hoorfar and M. Hassani. Inequalities on the Lambert W function and hyperpower function. J. Inequal.\n\nPure and Appl. Math, 9(2), 2008.\n\n[14] J. L. Kelly. A new interpretation of information rate. Information Theory, IRE Trans. on, 2(3):185\u2013189,\n\nSeptember 1956.\n\n[15] W. M. Koolen and T. van Erven. Second-order quantile methods for experts and combinatorial games. In\n\nProc. of the 28th Conf. on Learning Theory, pages 1155\u20131175, 2015.\n\n[16] R. E. Krichevsky and V. K. Tro\ufb01mov. The performance of universal encoding. IEEE Trans. on Information\n\nTheory, 27(2):199\u2013206, 1981.\n\n[17] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108\n\n(2):212\u2013261, 1994.\n\n[18] H. Luo and R. E. Schapire. A drifting-games analysis for online learning and applications to boosting. In\n\nAdvances in Neural Information Processing Systems 27, pages 1368\u20131376, 2014.\n\n[19] H. Luo and R. E. Schapire. Achieving all with no parameters: AdaNormalHedge. In Proc. of the 28th\n\nConf. on Learning Theory, pages 1286\u20131304, 2015.\n\n[20] D. McAllester. A PAC-Bayesian tutorial with a dropout bound, 2013. arXiv:1307.2118.\n[21] H. B. McMahan and J. Abernethy. Minimax optimal algorithms for unconstrained linear optimization. In\n\nAdvances in Neural Information Processing Systems 26, pages 2724\u20132732, 2013.\n\n[22] H. B. McMahan and F. Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax al-\ngorithms and normal approximations. In Proc. of the 27th Conf. on Learning Theory, pages 1020\u20131039,\n2014.\n\n[23] F. Orabona. Dimension-free exponentiated gradient.\n\nIn Advances in Neural Information Processing\n\nSystems 26 (NIPS 2013), pages 1806\u20131814. Curran Associates, Inc., 2013.\n\n[24] F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning.\n\nIn Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 1116\u20131124, 2014.\n\n[25] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine\n\nLearning, 4(2):107\u2013194, 2011.\n\n[26] M. Streeter and B. McMahan. No-regret algorithms for unconstrained online convex optimization. In\n\nAdvances in Neural Information Processing Systems 25 (NIPS 2012), pages 2402\u20132410, 2012.\n\n[27] V. Vovk. A game of prediction with expert advice. J. Computer and System Sciences, 56:153\u2013173, 1998.\n[28] E. T. Whittaker and G. N. Watson. A Course of Modern Analysis. Cambridge University Press, fourth\n\nedition, 1962. Reprinted.\n\n[29] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context tree weighting method: Basic properties.\n\nIEEE Trans. on Information Theory, 41:653\u2013664, 1995.\n\n9\n\n\fA From Log Loss to Wealth\n\nGuarantees for betting or sequential investement algorithm are often expressed as upper bounds on\nthe regret with respect to the log loss. Here, for the sake of completeness, we show how to convert\nsuch a guarantee to a lower bound on the wealth of the corresponding betting algorithm.\nWe consider the problem of predicting a binary outcome. The algorithm predicts at each round\nprobability pt \u2208 [0, 1]. The adversary generates a sequences of outcomes xt \u2208 {0, 1} and the\nalgorithm\u2019s loss is\n\n(cid:96)(pt, xt) = \u2212xt ln pt \u2212 (1 \u2212 xt) ln(1 \u2212 pt) .\n\nWe de\ufb01ne the regret with respect to a \ufb01xed probability vector \u03b2 as\n\nRegretlogloss\n\nT\n\n=\n\n(cid:96)(pt, xt) \u2212 min\n\u03b2\u2208[0,1]\n\n(cid:96)(\u03b2, xt) .\n\nLemma 7. Assume that an algorithm that predicts pt guarantees Regretlogloss\ncoin betting strategy with endowement \u0001 and \u03b2t = 2pt \u2212 1 guarantees\n\nT\n\n(cid:32)\n\n(cid:80)T\n\n1\n2\n\n+\n\nt=1 gt\n2T\n\n\u2212 RT\n\n\u2264 RT . Then, the\n\n(cid:33)\n\nt=1\n\nT(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\n2\n\n(cid:33)\n\nT(cid:88)\n\nt=1\n\n(cid:32)\n\nT \u00b7 D\nagainst any sequence of outcomes gt \u2208 [\u22121, +1].\n\nWealthT \u2265 \u0001 exp\n\nProof. De\ufb01ne xt = 1+gt\n\n2 . We have\n\nln WealthT = ln(Wealtht\u22121 +wtgt)\n\n= ln(Wealtht\u22121(1 + gt\u03b2t))\n\nt=1\n\nT(cid:89)\nT(cid:88)\nT(cid:88)\nT(cid:88)\n\nt=1\n\n= ln \u0001\n\n(1 + gt\u03b2t)\n\nln (1 + \u03b2t) +\n\nln (2pt) +\n\n(cid:19)\n\n2\n\nt=1\n\n= ln \u0001 +\n\nln(1 + gt\u03b2t)\n\n\u2265 ln \u0001 +\n\n(cid:18) 1 + gt\n(cid:19)\n(cid:18) 1 + gt\n(cid:19)\n(cid:18) 1 + gt\nT(cid:88)\n= ln \u0001 + T ln(2) \u2212 T(cid:88)\n\n= ln \u0001 + T ln(2) +\n\n= ln \u0001 +\n\n(cid:96)(pt, xt)\n\nt=1\n\nt=1\n\n2\n\n2\n\nt=1\n\n= ln \u0001 + T ln(2) \u2212 Regretlogloss\n\nT\n\n\u2265 ln \u0001 + T ln(2) \u2212 RT \u2212 min\n\u03b2\u2208[0,1]\n\n2\n\n(cid:19)\n\n(cid:18) 1 \u2212 gt\n(cid:18) 1 \u2212 gt\n(cid:19)\n(cid:18) 1 \u2212 gt\n\n2\n\nln(pt) +\n\n2\n\n(cid:19)\n\nln (1 \u2212 \u03b2t)\n\nln (2(1 \u2212 pt))\n\nln(1 \u2212 pt)\n\nT(cid:88)\n\nt=1\n\n(cid:96)(\u03b2, xt)\n\n\u2212 min\n\u03b2\u2208[0,1]\n\nT(cid:88)\n\nt=1\n\n(cid:96)(\u03b2, xt) ,\n\nwhere the \ufb01rst inequality is due to the concavity of ln and the second one is due to the assumption\nof the regret.\nIt is easy to see that the \u03b2\u2217 = arg min\u03b2\u2208[0,1]\n\n. Hence, we have\n\n(cid:80)T\n\nt=1 (cid:96)(\u03b2, xt) =\n\n(cid:80)T\n\nt=1 xt\nT\n\nT(cid:88)\n\nmin\n\u03b2\u2208[0,1]\n\nt=1\n\n10\n\n(cid:96)(\u03b2, xt) = T (\u2212\u03b2\u2217 ln \u03b2\u2217 \u2212 (1 \u2212 \u03b2\u2217) ln(1 \u2212 \u03b2\u2217)) .\n\n\fAlso, we have that for any \u03b2 \u2208 [0, 1]\n\n\u2212\u03b2 ln \u03b2 \u2212 (1 \u2212 \u03b2) ln(1 \u2212 \u03b2) = \u2212 D\n\n(cid:18)\n\n\u03b2\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)(cid:13) 1\n\n2\n\n+ ln 2 .\n\nPutting all together, we have the stated lemma.\n\nThe lower bound on the wealth of the adaptive Kelly betting based on the KT estimator is obtained\nsimply by the stated Lemma and reminding that the log loss regret of the KT estimator is upper\nbounded by 1\n\n2 ln T + ln 2.\n\nB Optimal Betting Fraction\nTheorem 8 (Optimal Betting Fraction). Let x \u2208 R. Let F : [x\u2212 1, x + 1] \u2192 R be a logarithmically\nconvex function. Then,\n\nF (x + g)\n\n1 + \u03b2g\n\n=\n\nF (x + 1) \u2212 F (x \u2212 1)\nF (x + 1) + F (x \u2212 1)\n\n.\n\nmax\ng\u2208[\u22121,1]\n\narg min\n\u03b2\u2208(\u22121,1)\nMoreover, \u03b2\u2217 = F (x+1)\u2212F (x\u22121)\n\nF (x+1)+F (x\u22121) satis\ufb01es\n\nln(F (x + 1)) \u2212 ln(1 + \u03b2\u2217) = ln(F (x \u2212 1)) \u2212 ln(1 \u2212 \u03b2\u2217) .\n\nProof. We de\ufb01ne the functions h, f : [\u22121, 1] \u00d7 (\u22121, 1) \u2192 R as\n\nh(g, \u03b2) =\n\nF (x + g)\n\n1 + \u03b2g\n\nand\n\nf (g, \u03b2) = ln(h(g, \u03b2)) = ln(F (x + g)) \u2212 ln(1 + \u03b2g) .\n\nClearly, arg min\u03b2\u2208(\u22121,1) maxg\u2208[\u22121,1] h(g, \u03b2) = arg min\u03b2\u2208(\u22121,1) maxg\u2208[\u22121,1] f (g, \u03b2) and we can\nwork with f instead of h. The function h is logarithmically convex in g and thus f is convex in g.\nTherefore,\n\n\u2200\u03b2 \u2208 (\u22121, 1)\n\nf (g, \u03b2) = max{f (+1, \u03b2), f (\u22121, \u03b2)} .\n\nmax\ng\u2208[\u22121,1]\n\nLet \u03c6(\u03b2) = max{f (+1, \u03b2), f (\u22121, \u03b2)}. We seek to \ufb01nd the arg min\u03b2\u2208(\u22121,1) \u03c6(\u03b2). Since f (+1, \u03b2)\nis decreasing in \u03b2 and f (\u22121, \u03b2) is increasing in \u03b2, the minimum of \u03c6(\u03b2) is at a point \u03b2\u2217 such that\nf (+1, \u03b2\u2217) = f (\u22121, \u03b2\u2217). In other words, \u03b2\u2217 satis\ufb01es\n\nln(F (x + 1)) \u2212 ln(1 + \u03b2\u2217) = ln(F (x \u2212 1)) \u2212 ln(1 \u2212 \u03b2\u2217) .\n\nThe only solution of this equation is\n\n\u03b2\u2217 =\n\nF (x + 1) \u2212 F (x \u2212 1)\nF (x + 1) + F (x \u2212 1)\n\n.\n\nTheorem 9. The functions Ft(x) = \u0001 exp( x2\n\n2t \u2212 1\n\n2\n\n(cid:80)t\n\ni=1\n\n1\n\ni ) are excellent coin betting potentials.\n\nProof. The \ufb01rst and second properties of De\ufb01nition 2 are trivially true. For the third property, we\n\ufb01rst use Theorem 8 to have\n\nln(1 + \u03b2tg) \u2212 ln Ft(x + g) \u2265 ln(1 + \u03b2t) \u2212 ln Ft(x + 1) = ln\n\n2\n\nFt(x + 1) + Ft(x \u2212 1)\n\n,\n\n11\n\n\f\f\f\f\f\f\f\f\f\f", "award": [], "sourceid": 320, "authors": [{"given_name": "Francesco", "family_name": "Orabona", "institution": "Yahoo Research"}, {"given_name": "David", "family_name": "Pal", "institution": "Google"}]}