{"title": "MetaGrad: Multiple Learning Rates in Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3666, "page_last": 3674, "abstract": "In online convex optimization it is well known that certain subclasses of objective functions are much easier than arbitrary convex functions. We are interested in designing adaptive methods that can automatically get fast rates in as many such subclasses as possible, without any manual tuning. Previous adaptive methods are able to interpolate between strongly convex and general convex functions. We present a new method, MetaGrad, that adapts to a much broader class of functions, including exp-concave and strongly convex functions, but also various types of stochastic and non-stochastic functions without any curvature. For instance, MetaGrad can achieve logarithmic regret on the unregularized hinge loss, even though it has no curvature, if the data come from a favourable probability distribution. MetaGrad's main feature is that it simultaneously considers multiple learning rates. Unlike all previous methods with provable regret guarantees, however, its learning rates are not monotonically decreasing over time and are not tuned based on a theoretically derived bound on the regret. Instead, they are weighted directly proportional to their empirical performance on the data using a tilted exponential weights master algorithm.", "full_text": "MetaGrad: Multiple Learning Rates\n\nin Online Learning\n\nTim van Erven\nLeiden University\n\ntim@timvanerven.nl\n\nWouter M. Koolen\n\nCentrum Wiskunde & Informatica\n\nwmkoolen@cwi.nl\n\nAbstract\n\nIn online convex optimization it is well known that certain subclasses of objective\nfunctions are much easier than arbitrary convex functions. We are interested in\ndesigning adaptive methods that can automatically get fast rates in as many such\nsubclasses as possible, without any manual tuning. Previous adaptive methods\nare able to interpolate between strongly convex and general convex functions. We\npresent a new method, MetaGrad, that adapts to a much broader class of functions,\nincluding exp-concave and strongly convex functions, but also various types of\nstochastic and non-stochastic functions without any curvature. For instance, Meta-\nGrad can achieve logarithmic regret on the unregularized hinge loss, even though\nit has no curvature, if the data come from a favourable probability distribution.\nMetaGrad\u2019s main feature is that it simultaneously considers multiple learning rates.\nUnlike previous methods with provable regret guarantees, however, its learning\nrates are not monotonically decreasing over time and are not tuned based on a\ntheoretically derived bound on the regret. Instead, they are weighted directly\nproportional to their empirical performance on the data using a tilted exponential\nweights master algorithm.\n\n1\n\nIntroduction\n\nMethods for online convex optimization (OCO) [28, 12] make it possible to optimize parameters\nsequentially, by processing convex functions in a streaming fashion. This is important in time series\nprediction where the data are inherently online; but it may also be convenient to process of\ufb02ine data\nsets sequentially, for instance if the data do not all \ufb01t into memory at the same time or if parameters\nneed to be updated quickly when extra data become available.\nThe dif\ufb01culty of an OCO task depends on the convex functions f1, f2, . . . , fT that need to be\noptimized. The argument of these functions is a d-dimensional parameter vector w from a convex\ndomain U. Although this is abstracted away in the general framework, each function ft usually\nmeasures the loss of the parameters on an underlying example (xt, yt) in a machine learning task.\nFor example, in classi\ufb01cation ft might be the hinge loss ft(w) = max{0, 1 ythw, xti} or the\nlogistic loss ft(w) = ln1 + eythw,xti, with yt 2 {1, +1}. Thus the dif\ufb01culty depends both on\nthe choice of loss and on the observed data.\nThere are different methods for OCO, depending on assumptions that can be made about the functions.\nThe simplest and most commonly used strategy is online gradient descent (GD), which does not\nrequire any assumptions beyond convexity. GD updates parameters wt+1 = wt \u2318trft(wt) by\ntaking a step in the direction of the negative gradient, where the step size is determined by a parameter\n\u2318t called the learning rate. For learning rates \u2318t / 1/pt, GD guarantees that the regret over T\nrounds, which measures the difference in cumulative loss between the online iterates wt and the best\nof\ufb02ine parameters u, is bounded by O(pT ) [33]. Alternatively, if it is known beforehand that the\nfunctions are of an easier type, then better regret rates are sometimes possible. For instance, if the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ffunctions are strongly convex, then logarithmic regret O(ln T ) can be achieved by GD with much\nsmaller learning rates \u2318t / 1/t [14], and, if they are exp-concave, then logarithmic regret O(d ln T )\ncan be achieved by the Online Newton Step (ONS) algorithm [14].\nThis partitions OCO tasks into categories, leaving it to the user to choose the appropriate algorithm\nfor their setting. Such a strict partition, apart from being a burden on the user, depends on an extensive\ncataloguing of all types of easier functions that might occur in practice. (See Section 3 for several\nways in which the existing list of easy functions can be extended.) It also immediately raises the\nquestion of whether there are cases in between logarithmic and square-root regret (there are, see\nTheorem 3 in Section 3), and which algorithm to use then. And, third, it presents the problem that\nthe appropriate algorithm might depend on (the distribution of) the data (again see Section 3), which\nmakes it entirely impossible to select the right algorithm beforehand.\nThese issues motivate the development of adaptive methods, which are no worse than O(pT ) for\ngeneral convex functions, but also automatically take advantage of easier functions whenever possible.\nAn important step in this direction are the adaptive GD algorithm of Bartlett, Hazan, and Rakhlin\n[2] and its proximal improvement by Do, Le, and Foo [8], which are able to interpolate between\nstrongly convex and general convex functions if they are provided with a data-dependent strong\nconvexity parameter in each round, and signi\ufb01cantly outperform the main non-adaptive method\n(i.e. Pegasos, [29]) in the experiments of Do et al. Here we consider a signi\ufb01cantly richer class of\nfunctions, which includes exp-concave functions, strongly convex functions, general convex functions\nthat do not change between rounds (even if they have no curvature), and stochastic functions whose\ngradients satisfy the so-called Bernstein condition, which is well-known to enable fast rates in of\ufb02ine\nstatistical learning [1, 10, 19]. The latter group can again include functions without curvature, like\nthe unregularized hinge loss. All these cases are covered simultaneously by a new adaptive method\nwe call MetaGrad, for multiple eta gradient algorithm. MetaGrad maintains a covariance matrix of\nsize d \u21e5 d where d is the parameter dimension. In the remainder of the paper we call this version full\nMetaGrad. A reference implementation is available from [17]. We also design and analyze a faster\napproximation that only maintains the d diagonal elements, called diagonal MetaGrad. Theorem 7\nbelow implies the following:\nTheorem 1. Let gt = rft(wt) and V u\nis simultaneously bounded by O(pT ln ln T ), and by\nTXt=1\n\nt=1 ((u wt)|gt)2. Then the regret of full MetaGrad\nT d ln T + d ln T\u2318 for any u 2U . (1)\n\nT =PT\n\nTXt=1\n\n(wt u)|gt \uf8ff O\u21e3pV u\n\nf (wt) \n\nft(u) \uf8ff\n\nTXt=1\n\nTheorem 1 bounds the regret in terms of a measure of variance V u\nT that depends on the distance of\nthe algorithm\u2019s choices wt to the optimum u, and which, in favourable cases, may be signi\ufb01cantly\nsmaller than T . Intuitively, this happens, for instance, when there is stable optimum u that the\nalgorithm\u2019s choices wt converge to. Formal consequences are given in Section 3, which shows that\nthis bound implies faster than O(pT ) regret rates, often logarithmic in T , for all functions in the rich\nclass mentioned above. In all cases the dependence on T in the rates matches what we would expect\nbased on related work in the literature, and in most cases the dependence on the dimension d is also\nwhat we would expect. Only for strongly convex functions is there an extra factor d. It is an open\nquestion whether this is a fundamental obstacle for which an even more general adaptive method is\nneeded, or whether it is an artefact of our analysis.\nThe main dif\ufb01culty in achieving the regret guarantee from Theorem 1 is tuning a learning rate\nT , but this is not possible using any existing\ntechniques, because the optimum u is unknown in advance, and tuning in terms of a uniform upper\nbound maxu V u\nT ruins all desired bene\ufb01ts. MetaGrad therefore runs multiple slave algorithms, each\nwith a different learning rate, and combines them with a novel master algorithm that learns the\nempirically best learning rate for the OCO task in hand. The slaves are instances of exponential\nweights on the continuous parameters u with a suitable surrogate loss function, which in particular\ncauses the exponential weights distributions to be multivariate Gaussians. For the full version of\nMetaGrad, the slaves are closely related to the ONS algorithm on the original losses, where each\nslave receives the master\u2019s gradients instead of its own. It is shown that d 1\n2 log2 Te + 1 slaves suf\ufb01ce,\nwhich is at most 16 as long as T \uf8ff 109, and therefore seems computationally acceptable. If not, then\nthe number of slaves can be further reduced at the cost of slightly worse constants in the bound.\n\nparameter \u2318. In theory, \u2318 should be roughly 1/pV u\n\n2\n\n\fProtocol 1: Online Convex Optimization from First-order Information\nInput: Convex set U\n1: for t = 1, 2, . . . do\n2:\n3:\n4:\n5: end for\n\nLearner plays wt 2U\nEnvironment reveals convex loss function ft : U! R\nLearner incurs loss ft(wt) and observes (sub)gradient gt = rft(wt)\n\nRelated Work If we disregard computational ef\ufb01ciency, then the result of Theorem 1 can be\nachieved by \ufb01nely discretizing the domain U and running the Squint algorithm for prediction with\nexperts with each discretization point as an expert [16]. MetaGrad may therefore also be seen as a\ncomputationally ef\ufb01cient extension of Squint to the OCO setting.\nOur focus in this work is on adapting to sequences of functions ft that are easier than general convex\nfunctions. A different direction in which faster rates are possible is by adapting to the domain U. As\nwe assume U to be \ufb01xed, we consider an upper bound D on the norm of the optimum u to be known.\nIn contrast, Orabona and P\u00e1l [24, 25] design methods that can adapt to the norm of u. One may also\nlook at the shape of U. As can be seen in the analysis of the slaves, MetaGrad is based a spherical\nGaussian prior on Rd, which favours u with small `2-norm. This is appropriate for U that are similar\nto the Euclidean ball, but less so if U is more like a box (`1-ball). In this case, it would be better\nto run a copy of MetaGrad for each dimension separately, similarly to how the diagonal version of\nthe AdaGrad algorithm [9, 21] may be interpreted as running a separate copy of GD with a separate\nlearning rate for each dimension. AdaGrad further uses an adaptive tuning of the learning rates that is\nable to take advantage of sparse gradient vectors, as can happen on data with rarely observed features.\nWe brie\ufb02y compare to AdaGrad in some very simple simulations in Appendix A.1.\nAnother notion of adaptivity is explored in a series of work [13, 6, 31] obtaining tighter bounds\nfor linear functions ft that vary little between rounds (as measured either by their deviation from\nthe mean function or by successive differences). Such bounds imply super fast rates for optimizing\na \ufb01xed linear function, but reduce to slow O(pT ) rates in the other cases of easy functions that\nwe consider. Finally, the way MetaGrad\u2019s slaves maintain a Gaussian distribution on parameters u\nis similar in spirit to AROW and related con\ufb01dence weighted methods, as analyzed by Crammer,\nKulesza, and Dredze [7] in the mistake bound model.\n\nOutline We start with the main de\ufb01nitions in the next section. Then Section 3 contains an extensive\nset of examples where Theorem 1 leads to fast rates, Section 4 presents the MetaGrad algorithm,\nand Section 5 provides the analysis leading to Theorem 7, which is a more detailed statement of\nTheorem 1 with an improved dependence on the dimension in some particular cases and with exact\nconstants. The details of the proofs can be found in the appendix.\n\n2 Setup\n\nLet U\u2713 Rd be a closed convex set, which we assume contains the origin 0 (if not, it can always\nbe translated). We consider algorithms for Online Convex Optimization over U, which operate\naccording to the protocol displayed in Protocol 1. Let wt 2U be the iterate produced by the\nalgorithm in round t, let ft : U! R be the convex loss function produced by the environment and let\ngt = rft(wt) be the (sub)gradient, which is the feedback given to the algorithm.1 We abbreviate the\nregret with respect to u 2U as Ru\nt=1 (ft(wt) ft(u)), and de\ufb01ne our measure of variance as\nT =PT\nV u\ni=1(uiwt,i)2g2\nfor the diagonal version. By convexity of ft, we always have ft(wt)ft(u) \uf8ff (wtu)|gt. De\ufb01ning\nT =PT\n\u02dcRu\nT . A stronger\nrequirement than convexity is that a function f is exp-concave, which (for exp-concavity parameter\n1) means that ef is concave. Finally, we impose the following standard boundedness assumptions,\ndistinguishing between the full version of MetaGrad (left column) and the diagonal version (right\n\nt=1Pd\nT =PT\nt=1 ((u wt)|gt)2 for the full version of MetaGrad and V u\nt=1(wt u)|gt, this implies the \ufb01rst inequality in Theorem 1: Ru\nT \uf8ff \u02dcRu\n\nT =PT\n\nt,i\n\n1If ft is not differentiable at wt, any choice of subgradient gt 2 @ft(wt) is allowed.\n\n3\n\n\fcolumn): for all u, v 2U , all dimensions i and all times t,\n\nfull\n\ndiag\n\nku vk \uf8ff Dfull\nkgtk \uf8ff Gfull\n\n|ui vi|\uf8ff Ddiag\n|gt,i|\uf8ff Gdiag.\n\n(2)\n\nHere, and throughout the paper, the norm of a vector (e.g. kgtk) will always refer to the `2-norm.\nFor the full version of MetaGrad, the Cauchy-Schwarz inequality further implies that (u v)|gt \uf8ff\nku vk\u00b7k gtk \uf8ff DfullGfull.\n3 Fast Rate Examples\n\nIn this section, we motivate our interest in the adaptive bound (1) by giving a series of examples in\nwhich it provides fast rates. These fast rates are all derived from two general suf\ufb01cient conditions:\none based on the directional derivative of the functions ft and one for stochastic gradients that satisfy\nthe Bernstein condition, which is the standard condition for fast rates in off-line statistical learning.\nSimple simulations that illustrate the conditions are provided in Appendix A.1 and proofs are also\npostponed to Appendix A.\n\nft(u) ft(w) + a(u w)|rft(w) + b ((u w)|rft(w))2\n\nDirectional Derivative Condition In order to control the regret with respect to some point u, the\n\ufb01rst condition requires a quadratic lower bound on the curvature of the functions ft in the direction\nof u:\nTheorem 2. Suppose, for a given u 2U , there exist constants a, b > 0 such that the functions ft all\nsatisfy\n(3)\nThen any method with regret bound (1) incurs logarithmic regret, Ru\nT = O(d ln T ), with respect to u.\nThe case a = 1 of this condition was introduced by Hazan, Agarwal, and Kale [14], who show that\nit is satis\ufb01ed for all u 2U by exp-concave and strongly convex functions. The rate O(d ln T ) is\nalso what we would expect by summing the asymptotic of\ufb02ine rate obtained by ridge regression on\nthe squared loss [30, Section 5.2], which is exp-concave. Our extension to a > 1 is technically a\nminor step, but it makes the condition much more liberal, because it may then also be satis\ufb01ed by\nfunctions that do not have any curvature. For example, suppose that ft = f is a \ufb01xed convex function\nthat does not change with t. Then, when u\u21e4 = arg minu f (u) is the of\ufb02ine minimizer, we have\n(u\u21e4 w)|rf (w) 2 [GfullDfull, 0], so that\nDfullGfull ((u\u21e4 w)|rf (w))2 ,\nf (u\u21e4) f (w) (u\u21e4 w)|rf (w) 2(u\u21e4 w)|rf (w) +\nwhere the \ufb01rst inequality uses only convexity of f. Thus condition (3) is satis\ufb01ed by any \ufb01xed convex\nfunction, even if it does not have any curvature at all, with a = 2 and b = 1/(GfullDfull).\n\nfor all w 2U .\n\n1\n\nBernstein Stochastic Gradients The possibility of getting fast rates even without any curvature\nis intriguing, because it goes beyond the usual strong convexity or exp-concavity conditions. In\nthe online setting, the case of \ufb01xed functions ft = f seems rather restricted, however, and may in\nfact be handled by of\ufb02ine optimization methods. We therefore seek to loosen this requirement by\nreplacing it by a stochastic condition on the distribution of the functions ft. The relation between\nvariance bounds like Theorem 1 and fast rates in the stochastic setting is studied in depth by Koolen,\nGr\u00fcnwald, and Van Erven [19], who obtain fast rate results both in expectation and in probability.\nHere we provide a direct proof only for the expected regret, which allows a simpli\ufb01ed analysis.\nSuppose the functions ft are independent and identically distributed (i.i.d.), with common distribution\nP. Then we say that the gradients satisfy the (B, )-Bernstein condition with respect to the stochastic\noptimum u\u21e4 = arg minu2U Ef\u21e0P[f (u)] if\n(w u\u21e4)| E\nfor all w 2U . (4)\nThis is an instance of the well-known Bernstein condition from of\ufb02ine statistical learning [1, 10],\napplied to the linearized excess loss (w u\u21e4)|rf (w). As shown in Appendix H, imposing the\ncondition for the linearized excess loss is a weaker requirement than imposing it for the original\nexcess loss f (w) f (u\u21e4).\n\n[rf (w)rf (w)|] (w u\u21e4) \uf8ff B(w u\u21e4)| E\n\n[rf (w)]\n\nf\n\nf\n\n4\n\n\fAlgorithm 1: MetaGrad Master\nInput: Grid of learning rates\n1: for t = 1, 2, . . . do\nGet prediction w\u2318\n2:\nPlay wt = P\u2318 \u21e1\u2318\n3:\nP\u2318 \u21e1\u2318\nObserve gradient gt = rft(wt)\n4:\nt+1 = \u21e1\u2318\nUpdate \u21e1\u2318\n5:\nP\u2318 \u21e1\u2318\n6: end for\n\n\u2318\n\u2318\nt (w\nt )\n\u2318\n\u2318\nt (w\n\n2U\n\nt e\u21b5`\n\nt e\u21b5`\n\nt 2U of slave (Algorithm 2) for each \u2318\nt \u2318w\u2318\nt \u2318\n\nt\n\n. Tilted Exponentially Weighted Average\n\n1\n\n5DG \u23181 \u23182 . . . with prior weights \u21e1\u23181\n\n1 ,\u21e1 \u23182\n\n1 , . . .\n\n. As in (8)\n\nt ) for all \u2318\n\n. Exponential Weights with surrogate loss (6)\n\nTheorem 3. If the gradients satisfy the (B, )-Bernstein condition for B > 0 and 2 (0, 1] with\nrespect to u\u21e4 = arg minu2U Ef\u21e0P[f (u)], then any method with regret bound (1) incurs expected\nregret E[Ru\u21e4\n\nT ] = O\u21e3(Bd ln T )1/(2) T (1)/(2) + d ln T\u2318.\n\nFor = 1, the rate becomes O(d ln T ), just like for \ufb01xed functions, and for smaller it is in between\nlogarithmic and O(pdT ). For instance, the hinge loss on the unit ball with i.i.d. data satis\ufb01es the\nBernstein condition with = 1, which implies an O(d ln T ) rate. (See Appendix A.4.) It is common\nto add `2-regularization to the hinge loss to make it strongly convex, but this example shows that that\nis not necessary to get logarithmic regret.\n\n4 MetaGrad Algorithm\n\nIn this section we explain the two versions (full and diagonal) of the MetaGrad algorithm. We will\nmake use of the following de\ufb01nitions:\n\nfull\n:= gtg|\n:= 1\n\nt\n\nM full\nt\n\u21b5full\n\ndiag\n:= diag(g2\n\nM diag\nt\n\u21b5diag := 1/d.\n\nt,1, . . . , g2\n\nt,d)\n\n(5)\n\nDepending on context, wt 2U will refer to the full or diagonal MetaGrad prediction in round t. In\nthe remainder we will drop the superscript from the letters above, which will always be clear from\ncontext.\nMetaGrad will be de\ufb01ned by means of the following surrogate loss `\u2318\nt (u), which depends on a\nparameter \u2318> 0 that trades off regret compared to u with the square of the scaled directional\nderivative towards u (full case) or its approximation (diag case):\n\nt (u) := \u2318(wt u)|gt + \u23182(u wt)|Mt(u wt).\n`\u2318\n\n(6)\nOur surrogate loss consists of a linear and a quadratic part. Using the language of Orabona, Crammer,\nand Cesa-Bianchi [26], the data-dependent quadratic part causes a \u201ctime-varying regularizer\u201d and\nDuchi, Hazan, and Singer [9] would call it \u201ctemporal adaptation of the proximal function\u201d. The sum\nof quadratic terms in our surrogate is what appears in the regret bound of Theorem 1.\nThe MetaGrad algorithm is a two-level hierarchical construction, displayed as Algorithms 1 (master\nalgorithm that learns the learning rate) and 2 (sub-module, a copy running for each learning rate \u2318\nfrom a \ufb01nite grid). Based on our analysis in the next section, we recommend using the grid in (8).\n\nMaster The task of the Master Algorithm 1 is to learn the empirically best learning rate \u2318 (parameter\nof the surrogate loss `\u2318\nt ), which is notoriously dif\ufb01cult to track online because the regret is non-\nmonotonic over rounds and may have multiple local minima as a function of \u2318 (see [18] for a study\nin the expert setting). The standard technique is therefore to derive a monotonic upper bound on\nthe regret and tune the learning rate optimally for the bound. In contrast, our approach, inspired\nby the approach for combinatorial games of Koolen and Van Erven [16, Section 4], is to have our\nmaster aggregate the predictions of a discrete grid of learning rates. Although we provide a formal\nanalysis of the regret, the master algorithm does not depend on the outcome of this analysis, so any\n\n5\n\n\f1 = D2I\n\n5DG, domain size D > 0\n\nAlgorithm 2: MetaGrad Slave\nInput: Learning rate 0 <\u2318 \uf8ff 1\n1: w\u2318\n2: for t = 1, 2, . . . do\n3:\n4:\n5:\n\nt+1 =\u21e3 1\new\u2318\n\n1 = 0 and \u2303\u2318\nIssue w\u2318\nt to master (Algorithm 1)\nObserve gradient gt = rft(wt)\ns=1 Ms\u23181\nD2 I + 2\u23182Pt\nUpdate \u2303\u2318\nt+1\u2318gt + 2\u23182Mt(w\u2318\nt+1 = w\u2318\nt \u2303\u2318\new\u2318\nt+1 with projection \u21e7\u2303\n\u2303\u2318\nt+1 =\u21e7\nU\nand simplify ew\u2318\n\n6: end for\nImplementation: For Mt = M diag\nt\nt gtg|\nt \u2303\u2318\nupdate \u2303\u2318\nt\nt \u2303\u2318\nt gt\n\nt 2\u23182\u2303\u2318\n1+2\u23182g|\n\nt+1 = \u2303\u2318\n\nw\u2318\n\nt+1\n\nt wt)\n\n. Gradient at master point wt\n\nU (w) = arg min\nu2U\n\n(u w)|\u23031(u w)\n\nonly maintain diagonal of \u2303\u2318\n\nt+1 = w\u2318\n\nt \u2318\u2303\u2318\n\nt . For Mt = M full\nt+1gt (1 + 2\u2318g|\n\nt\n\nuse rank-one\nt (w\u2318\nt wt)).\n\nslack in our bounds does not feed back into the algorithm. The master is in fact very similar to\nthe well-known exponential weights method (line 5), run on the surrogate losses, except that in the\npredictions the weights of the slaves are tilted by their learning rates (line 3), having the effect of\ngiving a larger weight to larger \u2318. The internal parameter \u21b5 is set to \u21b5full from (5) for the full version\nof the algorithm, and to \u21b5diag for the diagonal version.\n\nt (u) yields Gaussian posterior with mean w\u2318\n\nSlaves The role of the Slave Algorithm 2 is to guarantee small surrogate regret for a \ufb01xed learning\nrate \u2318. We consider two versions, corresponding to whether we take rank-one or diagonal matrices\nMt (see (5)) in the surrogate (6). The \ufb01rst version maintains a full d \u21e5 d covariance matrix and has\nthe best regret bound. The second version uses only diagonal matrices (with d non-zero entries),\nthus trading off a weaker bound with a better run-time in high dimensions. Algorithm 2 presents\nthe update equations in a computationally ef\ufb01cient form. Their intuitive motivation is given in the\nproof of Lemma 5, where we show that the standard exponential weights method with Gaussian prior\nand surrogate losses `\u2318\nt . The\nfull version of MetaGrad is closely related to the Online Newton Step algorithm [14] running on the\noriginal losses ft: the differences are that each Slave receives the Master\u2019s gradients gt = rft(wt)\ninstead of its own rft(w\u2318\nt wt) in line 5 adjusts for the\ndifference between the Slave\u2019s parameters w\u2318\nt and the Master\u2019s parameters wt. MetaGrad is therefore\na bona \ufb01de \ufb01rst-order algorithm that only accesses ft through gt. We also note that we have chosen\nthe Mirror Descent version that iteratively updates and projects (see line 5). One might alternatively\nconsider the Lazy Projection version (as in [34, 23, 32]) that forgets past projections when updating\non new data. Since projections are typically computationally expensive, we have opted for the Mirror\nDescent version, which we expect to project less often, since a projected point seems less likely to\nupdate to a point outside of the domain than an unprojected point.\n\nt ), and that an additional term 2\u23182Mt(w\u2318\n\nt and covariance matrix \u2303\u2318\n\nTotal run time As mentioned, the running time is dominated by the slaves. Ignoring the projection,\na slave with full covariance matrix takes O(d2) time to update, while slaves with diagonal covariance\nmatrix take O(d) time. If there are m slaves, this makes the overall computational effort respectively\nO(md2) and O(md), both in time per round and in memory. Our analysis below indicates that\n2 log2 Te slaves suf\ufb01ce, so m \uf8ff 16 as long as T \uf8ff 109. In addition, each slave may\nm = 1 + d 1\nincur the cost of a projection, which depends on the shape of the domain U. To get a sense for the\nprojection cost we consider a typical example. For the Euclidean ball a diagonal projection can be\nperformed using a few iterations of Newton\u2019s method to get the desired precision. Each such iteration\ncosts O(d) time. This is generally considered affordable. For full projections the story is starkly\ndifferent. We typically reduce to the diagonal case by a basis transformation, which takes O(d3) to\ncompute using SVD. Hence here the projection dwarfs the other run time by an order of magnitude.\nWe refer to [9] for examples of how to compute projections for various domains U. Finally, we\nremark that a potential speed-up is possible by running the slaves in parallel.\n\n6\n\n\f5 Analysis\n\nWe conduct the analysis in three parts. We \ufb01rst discuss the master, then the slaves and \ufb01nally their\ncomposition. The idea is the following. The master guarantees for all \u2318 simultaneously that\n\n0 =\n\n`\u2318\nt (w\u2318\n\nt ) + master regret compared to \u2318-slave.\n\nThen each \u2318-slave takes care of learning u, with regret O(d ln T ):\n\n`\u2318\nt (wt) \uf8ff\n\nTXt=1\nTXt=1\nTXt=1\nTXt=1\n(wt u)|gt \u23182V u\n\n`\u2318\nt (w\u2318\n\n\u2318\n\nTXt=1\n\nt ) \uf8ff\nThese two statements combine to\n\nand the overall result follows by optimizing \u2318.\n\n`\u2318\nt (u) + \u2318-slave regret compared to u.\n\nT = \n\nTXt=1\n\n`\u2318\nt (u) \uf8ff sum of regrets above\n\n(7a)\n\n(7b)\n\n(7c)\n\nt (w\u2318\n\nt=1 `\u2318\n\n1 e\u21b5PT\n\nt ). In Appendix B, we bound the last factor e\u21b5`\u2318\n\n5.1 Master\nTo show that we can aggregate the slave predictions, we consider the potential T\n\n:=\nP\u2318 \u21e1\u2318\nT ) above by its tangent at\nw\u2318\nT = wT and obtain an objective that can be shown to be equal to T1 regardless of the gradient\ngT if wT is chosen according to the Master algorithm. It follows that the potential is non-increasing:\nLemma 4 (Master combines slaves). The Master Algorithm guarantees 1 = 0 1 . . . T .\nAs 0 \uf8ff 1\n1, this implements step (7a) of our overall proof\nt ) + 1\nstrategy, with master regret 1\n1. We further remark that we may view our potential function T\nas a game-theoretic supermartingale in the sense of Chernov, Kalnishkan, Zhdanov, and Vovk [5],\nand this lemma as establishing that the MetaGrad Master is the corresponding defensive forecasting\nstrategy.\n\n\u21b5 ln T \uf8ff PT\n\nt (w\u2318\n\u21b5 ln \u21e1\u2318\n\n\u21b5 ln \u21e1\u2318\n\nt=1 `\u2318\n\nT (w\u2318\n\n5.2 Slaves\nNext we implement step (7b), which requires proving an O(d ln T ) regret bound in terms of the\nsurrogate loss for each MetaGrad slave. In the full case, the surrogate loss is jointly exp-concave, and\nin light of the analysis of ONS by Hazan, Agarwal, and Kale [14] such a result is not surprising. For\nthe diagonal case, the surrogate loss lacks joint exp-concavity, but we can use exp-concavity in each\ndirection separately, and verify that the projections that tie the dimensions together do not cause any\ntrouble. In Appendix C we analyze both cases simultaneously, and obtain the following bound on the\nregret:\nLemma 5 (Surrogate regret bound). For 0 <\u2318 \uf8ff 1\nin (6) (either the full or the diagonal version). Then the regret of Slave Algorithm 2 is bounded by\nTXt=1\n\nln det I + 2\u23182D2\n\nt (u) be the surrogate losses as de\ufb01ned\n\n1\n2D2kuk2 +\n\nfor all u 2U .\n\nMt!\n\n5DG, let `\u2318\n\nTXt=1\n\nTXt=1\n\n`\u2318\nt (u) +\n\n`\u2318\nt (w\u2318\n\nt ) \uf8ff\n\n1\n2\n\n5.3 Composition\nTo complete the analysis of MetaGrad, we \ufb01rst put the regret bounds for the master and slaves together\nas in (7c). We then discuss how to choose the grid of \u2318s, and optimize \u2318 over this grid to get our main\nresult. Proofs are postponed to Appendix D.\nTheorem 6 (Grid point regret). The full and diagonal versions of MetaGrad, with corresponding\nde\ufb01nitions from (2) and (5), guarantee that, for any grid point \u2318 with prior weight \u21e1\u2318\n1,\n\n\u02dcRu\nT \uf8ff \u2318V u\n\nT +\n\n1\n\n2D2kuk2 1\n\n\u21b5 ln \u21e1\u2318\n\n1 + 1\n\n2 ln det\u21e3I + 2\u23182D2PT\n\nt=1 Mt\u2318\n\n\u2318\n\nfor all u 2U .\n\n7\n\n\fGrid We now specify the grid points and corresponding prior. Theorem 6 above implies that any\ntwo \u2318 that are within a constant factor of each other will guarantee the same bound up to essentially\nthe same constant factor. We therefore choose an exponentially spaced grid with a heavy tailed prior\n(see Appendix E):\n\n\u2318i\n\n:=\n\n2i\n5DG\n\nand\n\n\u21e1\u2318i\n1\n\n:=\n\nC\n\nfor i = 0, 1, 2, . . . ,d 1\n\n2 log2 Te,\n\n(8)\n\n(i + 1)(i + 2)\nwith normalization C = 1 + 1(1 + d 1\n2 log2 Te). At the cost of a worse constant factor in the\nbounds, the number of slaves can be reduced by using a larger spacing factor, or by omitting some\nof the smallest learning rates. The net effect of (8) is that, for any \u2318 2 [\n5DG ] there is an\n1 \uf8ff 2 ln(i + 2) = O(ln ln(1/\u2318i)) = O(ln ln(1/\u2318)). As these costs\n\u2318i 2 [ 1\nare independent of T , our regret guarantees still hold if the grid (8) is instantiated with T replaced by\nany upper bound.\nThe \ufb01nal step is to apply Theorem 6 to this grid, and to properly select the learning rate \u2318i in the\nbound. This leads to our main result:\n\n2 \u2318, \u2318 ], for which ln \u21e1\u2318i\n\n5DGpT\n\n1\n\n2\n\n,\n\nt,i.\nt=1(ui wt,i)2g2\nThen the regret of MetaGrad, with corresponding de\ufb01nitions from (2) and (5) and with grid and prior\nas in (8), is bounded by\n\nt=1 Mt and V u\n\nT,i = PT\nCT\u25c6\n\n1\n\u21b5\n\n1\n\u21b5\n\n\u02dcRu\n\nwhere\n\nD2 rk(ST )\n\nT \u2713 1\nD2kuk2 +\u2305 T +\n\nTheorem 7 (MetaGrad Regret Bound). Let ST = PT\nT \uf8ff s8V u\nCT\u25c6 + 5DG\u2713 1\nST\u25c6 , rk(ST ) ln D2\n\u2305T \uf8ff min(ln det\u2713I +\nln D2PT\nfor the diagonal version, and CT = 4 ln3 + 1\nT \uf8ffvuut8D2 TXt=1\n\nkgtk2!\u2713 1\n\nfor the full version of the algorithm,\n\nt=1 g2\nt,i\nV u\nT,i\n\nD2kuk2 +\n\ndXi=1\n\n\u2305T =\n\nV u\nT\n\n\u02dcRu\n\n1\n\u21b5\n\nV u\nT\n\nD2kuk2 +\u2305 T +\n\nfor all u 2U ,\n\nkgtk2!) = O(d ln(D2G2T ))\n\nTXt=1\n\n! = O(d ln(D2G2T ))\n\n2 log2 T = O(ln ln T ) in both cases. Moreover, for\nCT\u25c6 + 5DG\u2713 1\nfor all u 2U .\n\nD2kuk2 +\n\nCT\u25c6\n\n1\n\u21b5\n\nboth versions of the algorithm, the regret is simultaneously bounded by\n\nThese two bounds together show that the full version of MetaGrad achieves the new adaptive guarantee\nof Theorem 1. The diagonal version behaves like running the full version separately per dimension,\nbut with a single shared learning rate.\n\n6 Discussion and Future Work\n\nOne may consider extending MetaGrad in various directions. In particular it would be interesting\nto speed up the method in high dimensions, for instance by sketching [20]. A broader question is\nto identify and be adaptive to more types of easy functions that are of practical interest. One may\nsuspect there to be a price (in regret overhead and in computation) for broader adaptivity, but based\non our results for MetaGrad it does not seem like we are already approaching the point where this\nprice is no longer worth paying.\n\nAcknowledgments We would like to thank Haipeng Luo and the anonymous reviewers (in par-\nticular Reviewer 6) for valuable comments. Koolen acknowledges support by the Netherlands\nOrganization for Scienti\ufb01c Research (NWO, Veni grant 639.021.439).\n\n8\n\n\fReferences\n[1] P. L. Bartlett and S. Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):\n\n311\u2013334, 2006.\n\n[2] P. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In NIPS 20, pages 65\u201372, 2007.\n[3] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert\n\nadvice. Machine Learning, 66(2/3):321\u2013352, 2007.\n\n[4] A. V. Chernov and V. Vovk. Prediction with advice of unknown number of experts. In Proc. of the 26th\n\nConf. on Uncertainty in Arti\ufb01cial Intelligence, pages 117\u2013125, 2010.\n\n[5] A. V. Chernov, Y. Kalnishkan, F. Zhdanov, and V. Vovk. Supermartingales in prediction with expert advice.\n\nTheoretical Computer Science, 411(29-30):2647\u20132669, 2010.\n\n[6] C.-K. Chiang, T. Yang, C.-J. Le, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization with\ngradual variations. In Proc. of the 25th Annual Conf. on Learning Theory (COLT), pages 6.1\u20136.20, 2012.\n[7] K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. In NIPS 22, pages\n\n414\u2013422, 2009.\n\n[8] C. B. Do, Q. V. Le, and C.-S. Foo. Proximal regularization for online and batch learning. In Proc. of the\n\n26th Annual International Conf. on Machine Learning (ICML), pages 257\u2013264, 2009.\n\n[9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[10] T. van Erven, P. D. Gr\u00fcnwald, N. A. Mehta, M. D. Reid, and R. C. Williamson. Fast rates in statistical and\n\nonline learning. Journal of Machine Learning Research, 16:1793\u20131861, 2015.\n\n[11] P. Gaillard, G. Stoltz, and T. van Erven. A second-order bound with excess losses. In Proc. of the 27th\n\nAnnual Conf. on Learning Theory (COLT), pages 176\u2013196, 2014.\n\n[12] E. Hazan. Introduction to online optimization. Draft, April 10, 2016, ocobook.cs.princeton.edu, 2016.\n[13] E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine\n\nlearning, 80(2-3):165\u2013188, 2010.\n\nLearning, 69(2-3):169\u2013192, 2007.\n\n[14] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine\n\n[15] S. Ihara. Information Theory for Continuous Systems. World Scienti\ufb01c, 1993.\n[16] W. M. Koolen and T. van Erven. Second-order quantile methods for experts and combinatorial games. In\n\nProc. of the 28th Annual Conf. on Learning Theory (COLT), pages 1155\u20131175, 2015.\n\n[17] W. M. Koolen and T. van Erven. MetaGrad open source code. bitbucket.org/wmkoolen/metagrad, 2016.\n[18] W. M. Koolen, T. van Erven, and P. D. Gr\u00fcnwald. Learning the learning rate for prediction with expert\n\n[19] W. M. Koolen, P. D. Gr\u00fcnwald, and T. van Erven. Combining adversarial guarantees and stochastic fast\n\nadvice. In NIPS 27, pages 2294\u20132302, 2014.\n\nrates in online learning. In NIPS 29, 2016.\n\nIn NIPS 29, 2016.\n\n[20] H. Luo, A. Agarwal, N. Cesa-Bianchi, and J. Langford. Ef\ufb01cient second order online learning by sketching.\n\n[21] H. B. McMahan and M. J. Streeter. Adaptive bound optimization for online convex optimization. In Proc.\n\nof the 23rd Annual Conf. on Learning Theory (COLT), pages 244\u2013256, 2010.\n\n[22] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Ef\ufb01cient estimation of word representations in vector\n\nspace. International Conf. on Learning Representations, 2013. Arxiv.org/abs/1301.3781.\n\n[23] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):\n\n221\u2013259, 2009.\n\nNIPS 27, pages 1116\u20131124, 2014.\n\n[24] F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In\n\n[25] F. Orabona and D. P\u00e1l. Coin betting and parameter-free online learning. In NIPS 29, 2016.\n[26] F. Orabona, K. Crammer, and N. Cesa-Bianchi. A generalized online mirror descent with applications to\n\nclassi\ufb01cation and regression. Machine Learning, 99(3):411\u2013435, 2015.\n\n[27] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85\u2013117, 2015.\n[28] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine\n\nLearning, 4(2):107\u2013194, 2012.\n\n[29] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nSVM. Mathematical Programming, 127(1):3\u201330, 2011.\n\n[30] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates.\n\nIn NIPS 23, pages\n\n2199\u20132207, 2010.\n\n[31] J. Steinhardt and P. Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In\n\nProc. of the 31th Annual International Conf. on Machine Learning (ICML), pages 1593\u20131601, 2014.\n\n[32] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of\n\nMachine Learning Research, 11:2543\u20132596, 2010.\n\n[33] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Proc. of the\n\n20th Annual International Conf. on Machine Learning (ICML), pages 928\u2013936, 2003.\n\n[34] M. Zinkevich. Theoretical Guarantees for Algorithms in Multi-Agent Settings. PhD thesis, Carnegie\n\nMellon University, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1823, "authors": [{"given_name": "Tim", "family_name": "van Erven", "institution": "Leiden University"}, {"given_name": "Wouter", "family_name": "Koolen", "institution": "Centrum Wiskunde & Informatica"}]}