{"title": "Adaptive Averaging in Accelerated Descent Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 2991, "page_last": 2999, "abstract": "We study accelerated descent dynamics for constrained convex optimization. This dynamics can be described naturally as a coupling of a dual variable accumulating gradients at a given rate $\\eta(t)$, and a primal variable obtained as the weighted average of the mirrored dual trajectory, with weights $w(t)$. Using a Lyapunov argument, we give sufficient conditions on $\\eta$ and $w$ to achieve a desired convergence rate. As an example, we show that the replicator dynamics (an example of mirror descent on the simplex) can be accelerated using a simple averaging scheme. We then propose an adaptive averaging heuristic which adaptively computes the weights to speed up the decrease of the Lyapunov function. We provide guarantees on adaptive averaging in continuous-time, prove that it preserves the quadratic convergence rate of accelerated first-order methods in discrete-time, and give numerical experiments to compare it with existing heuristics, such as adaptive restarting. The experiments indicate that adaptive averaging performs at least as well as adaptive restarting, with significant improvements in some cases.", "full_text": "Adaptive Averaging in Accelerated Descent Dynamics\n\nWalid Krichene \u2217\n\nUC Berkeley\n\nwalid@eecs.berkeley.edu\n\nAlexandre M. Bayen\n\nUC Berkeley\n\nbayen@berkeley.edu\n\nPeter L. Bartlett\n\nUC Berkeley and QUT\n\nbartlett@cs.berkeley.edu\n\nAbstract\n\nWe study accelerated descent dynamics for constrained convex optimization. This\ndynamics can be described naturally as a coupling of a dual variable accumulating\ngradients at a given rate \u03b7(t), and a primal variable obtained as the weighted average\nof the mirrored dual trajectory, with weights w(t). Using a Lyapunov argument,\nwe give suf\ufb01cient conditions on \u03b7 and w to achieve a desired convergence rate. As\nan example, we show that the replicator dynamics (an example of mirror descent\non the simplex) can be accelerated using a simple averaging scheme.\nWe then propose an adaptive averaging heuristic which adaptively computes the\nweights to speed up the decrease of the Lyapunov function. We provide guarantees\non adaptive averaging in continuous-time, prove that it preserves the quadratic\nconvergence rate of accelerated \ufb01rst-order methods in discrete-time, and give\nnumerical experiments to compare it with existing heuristics, such as adaptive\nrestarting. The experiments indicate that adaptive averaging performs at least as\nwell as adaptive restarting, with signi\ufb01cant improvements in some cases.\n\n1\n\nIntroduction\n\nWe study the problem of minimizing a convex function f over a feasible set X , a closed convex subset\nof E = Rn. We will assume that f is differentiable, that its gradient \u2207f is a Lipschitz function with\nLipschitz constant L, and that the set of minimizers S = arg minx\u2208X f (x) is non-empty. We will\nfocus on the study of continuous-time, \ufb01rst-order dynamics for optimization. First-order methods\nhave seen a resurgence of interest due to the signi\ufb01cant increase in both size and dimensionality of the\ndata sets typically encountered in machine learning and other applications, which makes higher-order\nmethods computationally intractable in most cases. Continuous-time dynamics for optimization\nhave been studied for a long time, e.g. [6, 9, 5], and more recently [20, 2, 1, 3, 11, 23], in which a\nconnection is made between Nesterov\u2019s accelerated methods [14, 15] and a family of continuous-time\nODEs. Many optimization algorithms can be interpreted as a discretization of a continuous-time\nprocess, and studying the continuous-time dynamics is useful for many reasons: The analysis is\noften simpler in continuous-time, it can help guide the design and analysis of new algorithms, and\nit provides intuition and insight into the discrete process. For example, Su et al. show in [20] that\nNesterov\u2019s original method [14] is a discretization of a second-order ODE, and use this interpretation\nto propose a restarting heuristic which empirically speeds up the convergence. In [11], we generalize\nthis approach to the proximal version of Nesterov\u2019s method [15] which applies to constrained convex\nproblems, and show that the continuous-time ODE can be interpreted as coupled dynamics of a dual\nvariable Z(t) which evolves in the dual space E\u2217, and a primal variable X(t) which is obtained as\nthe weighted average of a non-linear transformation of the dual trajectory. More precisely,\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u02d9Z(t) = \u2212 t\nr\u2207f (X(t))\n(cid:82) t\n0 \u03c4 r\u22121\u2207\u03c8\u2217(Z(\u03c4 ))d\u03c4\nX(t) =\nX(0) = \u2207\u03c8\u2217(Z(0)) = x0,\n\n(cid:82) t\n0 \u03c4 r\u22121d\u03c4\n\n\u2217Walid Krichene is currently af\ufb01liated with Google. walidk@google.com\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fwhere r \u2265 2 is a \ufb01xed parameter, the initial condition x0 is a point in the feasible set X , and \u2207\u03c8\u2217 is\na Lipschitz function that maps from the dual space E\u2217 to the feasible set X , which we refer to as the\nmirror map (such a function can be constructed using standard results from convex analysis, by taking\nthe convex conjugate of a strongly convex function \u03c8 with domain X ; see the supplementary material\nfor a brief review of the de\ufb01nition and basic properties of mirror maps). Using a Lyapunov argument,\nwe show that the solution trajectories of this ODE exhibit a quadratic convergence rate, i.e. if f (cid:63) is the\nminimum of f over the feasible set, then f (X(t)) \u2212 f (cid:63) \u2264 C/t2 for a constant C which depends on\nthe initial conditions. This formalized an interesting connection between acceleration and averaging,\nwhich had been observed in [8] in the special case of unconstrained quadratic minimization.\nA natural question that arises is whether different averaging schemes can be used to achieve the same\nrate, or perhaps faster rates. In this article, we provide a positive answer. We study a broad family of\nAccelerated Mirror Descent (AMD) dynamics, given by\n\n\u02d9Z(t) = \u2212\u03b7(t)\u2207f (X(t))\nX(t0)W (t0)+(cid:82) t\n\nw(\u03c4 )\u2207\u03c8\u2217(Z(\u03c4 ))d\u03c4\n\nX(t) =\nX(t0) = \u2207\u03c8\u2217(Z(t0)) = x0,\n\nt0\nW (t)\n\n, with W (t) =(cid:82) t\n\n0 w(\u03c4 )d\u03c4\n\n(1)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\nAMDw,\u03b7\n\nparameterized by two positive, continuous weight functions w and \u03b7, where w is used in the averaging\nand \u03b7 determines the rate at which Z accumulates gradients. This is illustrated in Figure 1. In our\nformulation we choose to initialize the ODE at t0 > 0 instead of 0 (to guarantee existence and\nuniqueness of a solution, as discussed in Section 2). We give a uni\ufb01ed study of this ODE using an\nappropriate Lyapunov function, given by\n\nLr(X, Z, t) = r(t)(f (X) \u2212 f (cid:63)) + D\u03c8\u2217 (Z, z(cid:63)),\n\n(2)\nwhere D\u03c8\u2217 is the Bregman divergence associated with \u03c8\u2217 (a non-negative function de\ufb01ned on\nE\u2217\n\u00d7 E\u2217), and r(t) is a desired convergence rate (a non-negative function de\ufb01ned on R+). By\nconstruction, Lr is a non-negative function on X \u00d7 E\u2217\nIf t (cid:55)\u2192 Lr(X(t), Z(t), t) is a\nnon-increasing function for all solution trajectories (X(t), Z(t)), then Lr is said to be a Lyapunov\nfunction for the ODE, in reference to Aleksandr Mikhailovich Lyapunov [12]. We give in Theorem 2\na suf\ufb01cient condition on \u03b7, w and r for Lr to be a Lyapunov function for AMDw,\u03b7, and show that\nunder these conditions, f (X(t)) converges to f (cid:63) at the rate 1/r(t).\n\n\u00d7 R+.\n\nFigure 1: Illustration of AMDw,\u03b7. The dual variable Z evolves in the dual space E\u2217, and accumulates\nnegative gradients at a rate \u03b7(t), and the primal variable X(t) (green solid line) is obtained by\naveraging the mirrored trajectory {\u2207\u03c8\u2217(Z(\u03c4 )), \u03c4 \u2208 [t0, t]} (green dashed line), with weights w(\u03c4 ).\nIn Section 3, we give an equivalent formulation of AMDw,\u03b7 written purely in the primal space. We\ngive several examples of these dynamics for simple constraint sets. In particular, when the feasible\nset is the probability simplex, we derive an accelerated version of the replicator dynamics, an ODE\nthat plays an important role in evolutionary game theory [22] and viability theory [4].\nMany heuristics have been developed to empirically speed up the convergence of accelerated methods.\nMost of these heuristics consist in restarting the ODE (or the algorithm in discrete time) whenever\na simple condition is met. For example, a gradient restart heuristic is proposed in [17], in which\nthe algorithm is restarted whenever the trajectory forms an acute angle with the gradient (which\nintuitively indicates that the trajectory is not making progress), and a speed restarting heuristic\nis proposed in [20], in which the ODE is restarted whenever the speed (cid:107) \u02d9X(t)(cid:107) decreases (which\nintuitively indicates that progress is slowing). These heuristics are known to empirically improve\n\n2\n\nEE\u2217X\u2207\u03c8\u2217\u2202\u03c8Z(t)\u2212\u03b7(t)\u2207f(X(t))X(t)\u2207\u03c8\u2217(Z(t))\fthe speed of convergence, but provide few guarantees. For example, the gradient restart in [17]\nis only studied for unconstrained quadratic problems, and the speed restart in [20] is only studied\nfor unconstrained strongly convex problems. In particular, it is not guaranteed (to our knowledge)\nthat these heuristics preserve the original convergence rate of the non-restarted method, when the\nobjective function is not strongly convex. In Section 4, we propose a new heuristic that provides such\nguarantees, and that is based on a simple idea for adaptively computing the weights w(t) along the\nsolution trajectories. The heuristic simply decreases the time derivative of the Lyapunov function\nLr(X(t), Z(t), t) whenever possible. Thus it preserves the 1/r(t) convergence rate. Other adaptive\nmethods have been applied to convex optimization, such as Adagrad [7] and Adam [10], which adapt\nthe learning rate in \ufb01rst-order methods, by maintaining moment estimates of the observed gradients.\nThey are particularly well suited to problems with sparse gradients. While these methods are similar\nin spirit to adaptive averaging, they are not designed for accelerated methods. In Section 5, we give\nnumerical experiments in which we compare the performance of adaptive averaging and restarting.\nThe experiments indicate that adaptive averaging compares favorably in all of the examples, and\ngives a signi\ufb01cant improvement in some cases. We conclude with a brief discussion in Section 6.\n\n2 Accelerated mirror descent with generalized averaging\n\n(cid:82) t\nWe start by giving an equivalent form of AMDw,\u03b7, which we use to brie\ufb02y discuss existence\nand uniqueness of a solution. Writing the second equation as X(t)W (t) \u2212 X(t0)W (t0) =\n\nw(\u03c4 )\u2207\u03c8\u2217(Z(\u03c4 ))d\u03c4, then taking the time-derivative, we have\n\nt0\n\nThus the ODE is equivalent to\n\n\u02d9X(t)W (t) + X(t)w(t) = w(t)\u2207\u03c8\u2217(Z(t)).\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nAMD(cid:48)w,\u03b7\n\n\u02d9Z(t) = \u2212\u03b7(t)\u2207f (X(t))\n\u02d9X(t) = w(t)\nX(t0) = \u2207\u03c8\u2217(Z(t0)) = x0.\n\nW (t) (\u2207\u03c8\u2217(Z(t)) \u2212 X(t))\n\nThe following theorem guarantees existence and uniqueness of the solution.\nTheorem 1. Suppose that W (t0) > 0. Then AMDw,\u03b7 has a unique maximal (i.e. de\ufb01ned on a\nmaximal interval) solution (X(t), Z(t)) that is C 1([t0, +\u221e)). Furthermore, for all t \u2265 t0, X(t)\nbelongs to the feasible set X .\nProof. Recall that, by assumption, \u2207f and \u2207\u03c8\u2217 are both Lipschitz, and w, \u03b7 are continuous. Fur-\nthermore, W (t) is non-decreasing and continuous, as the integral of a non-negative function, thus\nw(t)/W (t) \u2264 w(t)/W (t0). This guarantees that on any \ufb01nite interval [t0, T ), the functions \u03b7(t) and\nW (t) (\u2207\u03c8\u2217(Z) \u2212 X) are Lipschitz functions\nw(t)/W (t) are bounded. Therefore, \u2212\u03b7(t)\u2207f (X) and w(t)\nof (X, Z), uniformly in t \u2208 [t0, T ). By the Cauchy-Lipschitz theorem (e.g. Theorem 2.5 in [21]),\nthere exists a unique C 1 solution de\ufb01ned on [t0, T ). Since T is arbitrary, this de\ufb01nes a unique solution\non all of [t0, +\u221e). Indeed, any two solutions de\ufb01ned on [t0, T1) and [t0, T2) with T2 > T1 coincide\non [t0, T1). Finally, feasibility of the solution follows from the fact that X is convex and X(t) is the\nweighted average of points in X , speci\ufb01cally, x0 and the set {\u2207\u03c8\u2217(Z(\u03c4 )), \u03c4 \u2208 [t0, t]}.\nNote that in general, it is important to initialize the ODE at t0 and not 0, since W (0) = 0 and\nw(t)/W (t) can diverge at 0, in which case one cannot apply the Cauchy-Lipschitz theorem. It is\npossible however to prove existence and uniqueness with t0 = 0 for some choices of w, by taking\na sequence of Lipschitz ODEs that approximate the original one, as is done in [20], but this is a\ntechnicality and does not matter for practical purposes.\nWe now move to our main result for this section. Suppose that r is an increasing, positive differentiable\nfunction on [t0, +\u221e), and consider the candidate Lyapunov function Lr de\ufb01ned in (2), where the\nBregman divergence term is given by\n\nD\u03c8\u2217 (z, y) := \u03c8\u2217(z) \u2212 \u03c8\u2217(y) \u2212 (cid:104)\u2207\u03c8\u2217(y), z \u2212 y(cid:105) ,\n\nand z(cid:63) is a point in the dual space such that \u2207\u03c8\u2217(z(cid:63)) = x(cid:63) belongs to the set of minimizers S. Let\n(X(t), Z(t)) be the unique maximal solution trajectory of AMDw,\u03b7.\n\n3\n\n\f(cid:68) \u02d9Z(t),\u2207\u03c8\u2217(Z(t)) \u2212 \u2207\u03c8\u2217(z(cid:63))\n(cid:69)\nTaking the derivative of t (cid:55)\u2192 Lr(X(t), Z(t), t) = r(t)(f (X(t)) \u2212 f (cid:63)) + D\u03c8\u2217 (Z(t), z(cid:63)), we have\n(cid:29)\n\nLr(X(t), Z(t), t) = r(cid:48)(t)(f (X(t)) \u2212 f (cid:63)) + r(t)\n\n\u02d9X(t)\n\n(cid:69)\n\nd\ndt\n\n+\n\n= r(cid:48)(t)(f (X(t)) \u2212 f (cid:63)) + r(t)\n\n\u2264 (f (X(t)) \u2212 f (cid:63))(r(cid:48)(t) \u2212 \u03b7(t)) +\n\n\u2212\u03b7(t)\u2207f (X(t)), X(t) +\n\nW (t)\nw(t)\n\n\u02d9X(t) \u2212 x(cid:63)\n\nr(t) \u2212 \u03b7(t)W (t)\n\nw(t)\n\n,\n\n(3)\n\n(cid:19)\n\n(cid:68)\u2207f (X(t)),\n(cid:28)\n(cid:69)\n(cid:69)(cid:18)\n\n(cid:68)\u2207f (X(t)),\n(cid:68)\u2207f (X(t)),\n\u02d9Z and \u2207\u03c8\u2217(Z) from AMD(cid:48)\n\n\u02d9X(t)\n\n\u02d9X(t)\n\n+\n\nwhere we used the expressions for\nw,\u03b7 in the second equality, and\nconvexity of f in the last inequality. Equipped with this bound, it becomes straightforward to give\nsuf\ufb01cient conditions for Lr to be a Lyapunov function.\nTheorem 2. Suppose that for all t \u2208 [t0, +\u221e),\n(cid:17)\n\n(cid:68)\n1. \u03b7(t) \u2265 r(cid:48)(t) and\n\u02d9X(t)\n\n(cid:69)(cid:16)\nr(t) \u2212 \u03b7(t)W (t)\n\nThen Lr is a Lyapunov function for AMDw,\u03b7, and for all t \u2265 t0, f (X(t)) \u2212 f (cid:63) \u2264 Lr(X(t0),Z(t0),t0)\nProof. The two conditions, combined with inequality (3), imply that d\ndt Lr(X(t), Z(t), t) \u2264 0, thus\nLr is a Lyapunov function. Finally, since D\u03c8\u2217 is non-negative, and Lr is decreasing, we have\n\n\u2207f (X(t)),\n\n\u2264 0.\n\nw(t)\n\nr(t)\n\n2.\n\n.\n\nf (X(t)) \u2212 f (cid:63) \u2264 Lr(X(t), Z(t), t)\n\nr(t)\n\n\u2264 Lr(X(t0), Z(t0), t0)\n\nr(t)\n\n.\n\nwhich proves the claim.\n\na(\u03c4 )d\u03c4 , and\n(cid:82) t\n\na(t)\na(t0)\n\nNote that the second condition depends on the solution trajectory X(t), and may be hard to check a\npriori. However, we give one special case in which the condition trivially holds.\nW (t) \u2265 r(cid:48)(t)\nCorollary 1. Suppose that for all t \u2208 [t0, +\u221e), \u03b7(t) = w(t)r(t)\nLyapunov function for AMDw,\u03b7, and for all t \u2265 t0, f (X(t)) \u2212 f (cid:63) \u2264 Lr(X(t0),Z(t0),t0)\nNext, we describe a method to construct weight functions w, \u03b7 that satisfy the conditions of Corol-\nW (t) \u2265 r(cid:48)(t)\nlary 1, given a desired rate r. Of course, it suf\ufb01ces to construct w that satis\ufb01es w(t)\nr(t) , then\nto set \u03b7(t) = w(t)r(t)\n. We can reparameterize the weight function by writing w(t)\nW (t) = a(t). Then\nW (t)\nintegrating from t0 to t, we have W (t)\n\nr(t) . Then Lr is a\n\n, and w(t)\n\n(cid:82) t\n\nW (t)\n\nr(t)\n\nt0\n\n.\n\nW (t0) = e\n\nw(t) = w(t0)\n\ne\n\nt0\n\na(\u03c4 )d\u03c4 .\n\n(4)\n\nt0\n\n(cid:82) t\n\na(t0) e\n\na(\u03c4 )d\u03c4 = \u03b2/t\n\u03b2/t0\n\ne\u03b2 ln(t/t0) = (t/t0)\u03b2\u22121 and \u03b7(t) = w(t)r(t)\n\nTherefore the conditions of the corollary are satis\ufb01ed whenever w(t) is of the form (4) and a : R+ \u2192\nR+ is a continuous, positive function with a(t) \u2265 r(cid:48)(t)\nr(t) . Note that the expression of w is de\ufb01ned up\nto the constant w(t0), which re\ufb02ects the fact that the condition of the corollary is scale-invariant (if\nthe condition holds for a function w, then it holds for \u03b1w for all \u03b1 > 0).\nExample 1. Let r(t) = t2. Then r(cid:48)(t)/r(t) = 2/t, and we can take a(t) = \u03b2\nt with \u03b2 \u2265 2. Then\nw(t) = a(t)\nW (t) = \u03b2t, and we recover\nthe weighting scheme used in [11].\nExample 2. More generally, if r(t) = tp, p \u2265 1, then r(cid:48)(t)/r(t) = p/t, and we can take a(t) = \u03b2\nwith \u03b2 \u2265 p. Then w(t) = (t/t0)\u03b2\u22121, and \u03b7(t) = w(t)r(t)\nWe also exhibit in the following a second energy function that is guaranteed to decrease under the\nsame conditions. This energy function, unlike the Lyapunov function Lr, does not guarantee a\nspeci\ufb01c convergence rate. However, it captures a natural measure of energy in the system. To de\ufb01ne\nthis energy function, we will use the following characterization of the inverse mirror map: By duality\nof the subdifferentials (e.g. Theorem 23.5 in [18]), we have for a pair of convex conjugate functions \u03c8\nand \u03c8\u2217 that x \u2208 \u2202\u03c8\u2217(x\u2217) if and only if x\u2217\n\u2208 \u2202\u03c8(x). To simplify the discussion, we will assume that\n\u03c8 is also differentiable, so that (\u2207\u03c8\u2217)\u22121 = \u2207\u03c8 (this assumption can be relaxed). In what follows,\nwe will denote by \u02c7X = \u2207\u03c8(X) and \u02c7Z = \u2207\u03c8\u2217(Z).\n4\n\nW (t) = \u03b2tp\u22121.\n\nt\n\n\fTheorem 3. Let (X(t), Z(t)) be the unique maximal solution of AMDw,\u03b7, and let \u02c7X = \u2207\u03c8(X).\nConsider the energy function\n\nEr(t) = f (X(t)) +\n\n1\n\nr(t)\n\nD\u03c8\u2217 (Z(t), \u02c7X(t)).\n\n(5)\n\nd\ndt\n\n=\n\n\u02d9\u02c7X\n\n.\n\nD\u03c8\u2217 (Z, \u02c7X) =\n\n(cid:69) \u2212(cid:68)\n\nThen if w, \u03b7 satisfy condition (2) of Theorem 2, Er is a decreasing function of time.\n\nUsing the second equation in AMD(cid:48)\n\n(cid:69) \u2212(cid:68) \u02d9X, Z \u2212 \u02c7X\n(cid:69)\n\n(cid:10)X, Z \u2212 \u02c7X(cid:11). Taking the time-derivative , we have\n(cid:69)\n(cid:69) \u2212(cid:68)\u2207\u03c8\u2217( \u02c7X),\n(cid:69) \u2212(cid:68) \u02d9X, Z \u2212 \u02c7X\n(cid:68) \u02d9X, Z \u2212 \u02c7X\n\nProof. To make the notation more concise, we omit the explicit dependence on time in this proof.\n(cid:68)\u2207\u03c8\u2217(Z), \u02d9Z\nWe have D\u03c8\u2217 (Z, \u02c7X) = \u03c8\u2217(Z) \u2212 \u03c8\u2217( \u02c7X) \u2212\n(cid:68)\u2207\u03c8\u2217(Z) \u2212 X, \u02d9Z\n\u2207\u03c8\u2217(Z) \u2212 \u2207\u03c8\u2217( \u02c7X), Z \u2212 \u02c7X(cid:11)\na(cid:10)\nw,\u03b7, we have \u2207\u03c8\u2217(Z) \u2212 X = 1\n(cid:68) \u02d9X,\u2207f (X)\n(cid:69)\n\u2265 0 by monotonicity of \u2207\u03c8\u2217.\n(cid:68)\u2207f (X),\n, and we can \ufb01nally bound the derivative of Er:\n\u2264(cid:68)\u2207f (X),\n\n1\nd\n+\nr\ndt\n1 \u2212 \u03b7\nar\nTherefore condition (2) of Theorem 2 implies that d\n\n=\nCombining, we have\n\ndt D\u03c8\u2217 (Z, \u02c7X) \u2264 \u2212 \u03b7\n\nr2 D\u03c8\u2217 (Z, \u02c7X)\n\nD\u03c8\u2217 (Z, \u02c7X) \u2212 r(cid:48)\n\n(cid:17)\ndt Er(t) \u2264 0.\n\n(cid:69)\n(cid:69)(cid:16)\n\nX, \u02d9Z \u2212 \u02d9\u02c7X\n\n\u02d9X, and\n\na\n\na\n\nd\ndt\n\nEr(t) =\n\n\u02d9X\n\n\u02d9X\n\n(cid:69)\n\nd\n\n.\n\nThis energy function can be interpreted, loosely speaking, as the sum of a potential energy given by\nr(t) D\u03c8\u2217 (Z, \u02c7X): Indeed, when the problem is unconstrained,\nf (X), and a kinetic energy given by 1\nthen one can take \u03c8\u2217(z) = 1\n2(cid:107)z(cid:107)2, in which case \u2207\u03c8\u2217 = \u2207\u03c8 = I, the identity, and D\u03c8\u2217 (Z, \u02c7X) =\n2(cid:107) \u02c7Z \u2212 X(cid:107)2 = 1\n2(cid:107)\n3 Primal Representation and Example Dynamics\n\na (cid:107)2, a quantity proportional to the kinetic energy.\n\n\u02d9X\n\n1\n\nAn equivalent primal representation can be obtained by rewriting the equations in terms of \u02c7Z =\n\u2207\u03c8\u2217(Z) and its derivatives ( \u02c7Z is a primal variable that remains in X , since \u2207\u03c8\u2217 maps into X ).\nIn this section, we assume that \u03c8\u2217 is twice differentiable on E\u2217. Taking the time derivative of\n\u02c7Z(t) = \u2207\u03c8\u2217(Z(t)), we have\n\n\u02d9\u02c7Z(t) = \u22072\u03c8\u2217(Z(t)) \u02d9Z(t) = \u2212\u03b7(t)\u22072\u03c8\u2217\n\n\u25e6 \u2207\u03c8( \u02c7Z(t))\u2207f (X(t)),\nwhere \u22072\u03c8\u2217(z) is the Hessian of \u03c8\u2217 at z, de\ufb01ned as \u22072\u03c8\u2217(z)ij = \u22022\u03c8\u2217(z)\n(cid:18) x0W (t0)+(cid:82) t\nexpression for X, we can write AMDw,\u03b7 in the following primal form\n\n\u2202zj \u2202zi\n\nw(\u03c4 ) \u02c7Z(\u03c4 )d\u03c4\n\n(cid:19)\n\n. Then using the averaging\n\nt0\n\nW (t)\n\nAMDp\n\nw,\u03b7\n\n\u02c7Z(t0) = x0.\n\n\uf8f1\uf8f2\uf8f3 \u02d9\u02c7Z(t) = \u2212\u03b7(t)\u22072\u03c8\u2217 \u25e6 \u2207\u03c8( \u02c7Z(t))\u2207f\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u02d9Z(t) = \u2212\u2207f (X(t))\n\nX(t) = \u2207\u03c8\u2217(Z(t))\nX(t0) = x0.\n\nMD\n\n(6)\n\n(7)\n\nA similar derivation can be made for the mirror descent ODE without acceleration, which can be\nwritten as follows [11] (see also the original derivation of Nemirovski and Yudin in Chapter 3 in [13])\n\nNote that this can be interpreted as a limit case of AMD\u03b7,w with \u03b7(t) \u2261 1 and w(t) a Dirac function\nat t. Taking the time derivative of X(t) = \u2207\u03c8\u2217(Z(t)), we have \u02d9X(t) = \u22072\u03c8\u2217(Z(t)) \u02d9Z(t), which\nleads to the primal form of the mirror descent ODE\n\n(cid:40) \u02d9X(t) = \u2212\u22072\u03c8\u2217 \u25e6 \u2207\u03c8(X(t))\u2207f (X(t))\n\nMDp\n\nX(t0) = x0.\n\n5\n\n\f\u25e6 \u2207\u03c8 has a simple expression. We give two examples below.\n\nThe operator \u22072\u03c8\u2217\n\u25e6 \u2207\u03c8 appears in both primal representations (6) and (7), and multiplies the\ngradient of f. It can be thought of as a transformation of the gradient which ensures that the primal\ntrajectory remains in the feasible set, this is illustrated in the supplementary material. For some\nchoices of \u03c8, \u22072\u03c8\u2217\nWe also observe that in its primal form, AMDp\nw,\u03b7 is a generalization of the ODE family studied\ndt\u2207\u03c8(X(t) + e\u2212\u03b1(t) \u02d9X(t)) = \u2212e\u03b1(t)+\u03b2(t)\u2207f (X(t)), for which\nin [23], which can be written as d\nthey prove the convergence rate O(e\u2212\u03b2(t)). This corresponds to setting, in our notation, a(t) = e\u03b1(t),\nr(t) = e\u03b2(t) and taking \u03b7(t) = a(t)r(t) (which corresponds to the condition of Corollary 1).\nthe negative entropy function \u03c8(x) =(cid:80)\nPositive-orthant-constrained dynamics Suppose that X is the positive orthant Rn\n\u2207\u03c8(x)i = 1 + ln xi and \u22072\u03c8\u2217(z)i,j = \u03b4j\nx \u2208 Rn\n\n+, and consider\ni ezi\u22121, and we have\ni is 1 if i = j and 0 otherwise. Thus for all\n(cid:26)\n+, \u22072\u03c8\u2217\n\u25e6 \u2207\u03c8(x) = diag(x). Therefore, the primal forms (7) and (6), reduce to, respectively,\n\u2200i,\nX(0) = x0\n\ni xi ln xi. Then its dual is \u03c8\u2217(z) =(cid:80)\n\n\u02d9\u02c7Zi = \u2212\u03b7(t) \u02c7Zi\u2207f (X)i\n\n\u02d9Xi = \u2212Xi\u2207f (X)i\n\ni ezi\u22121, where \u03b4j\n\n\u2200i,\n\u02c7Z(t0) = x0\n\n(cid:40)\n\nwhere for the second ODE we write X compactly to denote the weighted average given by the second\nequation of AMDw,\u03b7. When f is af\ufb01ne, the mirror descent ODE lead to Lotka-Volterra equation\nwhich has applications in economics and ecology. For the mirror descent ODE, one can verify that\nthe solution remains in the positive orthant since \u02d9X tends to 0 as Xi approaches the boundary of the\n\u02d9\u02c7Z tends to 0 as \u02c7Z approaches the boundary, thus \u02c7Z\nfeasible set. Similarly for the accelerated version,\nremains feasible, and so does X by convexity.\n+ : (cid:80)n\n(cid:80)n\nSimplex-constrained dynamics: the replicator equation. Now suppose that X is the n-simplex,\nconjugate is \u03c8\u2217(z) = ln ((cid:80)n\nX = \u2206 = {x \u2208 Rn\ni=1 xi = 1}. Consider the distance-generating function \u03c8(x) =\ni=1 xi ln xi + \u03b4X (x), where \u03b4X (\u00b7) is the convex indicator function of the feasible set. Then its\ni=1 ezi), de\ufb01ned on E\u2217, and we have \u2207\u03c8(x)i = 1 + ln xi, \u2207\u03c8\u2217(z)i =\nezi(cid:80)\ni ezi(cid:80)\nk ezk , and \u22072\u03c8\u2217(z)ij = \u03b4j\nk ezk \u2212 ezi ezj\n\u25e6 \u2207\u03c8(x)ij =\ni xi(cid:80)\n((cid:80)\n(cid:40) \u2200i,\n(cid:0)\u2207f (X)i \u2212(cid:10) \u02c7Z,\u2207f (X)(cid:11)(cid:1) = 0\nk xk \u2212 xixj\nk xk)2 = \u03b4j\ni xi \u2212 xixj. Therefore, the primal forms (7) and (6) reduce to, respectively,\n\u02d9Xi + Xi (\u2207f (X)i \u2212 (cid:104)X,\u2207f (X)(cid:105)) = 0\n\n((cid:80)\nk ezk )2 . Then it is simple to calculate \u22072\u03c8\u2217\n\n(cid:40) \u2200i, \u02d9\u02c7Zi + \u03b7(t) \u02c7Zi\n\n\u03b4j\n\nX(0) = x0\n\n\u02c7Z(0) = x0.\n\nThe \ufb01rst ODE is known as the replicator dynamics [19], and has many applications in evolutionary\ngame theory [22] and viability theory [4], among others. See the supplementary material for additional\ndiscussion on the interpretation and applications of the replicator dynamics. This example shows that\nthe replicator dynamics can be accelerated simply by performing the original replicator update on the\nvariable \u02c7Z, in which (i) the gradient of the objective function is scaled by \u03b7(t) at time t, and (ii) the\ngradient is evaluated at X(t), the weighted average of the \u02c7Z trajectory.\n\n4 Adaptive Averaging Heuristic\n\n(cid:40)\na(t) = \u03b7(t)\nr(t)\na(t) \u2265 \u03b7(t)\n\n(cid:68)\n\n6\n\n(cid:69)(cid:16)\nr(t) \u2212 \u03b7(t)\n\na(t)\n\n(cid:17)\n\nIn this section, we propose an adaptive averaging heuristic for adaptively computing the weights w.\nNote that in Corollary 1, we simply set a(t) = \u03b7(t)\nis\nidentically zero (thus trivially satisfying condition (2) of Theorem 2). However, from the bound (3),\nif this term is negative, then this helps further decrease the Lyapunov function Lr (as well as the\nenergy function Er). A simple strategy is then to adaptively choose a(t) as follows\n\n\u2207f (X(t)),\n\nr(t) so that\n\n\u02d9X(t)\n\n(8)\nIf we further have \u03b7(t) \u2265 r(cid:48)(t), then the conditions of Theorem 2 and Theorem 3 are satis\ufb01ed, which\nguarantee that Lr is a Lyapunov function and that the energy Er decreases. In particular, such a\nheuristic would preserve the convergence rate r(t) by Theorem 2.\n\nr(t)\n\n\u2207f (X(t)),\n\nif\notherwise.\n\n\u02d9X(t)\n\n> 0,\n\n(cid:68)\n\n(cid:69)\n\n\fWe now propose a discrete version of the heuristic when r(t) = t2. We consider the quadratic rate\nin particular since in this case the discretization proposed by [11] preserves the quadratic rate, and\ncorresponds to a \ufb01rst-order accelerated method2 for which many heuristics have been developed,\nsuch as the restarting heuristics [17, 20] discussed in the introduction. To satisfy condition (1) of\nTheorem 2, we choose \u03b7(t) = \u03b2t with \u03b2 \u2265 2. Note that in this case, \u03b7(t)\nt . In the supplementary\nmaterial, we propose a discretization of the heuristic (8), using the correspondance t = k\u221as, for a\nstep size s. The resulting algorithm is summarized in Algorithm 1, where \u03c8\u2217 is a smooth distance\ngenerating function, and R is a regularizer assumed to be strongly convex and smooth. We give a\nbound on the convergence rate of Algorithm 1 in the supplementary material. The proof relies on a\ndiscrete counterpart of the Lyapunov function Lr.\n\u221a\nThe algorithm keeps ak = ak\u22121 whenever f (\u02dcx(k+1)) \u2264 f (\u02dcx(k)), and sets ak to \u03b2\ns otherwise. This\nresults in a non-increasing sequence ak. It is worth observing that in continuous time, from the\nexpression (4), a constant a(t) over an interval [t1, t2] corresponds to an exponential increase in\nthe weight w(t) over that interval, while a(t) = \u03b2\nt corresponds to a polynomial increase w(t) =\n(t/t0)\u03b2\u22121. Intuitively, adaptive averaging increases the weights w(t) on portions of the trajectory\nwhich make progress.\n\nr(t) = \u03b2\n\nk\n\nAlgorithm 1 Accelerated mirror descent with adaptive averaging\n1: Initialize \u02dcx(0) = x0, \u02c7z(0) = x0, a1 = \u03b2\n\u221as\n2: for k \u2208 N do\n3:\n\n+ D\u03c8(\u02c7z, \u02c7z(k)).\n\n\u03b2ks\n\n(cid:68)\u2207f (x(k)), \u02c7z\n(cid:69)\n(cid:68)\u2207f (x(k)), \u02dcx\n(cid:69)\n\n\u03b3s\n\n\u02c7z(k+1) = arg min\u02c7z\u2208X\n\u02dcx(k+1) = arg min\u02dcx\u2208X\n+ R(\u02dcx, x(k))\nx(k+1) = \u03bbk+1 \u02c7z(k+1) + (1 \u2212 \u03bbk+1)\u02dcx(k+1), with \u03bbk =\nak = min\nif f (\u02dcx(k+1)) \u2212 f (\u02dcx(k)) > 0 then\n\nak\u22121, \u03b2max\nk\u221as\n\n(cid:16)\n\n(cid:17)\n\n\u221asak\n1+\u221asak\n\n.\n\n4:\n5:\n6:\n7:\n8:\n\nak = \u03b2\nk\u221as\n\n5 Numerical Experiments\n\nIn this section, we compare our adaptive averaging heuristic (in its discrete version given in Al-\ngorithm 1) to existing restarting heuristics. We consider simplex-constrained problems and take\nthe distance generating function \u03c8 to be the entropy function, so that the resulting algorithm is a\ndiscretization of the accelerated replicator ODE studied in Section 3. We perform the experiments in\nR3 so that we can visualize the solution trajectories (the supplementary material contains additional\nexperiments in higher dimension). We consider different objective functions: A strongly convex\nquadratic given by f (x) = (x \u2212 s)T A(x \u2212 s) for a positive de\ufb01nite matrix A, a weakly convex\nquadratic, a linear function f (x) = cT x, and the Kullback-Leibler divergence, f (x) = DKL(x(cid:63), x).\nWe compare the following methods:\n\n1. The original accelerated mirror descent method (in which the weights follow a predetermined\n\n2. Our adaptive averaging, in which ak is computed adaptively following Algorithm 1,\n3. The gradient restarting heuristic in [17], in which the algorithm is restarted from the current\n\n\u221a\nschedule given by ak = \u03b2\nk\n\ns),\n\n\u2207f (x(k)), x(k+1) \u2212 x(k)(cid:11) > 0,\n\npoint whenever(cid:10)\npoint whenever (cid:107)x(k+1) \u2212 x(k)(cid:107) \u2264 (cid:107)x(k) \u2212 x(k\u22121)(cid:107).\n\n4. The speed restarting heuristic in [20], in which the algorithm is restarted from the current\n\nThe results are shown in Figure 2. Each sub\ufb01gure is divided into four plots: Clockwise from the top\nleft, we show the value of the objective function, the trajectory on the simplex, the value of the energy\nfunction Er and the value of the Lyapunov function Lr.\n\n2For faster rates r(t) = tp, p > 2, it is possible to discretize the ODE and preserve the convergence rate, as\nproposed by Wibisono et al. [23], however this discretization results in a higher-order method such as Nesterov\u2019s\ncubic accelerated Newton method [16].\n\n7\n\n\fThe experiments show that adaptive averaging compares favorably to the restarting heuristics on\nall these examples, with a signi\ufb01cant improvement in the strongly convex case. Additionally, the\nexperiments con\ufb01rm that under the adaptive averaging heuristic, the Lyapunov function is decreasing.\nThis is not the case for the restarting heuristics as can be seen on the weakly convex example. It is\ninteresting to observe, however, that the energy function Er is non-increasing for all the methods\nin our experiments. If we interpret the energy as the sum of a potential and a kinetic term, then this\ncould be explained intuitively by the fact that restarting keeps the potential energy constant, and\ndecreases the kinetic energy (since the velocity is reset to zero). It is also worth observing that even\nthough the Lyapunov function Lr is non-decreasing, it will not necessarily converge to 0 when there\nis more than one minimizer (its limit will depend on the choice of z(cid:63) in the de\ufb01nition of Lr).\nFinally, we observe that the methods have a different qualitative behavior: The original accelerated\nmethod typically exhibits oscillations around the set of minimizers. The heuristics alleviate these\noscillations in different ways: Intuitively, adaptive averaging acts by increasing the weights on\nportions of the trajectory which make the most progress, while the restarting heuristics reset the\nvelocity to zero whenever the algorithm detects that the trajectory is moving in a bad direction. The\nspeed restarting heuristic seems to be more conservative in that it restarts more frequently.\n\n(a) Strongly convex quadratic.\n\n(b) Weakly convex function.\n\n(c) Linear function.\n\n(d) KL divergence.\n\nFigure 2: Examples of accelerated descent with adaptive averaging and restarting.\n\n6 Conclusion\n\nMotivated by the averaging formulation of accelerated mirror descent, we studied a family of ODEs\nwith a generalized averaging scheme, and gave simple suf\ufb01cient conditions on the weight functions to\nguarantee a given convergence rate in continuous time. We showed as an example how the replicator\nODE can be accelerated by averaging. Our adaptive averaging heuristic preserves the convergence\nrate (since it preserves the Lyapunov function), and it seems to perform at least as well as other\nheuristics for \ufb01rst-order accelerated methods, and in some cases considerably better. This encourages\nfurther investigation into the performance of this adaptive averaging, both theoretically (by attempting\nto prove faster rates, e.g. for strongly convex functions), and numerically, by testing it on other\nmethods, such as the higher-order accelerated methods proposed in [23].\n\n8\n\n\fReferences\n[1] H. Attouch and J. Peypouquet. The rate of convergence of nesterov\u2019s accelerated forward-\nbackward method is actually faster than 1/k2. SIAM Journal on Optimization, 26(3):1824\u20131834,\n2016.\n\n[2] H. Attouch, J. Peypouquet, and P. Redont. Fast convergence of an inertial gradient-like system\n\nwith vanishing viscosity. CoRR, abs/1507.04782, 2015.\n\n[3] H. Attouch, J. Peypouquet, and P. Redont. Fast convex optimization via inertial dynamics with\n\nhessian driven damping. CoRR, abs/1601.07113, 2016.\n\n[4] J.-P. Aubin. Viability Theory. Birkhauser Boston Inc., Cambridge, MA, USA, 1991.\n[5] A. Bloch, editor. Hamiltonian and gradient \ufb02ows, algorithms, and control. American Mathe-\n\nmatical Society, 1994.\n\n[6] A. A. Brown and M. C. Bartholomew-Biggs. Some effective methods for unconstrained\noptimization based on the solution of systems of ordinary differential equations. Journal of\nOptimization Theory and Applications, 62(2):211\u2013224, 1989.\n\n[7] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. J. Mach. Learn. Res., 12:2121\u20132159, July 2011.\n\n[8] N. Flammarion and F. R. Bach. From averaging to acceleration, there is only a step-size. In\n\n28th Conference on Learning Theory, COLT, pages 658\u2013695, 2015.\n\n[9] U. Helmke and J. Moore. Optimization and dynamical systems. Communications and control\n\nengineering series. Springer-Verlag, 1994.\n\n[10] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the\n\n3rd International Conference on Learning Representations (ICLR), 2014.\n\n[11] W. Krichene, A. Bayen, and P. Bartlett. Accelerated mirror descent in continuous and discrete\n\ntime. In NIPS, 2015.\n\n[12] A. Lyapunov. General Problem of the Stability Of Motion. Control Theory and Applications\n\nSeries. Taylor & Francis, 1992.\n\n[13] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization.\n\nWiley-Interscience series in discrete mathematics. Wiley, 1983.\n\n[14] Y. Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2).\n\nSoviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[15] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103\n\n(1):127\u2013152, 2005.\n\n[16] Y. Nesterov. Accelerating the cubic regularization of newton\u2019s method on convex problems.\n\nMathematical Programming, 112(1):159\u2013181, 2008.\n\n[17] B. O\u2019Donoghue and E. Cand\u00e8s. Adaptive restart for accelerated gradient schemes. Foundations\n\nof Computational Mathematics, 15(3):715\u2013732, 2015. ISSN 1615-3375.\n\n[18] R. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n[19] K. Sigmund. Complexity, Language, and Life: Mathematical Approaches, chapter A Survey of\n\nReplicator Equations, pages 88\u2013104. Springer Berlin Heidelberg, Berlin, Heidelberg, 1986.\n\n[20] W. Su, S. Boyd, and E. Cand\u00e8s. A differential equation for modeling Nesterov\u2019s accelerated\n\ngradient method: Theory and insights. In NIPS, 2014.\n\n[21] G. Teschl. Ordinary differential equations and dynamical systems, volume 140. American\n\nMathematical Soc., 2012.\n\n[22] J. W. Weibull. Evolutionary game theory. MIT press, 1997.\n[23] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods\n\nin optimization. CoRR, abs/1603.04245, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1493, "authors": [{"given_name": "Walid", "family_name": "Krichene", "institution": "UC Berkeley"}, {"given_name": "Alexandre", "family_name": "Bayen", "institution": "UC Berkeley"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}]}