{"title": "A Latent Variational Framework for Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 5646, "page_last": 5656, "abstract": "This paper provides a unifying theoretical framework for stochastic optimization algorithms by means of a latent stochastic variational problem. Using techniques from stochastic control, the solution to the variational problem is shown to be equivalent to that of a Forward Backward Stochastic Differential Equation (FBSDE). By solving these equations, we recover a variety of existing adaptive stochastic gradient descent methods. This framework establishes a direct connection between stochastic optimization algorithms and a secondary latent inference problem on gradients, where a prior measure on gradient observations determines the resulting algorithm.", "full_text": "A Latent Variational Framework for Stochastic\n\nOptimization\n\nPhilippe Casgrain\n\nDepartment of Statistical Sciences\n\nUniversity of Toronto\nToronto, ON, Canada\n\np.casgrain@mail.utoronto.ca\n\nAbstract\n\nThis paper provides a unifying theoretical framework for stochastic optimization\nalgorithms by means of a latent stochastic variational problem. Using techniques\nfrom stochastic control, the solution to the variational problem is shown to be equiv-\nalent to that of a Forward Backward Stochastic Differential Equation (FBSDE).\nBy solving these equations, we recover a variety of existing adaptive stochastic\ngradient descent methods. This framework establishes a direct connection between\nstochastic optimization algorithms and a secondary latent inference problem on\ngradients, where a prior measure on gradient observations determines the resulting\nalgorithm.\n\n1\n\nIntroduction\n\nStochastic optimization algorithms are tools which are crucial to solving optimization problems\narising in machine learning. The initial motivation for these algorithms comes from the fact that\ncomputing the gradients of a target loss function becomes increasingly dif\ufb01cult as the scale and\ndimension of an optimization problem grows larger. In these large-scale optimization problems,\ndeterministic gradient-based optimization algorithms perform poorly due to the computational load\nof repeatedly computing gradients. Stochastic optimization algorithms remedy this issue by replacing\nexact gradients of the target loss with a computationally cheap gradient estimator, trading off noise in\ngradient estimates for computational ef\ufb01ciency at each step.\nTo illustrate this idea, consider the problem of minimizing a generic risk function f : Rd ! R, taking\nthe form\n(1)\n\n`(x;z) ,\n\nf (x) =\n\n1\n|N|\n\n\u00c2\nz2N\n\nwhere ` : Rd \u21e5 Z ! R, and where we de\ufb01ne the set N := {zi 2 Z , i = 1, . . . ,N} to be a set of\ntraining points. In this de\ufb01nition, we interpret `(x;z) as the model loss at a single training point z 2 N\nfor the parameters x 2 Rd.\nWhen N and d are typically large, computing the gradients of f can be time-consuming. Knowing\nthis, let us consider the path of an optimization algorithm as given by {xt}t2N. Rather than computing\n\u2014 f (xt) directly at each point of the optimization process, we may instead collect noisy samples of\ngradients as\n\ngt =\n\n1\n|Nm\nt |\n\n\u00c2\nz2Nm\n\nt\n\n\u2014x`(xt;z) ,\n\n(2)\n\nwhere for each t, Nm\nt \u2713 N is an independent sample of size m from the set of training points. We\nassume that m \u2327 N is chosen small enough so that gt can be computed at a signi\ufb01cantly lower cost\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthan \u2014 f (xt). Using the collection of noisy gradients {gt}t2N, stochastic optimization algorithms\nconstruct an estimator c\u2014 f (xt) of the gradient \u2014 f (xt) in order to determine the next step xt+1 of the\noptimizer.\nThis paper presents a theoretical framework which provides new perspectives on stochastic optimiza-\ntion algorithms, and explores the implicit model assumptions that are made by existing ones. We\nachieve this by extending the approach taken by Wibisono et al. (2016) to stochastic algorithms. The\nkey step in our approach is to interpret the task of optimization with a stochastic algorithm as a latent\nvariational problem. As a result, we can recover algorithms from this framework which have built-in\nonline learning properties. In particular, these algorithms use an online Bayesian \ufb01lter on the stream\nof noisy gradient samples, gt, to compute estimates of \u2014 f (xt). Under various model assumptions on\n\u2014 f and g, we recover a number of common stochastic optimization algorithms.\n\n1.1 Related Work\n\nThere is a rich literature on stochastic optimization algorithms as a consequence of their effectiveness\nin machine learning applications. Each algorithm introduces its own variation on the gradient\n\nestimator c\u2014 f (xt) as well as other features which can improve the speed of convergence to an\n\noptimum. Amongst the simplest of these is stochastic gradient descent and its variants Robbins\nand Monro (1951), which use an estimator based on single gradient samples. Others, such as Lucas\net al. (2018); Nesterov, use momentum and acceleration as features to enhance convergence, and can\nbe interpreted as using exponentially weighted moving averages as gradient estimators. Adaptive\ngradient descent methods such as AdaGrad from Duchi et al. (2011) and Adam from Kingma and Ba\n(2014) use similar moving average estimators, as well as dynamically updated normalization factors.\nFor a survey paper which covers many modern stochastic optimization methods, see Ruder (2016).\nThere exist a number of theoretical interpretations of various aspects of stochastic optimization. Cesa-\nBianchi et al. (2004) have shown a parallel between stochastic optimization and online learning. Some\nprevious related works, such as Gupta et al. (2017) provide a general model for adaptive methods,\ngeneralizing the subgradient projection approach of Duchi et al. (2011). Aitchison (2018) use a\nBayesian model to explain the various features of gradient estimators used in stochastic optimization\nalgorithms . This paper differs from these works by naturally generating stochastic algorithms from a\nvariational principle, rather than attempting to explain their individual features. This work is most\nsimilar to that of Wibisono et al. (2016) who provide a variational model for continuous deterministic\noptimization algorithms.\nThere is a large body of research on continuous-time approximations to deterministic optimization\nalgorithms via dynamical systems (ODEs) (da Silva and Gazeau (2018); Krichene et al. (2015); Su\net al. (2014); Wilson et al. (2016)), as well as approximations to stochastic optimization algorithms by\nstochastic differential equations (SDEs) (Krichene and Bartlett (2017); Mertikopoulos and Staudigl\n(2018); Raginsky and Bouvrie (2012); Xu et al. (2018a,b)). In particular, the most similar of these\nworks, Raginsky and Bouvrie (2012); Xu et al. (2018a,b), study continuous approximations to\nstochastic mirror descent by adding exogenous Brownian noise to the continuous dynamics derived\nin Wibisono et al. (2016). This work differs by deriving continuous stochastic dynamics for optimizers\nfrom a broader theoretical framework, rather than positing the continuous dynamics as-is. Although\nthe equations studied in these papers may resemble some of the results derived in this one, they\ndiffer in a number of ways. Firstly, this paper \ufb01nds that the source of randomness present in the\noptimizer dynamics obtained in this paper are not generated by an exogenous source of noise, but are\nin fact an explicit function of the randomness generated by observed stochastic gradients during the\noptimization process. Another important difference is that the optimizer dynamics presented in this\npaper make no use of the gradients of the objective function, \u2014 f (which is inaccessible to a stochastic\noptimizer), and are only a function of the stream of stochastic gradients gt.\n\n1.2 Contribution\n\nTo the author\u2019s knowledge, this is the \ufb01rst paper to produce a theoretical model for stochastic\noptimization based on a variational interpretation. This paper extends the continuous variational\nframework Wibisono et al. (2016) to model stochastic optimization. From this model, we derive\noptimality conditions in the form of a system of forward-backward stochastic differential equations\n(FBSDEs), and provide bounds on the expected rate of convergence of the resulting optimization\n\n2\n\n\falgorithm to the optimum. By discretizing solutions of the continuous system of equations, we\ncan recover a number of well-known stochastic optimization algorithms, demonstrating that these\nalgorithms can be obtained as solutions of the variational model under various assumptions on the\nloss function, f (x), that is being minimized.\n\n1.3 Paper Structure\n\nIn Section 2 we de\ufb01ne a continuous-time surrogate model of stochastic optimization. Section 3\nuses this model to motivate a stochastic variational problem over optimizers, in which we search\nfor stochastic optimization algorithms which achieve optimal average performance over a collection\nof minimization problems. In Section 4 we show that the necessary and suf\ufb01cient conditions for\noptimality of the variational problem can be expressed as a system of Forward-Backward Stochastic\nDifferential Equations. Theorem 4.2 provides rates of convergence for the optimal algorithm to the\noptimum of the minimization problem. Lastly, Section 5 recovers SGD, mirror descent, momentum,\nand other optimization algorithms as discretizations of the continuous optimality equations derived in\nSection 4 under various model assumptions. The proofs of the mathematical results of this paper are\nfound within the appendices.\n\n2 A Statistical Model for Stochastic Optimization\n\nt )t0 as a controlled process satisfying Xn\n\nOver the course of the section, we present a variational model for stochastic optimization. The\nultimate objective will be to construct a framework for measuring the average performance of an\nalgorithm over a random collection of optimization problems. We de\ufb01ne random variables in an\nambient probability space (W,P, G = {Gt}t2[0,T ]), where Gt is a \ufb01ltration which we will de\ufb01ne at\na later point in this section. We assume that loss functions are drawn from a random variable\nf : W ! C1(Rd). Each draw from the random variable satis\ufb01es f (x) 2 R for \ufb01xed x 2 Rd, and f is\nassumed to be an almost-surely continuously differentiable in x. In addition, we make the technical\nassumption that Ek\u2014 f (x)k2 < \u2022 for all x 2 Rd.\nWe de\ufb01ne an optimizer X = (Xn\nt 2 Rd for all t 0, with\ninitial condition X0 2 Rd. The paths of X are assumed to be continuously differentiable in time so that\nthe dynamics of the optimizer may be written as dXn\nt = nt dt, where nt 2 Rd represents the control,\nwhere we use the superscript to express the explicit dependence of Xn on the control n. We may\nt = X0 +R t\nalso write the optimizer in its integral form as Xn\n0 nu du, demonstrating that the optimizer\nis entirely characterized by a pair (n,X0) consisting of a control process n and an initial condition\nX0. Using an explicit Euler discretization with step size e > 0, the optimizer can be approximately\nrepresented through the update rule Xn\nt + e nt. This leads to the interpretation of nt as the\n(in\ufb01nitesimal) step the algorithm takes at each point t during the optimization process.\nIn order to capture the essence of stochastic optimization, we construct our model so that optimizers\nhave restricted access to the gradients of the loss function f . Rather than being able to directly observe\n\u2014 f over the path of Xn\nt , we assume that the algorithm may only use a noisy source of gradient samples,\nmodeled by a c\u00e0dl\u00e0g semi-martingale1 g = (gt)t0. As a simple motivating example, we can consider\nthe model gt = \u2014 f (Xn\nt ) + xt, where xt is a white noise process. This particular model for the noisy\ngradient process can be interpreted as consisting of observing \u2014 f (Xn\nt ) plus an independent source of\nnoise. This concrete example will be useful to keep in mind to make sense of the results which we\npresent over the course of the paper.\nTo make the concept of information restriction mathematically rigorous, we restrict ourselves only to\noptimizers Xn which are measurable with respect to the information generated by the noisy gradient\n\nt+e \u21e1 Xn\n\nprocess g. To do this, we \ufb01rst de\ufb01ne the global \ufb01ltration G , as Gt = s(gu)u2[0,t], f as the sigma\n\nalgebra generated by the paths of g as well as the realizations of the loss surface f . The \ufb01ltration Gt\nis de\ufb01ned so that it contains the complete set of information generating the optimization problem\nuntil time t.\n\n1A c\u00e0dl\u00e0g (continue \u00e0 droite, limite \u00e0 gauche) process is a continuous time process that is almost-surely\nright-continuous with \ufb01nite left limit at each point t. A semi-martingale is the sum of a process of \ufb01nite variation\nand a local martingale. For more information on continuous time stochastic processes and these de\ufb01nitions, see\nthe canonical text Jacod and Shiryaev (2013).\n\n3\n\n\fNext, we de\ufb01ne the coarser \ufb01ltration Ft = s (gu)u2[0,t] \u21e2 Gt generated strictly by the paths of the noisy\ngradient process. This \ufb01ltration represents the total set of information available to the optimizer up\nuntil time t. This allows us to formally restrict the \ufb02ow of information to the algorithm by restricting\nourselves to optimizers which are adapted to Ft. More precisely, we say that the optimizer\u2019s control\nn is admissible if\n\nn 2 A :=\u21e2w = (wt)t0 : w is F -adapted , EZ T\n\n0 kwtk2+k\u2014 f (Xw\n\nt )k2 dt < \u2022 .\n\nThe set of optimizers generated by A can be interpreted as the set of optimizers which may only use\nthe source of noisy gradients, which have bounded expected travel distance and have square-integrable\ngradients over their path.\n\n(3)\n\n3 The Optimizer\u2019s Variational Problem\n\nHaving de\ufb01ned the set of admissible optimization algorithms, we set out to select those which are\noptimal in an appropriate sense. We proceed similarly to Wibisono et al. (2016), by proposing an\nobjective functional which measures the performance of the optimizer over a \ufb01nite time period.\nThe motivation for the optimizer\u2019s performance metric comes from a physical interpretation of the\noptimization process. We can think of our optimization process as a particle traveling through a\npotential \ufb01eld de\ufb01ne by the target loss function f . As the particle travels through the potential \ufb01eld, it\nmay either gain or lose momentum depending on its location and velocity, which will in turn affect\nthe particle\u2019s trajectory. Naturally, we may seek to \ufb01nd the path of a particle which reaches the\noptimum of the loss function while minimizing the total amount of kinetic and potential energy that\nis spent. We therefore turn to the Lagrangian interpretation of classical mechanics, which provides a\nframework for obtaining solutions to this problem. Over the remainder of this section, we lay out the\nLagrangian formalism for the optimization problem we de\ufb01ned in Section 2.\nTo de\ufb01ne a notion of energy in the optimization process, we provide a measure of distance in the\nparameter space. We use the Bregman Divergence as the measure of distance within our parameter\nspace, which can embed additional information about the geometry of the optimization problem. The\nBregman divergence, Dh, is de\ufb01ned as\n\nDh(y,x) = h(y) h(x)h\u2014h(x),y xi\n\n(4)\nwhere h : Rd ! R is a strictly convex function satisfying h 2 C2. We assume here that the gradients of\nh are L-Lipschitz smooth for a \ufb01xed constant L > 0. The choice of h determines the way we measure\ndistance, and is typically chosen so that it mimics features of the loss function f . In particular, this\nquantity plays a central role in mirror descent and non-linear sub-gradient algorithms. For more\ninformation on this connection and on Bregman Divergence, see Nemirovsky and Yudin (1983)\nand Beck and Teboulle (2003).\nWe de\ufb01ne the total energy in our problem as the kinetic energy, accumulated through the movement\nof the optimizer, and the potential energy generated by the loss function f . Under the assumption that\nf almost surely admits a global minimum x? = argminx2Rd f (x), we may represent the total energy\nvia the Bregman Lagrangian as\n\nL (t,X,n) = egt (eat DhX + eat n,X\n}\n\nKinetic Energy\n\n{z\n\n|\n\nebt ( f (X) f (x?))\n}\n|\n\nPotential Energy\n\n{z\n\n) ,\n\n(5)\n\nfor \ufb01xed inputs (t,X,n), and where we assume that g,a,b : R+ ! R are deterministic, and satisfy\ng,a,b 2 C1. The functions g,a,b can be interpreted as hyperparameters which tune the energy\npresent at any state of the optimization process. An important property to note is that the Lagrangian\nis itself a random variable due to the randomness introduced by the latent loss function f .\nThe objective is then to \ufb01nd an optimizer within the admissible set A which can get close to the\nminimum x? = minx2Rd f (x), while simultaneously minimizing the energy cost over a \ufb01nite time\nperiod [0,T ]. The approach taken in classical mechanics and in Wibisono et al. (2016) \ufb01xes the\nendpoint of the optimizer at x?. Since we assume that the function f is not directly visible to our\noptimizer, it is not possible to add a constraint of this type that will hold almost surely. Instead, we\nintroduce a soft constraint which penalizes the algorithm\u2019s endpoint in proportion to its distance to\n\n4\n\n\fthe global minimum, f (XT ) f (x?). As such, we de\ufb01ne the expected action functional J : A ! R\nas\n(6)\n\nL (t,Xn\n\nt ,nt)dt\n\nJ (n) = EhZ T\n|\n\n0\n\nTotal Path Energy\n\n{z\n\nSoft End Point Constraint i ,\n+edT f (Xn\nT ) f (x?)\n|\n}\n{z\n\n}\n\nwhere dT 2 C1 is assumed to be an additional model hyperparameter, which controls the strength of\nthe soft constraint.\nWith this de\ufb01nition in place, the objective will be to select amongst admissible optimizers for those\nwhich minimize the expected action. Hence, we seek optimizers which solve the stochastic variational\nproblem\n\nn\u21e4 = arg min\nn2A\n\nJ (n) .\n\n(7)\n\nRemark 1. Note that the variational problem (7) is identical to the one with Lagrangian\n\n\u02dcL (t,X,n) = egt (eat DhX + eat n,X ebt f (X))\n\n(8)\nT ), since they differ by constants independent of n. Because of this, the\n\nand terminal penalty edT f (Xn\nresults presented in Section 4 also hold the case where x? and f (x?) do not exist or are in\ufb01nite.\n\n4 Critical Points of the Expected Action Functional\n\nIn order to solve the variational problem (7), we make use techniques from the calculus of variations\nand in\ufb01nite dimensional convex analysis to provide optimality conditions for the variational prob-\nlem (7). To address issues of information restriction, we rely on the stochastic control techniques\ndeveloped by Casgrain and Jaimungal (2018a,b,c).\nThe approach we take relies on the fact that a necessary condition for the optimality of a G\u00e2teaux\ndifferentiable functional J is that its G\u00e2teaux derivative vanishes in all directions. Computing the\nG\u00e2teaux derivative of J , we \ufb01nd an equivalence between the G\u00e2teaux derivative vanishing and a\nsystem of Forward-Backward Stochastic Differential Equations (FBSDEs), yielding a generalization\nof the Euler-Lagrange equations to the context of our optimization problem. The precise result is\nstated in Theorem 4.1 below.\nTheorem 4.1 (Stochastic Euler-Lagrange Equation). A control n\u21e4 2 A is a critical point of J if\nand only if (( \u2202 L\nd\u2713\u2202 L\n\u2202n \u25c6t\n\u2713\u2202 L\n\u2202X \u25c6t\n\u2713\u2202 L\n\u2202n \u25c6t\n\nt + eat n\u21e4t ) \u2014h(Xn\u21e4\n) eat \u20142h(Xn\u21e4\n)\u2318 ,\n\n\u2202X \u25c6tFt dt + dMt 8t < T , \u2713\u2202 L\n\u2202n \u25c6T\n\n= edT Eh\u2014 f (XT )FTi ,\n\n\u2202n ), M ) is a solution to the system of FBSDEs,\n\n= egt\u21e3\u2014h(Xn\u21e4\n\nt + eat n\u21e4t ) \u2014h(Xn\u21e4\n\n= E\uf8ff\u2713\u2202 L\n\n)n\u21e4t ebt \u2014 f (Xn\u21e4\n\nt\n\n))\n\nwhere we de\ufb01ne the processes\n\n= egt +at(\u2014h(Xn\u21e4\n\nand where the process M = (Mt)t2[0,T ] is an F -adapted martingale. As a consequence, if the\nsolution to this FBSDE is unique, then it is the unique critical point of the functional J up to null\nsets.\nProof. See Appendix C\n\n(10)\n\n(11)\n\n(9)\n\nt\n\nt\n\nt\n\nTheorem 4.1 presents an analogue of the Euler-Lagrange equation with free terminal boundary. Rather\nthan obtaining an ODE as in the classical result, we obtain an FBSDE2, with backwards process\n2For a background on FBSDEs, we point readers to Carmona (2016); Ma et al. (1999); Pardoux and Tang\n(1999). At a high level, the solution to an FBSDE of the form (9) consists of a pair of processes (\u2202 L/\u2202n, M ),\nwhich simultaneously satisfy the dynamics and the boundary condition of (9). Intuitively, the martingale part of\nthe solution can be interpreted as a random process which guides (\u2202 L/\u2202X)t towards the boundary condition at\ntime T .\n\n5\n\n\f(\u2202 L/\u2202n)t, and forward state processes E[(\u2202 L/\u2202X)t|Ft],R t\n. We can also interpret the\ndynamics of equation (9) as being the \ufb01ltered optimal dynamics of (Wibisono et al., 2016, Equation\n2.3), E[(\u2202 L/\u2202X)t|Ft], plus the increments of data-dependent martingale Mt, with mechanics similar\nto that of the \u2018innovations process\u2019 of \ufb01ltering theory. This martingale term should not be interpreted\nas a source of noise, but as an explicit function of the data, as is evident from its explicit form\n\n0 knuk du and Xn\u21e4\n\nt\n\nMt = E\uf8ffZ T\n\n0 \u2713\u2202 L\n\u2202X \u25c6u\n\ndu edT \u2014 f (XT )Ft .\n\nA feature of equation (9), is that optimality relies on the projection of (\u2202 L/\u2202X)t onto Ft. Thus,\nthe optimization algorithm makes use of past noisy gradient observations in order to make local\ngradient predictions. Local gradient predictions are updated using a Bayesian mechanism, where\nthe prior model for \u2014 f is conditioned with the noisy gradient information contained in Ft. This\ndemonstrates that the solution depends only on the gradients of f along the path of Xt and no higher\norder properties.\n\n(12)\n\n(13)\n\nEt = Dh(x?,Xn\u21e4\n\n4.1 Expected Rates of Convergence of the Continuous Algorithm\nUsing the dynamics (9) we obtain a bound on the rate of convergence of the continuous optimization\nalgorithm that is analogous to Wibisono et al. (2016, Theorem 2.1). We introduce the Lyapunov\nenergy functional\n\nt + eat nt) + ebt\u21e3 f (Xn\u21e4\n\nt\n\n) f (x?)\u2318 [\u2014h(Xn\u21e4 + eat n),Xn\u21e4 + eat n]t ,\n\nwhere we de\ufb01ne x? to be a global minimum of f . Under additional model assumptions, and by\nshowing that this quantity is a super-martingale with respect to the \ufb01ltration F , we obtain an upper\nbound for the expected rate of convergence from Xt towards the minimum.\nTheorem 4.2 (Convergence Rate). Assume that the function f is almost surely convex and that\nthe scaling conditions \u02d9gt = eat and \u02d9bt \uf8ff eat hold. Moreover, assume that in addition to h having\nL-Lipschitz smooth gradients, h is also \u00b5-strongly-convex with \u00b5 > 0. De\ufb01ne x? = argminx2Rd f (x)\nto be a global minimum of f . If x? exists almost surely, the optimizer de\ufb01ned by FBSDE (9) satis\ufb01es\n\nE [ f (Xt) f (x?)] = O\u21e3ebt max1 ,E\u21e5 [egt M ]t\u21e4 \u2318 ,\n\n(14)\nwhere [egt M ]t represents the quadratic variation of the process egt Mt, where M is the martingale\npart of the solution de\ufb01ned in Theorem 4.1.\nProof. See Appendix D.\nWe may interpret the term E [ [egt M ]t] as a penalty on the rate of convergence, which scales with\nthe amount of noise present in our gradient observations. To see this, note that if there is no noise\nin our gradient observations, we obtain that Ft = Gt, and hence Mt \u2318 0, which recovers the exact\ndeterministic dynamics of Wibisono et al. (2016) and the optimal convergence rate O(ebt ). If\nthe noise in our gradient estimates is large, we can expect E [ [eg M ]t] to grow at quickly and to\ncounteract the shrinking effects of ebt . Thus, in the case of a convex objective function f , any\npresence of gradient noise will proportionally hurt rate of convergence to an optimum. We also point\nout, that there will be a nontrivial dependence of E [ [eg M ]t] on all model hyperparameters, the\nspeci\ufb01c de\ufb01nition of the random variable f , and the model for the noisy gradient stream, (gt)t0.\nRemark 2. We do not assume that the conditions of Theorem 4.2 carry throughout the remainder of\nthe paper. In particular, Sections 5 study models which may not guarantee almost-sure convexity of\nthe latent loss function.\n\n5 Recovering Discrete Optimization Algorithms\n\nIn this section, we use the optimality equations of Theorem 4.1 to produce discrete stochastic\noptimization algorithms. The procedure we take is as follows. We \ufb01rst de\ufb01ne a model for the processes\n(\u2014 f (Xt),gt)t2[0,T ]. Second, we solve the optimality FBSDE (9) in closed form or approximate the\nsolution via the \ufb01rst-order singular perturbation (FOSP) technique, as described in Appendix A.\nLastly, we discretize the solutions with a simple Forward-Euler scheme in order to recover discrete\nalgorithms.\n\n6\n\n\fOver the course of Sections 5.1 and 5.2, we show that various simple models for (\u2014 f (Xt),gt)t2[0,T ]\nand different speci\ufb01cations of h produce many well-known stochastic optimization algorithms. These\nestablish the conditions, in the context of the variational problem of Section 2, under which each of\nthese algorithms are optimal. As a consequence, this allows us to understand the prior assumptions\nwhich these algorithms make on the gradients of the objective function they are trying to minimize,\nand the way noise is introduced in the sampling of stochastic gradients, (gt)t0.\n5.1 Stochastic Gradient Descent and Stochastic Mirror Descent\nHere we propose a Gaussian model on gradients which loosely represents the behavior of mini-batch\nstochastic gradient descent with a training set of size n and mini-batches of size m. By specifying\na martingale model for \u2014 f (Xt), we recover the stochastic gradient descent and stochastic mirror\ndescent algorithms as solutions to the variational problem described in Section 2.\nLet us assume that \u2014 f (Xt) = sW f\nt )t0 is a Brownian motion. Next, assume\nthat the noisy gradients samples obtained from mini-batches over the course of the optimization,\nt ), where r =p(nm)/m and W e is an independent\nevolve according to the model gt = s (W f\ncopy of W f\nto scale in m and n as it does with mini-batches.\nUsing symmetry, we obtain the trivial solution to the gradient \ufb01lter, E[\u2014 f (Xt)|Ft] = (1 + r2)1gt,\nimplying that the best estimate of the gradient at the point Xt will be the most recent mini-batch\nsample observed. re-scaled by a constant depending on n and m. Using this expression for the \ufb01lter,\nwe obtain the following result.\nProposition 5.1. The FOSP approximation to the solution of the optimality equations (9) can be\nexpressed as\n\nt . Here, we choose r so that V[gt] = (n/m)V[\u2014 f (Xt)] = O(m1), which allows the variance\n\nt , where s > 0 and (W f\n\nt + rW e\n\n0 eau+bu+gu dgu.\n\nt \u2318 dt ,\n\ndXt = eat\u21e3\u2014h\u21e4\u2014h(Xt) \u02dcFt(1 + r2)1gt Xn\u21e4\n\n(15)\n0 eau+bu+gu du) is a deterministic\n0 eau+bu+gu du. When h has the form h(x) = x|Mx for a symmetric\npositive-de\ufb01nite matrix M, the FOSP approximation is exact, and (15) is the exact solution to\nthe optimality FBSDE (9). The martingale portion of the solution to (9) can be expressed as\n\nwhere h\u21e4 is the convex dual of h and where \u02dcFt = egt (F0 +R t\nlearning rate with F0 = edT R T\nMt = M0 (1 + r2)1R t\nProof. See Appendix E.1.\nTo obtain a discrete optimization algorithm from the result of 5.1, we employ a forward-Euler\ndiscretization of the ODE (15) on the \ufb01nite mesh T = {t0 = 0 , tk+1 = tk + eatk : k 2 N}. This\ndiscretization results in the update rule\n(16)\ncorresponding exactly to mirror descent (e.g. see Beck and Teboulle (2003)) using the noisy mini-\nbatch gradients gt and a time-varying learning rate \u02dcFtk. Moreover, setting h(x) = 1\n2kxk2, we recover\nthe update rule Xtk+1 Xtk = \u02dcFtk gtk, exactly corresponding to the mini-batch SGD with a time-\ndependent learning rate.\nThis derivation demonstrates that the solution to the variational problem described in Section 2, under\nthe assumption of a Gaussian model for the evolution of gradients, recovers mirror descent and SGD.\nIn particular, the martingale gradient model proposed in this section can be roughly interpreted as\nassuming that gradients behave as random walks over the path of the optimizer. Moreover, the optimal\ngradient \ufb01lter E[\u2014 f (Xt)|Ft] = (1 + r2)1gt shows that, for the algorithm to be optimal, mini-batch\ngradients should be re-scaled in proportion to (1 + r2)1 = m/n.\n\nXtk+1 = \u2014h\u21e4\u2014h(Xtk ) \u02dcFtk gtk ,\n\n5.2 Kalman Gradient Descent and Momentum Methods\nUsing a linear state-space model for gradients, we can recover both the Kalman Gradient Descent\nalgorithm of Vuckovic (2018) and momentum-based optimization methods of Polyak (1964). We\nassume that each component of \u2014 f (Xt) = (\u2014i f (Xt))d\ni=1 is modeled independently as a linear dif-\nfusive process. Speci\ufb01cally, we assume that there exist processes yi = (yi,t)t0 so that for each i,\n\u2014i f (Xt) = b|yi,t, where yi,t 2 R \u02dcd is the solution to the linear SDE dyi,t = Ayi,tdt + LdWi,t. In partic-\nular, we the notation \u02c6yi, j,t to refer to element (i, j) of \u02c6y 2 Rd\u21e5 \u02dcd, and use the notation \u02c6y\u00b7, j,t = ( \u02c6yi, j,t)d\ni=1.\n\n7\n\n\fWe assume here that A,L 2 R \u02dcd\u21e5 \u02dcd are positive de\ufb01nite matrices and each of the Wi = (Wi,t)t0 are\nindependent \u02dcd-dimensional Brownian Motions.\nNext, we assume that we may write each element of a noisy gradient process as gi,t = b|yi,\u00b7,t + sxi,t,\nwhere s > 0 and where xi = (xi,t)t0 are independent white noise processes. Noting that\nE[\u2014i f (Xt+h)|Ft] = b|eAhyi,t, we \ufb01nd that this model implicitly assumes that gradients are ex-\npected decrease in exponentially in magnitude as a function of time, at a rate determined by the\neigenvalues of the matrix A. The parameters s and L can be interpreted as controlling the scale of the\nnoise within the observation and signal processes.\nUsing this model, we obtain that the \ufb01lter can be expressed as E[\u2014i f (Xt)|Ft] = b| \u02c6yi,t, where \u02c6yi,t =\nE[yi,t|Ft]. The process \u02c6yi,t is expressed as the solution to the Kalman-Bucy3 \ufb01ltering equations\n\nd \u02c6yi,t = A \u02c6yi,t dt + s1 \u00afPt bd \u02c6Bi,t ,\n\n\u02d9\u00afP = A \u00afPt \u00afP|\n\nt A s2 \u00afPtbb| \u00afP|\n\nt + LL| ,\n\n(17)\n\nwith the initial conditions \u02c6yi,0 = 0 and \u00afP0 = E[yi,0y|\ni,0], and where we de\ufb01ne innovations process\nd \u02c6Bi,t = s1 (gi,t b| \u02c6yi,t) dt with the property that each \u02c6Bi is an independent F -adapted Brownian\nmotion.\nInserting the linear state space model and its \ufb01lter into the optimality equations (9) we obtain the\nfollowing result.\nProposition 5.2 (State-Space Model Solution to the FOSP). Assume that the gradient state-space\nmodel described above holds. The FOSP approximation to the solution of the optimality equations (9)\ncan be expressed as\n\nj=1 \u02dcF j,t \u02c6y\u00b7, j,t) Xn\u21e4\n\nt )dt ,\n\ndXt = eat(\u2014h\u21e4(\u2014h(Xt) \u00c2 \u02dcd\n\n(18)\n0 eau+bu+gub|eA(tu) du) 2 R \u02dcd is a deterministic learning rate, where\n0 eau+bu+gueAu du can be chosen to\nhave arbitrarily large eigenvalues by scaling dT . The martingale portion of the solution of (9) can be\n\nwhere \u02dcFt = egt (b|eAtF0 +R t\neA represents the matrix exponential, and where F0 = edT eAT R T\nexpressed as Mt = M0 s1R t\n\n0 eau+bu+gub|eA(tu) \u00afPubd \u02c6Bu.\n\nProof. See Appendix E.2\n\n5.2.1 Kalman Gradient Descent\nIn order to recover Kalman Gradient Descent, we discretize the processes Xn\u21e4\nand \u02c6y over the \ufb01nite\nt\nmesh T , de\ufb01ned in equation (18). Applying a Forward-Euler-Maruyama discretization of (18) and\nthe \ufb01ltering equations (17), we obtain the discrete dynamics\n\nyi,tk+1 = (I eatk A)yi,tk + Leat wi,k ,\n\ngi,tk = b|yi,tk + seat xi,k ,\n\n(19)\n\nwhere each of the xi,k and wi,k are standard Gaussian random variables of appropriate size. The\n\ufb01lter \u02c6yi,k = E[ytk|{gtk0}k\nk0=1] for the discrete equations can be written as the solution to the discrete\nKalman \ufb01ltering equations, provided in Appendix B. Discretizing the process Xn\u21e4 over T with the\nForward-Euler scheme, we obtain discrete dynamics for the optimizer in terms of the Kalman Filter \u02c6y,\nas\n\nXtk+1 = \u2014h\u21e4\u21e3\u2014h(Xtk ) \u00c2 \u02dcd\n\nj=1 \u02dcF j,tk \u02c6y\u00b7, j,k\u2318 ,\n\n(20)\n\nyielding a generalized version of Kalman gradient descent of Vuckovic (2018) with \u02dcd states for each\n2kxk2, \u02dcd = 1 and b = 1 recovers the original Kalman gradient\ngradient element. Setting h(x) = 1\ndescent algorithm with a time-varying learning rate.\nJust as in Section 5.1, we interpret each gtk as being a mini-batch gradient, as with equation (2). The\nalgorithm (20) computes a Kalman \ufb01lter from these noisy mini-batch observations and uses it to\nupdate the optimizer\u2019s position.\n\n3For information on continuous time \ufb01ltering and the Kalman-Bucy \ufb01lter we refer the reader to the text of\n\nBensoussan (2004) or the lecture notes of Van Handel (2007).\n\n8\n\n\f5.2.2 Momentum and Generalized Momentum Methods\nBy considering the asymptotic behavior of the Kalman gradient descent method described in Sec-\ntion 5.2.1, we recover a generalized version of momentum gradient descent methods, which includes\nmirror descent behavior, as well as multiple momentum states. Let us assume that at = a0 re-\nmains constant in time. Then, using the asymptotic update rule for the Kalman \ufb01lter, as shown in\nProposition B.2, and equation (20), we obtain the update rule\n\nXtk+1 = \u2014h\u21e4\u21e3\u2014h(Xtk ) \u00c2 \u02dcd\n\nj=1 \u02dcF j,tk \u02c6y\u00b7, j,k\u2318 ,\n\n\u02c6yi,\u00b7,k = \u02dcA K\u2022b| \u02dcA \u02c6yi,\u00b7,k + K\u2022gi,k ,\n\n(21)\nwhere \u02dcA = Iea0A and where K\u2022 2 R \u02dcd is de\ufb01ned in the statement of the Proposition B.2. This yields\na generalized momentum update rule where we keep track of \u02dcd momentum states with ( \u02c6yi, j,k) \u02dcd\nj=1,\nand update its position using a linear update rule. This algorithm can be seen as being most similar\nto the Aggregated Momentum technique of Lucas et al. (2018), which also keeps track of multiple\nmomentum states which decay at different rates.\nUnder the special case where \u02dcd = 1, b = 1, and h = 1\nupdate rule of Polyak (1964) as\n\n2kxk2 we recover the exact momentum algorithm\n(22)\n\u02c6yi,k = p1 \u02c6yk + p2 gtk ,\nwhere we have a scalar learning rate \u02dcFtk, where p1 = \u02dcA K\u2022b| \u02dcA, p2 = K\u2022 are positive scalars, and\nwhere gtk are mini-batch draws from the gradient as in equation 2.\nThe recovery of the momentum algorithm of Polyak (1964) has some interesting consequences. Since\np1 and p2 are functions of the model parameters s ,A and a0, we obtain a direct relationship between\nthe optimal choice for the momentum model parameters, the assumed scale of gradient noise s ,L > 0\nand the assumed expected rate of decay of gradients, as given by eAt. This result gives insight as\nto how momentum parameters should be chosen in terms of their prior beliefs on the optimization\nproblem.\n\nXtk+1 Xtk = \u02dcFtk \u02c6yk ,\n\n6 Discussion and Future Research Directions\n\nOver the course of the paper we present a variational framework on optimizers, which interprets the\ntask of stochastic optimization as an inference problem on a latent surface that we wish to optimize.\nBy solving a variational problem over continuous optimizers with asymmetric information, we \ufb01nd\nthat optimal algorithms should satisfy a system of FBSDEs projected onto the \ufb01ltration F generated\nby the noisy observations of the latent process.\nBy solving these FBSDEs and obtaining continuous-time optimizers, we \ufb01nd a direct relationship\nbetween the measure assigned to the latent surface and its relationship to how data is observed.\nIn particular, assigning simple prior models to the pair of processes (\u2014 f (Xt),gt)t2[0,T ], recovers\na number of well-known and widely used optimization algorithms. The fact that this framework\ncan naturally recover these algorithms begs further study. In particular, it is still an open question\nwhether it is possible to recover other stochastic algorithms via this framework, particularly those\nwith second-order scaling adjustments such as ADAM or AdaGrad.\nFrom a more technical perspective, the intent is to further explore properties of the optimization model\npresented here and the form of the algorithms it suggests. In particular, the optimality FBSDE 9 is\nnonlinear, high-dimensional and intractable in general, making it dif\ufb01cult to use existing FBSDE\napproximation techniques, so new tools may need to be developed to understand the full extent of its\nbehavior.\nLastly, numerical work on the algorithms generated by this framework can provide some insights\nas to which prior gradient models work well when discretized. The extension of simplectic and\nquasi-simplectic stochastic integrators applied to the BSDEs and SDEs that appear in this paper also\nhas the potential for interesting future work.\n\nReferences\nLaurence Aitchison. A uni\ufb01ed theory of adaptive stochastic gradient descent as bayesian \ufb01ltering.\n\narXiv preprint arXiv:1807.07540, 2018.\n\n9\n\n\fAmir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\nAlain Bensoussan. Stochastic control of partially observable systems. Cambridge University Press,\n\n2004.\n\nRen\u00e9 Carmona. Lectures on BSDEs, stochastic control, and stochastic differential games with\n\n\ufb01nancial applications, volume 1. SIAM, 2016.\n\nPhilippe Casgrain and Sebastian Jaimungal. Mean \ufb01eld games with partial information for algorithmic\n\ntrading. arXiv preprint arXiv:1803.04094, 2018a.\n\nPhilippe Casgrain and Sebastian Jaimungal. Mean-\ufb01eld games with differing beliefs for algorithmic\n\ntrading. arXiv preprint arXiv:1810.06101, 2018b.\n\nPhilippe Casgrain and Sebastian Jaimungal. Trading algorithms with learning in latent alpha models.\n\narXiv preprint arXiv:1806.04472, 2018c.\n\nNicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line\n\nlearning algorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\nAndr\u00e9 Belotto da Silva and Maxime Gazeau. A general system of differential equations to model\n\n\ufb01rst order adaptive algorithms. arXiv preprint arXiv:1810.13108, 2018.\n\nJohn Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nVineet Gupta, Tomer Koren, and Yoram Singer. A uni\ufb01ed approach to adaptive regularization in\n\nonline and stochastic optimization. arXiv preprint arXiv:1706.06569, 2017.\n\nJean Jacod and Albert Shiryaev. Limit theorems for stochastic processes, volume 288. Springer\n\nScience & Business Media, 2013.\n\nSvetlana Jankovi\u00b4c, Miljana Jovanovi\u00b4c, and Jasmina Djordjevi\u00b4c. Perturbed backward stochastic\n\ndifferential equations. Mathematical and Computer Modelling, 55(5-6):1734\u20131745, 2012.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nWalid Krichene and Peter L Bartlett. Acceleration and averaging in stochastic descent dynamics. In\n\nAdvances in Neural Information Processing Systems, pages 6796\u20136806, 2017.\n\nWalid Krichene, Alexandre Bayen, and Peter L Bartlett. Accelerated mirror descent in continuous\nand discrete time. In Advances in neural information processing systems, pages 2845\u20132853, 2015.\nJames Lucas, Shengyang Sun, Richard Zemel, and Roger Grosse. Aggregated momentum: Stability\n\nthrough passive damping. arXiv preprint arXiv:1804.00325, 2018.\n\nJin Ma, J-M Morel, and Jiongmin Yong. Forward-backward stochastic differential equations and\n\ntheir applications. Number 1702. Springer Science & Business Media, 1999.\n\nPanayotis Mertikopoulos and Mathias Staudigl. On the convergence of gradient-like \ufb02ows with noisy\n\ngradient input. SIAM Journal on Optimization, 28(1):163\u2013197, 2018.\n\nArkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method\n\nef\ufb01ciency in optimization. 1983.\n\nYu Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2).\n\nIn Sov. Math. Dokl, volume 27.\n\nEtienne Pardoux and Shanjian Tang. Forward-backward stochastic differential equations and quasi-\n\nlinear parabolic pdes. Probability Theory and Related Fields, 114(2):123\u2013150, 1999.\n\nBoris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\nMaxim Raginsky and Jake Bouvrie. Continuous-time stochastic mirror descent on a network:\nVariance reduction, consensus, convergence. In 2012 IEEE 51st IEEE Conference on Decision and\nControl (CDC), pages 6793\u20136800. IEEE, 2012.\n\nHerbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical\n\nstatistics, pages 400\u2013407, 1951.\n\nSebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint\n\narXiv:1609.04747, 2016.\n\n10\n\n\fWeijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov\u2019s\naccelerated gradient method: Theory and insights. In Advances in Neural Information Processing\nSystems, pages 2510\u20132518, 2014.\n\nRamon Van Handel. Stochastic calculus, \ufb01ltering, and stochastic control. Course notes., URL\n\nhttp://www. princeton. edu/\u02dc rvan/acm217/ACM217. pdf, 2007.\n\nJames Vuckovic. Kalman gradient descent: Adaptive variance reduction in stochastic optimization.\n\narXiv preprint arXiv:1810.12273, 2018.\n\nJean Walrand and Antonis Dimakis. Random processes in systems - lecture notes. Department of\nElectrical Engineering and Computer Sciences, University of California, Berkeley CA 94720,\nAugust 2006.\n\nAndre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated\nmethods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351\u2013E7358,\n2016.\n\nAshia C Wilson, Benjamin Recht, and Michael I Jordan. A lyapunov analysis of momentum methods\n\nin optimization. arXiv preprint arXiv:1611.02635, 2016.\n\nPan Xu, Tianhao Wang, and Quanquan Gu. Accelerated stochastic mirror descent: From continuous-\ntime dynamics to discrete-time algorithms. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 1087\u20131096, 2018a.\n\nPan Xu, Tianhao Wang, and Quanquan Gu. Continuous and discrete-time accelerated stochastic\nmirror descent for strongly convex functions. In International Conference on Machine Learning,\npages 5488\u20135497, 2018b.\n\n11\n\n\f", "award": [], "sourceid": 3017, "authors": [{"given_name": "Philippe", "family_name": "Casgrain", "institution": "Citadel / University of Toronto"}]}