{"title": "Trivializations for Gradient-Based Optimization on Manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 9157, "page_last": 9168, "abstract": "We introduce a framework to study the transformation of problems with manifold constraints into unconstrained problems through parametrizations in terms of a Euclidean space.\nWe call these parametrizations trivializations.\nWe prove conditions under which a trivialization is sound in the context of gradient-based optimization and we show how two large families of trivializations have overall favorable properties, but also suffer from a performance issue.\nWe then introduce dynamic trivializations, which solve this problem, and we show how these form a family of optimization methods that lie between trivializations and Riemannian gradient descent, and combine the benefits of both of them.\nWe then show how to implement these two families of trivializations in practice for different matrix manifolds. To this end, we prove a formula for the gradient of the exponential of matrices, which can be of practical interest on its own.\nFinally, we show how dynamic trivializations improve the performance of existing methods on standard tasks designed to test long-term memory within neural networks.", "full_text": "Trivializations for Gradient-Based\n\nOptimization on Manifolds\n\nMario Lezcano-Casado\nDepartment of Mathematics\n\nUniversity of Oxford\n\nmario.lezcanocasado@maths.ox.ac.uk\n\nOxford,\n\nAbstract\n\nWe introduce a framework to study the transformation of problems with manifold\nconstraints into unconstrained problems through parametrizations in terms of\na Euclidean space. We call these parametrizations trivializations. We prove\nconditions under which a trivialization is sound in the context of gradient-based\noptimization and we show how two large families of trivializations have overall\nfavorable properties, but also suffer from a performance issue. We then introduce\ndynamic trivializations, which solve this problem, and we show how these form\na family of optimization methods that lie between trivializations and Riemannian\ngradient descent, and combine the bene\ufb01ts of both of them. We then show how\nto implement these two families of trivializations in practice for different matrix\nmanifolds. To this end, we prove a formula for the gradient of the exponential\nof matrices, which can be of practical interest on its own. Finally, we show how\ndynamic trivializations improve the performance of existing methods on standard\ntasks designed to test long-term memory within neural networks.1\n\n1\n\nIntroduction\n\nConstrained optimization allows to put restrictions on the family of objects being optimized. When\nthe restrictions are simple, for example, having a vector with entries in [0, 1] or [\u22121, 1], simple\nelement-wise parametrizations using sigmoid functions or tanh allow the design of powerful models\nsuch as LSTM [Hochreiter and Schmidhuber, 1997] and GRU [Cho et al., 2014] through the method of\ngating. This kind of vector-regularization is now standard, and most of the advanced neural network\narchitectures use it as a basic building block [Bahdanau et al., 2014]. Constraints on matrices, on the\nother hand, are much more challenging.\nMost of the interesting sets of matrices turn out to have a manifold structure. Optimization on\nmanifolds is both theoretically and practically challenging due to the inherent complexity of the\nobjects involved. Even then, optimization on matrix manifolds has proven to be rather useful in many\ndifferent sub\ufb01elds of machine learning and neural networks (NN). Examples of interesting matrix\nmanifolds in the context of gradient-based optimization are the set of positive de\ufb01nite matrices in\nBayesian statistics [Rasmussen and Williams, 2005], orthogonal matrices within RNNs [Arjovsky\net al., 2016, Helfrich et al., 2018, Lezcano-Casado and Mart\u00ednez-Rubio, 2019], NNs with structured\nlinear layers via the QR or the SVD decomposition [Berg et al., 2018, Zhang et al., 2018, Kingma and\nDhariwal, 2018], or invertible matrices in normalizing \ufb02ows [Berg et al., 2018] and VAEs [Tomczak\nand Welling, 2016].\n\n1An implementation can be found at: https://github.com/Lezcano/expRNN\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper we aim to provide a theoretically sound but also ef\ufb01ciently implementable framework\nto perform optimization on these and other matrix manifolds in the context of gradient-based\noptimization.\nOutline of the paper and summary of the main contributions\nIn this paper, we study parametrizations of the form \u03c6 : Rn \u2192 M.\nWe consider the transformation of a constrained optimization problem into an unconstrained one.\n\nInitial problem: min\n\nx\u2208M f (x)\n\nUnconstrained problem: min\ny\u2208Rn\n\nf (\u03c6(y)).\n\nWe call this process trivialization and we say that \u03c6 is a trivialization map. In Section 4, we show\nthat whenever \u03c6 is regular enough\u2014 a diffeomorphism\u2014these parametrizations act as a change of\nmetric on M, and thus, applying gradient descent to this new problem is equivalent to performing\nRGD on the original problem with this new metric, for which standard convergence results hold.\nAfter this, we look at two large families of parametrizations, the Riemannian exponential, and the\nLie exponential. We analyze these from the point of view of the framework presented before, and\nwe point out a problem that they present: they may create saddle points or local minima when near\ncertain region in the manifold.\nIn Section 5, we introduce dynamic trivializations. They can be described as follows:\nMain idea: Lift the function f to the current tangent space TxiM using a map rxi : TxiM \u2192 M\n, or, for ef\ufb01ciency, any retraction). Optimize\nby considering the trivialization f\u25e6rxi (think rxi = expxi\nf \u25e6 rxi on TxiM for a while using any standard optimization methods like ADAM, RMSPROP, or\nADAGRAD, since TxiM is a linear space. When we are at a point yk \u2208 TxiM on which rxi\nmight create saddle-points or local minima, then we consider the current point in the manifold\nxi+1 := rxi(yk) and we start optimizing the function f \u25e6 rxi+1, i.e., lift the problem to Txi+1M.\nThis family of methods has Riemannian gradient descent and classic trivializations as limit cases, and\nin particular, they combine the strengths of the two. Furthermore, we show that these methods give a\nnatural generalization of Euclidean optimizers to manifolds.\nIn Section 6 we show how to compute the gradients associated to the Lie exponential and some cases\nof the Riemannian exponential for matrix manifolds. To this end, we compute a formula that allows\nfor the approximation of the gradient of the exponential of matrices to machine-precision. We also\nshow some examples of for how to use this theory to perform optimization on some matrix manifolds.\nIn Appendix E we compile an extended list of examples that we hope might be helpful to the reader.\nFinally, in Section 7 we show how dynamic trivializations improve previously developed optimization\ntechniques in the context of optimization with orthogonal constraints.\n\n2 Related Work\n\nOptimization on manifolds. Most of the results on optimization on manifolds have found ana-\nlogues in the Riemannian setting [Udriste, 1994, Absil et al., 2009]. Algorithms like conjugate\ngradient descent or the Newton method were \ufb01rst devised for speci\ufb01c families of manifolds [Smith,\n1993, Edelman et al., 1998], and then they were derived for general Riemannian manifolds [Bonnabel,\n2013, Sato and Iwai, 2015, Boumal et al., 2016].\nOptimization methods on manifolds can be classi\ufb01ed in two families: Those that follow geodesics, and\nthose that follow retractions\u2014i.e., \ufb01rst order approximations to geodesics. In the \ufb01rst family, conver-\ngence rates have been proven for most \ufb01rst order methods, both stochastic and non-stochastic [Zhang\nand Sra, 2016], and even purely \ufb01rst-order accelerated methods [Zhang and Sra, 2018]. When it\ncomes to retractions, rates of convergence have been proved in the Lipschitz setting for \ufb01rst and\nsecond-order methods [Boumal et al., 2016].\nTrivialization. The trick of parametrizing a Lie group with elements in the Lie algebra through\nthe Lie exponential map has been commonly used under the name of trivialization in the area of\ndifferential equations on manifolds [Magnus, 1954, Iserles and N\u00f8rsett, 1999, Iserles et al., 2000].\nWe borrow the term, as the general idea behind these methods and ours is rather similar.\nOptimization through parametrizations. Parametrizing a manifold in terms of a Euclidean space\nis a common technique in optimization and machine learning. For example when doing computations\n\n2\n\n\fon symmetric positive de\ufb01nite matrices [Arsigny et al., 2006, 2007], compact Lie groups [Lezcano-\nCasado and Mart\u00ednez-Rubio, 2019], the special orthogonal group [Helfrich et al., 2018] or the unitary\ngroup [Jing et al., 2017, Maduranga et al., 2018]. In [Dreisigmeyer, 2018], it is used through the\nRiemannian exponential to adapt 0th order methods to naturally reductive homogeneous manifolds.\nOur work \ufb01nds the closest connections in the papers [Lezcano-Casado and Mart\u00ednez-Rubio, 2019,\nHelfrich et al., 2018, Maduranga et al., 2018] These papers present the use of the Lie exponential\nand the Cayley map for optimization on SO(n). Our framework can be seen as an extension that\ncan be implemented on top of them at a negligible execution cost. We also show that this theoretical\nimprovement translates into a better convergence in practice in Section 7.\n\n3 Problem Set-Up\n\nWe include a short introduction to the concepts used from differential and Riemannian geometry\nin Appendix A.\nWe are interested in approximating the following problem over a connected manifold M\n\nmin\nx\u2208M f (x).\n\nA differentiable manifold does not carry intrinsically any metric information. As such, if one is\ninterested in talking about concepts like the distance to the optimum, or the steepest descent direction,\nit is necessary to put additional structure on the problem. One way to do this is to consider a\nRiemannian metric g on M, turning M into a Riemannian manifold.\n3.1 The classic approach: Riemannian gradient descent\nGiven a complete metric on M, we can de\ufb01ne geodesics \u03b3p,v : [0,\u221e) \u2192 M such that \u03b3p,v(0) = p,\n\u03b3(cid:48)\np,v(0) = v for v \u2208 TpM. Then, the Riemannian exponential map is de\ufb01ned simply as the\nmap that maps rays starting at the origin in the tangent space to geodesics on M. In symbols,\nexpp(tv) := \u03b3p,v(t) for t \u2265 0.\nUsing the Riemannian exponential, one can de\ufb01ne Riemannian gradient descent in an analogous way\nto the Euclidean case:\n\nxt+1 = expxt(\u2212\u03b7\u2207f (xt)).\n\nIn plain words, the algorithm follows the geodesic in the direction of steepest descent \u2212\u2207f (xt) for a\ntime \u03b7 > 0. This approach has been extensively studied in the literature and it has been proven to\nenjoy similar convergence properties to its Euclidean counterpart [Absil et al., 2009, Bonnabel, 2013,\nBoumal et al., 2016, Zhang et al., 2016].\nSometimes it is convenient, due to computational constraints, to use a \ufb01rst order approximation to the\nexponential rather than the exponential map. This idea is encapsulated in the concept of a retraction.\nDe\ufb01nition 3.1 (Retraction). A differentiable map r : TM \u2192 M is called a retraction if for every\np \u2208 M, the map rp : TpM \u2192 M satis\ufb01es rp(0) = p and (drp)0 = Id.\nThe update rule of Riemannian gradient descent along a retraction r is then given by\n\nxt+1 = rxt(\u2212\u03b7\u2207f (xt)).\n\nIn many cases, this update rule is enough to recover the same convergence guarantees as in Riemannian\ngradient descent along the exponential map [Boumal et al., 2016].\nThe main problem of Riemannian gradient descent comes from a practical point of view. On many\npractical problems, it has been empirically proved that algorithms like ADAM [Kingma and Ba, 2014],\nADAGRAD [Duchi et al., 2011] or RMSPROP [Tieleman and Hinton, 2012] outperform vanilla SGD.\nThese algorithms were designed to work on Rn, and although generalizations for product manifolds\nare in order [cf., Becigneul and Ganea, 2019], it is not clear how to generalize them to most manifolds\nused in practice, and thus take advantage of them in the Riemannian setting.\n\n4 Trivializations\n\nWe now introduce trivializations. Trivializations are functions that allow us to transform a constrained\nproblem on a manifold to an unconstrained one.\n\n3\n\n\fDe\ufb01nition 4.1 (Trivialization). Given a manifold M, we de\ufb01ne a trivialization as a surjective map\n\n\u03c6 : Rn \u2192 M.\n\nExample 4.2. The most simple examples are found when M has a product structure, i.e., for vectors.\nFor example, for a \ufb01xed n > 0, consider component-wise functions like recti\ufb01ed linear units,\nparametrizing non-negative vectors relu : Rn \u2192 (R+)n or the sigmoid function \u03c3 : Rn \u2192 (0, 1)n.\nHaving a trivialization in hand, we can transform a constrained optimization problem into an uncon-\nstrained one by composing f with \u03c6.\n\nmin\ny\u2208Rn\n\nf (\u03c6(y)).\n\nRemark. When considering a parametrization \u03c6, the gradient \u2207f (x) changes into the gradient\n\u2207(f \u25e6 \u03c6)(y) for x = \u03c6(y). For a 1-dimensional trivialization, by the chain rule, if \u03c6(cid:48)(y) = 0 for\nmany y \u2208 R, \u03c6 will not be a good parametrization, because then \u2207(f \u25e6 \u03c6)(y) = \u2207f (\u03c6(y))\u03c6(cid:48)(y) = 0,\neven though \u2207f (x) might not be zero. As such, not all trivializations are equally good.\nWe formalize this intuition for general trivializations in the following theorem.\nTheorem 4.3. Let \u03c6 : Rn \u2192 M be a diffeomorphism. Then, solving the problem miny\u2208Rn f (\u03c6(y))\nthrough gradient descent accounts for solving the problem minx\u2208M f (x) using Riemannian gradient\ndescent for a certain metric on M induced by \u03c6.\nProof. See Appendix B.\nThis result tells us that, if \u03c6 is a diffeomorphism, \u03c6 will not add local minima or saddle points. It\nwill simply act as a change of metric on the manifold. This already explains the good behavior of the\ntanh and sigmoid functions present in an LSTM or GRU in the context of gating.\nAt \ufb01rst sight, the situation of \u03c6 being a diffeomorphism seems too restrictive for general manifolds.\nWe now introduce two parametrizations that are diffeomorphisms in almost all the manifold.2\n\n4.1 The Riemannian trivialization\n\nConsider now the Riemannian exponential map. By the Hopf-Rinow theorem, it is surjective whenever\n(M, g) is connected and complete. As such, in these cases, for any point p \u2208 M, the Riemannian\nexponential map expM,p : TpM(\u223c= Rn) \u2192 M is an example of a trivialization.\nGeometric intuition about the Riemannian trivialization. A direct corollary of Gauss\u2019 lemma\nsays that the metric induced by the exponential parametrization expM,p is a \ufb01rst order approximation\nto the metric on the manifold around the point p [Petersen, 2016, Lemma 5.5.7]. In other words, the\nRiemannian trivialization changes the metric into a new one with the square of the distance to p for\npoints near p.\nLet us now look at the behavior of the Riemannian trivialization in global terms.\nTheorem 4.4 (Properties of the Riemannian trivialization). Let (M, g) be a connected, complete\nRiemannian manifold. Fix a point p \u2208 M. Let Up \u2286 TpM be the largest radially convex open\nneighborhood of zero on which expM,p is a diffeomorphism3 then, expM,p(U p) = M.\nFurthermore, de\ufb01ne the cut locus in TpM as \u02dcCp := U p\\Up.\nIf V \u2208 TpM is another open\nneighborhood of the origin that contains a point in \u02dcCp, then expM,p is not a diffeomorphism on V .\nProof. See Section 5.7.3 in [Petersen, 2016].\nTheorem 4.4 combined with Theorem 4.3 tell us that there exists a radially convex neighborhood of\nzero on which expM,p acts as a change of metric, and that expM,p stops being a diffeomorphism in\nthe boundary\u2014and hence, can add minima or saddle points at these points. As the image of U p is\nthe whole M, if we write Cp := expM,p( \u02dcCp), we have that M decomposes in the disjoint union of\nexpM,p(Up) and Cp. The set Cp is called the cut locus at p.\nThe cut locus is a remarkably slippery object of study given that, in general, it is not differentiable.\nNonetheless, we can still measure the relative size of this set in a topological sense, by means of the\nHausdorff dimension.\n\n2This is taken with respect to the canonical Borel measure on the manifold induced by the metric.\n3A more formal way to de\ufb01ne it would be U p := {v \u2208 TpM| expp(tv) is length minimizing for t \u2208 [0, 1]}.\n\n4\n\n\fTheorem 4.5 (Itoh and Tanaka [1998]). Let M be a connected and complete Riemannian manifold\nof dimension n. For a point p \u2208 M the Hausdorff dimension of \u02dcCp is either 0 or n \u2212 1, and the\nHausdorff dimension of Cp is an integer less than n.\nPutting this result in the more familiar language of measures, we can argue that, although the cut\nlocus can introduce problems in practice, the problematic set is not too large.4\nCorollary 4.6. \u02dcCp has Lebesgue measure zero on TpM.\nProof. By the de\ufb01nition of Hausdorff dimension, a set of dimension n \u2212 1 has n-Hausdorff measure\n0. Finally, just note that the n-Hausdorff measure is a multiple of the Lebesgue measure.\n\n4.2 The Lie trivialization\n\nWe now introduce a useful trivialization for Lie groups and other matrix manifolds. Recall that for\na Lie group G we de\ufb01ne its Lie algebra as the tangent space to the identity element g := TeG. In\nLie group theory there is a canonical trivialization given by the Lie exponential. For matrix Lie\ngroups, which are the groups that we are interested in, the Lie exponential is exactly the exponential\nof matrices. We will denote the exponential of a matrix A as exp(A) or eA for short.\nFor connected and compact Lie groups\u2014e.g., SO(n), U(n), SU(n), Sp(n)\u2014this map is surjective\nand it coincides with the Riemannian trivialization at the identity for a suitable metric. If it is not\nsurjective, we can still use it as a trivialization of the image of g under exp. In Section 6.2 we explain\nhow to use the exponential parametrization in the whole Lie group, even when it is not surjective.\nTrivializations of this form for compact Lie groups were already studied in Lezcano-Casado and\nMart\u00ednez-Rubio [2019].\nThe following theorem is a generalization for matrix Lie groups of a classic result.\nTheorem 4.7 (Properties of the Lie exponential). Let G be a matrix Lie group, the Lie exponential\nis a diffeomorphism on the set U = {A \u2208 g | |Im(\u03bbi(A))| < \u03c0} with \u03bbi(A) the eigenvalues.\nProof. See Appendix C.\n\nThis result is the counterpart of Theorem 4.4 for the Lie trivialization on general matrix Lie groups.\nThe boundary of this set has similar properties as those of the cut locus for the Riemannian trivial-\nization for groups like GL(n) or SO(n).5 As such, this trivialization presents the same problem as\nthe Riemannian trivialization: It works as a change of metric for points that are close to the identity\nmatrix, but it creates local minima and saddle points on some points of the manifold, which we might\nencounter as the optimization method progresses.\n\n5 Dynamic Trivializations\n\nIn the last section, we have seen rather general families of trivializations that cover most of the\nmanifolds used in practice. We have seen how these trivializations act as a change of metric around\nthe initial point\u2014p in the case of the Riemannian trivialization and the identity matrix in the case\nof the Lie trivialization\u2014but we have also shown that the optimization process can be affected as it\ndeviates from the initial point.\nNote that, in the case of the exponential trivialization, we have a map from any tangent space of M\nonto M, but we are just using one of them as a trivialization. We can leverage the structure of TM in\norder to solve the problem that the trivializations introduced above. Instead of always using expM,p,\nwe can use it just for K optimization steps and then change p to the point on which we \ufb01nd ourselves\non the manifold after those K steps. This idea is formalized in the following algorithm.\n\n4The analogous result for Cp with respect to the Borel measure induced by the volume form is also true.\n5The constant \u03c0 is tight for matrix manifolds that contain matrices with eigenvalues that are 2\u03c0i apart. For\n\nthese manifolds, the matrix exponential fails to be a diffeomorphism on some points of the boundary of U.\n\n5\n\n\fFigure 1: Example of the trivialization and dynamic trivialization procedure for K = 4.\n\nAlgorithm 5.1 (Dynamic trivialization through retractions). Given a retraction r, an integer\nK > 0 or K = \u221e, and a starting point p0, the dynamic trivialization induced by r is de\ufb01ned as the\nsequence of problems indexed by i = 0, 1, . . .\n\nmin\ny\u2208Tpi\n\nM f (rpi(y))\n\nwhere pi+1 := rpi (yi,K) \u2208 M, and yi,k \u2208 TpiM for k = 1, . . . , K, is a sequence of approximations\ngiven by a Euclidean optimization algorithm\u2014e.g., SGD, ADAM, ADAGRAD, RMSPROP, . . . \u2014applied\nto the i-th problem with starting point yi,0 = 0. We say that pi is the basis at step i.\nRemark. Note that in this case we have dropped the condition of rp : TpM \u2192 M being surjective.\nThis is because, as long as M is connected, we can still reach any point in M in the optimization\nprocess by changing the basis of the dynamic trivialization whenever K < \u221e.\nThis procedure has two interesting limit cases.\nGeneralization of trivializations. For K = \u221e, i.e., no change of basis, it reduces to the trivializa-\ntion algorithms described in Section 4 with the trivialization rp0, provided that rp0 is surjective.\nGeneralization of Riemannian gradient descent.\nIn the case K = 1, we are changing the basis\nof the trivialization on every step. When the optimization process used to generate the iterates yi,k\nis regular SGD, this method recovers exactly stochastic Riemannian gradient descent using r as a\nretraction. For this, just note that by the chain rule and the de\ufb01nition of a retraction\n\nd(f \u25e6 rpi )0 = (df )rpi (0) \u25e6 (drpi)0 = (df )rpi (0) = (df )pi.\n\nFrom this it follows that\n\nso the update rule simpli\ufb01es for a learning rate \u03b7 > 0 can be rewritten as\npi+1 = rpi(\u2212\u03b7\u2207f (pi))\n\nyi,1 = \u2212\u03b7\u2207f (pi)\n\n\u2207(f \u25e6 rpi)(0) = \u2207f (pi)\n\nand pi+1 are exactly the iterates given by doing Riemannian stochastic gradient descent along the\nretraction r.\nIn particular, we have proved that for r = expM, we recover stochastic Riemannian gradient descent.\nAs such, we can see dynamic trivializations as an interpolation between the trivialization method\nusing expM and stochastic Riemannian gradient descent.\nMore interesting is perhaps the case when we use a different optimizer to generate the iterates yi,k.\nIn this case, dynamic trivializations yield a natural generalization to manifolds of the algorithm used\nto generate the iterates, i.e., ADAM, ADAGRAD, RMSPROP, etc.\n\n6 Gradient Computations and Examples\n\nThe last missing piece needed to implement dynamic trivializations is the explicit computation of\ntheir gradients. We will do so for the two families presented above.\n\n6\n\n\u03c6MTpM\u223c=Rny0y1y2y3y4y5y6pTrivialization\u03c6pi\u03c6pi+1MTpiM\u223c=RnTpi+1M\u223c=Rnpi+1:=\u03c6pi(yi,4)yi,0yi,1yi,2yi,3yi,4yi+1,0yi+1,1yi+1,2yi+1,3yi+1,4pipi+1DynamicTrivialization\f6.1 The matrix exponential\n\nWe \ufb01rst look at the matrix exponential. This function not only de\ufb01nes the Lie trivialization, but it is\nalso essential to compute the Riemannian exponential in many matrix manifolds (cf., Appendix E). In\norder to implement the dynamic trivialization algorithm within the context of \ufb01rst-order methods we\nneed an approximation of the trivialization map and its gradient.\nThe current fastest machine-precision approximation to the matrix exponential was formulated in Al-\nMohy and Higham [2009b]. On the other hand, it is not clear how to compute the gradient of this\nparametrization. The following proposition settles this problem.\nProposition 6.1 (Gradient of the exponential parametrization). Let f : Rn\u00d7n \u2192 R be a function\nde\ufb01ned on matrices, and let exp be the matrix exponential, we have that\n(cid:124)(\u2207f (eA)).\n\n\u2207(f \u25e6 exp)(A) = (d exp)A\n\nProof. See Appendix D.\n\nThis proposition together with the approximation algorithm for d exp presented in Al-Mohy and\nHigham [2009a] allows us to approximate to machine-precision this gradient.\nThis formula readily allows for the implementation of the Riemannian dynamic trivialization on many\nmatrix manifolds. We give examples of some of these in Appendix E.\n\n6.2 Lie exponential for matrix Lie groups\n\nThe Lie exponential on a Lie group G is just de\ufb01ned on the Lie algebra g = TeG. On matrix Lie\ngroups, we can identify any tangent space of G with g. Explicitly, if \u02dcA \u2208 TBG, then B\u22121 \u02dcA \u2208 g.\nFurthermore, if we choose a left-invariant metric on the Lie group, we can then use left multiplication\nto map the result exp(B\u22121 \u02dcA) to a neighborhood of B. In symbols, we can de\ufb01ne\n\nexpB : TBG \u2192 G\n\n\u02dcA (cid:55)\u2192 B exp(B\n\n\u22121 \u02dcA)\n\nWe give the gradient of this parametrization in Corollary D.3. This function constitutes a dynamic\ntrivialization on any connected matrix Lie group, like, for example, SO(n), U(n), SL(n), or GL+(n).\n\n6.3 Other retractions\n\nSometimes one cannot afford to approximate the exponential exactly, as it can be very costly. In this\ncase, the standard alternative are retractions Boumal et al. [2016].\nCayley map. This is one of the most well known retractions to optimize over SO(n) (cf., Absil\net al. [2009], Helfrich et al. [2018])\n\ncay : Skew(n) \u2192 SO(n)\n\nA (cid:55)\u2192 (I + A)(I \u2212 A)\n\n\u22121\n\nThis can be made into a dynamic retraction using the same trick as we did with the exponential,\nconsidering cayB(A) = B cay(B\u22121 \u02dcA), for B \u2208 SO(n), \u02dcA \u2208 TB SO(n).\nProjectors. Another common retraction used in matrix manifolds M \u2286 Rn\u00d7n is the one given\nby \u03c0M(x + v) for x \u2208 M, v \u2208 TxM and \u03c0M the projection from Rn\u00d7n onto M. For example,\n(cid:124), its\nfor M = SO(n), we have that for a matrix B \u2208 Rn\u00d7n with SVD decomposition B = U \u03a3V\n(cid:124).6 with gradient computed in Kenney and Laub\nprojection onto SO(n) is given by \u03c0SO(n)(B) = U V\n[1991, Eq. 2.18].\nWe workout more useful examples for common manifolds in Appendix E.\n\n6Formally, \u03c0SO(n) is well-de\ufb01ned for matrices such that det B > 0, that is, \u03c0SO(n) : GL+(n) \u2192 SO(n).\n\nNote that this function is not a diffeomorphism but a submersion. Theorem 4.3 can be extended to this case.\n\n7\n\n\fTable 1: Best test accuracy at MNIST and\nP-MNIST.\n\nTable 2: Test MSE at the end of the epoch with\nthe lowest validation MSE for the TIMIT task.\n\nMODEL\n\nN\n\nMNIST\n\nP-MNIST\n\nMODEL\n\nN\n\nVAL. MSE\n\nTEST MSE\n\nDTRIV1\n170\nDTRIV100\n170\nDTRIV\u221e 170\n170\nEXPRNN\n170\nSCORNN\n116\nSCURNN\n128\nLSTM\n116\nRGD\n\nDTRIV1\n360\nDTRIV100\n360\nDTRIV\u221e 360\n360\nEXPRNN\n360\nSCORNN\n250\nSCURNN\n256\nLSTM\n256\nRGD\n\nDTRIV1\n512\nDTRIV100\n512\nDTRIV\u221e 512\n512\nEXPRNN\n512\nSCORNN\n512\nLSTM\n512\nRGD\n\n98.3\n98.2\n98.1\n98.0\n97.2\n97.6\n81.9\n94.7\n\n98.4\n98.8\n98.9\n98.4\n98.1\n98.3\n88.8\n96.1\n\n98.7\n99.1\n99.0\n98.7\n98.2\n91.9\n97.3\n\n95.2\n95.1\n95.0\n94.9\n94.8\n94.9\n79.5\n92.5\n\n96.3\n96.4\n96.5\n96.2\n95.9\n96.2\n88.8\n93.9\n\n96.7\n96.7\n96.8\n96.6\n96.5\n91.8\n94.7\n\nDTRIV1\n224\nDTRIV100\n224\nDTRIV\u221e 224\n224\nEXPRNN\n224\nSCORNN\n128\nSCURNN\n84\nLSTM\n128\nRGD\n\nDTRIV1\n322\nDTRIV100\n322\nDTRIV\u221e 322\n322\nEXPRNN\n322\nSCORNN\n120\nLSTM\n192\nRGD\n\nDTRIV1\n425\nDTRIV100\n425\nDTRIV\u221e 425\n425\nEXPRNN\n425\nSCORNN\n258\nSCURNN\n158\nLSTM\n256\nRGD\n\n6.55\n4.80\n4.75\n5.34\n9.26\n9.42\n15.42\n15.07\n\n4.56\n3.80\n3.39\n4.42\n8.48\n13.93\n15.10\n\n4.21\n2.02\n2.00\n5.52\n7.97\n4.40\n13.66\n14.96\n\n6.54\n4.77\n4.71\n5.30\n8.50\n7.23\n14.30\n14.58\n\n4.55\n3.76\n3.76\n4.38\n7.82\n12.95\n14.50\n\n4.17\n1.99\n1.97\n5.48\n7.36\n3.39\n12.62\n14.69\n\n7 Experiments\n\nIn this section, we assess the effectiveness of dynamic trivializations (DTRIV) in the context of\northogonal optimization. We test the framework with the basis changed every K = 1, 100,\u221e steps.\nWe compare it against the most performant previous approaches presented for this task in the\ncontext of orthogonal optimization and a vanilla LSTM. These approaches are orthogonal exponential\ntrivialization [EXPRNN Lezcano-Casado and Mart\u00ednez-Rubio, 2019], orthogonal and unitary Cayley\ntrivializations [SCORNN / SCURNN Helfrich et al., 2018, Maduranga et al., 2018], and Riemannian\ngradient descent [RGD Wisdom et al., 2016].\nThe architecture on which we are testing the dynamic trivialization is the same as in the papers above:\nA vanilla RNN with an orthogonal layer parametrized using the Lie trivialization (cf., Section 6.2)\n\nht+1 = \u03c3(expB(A)ht + T xt+1).\n\nThe update procedure for B was described in Algorithm 5.1 (K = 1, 100,\u221e).\nRemark. Note that RGD is equivalent to DTRIV1 together with the optimizer SGD. Furthermore,\nEXPRNN is equivalent DTRIV\u221e only that EXPRNN has the basis on the identity matrix and DTRIV\u221e\nhas the basis on the matrix to which it is initialized.\n\nWe test this architecture on two different tasks that have become the standard to test the performance of\nRNNs in the context of long-term recall and long-term memory, namely the pixel-by-pixel MNIST and\nthe TIMIT dataset [Arjovsky et al., 2016, Henaff et al., 2016, Mhammedi et al., 2017, Helfrich et al.,\n2018, Maduranga et al., 2018, Lezcano-Casado and Mart\u00ednez-Rubio, 2019]. We do not present results\nfor the copying problem, as task is too simple to draw any meaningful conclusions, as explained\nin Henaff et al. [2016]. 7\nWe detail all the hyperparameters and set-up in Appendix F. The code and instructions to replicate\nthese experiments can be found in\n\n7For reference, dynamic trivializations are also able to converge to the correct answer stably, as EXPRNN.\n\nhttps://github.com/Lezcano/expRNN\n\n8\n\n\f7.1 Pixel-by-pixel MNIST\n\nThis task consists of classifying the hand-written images of numbers in the MNIST dataset [LeCun\nand Cortes, 2010] by processing them as a sequence pixel-by-pixel. Each image has 28 \u00d7 28 pixels,\nso the sequences are of length 784. The unpermuted task (MNIST) processes the row-by-row \ufb02attened\nimage, the permuted task (P-MNIST) samples a permutation of size 784 at the beginning and then\nuses it to permute all the images after \ufb02attening them. This task was introduced in Le et al. [2015].\nTable 1 is structured so that architectures with the same number of parameters are compared together.\nAs we can see, the addition of any dynamic trivialization to the Lie parametrization improves the\nresults on this experiment by 0.4% out of the 1.3% possible in the largest size. Moreover, it always\nimproves the previous results, suggesting that it is always a better option to use dynamic trivializations\nrather than just plain trivializations. In general, we saw that DTRIV100 and DTRIV\u221e gave the highest\nstability and the best results across the experiments.\n\n7.2 TIMIT speech dataset\n\nThe TIMIT dataset [S Garofolo et al., 1992] is a set of variable-length real-world speech recordings.\nThese recordings are \ufb01rst downsampled to 8kHz and then transformed into log-magnitudes via a\nshort-time Fourier transform, giving sequences of 129 complex numbers per step, and a variable\nlength between 61 and 490. The task consists of predicting the next log-magnitude given the previous\nones. This experiment was introduced in Wisdom et al. [2016].\nIn this experiment we see a similar behavior of the dynamic trivializations as the one already seen in\nthe MNIST and P-MNIST experiments. It also happens in this experiment that DTRIV100 and DTRIV\u221e\nalways improve the performance of their static counterparts with base at the identity and of RGD.\nIn the experiments in SCURNN they explicitly mention that they are computing the MSE without\ndiscarding the zeros used to pad the variable-length sequences [Maduranga et al., 2018]. As such,\nwhen computing the MSE, they are dividing by an incorrect number\u2014the longest element in the batch\ntimes the elements in the batch\u2014rather than by the correct one\u2014the sum of the lengths of all the\nelements in the batch. We computed the correct validation and test loss in Table 2.\n\n8 Conclusion and Future Work\n\nIn this paper we have presented a novel way to perform optimization on manifolds that combines\nthe strengths of the two most popular optimization techniques used in machine learning and neural\nnetworks\u2014parametrizations and Riemannian gradient descent. We have shown that, by moving the\ninitial point of the parametrization, as the metric is distorted less from the Euclidean one, we can\nachieve an improvement on the convergence of the neural network.\nWe leave open an interesting line of research based on applying dynamic trivializations to allow\noptimization on other interesting manifolds. As a \ufb01rst step in this direction, we detail examples of\nsome computations for the most common manifolds used in optimization in Appendix E.\n\nAcknowledgements\n\nWe would like to thank the help of Jaime Mendizabal and Momchil Konstantinov for the very useful\nfeedback and suggestions and Prof. Andras Juhasz for the computing power.\nThe work of MLC was supported by the Oxford-James Martin Graduate Scholarship and the \u201cla\nCaixa\u201d Banking Foundation (LCF/BQ/EU17/11590067).\n\n9\n\n\fReferences\nP.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton\n\nUniversity Press, 2009.\n\nA. H. Al-Mohy and N. J. Higham. Computing the fr\u00e9chet derivative of the matrix exponential, with\nan application to condition number estimation. SIAM Journal on Matrix Analysis and Applications,\n30(4):1639\u20131657, 2009a.\n\nA. H. Al-Mohy and N. J. Higham. A new scaling and squaring algorithm for the matrix exponential.\n\nSIAM Journal on Matrix Analysis and Applications, 31(3):970\u2013989, 2009b.\n\nE. Andruchow, G. Larotonda, L. Recht, and A. Varela. The left invariant metric in the general linear\n\ngroup. Journal of Geometry and Physics, 86:241\u2013257, 2014.\n\nM. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In International\n\nConference on Machine Learning, pages 1120\u20131128, 2016.\n\nV. Arsigny, O. Commowick, X. Pennec, and N. Ayache. A log-euclidean framework for statistics\non diffeomorphisms. In International Conference on Medical Image Computing and Computer-\nAssisted Intervention, pages 924\u2013931. Springer, 2006.\n\nV. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Geometric means in a novel vector space structure\non symmetric positive-de\ufb01nite matrices. SIAM journal on matrix analysis and applications, 29(1):\n328\u2013347, 2007.\n\nD. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and\n\ntranslate. arXiv preprint arXiv:1409.0473, 2014.\n\nG. Becigneul and O.-E. Ganea. Riemannian adaptive optimization methods.\n\nIn International\nConference on Learning Representations, 2019. URL https://openreview.net/forum?id=\nr1eiqi09K7.\n\nR. v. d. Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvester normalizing \ufb02ows for\n\nvariational inference. arXiv preprint arXiv:1803.05649, 2018.\n\nA. L. Besse. Einstein manifolds. Classics in Mathematics. Springer-Verlag, Berlin, 2008. Reprint of\n\nthe 1987 edition.\n\nS. Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic\n\nControl, 58(9):2217\u20132229, 2013.\n\nS. Bonnabel and R. Sepulchre. Riemannian metric and geometric mean for positive semide\ufb01nite\nmatrices of \ufb01xed rank. SIAM Journal on Matrix Analysis and Applications, 31(3):1055\u20131070,\n2009.\n\nN. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization on\n\nmanifolds. IMA Journal of Numerical Analysis, 2016.\n\nK. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.\nLearning phrase representations using rnn encoder-decoder for statistical machine translation.\narXiv preprint arXiv:1406.1078, 2014.\n\nM. do Carmo. Riemannian Geometry. Mathematics (Boston, Mass.). Birkh\u00e4user, 1992. ISBN\n\n9783764334901. URL https://books.google.co.uk/books?id=uXJQQgAACAAJ.\n\nD. W. Dreisigmeyer. Direct search methods on reductive homogeneous spaces. Journal of Optimiza-\n\ntion Theory and Applications, 176(3):585\u2013604, 2018.\n\nJ. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nA. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints.\n\nSIAM journal on Matrix Analysis and Applications, 20(2):303\u2013353, 1998.\n\n10\n\n\fJ. Gallier and J. Quaintance. Differential geometry and Lie groups. a computational perspective.\n\nAvailable in https://www.seas.upenn.edu/~jean/diffgeom-spr-I.pdf, 2019.\n\nS. Gallot, D. Hulin, and J. Lafontaine. Riemannian Geometry. Springer, 2 edition, 2012.\n\nB. Hall. Lie Groups, Lie Algebras, and Representations: An Elementary Introduction. Graduate\nTexts in Mathematics. Springer International Publishing, 2015. ISBN 9783319134673. URL\nhttps://books.google.es/books?id=didACQAAQBAJ.\n\nK. Helfrich, D. Willmott, and Q. Ye. Orthogonal recurrent neural networks with scaled Cayley\ntransform. In Proceedings of the 35th International Conference on Machine Learning, Proceedings\nof Machine Learning Research, pages 1969\u20131978. PMLR, 2018.\n\nS. Helgason. Differential geometry, Lie groups, and symmetric spaces, volume 80. Academic press,\n\n1979.\n\nM. Henaff, A. Szlam, and Y. LeCun. Recurrent orthogonal networks and long-memory tasks.\nIn Proceedings of the 33rd International Conference on International Conference on Machine\nLearning-Volume 48, pages 2034\u20132042. JMLR. org, 2016.\n\nN. J. Higham. Functions of matrices: theory and computation, volume 104. Siam, 2008.\n\nE. Hille. On roots and logarithms of elements of a complex banach algebra. Mathematische Annalen,\n\n136(1):46\u201357, 1958.\n\nS. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\nA. Iserles and S. N\u00f8rsett. On the solution of linear differential equations in lie groups. Philosophical\nTransactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering\nSciences, 357(1754):983\u20131019, 1999.\n\nA. Iserles, H. Z. Munthe-Kaas, S. P. N\u00f8rsett, and A. Zanna. Lie-group methods. Acta numerica, 9:\n\n215\u2013365, 2000.\n\nJ.-i. Itoh and M. Tanaka. The dimension of a cut locus on a smooth riemannian manifold. Tohoku\n\nMathematical Journal, Second Series, 50(4):571\u2013575, 1998.\n\nL. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. Skirlo, Y. LeCun, M. Tegmark, and M. Solja\u02c7ci\u00b4c. Tunable\nef\ufb01cient unitary neural networks (eunn) and their application to rnns. In International Conference\non Machine Learning, pages 1733\u20131741, 2017.\n\nC. Kenney and A. J. Laub. Polar decomposition and matrix sign function condition estimates. SIAM\n\nJournal on Scienti\ufb01c and Statistical Computing, 12(3):488\u2013504, 1991.\n\nD. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\nD. P. Kingma and P. Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In Advances\n\nin Neural Information Processing Systems, pages 10215\u201310224, 2018.\n\nQ. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of recti\ufb01ed linear\n\nunits. arXiv preprint arXiv:1504.00941, 2015.\n\nY. LeCun and C. Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/,\n\n2010. URL http://yann.lecun.com/exdb/mnist/.\n\nJ. Lee. Introduction to Smooth Manifolds. Springer, 2 edition, 2013.\n\nJ. Lee. Introduction to Riemannian Manifolds. Springer, 2 edition, 2018.\n\nM. Lezcano-Casado and D. Mart\u00ednez-Rubio. Cheap orthogonal constraints in neural networks: A\nsimple parametrization of the orthogonal and unitary group. arXiv preprint arXiv:1901.08428,\n2019.\n\n11\n\n\fK. D. Maduranga, K. E. Helfrich, and Q. Ye. Complex unitary recurrent neural networks using scaled\n\ncayley transform. arXiv preprint arXiv:1811.04142, 2018.\n\nW. Magnus. On the exponential solution of differential equations for a linear operator. Communica-\n\ntions on pure and applied mathematics, 7(4):649\u2013673, 1954.\n\nZ. Mhammedi, A. Hellicar, A. Rahman, and J. Bailey. Ef\ufb01cient orthogonal parametrisation of\nrecurrent neural networks using householder re\ufb02ections. In International Conference on Machine\nLearning, pages 2401\u20132409, 2017.\n\nB. O\u2019Neill. Semi-Riemannian Geometry With Applications to Relativity. Pure and Applied Mathe-\nmatics. Elsevier Science, 1983. ISBN 9780080570570. URL https://books.google.co.uk/\nbooks?id=CGk1eRSjFIIC.\n\nP. Petersen. Riemannian Geometry. Springer, 3 edition, 2016. ISBN 3319266527.\n\nC. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive\n\nComputation and Machine Learning). The MIT Press, 2005. ISBN 026218253X.\n\nW. Rossmann. Lie Groups: An Introduction Through Linear Groups. Oxford graduate texts in\nmathematics. Oxford University Press, 2006. ISBN 9780199202515. URL https://books.\ngoogle.co.uk/books?id=bAjulQ65W-UC.\n\nJ. S Garofolo, L. Lamel, W. M Fisher, J. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue. Timit\n\nacoustic-phonetic continuous speech corpus. Linguistic Data Consortium, 11 1992.\n\nH. Sato and T. Iwai. A new, globally convergent riemannian conjugate gradient method. Optimization,\n\n64(4):1011\u20131031, 2015.\n\nS. T. Smith. Geometric Optimization Methods for Adaptive Filtering. PhD thesis, Harvard University,\n\nCambridge, MA, USA, 1993. UMI Order No. GAX93-31032.\n\nT. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its\n\nrecent magnitude. COURSERA: Neural networks for machine learning, 4(2):26\u201331, 2012.\n\nJ. M. Tomczak and M. Welling. Improving variational auto-encoders using householder \ufb02ow. arXiv\n\npreprint arXiv:1611.09630, 2016.\n\nC. Udriste. Convex Functions and Optimization Methods on Riemannian Manifolds. Mathematics\n\nand Its Applications. Springer Netherlands, 1994. ISBN 9780792330028.\n\nH.-C. Wang et al. Discrete nilpotent subgroups of Lie groups. Journal of Differential Geometry, 3\n\n(3-4):481\u2013492, 1969.\n\nS. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas. Full-capacity unitary recurrent neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 4880\u20134888, 2016.\n\nH. Zhang and S. Sra. First-order methods for geodesically convex optimization. In Conference on\n\nLearning Theory, pages 1617\u20131638, 2016.\n\nH. Zhang and S. Sra. Towards riemannian accelerated gradient methods.\n\narXiv:1806.02812, 2018.\n\narXiv preprint\n\nH. Zhang, S. J. Reddi, and S. Sra. Riemannian svrg: Fast stochastic optimization on riemannian\n\nmanifolds. In Advances in Neural Information Processing Systems, pages 4592\u20134600, 2016.\n\nJ. Zhang, Q. Lei, and I. Dhillon. Stabilizing gradients for deep neural networks via ef\ufb01cient svd\n\nparameterization. In International Conference on Machine Learning, pages 5801\u20135809, 2018.\n\n12\n\n\f", "award": [], "sourceid": 4908, "authors": [{"given_name": "Mario", "family_name": "Lezcano Casado", "institution": "Univeristy of Oxford"}]}