{"title": "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2933, "page_last": 2941, "abstract": "A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, neural network theory, and empirical evidence, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep or recurrent neural network training, and provide numerical evidence for its superior optimization performance.", "full_text": "Identifying and attacking the saddle point\n\nproblem in high-dimensional non-convex optimization\n\nYann N. Dauphin Razvan Pascanu Caglar Gulcehre Kyunghyun Cho\n\nUniversit\u00b4e de Montr\u00b4eal\n\ndauphiya@iro.umontreal.ca, r.pascanu@gmail.com,\n\ngulcehrc@iro.umontreal.ca, kyunghyun.cho@umontreal.ca\n\nSurya Ganguli\n\nStanford University\n\nsganguli@standford.edu\n\nYoshua Bengio\n\nUniversit\u00b4e de Montr\u00b4eal, CIFAR Fellow\nyoshua.bengio@umontreal.ca\n\nAbstract\n\nA central challenge to many fields of science and engineering involves minimizing\nnon-convex error functions over continuous, high dimensional spaces. Gradient descent\nor quasi-Newton methods are almost ubiquitously used to perform such minimizations,\nand it is often thought that a main source of difficulty for these local methods to find\nthe global minimum is the proliferation of local minima with much higher error than\nthe global minimum. Here we argue, based on results from statistical physics, random\nmatrix theory, neural network theory, and empirical evidence, that a deeper and more\nprofound difficulty originates from the proliferation of saddle points, not local minima,\nespecially in high dimensional problems of practical interest. Such saddle points are\nsurrounded by high error plateaus that can dramatically slow down learning, and give the\nillusory impression of the existence of a local minimum. Motivated by these arguments,\nwe propose a new approach to second-order optimization, the saddle-free Newton method,\nthat can rapidly escape high dimensional saddle points, unlike gradient descent and\nquasi-Newton methods. We apply this algorithm to deep or recurrent neural network\ntraining, and provide numerical evidence for its superior optimization performance.\n\n1 Introduction\n\nIt is often the case that our geometric intuition, derived from experience within a low dimensional physical\nworld, is inadequate for thinking about the geometry of typical error surfaces in high-dimensional spaces.\nTo illustrate this, consider minimizing a randomly chosen error function of a single scalar variable, given\nby a single draw of a Gaussian process. (Rasmussen and Williams, 2005) have shown that such a random\nerror function would have many local minima and maxima, with high probability over the choice of the\nfunction, but saddles would occur with negligible probability. On the other-hand, as we review below, typical,\nrandom Gaussian error functions over N scalar variables, or dimensions, are increasingly likely to have\nsaddle points rather than local minima as N increases. Indeed the ratio of the number of saddle points to\nlocal minima increases exponentially with the dimensionality N.\nA typical problem for both local minima and saddle-points is that they are often surrounded by plateaus of small\ncurvature in the error. While gradient descent dynamics are repelled away from a saddle point to lower error\nby following directions of negative curvature, this repulsion can occur slowly due to the plateau. Second order\nmethods, like the Newton method, are designed to rapidly descend plateaus surrounding local minima by multi-\nplying the gradient steps with the inverse of the Hessian matrix. However, the Newton method does not treat sad-\ndle points appropriately; as argued below, saddle-points instead become attractive under the Newton dynamics.\nThus, given the proliferation of saddle points, not local minima, in high dimensional problems, the entire\ntheoretical justification for quasi-Newton methods, i.e. the ability to rapidly descend to the bottom of a convex\nlocal minimum, becomes less relevant in high dimensional non-convex optimization. In this work, which\n\n1\n\n\fis an extension of the previous report Pascanu et al. (2014), we first want to raise awareness of this issue,\nand second, propose an alternative approach to second-order optimization that aims to rapidly escape from\nsaddle points. This algorithm leverages second-order curvature information in a fundamentally different way\nthan quasi-Newton methods, and also, in numerical experiments, outperforms them in some high dimensional\nproblems involving deep or recurrent networks.\n\n2 The prevalence of saddle points in high dimensions\n\nHere we review arguments from disparate literatures suggesting that saddle points, not local minima, provide\na fundamental impediment to rapid high dimensional non-convex optimization. One line of evidence comes\nfrom statistical physics. Bray and Dean (2007); Fyodorov and Williams (2007) study the nature of critical\npoints of random Gaussian error functions on high dimensional continuous domains using replica theory\n(see Parisi (2007) for a recent review of this approach).\nOne particular result by Bray and Dean (2007) derives how critical points are distributed in the \u0001 vs \u03b1\nplane, where \u03b1 is the index, or the fraction of negative eigenvalues of the Hessian at the critical point, and\n\u0001 is the error attained at the critical point. Within this plane, critical points concentrate on a monotonically\nincreasing curve as \u03b1 ranges from 0 to 1, implying a strong correlation between the error \u0001 and the index\n\u03b1: the larger the error the larger the index. The probability of a critical point to be an O(1) distance off the\ncurve is exponentially small in the dimensionality N, for large N. This implies that critical points with error\n\u0001 much larger than that of the global minimum, are exponentially likely to be saddle points, with the fraction\nof negative curvature directions being an increasing function of the error. Conversely, all local minima, which\nnecessarily have index 0, are likely to have an error very close to that of the global minimum. Intuitively,\nin high dimensions, the chance that all the directions around a critical point lead upward (positive curvature)\nis exponentially small w.r.t. the number of dimensions, unless the critical point is the global minimum or\nstands at an error level close to it, i.e., it is unlikely one can find a way to go further down.\nThese results may also be understood via random matrix theory. We know that for a large Gaussian random\nmatrix the eigenvalue distribution follows Wigner\u2019s famous semicircular law (Wigner, 1958), with both mode\nand mean at 0. The probability of an eigenvalue to be positive or negative is thus 1/2. Bray and Dean (2007)\nshowed that the eigenvalues of the Hessian at a critical point are distributed in the same way, except that\nthe semicircular spectrum is shifted by an amount determined by \u0001. For the global minimum, the spectrum\nis shifted so far right, that all eigenvalues are positive. As \u0001 increases, the spectrum shifts to the left and\naccrues more negative eigenvalues as well as a density of eigenvalues around 0, indicating the typical presence\nof plateaus surrounding saddle points at large error. Such plateaus would slow the convergence of first order\noptimization methods, yielding the illusion of a local minimum.\nThe random matrix perspective also concisely and intuitively crystallizes the striking difference between\nthe geometry of low and high dimensional error surfaces. For N =1, an exact saddle point is a 0\u2013probability\nevent as it means randomly picking an eigenvalue of exactly 0. As N grows it becomes exponentially unlikely\nto randomly pick all eigenvalues to be positive or negative, and therefore most critical points are saddle points.\nFyodorov and Williams (2007) review qualitatively similar results derived for random error functions\nsuperimposed on a quadratic error surface. These works indicate that for typical, generic functions chosen\nfrom a random Gaussian ensemble of functions, local minima with high error are exponentially rare in the\ndimensionality of the problem, but saddle points with many negative and approximate plateau directions are\nexponentially likely. However, is this result for generic error landscapes applicable to the error landscapes of\npractical problems of interest?\nBaldi and Hornik (1989) analyzed the error surface of a multilayer perceptron (MLP) with a single linear\nhidden layer. Such an error surface shows only saddle-points and no local minima. This result is qualitatively\nconsistent with the observation made by Bray and Dean (2007). Indeed Saxe et al. (2014) analyzed the\ndynamics of learning in the presence of these saddle points, and showed that they arise due to scaling\nsymmetries in the weight space of a deep linear MLP. These scaling symmetries enabled Saxe et al. (2014)\nto find new exact solutions to the nonlinear dynamics of learning in deep linear networks. These learning\ndynamics exhibit plateaus of high error followed by abrupt transitions to better performance. They qualitatively\nrecapitulate aspects of the hierarchical development of semantic concepts in infants (Saxe et al., 2013).\nIn (Saad and Solla, 1995) the dynamics of stochastic gradient descent are analyzed for soft committee\nmachines. This work explores how well a student network can learn to imitate a randomly chosen teacher\nnetwork. Importantly, it was observed that learning can go through an initial phase of being trapped in the\nsymmetric submanifold of weight space. In this submanifold, the student\u2019s hidden units compute similar\nfunctions over the distribution of inputs. The slow learning dynamics within this submanifold originates\nfrom saddle point structures (caused by permutation symmetries among hidden units), and their associated\n\n2\n\n\fMNIST\n\nCIFAR-10\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a) and (c) show how critical points are distributed in the \u0001\u2013\u03b1 plane. Note that they concentrate\nalong a monotonically increasing curve. (b) and (d) plot the distributions of eigenvalues of the Hessian at\nthree different critical points. Note that the y axes are in logarithmic scale. The vertical lines in (b) and (d)\ndepict the position of 0.\n\nplateaus (Rattray et al., 1998; Inoue et al., 2003). The exit from the plateau associated with the symmetric\nsubmanifold corresponds to the differentiation of the student\u2019s hidden units to mimic the teacher\u2019s hidden\nunits. Interestingly, this exit from the plateau is achieved by following directions of negative curvature\nassociated with a saddle point. sin directions perpendicular to the symmetric submanifold.\nMizutani and Dreyfus (2010) look at the effect of negative curvature on learning and implicitly at the effect of\nsaddle points in the error surface. Their findings are similar. They show that the error surface of a single layer\nMLP has saddle points where the Hessian matrix is indefinite.\n\n3 Experimental validation of the prevalence of saddle points\n\nIn this section, we experimentally test whether the theoretical predictions presented by Bray and Dean (2007)\nfor random Gaussian fields hold for neural networks. To our knowledge, this is the first attempt to measure\nthe relevant statistical properties of neural network error surfaces and to test if the theory developed for\nrandom Gaussian fields generalizes to such cases.\nIn particular, we are interested in how the critical points of a single layer MLP are distributed in the \u0001\u2013\u03b1\nplane, and how the eigenvalues of the Hessian matrix at these critical points are distributed. We used a small\nMLP trained on a down-sampled version of MNIST and CIFAR-10. Newton method was used to identify\ncritical points of the error function. The results are in Fig. 1. More details about the setup are provided\nin the supplementary material.\nThis empirical test confirms that the observations by Bray and Dean (2007) qualitatively hold for neural\nnetworks. Critical points concentrate along a monotonically increasing curve in the \u0001\u2013\u03b1 plane. Thus the\nprevalence of high error saddle points do indeed pose a severe problem for training neural networks. While\nthe eigenvalues do not seem to be exactly distributed according to the semicircular law, their distribution\ndoes shift to the left as the error increases. The large mode at 0 indicates that there is a plateau around any\ncritical point of the error function of a neural network.\n\n4 Dynamics of optimization algorithms near saddle points\n\nGiven the prevalence of saddle points, it is important to understand how various optimization algorithms\nbehave near them. Let us focus on non-degenerate saddle points for which the Hessian is not singular. These\ncritical points can be locally analyzed by re-parameterizing the function according to Morse\u2019s lemma below\n(see chapter 7.3, Theorem 7.16 in Callahan (2010) or the supplementary material for details):\n\nn\u03b8(cid:88)\n\ni=1\n\n3\n\nf(\u03b8\u2217+\u2206\u03b8)=f(\u03b8\u2217)+\n\n1\n2\n\n\u03bbi\u2206v2\ni ,\n\n(1)\n\nwhere \u03bbi represents the ith eigenvalue of the Hessian, and \u2206vi are the new parameters of the model\ncorresponding to motion along the eigenvectors ei of the Hessian of f at \u03b8\u2217.\nIf finding the local minima of our function is the desired outcome of our optimization algorithm, we argue\nthat an optimal algorithm would move away from the saddle point at a speed that is inverse proportional\nwith the flatness of the error surface and hence depndented of how trustworthy this descent direction is further\naway from the current position.\n\n\fA step of the gradient descent method always points away from the saddle point close to it (SGD in Fig. 2). As-\nsuming equation (1) is a good approximation of our function we will analyze the optimality of the step accord-\ning to how well the resulting \u2206v optimizes the right hand side of (1). If an eigenvalue \u03bbi is positive (negative),\nthen the step moves toward (away) from \u03b8\u2217 along \u2206vi because the restriction of f to the corresponding eigen-\nvector direction \u2206vi, achieves a minimum (maximum) at \u03b8\u2217. The drawback of the gradient descent method\nis not the direction, but the size of the step along each eigenvector. The step, along any direction ei, is given\nby \u2212\u03bbi\u2206vi, and so small steps are taken in directions corresponding to eigenvalues of small absolute value.\n\nFigure 2: Behaviors of different op-\ntimization methods near a saddle\npoint for (a) classical saddle structure\n5x2\u2212y2; (b) monkey saddle structure\nx3\u22123xy2. The yellow dot indicates\nthe starting point. SFN stands for the\nsaddle-free Newton method we pro-\nposed.\n\n(a)\n\n(b)\n\nThe Newton method solves the slowness problem by rescaling the gradients in each direction with the inverse\nof the corresponding eigenvalue, yielding the step \u2212\u2206vi. However, this approach can result in moving toward\nthe saddle point. Specifically, if an eigenvalue is negative, the Newton step moves along the eigenvector\nin a direction opposite to the gradient descent step, and thus moves in the direction of \u03b8\u2217. \u03b8\u2217 becomes an\nattractor for the Newton method (see Fig. 2), which can get stuck in this saddle point and not converge\nto a local minima. This justifies using the Newton method to find critical points of any index in Fig. 1.\nA trust region approach is one approach of scaling second order methods to non-convex problems. In one such\nmethod, the Hessian is damped to remove negative curvature by adding a constant \u03b1 to its diagonal, which\nis equivalent to adding \u03b1 to each of its eigenvalues. If we project the new step along the different eigenvectors\nof the modified Hessian, it is equivalent to rescaling the projections of the gradient on this direction by the\n\ninverse of the modified eigenvalues \u03bbi+\u03b1 yields the step \u2212(cid:0) \u03bbi/\u03bbi+\u03b1\n\n(cid:1)\u2206vi. To ensure the algorithm does\n\nnot converge to the saddle point, one must increase the damping coefficient \u03b1 enough so that \u03bbmin+\u03b1>0\neven for the most negative eigenvalue \u03bbmin. This ensures that the modified Hessian is positive definnite.\nHowever, the drawback is again a potentially small step size in many eigen-directions incurred by a large\ndamping factor \u03b1 (the rescaling factors in each eigen-direction are not proportional to the curvature anymore).\nBesides damping, another approach to deal with negative curvature is to ignore them. This can be done regard-\nless of the approximation strategy used for the Newton method such as a truncated Newton method or a BFGS\napproximation (see Nocedal and Wright (2006) chapters 4 and 7). However, such algorithms cannot escape\nsaddle points, as they ignore the very directions of negative curvature that must be followed to achieve escape.\nNatural gradient descent is a first order method that relies on the curvature of the parameter manifold. That\nis, natural gradient descent takes a step that induces a constant change in the behaviour of the model as\nmeasured by the KL-divergence between the model before and after taking the step. The resulting algorithm\nis similar to the Newton method, except that it relies on the Fisher Information matrix F.\nIt is argued by Rattray et al. (1998); Inoue et al. (2003) that natural gradient descent can address certain\nsaddle point structures effectively. Specifically, it can resolve those saddle points arising from having units\nbehaving very similarly. Mizutani and Dreyfus (2010), however, argue that natural gradient descent also\nsuffers with negative curvature. One particular known issue is the over-realizable regime, where around\nthe stationary solution \u03b8\u2217, the Fisher matrix is rank-deficient. Numerically, this means that the Gauss-Newton\ndirection can be orthogonal to the gradient at some distant point from \u03b8\u2217 (Mizutani and Dreyfus, 2010),\ncausing optimization to converge to some non-stationary point. Another weakness is that the difference\nS between the Hessian and the Fisher Information Matrix can be large near certain saddle points that exhibit\nstrong negative curvature. This means that the landscape close to these critical points may be dominated\nby S, meaning that the rescaling provided by F\u22121 is not optimal in all directions.\nThe same is true for TONGA (Le Roux et al., 2007), an algorithm similar to natural gradient descent. It\nuses the covariance of the gradients as the rescaling factor. As these gradients vanish approaching a critical\npoint, their covariance will result in much larger steps than needed near critical points.\n\n4\n\n\f5 Generalized trust region methods\n\nIn order to attack the saddle point problem, and overcome the deficiencies of the above methods, we will\ndefine a class of generalized trust region methods, and search for an algorithm within this space. This class\ninvolves a straightforward extension of classical trust region methods via two simple changes: (1) We allow\nthe minimization of a first-order Taylor expansion of the function instead of always relying on a second-order\nTaylor expansion as is typically done in trust region methods, and (2) we replace the constraint on the norm of\nthe step \u2206\u03b8 by a constraint on the distance between \u03b8 and \u03b8+\u2206\u03b8. Thus the choice of distance function and\nTaylor expansion order specifies an algorithm. If we define Tk(f,\u03b8,\u2206\u03b8) to indicate the k-th order Taylor series\nexpansion of f around \u03b8 evaluated at \u03b8+\u2206\u03b8, then we can summarize a generalized trust region method as:\n\n\u2206\u03b8 =argmin\n\n\u2206\u03b8\n\nTk{f,\u03b8,\u2206\u03b8} with k\u2208{1,2}s. t. d(\u03b8,\u03b8+\u2206\u03b8)\u2264\u2206.\n\n(2)\n\nFor example, the \u03b1-damped Newton method described above arises as a special case with k = 2 and\nd(\u03b8,\u03b8+\u2206\u03b8)=||\u2206\u03b8||2\n\n2, where \u03b1 is implicitly a function of \u2206.\n\n6 Attacking the saddle point problem\n\n\u2202\u03b1\n\nfor i=1\u2192M do\n\n\u02c6H\nfor j =1\u2192m do\n\n| \u02c6H|\u2190(cid:12)(cid:12)(cid:12) \u22022s\n\n\u2202\u03b12\n\n(cid:12)(cid:12)(cid:12) by using an eigen decomposition of\n\n\u2202\u03b82\n\ng\u2190\u2212 \u2202s\n\u03bb\u2190argmin\u03bbs((| \u02c6H|+\u03bbI)\u22121g)\n\u03b8\u2190\u03b8+V(| \u02c6H|+\u03bbI)\u22121g\n\nAlgorithm 1 Approximate saddle-free Newton\nRequire: Function f(\u03b8) to minimize\nV\u2190k Lanczos vectors of \u22022f\ns(\u03b1)=f(\u03b8+V\u03b1)\n\nWe now search for a solution to the saddle-point\nproblem within the family of generalized trust region\nmethods. In particular, the analysis of optimization\nalgorithms near saddle points discussed in Sec. 4\nsuggests a simple heuristic solution: rescale the gra-\ndient along each eigen-direction ei by 1/|\u03bbi|. This\nachieves the same optimal rescaling as the Newton\nmethod, while preserving the sign of the gradient,\nthereby turning saddle points into repellers, not at-\ntractors, of the learning dynamics. The idea of taking\nthe absolute value of the eigenvalues of the Hessian\nwas suggested before. See, for example, (Nocedal\nand Wright, 2006, chapter 3.4) or Murray (2010,\nchapter 4.1). However, we are not aware of any\nproper justification of this algorithm or even a de-\ntailed exploration (empirical or otherwise) of this\nidea. One cannot simply replace H by |H|, where\n|H| is the matrix obtained by taking the absolute\nvalue of each eigenvalue of H, without proper justi-\nfication. While we might be able to argue that this heuristic modification does the right thing near critical\npoints, is it still the right thing far away from the critical points? How can we express this step in terms of the\nexisting methods ? Here we show this heuristic solution arises naturally from our generalized trust region\napproach.\nUnlike classical trust region approaches, we consider minimizing a first-order Taylor expansion of the loss\n(k = 1 in Eq. (2)). This means that the curvature information has to come from the constraint by picking\na suitable distance measure d (see Eq. (2)). Since the minimum of the first order approximation of f is at\ninfinity, we know that this optimization dynamics will always jump to the border of the trust region. So\nwe must ask how far from \u03b8 can we trust the first order approximation of f? One answer is to bound the\ndiscrepancy between the first and second order Taylor expansions of f by imposing the following constraint:\n\nend for\n\nend for\n\n(cid:12)(cid:12)(cid:12)(cid:12)=\n\n1\n2\n\n(cid:12)(cid:12)\u2206\u03b8(cid:62)H\u2206\u03b8(cid:12)(cid:12)\u2264\u2206,\n\n(3)\n\n(cid:12)(cid:12)(cid:12)(cid:12)f(\u03b8)+\u2207f\u2206\u03b8+\n\nd(\u03b8,\u03b8+\u2206\u03b8)=\n\n\u2206\u03b8(cid:62)H\u2206\u03b8\u2212f(\u03b8)\u2212\u2207f\u2206\u03b8\n\n1\n2\n\nwhere \u2207f is the partial derivative of f with respect to \u03b8 and \u2206\u2208R is some small value that indicates how\nmuch discrepancy we are willing to accept. Note that the distance measure d takes into account the curvature\nof the function.\nEq. (3) is not easy to solve for \u2206\u03b8 in more than one dimension. Alternatively, one could take the square of the\ndistance, but this would yield an optimization problem with a constraint that is quartic in \u2206\u03b8, and therefore\nalso difficult to solve. We circumvent these difficulties through a Lemma:\n\n5\n\n\fT\nS\nI\nN\nM\n\n0\n1\n-\nR\nA\nF\nI\nC\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 3: Empirical evaluation of different optimization algorithms for a single-layer MLP trained on the\nrescaled MNIST and CIFAR-10 dataset. In (a) and (d) we look at the minimum error obtained by the different\nalgorithms considered as a function of the model size. (b) and (e) show the optimal training curves for the\nthree algorithms. The error is plotted as a function of the number of epochs. (c) and (f) track the norm of the\nlargest negative eigenvalue.\nLemma 1. Let A be a nonsingular square matrix in Rn\u00d7Rn, and x\u2208Rn be some vector. Then it holds that\n|x(cid:62)Ax|\u2264x(cid:62)|A|x, where |A| is the matrix obtained by taking the absolute value of each of the eigenvalues\nof A.\n\nProof. See the supplementary material for the proof.\n\nInstead of the originally proposed distance measure in Eq. (3), we approximate the distance by its upper\nbound \u2206\u03b8|H|\u2206\u03b8 based on Lemma 1. This results in the following generalized trust region method:\n\n\u2206\u03b8 =argmin\n\n\u2206\u03b8\n\nf(\u03b8)+\u2207f\u2206\u03b8\n\ns. t. \u2206\u03b8(cid:62)|H|\u2206\u03b8\u2264\u2206.\n\nNote that as discussed before, we can replace the inequality constraint with an equality one, as the first order\napproximation of f has a minimum at infinity and the algorithm always jumps to the border of the trust region.\nSimilar to (Pascanu and Bengio, 2014), we use Lagrange multipliers to obtain the solution of this constrained\noptimization. This gives (up to a scalar that we fold into the learning rate) a step of the form:\n\n(4)\n\n(5)\n\n\u2206\u03b8 =\u2212\u2207f|H|\u22121\n\nThis algorithm, which we call the saddle-free Newton method (SFN), leverages curvature information in a\nfundamentally different way, to define the shape of the trust region, rather than Taylor expansion to second\norder, as in classical methods. Unlike gradient descent, it can move further (less) in the directions of low\n(high) curvature. It is identical to the Newton method when the Hessian is positive definite, but unlike the\nNewton method, it can escape saddle points. Furthermore, unlike gradient descent, the escape is rapid even\nalong directions of weak negative curvature (see Fig. 2).\nThe exact implementation of this algorithm is intractable in a high dimensional problem, because it requires\nthe exact computation of the Hessian. Instead we use an approach similar to Krylov subspace descent (Vinyals\nand Povey, 2012). We optimize that function in a lower-dimensional Krylov subspace \u02c6f(\u03b1)=f(\u03b8+\u03b1V).\nThe k Krylov subspace vectors V are found through Lanczos iteration of the Hessian. These vectors will span\nthe k biggest eigenvectors of the Hessian with high-probability. This reparametrization through \u03b1 greatly\nreduces the dimensionality and allows us to use exact saddle-free Newton in the subspace.1 See Alg. 1 for the\npseudocode.\n\n1 In the Krylov subspace, \u2202\u02c6f\n\nand \u22022 \u02c6f\n\n\u2202\u03b12 =V\n\n\u2202\u03b1 =V(cid:0) \u2202f\n\n\u2202\u03b8\n\n(cid:1)(cid:62)\n\n(cid:17)\n\n(cid:16) \u22022f\n\n\u2202\u03b82\n\nV(cid:62).\n\n6\n\n\fDeep Autoencoder\n\nRecurrent Neural Network\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Empirical results on training deep autoencoders on MNIST and recurrent neural network on Penn\nTreebank. (a) and (c): The learning curve for SGD and SGD followed by saddle-free Newton method. (b) The\nevolution of the magnitude of the most negative eigenvalue and the norm of the gradients w.r.t. the number of\nepochs (deep autoencoder). (d) The distribution of eigenvalues of the RNN solutions found by SGD and the\nSGD continued with saddle-free Newton method.\n\n7 Experimental validation of the saddle-free Newton method\n\nIn this section, we empirically evaluate the theory suggesting the existence of many saddle points in\nhigh-dimensional functions by training neural networks.\n\n7.1 Existence of Saddle Points in Neural Networks\n\nIn this section, we validate the existence of saddle points in the cost function of neural networks, and see how\neach of the algorithms we described earlier behaves near them. In order to minimize the effect of any type of\napproximation used in the algorithms, we train small neural networks on the scaled-down version of MNIST\nand CIFAR-10, where we can compute the update directions by each algorithm exactly. Both MNIST and\nCIFAR-10 were downsampled to be of size 10\u00d710.\nWe compare minibatch stochastic gradient descent (MSGD), damped Newton and the proposed saddle-free\nNewton method (SFN). The hyperparameters of SGD were selected via random search (Bergstra and Bengio,\n2012), and the damping coefficients for the damped Newton and saddle-free Newton2 methods were selected\nfrom a small set at each update.\nThe theory suggests that the number of saddle points increases exponentially as the dimensionality of the\nfunction increases. From this, we expect that it becomes more likely for the conventional algorithms such as\nSGD and Newton methods to stop near saddle points, resulting in worse performance (on training samples).\nFigs. 3 (a) and (d) clearly confirm this. With the smallest network, all the algorithms perform comparably, but\nas the size grows, the saddle-free Newton algorithm outperforms the others by a large margin.\nA closer look into the different behavior of each algorithm is presented in Figs. 3 (b) and (e) which show the\nevolution of training error over optimization. We can see that the proposed saddle-free Newton escapes, or\ndoes not get stuck at all, near a saddle point where both SGD and Newton methods appear trapped. Especially,\nat the 10-th epoch in the case of MNIST, we can observe the saddle-free Newton method rapidly escaping\nfrom the saddle point. Furthermore, Figs. 3 (c) and (f) provide evidence that the distribution of eigenvalues\nshifts more toward the right as error decreases for all algorithms, consistent with the theory of random\nerror functions. The distribution shifts more for SFN, suggesting it can successfully avoid saddle-points on\nintermediary error (and large index).\n\n7.2 Effectiveness of saddle-free Newton Method in Deep Feedforward Neural Networks\n\nHere, we further show the effectiveness of the proposed saddle-free Newton method in a larger neural network\nhaving seven hidden layers. The neural network is a deep autoencoder trained on (full-scale) MNIST and\nconsidered a standard benchmark problem for assessing the performance of optimization algorithms on neural\nnetworks (Sutskever et al., 2013). In this large-scale problem, we used the Krylov subspace descent approach\ndescribed earlier with 500 subspace vectors.\nWe first trained the model with SGD and observed that learning stalls after achieving the mean-squared\nerror (MSE) of 1.0. We then continued with the saddle-free Newton method which rapidly escaped the\n(approximate) plateau at which SGD was stuck (See Fig. 4 (a)). Furthermore, even in these large scale\n\n2Damping is used for numerical stability.\n\n7\n\n\fexperiments, we were able to confirm that the distribution of Hessian eigenvalues shifts right as error decreases,\nand that the proposed saddle-free Newton algorithm accelerates this shift (See Fig. 4 (b)).\nThe model trained with SGD followed by the saddle-free Newton method was able to get the state-of-the-art\nMSE of 0.57 compared to the previous best error of 0.69 achieved by the Hessian-Free method (Martens,\n2010). Saddle free Newton method does better.\n\n7.3 Recurrent Neural Networks: Hard Optimization Problem\n\nRecurrent neural networks are widely known to be more difficult to train than feedforward neural networks (see,\ne.g., Bengio et al., 1994; Pascanu et al., 2013). In practice they tend to underfit, and in this section, we want\nto test if the proposed saddle-free Newton method can help avoiding underfitting, assuming that that it is\ncaused by saddle points. We trained a small recurrent neural network having 120 hidden units for the task of\ncharacter-level language modeling on Penn Treebank corpus. Similarly to the previous experiment, we trained\nthe model with SGD until it was clear that the learning stalled. From there on, training continued with the\nsaddle-free Newton method.\nIn Fig. 4 (c), we see a trend similar to what we observed with the previous experiments using feedforward\nneural networks. The SGD stops progressing quickly and does not improve performance, suggesting that the\nalgorithm is stuck in a plateau, possibly around a saddle point. As soon as we apply the proposed saddle-free\nNewton method, we see that the error drops significantly. Furthermore, Fig. 4 (d) clearly shows that the\nsolution found by the saddle-free Newton has fewer negative eigenvalues, consistent with the theory of random\nGaussian error functions. In addition to the saddle-free Newton method, we also tried continuing with the\ntruncated Newton method with damping, however, without much success.\n\n8 Conclusion\n\nIn summary, we have drawn from disparate literatures spanning statistical physics and random matrix theory\nto neural network theory, to argue that (a) non-convex error surfaces in high dimensional spaces generically\nsuffer from a proliferation of saddle points, and (b) in contrast to conventional wisdom derived from low\ndimensional intuition, local minima with high error are exponentially rare in high dimensions. Moreover, we\nhave provided the first experimental tests of these theories by performing new measurements of the statistical\nproperties of critical points in neural network error surfaces. These tests were enabled by a novel application\nof Newton\u2019s method to search for critical points of any index (fraction of negative eigenvalues), and they\nconfirmed the main qualitative prediction of theory that the index of a critical point tightly and positively\ncorrelates with its error level.\nMotivated by this theory, we developed a framework of generalized trust region methods to search for\nalgorithms that can rapidly escape saddle points. This framework allows us to leverage curvature information\nin a fundamentally different way than classical methods, by defining the shape of the trust region, rather\nthan locally approximating the function to second order. Through further approximations, we derived an\nexceedingly simple algorithm, the saddle-free Newton method, which rescales gradients by the absolute value\nof the inverse Hessian. This algorithm had previously remained heuristic and theoretically unjustified, as well\nas numerically unexplored within the context of deep and recurrent neural networks. Our work shows that\nnear saddle points it can achieve rapid escape by combining the best of gradient descent and Newton methods\nwhile avoiding the pitfalls of both. Moreover, through our generalized trust region approach, our work shows\nthat this algorithm is sensible even far from saddle points. Finally, we demonstrate improved optimization on\nseveral neural network training problems.\nFor the future, we are mainly interested in two directions. The first direction is to explore methods beyond\nKyrylov subspaces, such as one in (Sohl-Dickstein et al., 2014), that allow the saddle-free Newton method\nto scale to high dimensional problems, where we cannot easily compute the entire Hessian matrix. In the\nsecond direction, the theoretical properties of critical points in the problem of training a neural network will\nbe further analyzed. More generally, it is likely that a deeper understanding of the statistical properties of\nhigh dimensional error surfaces will guide the design of novel non-convex optimization algorithms that could\nimpact many fields across science and engineering.\n\nAcknowledgments\n\nWe would like to thank the developers of Theano (Bergstra et al., 2010; Bastien et al., 2012). We would also\nlike to thank CIFAR, and Canada Research Chairs for funding, and Compute Canada, and Calcul Qu\u00b4ebec for\nproviding computational resources. Razvan Pascanu is supported by a DeepMind Google Fellowship. Surya\nGanguli thanks the Burroughs Wellcome and Sloan Foundations for support.\n\n8\n\n\fReferences\nBaldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without\nlocal minima. Neural Networks, 2(1), 53\u201358.\nBastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012).\nTheano: new features and speed improvements.\nBengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. 5(2),\n157\u2013166. Special Issue on Recurrent Neural Networks, March 94.\nBergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning\nResearch, 13, 281\u2013305.\nBergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio,\nY. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing\nConference (SciPy).\nBray, A. J. and Dean, D. S. (2007). Statistics of critical points of gaussian fields on large-dimensional spaces. Physics\nReview Letter, 98, 150201.\nCallahan, J. (2010). Advanced Calculus: A Geometric View. Undergraduate Texts in Mathematics. Springer.\nFyodorov, Y. V. and Williams, I. (2007). Replica symmetry breaking condition exposed by random matrix calculation of\nlandscape complexity. Journal of Statistical Physics, 129(5-6), 1081\u20131116.\nInoue, M., Park, H., and Okada, M. (2003). On-line learning theory of soft committee machines with correlated hidden\nunits steepest gradient descent and natural gradient descent. Journal of the Physical Society of Japan, 72(4), 805\u2013810.\nLe Roux, N., Manzagol, P.-A., and Bengio, Y. (2007). Topmoumoute online natural gradient algorithm. Advances in\nNeural Information Processing Systems.\nMartens, J. (2010). Deep learning via hessian-free optimization. In International Conference in Machine Learning,\npages 735\u2013742.\nMizutani, E. and Dreyfus, S. (2010). An analysis on negative curvature induced by singularity in multi-layer neural-\nnetwork learning. In Advances in Neural Information Processing Systems, pages 1669\u20131677.\nMurray, W. (2010). Newton-type methods. Technical report, Department of Management Science and Engineering,\nStanford University.\nNocedal, J. and Wright, S. (2006). Numerical Optimization. Springer.\nParisi, G. (2007). Mean field theory of spin glasses: statistics and dynamics. Technical Report Arxiv 0706.0094.\nPascanu, R. and Bengio, Y. (2014). Revisiting natural gradient for deep networks. In International Conference on\nLearning Representations.\nPascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In ICML\u20192013.\nPascanu, R., Dauphin, Y., Ganguli, S., and Bengio, Y. (2014). On the saddle point problem for non-convex optimization.\nTechnical Report Arxiv 1405.4604.\nRasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and\nMachine Learning). The MIT Press.\nRattray, M., Saad, D., and Amari, S. I. (1998). Natural Gradient Descent for On-Line Learning. Physical Review Letters,\n81(24), 5461\u20135464.\nSaad, D. and Solla, S. A. (1995). On-line learning in soft committee machines. Physical Review E, 52, 4225\u20134243.\nSaxe, A., McClelland, J., and Ganguli, S. (2013). Learning hierarchical category structure in deep neural networks.\nProceedings of the 35th annual meeting of the Cognitive Science Society, pages 1271\u20131276.\nSaxe, A., McClelland, J., and Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear\nneural network. In International Conference on Learning Representations.\nSohl-Dickstein, J., Poole, B., and Ganguli, S. (2014). Fast large-scale optimization by unifying stochastic gradient and\nquasi-newton methods. In ICML\u20192014.\nSutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E. (2013). On the importance of initialization and momentum in\ndeep learning. In S. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Conference on Machine\nLearning (ICML-13), volume 28, pages 1139\u20131147. JMLR Workshop and Conference Proceedings.\nVinyals, O. and Povey, D. (2012). Krylov Subspace Descent for Deep Learning. In AISTATS.\nWigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. The Annals of Mathematics, 67(2),\n325\u2013327.\n\n9\n\n\f", "award": [], "sourceid": 1539, "authors": [{"given_name": "Yann", "family_name": "Dauphin", "institution": "University of Montreal"}, {"given_name": "Razvan", "family_name": "Pascanu", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Caglar", "family_name": "Gulcehre", "institution": "University of Montreal"}, {"given_name": "Kyunghyun", "family_name": "Cho", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Surya", "family_name": "Ganguli", "institution": "Stanford University"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "University of Montreal"}]}