{"title": "How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective", "book": "Advances in Neural Information Processing Systems", "page_first": 8279, "page_last": 8288, "abstract": "The question of which global minima are accessible by a stochastic gradient decent (SGD) algorithm with specific learning rate and batch size is studied from the perspective of dynamical stability. The concept of non-uniformity is introduced, which, together with sharpness, characterizes the stability property of a global minimum and hence the accessibility of a particular SGD algorithm to that global minimum. In particular, this analysis shows that learning rate and batch size play different roles in minima selection. Extensive empirical results seem to correlate well with the theoretical findings and provide further support to these claims.", "full_text": "How SGD Selects the Global Minima in\n\nOver-parameterized Learning: A Dynamical Stability\n\nPerspective\n\nSchool of Mathematical Sciences\n\nProgram in Applied and Computational Mathematics\n\nLei Wu\n\nPeking University\n\nBeijing, 100081, P.R. China\n\nleiwu@pku.edu.cn\n\nChao Ma\n\nPrinceton University\n\nPrinceton, NJ 08544, USA\nchaom@princeton.edu\n\nWeinan E\n\nDepartment of Mathematics and Program in Applied and Computational Mathematics\n\nPrinceton University, Princeton, NJ 08544, USA and\n\nBeijing Institute of Big Data Research, Beijing, 100081, P.R. China\n\nweinan@math.princeton.edu\n\nAbstract\n\nThe question of which global minima are accessible by a stochastic gradient decent\n(SGD) algorithm with speci\ufb01c learning rate and batch size is studied from the\nperspective of dynamical stability. The concept of non-uniformity is introduced,\nwhich, together with sharpness, characterizes the stability property of a global\nminimum and hence the accessibility of a particular SGD algorithm to that global\nminimum. In particular, this analysis shows that learning rate and batch size play\ndifferent roles in minima selection. Extensive empirical results seem to correlate\nwell with the theoretical \ufb01ndings and provide further support to these claims.\n\n1\n\nIntroduction\n\nIn machine learning we have always faced with the following dilemma: The function that we minimize\nis the empirical risk but the one that we are really interested in is the population risk. In the old days\nwhen typical models have only few isolated minima, this issue was not so pressing. But now in the\nsetting of over-parametrized learning, e.g. deep learning, when there is a large set of global minima,\nall of which have zero training error but the test error can be very different, this issue becomes highly\nrelevant. In fact one might say that the task of optimization algorithms has become: Find the set of\nparameters with the smallest test error among all the ones with zero training error.\nAt the moment, this is clearly an impossible task since we do not have much explicit information\nabout the population risk. Therefore in this paper, we take a limited view and ask the question: Which\nglobal minima (of the empirical risk, of course) are accessible to a particular optimization algorithm\nwith a particular set of hyper-parameters? In other words, how the different optimization algorithms\nwith different hyper-parameters pick out the different set of global minima?\nSpeci\ufb01cally, in deep learning one of the most puzzling issues is the recent observation that SGD tends\nto select the so-called \ufb02at minima, and \ufb02atter minima seem to generalize better [3, 13, 7]. Several\nvery interesting attempts have been made to understand this issue. Goyal et al. [2] and Hoffer et al.\n[4] numerically studied how the learning rate and batch size impact test accuracy of the solutions\nfound by SGD. Jastrz\u02dbebski et al. [6] suggested that the ratio between the learning rate and the batch\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsize \u2318/B is a key factor that affects the \ufb02atness of minima selected. Zhu et al. [15] demonstrated that\nthe speci\ufb01c non-isotropic structure of the noise is important for SGD to \ufb01nd \ufb02at minima. Of particular\ninterest to this work is the observation in [15] that the minima found by GD (gradient decent) can\nbe unstable for SGD. As shown in Figure 1, when switching the algorithm from GD to SGD at a\npoint close to a global minimum, SGD escapes from that minimum and converges to another global\nminimum which generalizes better. The time it takes for the escape is very short compared to that\nrequired for SGD to converge.\n\nFigure 1: Fast escape phenomenon in \ufb01tting corrupted FashionMNIST. When the optimizer is switched from\nGD to SGD with the same learning rate, though the GD iterations is already quite close to a global minimum,\none observes a fast escape from that global minimum and subsequent convergence to another global minimum.\nAs shown by the right \ufb01gure, the ultimate global minimum found by SGD generalizes better for this example.\n\nIn this paper, we make an attempt to study these issues systematically from a dynamical stability\nperspective. Our focus will be on SGD and GD. But the principle is applicable to any other\noptimization algorithms. We begin our analysis by explaining the escape phenomenon in Figure 1\nusing a toy example that can reproduce the basic features of this process. We then formalize the\nintuition into an analytically argument. The analysis leads to a sharpness-non-uniformity diagram that\ncharacterizes the kind of minima that are stable and hence accessible for SGD. We show that both the\nsharpness and the non-uniformity are important for the selection of the global minima, although we\ndo observe in experiments that these two quantities are strongly correlated for deep neural network\nmodels. Our extensive numerical results give strong support to the theoretical analysis.\n\n2 The Mechanism of Escape\n\nTo understand how the fast escape happens, let us \ufb01rst consider a one-dimensional problem f (x) =\n2 (f1(x) + f2(x)) with\n1\n\nf1(x) = min{x2, 0.1(x 1)2}, f2(x) = min{x2, 1.9(x 1)2}.\n\nThe landscape is shown in Figure 2. This function has two global minima at x = 0 and x = 1. We\n\ufb01nd that for x0 = 1 + \", SGD always escapes and converges to x = 0 instead of staying in the initial\nbasin, as long as \" 6= 0 and \u2318 is relatively large (one trajectory of is shown in Figure 2 in red). In\ncontrast, SGD starting from x0 = \" with the same learning rate always converges to x = 0, and\nwe never observe the escape phenomenon. As a comparison, we observe that GD starting from the\nneighborhood of the two minima with the same learning rate will behave the same way.\n\nIntuitive explanation For this simple case, the above phenomena can be easily explained. (1) The\ntwo minima has the same sharpness f00 = 1 , so when the learning rate is small enough (\u2318 \uf8ff 2\nf00 = 2\nin this problem), the two minima are both stable for GD. (2) However, for SGD, in each iteration,it\nrandomly picks one function from f1 and f2 and applies gradient descent to that function. Since\nf001 (1) = 0.1, f002 (1) = 1.9, SGD with the learning rate \u2318 = 0.7 is stable for f1 but unstable for\nf2. Thus x = 1 is not stable. In contrast, \u2318 = 0.7 is stable for both f1 and f2 around x = 0 since\nf001 (0) = f002 (0) = 1. This intuition can be formalized by the following stability analysis.\n\nFormal argument Without loss of generality, let us assume that x = 0 is the global minimum\nof interest. Consider the following more general one-dimensional optimization problem, f (x) =\n\n2\n\n02000400060008000100001200014000Iteration20406080100TrainAccuracy(%)GDSGD02000400060008000100001200014000Iteration10203040506070TestAccuracy(%)GDSGD\fFigure 2: Motivating example. (Left) One trajectory of SGD with learning rate \u2318 = 0.7, x0 =\n1 105, showing convergence to 0. GD with the same learning rate will converge to 1. (Right) The\nvalue of objective function, showing a burst during the escape.\nnPi=1\n\naix2 with ai 0 8i 2 [n]. Thus the SGD with batch size B = 1 is given by:\n\n1\n2n\n\n(1)\n\nwhere \u21e0 is a random variable that satis\ufb01es P(\u21e0 = i) = 1/n. Hence, we have\n\nxt+1 = xt \u2318a\u21e0xt = (1 \u2318a\u21e0)xt,\n\nExt+1 = (1 \u2318a)tx0,\nt+1 = \u21e5(1 \u2318a)2 + \u23182s2\u21e4t\nEx2\n\n(2)\n(3)\nwhere a =Pn\ni /n a2. Therefore, for SGD to be stable around x = 0, we\nnot only need |1 \u2318a|\uf8ff 1 (stability condition of GD), but also (1 \u2318a)2 + \u23182s2 \uf8ff 1. Otherwise, xt\nwill blow up exponentially fast. In particular, SGD can only select the minima with s \uf8ff 1/\u2318 whereas\nGD does not have this requirement.\nIn particular, for the toy example discussed above, at x = 1, we have s = 1.8 > 1/0.7, so as predicted\nSGD with \u2318 = 0.7 will escape and \ufb01nally converge to minimum x = 0, where s = 0 < 1/0.7.\n\ni=1 ai/n, s =pPn\n\ni=1 a2\n\nx2\n0,\n\n3 Linear Stability Analysis\nLet us consider the minimization of the training error\n\nby a general optimizer\n\nf (x) =\n\n1\nn\n\nnXi=1\n\nfi(x)\n\nxt+1 = xt G(xt; \u21e0t)\n\n(4)\nwhere \u21e0t is a random variable independent of xt, and \u21e0t are i.i.d.. For SGD, G(xt; \u21e0t) = \u2318rf\u21e0t(xt);\nfor GD, G(xt; \u21e0t) = \u2318Pn\ni=1 rfi(xt)/n. In practice, G(xt; \u21e0t) usually depends on some tunable\nhyper-parameters, like learning rate, batch size, etc..\nDe\ufb01nition 1 (Fixed point). We say x\u21e4 is a \ufb01xed point of stochastic dynamics (4), if for any \u21e0, we\nhave G(x\u21e4; \u21e0) = 0.\n\nIt should be remarked that this kind of \ufb01xed points do not always exists. However for the over-\nparametrized learning (OPL) problems of interest, all the global minima of f (x) are \ufb01xed points\nof the popular optimizers such as SGD, Adam, etc.. Note that for a speci\ufb01c optimizer, only the\ndynamically stable \ufb01xed points can be selected. If a \ufb01xed point is unstable, a small perturbation will\ndrive the optimizer to move away. To formalize this, we introduce a kind of stability concept for the\nstochastic dynamics (4), which is an extension of the classical notion of linear stability in dynamical\nsystems [11].\nDe\ufb01nition 2 (Linear stability). Let x\u21e4 be a \ufb01xed point of stochastic dynamics (4). Consider the\nlinearized dynamical system:\n\n(5)\nwhere A\u21e0t = rxG(x\u21e4,\u21e0 t). We say that x\u21e4 is linearly stable if there exists a constant C such that,\n(6)\n\n\u02dcxt+1 = \u02dcxt A\u21e0t(\u02dcxt x\u21e4),\n\nE[k\u02dcxtk2] \uf8ff Ck\u02dcx0k2, for all t > 0.\n\n3\n\n0.50.00.51.01.5x0.00.10.20.30.4f=12(f1+f2)f1f2SGDtrajectory0200400Numberofiteration0.000.050.100.150.20Objectivevalue\f3.1 Stochastic Gradient Descent\n\nIn this section, we derive the stability condition for SGD. Let x\u21e4 be the \ufb01xed point of interest.\nConsider the quadratic approximation of f near x\u21e4: f (x) \u21e1 1\ni=1(x x\u21e4)>Hi(x x\u21e4) with\nHi = r2fi(x\u21e4). Here we have assumed f (x\u21e4) = 0. The corresponding linearized SGD is given by\n\n2nPn\nH\u21e0j (xt x\u21e4)\n\nxt+1 = xt \n\n\u2318\nB\n\nBXj=1\n\n(7)\n\nwhere B is the batch size and \u21e0 = {\u21e01,\u00b7\u00b7\u00b7 ,\u21e0 B} is a uniform, non-replaceable random sampling of\nsize B on {1, 2,\u00b7\u00b7\u00b7 , n}. To characterize the stability of this dynamical system, we need the following\ntwo quantities.\nDe\ufb01nition 3. Let H = 1\ni H 2. We de\ufb01ne a = max(H) to be the\nsharpness, and s = max(\u23031/2) to be the non-uniformity, respectively.\nTheorem 1. The global minimum x\u21e4 is linearly stable for SGD with learning rate \u2318 and batch size\nB if the following condition is satis\ufb01ed\n\nnPn\n\nnPn\n\ni=1 Hi, \u2303= 1\n\ni=1 H 2\n\n\u23182(n B)\nB(n 1)\nWhen d = 1, this becomes a suf\ufb01cient and necessary condition.\n\nmax\u21e2(I \u2318H )2 +\n\n\u2303 \uf8ff 1.\n\n(8)\n\nThe proof can be found in Appendix A\nWhen d = 1, this becomes the condition (1 \u2318a)2 + \u23182s2 \uf8ff 1 introduced in Section 2. The\ncondition (8) is sharp but not intuitive. A less sharp but simpler necessary condition for guaranteeing\n(8) is as follows\n\n0 \uf8ff a \uf8ff\n\n2\n\u2318\n\n, 0 \uf8ff s \uf8ff\n\n1\n\n\u2318r B(n 1)\n\nn B\n\n.\n\n(9)\n\nThis is obtained by requiring the two terms at the left hand side of (8) to satisfy the stability condition\n\nseparately. In particular, when B \u2327 n the largest non-uniformity allowed is roughly pB/\u2318. As\n\nshown in the next section, numerical experiments show that in deep learning the condition (9) is quite\nsharp.\n\nFigure 3: The sharpness-non-uniformity diagram, showing the rectangular region that is linearly stable for\nSGD. The left and right \ufb01gure shows the in\ufb02uence of batch size B and learning rate \u2318, respectively. Notice that\nin the left plot, the stablility region of GD is the unbounded region between a = 0 and a = 2/\u2318.\n\nThe sharpness-non-uniformity diagram of SGD Now assume that the learning rate \u2318 is \ufb01xed.\nWe use a and s as features of the global minima, to show how GD and SGD \"selects\" minima. From\nthe results above, we know that the global minima that GD can converge to satisfy a \uf8ff 2/\u2318, and the\nglobal minima that SGD can converge to satisfy a more restrictive condition (9). The stability regions\nare visualized in Figure 3, which is called the sharpness-non-uniformity diagram.\nFrom the sharpness-non-uniformity diagram we see that, when the learning rate is \ufb01xed, the set of\nglobal minima that are linearly stable for SGD is much smaller than that for GD. This means that\ncompared to GD, SGD can \ufb01lter out global minima with large non-uniformity.\n\n4\n\nSharpness:aNon-uniformity:sB=1B=2B=3a=2/Sharpness:aNon-uniformity:s=0.1=0.2=0.3\f3.2 Some Remarks\nRoles of the learning rate and batch size Our analysis shows that the learning rate and the batch\nsize play different roles in global minimum selection. As shown in Figure 3, increasing the learning\nrate forces the SGD to choose global minima closer to the origin in the sharpness-non-uniformity\ndiagram, which means smaller sharpness and smaller non-uniformity. On the other hand, decreasing\nbatch size only forces SGD to choose global minima with smaller non-uniformity.\n\nLocal stability for general loss functions As is well-known in\nthe theory of dynamical system [11], the issue of asymptotic conver-\ngence to a particular critical point is only related to the local stability\nof that critical point, and locally, one can always make the linear\napproximation for the dynamical system or quadratic approxima-\ntion for the objective function to be optimized, as long as the the\nlinear or quadratic approximations are non-degenerate. Therefore\nour \ufb01ndings are of general relevance, even for problems for which\nthe loss function is non-convex, as long as the non-degeneracy holds.\nHowever, as shown in Figure 4, the quadratic approximation, i.e.\nlinearization of SGD, is not suited for the loss function shown in\nthe solid curve, for which the Hessian vanishes at the minima. This happens to be the case for\nclassi\ufb01cation problems with the cross entropy used as the loss function. In this paper, we will focus\non the case when the quadratic approximation is locally valid. Therefore in the following experiments,\nwe use the mean squared error rather than the cross entropy as the loss function.\n\nFigure 4\n\n4 Experiments\n\nIn this section, we present numerical results1 in deep learning in connection with the analysis\nabove. We consider two classi\ufb01cation problems in our experiment, described in Table 1. Since the\ncomputation of non-uniformity is prohibitively expensive, to speed it up, in most cases we only\nselect 1000 training examples to train the models. For CIFAR10, only examples from the categories\n\"airplane\" and \"automobile\" are considered. We refer to Appendix C for the network architecture and\ncomputational method of sharpness and non-uniformity.\n\nNetwork type\n\nFNN\nVGG\n\nTable 1: Experimental setup\n\n# of parameters\n\nDataset\n\n# of training examples\n\n898,510\n71,410\n\nFashionMNIST\n\nCIFAR10\n\n1000\n1000\n\n4.1 Learning rate is crucial for the sharpness of the selected minima\n\nFirst we study how the learning rate affects the sharpness of the solutions. We focus on GD but as we\nwill show later, the general trend is the same for SGD. We trained the models with different learning\nrates, and report results on their sharpness in Table 2. All the models are trained for suf\ufb01cient number\nof iterations to achieve a training loss smaller than 104. As we can see, by comparing the second\nTable 2: Sharpness of the solutions found by GD with different learning rates. Each experiment is repeated for\n5 times with independent random initialization. Here we report the mean and standard deviation of the sharpness\nin the second and third row of the table. The fourth row shows the largest possible sharpness predicted by our\ntheory. Dashes indicate that GD blows up with that learning rate. Notice that GD tends to select the sharpest\npossible minima.\n\n\u2318\n\n0.01\n\n0.05\n\n0.1\n\n0.5\n\n1\n\n5\n\nFashionMNIST\n\nCIFAR10\n\nprediction 2/\u2318\n\n53.5 \u00b1 4.3\n198.9 \u00b1 0.6\n\n200\n\n39.3 \u00b1 0.5\n39.8 \u00b1 0.2\n\n40\n\n19.6 \u00b1 0.15\n19.8 \u00b1 0.1\n\n20\n\n3.9 \u00b1 0.0\n3.6 \u00b1 0.4\n\n4\n\n1.9 \u00b1 0.0\n\n-\n2\n\n0.4 \u00b1 0.0\n\n-\n0.4\n\nand third rows with the fourth row, the numerical results are very close to the theoretical prediction\n\n1The code is available at https://github.com/leiwu1990/sgd.stability\n\n5\n\nGoodquad.approx.Badquad.approx.\fof the largest possible sharpness 2/\u2318, especially in the large learning rate regime. This may seem\nsomewhat surprising, since the stability analysis only requires that the sharpness not exceeding 2/\u2318.\nAlthough there are lots of \ufb02atter minima, for instance those found by using larger learning rates, GD\nwith a small learning rate does not \ufb01nd them. One tempting explanation for this phenomenon is that\nthe density of sharp minima is much larger than the density of \ufb02at minima. Hence, when an optimizer\nis stable for both sharp and \ufb02at minima, it tends to \ufb01nd the sharp ones. It should also be remarked\nthat the same phenomenon is not expected to hold for very small learning rates, since there are not\nthat many really sharp global minima either.\n\n4.2 SGD converges to solutions with low non-uniformity\nOur theory reveals that a crucial difference between GD and SGD is that SGD must converge to\nsolutions that \ufb01t all the data uniformly well. To verify this argument, we trained a large number of\nmodels with different learning rates and batch sizes, and show the results in Figure 5.\nIn Figure 5a, we plot the non-uniformity against batch size, where batch size B = 1000 represents\nGD. We see that the solutions found by SGD indeed have much lower non-uniformity than those\nfound by GD. For example, in the CIFAR10 experiment, with a \ufb01xed learning rate \u2318 = 0.01, the\nnon-uniformity of GD solutions is about 350, however the same quantity is only about 100 for\nSGD with batch size B = 4. This value is around half of the highest possible non-uniformity\npredicted by our theory, which is about p4/\u2318 = 200. Also we observe that the non-uniformity\nalmost drops monotonically as we decrease the batch size. These suggest that our prediction about\nthe non-uniformity is correct although not as tight as for the sharpness.\nFigure 5b shows the in\ufb02uence of the batch size on the sharpness. We see that SGD always favors\n\ufb02atter minima than GD, and the smaller the batch size, the \ufb02atter the solutions. This phenomenon\ncan not be explained directly by our theory. A possible explanation is provided in the next section.\n\n(a) Non-uniformity\n\n(b) Sharpness\n\nFigure 5: The in\ufb02uence of the batch size on the non-uniformity and sharpness. For each set of hyper-parameters,\nwe trained the models with 5 independent random initialization and display the mean and standard deviation.\nThe total number of samples is 1000, so the right-most values in each panel correspond to GD.\n4.3 The selection mechanism of SGD\nTo investigate the accuracy of global minima selection criteria introduced in Section 3, we trained\na large number of models, and display the results of the sharpness and non-uniformity in Figure 6.\nTo take into account the in\ufb02uence of initialization, we tried three different initializations: uniform\ninitialization U [v/pnin, v/pnin], with v = 1, 2, 3 for FashionMNIST and v = 0.5, 1, 1.5 for\nCIFAR10. Larger v will cause the optimizers to diverge. We choose learning rate \u2318 = 0.5 for\nFashionMNIST and \u2318 = 0.01 for CIFAR10. For each speci\ufb01c set of hyper-parameters, we trained 5\nmodel with independent initializations. The predicted largest possible non-uniformities by the theory\nare displayed by the horizontal dash lines.\nAs we can see, all the solutions lie within the area predicted by the theory. Speci\ufb01cally, for relatively\nlarge batch size, for example B = 25, the non-uniformity found for the solutions is quite close to the\npredicted upper bound. However, for very small batch size, say B = 4, the non-uniformities for both\ndatasets are signi\ufb01cantly lower than the predicted upper bound.\nAnother interesting observation is that the non-uniformity and sharpness seems to be strongly\ncorrelated. This partially explains why SGD tends to select \ufb02atter minima shown in Figure 5b, instead\nof sharp minima with low non-uniformity. As long as the non-uniformity is reduced, the sharpness is\nreduced simultaneously. This mechanism is clearly shown in Figure 6.\n\n6\n\n123log10(batchsize)204060nonuniformityFashionMNISTlr=0.05lr=0.10lr=0.50123log10(batchsize)0100200300400nonuniformityCIFAR10lr=0.01lr=0.05lr=0.10123log10(batchsize)010203040sharpnessFashionMNIST=0.05=0.10=0.50123log10(batchsize)050100150200sharpnessCIFAR10=0.01=0.05=0.10\fFigure 6: The sharpness-non-uniformity diagram for the minima selected by SGD. Different colors correspond\nto different set of hyper-parameters. The dash line shows the predicted upper bound for the non-uniformity. One\ncan see that the data with different colors lies below the corresponding dash line.\n\nThe strong correlation between sharpness and non-uniformity is not part of the consequence of our\ntheory. To further explore of the generality of this observed correlation,\nwe trained many more models with a variety of different learning rates, batch sizes and initialization.\nIn addition, we considered a large-scale case, 14-layer ResNet for CIFAR-10 with 10, 000 training\nexamples. We plot sharpness against non-uniformity in Figure 7. The results are consistent with\nFigure 6.\n\nFigure 7: Scatter plot of sharpness and non-uniformity, suggesting that the non-uniformity and sharpness are\nroughly proportional to each other for these models.\n\n4.4 Back to the escape phenomenon\n\nNow we are ready to look at the escape phenomenon more closely. We \ufb01rst trained two models using\nGD with \u2318 = 0.1 to \ufb01t both the FashionMNIST and a corrupted FashionMNIST dataset. The latter\ncontains an extra 200 training examples with random labels. The corrupted FashionMNIST is used as\na more complex dataset, where the effect of the regularization is more signi\ufb01cant. The information\nfor the two solutions are summarized in Table 3. Starting from these two solutions, several SGD and\nGD with larger learning rates were run. Their dynamics are visualized in Figure 8.\nTable 3: Information for the initializations of the escape experiment.\n\ndataset\n\nFashionMNIST\n\nCorrupted FashionMNIST\n\nsharpness\n\nnon-uniformity\n\ntest acc\n80.04\n71.44\n\n19.7\n19.9\n\n45.2\n51.7\n\nPrediction of escape According to Table 3, the sharpnesses of both starting points are larger than\n19.0. Our theory predicts that they are not stable for GD with learning rate larger than 2/1.9. Therefore\nit is not surprising that GD with learning rate 0.3, 0.5 will escape. For SGD with \u2318 = 0.1, B = 1\nand \u2318 = 0.1, B = 4, the upper bound for the non-uniformity for being able to stay at the minima\nis at most p1/0.1 = 10 and p4/0.1 = 20. However, the non-uniformities of the two starting\nminima (45.2 and 51.7) are too large. Hence, they are unstable for SGD with \u2318 = 0.1, B = 1 and\n\u2318 = 0.1, B = 4. In Figure 8, we see that all the previous predictions are also con\ufb01rmed. In the\ncorrupted FashionMNIST experiments, we also notice that SGD with \u2318 = 0.1, B = 100 fails to\nescape from the starting point. This is due to the fact that the non-uniformity of that solution (\n51.7 ) is smaller than p100/0.1 = 100. SGD with \u2318 = 0.1, B = 100 will be stable around that\n\n7\n\n024sharpness051015nonuniformity2/FashionMNISTGDSGD,B=25SGD,B=10SGD,B=401020sharpness0204060nonuniformity2/CIFAR10GDSGD,B=25SGD,B=10SGD,B=4010203040Sharpness0204060Non-uniformityFNNforFashionMNIST0100200300400Sharpness0200400600Non-uniformityVGGforCIFAR-1001000200030004000Sharpness01000200030004000Non-uniformityResNetforCIFAR-10\f(a) FashionMNIST\n\n(b) Corrupted FashionMNIST\n\nFigure 8: The \ufb01rst three columns display the dynamics of the training accuracy, sharpness and test accuracy,\nrespectively. To better show the escape process, we only show the \ufb01rst 1, 500 iterations. All the optimizers\nare run for enough iterations to achieve a training error smaller than 104, and the test accuracies of the \ufb01nal\nsolutions are shown in the legends. The fourth column displays the scatter plots of the sharpnesses and the test\naccuracy of 200 models, which are obtained by using different learning rates, batch sizes and initializations.\n\npoint. Overall, we see that the escape can be predicted very well by using sharpness along with\nnon-uniformity.\n\nThe process of escape Now we focus on the \ufb01rst and second columns of Figure 8, i.e. the dynamics\nof the training accuracy and sharpness. We see that the dynamics display a sudden escape to some\n\ufb02atter area. After that the training error is gradually reduced, but the sharpness does not change much.\nAlso we can see that the escape process takes only a few iterations, much less than the number of\niterations needed for the optimizers to converge (see the \ufb01rst column of Figure 8). This justi\ufb01es our\npremise that linear stability is the real \"driving force\" for this escape phenomenon. Viewing SGD as\nSDE [8, 15, 6, 5] cannot explain this, since noise-driven escape is exponentially slow [1].\n\nImplications for generalization Let us examine the third column of Figure 8, together with the\nlegends where the test accuracy of the \ufb01nal solutions is reported. As expected, GD with large learning\nrates and SGD with small batch sizes indeed converge to solutions that generalize better. Moreover,\nthis effect is more signi\ufb01cant for the complex dataset, corrupted FashionMNIST. This is expected\nfrom previous observations that sharpness and the generalization error are correlated [3, 7, 6]. To see\nthe extent of this correlation, we plot the test accuracy against the sharpness in the fourth column of\nFigure 8. As we can see, the correlation between the test accuracy and the sharpness is stronger for\nthe corrupted FashionMNIST dataset than for the clean FashionMNIST.\nIt should also be noted that one can construct examples in which \ufb02atter solutions have worse test\naccuracy. One such counterexample is given in Appendix B\n\n5 Related Work and Discussion\n\nThe phenomenon of escape from sharp minima was \ufb01rst studied by Hu et al. [5], which suggested\nthat escape from sharp minimizers is easier than for \ufb02at minimizers. Zhu et al. [15] suggested that the\nnon-isotropic structure of SGD noise is essential for the fast escape. Jastrz\u02dbebski et al. [6] suggested\nthat the noise factor \u2318/B determines the sharpness of the solutions selected by SGD. A similar\nargument is used in Smith and Le [10] and Goyal et al. [2] in connection with test performance\ndegradation in large batch training. These works viewed SGD as a diffusion process. In contrast, we\nanalyzed SGD from a dynamic stability perspective. Our theory and experiments show that learning\nrate and batch size play different roles in the minima selection.\n\n8\n\n0100020003000#iteration7580859095100trainingaccuracy0100020003000#iteration05101520sharpness=0.1,B=1,testacc:80.73=0.1,B=4,testacc:80.40=0.3,B=1000,testacc:80.26=0.5,B=1000,testacc:80.390100020003000#iteration65707580testaccuracy02040sharpness79.580.080.581.0testaccuracy050010001500#iteration60708090100trainingaccuracy=0.1,B=100,testacc:70.90=0.1,B=4,testacc:72.42=0.3,B=1200,testacc:72.69=0.5,B=1200,testacc:72.52050010001500#iteration510152025sharpness050010001500#iteration505560657075testaccuracy02040sharpness70727476testaccuracy\fYin et al. [14] proposed a quantity called gradient diversity de\ufb01ned byPn\ni=1 krfik2/krfk2 and\nused it to analyze the parallel ef\ufb01ciency of SGD. This quantity is similar in spirit to the ratio\nbetween non-uniformity and sharpness. However, it is not well-de\ufb01ned at the global minima for\nover-parameterized models, since rfi = 0 for any i 2 [n].\nAt a technical point, our work also bears some similarity with the convergence rate analysis of SGD\nin Ma et al. [9] since both focused on quadratic problem. However, it should be stressed that our\ninterest is not at all the convergence rate, but whether it is possible for a particular optimization\nalgorithm with a particular set of hyper-parameters to converge to a particular global minimum. Even\nthough we also use formulas derived from quadratic problems to illustrate our \ufb01ndings, we should\nemphasize that the issue we are concerned here is local in nature, and locally one can almost always\nmake the quadratic approximation. So we expect our results to hold for general (even non-convex)\nloss functions, whereas the explicit results of Ma et al. [9] should only hold for the quadratic problem\nthey considered.\n\nQuasi-Newton methods Wilson et al. [12] found that compared to vanilla SGD, adaptive gradient\nmethods tend to select solutions that generalizes worse. This phenomenon can be explained as\nfollows. Consider the adaptive optimizer xt+1 = xt \u2318D1\nt rL(xt), whose stability condition\nis max(D1H) \uf8ff 2/\u2318. For algorithms that attempt to approximate Newton\u2019s method, we have\nD \u21e1 H. Consequently almost all the minima can be selected as long as \u2318 \uf8ff 2. This suggests that\nthese algorithms tend to select sharper minimizers. As an illustration, we used L-BFGS to train\na model for corrupted FashionMNIST. It is observed that L-BFGS always selects minima that are\nrelatively sharp, even though the learning rate is well tuned. This can be understood as follows.\nStarting from the best solutions selected by L-BFGS, both GD and SGD can escape and converge to\n\ufb02atter solutions which generalize better, as long as they are also well-tuned. We suspect that this might\nprovide an explanation why adaptive gradient methods perform worse in terms of generalization. But\nfurther work is needed to see whether this really holds.\n\nFigure 9: Escape of GD and SGD from the minima (test accuracy 69.5%) selected by well-tuned L-BFGS. The\ntraining accuracy, sharpness, and test accuracy with respect to the number of iterations are showed.\n\n6 Conclusion\nWe have discussed the mechanism of global minima selection from the perspective of dynamic\nstability. Through the linear stability analysis, we have demonstrated that sharpness and non-\nuniformity are both important for the selection. For both GD and SGD, larger learning rates give rise\nto \ufb02atter solutions, and SGD tends to select minima with smaller non-uniformity than GD. For neural\nnetworks, it was observed empirically that non-uniformity is roughly proportional to the sharpness.\nThis might explain why SGD tends to select \ufb02atter minima than GD in deep learning.\nRegarding the connection to generalization, our current understanding is that it can go both ways.\nOn one hand, one can construct examples for which sharper solutions generalize better. One such\nexample is given in the appendix. On the other hand, there is plenty of evidence that \ufb02atness is\nstrongly (positively) correlated with generalization error for neural networks. This work suggests\nthe following picture for the optimization process. While GD can readily converge to a global\nminimum near the initialization point, SGD has to work harder to \ufb01nd one that it can converge to.\nDuring the process SGD manages to \ufb01nd a set of parameters that can \ufb01t the data more uniformly,\nand this increased uniformity also helps to improve the ability for the model to \ufb01t other data, thereby\nincreasing the test accuracy. In any case, this is still very much a problem for further investigation.\n\n9\n\n010002000300040005000#iteration5060708090100trainingaccuracy010002000300040005000#iteration246sharpnessGD=0.8,testacc:70.82SGD=0.4,B=4,testacc:73.25010002000300040005000#iteration40506070testaccuracy\fAcknowledgement\nWe are grateful to Zhanxing Zhu for very helpful discussions. The worked performed here is supported\nin part by ONR grant N00014-13-1-0338 and the Major Program of NNSFC under grant 91130005.\n\nReferences\n[1] Crispin Gardiner. Stochastic methods, volume 4. springer Berlin, 2009.\n[2] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training\nimagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[3] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n[4] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the\ngeneralization gap in large batch training of neural networks. In Advances in Neural Information\nProcessing Systems, pages 1729\u20131739, 2017.\n\n[5] Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. On the diffusion approximation of\n\nnonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017.\n\n[6] Stanis\u0142aw Jastrz\u02dbebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua\narXiv preprint\n\nBengio, and Amos Storkey. Three factors in\ufb02uencing minima in sgd.\narXiv:1711.04623, 2017.\n\n[7] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch\ntraining for deep learning: Generalization gap and sharp minima. In In International Conference\non Learning Representations (ICLR), 2017.\n\n[8] Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modi\ufb01ed equations and adaptive stochastic\ngradient algorithms. In Proceedings of the 34th International Conference on Machine Learning,\nvolume 70, pages 2101\u20132110. PMLR, Aug 2017.\n\n[9] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding\nthe effectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35th\nInternational Conference on Machine Learning, volume 80, pages 3325\u20133334. PMLR, Jul 2018.\n[10] Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic\n\ngradient descent. In International Conference on Learning Representations, 2018.\n\n[11] Steven H Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology,\n\nchemistry, and engineering. CRC Press, 2018.\n\n[12] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The\nmarginal value of adaptive gradient methods in machine learning. In Advances in Neural\nInformation Processing Systems, pages 4151\u20134161, 2017.\n\n[13] Lei Wu, Zhanxing Zhu, and Weinan E. Towards understanding generalization of deep learning:\n\nPerspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.\n\n[14] Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran,\nand Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 1998\u20132007, 2018.\n\n[15] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in\nstochastic gradient descent: Its behavior of escaping from minima and regularization effects.\narXiv preprint arXiv:1803.00195, 2018.\n\n10\n\n\f", "award": [], "sourceid": 5039, "authors": [{"given_name": "Lei", "family_name": "Wu", "institution": "Peking University"}, {"given_name": "Chao", "family_name": "Ma", "institution": "Princeton University"}, {"given_name": "Weinan", "family_name": "E", "institution": "Princeton University"}]}