{"title": "Large Scale Structure of Neural Network Loss Landscapes", "book": "Advances in Neural Information Processing Systems", "page_first": 6709, "page_last": 6717, "abstract": "There are many surprising and perhaps counter-intuitive properties of optimization of deep neural networks. We propose and experimentally verify a unified phenomenological model of the loss landscape that incorporates many of them. High dimensionality plays a key role in our model. Our core idea is to model the loss landscape as a set of high dimensional \\emph{wedges} that together form a large-scale, inter-connected structure and towards which optimization is drawn. We first show that hyperparameter choices such as learning rate, network width and $L_2$ regularization, affect the path optimizer takes through the landscape in similar ways, influencing the large scale curvature of the regions the optimizer explores. Finally, we predict and demonstrate new counter-intuitive properties of the loss-landscape. We show an existence of low loss subspaces connecting a set (not only a pair) of solutions, and verify it experimentally. Finally, we analyze recently popular ensembling techniques for deep networks in the light of our model.", "full_text": "Large Scale Structure of Neural Network Loss\n\nLandscapes\n\nStanislav Fort\u2217\nGoogle Research\nZurich, Switzerland\n\nStanislaw Jastrzebski\u2020\nNew York University\n\nNew York, United States\n\nAbstract\n\nThere are many surprising and perhaps counter-intuitive properties of optimization\nof deep neural networks. We propose and experimentally verify a uni\ufb01ed phe-\nnomenological model of the loss landscape that incorporates many of them. High\ndimensionality plays a key role in our model. Our core idea is to model the loss\nlandscape as a set of high dimensional wedges that together form a large-scale,\ninter-connected structure and towards which optimization is drawn. We \ufb01rst show\nthat hyperparameter choices such as learning rate, network width and L2 regu-\nlarization, affect the path optimizer takes through the landscape in similar ways,\nin\ufb02uencing the large scale curvature of the regions the optimizer explores. Finally,\nwe predict and demonstrate new counter-intuitive properties of the loss-landscape.\nWe show an existence of low loss subspaces connecting a set (not only a pair)\nof solutions, and verify it experimentally. Finally, we analyze recently popular\nensembling techniques for deep networks in the light of our model.\n\n1\n\nIntroduction\n\nThe optimization of deep neural networks is still relatively poorly understood. One intriguing property\nis that despite their massive over-parametrization, their optimization dynamics is surprisingly simple\nin many respects. For instance, Li et al. [2018a] show that in spite of the typically very high number\nof trainable parameters, constraining optimization to a small number of randomly chosen directions\noften suf\ufb01ces to reach a comparable accuracy. Fort and Scherlis [2018] extend this observation and\nanalyze its geometrically implications for the landscape; Goodfellow et al. [2014] show that there is a\nsmooth path connecting initialization and the \ufb01nal minima. Another work shows how it is possible to\ntrain only a small percentage of weights, while reaching a good \ufb01nal test performance [Frankle and\nCarbin, 2019].\nInspired by these and some other investigations we propose a phenomenological model for the loss\nsurface of deep networks. We model the loss surface as a union of n-dimensional (lower dimension\nthan the full space, although still very high) manifolds that we call n-wedges, see Figure 1. Our\nmodel is capable of giving predictions that match previous experiments (such as low-dimensionality\nof optimization), as well as give new predictions.\nFirst, we show how common regularizers (learning rate, batch size, L2 regularization, dropout, and\nnetwork width) all in\ufb02uence the optimization trajectory in a similar way. We \ufb01nd that increasing\ntheir regularization strength leads, up to some point, to a similar effect: increasing width of the radial\ntunnel (see Figure 2 and Section 3.3 for discussion) the optimization travels. This presents a next step\ntowards understanding the common role that different hyperparameters play in regularizing training\nof deep networks [Jastrz\u02dbebski et al., 2017, Smith et al., 2018].\n\n\u2217This work was done as a part of the Google AI Residency program.\n\u2020This work was partially done while the author was an intern at Google Zurich.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: A model of the low-loss manifold comprising of 2-wedges in a 3-dimensional space.\nDue to the dif\ufb01culty of representing high dimensional objects, the closest visualization of our\nlandscape model is built with 2-dimensional low-loss wedges (the disks) in a 3-dimensional space\n(Left panel). The particular details such as angles between wedges differ in real loss landscapes.\nOptimization moves the network primarily radially at \ufb01rst. Two optimization paths are shown. A\nlow-loss connection, illustrated as a dashed line, is provided by the wedges\u2019 intersections. In real\nlandscapes, the dimension of the wedges is very high. In the Central panel, the test loss on a two\ndimensional subspace including two independently optimized optima (dark points) is shown (a CNN\non CIFAR-10), clearly displaying a high loss wall in between them on the linear path (lime). The\nRight panel shows a subspace including an optimized midpoint (aqua) and the low loss 1-tunnel we\nconstructed connecting the two optima. Notice the large size (compared to the radius) of the low-loss\nbasins between optima.\n\nFigure 2: A view of three individually optimized optima and their associated low-loss radial tunnels.\nA CNN was trained on CIFAR-10 from 3 different initializations. Along optimization, the radius of\nthe network\u2019s con\ufb01guration grows radially. The \ufb01gure shows experimentally observed training set\nloss in the vicinity of the three optima at 4 different radii (4 different epochs). The lime lines show\nthat in order to move from one optimum to another on a linear interpolation, we need to leave the low\nloss basins surrounding optima. Individual low-loss radial tunnels are approximately orthogonal to\neach other. Note that in high dimensions, the neighborhood for a point in our n-wedge (see Figure 1)\nwill look like a tunnel locally, while its large scale geometry might be that of a wedge (Section 3.3).\n\nMost importantly, our work analyses new surprising effects high dimensions have on the landscape\nproperties and highlights how our intuition about hills and valleys in 2D often fails us. Based on\nour model, we predict and demonstrate the existence of higher dimensional generalizations of low\nloss tunnels that we call m-tunnels. Whereas a tunnel would connect two optima on a low loss path,\nm-tunnels connect (m + 1) optima on an effectively m-dimensional hypersurface. We also critically\nexamine the recently popular Stochastic Weight Averaging (SWA) technique [Izmailov et al., 2018]\nin the light of our model.\n\n2\n\n\f2 Related work\n\nA large number of papers show that the loss landscape, despite its large dimension D, exhibits\nsurprisingly simple properties. These experiments, however, often do not attempt to construct a\nsingle model that would contain all of those properties at once. This is one of the key inspirations\nfor constructing our theoretical model of the loss surface \u2013 we want to build a uni\ufb01ed framework\nthat incorporates all of those experimental results and can make new veri\ufb01able predictions. In the\nfollowing, we present relevant observations from literature and frame them in the light of our model.\n\nLong and short directions First, the linear path from the initialization to the optimized solution\ntypically has a monotonically decreasing loss along it, encountering no signi\ufb01cant obstacles along\nthe way. [Goodfellow et al., 2014] To us, this suggests the existence of long directions of the\nsolution manifold \u2013 directions in which the loss changes slowly. On the other end of the spectrum,\nJastrz\u02dbebski et al. [2018] and Xing et al. [2018] characterize the shape of the sharpest directions in\nwhich optimization happens. While Goodfellow et al. [2014] observes a large scale property of the\nlandscape, the other experiments are inherently local.\n\nDistributed and dense manifold Another phenomenon is that constraining optimization to a\nrandom, low-dimensional hyperplanar cut through the weight space provides comparable performance\nto full-space optimization, given the dimension of the plane d is larger than an architecture- and\nproblem-speci\ufb01c intrinsic dimension dint. [Li et al., 2018a] The performance depends not only on\nthe hyperplane\u2019s dimension, but also on the radius at which it is positioned in the weight space. The\nresults are stable under reinitializations of the plane. [Fort and Scherlis, 2018] This suggests the\nspecial role of the radial direction as well as the distributed nature of the low-loss manifold. Put\nsimply, good solution are everywhere and they are distributed densely enough that even a random,\nlow-dimensional hyperplane hits them consistently.\n\nConnectivity Finally, a pair of two independently optimized optima has a high loss wall on a linear\nweight-space interpolation between them (as well as on any (P \u221d 1 \u2212 k/D) random path). However,\na low-loss connector can be found between them, such that each point along such a path is a low-loss\npoint itself. [Draxler et al., 2018, Garipov et al., 2018] This suggests connectivity of different parts of\nthe solution manifold.\n\nLoss surface of deep networks\nIn this paper we present a new model for the loss surface of\ndeep networks. Theoretical work on the subject was pioneered by Choromanska et al. [2015]. An\nimportant \ufb01nding from this and follow-up work is that all minima in the loss surface are in some\nsense global [Nguyen and Hein, 2017]. Some papers have also looked at ways of visualizing the loss\nsurface [Goodfellow et al., 2014, Keskar et al., 2017, Li et al., 2018b]. However, as we demonstrate,\nthose often do not capture the properties relevant to the SGD, as they choose their projection planes\nat random.\nAn important feature of the loss surface is its curvature. One of the \ufb01rst studies on curvature\nwere carried out by LeCun et al. [1998], Sagun et al. [2016]. A signi\ufb01cant, though not well\nunderstood, phenomenon is that curvature correlates with generalization in many networks. In\nparticular, optimization using a lower learning rate or a larger batch-size tends to steer optimization to\na both sharper, and a better generalizing, regions of the loss landscape Keskar et al. [2017], Jastrz\u02dbebski\net al. [2017]. Li et al. [2018b] also suggest that overall smoothness of the loss surface is an important\nfactor for network generalization.\n\n3 Building a toy model for the loss landscape\n\nIn this section we will gradually build a phenomenological toy model of the landscape in an informal\nmanner. Then we will perform experiments on the toy model and argue that they reproduce some of\nthe intriguing properties of the real loss landscape. In the next section we will use these insights to\npropose a formal model of the loss landscape of real neural networks.\n\n3\n\n\f3.1 Loss landscape as high dimensional wedges that intersect\n\nTo start with, we postulate that the loss landscape is a union of high dimensional manifolds whose\ndimension is only slightly lower than the one of the full space D. This construction is based on the\nkey surprising properties of real landscapes discussed in Section 2, and we do not attempt to build it\nup from simple assumptions. Rather, we focus on creating a phenomenological model that can, at the\nsame time, reconcile the many intriguing results about neural network optimization.\nTo make it more precise, let us imagine a simple scenario \u2013 there are two types of points in the\nloss landscape: good, low loss points, and bad, high loss points. This is an arti\ufb01cial distinction\nthat we will get rid of soon, but will be helpful for the discussion. Let n of their linear dimensions\nbe long, of in\ufb01nite linear extent, and D \u2212 n dimensions short, of length \u03b5. Let us construct a very\nsimple toy model where we take all possible n-tuples of axes in the D-dimensional space, and\nposition one cuboid such that its long axes align with each n-tuple of axes. We take the union of the\ncuboids corresponding to all n-tuples of axes. In such a way, we tiled the D-dimensional space with\nn-long-dimensional objects that all intersect at the origin and radiate from it to in\ufb01nity. We will start\nreferring to the cuboids as n-wedges now3. An illustration of such a model is shown in Figure 1.\nSince these objects have an in\ufb01nite extent in n out of D dimensions (and s = D \u2212 n short directions),\nthey necessarily intersect. If the number of short directions is small s (cid:28) D, then the intersection of\ntwo such cuboids has \u2248 2s short directions.\nEven such a simplistic model makes de\ufb01nite predictions. First, every low loss point is connected to\nevery other point. If they do not lie on the same wedge (as is very likely for randomly initialized\npoints), we can always go from one to the intersection of their respective wedges, and continue on\nthe other wedge, as illustrated in Figure 1. A linear path between the optima, however, would take us\nout of the low loss wedges to the area of high loss. As such, this model has an in-built connectivity\nbetween all low loss points, while making the linear path (or any other random path) necessarily\nencounter high loss areas.\nSecondly, the deviations from the linear path to the low loss path should be approximately aligned\nuntil we reach the wedge intersection, and change orientation after that. That is indeed what we\nobserve in real networks, as illustrated in Figure 3. Finally, the number of short directions should be\nhigher in the middle of our low loss path between two low loss points/optima on different wedges.\nWe observe that in real networks as well, as shown in Figure 4.\n\n3.2 Building the toy model\n\nWe are now ready to fully specify the toy model. In the previous section, we discussed informally\ndividing points in the loss landscape into good and bad points. Here we would like to build the loss\nfunction we call the surrogate loss. Let our con\ufb01guration be (cid:126)P \u2208 RD as before, and let us initialize\nit at random component-wise. Let Ltoy( (cid:126)P ) denote the surrogate loss for our toy landscape model,\nthat would reach zero once a point lands on one of the wedges, and would increase with the point\u2019s\ndistance to the nearest wedge. These properties are satis\ufb01ed by a simple Euclidean distance to the\nnearest point on the nearest wedge, which we use as our surrogate loss in our toy model.\nMore precisely, the way we calculate our surrogate loss Ltoy( (cid:126)P ) is: 1) sort the components of (cid:126)P\nbased on their absolute value, 2) take the D \u2212 n smallest values, 3) take a square root of the sum of\ntheir squares. This simple procedure yields the L2 distance to the nearest n-wedge in the toy model.\nImportantly, it allows us to optimize in the model landscape without having to explicitly model any\nof the wedges in memory. Here we align the n-wedges with the axes, however, we veri\ufb01ed that our\nconclusions are not dependent on this, nor on their exact mutual orientations or numbers.\n\n3.3 Experiments on the toy model\nHaving a function that maps any con\ufb01guration (cid:126)P to a surrogate loss Ltoy( (cid:126)P ), we can perform, in\nTensorFlow, simulations of the same experiments that we do on real networks and verify that they\nexhibit similar properties.\n\n3This name re\ufb02ects that in practice their width along the short directions is variable.\n\n4\n\n\fOptimizing on random low-dimensional hyperplanar cuts. On our toy landscape model, we\nreplicated the same experiments that were performed in Li et al. [2018a] and Fort and Scherlis\n[2018]. In the two papers, it was established that constraining optimization to a randomly chosen,\nd-dimensional hyperplane in the weight space yields comparable results to full-space optimization,\ngiven d > dintrinsic, which is small (dintrinsic (cid:28) D) and dataset and architecture speci\ufb01c.\nThe way we replicated this on our toy model was equivalent to the treatment in Fort and Scherlis\n[2018]. Let the full-space con\ufb01guration (cid:126)P depend on within-hyperplane position (cid:126)\u03b8 \u2208 Rd as (cid:126)P =\n(cid:126)\u03b8M + (cid:126)P0, where M \u2208 Rd\u00d7D de\ufb01nes the hyperplane\u2019s axes and (cid:126)P0 \u2208 RD is its offset from the origin.\nWe used TensorFlow to optimize the within-hyperplane coordinates (cid:126)\u03b8 directly, minimizing the toy\nlandscape surrogate loss Ltoy( (cid:126)X((cid:126)\u03b8)). We observed very similar behavior to the one in real neural\nnetworks: we could successfully minimize the loss, given the d of the hyperplane was > dlim. Since\nwe explicitly constructed the underlying landscape, we were able to related this limiting dimension to\nthe dimensionality of the wedges n as dlim = D \u2212 n. As in real networks, the random initialization\nof the hyperplane had very little effect on the optimization. This makes our toy model consistent with\none of the most surprising behaviors of real networks. Expressed simply, optimization on random,\nlow-dimensional hyperplanes works well provided the hyperplane dimensions supply at least the\nnumber of short directions the underlying landscape manifold has.\n\nBuilding a low-loss tunnel between 2 optima. While we build our landscape in a way that explic-\nitly allows a path between any two low loss points, we wondered if we could construct them the same\nway we do in real networks (see for instance Draxler et al. [2018]), as illustrated in Figure 3.\nWe took two random initializations I1 and I2 and optimized them using our surrogate loss Ltoy until\nconvergence. As expected, randomly chosen points converged to different n-wedges, and therefore\nthe linear path between them went through a region of high loss, exactly as real networks do. We\nthen chose the midpoint between the two points, put up a tangent hyperplane there and optimized, as\ndescribed in Section 4. In this way, we were able to construct a low loss path between the two optima,\nthe same way we did for real networks. We also observed the clustering of deviations from the linear\npath into two halves, as illustrated in Figure 3.\n\nLiving on a wedge, probing randomly, and seeing a hole. Finally, another property of optimiza-\ntion of deep networks is that along the optimization trajectory, the loss surface looks locally like a\nvalley or a tunnel [Jastrz\u02dbebski et al., 2018, Xing et al., 2018]. This is also replicated in our model. A\ncurious consequence of high dimensions is as follows. Imagine being at a point (cid:126)P0 a short distance\nfrom one of the wedges, leading to a loss L0 = Ltoy( (cid:126)P0). Imagine computing the loss along vectors\nof length a in random directions from (cid:126)P0 as (cid:126)P0 + a\u02c6v, where \u02c6v is a unit vector in a random direction.\nThe change in loss corresponding to the vector length a will very likely be almost independent of the\ndirection \u02c6v and will be increasing. In other words, it will look like you are at a bottom of a circular\nwell of low loss, and everywhere you look (at random), the loss increases in approximately the same\nrate. Importantly, for this to be true, the point (cid:126)P0 does not have to be the global minimum, i.e. exactly\non one of our wedges. We explored this effect numerically in our toy model, as well as observed it in\nreal networks, as demonstrated in Figure 2 on a real, trained network. Even if the manifold of low\nloss points were made of extended wedge-like objects (as in our model), locally (and importantly\nwhen probing in random directions), it would look like a network of radial tunnels. We use the size\nof those tunnels to characterize real networks, and the effect of different regularization techniques, as\nshown in Figure 5.\n\n4 Experiments\n\nOur main goal in this section is to validate our understanding of the loss landscape. In the \ufb01rst\nsection we investigate low-loss tunnels (connectors) between optima to show properties of our\nmodel. Interestingly, we see a match between the toy model and the real neural networks here. This\ninvestigation leads us naturally to discovering m-connectors \u2013 low loss subspaces connecting a large\nnumber of optima at the same time.\nNext, we investigate how optimization traverses the loss landscape and relate it to our model. In the\n\ufb01nal section we show that our model has concrete consequences for the ensembling techniques of\n\n5\n\n\fFigure 3: Building a low-loss tunnel. The Left panel shows a sketch of our algorithm. Cosines\nbetween vector deviations of points on a low-loss connector between optima from a linear interpolation\n(Middle panel). Deviations in the \ufb01rst and second half of the connector are aligned within the group\nbut essentially orthogonal mutually. The Right panel shows that label predictions along the tunnel\nchange linearly up to the middle, and stay constant from them on. This is consistent with each wedge\nrepresenting a particular function. The results were obtained with a CNN on CIFAR-10.\n\ndeep networks. All in all we demonstrate that our model of the loss landscape is consistent with both\nprevious observations, as well as is capable of giving new predictions.\n\n4.1 Experimental setting\nWe use SimpleCNN model (3 \u00d7 3 convolutional \ufb01lters, followed by 2 \u00d7 2 pooling, 16,32,32 channels\nwith the tanh non-linearity) and run experiments on the CIFAR-10 dataset. Unless otherwise noted\n(as in Figure 6), we ran training with a constant learning rate with the Adam optimizer.\nTo further verify the validity of our landscape model, we performed the same experiments on CNN\nand fully-connected networks of different widths and depths, explored ReLU as well as tanh non-\nlinearities, and used MNIST, Fashion MNIST, CIFAR-10 and CIFAR-100 datasets. We do not,\nhowever, present their results in the plots directly.\n\n4.2 Examining and building tunnels between optima\n\nTo validate the model we look more closely at previous observation about the paths between individual\noptima made in Draxler et al. [2018], Garipov et al. [2018]. Our main idea is to show that a similar\nconnector structure exists in the real loss landscape as the one we discussed in Section 3.3.\nIn Draxler et al. [2018], Garipov et al. [2018] it is shown that a low loss path exists between pairs of\noptima. To construct them, the authors use relatively complex algorithms. We achieved the same\nusing a simple algorithm which, in addition, allowed us to diagnose the geometry of the underlying\nloss landscape.\nTo \ufb01nd a low-loss connector between two optima, we use a simple and ef\ufb01cient algorithm. 1)\nConstruct a linear interpolation between the optima and divide it into segments. 2) At each segment,\nput up a (D \u2212 1)-dimensional hyperplane normal to the linear interpolation. 3) For each hyperplane,\nstart at the intersection with the linear interpolation, and minimize the training loss, constraining\nthe gradients to lie within. 4) Connect the optimized points by linear interpolations. This simple\napproach is suf\ufb01cient for \ufb01nding low-loss connectors between pairs of optima.\nThis corroborates our phenomenological model of the landscape. In our model we need to switch\nfrom one n-wedge to another on \ufb01nding a low-loss connector, and we observe the same alignment\nwhen optimizing on our toy landscape. The predicted labels change in the same manner, suggesting\nthat each n-wedge corresponds to a particular family of functions. Note also that the number of short\ndirections in Figure 4 at the endpoints (300; corresponding to the original optima themselves) is very\nsimilar to the Li et al. [2018a] intrinsic dimension for a CNN on the dataset, further supporting our\nmodel that predicts this correspondence (Section 3.3).\n\n4.3 Finding low loss m-connectors\n\nIn addition, we discover existence of m-connectors between (m + 1)-tuples of optima, to which our\nalgorithm naturally extends. In a convex hull de\ufb01ned by the optima we choose points where we put up\n\n6\n\n\fFigure 4: The number of short directions along a 1-connector (tunnel) between 2 optima (Left panel),\nthe number of short directions in the middle of an m-connector between (m + 1) optima (Middle\npanel) and the effect of learning rate on the number of short directions (Right panel). The results\nwere obtained with a CNN on Fashion MNIST.\n\n(D \u2212 m) dimensional hyperplanes and optimize as before. We experimentally veri\ufb01ed the existence\nof m-connectors up to m = 10 optima at once, going beyond previous works that only dealt with\nwhat we call 1-connectors, i.e. tunnels between pairs of optima. This is a natural generalization of\nthe concept of a tunnel between optima in high dimensions.Another new predictions of our landscape\nmodel is that the number of short directions in the middle of an m-connector should scale with m,\nwhich is what we observe, as visible in Figure 4. Note that the same property is exhibited by our toy\nmodel. We hope that m-connectors might be useful for developing new ensembling techniques in the\nfuture.\n\n4.4 The effect of learning rate, batch size and regularization\n\nOur model states that optimization is guided through a set of low-loss wedges that appear as radial\ntunnels on low dimensional projections (see Figure 2). A natural question is which tunnels are\nselected. While these observations do not directly con\ufb01rm our model, they make it more actionable\nfor the community.\n\nFigure 5: The effect of learning rate, L2 regularization and dropout rate on the angular width of radial\nlow-loss tunnels. The results were obtained with a CNN on CIFAR-10.\n\nWe observe that the learning rate, batch size, L2 regularization and dropout rate have a measurable,\nand similar, effect on the geometrical properties of the radial tunnel that SGD selects. In particular,\nwe observe that the angular width of the low-loss tunnels (the distance to which one can move from\nan optimum until a high loss is hit) changes as follows: 1) Higher learning rate \u2192 wider tunnels. As\nshown in Figure 5. We found the effect of batch size to be of the similar kind, where a larger batch\nsize leads to narrower tunnel. 2) Higher L2 regularization \u2192 wider tunnels as shown in Figure 5.\nThis effect disappears when regularization becomes too strong. 3) Higher dropout rate \u2192 wider\ntunnels as shown in Figure 5. This effect disappears when the dropout rate is too high. We hope these\nresults will lead to a better understanding of the somewhat exchangeable effect of hyperparameters\non generalization performance [Jastrz\u02dbebski et al., 2017, Smith et al., 2018] and will put them into a\nmore geometrical light.\n\n4.5 Consequences for ensembling procedures\n\nThe wedge structure of the loss landscape has concrete consequences for different ensembling\nprocedures. Since functions whose con\ufb01guration vectors lie on different wedges typically have\nthe most different class label predictions, in general an ensembling procedure averaging predicted\nprobabilities bene\ufb01ts from using optima from different wedges. This can easily be achieved by\n\n7\n\n\fstarting from independent random initializations, however, such an approach is costly in terms of\nhow many epochs one has to train for.\nAlternative approaches, such as snapshot ensembling Huang et al. [2017], have been proposed. For\nthem, a cyclical learning schedule is used, where during the large learning rate phase the network\ncon\ufb01guration moves signi\ufb01cantly through the landscape, and during the low learning rate phase\n\ufb01nds the local minimum. Then the con\ufb01guration (a snapshot) is stored. The predictions of many\nsnapshots are used during inference. While this provides higher accuracy, the inference is slow as the\npredictions from many models have to be calculated. Stochastic Weight Averaging (SWA) Izmailov\net al. [2018] has been proposed to remedy this by storing the average con\ufb01guration vector over the\nsnapshots.\nWe predicted that if the our model of the loss landscape is correct, SWA will not function well for\nhigh learning rates that would make the con\ufb01guration change wedges between snapshots. We veri\ufb01ed\nthis with a CNN on CIFAR-10 as demonstrated in Figure 6.\nThe bottomline for ensembling techniques is that practitioners should tune the learning rate carefully\nin these approaches. Our result also suggests that cyclic learning rates can indeed \ufb01nd optima on\ndifferent n-wedges that provide greater diversity for ensembling, if the maximum learning rate is\nhigh enough. However, these are unsuitable for weight averaging, as their mean weights fall outside\nof the low loss areas.\n\nFigure 6: Stochastic Weight Averaging (SWA) does not work well when wedges are changed between\nsnapshots. For high maximum learning rates during cyclical snapshotting, the con\ufb01gurations obtained\ndo not lie on the same wedge and therefore their weight averages lie at a high loss area. The\naverage con\ufb01guration performs worse than a single snapshots. (Panel 1). For a low learning rate,\nthe snapshots lie within the same wedge and therefore their average performs well (Panel 3). The\nadvantage predictions averaging (in orange) has over weight averaging (in green) is quanti\ufb01ed in\nPanel 4.\n\n5 Conclusion\n\nWe propose a phenomenological model of the loss landscape of neural networks that exhibits\noptimization behavior previously observed in literature. High dimensionality of the loss surface plays\na key role in our model. Further, we studied how optimization travels through the loss landscape\nguided by this manifold. Finally, our models gave new predictions about ensembling of neural\nnetworks.\nWe conducted experiments characterizing real neural network loss landscapes and veri\ufb01ed that the\nequivalent experiments performed on our toy model produce corresponding results. We generalized\nthe notion of low-loss connectors between pairs of optima to an m-dimensional connector between\na set of optima, explored it experimentally, and used it to constrain our landscape model. We\nobserved that learning rate (and other regularizers) lead to optimization exploring different parts\nof the landscape and we quanti\ufb01ed the results by measuring the radial low-loss tunnel width \u2013 the\ntypical distance we can move from an optimum until we hit a high loss area. We also make a direct\nand quantitative connection between the dimensionality of our landscape model and the intrinsic\ndimension of optimization in deep neural networks.\nFuture work could focus on exploring impact of our model for more ef\ufb01cient methods for training\nneural networks.\n\n8\n\n\fReferences\nChunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension\n\nof objective landscapes, 2018a.\n\nStanislav Fort and Adam Scherlis. The goldilocks zone: Towards better understanding of neural\n\nnetwork loss landscapes, 2018.\n\nIan J. Goodfellow, Oriol Vinyals, and Andrew M. Saxe. Qualitatively characterizing neural network\n\noptimization problems, 2014.\n\nJonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable\nneural networks. In International Conference on Learning Representations, 2019. URL https:\n//openreview.net/forum?id=rJl-b3RcF7.\n\nStanis\u0142aw Jastrz\u02dbebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio,\n\nand Amos Storkey. Three factors in\ufb02uencing minima in sgd, 2017.\n\nSamuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don\u2019t decay the learning rate, increase\nthe batch size. In International Conference on Learning Representations, 2018. URL https:\n//openreview.net/forum?id=B1Yy1BxCZ.\n\nPavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson.\n\nAveraging weights leads to wider optima and better generalization, 2018.\n\nStanis\u0142aw Jastrz\u02dbebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos\nStorkey. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length.\narXiv e-prints, art. arXiv:1807.05031, Jul 2018.\n\nChen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A Walk with SGD. arXiv\n\ne-prints, art. arXiv:1802.08770, Feb 2018.\n\nFelix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A Hamprecht. Essentially no barriers\n\nin neural network energy landscape. arXiv preprint arXiv:1803.00885, 2018.\n\nLoss\n\nsurfaces, mode connectivity,\n\nTimur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G\nWilson.\nIn\nS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar-\nInformation Processing Systems 31, pages 8789\u2013\nnett, editors, Advances in Neural\n8798. Curran Associates,\n2018.\nURL http://papers.nips.cc/paper/\n8095-loss-surfaces-mode-connectivity-and-fast-ensembling-of-dnns.pdf.\n\nand fast ensembling of dnns.\n\nInc.,\n\nAnna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The\nloss surfaces of multilayer networks. Journal of Machine Learning Research, 38:192\u2013204, 2015.\nISSN 1532-4435.\n\nQuynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In Proceedings\nof the 34th International Conference on Machine Learning - Volume 70, ICML\u201917, pages 2603\u2013\n2612. JMLR.org, 2017. URL http://dl.acm.org/citation.cfm?id=3305890.3305950.\n\nNitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter\nTang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR.\nOpenReview.net, 2017.\n\nHao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape\nof neural nets. In Advances in Neural Inform\u2018ation Processing Systems, pages 6389\u20136399, 2018b.\nYann LeCun, L\u00e9on Bottou, Genevieve B. Orr, and Klaus-Robert M\u00fcller. Ef\ufb01cient backprop. In\nNeural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pages\n9\u201350, London, UK, UK, 1998. Springer-Verlag. ISBN 3-540-65311-2. URL http://dl.acm.\norg/citation.cfm?id=645754.668382.\n\nLevent Sagun, L\u00e9on Bottou, and Yann LeCun. Singularity of the hessian in deep learning. arXiv\n\npreprint arXiv:1611.07476, 2016.\n\nGao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger.\n\nSnapshot ensembles: Train 1, get m for free, 2017.\n\n9\n\n\f", "award": [], "sourceid": 3636, "authors": [{"given_name": "Stanislav", "family_name": "Fort", "institution": "Stanford University / Google Research"}, {"given_name": "Stanislaw", "family_name": "Jastrzebski", "institution": "New York University"}]}