{"title": "Visualizing the Loss Landscape of Neural Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 6389, "page_last": 6399, "abstract": "Neural network training relies on our ability to find \"good\" minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple \"filter normalization\" method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.", "full_text": "Visualizing the Loss Landscape of Neural Nets\n\nHao Li1, Zheng Xu1, Gavin Taylor2, Christoph Studer3, Tom Goldstein1\n\n1University of Maryland, College Park 2United States Naval Academy 3Cornell University\n\n{haoli,xuzh,tomg}@cs.umd.edu, taylor@usna.edu, studer@cornell.edu\n\nAbstract\n\nNeural network training relies on our ability to \ufb01nd \u201cgood\u201d minimizers of highly\nnon-convex loss functions. It is well-known that certain network architecture\ndesigns (e.g., skip connections) produce loss functions that train easier, and well-\nchosen training parameters (batch size, learning rate, optimizer) produce minimiz-\ners that generalize better. However, the reasons for these differences, and their\neffect on the underlying loss landscape, are not well understood. In this paper, we\nexplore the structure of neural loss functions, and the effect of loss landscapes on\ngeneralization, using a range of visualization methods. First, we introduce a simple\n\u201c\ufb01lter normalization\u201d method that helps us visualize loss function curvature and\nmake meaningful side-by-side comparisons between loss functions. Then, using\na variety of visualizations, we explore how network architecture affects the loss\nlandscape, and how training parameters affect the shape of minimizers.\n\n1\n\nIntroduction\n\nTraining neural networks requires minimizing a high-dimensional non-convex loss function \u2013 a\ntask that is hard in theory, but sometimes easy in practice. Despite the NP-hardness of training\ngeneral neural loss functions [3], simple gradient methods often \ufb01nd global minimizers (parameter\ncon\ufb01gurations with zero or near-zero training loss), even when data and labels are randomized before\ntraining [43]. However, this good behavior is not universal; the trainability of neural nets is highly\ndependent on network architecture design choices, the choice of optimizer, variable initialization, and\na variety of other considerations. Unfortunately, the effect of each of these choices on the structure of\nthe underlying loss surface is unclear. Because of the prohibitive cost of loss function evaluations\n(which requires looping over all the data points in the training set), studies in this \ufb01eld have remained\npredominantly theoretical.\n\n(a) without skip connections\n\n(b) with skip connections\n\nFigure 1: The loss surfaces of ResNet-56 with/without skip connections. The proposed \ufb01lter\nnormalization scheme is used to enable comparisons of sharpness/\ufb02atness between the two \ufb01gures.\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fVisualizations have the potential to help us answer several important questions about why neural\nnetworks work. In particular, why are we able to minimize highly non-convex neural loss functions?\nAnd why do the resulting minima generalize? To clarify these questions, we use high-resolution\nvisualizations to provide an empirical characterization of neural loss functions, and explore how\ndifferent network architecture choices affect the loss landscape. Furthermore, we explore how the\nnon-convex structure of neural loss functions relates to their trainability, and how the geometry\nof neural minimizers (i.e., their sharpness/\ufb02atness, and their surrounding landscape), affects their\ngeneralization properties.\nTo do this in a meaningful way, we propose a simple \u201c\ufb01lter normalization\u201d scheme that enables us to\ndo side-by-side comparisons of different minima found during training. We then use visualizations to\nexplore sharpness/\ufb02atness of minimizers found by different methods, as well as the effect of network\narchitecture choices (use of skip connections, number of \ufb01lters, network depth) on the loss landscape.\nOur goal is to understand how loss function geometry affects generalization in neural nets.\n\n1.1 Contributions\n\nWe study methods for producing meaningful loss function visualizations. Then, using these visualiza-\ntion methods, we explore how loss landscape geometry affects generalization error and trainability.\nMore speci\ufb01cally, we address the following issues:\n\n\u2022 We reveal faults in a number of visualization methods for loss functions, and show that\nsimple visualization strategies fail to accurately capture the local geometry (sharpness or\n\ufb02atness) of loss function minimizers.\n\n\u2022 We present a simple visualization method based on \u201c\ufb01lter normalization.\u201d The sharpness of\nminimizers correlates well with generalization error when this normalization is used, even\nwhen making comparisons across disparate network architectures and training methods.\nThis enables side-by-side comparisons of different minimizers1.\n\u2022 We observe that, when networks become suf\ufb01ciently deep, neural loss landscapes quickly\ntransition from being nearly convex to being highly chaotic. This transition from convex to\nchaotic behavior coincides with a dramatic drop in generalization error, and ultimately to a\nlack of trainability.\n\n\u2022 We observe that skip connections promote \ufb02at minimizers and prevent the transition to\nchaotic behavior, which helps explain why skip connections are necessary for training\nextremely deep networks.\n\nvalues of the Hessian around local minima, and visualizing the results as a heat map.\n\n\u2022 We quantitatively measure non-convexity by calculating the smallest (most negative) eigen-\n\u2022 We study the visualization of SGD optimization trajectories (Appendix B). We explain\nthe dif\ufb01culties that arise when visualizing these trajectories, and show that optimization\ntrajectories lie in an extremely low dimensional space. This low dimensionality can be\nexplained by the presence of large, nearly convex regions in the loss landscape, such as\nthose observed in our 2-dimensional visualizations.\n\n2 Theoretical Background\n\nNumerous theoretical studies have been done on our ability to optimize neural loss function [6, 5].\nTheoretical results usually make restrictive assumptions about the sample distributions, non-linearity\nof the architecture, or loss functions [17, 32, 41, 37, 10, 40]. For restricted network classes, such\nas those with a single hidden layer, globally optimal or near-optimal solutions can be found by\ncommon optimization methods [36, 27, 39]. For networks with speci\ufb01c structures, there likely exists\na monotonically decreasing path from an initialization to a global minimum [33, 16]. Swirszcz et al.\n[38] show counterexamples that achieve \u201cbad\u201d local minima for toy problems.\nSeveral works have addressed the relationship between sharpness/\ufb02atness of local minima and their\ngeneralization ability. Hochreiter and Schmidhuber [19] de\ufb01ned \u201c\ufb02atness\u201d as the size of the connected\nregion around the minimum where the training loss remains low. Keskar et al. [25] characterize\n\n1Code and plots are available at https://github.com/tomgoldstein/loss-landscape\n\n2\n\n\f\ufb02atness using eigenvalues of the Hessian, and propose \u270f-sharpness as an approximation, which looks\nat the maximum loss in a neighborhood of a minimum. Dinh et al. [8], Neyshabur et al. [31] show\nthat these quantitative measure of sharpness are not invariant to symmetries in the network, and are\nthus not suf\ufb01cient to determine generalization ability. Chaudhari et al. [4] used local entropy as a\nmeasure of sharpness, which is invariant to the simple transformation in [8], but dif\ufb01cult to accurately\ncompute. Dziugaite and Roy [9] connect sharpness to PAC-Bayes bounds for generalization.\n\n3 The Basics of Loss Function Visualization\nNeural networks are trained on a corpus of feature vectors (e.g., images) {xi} and accompanying\nlabels {yi} by minimizing a loss of the form L(\u2713) = 1\ni=1 `(xi, yi; \u2713), where \u2713 denotes the\nparameters (weights) of the neural network, the function `(xi, yi; \u2713) measures how well the neural\nnetwork with parameters \u2713 predicts the label of a data sample, and m is the number of data samples.\nNeural nets contain many parameters, and so their loss functions live in a very high-dimensional\nspace. Unfortunately, visualizations are only possible using low-dimensional 1D (line) or 2D (surface)\nplots. Several methods exist for closing this dimensionality gap.\n\nmPm\n\n1-Dimensional Linear Interpolation One simple and lightweight way to plot loss functions is\nto choose two parameter vectors \u2713 and \u27130, and plot the values of the loss function along the line\nconnecting these two points. We can parameterize this line by choosing a scalar parameter \u21b5, and\nde\ufb01ning the weighted average \u2713(\u21b5) = (1 \u21b5)\u2713 + \u21b5\u27130. Finally, we plot the function f (\u21b5) = L(\u2713(\u21b5)).\nThis strategy was taken by Goodfellow et al. [14], who studied the loss surface along the line between\na random initial guess, and a nearby minimizer obtained by stochastic gradient descent. This method\nhas been widely used to study the \u201csharpness\u201d and \u201c\ufb02atness\u201d of different minima, and the dependence\nof sharpness on batch-size [25, 8]. Smith and Topin [35] use the same technique to show different\nminima and the \u201cpeaks\u201d between them, while Im et al. [22] plot the line between minima obtained\nvia different optimizers.\nThe 1D linear interpolation method suffers from several weaknesses. First, it is dif\ufb01cult to visualize\nnon-convexities using 1D plots. Indeed, Goodfellow et al. [14] found that loss functions appear\nto lack local minima along the minimization trajectory. We will see later, using 2D methods, that\nsome loss functions have extreme non-convexities, and that these non-convexities correlate with the\ndifference in generalization between different network architectures. Second, this method does not\nconsider batch normalization [23] or invariance symmetries in the network. For this reason, the visual\nsharpness comparisons produced by 1D interpolation plots may be misleading; this issue will be\nexplored in depth in Section 5.\n\nContour Plots & Random Directions To use this approach, one chooses a center point \u2713\u21e4 in\nthe graph, and chooses two direction vectors, and \u2318. One then plots a function of the form\nf (\u21b5) = L(\u2713\u21e4 + \u21b5) in the 1D (line) case, or\n\nf (\u21b5, ) = L(\u2713\u21e4 + \u21b5 + \u2318)\n\n(1)\nin the 2D (surface) case2. This approach was used in [14] to explore the trajectories of different\nminimization methods. It was also used in [22] to show that different optimization algorithms \ufb01nd\ndifferent local minima within the 2D projected space. Because of the computational burden of 2D\nplotting, these methods generally result in low-resolution plots of small regions that have not captured\nthe complex non-convexity of loss surfaces. Below, we use high-resolution visualizations over large\nslices of weight space to visualize how network design affects non-convex structure.\n\n4 Proposed Visualization: Filter-Wise Normalization\n\nThis study relies heavily on plots of the form (1) produced using random direction vectors, and \u2318,\neach sampled from a random Gaussian distribution with appropriate scaling (described below). While\nthe \u201crandom directions\u201d approach to plotting is simple, it fails to capture the intrinsic geometry of loss\nsurfaces, and cannot be used to compare the geometry of two different minimizers or two different\n\n2When making 2D plots in this paper, batch normalization parameters are held constant, i.e., random\n\ndirections are not applied to batch normalization parameters.\n\n3\n\n\fnetworks. This is because of the scale invariance in network weights. When ReLU non-linearities\nare used, the network remains unchanged if we (for example) multiply the weights in one layer of a\nnetwork by 10, and divide the next layer by 10. This invariance is even more prominent when batch\nnormalization is used. In this case, the size (i.e., norm) of a \ufb01lter is irrelevant because the output of\neach layer is re-scaled during batch normalization. For this reason, a network\u2019s behavior remains\nunchanged if we re-scale the weights. Note, this scale invariance applies only to recti\ufb01ed networks.\nScale invariance prevents us from making meaningful comparisons between plots, unless special\nprecautions are taken. A neural network with large weights may appear to have a smooth and slowly\nvarying loss function; perturbing the weights by one unit will have very little effect on network\nperformance if the weights live on a scale much larger than one. However, if the weights are\nmuch smaller than one, then that same unit perturbation may have a catastrophic effect, making\nthe loss function appear quite sensitive to weight perturbations. Keep in mind that neural nets are\nscale invariant; if the small-parameter and large-parameter networks in this example are equivalent\n(because one is simply a rescaling of the other), then any apparent differences in the loss function are\nmerely an artifact of scale invariance. This scale invariance was exploited by Dinh et al. [8] to build\npairs of equivalent networks that have different apparent sharpness.\nTo remove this scaling effect, we plot loss functions using \ufb01lter-wise normalized directions. To obtain\nsuch directions for a network with parameters \u2713, we begin by producing a random Gaussian direction\nvector d with dimensions compatible with \u2713. Then, we normalize each \ufb01lter in d to have the same\nnorm of the corresponding \ufb01lter in \u2713. In other words, we make the replacement di,j di,j\nkdi,jkk\u2713i,jk,\nwhere di,j represents the jth \ufb01lter (not the jth weight) of the ith layer of d, and k\u00b7k denotes the\nFrobenius norm. Note that the \ufb01lter-wise normalization is different from that of [22], which normalize\nthe direction without considering the norm of individual \ufb01lters. Note that \ufb01lter normalization is not\nlimited to convolutional (Conv) layers but also applies to fully connected (FC) layers. The FC layer\nis equivalent to a Conv layer with a 1 \u21e5 1 output feature map and the \ufb01lter corresponds to the weights\nthat generate one neuron.\nDo contour plots of the form (1) capture the natural distance scale of loss surfaces when the directions\n and \u2318 are \ufb01lter normalized? We answer this question to the af\ufb01rmative in Section 5 by showing\nthat the sharpness of \ufb01lter-normalized plots correlates well with generalization error, while plots\nwithout \ufb01lter normalization can be very misleading. In Appendix A.2, we also compare \ufb01lter-wise\nnormalization to layer-wise normalization (and no normalization), and show that \ufb01lter normalization\nproduces superior correlation between sharpness and generalization error.\n\n5 The Sharp vs Flat Dilemma\n\nSection 4 introduces the concept of \ufb01lter normalization, and provides an intuitive justi\ufb01cation\nfor its use. In this section, we address the issue of whether sharp minimizers generalize better\nthan \ufb02at minimizers. In doing so, we will see that the sharpness of minimizers correlates well with\ngeneralization error when \ufb01lter normalization is used. This enables side-by-side comparisons between\nplots. In contrast, the sharpness of non-normalized plots may appear distorted and unpredictable.\nIt is widely thought that small-batch SGD produces \u201c\ufb02at\u201d minimizers that generalize well, while large\nbatches produce \u201csharp\u201d minima with poor generalization [4, 25, 19]. This claim is disputed though,\nwith Dinh et al. [8], Kawaguchi et al. [24] arguing that generalization is not directly related to the\ncurvature of loss surfaces, and some authors proposing specialized training methods that achieve good\nperformance with large batch sizes [20, 15, 7]. Here, we explore the difference between sharp and\n\ufb02at minimizers. We begin by discussing dif\ufb01culties that arise when performing such a visualization,\nand how proper normalization can prevent such plots from producing distorted results.\nWe train a CIFAR-10 classi\ufb01er using a 9-layer VGG network [34] with batch normalization for a\n\ufb01xed number of epochs. We use two batch sizes: a large batch size of 8192 (16.4% of the training\ndata of CIFAR-10), and a small batch size of 128. Let \u2713s and \u2713l indicate the solutions obtained by\nrunning SGD using small and large batch sizes, respectively3. Using the linear interpolation approach\n\n3In this section, we consider the \u201crunning mean\u201d and \u201crunning variance\u201d as trainable parameters and include\nthem in \u2713. Note that the original study by Goodfellow et al. [14] does not consider batch normalization. These\nparameters are not included in \u2713 in future sections, as they are only needed when interpolating between two\nminimizers.\n\n4\n\n\f(a) 7.37%\n\n11.07%\n\n(b) k\u2713k2, WD=0\n\n(c) WD=0\n\n(d) 6.0%\n\n10.19%\n\n(e) k\u2713k2, WD=5e-4\n\n(f) WD=5e-4\n\nFigure 2: (a) and (d) are the 1D linear interpolation of VGG-9 solutions obtained by small-batch and\nlarge-batch training methods. The blue lines are loss values and the red lines are accuracies. The\nsolid lines are training curves and the dashed lines are for testing. Small batch is at abscissa 0, and\nlarge batch is at abscissa 1. The corresponding test errors are shown below. (b) and (e) shows the\nchange of weights norm k\u2713k2 during training. When weight decay is disabled, the weight norm grows\nsteadily during training without constraints (c) and (f) are the weight histograms, which verify that\nsmall-batch methods produce more large weights with zero weight decay and more small weights\nwith non-zero weight decay.\n\n[14], we plot the loss values on both training and testing data sets of CIFAR-10, along a direction\ncontaining the two solutions, i.e., f (\u21b5) = L(\u2713s + \u21b5(\u2713l \u2713s)).\nFigure 2(a) shows linear interpolation plots with \u2713s at x-axis location 0, and \u2713l at location 1. As\nobserved by [25], we can clearly see that the small-batch solution is quite wide, while the large-batch\nsolution is sharp. However, this sharpness balance can be \ufb02ipped simply by turning on weight\ndecay [26]. Figure 2(d) show results of the same experiment, except this time with a non-zero weight\ndecay parameter. This time, the large batch minimizer is considerably \ufb02atter than the sharp small\nbatch minimizer. However, we see that small batches generalize better in all experiments; there\nis no apparent correlation between sharpness and generalization. We will see that these sharpness\ncomparisons are extremely misleading, and fail to capture the endogenous properties of the minima.\nThe apparent differences in sharpness can be explained by examining the weights of each minimizer.\nHistograms of the network weights are shown for each experiment in Figure 2(c) and (f). We see\nthat, when a large batch is used with zero weight decay, the resulting weights tend to be smaller\nthan in the small batch case. We reverse this effect by adding weight decay; in this case the large\nbatch minimizer has much larger weights than the small batch minimizer. This difference in scale\noccurs for a simple reason: A smaller batch size results in more weight updates per epoch than a\nlarge batch size, and so the shrinking effect of weight decay (which imposes a penalty on the norm of\nthe weights) is more pronounced. The evolution of the weight norms during training is depicted in\nFigure 2(b) and (e). Figure 2 is not visualizing the endogenous sharpness of minimizers, but rather\njust the (irrelevant) weight scaling. The scaling of weights in these networks is irrelevant because\nbatch normalization re-scales the outputs to have unit variance. However, small weights still appear\nmore sensitive to perturbations, and produce sharper looking minimizers.\n\nFilter Normalized Plots We repeat the experiment in Figure 2, but this time we plot the loss\nfunction near each minimizer separately using random \ufb01lter-normalized directions. This removes the\napparent differences in geometry caused by the scaling depicted in Figure 2(c) and (f). The results,\npresented in Figure 3, still show differences in sharpness between small batch and large batch minima,\nhowever these differences are much more subtle than it would appear in the un-normalized plots. For\ncomparison, sample un-normalized plots and layer-normalized plots are shown in Section A.2 of\n\n5\n\n\f(a) 0.0, 128, 7.37%\n\n(b) 0.0, 8192, 11.07%\n\n(c) 5e-4, 128, 6.00%\n\n(d) 5e-4, 8192, 10.19%\n\n(e) 0.0, 128, 7.37%\n\n(f) 0.0, 8192, 11.07%\n\n(g) 5e-4, 128, 6.00%\n\n(h) 5e-4, 8192, 10.19%\n\nFigure 3: The 1D and 2D visualization of solutions obtained using SGD with different weight decay\nand batch size. The title of each sub\ufb01gure contains the weight decay, batch size, and test error.\n\nthe Appendix. We also visualize these results using two random directions and contour plots. The\nweights obtained with small batch size and non-zero weight decay have wider contours than the\nsharper large batch minimizers. Results for ResNet-56 appear in Figure 13 of the Appendix. Using\nthe \ufb01lter-normalized plots in Figure 3, we can make side-by-side comparisons between minimizers,\nand we see that now sharpness correlates well with generalization error. Large batches produced\nvisually sharper minima (although not dramatically so) with higher test error.\n\n6 What Makes Neural Networks Trainable? Insights on the (Non)Convexity\n\nStructure of Loss Surfaces\n\nOur ability to \ufb01nd global minimizers to neural loss functions is not universal; it seems that some\nneural architectures are easier to minimize than others. For example, using skip connections, the\nauthors of [18] trained extremely deep architectures, while comparable architectures without skip\nconnections are not trainable. Furthermore, our ability to train seems to depend strongly on the initial\nparameters from which training starts. Using visualization methods, we do an empirical study of\nneural architectures to explore why the non-convexity of loss functions seems to be problematic\nin some situations, but not in others. We aim to provide insight into the following questions: Do\nloss functions have signi\ufb01cant non-convexity at all? If prominent non-convexities exist, why are\nthey not problematic in all situations? Why are some architectures easy to train, and why are results\nso sensitive to the initialization? We will see that different architectures have extreme differences\nin non-convexity structure that answer these questions, and that these differences correlate with\ngeneralization error.\n\n(a) ResNet-110, no skip connections\nFigure 4: The loss surfaces of ResNet-110-noshort and DenseNet for CIFAR-10.\n\n(b) DenseNet, 121 layers\n\n6\n\n\f(a) ResNet-20, 7.37%\n\n(b) ResNet-56, 5.89%\n\n(c) ResNet-110, 5.79%\n\n(d) ResNet-20-NS, 8.18% (e) ResNet-56-NS, 13.31% (f) ResNet-110-NS, 16.44%\n\nFigure 5: 2D visualization of the loss surface of ResNet and ResNet-noshort with different depth.\n\nExperimental Setup To understand the effects of network architecture on non-convexity, we\ntrained a number of networks, and plotted the landscape around the obtained minimizers using the\n\ufb01lter-normalized random direction method described in Section 4. We consider three classes of neural\nnetworks: 1) ResNets [18] that are optimized for performance on CIFAR-10. We consider ResNet-\n20/56/110, where each name is labeled with the number of layers it has. 2) \u201cVGG-like\u201d networks\nthat do not contain shortcut/skip connections. We produced these networks simply by removing the\nshortcut connections from ResNets. We call these networks ResNet-20/56/110-noshort. 3) \u201cWide\u201d\nResNets that have more \ufb01lters per layer than the CIFAR-10 optimized networks. All models are\ntrained on the CIFAR-10 dataset using SGD with Nesterov momentum, batch-size 128, and 0.0005\nweight decay for 300 epochs. The learning rate was initialized at 0.1, and decreased by a factor of\n10 at epochs 150, 225 and 275. Deeper experimental VGG-like networks (e.g., ResNet-56-noshort,\nas described below) required a smaller initial learning rate of 0.01. High resolution 2D plots of the\nminimizers for different neural networks are shown in Figure 5 and Figure 6. Results are shown\nas contour plots rather than surface plots because this makes it extremely easy to see non-convex\nstructures and evaluate sharpness. For surface plots of ResNet-56, see Figure 1. Note that the center\nof each plot corresponds to the minimizer, and the two axes parameterize two random directions\nwith \ufb01lter-wise normalization as in (1). We make several observations below about how architecture\naffects the loss landscape.\n\nThe Effect of Network Depth From Figure 5, we see that network depth has a dramatic effect on\nthe loss surfaces of neural networks when skip connections are not used. The network ResNet-20-\nnoshort has a fairly benign landscape dominated by a region with convex contours in the center, and\nno dramatic non-convexity. This isn\u2019t too surprising: the original VGG networks for ImageNet had\n19 layers and could be trained effectively [34]. However, as network depth increases, the loss surface\nof the VGG-like nets spontaneously transitions from (nearly) convex to chaotic. ResNet-56-noshort\nhas dramatic non-convexities and large regions where the gradient directions (which are normal to\nthe contours depicted in the plots) do not point towards the minimizer at the center. Also, the loss\nfunction becomes extremely large as we move in some directions. ResNet-110-noshort displays even\nmore dramatic non-convexities, and becomes extremely steep as we move in all directions shown\nin the plot. Furthermore, note that the minimizers at the center of the deep VGG-like nets seem to\nbe fairly sharp. In the case of ResNet-56-noshort, the minimizer is also fairly ill-conditioned, as the\ncontours near the minimizer have signi\ufb01cant eccentricity.\n\nShortcut Connections to the Rescue Shortcut connections have a dramatic effect of the geometry\nof the loss functions. In Figure 5, we see that residual connections prevent the transition to chaotic\nbehavior as depth increases. In fact, the width and shape of the 0.1-level contour is almost identical\nfor the 20- and 110-layer networks. Interestingly, the effect of skip connections seems to be most\nimportant for deep networks. For the more shallow networks (ResNet-20 and ResNet-20-noshort), the\neffect of skip connections is fairly unnoticeable. However residual connections prevent the explosion\nof non-convexity that occurs when networks get deep. This effect seems to apply to other kinds\n\n7\n\n\f(a) k = 1, 5.89%\n\n(b) k = 2, 5.07%\n\n(c) k = 4, 4.34%\n\n(d) k = 8, 3.93%\n\n(e) k = 1, 13.31%\n\n(f) k = 2, 10.26%\n\n(g) k = 4, 9.69%\n\n(h) k = 8, 8.70%\n\nFigure 6: Wide-ResNet-56 on CIFAR-10 both with shortcut connections (top) and without (bottom).\nThe label k = 2 means twice as many \ufb01lters per layer. Test error is reported below each \ufb01gure.\n\nof skip connections as well; Figure 4 show the loss landscape of DenseNet [21], which shows no\nnoticeable non-convexity.\n\nWide Models vs Thin Models To see the effect of the number of conv \ufb01lters per layer, we compare\nthe narrow CIFAR-optimized ResNets (ResNet-56) with Wide-ResNets [42] by multiplying the\nnumber of \ufb01lters per layer by k = 2, 4, and 8. From Figure 6, we see that wider models have loss\nlandscapes less chaotic behavior. Increased network width resulted in \ufb02at minima and wide regions\nof apparent convexity. We see that increased width prevents chaotic behavior, and skip connections\ndramatically widen minimizers. Finally, note that sharpness correlates extremely well with test error.\n\nImplications for Network Initialization One interesting property seen in Figure 5 is that loss\nlandscapes for all the networks considered seem to be partitioned into a well-de\ufb01ned region of\nlow loss value and convex contours, surrounded by a well-de\ufb01ned region of high loss value and\nnon-convex contours. This partitioning of chaotic and convex regions may explain the importance\nof good initialization strategies, and also the easy training behavior of \u201cgood\u201d architectures. When\nusing normalized random initialization strategies such as those proposed by Glorot and Bengio [12],\ntypical neural networks attain an initial loss value less than 2.5. The well behaved loss landscapes in\nFigure 5 (ResNets, and shallow VGG-like nets) are dominated by large, \ufb02at, nearly convex attractors\nthat rise to a loss value of 4 or greater. For such landscapes, a random initialization will likely lie in\nthe \u201cwell- behaved\u201d loss region, and the optimization algorithm might never \u201csee\u201d the pathological\nnon-convexities that occur on the high-loss chaotic plateaus. Chaotic loss landscapes (ResNet-56/110-\nnoshort) have shallower regions of convexity that rise to lower loss values. For suf\ufb01ciently deep\nnetworks with shallow enough attractors, the initial iterate will likely lie in the chaotic region where\nthe gradients are uninformative. In this case, the gradients \u201cshatter\u201d [1], and training is impossible.\nSGD was unable to train a 156 layer network without skip connections (even with very low learning\nrates), which adds weight to this hypothesis.\n\nLandscape Geometry Affects Generalization Both Figures 5 and 6 show that landscape geometry\nhas a dramatic effect on generalization. First, note that visually \ufb02atter minimizers consistently\ncorrespond to lower test error, which further strengthens our assertion that \ufb01lter normalization\nis a natural way to visualize loss function geometry. Second, we notice that chaotic landscapes\n(deep networks without skip connections) result in worse training and test error, while more convex\nlandscapes have lower error values. In fact, the most convex landscapes (Wide-ResNets in the top\nrow of Figure 6), generalize the best of all, and show no noticeable chaotic behavior.\n\nAre we really seeing convexity? We are viewing the loss surface under a dramatic dimensionality\nreduction, and we need to be careful interpreting these plots. For this reason, we quantify the level of\nconvexity in loss functions but computing the principle curvatures, which are simply eigenvalues of\nthe Hessian. A truly convex function has no negative curvatures (the Hessian is positive semi-de\ufb01nite),\nwhile a non-convex function has negative curvatures. It can be shown that the principle curvatures\n\n8\n\n\fof a dimensionality reduced plot (with random Gaussian directions) are weighted averages of the\nprinciple curvatures of the full-dimensional surface (the weights are Chi-square random variables).\nThis has several consequences. First of all, if non-convexity is present in the dimensionality reduced\nplot, then non-convexity must be present in the full-dimensional surface as well. However, apparent\nconvexity in the low-dimensional surface does not mean the high-dimensional function is truly convex.\nRather it means that the positive curvatures are dominant (more formally, the mean curvature, or\naverage eigenvalue, is positive).\nWhile this analysis is reassuring, one may still wonder if there is signi\ufb01cant \u201chidden\u201d non-convexity\nthat these visualizations fail to capture. To answer this question, we calculate the minimum and\nmaximum eigenvalues of the Hessian, min and max.4 Figure 7 maps the ratio |min/max| across\nthe loss surfaces studied above (using the same minimizer and the same random directions). Blue color\nindicates a more convex region (near-zero negative eigenvalues relative to the positive eigenvalues),\nwhile yellow indicates signi\ufb01cant levels of negative curvature. We see that the convex-looking regions\nin our surface plots do indeed correspond to regions with insigni\ufb01cant negative eigenvalues (i.e., there\nare not major non-convex features that the plot missed), while chaotic regions contain large negative\ncurvatures. For convex-looking surfaces like DenseNet, the negative eigenvalues remain extremely\nsmall (less than 1% the size of the positive curvatures) over a large region of the plot.\n\n(a) Resnet-56\n\n(b) Resnet-56-noshort\n\n(c) DenseNet-121\n\nFigure 7: For each point in the \ufb01lter-normalized surface plots, we calculate the maximum and\nminimum eigenvalue of the Hessian, and map the ratio of these two.\n\nVisualizing the trajectories of SGD and Adam We provide a study of visualization methods for\nthe trajectories of optimizers in Appendix B.\n\n7 Conclusion\n\nWe presented a visualization technique that provides insights into the consequences of a variety of\nchoices facing the neural network practitioner, including network architecture, optimizer selection,\nand batch size. Neural networks have advanced dramatically in recent years, largely on the back of\nanecdotal knowledge and theoretical results with complex assumptions. For progress to continue\nto be made, a more general understanding of the structure of neural networks is needed. Our hope\nis that effective visualization, when coupled with continued advances in theory, can result in faster\ntraining, simpler models, and better generalization.\n\nAcknowledgements\n\nLi, Xu, and Goldstein were supported by the Of\ufb01ce of Naval Research (N00014-17-1-2078), DARPA\nLifelong Learning Machines (FA8650-18-2-7833), DARPA YFA (D18AP00055), and the Sloan\nFoundation. Taylor was supported by ONR (N0001418WX01582), and the DOD HPC Modernization\nProgram. Studer was supported in part by Xilinx, Inc. and by the NSF under grants ECCS-1408006,\nCCF-1535897, CCF-1652065, CNS-1717559, and ECCS-1824379.\n\n4We compute these using an implicitly restarted Lanczos method that requires only Hessian-vector products\n(which are calculated directly using automatic differentiation), and does not require an explicit representation of\nthe Hessian or its factorization.\n\n9\n\n\fReferences\n[1] David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The\n\nshattered gradients problem: If resnets are the answer, then what is the question? In ICML, 2017.\n\n[2] Andrew J Ballard, Ritankar Das, Stefano Martiniani, Dhagash Mehta, Levent Sagun, Jacob D Stevenson,\nand David J Wales. Energy landscapes for machine learning. Physical Chemistry Chemical Physics, 19\n(20):12585\u201312603, 2017.\n\n[3] Avrim Blum and Ronald L Rivest. Training a 3-node neural network is np-complete. In NIPS, 1989.\n\n[4] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-sgd: Biasing gradient\n\ndescent into wide valleys. In ICLR, 2017.\n\n[5] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun. The loss\n\nsurfaces of multilayer networks. In AISTATS, 2015.\n\n[6] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio.\nIdentifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS,\n2014.\n\n[7] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches.\n\nIn AISTATS, 2017.\n\n[8] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep\n\nnets. In ICML, 2017.\n\n[9] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep\n\n(stochastic) neural networks with many more parameters than training data. In UAI, 2017.\n\n[10] C Daniel Freeman and Joan Bruna. Topology and geometry of half-recti\ufb01ed network optimization. In\n\nICLR, 2017.\n\n[11] Marcus Gallagher and Tom Downs. Visualization of learning in multilayer perceptron networks using\nprincipal component analysis. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics),\n33(1):28\u201334, 2003.\n\n[12] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In AISTATS, 2010.\n\n[13] Tom Goldstein and Christoph Studer. Phasemax: Convex phase retrieval via basis pursuit. arXiv preprint\n\narXiv:1610.07531, 2016.\n\n[14] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network\n\noptimization problems. In ICLR, 2015.\n\n[15] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew\nTulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv\npreprint arXiv:1706.02677, 2017.\n\n[16] Benjamin D Haeffele and Ren\u00e9 Vidal. Global optimality in neural network training. In CVPR, 2017.\n\n[17] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In ICLR, 2017.\n\n[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.\n\nIn CVPR, 2016.\n\n[19] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n\n[20] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization\n\ngap in large batch training of neural networks. NIPS, 2017.\n\n[21] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolu-\n\ntional networks. In CVPR, 2017.\n\n[22] Daniel Jiwoong Im, Michael Tao, and Kristin Branson. An empirical analysis of deep network loss surfaces.\n\narXiv preprint arXiv:1612.04010, 2016.\n\n[23] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by\n\nReducing Internal Covariate Shift. In ICML, 2015.\n\n10\n\n\f[24] Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv\n\npreprint arXiv:1710.05468, 2017.\n\n[25] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang.\n\nOn large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.\n\n[26] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In NIPS, 1992.\n\n[27] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. arXiv\n\npreprint arXiv:1705.09886, 2017.\n\n[28] Qianli Liao and Tomaso Poggio. Theory of deep learning ii: Landscape of the empirical risk in deep\n\nlearning. arXiv preprint arXiv:1703.09833, 2017.\n\n[29] Zachary C Lipton. Stuck in a what? adventures in weight space. In ICLR Workshop, 2016.\n\n[30] Eliana Lorch. Visualizing deep network training trajectories with pca. In ICML Workshop on Visualization\n\nfor Deep Learning, 2016.\n\n[31] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in\n\ndeep learning. In NIPS, 2017.\n\n[32] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In ICML, 2017.\n\n[33] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspeci\ufb01ed neural networks. In ICML,\n\n2016.\n\n[34] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image\n\nRecognition. In ICLR, 2015.\n\n[35] Leslie N Smith and Nicholay Topin. Exploring loss function topology with cyclical learning rates. arXiv\n\npreprint arXiv:1702.04283, 2017.\n\n[36] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization\n\nlandscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.\n\n[37] Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural\n\nnetworks. arXiv preprint arXiv:1702.05777, 2017.\n\n[38] Grzegorz Swirszcz, Wojciech Marian Czarnecki, and Razvan Pascanu. Local minima in training of deep\n\nnetworks. arXiv preprint arXiv:1611.06310, 2016.\n\n[39] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its\n\napplications in convergence and critical point analysis. In ICML, 2017.\n\n[40] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. In AISTATS,\n\n2017.\n\n[41] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Global optimality conditions for deep neural networks. In\n\nICLR, 2017.\n\n[42] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.\n\n[43] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep\n\nlearning requires rethinking generalization. In ICLR, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3144, "authors": [{"given_name": "Hao", "family_name": "Li", "institution": "University of Maryland, Amazon AI"}, {"given_name": "Zheng", "family_name": "Xu", "institution": "University of Maryland, College Park"}, {"given_name": "Gavin", "family_name": "Taylor", "institution": "US Naval Academy"}, {"given_name": "Christoph", "family_name": "Studer", "institution": "Cornell University"}, {"given_name": "Tom", "family_name": "Goldstein", "institution": "University of Maryland"}]}