{"title": "Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 14601, "page_last": 14610, "abstract": "Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima---at least those discovered by gradient-based optimization---turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. \n\nWe give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.", "full_text": "Explaining Landscape Connectivity of Low-cost\n\nSolutions for Multilayer Nets\n\nRohith Kuditipudi\nDuke University\n\nXiang Wang\nDuke University\n\nrohith.kuditipudi@duke.edu\n\nxwang@cs.duke.edu\n\nHolden Lee\n\nPrinceton University\n\nholdenl@princeton.edu\n\nYi Zhang\n\nPrinceton University\n\ny.zhang@cs.princeton.edu\n\nzhiyuanli@cs.princeton.edu\n\nWei Hu\n\nPrinceton University\n\nhuwei@cs.princeton.edu\n\nPrinceton University and Institute for Advanced Study\n\narora@cs.princeton.edu\n\nZhiyuan Li\n\nPrinceton University\n\nSanjeev Arora\n\nRong Ge\n\nDuke University\n\nrongge@cs.duke.edu\n\nAbstract\n\nMode connectivity (Garipov et al., 2018; Draxler et al., 2018) is a surprising\nphenomenon in the loss landscape of deep nets. Optima\u2014at least those discovered\nby gradient-based optimization\u2014turn out to be connected by simple paths on\nwhich the loss function is almost constant. Often, these paths can be chosen to be\npiece-wise linear, with as few as two segments.\nWe give mathematical explanations for this phenomenon, assuming generic prop-\nerties (such as dropout stability and noise stability) of well-trained deep nets,\nwhich have previously been identi\ufb01ed as part of understanding the generalization\nproperties of deep nets. Our explanation holds for realistic multilayer nets, and\nexperiments are presented to verify the theory.\n\n1\n\nIntroduction\n\nEfforts to understand how and why deep learning works have led to a focus on the optimization\nlandscape of the training loss. Since optimization to near-zero training loss occurs for many choices\nof random initialization, it is clear that the landscape contains many global optima (or near-optima).\nHowever, the loss can become quite high when interpolating between found optima, suggesting that\nthese optima occur at the bottom of \u201cvalleys\u201d surrounded on all sides by high walls. Therefore the\nphenomenon of mode connectivity (Garipov et al., 2018; Draxler et al., 2018) came as a surprise:\noptima (at least the ones discovered by gradient-based optimization) are connected by simple paths in\nthe parameter space, on which the loss function is almost constant. In other words, the optima are not\nwalled off in separate valleys as hitherto believed. More surprisingly, the paths connecting discovered\noptima can be piece-wise linear with as few as two segments.\nMode connectivity begs for theoretical explanation. One paper (Freeman and Bruna, 2016) attempted\nsuch an explanation for 2-layer nets, even before the discovery of the phenomenon in multilayer nets.\nHowever, they require the width of the net to be exponential in some relevant parameters. Others\n(Venturi et al., 2018; Liang et al., 2018; Nguyen et al., 2018; Nguyen, 2019) require special structure\nin their networks where the number of neurons needs to be greater than the number of training data\npoints. Thus it remains an open problem to explain mode connectivity even in the 2-layer case with\nrealistic parameter settings, let alone for standard multilayer architectures.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAt \ufb01rst sight, \ufb01nding a mathematical explanation of the mode connectivity phenomenon for multilayer\nnets\u2014e.g., for a 50-layer ResNet on ImageNet\u2014appears very challenging. However, the glimmer of\nhope is that since the phenomenon exists for a variety of architectures and datasets, it must arise from\nsome generic property of trained nets. The fact that the connecting paths between optima can have as\nfew as two linear segments further bolsters this hope.\nStrictly speaking, empirical \ufb01ndings such as in (Garipov et al., 2018; Draxler et al., 2018) do not\nshow connectivity between all optima, but only for typical optima discovered by gradient-based\noptimization. It seems an open question whether connectivity holds for all optima in overparametrized\nnets. Section 5 answers this question, via a simple example of an overparametrized two-layer net, not\nall of whose optima are connected via low-cost paths.\nThus to explain mode connectivity one must seek generic properties that hold for optima obtained via\ngradient-based optimization on realistic data. A body of work that could be a potential source of such\ngeneric properties is the ongoing effort to understand the generalization puzzle of over-parametrized\nnets\u2014speci\ufb01cally, to understand the \u201ctrue model capacity\u201d. For example, Morcos et al. (2018) note\nthat networks that generalize are insensitive to linear restrictions in the parameter space. Arora et al.\n(2018) de\ufb01ne a noise stability property of deep nets, whereby adding Gaussian noise to the output of\na layer is found to have minimal effect on the vector computed at subsequent layers. Such properties\nseem to arise in a variety of architectures purely from gradient-based optimization, without any\nexplicit noise-injection during training\u2014though of course using small-batch gradient estimates is an\nimplicit source of noise-injection. (Sometimes training also explicitly injects noise, e.g. dropout or\nbatch-normalization, but that is not needed for noise stability to emerge.)\nSince resilience to perturbations arises in a variety of architectures, such resilience counts as a\n\u201cgeneric\u201d property for which it is natural to prove mode connectivity as a consequence. We carry\nthis out in the current paper. Note that our goal here is not to explain every known detail of mode\nconnectivity, but rather to give a plausible \ufb01rst-cut explanation.\nFirst, in Section 3 we explain mode connectivity by assuming the network is trained via dropout. In\nfact, the desired property is weaker: so long as there exists even a single dropout pattern that keeps\nthe training loss close to optimal on the two solutions, our proof constructs a piece-wise linear path\nbetween them. The number of linear segments grows linearly with the depth of the net.\nThen, in Section 4 we make a stronger assumption of noise stability along the lines of Arora et al.\n(2018) and show that it implies mode connectivity using paths with 10 linear segments. While this\nassumption is strong, it appears to be close to what is satis\ufb01ed in practice. (Of course, one could\nexplicitly train deep nets to satisfy the needed noise stability assumption, and the theory applies\ndirectly to them.)\n\n1.1 Related work\n\nThe landscape of the loss function for training neural networks has received a lot of attention. Dauphin\net al. (2014); Choromanska et al. (2015) conjectured that local minima of multi-layer neural networks\nhave similar loss function values, and proved the result in idealized settings. For linear networks, it is\nknown (Kawaguchi, 2016) that all local minima are also globally optimal.\nSeveral theoretical works have explored whether a neural network has spurious valleys (non-global\nminima that are surrounded by other points with higher loss). Freeman and Bruna (2016) showed\nthat for a two-layer net, if it is suf\ufb01ciently overparametrized then all the local minimizers are\n(approximately) connected. However, in order to guarantee a small loss along the path they need the\nnumber of neurons to be exponential in the number of input dimensions. Venturi et al. (2018) proved\nthat if the number of neurons is larger than either the number of training samples or the intrinsic\ndimension (in\ufb01nite for standard architectures), then the neural network cannot have spurious valleys.\nLiang et al. (2018) proved similar results for the binary classi\ufb01cation setting. Nguyen et al. (2018);\nNguyen (2019) relaxed the requirement on overparametrization, but still require the output layer to\nhave more direct connections than the number of training samples.\nSome other papers have studied the existence of spurious local minima. Yun et al. (2018) showed that\nin most cases neural networks have spurious local minima. Note that a local minimum need only have\nloss no larger than the points in its neighborhood, so a local minimum is not necessarily a spurious\nvalley. Safran and Shamir (2018) found spurious local minima for simple two-layer neural networks\n\n2\n\n\funder a Gaussian input distribution. These spurious local minima are indeed spurious valleys as they\nhave positive de\ufb01nite Hessian.\n\n2 Preliminaries\nNotations For a vector v, we use kvk to denote its `2 norm. For a matrix A, we use kAk to denote\nits operator norm, and kAkF to denote its Frobenius norm. We use [n] to denote the set {1, 2, . . . , n}.\nWe use In to denote the identity matrix in Rn\u21e5n. We use O(\u00b7), \u2326(\u00b7) to hide constants and use\neO(\u00b7),e\u2326(\u00b7) to hide poly-logarithmic factors.\n\nNeural network In most of the paper, we consider fully connected neural networks with ReLU\nactivations. Note however that our results can also be extended to convolutional neural networks (in\nparticular, see Remark 1 and the experiments in Section 6).\nSuppose the network has d layers. Let the vector before activation at layer i be xi, i 2 [d], where\nxd is just the output. For convenience, we also denote the input x as x0. Let Ai be the weight\nmatrix at i-th layer, so that we have xi = Ai(xi1) for 2 \uf8ff i \uf8ff d and x1 = A1x0. For any layer\ni, 1 \uf8ff i \uf8ff d, let the width of the layer be hi. We use [Ai]j to denote the j-th column of Ai. Let the\nmaximum width of the hidden layers be hmax := max{h1, h2, . . . , hd1} and the minimum width\nof the hidden layers be hmin := min{h1, h2, . . . , hd1}.\nWe use \u21e5 to denote the set of parameters of neural network, and in our speci\ufb01c model, \u21e5=\nRh1\u21e5h0 \u21e5 Rh2\u21e5h1 \u21e5\u00b7\u00b7\u00b7\u21e5 Rhd\u21e5hd1 which consists of all the weight matrices {Ai}\u2019s.\nThroughout the paper, we use f\u2713, \u2713 2 \u21e5 to denote the function that is computed by the neural network.\nFor a data set (x, y) \u21e0D , the loss is de\ufb01ned as LD(f\u2713) := E(x,y)\u21e0D[l(y, f\u2713(x))] where l is a loss\nfunction. The loss function l(y, \u02c6y) is convex in the second parameter. We omit the distribution D\nwhen it is clear from the context.\n\nMode connectivity and spurious valleys Fixing a neural network architecture, a data set D and a\nloss function, we say two sets of parameters/solutions \u2713A and \u2713B are \u270f-connected if there is a path\n\u21e1(t) : R ! \u21e5 that is continuous with respect to t and satis\ufb01es: 1. \u21e1(0) = \u2713A; 2. \u21e1(1) = \u2713B and 3.\nfor any t 2 [0, 1], L(f\u21e1(t)) \uf8ff max{L(f\u2713A), L(f\u2713B )} + \u270f. If \u270f = 0, we omit \u270f and just say they are\nconnected.\nIf all local minimizers are connected, then we say that the loss function has the mode connectivity\nproperty. However, as we later show in Section 5, this property is very strong and is not true even for\noverparametrized two-layer nets. Therefore we restrict our attention to classes of low-cost solutions\nthat can be found by the gradient-based algorithms (in particular in Section 3 we focus on solutions\nthat are dropout stable, and in Section 4 we focus on solutions that are noise stable). We say the loss\nfunction has \u270f-mode connectivity property with respect to a class of low-cost solutions C, if any\ntwo minimizers in C are \u270f-connected.\nMode connectivity is closely related to the notion of spurious valleys and connected sublevel sets\n(Venturi et al., 2018). If a loss function has all its sublevel sets ({\u2713 : L(f\u2713) \uf8ff }) connected, then it\nhas the mode connectivity property. When the network only has the mode connectivity property with\nrespect to a class of solutions C, as long as the class C contains a global minimizer, we know there\nare no spurious valleys in C.\nHowever, we emphasize that neither mode connectivity or lack of spurious valleys implies any local\nsearch algorithm can ef\ufb01ciently \ufb01nd the global minimizer. These notions only suggest that it is\nunlikely for local search algorithms to get completely stuck.\n\n3 Connectivity of dropout-stable optima\n\nIn this section we show that dropout stable solutions are connected. More concretely, we de\ufb01ne a\nsolution \u2713 to be \u270f-dropout stable if we can remove a subset of half its neurons in each layer such that\nthe loss remains steady.\nDe\ufb01nition 1. (Dropout Stability) A solution \u2713 is \u270f-dropout stable if for all i such that 1 \uf8ff i < d,\nthere exists a subset of at most bhj/2c hidden units in each of the layers j from i through d 1 such\n\n3\n\n\fthat after rescaling the outputs of these hidden units (or equivalently, the corresponding rows and/or\ncolumns of the relevant weight matrices) by some factor r1 and setting the outputs of the remaining\nunits to zero, we obtain a parameter \u2713i such that L(f\u2713i) \uf8ff L(f\u2713) + \u270f.\nIntuitively, if a solution is \u270f-dropout stable then it is essentially only using half of the network\u2019s\ncapacity. We show that such solutions are connected:\nTheorem 1. Let \u2713A and \u2713B be two \u270f-dropout stable solutions. Then there exists a path in parameter\nspace \u21e1 : [0, 1] ! \u21e5 between \u2713A and \u2713B such that L(f\u21e1(t)) \uf8ff max{L(f\u2713A), L(f\u2713B )} + \u270f for\n0 \uf8ff t \uf8ff 1. In other words, letting C be the set of solutions that are \u270f-dropout stable, a ReLU network\nhas the \u270f-mode connectivity property with respect to C.\nOur path construction in Theorem 1 consists of two key steps. First we show that we can rescale at\nleast half the hidden units in both \u2713A and \u2713B to zero via continuous paths of low loss, thus obtaining\ntwo parameters \u2713A\nLemma 1. Let \u2713 be an \u270f-dropout stable solution and let \u2713i be speci\ufb01ed as in De\ufb01nition 1 for\n1 \uf8ff i < d. Then there exists a path in parameter space \u21e1 : [0, 1] ! \u21e5 between \u2713 and \u27131 passing\nthrough each \u2713i such that L(f\u21e1(t)) \uf8ff L(f\u2713) + \u270f for 0 \uf8ff t \uf8ff 1.\nThough na\u00efvely one might expect to be able to directly connect the weights of \u2713 and \u27131 via interpo-\nlation, such a path may incur high loss as the loss function is not convex over \u21e5. In our proof of\nLemma 1, we rely on a much more careful construction. The construction uses two types of steps: (a)\ninterpolate between two weights in the top layer (the loss is convex in the top layer weights); (b) if a\nset of neurons already have their output weights set to zero, then we can change their input weights\narbitrarily. See Figure 1 for an example path for a 3-layer network. Here we have separated the weight\n\n1 satisfying the criteria in De\ufb01nition 1.\n\n1 and \u2713B\n\nmatrices into equally sized blocks: A3 = [ L3 R3 ], A2 =\uf8ff L2 C2\n\nD2 R2 and A1 =\uf8ff L1\n\npath consists of 6 steps alternating between type (a) and type (b). Note that for all the type (a) steps,\nwe only update the top layer weights; for all the type (b) steps, we only change rows of a weight\nmatrix (inputs to neurons) if the corresponding columns in the previous matrix (outputs of neurons)\nare already 0. In Section A we show how such a path can be generalized to any number of layers.\n\nB1 . The\n\nA3\n\n(1)\n\n[ L3 R3 ]\n\n(2)\n\n[ rL3\n\n0 ]\n\n(3)\n\n[ rL3\n\n0 ]\n\n(4)\n\n[ 0\n\nrL3 ]\n\n(5)\n\n[ 0\n\nrL3 ]\n\n(6)\n\n[ rL3\n\n0 ]\n\n(7)\n\n[ rL3\n\n0 ]\n\nA2\n\nA1\n\nrL2\n\n\uf8ff L1\nB1 \n\uf8ff L2 C2\nD2 R2 \n\uf8ff L1\nB1 (a)\n\uf8ff L2 C2\nD2 R2 \n0 \uf8ff L1\nB1 (b)\n\uf8ff L2 C2\nB1 (a)\n0 \uf8ff L1\n\uf8ff L2 C2\n\uf8ff L1\n0 \n\uf8ff rL2\nB1 (b)\n\uf8ff rL2\n0 \n\uf8ff L1\nB1 (a)\n\uf8ff L1\n0 \n\uf8ff rL2\n0 \n\nrL2\n\nrL2\n\nrL2\n\n0\n\n0\n\n0\n\n0\n\n(b)\n\nFigure 1: Example path, 6 line segments from a 3-layer network to its dropout version. Red denotes\nweights that have changed between steps while green denotes the zeroed weights that allow us to\nmake these changes without affecting our output.\n\n1Note our results will also work if r is allowed to vary for each layer.\n\n4\n\n\f1 such that its non-zero units do not intersect\n1 , thus allowing us two interpolate between these two parameters. This is formalized\n\nWe then show that we can permute the hidden units of \u2713A\nwith those of \u2713B\nin the following lemma and the proof is deferred to supplementary material.\nLemma 2. Let \u2713 and \u27130 be two solutions such that at least dhi/2e of the units in the ith hidden layer\nhave been set to zero in both. Then there exists a path in parameter space \u21e1 : [0, 1] ! \u21e5 between \u2713\nand \u27130 with 8 line segments such that L(f\u21e1(t)) \uf8ff max{L(f\u2713), L(f\u27130)}.\nTheorem 1 follows immediately from Lemma 1 and Lemma 2, as one can \ufb01rst connect \u2713A to its\ndropout version \u2713A\n1 of \u2713B using Lemma 2,\nand \ufb01nally connect \u2713B\nFinally, our results can be generalized to convolutional networks if we do channel-wise dropout (Tomp-\nson et al., 2015; Keshari et al., 2018).\nRemark 1. For convolutional networks, a channel-wise dropout will randomly set entire channels to\n0 and rescale the remaining channels using an appropriate factor. Theorem 1 can be extended to\nwork with channel-wise dropout on convolutional networks.\n\n1 using Lemma 1, then connect \u2713A\n\n1 to \u2713B using Lemma 1 again.\n\n1 to dropout version \u2713B\n\n4 Connectivity via noise stability\n\nIn this section, we relate mode connectivity to another notion of robustness for neural networks\u2014\nnoise stability. It has been observed (Morcos et al., 2018) that neural networks often perform as well\neven if a small amount of noise is injected into the hidden layers. This was formalized in (Arora\net al., 2018), where the authors showed that noise stable networks tend to generalize well. In this\nsection we use a very similar notion of noise stability, and show that all noise stable solutions can be\nconnected as long as the network is suf\ufb01ciently overparametrized.\nWe begin in Section 4.1 by restating the de\ufb01nitions of noise stability in (Arora et al., 2018) and\nalso highlighting the key differences in our de\ufb01nitions. In Section 6 we verify these assumptions\nin practice. In Section 4.2, we \ufb01rst prove that noise stability implies dropout stability (meaning\nTheorem 1 applies) and then show that it is in fact possible to connect noise stable neural networks\nvia even simpler paths than mere dropout stable networks.\n\n4.1 Noise stability\nFirst we introduce some additional notations and assumptions.\nIn this section, we consider a\n\ufb01nite and \ufb01xed training set S. For a network parameter \u2713, the empirical loss function is L(\u2713) =\n1\n\n|S|P(x,y)2S l(y, f (x)). Here the loss function l(y, \u02c6y) is assumed to be -Lipschitz in \u02c6y: for any\n\u02c6y, \u02c6y0 2 Rhd and any y 2 Rhd, we have |l(y, \u02c6y) l(y, \u02c6y0)|\uf8ff k\u02c6y \u02c6y0k. Note that the standard cross\nentropy loss over the softmax function is p2-Lipschitz.\nFor any two layers i \uf8ff j, let M i,j be the operator for the composition of these layers, such that\nxj = M i,j(xi). Let J i,j\nxi be the Jacobian of M i,j at input xi. Since the activation functions are\nReLU\u2019s, we know M i,j(xi) = J i,j\nArora et al. (2018) used several quantities to de\ufb01ne noise stability. We state the de\ufb01nitions of these\nquantities below.\nDe\ufb01nition 2 (Noise Stability Quantities). Given a sample set S, the layer cushion of layer i is\nde\ufb01ned as \u00b5i := minx2S\nFor any two layers i \uf8ff j, the interlayer cushion \u00b5i,j is de\ufb01ned as \u00b5i,j = minx2S kJ i,j\nkJ i,j\nFurthermore, for any layer i the minimal interlayer cushion is de\ufb01ned as2 \u00b5i! = mini\uf8ffj\uf8ffd \u00b5i,j.\nThe activation contraction c is de\ufb01ned as c = maxx2S, 1\uf8ffi\uf8ffd1\nIntuitively, these quantities measures the stability of the network\u2019s output to noise for both a single\nlayer and across multiple layers. Note that the de\ufb01nition of the interlayer cushion is slightly different\n\nkAi(xi1)k\nkAikF k(xi1)k\n\nxi xik\nxi kkxik\n\nxi xi.\n\n.\n\n.\n\nkxik\nk(xi)k\n\n.\n\n2Note that J i,i\n\nxi = Ihi and \u00b5i,i = 1.\n\n5\n\n\ffrom the original de\ufb01nition in (Arora et al., 2018). Speci\ufb01cally, in the denominator of our de\ufb01nition\nof interlayer cushion, we replace the Frobenius norm of J i,j\nxi by its spectral norm. In the original\nde\ufb01nition, the interlayer cushion is at most 1/phi, simply because J i,i\nxi = Ihi and \u00b5i,i = 1/phi.\nWith this new de\ufb01nition, the interlayer cushion need not depend on the layer width hi.\nThe \ufb01nal quantity of interest is interlayer smoothness, which measures how close the network\u2019s\nbehavior is to its linear approximation under noise. Our focus here is on the noise generated by the\ndropout procedure (Algorithm 1). Let \u2713 = {A1, A2, ..., Ad} be weights of the original network, and\nlet \u2713i = {A1, \u02c6A2, . . . , \u02c6Ai, Ai+1, . . . , Ad} be the result of applying Algorithm 1 to weight matrices\nfrom layer 2 to layer i.3 For any input x, let \u02c6xi\ni1(t) be the vector before activation at layer\ni using parameters \u2713t + \u2713i(1 t) and \u2713t + \u2713i1(1 t) respectively.\nDe\ufb01nition 3 (Interlayer Smoothness). Given the scenario above, de\ufb01ne interlayer smoothness \u21e2 to\nbe the largest number such that with probability at least 1/2 over the randomness in Algorithm 1 for\nany two layers i, j satisfying for every 2 \uf8ff i \uf8ff j \uf8ff d, x 2 S, and 0 \uf8ff t \uf8ff 1\n\ni(t) and \u02c6xi\n\nkM i,j(\u02c6xi\nkM i,j(\u02c6xi\n\nxi (\u02c6xi\n\ni(t)) J i,j\ni1(t)) J i,j\n\ni(t))k \uf8ff k\u02c6xi\ni1(t))k \uf8ff k\u02c6xi\nxi (\u02c6xi\n\ni(t) xikkxjk\n\n,\n\n\u21e2kxik\ni1(t) xikkxjk\n\n.\n\n\u21e2kxik\n\ni(t) xik,k\u02c6xi\n\nIf the network is smooth (has Lipschitz gradient), then interlayer smoothness holds as long as\ni1(t) xik is small. Essentially the assumption here is that the network behaves\nk\u02c6xi\nsmoothly in the random directions generated by randomly dropping out columns of the matrices.\nSimilar to (Arora et al., 2018), we have de\ufb01ned multiple quantities measuring the noise stability of a\nnetwork. These quantities are in practice small constants as we verify experimentally in Section 6.\nFinally, we combine all these quantities to de\ufb01ne a single overall measure of the noise stability of a\nnetwork.\nDe\ufb01nition 4 (Noise Stability). For a network \u2713 with layer cushion \u00b5i, minimal interlayer cushion\n\u00b5i!, activation contraction c and interlayer smoothness \u21e2, if the minimum width layer hmin is at\ni(t))k for 1 \uf8ff i \uf8ff d 1, 0 \uf8ff t \uf8ff 1,\nwe say the network \u2713 is \u270f-noise stable for\n\nleaste\u2326(1) wide, \u21e2 3d and k(\u02c6xi\n\ni(t))k1 = O(1/phi)k(\u02c6xi\ncd3/2 maxx2S(kf\u2713(x)k)\nh1/2\nmin min2\uf8ffi\uf8ffd(\u00b5i\u00b5i!)\n\n\u270f =\n\n.\n\nThe smaller \u270f, the more robust the network. Note that the quantity \u270f is small as long as the hidden\nlayer width hmin is large compared to the noise stable parameters. Intuitively, we can think of \u270f as a\nsingle parameter that captures the noise stability of the network.\n\n4.2 Noise stability implies dropout stability\n\nWe now show that noise stable local minimizers must also be dropout stable, from which it follows\nthat noise stable local minimizers are connected. We \ufb01rst de\ufb01ne the dropout procedure we will be\nusing in Algorithm 1.\nThe main theorem that we prove in this section is:\nTheorem 2. Let \u2713A and \u2713B be two fully connected networks that are both \u270f-noise stable, there exists\na path with 10 line segments in parameter space \u21e1 : [0, 1] ! \u21e5 between \u2713A and \u2713B such that4\nL(f\u21e1(t)) \uf8ff max{L(f\u2713A), L(f\u2713B )} + eO(\u270f) for 0 \uf8ff t \uf8ff 1.\nTo prove the theorem, we will \ufb01rst show that the networks \u2713A and \u2713B are eO(\u270f)-dropout stable. This\n4Here eO(\u00b7) hides log factors on relevant factors including |S|, d,kxk, 1/\u270f and hikAik for layers i 2 [d].\n\n3Note that A1 is excluded because dropping out columns in \u02c6A2 already drops out the neurons in layer 1;\n\ndropping out columns in A1 would drop out input coordinates, which is not necessary.\n\nis captured in the following main lemma:\n\n6\n\n\fAlgorithm 1 Dropout (Ai, p)\nInput: Layer matrix Ai 2 Rhi\u21e5hi1, dropout probability 0 < p < 1.\nOutput: Returns \u02c6Ai 2 Rhi\u21e5hi1.\n1: For each j 2 [hi1], let j be an i.i.d. Bernoulli random variable which takes the value 0 with\nprobability p and takes the value\n2: For each j 2 [hi1], let [ \u02c6Ai]j be j[Ai]j, where [ \u02c6Ai]j and [Ai]j are the j-th column of \u02c6Ai and\n\n1p with probability (1 p).\n\n1\n\nAi respectively.\n\nLemma 3. Let \u2713 be an \u270f-noise stable network, and let \u27131 be the network with weight matrices from\nlayer 2 to layer d dropped out by Algorithm 1 with dropout probabilitye\u2326(1/hmin) < p \uf8ff 3\n4. For\nany 2 \uf8ff i \uf8ff d, assume k[Ai]jk = O(pp)kAikF for 1 \uf8ff j \uf8ff hi1. For any 0 \uf8ff t \uf8ff 1, de\ufb01ne the\nnetwork on the segment from \u2713 to \u27131 as \u2713t := \u2713 + t(\u27131 \u2713). Then, with probability at least 1/4 over\nthe weights generated by Algorithm 1, L(f\u2713t) \uf8ff L(f\u2713) + eO(pp\u270f), for any 0 \uf8ff t \uf8ff 1.\n\nThe main difference between Lemma 3 and Lemma 1 is that we can now directly interpolate between\nthe original network and its dropout version, which reduces the number of segments required. This is\nmainly because in the noise stable setting, we can prove that after dropping out the neurons, not only\ndoes the output remains stable but moreover every intermediate layer also remains stable.\nFrom Lemma 3, the proof of Theorem 2 is very similar to the proof of Theorem 1. The detailed proof\nis given in Section B.\nThe additional power of Lemma 3 also allows us to consider a smaller dropout probability. The\ntheorem below allows us to trade the dropout fraction with the energy barrier \u270f that we can prove\u2014if\nthe network is highly overparametrized, one can choose a small dropout probability p which allow\nthe energy barrier \u270f to be smaller.\nTheorem 3. Suppose there exists a network \u2713\u21e4 with layer width h\u21e4i for each layer i that achieves\n\nloss L(f\u2713\u21e4), and minimum hidden layer width h\u21e4min =e\u2326(1). Let \u2713A and \u2713B be two \u270f-noise stable\nnetworks. For any dropout probability 1.5 max1\uf8ffi\uf8ffd1(h\u21e4i /hi) \uf8ff p \uf8ff 3/4, if for any 2 \uf8ff i \uf8ff d,\n1 \uf8ff j \uf8ff hi1, k[Ai]jk = O(pp)kAikF then there exists a path with 13 line segments in parameter\nspace \u21e1 : [0, 1] ! \u21e5 between \u2713A and \u2713B such that L(f\u21e1(t)) \uf8ff max{L(f\u2713A) + eO(pp\u270f), L(f\u2713B ) +\neO(pp\u270f), L(f\u2713\u21e4)} for 0 \uf8ff t \uf8ff 1.\n\nIntuitively, we prove this theorem by connecting \u2713A and \u2713B via the neural network \u2713\u21e4 with narrow\nhidden layers. The detailed proof is given in Section B.\n\n5 Disconnected modes in two-layer nets\n\nThe mode connectivity property is not true for every neural network. Freeman and Bruna (2016) gave\na counter-example showing that if the network is not overparametrized, then there can be different\nglobal minima of the neural network that are not connected. Venturi et al. (2018) showed that spurious\nvalleys can exist for 2-layer ReLU nets with an arbitrary number of hidden units, but again they\ndo not extend their result to the overparametrized setting. In this section, we show that even if a\nneural network is overparametrized\u2014in the sense that there exists a network of smaller width that\ncan achieve optimal loss\u2014there can still be two global minimizers that are not connected.\nIn particular, suppose we are training a two-layer ReLU student network with h hidden units to \ufb01t\na dataset generated by a ground truth two-layer ReLU teacher network with ht hidden units such\nthat the samples in the dataset are drawn from some input distribution and the labels computed via\nforward passes through the teacher network. The following theorem demonstrates that regardless of\nthe degree to which the student network is overparametrized, we can always construct such a dataset\nfor which global minima are not connected.\nTheorem 4. For any width h and and convex loss function l : R \u21e5 R 7! R such that l(y, \u02c6y) is\nminimized when y = \u02c6y, there exists a dataset generated by ground-truth teacher network with two\nhidden units (i.e. ht = 2) and one output unit such that global minimizers are not connected for a\nstudent network with h hidden units.\n\n7\n\n\fOur proof is based on an explicit construction. The detailed construction is given in Section C.\n\n6 Experiments\n\nWe now demonstrate that our assumptions and theoretical \ufb01ndings accurately characterize mode\nconnectivity in practical settings. In particular, we empirically validate our claims using standard\nconvolutional architectures\u2014for which we treat individual \ufb01lters as the hidden units and apply\nchannel-wise dropout (see Remark 1)\u2014trained on datasets such as CIFAR-10 and MNIST.\nTraining with dropout is not necessary for a network to be either dropout-stable or noise-stable. Recall\nthat our de\ufb01nition of dropout-stability merely requires the existence of a particular sub-network with\nhalf the width of the original that achieves low loss. Moreover, as Theorem 3 suggests, if there exists\na narrow network that achieves low loss, then we need only be able to drop out a number of \ufb01lters\nequal to the width of the narrow network to connect local minima.\n\nFigure 2: Results for convolutional networks trained on MNIST.\n\nFirst, we demonstrate in the left plot in Figure 2 on MNIST that 3-layer convolutional nets (not\ncounting the output layer) with 32 3 \u21e5 3 \ufb01lters in each layer tend to be fairly dropout stable\u2014both\nin the original sense of De\ufb01nition 1 and especially if we relax the de\ufb01nition to allow for wider\nsubnetworks\u2014despite the fact that no dropout was applied in training. For each trial, we randomly\nsampled 20 dropout networks with exactly b32(1 p)c non-zero \ufb01lters in each layer and report the\nperformance of the best one. In the center plot, we verify for p = 0.2 we can construct a linear path\n\u21e1(t) : R ! \u21e5 from our convolutional net to a dropout version of itself. Similar results were observed\nwhen varying p. Finally, in the right plot we demonstrate the existence of 3-layer convolutional\nnets just a few \ufb01lters wide that are able to achieve low loss on MNIST. Taken together, these results\nindicate that our path construction in Theorem 3 performs well in practical settings. In particular,\nwe can connect two convolutional nets trained on MNIST by way of \ufb01rst interpolating between the\noriginal nets and their dropped out versions with p = 0.2, and then connecting the dropped out\nversions by way of a narrow subnetwork with at most b32pc non-zero \ufb01lters.\nWe also demonstrate that the VGG-11 (Simonyan and Zisserman, 2014) architecture trained with\nchannel-wise dropout (Tompson et al., 2015; Keshari et al., 2018) with p = 0.25 at the \ufb01rst three\nlayers5 and p = 0.5 at the others on CIFAR-10 converges to a noise stable minima\u2014as measured\nby layer cushion, interlayer cushion, activation contraction and interlayer smoothness. The network\nunder investigation achieves 95% training and 91% test accuracy with channel-wise dropout activated,\nin comparison to 99% training and 92% test accuracy with dropout turned off. Figure 3 plots the\ndistribution of the noise stability parameters over different data points in the training set, from which\nwe can see they behave nicely. Interestingly, we also discovered that networks trained without\nchannel-wise dropout exhibit similarly nice behavior on all but the \ufb01rst few layers. Finally, in\nFigure 3, we demonstrate that the training loss and accuracy obtained via the path construction\nin Theorem 3 between two noise stable VGG-11 networks \u2713A and \u2713B remain fairly low and high\nrespectively\u2014particularly in comparison to directly interpolating between the two networks, which\nincurs loss as high as 2.34 and accuracy as low as 10%, as shown in Section D.2.\nFurther details on all experiments are provided in Section D.1.\n\n5we \ufb01nd the \ufb01rst three layers are less resistant to channel-wise dropout.\n\n8\n\n\fFigure 3: Left) Distribution of layer cushion, activation contraction, interlayer cushion and interlayer\nsmoothness of the 6-th layer of a VGG-11 network on the training set. The other layers\u2019 parameters\nare exhibited in Section D.3. Right) The loss and training accuracy along the path between two noise\nstable VGG-11 networks described in Theorem 3.\n\nAcknowledgments\nRong Ge acknowledges funding from NSF CCF-1704656, NSF CCF-1845171 (CAREER), the Sloan\nFellowship and Google Faculty Research Award. Sanjeev Arora acknowledges funding from the NSF,\nONR, Simons Foundation, Schmidt Foundation, Amazon Research, DARPA and SRC.\n\nReferences\nArora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). Stronger generalization bounds for deep nets\n\nvia a compression approach. arXiv preprint arXiv:1802.05296.\n\nChoromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015). The loss surfaces\n\nof multilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204.\n\nDauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying\nand attacking the saddle point problem in high-dimensional non-convex optimization. In Advances\nin neural information processing systems, pages 2933\u20132941.\n\nDraxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. (2018). Essentially no barriers in\n\nneural network energy landscape. arXiv preprint arXiv:1803.00885.\n\nFreeman, C. D. and Bruna, J. (2016). Topology and geometry of half-recti\ufb01ed network optimization.\n\narXiv preprint arXiv:1611.01540.\n\nGaripov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces,\nmode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing\nSystems, pages 8789\u20138798.\n\nKawaguchi, K. (2016). Deep learning without poor local minima. In Advances in neural information\n\nprocessing systems, pages 586\u2013594.\n\nKeshari, R., Singh, R., and Vatsa, M. (2018). Guided dropout. arXiv preprint arXiv:1812.03965.\n\nLiang, S., Sun, R., Li, Y., and Srikant, R. (2018). Understanding the loss surface of neural networks\n\nfor binary classi\ufb01cation. In International Conference on Machine Learning, pages 2840\u20132849.\n\nMorcos, A. S., Barrett, D. G., Rabinowitz, N. C., and Botvinick, M. (2018). On the importance of\n\nsingle directions for generalization. arXiv preprint arXiv:1803.06959.\n\nNguyen, Q. (2019). On connected sublevel sets in deep learning. arXiv preprint arXiv:1901.07417.\n\n9\n\n\fNguyen, Q., Mukkamala, M. C., and Hein, M. (2018). On the loss landscape of a class of deep neural\n\nnetworks with no bad local valleys. arXiv preprint arXiv:1809.10749.\n\nSafran, I. and Shamir, O. (2018). Spurious local minima are common in two-layer relu neural\n\nnetworks. In International Conference on Machine Learning, pages 4430\u20134438.\n\nSimonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556.\n\nTompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler, C. (2015). Ef\ufb01cient object localization\nusing convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 648\u2013656.\n\nTropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of\n\ncomputational mathematics, 12(4):389\u2013434.\n\nVenturi, L., Bandeira, A. S., and Bruna, J. (2018). Spurious valleys in two-layer neural network\n\noptimization landscapes. arXiv preprint arXiv:1802.06384.\n\nYun, C., Sra, S., and Jadbabaie, A. (2018). A critical view of global optimality in deep learning.\n\narXiv preprint arXiv:1802.03487.\n\n10\n\n\f", "award": [], "sourceid": 8268, "authors": [{"given_name": "Rohith", "family_name": "Kuditipudi", "institution": "Duke University"}, {"given_name": "Xiang", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Holden", "family_name": "Lee", "institution": "Princeton"}, {"given_name": "Yi", "family_name": "Zhang", "institution": "Princeton"}, {"given_name": "Zhiyuan", "family_name": "Li", "institution": "Princeton University"}, {"given_name": "Wei", "family_name": "Hu", "institution": "Princeton University"}, {"given_name": "Rong", "family_name": "Ge", "institution": "Duke University"}, {"given_name": "Sanjeev", "family_name": "Arora", "institution": "Princeton University"}]}