{"title": "Piecewise Strong Convexity of Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 12973, "page_last": 12983, "abstract": "We study the loss surface of a feed-forward neural network with ReLU non-linearities, regularized with weight decay. We show that the regularized loss function is piecewise strongly convex on an important open set which contains, under some conditions, all of its global minimizers. This is used to prove that local minima of the regularized loss function in this set are isolated, and that every differentiable critical point in this set is a local minimum, partially addressing an open problem given at the Conference on Learning Theory (COLT) 2015; our result is also applied to linear neural networks to show that with weight decay regularization, there are no non-zero critical points in a norm ball obtaining training error below a given threshold. We also include an experimental section where we validate our theoretical work and show that the regularized loss function is almost always piecewise strongly convex when restricted to stochastic gradient descent trajectories for three standard image classification problems.", "full_text": "Piecewise Strong Convexity of Neural Networks\n\nTristan Milne\n\nDepartment of Mathematics\n\nUniversity of Toronto\n\nToronto, Ontario, Canada\n\ntmilne@math.toronto.edu\n\nAbstract\n\nWe study the loss surface of a feed-forward neural network with ReLU non-\nlinearities, regularized with weight decay. We show that the regularized loss\nfunction is piecewise strongly convex on an important open set which contains,\nunder some conditions, all of its global minimizers. This is used to prove that\nlocal minima of the regularized loss function in this set are isolated, and that every\ndifferentiable critical point in this set is a local minimum, partially addressing\nan open problem given at the Conference on Learning Theory (COLT) 2015; our\nresult is also applied to linear neural networks to show that with weight decay\nregularization, there are no non-zero critical points in a norm ball obtaining training\nerror below a given threshold. We also include an experimental section where we\nvalidate our theoretical work and show that the regularized loss function is almost\nalways piecewise strongly convex when restricted to stochastic gradient descent\ntrajectories for three standard image classi\ufb01cation problems.\n\n1\n\nIntroduction\n\nNeural networks are an extremely popular tool with a variety of applications, from object ([SHK+14],\n[HZRS15]) and speech recognition [GMH13] to the automatic creation of realistic synthetic images\n[WLZ+17]. The optimization problem of \ufb01nding the weights of a neural network such that the\nresulting function approximates a target function on training data is the focus of this paper.\nDespite strong empirical success on this optimization problem, there is still much to know about the\nlandscape of the loss function besides the fact that it is not globally convex. How many local minima\ndoes this function have? Are the local minima isolated from each other? What about the existence of\nlocal maxima or saddle points? We aim to answer several of these questions in this paper, at least for\nthe loss function restricted to an important set of weights.\nImportant papers on the study of a neural network\u2019s loss function include [CHM+15] and [Kaw16].\nIn [CHM+15], the authors study the loss surface of a non-linear neural network. Under some\nassumptions, including independence of the network\u2019s input and the non-linearities, the loss function\nof the non-linear network can be reduced to a loss function for a linear network, which is then written\nas the Hamiltonian of a spin glass model. Recent ideas from spin glass theory are then used on\nthe network\u2019s loss function, proving statements about the distribution of its critical values. This\npaper established an important \ufb01rst step in the direction of understanding the loss surface of neural\nnetworks, but leaves a gap between theory and practise due to its analysis of the expected value of\nthe non-linear network over the behaviour of the ReLU non-linearities, which reduces the non-linear\nnetwork to a linear one. Moreover, the authors assume that the components of the input data vector are\nindependently distributed, which need not be true in important applications such as object recognition,\nwhere the pixel values of the input image are strongly correlated.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn [Kaw16], the results of [CHM+15] are improved and several strong assumptions in that paper are\neither weakened or eliminated. This paper studies the loss surface of a linear neural network with\na quadratic loss function, and establishes many strong conclusions, including the facts that every\nlocal minimum for such a network is a global minimum, there are no local maxima, and, under some\nassumptions on the rank of the product of weight matrices, the Hessian of the loss function has at\nleast one negative eigenvalue at a saddle point, which has implications for the convergence of gradient\ndescent via [LSJR16]. All this is proven under some mild assumptions on the distribution of the\nlabelled data. These results, although comprehensive for linear networks, only hold for a network\nwith ReLU activation functions if one \ufb01rst takes an expectation over the network\u2019s non-linearities as\nin [CHM+15], and thus have limited applicability to non-linear networks.\nOur contribution is to the understanding of the loss function for a neural network with ReLU non-\nlinearities, quadratic error, and weight decay regularization. In contrast to [CHM+15] and [Kaw16],\nwe do not take an expectation over the network\u2019s non-linearities, and instead study the network in its\nfully non-linear form. Without making any assumptions on the labelled data distribution, we prove\nthat the loss function has no local maxima in any open neighbourhood where the loss function is\nin\ufb01nitely differentiable and non-constant. We also prove that there is a non-empty open set where\n\n1. the regularized loss function is piecewise strongly convex, with domains of convexity\n\ndetermined by the ReLU non-linearity,\n\n2. every differentiable critical point of the regularized loss function is a local minima, and,\n3. local minima of the regularized loss function are isolated.\n\nThis open set contains the origin, all points in a norm ball where training error is below a given\nthreshold, and under some conditions, all global minimizers of the regularized loss function. Our\nresults also provide an explicit description of the set where piecewise strong convexity is guaranteed\nin terms of the size of the network\u2019s parameters and architecture, which allows for networks to be\ndesigned to maximize the size of this set. This set also implicitly depends on the training data. We\nemphasize the similarity of these results to those proved in [CHM+15], which include the fact that\nthere is a value of the Hamiltonian (i.e. the loss function of the network), below which critical points\nhave a high probability of being low index. Our results \ufb01nd a threshold of the same loss function\nunder which every differentiable critical point of the regularized function in a bounded set is a local\nminimum, and therefore has index 0. Since we make no assumptions on the distribution of training\ndata, we therefore hope that our results provide at least a partial answer to the open problem given in\n[CLA15].\nWe also include an experimental section where we validate our theoretical results on a toy problem\nand show that restricted to stochastic gradient descent trajectories, the loss function for commonly\nused neural networks on image classi\ufb01cation problems is almost always piecewise strongly convex.\nThis shows that even though the loss function may be non-convex outside the open set we have found,\nits gradient descent trajectories are similar to those of a strongly convex function, which hints at a\ngeneral explanation for the success of \ufb01rst order optimization methods applied to neural networks.\n\n1.1 Related Work\n\nA related paper is [HM16], which shows that for linear residual networks, the only critical points in a\nnorm ball at the origin are global minima, provided the training data is generated by a linear map\nwith Gaussian noise. In contrast, we show that for a non-linear network with any training data, there\nis an open set containing the origin where every differentiable critical point is a local minimum.\nOur result on the existence of regions on which the loss function is piecewise well behaved is similar\nto the results in [SS16], which shows that for a two layer network with a scalar output, weight space\ncan be divided into convex regions, and on each of these the loss function has a basin-like structure\nwhich resembles a convex function in that every local minimum is a global minimum. Compared to\n[SS16], our conclusions apply to a smaller set in weight space, but are compatible with more network\narchitectures (e.g. networks with more than two layers).\nAnother relevant paper is [SC16], which proves that for neural networks with a single hidden layer\nand leaky ReLUs, every differentiable local minimum of the training error is a global minimum\nunder mild over-parametrization; they also have an extension of this result to deeper networks under\nstronger conditions. With small modi\ufb01cations to our proofs, our results can be applied to networks\n\n2\n\n\fwith leaky ReLU\u2019s, and thus our paper and [SC16] provide complementary viewpoints; if every\ndifferentiable local minimum is a global minimum, then differentiable local minima must satisfy\nour error threshold as long as the network has enough capacity to obtain zero training error, and\nthus piecewise strong convexity is guaranteed around these points. Our results in combination with\n[SC16] therefore describe, in some cases, the regularized loss surface around differentiable local\nminima of the training error.\nConvergence of gradient methods for deep neural networks has recently been shown in [AZLS18]\nand [OS19] with some restricted architectures. Since our results on the structure of the loss function\nare limited to a subset of weight space, they say little about the convergence of gradient descent with\narbitrary initialization. Our empirical results in Section 4, however, show that the loss function is\nwell behaved outside this subset, and indicate methods for generating new convergence results.\n\n2 Summary of Results\n\nHere we give informal statements of our main theorems and outline the general argument. In Section\n3.1 we start by showing that a network of ReLU\u2019s is in\ufb01nitely differentiable (i.e. smooth) almost\neverywhere in weight space. Using the properties of subharmonic functions we then prove that the\ncorresponding loss function is either constant or not smooth in any open neighbourhood of a local\nmaximum. In Section 3.2, we estimate the second derivatives of the loss function in terms of the loss\nfunction itself, giving the following theorem in Section 3.3.\nTheorem 1. Let (cid:96) be the loss function for a feed-forward neural network with ReLU non-linearities,\na scalar output, a quadratic error function, and training data {ai, f (ai)}N\n\ni=1. Let\n\n(cid:96)\u03bb(W ) = (cid:96)(W ) +\n\n||W||2,\n\n\u03bb\n2\n\nbe the same loss function with weight decay regularization. Then there is an open set U, containing\nthe origin, all points W in a norm ball where training error is below a certain threshold, and, in\ncertain cases, all global minimizers of (cid:96)\u03bb, such that (cid:96)\u03bb is piecewise strongly convex on U.\n\nSee Theorem 3 for a more precise statement of this result. In Section 3.4, we use piecewise strong\nconvexity to prove the following theorem.\nTheorem 2. For the same network as in Theorem 1, every differentiable critical point of (cid:96)\u03bb in U is a\nlocal minimum, and every local minimum of (cid:96)\u03bb in U is an isolated local minimum.\n\nWe conclude our theoretical section with an application of Theorem 2. We end with Section 4, where\nwe empirically validate our theoretical results on a toy problem and demonstrate, for more complex\nproblems, that the loss function is almost always piecewise strongly convex on stochastic gradient\ndescent trajectories.\n\n3 Theoretical Results\n\nConsider a neural network with an n0 dimensional input a. The neural network will have H hidden\nlayers, which means there are H + 2 layers in total, including the input layer (layer 0), and the\noutput layer (layer H + 1). The width of the ith layer will be denoted by ni, and we will assume\nni > 1 for all i = 1, . . . , H. We will consider scalar neural networks for simplicity (i.e. nH+1 = 1),\nthough these results can be easily extended to non-scalar networks. The weights connecting the\nith layer to the i + 1st layer will be given by matrices Wi \u2208 Rni,ni+1, with wi\nj,k giving the weight\nof the connection between neuron j in layer i and neuron k in layer i + 1. The neural network\ny : Rn0 \u00d7 Rm \u2192 R is given by the function\ny(a, W ) = \u03c3(W T\n\n(1)\nwhere W = (W0, . . . , WH ) is the collection of all the network weights, and \u03c3(x1, . . . , xn) =\n(max(x1, 0), . . . , max(xn, 0)) is the ReLU non-linearity. Let f : Rn0 \u2192 R be the target function,\nand let {ai, f (ai)}N\ni=1 be a set of labelled training data. The loss function is given by (cid:96) : Rm \u2192 R,\n\n0 a) . . .))),\n\nH \u03c3(W T\n\nH\u22121\u03c3(. . . \u03c3(W T\n\n(cid:96)(W ) =\n\n1\n2N\n\n(f (ai) \u2212 y(ai, W ))2 .\n\nN(cid:88)\n\ni=1\n\n3\n\n\fWe will refer to (cid:96)(W ) as the training error. The regularized loss function is given by (cid:96)\u03bb : Rm \u2192 R,\n\n(cid:96)\u03bb(W ) = (cid:96)(W ) +\n\n||W||2,\n\n\u03bb\n2\n\nwhere \u03bb > 0, and ||W|| is the standard Euclidean norm of the weights. We will start by writing\ny(a, W ) as a matrix product. De\ufb01ne Si : Rn0 \u00d7 Rm \u2192 Rni,ni by\ni\u22121\u03c3(. . . \u03c3(W T\n\nSi(a, W ) = diag(hi(W T\n\n0 a) . . .)),\n\n(2)\n\nwhere hi : Rni \u2192 Rni is given by\n\nhi(x1, . . . , xni) = (1x1>0(x1), . . . , 1xni >0(xni )),\n\nwhere 1xi>0 is the indicator function of the positive real numbers, equal to 1 if the argument is\npositive, and zero otherwise. We will call the Si matrices the \u201cswitches\u201d of the network. It is clear\nthat\n\ny(a, W ) = SH+1(a, W )W T\n\nH SH (a, W )W T\n\nH\u22121SH\u22121(a, W ) . . . S1(a, W )W T\n\n0 a.\n\n3.1 Differentiability of the neural network\n\nHere we state that the network is differentiable, at least on the majority of points in weight space.\nLemma 1. For any a \u2208 Rn0, the map W (cid:55)\u2192 y(a, W ) is smooth almost everywhere in Rm.\nAlthough the proof of this lemma is deferred to the supplementary materials, the crux of the argument\nis simple; the network is piecewise analytic as a function of the weights, with domains of de\ufb01nition\ndelineated by the zero sets of the inputs to the ReLU non-linearities. These inputs are themselves\nlocally analytic functions, and the zero set of a non-zero real analytic function has Lebesgue measure\nzero; for a concise proof of this fact, see [Mit15].\nGiven that y(a, W ) is differentiable for almost all W , we compute some derivatives in the coordinate\ndirections of weight space, and use the result to prove a lemma about the existence of local maxima.\nIf W \u2217 is a local maximum of (cid:96), then (cid:96) is not smooth on any open neighbourhood of W \u2217,\nLemma 2.\nunless (cid:96) is constant on that neighbourhood.\n\nThis lemma is proved in the supplementary material, and involves showing that (cid:96) is a subharmonic\nfunction on any neighbourhood where it is smooth, and then using the maximum principle [McO03].\nNote that the same proof applies to deep linear networks. These networks are everywhere differen-\ntiable, so we conclude that the loss functions of linear networks have no local maxima unless they are\nconstant. This yields a simpler proof of Lemma 2.3 (iii) from [Kaw16].\n\n3.2 Estimating Second Derivatives\n\nThis section will be devoted to estimating the second derivatives of (cid:96). This relates to the convexity of\n(cid:96)\u03bb, since\n\nd2\n\ndt2|t=0(cid:96)\u03bb(W + tX) =\n\nd2\n\ndt2|t=0(cid:96)(W + tX) + \u03bb||X||2.\n\nHence, if there exists \u03b8 > 0 such that\nd2\n\ndt2|t=0(cid:96)(W + tX) \u2265 \u2212\u03b8||X||2,\n\nthen, provided \u03bb > \u03b8, the loss function (cid:96)\u03bb will be at least locally convex. To estimate the second\nderivative of the loss function in an arbitrary direction, we \ufb01rst de\ufb01ne the following norm, which\nmeasures the maximum operator norm of the weight matrices; it is the same norm used in [HM16].\nDe\ufb01nition 1. For a parameter W \u2208 Rm, de\ufb01ne the norm\n\n(cid:107)W(cid:107)\u2217 = max\n0\u2264i\u2264H\n\n(cid:107)Wi(cid:107)2\n\n(3)\n\nwhere (cid:107)\u00b7(cid:107)2 denotes the standard operator norm induced by the Euclidean norm.\nThe proof of Lemma 1 shows that (cid:96) is everywhere equal to a loss function for a collection of linear\nneural networks which are obtained from y(ai, W ) by holding the switches {Sj(ai, W )}i,j constant.\nThe next lemma estimates the second derivative of such a function in an arbitrary direction.\n\n4\n\n\fLemma 3. Suppose (cid:107)ai(cid:107)2 \u2264 r for all 1 \u2264 i \u2264 N. Fix W \u2208 Rm, and set \u03c6 : Rm \u2192 R as equal to (cid:96),\nbut with switches {Sj(ai, W )}i,j held constant as determined by W and the dataset {ai}N\ni=1. The\nsecond derivative of \u03c6 in direction X \u2208 Rm satis\ufb01es\n\nd2\n\n\u221a\ndt2|t=0\u03c6(W + tX) \u2265 \u2212\n\n2H(H + 1)(cid:107)W(cid:107)H\u22121\n\n\u2217\n\n(cid:107)X(cid:107)2r\u03c6(W )1/2.\n\n(4)\n\nThe proof of Lemma 3 is in the supplementary material, and uses standard tools.\n\n3.3 Piecewise Convexity\n\nLemma 3 shows that the second derivative of (cid:96) in an arbitrary direction is bounded below by a term\nwhich depends on (cid:96). Thus, as the loss function gets small, the second derivative of the regularized\nloss function will be overwhelmed by the positive part contributed by the weight decay term. This\nobservation leads to the following de\ufb01nition and theorem.\nDe\ufb01nition 2. For a \ufb01xed neural network architecture with H hidden layers, and r > 0 satisfying\n\n(cid:107)ai(cid:107)2 \u2264 r\nde\ufb01ne the open set U (\u03bb, \u03b8), where \u03bb > \u03b8 > 0, by\n\n1 \u2264 i \u2264 N\n\nU (\u03bb, \u03b8) = {W \u2208 Rm | (cid:96)(W )1/2(cid:107)W(cid:107)H\u22121\n\n\u2217\n\n\u221a\n\n<\n\n\u03bb \u2212 \u03b8\n\n2H(H + 1)r\n\n}.\n\nIf we restrict to the norm ball B(R) = {W \u2208 Rm | (cid:107)W(cid:107)\u2217 \u2264 R}, we have\n\u03bb \u2212 \u03b8\n\nU (\u03bb, \u03b8) \u2229 B(R) \u2283 {W \u2208 B(R) | (cid:96)(W )1/2 <\n\n\u221a\n\n2H(H + 1)rRH\u22121\n\n(5)\n\n(6)\n\n(7)\n\n}.\n\nThus, U (\u03bb, \u03b8) \u2229 B(R) contains all points in B(R) obtaining training error below the threshold given\nin (7); this is the inclusion referred to in Theorem 1. The set U (\u03bb, \u03b8) is non-empty provided H > 1,\nas it contains an open neighbourhood of 0 \u2208 Rm. Further, U (\u03bb, \u03b8) contains all points W obtaining\nzero training error, if these points exist; [ZBH+16] gives some suf\ufb01cient conditions on the network\narchitecture for obtaining zero training error. Lemma 4 gives conditions under which U (\u03bb, \u03b8) contains\nall global minimizers of (cid:96)\u03bb.\nLemma 4. Let \u0001 = inf W\u2208Rm (cid:96)\u03bb(W ). If\n\n\u0001 <\n\n\u03bb1+1/H\n\n2 (H(H + 1)r)2/H\n\n,\n\n(8)\n\nthen there exists \u03b8 such that U (\u03bb, \u03b8) contains all global minimizers of (cid:96)\u03bb.\nLemma 4 is proved in the supplementary materials. It follows since the term (cid:96)(W )1/2(cid:107)W(cid:107)H\u22121\nroughly bounded by (cid:96)\u03bb(W ). Observe that by evaluating (cid:96)\u03bb(W ) at W = 0, we get the inequality\n\n\u2217\n\nis\n\n\u0001 \u2264 (cid:96)(0) =\n\n1\n2N\n\nf (ai)2.\n\nN(cid:88)\n\ni=1\n\nThus, as long as \u03bb is large enough to satisfy\n\nN(cid:88)\n\ni=1\n\n1\n2N\n\nf (ai)2 <\n\n\u03bb1+1/H\n\n2 (H(H + 1)r)2/H\n\n,\n\nLemma 4 shows that there exists \u03b8 such that U (\u03bb, \u03b8) contains all global minimizers of (cid:96)\u03bb. Thus, for\nany problem, we have found a range of \u03bb values where our set U (\u03bb, \u03b8) is guaranteed to contain all\nglobal minimizers of the problem.\nTheorem 3 shows that we can characterise (cid:96)\u03bb on U (\u03bb, \u03b8).\n\n5\n\n(9)\n\n(10)\n\n\fTheorem 3 (Piecewise Strong Convexity). With U (\u03bb, \u03b8) de\ufb01ned as above, there exist closed sets\nBi \u2282 Rm for i = 1, . . . , L such that\n\nand smooth functions \u03c6i : Vi \u2282 Rm \u2192 R, Vi open, satisfying\n\nwhere H(\u03c6i) is the Hessian matrix, Bi \u2229 U (\u03bb, \u03b8) \u2282 Vi for all i, and\n\nL(cid:91)\n\ni=1\n\nU (\u03bb, \u03b8) =\n\nBi \u2229 U (\u03bb, \u03b8),\n\nH(\u03c6i)(W ) \u2265 \u03b8Im,\u2200W \u2208 Vi\n\n(cid:96)\u03bb|Bi\u2229U (\u03bb,\u03b8) = \u03c6i|Bi\u2229U (\u03bb,\u03b8).\n\n(11)\n\n(12)\n\n(13)\n\nThe proof of Theorem 3 is in the supplementary materials. The sets Bi are obtained from enumerating\nall possible values of the switches {Sj(ak, W )}j,k as we vary W , and taking Bi as the topological\nclosure of all weights W giving the ith value of the switches. The functions \u03c6i are obtained from (cid:96)\u03bb\nby \ufb01xing the switches according to the de\ufb01nition of Bi.\nNote that this theorem implies Theorem 1, with U given by U (\u03bb, \u03b8). This shows that when the\ntraining error is small enough in a bounded region, as prescribed by U (\u03bb, \u03b8) though (7), the function\n(cid:96)\u03bb is locally strongly convex. Note also that our estimates depend implicitly on the training data\n{ai, f (ai)}N\ni=1, and the widths of the network\u2019s layers, since these quantities affect the size of the set\nU (\u03bb, \u03b8).\n\n3.4 Isolated Local Minima\n\nThe next two lemmas are an application of Theorem 3, and prove Theorem 2.\nLemma 5. Every differentiable critical point of (cid:96)\u03bb in U (\u03bb, \u03b8) is an isolated local minimum.\nLemma 6. Every local minimum of (cid:96)\u03bb in U (\u03bb, \u03b8) is an isolated local minimum.\n\nThe proofs of these lemmas are given in the supplementary materials. Note the subtle difference\nbetween the two; Lemma 5 only applies to critical points where (cid:96)\u03bb is differentiable, while Lemma\n6 applies to non-differentiable local minima. Lemma 5 can be applied to prove that non-zero local\nminima obtaining training error below our threshold do not exist when the network is linear.\nLemma 7. If (cid:96)\u03bb is the regularized loss function for a linear neural network with H \u2265 1 and ni > 1\nfor all i = 1, . . . , H, then (cid:96)\u03bb has no non-zero critical points on the set U (\u03bb) given by\n\nU (\u03bb) =\n\nU (\u03bb, \u03b8) = {W \u2208 Rm | (cid:96)(W )1/2(cid:107)W(cid:107)H\u22121\n\n\u2217\n\n\u221a\n\n<\n\n\u03bb\n\n2H(H + 1)r\n\n}.\n\n(14)\n\n(cid:91)\n\n\u03b8>0\n\nThis lemma is proved in the supplementary materials, and involves the use of rotation matrices to\ndemonstrate that any non-zero local minimum in U (\u03bb) cannot be an isolated local minimum.\n\n4 Experiments\n\nWe begin our experimental analysis with a regression experiment to validate our theoretical results;\nnamely that the region U (\u03bb, \u03b8) is accessed over the course of training. We used a target function\nf (x) = x/4, with 100 data points sampled uniformly in [\u22121, 1], and a neural network with H =\n1, n1 = 2, no biases, no ReLU on the output, weight decay parameter \u03bb = 0.4, and learning rate\n\u03b7 = 0.1. In this experiment, we measure the change in the regularized loss (cid:96)\u03bb from when gradient\ndescent \ufb01rst enters U (\u03bb) to convergence, as a fraction of the total change in (cid:96)\u03bb from initialization to\nconvergence; this quantity measures the portion of the gradient descent trajectory contained in U (\u03bb).\nWe ran 1000 independent trials in PyTorch, distinguished by their random initializations, and found\nthat the bound given in (6) is satis\ufb01ed for some \u03b8 for 87.2% \u00b1 22.3% (mean \u00b1 standard deviation)\nof the loss change over training. A histogram of the loss change fractions over these trials is given\nin Figure 1 as well as a plot of the training error (cid:96) for a single initialization, selected to show the\n\n6\n\n\fFigure 1: Left: Histograms of loss change fractions for f (x) = x/4 over 1000 trials. Right: A plot\nof the training loss (cid:96) versus time for a single training run on the same toy example. When the loss is\nbelow the horizontal dashed line, the current parameters W are in U (\u03bb)\n\ntrajectory entering U (\u03bb) over the course of training. The histogram in 1 shows that, for this simple\nexperiment, most gradient descent trajectories inhabit U (\u03bb) for almost the entire training run.\nFor more complicated architectures, we found that the bound in (6) is dif\ufb01cult to satisfy while\nstill obtaining good test performance. The goal of the rest of this section is therefore to empirically\ndetermine when the loss function for standard neural network architectures is piecewise convex, which\nmay occur before the bound in (6) is satis\ufb01ed. Because the Hessian of the loss function (cid:96)\u03bb is dif\ufb01cult\nto compute for neural networks with many parameters, we turn our focus to (cid:96)\u03bb restricted to gradient\ndescent trajectories, which is similar to the analysis in [SMG13]. To that end, let W : R+ \u2192 Rm be\na solution to the ODE\n\n\u02d9W (t) = \u2212\u2207(cid:96)\u03bb(W (t)), W (0) = W0,\nand let \u03b3(t) = (cid:96)\u03bb(W (t)) be the restriction of (cid:96)\u03bb to this curve. We have\n\n(15)\n\n\u00a8\u03b3(t) = 2\u2207(cid:96)\u03bb(W (t))T H((cid:96)\u03bb(W (t)))\u2207(cid:96)\u03bb(W (t)).\n\n(16)\nIf this is always positive, then (cid:96)\u03bb is piecewise convex on gradient descent trajectories. We are\nespecially interested in the right hand side of (16) normalized by (cid:107)\u2207(cid:96)\u03bb(cid:107)2, due to Lemma 8.\nLemma 8. Suppose that \u03b3(t) is C 2 over [0, t\u2217] \u2282 R, and that\n\n2\u2207(cid:96)\u03bb(W (t))T H((cid:96)\u03bb(W (t)))\u2207(cid:96)\u03bb(W (t))\n\n(cid:107)\u2207(cid:96)\u03bb(W (t))(cid:107)2\n\n\u2265 C.\n\nThen we have the convergence estimate\n\n(cid:107)\u2207(cid:96)\u03bb(W (t))(cid:107)2 \u2264 (cid:107)\u2207(cid:96)\u03bb(W (0))(cid:107)2e\u2212Ct \u2200t \u2208 [0, t\u2217].\n\n(17)\n\n(18)\n\nThis lemma is proved in the supplementary material, and makes use of Gr\u00f6nwall\u2019s inequality. We\nmay compute the second derivative of \u03b3(t) while avoiding computing H((cid:96)\u03bb) by observing that\n\n2\u2207(cid:96)T\n\n\u03bb H((cid:96)\u03bb)(W )\u2207(cid:96)\u03bb = \u2207(cid:96)T\n\n\u03bb\u2207(cid:107)\u2207(cid:96)\u03bb(cid:107)2.\n\n(19)\n\nand therefore we may compute (16) by two backpropagations.\nOne notable difference between our theoretical work and the following experiments is that we focus\non image classi\ufb01cation tasks below, and hence the loss function is given by a composition of a\nsoftmax and cross entropy, as opposed to a square Euclidean distance.\nFor the MNIST, CIFAR10, and CIFAR100 datasets, we produce two plots generated by training\nneural networks using stochastic gradient descent (SGD). The \ufb01rst is of (cid:96)\u03bb and normalized second\nderivative, as given by the left hand side of (17), versus training time, represented using logging\nperiods, calculated over a single SGD trajectory. The second is a histogram which runs the same test\nover several trials, distinguished by their random initializations, and records the percent change in\n\u03b3(t) between when it \ufb01rst became piecewise convex to convergence. In other words, let t0 be the \ufb01rst\n\n7\n\n0.00.20.40.60.81.0Fraction of Loss Change0100200300400500600700CountHistogram of Loss Change Fractions020406080100Logging Period: 1 Step0.0100.0150.0200.0250.0300.0350.040Training LossTraining Loss Vs. Logging Period\ftime after which \u00a8\u03b3(t) is always positive; in our experiments, such a t0 always exists. Let t1 and t = 0\nbe the terminal and initial time, respectively. To compute the histograms in the left column of Figure\n2, we compute the loss change fraction\n\n\u03b3(t0) \u2212 \u03b3(t1)\n\u03b3(0) \u2212 \u03b3(t1)\n\n(20)\n\nover n separately initialized training trials on a given dataset. We are interested in (20) as it measures\nhow much of the SGD trajectory is spent in a piecewise convex regime.\nOur experimental set-up for each data set is summarized in Table 1, and all experiments were\nimplemented in PyTorch, with a mini-batch size of 128. Results are summarized in Table 2 in the\nform mean \u00b1 standard deviation, calculated across all trials for each data set. The column \u201cNorm.\n2nd. Deriv. (%10)\u201d in Table 2 is the 10th percentile of the normalized second derivatives in the\npiecewise convex regime, as calculated within each trial; we present this instead of the minimum\nnormalized second derivative as the minimum is quite noisy, and as such may not re\ufb02ect the behaviour\nof the second derivative in bulk.\n\nTable 1: Experimental Set-up\n\nModel\nDataset\nLeNet-5\nMNIST\nCIFAR10\nResNet22\nCIFAR100 ResNet22\n\n4.1 MNIST Experiments\n\nEpochs Batch Norm.\n\n2\n65\n65\n\nNo\nYes\nYes\n\n\u03bb\n\n5 \u2217 10\u22124\n1 \u2217 10\u22126\n1 \u2217 10\u22127\n\nTrials Learn. Rate\n100\n20\n20\n\n0.1/0.02\n0.1/0.02\n\n0.05\n\nIn our \ufb01rst experiment we used the LeNet-5 architecture [LBB+98] on MNIST. The histogram\ngenerated for MNIST in Figure 2 shows that the loss function is piecewise convex over most of\nthe SGD trajectory, independent of initialization. The upper right plot of Figure 2 con\ufb01rms this;\nthe normalized second derivative, in red, is negative for a very small amount of time, and then is\nconsistently larger than 10, indicating that the loss function is piecewise strongly convex on most\nof this SGD trajectory. This is similar to our theoretical results as piecewise strong convexity only\nseems to occur after the loss is below a certain threshold.\n\n4.2 CIFAR10 and CIFAR100 Experiments\n\nThis group of experiments deals with image classi\ufb01cation on the CIFAR10 and CIFAR100 datasets.\nWe trained the ResNet22 architecture with PyTorch, and produced the same plots as with MNIST.\nThis version of ResNet22 for CIFAR10/CIFAR100 was taken from the code accompanying [FOA18],\nand has a learning rate of 0.1, decreasing to 0.02 after 60 epochs.\nIn the middle left histogram in Figure 2, we see that over 20 trials on CIFAR10, every SGD trajectory\nwas in the piecewise convex regime for its entire course. This is again re\ufb02ected in the middle right\nplot of Figure 2, where we also see the normalized second derivative being consistently larger than\n10. The test set accuracies are given in Table 2 for all three data sets, and are non-optimal; this is\nintentional, as we wanted to study the loss surface without optimization of hyperparameters. Note\nthat Table 2 shows that large values of the normalized second derivative are consistently observed\nacross all initializations, and for every dataset.\n\nTable 2: Experimental Results\n\nDataset\nMNIST\nCIFAR10\nCIFAR100\n\nTest Acc. (%) Norm. 2nd. Deriv. (%10)\n97.32 \u00b1 1.00\n84.93 \u00b1 0.35\n54.01 \u00b1 0.53\n\n21.19 \u00b1 2.12\n8.82 \u00b1 2.80\n11.80 \u00b1 0.27\n\nLoss Frac.\n0.80 \u00b1 0.09\n1.00 \u00b1 0.00\n1.00 \u00b1 0.02\n\nTo see if this behaviour is consistent with more challenging classi\ufb01cation tasks, we tested the same\nnetwork on CIFAR100; the results are summarized in the bottom row of Figure 2, and are consistent\nwith CIFAR10, with only 1 trajectory out of 20 failing to be entirely piecewise convex. We conjecture\n\n8\n\n\fthat the extra convexity observed in CIFAR10/100 compared to MNIST is due to the presence of\nbatch normalization, which has been shown to improve the optimization landscape [STIM18].\n\nFigure 2: Left: Histograms of loss change fractions for three image classi\ufb01cation problems over\nseveral trials. Right: A plot of the loss function and normalized second derivative versus time for\na single training run on the same datasets. The normalized second derivative is clipped at 10 for\nlegibility. Top: MNIST, Middle: CIFAR10, Bottom: CIFAR100. Best viewed in color.\n\n5 Conclusion\n\nWe have established a number of facts about the critical points of the regularized loss function for\na neural network with ReLU activation functions. In particular, there are, in a certain sense, no\ndifferentiable local maxima, and, on an important set in weight space, the loss function is piecewise\nstrongly convex, with every differentiable critical point an isolated local minimum. We then applied\nthis result to prove that non-zero local minima of a regularized loss function for linear networks\nobtaining training error below a certain threshold do not exist. Finally, we established the relevance of\nour theory to a toy problem through experiments, and demonstrated empirically that the loss function\nfor standard neural networks is piecewise strongly convex along most SGD trajectories. Future\ndirections of research include investigating if the re-parametrization offered by Residual Networks\n[HZRS15] can improve our analysis, as was observed in [HM16], as well as determining where (cid:96)\u03bb\nsatis\ufb01es (17), which would allow for stronger theorems on the convergence of SGD.\n\n9\n\n0.500.600.700.800.901.00Fraction of Loss Change0510152025CountHistogram of Loss Change Fractions020406080Logging Period: 10 Steps20246810Norm. 2nd Deriv. of Loss Fn.Norm. 2nd Deriv. Vs. Logging Period0.00.51.01.52.0Loss0.500.600.700.800.901.00Fraction of Loss Change05101520CountHistogram of Loss Change Fractions050100150200250Logging Period: 100 Steps0246810Norm. 2nd Deriv. of Loss Fn.Norm. 2nd Deriv. Vs. Logging Period0.00.51.01.52.0Loss0.500.600.700.800.901.00Fraction of Loss Change05101520CountHistogram of Loss Change Fractions050100150200250Logging Period: 100 Steps5678910Norm. 2nd Deriv. of Loss Fn.Norm. 2nd Deriv. Vs. Logging Period0.01.02.03.04.0Loss\f6 Acknowledgements\n\nWe would like to thank the anonymous referees for their thoughtful reviews. Thanks also to Professor\nAdam Stinchcombe for the use of his GPUs. This work was partially supported by an NSERC PGS -\nD and by the Mackenzie King Open Scholarship.\n\nReferences\n[AZLS18] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning\n\nvia over-parametrization. arXiv preprint arXiv:1811.03962, 2018.\n\n[CHM+15] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann\nLeCun. The loss surfaces of multilayer networks. In Arti\ufb01cial Intelligence and Statistics,\npages 192\u2013204, 2015.\n\n[CLA15] Anna Choromanska, Yann LeCun, and G\u00e9rard Ben Arous. Open problem: The landscape\nof the loss surfaces of multilayer networks. In Conference on Learning Theory, pages\n1756\u20131760, 2015.\n\n[FOA18] Chris Finlay, Adam Oberman, and Bilal Abbasi. Improved robustness to adversarial\nexamples using Lipschitz regularization of the loss. arXiv preprint arXiv:1810.00953,\n2018.\n\n[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with\ndeep recurrent neural networks. In Acoustics, speech and signal processing (icassp),\n2013 ieee international conference on, pages 6645\u20136649. IEEE, 2013.\n\n[HM16] Moritz Hardt and Tengyu Ma.\n\narXiv:1611.04231, 2016.\n\nIdentity matters in deep learning. arXiv preprint\n\n[HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for\n\nimage recognition. CoRR, abs/1512.03385, 2015.\n\n[Kaw16] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural\n\nInformation Processing Systems, pages 586\u2013594, 2016.\n\n[LBB+98] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based\nlearning applied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324,\n1998.\n\n[LSJR16] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient\ndescent only converges to minimizers. In 29th Annual Conference on Learning Theory,\nvolume 49 of Proceedings of Machine Learning Research, pages 1246\u20131257, Columbia\nUniversity, New York, New York, USA, 23\u201326 Jun 2016. PMLR.\n\n[McO03] Robert C. McOwen. Partial Differential Equations: Methods and Applications. Prentice\n\nHall, 2nd edition, 2003.\n\n[Mit15] Boris Mityagin. The zero set of a real analytic function. arXiv preprint arXiv:1512.07276,\n\n2015.\n\n[OS19] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization:\nglobal convergence guarantees for training shallow neural networks. arXiv preprint\narXiv:1902.04674, 2019.\n\n[SC16] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error\n\nguarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n[SHK+14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan\nSalakhutdinov. Dropout: A simple way to prevent neural networks from over\ufb01tting.\nJournal of Machine Learning Research, 15:1929\u20131958, 2014.\n\n10\n\n\f[SMG13] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlin-\near dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120,\n2013.\n\n[SS16] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspeci\ufb01ed neural\n\nnetworks. In International Conference on Machine Learning, pages 774\u2013782, 2016.\n\n[STIM18] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does\nbatch normalization help optimization? In Advances in Neural Information Processing\nSystems, pages 2483\u20132493, 2018.\n\n[WLZ+17] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan\nCatanzaro. High-resolution image synthesis and semantic manipulation with conditional\ngans. CoRR, abs/1711.11585, 2017.\n\n[ZBH+16] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.\narXiv preprint\n\nUnderstanding deep learning requires rethinking generalization.\narXiv:1611.03530, 2016.\n\n11\n\n\f", "award": [], "sourceid": 7101, "authors": [{"given_name": "Tristan", "family_name": "Milne", "institution": "University of Toronto"}]}