{"title": "Spectrally-normalized margin bounds for neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6240, "page_last": 6249, "abstract": "This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized \"spectral complexity\": their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the MNIST and CIFAR10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity.", "full_text": "Spectrally-normalized margin bounds\n\nfor neural networks\n\nPeter L. Bartlett\u2217\n\nDylan J. Foster\u2020\n\nMatus Telgarsky\u2021\n\nAbstract\n\nThis paper presents a margin-based multiclass generalization bound for neural net-\nworks that scales with their margin-normalized spectral complexity: their Lipschitz\nconstant, meaning the product of the spectral norms of the weight matrices, times\na certain correction factor. This bound is empirically investigated for a standard\nAlexNet network trained with SGD on the mnist and cifar10 datasets, with both\noriginal and random labels; the bound, the Lipschitz constants, and the excess risks\nare all in direct correlation, suggesting both that SGD selects predictors whose\ncomplexity scales with the dif\ufb01culty of the learning task, and secondly that the\npresented bound is sensitive to this complexity.\n\n1 Overview\nNeural networks owe their astonishing success not only to their ability to \ufb01t any data set: they also\ngeneralize well, meaning they provide a close \ufb01t on unseen data. A classical statistical adage is that\nmodels capable of \ufb01tting too much will generalize poorly; what\u2019s going on here?\nLet\u2019s navigate the many possible explanations provided by statistical theory. A \ufb01rst observation is\nthat any analysis based solely on the number of possible labellings on a \ufb01nite training set \u2014 as is\nthe case with VC dimension \u2014 is doomed: if the function class can \ufb01t all possible labels (as is the\ncase with neural networks in standard con\ufb01gurations [Zhang et al., 2017]), then this analysis can not\ndistinguish it from the collection of all possible functions!\n\nFigure 1: An analysis of AlexNet [Krizhevsky et al., 2012] trained with SGD on cifar10, both with\noriginal and with random labels. Triangle-marked curves track excess risk across training epochs (on\na log scale), with an \u2018x\u2019 marking the earliest epoch with zero training error. Circle-marked curves\ntrack Lipschitz constants, normalized so that the two curves for random labels meet. The Lipschitz\nconstants tightly correlate with excess risk, and moreover normalizing them by margins (resulting in\nthe square-marked curve) neutralizes growth across epochs.\n\n\u2217; University of California, Berkeley and Queensland University of Technology.\n\u2020; Cornell University.\n\u2021; University of Illinois, Urbana-Champaign.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\nepoch 10epoch 100excess risk 0.3excess risk 0.9cifar excess riskcifar Lipschitzcifar Lipschitz/margincifar [random] excess riskcifar [random] Lipschitz\fNext let\u2019s consider scale-sensitive measures of complexity, such as Rademacher complexity and\nmetric entropy, which work directly with real-valued function classes, and moreover are sensitive\nto their magnitudes. Figure 1 plots the excess risk (the test error minus the training error) across\ntraining epochs against one candidate scale-sensitive complexity measure, the Lipschitz constant of\nthe network (the product of the spectral norms of their weight matrices), and demonstrates that they\nare tightly correlated (which is not the case for, say, the l2 norm of the weights). The data considered\nin Figure 1 is the standard cifar10 dataset, both with original and with random labels, which has\nbeen used as a sanity check when investigating neural network generalization [Zhang et al., 2017].\nThere is still an issue with basing a complexity measure purely on the Lipschitz constant (although\nit has already been successfully employed to regularize neural networks [Cisse et al., 2017]): as\ndepicted in Figure 1, the measure grows over time, despite the excess risk plateauing. Fortunately,\nthere is a standard resolution to this issue: investigating the margins (a precise measure of con\ufb01dence)\nof the outputs of the network. This tool has been used to study the behavior of 2-layer networks,\nboosting methods, SVMs, and many others [Bartlett, 1996, Schapire et al., 1997, Boucheron et al.,\n2005]; in boosting, for instance, there is a similar growth in complexity over time (each training\niteration adds a weak learner), whereas margin bounds correctly stay \ufb02at or even decrease. This\nbehavior is recovered here: as depicted in Figure 1, even though standard networks exhibit growing\nLipschitz constants, normalizing these Lipschitz constants by the margin instead gives a decaying\ncurve.\n1.1 Contributions\nThis work investigates a complexity measure for neural networks that is based on the Lipschitz\nconstant, but normalized by the margin of the predictor. The two central contributions are as follows.\n\u2022 Theorem 1.1 below will give the rigorous statement of the generalization bound that is\nthe basis of this work. In contrast to prior work, this bound: (a) scales with the Lipschitz\nconstant (product of spectral norms of weight matrices) divided by the margin; (b) has\nno dependence on combinatorial parameters (e.g., number of layers or nodes) outside of\nlog factors; (c) is multiclass (with no explicit dependence on the number of classes); (d)\nmeasures complexity against a reference network (e.g., for the ResNet [He et al., 2016], the\nreference network has identity mappings at each layer). The bound is stated below, with a\ngeneral form and analysis summary appearing in Section 3 and the full details relegated to\nthe appendix.\n\n\u2022 An empirical investigation, in Section 2, of neural network generalization on the standard\ndatasets cifar10, cifar100, and mnist using the preceding bound. Rather than using the\nbound to provide a single number, it can be used to form a margin distribution as in Figure 2.\nThese margin distributions will illuminate the following intuitive observations: (a) cifar10\nis harder than mnist; (b) random labels make cifar10 and mnist much more dif\ufb01cult;\n(c) the margin distributions (and bounds) converge during training, even though the weight\nmatrices continue to grow; (d) l2 regularization (\u201cweight decay\u201d) does not signi\ufb01cantly\nimpact margins or generalization.\n\nA more detailed description of the margin distributions is as follows. Suppose a neural network\ncomputes a function f : Rd \u2192 Rk, where k is the number of classes; the most natural way to\nconvert this to a classi\ufb01er is to select the output coordinate with the largest magnitude, meaning\nx (cid:55)\u2192 arg maxj f (x)j. The margin, then, measures the gap between the output for the correct label\nand other labels, meaning f (x)y \u2212 maxj(cid:54)=y f (x)j.\nUnfortunately, margins alone do not seem to say much; see for instance Figure 2a, where the\ncollections of all margins for all data points \u2014 the unnormalized margin distribution \u2014 are similar\nfor cifar10 with and without random labels. What is missing is an appropriate normalization, as in\nFigure 2b. This normalization is provided by Theorem 1.1, which can now be explained in detail.\nTo state the bound, a little bit of notation is necessary. The networks will use L \ufb01xed nonlinearities\n(\u03c31, . . . , \u03c3L), where \u03c3i : Rdi\u22121 \u2192 Rdi is \u03c1i-Lipschitz (e.g., as with coordinate-wise ReLU, and\nmax-pooling, as discussed in Appendix A.1); occasionally, it will also hold that \u03c3i(0) = 0. Given\nL weight matrices A = (A1, . . . , AL) let FA denote the function computed by the corresponding\nnetwork:\n\nFA(x) := \u03c3L(AL\u03c3L\u22121(AL\u22121 \u00b7\u00b7\u00b7 \u03c31(A1x)\u00b7\u00b7\u00b7 )).\n\n(1.1)\n\n2\n\n\f(a) Margins.\n\n(b) Normalized margins.\n\nFigure 2: Margin distributions at the end of training AlexNet on cifar10, with and without random\nlabels. With proper normalization, random labels demonstrably correspond to a harder problem.\n\nThe network output FA(x) \u2208 RdL (with d0 = d and dL = k) is converted to a class label in\n{1, . . . , k} by taking the arg max over components, with an arbitrary rule for breaking ties. Whenever\ninput data x1, . . . , xn \u2208 Rd are given, collect them as rows of a matrix X \u2208 Rn\u00d7d. Occasionally,\nnotation will be overloaded to discuss FA(X T ), a matrix whose ith column is FA(xi). Let W denote\nthe maximum of {d, d1, . . . , dL}. The l2 norm (cid:107) \u00b7 (cid:107)2 is always computed entry-wise; thus, for a\nmatrix, it corresponds to the Frobenius norm.\nNext, de\ufb01ne a collection of reference matrices (M1, . . . , ML) with the same dimensions as\nA1, . . . , AL; for instance, to obtain a good bound for ResNet [He et al., 2016], it is sensible to\nset Mi := I, the identity map, and the bound below will worsen as the network moves farther\nfrom the identity map; for AlexNet [Krizhevsky et al., 2012], the simple choice Mi = 0 suf\ufb01ces.\nFinally, let (cid:107) \u00b7 (cid:107)\u03c3 denote the spectral norm and (cid:107) \u00b7 (cid:107)p,q denote the (p, q) matrix norm, de\ufb01ned by\nnetwork FA with weights A is the de\ufb01ned as\n\n(cid:107)A(cid:107)p,q :=(cid:13)(cid:13)((cid:107)A:,1(cid:107)p, . . . ,(cid:107)A:,m(cid:107)p)(cid:13)(cid:13)q for A \u2208 Rd\u00d7m. The spectral complexity RFA = RA of a\n\n\uf8eb\uf8ed L(cid:89)\n\n\uf8f6\uf8f8\uf8eb\uf8ed L(cid:88)\n\nRA :=\n\n\u03c1i(cid:107)Ai(cid:107)\u03c3\n\ni=1\n\ni=1\n\n(cid:107)A(cid:62)\n\ni (cid:107)2/3\n\n2,1\n\ni \u2212 M(cid:62)\n(cid:107)Ai(cid:107)2/3\n\n\u03c3\n\n\uf8f6\uf8f83/2\n\n.\n\n(1.2)\n\nThe following theorem provides a generalization bound for neural networks whose nonlinearities are\n\ufb01xed but whose weight matrices A have bounded spectral complexity RA.\nTheorem 1.1. Let nonlinearities (\u03c31, . . . , \u03c3L) and reference matrices (M1, . . . , ML) be given as\nabove (i.e., \u03c3i is \u03c1i-Lipschitz and \u03c3i(0) = 0). Then for (x, y), (x1, y1), . . . , (xn, yn) drawn iid from\nany probability distribution over Rd \u00d7 {1, . . . , k}, with probability at least 1 \u2212 \u03b4 over ((xi, yi))n\ni=1,\nevery margin \u03b3 > 0 and network FA : Rd \u2192 Rk with weight matrices A = (A1, . . . , AL) satisfy\n\n(cid:105) \u2264 (cid:98)R\u03b3(FA) + (cid:101)O\n\n(cid:32)(cid:107)X(cid:107)2RA\n1(cid:2)f (xi)yi \u2264 \u03b3 + maxj(cid:54)=yi f (xi)j\n\n(cid:114)\n(cid:3) and (cid:107)X(cid:107)2 =(cid:112)(cid:80)\n\nln(W ) +\n\n\u03b3n\n\nFA(x)j (cid:54)= y\n\ni\n\nln(1/\u03b4)\n\nn\ni (cid:107)xi(cid:107)2\n2.\n\n(cid:33)\n\n,\n\nPr\n\narg max\n\nwhere (cid:98)R\u03b3(f ) \u2264 n\u22121(cid:80)\n\nj\n\n(cid:104)\n\nThe full proof and a generalization beyond spectral norms is relegated to the appendix, but a sketch is\nprovided in Section 3, along with a lower bound. Section 3 also gives a discussion of related work:\nbrie\ufb02y, it\u2019s essential to note that margin and Lipschitz-sensitive bounds have a long history in the\nneural networks literature [Bartlett, 1996, Anthony and Bartlett, 1999, Neyshabur et al., 2015]; the\ndistinction here is the sensitivity to the spectral norm, and that there is no explicit appearance of\ncombinatorial quantities such as numbers of parameters or layers (outside of log terms, and indices to\nsummations and products).\nTo close, miscellaneous observations and open problems are collected in Section 4.\n2 Generalization case studies via margin distributions\nIn this section, we empirically study the generalization behavior of neural networks, via margin\ndistributions and the generalization bound stated in Theorem 1.1.\n\n3\n\n0cifarrandom0cifarcifar random\f(a) Mnist is easier than cifar10.\n\n(b) Random mnist is as hard as random cifar10!\n\n(c) cifar100 is (almost) as hard as cifar10 with ran-\ndom labels!\nFigure 3: A variety of margin distributions. Axes are re-scaled in Figure 3a, but identical in the other\nsubplots; the cifar10 (blue) and random cifar10 (green) distributions are the same each time.\n\n(d) Random inputs are harder than random labels.\n\nBefore proceeding with the plots, it\u2019s a good time to give a more re\ufb01ned description of the mar-\ngin distribution, one that is suitable for comparisons across datasets. Given n pattern/label pairs\ni=1, with patterns as rows of matrix X \u2208 Rn\u00d7d, and given a predictor FA : Rd \u2192 Rk, the\n((xi, yi))n\n(normalized) margin distribution is the univariate empirical distribution of the labeled data points\neach transformed into a single scalar according to\n\n(x, y) (cid:55)\u2192 FA(x)y \u2212 maxi(cid:54)=y FA(x)i\n\nRA(cid:107)X(cid:107)2/n\n\n,\n\nwhere the spectral complexity RA is from eq. (1.2). The normalization is thus derived from the bound\nin Theorem 1.1, but ignoring log terms.\nTaken this way, the two margin distributions for two datasets can be interpreted as follows. Consider-\ning any \ufb01xed point on the horizontal axis, if the cumulative distribution of one density is lower than\nthe other, then it corresponds to a lower right hand side in Theorem 1.1. For no reason other than\nvisual interpretability, the plots here will instead depict a density estimate of the margin distribution.\nThe vertical and horizontal axes are rescaled in different plots, but the random and true cifar10\nmargin distributions are always the same.\nA little more detail about the experimental setup is as follows. All experiments were implemented in\nKeras [Chollet et al., 2015]. In order to minimize con\ufb02ating effects of optimization and regularization,\nthe optimization method was vanilla SGD with step size 0.01, and all regularization (weight decay,\nbatch normalization, etc.) were disabled. \u201ccifar\u201d in general refers to cifar10, however cifar100\nwill also be explicitly mentioned. The network architecture is essentially AlexNet [Krizhevsky et al.,\n2012] with all normalization/regularization removed, and with no adjustments of any kind (even to\nthe learning rate) across the different experiments.\nComparing datasets. A \ufb01rst comparison is of cifar10 and the standard mnist digit data. mnist\nis considered \u201ceasy\u201d, since any of a variety of methods can achieve roughly 1% test error. The\n\u201ceasiness\u201d is corroborated by Figure 3a, where the margin distribution for mnist places all its mass far\nto the right of the mass for cifar10. Interestingly, randomizing the labels of mnist, as in Figure 3b,\nresults in a margin distribution to the left of not only cifar10, but also slightly to the left of (but\nclose to) cifar10 with randomized labels.\n\n4\n\n0cifarcifar randommnist0cifarcifar randommnist random0cifarcifar randomcifar1000cifarrandom labelrandom input\f(a) Margins across epochs for cifar10.\n\n(b) Various levels of l2 regularization for cifar10.\n\nFigure 4\n\nNext, Figure 3c compares cifar10 and cifar100, where cifar100 uses the same input images\nas cifar10; indeed, cifar10 is obtained from cifar100 by collapsing the original 100 categories\ninto 10 groups. Interestingly, cifar100, from the perspective of margin bounds, is just as dif\ufb01cult\nas cifar10 with random labels. This is consistent with the large observed test error on cifar100\n(which has not been \u201coptimized\u201d in any way via regularization).\nLastly, Figure 3d replaces the cifar10 input images with random images sampled from Gaussians\nmatching the \ufb01rst- and second-order image statistics (see [Zhang et al., 2017] for similar experiments).\nConvergence of margins. As was pointed out in Section 1, the weights of the neural networks do\nnot seem to converge in the usual sense during training (the norms grow continually). However, as\ndepicted in Figure 4a, the sequence of (normalized) margin distributions is itself converging.\nRegularization. As remarked in [Zhang et al., 2017], regularization only seems to bring minor\nbene\ufb01ts to test error (though adequate to be employed in all cutting edge results). This observation\nis certainly consistent with the margin distributions in Figure 4b, which do not improve (e.g., by\nshifting to the right) in any visible way under regularization. An open question, discussed further in\nSection 4, is to design regularization that improves margins.\n3 Analysis of margin bound\nThis section will sketch the proof of Theorem 1.1, give a lower bound, and discuss related work.\n3.1 Multiclass margin bound\nThe starting point of this analysis is a margin-based bound for multiclass prediction. To state the bound,\n\ufb01rst recall that the margin operator M : Rk\u00d7{1, . . . , k} \u2192 R is de\ufb01ned as M(v, y) := vy\u2212max\nvi,\ni(cid:54)=y\nand de\ufb01ne the ramp loss (cid:96)\u03b3 : R \u2192 R+ as\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30\n\n(cid:96)\u03b3(r) :=\n\nr < \u2212\u03b3,\n\n1 + r/\u03b3 r \u2208 [\u2212\u03b3, 0],\n1\n\nr > 0,\n\nde\ufb01ne an empirical counterpart (cid:98)R\u03b3 of R\u03b3 as (cid:98)R\u03b3(f ) := n\u22121(cid:80)\nand ramp risk as R\u03b3(f ) := E((cid:96)\u03b3(\u2212M(f (x), y))). Given a sample S := ((x1, y1), . . . , (xn, yn)),\nand (cid:98)R\u03b3 respectively upper bound the probability and fraction of errors on the source distribution\ni (cid:96)\u03b3(\u2212M(f (xi), yi)); note that R\u03b3\nR(H|S) := n\u22121E suph\u2208H(cid:80)n\nand training set. Lastly, given a set of real-valued functions H, de\ufb01ne the Rademacher complexity as\ni=1 \u0001ih(xi, yi), where the expectation is over the Rademacher random\nvariables (\u00011, . . . , \u0001n), which are independent, uniform \u00b11-valued.\n(cid:8)(x, y) (cid:55)\u2192 (cid:96)\u03b3(\u2212M(f (x), y)) : f \u2208 F(cid:9). Then, with probability at least 1 \u2212 \u03b4 over a sample S\nWith this notation in place, the basic bound is as follows.\nLemma 3.1. Given functions F with F (cid:51) f : Rd \u2192 Rk and any \u03b3 > 0, de\ufb01ne F\u03b3 :=\n(cid:113) ln(1/\u03b4)\nof size n, every f \u2208 F satis\ufb01es Pr[arg maxi f (x)i (cid:54)= y] \u2264 (cid:98)R\u03b3(f ) + 2R((F\u03b3)|S) + 3\n\n2n .\n\nThis bound is a direct consequence of standard tools in Rademacher complexity. In order to instantiate\nthis bound, covering numbers will be used to directly upper bound the Rademacher complexity term\n\n5\n\n10 epochs20 epochs40 epochs80 epochs160 epochs0106105104\f\u221a\n\nk [Zhang, 2004].\n\nR((F\u03b3)|S). Interestingly, the choice of directly working in terms of covering numbers seems essential\nto providing a bound with no explicit dependence on k; by contrast, prior work primarily handles\nmulticlass via a Rademacher complexity analysis on each coordinate of a k-tuple of functions, and\npays a factor of\n3.2 Covering number complexity upper bounds\nThis subsection proves Theorem 1.1 via Lemma 3.1 by controlling, via covering numbers, the\nRademacher complexity R((F\u03b3)|S) for networks with bounded spectral complexity.\nThe notation here for (proper) covering numbers is as follows. Let N (U, \u0001,(cid:107) \u00b7 (cid:107)) denote the least\ncardinality of any subset V \u2286 U that covers U at scale \u0001 with norm (cid:107) \u00b7 (cid:107), meaning\n\n(cid:107)A \u2212 B(cid:107) \u2264 \u0001.\n\nsup\nA\u2208U\n\nmin\nB\u2208V\n\nChoices of U that will be used in the present work include both the image F|S of data S under some\nfunction class F , as well as the conceptually simpler choice of a family of matrix products.\nThe full proof has the following steps: (I) A matrix covering bound for the af\ufb01ne transformation\nof each layer is provided in Lemma 3.2; handling whole layers at once allows for more \ufb02exible\nnorms. (II) An induction on layers then gives a covering number bound for entire networks; this\nanalysis is only sketched here for the special case of norms used in Theorem 1.1, but the full proof in\nthe appendix culminates in a bound for more general norms (cf. Lemma A.7). (III) The preceding\nwhole-network covering number leads to Theorem 1.1 via Lemma 3.1 and standard techniques.\nStep (I), matrix covering, is handled by the following lemma. The covering number considers the\nmatrix product XA, where A will be instantiated as the weight matrix for a layer, and X is the data\npassed through all layers prior to the present layer.\nLemma 3.2. Let conjugate exponents (p, q) and (r, s) be given with p \u2264 2, as well as positive reals\n(a, b, \u0001) and positive integer m. Let matrix X \u2208 Rn\u00d7d be given with (cid:107)X(cid:107)p \u2264 b. Then\n\n(cid:111)\n\n(cid:19)\n\n(cid:38)\n\n(cid:39)\n\n(cid:18)(cid:110)\n\nlnN\n\nXA : A \u2208 Rd\u00d7m,(cid:107)A(cid:107)q,s \u2264 a\n\n, \u0001,(cid:107) \u00b7 (cid:107)2\n\n\u2264\n\na2b2m2/r\n\n\u00012\n\nln(2dm).\n\nThe proof relies upon the Maurey sparsi\ufb01cation lemma [Pisier, 1980], which is stated in terms\nof sparsifying convex hulls, and in its use here is inspired by covering number bounds for linear\npredictors [Zhang, 2002]. To prove Theorem 1.1, this matrix covering bound will be instantiated\nfor the case of (cid:107)A(cid:107)2,1. It is possible to instead scale with (cid:107)A(cid:107)2 and (cid:107)X(cid:107)2, but even for the case of\nthe identity matrix X = I, this incurs an extra dimension factor. The use of (cid:107)A(cid:107)2,1 here thus helps\nTheorem 1.1 avoid any appearance of W and L outside of log terms; indeed, the goal of covering\na whole matrix at a time (rather than the more standard vector covering) was to allow this greater\nsensitivity and avoid combinatorial parameters.\nStep (II) above, the induction on layers, proceeds as follows. Let Xi denote the output of layer i (thus\n\nX0 = X), and inductively suppose there exists a cover element (cid:98)Xi depending on covering matrices\n((cid:98)A1, . . . , (cid:98)Ai\u22121) chosen to cover weight matrices in earlier layers. Thanks to Lemma 3.2, there also\nexists (cid:98)Ai so that (cid:107)Ai(cid:98)Xi \u2212 (cid:98)Ai(cid:98)Xi(cid:107)2 \u2264 \u0001i. The desired cover element is thus (cid:98)Xi+1 = \u03c3i((cid:98)Ai(cid:98)Xi) where\n\n\u03c3i is the nonlinearity in layer i; indeed, supposing \u03c3i is \u03c1i-Lipschitz,\n\n(cid:107)Xi+1 \u2212 (cid:98)Xi+1(cid:107)2 \u2264 \u03c1i(cid:107)AiXi \u2212 (cid:98)Ai(cid:98)Xi(cid:107)2\n\n(cid:16)(cid:107)AiXi \u2212 Ai(cid:98)Xi(cid:107)2 + (cid:107)Ai(cid:98)Xi \u2212 (cid:98)Ai(cid:98)Xi(cid:107)2\n\u2264 \u03c1i(cid:107)Ai(cid:107)\u03c3(cid:107)Xi \u2212 Ai(cid:98)Xi(cid:107)2\u03c1i + \u0001i,\n\n\u2264 \u03c1i\n\n(cid:17)\n\nwhere the \ufb01rst term is controlled with the inductive hypothesis. Since (cid:98)Xi+1 depends on each choice\n((cid:98)Ai, . . . , (cid:98)Ai), the cardinality of the full network cover is the product of the individual matrix covers.\n\nThe preceding proof had no sensitivity to the particular choice of norms; it merely required an\noperator norm on Ai, as well as some other norm that allows matrix covering. Such an analysis is\npresented in full generality in Appendix A.5. Specializing to the particular case of spectral norms\nand (2, 1) group norms leads to the following full-network covering bound.\n\n6\n\n\fTheorem 3.3. Let \ufb01xed nonlinearities (\u03c31, . . . , \u03c3L) and reference matrices (M1, . . . , ML) be given,\nwhere \u03c3i is \u03c1i-Lipschitz and \u03c3i(0) = 0. Let spectral norm bounds (s1, . . . , sL), and matrix (2, 1)\nnorm bounds (b1, . . . , bL) be given. Let data matrix X \u2208 Rn\u00d7d be given, where the n rows corre-\nspond to data points. Let HX denote the family of matrices obtained by evaluating X with all choices\ni (cid:107)2,1 \u2264 bi\n\nof network FA: HX := (cid:8)FA(X T ) : A = (A1, . . . , AL), (cid:107)Ai(cid:107)\u03c3 \u2264 si, (cid:107)A(cid:62)\n\uf8f6\uf8f83\n(cid:19)2/3\n\nwhere each matrix has dimension at most W along each axis. Then for any \u0001 > 0,\n\nlnN (HX , \u0001,(cid:107) \u00b7 (cid:107)2) \u2264 (cid:107)X(cid:107)2\n\n\uf8f6\uf8f8\uf8eb\uf8ed L(cid:88)\n\n\uf8eb\uf8ed L(cid:89)\n\ni \u2212 M(cid:62)\n\n(cid:9),\n\n.\n\n2 ln(2W 2)\n\n\u00012\n\ns2\nj \u03c12\nj\n\nj=1\n\ni=1\n\n(cid:18) bi\n\nsi\n\n|S\n\nWhat remains is (III): Theorem 3.3 can be combined with the standard Dudley entropy integral upper\nbound on Rademacher complexity (see e.g. Mohri et al. [2012]), which combined with Lemma 3.1\ngives Theorem 1.1.\n3.3 Rademacher complexity lower bounds\nBy reduction to the linear case (i.e., removing all nonlinearities), it is easy to provide a lower bound\non the Rademacher complexity of the networks studied here. Unfortunately, this bound only scales\nwith the product of spectral norms, and not the other terms in RA (cf. eq. (1.2)).\nTheorem 3.4. Consider the setting of Theorem 3.3, but all nonlinearities are the ReLU z (cid:55)\u2192\nmax{0, z}, the output dimension is dL = 1, and all non-output dimensions are at least 2 (and hence\nW \u2265 2). Let data S := (x1, . . . , xn) be collected into data matrix X \u2208 Rn\u00d7d. Then there is a c\nsuch that for any scalar r > 0, R\n\n(cid:19)\ni (cid:107)Ai(cid:107)\u03c3 \u2264 r(cid:9)\nNote that, due to the nonlinearity, the lower bound should indeed depend on(cid:81)\n(cid:107)(cid:81)\n\n\u2265 c(cid:107)X(cid:107)2r.\ni (cid:107)Ai(cid:107)\u03c3 and not\ni Ai(cid:107)\u03c3; as a simple sanity check, there exist networks for which the latter quantity is 0, but the\n\n(cid:18)(cid:8)FA : A = (A1, . . . , AL),(cid:81)\n\nnetwork does not compute the zero function.\n3.4 Related work\nTo close this section on proofs, it is a good time to summarize connections to existing literature.\nThe algorithmic idea of large margin classi\ufb01ers was introduced in the linear case by Vapnik [1982]\n(see also [Boser et al., 1992, Cortes and Vapnik, 1995]). Vapnik [1995] gave an intuitive explanation\nof the performance of these methods based on a sample-dependent VC-dimension calculation, but\nwithout generalization bounds. The \ufb01rst rigorous generalization bounds for large margin linear\nclassi\ufb01ers [Shawe-Taylor et al., 1998] required a scale-sensitive complexity analysis of real-valued\nfunction classes. At the same time, a large margins analysis was developed for two-layer networks\n[Bartlett, 1996], indeed with a proof technique that inspired the layer-wise induction used to prove\nTheorem 1.1 in the present work. Margin theory was quickly extended to many other settings (see\nfor instance the survey by Boucheron et al. [2005]), one major success being an explanation of the\ngeneralization ability of boosting methods, which exhibit an explicit growth in the size of the function\nclass over time, but a stable excess risk [Schapire et al., 1997]. The contribution of the present work\nis to provide a margin bound (and corresponding Rademacher analysis) that can be adapted to various\noperator norms at each layer. Additionally, the present work operates in the multiclass setting, and\navoids an explicit dependence on the number of classes k, which seems to appear in prior work\n[Zhang, 2004, Tewari and Bartlett, 2007].\nThere are numerous generalization bounds for neural networks, including VC-dimension and fat-\nshattering bounds (many of these can be found in [Anthony and Bartlett, 1999]). Scale-sensitive\nanalysis of neural networks started with [Bartlett, 1996], which can be interpreted in the present\nsetting as utilizing data norm (cid:107)\u00b7(cid:107)\u221e and operator norm (cid:107)\u00b7(cid:107)\u221e\u2192\u221e (equivalently, the norm (cid:107)A(cid:62)\ni (cid:107)1,\u221e on\nweight matrix Ai). This analysis can be adapted to give a Rademacher complexity analysis [Bartlett\nand Mendelson, 2002], and has been adapted to other norms [Neyshabur et al., 2015], although the\n(cid:107) \u00b7 (cid:107)\u221e setting appears to be necessary to avoid extra combinatorial factors. More work is still needed\nto develop complexity analyses that have matching upper and lower bounds, and also to determine\nwhich norms are well-adapted to neural networks as used in practice.\nThe present analysis utilizes covering numbers, and is most closely connected to earlier covering\nnumber bounds [Anthony and Bartlett, 1999, Chapter 12], themselves based on the earlier fat-\nshattering analysis [Bartlett, 1996], however the technique here of pushing an empirical cover through\n\n7\n\n\f(cid:19)3/2\n\n(cid:16)(cid:81)L\n\n(cid:17)(cid:18)(cid:80)L\n\nThe original preprint of\n\nlayers is akin to VC dimension proofs for neural networks [Anthony and Bartlett, 1999]. The use\nof Maurey\u2019s sparsi\ufb01cation lemma was inspired by linear predictor covering number bounds [Zhang,\n2002].\nComparison to preprint.\n2017]\n\nfeatured a slightly different version of\n\nal.,\nthe spectral complexity RA, given by\n. In the present version (1.2), each (cid:107)Ai \u2212 Mi(cid:107)1\n(cid:107)Ai(cid:107)2/3\ni (cid:107)2,1. This is a strict improvement since for any matrix A \u2208 Rd\u00d7m\n\nterm is replaced by (cid:107)A(cid:62)\none has (cid:107)A(cid:107)2,1 \u2264 (cid:107)A(cid:107)1, and in general the gap between these two norms can be as large as\nOn a related note, all of the \ufb01gures in this paper use the (cid:96)1 norm in the spectral complexity RA instead\nof the (2, 1) norm. Variants of the experiments described in Section 2 were carried out using each of\n\ni=1 \u03c1i(cid:107)Ai(cid:107)\u03c3\n\nthe l1, (2, 1), and l2 norms in the ((cid:80)L\n\ni=1(\u00b7)2/3)3/2 term with negligible difference in the results.\n\ni \u2212 M(cid:62)\n\n(cid:107)Ai\u2212Mi(cid:107)2/3\n\n1\n\n\u221a\n\nthis paper\n\n[Bartlett\n\net\n\ni=1\n\n\u03c3\n\nd.\n\n\u03c3\n\ni=1\n\n\u221a\n(\n\n(cid:107)Ai(cid:107)2\n\nW(cid:107)Ai\u2212Mi(cid:107)2)2\n\n(cid:16)(cid:81)L\n\n(cid:17) \u00b7 L\n\ni=1 \u03c1i(cid:107)Ai(cid:107)\u03c3\n\n(cid:16)(cid:80)L\nTheorem 1.1 because for any A \u2208 Rd\u00d7m it holds that(cid:13)(cid:13)A(cid:62)(cid:13)(cid:13)2,1 \u2264 \u221a\nform ((cid:80)L\n\n(cid:17)1/2\ni=1(\u00b7)2/3)3/2 appearing in Theorem 1.1 may be replaced by the form L((cid:80)L\n\nSince spectrally-normalized margin bounds were \ufb01rst proposed in the preprint [Bartlett et al.,\n2017], subsequent works [Neyshabur et al., 2017, Neyshabur, 2017] re-derived a similar spectrally-\nnormalized bound using the PAC-Bayes framework. Speci\ufb01cally, these works showed that RA may\nbe replaced by (up to log(W ) factors):\n. Unfortu-\nnately, this bound never improves on Theorem 1.1, and indeed can be derived from it as follows. First,\nthe dependence on the individual matrices Ai in the second term of this bound can be obtained from\nd(cid:107)A(cid:107)2. Second, the functional\ni=1(\u00b7)2)1/2\nappearing above by using that (cid:107)\u03b1(cid:107)2/3 \u2264 L(cid:107)\u03b1(cid:107)2 for any \u03b1 \u2208 RL (this inequality following, for\ninstance, from Jensen\u2019s inequality).\n4 Further observations and open problems\nAdversarial examples. Adversarial examples are a phenomenon where the neural network predic-\ntions can be altered by adding seemingly imperceptible noise to an input [Goodfellow et al., 2014].\nThis phenomenon can be connected to margins as follows. The margin is nothing more than the\ndistance an input must traverse before its label is \ufb02ipped; consequently, low margin points are more\nsusceptible to adversarial noise than high margin points. Concretely, taking the 100 lowest margin\ninputs from cifar10 and adding uniform noise at scale 0.15 yielded \ufb02ipped labels on 5.86% of the\nimages, whereas the same level of noise on high margin points yielded 0.04% \ufb02ipped labels. Can the\nbounds here suggest a way to defend against adversarial examples?\nRegularization.\nIt was observed in [Zhang et al., 2017] that explicit regularization contributes little\nto the generalization performance of neural networks. In the margin framework, standard weight\ndecay (l2) regularization seemed to have little impact on margin distributions in Section 2. On the\nother hand, in the boosting literature, special types of regularization were developed to maximize\nmargins [Shalev-Shwartz and Singer, 2008]; perhaps a similar development can be performed here?\nSGD. The present analysis applies to predictors that have large margins; what is missing is an\nanalysis verifying that SGD applied to standard neural networks returns large margin predictors!\nIndeed, perhaps SGD returns not simply large margin predictors, but predictors that are well-behaved\nin a variety of other ways that can be directly translated into re\ufb01ned generalization bounds.\nImprovements to Theorem 1.1. There are several directions in which Theorem 1.1 might be\nimproved. Can a better choice of layer geometries (norms) yield better bounds on practical networks?\nCan the nonlinearities\u2019 worst-case Lipschitz constant be replaced with an (empirically) averaged\nquantity? Alternatively, can better lower bounds rule out these directions?\nRademacher vs. covering.\nwith no invocation of covering numbers?\nAcknowledgements\nThe authors thank Srinadh Bhojanapalli, Ryan Jian, Behnam Neyshabur, Maxim Raginsky, Andrew\nJ. Risteski, and Belinda Tzen for useful conversations and feedback. The authors thank Ben Recht\nfor giving a provocative lecture at the Simons Institute, stressing the need for understanding of\n\nIs it possible to prove Theorem 1.1 solely via Rademacher complexity,\n\n8\n\n\fboth generalization and optimization of neural networks. M.T. and D.F. acknowledge the use of\na GPU machine provided by Karthik Sridharan and made possible by an NVIDIA GPU grant.\nD.F. acknowledges the support of the NDSEG fellowship. P.B. gratefully acknowledges the support\nof the NSF through grant IIS-1619362 and of the Australian Research Council through an Australian\nLaureate Fellowship (FL110100281) and through the ARC Centre of Excellence for Mathematical\nand Statistical Frontiers. The authors thank the Simons Institute for the Theory of Computing Spring\n2017 program on the Foundations of Machine Learning. Lastly, the authors are grateful to La Burrita\n(both the north and the south Berkeley campus locations) for upholding the glorious tradition of the\nCalifornia Burrito.\nReferences\nMartin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\n\nUniversity Press, 1999.\n\nPeter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. arXiv preprint arXiv:1706.08498, 2017.\n\nPeter L. Bartlett. For valid generalization the size of the weights is more important than the size of\n\nthe network. In NIPS, 1996.\n\nPeter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. JMLR, 3:463\u2013482, Nov 2002.\n\nBernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal\nmargin classi\ufb01ers. In Proceedings of the Fifth Annual Workshop on Computational Learning\nTheory, COLT \u201992, pages 144\u2013152, New York, NY, USA, 1992. ACM. ISBN 0-89791-497-X.\n\nSt\u00e9phane Boucheron, Olivier Bousquet, and Gabor Lugosi. Theory of classi\ufb01cation: A survey of\n\nsome recent advances. ESAIM: Probability and Statistics, 9:323\u2013375, 2005.\n\nFran\u00e7ois Chollet et al. Keras. https://github.com/fchollet/keras, 2015.\n\nMoustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval\n\nnetworks: Improving robustness to adversarial examples. In ICML, 2017.\n\nCorinna Cortes and Vladimir N. Vapnik. Support-vector networks. Machine Learning, 20(3):273\u2013297,\n\n1995.\n\nIan J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial\n\nexamples. 2014. arXiv:1412.6572 [stat.ML].\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In ECCV, 2016.\n\nAlex Krizhevsky, Ilya Sutskever, and Geoffery Hinton. Imagenet classi\ufb01cation with deep convolu-\n\ntional neural networks. In NIPS, 2012.\n\nMehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.\n\nMIT Press, 2012.\n\nBehnam Neyshabur. Implicit regularization in deep learning. CoRR, abs/1709.01953, 2017. URL\n\nhttp://arxiv.org/abs/1709.01953.\n\nBehnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural\n\nnetworks. In COLT, 2015.\n\nBehnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian\napproach to spectrally-normalized margin bounds for neural networks. CoRR, abs/1707.09564,\n2017.\n\nGilles Pisier. Remarques sur un r\u00e9sultat non publi\u00e9 de b. maurey. S\u00e9minaire Analyse fonctionnelle\n\n(dit), pages 1\u201312, 1980.\n\n9\n\n\fRobert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new\n\nexplanation for the effectiveness of voting methods. In ICML, pages 322\u2013330, 1997.\n\nShai Shalev-Shwartz and Yoram Singer. On the equivalence of weak learnability and linear separabil-\n\nity: New relaxations and ef\ufb01cient boosting algorithms. In COLT, 2008.\n\nJ. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over\n\ndata-dependent hierarchies. IEEE Trans. Inf. Theor., 44(5):1926\u20131940, September 1998.\n\nAmbuj Tewari and Peter L. Bartlett. On the consistency of multiclass classi\ufb01cation methods. Journal\n\nof Machine Learning Research, 8:1007\u20131025, 2007.\n\nVladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New\n\nYork, 1982.\n\nVladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.\n\nChiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. ICLR, 2017.\n\nTong Zhang. Covering number bounds of certain regularized linear function classes. Journal of\n\nMachine Learning Research, 2:527\u2013550, 2002.\n\nTong Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods. Journal\n\nof Machine Learning Research, 5:1225\u20131251, 2004.\n\n10\n\n\f", "award": [], "sourceid": 3155, "authors": [{"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}, {"given_name": "Dylan", "family_name": "Foster", "institution": "Cornell University"}, {"given_name": "Matus", "family_name": "Telgarsky", "institution": "UIUC"}]}