{"title": "Uniform convergence may be unable to explain generalization in deep learning", "book": "Advances in Neural Information Processing Systems", "page_first": 11615, "page_last": 11626, "abstract": "Aimed at explaining the surprisingly good generalization behavior of overparameterized deep networks, recent works have developed a variety of generalization bounds for deep learning,  all  based on the fundamental learning-theoretic technique of uniform convergence. While\nit is well-known that many of these existing bounds are numerically large, through numerous experiments, we bring to light a more concerning aspect of these bounds: \nin practice,  these bounds can {\\em increase} with the training dataset size. Guided by our observations,\nwe then present examples of overparameterized linear classifiers and neural networks trained by  gradient descent (GD) where uniform convergence provably cannot ``explain generalization'' -- even if we take into account the implicit bias of GD {\\em to the fullest extent possible}. More precisely, even if we consider only the set of classifiers output by GD, which have test errors less than some small $\\epsilon$ in our settings, we show that applying (two-sided) uniform convergence on this set of classifiers will yield only a vacuous generalization guarantee larger than $1-\\epsilon$. Through these findings,\nwe cast doubt on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.", "full_text": "Uniform convergence may be unable to explain\n\ngeneralization in deep learning\n\nVaishnavh Nagarajan\n\nDepartment of Computer Science\n\nCarnegie Mellon University\n\nPittsburgh, PA\n\nvaishnavh@cs.cmu.edu\n\nJ. Zico Kolter\n\nDepartment of Computer Science\n\nCarnegie Mellon University &\n\nBosch Center for Arti\ufb01cial Intelligence\n\nPittsburgh, PA\n\nzkolter@cs.cmu.edu\n\nAbstract\n\nAimed at explaining the surprisingly good generalization behavior of overparam-\neterized deep networks, recent works have developed a variety of generalization\nbounds for deep learning, all based on the fundamental learning-theoretic technique\nof uniform convergence. While it is well-known that many of these existing bounds\nare numerically large, through numerous experiments, we bring to light a more\nconcerning aspect of these bounds: in practice, these bounds can increase with\nthe training dataset size. Guided by our observations, we then present examples\nof overparameterized linear classi\ufb01ers and neural networks trained by gradient\ndescent (GD) where uniform convergence provably cannot \u201cexplain generalization\u201d\n\u2013 even if we take into account the implicit bias of GD to the fullest extent possible.\nMore precisely, even if we consider only the set of classi\ufb01ers output by GD, which\nhave test errors less than some small \u0001 in our settings, we show that applying\n(two-sided) uniform convergence on this set of classi\ufb01ers will yield only a vacuous\ngeneralization guarantee larger than 1 \u2212 \u0001. Through these \ufb01ndings, we cast doubt\non the power of uniform convergence-based generalization bounds to provide a\ncomplete picture of why overparameterized deep networks generalize well.\n\n1\n\nIntroduction\n\nExplaining why overparameterized deep networks generalize well [28, 38] has become an important\nopen question in deep learning. How is it possible that a large network can be trained to perfectly \ufb01t\nrandomly labeled data (essentially by memorizing the labels), and yet, the same network when trained\nto perfectly \ufb01t real training data, generalizes well to unseen data? This called for a \u201crethinking\u201d of\nconventional, algorithm-independent techniques to explain generalization. Speci\ufb01cally, it was argued\nthat learning-theoretic approaches must be reformed by identifying and incorporating the implicit\nbias/regularization of stochastic gradient descent (SGD) [6, 35, 30]. Subsequently, a huge variety of\nnovel and re\ufb01ned, algorithm-dependent generalization bounds for deep networks have been developed,\nall based on uniform convergence, the most widely used tool in learning theory. The ultimate goal\nof this ongoing endeavor is to derive bounds on the generalization error that (a) are small, ideally\nnon-vacuous (i.e., < 1), (b) re\ufb02ect the same width/depth dependence as the generalization error (e.g.,\nbecome smaller with increasing width, as has been surprisingly observed in practice), (c) apply to the\nnetwork learned by SGD (without any modi\ufb01cation or explicit regularization) and (d) increase with\nthe proportion of randomly \ufb02ipped training labels (i.e., increase with memorization).\nWhile every bound meets some of these criteria (and sheds a valuable but partial insight into\ngeneralization in deep learning), there is no known bound that meets all of them simultaneously.\nWhile most bounds [29, 3, 11, 31, 27, 32] apply to the original network, they are neither numerically\nsmall for realistic dataset sizes, nor exhibit the desired width/depth dependencies (in fact, these\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbounds grow exponentially with the depth). The remaining bounds hold either only on a compressed\nnetwork [2] or a stochastic network [21] or a network that has been further modi\ufb01ed via optimization\nor more than one of the above [8, 39]. Extending these bounds to the original network is understood\nto be highly non-trivial [27]. While strong width-independent bounds have been derived for two-layer\nReLU networks [23, 1], these rely on a carefully curated, small learning rate and/or large batch size.\n(We refer the reader to Appendix A for a tabular summary of these bounds.)\nIn our paper, we bring to light another fundamental issue with existing bounds. We demonstrate that\nthese bounds violate another natural but largely overlooked criterion for explaining generalization:\n(e) the bounds should decrease with the dataset size at the same rate as the generalization error. In\nfact, we empirically observe that these bounds can increase with dataset size, which is arguably a\nmore concerning observation than the fact that they are large for a speci\ufb01c dataset size.\nMotivated by the seemingly insurmountable hurdles towards developing bounds satisfying all the\nabove \ufb01ve necessary criteria, we take a step back and examine how the underlying technique of\nuniform convergence may itself be inherently limited in the overparameterized regime. Speci\ufb01cally,\nwe present examples of overparameterized linear classi\ufb01ers and neural networks trained by GD\n(or SGD) where uniform convergence can provably fail to explain generalization. Intuitively, our\nexamples highlight that overparameterized models trained by gradient descent can learn decision\nboundaries that are largely \u201csimple\u201d \u2013 and hence generalize well \u2013 but have \u201cmicroscopic complexities\u201d\nwhich cannot be explained away by uniform convergence. Thus our results call into question the active\nongoing pursuit of using uniform convergence to fully explain generalization in deep learning.\n\nOur contributions in more detail. We \ufb01rst show that in practice certain weight norms of deep\nReLU networks, such as the distance from initialization, increase polynomially with the number of\ntraining examples (denoted by m). We then show that as a result, existing generalization bounds \u2013 all\nof which depend on such weight norms \u2013 fail to re\ufb02ect even a dependence on m even reasonably sim-\nilar to the actual test error, violating criterion (e); for suf\ufb01ciently small batch sizes, these bounds even\ngrow with the number of examples. This observation uncovers a conceptual gap in our understanding\nof the puzzle, by pointing towards a source of vacuity unrelated to parameter count.\nAs our second contribution, we consider three example setups of overparameterized models trained\nby (stochastic) gradient descent \u2013 a linear classi\ufb01er, a suf\ufb01ciently wide neural network with ReLUs\nand an in\ufb01nite width neural network with exponential activations (with the hidden layer weights\nfrozen) \u2013 that learn some underlying data distribution with small generalization error (say, at most\n\u0001). These settings also simulate our observation that norms such as distance from initialization grow\nwith dataset size m. More importantly, we prove that, in these settings, any two-sided uniform\nconvergence bound would yield a (nearly) vacuous generalization bound.\nNotably, this vacuity holds even if we \u201caggressively\u201d take implicit regularization into account\nwhile applying uniform convergence \u2013 described more concretely as follows. Recall that roughly\nspeaking a uniform convergence bound essentially evaluates the complexity of a hypothesis class (see\nDe\ufb01nition 3.2). As suggested by Zhang et al. [38], one can tighten uniform convergence bounds by\npruning the hypothesis class to remove extraneous hypotheses never picked by the learning algorithm\nfor the data distribution of interest. In our setups, even if we apply uniform convergence on the set of\nonly those hypotheses picked by the learner whose test errors are all negligible (at most \u0001), one can get\nno better than a nearly vacuous bound on the generalization error (that is at least 1 \u2212 \u0001). In this sense,\nwe say that uniform convergence provably cannot explain generalization in our settings. Finally, we\nnote that while nearly all existing uniform convergence-based techniques are two-sided, we show that\neven PAC-Bayesian bounds, which are typically presented only as one-sided convergence, also boil\ndown to nearly vacuous guarantees in our settings.\n\n1.1 Related Work\n\nWeight norms vs. training set size m. Prior works like Neyshabur et al. [30] and Nagarajan and\nKolter [26] have studied the behavior of weight norms in deep learning. Although these works do not\nexplicitly study the dependence of these norms on training set size m, one can infer from their plots\nthat weight norms of deep networks show some increase with m. Belkin et al. [4] reported a similar\nparadox in kernel learning, observing that norms that appear in kernel generalization bounds increase\nwith m, and that this is due to noise in the labels. Kawaguchi et al. [19] showed that there exist linear\nmodels with arbitrarily large weight norms that can generalize well, although such weights are not\n\n2\n\n\fnecessarily found by gradient descent. We crucially supplement these observations in three ways.\nFirst, we empirically and theoretically demonstrate how, even with zero label noise (unlike [4]) and\nby gradient descent (unlike [19]), a signi\ufb01cant level of m-dependence can arise in the weight norms \u2013\nsigni\ufb01cant enough to make even the generalization bound grow with m. Next, we identify uniform\nconvergence as the root cause behind this issue, and thirdly and most importantly, we provably\ndemonstrate this is so.\n\nWeaknesses of Uniform Convergence. Traditional wisdom is that uniform convergence bounds\nare a bad choice for complex classi\ufb01ers like k-nearest neighbors because these hypotheses classes\nhave in\ufb01nite VC-dimension (which motivated the need for stability based generalization bounds in\nthese cases [33, 5]). However, this sort of an argument against uniform convergence may still leave\none with the faint hope that, by aggressively pruning the hypothesis class (depending on the algorithm\nand the data distribution), one can achieve meaningful uniform convergence. In contrast, we seek\nto rigorously and thoroughly rule out uniform convergence in the settings we study. We do this by\n\ufb01rst de\ufb01ning the tightest form of uniform convergence in De\ufb01nition 3.3 \u2013 one that lower bounds\nany uniform convergence bound \u2013 and then showing that even this bound is vacuous in our settings.\nAdditionally, we note that we show this kind of failure of uniform convergence for linear classi\ufb01ers,\nwhich is a much simpler model compared to k-nearest neighbors.\nFor deep networks, Zhang et al. [38] showed that applying uniform convergence on the whole\nhypothesis class fails, and that it should instead be applied in an algorithm-dependent way. Ours\nis a much different claim \u2013 that uniform convergence is inherently problematic in that even the\nalgorithm-dependent application would fail \u2013 casting doubt on the rich line of post-Zhang et al. [38]\nalgorithm-dependent approaches. At the same time, we must add the disclaimer that our results do\nnot preclude the fact that uniform convergence may still work if GD is run with explicit regularization\n(such as weight decay). Such a regularized setting however, is not the main focus of the generalization\npuzzle [38, 28].\nPrior works [36, 34] have also focused on understanding uniform convergence for learnability of\nlearning problems. Roughly speaking, learnability is a strict notion that does not have to hold even\nthough an algorithm may generalize well for simple distributions in a learning problem. While we\ndefer the details of these works in Appendix I, we emphasize here that these results are orthogonal to\n(i.e., neither imply nor contradict) our results.\n\n2 Existing bounds vs. training set size\n\nAs we stated in criterion (e) in the introduction, a fundamental requirement from a generalization\nbound, however numerically large the bound may be, is that it should vary inversely with the size of\nthe training dataset size (m) like the observed generalization error. Such a requirement is satis\ufb01ed\n\neven by standard parameter-count-based VC-dimension bounds, like O(dh/\u221am) for depth d, width\n\nh ReLU networks [13]. Recent works have \u201ctightened\u201d the parameter-count-dependent terms in these\nbounds by replacing them with seemingly innocuous norm-based quantities; however, we show below\nthat this has also inadvertently introduced training-set-size-count dependencies in the numerator,\ncontributing to the vacuity of bounds. With these dependencies, the generalization bounds even\nincrease with training dataset size for small batch sizes.\nSetup and notations. We focus on fully connected networks of depth d = 5, width h = 1024 trained\non MNIST, although we consider other settings in Appendix B. We use SGD with learning rate 0.1 and\nbatch size 1 to minimize cross-entropy loss until 99% of the training data are classi\ufb01ed correctly by a\nmargin of at least \u03b3(cid:63) = 10 i.e., if we denote by f (x)[y] the real-valued logit output (i.e., pre-softmax)\non class y for an input x, we ensure that for 99% of the data (x, y), the margin \u0393(f (x), y) :=\nf (x)[y] \u2212 maxy(cid:48)(cid:54)=y f (x)[y(cid:48)] is at least \u03b3(cid:63). We emphasize that, from the perspective of generalization\nguarantees, this stopping criterion helps standardize training across different hyperparameter values,\nincluding different values of m [30]. Now, observe that for this particular stopping criterion, the test\nerror empirically decreases with size m as 1/m0.43 as seen in Figure 1 (third plot). However, we will\nsee that the story is starkly different for the generalization bounds.\n\nNorms grow with training set size m. Before we examine the overall generalization bounds\nthemselves, we \ufb01rst focus on two quantities that recur in the numerator of many recent bounds: the (cid:96)2\ndistance of the weights from their initialization [8, 26] and the product of spectral norms of the weight\n\n3\n\n\fFigure 1: Experiments in Section 2: In the \ufb01rst \ufb01gure, we plot (i) (cid:96)2 the distance of the network\nfrom the initialization and (ii) the (cid:96)2 distance between the weights learned on two random draws of\ntraining data starting from the same initialization. In the second \ufb01gure we plot the product of spectral\nnorms of the weights matrices. In the third \ufb01gure, we plot the test error. In the fourth \ufb01gure, we plot\nthe bounds from [31, 3]. Note that we have presented log-log plots and the exponent of m can be\nrecovered from the slope of these plots.\n\nmatrices of the network [31, 3]. We observe in Figure 1 (\ufb01rst two plots, blue lines) that both these\nquantities grow at a polynomial rate with m: the former at the rate of at least m0.4 and the latter at a\nrate of m. Our observation is a follow-up to Nagarajan and Kolter [26] who argued that while distance\nof the parameters from the origin grows with width as \u2126(\u221ah), the distance from initialization is\nwidth-independent (and even decreases with width); hence, they concluded that incorporating the\ninitialization would improve generalization bounds by a \u2126(\u221ah) factor. However, our observations\nimply that, even though distance from initialization would help explain generalization better in terms\nof width, it conspicuously fails to help explain generalization in terms of its dependence on m (and\nso does distance from origin as we show in Appendix Figure 5). 1\nAdditionally, we also examine another quantity as an alternative to distance from initialization: the\n(cid:96)2 diameter of the parameter space explored by SGD. That is, for a \ufb01xed initialization and data\ndistribution, we consider the set of all parameters learned by SGD across all draws of a dataset\nof size m; we then consider the diameter of the smallest ball enclosing this set. If this diameter\nexhibits a better behavior than the above quantities, one could then explain generalization better by\nreplacing the distance from initialization with the distance from the center of this ball in existing\nbounds. As a lower bound on this diameter, we consider the distance between the weights learned on\ntwo independently drawn datasets from the given initialization. Unfortunately, we observe that even\nthis quantity shows a similar undesirable behavior with respect to m like distance from initialization\n(see Figure 1, \ufb01rst plot, orange line).\n\nThe bounds grow with training set size m. We now turn to evaluating existing guarantees from\nNeyshabur et al. [31] and Bartlett et al. [3]. As we note later, our observations apply to many other\nbounds too. Let W1, . . . , Wd be the weights of the learned network (with W1 being the weights\nadjacent to the inputs), Z1, . . . , Zd the random initialization, D the true data distribution and S the\ntraining dataset. For all inputs x, let (cid:107)x(cid:107)2 \u2264 B. Let (cid:107)\u00b7(cid:107)2,(cid:107)\u00b7(cid:107)F ,(cid:107)\u00b7(cid:107)2,1 denote the spectral norm, the\nFrobenius norm and the matrix (2, 1)-norm respectively; let 1[\u00b7] be the indicator function. Recall that\n\u0393(f (x), y) := f (x)[y] \u2212 maxy(cid:48)(cid:54)=y f (x)[y(cid:48)] denotes the margin of the network on a datapoint. Then,\nfor any constant \u03b3, these generalization guarantees are written as follows, ignoring log factors:\n\nPrD[\u0393(f (x), y) \u2264 0] \u2264\n\n1[\u0393(f (x), y) \u2264 \u03b3] + generalization error bound.\n\n(1)\n\n1\n\nm (cid:88)(x,y)\u2208S\n\nHere the generalization error bound is of the form O(cid:16) Bd\nequals(cid:113)(cid:80)d\n\nm (cid:81)d\nk=1(cid:16)(cid:107)Wk\u2212Zk(cid:107)2,1\n\nh(cid:18)(cid:80)d\n\n\u221a\nin [31] and 1\nd\n\n(cid:107)Wk\u2212Zk(cid:107)2\n\n(cid:107)Wk(cid:107)2\n\n2\n\n(cid:107)Wk(cid:107)2\n\n\u221a\n\u221a\n\nh\n\n\u03b3\n\nk=1\n\nF\n\nk=1 (cid:107)Wk(cid:107)2 \u00d7 dist(cid:17) where dist\n(cid:17)2/3(cid:19)3/2\n\nin [3].\n\nIn our experiments, since we train the networks to \ufb01t at least 99% of the datapoints with a margin\nof 10, in the above bounds, we set \u03b3 = 10 so that the \ufb01rst train error term in the right hand side of\nEquation 1 becomes a small value of at most 0.01. We then plot in Figure 1 (fourth plot), the second\n\n1It may be tempting to think that our observations are peculiar to the cross-entropy loss for which the\noptimization algorithm diverges. But we observe that even for the squared error loss (Appendix B) where the\noptimization procedure does not diverge to in\ufb01nity, distance from initialization grows with m.\n\n4\n\n1285122048819232768TrainingSetSize1.01.5log(Distance)frominitbetweenweights1285122048819232768TrainingSetSize234log(Qdk=1kWkk2)1285122048819232768TrainingSetSize\u22121.5\u22121.0log(TestError)1285122048819232768TrainingSetSize4.04.55.05.5log(Bound)Neyshabur+\u201918Bartlett+\u201917\fterm above, namely the generalization error bounds, and observe that all these bounds grow with\nthe sample size m as \u2126(m0.68), thanks to the fact that the terms in the numerator of these bounds\ngrow with m. Since we are free to plug in \u03b3 in Equation 1, one may wonder whether there exists a\nbetter choice of \u03b3 for which we can observe a smaller increase on m (since the plotted terms inversely\ndepend on \u03b3). However, in Appendix Figure 6 we establish that even for larger values of \u03b3, this\nm-dependence remains. Also note that, although we do not plot the bounds from [27, 11], these have\nnearly identical norms in their numerator, and so one would not expect these bounds to show radically\nbetter behavior with respect to m. Finally, we defer experiments conducted for other varied settings,\nand the neural network bound from [32] to Appendix B.\n\nWhile the bounds might show better m-dependence for other settings \u2013 indeed, for larger batches, we\nshow in Appendix B that the bounds behave better \u2013 we believe that the egregious break down of these\nbounds in this setting (and many other hyperparameter settings as presented in Appendix B) must\nimply fundamental issues with the bounds themselves. While this may be addressed to some extent\nwith a better understanding of implicit regularization in deep learning, we regard our observations as\na call for taking a step back and clearly understanding any inherent limitations in the theoretical tool\nunderlying all these bounds namely, uniform convergence. 2\n\n1\n\n0\n\n1 \u2212 yy(cid:48)\n\n\u03b3\n\nyy(cid:48)\nyy(cid:48)\nyy(cid:48)\n\n\u2264 0\n\u2208 (0, \u03b3)\n\u2265 \u03b3.\n\n3 Provable failure of uniform convergence\nPreliminaries. Let H be a class of hypotheses mapping from X to R, and let D be a distribution\nover X \u00d7 {\u22121, +1}. The loss function we mainly care about is the 0-1 error; but since a direct\nanalysis of the uniform convergence of the 0-1 error is hard, sometimes a more general margin-based\nsurrogate of this error (also called as ramp loss) is analyzed for uniform convergence. Speci\ufb01cally,\ngiven the classi\ufb01er\u2019s logit output y(cid:48)\n\n\u2208 R and the true label y \u2208 {\u22121, +1}, de\ufb01ne\n\nL(\u03b3)(y(cid:48), y) =\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\nNote that L(0) is the 0-1 error, and L(\u03b3) an upper bound on the 0-1 error. We de\ufb01ne for any L,\nthe expected loss as LD(h) := E(x,y)\u223cD[L(h(x), y)] and the empirical loss on a dataset S of\nm(cid:80)(x,y)\u2208S L(h(x), y). Let A be the learning algorithm and let hS be\nm datapoints \u02c6LS(h) := 1\nthe hypothesis output by the algorithm on a dataset S (assume that any training-data-independent\nrandomness, such as the initialization/data-shuf\ufb02ing is \ufb01xed).\nFor a given \u03b4 \u2208 (0, 1), the generalization error of the algorithm is essentially a bound on the\ndifference between the error of the hypothesis hS learned on a training set S and the expected error\nover D, that holds with high probability of at least 1 \u2212 \u03b4 over the draws of S. More formally:\nDe\ufb01nition 3.1. The generalization error of A with respect to loss L is the smallest value \u0001gen(m, \u03b4)\nsuch that: PrS\u223cDm(cid:104)LD(hS) \u2212 \u02c6LS(hS) \u2264 \u0001gen(m, \u03b4)(cid:105) \u2265 1 \u2212 \u03b4.\nTo theoretically bound the generalization error of the algorithm, the most common approach is to\nprovide a two-sided uniform convergence bound on the hypothesis class used by the algorithm,\nwhere, for a given draw of S, we look at convergence for all the hypotheses in H instead of just hS:\nDe\ufb01nition 3.2. The uniform convergence bound with respect to loss L is the smallest value\n\n\u0001unif(m, \u03b4) such that: PrS\u223cDm(cid:104)suph\u2208H(cid:12)(cid:12)(cid:12)LD(h) \u2212 \u02c6LS(h)(cid:12)(cid:12)(cid:12) \u2264 \u0001unif(m, \u03b4)(cid:105) \u2265 1 \u2212 \u03b4.\n\n2Side note: Before we proceed to the next section, where we blame uniform convergence for the above\nproblems, we brie\ufb02y note that we considered another simpler possibility. Speci\ufb01cally, we hypothesized that,\nfor some (not all) existing bounds, the above problems could arise from an issue that does not involve uniform\nconvergence, which we term as pseudo-over\ufb01tting. Roughly speaking, a classi\ufb01er pseudo-over\ufb01ts when its\ndecision boundary is simple but its real-valued output has large \u201cbumps\u201d around some or all of its training\ndatapoint. As discussed in Appendix C, deep networks pseudo-over\ufb01t only to a limited extent, and hence\npsuedo-over\ufb01tting does not provide a complete explanation for the issues faced by these bounds.\n\n5\n\n\fTightest algorithm-dependent uniform convergence. The bound given by \u0001unif can be tightened\nby ignoring many extraneous hypotheses in H never picked by A for a given simple distribution\nD. This is typically done by focusing on a norm-bounded class of hypotheses that the algorithm A\nimplicitly restricts itself to. Let us take this to the extreme by applying uniform convergence on \u201cthe\nsmallest possible class\u201d of hypotheses, namely, only those hypotheses that are picked by A under D,\nexcluding everything else. Observe that pruning the hypothesis class any further would not imply\na bound on the generalization error, and hence applying uniform convergence on this aggressively\npruned hypothesis class would yield the tightest possible uniform convergence bound. Recall that we\ncare about this formulation because our goal is to rigorously and thoroughly rule out the possibility\nthat no kind of uniform convergence bound, however cleverly applied, can explain generalization in\nour settings of interest (which we will describe later).\nTo formally capture this bound,\nit is helpful to \ufb01rst rephrase the above de\ufb01nition of \u0001unif:\nwe can say that \u0001unif(m, \u03b4) is the smallest value for which there exists a set of sample\nsets S\u03b4 \u2286 (X \u00d7 {\u22121, 1})m for which P rS\u223cDm [S \u2208 S\u03b4] \u2265 1 \u2212 \u03b4 and furthermore,\nsuph\u2208H |LD(h) \u2212 \u02c6LS(h)| \u2264 \u0001unif(m, \u03b4). Observe that this de\ufb01nition is equivalent\nsupS\u2208S\u03b4\nto De\ufb01nition 3.2. Extending this rephrased de\ufb01nition, we can de\ufb01ne the tightest uniform convergence\nbound by replacing H here with only those hypotheses that are explored by the algorithm A under\nthe datasets belonging to S\u03b4:\nDe\ufb01nition 3.3. The tightest algorithm-dependent uniform convergence bound with respect to\nloss L is the smallest value \u0001unif-alg(m, \u03b4) for which there exists a set of sample sets S\u03b4 such that\nP rS\u223cDm[S \u2208 S\u03b4] \u2265 1 \u2212 \u03b4 and if we de\ufb01ne the space of hypotheses explored by A on S\u03b4 as H\u03b4 :=\n(cid:83)S\u2208S\u03b4{hS} \u2286 H, the following holds: supS\u2208S\u03b4\n\nsuph\u2208H\u03b4(cid:12)(cid:12)(cid:12)LD(h) \u2212 \u02c6LS(h)(cid:12)(cid:12)(cid:12) \u2264 \u0001unif-alg(m, \u03b4).\n\nIn the following sections, through examples of overparameterized models trained by GD (or SGD),\nwe argue how even the above tightest algorithm-dependent uniform convergence can fail to explain\ngeneralization. i.e., in these settings, even though \u0001gen is smaller than a negligible value \u0001, we show\nthat \u0001unif-alg is large (speci\ufb01cally, at least 1 \u2212 \u0001). Before we delve into these examples, below we\nquickly outline the key mathematical idea by which uniform convergence is made to fail.\nConsider a scenario where the algorithm generalizes well i.e., for every training set \u02dcS, h \u02dcS has zero\nerror on \u02dcS and has small test error. While this means that h \u02dcS has small error on random draws\nof a test set, it may still be possible that for every such h \u02dcS, there exists a corresponding \u201cbad\u201d\ndataset \u02dcS(cid:48) \u2013 that is not random, but rather dependent on \u02dcS \u2013 on which h \u02dcS has a large empirical\nerror (say 1). Unfortunately, uniform convergence runs into trouble while dealing with such bad\ndatasets. Speci\ufb01cally, as we can see from the above de\ufb01nition, uniform convergence demands that\n|LD(h \u02dcS)\u2212 \u02c6LS(h \u02dcS)| be small on all datasets in S\u03b4, which excludes a \u03b4 fraction of the datasets. While\nit may be tempting to think that we can somehow exclude the bad dataset as part of the \u03b4 fraction,\nthere is a signi\ufb01cant catch here: we can not carve out a \u03b4 fraction speci\ufb01c to each hypothesis; we can\nignore only a single chunk of \u03b4 mass common to all hypotheses in H\u03b4. This restriction turns out to\nbe a tremendous bottleneck: despite ignoring this \u03b4 fraction, for most h \u02dcS \u2208 H\u03b4, the corresponding\nbad set \u02dcS(cid:48) would still be left in S\u03b4. Then, for all such h \u02dcS, LD(h \u02dcS) would be small but \u02c6LS(h \u02dcS) large;\nwe can then set the S inside the supS\u2208S\u03b4 to be \u02dcS(cid:48) to conclude that \u0001unif-alg is indeed vacuous. This\nis the kind of failure we will demonstrate in a high-dimensional linear classi\ufb01er in the following\nsection, and a ReLU neural network in Section 3.2, and an in\ufb01nitely wide exponential-activation\nneural network in Appendix F \u2013 all trained by GD or SGD. 3\nNote: Our results about failure of uniform convergence holds even for bounds that output a different\nvalue for each hypothesis. In this case, the tightest uniform convergence bound for a given hypothesis\nwould be at least as large as supS\u2208S\u03b4 |LD(h \u02dcS) \u2212 \u02c6LS(h \u02dcS)| which by a similar argument would be\nvacuous for most draws of the training set \u02dcS. We discuss this in more detail in Appendix G.4.\n\n3.1 High-dimensional linear classi\ufb01er\n\nWhy a linear model? Although we present a neural network example in the next section, we \ufb01rst\nemphasize why it is also important to understand how uniform convergence could fail for linear\n\n3In Appendix H, the reader can \ufb01nd a more abstract setting illustrating this mathematical idea more clearly.\n\n6\n\n\fclassi\ufb01ers trained using GD. First, it is more natural to expect uniform convergence to yield poorer\nbounds in more complicated classi\ufb01ers; linear models are arguably the simplest of classi\ufb01ers, and\nhence showing failure of uniform convergence in these models is, in a sense, the most interesting.\nSecondly, recent works (e.g., [17]) have shown that as the width of a deep network goes to in\ufb01nity,\nunder some conditions, the network converges to a high-dimensional linear model (trained on a\nhigh-dimensional transformation of the data) \u2013 thus making the study of high-dimensional linear\nmodels relevant to us. Note that our example is not aimed at modeling the setup of such linearized\nneural networks. However, it does provide valuable intuition about the mechanism by which uniform\nconvergence fails, and we show how this extends to neural networks in the later sections.\nSetup. Let each input be a K + D dimensional vector (think of K as a small constant and D\nmuch larger than m). The value of any input x is denoted by (x1, x2) where x1 \u2208 RK and\nx2 \u2208 RD. Let the centers of the (two) classes be determined by an arbitrary vector u \u2208 RK such\nthat (cid:107)u(cid:107)2 = 1/\u221am. Let D be such that the label y has equal probability of being +1 and \u22121,\nD I).4\nand x1 = 2 \u00b7 y \u00b7 u while x2 is sampled independently from a spherical Gaussian, N (0, 32\nNote that the distribution is linearly separable based on the \ufb01rst few (K) dimensions. For the\nlearning algorithm A, consider a linear classi\ufb01er with weights w = (w1, w2) and whose output\nis h(x) = w1x1 + w2x2. Assume the weights are initialized to the origin. Given a dataset S, A\ntakes a gradient step of learning rate 1 to maximize y \u00b7 h(x) for each (x, y) \u2208 S. Hence, regardless\nof the batch size, the learned weights would satisfy, w1 = 2mu and w2 =(cid:80)i y(i)x(i)\n2 . Note that\neffectively w1 is aligned correctly along the class boundary while w2 is high-dimensional Gaussian\nnoise. It is fairly simple to show that this algorithm achieves zero training error for most draws of\nthe training set. At the same time, for this setup, we have the following lower bound on uniform\nconvergence for the L(\u03b3) loss:5\n\u0001(cid:1)(cid:1), \u03b3 \u2208 [0, 1], the\nTheorem 3.1. For any \u0001, \u03b4 > 0, \u03b4 \u2264 1/4, when D = \u2126(cid:0)max(cid:0)m ln m\nL(\u03b3) loss satis\ufb01es \u0001gen(m, \u03b4) \u2264 \u0001, while \u0001unif-alg(m, \u03b4) \u2265 1 \u2212 \u0001. Furthermore, for all \u03b3 \u2265 0, for the\nL(\u03b3) loss, \u0001unif-alg(m, \u03b4) \u2265 1 \u2212 \u0001gen(m, \u03b4).\nIn other words, even the tightest uniform convergence bound is nearly vacuous despite good gener-\nalization. In order to better appreciate the implications of this statement, it will be helpful to look\nat the bound a standard technique would yield here. For example, the Rademacher complexity of\n\n\u03b4 , m ln 1\n\nwhere \u03b3(cid:63) is the margin on the training data. In this setup, the weight norm grows with dataset size\n\nthe class of (cid:96)2-norm bounded linear classi\ufb01ers would yield a bound of the form O((cid:107)w(cid:107)2/(\u03b3(cid:63)\u221am))\nas (cid:107)w(cid:107)2 = \u0398(\u221am) (which follows from the fact that w2 is a Gaussian with m/D variance along\n\neach of the D dimensions) and \u03b3(cid:63) = \u0398(1). Hence, the Rademacher bound here would evaluate to\na constant much larger than \u0001. One might persist and think that perhaps, the characterization of w\nto be bounded in (cid:96)2 norm does not fully capture the implicit bias of the algorithm. Are there other\nproperties of the Gaussian w2 that one could take into account to identify an even smaller class of\nhypotheses for which uniform convergence may work after all? Unfortunately, our statement rules\nthis out: even after \ufb01xing w1 to the learned value (2mu) and for any possible 1 \u2212 \u03b4 truncation of the\nGaussian w2, the resulting pruned class of weights \u2013 despite all of them having a test error less than \u0001\n\u2013 would give only nearly vacuous uniform convergence bounds as \u0001unif-alg(m, \u03b4) \u2265 1 \u2212 \u0001.\nProof outline. We now provide an outline of our argument for Theorem 3.1, deferring the proof to\nthe appendix. First, the small generalization (and test) error arises from the fact that w1 is aligned\ncorrectly along the true boundary; at the same time, the noisy part of the classi\ufb01er w2 is poorly\naligned with at least 1 \u2212 \u0001 mass of the test inputs, and hence does not dominate the output of the\nclassi\ufb01er on test data \u2013 preserving the good \ufb01t of w1 on the test data. On the other hand, at a very\nhigh level, under the purview of uniform convergence, we can argue that the noise vector w2 is\neffectively stripped of its randomness. This misleads uniform convergence into believing that the D\nnoisy dimensions (where D > m) contribute meaningfully to the representational complexity of the\nclassi\ufb01er, thereby giving nearly vacuous bounds. We describe this more concretely below.\nAs a key step in our argument, we show that w.h.p over draws of S, even though the learned\nclassi\ufb01er hS correctly classi\ufb01es most of the randomly picked test data, it completely misclassi\ufb01es\n\n4As noted in Appendix G.3, it is easy to extend the discussion by assuming that x1 is spread out around 2yu.\n5While it is obvious from Theorem 3.1 that the bound is nearly vacuous for any \u03b3 \u2208 [0, 1], in Appendix G.1,\n\nwe argue that even for any \u03b3 \u2265 1, the guarantee is nearly vacuous, although in a slightly different sense.\n\n7\n\n\fa \u201cbad\u201d dataset, namely S(cid:48) = {((x1,\u2212x2), y) | (x, y) \u2208 S} which is the noise-negated version\nof S. Now recall that to compute \u0001unif-alg one has to begin by picking a sample set space S\u03b4\nof mass 1 \u2212 \u03b4. We \ufb01rst argue that for any choice of S\u03b4, there must exist S(cid:63) such that all the\nfollowing four events hold: (i) S(cid:63) \u2208 S\u03b4, (ii) the noise-negated S(cid:48)\n(cid:63) \u2208 S\u03b4, (iii) hS(cid:63) has test error\nless than \u0001 and (iv) hS(cid:63) completely misclassi\ufb01es S(cid:48)\n(cid:63). We prove the existence of such an S(cid:63) by\narguing that over draws from Dm, there is non-zero probability of picking a dataset that satis\ufb01es\nthese four conditions. Note that our argument for this crucially makes use of the fact that we\nhave designed the \u201cbad\u201d dataset in a way that it has the same distribution as the training set,\nnamely Dm. Finally, for a given S\u03b4, if we have an S(cid:63) satisfying (i) to (iv), we can prove our claim\nas \u0001unif-alg(m, \u03b4) = supS\u2208S\u03b4\n(hS(cid:63) )| = |\u0001\u22121| = 1\u2212\u0001.\nRemark 3.1. Our analysis depends on the fact that \u0001unif-alg is a two-sided convergence bound \u2013\nwhich is what existing techniques bound \u2013 and our result would not apply for hypothetical one-sided\nuniform convergence bounds. While PAC-Bayes based bounds are typically presented as one-sided\nbounds, we show in Appendix J that even these are lower-bounded by the two-sided \u0001unif-alg. To the\nbest of our knowledge, it is non-trivial to make any of these tools purely one-sided.\nRemark 3.2. The classi\ufb01er modi\ufb01ed by setting w2 \u2190 0, has small test error and also en-\njoys non-vacuous bounds as it has very few parameters. However, such a bound would not\nfully explain why the original classi\ufb01er generalizes well. One might then wonder if such a\nbound could be extended to the original classi\ufb01er, like it was explored in Nagarajan and Kolter\n[27] for deep networks. Our result implies that no such extension is possible in this particular example.\n\nsuph\u2208H\u03b4 |LD(h)\u2212 \u02c6LS(h)| \u2265 |LD(hS(cid:63) )\u2212 \u02c6LS(cid:48)\n\n(cid:63)\n\n3.2 ReLU neural network\n\nWe now design a non-linearly separable task (with no \u201cnoisy\u201d dimensions) where a suf\ufb01ciently wide\nReLU network trained in the standard manner, like in the experiments of Section 2 leads to failure of\nuniform convergence. For our argument, we will rely on a classi\ufb01er trained empirically, in contrast to\nour linear examples where we rely on an analytically derived expression for the learned classi\ufb01er.\nThus, this section illustrates that the effects we modeled theoretically in the linear classi\ufb01er are indeed\nre\ufb02ected in typical training settings, even though here it is dif\ufb01cult to precisely analyze the learning\nprocess. We also refer the reader to Appendix F, where we present an example of a neural network\nwith exponential activation functions for which we do derive a closed form expression.\nSetup. We consider a 1000-dimensional data, where two classes are distributed uniformly over two\norigin-centered hyperspheres with radius 1 and 1.1 respectively. We vary the number of training\nexamples from 4k to 65k (thus ranging through typical dataset sizes like that of MNIST). Observe\nthat compared to the linear example, this data distribution is more realistic in two ways. First, we\ndo not have speci\ufb01c dimensions in the data that are noisy and second, the data dimensionality here\nas such is a constant less than m. Given samples from this distribution, we train a two-layer ReLU\nnetwork with h = 100k to minimize cross entropy loss using SGD with learning rate 0.1 and batch\nsize 64. We train the network until 99% of the data is classi\ufb01ed by a margin of 10.\nAs shown in Figure 2 (blue line), in this setup, the 0-1 error (i.e., L(0)) as approximated by the\ntest set, decreases with m \u2208 [212, 216] at the rate of O(m\u22120.5). Now, to prove failure of uniform\nconvergence, we empirically show that a completely misclassi\ufb01ed \u201cbad\u201d dataset S(cid:48) can be constructed\nin a manner similar to that of the previous example. In this setting, we pick S(cid:48) by simply projecting\nevery training datapoint on the inner hypersphere onto the outer and vice versa, and then \ufb02ipping\nthe labels. Then, as shown in Figure 2 (orange line), S(cid:48) is completely misclassi\ufb01ed by the learned\nnetwork. Furthermore, like in the previous example, we have S(cid:48)\n\u223c Dm because the distributions are\nuniform over the hyperspheres. Having established these facts, the rest of the argument follows like\nin the previous setting, implying failure of uniform convergence as in Theorem 3.1 here too.\nIn Figure 2 (right), we visualize how the learned boundaries are skewed around the training data in a\nway that S(cid:48) is misclassi\ufb01ed. Note that S(cid:48) is misclassi\ufb01ed even when it has as many as 60k points,\nand even though the network was not explicitly trained to misclassify those points. Intuitively, this\ndemonstrates that the boundary learned by the ReLU network has suf\ufb01cient complexity that hurts\nuniform convergence while not affecting the generalization error, at least in this setting. We discuss\nthe applicability of this observation to other hyperparameter settings in Appendix G.2.\n\n8\n\n\fFigure 2: In the \ufb01rst \ufb01gure, we plot the error of the ReLU network on test data and on the bad dataset\nS(cid:48), in the task described in Section 3.2. The second and third images correspond to the decision\nboundary learned in this task, in the 2D quadrant containing two training datapoints (depicted as\n\u00d7 and \u2022). The black lines correspond to the two hyperspheres, while the brown and blue regions\ncorrespond to the class output by the classi\ufb01er. Here, we observe that the boundaries are skewed\naround the training data in a way that it misclassi\ufb01es the nearest point from the opposite class\n(corresponding to S(cid:48), that is not explicitly marked). The fourth image corresponds to two random\n(test) datapoints, where the boundaries are fairly random, and very likely to be located in between the\nhyperspheres (better con\ufb01rmed by the low test error).\n\nDeep learning conjecture. Extending the above insights more generally, we conjecture that in\noverparameterized deep networks, SGD \ufb01nds a \ufb01t that is simple at a macroscopic level (leading to\ngood generalization) but also has many microscopic \ufb02uctuations (hurting uniform convergence). To\nmake this more concrete, for illustration, consider the high-dimensional linear model that suf\ufb01ciently\nwide networks have been shown to converge to [17]. That is, roughly, these networks can be written\nas h(x) = wT \u03c6(x) where \u03c6(x) is a rich high-dimensional representation of x computed from many\nrandom features (chosen independent of training data). Inspired by our linear model in Section 3.1,\nwe conjecture that the weights w learned on a dataset S can be expressed as w1 +w2, where wT\n1 \u03c6(x)\ndominates the output on most test inputs and induces a simple decision boundary. That is, it may\nbe possible to apply uniform convergence on the function wT\n1 \u03c6(x) to obtain a small generalization\nbound. On the other hand, w2 corresponds to meaningless signals that gradient descent gathered from\nthe high-dimensional representation of the training set S. Crucially, these signals would be speci\ufb01c\nto S, and hence not likely to correlate with most of the test data i.e., w2\u03c6(x) would be negligible\non most test data, thereby not affecting the generalization error signi\ufb01cantly. However, w2\u03c6(x)\ncan still create complex \ufb02uctuations on the boundary, in low-probability regions of the input space\n(whose locations would depend on S, like in our examples). As we argued, this can lead to failure\nof uniform convergence. Perhaps, existing works that have achieved strong uniform convergence\nbounds on modi\ufb01ed networks, may have done so by implicitly suppressing w2, either by compression,\noptimization or stochasticization. Revisiting these works may help verify our conjecture.\n\n4 Conclusion and Future Work\n\nA growing variety of uniform convergence based bounds [29, 3, 11, 2, 31, 8, 39, 23, 1, 27, 32] have\nsought to explain generalization in deep learning. While these may provide partial intuition about the\npuzzle, we ask a critical, high level question: by pursuing this broad direction, is it possible to achieve\nthe grand goal of a small generalization bound that shows appropriate dependence on the sample\nsize, width, depth, label noise, and batch size? We cast doubt on this by \ufb01rst, empirically showing\nthat existing bounds can surprisingly increase with training set size for small batch sizes. We then\npresented example setups, including that of a ReLU neural network, for which uniform convergence\nprovably fails to explain generalization, even after taking implicit bias into account.\nFuture work in understanding implicit regularization in deep learning may be better guided with our\nknowledge of the sample-size-dependence in the weight norms. To understand generalization, it\nmay also be promising to explore other learning-theoretic techniques like, say, algorithmic stability\n[9, 12, 5, 34] ; our linear setup might also inspire new tools. Overall, through our work, we call for\ngoing beyond uniform convergence to fully explain generalization in deep learning.\n\nAcknowledgements. Vaishnavh Nagarajan is supported by a grant from the Bosch Center for\nAI.\n\n9\n\n40961638465536TrainingSetSize\u22121.0\u22120.50.0log(Error)TesterrorErroronS0\fReferences\n[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in over-\nparameterized neural networks, going beyond two layers. abs/1811.04918, 2018. URL\nhttp://arxiv.org/abs/1811.04918.\n\n[2] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds\nfor deep nets via a compression approach. In The 35th International Conference on Machine\nLearning, ICML, 2018.\n\n[3] Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. Spectrally-normalized margin bounds\nIn Advances in Neural Information Processing Systems 30: Annual\n\nfor neural networks.\nConference on Neural Information Processing Systems 2017, 2017.\n\n[4] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to\nunderstand kernel learning. In Proceedings of the 35th International Conference on Machine\nLearning, ICML 2018, 2018.\n\n[5] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. Journal of Machine Learning\n\nResearch, 2, 2002.\n\n[6] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. SGD learns over-\nparameterized networks that provably generalize on linearly separable data. International\nConference on Learning Representations (ICLR), 2018.\n\n[7] Felix Dr\u00e4xler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially\nIn Proceedings of the 35th International\n\nno barriers in neural network energy landscape.\nConference on Machine Learning, ICML 2018, 2018.\n\n[8] Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds\nfor deep (stochastic) neural networks with many more parameters than training data.\nIn\nProceedings of the Thirty-Third Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI 2017,\n2017.\n\n[9] Vitaly Feldman and Jan Vondr\u00e1k. Generalization bounds for uniformly stable algorithms.\nIn Advances in Neural Information Processing Systems 31: Annual Conference on Neural\nInformation Processing Systems 2018, NeurIPS 2018, 2018.\n\n[10] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew G. Wilson.\nLoss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Infor-\nmation Processing Systems 31: Annual Conference on Neural Information Processing Systems\n2018, NeurIPS 2018, 2018.\n\n[11] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of\n\nneural networks. Computational Learning Theory, COLT 2018, 2018.\n\n[12] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of\nstochastic gradient descent. In Proceedings of the 33nd International Conference on Machine\nLearning, ICML, 2016.\n\n[13] Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension bounds for\npiecewise linear neural networks. In Proceedings of the 30th Conference on Learning Theory,\nCOLT 2017, 2017.\n\n[14] Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing\nthe description length of the weights. In Proceedings of the Sixth Annual ACM Conference on\nComputational Learning Theory, COLT, 1993.\n\n[15] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 9(1), 1997.\n\n[16] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the\ngeneralization gap in large batch training of neural networks. Advances in Neural Information\nProcessing Systems (to appear), 2017.\n\n10\n\n\f[17] Arthur Jacot, Cl\u00e9ment Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and\ngeneralization in neural networks. In Advances in Neural Information Processing Systems 31:\nAnnual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 2018.\n\n[18] Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua\nBengio, and Amos J. Storkey. Width of minima reached by stochastic gradient descent is\nin\ufb02uenced by learning rate to batch size ratio. In Arti\ufb01cial Neural Networks and Machine\nLearning - ICANN 2018 - 27th International Conference on Arti\ufb01cial Neural Networks, 2018.\n\n[19] Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning.\n\n2017. URL http://arxiv.org/abs/1710.05468.\n\n[20] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[21] John Langford and Rich Caruana.\n\nIn Advances in Neural\nInformation Processing Systems 14 [Neural Information Processing Systems: Natural and\nSynthetic, NIPS 2001], 2001.\n\n(not) bounding the true error.\n\n[22] John Langford and John Shawe-Taylor. Pac-bayes & margins. In Advances in Neural Informa-\n\ntion Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, 2002.\n\n[23] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic\ngradient descent on structured data. In Advances in Neural Information Processing Systems 31:\nAnnual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 2018.\n\n[24] David McAllester. Simpli\ufb01ed pac-bayesian margin bounds. In Learning Theory and Kernel\n\nMachines. Springer Berlin Heidelberg, 2003.\n\n[25] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.\n\nAdaptive computation and machine learning. MIT Press, 2012.\n\n[26] Vaishnavh Nagarajan and J. Zico Kolter. Generalization in deep networks: The role of distance\nfrom initialization. Deep Learning: Bridging Theory and Practice Workshop in Advances\nin Neural Information Processing Systems 30: Annual Conference on Neural Information\nProcessing Systems 2017, 2017.\n\n[27] Vaishnavh Nagarajan and Zico Kolter. Deterministic PAC-bayesian generalization bounds for\ndeep networks via generalizing noise-resilience. In International Conference on Learning\nRepresentations (ICLR), 2019.\n\n[28] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias:\nOn the role of implicit regularization in deep learning. International Conference on Learning\nRepresentations Workshop Track, 2015.\n\n[29] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural\n\nnetworks. In Proceedings of The 28th Conference on Learning Theory, COLT, 2015.\n\n[30] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring\ngeneralization in deep learning. Advances in Neural Information Processing Systems to appear,\n2017.\n\n[31] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-\nbayesian approach to spectrally-normalized margin bounds for neural networks. International\nConference on Learning Representations (ICLR), 2018.\n\n[32] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. The\nrole of over-parametrization in generalization of neural networks. In International Conference\non Learning Representations (ICLR), 2019.\n\n[33] W. H. Rogers and T. J. Wagner. A \ufb01nite sample distribution-free performance bound for local\n\ndiscrimination rules. The Annals of Statistics, 6(3), 1978.\n\n11\n\n\f[34] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability,\n\nstability and uniform convergence. Journal of Machine Learning Research, 11, 2010.\n\n[35] Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on\n\nseparable data. International Conference on Learning Representations (ICLR), 2018.\n\n[36] V. N. Vapnik and A. Ya. Chervonenkis. On the Uniform Convergence of Relative Frequencies\n\nof Events to Their Probabilities. 1971.\n\n[37] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge\n\nSeries in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.\n\n[38] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand-\ning deep learning requires rethinking generalization. International Conference on Learning\nRepresentations (ICLR), 2017.\n\n[39] Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-\nvacuous generalization bounds at the imagenet scale: a PAC-bayesian compression approach.\nIn International Conference on Learning Representations (ICLR), 2019.\n\n12\n\n\f", "award": [], "sourceid": 6214, "authors": [{"given_name": "Vaishnavh", "family_name": "Nagarajan", "institution": "Carnegie Mellon University"}, {"given_name": "J. Zico", "family_name": "Kolter", "institution": "Carnegie Mellon University / Bosch Center for AI"}]}