{"title": "Modern Neural Networks Generalize on Small Data Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 3619, "page_last": 3628, "abstract": "In this paper,  we  use a linear program to empirically decompose fitted neural networks into ensembles of low-bias sub-networks. We show that these sub-networks are relatively uncorrelated which leads to an  internal regularization process, very much like a random forest, which can explain why a neural network is surprisingly resistant to overfitting.  We then demonstrate this in practice by applying large neural networks, with hundreds of parameters per training observation, to a  collection of 116 real-world data sets from the UCI Machine Learning Repository.   This collection of data sets contains a much smaller number of training examples than the types of image classification tasks generally studied in the deep learning literature, as well as non-trivial label noise. We show  that even in this setting deep neural nets are capable of achieving superior classification accuracy without overfitting.", "full_text": "Modern Neural Networks Generalize on Small Data\n\nSets\n\nWharton School University of Pennsylvania\n\nWharton School University of Pennsylvania\n\nAbraham J. Wyner\n\nDepartment of Statistics\n\nPhiladelphia, PA 19104\n\najw@wharton.upenn.edu\n\nMatthew Olson\n\nDepartment of Statistics\n\nPhiladelphia, PA 19104\n\nmaolson@wharton.upenn.edu\n\nRichard Berk\n\nDepartment of Statistics\n\nWharton School University of Pennsylvania\n\nPhiladelphia, PA 19104\n\nberkr@wharton.upenn.edu\n\nAbstract\n\nIn this paper, we use a linear program to empirically decompose \ufb01tted neural net-\nworks into ensembles of low-bias sub-networks. We show that these sub-networks\nare relatively uncorrelated which leads to an internal regularization process, very\nmuch like a random forest, which can explain why a neural network is surprisingly\nresistant to over\ufb01tting. We then demonstrate this in practice by applying large\nneural networks, with hundreds of parameters per training observation, to a col-\nlection of 116 real-world data sets from the UCI Machine Learning Repository.\nThis collection of data sets contains a much smaller number of training examples\nthan the types of image classi\ufb01cation tasks generally studied in the deep learning\nliterature, as well as non-trivial label noise. We show that even in this setting\ndeep neural nets are capable of achieving superior classi\ufb01cation accuracy without\nover\ufb01tting.\n\n1\n\nIntroduction\n\nA recent focus in the deep learning community has been resolving the \u201cparadox\" that extremely large,\nhigh capacity neural networks are able to simultaneously memorize training data and achieve good\ngeneralization error. In a number of experiments, Zhang et al. [24] demonstrated that large neural\nnetworks were capable of both achieving state of the art performance on image classi\ufb01cation tasks,\nas well as perfectly \ufb01tting training data with permuted labels. The apparent consequence of these\nobservations was to question traditional measures of complexity considered in statistical learning\ntheory.\nA great deal of recent research has aimed to explain the generalization ability of very high capacity\nneural networks [14]. A number of different streams have emerged in this literature [18, 16, 17]. The\nauthors in Zhang et al. [24] suggest that stochastic gradient descent (SGD) may provide implicit\nregularization by encouraging low complexity solutions to the neural network optimization problem.\nAs an analogy, they point out that SGD applied to under-determined least squares problems produces\nsolutions with minimal (cid:96)2 norm. Other streams of research have aimed at exploring the effect of\nmargins on generalization error [1, 16]. This line of thought is similar to the margin-based views of\nAdaBoost in the boosting literature that bound test performance in terms of the classi\ufb01er\u2019s con\ufb01dence\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fin its predictions. Other research has investigated the sharpness of local minima found by training a\nneural network with SGD [9, 18]. The literature is extensive, and this review is far from complete.\nThe empirical investigations found in this literature tend to be concentrated on a small set of image\nclassi\ufb01cation data sets. For instance, every research article mentioned in the last section with an\nempirical component considers at least on of the following four data sets: MNIST, CIFAR-10,\nCIFAR-100, or ImageNet. In fact, in both NIPS 2017 and ICML 2017, over half of all accepted\npapers that mentioned \u201cneural networks\" in the abstract or title used one of these data sets. All of\nthese data sets share characteristics that may narrow their generality: similar problem domain, very\nlow noise rates, balanced classes, and relatively large training sizes (60k training points at minimum).\nIn this work, we consider a much richer class of small data sets from the UCI Machine Learning\nRepository in order to study the \u201cgeneralization paradox.\" These data sets contain features not found\nin the image classi\ufb01cation data, such as small sample sizes and nontrivial, heteroscedastic label noise.\nAlthough not without its faults [20], the UCI repository provides a much needed alternative to the\nstandard image data sets.\nAs part of our investigation we will establish that large neural networks with tens of thousands\nof parameters are capable of achieving superior test accuracy on data sets with only hundreds of\nobservations. This is surprising, as it is commonly thought that deep neural networks require large\ndata sets to train properly [19, 7]. We believe that this gap in knowledge has led to the common\nmisbelief that unregularized, deep neural networks will necessarily over\ufb01t the types of data considered\nby small-data professions. In fact, we establish that with minimal tuning, deep neural networks\nare able to achieve performance on par with a random forest classi\ufb01er, which is considered to have\nstate-of-the-art performance on data sets from the UCI Repository [10].\nThe mechanism by which a random forest is able to generalize well on small data sets is straightfor-\nward: a random forest is an ensemble of low-bias, decorrelated trees. Randomization combined with\naveraging reduces the ensemble\u2019s variance, smoothing out the predictions from fully grown trees. It\nis clear that a neural network should excel at bias reduction, but the way in which it achieves variance\nreduction is much more mysterious. The same paradox has been examined at length in the literature\non AdaBoost, and in particular, it has been conjectured that the later stages of AdaBoost serve as a\nbagging phase which reduces variance [4, 5, 6, 23].\nOne of the central aims of this paper is to identify the variance stabilization that occurs when training\na deep neural network. To this end, the later half of this paper focuses on empirically decomposing a\nneural network into an ensemble of sub-networks, each of which achieves low training error and less\nthan perfect pairwise correlation. In this way, we view neural networks in a similar spirit to random\nforests. One can use this perspective as a window to viewing the success of recent residual network\narchitectures that are \ufb01t with hundreds or thousands of hidden layers [13, 22]. These deeper layers\nmight serve more as a bagging mechanism, rather than additional levels of feature hierarchy, as is\ncommonly cited for the success of deep networks [3].\nThe key takeaways from this paper are summarized as follows:\n\ngeneralize well on small, noisy data sets.\n\n\u2022 Large neural networks with hundreds of parameters per training observation are able to\n\u2022 Despite a bewildering set of tuning parameters [2], neural networks can be trained on small\n\u2022 Neural networks have a natural interpretation as an ensemble of low-bias classi\ufb01ers whose\npairwise correlations are less than one. This ensemble view offers a novel perspective on\nthe generalization paradox found in the literature.\n\ndata sets with minimal tuning.\n\n2 Ensemble View of Deep Neural Networks\n\nIn this section, we establish that a neural network has a natural interpretation as an ensemble classi\ufb01er.\nThis perspective allows us to borrow insights from random forests and model stacking to gain better\ninsight as to how a neural network with many parameters per observation is still able to generalize\nwell on small data sets. We also outline a procedure for decomposing \ufb01tted neural networks into\nensemble components of low bias, decorrelated sub-networks. This procedure will be illustrated for a\nnumber of neural networks \ufb01t to real data in Section 3.3.\n\n2\n\n\fzL+1(x) = W L+1g(zL(x))\n\nW L+1\n\n1,m g(zL\n\nm(x))\n\nm=1\n\nM(cid:88)\nM(cid:88)\nK(cid:88)\nK(cid:88)\n\nk=1\n\nm=1\n\nk=1\n\n=\n\n=\n\n=\n\n=\n\nK(cid:88)\nM(cid:88)\n\nk=1\n\nm=1\n\nfk(x)\n\n\u03b1m,kg(zL\n\nm(x))\n\n\u03b1m,kg(zL\n\nm(x))\n\n(2)\n\n2.1 Network Decomposition\n\nWe will begin by recalling some familiar notation for a feed-forward neural network in a binary\nclassi\ufb01cation setting. In the case of a network with L hidden layers, each layer with M hidden nodes,\nwe may write the network\u2019s prediction at a point x \u2208 Rp as\n\nz(cid:96)+1 = W (cid:96)+1g(z(cid:96)) (cid:96) = 0, . . . , L\nf (x) = \u03c3(zL+1)\n\n(1)\nwhere \u03c3 is the sigmoid function, g is an activation function, W L+1 \u2208 R1\u00d7M , W 1 \u2208 RM\u00d7p, and\nW (cid:96) \u2208 RM\u00d7M for (cid:96) = 2, . . . , L (and z0 \u2261 x). Assume that any bias terms have been absorbed.\nFor the models considered in this paper, L = 10, M = 100, and g is taken to be the elu activation\nfunction [8]. It is also helpful to abuse notation a bit, and to write z(cid:96)(x) as the output at hidden layer\n(cid:96) when x is fed through the network.\nThere are many ways to decompose a neural network into an ensemble of sub-networks: one way to\ndo this is at the \ufb01nal hidden layer. Let us \ufb01x an integer K \u2264 M and consider a matrix \u03b1 \u2208 RM\u00d7K\n1,m for m = 1, . . . , M. We can then write the \ufb01nal logit output as a\n\nsuch that(cid:80)K\n\nk=1 \u03b1m,k = W L+1\n\ncombination of units from the \ufb01nal hidden layer:\n\nwhere fk(x) = (cid:80)M\n\nm=1 \u03b1m,kg(zL\n\nm(x)). In words, we have simply decomposed the \ufb01nal layer of\nthe network (at the logit level) into a sum of component networks. The weights that de\ufb01ne the kth\nsub-network are stored in the kth column of \u03b1.\nWe will \ufb01nd in Section 3 that networks trained on a number of binary classi\ufb01cation problems\nhave decompositions such that each fk \ufb01ts the training data perfectly, and such that out-of-sample\ncorrelation between each fi and fj is relatively small. This situation is reminiscent of how a\nrandom forest works: by averaging the outputs of low-bias, low-correlation trees. We argue that it is\nthrough this implicit regularization mechanism that overparametrized neural networks are able to\nsimultaneously achieve zero training error and good generalization performance.\n\n2.2 Ensemble Hunting\n\nThe decomposition in Equation 2 is of course entirely open-ended without further restrictions on \u03b1,\nthe weights de\ufb01ning each sub-network. Broadly speaking, we want to search for a set of ensemble\ncomponents that are both diverse and low bias. As a proxy for the latter, we impose the constraint\nthat each sub-network achieves very high training accuracy. We will restrict our analysis to networks\nthat obtain 100% training accuracy, and we will demand that each sub-network fk do so as well.\nAs a proxy for diversity, we desire that each sub-network in the ensemble should be built from a\ndifferent part of the full network, as much to the extent that is possible. One strategy for accomplishing\nthis is to require that the columns of \u03b1 are sparse and have non-overlapping components. In practice,\nwe found that the integer programs required to \ufb01nd these matrices were computationally intractable\nwhen coupled with the other constraints we consider. Our approach was simply to force a random\nselection of half the entries of each column to be zero through linear constraints.\nWe will now outline our ensemble search more precisely. For each of the K columns of \u03b1, we\nsampled integers (m1,k, . . . , mM/2,k) uniformly without replacement from the integers 1 to M, and\n\n3\n\n\fwe included the linear constraints \u03b1mj,k,k = 0. Thus, we constrained each sub-network fk to be\na weighted sum of no more than M/2 units from the \ufb01nal hidden layer. Under this scheme, two\nsub-networks share 0.25M hidden units on average, and any given hidden unit appears in about half\nof the sub-networks. We then used linear programming to \ufb01nd a matrix \u03b1 \u2208 RM\u00d7K that satis\ufb01ed the\nrequired constraints:\n\nK(cid:88)\n(cid:32) M(cid:88)\n\n\u03b1m,k = W L+1\n1,m ,\n\nk=1\n\u03b1mj,k,k = 0,\n\n\u03b1m,kg(zL\n\nm(xi))\n\n(cid:33)\n\nyi \u2265 0,\n\n1 \u2264 j \u2264 M/2,\n\n1 \u2264 m \u2264 M\n1 \u2264 k \u2264 K\n1 \u2264 i \u2264 n, 1 \u2264 k \u2264 K\n\nm=1\n\nthat is, so that zL+1(x) = (cid:80)K\n\nIn summary, the \ufb01rst set of constraints ensures that the sub-networks fk decompose the full network,\nk=1 fk(x) for all x \u2208 Rp. The second set of constraints zeros out\nhalf entries for each column of \u03b1, encouraging diversity among the sub-networks. The \ufb01nal set of\nconstraints ensures that each sub-network achieves 100% accuracy on the training data (non-negative\nmargin).\nNotice that we are simply looking for any feasible \u03b1 for this system, and these constraints are rough\nheuristics that our sub-networks are diverse and low-bias. We could have additionally incorporated\na loss function which further penalized similarity among the columns of \u03b1, such as maximizing\npairwise distances between elements. However, most reasonable distance measures - such as norms\nor Kullback-Leibler divergence - are convex, and maximizing a convex function is dif\ufb01cult. Finally,\nwe emphasize that these constraints are very demanding, and rule out trivial decompositions. For\ninstance, in a number of experiments, we were not able to \ufb01nd any feasible \u03b1 for networks with a\nsmall number of hidden layers.\n\n2.3 Model Example\n\nAs a \ufb01rst application of ensemble hunting, we will consider a simulated model in two dimensions. The\nresponse y takes values \u22121 and 1 according to the following probability model, where x \u2208 [\u22121, 1]2:\n\n(cid:26)1\n\np(y = 1|x) =\n\nif (cid:107)x(cid:107)2 \u2264 0.3\notherwise .\n\n0.15\n\nThe Bayes rule for this model is to assign a label y = 1 if x is inside a circle centered at the origin\nwith radius 0.3, and to assign y = \u22121 otherwise. This rule implies a minimal possible classi\ufb01cation\nerror rate of approximately 10%.\nOur training set consists of 400 (x, y) pairs, where the predictors x form an evenly spaced grid\nof values on [\u22121, 1]2. We train two classi\ufb01ers: a neural network with L = 10 hidden layers and\nM = 100 hidden nodes, and a random forest. In each case, both classi\ufb01ers are trained until they\nachieve 100% training accuracy. The decision surfaces implied by these classi\ufb01ers, as well as the\nBayes rule, are plotted in Figure 1. Evaluated on a hold-out set of size n = 10, 000, the neural\nnetwork and random forest achieve test accuracies of 85% and 84%, respectively. Inspecting Figure 1,\nwe see that although each classi\ufb01er \ufb01ts the training data perfectly, the \ufb01ts around the noise points\noutside the circle are con\ufb01ned to small neighborhoods. Our goal is to explain how these types of \ufb01ts\noccur in practice.\nThe bottom two \ufb01gures in Figure 1 show the decision surfaces produced by the random forest and a\nsingle random forest tree. Comparing these surfaces illustrates the power of ensembling: the single\ntree has smaller accuracy than the large forest, as evidenced by the black patches outside of the circle.\nHowever, these patches are relatively small, and when all the trees are averaged together, much of\nthese get smoothed out to the Bayes rule. Averaging works because the random forest trees are\ndiverse by construction. We would like to extend this line of reasoning to explain the \ufb01t produced by\nthe neural network.\nUnlike a random forest, for which it is relatively easy to \ufb01nd sub-ensembles with low training error\nand low correlation, the corresponding search in the case of neural networks requires more work.\n\n4\n\n\f(a) Bayes Rule\n\n(b) Neural Network\n\n(c) Random Forest\n\n(d) Single Tree\n\nFigure 1: In each \ufb01gure, the black region indicates points for which the classi\ufb01er returns a label of\ny = 1. Training points with class label y = 1 are shown in red, and points with class label y = \u22121\nare shown in blue.\n\nUsing the ensemble-hunting procedure outlined in the previous section, we decompose the network\ninto K = 8 sub-networks f1, . . . , f8, and we plot their associated response surfaces in Figure 2.\nThe test accuracies of these sub-networks range from 79% to 82%, and every classi\ufb01er \ufb01ts the\ntraining data perfectly by construction. When examining these surfaces, it is curious that they all look\nsomewhat different, especially near the edges of the domain. Using the test set, we compute that\nthe average error correlation across sub-networks is 60%. We would like to emphasize that in this\nexample, the performance of both classi\ufb01ers is actually quite good, especially compared to a more\nsimple procedure such as CART.\nOne surprising conclusion from this exercise was that the \ufb01nal layer of our \ufb01tted neural network\nwas highly redundant: the \ufb01nal layer could be decomposed as 8 distinct classi\ufb01ers, each of which\nachieved 100% training accuracy. Common intuition for the success of neural networks suggests\nthat deep layers provide a rich hierarchy of features useful for classi\ufb01cation [3]. For instance, when\ntraining a network to distinguish cats and dogs, earlier layers might be able to detect edges, while\nlater layers learn to detect ears or other complicated shapes. There are no complicated features to\nlearn in this example: the Bayes decision boundary is a circle, which can be easily constructed from\na network with one hidden layer and a handful of hidden nodes. Our analysis here suggests that these\nlater layers might serve mostly in a variance reducing capacity. The full network\u2019s test accuracy is\nhigher than any of its components, which is possible only since their mistakes have relatively low\ncorrelation.\n\n3 Empirical Results\n\nIn this section we will discuss the results from a large scale classi\ufb01cation study comparing the\nperformance of a deep neural network and a random forest on a collection of 116 data sets from\nthe UCI Machine Learning Repository. We also discuss empirical ensemble decompositions for a\nnumber of trained neural networks from binary classi\ufb01cation tasks.\n\n5\n\n\f(a) f1\n\n(b) f2\n\n(c) f3\n\n(d) f4\n\n(e) f5\n\n(f) f6\n\n(g) f7\n\n(h) f8\n\nFigure 2: Decision surfaces implied from a decomposition of the neural network from Section 2.3.\n\n3.1 Data Summary\n\nThe collection of data sets we consider were \ufb01rst analyzed in a large-scale study comparing the\naccuracy of 147 different classi\ufb01ers [10]. This collection is salient for several reasons. First,\nFern\u00e1ndez-Delgado et al. [10] found that random forests had the best accuracy of all the classi\ufb01ers\nin the study (neural networks with many layers were not included). Thus, random forests can be\nconsidered a gold standard to which compare competing classi\ufb01ers. Second, this collection of data\nsets presents a very different test bed from the usual image and speech data sets found in the neural\nnetwork literature. In particular, these data sets span a wide variety of domains, including agriculture,\ncredit scoring, health outcomes, ecology, and engineering applications, to name a few. Importantly,\nthese data sets also re\ufb02ect a number of realities found in data analysis in areas apart from computer\nscience, such as highly imbalanced classes, non-trivial Bayes error rates, and discrete (categorical)\nfeatures.\nThese data sets tend to have a small number of observations: the median number of training cases\nis 601, and the smallest data set has only 10 observations. It is interesting to note that these small\nsizes lead to highly overparameterized models: on average, each network as 155 parameters per\ntraining observation. The number of features ranges from 3 to 262, and half of data sets include\ncategorical features. Finally, the number of classes ranges from 2 to 100. See Table 1 for a more\ndetailed summary of the data sets.\n\nCATEGORICAL\n\nCLASSES\n\nFEATURES\n\nMIN\n25%\n50%\n75%\nMAX\n\n0\n0\n4\n8\n256\n\n2\n2\n3\n6\n100\n\n3\n8\n15\n33\n262\n\nN\n10\n208\n601\n2201\n67557\n\nTable 1: Dataset Summary\n\n3.2 Experimental Setting\n\nFor each data set in our collection, we \ufb01t three classi\ufb01ers: a random forest, and neural networks\nwith and without dropout. Importantly, the training process was completely non-adaptive. One of\n\n6\n\n\four goals was to illustrate the extent to which deep neural networks can be used as \u201coff-the-shelf\"\nclassi\ufb01ers.\n\n(a)\n\n(b)\n\nFigure 3: Plots of cross-validated accuracy. Each point corresponds to a data set.\n\n3.2.1\n\nImplementation Details\n\nBoth networks shared the following architecture and training speci\ufb01cations:\n\n\u2022 10 hidden layers\n\u2022 100 nodes per layer\n\u2022 200 epochs of gradient descent using Adam optimizer with a learning rate of 0.001 [15].\n\u2022 He-initialization for each hidden layer [12]\n\u2022 Elu activation function [8].\n\n\u221a\n\nOur choice of architecture was chosen simply to ensure that each network had the capacity to achieve\nperfect training accuracy in most cases. In practice, we found that networks without dropout achieved\n100% training accuracy after a couple dozen epochs of training.\nThe only difference between the networks involved the presence of explicit regularization. More\nspeci\ufb01cally, one network was \ufb01t using dropout with a keep-rate of 0.85, while the other network\nwas \ufb01t without explicit regularization. Dropout can be thought of as a ridge-type penalty is often\nused to mitigate over-\ufb01tting [21]. There are other types of regularization not considered in this paper,\nincluding weight decay, early stopping, and max-norm constraints.\nEach random forest was \ufb01t with 500 trees, using defaults known to work well in practice [6]. In\nparticular, in each training instance, the number of randomly chosen splits to consider at each tree\nnode was\np, where p is the number of input variables. Although we did not tune this parameter, the\nperformance we observe is very similar to that found in [10].\nWe turn \ufb01rst to Figure 3, which plots the cross-validated accuracy of the neural network classi\ufb01ers\nand the random forest for each data set. In the \ufb01rst \ufb01gure, we see that a random forest outperforms\nan unregularized neural network on 72 out of 116 data sets, although by a small margin. The mean\ndifference in accuracy is 2.4%, which is statistically signi\ufb01cant at the 0.01 level by a Wilcoxon signed\nrank test. We notice that the gap between the two classi\ufb01ers tends to be the smallest on data sets with\nlow Bayes error rate - those points in the upper right hand portion of the plot. We also notice that\nthere exists data sets for which either a random forest or a neural network signi\ufb01cantly outperforms\nthe other. For example, a neural network achieves an accuracy of 90.3% on the monks-2 data set,\ncompared to 62.9% for a random forest. Incidentally, the base-rate for the majority class is 65.0%\npercent in this data set, indicating that the random forest has completely over\ufb01t the data.\nTurning to the second plot in Figure 3, we see that dropout improves the performance of the neural\nnetwork relative to the random forest. The mean difference between classi\ufb01ers is now decreased to\n1.5%, which is still signi\ufb01cant at the 0.01 level. The largest improvement in accuracy occurs in data\n\n7\n\n\fsets for which the random forest achieved an accuracy of between 75% and 85%. It is also worth\nnoting that the performance difference between the neural networks with and without dropout is less\nthan one percent, and this difference is not statistically signi\ufb01cant.\nWhile it might not be surprising that explicit regularization helps when \ufb01tting noisy data sets, it is\nsurprising that its absence does not lead to a complete collapse in performance. Neural networks\nwith many layers are dramatically more expressive than shallow networks [3], which suggests deeper\nnetworks should also be more susceptible to \ufb01tting noise when the Bayes error rate is non-zero. We\n\ufb01nd this is not the case.\n\n(a)\n\n(b)\n\nFigure 4: The left \ufb01gure displays the ratio of test error of the best sub-network to the full network,\nwhile the right \ufb01gure displays the average error correlation among sub-networks.\n\n3.3 Ensemble Formation\n\nWe will now carry over the ensemble decomposition analysis from Section 2 to the binary classi\ufb01ers\n\ufb01t in Section 3 using K = 10 sub-networks. We restrict our analysis to data sets with at least 500\nobservations, and for which the \ufb01tted neural network achieved 100% training accuracy. All results\nare reported over 25 randomly chosen 80-20 training-testing splits, and all metrics we report were\nobtained from the testing split.\nIn the \ufb01rst \ufb01gure of Figure 4, we report the test accuracy of the best sub-network as a fraction of the\nfull network. For example, in the statlog-australian-credit data set, the average value of this fraction\nwas 1.06 (over all 25 training-testing splits), meaning that the best sub-network outperformed the\nfull network by 6% on average. Conversely, in other data sets, such as tic-tac-toe data set, the best\nsub-network had worse performance than the full network across all training-testing splits.\nIn the second \ufb01gure of Figure 4, we report the error correlation averaged over the 10 sub-networks.\nEnsembles of classi\ufb01ers work best when the mistakes made by each component have low correlation\n- this is the precise motivation for the random forest algorithm. Strikingly, we observe that the errors\nmade by the sub-networks tend to have low correlation in every data set. Empirically, this illustrates\nthat a neural network can be decomposed as a collection of diverse sub-networks. In particular, the\nerror correlation in the tic-tac-toe data set is around 0.25, which reconciles our observation that the\nfull network performed better than the best sub-network.\n\n4 Discussion\n\nWe established that deep neural networks can generalize well on small, noisy data sets despite\nmemorizing the training data. In order to explain this behavior, we offered a novel perspective on\nneural networks which views them through the lens of ensemble classi\ufb01ers. Some commonly used\nwisdom when training neural networks is to choose an architecture which allows one suf\ufb01cient\ncapacity to \ufb01t the training data, and then scale back with regularization [2]. Contrast this with the\nmantra of a random forest: \ufb01t the training data perfectly with very deep decision trees, and then rely\n\n8\n\n\fon randomization and averaging for variance reduction 1. We have shown that the same mantra can\nbe applied to a deep neural network. Rather than each layer presenting an ever increasing hierarchy\nof features, it is plausible that the \ufb01nal layers offer an ensemble mechanism.\nFinally, we remark that the small size of data sets we consider and relatively small network sizes have\nobvious computational advantages, which allow for rapid experimentation. Some of the recent norms\nproposed for explaining neural network generalization are intractable on networks with millions of\nparameters: Schatten norms, for example, require computing a full SVD [11]. In the settings we\nconsider here, such calculations are trivial. Future research should aim to discern a mechanism for\nthe decorrelation we observe, and to explore the link between decorrelation and generalization.\n\nReferences\n[1] Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds\nfor neural networks. In Advances in Neural Information Processing Systems, pages 6240\u20136249.\n\n[2] Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures.\n\nIn Neural networks: Tricks of the trade, pages 437\u2013478. Springer.\n\n[3] Bengio, Y. et al. (2009). Learning deep architectures for ai. Foundations and trends R(cid:13) in Machine\n\nLearning, 2(1):1\u2013127.\n\n[4] Breiman, L. (2000a). Some in\ufb01nity theory for predictor ensembles. Technical report, Technical\n\nReport 579, Statistics Dept. UCB.\n\n[5] Breiman, L. (2000b). Special invited paper. additive logistic regression: A statistical view of\n\nboosting: Discussion. The annals of statistics, 28(2):374\u2013377.\n\n[6] Breiman, L. (2001). Random forests. Machine Learning, 45:5\u201332.\n\n[7] Chollet, F. (2017). Deep learning with python. Manning Publications Co.\n\n[8] Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network\n\nlearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.\n\n[9] Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. (2017). Sharp minima can generalize for deep\n\nnets. In International Conference on Machine Learning, pages 1019\u20131028.\n\n[10] Fern\u00e1ndez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds\nof classi\ufb01ers to solve real world classi\ufb01cation problems. J. Mach. Learn. Res, 15(1):3133\u20133181.\n\n[11] Golowich, N., Rakhlin, A., and Shamir, O. (2018). Size-independent sample complexity of\n\nneural networks. In Conference On Learning Theory, pages 297\u2013299.\n\n[12] He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE international\nconference on computer vision, pages 1026\u20131034.\n\n[13] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In\nProceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778.\n\n[14] Kawaguchi, K., Kaelbling, L. P., and Bengio, Y. (2017). Generalization in deep learning. arXiv\n\npreprint arXiv:1710.05468.\n\n[15] Kingma, D. P. and Ba, J. L. (2015). Adam: Amethod for stochastic optimization. In Proceedings\n\nof the 3rd International Conference on Learning Representations (ICLR).\n\n[16] Liang, T., Poggio, T., Rakhlin, A., and Stokes, J. (2017). Fisher-rao metric, geometry, and\n\ncomplexity of neural networks. arXiv preprint arXiv:1711.01530.\n\n1We do not intend to argue that one should always use a very deep network - the optimal architecture will of\ncourse vary from data set to data set. However, as in the case of a random forest, high capacity networks work\nsurprisingly well in a number of settings.\n\n9\n\n\f[17] Lin, H. W., Tegmark, M., and Rolnick, D. (2017). Why does deep and cheap learning work so\n\nwell? Journal of Statistical Physics, 168(6):1223\u20131247.\n\n[18] Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploring generalization\n\nin deep learning. In Advances in Neural Information Processing Systems, pages 5947\u20135956.\n\n[19] Rolnick, D., Veit, A., Belongie, S., and Shavit, N. (2017). Deep learning is robust to massive\n\nlabel noise. arXiv preprint arXiv:1705.10694.\n\n[20] Segal, M. R. (2004). Machine learning benchmarks and random forest regression.\n\n[21] Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).\nDropout: a simple way to prevent neural networks from over\ufb01tting. Journal of machine learning\nresearch, 15(1):1929\u20131958.\n\n[22] Veit, A., Wilber, M. J., and Belongie, S. (2016). Residual networks behave like ensembles\nof relatively shallow networks. In Advances in Neural Information Processing Systems, pages\n550\u2013558.\n\n[23] Wyner, A. J., Olson, M., Bleich, J., and Mease, D. (2017). Explaining the success of adaboost\nand random forests as interpolating classi\ufb01ers. Journal of Machine Learning Research, 18(48):1\u2013\n33.\n\n[24] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep\nlearning requires rethinking generalization. International Conference on Learning Representations.\n\n10\n\n\f", "award": [], "sourceid": 1827, "authors": [{"given_name": "Matthew", "family_name": "Olson", "institution": "The Voleon Group"}, {"given_name": "Abraham", "family_name": "Wyner", "institution": "University of Pennsylvania"}, {"given_name": "Richard", "family_name": "Berk", "institution": null}]}