{"title": "Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias", "book": "Advances in Neural Information Processing Systems", "page_first": 9334, "page_last": 9345, "abstract": "Despite the phenomenal success of deep neural networks in a broad range of learning tasks, there is a lack of theory to understand the way they work. In particular, Convolutional Neural Networks (CNNs) are known to perform much better than Fully-Connected Networks (FCNs) on spatially structured data: the architectural structure of CNNs benefits from prior knowledge on the features of the data, for instance their translation invariance. The aim of this work is to \nunderstand this fact through the lens of dynamics in the loss landscape. \n\nWe introduce a method that maps a CNN to its equivalent FCN (denoted as eFCN). Such an embedding enables the comparison of CNN and FCN training dynamics directly in the FCN space.\nWe use this method to test a new training protocol, which consists in training a CNN, embedding it to FCN space at a certain ``relax time'', then resuming the training in FCN space. We observe that for all relax times, the deviation from the CNN subspace is small, and the final performance reached by the eFCN is higher than that reachable by a standard FCN of same architecture. More surprisingly, for some intermediate relax times, the eFCN outperforms the CNN it stemmed, by combining the prior information of the CNN and the expressivity of the FCN in a complementary way. The practical interest of our protocol is limited by the very large size of the highly sparse eFCN. However, it offers interesting insights into the persistence of architectural bias under stochastic gradient dynamics. It shows the existence of some rare basins in the FCN loss landscape associated with very good generalization. These can only be accessed thanks to the CNN prior, which helps navigate the landscape during the early stages of optimization.", "full_text": "Finding the Needle in the Haystack with\n\nConvolutions: on the bene\ufb01ts of architectural bias\n\nSt\u00e9phane d\u2019Ascoli\n\nstephane.dascoli@ens.fr\n\nLaboratoire de Physique de l\u2019Ecole normale sup\u00e9rieure ENS, Universit\u00e9 PSL,\n\nCNRS, Sorbonne Universit\u00e9, Universit\u00e9 Paris-Diderot, Sorbonne Paris Cit\u00e9, Paris, France\n\nLevent Sagun\n\nleventsagun@fb.com\nFacebook AI Research\nFacebook, Paris, France\n\nJoan Bruna\n\nbruna@cims.nyu.edu\n\nCourant Institute of Mathematical Sciences and Center for Data Science\n\nNew York University, New York City, United States\n\nGiulio Biroli\n\ngiulio.biroli@lps.ens.fr\n\nLaboratoire de Physique de l\u2019Ecole normale sup\u00e9rieure ENS, Universit\u00e9 PSL,\n\nCNRS, Sorbonne Universit\u00e9, Universit\u00e9 Paris-Diderot, Sorbonne Paris Cit\u00e9, Paris, France\n\nAbstract\n\nDespite the phenomenal success of deep neural networks in a broad range of\nlearning tasks, there is a lack of theory to understand the way they work. In\nparticular, Convolutional Neural Networks (CNNs) are known to perform much\nbetter than Fully-Connected Networks (FCNs) on spatially structured data: the\narchitectural structure of CNNs bene\ufb01ts from prior knowledge on the features\nof the data, for instance their translation invariance. The aim of this work is to\nunderstand this fact through the lens of dynamics in the loss landscape.\nWe introduce a method that maps a CNN to its equivalent FCN (denoted as eFCN).\nSuch an embedding enables the comparison of CNN and FCN training dynamics\ndirectly in the FCN space. We use this method to test a new training protocol,\nwhich consists in training a CNN, embedding it to FCN space at a certain \u201crelax\ntime\u201d, then resuming the training in FCN space. We observe that for all relax\ntimes, the deviation from the CNN subspace is small, and the \ufb01nal performance\nreached by the eFCN is higher than that reachable by a standard FCN of same\narchitecture. More surprisingly, for some intermediate relax times, the eFCN\noutperforms the CNN it stemmed, by combining the prior information of the CNN\nand the expressivity of the FCN in a complementary way. The practical interest of\nour protocol is limited by the very large size of the highly sparse eFCN. However, it\noffers interesting insights into the persistence of architectural bias under stochastic\ngradient dynamics. It shows the existence of some rare basins in the FCN loss\nlandscape associated with very good generalization. These can only be accessed\nthanks to the CNN prior, which helps navigate the landscape during the early stages\nof optimization.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1\n\nIntroduction\n\nIn the classic dichotomy between model-based and data-based approaches to solving complex tasks,\nConvolutional Neural Networks (CNN) correspond to a particularly ef\ufb01cient tradeoff. CNNs capture\nkey geometric prior information for spatial/temporal tasks through the notion of local translation\ninvariance. Yet, they combine this prior with high \ufb02exibility, that allows them to be scaled to millions\nof parameters and leverage large datasets with gradient-descent learning strategies, typically operating\nin the \u2018interpolating\u2019 regime, i.e. where the training data is \ufb01t perfectly.\nSuch regime challenges the classic notion of model selection in statistics, whereby increasing the\nnumber of parameters trades off bias by variance [38]. On the one hand, several recent works\nstudying the role of optimization in this tradeoff argue that model size is not always a good predictor\nfor over\ufb01tting [30, 38, 29, 18, 7], and consider instead other complexity measures of the function\nclass, which favor CNNs due to their smaller complexity [14]. On the other hand, authors have\nalso considered geometric aspects of the energy landscape, such as width of basins [24], as a proxy\nfor generalisation. However, these properties of the landscape do not appear to account for the\nbene\ufb01ts associated with speci\ufb01c architectures. Additionally, considering the implicit bias due to the\noptimization scheme [35, 20] is not enough to justify the performance gains without considering the\narchitectural bias. Despite the important insights on the role of over-parametrization in optimization\n[13, 3, 36], the architectural bias prevails as a major factor to explain good generalization in visual\nclassi\ufb01cation tasks \u2013 over-parametrized CNN models generalize well, but large neural networks\nwithout any convolutional constraints do not.\nIn this work, we attempt to further disentangle the bias stemming from the architecture and the\noptimization scheme by showing that the CNN prior plays a favorable role mostly at the beginning of\noptimization. Geometrically, the CNN prior de\ufb01nes a low-dimensional subspace within the space\nof parameters of generic Fully-Connected Networks (FCN) (this subspace is linear since the CNN\nconstraints of weight sharing and locality are linear, see Figure 1 for a sketch of the core idea). Even\nthough the optimization scheme is able to minimize the training loss with or without the constraints\n(for suf\ufb01ciently over-parametrized models [19, 38]), the CNN subspace provides a \u201cbetter route\u201d that\nnavigates the loss landscape to solutions with better generalization performance.\nYet, surprisingly, we observe that leaving this subspace at an appropriate time can result in a FCN\nwith an equivalent or even better generalization than a CNN. Our numerical experiments suggest\nthat the CNN subspace as well as its vicinity are good candidates for high-performance solutions.\nFurthermore, we observe a threshold distance from the CNN space beyond which the performance\ndrops back down to the vanilla FCN accuracy level. Our results offer a new perspective on the success\nof the convolutional architecture: within FCN loss landscapes there exist rare basins associated to\nvery good generalization, characterised not only by their width but rather by their distance to the\nCNN subspace. These can be accessed thanks to the CNN prior, and are otherwise missed in the\nusual training of FCNs.\nThe rest of the paper is structured as follows. Section 2 discusses prior work in relating architecture\nand optimization biases. Section 3 presents our CNN to FCN embedding algorithm and training\nprocedure, and Section 4 describes and analyses the experiments performed on the CIFAR-10 dataset\n[25]. We conclude in Section 5 by describing theoretical setups compatible with our observations and\nconsequences for practical applications.\n\n2 Related Work\n\nThe relationship between CNNs and FCNs is an instance of trading-off prior information with\nexpressivity within Neural Networks. There is abundant literature that explored the relationship\nbetween different neural architectures, for different purposes. One can roughly classify these works\non whether they attempt to map a large model into a smaller one, or vice-versa.\nIn the \ufb01rst category, one of the earliest efforts to introduce structure within FCNs with the goal of\nimproving generalization was Nowlan and Hinton\u2019s soft weight sharing networks [32], in which the\nweights are regularized via a Mixture of Gaussians. Another highly popular line of work attempts\nto distill the \u201cknowledge\u201d of a large model (or an ensemble of models) into a smaller one [8, 22, 4],\nwith the goal of improving both computational ef\ufb01ciency and generalization performance. Network\n\n2\n\n\fFigure 1: White background: ambient, M-\ndimensional, fully-connected space. Yellow\nsubspace:\nlinear, m-dimensional convolu-\ntional subspace. We have m (cid:28) M. Red man-\nifold: (near-) zero loss valued, (approximate-)\nsolution set for a given training data. Note\nthat it is a nontrivial manifold due to continu-\nous symmetries (also, see the related work\nsection on mode connectivity) and it inter-\nsects with the CNN subspace. Blue path:\na CNN initialized and trained with the con-\nvolutional constraints. Purple path: a FCN\nmodel initialized and trained without the con-\nstraints. Green paths: Snapshots taken along\nthe CNN training that are lifted to the ambi-\nent FCN space, and trained in the FCN space\nwithout the constraints.\n\npruning [21] and the recent \u201cLottery Ticket Hypothesis\u201d [15] are other remarkable instances of the\nbene\ufb01ts of model reduction.\nIn the second category, which is more directly related to our work, authors have attempted to build\nlarger models by embedding small architectures into larger ones, such as the Net2Net model [10] or\nmore evolved follow-ups [34]. In these works, however, the motivation is to accelerate learning by\nsome form of knowledge transfer between the small model and the large one, whereas our motivation\nis to understand the speci\ufb01c role of architectural bias in generalization.\nIn the in\ufb01nite-width context, [31] study the role of translation equivariance of CNNs compared to\nFCNs. They \ufb01nd that in this limit, weight sharing does not play any role in the Bayesian treatment of\nCNNs, despite providing signi\ufb01cant improvment in the \ufb01nite-channel setup.\nThe links between generalization error and the geometry and topology of the optimization landscape\nhave been also extensively studied in recent times. [14] compare generalisation bounds between\nCNNs and FCNs, establishing a sample complexity advantage in the case of linear activations. [28, 27]\nobtain speci\ufb01c generalisation bounds for CNN architectures. [9] proposed a different optimization\nobjective, whereby a bilateral \ufb01ltering of the landscape favors dynamics into wider valleys. [24]\nexplored the link between sharpness of local minima and generalization through Hessian analysis\n[33], and [37] argued in terms of the volume of basins of attraction. The characterization of the loss\nlandscape along paths connecting different models have been studied recently, e.g. in [16], [17],\nand [12]. The existence of rare basins leading to better generalization was found and highlighted in\nsimple models in [5, 6]. The role of the CNN prior within the ambient FCNs loss landscape and its\nimplication for generalization properties were not considered in any of these works. In the following\nwe address this point by building on these previous investigations of the landscape properties.\n\n3 CNN to FCN Embedding\n\nIn both FCNs and CNNs, each feature of a layer is calculated by applying a non-linearity to a\nweighted sum over the features of the previous layer (or over all the pixels of the image, for the \ufb01rst\nlayer). CNNs are a particular type of FCNs, which make use of two key ingredients to reduce their\nnumber of redundant parameters: locality and weight sharing.\nLocality: In FCNs, the sum is taken over all the features of the previous layer. In locally connected\nnetworks (LCNs), locality is imposed by restricting the sum to a small receptive \ufb01eld (a box of\nadjacent features of the previous layer). The set of weights of this restricted sum is called a \ufb01lter.\nFor a given receptive \ufb01eld, one may create multiple features (or channels) by using several different\n\ufb01lters. This procedure makes use of the spatial structure of the data and reduces the number of \ufb01tting\nparameters.\nWeight sharing: CNNs are a particular type of LCNs where all the \ufb01lters of a given channel use\nthe same set of weights. This procedure makes use of the somewhat universal properties of feature\n\n3\n\n Linear CNN subpaceSolution manifold \ue238=0123CNN endCNN starteFCN endseFCN startsFCN endFCN start\fextracting \ufb01lters such as edge detectors and reduces even more drastically the number of \ufb01tting\nparameters.\nWhen mapping a CNN to its equivalent FCN (eFCN), we obtain very sparse (due to locality) and\nredundant (due to weight sharing) weight matrices (see Sec. A of the Supplemental Material for some\nintuition on the mapping). This typically results in a large memory overhead as the eFCN of a simple\nCNN can take several orders of magnitude more space in the memory. Therefore, we present the\ncore ideas on a simple 3-layer CNN on CIFAR-10 [26], and show similar results for AlexNet on\nCIFAR-100 in Sec. B of the Supplemental Material.\nIn the mapping1, all layers apart form the convolutional layers (ReLU, Dropout, MaxPool and fully-\nconnected) are left unchanged except for proper reshaping. Each convolutional layer is mapped to a\nfully-connected layer.\nAs a result, for a given CNN, we obtain its eFCN counterpart with an end-to-end fully-connected\narchitecture which is functionally identical to the original CNN.\n\n4 Experiments\nWe are given input-label pairs for a supervised classi\ufb01cation task, (x, y), with x \u2208 Rd and y the\nindex of the correct class for a given image x. The network, parametrized by \u03b8, outputs \u02c6y = fx(\u03b8).\nTo distinguish between different architectures we denote the CNN weights by \u03b8CN N \u2208 Rm and\nthe eFCNs weights by \u03b8eF CN \u2208 RM . Let\u2019s denote the embedding function described in Sec. 3 by\n\u03a6 : Rm (cid:55)\u2192 RM where m (cid:28) M and with a slight abuse of notation use f (\u00b7) for both CNN and eFCN.\nDropping the explicit input dependency for simplicity we have:\n\nf (\u03b8CN N ) = f (\u03a6(\u03b8CN N )) = f (\u03b8eF CN ).\n\nFor the experiments, we prepare the CIFAR-10 dataset for training without data augmentation. The\noptimizer is set to stochastic gradient descent with a constant learning rate of 0.1 and a minibatch\nsize of 250. We turn off the momentum and weight decay to simply focus on the stochastic gradient\ndynamics and we do not adjust the learning rate throughout the training process. In the following,\nwe focus on a convolutional architecture with 3 layers, 64 channels at each layer that are followed\nby ReLU and MaxPooling operators, and a single fully connected layer that outputs prediction\nprobabilities. In our experience, this VanillaCNN strikes a good balance of simplicity and performance\nin that its equivalent FCN version does not suffer from memory issues yet it signi\ufb01cantly outperforms\nany FCN model trained from scratch. We study the following protocol:\n\ninit\n\n1. Initialize the VanillaCNN at \u03b8CN N\n\nand train for 150 epochs. At the end of training \u03b8CN N\nf inal\n\nreaches \u223c 72% test accuracy.\n{t0 = 0, t1, . . . , tk\u22122, tk\u22121 = 150}. It provides k CNN points denoted by {\u03b8CN N\n\u03b8CN N\ninit\n\n2. Along the way, save k snapshots of the weights at logarithmically spaced epochs:\n=\n\n, . . . , \u03b8CN N\ntk\u22121\n\n, \u03b8CN N\n\n}.\n\nt1\n\nt0\n\n)} = {\u03b8eF CN\n\nt0\n\n, . . . , \u03b8eF CN\n\ntk\u22121\n\n} (so that\n\n3. Lift each one to its eFCN: {\u03a6(\u03b8CN N\n\n), . . . , \u03a6(\u03b8CN N\ntk\u22121\nonly m among a total of M parameters are non-zero).\nsmaller learning rate of 0.01. We obtain k solutions {\u03b8eF CN\n\nt0\n\n4. Train these k eFCNs in the FCN space for 100 epochs in the same conditions, except a\n\nt0,f inal, . . . , \u03b8eF CN\n\ntk\u22121,f inal}.\n\n5. For comparison, train a standard FCN (with the same architecture as the eFCNs but with\nthe default PyTorch initialization) for 100 epochs in the same conditions as the eFCNs, and\ndenote the resulting weights by \u03b8F CN\n\nf inal. The latter reaches \u223c 55% test accuracy.\n\nThis process gives us one CNN solution, one FCN solution, and k eFCN solutions that are labeled as\n\n\u03b8CN N\nf inal , \u03b8F CN\n\nf inal, and {\u03b8eF CN\n\nt0,f inal, . . . , \u03b8eF CN\n\ntk\u22121,f inal}\n\n(1)\n\nwhich we analyze in the following subsections. Note that due to the difference in size between the\nCNN and the eFCNs, it unclear what learning rate would give a fair comparison. One solution, shown\nin Sec. B of the Supplemental Material, is to use an adaptive learning rate optimizer such as Adam.\n\n1The source code may be found at: https://github.com/sdascoli/anarchitectural-search.\n\n4\n\n\f4.1 Performance and training dynamics of eFCNs\n\nOur \ufb01rst aim is to characterize the training dynamics of eFCNs and study how their training evolution\ndepends on their relax time tw \u2208 {t0 = 0, t1, . . . , tk\u22122, tk\u22121 = 150} (in epochs). When the\narchitectural constraint is relaxed, the loss decreases monotonically to zero (see the left panel of\nFig. 2). The initial losses are smaller for larger tws, as expected since those tws correspond to CNNs\ntrained for longer. In the right panel of Fig. 2, we show a more surprising result: test accuracy\nincreases monotonously in time for all tws, thus showing that relaxing the constraints does not lead\nto over\ufb01tting or catastrophic forgetting. Hence, from the point of view of the FCN space, it is not as\nif CNN dynamics took place on an unstable region from which the constraints of locality and weight\nsharing prevented from falling off. It is quite the contrary instead: the CNN dynamics takes place in\na basin, and when the constraints are relaxed, the system keeps going down on the training surface\nand up in test accuracy, as opposed to falling back to the standard FCN regime.\n\nFigure 2: Training loss (left) and test accuracy (right) on CIFAR-100 vs. training time in logarithmic\nscale including the initial point. Different models are color coded as follows: the VanillaCNN is\nshown in black, standard FCN is in red, and the eFCNs with their relax time tws are indicated by the\ngradient ranging from purple to light green.\n\nIn Fig. 3 (left) we compare the \ufb01nal test accuracies reached by eFCN with the ones of the CNN\nand the standard FCN. We \ufb01nd two main results. First, the accuracy of the eFCN for tw = 0 is\napproximately at 62.5%, well above the standard FCN result of 57.5%. This shows that imposing\nan untrained CNN prior is already enough to \ufb01nd a solution with much better performance than a\nstandard FCN. Hence the CNN prior brings us to a good region of the landscape to start with. The\nsecond result, perhaps even more remarkable, is that at intermediate relax times (tw \u223c 20 epochs),\nthe eFCN reaches\u2014and exceeds\u2014the \ufb01nal test accuracy reached by the CNN it stemmed from. This\nsupports the idea that the constraints are mostly helpful for navigating the landscape during the early\nstages of optimization. At late relax times, the eFCN is initialized close to the bottom of the landscape\nand has little room to move, hence the test accuracy stays the same as that of the fully trained CNN.\n\n4.2 A closer look at the landscape\n\nA widespread idea in the deep learning literature is that the sharpness of the minima of the training\nloss is related to generalization performance [24, 23]. The intuition being that \ufb02at minima reduce the\neffect of the difference between training loss and test loss. This motivates us to compare the \ufb01rst and\nsecond order properties of the landscape explored by the eFCNs and the CNNs they stem from. To\ndo so, we investigate the norm of the gradient of the training loss, |\u2207L|, and the top eigenvalue of the\nHessian of the training loss, \u03bbmax, in the central and right panels of Fig. 3 (we calculate the latter\nusing a power method).\nWe point out several interesting observations. First, the sharpness (|\u2207L|) and steepness (\u03bbmax)\nindicators increase then decrease during the training of the CNN (as analyzed in [1]), and display a\nmaximum around tw (cid:39) 20, which coincides with the relax time of best improvement for the eFCNs.\n\n5\n\n0100101102Training time (epochs)0.00.51.01.52.0Train loss0100101102Training time (epochs)10203040506070Test accuracyeFCN tw=0eFCN tw=1eFCN tw=2eFCN tw=3eFCN tw=4eFCN tw=6eFCN tw=8eFCN tw=10eFCN tw=13eFCN tw=18eFCN tw=23eFCN tw=30eFCN tw=40eFCN tw=52eFCN tw=67eFCN tw=88eFCN tw=115eFCN tw=150FCNCNN\fFigure 3: Left: Performance of eFCNs reached at the end of training (red crosses) compared to\nits counterpart for the best CNN accuracy (straight line) and the best FCN accuracy (dashed line).\nCenter: Norm of the gradient for eFCNs at the beginning and at the end of training. Right: Largest\neigenvalue of the Hessian for eFCNs at the beginning and at the end of training. In all \ufb01gures the\nx-axis is the relax time tw.\n\nSecond, we see that after training the eFCNs, these indicators plummet by an order of magnitude,\nwhich is particularly surprising at very late relax time where it appeared in the left panel of Fig. 3 (see\nalso 4) as if the eFCNs was hardly moving away from initialization. This supports the idea that when\nthe constraints are relaxed, the extra degrees of freedom lead us to wider basins, possibly explaining\nthe gain in performance.\n\n4.3 How far does the eFCN escape from the CNN subspace?\n\nFigure 4: Left panel: relax time tw of the eFCN vs. \u03b4, the measure of deviation from the CNN\nsubspace through the locality constraint, at the \ufb01nal point of eFCN training. Middle panel: \u03b4 vs.\nthe initial loss value. Right panel: \u03b4 vs. \ufb01nal test accuracy of eFCN models. For reference, the\nblue point in the middle and right panels indicate the deviation measure for a standard FCN, where\n\u03b4 \u223c 97%.\n\nA major question naturally arises: how far do the eFCNs move away from their initial condition?\nDo they stay close to the sparse con\ufb01guration they were initialized in ? To answer this question, we\nquantify how locality is violated once the constraints are relaxed (violation of weight sharing will\nbe studied in Sec. 4.4). To this end, we consider a natural decomposition of the weights in the FCN\nspace into two parts, \u03b8 = (\u03b8local, \u03b8off-local), where \u03b8off-local = 0 for an eFCN when it is initialized from\na CNN. A visualization of these blocks may be found in Sec. A of the Supplemental Material. We\nthen study the ratio \u03b4 of the norm of the off-local weights to the total norm, \u03b4(\u03b8) =\n, which\nis a measure of the deviation of the model from the CNN subspace.\nFig. 4 (left) shows that the deviation \u03b4 at the end of eFCN training decreases monotonically with its\nrelax time tw. Indeed, the earlier we relax the constraints (and therefore the higher the initial loss of\nthe eFCN) the further the eFCN escapes from the CNN subspace, as emphasized in Fig. 4 (middle).\nHowever, even at early relax times, the eFCNs stay rather close to the CNN subspace, since the ratio\n\n||\u03b8off-local||2\n\n||\u03b8||2\n\n6\n\n0100101102tw57.560.062.565.067.570.072.575.0Test accuracyeFCNsCNN bestFCN best0100101102tw102101100Norm of gradienteFCN initialeFCN final0100101102tw101102Top eigenvalue of the HessianeFCN initialeFCN final0100101102tw0.000.010.020.030.040.050.060.070.08103102101100103102101100Initial train losseFCNsFCN10310210110057.560.062.565.067.570.072.575.0Final test accuracyeFCNsFCNCNN best\fnever exceeds 8%, whereas it is around 97% for a regular FCN (since the number of off-local weights\nis much larger than the number of local weights). This underlines the persistence of the architectural\nbias under the stochastic gradient dynamics.\nFig. 4 (right) shows that when we move away from the CNN subspace, performance stays high then\nplummets down to FCN level. This hints to a critical distance from the CNN subspace within which\neFCNs behave like CNNs, and beyond which they fall back to the standard FCN regime. We further\nexplore this high performance vicinity of the CNN subspace using interpolations in weight space in\nSec. C of the Supplemental Material.\n\n4.4 What role do the extra degrees of freedom play in learning?\n\nHow can the eFCN use the extra degrees of freedom to\nimprove performance ? From Fig. 5, we see that the off-\nlocal part of the eFCN is useless on its own (with the\nlocal part masked off). However, when combined with\nthe local part, it may greatly improve performance when\nthe constraints are relaxed early enough. This hints to\nthe fact that the local and off-local parts are performing\ncomplementary tasks.\nTo understand what tasks the two parts they are performing,\nwe show in Fig. 6 a \u201c\ufb01lter\u201d from the \ufb01rst layer of the eFCN\n(whose receptive \ufb01eld is of the size of the images since\nlocality is relaxed). Note that each CNN \ufb01lter gives rise to\nmany eFCN \ufb01lters : one for each position of the CNN \ufb01lter\non the image, since weight sharing is relaxed. Here we\nshow the one obtained when the CNN \ufb01lter (local block)\nis on the top left of the image. We see that off-local blocks\nstay orders of magnitude smaller than the local blocks, as\nexpected from Sec. 4.3 where we saw that locality was\nalmost conserved. We also see that local blocks hardly\nchange during training, showing that weight sharing of the\nlocal blocks is also almost conserved.\nMore surprisingly, we see that for tw > 0 distinctive\nshapes of the images are learned by the eFCN off-local\nblocks, which perform some kind of template-matching.\nNote that the silhouettes are particularly clear for the in-\ntermediate relax time (middle row), at which we know from Sec. 4.1 that the eFCN had the best\nimprovement over the CNN. Hence, the eFCN is combining template-matching with convolutional\nfeature extraction in a complementary way.\nNote that by itself, template-matching is very inef\ufb01cient for complicated and varied images such\nas those of the CIFAR-10 dataset. Hence it cannot be observed in standard FCNs, as shown in\nFig. 7 where we reproduce the counterpart of Fig. 6 for the FCN in the left and middle images (they\ncorrespond to initial and \ufb01nal training times respectively). To reveal the silhouettes learned, we need\nto look at the pixelwise difference between the two images, i.e. focus on the change due to training\n(this in unnecessary for the eFCN whose off-local weights started at zero). In the right image of\nFig. 7), we see that a loose texture emerges, however, it is not as sharp as that of the eFCN weights\nafter training. Template-matching is only useful as a cherry-on-the-cake alongside more ef\ufb01cient\nlearning procedures.\n\nFigure 5: Contributions to the test accu-\nracy of the local blocks (off-local blocks\nmasked out), in orange, and off-local\nblocks (local blocks masked out), in blue.\nCombining them together yields a large\ngain in performance for the eFCN, in\ngreen.\n\n5 Discussion and Conclusion\n\nIn this work, we examined the inductive bias of CNNs, and challenged the accepted view that FCNs\nare unable to generalize as well as CNNs on visual tasks. Speci\ufb01cally, we showed that the CNN prior\nis mainly useful during the early stages of training, to prevent the unconstrained FCN from falling\nprey of spurious solutions with poor generalization too early.\n\n7\n\n0100101102tw10203040506070Test accuracylocaloff-localfull modelCNN bestFCN best\fFigure 6: Heatmap of the weights of an eFCN \u201c\ufb01lter\u201d from the \ufb01rst layer just at relax time (left\ncolumn), after training for 11 epochs (middle column), and after training for 78 epochs (right\ncolumn). The eFCNs were initialized at relax times tw = 0 (top row), tw = 13 (middle row),\nand tw = 115 (bottom row). The colors indicate the natural logarithm of the absolute value of the\nweights. Note that the convolutional \ufb01lters, in the top right, vary little and remain orders of magnitude\nlarger than the off-local blocks, whereas the off-local blocks pick up strong signals from images as\nsharp silhouettes appear.\n\nFigure 7: Same heatmap of weights as shown in Fig. 6 but for a standard FCN at a randomly initialized\npoint (left) and after training for 150 epochs (middle). The pixelwise difference is shown on the\nright panel. A loose texture appears, but it is by no means as sharp as the silhouettes of the eFCNs.\n\n8\n\n05101520253035eFCN(tw=0) after 0 epochseFCN(tw=0) after 11 epochseFCN(tw=0) after 78 epochs05101520253035eFCN(tw=13) after 0 epochseFCN(tw=13) after 11 epochseFCN(tw=13) after 78 epochs010203005101520253035eFCN(tw=115) after 0 epochs0102030eFCN(tw=115) after 11 epochs0102030eFCN(tw=115) after 78 epochs10864201020300102030After 0 epochs0102030After 150 epochs0102030Difference0.0150.0100.0050.0000.0050.0100.0150.0150.0100.0050.0000.0050.0100.0150.0020.0010.0000.0010.002\fOur experimental results show that there exists a vicinity of the CNN subspace with high gener-\nalization properties, and one may even enhance the performance of CNNs by exploring it, if one\nrelaxes the CNN constraints at an appropriate time during training. The extra degrees of freedom are\nused to perform complementary tasks which alone are unhelpful. This offers interesting theoretical\nperspectives, in relation to other high-dimensional estimation problems, such as in spiked tensor\nmodels [2], where a smart initialization, containing prior information on the problem, is used to\nprovide an initial condition that bypasses the regions where the estimation landscape is \u201crough\u201d and\nfull of spurious minima.\nOn the practical front, despite the performance gains obtained, our algorithm remains highly im-\npractical due to the large number of degrees of freedom required on our eFCNs. However, more\nef\ufb01cient strategies that would involve a less drastic relaxation of the CNN constraints (e.g., relaxing\nthe weight sharing but keeping the locality constraint such as locally-connected networks [11]) could\nbe of potential interest to practitioners.\n\n9\n\n\fAcknowledgments\nWe would like to thank Alp Riza Guler and Ilija Radosavovic for helpful discussions. We acknowledge\nfunding from the Simons Foundation (#454935, Giulio Biroli). JB acknowledges the partial support\nby the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, and Samsung\nElectronics.\n\nReferences\n[1] Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural\n\nnetworks. arXiv preprint arXiv:1711.08856, 2017.\n\n[2] Anima Anandkumar, Yuan Deng, Rong Ge, and Hossein Mobahi. Homotopy analysis for tensor\n\npca. arXiv preprint arXiv:1610.09322, 2016.\n\n[3] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit\n\nacceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018.\n\n[4] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural\n\ninformation processing systems, pages 2654\u20132662, 2014.\n\n[5] Carlo Baldassi, Christian Borgs, Jennifer T Chayes, Alessandro Ingrosso, Carlo Lucibello,\nLuca Saglietti, and Riccardo Zecchina. Unreasonable effectiveness of learning neural networks:\nFrom accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the\nNational Academy of Sciences, 113(48):E7655\u2013E7662, 2016.\n\n[6] Carlo Baldassi, Fabrizio Pittorino, and Riccardo Zecchina. Shaping the learning landscape in\n\nneural networks around wide \ufb02at minima. arXiv preprint arXiv:1905.07833, 2019.\n\n[7] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine\n\nlearning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.\n\n[8] Cristian Bucilu\u02c7a, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In\nProceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 535\u2013541. ACM, 2006.\n\n[9] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian\nBorgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient\ndescent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.\n\n[10] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowl-\n\nedge transfer. arXiv preprint arXiv:1511.05641, 2015.\n\n[11] Adam Coates and Andrew Y Ng. Selecting receptive \ufb01elds in deep networks. In Advances in\n\nneural information processing systems, pages 2528\u20132536, 2011.\n\n[12] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A Hamprecht. Essentially no\n\nbarriers in neural network energy landscape. arXiv preprint arXiv:1803.00885, 2018.\n\n[13] Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient\ndescent learns one-hidden-layer cnn: Don\u2019t be afraid of spurious local minima. arXiv preprint\narXiv:1712.00779, 2017.\n\n[14] Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Ruslan R Salakhutdinov, and\nAarti Singh. How many samples are needed to estimate a convolutional neural network? In\nAdvances in Neural Information Processing Systems, pages 373\u2013383, 2018.\n\n[15] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable\n\nneural networks. arXiv preprint arXiv:1803.03635, 2018.\n\n[16] C Daniel Freeman and Joan Bruna. Topology and geometry of deep recti\ufb01ed network optimiza-\n\ntion landscapes. arXiv preprint arXiv:1611.01540, 2016.\n\n10\n\n\f[17] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wil-\nson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural\nInformation Processing Systems, pages 8789\u20138798, 2018.\n\n[18] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, St\u00e9phane d\u2019Ascoli,\nGiulio Biroli, Cl\u00e9ment Hongler, and Matthieu Wyart. Scaling description of generalization with\nnumber of parameters in deep learning. arXiv preprint arXiv:1901.01608, 2019.\n\n[19] Mario Geiger, Stefano Spigler, St\u00e9phane d\u2019Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio\nBiroli, and Matthieu Wyart. The jamming transition as a paradigm to understand the loss\nlandscape of deep neural networks. arXiv preprint arXiv:1809.09349, 2018.\n\n[20] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias\n\nin terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.\n\n[21] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net-\nworks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,\n2015.\n\n[22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\narXiv preprint arXiv:1503.02531, 2015.\n\n[23] Stanis\u0142aw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua\narXiv preprint\n\nBengio, and Amos Storkey. Three factors in\ufb02uencing minima in sgd.\narXiv:1711.04623, 2017.\n\n[24] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv preprint arXiv:1609.04836, 2016.\n\n[25] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images.\n\nTechnical Report, pages 1\u201360, 2009.\n\n[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[27] Jaeho Lee and Maxim Raginsky. Learning \ufb01nite-dimensional coding schemes with nonlinear\n\nreconstruction maps. arXiv preprint arXiv:1812.09658, 2018.\n\n[28] Philip M Long and Hanie Sedghi. Size-free generalization bounds for convolutional neural\n\nnetworks. arXiv preprint arXiv:1905.12600, 2019.\n\n[29] Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-\nJulien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff in neural networks.\narXiv preprint arXiv:1810.08591, 2018.\n\n[30] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro.\nTowards understanding the role of over-parametrization in generalization of neural networks.\narXiv preprint arXiv:1805.12076, 2018.\n\n[31] Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A\nAbola\ufb01a, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks\nwith many channels are gaussian processes. 2018.\n\n[32] Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weight-sharing.\n\nNeural computation, 4(4):473\u2013493, 1992.\n\n[33] Levent Sagun, Utku Evci, V. U\u02d8gur G\u00fcney, Yann Dauphin, and L\u00e9on Bottou. Empirical analysis\nICLR 2018 Workshop Contribution,\n\nof the hessian of over-parametrized neural networks.\narXiv:1706.04454, 2017.\n\n[34] Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural\n\nInformation Processing Systems, pages 4053\u20134061, 2016.\n\n11\n\n\f[35] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The\nimplicit bias of gradient descent on separable data. Journal of Machine Learning Research,\n19(70), 2018.\n\n[36] Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with \ufb01nite intrinsic dimension\n\nhave no spurious valleys. arXiv preprint arXiv:1802.06384, 2018.\n\n[37] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspec-\n\ntive of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.\n\n[38] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n12\n\n\f", "award": [], "sourceid": 4983, "authors": [{"given_name": "St\u00e9phane", "family_name": "d'Ascoli", "institution": "ENS / FAIR"}, {"given_name": "Levent", "family_name": "Sagun", "institution": "EPFL"}, {"given_name": "Giulio", "family_name": "Biroli", "institution": "ENS"}, {"given_name": "Joan", "family_name": "Bruna", "institution": "NYU"}]}