{"title": "One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers", "book": "Advances in Neural Information Processing Systems", "page_first": 4932, "page_last": 4942, "abstract": "The success of lottery ticket initializations (Frankle and Carbin, 2019) suggests that small, sparsified networks can be trained so long as the network is initialized appropriately. Unfortunately, finding these \"winning ticket'' initializations is computationally expensive. One potential solution is to reuse the same winning tickets across a variety of datasets and optimizers. However, the generality of winning ticket initializations remains unclear. Here, we attempt to answer this question by generating winning tickets for one training configuration (optimizer and dataset) and evaluating their performance on another configuration. Perhaps surprisingly, we found that, within the natural images domain, winning ticket initializations generalized across a variety of datasets, including Fashion MNIST, SVHN, CIFAR-10/100, ImageNet, and Places365, often achieving performance close to that of winning tickets generated on the same dataset. Moreover, winning tickets generated using larger datasets consistently transferred better than those generated using smaller datasets. We also found that winning ticket initializations generalize across optimizers with high performance. These results suggest that winning ticket initializations generated by sufficiently large datasets contain inductive biases generic to neural networks more broadly which improve training across many settings and provide hope for the development of better initialization methods.", "full_text": "One ticket to win them all: generalizing lottery ticket\n\ninitializations across datasets and optimizers\n\nAri S. Morcos\u2217\n\nFacebook AI Research\narimorcos@fb.com\n\nMichela Paganini\n\nFacebook AI Research\n\nmichela@fb.com\n\nHaonan Yu\n\nFacebook AI Research\nhaonanu@gmail.com\n\nYuandong Tian\n\nFacebook AI Research\n\nyuandong@fb.com\n\nAbstract\n\nThe success of lottery ticket initializations [7] suggests that small, sparsi\ufb01ed net-\nworks can be trained so long as the network is initialized appropriately. Unfortu-\nnately, \ufb01nding these \u201cwinning ticket\u201d initializations is computationally expensive.\nOne potential solution is to reuse the same winning tickets across a variety of\ndatasets and optimizers. However, the generality of winning ticket initializations\nremains unclear. Here, we attempt to answer this question by generating winning\ntickets for one training con\ufb01guration (optimizer and dataset) and evaluating their\nperformance on another con\ufb01guration. Perhaps surprisingly, we found that, within\nthe natural images domain, winning ticket initializations generalized across a vari-\nety of datasets, including Fashion MNIST, SVHN, CIFAR-10/100, ImageNet, and\nPlaces365, often achieving performance close to that of winning tickets generated\non the same dataset. Moreover, winning tickets generated using larger datasets\nconsistently transferred better than those generated using smaller datasets. We also\nfound that winning ticket initializations generalize across optimizers with high\nperformance. These results suggest that winning ticket initializations generated\nby suf\ufb01ciently large datasets contain inductive biases generic to neural networks\nmore broadly which improve training across many settings and provide hope for\nthe development of better initialization methods.\n\n1\n\nIntroduction\n\nThe recently proposed lottery ticket hypothesis [7, 8] argues that the initialization of over-\nparameterized neural networks contains much smaller sub-network initializations, which, when\ntrained in isolation, reach similar performance as the full network. The presence of these \u201clucky\u201d\nsub-network initializations has several intriguing implications. First, it suggests that the most com-\nmonly used initialization schemes, which have primarily been discovered heuristically [10, 14], are\nsub-optimal and have signi\ufb01cant room to improve. This is consistent with work which suggests that\ntheoretically-grounded initialization schemes can enable the training of extremely deep networks\nwith hundreds of layers [13, 27, 29, 33, 35]. Second, it suggests that over-parameterization is not\nnecessary during the course of training as has been argued previously [1, 2, 5, 6, 24, 25], but rather\nover-parameterization is merely necessary to \ufb01nd a \u201cgood\u201d initialization of an appropriately parame-\nterized network. If this hypothesis is true, it indicates that by training and performing inference in\nnetworks which are 1-2 orders of magnitude larger than necessary, we are wasting large amounts of\n\n\u2217To whom correspondence should be addressed\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcomputation. Taken together, these results hint toward the development of a better, more principled\ninitialization scheme.\n\nHowever, the process to \ufb01nd winning tickets requires repeated cycles of alternating training (poten-\ntially large) models from scratch to convergence and pruning, making winning tickets computationally\nexpensive to \ufb01nd. Moreover, it remains unclear whether the properties of winning tickets which lead\nto high performance are speci\ufb01c to the precise combination of architecture, optimizer, and dataset,\nor whether winning tickets contain inductive biases which improve training more generically. This\nis a critical distinction for evaluating the future utility of winning ticket initializations: if winning\ntickets are over\ufb01t to the dataset and optimizer with which they were generated, a new winning ticket\nwould need to be generated for each novel dataset which would require training of the full model and\niterative pruning, signi\ufb01cantly blunting the impact of these winning tickets. In contrast, if winning\ntickets feature more generic inductive biases such that the same winning ticket generalizes across\ntraining conditions and datasets, it unlocks the possibility of generating a small number of such\nwinning tickets and reusing them across datasets. Further, it hints at the possibility of parameterizing\nthe distribution of such tickets, allowing us to sample generic, dataset-independent winning tickets.\n\nHere, we investigate this question by asking whether winning tickets found in the context of one\ndataset improve training of sparsi\ufb01ed models on other datasets as well. We demonstrate that individual\nwinning tickets which improve training across many natural image datasets can be found, and that,\nfor many datasets, these transferred winning tickets performed almost as well as (and in some cases,\nbetter than) dataset-speci\ufb01c initializations. We also show that winning tickets generated by larger\ndatasets (both with more training samples and more classes) generalized across datasets substantially\nbetter than those generated by small datasets. Finally, we \ufb01nd that winning tickets can also generalize\nacross optimizers, con\ufb01rming that dataset- and optimizer-independent winning ticket initializations\ncan be generated.\n\n2 Related work\n\nOur work is most directly inspired by the lottery ticket hypothesis, which argues that the probability\nof sampling a lucky, trainable sub-network initialization grows with network size due to the com-\nbinatorial explosion of available sub-network initializations. The lottery ticket hypothesis was \ufb01rst\npostulated and examined in smaller models and datasets in [7] and analyzed in large models and\ndatasets, leading to the development of late resetting, in [8]. The lottery ticket hypothesis has recently\nbeen challenged by [20], which argued that, if randomly initialized sub-networks with structured\npruning are scaled appropriately and trained for long enough, they can match performance from\nwinning tickets. [9] also evaluated the lottery ticket hypothesis in the context of large-scale models\nand were unable to \ufb01nd successful winning tickets, although, critically, they did not use iterative\npruning and late resetting, both of which have been found to be necessary to induce winning tickets in\nlarge-scale models. However, all of these studies have only investigated situations in which winning\ntickets are evaluated in an identical setting to that in which they generated, and therefore do not\nmeasure the generality of winning tickets.\n\nThis work is also strongly inspired by the model pruning literature as well, as the choice of pruning\nmethodology can have large impacts on the structure of the resultant winning tickets. In particular,\nwe use magnitude pruning, in which the lowest magnitude weights are pruned \ufb01rst, which was \ufb01rst\nproposed by [12]. A number of variants have been proposed as well, including structured variants\n[19] and those which enable pruned weights to recover during training [11, 37]. Many other pruning\nmethods have been proposed, including greedy methods [22], methods based on variational dropout\n[21], and those based on the similarity between the activations of feature maps [3, 28].\n\nTransfer learning has been studied extensively with many studies aiming to transfer learned represen-\ntations from one dataset to another [17, 31, 34, 38]. Pruning has also been analyzed in the context\nof transfer learning, primarily by \ufb01ne-tuning pruned networks on novel datasets and tasks [22, 37].\nThese results have demonstrated that training models on one dataset, pruning, and then \ufb01ne-tuning on\nanother datasets can often result in high performance on the transfer dataset. However, in contrast\nto the present work, these studies investigate the transfer of learned representations, whereas we\nanalyze the transfer of initializations across datasets.\n\n2\n\n\fa\n\nb\n\nFigure 1: Global vs. layerwise pruning. (a) CIFAR-10 winning ticket performance for global pruning (blue)\nvs. layerwise pruning (red) along with a random ticket (red). Error bars represent mean \u00b1 standard deviation\nacross six random seeds. (b) Ratio of global pruning rate to layerwise pruning rate at each convolutional layer\nin VGG19. Level represents the pruning iteration such that lighter blues represent lower overall pruning rates,\nwhile the darkest blue represents a 0.999 overall pruning rate.\n\n3 Approach\n\n3.1 The lottery ticket hypothesis\n\nThe lottery ticket hypothesis proposes that \"lucky\" sub-network initializations are present within the\ninitializations of over-parameterized networks which, when trained in isolation, can reach the same\nor, in some cases, better test accuracy than the full model, even when over 99% of the parameters\nhave been removed [7]. This effect is also present for large-scale models trained on ImageNet,\nsuggesting that this phenomenon is not speci\ufb01c to small models [7]. In the simplest method to\n\ufb01nd and evaluate winning tickets, models are trained to convergence, pruned, and then the set of\nremaining weights are reset to their value at the start of training. This smaller model is then trained to\nconvergence again (\u201cwinning ticket\u201d) and compared to a model with the same number of parameters\nbut randomly drawn initial parameter values (\u201crandom ticket\u201d). A good winning ticket is one which\nsigni\ufb01cantly outperform random tickets. However, while this straightforward approach \ufb01nds good\nwinning tickets for simple models and datasets (e.g., MLPs trained on MNIST), this method fails for\nmore complicated architectures and datasets, which require several \u201ctricks\u201d to generate good winning\ntickets:\n\nIterative pruning When models are pruned, they are typically pruned according to some criterion\n(see related work for more details). While these pruning criteria can often be effective, they are\nonly rough estimates of weight importance, and are often noisy. As such, pruning a large fraction of\nweights in one step (\u201cone-shot pruning\u201d) will often prune weights which were actually important.\nOne strategy to combat this issue is to instead perform many iterations of alternately training and\npruning, with each iteration pruning only a small fraction of weights [12]. By only pruning a small\nfraction of weights on each iteration, iterative pruning helps to de-noise the pruning process, and\nproduces substantially better pruned models and winning tickets [7, 8]. For this work, we used\nmagnitude pruning with an iterative pruning rate of 20% of remaining parameters.\n\nLate resetting\nIn the initial investigation of lottery tickets, winning ticket weights were reset to\ntheir values at the beginning of training (training iteration 0), and learning rate warmup was found to\nbe necessary for winning tickets on large models [7]. However, in follow-up work, Frankle et al. [8]\nfound that simply resetting the winning ticket weights to their values at training iteration k, with k\nmuch smaller than the number of total training iterations consistently produces better winning tickets\nand removes the need for learning rate warmup. This approach has been termed \u201clate resetting.\u201d In\nthis work, we independently con\ufb01rmed the importance of late resetting and used late resetting for all\nexperiments (See Appendix A.1 for the precise late resetting values used for each experiment).\n\nGlobal pruning Pruning can be performed in two different ways: locally and globally. In local\npruning, weights are pruned within each layer separately, such that every layer will have the same\nfraction of pruned parameters. In global pruning, all layers are pooled together prior to pruning,\n\n3\n\n0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.700.750.800.850.90CIFAR-10 test accuracyat convergenceGlobal vs. layerwiseTicket sourceGlobal WTLayerwise WTRandom\fallowing the pruning fraction to vary across layers. Consistently, we observed that global pruning\nleads to higher performance than local pruning (Figure 1a). As such, global pruning is used throughout\nthis work. Interestingly, we observed that global magnitude pruning preferentially prunes weights\nin deeper layers, leaving the \ufb01rst layer in particular relatively unpruned (Figure 1b; light blues\nrepresent early pruning iterations with low overall pruning fractions while dark blues represent late\npruning iterations with high pruning fractions). An intuitive explanation for this result is that because\ndeeper layers have many more parameters than early layers, pruning at a constant rate harms early\nlayers more since they have fewer absolute parameters remaining. For example, the \ufb01rst layer of\nVGG19 contains only 1792 parameters, so pruning this layer at a rate of 99% would result in only 18\nparameters remaining, signi\ufb01cantly harming the expressivity of the network.\n\nRandom masks\nIn the sparse pruning setting, winning ticket initializations contain two sources of\ninformation: the values of the remaining weights and the structure of the pruning mask. In previous\nwork [7\u20139], the structure of the pruning mask was maintained for the random ticket initializations,\nwith only the values of the weights themselves randomized. However, the structure of this mask\ncontains a substantial amount of information and requires fore-knowledge of the winning ticket\ninitialization (see Figure A1 for detailed comparisons of different random masks). For this work, we\ntherefore consider random tickets to have both randomly drawn weight values (from the initialization\ndistribution) and randomly permuted masks. For a more detailed discussion of the impact of different\nmask structures, see Appendix A.2.\n\n3.2 Models\n\nExperiments were performed using two models: a modi\ufb01ed form of VGG19 [30] and a ResNet50\n[15]. For VGG19, structure is as in [30], except all fully-connected layers were removed such that\nthe last layer is simply a linear layer from a global average pool of the last convolutional layer to the\nnumber of output classes (as in [7, 8]). For details of model architecture and hyperparameters used in\ntraining, see Appendix A.1.\n\n3.3 Transferring winning tickets across datasets and optimizers\n\nIn order to evaluate the generality of winning tickets, we generate winning tickets in one training\ncon\ufb01guration (\u201csource\u201d) and evaluate performance in a different con\ufb01guration (\u201ctarget\u201d). For many\ntransfers, this requires changing the output layer size, since datasets have different numbers of output\nclasses. Since the winning ticket initialization is only de\ufb01ned for the source architecture and therefore\ncannot be transferred to different topologies, we simply excluded this layer from the winning ticket\nand randomly reinitialized it. Since the last convolutional layer of our models was globally average\npooled prior to the \ufb01nal linear layer, changes in input dimension did not require modi\ufb01cation to the\nmodel.\n\nFor standard lottery ticket experiments in which the source and target dataset are identical, each\niteration of training represents the winning ticket performance for the model at the current pruning\nfraction. However, because the source and target dataset/optimizer are different for our experiments\nand because we primarily care about performance on the target dataset for this study, we must\nre-evaluate each winning ticket\u2019s performance on the target dataset, adding an additional training run\nfor each pruning iteration. We therefore run two additional training runs at each pruning iteration:\none for the winning ticket and one for the random ticket on target con\ufb01guration.\n\n4 Results\n\nFor all experiments, we plot test accuracy at convergence as a function of the fraction of pruned\nweights. For each curve, 6 replicates with different random seeds were run, with shaded error regions\nrepresenting \u00b1 1 standard deviation. For comparisons on a given target dataset or optimizer, models\nwere trained for the same number of epochs (see Appendix A.1 for details).\n\n4.1 Transfer within the same data distribution\n\nAs a \ufb01rst test of whether winning tickets generalize, we investigate the simplest form of transfer:\ngeneralization across samples drawn from the same data distribution. To measure this, we divided\n\n4\n\n\fa\n\nb\n\nFigure 2: Transferring winning tickets within the same data distribution. CIFAR-10 was divided into two\nhalves (\u201c10a\u201d and \u201c10b\u201d), each of which contained 25,000 total examples with 2,500 images per class. Winning\ntickets generated using CIFAR-10a generalized well to CIFAR-10b for both VGG19 (a) and ResNet50 (b). Error\nbars represent mean \u00b1 standard deviation across six random seeds.\n\nthe CIFAR-10 dataset into two halves: CIFAR-10a and CIFAR-10b. Each half contained 25,000\ntraining images with 2,500 images per class. We then asked whether winning tickets generated\nusing CIFAR-10a would produce increased performance on CIFAR-10b. To evaluate the impact of\ntransferring a winning ticket, we compared the CIFAR-10a ticket to both a random ticket and to a\nwinning ticket generated on CIFAR-10b itself (Figure 2). Interestingly, for ResNet50 models, while\nboth CIFAR-10a and CIFAR-10b winning tickets outperformed random tickets at extreme pruning\nfractions, both under-performed random tickets at low pruning fractions, suggesting that ResNet\nwinning tickets may be particularly sensitive to smaller datasets at low pruning fractions.\n\n4.2 Transfer across datasets\n\nOur experiments on transferring winning tickets across training data drawn from the same distribution\nsuggest that winning tickets are not over\ufb01t to the particular data samples presented during training,\nbut winning tickets may still be over\ufb01t to the data distribution itself. To answer this question, we\nperformed a large set of experiments to assess whether winning tickets generated on one dataset\ngeneralize to different datasets within the same domain (natural images).\n\nWe used six different natural image datasets of various complexity to test this: Fashion-MNIST [32],\nSVHN [23], CIFAR-10 [18], CIFAR-100 [18], ImageNet [4], and Places365 [36]. These datasets\nvary across a number of axes, including grayscale vs. color, input size, number of output classes,\nand training set size. Since each pruning curve comprises the result of training six models (to\ncapture variability due to random seed) from scratch at each pruning fraction, these experiments\nrequired extensive computation, especially for models trained on large datasets such as ImageNet\nand Places365. We therefore only evaluated tickets from larger datasets on these datasets to best\nprioritize computation. For all comparisons, we performed experiments on both VGG19 (Figure 3)\nand ResNet50 (Figure 4).\n\nAcross all comparisons, several key trends emerged. First and foremost, individual winning tickets\nwhich generalize across all datasets with performance close to that of winning tickets generated on\nthe target dataset can be found (e.g., winning tickets sourced from ImageNet and Places365 datsets).\nThis result suggests that a substantial fraction of the inductive bias provided by winning tickets is\ndataset-independent (at least within the same domain and for large source datasets), and provides\nhope that individual tickets or distributions of such winning tickets may be generated once and used\nacross different tasks and environments.\n\nSecond, we consistently observed that winning tickets generated on larger, more complex datasets\ngeneralized substantially better than those generated on small datasets. This is particularly noticeable\nfor winning tickets generated on the ImageNet and Places365 source datasets, which demonstrated\ncompetitive performance across all datasets. Interestingly, this effect was not merely a result of\ntraining set size, but also appeared to be impacted by the number of classes. This is most clearly\nexempli\ufb01ed by the differing performance of winning tickets generated on CIFAR-10 and CIFAR-100,\nboth of which feature 50,000 total training examples, but differ in the number of classes. Consistently,\nCIFAR-100 tickets generalized better than CIFAR-10 tickets, even to simple datasets such as Fashion-\nMNIST and SVHN, suggesting that simply increasing the number of classes while keeping the dataset\n\n5\n\n0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.20.40.60.8CIFAR-10b test accuracyat convergenceVGG19: CIFAR-10bTicket sourceCIFAR-10bCIFAR-10aRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.10.20.30.40.50.60.70.8CIFAR-10b test accuracyat convergenceResNet50: CIFAR-10bTicket sourceCIFAR-10bCIFAR-10aRandom\fa\n\nc\n\ne\n\nb\n\nd\n\nf\n\nFigure 3: Transferring VGG19 winning tickets across datasets. Winning ticket performance on target\ndatasets: SVHN (a), Fashion-MNIST (b), CIFAR-10 (c), CIFAR-100 (d), ImageNet (e), and Places365 (f).\nWithin each plot, each line represents a different source dataset for the winning ticket. In cases where the y-axis\nhas been narrowed to make small differences visible, full y-axes are provided as insets. Error bars represent\nmean \u00b1 standard deviation across six random seeds.\n\nsize \ufb01xed may lead to substantial gains in winning ticket generalization (e.g., compare CIFAR-10\nand CIFAR-100 ticket performance in Figure 3e).\n\nInterestingly, we also observed that when networks are extremely over-parameterized relative to the\ncomplexity of the task, as when we apply VGG19 to Fashion-MNIST, we found that transferred\nwinning tickets dramatically outperformed winning tickets generated on Fashion-MNIST itself at\nlow pruning rates (Figure 3b). In this setting, large networks trained on Fashion-MNIST over\ufb01t\ndramatically, leading to very low test accuracy at low pruning rates, which gradually improved as more\nweights were pruned. Winning tickets generated on other datasets, however, bypassed this problem,\nreaching high accuracy at the same pruning rates which were untrainable from even Fashion-MNIST\nwinning tickets (Figure 3b), again suggesting that transferred tickets provide additional regularization\nagainst over\ufb01tting.\n\nFinally, we observed that winning ticket transfer success was roughly similar across across VGG19\nand ResNet50 models, but several differences were present. Consistent with previous results [8], per-\nformance on large-scale datasets began to rapidly degrade for ResNet50 models when approximately\n5-10% of weights remained, in contrast to VGG19 models which only demonstrated small decreases\nin accuracy, even with 99.9% of weights pruned. In contrast, at extreme pruning fractions on small\ndatasets, ResNet50 models consistently achieved similar or only slightly degraded performance\nrelative to over-parameterized models (e.g., Figures 4c and 4d). These results suggest that ResNet50\nmodels may have a sharper \u201cpruning cliff\u201d than VGG19 models.\n\n6\n\n0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.7500.7750.8000.8250.8500.8750.9000.9250.950SVHN test accuracyat convergenceSVHN0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.9990.20.40.60.8Ticket sourceSVHNCIFAR-10CIFAR-100ImageNetRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.20.40.60.81.0Fashion-MNIST testaccuracy at convergenceFashion-MNISTTicket sourceFashion-MNISTCIFAR-10CIFAR-100ImageNetRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.00.20.40.60.8CIFAR-10 test accuracyat convergenceCIFAR-10Ticket sourceCIFAR-10CIFAR-100ImageNetRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.400.450.500.550.600.650.700.75CIFAR-100 test accuracyat convergenceCIFAR-1000.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.9990.000.250.500.75Ticket sourceCIFAR-10CIFAR-100ImageNetRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.00.20.40.60.8ImageNet test accuracyat convergence (Top 5)ImageNetTicket sourceCIFAR-10CIFAR-100ImageNetPlaces365Random0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.650.700.750.800.85Places365 test accuracyat convergence (Top 5)Places3650.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.9990.20.40.60.8Ticket sourceCIFAR-100ImageNetPlaces365Random\fa\n\nc\n\ne\n\nb\n\nd\n\nf\n\nFigure 4: Transferring ResNet50 winning tickets across datasets. Winning ticket performance on target\ndatasets: SVHN (a), Fashion-MNIST (b), CIFAR-10 (c), CIFAR-100 (d), ImageNet (e), and Places365 (f).\nWithin each plot, each line represents a different source dataset for the winning ticket. In cases where the y-axis\nhas been narrowed to make small differences visible, full y-axes are provided as insets. Error bars represent\nmean \u00b1 standard deviation across six random seeds.\n\n4.3 Transfer across optimizers\n\nIn the above sections, we demonstrated that winning tickets can be transferred across datasets within\nthe same domain, suggesting that winning tickets learn generic inductive biases which improve\ntraining. However, it is possible that winning tickets are also speci\ufb01c to the particular optimizer that is\nused. The starting point provided by a winning ticket allows a particular optimum to be reachable, but\nextensions to standard stochastic gradient descent (SGD) may alter the reachability of certain states,\nsuch that a winning ticket generated using one optimizer will not generalize to another optimizer.\n\nTo test this, we generated winning tickets using two optimizers (SGD with momentum and Adam\n[16]), and evaluated whether winning tickets generated using one optimizer increased performance\nwith the other optimizer. We found that, as with transfer across datasets, transferred winning tickets for\nVGG models achieved similar performance as those generated using the source optimizer (Figure 5).\nInterestingly, we found that tickets transferred from SGD to Adam under-performed random tickets\nat low pruning fractions, but signi\ufb01cantly outperformed random tickets at high pruning fractions\n(Figure 5b). Overall though, this result suggests that VGG winning tickets are not over\ufb01t to the\nparticular optimizer used during generation, suggesting that VGG winning tickets are optimizer-\nindependent.\n\n7\n\n0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.700.750.800.850.900.95SVHN test accuracyat convergenceSVHN0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.9990.20.40.60.81.0Ticket sourceSVHNCIFAR-10CIFAR-100ImageNetRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.00.20.40.60.8Fashion-MNIST testaccuracy at convergenceFashion-MNISTTicket sourceFashion-MNISTCIFAR-10CIFAR-100ImageNetRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.500.550.600.650.700.750.800.850.90CIFAR-10 test accuracyat convergenceCIFAR-100.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.9990.20.40.60.8Ticket sourceCIFAR-10CIFAR-100ImageNetRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.00.10.20.30.40.50.6CIFAR-100 test accuracyat convergenceCIFAR-100Ticket sourceCIFAR-10CIFAR-100ImageNetRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.00.20.40.60.8ImageNet test accuracyat convergence (Top 5)ImageNetTicket sourceCIFAR-100CIFAR-10ImageNetPlaces365Random0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.650.700.750.800.85Places365 test accuracyat convergence (Top 5)Places3650.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.9990.00.20.40.60.8Ticket sourceImageNetPlaces365Random\fa\n\nb\n\nFigure 5: Transferring VGG winning tickets across optimizers. Ticket performance when training using\nSGD w/ momentum (a) and Adam (b). Within each plot, each line represents a different source optimizer for the\nwinning ticket. In cases where the y-axis has been narrowed to make small differences visible, full y-axes are\nprovided as insets. Error bars represent mean \u00b1 standard deviation across six random seeds.\n\n5 Discussion\n\nIn this work, we demonstrated that winning tickets are capable of transferring across a variety of\ntraining con\ufb01gurations, suggesting that winning tickets drawn from suf\ufb01ciently large datasets are not\nover\ufb01t to a particular optimizer or dataset, but rather feature inductive biases which improve training\nof sparsi\ufb01ed models more generally (Figures 3 and 4). We also found that winning tickets generated\nagainst datasets with more samples and more classes consistently transfer better, suggesting that\nlarger datasets encourage more generic winning tickets. Together, these results suggest that winning\nticket initializations satisfy a necessary precondition (generality) for the eventual construction of a\nlottery ticket initialization scheme, and provide greater insights into the factors which make winning\nticket initializations unique.\n\n5.1 Caveats and next steps\n\nThe generality of lottery ticket initializations is encouraging, but a number of key limitations remain.\nFirst, while our results suggest that only a handful of winning tickets need to be generated, generating\nthese winning tickets via iterative pruning is very slow, requiring retraining the source model as many\nas 30 times serially for extreme pruning fractions (e.g., 0.999). This issue is especially prevalent\ngiven our observation that larger datasets produce more generic winning tickets, as these models\nrequire signi\ufb01cant compute for each training run.\n\nSecond, we have only evaluated the transfer of winning tickets across datasets within the same domain\n(natural images) and task (object classi\ufb01cation). It is possible that winning ticket initializations\nconfer inductive biases which are only good for a given data type or task structure, and that these\ninitializations may not transfer to other tasks, domains, or multi-modal settings. Future work will be\nrequired to assess the generalization of winning tickets across domains and diverse task sets.\n\nThird, while we found that transferred winning tickets often achieved roughly similar performance to\nthose generated on the same dataset, there was often a small, but noticeable gap between the transfer\nticket and the same dataset ticket, suggesting that a small fraction of the inductive bias conferred by\nwinning tickets is dataset-dependent. This effect was particularly pronounced for winning tickets\ngenerated on small datasets. However, it remains unclear which aspects of winning tickets are\ndataset-dependent and which aspects are dataset-independent. Working to understand this difference,\nand investigating ways to close this gap will be important future work, and may also aid transfer\nacross domains.\n\nFourth, we only evaluated situations where the network topology is \ufb01xed in both the source and target\ntraining con\ufb01gurations. This is limiting since it means a new winning ticket must be generated for\neach and every architecture topology, though our experiments suggest that a small number of layers\nmay be re-initialized without substantially damaging the winning ticket. As such, the development of\nmethods to parameterize winning tickets for novel architectures will be an important direction for\nfuture studies.\n\nFinally, a critical question remains unanswered: what makes winning tickets special? While our\nresults shed a vague light on this by suggesting that whatever makes these winning tickets unique is\n\n8\n\n0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.600.650.700.750.800.850.900.95CIFAR-10 with SGDtest accuracyat convergenceCIFAR-10 with SGD0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.9990.00.20.40.60.8Ticket sourceAdamSGDRandom0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.999Fraction of weights pruned0.600.650.700.750.800.850.900.95CIFAR-10 with Adamtest accuracyat convergenceCIFAR-10 with Adam0.00.20.360.4880.590.6720.7380.8660.9310.9650.9820.9910.9950.9980.9990.20.40.60.8Ticket sourceSGDAdamRandom\fsomewhat generic, what precisely makes them special is still unclear. Understanding these properties\nwill be critical for the future development of better initialization strategies inspired by lottery tickets.\n\nReferences\n\n[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in over-\nparameterized neural networks, going beyond two layers. November 2018. URL http:\n//arxiv.org/abs/1811.04918.\n\n[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via\n\nOver-Parameterization. November 2018. URL http://arxiv.org/abs/1811.03962.\n\n[3] Babajide O Ayinde, Tamer Inanc, and Jacek M Zurada. Redundant feature pruning for acceler-\nated inference in deep neural networks. Neural networks: the of\ufb01cial journal of the International\nNeural Network Society, May 2019. ISSN 0893-6080. doi: 10.1016/j.neunet.2019.04.021.\nURL http://www.sciencedirect.com/science/article/pii/S0893608019301273.\n\n[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A Large-Scale\n\nHierarchical Image Database. In CVPR09, 2009.\n\n[5] Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with\n\nquadratic activation. March 2018. URL http://arxiv.org/abs/1803.01206.\n\n[6] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes\nover-parameterized neural networks. October 2018. URL http://arxiv.org/abs/1810.\n02054.\n\n[7] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable\nneural networks. In International Conference on Learning Representations, 2019. URL http:\n//arxiv.org/abs/1803.03635.\n\n[8] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. The lottery\n\nticket hypothesis at scale. March 2019. URL http://arxiv.org/abs/1903.01611.\n\n[9] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.\n\nFebruary 2019. URL http://arxiv.org/abs/1902.09574.\n\n[10] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feed-\nforward neural networks. Aistats, 9:249\u2013256, 2010.\ndoi: 10.\n1.1.207.2059. URL http://machinelearning.wustl.edu/mlpapers/paper_files/\nAISTATS2010_GlorotB10.pdf.\n\nISSN 1532-4435.\n\n[11] Yiwen Guo, Anbang Yao, and Yurong Chen.\n\nI. Guyon,\n\nDynamic network surgery for ef-\nand\n\ufb01cient dnns.\nR. Garnett, editors, Advances in Neural Information Processing Systems 29, pages\nInc., 2016. URL http://papers.nips.cc/paper/\n1379\u20131387. Curran Associates,\n6165-dynamic-network-surgery-for-efficient-dnns.pdf.\n\nIn D. D. Lee, M. Sugiyama, U. V. Luxburg,\n\n[12] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections\nfor ef\ufb01cient neural network. In Advances in neural information processing systems, pages\n1135\u20131143, 2015.\n\n[13] Boris Hanin and David Rolnick. How to start training: The effect of initialization\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-\nand architecture.\nBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31,\npages 571\u2013581. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/\n7338-how-to-start-training-the-effect-of-initialization-and-architecture.\npdf.\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE\ninternational conference on computer vision, pages 1026\u20131034, 2015.\n\n9\n\n\f[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\nIn\n\nInternational Conference on Learning Representations, 2014.\n\n[17] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better ImageNet models transfer better?\n\nMay 2018. URL http://arxiv.org/abs/1805.08974.\n\n[18] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[19] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\n\nef\ufb01cient convnets. arXiv preprint arXiv:1608.08710, 2016.\n\n[20] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value\nof network pruning. In International Conference on Learning Representations, 2019. URL\nhttp://arxiv.org/abs/1810.05270.\n\n[21] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsi\ufb01es\nIn Proceedings of the 34th International Conference on Machine\n\ndeep neural networks.\nLearning-Volume 70, pages 2498\u20132507. JMLR. org, 2017.\n\n[22] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional\nneural networks for resource ef\ufb01cient inference. November 2016. URL http://arxiv.org/\nabs/1611.06440.\n\n[23] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\n\nReading digits in natural images with unsupervised feature learning. 2011.\n\n[24] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro.\n\nIn search of the real inductive\nbias: On the role of implicit regularization in deep learning. December 2014. URL http:\n//arxiv.org/abs/1412.6614.\n\n[25] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan\u2018 Srebro.\nThe role of over-parametrization in generalization of neural networks.\nIn International\nConference on Learning Representations, 2019. URL https://openreview.net/forum?\nid=BygfghAcYX¬eId=BygfghAcYX.\n\n[26] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS-W, 2017.\n\n[27] Arnu Pretorius, Elan van Biljon, Steve Kroon, and Herman Kamper. Critical initialisation\nfor deep signal propagation in noisy recti\ufb01er neural networks.\nIn S. Bengio, H. Wallach,\nH. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 31, pages 5717\u20135726. Curran Associates, Inc., 2018.\n\n[28] Zhuwei Qin, Fuxun Yu, Chenchen Liu, and Xiang Chen.\n\nINTERPRETABLE CONVO-\nLUTIONAL FILTER PRUNING, 2019. URL https://openreview.net/forum?id=\nBJ4BVhRcYX.\n\n[29] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep in-\nformation propagation. In International Conference on Learning Representations, 2016. URL\nhttp://arxiv.org/abs/1611.01232.\n\n[30] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\nIn International Conference on Learning Representations, 2015. URL\n\nimage recognition.\nhttps://arxiv.org/abs/1409.1556.\n\n[31] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A\nsurvey on deep transfer learning. In ICANN, 2018. URL http://arxiv.org/abs/1808.\n01974.\n\n10\n\n\f[32] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms, 2017.\n\n[33] Greg Yang and Sam S. Schoenholz. Deep mean \ufb01eld theory: Layerwise variance and\nwidth variation as methods to control gradient explosion.\nIn International Conference on\nLearning Representations Workshop Track, 2018. URL https://openreview.net/forum?\nid=rJGY8GbR-.\n\n[34] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features\nin deep neural networks?\nIn Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,\nand K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27,\npages 3320\u20133328. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/\n5347-how-transferable-are-features-in-deep-neural-networks.pdf.\n\n[35] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Residual learning without normalization\nvia better initialization. In International Conference on Learning Representations, 2019. URL\nhttps://openreview.net/forum?id=H1gsz30cKX.\n\n[36] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A\n10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 2017.\n\n[37] Michael H Zhu and Suyog Gupta. To prune, or not to prune: Exploring the ef\ufb01cacy of pruning\nfor model compression. In International Conference on Learning Representations Workshop\nTrack, February 2018. URL https://openreview.net/pdf?id=Sy1iIDkPM.\n\n[38] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable\narchitectures for scalable image recognition. July 2017. URL http://arxiv.org/abs/1707.\n07012.\n\n11\n\n\f", "award": [], "sourceid": 2738, "authors": [{"given_name": "Ari", "family_name": "Morcos", "institution": "Facebook AI Research"}, {"given_name": "Haonan", "family_name": "Yu", "institution": "Facebook AI Research"}, {"given_name": "Michela", "family_name": "Paganini", "institution": "Facebook AI Research"}, {"given_name": "Yuandong", "family_name": "Tian", "institution": "Facebook AI Research"}]}