{"title": "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask", "book": "Advances in Neural Information Processing Systems", "page_first": 3597, "page_last": 3607, "abstract": "The recent \"Lottery Ticket Hypothesis\" paper by Frankle & Carbin showed that a simple approach to creating sparse networks (keep the large weights) results in models that are trainable from scratch, but only when starting from the same initial weights. The performance of these networks often exceeds the performance of the non-sparse base model, but for reasons that were not well understood. In this paper we study the three critical components of the Lottery Ticket (LT) algorithm, showing that each may be varied significantly without impacting the overall results. Ablating these factors leads to new insights for why LT networks perform as well as they do. We show why setting weights to zero is important, how signs are all you need to make the re-initialized network train, and why masking behaves like training. Finally, we discover the existence of Supermasks, or masks that can be applied to an untrained, randomly initialized network to produce a model with performance far better than chance (86% on MNIST, 41% on CIFAR-10).", "full_text": "Deconstructing Lottery Tickets:\nZeros, Signs, and the Supermask\n\nHattie Zhou\n\nJanice Lan\n\nRosanne Liu\n\nJason Yosinski\n\nUber\n\nUber AI\n\nUber AI\n\nUber AI\n\nhattie@uber.com\n\njanlan@uber.com\n\nrosanne@uber.com\n\nyosinski@uber.com\n\nAbstract\n\nThe recent \u201cLottery Ticket Hypothesis\u201d paper by Frankle & Carbin showed that a\nsimple approach to creating sparse networks (keeping the large weights) results\nin models that are trainable from scratch, but only when starting from the same\ninitial weights. The performance of these networks often exceeds the performance\nof the non-sparse base model, but for reasons that were not well understood. In this\npaper we study the three critical components of the Lottery Ticket (LT) algorithm,\nshowing that each may be varied signi\ufb01cantly without impacting the overall results.\nAblating these factors leads to new insights for why LT networks perform as well\nas they do. We show why setting weights to zero is important, how signs are all\nyou need to make the reinitialized network train, and why masking behaves like\ntraining. Finally, we discover the existence of Supermasks, masks that can be\napplied to an untrained, randomly initialized network to produce a model with\nperformance far better than chance (86% on MNIST, 41% on CIFAR-10).\n\n1\n\nIntroduction\n\nMany neural networks are over-parameterized [3, 4], enabling compression of each layer [4, 21, 8] or\nof the entire network [14]. Some compression approaches enable more ef\ufb01cient computation by prun-\ning parameters, by factorizing matrices, or via other tricks [8, 10, 13, 16\u201318, 20\u201323]. Unfortunately,\nalthough sparse networks created via pruning often work well, training sparse networks directly often\nfails, with the resulting networks underperforming their dense counterparts [16, 8].\n\nA recent work by Frankle & Carbin [5] was thus surprising to many researchers when it presented\na simple algorithm for \ufb01nding sparse subnetworks within larger networks that are trainable from\nscratch. Their approach to \ufb01nding these sparse, performant networks is as follows: after training a\nnetwork, set all weights smaller than some threshold to zero, pruning them (similarly to other pruning\napproaches [9, 8, 15]), rewind the rest of the weights to their initial con\ufb01guration, and then retrain the\nnetwork from this starting con\ufb01guration but with the zero weights frozen (not trained). Using this\napproach, they obtained two intriguing results.\n\nFirst, they showed that the pruned networks performed well. Aggressively pruned networks (with 95\npercent to 99.5 percent of weights pruned) showed no drop in performance compared to the much\nlarger, unpruned network. Moreover, networks only moderately pruned (with 50 percent to 90 percent\nof weights pruned) often outperformed their unpruned counterparts. Second, they showed that these\npruned networks train well only if they are rewound to their initial state, including the speci\ufb01c initial\nweights that were used. Reinitializing the same network topology with new weights causes it to train\npoorly. As pointed out in [5], it appears that the speci\ufb01c combination of pruning mask and weights\nunderlying the mask form a more ef\ufb01cient subnetwork found within the larger network, or, as named\nby the original study, a lucky winning \u201cLottery Ticket,\u201d or LT.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Different mask criteria can be thought of as segmenting the\n2D (wi = initial weight value, wf = \ufb01nal weight value) space into\nregions corresponding to mask values of 1 vs 0. The ellipse represents\nin cartoon form the area occupied by the positively correlated initial and\n\ufb01nal weights from a given layer. The mask criterion shown, identi\ufb01ed\nby two horizontal lines that separate the whole region into mask-1 (blue)\nareas and mask-0 (grey) areas, corresponds to the large_\ufb01nal criterion\nused in [5]: weights with large \ufb01nal magnitude are kept and weights\nwith \ufb01nal values near zero are pruned.\n\nWhile Frankle & Carbin [5] clearly demonstrated LT networks to be effective, it raises many intriguing\nquestions about the underlying mechanics of these subnetworks. What about LT networks causes\nthem to show better performance? Why are the mask and the initial set of weights so tightly coupled,\nsuch that re-initializing the network makes it less trainable? Why does simply selecting large weights\nconstitute an effective criterion for choosing a mask? We attempt to answer these questions by\nexploiting the essential steps in the lottery ticket algorithm, described below:\n\n0. Initialize a mask m to all ones. Randomly initialize the parameters w of a network\n\nf (x; w \u2299 m)\n\n1. Train the parameters w of the network f (x; w \u2299m) to completion. Denote the initial weights\n\nbefore training wi and the \ufb01nal weights after training wf .\n\n2. Mask Criterion. Use the mask criterion M (wi, wf ) to produce a masking score for each\ncurrently unmasked weight. Rank the weights in each layer by their scores, set the mask\nvalue for the top p% to 1, the bottom (100 \u2212 p)% to 0, breaking ties randomly. Here p may\nvary by layer, and we follow the ratios chosen in [5], summarized in Table S1. In [5] the\nmask selected weights with large \ufb01nal value corresponding to M (wi, wf ) = |wf |.\n\n3. Mask-1 Action. Take some action with the weights with mask value 1. In [5] these weights\n\nwere reset to their initial values and marked for training in the next round.\n\n4. Mask-0 Action. Take some action with the weights with mask value 0. In [5] these weights\n\nwere pruned: set to 0 and frozen during any subsequent training.\n\n5. Repeat from 1 if performing iterative pruning.\n\nIn this paper we perform ablation studies along the above three dimensions of variability, considering\nalternate mask criteria (Section 2), alternate mask-1 actions (Section 3), and alternate mask-0 actions\n(Section 4). These studies in aggregate reveal new insights for why lottery ticket networks work\nas they do. Along the way we discover the existence of Supermasks\u2014masks that produce above-\nchance performance when applied to untrained networks (Section 5). We make our code available at\nhttps://github.com/uber-research/deconstructing-lottery-tickets.\n\n2 Mask criteria\n\nWe begin our investigation with a study of different Mask Criteria, or functions that decide which\nweights to keep vs. prune. In this paper, we de\ufb01ne the mask for each individual weight as a function\nof the weight\u2019s values both at initialization and after training: M (wi, wf ). We can visualize this\nfunction as a set of decision boundaries in a 2D space as shown in Figure 1. In [5], the mask\ncriterion simply keeps weights with large \ufb01nal magnitude; we refer to this as the large_\ufb01nal mask,\nM (wi, wf ) = |wf |.\n\nWe experiment with mask criteria based on \ufb01nal weights (large_\ufb01nal and small_\ufb01nal),\nini-\ntial weights (large_init and small_init), a combination of the two (large_init_large_\ufb01nal and\nsmall_init_small_\ufb01nal), and how much weights move (magnitude_increase and movement). We\nalso include random as a control case, which chooses masks randomly. These nine masks are\ndepicted along with their associated equations in Figure 2. Note that the main difference between\nmagnitude_increase and movement is that those weights that change sign are more likely to be kept\nin the movement criterion than the magnitude_increase criterion.\n\n2\n\nwiwfmask-1 regionsmask-0 region\flarge \ufb01nal\n\nsmall \ufb01nal\n\nlarge init\n\nsmall init\n\nlarge init\nlarge \ufb01nal\n\nsmall init\nsmall \ufb01nal\n\nmagnitude\n\nincrease\n\nmovement\n\nrandom\n\n|wf |\n\n\u2212|wf |\n\n|wi|\n\n\u2212|wi|\n\nmin(\u03b1|wf |, |wi|) \u2212max(\u03b1|wf |, |wi|) |wf | \u2212 |wi|\n\n|wf \u2212 wi|\n\n0\n\nFigure 2: Mask criteria studied in this section, starting with large_\ufb01nal that was used in [5]. Names\nwe use to refer to the various methods are given along with the formula that projects each (wi, wf )\npair to a score. Weights with the largest scores (colored regions) are kept, and weights with the\nsmallest scores (gray regions) are pruned. The x axis in each small \ufb01gure is wi and the y axis is wf .\nIn two methods, \u03b1 is adjusted as needed to align percentiles between wi and wf . When masks are\ncreated, ties are broken randomly, so a score of 0 for every weight results in random masks.\n\nFigure 3: Test accuracy at early stopping iteration of different mask criteria for four networks at\nvarious pruning rates. Each line is a different mask criteria, with bands around the best-performing\nmask criteria (large_\ufb01nal and magnitude_increase) and the baseline (random) depicting the min and\nmax over 5 runs. Stars represent points where large_\ufb01nal or magnitude_increase are signi\ufb01cantly\nabove the other at p < 0.05. The eight mask criteria form four groups of inverted pairs (each column\nof the legend represents one such pair) that act as controls for each other. We observe that large_\ufb01nal\nand magnitude_increase have the best performance, with magnitude_increase having slightly higher\naccuracy in Conv2 and Conv4. See Figure S1 for results on convergence speed.\n\nIn this section and throughout the remainder of the paper, we follow the experimental framework\nfrom [5] and perform iterative pruning experiments on a 3-layer fully-connected network (FC) trained\non MNIST [12] and on three convolutional neural networks (CNNs), Conv2, Conv4, and Conv6\n(small CNNs with 2/4/6 convolutional layers, same as used in [5]) trained on CIFAR-10 [11]. For\nmore architecture and training details, see Section S1 in Supplementary Information. We hope to\nexpand these experiments to larger datasets and deeper models in future work. In particular, [6]\nshows that the original LT Algorithm as proposed do not generalize to ResNet on ImageNet. It would\nbe valuable to see how well the experiments in this paper generalize to harder problems.\n\n3\n\n0.950.960.970.980.99FC0.600.650.700.75Conv2100.0%41.0%16.8%6.9%2.8%1.2%Weights Remaining0.650.700.750.80Conv4100.0%41.0%16.8%6.9%2.8%1.2%Weights Remaining0.700.750.800.85Conv6magnitude_increasemovementlarge_finalsmall_finallarge_initsmall_initlarge_init_large_finalsmall_init_small_finalrandomTest Accuracy at Early Stopping IterationTest Accuracy at Early Stopping Iteration\fResults of all criteria are shown in Figure 3 for the four networks (FC, Conv2, Conv4, Conv6). The\naccuracy shown is the test accuracy at an early stopping iteration1 of training. For all \ufb01gures in this\npaper, the line depicts the mean over \ufb01ve runs, and the band (if shown) depicts the min and max\nobtained over \ufb01ve runs. In some cases the band is omitted for visual clarity.\n\nNote that the \ufb01rst six criteria as depicted in Figure 2 form three opposing pairs; in each case, we\nobserve when one member of the pair performs better than the random baseline, the opposing member\nperforms worse than it. Moreover, the magnitude_increase criterion turns out to work just as well as\nthe large_\ufb01nal criterion, and in some cases signi\ufb01cantly better2.\n\nThe conclusion so far is that although large_\ufb01nal is a very competitive mask criterion, the LT behavior\nis not limited to this mask criterion as other mask criteria (magnitude_increase, large_init_large_\ufb01nal,\nmovement) can also match or exceed the performance of the original network. This partially answers\nour question about the ef\ufb01cacy of different mask criteria. Still unanswered: why either of the two\nfront-running criteria (magnitude_increase, large_\ufb01nal) should work well in the \ufb01rst place. We\nuncover those details in the following two sections.\n\n3 Mask-1 actions: the sign-i\ufb01cance of initial weights\n\nNow that we have explored various ways of choosing which weights to keep and prune, we will\nconsider how we should initialize the kept weights. In particular, we want to explore an interesting\nobservation in [5] which showed that the pruned, skeletal LT networks train well when you rewind to\nits original initialization, but degrades in performance when you randomly reinitialize the network.\n\nWhy does reinitialization cause LT networks to train poorly? Which components of the original\ninitialization are important? To investigate, we keep all other treatments the same as [5] and perform\na number of variants in the treatment of 1-masked, trainable weights, in terms of how to reinitialize\nthem before the subnetwork training:\n\n\u2022 \u201cReinit\u201d experiments: reinitialize kept weights based on the original init distribution.\n\u2022 \u201cReshuf\ufb02e\u201d experiments: reinitialize while respecting the original distribution of remaining\n\nweights in that layer by reshuf\ufb02ing the kept weights\u2019 initial values.\n\n\u2022 \u201cConstant\u201d experiments: reinitialize by setting 1-masked weight values to a positive or\nnegative constant; thus every weight on a layer becomes one of three values: \u2212\u03b1, 0, or \u03b1,\nwith \u03b1 being the standard deviation of each layer\u2019s original initialization.\n\nAll of the reinitialization experiments are based on the same original networks and use the large_\ufb01nal\nmask criterion with iterative pruning. We include the original LT network (\u201crewind, large \ufb01nal\u201d) and\nthe randomly pruned network (\u201crandom\u201d) as baselines for comparison.\n\nWe \ufb01nd that none of these three variants alone are able to train as well as the original LT network,\nshown as dashed lines in Figure 4. However, all three variants work better when we ensure that the\nnew values of the kept weights are of the same sign as their original initial values. These are shown as\nsolid color lines in Figure 4. Clearly, the common factor in all working variants including the original\nrewind action is the sign. As long as you keep the sign, reinitialization is not a deal breaker; in fact,\neven setting all kept weights to a constant value consistently performs well! The signi\ufb01cance of the\nsign suggests, in contrast to [5], that the basin of attraction for an LT network is actually quite large:\noptimizers work well anywhere in the correct sign quadrant for the weights, but encounter dif\ufb01culty\ncrossing the zero barrier between signs.\n\n4 Mask-0 actions: masking is training\n\nWhat should we do with weights that are pruned? This question may seem trivial, as deleting them\n(equivalently: setting them to zero) is the standard practice. The term \u201cpruning\u201d implies the dropping\nof connections by setting weights to zero, and these weights are thought of as unimportant. However,\nif the value of zero for the pruned weights is not important to the performance of the network, we\nshould expect that we can set pruned weights to some other value, such as leaving them frozen at\n\n1The early stopping criterion we employ in this paper is the iteration of minimum validation loss.\n2We run a t-test for each pruning percentage based on a sample of 5 independent runs for each mask criteria.\n\n4\n\n\fFigure 4: The effects of various 1-actions for four networks at various pruning rates. All reinitializa-\ntion experiments use the large_\ufb01nal mask criterion with iterative pruning. Dotted lines represent the\nthree described methods, and solid lines are those three except with each weight having the same\nsign as its original initialization. Shaded bands around notable runs depict the min and max over 5\nruns. Stars represent points where \"rewind (large_\ufb01nal)\" or \"constant, init sign\" is signi\ufb01cantly above\nthe other at a p < 0.05 level, showing no difference in performance between the two. The original\nlarge_\ufb01nal and random are included as baselines. See Figure S4 for results on convergence speed.\n\ntheir initial values, without hurting the trainability of the network. This turns out to not be the case.\nWe show in this section that zero values actually matter, alternative freezing approach results in better\nperforming networks, and masking can be viewed as a way of training.\n\nTypical network pruning procedures [9, 8, 15] perform two actions on pruned weights: set them to\nzero, and freeze them in subsequent training (equivalent to removing those connections from the\nnetwork). It is unclear which of these two components leads to the increased performance in LT\nnetworks. To separate the two factors, we run a simple experiment: we reproduce the LT iterative\npruning experiments in which network weights are masked out in alternating train/mask/rewind\ncycles, but try an additional treatment: freeze masked weights at their initial values instead of at zero.\nIf zero isn\u2019t special, both should perform similarly.\n\nFigure 5 shows the results for this experiment. We \ufb01nd that networks perform signi\ufb01cantly better\nwhen weights are frozen speci\ufb01cally at zero than at random initial values. For these networks masked\nvia the LT large_\ufb01nal criterion3, zero would seem to be a particularly good value to set pruned\nweights to. At high levels of pruning, freezing at the initial values may perform better, which makes\nsense since having a large number of zeros means having lots of dead connections.\n\nSo why does zero work better than initial values? One hypothesis is that the mask criterion we use\ntends to mask to zero those weights that were headed toward zero anyway.\n\nTo test out this hypothesis, we propose another mask-0 action halfway between freezing at zero\nand freezing at initialization: for any zero-masked weight, freeze it to zero if it moves toward zero\nover the course of training, and freeze it at its random initial value if it moves away from zero. We\nshow two variants of this experiment in Figure 5. In the \ufb01rst variant, we apply it directly as stated to\nzero-masked weights (to be pruned). We see that by doing so we achieve comparable performance to\nthe original LT networks at low pruning rates and better at high pruning rates. In the second variant,\nwe extend this action to one-masked weights too, that is, initialize every weight to zero if they move\ntowards zero during training, regardless of the pruning action on them. We see that performance\nof Variant 2 is even better than Variant 1, suggesting that this new mask-0 action we found can be\n\n3Figure S3 illustrates why the large_\ufb01nal criterion biases weights that were moving toward zero during\n\ntraining toward zero in the mask, effectively pushing them further in the direction they were headed.\n\n5\n\n0.950.960.970.980.99FC0.600.650.700.75Conv2100.0%41.0%16.8%6.9%2.8%1.2%Weights Remaining0.650.700.750.80Conv4100.0%41.0%16.8%6.9%2.8%1.2%Weights Remaining0.700.750.800.85Conv6rewind (large_final)randomreinit, init signreinit, rand signreshuffle, init signreshuffle, rand signconstant, init signconstant, rand signTest Accuracy at Early Stopping IterationTest Accuracy at Early Stopping Iteration\f(LT)\n(LC)\n\n \n\nVariant 1:\n\nVariant 2:\n\ni+\n\n0\n\ni+\n\ni\n\ni+\n\ni+\n\ni+\n\n0\n\ni\n\ni\n\n0\n\ni+\n\ni+\n\n0+\n\ni\n\n0\n\n0\n\ni\n\n0+\n\ni+\n\nFigure 5: Performance of network pruning using different treatments of pruned weights (mask-0\nactions). Horizontal black lines represent the performance of training the original, full network,\naveraged over \ufb01ve runs. Solid blue lines represent the original LT algorithm, which freezes pruned\nweights at zero. Dotted blue lines freeze pruned weights at their initial values. Grey lines show the\nnew proposed 0-action\u2014set to zero if they decreased in magnitude by the end of training, otherwise\nset to their initialization values. Two variants are shown: 1) new treatment applied to only pruned\nweights (dashdotted grey lines); 2) new treatment applied to all weights (dashed grey lines).\n\na bene\ufb01cial mask-1 action too. These results support our hypothesis that the bene\ufb01t derived from\nfreezing values to zero comes from the fact that those values were moving toward zero anyway4. This\nview on masking as training provides a new perspective on 1) why certain mask criteria work well\n(large_\ufb01nal and magnitude_increase both bias towards setting pruned weights close to their \ufb01nal\nvalues in the previous round of training), 2) the important contribution of the value of pruned weights\nto the overall performance of pruned networks, and 3) the bene\ufb01t of setting these select weights to\nzero as a better initialization for the network.\n\n5 Supermasks\n\nThe hypothesis above suggests that for certain mask criteria, like large_\ufb01nal, that masking is training:\nthe masking operation tends to move weights in the direction they would have moved during training.\nIf so, just how powerful is this training operation? To answer this, we can start from the beginning\u2014\nnot training the network at all, but simply applying a mask to the randomly initialized network.\n\nIt turns out that with a well-chosen mask, an untrained network can already attain a test accuracy\nfar better than chance. This might come as a surprise, because if you use a randomly initialized and\nuntrained network to, say, classify images of handwritten digits from the MNIST dataset, you would\nexpect accuracy to be no better than chance (about 10%). But now imagine you multiply the network\nweights by a mask containing only zeros and ones. In this instance, weights are either unchanged or\ndeleted entirely, but the resulting network now achieves nearly 40 percent accuracy at the task! This\nis strange, but it is exactly what we observe with masks created using the large_\ufb01nal criterion.\n\nIn randomly-initialized networks with large_\ufb01nal masks, it is not implausible to have better-than-\nchance performance since the masks are derived from the training process. The large improvement in\nperformance is still surprising, however, since the only transmission of information from the training\nback to the initial network is via a zero-one mask based on a simple criterion. We call masks that can\nproduce better-than-chance accuracy without training of the underlying weights \u201cSupermasks\u201d.\n\nWe now turn our attention to \ufb01nding better Supermasks. First, we simply gather all masks instantiated\nin the process of creating the networks shown in Figure 2, apply them to the original, randomly\ninitialized networks, and evaluate the accuracy without training the network. Next, compelled\nby the demonstration in Section 3 of the importance of signs and in Section 4 of keeping large\n\n4Additional control variants of this experiment can be seen in Supplementary Information Section S3.\n\n6\n\n\flarge \ufb01nal,\nsame sign\n\nlarge \ufb01nal,\ndiff sign\n\nmax(0,\n\nwiwf\n|wi| ) max(0,\n\n\u2212wiwf\n|wi| )\n\n(left) Untrained networks perform at chance (10% accuracy) on MNIST, if they are\nFigure 6:\nrandomly initialized, or randomly initialized and randomly masked. However, applying the large_\ufb01nal\nmask improves the network accuracy beyond the chance level. (right) The large_\ufb01nal_same_sign\nmask criterion (left) that tends to produce the best Supermasks. In contrast to the large_\ufb01nal mask\nin Figure 1, this criterion masks out the quadrants where the sign of wi and wf differ. We include\nlarge_\ufb01nal_di\ufb00_sign (right) as a control.\n\nweights, we de\ufb01ne a new large_\ufb01nal_same_sign mask criterion that selects for weights with large\n\ufb01nal magnitudes that also maintained the same sign by the end of training. This criterion, as well\nas the control case of large_\ufb01nal_di\ufb00_sign, is depicted in Figure 6. Performances of Supermasks\nproduced by all 10 criteria are included in Figure 7, compared with two baselines: networks untrained\nand unmasked (untrained_baseline) and networks fully trained (trained_baseline). For simplicity,\nwe evaluate Supermasks based on one-shot pruning rather than iterative pruning.\n\nWe see that large_\ufb01nal_same_sign signi\ufb01cantly outperforms the other mask criteria in terms of\naccuracy at initialization. We can create networks that obtain a remarkable 80% test accuracy on\nMNIST and 24% on CIFAR-10 without training using this simple mask criterion. Another curious\nobservation is that if we apply the mask to a signed constant (as described in Section 3) rather than\nthe actual initial weights, we can produce even higher test accuracy of up to 86% on MNIST and\n41% on CIFAR-10! Detailed results across network architectures, pruning percentages, and these two\ntreatments, are shown in Figure 7.\n\nWe \ufb01nd it fascinating that these Supermasks exist and can be found via such simple criteria. As an\naside, they also present a method for network compression, since we only need to save a binary mask\nand a single random seed to reconstruct the full weights of the network.\n\n5.1 Optimizing the Supermask\n\nWe have shown that Supermasks derived using simple heuristics greatly enhance the performance\nof the underlying network immediately, with no training involved. In this section we are interested\nin how far we can push the performance of Supermasks by training the mask, instead of training\nnetwork weights. Similar works in this domain include training networks with binary weights [1, 2],\nor training masks to adapt a base network to multiple tasks [19]. Our work differs in that the base\nnetwork is randomly initialized, never trained, and masks are optimized for the original task.\n\nWe do so by creating a trainable mask variable for each layer while freezing all original parameters\nfor that layer at their random initialization values. For an original weight tensor w and a mask tensor\nm of the same shape, we have as the effective weight w\u2032 = wi \u2299 g(m), where wi denotes the initial\nvalues weights are frozen at, \u2299 is element-wise multiplication and g is a point-wise function that\ntransform a matrix of continuous values into binary values.\n\nWe train the masks with g(m) = Bern(S(m)), where Bern(p) is the bernoulli sampler with proba-\nbility p, and S(m) is the sigmoid function. The bernoulli sampling adds some stochasticity that helps\nwith training, mitigates the bias of all things starting at the same value, and uses in effect the expected\nvalue of S(m), which is especially useful when they are close to 0.5.\n\nBy training the m matrix with SGD, we obtained up to 95.3% test accuracy on MNIST and 65.4% on\nCIFAR-10. Results are shown in Figure 7, along with all the heuristic-based, unlearned Supermasks.\nNote that there is no straightforward way to control for the pruning percentage. Instead, we initialize\nm with larger or smaller magnitudes, which nudges the network toward pruning more or less. This\n\n7\n\nNetwork w/ Rand WeightsNetwork w/ Rand Weights + Rand MaskNetwork w/ Rand Weights + Large Final Mask0.00.20.40.60.81.0Test Accuracy at Initialization\fOriginal Initialization\n\nInitial Signed Constant\n\n \n\nC\nF\nT\nS\nN\nM\n\nI\n\n \n\n4\nv\nn\no\nC\n0\n1\n-\nR\nA\nF\nC\n\nI\n\nFigure 7: Comparision of Supermask performances in terms of test accuracy on MNIST and CIFAR-\n10 classi\ufb01cation tasks. Sub\ufb01gures are across two network structures (top: FC on MNIST, bottom:\nConv4 on CIFAR-10), as well as 1-action treatments (left: weights are at their original initialization,\nright: weights are converted to signed constants). No training is performed in any network. Within\nheuristic based Supermasks (excluding learned_mask), the large_\ufb01nal_same_sign mask creates the\nhighest performing Supermask by a wide margin. Note that aside from the \ufb01ve independent runs\nperformed to generate uncertainty bands for this plot, every point on this plot is from the same\nunderlying network, just with different masks. See Figure S6 for performance on all four networks.\n\nallows us to produce masks with the amounts of pruning (percentages of zeros) ranging from 7% to\n89%. Further details about the training can be seen in Section S6.\n\n5.2 Dynamic Weight Rescaling\n\nOne bene\ufb01cial trick in Supermask training is to dynamically rescale the values of weights based on\nthe sparsity of the network in the current training iteration. For each training iteration and for each\nlayer, we multiply the underlying weights by the ratio of the total number of weights in the layer over\nthe number of ones in the corresponding mask. Dynamic rescaling leads to signi\ufb01cant improvements\nin the performance of the masked networks, which is illustrated in Table 1.\n\nTable 1 summarizes the best test accuracy obtained through different treatments. The result shows\nstriking improvement of learned Supermasks over heuristic based ones. Learned Supermasks result\nin performance close to training the full network, which suggests that a network upon initialization\nalready contains powerful subnetworks that work well without training.\n\n6 Conclusion\n\nIn this paper, we have studied how three components to LT-style network pruning\u2014mask criterion,\ntreatment of kept weights during retraining (mask-1 action), and treatment of pruned weights during\nretraining (mask-0 action)\u2014come together to produce sparse and performant subnetworks. We\nproposed the hypothesis that networks work well when pruned weights are set close to their \ufb01nal\nvalues. Building on this hypothesis, we introduced alternative freezing schemes and other mask\n\n8\n\n\fTable 1: Test accuracy of the best Supermasks with various initialization treatments. Values shown\nare the max over any prune percentage and averaged over four or more runs. The \ufb01rst two columns\nshow untrained networks with heuristic-based masks, where \u201cinit\u201d stands for the initial, untrained\nweights, and \u201cS.C.\u201d is the signed constant approach, which replaces each random initial weight with\nits sign as described in Section 3. The next two columns show results for untrained weights overlaid\nwith learned masks; and the two after add the Dynamic Weight Rescaling (DWR) approach. The\n\ufb01nal column shows the performance of networks with weights trained directly using gradient descent.\nBold numbers show the performance of the best Supermask variation.\n\nmask mask\n\nlearned\nmask\n\nlearned\nmask\n\nDWR\nlearned\nmask\n\nDWR\nlearned\nmask\n\nNetwork\n\nMNIST FC\nCIFAR Conv2\nCIFAR Conv4\nCIFAR Conv6\n\n\u2299\ninit\n\n79.3\n22.3\n23.\n24.0\n\n\u2299\nS.C.\n\n86.3\n37.4\n39.7\n41.0\n\n\u2299\ninit\n\n95.3\n64.4\n65.4\n65.3\n\n\u2299\nS.C.\n\n96.4\n66.3\n66.2\n65.4\n\n\u2299\ninit\n\n97.8\n65.0\n71.7\n76.3\n\n\u2299\nS.C.\n\n98.0\n66.0\n72.5\n76.5\n\ntrained\nweights\n\n97.7\n69.2\n75.4\n78.3\n\ncriteria that meet or exceed current approaches by respecting this basic rule. We also showed that the\nonly element of the original initialization that is crucial to the performance of LT networks is the sign,\nnot the relative magnitude of the weights. Finally, we demonstrated that the masking procedure can\nbe thought of as a training operation, and consequently we uncovered the existence of Supermasks,\nwhich can produce partially working networks without training.\n\nAcknowledgments\n\nThe authors would like to acknowledge Jonathan Frankle, Joel Lehman, Zoubin Ghahramani, Sam\nGreydanus, Kevin Guo, and members of the Deep Collective research group at Uber AI for combina-\ntions of helpful discussion, ideas, feedback on experiments, and comments on early drafts of this\nwork.\n\n9\n\n\fReferences\n\n[1] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep\nneural networks with binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D.\nLee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems\n28, pages 3123\u20133131. Curran Associates, Inc., 2015.\n\n[2] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\nneural networks: Training deep neural networks with weights and activations constrained to+ 1\nor-1. arXiv preprint arXiv:1602.02830, 2016.\n\n[3] Yann Dauphin and Yoshua Bengio. Big neural networks waste capacity. CoRR, abs/1301.3583,\n\n2013.\n\n[4] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in\ndeep learning. In Advances in Neural Information Processing Systems, pages 2148\u20132156, 2013.\n\n[5] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable\nneural networks. In International Conference on Learning Representations (ICLR), volume\nabs/1803.03635, 2019.\n\n[6] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery\n\nticket hypothesis at scale. CoRR, abs/1903.01611, 2019.\n\n[7] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nneural networks. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n249\u2013256, 2010.\n\n[8] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural\nnetwork with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.\n\n[9] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for\nef\ufb01cient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 28, pages 1135\u20131143. Curran\nAssociates, Inc., 2015.\n\n[10] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal\nbrain surgeon. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural\nInformation Processing Systems 5, pages 164\u2013171. Morgan-Kaufmann, 1993.\n\n[11] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[12] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[13] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In D. S. Touretzky, edi-\ntor, Advances in Neural Information Processing Systems 2, pages 598\u2013605. Morgan-Kaufmann,\n1990.\n\n[14] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic\nDimension of Objective Landscapes. In International Conference on Learning Representations,\nApril 2018.\n\n[15] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\n\nef\ufb01cient convnets. arXiv preprint arXiv:1608.08710, 2016.\n\n[16] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\nef\ufb01cient convnets. In International Conference on Learning Representations (ICLR), volume\nabs/1608.08710, 2017.\n\n[17] C. Louizos, K. Ullrich, and M. Welling. Bayesian Compression for Deep Learning. ArXiv\n\ne-prints, May 2017.\n\n10\n\n\f[18] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A \ufb01lter level pruning method for deep\nneural network compression. In Proceedings of the IEEE international conference on computer\nvision (ICCV), pages 5058\u20135066, 2017.\n\n[19] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to\nmultiple tasks by learning to mask weights. In Proceedings of the European Conference on\nComputer Vision (ECCV), pages 67\u201382, 2018.\n\n[20] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional\nneural networks for resource ef\ufb01cient inference. In International Conference on Learning\nRepresentations (ICLR), 2017.\n\n[21] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity\n\nin deep neural networks. CoRR, abs/1608.03665, 2016.\n\n[22] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-ef\ufb01cient convolutional neural\nnetworks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 5687\u20135695, 2017.\n\n[23] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and\nZiyu Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on\nComputer Vision, pages 1476\u20131483, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1952, "authors": [{"given_name": "Hattie", "family_name": "Zhou", "institution": "Uber"}, {"given_name": "Janice", "family_name": "Lan", "institution": "Uber AI"}, {"given_name": "Rosanne", "family_name": "Liu", "institution": "Uber AI Labs"}, {"given_name": "Jason", "family_name": "Yosinski", "institution": "Uber AI; Recursion"}]}