Review for NeurIPS paper: Pruning neural networks without any data by iteratively conserving synaptic flow

NeurIPS 2020

Pruning neural networks without any data by iteratively conserving synaptic flow

Review 1

Summary and Contributions: This work propose a novel pruning criteria called synaptic saliency which is based on Hadamard product between the weight magnitudes and the gradients. The concept is extended to a data agnostic pruning algorithm and an iterative pruning algorithm.

Strengths: The method is very well-motivated with sound insights of preventing catastrophic layer-collapse. The authors also connects the idea to explain the success of magnitude pruning, which is also interesting observation. The authors prove the superiority of SynFlow theoretically and empirically. Given the completeness of the paper, I believe it's a work that should be accepted.

Weaknesses: It only provides comparison on random pruning algorithms in the iterate pruning experiments, while I believe it's acceptable given the scope of this paper.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: ========== Post-rebuttal ========== Thank you for the response. I have less confidence in the present experiments now due to two reasons: 1. I agree with R3 that the bug in the baseline models (SNIP and GraSP) is quite serious and it would be better that the authors fix it for all the experiments, as the current response only provides part of them. 2. As R2 and R3 mentioned, the iterative process could be the major reason of the effectiveness of preventing layer collapse, which is also shown in Figure B in the response. For sure that the iterative process could not solve layer collapse entirely, while it would be better that the authors provide comparison to stronger baselines and ablation study. I like the contributions of this paper including identification of layer collapse, the interesting ideas in building the SynFlow algorithm, explanation about the success of IMP, and the plausible improvements given the current experiments. However, given the above reasons I would decrease my score to 6 unfortunately.

Review 2

Summary and Contributions: The paper investigates unstructured pruning in neural networks at initialization. It shows that most approaches suffer from layer collapse where a full layer is pruned, leading to catastrophic performance. Papers shows that gradient based pruning approaches respect the conservation laws: their saliency score is conserved at every hidden units and layers in a neural net. It then argues that layer collapse is a result of the conservation laws that apply to gradient based pruning approaches and one-shot pruning. Authors then empirically demonstrates that that iterative magnitude pruning avoids layer-collapse. Retraining encourages the magnitude scores to observe a conservation law, which combined with iterative pruning leads to increase of the saliency scores for the largest layer. Finally, it proposes SynFlow, a data-independent gradient based pruning approach. SynFlow. SynFlow proposes a score function which is data-independent that respects the convervation laws and an iterative estimation procedure of the saliency to avoid layer-collapse. Authors then validates SynFlow on the CIFAR10/100/Tiny-ImageNet datasets. Empircal evaluation shows that SynFlow outperforms previous pruning at initialization approaches such as SNIP or GraSP.

Strengths: The paper identify layer collapse as a clear bottleneck in pruning approaches a propose an algorithm to address this issue. The empirical evaluation validates the proposed approaches. Both the identification of layer collapse in approaches that prune at initialization and the SynFlow algorithm could be of interest to community. The reason of the iterative magnitude pruning (IMP) effectiveness is also an open question. Authors propose a plausible explanation for IMP good performances and provide empirical evidences supporting their claims.

Weaknesses: The main limitation of this work is the absence of comparison with lottery ticket approaches. In particular, the authors claim in the abstract that they ‘identify highly sparse trainable subnetworks at initialization, without ever training’. However, it is unclear if the sparse network found by SynFlow leads to similar performance than the subbnetwork identified through lottery ticket procedure. SynFlow requires to re-estimate the saliency scores while previous approaches such as SNIP and GraSP perform a one-shot pruning. It would be informative to discuss the computation cost associated with the different approaches. How would SynFlow compare in term of performance SNIP or GraSP using iterative estimation of their saliency scores?

Correctness: Claims and method appear correct to me.

Clarity: Paper is clear and pleasant to read.

Relation to Prior Work: Related work is clearly discussed.

Reproducibility: Yes

Additional Feedback: In figure 3, why do the conservation laws only seems to hold for GraSP and SynFlow and not for SNIP? Is it because of the absolute value used when computing the saliency criterion? In figure 5, is the performance reported after pruning or after retraining of the pruned network? --------------- Post rebuttal update Thanks for your rebuttal which addresses most of my concerns. I will keep my original and positive rating.

Review 3

Summary and Contributions: This paper proposes a new criteria for pruning neural networks at initialization, and even without ever accessing the data. The authors first define the Maximal Critical Compression, i.e., the maximal compression rate before resulting in layer-collapse for a given pruning algorithm. Then the authors propose synaptic saliency, a class of gradient-based scores (previous works SNIP and GraSP also fall in this class), which is conserved at every hidden unit and layer of neural network. The conservation law reveals that single-shot pruning will result in layer-collapse. As a mitigation, the authors propose Synflow, which iteratively prune the network at initialization without access to any data. The experiments are conducted on CIFAR10, CIFAR-100, and Tiny-ImageNet, showing that Synflow outperforms the baseline by a big margin.

Strengths: - The authors propose a general class of gradient-based scores for pruning neural networks at initialization, which unifies the previous works SNIP and GraSP in this class. - The proposed method is well-motivated based on the observation of layer-collapse and conservation law. - The experiments are thorough and carefully designed. - The study on Iterative Magnitude Pruning is interesting and offers insights on why IMP performs so well in practice.

Weaknesses: Method: - Theorem 1 is not new, and similar results were already presented in previous work [Liang et al., 2019]. - Theorem 1 does not hold for neural networks with bias terms. For the proof of Theorem 1, In L171, there are no bias terms that appeared in computing z_j. However, for ResNet and VGGNet they all have bias terms at each layer, so this theorem does not apply for them. - I do not think data-independent is a good feature in designing pruning algorithms since the network architecture should depend on the data distribution. For example, when the input data is very sparse and most of its dimensions are not correlated with the prediction, then you can probably prune a lot at the input layer. Therefore, I am not convinced with the motivation of data-independent pruning. Experiments: - For magnitude pruning, do you apply it on a pretrained network or a randomly initialized network? I am a bit surprised that it performs almost the worst among these methods. - The comparisons between SNIP, GraSP, and SynFlow is not fair, because SynFlow prunes the network iteratively. So, I would suggest the authors include comparisons with the iterative version of SNIP. - Why GraSP performs much worse than SNIP? When computing the saliency of GraSP and SNIP, it seems that you did not set the model to training mode. Is this a bug? The results will be more convincing if you can adopt the same settings as used in the baseline papers. Though this is an interesting paper, I am inclined to reject this paper at the current stage, as I am not convinced with the empirical results due to the potential bugs in the implementation of SNIP and GraSP. Besides, Theorem 1 does not apply to the networks studied in the experiments, as they all have bias terms. The authors should clarify this difference, and further experiments may be needed to study the role of bias terms computing the saliency score. Liang, Tengyuan, et al. "Fisher-rao metric, geometry, and complexity of neural networks." The 22nd International Conference on Artificial Intelligence and Statistics. 2019. ==================== After Rebuttal ====================== Thanks to the authors' response. Unfortunately, I am still inclining to rejection due to the following reasons: 1. [Theorem 1] Although the authors' response has addressed my concerns about applying Thm 1 on networks with bias terms, it is still not straightforward to see if it applies to BatchNorm layers. A detailed proof will be necessary. Besides, Thm 1 is obvious given the results in [Liang et al., 2019]. 2. [Bug is critical] More importantly, the results of all the baseline models (SNIP and GraSP) were wrong due to the bugs in the code. Though the authors claimed they've rerun all of the experiments during the rebuttal period, only part of the results are provided in the rebuttal due to the one-page limitation. In general, I believe it's fairly unfair to accept a paper with buggy implementations of all the baseline models, let alone all the baseline models have public code available. I believe this is an interesting paper, but due to the aforementioned reasons, I vote for rejection at the current stage. I strongly recommend the authors resubmit this work to a near-future conference with more convincing results and detailed proofs.

Correctness: The method itself is correct and empirical methodology is also correct. The proofs of the Theorems are correct, but Theorem 1 does not apply to the networks (VGGNets and ResNets) adopted in the experiments.

Clarity: The paper is overall well written, but the structure can be improved. E.g., Section 5 can be moved to appendix, as it is not very coherent with the paragraphs before and after it. The authors should also highlight their methods more.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper developed a unifying theoretical framework that explains why existing single-shot pruning algorithms at initialization suffer from layer-collapse. It also designed a new data-agnostic pruning algorithm, SynFlow, that provably avoids layer-collapse and reaches Maximal Critical Compression. Lastly, it achives remarkable and consistent performance gain on several benchmark models and datasets.

Strengths: 1. This work has a theoretical framework and solid experiments to support its arguments. 2. it achives remarkable and consistent performance gain on several benchmark models and datasets when the compression ratio is very high.

Weaknesses: 1. When the compression ratio is lower than 10, SynFlow's performance is comparable to other SOTA methods. One thing I am confused about is that do we really need such a high compress ratio, e.g. 100 or higher? Such a high compress ratio hurts the model's performance significantly and makes its performance not quite meaningful.

Correctness: Yes.

Clarity: Good writing overall but it can be better.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: