Review for NeurIPS paper: Pruning Filter in Filter

NeurIPS 2020

Pruning Filter in Filter

Review 1

Summary and Contributions: Different to the conventional channel pruning or shape-wise pruning, the paper proposed a new pruning method (PFF) that learns the optimal shape of each filter and performs stripe selection in each filter. An efficient implementation of the pruning method was also introduced.

Strengths: 1. The idea is novel. Different to the conventional channel pruning or shape-wise pruning, the paper proposed a new pruning method that learns the optimal shape of each filter and performs stripe selection in each filter. 2. The claims are reasonable that keeps each filter independent with each other which does not break the independent assumption among the filters. 3. Pruning stripes in filters can lead to regular sparse computation and can easily achieve acceleration in general processors.

Weaknesses: I have several questions: 1. Some SOTA pruning methods were not compared, such as [1]. 2. The actual inference latency of pruned networks is not provided. [1] Yu, Jiahui, and Thomas S. Huang. "Universally slimmable networks and improved training techniques." Proceedings of the IEEE International Conference on Computer Vision. 2019. [2] Chin, Ting-Wu, et al. "Towards Efficient Model Compression via Learned Global Ranking." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Correctness: The claims and method are correct. The empirical methodology is correct.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: The paper has well discussed with related works including weight pruning, channel pruning and shape-wise pruning. Especially for shape-wise pruning [1] which removes the weights located in the same position among all the filters, the proposed method keeps each filter independent with each other, thus can lead to a more efficient network structure. The propsoed method utilizes an Filter Skeleton (FS) to prune spatial groups. FS can be viewed as a mask which determines whether preserve one location or not. The full-stack filters [2] which also utilize the mask to determine the states of weights in filters. The difference with [2] should be discussed. [1] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016. [2] Han, Kai, et al. "Full-stack filters to build minimum viable cnns." arXiv preprint arXiv:1908.02023 (2019).

Reproducibility: Yes

Additional Feedback: Update after rebuttal: I have read the rebuttal and other reviews. I keep my score as 6. The paper has novelty that learns optimal shape for each filter and provides efficient implementation. I also agree with other reviewers that there are still some presentation and claims unclear. I tend to vote for accepting this submission.

Review 2

Summary and Contributions: This paper proposes the pruning method PFF(pruning filter in filter) pruning stripes of filter instead of the whole filter using L1 regularization on filter skeleton. Here, filter skeleton is a soft mask of dimension n_l x K x K (for the number of filter n_l and kernel size K x K) which multiplies to weight of convolution filter and determines which stripes of length c_l (the number of channels in a filter) would be pruned. PFF trains the neural network with L1 regularization on filter skeleton, and prunes stripes by thresholding filter skeleton. PFF achieves finer granularity while being hardware friendly. Also, PFF achieves the sota pruning ratio on CIFAR-10 and ImageNet datasets without much accuracy drop.

Strengths: The paper is well-written and easy to follow. The main idea of the paper is simple and achieves good performance. Since implementation details are written in detail and attached code is well-explained, it is easy to follow the whole process of the paper end to end.

Weaknesses: First, the concept of pruning filter axes of convolution filter was introduced in [Kang, 2019]. So, PFF is not the first to prune filter axes of convolution filter. And, shape-wise pruning methods prunes filter axes of convolution filter and PFF prunes channel axes of convolution filter. L100-108 explains the benefit of pruning channel axes rather than filter axes but the experiments comparing these two kinds of method are very insufficient. Thus, substantially more experiments comparing shape-wise pruning methods and PFF required to make the case. Additionally, filter skeleton and soft mask used in [24] also shares the similar concept. In summary, the novelty of the proposed approach is very limited. [Kang] Kang, H. J. (2019). Accelerator-aware pruning for convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology. [13] Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2554–2564, 2016. [24] Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, and David Doermann. Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2790–2799, 2019.

Correctness: The claims and methods in the paper are correct. The paper prunes model with same baseline model used in baseline methods and evaluates pruned model with usual image classification tasks.

Clarity: The paper is well-written.

Relation to Prior Work: The paper sufficiently covers the prior works and discusses a relationship between them.

Reproducibility: Yes

Additional Feedback: As I mentioned before, in order to claim pruning filter axes is better than pruning channel axes, the substantial experiments comparing shape-wise pruning methods and PFF are required to make the case. ----- Thank you for your answers to the feedback. I still don't think that the stripe-wise pruning method is new, but I agree that analyses in this paper and the performance of the method are worth being heard about in the efficient inference community.

Review 3

Summary and Contributions: The paper proposes a method to "prune the filter in the filter (PFF)" that prunes a trained neural network. It treats a tensorial filter as multiple 1x1 filters, and learns to remove the 1x1 filters through a sparsity-constrained optimization. It additionally introduces "Filter Skeleton" wholse values reflect the optimal shape of each filter. It demonstrates the proposed PFF achieves state-of-the-art pruning ratio on CIFAR10 and ImageNet datasets without accuracy drop.

Strengths: The idea of decomposing tensorial filters into "Filter Skeleton" and sparse factors is interesting. It resembles tensor decomposition but exploits a discriminative objective function Eq. (1). It makes sense to prune the decomposed 1x1 filters while maintaining a fairly reasonable "Filter Skeleton", which can be thought of as high-order PCA on tensors.

Weaknesses: The paper introduces several new concepts without clarifying well, such as "Filter Skeleton" and "optimal shape". This makes the paper a little hard to follow. Here are some other questions that need to answer or clarify. "Fine-tuning" seems like an important step in standard filter pruning pipeline as stated in Line31. The authors claim that PFF keeps "high performance even without a fine-tuning process" in Line152? But the paper does not compare the performance with and without fine-tuning. So how the conclusion is derived? Equation (1) says that the filter pruning requires a "training" step over the train set, then why it is advantageous to skip fine-tuning network parameters, given that solving Eq. (1) still requires backpropagating through the whole network? Eq.1 and Figure 2: It is not clear why using dot product? Is W*I reconstructing the original filter? It is not clear what "shape" means in Line10, Line42 and Line200. Can the authors clarify? Perhaps the concept is known for specific audience, it is not self-explanatory in this paper. In Line200, it is still not clear what the "shape of the filter" means and why the shape could matter? Should it be expected to matter or not? Why does it show "shape" matters from Figure 4? By line203, it seems to indicate the performance is not affected by "shape"? Line178: It is not clear what "one-shot finetuning" means. Figure 5: What are the x-axes? Line209: How to tell "the weights of the network become more stable"? Figure 6: How to define "top-1", "top-2", ..., "top-10"? typo: Line154: "a optimal" --> "an optimal" Line155: "the filter do not lose much...." --> "the filter does not loss much..." Line 161: "how PFF prune the network" --> "how PFF prunes the network" Line172: "... implementation that train the model..." --> "... implementation that trains the model..." Line222: "may towards a better understanding"

Correctness: One claim in the paper is that PFF keeps "high performance even without a fine-tuning process" in Line152. But the paper does not compare the performance with and without fine-tuning. So it is unclear how the conclusion is derived. Moreover, Equation (1) says that the filter pruning requires a "training" step over the train set, then it is not clear why it is advantageous to skip fine-tuning network parameters, given that solving Eq. (1) still requires backpropagating through the whole network.

Clarity: The paper is a little hard to follow. It introduces several new concepts which are not clarified well. Figures 1 and 2 are hard to understand.

Relation to Prior Work: The paper presents related work fairly well.

Reproducibility: No

Additional Feedback: Addressing points listed in "Weaknesses" will improve the paper, w.r.t presentation and clarity. ----------------------------after rebuttal Thank authors for the rebuttal! I've read all the reviews and rebuttal. I'd maintain my initial rating. My concerns are still on the unclear presentation and claims in this paper. For example -- The authors answer my question(s) by pointing to the paper L112-L119 (but I do think they are referring to Line202-206): "We surprisingly find that the network still achieves 80.58% test accuracy with only 12.64% parameters left. This observation shows that even though the weights of filters are randomly initialized, the network still has good representation capability if we could find an optimal shape of the filters. After learning the shape of each filter, we fix the architecture of the network and finetune the weights. The network ultimately achieves 91.93% accuracy on the test set." 1. It reads like that fine-tuning is still crucial (authors mention CIFAR10 in Figure 4 caption when writing the paragraph including the above sentences). 2. It is not clear what "randomly initialized" means. If randomly initialized, why not fine-tuning as re-iterated by the authors? 3. When authors report 80.58% accuracy with only 12.64% parameters left and claim this could be an "optimal shape", I would like to see performance by "suboptimal shapes" with similar portion of parameters left (e.g., randomly dropping filters). This can help understand how/why "optimal shape" matters or not. In other word, the paper does not provide with a comparison to justify this "optimal shape" claim.