Reviews: Focused Quantization for Sparse CNNs

Summary of the paper: This paper proposes a distribution aware quantization which chooses between recentralized and shift quantizations based on weight distributions in the kernels. The end-to-end compression pipeline, when used on multiple architectures, shows good compression ratios over the original models. Finally, the new implementation proposed utilizes the maximum hardware optimization than the earlier proposed methods. Originality: The ideas in general based on distributions of weights have been around but the proposed methods is novel in the way it uses the weight distributions in the sparse CNNs to get optimal quantization strategies. This work is novel and grounded to the best of my knowledge. Quality: The experimental results are good but can be improved by the suggestion following in the review. The idea is simple and easy to follow. I think it is a good quality of work which will help the community. Clarity and writing: The paper was easy to follow and understand barring a few minor things: 1) The citation style is different from the normal style of neurips for eg. [1], [3,4,5]. Instead, the authors have used abc et al., format which takes up space. Please consider moving to the other format. 2) In Figure 2, I am not sure what / 128 means. It would be good to explain what that means for the ease of the reader. 3) The gap between table1 and its caption and table 2 and caption are very low. Please reformat it to make it look clean. 4) Please change the name of the paper from "Focused Quantization for Sparse DNNs" to "Focused Quantization for Sparse CNNs" in the manuscript. I think that is a mistake overlooked while submission. Positives: 1) Good idea and nice utilization of weight distributions of sparse CNNs to get the right quantization technique based on bit shifts. 2) Extensive experimental results and good ablation study. Issues and questions: 1) The only major issue according to me is that not all compression techniques provide good computation gains as well due to a lot of constraints. In order to make any of the claims made in the paper about efficiency, it would be good to have inference time for each of the models reported in table 1 and table 2. - this will make the experiments stronger 2) In conclusion, there is a claim about savings in power efficiencies for future CNN accelerators in IoT devices. Given IoT devices mostly are single-threaded it would be good if the authors could provide an example of what they are talking (expecting in the future) about and to make it even stronger a real-world deployment and case study will make the papers results stand out. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- I have read the author response thoroughly and I am happy with the response. The response addressed my concerns and I would like to see the #gates for the 8-bit quantized W+A in Table #4 in the camera-ready. I also agree with the other reviewers' comments and would like to see them addressed completely as done in rebuttal in the camera-ready.

Reviewer 2

Originality: I believe the quantization approach proposed in this paper is novel, and provides a new general framework to quantize sparse CNNs. Quality: The overall quality of this paper is good. The experimental results are detailed and provide solid support to the effectiveness of the proposed approach. However, I think the experimental section can be further strengthen by considering more datasets. Clarity: The paper is very well-written. In particular, it shows the intuition behind the proposed method, and how the proposed method relates to existing machine learning frameworks such as minimum description length optimization. Significance: I believe the paper shows promising results on addressing the problem of quantizing sparse CNNs.

Reviewer 3

This work treats the sparse weight distribution as a mixture of Gaussian distribution, and use Focused Quantization to quantize the sparse weights. The integrated quantization and pruning method help compress the model size. Recentralized Quantization uses MLE to find the appropriate hyperparameters and Shift Quantization uses powers-of-2 values to quantize a single high-probability region. In terms of the compression algorithm, the idea is novel and makes a lot of sense. In Section 3.6, there is a small mistake, the variational free energy should be represented as L (θ, α, Φ). Since α is responsible for the cross-entropy loss and Φ is composed of hyperparameters that are related to the discrepancies between the weight distributions before and after quantization, it is not included in the set Φ. In terms of the computation reduction, the powers-of-2 values can replace integer multiplications with shift operations which is cheap for hardware implementation. However, in this work, the whole pipeline is too complicated. In line 258, the authors mentioned that they have only one high-precision computing unit. However, the Recentralized Quantization actually introduces two high-precision parameters, μ, and σ (in addition, for each weight, these two params are different). The dot products in this paper, therefore, requires more high-precision multiply-adds units. This paper is well-written and easy to follow, except some figures are confusing. For example, in the last two blocks of Fig. 2, the meaning of y_axis should be defined clearly. Also in Fig.4, it seems that the standard deviation σ is missing. The experimental results are solid. However, this paper mainly presents a quantization method over sparse weights. The real effectiveness of this quantization scheme is not well evaluated. The authors should provide a comparison of their model between the model after pruning but before quantization to demonstrate the quantization performance.

Paper ID:	2993
Title:	Focused Quantization for Sparse CNNs

Reviewer 1

Reviewer 2

Reviewer 3