Review for NeurIPS paper: HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks

NeurIPS 2020

HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks

Review 1

Summary and Contributions: This paper suggests that Hessian trace can be a good metric to automate the process to decide the number of quantization bits for each layer unlike previous attempts such as using top Hessian eigenvalue. Some mathematical analysis to support that Hessian trace is better than top Hessian eigenvalue is provided while memory footprint and mode accuracy are compared on several models using ImageNet database. This paper also shows that Hessian trace computations can be simplified by following the Hutchinson's algorithm.

Strengths: - Hessian-related metrics have been widely adopted to present different sensitivity of layers. This paper compares a few different Hessian-related approaches and provides some mathematical analysis to claim why Hessian trace can be considered as a good metric to produce some optimal number of quantization bits. - Experiments are performed by large models such as ResNet50 and RetinaNet.

Weaknesses: - The amount of novelty is quite limited because the main contribution of this paper is just to show that Hessian trace can be a better method compared to top Hessian eigenvalue that has been introduced in HAWQ[7]. - One example that showing why Hessian trace is good is only shown in the initial part of Section 2.1 (100x^2 +y^2 vs 100x^2+99y^2). But what is missing to verify the major claim is to show a few major Hessian eigenvalues in some example models. Suppose that top Hessian eigenvalue is highly dominating in general, then Hessian trace would not show distinguished advantages while computational complexity increases. It is necessary to show empirical results to support major claims. - HAQ and a few other previous works do not rely on top Hessian eigenvalue. Hence, this paper is mainly comparing with HAWQ[7] only. It is necessary to prove that Hessian trace is superior to other methods such as machine learning based ideas, e.g., HAQ[22]. - In Table 1 and 2, comparisons are not fair because many methods assume the fixed number of quantization bits while this paper depends on the relaxed assumption that each layer can have different number quantization bits. It would be necessary to compare experimental results on the same principles and hardware requirements. - It is difficult to understand what authors wanted to claim with Figure 4. Since perturbation is defined as a metric that authors have chosen to minimize, it is natural that AMQ results lie on Pareto frontier lines. What is missing is the relationship between total perturbation and model accuracy.

Correctness: Figure 4, Table 1 and Table 2 need to be fixed to improve fairness on comparison.

Clarity: This paper is well written.

Relation to Prior Work: This paper is mainly compared with HAWQ[7]. Other many previous works need to be compared as well (especially HAQ[22] that follow learning-based approach)

Reproducibility: No

Additional Feedback: Unfortunately, this paper shows limited novelty while most claims are made to be compared with HAWQ[7]. Experimental results to compare with previous works need to be made under the same assumptions and hardware resources. ---------------------- [response to authors' rebuttal] Thanks for addressing my concerns in detail. I raise my score to 6. Please include your answers in the final manuscript.

Review 2

Summary and Contributions: This paper considers a mixed-precision quantization scheme. To address three major limitations of existing works, this paper presents AMQ which 1) theoretically proves that the right sensitivity metric is the average Hessian trace, instead of just top Hessian eigenvalue. 2) develops a Pareto frontier based method for automatic bit precision selection of different layers without any manual intervention. 3) develops the first Hessian based analysis for mixed-precision activation quantization, which is very beneficial for object detection. and achieves new state-of-the-art results for a wide range of tasks.

Strengths: Overall, this paper is organized well and easy to follow. Meanwhile, the results are SOTA as shown in the tables.

Weaknesses: 1) In section 2, to theoretically prove that the right sensitivity metric is the average Hessian trace, the paper assumes that the coefficients are identical for all the eigenvectors (i.e., Eq.(2)). Besides, it also assumes that $||\$||Q(W_i)-W_i||$. In fact, the latter has become a very common and natural metric for mix-precision quantization in practice. Therefore, it’s necessary to include these experimental results in the ablation study. Overall, this paper proposes some improvements baseDelta W_i^*||=||\Delta W_j^*||$ for different layers. In fact, these are very strong assumptions. Therefore, the theoretical analysis is not convincing. Meanwhile, there is no experimental result to directly justify the lemma. 2) This paper utilizes Eq.(9) as the metric for bit allocation. To support the average Hessian trace, a strong baseline should be d on HAWQ and shows superior performance compared to existing works. However, on the one hand, the theoretical analysis is not convincing because of the strong assumptions. On the other hand, without sufficient ablation studies, it is hard to verify the correctness and effectiveness of the proposed sensitivity metric. ################## Post-rebuttal I have read the authors' feedback and the comments from other reviewers. The authors are encouraged to include the answers to my concerns in the final version. I change my overall score to "marginally above the threshold". Some minor comments: - It would be better to conduct an ablation study on the effectiveness of mix-precision activations. - Compared with HAWQ, the "automatic design" only marginally improves the accuracy on ImageNet. Therefore, the motivation of "Automatic Mixed-precision Quantization" has weakened somewhat.

Correctness: I personally think the claims and method are ok.

Clarity: The paper is well written.

Relation to Prior Work: Yes, this paper has included discussions on the differences.

Reproducibility: No

Additional Feedback: ############### Post-rebuttal It would be better to release the training codes to help reproduce the results in this paper (I'm not fully convinced that the performance improvement are completely from the "automatic" scheme, leaving out the training tricks).

Review 3

Summary and Contributions: This is one of the Hessian approaches to determine the precision for each layer of the models to minimize search spaces (compared to manual or RL methods). The methods claim better performance than the previous HAWQ method by taking the average Hessian trace instead of the top Hessian eigenvalue. In addition, the method discussed the Hessian for activations, which is normally considered computationally formidable.

Strengths: The authors provide theoretical proof that the Hessian trace is a good indicator of the layer sensitivity to quantization. The authors also demonstrated efficient approximate methods to calcuate the Hessian trace and the Hessian for activations.

Weaknesses: The improvment in performance from HAWQ to AMQ seems moderate to the reviewer, < 0.5% in ResNet and 0.6% in SqueezeNet, leaving the question that whether this is a progressive update or a disruptive one. It would be good to have HAWQ result for RetinaNet as well in the Table 2.

Correctness: The claims and empirical methods are fairly correct. In 3.1, the authors claim that their method is "significantly" faster than HAQ [22] and DNAS[23] could the authors add some data to support that? And How does the speed compare to HAWQ?

Clarity: The paper is well written.

Relation to Prior Work: Yes, this paper comapared AMQ with HAWQ thoroughly.

Reproducibility: Yes

Additional Feedback: The authors' rebuttal addressed my questions fairly. They responded to my first question that whether ~0.5% should be counted as a breakthrough and explained the other advantages that AMQ has over HAWQ. In addition, the authors also agreed to add the comparison between AMQ and HAWQ on RetinaNet, which enhanced the strength of AMQ. Therefore, I will stand by my original positive score. The reason that I decided not to increase my score is because while the authors provided some missing information, the fundamental points have not changed.

Review 4

Summary and Contributions: The paper proposes an automatic approach to select the quantization precision for both weighs and activations for each layer in a mixed-precision neural network. The approach outperforms prior art, HAWQ, both in theory and in practice.

Strengths: Significance: this work was centered around improving the prior art of HAWQ [7]. It certainly accomplishes so with the automatic selection of bit-depths, weight and activation quantization, as well as the low-complexity trace computation. Theoretical grounding: Lemma 1 gives a solid justification for the proposed algorithm and its advantage over HAWQ. The Pareto frontier approach for selecting bit-depth is also first-principled. Empirical evaluation: the work is validated for a mobile architecture (SqueezeNet) where the effect of low-precision is more differentiable, and for object detection where previous quantization work rarely gets involved.

Weaknesses: Significance: the improvement above HAWQ is incremental: the ImageNet results are marginally (<0.5%) better at the same model size;

Correctness: I do not fully understand assumption 1: why does the fine-tuning perturbation have to lie in direction of the sum of H's eigenvectors (as stated in Eq (2))? Regarding Proof of Lemma 1 (line 122): Since the perturbation Delta W does not go to zero (and is probably precision dependent), it may not be safe to ignore high order terms in the Taylor expansion. In this case additional assumptions on higher -order curvatures may be necessary.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: