NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4004
Title:Normalization Helps Training of Quantized LSTM

Reviewer 1

I think building the connection between the norm of weight matrices and gradient is original, but using normalization to improve the accuracy loss caused by quantization is not so impressive. My detailed concerns are listed as follows: 1. In Eq. (4), the condition of gradient exploding is lambda_1>1. Although Table 1 shows the matrix norm, it is not a direct way. Can you clearly visualize the lambda_1 values in several typical LSTMs, before and after quantization without normalization? 2. In the three normalizations, the trainable scaling factors (g) will still affect the value of lambda_1 (thus affects the exploding condition), but the authors mentioned little on this. So can you present the g and lambda_1 values in typical LSTMs before and after quantization with normalization? 3. In sequential MNIST task, the batch normalization (shared) method totally failed. Could you explain why the behavior on this task is such different from others? 4. I noted that the quantization with scaling factors (e.g. BWN, TWN, alternating LSTM) can also improve the accuracy. But I did not see these results in Table 4. Could you please provide them for better comparison? 5. Do your conclusions still hold in GRU? You can also conduct the similar chain rule analysis during back propagation. 6. How about in multilayer LSTMs?

Reviewer 2

While in recent years a number of extreme low-precision quantization techniques were developed for DNNs, they were not directly applicable to recurrent architectures. In recent work [1] an extreme low-precision quantization method was proposed that utilizes batch normalization and compresses recurrent neural networks without a large drop in accuracy achieving state-of-the-art performance. In this paper, the authors proposed a theoretical explanation of the difficulties of training LSTMs with low-precision weights and practically explored a combination of different normalization techniques with different quantization schemes. The authors experimentally showed that simple introduction of weight or layer normalization allows applying many standard quantization techniques without modifications. Comments, suggestions, and questions: Figure 1 is quite difficult to read. I would suggest using a logarithmic scale. I would also suggest adding a reference to footnote 3 in the caption of the figure in order to increase clarity Line 183 “On text8, we use a” -> “On Text8, we use a” Do I understand correctly that simple application of batch normalization works better or similar to SBN? In [1] the authors claim that the size of their network is 5 KBytes, while in the Table4, the SBN size is 5526KBytes? Can the authors please clarify? Overall, the propositions on upper bounds are quite straight-forward. The experimental results are simple combinations of previous works. Nonetheless, it is quite interesting that the simple usage of different normalization techniques makes different quantization schemes applicable again. [1] A. Ardakani, Z. Ji, S. C. Smithson, B. H. Meyer, and W. J. Gross. Learning recurrent binary/ternary weights. In International Conference on Learning Representations, 2019. I would like to thanks the authors for their comments. I updated my score since the answers to my questions are mostly positive.

Reviewer 3

The strengths of the paper besides the above contributions: - The paper is clearly written. - The systematic comparisons are thorough and convincing. - The theoretic analysis is useful for understanding the problem and provides the justification for the solutions. - Code is provided for reproducibility. The weaknesses that could be further improved: - The approach used in the theoretic analysis in the paper isn't able to give a necessary or a sufficient condition for the exploding gradient problem for LSTM since lambda_2 in Proposition 2.1 is typically not zero. The analyses of the various normalization approaches are also only made on the upper bound of the gradient magnitude, but on the gradient itself. It would be great to have a more direct proof for the effectiveness of the solutions. - In the analysis in Section 3, it only studies that the increased scaling of the weight matrices does not affect the upper bound of the gradient magnitude due to \sigma. It would be good to also analyze the effect of all the normalization scaling parameters 'g' w.r.t. the exploding gradient problem. - It would make the paper even stronger if it can show the normalization techniques still make 1-bit or 2-bit quantized LSTM achieve comparable performance to a full-precision counterpart on a much larger network consisting of multiple LSTM layers, e.g. for machine translation or speech recognition, where there is a strong practical need for compressing the models especially for the embedded/mobile use case.