Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
I really liked this paper. The idea of creating space for ECC bits within the network is both interesting and practical. The evaluation is thorough and on the whole, the paper is well written and easy to follow.
a) The paper assumes a system, which uses a specific approach for protecting DNN weights against memory faults. However, it does not mention how the system in question handles potential memory faults against its input/output tensors. Not storing input/output tensors in memory or using different strategies for memory protection of different data (weights vs. input/output) imposes conditions on the system, thereby making the solution not generally applicable. b) Section 3 of the paper asserts the following: 67 This work focuses on protections of 8-bit quantized DNN models. The reason are two-fold: 1) 68 8-bit quantization has been the de facto step before model deployment to reduce model size while 69 providing lower latency with little degradation in model accuracy. 2) Previous studies [10, 15] However, the paper does not provide any references or comparisons with lower precisions (e.g. 4-bit) to justify reason 1). The optimal quantization precision (balance accuracy and model size) for any network depends on its weight distribution, which might be lower than 8-bit as well. Lastly, making the two protection methods heavily dependent on reason 1) imposes more restrictions on the system where such techniques can be used. c) The paper demonstrates the impact of its protection techniques only for CNNs (VGG16, ResNet18, and SqueezeNet). It is not obvious that techniques applicable to convolutional and fully connected layers in CNNs will also apply to element-wise multiplication/addition operations (layers) in LSTMs. It seems more apt to call the proposed techniques memory protection for CNNs only. Additionally, Section 6.1 asserts that these chose CNN models cover a variety of CNNs. Providing more reasons to explain that would justify the choice of CNN models better. d) Section 6.2 illustrates the point in b) above. By reducing 3000 large values (beyond [-64, 63]) in the first seven positions to fit in 6-bits (+1-bit sign) the accuracy of the networks considered remains unchanged. This begs the question, why can’t all the weights be reduced to 6-bits (+1-bit sign) without any drop in accuracy. And if it is possible to reduce the design to fewer bits why use 8-bit implementation at all. In an extreme scenario, a network which has negligible drop in accuracy with 4-bit weights would be forced to use 8-bit precision to use techniques proposed in this paper. Traditional ECC on such a 4-bit network (12.5% penalty on 50% of space) would cost less space than using new proposed zero-space ECC with identical reliability guarantees. e) Lastly, the paper mentions in the abstract that the proposed techniques can be applied to safety-critical systems (e.g. autonomous vehicles). However, arranging weights such that their spatial distribution is fixed (e.g. large weight sitting on last byte in 64bit data block always) will make such systems susceptible to security threats. With that concern in mind the proposed solutions would require some more work before applying to safety-critical systems. Minor Issues a) Neither Figure 3, or the text referencing it in Section 5 clearly explain the definition of fault rates used to create the figure. My guess would be number of bit flips on a target bit in n (e.g. 10000) inferences. b) Fault rate definition should be added for Table 2 as well. c) Few minor typos on following lines – 60, 63, 211 and 247. Apart from above concerns paper has explained the ideas clearly, good job! Author Feedback Comments a) The authors justify well how memory protection for weights is more pressing compared to input/output tensors. At the same time, in-place zero-space ECC requires a specific memory arrangement wherein all 8-bit words have an error-check bit, which is connected to ECC hardware with additional wiring. Implying that an inference engine that wishes to adapt the proposed techniques has three options to manage the input/output tensor dataflow. 1) Use a separate memory (relatively small) with (/preferably without) ECC hardware for input/output tensors. 2) Rely on a processor to control the input/output tensor dataflow (It is possible this is already in place, but then 12.5% savings on the inference engine memory area might be much smaller at the system level.) 3) Somehow manage to store input/output tensors in the same memory as weights and yet read their values correctly despite the weight specific ECC hardware in place. Unless, the authors somehow demonstrate option 3 works, which might have additional cost of its own, the inference engine will have to incur the cost of options 1/2. Considering these costs raises the question if the 12.5% area savings of zero-space technique (most effective) worth it. b) Based on the author feedback I can see that 8-bit quantization offers a sweet spot in terms of ease of implementation (through many industry-supported DNN libraries) and improving latency without compromising much on accuracy. Keeping this in mind it would be befitting to reword Section 3 of the paper to clearly communicate the above reasons for choosing 8-bit quantization. Labeling 8-bit quantization as the de facto step before deployment or most resilient to memory faults seems to ignore the possibility of using even lower precision for certain networks/scenarios. c) Appreciate the authors delineating their choice of CNNs for their experiments, and changing their application scope from DNNs to CNNs. d) As well noted by R3, the techniques proposed in the paper rest on a couple of empirical observations: 1) CNN accuracy does not degrade with quantization to 8bits 2) CNN weights have a distribution wherein a negligible chunk of weights lie in [64, 128) range (absolute value of weights) The above two observations through empirically common, need to be validated before applying these techniques to a new CNN.
1. This paper is mainly based on observations and empirical results. For example, the authors should provide more details in the regularization step of regulated QAT process in Section 4.1. If we simply clip the weights, then how is the convergence guaranteed? The authors should elaborate more on theoretical analysis. 2. It seems the paper is off the scope of NeuraIPS. The authors should consider submitting the paper to EDA conferences such as DAC. 3. In Section 5, it is not clear why majority-vote protection is better used under low error rate scenarios. 4. The experiment section is extremely vague. The authors claim that “We use the ImageNet dataset  (ILSVRC 2012) for model training and evaluation.” However, the only results based on ImageNet are in Figure 4(b) and 5(b). Are the results in Table 2 based on ImageNet or Cifar10? Additionally, are the models trained from scratch or finetuned? The accuracy curves in Figure 5 look strange to me. For example, the accuracy of SqueezeNet on ImageNet increases from ~10% to ~55% within 2 iterations and the accuracy of VGG16 on Cifar10 is already over 93% at epoch 0. 5. With a thorough look into the codes, I did not find the training module for the three selected models (i.e., VGG16, ResNet18, SqueezeNet) on ImageNet. The authors should provide a description of their codes. Minor issues: 1. In Figure 1, the percentage row, shouldn’t the range be [-128, 128]? 2. There are many typos to be corrected. For example, in the last paragraph of Section 2, “mgeneral” should be “general”.