{"title": "Post training 4-bit quantization of convolutional networks for rapid-deployment", "book": "Advances in Neural Information Processing Systems", "page_first": 7950, "page_last": 7958, "abstract": "Convolutional neural networks require significant memory bandwidth and storage for intermediate computations, apart from substantial computing resources. Neural network quantization has significant benefits in reducing the amount of intermediate results, but it often requires the full datasets and time-consuming fine tuning to recover the accuracy lost after quantization. This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-tuning), nor it requires the availability of the full dataset. We target the quantization of both activations and weights and suggest three complementary methods for minimizing quantization error at the tensor level, two of whom obtain a closed-form analytical solution. Combining these methods, our approach achieves accuracy that is just a few percents less the state-of-the-art baseline across a wide range of convolutional models. The source code to replicate all experiments is available on GitHub: \\url{https://github.com/submission2019/cnn-quantization}.", "full_text": "Post training 4-bit quantization of convolutional\n\nnetworks for rapid-deployment\n\nRon Banner1 , Yury Nahshan1 , and Daniel Soudry2\n\nIntel \u2013 Arti\ufb01cial Intelligence Products Group (AIPG)1\n\nTechnion \u2013 Israel Institute of Technology2\n\n{ron.banner, yury.nahshan}@intel.com\n\ndaniel.soudry@gmail.com\n\nAbstract\n\nConvolutional neural networks require signi\ufb01cant memory bandwidth and stor-\nage for intermediate computations, apart from substantial computing resources.\nNeural network quantization has signi\ufb01cant bene\ufb01ts in reducing the amount of\nintermediate results, but it often requires the full datasets and time-consuming\n\ufb01ne tuning to recover the accuracy lost after quantization. This paper intro-\nduces the \ufb01rst practical 4-bit post training quantization approach: it does not\ninvolve training the quantized model (\ufb01ne-tuning), nor it requires the avail-\nability of the full dataset. We target the quantization of both activations and\nweights and suggest three complementary methods for minimizing quantization\nerror at the tensor level, two of whom obtain a closed-form analytical solution.\nCombining these methods, our approach achieves accuracy that is just a few\npercents less the state-of-the-art baseline across a wide range of convolutional\nmodels. The source code to replicate all experiments is available on GitHub:\nhttps://github.com/submission2019/cnn-quantization.\n\n1\n\nIntroduction\n\nA signi\ufb01cant drawback of deep learning models is their computational costs. Low precision is one of\nthe key techniques being actively studied recently to overcome the problem. With hardware support,\nlow precision training and inference can compute more operations per second, reduce memory\nbandwidth and power consumption, and allow larger networks to \ufb01t into a device.\nThe majority of literature on neural network quantization involves some sort of training either from\nscratch (Hubara et al., 2016) or as a \ufb01ne-tuning step from a pre-trained \ufb02oating point model (Han et al.,\n2015). Training is a powerful method to compensate for model\u2019s accuracy loss due to quantization.\nYet, it is not always applicable in real-world scenarios since it requires the full-size dataset, which is\noften unavailable from reasons such as privacy, proprietary or when using an off-the-shelf pre-trained\nmodel for which data is no longer accessible. Training is also time-consuming, requiring very long\nperiods of optimization as well as skilled manpower and computational resources.\nConsequently, it is often desirable to reduce the model size by quantizing weights and activations\npost-training, without the need to re-train/\ufb01ne-tune the model. These methods, commonly referred\nto as post-training quantization, are simple to use and allow for quantization with limited data. At\n8-bit precision, they provide close to \ufb02oating point accuracy in several popular models, e.g., ResNet,\nVGG, and AlexNet. Their importance can be seen from the recent industrial publications, focusing\non quantization methods that avoid re-training (Goncharenko et al., 2018; Choukroun et al., 2019;\nMeller et al., 2019; Migacz, 2017).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fUnfortunately, post-training quantization below 8 bits usually incurs signi\ufb01cant accuracy degradation\n(Krishnamoorthi, 2018; Jacob et al., 2018). This paper focuses on CNN post-training quantization\nto 4-bit representation. In the absence of a training set, our methods aim at minimizing the local\nerror introduced during the quantization process (e.g., round-off errors). To that end, we often adopt\nknowledge about the statistical characterization of neural network distributions, which tend to have a\nbell-curved distribution around the mean. This enables to design ef\ufb01cient quantization schemes that\nminimize the mean-squared quantization error at the tensor level, avoiding the need for re-training.\nOur contributions\nOur paper suggests three new contributions for post-training quantization:\n\n1. Analytical Clipping for Integer Quantization (ACIQ): We suggest to limit (henceforth,\nclip) the range of activation values within the tensor. While this introduces distortion to the\noriginal tensor, it reduces the rounding error in the part of the distribution containing most\nof the information. Our method approximates the optimal clipping value analytically from\nthe distribution of the tensor by minimizing the mean-square-error measure. This analytical\nthreshold is simple to use during run-time and can easily be integrated with other techniques\nfor quantization.\n\n2. Per-channel bit allocation: We introduce a bit allocation policy to determine the optimal\nbit-width for each channel. Given a constraint on the average per-channel bit-width, our\ngoal is to allocate for each channel the desired bit-width representation so that overall\nmean-square-error is minimized. We solve this problem analytically and show that by taking\ncertain assumptions about the input distribution, the optimal quantization step size of each\nchannel is proportional to the 2\n\n3-power of its range.\n\n3. Bias-correction: We observe an inherent bias in the mean and the variance of the weight\nvalues following their quantization. We suggest a simple method to compensate for this bias.\n\nWe use ACIQ for activation quantization and bias-correction for quantizing weights. Our per-channel\nbit allocation method is used for quantizing both weights and activations (we explain the reasons for\nthis con\ufb01guration in Section 5). These methods are evaluated on six ImageNet models. ACIQ and\nbias-correction improve, on average, the 4-bit baselines by 3.2% and 6.0%, respectively. Per-channel\nbit allocation improves the baselines, on average, by 2.85% for activation quantization and 6.3% for\nweight quantization. When the three methods are used in combination to quantize both weights and\nactivations, most of the degradation is restored without re-training, as can be seen in Figure 1.\n\nFigure 1: Top-1 accuracy of \ufb02oating-point models converted directly to 4-bit weights and activations\nwithout retraining. For some models, the combination of three methods can reduce the quantization\ninduced degradation enough as to make retraining unnecessary, enabling for the \ufb01rst time a rapid\ndeployment of 4-bit models (detailed numerical results appear in Table 1).\n\nPrevious works\nPerhaps the most relevant previous work that relates to our clipping study (ACIQ) is due to (Migacz,\n2017), who also proposes to clip activations post-training. Migacz (2017) suggests a time-consuming\niterative method to search for a suitable clipping threshold based on the Kullback-Leibler Divergence\n\n2\n\n\f(KLD) measure. This requires collecting statistics for activation values before deploying the model,\neither during training or by running a few calibration batches on FP32 model. It has a drawback of\nencountering values at runtime not obeying the previously observed statistics.\nCompared to the KLD method, ACIQ avoids searching for candidate threshold values to identify\nthe optimal clipping, which allows the clipping threshold to be adjusted dynamically at runtime. In\naddition, we show in the Appendix that our analytical clipping approach outperforms KLD in almost\nall models for 4-bit quantization even when it uses only statistical information (i.e., not tensor values\nobserved at runtime). Zhao et al. (2019) compared ACIQ (from an earlier version of this manuscript)\nto KLD for higher bit-width of 5 to 8 bits. It was found that ACIQ typically outperforms KLD for\nweight clipping and is more or less the same for activation clipping.\nSeveral new post-training quantization schemes have recently been suggested to handle statistical\noutliers. Meller et al. (2019) suggests weight factorization that arranges the network to be more\ntolerant of quantization by equalizing channels and removing outliers. A similar approach has recently\nbeen suggested by (Zhao et al., 2019), who suggests duplicating channels containing outliers and\nhalving their values to move outliers toward the center of the distribution without changing network\nfunctionality. Unlike our method that focuses on 4-bit quantization, the focus of these schemes was\npost-training quantization for larger bitwidths.\n\n2 ACIQ: Analytical Clipping for Integer Quantization\n\nIn the following, we derive a generic expression for the expected quantization noise as a function of\nclipping value for either Gaussian or Laplace distributions. In the Appendix, we consider the case\nwhere convolutions and recti\ufb01ed linear units (ReLU) are fused to avoid noise accumulation, resulting\nin folded-Gaussian and Laplace distributions.\nLet X be a high precision tensor-valued random variable, with a probability density function f (x).\nWithout loss of generality, we assume a prepossessing step has been made so that the average value\nin the tensor zero, i.e., E (X) = \u00b5 = 0 (we do not lose generality since we can always subtract and\nadd this mean). Assuming bit-width M, we would like to quantize the values in the tensor uniformly\nto 2M discrete values.\nCommonly (e.g., in GEMMLOWP (Jacob et al., 2017)), integer tensors are uniformly quantized\nbetween the tensor maximal and minimal values. In the following, we show that this is suboptimal,\nand suggest a model where the tensor values are clipped in the range [\u2212\u03b1, \u03b1] to reduce quantization\nnoise. For any x \u2208 IR, we de\ufb01ne the clipping function clip(x, \u03b1) as follows\n\n(1)\nDenoting by \u03b1 the clipping value, the range [\u03b1,\u2212\u03b1] is partitioned to 2M equal quantization regions.\nHence, the quantization step \u2206 between two adjacent quantized values is established as follows:\n\nclip(x, \u03b1) =\n\nif |x| \u2264 \u03b1\nsign(x) \u00b7 \u03b1 if |x| > \u03b1\n\n(cid:26)x\n\n\u2206 =\n\n2\u03b1\n2M\n\n(2)\n\nOur model assumes values are rounded to the midpoint of the region (bin) i.e., for every index\ni \u2208 [0, 2M \u2212 1] all values that fall in [\u2212\u03b1 + i \u00b7 \u2206,\u2212\u03b1 + (i + 1) \u00b7 \u2206] are rounded to the midpoint\nqi = \u2212\u03b1 + (2i + 1) \u2206\n2 , as illustrated in Figure 2 left. Then, the expected mean-square-error between\nX and its quantized version Q(X) can be written as follows:\nE[(X \u2212 Q(X))2] =\n\nf (x) \u00b7 (x + \u03b1)2dx +\n\nf (x) \u00b7 (x \u2212 qi)2dx +\n\nf (x) \u00b7 (x \u2212 \u03b1)2dx\n\n(3)\n\n(cid:90) \u2212\u03b1\n\n\u2212\u221e\n\n(cid:90) \u2212\u03b1+(i+1)\u2206\n\n2M\u22121(cid:88)\n\n\u2212\u03b1+i\u2206\n\ni=0\n\n(cid:90) \u221e\n\n\u03b1\n\nEq. 3 is composed of three parts. The \ufb01rst and last terms quantify the contribution of clip(x, \u03b1) to\nthe expected mean-square-error. Note that for symmetrical distributions around zero (e.g., Gaussian\nN (0, \u03c32) or Laplace(0, b)) these two terms are equal and their sum can therefore be evaluated by\nmultiplying any of the terms by 2. The second term corresponds to the expected mean-square-error\nwhen the range [\u2212\u03b1, \u03b1] is quantized uniformly to 2M discrete levels. This term corresponds to the\n\n3\n\n\f(cid:90) \u221e\n\n\u03b1\n\n(cid:90) \u2212\u03b1\n\n\u2212\u221e\n\n\u00b7 2M\u22121(cid:88)\n\ni=0\n\nFigure 2: left: An activation distribution quantized uniformly in the range [\u2212\u03b1, \u03b1] with 2M equal\nquantization intervals (bins) right: Expected mean-square-error as a function of clipping value for\ndifferent quantization levels (Laplace (\u00b5 = 0 and b = 1)). Analytical results, stated by Eq. 5, are in a\ngood agreement with simulations, which where obtained by clipping and quantizing 10,000 values,\ngenerated from a Laplace distribution.\n\nquantization noise introduced when high precision values in the range [\u2212\u03b1, \u03b1] are rounded to the\nnearest discrete value.\nQuantization noise: We approximate the density function f by the construction of a piece-wise\nlinear function whose segment breakpoints are points in f, as illustrated on the right side of \ufb01gure 2.\nIn the appendix we use this construction to show that quantization noise satis\ufb01es the following:\n\n(cid:90) \u2212\u03b1+(i+1)\u00b7\u2206\n\n2M\u22121(cid:88)\n\n\u2212\u03b1+i\u00b7\u2206\n\ni=0\n\nf (x) \u00b7 (x \u2212 qi)2dx \u2248 2 \u00b7 \u03b13\n\n3 \u00b7 23M \u00b7 2M\u22121(cid:88)\n\ni=0\n\n1\n2\u03b1\n\n=\n\n\u03b12\n\n3 \u00b7 22M\n\n(4)\n\nClipping noise: In the appendix we show that clipping noise for the case of Laplace(0, b) satis\ufb01es\nthe following:\n\nf (x) \u00b7 (x \u2212 \u03b1)2dx =\n\nf (x) \u00b7 (x \u2212 \u03b1)2dx = b2 \u00b7 e\u2212 \u03b1\n\nb\n\nWe can \ufb01nally state Eq. 3 for the laplace case as follows.\n\nE[(X \u2212 Q(X))2] \u2248 2 \u00b7 b2 \u00b7 e\u2212 \u03b1\n\nb +\n\n2 \u00b7 \u03b13\n3\n\nf (qi) = 2 \u00b7 b2 \u00b7 e\u2212 \u03b1\n\nb +\n\n\u03b12\n\n3 \u00b7 22M\n\n(5)\n\nOn the right side of \ufb01gure 2, we introduce the mean-square-error as a function of clipping value for\nvarious bit widths.\nFinally, to \ufb01nd the optimal clipping value \u03b1 for which mean-square-error is minimized, the corre-\nsponding derivative with respect to \u03b1 is set equal to zero as follows:\n\n\u2202E[(X \u2212 Q(X))2]\n\n\u2202\u03b1\n\n=\n\n2\u03b1\n\n3 \u00b7 22M \u2212 2be\u2212 \u03b1\n\nb = 0\n\n(6)\n\nSolving Eq. 6 numerically for bit-widths M = 2, 3, 4 results with optimal clipping values of\n\u03b1\u2217 = 2.83b, 3.89b, 5.03b, respectively.\nIn practice, ACIQ uses \u03b1\u2217 to optimally clip values by\nestimating the Laplace parameter b = E(|X \u2212 E(X)|) from input distribution X, and multiplying by\nthe appropriate constant (e.g., 5.03 for 4 bits).\nIn the appendix, we provide similar analysis for the Gaussian case. We also compare the validation\naccuracy against the standard GEMMLOWP approach (Jacob et al., 2017) and demonstrate signi\ufb01cant\nimprovements in all studied models for 3-bit activation quantization.\n\n3 Per-channel bit-allocation\n\nWith classical per-channel quantization, we have a dedicated scale and offset for each channel. Here\nwe take a further step and consider the case where different channels have different numbers of bits for\n\n4\n\n\fprecision. For example, instead of restricting all channel values to have the same 4-bit representation,\nwe allow some of the channels to have higher bit-width while limiting other channels to have a lower\nbit-width. The only requirement we have is that the total number of bits written to or read from\nmemory remains unchanged (i.e., keep the average per-channel bit-width at 4).\nGiven a layer with n channels, we formulate the problem as an optimization problem aiming to \ufb01nd a\nsolution that allocates a quota of B quantization intervals (bins) to all different channels. Limiting\nthe number of bins B translates into a constraint on the number of bits that one needs to write to\nmemory. Our goal is to minimize the overall layer quantization noise in terms of mean-square-error.\nAssuming channel i has values in the range [\u2212\u03b1i, \u03b1i] quantized to Mi bits of precision, Eq. 5 provides\nthe quantization noise in terms of expected mean-square-error. We employ Eq. 5 to introduce a\nLagrangian with a multiplier \u03bb to enforce the requirement on the number of bins as follows:\n\nL(M0, M1, ..., Mn\u03bb) =\n\n2 \u00b7 b2 \u00b7 e\u2212 \u03b1i\n\nb +\n\n\u03b12\ni\n\n3 \u00b7 22Mi\n\n+ \u03bb\n\n2Mi \u2212 B\n\n(7)\n\nThe \ufb01rst term in the Lagrangian is the total layer quantization noise (i.e., the sum of mean-square-\nerrors over all channels as de\ufb01ned by Eq. 5). The second term captures the quota constraint on the\ntotal number of allowed bins B. By setting to zero the partial derivative of the Lagrangian function\nL(\u00b7) with respect to Mi, we obtain for each channel index i \u2208 [0, n \u2212 1] the following equation:\n\n\u2202L(M0, M1, ..., Mn, \u03bb)\n\n(8)\nBy setting to zero the partial derivative of the Lagrangian function L(\u00b7) with respect to \u03bb we take into\naccount the constraint on the number of allowed bins.\n\n\u2202Mi\n\n+ \u03bb \u00b7 2Mi = 0\n\n\u2202L(M0, M1, ..., Mn, \u03bb)\n\n\u2202\u03bb\n\n2Mi \u2212 B = 0\n\n(9)\n\ni\n\n= \u2212 2 ln 2 \u00b7 \u03b12\n3 \u00b7 22Mi\n(cid:88)\n\n=\n\ni\n\n(cid:18)\n\n(cid:88)\n\ni\n\n(cid:19)\n\n(cid:32)(cid:88)\n\ni\n\n(cid:33)\n\nConsidering Eq. 8 and Eq. 9, we have a separate equation for each channel i \u2208 [0, n \u2212 1] and an\nadditional equation for the the Lagrangian multiplier \u03bb. In the Appendix, we show that the solution\nto this system of equations results with the following simple rule for optimal bin allocation for each\nchannel i:\n\nB(cid:63)\n\ni = 2Mi =\n\n\u00b7 B\n\n(10)\n\nBy taking the logarithm of both sides, we translate Eq. 10 into bit width assignment Mi for each\nchannel i. Since Mi is an integer it includes a round operation.\n\u00b7 B\n\n(11)\n\n\u03b1\n\n2\n3\n\nMi =\n\nlog2\n\n(cid:33)(cid:39)\n\n(cid:32)\n\n(cid:36)\n\n2\n3\n\ni(cid:80)\n\n\u03b1\n\n2\ni \u03b1\n3\ni\n\ni(cid:80)\n\n2\ni \u03b1\n3\ni\n\nFigure 3 illustrates the mean-square-error in a synthetic experiment including two channels i, j, each\nhaving different values for \u03b1i,\u03b1j. Results of the experiment show that optimal allocations determined\nby Eq. 10 are in a good agreement with the best allocations found by the experiment. Finally, the\nvalidation accuracy of per-channel bit-allocation is compared in the appendix when activations are\nquantized on average to 3-bit precision. Unlike the baseline method that assigns a precision of exactly\n3 bits to each channel in a layer, the per-channel bit-allocation method does not change the total bit\nrate to memory but signi\ufb01cantly improve validation accuracy in all models.\n\n4 Bias-Correction\n\nWe observe an inherent bias in the mean and the variance of the weight values following their\nquantization. Formally, denoting by Wc \u2286 W the weights of channel c and its quantized version by\nc )||2. We suggest\nc , we observe that E (Wc) (cid:54)= E (W q\nW q\nto compensate for this quantization bias. To that end, we \ufb01rst evaluate correction constants for each\nchannel c as follows:\n\nc ) and ||Wc \u2212 E (Wc)||2 (cid:54)= ||W q\n\nc \u2212 E (W q\n\n(12)\n\n\u00b5c = E (Wc) \u2212 E (W q\nc )\n||Wc \u2212 E (Wc)||2\n||W q\nc )||2\n\nc \u2212 E (W q\n\n\u03bec =\n\n5\n\n\fFigure 3: Optimal bin-allocation in a synthetic experiment including of a pair of channels i, j, each\nconsisting of 1000 values taken from N (0, \u03b12\nj ). The overall bin quota for the layer is\nset to B = 32, equivalent in terms of memory bandwidth to the number of bins allocated for two\nchannels at 4-bit precision. As indicated by the vertical lines in the plot, the optimal allocations\n(predicted by Eq. 10) coincide with the best allocations found by the experiment.\n\ni ) and N (0, \u03b12\n\nThen, we compensate for the bias in W q\n\nc for each channel c as follows:\n\nw \u2190\u2212 \u03bec (w + \u00b5c) ,\n\n\u2200w \u2208 W q\n\nc\n\n(13)\n\nWe consider a setup where each channel has a different scale and offset (per-channel quantization).\nWe can therefore compensate for this bias by folding for each channel c the correction terms \u00b5c and\n\u03bec into the scale and offset of the channel c. In the appendix, we demonstrate the bene\ufb01t of using\nbias-correction for 3-bit weight quantization.\n\n5 Combining our quantization methods\n\nIn the previous sections, we introduce each of the quantization methods independently of the other\nmethods. In this section, we consider their ef\ufb01cient integration.\n\n5.1 Applicability\n\nWe use per-channel bit allocation for both weights and activations. We found no advantage in doing\nany kind of weight clipping. This is in line with earlier works that also report no advantage to weight\nclipping for larger bitwidths (Migacz, 2017; Zhao et al., 2019). Therefore, ACIQ was considered for\nquantizing activations only. On the other hand, bias correction could in principle be implemented for\nboth weights and activations. Yet, unlike bias correction for weights that can be done of\ufb02ine before\nmodel deployment, activation bias is estimated by running input images, which might not be available\nfor gathering statistics at post-training . As the online alternative of estimating the activation bias on\nthe \ufb02y during run-time might be prohibitive, we considered the bias correction method only for the\nweights.\n\n5.2\n\nInteraction between quantization medthods\n\nWe conduct a study to investigate how each quantization method affects performance. We consider\nfour quantization methods: (1) ACIQ; (2) Bias-correction (3) Per-channel bit-allocation for weights;\n(4) Per-channel bit allocation for activations. In Figure 4, we demonstrate the interaction between\nthese methods at different quantization levels for various models. In the appendix, we report the results\nof an experiment on ResNet101 where all possible interactions are evaluated (16 combinations).\n\n6\n\n\fFigure 4: An ablation study showing the methods work in synergy and effectively at 3-4 bit precision.\n\n6 Experiments & Results\n\nThis section reports experiments on post-training quantization using six convolutional models origi-\nnally pre-trained on the ImageNet dataset. We consider the following baseline setup:\nPer-channel-quantization of weights and activations: It is often the case where the distributions\nof weights and activations vary signi\ufb01cantly between different channels. In these cases, calculating a\nscale-factor per channel can provide good accuracy for post-training quantization (Krishnamoorthi,\n2018). The per-channel scale has shown to be important both for inference (Rastegari et al., 2016)\nand for training (Wu et al., 2018).\nFused ReLU: In convolution neural networks, most convolutions are followed by a recti\ufb01ed linear\nunit (ReLU), zeroing the negative values. There are many scenarios where these two operations can\nbe fused to avoid the accumulation of quantization noise. In these settings, we can ignore the negative\nvalues and \ufb01nd an optimal clipping value \u03b1 for the positive half space [0, \u03b1]. Fused ReLU provides a\nsmaller dynamic range, which leads to a smaller spacing between the different quantization levels and\ntherefore smaller roundoff error upon quantization. In the Appendix, we provide a detailed analysis\nfor the optimal value of \u03b1.\nWe use the common practice to quantize the \ufb01rst and the last layer as well as average/max-pooling\nlayers to 8-bit precision. Table 1 summarizes our results for 4-bit post training quantization. In the\nappendix we provide additional results for 3-bit quantization.\n7 Conclusion\n\nLearning quantization for numerical precision of 4-bits and below has long been shown to be effective\n(Lin et al., 2017; McKinstry et al., 2018; Zhou et al., 2016; Choi et al., 2018). However, these\nschemes pose major obstacles that hinder their practical use. For example, many DNN developers\nonly provide the pre-trained networks in full precision without the training dataset from reasons such\nas privacy or massiveness of the data size. Consequently, quantization schemes involving training\nhave largely been ignored by the industry. This gap led intensive research efforts by several tech\ngiants and start-ups to improve post-training quantization: (1) Samsung (Lee et al., 2018), (2) Huawei,\n(Choukroun et al., 2019), (3) Hailo Technologies (Meller et al., 2019), (4) NVIDIA (Migacz, 2017).\nOur main \ufb01ndings in this paper suggest that with just a few percent accuracy degradation, retraining\nCNN models may be unnecessary for 4-bit quantization.\n\n7\n\n\fTable 1: ImageNet Top-1 validation accuracy with post-training quantization using the three methods\nsuggested by this work. Quantizing activations (8W4A): (A) Baseline consists of per-channel\nquantization of activations and fused ReLU; each channel is quantized to 4-bit precision with a\nuniform quantization step between the maximum and minimum values of the channel (GEMMLOWP,\nJacob et al. (2017)). (B) ACIQ optimally clips the values within each channel before applying\nquantization. (C) Per-channel bit allocation assigns to each activation channel an optimal bit-width\nwithout exceeding an average of 4 bits per channel, as determined by Eq. 11. (D) ACIQ + Per channel\nbit allocation quantize the activation tensors in a two stage pipeline: bit-allocation and clipping.\nQuantizing weights (4W8A): (A) Baseline consists of per-channel quantization of weights. (B)\nBias-correction compensates for the quantization bias using Eq. 13. (C) Per-channel bit allocation\nassigns to each weight channel the optimal bit-width without violating the quota of allowed bits, which\ntranslates on average to 4 bits per channel (D) Bias-Corr + Per-channel bit allocation quantize the\nweight tensors in a three-stage pipeline: per-channel bit-allocation, quantization and bias correction\nto compensate for the quantization bias. Quantizing weights and activation (4W4A): Baseline\nconsists of a combination of the above two baseline settings, i.e., (4W8A) and (8W4A). Our pipeline\nincorporates into the baseline all methods suggested by our work, namely, ACIQ for activation\nquantization, per-channel bit allocation of both weights and activations, and bias correction for weight\nquantization.\n\nVGG\n\nVGG-BN IncepV3 Res18 Res50 Res101\n\nQuantizing activations: 8 bits weights, 4 bits activations (8W4A)\n\n(Per channel quantization of activations + fused ReLU)\n\nMethod\n\nBaseline\n\nACIQ\n\nPer-channel bit allocation\n\nACIQ + Per-channel bit allocation\n\nReference (FP32)\n\n68.8% 70.6%\n\n70.1% 72.0%\n\n69.7% 72.6%\n\n70.7% 72.8%\n\n71.6% 73.4%\n\n70.9%\n\n72.7%\n\n74.3%\n\n75.1%\n\n77.2%\n\n61.5% 68.3% 66.5%\n\n66.6% 71.8% 72.6%\n\n65.0% 71.3% 70.8%\n\n68.0% 73.6% 75.6%\n\n69.7% 76.1% 77.3%\n\nQuantizing weights: 4 bits weights, 8 bits activations (4W8A)\n\n(Per channel quantization of weights)\n\nBaseline\n\nBias-Correction\n\nPer-channel bit-allocation\n\n70.5% 68.5%\n\n71.0% 71.7%\n\n71.0% 71.9%\n\nBias-Corr + Per-channel bit-allocation\n\n71.2% 72.4%\n\nReference (FP32)\n\n71.6% 73.4%\n\n38.4%\n\n59.5%\n\n61.4%\n\n68.2%\n\n77.2%\n\n59.7% 72.5% 74.6%\n\n67.4% 74.8% 76.3%\n\n66.7% 75.0% 76.4%\n\n68.3% 75.3% 76.9%\n\n69.7% 76.1% 77.3%\n\nQuantizing weights and activations: 4 bits weights, 4 bits activations (4W4A)\n\n(Per channel quantization of weights & activations + fused ReLU)\n\nBaseline\n\nAll methods combined\n\nReference (FP32)\n\n67.2% 64.5%\n\n70.5% 71.8%\n\n71.6% 73.4%\n\n30.6%\n\n66.4%\n\n77.2%\n\n51.6% 62.0% 62.6%\n\n67.0% 73.8% 75.0%\n\n69.7% 76.1% 77.3%\n\n8\n\n\fReferences\nChoi, Jungwook, Wang, Zhuo, Venkataramani, Swagath, Chuang, Pierce I-Jen, Srinivasan, Vijayalak-\nshmi, and Gopalakrishnan, Kailash. Pact: Parameterized clipping activation for quantized neural\nnetworks. arXiv preprint arXiv:1805.06085, 2018.\n\nChoukroun, Yoni, Kravchik, Eli, and Kisilev, Pavel. Low-bit quantization of neural networks for\n\nef\ufb01cient inference. arXiv preprint arXiv:1902.06822, 2019.\n\nGoncharenko, Alexander, Denisov, Andrey, Alyamkin, Sergey, and Terentev, Evgeny. Fast adjustable\n\nthreshold for uniform neural network quantization. arXiv preprint arXiv:1812.07872, 2018.\n\nHan, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural networks\nwith pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\nHubara, I, Courbariaux, M, Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized neural networks. In\n\nNIPS. US Patent 62/317,665, Filed, 2016.\n\nJacob, Benoit, Kligys, Skirmantas, Chen, Bo, Zhu, Menglong, Tang, Matthew, Howard, Andrew,\nAdam, Hartwig, and Kalenichenko, Dmitry. Quantization and training of neural networks for\nef\ufb01cient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pp. 2704\u20132713, 2018.\n\nJacob, Benoit et al. gemmlowp: a small self-contained low-precision gemm library.(2017), 2017.\n\nKrishnamoorthi, Raghuraman. Quantizing deep convolutional networks for ef\ufb01cient inference: A\n\nwhitepaper. arXiv preprint arXiv:1806.08342, 2018.\n\nLee, Jun Haeng, Ha, Sangwon, Choi, Saerom, Lee, Won-Jo, and Lee, Seungwon. Quantization for\n\nrapid deployment of deep neural networks. arXiv preprint arXiv:1810.05488, 2018.\n\nLin, Xiaofan, Zhao, Cong, and Pan, Wei. Towards accurate binary convolutional neural network. In\n\nAdvances in Neural Information Processing Systems, pp. 345\u2013353, 2017.\n\nMcKinstry, Jeffrey L, Esser, Steven K, Appuswamy, Rathinakumar, Bablani, Deepika, Arthur, John V,\nYildiz, Izzet B, and Modha, Dharmendra S. Discovering low-precision networks close to full-\nprecision networks for ef\ufb01cient embedded inference. arXiv preprint arXiv:1809.04191, 2018.\n\nMeller, Eldad, Finkelstein, Alexander, Almog, Uri, and Grobman, Mark. Same, same but different-\nrecovering neural network quantization error through weight factorization. arXiv preprint\narXiv:1902.01917, 2019.\n\nMigacz, S. 8-bit inference with tensorrt. In GPU Technology Conference, 2017.\n\nRastegari, Mohammad, Ordonez, Vicente, Redmon, Joseph, and Farhadi, Ali. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pp. 525\u2013542. Springer, 2016.\n\nWu, Shuang, Li, Guoqi, Chen, Feng, and Shi, Luping. Training and inference with integers in deep\n\nneural networks. arXiv preprint arXiv:1802.04680, 2018.\n\nZhao, Ritchie, Hu, Yuwei, Dotzel, Jordan, De Sa, Christopher, and Zhang, Zhiru. Improving neural\n\nnetwork quantization using outlier channel splitting. arXiv preprint arXiv:1901.09504, 2019.\n\nZhou, Shuchang, Wu, Yuxin, Ni, Zekun, Zhou, Xinyu, Wen, He, and Zou, Yuheng. Dorefa-net:\nTraining low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint\narXiv:1606.06160, 2016.\n\n9\n\n\f", "award": [], "sourceid": 4363, "authors": [{"given_name": "Ron", "family_name": "Banner", "institution": "Intel - Artificial Intelligence Products Group (AIPG)"}, {"given_name": "Yury", "family_name": "Nahshan", "institution": "Intel - Artificial Intelligence Products Group (AIPG)"}, {"given_name": "Daniel", "family_name": "Soudry", "institution": "Technion"}]}