{"title": "Scalable methods for 8-bit training of neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5145, "page_last": 5153, "abstract": "Quantized Neural Networks (QNNs) are often used to improve network efficiency during the inference phase, i.e. after the network has been trained. Extensive research in the field suggests many different quantization schemes. Still, the number of bits required, as well as the best quantization scheme, are yet unknown. Our theoretical analysis suggests that most of the training process is robust to substantial precision reduction, and points to only a few specific operations that require higher precision. Armed with this knowledge, we quantize the model parameters, activations and layer gradients to 8-bit, leaving at higher precision only the final step in the computation of the weight gradients. Additionally, as QNNs require batch-normalization to be trained at high precision, we introduce Range Batch-Normalization (BN) which has significantly higher tolerance to quantization noise and improved computational complexity. Our simulations show that Range BN is equivalent to the traditional batch norm if a precise scale adjustment, which can be approximated analytically, is applied. To the best of the authors' knowledge, this work is the first to quantize the weights, activations, as well as a substantial volume of the gradients stream, in all layers (including batch normalization) to 8-bit while showing state-of-the-art results over the ImageNet-1K dataset.", "full_text": "Scalable Methods for 8-bit Training of Neural\n\nNetworks\n\nRon Banner1\u2217, Itay Hubara2\u2217, Elad Hoffer2\u2217, Daniel Soudry2\n{itayhubara, elad.hoffer, daniel.soudry}@gmail.com\n\n{ron.banner}@intel.com\n\n(1) Intel - Arti\ufb01cial Intelligence Products Group (AIPG)\n(2) Technion - Israel Institute of Technology, Haifa, Israel\n\nAbstract\n\nQuantized Neural Networks (QNNs) are often used to improve network ef\ufb01ciency\nduring the inference phase, i.e. after the network has been trained. Extensive\nresearch in the \ufb01eld suggests many different quantization schemes. Still, the number\nof bits required, as well as the best quantization scheme, are yet unknown. Our\ntheoretical analysis suggests that most of the training process is robust to substantial\nprecision reduction, and points to only a few speci\ufb01c operations that require\nhigher precision. Armed with this knowledge, we quantize the model parameters,\nactivations and layer gradients to 8-bit, leaving at a higher precision only the \ufb01nal\nstep in the computation of the weight gradients. Additionally, as QNNs require\nbatch-normalization to be trained at high precision, we introduce Range Batch-\nNormalization (BN) which has signi\ufb01cantly higher tolerance to quantization noise\nand improved computational complexity. Our simulations show that Range BN is\nequivalent to the traditional batch norm if a precise scale adjustment, which can be\napproximated analytically, is applied. To the best of the authors\u2019 knowledge, this\nwork is the \ufb01rst to quantize the weights, activations, as well as a substantial volume\nof the gradients stream, in all layers (including batch normalization) to 8-bit while\nshowing state-of-the-art results over the ImageNet-1K dataset.\n\n1\n\nIntroduction\n\nDeep Neural Networks (DNNs) achieved remarkable results in many \ufb01elds making them the most\ncommon off-the-shelf approach for a wide variety of machine learning applications. However, as\nnetworks get deeper, using neural network (NN) algorithms and training them on conventional\ngeneral-purpose digital hardware is highly inef\ufb01cient. The main computational effort is due to\nmassive amounts of multiply-accumulate operations (MACs) required to compute the weighted sums\nof the neurons\u2019 inputs and the parameters\u2019 gradients.\nMuch work has been done to reduce the size of networks. The conventional approach is to com-\npress a trained (full precision) network [4, 19, 12] using weights sharing, low rank approximation,\nquantization, pruning or some combination thereof. For example, Han et al., 2015 [7] successfully\npruned several state-of-the-art large-scale networks and showed that the number of parameters can be\nreduced by an order of magnitude.\nSince training neural networks requires approximately three times more computation power than just\nevaluating them, quantizing the gradients is a critical step towards faster training machines. Previous\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\u2217Equal contribution\n\n\fwork demonstrated that by quantizing network parameters and intermediate activations during the\ntraining phase more computationally ef\ufb01cient DNNs could be constructed.\nResearchers [6, 5] have shown that 16-bit is suf\ufb01cient precision for most network training but further\nquantization (i.e., 8-bit) results with severe degradation. Our work is the \ufb01rst to almost exclusively\ntrain at 8-bit without harming classi\ufb01cation accuracy. This is addressed by overcoming two main\nobstacles known to hamper numerical stability: batch normalization and gradient computations.\nThe traditional batch normalization [11] implementation requires the computation of the sum of\nsquares, square-root and reciprocal operations; these require high precision (to avoid zero variance)\nand a large dynamic range. It should come as no surprise that previous attempts to use low precision\nnetworks did not use batch normalization layers [21] or kept them in full precision [24]. This\nwork replaces the batch norm operation with range batch-norm (range BN) that normalizes inputs\nby the range of the input distribution (i.e., max(x) \u2212 min(x)). This measure is more suitable for\nlow-precision implementations. Range BN is shown analytically to approximate the original batch\nnormalization by multiplying this range with a scale adjustment that depends on the size of the\nbatch and equals to (2 \u00b7 ln(n))\u22120.5. Experiments on ImageNet with Res18 and Res50 showed no\ndistinguishable difference between accuracy of Range BN and traditional BN.\nThe second obstacle is related to the gradients quantization. Given an upstream gradient gl from layer\nl, layer l \u2212 1 needs to apply two different matrix multiplications: one for the layer gradient gl\u22121 and\nthe other for the weight gradient gW which are needed for the update rule. Our analysis indicates that\nthe statistics of the gradient gl violates the assumptions at the crux of common quantization schemes.\nAs such, quantizing these gradients constitutes the main cause of degradation in performance through\ntraining. Accordingly, we suggest to use two versions of layer gradients gl, one with low-precision\n(8-bit) and another with higher-precision (16-bit). The idea is to keep all calculations with gl that\ndoes not involve a performance bottleneck at 16 bits, while the rest at 8 bits. As the gradients gW\nare required only for the weight update, they are computed using the 16 bits copy of gl. On the\nother hand, the gradient gl\u22121 is required for the entire backwards stream and as such it is computed\nusing the corresponding 8-bit version of gl. In most layers of the DNN these computations can be\nperformed in parallel. Hence gW can be computed at high precision in parallel with gl\u22121, without\ninterrupting the propagation of gl to lower layers. We denote the use of two different arithmetic\nprecision operations in the differentiation process as \"Gradients Bifurcation\".\n\n2 Previous Work\n\nWhile several works [6, 5] have shown that training at 16-bit is suf\ufb01cient for most networks, more\naggressive quantization schemes were also suggested [24, 16, 14, 10]. In the extreme case, the\nquantization process used only one bit which resulted in binarized neural networks (BNNs) [9] where\nboth weights and activations were constrained to -1 and 1. However, for more complex models and\nchallenging datasets, the extreme compression rate resulted in a loss of accuracy. Recently, Mishra\net al. [15] showed that this accuracy loss can be prevented by merely increasing the number of \ufb01lter\nmaps in each layer, thus suggesting that quantized neural networks (QNNs) do not possess an inherent\nconvergence problem. Nevertheless, increasing the number of \ufb01lter maps enlarge quadratically the\nnumber of parameters, which raises questions about the ef\ufb01ciency of this approach.\nIn addition to the quantization of the forward pass, a growing interest is directed towards the\nquantization of the gradient propagation in neural networks. A fully quantized method, allowing\nboth forward and backward low-precision operations will enable the use of dedicated hardware,\nwith considerable computational, memory, and power bene\ufb01ts. Previous attempts to discretize the\ngradients managed to either reduce them to 16-bit without loss of accuracy [5] or apply a more\naggressive approach and reduce the precision to 6-8 bit [24, 10] with a noticeable degradation. Batch\nnormalization is mentioned by [21] as a bottleneck for network quantization and is either replaced by\na constant scaling layer kept in full precision, or avoided altogether; this clearly has some impact on\nperformance (e.g., AlexNet trained over ImageNet resulted with top-1 error of 51.6%, where the state\nof the art is near 42%) and better ways to quantize normalization are explicitly called for. Recently\nL1 batch norm with only linear operations in both forward and backward propagation was suggested\nby [22, 8] with improved numerical stability. Yet, our experiments show that with 8-bit training even\nL1 batch norm is prone to over\ufb02ows when summing over many large positive values. Finally, Wen\n\n2\n\n\fet al. [20] focused on quantizing the gradient updates to ternary values to reduce the communication\nbandwidth in distributed systems.\nWe claim that although more aggressive quantization methods exist, 8-bit precision may prove to\nhave a \"sweet-spot\" quality to it, by enabling training with no loss of accuracy and without modifying\nthe original architecture. Moreover, we note that 8-bit quantization is better suited for future and even\ncurrent hardware, many of which can already bene\ufb01t from 8-bit operations [17]. So far, to the best of\nour knowledge, no work has succeeded to quantize the activations, weights, and gradient of all layers\n(including batch normalization) to 8-bit without any degradation.\n\nx(d) \u2212 \u00b5d\n\n(cid:112)Var[x(d)]\n\n3 Range Batch-Normalization\nFor a layer with n\u00d7d\u2212dimensional input x = (x(1), x(2), ..., x(d)), traditional batch norm normalizes\neach dimension\n\n\u02c6x(d) =\n\n,\n\n(1)\n\n(cid:112)Var[x(d)] involves sums of squares that can lead to numerical instability as well as to arithmetic\n\nwhere \u00b5d is the expectation over x(d), n is the batch size and Var[x(d)] = 1\n\nn||x(d) \u2212 \u00b5d||2\n\n2. The term\n\nover\ufb02ows when dealing with large values. The Range BN method replaces the above term by\nnormalizing according to the range of the input distribution (i.e., max(\u00b7) \u2212 min(\u00b7)), making it more\ntolerant to quantization. For a layer with d\u2212dimensional input x = (x(1), x(2), ..., x(d)), Range BN\nnormalizes each dimension\n\nx(d) \u2212 \u00b5d\n\n\u02c6x(d) =\n\nC(n) \u00b7 range(x(d) \u2212 \u00b5d)\n\n,\n\n(2)\n\n1\u221a\n2\u00b7ln(n)\n\nis a scale adjustment\n\nwhere \u00b5d is the expectation over x(d), n is the batch size, C(n) =\nterm, and range(x) = max(x) \u2212 min(x).\nThe main idea behind Range BN is to use the scale adjustment C(n) to approximate the standard\ndeviation \u03c3 (traditionally being used in vanilla batch norm) by multiplying it with the range of the\ninput values. Assuming the input follows a Gaussian distribution, the range (spread) of the input is\nhighly correlated with the standard deviation magnitude. Therefore by normalizing the range by C(n)\nwe can estimate \u03c3. Note that the Gaussian assumption is a common approximation (e.g., Soudry\net al. [18]), based on the fact that the neural input x(d) is a sum of many inputs, so we expect it to be\napproximately Gaussian from the central limit theorem.\nWe now turn to derive the normalization term C(n). The expectation of maximum of Gaussian\nrandom variables are bounded as follows [13]:\n\n0.23\u03c3 \u00b7(cid:112)ln(n) \u2264 E[max(x(d) \u2212 \u00b5d)] \u2264\n\n2\u03c3(cid:112)ln(n).\n\n\u221a\n\n(3)\nSince x(d) \u2212 \u00b5d is symmetrical with respect to zero (centred at zero and assumed gaussian), it holds\nthat E[max(\u00b7)] = \u2212E[min(\u00b7)]; hence,\n\n0.23\u03c3 \u00b7(cid:112)ln(n) \u2264 \u2212E[min(x(d) \u2212 \u00b5d)] \u2264\n\n2\u03c3(cid:112)ln(n).\n\n\u221a\n\n(4)\n\nTherefore, by summing Equations 3 and 4 and multiplying the three parts of the inequality by the\nnormalization term C(n), Range BN in Eq. 2 approximates the original standard deviation measure\n\u03c3 as follows:\n\n0.325\u03c3 \u2264 C(n) \u00b7 range(x(d) \u2212 \u00b5d) \u2264 2 \u00b7 \u03c3\n\nImportantly, the scale adjustment term C(n) plays a major role in RangeBN success. The performance\nwas degraded in simulations when C(n) was not used or modi\ufb01ed to nearby values.\n\n3\n\n\f4 Quantized Back-Propagation\n\nQuantization methods: Following [23] we used the GEMMLOWP quantization scheme as de-\ncribed in Google\u2019s open source library [1]. A detailed explanation of this approach is given in\nAppendix.While GEMMLOWP is widely used for deployment, to the best of the authors knowledge\nthis is the \ufb01rst time GEMMLOWP quantization is applied for training. Note that the activations maxi-\nmum and minimum values were computed by the range BN operator, thus \ufb01nding the normalization\nscale (see Appendix)does not require additional O(n) operations.\nFinally we note that a good convergence was achieved only by using stochastic rounding [6] for the\ngradient quantization. This behaviour is not surprising as the gradients will serve eventually for the\nweight update thus unbiased quantization scheme is required to avoid noise accumulation.\n\nGradients Bifurcation:\nof the loss function L with respect to I(cid:96), the input of the (cid:96) neural layer,\n\nIn the back-propagation algorithm we recursively calculate the gradients\n\ng(cid:96) =\n\n,\n\n(5)\n\n\u2202L\n\u2202I(cid:96)\n\nstarting from the last layer. Each layer needs to derive two sets of gradients to perform the recursive\nupdate. The layer activation gradients:\n\n(6)\nserved for the Back-Propagation (BP) phase thus passed to the next layer,and the weights gradients\n\ng(cid:96)\u22121 = g(cid:96)W T\n(cid:96) ,\n\n(cid:96)\u22121,\n\ngW(cid:96) = g(cid:96)I T\n\n(7)\nused to updated the weights in layer (cid:96). Since the backward pass requires twice the amount of\nmultiplications compared to the forward pass, quantizing the gradients is a crucial step towards faster\ntraining machines. Since g(cid:96), the gradients streaming from layer (cid:96), are required to compute g(cid:96)\u22121, it is\nimportant to expedite the matrix multiplication described in Eq.6. The second set of gradient derive in\nEq.7 is not required for this sequential process and thus we choose to keep this matrix multiplication\nin full precision. We argue that the extra time required for this matrix multiplication is comparably\nsmall to the time required to communicate the gradients g(cid:96). Thus, in this work the gradients used for\nthe weight gradients derivation are still in \ufb02oat. In section 6, we show empirically that bifurcation of\nthe gradients is crucial for high accuracy results.\n\nStraight-Through Estimator: Similar to previous work [9, 15], we used the straight-through\nestimator (STE) approach to approximate differentiation through discrete variables. This is the most\nsimple and hardware friendly approach to deal with the fact that the exact derivative of discrete\nvariables is zero almost everywhere.\n\n5 When is quantization of neural networks possible?\n\nThis section provides some of the foundations needed for understanding the internal representation of\nquantized neural networks. It is well known that when batch norm is applied after a convolution layer,\nthe output is invariant to the norm of the weight on the proceeding layer [11] i.e., BN (C \u00b7 W \u00b7 x) =\nBN (W \u00b7 x) for any given constant C. This quantity is often described geometrically as the norm of\nthe weight tensor, and in the presence of this invariance, the only measure that needs to be preserved\nupon quantization is the directionality of the weight tensor. In the following we show that quantization\npreserves the direction (angle) of high-dimensional vectors when W follows a Gaussian distribution.\nMore speci\ufb01cally, for networks with M-bit \ufb01xed point representation, the angle is preserved when\n\nthe number of quantization levels 2M is much larger than(cid:112)2 ln(N ), where N is the size of quan-\n(i.e.,(cid:112)2 ln(3 \u00b7 3 \u00b7 2048 \u00b7 1024) = 5.7 << 28). We stress that this result heavily relays on values\n\ntized vector. This shows that signi\ufb01cant quantization is possible on practical settings. Taking for\nexample the dimensionality of the joint product in a batch with 1024 examples corresponding to\nthe last layer of ResNet-50, we need no more than 8-bit of precision to preserve the angle well\n\nbeing distributed according to a Gaussian distribution, and suggests why some vectors are robust to\nquantization (e.g., weights and activations) while others are more fragile (e.g., gradients).\n\n4\n\n\f5.1 Problem Statement\n\na\n\nvector\n\nGiven\nof weights\nW = (w0, w1, ..., wN\u22121), where the\nweights follow a Gaussian distribu-\ntion W \u223c N (0, \u03c3), we would like\nto measure the cosine similarity (i.e.,\ncosine of the angle) between W and\nQ(W ), where Q(\u00b7) is a quantization\nfunction. More formally, we are\ninterested in estimating the following\ngeometric measure:\n\n\u0001\n\nW + \u0001\n\nW\n\n\u03b8\n\nFigure 1: Graphic illustration of the angle between full pre-\ncision vector W and its low precision counterpart which we\nmodel as W + \u0001 where \u0001 \u223c U(\u2212\u2206/2, \u2206/2)\n\nW \u00b7 Q(W )\n\n(8)\nWe next de\ufb01ne the quantization function Q(\u00b7) using a \ufb01xed quantization step between adjacent\nquanti\ufb01ed levels as follows:\n\n||W||2 \u00b7 ||Q(W )||2\n\ncos(\u03b8) =\n\n+\n\n(9)\nWe consider the case where quantization step \u2206 is much smaller than mean(|W|). Under this\nassumption correlation between W and quantization noise W \u2212 Q(W ) = (\u00010, \u00011, ..., \u0001N\u22121) is\nnegligible, and can be approximated as an additive noise. Our model assumes an additive quantization\nnoise \u00af\u0001 with a uniform distribution i.e., \u0001i \u223c U[\u2212\u2206/2, \u2206/2] for each index i. Our goal is to estimate\nthe angle between W and W + \u00af\u0001 for high dimensions (i.e., N \u2192 \u221e).\n\n, where \u2206 =\n\n2M\n\n\u2206\n\nQ(x) = \u2206 \u00b7(cid:16)(cid:106) x\n\n(cid:107)\n\n(cid:17)\n\n1\n2\n\nmax(|W|)\n\n5.2 Angle preservation during quantization\n\nIn order to estimate the angle between W and W + \u0001, we \ufb01rst estimate the angle between W and\n\u0001. It is well known that if \u0001 and W are independent, then at high dimension the angle between W\nand \u0001 tends to \u03c0\n[2] i.e., we get a right angle triangle with W and \u0001 as the legs, while W + \u0001 is\n2\nthe hypotenuse as illustrated in Figure 1-right. The cosine of the angle \u03b8 in that triangle can be\napproximated as follows:\n\ncos(\u03b8) =\n\n||W||\n||W + \u0001|| \u2265\n\u221a\n\n||W||\n\n||W|| + ||\u0001||\n\nAppendix ?? we show that E(||\u00af\u0001||) \u2264(cid:112)N/12 \u00b7 \u2206. Moreover, at high dimensions, the relative error\n\nSince W is Gaussian, we have that E(||W||) \u223c=\nN \u03c3 in high dimensions [3]. Additionally, in\nmade as considering E||X|| instead of the random variable ||X|| becomes asymptotically negligible\n[2]. Therefore, the following holds in high dimensions:\n\n\u03c3\n\n=\n\n12\n\nthe following:\n\n\u221a\n\u03c3 + E(\u2206)/\n\n\u221a\n2M \u00b7 \u03c3 + E(max(|W|))/\n\n2\u03c3(cid:112)ln(N ) when W follows a Gaussian distribution [13], establishing\ncos(\u03b8) \u2265\nFinally, E(max(W )) \u2264 \u221a\nEq. 12 establishes that when 2M >> (cid:112)ln(N ) the angle is preserved during quantization. It is\n\n2M\n\u221a\nln N /\n\ncos(\u03b8) \u2265\n\n2M +\n\neasy to see that in most practical settings this condition holds even for challenging quantizations.\nMoreover, this results highly depends on the assumption made about the Gaussian distribution of W\n(transition from equation 11 to equation 12).\n\n(12)\n\n(11)\n\n\u221a\n\n6\n\n12\n\n2M \u00b7 \u03c3\n\n(10)\n\n6 Experiments\n\nWe evaluated the ideas of Range Batch-Norm and Quantized Back-Propagation on multiple different\nmodels and datasets. The code to replicate all of our experiments is available on-line 2.\n\n2https://github.com/eladhoffer/quantized.pytorch\n\n5\n\n\f6.1 Experiment results on cifar-10 dataset\n\nTo validate our assumption that the cosine similarity is a good measure for the quality of the\nquantization, we ran a set of experiments on Cifar-10 dataset, each with a different number of bits,\nand then plotted the average angle and the \ufb01nal accuracy. As can be seen in Figure 2 there is a\nhigh correlation between the two. Taking a closer look the following additional observations can\nbe made: (1) During quantization the direction of vectors is better preserved with the forward pass\ncompared to the backward pass; (2) validation accuracy follows tightly the cosine of the angle in\nthe backward pass, indicating gradient quantization as the primary bottleneck; (3) as expected, the\nbound on E(cos(\u03b8)) in Eq. 12 holds in the forward pass, but less so in the backward pass, where\nthe Gaussian assumption tends to break. The histograms in Figure 2 further con\ufb01rms that the layer\ngradients gl do not follow Gaussian distribution. These are the values that are bifurcated into low and\nhigh precision copies to reduce noise accumulation.\n\nFigure 2: Left: empirical and theoretical analysis of cosine similarity, cos(\u03b8), with respect the number\nof bits used for quantization. Right: Histograms of activations, layer gradients gl and weight gradients\ngW . To emphasize that gl do not follow a Gaussian distribution, the histograms were plotted in a\nlog-scale (ResNet-18, Cifar-10).\n\nFigure 3: Equivalent accuracy with standard and range batch-norm (ResNet-50, ImageNet).\n\n6.2 Experiment results on ImageNet dataset: Range Batch-Normalization\n\nWe ran experiments with Res50 on ImageNet dataset showing the equivalence between the standard\nbatch-norm and Range BN in terms of accuracy. The only difference between the experiments was\nthe use of Range BN instead of the traditional batch-norm. Figure 3 compares between the two.\nIt shows equivalence when models are trained at high precision. We also ran simulations on other\ndatasets and models. When examining the \ufb01nal results, both were equivalent i.e., 32.5% vs 32.4%\nfor ResNet-18 on ImageNet and 10.5% vs 10.7% for ResNet-56 on Cifar10. To conclude, these\nsimulations prove that we can replace standard batch-norm with Range BN while keeping accuracy\nunchanged. Replacing the sum of squares and square root operations in standard batch-norm by a\nfew maximum and minimum operations has a major bene\ufb01t in low-precision implementations.\n\n6\n\n510150.00.20.40.60.81.0theoretical boundfinal valid acc (top1)errorbar (avg-std) fwderrorbar (avg-std) bwd\f6.3 Experiment results on ImageNet dataset: Putting it all together\n\nWe conducted experiments using RangeBN together with Quantized Back-Propagation. To validate\nthis low precision scheme, we were quantizing the vast majority of operations to 8-bit. The only\noperations left at higher precising were the updates (\ufb02oat32) needed to accumulate small changes\nfrom stochastic gradient descent, and a copy of the layer gradients at 16 bits needed to compute gW .\nNote that the \ufb02oat32 updates are done once per minibatch while the propagations are done for each\nexample (e.g., for a minibatch of 256 examples the updates constitute less than 0.4% of the training\neffort). Figure 4 presents the result of this experiment on ImageNet dataset using ResNet18 and\nResNet50. We provide additional results using more aggressive quantizations in Appedix F.\n\nFigure 4: Comparing a full precision run against 8-bit run with Quantized Back-Propagation and\nRange BN (ResNet-18 and ResNet-50 trained on ImageNet).\n\n7 Discussion\n\nIn this study, we investigate the internal representation of low precision neural networks and present\nguidelines for their quantization. Considering the preservation of direction during quantization, we\nanalytically show that signi\ufb01cant quantization is possible for vectors with a Gaussian distribution.\nOn the forward pass the inputs to each layer are known to be distributed according to a Gaussian\ndistribution, but on the backward pass we observe that the layer gradients gl do not follow this\ndistribution. Our experiments further assess that angle is not well preserved on the backward pass,\nand moreover \ufb01nal validation accuracy tightly follows that angle. Accordingly, we bifurcate the layer\ngradients gl and use it at 16-bits for the computation of the weight gradient gW while keeping the\ncomputation of next layer gradient gl\u22121 at 8-bit. This enables the (slower) 16-bits computation of\ngW to be be done in parallel with gl\u22121, without interrupting the propagation the layer gradients.\nWe further show that Range BN is comparable to the traditional batch norm in terms of accuracy\nand convergence rate. This makes it a viable alternative for low precision training. During the\nforward-propagation phase computation of the square and square root operations are avoided and\nreplaced by max(\u00b7) and min(\u00b7) operations. During the back-propagation phase, the derivative of\nmax(\u00b7) or min(\u00b7) is set to one where the coordinates for which the maximal or minimal values are\nattained, and is set to zero otherwise.\nFinally, we combine the two novelties into a single training scheme and demonstrate, for the \ufb01rst\ntime, that 8-bit training on a large scale dataset does not harm accuracy. Our quantization approach\nhas major performance bene\ufb01ts in terms of speed, memory, and energy. By replacing \ufb02oat32 with\nint8, multiplications become 16 times faster and at least 15 times more energy ef\ufb01cient [10]. This\nimpact is attained for 2/3 of all the multiplications, namely the forward pass and the calculations of\nthe layer gradients gl. The weight gradients gW are computed as a product of 8-bit precision (layer\ninput) with a 16-bit precision (unquantized version of gl), resulting with a speedup of x8 for the rest\nof multiplications and at least x2 power savings. Although previous works considered an even lower\nprecision quantization (up-to 1-bit), we claim that 8-bit quantization may prove to be more of an\ninterest. Furthermore, 8-bit matrix multiplication is available as an off-the-shelf operation in existing\nhardware and can be easily adopted and used with our methods.\n\n7\n\n\fAcknowledgments\n\nThis research was supported by the Israel Science Foundation (grant No. 31/1031), and by the Taub\nfoundation. A Titan Xp used for this research was donated by the NVIDIA Corporation. The authors\nare pleased to acknowledge that the work reported in this paper was substantially performed at Intel -\nArti\ufb01cial Intelligence Products Group (AIPG).\n\nReferences\n[1] Benoit, J., Pete, W., Miao, W., et al. gemmlowp: a small self-contained low-precision gemm library, 2017.\n\nhttps://github.com/google/gemmlowp.\n\n[2] Biau, G. and Mason, D. M. High-dimensional p p-norms. In Mathematical Statistics and Limit Theorems,\n\npp. 21\u201340. Springer, 2015.\n\n[3] Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. The convex geometry of linear inverse\n\nproblems. Foundations of Computational mathematics, 12(6):805\u2013849, 2012.\n\n[4] Chen, W., Wilson, J., Tyree, S., Weinberger, K., and Chen, Y. Compressing neural networks with the\n\nhashing trick. In International Conference on Machine Learning, pp. 2285\u20132294, 2015.\n\n[5] Das, D., Mellempudi, N., Mudigere, D., et al. Mixed precision training of convolutional neural networks\n\nusing integer operations. 2018.\n\n[6] Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical\nprecision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp.\n1737\u20131746, 2015.\n\n[7] Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning,\n\ntrained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\n[8] Hoffer, E., Banner, R., Golan, I., and Soudry, D. Norm matters: ef\ufb01cient and accurate normalization\n\nschemes in deep networks. arXiv preprint arXiv:1803.01814, 2018.\n\n[9] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized neural networks. In\n\nAdvances in Neural Information Processing Systems 29 (NIPS\u201916), 2016.\n\n[10] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training\n\nneural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.\n\n[11] Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[12] Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank\n\nexpansions. arXiv preprint arXiv:1405.3866, 2014.\n\n[13] Kamath, G. Bounds on the expectation of the maximum of samples from a gaussian. URL http://www.\n\ngautamkamath. com/writings/gaussian max. pdf, 2015.\n\n[14] Lin, X., Zhao, C., and Pan, W. Towards accurate binary convolutional neural network. In Advances in\n\nNeural Information Processing Systems, pp. 344\u2013352, 2017.\n\n[15] Mishra, A., Nurvitadhi, E., Cook, J. J., and Marr, D. Wrpn: Wide reduced-precision networks. arXiv\n\npreprint arXiv:1709.01134, 2017.\n\n[16] Miyashita, D., Lee, E. H., and Murmann, B. Convolutional neural networks using logarithmic data\n\nrepresentation. arXiv preprint arXiv:1603.01025, 2016.\n\n[17] Rodriguez, A., Segal, E., Meiri, E., et al. Lower numerical precision deep learning inference and\ntraining. https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-\ntraining, 2018.\n\n[18] Soudry, D., Hubara, I., and Meir, R. Expectation backpropagation: Parameter-free training of multilayer\nneural networks with continuous or discrete weights. In Advances in Neural Information Processing\nSystems, pp. 963\u2013971, 2014.\n\n[19] Ullrich, K., Meeds, E., and Welling, M. Soft weight-sharing for neural network compression. arXiv\n\npreprint arXiv:1702.04008, 2017.\n\n8\n\n\f[20] Wen, W., Xu, C., Yan, F., et al. Terngrad: Ternary gradients to reduce communication in distributed deep\n\nlearning. In Advances in Neural Information Processing Systems, pp. 1508\u20131518, 2017.\n\n[21] Wu, S., Li, G., Chen, F., and Shi, L. Training and inference with integers in deep neural networks.\n\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[22] Wu, S., Li, G., Deng, L., et al. L1-norm batch normalization for ef\ufb01cient training of deep neural networks.\n\narXiv preprint arXiv:1802.09769, 2018.\n\n[23] Wu, Y., Schuster, M., Chen, Z., et al. Google\u2019s neural machine translation system: Bridging the gap\n\nbetween human and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[24] Zhou, S., Ni, Z., Zhou, X., et al. Dorefa-net: Training low bitwidth convolutional neural networks with low\n\nbitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.\n\n9\n\n\f", "award": [], "sourceid": 2468, "authors": [{"given_name": "Ron", "family_name": "Banner", "institution": "Intel - Artificial Intelligence Products Group (AIPG)"}, {"given_name": "Itay", "family_name": "Hubara", "institution": "Technion"}, {"given_name": "Elad", "family_name": "Hoffer", "institution": "Technion"}, {"given_name": "Daniel", "family_name": "Soudry", "institution": "Technion"}]}