{"title": "HitNet: Hybrid Ternary Recurrent Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 604, "page_last": 614, "abstract": "Quantization is a promising technique to reduce the model size, memory footprint, and massive computation operations of recurrent neural networks (RNNs) for embedded devices with limited resources. Although extreme low-bit quantization has achieved impressive success on convolutional neural networks, it still suffers from huge accuracy degradation on RNNs with the same low-bit precision. In this paper, we first investigate the accuracy degradation on RNN models under different quantization schemes, and the distribution of tensor values in the full precision model. Our observation reveals that due to the difference between the distributions of weights and activations, different quantization methods are suitable for different parts of models. Based on our observation, we propose HitNet, a hybrid ternary recurrent neural network, which bridges the accuracy gap between the full precision model and the quantized model. In HitNet, we develop a hybrid quantization method to quantize weights and activations. Moreover, we introduce a sloping factor motivated by prior work on Boltzmann machine to activation functions, further closing the accuracy gap between the full precision model and the quantized model. Overall, our HitNet can quantize RNN models into ternary values, {-1, 0, 1}, outperforming the state-of-the-art quantization methods on RNN models significantly. We test it on typical RNN models, such as Long-Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), on which the results outperform previous work significantly. For example, we improve the perplexity per word (PPW) of a ternary LSTM on Penn Tree Bank (PTB) corpus from 126 (the state-of-the-art result to the best of our knowledge) to 110.3 with a full precision model in 97.2, and a ternary GRU from 142 to 113.5 with a full precision model in 102.7.", "full_text": "HitNet: Hybrid Ternary Recurrent Neural Network\n\nPeiqi Wang1,2,4, Xinfeng Xie4, Lei Deng4, Guoqi Li3, Dongsheng Wang1,2\u2217, Yuan Xie4\n\n1Department of Computer Science and Technology, Tsinghua University\n\n2Beijing National Research Center for Information Science and Technology\n\n3Department of Precision Instrument, Tsinghua University\n\n4Department of Electrical and Computer Engineering, University of California, Santa Barbara\n\nwpq14@mails.tsinghua.edu.cn, {liguoqi, wds}@mail.tsinghua.edu.cn\n\n{xinfeng, leideng, yuanxie}@ucsb.edu\n\nAbstract\n\nQuantization is a promising technique to reduce the model size, memory footprint,\nand computational cost of neural networks for the employment on embedded\ndevices with limited resources. Although quantization has achieved impressive\nsuccess in convolutional neural networks (CNNs), it still suffers from large accuracy\ndegradation on recurrent neural networks (RNNs), especially in the extremely low-\nbit cases. In this paper, we \ufb01rst investigate the accuracy degradation of RNNs\nunder different quantization schemes and visualize the distribution of tensor values\nin the full precision models. Our observation reveals that due to the different\ndistributions of weights and activations, different quantization methods should be\nused for each part. Accordingly, we propose HitNet, a hybrid ternary RNN, which\nbridges the accuracy gap between the full precision model and the quantized model\nwith ternary weights and activations. In HitNet, we develop a hybrid quantization\nmethod to quantize weights and activations. Moreover, we introduce a sloping\nfactor into the activation functions to address the error-sensitive problem, further\nclosing the mentioned accuracy gap. We test our method on typical RNN models,\nsuch as Long-Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU).\nOverall, HitNet can quantize RNN models into ternary values of {-1, 0, 1} and\nsigni\ufb01cantly outperform the state-of-the-art methods towards extremely quantized\nRNNs. Speci\ufb01cally, we improve the perplexity per word (PPW) of a ternary LSTM\non Penn Tree Bank (PTB) corpus from 126 to 110.3 and a ternary GRU from 142\nto 113.5.\n\n1\n\nIntroduction\n\nRecurrent Neural Networks (RNNs) yield great results across many natural language processing\napplications, including speech recognition, machine translation, language modeling, and question\nanswering [1, 2, 3, 4, 5]. The emergence of various RNN architectures, such as Long-Short-Term\nMemory (LSTM) [6] and Gated Recurrent Units (GRU) [7], has achieved state-of-the-art accuracy in\nthese applications. In order to further improve the model accuracy, researchers always build deeper\ninternal recurrence by increasing the number of recurrent sequence length, hidden units, or layers.\nHowever, these methods usually introduce large models resulting in practical hardness, especially on\nembedded devices due to inferior resources for computation and memory. Model compression is a\npromising method to alleviate aforementioned problems. There exist several techniques to reduce the\namount of model weights, such as low-rank approximation [8], sparsity utilization [9, 10, 11], and\n\n\u2217Corresponding Author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Theoretical memory saving for an LSTM model through quantization. W _t denotes ternary\nweights, A_t denotes ternary activations, and A_F P denotes full precision activations.\n\ndata quantization [12, 13]. Among these compression techniques, quantizing data into lower bitwidth\nis quite promising.\nIn previous studies that attempt to quantize RNNs to extreme low-bit precision, they always focus\non weights while retaining activations in full precision [14, 15, 16]. Activations include all dynamic\ntensors during processing, and weights are learned and \ufb01xed after training. As shown in Figure 1,\nquantizing weights into ternary values {-1, 0, 1} can only save 1.4\u00d7 memory consumption in the\ntraining phase. While quantizing both weights and activations can achieve up to 16\u00d7 memory saving.\nTernary values can be further compressed by using only 1 bit to represent -1 and +1 without storing\nthe zero values on some specialized devices [17]. Memory requirements for this further compression\ncan be reduced up to 32\u00d7. For the inference phase, although weights occupy the most of memory,\nthe quantization of activations can further simplify the \ufb02oating-point computations. Under the ternary\nscenario, the costly full precision matrix arithmetic can be transformed into simple logical operations,\nwhich are friendly for ef\ufb01cient implementations on customized hardware[18].\nQuantization achieved impressive success and satis\ufb01ed the accuracy requirement on convolutional\nneural networks (CNNs) even with only ternary weights and activations [19, 20, 21, 22, 23], while\nthe results on RNNs are still unsatisfactory. The huge accuracy gap between the quantized models\nand the original models prevents RNNs from using extreme low-bit precision. This motivates us\nto investigate the reason for huge loss of accuracy in aggressive quantizations on RNNs. There are\nseveral differences between RNNs and CNNs. These differences could cause the practical success of\nstate-of-the-art quantization methods on CNNs while similar methods cannot work well on RNNs.\nOne major reason is the recurrent structure in RNN models. Activations in RNNs are iteratively\nupdated along the temporal dimension through the same network structure and weights. This structure\nmakes quantizing RNNs easier to cause an error accumulation. In addition, the strongly nonlinear\nactivation functions (e.g. sigmoid or tanh) in RNN are more sensitive to small errors compared to\nthe piecewise linear functions in CNNs (e.g. ReLU). Although previous studies [14, 24, 25] tried to\nbridge this gap in various ways, the accuracy of quantized RNNs, especially the ones using extremely\nlow-bit precision, is far from being satis\ufb01ed.\nIn this paper, we analyze the value distributions of weights and activations to reveal the reason\nbehind the accuracy degradation. Based on our observation, we propose a hybrid ternary quantization\nmethod called HitNet to signi\ufb01cantly bridge the accuracy gap. In HitNet, we take full advantages\nof various quantization methods. Speci\ufb01cally, we use threshold ternary quantization method for\nweight quantization and Bernoulli ternary quantization method for activation quantization. Moreover,\nto further decrease the accuracy gap between the quantized models and the original models, we\nintroduce a sloping factor into the activation functions to address the error-sensitive problem.\nOur contributions can be summarized as follows:\n\n\u2022 To the best of our knowledge, our work is the \ufb01rst to present a comprehensive analysis\n\nregarding the impact of different quantization methods on the accuracy of RNNs.\n\n\u2022 Motivated by the above analysis, we propose a hybrid ternary quantization method, termed\nas HitNet, which adopts different quantization methods to quantize weights and activations\naccording to their statistical characteristics.\n\n\u2022 To further narrow the accuracy gap, we introduce a sloping factor into the activation functions.\nHitNet achieves signi\ufb01cant accuracy improvement compared with previous work, like the\n\n2\n\nFull PrecisionW_t + A_FPW_t + A_tSpecial Hardware0.00.20.40.60.81.0Amount of data (bytes)1e71.4X16Xup to 32XWeights (static)Activations(dynamic)\fperplexity per word (PPW) of a ternary LSTM on Penn Tree Bank (PTB) corpus decreasing\nfrom 126 to 110.3 and a GRU from 142 to 113.5.\n\n2 Related Work\n\nModel compression is widely used to reduce the size of RNN models with redundancy, like low-\nrank approximation [8], sparsity utilization [9, 10, 11], and data quantization [12, 13]. All of\nthese approaches are independent and we focus on data quantization in this work. The biggest\nchallenge of quantization is to maintain application accuracy. Some methods [24, 26, 19] propose\nvarious transformation functions before uniform quantization or design different threshold decision\napproaches, trying to balance the distribution of quantized data and utilize limited quantized states\nef\ufb01ciently. Zhou et al. [26] recursively partitioned the parameters by percentiles into balanced bins\nbefore quantization to balance data distribution. He et al. [24] introduced a parameter dependent\nadaptive threshold to take full advantage of the available parameter space. A few prior studies [14, 26]\nintroduce preconditioned coef\ufb01cients or employ various transformation functions before and after\nquantization to decrease the accuracy degradation. Some others formulated quantization into an\noptimization problem and then used different approaches to search optimal quantization coef\ufb01cients,\nlike greedy approximation [27] and alternating multi-bit quantization [25]. A few studies [16, 28]\nadded extra components to reorganize the original RNN architecture and \ufb01ne-tune the quantized\nmodels, but they introduce extra computation resulting in negative impacts on the performance for\nboth inference and training phases. Most of these studies still present a big accuracy gap when\nusing extreme low-bit precision. Our HitNet addresses this problem by quantizing both weights and\nactivations of RNNs into ternary states with impressive accuracy improvement compared to previous\nwork.\n\n3 Analysis of Quantized RNNs\n\nIn this section, we provide a comprehensive study on the accuracy loss of quantized RNNs under\ndifferent state-of-the-art quantization methods. We take LSTM as a case study to understand how\nthese quantization methods work. The basic unit is the LSTM cell which can be described as\n\nit, ft, gt, ot = \u03c3(Wxxt + bx + Whht\u22121 + bh)\n\nct = ft \u2217 ct\u22121 + it \u2217 gt\nht = ot \u2217 \u03c3(ct)\n\n(1)\n\nwhere \u03c3 denotes the activation function. The computation contains: input gate it, forget gate ft,\ncandidate cell state gt, output gate ot, cell state ct, and hidden state ht. The activation function is\n\u03c3(x) = sigmoid(x) = 1/(1+e\u2212x) for it, ft, and ot, and \u03c3(x) = tanh(x) = (ex\u2212e\u2212x)/(ex +e\u2212x)\nfor gt and ht. We refer to the \ufb01rst four items as gate computations in the following sections due to\nthe similar formats. Weights in gate computations can be separated into two parts: one is related to\ncurrent input information (i.e. Wxxt +bx) and the other is related to previous state (i.e. Whht\u22121 +bh).\nHere Wx (Wh) merges Wxi (Whi), Wxf (Whf ), Wxg (Whg), and Wxo (Who) for clarity. Activations\nrepresent the intermediate data propagated in the forward pass.\nExisting quantization methods can be classi\ufb01ed into three categories: uniformed quantization [26,\n25], deterministic quantization (threshold-based quantization) [19, 21], and stochastic quantization\n[12, 13]. We use Q(X) to represent the quantization method, where X denotes the input tensor.\nHence, the computation of its corresponding gradient in the back propagation is governed by\n\n\u2202E\n\u2202X\n\n=\n\n\u2202E\n\u2202Q\n\n\u2202Q\n\u2202X\n\n(2)\n\nwhere E denotes the loss function. For the non-differentiable part in Q(X), we adopt the straight-\nthrough estimator (STE) method [29]. An STE actually can be regarded as an operator owning\narbitrary forward operations but unit derivative in the backward pass. In the rest of this section, we\nwill conduct detailed analysis for different quantization methods. All evaluations in this section adopt\nan LSTM model with one hidden layer of 300 units. The sequence length is set to 35, and it is applied\non Penn Tree Bank (PTB) corpus [30]. The accuracy is measured in perplexity per word (PPW), and\na lower value in PPW means a better accuracy.\n\n3\n\n\fTable 1: The testing perplexity per word (PPW) on quantized LSTM over PTB by using different\nquantization methods. W _2bit represents 2-bit weights, W _F P represents full precision weights,\nW _t represents ternary weights. The symbol A for activations has the similar meaning. UQ denotes\nUniform Quantization, TTQ denotes Threshold Ternary Quantization, and BTQ denotes Bernoulli\nTernary Quantization.\n\n(a) UQ\nW _2bit W _F P\n108.9\n638.9\n103.3\n97.2\n\nA_2bit\nA_F P\n\n(b) TTQ\nW _t W _F P\n108.7\n125.5\n98.6\n97.2\n\nA_t\nA_F P\n\n(c) BTQ\nW _t W _F P\n108.3\n189.5\n110.9\n97.2\n\nA_t\nA_F P\n\n3.1 Uniform Quantization\n\nThe basic and straightforward way to quantize a full precision value x to a k-bit \ufb01xed-point one is the\nuniform quantization (UQ) [31], which can be implemented as\nround[(2k \u2212 1)x]\n\nqk(x) =\n\n2k \u2212 1\n\n.\n\n(3)\n\n(5)\n\nThis quantization essentially reduces the representation precision rather than changing the dynamic\nrange, thus x needs to be mapped into [0,1] before quantization and then recovered back to the\noriginal range. Thus, the practical UQ is usually modi\ufb01ed to\n\nU Q(X) = 2max(|X|)[qk(\n\nX\n\n2max(|X|)\n\n+\n\n1\n2\n\n) \u2212 1\n2\n\n].\n\n(4)\n\nIn order to observe how UQ affects the accuracy of RNNs, we evaluate a 2-bit quantized LSTM on\nPTB and the results are shown in Table 1(a). We \ufb01nd that using 2-bit representation is relatively\nacceptable for just quantizing either weights or activations by UQ method. However, the accuracy\ndegrades dramatically when we apply this quantization method on both weights and activations.\n\n3.2 Threshold Ternary Quantization\n\nDeterministic quantization assigns the quantized states arbitrarily according to an estimated threshold,\nand threshold ternary quantization (TTQ) [21] is such a method to convert full precision to ternary\nvalues. To retain as much information as possible, an estimated threshold \u03b8 and an non-negative\nscaling factor \u03b1 are often used to minimize the quantization error. The TTQ method is implemented\nas\n\n(cid:40)+1\n\nf (x) =\n\n0\n\u22121\n\nx > \u03b8\n|x| \u2264 \u03b8\nx < \u2212\u03b8\n\nT T Q(X) = \u03b1f (X).\n\n(6)\nAssume that the data in RNNs obey a normal distribution X \u223c N (0, \u03c32). We de\ufb01ne a set \u0398 =\n{i||xi| > \u03b8}, and I\u0398 be the indicator function of the set \u0398. Therefore, approximately solving an\noptimization problem of minimizing L2 distance between the original full-precision values and\nthe quantized ternary values, we can get an sub-optimal threshold \u03b8 and the scaling factor \u03b1 [21]\ncalculated by\n\n\u03b8 \u2248 2\n3\n\u03b1 \u2248 1\n|\u0398|\n\nE(|X|)\n\n(cid:88)|x|I\u0398(x).\n\n(7)\n\nWe quantize both weights and activations in RNNs to ternary values using TTQ and the evaluation\nresults are shown in Table 1(b). Quantizing weights has small accuracy loss compared to the original\nmodel while quantizing activations suffers from larger accuracy loss. Quantizing both of them also\nshows large accuracy degradation although it is much better than the result in the UQ case.\n\n4\n\n\f(a) Wxo with FP\n\n(b) Who with FP\n\n(c) ot with FP\n\n(d) Wxo with 2-bit UQ\n\n(e) Who with 2-bit UQ\n\n(f) ot with 2-bit UQ\n\n(g) Wxo with TTQ\n\n(h) Who with TTQ\n\n(i) ot with TTQ\n\n(j) Wxo with BTQ\n\n(k) Who with BTQ\n\n(l) ot with BTQ\n\nFigure 2: The distribution of weights (i.e. Wxo, Who) and activations of output gate (i.e. ot) in LSTM\nunder different quantization methods. (a)-(c) show the distributions in full precision (FP), (d)-(f) use\nthe 2-bit uniform quantization (UQ), (g)-(i) use the threshold ternary quantization (TTQ), (j)-(l) use\nthe Bernoulli ternary quantization (BTQ). The model used here is an LSTM trained on Penn Tree\nBank (PTB) corpus.\n\n3.3 Bernoulli Ternary Quantization\n\nStochastic quantization performs well in CNNs [13, 23]. Speci\ufb01cally, for the ternary case, the\nquantized value is sampled from a Bernoulli distribution. Thus, a Bernoulli ternary quantization\n(BTQ) is formulated as\n\n(cid:40)+1 with probability p if x > 0\n\n, where p \u223c Bernoulli(|x|)\n\n(8)\n\nf (x) =\n\nwith probability 1 \u2212 p\n\n0\n\u22121 with probability p if x < 0\n\n(9)\nWe also present the quantization test for BTQ, as shown in 1(c). For the weights outside the range of\n[-1,1], we apply a tanh function for magnitude scaling before quantization [19]. We can see that the\nactivation quantization obtains slightly better accuracy than quantizing weights. Quantizing both of\nthem still suffers from unacceptable accuracy loss although it is also better than the UQ result.\n\nBT Q(X) = f (X).\n\n3.4 Characteristic Analysis\n\nIn order to understand the accuracy gap between the original full precision and quantized low-bit\nprecision with different quantization schemes, we \ufb01rst visualize the distributions of weights (i.e. Wxo\nand Who) and activations (i.e. ot) in Figure 2. Here we just take output gate as an analysis example.\nThe distributions of full-precision data are shown in Figure 2(a)-2(c), and the rest \ufb01gures represent\nmodels using various aforementioned quantization methods. These distributions are consistent with\n\n5\n\n1.51.00.50.00.51.01.5W_xo0.00.20.40.60.81.01.2Frequency (x103)2.01.51.00.50.00.51.01.52.0W_ho0.00.20.40.60.81.01.21.41.6Frequency (x103)0.00.20.40.60.81.0output gate0.00.20.40.60.81.01.21.4Frequency (x102)21012W_xo0.010.020.030.040.0Frequency (x103)3210123W_ho0.010.020.030.040.0Frequency (x103)0.00.20.40.60.81.0output gate0.02.55.07.510.012.515.017.5Frequency (x102)21012W_xo0.05.010.015.020.025.030.035.040.0Frequency (x103)3210123W_ho0.05.010.015.020.025.030.035.040.0Frequency (x103)0.00.20.40.60.81.0output gate0.02.55.07.510.012.515.017.520.0Frequency (x102)21012W_xo0.010.020.030.040.050.0Frequency (x103)3210123W_ho0.010.020.030.040.050.060.070.0Frequency (x103)0.00.20.40.60.81.0output gate0.02.55.07.510.012.515.017.520.0Frequency (x102)\four prior knowledge in terms of weights. The distribution of weights in RNNs follows the normal\ndistribution although the range of Who is slightly different from Wxo, which is shown in Figure\n2(a) and Figure 2(b). However, the distribution of activations is very different from weights. The\nactivation range is limited within [0,1] due to the activation functions. More importantly, most of the\nvalues locate near to the two poles instead of the middle of this range.\nThe distributions of the quantized data under 2-bit UQ scheme are shown in Figure 2(d)-2(f). It is\nfound that this quantization method does not fully utilize the representation ability of 2 bits with 4\nstates and most values in this tensor concentrated to 2 states due to the unbalanced data distribution\n(other 2 states are too small to see in the \ufb01gures). For quantized activations shown in Figure 2(f), only\n2 states are valid due to the activation functions. This greatly degrades the model\u2019s expressive power.\nSome previous studies [26, 32] try to balance the distribution of quantized values to suf\ufb01ciently\nutilize these 4 states for better representation. Usually, they introduce new transformation functions\nbefore mapping them into a 2-bit value to force them into a balanced distribution. However, such\nmethods change the original data distribution to some extent, resulting in an unavoidable accuracy\ngap between the quantized model and the original model.\nFigure 2(g)-2(i) show the distributions of quantized data under TTQ scheme, which provides good\nexplanation for the comparable accuracy of just quantizing the weights. First, the distribution of\nquantized weights is similar to the original normal distribution. Second, all 3 states are fully utilized\nin the weight quantization. However, for the activation quantization, TTQ doesn\u2019t present good\nbipolar distribution as the original FP data. Therefore, although TTQ achieves better accuracy than\nthe UQ, it still degrades the accuracy obviously when quantizing both weights and activations. Note\nthat there are only 2 states in quantized activations, this is because we take ot with only positive\nvalues as the example, which is similar to the following BTQ scheme.\nFigure 2(j)-2(l) present the distributions under BTQ scheme. Different from TTQ, activations in\nBTQ present bipolar distribution similar to the original FP data. Through probabilistic sampling,\nit performs a better approximation of the middle data in the stochastic sense. However, although\nthe range of values in weights is preserved, 3 ternary states are not balanced since the lack of global\nstatistic information (e.g. \u03b8 and \u03b1 in TTQ). Therefore, BTQ scheme still suffers from similar accuracy\nloss when quantizing both the weights and activations.\nIn summary, these distributions under different quantization schemes provide insightful observations:\n\u2022 TTQ is preferable for weights. Among these three quantization schemes, the TTQ utilizes 3\n\nstates ef\ufb01ciently and presents a balanced distribution.\n\n\u2022 BTQ approximates the bipolar distribution of the activations better, and the probabilistic\n\nsampling preserves the middle data in the stochastic sense.\n\nAlthough both of them still suffer from obvious accuracy degradation, the above comprehensive\nanalysis motivates us to propose a hybrid ternary quantization method that takes full advantages of\nthe different quantization schemes.\n\n4 HitNet\n\nBased on the above comprehensive comparisons, we propose HitNet in this section. We \ufb01rst introduce\na constant sloping factor into activation to guide activations towards two poles, and then apply a\nhybrid ternary quantization method. Speci\ufb01cally, we use TTQ to quantize weights and BTQ to\nquantize activations to fully leverage the advantages of both. The hybrid quantization is governed by\n\nit, ft, gt, ot = \u03c3(T T Q(Wx)xt + T T Q(bx) + T T Q(Wh)ht\u22121 + T T Q(bh))\n\nct = ft \u2217 ct\u22121 + it \u2217 gt\nht = BT Q(ot \u2217 \u03c3(ct))\n\n(10)\n\nWe quantize the weights and hidden states since they are the only part involved in costly matrix\noperations, and the remained lightweight vector operations can be kept in high precision. In the\ntraining phase, the weights can be quantized right after each gradient step, while the activations need\nto be quantized every time the activation computation occurs. The threshold \u03b8 and the scaling factor\n\u03b1 in TTQ are updated according to the real value distribution of the data to be quantized.\n\n6\n\n\f(a) sigmoid(\u00b7) function under different \u03bb\n\n(b) tanh(\u00b7) function under different \u03bb\n\nFigure 3: Activation functions under different \u03bb.\n\n(a) \u03bb = 0.1\n\n(b) \u03bb = 0.4\n\n(c) \u03bb = 0.7\n\n(d) \u03bb = 1\n\nFigure 4: The distribution of output gate in LSTM with different sloping factor \u03bb.\n\nBecause the activations present not only bipolar distribution but also with many middle data, none of\nthe prior methods can avoid signi\ufb01cant accuracy loss in an extreme low-bits quantization. The middle\ndata between the two poles also play a non-negligible role on the model accuracy. If we can control\nthe bipolarity of the activation distribution, we may have an opportunity to reduce the quantization\nerror. Fortunately, prior work [33] proposed a temperature factor for the sigmoid(\u00b7) function in deep\nbelief networks to control the activation distribution for better accuracy. It is also applicable to RNNs\nsince the similar formats of activation functions. Therefore, we introduce a constant sloping factor in\nthe activation functions to guide the activation distribution concentrating to the two poles as much as\npossible. Furthermore, based on previous analysis, we propose a hybrid ternary quantization method\nthat uses TTQ for weight quantization and BTQ for activation quantization.\n\n4.1 Hybrid Ternary Quantization\n\nWe introduce a constant sloping factor \u03bb to RNN cells for controlling the activation distribution:\n\n\u02c6\n\nsigmoid(x) =\n\n1\n\n1 + e\u2212 x\n\n\u03bb\n\n,\n\n\u02c6tanh(x) =\n\n\u03bb \u2212 e\u2212 x\ne x\n\u03bb + e\u2212 x\ne x\n\n\u03bb\n\n\u03bb\n\n.\n\n(11)\n\nFigure 3 shows the variation tendency of activation functions under different \u03bb. As prior work[33]\npointed out that a properly small \u03bb (\u03bb < 1) can scale the propagated gradients thus solve the gradient\nvanish problem to some extent to improve the accuracy; whereas, too small \u03bb will lead to extreme\nbipolar activation distribution that harms the model expressive power. Actually, a smaller \u03bb can also\nproduce more bipolar activation distribution in RNNs as shown in Figure 4. This will reduce the\nmiddle data, and consequently improve the ternarization error from BTQ. Therefore, it is possible\nto obtain better accuracy on the quantized models if we carefully con\ufb01gure a properly small \u03bb. We\napply the modi\ufb01ed activation functions during both the training and inference phases and evidence\nthis prediction through the experimental results in Table 2. We use the same LSTM model as the one\nused in Section 3 for comparing the existing quantization schemes. The accuracy under different \u03bb\nfor the full-precision model, quantized model, and the gap between them are clearly shown in the\ntable. We get the following observations:\n\n\u2022 A properly small \u03bb (0.3-0.6) can slightly improve the accuracy of the full-precision model,\n\nwhile too small \u03bb (<0.2) will obviously degrade the model functionality.\n\n\u2022 A smaller \u03bb generates more bipolar activation distribution that reduces the ternarization error.\nIt usually brings better accuracy on the quantized model except for smaller \u03bb with much\nworse baseline model in full precision. Jointly considering these two points, we con\ufb01gure\n\u03bb=0.3-0.4 in this network.\n\n7\n\n8642024680.00.20.40.60.81.0Sigmoid=0.1=0.3=0.5=0.7151050510151.000.750.500.250.000.250.500.751.00Tanh=0.1=0.3=0.5=0.70.00.20.40.60.81.0output gate0.00.51.01.52.02.5Frequency (x103)0.00.20.40.60.81.0output gate0.00.20.50.81.01.21.51.82.0Frequency (x103)0.00.20.40.60.81.0output gate0.00.20.40.60.81.0Frequency (x103)0.00.20.40.60.81.0output gate0.00.20.40.60.81.0Frequency (x103)\fTable 2: The testing PPW of LSTM models on PTB with different sloping factor \u03bb.\n\n\u03bb\n\nFull precision\n\nHitNet\n\nPPW degradation\n\n0.1\n123.5\n136.4\n-12.9\n\n0.2\n100.1\n115.7\n-15.6\n\n0.3\n96.4\n110.6\n-14.2\n\n0.4\n96.3\n110.3\n-14\n\n0.5\n95.8\n112.8\n-17\n\n0.6\n96.7\n114.6\n-17.9\n\n0.7\n97.1\n115.6\n-18.5\n\n0.8\n97.1\n116.3\n-19.2\n\n0.9\n97.1\n119.9\n-22.8\n\n1\n\n97.2\n120.1\n-22.9\n\nTable 3: The testing PPW of HitNet on the PTB dataset.\n\nModels Kapur et al.[28] He et al.[24]\nLSTM\nGRU\n\n152.2\nN/A\n\n152\n150\n\nZhou et al.[26] HitNet(ours)\n\nFP(baseline)\n\n126\n142\n\n110.3\n113.5\n\n97.2\n102.7\n\n\u2022 Actually, our hybrid ternary quantization method using TTQ for weights and BTQ for\nactivations can improve the accuracy of the quantized model even without introducing the\nsloping factor (i.e. \u03bb=1) compared to previous schemes. This can be seen if we combine\nTable 2 and Table 1.\n\nIn conclusion, our HitNet tries to \ufb01nd the best quantization strategies for different components in\nRNN models bene\ufb01t from the hybrid quantization paradigm. Besides, the introduced sloping factor\ncan guide the activation distribution more bipolar without affecting the accuracy of the full-precision\nmodel (actually slight improvement) and bridges the accuracy gap between the full-precision model\nand the quantized model. For the rest of the conducted experiments, we adopt the sloping factor \u03bb\nwith the best accuracy (i.e. \u03bb=0.4).\n\n4.2 Overall Results\n\nIn this section, we evaluate the effectiveness of the proposed HitNet on language modeling with two\ntypical RNN models (LSTM [6] and GRU [7]). The target is to predict the next word, thus we use\nperplexity per word (PPW) to measure the accuracy. A lower PPW means a better accuracy. For all\nRNN models in our experiments, a word embedding layer is used on the input side. We initialize the\nlearning rate as 20 and decrease it by a factor of 4 at the epoch if the validation error exceeds current\nbest record. The sequence length is set to 35 and the gradient norm is clipped into the range of [-0.25,\n0.25]. In addition, we set the maximum epoch to be 40 and set dropout rate to be 0.2. The baseline of\nour experiment does not catch up with the state-of-the-art result, which is not within the scope of this\npaper. It can be further improved by optimizing the hyper-parameter con\ufb01guration.\nWe \ufb01rst use the Penn Tree Bank (PTB) corpus [30], which contains 10K vocabulary. We conduct\nexperiments for both LSTM and GRU, and provide comparisons to prior work. For a fair comparison,\nwe use the model with one hidden layer of 300 hidden units, which is the same with previous\ncounterparts. We use a batch size of 20 for training, and the results are shown in Table 3. We take\nthe accuracy of 2-bit quantized models in previous work for comparison. In prior work [28], the\nresult on 2-bit quantized GRU on PTB is not reported, thus we leave it. Although there is still an\naccuracy gap between our HitNet and the original full-precision model, HitNet outperforms existing\nquantization methods signi\ufb01cantly. Furthermore, the 2-bit quantization in previous work actually has\n4 states, while HitNet just has 3 states. Compared to these existing studies, HitNet achieves much\nbetter accuracy even using one less state for quantizing both the weights and activations.\nWe also apply our model to other larger datasets. Wikidata-2 is a larger corpus owning 33K vocabulary.\nIt is roughly 2x larger in dataset size and 3x larger in vocabulary than PTB. We train both LSTM\nand GRU with one 512-size hidden layer and set the batch size to 50. Text8 is also a large corpus\nwhich is composed of a pre-processed version of the \ufb01rst 100 million characters from Wikipedia\ndump. We follow the same setting in [34] and the resulting vocabulary size is about 44K. On this\ndataset, we train the models with one hidden 1024-size layer and set the batch size to be 50. The\nresults are shown in Table 4. HitNet is still able to achieve acceptable results on these larger datasets.\nWe didn\u2019t compare the results with previous work because few of them focused on extremely low-bit\nquantization on these datasets.\n\n8\n\n\fTable 4: The testing PPW of HitNet on Wiki-2 and Text8 datasets.\n\nDataset\n\nWikidata-2\n\nText8\n\nLSTM\nFull Precision HitNet\n132.49\n172.6\n\n114.37\n151.4\n\nHitNet\n126.72\n169.1\n\nGRU\nFull Precision\n\n124.5\n158.2\n\nCompared to prior work, HitNet demonstrates several advantages:\n\nternarization error.\n\nproviding a good match between the data distribution and the quantization scheme.\n\n\u2022 HitNet adopts hybrid quantization methods for different data types to achieve better accuracy,\n\u2022 HitNet introduces a sloping factor that can control the activation distribution to reduce the\n\u2022 HitNet uses only 3 data states (less than 2 bits) that results in signi\ufb01cant memory reduction\nvia data compression and speedup via transforming full-precision matrix arithmetics to\nsimple logic operations. Actually, the 0 values in HitNet do not need computation and can\nbe skipped for further ef\ufb01ciency optimization on specialized platforms [9, 35, 36].\n\n5 Conclusion and Future Work\n\nWe introduce HitNet, a hybrid ternary quantization method on RNNs to signi\ufb01cantly bridge the\naccuracy gap between the full-precision model and the quantized model, outperforming existing\nquantization methods. The reason why previous extremely low-bit quantization on RNNs fails is\ncomprehensively analyzed. Our observation reveals that the distributions of weights and activations\nare completely different. Prior work considers them equally leading to huge accuracy degradation in\nlow-bit quantization. In HitNet, we quantize RNN models into ternary values {-1, 0, 1} by using TTQ\n(threshold ternary quantization) for weight quantization and BTQ (Bernoulli ternary quantization)\nfor activation quantization. In addition, a sloping factor is introduced into the activation functions,\nwhich guides the activation distribution more bipolar and further reduces the ternarization error. Our\nexperiments on both LSTM and GRU models over PTB, Wikidata-2, and Text8 datasets demonstrate\nthat HitNet can achieve signi\ufb01cantly better accuracy compared to previous work, and resulting in\n>16x memory saving via only 3 data states and extreme compute simpli\ufb01cation via only logic\noperations.\nIn future work, we aim to deploy HitNet on specialized hardware to get practical acceleration and\nef\ufb01ciency improvement because many characteristics of HitNet can be leveraged to obtain bene\ufb01ts.\nFor example, the full-precision data can be compressed to only 2 bits or less with >16x memory\nsaving; all matrix arithmetics are converted into logic operations, which can accelerate the execution\nand save the energy consumption; all zero values do not need to be stored and computed for further\noptimization.\n\nAcknowledgement\n\nThe authors were supported by the Beijing National Research Center for Information Science\nand Technology, the Beijing Innovation Center for Future Chip, and the Scalable Energy-ef\ufb01cient\nArchitecture Lab (SEAL) in UCSB. This research was supported in part by the National Key\nResearch and Development Plan of China under Grant No. 2016YFB1000303, the National Science\nFoundations(NSF) under Grant No. 1725447 and 1730309, and the National Natural Science\nFoundation of China under Grant No. 61876215. This work was also supported in part by a grant\nfrom the China Scholarship Council.\n\nReferences\n[1] A. Graves, A.-r. Mohamed, and G. Hinton, \u201cSpeech recognition with deep recurrent neural networks,\u201d in\nAcoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645\u20136649,\nIEEE, 2013.\n\n9\n\n\f[2] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio,\n\u201cLearning phrase representations using rnn encoder-decoder for statistical machine translation,\u201d arXiv\npreprint arXiv:1406.1078, 2014.\n\n[3] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget, J. \u02c7Cernock`y, and S. Khudanpur, \u201cRecurrent neural network based\nlanguage model,\u201d in Eleventh Annual Conference of the International Speech Communication Association,\n2010.\n\n[4] H. Sak, A. Senior, and F. Beaufays, \u201cLong short-term memory recurrent neural network architectures for\nlarge scale acoustic modeling,\u201d in Fifteenth annual conference of the international speech communication\nassociation, 2014.\n\n[5] M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, and H. Daum\u00e9 III, \u201cA neural network for factoid question\nanswering over paragraphs,\u201d in Proceedings of the 2014 Conference on Empirical Methods in Natural\nLanguage Processing (EMNLP), pp. 633\u2013644, 2014.\n\n[6] S. Hochreiter and J. Schmidhuber, \u201cLong short-term memory,\u201d Neural computation, vol. 9, no. 8, pp. 1735\u2013\n\n1780, 1997.\n\n[7] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio,\n\u201cLearning phrase representations using rnn encoder-decoder for statistical machine translation,\u201d arXiv\npreprint arXiv:1406.1078, 2014.\n\n[8] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, \u201cLow-rank matrix factorization\nfor deep neural network training with high-dimensional output targets,\u201d in Acoustics, Speech and Signal\nProcessing (ICASSP), 2013 IEEE International Conference on, pp. 6655\u20136659, IEEE, 2013.\n\n[9] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, \u201cEie: ef\ufb01cient inference engine\non compressed deep neural network,\u201d in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual\nInternational Symposium on, pp. 243\u2013254, IEEE, 2016.\n\n[10] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., \u201cEse: Ef\ufb01cient\nspeech recognition engine with sparse lstm on fpga,\u201d in Proceedings of the 2017 ACM/SIGDA International\nSymposium on Field-Programmable Gate Arrays, pp. 75\u201384, ACM, 2017.\n\n[11] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, \u201cLearning structured sparsity in deep neural networks,\u201d in\n\nAdvances in Neural Information Processing Systems, pp. 2074\u20132082, 2016.\n\n[12] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, \u201cBinarized neural networks:\nTraining deep neural networks with weights and activations constrained to+ 1 or-1,\u201d arXiv preprint\narXiv:1602.02830, 2016.\n\n[13] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, \u201cQuantized neural networks: Training\n\nneural networks with low precision weights and activations,\u201d arXiv preprint arXiv:1609.07061, 2016.\n\n[14] J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio, \u201cRecurrent neural networks with limited numerical\n\nprecision,\u201d arXiv preprint arXiv:1608.06902, 2016.\n\n[15] C. Z. Tang and H. K. Kwan, \u201cMultilayer feedforward neural networks with single powers-of-two weights,\u201d\n\nIEEE Transactions on Signal Processing, vol. 41, no. 8, pp. 2724\u20132727, 1993.\n\n[16] S. Han, H. Mao, and W. J. Dally, \u201cDeep compression: Compressing deep neural networks with pruning,\n\ntrained quantization and huffman coding,\u201d arXiv preprint arXiv:1510.00149, 2015.\n\n[17] Y. van de Burgt, E. Lubberman, E. J. Fuller, S. T. Keene, G. C. Faria, S. Agarwal, M. J. Marinella, A. A.\nTalin, and A. Salleo, \u201cA non-volatile organic electrochemical device as a low-voltage arti\ufb01cial synapse for\nneuromorphic computing,\u201d Nature materials, vol. 16, no. 4, p. 414, 2017.\n\n[18] M. Horowitz, \u201c1.1 computing\u2019s energy problem (and what we can do about it),\u201d in Solid-State Circuits\n\nConference Digest of Technical Papers (ISSCC), 2014 IEEE International, pp. 10\u201314, IEEE, 2014.\n\n[19] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, \u201cDorefa-net: Training low bitwidth convolutional\n\nneural networks with low bitwidth gradients,\u201d arXiv preprint arXiv:1606.06160, 2016.\n\n[20] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, \u201cXnor-net: Imagenet classi\ufb01cation using binary\nconvolutional neural networks,\u201d in European Conference on Computer Vision, pp. 525\u2013542, Springer,\n2016.\n\n[21] F. Li, B. Zhang, and B. Liu, \u201cTernary weight networks,\u201d arXiv preprint arXiv:1605.04711, 2016.\n\n10\n\n\f[22] C. Zhu, S. Han, H. Mao, and W. J. Dally, \u201cTrained ternary quantization,\u201d arXiv preprint arXiv:1612.01064,\n\n2016.\n\n[23] L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li, \u201cGxnor-net: Training deep neural networks with ternary weights\nand activations without full-precision memory under a uni\ufb01ed discretization framework,\u201d Neural Networks,\nvol. 100, pp. 49\u201358, 2018.\n\n[24] Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou, \u201cEffective quantization methods for recurrent\n\nneural networks,\u201d arXiv preprint arXiv:1611.10176, 2016.\n\n[25] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha, \u201cAlternating multi-bit quantization for recurrent\n\nneural networks,\u201d arXiv preprint arXiv:1802.00150, 2018.\n\n[26] S.-C. Zhou, Y.-Z. Wang, H. Wen, Q.-Y. He, and Y.-H. Zou, \u201cBalanced quantization: An effective and\nef\ufb01cient approach to quantized neural networks,\u201d Journal of Computer Science and Technology, vol. 32,\nno. 4, pp. 667\u2013682, 2017.\n\n[27] Y. Guo, A. Yao, H. Zhao, and Y. Chen, \u201cNetwork sketching: Exploiting binary structure in deep cnns,\u201d\n\narXiv preprint arXiv:1706.02021, 2017.\n\n[28] S. Kapur, A. Mishra, and D. Marr, \u201cLow precision rnns: Quantizing rnns without losing accuracy,\u201d arXiv\n\npreprint arXiv:1710.07706, 2017.\n\n[29] Y. Bengio, N. L\u00e9onard, and A. Courville, \u201cEstimating or propagating gradients through stochastic neurons\n\nfor conditional computation,\u201d arXiv preprint arXiv:1308.3432, 2013.\n\n[30] A. Taylor, M. Marcus, and B. Santorini, \u201cThe penn treebank: an overview,\u201d in Treebanks, pp. 5\u201322,\n\nSpringer, 2003.\n\n[31] R. M. Gray and D. L. Neuhoff, \u201cQuantization,\u201d IEEE transactions on information theory, vol. 44, no. 6,\n\npp. 2325\u20132383, 1998.\n\n[32] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, \u201cDeep learning with limited numerical\n\nprecision,\u201d in International Conference on Machine Learning, pp. 1737\u20131746, 2015.\n\n[33] G. Li, L. Deng, Y. Xu, C. Wen, W. Wang, J. Pei, and L. Shi, \u201cTemperature based restricted boltzmann\n\nmachines,\u201d Scienti\ufb01c reports, vol. 6, p. 19133, 2016.\n\n[34] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato, \u201cLearning longer memory in recurrent\n\nneural networks,\u201d arXiv preprint arXiv:1412.7753, 2014.\n\n[35] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, \u201cCnvlutin: Ineffectual-\nneuron-free deep neural network computing,\u201d in ACM SIGARCH Computer Architecture News, vol. 44,\npp. 1\u201313, IEEE Press, 2016.\n\n[36] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, \u201cCambricon-x: An\naccelerator for sparse neural networks,\u201d in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM\nInternational Symposium on, pp. 1\u201312, IEEE, 2016.\n\n11\n\n\f", "award": [], "sourceid": 355, "authors": [{"given_name": "Peiqi", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Xinfeng", "family_name": "Xie", "institution": "UCSB"}, {"given_name": "Lei", "family_name": "Deng", "institution": "University of California, Santa Barbara"}, {"given_name": "Guoqi", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Dongsheng", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Yuan", "family_name": "Xie", "institution": "University of California, Santa Barbara"}]}