{"title": "Normalization Helps Training of Quantized LSTM", "book": "Advances in Neural Information Processing Systems", "page_first": 7346, "page_last": 7356, "abstract": "The long-short-term memory (LSTM), though powerful, is memory and computa\\x02tion expensive. To alleviate this problem, one approach is to compress its weights by quantization. However, existing quantization methods usually have inferior performance when used on LSTMs. In this paper, we first show theoretically that training a quantized LSTM is difficult because quantization makes the exploding gradient problem more severe, particularly when the LSTM weight matrices are large. We then show that the popularly used weight/layer/batch normalization schemes can help stabilize the gradient magnitude in training quantized LSTMs. Empirical results show that the normalized quantized LSTMs achieve significantly better results than their unnormalized counterparts. Their performance is also comparable with the full-precision LSTM, while being much smaller in size.", "full_text": "Normalization Helps Training of Quantized LSTM\n\nLu Hou1, Jinhua Zhu2, James T. Kwok1, Fei Gao3, Tao Qin3, Tie-yan Liu3\n\n1Hong Kong University of Science and Technology, Hong Kong\n\n{lhouab,jamesk}@cse.ust.hk\n\n2University of Science and Technology of China, Hefei, China\n\nteslazhu@mail.ustc.edu.cn\n3Microsoft Research, Beijing, China\n\n{feiga, taoqin, tyliu}@microsoft.com\n\nAbstract\n\nThe long-short-term memory (LSTM), though powerful, is memory and computa-\ntion expensive. To alleviate this problem, one approach is to compress its weights\nby quantization. However, existing quantization methods usually have inferior\nperformance when used on LSTMs. In this paper, we \ufb01rst show theoretically that\ntraining a quantized LSTM is dif\ufb01cult because quantization makes the exploding\ngradient problem more severe, particularly when the LSTM weight matrices are\nlarge. We then show that the popularly used weight/layer/batch normalization\nschemes can help stabilize the gradient magnitude in training quantized LSTMs.\nEmpirical results show that the normalized quantized LSTMs achieve signi\ufb01cantly\nbetter results than their unnormalized counterparts. Their performance is also\ncomparable with the full-precision LSTM, while being much smaller in size.\n\n1\n\nIntroduction\n\nThe long-short-term memory (LSTM) [10] has achieved remarkable performance in various sequence\nmodeling tasks [26, 28, 14]. Though powerful, the high-dimensional input/hidden/output and recur-\nsive computation across long time steps lead to space and time inef\ufb01ciencies [29, 1], limiting its use\non low-end devices with limited hardware resources.\nIn the LSTM, its weight matrices account for most of the time and space complexities. To lighten the\ncomputational demands, a popular approach is to quantize each weight to fewer bits. Previous weight\nquantization methods are mainly used on feedforward networks. In BinaryConnect [5], each weight\nis binarized. By introducing a scaling parameter, the binary weight network [23], ternary weight\nnetwork [17], and loss-aware binarized/ternarized network [11, 12] often report performance that\nare even comparable with the full-precision network. However, when used to quantize LSTMs, their\nperformance is usually inferior, and BinaryConnect even fails [11, 1]. To alleviate this problem, one\nhas to use more bits and/or much more sophisticated quantization functions [9, 29, 22, 18].\nOn the other hand, normalization techniques, such as weight normalization [24], layer normaliza-\ntion [2] and batch normalization [13], have been found useful in improving deep network training and\nperformance. In particular, while batch normalization is initially limited to feedforward networks, it\nhas been recently extended to LSTMs [4]. Very recently, Ardakani et al. [1] used this extension to\ntrain binarized/ternarized LSTMs, and achieved state-of-the art performance. However, it remains\nunclear why batch normalization works well on quantized LSTMs, and also leaves open the question\nwhether weight normalization and layer normalization may also work. Besides, the batch normaliza-\ntion extension in [4, 1] requires storing separate full-precision population statistics for each time step.\nThese can cost even more storage than the quantized LSTM model itself. Moreover, it has to be used\nwith a stochastic weight quantization function, which can be expensive due to the sampling operation.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\uf8ee\uf8ef\uf8f0 it\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0 Wxixt + Whiht\u22121\n\n\uf8f9\uf8fa\uf8fb +\n\n\uf8ee\uf8ef\uf8f0bi\n\n\uf8f9\uf8fa\uf8fb ,\n\nWxf xt + Whf ht\u22121\nWxaxt + Whaht\u22121\nWxoxt + Whoht\u22121\n\nft\nbf\nat\nba\not\nbo\nct = \u03c3(it) (cid:12) tanh(at) + \u03c3(ft) (cid:12) ct\u22121,\nht = \u03c3(ot) (cid:12) tanh(ct).\n\n(1)\n\nIn this paper, we \ufb01rst study theoretically why the quantized LSTM is dif\ufb01cult to train. Similar to\nthe analysis on vanilla RNN in [21], we study the LSTM by investigating an upper bound on its\nbackpropagated gradient w.r.t. the hidden states. We show that in each backpropagation step, the scale\nof this gradient is controlled by a number bounded by the norm of LSTM weights. Quantization tends\nto increase these norms, particularly for large models, making the exploding gradient problem much\nmore severe than its full-precision counterpart. We then study the quantized LSTM with weight, layer,\nand batch normalization. Unlike the batch-normalized LSTM in [1] which requires a new stochastic\nweight quantization, we propose to apply normalization directly on top of any existing quantization\nmethod. We show that these normalization methods make the gradient invariant to weight scaling,\nand thus alleviate the problem of having a possibly large weight norm increase caused by quantization.\nExperiments are performed on various quantization methods with weight/layer/batch normalization.\nResults show that all three normalization schemes work well with quantized LSTMs, and achieve\nbetter results than their unnormalized counterparts. With only one or two bits, the normalized\nquantized LSTMs achieve comparable performance with the full-precision baseline. Moreover,\nweight/layer normalization perform as well as batch normalization (with separate statistics), but are\nmore memory ef\ufb01cient.\n\nNotations: For a vector x, (cid:107)x(cid:107) =(cid:112)(cid:80)\n\ni is its (cid:96)2-norm, and Diag(x) returns a diagonal matrix\nwith x on the diagonal. For two vectors x and y, x (cid:12) y denotes the element-wise multiplication. For\na matrix X, (cid:107)X(cid:107)2 is its spectral norm (which is equal to its largest singular value).\n\ni x2\n\n2 Problem of Exploding Gradient\n\nIt is well-known that the vanilla RNN suffers from exploding and vanishing gradient problems due to\nlong-term dependencies [3, 21]. To avoid vanishing gradient, the LSTM employs self-connection in\nthe cell [10]. Its recurrence (without using peephole connections) is:\n\n(2)\n(3)\nHere, xt, ht and ct are the input, hidden state and cell state at time t, Wxi, Wxf , Wxa, Wxo \u2208 Rd\u00d7r,\nWhi, Whf , Wha, Who \u2208 Rd\u00d7d are the weight matrices, and bi, bf , ba, bo \u2208 Rd are the biases.\nIn the following, we show that the gradients in the LSTM can still explode. For the vanilla RNN, the\nbackpropagated gradient takes the form of a product of Jacobian matrices. By studying the upper\nbound on gradient magnitude, the necessary condition for exploding gradient can be derived [21]. In\nthis paper, we follow the same approach, and study upper bounds on the gradient magnitude in the\nLSTM. Because of the introduction of ct, the backpropagated gradient in LSTM is not of the simple\nform as that for vanilla RNN, and the upper bound analysis is much more dif\ufb01cult.\nLower bounds are more desirable in deriving a suf\ufb01cient condition for exploding gradient. However,\neven for the vanilla RNN, only an upper bound can be derived [21]. Moreover, as will be demonstrated\nempirically in Section 4, these upper bounds can still help explain the behavior of exploding gradient\n(Figures 2-3) and failure of BinaryConnect and TerConnect in quantized LSTM (Tables 2-4).\n\n2.1 Exploding Gradient in LSTM\n\nLet the loss be \u03be =(cid:80)T\n\nstep m. In backpropagation, recall that we \ufb01rst obtain \u2202\u03bem\n\u2202hm\n\u2202\u03bem\n\u2202ht\n\nm=1 \u03bem, where T is the number of time steps unrolled, and \u03bem is the loss at time\n, and then backpropagate from\n(for t \u2264 m). We study the exploding gradient problem of LSTM by\n(cid:107) at adjacent time steps. Let \u03b31 = max1\u2264t\u2264m,1\u2264j\u2264d |[ct\u22121]j|,\n\nand \u2202\u03bem\n\u2202ct\u22121\n(cid:107) and (cid:107) \u2202\u03bem\n\nand \u2202\u03bem\n\u2202cm\n\n\u2202ht\n\nand \u2202\u03bem\n\u2202ct\n\nto \u2202\u03bem\n\u2202ht\u22121\n\ufb01rst considering (cid:107) \u2202\u03bem\nand\n\n\u2202ht\u22121\n\n\u03bb1=\n\n(cid:107)Whi(cid:107)2 +\n1\n4\n\n\u03b31\n4\n\n(cid:107)Whf(cid:107)2 +(cid:107)Wha(cid:107)2 +\n\n(cid:107)Who(cid:107)2, \u03bb2=\n1\n4\n\n(cid:107)Whi(cid:107)2 +\n1\n4\n\n\u03b31\n4\n\n(cid:107)Whf(cid:107)2 +(cid:107)Wha(cid:107)2.\n\n(4)\n\n2\n\n\f\u2202ht\u22121\n\n\u2202ht\n\n\u2202ct+1\n\n(cid:107) \u2264 \u03bbt\u2212p\n\n(cid:107) \u2264 \u03bb1(cid:107) \u2202\u03bem\n\n(cid:107).\n(cid:107) \u2264 \u03bb1(cid:107) \u2202\u03bem\n\n(cid:107). By induction, for any time step\n(cid:107). The norm of this backpropagated gradient can grow exponentially\n\nProposition 2.1 (cid:107) \u2202\u03bem\n(cid:107) + \u03bb2(cid:107) \u2202\u03bem\nWhen \u03bb2 = 0, Proposition 2.1 simpli\ufb01es to (cid:107) \u2202\u03bem\np < t, (cid:107) \u2202\u03bem\nwhen \u03bb1 > 1, leading to exploding gradient. Hence, we have the following corollary.\nCorollary 1 When \u03bb2 = 0, a necessary condition for exploding gradients in the LSTM is \u03bb1 > 1.\n(cid:107) in Proposition 2.1 is then even\nEmpirically, \u03bb2 is rarely zero (Figure 1). The upper bound of (cid:107) \u2202\u03bem\nlarger, and the gradient may explode even more easily.\n\n1 (cid:107) \u2202\u03bem\n\n\u2202ht\u22121\n\n\u2202hp\n\n\u2202ht\n\n\u2202ht\n\n\u2202ht\u22121\n\n2.2 Exploding Gradient in Quantized LSTM\n\nFrom (1)-(3), most of the LSTM\u2019s parameters are due to matrices Wxi, Wxf , Wxa, Wxo, Whi,\nWhf , Wha, Who. In the sequel, we use Wx\u2217 and Wh\u2217 to denote these matrices. The computation\nis also dominated by matrix-vector multiplications of the form Wx\u2217xt + Wh\u2217ht\u22121 [1]. Quantizing\nthese weight matrices can thus signi\ufb01cantly reduce space and time [11, 29, 1].\nThe following propositions show that a large d leads to a large (cid:107)Wh\u2217(cid:107)2 for both the binarized LSTM\n(Proposition 2.2) and m-bit quantized LSTM (Proposition 2.3).\nProposition 2.2 [11] For any W \u2208 {\u22121, +1}d\u00d7d, (cid:107)W(cid:107)2 \u2265 \u221a\n\nd. Equality holds iff all singular\n\nvalues of W are the same.\nProposition 2.3 For any W \u2208 {\u2212Bk, . . . ,\u2212B1, B0, B1, . . . , Bk}d\u00d7d where 0 = B0 < B1 < \u00b7\u00b7\u00b7 <\nBk, we have (cid:107)W(cid:107)2 \u2265 (1 \u2212 s)B1\nd, where s is the sparsity (fraction of zero elements) in W.\nEquality holds iff all singular values of W are the same.\n\n\u221a\n\nFor ternarization, s > 0 and B1 = 1, the lower bound in Proposition 2.3 is smaller than that for\nbinarization in Proposition 2.2. Table 1 compares the spectral norms of Wh\u2217 before/after binarization\n(using BinaryConnect) and ternarization (using TerConnect in Section 4). As can be seen, quantization\nincreases its spectral norm, especially for large d and for binarization. When (cid:107)Wh\u2217(cid:107)2 becomes larger\nafter quantization, \u03bb1, \u03bb2 in (4) also become large, and the exploding gradient problem becomes more\nsevere. Empirically, though BinaryConnect [5] achieves remarkable performance on feedforward\nnetworks, it fails on LSTMs [11].\n\nTable 1: Average spectral norm of 10 d \u00d7 d weight matrices (obtained by various initialization\nmethods) before/after binarization and ternarization. \u201cPytorch default\" refers to the default uniform\ninitialization used in the PyTorch implementation of LSTM. For ternarization, sparsity of the matrix\nis around 0.35 empirically.\n\nd\n512\n1024\n2048\n\nPytorch default Xavier initialization [6] He initialization [8]\n1.15 \u00b1 0.01\n1.15 \u00b1 0.00\n1.15 \u00b1 0.00\n\n2.81 \u00b1 0.02\n2.81 \u00b1 0.01\n2.82 \u00b1 0.01\n\nfull-precision\n1.98 \u00b1 0.01\n1.99 \u00b1 0.01\n2.00 \u00b1 0.01\n\nbinarized\n44.76 \u00b1 0.28\n63.64 \u00b1 0.30\n90.32 \u00b1 0.30\n\nternarized\n36.18 \u00b1 0.18\n51.33 \u00b1 0.28\n72.81 \u00b1 0.17\n\nTo alleviate the exploding gradient problem, empirical success has been observed by adding a scaling\nfactor to the quantized weight [11, 12]. For example, in binarization, the binarized values become\n{\u2212\u03b1, +\u03b1} for some \u03b1 > 0. We speculate that the success behind this simple method is that for any\n\u03b1 \u2265 0, (cid:107)\u03b1Wh\u2217(cid:107)2 = \u03b1(cid:107)Wh\u2217(cid:107)2. By using \u03b1 < 1, the norm becomes smaller and the exploding\ngradient problem can be alleviated (empirical results are shown in Appendix B). However, \u03b1 is\nusually learned and there is no guarantee that it is small enough to compensate for the increase in\n(cid:107)Wh\u2217(cid:107)2 caused by quantization.\n\n3 Normalization in LSTM\n\nIn this section, we theoretically study the properties of (full-precision and quantized) LSTMs with\nweight normalization [24], layer normalization [2], and batch normalization [13], and how these\n\n3\n\n\fproperties help optimization of the quantized LSTMs. In general, let the normalization function be\nN (\u00b7). The normalized LSTM satis\ufb01es the recurrence:\n\n\uf8ee\uf8ef\uf8ef\uf8f0\u02dcit\n\n\uf8f9\uf8fa\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0 N (Wxixt) + N (Whiht\u22121)\n\nN (Wxf xt) + N (Whf ht\u22121)\nN (Wxaxt) + N (Whaht\u22121)\nN (Wxoxt) + N (Whoht\u22121)\n\n\u02dcft\n\u02dcat\n\u02dcot\nct = \u03c3(\u02dcit) (cid:12) tanh(\u02dcat) + \u03c3(\u02dcft) (cid:12) ct\u22121,\nht = \u03c3(\u02dcot) (cid:12) tanh(ct).\n\n\uf8ee\uf8ef\uf8f0bi\n\uf8f9\uf8fa\uf8fb+\n\nbf\nba\nbo\n\n\uf8f9\uf8fa\uf8fb ,\n\n(5)\n\n(6)\n(7)\n\n3.1 Weight Normalization (WN )\nWeight normalization re-parameterizes the weight vector to decouple its length from direction. In\na LSTM, each row Wj,: of a weight matrix W (where W can be Wh\u2217 or Wx\u2217) is separately nor-\nmalized. The jth element of a weight-normalized vector WN (Wx) is WN (Wj,:x) = gj\n(cid:107)Wj,:(cid:107) x,\nwhere gj is a trainable scaling factor. For Wh\u2217, let the corresponding g\u2217 be the largest gj across all\nrows of WN (Wh\u2217x), and D\u2217 = Diag([(cid:107)(Wh\u2217)1,:(cid:107),(cid:107)(Wh\u2217)2,:(cid:107), . . . ,(cid:107)(Wh\u2217)d,:(cid:107)](cid:62)).\nProposition 3.1 With weight normalization,\n(cid:107)D\n\nf Whf(cid:107)2 + ga(cid:107)D\n\u22121\n\na Wha(cid:107)2 +\n\u22121\n\ni Whi(cid:107)2 +\n\u22121\n\no Who(cid:107)2\n\u22121\n\n(cid:107)D\n\n(cid:107)D\n\n\u03b31gf\n\nWj,:\n\n(cid:17)(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202ht\n\n(cid:13)(cid:13)(cid:13)(cid:13)\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202ht\u22121\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2264 (cid:16) gi\n(cid:16) gi\n\n4\n\n+\n\ngo\n4\n\n(cid:17)(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202ct+1\n\n(cid:13)(cid:13)(cid:13)(cid:13) .\n\n4\n\n(cid:13)(cid:13)D\n\n(cid:13)(cid:13)2\n\n(cid:107)D\n\ni Whi(cid:107)2 +\n\u22121\n\n\u03b31gf\n\n4\n\n4\n\n\u22121\nf Whf\n\n+ ga(cid:107)D\n\na Wha(cid:107)2\n\u22121\n\n\u2202ht\n\nCompared to the unnormalized LSTM (Proposition 2.1), the norm of the backpropagated gradient is\nnow related to g\u2217\u2019s and D\u2217\u2019s. As will be demonstrated in Appendix C, the g\u2217 value only increases\nslightly after quantization, and so we ignore its effect in the theoretical analysis. When Wh\u2217 is\nscaled by a factor \u03b1, D\u2217 will also be scaled by \u03b1, and so D\u22121\u2217 Wh\u2217 is not affected. Hence, the\nbackpropagation of (cid:107) \u2202\u03bem\n(cid:107) in the quantized LSTM is not affected by the possibly large scaling of the\nweight matrix caused by quantization (Propositions 2.2 and 2.3), and the exploding gradient problem\ncan be alleviated.\n3.2 Layer Normalization (LN )\nLayer normalization normalizes the neuron activities in each layer to zero mean and unit variance,\nand can stabilize the hidden state dynamics for RNNs. Let x \u2208 Rd be the input (which can be Wx\u2217xt\nor Wh\u2217ht\u22121 in (5)) to layer normalization, with mean \u00b5 and standard deviation \u03c3 computed over\nits d elements. Let z = (x \u2212 \u00b51)/\u03c3 be the z-normalized vector (with zero mean and unit variance).\nThe output from layer normalization is y = LN (x) = g (cid:12) z + b, where g and b are the scaling\nand bias parameters. For layer normalization applied to Wh\u2217ht\u22121, let g\u2217 = gk, \u03c3\u2217 = \u03c3k, where\nk = arg max1\u2264j\u2264d gj.\nProposition 3.2 With layer normalization,\ngf\n\u03b31\n4\n\u03c3f\n(cid:107)Whi(cid:107)2 +\n\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n(cid:18) 1\n(cid:18) 1\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2264\n\n(cid:107)Whf(cid:107)2 +\n\n(cid:107)Wha(cid:107)2 +\n\n(cid:107)Whf(cid:107)2 +\n\n(cid:107)Whi(cid:107)2 +\n\n(cid:107)Wha(cid:107)2\n\n(cid:107)Who(cid:107)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u2202ht\u22121\n\nga\n\u03c3a\n\ngi\n\u03c3i\n\n\u2202ht\n\n+\n\n4\n\n1\n4\n\ngo\n\u03c3o\n\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202ct+1\n\n(cid:13)(cid:13)(cid:13)(cid:13).\n\n\u03b31\n4\n\ngf\n\u03c3f\n\ngi\n\u03c3i\n\n4\n\nga\n\u03c3a\n\nIf the elements of Wh\u2217 grow twice as large, the corresponding \u03c3\u2217 will be twice as large, and\n(cid:107)Wh\u2217(cid:107)2/\u03c3\u2217 remains unchanged. Thus, the backpropagation of (cid:107) \u2202\u03bem\n(cid:107) is again not affected by\nscaling of Wh\u2217.\n3.3 Batch Normalization (BN )\nRecently, batch normalization has achieved state-of-the-art performance with quantized LSTMs [1].\nHowever, the underlying reason remains unclear. Besides, it has to be used with a stochastic weight\n\n\u2202ht\n\n4\n\n\fquantization function, which is expensive due to the underlying sampling operation. It is also memory\nexpensive as separate mean and variance statistics for different time steps have to be stored.\nIn this work, we propose to directly apply batch normalization on top of any existing quantization\nmethod in LSTM. As will be seen in Section 4, this yields comparable or even better performance\nthan [1], and is much cheaper in space when the batch statistics are shared across different time steps.\nt \u2208 Rd be the input, hidden\nBatch normalization operates on a minibatch. At time t, let xk\nt ](cid:62) \u2208 RN\u00d7d,\nstate and cell state for sample k in a minibatch of N samples, and Ht = [h1\nt ](cid:62) \u2208 RN\u00d7r. The input to batch normalization is X \u2208 RN\u00d7d (which can\nXt = [x1\nbe XtW(cid:62)\nh\u2217), with mean \u00b5j and standard deviation \u03c3j for the jth column. The\nbatch normalization output Y \u2208 RN\u00d7d has Y:,j = BN (X:,j) = gj\n+ bj1, where g =\n[g1, . . . , gd](cid:62) \u2208 Rd and b = [b1, . . . , bd](cid:62) \u2208 Rd are the scaling parameters and biases, respectively.\nFor batch normalization applied to Ht\u22121W(cid:62)\n. Let\n\u03b32 = max1\u2264t\u2264m,1\u2264j\u2264d,1\u2264k\u2264N |[ck\nProposition 3.3 For the unnormalized LSTM in (1),\n\nh\u2217, let (\u03c3\u2217, g\u2217) = arg max{\u03c31,...,\u03c3d},{g1,...,gd} gj\n\u03c3j\n\nt , . . . , xN\nx\u2217 or Ht\u22121W(cid:62)\n\nt \u2208 Rr, hk\n\nt , . . . , hN\n\nt\u22121]j|.\n\nX:,j\u2212\u00b5j 1\n\nt , ck\n\n\u03c3j\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202hk\n\nt\u22121\n\n(cid:13)(cid:13)(cid:13)(cid:13)2 \u2264\n\nN(cid:88)\n\nk=1\n\n(cid:18) 1\n(cid:18) 1\n\n2\n\n+\n\n2\n\n(cid:107)Whi(cid:107)2\n\n2 +\n\n\u03b32\n2\n2\n\n(cid:107)Whf(cid:107)2\n\n2 + 8(cid:107)Wha(cid:107)2\n2 +\n\n(cid:107)Who(cid:107)2\n\n(cid:107)Whi(cid:107)2\n\n2 +\n\n\u03b32\n2\n2\n\n(cid:107)Whf(cid:107)2\n\n2 + 8(cid:107)Wha(cid:107)2\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202hk\nt\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:19) N(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202ck\n\nk=1\n\n.\n\n2\n\nt+1\n\n1\n4\n\n(cid:19) N(cid:88)\n\nk=1\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202hk\nt\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:33) N(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\nk=1\n\n.\n\n1\n4\n\ng2\no\n\u03c32\no\n\n(cid:33) N(cid:88)\n\nk=1\n\n2\n\n(cid:107)Who(cid:107)2\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202ck\n\nt+1\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202\u03bem\n\n\u2202hk\n\nProposition 3.4 With batch normalization,\ng2\nf\n\u03c32\nf\n\n(cid:107)Whi(cid:107)2\n\ng2\ni\n\u03c32\ni\n\n\u03b32\n2\n2\n\n2 +\n\nt\u22121\n\n1\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2 \u2264\n\nN(cid:88)\n\nk=1\n\n(cid:32)\n\n(cid:32)\n\n(cid:107)Whf(cid:107)2\n\n2 + 8\n\ng2\na\n\u03c32\na\n\n(cid:107)Wha(cid:107)2\n\n2 +\n\n+\n\n1\n2\n\ng2\ni\n\u03c32\ni\n\n(cid:107)Whi(cid:107)2\n\n2 +\n\n\u03b32\n2\n2\n\ng2\nf\n\u03c32\nf\n\n(cid:107)Whf(cid:107)2\n\n2 + 8\n\n(cid:107)Wha(cid:107)2\n\n2\n\ng2\na\n\u03c32\na\n\nIn contrast to the unnormalized LSTM (Proposition 3.3), when batch normalization is used, if the\nelements of Wh\u2217 in the quantized LSTM grow twice as large, the corresponding \u03c3\u2217 will be twice as\nlarge, and (cid:107)Wh\u2217(cid:107)2\n2 /\u03c32\u2217 remains unchanged. Thus, it is again unaffected by the scaling of Wh\u2217.\n(cid:107)\nRemark 3.1 In summary, by using weight, layer or batch normalization, backpropagation of (cid:107) \u2202\u03bem\nin the quantized LSTM is not affected by the possibly large scaling of the weight matrix caused by\nquantization, and the exploding gradient problem can be alleviated.\n\n\u2202ht\n\nRemark 3.2 Note that the storage requirements of the normalization schemes differ. The full-\nprecision LSTM requires 32 \u00d7 4(rd + d2 + d) bits to store the Wx\u2217\u2019s, Wh\u2217\u2019s, and b\u2217\u2019s in (1), while\nthe m-bit unnormalized LSTM requires m \u00d7 4(rd + d2) + 32 \u00d7 4d bits. When normalization is\nused on the m-bit LSTM, weight normalization requires 32 \u00d7 8d additional bits to store the scaling\nparameters gj\u2019s. Layer normalization is slightly more expensive, and needs 32 \u00d7 16d bits to store\nthe scaling parameters and biases. Batch normalization needs 32 \u00d7 32d extra bits when the mean\nand variance statistics are shared across time steps, which is even more expensive but still small\n(compared to the LSTM size). However, when separate statistics are used in each time step, the\nadditional space becomes 32\u00d7 16d + 32\u00d7 16T d bits, and can be large for large T . As will be shown\nin Section 4, empirically using shared statistics performs similarly as using separate statistics on\nlanguage modeling tasks, but worse on (permuted) sequential MNIST classi\ufb01cation.\n\n4 Experiments\n\nExperiments are performed on character/word-level language modeling and sequential MNIST\nclassi\ufb01cation. We compare with the full-precision LSTM, and popular state-of-the-art quantized\nLSTMs including (i) 1-bit LSTMs with/without normalization: binarized using BinaryConnect (BC)\n[5], binary weight network (BWN) [23], and loss-aware binarization (LAB) [11]. We also compare\n\n5\n\n\fTable 2: Test BPC and size (in KB) of LSTM on character-level language modeling. \u201cN/A\" means that\nthe loss become NaN after the \ufb01rst epoch. Method with the lowest BPC in each group is highlighted.\n\nText8\n\nBPC\n1.46\n1.48\n1.45\n1.46\n1.46\n1.54\nN/A\n1.50\n1.47\n1.47\n1.48\n1.56\n1.50\n1.47\n1.47\n1.48\n1.58\n1.50\n1.47\n1.47\n1.47\n1.51\nN/A\n1.42\n1.44\n1.44\n1.44\n1.54\n1.43\n1.44\n1.44\n1.44\n1.50\n1.44\n1.41\n1.41\n1.41\n\nsize\n63375\n63438\n63500\n63625\n86000\n27464\n2011\n2073\n2136\n2261\n24636\n2011\n2073\n2136\n2261\n24636\n2011\n2073\n2136\n2261\n24636\n15303\n3990\n4053\n4115\n4240\n26615\n2990\n4053\n4115\n4240\n26615\n2990\n4053\n4115\n4240\n26615\n\nprecision\n\nquantization\n\nfull\n\n-\n\nSBN\n\nBinaryConnect\n\n1-bit\n\nBWN\n\nLAB\n\nSTN\n\nTerConnect\n\n2-bit\n\nTWN\n\nLAT\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\nnormalization\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\nWar and Peace Penn Treebank\nsize\nBPC\n4504\n1.72\n4520\n1.73\n1.69\n4536\n4568\n1.72\n7736\n1.72\n3785\n1.78\n149\n4.24\n165\n1.74\n1.69\n181\n213\n1.72\n3381\n1.72\n149\n1.89\n165\n1.74\n1.70\n181\n213\n1.72\n3381\n1.72\n149\n1.86\n165\n1.73\n1.70\n181\n1.71\n213\n3381\n1.72\n3521\n1.72\n289\n6.35\n305\n1.72\n1.67\n321\n1.70\n353\n3521\n1.71\n289\n1.86\n305\n1.71\n1.67\n321\n353\n1.70\n1.70\n3521\n289\n1.80\n305\n1.69\n1.65\n321\n353\n1.68\n1.68\n3521\n\nsize\n4800\n4816\n4832\n4864\n8032\n3794\n158\n174\n190\n222\n3390\n158\n174\n190\n222\n3390\n158\n174\n190\n222\n3390\n3944\n308\n324\n340\n372\n3540\n308\n324\n340\n372\n3540\n308\n324\n340\n372\n3540\n\nBPC\n1.45\n1.45\n1.43\n1.45\n1.45\n1.60\n2.51\n1.50\n1.49\n1.51\n1.50\n1.56\n1.51\n1.49\n1.50\n1.51\n1.56\n1.51\n1.49\n1.50\n1.50\n1.60\n5.84\n1.42\n1.43\n1.44\n1.45\n1.51\n1.45\n1.43\n1.44\n1.45\n1.48\n1.42\n1.40\n1.41\n1.42\n\nwith the recent stochastically binarized LSTM (denoted SBN) with batch normalization in [1]; (ii)\n2-bit LSTMs with/without normalization: ternarized using ternary weight networks (TWN) [17], and\nloss-aware ternarization with approximate solution (LAT) [12]. Analogous to BinaryConnect, we\nalso include a baseline called TerConnect1, which ternarizes weights to {\u22121, 0, +1} using the same\nthreshold as TWN, but does not scale the ternary weights. We also compare with 2-bit alternating\nLSTM [29], and the stochastically ternarized LSTM (denoted STN) with batch normalization [1].\n\n4.1 Character-level Language Modeling\n\nThe LSTM takes as input a character sequence, and predicts the next character at each time step. The\ntraining objective is the cross-entropy loss over target sequences, and performance is evaluated by bits\nper character (BPC). Experiments are performed on three benchmark data sets: (i) Leo Tolstoy\u2019s War\nand Peace; (ii) Penn Treebank Corpus [27]; and (iii) Text8. On War and Peace and Penn Treebank, we\nuse a one-layer LSTM with 512 hidden units2 as in [11, 12]. On text8, we use a one-layer LSTM with\n2000 hidden units as in [1]. Adam is used as the optimizer. The detailed setup is in Appendix A.1.\nTable 2 shows the testing BPC values and size of LSTM parameters, including the additional storage\ndue to normalization parameters and statistics (where applicable). Note that [1] does not count this\nadditional storage.\nNormalization: For the quantized LSTM, the normalized version consistently outperforms its unnor-\nmalized counterpart. In particular, directly applying BinaryConnect achieves very poor performance\non War and Peace and Penn TreeBank, and fails on Text8. With weight / layer / batch (shared)\nnormalization, BinaryConnect achieves comparable or even better results than the full-precision\n\n1Note that this is different from the stochastic ternary-connect in [19]\n2Ardakani et al. [1] use 1000 hidden units, so their BPC results are not directly comparable.\n\n6\n\n\fLSTM, while being over 20x smaller. For batch normalization, using shared statistics across all time\nsteps yields similar performance as when separate statistics are used, but with much smaller storage.\nComparison with Full-Precision LSTM: The 1-bit normalized LSTM performs similarly as the\nunnormalized full-precision baseline on War and Peace and Text8. The 2-bit normalized LSTM\nsigni\ufb01cantly outperforms the unnormalized full-precision baseline on all three data sets, and requires\nmuch less storage. Moreover, compared with the normalized full-precision LSTM, 2-bit normalized\nLSTM has competitive performance, but requires much smaller storage.\nComparison of Different Quantization Methods: For the unnormalized 1-bit LSTM, BWN and\nLAB perform signi\ufb01cantly better than BinaryConnect. They have an additional scaling parameter\nwhich is empirically smaller than 1 (as can be seen in Appendix B), and can thus alleviate the\nexploding gradient problem (Section 2.2). As BWN and LAB differ from BinaryConnect only in\nthe scaling parameter, the three perform similarly when normalization is applied, as the normalized\nLSTM is invariant to weight rescaling. For the unnormalized 2-bit quantized LSTMs, TWN and LAT\nalso perform signi\ufb01cantly better than TerConnect.\nComparison with SBN and STN: Directly applying normalization on top of quantization con-\nsistently outperforms the batch-normalized LSTM in [1].\nIn particular, 1-bit layer-normalized\nBinaryConnect achieves similar and often better performance than 2-bit STN. Moreover, bina-\nrized/ternarized model with weight/layer/batch (shared) normalization is signi\ufb01cantly smaller than\nSBN and STN, which use separate statistics for different time steps and normalization on the cell.\n\u03bb1, \u03bb2 Values: Figures 1(a)-1(b) show \u03bb1, \u03bb2 (in Propositions 2.1, 3.1-3.2) for the unnormalized\nfull-precision and BC-binarized LSTMs with weight/layer normalization on Penn Treebank, and\nFigures 1(c)-1(d) show the values (in Propositions 3.3-3.4) with batch normalization.3 As can be seen,\nnormalization reduces \u03bb1, \u03bb2 in the binarized LSTM. The corresponding g\u2217 values are in Appendix C.\n\n(a) \u03bb1.\n\n(b) \u03bb2.\n\n(c) \u03bb1.\n\n(d) \u03bb2.\n\nFigure 1: \u03bb1 and \u03bb2 values in full-precision and binarized LSTMs (with and without normalization).\n\nGradient Magnitude: Figure 2 shows the backpropagated gradient norms4 for the unnormalized\nfull-precision LSTM, and BC-normalized LSTM with/without normalization. The gradients of the\nunnormalized binarized LSTM explode quickly during backpropagation (Figure 2(b)), while the\nnormalized binarized LSTM has stable gradient \ufb02ow similar to the full-precision baseline. More\nresults can be found in Appendix D.1.\n\n(b) BC.\n\n(a) Full-precision.\n(e) BC+BN (shared).\nFigure 2: Gradient norms of the unnormalized full-precision LSTM and BC-binarized LSTM\nwith/without normalization for character-level language modeling on Penn Treebank. Since back-\npropagation operates in the backward direction, each plot is best read from right to left.\n\n(c) BC+WN.\n\n(d) BC+LN.\n\n3Propositions 3.3 and 3.4 are based on the squared weight norm. Hence, Figures 1(c)-1(d) plot the coef\ufb01cients\n\n(cid:107)2. With an abuse of notation, we still call these \u03bb1 and \u03bb2.\n\n(cid:107)2 and(cid:80)N\n\nbefore(cid:80)N\nFor better visualization, we only show 10 curves. Figure 2(e) shows(cid:80)N\n\n4Figures 2(a)-2(d) show (cid:107) \u2202\u03be\n\nk=1 (cid:107) \u2202\u03bem\n\nk=1 (cid:107) \u2202\u03bem\n\n\u2202ck+1\n\n\u2202hk\nt\n\n\u2202ht\n\nt\n\n(cid:107). Each curve corresponds to the gradient from one sample in the mini-batch.\n(cid:107)2 for the whole mini-batch.\n\nk=1 (cid:107) \u2202\u03be\n\n\u2202hk\nt\n\n7\n\n02000400060008000#iterations01002003001Full-precisionBCBC+LNBC+WN02000400060008000#iterations01002003002Full-precisionBCBC+LNBC+WN02000400060008000#iterations02000004000006000001Full-precisionBCBC+BN(shared)02000400060008000#iterations02000004000006000002Full-precisionBCBC+BN(shared)020406080100timestep102101100101102103ht(\u00d7103)020406080100timestep102101100101102103ht(\u00d7103)020406080100timestep102101100101102103ht(\u00d7103)020406080100timestep102101100101102103ht(\u00d7103)020406080100timestep102101100101102103ht(\u00d7103)\f4.2 Word-level Language Modeling\n\nIn this section, we perform experiments to predict the next word on Penn Treebank. We use a one-\nlayer LSTM, with d = 300 as in [1, 29], and d = 650 as in [1, 30]. We use the same data preparation\nand training procedures as in [1, 20]. The optimizer is SGD. Detailed setup is in Appendix A.2.\nTable 3 shows the testing perplexity (PPL) results. BinaryConnect and TerConnect fail when\ndirectly applied. However, with normalization, they achieve comparable performance as the full-\nprecision counterpart. Quantized models with normalization usually outperform their unnormalized\ncounterparts, and have comparable or even better performance than the full-precision baseline. Again,\ndirectly applying normalization to existing quantization methods on LSTM perform similarly or\neven better than SBN and STN, while requiring much smaller storage when normalized using\nweight/layer/batch (shared) normalization. For batch normalization, using shared mean and variance\nstatistics across all time steps performs similarly as using separate statistics. Preliminary results on a\n2-layer LSTM also show similar observations (details are in Appendix E).\nTable 3: Test PPL and size (in KB) of LSTM for word-level language modeling on Penn Treebank.\nFor the alternating LSTM, only d = 300 are reported in [29].\n\nprecision\n\nquantlization\n\n-\n\nSBN\n\nBinaryConnect\n\nBWN\n\nalternating LSTM\n\nSTN\n\nTerConnect\n\nfull\n\n1-bit\n\n2-bit\n\nTWN\n\n3-bit\n4-bit\n\nalternating LSTM\nalternating LSTM\n\nnormalization\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch(separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n-\n\nd = 300\n\nd = 650\n\nPPL\n91.5\n86.1\n87.4\n90.2\n90.5\n92.2\n8247.4\n87.6\n89.4\n92.4\n91.9\n94.7\n89.4\n91.4\n91.5\n93.0\n103.1\n90.7\n113.8\n86.5\n88.2\n90.6\n91.6\n89.8\n87.1\n90.5\n92.1\n91.2\n93.8\n91.4\n\nsize\n2817\n2827\n2836\n2855\n3492\n852\n93\n102\n111\n130\n767\n93\n102\n111\n130\n767\n180\n940\n180\n190\n199\n218\n855\n180\n190\n199\n218\n855\n268\n356\n\nPPL\n87.6\n86.2\n84.5\n86.3\n87.9\n87.2\n1244.2\n84.8\n82.3\n84.8\n85.6\n83.5\n85.9\n84.2\n86.6\n87.3\n\n-\n\n86.1\n113.8\n84.9\n82.5\n85.8\n86.5\n84.2\n85.6\n84.1\n85.5\n87.5\n\n-\n-\n\nsize\n13213\n13234\n13254\n13295\n14676\n2068\n423\n443\n463\n504\n1885\n423\n443\n463\n504\n1885\n\n-\n\n2481\n835\n856\n876\n917\n2298\n835\n856\n876\n917\n2298\n\n-\n-\n\nFigure 3 shows the norms of backpropagated gradients in the full-precision and binarized LSTMs.\nAs can be seen, the gradients of the unnormalized binarized LSTM explode quickly during backprop-\nagation (Figure 3(b)), while the normalized binarized LSTM has stable gradient \ufb02ow similar to the\nfull-precision baseline. This agrees with Propositions 2.2-2.3 and Table 1 that the spectral norm of\nweight matrix becomes larger (i) for large d, and (ii) for BinaryConnect than TerConnect, leading to\nmore severe exploding gradient problem. More results can be found in Appendix D.2.\n\n(e) BC+BN (shared).\n(a) Full-precision.\nFigure 3: Gradient norms of the unnormalized full-precision LSTM and BC-binarized LSTM\nwith/without normalization for word-level language modeling (with d = 300) on Penn Treebank.\n\n(c) BC+WN.\n\n(d) BC+LN.\n\n(b) BC.\n\n8\n\n05101520253035timestep101100101102103104ht(\u00d7103)05101520253035timestep101100101102103104ht(\u00d7103)05101520253035timestep101100101102103104ht(\u00d7103)05101520253035timestep101100101102103104ht(\u00d7103)05101520253035timestep101100101102103104ht(\u00d7103)\f4.3 Sequential MNIST\n\nWe perform experiments on the sequential version of MNIST classi\ufb01cation, which processes one\nimage pixel at a time. We follow the setting in [16, 4, 1], and use both the MNIST and permuted\nMNIST (pMNIST) [4]. In MNIST, the 28 \u00d7 28 pixels are processed in scanline order. In pMNIST,\nthey are processed in a \ufb01xed random order. The optimizer is Adam. Detailed setup is in Appendix A.3.\nTable 4 shows the test accuracy results and size5 of the LSTM parameters. BinaryConnect and\nTerConnect, which cannot be trained without normalization on this task, have comparable results as\nthe full-precision baselines when used with normalization.\nBatch normalization with shared mean and variance statistics across all time steps has inferior\nperformance; while storing separate mean and variance statistics for the 28 \u00d7 28 = 784 time steps is\ntoo memory expensive. In contrast, weight and layer normalization achieve high model compression,\nand with comparable or even better performance as the batch normalization counterpart.\nThough batch normalization with shared statistics performs similarly as using separate statistics\non language modeling tasks (Tables 2-3), it fails on this sequential MNIST task. We speculate\nit is because in this task, each time step corresponds to an input pixel. The use of shared batch\nnormalization statistics implicitly assumes different pixels to have similar characteristics. However,\nthis may not be reasonable (e.g., pixels around the edge are typically darker).\n\nTable 4: Test accuracy (%) and size (KB) of LSTM on the sequential MNIST task. \u201cN/A\" means that\nthe loss becomes NaN after the \ufb01rst epoch.\n\nnormalization MNIST pMNIST\n\n-\n\nweight\nlayer\n\nprecision\n\nquantization\n\nfull\n\n1-bit\n\n2-bit\n\n-\n\nSBN\n\nBinaryConnect\n\nBWN\n\nSTN\n\nTerConnect\n\nTWN\n\nbatch (shared)\nbatch (separate)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\nbatch (separate)\n\nweight\nlayer\n\nbatch (shared)\nbatch (separate)\n\n-\n\n-\n\n-\n\n-\n\nweight\nlayer\n\nbatch (shared))\nbatch (separate)\n\n98.9\n98.4\n98.0\n21.4\n99.0\n98.6\nN/A\n98.7\n98.9\n20.6\n98.7\n98.7\n98.7\n98.8\n20.6\n98.6\n98.8\nN/A\n98.9\n98.8\n23.6\n98.8\n98.6\n98.6\n98.8\n26.5\n98.7\n\n90.2\n90.2\n90.7\n35.4\n93.7\n89.9\nN/A\n91.4\n91.2\n36.5\n91.2\n89.7\n91.3\n90.8\n40.1\n91.1\n91.9\nN/A\n92.4\n92.5\n34.1\n93.2\n90.4\n92.1\n91.7\n38.3\n93.1\n\nsize\n159\n163\n166\n172\n5066\n5526\n\n8\n11\n14\n21\n4914\n\n8\n11\n14\n21\n4914\n5531\n13\n16\n19\n25\n4919\n13\n16\n18\n25\n4919\n\n5 Conclusion\n\nIn this paper, we show that quantized LSTMs are hard to train because the scales of the quantized\nLSTM weights can be very large, making the gradients easy to explode. We then show that applying\nweight, layer or batch normalization can enable the gradient magnitude to be invariant to this possibly\nlarge scaling, and thus alleviates the exploding gradient problem. Experiments on various tasks\nshow that the normalized quantized LSTM can be easily trained, achieves comparable or even better\nperformance than its full-precision counterpart, but saves much storage due to quantization.\n\n5Note that [1] does not count the additional storage for batch statistics, which is indeed much larger than the\n\nmodel itself on this sequential MNIST task (where the number of time steps is T = 784).\n\n9\n\n\fAcknowledgments\n\nThis research project is partially funded by Microsoft Research Asia.\n\nReferences\n[1] A. Ardakani, Z. Ji, S. C. Smithson, B. H. Meyer, and W. J. Gross. Learning recurrent bi-\n\nnary/ternary weights. In International Conference on Learning Representations, 2019.\n\n[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. Preprint arXiv:1607.06450, 2016.\n\n[3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent\n\nis dif\ufb01cult. IEEE Transactions on Neural Networks, 5(2):157\u2013166, 1994.\n\n[4] T. Cooijmans, N. Ballas, C. Laurent, C. Gulcehre, and A. Courville. Recurrent batch normaliza-\n\ntion. In International Conference on Learning Representations, 2016.\n\n[5] M. Courbariaux, Y. Bengio, and J. P. David. BinaryConnect: Training deep neural networks\nwith binary weights during propagations. In Neural Information Processing Systems, pages\n3105\u20133113, 2015.\n\n[6] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 249\u2013256,\n2010.\n\n[7] A. Graves. Supervised sequence labelling. In Supervised Sequence Labelling with Recurrent\n\nNeural Networks, pages 5\u201313. Springer, 2012.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level\nperformance on imagenet classi\ufb01cation. In International Conference on Computer Vision, pages\n1026\u20131034, 2015.\n\n[9] Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou. Effective quantization methods\n\nfor recurrent neural networks. Preprint arXiv:1611.10176, 2016.\n\n[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[11] L. Hou, Q. Yao, and J. T. Kwok. Loss-aware binarization of deep networks. In International\n\nConference on Learning Representations, 2017.\n\n[12] L. Hou, Q. Yao, and J. T. Kwok. Loss-aware weight quantization of deep networks.\n\nInternational Conference on Learning Representations, 2018.\n\nIn\n\n[13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In International Conference on Machine Learning, pages 448\u2013456,\n2015.\n\n[14] A. Karpathy, J. Johnson, and F. F. Li. Visualizing and understanding recurrent networks. In\n\nInternational Conference on Learning Representations, 2016.\n\n[15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference\n\non Learning Representations, 2015.\n\n[16] Q. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of recti\ufb01ed\n\nlinear units. Preprint arXiv:1504.00941, 2015.\n\n[17] F. Li and B. Liu. Ternary weight networks. Preprint arXiv:1605.04711, 2016.\n\n[18] Z. Li, D. He, F. Tian, W. Chen, T. Qin, L. Wang, and T. Liu. Towards binary-valued gates for\n\nrobust lstm training. In International Conference on Machine Learning, 2018.\n\n[19] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications.\n\nIn International Conference on Learning Representations, 2016.\n\n10\n\n\f[20] T. Mikolov, M. Kara\ufb01at, L. Burget, J. Cernocky, and S. Khudanpur. Recurrent neural network\nbased language model. In Annual Conference of the International Speech Communication\nAssociation, 2010.\n\n[21] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks.\n\nIn International Conference on Machine Learning, pages 1310\u20131318, 2013.\n\n[22] A. Polino, R. Pascanu, and D. Alistarh. Model compression via distillation and quantization. In\n\nInternational Conference on Learning Representations, 2018.\n\n[23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet classi\ufb01cation\nusing binary convolutional neural networks. In European Conference on Computer Vision,\n2016.\n\n[24] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Neural Information Processing Systems, pages 901\u2013909,\n2016.\n\n[25] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimiza-\n\ntion? In Neural Information Processing Systems, 2018.\n\n[26] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In\n\nNeural Information Processing Systems, pages 3104\u20133112, 2014.\n\n[27] A. Taylor, M. Marcus, and B. Santorini. The Penn treebank: An overview. In Treebanks, pages\n\n5\u201322. Springer, 2003.\n\n[28] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption\ngenerator. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3156\u20133164,\n2015.\n\n[29] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha. Alternating multi-bit quantization\nfor recurrent neural networks. In International Conference on Learning Representations, 2018.\n\n[30] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. Preprint\n\narXiv:1409.2329, 2014.\n\n11\n\n\f", "award": [], "sourceid": 4004, "authors": [{"given_name": "Lu", "family_name": "Hou", "institution": "Huawei Technologies Co., Ltd"}, {"given_name": "Jinhua", "family_name": "Zhu", "institution": "University of Science and Technology of China"}, {"given_name": "James", "family_name": "Kwok", "institution": "Hong Kong University of Science and Technology"}, {"given_name": "Fei", "family_name": "Gao", "institution": "University of Chinese Academy of Sciences"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft Research"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research"}]}