{"title": "FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 9017, "page_last": 9028, "abstract": "This paper develops the FastRNN and FastGRNN algorithms to address the twin RNN limitations of inaccurate training and inefficient prediction. Previous approaches have improved accuracy at the expense of prediction costs making them infeasible for resource-constrained and real-time applications. Unitary RNNs have increased accuracy somewhat by restricting the range of the state transition matrix's singular values but have also increased the model size as they require a larger number of hidden units to make up for the loss in expressive power. Gated RNNs have obtained state-of-the-art accuracies by adding extra parameters thereby resulting in even larger models. FastRNN addresses these limitations by adding a residual connection that does not constrain the range of the singular values explicitly and has only two extra scalar parameters. FastGRNN then extends the residual connection to a gate by reusing the RNN matrices to match state-of-the-art gated RNN accuracies but with a 2-4x smaller model. Enforcing FastGRNN's matrices to be low-rank, sparse and quantized resulted in accurate models that could be up to 35x smaller than leading gated and unitary RNNs. This allowed FastGRNN to accurately recognize the \"Hey Cortana\" wakeword with a 1 KB model and to be deployed on severely resource-constrained IoT microcontrollers too tiny to store other RNN models. FastGRNN's code is available at (https://github.com/Microsoft/EdgeML/).", "full_text": "FastGRNN: A Fast, Accurate, Stable and Tiny\nKilobyte Sized Gated Recurrent Neural Network\n\nAditya Kusupati\u2020, Manish Singh\u00a7, Kush Bhatia\u2021,\nAshish Kumar\u2021, Prateek Jain\u2020 and Manik Varma\u2020\n\n\u2020Microsoft Research India\n\n\u00a7Indian Institute of Technology Delhi\n\u2021University of California Berkeley\n\n{t-vekusu,prajain,manik}@microsoft.com, singhmanishiitd@gmail.com\n\nkush@cs.berkeley.edu, ashish_kumar@berkeley.edu\n\nAbstract\n\nThis paper develops the FastRNN and FastGRNN algorithms to address the twin\nRNN limitations of inaccurate training and inef\ufb01cient prediction. Previous ap-\nproaches have improved accuracy at the expense of prediction costs making them\ninfeasible for resource-constrained and real-time applications. Unitary RNNs have\nincreased accuracy somewhat by restricting the range of the state transition matrix\u2019s\nsingular values but have also increased the model size as they require a larger num-\nber of hidden units to make up for the loss in expressive power. Gated RNNs have\nobtained state-of-the-art accuracies by adding extra parameters thereby resulting\nin even larger models. FastRNN addresses these limitations by adding a residual\nconnection that does not constrain the range of the singular values explicitly and\nhas only two extra scalar parameters. FastGRNN then extends the residual connec-\ntion to a gate by reusing the RNN matrices to match state-of-the-art gated RNN\naccuracies but with a 2-4x smaller model. Enforcing FastGRNN\u2019s matrices to be\nlow-rank, sparse and quantized resulted in accurate models that could be up to\n35x smaller than leading gated and unitary RNNs. This allowed FastGRNN to\naccurately recognize the \"Hey Cortana\" wakeword with a 1 KB model and to be\ndeployed on severely resource-constrained IoT microcontrollers too tiny to store\nother RNN models. FastGRNN\u2019s code is available at [30].\n\n1\n\nIntroduction\n\nObjective: This paper develops the FastGRNN (an acronym for a Fast, Accurate, Stable and Tiny\nGated Recurrent Neural Network) algorithm to address the twin RNN limitations of inaccurate\ntraining and inef\ufb01cient prediction. FastGRNN almost matches the accuracies and training times of\nstate-of-the-art unitary and gated RNNs but has signi\ufb01cantly lower prediction costs with models\nranging from 1 to 6 Kilobytes for real-world applications.\nRNN training and prediction: It is well recognized that RNN training is inaccurate and unstable\nas non-unitary hidden state transition matrices could lead to exploding and vanishing gradients for\nlong input sequences and time series. An equally important concern for resource-constrained and\nreal-time applications is the RNN\u2019s model size and prediction time. Squeezing the RNN model and\ncode into a few Kilobytes could allow RNNs to be deployed on billions of Internet of Things (IoT)\nendpoints having just 2 KB RAM and 32 KB \ufb02ash memory [17, 29]. Similarly, squeezing the RNN\nmodel and code into a few Kilobytes of the 32 KB L1 cache of a Raspberry Pi or smartphone, could\nsigni\ufb01cantly reduce the prediction time and energy consumption and make RNNs feasible for real-\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftime applications such as wake word detection [27, 11, 12, 42, 43], predictive maintenance [46, 1],\nhuman activity recognition [3, 2], etc.\nUnitary and gated RNNs: A number of techniques have been proposed to stabilize RNN training\nbased on improved optimization algorithms [40, 26], unitary RNNs [5, 24, 37, 47, 50, 54, 25] and\ngated RNNs [20, 13, 14]. While such approaches have increased the RNN prediction accuracy they\nhave also signi\ufb01cantly increased the model size. Unitary RNNs have avoided gradients exploding and\nvanishing by limiting the range of the singular values of the hidden state transition matrix. This has\nled to only limited gains in prediction accuracy as the optimal transition matrix might often not be\nclose to unitary. Unitary RNNs have compensated by learning higher dimensional representations but,\nunfortunately, this has led to larger model sizes. Gated RNNs [20, 13, 14] have stabilized training by\nadding extra parameters leading to state-of-the-art prediction accuracies but with models that might\nsometimes be even larger than unitary RNNs.\nFastRNN: This paper demonstrates that standard RNN training could be stabilized with the addition\nof a residual connection [19, 44, 22, 7] having just 2 additional scalar parameters. Residual con-\nnections for RNNs have been proposed in [22] and further studied in [7]. This paper proposes the\nFastRNN architecture and establishes that a simple variant of [22, 7] with learnt weighted residual\nconnections (2) can lead to provably stable training and near state-of-the-art prediction accuracies\nwith lower prediction costs than all unitary and gated RNNs. In particular, FastRNN\u2019s prediction\naccuracies could be: (a) up to 19% higher than a standard RNN; (b) could often surpass the ac-\ncuracies of all unitary RNNs and (c) could be just shy of the accuracies of leading gated RNNs.\nFastRNN\u2019s empirical performance could be understood on the basis of theorems proving that for an\ninput sequence with T steps and appropriate setting of residual connection weights: (a) FastRNN\nconverges to a stationary point within O(1/\u00012) SGD iterations (see Theorem 3.1), independent of\nT , while the same analysis for a standard RNN reveals an upper bound of O(2T ) iterations and (b)\nFastRNN\u2019s generalization error bound is independent of T whereas the same proof technique reveals\nan exponential bound for standard RNNs.\nFastGRNN: Inspired by this analysis, this paper develops the novel FastGRNN architecture by\nconverting the residual connection to a gate while reusing the RNN matrices. This allowed FastGRNN\nto match, and sometimes exceed, state-of-the-art prediction accuracies of LSTM, GRU, UGRNN and\nother leading gated RNN techniques while having 2-4x fewer parameters. Enforcing FastGRNN\u2019s\nmatrices to be low-rank, sparse and quantized led to a minor decrease in the prediction accuracy but\nresulted in models that could be up to 35x smaller and \ufb01t in 1-6 Kilobytes for many applications. For\ninstance, using a 1 KB model, FastGRNN could match the prediction accuracies of all other RNNs at\nthe task of recognizing the \"Hey Cortana\" wakeword. This allowed FastGRNN to be deployed on IoT\nendpoints, such as the Arduino Uno, which were too small to hold other RNN models. On slightly\nlarger endpoints, such as the Arduino MKR1000 or Due, FastGRNN was found to be 18-42x faster at\nmaking predictions than other leading RNN methods.\nContributions: This paper makes two contributions. First, it rigorously studies the residual connec-\ntion based FastRNN architecture which could often outperform unitary RNNs in terms of training\ntime, prediction accuracy and prediction cost. Second, inspired by FastRNN, it develops the Fast-\nGRNN architecture which could almost match state-of-the-art accuracies and training times but with\nprediction costs that could be lower by an order of magnitude. FastRNN and FastGRNN\u2019s code can\nbe downloaded from [30].\n\n2 Related Work\n\nResidual connections: Residual connections have been studied extensively in CNNs [19, 44] as\nwell as RNNs [22, 7]. The Leaky Integration Unit architecture [22] proposed residual connections\nfor RNNs but were unable to learn the state transition matrix due to the problem of exploding\nand vanishing gradients. They therefore sampled the state transition matrix from a hand-crafted\ndistribution with spectral radius less than one. This limitation was addressed in [7] where the state\ntransition matrix was learnt but the residual connections were applied to only a few hidden units\nand with randomly sampled weights. Unfortunately, the distribution from which the weights were\nsampled could lead to an ill-conditioned optimization problem. In contrast, the FastRNN architecture\nleads to provably stable training with just two learnt weights connected to all the hidden units.\n\n2\n\n\f(a) FastRNN - Residual Connection\n\n(b) FastGRNN - Gate\n\nFigure 1: Block diagrams for FastRNN (a) and FastGRNN (b). FastGRNN uses shared matrices W,\nU to compute both the hidden state ht as well as the gate zt.\n\nUnitary RNNs: Unitary RNNs [5, 50, 37, 24, 47, 25] stabilize RNN training by learning only well-\nconditioned state transition matrices. This limits their expressive power and prediction accuracy while\nincreasing training time. For instance, SpectralRNN [54] learns a transition matrix with singular\nvalues in 1 \u00b1 \u0001. Unfortunately, the training algorithm converged only for small \u0001 thereby limiting\naccuracy on most datasets. Increasing the number of hidden units was found to increase accuracy\nsomewhat but at the cost of increased training time, prediction time and model size.\nGated RNNs: Gated architectures [20, 13, 14, 23] achieve state-of-the-art classi\ufb01cation accuracies\nby adding extra parameters but also increase model size and prediction time. This has resulted in a\ntrend to reduce the number of gates and parameters with UGRNN [14] simplifying GRU [13] which\nin turn simpli\ufb01es LSTM [20]. FastGRNN can be seen as a natural simpli\ufb01cation of UGRNN where\nthe RNN matrices are reused within the gate and are made low-rank, sparse and quantized so as to\ncompress the model.\nEf\ufb01cient training and prediction: Ef\ufb01cient prediction algorithms have often been obtained by\nmaking sparsity and low-rank assumptions. Most unitary methods effectively utilize a low-rank\nrepresentation of the state transition matrix to control prediction and training complexity [24, 54].\nSparsity, low-rank, and quantization were shown to be effective in RNNs [51, 39, 48], CNNs [18],\ntrees [29] and nearest neighbour classi\ufb01ers [17]. FastGRNN builds on these ideas to utilize low-rank,\nsparse and quantized representations for learning kilobyte sized classi\ufb01ers without compromising\non classi\ufb01cation accuracy. Other approaches to speed up RNN training and prediction are based on\nreplacing sequential hidden state transitions by parallelizable convolutions [9] or on learning skip\nconnections [10] so as to avoid evaluating all the hidden states. Such techniques are complementary\nto the ones proposed in this paper and can be used to further improve FastGRNN\u2019s performance.\n\n3 FastRNN and FastGRNN\n\nNotation: Throughout the paper, parameters of an RNN are denoted by matrices W \u2208 R \u02c6D\u00d7D, U \u2208\nR \u02c6D\u00d7 \u02c6D and bias vectors b \u2208 R \u02c6D, often using subscripts if multiple vectors are required to specify the\narchitecture. a (cid:12) b denotes the Hadamard product between a and b, i.e., (a (cid:12) b)i = ai, bi. (cid:107) \u00b7 (cid:107)0\ndenotes the number of non-zeros entries in a matrix or vector. (cid:107) \u00b7 (cid:107)F ,(cid:107) \u00b7 (cid:107)2 denotes the Frobenius and\nspectral norm of a matrix, respectively. Unless speci\ufb01ed, (cid:107) \u00b7 (cid:107) denotes (cid:107) \u00b7 (cid:107)2 of a matrix or vector.\n\na(cid:62)b =(cid:80)\n\ni aibi denotes the inner product of a and b.\n\nStandard RNN architecture [41] is known to be unstable for training due to exploding or vanishing\ngradients and hence is shunned for more expensive gated architectures.\nThis paper studies the FastRNN architecture that is inspired by weighted residual connections [22, 19],\nand shows that FastRNN can be signi\ufb01cantly more stable and accurate than the standard RNN while\npreserving its prediction complexity. In particular, Section 3.1.1 demonstrates parameter settings for\nFastRNN that guarantee well-conditioned gradients as well as faster convergence rate and smaller\ngeneralization error than the standard RNN. This paper further strengthens FastRNN to develop the\nFastGRNN architecture that is more accurate than unitary methods [5, 54] and provides comparable\naccuracy to the state-of-the-art gated RNNs at 35x less computational cost (see Table 3).\n\n3\n\nhtht-1xtWU\u03c3\u03b2\u03b1xtht-1U\u0001htWtanhf(zt)\u03b6(1-zt)+=\f3.1 FastRNN\nLet X = [x1, . . . , xT ] be the input data where xt \u2208 RD denotes the t-th step feature vector. Then,\nthe goal of multi-class RNNs is to learn a function F : RD\u00d7T \u2192 {1, . . . , L} that predicts one of L\nclasses for the given data point X. Standard RNN architecture has a provision to produce an output\nat every time step, but we focus on the setting where each data point is associated with a single label\nthat is predicted at the end of the time horizon T . Standard RNN maintains a vector of hidden state\nht \u2208 R \u02c6D which captures temporal dynamics in the input data, i.e.,\nht = tanh(Wxt + Uht\u22121 + b).\n\n(1)\n\nAs explained in the next section, learning U, W in the above architecture is dif\ufb01cult as the gradient\ncan have exponentially large (in T ) condition number. Unitary methods explicitly control the\ncondition number of the gradient but their training time can be signi\ufb01cantly larger or the generated\nmodel can be less accurate.\nInstead, FastRNN uses a simple weighted residual connection to stabilize the training by generating\nwell-conditioned gradients. In particular, FastRNN updates the hidden state ht as follows:\n\n\u02dcht = \u03c3(Wxt + Uht\u22121 + b),\nht = \u03b1\u02dcht + \u03b2ht\u22121,\n\n(2)\nwhere 0 \u2264 \u03b1, \u03b2 \u2264 1 are trainable weights that are parameterized by the sigmoid function. \u03c3 : R \u2192 R\nis a non-linear function such as tanh, sigmoid, or ReLU, and can vary across datasets. Given hT , the\nlabel for a given point X is predicted by applying a standard classi\ufb01er, e.g., logistic regression to hT .\nTypically, \u03b1 (cid:28) 1 and \u03b2 \u2248 1 \u2212 \u03b1, especially for problems with larger T . FastRNN updates hidden\nstate in a controlled manner with \u03b1, \u03b2 limiting the extent to which the current feature vector xt\nupdates the hidden state. Also, FastRNN has only 2 more parameters than RNN and require only\n\u02c6D more computations, which is a tiny fraction of per-step computation complexity of RNN. Unlike\nunitary methods [5, 23, 54], FastRNN does not introduce expensive structural constraints on U and\nhence scales well to large datasets with standard optimization techniques [28].\n\n3.1.1 Analysis\nThis section shows how FastRNN addresses the issue of ill-conditioned gradients, leading to stable\ntraining and smaller generalization error. For simplicity, assume that the label decision function is\none dimensional and is given by f (X) = v(cid:62)hT . Let L(X, y; \u03b8) = L(f (X), y; \u03b8) be the logistic\nloss function for the given labeled data point (X, y) and with parameters \u03b8 = (W, U, v). Then, the\ngradient of L w.r.t. W, U, v is given by:\n\nT(cid:88)\nT(cid:88)\n\nt=0\n\nt=0\n\nDt\n\nDt\n\n(cid:32)T\u22121(cid:89)\n(cid:32)T\u22121(cid:89)\n\nk=t\n\nk=t\n\n\u2202L\n\u2202U\n\n= \u03b1\n\n\u2202L\n\u2202W\n\n= \u03b1\n\nM (U) =(cid:81)T\u22121\n\n(cid:62)\n\n(cid:62)\n\n(\u03b1U\n\n(\u03b1U\n\nDk+1 + \u03b2I)\n\nDk+1 + \u03b2I)\n\n(\u2207hT L)h\n\n(cid:62)\nt\u22121,\n\n(\u2207hT L)x\n\n(cid:62)\nt ,\n\n\u2202L\n\u2202v\n\n=\n\n\u2212y exp (\u2212y \u00b7 v(cid:62)hT )\n1 + exp (\u2212y \u00b7 v(cid:62)hT )\n\nhT ,\n\n(3)\n\n(4)\n\nwhere \u2207hT L = \u2212c(\u03b8) \u00b7 y \u00b7 v, and c(\u03b8) =\n\n1+exp (y\u00b7v(cid:62)hT ). A critical term in the above expression is:\n\n1\n\nk=t (\u03b1U(cid:62)Dk+1 + \u03b2I), whose condition number, \u03baM (U), is bounded by:\n\n\u03baM (U) \u2264 (1 + \u03b1\n(1 \u2212 \u03b1\n\n\u03b2 maxk (cid:107)U(cid:62)Dk+1(cid:107))T\u2212t\n\u03b2 maxk (cid:107)U(cid:62)Dk+1(cid:107))T\u2212t ,\n\n(5)\nwhere Dk = diag(\u03c3(cid:48)(Wxk + Uhk\u22121 + b)) is the Jacobian matrix of the pointwise nonlinearity.\nAlso if \u03b1 = 1 and \u03b2 = 0, which corresponds to standard RNN, the condition number of M (U) can\n(cid:107)U(cid:62)Dk+1(cid:107)\n\u03bbmin(U(cid:62)Dk+1) )T\u2212t where \u03bbmin(A) denotes the minimum singular value of A.\nbe as large as (maxk\nHence, gradient\u2019s condition number for the standard RNN can be exponential in T . This implies that,\nrelative to the average eigenvalue, the gradient can explode or vanish in certain directions, leading to\nunstable training.\nIn contrast to the standard RNN, if \u03b2 \u2248 1 and \u03b1 \u2248 0, then the condition number, \u03baM (U), for\nFastRNN is bounded by a small term. For example, if \u03b2 = 1 \u2212 \u03b1 and \u03b1 =\nT maxk (cid:107)U(cid:62)Dk+1(cid:107),\n\n1\n\n(cid:33)\n(cid:33)\n\n4\n\n\fthen \u03baM (U) = O(1). Existing unitary methods are also motivated by similar observation. But they\nattempt to control the \u03baM (U) by restricting the condition number, \u03baU, of U which can still lead\nto ill-conditioned gradients as U(cid:62)Dk+1 might still be very small in certain directions. By using\nresidual connections, FastRNN is able to address this issue, and hence have faster training and more\naccurate model than the state-of-the-art unitary RNNs.\nFinally, by using the above observations and a careful perturbation analysis, we can provide the\nfollowing convergence and generalization error bounds for FastRNN:\nTheorem 3.1 (Convergence Bound). Let [(X1, y1), . . . , (Xn, yn)] be the given labeled sequential\ntraining data. Let L(\u03b8) = 1\ni L(Xi, yi; \u03b8) be the loss function with \u03b8 = (W, U, v) be the\nparameters of FastRNN architecture (2) with \u03b2 = 1 \u2212 \u03b1 and \u03b1 such that,\nn\n\n(cid:80)\n\n\u03b1 \u2264 min\n\n1\n\n1\n\n1\n\n4T \u00b7 |D(cid:107)U(cid:107)2 \u2212 1| ,\n\n4T \u00b7 RU\n\n,\n\nT \u00b7 |(cid:107)U(cid:107)2 \u2212 1|\n\n(cid:18)\n\n2(cid:107)] \u2264 BM :=\n\nSGD, when applied to the data for a maximum of M iteration outputs a solution(cid:98)\u03b8 such that:\nwhere D = sup\u03b8,k (cid:107)D\u03b8\nE[(cid:107)\u2207\u03b8L((cid:98)\u03b8)(cid:107)2\n\nk(cid:107)2. Then, randomized stochastic gradient descent [15], a minor variation of\n\nO(\u03b1T )L(\u03b80)\n\nM\nwhere RX = maxX (cid:107)X(cid:107)F for X = {U, W, v}, L(\u03b80) is the loss of the initial classi\ufb01er, and\n, k \u2208 [M ], \u00afD \u2265 0.\nthe step-size of the k-th SGD iteration is \ufb01xed as: \u03b3k = min\nMaximum number of iterations is bounded by M = O( \u03b1T\nTheorem 3.2 (Generalization Error Bound). [6] Let Y, \u02c6Y \u2286 [0, 1] and let FT denote the class\nof FastRNN with (cid:107)U(cid:107)F \u2264 RU,(cid:107)W(cid:107)F \u2264 RW. Let the \ufb01nal classi\ufb01er be given by \u03c3(v(cid:62)hT ),\n(cid:107)v(cid:107)2 \u2264 Rv . Let L : Y \u00d7 \u02c6Y \u2192 [0, B] be any 1-Lipschitz loss function. Let D be any distribution on\nX \u00d7 Y such that (cid:107)xit(cid:107)2 \u2264 Rx a.s. Let 0 \u2264 \u03b4 \u2264 1. For all \u03b2 = 1 \u2212 \u03b1 and \u03b1 such that,\n\n\u00012 \u00b7 poly(L(\u03b80), RWRURv, \u00afD)), \u0001 \u2265 0.\n\n\u00afD\n\n(cid:110) 1O(\u03b1T ) ,\n\n4RWRURv\n\n\u2264 \u0001,\n\n\u00afD +\n\nT\n\nM\n\nM\n\n\u221a\n\u00afD\n\n+\n\n,\n\n(cid:19)\n(cid:19) O(\u03b1T )\u221a\n(cid:111)\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)\n\n(cid:115)\n\n1\n\nn(cid:88)\n\n\u03b1 \u2264 min\n\n4T \u00b7 |D(cid:107)U(cid:107)2 \u2212 1| ,\n\n4T \u00b7 RU\n\n,\n\nT \u00b7 |(cid:107)U(cid:107)2 \u2212 1|\n\n,\n\n1\n\n1\n\nk(cid:107)2, we have that with probability at least 1 \u2212 \u03b4, all functions f \u2208 v \u25e6 FT\n\nwhere D = sup\u03b8,k (cid:107)D\u03b8\nsatisfy,\n\nED[L(f (X), y)] \u2264 1\nn\n\nL(f (Xi), yi) + C O(\u03b1T )\u221a\nn\n\n+ B\n\nln( 1\n\u03b4 )\nn\n\n,\n\ni=1\n\nwhere C = RWRURxRv represents the boundedness of the parameter matrices and the data.\nThe convergence bound states that if \u03b1 = O(1/T ) then the algorithm converges to a stationary\npoint in constant time with respect to T and polynomial time with respect to all the other problem\nparameters. Generalization bound states that for \u03b1 = O(1/T ), the generalization error of FastRNN is\nindependent of T . In contrast, similar proof technique provide exponentially poor (in T ) error bound\nand convergence rate for standard RNN. But, this is an upper bound, so potentially signi\ufb01cantly better\nerror bounds for RNN might exist; matching lower bound results for standard RNN is an interesting\nresearch direction. Also, O(T 2) generalization error bound can be argued using VC-dimension style\narguments [4]. But such bounds hold for speci\ufb01c settings like binary y, and are independent of\nproblem hardness parameterized by the size of the weight matrices (RW, RU).\nFinally, note that the above analysis \ufb01xes \u03b1 = O(1/T ), \u03b2 = 1 \u2212 \u03b1, but in practice FastRNN learns\n\u03b1, \u03b2 (which is similar to performing cross-validation on \u03b1, \u03b2). However, interestingly, across datasets\nthe learnt \u03b1, \u03b2 values indeed display a similar scaling wrt T for large T (see Figure 2).\n\n3.2 FastGRNN\n\nWhile FastRNN controls the condition number of gradient reasonably well, its expressive power\nmight be limited for some datasets. This concern is addressed by a novel architecture, FastGRNN,\nthat uses a scalar weighted residual connection for each and every coordinate of the hidden state ht.\nThat is,\n\nzt = \u03c3(Wxt + Uht\u22121 + bz),\n\u02dcht = tanh(Wxt + Uht\u22121 + bh),\nht = (\u03b6(1 \u2212 zt) + \u03bd) (cid:12) \u02dcht + zt (cid:12) ht\u22121,\n\n(6)\n\n5\n\n\fwhere 0 \u2264 \u03b6, \u03bd \u2264 1 are trainable parameters that are parameterized by the sigmoid function, and\n\u03c3 : R \u2192 R is a non-linear function such as tanh, sigmoid and can vary across datasets. Note that\neach coordinate of zt is similar to parameter \u03b2 in (2) and \u03b6(1 \u2212 zt) + \u03bd\u2019s coordinates simulate \u03b1\nparameter; also if \u03bd \u2248 0, \u03b6 \u2248 1 then it satis\ufb01es the intuition that \u03b1 + \u03b2 = 1. It was observed that\nacross all datasets, this gating mechanism outperformed the simple vector extension of FastRNN\nwhere each coordinate of \u03b1 and \u03b2 is learnt (see Appendix G).\nFastGRNN computes each coordinate of gate zt using a non-linear function of xt and ht\u22121. To\nminimize the number of parameters, FastGRNN reuses the matrices W, U for the vector-valued\ngating function as well. Hence, FastGRNN\u2019s inference complexity is almost same as that of the\nstandard RNN but its accuracy and training stability is on par with expensive gated architectures like\nGRU and LSTM.\nSparse low-rank representation: FastGRNN further compresses the model size by using a low-rank\nand a sparse representation of the parameter matrices W, U. That is,\n\nW = W1(W2)(cid:62), U = U1(U2)(cid:62), (cid:107)Wi(cid:107)0 \u2264 si\n\n(7)\nwhere W1 \u2208 R \u02c6D\u00d7rw , W2 \u2208 RD\u00d7rw, and U1, U2 \u2208 R \u02c6D\u00d7ru. Hyperparameters rw, sw, ru, su\nprovide an ef\ufb01cient way to control the accuracy-memory trade-off for FastGRNN and are typically\nset via \ufb01ne-grained validation. In particular, such compression is critical for FastGRNN model to \ufb01t\non resource-constrained devices. Second, this low-rank representation brings down the prediction\ntime by reducing the cost at each time step from O( \u02c6D(D + \u02c6D)) to O(rw(D + \u02c6D) + ru \u02c6D). This\nenables FastGRNN to provide on-device prediction in real-time on battery constrained devices.\n\nw, (cid:107)Ui(cid:107)0 \u2264 si\n\nu, i = {1, 2},\n\n3.2.1 Training FastGRNN\n\nThe parameters for FastGRNN: \u0398FastGRNN = (Wi, Ui, bh, bz, \u03b6, \u03bd) are trained jointly using pro-\njected batch stochastic gradient descent (b-SGD) (or other stochastic optimization methods) with\ntypical batch sizes ranging from 64 \u2212 128. In particular, the optimization problem is given by:\n\n(cid:88)\n\nj\n\n\u0398FastGRNN,(cid:107)Wi(cid:107)0\u2264si\n\nmin\nw ,(cid:107)Ui(cid:107)0\u2264si\n\nu,i\u2208{1,2}\n\nJ (\u0398FastGRNN) =\n\n1\nn\n\nL(Xj, yj; \u0398FastGRNN)\n\n(8)\n\nwhere L denotes the appropriate loss function (typically softmax cross-entropy). The training\nprocedure for FastGRNN is divided into 3 stages:\n(I) Learning low-rank representation (L): In the \ufb01rst stage of the training, FastGRNN is trained\nfor e1 epochs with the model as speci\ufb01ed by (7) using b-SGD. This stage of optimization ignores the\nsparsity constraints on the parameters and learns a low-rank representation of the parameters.\n(II) Learning sparsity structure (S): FastGRNN is next trained for e2 epochs using b-SGD, pro-\njecting the parameters onto the space of sparse low-rank matrices after every few batches while\nmaintaining support between two consecutive projection steps. This stage, using b-SGD with Iterative\nHard Thresholding (IHT), helps FastGRNN identify the correct support for parameters (Wi, Ui).\n(III) Optimizing with \ufb01xed parameter support: In the last stage, FastGRNN is trained for e3\nepochs with b-SGD while freezing the support set of the parameters.\nIn practice, it is observed that e1 = e2 = e3 = 100 generally leads to the convergence of FastGRNN\nto a good solution. Early stopping is often deployed in stages (II) and (III) to obtain the best models.\n\n3.3 Byte Quantization (Q)\n\nFastGRNN further compresses the model by quantizing each element of Wi, Ui, restricting them to\nat most one byte along with byte indexing for sparse models. However, simple integer quantization\nof Wi, Ui leads to a large loss in accuracy due to gross approximation. Moreover, while such a\nquantization reduces the model size, the prediction time can still be large as non-linearities will\nrequire all the hidden states to be \ufb02oating point. FastGRNN overcomes these shortcomings by training\nWi and Ui using piecewise-linear approximation of the non-linear functions, thereby ensuring that\nall the computations can be performed with integer arithmetic. During training, FastGRNN replaces\nthe non-linear function in (6) with their respective approximations and uses the above mentioned\ntraining procedure to obtain \u0398FastGRNN. The \ufb02oating point parameters are then jointly quantized\nto ensure that all the relevant entities are integer-valued and the entire inference computation can\n\n6\n\n\fbe executed ef\ufb01ciently with integer arithmetic without a signi\ufb01cant drop in accuracy. For instance,\nTables 4, 5 show that on several datasets FastGRNN models are 3-4x faster than their corresponding\nFastGRNN-Q models on common IoT boards with no \ufb02oating point unit (FPU). FastGRNN-LSQ,\nFastGRNN \"minus\" the Low-rank, Sparse and Quantized components, is the base model with no\ncompression.\n\n4 Experiments\n\nDatasets: FastRNN and FastGRNN\u2019s performance was benchmarked on the following IoT tasks\nwhere having low model sizes and prediction times was critical to the success of the application: (a)\nWakeword-2 [45] - detecting utterances of the \"Hey Cortana\" wakeword; (b) Google-30 [49] and\nGoogle-12 - detection of utterances of 30 and 10 commands plus background noise and silence and\n(c) HAR-2 [3] and DSA-19 [2] - Human Activity Recognition (HAR) from an accelerometer and\ngyroscope on a Samsung Galaxy S3 smartphone and Daily and Sports Activity (DSA) detection\nfrom a resource-constrained IoT wearable device with 5 Xsens MTx sensors having accelerometers,\ngyroscopes and magnetometers on the torso and four limbs. Traditional RNN tasks typically do\nnot have prediction constraints and are therefore not the focus of this paper. Nevertheless, for the\nsake of completeness, experiments were also carried out on benchmark RNN tasks such as language\nmodeling on the Penn Treebank (PTB) dataset [33], star rating prediction on a scale of 1 to 5 of Yelp\nreviews [52] and classi\ufb01cation of MNIST images on a pixel-by-pixel sequence [32, 31].\nAll datasets, apart from Wakeword-2, are publicly available and their pre-processing and feature\nextraction details are provided in Appendix B. The publicly provided training set for each dataset\nwas subdivided into 80% for training and 20% for validation. Once the hyperparameters had been\n\ufb01xed, the algorithms were trained on the full training set and results were reported on the publicly\navailable test set. Table 1 lists the statistics of all datasets.\nBaseline algorithms and Implementation: FastRNN and FastGRNN were compared to stan-\ndard RNN [41], leading unitary RNN approaches such as SpectralRNN [54], Orthogonal RNN\n(oRNN) [37], Ef\ufb01cient Unitary Recurrent Neural Networks (EURNN) [24], FactoredRNN [47] and\nstate-of-the-art gated RNNs including UGRNN [14], GRU [13] and LSTM [20]. Details of these\nmethods are provided in Section 2. Native Tensor\ufb02ow implementations were used for the LSTM\nand GRU architectures. For all the other RNNs, publicly available implementations provided by the\nauthors were used taking care to ensure that published results could be reproduced thereby verifying\nthe code and hyper-parameter settings. All experiments were run on an Nvidia Tesla P40 GPU with\nCUDA 9.0 and cuDNN 7.1 on a machine with an Intel Xeon 2.60 GHz CPU with 12 cores.\nHyper-parameters: The hyper-parameters of each algorithm were set by a \ufb01ne-grained validation\nwherever possible or according to the settings recommended by the authors otherwise. Adam,\nNesterov Momentum and SGD were used to optimize each algorithm on each dataset and the\noptimizer with the best validation performance was selected. The learning rate was initialized to\n10\u22122 for all architectures except for RNNs where the learning rate was initialized to 10\u22123 to ensure\nstable training. Each algorithm was run for 200 epochs after which the learning rate was decreased by\na factor of 10\u22121 and the algorithm run again for another 100 epochs. This procedure was carried out\non all datasets except for Pixel MNIST where the learning rate was decayed by 1\n2 after each pass of\n200 epochs. Batch sizes between 64 and 128 training points were tried for most architectures and a\nbatch size of 100 was found to work well in general except for standard RNNs which required a batch\nsize of 512. FastRNN used tanh as the non-linearity in most cases except for a few (indicated by +)\nwhere ReLU gave slightly better results. Table 11 in the Appendix lists the non-linearity, optimizer\nand hyper-parameter settings for FastGRNN on all datasets.\n\nTable 1: Dataset Statistics\n\nDataset\n\n#Train\n\n#Features\n\nGoogle-12\nGoogle-30\nWakeword-2\nYelp-5\nHAR-2\nPixel-MNIST-10\nPTB-10000\nDSA-19\n\n22,246\n51,088\n195,800\n500,000\n7,352\n60,000\n929,589\n4,560\n\n3,168\n3,168\n5,184\n38,400\n1,152\n784\n\u2014\n5,625\n\n#Time\nSteps\n99\n99\n162\n300\n128\n784\n300\n125\n\n#Test\n\n3,081\n6,835\n83,915\n500,000\n2,947\n10,000\n82,430\n4,560\n\n7\n\nTable 2: PTB Language Modeling - 1 Layer\nTrain\n\nModel\n\nTrain\n\nTest\n\nMethod\n\nPerplexity\n\nPerplexity\n\nSize (KB)\n\nTime (min)\n\nRNN\nFastRNN\nFastGRNN-LSQ\nFastGRNN\nSpectralRNN\nUGRNN\nLSTM\n\n144.71\n127.76+\n115.92\n116.11\n130.20\n119.71\n117.41\n\n68.11\n109.07\n89.58\n81.31\n65.42\n65.25\n69.44\n\n129\n513\n513\n39\n242\n256\n2052\n\n9.11\n11.20\n12.53\n13.75\n\n\u2014\n\n11.12\n13.52\n\n\fEvaluation criteria: The emphasis in this paper is on designing RNN architectures which can run\non low-memory IoT devices and which are ef\ufb01cient at prediction time. As such, the model size of\neach architecture is reported along with its training time and classi\ufb01cation accuracy (F1 score on the\nWakeword-2 dataset and perplexity on the PTB dataset). Prediction times on some of the popular\nIoT boards are also reported. Note that, for NLP applications such as PTB and Yelp, just the model\nsize of the various RNN architectures has been reported. In a real application, the size of the learnt\nword-vector embeddings (10 MB for FastRNN and FastGRNN) would also have to be considered.\nResults: Tables 2 and 3 compare the performance of FastRNN, FastGRNN and FastGRNN-LSQ to\nstate-of-the-art RNNs. Three points are worth noting about FastRNN\u2019s performance. First, FastRNN\u2019s\nprediction accuracy gains over a standard RNN ranged from 2.34% on the Pixel-MNIST dataset\nto 19% on the Google-12 dataset. Second, FastRNN\u2019s prediction accuracy could surpass leading\nunitary RNNs on 6 out of the 8 datasets with gains up to 2.87% and 3.77% over SpectralRNN on the\nGoogle-12 and DSA-19 datasets respectively. Third, FastRNN\u2019s training speedups over all unitary\nand gated RNNs could range from 1.2x over UGRNN on the Yelp-5 and DSA-19 datasets to 196x\nover EURNN on the Google-12 dataset. This demonstrates that the vanishing and exploding gradient\nproblem could be overcome by the addition of a simple weighted residual connection to the standard\nRNN architecture thereby allowing FastRNN to train ef\ufb01ciently and stablely. This also demonstrates\nthat the residual connection offers a theoretically principled architecture that can often result in\naccuracy gains without limiting the expressive power of the hidden state transition matrix.\nTables 2 and 3 also demonstrate that FastGRNN-LSQ could be more accurate and faster to train than\nall unitary RNNs. Furthermore, FastGRNN-LSQ could match the accuracies and training times of\nstate-of-the-art gated RNNs while having models that could be 1.18-4.87x smaller. This demonstrates\nthat extending the residual connection to a gate which reuses the RNN matrices increased accuracy\nwith virtually no increase in model size over FastRNN in most cases. In fact, on Google-30 and\nPixel-MNIST FastGRNN-LSQ\u2019s model size was lower than FastRNN\u2019s as it had a lower hidden\ndimension indicating that the gate ef\ufb01ciently increased expressive power.\nFinally, Tables 2 and 3 show that FastGRNN\u2019s accuracy was at most 1.13% worse than the best RNN\nbut its model could be up to 35x smaller even as compared to low-rank unitary methods such as\nSpectralRNN. Figures 3 and 4 in the Appendix also show that FastGRNN-LSQ and FastGRNN\u2019s\nclassi\ufb01cation accuracies could be higher than those obtained by the best unitary and gated RNNs\nfor any given model size in the 0-128 KB range. This demonstrates the effectiveness of making\nFastGRNN\u2019s parameters low-rank, sparse and quantized and allows FastGRNN to \ufb01t on the Arduino\n\nTable 3: FastGRNN had up to 35x smaller models than leading RNNs with almost no loss in accuracy\n\nDataset\n\nMethod\n\nRNN\n\nd FastRNN\n\nFastGRNN-LSQ\nFastGRNN\n\ny SpectralRNN\n\nEURNN\noRNN\nFactoredRNN\n\ne\ns\no\np\no\nr\nP\n\nr\na\nt\ni\nn\nU\n\nd UGRNN\n\ne\nt\na\nG\n\nGRU\nLSTM\n\nDataset\n\nMethod\n\nRNN\n\ne\ns\no\np\no\nr\nP\n\nr\na\nt\ni\nn\nU\n\nd FastRNN\n\nFastGRNN-LSQ\nFastGRNN\n\ny SpectralRNN\n\nEURNN\noRNN\nFactoredRNN\n\nd UGRNN\n\ne\nt\na\nG\n\nGRU\nLSTM\n\nAccuracy\n\n(%)\n73.25\n92.21+\n93.18\n92.10\n\n91.59\n76.79\n88.18\n53.33\n92.63\n93.15\n92.30\n\nGoogle-12\n\nModel\n\nSize (KB)\n\n56\n56\n57\n5.5\n\n228\n210\n102\n1114\n75\n248\n212\n\nTrain\n\nTime (hr)\n1.11\n0.61\n0.63\n0.75\n\nAccuracy\n\n(%)\n80.05\n91.60+\n92.03\n90.78\n\n19.00\n120.00\n16.00\n7.00\n0.78\n1.23\n1.36\n\n88.73\n56.35\n86.95\n40.57\n90.54\n91.41\n90.31\n\nAccuracy\n\n(%)\n47.59\n55.38\n59.51\n59.43\n\n56.56\n59.01\n\n\u2014\n\u2014\n\n58.67\n59.02\n59.49\n\nYelp-5\n\nRNN Model\nSize (KB)\n\nTrain\n\nTime (hr)\n\n130\n130\n130\n8\n\n89\n122\n\u2014\n\u2014\n258\n388\n516\n\n3.33\n3.61\n3.91\n4.62\n\n4.92\n72.00\n\n\u2014\n\u2014\n4.34\n8.12\n8.61\n\nAccuracy\n\n(%)\n91.31\n94.50+\n95.38\n95.59\n\n95.48\n93.11\n94.57\n78.65\n94.53\n93.62\n93.65\n\nHAR-2\nModel\n\nSize (KB)\n\nTrain\n\nTime (hr)\n\n29\n29\n29\n3\n\n525\n12\n22\n1\n37\n71\n74\n\n0.11\n0.06\n0.08\n0.10\n\n0.73\n0.84\n2.72\n0.11\n0.12\n0.13\n0.18\n\n8\n\n63\n96\n45\n6.25\n\n128\n135\n120\n1150\n260\n257\n219\n\nAccuracy\n\n(%)\n71.68\n84.14\n85.00\n83.73\n\n80.37\n\n\u2014\n\n72.52\n73.20\n84.74\n84.84\n84.84\n\nGoogle-30\n\nModel\n\nSize (KB)\n\nTrain\n\nTime (hr)\n\nF1\n\nScore\n89.17\n97.09\n98.19\n97.83\n\n96.75\n92.22\n\n\u2014\n\u2014\n\n98.17\n97.63\n97.82\n\nWakeword-2\nModel\n\nSize (KB)\n\nTrain\n\nTime(hr)\n\n8\n8\n8\n1\n\n17\n24\n\u2014\n\u2014\n16\n24\n32\n\n0.28\n0.69\n0.83\n1.08\n\n7.00\n69.00\n\n\u2014\n\u2014\n1.00\n1.38\n1.71\n\nPixel-MNIST-10\n\n2.13\n1.30\n1.41\n1.77\n\n11.00\n19.00\n35.00\n8.52\n2.11\n2.70\n2.63\n\nDSA-19\nModel\n\nSize (KB)\n\nTrain\n\nTime (min)\n\n20\n97\n208\n3.25\n\n50\n\u2014\n18\n1154\n399\n270\n526\n\n1.11\n1.92\n2.15\n2.10\n\n2.25\n\u2014\n\u2014\n\u2014\n2.31\n2.33\n2.58\n\nAccuracy\n\n(%)\n94.10\n96.44\n98.72\n98.20\n\n97.70\n95.38\n97.20\n94.60\n97.29\n98.70\n97.80\n\nModel\n\nSize (KB)\n\n71\n166\n71\n6\n\n25\n64\n49\n125\n84\n123\n265\n\nTrain\n\nTime (hr)\n45.56\n15.10\n12.57\n16.97\n\n122.00\n\n\u2014\n\n\u2014\n\u2014\n\n15.17\n23.67\n26.57\n\n\fTable 4: Prediction time in ms on the Arduino MKR1000\n\nMethod\nFastGRNN\nFastGRNN-Q\nRNN\nUGRNN\nSpectralRNN\n\nGoogle-12 HAR-2 Wakeword-2\n\n537\n2282\n12028\n22875\n70902\n\n162\n553\n2249\n4207\n\u2014\n\n175\n755\n2232\n6724\n10144\n\nTable 5: Prediction time in ms on the Arduino Due\nGoogle-12 HAR-2 Wakeword-2\nMethod\nFastGRNN\nFastGRNN-Q\nRNN\nUGRNN\nSpectralRNN\n\n62\n172\n590\n1142\n55558\n\n242\n779\n3472\n6693\n17766\n\n77\n238\n653\n1823\n2691\n\nUno having just 2 KB RAM and 32 KB \ufb02ash memory. In particular, FastGRNN was able to recognize\nthe \"Hey Cortana\" wakeword just as accurately as leading RNNs but with a 1 KB model.\nPrediction on IoT boards: Unfortunately, most RNNs were too large to \ufb01t on an Arduino Uno apart\nfrom FastGRNN. On the slightly more powerful Arduino MKR1000 having an ARM Cortex M0+\nmicrocontroller operating at 48 MHz with 32 KB RAM and 256 KB \ufb02ash memory, Table 4 shows\nthat FastGRNN could achieve the same prediction accuracy while being 25-45x faster at prediction\nthan UGRNN and 57-132x faster than SpectralRNN. Results on the even more powerful Arduino Due\nare presented in Table 5 while results on the Raspberry Pi are presented in Table 12 of the Appendix.\nAblations, extensions and parameter settings: Enforcing that FastGRNN\u2019s matrices be low-rank\nled to a slight increase in prediction accuracy and reduction in prediction costs as shown in the\nablation experiments in Tables 8, 9 and 10 in the Appendix. Adding sparsity and quantization\nled to a slight drop in accuracy but resulted in signi\ufb01cantly smaller models. Next, Table 16 in\nthe Appendix shows that regularization and layering techniques [36] that have been proposed to\nincrease the prediction accuracy of other gated RNNs are also effective for FastGRNN and can lead\nto reductions in perplexity on the PTB dataset. Finally, Figure 2 and Table 7 of the Appendix measure\nthe agreement between FastRNN\u2019s theoretical analysis and empirical observations. Figure 2 (a) shows\nthat the \u03b1 learnt on datasets with T time steps is decreasing function of T and Figure 2 (b) shows\nthat the learnt \u03b1 and \u03b2 follow the relation \u03b1/\u03b2 \u2248 O(1/T ) for large T which is one of the settings\nin which FastRNN\u2019s gradients stabilize and training converges quickly as proved by Theorems 3.1\nand 3.2. Furthermore, \u03b2 can be seen to be close to 1 \u2212 \u03b1 for large T in Figure 2 (c) as assumed in\nSection 3.1.1 for the convergence of long sequences. For instance, the relative error between \u03b2 and\n1 \u2212 \u03b1 for Google-12 with 99 timesteps was 2.15%, for HAR-2 with 128 timesteps was 3.21% and for\nMNIST-10 with 112 timesteps was 0.68%. However, for short sequences where there was a lower\nlikelihood of gradients exploding or vanishing, \u03b2 was found to deviate signi\ufb01cantly from 1 \u2212 \u03b1 as\nthis led to improved prediction accuracy. Enforcing that \u03b2 = 1 \u2212 \u03b1 on short sequences was found to\ndrop accuracy by up to 1.5%.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Plots (a) and (b) show the variation of \u03b1 and \u03b1/\u03b2 of FastRNN with respect to 1/T for three datasets.\nPlot (c) shows the relation between \u03b2 and 1 \u2212 \u03b1. In accordance with Theorem 3.1, the learnt values of \u03b1 and\n\u03b1/\u03b2 scale as O(1/T ) while \u03b2 \u2192 1 \u2212 \u03b1 for long sequences.\n\n5 Conclusions\n\nThis paper proposed the FastRNN and FastGRNN architectures for ef\ufb01cient RNN training and\nprediction. FastRNN could lead to provably stable training by incorporating a residual connection\nwith two scalar parameters into the standard RNN architecture. FastRNN was demonstrated to have\nlower training times, lower prediction costs and higher prediction accuracies than leading unitary\nRNNs in most cases. FastGRNN extended the residual connection to a gate reusing the RNN matrices\nand was able to match the accuracies of state-of-the-art gated RNNs but with signi\ufb01cantly lower\nprediction costs. FastGRNN\u2019s model could be compressed to 1-6 KB without compromising accuracy\nin many cases by enforcing that its parameters be low-rank, sparse and quantized. This allowed\nFastGRNN to make accurate predictions ef\ufb01ciently on severely resource-constrained IoT devices too\ntiny to hold other RNN models.\n\n9\n\n00.050.10.151/T00.20.40.6Google-12HAR-2MNIST-1000.050.10.151/T00.20.40.60.8Google-12HAR-2MNIST-100.40.50.60.70.80.910.80.91Google - 12HAR-2MNIST-10\fAcknowledgements\n\nWe are grateful to Ankit Anand, Niladri Chatterji, Kunal Dahiya, Don Dennis, Inderjit S. Dhillon,\nDinesh Khandelwal, Shishir Patil, Adithya Pratapa, Harsha Vardhan Simhadri and Raghav Somani\nfor helpful discussions and feedback. KB acknowledges the support of the NSF through grant\nIIS-1619362 and of the AFOSR through grant FA9550-17-1-0308.\n\nReferences\n[1] S. Ahmad, A. Lavin, S. Purdy, and Z. Agha. Unsupervised real-time anomaly detection for\n\nstreaming data. Neurocomputing, 262:134\u2013147, 2017.\n\n[2] K. Altun, B. Barshan, and O. Tun\u00e7el. Comparative study on classifying human activities with\nminiature inertial and magnetic sensors. Pattern Recognition, 43(10):3605\u20133620, 2010. URL\nhttps://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities.\n\n[3] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. Human activity\nrecognition on smartphones using a multiclass hardware-friendly support vector machine.\nIn International Workshop on Ambient Assisted Living, pages 216\u2013223. Springer, 2012.\nURL https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+\nusing+smartphones.\n\n[4] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations. Cambridge\n\nUniversity Press, 2009.\n\n[5] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks.\n\nInternational Conference on Machine Learning, pages 1120\u20131128, 2016.\n\nIn\n\n[6] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[7] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu. Advances in optimizing recurrent\nnetworks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International\nConference on, pages 8624\u20138628. IEEE, 2013.\n\n[8] K. Bhatia, K. Dahiya, H. Jain, Y. Prabhu, and M. Varma. The Extreme Classi\ufb01cation\nRepository: Multi-label Datasets & Code. URL http://manikvarma.org/downloads/\nXC/XMLRepository.html.\n\n[9] J. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-recurrent neural networks. arXiv\n\npreprint arXiv:1611.01576, 2016.\n\n[10] V. Campos, B. Jou, X. G. i Nieto, J. Torres, and S.-F. Chang. Skip RNN: Learning to skip state\nupdates in recurrent neural networks. In International Conference on Learning Representations,\n2018.\n\n[11] G. Chen, C. Parada, and G. Heigold. Small-footprint keyword spotting using deep neural\nnetworks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International\nConference on, pages 4087\u20134091. IEEE, 2014.\n\n[12] G. Chen, C. Parada, and T. N. Sainath. Query-by-example keyword spotting using long short-\nterm memory networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE\nInternational Conference on, pages 5236\u20135240. IEEE, 2015.\n\n[13] K. Cho, B. Van Merri\u00ebnboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine\n\ntranslation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.\n\n[14] J. Collins, J. Sohl-Dickstein, and D. Sussillo. Capacity and trainability in recurrent neural\n\nnetworks. arXiv preprint arXiv:1611.09913, 2016.\n\n[15] S. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[16] N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural\n\nnetworks. arXiv preprint arXiv:1712.06541, 2017.\n\n[17] C. Gupta, A. S. Suggala, A. Gupta, H. V. Simhadri, B. Paranjape, A. Kumar, S. Goyal, R. Udupa,\nM. Varma, and P. Jain. Protonn: Compressed and accurate knn for resource-scarce devices. In\nProceedings of the International Conference on Machine Learning, August 2017.\n\n10\n\n\f[18] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. In ICLR, 2016.\n\n[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n770\u2013778, 2016.\n\n[20] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[21] H. Inan, K. Khosravi, and R. Socher. Tying word vectors and word classi\ufb01ers: A loss framework\n\nfor language modeling. arXiv preprint arXiv:1611.01462, 2016.\n\n[22] H. Jaeger, M. Lukosevicius, D. Popovici, and U. Siewert. Optimization and applications of\n\necho state networks with leaky-integrator neurons. Neural Networks, 20(3):335\u2013352, 2007.\n\n[23] L. Jing, C. Gulcehre, J. Peurifoy, Y. Shen, M. Tegmark, M. Soljaci\u00b4c, and Y. Bengio. Gated\n\northogonal recurrent units: On learning to forget. arXiv preprint arXiv:1706.02761, 2017.\n\n[24] L. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. Skirlo, M. Tegmark, and M. Soljaci\u00b4c. Tunable ef\ufb01-\ncient unitary neural networks (eunn) and their application to RNN. In International Conference\non Machine Learning, 2017.\n\n[25] C. Jose, M. Cisse, and F. Fleuret. Kronecker recurrent units. In J. Dy and A. Krause, editors,\nInternational Conference on Machine Learning, volume 80 of Proceedings of Machine Learning\nResearch, pages 2380\u20132389, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n[26] S. Kanai, Y. Fujiwara, and S. Iwamura. Preventing gradient explosions in gated recurrent units.\n\nIn Advances in Neural Information Processing Systems, pages 435\u2013444, 2017.\n\n[27] V. K\u00ebpuska and T. Klein. A novel wake-up-word speech recognition system, wake-up-word\nrecognition task, technology and evaluation. Nonlinear Analysis: Theory, Methods & Applica-\ntions, 71(12):e2772\u2013e2789, 2009.\n\n[28] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[29] A. Kumar, S. Goyal, and M. Varma. Resource-ef\ufb01cient machine learning in 2 kb ram for the\ninternet of things. In Proceedings of the International Conference on Machine Learning, August\n2017.\n\n[30] A. Kusupati, D. Dennis, C. Gupta, A. Kumar, S. Patil, and H. Simhadri. The EdgeML Li-\nbrary: An ML library for machine learning on the Edge, 2017. URL https://github.com/\nMicrosoft/EdgeML.\n\n[31] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of recti\ufb01ed\n\nlinear units. arXiv preprint arXiv:1504.00941, 2015.\n\n[32] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[33] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of\n\nenglish: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[34] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions\nwith review text. In Proceedings of the 7th ACM conference on Recommender systems, pages\n165\u2013172. ACM, 2013.\n\n[35] G. Melis, C. Dyer, and P. Blunsom. On the state of the art of evaluation in neural language\n\nmodels. arXiv preprint arXiv:1707.05589, 2017.\n\n[36] S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing LSTM language models.\n\narXiv preprint arXiv:1708.02182, 2017.\n\n[37] Z. Mhammedi, A. Hellicar, A. Rahman, and J. Bailey. Ef\ufb01cient orthogonal parametrisation\nof recurrent neural networks using householder re\ufb02ections. In International Conference on\nMachine Learning, 2017.\n\n[38] T. Mikolov and G. Zweig. Context dependent recurrent neural network language model. SLT,\n\n12(234-239):8, 2012.\n\n[39] S. Narang, E. Elsen, G. Diamos, and S. Sengupta. Exploring sparsity in recurrent neural\n\nnetworks. arXiv preprint arXiv:1704.05119, 2017.\n\n11\n\n\f[40] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks.\n\nIn International Conference on Machine Learning, pages 1310\u20131318, 2013.\n\n[41] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating\n\nerrors. Nature, 323(6088):533, 1986.\n\n[42] T. N. Sainath and C. Parada. Convolutional neural networks for small-footprint keyword spotting.\nIn Sixteenth Annual Conference of the International Speech Communication Association, 2015.\n[43] Siri Team, Apple. Hey Siri: An on-device dnn-powered voice trigger for apple\u2019s personal\nassistant, 2017. URL https://machinelearning.apple.com/2017/10/01/hey-siri.\nhtml.\n\n[44] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks.\n\narXiv:1505.00387, 2015.\n\narXiv preprint\n\n[45] STCI, Microsoft. Wakeword dataset.\n[46] G. A. Susto, A. Schirru, S. Pampuri, S. McLoone, and A. Beghi. Machine learning for predictive\nmaintenance: A multiple classi\ufb01er approach. IEEE Transactions on Industrial Informatics, 11\n(3):812\u2013820, 2015.\n\n[47] E. Vorontsov, C. Trabelsi, S. Kadoury, and C. Pal. On orthogonality and learning recurrent\nnetworks with long term dependencies. In International Conference on Machine Learning,\n2017.\n\n[48] Z. Wang, J. Lin, and Z. Wang. Accelerating recurrent neural networks: A memory-ef\ufb01cient\napproach. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 25(10):2763\u2013\n2775, 2017.\n[49] P. Warden.\n\nSpeech commands: A dataset for limited-vocabulary speech recognition.\narXiv preprint arXiv:1804.03209, 2018. URL http://download.tensorflow.org/data/\nspeech_commands_v0.01.tar.gz.\n\n[50] S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas. Full-capacity unitary recurrent\nneural networks. In Advances in Neural Information Processing Systems, pages 4880\u20134888,\n2016.\n\n[51] J. Ye, L. Wang, G. Li, D. Chen, S. Zhe, X. Chu, and Z. Xu. Learning compact recurrent neural\n\nnetworks with block-term tensor decomposition. arXiv preprint arXiv:1712.05134, 2017.\n\n[52] Yelp Inc. Yelp dataset challenge, 2017. URL https://www.yelp.com/dataset/\n\nchallenge.\n\n[53] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv\n\npreprint arXiv:1409.2329, 2014.\n\n[54] J. Zhang, Q. Lei, and I. S. Dhillon. Stabilizing gradients for deep neural networks via ef\ufb01cient\n\nSVD parameterization. In International Conference on Machine Learning, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5394, "authors": [{"given_name": "Aditya", "family_name": "Kusupati", "institution": "Microsoft Research India"}, {"given_name": "Manish", "family_name": "Singh", "institution": "Indian Institute of Technology Delhi"}, {"given_name": "Kush", "family_name": "Bhatia", "institution": "UC Berkeley"}, {"given_name": "Ashish", "family_name": "Kumar", "institution": "UC Berkeley"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Manik", "family_name": "Varma", "institution": "Microsoft Research India"}]}