{"title": "Shallow RNN: Accurate Time-series Classification on Resource Constrained Devices", "book": "Advances in Neural Information Processing Systems", "page_first": 12916, "page_last": 12926, "abstract": "Recurrent Neural Networks (RNNs) capture long dependencies and context, and\n2 hence are the key component of typical sequential data based tasks. However, the\nsequential nature of RNNs dictates a large inference cost for long sequences even if\nthe hardware supports parallelization. To induce long-term dependencies, and yet\nadmit parallelization, we introduce novel shallow RNNs. In this architecture, the\nfirst layer splits the input sequence and runs several independent RNNs. The second\nlayer consumes the output of the first layer using a second RNN thus capturing\nlong dependencies. We provide theoretical justification for our architecture under\nweak assumptions that we verify on real-world benchmarks. Furthermore, we show\nthat for time-series classification, our technique leads to substantially improved\ninference time over standard RNNs without compromising accuracy. For example,\nwe can deploy audio-keyword classification on tiny Cortex M4 devices (100MHz\nprocessor, 256KB RAM, no DSP available) which was not possible using standard\nRNN models. Similarly, using SRNN in the popular Listen-Attend-Spell (LAS)\narchitecture for phoneme classification [4], we can reduce the lag inphoneme\nclassification by 10-12x while maintaining state-of-the-art accuracy.", "full_text": "Shallow RNNs: A Method for Accurate Time-series\n\nClassi\ufb01cation on Tiny Devices\n\nDon Kurian Dennis\u02da\n\nDurmus Alp Emre Acar\n\nVikram Mandikal:\n\nCarnegie Mellon University\n\nBoston University\n\nUniversity of Texas at Austin\n\nVinu Sankar Sadasivan:\n\nHarsha Vardhan Simhadri\n\nVenkatesh Saligrama\n\nIIT Gandhinagar\n\nMicrosoft Research India\n\nBoston University\n\nPrateek Jain\n\nMicrosoft Research India\n\nAbstract\n\nRecurrent Neural Networks (RNNs) capture long dependencies and context, and\nhence are the key component of typical sequential data based tasks. However, the\nsequential nature of RNNs dictates a large inference cost for long sequences even if\nthe hardware supports parallelization. To induce long-term dependencies, and yet\nadmit parallelization, we introduce novel shallow RNNs. In this architecture, the\n\ufb01rst layer splits the input sequence and runs several independent RNNs. The second\nlayer consumes the output of the \ufb01rst layer using a second RNN thus capturing\nlong dependencies. We provide theoretical justi\ufb01cation for our architecture under\nweak assumptions that we verify on real-world benchmarks. Furthermore, we show\nthat for time-series classi\ufb01cation, our technique leads to substantially improved\ninference time over standard RNNs without compromising accuracy. For example,\nwe can deploy audio-keyword classi\ufb01cation on tiny Cortex M4 devices (100MHz\nprocessor, 256KB RAM, no DSP available) which was not possible using standard\nRNN models. Similarly, using ShaRNN in the popular Listen-Attend-Spell (LAS)\narchitecture for phoneme classi\ufb01cation [4], we can reduce the lag in phoneme\nclassi\ufb01cation by 10-12x while maintaining state-of-the-art accuracy.\n\nIntroduction\n\n1\nWe focus on the challenging task of time-series classi\ufb01cation on tiny devices, a problem arising in\nseveral industrial and consumer applications [25, 22, 30], where tiny edge-devices perform sensing,\nmonitoring and prediction in a limited time and resource budget. A prototypical example is an\ninteractive cane for people with visual impairment, capable of recognizing gestures that are observed\nas time-traces on a sensor embedded onto the cane [24].\n\nTime series or sequential data naturally exhibit temporal dependencies. Sequential models such as\nRNNs are particularly well-suited in this context because they can account for temporal dependencies\nby attempting to derive relations from the previous inputs. Nevertheless, directly leveraging RNNs\nfor prediction in constrained scenarios mentioned above is challenging. As observed by several\nauthors [28, 14, 29, 9], the sequential nature by which RNNs process data fundamentally limits\nparallelization leading to large training and inference costs. In particular, in time-series classi\ufb01cation,\nat inference time, the processing time scales with the size, T , of the receptive window, which is\nunacceptable in resource constrained settings.\n\n\u02daWork done as a Research Fellow at Microsoft Research India.\n:Work done during internships at Microsoft Research India.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA solution proposed in literature [28, 14, 29, 9] is to replace sequential processing with parallelizable\nfeed-forward and convolutional networks. A key insight exploited here is that most applications\nrequire relatively small receptive window, and that this size can be increased with tree-structured\nnetworks and dilated convolutions. Nevertheless, feedforward/convolutional networks utilize sub-\nstantial working memory, which makes them dif\ufb01cult to deploy on tiny devices. For this reason,\nother methods such as [32, 2] also are not applicable for our setting. For example, a standard audio\nkeyword detection task with a relatively modest setup of 32 conv \ufb01lters would itself need a working\nmemory of 500KB and about 32X more computation than a baseline RNN model (see Section 5).\n\nShallow RNNs. To address these challenges, we design a novel layered RNN architecture that is\nparallelizable/limited-recurrence while still maintaining the receptive \ufb01eld length (T ) and the size of\nthe baseline RNN. Concretely, we propose a simple 2-layer architecture that we refer to as ShaRNN.\nBoth the layers of ShaRNN are composed of a collection of shallow recurrent neural networks that\noperate independently. More precisely, each sequential data point (receptive window) is divided into\nindependent parts called bricks of size k, and a shared RNN operates on each brick independently,\nthus ensuring a small model size and short recurrence. That is, ShaRNN\u2019s bottom layer restarts from\nan initial state after every k \u0103\u0103 T steps, and hence only has a short recurrence. The outputs of T{k\nparallel RNNs are input as a sequence into a second layer RNN, which then outputs a prediction\nafter T{k time. In this way, for k \u00ab Op?Tq we obtain a speedup of Op?Tq in inference time in the\n(a) Parallelization: here we parallelize inference over T{k independent RNNs thus admitting\n\nfollowing two settings:\n\nspeed-ups on multi-threaded architectures,\n\n(b) Streaming: here we utilize receptive (sliding) windows and reuse computation from older\n\nsliding window/receptive \ufb01elds.\n\nWe also note that, in contrast to the proposed feed-forward methods or truncated RNN methods [23],\nour proposal admits fully receptive \ufb01elds and thus does not result in loss of information. We further\nenhance ShaRNN by combining it with the recent MI-RNN method [10] to reduce the receptive\nwindow sizes; we call the resulting method MI-ShaRNN.\n\nWhile a feedforward layer could be used in lieu of our RNN in the next layer, such layers lead to\nsigni\ufb01cant increase in model size and working RAM to be admissible in tiny devices.\n\nPerformance and Deployability. We compare the two-layer MI-ShaRNN approach against other\nstate-of-art methods, on a variety of benchmark datasets, tabulating both accuracy and budgets. We\nshow that the proposed 2-layer MI-ShaRNN exhibits signi\ufb01cant improvement in inference time\nwhile also improving accuracy. For example, on Google-13 dataset, MI-ShaRNN achieves 1%\nhigher accuracy than baseline methods while providing 5-10x improvement in inference cost. A\ncompelling aspect of the architecture is that it allows for reuse of most of the computation, which\nleads to its deployability on the tiniest of devices. In particular, we show empirically that the method\ncan be deployed for real-time time-series classi\ufb01cation on devices as those based on the tiny ARM\nCortex M4 microprocessor3 with just 256KB RAM, 100MHz clock-speed and no dedicated Digital\nSignal Processing (DSP) hardware. Finally, we demonstrate that we can replace bi-LSTM based\nencoder-decoder of the LAS architecture [4] by ShaRNN while maintaining close to best accuracy\non publicly-available TIMIT dataset [13]. This enables us to deploy LAS architecture in streaming\nfashion with a lag of 1 second in phoneme prediction and Op1q amortized cost per time-step; standard\nLAS model would incur lag of about 8 seconds as it processes the entire 8 seconds of audio before\nproducing predictions.\n\nTheory. We provide theoretical justi\ufb01cation for the ShaRNN architecture and show that signi\ufb01cant\nparallelization can be achieved if the network satis\ufb01es some relatively weak assumptions. We also\npoint out that additional layers can be introduced in the architecture leading to hierarchical processing.\nWhile we do not experiment with this concept here, we note that, it offers potential for exponential\nimprovement in inference time.\n\n3https://en.wikipedia.org/wiki/ARM_Cortex-M#Cortex-M4\n\n2\n\n\fIn summary, the following are our main contributions:\n\n\u2022 We show that under relatively weak assumptions, recurrence in RNNs and consequently, the\n\ninference cost can be reduced signi\ufb01cantly.\n\n\u2022 We demonstrate this inference ef\ufb01ciency via a two-layer ShaRNN (and MI-ShaRNN) architecture\n\nthat uses only shallow RNNs with a small amount of recurrence.\n\n\u2022 We benchmark MI-ShaRNN (enhancement of ShaRNN with MI-RNN) on several datasets and\nobserve that it learns nearly as accurate models as standard RNNs and MI-RNN. Due to limited\nrecurrence, ShaRNN saves 5-10x computation cost over baseline methods. We deploy MI-ShaRNN\nmodel on a tiny microcontroller for real-time audio keyword detection, which, prior to this work,\nwas not possible with standard RNNs due to large inference cost with receptive (sliding) windows.\nWe also deploy ShaRNN in LAS architecture to enable streaming phoneme classi\ufb01cation with less\nthan 1 second of lag in prediction.\n\n2 Related Work\n\nStacked Architecture. Our multi-layered RNN resembles stacked RNNs studied in the literature [15,\n16, 27] but they are unrelated. The goal of Stacked RNNs is to produce complex models and subsume\nconventional RNNs. Each layer is fully recurrent, and feeds output of the \ufb01rst layer to the next level.\nThe next level is another fully recurrent RNN. As such, stacked RNN architectures lead to increased\nmodel size and recurrence, which results in worse inference time than standard RNNs.\n\nRecurrent Nets (Training). Conventional works on RNNs primarily address challenges arising during\ntraining. In particular for large receptive window T , RNNs suffer from vanishing and exploding\ngradient issues. A number of works propose to circumvent this issue in a number of ways such\nas Gated architectures [7, 17] or adding residual connections in RNNs [18, 1, 21] or through\nconstraining the learnt parameters [31]. Several recent works attempt to reduce the number of gates\nand parameters [8, 6, 21] to reduce model size but as such suffer from poor inference time, since they\nare still fully recurrent. Different from these works, our focus is on reducing model size as well as\ninference time and view these works as complementary to our paper.\n\nRecurrent Nets (Inference Time). Recent works have begun to focus on RNN inference cost. [3]\nproposes to learn skip connections that can avoid evaluating all the hidden states. [10] exploits\ndomain knowledge that true signature is signi\ufb01cantly shorter than the time-trace to trim down length\nof the sliding windows. Both of these approaches are complementary and we indeed leverage the\nsecond in our approach. A recent work on dilated RNNs [5] is interesting. While it could serve as\na potential solution, we note that, in its original form, dilated RNN also has a fully recurrent \ufb01rst\nlayer, which is therefore infeasible. One remedy is to introduce dilation in the \ufb01rst layer to improve\ninference time. But, dilation skips steps and hence can miss out on critical local context.\n\nFinally, CNN based methods [28, 14, 29, 9, 2] allow higher parallelization in the sequential tasks but\nas discussed in Section 1, also lead to signi\ufb01cantly larger working RAM requirement when compared\nto RNNs, thus cannot be considered for deployment on tiny devices (see Section 5).\n\n3 Problem Formulation and Proposed ShaRNN Method\n\nIn this paper, we primarily focus on the time-series classi\ufb01cation problem, although the techniques\napply to more general sequence-to-sequence problems like phoneme classi\ufb01cation problem discussed\nin Section 5. Let Z \u201c tpX1, y1q, . . . ,pXn, ynqu where Xi is the i-th sequential data point with Xi \u201c\nrxi,1, xi,2, . . . , xi,Ts P Rd\u02c6T and xi,t P Rd is the t-th time-step data point. yi P rCs is the label of\nXi where C is the number of class labels. xi,t:t`k is the shorthand for xi,t:t`k \u201c rxi,t, . . . , xi,t`ks.\nGiven training data Z, the goal is to learn a classi\ufb01er f : Rd\u02c6T \u00d1 rCs that can be used for ef\ufb01cient\ninference, especially on tiny devices. Recurrent Neural Networks (RNN) are popularly used for\n\u02c6d at the t-th step that is\nmodeling such sequential problems and maintain a hidden state ht\u00b41 P R\nupdated using:\n\nht \u201c Rpht\u00b41, xtq, t P rTs,\n\n\u02c6y \u201c fphTq,\n\nwhere \u02c6y is the prediction by applying a classi\ufb01er f on hT and \u02c6d is the dimensionality of the hidden\nstate. Due to the sequential nature of RNN, inference cost of RNN is \u2126pTq even if the hardware\n\n3\n\n\fsupports large amount of parallelization. Furthermore, practical applications require handling a\ncontinuous stream of data, e.g., smart-speaker listening for certain audio keywords.\n\nA standard approach is to use sliding windows (receptive \ufb01eld) to form a stream of test points on\nwhich inference can be applied. That is, given a stream X \u201c rx1, x2, . . . ,s, we form sliding windows\nX s \u201c xps\u00b41q\u00a8\u03c9`1:ps\u00b41q\u00a8\u03c9`T P Rd\u02c6T which stride by \u03c9 \u0105 0 time-steps after each inference. RNN is\nthen applied to each sliding window X s which implies amortized cost for processing each time-step\ndata point (xt) is \u0398p T\n\u03c9q. To ensure high-resolution in prediction, \u03c9 is required to be a fairly small\nconstant independent of T . Thus, amortized inference cost for each time-step point is OpTq which is\nprohibitively large for tiny devices. So, we study the following key question: \u201cCan we process each\ntime-step point in a data stream in opTq computational steps?\u201d\n\n3.1 ShaRNN\n\nShallow RNNs (ShaRNN) are a hierarchical collection of RNNs organized at two levels. T\nk RNNs at\nground-layer operate completely in parallel with fully shared parameters and activation functions,\nthus ensuring small model size and parallel execution. An RNN at the next level take inputs from the\nground-layer and subsequently outputs a prediction.\nFormally, given a sequential point X \u201c rx1, . . . , xTs (e.g. sliding window in streaming data), we\nsplit it into bricks of size k, where k is a parameter of the algorithm. That is, we form T{k bricks:\nB \u201c rB1, . . . , BT {ks where Bj \u201c xppj\u00b41q\u00a8k`1q:pj\u00a8kq. Now, ShaRNN applies a standard recurrent\n\u02c6d1 on each brick, where \u02c6d1 is the dimensionality of hidden states of Rp1q.\nmodel Rp1q : Rd\u02c6k \u00d1 R\n\nThat is,\n\n\u03bdp1q\nj \u201c Rp1qpBjq, j P rT{ks.\n\nNote that Rp1q can be any standard RNN model like GRU, LSTM etc. We now feed output of each\nlayer into another RNN to produce the \ufb01nal state/feature vector that is then fed into a feed forward\nlayer. That is,\n\n\u03bdp2q\nT {k \u201c Rp2qpr\u03bdp1q\n\n1 , . . . , \u03bdp1q\n\nT {ksq,\n\n\u02c6y \u201c fp\u03bdp2q\nT {kq,\n\nwhere Rp2q is the second layer RNN and can also be any standard RNN model. \u03bdp2q\nhidden-state obtained by applying Rp2q to \u03bdp1q\n1:T {k. f applies the standard feed-forward network to\n\u03bdp2q\nT {k. See Figure 1 for a block-diagram of the architecture. That is, ShaRNN is de\ufb01ned by parameters\n\u039b composed of shared RNN parameters at the ground-level, RNN parameters at the next level, and\nclassi\ufb01er weights for making a prediction. We train the ShaRNN based on minimizing an empirical\nloss function over training set Z.\n\nT {k P R\n\n\u02c6d2 is the\n\nNaturally, ShaRNN is an approximation of a true RNN and in principle has less modeling power (and\nrecurrence). But as discussed in Section 4 and shown by our empirical results in Section 5, ShaRNN\ncan still capture enough context from the entire sequence to effectively model a variety of time-series\nclassi\ufb01cation problems with large T (typically T \u011b 100). Due to parallel k RNNs in the bottom layer\nthat are processed by R2 in the second layer, ShaRNN inference cost can be reduced to OpT{k ` kq\nfor multi-threaded architectures with k-wise parallelization; k \u201c ?T leads to smallest inference cost.\nStreaming. Recall that in the streaming setting, we form sliding windows X s \u201c xs\u00a8\u03c9`1:s\u00a8\u03c9`T P\nRd\u02c6T by striding each window by \u03c9 \u0105 0 time-steps. Hence, if \u03c9 \u201c k \u00a8 q for q P N then the inference\ncost of X s`1 can be reduced by reusing previously computed \u03bdp1q\nvectors @j P rq ` 1, T{ks for X s.\n\nj\n\nBelow claim provides a formal result for the same.\n\nClaim 1. Let both layers RNNs Rp1q and Rp2q of ShaRNN have same hidden-size and per-time step\ncomputation complexity C1. Then, given T and \u03c9, the additional cost of applying ShaRNN to X s`1\ngiven X s is OpT{k ` q \u00a8 kq \u00a8 C1, where X s \u201c xps\u00b41q\u00a8\u03c9`1:ps\u00b41q\u00a8\u03c9`T , \u03c9 is the stride-length of sliding\nwindow, and the brick-size \u03c9 \u201c q \u00a8 k for some integer q \u011b 1. Consequently, the total amortized cost\ncan be bounded by Op?q \u00a8 T C1q if k \u201c aT{q.\n\nSee Appendix A for a proof of the claim.\n\n4\n\n\f(b)\n\n(c)\n\n(a)\n\n(d)\n\n3\n\n2 , \u03bdp1q\n\nFigure 1: (a) ShaRNN applies RNN Rp1q independently to bricks x1:k, xk`1:2k, . . . to compute \u03bdp1q\nfor all k. The second layer RNN Rp2q produces class labels or in multi-layer case, inputs for the\nnext layer. Note that \u03bdp1q\ncan be reused for evaluating the next window. (b), (c): Mean squared\napproximation error and the prediction accuracy of ShaRNN with zeroth and \ufb01rst order approximation\n(M \u201c 1, 2 respectively in Claim 3) with different brick-sizes k (for Google-13). Note the large\nerror with M \u201c 1 (same as truncation method in [23]). M \u201c 2 introduces signi\ufb01cant improvement,\nespecially for small k, but clearly needs larger M to achieve better accuracy. (d): Comparison of\nnorm of gradient vs Hessian of Rpht, xt`1:t`kq with varying k. R is FastRNN [21] with swish\nactivation. Smaller Hessian norm indicates that the \ufb01rst-order approximation of R (Claim 3) by\nShaRNN is more accurate than the 0-th order one (ShaRNN with M \u201c 1) suggested by [23].\n\nk\n\n3.2 Multi-layer ShaRNN\n\nAbove claim shows that selecting small k leads to a large number of bricks and hence, a large number\nof points to be processed by second layer RNN Rp2q which will be the bottleneck in inference.\nHowever, using the same approach, we can replace the second layer with another layer of ShaRNN to\nbring down the cost. By repeating the same process, we can design a general L layer architecture\nwhere each layer is equipped with a RNN model Rplq and the output of a l-th layer brick is given by:\n\n\u03bdplq\nj \u201c Rplqpr\u03bdpl\u00b41q\nfor all 1 \u010f j \u010f T{kl, where \u03bdp0q\nj \u201c xj . The predicted label is given by \u02c6y \u201c fp\u03bdpLq\n\npj\u00b41q\u00a8k`1, . . . , \u03bdpl\u00b41q\n\npj\u00b41qk`ksq,\n\nT {kL\u00b41q.\n\nUsing argument similar to the claim in the previous section, we can reduce the total inference cost to\nOplog Tq by using k \u201c Op1q and L \u201c log T .\nClaim 2. Let all layers of multi-layer ShaRNN have same hidden-size and per-time step complexity\nC1 and let k \u201c \u03c9. Then, the additional cost of applying ShaRNN to X s`1 is OpT{kL ` L \u00a8 kq \u00a8 C1,\nwhere X s \u201c xps\u00b41q\u00a8\u03c9`1:ps\u00b41q\u00a8\u03c9`T . Consequently, selecting L \u201c logpTq, k \u201c Op1q, and assuming\n\u03c9 \u201c Op1q, the total amortized cost is OpC1 \u00a8 log pTqq.\nThat is, we can achieve exponential speed-up over OpTq cost for standard RNN. However, such a\nmodel can lead to a large loss in accuracy. Moreover, constants in the cost for large L are so large\nthat a network with smaller L might be more ef\ufb01cient for typical values of T .\n\n3.3 MI-ShaRNN\n\nRecently, [10] showed that several time-series training datasets are coarse and the sliding window size\nT can be decreased signi\ufb01cantly by using their multi-instance based algorithm (MI-RNN). MI-RNN\n\ufb01nds tight windows around the actual signature of the class, which leads to signi\ufb01cantly smaller\nmodels and reduces inference cost. Our ShaRNN architecture is orthogonal to MI-RNN and can\nbe combined to obtain even higher amount of inference saving. That is, MI-RNN takes the dataset\n\n5\n\n020406080100Brick-size(k)0510152025MeanErrorM=1M=2020406080100Brick-size(k)0.10.20.30.40.50.60.70.80.91.0AccuracyM=1M=2020406080100Brick-size(k)0246810NormMeanNormofGradientvsHessianGradientNormHessianNorm\fZ \u201c tpX1, y1q, . . . ,pXn, ynqu with Xi being a sequential data point over T steps and produces a\nnew set of points X 1\nj is sequential data point over T 1 and T 1 \u010f T .\nMI-ShaRNN applies ShaRNN to the output of MI-RNN so that the inference cost is dependent only\n\nj , where each X 1\n\nj with labels y1\n\non T 1 \u010f T , and captures the key signal in each data point.\n\n4 Analysis\n\nIn this section, we provide theoretical underpinnings of ShaRNN approach and we also put it in\ncontext of work by [23] that discusses RNN models for which we can get rid of almost all of the\nrecurrence.\n\n\u02c6d be a standard RNN model that maps the given hidden state ht\u00b41 P R\nLet R : Rd` \u02c6d \u00d1 R\nand data point xt P Rd into the next hidden state ht \u201c Rpht\u00b41, xtq. Overloading notation,\nRph0, x1, . . . , xtq \u201c Rpht\u00b41, xtq. We de\ufb01ne a function to be recurrent if the following holds:\n\n\u02c6d\n\nRph0, x1, . . . , xtq \u201c RpRph0, x1, . . . , xt\u00b41q, xtq.\n\nThe \ufb01nal class prediction using feed-forward layer is given by: \u02c6y \u201c fphTq \u201c fpRph0, x1:Tqq. Now,\nShaRNN attempts to untangle and approximate the dependency of fphTq and Rph0, x1:Tq on h0,\nby using Taylor\u2019s theorem. Below claim shows the condition under which the approximation error\nintroduced by ShaRNN is small.\n\nh\n\nClaim 3. Let Rph0, x1, . . . , xtq be an RNN and let }\u2207M\n\u01eb \u011b 0 where \u2207M\n}\u2207m\nRp1q, Rp2q and brick-size k, s.t.:\n\nRph, xt:t`kq} \u010f Op\u01eb \u00a8 M !q for some\nh is M th order derivative with respect to h. Also let }Rph0, x1:tq \u00b4 h0} \u201c Op1q,\nRph0, xt`1:t`kq} \u201c Opm!q for all t P rTs. Then, there exists an ShaRNN de\ufb01ned by functions\n}Rp2qp\u03bdp1q\n\nT {kq \u00b4 Rph0, x1:Tq} \u010f \u01eb \u00a8 M \u00a8 T, where \u03bdp1q\n\nj \u201c Rp1qph0, xpj\u00b41q\u00a8k`1:j\u00a8kq.\n\n1 , . . . , \u03bdp1q\n\nh\n\nSee Appendix A for a detailed proof of the claim.\n\nThe above claim shows that the hidden state computed by ShaRNN is close to the state computed by\na fully recursive RNN, hence the \ufb01nal output \u02c6y would also be close. We now compare this result to\nthe result of [23], which showed that }Rph0, x1:Tq\u00b4 Rph0, xT \u00b4k`1:Tq} \u010f \u01eb for large enough k if R\nsatis\ufb01es a contraction property. That is, if }Rpht\u00b41, xtq \u00b4 Rph1\nt\u00b41} where\n\u03bb \u0103 1. However, \u03bb \u0103 1 is a strict requirement and do not hold in practice. Due to this, if we only\ncompute Rph0, xT \u00b4k`1:Tq as suggested by the above result (for some reasonable values of k), then\nresulting accuracy on several datasets drops signi\ufb01cantly (see Figure 1(b),(c)).\nIn the context of Claim 3, result of [23] is a special case with M \u201c 1, i.e., the result only applies a\n0\u00b4th order Taylor series expansion. Figure 1 (d) shows how norm of the gradient that bounds error\ndue to the 0-th order expansion is signi\ufb01cantly larger than the norm of the Hessian which bounds\nerror due to the 1-st order expansion.\n\nt\u00b41, xtq} \u010f \u03bb}ht\u00b41 \u00b4 h1\n\nh\n\nCase study with FastRNN: We now instantiate Claim 3 for a simple FastRNN model [21] with a\n\ufb01rst-order approximation i.e., with M \u201c 2 in Claim 3.\nClaim 4. Let Rph0, x1, . . . , xtq be a FastRNN model with parameters U, W . Let }U} \u010f Op1q,\n}\u22072\nRph0, xt:t`kq} \u010f Op\u01ebq for any k-length sequence. Then, there exists an ShaRNN de\ufb01ned\nby functions Rp1q, Rp2q and brick-size k s.t.: }Rp2qp\u03bdp1q\nT {kq \u00b4 Rph0, x1:Tq} \u010f \u01eb, where\n\u03bdp1q\nj \u201c Rp1qph0, xpj\u00b41q\u00a8k`1:j\u00a8kq.\nNote that }U} \u201c Op1q holds for all the benchmarks that were tried in [21]. Moreover, this assumption\nis signi\ufb01cantly weaker than the typical }U} \u0103 1 assumption required by [23]. Finally, the Hessian\nterm is signi\ufb01cantly smaller than the derivative term (Figure 1 (d)), hence the approximation error and\nprediction error should be signi\ufb01cantly smaller than the one we would get by 0-th order approximation\n(see Figure 1 (b), (c)).\n\n1 , . . . , \u03bdp1q\n\n5 Empirical Results\n\nWe conduct experiments to study: a) performance of MI-ShaRNN with varying hidden state dimen-\nsions at both the layers Rp1q and Rp2q to understand how its accuracy stacks up against baseline\n\n6\n\n\fTable 1: Table compares maximum accuracy achieved by each of the method for different model sizes,\ni.e., different hidden-state sizes indicated by numbers in bracket; MI-ShaRNN reports two numbers\nfor the \ufb01rst and the second layer, respectively. Table also reports the corresponding computational\ncost (amortized number of \ufb02ops required per data point inference) for each method. T denotes the no.\nof time-steps for the dataset, T 1 denotes the trimmed number of time-steps obtained by MI-RNN, k is\nthe selected brick-length for MI-ShaRNN. Note that for all but one datasets MI-ShaRNN is able to\nachieve similar or better accuracy compared to baseline LSTMs.\n\nDataset\n\nGoogle-13\nHAR-6\nGesturePod-5\nSTCI-2\nDSA-19\n\nBaseline LSTM\n\nAcc(%)\n\nFlops\n\nT\n\n91.13 (64)\n93.04 (32)\n97.13 (48)\n99.01 (32)\n85.17 (64)\n\n4.89M 99\n1.36M 128\n8.37M 400\n2.67M 162\n7.23M 129\n\nMI-RNN\n\nAcc(%)\n\nFlops\n\nT 1\n\n93.16 (64)\n91.78 (32)\n98.43 (48)\n98.43 (32)\n88.11 (64)\n\n2.42M 49\n0.51M 48\n4.19M 200\n1.33M 81\n5.05M 90\n\nMI-ShaRNN\n\nAcc(%)\n\nFlops k\n\n94.01 (64, 32)\n94.02 (32, 8)\n99.21 (48, 32)\n99.23 (32, 32)\n87.36 (64, 48)\n\n0.59M 8\n0.17M 16\n0.83M 20\n0.30M 8\n1.10M 15\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a),(b),(c): Accuracy vs inference cost: we vary model size (hidden dimensions) to obtain\naccuracy vs inference cost curve for different methods. All the three plots show that MI-ShaRNN\nproduces more accurate models with as much as 8\u00b410x reduction in the inference cost. (d): Error-rate\nof standard LAS method [12] and of ShaRNN based streaming LAS with varying brick-sizes k on\nthe TIMIT [13] dataset. We report results when both Listener+Speller use ShaRNN vs when only\nListener uses it. ShaRNN Listener+Speller with k \u201c 64 incurs 12x smaller lag in phoneme prediction\nvs baseline LAS (k \u201c 784).\n\nmodels across different model sizes, b) inference cost improvement that MI-ShaRNN produces\nfor standard time-series classi\ufb01cation problems over baseline models and MI-RNN models, c) if\nMI-ShaRNN can enable certain time-series classi\ufb01cation tasks on devices based on the tiny Cortex\nM4 with only 100MHz processor and 256KB RAM. Recall that MI-ShaRNN uses ShaRNN on top of\ntrimmed data points given by MI-RNN. MI-RNN is known to have better performance than baseline\nLSTMs, so naturally MI-ShaRNN has better performance than ShaRNN. Hence, we present results\nfor MI-ShaRNN and compare them to MI-RNN to demonstrate advantage of ShaRNN technique.\n\nDatasets: We benchmark our method on standard datasets from different domains like audio keyword\ndetection (Google-13), wake word detection (STCI-2), activity recognition (HAR-6), sports activity\nrecognition (DSA-19), gesture recognition (GesturePod-5). The number after hyphen in dataset name\nindicates the number of classes in the dataset. See Table 3 in appendix for more details about the\ndatasets. All the datasets are available online (see Table 3) except STCI-2 which is a proprietary wake\nword detection dataset.\n\nBaselines: We compare our algorithm MI-ShaRNN (LSTM) against the baseline LSTM method as\nwell as MI-RNN (LSTM) method. Note that MI-RNN as well as MI-ShaRNN build upon an RNN\ncell. For simplicity and consistency, we have selected LSTM as the base cell for all the methods, but\nwe can train each of them with other RNN cells like GRU [7] or FastRNN [21]. We implemented\nall the algorithms on TensorFlow and used Adam for training the models [19]. The inference code\nfor Cortex M4 device was written in C and compiled onto the device. All the presented numbers are\naveraged over 5 independent runs. The implementation of our algorithm is released as part of the\nEdgeML [11] library.\n\nHyperparameter selection: The main hyperparameters are: a) hidden state sizes for both the layers\nof MI-ShaRNN. b) brick-size k for MI-ShaRNN. In addition, the number of time-steps T is associated\n\nwith each dataset. MI-RNN prunes down T and works with T 1 \u010f T time-steps. We provide results\n\n7\n\n105105106106Num.Flops88909294AccuracyGoogle-13DatasetBaselineLSTMMI-ShaRNNMI-ShaRNN104105105106Num.Flops9091929394AccuracyHAR-6Dataset104105105106106107Num.Flops949596979899AccuracyGesturePod-5Dataset0200400600800Brick-size0.200.220.240.260.28PhonemeErrorRateOnlineShaRNNLASStandardLASShaRNNListenerShaRNNListener+Speller\fTable 2: Deployment on Cortex M4: accuracy of different methods vs inference time cost (ms) on M4\ndevice with 256KB RAM and 100MHz processor. For low-latency keyword spotting (Google-13),\nthe total inference time budget is 120 ms.\n\nBaseline\n16\n32\n\nMI-RNN\n16\n32\n\nAcc.\nCost\n\n86.99\n456\n\n89.84\n999\n\n89.78\n226\n\n92.61\n494\n\nMI-ShaRNN\n\n(16, 16)\n\n91.42\n70.5\n\n(32,16)\n92.67\n117\n\nwith varying hidden state sizes to illustrate trade-offs involved with selecting this hyperparameter\n\n(Figure 2). We select k \u00ab ?T with some variation to optimize w.r.t the stride length \u03c9 for each\n\ndataset; we also provide an ablation study to illustrate impact of different choices of k on accuracy\nand the inference cost (Figure 3, Appendix).\n\nComparison of accuracies: Table 1 compares accuracy of MI-ShaRNN against baselines and\nMI-RNN for different hidden dimensions at R1 and R2. In terms of prediction accuracies, MI-\nShaRNN performs much better than baselines and is competitive to MI-RNN on all the datasets. For\nexample, with only k \u201c 8, MI-ShaRNN is able to achieve 94% accuracy on the Google-13 dataset\nwhile MI-RNN model is applied for T \u201c 49 steps and baseline LSTM for T \u201c 99 steps. That is, with\nonly 8-deep recurrence, MI-ShaRNN is able to compete with accuracies of 49 and 99 deep LSTMs.\n\nFor inference cost, we study the amortized cost per data point in the sliding window setting (See\nSection 3). That is, baseline and MI-RNN for each sliding window recomputes the entire prediction\nfrom scratch. But, MI-ShaRNN can re-use computation in the \ufb01rst layer (see Section 3) leading\nto signi\ufb01cant saving in inference cost. We report inference cost as the additional \ufb02oating point\noperations (\ufb02ops) each model would need to execute for every new inference. For simplicity, we treat\nboth addition and multiplication to be of same cost. The number of non-linearity computations are\nsmall and are nearly same for all the methods so we ignore them.\n\nTable 1 clearly shows that to achieve best accuracy, MI-ShaRNN is up to 10x faster than baselines\nand up to 5x faster than MI-RNN, even on a single threaded hardware architecture. Figure 2 shows\ncomputation vs accuracy trade-off for three datasets. We observe that for a range of desired accuracy\nvalues, MI-ShaRNN is 5-10x faster than the baselines.\n\nNext, we compute accuracy and \ufb02ops for MI-ShaRNN with different brick sizes k (see Figure 3\n\nof Appendix). As expected, k \u201e ?T setting requires fewest \ufb02ops for inference, but the story for\n\naccuracy is more complicated. For this dataset, we do not observe any particular trend for accuracy;\nall the accuracy values are similar, irrespective of k.\n\nDeployment of Google-13 on Cortex M4: we use ShaRNN to deploy a real-time keyword spotting\nmodel (Google-13) on a Cortex M4 device. For time series classi\ufb01cation (Section 3), we will need\nto slide windows and infer classes on each window. Due to small working RAM of M4 devices\n(256KB), for real-time recognition, the method needs to \ufb01nish the following tasks within a budget\nof 120ms: collect data from the microphone buffer, process them, produce ML based inference and\nsmoothened out predictions for one \ufb01nal output.\nStandard LSTM models for this task work on 1s windows, whose featurization generates a 32 \u02c6 99\nfeature vector; here T \u201c 99. So, even a relatively small LSTM (hidden size 16), takes on 456ms\nto process one window, exceeding the time budget (Table 2). MI-RNN is faster but still requires\n225ms. Recently, a few CNN based methods have also been designed for low-resource keyword\nspotting [26, 20]. However, with just 40 \ufb01lters applied to the standard 32\u02c6 99 \ufb01lter-bank features, the\nworking memory requirement balloons up to \u00ab 500KB which is beyond typical M4 devices\u2019 memory\nbudget. Similarly, compute requirement of such architectures also easily exceed the latency budget of\n120ms. See Figure 4, in the Appendix for a comparison between CNN models and ShaRNN.\n\nIn contrast, our method is able to produce inference in only 70ms, thus is well-within latency budget\nof M4. Also, MI-ShaRNN holds two arrays in the working RAM: a) input features for 1 brick and\nb) buffered \ufb01nal states from previous bricks. For the deployed MI-ShaRNN model, with timesteps\nT \u201c 49, brick-size k \u201c 8 working RAM requirement is just 1.5 KB.\nShaRNN for Streaming Listen Attend Spell (LAS): LAS is a popular architecture for phoneme\nclassi\ufb01cation in given audio stream. It forms non-overlapping time-windows of length 784 (\u00ab 8\nseconds) and apply an encoder-decoder architecture to predict a sequence of phonemes. We study\n\n8\n\n\fLAS applied to TIMIT dataset [13]. We enhance the standard LAS architecture to exploit time-\nannotated ground truth available in TIMIT dataset, which improved baseline phoneme error rate from\npublicly reported 0.271 to 0.22. Both Encoder and Decoder layer in standard and enhanced LAS\nconsists of fully recurrent bi-LSTMs. So for each time window (of length 784) we would need to\napply entire encoder-decoder architecture to predict the phoneme sequence, implying a potential lag\nof \u00ab 8 seconds (784 steps) in prediction.\nInstead, using ShaRNN we can divide both the encoder and decoder layer in bricks of size k. This\nmakes it possible to give phoneme classi\ufb01cation for every k steps of points thereby bringing down lag\nfrom 784 steps to k steps. However, due to small brick size k, in principle we might lose signi\ufb01cant\namount of context information. But due to the corrective second layer in ShaRNN (Figure 1) we\nobserve little loss in accuracy. Figure 2 shows performance of two variants of ShaRNN + LAS: a)\nShaRNN Listener that uses ShaRNN only in encoding layer, b) ShaRNN Listener + Speller that\nuses ShaRNN in both the encoding and decoding layer. Figure 2 (d) shows that using ShaRNN in\nboth the encoder and decoder is more bene\ufb01cial than using it only in encoder layer. Furthermore,\ndecreasing k from 784 to 64 leads to marginal increase in error from 0.22 to 0.238 while reducing the\nlag signi\ufb01cantly; from 8 seconds to 0.6 seconds. In fact, even at k \u201c 64 this model\u2019s performance is\nsigni\ufb01cantly better than the reported error of standard LAS (0.27) [12]. See Appendix C for details.\n\n9\n\n\fReferences\n\n[1] Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. Advances in optimizing\nrecurrent networks. 2013 IEEE International Conference on Acoustics, Speech and Signal\nProcessing, 2013.\n\n[2] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural\n\nnetworks. arXiv preprint arXiv:1611.01576, 2016.\n\n[3] V\u00edctor Campos, Brendan Jou, Xavier Gir\u00f3 i Nieto, Jordi Torres, and Shih-Fu Chang. Skip RNN:\nLearning to skip state updates in recurrent neural networks. In International Conference on\nLearning Representations, 2018.\n\n[4] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural\nnetwork for large vocabulary conversational speech recognition. In 2016 IEEE International\nConference on Acoustics, Speech and Signal Processing (ICASSP), 2016.\n\n[5] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael\nWitbrock, Mark A Hasegawa-Johnson, and Thomas S Huang. Dilated recurrent neural networks.\nIn I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 30, 2017.\n\n[6] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On\nthe properties of neural machine translation: Encoder-decoder approaches. arXiv preprint\narXiv:1409.1259, 2014.\n\n[7] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder\u2013\ndecoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical\nMethods in Natural Language Processing (EMNLP), 2014.\n\n[8] Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capacity and trainability in\n\nrecurrent neural networks. arXiv preprint arXiv:1611.09913, 2016.\n\n[9] Yann Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated\n\nconvolutional networks. In ICML, 2017.\n\n[10] Don Kurian Dennis, Chirag Pabbaraju, Harsha Vardhan Simhadri, and Prateek Jain. Multiple\ninstance learning for ef\ufb01cient sequential data classi\ufb01cation on resource-constrained devices. In\nNeurIPS, 2018.\n\n[11] Don Kurian Dennis, Sridhar Gopinath, Chirag Gupta, Ashish Kumar, Aditya Kusupati, Shishir\nG Patil, Harsha Vardhan Simhadri. EdgeML: Machine Learning for resource-constrained edge\ndevices.\n\n[12] Janna Escur. Exploring automatic speech recognition with tensor\ufb02ow. Master\u2019s thesis, 2018.\n\n[13] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, David S Pallett, Nancy L\n\nDahlgren, and Victor Zue. Darpa timit acoustic phonetic continuous speech corpus, 1993.\n\n[14] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann Dauphin. Convolutional\n\nsequence to sequence learning. In ICML, 2017.\n\n[15] Alex Graves, Abdel rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep\nrecurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and\nSignal Processing, 2013.\n\n[16] Michiel Hermans and Benjamin Schrauwen. Training and analysing deep recurrent neural\nnetworks. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 26, 2013.\n\n[17] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8),\n\n1997.\n\n10\n\n\f[18] Herbert Jaeger, Mantas Lukosevicius, Dan Popovici, and Udo Siewert. Optimization and\napplications of echo state networks with leaky-integrator neurons. Neural networks : the of\ufb01cial\njournal of the International Neural Network Society, 20, 2007.\n\n[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[20] Rajath Kumar, Vaishnavi Yeruva, and Sriram Ganapathy. On convolutional lstm modeling for\njoint wake-word detection and text dependent speaker veri\ufb01cation. Proc. Interspeech 2018,\n2018.\n\n[21] Aditya Kusupati, Manish Singh, Kush Bhatia, Ashish Kumar, Prateek Jain, and Manik Varma.\nFastgrnn: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In\nNeurIPS, 2018.\n\n[22] Beno\u00eet Latr\u00e9, Bart Braem, Ingrid Moerman, Chris Blondia, and Piet Demeester. A survey on\n\nwireless body area networks. Wireless Networks., 2011.\n\n[23] John Miller and Moritz Hardt. When recurrent models don\u2019t need to be recurrent. arXiv preprint\n\narXiv:1805.10369, 4, 2018.\n\n[24] Shishir G. Patil, Don Kurian Dennis, Chirag Pabbaraju, Nadeem Shaheer, Harsha Vardhan\nSimhadri, Vivek Seshadri, Manik Varma, and Prateek Jain. Gesturepod: Enabling on-device\ngesture-based interaction for white cane users. Technical report, New York, NY, USA, 2019.\n\n[25] Charith Perera, Chi Harold Liu, and Srimal Jayawardena. The emerging internet of things\nmarketplace from an industrial perspective: A survey. IEEE Transactions on Emerging Topics\nin Computing, 3(4), 2015.\n\n[26] Tara N Sainath and Carolina Parada. Convolutional neural networks for small-footprint key-\nword spotting. In Sixteenth Annual Conference of the International Speech Communication\nAssociation, 2015.\n\n[27] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng,\nand Christopher Potts. Recursive deep models for semantic compositionality over a sentiment\ntreebank. In Proceedings of the 2013 conference on empirical methods in natural language\nprocessing, 2013.\n\n[28] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative\nmodel for raw audio. In SSW, 2016.\n\n[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\n\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.\n\n[30] Allen Y Yang, Sameer Iyengar, Shankar Sastry, Ruzena Bajcsy, Philip Kuryloski, and Roozbeh\nJafari. Distributed segmentation and classi\ufb01cation of human actions using a wearable motion\nsensor network. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern\nRecognition Workshops. IEEE, 2008.\n\n[31] Jiong Zhang, Qi Lei, and Inderjit S Dhillon. Stabilizing gradients for deep neural networks via\n\nef\ufb01cient svd parameterization. arXiv preprint arXiv:1803.09327, 2018.\n\n[32] \u0141ukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms, 2015.\n\n11\n\n\f", "award": [], "sourceid": 7064, "authors": [{"given_name": "Don", "family_name": "Dennis", "institution": "Carnegie Mellon University"}, {"given_name": "Durmus Alp Emre", "family_name": "Acar", "institution": "Boston University"}, {"given_name": "Vikram", "family_name": "Mandikal", "institution": "The University of Texas at Austin"}, {"given_name": "Vinu Sankar", "family_name": "Sadasivan", "institution": "Indian Institute of Technology Gandhinagar"}, {"given_name": "Venkatesh", "family_name": "Saligrama", "institution": "Boston University"}, {"given_name": "Harsha Vardhan", "family_name": "Simhadri", "institution": "Microsoft Research"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}]}