{"title": "Multiple Instance Learning for Efficient Sequential Data Classification on Resource-constrained Devices", "book": "Advances in Neural Information Processing Systems", "page_first": 10953, "page_last": 10964, "abstract": "We study the problem of fast and efficient classification of sequential data (such as\ntime-series) on tiny devices, which is critical for various IoT related applications\nlike audio keyword detection or gesture detection. Such tasks are cast as a standard classification task by sliding windows over the data stream to construct data points. Deploying such classification modules on tiny devices is challenging as predictions over sliding windows of data need to be invoked continuously at a high frequency. Each such predictor instance in itself is expensive as it evaluates large models over long windows of data. In this paper, we address this challenge by exploiting the following two observations about classification tasks arising in typical IoT related applications: (a) the \"signature\" of a particular class (e.g. an audio keyword) typically occupies a small fraction of the overall data, and (b) class signatures tend to be discernible early on in the data. We propose a method, EMI-RNN, that exploits these observations by using a multiple instance learning formulation along with an early prediction technique to learn a model that achieves better accuracy compared to baseline models, while simultaneously reducing computation by a large fraction. For instance, on a gesture detection benchmark [ 25 ], EMI-RNN improves standard LSTM model\u2019s accuracy by up to 1% while requiring 72x less computation. This enables us to deploy such models for continuous real-time prediction on a small device such as Raspberry Pi0 and Arduino variants, a task that the baseline LSTM could not achieve. Finally, we also provide an analysis of our multiple instance learning algorithm in a simple setting and show that the proposed algorithm converges to the global optima at a linear rate, one of the first such result in this domain. The code for EMI-RNN is available at: https://github.com/Microsoft/EdgeML/tree/master/tf/examples/EMI-RNN", "full_text": "Multiple Instance Learning for Ef\ufb01cient Sequential\nData Classi\ufb01cation on Resource-constrained Devices\n\nDon Kurian Dennis\n\nChirag Pabbaraju\n\nMicrosoft Research, India\n\nHarsha Vardhan Simhadri\n\nPrateek Jain\n\n{t-dodenn, t-chpab, harshasi, prajain}@microsoft.com\n\nAbstract\n\nWe study the problem of fast and ef\ufb01cient classi\ufb01cation of sequential data (such as\ntime-series) on tiny devices, which is critical for various IoT related applications\nlike audio keyword detection or gesture detection. Such tasks are cast as a standard\nclassi\ufb01cation task by sliding windows over the data stream to construct data points.\nDeploying such classi\ufb01cation modules on tiny devices is challenging as predictions\nover sliding windows of data need to be invoked continuously at a high frequency.\nEach such predictor instance in itself is expensive as it evaluates large models\nover long windows of data. In this paper, we address this challenge by exploiting\nthe following two observations about classi\ufb01cation tasks arising in typical IoT\nrelated applications: (a) the \"signature\" of a particular class (e.g. an audio keyword)\ntypically occupies a small fraction of the overall data, and (b) class signatures\ntend to be discernible early on in the data. We propose a method, EMI-RNN, that\nexploits these observations by using a multiple instance learning formulation along\nwith an early prediction technique to learn a model that achieves better accuracy\ncompared to baseline models, while simultaneously reducing computation by a\nlarge fraction. For instance, on a gesture detection benchmark [26], EMI-RNN\nrequires 72x less computation than standard LSTM while improving accuracy by\n1%. This enables us to deploy such models for continuous real-time prediction\non devices as small as a Raspberry Pi0 and Arduino variants, a task that the\nbaseline LSTM could not achieve. Finally, we also provide an analysis of our\nmultiple instance learning algorithm in a simple setting and show that the proposed\nalgorithm converges to the global optima at a linear rate, one of the \ufb01rst such result\nin this domain. The code for EMI-RNN is available at [14].\n\n1\n\nIntroduction\n\nClassi\ufb01cation of sequential data: Several critical applications, especially in the Internet of Things\n(IoT) domain, require real-time predictions on sensor data. For example, wrist bands attempt to\nrecognize gestures or activities (such as walking, climbing) from Inertial Measurement Unit (IMU)\nsensor data [2, 26]. Similarly, several audio applications require detection of speci\ufb01c keywords like\n\"up\", \"down\" or certain urban sounds [35, 32, 29, 5].\nLSTM based models: Typically, such problems are modeled as a multi-class classi\ufb01cation problem\nwhere each sliding window over the sensor data forms a sequential data point and each class represents\none category of events to be detected. Additionally, there is a \"noise\" class which denotes a no-event\ndata point. Recurrent Neural Networks (RNNs) like LSTMs [21] are a popular tool for modeling\nsuch problems where the RNN produces an embedding of the given sequential data point that can\nthen be consumed by a standard classi\ufb01er like logistic regression to predict label of the point [20, 30].\nDeployment on tiny devices: Such event detection applications, especially in the IoT domain,\nrequire deployment of inference on tiny edge devices with capabilities comparable to an Arduino\nUno [13] or a Raspberry Pi0 [19]. Further, the data is sampled at a high rate for good resolution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand the data buffer is often limited by small DRAMs. This means that the prediction needs to be\ncompleted before the buffer is refreshed. For example, the GesurePod device [26] requires the buffer\nto be refreshed every 100 ms and hence predictions must be completed in 100 ms using just 32KB of\nRAM. Another critical requirement is that the lag between the event and the detection should be very\nshort. This not only requires fast but also \u201cearly\u201d prediction, i.e., the solution must classify an event\nbefore even observing the entire event based on a pre\ufb01x of the signature of the event.\nDrawbacks of existing solutions: Unfortunately, existing state-of-the-art solutions like the ones\nbased on standard RNN models like LSTM are dif\ufb01cult to deploy on tiny devices as they are\ncomputationally expensive. Typically, the training data is collected \u201cloosely\u201d to ensure that the event\nis captured somewhere in the time-series (see Figure 1). For example, in the case of the urban sound\ndetection problem [5], the training data contains the relevant sound somewhere in the given 1-2 secs\nclip, while the actual vehicle noise itself might last only for few hundreds of milliseconds. That is,\nthe sliding window length should be greater than 1 sec while ideally we should restrict ourselves\nto 100ms window lengths. But, it is dif\ufb01cult to pin-point the window of data where the particular\nevent occurred. This phenomenon holds across various domains such as keyword detection [35],\ngesture recognition [26] and activity recognition [2]. Due to the large number of time-steps required\nto accommodate the buffer around the actual signature, the model size needs to be large enough to\ncapture a large amount of variations in the location of the signature in the data point. Moreover,\nstoring the entire data point can be challenging as the data for each time-step can be multidimensional,\ne.g., IMU data. Finally, due to the large computation time and the failure to tightly capture the\nsignature of the event, the lag in prediction can be large.\nOur method: In this work, we propose a method that addresses above mentioned challenges in\ndeployment on tiny devices. In particular, our method is motivated by two observations: a) the actual\n\"signature\" of the event forms a small fraction of the data point. So, we can split the data point into\nmultiple instances where only a few of the instances are positive instances, i.e., they contain the\nactual class/event while remaining instances represent noise (see Figure 1). b) The event class can\nbe discerned by observing a pre\ufb01x of the signature of the class. For example, if the audio category\nis that of repetitive sounds from a jack-hammer, then the class can be predicted with high accuracy\nusing only a few initial samples from the data. We also observe that in many such applications, a few\nfalse positives are allowed as typically data from a positive event detection is uploaded to a powerful\ncloud server that can further prune out false positives.\nBased on the above observations, we propose a robust multi-instance learning (MIL) method with\nearly stopping. Our solution splits the given sequential data point into a collection of windows,\nreferred to here as instances (see Figure 1). All the instances formed from the sequential data point\nare assigned it\u2019s class label. Based on observation (a) above, each sequential data typically has\nonly a few consecutive positive instances, i.e., instances (sub-windows) that contain the true class\n\"signature\". The remaining negative instances have noisy labels and hence the classi\ufb01cation problem\nwith these labeled instances can be challenging. Our MIL algorithm uses the above mentioned\nstructure with an iterative thresholding technique to provide an ef\ufb01cient and accurate algorithm. We\ntrain our model to predict the label after observing the data only for a few time-steps of the instance.\nThat is, we jointly train our robust MIL model with an early classi\ufb01cation loss function to obtain an\naccurate model that also reduces computation signi\ufb01cantly by predicting early. For simplicity, we\ninstantiate our algorithm for standard LSTM based architecture. However, our solution is independent\nof the base architecture and can be applied to other RNN based models like GRU [9], svdRNN [? ],\nUGRNN [11], CNN-LSTM [28] etc.\nWe also provide theoretical analysis of our thresholding based robust MIL method. Note that existing\ngeneric robust learning results [6, 15] do not apply to sequential data based MIL problems, as the\nnumber of incorrectly labeled instances form a large fraction of all instances (see Figure 1). Instead,\nwe study our algorithm in a simple separable setting and show that despite presence of a large amount\nof noise, the algorithm converges to the optimal solution in a small number of iterations. Our analysis\nrepresents one of the \ufb01rst positive result for MIL in a more interesting non-homogeneous MIL setting\nthat is known to be NP-hard in general [4, 27].\nEmpirical results: We present empirical results for our method when applied to \ufb01ve data sets\nrelated to activity detection, gesture recognition [2, 26], keyword detection [35] etc. For each of the\nbenchmarks, we observe that our method is able to signi\ufb01cantly reduce the computational time while\nalso improving accuracy (in most of the benchmarks) over baseline LSTM architecture. In four of\nthe \ufb01ve data sets we examine, our methods save 65 \u2212 90% of computation without losing accuracy\n\n2\n\n\fFigure 1: A time-series data point Xi bagged\ninto sliding windows of width \u03c9,\njust long\nenough to contain the true class signature.\n\nFigure 2: EMI-RNN architecture. Zi,\u03c4 are\nthe \u201cpositive\u201d instances from each data point\nXi which are then combined with E-RNN loss\n(3.2.1) to train the model parameters.\n\ncompared to the baseline (LSTM trained on the entire time-series data) when compared at a \ufb01xed\nLSTM hidden dimension. In fact, for the keyword spotting task [35], our method can provide up to\n2% more accurate predictions while saving 2 \u2212 3\u00d7 in compute time. This enables deployment on a\nRaspberry Pi0 device, a task beyond the reach of normal LSTMs. Additionally, in all the data sets,\nour method is able to improve over baseline LSTM models using a much smaller hidden dimension,\nthus making inference up to 72x less expensive.\n\n1.1 Related Work\nNN architectures for sequential data classi\ufb01cation: RNN based architectures are popular for\nlearning with sequential data as they are able to capture long-term dependencies and tend to have a\nsmall working memory requirement. Gated architectures like LSTM [21], GRU [8, 10], UGRNN\n[11] have been shown to be successful for a variety of sequential data based applications. However,\nthey tend to be more expensive than vanilla RNN, so there has been a lot of recent work on stabilizing\ntraining and performance of RNNs [3? ]. In addition, CNN based architectures [29] have also\nbeen proposed such as CNN-LSTM [16], but these methods tend to have a large working RAM\nrequirement. Our method is independent of underlying architecture and can be combined with any\nof the above RNN-based techniques. For our experiments, we focus on the LSTM architecture as it\ntends to be stable and provides nearly state-of-the-art accuracy on most datasets.\nMIL/Robust learning: Multi-instance learning (MIL) techniques are popular for modeling one-sided\nnoise where positive bags of instances can have several negative instances as well. Existing practical\nalgorithms [1] iteratively re\ufb01ne the set of positive instances to learn a robust classi\ufb01er but typically do\nnot have strong theoretical properties. On the other hand, [27] consider a homogeneous setting where\none can obtain strong theoretical bounds but the assumptions are too strong for any practical setting\nand the resulting algorithm is just a standard classi\ufb01er. In contrast, our algorithm provides signi\ufb01cant\naccuracy improvements on benchmarks, and at the same time, we can analyze the algorithm in a\nnon-homogeneous setting which represents one of the \ufb01rst positive results in this setting.\nEarly classi\ufb01cation: Recently, several papers have studied the problem of early classi\ufb01cation in\nsequential data [25, 33, 12, 36, 17], however these techniques assume a pre-trained classi\ufb01er and\nlearn a policy for early classi\ufb01cation independently which can lead to signi\ufb01cant loss in accuracy.\n[24] introduced an early-classi\ufb01cation method for the speci\ufb01c task of human activity detection in\nvideo. Independent of our work, [7] developed an architecture to reduce computational cost of RNNs\nby skipping certain hidden-state updates. They show that this technique leads to almost no loss\nin accuracy. Their technique is complementary to our work as our focus is on early classi\ufb01cation,\ni.e., predicting the class before observing the entire sequential data point. Further, our EMI-RNN\narchitecture uses joint training with MIL formulation to ensure that such early classi\ufb01cation is\naccurate for a large fraction of points.\n2 Problem Formulation and Notation\nSuppose we are given a dataset of labeled sequential data points Z = [(X1, y1), . . . , (Xn, yn)] where\neach sequential data point Xi = [xi1, xi2, . . . , xiT ] is a sequence of T data points with xi,t \u2208 Rd.\nThat is, the t-th time-step data point of the i-th sequential data point is xi,t. Throughout the paper,\nvi denotes the i-th component of a vector v. Let yi \u2208 [L] denote the class of i-the data point Xi;\n[L] := {\u22121, 1, 2, . . . , L}. Here yi = \u22121 denotes \"noisy\" points that do not correspond to any class.\nFor example, in keyword detection application, Xi is an audio clip with xit being the t-th sample\nfrom the clip and the yi-th keyword spoken in the clip. If no keyword is present, yi = \u22121.\n\n3\n\n\fGiven Z, the goal is to learn a classi\ufb01er f : Rd\u00d7T \u2192 RL that can accurately predict the class of\nthe input sequential data point X using arg maxb fb(X). In addition, for f to be deployable on tiny\ndevices in real-world settings, it must satisfy three key requirements: a) small memory footprint, b)\ncan be computed quickly on resource-constrained devices, c) should predict the correct class after\nobserving as few time-step data points of X as possible, i.e., as early as possible.\nWe focus on problems where each class has a certain \"signature\" in the sequential data (see Figure 1).\nThe goal is to train f to identify the signature ef\ufb01ciently and with minimal lag. Further, due to the\narchitectural constraints of tiny devices, it might not be possible to buffer the entire sequential point\nin memory. This implies that early classi\ufb01cation with small number of time-step data points is critical\nto minimizing the lag in prediction.\nDue to the sequential nature of data, existing approaches use recurrent neural network (RNN) based\nsolutions [21, 8]. However, as mentioned in Section 1, such solutions do not fare well on the three\nkey requirements mentioned above due to training data inef\ufb01ciency (see Section 1). That is, while the\nactual class-signature is a small part of the entire data point, in absence of more \ufb01ne-grained labels,\nexisting methods typically process the entire sequential data point to predict the class label which\nleads to expensive computation and large model sizes.\nIn the next section, we describe our method that addresses all the key requirements mentioned above\n\u2014 low memory footprint, low computational requirement and small lag in prediction.\n3 Method\nAs mentioned in the previous section, although the entire sequential data point Xi =\n[xi1, xi2, . . . , xiT ] is labeled with class yi, there is a small subset of consecutive but unknown\ntime-steps that capture the signature of the label, for instance the part of audio clip where a particular\nkeyword is spoken (see Figure 1). Moreover, the label signature is signi\ufb01cantly distinct from other\nlabels, and hence can be detected reasonably early without even observing the entire signature. In\nthis section, we introduce a method that exploits both these insights to obtain a classi\ufb01er with smaller\nprediction complexity as well as smaller \"lag\" in label prediction.\nWe would like to stress that our method is orthogonal to the actual neural-network architecture used\nfor classi\ufb01cation and can be used with any of the state-of-the-art classi\ufb01cation techniques employed\nfor sequential data. In this work we instantiate our method with a simple LSTM based architecture for\nits ease of training [26, 18, 23]. Our method is based on two key techniques and next two subsections\ndescribe both the techniques separately; Section 3.3 then describes our \ufb01nal method that combines\nboth the techniques to provide an ef\ufb01cient and \"early\" sequential data classi\ufb01er.\n\n3.1 MI-RNN\nAs the true label signature can be a tiny fraction of a given sequential data point, we break the\ndata point itself into multiple overlapping instances such that at least one of the instance cover the\ntrue label signature (see Figure 1). That is, given a sequential data point Xi, we construct a bag\nXi = [Zi,1, . . . , Zi,T\u2212\u03c9+1] where Zi,\u03c4 = [xi,\u03c4 , . . . , xi,\u03c4 +\u03c9\u22121] \u2208 Rd\u00d7\u03c9 represents the \u03c4-th instance\n(or sub-window) of data points. \u03c9 is the instance-length, i.e., the number of time-steps in each\ninstance Zi,\u03c4 , and should be set close to the typical length of the label signature. For example, if\nkeywords in our dataset can be all captured in 10 time-steps then \u03c9 should be set close to 10.\nWhile a small number of Zi,\u03c4 instances have label yi, the remaining instances are essentially noise\nand should be set to have label yi,\u03c4 = \u22121. Since we do not have this label information apriori the\nlabel of each instance in the i-th sequential data point is initialized to yi. This leads to a heavily noisy\nlabelling. Because of this, the problem can be viewed as a multi-instance learning problem with only\n\"bag\" level label available rather than instance level labels. Similarly, the problem can also be viewed\nas a robust learning problem where the label for most of the instances is wrong.\nNaturally, existing techniques for robust learning do not apply as they are not suitable for the large\namount of noise in this setting. Moreover, these methods and multi-instance learning methods fail to\nexploit structure in the problem \u2013 the existence of consecutive set of instances with the given label.\nInstead, we study a simple optimization problem that captures the above mentioned structure:\n\n\u03b4i,\u03c4 (cid:96)(f (Zi,\u03c4 ), yi)), such that, \u03b4i,\u03c4 =\n\n(3.1.1)\n\n(cid:88)\n\nmin\n\nf,si,1\u2264i\u2264n\n\n1\nn\n\ni,\u03c4\n\n4\n\n(cid:26)1, \u03c4 \u2208 [si, si + k \u2212 1],\n\n0, \u03c4 /\u2208 [si, si + k \u2212 1].\n\n\fsteps, instance-length \u03c9, parameter k\n\nAlgorithm 1 MI-RNN: Multi-instance RNN\nRequire: Sequential data: {(X1, y1), . . . , (Xn, yn)} with T\n1: Create multiple instances: {Zi,\u03c4 , 1 \u2264 \u03c4 \u2264 T \u2212 \u03c9 + 1}\n2: Initialize: f \u2190Train-LSTM({(Zi,\u03c4 , yi), \u2200i, \u03c4})\n3: repeat\n4:\n1],\u2200i})\n\nf \u2190Train-LSTM({(Zi,\u03c4 , yi,\u03c4 ), \u03c4 \u2208 [si, si + k \u2212\n\nwith Zi,\u03c4 = [xi,\u03c4 , . . . , xi,\u03c4 +\u03c9\u22121] for all i\n\nsi = arg max\u03c4(cid:48) (cid:88)\n\nfyi (Zi,\u03c4(cid:48) ), \u2200i\n\n5:\n\n6: until Convergence\n\n\u03c4(cid:48)\u2264\u03c4\u2264\u03c4(cid:48)+k\u22121\n\nAlgorithm 2 Inference for E-RNN\nRequire: An instance Z = [x1, x2, . . . , x\u03c9],\n\nand a probability threshold \u02c6p.\n\n1: for t = 1, 2, . . . , \u03c9 do\npt \u2190 MAX(wT ot)\n2:\n(cid:96)i,t \u2190 ARGMAX(wT ot)\n3:\nif pi,t \u2265 \u02c6p or t = \u03c9 then\n4:\n5:\nend if\n6:\n7: end for\n\nreturn [(cid:96)i,t, pi,t]\n\nWe propose a thresholding based method that exploits the structure in label noise and is still simple\nenough to be implemented ef\ufb01ciently on top of LSTM training. Algorithm 1 presents a pseudo-code\nof our algorithm. The Train-LSTM procedure can use any standard optimization algorithm such as\nSGD+Momentum [31], Adam [22] to train a standard LSTM with a multi-class logistic regression\nlayer. Input to this procedure is the instance level data constructed by windowing of the sequential\ndata along with the corresponding estimated instance labels. Naturally, our technique applies even if\nwe use any other architecture to learn the classi\ufb01er f.\nWe further re\ufb01ne the trained LSTM by re-estimating the \"correct\" set of instances per sequential data\npoint using a simple thresholding scheme. In particular, we \ufb01nd k (a parameter) consecutive instances\nfor which sum of predictions (using function f) for the data-point label y is highest and include it\nin training set; the remaining set of instances are ignored for that particular iteration. The predicted\nlabel of a given point X = [x1, . . . , xT ] is given by: y = arg maxb max\u03c4\nWhile our algorithm is similar to the hard-thresholding approach popular in robust learning [6, 15],\nboth the thresholding operator and the setting are signi\ufb01cantly different. For example, robust learning\ntechniques are expected to succeed when the amount of noisy labels is small, but in our problem only\nk out of T instances are correctly labeled (typically k (cid:28) T ). Nonetheless, we are still able to analyze\nour method in a simple setting where the data itself is sampled from a well-separated distribution.\n3.1.1 Analysis\n\nt\u2208[\u03c4,\u03c4 +k\u22121] fb(xt).\n\n(cid:80)\n\nn , \u00afyP\n\n1 , \u00afyP\n\nn , \u00afyN\n\n1 , \u00afyN\n\n1 ), . . . , (Z P\n\n1 ), . . . , (Z N\n\nn )] be the given dataset where Z P\n\ni de-\nLet Z = [(Z P\nn ), (Z N\ni denotes the i-th \"negative\" bag. For simplicity, we assume that\nnotes the i-th \"positive\" bag and Z N\ni denotes the label vector for i-th positive bag. For\neach positive and negative bag contain T points. \u00afyP\ni \u2208 {\u22121, 1}T .\nsimplicity, we only consider binary classi\ufb01cation problem in this section. That is, \u00afyP\nAlso, let number of positives (i.e. points with label +1) be k and let all the positives be consecutive\nin index set 1 \u2264 \u03c4 \u2264 T . By de\ufb01nition, \u00afyN\ni are not\navailable apriori. Let S \u2286 [n] \u00d7 [T ] be an index set of columns, i.e., S = {(i1, \u03c41), . . . , (i|S|, \u03c4|S|)}.\nNow, given Z P , Z N , the goal is to learn a linear classi\ufb01er w s.t. sign(wT Z P\ni,\u03c4 and\ni,\u03c4 ) = \u22121, i.e. each point in the positive/negative bag is correctly classi\ufb01ed. To this end,\nsign(wT Z N\nwe use the following modi\ufb01ed version of Algorithm 1:\n\ni,\u03c4 = \u22121 for all instances in the bag. Note that \u00afyP\n\ni,\u03c4 ) = \u00afyP\n\n(cid:88)\n\n(cid:88)\n\nw0 = arg min\nw\n\n(1 \u2212 wT Z P\n\ni,\u03c4 )2 +\n\n(\u22121 \u2212 wT Z N\n\ni,\u03c4 )2,\n\ni,\u03c4\n\ni,\u03c4\n\n(cid:88)\n\nSr+1 = \u222an\n\ni=1 \u222ak1+k\n\n\u03c4 =k1\n\n(i, \u03c4 ), where, k1 = arg max\nk1\n\n(cid:104)wr, Z P\n\ni,\u03c4(cid:105),\u2200i, \u2200r \u2265 0\n\n(1 \u2212 (cid:104)w, Z P\n\ni,\u03c4(cid:105))2 +\n\n(i,\u03c4 )\u2208Sr+1\n\nwr+1 = arg min\nw\n\nwhere k is a parameter that speci\ufb01es the number of true positives in each bag, i.e.,(cid:80)\n\ni,\u03c4 > 0] = k.\nNote that the above optimization algorithm is essentially an alternating optimization based algorithm\nwhere in each iteration we alternately estimate the classi\ufb01er w and the correct set of positive points\n\n(3.1.2)\n\nI[yP\n\ni,\u03c4\n\n\u03c4\n\n(\u22121 \u2212 (cid:104)w, Z N\n\ni,\u03c4(cid:105))2, \u2200r \u2265 0\n\n1\nT\n\nk1+k(cid:88)\n(cid:88)\n\n\u03c4 =k1\n\n5\n\n\fi,\u03c4 and xN\n\ni,\u03c4 . Let gN\ni,\u03c4\n\ni,\u03c4 )\u00b5\u2212 + gP\n\ni = [xP\n\ni,1, . . . , xP\n\ni,T ] and Z N\n\n1 , . . . , Z P\n\nn ], Z N = [Z N\n\ni,\u03c4 = \u00b5\u2212 + gN\n\ni,\u03c4 + 1)\u00b5+ + 0.5(1 \u2212 yP\n\ni,\u03c4 be arbitrary vectors that satisfy: (cid:107)(cid:80)\n\nin positive bags. Naturally the problem is non-convex and is in fact NP-hard in general, and hence the\nabove algorithm might not even converge. However, by restricting the noise in the bags to a class of\n\"nice\" distributions, we can show that the above algorithm converges to the global optima at a linear\nrate. That is, within a small number of steps the algorithm is able to learn the correct set of positive\npoints in each bag and hence the optimal classi\ufb01er w\u2217 as well.\nTheorem 3.1. Let Z P = [Z P\nn ] be the data matrix of positive and\n1 , . . . , Z N\nnegative bags, respectively, with each bag containing T points, i.e., Z P\ni =\ni,T ]. Let \u00afyP be the true labels of each data point in positive bags. Let each data point be\n[xN\ni,1, . . . , xN\ni.i.d\u223c D\ngiven by: xP\ni,\u03c4 = 0.5(yP\nbe sampled from D and let D be sub-Gaussian with sub-Gaussian norm \u03c8(D). Wlog, let each gN\ni,\u03c4 be\n(i,\u03c4 )(cid:107) \u2264 \u03b3|S|\nisometric random vectors. Also, let gP\nfor all S where \u03b3 > 0 is a constant. Moreover, let n \u2265 dT C2\n\u03c8 > 0 is a constant dependent\nonly on \u03c8(D). Let (cid:107)\u2206\u00b5(cid:107)2 \u2265 400C\u03c8(\u03b32 + 1)\u00b7 ((cid:107)\u00b5+(cid:107) +(cid:107)\u00b5\u2212(cid:107)) log(nT ) where \u2206\u00b5 = \u00b5+\u2212 \u00b5\u2212. Then,\nR = log n iterations of Algorithm in (3.1.2) recover the true positive set S\u2217 = {(i, \u03c4 ), \u00afyP\ni,\u03c4 = +1}\nexactly, with probability \u2265 1 \u2212 30/n20.\nSee supplement for a detailed proof of the above theorem.\nRemark 1: Note that we allow positive set of points to be arbitrarily dependent on each other.\nThe required condition on GP can be easily satis\ufb01ed by dependent set of vectors, for example if\ni,\u03c4 \u223c 1\ni,\u03c4\u22121 + N (0, I). Also, note that due to arbitrary dependence of \"positive\" set of points, the\ngP\nabove problem is not a homogeneous MIML problem and has a more realistic model than extensively\nstudied homogeneous MIML model [27]. Unfortunately, non-homogeneous MIML problems are in\ngeneral NP-hard [27]. By exploiting structure in the problem, our thresholding based algorithm is\nstill able to obtain the optimal solution with small computational complexity.\nRemark 2: While our technique is similar to hard-thresholding based robust learning works [6, 15],\nour results hold despite the presence of large amounts of noise (1 \u2212 k/T ). Existing robust learning\nresults require the fraction of errors to be less than a small constant and hence do not apply here.\nRemark 3: While our current proof holds only for squared loss, we believe that similar techniques\ncan be used for classi\ufb01cation loss functions as well; further investigation is left for future work.\n\n\u03c8\n\n, where C 2\n\nk2\n\n(i,\u03c4 )\u2208S gP\n\nd gP\n\n3.2 Early Classi\ufb01cation (E-RNN)\n\n(cid:88)\n\nT(cid:88)\n\nLSTM models for classi\ufb01cation tasks typically pipe the hidden state at the \ufb01nal time-step to a standard\nclassi\ufb01er like logistic regression to obtain the predicted label. As the number of time-steps T can\nbe large, going over all the time-steps can be slow and might exceed the prediction budget of tiny\ndevices. Furthermore, in practice, most of the data points belong to \"noise\" class (y = \u22121) which\nshould be easy to predict early on. To this end, we teach the LSTM when it can stop early by piping\nthe output of each step (instead of just the last step) of the LSTM to the classi\ufb01er. The network is\nthen trained to optimize the sum of loss of classi\ufb01er\u2019s output at each step. That is,\n\nmin\n\n(cid:96)(wT oi,t)\n\n(3.2.1)\n\ni\n\nt=1\n\nwhere w is the weight of the fully connected (F.C.) layer, and oi,t is the output of t-th step of the\nLSTM when processing data point Xi.\nFor inference, if at step t, a class is predicted with probability greater than a tunable threshold \u02c6p, we\nstop the LSTM and output the class. For ef\ufb01ciency, we predict the probability after every \u03ba-steps. For\nwake-word detection type tasks where noise class forms an overwhelming majority of test points, we\nprovide early classi\ufb01cation only on the noise class (Figure 4c). Algorithm 2 describes our inference\nprocedure for early classi\ufb01cation. We overload notation and assume that f (Zi,\u03c4 ) outputs a probability\ndistribution over all classes.\n\n3.3 EMI-RNN\n\nMI-RNN and E-RNN are complementary techniques \u2014 MI-RNN reduces the number of time-steps\nin the sequential data point and E-RNN provides early prediction. Naturally, a combination of two\n\n6\n\n\fDataset\n\nM\nT\nS\nL\n\nn\ne\nd\nd\ni\nH\n\nn\no\ni\ns\nn\ne\nm\nD\n\ni\n\nHAR-6\n\nGoogle-13\n\nSTCI-2\n\nGesturePod-6\n\nDSA-19\n\n8\n16\n32\n16\n32\n64\n8\n16\n32\n16\n32\n48\n32\n48\n64\n\n89.54\n92.90\n93.04\n86.99\n89.84\n91.13\n98.07\n98.78\n99.01\n\n-\n\n94.04\n97.13\n84.56\n85.35\n85.17\n\nAccuracies for MI-RNN\n\nAccuracies for E-RNN\n\nN\nN\nR\n-\nI\n\nM\n\n0\nd\nn\nu\no\nR\n\nt\na\n\n90.83\n92.16\n93.75\n88.06\n91.80\n92.87\n97.38\n98.35\n98.50\n96.23\n98.27\n98.27\n86.97\n84.42\n85.08\n\nN\nN\nR\n-\nI\n\nM\n\n91.92\n93.89\n91.78\n89.78\n92.61\n93.16\n98.08\n99.07\n98.96\n98.00\n99.13\n98.43\n87.01\n89.60\n88.11\n\nn\no\ni\nt\na\nt\nu\np\nm\no\nC\n\n%\nn\ni\nd\ne\nv\na\ns\n\n62.5\n\n50.5\n\n50\n\n50\n\n28\n\nEarly prediction\nat 55% time steps\n\nEarly prediction\nat 75% time steps\n\ny\nl\nr\na\ne\n\nd\ne\nt\nc\ni\nd\ne\nr\nP\n%\n79.74\n86.80\n85.06\n35.08\n41.05\n59.13\n50.17\n54.15\n53.25\n\n\u2013\n\n39.38\n76.68\n55.48\n68.81\n41.00\n\nl\nl\na\nr\ne\nv\nO\n\ny\nc\na\nr\nu\nc\nc\na\n\n89.75\n91.24\n91.75\n84.18\n88.31\n92.24\n96.39\n98.16\n98.24\n\n\u2013\n\n84.48\n96.39\n83.72\n82.63\n85.48\n\ny\nl\nr\na\ne\n\nd\ne\nt\nc\ni\nd\ne\nr\np\n%\n81.50\n87.68\n85.88\n55.14\n64.41\n85.74\n74.94\n84.53\n81.89\n\n\u2013\n\n58.93\n99.13\n56.40\n69.16\n41.27\n\nl\nl\na\nr\ne\nv\nO\n\ny\nc\na\nr\nu\nc\nc\na\n\n89.78\n91.24\n91.85\n84.27\n88.42\n92.43\n96.52\n98.35\n98.32\n\n\u2013\n\n84.48\n96.55\n83.68\n82.54\n85.52\n\nTable 1: Accuracies of MI-RNN and E-RNN methods compared to a baseline LSTM on the \ufb01ve\ndatasets. Each row corresponds to experiments with a \ufb01xed hidden dimension. The \ufb01rst column under\nMI-RNN (at Round 0) lists the test accuracy of an LSTM trained on instances (windows) before\nany re\ufb01nement of instance labels. The next column lists the test accuracy after the completion of\nMI-RNN. The Third column lists the computation saved due to the fewer number of steps needed\nfor the shorter windows. The columns under E-RNN list the fraction examples that the E-RNN can\npredict early at the 55% (and 75%) time step and the overall prediction accuracy.\n\ntechniques should provide an even more ef\ufb01cient solution. Interestingly, here the total bene\ufb01t is larger\nthan the sum of parts. Early classi\ufb01cation should become more effective once MI-RNN identi\ufb01es\ninstances that contain true and tight class signatures as these signatures are unique to that class.\nHowever, since the two methods are trained separately, a na\u00efve combination ends up reducing the\naccuracy signi\ufb01cantly in many cases (see Figure 4).\nTo alleviate this concern, we propose an architecture (see Figure 2) that trains MI-RNN and E-RNN\njointly. That is, we break each data point into multiple instances as in 3.1.1 and replace the loss\nfunction in the Train-LSTM procedure in Algorithm 1 with the sum loss function in equation 3.2.1.\nWe call this modi\ufb01ed version of Algorithm 1 Early Multi-Instance RNN (EMI-RNN). EMI-RNN is\nable to learn an LSTM that is capable of predicting short signatures accurately as well as stopping\nearly. For instance, on GesturePod-6 dataset early classi\ufb01cation at 55% time-step leads to a 10% drop\nin accuracy for E-RNN. With EMI-RNN, not only is this accuracy drop recovered but an additional\n3% improvement and a computational saving of saving 80% is obtained (see Table 1, Figure 3).\nFor training, we \ufb01rst apply Algorithm 1 to train MI-RNN and form a new training data D = {Zi,\u03c4 , \u03c4 \u2208\n[si, si + k \u2212 1],\u22001 \u2264 i \u2264 n} by pruning each bag of instances Xi down to the set of positive\n(cid:80)\ninstances Zi,\u03c4 . We then use D to train LSTM with the E-RNN loss function (3.2.1) for ensuring\naccurate prediction with early classi\ufb01cation. For inference, we use the tunable Algorithm 2 for early\nt\u2208[\u03c4,\u03c4 +k\u22121] f E\u2212RN N\nprediction for each instance and then compute y = arg maxb max\u03c4\n(xt),\nwhere f E\u2212RN N is computed using our jointly trained LSTM model f applied to Algorithm 2.\n\nb\n\n4 Empirical Results\nWe empirically evaluate our methods on \ufb01ve sequential (time-series) datasets \u2013 three activity/gesture\nrecognition datasets and two audio related datasets. The details of the datasets including their sources\nare reported in Table 4 in the supplement. HAR-6, GesturePod-6 and DSA-19 are multi-class activity\nrecognition datasets. Of the three, only GesturePod-6 data set has negative examples. Google-\n\n7\n\n\fn\ne\nd\nd\ni\n\n.\n\nm\nD\n\ni\n\nDevice H\n\nRPi0\n\n(22.50 ms)\n\nRPi3\n\n(26.39 ms)\n\n16\n32\n64\n16\n32\n64\n\nM\nT\nS\nL\n28.14\n74.46\n226.1\n12.76\n33.10\n92.09\n\nN\nN\nR\n\nI\n\nM\n\n-\n\n14.06\n37.41\n112.6\n6.48\n16.47\n46.28\n\nI\n\nM\nE\n\nN\nN\nR\n\n-\n\n5.62\n14.96\n45.03\n2.59\n6.58\n18.51\n\nFigure 3: Trade-off between accuracy gains and\ncomputational savings of EMI-RNN over baseline\nmethod. Hidden dimension listed in parenthesis.\n\nTable 2: Prediction time in milliseconds\nfor keyword spotting (Google-13) on Rasp-\nberry Pi 0 and 3 for a simple LSTM imple-\nmentation in C. Constraints for real-time\ndetection are listed in brackets.\n\n(a) HAR-6\n\n(b) STCI-2\n\n(c) STCI-2 : Noise detector\n\nFigure 4: Here, a and b compare the accuracy and computational saving of early classi\ufb01cation\n(Algorithm 2) on EMI-RNN model vs MI-RNN model. Clearly, on a \ufb01xed computational savings\nrequirement, joint training of EMI-RNN provides signi\ufb01cantly more accurate models. In c, early\nnoise detection with E-RNN in STCI-2is shown. It can be seen that the drop in accuracy on the\npredicted fraction is very low (hidden dimension 32).\n\n13 dataset is a multi-class keyword detection dataset where all but 12 classes are treated as negative\nexamples. STCI-2 is a proprietary audio dataset where the goal is to recognize a wake-word (e.g., Hey\nCortana) from the background. Although not a prerequisite, in these datasets, the actual signature of\nthe activity or audio example is shorter that the length of the training examples.\nFor training, in addition to the RNN\u2019s hyperparameters, EMI-RNN requires the selection of three\nmore hyperparameters. There is a principled way of selecting them. The instance width \u03c9 can be set\nfrom domain knowledge of the signature (e.g. longest length of keyword in keyword spotting), or can\nbe found by cross-validation (Figure 5a in supplement). The stride between instances is set to the\nvalue typically used for vanilla RNN based on domain knowledge (e.g. \u223c25ms for audio) or based on\ncross-validation. The value of k, the number of consecutive instances that contain a signature, is set\nto (cid:100)\u03c9/instance stride (cid:101).\nWe use the standard accuracy measure to compare various methods. That is, for a given algorithm, we\n, 1 \u2264 j \u2264 m in the test set and then\n\ncompute predicted label(cid:98)yj for each sequential data point Xtest\n\n(cid:80)m\ni=1 I[(cid:98)yi == yi]. Table 1 compares the accuracies of the MI-RNN and E-\n\ncompute accuracy as 1\nm\nRNN methods with that of an LSTM trained on the full-length training data [(X1, y1), . . . , (Xn, yn)].\nThe results in Table 1 show that for a \ufb01xed hidden dimension, MI-RNN can save the computation\nneeded for inference by up to a factor of two, and simultaneously improve the prediction accuracy.\nAccuracy gain of MI-RNN over baseline is signi\ufb01cantly higher for small model sizes, e.g., gain of\n2.5% for Google-13, as MI-RNN is able to prune away noisy instances. Further, it also demonstrates\n\nj\n\n8\n\n5060708090Computation saved (%)2.01.51.00.50.00.51.01.52.02.5Accuracy gain (%)HAR (8)STCI-v1 (32)SPORTS (32)Google (64)GesturePod (32)0.00.20.40.60.81.0Fraction savings0.00.20.40.60.81.0AccuracyEMI-RNNMI-RNNBaseline LSTM0.40.50.60.70.80.91.0Fraction savings0.00.20.40.60.81.0F-ScoreEMI-RNNMI-RNNBaseline LSTM80100120140160Number of LSTM steps0.40.50.60.70.80.91.01.1Fraction noise in testAccuracyPredicted fraction\fLSTM\n\nEMI-RNN\n\nHidden Dim. Accuracy Hidden Dim Accuracy\n\nDataset\nHAR-6\n\nGoogle-13\n\nSTCI-2\n\nGesturePod-6\n\nDSA-19\n\n32\n64\n32\n48\n64\n\n93.04\n91.13\n99.01\n97.13\n85.17\n\n16\n32\n16\n8\n32\n\nSpeedup\n\n10.5x\n\n8x\n8x\n72x\n5.5x\n\nSpeedup\n\n@~1% Drop\n\n42x\n32x\n32x\n-\n-\n\n93.89\n92.61\n99.07\n98.00\n87.01\n\nTable 3: The table shows the hidden-size of EMI-RNN required to achieve same or slightly higher\naccuracy than baseline LSTM, and the corresponding inference speed-up over LSTM. Last column\nshows inference cost speed-up provided by EMI-RNN if the model-size is decreased further while\nensuring that the accuracy is at most 1% lower than LSTM\u2019s accuracy.\n\nthat E-RNN can train a model that can accurately predict a large fraction of time-series data points\nearly. At 55% time steps, we notice that up to 80% of the data points can be con\ufb01dently predicted. In\nreal world tasks such as wake-word detection, most data points do not contain positive signatures and\nare hence juse noise. In such situations, it is critical to eliminate the negative examples (i.e., noise)\nas early as possible. Figure 4c demonstrates that around the 100-th time step (of 160), noise can be\nidenti\ufb01ed with \u223c100% accuracy.\nThe drawback of E-RNN is that on some datasets, the overall accuracy, which includes data points that\ncould not be classi\ufb01ed until the last step, degrades. In one instance (GesturePod-6, 32 dim.), accuracy\ndrops almost 10%. The problem can be addressed by using the EMI-RNN method that jointly trains\nfor the right window labels and for early classi\ufb01cation. Figure 3 demonstrates the trade-off between\naccuracy gain of the LSTM trained by the EMI-RNN method (over the baseline LSTM) and the\npercentage of computation saved due to fewer LSTM steps, at a \ufb01xed hidden dimension. The trade-off\nis tuned by adjusting the probability threshold \u02c6p for early prediction. In all cases except STCI-2,\nEMI-RNN can save 65%\u2212 90% of the computation without losing accuracy compared to the baseline.\nIn fact, in most cases EMI-RNN can provide up to 2% more accurate predictions while providing\n2 \u2212 3\u00d7 saving in computation. Similarly, we can trade-off accuracy for higher computation speed-up,\ni.e., EMI-RNN outperforms baseline LSTM models using less than half the number of states thus\nproviding huge computational gains (Table 3).\nImprovements in computation time can critically enable real-time predictions on tiny devices such as\nRaspberry Pi. Consider the example of spotting keyword as in Google-13. Typically, predictions on\naudio samples are made on sliding windows of 30ms, i.e., every 30ms, we extract spectral features\nand start a new LSTM to predict the keyword in the trailing window. For real-time prediction, this\nmust be completed within 30ms. Table 2 demonstrates that while normal LSTM cannot meet this\ndeadline, LSTMs trained with MI-RNN and EMI-RNN techniques comfortably accommodate a 32\nhidden dimensional LSTM while learning a model that is 1.5% more accurate. Note that the times\nfor EMI-RNN in the table were estimated by choosing the probability threshold \u02c6p that provides at\nleast 1% accuracy improvement over the baseline.\n\n5 Conclusion\n\nThis paper proposed EMI-RNN algorithm for sequential data classi\ufb01cation. EMI-RNN was based\non a multi-instance learning (MIL) formulation of the sequential data classi\ufb01cation problem and\nexploited techniques from robust learning to ensure ef\ufb01cient training. Analysis of EMI-RNN showed\nthat it can be ef\ufb01ciently trained to recover globally optimal solution in interesting non-homogeneous\nMIL settings. Furthermore, on several benchmarks, EMI-RNN outperformed baseline LSTM while\nproviding up to 70x reduction in inference cost.\nWhile this paper restricted it\u2019s attention to \ufb01xed length sensors related problems, application of similar\napproach to the natural language processing problems should be of interest. Relaxation of restrictive\nassumptions on data distributions for analysis of EMI-RNN should also be of wide interest.\n\n9\n\n\fReferences\n[1] Jaume Amores. Multiple instance classi\ufb01cation: Review, taxonomy and comparative study.\n\nArti\ufb01cial Intelligence, 201:81\u2013105, 2013.\n\n[2] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. A\n\npublic domain dataset for human activity recognition using smartphones. In ESANN, 2013.\n\n[3] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\n\nIn International Conference on Machine Learning, pages 1120\u20131128, 2016.\n\n[4] Peter Auer, Philip M Long, and Aravind Srinivasan. Approximating hyper-rectangles: Learning\nand pseudo-random sets. In Proceedings of the twenty-ninth annual ACM symposium on Theory\nof computing, pages 314\u2013323. ACM, 1997.\n\n[5] J. P. Bello, C. Mydlarz, and J. Salamon. Sound analysis in smart cities. In Computational\n\nAnalysis of Sound Scenes and Events, pages 373\u2013397. 2018.\n\n[6] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust Regression via Hard Thresholding. In\nProceedings of the 29th Annual Conference on Neural Information Processing Systems, 2015.\n\n[7] V\u00edctor Campos, Brendan Jou, Xavier Gir\u00f3 i Nieto, Jordi Torres, and Shih-Fu Chang. Skip RNN:\nLearning to skip state updates in recurrent neural networks. In International Conference on\nLearning Representations, 2018.\n\n[8] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On\nthe properties of neural machine translation: Encoder-decoder approaches. arXiv preprint\narXiv:1409.1259, 2014.\n\n[9] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder\u2013\ndecoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical\nMethods in Natural Language Processing (EMNLP), pages 1724\u20131734, 2014.\n\n[10] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,\n2014.\n\n[11] Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capacity and trainability in\n\nrecurrent neural networks. arXiv preprint arXiv:1611.09913, 2016.\n\n[12] Asma Dachraoui, Alexis Bondu, and Antoine Cornu\u00e9jols. Early classi\ufb01cation of time series\nas a non myopic sequential decision making problem. In Machine Learning and Knowledge\nDiscovery in Databases, pages 433\u2013447, 2015.\n\n[13] Alessandro D\u2019Ausilio. Arduino: A low-cost multipurpose lab equipment. Behavior research\n\nmethods, 44(2):305\u2013313, 2012.\n\n[14] Don Kurian Dennis, Chirag Pabbaraju, Harsha Vardhan Simhadri, and Prateek Jain. The\n\nEdgeML Libraray: Code for EMI-RNN, 2018. https://github.com/Microsoft/EdgeML.\n\n[15] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in\nhigh dimensions without the computational intractability. In 2016 IEEE 57th Annual Symposium\non Foundations of Computer Science (FOCS), pages 655\u2013664, Oct 2016.\n\n[16] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini\nVenugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for\nvisual recognition and description. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pages 2625\u20132634, 2015.\n\n[17] Mohamed F. Ghalwash and Zoran Obradovic. Early classi\ufb01cation of multivariate temporal\nobservations by extraction of interpretable shapelets. BMC Bioinformatics, 13(1):195, Aug\n2012.\n\n10\n\n\f[18] Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi Paranjape,\nAshish Kumar, Saurabh Goyal, Raghavendra Udupa, Manik Varma, and Prateek Jain. Pro-\ntoNN: Compressed and accurate kNN for resource-scarce devices. In Proceedings of the 34th\nInternational Conference on Machine Learning, pages 1331\u20131340, 2017.\n\n[19] Gareth Halfacree and Eben Upton. Raspberry Pi User Guide. Wiley Publishing, 1st edition,\n\n2012.\n\n[20] Nils Y. Hammerla, Shane Halloran, and Thomas Ploetz. Deep, convolutional, and recurrent\nmodels for human activity recognition using wearables. arXiv preprint arXiv:1604.08880, 2016.\n[21] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[23] Ashish Kumar, Saurabh Goyal, and Manik Varma. Resource-ef\ufb01cient machine learning in 2\nKB RAM for the internet of things. In Proceedings of the 34th International Conference on\nMachine Learning, pages 1935\u20131944, 2017.\n\n[24] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection\nand early detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 1942\u20131950, June 2016.\n\n[25] U. Mori, A. Mendiburu, S. Dasgupta, and J. A. Lozano. Early classi\ufb01cation of time series by\nsimultaneously optimizing the accuracy and earliness. IEEE Transactions on Neural Networks\nand Learning Systems, pages 1\u201310, 2017.\n\n[26] Shishir Patil, Don Kurian Dennis, Chirag Pabbaraju, Rajanikant Deshmukh, Har-\nProgrammable\nfor\nreport,\nhttps://www.microsoft.com/en-us/research/publication/\n\nsha Simhadri, Manik Varma,\ngesture\nMay\ngesturepod-programmable-gesture-recognition-augmenting-assistive-devices/.\n\nand Prateek Jain.\n\nGesturepod:\n\nrecognition\n\n2018.\n\naugmenting\n\nassistive\n\ndevices.\n\nTechnical\n\n[27] Sivan Sabato and Naftali Tishby. Multi-instance learning with any hypothesis class. Journal of\n\nMachine Learning Research, 13(Oct):2999\u20133039, 2012.\n\n[28] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, long short-term memory, fully\nconnected deep neural networks. In 2015 IEEE International Conference on Acoustics, Speech\nand Signal Processing (ICASSP), pages 4580\u20134584, 2015.\n\n[29] Tara N Sainath and Carolina Parada. Convolutional neural networks for small-footprint key-\nword spotting. In Sixteenth Annual Conference of the International Speech Communication\nAssociation, 2015.\n\n[30] Hasim Sak, Andrew W. Senior, Kanishka Rao, and Fran\u00e7oise Beaufays. Fast and accurate recur-\nrent neural network acoustic models for speech recognition. In Sixteenth Annual Conference of\nthe International Speech Communication Association, 2015.\n\n[31] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initial-\nization and momentum in deep learning. In Proceedings of the 30th International Conference\non Machine Learning, pages 1139\u20131147, 2013.\n\n[32] Raphael Tang, Weijie Wang, Zhucheng Tu, and Jimmy Lin. An experimental analysis of the\npower consumption of convolutional neural networks for keyword spotting. arXiv preprint\narXiv:1711.00333, 2017.\n\n[33] Romain Tavenard and Simon Malinowski. Cost-aware early classi\ufb01cation of time series. In\nJoint European Conference on Machine Learning and Knowledge Discovery in Databases,\npages 632\u2013647, 2016.\n\n[34] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar\nand G. Kutyniok, editors, Compressed Sensing, Theory and Applications, chapter 5, pages\n210\u2013268. Cambridge University Press, 2012.\n\n11\n\n\f[35] Pete Warden. Speech commands: A public dataset for single-word speech recognition., 2017.\nDataset available from http://download.tensorflow.org/data/speech_commands_v0.\n01.tar.gz.\n\n[36] Zhengzheng Xing, Jian Pei, and Philip S. Yu. Early classi\ufb01cation on time series. Knowledge\n\nand Information Systems, 31(1):105\u2013127, 2012.\n\n12\n\n\f", "award": [], "sourceid": 8002, "authors": [{"given_name": "Don", "family_name": "Dennis", "institution": "Microsoft Research"}, {"given_name": "Chirag", "family_name": "Pabbaraju", "institution": "Microsoft Research"}, {"given_name": "Harsha Vardhan", "family_name": "Simhadri", "institution": "Microsoft Research India"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}]}