{"title": "A Latent Source Model for Nonparametric Time Series Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1088, "page_last": 1096, "abstract": "For classifying time series, a nearest-neighbor approach is widely used in practice with performance often competitive with or better than more elaborate methods such as neural networks, decision trees, and support vector machines. We develop theoretical justification for the effectiveness of nearest-neighbor-like classification of time series. Our guiding hypothesis is that in many applications, such as forecasting which topics will become trends on Twitter, there aren't actually that many prototypical time series to begin with, relative to the number of time series we have access to, e.g., topics become trends on Twitter only in a few distinct manners whereas we can collect massive amounts of Twitter data. To operationalize this hypothesis, we propose a latent source model for time series, which naturally leads to a weighted majority voting\" classification rule that can be approximated by a nearest-neighbor classifier. We establish nonasymptotic performance guarantees of both weighted majority voting and nearest-neighbor classification under our model accounting for how much of the time series we observe and the model complexity. Experimental results on synthetic data show weighted majority voting achieving the same misclassification rate as nearest-neighbor classification while observing less of the time series. We then use weighted majority to forecast which news topics on Twitter become trends, where we are able to detect such \"trending topics\" in advance of Twitter 79% of the time, with a mean early advantage of 1 hour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%.\"", "full_text": "A Latent Source Model for\n\nNonparametric Time Series Classi\ufb01cation\n\nGeorge H. Chen\n\nMIT\n\ngeorgehc@mit.edu\n\nsnikolov@twitter.com\n\ndevavrat@mit.edu\n\nStanislav Nikolov\n\nTwitter\n\nDevavrat Shah\n\nMIT\n\nAbstract\n\nFor classifying time series, a nearest-neighbor approach is widely used in practice\nwith performance often competitive with or better than more elaborate methods\nsuch as neural networks, decision trees, and support vector machines. We develop\ntheoretical justi\ufb01cation for the effectiveness of nearest-neighbor-like classi\ufb01ca-\ntion of time series. Our guiding hypothesis is that in many applications, such as\nforecasting which topics will become trends on Twitter, there aren\u2019t actually that\nmany prototypical time series to begin with, relative to the number of time series\nwe have access to, e.g., topics become trends on Twitter only in a few distinct man-\nners whereas we can collect massive amounts of Twitter data. To operationalize\nthis hypothesis, we propose a latent source model for time series, which naturally\nleads to a \u201cweighted majority voting\u201d classi\ufb01cation rule that can be approximated\nby a nearest-neighbor classi\ufb01er. We establish nonasymptotic performance guar-\nantees of both weighted majority voting and nearest-neighbor classi\ufb01cation under\nour model accounting for how much of the time series we observe and the model\ncomplexity. Experimental results on synthetic data show weighted majority voting\nachieving the same misclassi\ufb01cation rate as nearest-neighbor classi\ufb01cation while\nobserving less of the time series. We then use weighted majority to forecast which\nnews topics on Twitter become trends, where we are able to detect such \u201ctrending\ntopics\u201d in advance of Twitter 79% of the time, with a mean early advantage of 1\nhour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%.\n\n1\n\nIntroduction\n\nRecent years have seen an explosion in the availability of time series data related to virtually every\nhuman endeavor \u2014 data that demands to be analyzed and turned into valuable insights. A key\nrecurring task in mining this data is being able to classify a time series. As a running example used\nthroughout this paper, consider a time series that tracks how much activity there is for a particular\nnews topic on Twitter. Given this time series up to present time, we ask \u201cwill this news topic go\nviral?\u201d Borrowing Twitter\u2019s terminology, we label the time series a \u201ctrend\u201d and call its corresponding\nnews topic a trending topic if the news topic goes viral; otherwise, the time series has label \u201cnot\ntrend\u201d. We seek to forecast whether a news topic will become a trend before it is declared a trend (or\nnot) by Twitter, amounting to a binary classi\ufb01cation problem. Importantly, we skirt the discussion\nof what makes a topic considered trending as this is irrelevant to our mathematical development.1\nFurthermore, we remark that handling the case where a single time series can have different labels\nat different times is beyond the scope of this paper.\n\n1While it is not public knowledge how Twitter de\ufb01nes a topic to be a trending topic, Twitter does provide\ninformation for which topics are trending topics. We take these labels to be ground truth, effectively treating\nhow a topic goes viral to be a black box supplied by Twitter.\n\n1\n\n\fNumerous standard classi\ufb01cation methods have been tailored to classify time series, yet a simple\nnearest-neighbor approach is hard to beat in terms of classi\ufb01cation performance on a variety of\ndatasets [20], with results competitive to or better than various other more elaborate methods such\nas neural networks [15], decision trees [16], and support vector machines [19]. More recently,\nresearchers have examined which distance to use with nearest-neighbor classi\ufb01cation [2, 7, 18] or\nhow to boost classi\ufb01cation performance by applying different transformations to the time series\nbefore using nearest-neighbor classi\ufb01cation [1]. These existing results are mostly experimental,\nlacking theoretical justi\ufb01cation for both when nearest-neighbor-like time series classi\ufb01ers should be\nexpected to perform well and how well.\nIf we don\u2019t con\ufb01ne ourselves to classifying time series, then as the amount of data tends to in\ufb01nity,\nnearest-neighbor classi\ufb01cation has been shown to achieve a probability of error that is at worst\ntwice the Bayes error rate, and when considering the nearest k neighbors with k allowed to grow\nwith the amount of data, then the error rate approaches the Bayes error rate [5]. However, rather\nthan examining the asymptotic case where the amount of data goes to in\ufb01nity, we instead pursue\nnonasymptotic performance guarantees in terms of how large of a training dataset we have and how\nmuch we observe of the time series to be classi\ufb01ed. To arrive at these nonasymptotic guarantees, we\nimpose a low-complexity structure on time series.\nOur contributions. We present a model for which nearest-neighbor-like classi\ufb01cation performs well\nby operationalizing the following hypothesis: In many time series applications, there are only a small\nnumber of prototypical time series relative to the number of time series we can collect. For example,\nposts on Twitter are generated by humans, who are often behaviorally predictable in aggregate. This\nsuggests that topics they post about only become trends on Twitter in a few distinct manners, yet we\nhave at our disposal enormous volumes of Twitter data. In this context, we present a novel latent\nsource model: time series are generated from a small collection of m unknown latent sources, each\nhaving one of two labels, say \u201ctrend\u201d or \u201cnot trend\u201d. Our model\u2019s maximum a posteriori (MAP) time\nseries classi\ufb01er can be approximated by weighted majority voting, which compares the time series\nto be classi\ufb01ed with each of the time series in the labeled training data. Each training time series\ncasts a weighted vote in favor of its ground truth label, with the weight depending on how similar\nthe time series being classi\ufb01ed is to the training example. The \ufb01nal classi\ufb01cation is \u201ctrend\u201d or \u201cnot\ntrend\u201d depending on which label has the higher overall vote. The voting is nonparametric in that it\ndoes not learn parameters for a model and is driven entirely by the training data. The unknown latent\nsources are never estimated; the training data serve as a proxy for these latent sources. Weighted\nmajority voting itself can be approximated by a nearest-neighbor classi\ufb01er, which we also analyze.\nUnder our model, we show suf\ufb01cient conditions so that if we have n =\u21e5( m log m\n ) time series in\nour training data, then weighted majority voting and nearest-neighbor classi\ufb01cation correctly clas-\n ) time steps. As\nsify a new time series with probability at least 1  after observing its \ufb01rst \u2326(log m\nour analysis accounts for how much of the time series we observe, our results readily apply to the\n\u201conline\u201d setting in which a time series is to be classi\ufb01ed while it streams in (as is the case for fore-\ncasting trending topics) as well as the \u201cof\ufb02ine\u201d setting where we have access to the entire time series.\nAlso, while our analysis yields matching error upper bounds for the two classi\ufb01ers, experimental re-\nsults on synthetic data suggests that weighted majority voting outperforms nearest-neighbor classi\ufb01-\ncation early on when we observe very little of the time series to be classi\ufb01ed. Meanwhile, a speci\ufb01c\ninstantiation of our model leads to a spherical Gaussian mixture model, where the latent sources are\nGaussian mixture components. We show that existing performance guarantees on learning spherical\nGaussian mixture models [6, 10, 17] require more stringent conditions than what our results need,\nsuggesting that learning the latent sources is overkill if the goal is classi\ufb01cation.\nLastly, we apply weighted majority voting to forecasting trending topics on Twitter. We emphasize\nthat our goal is precognition of trends: predicting whether a topic is going to be a trend before it\nis actually declared to be a trend by Twitter or, in theory, any other third party that we can collect\nground truth labels from. Existing work that identify trends on Twitter [3, 4, 13] instead, as part\nof their trend detection, de\ufb01ne models for what trends are, which we do not do, nor do we assume\nwe have access to such de\ufb01nitions. (The same could be said of previous work on novel document\ndetection on Twitter [11, 12].)\nIn our experiments, weighted majority voting is able to predict\nwhether a topic will be a trend in advance of Twitter 79% of the time, with a mean early advantage\nof 1 hour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%. We empirically\n\ufb01nd that the Twitter activity of a news topic that becomes a trend tends to follow one of a \ufb01nite\nnumber of patterns, which could be thought of as latent sources.\n\n2\n\n\fOutline. Weighted majority voting and nearest-neighbor classi\ufb01cation for time series are pre-\nsented in Section 2. We provide our latent source model and theoretical performance guarantees\nof weighted majority voting and nearest-neighbor classi\ufb01cation under this model in Section 3. Ex-\nperimental results for synthetic data and forecasting trending topics on Twitter are in Section 4.\n\n2 Weighted Majority Voting and Nearest-Neighbor Classi\ufb01cation\nGiven a time-series2 s : Z ! R, we want to classify it as having either label +1 (\u201ctrend\u201d) or 1\n(\u201cnot trend\u201d). To do so, we have access to labeled training data R+ and R, which denote the sets\nof all training time series with labels +1 and 1 respectively.\nWeighted majority voting. Each positively-labeled example r 2R + casts a weighted vote\ned(T )(r,s) for whether time series s has label +1, where d(T )(r, s) is some measure of similar-\nity between the two time series r and s, superscript (T ) indicates that we are only allowed to look\nat the \ufb01rst T time steps (i.e., time steps 1, 2, . . . , T ) of s (but we\u2019re allowed to look outside of these\ntime steps for the training time series r), and constant   0 is a scaling parameter that determines\nthe \u201csphere of in\ufb02uence\u201d of each example. Similarly, each negatively-labeled example in R also\ncasts a weighted vote for whether time series s has label 1.\nThe similarity measure d(T )(r, s) could, for example, be squared Euclidean distance: d(T )(r, s) =\nPT\nt=1(r(t)  s(t))2 , kr  sk2\nT . However, this similarity measure only looks at the \ufb01rst T time\nsteps of training time series r. Since time series in our training data are known, we need not restrict\nour attention to their \ufb01rst T time steps. Thus, we use the following similarity measure:\n\nTXt=1\n\nmin\n\nd(T )(r, s) =\n\n2{max,...,0,...,max}\n\n(r(t +)  s(t))2 =\n\n2{max,...,0,...,max}kr\u21e4  sk2\nT ,\n(1)\nwhere we minimize over integer time shifts with a pre-speci\ufb01ed maximum allowed shift max  0.\nHere, we have used q\u21e4 to denote time series q advanced by  time steps, i.e., (q\u21e4)(t) = q(t+).\nFinally, we sum up all of the weighted +1 votes and then all of the weighted 1 votes. The label\nwith the majority of overall weighted votes is declared as the label for s:\n\nmin\n\nbL(T )(s; ) =(+1 if Pr2R+\n\notherwise.\n\n1\n\ned(T )(r,s) Pr2R\n\ned(T )(r,s),\n\n(2)\n\nUsing a larger time window size T corresponds to waiting longer before we make a prediction.\nWe need to trade off how long we wait and how accurate we want our prediction. Note that k-\nnearest-neighbor classi\ufb01cation corresponds to only considering the k nearest neighbors of s among\nall training time series; all other votes are set to 0. With k = 1, we obtain the following classi\ufb01er:\n\nThen we declare the label for s to be:\n\nNearest-neighbor classi\ufb01er. Letbr = arg minr2R+[R\n\nN N (s) =\u21e2+1 ifbr 2R +,\nbL(T )\n1 ifbr 2R .\n\n3 A Latent Source Model and Theoretical Guarantees\n\nd(T )(r, s) be the nearest neighbor of s.\n\n(3)\n\nWe assume there to be m unknown latent sources (time series) that generate observed time series.\nLet V denote the set of all such latent sources; each latent source v : Z ! R in V has a true label\n+1 or 1. Let V+ \u21e2V be the set of latent sources with label +1, and V \u21e2V be the set of those\nwith label 1. The observed time series are generated from latent sources as follows:\n\n1. Sample latent source V from V uniformly at random.3 Let L 2 {\u00b11} be the label of V .\n2We index time using Z for notationally convenience but will assume time series to start at time step 1.\n3While we keep the sampling uniform for clarity of presentation, our theoretical guarantees can easily be\nextended to the case where the sampling is not uniform. The only change is that the number of training data\nneeded will be larger by a factor of\n, where \u21e1min is the smallest probability of a particular latent source\noccurring.\n\nm\u21e1min\n\n1\n\n3\n\n\fFigure 1: Example of latent sources superimposed, where each latent source is shifted vertically in\namplitude such that every other latent source has label +1 and the rest have label 1.\n\n2. Sample integer time shift  uniformly from {0, 1, . . . , max}.\n3. Output time series S : Z ! R to be latent source V advanced by  time steps, followed\nby adding noise signal E : Z ! R, i.e., S(t) = V (t +) + E(t). The label associated\nwith the generated time series S is the same as that of V , i.e., L. Entries of noise E are\ni.i.d. zero-mean sub-Gaussian with parameter , which means that for any time index t,\n\nE[exp(E(t))] \uf8ff exp\u21e3 1\n\n2\n\n22\u2318\n\nfor all  2 R.\n\n(4)\n\nThe family of sub-Gaussian distributions includes a variety of distributions, such as a zero-\nmean Gaussian with standard deviation  and a uniform distribution over [, ].\n\nThe above generative process de\ufb01nes our latent source model. Importantly, we make no assumptions\nabout the structure of the latent sources. For instance, the latent sources could be tiled as shown in\nFigure 1, where they are evenly separated vertically and alternate between the two different classes\n+1 and 1. With a parametric model like a k-component Gaussian mixture model, estimating\nthese latent sources could be problematic. For example, if we take any two adjacent latent sources\nwith label +1 and cluster them, then this cluster could be confused with the latent source having\nlabel 1 that is sandwiched in between. Noise only complicates estimating the latent sources. In\nthis example, the k-component Gaussian mixture model needed for label +1 would require k to be\nthe exact number of latent sources with label +1, which is unknown. In general, the number of\nsamples we need from a Gaussian mixture mixture model to estimate the mixture component means\nis exponential in the number of mixture components [14]. As we discuss next, for classi\ufb01cation,\nwe sidestep learning the latent sources altogether, instead using training data as a proxy for latent\nsources. At the end of this section, we compare our sample complexity for classi\ufb01cation versus\nsome existing sample complexities for learning Gaussian mixture models.\nClassi\ufb01cation. If we knew the latent sources and if noise entries E(t) were i.i.d. N (0, 1\n2 ) across t,\nthen the maximum a posteriori (MAP) estimate for label L given an observed time series S = s is\n\nwhere\n\nMAP(s; ) =(+1\nbL(T )\nMAP(s; ) , Pv+2V+P+2D+\nPv2VP2D+\n\n1\n\n\u21e4(T )\n\nif \u21e4(T )\notherwise,\n\nMAP(s; )  1,\n\nexp  kv+ \u21e4 +  sk2\nT\nexp  kv \u21e4   sk2\nT ,\n\nand D+ , {0, . . . , max}.\nHowever, we do not know the latent sources, nor do we know if the noise is i.i.d. Gaussian. We\nassume that we have access to training data as given in Section 2. We make a further assumption\nthat the training data were sampled from the latent source model and that we have n different training\ntime series. Denote D , {max, . . . , 0, . . . , max}. Then we approximate the MAP classi\ufb01er by\nusing training data as a proxy for the latent sources. Speci\ufb01cally, we take ratio (6), replace the inner\nsum by a minimum in the exponent, replace V+ and V by R+ and R, and replace D+ by D to\nobtain the ratio:\n\n(5)\n\n(6)\n\n(7)\n\n\u21e4(T )(s; ) , Pr+2R+\nPr2R\n\nexp   min+2D kr+ \u21e4 +  sk2\nT\nexp   min2D kr \u21e4   sk2\nT .\n\n4\n\ntime activity +1 {1 +1 {1 +1 {1 \fPlugging \u21e4(T ) in place of \u21e4(T )\nMAP in classi\ufb01cation rule (5) yields the weighted majority voting rule (2).\nNote that weighted majority voting could be interpreted as a smoothed nearest-neighbor approxima-\ntion whereby we only consider the time-shifted version of each example time series that is closest\nto the observed time series s. If we didn\u2019t replace the summations over time shifts with minimums\nin the exponent, then we have a kernel density estimate in the numerator and in the denominator\n[9, Chapter 7] (where the kernel is Gaussian) and our main theoretical result for weighted majority\nvoting to follow would still hold using the same proof.4\nLastly, applications may call for trading off true and false positive rates. We can do this by general-\nizing decision rule (5) to declare the label of s to be +1 if \u21e4(T )(s, )  \u2713 and vary parameter \u2713> 0.\nThe resulting decision rule, which we refer to as generalized weighted majority voting, is thus:\n\n(s; ) =\u21e2+1\n\n1\n\n\u2713\n\nbL(T )\n\nif \u21e4(T )(s, )  \u2713,\notherwise,\n\n(8)\n\nwhere setting \u2713 = 1 recovers the usual weighted majority voting (2). This modi\ufb01cation to the\nclassi\ufb01er can be thought of as adjusting the priors on the relative sizes of the two classes. Our\ntheoretical results to follow actually cover this more general case rather than only that of \u2713 = 1.\nTheoretical guarantees. We now present the main theoretical results of this paper which identify\nsuf\ufb01cient conditions under which generalized weighted majority voting (8) and nearest-neighbor\nclassi\ufb01cation (3) can classify a time series correctly with high probability, accounting for the size of\nthe training dataset and how much we observe of the time series to be classi\ufb01ed. First, we de\ufb01ne the\n\u201cgap\u201d between R+ and R restricted to time length T and with maximum time shift max as:\n\nG(T )(R+,R, max) ,\n\nmin\n\nr+2R+,r2R,\n\n+,2D\n\nkr+ \u21e4 +  r \u21e4 k2\nT .\n\n(9)\n\nThis quantity measures how far apart the two different classes are if we only look at length-T chunks\nof each time series and allow all shifts of at most max time steps in either direction.\nOur \ufb01rst main result is stated below. We defer proofs to the longer version of this paper.\nTheorem 1. (Performance guarantee for generalized weighted majority voting) Let m+ = |V+| be\nthe number of latent sources with label +1, and m = |V| = m  m+ be the number of latent\nsources with label 1. For any > 1, under the latent source model with n >m log m time series\nin the training data, the probability of misclassifying time series S with label L using generalized\nweighted majority votingbL(T )\nP(bL(T )\n\uf8ff\u21e3 \u2713m+\n42 ),\nAn immediate consequence is that given error tolerance  2 (0, 1) and with choice  2 (0, 1\nthen upper bound (10) is at most  (by having each of the two terms on the right-hand side be \uf8ff \n2)\nif n > m log 2m\n\n\n\u2713m\u2318(2max + 1)n exp  (  422)G(T )(R+,R, max) + m+1.\n\n(\u00b7; ) satis\ufb01es the bound\n\n(S; ) 6= L)\nm\n\n(i.e.,  = 1 + log 2\n\n(10)\n\nm\n\n+\n\n\u2713\n\n\u2713\n\n\u2713m ) + log(2max + 1) + log n + log 2\n\n\n.\n\n(11)\n\nG(T )(R+,R, max) \n\n  422\n ) time series, then we can subsample n =\u21e5( m log m\n\nThis means that if we have access to a large enough pool of labeled time series, i.e., the pool has\n ) of them to use as training data.\n\u2326(m log m\nThen with choice  = 1\n82 , generalized weighted majority voting (8) correctly classi\ufb01es a new time\nseries S with probability at least 1   if\n\nG(T )(R+,R, max) =\u2326\u27132\u21e3 log\u21e3 \u2713m+\n\nm\n\n+\n\nm\n\n\u2713m\u2318 + log(2max + 1) + log\n\nm\n\n\u2318\u25c6.\n\n(12)\n\nThus, the gap between sets R+ and R needs to grow logarithmic in the number of latent sources m\nin order for weighted majority voting to classify correctly with high probability. Assuming that the\n4We use a minimum rather a summation over time shifts to make the method more similar to existing time\n\nseries classi\ufb01cation work (e.g., [20]), which minimize over time warpings rather than simple shifts.\n\n / log m), and\nlog( \u2713m+\nm + m\n\n5\n\n\foriginal unknown latent sources are separated (otherwise, there is no hope to distinguish between\nthe classes using any classi\ufb01er) and the gap in the training data grows as G(T )(R+,R, max) =\n\u2326(2T ) (otherwise, the closest two training time series from opposite classes are within noise of\neach other), then observing the \ufb01rst T = \u2326(log(\u2713 + 1\n ) time steps from\n\u2713 ) + log(2max + 1) + log m\nthe time series is suf\ufb01cient to classify it correctly with probability at least 1  .\nA similar result holds for the nearest-neighbor classi\ufb01er (3).\nTheorem 2. (Performance guarantee for nearest-neighbor classi\ufb01cation) For any > 1, under\nthe latent source model with n >m log m time series in the training data, the probability of\nN N (\u00b7) satis\ufb01es the\n(13)\n\nmisclassifying time series S with label L using the nearest-neighbor classi\ufb01erbL(T )\nN N (S) 6= L) \uf8ff (2max + 1)n exp\u21e3 \nP(bL(T )\n\n162 G(T )(R+,R, max)\u2318 + m+1.\n\nbound\n\nOur generalized weighted majority voting bound (10) with \u2713 = 1 (corresponding to regular weighted\nmajority voting) and  = 1\n82 matches our nearest-neighbor classi\ufb01cation bound, suggesting that\nthe two methods have similar behavior when the gap grows with T . In practice, we \ufb01nd weighted\nmajority voting to outperform nearest-neighbor classi\ufb01cation when T is small, and then as T grows\nlarge, the two methods exhibit similar performance in agreement with our theoretical analysis. For\nsmall T , it could still be fairly likely that the nearest neighbor found has the wrong label, dooming\nthe nearest-neighbor classi\ufb01er to failure. Weighted majority voting, on the other hand, can recover\nfrom this situation as there may be enough correctly labeled training time series close by that con-\ntribute to a higher overall vote for the correct class. This robustness of weighted majority voting\nmakes it favorable in the online setting where we want to make a prediction as early as possible.\nSample complexity of learning the latent sources. If we can estimate the latent sources accurately,\nthen we could plug these estimates in place of the true latent sources in the MAP classi\ufb01er and\nachieve classi\ufb01cation performance close to optimal.\nIf we restrict the noise to be Gaussian and\nassume max = 0, then the latent source model corresponds to a spherical Gaussian mixture model.\nWe could learn such a model using Dasgupta and Schulman\u2019s modi\ufb01ed EM algorithm [6]. Their\ntheoretical guarantee depends on the true separation between the closest two latent sources, namely\nG(T )\u21e4 , minv,v02V s.t. v6=v0 kv  v0k2\n\u2326(max{1, 2T\nT =\u2326\u2713 max\u21e21,\n\n2, which needs to satisfy G(T )\u21e4  2pT . Then with n =\n(G(T )\u21e4)2 log\uf8ff m\n\ntheir algorithm achieves, with probability at least 1  , an additive \"pT error (in Euclidean\ndistance) close to optimal in estimating every latent source. In contrast, our result is in terms of gap\nG(T )(R+,R, max) that depends not on the true separation between two latent sources but instead\non the minimum observed separation in the training data between two time series of opposite labels.\nIn fact, our gap, in their setting, grows as \u2326(2T ) even when their gap G(T )\u21e4 grows sublinear in T .\n ) \uf8ff G(T )\u21e4 \uf8ff 2pT ,\nIn particular, while their result cannot handle the regime where O(2 log m\n ) time\nours can, using n =\u21e5( m log m\nsteps to classify a time series correctly with probability at least 1  ; see the longer version of this\npaper for details.\nVempala and Wang [17] have a spectral method for learning Gaussian mixture models that can han-\n\n ) training time series and observing the \ufb01rst T = \u2326(log m\n\n4T 2\n\n(G(T )\u21e4)2\u25c6,\n\n ), G(T )\u21e4 =\u2326( 2 log m\n\n\" ), and\n\nG(T )\u21e4}m log m\n\ndle smaller G(T )\u21e4 than Dasgupta and Schulman\u2019s approach but requires n =e\u2326(T 3m2) training data,\n\nwhere we\u2019ve hidden the dependence on 2 and other variables of interest for clarity of presentation.\nHsu and Kakade [10] have a moment-based estimator that doesn\u2019t have a gap condition but, under a\ndifferent non-degeneracy condition, requires substantially more samples for our problem setup, i.e.,\nn =\u2326(( m14 + T m11)/\"2) to achieve an \" approximation of the mixture components. These results\nneed substantially more training data than what we\u2019ve shown is suf\ufb01cient for classi\ufb01cation.\nTo \ufb01t a Gaussian mixture model to massive training datasets, in practice, using all the training data\ncould be prohibitively expensive. In such scenarios, one could instead non-uniformly subsample\nO(T m3/\"2) time series from the training data using the procedure given in [8] and then feed the\nresulting smaller dataset, referred to as an (m, \")-coreset, to the EM algorithm for learning the latent\nsources. This procedure still requires more training time series than needed for classi\ufb01cation and\nlacks a guarantee that the estimated latent sources will be close to the true latent sources.\n\n4T 2\n\nmax\u21e21,\n\n\n\n(14)\n\n1\n\n6\n\n\fa\nt\na\nd\n \nt\ns\ne\nt\n \nn\no\n \ne\nt\na\nr\n \nr\no\nr\nr\ne\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n \n\nWeighted majority voting\nNearest\u2212neighbor classifier\nOracle MAP classifier\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n0\n\n50\n\n150\n\n200\n\n100\nT\n\n(a)\n\na\nt\na\nd\n \nt\ns\ne\nt\n \nn\no\n \ne\nt\na\nr\n \nr\no\nr\nr\ne\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n1\n\n \n\nWeighted majority voting\nNearest\u2212neighbor classifier\nOracle MAP classifier\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n\u03b2\n\n(b)\n\nFigure 2: Results on synthetic data. (a) Classi\ufb01cation error rate vs. number of initial time steps T\nused; training set size: n = m log m where  = 8. (b) Classi\ufb01cation error rate at T = 100 vs. .\nAll experiments were repeated 20 times with newly generated latent sources, training data, and test\ndata each time. Error bars denote one standard deviation above and below the mean value.\n\nFigure 3: How news topics become trends on Twitter. The top left shows some time series of activity\nleading up to a news topic becoming trending. These time series superimposed look like clutter, but\nwe can separate them into different clusters, as shown in the next \ufb01ve plots. Each cluster represents\na \u201cway\u201d that a news topic becomes trending.\n\n4 Experimental Results\n\nSynthetic data. We generate m = 200 latent sources, where each latent source is constructed by\n\ufb01rst sampling i.i.d. N (0, 100) entries per time step and then applying a 1D Gaussian smoothing\n\ufb01lter with scale parameter 30. Half of the latent sources are labeled +1 and the other half 1. Then\nn = m log m training time series are sampled as per the latent source model where the noise added\nis i.i.d. N (0, 1) and max = 100. We similarly generate 1000 time series to use as test data. We\nset  = 1/8 for weighted majority voting. For  = 8, we compare the classi\ufb01cation error rates on\ntest data for weighted majority voting, nearest-neighbor classi\ufb01cation, and the MAP classi\ufb01er with\noracle access to the true latent sources as shown in Figure 2(a). We see that weighted majority voting\noutperforms nearest-neighbor classi\ufb01cation but as T grows large, the two methods\u2019 performances\nconverge to that of the MAP classi\ufb01er. Fixing T = 100, we then compare the classi\ufb01cation error\nrates of the three methods using varying amounts of training data, as shown in Figure 2(b); the\noracle MAP classi\ufb01er is also shown but does not actually depend on training data. We see that as\n increases, both weighted majority voting and nearest-neighbor classi\ufb01cation steadily improve in\nperformance.\nForecasting trending topics on twitter. We provide only an overview of our Twitter results here,\ndeferring full details to the longer version of this paper. We sampled 500 examples of trends at\nrandom from a list of June 2012 news trends, and 500 examples of non-trends based on phrases\nappearing in user posts during the same month. As we do not know how Twitter chooses what\nphrases are considered as candidate phrases for trending topics, it\u2019s unclear what the size of the\n\n7\n\ntime activity \f(a)\n\n(b)\n\n(c)\n\nFigure 4: Results on Twitter data. (a) Weighted majority voting achieves a low error rate (FPR\nof 4%, TPR of 95%) and detects trending topics in advance of Twitter 79% of the time, with a mean\nof 1.43 hours when it does; parameters:  = 10, T = 115, Tsmooth = 80, h = 7. (b) Envelope of\nall ROC curves shows the tradeoff between TPR and FPR. (c) Distribution of detection times for\n\u201caggressive\u201d (top), \u201cconservative\u201d (bottom) and \u201cin-between\u201d (center) parameter settings.\n\nnon-trend category is in comparison to the size of the trend category. Thus, for simplicity, we\nintentionally control for the class sizes by setting them equal. In practice, one could still expressly\nassemble the training data to have pre-speci\ufb01ed class sizes and then tune \u2713 for generalized weighted\nmajority voting (8). In our experiments, we use the usual weighted majority voting (2) (i.e., \u2713 = 1)\nto classify time series, where max is set to the maximum possible (we consider all shifts).\nPer topic, we created its time series based on a pre-processed version of the raw rate of how often\nthe topic was shared, i.e., its Tweet rate. We empirically found that how news topics become trends\ntends to follow a \ufb01nite number of patterns; a few examples of these patterns are shown in Figure 3.\nWe randomly divided the set of trends and non-trends into into two halves, one to use as training\ndata and one to use as test data. We applied weighted majority voting, sweeping over , T , and\ndata pre-processing parameters. As shown in Figure 4(a), one choice of parameters allows us to\ndetect trending topics in advance of Twitter 79% of the time, and when we do, we detect them an\naverage of 1.43 hours earlier. Furthermore, we achieve a true positive rate (TPR) of 95% and a false\npositive rate (FPR) of 4%. Naturally, there are tradeoffs between TPR, FPR, and how early we make\na prediction (i.e., how small T is). As shown in Figure 4(c), an \u201caggressive\u201d parameter setting yields\nearly detection and high TPR but high FPR, and a \u201cconservative\u201d parameter setting yields low FPR\nbut late detection and low TPR. An \u201cin-between\u201d setting can strike the right balance.\nAcknowledgements. This work was supported in part by the Army Research Of\ufb01ce under MURI\nAward 58153-MA-MUR. GHC was supported by an NDSEG fellowship.\n\n8\n\n\fReferences\n[1] Anthony Bagnall, Luke Davis, Jon Hills, and Jason Lines. Transformation based ensembles for time\nseries classi\ufb01cation. In Proceedings of the 12th SIAM International Conference on Data Mining, pages\n307\u2013319, 2012.\n\n[2] Gustavo E.A.P.A. Batista, Xiaoyue Wang, and Eamonn J. Keogh. A complexity-invariant distance mea-\nsure for time series. In Proceedings of the 11th SIAM International Conference on Data Mining, pages\n699\u2013710, 2011.\n\n[3] Hila Becker, Mor Naaman, and Luis Gravano. Beyond trending topics: Real-world event identi\ufb01cation\n\non Twitter. In Proceedings of the Fifth International Conference on Weblogs and Social Media, 2011.\n\n[4] Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. Emerging topic detection on twitter based on\ntemporal and social terms evaluation. In Proceedings of the 10th International Workshop on Multimedia\nData Mining, 2010.\n\n[5] Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classi\ufb01cation.\n\nInformation Theory, 13(1):21\u201327, 1967.\n\nIEEE Transactions on\n\n[6] Sanjoy Dasgupta and Leonard Schulman. A probabilistic analysis of EM for mixtures of separated,\n\nspherical gaussians. Journal of Machine Learning Research, 8:203\u2013226, 2007.\n\n[7] Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang, and Eamonn Keogh. Querying and min-\ning of time series data: experimental comparison of representations and distance measures. Proceedings\nof the VLDB Endowment, 1(2):1542\u20131552, 2008.\n\n[8] Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets.\n\nIn Advances in Neural Information Processing Systems 24, 2011.\n\n[9] Keinosuke Fukunaga. Introduction to statistical pattern recognition (2nd ed.). Academic Press Profes-\n\nsional, Inc., 1990.\n\n[10] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: Moment methods and\n\nspectral decompositions, 2013. arXiv:1206.5766.\n\n[11] Shiva Prasad Kasiviswanathan, Prem Melville, Arindam Banerjee, and Vikas Sindhwani. Emerging topic\nIn Proceedings of the 20th ACM Conference on Information and\n\ndetection using dictionary learning.\nKnowledge Management, pages 745\u2013754, 2011.\n\n[12] Shiva Prasad Kasiviswanathan, Huahua Wang, Arindam Banerjee, and Prem Melville. Online l1-\nIn Advances in Neural Information\n\ndictionary learning with application to novel document detection.\nProcessing Systems 25, pages 2267\u20132275, 2012.\n\n[13] Michael Mathioudakis and Nick Koudas. Twittermonitor: trend detection over the Twitter stream. In\n\nProceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010.\n\n[14] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In 51st\n\nAnnual IEEE Symposium on Foundations of Computer Science, pages 93\u2013102, 2010.\n\n[15] Alex Nanopoulos, Rob Alcock, and Yannis Manolopoulos. Feature-based classi\ufb01cation of time-series\n\ndata. International Journal of Computer Research, 10, 2001.\n\n[16] Juan J. Rodr\u00b4\u0131guez and Carlos J. Alonso.\n\nProceedings of the 2004 ACM Symposium on Applied Computing, 2004.\n\nInterval and dynamic time warping-based decision trees.\n\nIn\n\n[17] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. Journal of Com-\n\nputer and System Sciences, 68(4):841\u2013860, 2004.\n\n[18] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[19] Yi Wu and Edward Y. Chang. Distance-function design and fusion for sequence data. In Proceedings of\n\nthe 2004 ACM International Conference on Information and Knowledge Management, 2004.\n\n[20] Xiaopeng Xi, Eamonn J. Keogh, Christian R. Shelton, Li Wei, and Chotirat Ann Ratanamahatana. Fast\ntime series classi\ufb01cation using numerosity reduction. In Proceedings of the 23rd International Conference\non Machine Learning, 2006.\n\n9\n\n\f", "award": [], "sourceid": 578, "authors": [{"given_name": "George", "family_name": "Chen", "institution": "MIT"}, {"given_name": "Stanislav", "family_name": "Nikolov", "institution": "Twitter"}, {"given_name": "Devavrat", "family_name": "Shah", "institution": "MIT"}]}