{"title": "Precision and Recall for Time Series", "book": "Advances in Neural Information Processing Systems", "page_first": 1920, "page_last": 1930, "abstract": "Classical anomaly detection is principally concerned with point-based anomalies, those anomalies that occur at a single point in time. Yet, many real-world anomalies are range-based, meaning they occur over a period of time. Motivated by this observation, we present a new mathematical model to evaluate the accuracy of time series classification algorithms. Our model expands the well-known Precision and Recall metrics to measure ranges, while simultaneously enabling customization support for domain-specific preferences.", "full_text": "Precision and Recall for Time Series\n\nNesime Tatbul \u2217\nIntel Labs and MIT\n\ntatbul@csail.mit.edu\n\nTae Jun Lee \u2217\n\nMicrosoft\n\ntae_jun_lee@alumni.brown.edu\n\nStan Zdonik\n\nBrown University\n\nsbz@cs.brown.edu\n\nMejbah Alam\n\nIntel Labs\n\nmejbah.alam@intel.com\n\nJustin Gottschlich\n\nIntel Labs\n\njustin.gottschlich@intel.com\n\nAbstract\n\nClassical anomaly detection is principally concerned with point-based anomalies,\nthose anomalies that occur at a single point in time. Yet, many real-world anomalies\nare range-based, meaning they occur over a period of time. Motivated by this\nobservation, we present a new mathematical model to evaluate the accuracy of time\nseries classi\ufb01cation algorithms. Our model expands the well-known Precision and\nRecall metrics to measure ranges, while simultaneously enabling customization\nsupport for domain-speci\ufb01c preferences.\n\nIntroduction\n\n1\nAnomaly detection (AD) is the process of identifying non-conforming items, events, or behaviors\n[1, 9]. The proper identi\ufb01cation of anomalies can be critical for many domains. Examples include\nearly diagnosis of medical diseases [22], threat detection for cyber-attacks [3, 18, 36], or safety\nanalysis for self-driving cars [38]. Many real-world anomalies can be detected in time series data.\nTherefore, systems that detect anomalies should reason about them as they occur over a period of\ntime. We call such events range-based anomalies. Range-based anomalies constitute a subset of both\ncontextual and collective anomalies [9]. More precisely, a range-based anomaly is one that occurs\nover a consecutive sequence of time points, where no non-anomalous data points exist between the\nbeginning and the end of the anomaly. The standard metrics for evaluating time series classi\ufb01cation\nalgorithms today, Precision and Recall, have been around since the 1950s. Originally formulated\nto evaluate document retrieval algorithms by counting the number of documents that were correctly\nreturned against those that were not [6], Precision and Recall can be formally de\ufb01ned as follows [1]\n(where TP , FP , FN are the number of true positives, false positives, false negatives, respectively):\n(1)\n(2)\nInformally, Precision is the fraction of all detected anomalies that are real anomalies, whereas, Recall\nis the fraction of all real anomalies that are successfully detected. In this sense, Precision and Recall\nare complementary, and this characterization proves useful when they are combined (e.g., using\nF\u03b2-Score, where \u03b2 represents the relative importance of Recall to Precision) [6]. Such combinations\nhelp gauge the quality of anomaly predictions. While useful for point-based anomalies, classical\nPrecision and Recall suffer from the inability to represent domain-speci\ufb01c time series anomalies.\nThis has a negative side-effect on the advancement of AD systems. In particular, many time series AD\nsystems\u2019 accuracy is being misrepresented, because point-based Precision and Recall are being used\nto measure their effectiveness for range-based anomalies. Moreover, the need to accurately identify\ntime series anomalies is growing in importance due to the explosion of streaming and real-time\nsystems [2, 5, 16, 19, 27, 28, 30, 31, 37, 40].\nTo address this need, we rede\ufb01ne Precision and Recall to encompass range-based anomalies. Unlike\nprior work [2, 25], our new mathematical de\ufb01nitions extend their classical counterparts, enabling\n\nP recision = T P \u00f7 (T P + F P )\nRecall = T P \u00f7 (T P + F N )\n\n\u2217Lead authors.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Precision = 0.6, Recall = 0.5\n\n(b) Precision = ?, Recall = ?\n\nFigure 1: Point-based vs. range-based anomalies.\n\nour model to subsume the classical one. Further, our formulation is more broadly generalizable by\nproviding specialization functions that can control a domain\u2019s bias along a multi-dimensional axis to\nproperly accommodate the needs of that speci\ufb01c domain. Thus, the key contribution of this paper is a\nnew, customizable mathematical model, which can be used to evaluate and compare results of AD\nalgorithms. Although outside the scope of this paper, our model can also be used as the objective\nfunction for machine learning (ML) training, which may have a profound impact on AD training\nstrategies, giving rise to fundamentally new ML techniques in the future.\nIn the remaining sections, we \ufb01rst detail the problem and provide an overview of prior work. In\nSection 4, we formally present our new model. Section 5 provides a detailed experimental study of\nour new model compared to the classical model and a recent scoring model provided by Numenta [25].\nFinally, we conclude with a brief discussion of future directions.\n\n2 Problem motivation and design goals\nClassical precision and recall are de\ufb01ned for sets of independent points (Figure 1a), which is suf\ufb01cient\nfor point-based anomalies. Time series AD algorithms, on the other hand, work with sequences\nof time intervals (Figure 1b) [11, 17, 18, 21, 23, 30]. Therefore, important time series speci\ufb01c\ncharacteristics cannot be captured by the classical model. Many of these distinctions are due to\npartial overlaps. In the point-based case, a predicted anomaly point is either a member of the set\nof real anomalies (a TP) or not (an FP). In the range-based case, a predicted anomaly range might\npartially overlap with a real one. In this case, a single prediction range is partially a TP and partially\nan FP at the same time. The size of this partial overlap needs to be quanti\ufb01ed, but there may be\nadditional criteria. For example, one may want to consider the position of the overlap. After all, a\nrange consists of an ordered collection of points and the order might be meaningful for the application.\nFor instance, detecting the earlier portion of an anomaly (i.e., its \u201cfront-end\u201d) might be more critical\nfor a real-time application to reduce the time to react to it [24] (see Section 4.3 for more examples).\nFurthermore, overlaps are no longer \u201c1-to-1\u201d as in the classical model; one or more predicted anomaly\nranges may (partially) overlap with one or more real ones. Figure 1b illustrates one such situation\n(P2 vs. R2, R3). The speci\ufb01c domain might care if each independent anomaly range is detected as a\nsingle unit or not. Thus, we may also want to capture cardinality when measuring overlap.\nIf all anomalies were point-based, or if there were no partial-overlap situations, then the classical\nmodel would suf\ufb01ce. However, for general range-based anomalies, the classical model falls short.\nMotivated by these observations, we propose a new model with the following design goals:\n\u2022 Expressive: captures criteria unique to range-based anomalies, e.g., overlap position and cardinality.\n\u2022 Flexible: supports adjustable weights across multiple criteria for domain-speci\ufb01c needs.\n\u2022 Extensible: supports inclusion of additional domain-speci\ufb01c criteria that cannot be known a priori.\nOur model builds on the classical precision/recall model as a solid foundation, and extends it to be\nused more effectively for range-based AD (and time series classi\ufb01cation, in general).\n\n3 Related work\nThere is a growing body of research emphasizing the importance of time series classi\ufb01cation [10, 19,\n27], including anomaly detection. Many new ML techniques have been proposed to handle time series\nanomalies for a wide range of application domains, from space shuttles to web services [16, 30, 31, 41].\nMacroBase [7] and SPIRIT [33] dynamically detect changes in time series when analyzing fast,\nstreaming data. Lipton et al. investigated the viability of existing ML-based AD techniques on\nmedical time series [29], while Patcha and Park have drawn out weaknesses for AD systems to\n\n2\n\n\fTable 1: Notation\n\nNotation\n\nR, Ri\nP , Pj\n\nN, Nr, Np\n\n\u03b1\n\n\u03b3(), \u03c9(), \u03b4()\n\nDescription\nset of real anomaly ranges, the ith real anomaly range\nset of predicted anomaly ranges, the jth predicted anomaly range\nnumber of all points, number of real anomaly ranges, number of predicted anomaly ranges\nrelative weight of existence reward\noverlap cardinality function, overlap size function, positional bias function\n\nproperly handle cyber-attacks in a time series setting, amongst others [34]. Despite these efforts,\ntechniques designed to judge the ef\ufb01cacy of time series AD systems remain largely underdeveloped.\nEvaluation measures have been investigated in other related areas [8, 13], e.g., sensor-based activity\nrecognition [12], time series change point detection [4]. Most notably, Ward et al. provide a \ufb01ne-\ngrained analysis of errors by dividing activity events into segments to account for situations such as\nevent fragmentation/merging and timing offsets [39]. Because these approaches do not take positional\nbias into account and do not provide a tunable model like ours, we see them as complementary.\nSimilar to our work, Lavin and Ahmad have also noted the lack of proper time series measurement\ntechniques for AD in their Numenta Anomaly Benchmark (NAB) work [25]. Yet, unlike ours, the\nNAB scoring model remains point-based in nature and has a \ufb01xed bias (early detection), which makes\nit less generalizable than our model. We rediscuss NAB in greater detail in Section 5.3.\n4 Precision and recall for ranges\nIn this section, we propose a new way to compute precision and recall for ranges. We \ufb01rst formulate\nrecall, and then follow a similar methodology to formulate precision. Finally, we provide examples\nand guidelines to illustrate the practical use of our model. Table 1 summarizes our notation.\n4.1 Range-based recall\nThe basic purpose of recall is to reward a prediction system when real anomalies are successfully\nidenti\ufb01ed (TPs), and to penalize it when they are not (FNs). Given a set of real anomaly ranges\nR = {R1, .., RNr} and a set of predicted anomaly ranges P = {P1, .., PNp}, our RecallT (R, P )\nformulation iterates over the set of all real anomaly ranges (R), computing a recall score for each real\nanomaly range (Ri \u2208 R) and adding them up into a total recall score. This total score is then divided\nby the total number of real anomalies (Nr) to obtain an average recall score for the whole time series.\n\n(cid:80)Nr\n\nRecallT (R, P ) =\n\ni=1 RecallT (Ri, P )\n\nNr\n\n(3)\n\nitself, might be valuable for the application.\n\nportion of Ri might matter to the application.\n\nWhen computing the recall score RecallT (Ri , P ) for a single anomaly range Ri, we consider:\n\u2022 Existence: Catching the existence of an anomaly (even by predicting only a single point in Ri), by\n\u2022 Size: The larger the size of the correctly predicted portion of Ri, the higher the recall score.\n\u2022 Position: In some cases, not only size, but also the relative position of the correctly predicted\n\u2022 Cardinality: Detecting Ri with a single prediction range Pj \u2208 P may be more valuable than doing\nWe capture all of these considerations as a sum of two main reward terms weighted by \u03b1 and (1 \u2212 \u03b1),\nrespectively, where 0 \u2264 \u03b1 \u2264 1. \u03b1 represents the relative importance of rewarding existence, whereas\n(1 \u2212 \u03b1) represents the relative importance of rewarding size, position, and cardinality, all of which\nstem from the actual overlap between Ri and the set of all predicted anomaly ranges (Pj \u2208 P ).\n\nso with multiple different ranges in P in a fragmented manner.\n\nRecallT (Ri, P ) = \u03b1 \u00d7 ExistenceReward(Ri, P ) + (1 \u2212 \u03b1) \u00d7 OverlapReward(Ri, P )\n\n(4)\nIf anomaly range Ri is identi\ufb01ed (i.e., |Ri \u2229 Pj| \u2265 1 across all Pj \u2208 P ), then a reward is earned.\n\n(cid:40)\n1, if(cid:80)Np\n\nj=1 |Ri \u2229 Pj| \u2265 1\n\nExistenceReward(Ri, P ) =\n\n(5)\nAdditionally, a cumulative overlap reward that depends on three functions, 0 \u2264 \u03b3() \u2264 1, 0 \u2264 \u03c9() \u2264 1,\nand \u03b4() \u2265 1, is earned. These capture the cardinality (\u03b3()), size (\u03c9()), and position (\u03b4()) aspects of\n\n0, otherwise\n\n3\n\n\ffunction \u03c9(AnomalyRange, OverlapSet, \u03b4)\n\nMyValue \u2190 0\nMaxValue \u2190 0\nAnomalyLength \u2190 length(AnomalyRange)\nfor i \u2190 1, AnomalyLength do\n\nBias \u2190 \u03b4(i, AnomalyLength)\nMaxValue \u2190 MaxValue + Bias\nif AnomalyRange[i] in OverlapSet then\nMyValue \u2190 MyValue + Bias\n\nreturn MyValue/MaxValue\n(a) Overlap size\n\nfunction \u03b4(i, AnomalyLength)\n\nreturn 1\n\n(cid:46) Flat bias\n\nfunction \u03b4(i, AnomalyLength) (cid:46) Front-end bias\n\nreturn AnomalyLength - i + 1\n\nfunction \u03b4(i, AnomalyLength) (cid:46) Back-end bias\n\nfunction \u03b4(i, AnomalyLength)\n\nreturn i\nif i \u2264 AnomalyLength/2 then\n\n(cid:46) Middle bias\n\nreturn i\n\nelse\n\nreturn AnomalyLength - i + 1\n\n(b) Positional bias\n\nFigure 2: Example de\ufb01nitions for \u03c9() and \u03b4() functions.\n\nthe overlap. More speci\ufb01cally, the cardinality term serves as a scaling factor for the rewards earned\nfrom overlap size and position.\n\nOverlapReward(Ri, P ) = CardinalityF actor(Ri, P ) \u00d7\n\n\u03c9(Ri, Ri \u2229 Pj, \u03b4)\n\n(6)\n\nNp(cid:88)\n\nj=1\n\nWhen Ri overlaps with only one predicted anomaly range, the cardinality factor reward is the largest\n(i.e., 1). Otherwise, it receives 0 \u2264 \u03b3() \u2264 1 de\ufb01ned by the application.\n\nCardinalityF actor(Ri, P ) =\n\n, if Ri overlaps with at most one Pj \u2208 P\n\n1\n\u03b3(Ri, P ), otherwise\n\n(7)\n\n(cid:40)\n\nNote that both the weight (\u03b1) and the functions (\u03b3(), \u03c9(), and \u03b4()) are tunable according to the needs\nof the application. We illustrate how they can be customized with examples in Section 4.3.\n4.2 Range-based precision\nAs seen in Equations 1 and 2, the key difference between precision and recall is that precision\npenalizes FPs instead of FNs. Given a set of real anomaly ranges R = {R1, .., RNr} and a set of\npredicted anomaly ranges P = {P1, .., PNp}, our PrecisionT (R, P ) formula iterates over the set of\nall predicted anomaly ranges (P ), computing a precision score for each predicted anomaly range\n(Pi \u2208 P ) and adding them up into a total precision score. This total score is then divided by the total\nnumber of predicted anomalies (Np) to obtain an average precision score for the whole time series.\n\n(cid:80)Np\n\nP recisionT (R, P ) =\n\ni=1 P recisionT (R, Pi)\n\nNp\n\n(8)\n\nWhen computing PrecisionT (R, Pi ) for a single predicted anomaly range Pi, there is no need for an\nexistence reward, since precision by de\ufb01nition emphasizes prediction quality, and existence by itself\nis too low a bar for judging the quality of a prediction (i.e., \u03b1 = 0). On the other hand, the overlap\nreward is still needed to capture the cardinality, size, and position aspects of a prediction.\n\nP recisionT (R, Pi) = CardinalityF actor(Pi, R) \u00d7 Nr(cid:88)\n\n\u03c9(Pi, Pi \u2229 Rj, \u03b4)\n\n(9)\n\nj=1\n\nLike in our recall formulation, the \u03b3(), \u03c9(), and \u03b4() functions are tunable according to application\nsemantics. Despite not explicitly shown in our notation, these functions can be de\ufb01ned differently for\nRecallT and PrecisionT (see Sections 4.3 and 5 for examples).\n4.3 Customization guidelines and examples\nFigure 2a provides an example for the \u03c9() function for size, which can be used with a \u03b4() function\nfor positional bias. In general, for both RecallT and PrecisionT , we expect \u03c9() to be always de\ufb01ned\nlike in our example, due to its additive nature and direct proportionality with the size of the overlap.\n\u03b4(), on the other hand, can be de\ufb01ned in multiple different ways, as illustrated with four examples\nin Figure 2b. If all index positions of an anomaly range are equally important for the application,\nthen the \ufb02at bias function should be used. In this case, simply, the larger the size of the overlap,\nthe higher the overlap reward will be. If earlier (later) index positions carry more weight than later\n(earlier) ones, then the front-end (back-end) bias function should be used instead. Finally, the middle\nbias function is for prioritizing mid-portions of anomaly ranges. In general, we expect \u03b4() to be a\nfunction in which the value of an index position i monotonically increases/decreases based on its\n\n4\n\n\fdistance from a well-de\ufb01ned reference point of the anomaly range. From a semantic point of view,\n\u03b4() signi\ufb01es the urgency of when an anomaly is detected and reacted to. For example, front-end\nbias is preferable in scenarios where early response is critical, such as cancer detection or real-time\nsystems. Back-end bias is useful in scenarios where responses are irreversible (e.g., \ufb01ring of a missile,\noutput of a program). In such scenarios, it is generally more desirable to delay raising detection of an\nanomaly until there is absolute certainty of its presence. Middle bias can be useful when there is a\nclear trade-off between the effect of early and late detection, i.e., an anomaly should not be reacted\nto too early or too late. Hard braking in autonomous vehicles is an example: braking too early may\nunnecessarily startle passengers, while braking too late may cause a collision. We expect that for\nRecallT , \u03b4() will typically be set to one of the four functions in Figure 2b depending on the domain;\nwhereas a \ufb02at bias function will suf\ufb01ce for PrecisionT in most domains, since an FP is typically\nconsidered uniformly bad wherever it appears in a prediction range.\nWe expect the \u03b3() function to be de\ufb01ned similarly for both RecallT and PrecisionT . Intuitively, in\nboth cases, the cardinality factor should be inversely proportional to the number of distinct ranges\nthat a given anomaly range overlaps. Thus, we expect \u03b3() to be generally structured as a reciprocal\nrational function 1/f (x), where f (x) \u2265 1 is a single-variable polynomial and x represents the\nnumber of distinct overlap ranges. A typical example for \u03b3() is 1/x.\nBefore we conclude this section, it is important to note that our precision and recall formulas subsume\ntheir classical counterparts, i.e., RecallT \u2261 Recall and PrecisionT \u2261 Precision, when:\n(i) all Ri \u2208 R and Pj \u2208 P are represented as unit-size ranges (e.g., range [1, 3] as [1, 1], [2, 2], [3, 3]),\n(ii) \u03b1 = 0, \u03b3() = 1, \u03c9() is as in Figure 2a, and \u03b4() returns \ufb02at positional bias as in Figure 2b.\n5 Experimental study\nIn this section, we present an experimental study of our range-based model applied to results of\nstate-of-the-art AD algorithms over a collection of diverse time series datasets. We aim to demonstrate\ntwo key features of our model: (i) its expressive power for range-based anomalies compared to the\nclassical model, (ii) its \ufb02exibility in supporting diverse scenarios in comparison to the Numenta\nscoring model [25]. Furthermore, we provide a cost analysis for computing our new metrics, which\nis important for their practical application.\n5.1 Setup\nSystem: All experiments were run on a Windows 10 machine with an Intel R(cid:13) CoreTM i5-6300HQ\nprocessor running at 2.30 GHz with 4 cores and 8 GB of RAM.\nDatasets: We used a mixture of real and synthetic datasets. The real datasets are taken from the\nNAB Data Corpus [32], whereas the synthetic datasets are generated by the Paranom tool [15].\nAll data is time-ordered, univariate, numeric time series, for which anomalous points/ranges (i.e.,\n\u201cthe ground truth\u201d) are already known. NYC-Taxi: A real dataset collected by the NYC Taxi and\nLimousine Commission. The data represents the number of passengers over time, recorded as 30-\nminute aggregates. There are \ufb01ve anomalies: the NYC Marathon, Thanksgiving, Christmas, New\nYear\u2019s day, and a snow storm. Twitter-AAPL: A real dataset with a time series of the number of tweets\nthat mention Apple\u2019s ticker symbol (AAPL), recorded as 5-minute aggregates. Machine-Temp: A\nreal dataset of readings from a temperature sensor attached to a large industrial machine, with three\nanomalies: a planned shutdown, an unidenti\ufb01ed error, and a catastrophic failure. ECG: A dataset from\nParanom based on a real Electrocardiogram (ECG) dataset [20, 30]; augmented to include additional\nsynthetic anomalies over the original single pre-ventricular contraction anomaly. Space-Shuttle: A\ndataset based on sensor measurements from valves on a NASA space shuttle; also generated by\nParanom based on a real dataset from the literature with multiple additional synthetic anomalies\n[20, 30]. Sine: A dataset from Paranom that includes a sine wave oscillating between 0.2 and 0.5 with\na complete period over 360 timestamps. It contains many stochastic anomalous events ranging from\n50 \u2212 100 time intervals. Time-Guided: A Paranom dataset with monotonically increasing univariate\nvalues, in which multiple range-based stochastic anomalies appear with inverted negative values.\nAnomaly Detectors: For our \ufb01rst set of experiments, we trained a TensorFlow-implemented version\nof LSTM-AD, a long short-term memory model for AD [30], for all datasets. For training and testing,\nwe carefully partition each dataset to ensure anomaly ranges are intact in spite of the segmentation.\nWe interpret adjacent anomalous points as part of a single anomaly range. Thus, the LSTM-AD\ntesting phase output is a sequence of predicted anomaly ranges for each dataset. Secondly, to illustrate\nhow our model can be used in evaluating and comparing predictions from multiple detectors, we\nrepeat this procedure with two additional detectors: Greenhouse [26] and Luminol [28].\n\n5\n\n\f(a) Recall\n\n(b) Precision\n\n(c) F1-Score\n\nFigure 3: Our model vs. the classical point-based model.\n\nCompared Scoring Models: We evaluated the prediction accuracy of each output from the anomaly\ndetectors using three models: classical point-based precision/recall, the Numenta scoring model [25]\n(described in detail below), and our range-based precision/recall model. Unless otherwise stated, we\nuse the following default parameter settings for computing our RecallT and PrecisionT equations:\n\u03b1 = 0, \u03b3() = 1, \u03c9() is as in Figure 2a, and \u03b4() returns \ufb02at bias as in Figure 2b.\n5.2 Comparison to the classical point-based model\nFirst, we compare our range-based model with the classical point-based model. The goal of this\nexperiment is twofold: (i) verify that our model subsumes the classical one, (ii) show that our model\ncan capture additional variations for anomaly ranges that the classical one cannot.\nWe present our results in Figure 3 with three bar charts, which show Recall, Precision, and F1-Score\nvalues for our LSTM-AD testing over the seven datasets, respectively. Recall values are computed\nusing Equation 2 for Recall (bar labeled Recall_Classical) and Equation 3 for RecallT (bars\nlabeled Recall_T_*). Similarly, Precision values are computed using Equation 1 for Precision (bar\nlabeled Precision_Classical) and Equation 8 for PrecisionT (bars labeled Precision_T_*).\nFinally, F1 -Score values are computed using the following well-known equation, which essentially\nrepresents the harmonic mean of Precision and Recall [6]:\n\nF1-Score = 2 \u00d7 P recision \u00d7 Recall\n\nP recision + Recall\n\n(10)\nWe \ufb01rst observe that, in all graphs, the \ufb01rst two bars are equal for all datasets. This demonstrates that\nour model, when properly con\ufb01gured as described at the end of Section 4.3, subsumes the classical\none. Next, we analyze the impact of positional bias (\u03b4()) on each metric. Note that \u03b3() = 1/x in\nwhat follows.\nRecall: In Figure 3a, we provide RecallT measurements for four positional biases: \ufb02at, front, back,\nand middle (see Figure 2b). Recall is perfect (= 1.0) in all runs for the Machine-Temp dataset. When\nwe analyze the real and predicted anomaly ranges for this dataset, we see two real anomaly ranges that\nLSTM-AD predicts as a single anomaly range. Both real anomalies are fully predicted (i.e., no false\nnegatives and x = 1 for both ranges). Therefore, recall has the largest possible value independent\nfrom what \u03b4() is, and is the expected behavior for this scenario.\nFor all other datasets except Time-Guided, RecallT is smaller than Recall. This is generally expected,\nas real anomaly ranges are rarely captured (i) entirely by LSTM-AD (i.e., overlap reward is partially\nearned) and (ii) by exactly one predicted anomaly range (i.e., cardinality factor downscales the overlap\nreward). Analyzing the real and predicted anomalies for Time-Guided reveals that LSTM-AD predicts\n8/12 ranges fully and 4/12 ranges half-way (i.e., no range is completely missed and x = 1 for all 12\nranges). Furthermore, differences among RecallT biases are mainly due to the four half-predicts. All\nhalf-predicts lie at back-ends, which explains why Recall_T_Back is larger than other biases. This\nillustrates positional bias sensitivity of RecallT , which cannot be captured by the classical Recall.\nWe make similar observations for ECG and Sine. That is, different positional bias leads to visibly\ndifferent RecallT values. These demonstrate the sensitivity of RecallT to the positions of correctly\npredicted portions of anomalies. For Space-Shuttle, NYC-Taxi, and Twitter-AAPL, the differences are\nnot as pronounced. Data indicates that these datasets contains few real anomaly ranges (5, 3, and\n2, respectively), but LSTM-AD predicts a signi\ufb01cantly larger number of small anomaly ranges. As\nsuch, overlaps are small and highly fragmented, likely making \u03b4() less dominant than \u03c9() and \u03b3().\nPrecision: In Figure 3b, we present PrecisionT only with \ufb02at positional bias, as we expect this to\nbe the typical use case (see Section 4.3). In general, we observe that Precision and PrecisionT\nfollow a similar pattern. Also, PrecisionT is typically smaller than Precision in all but two datasets\n(Time-Guided and NYC-Taxi). For Time-Guided, precision values turn out to be almost identical,\n\n6\n\n\f(a) Numenta Standard\n\n(b) Numenta Reward-Low-FP\n\n(c) Numenta Reward-Low-FN\n\nFigure 4: Our model vs. the Numenta model.\n\nbecause both real and predicted anomaly ranges are too narrow, diminishing the differences between\npoints and ranges. When reviewing the result for NYC-Taxi, we found many narrow-range predictions\nagainst 3 wide-range real anomalies, with a majority of the predictions being false positives. Because\nthe narrow predictions earn similar overlap rewards as unit-size/point predictions, yet the number\nof predicted anomaly ranges (Np) is relatively smaller than the total number of points in those Np\nranges, the value of PrecisionT comes out slightly larger than Precision for this dataset. Overall,\nthis scenario demonstrates that PrecisionT is more range-aware than Precision and can, therefore,\njudge exactness of range-based predictions more accurately.\nF1-Score: In Figure 3c, we present F1-Scores. The general behavior is as in the classical model,\nbecause F1 is the harmonic mean of respective recall and precision values in both models. A\nnoteworthy observation is that the F1-Scores display similar sensitivity to positional bias variations\nas in recall graph of Figure 3a. This demonstrates that our range-based model\u2019s expressiveness for\nrecall and precision carries into combined metrics like F1-Score.\n5.3 Comparison to the Numenta Anomaly Benchmark (NAB) scoring model\nIn this section, we report results from our LSTM-AD experiments with the Numenta Anomaly\nBenchmark (NAB) scoring model [25]. Our goals are: (i) to determine if our model can mimic the\nNAB model and (ii) to examine some of the \ufb02exibilities our model has beyond the NAB model.\nBackground on NAB\u2019s Scoring Model: While focusing entirely on real-time streaming applications,\nthe NAB authors claim that: (i) only early anomaly detection matters (i.e., what we call front-end\nbias) and (ii) only the \ufb01rst detection point of an anomaly range matters (i.e., an emphasis on precision\nover recall and a bias toward single-point TP prediction systems). They then propose a scoring\nmodel, based on anomaly windows, application pro\ufb01les, and a sigmoidal scoring function, which is\nspeci\ufb01cally designed to reward detections earlier in a window. NAB\u2019s scoring model also includes\nweights that can be adjusted according to three prede\ufb01ned application pro\ufb01les: Standard, Reward-\nLow-FP, and Reward-Low-FN. While this approach may be suitable for restricted domains, it does\nnot generalize to all range-based anomalies. For example, in the identi\ufb01cation of medical anomalies\n(e.g., cancer detection), it is critical to identify when an illness is regressing due to the use of life-\nthreatening medical treatments [14]. Under NAB\u2019s single-point reward system, there is no clear\ndistinction for such state changes. Furthermore, as discussed in a follow-up analysis by Singh and\nOlinsky [35], the NAB scoring system has a number of limitations that make it challenging to use in\nreal-world applications, even within its restricted domain (e.g., determining anomaly window size\na priori, ambiguities in scoring function, magic numbers, non-normalized scoring due to no lower\nbound). Given these irregularities, it is dif\ufb01cult, and perhaps ill-advised, to make a direct and unbiased\ncomparison to the NAB scoring model. Instead, we focus on comparing relative approximations of\nour F\u03b2-Scores compared to NAB scores, rather than their absolute values.\nTo obtain a behavior similar to NAB, we used the following settings for our model: \u03b1 = 0,\n\u03b3() = 1, \u03c9() is as in Figure 2a, \u03b4() is front bias for RecallT and \ufb02at bias for PrecisionT . Be-\ncause NAB\u2019s scoring model makes point-based predictions, we represent Pi as points. For each\nrun, we compute Recall_T_Front and Precision_T_Flat and their F\u03b2-Score corresponding to\nthe NAB application pro\ufb01le under consideration (i.e., F1_T for Numenta_Standard, F0.5_T for\nNumenta_Reward_Low_FP, and F2_T for Numenta_Reward_Low_FN).\nFigure 4 contains three bar charts, one for each Numenta pro\ufb01le. The datasets on the x-axis are\npresented in descending order of the Numenta accuracy scores to facilitate comparative analysis. For\nease of exposition, we scaled the negative Numenta scores by a factor of 100 and lower-bounded\nthe y-axis at -1. At a high level, although the accuracy values vary slightly across the charts, their\napproximation follows a similar pattern. Therefore, we only discuss Figure 4a.\n\n7\n\n\f(a) Sine dataset\n\n(b) ECG dataset\n\n(c) NYC-Taxi dataset\n\nFigure 5: Evaluating multiple anomaly detectors.\n\nIn Figure 4a, both models generally decrease in value from left to right except for a few exceptions, the\nSpace-Shuttle dataset being the most striking. In analyzing the data for the Space-Shuttle, we noted a\nsmall number of mid-range real anomalies and a larger number of mid-range predicted anomalies.\nOn the other hand, when we analyzed similar data for Sine, we found it had many narrow-range\nreal and predicted anomalies that appear close together. These \ufb01ndings reveal that Space-Shuttle\ndata had more FPs and TPs, which is re\ufb02ected in its smaller PrecisionT score and larger RecallT\nscore (Figure 3). Because NAB favors precision over recall (i.e., NAB heavily penalizes FPs), this\ndiscrepancy is further magni\ufb01ed. Overall, this shows that not only can our model mimic the NAB\nscoring system, but it can also identify additional intricacies missed by NAB.\n\nTable 2: Sensitivity to positional bias\nF1_T_Front_Flat\nNumenta_Standard\n\nFront-Predicted\nBack-Predicted\n\n0.67\n0.63\n\n0.42\n0.11\n\nF1_T_Back_Flat\n\n0.11\n0.42\n\nTo further illustrate, we investigated how the two models behave under two contrasting positional\nbias scenarios: (i) anomaly predictions overlapping with front-end portions of real anomaly ranges\n(Front-Predicted) and (ii) anomaly predictions overlapping with back-end portions of real anomaly\nranges (Back-Predicted) as shown in Table 2. We arti\ufb01cially generated the desired ranges in a sym-\nmetric fashion to make them directly comparable. We then scored them using Numenta_Standard,\nF1 _T _Front_Flat, and F1 _T _Back_Flat. As shown in Table 2, NAB\u2019s scoring function is not\nsuf\ufb01ciently sensitive to distinguish between the two scenarios. This is not surprising, as NAB was\ndesigned to reward early detections. However, our model distinguishes between the two scenarios\nwhen its positional bias is set appropriately. This experiment demonstrates our model\u2019s \ufb02exibility and\ngenerality over the NAB scoring model.\n5.4 Evaluating multiple anomaly detectors\nWe analyzed the three scoring models for their effectiveness in evaluating and comparing prediction\nresults from multiple anomaly detectors: LSTM-AD [30], Greenhouse [26], and Luminol (bitmap)\n[28]. Please note that the primary goal of this experiment is to compare the scoring models\u2019 behaviors\nfor a speci\ufb01c application scenario, not to evaluate the AD algorithms themselves. In this experiment,\nwe suppose an application that requires early detection of anomaly ranges in a non-fragmented\nmanner, with precision and recall being equally important. In our model, this corresponds to the\nfollowing settings: \u03b1 = 0, \u03b3() = 1/x for both PrecisionT and RecallT , \u03b4() is front bias for RecallT ,\n\u03b4() is \ufb02at bias for PrecisionT , and \u03b2 = 1 for F\u03b2-Score. In Numenta, the closest application pro\ufb01le\nis Standard, whereas in the classical model, the only tunable parameter is \u03b2 = 1 for F\u03b2-Score.\nFigure 5 presents our results from running the detectors on three selected datasets: Sine (synthetic),\nECG (augmented), and NYC-Taxi (real). In Figure 5a for Sine, all scoring models are in full agreement\nthat Luminol is the most accurate and Greenhouse is the least. Detailed analysis of the data reveals\nthat Luminol scores by far the greatest in precision, whereas Greenhouse scores by far the smallest.\nAll scoring models recognize this obvious behavior equally well. In Figure 5b for ECG, all scoring\nmodels generally agree that Greenhouse performs poorly. It turns out that FPs clearly dominate in\nGreenhouse\u2019s anomaly predictions. Classical and Numenta score LSTM-AD and Luminol similarly,\nwhile our model strongly favors LSTM-AD over the others. Data indicates that this is because\nLSTM-AD makes a relatively smaller number of predictions clustered around the real anomaly\nranges, boosting the number of TPs and making both precision and recall relatively greater. Luminol\ndiffers from LSTM-AD mainly in that it makes many more predictions, causing more fragmentation\nand severe score degradation due to \u03b3() in our model. Finally, in Figure 5c for NYC-Taxi, all scoring\nmodels largely disagree on the winner, except that LSTM-AD and Greenhouse perform somewhat\nsimilarly according to all. Indeed, both of these detectors make many narrow predictions against\n\n8\n\n\fFigure 6: Cost analysis.\n\n(b) Optimized\n\n(a) Naive\n\n3 relatively wide real anomaly ranges. Our model scores Greenhouse the best, mainly because its\nfront-biased RecallT comes out to be much greater than the others\u2019. Unlike the two precision/recall\nmodels, Numenta strongly favors Luminol, even though this detector misses one of the 3 real anomaly\nranges in its entirety. Overall, this experiment shows the effectiveness of our model in capturing\napplication requirements; in particular, when comparing results of multiple anomaly detectors, and\neven in the presence of subtleties in data.\n5.5 Cost analysis\nIn this section, we analyze the computational overhead of our RecallT and PrecisionT equations.\nNaive: In our RecallT and PrecisionT equations, we keep track of two sets, R and P . A naive\nalgorithm, which compares each Ri \u2208 R with all Pj \u2208 P for RecallT (and vice versa for PrecisionT )\nhas a computational complexity of O(Nr \u00d7 Np).\nOptimization1: We can reduce the comparisons by taking advantage of sequential relationships\namong ranges. Assume that all ranges are represented as timestamp pairs, ordered by their \ufb01rst\ntimestamps. We iterate over R and P simultaneously, only comparing those pairs that are in close\nproximity in terms of their timestamps. This optimized algorithm brings the computational complexity\nto O(max{Nr , Np}).\nOptimization2: An additional optimization is possible in computing the positional bias functions\n(e.g., \ufb02at). Instead of computing them for each point, we apply them in closed form, leading to a\nsingle computation per range. We see room for further optimizations (e.g., using bit vectors and\nbitwise operations), but exhaustive optimization is beyond the goals of our current analysis.\nWe implemented \ufb01ve alternative ways of computing recall and precision, and analyzed how the\ncomputation time for each of them scales with increasing number of real and predicted anomaly\nranges. Two of these are naive algorithms for both classical and our recall/precision equations. The\nrest are their optimized variants. Ranges used in this experiment are randomly generated from a time\nseries of 50K data points. Figure 6 shows our results. We plot naive and optimized approaches in\nseparate graphs, as their costs are orders of magnitude different. In Figure 6a, we see that computing\nour range-based metrics naively is signi\ufb01cantly more costly than doing so for the classical ones.\nFortunately, we can optimize our model\u2019s computational cost. The optimized results in Figure 6b\nprovide a clear improvement in performance. Both the classical metrics and ours can be computed in\nalmost three orders of magnitude less time than their corresponding naive baselines when optimized.\nThere is still a factor of 2-3 difference, but our model\u2019s values are notably closer at this scale to the\nclassical model. This analysis shows that our new range-based metrics can be computed ef\ufb01ciently,\nand their overhead compared to the classical ones is acceptably low.\n6 Conclusions and future directions\nPrecision and recall are commonly used to evaluate the accuracy of anomaly detection systems.\nYet, classical precision and recall were designed for point-based data, whereas for time series data,\nanomalies are in the form of ranges. In response, we offer a new formal model that accounts for\nrange-speci\ufb01c issues, such as partial overlaps between real vs. predicted ranges as well as their\nrelative positions. Moreover, our approach allows users to supply customizable bias functions that\nenable weighing multiple criteria differently. We ran experiments with diverse datasets and anomaly\ndetectors, comparing our model against two others, which illustrated that our new model is expressive,\n\ufb02exible, and extensible. Our evaluation model comes with a number of tunable parameters. Given\nan application, these parameters must properly set to suit the needs of that domain. Though our\nmodel is extensible, we expect that the example settings and guidelines we provided in the paper\nwill be suf\ufb01cient in most practical domains. In ongoing work, we have been actively collaborating\nwith domain experts to apply our model in real use cases, such as autonomous driving. Future work\nincludes designing new range-based ML techniques that are optimized for our new accuracy model.\n\n9\n\n\fAcknowledgments\n\nWe thank Eric Metcalf for his help with experiments. This research has been funded in part by Intel\nand by NSF grant IIS-1526639.\n\nReferences\n[1] C. C. Aggarwal. Outlier Analysis. Springer, 2013.\n\n[2] S. Ahmad, A. Lavin, S. Purdy, and Z. Agha. Unsupervised Real-time Anomaly Detection for Streaming\n\nData. Neurocomputing, 262:134\u2013147, 2017.\n\n[3] A. AlEroud and G. Karabatis. A Contextual Anomaly Detection Approach to Discover Zero-Day Attacks.\n\nIn International Conference on CyberSecurity (ICCS), pages 40\u201345, 2012.\n\n[4] S. Aminikhanghahi and D. J. Cook. A Survey of Methods for Time Series Change Point Detection.\n\nKnowledge and Information Systems, 51(2):339\u2013367, 2017.\n\n[5] O. Anava, E. Hazan, and A. Zeevi. Online Time Series Prediction with Missing Data. In International\n\nConference on Machine Learning (ICML), pages 2191\u20132199, 2015.\n\n[6] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval: The Concepts and Technology behind\n\nSearch (Second Edition). Addison-Wesley, 2011.\n\n[7] P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. MacroBase: Prioritizing Attention in\nFast Data. In ACM International Conference on Management of Data (SIGMOD), pages 541\u2013556, 2017.\n\n[8] V. Chandola, V. Mithal, and V. Kumar. Comparative Evaluation of Anomaly Detection Techniques for\n\nSequence Data. In IEEE International Conference on Data Mining (ICDM), pages 743\u2013748, 2008.\n\n[9] V. Chandola, A. Banerjee, and V. Kumar. Anomaly Detection: A Survey. ACM Computing Surveys, 41(3):\n\n15:1\u201315:58, 2009.\n\n[10] G. H. Chen, S. Nikolov, and D. Shah. A Latent Source Model for Nonparametric Time Series Classi\ufb01cation.\n\nIn Annual Conference on Neural Information Processing Systems (NIPS), pages 1088\u20131096, 2013.\n\n[11] H. Cheng, P. Tan, C. Potter, and S. Klooster. Detection and Characterization of Anomalies in Multivariate\n\nTime Series. In SIAM International Conference on Data Mining (SDM), pages 413\u2013424, 2009.\n\n[12] D. J. Cook and N. C. Krishnan. Activity Learning: Discovering, Recognizing, and Predicting Human\n\nBehavior from Sensor Data. John Wiley & Sons, 2015.\n\n[13] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and Mining of Time Series\nData: Experimental Comparison of Representations and Distance Measures. Proceedings of the VLDB\nEndowment, 1(2):1542\u20131552, 2008.\n\n[14] S. Formenti and S. Demaria. Systemic Effects of Local Radiotherapy. The Lancet Oncology, pages\n\n718\u2013726, 2017.\n\n[15] J. Gottschlich. Paranom: A Parallel Anomaly Dataset Generator. https://arxiv.org/abs/1801.\n\n03164/, 2018.\n\n[16] S. Guha, N. Mishra, G. Roy, and O. Schrijvers. Robust Random Cut Forest Based Anomaly Detection on\n\nStreams. In International Conference on Machine Learning (ICML), pages 2712\u20132721, 2016.\n\n[17] M. Hermans and B. Schrauwen. Training and Analyzing Deep Recurrent Neural Networks. In Annual\n\nConference on Neural Information Processing Systems (NIPS), pages 190\u2013198, 2013.\n\n[18] S. A. Hofmeyr, S. Forrest, and A. Somayaji. Intrusion Detection using Sequences of System Calls. Journal\n\nof Computer Security, 6(3):151\u2013180, 1998.\n\n[19] M. H\u00fcsken and P. Stagge. Recurrent Neural Networks for Time Series Classi\ufb01cation. Neurocomputing, 50:\n\n223\u2013235, 2003.\n\n[20] E. Keogh. UCR Time-Series Datasets. http://www.cs.ucr.edu/~eamonn/discords/, 2005.\n\n[21] E. Keogh, J. Lin, and A. Fu. HOT SAX: Ef\ufb01ciently Finding the Most Unusual Time Series Subsequence.\n\nIn IEEE International Conference on Data Mining (ICDM), pages 226\u2013233, 2005.\n\n10\n\n\f[22] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis. Machine Learning\nApplications in Cancer Prognosis and Prediction. Computational and Structural Biotechnology Journal,\n13:8\u201317, 2015.\n\n[23] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based Change Detection: Methods, Evaluation,\nand Applications. In ACM SIGCOMM Conference on Internet Measurement (IMC), pages 234\u2013247, 2003.\n\n[24] N. Laptev, S. Amizadeh, and I. Flint. Generic and Scalable Framework for Automated Time-series\nAnomaly Detection. In ACM International Conference on Knowledge Discovery and Data Mining (KDD),\npages 1939\u20131947, 2015.\n\n[25] A. Lavin and S. Ahmad. Evaluating Real-Time Anomaly Detection Algorithms - The Numenta Anomaly\nBenchmark. In IEEE International Conference on Machine Learning and Applications (ICMLA), pages\n38\u201344, 2015.\n\n[26] T. J. Lee, J. Gottschlich, N. Tatbul, E. Metcalf, and S. Zdonik. Greenhouse: A Zero-Positive Machine\nLearning System for Time-Series Anomaly Detection. In Inaugural Conference on Systems and Machine\nLearning (SysML), 2018.\n\n[27] S. C. Li and B. Marlin. A Scalable End-to-end Gaussian Process Adapter for Irregularly Sampled Time\nSeries Classi\ufb01cation. In Annual Conference on Neural Information Processing Systems (NIPS), pages\n1812\u20131820, 2016.\n\n[28] LinkedIn. LinkedIn\u2019s Anomaly Detection and Correlation Library. https://github.com/linkedin/\n\nluminol/, 2018.\n\n[29] Z. C. Lipton, D. C. Kale, C. Elkan, and R. C. Wetzel. Learning to Diagnose with LSTM Recurrent Neural\n\nNetworks. In International Conference on Learning Representations (ICLR), 2016.\n\n[30] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal. Long Short Term Memory Networks for Anomaly Detection\nin Time Series. In European Symposium on Arti\ufb01cial Neural Networks, Computational Intelligence, and\nMachine Learning (ESANN), pages 89\u201394, 2015.\n\n[31] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff. LSTM-based Encoder-Decoder\n\nfor Multi-sensor Anomaly Detection. ICML Anomaly Detection Workshop, 2016.\n\n[32] Numenta. NAB Data Corpus. https://github.com/numenta/NAB/tree/master/data/, 2017.\n\n[33] S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming Pattern Discovery in Multiple Time-Series. In\n\nInternational Conference on Very Large Data Bases (VLDB), pages 697\u2013708, 2005.\n\n[34] A. Patcha and J. Park. An Overview of Anomaly Detection Techniques: Existing Solutions and Latest\n\nTechnological Trends. Computer Networks, 51(12):3448\u20133470, 2007.\n\n[35] N. Singh and C. Olinsky. Demystifying Numenta Anomaly Benchmark. In International Joint Conference\n\non Neural Networks (IJCNN), pages 1570\u20131577, 2017.\n\n[36] M. Tavallaee, N. Stakhanova, and A. A. Ghorbani. Toward Credible Evaluation of Anomaly-based Intrusion\n\nDetection Methods. IEEE Transactions on Systems, Man, and Cybernetics, 40(5):516\u2013524, 2010.\n\n[37] Twitter. Twitter\u2019s Open-Source Anomaly Detection R Package. https://github.com/twitter/\n\nAnomalyDetection/, 2015.\n\n[38] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. N. Clark, J. Dolan, D. Duggins, T. Galatali,\nC. Geyer, M. Gittleman, S. Harbaugh, M. Hebert, T. M. Howard, S. Kolski, A. Kelly, M. Likhachev,\nM. McNaughton, N. Miller, K. Peterson, B. Pilnick, R. Rajkumar, P. Rybski, B. Salesky, Y.-W. Seo,\nS. Singh, J. Snider, A. Stentz, W. R. Whittaker, Z. Wolkowicki, J. Ziglar, H. Bae, T. Brown, D. Demitrish,\nB. Litkouhi, J. Nickolaou, V. Sadekar, W. Zhang, J. Struble, M. Taylor, M. Darms, and D. Ferguson.\nAutonomous Driving in Urban Environments: Boss and the Urban Challenge. Journal of Field Robotics,\n25(8):425\u2013466, 2008.\n\n[39] J. A. Ward, P. Lukowicz, and H. W. Gellersen. Performance Metrics for Activity Recognition. ACM\n\nTransactions on Intelligent Systems and Technology (TIST), 2(1):6:1\u20136:23, 2011.\n\n[40] H. Yu, N. Rao, and I. S. Dhillon. Temporal Regularized Matrix Factorization for High-dimensional Time\nSeries Prediction. In Annual Conference on Neural Information Processing Systems (NIPS), pages 847\u2013855,\n2016.\n\n[41] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang. Deep Structured Energy Based Models for Anomaly Detection.\n\nIn International Conference on Machine Learning (ICML), pages 1100\u20131109, 2016.\n\n11\n\n\f", "award": [], "sourceid": 968, "authors": [{"given_name": "Nesime", "family_name": "Tatbul", "institution": "Intel Labs and MIT"}, {"given_name": "Tae Jun", "family_name": "Lee", "institution": "Microsoft"}, {"given_name": "Stan", "family_name": "Zdonik", "institution": "Brown University"}, {"given_name": "Mejbah", "family_name": "Alam", "institution": "Intel Labs"}, {"given_name": "Justin", "family_name": "Gottschlich", "institution": "Intel Labs"}]}