{"title": "Statistical Analysis of Nearest Neighbor Methods for Anomaly Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 10923, "page_last": 10933, "abstract": "Nearest-neighbor (NN) procedures are well studied and widely used in both supervised and unsupervised learning problems. In this paper we are concerned with investigating the performance of NN-based methods for anomaly detection. We first show through extensive simulations that NN methods compare favorably to some of the other state-of-the-art algorithms for anomaly detection based on a set of benchmark synthetic datasets. We further consider the performance of NN methods on real datasets, and relate it to the dimensionality of the problem. Next, we analyze the theoretical properties of NN-methods for anomaly detection by studying a more general quantity called distance-to-measure (DTM), originally developed in the literature on robust geometric and topological inference. We provide finite-sample uniform guarantees for the empirical DTM and use them to derive misclassification rates for anomalous observations under various settings. In our analysis we rely on Huber's contamination model and formulate mild geometric regularity assumptions on the underlying distribution of the data.", "full_text": "Statistical Analysis of Nearest Neighbor Methods for\n\nAnomaly Detection\n\n1Department of Statistics and Data Science, Carnegie Mellon University\n\n2Heinz College of Information Systems and Public Policy, Carnegie Mellon University\n\nXiaoyi Gu1, Leman Akoglu2, Alessandro Rinaldo1\n\n{xgu1,lakoglu}@andrew.cmu.edu, arinaldo@cmu.edu\n\nAbstract\n\nNearest-neighbor (NN) procedures are well studied and widely used in both super-\nvised and unsupervised learning problems. In this paper we are concerned with\ninvestigating the performance of NN-based methods for anomaly detection. We\n\ufb01rst show through extensive simulations that NN methods compare favorably to\nsome of the other state-of-the-art algorithms for anomaly detection based on a\nset of benchmark synthetic datasets. We further consider the performance of NN\nmethods on real datasets, and relate it to the dimensionality of the problem. Next,\nwe analyze the theoretical properties of NN-methods for anomaly detection by\nstudying a more general quantity called distance-to-measure (DTM), originally\ndeveloped in the literature on robust geometric and topological inference. We\nprovide \ufb01nite-sample uniform guarantees for the empirical DTM and use them to\nderive misclassi\ufb01cation rates for anomalous observations under various settings. In\nour analysis we rely on Huber\u2019s contamination model and formulate mild geometric\nregularity assumptions on the underlying distribution of the data.\n\n1\n\nIntroduction\n\nAnomaly detection is the process of detecting instances that deviate signi\ufb01cantly from the other\nsample members. The problem of detecting anomalies can arise in many different applications, such\nas fraud detection in \ufb01nancial transactions, intrusion detection for security systems, and various\nmedical examinations.\nDepending on the availability of data labels, there are multiple setups for anomaly detection. The\n\ufb01rst is the supervised setup, where labels are available for both normal and anomalous instances\nduring the training stage. Because of its similarity to the standard classi\ufb01cation setup, numerous\nclassi\ufb01cation methods with good empirical performance and well-studied theoretical properties can\nbe adopted. The second setup is the semi-supervised setup, where training data only comprise\nnormal instances and no anomalies. This setup is widely used in the intrusion detection literature.\nWell-known methods with theoretical guarantees include kNNG [1], BP-kNNG [2] and BCOPS [3],\nwith the \ufb01rst two methods developed based on the geometric entropy minimization (GEM) principle\nproposed in [1], and the third on conformal prediction. Methods under this setups are essentially\ntargeting the estimation of high density regions, and treating low density points as anomalies. The\nthird setup is the unsupervised setup, which is the most \ufb02exible yet challenging setup. For the rest of\nthe paper, we will only focus on this setup and do not assume any prior knowledge on data labels.\nMany empirical methods have been developed in the unsupervised setup, which can be roughly\nclassi\ufb01ed into four categories: density based methods such as the Robust KDE (RKDE) [4], Local\nOutlier Factor (LOF) [5], and mixture models (EGMM); distance based methods such as kNN\n[6] and Angle-based Outlier Detection (ABOD) [7]; model based methods such as the one-class\nSVM (OCSVM) [8], SVDD [9], and autoencoders [10]; ensemble methods such as Isolation Forest\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(IForest) [11] and LODA [12]. In practice, ensemble methods are often favored for their computa-\ntional ef\ufb01ciency and robustness to tuning parameters, yet there is little theoretical understanding of\nhow and why these algorithms work. Recent work on NN-methods combine kNN with sub-sampling\n[13] [14] or bagging [15] [16], and show that such methods are comparable to the other state-of-the-art\nmethods, both in performance and computational ef\ufb01ciency. Moreover, some theoretical results [14]\n[17] have been developed on how these methods work.\nIn this paper, we focus on studying NN-methods in the unsupervised setting, without any sub-sampling\nor bagging. We begin with an empirical analysis of NN-methods on a set of synthetic benchmark\ndatasets and show that they compare favorably to the other state-of-the-art algorithms. We further\ndiscuss their performance on real datasets and relate it to the dimensionality of the problem. Next,\nwe provide statistical analysis of NN-methods by analyzing the distance-to-a-measure (DTM) [18], a\ngeneralization to the NN scheme. The quantity was initially raised in the robust topological inference\nliterature, in which DTM proves to be an effective distance-like function for shape reconstruction in\nthe presence of outliers [19]. We give \ufb01nite sample uniform guarantees on the empirical DTM, and\nalso demonstrate how DTM classi\ufb01es the anomalies, under suitable assumptions on the underlying\ndistribution of the data. Our theoretical results differ, both in assumptions and goals, from those\nprovided in [17] [14] signi\ufb01cantly, and provide complementary insights into the performance of\nNN-based methods both for anomaly detection and for more general tasks.\n\n2 Empirical Performance of NN-methods\n\nTwo versions of the NN anomaly detection algorithms have been proposed: kthNN [20] and kNN [6].\nkthNN assigns anomaly score of an instance by computing the distance to its kth-nearest-neighbor,\nwhereas kNN takes the average distance over all k-nearest-neighbors. Both methods are shown to\nhave competitive performance in various comparative studies [21, 22, 12, 23]. In particular, the\ncomparative study developed by Goldstein and Uchida [21] is the one of most comprehensive analysis\nto date that includes the discussion of NN-methods and, at the same time, aligns with the unsupervised\nanomaly detection setup. However, the authors omit the analysis of ensemble methods, some of\nwhich are considered as state-of-the-art algorithms (e.g., IForest and LODA). Emmott et al. [24]\nconstructed a large corpus (over 20,000) of synthetic benchmark datasets that vary across multiple\naspects (e.g., clusteredness, separability, dif\ufb01culty, etc). The authors evaluate the performance of eight\ntop-performing algorithms, including IForest and LODA, but omit the analysis of NN-methods.\nIn this section, we provide a comprehensive empirical analysis of NN-methods by comparing kNN,\n1 to IForest, LOF and LODA on (1) the corpus of synthetic datasets developed\nkthNN, and DTM2\nin [24], (2) 23 real datasets from the ODDS library [25], and (3) 6 high dimensional datasets from\nthe UCI library [26]. The code for all our experiments are publicly available2. In general, no one\nmethodology should be expected to perform well in all possible scenarios. In Appendix D we present\ndifferent examples in which IForest, LODA, LOF and DTM2 perform very differently. For all our\nexperiments, we set the following hyperparameters for our models: sub-sampling size = 256 and\nthe number of trees = 100 for IForest; k = 0.03 \u00d7 (sample size) for all distance based methods for\n\u221a\ncomparable results; for LODA, we use 100 projections with each projection using approximately\nd\nfeatures. The discussion on the robustness of distance-based methods to the choice of hyperparameter\nk can be found at [27] [28].\n\n2.1 Comparison on Benchmark Datasets\n\nFirst, we complement Emmott et al.\u2019s study [24] by extending it to NN-based detectors. First,\nwe calculate the ROC-AUC (AUC) and Average Precision (AP) scores for each method on each\nbenchmark, and compute their respective quantiles on the empirical distributions for AUC and AP\nscores (refer to Appendix E in [24] for more details on treating AUC and AP as random variables).\nWe say that an algorithm fails on a benchmark with metric AUC (or AP) at signi\ufb01cance level \u03b1 if the\ncomputed AUC (or AP) quantiles are less than (1 \u2212 \u03b1). Then, the failure rate for each algorithm is\nfound as the percentage of failures over the entire benchmark corpus. The failure rate gives a better\n\n1DTM2 stands for the empirical DTM (see Section 3) with q = 2. We include its empirical analysis here for\n\ncomparison purposes.\n\n2https://github.com/xgu1/DTM\n\n2\n\n\fTable 1: Algorithm Failure Rate with Signi\ufb01cance Level \u03b1 = 0.001.\n\nAUC\n0.5898\nABOD\n0.5520\nIForest\n0.6187\nLODA\n0.6016\nLOF\n0.6122\nRKDE\nOCSVM 0.7218\n0.8482\nSVDD\nEGMM 0.6188\n0.5646\n0.5831\n0.5669\n\nkNN\nkthNN\nDTM2\n\nAP\n\n0.6784\n0.6514\n0.6955\n0.7071\n0.7030\n0.7342\n0.8868\n0.7146\n0.6744\n0.6886\n0.6761\n\nEither\n0.7000\n0.6741\n0.7194\n0.7331\n0.7194\n0.7969\n0.9080\n0.7303\n0.6960\n0.7100\n0.6977\n\n(a) AUC\n\n(b) AP\n\nFigure 1: Boxplots for AUC and AP scores on 23 real datasets.\n\nmeasure of the overall performance of different methods across the entire benchmark datasets than\nthe average AUC (or AP) scores, as it takes into account of the dif\ufb01culty of each dataset.\nThe results are shown in Table 1, where the top section is copied from [24] and the bottom section\nshows the failure rates we obtained for kNN, kthNN, and DTM2. The \"Either\" column indicates\nthat the benchmarks fail under at least one of the two metrics. Among all methods, IForest gives the\nlowest failure rates (boldfaced) for all three metrics. kNN and DTM2 turn out to be next-best top\nperformers, falling marginally behind IForest.\n\n2.2 Comparison on Datasets from the ODDS library\n\nNext, we compare the performance of IForest, LODA, LOF, DTM2, kNN and kthNN on 23 real\ndatasets from the ODDS library [25]. Figure 1 presents the overall distributions of AUC and AP\nscores of the \ufb01ve methods as boxplots. It appears that all methods, except for LOF, have comparable\nperformance, and we further veri\ufb01ed this claim via pairwise Wilcoxon signed-rank tests between\nmethods, which showed no statistically signi\ufb01cant difference at level 0.05. The full performance\ntable (Table 3) is given in the Appendix, with the last row of the table showing the average rank of\neach method.\n\n3\n\n\fAUC\ngisette\nisolet\nletter\n\nmadelon\ncancer\n\nionosphere\n\nAP\n\ngisette\nisolet\nletter\n\nmadelon\ncancer\n\nionosphere\n\nTable 2: AUC and AP performance on high dimensional datasets\n\nn\n\n3850\n4886\n4586\n1430\n385\n242\n\nn\n\n3850\n4886\n4586\n1430\n385\n242\n\nd\n\n4970\n617\n617\n500\n30\n33\n\nd\n\n4970\n617\n617\n500\n30\n33\n\nIForest LODA\n0.5176\n0.5023\n0.5485\n0.5421\n0.5459\n0.5600\n0.5427\n0.5327\n0.9626\n0.9528\n0.9265\n0.9118\n\nLOF\n0.6753\n0.7330\n0.7846\n0.5450\n0.8097\n0.9450\n\n(a) AUC\n\nIForest LODA\n0.0877\n0.0907\n0.1003\n0.1005\n0.0980\n0.0956\n0.0974\n0.1067\n0.8277\n0.6274\n0.7222\n0.7438\n\nLOF\n0.1628\n0.2343\n0.2921\n0.1171\n0.3121\n0.6058\n\n(b) AP\n\nkNN\n0.5696\n0.6810\n0.7162\n0.5608\n0.9780\n0.9832\n\nkNN\n0.1093\n0.2074\n0.2328\n0.1209\n0.8813\n0.8903\n\nkthNN DTM2 DTMF2\n0.7051\n0.5429\n0.6480\n0.7645\n0.8096\n0.6826\n0.5546\n0.5552\n0.6937\n0.9756\n0.9803\n0.9372\n\n0.5692\n0.6796\n0.7149\n0.5607\n0.9773\n0.9824\n\nkthNN DTM2 DTMF2\n0.1015\n0.1723\n0.2458\n0.1846\n0.3010\n0.2054\n0.1166\n0.1181\n0.2800\n0.8840\n0.8801\n0.6105\n\n0.1092\n0.2070\n0.2319\n0.1209\n0.8864\n0.8868\n\n2.3 Effect of the dimension\n\nWe then take a closer look at the performance of IForest, LODA, LOF, DTM2, kNN and kthNN\nwhen the data is high dimensional. Additionally, we include the analysis of DTMF2 in our exper-\niments, a quantity de\ufb01ned as the inverse ratio of the DTM2 of a point and the average DTM2 of\nits k-nearest neighbors. DTMF2 can be interpreted as a LOF version of DTM2 and is described\nin the Appendix A. We consider six high dimensional real datasets from the UCI library [26] (see\n[12] for details) and compute the AUC and AP scores for each algorithm. The results are presented\nin Table 2. The n and d columns stand for the number of samples and dimension of the datasets.\nOn datasets gisette, isolet and letter, the performance of IForest and LODA have been signi\ufb01cantly\ndowngraded; the NN-methods give somewhat better performance, whereas LOF and DTMF2 are\nshowing signi\ufb01cantly stronger performance. However, on datasets cancer and ionoshphere, where\ndimensions are slightly lower, the situations are reversed, with LOF and DTMF2 giving signi\ufb01cantly\nworse performance than the others. This is consistent with our \ufb01ndings in Section 2.2. The de\ufb01ciency\nof IForest in high dimensions is expected, as the IForest trees are generated by random partitioning\nalong a randomly selected feature. However, in high dimensions, there is a high probability that\na large number of features are neglected in the process. From another perspective, [29] discusses\nthe various effects of dimensionality in the context of anomaly detection. In particular, the authors\ndescribe a concentration effect of distances in high dimensions, which has a negative effect on IForest,\nor any other methods that rely on pairwise distances of points for computation of anomaly scores.\nNN-methods, on the other hand, are somewhat more robust in high dimensions, as the rankings of\ndistance values are still feasible.\nOverall, our experiments show that IForest and NN-methods are the top two methods with excellent\noverall performance on both low dimensional synthetic and real datasets. However, NN-methods\nexhibit better performance than IForest when the data is high dimensional. In the following sections,\nwe provide a theoretical understanding of how the NN-methods work under the anomaly detection\nframework.\n\n3 Theoretical Analysis\n\nIn this section we formalize the settings for a simple yet natural anomaly detection problem based\non the classic Huber-contamination model [30, 31], whereby a target distribution generating normal\nobservations is corrupted by a distribution from which anomalous observations are drawn. We\nintroduce the notion of distance-to-a-measure (DTM) [18], as an overall functional of the data based\n\n4\n\n\fon nearest neighbors statistics and provide \ufb01nite sample bounds on the empirical nearest neighbor\nradii and on the rates of consistency of the DTM in the supremum norm. These theoretical guarantees\nare novel and may be of independent interest. Finally, we derive conditions under which DTM-based\nmethods provably separate normal and anomalous points, as a function of the level of contamination\nand the separation between the normal distribution and the anomalous distribution. All the proofs are\ngiven in the Appendix B.\n\n3.1 Problem Setup\nWe assume we observe n i.i.d. realizations Xn = (X1, . . . , Xn) from a distribution P on Rd that\nfollows the Huber contamination model [30, 31]\n\nP = (1 \u2212 \u03b5)P0 + \u03b5P1,\n\nwhere P0 and P1 are, respectively, the underlying distribution for the normal and anomalous instances,\nand \u03b5 \u2208 [0, 1) is the proportion of contamination. Letting S0 and S1 be the support of P0 and P1,\nrespectively, we further assume that S0 \u2229 S1 = \u2205. The distributions P0 and P1, their support and the\nlevel of contamination \u03b5 are unknown.\nOur goal is to devise a procedure that is able to discriminate the normal observations Xi\u2019s belonging\nto S0, from the anomalous ones, falling in the set S1. Since we will be focusing exclusively on NN\nmethods, we will begin by introducing a population counterpart to the notion of kth nearest neighbor.\nThroughout the article, for any x \u2208 Rd and r > 0, B(x, r) denotes the closed Euclidean ball of\nradius r centered at x.\nDe\ufb01nition 3.1 (p-NN radius). Let p \u2208 (0, 1). For any x, de\ufb01ne rp(x) to be the radius of the smallest\nball centered at x with P -probability mass at least p. Formally,\n\nrp(x) = inf{r > 0 : P (B(x, r)) \u2265 p}.\n\nNaturally, the empirical p-NN radius is de\ufb01ned as\n\n\u02c6rp(x) = inf{r > 0 : Pn(B(x, r)) \u2265 p},\n\nwhere Pn is the empirical measure that puts mass 1/n on each Xi. Setting k = (cid:100)np(cid:101), \u02c6rp(x) is simply\nthe kth-nearest neighbor radius of the point x with respect to the sample (X1, . . . , Xn). Thus,\n\nPn(B(x, \u02c6rp(x)) =\n\n|{X1, . . . , Xn} \u2229 B(x, \u02c6rp(x))| =\n\n1\nn\n\nk\nn\n\n.\n\nWe will impose the following, mild regularity assumptions on the distribution P :\n\n\u2022 Assumption (A0):\n\n\u2022 Assumption (A1):\n\nThe sets S0 and S1 have diameters bounded by some L > 0, and are disjoint from each\nother.\n\nThere exist positive constants C = C(P ) and \u03bd0 = \u03bd0(P ) such that for all 0 < \u03bd < \u03bd0 and\n\u03b3 \u2208 R,\n\n|P (B(x, rp(x) + \u03b3)) \u2212 P (B(x, rp(x)))| \u2264 \u03bd \u21d2 |\u03b3| < C\u03bd,\n\n\u2022 Assumption (A2):\n\nfor P -almost every x.\nP0 satis\ufb01es the (a,b)-condition: For b > 0, for any x \u2208 S0, there exist a = a(x) > 0, and\nr > 0 such that P0(B(x, r)) \u2265 min{1, arb}.\n\nIntuitively, assumption (A1) implies that P has non-zero probability content around the boundary\nof B(x, rp(x)). Observing further that the function r \u2208 R+ (cid:55)\u2192 Fx(r) = P (B(x, r)) is the c.d.f. of\nthe random variable (cid:107)X \u2212 x(cid:107), where X \u223c P , then a suf\ufb01cient condition for (A1) to hold is that,\nuniformly over all x, Fx has its derivative uniformly bounded away from zero in a \ufb01xed neighborhood\nof rp(x). This condition, originally formulated in [19] to derive bootstrap-based con\ufb01dence bands for\nthe DTM function, appears to be a natural regularity assumption in the analysis of NN-type methods.\nWhen a(x) = a for all x \u2208 S0, assumption (A2) reduces to a widely used condition in the literature\non statistical inference for geometric and topological data analysis [32, 33]. Such condition requires\n\n5\n\n\fthe support of P0 to not locally resemble a lower dimensional set; in particular, it prevents S0 from\nhaving thin ridges or outward cusps. When (A2) is violated, it becomes impossible to estimate S0, no\nmatter the size of the sample. The parameter b can be interpreted as the intrinsic dimension of P . In\nparticular, if P admits a strictly positive density on a D-dimensional smooth manifold, then it can be\nshown that b = D.\nDe\ufb01nition 3.2 (DTM [18]). The distance-to-a-measure (DTM) with respect to a probability distribu-\ntion P with parameter m \u2208 (0, 1) and power q \u2265 1 is de\ufb01ned as\n\n(cid:18) 1\n\n(cid:90) m\n\n(cid:19)1/q\n\nd(x) = dP,m,q(x) =\n\nrp(x)q dp\n\n.\n\n0\n\nm\nWhen q = \u221e, we set d(x) = dP,m,\u221e(x) = rm(x).\nIt is immediate from the de\ufb01nition that a point x \u2208 Rd has a small DTM value d(x) if its p-NN radii,\nwhen averaged across all p \u2208 (0, m) are small. Intuitively, d(x) can be thought of as a measure of the\ndistance of x from the bulk of the mass of the probability distribution P at level of accuracy speci\ufb01ed\nby the parameter m. The choice of the parameter q allows to weight differently the impact of large\nversus small p-NN radii.\nBy substituting rp(x) with \u02c6rp(x) in (1), the empirical DTM can be seen to be\n\n(1)\n\n\u02c6d(x) = dPn,m,q(x) =\n\n(cid:107)Xi \u2212 x(cid:107)q\n\n\uf8eb\uf8ed 1\n\nk\n\n(cid:88)\n\nXi\u2208Nk(x)\n\n\uf8f6\uf8f81/q\n\n,\n\nwhere k = (cid:100)mn(cid:101) and Nk(x) denotes the set of k-nearest neighbors to x in the sample. Different\nvalues of q \u2265 1 yield different NN-functionals. In particular, the empirical DTM with q = 1 is\nequivalent to the kNN method, and the empirical DTM with q = \u221e is equivalent to kthNN. The\nnotion of DTM was initially introduced in the geometric inference literature [19], where DTM was\ndeveloped for shape reconstruction under the presence of outliers. The DTM is known to have\nseveral nice properties: it is 1-Lipschitz and it is robust with respect to perturbations of the original\ndistributions with respect to the Wasserstein distance. The case of q = 2 is special: the corresponding\nDTM, denoted below as DTM2, is also semi-concave and distance-like, and admits strong regularity\nconditions on its sub-level sets. Chazal et al. [19] have also derived the limiting distribution and a\ncon\ufb01dence band for the DTM.\n\n3.2 Uniform bounds for \u02c6rp and \u02c6d\n\nTheorem 3.3. Let \u03b4 \u2208 (0, 1) and set \u03b2n =(cid:112)(4/n)((d + 1) log 2n + log (8/\u03b4)). Under assumption\n\nIn this section we derive \ufb01nite sample bounds on the deviation of \u02c6rp and \u02c6d from rp and dP,m,q,\nrespectively, that hold uniformly over all x \u2208 Rd or only over the sample points. These theoretical\nguarantees are, to the best of our knowledge, novel and may be of independent interest.\n(A1), with probability at least 1 \u2212 \u03b4 we have that\n\n|\u02c6rp(x) \u2212 rp(x)| \u2264 C(\u03b22\n\nn + \u03b2n\n\np),\n\n\u221a\n\nsup\n\nx\n\n\u221a\n\np + \u03b22\n\nn + \u03b2n\n\np \u2264 1\n\nwhere C is the constant introduced in Assumption (A1), simultaneously over all p \u2208 (0, 1) such that\n(2)\nThe dimension d enters in the previous bound in such a way that, for all p satisfying (2), supx |\u02c6rp(x)\u2212\nrp(x)| \u2192 0 with probability tending to 1 provided that d\nn \u2192 0. If we limit the supremum only to\nthe sample points, then the dependence on the dimension disappears altogether and we can instead\nachieve a nearly-parametric rate of\n\nTheorem 3.4. Let \u03b4 \u2208 (0, 1) and set \u03b1n = (cid:112)(4/(n \u2212 1))(log 2(n \u2212 1) + log (8n/\u03b4)). Under\n\n(cid:113) log n\n\nand p \u2212 \u03b22\n\nn \u2212 \u03b2n\n\np \u2265 0.\n\nn .\n\n\u221a\n\nassumption (A1), with probability at least 1 \u2212 \u03b4 we have that\n|\u02c6rp(Xi) \u2212 rp(Xi)| \u2264 C(\u03b12\n\nmax\n\n\u221a\n\nn + \u03b1n\n\np +\n\ni=1,...,n\n\nwhere C is the constant introduced in Assumption (A1), simultaneously over all p \u2208 (0, 1) such that\n(3)\n\nand p \u2212 2/n \u2212 \u03b12\n\np \u2264 1\n\np + \u03b12\n\n\u221a\n\nn \u2212 \u03b1n\n\nn + \u03b1n\n\n2\nn\n\n)\n\n(cid:112)p \u2212 2/n \u2265 0.\n\n6\n\n\fThe results in Theorem 3.3 and Theorem 3.4 yield the following uniform bounds for the DTM of all\norder.\nTheorem 3.5. Under assumption (A0) and (A1), with probability at least 1 \u2212 \u03b4,\n\n|d(x) \u2212 \u02c6d(x)| \u2264 C1\u03b2n(\u03b2n +\n\nsup\n\nx\n\n\u221a\n\nm),\n\nand\n\n|d(Xi) \u2212 \u02c6d(Xi)| \u2264 C2\u03b1n(\u03b1n +\n\nmax\n\ni=1,...,n\n\n\u221a\n\nm +\n\n2\nn\n\n).\n\nwhere \u03b2n and \u03b1n are de\ufb01ned in Theorem 3.3 and Theorem 3.4 and C1 and C2 are some positive\nconstants depending on q, the diameter bound L in Assumption (A0) and the constant C in Assumption\n(A1).\nRemark. The bound in Theorem 3.5 holds for all choices of q \u2265 1, including the case of q = \u221e.\np)q dp will bring out an explicit dependence on q but\n\nEvaluating explicitly the integral(cid:82) m\n\n\u221a\n\n0 (\u03b2n +\n\nwill not lead to better rates.\n\n3.3 DTM for anomaly detection: theoretical guarantees\n\nWe are now ready to derive some theoretical guarantees on the performance of DTM-based methods\nfor discriminating normal and anomalous points in the sample (X1, . . . , Xn) according to the Huber-\ncontamination model described above in Section 3.1. We recall that in our setting, a sample point Xi\nis normal if it belongs to the support S0 of P0, and is otherwise deemed an anomaly if it lies in S1,\nthe support of P1, where S1 \u2229 S0 = \u2205.\nThe methodology we consider is quite simple, and it is consistent with the prevailing practice of\nassigning to each sample point a score that expresses its degree of being anomalous compared to\nthe other points. In detail, we rank the sample points based on their empirical DTM values, and we\ndeclare the points with largest empirical DTM values as anomalies. This simple procedure will work\nperfectly well if\n\nmax\nXi\u2208S0\n\n\u02c6d(Xi) < min\nXi\u2208S1\n\n\u02c6d(Xi)\n\nand if the difference between the two quantities is large. In general, of course, one would expect\nthat some sample points in S0 may have smaller empirical DTMs of some of the points in S1. The\nextent to which such incorrect labeling occurs depends on two key factors: how closely the empirical\nDTM tracks the true DTM and whether the population DTM could itself discriminate normal points\nversus anomalous ones. The former issue can be handled using the high probability bounds on the\nstochastic \ufb02uctuations of the empirical DTM obtained in the previous section. The latter issue will\ninstead require to specify some degree of separation between the mixture components P0 and P1,\nboth in terms of the distance between their supports but also in terms of how their probability mass\ngets distributed. There is more than one way to formalize this setting. Here we choose to remain\ncompletely agnostic to the form of the contaminating distribution P1, for which we impose virtually\nno constraint. On the other hand, we require the normal distribution P0 to satisfy condition (A2)\nabove in such a way that point inside the support will have larger values of a(x) than points near the\nboundary of S0. This condition, which is satis\ufb01ed if for example P0 admits a Lebesgue density whose\nvalues increase as a function of the distance from the boundary of S0, ensures that the population\nDTM will be large near the boundary of S0 and small everywhere else. As a result, incorrect labeling\nof normal points will only occur around the boundary of S0 but not inside the bulk of the distribution\nP0. We formalize this intuition in our next result, which is purely deterministic.\nProposition 3.6. Under assumptions (A0) and (A2), suppose that a(x) = g(d(x, \u2202S0)), where g(z)\nis a non-decreasing function on [0, z0) for some z0, and g(z) \u2265 g(z0) for all z \u2265 z0. Let\n\n(4)\n\n(5)\n\n(6)\n\n\u03b7 = min\n\nx\u2208S0,y\u2208S1\n\n(cid:107)x \u2212 y(cid:107)\n\n\uf8f1\uf8f2\uf8f3 m\n\n(cid:16) b+q\nm \u03b7q \u2212 h(cid:1)(cid:17)\u2212b/q\n(cid:0) m\u2212\u03b5\n1\u2212\u03b5\n1\u2212\u03b5 (\u03b7 \u2212 h)\u2212b\nm\n\nb\n\n7\n\nbe the distance between S0 and S1 and h > 0 be a given threshold parameter. For any m > \u03b5,\nadditionally assume that\n\ng(z0) \u2265 g0 :=\n\n1 \u2264 q < \u221e\nq = \u221e.\n\n(7)\n\n\fNext, de\ufb01ne the \"safety zone\" A\u03b7 as\n\nA\u03b7 =(cid:8)x \u2208 S0 : d(x, \u2202S0) \u2265 g\u22121(g0)(cid:9)\n\nThen, we have\n\nsup\nx\u2208A\u03b7\n\ndP,m,q(x) + h < inf\ny\u2208S1\n\ndP,m,q(y).\n\n(8)\n\n(9)\n\nThe main message from the previous result is that there exists a subset A\u03b7 of the support of the\nnormal distribution, which intuitively corresponds to a region deep inside the support of P0 of high\ndensity, over which the population DTM will be smaller than at any point in the support S1 of the\ncontaminating distribution. Thus, the true DTM is guaranteed to perfectly separate A\u03b7 from S1,\nmaking mistakes (possibly) only for the normal points in S0 \\ A\u03b7.\nNotice that the de\ufb01nition of A\u03b7 depends on all the relevant quantities, namely the contamination\nparameter \u03b5, the probability parameter m, the dimension b of P0 and the order q of the DTM through\nthe expression (7). Importantly, it is necessary that m > \u03b5, otherwise inequality (9) maybe not be\nsatis\ufb01ed. For example, we can take P1 to have point mass at a single point y; then rP,t(y) = 0 for all\nt \u2264 m, and the right hand side of (9) is zero.\nWhen g(0) = a0 > 0, which occurs, e.g., if P0 has a density bounded away from 0 over its support,\nimplies that A\u03b7 = S0 if\n\n(cid:32)\n\n(cid:32)\n\n\u03b7 >\n\nm\nm \u2212 \u03b5\n\nb\n\nb + q\n\n(cid:19)q/b\n\n(cid:18) m\n\na0(1 \u2212 \u03b5)\n\n(cid:33)(cid:33)\u22121/q\n\n+ h\n\n.\n\nThat is, when S0 and S1 are suf\ufb01ciently well-separated, the DTM will classify all the points in S0 as\nnormals.\nThe parameter h serves as a buffer that allows one to replace the DTM function d(x) with any\nestimator that is close to it in the supremum norm by no more than h. Thus, we may plug-in the\nhigh-probability bounds of Theorem 3.4 and Theorem 3.3 to conclude that the empirical DTM will\nwill identify all normal instances within A\u03b7 correctly, with high probability.\nCorollary 3.6.1. Taking h to be twice the upper bound in (5), we get, with probability at least 1 \u2212 \u03b4,\n\nSimilarly, if h is twice the upper bound in (4), we have that\n\nmax\nXi\u2208A\u03b7\n\n\u02c6dP,m,q(Xi) < min\nXi\u2208S1\n\n\u02c6dP,m,q(Xi).\n\nsup\nx\u2208A\u03b7\n\n\u02c6dP,m,q(x) < inf\ny\u2208S1\n\n\u02c6dP,m,q(y).\n\n(10)\n\nThe guarantee in (10) calls for a higher sample complexity that depends on the dimension d. At\nthe same time, it extends to all the points in A\u03b7 and not just the sample points. Thus the DTM can\naccurately identify not only the normal instance in the sample but any other normal instance, such as\nfuture observations.\n\n3.4\n\nIllustrative examples\n\nWe illustrate the separation condition in Proposition 3.6 with the following example. Consider a\ncollection of normal instances generated from a standard normal distribution. Figure 2 shows the\nmis-classi\ufb01cation rates for DTM2 as a cluster of 5 anomalies approaches the normal instances for\nthree different underlying distributions: Gaussian, Moon-shaped, Circle. The color of each point\nrepresents its class, with black being the normal instances and red being anomalies. The radius of the\ncircle around each point represents its empirical DTM score, and the color of the circle represents its\npredicted class from DTM2. As we see, as the anomalies approach the normal instances, more and\nmore data around the boundaries of the normal distribution get mis-classi\ufb01ed as anomalies.\n\n4 Conclusions\n\nIn this paper we have presented empirical evidence, based on simulated and real-life benchmark\ndatasets, that NN-based methods show very good performance at identifying anomalous instances\n\n8\n\n\fHigh Separation\n\nMedium Separation\n\nLow Separation\n\nFigure 2: Performance of DTM when the separation distance between the normal instances and\nanomalies gradually decreases. Top: Gaussian; Middle: Moon-Shaped; Bottom: Circle.\n\nin an unsupervised anomaly detection set-up. We have introduced a simple but natural framework\nfor anomaly detection based on the Huber contamination model and have used it to characterize the\nperformance of a class of NN methods for anomaly detection that are based on the distance-to-a-\nmeasure (DTM) functional. In our results we rely on various geometric and analytic properties of\nthe underlying distribution to the accuracy of DTM-methods for anomaly detection. We are able to\ndemonstrate that, under mild conditions, NN methods will mis-classify normal points only around\nthe boundary of the support of the distribution generating normal instances and have quanti\ufb01ed\nthis phenomenon rigorously. Finally, we have derived novel \ufb01nite sample bounds on the nearest\nneighbor radii and on the rate of convergence of the empirical DTM to the true DTM that may be of\nindependent interest.\n\nReferences\n\n[1] Alfred O. Hero. Geometric entropy minimization (gem) for anomaly detection and localization.\nIn B. Sch\u00f6lkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing\nSystems 19, pages 585\u2013592. MIT Press, 2007.\n\n9\n\n\f[2] Kumar Sricharan and Alfred O. Hero. Ef\ufb01cient anomaly detection using bipartite k-nn graphs. In\nJ. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances\nin Neural Information Processing Systems 24, pages 478\u2013486. Curran Associates, Inc., 2011.\n\n[3] Leying Guan and Rob Tibshirani. Prediction and outlier detection: a distribution-free prediction\n\nset with a balanced objective. arXiv e-prints, page arXiv:1905.04396, May 2019.\n\n[4] JooSeuk Kim and Clayton D. Scott. Robust kernel density estimation. J. Mach. Learn. Res.,\n\n13(1):2529\u20132565, September 2012.\n\n[5] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J\u00f6rg Sander. Lof: Identifying\n\ndensity-based local outliers. SIGMOD Rec., 29(2):93\u2013104, May 2000.\n\n[6] Fabrizio Angiulli and Clara Pizzuti. Fast outlier detection in high dimensional spaces. In\nPrinciples of Data Mining and Knowledge Discovery, pages 15\u201327, Berlin, Heidelberg, 2002.\nSpringer Berlin Heidelberg.\n\n[7] Xiaojie Li, Jian Cheng Lv, and Dongdong Cheng. Angle-based outlier detection algorithm with\nmore stable relationships. In Proceedings of the 18th Asia Paci\ufb01c Symposium on Intelligent\nand Evolutionary Systems, Volume 1, pages 433\u2013446, Cham, 2015. Springer International\nPublishing.\n\n[8] Bernhard Sch\u00f6lkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C.\nWilliamson. Estimating the support of a high-dimensional distribution. Neural Comput.,\n13(7):1443\u20131471, July 2001.\n\n[9] David M.J. Tax and Robert P.W. Duin. Support vector data description. Machine Learning,\n\n54(1):45\u201366, Jan 2004.\n\n[10] Jinghui Chen, Saket Sathe, Charu C. Aggarwal, and Deepak S. Turaga. Outlier detection with\n\nautoencoder ensembles. In SDM, 2017.\n\n[11] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Proceedings of the 2008\nEighth IEEE International Conference on Data Mining, ICDM \u201908, pages 413\u2013422, Washington,\nDC, USA, 2008. IEEE Computer Society.\n\n[12] Tom\u00e1\u0161 Pevn\u00fd. Loda: Lightweight on-line detector of anomalies. Machine Learning, 102(2):275\u2013\n\n304, Feb 2016.\n\n[13] Mingxi Wu and Christopher Jermaine. Outlier detection by sampling with accuracy guarantees.\n\npages 767\u2013772, 2006.\n\n[14] Mahito Sugiyama and Karsten Borgwardt. Rapid distance-based outlier detection via sampling.\n\npages 467\u2013475, 2013.\n\n[15] T. R. Bandaragoda, K. M. Ting, D. Albrecht, F. T. Liu, and J. R. Wells. Ef\ufb01cient anomaly\n\ndetection by isolation using nearest neighbour ensemble. pages 698\u2013705, Dec 2014.\n\n[16] G. Pang, K. M. Ting, and D. Albrecht. Lesinn: Detecting anomalies by identifying least similar\n\nnearest neighbours. pages 623\u2013630, Nov 2015.\n\n[17] Kai Ming Ting, Takashi Washio, Jonathan R. Wells, and Sunil Aryal. Defying the gravity of\nlearning curve: a characteristic of nearest neighbour anomaly detectors. Machine Learning,\n106(1):55\u201391, Jan 2017.\n\n[18] Fr\u00e9d\u00e9ric Chazal, David Cohen-Steiner, and Quentin M\u00e9rigot. Geometric inference for probabil-\n\nity measures. Foundations of Computational Mathematics, 11(6):733\u2013751, Dec 2011.\n\n[19] Fr\u00e9d\u00e9ric Chazal, Brittany Fasy, Fabrizio Lecci, Bertr, Michel, Aless, ro Rinaldo, and Larry\nWasserman. Robust topological inference: Distance to a measure and kernel distance. Journal\nof Machine Learning Research, 18(159):1\u201340, 2018.\n\n[20] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Ef\ufb01cient algorithms for mining\n\noutliers from large data sets. In SIGMOD Conference, 2000.\n\n10\n\n\f[21] Markus Goldstein and Seiichi Uchida. A comparative evaluation of unsupervised anomaly\n\ndetection algorithms for multivariate data. PLOS ONE, 11(4):1\u201331, 04 2016.\n\n[22] Filipe Falc\u00e3o, Tommaso Zoppi, Caio Barbosa Viera Silva, Anderson Santos, Baldoino Fon-\nseca, Andrea Ceccarelli, and Andrea Bondavalli. Quantitative comparison of unsupervised\nanomaly detection algorithms for intrusion detection. In Proceedings of the 34th ACM/SIGAPP\nSymposium on Applied Computing, SAC \u201919, pages 318\u2013327, New York, NY, USA, 2019. ACM.\n\n[23] Guilherme O. Campos, Arthur Zimek, J\u00f6rg Sander, Ricardo J. G. B. Campello, Barbora\nMicenkov\u00e1, Erich Schubert, Ira Assent, and Michael E. Houle. On the evaluation of unsupervised\noutlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge\nDiscovery, 30(4):891\u2013927, Jul 2016.\n\n[24] Andrew Emmott, Shubhomoy Das, Thomas Dietterich, Alan Fern, and Weng-Keen Wong. A\nMeta-Analysis of the Anomaly Detection Problem. arXiv e-prints, page arXiv:1503.01158, Mar\n2015.\n\n[25] Shebuti Rayana. ODDS library. http://odds.cs.stonybrook.edu, 2016.\n\n[26] A. Frank and A. Asuncion. Uci machine learning repository. http://archive.ics.uci.\n\nedu/ml, 2010.\n\n[27] Yizhen Wang, Somesh Jha, and Kamalika Chaudhuri. Analyzing the robustness of nearest\nneighbors to adversarial examples. In Jennifer Dy and Andreas Krause, editors, Proceedings of\nthe 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine\nLearning Research, pages 5133\u20135142, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018.\nPMLR.\n\n[28] Guilherme O. Campos, Arthur Zimek, J\u00f6rg Sander, Ricardo J. G. B. Campello, Barbora\nMicenkov\u00e1, Erich Schubert, Ira Assent, and Michael E. Houle. On the evaluation of unsupervised\noutlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge\nDiscovery, 30(4):891\u2013927, Jul 2016.\n\n[29] Schubert E. Kriegel H.-P. Zimek, A. A survey on unsupervised outlier detection in high-\n\ndimensional numerical data. Statistical Analysis and Data Mining, 5:363\u2013387, 2012.\n\n[30] Peter J. Huber. Robust Estimation of a Location Parameter, pages 492\u2013518. Springer New\n\nYork, New York, NY, 1992.\n\n[31] Peter J. Huber. A robust version of the probability ratio test. Ann. Math. Statist., 36(6):1753\u2013\n\n1758, 12 1965.\n\n[32] Fr\u00e9d\u00e9ric Chazal, Marc Glisse, Catherine Labru\u00e8re, and Bertrand Michel. Convergence rates\nfor persistence diagram estimation in topological data analysis. Journal of Machine Learning\nResearch, 16:3603\u20133635, 2015.\n\n[33] Antonio Cuevas and Ricardo Fraiman. A plug-in approach to support estimation. Ann. Statist.,\n\n25(6):2300\u20132312, 12 1997.\n\n11\n\n\f", "award": [], "sourceid": 5852, "authors": [{"given_name": "Xiaoyi", "family_name": "Gu", "institution": "Carnegie Mellon University"}, {"given_name": "Leman", "family_name": "Akoglu", "institution": "CMU"}, {"given_name": "Alessandro", "family_name": "Rinaldo", "institution": "CMU"}]}