{"title": "Anomaly Detection with Score functions based on Nearest Neighbor Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 2250, "page_last": 2258, "abstract": "We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below q, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, q, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.", "full_text": "Anomaly Detection with Score functions based on\n\nNearest Neighbor Graphs\n\nManqi Zhao\nECE Dept.\n\nBoston University\nBoston, MA 02215\nmqzhao@bu.edu\n\nVenkatesh Saligrama\n\nECE Dept.\n\nBoston University\nBoston, MA, 02215\n\nsrv@bu.edu\n\nAbstract\n\nWe propose a novel non-parametric adaptive anomaly detection algorithm for high\ndimensional data based on score functions derived from nearest neighbor graphs\non n-point nominal data. Anomalies are declared whenever the score of a test\nsample falls below \u03b1, which is supposed to be the desired false alarm level. The\nresulting anomaly detector is shown to be asymptotically optimal in that it is uni-\nformly most powerful for the speci\ufb01ed false alarm level, \u03b1, for the case when\nthe anomaly density is a mixture of the nominal and a known density. Our al-\ngorithm is computationally ef\ufb01cient, being linear in dimension and quadratic in\ndata size. It does not require choosing complicated tuning parameters or function\napproximation classes and it can adapt to local structure such as local change in\ndimensionality. We demonstrate the algorithm on both arti\ufb01cial and real data sets\nin high dimensional feature spaces.\n\n1 Introduction\n\nAnomaly detection involves detecting statistically signi\ufb01cant deviations of test data from nominal\ndistribution. In typical applications the nominal distribution is unknown and generally cannot be\nreliably estimated from nominal training data due to a combination of factors such as limited data\nsize and high dimensionality.\nWe propose an adaptive non-parametric method for anomaly detection based on score functions that\nmaps data samples to the interval [0, 1]. Our score function is derived from a K-nearest neighbor\ngraph (K-NNG) on n-point nominal data. Anomaly is declared whenever the score of a test sample\nfalls below \u03b1 (the desired false alarm error). The ef\ufb01cacy of our method rests upon its close connec-\ntion to multivariate p-values. In statistical hypothesis testing, p-value is any transformation of the\nfeature space to the interval [0, 1] that induces a uniform distribution on the nominal data. When test\nsamples with p-values smaller than \u03b1 are declared as anomalies, false alarm error is less than \u03b1.\nWe develop a novel notion of p-values based on measures of level sets of likelihood ratio functions.\nOur notion provides a characterization of the optimal anomaly detector, in that, it is uniformly most\npowerful for a speci\ufb01ed false alarm level for the case when the anomaly density is a mixture of the\nnominal and a known density. We show that our score function is asymptotically consistent, namely,\nit converges to our multivariate p-value as data length approaches in\ufb01nity.\nAnomaly detection has been extensively studied. It is also referred to as novelty detection [1, 2],\noutlier detection [3], one-class classi\ufb01cation [4, 5] and single-class classi\ufb01cation [6] in the liter-\nature. Approaches to anomaly detection can be grouped into several categories.\nIn parametric\napproaches [7] the nominal densities are assumed to come from a parameterized family and gen-\neralized likelihood ratio tests are used for detecting deviations from nominal. It is dif\ufb01cult to use\nparametric approaches when the distribution is unknown and data is limited. A K-nearest neighbor\n\n1\n\n\f(K-NN) anomaly detection approach is presented in [3, 8]. There an anomaly is declared whenever\nthe distance to the K-th nearest neighbor of the test sample falls outside a threshold. In comparison\nour anomaly detector utilizes the global information available from the entire K-NN graph to detect\ndeviations from the nominal. In addition it has provable optimality properties. Learning theoretic\napproaches attempt to \ufb01nd decision regions, based on nominal data, that separate nominal instances\nfrom their outliers. These include one-class SVM of Sch\u00a8olkopf et. al. [9] where the basic idea\nis to map the training data into the kernel space and to separate them from the origin with maxi-\nmum margin. Other algorithms along this line of research include support vector data description\n[10], linear programming approach [1], and single class minimax probability machine [11]. While\nthese approaches provide impressive computationally ef\ufb01cient solutions on real data, it is generally\ndif\ufb01cult to precisely relate tuning parameter choices to desired false alarm probability.\nScott and Nowak [12] derive decision regions based on minimum volume (MV) sets, which does\nprovide Type I and Type II error control. They approximate (in appropriate function classes) level\nsets of the unknown nominal multivariate density from training samples. Related work by Hero\n[13] based on geometric entropic minimization (GEM) detects outliers by comparing test samples\nto the most concentrated subset of points in the training sample. This most concentrated set is the\nK-point minimum spanning tree(MST) for n-point nominal data and converges asymptotically to\nthe minimum entropy set (which is also the MV set). Nevertheless, computing K-MST for n-point\ndata is generally intractable. To overcome these computational limitations [13] proposes heuristic\ngreedy algorithms based on leave-one out K-NN graph, which while inspired by K-MST algorithm\nis no longer provably optimal. Our approach is related to these latter techniques, namely, MV sets\nof [12] and GEM approach of [13]. We develop score functions on K-NNG which turn out to be the\nempirical estimates of the volume of the MV sets containing the test point. The volume, which is a\nreal number, is a suf\ufb01cient statistic for ensuring optimal guarantees. In this way we avoid explicit\nhigh-dimensional level set computation. Yet our algorithms lead to statistically optimal solutions\nwith the ability to control false alarm and miss error probabilities.\nThe main features of our anomaly detector are summarized.\n(1) Like [13] our algorithm scales\nlinearly with dimension and quadratic with data size and can be applied to high dimensional feature\nspaces. (2) Like [12] our algorithm is provably optimal in that it is uniformly most powerful for\nthe speci\ufb01ed false alarm level, \u03b1, for the case that the anomaly density is a mixture of the nominal\nand any other density (not necessarily uniform). (3) We do not require assumptions of linearity,\nsmoothness, continuity of the densities or the convexity of the level sets. Furthermore, our algorithm\nadapts to the inherent manifold structure or local dimensionality of the nominal density. (4) Like [13]\nand unlike other learning theoretic approaches such as [9, 12] we do not require choosing complex\ntuning parameters or function approximation classes.\n\n2 Anomaly Detection Algorithm: Score functions based on K-NNG\n\nIn this section we present our basic algorithm devoid of any statistical context. Statistical analysis\nappears in Section 3. Let S = {x1, x2,\u00b7\u00b7\u00b7 , xn} be the nominal training set of size n belonging to\nthe unit cube [0, 1]d. For notational convenience we use \u03b7 and xn+1 interchangeably to denote a test\npoint. Our task is to declare whether the test point is consistent with nominal data or deviates from\nthe nominal data. If the test point is an anomaly it is assumed to come from a mixture of nominal\ndistribution underlying the training data and another known density (see Section 3).\nLet d(x, y) be a distance function denoting the distance between any two points x, y \u2208 [0, 1]d. For\nsimplicity we denote the distances by dij = d(xi, xj). In the simplest case we assume the distance\nfunction to be Euclidean. However, we also consider geodesic distances to exploit the underly-\ning manifold structure. The geodesic distance is de\ufb01ned as the shortest distance on the manifold.\nThe Geodesic Learning algorithm, a subroutine in Isomap [14, 15] can be used to ef\ufb01ciently and\nconsistently estimate the geodesic distances. In addition by means of selective weighting of differ-\nent coordinates note that the distance function could also account for pronounced changes in local\ndimensionality. This can be accomplished for instance through Mahalanobis distances or as a by\nproduct of local linear embedding [16]. However, we skip these details here and assume that a\nsuitable distance metric is chosen.\nOnce a distance function is de\ufb01ned our next step is to form a K nearest neighbor graph (K-NNG) or\nalternatively an \u0001 neighbor graph (\u0001-NG). K-NNG is formed by connecting each xi to the K closest\n\n2\n\n\fpoints {xi1,\u00b7\u00b7\u00b7 , xiK} in S \u2212 {xi}. We then sort the K nearest distances for each xi in increasing\norder di,i1 \u2264 \u00b7\u00b7\u00b7 \u2264 di,iK and denote RS(xi) = di,iK , that is, the distance from xi to its K-th\nnearest neighbor. We construct \u0001-NG where xi and xj are connected if and only if dij \u2264 \u0001. In this\ncase we de\ufb01ne NS(xi) as the degree of point xi in the \u0001-NG.\nFor the simple case when the anomalous density is an arbitrary mixture of nominal and uniform\ndensity1 we consider the following two score functions associated with the two graphs K-NNG and\n\u0001-NNG respectively. The score functions map the test data \u03b7 to the interval [0, 1].\n\nK-LPE:\n\n\u02c6pK(\u03b7) =\n\n\u0001-LPE:\n\n\u02c6p\u0001(\u03b7) =\n\n1\nn\n\n1\nn\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\nI{RS (\u03b7)\u2264RS (xi)}\n\nI{NS (\u03b7)\u2265NS (xi)}\n\n(1)\n\n(2)\n\nwhere I{\u00b7} is the indicator function.\nFinally, given a pre-de\ufb01ned signi\ufb01cance level \u03b1 (e.g., 0.05), we declare \u03b7 to be anomalous if\n\u02c6pK(\u03b7), \u02c6p\u0001(\u03b7) \u2264 \u03b1. We call this algorithm Localized p-value Estimation (LPE) algorithm. This\nchoice is motivated by its close connection to multivariate p-values(see Section 3).\nThe score function K-LPE (or \u0001-LPE) measures the relative concentration of point \u03b7 compared to\nthe training set. Section 3 establishes that the scores for nominally generated data is asymptotically\nuniformly distributed in [0, 1]. Scores for anomalous data are clustered around 0. Hence when scores\nbelow level \u03b1 are declared as anomalous the false alarm error is smaller than \u03b1 asymptotically (since\nthe integral of a uniform distribution from 0 to \u03b1 is \u03b1).\n\nFigure 1: Left: Level sets of the nominal bivariate Gaussian mixture distribution used to illustrate the K-\nLPE algorithm. Middle: Results of K-LPE with K = 6 and Euclidean distance metric for m = 150 test\npointsdrawnfromaequalmixtureof2Duniformandthe(nominal)bivariatedistributions. Scoresforthetest\npointsarebasedon 200 nominaltrainingsamples. Scoresfallingbelowathresholdlevel 0.05 aredeclaredas\nanomalies. ThedottedcontourcorrespondstotheexactbivariateGaussiandensitylevelsetatlevel\u03b1 = 0.05.\nRight: TheempiricaldistributionofthetestpointscoresassociatedwiththebivariateGaussianappeartobe\nuniformwhilescoresforthetestpointsdrawnfrom2Duniformdistributionclusteraroundzero.\n\nFigure 1 illustrates the use of K-LPE algorithm for anomaly detection when the nominal data is a\n2D Gaussian mixture. The middle panel of \ufb01gure 1 shows the detection results based on K-LPE are\nconsistent with the theoretical contour for signi\ufb01cance level \u03b1 = 0.05. The right panel of \ufb01gure 1\nshows the empirical distribution (derived from the kernel density estimation) of the score function\nK-LPE for the nominal (solid blue) and the anomaly (dashed red) data. We can see that the curve for\nthe nominal data is approximately uniform in the interval [0, 1] and the curve for the anomaly data\nhas a peak at 0. Therefore choosing the threshold \u03b1 = 0.05 will approximately control the Type I\nerror within 0.05 and minimize the Type II error. We also take note of the inherent robustness of our\nalgorithm. As seen from the \ufb01gure (right) small changes in \u03b1 lead to small changes in actual false\nalarm and miss levels.\n(cid:190)\n\n1\n\nWhen the mixing density is not uniform but, say f1, the score functions must be modi\ufb01ed to \u02c6pK (\u03b7) = 1\nn\n\n(cid:80)n\n\n(cid:189)\ni=1 I\n\n1\n\nRS (\u03b7)f1(\u03b7) \u2264\n\n1\n\nRS (xi)f1(xi)\n\n(cid:80)n\n\n(cid:189)\ni=1 I\n\nand \u02c6p\u0001(\u03b7) = 1\nn\n\nNS (\u03b7)\nf1(\u03b7) \u2265 NS (xi)\nf1(xi)\n\n(cid:190)\n\nfor the two graphs K-NNG and \u0001-NNG respectively.\n\n3\n\nBivariate Gaussian mixture distribution\u22126\u22124\u22122024\u22126\u22125\u22124\u22123\u22122\u22121012345anomaly detection via K\u2212LPE, n=200, K=6, \u03b1=0.05 \u22126\u22124\u22122024\u22126\u22125\u22124\u22123\u22122\u22121012345level set at \u03b1=0.05labeled as anomalylabeled as nominal00.20.40.60.81024681012empirical distribution of the scoring function K\u2212LPEvalue of K\u2212LPEempirical density nominal dataanomaly data\u03b1=0.05\fTo summarize the above discussion, our LPE algorithm has three steps:\n(1) Inputs: Signi\ufb01cance level \u03b1, distance metric (Euclidean, geodesic, weighted etc.).\n(2) Score computation: Construct K-NNG (or \u0001-NG) based on dij and compute the score function\nK-LPE from Equation 1 (or \u0001-LPE from Equation 2).\n(3) Make Decision: Declare \u03b7 to be anomalous if and only if \u02c6pK(\u03b7) \u2264 \u03b1 (or \u02c6p\u0001(\u03b7) \u2264 \u03b1).\nComputational Complexity: To compute each pairwise distance requires O(d) operations; and\nO(n2d) operations for all the nodes in the training set. In the worst-case computing the K-NN graph\n(for small K) and the functions RS(\u00b7), NS(\u00b7) requires O(n2) operations over all the nodes in the\ntraining data. Finally, computing the score for each test data requires O(nd+n) operations(given that\nRS(\u00b7), NS(\u00b7) have already been computed).\nRemark: LPE is fundamentally different from non-parametric density estimation or level set esti-\nmation schemes (e.g., MV-set). These approaches involve explicit estimation of high dimensional\nquantities and thus hard to apply in high dimensional problems. By computing scores for each test\nsample we avoid high-dimensional computation. Furthermore, as we will see in the following sec-\ntion the scores are estimates of multivariate p-values. These turn out to be suf\ufb01cient statistics for\noptimal anomaly detection.\n\n3 Theory: Consistency of LPE\n\nA statistical framework for the anomaly detection problem is presented in this section. We establish\nthat anomaly detection is equivalent to thresholding p-values for multivariate data. We will then\nshow that the score functions developed in the previous section is an asymptotically consistent esti-\nmator of the p-values. Consequently, it will follow that the strategy of declaring an anomaly when a\ntest sample has a low score is asymptotically optimal.\nAssume that the data belongs to the d-dimensional unit cube [0, 1]d and the nominal data is sam-\npled from a multivariate density f0(x) supported on the d-dimensional unit cube [0, 1]d. Anomaly\ndetection can be formulated as a composite hypothesis testing problem. Suppose test data, \u03b7 comes\nfrom a mixture distribution, namely, f(\u03b7) = (1\u2212 \u03c0)f0(\u03b7) + \u03c0f1(\u03b7) where f1(\u03b7) is a mixing density\nsupported on [0, 1]d. Anomaly detection involves testing the nominal hypotheses H0 : \u03c0 = 0 versus\nthe alternative (anomaly) H1 : \u03c0 > 0. The goal is to maximize the detection power subject to false\nalarm level \u03b1, namely, P(declare H1 | H0) \u2264 \u03b1.\nDe\ufb01nition 1. Let P0 be the nominal probability measure and f1(\u00b7) be P0 measurable. Suppose the\nlikelihood ratio f1(x)/f0(x) does not have non-zero \ufb02at spots on any open ball in [0, 1]d. De\ufb01ne\nthe p-value of a data point \u03b7 as\n\n(cid:181)\n\n(cid:182)\n\np(\u03b7) = P0\n\nx : f1(x)\nf0(x)\n\n\u2265 f1(\u03b7)\nf0(\u03b7)\n\nNote that the de\ufb01nition naturally accounts for singularities which may arise if the support of f0(\u00b7)\nis a lower dimensional manifold. In this case we encounter f1(\u03b7) > 0, f0(\u03b7) = 0 and the p-value\np(\u03b7) = 0. Here anomaly is always declared(low score).\nThe above formula can be thought of as a mapping of \u03b7 \u2192 [0, 1]. Furthermore, the distribution of\np(\u03b7) under H0 is uniform on [0, 1]. However, as noted in the introduction there are other such trans-\nformations. To build intuition about the above transformation and its utility consider the following\nexample. When the mixing density is uniform, namely, f1(\u03b7) = U(\u03b7) where U(\u03b7) is uniform over\n[0, 1]d, note that \u2126\u03b1 = {\u03b7 | p(\u03b7) \u2265 \u03b1} is a density level set at level \u03b1. It is well known (see [12])\nthat such a density level set is equivalent to a minimum volume set of level \u03b1. The minimum volume\nset at level \u03b1 is known to be the uniformly most powerful decision region for testing H0 : \u03c0 = 0\nversus the alternative H1 : \u03c0 > 0 (see [13, 12]). The generalization to arbitrary f1 is described next.\nTheorem 1. The uniformly most powerful test for testing H0 : \u03c0 = 0 versus the alternative\n(anomaly) H1 : \u03c0 > 0 at a prescribed level \u03b1 of signi\ufb01cance P(declare H1 | H0) \u2264 \u03b1 is:\n\n(cid:189)\n\n\u03c6(\u03b7) =\n\nH1, p(\u03b7) \u2264 \u03b1\nH0, otherwise\n\n4\n\n\fProof. We provide the main idea for the proof. First, measure theoretic arguments are used to\nestablish p(X) as a random variable over [0, 1] under both nominal and anomalous distributions.\nNext when X d\u223c f0, i.e., distributed with nominal density it follows that the random variable p(X) d\u223c\nU[0, 1]. When X d\u223c f = (1 \u2212 \u03c0)f0 + \u03c0f1 with \u03c0 > 0 the random variable, p(X) d\u223c g where g(\u00b7)\nis a monotonically decreasing PDF supported on [0, 1]. Consequently, the uniformly most powerful\ntest for a signi\ufb01cance level \u03b1 is to declare p-values smaller than \u03b1 as anomalies.\n\n(cid:80)\n\n(cid:80)\n\nNext we derive the relationship between the p-values and our score function. By de\ufb01nition, RS(\u03b7)\nand RS(xi) are correlated because the neighborhood of \u03b7 and xi might overlap. We modify our\nalgorithm to simplify our analysis. We assume n is odd (say) and can be written as n = 2m + 1.\nWe divide training set S into two parts:\n\nS = S1 \u2229 S2 = {x0, x1,\u00b7\u00b7\u00b7 , xm} \u2229 {xm+1,\u00b7\u00b7\u00b7 , x2m}\n\nWe modify \u0001-LPE to \u02c6p\u0001(\u03b7) = 1\nm\n1\nm\n\nxi\u2208S1 I{RS2 (\u03b7)\u2264RS1 (xi)}). Now RS2(\u03b7) and RS1(xi) are independent.\n\nxi\u2208S1 I{NS2 (\u03b7)\u2265NS1 (xi)} (or K-LPE to \u02c6pK(\u03b7) =\n\nFurthermore, we assume f0(\u00b7) satis\ufb01es the following two smoothness conditions:\n\n\u03bbM , i.e., \u2203M s.t. H(x) (cid:185) M \u2200x and \u03bbmax(M) \u2264 \u03bbM\n\n1. the Hessian matrix H(x) of f0(x) is always dominated by a matrix with largest eigenvalue\n2. In the support of f0(\u00b7), its value is always lower bounded by some \u03b2 > 0.\n\nWe have the following theorem.\nTheorem 2. Consider the setup above with the training data {xi}n\ni=1 generated i.i.d. from f0(x).\nLet \u03b7 \u2208 [0, 1]d be an arbitrary test sample. It follows that for a suitable choice K and under the\nabove smoothness conditions,\n\n|\u02c6pK(\u03b7) \u2212 p(\u03b7)| n\u2192\u221e\u2212\u2192 0 almost surely, \u2200\u03b7 \u2208 [0, 1]d\n\nFor simplicity, we limit ourselves to the case when f1 is uniform. The proof of Theorem 2 consists\nof two steps:\n\n(cid:113)\n\n\u2022 We show that the expectation ES1 [\u02c6p\u0001(\u03b7)] n\u2192\u221e\u2212\u2192 p(\u03b7) (Lemma 3). This result is then ex-\n\ntended to K-LPE (i.e. ES1 [\u02c6pK(\u03b7)] n\u2192\u221e\u2212\u2192 p(\u03b7)) in Lemma 4.\n\n\u2022 Next we show that \u02c6pK(\u03b7) n\u2192\u221e\u2212\u2192 ES1 [\u02c6pK(\u03b7)] via concentration inequality (Lemma 5).\n\nLemma 3 (\u0001-LPE). By picking \u0001 = m\u2212 3\n\n5d\n\nd\n\n2\u03c0e , with probability at least 1 \u2212 e\u2212\u03b2m1/15/2,\n\nlm(\u03b7) \u2264 ES1 [\u02c6p\u0001(\u03b7)] \u2264 um(\u03b7)\n\n(3)\n\nwhere\n\nlm(\u03b7) = P0{x : (f0(\u03b7) \u2212 \u22061) (1 \u2212 \u22062) \u2265 (f0(x) + \u22061) (1 + \u22062)} \u2212 e\u2212\u03b2m1/15/2\num(\u03b7) = P0{x : (f0(\u03b7) + \u22061) (1 + \u22062) \u2265 (f0(x) \u2212 \u22061) (1 \u2212 \u22062)} + e\u2212\u03b2m1/15/2\n\n\u22061 = \u03bbM m\u22126/5d/(2\u03c0e(d + 2)) and \u22062 = 2m\u22121/6.\n\nProof. We only prove the lower bound since the upper bound follows along similar lines. By inter-\nchanging the expectation with the summation,\n\nES1 [\u02c6p\u0001(\u03b7)] = ES1\n\nI{NS2 (\u03b7)\u2265NS1 (xi)}\n\n1\nm\n\n=\n= Ex1[PS1\\x1(NS2(\u03b7) \u2265 NS1(x1))]\n\nxi\u2208S1\n\nI{NS2 (\u03b7)\u2265NS1 (xi)}\n\n(cid:34)\n(cid:88)\n\n1\nm\n\n(cid:88)\n\n(cid:104)\n\nxi\u2208S1\nExiES1\\xi\n\n5\n\n(cid:35)\n\n(cid:105)\n\n\fwhere the last inequality follows from the symmetric structure of {x0, x1,\u00b7\u00b7\u00b7 , xm}.\nClearly the objective of the proof is to show PS1\\x1(NS2(\u03b7) \u2265 NS1(x1)) n\u2192\u221e\u2212\u2192 I{f0(\u03b7)\u2265f0(x1)}.\nSkipping technical details, this can be accomplished in two steps. (1) Note that NS(x1) is a binomial\nf0(x1 + t)dt. This relates PS1\\x1(NS2(\u03b7) \u2265\nrandom variable with success probability q(x1) :=\nNS1(x1)) to I{q(\u03b7)\u2265q(x1)}.\n(2) We relate I{q(\u03b7)\u2265q(x1)} to I{f0(\u03b7)\u2265f0(x1)} based on the function\nsmoothness condition. The details of these two steps are shown in the below.\nNote that NS1(x1) \u223c Binom(m, q(x1)). By Chernoff bound of binomial distribution, we have\n\n(cid:82)\n\nB\u0001\n\nPS1\\x1(NS1(x1) \u2212 mq(x1) \u2265 \u03b4) \u2264 e\nthat is, NS1(x1) is concentrated around mq(x1). This implies,\n\n\u2212 \u03b42\n\n2mq(x1)\n\nPS1\\x1(NS2(\u03b7) \u2265 NS1(x1)) \u2265 I{NS2 (\u03b7)\u2265mq(x1)+\u03b4x1} \u2212 e\n\n\u2212 \u03b42\n\nx1\n\n2mq(x1)\n\nWe choose \u03b4x1 = q(x1)m\u03b3(\u03b3 will be speci\ufb01ed later) and reformulate equation (4) as\n\n(cid:190) \u2212 e\u2212 q(x1)m2\u03b3\u22121\n\n2\n\n(\u03b7)\n\nmVol(B\u0001)\u2265 q(x1)\nNS2\n\nVol(B\u0001)(1+ 2\n\nm1\u2212\u03b3 )\n\nNext, we relate q(x1)(or\ncondition of f0,\n\n(cid:82)\n\nPS1\\x1(NS2(\u03b7) \u2265 NS1(x1)) \u2265 I(cid:189)\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175) \u2264 \u03bbM\n(cid:82)\n\u2212 f0(x1)\nPS1\\x1(NS2(\u03b7) \u2265 NS1(x1)) \u2265 I(cid:189)\n\nf0(x1 + t)dt\nVol(B\u0001)\n\nB\u0001\n\nB\u0001\n\nand then equation (5) becomes\n\n(cid:90)\n\n1\n\n\u00b7\n\n2\n\nVol(B\u0001)\n(cid:179)\nmVol(B\u0001)\u2265\nNS2\n\n(\u03b7)\n\n(cid:107)t(cid:107)2dt = \u03bbM \u00012\n\n2d(d + 2)\n\n(6)\n\nB\u0001\n\n(cid:180)\n\n(cid:190) \u2212 e\u2212 q(x1)m2\u03b1\u22121\n\n2\n\nf0(x1)+ \u03bbM \u00012\n\n2d(d+2)\n\n(1+ 2\n\nm1\u2212\u03b3 )\n\nf0(x1 + t)dt) to f0(x1) via the Taylor\u2019s expansion and the smoothness\n\n(4)\n\n(5)\n\nBy applying the same steps to NS2(\u03b7) as equation 4 (Chernoff bound) and equation 6 (Taylor\u2019s\nexplansion), we have with probability at least 1 \u2212 e\u2212 q(\u03b7)m2\u03b1\u22121\nEx1 [PS1\\x1 (NS2 (\u03b7) \u2265 NS1 (x1))] \u2265 Px1\n1\u2212 2\n\n(cid:180)(cid:190)\n\u2212e\n\n\u03bbM \u00012\n2d(d+2)\n\n(cid:182)(cid:179)\n\n(cid:182)(cid:179)\n\n1+ 2\n\nf0(x1)+\n\n2d(d+2)\n\n(cid:181)\n\n(cid:180)\n\n\u2265\n\n,\n\n2\n\nm1\u2212\u03b3\n\nm1\u2212\u03b3\n\n\u2212 q(x1)m2\u03b1\u22121\n\n2\n\n(cid:189)(cid:181)\nf0(\u03b7)\u2212 \u03bbM \u00012\n(cid:161)\n\n1 \u2212 2m\u22121/6\n\n(cid:162)\n\n5d \u00b7 d\nFinally, by choosing \u00012 = m\u2212 6\nLemma 4 (K-LPE). By picking K =\n1 \u2212 e\u2212\u03b2m1/15/2,\n\n2\u03c0e and \u03b3 = 5/6, we prove the lemma.\n\nm2/5 (f0(\u03b7) \u2212 \u22061), with probability at least\n\nlm(\u03b7) \u2264 ES1 [\u02c6pK(\u03b7)] \u2264 um(\u03b7)\n\n(7)\n\nProof. The proof is very similar to the proof to Lemma 3 and we only give a brief outline here. Now\nthe objective is to show PS1\\x1(RS2(\u03b7) \u2264 RS1(x1)) n\u2192\u221e\u2212\u2192 I{f0(\u03b7)\u2265f0(x1)}.The basic idea is to use\nthe result of Lemma 3. To accomplish this, we note that {RS2(\u03b7) \u2264 RS1(x1)} contains the events\n{NS2(\u03b7) \u2265 K} \u2229 {NS1(x1) \u2264 K}, or equivalently\n\n{NS2(\u03b7) \u2212 q(\u03b7)m \u2265 K \u2212 q(\u03b7)m} \u2229 {NS1(x1) \u2212 q(x1)m \u2264 K \u2212 q(x1)m}\n\n(8)\n\nBy the tail probability of Binomial distribution, the probability of the above two events converges to\n1 exponentially fast if K \u2212 q(\u03b7)m < 0 and K \u2212 q(x1)m > 0. By using the same two-step bounding\ntechniques developed in the proof to Lemma 3, these two inequalities are implied by\n\nK \u2212 m2/5 (f0(\u03b7) \u2212 \u22061) < 0 and K \u2212 m2/5 (f0(x1) + \u22061) > 0\n\n1 \u2212 2m\u22121/6\n\nm2/5 (f0(\u03b7) \u2212 \u22061), we have with probability at least\n\nTherefore if we choose K =\n1 \u2212 e\u2212\u03b2m\u22121/15/2,\n\n(cid:161)\n\n(cid:162)\n\nPS1\\x1(RS2(\u03b7) \u2264 RS1(x1)) \u2265 I{(f0(\u03b7)\u2212\u22061)(1\u2212\u22062)\u2265(f0(x1)+\u22061)(1+\u22062)} \u2212 e\u2212\u03b2m\u22121/15/2\n\n6\n\n\fRemark: Lemma 3 and Lemma 4 were proved with speci\ufb01c choices for \u0001 and K. However, they\ncan be chosen in a range of values, but will lead to different lower and upper bounds. We will show\nin Section 4 via simulation that our LPE algorithm is generally robust to choice of parameter K.\nxi\u2208S1 I{RS2 (\u03b7)\u2264RS1 (xi)}. We have\nLemma 5. Suppose K = cm2/5 and denote \u02c6pK(\u03b7) = 1\nm\n\n(cid:80)\nP0 (|ES1 [\u02c6pK(\u03b7)] \u2212 \u02c6pK(\u03b7)| > \u03b4) \u2264 2e\n\n\u2212 2\u03b42m1/5\n\nc2\u03b32\nd\n\nwhere \u03b3d is a constant and is de\ufb01ned as the minimal number of cones centered at the origin of angle\n\u03c0/6 that cover Rd.\n\nProof. We can not apply Law of Large Number in this case because I{RS2 (\u03b7)\u2264RS1 (xi)} are cor-\nrelated.\nInstead, we need to use the more generalized concentration-of-measure inequality such\nas MacDiarmid\u2019s inequality[17]. Denote F (x0,\u00b7\u00b7\u00b7 , xm) = 1\nxi\u2208S1 I{RS2 (\u03b7)\u2264RS1 (xi)}. From\nCorollary 11.1 in [18],\n\nm\n\n|F (x0,\u00b7\u00b7\u00b7 , xi,\u00b7\u00b7\u00b7 , xm) \u2212 F (x0,\u00b7\u00b7\u00b7 , x(cid:48)\n\ni,\u00b7\u00b7\u00b7 , xn)| \u2264 K\u03b3d/m\n\n(9)\n\nsup\n\nx0,\u00b7\u00b7\u00b7 ,xm,x(cid:48)\n\ni\n\n(cid:80)\n\nThen the lemma directly follows from applying McDiarmid\u2019s inequality.\n\nTheorem 2 directly follows from the combination of Lemma 4 and Lemma 5 and a standard appli-\ncation of the \ufb01rst Borel-Cantelli lemma. We have used Euclidean distance in Theorem 2. When the\nsupport of f0 lies on a lower dimensional manifold (say d(cid:48) < d) adopting the geodesic metric leads\nto faster convergence. It turns out that d(cid:48) replaces d in the expression for \u22061 in Lemma 3.\n\n4 Experiments\n\nFirst, to test the sensitivity of K-LPE to parameter changes, we run K-LPE on the benchmark data-\nset Banana [19] with K varying from 2 to 12. We randomly pick 109 points with \u201c+1\u201d label and\nregard them as the nominal training data. The test data comprises of 108 \u201c+1\u201d points and 183 \u201c\u22121\u201d\npoints (ground truth) and the algorithm is supposed to predict \u201c+1\u201d data as nominal and \u201c\u22121\u201d data\nas anomalous. Scores computed for test set using Equation 1 is oblivious to true f1 density (\u201c\u22121\u201d\nlabels). Euclidean distance metric is adopted for this experiment.\nTo control false alarm at level \u03b1, points with score smaller than \u03b1 are predicted as anomaly. Empiri-\ncal false alarm and true positives are computed from ground truth. We vary \u03b1 to obtain the empirical\nROC curve. The above procedure is followed for the rest of the experiments in this section. As\nshown in 2(a), the LPE algorithm is insensitive to K. For comparison we plot the empirical ROC\ncurve of the one-class SVM of [9]. For our OC-SVM implementation, for a \ufb01xed bandwidth, c, we\nobtain the empirical ROC curve by varying \u03bd. We then vary the bandwidth, c, to obtain the best\n(in terms of AUC) ROC curve. The optimal bandwidth turns out to be c = 1.5. In LPE if we set\n\u03b1 = 0.05 we get empirical F A = 0.06 and for \u03b1 = 0.08, empirical F A = 0.09. For OC-SVM we\nare unaware of any natural way of picking c and \u03bd to control FA rate based on training data.\nNext, we apply our K-LPE to the problem where the nominal and anomalous data are generated in\nthe following way:\n\n(cid:181)(cid:183)\n\n(cid:184)\n\n(cid:183)\n\n(cid:184)(cid:182)\n\n(cid:181)\n\n(cid:183)\n\n(cid:184)(cid:182)\n\nf0 \u223c 1\n2\n\nN\n\n8\n0\n\n,\n\n1 0\n0 9\n\n+\n\nN\n\n1\n2\n\n1 0\n0\n9\n\nf1 \u223c N\n\n,\n\n0,\n\n49\n0\n\n0\n49\n\n(10)\n\n(cid:183)\n\n(cid:181)(cid:183)\u22128\n(cid:184)\n\n,\n\n0\n\n(cid:184)(cid:182)\n\nWe call ROC curve corresponding to the optimal Bayesian classi\ufb01er as the Clairvoyant ROC (the\nred dashed curve in Figure 2(b)). The other two curves are averaged (over 15 trials) empirical ROC\ncurves via LPE. Here we set K = 6 and n = 40 or n = 160. We see that for a relatively small\ntraining set of size 160 the average empirical ROC curve is very close to the clairvoyant ROC curve.\nFinally, we ran LPE on three real-world datasets: Wine, Ionosphere[20] and MNIST US Postal\nService (USPS) database of handwritten digits. If there are more than 2 labels in the data set, we\narti\ufb01cially regard points with one particular label as nominal and regard the points with other labels\nas anomalous. For example, for the USPS dataset, we regard instances of digit 0 as nominal and\ninstances of digits 1,\u00b7\u00b7\u00b7 , 9 as anomaly. The data points are normalized to be within [0, 1]d and we\n\n7\n\n\f(a) SVM vs. K-LPE for Banana Data\n\n(b) Clairvoyant vs. K-LPE\n\nFigure 2: (a) Empirical ROC curve of K-LPE on the banana dataset with K = 2, 4, 6, 8, 10, 12 (with\nn = 400) vs the empirical ROC curve of one class SVM developed in [9]; (b) Empirical ROC curves of\nK-LPEalgorithmvsclairvoyantROCcurve(f0 isgivenbyEquation10)forK = 6 andforn = 40 or160.\n\nuse geodesic distance [14]. The ROC curves are shown in Figure 3. The feature dimension of Wine\nis 13 and we apply the \u0001-LPE algorithm with \u0001 = 0.9 and n = 39. The test set is a mixture of\n20 nominal points and 158 anomaly points. The feature dimension of Ionosphere is 34 and we\napply the K-LPE algorithm with K = 9 and n = 175. The test set is a mixture of 50 nominal points\nand 126 anomaly points. The feature dimension of USPS is 256 and we apply the K-LPE algorithm\nwith K = 9 and n = 400. The test set is a mixture of 367 nominal points and 33 anomaly points.\nIn USPS, setting \u03b1 = 0.5 induces empirical false-positive 6.1% and empirical false alarm rate 5.7%\n(In contrast F P = 7% and F A = 9% with \u03bd = 5% for OC-SVM as reported in [9]). Practically we\n\ufb01nd that K-LPE is more preferable to \u0001-LPE and as a rule of thumb setting K \u2248 n2/5 is generally\neffective.\n\n(a) Wine\n\n(b) Ionosphere\n\n(c) USPS\n\nFigure 3: ROC curves on real datasets via LPE; (a) Wine dataset with D = 13, n = 39, \u0001 = 0.9; (b)\nIonosphere datasetwithD = 34, n = 175, K = 9;(c)USPS datasetwithD = 256, n = 400, K = 9.\n\n5 Conclusion\n\nIn this paper, we proposed a novel non-parametric adaptive anomaly detection algorithm which leads\nto a computationally ef\ufb01cient solution with provable optimality guarantees. Our algorithm takes a\nK-nearest neighbor graph as an input and produces a score for each test point. Scores turn out to be\nempirical estimates of the volume of minimum volume level sets containing the test point. While\nminimum volume level sets provide an optimal characterization for anomaly detection, they are\nhigh dimensional quantities and generally dif\ufb01cult to reliably compute in high dimensional feature\nspaces. Nevertheless, a suf\ufb01cient statistic for optimal tradeoff between false alarms and misses is\nthe volume of the MV set itself, which is a real number. By computing score functions we avoid\ncomputing high dimensional quantities and still ensure optimal control of false alarms and misses.\nThe computational cost of our algorithm scales linearly in dimension and quadratically in data size.\n\n8\n\n00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91false positivestrue positivesbanana data set ROC of LPE (K=2)ROC of LPE (K=4)ROC of LPE (K=6)ROC of LPE (K=8)ROC of LPE (K=10)ROC of LPE (K=12)ROC of one\u2212class SVM00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91false positivestrue positives2D Gaussian mixture ROC of LPE(n=40)ROC of LPE(n=160)Clairvoyant ROC00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91false positivetrue positive00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91false positivetrue positive00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91false positivetrue positive\fReferences\n[1] C. Campbell and K. P. Bennett, \u201cA linear programming approach to novelty detection,\u201d in Advances in\n\nNeural Information Processing Systems 13. MIT Press, 2001, pp. 395\u2013401.\n\n[2] M. Markou and S. Singh, \u201cNovelty detection: a review \u2013 part 1: statistical approaches,\u201d Signal Processing,\n\nvol. 83, pp. 2481\u20132497, 2003.\n\n[3] R. Ramaswamy, R. Rastogi, and K. Shim, \u201cEf\ufb01cient algorithms for mining outliers from large data sets,\u201d\n\nin Proceedings of the ACM SIGMOD Conference, 2000.\n\n[4] R. Vert and J. Vert, \u201cConsistency and convergence rates of one-class svms and related algorithms,\u201d Journal\n\nof Machine Learning Research, vol. 7, pp. 817\u2013854, 2006.\n\n[5] D. Tax and K. R. M\u00a8uller, \u201cFeature extraction for one-class classi\ufb01cation,\u201d in Arti\ufb01cial neural networks\n\nand neural information processing, Istanbul, TURQUIE, 2003.\n\n[6] R. El-Yaniv and M. Nisenson, \u201cOptimal singl-class classi\ufb01cation strategies,\u201d in Advances in Neural In-\n\nformation Processing Systems 19. MIT Press, 2007.\n\n[7] I. V. Nikiforov and M. Basseville, Detection of abrupt changes: theory and applications. Prentice-Hall,\n\nNew Jersey, 1993.\n\n[8] K. Zhang, M. Hutter, and H. Jin, \u201cA new local distance-based outlier detection approach for scattered\n\nreal-world data,\u201d March 2009, arXiv:0903.3257v1[cs.LG].\n\n[9] B. Sch\u00a8olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. Williamson, \u201cEstimating the support of a\n\nhigh-dimensional distribution,\u201d Neural Computation, vol. 13, no. 7, pp. 1443\u20131471, 2001.\n\n[10] D. Tax, \u201cOne-class classi\ufb01cation: Concept-learning in the absence of counter-examples,\u201d Ph.D. disserta-\n\ntion, Delft University of Technology, June 2001.\n\n[11] G. R. G. Lanckriet, L. E. Ghaoui, and M. I. Jordan, \u201cRobust novelty detection with single-class MPM,\u201d\n\nin Neural Information Processing Systems Conference, vol. 18, 2005.\n\n[12] C. Scott and R. D. Nowak, \u201cLearning minimum volume sets,\u201d Journal of Machine Learning Research,\n\nvol. 7, pp. 665\u2013704, 2006.\n\n[13] A. O. Hero, \u201cGeometric entropy minimization(GEM) for anomaly detection and localization,\u201d in Neural\n\nInformation Processing Systems Conference, vol. 19, 2006.\n\n[14] J. B. Tenenbaum, V. de Silva, and J. C. Langford, \u201cA global geometric framework fo nonlinear dimen-\n\nsionality reduction,\u201d Science, vol. 290, pp. 2319\u20132323, 2000.\n\n[15] M. Bernstein, V. D. Silva, J. C. Langford, and J. B. Tenenbaum, \u201cGraph approximations to geodesics on\n\nembedded manifolds,\u201d 2000.\n\n[16] S. T. Roweis and L. K. Saul, \u201cNonlinear dimensionality reduction by local linear embedding,\u201d Science,\n\nvol. 290, pp. 2323\u20132326, 2000.\n\n[17] C. McDiarmid, \u201cOn the method of bounded differences,\u201d in Surveys in Combinatorics.\n\nUniversity Press, 1989, pp. 148\u2013188.\n\nCambridge\n\n[18] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer Verlag\n\nNew York, Inc., 1996.\n\n[19] \u201cBenchmark repository.\u201d [Online]. Available: http://ida.\ufb01rst.fhg.de/projects/bench/benchmarks.htm\n[20] A. Asuncion and D. J. Newman, \u201cUCI machine learning repository,\u201d 2007. [Online]. Available:\n\nhttp://www.ics.uci.edu/\u223cmlearn/MLRepository.html\n\n9\n\n\f", "award": [], "sourceid": 608, "authors": [{"given_name": "Manqi", "family_name": "Zhao", "institution": null}, {"given_name": "Venkatesh", "family_name": "Saligrama", "institution": null}]}