{"title": "Efficient anomaly detection using bipartite k-NN graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 478, "page_last": 486, "abstract": "Learning minimum volume sets of an underlying nominal distribution is a very effective approach to anomaly detection. Several approaches to learning minimum volume sets have been proposed in the literature, including the K-point nearest neighbor graph (K-kNNG) algorithm based on the geometric entropy minimization (GEM) principle [4]. The K-kNNG detector, while possessing several desirable characteristics, suffers from high computation complexity, and in [4] a simpler heuristic approximation, the leave-one-out kNNG (L1O-kNNG) was proposed. In this paper, we propose a novel bipartite k-nearest neighbor graph (BP-kNNG) anomaly detection scheme for estimating minimum volume sets. Our bipartite estimator retains all the desirable theoretical properties of the K-kNNG, while being computationally simpler than the K-kNNG and the surrogate L1O-kNNG detectors. We show that BP-kNNG is asymptotically consistent in recovering the p-value of each test point. Experimental results are given that illustrate the superior performance of BP-kNNG as compared to the L1O-kNNG and other state of the art anomaly detection schemes.", "full_text": "Ef\ufb01cient anomaly detection using\n\nbipartite k-NN graphs\n\nKumar Sricharan\nDepartment of EECS\nUniversity of Michigan\nAnn Arbor, MI 48104\n\nkksreddy@umich.edu\n\nAlfred O. Hero III\nDepartment of EECS\nUniversity of Michigan\nAnn Arbor, MI 48104\nhero@umich.edu\n\nAbstract\n\nLearning minimum volume sets of an underlying nominal distribution is a very ef-\nfective approach to anomaly detection. Several approaches to learning minimum\nvolume sets have been proposed in the literature, including the K-point nearest\nneighbor graph (K-kNNG) algorithm based on the geometric entropy minimiza-\ntion (GEM) principle [4]. The K-kNNG detector, while possessing several de-\nsirable characteristics, suffers from high computation complexity, and in [4] a\nsimpler heuristic approximation, the leave-one-out kNNG (L1O-kNNG) was pro-\nposed. In this paper, we propose a novel bipartite k-nearest neighbor graph (BP-\nkNNG) anomaly detection scheme for estimating minimum volume sets. Our\nbipartite estimator retains all the desirable theoretical properties of the K-kNNG,\nwhile being computationally simpler than the K-kNNG and the surrogate L1O-\nkNNG detectors. We show that BP-kNNG is asymptotically consistent in recov-\nering the p-value of each test point. Experimental results are given that illustrate\nthe superior performance of BP-kNNG as compared to the L1O-kNNG and other\nstate of the art anomaly detection schemes.\n\n1 Introduction\n\nGiven a training set of normal events, the anomaly detection problem aims to identify unknown,\nanomalous events that deviate from the normal set. This novelty detection problem arises in applic-\nations where failure to detect anomalous activity could lead to catastrophic outcomes, for example,\ndetection of faults in mission-critical systems, quality control in manufacturing and medical dia-\ngnosis.\n\nSeveral approaches have been proposed for anomaly detection. One class of algorithms assumes a\nfamily of parametrically de\ufb01ned nominal distributions. Examples include Hotelling\u2019s T test and the\nFisher F-test, which are both based on a Gaussian distribution assumption. The drawback of these\nalgorithms is model mismatch: the supposed distribution need not be a correct representation of the\nnominal data, which can then lead to poor false alarm rates. More recently, several non-parametric\nmethods based on minimum volume (MV) set estimation have been proposed. These methods aim to\n\ufb01nd the minimum volume set that recovers a certain probability mass \u03b1 with respect to the unknown\nprobability density of the nominal events. If a new event falls within the MV set, it is classi\ufb01ed as\nnormal and otherwise as anomalous.\n\nEstimation of minimum volume sets is a dif\ufb01cult problem, especially for high dimensional data.\nThere are two types of approaches to this problem: (1) transform the MV estimation problem to an\nequivalent density level set estimation problem, which requires estimation of the nominal density;\nand (2) directly identify the minimal set using function approximation and non-parametric estima-\ntion [10, 6, 9]. Both types of approaches involve explicit approximation of high dimensional quant-\n\n1\n\n\fities - the multivariate density function in the \ufb01rst case and the boundary of the minimum volume\nset in the second and are therefore not easily applied to high dimensional problems.\n\nThe GEM principle developed by Hero [4] for determining MV sets circumvents the above dif\ufb01-\nculties by using the asymptotic theory of random Euclidean graphs instead of function approxima-\ntion. However, the GEM based K-kNNG anomaly detection scheme proposed in [4] is computation-\nally dif\ufb01cult. To address this issue, a surrogate L1O-kNNG anomaly detection scheme was proposed\nin [4]. L1O-kNNG is computationally simpler than K-kNNG, but loses some desirable properties of\nthe K-kNNG, including asymptotic consistency, as shown below.\nIn this paper, we use the GEM principle to develop a bipartite k-nearest neighbor (k-NN) graph-\nbased anomaly detection algorithm. BP-kNNG retains the desirable properties of the GEM principle\nand as a result inherits the following features: (i) it is not restricted to linear or even convex decision\nregions, (ii) it is completely non-parametric, (iii) it is optimal in that it converges to the uniformly\nmost powerful (UMP) test when the anomalies are drawn from a mixture of the nominal density and\nthe uniform density, (iv) it does not require knowledge of anomalies in the training sample, (v) it is\nasymptotically consistent in recovering the p-value of the test point and (vi) it produces estimated\np-values, allowing for false positive rate control.\nK-LPE [13] and RRS [7] are anomaly detection methods which are also based on k-NN graphs. BP-\nkNNG differs from L1O-kNNG, K-LPE and RRS in the following respects. L1O-kNNG, K-LPE\nand RRS do not use bipartite graphs. We will show that the bipartite nature of BP-kNNG results\nin signi\ufb01cant computational savings. In addition, the K-LPE and RRS test statistics involve only\nthe k-th nearest neighbor distance, while the statistic in BP-kNNG, like the L1O-kNNG, involves\nsummation of the power weighted distance of all the edges in the k-NN graph. This will result\nin increased robustness to outliers in the training sample. Finally, we will show that the mean\nsquare rate of convergence of p-values in BP-kNNG (O(T \u22122/(2+d))) is faster as compared to the\nconvergence rate of K-LPE (O(T \u22122/5+T \u22126/5d)), where T is the size of the nominal training sample\nand d is the dimension of the data.\nThe rest of this paper is organized as follows. In Section 2, we outline the statistical framework\nfor minimum volume set anomaly detection. In Section 3, we describe the GEM principle and the\nK-kNNG and L1O-kNNG anomaly detection schemes proposed in [4]. Next, in Section 4, we\ndevelop our bipartite k-NN graph (BP-kNNG) method for anomaly detection. We show consistency\nof the method and compare its computational complexity with that of the K-kNNG, L1O-kNNG and\nK-LPE algorithms. In Section 5, we show simulation results that illustrate the superior performance\nof BP-kNNG over L1O-kNNG. We also show that our method compares favorably to other state of\nthe art anomaly detection schemes when applied to real world data from the UCI repository [1]. We\nconclude with a short discussion in Section 6.\n\n2 Statistical novelty detection\nThe problem setup is as follows. We assume that a training sample X T = {X1, . . . , XT} of d-\ndimensional vectors is available. Given a new sample X, the objective is to declare X to either be\na \u2019nominal\u2019 event consistent with XT or an \u2019anomalous\u2019 event which deviates from X T . We seek to\n\ufb01nd a functional D and corresponding detection rule D(x) > 0 so that X is declared to be nominal if\nD(x) > 0 holds and anomalous otherwise. The acceptance region is given by A = {x : D(x) > 0}.\nWe seek to further constrain the choice of D to allow as few false negatives as possible for a \ufb01xed\nallowance of false positives.\nTo formulate this problem, we adopt the standard statistical framework for testing composite hy-\npotheses. We assume that the training sample XT is an i.i.d sample draw from an unknown d-\ndimensional probability distribution f 0(x) on [0, 1]d. Let X have density f on [0, 1]d. The anomaly\ndetection problem can be formulated as testing the hypotheses H 0 : f = f0 versus H1 : f (cid:2)= f0.\nFor a given \u03b1 \u2208 (0, 1), we seek an acceptance region A that satis\ufb01es P r(X \u2208 A|H 0) \u2265 1 \u2212 \u03b1.\n(cid:2)\nThis requirement maintains the false positive rate at a level no greater than \u03b1. Let A = {A :\nf0(x)dx \u2265 1 \u2212 \u03b1} denote the collection of acceptance regions of level \u03b1. The most suitable\nacceptance region from the collection A would be the set which minimizes the false negative rate.\nAssume that the density f is bounded above by some constant C. In this case the false negative rate\nd. Consider the relaxed problem of\nis bounded by C\u03bb(A) where \u03bb(.) is the Lebesgue measure in R\n\nA\n\n2\n\n\f\u22121\n\n(cid:2)\n\nA f \u03bd\n\n(cid:2)\nA f0(x)dx \u2265 \u03b1}.\n(cid:2)\n\nminimizing the upper bound C\u03bb(A) or equivalently the volume \u03bb(A) of A. The optimal acceptance\nregion with a maximum false alarm rate \u03b1 is therefore given by the minimum volume set of level \u03b1:\n\u039b\u03b1 = min{\u03bb(A) :\nf0(x)dx \u2265 1 \u2212 \u03b1} where\nDe\ufb01ne the minimum entropy set of level \u03b1 to be \u03a9 \u03b1 = min{H\u03bd(A) :\nH\u03bd(A) = (1 \u2212 \u03bd)\n0 (x)dx is the R\u00b4enyi \u03bd-entropy of the density f 0 over the set A. It can be\nd, the minimum volume set and the minimum entropy\nshown that when f0 is a Lebesgue density in R\nset are equivalent, i.e. \u039b\u03b1 and \u03a9\u03b1 are identical. Therefore, the optimal decision rule for a given level\nof false alarm \u03b1 is to declare an anomaly if X /\u2208 \u03a9 \u03b1.\nThis decision rule has a strong optimality property [4]: when f 0 is Lebesgue continuous and has\nno \u2019\ufb02at\u2019 regions over its support, this decision rule is a uniformly most powerful (UMP) test at level\n1 \u2212 \u03b1 for the null hypothesis that the test point has density f (x) equal to the nominal f 0(x) versus\nthe alternative hypothesis that f (x) = (1 \u2212 \u0001)f0(x) + \u0001U (x), where U (x) is the uniform density\nover [0, 1]d and \u0001 \u2208 [0, 1]. Furthermore, the power function is given by \u03b2 = P r(X /\u2208 \u03a9 \u03b1|H1) =\n(1 \u2212 \u0001)\u03b1 + \u0001(1 \u2212 \u03bb(\u03a9\u03b1)).\n\nA\n\n3 GEM principle\n\nIn this section, we brie\ufb02y review the geometric entropy minimization (GEM) principle method [4]\nfor determining minimum entropy sets \u03a9 \u03b1 of level \u03b1. The GEM method directly estimates the crit-\nical region \u03a9\u03b1 for detecting anomalies using minimum coverings of subsets of points in a nominal\ntraining sample. These coverings are obtained by constructing minimal graphs, e.g., the k-minimal\nspanning tree or the k-nearest neighbor graph, covering a K-point subset that is a given proportion\nof the training sample. Points in the training sample that are not covered by the K-point minimal\ngraphs are identi\ufb01ed as tail events.\nIn particular, let XK,T denote one of the\nK point subsets of XT . The k-nearest neighbors\n(k-NN) of a point Xi \u2208 XK,T are the k closest points to Xi among XK,T \u2212 Xi. Denote the\ncorresponding set of edges between X i and its k-NN by {ei(1), . . . , ei(k)}. For any subset XK,T ,\nde\ufb01ne the total power weighted edge length of the k-NN graph on X K,T with power weighting \u03b3\n(0 < \u03b3 < d), as\n\n(cid:4)\n\n(cid:3)\n\nT\nK\n\nLkN N (XK,T ) =\n\n|eti(l)|\u03b3,\n\nK(cid:5)\n\nk(cid:5)\n\ni=1\n\nl=1\n\n(cid:3)\n\n(cid:4)\n\nwhere {t1, . . . , tK} are the indices of Xi \u2208 XK,T . De\ufb01ne the K-kNNG graph to be the K-point\nk-NN graph having minimal length minXT ,K\u2208XT LkN N (XT,K ) over all\nsubsets XK,T . Denote\nLkN N (XK,T ).\nthe corresponding length minimizing subset of K points by X \u2217\nThe K-kNNG thus speci\ufb01es a minimal graph covering X \u2217\nK,T of size K. This graph can be viewed as\ncapturing the densest regions of XT . If XT is an i.i.d. sample from a multivariate density f 0(x) and\nif limK,T\u2192\u221e K/T = \u03c1, then the set X \u2217\nK,T converges a.s. to the minimum \u03bd-entropy set containing\na proportion of at least \u03c1 of the mass of f 0(x), where \u03bd = 1 \u2212 \u03b3/d [4]. This set can be used to\nperform anomaly detection.\n\nT\nK\nT,K = argmin\nXT ,K\u2208X\n\n3.1 K-kNNG anomaly detection\nGiven a test sample X, denote the pooled sample X T +1 = XT \u222a {X} and determine the K-kNNG\ngraph over XT +1. Declare X to be an anomaly if X /\u2208 X \u2217\nK,T +1 and nominal otherwise. When the\ndensity f0 is Lebesgue continuous, it follows from [4] that as K, T \u2192 \u221e, this anomaly detection\nalgorithm has false alarm rate that converges to \u03b1 = 1 \u2212 K/T and power that converges to that of\nthe minimum volume set test of level \u03b1. An identical detection scheme based on the K-minimal\nspanning tree has also been developed in [4].\n\nThe K-kNNG anomaly detection scheme therefore offers a direct approach to detecting outliers\nwhile bypassing the more dif\ufb01cult problems of density estimation and level set estimation in high di-\nmensions. However, this algorithm requires construction of k-nearest neighbor graphs (or k-minimal\nspanning trees) over\ndifferent subsets. For each input test point, the runtime of this algorithm\n\n(cid:3)\n\n(cid:4)\n\nT\nK\n\n3\n\n\f(cid:4)\n\n). As a result, the K-kNNG method is not well suited for anomaly detection\n\n(cid:3)\n\nT\nis therefore O(dK 2\nK\nfor large sample sizes.\n\n3.2 L1O-kNNG\n\nTo address the computational problems of K-kNNG, Hero [4] proposed implementing the K-kNNG\nfor the simplest case K = T \u2212 1. The runtime of this algorithm for each input test point is O(dT 2).\nClearly, the L1O-kNNG is of much lower complexity that the K-kNNG scheme. However, the L1O-\nkNNG detects anomalies at a \ufb01xed false alarm rate 1/(T + 1), where T is the training sample size.\nTo detect anomalies at a higher false alarm rate \u03b1\u2217\n, one would have to subsample the training set\n= 1/\u03b1\u2217 \u2212 1 training samples. This destroys any hope for asymptotic consistency\nand only use T \u2217\nof the L1O-kNNG.\n\nIn the next section, we propose a different GEM based algorithm that uses bipartite graphs. The\nalgorithm has algorithm has a much faster runtime than the L1O-kNNG, and unlike the L1O-kNNG,\nis asymptotically consistent and can operate at any speci\ufb01ed alarm rate \u03b1. We describe our algorithm\nbelow.\n\ni=1\n\nl=k\u2212s+1\n\n(cid:3)\n\n4 BP-kNNG\nLet {XN ,XM} be a partition of XT with card{XN} = N and card{XM} = M = T \u2212 N\nrespectively. As above, let XK,N denote one of the\nsubsets of K distinct points from XN .\nDe\ufb01ne the bipartite k-NN graph on {XK,N ,XM} to be the set of edges linking each X i \u2208 XK,N\nto its k nearest neighbors in XM . De\ufb01ne the total power weighted edge length of this bipartite\nk-NN graph with power weighting \u03b3 (0 < \u03b3 < d) and a \ufb01xed number of edges s (1 \u2264 s \u2264 k)\ncorresponding to each vertex X i \u2208 XK,N to be\nLs,k(XK,N ,XM ) =\n\nk(cid:5)\n\nK(cid:5)\n\n|eti(l)|\u03b3,\n\n(cid:4)\n\nN\nK\n\nN\nK\n\n(cid:3)\n\n(cid:4)\n\nK,N = argmin\nXK,N\u2208X\n\nwhere {t1, . . . , tK} are the indices of Xi \u2208 XK,N and {eti(1), . . . , eti(k)} are the k-NN edges in\nthe bipartite graph originating from X ti \u2208 XK,N . De\ufb01ne the bipartite K-kNNG graph to be the one\nsubsets XK,N . De\ufb01ne\nhaving minimal weighted length minXN,K\u2208XN Ls,k(XN,K,XM ) over all\nthe corresponding minimizing subset of K points of X K,N by X \u2217\nLs,k(XK,N ,XM ).\nUsing the theory of partitioned k-NN graph entropy estimators [11], it follows that as k/M \u2192\n0, k, N \u2192 \u221e and for \ufb01xed s, the set X \u2217\nK,N converges a.s. to the minimum \u03bd-entropy set \u03a9 1\u2212\u03c1\ncontaining a proportion of at least \u03c1 of the mass of f 0(x), where \u03c1 = limK,N\u2192\u221e K/N and \u03bd =\n1 \u2212 \u03b3/d.\nThis suggests using the bipartite k-NN graph to detect anomalies in the following way. Given a\ntest point X, denote the pooled sample X N +1 = XN \u222a {X} and determine the optimal bipartite\nK-kNNG graph X \u2217\nK,N +1\nand nominal otherwise. It is clear that by the GEM principle, this algorithm detects false alarms at\na rate that converges to \u03b1 = 1 \u2212 K/T and power that converges to that of the minimum volume set\ntest of level \u03b1.\n(cid:6)k\n(cid:6)k\nWe can equivalently determine X \u2217\nK,N +1 as follows. For each Xi \u2208 XN , construct ds,k(Xi) =\nl=s\u2212k+1 |eX(l)|\u03b3, where\nl=k\u2212s+1 |ei(l)|\u03b3.\n{eX(1), . . . , eX(k)} are the k-NN edges from X to XM . Now, choose the K points among X N \u222a X\nwith the K smallest of the N + 1 edge lengths {ds,k(Xi), Xi \u2208 XN} \u222a {ds,k(X)}. Because of\nthe bipartite nature of the construction, this is equivalent to choosing X \u2217\nK,N +1. This leads to the\nproposed BP-kNNG anomaly detection algorithm described by Algorithm 1.\n\nK,N +1 over {XK,N +1,XM}. Now declare X to be an anomaly if X /\u2208 X \u2217\n\nFor each test point X, de\ufb01ne ds,k(X) =\n\n4.1 BP-kNNG p-value estimates\n\nThe p-value is a score between 0 and 1 that is associated with the likelihood that a given point X 0\ncomes from a speci\ufb01ed nominal distribution. The BP-kNNG generates an estimate of the p-value\n\n4\n\n\fAlgorithm 1 Anomaly detection scheme using bipartite k-NN graphs\n1. Input: Training samples XT , test samples X, false alarm rate \u03b1\n2. Training phase\na. Create partition {XN ,XM}\nb. Construct k-NN bipartite graph on partition\nc. Compute k-NN lengths ds,k(Xi) for each Xi \u2208 XN : ds,k(Xi) =\n3. Test phase: detect anomalous points\nfor each input test sample X do\n\n(cid:6)k\n(cid:5)\nl=k\u2212s+1 |eX(l)|\u03b3\n1(ds,k(Xi) < ds,k(X)) \u2265 1 \u2212 \u03b1\n\n(cid:6)k\nl=k\u2212s+1 |ei(l)|\u03b3\n\nCompute k-NN length ds,k(X) =\nif\n\n(1/N )\n\nXi\u2208XN\n\nDeclare X to be anomalous\n\nDeclare X to be non-anomalous\n\nthen\n\nelse\n\nend if\nend for\n\nthat is asymptotically consistent, guaranteeing that the BP-kNNG detector is a consistent novelty\n(cid:2)\ndetector.\nSpeci\ufb01cally, for a given test point X0, the true p-value associated with a point X 0 in a minimum\nS(X0) f0(z)dz where S(X0) = {z : f0(z) \u2264 f0(X0)} and\nvolume set test is given by ptrue(X0) =\nE(X0) = {z : f0(z) = f0(X0)}. ptrue(X0) is the minimal level \u03b1 at which X0 would be rejected.\n(cid:6)\nThe empirical p-value associated with the BP-kNNG is de\ufb01ned as\n\npbp(X0) =\n\nXi\u2208XN 1(ds,k(Xi) \u2265 ds,k(X0))\n\nN\n\n.\n\n(1)\n\n4.2 Asymptotic consistency and optimal convergence rates\n\nHere we prove that the BP-kNNG detector is asymptotically consistent by showing that for a \ufb01xed\nnumber of edges s, E[(pbp(X0) \u2212 ptrue(X0))2] \u2192 0 as k/M \u2192 0, k, N \u2192 \u221e. In the process,\nwe also obtain rates of convergence of this mean-squared error. These rates depend on k, N and M\nand result in the speci\ufb01cation of an optimal number of neighbors k and an optimal partition ratio\nN/M that achieve the best trade-off between bias and variance of the p-value estimates p bp(X0).\nWe assume that the density f0 (i) is bounded away from 0 and \u221e and is continuous on its support\nS, (ii) has no \ufb02at spots over its support set and (iii) has a \ufb01nite number of modes. Let E denote the\nexpectation w.r.t. the density f0, and B, V denote the bias and variance operators. Throughout this\nsection, assume without loss of generality that {X 1, . . . , XN} \u2208 XN and {XN +1, . . . , XT} \u2208 XM .\n(cid:6)\nXi\u2208XN 1(f0(Xi) \u2264 f0(X0))\nBias: We \ufb01rst introduce the oracle p-value p orac(X0) = (1/N )\nand note that E[porac(X0)] = ptrue(X0). The distance ei(l) of a point Xi \u2208 XN to its l-th\nnearest neighbor in XM is related to the bipartite l-nearest neighbor density estimate \u02c6fl(Xi) =\n(l \u2212 1)/(M cded\n\ni(l)) (section 2.3, [11]) where c d is the unit ball volume in d dimensions. Let\n\n(cid:7)\n\n(cid:8)\n\nk(cid:5)\n\nl=k\u2212s+1\n\nk \u2212 1\nl \u2212 1\n\n\u02c6fl(X)\n\n(cid:10)\n\n(cid:9)\u03bd\u22121\n\ne(X) =\n\n\u2212 s(f (X))\u03bd\u22121\n\nand\n\nWe then have\n\n\u03b4(Xi, X0) = \u03b4i = (f (Xi))\u03bd\u22121 \u2212 (f (X0))\u03bd\u22121.\n\nB[pbp(X0)] = E[pbp(X0)] \u2212 ptrue(X0) = E[pbp(X0) \u2212 porac(X0)]\n= E[1(ds,k(X1) \u2265 ds,k(X0))] \u2212 E[1(f (X1) \u2264 f (X0))]\n= E[1(e(X1) \u2212 e(X0) + \u03b41 \u2264 0) \u2212 1(\u03b41 \u2264 0)].\n\n5\n\n\fThis bias will be non-zero when 1(e(X1) \u2212 e(X0) + \u03b41 \u2264 0) (cid:2)= 1(\u03b41 \u2264 0). First we investigate\nthis condition when \u03b41 > 0. In this case, for 1(e(X1) \u2212 e(X0) + \u03b41 \u2264 0) (cid:2)= 1(\u03b41 \u2264 0), we need\n\u2212e(X1) + e(X0) \u2265 \u03b41. Likewise, when \u03b41 \u2264 0, 1(e(X1) \u2212 e(X0) + \u03b41 \u2264 0) (cid:2)= 1(\u03b41 \u2264 0) occurs\nwhen e(X1) \u2212 e(X0) > |\u03b41|.\nFrom the theory developed in [11], for any \ufb01xed s, |e(X)| = O(k/M ) 1/d + O(1/\nility greater than 1 \u2212 o(1/M ). This implies that\n\n\u221a\nk) with probab-\n\nB[pbp(X0)] = E[1(e(X1) \u2212 e(X0) + \u03b41 \u2264 0) \u2212 1(\u03b41 \u2264 0)]\n\n= P r{|\u03b41| = O((k/M )1/d + 1/\n\n\u221a\nk)} + o(1/M ) = O((k/M )1/d + 1/\n\n\u221a\nk), (2)\nwhere the last step follows from our assumption that the density f 0 is continuous and has a \ufb01nite\nnumber of modes.\nVariance: De\ufb01ne bi = 1(e(Xi) \u2212 e(X0) + \u03b4i \u2264 0) \u2212 1(\u03b4i \u2264 0). We can compute the variance\nin a similar manner to the bias as follows (for additional details, please refer to the supplementary\nmaterial):\n\nV[pbp(X0)] =\n\nV[1(e(X1) \u2212 e(X0) + \u03b41 \u2264 0)] +\n\n1\nN\n\nN \u2212 1\nN\n\nCov[b1, b2]\n\n= O(1/N ) + E[b1b2] \u2212 (E[b1]E[b2]) = O(1/N + (k/M )2/d + 1/k).\n\n(3)\n\nConsistency of p-values: From (2) and (3), we obtain an asymptotic representation of the estim-\nated p-value E[(pbp(X0) \u2212 ptrue(X0))2] = O((k/M )2/d) + O(1/k) + O(1/N ). This implies that\npbp converges in mean square to p true, for a \ufb01xed number of edges s, as k/M \u2192 0, k, N \u2192 \u221e.\n\nOptimal choice of parameters: The optimal choice of k to minimize the MSE is given by k =\n\u0398(M 2/(2+d)). For \ufb01xed M + N = T , to minimize MSE, N should then be chosen to be of the\norder O(M (4+d)/(4+2d)), which implies that M = \u0398(T ). The mean square convergence rate for\nthis optimal choice of k and partition ratio N/M is given by O(T \u22122/(2+d)). In comparison, the\nK-LPE method requires that k grows with the sample size at rate k = \u0398(T 2/5). The mean square\nrate of convergence of the p-values in K-LPE is then given by O(T \u22122/5 + T \u22126/5d). The rate of\nconvergence of the p-values is therefore faster in the case of BP-kNNG as compared to K-LPE.\n\n4.3 Comparison of run time complexity\n\n(cid:3)\n\n(cid:4)\n\nT\nK\n\nHere we compare complexity of BP-kNNG with that of K-kNNG, L1O-kNNG and K-LPE. For a\nsingle query point X, the runtime of K-kNNG is O(dK 2\n), while the complexity of the surrogate\nL1O-kNN algorithm and the K-LPE is O(dT 2). On the other hand, the complexity of the proposed\nBP-kNNG algorithm is dominated by the computation of d k(Xi) for each Xi \u2208 XN and dk(X),\nwhich is O(dN M ) = O(dT (8+3d)/(4+2d)) = o(dT 2).\nFor the K-kNNG, L1O-kNNG and K-LPE, a new k-NN graph has to be constructed on {X N \u222a{X}}\nfor every new query point X. On the other hand, because of the bipartite construction of our k-NN\ngraph, dk(Xi) for each Xi \u2208 XN needs to be computed and stored only once. For every new query\nX that comes in, the cost to compute dk(X) is only O(dM ) = O(dT ). For a total of L query points,\nthe overall runtime complexity of our algorithm is therefore much smaller than the L1O-kNNG, K-\nLPE and K-kNNG anomaly detection schemes (O(dT (T (4+d)/(4+2d) + L)) compared to O(dLT 2),\nO(dLT 2) and O(dLK 2\n\n) respectively).\n\n(cid:3)\n\n(cid:4)\n\nT\nK\n\n5 Simulation comparisons\n\nWe compare the L1O-kNNG and the bipartite K-kNNG schemes on a simulated data set. The\ntraining set contains 1000 realizations drawn from a 2-dimensional Gaussian density f 0 with mean\n0 and diagonal covariance with identical component variances of \u03c3 = 0.1. The test set contains 500\nrealizations drawn from 0.8f0 + 0.2U , where U is the uniform density on [0, 1] 2. Samples from the\nuniform distribution are classi\ufb01ed to be anomalies. The percentage of anomalies in the test set is\ntherefore 20%.\n\n6\n\n\f0.98\n\n0.97\n\n0.96\n\n0.95\n\n0.94\n\n0.93\n\n0.92\n\nt\n\ne\na\nr\n \ne\nv\ni\nt\ni\ns\no\np\n\n \n\ne\nu\nr\nT\n\n \n\n0.91\n0\n\n0.02\n\n0.04\n\n0.06\n\n0.1\nFalse positive rate\n\n0.08\n\n \n\nBP\u2212kNNG\nL10\u2212kNNG\nClairvoyant\n\n0.12\n\n0.14\n\n0.16\n\nd\ne\nv\nr\ne\ns\nb\nO\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n \n\n0\n0\n\n \n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\nDesired \n\n0.1\n\n0.12\n\n0.14\n\n0.16\n\nBP\u2212kNNG\nL10\u2212kNNG\n\n(a) ROC curves for L1O-kNNG and BP-kNNG.\nThe labeled \u2019clairvoyant\u2019 curve is the ROC of the\nUMP anomaly detector..\n\n(b) Comparison of observed false alarm rates for\nL1O-kNNG and BP-kNNG with the desired false\nalarm rates.\n\nFigure 1: Comparison of performance of L1O-kNNG and BP-kNNG.\n\nData set\nHTTP (KDD\u201999)\nForest\nMulcross\nSMTP (KDD\u201999)\nShuttle\n\nSample size Dimension Anomaly class\nattack (0.4%)\nclass 4 vs class 2 (0.9%)\n2 clusters (10%)\nattack (0.03%)\nclass 2,3,5,6,7 vs class 1 (7%)\n\n567497\n286048\n262144\n95156\n49097\n\n3\n10\n4\n3\n9\n\nTable 1: Description of data used in anomaly detection experiments.\n\n\u221a\nThe distribution f0 has essential support on the unit square. For this simple case the minimum\nvolume set of level \u03b1 is a disk centered at the origin with radius\n2\u03c32 log(1/\u03b1). The power of the\nuniformly most powerful (UMP) test is 1 \u2212 2\u03c0\u03c3 2 log(1/\u03b1).\nL1O-kNNG and BP-kNNG were implemented in Matlab 7.6 on an 2 GHz Intel processor with\n3 GB of RAM. The value of k was set to 5. For the BP-kNNG, we set s = 1, N = 100 and\nM = 900.\nIn Fig. 1(a), we compare the detection performance of L1O-kNNG and BP-kNNG\nagainst the \u2019clairvoyant\u2019 UMP detector in terms of the ROC. We note that the proposed BP-kNNG\nis closer to the optimal UMP test as compared to the L1O-kNNG. In Fig. 1(b) we note the close\nagreement between desired and observed false alarm rates for BP-kNNG. Note that the L1O-kNNG\nsigni\ufb01cantly underestimates its false alarm rate for higher levels of true false alarm. In the case\nof the L1O-kNNG, it took an average of 60ms to test each instance for possible anomaly. The\ntotal run-time was therefore 60x500 = 3000ms. For the BP-kNNG, for a single instance, it took an\naverage of 57ms. When all the instances were processed together, the total run time was only 97ms.\nThis signi\ufb01cant savings in runtime is due to the fact that the bipartite graph does not have to be\nconstructed separately for each new test instance; it suf\ufb01ces to construct it once on the entire data\nset.\n\n5.1 Experimental comparisons\n\nIn this section, we compare our algorithm to several other state of the art anomaly detection al-\ngorithms, namely: MassAD [12], isolation forest (or iForest) [5], two distance-based methods\nORCA [2] and K-LPE [13], a density-based method LOF [3], and the one-class support vector\nmachine (or 1-SVM) [9]. All the methods are tested on the \ufb01ve largest data sets used in [5]. The\ndata characteristics are summarized in Table 1. One of the anomaly data generators is Mulcross [8]\nand the other four are from the UCI repository [1]. Full details about the data can be found in [5].\n\nThe comparison performance is evaluated in terms of averaged AUC (area under ROC curve) and\nprocessing time (a total of training and test time). Results for BP-kNNG are compared with results\nfor L1O-kNNG, K-LPE, MassAD, iForest and ORCA in Table 2. The results for MassAD, iForest\nand ORCA are reproduced from [12]. MassAD and iForest were implemented in Matlab and tested\non an AMD Opteron machine with a 1.8 GHz processor and 4 GB memory. The results for ORCA,\n\n7\n\n\fData sets\n\nHTTP\nForest\n\nMulcross\nSMTP\nShuttle\n\nBP\n0.99\n0.86\n1.00\n0.90\n0.99\n\nL10\nNA\nNA\nNA\nNA\nNA\n\nAUC\n\nK-LPE Mass\n1.00\n0.91\n0.99\n0.86\n0.99\n\nNA\nNA\nNA\nNA\nNA\n\niF\n1.00\n0.87\n0.96\n0.88\n1.00\n\nORCA\n0.36\n0.83\n0.33\n0.87\n0.60\n\nBP\n3.81\n7.54\n4.68\n0.74\n1.54\n\nL10\n.10/i\n.18/i\n.26/i\n.11/i\n.45/i\n\nTime (secs)\n\nK-LPE Mass\n34\n.19/i\n18\n.18/i\n17\n.17/i\n7\n.17/i\n.16/i\n4\n\niF\n147\n79\n75\n26\n15\n\nORCA\n9487\n6995\n2512\n267\n157\n\nTable 2: Comparison of anomaly detection schemes in terms of AUC and run-time for BP-kNNG\n(BP) against L1O-kNNG (L10), K-LPE, MassAD (Mass), iForest (iF) and ORCA. When reporting\nresults for L1O-kNNG and K-LPE, we report the processing time per test instance (/i). We are\nunable to report the AUC for K-LPE and L1O-kNNG because of the large processing time. We note\nthat BP-kNNG compares favorably in terms of AUC while also requiring the least run-time.\n\nData sets\n\nHTTP (KDD\u201999)\n\nForest\n\nMulcross\n\nSMTP (KDD\u201999)\n\nShuttle\n\n0.01\n0.007\n0.009\n0.008\n0.006\n0.026\n\nDesired false alarm\n0.1\n0.136\n0.071\n0.096\n0.099\n0.079\n\n0.05\n0.063\n0.035\n0.040\n0.046\n0.045\n\n0.02\n0.015\n0.015\n0.014\n0.017\n0.030\n\n0.2\n0.216\n0.150\n0.186\n0.204\n0.179\n\nTable 3: Comparison of desired and observed false alarm rates for BP-kNNG. There is good agree-\nment between the desired and observed rates.\n\nLOF and 1-SVM were conducted using the same experimental setting but on a faster 2.3 GHz\nmachine. We exclude the results for LOF and 1-SVM in table 2 because MassAD, iForest and\nORCA have been shown to outperform LOF and 1-SVM in [12].\n\nWe implemented BP-kNNG, L1O-kNNG and K-LPE in Matlab on an Intel 2 GHz processor with 3\nGB RAM. We note that this machine is comparable to the AMD Opteron machine with a 1.8 GHz\nprocessor. We choose T = 104 training samples and \ufb01x k = 50 in all three cases. For BP-kNNG,\nwe \ufb01x s = 5 and N = 103. When reporting results for L1O-kNNG and K-LPE, we report the\nprocessing time per test instance (/i). We are unable to report the AUC for K-LPE because of the\nlarge processing time and for L1O-kNNG because it cannot operate at high false alarm rates.\n\nFrom the results in Table 2, we see that BP-kNNG performs comparably in terms of AUC to the\nother algorithms, while having the least processing time across all algorithms (implemented on\ndifferent, but comparable machines). In addition, BP-kNNG allows the speci\ufb01cation of a threshold\nfor anomaly detection at a desired false alarm rate. This is corroborated by the results in Table 3,\nwhere we see that the observed false alarm rates across the different data sets are close to the desired\nfalse alarm rate.\n\n6 Conclusions\n\nThe geometric entropy minimization (GEM) principle was introduced in [4] to extract minimal set\ncoverings that can be used to detect anomalies from a set of training samples. In this paper we\npropose a bipartite k-nearest neighbor graph (BP-kNNG) anomaly detection algorithm based on the\nGEM principle. BP-kNNG inherits the theoretical optimality properties of GEM methods including\nconsistency, while being an order of magnitude faster than the methods proposed in [4].\n\nWe compared BP-kNNG against state of the art anomaly detection algorithms and showed that BP-\nkNNG compares favorably in terms of both ROC performance and computation time. In addition,\nBP-kNNG enjoys several other advantages including the ability to detect anomalies at a desired false\nalarm rate. In BP-kNNG, the p-values of each test point can also be easily computed (1), making\nBP-kNNG easily extendable to incorporating false discovery rate constraints.\n\n8\n\n\fReferences\n\n[1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.\n[2] S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with ran-\ndomization and a simple pruning rule. In Proceedings of the ninth ACM SIGKDD international\nconference on Knowledge discovery and data mining, KDD \u201903, pages 29\u201338, New York, NY,\nUSA, 2003. ACM.\n\n[3] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local\noutliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management\nof data, SIGMOD \u201900, pages 93\u2013104, New York, NY, USA, 2000. ACM.\n\n[4] A. O. Hero. Geometric entropy minimization (gem) for anomaly detection and localization. In\nProc. Advances in Neural Information Processing Systems (NIPS, pages 585\u2013592. MIT Press,\n2006.\n\n[5] F. T. Liu, K. M. Ting, and Z. Zhou. Isolation forest. In Proceedings of the 2008 Eighth IEEE\nInternational Conference on Data Mining, pages 413\u2013422, Washington, DC, USA, 2008. IEEE\nComputer Society.\n\n[6] C. Park, J. Z. Huang, and Y. Ding. A computable plug-in estimator of minimum volume sets\n\nfor novelty detection. Operations Research, 58(5):1469\u20131480, 2010.\n\n[7] S. Ramaswamy, R. Rastogi, and K. Shim. Ef\ufb01cient algorithms for mining outliers from large\n\ndata sets. SIGMOD Rec., 29:427\u2013438, May 2000.\n\n[8] D. M. Rocke and D. L. Woodruff. Identi\ufb01cation of Outliers in Multivariate Data. Journal of\n\nthe American Statistical Association, 91(435):1047\u20131061, 1996.\n\n[9] B. Sch\u00a8olkopf, R. Williamson, A. Smola, J. Shawe-Taylor, and J.Platt. Support Vector Method\n\nfor Novelty Detection. volume 12, 2000.\n\n[10] C. Scott and R. Nowak. Learning minimum volume sets. J. Machine Learning Res, 7:665\u2013704,\n\n2006.\n\n[11] K. Sricharan, R. Raich, and A. O. Hero. Empirical estimation of entropy functionals with\n\ncon\ufb01dence. ArXiv e-prints, December 2010.\n\n[12] K. M. Ting, G. Zhou, T. F. Liu, and J. S. C. Tan. Mass estimation and its applications. In\nProceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, KDD \u201910, pages 989\u2013998, New York, NY, USA, 2010. ACM.\n\n[13] M. Zhao and V. Saligrama. Anomaly detection with score functions based on nearest neighbor\n\ngraphs. Computing Research Repository, abs/0910.5461, 2009.\n\n9\n\n\f", "award": [], "sourceid": 354, "authors": [{"given_name": "Kumar", "family_name": "Sricharan", "institution": null}, {"given_name": "Alfred", "family_name": "Hero", "institution": null}]}