{"title": "Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic", "book": "Advances in Neural Information Processing Systems", "page_first": 1959, "page_last": 1967, "abstract": "The detection of anomalous activity in graphs is a statistical problem that arises in many applications, such as network surveillance, disease outbreak detection, and activity monitoring in social networks. Beyond its wide applicability, graph structured anomaly detection serves as a case study in the difficulty of balancing computational complexity with statistical power. In this work, we develop from first principles the generalized likelihood ratio test for determining if there is a well connected region of activation over the vertices in the graph in Gaussian noise. Because this test is computationally infeasible, we provide a relaxation, called the Lov\\'asz extended scan statistic (LESS) that uses submodularity to approximate the intractable generalized likelihood ratio. We demonstrate a connection between LESS and maximum a-posteriori inference in Markov random fields, which provides us with a poly-time algorithm for LESS. Using electrical network theory, we are able to control type 1 error for LESS and prove conditions under which LESS is risk consistent. Finally, we consider specific graph models, the torus, $k$-nearest neighbor graphs, and $\\epsilon$-random graphs. We show that on these graphs our results provide near-optimal performance by matching our results to known lower bounds.", "full_text": "Near-optimal Anomaly Detection in Graphs\n\nusing Lov\u00b4asz Extended Scan Statistic\n\nJames Sharpnack\n\nMachine Learning Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\njsharpna@gmail.com\n\nAkshay Krishnamurthy\n\nComputer Science Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nakshaykr@cs.cmu.edu\n\nAarti Singh\n\nMachine Learning Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\naarti@cs.cmu.edu\n\nAbstract\n\nThe detection of anomalous activity in graphs is a statistical problem that arises in\nmany applications, such as network surveillance, disease outbreak detection, and\nactivity monitoring in social networks. Beyond its wide applicability, graph struc-\ntured anomaly detection serves as a case study in the dif\ufb01culty of balancing com-\nputational complexity with statistical power. In this work, we develop from \ufb01rst\nprinciples the generalized likelihood ratio test for determining if there is a well\nconnected region of activation over the vertices in the graph in Gaussian noise.\nBecause this test is computationally infeasible, we provide a relaxation, called the\nLov\u00b4asz extended scan statistic (LESS) that uses submodularity to approximate the\nintractable generalized likelihood ratio. We demonstrate a connection between\nLESS and maximum a-posteriori inference in Markov random \ufb01elds, which pro-\nvides us with a poly-time algorithm for LESS. Using electrical network theory,\nwe are able to control type 1 error for LESS and prove conditions under which\nLESS is risk consistent. Finally, we consider speci\ufb01c graph models, the torus, k-\nnearest neighbor graphs, and \u01eb-random graphs. We show that on these graphs our\nresults provide near-optimal performance by matching our results to known lower\nbounds.\n\n1\n\nIntroduction\n\nDetecting anomalous activity refers to determining if we are observing merely noise (business as\nusual) or if there is some signal in the noise (anomalous activity). Classically, anomaly detection\nfocused on identifying rare behaviors and aberrant bursts in activity over a single data source or\nchannel. With the advent of large surveillance projects, social networks, and mobile computing,\ndata sources often are high-dimensional and have a network structure. With this in mind, statistics\nneeds to comprehensively address the detection of anomalous activity in graphs. In this paper, we\nwill study the detection of elevated activity in a graph with Gaussian noise.\n\nIn reality, very little is known about the detection of activity in graphs, despite a variety of real-world\napplications such as activity detection in social networks, network surveillance, disease outbreak de-\ntection, biomedical imaging, sensor network detection, gene network analysis, environmental moni-\ntoring and malware detection. Sensor networks might be deployed for detecting nuclear substances,\nwater contaminants, or activity in video surveillance. By exploiting the sensor network structure\n\n1\n\n\f(based on proximity), one can detect activity in networks when the activity is very faint. Recent\ntheoretical contributions in the statistical literature[1, 2] have detailed the inherent dif\ufb01culty of such\na testing problem but have positive results only under restrictive conditions on the graph topology.\nBy combining knowledge from high-dimensional statistics, graph theory and mathematical pro-\ngramming, the characterization of detection algorithms over any graph topology by their statistical\nproperties is possible.\n\nAside from the statistical challenges, the computational complexity of any proposed algorithms\nmust be addressed. Due to the combinatorial nature of graph based methods, problems can easily\nshift from having polynomial-time algorithms to having running times exponential in the size of\nthe graph. The applications of graph structured inference require that any method be scalable to\nlarge graphs. As we will see, the ideal statistical procedure will be intractable, suggesting that\napproximation algorithms and relaxations are necessary.\n\n1.1 Problem Setup\nConsider a connected, possibly weighted, directed graph G de\ufb01ned by a set of vertices V (|V | = p)\nand directed edges E (|E| = m) which are ordered pairs of vertices. Furthermore, the edges may be\nassigned weights, {We}e\u2208E, that determine the relative strength of the interactions of the adjacent\nvertices. For each vertex, i \u2208 V , we assume that there is an observation yi that has a Normal\ndistribution with mean xi and variance 1. This is called the graph-structured normal means problem,\nand we observe one realization of the random vector\n\ny = x + \u03be,\n\n(1)\n\nwhere x \u2208 Rp, \u03be \u223c N (0, Ip\u00d7p). The signal x will re\ufb02ect the assumption that there is an active\ncluster (C \u2286 V ) in the graph, by making xi > 0 if i \u2208 C and xi = 0 otherwise. Furthermore,\nthe allowable clusters, C, must have a small boundary in the graph. Speci\ufb01cally, we assume that\nthere are parameters \u03c1, \u00b5 (possibly dependent on p such that the class of graph-structured activation\npatterns x is given as follows.\n\nX =(x : x =\n\n1C, C \u2208 C) ,\n\nC = {C \u2286 V : out(C) \u2264 \u03c1}\n\n\u00b5\n\np|C|\n\nHere out(C) =P(u,v)\u2208E Wu,vI{u \u2208 C, v \u2208 \u00afC} is the total weight of edges leaving the cluster C.\n\nIn other words, the set of activated vertices C have a small cut size in the graph G. While we assume\nthat the noise variance is 1 in (1), this is equivalent to the more general model in which E\u03be2\ni = \u03c32\nwith \u03c3 known. If we wanted to consider known \u03c32 then we would apply all our algorithms to y/\u03c3\nand replace \u00b5 with \u00b5/\u03c3 in all of our statements. For this reason, we call \u00b5 the signal-to-noise ratio\n(SNR), and proceed with \u03c3 = 1.\n\nIn graph-structured activation detection we are concerned with statistically testing the null against\nthe alternative hypotheses,\n\nH0 : y \u223c N (0, I)\nH1 : y \u223c N (x, I), x \u2208 X\n\n(2)\n\nH0 represents business as usual (such as sensors returning only noise) while H1 encompasses all of\nthe foreseeable anomalous activity (an elevated group of noisy sensor observations). Let a test be a\nmapping T (y) \u2208 {0, 1}, where 1 indicates that we reject the null. It is imperative that we control\nboth the probability of false alarm, and the false acceptance of the null. To this end, we de\ufb01ne our\nmeasure of risk to be\n\nR(T ) = E0[T ] + sup\nx\u2208X\n\nEx[1 \u2212 T ]\n\nwhere Ex denote the expectation with respect to y \u223c N (x, I). These terms are also known as the\nprobability of type 1 and type 2 error respectively. This setting should not be confused with the\nBayesian testing setup (e.g. as considered in [2, 3]) where the patterns, x, are drawn at random.\nWe will say that H0 and H1 are asymptotically distinguished by a test, T , if in the setting of large\ngraphs, limp\u2192\u221e R(T ) = 0. If such a test exists then H0 and H1 are asymptotically distinguishable,\notherwise they are asymptotically indistinguishable (which occurs whenever the risk does not tend\nto 0). We will be characterizing regimes for \u00b5 in which our test asymptotically distinguishes H0\nfrom H1.\n\n2\n\n\fThroughout the study, let the edge-incidence matrix of G be \u2207 \u2208 Rm\u00d7p such that for e = (v, w) \u2208\nE, \u2207e,v = \u2212We, \u2207e,w = We and is 0 elsewhere. For directed graphs, vertex degrees refer to dv =\nout({v}). Let k.k denote the \u21132 norm, k.k1 be the \u21131 norm, and (x)+ be the positive components\nof the vector x. Let [p] = {1, . . . , p}, and we will be using the o notation, namely if non-negative\nsequences satisfy an/bn \u2192 0 then an = o(bn) and bn = \u03c9(an).\n1.2 Contributions\n\nSection 3 highlights what is known about the hypothesis testing problem 2, particularly we provide\na regime for \u00b5 in which H0 and H1 are asymptotically indistinguishable. In section 4.1, we derive\nthe graph scan statistic from the generalized likelihood ratio principle which we show to be a com-\nputationally intractable procedure. In section 4.2, we provide a relaxation of the graph scan statistic\n(GSS), the Lov\u00b4asz extended scan statistic (LESS), and we show that it can be computed with suc-\ncessive minimum s \u2212 t cut programs (a graph cut that separates a source vertex from a sink vertex).\nIn section 5, we give our main result, Theorem 5, that provides a type 1 error control for both test\nstatistics, relating their performance to electrical network theory. In section 6, we show that GSS\nand LESS can asymptotically distinguish H0 and H1 in signal-to-noise ratios close to the lowest\npossible for some important graph models. All proofs are in the Appendix.\n\n2 Related Work\n\nGraph structured signal processing. There have been several approaches to signal processing over\ngraphs. Markov random \ufb01elds (MRF) provide a succinct framework in which the underlying signal\nis modeled as a draw from an Ising or Potts model [4, 5]. We will return to MRFs in a later section,\nas it will relate to our scan statistic. A similar line of research is the use of kernels over graphs. The\nstudy of kernels over graphs began with the development of diffusion kernels [6], and was extended\nthrough Green\u2019s functions on graphs [7]. While these methods are used to estimate binary signals\n(where xi \u2208 {0, 1}) over graphs, little is known about their statistical properties and their use in\nsignal detection. To the best of our knowledge, this paper is the \ufb01rst connection made between\nanomaly detection and MRFs.\n\nNormal means testing. Normal means testing in high-dimensions is a well established and funda-\nmental problem in statistics. Much is known when H1 derives from a smooth function space such as\nBesov spaces or Sobolev spaces[8, 9]. Only recently have combinatorial structures such as graphs\nbeen proposed as the underlying structure of H1. A signi\ufb01cant portion of the recent work in this area\n[10, 3, 1, 2] has focused on incorporating structural assumptions on the signal, as a way to mitigate\nthe effect of high-dimensionality and also because many real-life problems can be represented as\ninstances of the normal means problem with graph-structured signals (see, for an example, [11]).\n\nGraph scan statistics. In spatial statistics, it is common, when searching for anomalous activity\nto scan over regions in the spatial domain, testing for elevated activity[12, 13]. There have been\nscan statistics proposed for graphs, most notably the work of [14] in which the authors scan over\nneighborhoods of the graphs de\ufb01ned by the graph distance. Other work has been done on the theory\nand algorithms for scan statistics over speci\ufb01c graph models, but are not easily generalizable to\narbitrary graphs [15, 1]. More recently, it has been found that scanning over all well connected\nregions of a graph can be computationally intractable, and so approximations to the intractable\nlikelihood-based procedure have been studied [16, 17]. We follow in this line of work, with a\nrelaxation to the intractable generalized likelihood ratio test.\n\n3 A Lower Bound and Known Results\n\nIn this section we highlight the previously known results about the hypothesis testing problem (2).\nThis problem was studied in [17], in which the authors demonstrated the following lower bound,\nwhich derives from techniques developed in [3].\n\nTheorem 1. [17] Hypotheses H0 and H1 de\ufb01ned in Eq. (2) are asymptotically indistinguishable if\n\n\u00b5 = o smin(cid:26) \u03c1\n\ndmax\n\nwhere dmax is the maximum degree of graph G.\n\nlog(cid:18) pd2\n\n\u03c12 (cid:19) ,\u221ap(cid:27)!\n\nmax\n\n3\n\n\fNow that a regime of asymptotic indistinguishability has been established, it is instructive to consider\ntest statistics that do not take the graph into account (viz. the statistics are unaffected by a change\nin the graph structure). Certainly, if we are in a situation where a naive procedure perform near-\noptimally, then our study is not warranted. As it turns out, there is a gap between the performance\nof the natural unstructured tests and the lower bound in Theorem 1.\nProposition 2. [17] (1) The thresholding test statistic, maxv\u2208[p] |yv|, asymptotically distinguishes\nH0 from H1 if \u00b5 = \u03c9(|C| log(p/|C|)).\n(2) The sum test statistic,Pv\u2208[p] yv, asymptotically distinguishes H0 from H1 if \u00b5 = \u03c9(p/|C|).\nAs opposed to these naive tests one can scan over all clusters in C performing individual likelihood\nratio tests. This is called the scan statistic, and it is known to be a computationally intractable\ncombinatorial optimization. Previously, two alternatives to the scan statistic have been developed:\nthe spectral scan statistic [16], and one based on the uniform spanning tree wavelet basis [17]. The\nformer is indeed a relaxation of the ideal, computationally intractable, scan statistic, but in many\nimportant graph topologies, such as the lattice, provides sub-optimal statistical performance. The\nuniform spanning tree wavelets in effect allows one to scan over a subclass of the class, C, but tends\nto provide worse performance (as we will see in section 6) than that presented in this work. The\ntheoretical results in [17] are similar to ours, but they suffer additional log-factors.\n\n4 Method\n\nAs we have noted the fundamental dif\ufb01culty of the hypothesis testing problem is the composite\nnature of the alternative hypothesis. Because the alternative is indexed by sets, C \u2208 C(\u03c1), with a\nlow cut size, it is reasonable that the test statistic that we will derive results from a combinatorial\noptimization program. In fact, we will show we can express the generalized likelihood ratio (GLR)\nstatistic in terms of a modular program with submodular constraints. This will turn out to be a\npossibly NP-hard program, as a special case of such programs is the well known knapsack problem\n[18]. With this in mind, we provide a convex relaxation, using the Lov\u00b4asz extension, to the ideal\nGLR statistic. This relaxation conveniently has a dual objective that can be evaluated with a binary\nMarkov random \ufb01eld energy minimization, which is a well understood program. We will reserve\nthe theoretical statistical analysis for the following section.\n\nSubmodularity. Before we proceed, we will introduce the reader to submodularity and the Lov\u00b4asz\nextension. (A very nice introduction to submodularity can be found in [19].) For any set, which we\n\nmay as well take to be the vertex set [p], we say that a function F : {0, 1}p \u2192 R is submodular\nif for any A, B \u2286 [p], F (A) + F (B) \u2265 F (A \u2229 B) + F (A \u222a B). (We will interchangeably use\nthe bijection between 2[p] and {0, 1}p de\ufb01ned by C \u2192 1C .) In this way, a submodular function\nexperiences diminishing returns, as additions to large sets tend to be less dramatic than additions to\nsmall sets. But while this diminishing returns phenomenon is akin to concave functions, for opti-\nmization purposes submodularity acts like convexity, as it admits ef\ufb01cient minimization procedures.\n\nMoreover, for every submodular function there is a Lov\u00b4asz extension f : [0, 1]p \u2192 R de\ufb01ned in the\nfollowing way: for x \u2208 [0, 1]p let xji denote the ith largest element of x, then\n\nSubmodular functions as a class is similar to convex functions in that it is closed under addition and\nnon-negative scalar multiplication. The following facts about Lov\u00b4asz extensions will be important.\nProposition 3. [19] Let F be submodular and f be its Lov\u00b4asz extension. Then f is convex, f (x) =\nF (x) if x \u2208 {0, 1}p, and\n\nmin{F (x) : x \u2208 {0, 1}p} = min{f (x) : x \u2208 [0, 1]p}\n\nWe are now suf\ufb01ciently prepared to develop the test statistics that will be the focus of this paper.\n\n4.1 Graph Scan Statistic\n\nIt is instructive, when faced with a class of probability distributions, indexed by subsets C \u2286 2[p],\nto think about what techniques we would use if we knew the correct set C \u2208 C (which is often\ncalled oracle information). One would in this case be only testing the null hypothesis H0 : x = 0\n\n4\n\nf (x) = xj1F ({j1}) +\n\n(F ({j1, . . . , ji}) \u2212 F ({j1, . . . , ji\u22121}))xji\n\np\n\nXi=2\n\n\fagainst the simple alternative H1 : x \u221d 1C . In this situation, we would employ the likelihood\nratio test because by the Neyman-Pearson lemma it is the uniformly most powerful test statistic.\nThe maximum likelihood estimator for x is 1C 1\u22a4C y/|C| (the MLE of \u00b5 is 1\u22a4C y/p|C|) and the\n\nlikelihood ratio turns out to be\n\nexp(cid:26)\u2212\n\n1\n\n2kyk2(cid:27) / exp(\u2212\n\n1\n\n2(cid:13)(cid:13)(cid:13)(cid:13)\n\n1C 1\u22a4C y\n\n|C| \u2212 y(cid:13)(cid:13)(cid:13)(cid:13)\n\n2) = exp(cid:26) (1\u22a4C y)2\n2|C| (cid:27)\n\n1\u22a4C y\n\np|C|\n\nHence, the log-likelihood ratio is proportional to (1\u22a4C y)2/|C| and thresholding this at z2\nus a size \u03b1 test.\n\n1\u2212\u03b1/2 gives\n\nThis reasoning has been subject to the assumption that we had oracle knowledge of C. A\nnatural statistic, when C is unknown, is the generalized log-likelihood ratio (GLR) de\ufb01ned by\nmax(1\u22a4C y)2/|C| s.t. C \u2208 C. We will work with the graph scan statistic (GSS),\n\n\u02c6s = max\n\ns.t. C \u2208 C(\u03c1) = {C : out(C) \u2264 \u03c1}\n\n(3)\n\nwhich is nearly equivalent to the GLR. (We can in fact evaluate \u02c6s for y and \u2212y, taking a maximum\nand obtain the GLR, but statistically this is nearly the same.) Notice that there is no guarantee that\nthe program above is computationally feasible. In fact, it belongs to a class of programs, speci\ufb01cally\nmodular programs with submodular constraints that is known to contain NP-hard instantiations,\nsuch as the ratio cut program and the knapsack program [18]. Hence, we are compelled to form a\nrelaxation of the above program, that will with luck provide a feasible algorithm.\n\n4.2 Lov\u00b4asz Extended Scan Statistic\n\nIt is common, when faced with combinatorial optimization programs that are computationally in-\n\nGenerally, the hope is that optimizing the relaxation will approximate the combinatorial program\n\nfeasible, to relax the domain from the discrete {0, 1}p to a continuous domain, such as [0, 1]p.\nwell. First we require that we can relax the constraint out(C) \u2264 \u03c1 to the hypercube [0, 1]p. This\nwill be accomplished by replacing it with its Lov\u00b4asz extension k(\u2207x)+k1 \u2264 \u03c1. We then form the\nrelaxed program, which we will call the Lov\u00b4asz extended scan statistic (LESS),\n\n\u02c6l = max\nt\u2208[p]\n\nmax\n\nx\n\nx\u22a4y\n\u221at\n\ns.t. x \u2208 X (\u03c1, t) = {x \u2208 [0, 1]p : k(\u2207x)+k1 \u2264 \u03c1, 1\u22a4x \u2264 t}\n\n(4)\n\nWe will \ufb01nd that not only can this be solved with a convex program, but the dual objective is a\nminimum binary Markov random \ufb01eld energy program. To this end, we will brie\ufb02y go over binary\nMarkov random \ufb01elds, which we will \ufb01nd can be used to solve our relaxation.\n\nBinary Markov Random Fields. Much of the previous work on graph structured statistical proce-\ndures assumes a Markov random \ufb01eld (MRF) model, in which there are discrete labels assigned to\neach vertex in [p], and the observed variables {yv}v\u2208[p] are conditionally independent given these\nlabels. Furthermore, the prior distribution on the labels is drawn according to an Ising model (if\nthe labels are binary) or a Potts model otherwise. The task is to then compute a Bayes rule from\nthe posterior of the MRF. The majority of the previous work assumes that we are interested in the\nmaximum a-posteriori (MAP) estimator, which is the Bayes rule for the 0/1-loss. This can generally\nbe written in the form,\n\nmin\n\nx\u2208{0,1}p Xv\u2208[p]\n\n\u2212lv(xv|yv) + Xv6=u\u2208[p]\n\nWv,uI{xv 6= xu}\n\nwhere lv is a data dependent log-likelihood. Such programs are called graph-representable in [20],\nand are known to be solvable in the binary case with s-t graph cuts. Thus, by the min-cut max-\ufb02ow\ntheorem the value of the MAP objective can be obtained by computing a maximum \ufb02ow. More\nrecently, a dual-decomposition algorithm has been developed in order to parallelize the computation\nof the MAP estimator for binary MRFs [21, 22].\n\nWe are now ready to state our result regarding the dual form of the LESS program, (4).\nProposition 4. Let \u03b70, \u03b71 \u2265 0, and de\ufb01ne the dual function of the LESS,\ny\u22a4x \u2212 \u03b701\u22a4x \u2212 \u03b71k\u2207xk0\n\ng(\u03b70, \u03b71) = max\n\nx\u2208{0,1}p\n\n5\n\n\fThe LESS estimator is equal to the following minimum of convex optimizations\n\n\u02c6l = max\nt\u2208[p]\n\n1\n\u221at\n\nmin\n\n\u03b70,\u03b71\u22650\n\ng(\u03b70, \u03b71) + \u03b70t + \u03b71\u03c1\n\ng(\u03b70, \u03b71) is the objective of a MRF MAP problem, which is poly-time solvable with s-t graph cuts.\n\n5 Theoretical Analysis\n\nSo far we have developed a lower bound to the hypothesis testing problem, shown that some com-\nmon detectors do not meet this guarantee, and developed the Lov\u00b4asz extended scan statistic from\n\ufb01rst principles. We will now provide a thorough statistical analysis of the performance of LESS.\nPreviously, electrical network theory, speci\ufb01cally the effective resistances of edges in the graph,\nhas been useful in describing the theoretical performance of a detector derived from uniform span-\nning tree wavelets [17]. As it turns out the performance of LESS is also dictated by the effective\nresistances of edges in the graph.\n\nEffective Resistance. Effective resistances have been extensively studied in electrical network the-\nory [23]. We de\ufb01ne the combinatorial Laplacian of G to be \u2206 = D \u2212 W (Dv,v = out({v}) is the\ndiagonal degree matrix). A potential difference is any z \u2208 R|E| such that it satis\ufb01es Kirchoff \u2019s poten-\ntial law: the total potential difference around any cycle is 0. Algebraically, this means that \u2203x \u2208 Rp\nsuch that \u2207x = z. The Dirichlet principle states that any solution to the following program gives\nan absolute potential x that satis\ufb01es Kirchoff\u2019s law:\n\nminxx\u22a4\u2206x s.t. xS = vS\n\nfor source/sinks S \u2282 [p] and some voltage constraints vS \u2208 R|S|. By Lagrangian calculus, the\nsolution to the above program is given by x = \u2206\u2020v where v is 0 over SC and vS over S, and \u2020\nindicates the Moore-Penrose pseudoinverse. The effective resistance between a source v \u2208 V and\na sink w \u2208 V is the potential difference required to create a unit \ufb02ow between them. Hence, the\neffective resistance between v and w is rv,w = (\u03b4v \u2212 \u03b4w)\u22a4\u2206\u2020(\u03b4v \u2212 \u03b4w), where \u03b4v is the Dirac delta\nfunction. There is a close connection between effective resistances and random spanning trees. The\nuniform spanning tree (UST) is a random spanning tree, chosen uniformly at random from the set of\nall distinct spanning trees. The foundational Matrix-Tree theorem [24, 23] states that the probability\nof an edge, e, being included in the UST is equal to the edge weight times the effective resistance\nWere. The UST is an essential component of the proof of our main theorem, in that it provides a\nmechanism for unravelling the graph while still preserving the connectivity of the graph.\n\nWe are now in a position to state the main theorem, which will allow us to control the type 1 error\n(the probability of false alarm) of both the GSS and its relaxation the LESS.\n\nTheorem 5. Let rC = max{P(u,v)\u2208E:u\u2208C Wu,vr(u,v) : C \u2208 C} be the maximum effective re-\n\nsistance of the boundary of a cluster C. The following statements hold under the null hypothesis\nH0 : x = 0:\n\n1. The graph scan statistic, with probability at least 1 \u2212 \u03b1, is smaller than\n\n\u02c6s \u2264 \u221arC +r 1\n\n2\n\nlog p!p2 log(p \u2212 1) +p2 log 2 +p2 log(1/\u03b1)\n\n2. The Lov\u00b4asz extended scan statistic, with probability at least 1 \u2212 \u03b1 is smaller than\n\n\u02c6l \u2264\n\nlog(2p) + 1\n\nr(cid:16)\u221arC +q 1\n\n2 log p(cid:17)\n\n2\n\nlog p\n\n+ 2vuut \u221arC +r 1\n\n2\n\nlog p!2\n\nlog p\n\n+p2 log p +p2 log(1/\u03b1)\n\n(5)\n\n(6)\n\nThe implication of Theorem 5 is that the size of the test may be controlled at level \u03b1 by selecting\nthresholds given by (5) and (6) for GSS and LESS respectively. Notice that the control provided\nfor the LESS is not signi\ufb01cantly different from that of the GSS. This is highlighted by the following\nCorollary, which combines Theorem 5 with a type 2 error bound to produce an information theoretic\nguarantee for the asymptotic performance of the GSS and LESS.\n\n6\n\n\fCorollary 6. Both the GSS and the LESS asymptotically distinguish H0 from H1 if\n\n\u00b5\n\u03c3\n\n= \u03c9(cid:16)max{prC log p, log p}(cid:17)\n\nTo summarize we have established that the performance of the GSS and the LESS are dictated by\nthe effective resistances of cuts in the graph. While the condition in Cor. 6 may seem mysterious,\nthe guarantee in fact nearly matches the lower bound for many graph models as we now show.\n\n6 Speci\ufb01c Graph Models\n\nTheorem 5 shows that the effective resistance of the boundary plays a critical role in characterizing\nthe distinguishability region of both the the GSS and LESS. On speci\ufb01c graph families, we can\ncompute the effective resistances precisely, leading to concrete detection guarantees that we will see\nnearly matches the lower bound in many cases. Throughout this section, we will only be working\nwith undirected, unweighted graphs.\n\nRecall that Corollary 6 shows that an SNR of \u03c9(cid:0)\u221arC log p(cid:1) is suf\ufb01cient while Theorem 1 shows\nthat \u2126(cid:16)p\u03c1/dmax log p(cid:17) is necessary for detection. Thus if we can show that rC \u2248 \u03c1/dmax, we\n\nwould establish the near-optimality of both the GSS and LESS. Foster\u2019s theorem lends evidence to\nthe fact that the effective resistances should be much smaller than the cut size:\n\nTheorem 7. (Foster\u2019s Theorem [25, 26])\n\nre = p \u2212 1\n\nXe\u2208E\n\nRoughly speaking, the effective resistance of an edge selected uniformly at random is \u2248 (p\u22121)/m =\nd\u22121\nave so the effective resistance of a cut is \u2248 \u03c1/dave. This intuition can be formalized for speci\ufb01c\nmodels and this improvement by the average degree bring us much closer to the lower bound.\n\n6.1 Edge Transitive Graphs\n\nAn edge transitive graph, G, is one for which there is a graph automorphism mapping e0 to e1 for any\npair of edges e0, e1. Examples include the l-dimensional torus, the cycle, and the complete graph\nKp. The existence of these automorphisms implies that every edge has the same effective resistance,\nand by Foster\u2019s theorem, we know that these resistances are exactly (p \u2212 1)/m. Moreover, since\nedge transitive graphs must be d-regular, we know that m = \u0398(pd) so that re = \u0398(1/d). Thus as\na corollary to Theorem 5 we have that both the GSS and LESS are near-optimal (optimal modulo\nlogarithmic factors whenever \u03c1/d \u2264 \u221ap) on edge transitive graphs:\n\nCorollary 8. Let G be an edge-transitive graph with common degree d. Then both the GSS and\nLESS distinguish H0 from H1 provided that:\n\n6.2 Random Geometric Graphs\n\n\u00b5 = \u03c9(cid:16)max{p\u03c1/d log p, log p}(cid:17)\n\nAnother popular family of graphs are those constructed from a set of points in RD drawn according\nto some density. These graphs have inherent randomness stemming from sampling of the density,\nand thus earn the name random geometric graphs. The two most popular such graphs are symmetric\nk-nearest neighbor graphs and \u01eb-graphs. We characterize the distinguishability region for both.\nIn both cases, a set of points z1, . . . , zp are drawn i.i.d. from a density f support over RD, or a subset\nof RD. Our results require mild regularity conditions on f , which, roughly speaking, require that\nsupp(f ) is topologically equivalent to the cube and has density bounded away from zero (See [27]\nfor a precise de\ufb01nition). To form a k-nearest neighbor graph Gk, we associate each vertex i with a\npoint zi and we connect vertices i, j if zi is amongst the k-nearest neighbors, in \u21132, of zj or vice\nversa. In the the \u01eb-graph, G\u01eb we connect vertices i, j if ||zi, zj|| \u2264 \u01eb for some metric \u03c4 .\nThe relationship re \u2248 1/d, which we used for edge-transitive graphs, was derived in Corollaries 8\nand 9 in [27] The precise concentration arguments, which have been done before [17], lead to the\nfollowing corollary regarding the performance of the GSS and LESS on random geometric graphs:\n\n7\n\n\fFigure 1: A comparison of detection procedures: spectral scan statistic (SSS), UST wavelet detector\n\n(Wavelet), and LESS. The graphs used are the square 2D Torus, kNN graph (k \u2248 p1/4), and \u01eb-graph\n(with \u01eb \u2248 p\u22121/3); with \u00b5 = 4, 4, 3 respectively, p = 225, and |C| \u2248 p1/2.\nCorollary 9. Let Gk be a k-NN graph with k/p \u2192 0, k(k/p)2/D \u2192 \u221e and suppose the density\nf meets the regularity conditions in [27]. Then both the GSS and LESS distinguish H0 from H1\nprovided that:\n\nIf G\u01eb is an \u01eb-graph with \u01eb \u2192 0, n\u01ebD+2 \u2192 \u221e then both distinguish H0 from H1 provided that:\n\n\u00b5 = \u03c9(cid:16)max{p\u03c1/k log p, log p}(cid:17)\n\u00b5 = \u03c9(cid:18)max(cid:26)r \u03c1\np\u01ebD log p, log p(cid:27)(cid:19)\n\nThe corollary follows immediately form Corollary 6 and the proofs in [17]. Since under the regu-\nlarity conditions, the maximum degree is \u0398(k) and \u0398(p\u01ebD) in k-NN and \u01eb-graphs respectively, the\n\ncorollary establishes the near optimality (again provided that \u03c1/d \u2264 \u221ap) of both test statistics.\n\nWe performed some experiments using the MRF based algorithm outlined in Prop. 4. Each exper-\niment is made with graphs with 225 vertices, and we report the true positive rate versus the false\npositive rate as the threshold varies (also known as the ROC.) For each graph model, LESS provides\ngains over the spectral scan statistic[16] and the UST wavelet detector[17], each of the gains are\nsigni\ufb01cant except for the \u01eb-graph which is more modest.\n\n7 Conclusions\n\nTo summarize, while Corollary 6 characterizes the performance of GSS and LESS in terms of ef-\nfective resistances, in many speci\ufb01c graph models, this can be translated into near-optimal detection\nguarantees for these test statistics. We have demonstrated that the LESS provides guarantees similar\nto that of the computationally intractable generalized likelihood ratio test (GSS). Furthermore, the\nLESS can be solved through successive graph cuts by relating it to MAP estimation in an MRF.\nFuture work includes using these concepts for localizing the activation, making the program robust\nto missing data, and extending the analysis to non-Gaussian error.\n\nAcknowledgments\n\nThis research is supported in part by AFOSR under grant FA9550-10-1-0382 and NSF under grant\nIIS-1116458. AK is supported in part by a NSF Graduate Research Fellowship. We would like to\nthank Sivaraman Balakrishnan for his valuable input in the theoretical development of the paper.\n\nReferences\n\n[1] E. Arias-Castro, E.J. Candes, and A. Durand. Detection of an anomalous cluster in a network. The Annals\n\nof Statistics, 39(1):278\u2013304, 2011.\n\n[2] L. Addario-Berry, N. Broutin, L. Devroye, and G. Lugosi. On combinatorial testing problems. The Annals\n\nof Statistics, 38(5):3063\u20133092, 2010.\n\n[3] E. Arias-Castro, E.J. Candes, H. Helgason, and O. Zeitouni. Searching for a trail of evidence in a maze.\n\nThe Annals of Statistics, 36(4):1726\u20131757, 2008.\n\n[4] V. Cevher, C. Hegde, M.F. Duarte, and R.G. Baraniuk. Sparse signal recovery using markov random\n\n\ufb01elds. Technical report, DTIC Document, 2009.\n\n8\n\n\f[5] P. Ravikumar and J.D. Lafferty. Quadratic programming relaxations for metric labeling and markov\n\nrandom \ufb01eld map estimation. 2006.\n\n[6] R.I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proceedings\n\nof the Nineteenth International Conference on Machine Learning, pages 315\u2013322. Citeseer, 2002.\n\n[7] A. Smola and R. Kondor. Kernels and regularization on graphs. Learning theory and kernel machines,\n\npages 144\u2013158, 2003.\n\n[8] Y.I. Ingster. Minimax testing of nonparametric hypotheses on a distribution density in the l\n\np metrics.\n\nTheory of Probability and its Applications, 31:333, 1987.\n\n[9] Y.I. Ingster and I.A. Suslina. Nonparametric goodness-of-\ufb01t testing under Gaussian models, volume 169.\n\nSpringer Verlag, 2003.\n\n[10] E. Arias-Castro, D. Donoho, and X. Huo. Near-optimal detection of geometric objects by fast multiscale\n\nmethods. IEEE Trans. Inform. Theory, 51(7):2402\u20132425, 2005.\n\n[11] L. Jacob, P. Neuvial, and S. Dudoit. Gains in power from structured two-sample tests of means on graphs.\n\nArxiv preprint arXiv:1009.5173, 2010.\n\n[12] Daniel B Neill and Andrew W Moore. Rapid detection of signi\ufb01cant spatial clusters. In Proceedings\nof the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages\n256\u2013265. ACM, 2004.\n\n[13] Deepak Agarwal, Andrew McGregor, Jeff M Phillips, Suresh Venkatasubramanian, and Zhengyuan Zhu.\nSpatial scan statistics: approximations and performance study. In Proceedings of the 12th ACM SIGKDD\ninternational conference on Knowledge discovery and data mining, pages 24\u201333. ACM, 2006.\n\n[14] Carey E Priebe, John M Conroy, David J Marchette, and Youngser Park. Scan statistics on enron graphs.\n\nComputational & Mathematical Organization Theory, 11(3):229\u2013247, 2005.\n\n[15] Chih-Wei Yi. A uni\ufb01ed analytic framework based on minimum scan statistics for wireless ad hoc and\n\nsensor networks. Parallel and Distributed Systems, IEEE Transactions on, 20(9):1233\u20131245, 2009.\n\n[16] J. Sharpnack, A. Rinaldo, and A. Singh. Changepoint detection over graphs with the spectral scan statistic.\n\nArxiv preprint arXiv:1206.0773, 2012.\n\n[17] James Sharpnack, Akshay Krishnamurthy, and Aarti Singh. Detecting activations over graphs using\n\nspanning tree wavelet bases. arXiv preprint arXiv:1206.0937, 2012.\n\n[18] Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial optimization: algorithms and complexity.\n\nCourier Dover Publications, 1998.\n\n[19] Francis Bach. Convex analysis and optimization with submodular functions: a tutorial. arXiv preprint\n\narXiv:1010.4207, 2010.\n\n[20] Vladimir Kolmogorov and Ramin Zabin. What energy functions can be minimized via graph cuts? Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on, 26(2):147\u2013159, 2004.\n\n[21] Petter Strandmark and Fredrik Kahl. Parallel and distributed graph cuts by dual decomposition.\n\nIn\nComputer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2085\u20132092. IEEE,\n2010.\n\n[22] David Sontag, Amir Globerson, and Tommi Jaakkola. Introduction to dual decomposition for inference.\n\nOptimization for Machine Learning, 1, 2011.\n\n[23] R. Lyons and Y. Peres. Probability on trees and networks. Book in preparation., 2000.\n\n[24] G. Kirchhoff. Ueber die au\ufb02\u00a8osung der gleichungen, auf welche man bei der untersuchung der linearen\n\nvertheilung galvanischer str\u00a8ome gef\u00a8uhrt wird. Annalen der Physik, 148(12):497\u2013508, 1847.\n\n[25] R.M. Foster. The average impedance of an electrical network. Contributions to Applied Mechanics\n\n(Reissner Anniversary Volume), pages 333\u2013340, 1949.\n\n[26] P. Tetali. Random walks and the effective resistance of networks. Journal of Theoretical Probability,\n\n4(1):101\u2013109, 1991.\n\n[27] Ulrike Von Luxburg, Agnes Radl, and Matthias Hein. Hitting and commute times in large graphs are\n\noften misleading. ReCALL, 2010.\n\n[28] R Tyrell Rockafellar. Convex analysis, volume 28. Princeton university press, 1997.\n\n[29] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms.\n\nMIT press, 2001.\n\n[30] Wai Shing Fung and Nicholas JA Harvey. Graph sparsi\ufb01cation by edge-connectivity and random spanning\n\ntrees. arXiv preprint arXiv:1005.0265, 2010.\n\n[31] Michel Ledoux. The concentration of measure phenomenon, volume 89. American Mathematical Soc.,\n\n2001.\n\n9\n\n\f", "award": [], "sourceid": 992, "authors": [{"given_name": "James", "family_name": "Sharpnack", "institution": "CMU"}, {"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": "CMU"}, {"given_name": "Aarti", "family_name": "Singh", "institution": "CMU"}]}