{"title": "Identifying graph-structured activation patterns in networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2137, "page_last": 2145, "abstract": "We consider the problem of identifying an activation pattern in a complex, large-scale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of high-dimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size.", "full_text": "Identifying graph-structured activation patterns in\n\nnetworks\n\nMachine Learning Department, Statistics Department\n\nJames Sharpnack\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\njsharpna@andrew.cmu.edu\n\nAarti Singh\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\naartisingh@cmu.edu\n\nAbstract\n\nWe consider the problem of identifying an activation pattern in a complex, large-\nscale network that is embedded in very noisy measurements. This problem is\nrelevant to several applications, such as identifying traces of a biochemical spread\nby a sensor network, expression levels of genes, and anomalous activity or con-\ngestion in the Internet. Extracting such patterns is a challenging task specially\nif the network is large (pattern is very high-dimensional) and the noise is so ex-\ncessive that it masks the activity at any single node. However, typically there are\nstatistical dependencies in the network activation process that can be leveraged to\nfuse the measurements of multiple nodes and enable reliable extraction of high-\ndimensional noisy patterns. In this paper, we analyze an estimator based on the\ngraph Laplacian eigenbasis, and establish the limits of mean square error recov-\nery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based\non an arbitrary graph structure. We consider both deterministic and probabilistic\nnetwork evolution models, and our results indicate that by leveraging the network\ninteraction structure, it is possible to consistently recover high-dimensional pat-\nterns even when the noise variance increases with network size.\n\nIntroduction\n\n1\nThe problem of identifying high-dimensional activation patterns embedded in noise is important for\napplications such as contamination monitoring by a sensor network, determining the set of differen-\ntially expressed genes, and anomaly detection in networks. Formally, we consider the problem of\nidentifying a pattern corrupted by noise that is observed at the p nodes of a network:\n\ni \u2208 [p] = {1, . . . , p}\n\nyi = xi + \u03b6i\n\n(1)\nHere yi denotes the observation at node i, x = [x1, . . . , xp] \u2208 Rp (or {0, 1}p) is the p-dimensional\niid\u223c N (0, \u03c32), the Gaussian distri-\nunknown continuous (or binary) activation pattern, and the noise \u03b6i\nbution with mean zero and variance \u03c32. This problem is particularly challenging when the network\nis large-scale, and hence x is a high-dimensional pattern embedded in heavy noise. Classical ap-\nproaches to this problem in the signal processing and statistics literature involve either thresholding\nthe measurements at every node, or in the discrete case, matching the observed noisy measurements\nwith all possible patterns (also known as the scan statistic). The \ufb01rst approach does not work well\nwhen the noise level is too high, rendering the per node activity statistically insigni\ufb01cant. In this\ncase, multiple hypothesis testing effects imply that the noise variance needs to decrease as the num-\nber of nodes p increase [10, 1] to enable consistent mean square error (MSE) recovery. The second\napproach based on the scan statistic is computationally infeasible in high-dimensional settings as the\nnumber of discrete patterns scale exponentially (\u2265 2p) in the number of dimensions p.\nIn practice, network activation patterns tend to be structured due to statistical dependencies in the\nnetwork activation process. Thus, it is possible to recover activation patterns in a computationally\nand statistically ef\ufb01cient manner in noisy high-dimensional settings by leveraging the structure of\n\n1\n\n\fFigure 1: Threshold of noise variance below which consistent MSE recovery of network activation\npatterns is possible. If the activation is independent at each node, noise variance needs to decrease\nas network size p increases (in blue). If dependencies in the activation process are harnessed, noise\nvariance can increase as p\u03b3 where 0 < \u03b3 < 1 depends on network interactions (in red).\n\nthe dependencies between node measurements. In this paper, we study the limits of MSE recovery\nof high-dimensional, graph-structured noisy patterns. Speci\ufb01cally, we assume that the patterns x\nare generated from a probabilistic model, either Gaussian graphical model (GGM) or Ising (binary),\nbased on a general graph structure G(V, E), where V denotes the p vertices and E denotes the edges.\n\nGaussian graphical model:\nIsing model:\n\np(x) \u221d exp(\u2212xT \u03a3\u22121x)\np(x) \u221d exp\n\n(cid:16)\u2212(cid:80)\n\n(i,j)\u2208E Wij(xi \u2212 xj)2(cid:17) \u221d exp(\u2212xT Lx)\n\n(2)\n\nmatrix and D is the diagonal matrix of node degrees di =(cid:80)\n\nIn the Ising model, L = D \u2212 W denotes the graph Laplacian, where W is the weighted adjacency\nj:(i,j)\u2208E Wij. In the Gaussian graphical\nmodel, L = \u03a3\u22121 denotes the inverse covariance matrix whose zero entries indicate the absence of an\nedge between the corresponding nodes in the graph. The graphical model implies that all patterns are\nnot equally likely to occur in the network. Patterns in which the values of nodes that are connected\nby an edge agree are more likely, the likelihood being determined by the weights Wij of the edges.\nThus, the graph structure dictates the statistical dependencies in network measurements. We assume\nthat this graph structure is known, either because it corresponds to the physical topology of the\nnetwork or it can be learnt using network measurements [18, 25].\nIn this paper, we are concerned with the following problem: What is the largest amount of noise that\ncan be tolerated, as a function of the graph and parameters of the model, while allowing for con-\nsistent reconstruction of graph-structured network activation patterns? If the activations at network\nnodes are independent of each other, the noise variance (\u03c32) must decrease with network size p to\nensure consistent MSE recovery [10, 1]. We show that by exploiting the network dependencies, it\nis possible to consistently recover high-dimensional patterns when the noise variance is much larger\n(can grow with the network size p). See Figure 1.\nWe characterize the learnability of graph structured patterns based on the eigenspectrum of the\nnetwork. To this end, we propose using an estimator based on thresholding the projection of the\nnetwork measurements onto the graph Laplacian eigenvectors. This is motivated by the fact that\nin the Ising model, unlike the GGM, the Bayes rule and it\u2019s risk have no known closed form. Our\nresults indicate that the noise threshold is determined by the eigenspectrum of the Laplacian. For\nthe GGM this procedure reduces to PCA and the noise threshold depends on the eigenvalues of the\ncovariance matrix, as expected. We show that for simple graph structures, such as hierarchical or\nlattice graphs, as well as the random Erd\u00a8os-R\u00b4enyi graph, the noise threshold can possibly grow in\nthe network size p. Thus, leveraging the structure of network interactions can enable extraction of\nhigh-dimensional patterns embedded in heavy noise.\nThis paper is organized as follows. We discuss related work in Section 2. Limits of MSE recov-\nery for graph-structured patterns are investigated in Section 3 for the binary Ising model, and in\nSection 4 for the Gaussian graphical model. In Section 5, we analyze the noise threshold for some\nsimple deterministic and random graph structures. Simulation results are presented in Section 6, and\nconcluding discussion in Section 7. Proof sketches are included in the Appendix.\n\n2\n\no(pg)o(1)\f2 Related work\nGiven a prior, the Bayes optimal estimators are known to be the posterior mean under MSE, the\nMaximum A Posterior (MAP) rule under 0/1 loss, and the posterior centroid under Hamming loss\n[8]. However, these estimators and their corresponding risks (expected loss) have no closed form\nfor the Ising graphical model and are intractable to analyze. The estimator we propose based on the\ngraph Laplacian eigenbasis is both easy to compute and analyze. Eigenbasis of the graph Laplacian\nhas been successfully used for problems, such as clustering [20, 24], dimensionality reduction [5],\nand semi-supervised learning [4, 3]. The work on graph and manifold regularization [4, 3, 23, 2] is\nclosely related and assumes that the function of interest is smooth with respect to the graph, which is\nessentially equivalent to assuming a graphical model prior of the form in Eq. (2). However, the use\nof graph Laplacian is theoretically justi\ufb01ed mainly in the embedded setting [6, 21], where the data\npoints are sampled from or near a low-dimensional manifold, and the graph weights are the distances\nbetween two points as measured by some kernel. To the best of our knowledge, no previous work\nstudies the noise threshold for consistent MSE recovery of arbitrary graph-structured patterns.\nThere have been several attempts at constructing multi-scale basis for graphs that can ef\ufb01ciently rep-\nresent localized activation patterns, notably diffusion wavelets [9] and treelets [17], however their\napproximation capabilities are not well understood. More recently, [22] and [14] independently pro-\nposed unbalanced Haar wavelets and characterized their approximation properties for tree-structured\nbinary patterns. We argue in Section 5.1 that the unbalanced Haar wavelets are a special instance of\ngraph Laplacian eigenbasis when the underlying graph is hierarchical. On the other hand, a lattice\ngraph structure yields activations that are globally supported and smooth, and in this case the Lapla-\ncian eigenbasis corresponds to the Fourier transform (see Section 5.2). Thus, the graph Laplacian\neigenbasis provides an ef\ufb01cient representation for patterns whose structure is governed by the graph.\n3 Denoising binary graph-structured patterns\nThe binary Ising model is essentially a discrete version of the GGM, however, the Bayes rule and\nrisk for the Ising model have no known closed form. For binary graph-structured patterns drawn\nfrom an Ising prior, we suggest a different estimator based on projections onto the graph Laplacian\neigenbasis. Let the graph Laplacian L have spectral decomposition, L = U\u039bUT , and denote the\n\ufb01rst k eigenvectors (corresponding to the smallest eigenvalues) of L by U[k]. De\ufb01ne the estimator\n(3)\nwhich is a hard thresholding of the projection of network measurements y = [y1, . . . , yp] onto the\ngraph Laplacian eigenbasis. The following theorem bounds the MSE of this estimator.\nTheorem 1. The Bayes MSE of the estimator in Eq. (3) for the observation model in Eq. (1), when\nthe binary activation patterns are drawn from the Ising prior of Eq. (2) is bounded as\n\n(cid:98)xk = U[k]UT\n\n[k]y,\n\nE[(cid:107)(cid:98)xk \u2212 x(cid:107)2] \u2264 min\n\nRB :=\n\n1\np\n\n(cid:18)\n\n(cid:19)\n\n1,\n\n\u03b4\n\n\u03bbk+1\n\n+\n\nk\u03c32\np\n\n+ e\u2212p\n\nwhere 0 < \u03b4 < 2 is a constant and \u03bbk+1 is the (k + 1)th smallest eigenvalue of L.\n\nThrough this bias-variance decomposition, we see the eigenspectrum of the graph Laplacian deter-\nmines a bound on the MSE for binary graph-structured activations. In practice, k can be chosen\nusing FDR[1] in the eigendomain or cross-validation.\n\nRemark: Consider the binarized estimator (cid:98)x(cid:48)\nE[dH ((cid:98)x(cid:48), x)] = MSE((cid:98)x(cid:48)) \u2264 4MSE((cid:98)x), by the triangle inequality.\n\ni = 1(cid:98)xi>1/2, i \u2208 [p]. Then the results of Theo-\n\nrem 1 also provide an upper bound on the expected Hamming distance of this new estimator since\n\n4 Denoising Gaussian graph-structured patterns\nIf the network activation patterns are generated by a Gaussian graphical model, it is easy to see that\nthe eigenvalues of the Laplacian (inverse covariance) determine the MSE decay. Consider the GGM\nprior as in Eq. (2), then the posterior distribution is\n\nx|y \u223c N(cid:16)\n\n(2\u03c32L + I)\u22121y,(cid:0)2L + \u03c3\u22122I(cid:1)\u22121(cid:17)\n\n(cid:80)\n(4)\nwhere I is the identity matrix. The posterior mean is the Bayes optimal estimator with Bayes MSE,\ni\u2208[p](2\u03bbi + \u03c3\u22122)\u22121, where {\u03bbi}i\u2208[p] are the ordered eigenvalues of L. For the GGM, we obtain\n1\np\na result similar to Theorem 1 for the sake of bounding the performance of the Bayes rule.\n\n,\n\n3\n\n\fFigure 2: Weight matrices corresponding to hierarchical dependencies between node variables.\n\nTheorem 2. The Bayes MSE of the estimator in Eq. (3) for the observation model in Eq. (1), when\nthe activation patterns are drawn from the Gaussian graphical model prior of Eq. (2) is bounded as\n\nE[(cid:107)(cid:98)xk \u2212 x(cid:107)2] =\n\n1\np\n\nRB :=\n\n1\np\n\np(cid:88)\n\ni=k+1\n\n1\n2\u03bbi\n\n+\n\nk\u03c32\np\n\n\u2264 1\n\n2\u03bbk+1\n\n+\n\nk\u03c32\np\n\nHence, the Bayes MSE for the estimator of Eq. (3) under the GGM or Ising prior is bounded above\nby 2/\u03bbk + \u03c32k/p + e\u2212p which is the form used to prove Corollaries 1, 2, 3 in the next section.\n5 Noise threshold for some simple graphs\nIn this section, we discuss the eigenspectrum of some simple graphs and use the MSE bounds derived\nin the previous section to analyze the amount of noise that can be tolerated while ensuring consistent\nMSE recovery of high-dimensional patterns. In all these examples, we \ufb01nd that the tolerable noise\nlevel scales as \u03c32 = o(p\u03b3), where \u03b3 \u2208 (0, 1) characterizes the strength of network interactions.\n5.1 Hierarchical structure\nConsider that, under an appropriate permutation of rows and columns, the weight matrix W has\nthe hierarchical block form shown in Figure 2. This corresponds to hierarchical graph structured\ndependencies between node variables, where \u0001(cid:96) > \u0001(cid:96)+1 denote the strength of interactions between\nnodes that are in the same block at level (cid:96) = 0, 1, . . . , L. It is easy to see that in this case the\neigenvectors u of the graph Laplacian correspond to unbalanced Haar wavelet basis (proposed in\n[22, 14]), i.e. u \u221d 1|c2| 1c2 \u2212 1|c1| 1c1, where c1 and c2 are groups of variables within blocks at the\nsame level that are merged together at the next level (see [19] for the case of a full dyadic hierarchy).\nLemma 1. For a dyadic hierarchy with L levels, the eigenvectors of the graph Laplacian are the\nstandard Haar wavelet basis and there are L + 1 unique eigenvalues with the smallest eigenvalue\n\u03bb0 = 0, and the (cid:96)th smallest unique eigenvalue ((cid:96) \u2208 [L]) is 2(cid:96)\u22121-fold degenerate and given as\n\nL(cid:88)\n\n\u03bb(cid:96) =\n\ni=L\u2212(cid:96)+1\n\n2i\u22121\u0001i + 2L\u2212(cid:96)\u0001L\u2212(cid:96)+1.\n\nUsing the bound on MSE as given in Theorems 1 and 2, we can now derive the noise threshold that\nallows for consistent MSE recovery of high-dimensional patterns as the network size p \u2192 \u221e.\nCorollary 1. Consider a graph-structured pattern drawn from an Ising model or the GGM with\nweight matrix W of the hierarchical block form as depicted in Figure 2. If \u0001(cid:96) = 2\u2212(cid:96)(1\u2212\u03b2) \u2200(cid:96) \u2264\n\u03b3 log2 p+1, for constants \u03b3, \u03b2 \u2208 (0, 1), and \u0001(cid:96) = 0 otherwise, then the noise threshold for consistent\nMSE recovery (RB = o(1)) is\n\n\u03c32 = o(p\u03b3).\n\nThus, if we take advantage of the network interaction structure, it is possible to tolerate noise with\nvariance that scales with the network size p, whereas without exploiting structure the noise vari-\nance needs to decrease with p, as discussed in the introduction. Larger \u03b3 implies stronger network\ninteractions, and hence larger the noise threshold.\n5.2 Regular Lattice structure\nNow consider the lattice graph which is constructed by placing vertices in a regular grid on a d\ndimensional torus and adding edges of weight 1 to adjacent points. Let p = nd. For d = 1\n\n4\n\n\fthis is a cycle which has a circulant weight matrix w, with eigenvalues {2 cos( 2\u03c0k\np ) : k \u2208 [p]} and\neigenvectors correspond to the discrete Fourier transform [13]. Let i = (i1, ..., id), j = (j1, ..., jd) \u2208\n[n]d. Then the weight matrix of the lattice in d dimensions is\n\nWi,j = wi1,j1\u03b4i2,j2...\u03b4id,jd + ... + wid,jd \u03b4i1,j1 ...\u03b4id\u22121,jd\u22121\n\n(5)\nwhere \u03b4 is the Kronecker delta function. This form for W and since all nodes have same degree\ngives us a closed form for the eigenvalues of the Laplacian, along with a concentration inequality.\nLemma 2. Let \u03bbL\u2022 be an eigenvalue of the Laplacian, L, of the lattice graph in d dimensions with\np = nd vertices, chosen uniformly at random. Then\n\nP{\u03bbL\u2022 \u2264 d} \u2264 exp{\u2212d/8}.\n(6)\nk \u2265 d and k = (cid:100)pe\u2212d/8(cid:101). So, the risk bound becomes O(2/d +\n\nHence, we can choose k such that \u03bbL\n\u03c32e\u2212d/8 + e\u2212p), and as we increase dimensions of the lattice the MSE decays linearly.\nCorollary 2. Consider a graph-structured pattern drawn from an Ising model or GGM based on a\nlattice graph in d dimensions with p = nd vertices. If n is a constant and d = 8\u03b3 ln p, for some\nconstant \u03b3 \u2208 (0, 1), then the noise threshold for consistent MSE recovery (RB = o(1)) is given as:\n\n\u03c32 = o(p\u03b3).\n\nAgain, the noise variance can increase with the network size p, and larger \u03b3 implies stronger network\ninteractions as each variables interacts with more number of neighbors (d is larger).\n\n5.3 Erd\u00a8os-R\u00b4enyi random graph structure\nErd\u00a8os-R\u00b4enyi (ER) random graphs are generated by adding edges with weight 1 between any two\nvertices within the vertex set V (of size p) with probability qp. It is known that the probability of\nedge inclusion (qp) determines large geometric properties of the graph [11]. Real world networks\nare generally sparse, so we set qp = p\u2212(1\u2212\u03b3), where \u03b3 \u2208 (0, 1). Larger \u03b3 implies higher probability\nof edge inclusion and stronger network interaction structure. Using the degree distribution [7], and\na result from perturbation theory, we bound the quantiles of the eigenspectrum of L.\nLemma 3. Let \u03bb\u2022 denote an eigenvalue of L chosen uniformly at random. Let PG be the probability\nmeasure induced by the ER random graph and P\u2022 be the uniform distribution over eigenvalues\nconditional on the graph. Then, for any \u03b1p increasing in p,\n\nPG{P\u2022{\u03bb\u2022 \u2264 p\u03b3/2 \u2212 p\u03b3\u22121} \u2265 \u03b1pp\u2212\u03b3} = O(1/\u03b1p)\n\n(7)\nHence, we are able to set the sequence of quantiles for the eigenvalue distribution kp = (cid:100)\u03b1pp1\u2212\u03b3(cid:101)\nsuch that PG{\u03bbkp \u2264 p\u03b3/2 \u2212 p\u03b3\u22121} = O(1/\u03b1p). So, we obtain a bound for the expected Bayes\nMSE (with respect to the graph) EG[RB] \u2264 O(p\u2212\u03b3) + \u03c32O(\u03b1pp\u2212\u03b3) + O(1/\u03b1p).\nCorollary 3. Consider a graph G drawn from an Erd\u00a8os-R\u00b4enyi random graph model with p vertices\nand probability of edge inclusion qp = p\u2212(1\u2212\u03b3) for some constant \u03b3 \u2208 (0, 1). If the latent graph-\nstructured pattern is drawn from an Ising model or a GGM with the Laplacian of G, then the noise\nvariance that can be tolerated while ensuring consistent MSE recovery (RB = oPG (1)) is given as:\n\n\u03c32 = o(p\u03b3).\n\n6 Experiments\nWe simulate patterns from the Ising model de\ufb01ned on hierarchical, lattice and ER graphs. Since the\nIsing distribution admits a closed form for the distribution of one node conditional on the rest of the\nnodes, a Gibbs sampler can be employed. Histograms of the eigenspectrum for the hierarchical tree\ngraph with a large depth, the lattice graph in high dimensions, and a draw from the ER graph with\nmany nodes is shown in \ufb01gures 3(a), 4(a), 5(a) respectively. The eigenspectrum of the lattice and\nER graphs illustrate the concentration of the eigenvalues about the expected degree of each node.\nWe use iterative eigenvalue solvers to form our estimator and choose the quantile k by minimizing\nthe bound in Theorem 1. We compute the Bayes MSE (by taking multiple draws) of our estimator\nfor a noisy sample of node measurements. We observe in all of the models that the eigenmap\nestimator is a substantial improvement over Naive (the Bayes estimator that ignores the structure).\n\n5\n\n\f(a) Eigenvalue Histogram for hierarchical tree.\n\n(b) Estimator Performance\n\nFigure 3: The eigenvalue histogram for the binary tree, L = 11, \u03b2 = .1 (left) and the performance\nof various estimators (right) with \u03b2 = 0.05 and \u03c32 = 4, both with \u03b3 = 1.\n\n(a) Eigenvalue Histogram for Lattice.\n\n(b) Estimator Performance\n\nFigure 4: The eigenvalue histogram for the lattice with d = 10 and p = 510 (left) and estimator\nperformances (right) with p = 3d and \u03c32 = 1. Notice that the eigenvalues concentrate around 2d.\n\n(a) Eigenvalue Histogram for Erd\u00a8os-R\u00b4enyi.\n\n(b) Estimator Performance\n\nFigure 5: The eigenvalue histogram for a draw from the ER graph with p = 2500 and qp = p\u2212.5\n(left) and the estimator performances (right) with qp = p\u2212.75 and \u03c32 = 4. Notice that the eigenval-\nues are concentrated around p\u03b3 where qp = p\u2212(1\u2212\u03b3).\n\n(a) Eigenvalue Histogram for Watts-Strogatz.\n\n(b) Estimator Performance\n\nFigure 6: The eigenvalue histogram for a draw from the Watts-Strogatz graph with d = 5 and\np = 45 with 0.25 probability of rewiring (left) and estimator performances (right) with 4d vertices\nand \u03c32 = 4. Notice that the eigenvalues are concentrated around 2d.\n\n6\n\n\fSee Figures 3(b), 4(b), 5(b). For the hierarchical model, we also sample from the posterior using a\nGibbs sampler and estimate the posterior mean (Bayes rule under MSE). We \ufb01nd that the posterior\nmean is only a slight improvement over the eigenmap estimator (Figure 3(b)), despite it\u2019s dif\ufb01culty\nto compute. Also, a binarized version of these estimators does not substantially change the MSE.\nWe also simulate graphs from the Watts-Strogatz \u2018small world\u2019 model [26], which is known to be an\nappropriate model for self-organizing systems such as biological systems and human networks. The\n\u2018small world\u2019 graph is generated by forming the lattice graph described in Section 5.2, then rewiring\neach edge with some constant probability to another vertex uniformly at random such that loops\nare never created. We observe that the eigenvalues concentrate (more tightly than the lattice graph)\naround the expected degree 2d (Figure 6(a)) and note that, like the ER model, the eigenspectrum\nconverges to a nearly semi-circular distribution [12]. Similarly, the MSE decays in a fashion similar\nto the ER model (Figure 6(b)).\n\n7 Discussion\nIn this paper, we have characterized the improvement in noise threshold, below which consistent\nMSE recovery of high-dimensional network activation patterns embedded in heavy noise is possi-\nble, as a function of the network size and parameters governing the statistical dependencies in the\nactivation process. Our results indicate that by leveraging the network interaction structure, it is\npossible to tolerate noise with variance that increases with the size of the network whereas with-\nout exploiting dependencies in the node measurements, the noise variance needs to decrease as the\nnetwork size grows to accommodate for multiple hypothesis testing effects.\nWhile we have only considered MSE recovery, it is often possible to detect the presence of patterns\nin much heavier noise, even though the activation values may not be accurately recovered [16].\nEstablishing the noise threshold for detection, deriving upper bounds on the noise threshold, and\nextensions to graphical models with higher-order interaction terms are some of the directions for\nfuture work. In addition, the thresholding estimator based on the graph Laplacian eigenbasis can\nalso be used in high-dimensional linear regression or compressed sensing framework to incorporate\nstructure, in addition to sparsity, of the relevant variables.\nAppendix\nProof sketch of Theorem 1: First, we argue that whp, xT Lx \u2264 \u03b4p, where 0 < \u03b4 < 2 is a constant.\nLet \u2126 = {x : xT Lx \u2264 \u03b4p} and \u00af\u2126 denotes its complement. By Markov\u2019s inequality, for t > 0,\n\nLet \u03bd denote the uniform distribution over {0, 1}p and N (L) =(cid:82) \u03bd(dx)e\u2212xT Lx. Then,\nwhere the last step follows since N (L) = (cid:80)\n\nP{xT Lx > C} = P{etxT Lx > etC} \u2264 e\u2212tC EetxT Lx\n(cid:82) \u03bd(dx)e\u2212xT (1\u2212t)Lx\n\nEexT (tL)x =(cid:82) \u03bd(dx)N (L)\u22121e\u2212xT LxexT (tL)x =\n\n= N ((1\u2212t)L)\n\nN (L), N ((1 \u2212 t)L) \u2264 2p,\u2200t \u2208 (0, 1). This gives us the Chernoff-type bound,\nP( \u00af\u2126) \u2264 P{xT Lx > C} \u2264 e\u2212tC2p = e(log 2\u2212tC/p)p \u2264 e\u2212p\n\nN (L)\n\nx\u2208{0,1}p e\u2212xT Lx and L(cid:126)1 = 0 implying that 1 \u2264\n\nN (L) \u2264 2p\n\nby setting C = \u03b4p and \u03b4 = 1+log 2\n\n. If we choose t < 1+log 2\n\n2\n\nthen \u03b4 < 2.\n\nt\n\nLet ui denote the ith eigenvector of the graph Laplacian L, then under this orthonormal basis,\n\nE[(cid:107)(cid:98)xk \u2212 x(cid:107)2] \u2264 E[\n\np(cid:88)\n\ni=k+1\n\ni+kx, i \u2208 [p \u2212 k] and note that xT Lx =(cid:80)p\n\nWe now establish that supx:xT Lx\u2264\u03b4p\nLet \u02dcxi = uT\nith eigenvalue of L. Consider the primal problem,\n\ni=k+1(uT\n\n(cid:80)p\n\np(cid:88)\ni x)2 \u2265(cid:80)p\n\ni=k+1\n\ni x2| \u2126] + pP ( \u00af\u2126) + k\u03c32 \u2264\nuT\n\nsup\n\nx:xT Lx\u2264\u03b4p\n\ni x2 + p e\u2212p + k\u03c32.\nuT\n\ni=1 \u03bbi(uT\n\ni x)2 \u2264 p min(1, \u03b4/\u03bbk+1), and the result follows.\ni , for \u03bbi the\np\u2212k(cid:88)\n\ni=k+1 \u03bbi \u02dcx2\n\n\u03bbj \u02dcx2\n\nj \u2264 \u03b4p, \u02dcx \u2208 Rp\u2212k\n\np\u2212k(cid:88)\n\nmax\n\nj such that\n\u02dcx2\n\nj=1\n\nj=1\n\n7\n\n\fi\n\n1\np\n\ni=k+1(uT\n\nE||U[k]\u03b6||2 = 1\n\n(cid:80)p\ni=k+1(2\u03bbi)\u22121 + \u03c32k/p \u2264 (2\u03bbk+1)\u22121 + \u03c32k/p.\n\n(cid:80)p\n) independently over i \u2208 [p]. Then E||\u02dcx||2 =(cid:80)p\n\nNote that x contained within the ellipsoid xT Lx \u2264 \u03b4p, x \u2208 {0, 1}p implies that \u02dcx is feasible, so\na solution to the optimization upper bounds supx:xT Lx\u2264\u03b4p\ni x)2. By forming the dual\nproblem, we \ufb01nd that the solution, x\u2217, to the primal problem attains a bound of ||\u02dcx||2 \u2264 ||\u02dcx\u2217||2 =\n\u03b4p/\u03bbk+1. Also, ||\u02dcx||2 \u2264 ||x||2 \u2264 p, so we obtain the desired bound.\nE||(cid:98)x\u2212 x||2 =\ni x \u223c\nProof sketch of Theorem 2: Under the same notation as the previous proof, notice that uT\nN (0, (2\u03bb)\u22121\nE||\u02dcx||2 + 1\nProof sketch of Corollary 1: Let (cid:96)\u2217 = (1 \u2212 \u03b3) log2 p. Since \u0001i = 2\u2212i(1\u2212\u03b2) \u2200i < L \u2212 (cid:96)\u2217 + 1 and\n\u0001i = 0 otherwise, we have for (cid:96) \u2265 (cid:96)\u2217 and since L = log2 p, \u03bb(cid:96) \u2265 2\u03b2(L\u2212(cid:96)\u2217)2\u03b2\u22121 = p\u03b2\u03b32\u03b2\u22121, which\nis increasing in p. Therefore, we can pick k = 2(cid:96)\u2217\nProof sketch of Lemma 2: If v1, ..., vd are a subset of the eigenvectors of w with eigenvalues\n\u03bb1, ..., \u03bbd, then W (v1 \u2297 ... \u2297 vd) = (\u03bb1 + ... + \u03bbd)(v1 \u2297 ... \u2297 vd) where \u2297 denotes tensor product.\nNoting that the Dii = 2d,\u2200i \u2208 [n]d then we see that the Laplacian L has eigenvalues \u03bbL\ni =\n2d \u2212 \u03bbW\nn ) for some k \u2208 [n]. Let i be\ndistributed uniformly over [n]d. Then E[\u03bbw\n\n) for all i \u2208 [n]d. Recall \u03bbw\n\n/p = p\u2212\u03b3, the result follows.\n\ni = (cid:80)[d]\n\ni=k+1(2\u03bbi)\u22121 and, so, 1\n\n] = 0, and by Hoeffding\u2019s inequality,\n\nand since 2(cid:96)\u2217\n\nj (2 \u2212 \u03bbw\n\nk = 2 cos( 2\u03c0k\n\np\n\np\n\np\n\nij\n\nP{ d(cid:88)\nSo, using t = d we get that P{(cid:80)d\n\nj=1\n\nij\n\nij\n\n(2 \u2212 \u03bbw\nj=1(2 \u2212 \u03bbw\n\nij\n\n) \u2212 2d \u2264 \u2212t} \u2264 exp{\u22122t2/16d}\n\n) \u2264 d} \u2264 exp{\u2212d\n\n8 } and the result follows.\n\nProof of Lemma 3: We introduce a random variable \u2022 that is uniform over [p]. Note that, con-\nditioned on this random variable, d\u2022 \u223c Binomial(p \u2212 1, qp) and Var(d\u2022) \u2264 pqp. We decompose\nthe Laplacian, L = D \u2212 W = ( \u00afdI \u2212 W) + (D \u2212 \u00afdI), into the expected degree of each vertex\n( \u00afd = (p \u2212 1)qp), W and the deviations from the expected degree and use the following lemma.\nLemma 4 (Wielandt-Hoffman Theorem). [15, 27] Suppose A = B+C are symmetric p\u00d7p matrices\ni }p\nand denote the ordered eigenvalues by {\u03bbA\ni , \u03bbB\ni \u2212 \u03bbB\n(\u03bbA\n\ni=1. If ||.||F denotes the Frobenius norm,\ni )2 \u2264 ||C||2\n\n(8)\nF /p = Var(d\u2022) and so EG||\u03bb \u00afdI\u2212W \u2212 \u03bbL||2/p \u2264 pqp = p\u03b3 (i). Also, it\nNotice that EG||D \u2212 \u00afdI||2\nis known that for \u03b3 \u2208 (0, 1) the eigenvalues converge to a semicircular distribution[12] such that\n\nPG{|\u03bbW\u2022 | \u2264 2(cid:112)pqp(1 \u2212 qp)} \u2192 1. Since 2(cid:112)pqp(1 \u2212 qp) \u2264 2p\u03b3/2, we have EG[(\u03bbW\u2022 )2] \u2264 4p\u03b3\n\np(cid:88)\n\ni=1\n\nF\n\nfor large enough p (ii). Using triangle inequality,\n\nEG[(\u03bbL\u2022 \u2212 (p \u2212 1)qp)2] \u2264 EG[(\u03bbL\u2022 \u2212 ((p \u2212 1)qp \u2212 \u03bbW\u2022 ))2] + EG[(\u03bbW\u2022 )2] \u2264 5p\u03b3,\n\n(9)\n\nwhere the last step follows using (i), (ii) and \u03bb\n\n\u00afdI\u2212W\ni\n\n= (p \u2212 1)qp \u2212 \u03bbW\n\ni\n\nPG{P\u2022{\u03bbL\u2022 \u2264 p\u03b3\n2\n\n\u2212 p\u03b3\u22121} \u2265 \u03b1pp\u2212\u03b3} \u2264 p\u03b3\n\u03b1p\n\nEG[P\u2022{\u03bbL\u2022 \u2264 p\u03b3\n2\n\n. By Markov\u2019s inequality,\n\n\u2212 p\u03b3\u22121}]\n\n(10)\n\nfor any \u03b1p which is an increasing positive function in p. We now analyze the right hand side.\n\nP\u2022{|\u03bbL\u2022 \u2212 (p \u2212 1)qp| \u2265 \u0001} \u2264 \u0001\u22122E\u2022[(\u03bbL\u2022 \u2212 (p \u2212 1)qp)2]\n\nNote that P\u2022{\u03bbL\u2022 \u2264 pqp \u2212 qp \u2212 \u0001} \u2264 P\u2022{|\u03bbL\u2022 \u2212 (p \u2212 1)qp| \u2265 \u0001} and setting \u0001 = pqp/2 = p\u03b3/2,\n\nP\u2022{\u03bbL\u2022 \u2264 p\u03b3/2 \u2212 p\u03b3\u22121} \u2264 4p\u22122\u03b3E\u2022[(\u03bbL\u2022 \u2212 (p \u2212 1)qp)2].\n\nHence, we are able to complete the lemma, such that for p large enough, using Eqs. (10) and (9)\n\nPG{P\u2022{\u03bbL\u2022 \u2264 p\u03b3\n2\n\n\u2212 p\u03b3\u22121} \u2265 \u03b1pp\u2212\u03b3} \u2264 4\n\u03b1pp\u03b3\n\nEG[E\u2022[(\u03bbL\u2022 \u2212 (p \u2212 1)qp)2]] \u2264 20\n\u03b1p\n\n.\n\n(11)\n\nProof sketch of Corollary 3: By lemma 3 and appropriately specifying the quantiles,\nEGRB \u2264 EG\n\nNote that we have the freedom to choose \u03b1p =(cid:112)p\u03b3/\u03c32 making \u03c32O(\u03b1pp\u2212\u03b3) = O((cid:112)\u03c32/p\u03b3) =\n\np\u03b3/2 \u2212 p\u03b3\u22121 + \u03c32O(\u03b1pp\u2212\u03b3) + e\u2212p\n\n+ \u03c32 kp\np\n\n+ e\u2212p\n\n+ O(\n\n) (12)\n\n1\n\u03b1p\n\n\u03bbkp\n\n\u2264\n\n2\n\no(1) and O(1/\u03b1p) = o(1) if \u03c32 = o(p\u03b3).\n\n(cid:21)\n\n(cid:18)\n\n(cid:20) 2\n\n(cid:19)\n\n8\n\n\fReferences\n[1] F. Abramovich, Y. Benjamini, D. L. Donoho, and I. M. Johnstone, Adapting to unknown sparsity by\n\ncontrolling the false discovery rate, Annals of Statistics 34 (2006), no. 2, 584\u2013653.\n\n[2] Rie K. Ando and Tong Zhang, Learning on graph with laplacian regularization, Advances in Neural\n\nInformation Processing Systems (NIPS), 2006.\n\n[3] M. Belkin and P. Niyogi, Semi-supervised learning on riemannian manifolds, Machine Learning 56(1-3)\n\n(2004), 209\u2013239.\n\n[4] Mikhail Belkin, Irina Matveeva, and Partha Niyogi, Regularization and semi-supervised learning on large\n\ngraphs, Conference on Learning Theory (COLT), 2004.\n\n[5] Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps for dimensionality reduction and data represen-\n\ntation, Neural Computation 15 (2003), no. 6, 1373\u20131396.\n\n[6]\n\n, Convergence of laplacian eigenmaps, Advances in Neural Information Processing Systems\n\n(NIPS), 2006.\n\n[7] B. Bollobas, Random graphs, Cambridge University Press, 2001.\n[8] Luis E. Carvalho and Charles E. Lawrence, Centroid estimation in discrete high-dimensional spaces with\n\napplications in biology, PNAS 105 (2008), no. 9, 3209\u20133214.\n\n[9] R. Coifman and M. Maggioni, Diffusion wavelets, Applied and Computational Harmonic Analysis 21\n\n(2006), no. 1, 53\u201394.\n\n[10] D. L. Donoho, I. M. Johnstone, J. C. Hoch, and A. S. Stern, Maximum entropy and the nearly black object,\n\nJournal of Royal Statistical Society, Series B 54 (1992), 41\u201381.\n\n[11] P. Erd\u00a8os and A R\u00b4enyi, On the evolution of random graphs, Publication of the Mathematical Institute of\n\nthe Hungarian Academy of Sciences, 1960, pp. 17\u201361.\n\n[12] Ill\u00b4es J. Farkas, Imre Der\u00b4enyi, Albert-L\u00b4aszl\u00b4o Barab\u00b4asi, and Tam\u00b4as Vicsek, Spectra of real-world graphs:\n\nBeyond the semi-circle law, Physical Review E 64 (2001), 1\u201312.\n\n[13] Bernard Friedman, Eigenvalues of composite matrices, Mathematical Proceedings of the Cambridge\n\nPhilosophical Society 57 (1961), 37\u201349.\n\n[14] M. Gavish, B. Nadler, and R. Coifman, Multiscale wavelets on trees, graphs and high dimensional data:\nTheory and applications to semi supervised learning, 27th International Conference on Machine Learning\n(ICML), 2010.\n\n[15] S. Jalan and J. N. Bandyopadhyay, Random matrix analysis of network laplacians, Tech. Report cond-\n\nmat/0611735, Nov 2006.\n\n[16] J. Jin and D. L. Donoho, Higher criticism for detecting sparse heterogeneous mixtures, Annals of Statistics\n\n32 (2004), no. 3, 962\u2013994.\n\n[17] A. B. Lee, B. Nadler, and L. Wasserman, Treelets - an adaptive multi-scale basis for sparse unordered\n\ndata, Annals of Applied Statistics 2 (2008), no. 2, 435\u2013471.\n\n[18] N. Meinshausen and P. Buhlmann, High dimensional graphs and variable selection with the lasso, Annals\n\nof Statistics 34 (2006), no. 3, 1436\u20131462.\n\n[19] A. T. Ogielski and D. L. Stein, Dynamics on ultrametric spaces, Physical Review Letters 55 (1985),\n\n1634\u20131637.\n\n[20] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Analysis and Machine\n\nIntelligence 22 (2000), 888\u2013905.\n\n[21] A. Singer, From graph to manifold laplacian: the convergence rate, Applied and Computational Har-\n\nmonic Analysis 21 (2006), no. 1, 135\u2013144.\n\n[22] A. Singh, R. Nowak, and R. Calderbank, Detecting weak but hierarchically-structured patterns in net-\n\nworks, 13th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n\n[23] A. Smola and R. Kondor, Kernels and regularization on graphs, Conference on Learning Theory (COLT),\n\n2003.\n\n[24] Ulrike von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17 (2007), no. 4, 395\u2013416.\n[25] M. Wainwright, P. Ravikumar, and J. D. Lafferty, High-dimensional graphical model selection using (cid:96)1-\n\nregularized logistic regression, Advances in Neural Information Processing Systems (NIPS), 2006.\n\n[26] Duncan J. Watts and Steven H. Strogatz, Collective dynamics of \u2018small-world\u2019 networks, Nature 393\n\n(1998), no. 6684, 440\u2013442.\n\n[27] Choujun Zhan, Guanrong Chen, and Lam F. Yeung, On the distribution of laplacian eigenvalues versus\n\nnode degrees in complex networks, Physica A 389 (2010), 1779\u20131788.\n\n9\n\n\f", "award": [], "sourceid": 930, "authors": [{"given_name": "James", "family_name": "Sharpnack", "institution": null}, {"given_name": "Aarti", "family_name": "Singh", "institution": null}]}