{"title": "Robust Spectral Detection of Global Structures in the Data by Learning a Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 541, "page_last": 549, "abstract": "Spectral methods are popular in detecting global structures in the given data that can be represented as a matrix. However when the data matrix is sparse or noisy, classic spectral methods usually fail to work, due to localization of eigenvectors (or singular vectors) induced by the sparsity or noise. In this work, we propose a general method to solve the localization problem by learning a regularization matrix from the localized eigenvectors. Using matrix perturbation analysis, we demonstrate that the learned regularizations suppress down the eigenvalues associated with localized eigenvectors and enable us to recover the informative eigenvectors representing the global structure. We show applications of our method in several inference problems: community detection in networks, clustering from pairwise similarities, rank estimation and matrix completion problems. Using extensive experiments, we illustrate that our method solves the localization problem and works down to the theoretical detectability limits in different kinds of synthetic data. This is in contrast with existing spectral algorithms based on data matrix, non-backtracking matrix, Laplacians and those with rank-one regularizations, which perform poorly in the sparse case with noise.", "full_text": "Robust Spectral Detection of Global Structures in the\n\nData by Learning a Regularization\n\nInstitute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China\n\nPan Zhang\n\npanzhang@itp.ac.cn\n\nAbstract\n\nSpectral methods are popular in detecting global structures in the given data that\ncan be represented as a matrix. However when the data matrix is sparse or noisy,\nclassic spectral methods usually fail to work, due to localization of eigenvectors\n(or singular vectors) induced by the sparsity or noise. In this work, we propose\na general method to solve the localization problem by learning a regularization\nmatrix from the localized eigenvectors. Using matrix perturbation analysis, we\ndemonstrate that the learned regularizations suppress down the eigenvalues asso-\nciated with localized eigenvectors and enable us to recover the informative eigen-\nvectors representing the global structure. We show applications of our method\nin several inference problems: community detection in networks, clustering from\npairwise similarities, rank estimation and matrix completion problems. Using ex-\ntensive experiments, we illustrate that our method solves the localization problem\nand works down to the theoretical detectability limits in different kinds of syn-\nthetic data. This is in contrast with existing spectral algorithms based on data\nmatrix, non-backtracking matrix, Laplacians and those with rank-one regulariza-\ntions, which perform poorly in the sparse case with noise.\n\n1\n\nIntroduction\n\nIn many statistical inference problems, the task is to detect, from given data, a global structure such\nas low-rank structure or clustering. The task is usually hard to solve since modern datasets usually\nhave a large dimensionality. When the dataset can be represented as a matrix, spectral methods are\npopular as it gives a natural way to reduce the dimensionality of data using eigenvectors or singular\nvectors.\nIn the point-of-view of inference, data can be seen as measurements to the underlying\nstructure. Thus more data gives more precise information about the underlying structure.\nHowever in many situations when we do not have enough measurements, i.e.\nthe data matrix is\nsparse, standard spectral methods usually have localization problems thus do not work well. One\nexample is the community detection in sparse networks, where the task is to partition nodes into\ngroups such that there are many edges connecting nodes within the same group and comparatively\nfew edges connecting nodes in different groups. It is well known that when the graph has a large\nconnectivity c, simply using the \ufb01rst few eigenvectors of the adjacency matrix A \u2208 {0, 1}n\u00d7n\n(with Aij = 1 denoting an edge between node i and node j,and Aij = 0 otherwise) gives a good\nresult. In this case, like that of a suf\ufb01ciently dense Erd\u02ddos-R\u00b4enyi (ER) random graph with average\n4c \u2212 \u03bb2/2\u03c0c, and there\ndegree c, the spectral density follows Wigner\u2019s semicircle rule, P (\u03bb) =\nis a gap between the edge of bulk of eigenvalues and the informative eigenvalue that represents the\nunderlying community structure. However when the network is large and sparse, the spectral density\nof the adjacency matrix deviates from the semicircle, the informative eigenvalue is hidden in the\nbulk of eigenvalues, as displayed in Fig. 1 left. Its eigenvectors associated with largest eigenvalues\n(which are roughly proportional to log n/ log log n for ER random graphs) are localized on the large-\n\n\u221a\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fA \u2208 Rm\u00d7n with rank r (cid:28) \u221a\n\ndegree nodes, thus reveal only local structures about large degrees rather than the underlying global\nstructure. Other standard matrices for spectral clustering [19, 22], e.g. Laplacian, random walk\nmatrix, normalized Laplacian, all have localization problems but on different local structures such\nas dangling trees.\nAnother example is the matrix completion problem which asks to infer missing entries of matrix\nmn from only few observed entries. A popular method for this\nproblem is based on the singular value decomposition (SVD) of the data matrix. However it is\nwell known that when the matrix is sparse, SVD-based method performs very poorly, because the\nsingular vectors corresponding to the largest singular values are localized, i.e. highly concentrated\non high-weight column or row indices.\nA simple way to ease the pain of localization induced by high degree or weight is trimming [6, 13]\nwhich sets to zero columns or rows with a large degree or weight. However trimming throws away\npart of the information, thus does not work all the way down to the theoretical limit in the com-\nmunity detection problem [6, 15]. It also performs worse than other methods in matrix completion\nproblem [25].\nIn recent years, many methods have been proposed for the sparsity-problem. One kind of methods\nuse new linear operators related to the belief propagation and Bethe free energy, such as the non-\nbacktracking matrix [15] and Bethe Hessian [24]. Another kind of methods add to the data matrix or\nits variance a rank-one regularization matrix [2, 11, 16\u201318, 23]. These methods are quite successful\nin some inference problems in the sparse regime. However in our understanding none of them works\nin a general way to solve the localization problem. For instance, the non-backtracking matrix and\nthe Bethe Hessian work very well when the graph has a locally-tree-like structure, but they have\nagain the localization problems when the system has short loops or sub-structures like triangles and\ncliques. Moreover its performance is sensitive to the noise in the data [10]. Rank-one regularizations\nhave been used for a long time in practice, the most famous example is the \u201cteleportation\u201d term\nin the Google matrix. However there is no satisfactory way to determine the optimal amount of\nregularization in general. Moreover, analogous to the non-backtracking matrix and Bethe Hessian,\nthe rank-one regularization approach is also sensitive to the noise, as we will show in the paper.\nThe main contribution of this paper is to illustrate how to solve the localization problem of spec-\ntral methods for general inference problems in sparse regime and with noise, by learning a proper\nregularization that is speci\ufb01c for the given data matrix from its localized eigenvectors. In the fol-\nlowing text we will \ufb01rst discuss in Sec. 2 that all three methods for community detection in sparse\ngraphs can be put into the framework of regularization. Thus the drawbacks of existing methods\ncan be seen as improper choices of regularizations. In Sec. 3 we investigate how to choose a good\nregularization that is dedicated for the given data, rather than taking a \ufb01xed-form regularization as\nin the existing approaches. We use matrix perturbation analysis to illustrate how the regulariza-\ntion works in penalizing the localized eigenvectors, and making the informative eigenvectors that\ncorrelate with the global structure \ufb02oat to the top positions in spectrum. In Sec. 4 we use exten-\nsive numerical experiments to validate our approach on several well-studied inference problems,\nincluding the community detection in sparse graphs, clustering from sparse pairwise entries, rank\nestimation and matrix completion from few entries.\n\nFigure 1: Spectral density of the adjacency matrix (left) and X-Laplacian (right) of a graph generated\nby the stochastic block model with n = 10000 nodes, average degree c = 3, q = 2 groups and\n\u0001 = 0.125. Red arrows point to eigenvalues out of the bulk.\n\n2\n\n\u221230300.050.10.150.2\u221230300.050.10.150.2\f2 Regularization as a uni\ufb01ed framework\n\nWe see that the above three methods for the community detection problem in sparse graphs, i.e.\ntrimming, non-backtracking/Bethe Hessian, and rank-one regularizations, can be understood as do-\ning different ways of regularizations. In this framework, we consider a regularized matrix\n\nL = \u02c6A + \u02c6R.\n\n(1)\nHere matrix \u02c6A is the data matrix or its (symmetric) variance, such as \u02dcA = D\u22121/2AD\u22121/2 with\nD denoting the diagonal matrix of degrees, and matrix \u02c6R is a regularization matrix. The rank-one\nregularization approaches [2, 11, 16\u201318, 23] fall naturally into this framework as they set R to be a\nrank-one matrix, \u2212\u03b611T , with \u03b6 being a tunable parameter controlling strength of regularizations.\nIt is also easy to see that in the trimming, \u02c6A is set to be the adjacency matrix and \u02c6R contains entries\nto remove columns or rows with high degrees from A.\nFor spectral algorithms using the non-backtracking matrix, its relation to form Eq. (1) is not straight-\nforward. However we can link them using the theory of graph zeta function [8] which says that an\neigenvalue \u00b5 of the non-backtracking operator satis\ufb01es the following quadratic eigenvalue equation,\n\ndet[\u00b52I \u2212 \u00b5A + (D \u2212 I)] = 0,\n\nwhere I is the identity matrix. It indicates that a particular vector v that is related to the eigenvector\nof the non-backtracking matrix satis\ufb01es (A \u2212 D\u2212I\n\u00b5 )v = \u00b5v. Thus spectral clustering algorithm\nusing the non-backtracking matrix is equivalent to the spectral clustering algorithm using matrix\nwith form in Eq. (1), while \u02c6A = A, \u02c6R = D\u2212I\n\u00b5 , and \u00b5 acting as a parameter. We note here that\nthe parameter does not necessarily be an eigenevalue of the non-backtracking matrix. Actually a\nrange of parameters work well in practice, like those estimated from the spin-glass transition of the\nsystem [24]. So we have related different approaches of resolving localizations of spectral algorithm\nin sparse graphs into the framework of regularization. Although this relation is in the context of\ncommunity detection in networks, we think it is a general point-of-view, when the data matrix has a\ngeneral form rather than a {0, 1} matrix.\nAs we have argued in the introduction, above three ways of regularization work from case to case\nand have different problems, especially when system has noise. It means that in the framework\nof regularizations, the effective regularization matrix \u02c6R added by these methods do not work in a\ngeneral way and is not robust. In our understanding, the problem arises from the fact that in all\nthese methods, the form of regularization is \ufb01xed for all kinds of data, regardless of different reasons\nfor the localization. Thus one way to solve the problem would be looking for the regularizations\nthat are speci\ufb01c for the given data, as a feature.\nIn the following section we will introduce our\nmethod explicitly addressing how to learn such regularizations from localized eigenvectors of the\ndata matrix.\n\n3 Learning regularizations from localized eigenvectors\n\nThe reason that the informative eigenvectors are hidden in the bulk is that some random eigenvectors\nhave large eigenvalues, due to the localization which represent the local structures of the system. In\nthe complementary side, if these eigenvectors are not localized, they are supposed to have smaller\neigenvalues than the informative ones which reveal the global structures of the graph. This is the\nmain assumption that our idea is based on.\n\nIn this work we use the Inverse Participation Ratio (IPR), I(v) =(cid:80)n\n\ni=1 v4\n\ni , to quantify the amount\nof localization of a (normalized) eigenvector v. IPR has been used frequently in physics, for exam-\nple for distinguishing the extended state from the localized state when applied on the wave func-\nn} to 1 for vector\nn , ..., 1\u221a\ntion [3]. It is easy to check that I(v) ranges from 1\n{0, ..., 0, 1, 0, ..., 0}. That is, a larger I(v) indicates more localization in vector v.\nOur idea is to create a matrix LX with similar structures to A, but with non-localized leading eigen-\nvectors. We call the resulting matrix X-Laplacian, and de\ufb01ne it as LX = A + X, where matrix A is\nthe data matrix (or its variant), and X is learned using the procedure detailed below:\n\nn for vector { 1\u221a\n\nn , 1\u221a\n\n3\n\n\fAlgorithm 1: Regularization Learning\nInput: Real symmetric matrix A, number of eigenvectors q, learning rate \u03b7 = O(1), threshold \u2206.\nOutput: X-Laplacian, LX, whose leading eigenvectors reveal the global structures in A.\n\n1. Set X to be all-zero matrix.\n2. Find set of eigenvectors U = {u1, u2, ..., uq} associated with the \ufb01rst q largest\n\neigenvalues (in algebra) of LX.\n\n3. Identify the eigenvector v that has the largest inverse participation ratio among the q\n4. if I(v) < \u2206, return LX = A + X; Otherwise, \u2200i, Xii \u2190 Xii \u2212 \u03b7v2\n\neigenvectors in U. That is, \ufb01nd v = argmaxu\u2208U I(u).\n\ni , then go to step 2.\n\nWe can see that the regularization matrix X is a diagonal matrix, its diagonal entries are learned\ngradually from the most localized vector among the \ufb01rst several eigenvectors. The effect of X is to\npenalize the localized eigenvectors, by suppressing down the eigenvalues associated with the local-\nized eigenvectors. The learning will continue until all q leading eigenvectors are delocalized, thus\nare supposed to correlate with the global structure rather than the local structures. As an example,\nwe show the effect of X to the spectrum in Fig. 1. In the left panel, we plot the spectrum of the\nadjacency matrix (i.e. before learning X) and the X-Laplacian (i.e. after learning X) of a sparse\nnetwork generated by the stochastic block model with q = 2 groups. For the adjacency matrix in\nthe left panel, localized eigenvectors have large eigenvalues and contribute a tail to the semicircle,\ncovering the informative eigenvalue, leaving only one eigenvalue, which corresponds to the eigen-\nvector that essentially sorts vertices according to their degree, out of the bulk. The spectral density\nof X-Laplacian is shown in the right panel of Fig. 1. We can see that the right corner of the continues\npart of the spectral density appearing in the spectrum of the adjacency matrix , is missing here. This\nis because due to the effect of X, the eigenvalues that are associated with localized eigenvectors in\nthe adjacency matrix are pushed into the bulk, maintaining a gap between the edge of bulk and the\ninformative eigenvalue (being pointed by the left red arrow in the \ufb01gure).\nThe key procedure of the algorithm is the learning part in step 4, which updates diagonal terms of\nmatrix X using the most localized eigenvector v. Throughout the paper, by default we use learning\nrate \u03b7 = 10 and threshold \u2206 = 5/n. As \u03b7 = O(1) and v2\ni = O(1/n), we can treat the learned entries\nin each step, \u02c6L, as a perturbation to matrix LX. After applying this perturbation, we anticipate that\nan eigenvalue of L changes from \u03bbi to \u03bbi + \u02c6\u03bbi, and an eigenvector changes from ui to ui + \u02c6ui. If\nwe assume that matrix LX is not ill-conditioned, and the \ufb01rst few eigenvectors that we care about\nare distinct, then we have \u02c6\u03bbi = uT\n\u02c6Lui. Derivation of the above expression is straightforward, but\ni\nfor the completeness we put the derivations in the SI text. In our algorithm, \u02c6L is a diagonal matrix\nwith entries \u02c6Lii = \u2212\u03b7v2\ni with v denoting the identi\ufb01ed eigenvector who has the largest inverse\nik. For the identi\ufb01ed vector\n\nparticipation ratio, so last equation can be written as \u02c6\u03bbi = \u2212\u03b7(cid:80)\n\nku2\n\nk v2\n\nv, we further have\n\n\u02c6\u03bbv = \u2212\u03b7\n\ni = \u2212\u03b7I(v).\nv4\n\n(2)\n\n(cid:88)\n\ni\n\ninverse participation ratio of the new vector ui + \u02c6ui can be written as\n\nIt means the eigenvalue of the identi\ufb01ed eigenvector with inverse participation ratio I(v) is decreased\nby amount \u03b7I(v). That is, the more localized the eigenvector is, the larger penalty on its eigenvalue.\nIn addition to the penalty to the localized eigenvalues, We see that the leading eigenvectors are delo-\ncalizing during learning. We have analyzed the change of eigenvectors after the perturbation given\nby the identi\ufb01ed vector v, and obtained (see SI for the derivations) the change of an eigenvector \u02c6ui\nuj. Then the\n\nas a function of all the other eigenvalues and eigenvectors, \u02c6ui = (cid:80)\n(cid:88)\n(cid:88)\nAs eigenvectors ui and uj are orthogonal to each other, the term 4\u03b7(cid:80)n\n\ncan be\nseen as a signal term and the last term can be seen as a cross-talk noise with zero mean. We see\nthat the cross-talk noise has a small variance, and empirically its effect can be neglected. For the\n\nI(ui + \u02c6ui) = I(ui) \u2212 4\u03b7\n\nkujkuikujl\n\u03bbi \u2212 \u03bbj\n\nu2\nl u4\njlv2\nil\n\u03bbi \u2212 \u03bbj\n\n.\n\n(3)\n\nn(cid:88)\n\n(cid:88)\n\nk ujkv2\n\u03bbi\u2212\u03bbj\n\nkuik\n\nl=1\n\nj(cid:54)=i\n\nk(cid:54)=l\n\nn(cid:88)\n\nj(cid:54)=i\n\njlv2\nl u4\nu2\n\u03bbi\u2212\u03bbj\n\nil\n\n(cid:80)\n\nl=1\n\nj(cid:54)=i\n\n(cid:80)\n\nj(cid:54)=i\n\nu3\nilv2\n\nl=1\n\n\u2212 4\u03b7\n\n4\n\n\fleading eigenvector corresponding to the largest eigenvalue \u03bbi = \u03bb1, it is straightforward to see that\nthe signal term is strictly positive. Thus if the learning is slow enough, the perturbation will always\ndecrease the inverse participation ratio of the leading eigenvector. This is essentially an argument\nfor convergence of the algorithm. For other top eigenvectors, i.e. the second and third eigenvectors\nand so on, though \u03bbi \u2212 \u03bbj is not strictly positive, there are much more positive terms than negative\nterms in the sum, thus the signal should be positive with a high probability. Thus one can conclude\nthat the process of learning X makes \ufb01rst few eigenvectors de-localizing.\nAn example illustrating the process of the learning is shown in Fig. 2 where we plot the second\neigenvector vs. the third eigenvector, at several times steps during the learning, for a network gen-\nerated by the stochastic block model with q = 3 groups. We see that at t = 0, i.e. without learning,\nboth eigenvectors are localized, with a large range of distribution in entries. The color of eigen-\nvectors encodes the group membership in the planted partition. We see that at t = 0 three colors\nare mixed together indicating that two eigenvectors are not correlated with the planted partition. At\nt = 4 three colors begin to separate, and range of entry distribution become smaller, indicating that\nthe localization is lighter. At t = 25, three colors are more separated, the partition obtained by ap-\nplying k-means algorithm using these vectors successfully recovers 70% of the group memberships.\nMoreover we can see that the range of entries of eigenvectors shrink to [\u22120.06, 0.06], giving a small\ninverse participation ratio.\n\nFigure 2: The second eigenvector V2 compared with the third eigenvector V3 of LX for a network at\nthree steps with t = 0, 4 and 25 during learning. The network has n = 42000 nodes, q = 3 groups,\naverage degree c = 3, \u0001 = 0.08, three colors represent group labels in the planted partition.\n\n4 Numerical evaluations\n\nIn this section we validate our approach with experiments on several inference problems, i.e. com-\nmunity detection problems, clustering from sparse pairwise entries, rank estimation and matrix com-\npletion from a few entries. We will compare performance of the X-Laplacian (using mean-removed\ndata matrix) with recently proposed state-of-the-art spectral methods in the sparse regime.\n\n4.1 Community Detection\n\nFirst we use synthetic networks generated by the stochastic block model [9], and its variant with\nnoise [10]. The standard Stochastic Block Model (SBM), also called the planted partition model, is\na popular model to generate ensemble of networks with community structure. There are q groups\nof nodes and a planted partition {t\u2217\ni } \u2208 {1, ..., q}. Edges are generated independently according\nto a q \u00d7 q matrix {pab}. Without loss of generality here we discuss the commonly studied case\nwhere the q groups have equal size and where {pab} has only two distinct entries, pab = cin/n if\na = b and cout/n if a (cid:54)= b. Given the average degree of the graph, there is a so-called detectability\nc \u2212 1 + q) [7] , beyond which point it is not possible to\ntransition \u0001\u2217 = cout/cin = (\nobtain any information about the planted partition. It is also known spectral algorithms based on\nthe non-backtracking matrix succeed all the way down to the transition [15]. This transition was\nrecently established rigorously in the case of q = 2 [20, 21]. Comparisons of spectral methods using\ndifferent matrices are shown in Fig. 3 left. From the \ufb01gure we see that the X-Laplacian works as\nwell as the non-backtracking matrix, down to the detectability transition. While the direct use of the\nadjacency matrix, i.e. LX before learning, does not work well when \u0001 exceeds about 0.1.\nIn the right panel of Fig. 3, each network is generated by the stochastic block model with the same\nparameter as in the left panel, but with 10 extra cliques, each of which contains 10 randomly selected\n\n\u221a\nc \u2212 1)/(\n\n\u221a\n\n5\n\n\fnodes. Theses cliques do not carry information about the planted partition, hence act as noise to the\nsystem. In addition to the non-backtracking matrix, X-Laplacian, and the adjacency matrix, we put\ninto comparison the results obtained using other classic and newly proposed matrices, including\nBethe Hessian [24], Normalized Laplacian (N. Laplacian) Lsym = I \u2212 \u02dcA, and regularized and\nnormalized Laplacian (R.N. Laplacian) LA = \u02dcA \u2212 \u03b611T, with a optimized regularization \u03b6 (we\nhave scanned the whole range of \u03b6, and chosen an optimal one that gives the largest overlap, i.e.\nfraction of correctly reconstructed labels, in most of cases). From the \ufb01gure we see that with the\nnoise added, only X-Laplacian works down to the original transition (of SBM without cliques). All\nother matrices fail in detecting the community structure with \u0001 > 0.15.\nWe have tested other kinds of noisy models, including the noisy stochastic block model, as proposed\nin [10]. Our results show that the X-Laplacian works well (see SI text) while all other spectral\nmethods do not work at all on this dataset [10]. Moreover, in addition to the classic stochastic block\nmodel, we have extensively evaluated our method on networks generated by the degree-corrected\nstochastic block model [12], and the stochastic block model with extensive triangles. We basically\nobtained qualitatively results as in Fig. 3 that the X-Laplacian works as well as the state-of-the-art\nspectral methods for the dataset. The \ufb01gures and detailed results can be found at the SI text.\nWe have also tested real-world networks with an expert division, and found that although the expert\ndivision is usually easy to detect by directly using the adjacency matrix, the X-Laplacian signi\ufb01-\ncantly improves the accuracy of detection. For example on the political blogs network [1], spectral\nclustering using the adjacency matrix gives 83 mis-classi\ufb01ed labels among totally 1222 labels, while\nthe X-Laplacian gives only 50 mis-classi\ufb01ed labels.\n\nFigure 3: Accuracy of community detection, represented by overlap (fraction of correctly recon-\nstructed labels) between inferred partition and the planted partition, for several methods on networks\ngenerated by the stochastic block model with average degree c = 3 (left) and with extra 10 size-10\ncliques (right). All networks has n = 10000 nodes and q = 2 groups, \u0001 = cout/cin. The black dashed\nlines denote the theoretical detectability transition. Each data point is averaged over 20 realizations.\n\n4.2 Clustering from sparse pairwise measurements\nConsider the problem of grouping n items into clusters based on the similarity matrix S \u2208 Rn\u00d7n,\nwhere Sij is the pairwise similarity between items i and j. Here we consider not using all pairwise\nsimilarities, but only O(n) random samples of them. In other words, the similarity graph which\nencodes the information of the global clustering structure is sparse, rather than the complete graph.\nThere are many motivations for choosing such sparse observations, for example in some cases all\nmeasurements are simply not available or even can not be stored.\nIn this section we use the generative model recently proposed in [26], since there is a theoretical\nlimit that can be used to evaluate algorithms. Without loss of generality, we consider the problem\nwith only q = 2 clusters. The model in [26] \ufb01rst assigns items hidden clusters {ti} \u2208 {1, 2}n, then\ngenerates similarity between a randomly sampled pairs of items according to probability distribution,\npin and pout, associated with membership of two items. There is a theoretical limit \u02c6c satisfying\npin(s)+(q\u22121)pout(s) , that with c < \u02c6c no algorithm could obtain any partial information of\n1\n\u02c6c = 1\nthe planted clusters; while with c > \u02c6c some algorithms, e.g. spectral clustering using the Bethe\nHessian [26], achieve partial recovery of the planted clusters.\n\n(cid:82) ds (pin(s)\u2212pout(s))2\n\nq\n\n6\n\n00.10.20.30.50.60.70.80.91\u03b5Overlap Detectability transitionAdjacencyNon\u2212backtrackingX\u2212Laplacian00.10.20.30.50.60.70.80.91\u03b5Overlap Detectability transitionAdjacencyR. N. AdjacencyN. LaplacianNonbacktrackingBethe HessianX\u2212Laplacian\fSimilar to the community detection in sparse graphs, spectral algorithms directly using the eigen-\nvectors of a similarity matrix S does not work well, due to the localization of eigenvectors induced\nby the sparsity. To evaluate whether our method, the X-Laplacian, solves the localization problem,\nand how it works compared with the Bethe Hessian, in Fig. 4 we plot the performance (in overlap,\nthe fraction of correctly reconstructed group labels) of three algorithms on the same set of similarity\nmatrices. For all the datasets there are two groups with distributions pin and pout being Gaussian\nwith unit variance and mean 0.75 and \u22120.75 respectively. In the left panel of Fig. 4 the topology\nof pairwise entries is random graph, Bethe Hessian works down to the theoretical limit, while di-\nrectly using of the measurement matrix gives a poor performance. We can also see that X-Laplacian\nhas \ufb01xed the localization problem of directly using of the measurement matrix, and works almost\nas good as the Bethe-Hessian. We note that the Bethe Hessian needs to know the parameters (i.e.\nparameters of distributions pin and pout), while the X-Laplacian does not use them at all.\nIn the right panel of Fig. 4, on top of the ER random graph topology, we add some noisy local\nstructures by randomly selecting 20 nodes and connecting neighbors of each selected node to each\nother. The weights for the local pairwise were set to 1, so that the noisy structures do not contain\ninformation about the underlying clustering. We can see that Bethe Hessian is in\ufb02uenced by noisy\nlocal structures and fails to work, while X-Laplacian solves the localization problems induced by\nsparsity, and is robust to the noise. We have also tested other kinds of noise by adding cliques, or\nhubs, and obtained similar results (see SI text).\n\nFigure 4: Spectral clustering using sparse pairwise measurements. The X-axis denotes the average\nnumber of pairwise measurements per data point, and the Y-axis is the fraction of correctly recon-\nstructed labels, maximized over permutations. The model used to generate pairwise measurements\nis proposed in [26], see text for detailed descriptions. In the left panel, the topologies of the pair-\nwise measurements are random graphs. In the right panel in addition to the random graph topology\nthere are 20 randomly selected nodes with all their neighbors connected. Each point in the \ufb01gure is\naveraged over 20 realizations of size 104.\n\n4.3 Rank estimation and Matrix Completion\n\nnm is the ground-true rank. Only few, say c\n\nThe last problem we consider in this paper for evaluating the X-Laplacian is completion of a low rank\nmatrix from few entries. This problem has many applications including the famous collaborative\n\ufb01ltering. A problem that is closely related to it is the rank estimation from revealed entries. Indeed\nestimating rank of the matrix is usually the \ufb01rst step before actually doing the matrix completion.\nThe problem is de\ufb01ned as follows: let Atrue = U V T , where U \u2208 Rn\u00d7r and V \u2208 Rm\u00d7r are chosen\nuniformly at random and r (cid:28) \u221a\nmn, entries of\nmatrix Atrue are revealed. That is we are given a matrix A \u2208 Rn\u00d7m who contains only subset of\nAtrue, with other elements being zero. Many algorithms have been proposed for matrix completion,\nincluding nuclear norm minimization [5] and methods based on the singular value decomposition [4]\netc. Trimming which sets to zero all rows and columns with a large revealed entries, is usually\nintroduced to control the localizations of singular vectors and to estimate the rank using the gap of\nsingular values [14]. Analogous to the community detection problem, trimming is not supposed to\nwork optimally when matrix A is sparse. Indeed in [25] authors reported that their approach based\non the Bethe Hessian outperforms trimming+SVD when the topology of revealed entries is a sparse\nrandom graph. Moreover, authors in [25] show that the number of negative eigenvalues of the Bethe\nHessian gives a more accurate estimate of the rank of A than that based on trimming+SVD.\n\n\u221a\n\n7\n\n1234560.50.60.70.80.9cOverlap Detectability transitionPairwise measurement matrixBethe HessianX\u2212Laplacian1234560.50.60.70.80.9cOverlap Detectability transitionPairwise measurement matrixBethe HessianX\u2212Laplacian\f(cid:19)\n\nA 0\n\n(cid:18) 0 A\n\nHowever, we see that if the topology is not locally-tree-like but with some noise, for example with\nsome additional cliques, both trimming of the data matrix and Bethe Hessian perform much worse,\nreporting a wrong rank, and giving a large reconstruction error, as illustrated in Fig. 5. In the left\npanel of the \ufb01gure we plot the eigenvalues of the Bethe Hessian, and singular values of trimmed\nmatrix A with true rank rtrue = 2. We can see that both of them are continuously distributed: there\nis no clear gap in singular values of trimmed A, and Bethe Hessian has lots of negative eigenvalues.\nIn this case since matrix A could be a non-squared matrix, we need to de\ufb01ne the X-Laplacian as\n\u2212 X. The eigenvalues of LX are also plotted in Fig. 5 where one can see clearly\nLX =\nthat there is a gap between the second largest eigenvalue and the third one. Thus the correct rank\ncan be estimated using the value minimizing consecutive eigenvalues, as suggested in [14].\nAfter estimating the rank of the matrix, matrix completion is done by using a local optimization\nalgorithm [27] starting from initial matrices, that obtained using \ufb01rst r singular vectors of trim-\nming+SVD, \ufb01rst r eigenvectors of Bethe Hessian and X-Laplacian with estimated rank r respec-\ntively. The results are shown in Fig. 5 right where we plot the probability that obtained root mean\nsquare error (RMSE) is smaller than 10\u22127 as a function of average number of revealed entries per\nrow c, for the ER random-graph topology plus noise represented by several cliques. We can see that\nX-Laplacian outperforms Bethe Hessian and Trimming+SVD with c \u2265 13. Moreover, when c \u2265 18,\nfor all instances, only X-Laplacian gives an accurate completion for all instances.\n\nFigure 5: (Left:) Singular values of sparse data matrix with trimming, eigenvalues of the Bethe\nHessian and X-Laplacian. The data matrix is the outer product of two vectors of size 1000. Their\nentries are Gaussian random variables with mean zero and unit variance, so the rank of the original\nmatrix is 2. The topology of revealed observations are random graphs with average degree c = 8\nplus 10 random cliques of size 20. (Right:) Fraction of samples that RMSE is smaller than 10\u22127,\namong 100 samples of rank-3 data matrix U V T of size 1000 \u00d7 1000, with the entries of U and V\ndrawn from a Gaussian distribution of mean 0 and unit variance. The topology of revealed entries is\nthe random graph with varying average degree c plus 10 size-20 cliques.\n\n5 Conclusion and discussion\n\nWe have presented the X-Laplacian, a general approach for detecting latent global structure in a\ngiven data matrix. It is completely a data-driven approach that learns different forms of regulariza-\ntion for different data, to solve the problem of localization of eigenvectors or singular vectors. The\nmechanics for de-localizing of eigenvectors during learning of regularizations has been illustrated\nusing the matrix perturbation analysis. We have validated our method using extensive numerical ex-\nperiments, and shown that it outperforms state-of-the-art algorithms on various inference problems\nin the sparse regime and with noise.\nIn this paper we discuss the X-Laplacian using directly the (mean-removed) data matrix A, but\nwe note that the data matrix is not the only choice for the X-Laplacian. Actually we have tested\napproaches using various variants of A, such as normalized data matrix \u02dcA, and found they work as\nwell. We also tried learning regularizations for the Bethe Hessian, and found it succeeds in repairing\nBethe Hessian when Bethe Hessian has localization problem. These indicate that our scheme of\nregularization-learning is a general spectral approach for hard inference problems.\nA (Matlab) demo of our method can be found at http://panzhang.net.\n\n8\n\n0102030405005101520Eigenvalues TrimmingBethe HessianX\u2212Laplacian5101520253000.20.40.60.81cP(RMSE<10\u22127) Trimming SVDBethe HessianX\u2212Laplacian\fReferences\n[1] L. A. Adamic and N. Glance. The political blogosphere and the 2004 us election: divided they blog. In\n\nProceedings of the 3rd international workshop on Link discovery, pages 36\u201343. ACM, 2005.\n\n[2] A. A. Amini, A. Chen, P. J. Bickel, E. Levina, et al. Pseudo-likelihood methods for community detection\n\nin large sparse networks. The Annals of Statistics, 41(4):2097\u20132122, 2013.\n\n[3] R. Bell and P. Dean. Atomic vibrations in vitreous silica. Discussions of the Faraday society, 50:55\u201361,\n\n1970.\n\n[4] J.-F. Cai, E. J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20(4):1956\u20131982, 2010.\n\n[5] E. J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computa-\n\ntional mathematics, 9(6):717\u2013772, 2009.\n\n[6] A. COJA-OGHLAN. Graph partitioning via adaptive spectral techniques. Combinatorics, Probability\n\nand Computing, 19:227\u2013284, 3 2010.\n\n[7] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00b4a. Asymptotic analysis of the stochastic block model\n\nfor modular networks and its algorithmic applications. Phys. Rev. E, 84:066106, Dec 2011.\n\n[8] K.-i. Hashimoto. Zeta functions of \ufb01nite graphs and representations of p-adic groups. Advanced Studies\n\nin Pure Mathematics, 15:211\u2013280, 1989.\n\n[9] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social networks,\n\n5(2):109\u2013137, 1983.\n\n[10] A. Javanmard, A. Montanari, and F. Ricci-Tersenghi. Phase transitions in semide\ufb01nite relaxations. Pro-\n\nceedings of the National Academy of Sciences, 113(16):E2218, 2016.\n\n[11] A. Joseph and B. Yu. Impact of regularization on spectral clustering. arXiv preprint arXiv:1312.1733,\n\n2013.\n\n[12] B. Karrer and M. E. J. Newman. Stochastic blockmodels and community structure in networks. Phys.\n\nRev. E, 83:016107, Jan 2011.\n\n[13] R. H. Keshavan, A. Montanari, and S. Oh. Low-rank matrix completion with noisy observations: a\nquantitative comparison. In Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual\nAllerton Conference on, pages 1216\u20131222. IEEE, 2009.\n\n[14] R. H. Keshavan, S. Oh, and A. Montanari. Matrix completion from a few entries. In Information Theory,\n\n2009. ISIT 2009. IEEE International Symposium on, pages 324\u2013328. IEEE, 2009.\n\n[15] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborov\u00b4a, and P. Zhang. Spectral redemption\n\nin clustering sparse networks. Proc. Natl. Acad. Sci. USA, 110(52):20935\u201320940, 2013.\n\n[16] C. M. Le, E. Levina, and R. Vershynin. Sparse random graphs: regularization and concentration of the\n\nlaplacian. arXiv preprint arXiv:1502.03049, 2015.\n\n[17] C. M. Le and R. Vershynin. Concentration and regularization of random graphs.\n\narXiv:1506.00669, 2015.\n\narXiv preprint\n\n[18] J. Lei, A. Rinaldo, et al. Consistency of spectral clustering in stochastic block models. The Annals of\n\nStatistics, 43(1):215\u2013237, 2014.\n\n[19] U. V. Luxburg, M. Belkin, O. Bousquet, and Pertinence. A tutorial on spectral clustering. Stat. Comput,\n\n2007.\n\n[20] L. Massouli\u00b4e. Community detection thresholds and the weak ramanujan property. In Proceedings of the\n\n46th Annual ACM Symposium on Theory of Computing, pages 694\u2013703. ACM, 2014.\n\n[21] E. Mossel, J. Neeman, and A. Sly.\n\narXiv:1202.1499, 2012.\n\nStochastic block models and reconstruction.\n\narXiv preprint\n\n[22] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in\n\nneural information processing systems, 2:849\u2013856, 2002.\n\n[23] T. Qin and K. Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel.\n\nIn Advances in Neural Information Processing Systems, pages 3120\u20133128, 2013.\n\n[24] A. Saade, F. Krzakala, and L. Zdeborov\u00b4a. Spectral clustering of graphs with the bethe hessian.\n\nAdvances in Neural Information Processing Systems, pages 406\u2013414, 2014.\n\nIn\n\n[25] A. Saade, F. Krzakala, and L. Zdeborov\u00b4a. Matrix completion from fewer entries: Spectral detectability\nand rank estimation. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 28, pages 1261\u20131269. Curran Associates, Inc., 2015.\n\n[26] A. Saade, M. Lelarge, F. Krzakala, and L. Zdeborov\u00b4a. Clustering from sparse pairwise measurements. To\nappear in IEEE International Symposium on Information Theory (ISIT). IEEE, arXiv:1601.06683, 2016.\n\n[27] S.G.Johnson. The nlopt nonlinear-optimization package, 2014.\n\n9\n\n\f", "award": [], "sourceid": 301, "authors": [{"given_name": "Pan", "family_name": "Zhang", "institution": "Institute of Theoretical Physics"}]}