{"title": "Revisiting the Bethe-Hessian: Improved Community Detection in Sparse Heterogeneous Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 4037, "page_last": 4047, "abstract": "Spectral clustering is one of the most popular, yet still incompletely understood, methods for community detection on graphs. This article studies spectral clustering based on the Bethe-Hessian matrix H_r= (r^2\u22121)I_n+D\u2212rA for sparse heterogeneous graphs (following the degree-corrected stochastic block model) in a two-class setting. For a specific value r=\u03b6, clustering is shown to be insensitive to the degree heterogeneity. We then study the behavior of the informative eigenvector of H_\u03b6 and, as a result, predict the clustering accuracy. The article concludes with an overview of the generalization to more than two classes along with extensive simulations on synthetic and real networks corroborating our findings.", "full_text": "Revisiting the Bethe-Hessian: Improved Community\n\nDetection in Sparse Heterogeneous Graphs\n\nLorenzo Dall\u2019Amico\n\nGIPSA-lab, UGA, CNRS, Grenoble INP\n\nlorenzo.dall-amico@gipsa-lab.fr\n\nRomain Couillet\n\nGIPSA-lab, UGA, CNRS, Grenoble INP\n\nL2S, CentraleSup\u00e9lec, University of Paris Saclay\n\nNicolas Tremblay\n\nGIPSA-lab, UGA, CNRS, Grenoble INP\n\nAbstract\n\nSpectral clustering is one of the most popular, yet still incompletely understood,\nmethods for community detection on graphs. This article studies spectral cluster-\ning based on the Bethe-Hessian matrix Hr = (r2 \u2212 1)In + D \u2212 rA for sparse\nheterogeneous graphs (following the degree-corrected stochastic block model) in a\ntwo-class setting. For a speci\ufb01c value r = \u03b6, clustering is shown to be insensitive to\nthe degree heterogeneity. We then study the behavior of the informative eigenvector\nof H\u03b6 and, as a result, predict the clustering accuracy. The article concludes with\nan overview of the generalization to more than two classes along with extensive\nsimulations on synthetic and real networks corroborating our \ufb01ndings.\n\n1\n\nIntroduction\n\nNetwork theory studies the interaction of connected systems of agents. Real networks tend to be\nstructured in af\ufb01nity classes and the problem of clustering consists in retrieving these unknown\nclasses from the observed network pairwise interactions [1]. Belief propagation (BP) is an ef\ufb01cient\nway to reconstruct communities and \u2013 under certain conditions (see [2]) \u2013 was proved to give optimal\nreconstruction. On the negative side, BP suffers from a possibly long convergence time and a\nnon-trivial implementation. Among the alternative clustering algorithms, spectral techniques proved\nparticularly ef\ufb01cient in terms of speed and analytical tractability [3, 4, 5, 6]. In the dense regime, in\nparticular, where the average node degree scales like the size of the network, random matrix theory\n[4, 7, 8] manages to predict the asymptotic spectral clustering performances and to identify transition\npoints beyond which asymptotic non trivial classi\ufb01cation is achievable. This is however not the\ntypical condition for real networks that tend instead to be sparse. For a graph G(V,E) with |V| = n\nnodes, the condition of sparsity means that the average degree d does not depend on the size of the\nnetwork and in particular d (cid:28) n.\nBoth standard spectral clustering methods and their associated random matrix asymptotics collapse\nin this regime. As an answer, many intuitions emerged from statistical physics and led to important\nseminal steps. Notably, two deeply connected matrices recently proved to overcome the problem of\nsparsity: the n\u00d7n Bethe-Hessian [9] Hr with r \u2208 R a parameter to be \ufb01xed \u2013 the study of which is the\nobject of the present article\u2013, and the non symmetric non backtracking operator B \u2208 {0, 1}2|E|\u00d72|E|\n[10]. Both matrices were introduced and studied under the homogeneous degree stochastic block\nmodel (SBM). Narrowing to the case of two communities it was proved both experimentally and\ntheoretically [11, 2, 12, 13] that, if there exists an algorithm able to detect communities better then\nrandom guess, then these two matrices can be used to give non-trivial node partition. It is said that\nboth algorithms work down to the detectability threshold.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fHowever, real networks are rarely homogeneous and typically follow a power law degree distribution\n[14]. The results of [15, 16] generalize the above studies to heterogeneous networks, generated\nby degree-corrected stochastic block models (DC-SBM) [17] and suggest that both B and Hr\nprovide also in this case non trivial clustering down to the detectability threshold. Yet, a precise\ncharacterization of their behavior and performances is still lacking; the present article shows that\nsome aspects of the behavior of B and Hr have indeed been overlooked.\nSpectral clustering in sparse heterogeneous networks has also been tackled using various regularized\nLaplacian matrices [18, 19, 20] but, to our knowledge, these are not proved to operate down to the\ndetectability threshold. These structurally different methods are discussed in concluding remarks.\nThe main message of the present communication is that, under a DC-SBM setting, the choice of r\nin Hr proposed in [9] for the SBM setting is suboptimal. We propose and theoretically support an\nimproved parametrization r = \u03b6 that allows the Bethe-Hessian H\u03b6 to ef\ufb01ciently detect communities\nin sparse and heterogeneous graphs. In detail, under the DC-SBM setting, a) we propose a spectral\nalgorithm on H\u03b6 which performs ef\ufb01ciently down to the detectability threshold, with an informative\neigenvector not tainted by the degree distribution (unlike in [9]); b) the algorithm is generalized to k-\nclass clustering with a consistent estimation procedure for k; c) substantial performance improvements\non the originally proposed Bethe-Hessian are testi\ufb01ed by simulations on synthetic and real networks.\nThe remainder of the article is organized as follows: Section 2 argues on the optimal value r = \u03b6\nfor Hr and, based on heuristic arguments, studies the behavior of the informative eigenvector of\nH\u03b6, concluding with an explicit expression of the clustering performance; Section 3 provides an\nunsupervised method to estimate \u03b6, drawing on connections with the non-backtracking matrix B;\nSection 4 extends the algorithm to a k-class scenario; numerical supports are then provided in\nSection 5 on both synthetic and real networks; concluding remarks close the article.\nReproducibility. A Python implementation of the proposed algorithm along with codes to reproduce\nthe results of the article are available at lorenzodallamico.github.io/codes.\n\n2 Model and Main Results\n\nsize (i.e.,(cid:80)\n\n2.1 Model setting\nConsider an undirected binary graph G(E,V), with nodes V = {1, . . . , n} (|V| = n) and edges\nE \u2282 V \u00d7 V (|E| = m). Let \u03c3 \u2208 {\u22121, 1}n be the vector of class labels, both classes being of equal\ncout cin ). These assumptions are meant to set the problem in\na more readable symmetric scenario. Section 4 extends the results to multiple classes of possibly\ndifferent sizes. In order to account both for sparsity and heterogeneity, we consider the DC-SBM as a\ngenerative model for G. Denoting A \u2208 {0, 1}n\u00d7n the adjacency matrix de\ufb01ned by Aij = 1(i,j)\u2208E,\nthe DC-SBM generates edges independently according to:\n\ni \u03c3i = 0), and C = ( cin cout\n\nP(Aij = 1|\u03c3i, \u03c3j, \u03b8i, \u03b8j) = \u03b8i\u03b8j\n\nC\u03c3i,\u03c3j\n\nn\n\n,\n\n(1)\n\nwhere \u03b8 = (\u03b81, . . . , \u03b8n) is the vector of random intrinsic connection \u201cprobabilities\u201d of each node.\nThe \u03b8i\u2019s are assumed i.i.d. and independent of the class labels, and we impose E[\u03b8i] = 1, E[\u03b82\ni ] = \u03a6.\nThe 1/n term bounds the degree of each node to an n-independent value, making the network sparse.\nDenoting c = (cin + cout)/2, the detectability condition [16] reads:\n\n\u03b1 \u2261 cin \u2212 cout\u221a\n\nc\n\n\u2265 2\u221a\n\u03a6\n\n\u2261 \u03b1c.\n\n(2)\n\nFor \u03b1 < \u03b1c, no algorithm can partition the nodes better than by random guess. Letting D = diag(A1)\nbe the degree matrix, the Bethe-Hessian is de\ufb01ned as\n\nHr = (r2 \u2212 1)In + D \u2212 rA,\n\nr \u2208 R.\n\n(3)\n\nc\u03a6, which asymptotically provides non trivial\nThis matrix was originally proposed in [9] for r =\nclustering down to the detectability threshold (for \u03b1 > \u03b1c). The informative eigenvector of Hr is\nassociated with the second smallest eigenvalue and we denote it x(2)\nare\nhowever strongly tainted by the \u03b8i\u2019s, sensibly altering the algorithm performance.\n\nr . The components of x(2)\u221a\n\nc\u03a6\n\n\u221a\n\n2\n\n\fWe show here that for \u03b1 \u2265 \u03b1c there exists a value \u03b6 \u2264 \u221a\n\nc\u03a6 for which the components of the second\n\u03b6 of H\u03b6 align to the labels irrespective of the \u03b8i\u2019s, thus largely improving the algorithm\n\neigenvector x(2)\nperformance while maintaining detectability down to the threshold.\n\n2.2\n\nInformative eigenvector of Hr\n\nand therefore P(\u03c3\u2202i|\u03c3i) (cid:39)(cid:81)\n\nIn the sequel we assume that: (i) being sparse, we can locally approximate the graph by a tree [21]\nP(\u03c3j|\u03c3i), with \u2202i the neighbourhood of i; (ii) n \u2192 \u221e and c is\nbounded by an n-independent value while being arbitrarily larger than one, i.e., n (cid:29) c (cid:29) 1.\nFor ease of notation we work here with D \u2212 rA rather than Hr, both having the same eigenvectors.\nThe core of our proposed method lies in the following observation, related to the action of Hr on \u03c3:\n\nj\u2208\u2202i\n\n[(D \u2212 rA)\u03c3]i = di\u03c3i\n\n1 \u2212 r\n\n(cid:34)\n\n(cid:32)|\u2202(s)\n\n|\n\n|\n\n\u2212 |\u2202(o)\n\ni\ndi\n\ni\ndi\n\n(cid:33)(cid:35)\n\n(4)\n\ni\n\ni\n\n| (resp., |\u2202(o)\n\nwhere |\u2202(s)\n|) stands for the number of neighbors of i belonging to the same (resp.,\nopposite) class as i. We show next that a proper choice of r can annihilate the right-hand side of (4)\n\u201con average\u201d or whenever the typical degrees di are not too small, turning (4) into an eigenvector\nequation. To this end, we need to quantify the random variables |\u2202(s)\nFrom a Bayesian perspective, \u03c3 and \u03b8 are unknown parameters and A (and thus di) known realizations.\nWe may thus write\n\n| and |\u2202(o)\n\n|.\n\ni\n\ni\n\nP(\u03c3i|\u03c3j, Aij = 1) =\n\u221d\n\nP(\u03c3i, \u03c3j|Aij = 1)\nP(\u03c3j|Aij = 1)\n\n(cid:90)(cid:90)\n\n= 2\n\nP(\u03c3i, \u03c3j, \u03b8i, \u03b8j|Aij = 1)d\u03b8id\u03b8j\n\nP(Aij = 1|\u03c3i, \u03c3j, \u03b8i, \u03b8j)P(\u03c3i, \u03c3j, \u03b8i, \u03b8j)d\u03b8id\u03b8j \u221d C(\u03c3i, \u03c3j),\n\nr =\n\ncin + cout\ncin \u2212 cout\n\n=\n\nc\n\n\u2261 \u03b6\u03b1.\n\n(6)\n\nwith \u03b1 as in (2) the proper control parameter for the clustering problem (as shown e.g., in [7, 15, 16,\n22]). For simplicity of notation the dependence on \u03b1 of \u03b6 = \u03b6\u03b1 will be made explicit only when\nrelevant. Intuitively, this calculus suggests that \u03b6 is the only value of r that ensures that Hr has an\ninformative eigenvector not signi\ufb01cantly tainted by the degree distribution. This claim is supported\nby the following two remarks.\nRemark 1 (Consistency of \u03b6 for trivial classi\ufb01cation). In the limit of trivial clustering where cout \u2192 0,\n|\u2202(s)\n| tend to their mean. In particular, for cout = 0, \u03b6 = 1 and (D\u2212\u03b6A)\u03c3 = (D\u2212A)\u03c3 =\n0, so that \u03c3 is an exact eigenvector of H\u03b6=1 associated with its zero eigenvalue.\nRemark 2 (Mapping to Ising). The original intuition behind the Bethe-Hessian matrix arises from a\nmapping of the community labels into the spins of a Ising Hamiltonian. The \u201ctemperature-related\u201d\nparameter r guarantees a correct mapping only for r = \u03b6. This is elaborated in details in Section A\nof the supplementary material.\n\n| and |\u2202(o)\n\ni\n\ni\n\nAlthough one commonly assumes an assortative model for the communities, by which cin > cout,\nthe Bethe-Hessian matrix is oblivious of the sign of cin \u2212 cout.\n\n3\n\nwhere we used the facts that the classes are of equal size (P(\u03c3i) is constant), and the \u03b8i are i.i.d.,\nindependent of the classes with E[\u03b8i] = 1. Normalizing, one \ufb01nally obtains P(\u03c3i|\u03c3j, Aij = 1) =\nC(\u03c3i, \u03c3j)/(cin + cout), which is independent of the degree distribution. We further know that\n|\u2202(s)\n| = di, which is a deterministic observation. Given the locally tree-like structure of the\ngraph, neighbors of the same node are conditionally independent \u2013 see (i) \u2013 so that |\u2202(s)\n| is the sum\nof di i.i.d. Bernoulli random variables with parameter p = cin/(cin + cout). We thus obtain\n\n| + |\u2202(o)\n\ni\n\ni\n\ni\n\nE[[(D \u2212 rA)\u03c3]i | A] = di\u03c3i\n\n1 \u2212 r\n\n.\n\n(5)\n\n(cid:19)\n\ncin \u2212 cout\ncin + cout\n\nThis equation suggests that, for the expectation of (4) to be an eigenvector equation in the large (but\n\ufb01nite) di regime, r should be taken equal to\n\n(cid:90)(cid:90)\n\n(cid:18)\n\n\u221a\n2\n\n\u03b1\n\n\fRemark 3 (Disassortative networks). The case where cout > cin does not invalidate the above\nanalysis which results in \u03b6 < 0. Clustering with H\u03b6 is thus also valid in disassortative networks.\n\nIn practice, for a given (non averaged) realization of the \u03c3i\u2019s, \u03c3 is not an exact eigenvector of H\u03b6.\nBy a perturbation analysis around \u03c3, we next analyze the behavior of the corresponding informative\neigenvector of H\u03b6 and theoretically predict the overlap performance.\n\n2.3 Performance Analysis\n\n\u03b6 \u2261 \u03c3 + \u03b4.\nTo generalize the averaged analysis of (5), we perturb \u03c3 by a \u201cnoise\u201d term \u03b4 and write x(2)\nSince \u03b6 is however maintained, the associated eigenvalue of D \u2212 \u03b6A, which is zero in (5), now\npossibly deviates from zero; this eigenvalue is denoted \u03bb\u03b1, i.e.,\n\n(D \u2212 \u03b6\u03b1A)(\u03c3 + \u03b4) = \u03bb\u03b1(\u03c3 + \u03b4).\n\n(7)\n\nFrom Remark 1, we already know that lim\u03b1\u2192\u221a\nIn the following, expectations are taken for a \ufb01xed realization of the network, i.e. E[\u00b7] \u2261 E[\u00b7|A].\nWriting |\u2202s\ni | = di,\nwe obtain:\n(8)\n\ni |]+\u2206i and |\u2202o\n[(D \u2212 \u03b6\u03b1A)(\u03c3 + \u03b4)]i = \u22122\u03b6\u03b1\u03c3i\u2206i + di\u03b4i \u2212 \u03b6\u03b1\n\ni |]\u2212\u2206i, where we exploited the relation |\u2202s\n\ni | = E[|\u2202o\n\ni | = E[|\u2202s\n\ni |+|\u2202o\n\n(cid:88)\n\n\u03bb\u03b1 = 0.\n\n\u03b4j.\n\n2cin\n\nThe random variable \u2206i is a sum of di independent (centered) Bernoulli random variables, tending in\nthe large c limit to a zero mean Gaussian, i.e.,\n\nj\u2208\u2202i\n\n\u221a\ncincout\ncin \u2212 cout\n\n(cid:114)\n\n=\n\n1\n\u03b1\n\nc \u2212 \u03b12\n4\n\n.\n\n(9)\n\n\u2206i \u223c N (0, dicincout/(cin + cout)2) \u2261 N (0, dif 2\n\n\u03b1/\u03b6 2\n\n\u03b1),\n\nf\u03b1 \u2261\n\n\u03b1\u03b22\n\nthe sparse (and thus locally tree-like) nature of the graph.\n\nOur analysis of (8) relies on the following claim that we shall justify next.\nAssumption 1. The random variables \u03b4i, 1 \u2264 i \u2264 n, are distributed as \u03b4i \u223c N (\u2212\u00b5\u03b1\u03c3i, f 2\ni ) for\nsome \u00b5\u03b1 \u2208 R depending on \u03b1 only, and \u03b2i \u2208 R depending on i only. Besides, the \u03b4i\u2019s are \u201cweakly\ndependent\u201d in the sense that E[\u03b4i\u03b4j] = E[\u03b4i]E[\u03b4j] + O(1/c).\nThe elements of Assumption 1 rely on the following observations:\n\u2022 Weak dependence: This claim follows from the weak dependence of the \u2206i\u2019s, which results from\n\u2022 Gaussianity: The right-hand side of (8) features 3 random variables, the leftmost being Gaussian\nand rightmost the sum of di variables tending to an (asymptotically independent) Gaussian. It is\n\u2022 Mean of \u03b4i: The symmetry of the problem at hand (equal class sizes, same af\ufb01nity cin for each\nclass), along with the fact that the right-hand side of (4) vanishes in its \ufb01rst order approximation in\ndi, suggest that the mean of \u03b4i does not depend in the \ufb01rst order on di but only on \u03c3i. The amplitude\nof the mean then depends on \u03b1 characterized here through \u00b5\u03b1.\n\u2022 Variance of \u03b4i: The variance appears as the product of two terms: one that depends on i (\u03b2i) and\none that depends on \u03b1. This follows from assuming that the \ufb02uctuations of \u03b4i follow the \ufb02uctuations\nof \u2206i for which the variance is similarly factorized.\n\nthus reasonable that \u03b4i be Gaussian (so to ensure (7)) yet not independent of \u2206i or(cid:80)\n\nj\u2208\u2202i\n\n\u03b4j.\n\nImposing the norm of the eigenvector x(2)\n\u03b6 = \u03c3 + \u03b4 to be constant with respect to \u03b1 and the boundary\ncondition \u00b5\u03b1c = 1 (i.e., there is no information about the classes at the detectability threshold), we\n\ufb01nd the following explicit expressions for \u00b5\u03b1 and \u03b2i.\n\n(cid:114) c\u03a6 \u2212 \u03b6 2\n\n\u03b1\n\nc\u03a6 \u2212 1\n\n1 \u2212 \u00b5\u03b1 =\n\n,\n\n\u03b2i =\n\n2\u221a\ndi\n\n.\n\nDetails are provided in Section B of the supplementary material. Figure 1-(a) supports the analysis by\ncomparing this prediction to simulations for a synthetic network with power law degree distribution.\nThe previous line of argument provides a large dimensional approximation for the performance\nof spectral clustering based on the eigenvector x(2)\n\u03b6 . The performance measure of interest is the\n\n4\n\n\f\u221a\n\nFigure 1: (a) Theoretical values of mean and variance (red line indicates 1 \u2212 \u00b5\u03b1 \u00b1 2f\u03b1/\nsimulation (green dots) for power-law distributed \u03b8i\u2019s (\u03b8i \u223c Z\u22121[U(3, 10)]4). (b) Theoretical (10) vs\nsimulated overlap, averaged over 10 realizations, for \u03b8i constant (left), and power-law distributed\n(right). For both \ufb01gures, n = 5000, cout = 6, cin = 7 \u2192 36.\n\nc) vs\n\n(cid:2) 1\n\n(cid:80)n\ni=1 \u03b4\u03c3i,\u02c6\u03c3i \u2212 1\n\n2\n\n(cid:3)where \u02c6\u03c3 denotes the vector of estimated\n\noverlap, de\ufb01ned as Ov \u2261 2 maxP \u02c6\u03c3\nlabels, P\u02c6\u03c3 the set of permutations of the labels, and \u03b4 the Kronecker symbol (\u03b4ij = 1 if i = j, and\n0 otherwise). In this particularly symmetric setting only \u02c6\u03c3i = sign[(x(2)\n\u03b6 )i] where sign is the sign\nfunction. (Remark 5 underlines the necessity not to cluster based on sign in asymmetric scenarios).\nFrom the expression of \u00b5\u03b1 and \u03b2i, we \ufb01nd that, conditionally to A,\n\nn\n\n(cid:34)(cid:115)\n\nn(cid:88)\n\ni=1\n\nerf\n\nE[Ov] (cid:39) 1\nn\n\n\u03b12di\n\n8c \u2212 2\u03b12\n\n(cid:19)(cid:35)\n\n(cid:18) c\u03a6 \u2212 \u03b6 2\n\n\u03b1\n\nc\u03a6 \u2212 1\n\n(10)\n\n(proof details are provided in Section B of the supplementary material). Figure 1-(b) compares the\nprediction of Equation (10) to simulations on networks with \u03b8i = 1 constant (left) or power-law\ndistributed (right). The observed match on this 5 000-node synthetic network is close to perfect.\nAs a side remark, our analysis reveals an interesting connection between H\u03b6 and D\u22121A.\nRemark 4 (Relation to the random walk Laplacian). Similar to A, D \u2212 A, and D\u2212 1\n2 , the\nmatrix D\u22121A is claimed inappropriate as a spectral community detection matrix for sparse graphs.\nThis is in fact a slight overstatement: as already observed in [20], as the graph under study gets\nsparser, D\u22121A still possesses one or possibly more informative eigenvectors, however not necessarily\ncorresponding to dominant isolated eigenvalues (it was in particular noted that for the real network\npolblogs [23] the informative eigenvector is associated to the third and not the second largest\neigenvalue). This observation is easily explained in our analysis framework. Similar to our derivation\nfor D \u2212 \u03b6A, the average action of D\u22121A on the class vector \u03c3 reads E[[D\u22121A\u03c3]i|A] = \u03c3i/\u03b6\nand thus, for large di, \u03c3 is a close eigenvector to D\u22121A, correctly predicting the existence of an\ninformative eigenvalue also for this matrix. However, the associated eigenvalue 1/\u03b6 decays with\nincreasing \u03b6 and thus with harder detection tasks, hence explaining why the informative eigenvectors\nare associated with eigenvalues found deeper into the spectrum of D\u22121A.\n\n2 AD\u2212 1\n\n3 Estimating \u03b6\n\n\u221a\n\n\u221a\n\nc\u03a6, \u03b6 is not readily accessible (as it depends on\nWhile r = \u03b6 is more appropriate a choice than r =\ncin \u2212 cout), unlike\nc\u03a6 that is easily estimated from the di\u2019s. To estimate \u03b6, we elaborate on the deep\nrelations between the Bethe Hessian Hr and the non-backtracking operator B \u2208 R2|E|\u00d72|E| de\ufb01ned,\nfor all (ij), (lm) \u2208 ED the set of directed edges of G, as B(ij)(lm) = \u03b4jl(1 \u2212 \u03b4im).\nWhen r is an eigenvalue of B, then det Hr = 0 [11, 24]. This is convenient as B only has a few\nisolated real eigenvalues (B is non symmetric) that can send the associated isolated eigenvalues of\nHr to zero. This provides us with two alternative methods to estimate \u03b6.\n\n3.1 Exploiting the eigenvalues outside the bulk of B\n\nIt is proved in [15] that, for the DC-SBM and beyond the phase transition (\u03b1 > \u03b1c), the eigenvalues\n\u03b31, . . . , \u03b32m of B, decreasingly sorted in modulus, satisfy in the large n setting: \u03b31 \u2192 \u03a6(cin+cout)/2,\n\u03b32 \u2192 \u03a6(cin \u2212 cout)/2 >\n\n\u03b31 and, for i > 2, lim supn |\u03b3i| \u2264 \u221a\n\n\u03b31, almost surely.\n\n\u221a\n\n5\n\n\fSince \u03b6 = limn \u03b31/\u03b32, denoting \u03bdi(r) the eigenvalues of Hr sorted in increasing order, this result\nconveys the following \ufb01rst method to estimate \u03b6.\n\nFigure 2: Superposed spectra of B for 3 values of \u03b1: n = 4000, cin = 12, 11, 10 and cout = 1, 2, 3\n(cin + cout is \ufb01xed); \u03b8 with power law distribution; all eigenvalues displayed in blue except top three\ndominant real displayed in colors for each (cin, cout) pair.\nMethod 1 (First estimation of \u03b6). Under the previous notations \u03b6 (cid:39) \u03b31/\u03b32. The eigenvalues \u03b31 and\n\n\u03b32 of B can be estimated by a line search over r \u2208 ((cid:112)\u03c1(B),\u221e) on changing signs of \u03bd1(r) and\n\n\u03bd2(r) that correspond to r = \u03b31 and r = \u03b32, respectively.1\n\n3.2 Exploiting the eigenvalues inside the bulk of B\n\nThe matrix B can be obtained from the linearization of the belief propagation (BP) equations (see\n[10] for details). In particular, the linear expansion to \ufb01rst order of the beliefs around their \ufb01xed points\nyields Bw (cid:39) \u03b6w. According to this argument, one expects the matrix B to have a real eigenvalue\nequal to \u03b6 with2 \u03b6 \u2264 \u221a\nc\u03a6. Figure 2 visually emphasizes this eigenvalue for three different values of\n\u03b1, maintaining c constant. The matrix B thus has four eigenvalues inside its main bulk: \u22121, 0, 1 and\n\u03b6. As the community detection problem gets harder, both \u03b6 and \u03b32 shift towards the edge of the bulk\n(from the left for the former and from the right for the latter) and then meet exactly at\nc\u03a6 when\n\u03b1 = \u03b1c. Then, for \u03b1 < \u03b1c, they reach the complex part of the bulk.\n\n\u221a\n\n\u221a\n\n\u03b1 \u2212 1)In in Equation (7) coincides with \u2212(\u03b6 2\n\nMore fundamentally, simulations further suggest that the eigenvector associated with the null eigen-\nvalue of H\u03b6 is precisely x(2)\n\u03b6 = \u03c3 + \u03b4 studied in Section 2.3. This indicates that the informative\neigenvalue \u03bb\u03b1 of D \u2212 \u03b6\u03b1A = H\u03b6\u03b1 \u2212 (\u03b6 2\n\u03b1 \u2212 1). It\nfurther explains why H\u221a\nc\u03a6, initially proposed in [9], works well close to the detectability threshold\nas \u03b6 \u2192 \u221a\nc\u03a6 when \u03b1 \u2192 \u03b1c. We thus expect most of the improvement of the choice r = \u03b6 to emerge\nin the easier scenarios.\nNote that, as was already observed in [9], if |r| > 1, then the eigenvalues of the bulk of Hr are strictly\npositive for |r| (cid:54)=\nis necessarily isolated when \u03b1 > \u03b1c and so spectral\nclustering on H\u03b6 works down to the detectability threshold. To the best of our knowledge, this\nproperty is not formally proved, but we point out that it agrees with the shape of the spectrum of B: if\nthe bulk of Hr was negative for some |r| > 1, then there would be a \u2018continuum\u2019 of real eigenvalues\nin [1,\nc\u03a6] if r > 1 (in the assortative case). As this is not the case, the smallest eigenvalue in the\nbulk of Hr cannot be negative.\nClaim 1 (Informative eigenvalue of H\u03b6\u03b1). The eigenvalue associated to the informative eigenvector\nof H\u03b6\u03b1 is equal to zero. Equivalently, the eigenvalue \u03bb\u03b1 associated to the informative eigenvector of\nD \u2212 \u03b6\u03b1A is given by \u03bb\u03b1 = \u2212(\u03b6 2\nThis claim gives rise to a second method to estimate \u03b6.\n\nc\u03a6. As a consequence, x(2)\n\n\u221a\n\n\u03b6\n\n1The spectral radius of the matrix B, \u03c1(B), can be estimated as \u03c1(B) (cid:39)(cid:80)\n\n\u03b1 which vanishes for cout \u2192 0.\ni /(cid:80)\n\n\u03b1 \u2212 1) = \u22124f 2\n\ni d2\n\n2This eigenvalue is visible in [10, 11] but not commented.\n\ni di.\n\n6\n\n\fFigure 3: Overlap comparison as a function of \u03b1, using the second smallest eigenvector of Hr, for\ndifferent values of r. In color code the values of r ranging from r = 1 (blue) to r = c\u03a6 (yellow).\nThe red squares indicate r = (cin \u2212 cout)\u03a6/2, that is equivalent to clustering with the matrix B [10],\nthe purple hexagons represent the Bethe-Hessian of [9], the green diamonds are the proposed\nAlgorithm 1 and the blue crosses are the graph Laplacian. In the top left corner, a zoom of the\noverlap close to the transition. For these simulations, n = 5000, cin : 15 \u2192 9.4, cout : 1 \u2192 6.6\n(while keeping c \ufb01xed), \u03b8i \u223c [U(3, 10)]4.\n\ncorresponds to the position of change of sign of \u03bd2(r) in the set r \u2208 (1,(cid:112)\u03c1(B)).\n\nMethod 2 (Second estimation of \u03b6). Under the previous notations \u03bd2(\u03b6) = 0. The parameter \u03b6 then\n\n4 Extension to multiple uneven-sized classes\n\nThe analysis performed in the previous sections is resilient to heterogeneous degree distributions\nand can be generalized to k uneven-sized classes, with last clustering step by k-means. To this\nend, let \u03a0 \u2208 Rk\u00d7k be diagonal with \u03a0ii the fraction of nodes in class i and assume C\u03a01 = c1.\nThis assumption is a standard hypothesis [10, 22, 11, 25] which ensures that the averaged node\nconnectivity is independent of the class. For 1 \u2264 p \u2264 k, let (\u03c4p, v(p)) be the p-th largest eigenpair\nof C\u03a0, and u(p) \u2208 Rn de\ufb01ned as u(p)\n\u2200 1 \u2264 i \u2264 n for (cid:96)i the class of node i. The vector\nu(p) contains plateaus with heights corresponding to the values of v(p). Repeating the arguments of\nSection 2 (see details in Section C of the supplementary material), we obtain k choices for r:\n\ni = v(p)\n(cid:96)i\n\nE[[(D \u2212 rA)u(p)]i] = diu(p)\n\ni\n\n1 \u2212 r\n\n\u03c4p\nc\n\nand thus\n\nr =\n\nc\n\u03c4p\n\n\u2261 \u03b6p,\n\n1 \u2264 p \u2264 k.\n\n(11)\n\n(cid:104)\n\n(cid:105)\n\n\u221a\n\nSince the largest eigenpair (c, 1) of C\u03a0 is not informative of the class structure, only the k \u2212 1\nnext largest eigenvectors v(p) of C\u03a0 are informative. The vector u(p) (corresponding to the p-th\nlargest eigenvalue \u03c4p) is in one-to-one mapping with v(p) and corresponds to the p-th smallest value\nof \u03b6p = c/\u03c4p. Considering r =\nc\u03a6, all the informative eigenvalues of Hr are negative [9]. By\ndecreasing r they progressively become positive: for r = \u03b6k (the largest among \u03b6p) the k-th smallest\neigenvalue is the \ufb01rst to hit zero. By further decreasing r, all the informative eigenvalues follow, until\nr = \u03b61 = 1 for which the smallest eigenvalue is null. We conclude that u(p) is associated with the\np-th smallest eigenvector x(p)\n\u03b6p\nMethod 1 and Method 2 both generalize to this scenario. In particular the outer eigenvalues of B\nconverge as \u03b3p \u2192 \u03c4p\u03a6 and the linearization of BP retrieves the \ufb01xed points as \u03b6p = c/\u03c4p.\nc\u03a6 still plays an important role. It was chosen in [9] because, asymp-\nFor k > 2, the value r =\ntotically, for this value of r only the informative eigenvalues of H\u221a\nc\u03a6 are negative. The number of\nclasses is then directly obtained from counting the number of negative eigenvalues of H\u221a\nc\u03a6. The\nrelation between Hr and B further guarantees that the number of isolated eigenvalues of B (hence of\nHr) is asymptotically equal to the number of detectable classes.\n\nof H\u03b6p.\n\n\u221a\n\n7\n\n\fFigure 4: (a) Comparison of spectral clustering for \u03b8i = 1 (left) and with power law distribution\n\u03b8i \u223c Z\u22121[U(3, 10)]4. \u201cD\u22121A best\u201d indicates spectral clustering on the best (among the \ufb01rst 25)\neigenvector of D\u22121A. Here, n = 5000, cout = 1, cin = 2 \u2192 16. Averaged over 10 samples. The\nerror bars indicate one standard deviation. (b) x(2)\ndistributed \u03b8i (left) and for \u03b8i = \u03b80, i \u2264 n/4 and n/2 \u2264 i \u2264 3n/4, and \u03b8i = 4\u03b80 otherwise (right).\n\n(bottom) for power law\n\n(top) and x(2)\u221a\n\n\u03b6\n\nc\u03a6\n\nRemark 5 (On k-means versus signed-based clustering). Under a symmetric 2-class of even size\nsetting, the classi\ufb01cation of the entries of the informative eigenvector of Hr can be performed based\non their signs. This sign-based method \ufb01rst does not generalize to more than two or uneven sized\nclasses, where k-means or expectation-maximization based clustering is required. But it also hinders\nthe fact that the eigenvector entries may be quite concentrated around zero (close to 0+ or 0\u2212\naccording to the class) and thus not clustered, a situation where k-means has no discriminative power.\nSimulations (and reported results in [9] based on signs rather than k-means) suggest that the\ninformative eigenvector of H\u221a\nc\u03a6 precisely suffers this condition. We have demonstrated here instead\nthat the informative eigenvector of H\u03b6 has the convenient feature of being genuinely clustered.\n\n5 Experimental validation\n\nindicates the corresponding eigenvector.\n\nOur results can be summarized by Algorithm 1, where we recall that \u03bdp(r) is the p-th smallest\neigenvalue of Hr and where x(p)\nr\nFigure 3 depicts the overlap, as a function of \u03b1, of the output of a two-class k-means on the informative\neigenvector of Hr, for different values of r, ranging from 1 to c\u03a6. When \u03b1 is large enough, small\nvalues of r lead to better partitions than large values of r that are more affected by degree heterogeneity.\nHowever, for r small, the informative eigenvector is not necessarily corresponding to the second\nsmallest eigenvalue, leading to a meaningless partition. On the contrary, larger values of r show\nfor \u03b1 \u2192 \u03b1c, \u03b6 is \"large enough\" so that the informative eigenvalue is isolated, while for \u03b1 \u2192 \u221a\nisolated eigenvectors also in the \"hard regime\". We recall that r = \u03b6 is an \u03b1-dependent parameter:\n2cin\nit is \"small enough\" to give good partitions. Also the value of r = (cin \u2212 cout)\u03a6/2 is \u03b1-dependent\nand it corresponds to clustering with B as indicated in [10]. While it gives good partitions very close\nto the transition, this choice of r seems largely sub-optimal for easier tasks.\nFigure 4-(a) compares the overlaps obtained with Algorithm 1 versus related spectral clustering\nc\u03a6, D\u22121A and A. Accordingly with Remark 5, k-means clustering (rather than\nmethods based on H\u221a\nsign-based) on the informative eigenvectors is systematically performed. For \u03b8i = 1, the left display\nrecovers the results of [9], evidencing a strong advantage for Hr versus Laplacian methods. Since the\ndegrees are similar, both r =\nc\u03a6 and r = \u03b6 induce similar Hr performances. The improvement\nprovided by H\u03b6 arises in the right display for power-law distributed \u03b8i, with most of the gain appearing\naway from the detection threshold. On both displays is also depicted the performance of D\u22121A\nbased on its second largest eigenvector and on an oracle choice of the informative eigenvector with\nmaximal overlap. These curves con\ufb01rm Remark 4 on the non-dominant position of the informative\neigenvector of D\u22121A in hard tasks.3 Figure 4-(b) depicts the informative eigenvectors of H\u221a\nc\u03a6 and\nH\u03b6, demonstrating the negative impact of \u03b8i on H\u221a\n\nc\u03a6, in stark contrast with the resilience of H\u03b6.\n\n\u221a\n\n3The low performance of D\u22121A, even in an oracle setting, can be attributed to the high density of eigenvalues\nin the bulk of the spectrum which induces a \u201cdispersion\u201d of the informative eigenvectors to the eigenvectors\nassociated to neighboring eigenvalues. The class information is thus \u201cspread\u201d across several eigenvectors.\n\n8\n\n\fAlgorithm 1 Improved Bethe-Hessian Community Detection\n1: Input : adjacency matrix of undirected graph G\n\u221a\nc\u03a6) < 0}|.\n2: Detect the number of classes: \u02c6k \u2190 |{i, \u03bdi(\n3: for 2 \u2264 p \u2264 \u02c6k do\n4:\n5:\n6: Estimate community labels \u02c6(cid:96) as output of \u02c6k-class k-means on the rows of X = [X2, . . . , X\u02c6k].\n\n\u03b6p \u2190 r such that \u03bdp(r) = 0\nXp \u2190 x(p)\n\n\u03b6p\n\nreturn Estimated number \u02c6k of communities and label vector \u02c6(cid:96).\n\nTable 1 next provides a comparison of the algorithm performances on real networks, both labelled\nand unlabelled, con\ufb01rming the overall superiority of Algorithm 1, quite unlike H\u221a\nc\u03a6 which fails on\nseveral examples.4\n\nL\n\nA\n\nn\n\nU\nMail\n\nn\n34\n62\n105\n115\n1221\n\nk\n2\n2\n3\n12\n2\n\nAlg.1 H\u221a\n1.00\n0.97\n0.77\n0.92\n0.91\n\nc\u03a6\n1.00\n0.87\n0.74\n0.92\n0.32\n\nKarate [28]\nDolphins [29]\nPolbooks [30]\nFootball [31]\nPolblogs [23]\nTable 1: Performance comparison on real networks. Labelled datasets with k known and overlap\ncomparison: (left). Unlabelled networks [32] with k estimated and modularity comparison. Only\nassortative features are kept into account.\n\n1.00\n0.65\n0.57\n0.92\n0.26\n\n1133\n4039\n4941\n6301\n7115\n\nFacebook\nPower grid\n\nNutella\nWikipedia\n\nk\n21\n65\n53\n5\n21\n\nAlg.1 H\u221a\nc\u03a6\n0.40\n0.50\n0.77\n0.48\n0.61\n0.92\n0.15\n0.34\n0.18\n0.21\n\nA\n\n0.32\n0.38\n0.31\n0.14\n0.15\n\n6 Concluding Remarks\n\n2 A(D + \u03c4 In)\u2212 1\n\nBeyond the demonstration of superiority of H\u03b6 to H\u221a\nc\u03a6, originally proposed in [9], the article\nprovides a consistent understanding of the natural limitations and strengths of the wide class of\nspectral clustering methods involving combinations of A and D.\nYet, other methods, the performances of which cannot always be compared on even grounds, have\nbeen proposed in the literature that marginally relate to the present study. This is notably the case\nof [18] which performs spectral clustering on L\u03c4 = (D + \u03c4 In)\u2212 1\n2 (with a proposed\nchoice \u03c4 = c) which aims at neutralizing the deleterious effects of small di. Although evidently\naffecting the spectrum (and thus the informative structure) of A by the non-linear normalization,\nsimulations on L\u03c4 suggest competitive performances to H\u03b6 in almost all studied examples. A\nsystematic analysis of this and similarly proposed methods in the literature is clearly called for.\nDespite its demonstrated signi\ufb01cant performance improvement, Algorithm 1 suffers from a slightly\nlarger computational cost than most competing methods (O(nk3) instead of the usual O(nk2)\ncomplexity in the case of sparse graph) due to the successive estimations of \u03b6. We are currently\nworking on improving this computation time.\nFrom a theoretical standpoint, the request for c (cid:29) 1 is still inappropriate to many practical networks.\nA \ufb01rst consequence of smaller values for c is the loss of Gaussianity of the eigenvector entries as\nalready evidenced in Figures 1 and 4 where Gaussianity is clearly lost in the easiest tasks in pro\ufb01t of\na \u201cone-sided\u201d distribution. This suggests further improvement of our analysis framework along with\nthe development of algorithms more appropriate than k-means to handle the last clustering step.\n\nAcknowledgments\n\nThis work is supported by the ANR Project RMT4GRAPH (ANR-14-CE28-0006), the IDEX GSTATS\nChair at University Grenoble Alpes and by CNRS PEPS I3A (Project RW4SPEC). The authors thank\nJean-Louis Barrat for fruitful discussions.\n\n4In Table 1, the modularity is de\ufb01ned as M = 1\n2|E|\n\nAij \u2212 didj\n2|E|\n\ni,j=1\n\n\u03b4(\u02c6(cid:96)i, \u02c6(cid:96)j), see e.g., [26, 27].\n\n(cid:17)\n\n(cid:16)\n\n(cid:80)n\n\n9\n\n\fReferences\n[1] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75\u2013174, 2010.\n\n[2] Elchanan Mossel, Joe Neeman, and Allan Sly. Belief propagation, robust reconstruction and\noptimal recovery of block models. In Conference on Learning Theory, pages 356\u2013370, 2014.\n\n[3] Karl Rohe, Sourav Chatterjee, Bin Yu, et al. Spectral clustering and the high-dimensional\n\nstochastic blockmodel. The Annals of Statistics, 39(4):1878\u20131915, 2011.\n\n[4] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416,\n\n2007.\n\n[5] Lennart Gulikers, Marc Lelarge, and Laurent Massouli\u00e9. A spectral method for community\ndetection in moderately sparse degree-corrected stochastic block models. Advances in Applied\nProbability, 49(3):686\u2013721, 2017.\n\n[6] J. Lei and A. Rinaldo. Consistency of spectral clustering in stochastic block models. The Annals\n\nof Statistics, 43(1):215\u2013237, 2015.\n\n[7] Raj Rao Nadakuditi and Mark EJ Newman. Graph spectra and the detectability of community\n\nstructure in networks. Physical review letters, 108(18):188701, 2012.\n\n[8] Ha\ufb01z Tiomoko Ali and Romain Couillet. Random matrix improved community detection in\nheterogeneous networks. In Signals, Systems and Computers, 2016 50th Asilomar Conference\non, pages 1385\u20131389. IEEE, 2016.\n\n[9] Alaa Saade, Florent Krzakala, and Lenka Zdeborov\u00e1. Spectral clustering of graphs with the\nbethe hessian. In Advances in Neural Information Processing Systems, pages 406\u2013414, 2014.\n\n[10] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zde-\nborov\u00e1, and Pan Zhang. Spectral redemption in clustering sparse networks. Proceedings of the\nNational Academy of Sciences, 110(52):20935\u201320940, 2013.\n\n[11] Charles Bordenave, Marc Lelarge, and Laurent Massouli\u00e9. Non-backtracking spectrum of\nrandom graphs: community detection and non-regular ramanujan graphs. In Foundations of\nComputer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 1347\u20131357. IEEE,\n2015.\n\n[12] Laurent Massouli\u00e9. Community detection thresholds and the weak ramanujan property. In\nProceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 694\u2013703.\nACM, 2014.\n\n[13] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted\n\npartition model. Probability Theory and Related Fields, 162(3-4):431\u2013461, 2015.\n\n[14] Albert-L\u00e1szl\u00f3 Barab\u00e1si and R\u00e9ka Albert. Emergence of scaling in random networks. science,\n\n286(5439):509\u2013512, 1999.\n\n[15] Lennart Gulikers, Marc Lelarge, and Laurent Massouli\u00e9. Non-Backtracking Spectrum of\nIn 8th Innovations in Theoretical Computer\nDegree-Corrected Stochastic Block Models.\nScience Conference (ITCS 2017), volume 67 of Leibniz International Proceedings in Informatics\n(LIPIcs), pages 44:1\u201344:27, 2017.\n\n[16] Lennart Gulikers, Marc Lelarge, Laurent Massouli\u00e9, et al. An impossibility result for recon-\nstruction in the degree-corrected stochastic block model. The Annals of Applied Probability,\n28(5):3002\u20133027, 2018.\n\n[17] Brian Karrer and Mark EJ Newman. Stochastic blockmodels and community structure in\n\nnetworks. Physical review E, 83(1):016107, 2011.\n\n[18] Tai Qin and Karl Rohe. Regularized spectral clustering under the degree-corrected stochastic\nblockmodel. In Advances in Neural Information Processing Systems, pages 3120\u20133128, 2013.\n\n10\n\n\f[19] Can M Le, Elizaveta Levina, and Roman Vershynin. Concentration and regularization of random\n\ngraphs. Random Structures & Algorithms, 51(3):538\u2013561, 2017.\n\n[20] Antony Joseph and Bin Yu. Impact of regularization on spectral clustering. arXiv preprint\n\narXiv:1312.1733, 2013.\n\n[21] Amir Dembo, Andrea Montanari, et al. Gibbs measures and phase transitions on sparse random\n\ngraphs. Brazilian Journal of Probability and Statistics, 24(2):137\u2013211, 2010.\n\n[22] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov\u00e1. Asymptotic\nanalysis of the stochastic block model for modular networks and its algorithmic applications.\nPhysical Review E, 84(6):066106, 2011.\n\n[23] Lada A Adamic and Natalie Glance. The political blogosphere and the 2004 us election: divided\nthey blog. In Proceedings of the 3rd international workshop on Link discovery, pages 36\u201343.\nACM, 2005.\n\n[24] Audrey Terras. Zeta functions of graphs: a stroll through the garden, volume 128. Cambridge\n\nUniversity Press, 2010.\n\n[25] Lenka Zdeborov\u00e1 and Florent Krzakala. Statistical physics of inference: Thresholds and\n\nalgorithms. Advances in Physics, 65(5):453\u2013552, 2016.\n\n[26] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in\n\nnetworks. Physical review E, 69(2):026113, 2004.\n\n[27] M. E. J. Newman. Modularity and community structure in networks. Proceedings of the\n\nNational Academy of Sciences, 103:8577\u20138582, 2006.\n\n[28] Wayne W Zachary. An information \ufb02ow model for con\ufb02ict and \ufb01ssion in small groups. Journal\n\nof anthropological research, 33(4):452\u2013473, 1977.\n\n[29] David Lusseau, Karsten Schneider, Oliver J Boisseau, Patti Haase, Elisabeth Slooten, and\nSteve M Dawson. The bottlenose dolphin community of doubtful sound features a large\nproportion of long-lasting associations. Behavioral Ecology and Sociobiology, 54(4):396\u2013405,\n2003.\n\n[30] http://www.orgnet.com/.\n\n[31] Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks.\n\nProceedings of the national academy of sciences, 99(12):7821\u20137826, 2002.\n\n[32] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection.\n\nhttp://snap.stanford.edu/data, June 2014.\n\n11\n\n\f", "award": [], "sourceid": 2232, "authors": [{"given_name": "Lorenzo", "family_name": "Dall'Amico", "institution": "GIPSA lab"}, {"given_name": "Romain", "family_name": "Couillet", "institution": "CentralSup\u00e9lec"}, {"given_name": "Nicolas", "family_name": "Tremblay", "institution": "CNRS"}]}