{"title": "Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 1334, "page_last": 1342, "abstract": "The stochastic block model (SBM) has long been studied in machine learning and network science as a canonical model for clustering and community detection. In the recent years, new developments have demonstrated the presence of threshold phenomena for this model, which have set new challenges for algorithms. For the {\\it detection} problem in symmetric SBMs, Decelle et al.\\ conjectured that the so-called Kesten-Stigum (KS) threshold can be achieved efficiently. This was proved for two communities, but remained open from three communities. We prove this conjecture here, obtaining a more general result that applies to arbitrary SBMs with linear size communities. The developed algorithm is a linearized acyclic belief propagation (ABP) algorithm, which mitigates the effects of cycles while provably achieving the KS threshold in $O(n \\ln n)$ time. This extends prior methods by achieving universally the KS threshold while reducing or preserving the computational complexity. ABP is also connected to a power iteration method on a generalized nonbacktracking operator, formalizing the spectral-message passing interplay described in Krzakala et al., and extending results from Bordenave et al.", "full_text": "Achieving the KS threshold in the general stochastic\nblock model with linearized acyclic belief propagation\n\nApplied and Computational Mathematics and EE Dept.\n\nEmmanuel Abbe\n\nPrinceton University\n\neabbe@princeton.edu\n\nColin Sandon\n\nDepartment of Mathematics\n\nPrinceton University\n\nsandon@princeton.edu\n\nAbstract\n\nThe stochastic block model (SBM) has long been studied in machine learning and\nnetwork science as a canonical model for clustering and community detection. In\nthe recent years, new developments have demonstrated the presence of threshold\nphenomena for this model, which have set new challenges for algorithms. For\nthe detection problem in symmetric SBMs, Decelle et al. conjectured that the\nso-called Kesten-Stigum (KS) threshold can be achieved ef\ufb01ciently. This was\nproved for two communities, but remained open for three and more communities.\nWe prove this conjecture here, obtaining a general result that applies to arbitrary\nSBMs with linear size communities. The developed algorithm is a linearized\nacyclic belief propagation (ABP) algorithm, which mitigates the effects of cycles\nwhile provably achieving the KS threshold in O(n ln n) time. This extends prior\nmethods by achieving universally the KS threshold while reducing or preserving the\ncomputational complexity. ABP is also connected to a power iteration method on a\ngeneralized nonbacktracking operator, formalizing the spectral-message passing\ninterplay described in Krzakala et al., and extending results from Bordenave et al.\n\n1\n\nIntroduction\n\nThe stochastic block model (SBM) is widely used as a model for community detection and as a\nbenchmark for clustering algorithms. The model emerged in multiple scienti\ufb01c communities, in\nmachine learning and statistics under the SBM [1, 2, 3, 4], in computer science as the planted partition\nmodel [5, 6, 7], and in mathematics as the inhomogeneous random graph model [8]. Although the\nmodel was de\ufb01ned as far back as the 80s, mainly studied for the exact recovery problem, it resurged\nin the recent years due in part to fascinating conjectures on the detection problem, established in\n[9] (and backed in [10]) from deep but non-rigorous statistical physics arguments. For ef\ufb01cient\nalgorithms, the following was conjectured:\nConjecture 1. (See formal de\ufb01nitions below) In the stochastic block model with n vertices, k\nbalanced communities, edge probability a/n inside the communities and b/n across, it is possible to\ndetect communities in polynomial time if and only if\n(a \u2212 b)2\n\n(1)\n\nk(a + (k \u2212 1)b)\n\n> 1.\n\nIn other words, the problem of detecting ef\ufb01ciently communities is conjectured to have a sharp\nthreshold at the above, which is called the Kesten-Stigum (KS) threshold. Establishing such thresholds\nis of primary importance for the developments of algorithms. A prominent example is Shannon\u2019s\ncoding theorem, that gives a sharp threshold for coding algorithms at the channel capacity, and\nwhich has led the development of coding algorithms used in communication standards. In the area of\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fclustering, where establishing rigorous benchmarks is a challenge, the quest of sharp thresholds is\nlikely to also have fruitful outcomes.\nInterestingly, classical clustering algorithms do not seem to suf\ufb01ce for achieving the threshold in (1).\nThis includes spectral methods based on the adjacency matrix or Laplacians, as well as SDPs. For\nstandard spectral methods, a \ufb01rst issue is that the \ufb02uctuations in the node degrees produce high-degree\nnodes that disrupt the eigenvectors from concentrating on the clusters. This issue is further enhanced\non real networks where degree variations are important. A classical trick is to trim such high-degree\nnodes [11, 12], throwing away some information, but this does not seem to suf\ufb01ce. SDPs are a natural\nalternative, but they also stumble before the threshold [13, 14], focusing on the most likely rather\nthan typical clusterings.\nSigni\ufb01cant progress has already been achieved on Conjecture 1. In particular, the conjecture is set\nfor k = 2, with the achievability part proved in [15, 16] and [17], and the impossibility part in [10].\nAchievability results were also obtained in [17] for SBMs with multiple communities that satisfy a\ncertain asymmetry condition (see Theorem 5 in [17]). Conjecture 1 remained open for k \u2265 3.\nIn their original paper [9], Decelle et al. conjectured that belief propagation (BP) achieves the KS\nthreshold. The main issue when applying BP to the SBM is the classical one: the presence of cycles\nin the graph makes the behavior of the algorithm dif\ufb01cult to understand, and BP is susceptible to\nsettle down in the wrong \ufb01xed points. While empirical studies of BP on loopy graph have shown\nthat convergence still takes place in some cases [18], obtaining rigorous results in the context of\nloopy graphs remains a long standing challenge for message passing algorithms, and achieving the\nKS threshold requires precisely running BP to an extent where the graph is not even tree-like. We\naddress this challenge in the present paper, with a linearized version of BP that mitigates cycles.\nNote that establishing formally the converse of Conjecture 1 (i.e., that ef\ufb01cient detection is impossible\nbelow the threshold) for arbitrary k seems out of reach at the moment, as the problem behaves very\ndifferently for small rather than arbitrary k. Indeed, except for a few low values of k, it is proven in\n[19, 20] that the threshold in (1) does not coincide with the information-theoretic threshold. Since it\nis possible to detect below the threshold with non-ef\ufb01cient algorithms, proving formally the converse\nof Conjecture 1 would require major headways in complexity theory. On the other hand, [9] provides\nalready non-rigourous arguments that the converse hold.\n\n1.1 This paper\n\nThis paper proves the achievability part of conjecture 1. Our main result applies to a more general\ncontext, with a generalized notion of detection that applies to arbitrary SBMs. In particular,\n\u2022 we show that an approximate belief propagation (ABP) algorithm that mitigates cycles achieves\nthe KS threshold universally. The simplest linearized1 version of BP is to repeatedly update\nbeliefs about a vertex\u2019s community based on its neighbor\u2019s suspected communities while avoiding\nbacktrack. However, this only works ideally if the graph is a tree. The correct response to a\ncycle would be to discount information reaching the vertex along either branch of the cycle to\ncompensate for the redundancy of the two branches. Due to computational issues we simply\nprevent information from cycling around constant size cycles.\n\u2022 we show how ABP can be interpreted as a power iteration method on a generalized r-\nnonbacktracking operator, i.e., a spectral algorithm that uses a matrix counting the number of\nr-nonbacktracking walks rather than the adjacency matrix. The random initialization of the beliefs\nin ABP corresponds to the random vector to which the power iteration is applied, formalizing the\nconnection described in [22]. While using r = 2 backtracks may suf\ufb01ce to achieve the threshold,\nlarger backtracks are likely to help mitigating the presence of small loops in networks.\n\nOur results are closest to [16, 17], while diverging in several key parts. A few technical expansions in\nthe paper are similar to those carried in [16], such as the weighted sums over nonbacktracking walks\nand the SAW decomposition in [16], similar to our compensated nonbacktracking walk counts and\nShard decomposition. Our modi\ufb01cations are developed to cope with the general SBM, in particular\nto compensation for the dominant eigenvalues in the latter setting. Our algorithm complexity is also\nslightly reduced by a logarithmic factor.\n\n1Other forms of approximate message passing algorithms have been studied for dense graphs, in particular\n\n[21] for compressed sensing.\n\n2\n\n\fOur algorithm is also closely related to [17], which focuses on extracting the eigenvectors of the\nstandard nonbacktracking operator. Our proof technique is different than the one in [17], so that we\ncan cope with the setting of Conjecture 1. We also implement the eigenvector extractions in a belief\npropagation fashion. Another difference with [17] is that we rely on nonbacktracking operators of\nhigher orders r. While r = 2 is arguably the simplest implementation and may suf\ufb01ce for the sole\npurpose of achieving the KS threshold, a larger r is likely to be bene\ufb01cial in practice. For example,\nan adversary may add triangles for which ABP with r = 2 would fail while larger r would succeed.\nFinally, the approach of ABP can be extended beyond the linearized setting to move from detection\nto an optimal accuracy as discussed in Section 5.\n\n2 Results\n\n2.1 A general notion of detection\n\nThe stochastic block model (SBM) is a random graph model with clusters de\ufb01ned as follows.\nDe\ufb01nition 1. For k \u2208 Z+, a probability distribution p \u2208 (0, 1)k, a k \u00d7 k symmetric matrix Q with\nnonnegative entries, and n \u2208 Z+, we de\ufb01ne SBM(n, p, Q/n) as the probability distribution over\nordered pairs (\u03c3, G) of an assignment of vertices to one of k communities and an n-vertex graph\ngenerated by the following procedure. First, each vertex v \u2208 V (G) is independently assigned a\ncommunity \u03c3v under the probability distribution p. Then, for every v (cid:54)= v(cid:48), an edge is drawn in G\nbetween v and v(cid:48) with probability Q\u03c3v,\u03c3v(cid:48) /n, independently of other edges. We sometimes say that\nG is drawn under SBM(n, p, Q/n) without specifying \u03c3 and de\ufb01ne \u2126i = {v : \u03c3v = i}.\nDe\ufb01nition 2. The SBM is called symmetric if p is uniform and if Q takes the same value on the\ndiagonal and the same value off the diagonal.\n\nOur goal is to \ufb01nd an algorithm that can distinguish between vertices from one community and\nvertices from another community in a non trivial way.\nDe\ufb01nition 3. Let A be an algorithm that takes a graph as input and outputs a partition of its vertices\ninto two sets. A solves detection (or weak recovery) in graphs drawn from SBM(n, p, Q/n) if\nthere exists \u0001 > 0 such that the following holds. When (\u03c3, G) is drawn from SBM(n, p, Q/n) and\nA(G) divides its vertices into S and Sc, with probability 1 \u2212 o(1), there exist i, j \u2208 [k] such that\n|\u2126i \u2229 S|/|\u2126i| \u2212 |\u2126j \u2229 S|/|\u2126j| > \u0001.\nIn other words, an algorithm solves detection if it divides the graph\u2019s vertices into two sets such that\nvertices from different communities have different probabilities of being assigned to one of the sets.\nAn alternate de\ufb01nition (see for example Decelle et al. [9]) requires the algorithm to divide the vertices\ninto k sets such that there exists \u0001 > 0 for which there exists an identi\ufb01cation of the sets with the\ncommunities labelling at least max pi + \u0001 of the vertices correctly with high probability. In the 2\ncommunity symmetric case, the two de\ufb01nitions are equivalent. In a two community asymmetric case\nwhere p = (.2, .8), an algorithm that could \ufb01nd a set containing 2/3 of the vertices from the large\ncommunity and 1/3 of the vertices from the small community would satisfy De\ufb01nition 3, however, it\nwould not satisfy previous de\ufb01nition due to the biased prior. If all communities have the same size,\nthis distinction is meaningless and we have the following equivalence:\nLemma 1. Let k > 0, Q be a k \u00d7 k symmetric matrix with nonnegative entries, p be the uni-\nform distribution over k sets, and A be an algorithm that solves detection in graphs drawn from\nSBM(n, p, Q/n). Then A also solves detection according to Decelle et al.\u2019s criterion [9], provided\nthat we consider it as returning k \u2212 2 empty sets in addition to its actual output.\nProof. Let (\u03c3, G) \u223c SBM(n, p, Q/n) and A(G) return S and S(cid:48). There exists \u0001 > 0 such that with\nhigh probability (whp) there exist i, j such that |\u2126i \u2229 S|/|\u2126i| \u2212 |\u2126j \u2229 S|/|\u2126j| > \u0001. So, if we map S\nto community i and S(cid:48) to community j, the algorithm classi\ufb01es at least |\u2126i \u2229 S|/n + |\u2126j \u2229 S(cid:48)|/n =\n|\u2126j|/n + |\u2126i \u2229 S|/n \u2212 |\u2126j \u2229 S|/n \u2265 1/k + \u0001/k \u2212 o(1) of the vertices correctly whp.\n\n2.2 Achieving ef\ufb01ciently and universally the KS threshold\n\nGiven parameters p and Q for the SBM, let P be the diagonal matrix such that Pi,i = pi for each\ni \u2208 [k]. Also, let \u03bb1, ..., \u03bbh be the distinct eigenvalues of P Q in order of nonincreasing magnitude.\n\n3\n\n\fDe\ufb01nition 4. The signal to noise ratio of SBM(n, p, Q/n) is de\ufb01ned by SNR := \u03bb2\nTheorem 1. Let k \u2208 Z+, p \u2208 (0, 1)k be a probability distribution, Q be a k \u00d7 k symmetric matrix\nwith nonnegative entries, and G be drawn under SBM(n, p, Q/n). If SNR > 1, then there exist\nr \u2208 Z+, c > 0, and m : Z+ \u2192 Z+ such that ABP(G, m(n), r, c, (\u03bb1, ..., \u03bbh)) described in the next\nsection solves detection and runs in O(n log n) time.\n\n2/\u03bb1.\n\nFor the symmetric SBM, this settles the achievability part of Conjecture 1, as the condition SNR > 1\nreads in this case SNR = ( a\u2212b\n\n) = (a \u2212 b)2/(k(a + (k \u2212 1)b)) > 1.\n\nk )2/( a+(k\u22121)b\n\nk\n\n3 The linearized acyclic belief propagation algorithm (ABP)\n\n3.1 Vanilla version\n\nWe present \ufb01rst a simpli\ufb01ed version of our algorithm that captures the essence of the algorithm while\navoiding technicalities required for the proof, described in Section 3.3.\nABP\u2217(G, m, r, \u03bb1):\n1. For each vertex v, randomly draw xv with a Normal distribution. For all adjacent v, v(cid:48) in G, set\n\nv,v(cid:48) = xv(cid:48) and set y(t)\ny(1)\n2. For each 1 < t \u2264 m, set\n\nv,v(cid:48) = 0 whenever t < 1.\n\n(cid:88)\n\n(v(cid:48)(cid:48),v(cid:48)(cid:48)(cid:48))\u2208E(G)\n\ny(t\u22121)\nv(cid:48)(cid:48),v(cid:48)(cid:48)(cid:48)\n\n(2)\n\nfor all adjacent v, v(cid:48). For each adjacent v, v(cid:48) that are not part of a cycle of length r or less, set\n\nv,v(cid:48) = y(t\u22121)\nz(t\u22121)\n\nv,v(cid:48) \u2212\n\n1\n\n2|E(G)|\n\n(cid:88)\n\ny(t)\nv,v(cid:48) =\n\nv(cid:48)(cid:48):(v(cid:48),v(cid:48)(cid:48))\u2208E(G),v(cid:48)(cid:48)(cid:54)=v\n\nz(t\u22121)\nv(cid:48),v(cid:48)(cid:48)\n\n(cid:88)\n\nand for the other adjacent v, v(cid:48) in G, let the other vertex in the cycle that is adjacent to v be v(cid:48)(cid:48)(cid:48),\nthe length of the cycle be r(cid:48), and set\n\n(cid:88)\n\ny(t)\nv,v(cid:48) =\n\nv(cid:48)(cid:48):(v(cid:48),v(cid:48)(cid:48))\u2208E(G),v(cid:48)(cid:48)(cid:54)=v\n\nz(t\u22121)\nv(cid:48),v(cid:48)(cid:48) \u2212\n\nv,v(cid:48) =(cid:80)\n\nunless t = r(cid:48), in which case, set y(t)\n\n3. Set y(cid:48)\n\nv(cid:48):(v(cid:48),v)\u2208E(G) y(m)\n\nv,v(cid:48) for every v \u2208 G and return ({v : y(cid:48)\n\nv(cid:48)(cid:48):(v(cid:48),v(cid:48)(cid:48))\u2208E(G),v(cid:48)(cid:48)(cid:54)=v z(t\u22121)\n\nv(cid:48)(cid:48):(v,v(cid:48)(cid:48))\u2208E(G),v(cid:48)(cid:48)(cid:54)=v(cid:48),v(cid:48)(cid:48)(cid:54)=v(cid:48)(cid:48)(cid:48)\nv(cid:48),v(cid:48)(cid:48) \u2212 z(1)\nv(cid:48)(cid:48)(cid:48),v.\nv > 0},{v : y(cid:48)\n\nv \u2264 0}).\n\nv =(cid:80)\n\nz(t\u2212r(cid:48))\nv,v(cid:48)(cid:48)\n\nif\n\n(v, v(cid:48))\n\n(cid:80)r\n\nr(cid:48)=1\n\nis\n\nin multiple cycles of\n\nlength r or\n\nthe algorithm does\n\n= (cid:80)\n\nv(cid:48)(cid:48)(cid:48),v,v(cid:48)(cid:80)\n\nRemarks. (1) In the r = 2 case, one can exit step 2 after the second line. As mentioned above, we\nrely on a less compact version of the algorithm to prove the theorem, but expect that the above also\nsucceeds at detection as long as m > 2 ln(n)/ ln(SNR).\n(2) What\nis unspeci\ufb01ed as\nSBM. This can be modi\ufb01ed for more general settings,\ndependently for each such cycle,\n\nthere is no such edge with probability 1 \u2212 o(1)\n\napplying the adjustment\nv(cid:48)(cid:48):(v(cid:48),v(cid:48)(cid:48))\u2208E(G),v(cid:48)(cid:48)(cid:54)=v z(t\u22121)\nv(cid:48)(cid:48):(v,v(cid:48)(cid:48))\u2208E(G),v(cid:48)(cid:48)(cid:54)=v(cid:48),v(cid:48)(cid:48)(cid:54)=v(cid:48)(cid:48)(cid:48) z(t\u2212r(cid:48))\nv(cid:48)(cid:48)(cid:48),v,v(cid:48)\nv,v(cid:48)(cid:48)\n\n(cid:80)\nv(cid:48)(cid:48)(cid:48):(v,v(cid:48)(cid:48)(cid:48))\u2208E(G) C (r(cid:48))\n\nless\nin the sparse\nin-\nv(cid:48),v(cid:48)(cid:48) \u2212\nde-\n\nnotes the number of length r(cid:48) cycles that contain v(cid:48)(cid:48)(cid:48), v, v(cid:48) as consecutive vertices.\n(3) The purpose of setting z(t\u22121)\nas in step (2) is to ensure that the average value of the y(t) is\nv,v(cid:48)\napproximately 0, and thus that the eventual division of the vertices into two sets is roughly even. An\nv,v(cid:48) = y(t\u22121)\nalternate way of doing this is to simply let z(t\u22121)\nand then compensate for any bias of y(t)\nv,v(cid:48)\ntowards positive or negative values at the end. More speci\ufb01cally, de\ufb01ne Y to be the n\u00d7 m matrix such\nv,v(cid:48), and M to be the m \u00d7 m matrix such that Mi,i = 1\nem,\n\nand Mi,i+1 = \u2212\u03bb1 for all i, and all other entries of M are equal to 0. Then set y(cid:48) = Y M m(cid:48)\nwhere em \u2208 Rm denotes the unit vector with 1 in the m-th entry, and m(cid:48) is a suitable integer.\n\nthat for all t and v, Yv,t =(cid:80)\n\nv(cid:48):(v(cid:48),v)\u2208E(G) y(t)\n\n, where C (r(cid:48))\n\nsetting y(t)\nv,v(cid:48)\n\n4\n\n\f3.2 Spectral implementation\n\nOne way of looking at this algorithm for r = 2 is the following. Given a vertex v in community i,\nthe expected number of vertices v(cid:48) in community j that are adjacent to v is approximately ej \u00b7 P Qei.\nFor any such v(cid:48) the expected number of vertices in community j(cid:48) that are adjacent to v(cid:48) not counting\nv is approximately ej(cid:48) \u00b7 P Qej, and so on. In order to explore this connection, de\ufb01ne the graph\u2019s\nnonbacktracking walk matrix W as the 2|E(G)| \u00d7 2|E(G)| matrix such that for all v \u2208 V (G) and\nall distinct v(cid:48) and v(cid:48)(cid:48) adjacent to v, W(v,v(cid:48)(cid:48)),(v(cid:48),v) = 1, and all other entries in W are 0.\nNow, let w be an eigenvector of P Q with eigenvalue \u03bbi and w \u2208 R2|E(G)| be the vector such that\nw(v,v(cid:48)) = w\u03c3v(cid:48) /p\u03c3v(cid:48) for all (v, v(cid:48)) \u2208 E(G). For any small t, we would expect that w \u00b7 W tw \u2248\ni||w||2\n2, which strongly suggests that w is correlated with an eigenvector of W with eigenvalue \u03bbi.\n\u03bbt\nFor any such w with i > 1, dividing G\u2019s vertices into those with positive entries in w and those\nwith negative entries in w would put all vertices from some communities in the \ufb01rst set, and all\nvertices from the other communities in the second. So, we suspect that an eigenvector of W with its\neigenvalue of second greatest magnitude would have entries that are correlated with the corresponding\nvertices\u2019 communities.\nWe could simply extract this eigenvector, but a faster approach would be to take a random vector\ny and then compute W my for some suitably large m. That will be approximately equal to a\nlinear combination of W \u2019s dominant eigenvectors. Its dominant eigenvector is expected to have an\neigenvalue of approximately \u03bb1 and to have all of its entries approximately equal, so if instead we\ncompute (W \u2212 \u03bb1\n2|E(G)| J)my where J is the vector with all entries equal to 1, the component of y\nproportional to W \u2019s dominant eigenvector will be reduced to negligable magnitude, leaving a vector\nthat is approximately proportional to W \u2019s eigenvector of second largest eigenvalue. This is essentially\nwhat the ABP algorithm does for r = 2.\nThis vanilla approach does however not extend obviously to the case with multiple eigenvalues. In\nsuch cases, we will have to subtract multiples of the identity matrix instead of J because we will\nnot know enough about W \u2019s eigenvectors to \ufb01nd a matrix that cancels out one of them in particular.\nThese are signi\ufb01cant challenges to overcome to prove the general result and Conjecture 1.\nFor higher values of r, the spectral view of ABP can be understood as described above but introducing\nthe following generalized nonbacktracking operator as a replacement to W :\nDe\ufb01nition 5. Given a graph, de\ufb01ne the r-nonbacktracking matrix W (r) of dimension equal to the\nnumber of r \u2212 1 directed paths in the graph and with entry W (r)\nr) equal to 1 if\ni+1 = vi for each 1 \u2264 i < r and v(cid:48)\nv(cid:48)\n\n1 (cid:54)= vr, and equal to 0 otherwise.\n\n(v1,v2,...,vr),(v(cid:48)\n\n1,v(cid:48)\n\n2,...,v(cid:48)\n\nFigure 1: Two paths of length 3 that contribute to an entry of 1 in W (4).\n\n3.3 Full version\n\nThe main modi\ufb01cations in the proof are as follows. First, at the end we assign vertices to sets with\nprobabilities that scale linearly with their entries in y(cid:48) instead of simply assigning them based on the\nsigns of their entries. This allows us to convert the fact that the average values of y(cid:48)\nv for v in different\ncommunities is different into a detection result. Second, we remove a small fraction of the edges\nfrom the graph at random at the beginning of the algorithm (the graph-splitting step), de\ufb01ning y(cid:48)(cid:48)\nto be the sum of y(cid:48)\nv(cid:48) over all v(cid:48) connected to v by paths of a suitable length with removed edges at\nv\ntheir ends in order to eliminate some dependency issues. Also, instead of just compensating for P Q\u2019s\ndominant eigenvalue, we also compensate for some of its smaller eigenvalues, and subtract multiples\nof y(t\u22121) from y(t) for some t instead of subtracting the average value of y(t) from all of its entries\nfor all t. We refer to [19] for the full description of the algorithm. Note that while it is easier to prove\nthat the ABP algorithm works, the ABP\u2217 algorithm should work at least as well in practice.\n\n5\n\n\f4 Proof technique\n\nFor simplicity, consider \ufb01rst the two community symmetric case. Consider determining the commu-\nnity of v using belief propagation, assuming some preliminary guesses about the vertices t edges\naway from it, and assuming that the subgraph of G induced by the vertices within t edges of v is a tree.\nFor any vertex v(cid:48) such that d(v, v(cid:48)) < t, let Cv(cid:48) be the set of the children of v(cid:48). If we believe based\non either our prior knowledge or propagation of beliefs up to these vertices that v(cid:48)(cid:48) is in community 1\n2 \u0001v(cid:48)(cid:48) for each v(cid:48)(cid:48) \u2208 Cv(cid:48), then the algorithm will conclude that v(cid:48) is in community\nwith probability 1\n2 + 1\n1 with a probability of\n(cid:81)\n\n2 + a\u2212b\n\nv(cid:48)(cid:48)\u2208Cv(cid:48) ( a+b\n\n2 \u0001v(cid:48)(cid:48))\n\n.\n\n(cid:81)\n2 + a\u2212b\n\n2 \u0001v(cid:48)(cid:48) ) +(cid:81)\n\nv(cid:48)(cid:48)\u2208Cv(cid:48) ( a+b\n\n2 \u2212 a\u2212b\n\n2 \u0001v(cid:48)(cid:48))\n\nIf all of the \u0001v(cid:48)(cid:48) are close to 0, then this is approximately equal to (see also [9, 22])\n\na\u2212b\na+b \u0001v(cid:48)(cid:48)\nv(cid:48)(cid:48)\u2208Cv(cid:48) (\u22121) a\u2212b\n\na+b \u0001v(cid:48)(cid:48)\n\n=\n\n1\n2\n\n+\n\na \u2212 b\na + b\n\n(cid:88)\n\nv(cid:48)(cid:48)\u2208Cv(cid:48)\n\n1\n2\n\n\u0001v(cid:48)(cid:48) .\n\n2 +(cid:80)\n\nv(cid:48)(cid:48)\u2208Cv(cid:48)\n\nv(cid:48)(cid:48)\u2208Cv(cid:48) ( a+b\n\na\u2212b\n\nv(cid:48)(cid:48)\u2208Cv(cid:48)\n\n1 +(cid:80)\na+b \u0001v(cid:48)(cid:48) +(cid:80)\na+b )t(cid:80)\n\n2 + 1\n\n2 ( a\u2212b\n\nThat means that the belief propagation algorithm will ultimately assign an average probability of\napproximately 1\nv(cid:48)(cid:48):d(v,v(cid:48)(cid:48))=t \u0001v(cid:48)(cid:48) to the possibility that v is in community 1. If there\nexists \u0001 such that Ev(cid:48)(cid:48)\u2208\u21261 [\u0001v(cid:48)(cid:48) ] = \u0001 and Ev(cid:48)(cid:48)\u2208\u21262[\u0001v(cid:48)(cid:48)] = \u2212\u0001 (recall that \u2126i = {v : \u03c3v = i}), then\n\u0001 to v being\non average we would expect to assign a probability of approximately 1\nin its actual community, which is enhanced as t increases when SNR > 1. Note that since the\nvariance in the probability assigned to the possibility that v is in its actual community will also grow\nas\n, the chance that this will assign a probability of greater than 1/2 to v being in its actual\n\n(cid:16) (a\u2212b)2\n\n(cid:16) (a\u2212b)2\n\n(cid:17)t\n\n(cid:17)t\n\n2 + 1\n\n2(a+b)\n\n2\n\n2(a+b)\n\n(cid:18)(cid:16) (a\u2212b)2\n\n(cid:17)t/2(cid:19)\n\n2(a+b)\n\n.\n\ncommunity will be 1\n\n2 + \u0398\n\n2\n\n>\n\n2(a+b)\n\n(cid:17)t\n\n(cid:17)t/2\n\nn, we have that\n\n(cid:16) (a+b)\n\n(cid:16) (a\u2212b)2\n\nOne idea for the initial estimate is to simply guess the vertices\u2019 communities at random, in the\n\u221a\nexpectation that the fractions of the vertices from the two communities assigned to a community\nwill differ by \u03b8(1/\nn) by the Central Limit Theorem. Unfortunately, for any t large enough that\n\u221a\n> n which means that our approximation breaks down\nbefore t gets large enough to detect communities. In fact, t would have to be so large that not only\nwould neighborhoods not be tree like, but vertices would have to be exhausted.\nOne way to handle this would be to stop counting vertices that are t edges away from v, and instead\ncount each vertex a number of times equal to the number of length t paths from v to it.2 Unfortunately,\n\ufb01nding all length t paths starting at v can be done ef\ufb01ciently enough only for values of t that are\nsmaller than what is needed to amplify a random guess to the extent needed here. We could instead\ncalculate the number of length t walks from v to each vertex more quickly, but this count would\nprobably be dominated by walks that go to a high degree vertex and then leave and return to it\nrepeatedly, which would throw the calculations off. On the other hand, most reasonably short\nnonbacktracking walks are likely to be paths, so counting each vertex a number of times equal to the\nnumber of nonbacktracking walks of length t from v to it seems like a reasonable modi\ufb01cation. That\nsaid, it is still possible that there is a vertex that is in cycles such that most nonbacktracking walks\nsimply leave and return to it many times. In order to mitigate this, we use r-nonbacktracking walks,\nwalks in which no vertex reoccurs within r steps of a previous occurrence, such that walks cannot\nreturn to any vertex more than t/r times.\nUnfortunately, this algorithm would not work because the original guesses will inevitably be biased\ntowards one community or the other. So, most of the vertices will have more r-nonbacktracking\nwalks of length t from them to vertices that were suspected of being in that community than the other.\nOne way to deal with this bias would be to subtract the average number of r-nonbacktracking walks\nto vertices in each set from each vertex\u2019s counts. Unfortunately, that will tend to undercompensate\nfor the bias when applied to high degree vertices and overcompensate for it when applied to low\n\n2This type of approach is considered in [23].\n\n6\n\n\fdegree vertices. So, we modify the algorithm that counts the difference between the number of\nr-nonbacktracking walks leading to vertices in the two sets to subtract off the average at every step in\norder to prevent a major bias from building up.\nOne of the features of our approach is that it extends fairly naturally to the general SBM. Despite the\npotential presence of more than 2 communities, we still only assign one value to each vertex, and\noutput a partition of the graph\u2019s vertices into two sets in the expectation that different communities\nwill have different fractions of their vertices in the second set. One complication is that the method of\npreventing the results from being biased towards one comunity does not work as well in the general\ncase. The problem is, by only assigning one value to each vertex, we compress our beliefs onto one\ndimension. That means that the algorithm cannot detect biases orthogonal to that dimension, and\nthus cannot subtract them off. So, we cancel out the bias by subtracting multiples of the counts of the\nnumbers of r-nonbacktracking walks of some shorter length that will also have been affected by it.\nMore concretely, we assign each vertex an initial value, xv, at random. Then, we compute a matrix Y\nsuch that for each v \u2208 G and 0 \u2264 t \u2264 m, Yv,t is the sum over all r-nonbacktracking walks of length\nt ending at v of the initial values associated with their starting vertices. Next, for each v we compute\na weighted sum of Yv,1, Yv,2, ..., Yv,m where the weighting is such that any biases in the entries of Y\nresulting from the initial values should mostly cancel out. We then use these to classify the vertices.\nProof outline for Theorem 1. If we were going to prove that ABP\u2217 worked, we would proba-\nbly de\ufb01ne Wr[S]((v0, ..., vm)) to be 1 if for every consecutive subsequence (i1, . . . , im(cid:48)) \u2286\nis a r-nonbacktracking walk, and 0 otherwise. Next, de-\nS, we have that vi1\u22121, ..., vim(cid:48)\nS\u2286(1,...,m)(\u22122|E(G)|)\u2212|S|Wr[S]((v0, ..., vm)) and Wm(x, v) =\nv = Wm(x, v) for x and y(cid:48)\nas in ABP \u2217. As explained above, we rely on a different approach to cope with the general SBM.\nIn order to prove that the algorithm works, we make the following de\ufb01nitions.\nDe\ufb01nition 6. For any r \u2265 1 and series of vertices v0, ..., vm, let Wr((v0, ..., vm)) be 1 if v0, ..., vm\nis an r-nonbacktracking walk and 0 otherwise. Also, for any r \u2265 1, series of vertices v0, ..., vm and\nc0, ..., cm \u2208 Rm+1, let\n\n\ufb01ne Wr((v0, ..., vm)) = (cid:80)\n(cid:80)\nv0,...,vm\u2208G:vm=v xv0Wr((v0, ..., vm)), and we would have that y(cid:48)\n\n\uf8eb\uf8ed (cid:89)\n\n\uf8f6\uf8f8 Wr((vi0, vi1, ..., vim(cid:48) )).\n\n(\u2212ci/n)\n\nW(c0,...,cm)[r]((v0, ..., vm)) =\n\n(i0,...,im(cid:48) )\u2208(0,...,m)\n\ni(cid:54)\u2208(i0,...,im(cid:48) )\n\nIn other words, W(c0,...,cm)[r]((v0, ..., vm)) is the sum over all subsequences of (v0, ..., vm) that form\nr-nonbacktracking walks of the products of the negatives of the ci/n corresponding to the elements\nof (v0, ..., vm) that are not in the walks. Finally, let\n\nWm/{ci}(x, v) =\n\nxv0W(c0,...,cm)[r]((v0, ..., vm)).\n\nv0,...,vm\u2208G:vm=v\n\nThe reason these de\ufb01nitions are important is that for each v and t, we have that\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\nYv,t =\n\nxv0Wr((v0, ..., vt))\n\nv0,...,vt\u2208G:vt=v\n\nv\n\nand y(m)\nis equal to Wm/{ci}(x, v) for suitable (c0, ..., cm). For the full ABP algorithm, both terms\nin the above equality refer to G as it is after some of its edges are removed at random in the \u2018graph\nsplitting\u2019 step (which explains the presence of 1 \u2212 \u03b3 factors in [19]). One can easily prove that if\nv0, ..., vt are distinct, \u03c3v0 = i and \u03c3vt = j, then\n\nE[W(c0,...,ct)[r]((v0, ..., vt))] = ei \u00b7 P \u22121(P Q)tej/nt,\n\nand most of the rest of the proof centers around showing that W(c0,...,cm)[r]((v0, ..., vm)) such that\nv0, ..., vm are not all distinct do not contribute enough to the sums to matter. That starts with a bound\non |E[W(c0,...,cm)[r]((v0, ..., vm))]| whenever there is no i, j (cid:54)= j(cid:48) such that vj = vj(cid:48), |i\u2212 j| \u2264 r, and\nci (cid:54)= 0; and continues with an explanation of how to re-express any W(c0,...,cm)[r]((v0, ..., vm)) as a\nlinear combination of expressions of the form W(c(cid:48)\nm(cid:48))) which have this property.\n\n0, ..., v(cid:48)\n\n0,...,c(cid:48)\n\nm(cid:48) )[r]((v(cid:48)\n\n7\n\n\fThen we use these to prove that for suitable (c0, ..., cm), the sum of |E[W(c0,...,cm)[r]((v0, ..., vm))]|\nfor all suf\ufb01ciently repetitive (v0, ..., vm) is suf\ufb01ciently small. Next, we observe that\n\nW(c0,...,cm)[r]((v0, ..., vm))W(c(cid:48)(cid:48)\n= W(c0,...,cm,0,...,0,c(cid:48)(cid:48)\n\nm))\n0 )[r]((v0, ..., vm, u1, ..., ur, v(cid:48)(cid:48)\n\nm,...,c(cid:48)(cid:48)\n\n0 ,...,c(cid:48)(cid:48)\n\nm)[r]((v(cid:48)(cid:48)\n\n0 , ..., v(cid:48)(cid:48)\n\nn, ...v(cid:48)(cid:48)\n0 ))\n\nif u1, ..., ur are new vertices that are connected to all other vertices, and use that fact to translate\nbounds on expected values to bounds on variances.\nThat allows us to show that if m and (c0, ..., cm) have the appropriate properties and w is an\neigenvector of P Q with eigenvalue \u03bbj and magnitude 1, then with high probability\n\nw\u03c3v /p\u03c3v Wm/{ci}(x, v)| = O(\n\n|\u03bbj \u2212 ci| +\n\n|\u03bbs \u2212 ci|)\n\n(cid:89)\n\n0\u2264i\u2264m\n\n\u221a\n\nn\n\n(cid:89)\n\n\u221a\n\nn\n\nlog(n)\n\n0\u2264i\u2264m\n\n|\u03bbj \u2212 ci|).\n\n(cid:89)\n\n| (cid:88)\n\n\u221a\nw\u03c3v /p\u03c3v Wm/{ci}(x, v)| = \u2126(\n\nn\n\n| (cid:88)\n\nv\u2208V (G)\n\nand\n\n0\u2264i\u2264m\n\nv\u2208V (G)\n\nWe also show that under appropriate conditions Var[Wm/{ci}(x, v)] = O((1/n)(cid:81)\n\n0\u2264i\u2264m(\u03bbs\u2212ci)2).\nTogether, these facts would allow us to prove that the differences between the average values of\nWm/{ci}(x, v) in different communities are large enough relative to the variance of Wm/{ci}(x, v)\nto let us detect communities, except for one complication. Namely, these bounds are not quite\ngood enough to rule out the possibility that there is a constant probability scenario in which the\nempirical variance of {Wm/{ci}(x, v)} is large enough to disrupt our efforts at using Wm/{ci}(x, v)\nfor detection. Although we do not expect this to actually happen, we rely on the graph splitting step\ndescribed in Section 3.3 to discard this potential scenario.\n\n5 Conclusions and extensions\n\nThis algorithm is intended to classify vertices with an accuracy nontrivially better than that attained\nby guessing randomly, but it is not hard to convert this to an algorithm that classi\ufb01es vertices with\noptimal accuracy. Once one has reasonable initial guesses of which communities the vertices are in,\none can simply run full belief propagation on these guesses. This requires bridging the gap from\ndividing the vertices into two sets that are correlated with their communities in an unknown way, and\nassigning each vertex a nontrivial probability distribution for how likely it is to be in each community.\nOne way to do this is to divide G\u2019s vertices into those that have positive and negative values of y(cid:48),\nand divide its directed edges into those that have positive and negative values of y(m). We would\ngenerally expect that edges from vertices in different communities will have different probabilities of\ncorresponding to positive values of y(m). Now, let d(cid:48) be the largest integer such that at least\nn of the\nvertices have degree at least d(cid:48), let S be the set of vertices with degree exactly d(cid:48), and for each v \u2208 S,\n(v,v(cid:48)) > 0}|. We would expect that for any given community i, the\nlet \u03bev = |{v(cid:48) : (v, v(cid:48)) \u2208 E(G), y(cid:48)\nprobability distribution of \u03bev for v \u2208 \u2126i would be essentially a binomial distribution with parameters\nd(cid:48) and some unknown probability. So, compute probabilities such that the observed distribution of\nvalues of \u03bev approximately matches the appropriate weighted sum of k binomial distributions.\nNext, go through all identi\ufb01cations of the communities with these binomial distributions that are con-\nsistent with the community sizes and determine which one most accurately predicts the connectivity\nrates between vertices that have each possible value of \u03be when the edge in question is ignored, and\ntreat this as the mapping of communities to binomial distributions. Then, for each adjacent v and v(cid:48),\ndetermine the probability distribution of what community v is in based on the signs of y(cid:48)\n(v(cid:48)(cid:48),v) for all\nv(cid:48)(cid:48) (cid:54)= v(cid:48). Finally, use these as the starting probabilities for BP with a depth of ln(n)/3 ln(\u03bb1).\n\n\u221a\n\nAcknowledgments\n\nThis research was supported by NSF CAREER Award CCF-1552131 and ARO grant W911NF-16-1-\n0051.\n\n8\n\n\fReferences\n[1] P. W. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks,\n\n5(2):109\u2013137, 1983.\n\n[2] Peter J. Bickel and Aiyou Chen. A nonparametric view of network models and Newman-Girvan and other\n\nmodularities. Proceedings of the National Academy of Sciences, 106(50):21068\u201321073, 2009.\n\n[3] B. Karrer and M. E. J. Newman. Stochastic blockmodels and community structure in networks. Phys. Rev.\n\nE, 83:016107, Jan 2011.\n\n[4] C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou. Achieving Optimal Misclassi\ufb01cation Proportion in Stochastic\n\nBlock Model. ArXiv e-prints, May 2015.\n\n[5] T.N. Bui, S. Chaudhuri, F.T. Leighton, and M. Sipser. Graph bisection algorithms with good average case\n\nbehavior. Combinatorica, 7(2):171\u2013191, 1987.\n\n[6] R.B. Boppana. Eigenvalues and graph bisection: An average-case analysis. In 28th Annual Symposium on\n\nFoundations of Computer Science, pages 280\u2013285, 1987.\n\n[7] F. McSherry. Spectral partitioning of random graphs.\n\nIn Foundations of Computer Science, 2001.\n\nProceedings. 42nd IEEE Symposium on, pages 529\u2013537, 2001.\n\n[8] B\u00e9la Bollob\u00e1s, Svante Janson, and Oliver Riordan. The phase transition in inhomogeneous random graphs.\n\nRandom Struct. Algorithms, 31(1):3\u2013122, August 2007.\n\n[9] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00e1. Asymptotic analysis of the stochastic block model\n\nfor modular networks and its algorithmic applications. Phys. Rev. E, 84:066106, December 2011.\n\n[10] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted partition\n\nmodel. Probability Theory and Related Fields, 162(3):431\u2013461, 2015.\n\n[11] A. Coja-Oghlan. Graph partitioning via adaptive spectral techniques. Comb. Probab. Comput., 19(2):227\u2013\n\n284, March 2010.\n\n[12] V. Vu. A simple svd algorithm for \ufb01nding hidden partitions. ArXiv:1404.3918, April 2014.\n\n[13] Olivier Gu\u00e9don and Roman Vershynin. Community detection in sparse networks via grothendieck\u2019s\n\ninequality. Probability Theory and Related Fields, 165(3):1025\u20131049, 2016.\n\n[14] Andrea Montanari and Subhabrata Sen. Semide\ufb01nite programs on sparse random graphs and their\napplication to community detection. In Proceedings of the 48th Annual ACM SIGACT Symposium on\nTheory of Computing, STOC 2016, pages 814\u2013827, New York, NY, USA, 2016. ACM.\n\n[15] L. Massouli\u00e9. Community detection thresholds and the weak Ramanujan property. In STOC 2014: 46th\n\nAnnual Symposium on the Theory of Computing, pages 1\u201310, New York, United States, June 2014.\n\n[16] E. Mossel, J. Neeman, and A. Sly. A proof of the block model threshold conjecture. Available online at\n\narXiv:1311.4115 [math.PR], January 2014.\n\n[17] Charles Bordenave, Marc Lelarge, and Laurent Massoulie. Non-backtracking spectrum of random graphs:\nCommunity detection and non-regular ramanujan graphs. In FOCS \u201915, pages 1347\u20131357, Washington,\nDC, USA, 2015. IEEE Computer Society.\n\n[18] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference:\nAn empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Arti\ufb01cial Intelligence,\nUAI\u201999, pages 467\u2013475, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.\n\n[19] E. Abbe and C. Sandon. Detection in the stochastic block model with multiple clusters: proof of the\nachievability conjectures, acyclic BP, and the information-computation gap. ArXiv:1512.09080, Dec. 2015.\n\n[20] J. Banks and C. Moore. Information-theoretic thresholds for community detection in sparse networks.\n\nArXiv:1601.02658, January 2016.\n\n[21] David L. Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed\n\nsensing. Proceedings of the National Academy of Sciences, 106(45):18914\u201318919, 2009.\n\n[22] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zdeborova, and\nPan Zhang. Spectral redemption in clustering sparse networks. Proceedings of the National Academy of\nSciences, 110(52):20935\u201320940, 2013.\n\n[23] S. Bhattacharyya and P. J. Bickel. Community Detection in Networks using Graph Distance.\n\nArXiv:1401.3915, January 2014.\n\n9\n\n\f", "award": [], "sourceid": 733, "authors": [{"given_name": "Emmanuel", "family_name": "Abbe", "institution": "Princeton University"}, {"given_name": "Colin", "family_name": "Sandon", "institution": "Princeton University"}]}