{"title": "Exploiting Tradeoffs for Exact Recovery in Heterogeneous Stochastic Block Models", "book": "Advances in Neural Information Processing Systems", "page_first": 4871, "page_last": 4879, "abstract": "The Stochastic Block Model (SBM) is a widely used random graph model for networks with communities. Despite the recent burst of interest in community detection under the SBM from statistical and computational points of view, there are still gaps in understanding the fundamental limits of recovery. In this paper, we consider the SBM in its full generality, where there is no restriction on the number and sizes of communities or how they grow with the number of nodes, as well as on the connectivity probabilities inside or across communities. For such stochastic block models, we provide guarantees for exact recovery via a semidefinite program as well as upper and lower bounds on SBM parameters for exact recoverability. Our results exploit the tradeoffs among the various parameters of heterogenous SBM and provide recovery guarantees for many new interesting SBM configurations.", "full_text": "Exploiting Tradeoffs for Exact Recovery in\n\nHeterogeneous Stochastic Block Models\n\nAmin Jalali\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nSeattle, WA 98195\namjalali@uw.edu\n\nQiyang Han\n\nDepartment of Statistics\nUniversity of Washington\n\nSeattle, WA 98195\nroyhan@uw.edu\n\nIoana Dumitriu\n\nDepartment of Mathematics\n\nUniversity of Washington\n\nSeattle, WA 98195\ndumitriu@uw.edu\n\nMaryam Fazel\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nSeattle, WA 98195\nmfazel@uw.edu\n\nAbstract\n\nThe Stochastic Block Model (SBM) is a widely used random graph model for\nnetworks with communities. Despite the recent burst of interest in community\ndetection under the SBM from statistical and computational points of view, there\nare still gaps in understanding the fundamental limits of recovery. In this paper,\nwe consider the SBM in its full generality, where there is no restriction on the\nnumber and sizes of communities or how they grow with the number of nodes, as\nwell as on the connectivity probabilities inside or across communities. For such\nstochastic block models, we provide guarantees for exact recovery via a semidef-\ninite program as well as upper and lower bounds on SBM parameters for exact\nrecoverability. Our results exploit the tradeoffs among the various parameters\nof heterogenous SBM and provide recovery guarantees for many new interesting\nSBM con\ufb01gurations.\n\n1 Introduction\n\nA fundamental problem in network science and machine learning is to discover structures in large,\ncomplex networks (e.g., biological, social, or information networks). Community or cluster detec-\ntion underlies many decision tasks, as a basic step that uses pairwise relations between data points\nin order to understand more global structures in the data. Applications include recommendation\nsystems [27], image segmentation [24, 20], learning gene network structures in bioinformatics, e.g.,\nin protein detection [9] and population genetics [17].\n\nIn spite of a long history of heuristic algorithms (see, e.g., [18] for an empirical overview), as well as\nstrong research interest in recent years on the theoretical side as brie\ufb02y reviewed in the sequel, there\nare still gaps in understanding the fundamental information theoretic limits of recoverability (i.e., if\nthere is enough information to reveal the communities) and computational tractability (if there are\nef\ufb01cient algorithms to recover them). This is particularly true in the case of sparse graphs (that test\nthe limits of recoverability), graphs with heterogeneous communities (communities varying greatly\nin size and connectivity), graphs with a number of communities that grows with the number of nodes,\nand partially observed graphs (with various observation models).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f1.1 Exact Recovery for Heterogenous Stochastic Block Model\n\nThe stochastic block model (SBM), \ufb01rst introduced and studied in mathematical sociology by Hol-\nland, Laskey and Leinhardt in 1983 [16], can be described as follows. Consider n vertices partitioned\ninto r communities V1, V2, . . . , Vr , of sizes n1, n2, . . . , nr. We endow the kth community with an\nErd\u02ddos-R\u00e9nyi random graph model G(nk, pk) and draw an edge between pairs of nodes in different\ncommunities independently with probability q; i.e., for any pair of nodes i and j , if i, j \u2208 Vk for\nsome k \u2208 {1, . . . , r} we draw an edge with probability pk, and draw an edge with probability q if\nthey are in different communities. We assume q < mink pk in order for the idea of communities to\nmake sense. This de\ufb01nes a distribution over random graphs known as the stochastic block model. In\nthis paper, we assume the above model while allowing the number of communities to grow with the\nnumber of nodes (similar to [13, 15, 23]). We refer to this model as the heterogeneous stochastic\nblock model to contrast our study of this general setting with previous works on special cases of\nSBM such as 1) homogenous SBM where the communities are equivalent (they are of the same size\nand the connectivity probabilities are equal,) e.g., [12], or, 2) SBM with linear-sized communities,\nwhere the number of communities is \ufb01xed and all community sizes are O(n); e.g., [1].\n\n1.2 Statistical and Computational Regimes\n\nWhat we can infer about the community structure from a single draw of the random graph varies\nbased on the regime of model parameters. Often, the following scenarios are considered.\n\n1. Recovery, where the proportion of misclassi\ufb01ed nodes is negligible; either 0 (corresponding to\nexact recovery with strong consistency, and considered in [12, 1]) or asymptotically 0 (corre-\nsponding to exact recovery with weak consistency as considered in [23, 22, 28]) as the number of\nnodes grows.\n\n2. Approximation, where a \ufb01nite fraction (bounded away from 1) of the vertices is recovered. This\nregime was \ufb01rst introduced in [13, 14], and has been considered in many other works since then;\ne.g., see [15] and references therein.\n\nBoth recovery and approximation can be studied from statistical and computational points of view.\n\nStatistically, one can ask about the parameter regimes for which the model can be recovered or ap-\nproximated. Such characterizations are specially important when an information-theoretical lower\nbound (below which recovery is not possible with high probability) is shown to be achievable with\nan algorithm (with high probability), hence characterizing a phase transition in model parameters.\nRecently, there has been signi\ufb01cant interest in identifying such sharp thresholds for various param-\neter regimes.\n\nComputationally, one might be interested to study algorithms for recovery or approximation. In\nthe older approach, algorithms were studied to provide upper bounds on the parameter regimes for\nrecovery or approximation. See [10] or [1, Section 5] for a summary of such results. More recently,\nthe paradigm has shifted towards understanding the limitations and strengths of tractable methods\n(e.g. see [21] on semide\ufb01nite programming based methods) and assessing whether successful re-\ntrieval can be achieved by tractable algorithms at the sharp statistical thresholds or there is a gap.\nSo far, it is understood that there is no such gap in the case of exact recovery (weak and strong)\nand approximation of binary SBM as well as the exact recovery of linear-sized communities [1].\nHowever, this is still an open question for more general cases; e.g., see [2] and the list of unresolved\nconjectures therein.\n\nThe statistical-computational picture for SBM with only two equivalent communities has been fully\ncharacterized in a series of recent papers. Apart from the binary SBM, the best understood cases are\nwhere there is a \ufb01nite number r of equivalent or linear-sized communities. Outside of the settings\ndescribed above, the full picture has not yet emerged and many questions are unresolved.\n\n1.3 This paper\n\nThe community detection problem studied in this paper is stated as: given the adjacency matrix of\na graph generated by the heterogenous stochastic block model, for what SBM parameters we can\nrecover the labels of all vertices, with high probability, using an algorithm that has been proved to\ndo so. We consider a convex program in (2.4) and an estimator similar to the maximum likelihood\n\n2\n\n\festimator in (2.5) and characterize parts of the model space for which exact recovery is possible via\nthese algorithms. Theorems 1 and 2 provide suf\ufb01cient conditions for the convex recovery program\nand Theorem 3 provides suf\ufb01cient conditions for the modi\ufb01ed maximum likelihood estimator to\nexactly recover the underlying model. In Section 2.3, we extend the above bounds to the case of\npartial observations, i.e., when each entry of the matrix is observed uniformly at random with some\nprobability \u03b3 and the results are recorded. We also provide an information-theoretic lower bound,\ndescribing an impossibility regime for exact recovery in heterogenous SBM in Theorem 4. All of\nour results only hold with high probability, as this is the best one can hope for; with tiny probability\nthe model can generate graphs like the complete graph where the partition is unrecoverable.\n\nThe results of this paper provide a clear improvement in the understanding of stochastic block mod-\nels by exploiting tradeoffs among SBM parameters. We identify a key parameter (or summary\nstatistic), de\ufb01ned in (2.1) and referred to as relative density, which shows up in our results and pro-\nvides improvements in the statistical assessment and ef\ufb01cient computational approaches for certain\ncon\ufb01gurations of heterogenous SBM; examples are given in in Section 3 to illustrate a number of\nsuch bene\ufb01cial tradeoffs such as\n\n\u2022 semide\ufb01nite programming can successfully recover communities of size O(\u221alog n) under\n\nmild conditions on other communities (see Example 3 for details) while log n has long been\nbelieved to be the threshold for the smallest community size.\n\n\u2022 The sizes of the communities can be very spread, or the inter- and intra-community prob-\nabilities can be very close, and the model still be ef\ufb01ciently recoverable, while existing\nmethods (e.g., peeling strategy [3]) providing false negatives.\n\nWhile these results are a step towards understanding the information-computational picture about\nthe heterogenous SBM with a growing number of communities, we cannot comment on phase tran-\nsitions or a possible information-computational gap (see Section 1.2) in this setup based on the\nresults of this paper.\n\n2 Main Results\n\nConsider the heterogenous stochastic block model described above. In the proofs, we can allow\nfor isolated nodes (communities of size 1) which are omitted from the model here to simplify the\npresentation. Denote by Y the set of admissible adjacency matrices according to a community\nassignment as above, i.e.,\n\nY := {Y \u2208 {0, 1}n\u00d7n : Y is a valid community matrix w.r.t. V1, . . . , Vr where |Vk| = nk} .\n\nDe\ufb01ne the relative density of community k as\n\n\u03c1k = (pk \u2212 q)nk\n\nwhich can be seen as the increase in the average degree of a node in community k in the SBM,\nrelative to its average degree in an Erd\u02ddos-R\u00e9nyi model. De\ufb01ne nmin and nmax as the minimum\nand maximum of n1, . . . , nk respectively. The total variance over the kth community is de\ufb01ned as\n\u03c32\nk = nkpk(1 \u2212 pk) , and we let \u03c32\n\n0 = nq(1 \u2212 q) . Moreover, consider\n\n\u03c32\nmax = max\nk=1,...,r\n\n\u03c32\nk = max\nk=1,...,r\n\nnkpk(1 \u2212 pk) .\n\n(2.2)\n\nA Bernoulli random variable with parameter p is denoted by Ber(p) , and a Binomial random vari-\nable with parameters n and p is denoted by Bin(n, p) . The Neyman Chi-square divergence between\nthe two discrete random variables Ber(p) and Ber(q) is given by\n\n(2.1)\n\n(2.3)\n\nand we have eD(p, q) \u2265 DKL(p, q) := DKL(Ber(p), Ber(q)) . Chi-square divergence is an instance\n\nof a more general family of divergence functions called f -divergences or Ali-Silvey distances. This\nfamily also has KL-divergence, total variation distance, Hellinger distance and Chernoff distance as\nspecial cases. Moreover, the divergence used in [1] is an f -divergence.\nLastly, log denotes the natural logarithm (base e), and the notation \u03b8 & 1 is equivalent to \u03b8 \u2265 O(1) .\n\neD(p, q) :=\n\n(p \u2212 q)2\nq(1 \u2212 q)\n\n3\n\n\f2.1 Convex Recovery\n\nInspired by the success of semide\ufb01nite programs in community detection (e.g., see [15, 21]) we\nconsider a natural convex relaxation of the maximum likelihood estimator, similar to the one used\nin [12], for exact recovery of the heterogeneous SBM with a growing number of communities. As-\n\nsuming that \u03b6 =Pr\n\nk=1 n2\n\nk is known, we solve\n\nY\n\nsubject to\n\n\u02c6Y = arg max\n\nPi,j Aij Yij\nkY k\u22c6 \u2264 n , Pi,j Yij = \u03b6 , 0 \u2264 Yij \u2264 1 .\nwhere k \u00b7 k\u22c6 denotes the nuclear norm (the sum of singular values of the matrix).\nWe prove two theorems giving conditions under which the above convex program outputs the true\ncommunity matrix with high probability. In establishing these performance guarantees, we follow\nthe standard dual certi\ufb01cate argument in convex analysis while utilizing strong matrix concentra-\ntion results from random matrix theory [8, 25, 26, 5]. These results allow us to bound the spectral\nradius of the matrix A\u2212 E[A] where A is an instance of adjacency matrix generated under heteroge-\nnous SBM. The proofs for both theorems along with the matrix concentration bounds are given in\nAppendix A.\n\n(2.4)\n\nTheorem 1 Under the heterogenous stochastic block model, the output of the semide\ufb01nite program\nin (2.4) coincides with Y \u22c6 with high probability, provided that\n\n\u03c12\nk & \u03c32\n\nk log nk , eD(pmin, q) &\n\nlog nmin\n\nnmin\n\n, \u03c12\n\nmin & max{\u03c32\n\nmax, nq(1 \u2212 q), log n}\n\nk=1 n\u2212\u03b1\n\nk = o(1) for some \u03b1 > 0 .\n\nandPr\nProof Sketch. For Y \u22c6 to be the unique solution of (2.4), we need to show that for any feasible\nY 6= Y \u22c6 , the following quantity\n\nhA, Y \u22c6 \u2212 Y i = hE[A], Y \u22c6 \u2212 Y i + hA \u2212 E[A], Y \u22c6 \u2212 Y i\n\nis strictly positive. In bounding the second term above, we make use of the constraint kY k\u22c6 \u2264 n =\nkY \u22c6k\u22c6 by constructing a dual certi\ufb01cate from A \u2212 E[A] . This is where the bounds on the spectral\nnorm (dual norm for the nuclear norm) of A \u2212 E[A] enter and we use matrix concentration bounds\n(see Lemma 7 in Appendix A).\n\nThe \ufb01rst condition of Theorem 1 is equivalent to each community being connected, second condition\nensures that each community is identi\ufb01able (pmin \u2212 q is large enough), and the third condition\nrequires minimal density to dominate global variability. The assumption Pr\nk = o(1) is\ntantamount to saying that the number of tiny communities cannot be too large (e.g., the number\nof polylogarithmic-size communities cannot be a power of n). In other words, one needs to have\nmostly large communities (growing like n\u01eb, for some \u01eb > 0) for this assumption to be satis\ufb01ed.\nNote, however, that the condition does not restrict the number of communities of size n\u01eb for any\n\ufb01xed \u01eb > 0 .\nIn fact, Theorem 1 allows us to describe a regime in which tiny communities of\nsize O(\u221alog n) are recoverable provided that they are very dense and that only few tiny or small\ncommunities exist; see Example 3. The second theorem imposes more stringent conditions on the\nrelative density, hence only allowing for communities of size down to log n , but relaxes the condition\nthat only a small number of nodes can be in small communities.\n\nk=1 n\u2212\u03b1\n\nTheorem 2 Under the heterogenous stochastic block model, the output of the semide\ufb01nite program\nin (2.4) coincides with Y \u22c6 , with high probability, provided that\n\nk & \u03c32\n\u03c12\n\nk log n , eD(pmin, q) & log n\n\nnmin\n\n, \u03c12\n\nmin & max{\u03c32\n\nmax , nq(1 \u2212 q)} .\n\nThe proof of Theorem 2 is similar to the proof of Theorem 1 except that we use a different matrix\nconcentration bound (see Lemma 10 in Appendix A).\n\n2.2 Recoverability Lower and Upper Bounds\n\nNext, we consider an estimator, inspired by maximum likelihood estimation, and identify a subset of\nthe model space which is exactly recoverable via this estimator. The proposed estimation approach\n\n4\n\n\fis not computationally tractable and is only used to examine the conditions for which exact recovery\nis possible. For a \ufb01xed Y \u2208 Y and an observed matrix A , the likelihood function is given by\n\npAij Yij\n\u03c4 (i,j) (1 \u2212 p\u03c4 (i,j))(1\u2212Aij )Yij qAij (1\u2212Yij )(1 \u2212 q)(1\u2212Aij )(1\u2212Yij ),\n\nPY (A) =Yi 0 , then the optimal solution \u02c6Y of the non-convex recovery program in (2.5)\ncoincides with Y \u22c6, with a probability not less than 1 \u2212 7 pmax\u2212q\nNotice that \u03c1min = mink=1,...,r nk(pk \u2212 q) and pmin = mink=1,...,r pk do not necessarily corre-\nspond to the same community. Similar to the proof of Theorem 1, we establish hA, Y \u22c6 \u2212 Y i > 0\nfor any Y \u2208 Y, while this time, we use a counting argument (see Lemma 11 in Appendix B) similar\nto the one in [12]. The proofs for this Theorem and the next one are given in Appendix B.\n\npmin\u2212q n2\u2212\u03b7 .\n\nFinally, to provide a better picture of community detection for heterogenous SBM we provide the\n\nfollowing necessary conditions for exact recovery. Notice that Theorems 1 and 2 require eD(q, pk)\n(in their \ufb01rst condition) and eD(pk, q) (in their second condition) to be bounded from below for\n\nrecoverability by the SDP. Similarly, the conditions of Theorem 4 can be seen as average-case and\nworst-case upper bounds on these divergences.\n\nTheorem 4 If any of the following conditions holds,\n\n(1) 2 \u2264 nk \u2264 n/e , and 4Pr\n2Pk nk log n\n(2) n \u2265 128 , r \u2265 2 and maxk(cid:8)nkeD(pk, q) + nkeD(q, pk)(cid:9) \u2264 1\n\nkeD(pk, q) \u2264 1\n\nk=1 n2\n\nnk \u2212 r \u2212 2\n\n12 log(n \u2212 nmin)\n\nthen inf \u02c6Y supY \u22c6\u2208Y\nbased on the realization A generated according to the heterogenous stochastic block model.\n\n2 where the in\ufb01mum is taken over all measurable estimators \u02c6Y\n\nP[ \u02c6Y 6= Y \u22c6] \u2265 1\n\n2.3 Partial Observations\n\nIn the general stochastic block model, we assume that the entries of a symmetric adjacency matrix\n\nA \u2208 {0, 1}n\u00d7n have been generated according to a combination of Erd\u02ddos-R\u00e9nyi models with pa-\nrameters that depend on the true community matrix. In the case of partial observations, we assume\nthat the entries of A has been observed independently with probability \u03b3 . In fact, every entry of\nthe input matrix falls into one of these categories: observed as one denoted by \u21261, observed as zero\ndenoted by \u21260, and unobserved which corresponds to \u2126c where \u2126 = \u21260 \u222a \u21261 . If an estimator only\n\ntakes the observed part of the matrix as the input, one can revise the underlying probabilistic model\nto incorporate both the stochastic block model and the observation model; i.e. a revised distribution\nfor entries of A as\n\nAij =(cid:26)Ber(\u03b3pk)\n\nBer(\u03b3q)\n\ni, j \u2208 Vk for some k\ni \u2208 Vk and j \u2208 Vl for k 6= l .\n\n5\n\n\fyields the same output from an estimator that only takes in the observed values. Therefore, the\nestimators in (2.4) and (2.5), as well as the results of Theorems 1, 2, 3, can be easily adapted to the\ncase of partially observed graphs. It is worth mentioning that the above model for partially observed\nSBM (which is another SBM) is different from another random model known as Censored Block\nModel (CBM) [4]. In SBM, absence of an edge provides information, whereas in CBM it does not.\n\n3 Tradeoffs in Heterogenous SBM\n\nAs it can be seen from the results presented in this paper, and the main summary statistics they uti-\nlize (the relative densities \u03c11, . . . , \u03c1r), the parameters of SBM can vary signi\ufb01cantly and still satisfy\nthe same recoverability conditions. In the following, we examine a number of such tradeoffs which\nleads to recovery guarantees for interesting SBM con\ufb01gurations. Here, a con\ufb01guration is a list of\ncommunity sizes nk, their connectivity probabilities pk, and the inter-community connectivity prob-\nability q . A triple (m, p, k) represents k communities of size m each, with connectivity parameter p .\nWe do not worry about whether m and k are always integers; if they are not, one can always round\nup or down as needed so that the total number of vertices is n, without changing the asymptotics.\nMoreover, when the O(\u00b7) notation is used, we mean that appropriate constants can be determined.\nA detailed list of computations for the examples in this section are given in Appendix D.\n\nTable 1: A summary of examples in Section 3. Each row gives the important aspect of the\ncorresponding example as well as whether, under appropriate regimes of parameters, it would\nsatisfy the conditions of the theorems proved in this paper.\n\nimportance\n{\u03c1k} instead of (pmin, nmin)\nstronger guarantees for convex recovery\nnmin = \u221alog n\n\nEx. 1\nEx. 2\nEx. 3\nEx. 4 many small communities, nmax = O(n)\nEx. 5\nEx. 6\n\nnmin = O(log n), spread in sizes\nsmall pmin \u2212 q\n\nconvex recovery\n\nconvex recovery\n\nrecoverability\n\nby Thm. 1\n\nby Thm. 2\n\nby Thm. 3\n\n\u00d7\nX\nX\nX\n\u00d7\nX\n\n\u00d7\nX\n\u00d7\nX\nX\nX\n\nX\nX\n\u00d7\nX\nX\nX\n\nIt is intuitive that using summary statistics such as (pmin, nmin), for\nBetter Summary Statistics.\na heterogenous SBM where nk\u2019s and pk\u2019s are allowed to take very different values, can be very\nlimiting. Examples 1 and 2 are intended to give con\ufb01gurations that are guaranteed to be recoverable\nby our results but fail the existing recoverability conditions in the literature.\n\nExample 1 Suppose we have two communities of sizes n1 = n \u2212 \u221an, n2 = \u221an, with p1 = n\u22122/3\n\nand p2 = 1/ log n while q = n\u22122/3\u22120.01 . The bound we obtain here in Theorem 3 makes it clear\nthat this case is theoretically solvable (the modi\ufb01ed maximum likelihood estimator successfully\nrecovers it). By contrast, Theorem 3.1 in [7] (specialized for the case of no outliers), requiring\n\nn2\n\nmin(pmin \u2212 q)2 & (\u221apminnmin + \u221anq)2 log n ,\n\n(3.1)\n\nwould fail and provide no guarantee for recoverability.\n\nExample 2 Consider a con\ufb01guration as\n\n(n \u2212 n2/3, n\u22121/3+\u01eb, 1) , (\u221an, O( 1\n\nlog n ), n1/6) , q = n\u22122/3+3\u01eb\n\nwhere \u01eb is a small quantity, e.g., \u01eb = 0.1 . Either of Theorems 1 and 2 certify this case as recoverable\nvia the semide\ufb01nite program (2.4) with high probability. By contrast, using the pmin = n\u22121/3+\u01eb and\nnmin = \u221an heuristic, neither the condition of Theorem 3.1 in [7] (given in (3.1)) nor the condition\nof Theorem 2.5 in [12] is ful\ufb01lled, hence providing no recovery guarantee for this con\ufb01guration.\n\n3.1 Small communities can be ef\ufb01ciently recovered\n\nMost algorithms for clustering the SBM run into the problem of small communities [11, 6, 19],\noften because the models employed do not allow for enough parameter variation to identify the key\nquantities involved. The next three examples attempt to provide an idea of how small the community\n\n6\n\n\fsizes can be, how many small communities are allowed, and how wide the spread of community sizes\ncan be, as characterized by our results.\n\nExample 3 (smallest community size for convex recovery) Consider a con\ufb01guration as\n\n(plog n, O(1), m) , (n2, O( log n\n\n\u221an ), \u221an) , q = O( log n\nn )\n\nwhere n2 = \u221an \u2212 mplog n/n to ensure a total of n vertices. Here, we assume m \u2264 n/(2\u221alog n)\nwhich implies n2 \u2265 \u221an/2 . It is straightforward to verify the conditions of Theorem 1.\n\nTo our knowledge, this is the \ufb01rst example in the literature for which semide\ufb01nite programming\nbased recovery works and allows the recovery of (a few) communities of size smaller than log n.\nPreviously, log n was considered to be the standard bound on the community size for exact recovery,\nas illustrated by Theorem 2.5 of [12] in the case of equivalent communities. We have thus shown\nthat it is possible, in the right circumstances (when sizes are spread and the smaller the community\nthe denser it is), to recover very small communities (up to \u221alog n size), if there are just a few of\nthem (at most polylogarithmic in n). The signi\ufb01cant improvement we made in the bound on the\nsize of the smallest community is due to the fact that we were able to perform a closer analysis of\nthe semide\ufb01nite program by utilizing stronger matrix concentration bounds, mainly borrowed from\n[8, 25, 26, 5]. For more details, see Appendix A.2.\n\nNotice that the condition of Theorem 3 is not satis\ufb01ed. This is not an inconsistency (as Theorem 3\ngives only an upper bound for the threshold), but indicates the limitation of this theorem in charac-\nterizing all recoverable cases.\n\nSpreading the sizes. As mentioned before, while Theorem 1 allows for going lower than the stan-\ndard log n bound on the community size for exact recovery, it requires the number of very small\ncommunities to be relatively small. On the other hand, Theorem 2 provides us with the option of\nhaving many small communities but requires the smallest community to be of size O(log n) . We\nexplore two cases with many small communities in the following.\n\nExample 4 Consider a con\ufb01guration where small communities are dense and there is one big\ncommunity,\n\n2 n\u01eb, O(1), n1\u2212\u01eb) , ( 1\n( 1\n\n2 n, n\u2212\u03b1 log n, 1) , q = O(n\u2212\u03b2 log n)\n\nwith 0 < \u01eb < 1 and 0 < \u03b1 < \u03b2 < 1. We are interested to see how large the number of small\ncommunities can be. Then the conditions of Theorems 1 and 2 both require that\n\n1\n\n2 (1 \u2212 \u03b1) < \u01eb < 2(1 \u2212 \u03b1)\n\n,\n\n\u01eb > 2\u03b1 \u2212 \u03b2\n\n(3.2)\n\nand are depicted in Figure 1. Since we have not speci\ufb01ed the constants in our results, we only\nconsider strict inequalities.\n\n\u01eb\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n2\n\n\u03b1\n\n+\n\n\u01eb\n\n=\n\n2\n\n\u03b1 + 2\u01eb = 1\n\n2 \u03b1 = \u03b2 + \u01eb\n\n0.25 0.5 0.75\n\n\u03b1\n\n0\n\n1/3\n\n2/3\n\n1\n\n\u03b2\n\nFigure 1: The space of parameters in Equation (3.2). The face de\ufb01ned by \u03b2 = \u03b1 is shown\nwith dotted edges. The three gray faces in the back correspond to \u03b2 = 1 , \u03b1 = 0 and \u01eb = 1.\nThe green plane (corresponding to the last condition in (3.2)) comes from controlling the intra-\ncommunity interactions uniformly (interested reader is referred to Equations (A.8) and (A.9)\nin the supplement material) which might be only an artifact of our proof and can be possibly\nimproved.\n\nNotice that the small communities are as dense as can be, but the large one is not necessarily very\ndense. By picking \u01eb to be just over 1/4, we can make \u03b1 just shy of 1/2, and \u03b2 very close to 1. As\n\n7\n\n\ffar as we can tell, there are no results in the literature surveyed that cover such a case, although\nthe clever \u201cpeeling\u201d strategy introduced in [3] would recover the largest community. The strongest\nresult in [3] that seems applicable here is Corollary 4 (which works for non-constant probabilities).\nThe algorithm in [3] works to recover a large community (larger than O(\u221an log2 n)), subject to\nexistence of a gap in the community sizes (roughly, there should be no community sizes between\nO(\u221an) and O(\u221an log2 n)). Therefore, in this example, after a single iteration, the algorithm will\nstop, despite the continued existence of a gap, as there is no community with size above the gap.\nHence the \u201cpeeling\u201d strategy on this example would fail to recover all the communities.\n\nExample 5 Consider a con\ufb01guration with many small dense communities of size log n . We are in-\nterested to see how large the spread of community sizes can be for the semide\ufb01nite program to work.\nAs required by Theorems 1 and 2 and to control \u03c3max (de\ufb01ned in (2.2)), the larger a community\nthe smaller its connectivity probability should be; therefore we choose the largest community at the\nthreshold of connectivity (required for recovery). Consider the community sizes and probabilities:\n\n(log n, O(1), n/log n \u2212 mpn/log n) , (pn log n, O(p(log n)/n), m) , q = O((log n)/n)\n\nwhere m is a constant. Again, we round up or down where necessary to make sure the sizes are\nintegers and the total number of vertices is n. All the conditions of Theorem 2 are satis\ufb01ed and\nexact convex recovery is possible via the semide\ufb01nite program. Note that the last condition of\nTheorem 1 is not satis\ufb01ed since there are too many small communities. Also note that alternative\nmethods proposed in the literature surveyed would not be applicable; in particular, the gap condition\nin [3] is not satis\ufb01ed for this case from the start.\n\n3.2 Weak communities are ef\ufb01ciently recoverable\n\nThe following examples illustrate how small pmin \u2212 q can be in order for the recovery, respectively,\nthe convex recovery algorithms to still be guaranteed to work. When some pk is very close to q ,\nthe Erd\u02ddos-R\u00e9nyi model G(nk, pk) looks very similar to the ambient edges from G(n, q) . Again, we\nare going to exploit the possible tradeoffs in the parameters of SBM to guarantee recovery. Note\nthat the difference in pmin \u2212 q for the two types of recovery is noticeable, indicating that there is a\nsigni\ufb01cant difference between what we know to be recoverable and what we can recover ef\ufb01ciently\nby our convex method. We consider both dense graphs (where pmin is O(1)) and sparse ones.\n\nExample 6 Consider a con\ufb01guration where all of the probabilities are O(1) and\n\n(n1, pmin, 1) , (nmin, p2, 1) , (n3, p3, n\u2212n1\u2212nmin\n\nn3\n\n) , q = O(1)\n\nwhere p2 \u2212 q and p3 \u2212 q are O(1) . On the other hand, we assume pmin \u2212 q = f (n) is small. For\nrecoverability by Theorem 3, we need f (n) & (log n)/nmin and f 2(n) & (log n)/n1 . Notice that,\nsince n & n1 & nmin , we should have f (n) &plog n/n . For the convex program to recover this\ncon\ufb01guration (by Theorem 1 or 2), we need nmin & \u221an and f 2(n) & max{n/n2\n1 , log n/nmin} ,\nwhile all the probabilities are O(1) .\nNote that if all the probabilities, as well as pmin \u2212 q , are O(1), then by Theorem 3 all communities\ndown to a logarithmic size should be recoverable. However, the success of convex recovery is\nguaranteed by Theorems 1 and 2 when nmin & \u221an .\nFor a similar con\ufb01guration to Example 6, where the probabilities are not O(1) , recoverability by\nTheorem 3 requires f (n) & max{ppmin(log n)/n , n\u2212c} for some appropriate c > 0 .\n\n4 Discussion\n\nWe have provided a series of extensions to prior works (especially [12, 1]) by considering the exact\nrecovery for stochastic block model in its full generality with a growing number of communities. By\ncapturing the tradeoffs among the various parameters of SBM, we have identi\ufb01ed interesting SBM\ncon\ufb01gurations that are ef\ufb01ciently recoverable via semide\ufb01nite programs. However there are still\ninteresting problems that remain open. Sharp thresholds for recovery or approximation of heteroge-\nnous SBM, models for partial observation (non-uniform, based on prior information, or adaptive as\nin [28]), as well as overlapping communities (e.g., [1]) are important future directions. Moreover,\nother estimators similar to the ones considered in this paper can be analyzed; e.g. when the un-\nknown parameters in the maximum likelihood estimator, or \u03b6 in (2.4), are estimated from the given\nobservations.\n\n8\n\n\fReferences\n\n[1] E. Abbe and C. Sandon. Community detection in general stochastic block models: fundamental limits\n\nand ef\ufb01cient recovery algorithms. arXiv preprint arXiv:1503.00609, 2015.\n\n[2] E. Abbe and C. Sandon. Detection in the stochastic block model with multiple clusters: proof\narXiv preprint\n\nof the achievability conjectures, acyclic bp, and the information-computation gap.\narXiv:1512.09080, 2015.\n\n[3] N. Ailon, Y. Chen, and H. Xu. Breaking the small cluster barrier of graph clustering. In ICML, pages\n\n995\u20131003, 2013.\n\n[4] A. S. Bandeira. An ef\ufb01cient algorithm for exact recovery of vertex variables from edge measurements,\n\n2015.\n\n[5] A. S. Bandeira and R. van Handel. Sharp nonasymptotic bounds on the norm of random matrices with\n\nindependent entries. arXiv preprint arXiv:1408.6185, 2014.\n\n[6] R. B. Boppana. Eigenvalues and graph bisection: An average-case analysis. In Foundations of Computer\n\nScience, 1987., 28th Annual Symposium on, pages 280\u2013285. IEEE, 1987.\n\n[7] T. T. Cai and X. Li. Robust and computationally feasible community detection in the presence of arbitrary\n\noutlier nodes. Ann. Statist., 43(3):1027\u20131059, 2015.\n\n[8] S. Chatterjee. Matrix estimation by universal singular value thresholding. Ann. Statist., 43(1):177\u2013214,\n\n2015.\n\n[9] J. Chen and B. Yuan. Detecting functional modules in the yeast protein\u2013protein interaction network.\n\nBioinformatics, 22(18):2283\u20132290, 2006.\n\n[10] Y. Chen, A. Jalali, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex optimization.\n\nJ. Mach. Learn. Res., 15:2213\u20132238, 2014.\n\n[11] Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. In Advances in neural information processing\n\nsystems, pages 2204\u20132212, 2012.\n\n[12] Y. Chen and J. Xu. Statistical-computational tradeoffs in planted problems and submatrix localization\n\nwith a growing number of clusters and submatrices. J. Mach. Learn. Res., 17(27):1\u201357, 2016.\n\n[13] A. Coja-Oghlan. Graph partitioning via adaptive spectral techniques. Combin. Probab. Comput.,\n\n19(2):227\u2013284, 2010.\n\n[14] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00e1. Asymptotic analysis of the stochastic block model\n\nfor modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011.\n\n[15] O. Gu\u00e9don and R. Vershynin. Community detection in sparse networks via grothendieck\u2019s inequality.\n\nProbability Theory and Related Fields, pages 1\u201325, 2015.\n\n[16] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social networks,\n\n5(2):109\u2013137, 1983.\n\n[17] D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A survey. Knowledge and\n\nData Engineering, IEEE Transactions on, 16(11):1370\u20131386, 2004.\n\n[18] J. Leskovec, K. J. Lang, and M. Mahoney. Empirical comparison of algorithms for network community\ndetection. In Proceedings of the 19th international conference on World wide web, pages 631\u2013640. ACM,\n2010.\n\n[19] F. McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science, 2001. Pro-\n\nceedings. 42nd IEEE Symposium on, pages 529\u2013537. IEEE, 2001.\n\n[20] M. Meila and J. Shi. A random walks view of spectral segmentation. 2001.\n[21] A. Montanari and S. Sen.\n\nSemide\ufb01nite programs on sparse random graphs.\n\narXiv preprint\n\narXiv:1504.05910, 2015.\n\n[22] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In Proceedings\n\nof the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 69\u201375. ACM, 2015.\n\n[23] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel.\n\nThe Annals of Statistics, 39(4):1878\u20131915, 2011.\n\n[24] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence,\n\nIEEE Transactions on, 22(8):888\u2013905, 2000.\n\n[25] D.-C. Tomozei and L. Massouli\u00e9. Distributed user pro\ufb01ling via spectral methods. Stoch. Syst., 4(1):1\u201343,\n\n2014.\n\n[26] V. Vu. A simple svd algorithm for \ufb01nding hidden partitions. arXiv:1404.3918, 2014.\n[27] J. Xu, R. Wu, K. Zhu, B. Hajek, R. Srikant, and L. Ying. Jointly clustering rows and columns of binary\nmatrices: Algorithms and trade-offs. In The 2014 ACM international conference on Measurement and\nmodeling of computer systems, pages 29\u201341. ACM, 2014.\n\n[28] S.-Y. Yun and A. Proutiere. Community detection via random and adaptive sampling. In Proceedings of\n\nThe 27th Conference on Learning Theory, pages 138\u2013175, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2469, "authors": [{"given_name": "Amin", "family_name": "Jalali", "institution": "University of Washington"}, {"given_name": "Qiyang", "family_name": "Han", "institution": "University of Washington"}, {"given_name": "Ioana", "family_name": "Dumitriu", "institution": "University of Washington"}, {"given_name": "Maryam", "family_name": "Fazel", "institution": "University of Washington"}]}