{"title": "Mean Field for the Stochastic Blockmodel: Optimization Landscape and Convergence Issues", "book": "Advances in Neural Information Processing Systems", "page_first": 10694, "page_last": 10704, "abstract": "Variational approximation has been widely used in large-scale Bayesian inference recently, the simplest kind of which involves imposing a mean field assumption to approximate complicated latent structures. Despite the computational scalability of mean field, theoretical studies of its loss function surface and the convergence behavior of iterative updates for optimizing the loss are far from complete. In this paper, we focus on the problem of community detection for a simple two-class Stochastic Blockmodel (SBM). Using batch co-ordinate ascent (BCAVI) for updates, we give a complete characterization of all the critical points and show different convergence behaviors with respect to initializations. When the parameters are known, we show a significant proportion of random initializations will converge to ground truth. On the other hand, when the parameters themselves need to be estimated, a random initialization will converge to an uninformative local optimum.", "full_text": "Mean Field for the Stochastic Blockmodel:\n\nOptimization Landscape and Convergence Issues\n\nSoumendu Sunder Mukherjee\u2217\n\nPurnamrita Sarkar\u2217\n\nInterdisciplinary Statistical Research Unit (ISRU)\n\nDepartment of Statistics and Data Science\n\nIndian Statistical Institute, Kolkata\n\nKolkata 700108, India\n\nsoumendu041@gmail.com\n\nUniversity of Texas, Austin\n\nAustin, TX 78712\n\npurna.sarkar@austin.utexas.edu\n\nY. X. Rachel Wang\u2217\n\nUniversity of Sydney\nNSW 2006, Australia\n\nrachel.wang@sydney.edu.au\n\nSchool of Mathematics and Statistics\n\nDepartment of Statistics and Data Science\n\nBowei Yan\n\nUniversity of Texas, Austin\n\nAustin, TX 78712\n\nboweiy@utexas.edu\n\nAbstract\n\nVariational approximation has been widely used in large-scale Bayesian inference\nrecently, the simplest kind of which involves imposing a mean \ufb01eld assumption to\napproximate complicated latent structures. Despite the computational scalability\nof mean \ufb01eld, theoretical studies of its loss function surface and the convergence\nbehavior of iterative updates for optimizing the loss are far from complete. In\nthis paper, we focus on the problem of community detection for a simple two-\nclass Stochastic Blockmodel (SBM). Using batch co-ordinate ascent (BCAVI) for\nupdates, we show different convergence behavior with respect to different initial-\nizations. When the parameters are known, we show that a random initialization can\nconverge to the ground truth, whereas in the case when the parameters themselves\nneed to be estimated, a random initialization will converge to an uninformative\nlocal optimum.\n\n1\n\nIntroduction\n\nVariational approximation has recently gained a huge momentum in contemporary Bayesian statis-\ntics [13, 5, 11]. Mean \ufb01eld is the simplest type of variational approximation, and is a popular tool in\nlarge scale Bayesian inference. It is particularly useful for problems which involve complicated latent\nstructure, so that direct computation with the likelihood is not feasible. The main idea of variational\napproximation is to obtain a tractable lower bound on the complete log-likelihood of any model. This\nis, in fact, akin to the Expectation Maximization algorithm [6], where one obtains a lower bound on\nthe marginal log-likelihood function via the expectation with respect to the conditional distribution of\nthe latent variables under the current estimates of the underlying parameters. In contrast, for mean\n\ufb01eld variational approximation, the lower bound or ELBO is computed using the expectation with\nrespect to a product distribution over the latent variables.\nWhile there are many advances in developing new mean \ufb01eld type approximation methods for\nBayesian models, the theoretical behavior of these algorithms is not well understood. There is one\nline of work that studies the asymptotic consistency of variational inference. Most of the existing\ntheoretical work focuses on the global optimizer of variational methods. For example, for Latent\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fDirichlet Allocation (LDA) [5] and Gaussian mixture models, it is shown in [16] that the global\noptimizer is statistically consistent. [23] connects variational estimators to pro\ufb01le M-estimation,\nand shows consistency and asymptotic normality of those estimators. For Stochastic Blockmodels\n(SBM) [10, 9], [3] shows that the global optimizer of the variational log-likelihood is consistent\nand asymptotically normal. For more general cases, [22] proves a variational Bernstein-von Mises\ntheorem, which states that the variational posterior converges to \u201cthe Kullback-Leibler minimizer of\na normal distribution, centered at the truth\u201d.\nRecently, a lot more effort is being directed towards understanding the statistical convergence behavior\nof non-convex algorithms in general. For Gaussian mixture models and exponential families with\nmissing data, [19, 21] prove local convergence to the true parameters. The same authors also show\nthat the covariance matrix from variational Bayesian approximation for the Gaussian mixture model is\n\u201ctoo small\u201d compared with that obtained for the maximum likelihood estimator [20]. The robustness\nof variational Bayes estimators is further discussed in [8]. For LDA, [2] shows that, with proper\ninitialization, variational inference algorithms converge to the global optimum.\nTo be concrete, let us take the community detection problem in networks. Here the latent structure\ninvolves unknown community memberships. Optimization of the likelihood involves a combinatorial\nsearch, and thus is infeasible for large-scale graphs. The mean \ufb01eld approximation has been used\npopularly for this task [4, 26]. In [3], it was proved that the global optimum of the mean \ufb01eld\napproximation to the likelihood behaves optimally in the dense degree regime, where the average\nexpected degree of the network grows faster than the logarithm of the number of vertices.\nIn [26], it is shown that if the initialization of mean \ufb01eld is close enough to the truth then one gets\nconvergence to the truth at the minimax rate. However, in practice, it is usually not possible to\ninitialize like that unless one uses a pilot algorithm. Most initialization techniques like Spectral\nClustering [17, 15] will return correct clustering in the dense degree regime, thus rendering the need\nfor mean \ufb01eld updates redundant.\nIndeed, in most practical scenarios, one simply uses multiple random initializations, which usually\nfails miserably. However, to understand the behavior of random initializations, one needs to better\nunderstand the landscape of the mean \ufb01eld loss. There are few such works for non-convex optimization\nin the literature; notable examples include [14, 7, 12, 24]. In [24], the authors fully characterize the\nlandscape of the likelihood of the equal proportion Gaussian Mixture Model with two components,\nwhere the main message is that most random initializations should indeed converge to the ground truth.\nIn contrast, for topic models, it has been established that, for some parameter regimes, variational\ninference exhibits instability and returns a posterior mean that is uncorrelated with the truth [7]. In\nthis respect, for network models, there has not been much work characterizing the behavior of the\nvariational loss surface.\nIn this article, in the context of a stochastic blockmodel, we give a complete characterization of all\nthe critical points and establish the behavior of random initializations for batch co-ordinate ascent\n(BCAVI) updates for mean \ufb01eld likelihood (with known and unknown model parameters). Our results\nthus complement the results of [25].\nFor simplicity, we work with equal-sized two class stochastic blockmodels. We show that, when\nthe model parameters are known, random initializations can converge to the ground truth. We also\nanalyze the setting with unknown model parameters, where they are estimated jointly with the cluster\nmemberships. In this case, we see that indeed, with high probability, a random initialization never\nconverges to the ground truth, thus showing the critical importance of a good initialization for network\nmodels.\n\n2 Setup and preliminaries\n\nThe stochastic blockmodel SBM(B, Z, \u03c0) is a generative model of networks with community structure\non n nodes. Its dynamics is as follows: there are K communities {1, . . . , K} and each node belongs\nto a single community, where this membership is captured by the rows of the n \u00d7 K matrix Z,\nwhere the ith row of Z, i.e. Zi(cid:63), is the community membership vector of the ith node and has a\nMultinomial(1; \u03c0) distribution, independently of the other rows. Given the community structure,\nlinks between pairs of nodes are determined solely by the block memberships of the nodes in an\nindependent manner. That is, if A denotes the adjacency matrix of the network, then given Z, Aij\n\n2\n\n\fand Akl are independent for (i, j) (cid:54)= (k, l), i < j, k < l, and\n\nP(Aij = 1 | Z) = P(Aij = 1 | Zia = 1, Zjb = 1) = Bab.\n\nB = ((Bab)) is called the block (or community) probability matrix. We have the natural restriction\nthat B is symmetric for undirected networks.\nThe block memberships are hidden variables and one only observes the network in practice. The\ngoal often is to \ufb01t an appropriate SBM to learn the community structure, if any, and also estimate the\nparameters B and \u03c0.\nThe complete likelihood for the SBM is given by\n\n(cid:89)\n\n(cid:89)\n\nP(A, Z; B, \u03c0) =\n\n(BAij\n\nab (1 \u2212 Bab)1\u2212Aij )ZiaZjb(cid:89)\n(cid:88)\n\nP(A, Z; B, \u03c0),\n\n(cid:89)\n\na\n\n\u03c0Zia\na\n\n.\n\n(1)\n\n(2)\n\ni\nAs Z is not observable, if we integrate out Z, we get the data likelihood\n\ni<j\n\na,b\n\nP(A; B, \u03c0) =\n\nZ\u2208Z\n\nwhere Z is the space of all n \u00d7 K matrices with exactly one 1 in each row.\nIn principle we can optimize the data likelihood to estimate B and \u03c0. However, P(A; B, \u03c0) involves\na sum over a complicated large \ufb01nite set (the cardinality of this set is K n), and hence is not easy to\ndeal with. A well-known alternative approach is to optimize the variational log-likelihood [3], which\nhas a less complicated dependency structure, the simplest of which is mean \ufb01eld log-likelihood (see,\ne.g., [18]). We defer a detailed discussion of the mean \ufb01eld principle in the supplementary material.\nFor the SBM, the variational log-likelihood with respect to a distribution \u03c8 is given by\n\n\u03c8(Z) = E\u03c8\n\nZiaZjb(\u03b8abAij \u2212 f (\u03b8ab))\n\n\u2212 KL(\u03c8||\u03c0\u2297n),\n\n(cid:88)\n\nZ\n\nlog\n\n(cid:19)\n\n(cid:18)P(A, Z; B, \u03c0)\n(cid:16) Bab\n(cid:17)\n\n\u03c8(Z)\n\n(cid:18) (cid:88)\n\ni<j,a,b\n\n1\u2212Bab\n\n, f (\u03b8) = log(1 + e\u03b8) and \u03c0\u2297n denotes the product measure on Z with\nwhere \u03b8ab = log\nthe rows of Z being i.i.d. Multinomial(1; \u03c0). A special case of the variational log-likelihood is the\nmean \ufb01eld log-likelihood (see, e.g., [18]), where one approximates \u03a8 by\n\u03c8j(zj)}.\n(3)\ni KL(\u03c8i||\u03c0). For SBM the mean \ufb01eld\n\nDe\ufb01ne (cid:96)M F (\u03c8, \u03b8, \u03c0) =(cid:80)\n\n\u03a8M F \u2261 {\u03c8 : \u03c8(z1, . . . , zn) =\n\nj=1\n\nn(cid:89)\ni<j,a,b \u03c8ia\u03c8jb(\u03b8abAij \u2212 f (\u03b8ab))\u2212(cid:80)\nsubject to(cid:88)\n\n(cid:96)M F (\u03c8, \u03b8, \u03c0)\n\u03c8ia = 1, for all 1 \u2264 i \u2264 n\n\nmax\n\napproximation is equivalent to optimizing (cid:96)M F (\u03c8, \u03b8, \u03c0) as follows:\n\n\u03c8\n\n(cid:19)\n\nwhere each \u03c8i is a discrete probability distribution over {1, . . . , K}.\n\na\n\n\u03c8ia \u2265 0, for all 1 \u2264 i \u2264 n, 1 \u2264 a \u2264 K,\n\n2.1 Mean \ufb01eld updates for a two-parameter two-block SBM\nConsider the stochastic blockmodel with two blocks with prior block probability \u03c0, 1\u2212 \u03c0 respectively\nand block probability matrix B = (p\u2212 q)I + qJ, where p > q, I is the identity matrix, and J = 11(cid:62)\nis the matrix of all 1\u2019s. For simplicity, we will denote \u03c8i1 as \u03c8i. Then the mean \ufb01eld log-likelihood is\n\n(cid:96)(\u03c8, p, q, \u03c0) =\n\n1\n2\n\n[\u03c8i(1 \u2212 \u03c8j) + \u03c8j(1 \u2212 \u03c8i)][Aij log\n\n+ log(1 \u2212 q)]\n\n(cid:18) q\n(cid:19)\n(cid:18) p\n\n1 \u2212 q\n\n(cid:19)\n\n[\u03c8i\u03c8j + (1 \u2212 \u03c8i)(1 \u2212 \u03c8j)][Aij log\n\n+ log(1 \u2212 p)]\n\n(cid:18) \u03c8i\n\n(cid:19)\n\n\u03c0\n\n[log\n\n(cid:19)\n\n(cid:18) 1 \u2212 \u03c8i\n\n1 \u2212 \u03c0\n\n1 \u2212 p\n(1 \u2212 \u03c8i)].\n\n(cid:88)\n\ni,j:i(cid:54)=j\n\n+\n\n1\n2\n\n(cid:88)\n\u2212(cid:88)\n\ni,j:i(cid:54)=j\n\ni\n\n\u03c8i + log\n\n3\n\n\f\u2202(cid:96)\n\u2202\u03c8i\n\n=\n\n1\n2\n\n2[1 \u2212 2\u03c8j][Aij log\n\n(cid:88)\n\nj:j(cid:54)=i\n1\n2\n\n(cid:88)\n\n+\n\n(cid:88)\n2 log(cid:0) p(1\u2212q)\n\nq(1\u2212p)\n\n= 4t\n\nj:j(cid:54)=i\n\n2[2\u03c8j \u2212 1][Aij log\n\nj:j(cid:54)=i\n(\u03c8j \u2212 1\n2\n\n)(Aij \u2212 \u03bb) \u2212 log\n\n(cid:1) and \u03bb = 1\n\n2t log(cid:0) 1\u2212q\n\n(cid:19)\n\n(cid:19)\n\n1 \u2212 q\n\n1 \u2212 p\n\n+ log(1 \u2212 q)]\n\n(cid:18) \u03c8i\n\n+ log(1 \u2212 p)] \u2212 log\n\n(cid:18) q\n(cid:19)\n(cid:19)\n(cid:18) p\n(cid:18) \u03c8i\n(cid:1). Detailed calculations of other \ufb01rst and second order\n(cid:88)\n\n1 \u2212 \u03c8i\n\n1 \u2212 \u03c8i\n\n,\n\nFor simplicity of exposition, we will assume that \u03c0 (which is essentially a prior on the block\n|C1|\nmemberships) is known and equals 1/2. Let Ci, i = 1, 2 be the two communities. Let \u02dc\u03c0 =\nn . It is\nclear that \u02dc\u03c0 = 1\n2 from the start will not change our conclusions but\nmake the algebra a lot nicer, which we do henceforth. Now\n\nn ). Assuming \u02dc\u03c0 = 1\n\n2 + OP ( 1\u221a\n\nwhere t = 1\npartial derivatives are given in Section 2 of the supplementary article [1]. The co-ordinate ascent\n(CAVI) updates for \u03c8 are\n\n1\u2212p\n\nlog\n\ni\n\n\u03c8(new)\n1 \u2212 \u03c8(new)\n\ni\n\n= 4t\n\nj(cid:54)=i\n\n(\u03c8j \u2212 1\n2\n\n)(Aij \u2212 \u03bb).\n\nIntroducing an intermediate variable \u03be for the updates, let f (x) = log( x\niteration s, the batch version (BCAVI) of this is\n\n1\u2212x ) and \u03bei = f (\u03c8i). Then at\n\n\u03be(s) = 4t(A \u2212 \u03bb(J \u2212 I))(\u03c8(s\u22121) \u2212 1\n2\n\n1),\n\nand \u03c8(s) = g(\u03be(s)) with g(x) = 1/(1 + e\u2212x). The population version (replacing A by E(A | Z) =\nZBZ(cid:62) \u2212 pI =: P \u2212 pI) of BCAVI is\n\n\u03be(s) = 4t(P \u2212 pI \u2212 \u03bb(J \u2212 I))(\u03c8(s\u22121) \u2212 1\n2\n\n1).\n\nThe matrix M := P \u2212 pI \u2212 \u03bb(J \u2212 I) will appear many times later. There are updates for p, q as\nwell, which can be expressed compactly in terms of \u03c8. We describe these in detail in (8).\n\n3 Main results\n\nIn this section, we state and discuss our main results. All the proofs appear in the supplementary\narticle [1].\nNote: In the following, we will see the following vectors repeatedly: \u03c8 = 1\n2 1, 1, 0, 1C1 , 1C2. Among\nthese, 1 corresponds to the case where every node is assigned by \u03c8 to C1, and, similarly, for 0, to\nC2. On the other hand, 1Ci are the indicators of the clusters Ci and hence correspond to the ground\ntruth community assignment. Finally, 1\n2 1 corresponds to the solution where a node belong to each\ncommunity with equal probability.\nProposition 3.1. Suppose 1 > p > q > 0. Then\n\n1. (p\u2212q)(1+p\u2212q)\n\n2(1\u2212q)p < t < (p\u2212q)(1\u2212p+q)\n\n2(1\u2212p)q\n\n, and\n\n2. q < \u03bb < p.\n\nThe eigendecomposition of P \u2212 \u03bbJ will play a crucial role in our analysis. Note that it has rank\ntwo and two eigenvalues e\u00b1 = n\u03b1\u00b1, where \u03b1+ = p+q\n2 , with eigenvectors 1 and\n1C1 \u2212 1C2 respectively.\nNow, the eigenvalues of M are \u03bd1 = e+ \u2212 (p \u2212 \u03bb), \u03bd2 = e\u2212 \u2212 (p \u2212 \u03bb) and \u03bdj = \u2212(p \u2212 \u03bb),\nj = 3, . . . , n. The eigenvector of M corresponding to \u03bd1 is u1 = 1, and the one corresponding to \u03bd2\nis u2 = 1C1 \u2212 1C2 .\n\n2 \u2212 \u03bb, \u03b1\u2212 = p\u2212q\n\n4\n\n\f3.1 Known p, q:\n\nIn this case, we need only consider the updates for \u03c8. The population BCAVI updates are\n\n\u03be(s+1) = 4tM (\u03c8(s) \u2212 1\n2\n\n(4)\nWe consider the case where the true p, q are of the same order, that is, p (cid:16) q (cid:16) \u03c1n with \u03c1n possibly\n2 1 is a saddle point of the population mean \ufb01eld log-likelihood.\ngoing to 0. In the known p, q case 1\nProposition 3.2. \u03c8 = 1\n2 1 is a saddle point of the population mean \ufb01eld log-likelihood when p and q\nare known, for all n large enough.\n\n1).\n\nNow we will write the BCAVI updates in the eigenvector coordinates of M. To this end, de\ufb01ne\ni = (cid:104)\u03c8(s), ui(cid:105)/(cid:107)ui(cid:107)2 = (cid:104)\u03c8(s), ui(cid:105)/n, for i = 1, 2. We can then write\n\u03b6 (s)\n\u03c8(s) = (cid:104)\u03c8(s), u1/(cid:107)u1(cid:107)(cid:105)u1/(cid:107)u1(cid:107) + (cid:104)\u03c8(s), u2/(cid:107)u2(cid:107)(cid:105)u2/(cid:107)u2(cid:107) + v(s) = \u03b6 (s)\n1 u1 + \u03b6 (s)\nSo, using (4) in conjunction with the above decomposition, coordinate-wise we have:\n\n2 u2 + v(s).\n\n(cid:19)\n\n(cid:18)\n\n\u03be(s+1)\ni\n\n= 4tn\n\n)\u03b1+ + \u03c3i\u03b6 (s)\n\n2 \u03b1\u2212\n\n+ 4t\u03bd3\n\n1 \u2212 1\n(\u03b6 (s)\n2\n\n) + \u03c3i\u03b6 (s)\n\n2 + v(s)\n\ni\n\n(cid:18)\n\n1 \u2212 1\n(\u03b6 (s)\n2\n+ b(s)\n\n,\n\ni\n\n=: na(s)\n\u03c3i\n\n(cid:19)\n\n(5)\n\n(6)\n\nwhere \u03c3i = 1, if i is in C1, and \u22121 otherwise.\nTheorem 3.3 (Population behavior). The limit behavior of the population BCAVI updates is charac-\nterized by the signs of \u03b1+ and a(0)\u00b11, where \u03b1+ = (p + q)/2 \u2212 \u03bb and a(s)\u00b11 for iteration s is de\ufb01ned\nin (5). Assume that |na(0)\u00b11| \u2192 \u221e. De\ufb01ne (cid:96)(\u03c8(0)) = 1(a(0)\n+1 > 0)1C1 + 1(a(0)\u22121 > 0)1C2. Then, we\nhave\n\n= O(exp(\u2212\u0398(n min{|a(0)\n\n+1|,|a(0)\u22121|}))) = o(1).\n\n(cid:107)\u03c8(1) \u2212 (cid:96)(\u03c8(0))(cid:107)2\n\nn\nWe also have for any s \u2265 2\n(cid:107)\u03c8(s) \u2212 (cid:96)(\u03c8(0))(cid:107)2\n\nn\n\n=\n\n(cid:40)\n\nO(exp(\u2212\u0398(nt\u03b1\u2212))),\nO(exp(\u2212\u0398(nt|\u03b1+|)),\n\nif a(0)\nif a(0)\n\n+1a(0)\u22121 < 0,\n+1a(0)\u22121 > 0, and \u03b1+ > 0.\n\nFinally, if a(0)\n\n+1a(0)\u22121 > 0 and \u03b1+ < 0, then, for any s \u2265 2, we have\n\n(cid:26)(cid:107)\u03c8(s) \u2212 1(cid:107)2\n\nmin\n\nn\n\n(cid:107)\u03c8(s) \u2212 0(cid:107)2\n\nn\n\n,\n\n(cid:27)\n\n= O(exp(\u2212\u0398(nt|\u03b1+|)).\n\nIn fact, in this case, \u03c8(s) cycles between 1 and 0, in the sense that it is close to 1 is one iteration, and\nto 0 in the next and so on.\nRemark 3.1. We see from Theorem 3.3 that, essentially, we have exponential convergence within\ntwo iterations.\n\nNow we turn to the sample behavior. To distinguish from the population case, we denote the sample\nBCAVI updates as\n\n\u02c6\u03be(s+1) = 4t \u02c6M ( \u02c6\u03c8(s) \u2212 1\n2\n\n1),\n\n(7)\n\nwhere \u02c6M = A \u2212 \u03bb(J \u2212 I) and \u02c6\u03c8(s) depends on A for s \u2265 1. Note that \u02c6\u03c8(0) = \u03c8(0).\nTheorem 3.4 (Sample behavior). For all s \u2265 1, the same conclusion as Theorem 3.3 holds for the\nsample BCAVI updates with high probability as long as n|a(0)\u00b11| (cid:29) max{\u221a\n2(cid:107)\u221e, 1},\n\u221a\nn\u03c1n = \u2126(log n) and \u03c8(0) is independent of A.\n\nn\u03c1n(cid:107)\u03c8(0) \u2212 1\n\nFrom Theorem 3.3, we can calculate lower bounds to the volumes of the basins of attractions of the\nlimit points of the population BCAVI updates. We have the following corollary.\n\n5\n\n\fCorollary 3.5. De\ufb01ne the set of initialization points converging to a stationary point c as\n\nSc := {v | lim sup\ns\u2192\u221e\n\nn\u22121(cid:107)\u03c8(s) \u2212 c(cid:107)2 = O(exp(\u2212\u0398(nt min{|\u03b1+|, \u03b1\u2212}))), when \u03c8(0) = v}.\n\nLet M be some measure on [0, 1]n, absolutely continuous with respect to the Lebesgue measure.\nConsider the stationary point 1, then\n\nM(S1) \u2265 lim\n\u03b3\u21911\n\nM(H \u03b3\n\n+ \u2229 H \u03b3\u2212 \u2229 [0, 1]n),\n\nwhere the half-spaces H \u03b3\u00b1 are given as\n\nH \u03b3\u00b1 =(cid:8)x | (cid:104)x, \u03b1+u1 \u00b1 \u03b1\u2212u2(cid:105) >\n\n(cid:9).\n\nn\u03b1+\n\n2\n\n+\n\nn1\u2212\u03b3\n4t\n\nSimilar formulas can be obtained for the other stationary points.\n\nFor speci\ufb01c measures M, one can obtain explicit formulas for these volumes. In practice, these are\nquite easy to calculate by Monte Carlo simulations.\nIn fact, using arguments that goes into the proof of Theorem 3.3, we can show that in the large n limit,\nthere are only \ufb01ve stationary points of the mean \ufb01eld log-likelihood, namely 1\n2 1, 1, 0, 1C1, and 1C2.\n\n3.2 Unknown p, q:\n\nIn this case, the BCAVI updates are\n\np(s) =\n\nq(s) =\n\nt(s) =\n\n(\u03c8(s\u22121))(cid:62)A\u03c8(s\u22121) + (1 \u2212 \u03c8(s\u22121))(cid:62)A(1 \u2212 \u03c8(s\u22121))\n\n(\u03c8(s\u22121))(cid:62)(J \u2212 I)\u03c8(s\u22121) + (1 \u2212 \u03c8(s\u22121))(cid:62)(J \u2212 I)(1 \u2212 \u03c8(s\u22121))\n\n(\u03c8(s\u22121))(cid:62)A(1 \u2212 \u03c8(s\u22121))\n\n(cid:18) p(s)(1 \u2212 q(s))\n\n(\u03c8(s\u22121))(cid:62)(J \u2212 I)(1 \u2212 \u03c8(s\u22121))\n1\n2\n\nq(s)(1 \u2212 p(s))\n\n(cid:19)\n\n, \u03bb(s) =\n\nlog\n\n,\n\n(cid:18) 1 \u2212 q(s)\n\n1 \u2212 p(s)\n\n(cid:19)\n\n,\n\n1\n\n2t(s)\n\nlog\n\n,\n\n(8)\n\n(9)\n\n\u03be(s) = 4t(s)(A \u2212 \u03bb(s)(J \u2212 I))(\u03c8(s\u22121) \u2212 1\n2\n\n1).\n\nSimilar to before, p (cid:16) q (cid:16) \u03c1n with \u03c1n possibly going to 0. In the population version, we would\nreplace A with E(A | Z) = P \u2212 pI.\nIn this case with unknown p, q, our next result shows that 1\ntion 3.2) to a local maximum.\nProposition 3.6. Let n \u2265 2. Then (\u03c8, p, q) = ( 1\nmean \ufb01eld log-likelihood.\n\nn(n\u22121) ) is a strict local maximum of the\n\n2 1 changes from a saddle point (Proposi-\n\nn(n\u22121) , 1(cid:62)A1\n\n2 1, 1(cid:62)A1\n\nSince p, q and \u03c8 are unknown and need to be estimated iteratively, we have the following updates for\np(1) and q(1) given the initialization \u03c8(0) and show that they can be written in terms of the projection\nof the initialization in the principal eigenspace of P .\nLemma 3.1. Let x = \u03c8T \u03c8 +(1\u2212\u03c8)T (1\u2212\u03c8) and y = 2\u03c8T (1\u2212\u03c8) = n\u2212x. If \u03c8 = \u03b61u1 +\u03b62u2 +w,\nwhere w \u2208 span{u1, u2}\u22a5, then\np + q\n\nSince \u03c8T (1 \u2212 \u03c8) > 0, we have \u03b61(1 \u2212 \u03b61) \u2265 \u03b6 2\n\n(cid:18) p + q\n\n2\n\np(1) \u2208\n\n\u221a\n+ OP (\n\n6\n\np(1) =\n\nq(1) =\n\n2\n\np + q\n\n2\n\n+\n\n2 + y/2n2)\n\n2 \u2212 x/2n2)\n\n(p \u2212 q)(\u03b6 2\n\u221a\n1 + (1 \u2212 \u03b61)2 \u2212 x/n2 + OP (\n\u03b6 2\n\u2212 (p \u2212 q)(\u03b6 2\n\u221a\n2\u03b61(1 \u2212 \u03b61) \u2212 y/n2 + OP (\n(cid:21)\n\n2 . This gives:\nq(1) \u2208\n\n\u03c1n/n), p\n\n,\n\np + q\n\n(cid:20)\n\nq,\n\n\u03c1n/n),\n\n\u03c1n/n).\n\n\u221a\n+ OP (\n\n2\n\n(cid:19)\n\n\u03c1n/n)\n\n.\n\n(10)\n\n(11)\n\n\f\u221a\n\nn) as n \u2192 \u221e.\n\n\u03c1n/n) close to\nIt is interesting to note that p(1) is always smaller than q(1) except when it is O(\n(p + q)/2. In that regime, one needs to worry about the sign of t and \u03bb. In all other regimes, t, \u03bb are\npositive.\nUsing the update forms in Lemma 3.1, the following result shows that the stationary points of the\npopulation mean \ufb01eld log-likelihood lie in the principle eigenspace span{u1, u2} of P in a limiting\nsense.\nProposition 3.7. Consider the case with unknown p, q and \u03c1n \u2192 0, n\u03c1n \u2192 \u221e. Let (\u03c8, \u02dcp, \u02dcq) be\na stationary point of the population mean \ufb01eld log-likelihood. If \u03c8 = \u03c8u + \u03c8u\u22a5, where \u03c8u \u2208\n\u221a\nspan{u1, u2} and \u03c8u\u22a5 \u22a5 span{u1, u2}, then (cid:107)\u03c8u\u22a5(cid:107) = o(\nLemma 3.1 basically shows that if \u03b62 is vanishing, then p(1) and q(1) concentrates around the average\nof the conditional expectation matrix, i.e. (p+q)/2. The next result shows that if one uses independent\nand identically distributed initialization, then \u03b62 is indeed vanishing. This is not surprising, since \u03b62\nmeasures correlation with the second eigenvector of P u2 which is basically the 1C1 \u2212 1C2 vector.\nConsider a simple random initialization, where the entries of \u03c8(0) are i.i.d with mean \u00b5 and show\n2 with small deviations within one update. This shows the futility of random\nthat it converges to 1\ninitialization.\niid\u223c f\u00b5 where f is a distribution supported\nLemma 3.2. Consider the initial distribution \u03c8(0)\non (0, 1) with mean \u00b5. If \u00b5 is bounded away from 0 and 1 and n\u03c1n = \u2126(log2 n), then \u03c8(1)\ni =\n1\n2 + OP (log n/\nPerhaps, it is also instructive to analyze the case where the initialization is in fact correlated with the\ntruth, i.e. E[\u03c8(0)\nLemma 3.3. Consider an initial \u03c8(0) such that\n\n] = \u00b5\u03c3i. To this end, we will consider the following initialization scheme.\n\nn) uniformly for all i, where \u03c8(1) is computed using the sample update.\n\n\u221a\n\ni\n\ni\n\n\u03b61 =\n\n\u03b62 =\n\n(\u03c8(0))T 1\n\nn\n\n(\u03c8(0))T u2\n\n=\n\nn\n\n\u00b51 + \u00b52\n\u00b51 \u2212 \u00b52\n\n2\n\n=\n\n\u221a\n+ OP (1/\n\nn),\n\u221a\n+ OP (1/\n\nn).\n\n\uf8eb\uf8ed2|\u00b51 + \u00b52 \u2212 1| + OP\n\n\uf8eb\uf8ed\n\n2\n\n(cid:113)\n\n\uf8f6\uf8f8 ,\n\n(cid:18) \u03c1n log n\n\nn(p \u2212 q)2\n\n(cid:19)1/3\uf8f6\uf8f8 ,\n\n\u03c1n log2 n/n\np \u2212 q\n\n(12)\n\n(13)\n\nIf \u00b51, \u00b52 are bounded away from 0 and 1 and satisfy\n\n|\u00b51 \u2212 \u00b52| > max\n\nand n\u03c1n = \u2126(log2 n), then \u03c8(1) = 1C1 + OP (exp(\u2212\u2126(log n))) or 1C2 + OP (exp(\u2212\u2126(log n))),\nwhere the error term is uniform for all the coordinates.\nRemark 3.2. The lemma states that provided the separation between p and q does not vanish too\nfast, if the initial \u03c8(0) is centered around two slightly different means, e.g., \u00b51 = 1/2 + cn and\n\u00b52 = 1/2 \u2212 cn for some constant cn \u2192 0, then we converge to the truth within one iteration.\n\n4 Numerical results\n\nIn Figure 1-(a), we have generated a network from an SBM with parameters p = 0.4, q = 0.025, and\ntwo equal sized blocks of 100 nodes each. We generate 5000 initializations \u03c8(0) from Beta(\u03b1, \u03b2)\u2297n\n(for four sets of \u03b1 and \u03b2) and map them to a(0)\u00b11. We perform sample BCAVI updates on \u03c8(0) with\nknown p, q and color the points in the a(0)\u00b11 co-ordinates according the limit points they have converged\n+1a(0)\u22121 < 0\nto. In this case, \u03b1+ > 0, hence based on Theorems 3.3 and 3.4, we expect points with a(0)\nto converge to the ground truth (colored green or magenta) and those with a(0)\n+1a(0)\u22121 > 0 to converge\nto 0 or 1. As expected, points falling in the center of the \ufb01rst and third quadrants have converged to 0\nor 1. The points converging to the ground truth lie more toward the boundaries but mostly remain\nin the same quadrants, suggesting possible perturbations arising from the sample noise and small\nnetwork size. We see that this issue is alleviated when we increase n.\n\n7\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 1: n = 200 and 5000, \u03c8(0) \u223c Beta(\u03b1, \u03b2)\u2297n for various values of \u03b1 and \u03b2. These \u03c8(0) are\nmapped to (a(0)\n+1, a(0)\u22121) (see (5)) and plotted. C1 (magenta) and C2 (green) correspond to the limit\npoints 1C1 and 1C2. Other limit points are \u2018Ones\u2019, i.e. 1 (blue) and \u2018Zeros\u2019, i.e. 0 (red).\n\nThe notable thing is, in Figure 1-(a) and (d), the Beta distribution has mean 0.16 and 0.71 respectively.\nSo the initialization is more skewed towards values that are closer to zero or closer to one. In these\ncases most of the random runs converge to the all zeros or all ones, with very few converging to the\nground truth. However, for Figure 1-(b) and (d), the mean of the Beta is 0.3 and 0.7, and we see\nconsiderably more convergences to the ground truth. Also, (b) and (d) are, in some sense, mirror\nimages of each other, i.e. in one, the majority converges to 0; whereas in the other, the majority\nconverges to 1.\nIn Figure 2-(a), we examine initializations of the type described in Lemma 3.3 and the resulting\nestimation error. For each c0, we initialize \u03c8(0) such that E(\u03c8(0)) = (1/2 + c0)1C1 + (1/2 \u2212 c0)1C2\nwith iid noise. The y-axis shows the average distance between \u03c8(20) and the true Z from 500 such\ninitializations, as measured by (cid:107)\u03c8(20) \u2212 Z(cid:107)1/n. For every choice of p, q, a network of size 400 with\ntwo equal sized blocks was generated. In all cases, suf\ufb01ciently large c0 guarantees convergence to the\ntruth. We also observe that the performance deteriorates when p \u2212 q becomes small, either when p\ndecreases or when the network becomes sparser.\n\n5 Discussion\n\nIn this paper, we work with the BCAVI mean \ufb01eld variational algorithm for a simple two class\nstochastic blockmodel with equal sized classes. Mean \ufb01eld methods are used widely for their\nscalability. However, existing theoretical works typically analyze the behavior of the global optima,\nor the local convergence behavior when initialized near the ground truth. In the simple setting\nconsidered, we show two interesting results. First, we show that, when the model parameters are\n\n8\n\n\fknown, random initializations may lead to convergence to the ground truth. In contrast, when the\nparameters are not known, but estimated, we show that a random initialization converges, with\nhigh probability, to a meaningless local optimum. This shows the futility of using multiple random\ninitializations, which is typically done in practice when no prior knowledge is available.\nIn view of recent works on the optimization landscape for Gaussian mixtures [12, 24], we would like\nto comment that, despite falling into the category of latent variable models, the SBM has fundamental\ndifferences from Gaussian mixtures which require different analysis techniques. The posterior\nprobabilities of the latent labels in the latter model can be easily estimated when the parameters are\nknown, whereas this is not the case for SBM since the posterior probability P(Zi|A) depends on\nthe entire network. The signi\ufb01cance of Theorem 3.3 lies in characterizing the convergence of label\nestimates given the correct parameters for general initializations, which is different from the type of\nparameter convergence shown in [12, 24]. Furthermore, as most of the existing literature for the SBM\nfocuses on estimating the labels \ufb01rst, our results provide an important complementary direction by\nsuggesting that one could start with parameter estimation instead. A natural direction is to investigate\nhow robust the results on the known p, q setting are when we can estimate p and q within some small\nerror.\nWhile we only show results for two classes, we expect that our main theoretical results generalize\nwell to K > 2 and will leave the analysis for future work. As an illustration, consider a setting\nsimilar to that of Figure 1-(a) but for n = 450 with K = 3 equal sized classes. p = 0.5, q = 0.01\nare known and \u03c80 is initialized with a Dirichlet(0.1, 0.1, 0.1) distribution. Each row of the matrix in\nFigure 2-(b) represents a stationary cluster membership vector from a random initialization.\nIn Figure 2, all 1000 random initializations converge to stationary points \u03c8 lying in the span of\n{1C1, 1C2, 1C3}, which are the membership vectors for each class. We represent the node member-\n\n(cid:1) = 4 different types of stationary points, not counting\n\nlabel permutations. Another stationary point (the all ones vector that puts everyone in the same class)\ncan be obtained with other initialization schemes, e.g., when the rows of \u03c8(0) are identical. For a\ngeneral K- blockmodel, we conjecture that the number of stationary points grows exponentially\nwith K. Similar to Figure 1-(a), a signi\ufb01cant fraction of the random initializations converge to the\nground truth. On the other hand, when p, q are unknown, random initializations always converge to\nthe uninformative stationary point (1/3, 1/3, 1/3), analogous to Lemma 3.2.\n\nships with different colors, and there are 1 +(cid:0)3\n\n2\n\n(a)\n\n(b)\n\nFigure 2: (a) Average distance between the estimated \u03c8 and the true Z with respect to c0, where\nE(\u03c8(0)) = (1/2 + c0)1C1 + (1/2 \u2212 c0)1C2. (b) Convergence to stationary points for known p, q,\nK = 3. Rows permuted for clarity.\n\nAcknowledgements\n\nSSM thanks Professor Peter J. Bickel for helpful discussions. PS is partially funded by NSF grant\nDMS1713082. YXRW is supported by the ARC DECRA Fellowship.\n\n9\n\n\fReferences\n[1] Appendix for \u201cMean Field for the Stochastic Blockmodel: Optimization Landscape and Conver-\n\ngence Issues\u201d. 2018.\n\n[2] Pranjal Awasthi and Andrej Risteski. On some provably correct cases of variational inference\nfor topic models. In Advances in Neural Information Processing Systems, pages 2098\u20132106,\n2015.\n\n[3] Peter Bickel, David Choi, Xiangyu Chang, and Hai Zhang. Asymptotic normality of maximum\nlikelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics,\npages 1922\u20131943, 2013.\n\n[4] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for\n\nstatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. J. Mach.\n\nLearn. Res., 3:993\u20131022, March 2003.\n\n[6] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete\ndata via the EM algorithm. Journal of the royal statistical society. Series B (methodological),\npages 1\u201338, 1977.\n\n[7] Behrooz Ghorbani, Hamid Javadi, and Andrea Montanari. An instability in variational inference\n\nfor topic models. arXiv preprint arXiv:1802.00568, 2018.\n\n[8] Ryan Giordano, Tamara Broderick, and Michael I. Jordan. Covariances, robustness, and\n\nvariational Bayes. arXiv preprint arXiv:1709.02536, 2017.\n\n[9] Jake M. Hofman and Chris H. Wiggins. Bayesian approach to network modularity. Physical\n\nreview letters, 100(25):258701, 2008.\n\n[10] Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels:\n\nFirst steps. Social networks, 5(2):109\u2013137, 1983.\n\n[11] Tommi S. Jaakkola and Michael I. Jordon. Improving the mean \ufb01eld approximation via the\nuse of mixture distributions. In Learning in Graphical Models, pages 163\u2013173. MIT Press,\nCambridge, MA, USA, 1999.\n\n[12] Chi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J. Wainwright, and Michael I. Jordan.\nLocal maxima in the likelihood of gaussian mixture models: Structural results and algorithmic\nconse quences. In Advances in Neural Information Processing Systems, pages 4116\u20134124,\n2016.\n\n[13] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An intro-\nduction to variational methods for graphical models. Mach. Learn., 37(2):183\u2013233, November\n1999.\n\n[14] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex\n\nlosses. 07 2016.\n\n[15] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an\n\nalgorithm. In Advances in neural information processing systems, pages 849\u2013856, 2002.\n\n[16] Debdeep Pati, Anirban Bhattacharya, and Yun Yang. On statistical optimality of variational\n\nBayes. arXiv preprint arXiv:1712.08983, 2017.\n\n[17] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional\n\nstochastic blockmodel. The Annals of Statistics, pages 1878\u20131915, 2011.\n\n[18] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and\n\nvariational inference. Found. Trends Mach. Learn., 1(1-2):1\u2013305, January 2008.\n\n[19] Bo Wang and D. M. Titterington. Convergence and asymptotic normality of variational Bayesian\napproximations for exponential family models with missing values. In Proceedings of the 20th\nconference on Uncertainty in arti\ufb01cial intelligence, pages 577\u2013584. AUAI Press, 2004.\n\n10\n\n\f[20] Bo Wang and D. M. Titterington. Inadequacy of interval estimates corresponding to variational\n\nBayesian approximations. In AISTATS, 2005.\n\n[21] Bo Wang and D. M. Titterington. Convergence properties of a general algorithm for calculating\nvariational Bayesian estimates for a normal mixture model. Bayesian Analysis, 1(3):625\u2013650,\n2006.\n\n[22] Yixin Wang and David M. Blei. Frequentist consistency of variational Bayes. arXiv preprint\n\narXiv:1705.03439, 2017.\n\n[23] Ted Westling and Tyler H. McCormick. Beyond prediction: A framework for inference with\n\nvariational approximations in mixture models. arXiv preprint arXiv:1510.08151, 2015.\n\n[24] Ji Xu, Daniel J. Hsu, and Arian Maleki. Global analysis of expectation maximization for\nmixtures of two gaussians. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 29, pages 2676\u20132684. Curran\nAssociates, Inc., 2016.\n\n[25] Anderson Y. Zhang and Harrison H. Zhou. Theoretical and computational guarantees of mean\n\ufb01eld variational inference for community detection. arXiv preprint arXiv:1710.11268, 2017.\n\n[26] Fengshuo Zhang and Chao Gao. Convergence rates of variational posterior distributions. arXiv\n\npreprint arXiv:1712.02519, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6802, "authors": [{"given_name": "Soumendu Sundar", "family_name": "Mukherjee", "institution": "University of California, Berkeley"}, {"given_name": "Purnamrita", "family_name": "Sarkar", "institution": "UT Austin"}, {"given_name": "Y. X. Rachel", "family_name": "Wang", "institution": "University of Sydney"}, {"given_name": "Bowei", "family_name": "Yan", "institution": "Jump Trading"}]}