{"title": "Clustering Sparse Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 2204, "page_last": 2212, "abstract": "We develop a new algorithm to cluster sparse unweighted graphs -- i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the in-cluster and across cluster edge densities are very small, possibly vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difficult to solve. Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within clusters and present edges across clusters. Our insight is that in the sparse case, these must be {\\em penalized differently}. We analyze our algorithm's performance on the natural, classical and widely studied ``planted partition'' model (also called the stochastic block model); we show that our algorithm can cluster sparser graphs, and with smaller clusters, than all previous methods. This is seen empirically as well.", "full_text": "Clustering Sparse Graphs\n\nDepartment of Electrical and Computer Engineering\n\nYudong Chen\n\nThe University of Texas at Austin\n\nAustin, TX 78712\n\nydchen@utexas.edu\n\nDepartment of Electrical and Computer Engineering\n\nSujay Sanghavi\n\nThe University of Texas at Austin\n\nAustin, TX 78712\n\nsanghavi@mail.utexas.edu\n\nHuan Xu\n\nMechanical Engineering Department\n\nNational University of Singapore\n\nSingapore 117575, Singapore\n\nmpexuh@nus.edu.sg\n\nAbstract\n\nWe develop a new algorithm to cluster sparse unweighted graphs \u2013 i.e. partition\nthe nodes into disjoint clusters so that there is higher density within clusters, and\nlow across clusters. By sparsity we mean the setting where both the in-cluster and\nacross cluster edge densities are very small, possibly vanishing in the size of the\ngraph. Sparsity makes the problem noisier, and hence more dif\ufb01cult to solve.\nAny clustering involves a tradeoff between minimizing two kinds of errors: miss-\ning edges within clusters and present edges across clusters. Our insight is that in\nthe sparse case, these must be penalized differently. We analyze our algorithm\u2019s\nperformance on the natural, classical and widely studied \u201cplanted partition\u201d model\n(also called the stochastic block model); we show that our algorithm can cluster\nsparser graphs, and with smaller clusters, than all previous methods. This is seen\nempirically as well.\n\n1\n\nIntroduction\n\nThis paper proposes a new algorithm for the following task: given a sparse undirected unweighted\ngraph, partition the nodes into disjoint clusters so that the density of edges within clusters is higher\nthan the edges across clusters. In particular, we are interested in settings where even within clus-\nters the edge density is low, and the density across clusters is an additive (or small multiplicative)\nconstant lower.\nSeveral large modern datasets and graphs are sparse; examples include the web graph, social graphs\nof various social networks, etc. Clustering naturally arises in these settings as a means/tool for\ncommunity detection, user pro\ufb01ling, link prediction, collaborative \ufb01ltering etc. More generally,\nthere are several clustering applications where one is given as input a set of similarity relationships,\nbut this set is quite sparse. Unweighted sparse graph clustering corresponds to a special case in\nwhich all similarities are either \u201c1\u201d or \u201c0\u201d.\nAs has been well-recognized, sparsity complicates clustering, because it makes the problem noisier.\nJust for intuition, imagine a random graph where every edge has a (potentially different) probability\npij (which can be re\ufb02ective of an underlying clustering structure) of appearing in the graph. Consider\nsetting of small pij \u2192 0, the mean of this variable is pij but its standard deviation is \u221a\nnow the edge random variable, which is 1 if there is an edge, and 0 else. Then, in the sparse graph\npij, which\n\n1\n\n\fcan be much larger. This problem gets worse as pij gets smaller. Another parameter governing\nproblem dif\ufb01culty is the size of the clusters; smaller clusters are easier to lose in the noise.\nOur contribution: We propose a new algorithm for sparse unweighted graph clustering. Clearly,\nthere will be two kinds of deviations (i.e. errors) between the given graph and any candidate cluster-\ning: missing edges within clusters, and present edges across clusters. Our key realization is that for\nsparse graph clustering, these two types of error should be penalized differently. Doing so gives as\na combinatorial optimization problem; our algorithm is a particular convex relaxation of the same,\nbased on the fact that the cluster matrix is low-rank (we elaborate below). Our main analytical\nresult in this paper is theoretical guarantees on its performance for the classical planted partition\nmodel [10], also called the stochastic block-model [1, 22], for random clustered graphs. While this\nmodel has a rich literature (e.g., [4, 7, 10, 20]), we show that our algorithm out-performs (upto\nat most log factors) every existing method in this setting (i.e. it recovers the true clustering for a\nbigger range of sparsity and cluster sizes). Both the level of sparsity and the number and sizes of\nthe clusters are allowed to be functions of n, the total number of nodes. In fact, we show that in a\nsense we are close to the boundary at which \u201cany\u201d spectral algorithm can be expected to work. Our\nsimulation study con\ufb01rms our theoretic \ufb01nding, that the proposed method is effective in clustering\nsparse graphs and outperforms existing methods.\nThe rest of the paper is organized as follows: Section 1.1 provides an overview of related work;\nSection 2 presents both the precise algorithm, and the idea behind it; Section 3 presents the main\nresults \u2013 analytical results on the planted partition / stochastic block model \u2013 which are shown to\noutperform existing methods; Section 4 provides simulation results; and \ufb01nally, the proof of main\ntheoretic results is outlined in Section 5.\n\n1.1 Related Work\n\nThe general \ufb01eld of clustering, or even graph clustering, is too vast for a detailed survey here; we\nfocus on the most related threads, and therein too primarily on work which provides theoretical\n\u201ccluster recovery\u201d guarantees on the resulting algorithms.\nCorrelation clustering: As mentioned above, every candidate clustering will have two kinds of er-\nrors; correlation clustering [2] weighs them equally, thus the objective is to \ufb01nd the clustering which\nminimizes just the total number of errors. This is an NP-hard problem, and [2] develops approxi-\nmation algorithms. Subsequently, there has been much work on devising alternative approximation\nalgorithms for both the weighted and unweighted cases, and for both agreement and disagreement\nobjectives [12, 13, 3, 9]. Approximations based on LP relaxation [11] and SDP relaxation [25, 19],\nfollowed by rounding, have also been developed. All of this line of work is on worst-case guaran-\ntees. We emphasize that while we do convex relaxation as well, we do not do rounding; rather, our\nconvex program itself yields an optimal clustering.\nPlanted partition model / Stochastic block model: This is a natural and classic model for studying\ngraph clustering in the average case, and is also the setting for our performance guarantees. Our\nresults are directly comparable to work here; we formally de\ufb01ne this setting in section 3 and present\na detailed comparison, after some notation and our theorem, in section 3 below.\nSparse and low-rank matrix decomposition: It has recently been shown [8, 6] that, under certain\nconditions, it is possible to recover a low-rank matrix from sparse errors of arbitrary magnitude; this\nhas even been applied to graph clustering [17]. Our algorithm turns out to be a weighted version\nof sparse and low-rank matrix decomposition, with different elements of the sparse part penalized\ndifferently, based on the given input. To our knowledge, ours is the \ufb01rst paper to study any weighted\nversion; in that sense, while our weights have a natural motivation in our setting, our results are\nlikely to have broader implications, for example robust versions of PCA when not all errors are\ncreated equal, but have a corresponding prior.\n\n2 Algorithm\n\nIdea: Our algorithm is a convex relaxation of a natural combinatorial objective for the sparse clus-\ntering problem. We now brie\ufb02y motivate this objective, and then formally describe our algorithm.\nRecall that we want to \ufb01nd a clustering (i.e. a partition of the nodes) such that in-cluster connectiv-\n\n2\n\n\fity is denser than across-cluster connectivity. Said differently, we want a clustering that has a small\nnumber of errors, where an error is either (a) an edge between two nodes in different clusters, or\n(b) a missing edge between two nodes in the same cluster. A natural (combinatorial) objective is to\nminimize a weighted combination of the two types of errors.\nThe correlation clustering setup [2] gives equal weights to the two types of errors. However, for\nsparse graphs, this will yield clusters with a very small number of nodes. This is because there is\nsparsity both within clusters and across clusters; grouping nodes in the same cluster will result in a\nlot of errors of type (b) above, without yielding corresponding gains in errors of type (a) \u2013 even when\nthey may actually be in the same cluster. This can be very easily seen: suppose, for example, the\n\u201ctrue\u201d clustering has two clusters with equal size, and the in-cluster and across-cluster edge density\nare both less than 1/4. Then, when both errors are weighted equally, the clustering which puts every\nnode in a separate cluster will have lower cost than the true clustering.\nTo get more meaningful solutions, we penalize the two types of errors differently. In particular,\nsparsity means that we can expect many more errors of type (b) in any solution, and hence we\nshould give this (potentially much) smaller weight than errors of type (a). Our crucial insight is that\nwe can know what kind of error will (potentially) occur on any given edge from the given adjacency\nmatrix itself. In particular, if aij = 1 for some pair i, j, when in any clustering it will either have no\nerror, or an error of type (a); it will never be an error of type (b). Similarly if aij = 0 then it can only\nbe an error of type (b), if at all. Our algorithm is a convex relaxation of the combinatorial problem of\n\ufb01nding the minimum cost clustering, with the cost for an error on edge i, j determined based on the\nvalue of aij. Perhaps surprisingly, this simple idea yields better results than the extensive literature\nalready in place for planted partitions.\nWe proceed by representing the given adjacency matrix A as the sum of two matrices A = Y + S,\nwhere we would like Y to be a cluster matrix, with yij = 1 if and only if i, j are in the same cluster,\nand 0 otherwise12. S is the corresponding error matrix as compared to the given A, and has values\nof +1, -1 and 0.\nWe now make a cost matrix C \u2208 Rn\u00d7n based on the insight above; we choose two values cA and\ncAc and set cij = cA if the corresponding aij = 1, and cij = cAc if aij = 0. However, diagonal\ncii = 0. With this setup, we have\n\nCombinatorial Objective:\n\nmin\nY,S\ns.t\n\n(cid:107)C \u25e6 S(cid:107)1\nY + S = A\nY is a cluster matrix\n\n(1)\n\n(3)\n\nHere C \u25e6 S denotes the matrix obtained via element-wise product between the two matrices C, S,\ni.e. (C \u25e6 S)ij = cijsij. Also (cid:107) \u00b7 (cid:107)1 denotes the element-wise (cid:96)1 norm (i.e. sum of absolute values of\nelements).\nAlgorithm: Our algorithm involves solving a convex relaxation of this combinatorial objective, by\nreplacing the \u201cY is a cluster matrix\u201d constraint with (i) constraints 0 \u2264 yij \u2264 1 for all elements i, j,\nand (ii) a nuclear norm3 penalty (cid:107)Y (cid:107)\u2217 in the objective. The latter encourages Y to be low-rank, and\nis based on the well-established insight that the cluster matrix (being a block-diagonal collection of\n1\u2019s) is low-rank. Thus we have our algorithm:\nSparse Graph Clustering:\n\n(2)\n\nOnce (cid:98)Y is obtained, check if it is a cluster matrix (say e.g. via an SVD, which will also reveal\n\ncluster membership if it is).\nIf it is not, any one of several rounding/aggregration ideas can be\nused empirically. Our theoretical results provide suf\ufb01cient conditions under which the optimum\nof the convex program is integral and a clustering, with no rounding required. Section 3 in the\nsupplementary material provides details on fast implementation for large matrices; this is one reason\n\nmin\nY,S\ns.t.\n\n(cid:107)Y (cid:107)\u2217 + (cid:107)C \u25e6 S(cid:107)1\n0 \u2264 yij \u2264 1,\u2200i, j\nY + S = A,\n\n1In this paper we will assume the convention that aii = 1 and yii = 1 for all nodes i.\n2In other words, Y is the adjacency matrix of a graph consisting of disjoint cliques.\n3The nuclear norm of a matrix is the sum of its singular values.\n\n3\n\n\fally, it is not hard to see (proof in the supplementary material) that its performance is \u201cmonotone\u201d,\n\nwe did not include a semide\ufb01nite constraint on Y in our algorithm. Our algorithm has two positive\nparameters: cA, cAc. We defer discussion on how to choose them until after our main result.\n\nin the following lemma. This shows that, in the terminology of [19, 4, 14], our method is robust\nunder a classical semi-random model where an adversary can add edge within clusters and remove\nedges between clusters.\n\nComments: Based on the given A and these values, the optimal(cid:98)Y may or may not be a cluster ma-\ntrix. If (cid:98)Y is a cluster matrix, then clearly it minimizes the combinatorial objective above. Addition-\nin the sense that adding edges \u201caligned with\u201d(cid:98)Y cannot result in a different optimum, as summarized\nLemma 1. Suppose(cid:98)Y is the optimum of Formulation (2) for a given A. Suppose now we arbitrarily\nchange some edges of A to obtain (cid:101)A, by (a) choosing some edges such that(cid:98)yij = 1 but aij = 0,\nand making(cid:101)aij = 1, and (b) choosing some edges where(cid:98)yij = 0 but aij = 1, and making(cid:101)aij = 0.\nThen, (cid:98)Y is also an optimum of Formulation (2) with (cid:101)A as the input.\nOur theoretical guarantees characterize when the optimal (cid:98)Y will be a cluster matrix, and recover\n\nthe clustering, in a natural classical problem setting called the planted partition model [10]. These\ntheoretical guarantees also provide guidance on how one would pick parameter values in practice;\nwe thus defer discussion on parameter picking until after we present our main theorem.\n\n3 Performance Guarantees\n\nIn this section we provide analytical performance guarantees for our algorithm under a natural and\nclassical graph clustering setting: (a generalization of) the planted partition model [10]. We \ufb01rst\ndescribe the model, and then our results.\n(Generalized) Planted partition model: Consider a random graph generated as follows: the n\nnodes are partitioned into r disjoint clusters, which we will refer to as the \u201ctrue\u201d clusters. Let K be\nthe minimum cluster size. For every pair of nodes i, j that belong to the same cluster, edge (i, j) is\npresent in the graph with probability that is at least \u00afp, while for every pair where the nodes are in\ndifferent clusters the edge is present with probability at most \u00afq. We call this model the \u201cgeneralized\u201d\nplanted partition because we allow for clusters to be different sizes, and the edge probabilities also\nto be different (but uniformly bounded as mentioned). The objective is to \ufb01nd the partition, given\nthe random graph generated from it.\nRecall that A is the given adjacency matrix, and let Y \u2217 be the matrix corresponding to the true\nclusters as above \u2013 i.e. y\u2217\nij = 1 if and only if i, j are in the same true cluster, and 0 otherwise..\nOur result below establishes conditions under which our algorithm, speci\ufb01cally the convex program\n(2)-(3), yields this Y \u2217 as the unique optimum (without any further need for rounding etc.) with high\nprobability (w.h.p.). Throughout the paper, with high probability means with probability at least\n1 \u2212 c0n\u221210 for some absolute constant c0\nTheorem 1. Suppose we choose cA =\n\n(cid:26)(cid:113) 1\u2212\u00afq\n\n, and cAc =\n. Then (Y \u2217, A \u2212 Y \u2217) is the unique optimal solution to Formulation (2)\n\n(cid:110)(cid:113) \u00afp\n\n(cid:113) n\n\n1\nn log n min\n\n(cid:27)\n\n(cid:111)\n\n\u221a\n\n16\n\n\u00afq ,\n\nlog4 n\n\n\u221a\n\n16\n\n1\nn log n min\n\nw.h.p. provided \u00afq \u2264 1\n\n1\u2212 \u00afp , 1\n4 , and\n\n\u00afp \u2212 \u00afq\u221a\nwhere c1 is an absolute positive constant.\n\n\u00afp\n\n\u221a\n\nn\nK\n\n\u2265 c1\n\nlog2 n.\n\nOur theorem quanti\ufb01es the tradeoff between the two quantities governing the hardness of a planted\npartition problem \u2013 the difference in edge densities p\u2212q, and the minimum cluster size K \u2013 required\nfor our algorithm to succeed, i.e. to recover the planted partition without any error. Note that here\np, q and K are allowed to scale with n. We now discuss and remark on our result, and then compare\nits performance to past approaches and theoretical results in Table 1.\n\n\u221a\nn log2 n). This will be achieved only when \u00afp \u2212 \u00afq is a constant\nNote that we need K to be \u2126(\nthat does not change with n; indeed in this extreme our theorem becomes a \u201cdense graph\u201d result,\n\n4\n\n\f\u00afp decreases with n, corresponding to a sparser regime,\n\nmatching e.g. the scaling in [17, 19]. If \u00afp\u2212\u00afq\u221a\nthen the minimum size of K required will increase.\nA nice feature of our work is that we only need \u00afp \u2212 \u00afq to be large only as compared to\n\u00afp; several\nother existing results (see Table 1) require a lower bound (as a function only of n, or n, K) on\n\u00afp \u2212 \u00afq itself. This allows us to guarantee recovery for much sparser graphs than all existing results.\nFor example, when K is \u0398(n), \u00afp and \u00afp \u2212 \u00afq can be as small as \u0398( log4 n\nn ). This scaling is close to\nthen each cluster will be almost surely disconnected, and if \u00afp \u2212 \u00afq = o( 1\noptimal: if \u00afp < log n\nn ),\nn\nthen on average a node has equally many neighbours in its own cluster and in another cluster \u2013\nboth are ill-posed situations in which one can not hope to recover the underlying clustering. When\n, while the previous best result for this regime\n\nn log2 n(cid:1), \u00afp and \u00afp \u2212 \u00afq can be \u0398\n\nK = \u2126(cid:0)\u221a\n\n(cid:16) n log4 n\n\n(cid:17)\n\n\u221a\n\nK2\n\n(cid:17)\n\n(cid:16) n2\n\nK3\n\nrequires at least \u0398\n\n[20].\n\nParameters: Our algorithm has two parameters: cA and cAc. The theorem provides a way to choose\ntheir values, assuming we know the values of the bounds \u00afp, \u00afq. To estimate these from data, we can\nuse the following rule of thumb; our empirical results are based on this rule. If all the clusters have\nequal size K, it is easy to verify that the \ufb01rst eigenvalue of E [A \u2212 I] is K(p \u2212 q) \u2212 p + nq with\nK \u22121, and the third eigenvalue\nmultiplicity 1, the second eigenvalue is K(p\u2212q)\u2212p with multiplicity n\nis \u2212p with multiplicities (n \u2212 n\n\nK ) [16]. We thus have the following rule of thumb:\n\n1. Compute the eigenvalues of A \u2212 I, denoted as \u03bb1, . . . , \u03bbn.\n2. Let r = arg maxi=1,...,n\u22121(\u03bbi \u2212 \u03bbi\u22121). Set K = n/r.\n\n3. Solve for p and q from the equations(cid:26)K(p \u2212 q) \u2212 p + nq = \u03bb1\n\nK(p \u2212 q) \u2212 p = \u03bb2\n\nTable 1: Comparison with literature. This table shows the lower-bound requirements on K and\np\u2212 q that existing literature needs for exact recovery of the planted partitions/clusters. Note that this\ntable is under the assumption that every cluster is of size K, and the edge densities are uniformly\np and q (for within and across clusters respectively). As can be seen, our algorithm achieves a\nbetter p \u2212 q scaling than every other result. And, we achieve a better K scaling than every other\nresult except Shamir [23], Oymak & Hassibi [21] and Giesen & Mitsche[15]; we are off by a at\nmost log2 n factor from each of these. Perhaps more importantly, we use a completely different\nalgorithmic approach from all of the others.\n\nMin. cluster size K Density difference p \u2212 q\n\nn log2 n)\n\n\u221a\n\n\u2126(\n\npn log2 n\n\nK\n\n)\n\n5\n\nPaper\n\nBoppana [5]\n\nJerrum & Sorkin [18]\nCondon & Karp [10]\n\nCarson & Impaglizzo [7]\n\nFeige & Kilian [14]\n\nShamir [23]\nMcSherry [20]\n\nOymak & Hassibi [21]\nGiesen & Mitsche[15]\n\nBollobas [4]\n\nThis paper\n\nn/2\nn/2\n\u2126(n)\nn/2\n\u221a\nn/2\nn log n)\n\n\u2126(\n\n\u2126(n2/3)\n\n\u221a\n\u221a\n\nn)\n\nn)\n\n\u2126(\n\n\u2126(\n\nn\n\nlog1/8 n\n\n)\n\n\u2126(\n\u221a\n\u2126(\n\n\u221a\n\n1\n\np log n\u221a\nn )\n\u2126(\n1\n\u2126(\nn1/6\u2212\u0001 )\n\u2126(\nn1/2\u2212\u0001 )\n\u221a\np\u221a\n\u03c9(\nn log n)\nn log n)\n\u2126( 1\n\u221a\nn log n\nK )\n\u2126(\nK3 )\n\n\u2126(\n\n(cid:113) pn2\n(cid:113) log n\n\u2126(max{\u221a\nK })\nn\n\u2126(max{(cid:113) q log n\nK ,\n\u221a\nn\nK )\nn })\nn , log n\n\n\u2126(\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Comparison of our method with Single-Linkage clustering (SLINK), spectral cluster-\ning, and low-rank-plus-sparse (L+S) approach. The area above each curve is the values of (p, q) for\nwhich a method successfully recovers the underlying true clustering. (b) More detailed results for\nthe area in the box in (a). The experiments are conducted on synthetic data with n = 1000 nodes\nand r = 5 clusters with equal size K = 200.\n\n4 Empirical Results\n\nWe perform experiments on synthetic data, and compare with other methods. We generate a graph\nusing the planted partition model with n = 1000 nodes, r = 5 clusters with equal size K = 200,\nand p, q \u2208 [0, 1]. We apply our method to the data, where we use the fast solver described in the\nsupplementary material. We estimate p and q using the heuristic described in Section 3, and choose\nthe weights cA and cAc according to the main theorem4. Due to numerical accuracy, the output \u02c6Y\nof our algorithm may not be integer, so we do the following simple rounding: compute the mean\n\u00afy of the entries of \u02c6Y , and round each entry of \u02c6Y to 1 if it is greater than \u00afy, and 0 otherwise. We\nmeasure the error by (cid:107)Y \u2217 \u2212 round( \u02c6Y )(cid:107)1, which is simply the number of misclassifed pairs. We say\nour method succeeds if it misclassi\ufb01es less than 0.1% of the pairs.\nFor comparison, we consider three alternative methods: (1) Single-Linkage clustering (SLINK) [24],\nwhich is a hierarchical clustering method that merge the most similar clusters in each iteration. We\nuse the difference of neighbours, namely (cid:107)Ai\u00b7 \u2212 Aj\u00b7(cid:107)1, as the distance measure of node i and j, and\noutput when SLINK \ufb01nds a clustering with r = 5 clusters. (2) A spectral clustering method [26],\nwhere we run SLINK on the top r = 5 singular vectors of A. (3) Low-rank-plus-sparse approach\n[17, 21], followed by the same rounding scheme. Note the \ufb01rst two methods assume knowledge of\nr, which is not available to our method. Success is measured in the same way as above.\nFor each q, we \ufb01nd the smallest p for which a method succeeds, and average over 20 trials. The\nresults are shown in Figure 1(a), where the area above each curves corresponds to the range of\nfeasible (p, q) for each method. It can been seen that our method subsumes all others, in that we\nsucceed for a strictly larger range of (p, q). Figure 1(b) shows more detailed results for sparse graphs\n(p \u2264 0.3, q \u2264 0.1), for which SLINK and trace-norm-plus unweighted (cid:96)1 completely fail, while our\nmethod signi\ufb01cantly outperforms the spectral method, the only alternative method that works in this\nregime.\n\n5 Proof of Theorem 1\nOverview: Let S\u2217 (cid:44) A \u2212 Y \u2217. The proof consists of two main steps: (a) developing a new approxi-\nmate dual certi\ufb01cate condition, i.e. a set of stipulations which, if satis\ufb01ed by any matrix W , would\n\n4we point out that searching for the best cA and cAc while keeping cA/cAc \ufb01xed might lead to better\n\nperformance, which we do not pursue here\n\n6\n\n0.10.20.30.40.50.60.70.800.10.20.30.40.50.60.70.80.9qp Our methodSLINKSpectralL+S00.020.040.060.080.100.050.10.150.20.25qp Our methodSLINKSpectralL+S\fguarantee the optimality of (Y \u2217, S\u2217), and (b) constructing a W that satis\ufb01es these stipulations with\nhigh probability. While at a high level these two steps have been employed in several papers on\nsparse and low-rank matrix decomposition, our analysis is different because it relies critically on\nthe speci\ufb01c clustering setting we are in. Thus, even though we are looking at a potentially more\ninvolved setting with input-dependent weights on the sparse matrix regularizer, our proof is much\nsimpler than several others in this space. Also, existing proofs do not cover our setting.\nPreliminaries: De\ufb01ne support sets \u2126 (cid:44) support(S\u2217), and R (cid:44) support(Y \u2217). Their complements\nare \u2126c and Rc respectively. Due to the constraints (3) in our convex program, if (Y \u2217 + \u2206, S\u2217 \u2212 \u2206)\nis a feasible solution to the convex program (2), then it has to be that \u2206 \u2208 D, where\n\nD (cid:44) {M \u2208 Rn\u00d7n | \u2200(i, j) \u2208 R : \u22121 \u2264 mij \u2264 0;\n\n\u2200(i, j) \u2208 Rc : 1 \u2265 mij \u2265 0}.\n\nThus we only need to execute steps (a),(b) above for optimality over this restricted set of deviations.\nFinally, we de\ufb01ne the (now standard) projection operators: P\u2126(M ) is the matrix where the (i, j)th\nentry is mij if (i, j) \u2208 \u2126, and 0 else. Let the SVD of Y \u2217 be U0\u03a30U(cid:62)\n0 (notice that Y \u2217 is a symmetric\npositive semide\ufb01nite matrix), and let PT \u22a5 (M ) (cid:44) (I\u2212U0U(cid:62)\n0 ) be the projection of M\nonto the space of matrices whose columns and rows are orthogonal to those of Y \u2217, and PT (M ) (cid:44)\nM \u2212 PT \u22a5 (M ).\n\n0 )M (I\u2212U0U(cid:62)\n\nStep (a) - Dual certi\ufb01cate condition: The following proposition provides a suf\ufb01cient condition\nfor the optimality of (Y \u2217, S\u2217).\nProposition 1 (New Dual Certi\ufb01cate Conditions for Clustering). If there exists a matrix W \u2208\nRn\u00d7n and a positive number \u0001 obeying the following conditions\n\n1. (cid:107)PT \u22a5 W(cid:107) \u2264 1.\n2. (cid:107)PT (W )(cid:107)\u221e \u2264 \u0001\n\n3. (cid:10)P\u2126(U0U(cid:62)\n4. (cid:10)P\u2126c(U0U(cid:62)\n\n2 min{cAc, cA}\n\n0 + W ), \u2206(cid:11) = (1 + \u0001)(cid:107)P\u2126(C \u25e6 \u2206)(cid:107)1 ,\u2200\u2206 \u2208 D.\n0 + W ), \u2206(cid:11) \u2265 \u2212(1 \u2212 \u0001)(cid:107)P\u2126c (C \u25e6 \u2206)(cid:107)1 ,\u2200\u2206 \u2208 D\n\nthen (Y \u2217, S\u2217) is the unique optimal solution to the convex program (2).\n\nThe proof is in the supplementary material; it also involves several steps unique to our clustering\nsetup here.\nStep (b) - Dual certi\ufb01cate constructions: We now construct a W , and show that it satis\ufb01es the\nconditions in Proposition 1 w.h.p. (but not always, and this is key to its simple construction). To keep\nthe notation light, we consider the standard planted partition model, where the edge probabilities are\nuniform; that is, for every pair of nodes in the same cluster, there is an edge between them with\nprobability p \u2265 \u00afp, and for every pair where the nodes are in different clusters, the edge is present\nwith probability q \u2264 \u00afq. It is straightforward to adapt the proof to the general case with non-uniform\nedge probabilities. We de\ufb01ne W (cid:44) W1 + W2 where\n1 \u2212 p\np\n\nW1 (cid:44) \u2212P\u2126(U0U(cid:62)\n\nr(cid:88)\n\n1Rm\u2229\u2126c,\n\n0 ) +\n\n(cid:20)\n\n1\nkm\ncAc (1 \u2212 p)\n\nm=1\n\n(cid:21)\n\n.\n\nW2 (cid:44) (1 + \u0001)\n\nC \u25e6 S\u2217 +\n\n1R\u2229\u2126c \u2212 cAq\n1 \u2212 q\n\n1Rc\u2229\u2126c\n\np\n\nIntuitively speaking, the idea is that W1 and W2 are zero mean random matrices, so they are likely\nto have small norms. To prove Theorem 1, it remains to show that W satis\ufb01es the desired conditions\nw.h.p.; this is done below, with proof in the supplementary, and is much simpler than similar proofs\nin the sparse-plus-low-rank literature.\nProposition 2. Under the assumptions of Theorem 1, with high probability, W satis\ufb01es the condi-\ntions in Proposition 1 with \u0001 = 2 log2 n\n\n(cid:113) n\n\nK\n\np .\n\n7\n\n\f6 Conclusion\n\nWe presented a convex optimization formulation, essentially a weighted version of low-rank matrix\ndecomposition, to address graph clustering where the graph is sparse. We showed that under a wide\nrange of problem parameters, the proposed method guarantees to recover the correct clustering. In\nfact, our theoretic analysis shows that the proposed method outperforms, i.e., succeeds under less\nrestrictive conditions, every existing method in this setting. Simulation studies also validates the\nef\ufb01ciency and effectiveness of the proposed method.\nThis work is motivated by analyzing large-scale social network, where inherently, even actors\n(nodes) within one cluster are more than likely not having connections. As such, immediate goals\nfor future work include faster algorithm implementations, as well as developing effective postpro-\ncessing schemes (e.g., rounding) when the obtained solution is not an exact cluster matrix.\n\nAcknowledgments\n\nS. Sanghavi would like to acknowledge NSF grants 0954059 and 1017525, and ARO grant\nW911NF1110265. The research of H. Xu is partially supported by the Ministry of Education of\nSingapore through NUS startup grant R-265-000-384-133.\n\nReferences\n\n[1] P. Holland andK.B. Laskey and S. Leinhardt. Stochastic blockmodels: Some \ufb01rst steps. Social\n\nNetworks, 5:109\u2013137, 1983.\n\n[2] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1):89\u2013113,\n\n2004.\n\n[3] H. Becker.\n\nA survey\n\nof\n\nhttp://www1.cs.columbia.edu/ hila/clustering.pdf, 2005.\n\ncorrelation\n\nclustering.\n\nAvailable\n\nonline\n\nat\n\n[4] B. Bollob\u00b4as and AD Scott. Max cut for random graphs with a planted partition. Combinatorics,\n\nProbability and Computing, 13(4-5):451\u2013474, 2004.\n\n[5] R.B. Boppana. Eigenvalues and graph bisection: An average-case analysis. In Foundations of\n\nComputer Science, 1987., 28th Annual Symposium on, pages 280\u2013285. IEEE, 1987.\n\n[6] E.J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Arxiv preprint\n\narXiv:0912.3599, 2009.\n\n[7] T. Carson and R. Impagliazzo. Hill-climbing \ufb01nds random planted bisections. In Proceedings\nof the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 903\u2013909. Society\nfor Industrial and Applied Mathematics, 2001.\n\n[8] V. Chandrasekaran, S. Sanghavi, S. Parrilo, and A. Willsky. Rank-sparsity incoherence for\n\nmatrix decomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n\n[9] M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. In Foun-\ndations of Computer Science, 2003. Proceedings. 44th Annual IEEE Symposium on, pages\n524\u2013533. IEEE, 2003.\n\n[10] A. Condon and R.M. Karp. Algorithms for graph partitioning on the planted partition model.\n\nRandom Structures and Algorithms, 18(2):116\u2013140, 2001.\n\n[11] E. Demaine and N. Immorlica. Correlation clustering with partial information. Approximation,\nRandomization, and Combinatorial Optimization.. Algorithms and Techniques, pages 71\u201380,\n2003.\n\n[12] E.D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation clustering in general\n\nweighted graphs. Theoretical Computer Science, 361(2):172\u2013187, 2006.\n\n[13] D. Emanuel and A. Fiat. Correlation clustering\u2013minimizing disagreements on arbitrary\n\nweighted graphs. Algorithms-ESA 2003, pages 208\u2013220, 2003.\n\n[14] U. Feige and J. Kilian. Heuristics for semirandom graph problems. Journal of Computer and\n\nSystem Sciences, 63(4):639\u2013671, 2001.\n\n8\n\n\f[15] J. Giesen and D. Mitsche. Bounding the misclassi\ufb01cation error in spectral partitioning in the\nplanted partition model. In Graph-Theoretic Concepts in Computer Science, pages 409\u2013420.\nSpringer, 2005.\n\n[16] J. Giesen and D. Mitsche. Reconstructing many partitions using spectral techniques. In Fun-\n\ndamentals of Computation Theory, pages 433\u2013444. Springer, 2005.\n\n[17] A. Jalali, Y. Chen, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex\n\noptimization. Arxiv preprint arXiv:1104.4803, 2011.\n\n[18] M. Jerrum and G.B. Sorkin. The metropolis algorithm for graph bisection. Discrete Applied\n\nMathematics, 82(1-3):155\u2013175, 1998.\n\n[19] C. Mathieu and W. Schudy. Correlation clustering with noisy input. In Proceedings of the\nTwenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pages 712\u2013728. Society\nfor Industrial and Applied Mathematics, 2010.\n\n[20] F. McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science,\n\n2001. Proceedings. 42nd IEEE Symposium on, pages 529\u2013537. IEEE, 2001.\n\n[21] S. Oymak and B. Hassibi. Finding dense clusters via \u201dlow rank+ sparse\u201d decomposition. Arxiv\n\npreprint arXiv:1104.5186, 2011.\n\n[22] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic\nblock model. Technical report, Technical Report 791, Statistics Department, UC Berkeley,\n2010.\n\n[23] R. Shamir and D. Tsur. Improved algorithms for the random cluster graph model. Random\n\nStructures & Algorithms, 31(4):418\u2013449, 2007.\n\n[24] R. Sibson. Slink: an optimally ef\ufb01cient algorithm for the single-link cluster method. The\n\nComputer Journal, 16(1):30\u201334, 1973.\n\n[25] C. Swamy. Correlation clustering: maximizing agreements via semide\ufb01nite programming.\nIn Proceedings of the \ufb01fteenth annual ACM-SIAM symposium on Discrete algorithms, pages\n526\u2013527. Society for Industrial and Applied Mathematics, 2004.\n\n[26] U. Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013416,\n\n2007.\n\n9\n\n\f", "award": [], "sourceid": 1100, "authors": [{"given_name": "Yudong", "family_name": "Chen", "institution": null}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": null}, {"given_name": "Huan", "family_name": "Xu", "institution": null}]}