{"title": "Graph Clustering With Missing Data: Convex Algorithms and Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 2996, "page_last": 3004, "abstract": "We consider the problem of finding clusters in an unweighted graph, when the graph is partially observed. We analyze two programs, one which works for dense graphs and one which works for both sparse and dense graphs, but requires some a priori knowledge of the total cluster size, that are based on the convex optimization approach for low-rank matrix recovery using nuclear norm minimization. For the commonly used Stochastic Block Model, we obtain \\emph{explicit} bounds on the parameters of the problem (size and sparsity of clusters, the amount of observed data) and the regularization parameter characterize the success and failure of the programs. We corroborate our theoretical findings through extensive simulations. We also run our algorithm on a real data set obtained from crowdsourcing an image classification task on the Amazon Mechanical Turk, and observe significant performance improvement over traditional methods such as k-means.", "full_text": "Graph Clustering With Missing Data : Convex\n\nAlgorithms and Analysis\n\nRamya Korlakai Vinayak, Samet Oymak, Babak Hassibi\n\nDepartment of Electrical Engineering\n\n{ramya, soymak}@caltech.edu, hassibi@systems.caltech.edu\n\nCalifornia Institute of Technology, Pasadena, CA 91125\n\nAbstract\n\nWe consider the problem of \ufb01nding clusters in an unweighted graph, when the\ngraph is partially observed. We analyze two programs, one which works for dense\ngraphs and one which works for both sparse and dense graphs, but requires some a\npriori knowledge of the total cluster size, that are based on the convex optimization\napproach for low-rank matrix recovery using nuclear norm minimization. For\nthe commonly used Stochastic Block Model, we obtain explicit bounds on the\nparameters of the problem (size and sparsity of clusters, the amount of observed\ndata) and the regularization parameter characterize the success and failure of the\nprograms. We corroborate our theoretical \ufb01ndings through extensive simulations.\nWe also run our algorithm on a real data set obtained from crowdsourcing an\nimage classi\ufb01cation task on the Amazon Mechanical Turk, and observe signi\ufb01cant\nperformance improvement over traditional methods such as k-means.\n\n1\n\nIntroduction\n\n(cid:1)\n\nClustering [1] broadly refers to the problem of identifying data points that are similar to each other.\nIt has applications in various problems in machine learning, data mining [2, 3], social networks [4\u2013\n6], bioinformatics [7, 8], etc. In this paper we focus on graph clustering [9] problems where the data\n\nis in the form of an unweighted graph. Clearly, to observe the entire graph on n nodes requires(cid:0)n\n\n2\nmeasurements. In most practical scenarios this is infeasible and we can only expect to have partial\nobservations. That is, for some node pairs we know whether there exists an edge between them\nor not, whereas for the rest of the node pairs we do not have this knowledge. This leads us to the\nproblem of clustering graphs with missing data.\nGiven the adjacency matrix of an unweighted graph, a cluster is de\ufb01ned as a set of nodes that are\ndensely connected to each other when compared to the rest of the nodes. We consider the problem of\nidentifying such clusters when the input is a partially observed adjacency matrix. We use the popular\nStochastic Block Model (SBM) [10] or Planted Partition Model [11] to analyze the performance of\nthe proposed algorithms. SBM is a random graph model where the edge probability depends on\nwhether the pair of nodes being considered belong to the same cluster or not. More speci\ufb01cally, the\nedge probability is higher when both nodes belong to the same cluster. Further, we assume that each\nentry of the adjacency matrix of the graph is observed independently with probability r. We will\nde\ufb01ne the model in detail in Section 2.1.\n\n1.1 Clustering by Low-Rank Matrix Recovery and Completion\n\nThe idea of using convex optimization for clustering has been proposed in [12\u201321]. While each of\nthese works differ in certain ways, and we will comment on their relation to the current paper in\nSection 1.3, the common approach they use for clustering is inspired by recent work on low-rank\nmatrix recovery and completion via regularized nuclear norm (trace norm) minimization [22\u201326].\n\n1\n\n\fIn the case of unweighted graphs, an ideal clustered graph is a union of disjoint cliques. Given\nthe adjacency matrix of an unweighted graph with clusters (denser connectivity inside the clusters\ncompared to outside), we can interpret it as an ideal clustered graph with missing edges inside the\nclusters and erroneous edges in between clusters. Recovering the low-rank matrix corresponding to\nthe disjoint cliques is equivalent to \ufb01nding the clusters.\nWe will look at the following well known convex program which aims to recover and complete the\nlow-rank matrix (L) from the partially observed adjacency matrix (Aobs):\nSimple Convex Program:\n\n(1.1)\n\n(cid:107)L(cid:107)(cid:63) + \u03bb(cid:107)S(cid:107)1\n\nminimize\n\nL,S\n\nsubject to\n\n1 \u2265 Li,j \u2265 0 for all i, j \u2208 {1, 2, . . . n}\nLobs + Sobs = Aobs\n\n(1.2)\n(1.3)\nwhere \u03bb \u2265 0 is the regularization parameter, (cid:107).(cid:107)(cid:63) is the nuclear norm (sum of the singular values\nof the matrix), and (cid:107).(cid:107)1 is the l1-norm (sum of absolute values of the entries of the matrix). S is\nthe sparse error matrix that accounts for the missing edges inside the clusters and erroneous edges\noutside the clusters on the observed entries. Lobs and Sobs denote entries of L and S that correspond\nto the observed part of the adjacency matrix.\nProgram 1.1 is very simple and intuitive. Further, it does not require any information other than\nthe observed part of the adjacency matrix. In [13], the authors analyze Program 1.1 without the\nconstraint (1.2). While dropping (1.2) makes the convex program less effective, it does allow [13] to\nmake use of low-rank matrix completion results for its analysis. In [16] and [21], the authors analyze\nProgram 1.1 when the entire adjacency matrix is observed. In [17], the authors study a slightly more\ngeneral program, where the regularization parameter is different for the extra edges and the missing\nedges. However, the adjacency matrix is completely observed.\nIt is not dif\ufb01cult to see that, when the edge probability inside the cluster is p < 1/2, that (as n \u2192 \u221e)\nProgram 1.1 will return L0 = 0 as the optimal solution (since if the cluster is not dense enough it is\nmore costly to complete the missing edges). As a result our analysis of Program 1.1, and the main\nresult of Theorem 1, assumes p > 1/2. Clearly, there are many instances of graphs we would like\nto cluster where p < 1/2. If the total size of the cluster region (i.e, the total number of edges in\nthe cluster, denoted by |R|) is known, then the following convex program can be used, and can be\nshown to work for p < 1/2 (see Theorem 2).\nImproved Convex Program:\n\nminimize\n\nL,S\n\n(cid:107)L(cid:107)(cid:63) + \u03bb(cid:107)S(cid:107)1\n\n(1.4)\n\nsubject to\n\ni,j = 0\n\n1 \u2265 Li,j \u2265 Si,j \u2265 0 for all i, j \u2208 {1, 2, . . . n}\nLi,j = Si,j whenever Aobs\nsum(L) \u2265 |R|\n\n(1.5)\n(1.6)\n(1.7)\nAs before, L is the low-rank matrix corresponding to the ideal cluster structure and \u03bb \u2265 0 is the\nregularization parameter. However, S is now the sparse error matrix that accounts only for the\nmissing edges inside the clusters on the observed part of adjacency matrix. [16] and [19] study\nprograms similar to Program 1.4 for the case of a completely observed adjacency matrix. In [19],\nthe constraint 1.7 is a strict equality. In [15] the authors analyze a program close to Program 1.4 but\nwithout the l1 penalty.\nIf R is not known, it is possible to solve Problem 1.4 for several values of R until the desired\nperformance is obtained. Our empirical results reported in Section 3, suggest that the solution is not\nvery sensitive to the choice of R.\n\n1.2 Our Contributions\n\u2022 We analyze the Simple Convex Program 1.1 for the SBM with partial observations. We provide\nexplicit bounds on the regularization parameter as a function of the parameters of the SBM, that\n\n2\n\n\fcharacterizes the success and failure conditions of Program 1.1 (see results in Section 2.2). We\nshow that clusters that are either too small or too sparse constitute the bottleneck. Our analysis is\nhelpful in understanding the phase transition from failure to success for the simple approach.\n\u2022 We also analyze the Improved Convex Program 1.4. We explicitly characterize the conditions on\nthe parameters of the SBM and the regularization parameter for successfully recovering clusters\nusing this approach (see results in Section 2.3).\n\u2022 Apart from providing theoretical guarantees and corroborating them with simulation results (Sec-\ntion 3), we also apply Programs 1.1 and 1.4 on a real data set (Section 3.3) obtained by crowd-\nsourcing an image labeling task on Amazon Mechanical Turk.\n\n1.3 Related Work\n\nIn [13], the authors consider the problem of identifying clusters from partially observed un-\nweighted graphs. For the SBM with partial observations, they analyze Program 1.1 without con-\nstraint (1.2), and show that under certain conditions, the minimum cluster size must be at least\n\nO((cid:112)n(log(n))4/r) for successful recovery of the clusters. Unlike our analysis, the exact require-\n\nment on the cluster size is not known (since the constant of proportionality is not known). Also\nthey do not provide conditions under which the approach fails to identify the clusters. Finding the\nexplicit bounds on the constant of proportionality is critical to understanding the phase transition\nfrom failure to successfully identifying clusters.\n\u221a\nIn [14\u201319], analyze convex programs similar to the Programs 1.1 and 1.4 for the SBM and show\nthat the minimum cluster size should be at least O(\nn) for successfully recovering the clusters.\nHowever, the exact requirement on the cluster size is not known. Also, they do not provide explicit\nconditions for failure, and except for [16] they do not address the case when the data is missing.\nIn contrast, we consider the problem of clustering with missing data. We explicitly characterize\nthe constants by providing bounds on the model parameters that decide if Programs 1.1 and 1.4\ncan successfully identify clusters. Furthermore, for Program 1.1, we also explicitly characterize the\nconditions under which the program fails.\nIn [16], the authors extend their results to partial observations by scaling the edge probabilities by r\n(observation probability), which will not work for r < 1/2 or 1/2 < p < 1/2r in Program 1.1 . [21]\nanalyzes Program 1.1 for the SBM and provides conditions for success and failure of the program\nwhen the entire adjacency matrix is observed. The dependence on the number of observed entries\nemerges non-trivially in our analysis. Further, [21] does not address the drawback of Program 1.1,\nwhich is p > 1/2, whereas in our work we analyze Program 1.4 that overcomes this drawback.\n2 Partially Observed Unweighted Graph\n\n2.1 Model\n\nDe\ufb01nition 2.1 (Stochastic Block Model). Let A = AT be the adjacency matrix of a graph on n\nnodes with K disjoint clusters of size ni each, i = 1, 2,\u00b7\u00b7\u00b7 , K. Let 1 \u2265 pi \u2265 0, i = 1,\u00b7\u00b7\u00b7 , K and\n1 \u2265 q \u2265 0. For l > m,\n\n(cid:26)1 w.p. pi,\n\n1 w.p. q,\n\nAl,m =\n\nif both nodes l, m are in the same cluster i.\nif nodes l, m are not in the same cluster.\n\n(2.1)\n\nIf pi > q for each i, then we expect the density of edges to be higher inside the clusters compared to\noutside. We will say the random variable Y has a \u03a6(r, \u03b4) distribution, for 0 \u2264 \u03b4, r \u2264 1, written as\nY \u223c \u03a6(r, \u03b4), if\n\n\uf8f1\uf8f2\uf8f31, w.p. r\u03b4\n\n0, w.p. r(1 \u2212 \u03b4)\n\u2217, w.p. (1 \u2212 r)\n\nY =\n\nwhere \u2217 denotes unknown.\nDe\ufb01nition 2.2 (Partial Observation Model). Let A be the adjacency matrix of a random graph\ngenerated according to the Stochastic Block Model of De\ufb01nition 2.1. Let 0 < r \u2264 1. Each entry of\n\n3\n\n\fthe adjacency matrix A is observed independently with probability r. Let Aobs denote the observed\nadjacency matrix. Then for l > m: (Aobs)l,m \u223c \u03a6(r, pi) if both the nodes l and m belong to the\nsame cluster i. Otherwise, (Aobs)l,m \u223c \u03a6(r, q).\n\n2.2 Results : Simple Convex Program\nLet [n] = {1, 2,\u00b7\u00b7\u00b7 , n}. Let R be the union of regions induced by the clusters and Rc = [n]\u00d7 [n]\u2212\nni,\n\nR its complement. Note that |R| =(cid:80)K\n\ni and |Rc| = n2 \u2212(cid:80)K\n\ni=1 n2\n\ni . Let nmin := min\n1\u2264i\u2264K\n\npmin := min\n1\u2264i\u2264K\n\npi and nmax := max\n1\u2264i\u2264K\n\ni=1 n2\nni.\n\nThe following de\ufb01nitions are important to describe our results.\n\u2022 De\ufb01ne Di := ni r (2pi \u2212 1) as the effective density of cluster i and Dmin = min\n1\u2264i\u2264K\n\u2022 \u03b3succ := max\n1\u2264i\u2264K\n\u221a\n\u2022 \u039b\u22121\n\n(cid:113)\n(cid:113) 1\nni\nr \u2212 1 + 4q(1 \u2212 q) + \u03b3succ and \u039b\u22121\n\nr \u2212 1) + 4 (q(1 \u2212 q) + pi(1 \u2212 pi)) and \u03b3fail :=(cid:80)K\n\nfail :=(cid:112)rq(n \u2212 \u03b3fail).\n\nsucc := 2r\n\n2( 1\n\n\u221a\n\nn2\ni\nn\n\n2r\n\ni=1\n\nn\n\nDi.\n\nWe note that the thresholds, \u039bsucc and \u039bfail depend only the parameters of the model. Some simple\nalgebra shows that \u039bsucc < \u039bfail.\nTheorem 1 (Simple Program). Consider a random graph generated according to the Partial Obser-\ni=1, and probabilities {pi}K\nvation Model of De\ufb01nition (2.2) with K disjoint clusters of sizes {ni}K\ni=1\nand q, such that pmin > 1\n2 such that,\n1. If \u03bb \u2265 (1 + \u0001)\u039bfail, then Program 1.1 fails to correctly recover the clusters with probability\n\n2 > q > 0. Given \u0001 > 0, there exists positive constants c(cid:48)\n\n1, c(cid:48)\n\n1 \u2212 c(cid:48)\n\n1 exp(\u2212c(cid:48)\n\n2|Rc|).\n\n2. If 0 < \u03bb \u2264 (1 \u2212 \u0001)\u039bsucc,\n\u2022 If Dmin \u2265 (1 + \u0001) 1\n\u2022 If Dmin \u2264 (1 \u2212 \u0001) 1\n2nmin).\n\n2nmin).\n\n1n2 exp(\u2212c(cid:48)\n\n\u03bb , then Program 1.1 succeeds in correctly recovering the clusters with\n\n\u03bb , then Program 1.1 fails to correctly recover the clusters with probability\n\nprobability 1 \u2212 c(cid:48)\n1 \u2212 c(cid:48)\n1 exp(\u2212c(cid:48)\nDiscussion:\n1. Theorem 1 characterizes the success and failure of Program 1.1 as a function of the regularization\nparameter \u03bb. In particular, if \u03bb > \u039bfail, Program 1.1 fails with high probability. If \u03bb < \u039bsucc,\nProgram 1.1 succeeds with high probability if and only if Dmin > 1\n\u03bb. However, Theorem 1 has\nnothing to say about \u039bsucc < \u03bb < \u039bfail.\n\u221a\n\n2. Small Cluster Regime: When nmax = o(n), we have \u039b\u22121\n\n(cid:113)(cid:0) 1\nr \u2212 1 + 4q(1 \u2212 q)(cid:1).\n\nFor simplicity let pi = p, \u2200 i, which yields Dmin = nminr(2p\u2212 1). Then Dmin > \u039b\u22121\n\nsucc = 2r\n\nsucc implies,\n\nn\n\n(cid:115)(cid:18) 1\n\nr\n\n\u221a\n\n2\nn\n2p \u2212 1\n\n(cid:19)\n\nnmin >\n\n\u2212 1 + 4q(1 \u2212 q)\n\n,\n\n(2.2)\n\ngiving a lower bound on the minimum cluster size that is suf\ufb01cient for success.\n\n2.3 Results: Improved Convex Program\n\nThe following de\ufb01nitions are critical to describe our results.\n\u2022 De\ufb01ne \u02dcDi := ni r (pi \u2212 q) as the effective density of cluster i and \u02dcDmin = min\n1\u2264i\u2264K\n\u2022 \u02dc\u03b3succ := 2 max\n1\u2264i\u2264K\n\nr \u2212 1 + pi) + (1 \u2212 q)( 1\n\nr \u2212 1 + q)\n\n(1 \u2212 pi)( 1\n\n(cid:113)\n\n\u221a\n\nni\n\nr\n\n\u02dcDi.\n\n4\n\n\f(a)\n\n(b)\n\nFigure 1: Region of success (white region) and failure (black region) of Program 1.1 with \u03bb =\n1.01D\u22121\nmin. The solid red curve is the threshold for success (\u03bb < \u039bsucc) and the dashed green line\nwhich is the threshold for failure (\u03bb > \u039bfail) as predicted by Theorem 1.\n\n(cid:113)\n\n\u2022 \u02dc\u039b\u22121\n\nsucc := 2r\n\n\u221a\n\nn\n\nr \u2212 1 + q)(1 \u2212 q) + \u02dc\u03b3succ.\n( 1\n\ni=1 and q, such that pmin > q > 0. Given \u0001 > 0, there exists positive constants c(cid:48)\n\nWe note that the threshold, \u02dc\u039bsucc depends only on the parameters of the model.\nTheorem 2 (Improved Program). Consider a random graph generated according to the Partial\nObservation Model of De\ufb01nition 2.2, with K disjoint clusters of sizes {ni}K\ni=1, and probabilities\n{pi}K\n2 such\nthat: If 0 < \u03bb \u2264 (1 \u2212 \u0001) \u02dc\u039bsucc and \u02dcDmin \u2265 (1 + \u0001) 1\n\u03bb , then Program 1.4 succeeds in recovering the\nclusters with probability 1 \u2212 c(cid:48)\nDiscussion:1\n1. Theorem 2 gives a suf\ufb01cient condition for the success of Program 1.4 as a function of \u03bb. In\n\n1n2 exp(\u2212c(cid:48)\n\n2nmin).\n\n1, c(cid:48)\n\nparticular, for any \u03bb > 0, we succeed if \u02dcD\u22121\n\nmin < \u03bb < \u02dc\u039bsucc.\n2. Small Cluster Regime: When nmax = o(n), we have \u02dc\u039b\u22121\n\nsucc = 2r\n\nn\n\n(cid:113)(cid:0) 1\nr \u2212 1 + q(cid:1) (1 \u2212 q). For\n\n\u221a\n\nsimplicity let pi = p, \u2200 i, which yields \u02dcDmin = nminr(p \u2212 q). Then \u02dcDmin > \u02dc\u039b\u22121\n\nsucc implies,\n\n(cid:115)(cid:18) 1\n\nr\n\n\u221a\n2\nn\np \u2212 q\n\n(cid:19)\n\nnmin >\n\n\u2212 1 + q\n\n(1 \u2212 q),\n\n(2.3)\n\nwhich gives a lower bound on the minimum cluster size that is suf\ufb01cient for success.\n\n3. (p, q) as a function of n: We now brie\ufb02y discuss the regime in which cluster sizes are large\n(i.e. O(n)) and we are interested in the parameters (p, q) as a function of n that allows proposed\napproaches to be successful. Critical to Program 1.4 is the constraint (1.6): Li,j = Si,j when\ni,j = 0 (which is the only constraint involving the adjacency Aobs). With missing data,\nAobs\ni,j = 0 with probability r(1\u2212 p) inside the clusters and r(1\u2212 q) outside the clusters. De\ufb01ning\nAobs\n\u02c6p = rp + 1 \u2212 r and \u02c6q = rq + 1 \u2212 r, the number of constraints in (1.6) becomes statistically\nequivalent to those of a fully observed graph where p and q are replaced by \u02c6p and \u02c6q. Consequently,\nfor a \ufb01xed r > 0, from (2.3), we require p \u2265 p \u2212 q (cid:38) O( 1\u221a\nn ) for success. However, setting\nthe unobserved entries to 0, yields Ai,j = 0 with probability 1 \u2212 rp inside the clusters and\n1 \u2212 rq outside the clusters. This is equivalent to a fully observed graph where p and q are\nreplaced by rp and rq. In this case, we can allow p \u2248 O( 1\nn ) for success which is order-wise\nbetter, and matches the results in McSherry [27]. Intuitively, clustering a fully observed graph\nwith parameters \u02c6p = rp + 1 \u2212 r and \u02c6q = rq + 1 \u2212 r is much more dif\ufb01cult than one with rp\nand rq, since the links are more noisy in the former case. Hence, while it is bene\ufb01cial to leave\nthe unobserved entries blank in Program 1.1, for Program 1.4 it is in fact bene\ufb01cial to set the\nunobserved entries to 0.\n\n1The proofs for Theorems 1 and 2 are provided in the supplementary material.\n\n5\n\nEdge Probability inside the cluster (p)Observation Probability (r) 0.60.70.80.910.20.40.60.81SuccessFailureMinimum Cluster SizeObservation Probability (r) 501001502000.20.40.60.81SuccessFailure\f(a) Region of success (white region) and failure\n(black region) of Program 1.4 with \u03bb = 0.49 \u02dc\u039bsucc.\nThe solid red curve is the threshold for success\n( \u02dcDmin > \u03bb\u22121) as predicted by Theorem 2.\n\n(b) Comparison range of edge probability p for Sim-\nple Program 1.1 and Improved Program 1.4.\n\nFigure 2: Simulation results for Improved Program.\n\n3 Experimental Results\n\nWe implement Program 1.1 and 1.4 using the inexact augmented Lagrange method of multipli-\ners [28]. Note that this method solves the Program 1.1 and 1.4 approximately. Further, the numerical\nimprecisions will prevent the entries of the output of the algorithms from being strictly equal to 0 or\n1. We use the mean of all the entries of the output as a hard threshold to round each entry. That is,\nif an entry is less than the threshold, it is rounded to 0 and to 1 otherwise. We compare the output\nof the algorithm after rounding to the optimal solution (L0), and declare success if the number of\nwrong entries is less than 0.1%.\nSet Up: We consider at an unweighted graph on n = 600 nodes with 3 disjoint clusters. For\nsimplicity the clusters are of equal size n1 = n2 = n3, and the edge probability inside the clusters\nare same p1 = p2 = p3 = p. The edge probability outside the clusters is \ufb01xed, q = 0.1. We generate\nthe adjacency matrix randomly according to the Stochastic Block Model 2.1 and Partial Observation\nModel 2.2. All the results are an average over 20 experiments.\n3.1 Simulations for Simple Convex Program\n\nDependence between r and p: In the \ufb01rst set of experiments we keep n1 = n2 = n3 = 200, and\nvary p from 0.55 to 1 and r from 0.05 to 1 in steps of 0.05.\nDependence between nmin and r: In the second set of experiments we keep the edge probability\ninside the clusters \ufb01xed, p = 0.85. The cluster size is varied from nmin = 20 to nmin = 200 in steps\nof 20 and r is varied from 0.05 to 1 in steps of 0.05.\nIn both the experiments, we set the regularization parameter \u03bb = 1.01D\u22121\nmin, ensuring that Dmin >\n1/\u03bb, enabling us to focus on observing the transition around \u039bsucc and \u039bfail. The outcome of the\nexperiments are shown in the Figures 1a and 1b. The experimental region of success is shown in\nwhite and the region of failure is shown in black. The theoretical region of success is about the solid\nred curve (\u03bb < \u039bsucc) and the region of failure is below dashed green curve (\u03bb > \u039bfail). As we can\nsee the transition indeed occurs between the two thresholds \u039bsucc and \u039bfail.\n\n3.2 Simulations for Improved Convex Program\n\nWe keep the cluster size, n1 = n2 = n3 = 200 and vary p from 0.15 to 1 and r from 0.05 to 1 in\nsteps of 0.05. We set the regularization parameter, \u03bb = 0.49 \u02dc\u039bsucc, ensuring that \u03bb < \u02dc\u039bsucc, enabling\nus to focus on observing the condition of success around \u02dcDmin. The outcome of this experiment is\nshown in the Figure 2a. The experimental region of success is shown in white and region of failure\nis shown in black. The theoretical region of success is above solid red curve.\nComparison with the Simple Convex Program: In this experiment, we are interested in observing\nthe range of p for which the Programs 1.1 and 1.4 work. Keeping the cluster size n1 = n2 = n3 =\n\n6\n\nObservation Probability (r)Edge Probability inside the cluster (p) 0.20.40.60.810.20.40.60.81Success0.20.40.60.8100.51Edge Probability inside the clusters (p)Probability of Success SimpleImproved\f(a)\n\n(b)\n\n(c) Comparing with k-means clustering.\n\nFigure 3: Result of using (a) Program 1.1 (Simple) and (b) Program 1.4 (Improved) on the real data\nset. (c) Comparing the clustering output after running Program 1.1 and Program 1.4 with the output\nof applying k-means clustering directly on A (with unknown entries set to 0).\n\n200 and r = 1, we vary the edge probability inside the clusters from p = 0.15 to p = 1 in steps\nof 0.05. For each instance of the adjacency matrix, we run both Program 1.1 and 1.4. We plot the\nprobability of success of both the algorithms in Figure 2b. As we can observe, Program 1.1 starts\nsucceeding only after p > 1/2, whereas for Program 1.4 it starts at p \u2248 0.35.\n\n3.3 Labeling Images: Amazon MTurk Experiment\n\nCreating a training dataset by labeling images is a tedious task. It would be useful to crowdsource\nthis task instead. Consider a speci\ufb01c example of a set of images of dogs of different breeds. We want\nto cluster them such that the images of dogs of the same breed are in the same cluster. One could\nshow a set of images to each worker, and ask him/her to identify the breed of dog in each of those\nimages. But such a task would require the workers to be experts in identifying the dog breeds. A\nrelatively reasonable task is to ask the workers to compare pairs of images, and for each pair, answer\nwhether they think the dogs in the images are of the same breed or not. If we have n images, then\n\n(cid:1) distinct pairs of images, and it will pretty quickly become unreasonable to compare all\n\nthere are(cid:0)n\n\npossible pairs. This is an example where we could obtain a subset of the data and try to cluster the\nimages based on the partial observations.\nImage Data Set: We used images of 3 different breeds of dogs : Norfolk Terrier (172 images), Toy\nPoodle (151 images) and Bouvier des Flandres (150 images) from the Standford Dogs Dataset [29].\nWe uploaded all the 473 images of dogs on an image hosting server (we used imgur.com).\nMTurk Task: We used Amazon Mechanical Turk [30] as the platform for crowdsourcing. For\n\neach worker, we showed 30 pairs of images chosen randomly from the(cid:0)n\n\n(cid:1) possible pairs. The task\n\n2\n\n2\n\nassigned to the worker was to compare each pair of images, and answer whether they think the dogs\nbelong to the same breed or not. If the worker\u2019s response is a \u201cyes\u201d, then there we \ufb01ll the entry of\nthe adjacency matrix corresponding to the pair as 1, and 0 if the answer is a \u201cno\u201d.\nCollected Data: We recorded around 608 responses. We were able to \ufb01ll 16, 750 out of 111, 628\nentries in A. That is, we observed 15% of the total number of entries. Compared with true answers\n(which we know a priori), the answers given by the workers had around 23.53% errors (3941 out of\n16750). The empirical parameters for the partially observed graph thus obtained is shown Table 1.\n\u221a\nn. Further, for Pro-\nWe ran Program 1.1 and Program 1.4 with regularization parameter, \u03bb = 1/\n\ngram 1.4, we set the size of the cluster region, R to 0.125 times(cid:0)n\n\n(cid:1). Figure 3a shows the recovered\n\nmatrices. Entries with value 1 are depicted by white and 0 is depicted by black. In Figure 3c we\ncompare the clusters output by running the k-means algorithm directly on the adjacency matrix\nA (with unknown entries set to 0) to that obtained by running k-means algorithm on the matrices\nrecovered after running Program 1.1 (Simple Program) and Program 1.4 (Improved Program) re-\nspectively. The overall error with k-means was 40.8% whereas the error signi\ufb01cantly reduced to\n15.86% and 7.19% respectively when we used the matrices recoverd from Programs 1.1 and 1.4\nrespectively (see Table 2). Further, note that for running the k-means algorithm we need to know\nthe exact number of clusters. A common heuristic is to identify the top K eigenvalues that are much\n\n2\n\n7\n\nMatrix Recovered by Simple Program10020030040050100150200250300350400450Matrix Recovered by Improved Program10020030040050100150200250300350400450Ideal Clusters501001502002503003504004500.511.5Clusters identifyed by k\u2212means on A501001502002503003504004500.511.5Clusters Identified from Simple Program501001502002503003504004500.511.5Clusters Identified from Improved Program501001502002503003504004500.511.5\fTable 1: Empirical Parameters from the real data.\n\nParams Value\n473\nn\n3\nK\n172\nn1\n151\nn2\n150\nn3\n\nParams\n\nr\nq\np1\np2\np3\n\nValue\n0.1500\n0.1929\n0.7587\n0.6444\n0.7687\n\nTable 2: Number of miss-classi\ufb01ed images\n\nClusters\u2192 1\nK-means\n39\n9\nSimple\nImproved\n1\n\n2\n150\n57\n29\n\n3 Total\n4\n193\n74\n8\n4\n34\n\nlarger than the rest. In Figure 4 we plot the sorted eigenvalues for the adjacency matrix A and the\nrecovered matrices. We can see that the top 3 eigen values are very easily distinguished from the\nrest for the matrix recovered after running Program 1.4.\nA sample of the data is shown in Figure 5. We observe that factors such as color, grooming, posture,\nface visibility etc. can result in confusion while comparing image pairs. Also, note that the ability\nof the workers to distinguish the dog breeds is neither guaranteed nor uniform. Thus, the edge\nprobability inside and outside clusters are not uniform. Nonetheless, Programs 1.1 and Program 1.4,\nespecially Program 1.4, are quite successful in clustering the data with only 15% observations.\n\nFigure 4: Plot of sorted eigen values for (1) Adjacency matrix with unknown entries \ufb01lled by 0, (2)\nRecovered adjacency matrix from Program 1.1, (3) Recovered adjacency matrix from Program 1.4\n\nFigure 5: Sample images of three breeds of dogs that were used in the MTurk experiment.\n\nThe authors thank the anonymous reviewers for their insightful comments. This work was supported\nin part by the National Science Foundation under grants CCF-0729203, CNS-0932428 and CIF-\n1018927, by the Of\ufb01ce of Naval Research under the MURI grant N00014-08-1-0747, and by a grant\nfrom Qualcomm Inc. The \ufb01rst author is also supported by the Schlumberger Foundation Faculty for\nthe Future Program Grant.\n\nReferences\n[1] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Comput. Surv., 31(3):264\u2013323,\n\nSeptember 1999.\n\n[2] M. Ester, H.-P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases.\n\nIn\nProceedings of the 1st international conference on Knowledge Discovery and Data mining (KDD\u201995),\npages 94\u201399. AAAI Press, August 1995.\n\n[3] Xiaowei Xu, Jochen J\u00a8ager, and Hans-Peter Kriegel. A fast parallel clustering algorithm for large spatial\n\ndatabases. Data Min. Knowl. Discov., 3(3):263\u2013290, September 1999.\n\n[4] Nina Mishra, Robert Schreiber, Isabelle Stanton, and Robert Tarjan. Clustering Social Networks. In An-\nthony Bonato and Fan R. K. Chung, editors, Algorithms and Models for the Web-Graph, volume 4863 of\nLecture Notes in Computer Science, chapter 5, pages 56\u201367. Springer Berlin Heidelberg, Berlin, Heidel-\nberg, 2007.\n\n8\n\n0200400600\u2212100102030 A0200400600\u22121000100200300 Simple0200400600\u22121000100200300 ImprovedNorfolk Terrier Toy Poodle Bouvier des Flandres \f[5] Pedro Domingos and Matt Richardson. Mining the network value of customers. In Proceedings of the\nseventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD \u201901,\npages 57\u201366, New York, NY, USA, 2001. ACM.\n\n[6] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 \u2013 174, 2010.\n[7] Ying Xu, Victor Olman, and Dong Xu. Clustering gene expression data using a graph-theoretic approach:\n\nan application of minimum spanning trees. Bioinformatics, 18(4):536\u2013545, 2002.\n\n[8] Qiaofeng Yang and Stefano Lonardi. A parallel algorithm for clustering protein-protein interaction net-\n\nworks. In CSB Workshops, pages 174\u2013177. IEEE Computer Society, 2005.\n\n[9] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27 \u2013 64, 2007.\n[10] Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First\n\nsteps. Social Networks, 5(2):109 \u2013 137, 1983.\n\n[11] Anne Condon and Richard M. Karp. Algorithms for graph partitioning on the planted partition model.\n\nRandom Struct. Algorithms, 18(2):116\u2013140, 2001.\n\n[12] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. In John D. Lafferty,\nChristopher K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and Aron Culotta, editors, NIPS, pages\n2496\u20132504. Curran Associates, Inc., 2010.\n\n[13] Ali Jalali, Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering partially observed graphs via convex\noptimization. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Confer-\nence on Machine Learning (ICML-11), ICML \u201911, pages 1001\u20131008, New York, NY, USA, June 2011.\nACM.\n\n[14] Brendan P. W. Ames and Stephen A. Vavasis. Convex optimization for the planted k-disjoint-clique\n\nproblem. Math. Program., 143(1-2):299\u2013337, 2014.\n\n[15] Brendan P. W. Ames and Stephen A. Vavasis. Nuclear norm minimization for the planted clique and\n\nbiclique problems. Math. Program., 129(1):69\u201389, September 2011.\n\n[16] S. Oymak and B. Hassibi.\narXiv:1104.5186, April 2011.\n\nFinding Dense Clusters via \u201dLow Rank + Sparse\u201d Decomposition.\n\n[17] Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering sparse graphs. In Peter L. Bartlett, Fernando\nC. N. Pereira, Christopher J. C. Burges, Lon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages\n2213\u20132221, 2012.\n\n[18] Yudong Chen, Ali Jalali, Sujay Sanghavi, and Constantine Caramanis. Low-rank matrix recovery from\n\nerrors and erasures. IEEE Transactions on Information Theory, 59(7):4324\u20134337, 2013.\n\n[19] Brendan P. W. Ames. Robust convex relaxation for the planted clique and densest k-subgraph problems.\n\n2013.\n\n[20] Nir Ailon, Yudong Chen, and Huan Xu. Breaking the small cluster barrier of graph clustering. CoRR,\n\nabs/1302.4549, 2013.\n\n[21] Ramya Korlakai Vinayak, Samet Oymak, and Babak Hassibi. Sharp performance bounds for graph clus-\ntering via convex optimizations. In Proceedings of the 39th International Conference on Acoustics, Speech\nand Signal Processing, ICASSP \u201914, 2014.\n\n[22] Emmanuel J. Candes and Justin Romberg. Quantitative robust uncertainty principles and optimally sparse\n\ndecompositions. Found. Comput. Math., 6(2):227\u2013254, April 2006.\n\n[23] Emmanuel J. Candes and Benjamin Recht. Exact matrix completion via convex optimization. Found.\n\nComput. Math., 9(6):717\u2013772, December 2009.\n\n[24] Emmanuel J. Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J.\n\nACM, 58(3):11:1\u201311:37, June 2011.\n\n[25] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, and Alan S. Willsky. Rank-sparsity incoher-\n\nence for matrix decomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n\n[26] Venkat Chandrasekaran, Pablo A. Parrilo, and Alan S. Willsky. Rejoinder: Latent variable graphical\n\nmodel selection via convex optimization. CoRR, abs/1211.0835, 2012.\n\n[27] Frank McSherry. Spectral partitioning of random graphs.\n\nSociety, 2001.\n\nIn FOCS, pages 529\u2013537. IEEE Computer\n\n[28] Zhouchen Lin, Minming Chen, and Yi Ma. The Augmented Lagrange Multiplier Method for Exact\n\nRecovery of Corrupted Low-Rank Matrices. Mathematical Programming, 2010.\n\n[29] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for \ufb01ne-\ngrained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Confer-\nence on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.\n\n[30] Michael Buhrmester, Tracy Kwang, and Samuel D. Gosling. Amazon\u2019s Mechanical Turk: A new source\n\nof inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1):3\u20135, January 2011.\n\n9\n\n\f", "award": [], "sourceid": 1563, "authors": [{"given_name": "Ramya", "family_name": "Korlakai Vinayak", "institution": "California Institute of Technology"}, {"given_name": "Samet", "family_name": "Oymak", "institution": "Caltech"}, {"given_name": "Babak", "family_name": "Hassibi", "institution": "Caltech"}]}