{"title": "Graphons, mergeons, and so on!", "book": "Advances in Neural Information Processing Systems", "page_first": 2307, "page_last": 2315, "abstract": "In this work we develop a theory of hierarchical clustering for graphs. Our modelling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the ``correct\" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties.", "full_text": "Graphons, mergeons, and so on!\n\nJustin Eldridge Mikhail Belkin Yusu Wang\n\nThe Ohio State University\n\n{eldridge, mbelkin, yusu}@cse.ohio-state.edu\n\nAbstract\n\nIn this work we develop a theory of hierarchical clustering for graphs. Our mod-\neling assumption is that graphs are sampled from a graphon, which is a powerful\nand general model for generating graphs and analyzing large networks. Graphons\nare a far richer class of graph models than stochastic blockmodels, the primary\nsetting for recent progress in the statistical theory of graph clustering. We de\ufb01ne\nwhat it means for an algorithm to produce the \u201ccorrect\" clustering, give su\ufb03cient\nconditions in which a method is statistically consistent, and provide an explicit\nalgorithm satisfying these properties.\n\n1 Introduction\n\nA fundamental problem in the theory of clustering is that of de\ufb01ning a cluster. There is no single\nanswer to this seemingly simple question. The right approach depends on the nature of the data\nand the proper modeling assumptions.\nIn a statistical setting where the objects to be clustered\ncome from some underlying probability distribution, it is natural to de\ufb01ne clusters in terms of the\ndistribution itself. The task of a clustering, then, is twofold \u2013 to identify the appropriate cluster\nstructure of the distribution and to recover that structure from a \ufb01nite sample. Thus we would like\nto say that a clustering is good if it is in some sense close to the ideal structure of the underlying\ndistribution, and that a clustering method is consistent if it produces clusterings which converge to\nthe true clustering, given larger and larger samples. Proving the consistency of a clustering method\ndeepens our understanding of it, and provides justi\ufb01cation for using the method in the appropriate\nsetting.\nIn this work, we consider the setting in which the objects to be clustered are the vertices of a graph\nsampled from a graphon \u2013 a very general random graph model of signi\ufb01cant recent interest. We\ndevelop a statistical theory of graph clustering in the graphon model; To the best of our knowledge,\nthis is the \ufb01rst general consistency framework developed for such a rich family of random graphs.\nThe speci\ufb01c contributions of this paper are threefold. First, we de\ufb01ne the clusters of a graphon. Our\nde\ufb01nition results in a graphon having a tree of clusters, which we call its graphon cluster tree. We\nintroduce an object called the mergeon which is a particular representation of the graphon cluster\ntree that encodes the heights at which clusters merge. Second, we develop a notion of consistency\nfor graph clustering algorithms in which a method is said to be consistent if its output converges to\nthe graphon cluster tree. Here the graphon setting poses subtle yet fundamental challenges which\ndi\ufb00erentiate it from classical clustering models, and which must be carefully addressed. Third, we\nprove the existence of consistent clustering algorithms. In particular, we provide su\ufb03cient condi-\ntions under which a graphon estimator leads to a consistent clustering method. We then identify a\nspeci\ufb01c practical algorithm which satis\ufb01es these conditions, and in doing so present a simple graph\nclustering algorithm which provably recovers the graphon cluster tree.\nRelated work. Graphons are objects of signi\ufb01cant recent interest in graph theory, statistics, and\nmachine learning. The theory of graphons is rich and diverse; A graphon can be interpreted as\na generalization of a weighted graph with uncountably many nodes, as the limit of a sequence of\n\ufb01nite graphs, or, more importantly for the present work, as a very general model for generating\n\n29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\funweighted, undirected graphs. Conveniently, any graphon can be represented as a symmetric, mea-\nsurable function W : [0, 1]2 \u2192 [0, 1], and it is this representation that we use throughout this paper.\nThe graphon as a graph limit was introduced in recent years by [16], [5], and others. The interested\nreader is directed to the book by Lov\u00e1sz [15] on the subject. There has also been a considerable\nrecent e\ufb00ort to produce consistent estimators of the graphon, including the work of [20], [8], [2],\n[18], and others. We will analyze a simple modi\ufb01cation of the graphon estimator proposed by [21]\nand show that it leads to a graph clustering algorithm which is a consistent estimator of the graphon\ncluster tree.\nMuch of the previous statistical theory of graph clustering methods assumes that graphs are gen-\nerated by the so-called stochastic blockmodel. The simplest form of the model generates a graph\nwith n nodes by assigning each node, randomly or deterministically, to one of two communities. An\nedge between two nodes is added with probability \u03b1 if they are from the same community and with\nprobability \u03b2 otherwise. A graph clustering method is said to achieve exact recovery if it identi\ufb01es\nthe true community assignment of every node in the graph with high probability as n \u2192 \u221e. The\nblockmodel is a special case of a graphon model, and our notion of consistency will imply exact\nrecovery of communities.\nStochastic blockmodels are widely studied, and it is known that, for example, spectral methods\nlike that of [17] are able to recover the communities exactly as n \u2192 \u221e, provided that \u03b1 and \u03b2\nremain constant, or that the gap between them does not shrink too quickly. For a summary of\nconsistency results in the blockmodel, see [1], which also provides information-theoretic thresholds\nfor the conditions under which exact recovery is possible. In a related direction, [4] examines the\nability of spectral clustering to withstand noise in a hierarchical block model.\nThe density setting. The problem of de\ufb01ning the underlying cluster structure of a probability dis-\ntribution goes back to Hartigan [12] who considered the setting in which the objects to be clustered\n: X \u2192 R+. In this case, the high density clusters of f are\nare points sampled from a density f\nde\ufb01ned to be the connected components of the upper level sets {x : f (x) \u2265 \u03bb} for any \u03bb > 0. The\nset of all such clusters forms the so-called density cluster tree. Hartigan [12] de\ufb01ned a notion of\nconsistency for the density cluster tree, and proved that single-linkage clustering is not consistent.\nIn recent years, [9] and [14] have demonstrated methods which are Hartigan consistent. [10] in-\ntroduced a distance between a clustering of the data and the density cluster tree, called the merge\ndistortion metric. A clustering method is said to be consistent if the trees it produces converge in\nmerge distortion to density cluster tree. It is shown that convergence in merge distortion is stronger\nthan Hartigan consistency, and that the method of [9] is consistent in this stronger sense.\nIn the present work, we will be motivated by the approach taken in [12] and [10]. We note, however,\nthat there are signi\ufb01cant and fundamental di\ufb00erences between the density case and the graphon\nsetting. Speci\ufb01cally, it is possible for two graphons to be equivalent in the same way that two\ngraphs are: up to a relabeling of the vertices. As such, a graphon W is a representative of an\nequivalence class of graphons modulo appropriately de\ufb01ned relabeling. It is therefore necessary to\nde\ufb01ne the clusters of W in a way that does not depend upon the particular representative used. A\nsimilar problem occurs in the density setting when we wish to de\ufb01ne the clusters not of a single\ndensity function, but rather of a class of densities which are equal almost everywhere; Steinwart\n[19] provides an elegant solution. But while the domain of a density is equipped with a meaningful\nmetric \u2013 the mass of a ball around a point x is the same under two equivalent densities \u2013 the ambient\nmetric on the vertices of a graphon is not useful. As a result, approaches such as that of [19] do not\ndirectly apply to the graphon case, and we must carefully produce our own. Additionally, we will\nsee that the procedure for sampling a graph from a graphon involves latent variables which are in\nprinciple unrecoverable from data. These issues have no analogue in the classical density setting,\nand present very distinct challenges.\nMiscellany. Due to space constraints, most of the (rather involved) technical details are in the\nappendix. We will use [n] to denote the set {1, . . . , n}, \u25b3 for the symmetric di\ufb00erence, \u00b5 for the\nLebesgue measure on [0, 1], and bold letters to denote random variables.\n\n2\n\n\f\u222b\n\n\u220f\n\n\u220f\n\n[\n\n]\u220f\n\ni\u2208[n]\n\n2 The graphon model\n\nIn order to discuss the statistical properties of a graph clustering algorithm, we must \ufb01rst model the\nprocess by which graphs are generated. Formally, a random graph model is a sequence of random\nvariables G1, G2, . . . such that the range of Gn consists of undirected, unweighted graphs with node\nset [n], and the distribution of Gn is invariant under relabeling of the nodes \u2013 that is, isomorphic\ngraphs occur with equal probability. A random graph model of considerable recent interest is the\ngraphon model, in which the distribution over graphs is determined by a symmetric, measurable\nfunction W : [0, 1]2 \u2192 [0, 1] called a graphon. Informally, a graphon W may be thought of as the\nweight matrix of an in\ufb01nite graph whose node set is the continuous unit interval, so that W(x, y)\nrepresents the weight of the edge between nodes x and y.\nInterpreting W(x, y) as a probability suggests the following graph sampling procedure: To draw a\ngraph with n nodes, we \ufb01rst select n points x1, . . . , xn at random from the uniform distribution on\n[0, 1] \u2013 we can think of these xi as being random \u201cnodes\u201d in the graphon. We then sample a random\ngraph G on node set [n] by admitting the edge (i, j) with probability W(xi, x j); by convention, self-\nedges are not sampled. It is important to note that while we begin by drawing a set of nodes {xi}\nfrom the graphon, the graph as given to us is labeled by integers. Therefore, the correspondence\nbetween node i in the graph and node xi in the graphon is latent.\nIt can be shown that this sampling procedure de\ufb01nes a distribution on \ufb01nite graphs, such that the\nprobability of graph G = ([n], E) is given by\n\nPW(G = G) =\n\n[0,1]n\n\nW(xi, x j)\n\n1 \u2212 W(xi, x j)\n\n(i, j)\ufffdE\n\n(i, j)\u2208E\n\ndxi.\n\n(1)\n\n{{ {\n\nFor a \ufb01xed choice of x1, . . . , xn \u2208 [0, 1], the integrand represents the likelihood that the graph G is\nsampled when the probability of the edge (i, j) is assumed to be W(xi, x j). By integrating over all\npossible choices of x1, . . . , xn, we obtain the probability of the graph.\nA very general class of random graph models may be represented as graphons.\nIn particular, a random graph model G1, G2, . . . is said to be consistent if the\nrandom graph Fk\u22121 obtained by deleting node k from Gk has the same distribution\nas Gk. A random graph model is said to be local if whenever S , T \u2282 [k] are\ndisjoint, the random subgraphs of Gk induced by S and T are independent random\nvariables. A result of Lov\u00e1sz and Szegedy [16] is that any consistent, local random\ngraph model is equivalent to the distribution on graphs de\ufb01ned by PW for some\ngraphon W; the converse is true as well. That is, any such random graph model is\nequivalent to a graphon.\nA particular random graph model is not uniquely de\ufb01ned by a graphon \u2013 it is clear\nfrom Equation 1 that two graphons W1 and W2 which are equal almost everywhere\n(i.e., di\ufb00er on a set of measure zero) de\ufb01ne the same distribution on graphs. In\nfact, the distribution de\ufb01ned by W is unchanged by \u201crelabelings\u201d of W\u2019s nodes.\nMore formally, if \u03a3 is the sigma-algebra of Lebesgue measurable subsets of [0, 1]\nand \u00b5 is the Lebesgue measure, we say that a relabeling function \u03c6 : ([0, 1], \u03a3) \u2192\n([0, 1], \u03a3) is measure preserving if for any measurable set A \u2208 \u03a3, \u00b5(\u03c6\u22121(A)) = \u00b5(A).\nWe de\ufb01ne the relabeled graphon W \u03c6 by W \u03c6(x, y) = W(\u03c6(x), \u03c6(y)). By analogy\nwith \ufb01nite graphs, we say that graphons W1 and W2 are weakly isomorphic if they\nare equivalent up to relabeling, i.e., if there exist measure preserving maps \u03c61 and\n\u03c62 such that W \u03c61\n2 almost everywhere. Weak isomorphism is an equivalence\nrelation, and most of the important properties of a graphon in fact belong to its\nequivalence class. For instance, a powerful result of [15] is that two graphons\nde\ufb01ne the same random graph model if and only if they are weakly isomorphic.\nAn example of a graphon W is shown in Figure 1a. It is conventional to plot\nthe graphon as one typically plots an adjacency matrix: with the origin in the\nupper-left corner. Darker shades correspond to higher values of W. Figure 1b\ndepicts a graphon W \u03c6 which is weakly isomorphic to W.\nIn particular, W \u03c6 is\nthe relabeling of W by the measure preserving transformation \u03c6(x) = 2x mod 1.\nAs such, the graphons shown in Figures 1a and 1b de\ufb01ne the same distribution\non graphs. Figure 1c shows the adjacency matrix A of a graph of size n = 50\n\n(c) An instance\nof a graph ad-\njacency sampled\nfrom W.\n\n(b) W \u03c6 weakly\nisomorphic\nto\nW.\n\n(a) Graphon W.\n\n1 = W \u03c62\n\nFigure 1\n\n3\n\n\fsampled from the distribution de\ufb01ned by the equivalence class containing W and W \u03c6. Note that it is\nin principle not possible to determine from A alone which graphon W or W \u03c6 it was sampled from,\nor to what node in W a particular column of A corresponds to.\n\n3 The graphon cluster tree\n\nWe now identify the cluster structure of a graphon. We will de\ufb01ne a graphon\u2019s clusters such that\nthey are analogous to the maximally-connected components of a \ufb01nite graph. It turns out that the\ncollection of all clusters has hierarchical structure; we call this object the graphon cluster tree. We\npropose that the goal of clustering in the graphon setting is the recovery of the graphon cluster tree.\nConnectedness and clusters. Consider a \ufb01nite weighted graph. It is natural to cluster the graph\ninto connected components. In fact, because of the weighted edges, we can speak of the clusters of\nthe graph at various levels. More precisely, we say that a set of nodes A is internally connected \u2013\nor, from now on, just connected \u2013 at level \u03bb if for every pair of nodes in A there is a path between\nthem such that every node along the path is also in A, and the weight of every edge in the path is at\nleast \u03bb. Equivalently, A is connected at level \u03bb if and only if for every partitioning of A into disjoint,\nnon-empty sets A1 and A2 there is an edge of weight \u03bb or greater between A1 and A2. The clusters at\nlevel \u03bb are then the largest connected components at level \u03bb.\nA graphon is, in a sense, an in\ufb01nite weighted graph, and we will de\ufb01ne the clusters of a graphon\nusing the example above as motivation. In doing so, we must be careful to make our notion robust to\nchanges of the graphon on a set of zero measure, as such changes do not a\ufb00ect the graph distribution\nde\ufb01ned by the graphon. We base our de\ufb01nition on that of Janson [13], who de\ufb01ned what it means for\na graphon to be connected as a whole. We extend the de\ufb01nition in [13] to speak of the connectivity\nof subsets of the graphon\u2019s nodes at a particular height. Our de\ufb01nition is directly analogous to the\nnotion of internal connectedness in \ufb01nite graphs.\nDe\ufb01nition 1 (Connectedness). Let W be a graphon, and let A \u2282 [0, 1] be a set of positive measure.\nWe say that A is disconnected at level \u03bb if there exists a measurable S \u2282 A such that 0 < \u00b5(S ) < \u00b5(A),\nand W < \u03bb almost everywhere on S \u00d7 (A \\ S ). Otherwise, we say that A is connected at level \u03bb.\nWe now identify the clusters of a graphon; as in the \ufb01nite case, we will frame our de\ufb01nition in terms\nof maximally-connected components. We begin by gathering all subsets of [0, 1] which should\nbelong to some cluster at level \u03bb. Naturally, if a set is connected at level \u03bb, it should be in a cluster\nat level \u03bb; for technical reasons, we will also say that a set which is connected at all levels \u03bb\u2032 < \u03bb\n(though perhaps not at \u03bb) should be contained in a cluster at level \u03bb, as well. That is, for any \u03bb, the\ncollection A\u03bb of sets which should be contained in some cluster at level \u03bb is A\u03bb = { A \u2208 \u03a3 : \u00b5(A) >\n0 and A is connected at every level \u03bb\u2032 < \u03bb}. Now suppose A1, A2 \u2208 A\u03bb, and that there is a set A \u2208 A\u03bb\nsuch that A \u2283 A1\u222a A2. Naturally, the cluster to which A belongs should also contain A1 and A2, since\nboth are subsets of A. We will therefore consider A1 and A2 to be equivalent, in the sense that they\nshould be contained in the same cluster at level \u03bb. More formally, we de\ufb01ne a relation \ufffd\u03bb on A\u03bb\nby A1 \ufffd\u03bb A2 \u21d0\u21d2 \u2203A \u2208 A\u03bb s.t. A \u2283 A1 \u222a A2. It can be veri\ufb01ed that \ufffd\u03bb is an equivalence relation\non A\u03bb; see Claim 9 in Appendix B.\nEach equivalence class A in the quotient space A\u03bb/\ufffd\u03bb. consists of connected sets which should\nintuitively be clustered together at level \u03bb. Naturally, we will de\ufb01ne the clusters to be the largest\nelements of each class; in some sense, these are the maximally-connected components at level \u03bb.\nMore precisely, suppose A is such an equivalence class. It is clear that in general no single member\nA \u2208 A can contain all other members of A , since adding a null set (i.e., a set of measure zero) to A\nresults in a larger set A\u2032 which is nevertheless still a member of A . However, we can \ufb01nd a member\nA\u2217 \u2208 A which contains all but a null set of every other set in A . More formally, we say that A\u2217\nis an essential maximum of the class A if A\u2217 \u2208 A and for every A \u2208 A , \u00b5(A \\ A\u2217) = 0. A\u2217 is of\ncourse not unique, but it is unique up to a null set; i.e., for any two essential maxima A1, A2 of A ,\nwe have \u00b5(A1 \u25b3 A2) = 0. We will write the set of essential maxima of A as ess max A ; the fact that\nthe essential maxima are well-de\ufb01ned is proven in Claim 10 in Appendix B. We then de\ufb01ne clusters\nas the maximal members of each equivalence class in A\u03bb/\ufffd\u03bb:\nDe\ufb01nition 2 (Clusters). The set of clusters at level \u03bb in W, written CW(\u03bb), is de\ufb01ned to be the\ncountable collection CW(\u03bb) = { ess max A : A \u2208 A\u03bb/\ufffd\u03bb} .\n\n4\n\n\f\u222a\n\n(b) Mergeon M\nof CW.\n\nFigure 2\n\nClaim 3. Let W be a graphon and M a mergeon of the cluster tree of W. If \u03c6 is a measure preserving\ntransformation, then M\u03c6 is a mergeon of the cluster tree of W \u03c6.\n\n1The de\ufb01nition given here involves a slight abuse of notation. For a precise \u2013 but more technical \u2013 version,\n\nsee Appendix A.2.\n\n5\n\nNote that a cluster C of a graphon is not a subset of the unit interval per se, but rather an equivalence\nclass of subsets which di\ufb00er only by null sets. It is often possible to treat clusters as sets rather\nthan equivalence classes, and we may write \u00b5(C ), C \u222a C \u2032, etc., without ambiguity. In addition, if\n\u03c6 : [0, 1] \u2192 [0, 1] is a measure preserving transformation, then \u03c6\u22121(C ) is well-de\ufb01ned.\nFor a concrete example of our notion of a cluster, consider the graphon W depicted in Figure 1a. A,\nB, and C represent sets of the graphon\u2019s nodes. By our de\ufb01nitions there are three clusters at level\n\u03bb3: A , B, and C . Clusters A and B merge into a cluster A \u222a B at level \u03bb2, while C remains a\nseparate cluster. Everything is joined into a cluster A \u222a B \u222a C at level \u03bb1.\nWe have taken care to de\ufb01ne the clusters of a graphon in such a way as to be robust to changes of\nmeasure zero to the graphon itself. In fact, clusters are also robust to measure preserving transfor-\nmations. The proof of this result is non-trivial, and comprises Appendix C.\nClaim 1. Let W be a graphon and \u03c6 a measure preserving transformation. Then C is a cluster of\nW \u03c6 at level \u03bb if and only if there exists a cluster C \u2032 of W at level \u03bb such that C = \u03c6\u22121(C \u2032).\n\nCluster trees and mergeons. The set of all clusters of a graphon at any level has hierarchical\nstructure in the sense that, given any pair of distinct clusters C1 and C2, either one is \u201cessentially\u201d\ncontained within the other, i.e., C1 \u2282 C2, or C2 \u2282 C1, or they are \u201cessentially\u201d disjoint, i.e., \u00b5(C1 \u2229\nC2) = 0, as is proven by Claim 8 in Appendix B. Because of this hierarchical structure, we call the\nset CW of all clusters from any level of the graphon W the graphon cluster tree of W. It is this tree\nthat we hope to recover by applying a graph clustering algorithm to a graph sampled from W.\nWe may naturally speak of the height at which pairs of distinct clusters merge in\nthe cluster tree. For instance, let C1 and C2 be distinct clusters of C. We say that\nthe merge height of C1 and C2 is the level \u03bb at which they are joined into a single\ncluster, i.e., max{\u03bb : C1\u222a C2 \u2208 C(\u03bb)}. However, while the merge height of clusters\nis well-de\ufb01ned, the merge height of individual points is not. This is because the\ncluster tree is not a collection of sets, but rather a collection of equivalence classes\nof sets, and so a point does not belong to any one cluster more than any other. Note\nthat this is distinct from the classical density case considered in [12], [9], and [1],\nwhere the merge height of any pair of points is well-de\ufb01ned.\nNevertheless, consider a measurable function M : [0, 1]2 \u2192 [0, 1] which assigns\na merge height to every pair of points. While the value of M on any given pair is\narbitrary, the value of M on sets of positive measure is constrained. Intuitively, if\nC is a cluster at level \u03bb, then we must have M \u2265 \u03bb almost everywhere on C \u00d7 C .\nIf M satis\ufb01es this constraint for every cluster C we call M a mergeon for C, as it\nis a graphon which determines a particular choice for the merge heights of every\npair of points in [0, 1]. More formally:\nDe\ufb01nition 3 (Mergeon). Let C be a cluster tree. A mergeon1 of C is a graphon\nM such that for all \u03bb \u2208 [0, 1], M\u22121[\u03bb, 1] =\nC\u2208CW (\u03bb) C \u00d7 C , where M\u22121[\u03bb, 1] =\n{(x, y) \u2208 [0, 1]2 : M(x, y) \u2265 \u03bb}.\nAn example of a mergeon and the cluster tree it represents is shown in Figure 2. In fact, the cluster\ntree depicted is that of the graphon W from Figure 1a. The mergeon encodes the height at which\nclusters A , B, and C merge. In particular, the fact that M = \u03bb2 everywhere on A \u00d7 B represents\nthe merging of A and B at level \u03bb2 in W.\nIt is clear that in general there is no unique mergeon representing a graphon cluster tree, however,\nthe above de\ufb01nition implies that two mergeons representing the same cluster tree are equal almost\neverywhere. Additionally, we have the following two claims, whose proofs are in Appendix B.\nClaim 2. Let C be a cluster tree, and suppose M is a mergeon representing C. Then C \u2208 C(\u03bb) if\nand only if C is a cluster in M at level \u03bb. In other words, the cluster tree of M is also C.\n\n(a) Cluster tree\nCW of W.\n\n{{ {\n\n\f4 Notions of consistency\n\nWe have so far de\ufb01ned the sense in which a graphon has hierarchical cluster structure. We now\nturn to the problem of determining whether a clustering algorithm is able to recover this structure\nwhen applied to a graph sampled from a graphon. Our approach is to de\ufb01ne a distance between the\nin\ufb01nite graphon cluster tree and a \ufb01nite clustering. We will then de\ufb01ne consistency by requiring that\na consistent method converge to the graphon cluster tree in this distance for all inputs minus a set of\nvanishing probability.\nMerge distortion. A hierarchical clustering C of a set S \u2013 or, from now on, just a clustering of\nS \u2013 is hierarchical collection of subsets of S such that S \u2208 C and for all C, C\u2032 \u2208 C, either C \u2282 C\u2032,\nC\u2032 \u2282 C, or C \u2229 C\u2032 = \u2205. Suppose C is a clustering of a \ufb01nite set S consisting of graphon nodes; i.e,\nS \u2282 [0, 1]. How might we measure the distance between this clustering and a graphon cluster tree\nC? Intuitively, the two trees are close if every pair of points in S merges in C at about the same level\nas they merge in C. But this informal description faces two problems: First, C is a collection of\nequivalence classes of sets, and so the height at which any pair of points merges in C is not de\ufb01ned.\nRecall, however, that the cluster tree has an alternative representation as a mergeon. A mergeon does\nde\ufb01ne a merge height for every pair of nodes in a graphon, and thus provides a solution to this \ufb01rst\nissue. Second, the clustering C is not equipped with a height function, and so the height at which\nany pair of points merges in C is also unde\ufb01ned. Following [10], our approach is to induce a merge\nheight function on the clustering using the mergeon in the following way:\nDe\ufb01nition 4 (Induced merge height). Let M be a mergeon, and suppose S is a \ufb01nite subset of\n[0, 1]. Let C be a clustering of S . The merge height function on C induced by M is de\ufb01ned by\n\u02c6MC(s, s\u2032) = minu,v\u2208C(s,s\u2032) M(u, v), for every s, s\u2032 \u2208 S \u00d7 S , where C(s, s\u2032) denotes the smallest cluster\nC \u2208 C which contains both s and s\u2032.\nWe measure the distance between a clustering C and the cluster tree C using the merge distortion:\nDe\ufb01nition 5. Let M be a mergeon, S a \ufb01nite subset of [0, 1], and C a clustering of S . The merge\ndistortion is de\ufb01ned by dS (M, \u02c6MC) = maxs,s\u2032\u2208S , s\ufffds\u2032 |M(s, s\u2032) \u2212 \u02c6MC(s, s\u2032)|.\nDe\ufb01ning the induced merge height and merge distortion in this way leads to an especially meaningful\ninterpretation of the merge distortion. In particular, if the merge distortion between C and C is \u03f5,\nthen any two clusters of C which are separated at level \u03bb but merge below level \u03bb \u2212 \u03f5 are correctly\nseparated in the clustering C. A similar result guarantees that a cluster in C is connected in C at\nwithin \u03f5 of the correct level. For a precise statement of these results, see Claim 5 in Appendix A.4.\nThe label measure. We will use the merge distortion to measure the distance between C, a hier-\narchical clustering of a graph, and C, the graphon cluster tree. Recall, however, that the nodes of\na graph sampled from a graphon have integer labels. That is, C is a clustering of [n], and not of a\nsubset of [0, 1]. Hence, in order to apply the merge distortion, we must \ufb01rst relabel the nodes of the\ngraph, placing them in direct correspondence to nodes of the graphon, i.e., points in [0, 1].\nRecall that we sample a graph of size n from a graphon W by \ufb01rst drawing n\npoints x1, . . . , xn uniformly at random from the unit interval. We then generate\na graph on node set [n] by connecting nodes i and j with probability W(xi, xj).\nHowever, the nodes of the sampled graph are not labeled by x1, . . . , xn, but rather\nby the integers 1, . . . , n. Thus we may think of xi as being the \u201ctrue\u201d latent label\nof node i. In general the latent node labeling is not recoverable from data, as is\ndemonstrated by the \ufb01gure to the right. We might suppose that the graph shown is\nsampled from the graphon above it, and that node 1 corresponds to a, node 2 to b,\nnode 3 to c, and node 4 to d. However, it is just as likely that node 4 corresponds\nto d\u2032, and so neither labeling is more \u201ccorrect\u201d. It is clear, though, that some\nlabelings are less likely than others. For instance, the existence of the edge (1, 2)\nmakes it impossible that 1 corresponds to a and 2 to c, since W(a, c) is zero.\nTherefore, given a graph G = ([n], E) sampled from a graphon, there are many possible relabelings\nof G which place its nodes in correspondence with nodes of the graphon, but some are more likely\nthan others. The merge distortion depends which labeling of G we assume, but, intuitively, a good\nclustering of G will have small distortion with respect to highly probable labelings, and only have\nlarge distortion on improbable labelings. Our approach is to assign a probability to every pair (G, S )\nof a graph and possible labeling. We will thus be able to measure the probability mass of the set of\n\n6\n\n\f\u220f\n\n[\n\n]\n\n\u220f\n\n})\n\n)\n\n\u2211\n\n(\u222b\n\n({\n\n\u222a\n\nW(xi, x j)\n\n1 \u2212 W(xi, x j)\n\n,\n\n(i, j)\u2208E(G)\n\n(i, j)\ufffdE(G)\n\npairs for which a method performs poorly, i.e., results in a large merge distortion.\nMore formally, let Gn denote the set of all undirected, unweighted graphs on node set [n], and let\n\u03a3n be the sigma-algebra of Lebesgue-measurable subsets of [0, 1]n. A graphon W induces a unique\nproduct measure \u039bW,n de\ufb01ned on the product sigma-algebra 2Gn \u00d7 \u03a3n such that for all G \u2208 2Gn and\nS \u2208 \u03a3n:\n\n\u039bW,n(G \u00d7 S) =\n\nG\u2208G\n\nS LW(S|G) dS\n\n, where LW(S | G) =\n\nwhere E(G) represents the edge set of the graph G. We recognize LW(S | G) as the integrand in\nEquation 1 for the probability of a graph as determined by a graphon. If G is \ufb01xed, integrating\nLW(S | G) over all S \u2208 [0, 1]n gives the probability of G under the model de\ufb01ned by W.\nWe may now formally de\ufb01ne our notion of consistency. First, some notation: If C is a clustering of\n[n] and S = (x1, . . . , xn), write C \u25e6 S to denote the relabeling of C by S , in which i is replaced by xi\nin every cluster. Then if f is a hierarchical graph clustering method, f (G) \u25e6 S is a clustering of S ,\nand \u02c6M f (G)\u25e6S denotes the merge function induced on f (G) \u25e6 S by M.\nDe\ufb01nition 6 (Consistency). Let W be a graphon and M be a mergeon of W. A hierarchical graph\nclustering method f is said to be a consistent estimator of the graphon cluster tree of W if for any\n\ufb01xed \u03f5 > 0, as n \u2192 \u221e, \u039bW,n\nThe choice of mergeon for the graphon W does not a\ufb00ect consistency, as any two mergeons of the\nsame graphon di\ufb00er on a set of measure zero. Furthermore, consistency is with respect to the random\ngraph model, and not to any particular graphon representing the model. The following claim, the\nproof of which is in Appendix B, makes this precise.\nClaim 4. Let W be a graphon and \u03c6 a measure preserving transformation. A clustering method f\nis a consistent estimator of the graphon cluster tree of W if and only if it is a consistent estimator of\nthe graphon cluster tree of W \u03c6.\n\n(G, S ) : dS (M, \u02c6M f (G)\u25e6S ) > \u03f5\n\n\u2192 0.\n\nConsistency and the blockmodel. If a graph clustering method is consistent in the sense de\ufb01ned\nabove, it is also consistent in the stochastic blockmodel; i.e., it ensures strict recovery of the com-\nmunities with high probability as the size of the graphs grow large. For instance, suppose W is\na stochastic blockmodel graphon with \u03b1 along the block-diagonal and \u03b2 everywhere else. W has\ntwo clusters at level \u03b1, merging into one cluster at level \u03b2. When the merge distortion between the\ngraphon cluster tree and a clustering is less than \u03b1 \u2212 \u03b2, which will eventually be the case with high\nprobability if the method is consistent, the two clusters are totally disjoint in C; this implication is\nmade precise by Claim 5 in Appendix A.4.\n\n5 Consistent algorithms\n\nWe now demonstrate that consistent clustering methods exist. We present two results: First, we\nshow that any method which is capable of consistently estimating the probability of each edge in a\nrandom graph leads to a consistent clustering method. We then analyze a modi\ufb01cation of an existing\nalgorithm to show that it consistently estimates edge probabilities. As a corollary, we identify a\ngraph clustering method which satis\ufb01es our notion of consistency. Our results will be for graphons\nwhich are piecewise Lipschitz (or weakly isomorphic to a piecewise Lipschitz graphon):\nDe\ufb01nition 7 (Piecewise Lipschitz). We say that B = {B1, . . . , Bk} is a block partition if each Bi is an\nopen, half-open, or closed interval in [0, 1] with positive measure, Bi \u2229 B j is empty whenever i \ufffd j,\nB = [0, 1]. We say that a graphon W is piecewise c-Lipschitz if there exists a set of blocks B\nand\nsuch that for any (x, y) and (x\u2032, y\u2032) in Bi \u00d7 B j, |W(x, y) \u2212 W(x\u2032, y\u2032)| \u2264 c(|x \u2212 x\u2032| + |y \u2212 y\u2032|).\nOur \ufb01rst result concerns methods which are able to consistently estimate edge probabilities in the\nfollowing sense. Let S = (x1, . . . , xn) be an ordered set of n uniform random variables drawn from\nthe unit interval. Fix a graphon W, and let P be the random matrix whose i j entry is given by\nW(xi, xj). We say that P is the random edge probability matrix. Assuming that W has structure, it\nis possible to estimate P from a single graph sampled from W. We say that an estimator \u02c6P of P is\nconsistent in max-norm if, for any \u03f5 > 0, limn\u2192\u221e P(maxi\ufffd j |Pi j \u2212 \u02c6Pi j| > \u03f5) = 0. The following non-\ntrivial theorem, whose proof comprises Appendix D, states that any estimator which is consistent in\nthis sense leads to a consistent clustering algorithm:\n\n7\n\n\f\u221a\n\n(\n\n\u2211\n\n\u2211\n\n)\n\nTheorem 1. Let W be a piecewise c-Lipschitz graphon. Let \u02c6P be a consistent estimator of P in\nmax-norm. Let f be the clustering method which performs single-linkage clustering using \u02c6P as a\nsimilarity matrix. Then f is a consistent estimator of the graphon cluster tree of W.\n\n(log n)/n\n\nAlgorithm 1 Clustering by nbhd. smoothing\nRequire: Adjacency matrix A, C \u2208 (0, 1)\n% Step 1: Compute the estimated edge\n% probability matrix \u02c6P using neighborhood\n% smoothing algorithm based on [21]\nn \u2190 Size(A)\nh \u2190 C\nfor i \ufffd j \u2208 [n] \u00d7 [n] do\n\u02c6A \u2190 A after setting row/column j to zero\nfor i\u2032 \u2208 [n] \\ {i, j} do\nd j(i, i\u2032) \u2190 maxk\ufffdi,i\u2032, j |( \u02c6A2/n)ik \u2212 ( \u02c6A2/n)i\u2032k|\nend for\nqi j \u2190 hth quantile of {d j(i, i\u2032) : i\u2032 \ufffd i, j}\nNi j \u2190 {i\u2032 \ufffd i, j : d j(i, i\u2032) \u2264 qi j(h)}\n\nEstimating the matrix of edge probabilities has\nbeen a direction of recent research, however we\nare only aware of results which show consis-\ntency in mean squared error; That is, the liter-\nature contains estimators for which 1/n2\u2225P\u2212 \u02c6P\u22252\nF\ntends to zero in probability. One practical\nmethod is the neighborhood smoothing algo-\nrithm of [21]. The method constructs for each\nnode i in the graph G a neighborhood of nodes\nNi which are similar to i in the sense that for ev-\nery i\u2032 \u2208 Ni, the corresponding column Ai\u2032 of the\nadjacency matrix is close to Ai in a particular\ndistance. Aij is clearly not a good estimate for\nthe probability of the edge (i, j), as it is either\nzero or one, however, if the graphon is piece-\nwise Lipschitz, the average Ai\u2032j over i\u2032 \u2208 Nij\nwill intuitively tend to the true probability. Like\nothers, the method of [21] is proven to be con-\nsistent in mean squared error. Since Theorem 1\nrequires consistency in max-norm, we analyze\na slight modi\ufb01cation of this algorithm and show\nthat it consistently estimates P in this stronger\nsense. The technical details are in Appendix E.\nTheorem 2. If the graphon W is piecewise Lipschitz, the modi\ufb01ed neighborhood smoothing algo-\nrithm in Appendix E is a consistent estimator of P in max-norm.\n\nj\u2032\u2208N ji Ai j\u2032\n\n2\n\n1\nNi j\n\nend for\nfor (i, j) \u2208 [n] \u00d7 [n] do\n\u02c6Pi j \u2190 1\nend for\n% Step 2: Cluster \u02c6P with single linkage\nC \u2190 the single linkage clusters of \u02c6P\nreturn C\n\ni\u2032\u2208Ni j Ai\u2032 j + 1\n\nN ji\n\nAs a corollary, we identify a practical graph clustering algorithm which is a consistent estimator of\nthe graphon cluster tree. The algorithm is shown in Algorithm 1, and details are in Appendix E.2.\nAppendix F contains experiments in which the algorithm is applied to real and synthetic data.\nCorollary 1. If the graphon W is piecewise Lipschitz, Algorithm 1 is a consistent estimator of the\ngraphon cluster tree of W.\n\n6 Discussion\n\nWe have presented a consistency framework for clustering in the graphon model and demonstrated\nthat a practical clustering algorithm is consistent. We now identify two interesting directions of\nfuture research. First, it would be interesting to consider the extension of our framework to sparse\nrandom graphs; many real-world networks are sparse, and the graphon generates dense graphs. Re-\ncently, however, sparse models which extend the graphon have been proposed; see [7, 6]. It would\nbe interesting to see what modi\ufb01cations are necessary to apply our framework in these models.\nSecond, it would be interesting to consider alternative ways of de\ufb01ning the ground truth clustering of\na graphon. Our construction is motivated by interpreting the graphon W not only as a random graph\nmodel, but also as a similarity function, which may not be desirable in certain settings. For example,\nconsider a \u201cbipartite\u201d graphon W, which is one along the block-diagonal and zero elsewhere. The\ncluster tree of W consists of a single cluster at all levels, whereas the ideal bipartite clustering has\ntwo clusters. Therefore, consider applying a transformation S to W which maps it to a \u201csimilarity\u201d\ngraphon. The goal of clustering then becomes the recovery of the cluster tree of S (W) given a\nrandom graph sampled from W. For instance, let S : W 7\u2192 W2, where W2 is the operator square\nof the bipartite graphon W. The cluster tree of S (W) has two clusters at all positive levels, and\nso represents the desired ground truth. In general, any such transformation S leads to a di\ufb00erent\nclustering goal. We speculate that, with minor modi\ufb01cation, the framework herein can be used to\nprove consistency results in a wide range of graph clustering settings.\nAcknowledgements. This work was supported by NSF grant IIS-1550757.\n\n8\n\n\fReferences\n[1] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model.\n\nIEEE Trans. Inf. Theory, 62(1):471\u2013487, 2015.\n\n[2] Edoardo M Airoldi, Thiago B Costa, and Stanley H Chan. Stochastic blockmodel approximation of a\ngraphon: Theory and consistent estimation. In C J C Burges, L Bottou, M Welling, Z Ghahramani, and\nK Q Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 692\u2013700. Curran\nAssociates, Inc., 2013.\n\n[3] Robert B Ash and Catherine Doleans-Dade. Probability and measure theory. Academic Press, 2000.\n\n[4] Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, and Aarti Singh. Noise thresholds for spectral\nclustering. In J Shawe-Taylor, R S Zemel, P L Bartlett, F Pereira, and K Q Weinberger, editors, Advances\nin Neural Information Processing Systems 24, pages 954\u2013962. Curran Associates, Inc., 2011.\n\n[5] C Borgs, J T Chayes, L Lov\u00e1sz, V T S\u00f3s, and K Vesztergombi. Convergent sequences of dense graphs I:\nSubgraph frequencies, metric properties and testing. Adv. Math., 219(6):1801\u20131851, 20 December 2008.\n\n[6] Christian Borgs, Jennifer T Chayes, Henry Cohn, and Nina Holden. Sparse exchangeable graphs and their\n\nlimits via graphon processes. arXiv:1601.07134, 26 January 2016.\n\n[7] Fran\u00e7ois Caron and Emily B Fox. Sparse graphs using exchangeable random measures. arXiv:1401.1137,\n\n6 January 2014.\n\n[8] Stanley Chan and Edoardo Airoldi. A consistent histogram estimator for exchangeable graph models. In\n\nProceedings of The 31st International Conference on Machine Learning, pages 208\u2013216, 2014.\n\n[9] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree. In Advances in\n\nNeural Information Processing Systems, pages 343\u2013351, 2010.\n\n[10] Justin Eldridge, Mikhail Belkin, and Yusu Wang. Beyond hartigan consistency: Merge distortion metric\nfor hierarchical clustering. In Proceedings of The 28th Conference on Learning Theory, pages 588\u2013606,\n2015.\n\n[11] M Girvan and M E J Newman. Community structure in social and biological networks. Proc. Natl. Acad.\n\nSci. U. S. A., 99(12):7821\u20137826, 11 June 2002.\n\n[12] J. A. Hartigan. Consistency of Single Linkage for High-Density Clusters. Journal of the American\nStatistical Association, 76(374):388\u2013394, June 1981. ISSN 0162-1459. doi: 10.1080/01621459.1981.\n10477658.\n\n[13] Svante Janson. Connectedness in graph limits. arXiv:0802.3795, 26 February 2008.\n\n[14] Samory Kpotufe and Ulrike V. Luxburg. Pruning nearest neighbor cluster trees. In Proceedings of the\n28th International Conference on Machine Learning (ICML-11), pages 225\u2013232, New York, NY, USA,\n2011. ACM.\n\n[15] L\u00e1szl\u00f3 Lov\u00e1sz. Large networks and graph limits, volume 60. American Mathematical Soc., 2012.\n\n[16] L\u00e1szl\u00f3 Lov\u00e1sz and Bal\u00e1zs Szegedy. Limits of dense graph sequences. J. Combin. Theory Ser. B, 96(6):\n\n933\u2013957, November 2006.\n\n[17] F McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science, 2001. Pro-\n\nceedings. 42nd IEEE Symposium on, pages 529\u2013537, October 2001.\n\n[18] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic block-\n\nmodel. Ann. Stat., 39(4):1878\u20131915, August 2011.\n\n[19] I Steinwart. Adaptive density level set clustering. In Proceedings of The 24th Conference on Learning\n\nTheory, pages 703\u2013737, 2011.\n\n[20] Patrick J Wolfe and So\ufb01a C Olhede. Nonparametric graphon estimation. arXiv:1309.5936, 23 September\n\n2013.\n\n[21] Yuan Zhang, Elizaveta Levina, and Ji Zhu. Estimating network edge probabilities by neighborhood\n\nsmoothing. arXiv:1509.08588, 29 September 2015.\n\n9\n\n\f", "award": [], "sourceid": 1194, "authors": [{"given_name": "Justin", "family_name": "Eldridge", "institution": "The Ohio State University"}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": "Ohio State University"}, {"given_name": "Yusu", "family_name": "Wang", "institution": "The Ohio State University"}]}