{"title": "On clustering network-valued data", "book": "Advances in Neural Information Processing Systems", "page_first": 7071, "page_last": 7081, "abstract": "Community detection, which focuses on clustering nodes or detecting communities in (mostly) a single network, is a problem of considerable practical interest and has received a great deal of attention in the  research community. While being able to cluster within a network is important, there are emerging needs to be able to \\emph{cluster multiple networks}. This is largely motivated by the routine collection of network data that are generated from potentially different populations. These networks may or may not have node correspondence. When node correspondence is present, we cluster networks by summarizing a network by its graphon estimate, whereas when node correspondence is not present, we propose a novel solution for clustering such networks by associating a computationally feasible feature vector to each network based on trace of powers of the adjacency matrix. We illustrate our methods using both simulated and real data sets, and theoretical justifications are provided in terms of consistency.", "full_text": "On clustering network-valued data\n\nSoumendu Sundar Mukherjee\n\nDepartment of Statistics\n\nUniversity of California, Berkeley\nBerkeley, California 94720, USA\n\nsoumendu@berkeley.edu\n\nPurnamrita Sarkar\n\nDepartment of Statistics and Data Sciences\n\nUniversity of Texas, Austin\nAustin, Texas 78712, USA\n\npurna.sarkar@austin.utexas.edu\n\nDepartment of Applied and Computational Mathematics and Statistics\n\nLizhen Lin\n\nUniveristy of Notre Dame\n\nNotre Dame, Indiana 46556, USA\n\nlizhen.lin@nd.edu\n\nAbstract\n\nCommunity detection, which focuses on clustering nodes or detecting com-\nmunities in (mostly) a single network, is a problem of considerable practical\ninterest and has received a great deal of attention in the research com-\nmunity. While being able to cluster within a network is important, there\nare emerging needs to be able to cluster multiple networks. This is largely\nmotivated by the routine collection of network data that are generated from\npotentially di\ufb00erent populations. These networks may or may not have node\ncorrespondence. When node correspondence is present, we cluster networks\nby summarizing a network by its graphon estimate, whereas when node\ncorrespondence is not present, we propose a novel solution for clustering\nsuch networks by associating a computationally feasible feature vector to\neach network based on trace of powers of the adjacency matrix. We illus-\ntrate our methods using both simulated and real data sets, and theoretical\njusti\ufb01cations are provided in terms of consistency.\n\nIntroduction\n\n1\nA network, which is used to model interactions or communications among a set of agents\nor nodes, is arguably among one of the most common and important representations for\nmodern complex data. Networks are ubiquitous in many scienti\ufb01c \ufb01elds, ranging from\ncomputer networks, brain networks and biological networks, to social networks, co-authorship\nnetworks and many more. Over the past few decades, great advancement has been made\nin developing models and methodologies for inference of networks. There are a range of\nprobabilistic models for networks, starting from the relatively simple Erd\u00f6s-R\u00e9nyi model\n[12], stochastic blockmodels and their extensions [15, 17, 6], to in\ufb01nite dimensional graphons\n[28, 13]. These models are often used for community detection, i.e. clustering the nodes in a\nnetwork. Various community detection algorithms or methods have been proposed, including\nmodularity-based methods [21], spectral methods [25], likelihood-based methods [8, 11, 7, 4],\nand optimization-based approaches like those based on semide\ufb01nite programming [5], etc.\nThe majority of the work in the community detection literature including the above mentioned\nones focus on \ufb01nding communities among the nodes in a single network. While this is still\na very important problem with many open questions, there is an emerging need to be\nable to detect clusters among multiple network-valued objects, where a network itself is a\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ffundamental unit of data. This is largely motivated by the routine collection of populations\nor subpopulations of network-valued data objects. Technological advancement and the\nexplosion of complex data in many domains has made this a somewhat common practice.\nThere has been some notable work on graph kernels in the Computer Science literature [27, 26].\nIn these works the goal is to e\ufb03ciently compute di\ufb00erent types of kernel based similarity\nmeasures (or their approximations) between networks. In contrast, we ask the following\nstatistical questions. Can we cluster networks consistently from a mixture of graphons, when\n1) there is node correspondence and 2) when there isn\u2019t? The \ufb01rst situation arises, for\nexample, when one has a network evolving over time, or multiple instances of a network\nbetween well-de\ufb01ned objects. If one thinks of them as random samples from a mixture of\ngraphons, then can we cluster them? We propose a simple and general algorithm to address\nthis question, which operates by \ufb01rst obtaining a graphon estimate of each of the networks,\nconstructing a distance matrix between those graphon estimates, and then performing\nspectral clustering on the distance matrix. We call this algorithm Network Clustering based\non Graphon Estimates (NCGE).\nThe second situation arises when one is interested in global properties of a network. This\nsetting is closer to that of graph kernels. Say we have co-authorship networks from Computer\nScience and High Energy Physics. Are these di\ufb00erent types of networks? There has been a\nlot of empirical and algorithmic work on featurizing networks or computing kernels between\nnetworks. But most of these features require expensive computation. We propose a simple\nfeature based on traces of powers of the adjacency matrix for this purpose which is very\ncheap to compute as it involves only matrix multiplication. We cluster the networks based\non these features and call this method Network Clustering based on Log Moments (NCLM).\nWe provide some theoretical guarantees for our algorithms in terms of consistency, in\naddition to extensive simulations and real data examples. The simulation results show that\nNCGE clearly outperform the naive yet popular method of clustering (vectorized) adjacency\nmatrices in various settings. We also show that, in absence of node correspondence, NCLM\nis consistently better and faster than methods which featurize networks with di\ufb00erent global\nstatistics and graphlet kernels. We also apply NCLM to separate out a mixed bag of real\nworld networks, like co-authorship networks form di\ufb00erent domains and ego networks.\nThe rest of the paper is organized as follows. In Section 2 we brie\ufb02y describe graphon-\nestimation methods and other related work. Next, in Section 3 we formally describe our\nsetup and introduce our algorithms. Section 4.1 contains some theory for these algorithms.\nIn Section 5 we provide simulations and real data examples. We conclude with a discussion\nin Section 6.\n\n2 Related work\n\nThe focus of this paper is on 1) clustering networks which have node correspondence based on\nestimating the underlying graphon and 2) clustering networks without node correspondence\nbased on global properties of the networks. In this section we \ufb01rst cite two methods of\nobtaining graphon estimates, which we will use in our \ufb01rst algorithm. Second, we cite\nexisting work that summarizes a network using di\ufb00erent statistics and compares those to\nobtain a measure of similarity.\nA prominent estimator of graphons is the so called Universal Singular Value Thresholding\n(USVT) estimator proposed by [9]. The main idea behind USVT is to essentially estimate\nthe low rank structure of the population matrix by thresholding the singular values of\nthe observed matrix at an universal cuto\ufb00, and then use retained singular values and the\ncorresponding singular vectors to construct an estimate of the population matrix.\nAnother recent work [29] proposes a novel, statistically consistent and computationally\ne\ufb03cient approach for estimating the link probability matrix by neighborhood smoothing.\nTypically for large networks USVT is a lot more scalable than the neighborhood-smoothing\napproach. There are several other methods for graphon estimation, e.g., by \ufb01tting a stochastic\nblockmodel [24]. These methods can also be used in our algorithm.\n\n2\n\n\fIn [10], a graph-based method for change-point detection is proposed, where an independent\nsequence of observations are considered. These are generated i.i.d. under the null hypothesis,\nwhereas under the alternative, after a change point, the underlying distribution changes.\nThe goal is to \ufb01nd this change point. The observations can be high-dimensional vectors or\neven networks, with the latter bearing some resemblance with our \ufb01rst framework. This can\nbe viewed as clustering the observations into \u201cpast\u201d and \u201cfuture\u201d. We remark here that our\ngraphon-estimation based clustering algorithm suggests an alternative method for change\npoint detection in networks, namely by looking at the second eigenvector of the distance\nmatrix between estimated graphons. Another related work is due to [14] which aims to\nextend the classical large sample theory to model network-valued objects.\nFor comparing global properties of networks, there have been many interesting works that\nfeaturize networks, see, for instance, [3]. In the Computer Science literature, graph kernels\nhave gained much attention [27, 26]. In these works the goal is to e\ufb03ciently compute di\ufb00erent\ntypes of kernel based similarity measures (exact or approximate) between networks.\n\n3 A framework for clustering networks\n\nLet G be a binary random network or graph with n nodes. Denote by A its adjacency\nmatrix, which is an n by n symmetric matrix with binary entries. That is, Aij = Aji \u2208\n{0, 1}, 1 \u2264 i < j \u2264 n, where Aij = 1 if there is an observed edge between nodes i and j, and\nAij = 0 otherwise. All the diagonal elements of A are structured to be zero (i.e. Aii = 0).\nWe assume the following random Bernoulli model with\n\nAij | Pij \u223c Bernoulli(Pij), i < j,\n\n(1)\n\nwhere Pij = P(Aij = 1) is the probability of link formation between nodes i and j. We\ndenote the link probability matrix as P = ((Pij)). The edge probabilities are often modeled\nusing the so-called graphons. A graphon f is a nonnegative bounded, measurable symmetric\nfunction f : [0, 1]2 \u2192 [0, 1]. Given such an f, one can use the model\n\nPij = f(\u03bei, \u03bej),\n\n(2)\n\nwhere \u03bei, \u03bej are i.i.d. uniform random variables on (0, 1). In fact, any (in\ufb01nite) exchangeable\nnetwork arises in this way (by Aldous-Hoover representation [2, 16]).\nOur current work focuses on the problem of clustering networks. Unlike in a traditional\nsetup, where one observes a single network (with potentially growing number of nodes) and\nthe goal often is to cluster the nodes, here we observe multiple networks and are interested\nin clustering these networks viewed as fundamental data units.\n\n3.1 Node correspondence present\n\nA simple and natural model for this is what we call the graphon mixture model for obvious\nreasons: there are only K (\ufb01xed) underlying graphons f1, . . . , fK giving rise to link probability\nmatrices \u03a01, . . . , \u03a0K and we observe T networks sampled i.i.d. from the mixture model\n\nKX\n\ni=1\n\n3\n\nPmix(A) =\n\nqiP\u03a0i(A),\n\n(3)\n\nwhere the qi\u2019s are the mixing proportions and PP (A) =Q\n\nuv (1 \u2212 Puv)1\u2212Auv is the\nprobability of observing the adjacency matrix A when the link probability matrix is given by\nP. Consider n nodes, and T independent networks Ai, i \u2208 [T], which de\ufb01ne edges between\nthese n nodes. We propose the following simple and general algorithm (Algorithm 1) for\nclustering them.\n\nu<v P Auv\n\n\fAlgorithm 1 Network Clustering based on Graphon Estimates (NCGE)\n1: Graphon estimation. Given A1, . . . , AT , estimate their corresponding link probability\nmatrices P1, . . . , PT using any one of the \u2018blackbox\u2019 algorithms such as USVT ([9]), the\nneighborhood smoothing approach by [29] etc. Call these estimates \u02c6P1, . . . , \u02c6PT .\n2: Forming a distance matrix. Compute the T by T distance matrix \u02c6D with \u02c6Dij =\nk \u02c6Pi \u2212 \u02c6PjkF , where k \u00b7 kF is the Frobenius norm.\n\n3: Clustering. Apply the spectral clustering algorithm to the distance matrix \u02c6D.\n\nWe will from now on denote the above algorithm with the di\ufb00erent graphon estimation\n(\u2018blackbox\u2019) approaches as follows: the algorithm with USVT as blackbox will be denoted\nby CL-USVT and the one with the neighborhood smoothing method as blackbox will be\ndenoted by CL-NBS. We will compare these two algorithms with the CL-NAIVE method\nwhich does not estimate the underlying graphon, but clusters vectorized adjacency matrices\ndirectly (in the spirit of [10]).\n\n3.2 Node correspondence absent\nWe use certain graph statistics to construct a feature vector. The basic statistics we choose\nare the trace of powers of the adjacency matrix, suitably normalized and we call them graph\nmoments:\n\nmk(A) = trace(A/n)k.\n\n(4)\nThese statistics are related to various path/subgraph counts. For example, m2(A) is the\nnormalized count of the total number of edges, m3(A) is the normalized triangle count of A.\nHigher order moments are actually counts of closed walks (or directed circuits).\nThe reason we use graph moments instead of subgraph counts is that the latter are quite\ndi\ufb03cult to compute and present day algorithms work only for subgraphs up to size 5. On\nthe contrary, graph moments are easy to compute as they only involve matrix multiplication.\nWhile it may seem that this is essentially the same as comparing the eigenspectrum, it is not\nclear how many eigenvalues one should use. Even if one could estimate the number of large\neigenvalues using an USVT type estimator, the length is di\ufb00erent for di\ufb00erent networks. The\ntrace takes into account the relative magnitudes of the eigenvalues naturally. In fact, we\ntried (see Section 5) using the top few eigenvalues as the sole features, but the results were\nnot as satisfactory as using mk.\nWe now present our second algorithm (Algorithm 2). In step 2 below we take d to be the\nstandard Euclidean metric.\n\nAlgorithm 2 Network Clustering based on Log Moments (NCLM)\n1: Moment calculation. For each network Ai, i \u2208 [T] and a positive integer J, compute\n2: Forming a distance matrix. For some metric d, set \u02c6Dij = d(gJ(Ai), gJ(Aj)).\n3: Clustering. Apply the spectral clustering algorithm to the distance matrix \u02c6D.\n\nthe feature vector gJ(Ai) := (log m1(Ai), log m2(Ai), . . . , log mJ(Ai)).\n\nNote: The rationale behind taking a logarithm of the graph moments is that if we have two\ngraphs with the same degree density but di\ufb00erent sizes, then the degree density will not play\nany role in the the distance (which is necessary because the degree density will subdue any\nother di\ufb00erences otherwise). The parameter J counts, in some sense, the e\ufb00ective number of\neigenvalues we are using.\n\n4 Theory\nWe will only mention our main results and discuss some of the consequences here. All the\nproofs and further details can be found in the supplementary article [1].\n\n4\n\n\f4.1 Results on NCGE\nWe can think of \u02c6Dij as estimating Dij = kPi \u2212 PjkF .\nTheorem 4.1. Suppose D = ((Dij)) has rank K. Let V (resp. \u02c6V ) be the T \u00d7 K matrix\nwhose columns correspond to the leading K eigenvectors (corresponding to the K largest-in-\nmagnitude eigenvalues) of D (resp. \u02c6D). Let \u03b3 = \u03b3(K, n, T) be the K-th smallest eigenvalue\nvalue of D in magnitude. Then there exists an orthogonal matrix \u02c6O such that\n\nk \u02c6V \u02c6O \u2212 V k2\n\nF \u2264 64T\n\u03b32\n\nk \u02c6Pi \u2212 Pik2\nF .\n\nX\n\ni\n\nCorollary 4.2. Assume for some absolute constants \u03b1, \u03b2 > 0 the following holds for each\ni \u2208 [T]:\n\n(5)\neither in expectation or with high probability (\u2265 1 \u2212 \u0001i,n). Then in expectation or with high\n\nn2\n\n\u2264 Cin\u2212\u03b1(log n)\u03b2,\n\nF\n\nprobability (\u2265 1 \u2212P\n\nk \u02c6Pi \u2212 Pik2\n\ni \u0001i,n) we have that\nk \u02c6V \u02c6O \u2212 V k2\n\nF \u2264 64CT T 2n2\u2212\u03b1(log n)\u03b2\n\n\u03b32\n\n,\n\n(6)\n\nwhere CT = maxi\u2264i\u2264T Ci.\nIf there are K (\ufb01xed, not growing with T) underlying graphons, then the constant CT does\nnot depend on T. Table 1 reports values of \u03b1, \u03b2 for various graphon estimation procedures\n(under assumptions on the underlying graphons, that are described in the supplementary\narticle [1]).\n\nTable 1: Values of \u03b1, \u03b2 for various graphon estimation procedures.\n\nProcedure USVT NBS Minimax rate\n\n\u03b1\n\u03b2\n\n1/3\n0\n\n1/2\n1/2\n\n1\n1\n\n(cid:19)\n\n(cid:18) 0\n\nWhile it is hard to obtain an explicit lower bound on \u03b3 in general, let us consider a simple\nequal weight mixture of two graphons to illustrate the relationship between \u03b3 and separation\nbetween graphons. Let the distance between the population graphons be dn. Then we have\nD = Z\nZ T , where the i-th row of the binary matrix Z has a single one at position\nl if network Ai is sampled from \u03a0l. The nonzero eigenvalues of this matrix are T nd/2 and\n\u2212T nd/2. Thus in this case \u03b3 = T nd/2. As a result (6) becomes\nF \u2264 256CT n\u2212\u03b1(log n)\u03b2\n\nk \u02c6V \u02c6O \u2212 V k2\n\ndn\n0\n\n(7)\n\ndn\n\n.\n\nd2\n\nLet us look at a more speci\ufb01c case of blockmodels with the same number (= m) of clusters\nof equal sizes (= n/m) to gain some insight into d. Let C be a n \u00d7 m binary matrix\nof memberships such that Cib = 1 if node i within a blockmodel comes from cluster b.\nConsider two blockmodels \u03a01 = CB1C T with B1 = (p \u2212 q)Im + qEm and \u03a02 = CB2C T\nwith B2 = (p0 \u2212 q0)Im + q0Em, where Im is the identity matrix of order m (here the only\ndi\ufb00erence between the models come from link formation probabilities within/between blocks,\nthe blocks remaining the same). In this case\n\nd2 = k\u03a01 \u2212 \u03a02k2\n\nF\n\nn2\n\n= 1\n\nm\n\n(p \u2212 p0)2 +\n\n1 \u2212 1\n\nm\n\n(q \u2212 q0)2.\n\n(cid:18)\n\n(cid:19)\n\nThe bound (6) can be turned into a bound on the proportion of \u201cmisclustered\u201d networks,\nde\ufb01ned appropriately. There are several ways to de\ufb01ne misclustered nodes in the context\nof community detection in stochastic blockmodels that are easy to analyze with spectral\nclustering (see, e.g., [25, 18]). These de\ufb01nitions work in our context too. For example, if we\n\n5\n\n\fuse De\ufb01nition 4 of [25] and denote by M the set of misclustered networks, then from the\nproof of their Theorem 1, we have\n\n|M| \u2264 8mTk \u02c6V \u02c6O \u2212 V k2\nF ,\n\nwhere mT = maxj=1,...,K(Z T Z)jj is the maximum number of networks coming from any of\nthe graphons.\n\n4.2 Results on NCLM\nWe \ufb01rst establish concentration of trace(Ak). The proof uses Talagrand\u2019s concentration\ninequality, which requires additional results on Lipschitz continuity and convexity. This\nis obtained via decomposing A 7\u2192 trace(Ak) into a linear combination of convex-Lipschitz\nfunctions.\nTheorem 4.3 (Concentration of moments). Let A be the adjacency matrix of a random\ngraph with link-probability matrix P. Then for any k. Let \u03c8k(A) := n\n\u221a\n\n2 mk(A). Then\n\n\u221a\nP(|\u03c8k(A) \u2212 E\u03c8k(A)| > t) \u2264 4 exp(\u2212(t \u2212 4\n\nk\n\n2)2/16).\n\nAs a consequence of this, we can show that gJ(A) concentrates around \u00afgJ(A) :=\n(log Em2(A), . . . , log EmJ(A)).\nTheorem 4.4 (Concentration of gJ(A)). Let EA = \u03c1S, where \u03c1 \u2208 (0, 1), mini,j Sij = \u2126(1),\ni,j Sij = n2. Then k\u00afgJ(A)k = \u0398(J3/2 log(1/\u03c1)), and for any 0 < \u03b4 < 1 satisfying\n\nand P\n\n\u03b4J log(1/\u03c1) = \u2126(1), we have\n\nP(kgJ(A) \u2212 \u00afgJ(A)k \u2265 \u03b4J3/2 log(1/\u03c1)) \u2264 JC1e\u2212C2n2\u03c12J\n\n.\n\nWe expect that \u00afgJ will be a good population level summary for many models. In general,\nit is hard to show an explicit separation result for \u00afgJ . However, in simple models, we can\ndo explicit computations to show separation. For example, in a two parameter blockmodel\nB = (p\u2212q)Im+qEm, with equal block sizes, we have Em2(A) = (p/m+(m\u22121)q/m)(1+o(1)),\nEm3(A) = (p3/m2 + (m \u2212 1)pq2/m2 + (m \u2212 1)(m \u2212 2)q3/6m2)(1 + o(1)) and so on. Thus\nwe see that if m = 2, then \u00afg2 should be able to distinguish between such blockmodels (i.e.\ndi\ufb00erent p, q).\nNote: After this paper was submitted, we came to know of a concurrent work [20] that\nprovides a topological/combinatorial perspective on the expected graph moments Emk(A).\nTheorem 1 in [20] shows that under some mild assumptions on the model (satis\ufb01ed, for\nexample, by generalized random graphs with bounded kernels as long as the average degree\ngrows to in\ufb01nity), Etrace(Ak) = E(# of closed k-walks) will be asymptotic to E(# of closed\nk-walks that trace out a k-cycle) plus 1{k even}E(# of closed k-walks that trace out a (k/2+1)-\ntree). For even k, if the degree grows fast enough, k-cycles tend to dominate, whereas for\nsparser graphs trees tend to dominate. From this and our concentration results, we can\nexpect NCLM to be able to tell apart graphs which are di\ufb00erent in terms the counts of these\nsimpler closed k-walks. Incidentally, the authors of [20] also show that the expected count of\nclosed non-backtracking walks of length k is dominated by walks tracing out k-cycles. Thus\nif one uses counts of closed non-backtracking k-walks (i.e. moments of the non-backtracking\nmatrix) instead of just closed k-walks as features, one would expect similar performance on\ndenser networks, but in sparser settings it may lead to improvements because of the absence\nof the non-informative trees in lower order even moments.\n\n5 Simulation study and data analysis\n\nIn this section, we describe the results of our experiments with simulated and real data\nto evaluate the performance of NCGE and NCLM. We measure performance in terms of\nclustering error which is the minimum normalized hamming distance between the estimated\nlabel vector and all K! permutations of the true label assignment. Clustering accuracy is\none minus clustering error.\n\n6\n\n\fNode correspondence present: We provide two simulated data experiments1 for clus-\ntering networks with node correspondence. In each experiment twenty 150-node networks\nwere generated from a mixture of two graphons, 13 networks from the \ufb01rst and the other\n7 from the second. We also used a scalar multiplier with the graphons to ensure that the\nnetworks are not too dense. The average degree for all these experiments were around 20-25.\nWe report the average error bars from a few random runs.\nFirst we generate a mixture of graphons from two blockmodels, with probability matrices\n(pi \u2212 qi)Im + qiEm with i \u2208 {1, 2}. We use p2 = p1(1 + \u0001) and q2 = q1(1 + \u0001) and measure\nclustering accuracy as the multiplicative error \u0001 is increased from 0.05 to 0.15. We compare\nCL-USVT, CL-NBS and CL-NAIVE and the results are summarized in Figure 1(A). We\nhave observed two things. First, CL-USVT and CL-NBS start distinguishing the graphons\nbetter as \u0001 increases (as the theory suggests). Second, the naive approach does not do a\ngood job even when \u0001 increases.\n\nFigure 1: We show the behavior of the three algorithms when \u0001 increases, when the underlying\nnetwork is generated from (A) a blockmodel, and (B) a smooth graphon.\n\n(A)\n\n(B)\n\nIn the second simulation, we generate the networks from two smooth graphons \u03a01 and\n\u03a02, where \u03a02 = \u03a01(1 + \u0001) (here \u03a01 corresponds to the graphon 3 appearing in Table 1 of\n[29]). As is seen from Figure 1(B), here also CL-USVT and CL-NBS outperform the naive\nalgorithm by a huge margin. Also, CL-NBS is consistently better than CL-USVT, which\nshows that the accuracy of the graphon estimation procedure is important (for example,\nUSVT is known to perform worse as the network becomes sparser).\nNode correspondence absent: We show the e\ufb03cacy of our approach via two sets of\nexperiments. We compare our log-moment based method NCLM with three other methods.\nThe \ufb01rst is Graphlet Kernels [26] with 3, 4 and 5 graphlets, denoted by GK3, GK4 and\nGK5 respectively. In the second method, we use six di\ufb00erent network-based statistics to\nsummarize each graph; these statistics are the algebraic connectivity, the local and global\nclustering coe\ufb03cients [23], the distance distribution [19] for 3 hops, the Pearson correlation\ncoe\ufb03cient [22] and the rich-club metric [30]. We also compare against graphs summarized by\nthe top J eigenvalues of A/n (TopEig). These are detailed in the supplementary article [1].\nFor each distance matrix \u02c6D we compute with NCLM, GraphStats and TopEig, we calculate\na similarity matrix K = exp(\u2212t \u02c6D) where t is chosen as the value, within a range, which\nmaximizes the relative eigengap (\u03bbK(K) \u2212 \u03bbK+1(K))/\u03bbK+1(K). It would be interesting to\nhave a data dependent range for t.\nFor each matrix K we calculate the top few eigenvectors, say N many, and do K-means on\nthem to get the \ufb01nal clustering. We use N = K; however, for GK3, GK4, and GK5, we had\nto use a smaller N which boosted their clustering accuracy.\nFirst we construct four sets of parameters for the two parameter blockmodel (also known as\nthe planted partition model): \u03981 = {p = 0.1, q = 0.05, K = 2, \u03c1 = 0.6}, \u03982 = {p = 0.1, q =\n1Code used in this paper is publicly available at https://github.com/soumendu041/\n\nclustering-network-valued-data.\n\n7\n\n\f0.05, K = 2, \u03c1 = 1}, \u03983 = {p = 0.1, q = 0.05, K = 8, \u03c1 = 0.6}, and \u03984 = {p = 0.2, q =\n0.1, K = 8, \u03c1 = 0.6}. Note that the \ufb01rst two settings di\ufb00er only in the density parameter \u03c1.\nThe second two settings di\ufb00er in the within and across cluster probabilities. The \ufb01rst two\nand second two di\ufb00er in K. For each parameter setting we generate two sets of 20 graphs,\none with n = 500 and the other with n = 1000.\nFor choosing J, we calculate the moments for a large J; compute a kernel similarity matrix\nfor each choice of J and report the one with largest relative eigengap between the K th and\n(K + 1)th eigenvalue. We show these plots in the supplementary article [1]. We see that the\neigengap increases and levels o\ufb00 after a point. However, as J increases, the computation\ntime increases, so there is a trade-o\ufb00. We report the accuracy of J = 5, whereas J = 8 also\nreturns the same in 48 seconds.\n\nTable 2: Error of 6 di\ufb00erent methods on the simulated networks.\nNCLM (J = 5) GK3 GK4 GK5 GraphStats (J = 6) TopEig (J = 5)\n\nError\n\nTime (s)\n\n0\n25\n\n0.5\n14\n\n0.36\n16\n\n0.26\n38\n\n0.37\n94\n\n0.18\n8\n\nWe see that NCLM performs the best. For GK3, GK4 and GK5, if one uses the top two\neigenvectors, and clusters those into 4 groups (since there are four parameter settings), the\nerrors are respectively 0.08, 0.025 and 0.03. This means that, for clustering, one needs to\nestimate the e\ufb00ective rank of the graphlet kernels as well. TopEig performs better than\nGraphStats, which has trouble separating out \u03982 and \u03984.\nNote: Intuitively one would expect that, if there is node correspondence between the graphs,\nclustering based on graphon estimates would work better, because it aims to estimate the\nunderlying probabilistic model for comparison. However, in our experiments we found that\na properly tuned NCLM matched the performance of NCGE. This is probably because a\nproperly tuned NCLM captures the global features that distinguish two graphons. We leave\nit for future work to compare their performance theoretically.\nReal Networks: We cluster about \ufb01fty real world networks. We use 11 co-authorship\nnetworks between 15,000 researchers from the High Energy Physics corpus of the arXiv, 11\nco-authorship networks with 21,000 nodes from Citeseer (which had Machine Learning in\ntheir abstracts), 17 co-authorship networks (each with about 3000 nodes) from the NIPS\nconference and \ufb01nally 10 Facebook ego networks2. The average degrees vary between 0.2 to\n0.4 for co-authorship networks and are around 10 for the ego networks. Each co-authorship\nnetwork is dynamic, i.e. a node corresponds to an author in that corpus and this node index\nis preserved in the di\ufb00erent networks over time. The ego networks are di\ufb00erent in that sense,\nsince each network is the subgraph of Facebook induced by the neighbors of a given central\nor \u201cego\u201d node. The sizes of these networks vary between 350 to 4000.\n\nTable 3: Clustering error of 6 di\ufb00erent methods on a collection of real world networks\nconsisting of co-authorship networks from Citeseer, High Energy Physics (HEP-Th) corpus\nof arXiv, NIPS and ego networks from Facebook.\n\nNCLM (J = 8) GK3 GK4 GK5 GraphStats (J = 8) TopEig (J = 30)\n\nError\n\nTime (s)\n\n0.1\n2.7\n\n0.6\n45\n\n0.6\n50\n\n0.6\n60\n\n0.16\n765\n\n0.32\n14\n\nTable 3 summarizes the performance of di\ufb00erent algorithms and their running time to\ncompute distance between the graphs. We use the di\ufb00erent sources of networks as labels, i.e.\nHEP-Th will be one cluster, etc. We explore di\ufb00erent choices of J, and see that the best\nperformance is from NCLM, with J = 8, followed closely by GraphStats. TopEig (J in this\ncase is where the eigenspectra of the larger networks have a knee) and the graph kernels do\nnot perform very well. GraphStats take 765 seconds to complete, whereas NCLM \ufb01nishes in\n2.7 seconds. This is because the networks are large but extremely sparse, and so calculation\nof matrix powers is comparatively cheap.\n\n2https://snap.stanford.edu/data/egonets-Facebook.html\n\n8\n\n\fFigure 2: Kernel matrix for NCLM on 49 real networks.\n\nIn Figure 2, we plot the kernel similarity matrix obtained using NCLM on the real networks\n(higher the value, more similar the points are). The \ufb01rst 11 networks are from HEP-Th,\nwhereas the next 11 are from Citeseer. The next 16 are from NIPS and the remaining\nones are the ego networks from Facebook. First note that {HEP-Th, Citeseer}, NIPS and\nFacebook are well separated. However, HEP-Th and Citeseer are hard to separate out. This\nis also veri\ufb01ed by the bad performance of TopEig in separating out the \ufb01rst two (shown in\nSection 5). However, in Figure 2, we can see that the Citeseer networks are di\ufb00erent from\nHEP-Th in the sense that they are not as strongly connected inside as HEP-Th.\n\n6 Discussion\nWe consider the problem of clustering network-valued data for two settings, both of which\nare prevalent in practice. In the \ufb01rst setting, di\ufb00erent network objects have node correspon-\ndence. This includes clustering brain networks obtained from FMRI data where each node\ncorresponds to a speci\ufb01c region in the brain, or co-authorship networks between a set of\nauthors where the connections vary from one year to another. In the second setting, node\ncorrespondence is not present, e.g., when one wishes to compare di\ufb00erent types of networks:\nco-authorship networks, Facebook ego networks, etc. One may be interested in seeing if\nco-authorship networks are more \u201csimilar\u201d to each other than ego or friendship networks.\nWe present two algorithms for these two settings based on a simple general theme: summarize\na network into a possibly high dimensional feature vector and then cluster these feature\nvectors. In the \ufb01rst setting, we propose NCGE, where each network is represented using its\ngraphon-estimate. We can use a variety of graphon estimation algorithms for this purpose.\nWe show that if the graphon estimation is consistent, then NCGE can cluster networks\ngenerated from a \ufb01nite mixture of graphons in a consistent way, if those graphons are\nsu\ufb03ciently di\ufb00erent. In the second setting, we propose to represent a network using an\neasy-to-compute summary statistic, namely the vector of the log-traces of the \ufb01rst few\npowers of a suitably normalized version of the adjacency matrix. We call this method\nNCLM and show that the summary statistic concentrates around its expectation, and\nargue that this expectation should be able to separate networks generated from di\ufb00erent\nmodels. Using simulated and real data experiments we show that NCGE is vastly superior\nto the naive but often-used method of comparing adjacency matrices directly, and NCLM\noutperforms most computationally expensive alternatives for di\ufb00erentiating networks without\nnode correspondence. In conclusion, we believe that these methods will provide practitioners\nwith a powerful and computationally tractable tool for comparing network-structured data\nin a range of disciplines.\n\n9\n\n\fAcknowledgments\nWe thank Professor Peter J. Bickel for helpful discussions. SSM was partially supported by\nNSF-FRG grant DMS-1160319 and a Lo\u00e9ve Fellowship. PS was partially supported by NSF\ngrant DMS 1713082. LL was partially supported by NSF grants IIS 1663870, DMS 1654579\nand a DARPA grant N-66001-17-1-4041.\n\nReferences\n[1] Supplement to \u201cOn clustering network-valued data\u201d. 2017.\n[2] David J. Aldous. Representations for partially exchangeable arrays of random variables.\n\nJournal of Multivariate Analysis, 11(4):581 \u2013 598, 1981.\n\n[3] S. et al Aliakbary. Learning an integrated distance metric for comparing structure of\n\ncomplex networks. Chaos, 25(2):177\u2013214, 2015.\n\n[4] Arash A Amini, Aiyou Chen, Peter J Bickel, Elizaveta Levina, et al. Pseudo-likelihood\nmethods for community detection in large sparse networks. The Annals of Statistics,\n41(4):2097\u20132122, 2013.\n\n[5] Arash A Amini and Elizaveta Levina. On semide\ufb01nite relaxations for the block model.\n\narXiv preprint arXiv:1406.5647, 2014.\n\n[6] Brian Ball, Brian Karrer, and MEJ Newman. E\ufb03cient and principled method for\n\ndetecting communities in networks. Physical Review E, 84(3):036103, 2011.\n\n[7] Peter Bickel, David Choi, Xiangyu Chang, Hai Zhang, et al. Asymptotic normality of\nmaximum likelihood and its variational approximation for stochastic blockmodels. The\nAnnals of Statistics, 41(4):1922\u20131943, 2013.\n\n[8] Peter J. Bickel and Aiyou Chen. A nonparametric view of network models and newman\ngirvan and other modularities. Proceedings of the National Academy of Sciences of the\nUnites States of America, 106(50):21068\u201321073, 2009.\n\n[9] Sourav Chatterjee. Matrix estimation by universal singular value thresholding. Ann.\n\nStatist., 43(1):177\u2013214, 02 2015.\n\n[10] Hao Chen and Nancy Zhang. Graph-based change-point detection. Ann. Statist.,\n\n43(1):139\u2013176, 02 2015.\n\n[11] David S Choi, Patrick J Wolfe, and Edoardo M Airoldi. Stochastic blockmodels with a\n\ngrowing number of classes. Biometrika, page asr053, 2012.\n\n[12] Paul Erd\u0151s and Alfr\u00e9d R\u00e9nyi. On random graphs i. Publicationes Mathematicae\n\n(Debrecen), 6:290\u2013297, 1959 1959.\n\n[13] Chao Gao, Yu Lu, and Harrison H. Zhou. Rate-optimal graphon estimation. Ann.\n\nStatist., 43(6):2624\u20132652, 12 2015.\n\n[14] C. E. Ginestet, P. Balanchandran, S. Rosenberg, and E. D. Kolaczyk. Hypothesis\n\nTesting For Network Data in Functional Neuroimaging. ArXiv e-prints, July 2014.\n\n[15] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic\n\nblockmodels: First steps. Social networks, 5(2):109\u2013137, 1983.\n\n[16] D.N. Hoover. Relations on probability spaces and arrays of random variables. Technical\n\nreport, Institute of Advanced Study, Princeton., 1979.\n\n[17] Brian Karrer and M. E. J. Newman. Stochastic blockmodels and community structure\n\nin networks. Phys. Rev. E, 83:016107, Jan 2011.\n\n[18] Jing Lei, Alessandro Rinaldo, et al. Consistency of spectral clustering in stochastic\n\nblock models. The Annals of Statistics, 43(1):215\u2013237, 2015.\n\n10\n\n\f[19] Priya Mahadevan, Dmitri Krioukov, Kevin Fall, and Amin Vahdat. Systematic topology\nanalysis and generation using degree correlations. SIGCOMM Comput. Commun. Rev.,\n36(4):135\u2013146, August 2006.\n\n[20] Pierre-Andr\u00e9 G Maugis, So\ufb01a C Olhede, and Patrick J Wolfe. Topology reveals universal\n\nfeatures for network comparison. arXiv preprint arXiv:1705.05677, 2017.\n\n[21] M. E. J. Newman. Modularity and community structure in networks. Proceedings of\n\nthe National Academy of Sciences, 103(23):8577\u20138582, 2006.\n\n[22] Mark E. Newman. Assortative mixing in networks. Phys. Rev. Lett., 89(20):208701,\n\n2002.\n\n[23] M.E.J. Newman. The structure and function of complex networks. SIAM review,\n\n45(2):167\u2013256, 2003.\n\n[24] So\ufb01a C. Olhede and Patrick J. Wolfe. Network histograms and universality of blockmodel\napproximation. Proceedings of the National Academy of Sciences of the Unites States\nof America, 111(41):14722\u201314727, 2014.\n\n[25] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional\n\nstochastic block model. Annals of Statistics, 39:1878\u20131915, 2011.\n\n[26] N. Shervashidze, SVN. Vishwanathan, TH. Petri, K. Mehlhorn, and KM. Borgwardt.\nE\ufb03cient graphlet kernels for large graph comparison. In JMLR Workshop and Conference\nProceedings Volume 5: AISTATS 2009, pages 488\u2013495, Cambridge, MA, USA, April\n2009. Max-Planck-Gesellschaft, MIT Press.\n\n[27] S. V. N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt.\n\nGraph kernels. J. Mach. Learn. Res., 11:1201\u20131242, August 2010.\n\n[28] P. J. Wolfe and S. C. Olhede. Nonparametric graphon estimation. ArXiv e-prints,\n\nSeptember 2013.\n\n[29] Y. Zhang, E. Levina, and J. Zhu. Estimating network edge probabilities by neighborhood\n\nsmoothing. ArXiv e-prints, September 2015.\n\n[30] Shi Zhou and Raul J. Mondrag\u00f3n. The rich club phenomenon in the internet topology.\n\nIEEE Communications Letters, 8(3):180\u2013182, 2004.\n\n11\n\n\f", "award": [], "sourceid": 3550, "authors": [{"given_name": "Soumendu Sundar", "family_name": "Mukherjee", "institution": "University of California, Berkeley"}, {"given_name": "Purnamrita", "family_name": "Sarkar", "institution": "UT Austin"}, {"given_name": "Lizhen", "family_name": "Lin", "institution": "The University of Texas at Austin"}]}