{"title": "Community Detection on Evolving Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 3522, "page_last": 3530, "abstract": "Clustering is a fundamental step in many information-retrieval and data-mining applications. Detecting clusters in graphs is also a key tool for finding the community structure in social and behavioral networks. In many of these applications, the input graph evolves over time in a continual and decentralized manner, and, to maintain a good clustering, the clustering algorithm needs to repeatedly probe the graph. Furthermore, there are often limitations on the frequency of such probes, either imposed explicitly by the online platform (e.g., in the case of crawling proprietary social networks like twitter) or implicitly because of resource limitations (e.g., in the case of crawling the web). In this paper, we study a model of clustering on evolving graphs that captures this aspect of the problem. Our model is based on the classical stochastic block model, which has been used to assess rigorously the quality of various static clustering methods. In our model, the algorithm is supposed to reconstruct the planted clustering, given the ability to query for small pieces of local information about the graph, at a limited rate. We design and analyze clustering algorithms that work in this model, and show asymptotically tight upper and lower bounds on their accuracy. Finally, we perform simulations, which demonstrate that our main asymptotic results hold true also in practice.", "full_text": "Community Detection on Evolving Graphs\n\nAris Anagnostopoulos\n\nSapienza University of Rome\n\naris@dis.uniroma1.it\n\nJakub \u0141 \u02dbacki\n\nSapienza University of Rome\n\nj.lacki@mimuw.edu.pl\n\nSilvio Lattanzi\n\nGoogle\n\nsilviol@google.com\n\nStefano Leonardi\n\nSapienza University of Rome\nleonardi@dis.uniroma1.it\n\nAbstract\n\nMohammad Mahdian\n\nGoogle\n\nmahdian@google.com\n\nClustering is a fundamental step in many information-retrieval and data-mining ap-\nplications. Detecting clusters in graphs is also a key tool for \ufb01nding the community\nstructure in social and behavioral networks. In many of these applications, the input\ngraph evolves over time in a continual and decentralized manner, and, to maintain\na good clustering, the clustering algorithm needs to repeatedly probe the graph.\nFurthermore, there are often limitations on the frequency of such probes, either\nimposed explicitly by the online platform (e.g., in the case of crawling proprietary\nsocial networks like twitter) or implicitly because of resource limitations (e.g., in\nthe case of crawling the web).\nIn this paper, we study a model of clustering on evolving graphs that captures\nthis aspect of the problem. Our model is based on the classical stochastic block\nmodel, which has been used to assess rigorously the quality of various static\nclustering methods. In our model, the algorithm is supposed to reconstruct the\nplanted clustering, given the ability to query for small pieces of local information\nabout the graph, at a limited rate. We design and analyze clustering algorithms\nthat work in this model, and show asymptotically tight upper and lower bounds on\ntheir accuracy. Finally, we perform simulations, which demonstrate that our main\nasymptotic results hold true also in practice.\n\n1\n\nIntroduction\n\nThis work studies the problem of detecting the community structure of a dynamic network according\nto the framework of evolving graphs [3]. In this model the underlying graph evolves over time,\nsubject to a probabilistic process that modi\ufb01es the vertices and the edges of the graph. The algorithm\ncan learn the changes that take place in the network only by probing the graph at a limited rate. The\nmain question for the evolving graph model is to design strategies for probing the graph, such as to\nobtain information that is suf\ufb01cient to maintain a solution that is competitive with a solution that can\nbe computed if the entire underlying graph is known.\nThe motivation for studying this model comes from the the inadequacy of the classical computational\nparadigm, which assumes perfect knowledge of the input data and an algorithm that terminates. The\nevolving graph model captures the evolving and decentralized nature of large-scale online social\nnetworks. An important part of the model is that only a limited number of probes can be made at\neach time step. This assumption is motivated by the limitations imposed by many social network\nplatforms such as Twitter or Facebook, where the network is constantly evolving and the access to the\nstructure is possible through an API that implements a rate-limited oracle. Even in cases where such\nrate-limits are not exogenously imposed (e.g., when the network under consideration is the Web),\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fresource constraints often prohibit us from making too many probes in each time step (probing a\nlarge graphs stored across many machines is a costly operation). The evolving graph model has been\nconsidered for PageRank computation [4] and connectivity problems [3]. This work is the \ufb01rst to\naddress the problem of community detection in the evolving graph model.\nOur probabilistic model of the evolution of the community structure of a network is based on the\nstochastic block model (SBM) [1, 2, 5, 10]. It is a widely accepted model of probabilistic networks\nfor the study of community-detection methods, which generates graphs with an embodied community\nstructure. In the basic form of the model, vertices of a graph are \ufb01rst partitioned into k disjoint\ncommunities in a probabilistic manner. Then, two nodes of the same community are linked with\nprobability p, and two nodes of distinct communities are linked with probability q, where p > q. All\nthe connections are mutually independent.\nWe make a \ufb01rst step in the study of community detection in the evolving-graph model by considering\nan evolving stochastic block model, which allows nodes to change their communities according to a\ngiven stochastic process.\n\n1.1 Our Contributions\n\nOur \ufb01rst step is to de\ufb01ne a meaningful model for community detection on evolving graphs. We do\nthis by extending the stochastic block model to the evolving setting. The evolving stochastic block\nmodel generates an n-node graph, whose nodes are partitioned into k communities. At each time\nstep, some nodes may change their communities in a random fashion. Namely, with probability 1/n\neach node is reassigned; when this happens it is moved to the ith community Ci (which we also\ncall a cluster Ci) with probability \u03b1i, where {\u03b1i}k\ni=1 form a probability distribution. After being\nreassigned, the neighborhood of the node is updated accordingly.\nWhile these changes are being performed, we are unaware of them. Yet, at each step, we have a\nbudget of \u03b2 queries that we can perform to the graph (later we will specify values for \u03b2 that allow\nus to obtain meaningful results\u2014a value of \u03b2 that is too small may not allow the algorithm to catch\nup with the changes; a value that is too large makes the problem trivial and unrealistic). A query of\nthe algorithm consists in choosing a single node. The result of the query is the list of neighbors of\nthe chosen node at the moment of the query. Our goal is to design an algorithm that is able to issue\nqueries over time in such a way that at each step it may report a partitioning { \u02c6C1, . . . , \u02c6Ck} that is\nas close as possible to the real one {C1, . . . , Ck}. A dif\ufb01culty of the evolving-graph model is that,\nbecause we observe the process for an in\ufb01nite amount of time, even events with negligible probability\nwill take place. Thus we should design algorithms that are able to provide guarantees for most of the\ntime and recover even after highly unlikely events take place.\nLet us now present our results at a high level. For simplicity of the description, let us assume that\np = 1, q = 0 and that the query model is slightly different, namely the algorithm can discover the\nentire contents of a given cluster with one query.1\nWe \ufb01rst study algorithms that at each step pick the cluster to query independently at random from\nsome prede\ufb01ned distribution. One natural idea is to pick a cluster proportionally to its size (which\nis essentially the same as querying the cluster of a node chosen uniformly at random). However,\nwe show that a better strategy is to query a cluster proportionally to the square root of its size.\nWhile the two strategies are equivalent if the cluster probabilities {\u03b1i}k\ni=1 are uniform, the latter\nbecomes better in the case of skewed distributions. For example, if we have n1/3 clusters, and\nthe associated probabilities are \u03b1i \u223c 1/i2, the \ufb01rst strategy incorrectly classi\ufb01es O(n1/3) nodes in\neach step (in expectation), compared to only O(log2 n) nodes misclassi\ufb01ed by the second strategy.\nFurthermore, our experimental analysis suggests that the the strategy of probing a cluster with a\nfrequency proportional to the square root of its size is not only ef\ufb01cient in theory, but it may be a\ngood choice for practical application as well.\nWe later improve this result and give an algorithm that uses a mixture of cluster and node queries.\nIn the considered example when \u03b1i \u223c 1/i2, at each step it reports clusterings with only O(1)\nmisclassi\ufb01ed nodes (in expectation). Although the query strategy and the error bound expressed in\n\n1In our analysis we show that the assumption about the query model can be dropped at the cost of increasing\n\nthe number of queries that we perform at each time step by a constant factor.\n\n2\n\n\fterms of {\u03b1i}k\ni=1 are both quite complex, we are able show that the algorithm is optimal, by giving a\nmatching lower bound.\nFinally, we also show how to deal with the case when 1 \u2265 p > q \u2265 0. In this case querying node v\nprovides us with only partial information about its cluster: it is connected to only a subset of the nodes\nin C. In this case we impose some assumptions on p and q, and we provide an algorithm that given\na node can discover the entire contents of its cluster with O(log n/p) node queries. This algorithm\nallows us to extend the previous results to the case when p > q > 0 (and p and q are suf\ufb01ciently\nfar from each other), at the cost of performing \u03b2 = O(log n/p) queries per step. Even though the\nevolving graph model requires the algorithm to issue a low number of queries, our analysis shows\nthat (under reasonable assumptions on p and q) this small number of queries is suf\ufb01cient to maintain\na high-quality clustering.\nOur theoretical results hold for large enough n. Therefore, we also perform simulations, which\ndemonstrate that our \ufb01nal theoretically optimal algorithm is able to beat the other algorithms even for\nsmall values of n.\n\n2 Related Work\n\nClustering and community-detection techniques have been studied by hundreds of researchers. In\nsocial networks, detecting the clustering structure is a basic primitive for \ufb01nding communities of\nusers, that is, sets of users sharing similar interests or af\ufb01liations [12, 16]. In recommendation\nnetworks cluster discovery is often used to improve the quality of recommendation systems [13].\nOther relevant applications of clustering can be found in image processing, bioinformatics, image\nanalysis and text classi\ufb01cation.\nPrior to the evolving model, a number of dynamic computation models have been studied, such as\nonline computation (the input data are revealed step by step), dynamic algorithms and data structures\n(the input data are modi\ufb01ed dynamically), and streaming computation (the input data are revealed\nstep by step while the algorithm is space constrained). Hartamann et al. [9] presented a survey of\nresults for clustering dynamic networks in some of the previously mentioned models. However, none\nof the aforementioned models capture the relevant features of the dynamic evolution of large-scale\ndata sets: the data evolves at a slow pace and an algorithm can learn the data changes only by probing\nspeci\ufb01c portions of the graph at some cost.\nThe stochastic block model, used by sociologists [10], has recently received a growing attention in\ncomputer science, machine learning, and statistics [1, 2, 5, 6, 17]. At the theoretical level, most work\nhas studied the range of parameters, for which the communities can be recovered from the generated\ngraph, both in the case of two [1, 7, 11, 14, 15] or more [2, 5] communities.\nAnother line of research focused on studying different dynamic versions of the stochastic block\nmodel [8, 18, 19, 20]. Yet, there is a lack of theoretical work on modeling and analyzing stochastic\nblock models, and more generally community detection on evolving graph. This paper makes the\n\ufb01rst step in this direction.\n\n3 Model\n\n\u03b11, . . . , \u03b1k (i.e., \u03b1i > 0 and(cid:80)\n\nIn this paper we analyze an evolving extension of the stochastic block model [10]. We call this new\nmodel the evolving stochastic block model. In this model we consider a graph of n nodes, which\nare assigned to one of k clusters, and the probability that two nodes have an edge between them\ndepends on the clusters to which they are assigned. More formally, consider a probability distribution\ni \u03b1i = 1). Without loss of generality, throughout the paper we assume\n\u03b11 \u2265 . . . \u2265 \u03b1k. Also, for each 1 \u2264 i \u2264 k we also assume that \u03b1i < 1 \u2212 \u0001\u03b1 for some constant\n0 < \u0001\u03b1 < 1.\nAt the beginning, each node independently picks one of the k clusters. The probability that the node\npicks cluster i is \u03b1i. We denote this clustering of the nodes by C. Nodes that pick the same cluster\ni are connected with a \ufb01xed probability pi (which may depend on n), whereas pairs of nodes that\npick two different clusters i and j are connected with probability qij (also possibly dependent on n).\nNote that qij = qji and the edges are independent of each other. We denote p := min1\u2264i\u2264k pi and\nq := max1\u2264i,j\u2264k qi,j.\n\n3\n\n\fSo far, our model is very similar to the classic stochastic block model. Now we introduce its main\ndistinctive property, namely the evolution dynamics.\nEvolution model: In our analysis, we assume that the graph evolves in discrete time steps indexed\nby natural numbers. The nodes change their cluster in a random manner. At each time step, every\nnode v is reassigned with probability 1/n (independently from other nodes). When this happens, v\n\ufb01rst deletes all the edges to its neighbors, then selects a new cluster i with probability \u03b1i and \ufb01nally\nadds new edges with probability pi to nodes in cluster i and with probability qij to nodes in cluster j,\nfor every j (cid:54)= i. For 1 \u2264 i \u2264 k and t \u2208 N, we denote by C t\ni the set of nodes assigned to cluster i\njust after the reassignments in time step t. Note that we use Ci to denote the cluster itself, but C t\ni to\ndenote its contents.\nQuery model: We assume that the algorithm may gather information about the clusters by issuing\nqueries. In a single query the algorithm chooses a single node v and learns the list of current neighbors\nof v. In each time step, the graph is probed after all reassignments are made.\nWe study algorithms that learn the cluster structure of the graph. The goal of our algorithm is to\nreport an approximate clustering \u02c6C of the graph at the end of each time step, that is close to the true\nclustering C. We de\ufb01ne the distance between two clusterings (partitions) C = {C1, C2, . . . , Ck} and\n\u02c6C = { \u02c6C1, \u02c6C2, . . . , \u02c6Ck} of the nodes as\n\nk(cid:88)\n\ni=1\n\nd(C, \u02c6C) = min\n\n\u03c0\n\n|Ci(cid:52) \u02c6C\u03c0(i)|,\n\nwhere the minimum is taken over all the permutations \u03c0 of {1, . . . , k}, and (cid:52) denotes the symmetric\ndifference between two sets, i.e., A(cid:52)B = (A \\ B) \u222a (B \\ A).2 The distance d(C, \u02c6C) is called the\nerror of the algorithm (or of the returned clustering).\nFinally, in our analysis we assume that p and q are far apart, more formally we assume that:\nAssumption 1. For every i \u2208 [k], and parameters K, \u03bb and \u03bb(cid:48) that we \ufb01x later, we have: (i)\np\u03b1i > Kq, (ii) p2\u03b1in \u2265 \u03bb log n and (iii) p\u03b1in \u2265 \u03bb(cid:48) log n.\nLet us now discuss the above assumptions. Observe that (iii) follows from (ii). However, we prefer\nto make them separate, as we mostly rely only on (iii). Assumption (iii) is necessary to assure that\nmost of the nodes in the cluster have at least a single edge to another node in the same cluster. In the\nanalysis, we set \u03bb(cid:48) to be large enough (yet, still constant), to assure that for every given time t each\nnode has \u2126(log n) edges to nodes of the same cluster, with high probability.\nWe use Assumption 1 in an algorithm that, given a node v, \ufb01nds all nodes of the cluster of v (correctly\nwith high probability3) and issues only O(log n/p) queries. Our algorithm also uses (ii), which is\nslightly stronger than (iii) (it implies that two nodes from the same cluster have many neighbors in\ncommon), as well as (i), which guarantees that (on average) most neighbors of a node v belong to the\ncluster of v.\nDiscussion: The assumed graph model is relatively simple\u2014certainly not complex enough to claim\nthat it accurately models real-world graphs. Nevertheless, this work is the \ufb01rst attempt to formally\nstudy clustering in dynamic graphs and several simplifying assumptions are necessary to obtain\nprovable guarantees. Even with this basic model, the analysis is rather involved. Dealing with dif\ufb01cult\nfeatures of a more advanced model would overshadow our main \ufb01ndings.\nWe believe that if we want to keep the number of queries low, that is, O(log n/p), Assumption 1\ncannot be relaxed considerably, that is, p and q cannot be too close to each other. At the same time,\nrecovery of clusters in the (nonevolving) stochastic block model has also been studied for stricter\nranges of parameters. However, the known algorithms in such settings inspect considerably more\nnodes and require that the cluster probabilities {\u03b1i}k\ni=1 are close to being uniform [5]. The results\nthat apply to the case with many clusters with nonuniform sizes require that p and q are relatively far\napart. We note that in studying the classic stochastic block model it is a standard assumption to know\np and q, so we also assume it in this work for the sake of simplicity.\n\n2Note that we can extend this de\ufb01nition to pairs of clusterings with different numbers of clusters just by\n\nadding empty clusters to the clustering with a smaller number of clusters.\n\n3We de\ufb01ne the term with high probability in Section 4.\n\n4\n\n\fOur model assumes that (in expectation) only one node changes its cluster at every time step. However,\nwe believe that the analysis can be extended to the case when c > 1 nodes change their cluster every\nstep (in expectation) at the cost of using c times more queries.\nGeneralizing the results of this paper to more general models is a challenging open problem. Some\ninteresting directions are, for example, using graphs models with overlapping communities or\nanalyzing a more general model of moving nodes between clusters.\n\n4 Algorithms and Main Results\n\nIn this section we outline our main results. For simplicity, we omit some technical details, mostly\nconcerning probability. In particular, we say that an event happens with high probability, if it happens\nwith probability at least 1 \u2212 1/nc, for some constant c > 1, but in this section we do not specify how\nthis constant is de\ufb01ned.4\nWe are interested in studying the behavior of the algorithm in an arbitrary time step. We start by\nstating a lemma showing that to obtain an algorithm that can run inde\ufb01nitely long, it suf\ufb01ces to\ndesigning an algorithm that uses \u03b2 queries per step, initializes in O(n log n) steps and works with\nhigh probability for n2 steps.\nLemma 1. Assume that there exists an algorithm for clustering evolving graphs that issues \u03b2 queries\nper step and that at each time step t such that t = \u2126(n log n) and t \u2264 n2 it reports a clustering of\nexpected error E correctly with high probability.\nThen, there exists an algorithm for clustering evolving graphs that issues 2\u03b2 queries per step and at\neach time step t such that t = \u2126(n log n) it reports a clustering of expected error O(E).\n\nTo prove this lemma, we show that it suf\ufb01ces to run a new instance of the assumed algorithm every\nn2 steps. In this way, when the \ufb01rst instance is no longer guaranteed to work, the second one has\n\ufb01nished initializing and can be used to report clusterings.\n\n4.1 Simulating Node Queries\n\nWe now show how to reduce the problem to the setting in which an algorithm can query for the entire\ncontents of a cluster. This is done in two steps. As a \ufb01rst step, we give an algorithm for detecting the\ncluster of a given node v by using only O(log n/p) node queries.\nThis algorithm maintains score of each node in the graph. Initially, the scores are all equal to 0. The\nalgorithm queries O(log n/p) neighbors of v and adds a score of 1 to every neighbor of neighbor\nof v. We use Assumption 1 to prove that after this step, with high probability there is a gap between\nthe minimum score of a node inside the cluster of v and the maximum score of a node outside it.\nLemma 2. Suppose that Assumption 1 holds. Then, there exists an algorithm that, given a node v,\ncorrectly identi\ufb01es all nodes in the cluster of v with high probability. It issues O(log n/p) queries.\n\nObserve that Lemma 2 effectively reduces our problem to the case when p = 1 and q = 0: a single\nexecution of the algorithm gives us the entire cluster of a node, just like a single query for this node\nin the case when p = 1 and q = 0.\nIn the second step, we give a data structure that maintains an approximate clustering of the nodes and\ndetects the number of cluster k together with (approximate) cluster probabilities. Internally, it uses\nthe algorithm of Lemma 2.\nLemma 3. Suppose that Assumption 1 holds. Then there exists a data structure that at each time\nstep t = \u2126(n) may answer the following queries:\n\n1. Given a cluster number i, return a node v, such that Pr(v \u2208 C t\n2. Given C t\n3. Return k and a sequence \u03b1(cid:48)\n\ni (the contents of cluster Ci) return i.\n\n1, . . . , \u03b1(cid:48)\n\ni ) \u2265 1/2.\n\n3\u03b1i/2.\n\nk, such that for each 1 \u2264 i \u2264 k, we have \u03b1i/2 \u2264 \u03b1(cid:48)\n\ni \u2264\n\n4Usually, the constant c can be made arbitrarily large, by tuning the constants of Assumption 1.\n\n5\n\n\fThe data structure runs correctly for n2 steps with high probability and issues O(log n/p) queries\nper step.\nFurthermore if p = 1 and q = 0, it makes only 1 query per step.\n\nNote that because the data structure can only use node queries to access the graph, it imposes its own\nnumbering on the clusters that it uses consistently. Let us now describe the high-level idea behind it.\nIn each step the data structure selects a node uniformly at random and discovers its entire cluster using\nthe algorithm of Lemma 2. We show that this implies that within any n/16 time steps each cluster\nis queried at least once with high probability. The main challenge lies in refreshing the knowledge\nabout the clusters. The data structure internally maintains a clustering D1, . . . , Dk. However, when\nit queries some cluster C, it is not clear which of D1, . . . , Dk does C correspond to. To deal with\nthat we show that the number of changes in each cluster within n/16 time steps is so low (again, with\nhigh probability), that there is a single cluster D \u2208 {D1, . . . , Dk}, for which |D \u2229 C| > |C| /2.\nThe data structure of Lemma 3 can be used to simulate queries for cluster in the following way.\nAssume we want to discover the contents of cluster i. First, we use the data structure to get a node v,\ni ) \u2265 1/2. Then, we can use algorithm of Lemma 2 to get the entire cluster C(cid:48) of\nsuch that Pr(v \u2208 C t\nnode v. Finally, we may use the data structure again to verify whether C(cid:48) is indeed C t\ni . This is the\ncase with probability more at least 1/2.\nMoreover, the data structure allows us to assume that the algorithms are initially only given the\nnumber of nodes n and the values of p and q, because the data structure can provide to the algorithms\nboth the number of clusters k and their (approximate) probabilities.\n\n4.2 Clustering Algorithms\n\n1, . . . , \u03b1(cid:48)\n\nUsing the results of Section 4.1, we may now assume that algorithms may query the clusters directly.\nThis allows us to give a simple clustering algorithm. The algorithm \ufb01rst computes a probability\ndistribution \u03c11, . . . , \u03c1k on the clusters, which is a function of the cluster probability distribution\n\u03b11, . . . , \u03b1k. Although the cluster probability distribution is not a part of the input data, we may use an\napproximate distribution \u03b1(cid:48)\nk given by the data structure of Lemma 3\u2014this increases the error\nof the algorithm only by a constant factor. In each step the algorithm picks a cluster independently at\nrandom from the distribution \u03c11, . . . , \u03c1k and queries it.\nIn order to determine the probability distribution \u03c11, . . . , \u03c1k, we express the upper bound on the error\nin terms of this distribution and then \ufb01nd the sequence \u03c11, . . . , \u03c1k that minimizes this error.\nTheorem 4. Suppose that Assumption 1 holds. Then there exists an algorithm for clustering evolving\ngraphs that issues O(log n/p) queries per step and that for each time step t = \u2126(n) reports a\n\n. Furthermore if p = 1 and q = 0, it issues only\n\n(cid:18)(cid:16)(cid:80)k\n\n\u221a\n\n\u03b1i\n\ni=1\n\n(cid:17)2(cid:19)\n\nclustering of expected error O\n\nO(1) queries per step.\n\nThe clusterings given by this algorithm already have low error, but still we are able to give a better\nresult. Whenever the algorithm of Theorem 4 queries some cluster Ci, it \ufb01nds the correct cluster\nassignment for all nodes that have been reassigned to Ci since it has last been queried. These nodes\nare immediately assigned to the right cluster. However, by querying Ci the algorithm also discovers\nwhich nodes have been recently reassigned from Ci (they used to be in Ci when it was last queried,\nbut are not there now). Our improved algorithm maintains a queue of such nodes and in each step\nremoves two nodes from this queue and locates them. In order to locate a single node v, we \ufb01rst\ndiscover its cluster C(v) (using algorithm of Lemma 2) and then use the data structure of Lemma 3\nto \ufb01nd the cluster number of C(v). Once we do that, we can assign v to the right cluster immediately.\nThis results in a better bound on the error.\nTheorem 5. Assume that \u03b11 \u2265 . . . \u2265 \u03b1k. Suppose that Assumption 1 holds. Then there exists an\nalgorithm for clustering evolving graphs that issues O(log n/p) queries per step and that for each\n\ntime step t = \u2126(n log n) reports a clustering of expected error O\n\n1\u2264i\u2264k\n\n\u03b1i\n\ni