{"title": "Bayesian Hierarchical Community Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 1601, "page_last": 1609, "abstract": "We propose an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks. Our model is a tree-structured mixture of potentially exponentially many stochastic blockmodels. We describe a family of greedy agglomerative model selection algorithms whose worst case scales quadratically in the number of vertices of the network, but independent of the number of communities. Our algorithms are two orders of magnitude faster than the infinite relational model, achieving comparable or better accuracy.", "full_text": "Bayesian Hierarchical Community Discovery\n\nCharles Blundell\u2217\nDeepMind Technologies\n\ncharles@deepmind.com\n\nYee Whye Teh\n\ny.w.teh@stats.ox.ac.uk\n\nDepartment of Statistics,\n\nUniversity of Oxford\n\nAbstract\n\nWe propose an ef\ufb01cient Bayesian nonparametric model for discovering hierar-\nchical community structure in social networks. Our model is a tree-structured\nmixture of potentially exponentially many stochastic blockmodels. We describe a\nfamily of greedy agglomerative model selection algorithms that take just one pass\nthrough the data to learn a fully probabilistic, hierarchical community model. In\nthe worst case, Our algorithms scale quadratically in the number of vertices of\nthe network, but independent of the number of nested communities. In practice,\nthe run time of our algorithms are two orders of magnitude faster than the In\ufb01nite\nRelational Model, achieving comparable or better accuracy.\n\n1\n\nIntroduction\n\nPeople often organise themselves into groups or communities. For example, friends may form\ncliques, scientists may have recurring collaborations, and politicians may form factions. Conse-\nquently the structure found in social networks is often studied by inferring these groups. Using\ncommunity membership one may then make predictions about the presence or absence of unob-\nserved connectivity in the social network. Sometimes these communities possess hierarchical struc-\nture. For example, within science, the community of physicists may be split into those working on\nvarious branches of physics, and each branch re\ufb01ned repeatedly until \ufb01nally reaching the particular\nspecialisation of an individual physicist.\nMuch previous work on social networks has focused on discovering \ufb02at community structure. The\nstochastic blockmodel [1] places each individual in a community according to the block structure\nof the social network\u2019s adjacency matrix, whilst the mixed membership stochastic blockmodel [2]\nextends the stochastic blockmodel to allow individuals to belong to several \ufb02at communities simul-\ntaneously. Both models require the number of \ufb02at communities to be known and are parametric\nmethods.\nGreedy hierarchical clustering has previously been applied directly to discovering hierarchical com-\nmunity structure [3]. These methods do not require the community structure to be \ufb02at or the number\nof communities to be known. Such schemes are often computationally ef\ufb01cient, scaling quadrat-\nically in the number of individuals for a dense network, or linearly in the number of edges for a\nsparse network [4]. These methods do not yield a full probabilistic account of the data, in terms of\nparameters and the discovered structure.\nSeveral authors have also proposed Bayesian approaches to inferring community structure. The In\ufb01-\nnite Relational Model (IRM; [5, 6, 7]) infers a \ufb02at community structure. The IRM has been extended\nto infer hierarchies [8], by augmenting it with a tree, but comes at considerable computational cost.\n[9] and [10] propose methods limited to hierarchies of depth two, whilst [11] propose methods lim-\nited to hierarchies of known depth.. The Mondrian process [12] propose a \ufb02exible prior on trees and\na likelihood model for relational data. Current Bayesian nonparametric methods do not scale well\nto larger networks because the inference algorithms used make many small changes to the model.\n\n\u2217Part of the work was done whilst at the Gatsby Unit, University College London.\n\n1\n\n\fSuch schemes can take a large number of iterations to converge on an adequate solution whilst each\niteration often scales unfavourably in the number of communities or vertices.\nWe shall describe a greedy Bayesian hierarchical clustering method for discovering community\nstructure in social networks. Our work builds upon Bayesian approaches to greedy hierarchical\nclustering [13, 14] extending these approaches to relational data. We call our method Bayesian\nHierarchical Community Discovery (BHCD). BHCD produces good results two orders of magnitude\nfaster than a single iteration of the IRM, and its worst case run-time is quadratic in the number of\nvertices of the graph and independent of the number of communities.\nThe remainder of the paper is organised as follows. Section 2 describes the stochastic blockmodel. In\nSection 3 we introduce our model as a hierarchical mixture of stochastic blockmodels. In Section 4\nwe describe an ef\ufb01cient scheme for inferring hierarchical community structure with our model.\nSection 5 demonstrates BHCD on several data sets. We conclude with a brief discussion in Section 6\n\n2 Stochastic Blockmodels\n\nA stochastic blockmodel [1] consists of a partition, \u03c6, of vertices V and for each pair of clusters\np and q in \u03c6, a parameter, \u03b8pq, giving the probability of a presence or absence of an edge between\nnodes of the clusters. Suppose V = {a, b, c, d}, then one way to partition the vertices would be\nto form clusters ab, c and d, which we shall write as \u03c6 = ab|c|d, where | denotes a split between\nclusters. The probability of an adjacency matrix, D, given a stochastic blockmodel, is as follows:\n\nP (D|\u03c6,{\u03b8pq}p,q\u2208\u03c6) =\n\nn1\npq\n\npq (1 \u2212 \u03b8pq)n0\n\npq\n\n\u03b8\n\n(1)\n\n(cid:89)\n\np,q\u2208\u03c6\n\npq is the total number of edges in D between the vertices in clusters p and q, and n0\n\nwhere n1\ntotal number of observed absent edges in D between the vertices in clusters p and q.\nWhen modelling communities, we expect the edge appearance probabilities within a cluster to be\ndifferent to those between different clusters. Hence we place a different prior on each of these\ncases. Similar approaches have been taken to adapt the IRM to community detection [7], where\nnon-conjugate priors were used at increased computational cost. In the interest of computational\nef\ufb01ciency, our model will instead use conjugate priors but with differing hyperparameters. \u03b8pp will\nhave a Beta(\u03b1, \u03b2) prior and \u03b8pq, p (cid:54)= q, will have a Beta(\u03b4, \u03bb) prior. The hyperparameters are picked\nsuch that \u03b1 > \u03b2 and \u03b4 < \u03bb, which correspond to a prior belief of a higher density of edges within\na community than across communities. Integrating out the edge appearance parameters, we obtain\nthe following likelihood of a particular partition \u03c6:\n\npq is the\n\n(cid:89)\n\nP (D|\u03c6) =\n\nwhere B(\u00b7,\u00b7) is the Beta function. We may generalise this to use any exponential family:\n\n(cid:89)\n\np,q\u2208\u03c6\np(cid:54)=q\n\n(cid:89)\n\nB(\u03b1 + n1\n\npp, \u03b2 + n0\n\npp)\n\np\u2208\u03c6\n\nB(\u03b1, \u03b2)\n\nB(\u03b4 + n1\n\npq, \u03bb + n0\n\npq)\n\nB(\u03b4, \u03bb)\n\np(D|\u03c6) =\n\n(cid:89)\n\np\u2208\u03c6\n\nf (\u03c3pp)\n\ng(\u03c3pq)\n\np,q\u2208\u03c6, p(cid:54)=q\n\n(2)\n\n(3)\n\nwhere f (\u00b7) is the marginal likelihood of the on-diagonal blocks, and g(\u00b7) is the marginal likelihood\nof the off-diagonal blocks. We use \u03c3pq to denote the suf\ufb01cient statistics from a (p, q)-block of the\nadjacency matrix: all of the elements whose row indices are in cluster p and whose column indices\nare in cluster q. For the remainder of the paper, we shall focus on the beta-Bernoulli case given in\nand g(x, y) = B(\u03b4+x,\u03bb+y)\n(2) for concreteness. i.e., \u03c3pq = (n1\n.\nFor clarity of exposition, we shall focus on modelling undirected or symmetric networks with no\nself-edges, so \u03c3pq = \u03c3qp and \u03c3{x}{x} = 0 for each vertex x, but in general this restriction is not\nnecessary.\nOne approach to inferring \u03c6 is to \ufb01x the number of communities K then use maximum likelihood\nestimation or Bayesian inference to assign vertices to each of the communities [1, 15]. Another\napproach is to use variational Bayes, combined with an upper bound on the number of communities,\nto determine the number of communities and community assignments [16].\n\npq), with f (x, y) = B(\u03b1+x,\u03b2+y)\n\npq, n0\n\nB(\u03b1,\u03b2)\n\nB(\u03b4,\u03bb)\n\n2\n\n\fFigure 1: Hierarchical decomposition of the adjacency matrix into tree-consistent partitions. Black\nsquares indicated edge presence, white squares indicate edge absence, grey squares are unobserved.\n\n3 Bayesian Hierarchical Communities\n\nIn this section we shall develop a Bayesian nonparametric approach to community discovery. Our\nmodel organises the communities into a nested hierarchy T , with all vertices in one community at\nthe root and singleton vertices at the leaves. Each vertex belongs to all communities along the path\nfrom the root to the leaf containing it. We describe the probabilistic model relating the hierarchy\nof communities to the observed network connectivity data here, whilst in the next section we will\ndevelop a greedy model selection procedure for learning the hierarchy T from data.\nWe begin with the marginal probability of the adjacency matrix D under a stochastic blockmodel:\n(4)\n\np(\u03c6)p(D|\u03c6)\n\np(D) =\n\n(cid:88)\n\n\u03c6\n\nIf the Chinese restaurant process (CRP) is used as the prior on partitions p(\u03c6), then (4) corresponds\nto the marginal likelihood of the IRM. Computing (4) typically requires an approximation: the space\nof partitions \u03c6 is large and so the number of partitions in the above sum grows at least exponentially\nin the number of vertices.\nWe shall take a different approach: we use a tree to de\ufb01ne a prior on partitions, where only partitions\nthat are consistent with the tree are included in the sum. This allows us to evaluate (4) exactly. The\ntree will represent the hierarchical community structure discovered in the network. Each internal\nnode of the tree corresponds to a community and the leaves of the tree are the vertices of the adja-\ncency matrix, D. Figure 1 shows how a tree de\ufb01nes a collection of partitions for inclusion in the\nsum in (4). The adjacency matrix on the left is explained by our model, conditioned upon the tree\non the upper left, by its \ufb01ve tree-consistent partitions. Various blocks within the adjacency matrix\nare explained either by the on-diagonal model f or the off-diagonal model g, according to each par-\ntition. Note that the block structure of the off-diagonal model depends on the structure of the tree T ,\nnot just on the partition \u03c6. The model always includes the trivial partition of all vertices in a single\ncommunity and also the singleton partition of all vertices in separate communities.\nMore precisely, we shall denote trees as a nested collection of sets of vertices. For example, the tree\nin Figure 1 is T = {{a, b},{c, d, e}, f}. The set of of partitions consistent with a tree T may be\nexpressed formally as in [14]:\n\n\u03a6(T ) = {leaves(T )} \u222a {\u03c61|. . .|\u03c6nT : \u03c6i \u2208 \u03a6(Ti), Ti \u2208 ch(T )}\n\n(5)\n\nwhere leaves(T ) are the leaves of the tree T , ch(T ) are its children, and so Ti is the ith subtree of\ntree T . The marginal likelihood of the tree T can be written as:\n\np(D|T ) = p(DT T|T ) =\n\np(\u03c6|T )p(DT T|\u03c6, T )\n\n(6)\n\n(cid:88)\n\n\u03c6\n\nwhere the notation DT T is short for Dleaves(T ),leaves(T ), the block of D whose rows and columns\ncorrespond to the leaves of T .\nOur prior on partitions p(\u03c6|T ) is motivated by the following generative process: Begin at the root\nof the tree, S = T . With probability \u03c0S, stop and generate DSS according to the on-diagonal model\nf. Otherwise, with probability 1 \u2212 \u03c0S, generate all inter-cluster edges between the children of the\ncurrent node according to g, and recurse on each child of the current tree S. The resulting prior on\n\n3\n\n\ftree-consistent partitions p(\u03c6|T ) factorises as:\n\np(\u03c6|T ) =\n\n(1 \u2212 \u03c0S)\n\n(cid:89)\n\nS\u2208ancestorT (\u03c6)\n\n(cid:89)\n\nS\u2208subtreeT (\u03c6)\n\n\u03c0S\n\n(7)\n\nwhere subtreeT (\u03c6) are the subtrees in T corresponding to the clusters in partition \u03c6 and ancestorT (\u03c6)\nare the ancestors of trees in subtreeT (\u03c6). The prior probability of partitions not consistent with T is\nzero. Following [14], we de\ufb01ne \u03c0S = 1 \u2212 (1 \u2212 \u03b3)|ch(S)|, where \u03b3 is a parameter of the model. This\nchoice of \u03c0S gives higher likelihood to non-binary trees over cascading binary trees when the data\nhas no hierarchical structure [14]. Similarly, the likelihood of each partition p(D|\u03c6, T ) factorises as:\n(8)\n\n(cid:1) (cid:89)\n\np(DT T|\u03c6, T ) =\n\nf (\u03c3SS)\n\nS\u2208ancestorT (\u03c6)\n\n(cid:89)\ng(cid:0)\u03c3\u00acch\nSS = \u03c3SS \u2212 (cid:88)\n\nSS\n\n\u03c3\u00acch\n\nC\u2208ch(S)\n\nwhere \u03c3SS are the suf\ufb01cient statistics of the adjacency matrix D among the leaves of tree S, and\n\u03c3\u00acch\nSS are the suf\ufb01cient statistics of the edges between different children of S:\n\nS\u2208subtreeT (\u03c6)\n\n\u03c3CC\n\n(9)\n\nThe set of tree consistent partitions given in (5) has at most O(2n) partitions, for n vertices. However\ndue to the structure of the prior on partitions (7) and the block model (8), the marginal likelihood (6)\nmay be calculated by dynamic programming, in O(n + m) time where n is the number of vertices\nand m is the number of edges. Combining (7) and (8) and expanding (6) by breadth-\ufb01rst traversal of\nthe tree, yields the following recursion for the marginal likelihood of the generative process given at\nthe beginning of the section:\n\np(DT T|T ) = \u03c0T f (\u03c3T T ) + (1 \u2212 \u03c0T )g(cid:0)\u03c3\u00acch\n\nT T\n\n(cid:1) (cid:89)\n\np(DCC|C)\n\n(10)\n\nC\u2208ch(T )\n\n4 Agglomerative Model Selection\n\nIn this section we describe how to learn the hierarchy of communities T . The problem is treated as\none of greedy model selection: each tree T is a different model, and we wish to \ufb01nd the model that\nbest explains the data. The tree is built in a bottom-up greedy agglomerative fashion, starting from\na forest consisting of n trivial trees, each corresponding to exactly one vertex. Each iteration then\nmerges two of the trees in the forest. At each iteration, each vertex in the network is a leaf of exactly\none tree in the forest. The algorithm \ufb01nishes when just one tree remains. We de\ufb01ne the likelihood\nof the forest F as the probability of data described by each tree in the forest times that for the data\ncorresponding to edges between different trees:\n\np(D|F ) = g(\u03c3\u00acch\nF F )\n\np(DT T|T )\n\n(11)\n\n(cid:89)\n\nT\u2208F\n\nF F are the suf\ufb01cient statistics of the edges between different trees in the forest.\n\nwhere \u03c3\u00acch\nThe initial forest, F (0), consists a singleton tree for each vertex of the network. At each iteration\na pair of trees in the forest F is chosen to be merged, resulting in forest F (cid:63). Which pair of tree to\nmerge, and how to merge these trees, is determined by considering which pair and type of merger\nyields the largest Bayes factor improvement over the current model. If the trees I and J are merged\nto form the tree M, then the Bayes factor score is:\np(DMM|F (cid:63))\np(DMM|F )\n\n(12)\nwhere p(DMM|M ), p(DII|I) and p(DJJ|J) are given by (10) and \u03c3IJ are the suf\ufb01cient statistics of\nthe edges connecting leaves(I) and leaves(J). Note that the Bayes factor score is based on data local\nto the merge\u2014i.e., by considering the probability of the connectivity data only among the leaves of\nthe newly merged tree. This permits ef\ufb01cient local computations and makes the assumption that\nlocal community structure should depend only on the local connectivity structure.\nWe consider three possible mergers of two trees I and J into M. See Figure 2, where for concrete-\nness we take I = {Ta, Tb, Tc} and J = {Td, Te} where Ta, Tb, Tc, Td, Te are subtrees. M may be\n\np(DII|I)p(DJJ|J)g(\u03c3IJ )\n\nSCORE(M ; I, J) =\n\np(DMM|M )\n\n=\n\n4\n\n\f1: Initialise F,{pI , \u03c3\u00acch\n2: for each unique pair I, J \u2208 F do\n3:\n\nII }I\u2208F ,{\u03c3IJ}I,J\u2208F .\n\nLet M := MERGE(I; J), compute pM and\nSCORE(M ; I, J), and add M to the heap.\n\n4: end for\n5: while heap is not empty do\nPop I = MERGE(X; Y ) off the top of heap.\n6:\nif X \u2208 F and Y \u2208 F then\n7:\n8:\n9:\n10:\n11:\n\nF \u2190 (F \\ {X, Y }) \u222a {I}.\nfor each tree J \u2208 F \\ {I}, do\n\nCompute \u03c3IJ, \u03c3MM , and \u03c3\u00acch\nLet M := MERGE(I; J), compute pM and\nSCORE(M ; I, J), and add M to heap.\n\nMM using (13).\n\nend if\n\n12:\n13:\n14: end while\n15: return the only tree in F\nAlgorithm 1: Bayesian hierarchical community discovery.\n\nend for\n\nFigure 2: Different merge operations.\n\nformed by joining I and J together using a new node, giving M = {I, J}. Alternatively M may be\nformed by absorbing J as a child of I, yielding M = {J}\u222a ch(I), or vice versa, M = {I}\u222a ch(J).\nThe algorithm for \ufb01nding T is described in Algorithm 1. The algorithm maintains a forest F of\ntrees, the likelihood pI = p(DII|I) of each tree I \u2208 F according to (10), and the suf\ufb01cient statistics\n{\u03c3\u00acch\nIt also maintains a heap of potential\nmerges ordered by the SCORE (12), used for determining the ordering of merges. At each iteration,\nthe best potential merge, say of trees X and Y resulting in tree I, is picked off the heap. If either X or\nY is not in F , this means that the tree has been used in a previous merge, so that the potential merge\nis discarded and the next potential merge is considered. After a successful merge, the suf\ufb01cient\nstatistics associated with the new tree I are computed using the previously computed ones:\n\nII }I\u2208F , {\u03c3IJ}I,J\u2208F needed for ef\ufb01cient computation.\n\n\u03c3IJ = \u03c3XJ + \u03c3Y J\n\n\u03c3MM = \u03c3II + \u03c3JJ + \u03c3IJ\n\nfor J \u2208 F, J (cid:54)= I.\n\n\uf8f1\uf8f2\uf8f3\u03c3IJ\n\n\u03c3\u00acch\nII + \u03c3IJ\n\u03c3\u00acch\nJJ + \u03c3IJ\n\n\u03c3\u00acch\nMM =\n\nif M is formed by joining I and J,\nif M is formed by J absorbed into I,\nif M is formed by I absorbed into J.\n\n(13)\n\nThese suf\ufb01cient statistics are computed based on previous cached values, allowing each inner loop\nof the algorithm to take O(1) time. Finally, potential mergers of I with other trees J in the forest are\nconsidered and added onto the heap. In the algorithm, MERGE(I; J) denotes the best of the three\npossible merges of I and J.\nAlgorithm 1 is structurally the same as that in [14], and so has time complexity in O(n2 log(n)).\nThe difference is that additional care is needed to cache the suf\ufb01cient statistics allowing for O(1)\ncomputation per inner loop of the algorithm. We shall refer to Algorithm 1 as BHCD.\n\n4.1 Variations\n\nBHCD will consider merging trees that have no edges between them if the merge score (12) is\nhigh enough. This does not seem to be a reasonable behaviour as communities that are completely\ndisconnected should not be merged. We can alter BHCD by simply prohibiting such merges between\ntrees that have no edges between them. The resulting algorithm we call BHCD sparse, as it only\nneeds to perform computations on the parts of the network with edges present. Empirically, we have\nfound that it produces better results than BHCD and runs faster for sparse networks, although in the\nworst case it has the same time complexity O(n2 log n) as BHCD.\nAs BHCD runs, several merges can have the same score. In particular, at the \ufb01rst iteration all pairs of\nvertices connected by an edge have the same score. In such situations, we break the ties at random.\nDifferent tie breaks can yield different results and so different runs on the same data may yield\n\n5\n\nTaTbTcTdTeIJJoin(M)TaTbTcTdTeJAbsorb(M)\fdifferent trees. Where we want a single tree, we use R (R = 50 in experiments) restarts and pick\nthe tree with the highest likelihood according to (10). Where we wish to make predictions, we will\nconstruct predictive probabilities (see next section) by averaging all R trees.\n\n4.2 Predictions\n\nFor link prediction, we wish to obtain the predictive distribution of a previously unobserved element\nof the adjacency matrix. This is easily achieved by traversing one path of the tree from the root to-\nwards the leaves, hence the computational complexity is linear in the depth of the tree. In particular,\nsuppose we wish to predict the edge between x and y, Dxy, given the observed edges D, then the\npredictive distribution can be computed recursively starting with S = T :\np(Dxy|DSS, S) = rSf (Dxy|\u03c3SS) + (1 \u2212 rS)\n\n(cid:26)p(Dxy|DCC, C)\n\ng(Dxy|\u03c3\u00acch\nSS )\n\nif \u2203C \u2208 ch(S) : x, y \u2208 leaves(C),\nif \u2200C \u2208 ch(S) : x, y (cid:54)\u2208 leaves(C).\n\nrS =\n\n\u03c0Sf (\u03c3SS)\np(DSS|S)\n\n(14)\n\nwhere rS is the probability that the cluster consisting of leaves(S) is present if the cluster corre-\nsponding to its parent is not present, given the data D and the tree T . The predictive distribution\nis a mixture of a number of on-diagonal posterior f terms (weighted by the responsibility rT ), and\n\ufb01nally an off-diagonal posterior g term. Hence the computational complexity is \u0398(n).\n\n5 Experiments\n\nWe now demonstrate BHCD on three data sets. Firstly we examine qualitative performance on\nSampson\u2019s monastery network. Then we demonstrate the speed and accuracy of our method on\na subset of the NIPS 1\u201317 co-authorship network compared to IRM\u2014one of the fastest Bayesian\nnonparametric models for these data. Finally we show hierarchical structure found examining the\nfull NIPS 1\u201317 co-authorship network. In our experiments we set the model hyperparameters \u03b1 =\n\u03b4 = 1.0, \u03b2 = \u03bb = 0.2, and \u03b3 = 0.4 which we found to work well. In the \ufb01rst two experiments\nwe shall compare four variations of BHCD: BHCD, BHCD sparse, BHCD restricted to binary trees,\nand BHCD sparse restricted to binary trees. Binary-only variations of BHCD only consider joins,\nnot absorptions, and so may run faster. They also tend to produce better predictive results as they\naverage over a larger number of partitions. But, as we shall see below, the hierarchies found can be\nmore dif\ufb01cult to interpret than the non-binary hierarchies found.\nSampson\u2019s Monastery Network Figure 3 shows the result of running six variants of BHCD on time\nfour of Sampson\u2019s monastery network [17]. Sampson observed the monastery at \ufb01ve times\u2014time\nfour is the most interesting time as it was before four of the monks were expelled. We treated positive\naf\ufb01liations as edges, and negative af\ufb01liations as observed absent edges, and unknown af\ufb01liation as\nmissing data. [17], using a variety of methods, found four \ufb02at groups, shown at the top of Figure 3:\nYoung Turks (Albert, Boniface, Gregory, Hugh, John Bosco, Mark, Winfrid), Loyal Opposition\n(Ambrose, Berthold, Bonaventure, Louis, Peter), Outcasts (Basil, Elias, Simplicius), and Interstitial\ngroup (Amand, Ramuald, Victor).\nAs can be seen in Figure 3, most BHCD variants \ufb01nd clear block diagonal structure in the adjacency\nmatrix. The binary versions \ufb01nd similar structure to the non-binary versions, up to permutations of\nthe children of the non-binary trees. BHCD global is lead astray by out of date scores on its heap\nand so \ufb01nds less coherent structure. The log likelihoods of the trees in Figure 3 are \u22126.62 (BHCD)\nand \u221222.80 (BHCD sparse). Whilst the log likelihoods of the binary trees in Figure 3 are \u22128.32\n(BHCD binary) and \u221224.68 (BHCD sparse binary). BHCD \ufb01nds the most likely tree, and rose trees\ntypically better explain the data than binary trees.\nBHCD \ufb01nds the Young Turks and Loyal Opposition groups and chooses to merge some members\nof the Interstitial group into the Loyal Opposition and Amand into the Outcasts. Mark, however, is\nplaced in a separate community: although Mark has a positive af\ufb01liation with Gregory, Mark also\nhas a negative af\ufb01liation with John Bosco and so BHCD elects to create a new community to account\nfor this discrepancy.\nNIPS-234 Next we applied BHCD to a subset of the NIPS co-authorship dataset [19]. We compared\nits predictive performance to both IRM using MCMC and also inference in the IRM using greedy\n\n6\n\n\fMethod Time complexity\n\nIRM (na\u00a8\u0131ve) O(n2K 2IR)\nIRM (sparse) O(mK 2IR)\nLFRM [18] O(n2F 2IR)\nIMMM [9] O(n2K 2IR)\n\nILA [10] O(n2(F + K 2)IR)\n\n[8] O(n2K 2IR)\n\nBHCD O(n2 log(n)R)\n\nTable 1: Time complexities of different methods.\nn = # vertices, m = # edges, K = # commu-\nnities, F = # latent factors, I = # iterations per\nrestart, R = # restarts.\n\nFigure 3:\nSampson\u2019s monastery net-\nwork. White indicates a positive af\ufb01l-\niation, black negative, whilst grey in-\ndicates unknown.\nFrom top to bot-\ntom: Sampson\u2019s clustering, BHCD,\nBHCD-sparse, BHCD with binary trees,\nBHCD-sparse-binary.\n\nFigure 4: NIPS-234 comparison using log pre-\ndictive, accuracy and AUC, averaged across 10\ncross-validation folds.\n\n7\n\nAlbertBonifaceGregoryHughJohn BoscoMarkWinfridAmandRamualdVictorAmbroseBertholdBonaventureLouisPeterBasilEliasSimpliciusAlbertBonifaceGregoryHughJohn BoscoMarkWinfridAmandRamualdVictorAmbroseBertholdBonaventureLouisPeterBasilEliasSimpliciusAlbertBasilBonifaceGregoryHughJohn BoscoWinfridAmandEliasSimpliciusMarkAmbroseBertholdBonaventureLouisPeterRamualdVictorAlbertBasilBonifaceGregoryHughJohn BoscoWinfridAmandEliasSimpliciusMarkAmbroseBertholdBonaventureLouisPeterRamualdVictorAlbertMarkBasilGregoryHughJohn BoscoWinfridBonifaceAmandEliasSimpliciusAmbroseLouisBertholdBonaventurePeterRamualdVictorAlbertMarkBasilGregoryHughJohn BoscoWinfridBonifaceAmandEliasSimpliciusAmbroseLouisBertholdBonaventurePeterRamualdVictorAlbertBasilJohn BoscoGregoryHughWinfridBonifaceMarkAmandEliasSimpliciusAmbroseLouisBertholdBonaventurePeterVictorRamualdAlbertBasilJohn BoscoGregoryHughWinfridBonifaceMarkAmandEliasSimpliciusAmbroseLouisBertholdBonaventurePeterVictorRamualdAlbertBasilJohn BoscoHughGregoryBonifaceWinfridAmandEliasSimpliciusMarkAmbroseBonaventurePeterRamualdVictorBertholdLouisAlbertBasilJohn BoscoHughGregoryBonifaceWinfridAmandEliasSimpliciusMarkAmbroseBonaventurePeterRamualdVictorBertholdLouisllllllllllllllllllllllllllllllllllllllllllllllllll\u22120.10\u22120.09\u22120.08\u22120.07\u22120.06\u22120.05101000Run time (s)Average Log PredictiveIRM             BHCD        SparselMCMCGreedyBinary Rose Binary  Rose  llllllllllllllllllllllllllllllllllllllllllllllllll0.9840.980.976101000Run time (s)Accuracyllllllllllllllllllllllllllllllllllllllllllllllllll0.800.850.90101000Run time (s)Area Under the Curve (AUC)\fFigure 5: Clusters of authors found in NIPS 1\u201317. Top 10 most most collaborating authors shown\nfor all clusters with more than 15 vertices.\n\nsearch, using the publicly available C implementation[20]. Our implementation of BHCD is also in\nC. As can be seen from Table 1, BHCD has signi\ufb01cantly lower computational complexity than other\nBayesian nonparametric methods even than those inferring \ufb02at hierarchies. This is because it is a\nsimpler model and uses a simpler inference method\u2014thus we do not expect it to yield better predic-\ntive results, but instead to get good results quickly. Unlike the other listed methods, BHCD\u2019s worst\ncase complexity does not depend upon the number of communities, and BHCD always terminates\nafter a \ufb01xed number of steps so has no I factor. This latter factor, I, corresponds to the number\nof MCMC steps or the number of greedy search steps, may be large and may need to scale as the\nnumber of vertices increases.\nFollowing [18, 10] we restricted the network to the 234 most connected individuals. Figure 4 shows\nthe average log predictive probability of held out data, accuracy and Area under the receiver operat-\ning Curve (AUC) over time for both BHCD and IRM. For the IRM, each point represents a single\nGibbs step (for MCMC) or a search step (for greedy search). For BHCD, each point represents a\ncomplete run of the inference algorithm. BHCD is able to make reasonable predictions before the\nIRM has completed a single Gibbs scan. We used the same 10 cross-validation folds as used in\n[10] and so our results are quantitatively comparable to their results for the Latent Factor Relational\nModel (LFRM [18]) and their model, the In\ufb01nite Latent Attributes model (ILA). BHCD performs\nsimilarly to LFRM, worse than ILA, and better IRM. After about 10 seconds, the sparse variants\nof BHCD make as good predictions on NIPS-234 as the IRM after about 1000 seconds. Notably\nthe sparse variations are faster than the non-sparse variants of BHCD, as the NIPS co-authorship\nnetwork is sparse.\nFull NIPS The full NIPS dataset has 2864 vertices and 9466 edges. Figure 5 shows part of the\nhierarchy discovered by BHCD. The full inferred hierarchy is large, having 646 internal nodes. We\ncut the tree and retained the top portion of the hierarchy, shown above the clusters. We merged all\n(1\u2212 rA) > 0.5 where rT is given in\nthe leaves of a subtree T into a \ufb02at cluster when rT\n(14). This quantity corresponds to the probability of picking that particular subtree in the predictive\ndistribution. Amongst these clusters we included only those with at least 15 members in Figure 5.\nWe include hierarchies with a lower cut-off in the supplementary.\n\nA\u2208ancestorT\n\n(cid:81)\n\n6 Discussion and Future Work\n\nWe proposed an ef\ufb01cient Bayesian procedure for discovering hierarchical communities in social\nnetworks. Experimentally our procedure discovers reasonable hierarchies and is able to make pre-\ndictions about two orders of magnitude faster than one of the fastest existing Bayesian nonparametric\nschemes, whilst attaining comparable performance. Our inference procedure scales as O(n2 log n)\nthrough a novel caching scheme, where n is the number of vertices, making the procedure suitable\nfor large dense networks. However our likelihood can be computed in O(n + m) time, where m are\nthe number of edges. This disparity between inference and likelihood suggests that in future it may\nbe possible to improve the scalability of the model on sparse networks, where m (cid:28) n2. Another way\nto scale up the model would be to investigate parameterising the network using the suf\ufb01cient statis-\ntics of triangles, instead of edges as in [21]. Others [7] have found that non-conjugate likelihoods\ncan yield improved predictions\u2014thus adapting our scheme to work with non-conjugate likelihoods\nand doing hyperparameter inference could also be fruitful next steps.\nAcknowledgements We thank the Gatsby Charitable Foundation for generous funding.\n\n8\n\nAmari SWaibel ADoya KYang HCortes CFinke MCichocki AMurata NHaffner PLi YBengio YObermayer KBishop CSinger YKawato MBaldi PMoore ATresp VMorgan NSmyth PJaakkola TFreeman WDarrell TFisher JWillsky AWainwright MSudderth E BIhler A TTaycher LAdelson E HLeCun YVapnik VGuyon IDenker JGraf HSimard PHenderson DJackel LBottou LHubbard WKoch CMead CLiu SHarris JHoriuchi TWawrzynek JLuo JLazzaro JRuderman DBair WMeir RAlspector JAllen RJayakumar AEl-Yaniv RSatyanarayana SDomany EZeppenfeld TLippe DLapedes APoggio TMukherjee SPontil MGeiger DRudra AVetter TSerre THeisele BGirosi FRiesenhuber MSejnowski TMovellan J RViola PBartlett MMovellan JCohen MLittlewort GEkman PAndreou ALarsen JSingh SBarto AKearns MOpper MSutton RSchapire RMansour YIsbell C LLittman MMcAllester DMaass WBrown TZador AClaiborne BNatschlager TTsai KSontag EMainen ZCamevale NSontag E DStork DWolff GWatanabe TBoonyanit KLeung MKritayakirana KPeterson ASchwartz EBurr JMurray MJain ATebelskis JWang XSchmidbauer OSloboda TMcNair ABallard DSaito HOsterholtz LWoszczyna MMarchand MGolea MMason LTenorio MLee WBaxter JJapkowicz NFrean MSokolova MTsirukis AScholkopf BWeston JMuller KShawe-Taylor JSmola ABartlett PRatsch GWilliamson RPlatt JElisseeff A\fReferences\n[1] P. Holland, K.B. Laskey, and S. Leinhardt. Stochastic blockmodels: Some \ufb01rst steps. Social\n\nNetworks, 5:109137, 1983.\n\n[2] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership\n\nstochastic blockmodel. Journal of Machine Learning Research, 9:1981\u20132014, 2008.\n\n[3] M. Girvan and M. E. J. Newman. Community structure in social and biological networks.\n\nPNAS, 99:7821\u20137826, 2002.\n\n[4] A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large\n\nnetworks. Physics Review E, 70, 2004.\n\n[5] Charles Kemp, Joshua B. Tenenbaum, Thomas L. Grif\ufb01ths, Takeshi Yamada, and Naonori\n\nUeda. Learning systems of concepts with an in\ufb01nite relational model. AAAI, 2006.\n\n[6] Zhao Xu, Volker Tresp, Kai Yu, and Hans-Peter Kriegel. In\ufb01nite hidden relational models.\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2006.\n\n[7] Morten M\u00f8rup and Mikkel N. Schmidt. Bayesian community detection. Neural Computation,\n\n24:2434\u20132456, 2012.\n\n[8] T. Herlau, M. M\u00f8rup, M. N. Schmidt, and L. K. Hansen. Detecting hierarchical structure in\n\nnetworks. In Cognitive Information Processing, 2012.\n\n[9] Phaedon-Stelios Koutsourelakis and Tina Eliassi-Rad. Finding mixed-memberships in social\nnetworks. 2008 AAAI Spring Symposium on Social Information Processing (AAAI-SS\u201908),\n2008.\n\n[10] Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. An in\ufb01nite latent attribute\nIn Proceedings of the 29th International Conference on Machine\n\nmodel for network data.\nLearning, ICML 2012. July 2012.\n\n[11] Qirong Ho, Ankur P. Parikh, Le Song, and Erix P. Xing. Multiscale community blockmodel\nfor network exploration. Proceedings of the Fourteenth International Workshop on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), 2011.\n\n[12] D. M. Roy and Y. W. Teh. The Mondrian process. In Advances in Neural Information Pro-\n\ncessing Systems, volume 21, 2009.\n\n[13] K. A. Heller and Z. Ghahramani. Bayesian hierarchical clustering.\n\nInternational Conference on Machine Learning, volume 22, 2005.\n\nIn Proceedings of the\n\n[14] C. Blundell, Y. Teh, and K. A. Heller. Bayesian Rose trees. UAI, 2010.\n[15] T. Snijders and K. Nowicki. Estimation and prediction for stochastic blockmodels for graphs\n\nwith latent block structure. Journal of Classi\ufb01cation, 14:75\u2013100, 1997.\n\n[16] Jake M. Hofman and Chris H. Wiggins. Bayesian approach to network modularity. Physical\n\nReview Letters, 100(25):258701, 2008.\n\n[17] S. F. Sampson. A novitiate in a period of change. an experimental and case study of social\n\nrelationships. 1968.\n\n[18] Kurt T. Miller, Thomas L. Grif\ufb01ths, and Michael I. Jordan. Nonparametric latent feature mod-\n\nels for link prediction. Neural Information Processing Systems (NIPS), 2009.\n\n[19] A. Globerson, G. Chechik, F. Pereira, and N. Tishby. Euclidean embedding of co-occurrence\n\ndata. Journal of Machine Learning Research, 8:2265\u20132295, 2007.\n\n[20] Charles Kemp. In\ufb01nite relational model implementation. http://www.psy.cmu.edu/\n\n\u02dcckemp/code/irm.html. Accessed: 2013-04-08.\n\n[21] Q. Ho, J. Yin, and E. P. Xing. On triangular versus edge representations \u2014 towards scalable\n\nmodeling of networks. Neural Information Processing Systems (NIPS), 2012.\n\n9\n\n\f", "award": [], "sourceid": 796, "authors": [{"given_name": "Charles", "family_name": "Blundell", "institution": "Gatsby Unit, UCL"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford"}]}