{"title": "Soft Clustering on Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1553, "page_last": 1560, "abstract": null, "full_text": "Soft Clustering on Graphs\n\nKai Yu1 , Shipeng Yu2 , Volker Tresp1 1 Siemens AG, Corporate Technology 2 Institute for Computer Science, University of Munich kai.yu@siemens.com, volker.tresp@siemens.com spyu@dbs.informatik.uni-muenchen.de\n\nAbstract\nWe propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results.\n\n1\n\nIntroduction\n\nClustering has been widely applied in data analysis to group similar objects. Many algorithms are either similarity-based or model-based. In general, the former (e.g., normalized cut [5]) requires no assumption on data densities but simply a similarity function, and usually partitions data exclusively into clusters. In contrast, model-based methods apply mixture models to fit data distributions and assign data to clusters (i.e. mixture components) probabilistically. This soft clustering is often desired, as it encodes uncertainties on datato-cluster assignments. However, their density assumptions can sometimes be restrictive, e.g. clusters have to be Gaussian-like in Gaussian mixture models (GMMs). In contrast to flat clustering, hierarchical clustering makes intuitive senses by forming a tree of clusters. Despite of its wide applications, the technique is usually achieved by heuristics (e.g., single link) and lacks theoretical backup. Only a few principled algorithms exist so far, where a Gaussian or a sphere-shape assumption is often made [3, 1, 2]. This paper suggests a novel graph-factorization clustering (GFC) framework that employs data's affinities and meanwhile partitions data probabilistically. A hierarchical clustering algorithm (HGFC) is further derived by merging lower-level clusters into higher-level ones. Analysis based on graph random walks suggests that our clustering method models data affinities as empirical transitions generated by a mixture of latent factors. This view significantly differs from conventional model-based clustering since here the mixture model is not directly for data objects but for their relations. Clusters with arbitrary shapes can be modeled by our method since only pairwise similarities are considered. Interestingly, we prove that the higher-level clusters are associated with longer-term diffusive transitions on the graph, amounting to smoother and more global similarity functions on the data mani-\n\n\f\nfold. Therefore, the cluster hierarchy exposes the observed affinity structure gradually in different resolutions, which is somehow similar to the wavelet method that analyzes signals in different bandwidths. To the best of our knowledge, this property has never been considered by other agglomerative hierarchical clustering algorithms (e.g., see [3]). The paper is organized as follows. In the following section we describe a clustering algorithm based on similarity graphs. In Sec. 3 we generalize the algorithm to hierarchical clustering, followed by a discussion from the random walk point of view in Sec. 4. Finally we present the experimental results in Sec. 5 and conclude the paper in Sec. 6.\n\n2\n\nGraph-factorization clustering (GFC)\n\nData similarity relations can be conveniently encoded by a graph, where vertices denote data objects and adjacency weights represent data similarities. This section introduces graph factorization clustering, which is a probabilistic partition of graph vertices. Formally, let G(V, E) be a weighted undirected graph with vertices V = {vi }n 1 and edges i= E {(vi , vj )}. Let W = {wij } be the adjacency matrix, where wij = wj i , wij > 0 if (vi , vj ) E and wij = 0 otherwise. For instances, wij can be computed by the RBF similarity function based on the features of objects i and j , or by a binary indicator (0 or 1) of the k -nearest neighbor affinity. 2.1 Bipartite graphs\n\nBefore presenting the main idea, it is necessary to introduce bipartite graphs. Let K (V, U, F) be the bipartite graph (e.g., Fig. 1(b)), where V = {vi }n 1 and U = i= {up }m 1 are the two disjoint vertex sets and F contains all the edges connecting V and p= U. Let B = {bip } denote the n m adjacency matrix with bip 0 being the weight for edge [vi , up ]. The bipartite graph K induces a similarity between v1 and vj [6] wij = B pm bip bj p = -1 B p =1\ni j\n\n,\n\n = diag(1 , . . . , m )\n\n(1)\n\nn where p = i=1 bip denotes the degree of vertex up U. We can interpret Eq. (1) from the perspective of Markov random walks on graphs. wij is essentially a quantity proportional to the stationary probability of direct transitions between vi i nd vj , denoted a by p(vi , vj ). Without loss of generality, we normalize W to ensure j wij = 1 and wij = p(vi , vj ). For a bipartite graph K (V, U, F), there is no direct links between vertices in V, and all the paths from vi to vj must go through vertices in U. This indicates p(vi , vj ) = p(vi )p(vj |vi ) = di p p(up |vi )p(vj |up ) = p p(vi , up )p(up , vj ) , p\n\nwhere p(vj |vi ) is the conditional transition probability from vi to vj , and di = p(vi ) the degree of vi . This directly leads to Eq. (1) with bip = p(vi , up ). 2.2 Graph factorization by bipartite graph construction\n\nFor a bipartite graph K , p(up |vi ) = bip /di tells the conditional probability of transitions from vi to up . If the size of U is smaller than that of V, namely m < n, then p(up |vi ) indicates how likely data point i belongs to vertex p. This property suggests that one can construct a bipartite graph K (V, U, F) to approximate a given G(V, E), and then obtain a soft clustering structure, where U corresponds to clusters (see Fig. 1(a) (b)).\n\n\f\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) The original graph representing data affinities; (b) The bipartite graph representing data-to-cluster relations; (c) The induced cluster affinities. Eq. (1) suggests that this approximation can be done by minimizing (W, B-1 B ), given a distance (, ) between two adjacency matrices. To make the problem easy to solve, we remove the coupling between B and via H = B-1 and then have min\nH, \n\nW, HH\n\n,\n\ns. t.\n\nin\n=1\n\nhip = 1, H Rnm , Dmm , + +\n\n(2)\n\nwhere Dmm denotes the set of m m diagonal matrices with positive diagonal entries. + This problem is a symmetric variant of non-negative matrix factorization [4]. In this paper we focus on the divergence distance between matrices. The following theorem suggests an alternating optimization approach to find a local minimum: i xij Theorem 2.1. For divergence distance (X, Y) = j (xij log yij - xij + yij ), the cost function in Eq. (2) is non-increasing under the update rule ( ~ denote updated quantities) j i wij ~ ~ hip hip p hj p , normalize s.t. hip = 1; (3) (HH )ij i p i wij ~ ~ p p hip hj p , normalize s.t. p = wij . (4) ) (HH ij j j The distance is invariant under the update if and only if H and are at a stationary point. q See Appendix for all the proofs in this paper. Similar to GMM, p(up |vi ) = bip / biq is the soft probabilistic assignment of vertex vi to cluster up . The method can be seen as a counterpart of mixture models on graphs. The time complexity is O(m2 N ) with N being the number of nonzero entries in W. This can be very efficient if W is sparse (e.g., for k -nearest neighbor graph the complexity O(m2 nk ) scales linearly with sample size n).\n\n3\n\nHierarchical graph-factorization clustering (HGFC)\n\nAs a nice property of the proposed graph factorization, a natural affinity between two clusters up and uq can be computed as p(up , uq ) = B i n bip biq = di =1\nD-1\n\nB\n\np\nq\n\n,\n\nD = diag(d1 , . . . , dn )\n\n(5)\n\nThis is similar to Eq. (1), but derived from another way of two-hop transitions U V U. Note that the similarity between clusters p and q takes into account a weighted average of contributions from all the data (see Fig. 1(c)).\n\n\f\nLet G0 (V0 , E0 ) be the initial graph describing the similarities of totally m0 = n data points, with adjacency matrix W0 . Based on G0 we can build a bipartite graph K1 (V0 , V1 , F1 ), with m1 < m0 vertices in V1 . A hierarchical clustering method can be motivated from the observation that the cluster similarity in Eq. (5) suggests a new adjacency matrix W1 for graph G1 (V1 , E1 ), where V1 is formed by clusters, and E1 contains edges connecting these clusters. Then we can group those clusters by constructing another bipartite graph K2 (V1 , V2 , F2 ) with m2 < m1 vertices in V2 , such that W1 is again factorized as in Eq. (2), and a new graph G2 (V2 , E2 ) can be built. In principal we can repeat this procedure until we get only one cluster. Algorithm 1 summarizes this algorithm. Algorithm 1 Hierarchical Graph-Factorization Clustering (HGFC) Require: given n data objects and a similarity measure 1: build the similarity graph G0 (V0 , E0 ) with adjacency matrix W0 , and let m0 = n 2: for l = 1, 2, . . . , do 3: choose ml < ml-1 4: factorize Gl-1 to obtain Kl (Vl-1 , Vl , Fl ) with the adjacency matrix Bl 5: build a graph Gl (Vl , El ) with the adjacency matrix Wl = Bl D-1 Bl , where Dl 's l diagonal entries are obtained by summation over Bl 's columns 6: end for The algorithm ends up with a hierarchical clustering structure. For level l, we can assign data to the obtained ml clusters via a propagation from the bottom level of clusters. Based on the chain rule of Markov random walks, the soft (i.e., probabilistic) assignment of vi (l ) V0 to cluster vp Vl is given by v =v v v =D v i -1 ( ( p pl) |vi p pl) |v (l-1) p (1) |vi 1 Bl p , (6)\n(l-1) V l-1 (1) V 1\n\n where Bl = . . . D-1 Bl . One can interpret this by deriving an equivl alent bipartite graph Kl (V0 , Vl , Fl ), and treating Bl as the equivalent adjacency matrix attached to the equivalent edges Fl connecting data V0 and clusters Vl .\n\nB1 D-1 B2 D-1 B3 2 3\n\n4\n4.1\n\nAnalysis of the proposed algorithms\nFlat clustering: statistical modeling of single-hop transitions\n\nIn this section we provide some insights to the suggested clustering algorithm, mainly from the perspective of random walks on graphs. Suppose that from a stationary stage of random walks on G(V, E), one observes ij single-hop transitions between vi and vj in a unitary time frame. As an intuition of graph-based view to similarities, if two data points are similar or related, the transitions between them are likely to happen. Thus we connect the observed similarities to the frequency of transitions via wij ij . If the observed transitions are i.i.d. sampled from a true distribution p(vi , vj ) = (HH )ij where a bipartite graph is behind, then the log likelihood with respect to the observed transitions is i i L(H, ) = log p(vi , vj )ij wij log(HH )ij . (7)\nj j\n\nThen we have the following conclusion Proposition 4.1. For a weighted undirected graph G(V, E) and the log likelihood defined in Eq. (7), the following results hold: (i) Minimizing the divergence distance l(W, HH ) is equivalent to maximizing the log likelihood L(H, ); (ii) Updates Eq. (3) and Eq. (4) correspond to a standard EM algorithm for maximizing L(H, ).\n\n\f\nFigure 2: The similarities of vertices to a fixed vertex (marked in the left panel) on a 6nearest-neighbor graph, respectively induced by clustering level l = 2 (the middle panel) and l = 6 (the right panel). A darker color means a higher similarity. 4.2 Hierarchical clustering: statistical modeling of multi-hop transitions\n\nThe adjacency matrix W0 of G0 (V0 , E0 ) only models one-hop transitions that follow direct links from vertices to their neighbors. However, the random walk is a process of diffusion on the graph. Within a relatively longer period, a walker starting from a vertex has the chance to reach vertices faraway through multi-hop transitions. Obviously, multihop transitions induce a slowly decaying similarity function on the graph. Based on the chain rule of Markov process, the equivalent adjacency matrix for t-hop transitions is At = W0 (D-1 W0 )t-1 = At-1 D-1 W0 . (8) 0 0 Generally speaking, a slowly decaying similarity function on the similarity graph captures a global affinity structure of data manifolds, while a rapidly decaying similarity function only tells the local affinity structure. The following proposition states that in the suggested HGFC, a higher-level clustering implicitly employs a more global similarity measure caused by multi-hop Markov random walks: Proposition 4.2. For a given hierarchical clustering structure that starts from a bottom graph G0 (V0 , E0 ) to a higher level Gk (Vk , Ek ), the vertices Vl at level 0 < l k induces an equivalent adjacency matrix of V0 , which is At with t = 2l-1 as defined in Eq. (8). Therefore the presented hierarchical clustering algorithm HGFC applies different sizes of time windows to examine random walks, and derives different scales of similarity measures to expose the local and global clustering structures of data manifolds. Fig. 2 illustrates the employed similarities of vertices to a fixed vertex in clustering levels l = 2 and 6, which corresponds to time periods t = 2 and 32. It can be seen that for a short period t = 2, the similarity is very local and helps to uncover low-level clusters, while in a longer period t = 32 the similarity function is rather global.\n\n5\n\nEmpirical study\n\nWe apply HGFC on USPS handwritten digits and Newsgroup text data. For USPS data we use the images of digits 1, 2, 3 and 4, with respectively 1269, 929, 824 and 852 images per class. Each image is represented as a 256-dimension vector. The text data contain totally 3970 documents covering 4 categories, autos, motorcycles, baseball, and hockey. Each document is represented by an 8014-dimension TFIDF feature vector. Our method employs a 10-nearest-neighbor graph, with the similarity measure RBF for USPS and cosine for Newsgroup. We perform 4-level HGFC, and set the cluster number, respectively from bottom to top, to be 100, 20, 10 and 4 for both data sets. We compare HGFC with two popular agglomerative hierarchical clustering algorithms, single link and complete link (e.g., [3]). Both methods merge two closest clusters at each step.\n\n\f\nFigure 3: Visualization of HGFC for USPS data set. Left: mean images of the top 3 clustering levels, along with a Hinton graph representing the soft (probabilistic) assignments of randomly chosen 10 digits (shown on the left) to the top 3rd level clusters; Middle: a Hinton graph showing the soft cluster assignments from top 3rd level to top 2nd level; Right: a Hinton graph showing the soft assignments from top 2nd level to top 1st level.\n\nFigure 4: Comparison of clustering methods on USPS (left) and Newsgroup (right), evaluated by normalized mutual information (NMI). Higher values indicate better qualities.\n\nSingle link defines the cluster distance to be the smallest point-wise distance between two clusters, while complete link uses the largest one. A third compared method is normalized cut [5], which partitions data into two clusters. We apply the algorithm recursively to produce a top-down hierarchy of 2, 4, 8, 16, 32 and 64 clusters. We also compare with the k-means algorithm, k = 4, 10, 20 and 100. Before showing the comparison, we visualize a part of clustering results for USPS data in Fig. 3. On top of the left figure, we show the top three levels of the hierarchy with respectively 4, 10 and 20 clusters, where each cluster is represented by its mean image via an average over all the images weighted by their posterior probabilities of belonging to this cluster. Then 10 randomly sampled digits with soft cluster assignments to the top 3rd level clusters are illustrated with a Hinton graph. The middle and right figures in Fig. 3 show the assignments between clusters across the hierarchy. The clear diagonal block structure in all the Hinton graphs indicates a very meaningful cluster hierarchy.\n\n\f\n\"1\" \"2\" \"3\" \"4\"\n\n635 2 2 10\n\nNormalized cut 630 1 3 4 744 179 1 817 4 6 1 835\n\n1254 1 1 4\n\nHGFC 3 8 886 33 4 816 8 2\n\n4 9 3 838\n\n1265 17 10 58\n\nK-means 1 0 720 95 9 796 20 0\n\n3 97 9 774\n\nTable 1: Confusion matrices of clustering results, 4 clusters, USPS data. In each confusion matrix, rows correspond true classes and columns correspond the found clusters.\n\nautos motor. baseball hockey\n\nNormalized cut 858 98 30 79 893 16 44 33 875 11 8 893\n\n2 5 40 85\n\n772 42 15 7\n\nHGFC 182 13 934 5 33 843 21 11\n\n21 12 101 958\n\n977 985 39 16\n\nK-means 7 4 3 5 835 114 4 900\n\n0 0 4 77\n\nTable 2: Confusion matrices of clustering results, 4 clusters, Newsgroup data. In each confusion matrix, rows correspond true classes and columns correspond the found clusters.\n\nWe compare the clustering methods by evaluating the normalized mutual information (NMI) in Fig. 4. It is defined to be the mutual information between clusters and true classes, normalized by the maximum of marginal entropies. Moreover, in order to more directly assess the clustering quality, we also illustrate the confusion matrices in Table 1 and Table 2, in the case of producing 4 clusters. We drop out the confusion matrices of single link and complete link in the tables, for saving spaces and also due to their clearly poor performance compared with others. The results show that single link performs poorly, as it greedily merges nearby data and tends to form a big cluster with some outliers. Complete link is more balanced but unsatisfactory either. For the Newsgroup data it even gets stuck at the 3601-th merge because all the similarities between clusters are 0. Top-down hierarchical normalized cut obtains reasonable results, but sometimes cannot split one big cluster (see the tables). The confusion matrices indicates that k-means does well for digit images but relatively worse for high-dimension textual data. In contrast, Fig. 4 shows that HGFC gives significantly higher NMI values than competitors on both tasks. It also produces confusion matrices with clear diagonal structures (see tables 1 and 2), which indicates a very good clustering quality.\n\n6\n\nConclusion and Future Work\n\nIn this paper we have proposed a probabilistic graph partition method for clustering data objects based on their pairwise similarities. A novel hierarchical clustering algorithm HGFC has been derived, where a higher level in HGFC corresponds to a statistical model of random walk transitions in a longer period, giving rise to a more global clustering structure. Experiments show very encouraging results. In this paper we have empirically specified the number of clusters in each level. In the near future we plan to investigate effective methods to automatically determine it. Another direction is hierarchical clustering on directed graphs, as well as its applications in web mining.\n\n\f\nAppendix\np i i Proof of Theorem 2.1. We first notice that p = hip = 1. Therej wij under constraints i fore we can normalize W by wij and after convergence multiply all p by this quantity to get j i ) the solution. Under this assumption we are maximizing L(H, ) = ij with j wij log(HH p an extra constraint p = 1. We first fix p and show update Eq. (3) will not decrease L(H) L(H, ). We prove this by constructing an auxiliary function f (H, H ) such that f (H, H ) L(H) and f (H, H) = L(H). Then we know the update Ht+1 = arg maxH f (H, Ht ) will not decrease L(H) since L(Ht+1 ) f (Ht+1 , Ht ) f (Ht ,.Ht ) = L(Ht ). Define f (H, H ) = i p hp p hp l h p h i l j og hip p hj p - log lip lj p f (H, H) = L(H) can be easily veriwij j h l h h h fied, and f (H, H ) L(H) also follows if we use concavity of log function. Then it is straightforward to verify Eq. (3) by setting the derivative of f with respect to hip to be zero. The normalization is due to the constraints and can be formally derived from this procedure with a Lagrange formalism. Similarly we can define an auxiliary function for with H fixed, and verify Eq. (4). Proof of Proposition 4.1. (i) follows directly from the proof of Theorem 2.1. To prove (ii) we take up as the missing data and follow the standard way to derive the EM algorithm. In the E-step we estimate the a posteriori probability of taking up for pair (vi , vj ) using Bayes' rule: p(up |vi , vj ) ^ p(vi |up )p(vj |up )p(up ). And then in the M-step we maximize the \"complete\" data likelihood p i ^ p(up |vi , vj ) log p(vi |up )p(vj |up )p(up ) with respect to model parameters ^ L(G) = j wij p i p = 1. By setting the corhip = 1 and hip = p(vi |up ) and p = p(up ), with constraints i j ^ wij p(up |vi , vj ) and p ^ responding derivatives to zero we obtain hip j wij p(up |vi , vj ). It is easy to check that they are equivalent to updates Eq. (3) and Eq. (4) respectively. Proof of Proposition 4.2. We give a brief proof. Suppose that at level l the data-cluster relationship is described by Kl (V0 , Vl , Fl ) (see Eq. (6)) with adjacency matrix Bl , degrees D0 for V0 , and l degrees l for Vl . In this case the induced adjacency matrix of V0 is Wl = Bl -1 B , and l l the adjacency matrix of Vl is Wl = B D-1 Bl . Let Kl (Vl , Vl+1 , Fl+1 ) be the bipartite graph 0 connecting Vl and Vl+1 , with the adjacency Bl+1 and degrees l+1 for Vl+1 . Then the adjacency l matrix of V0 induced by level l + 1 is Wl+1 = Bl -1 Bl+1 -11 Bl+1 -1 B = Wl D-1 Wl , 0 l l+ l -1 l l D-1 Bl and Wl = Bl -1 Bl are applied. Given the initial where relations Bl+1 l+1 B +1 = B 0 l condition from the bottom level W1 = W0 , it is not difficult to obtain Wl = At with t = 2l-1 .\nil jl il jl\n\nReferences\n[1] J. Goldberger and S. Roweis. Hierarchical clustering of a mixture model. In L.K. Saul, Y. Weiss, and L. Bottou, editors, Neural Information Processing Systems 17 (NIPS*04), pages 505512, 2005. [2] K.A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In Proceedings of the 22nd International Conference on Machine Learning, pages 297304, 2005. [3] S. D. Kamvar, D. Klein, and C. D. Manning. Interpreting and extending classical agglomerative clustering algorithms using a model-based approach. In Proceedings of the 19th International Conference on Machine Learning, pages 283290, 2002. [4] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (NIPS*00), pages 556562, 2001. [5] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888905, 2000. [6] D. Zhou, B. Scholkopf, and T. Hofmann. Semi-supervised learning on directed graphs. In L.K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17 (NIPS*04), pages 16331640, 2005.\n\n\f\n", "award": [], "sourceid": 2948, "authors": [{"given_name": "Kai", "family_name": "Yu", "institution": null}, {"given_name": "Shipeng", "family_name": "Yu", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}