{"title": "Semi-supervised Learning with Penalized Probabilistic Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 849, "page_last": 856, "abstract": null, "full_text": " Semi-supervised Learning with Penalized\n Probabilistic Clustering\n\n\n\n Zhengdong Lu and Todd K. Leen\n Department of Computer Science and Engineering\n OGI School of Science and Engineering , OHSU\n Beaverton, OR 97006\n {zhengdon,tleen}@cse.ogi.edu\n\n\n\n Abstract\n\n While clustering is usually an unsupervised operation, there are circum-\n stances in which we believe (with varying degrees of certainty) that items\n A and B should be assigned to the same cluster, while items A and C\n should not. We would like such pairwise relations to influence cluster\n assignments of out-of-sample data in a manner consistent with the prior\n knowledge expressed in the training set. Our starting point is proba-\n bilistic clustering based on Gaussian mixture models (GMM) of the data\n distribution. We express clustering preferences in the prior distribution\n over assignments of data points to clusters. This prior penalizes cluster\n assignments according to the degree with which they violate the prefer-\n ences. We fit the model parameters with EM. Experiments on a variety\n of data sets show that PPC can consistently improve clustering results.\n\n\n\n1 Introduction\n\nWhile clustering is usually executed completely unsupervised, there are circumstances in\nwhich we have prior belief that pairs of samples should (or should not) be assigned to\nthe same cluster. Such pairwise relations may arise from a perceived similarity (or dis-\nsimilarity) between samples, or from a desire that the algorithmically generated clusters\nmatch the geometric cluster structure perceived by the experimenter in the original data.\nContinuity, which suggests that neighboring pairs of samples in a time series or in an image\nare likely to belong to the same class of object, is also a source of clustering preferences.\nWe would like these preferences to be incorporated into the cluster structure so that the\nassignment of out-of-sample data to clusters captures the concept(s) that give rise to the\npreferences expressed in the training data.\n\nSome work [1, 2, 3] has been done on adopting traditional clustering methods, such as K-\nmeans, to incorporate pairwise relations. These models are based on hard clustering and\nthe clustering preferences are expressed as hard pairwise constraints that must be satis-\nfied. While this work was in progress, we became aware of the algorithm of Shental et al.\n[4] who propose a Gaussian mixture model (GMM) for clustering that incorporates hard\npairwise constraints.\n\nIn this paper, we propose a soft clustering algorithm based on GMM that expresses cluster-\n\n\f\ning preferences (in the form of pairwise relations) in the prior probability on assignments of\ndata points to clusters. This framework naturally accommodates both hard constraints and\nsoft preferences in a framework in which the preferences are expressed as a Bayesian prob-\nability that pairs of points should (or should not) be assigned to the same cluster. We call\nthe algorithm Penalized Probabilistic Clustering (PPC). Experiments on several datasets\ndemonstrate that PPC can consistently improve the clustering result by incorporating reli-\nable prior knowledge.\n\n\n2 Prior Knowledge for Cluster Assignments\n\nPPC begins with a standard GMM\n M\n P (x|) = P (x|, )\n =1\n\nwhere = (1, . . . K , 1, . . . , K ). We augment the dataset X = {xi}, i = 1 . . . N with\n(latent) cluster assignments Z = z(xi), i = 1, . . . , N to form the familiar complete data\n(X, Z). The complete data likelihood is\n\n P (X, Z|) = P (X|Z, )P (Z|). (1)\n\n\n2.1 Prior distribution in latent space\n\nWe incorporate our clustering preferences by manipulating the prior probability P (Z|).\nIn the standard Gaussian mixture model, the prior distribution is trivial: P (Z|) = .\n i zi\nWe incorporate prior knowledge (our clustering preferences) through a weighting function\ng(Z) that has large values when the assignment of data points to clusters Z conforms to\nour preferences, and low values when Z conflicts with our preferences. Hence we write\n\n g(Z) 1\n P (Z|, G) = i zi g(Z) (2)\n g(Z) K zi\n Z j zj i\n\nwhere the sum is over all possible assignments of the data to clusters. The likelihood\nof the data, given a specific cluster assignment, is independent of the cluster assignment\npreferences, and so the complete data likelihood is\n\n 1 1\n P (X, Z|, G) = P (X|Z, ) g(Z) = P (X, Z|)g(Z), (3)\n K zi K\n i\n\nwhere P (X, Z|) is the complete data likelihood for a standard GMM. The data like-\nlihood is the sum of complete data likelihood over all possible Z, that is, L(X|) =\nP (X|, G) = P (X, Z|, G)\n Z , which can be maximized with the EM algorithm. Once\nthe model parameters are fit, we do soft clustering according to the posterior probabilities\nfor new data p(|x, ). (Note that cluster assignment preferences are not expressed for the\nnew data, only for the training data.)\n\n\n2.2 Pairwise relations\n\nPairwise relations provide a special case of the framework discussed above. We specify\ntwo types of pairwise relations:\n\n link: two sample should be assigned into one cluster\n\n do-not-link: two samples should be assigned into different clusters.\n\n\f\nThe weighting factor given to the cluster assignment configuration Z is simple:\n\n g(Z) = exp(W p (z\n ij i, zj )),\n i,j\n\nwhere is the Kronecker -function and W p\n ij is the weight associated with sample pair\n(xi, xj). It satisfies\n W p [-, ], W p = W p .\n ij ij ji\n\nThe weight W p\n ij reflects our preference and confidence in assigning xi and xj into one\ncluster. We use a positive W p\n ij when we prefer to assign xi and xj into one cluster (link),\nand a negative W p\n ij when we prefer to assign them into different clusters (do-not-link). The\nvalue |W p | = 0\n ij reflects how certain we are in the preference. If W p\n ij , we have no prior\nknowledge on the assignment relevancy of xi and xj. In the extreme cases where |W p | \n ij\n, the Z violating the pairwise relations about xi and xj have zero prior probability, since\nfor those assignments\n\n z exp(W p (zi, zj))\n P (Z|, G) = n n i,j ij 0.\n exp(W p (z\n Z n zn i,j ij i, zj ))\n\nThen the relations become hard constraints, while the relations with |W p | < \n ij are called\nsoft preferences. In the remainder of this paper, we will use W p to denote the prior knowl-\nedge on pairwise relations, that is\n\n 1\n P (X, Z|, W p) = P (X, Z|) exp(W p (z\n K ij i, zj )) (4)\n i,j\n2.3 Model fitting\n\nWe use the EM algorithm [5] to fit the model parameters .\n\n = arg max L(X|, G)\n \n\nThe expectation step (E-step) and maximization step (M-step) are\n\n E-step: Q(, (t-1)) = EZ|X(log P (X, Z|, G)|X, (t-1), G)\n M-step: (t) = arg max Q(, (t-1))\n \n\n\nIn the M-step, the optimal mean and covariance matrix of each component is:\n\n N xjP (k|xj, (t-1), G)\n j=1\n k = N P (k|x\n j=1 j , (t-1), G)\n N P (k|xj, (t-1), G)(xj - k)(xj - k)T\n j=1\n k = .\n N P (k|x\n j=1 j , (t-1), G)\n\n\nHowever, the update of prior probability of each component is more difficult than for the\nstandard GMM, we need to find\n\n M N\n {1, . . . , m} = arg max log lP (l|xi, (t-1), G) - log K().\n \n l=1 i=1\n\nIn this paper, we use a numerical method to find the solution.\n\n\f\n2.4 Posterior Inference and Gibbs sampling\n\nThe M-step requires the cluster membership posterior. Computing this posterior is simple\nfor the standard GMM since each data point xi can be assigned to a cluster independent of\nthe other data points and we have the familiar cluster origin posterior p(zi = k|xi, ).\n\nFor the PPC model calculating the posteriors is no longer trivial. If two sample points, xi\nand xj participate in a pairwise relations, equation (4) tells us\n\n P (zi, zj|X, , W p) = P (zi|X, , W p)P (zj|X, , W p) .\n\nand the posterior probability of xi and xj cannot be computed separately.\n\nFor pairwise relations, the joint posterior distribution must be calculated over the entire\ntransitive closure of the \"link\" or \"do-not-link\" relations. See Fig. 1 for an illustration.\n\n\n\n\n\n (a) (b)\n\n\nFigure 1: (a) Links (solid line) and do-not-links (dotted line) among six samples; (b) Rele-\nvancy (solid line) translated from links in (a)\n\nIn the remainder of this paper, we will refer to the smallest sets of samples whose posterior\nassignment probabilities can be calculated independently as cliques. The posterior proba-\nbility of a given sample xi in a clique T is calculated by marginalizing the posterior over\nthe entire clique\n\n P (zi = k|X, , W p) = P (ZT |XT , , W p),\n ZT |zi=k\n\nwith the posterior on the clique given by\n\n P (Z P (Z\n P (Z T , XT |, W p) T , XT |, W p)\n T |XT , , W p) = = .\n P (XT |, W p) P (Z , X\n Z T T |, W p)\n T\n\n\nComputing the posterior probability of a sample in clique T requires time complexity\nO(M |T |), where |T | is the size of clique T and M is the number of components in the\nmixture model. This is very expensive if |T | is very big and model size M 2. Hence\nsmall size cliques are required to make the marginalization computationally reasonable.\n\nIn some circumstances it is natural to limit ourselves to the special case of pairwise relation\nwith |T | 2, called non-overlapping relations. See Fig. 2 for illustration. More generally,\nwe can avoid the expensive computation in posterior inference by breaking large clique\ninto many small ones. To do this, we need to ignore some links or do-not-links. In section\n3.2, we will give an application of this idea.\n\nFor some choices of g(Z), the posterior probability can be given in a simple form even\nwhen the clique is big. One example is when there are only hard links. This case is useful\nwhen we are sure that a group of samples are from one source. For more general cases,\nwhere exact inference is computationally prohibitive, we propose to use Gibbs sampling\n[6] to estimate the posterior probability.\n\n\f\n (a) (b)\n\n\n Figure 2: (a) Overlapping pairwise relations; (b) Non-overlapping pairwise relations.\n\n\n\nIn Gibbs sampling, we estimate P (zi|X, , G) as a sample mean\n\n 1 S\n P (zi = k|X, , G) = E((zi, k)|X, , G) (z(t), k)\n S i\n t=1\n\nwhere the sum is over a sequence of S samples from P (Z|X, , G) generated by the\nGibbs MCMC. The tth sample in the sequence is generated by the usual Gibbs sampling\ntechnique:\n Pick z(t)\n 1 from distribution P (z1|z(t-1)\n 2 , z(t-1)\n 3 , ..., z(t-1), X, G, )\n N\n\n Pick z(t)\n 2 from distribution P (z2|zt1, z(t-1)\n 3 , ..., z(t-1), X, G, )\n N\n \n Pick z(t) , X, G, )\n N from distribution P (zN |z(t)\n 1 , z(t)\n 2 , ..., z(t)\n N -1\nFor pairwise relations it is helpful to introduce some notation. Let Z-i denote an as-\nsignment of data points to clusters that leaves out the assignment of xi. Let U (i) be\nthe indices of the set of samples that participate in a pairwise relation with sample xi,\nU (i) = {j : W p = 0}\n ij . Then we have\n\n P (zi|Z-i, X, , W p) P (xi, zi|) exp(2W p (z\n ij i, zj )). (5)\n jU (i)\n\nWhen W p is sparse, the size of U (i) is small, thus calculating P (zi|Z-i, X, , W p) is\nvery cheap and Gibbs sampling can effectively estimate the posterior probability.\n\n\n3 Experiments\n\n3.1 Clustering with different number of hard pairwise constraints\n\nIn this experiment, we demonstrate how the number of pairwise relations affects the per-\nformance of clustering. We apply PPC model to three UCI data sets: Iris,Waveform, and\nPendigits. Iris data set has 150 samples and three classes, 50 samples in each class; Wave-\nform data set has 5000 samples and three classes, 33% samples in each class; Pendigits data\nset includes four classes (digits 0,6,8,9), each with 750 samples. All data sets have labels\nfor all samples, which are used to generate the relations and to evaluate performance.\n\nWe try PPC (with component number same as the number of classes) with various number\nof pairwise relations. For each relations number, we conduct 100 runs and calculate the\naveraged classification accuracy. In each run, the data set is randomly split into training set\n(90%) and test set (10%). The pairwise relations are generated as follows: we randomly\npick two samples from the training set without replacement and check their labels. If\nthe two have the same label, we then add a link constraint between them; otherwise, we\n\n\f\nadd a do-not-link constraint. Note the generated pairwise relations are non-overlapping,\nas described in section 2.4. The model fitted on the training set is applied to test set.\nExperiment results on two data sets are shown in Fig. 3 (a) and (b) respectively. As Fig.\n3 indicates, PPC can consistently improve its clustering accuracy on the training set when\nmore pairwise constraints are added; also, the effect brought by constraints generalizes to\nthe test set.\n\n\n 1 0.84 0.9\n\n\n 0.82\n 0.95\n 0.85\n\n 0.8\n 0.9\n\n 0.78 0.8\n\n 0.85\n 0.76\n\n 0.75\n 0.8 on training set 0.74 on training set on training set\n Averaged Classification Accuracy on test set Averaged Classification Accuracy on test set Averaged Classification Accuracy on test set\n\n 0.75 0.72\n 0 10 20 30 40 50 0 100 200 300 400 500 600 0.70 200 400 600 800 1000 1200\n Number of Relations Number of Relations Number of relations\n\n (a) on Iris data (b) on Waveform data (c) on Pendigits data\n\n\n Figure 3: The performance of PPC with various number of relations\n\n\n\n3.2 Hard pairwise constraints for encoding partial label\n\n\nThe experiment in this subsection shows the application of pairwise constraints on partially\nlabeled data. For example, consider a problem with six classes A, B, ..., F . The classes are\ngrouped into several class-sets C1 = {A, B, C}, C2 = {D, E}, C3 = {F }. The samples\nare partially labeled in the sense that we are told which class-set a sample is from, but not\nwhich specific class it is from. We can logically derive a do-not-link constraint between\nany pair of samples known to belong to different class-sets, while no link constraint can be\nderived if each class-set has more than one class in it.\n\nFig. 4 (a) is a 120x400 region from Greenland ice sheet from NASA Langley DAAC. This\nregion is partially labeled into snow area and non-snow area, as indicated in Fig. 4 (b).\nThe snow area can be ice, melting snow or dry snow, while the non-snow area can be bare\nland, water or cloud. Each pixel has attributes from seven spectrum bands. To segment the\nimage, we first divide the image into 5x5x7 blocks (175 dim vectors). We use the first 50\nprincipal components as feature vectors.\n\nFor PPC, we use half of data samples for training set and the rest for test. Hard do-not-link\nconstraints (only on training set) are generated as follows: for each block in the non-snow\narea, we randomly choose (without replacement) six blocks from the snow area to build\ndo-not-link constraints. By doing this, we achieve cliques with size seven (1 non-snow\nblock + 6 snow blocks). Like in section 3.1, we apply the model fitted with PPC to test\nset and combine the clustering results on both data sets into a complete picture. A typical\nclustering result of 3-component standard GMM and 3-component PPC are shown as Fig.\n4 (c) and (d) respectively. From Fig. 4, standard GMM gives a clustering that is clearly\nin disagreement with the human labeling in Fig. 4 (b). The PPC segmentation makes far\nfewer mis-assignments of snow areas (tagged white and gray) to non-snow (black) than\ndoes the GMM. The PPC segmentation properly labels almost all of the non-snow regions\nas non-snow. Furthermore, the segmentation of the snow areas into the two classes (not\nlabeled) tagged white and gray in Fig. 4 (d) reflects subtle differences in the snow regions\ncaptured by the gray-scale image from spectral channel 2, as shown in Fig. 4 (a).\n\n\f\nFigure 4: (a) Gray-scale image from the first spectral channel 2. (b) Partial label given by\nexpert, black pixels denote non-snow area and white pixels denote snow area. Clustering\nresult of standard GMM (c) and PPC (d). (c) and (d) are colored according to image blocks'\nassignment.\n\n\n\n3.3 Soft pairwise preferences for texture image segmentation\n\nIn this subsection, we propose an unsupervised texture image segmentation algorithm as\nan application of PPC model. Like in section 3.2, the image is divided into blocks and\nrearranged into feature vectors. We use GMM to model those feature vectors, hoping\neach Gaussian component represents one texture. However, standard GMM often fails\nto give a good segmentation because it cannot make use of the spatial continuity of image,\nwhich is essential in many image segmentation models, such as random field [7]. In our\nalgorithm, the spatial continuity is incorporated as the soft link preferences with uniform\nweight between each block and its neighbors. The complete data likelihood is\n\n 1\n P (X, Z|, W p) = P (X, Z|) exp(w (z\n K i, zj )), (6)\n i jU(i)\n\nwhere U (i) means the neighbors of the ith block. The EM algorithm can be roughly in-\nterpreted as iterating on two steps: 1) estimating the texture description (parameters of\nmixture model) based on segmentation, and 2) segmenting the image based on the texture\ndescription given by step 1. Gibbs sampling is used to estimate the posterior probability in\neach EM iteration. Equation (5) is reduced to\n\n P (zi|Z-i, X, , W p) P (xi, zi|) exp(2w (zi, zj)).\n jU (i)\n\n\nThe image shown in Fig. 5 (a) is combined from four Brodatz textures 1 . This image is\ndivided into 7x7 blocks and then rearranged to 49-dim vectors. We use those vectors' first\nfive principal components as the associated feature vectors. For PPC model, the soft links\n\n 1Downloaded from http://sipi.usc.edu/services/database/Database.html, April, 2004\n\n\f\nwith weight w are added between each block and its four neighbors, as shown in Fig. 5\n(b). A typical clustering result of 4-component standard GMM and 4-component PPC with\nw = 2 are shown in Fig. 5 (c) and Fig. 5 (d) respectively. Obviously, PPC achieves a better\nsegmentation after incorporating spatial continuity.\n\n\n\n\n\nFigure 5: (a) Texture combination. (b) One block and its four neighbor. Clustering result of\nstandard GMM (c) and PPC (d). (c) and (d) are shaded according to the blocks assignments\nto clusters.\n\n4 Conclusion and Discussion\nWe have proposed a probabilistic clustering model that incorporates prior knowledge in\nthe form of pairwise relations between samples. Unlike previous work in semi-supervised\nclustering, this work formulates clustering preferences as a Bayesian prior over the assign-\nment of data points to clusters, and so naturally accommodates both hard constraints and\nsoft preferences. For the computational difficulty brought by large cliques, we proposed a\nMarkov chain estimation method to reduce the computational cost. Experiments on differ-\nent data sets show that pairwise relations can consistently improve the performance of the\nclustering process.\n\n\nAcknowledgments\n\nThe authors thank Ashok Srivistava for helpful conversations. This work was funded by\nNASA Collaborative Agreement NCC 2-1264.\n\n\nReferences\n\n[1] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained K-means clustering with back-\n ground knowledge. In Proceedings of the Eighteenth International Conference on Machine\n Learning, pages 577584, 2001.\n\n[2] S. Basu, A. Bannerjee, and R. Mooney. Semi-supervised clustering by seeding. In Proceedings\n of the Nineteenth International Conference on Machine Learning, pages 1926, 2002.\n\n[3] D. Klein, S. Kamvar, and C. Manning. From instance Level to space-level constraints: making\n the most of prior knowledge in data clustering. In Proceedings of the Nineteenth International\n Conference on Machine Learning, pages 307313, 2002.\n\n[4] N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing Gaussian mixture models\n with EM using equivalence constraints. In Advances in Neural Information Processing System,\n volume 15, 2003.\n\n[5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM\n algorithm. Journal of the Royal Statistical Society, Series B, 39:138, 1977.\n\n[6] R. Neal. Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report\n CRG-TR-93-1, Computer Science Department, Toronto University, 1993.\n\n[7] C. Bouman and M. Shapiro. A multiscale random field model for Bayesian image segmentation.\n IEEE Trans. Image Processing, 3:162177, March 1994.\n\n\f\n", "award": [], "sourceid": 2610, "authors": [{"given_name": "Zhengdong", "family_name": "Lu", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}]}