{"title": "Maximum Margin Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1537, "page_last": 1544, "abstract": null, "full_text": " Maximum Margin Clustering\n\n\n\n Linli Xu James Neufeld Bryce Larson Dale Schuurmans\n University of Waterloo\n University of Alberta\n\n\n\n Abstract\n\n We propose a new method for clustering based on finding maximum mar-\n gin hyperplanes through data. By reformulating the problem in terms\n of the implied equivalence relation matrix, we can pose the problem as\n a convex integer program. Although this still yields a difficult com-\n putational problem, the hard-clustering constraints can be relaxed to a\n soft-clustering formulation which can be feasibly solved with a semidef-\n inite program. Since our clustering technique only depends on the data\n through the kernel matrix, we can easily achieve nonlinear clusterings in\n the same manner as spectral clustering. Experimental results show that\n our maximum margin clustering technique often obtains more accurate\n results than conventional clustering methods. The real benefit of our ap-\n proach, however, is that it leads naturally to a semi-supervised training\n method for support vector machines. By maximizing the margin simul-\n taneously on labeled and unlabeled training data, we achieve state of the\n art performance by using a single, integrated learning principle.\n\n\n1 Introduction\n\nClustering is one of the oldest forms of machine learning. Nevertheless, it has received a\nsignificant amount of renewed attention with the advent of nonlinear clustering methods\nbased on kernels. Kernel based clustering methods continue to have a significant impact on\nrecent work in machine learning [14, 13], computer vision [16], and bioinformatics [9].\n\nAlthough many variations of kernel based clustering has been proposed in the literature,\nmost of these techniques share a common \"spectral clustering\" framework that follows a\ngeneric recipe: one first builds the kernel (\"affinity\") matrix, normalizes the kernel, per-\nforms dimensionality reduction, and finally clusters (partitions) the data based on the re-\nsulting representation [17].\n\nIn this paper, our primary focus will be on the final partitioning step where the actual\nclustering occurs. Once the data has been preprocessed and a kernel matrix has been con-\nstructed (and its rank possibly reduced), many variants have been suggested in the literature\nfor determining the final partitioning of the data. The predominant strategies include using\nk-means clustering [14], minimizing various forms of graph cut cost [13] (relaxations of\nwhich amount to clustering based on eigenvectors [17]), and finding strongly connected\ncomponents in a Markov chain defined by the normalized kernel [4]. Some other recent\nalternatives are correlation clustering [12] and support vector clustering [1].\n\nWhat we believe is missing from this previous work however, is a simple connection to\n\n\f\nother types of machine learning, such as semisupervised and supervised learning. In fact,\none of our motivations is to seek unifying machine learning principles that can be used\nto combine different types of learning problems in a common framework. For example, a\nuseful goal for any clustering technique would be to find a way to integrate it seamlessly\nwith a supervised learning technique, to obtain a principled form of semisupervised learn-\ning. A good example of this is [18], which proposes a general random field model based\non a given kernel matrix. They then find a soft cluster assignment on unlabeled data that\nminimizes a joint loss with observed labels on supervised training data. Unfortunately, this\ntechnique actually requires labeled data to cluster the unlabeled data. Nevertheless, it is a\nuseful approach.\n\nOur goal in this paper is to investigate another standard machine learning principle--\nmaximum margin classification--and modify it for clustering, with the goal of achieving a\nsimple, unified way of solving a variety of problems, including clustering and semisuper-\nvised learning.\n\nAlthough one might be skeptical that clustering based on large margin discriminants can\nperform well, we will see below that, combined with kernels, this strategy can often be\nmore effective than conventional spectral clustering. Perhaps more significantly, it also im-\nmediately suggests a simple semisupervised training technique for support vector machines\n(SVMs) that appears to improve the state of the art.\n\nThe remainder of this paper is organized as follows. After establishing the preliminary\nideas and notation in Section 2, we tackle the problem of computing a maximum margin\nclustering for a given kernel matrix in Section 3. Although it is not obvious that this prob-\nlem can be solved efficiently, we show that the optimal clustering problem can in fact be\nformulated as a convex integer program. We then propose a relaxation of this problem\nwhich yields a semidefinite program that can be used to efficiently compute a soft cluster-\ning. Section 4 gives our experimental results for clustering. Then, in Section 5 we extend\nour approach to semisupervised learning by incorporating additional labeled training data\nin a seamless way. We then present experimental results for semisupervised learning in\nSection 6 and conclude.\n\n\n2 Preliminaries\n\nSince our main clustering idea is based on finding maximum margin separating hyper-\nplanes, we first need to establish the background ideas from SVMs as well as establish the\nnotation we will use.\n\nFor SVM training, we assume we are given labeled training examples\n( 1 N\n x , y1), ..., (x , yN ), where each example is assigned to one of two classes\nyi {-1,+1}. The goal of an SVM of course is to find the linear discriminant\nfw,b(x) = w (x) + b that maximizes the minimum misclassification margin\n = max subject to yi( i\n w (x ) + b) , Ni=1, w 2 = 1 (1)\n w,b,\n\nHere the Euclidean normalization constraint on w ensures that the Euclidean distance be-\ntween the data and the separating hyperplane (in ( \n x) space) determined by w , b is\nmaximized. It is easy to show that this same \n w , b is a solution to the quadratic program\n\n -2 = min 2 i\n w subject to yi(w (x ) + b) 1,Ni=1 (2)\n w,b\n\n\nImportantly, the minimum value of this quadratic program, -2, is just the inverse square\nof the optimal solution value to (1) [10].\n\nTo cope with potentially inseparable data, one normally introduces slack variables to reduce\nthe dependence on noisy examples. This leads to the so called soft margin SVM (and its\n\n\f\ndual) which is controlled by a tradeoff parameter C\n\n-2 = min 2 i\n w + C e subject to yi(w (x ) + b) 1 - i,Ni=1, 0\n w,b,\n\n = max 2 e - K ,yy subject to 0 C, y = 0 (3)\n \n\nThe notation we use in this dual formulation requires some explanation, since we will use it\nbelow: Here K denotes the N\n 1 N kernel matrix formed from the inner products of feature\nvectors = [( N i j\n x ), ..., (x )] such that K = . Thus kij = (x ) (x ). The\nvector e denotes the vector of all 1 entries. We let A B denote componentwise matrix\nmultiplication, and let A, B = a\n ij ij bij . Note that (3) is derived from the standard dual\n\nSVM by using the fact that (K yy ) = K yy , = K , yy .\nTo summarize: for supervised maximum margin training, one takes a given set of labeled\ntraining data ( 1 N\n x , y1), ..., (x , yN ), forms the kernel matrix K on data inputs, forms the\nkernel matrix yy on target outputs, sets the slack parameter C, and solves the quadratic\nprogram (3) to obtain the dual solution and the inverse square maximum margin value\n-2. Once these are obtained, one can then recover a classifier directly from [15].\n\nOf course, our main interest initially is not to find a large margin classifier given labels on\nthe data, but instead to find a labeling that results in a large margin classifier.\n\n\n3 Maximum margin clustering\n\nThe clustering principle we investigate is to find a labeling so that if one were to sub-\nsequently run an SVM, the margin obtained would be maximal over all possible la-\nbellings. That is, given data 1 N\n x , .., x , we wish to assign the data points to two classes\nyi {-1,+1} so that the separation between the two classes is as wide as possible.\nUnsurprisingly, this is a hard computational problem. However, with some reformulation\nwe can express it as a convex integer program, which suggests that there might be some\nhope of obtaining practical solutions. However, more usefully, we can relax the integer\nconstraint to obtain a semidefinite program that yields soft cluster assignments which ap-\nproximately maximize the margin. Therefore, one can obtain soft clusterings efficiently\nusing widely available software. However, before proceeding with the main development,\nthere are some preliminary issues we need to address.\n\nFirst, we clearly need to impose some sort of constraint on the class balance, since other-\nwise one could simply assign all the data points to the same class and obtain an unbounded\nmargin. A related issue is that we would also like to avoid the problem of separating a sin-\ngle outlier (or very small group of outliers) from the rest of the data. Thus, to mitigate these\neffects we will impose a constraint that the difference in class sizes be bounded. This will\nturn out to be a natural constraint for semisupervised learning and is very easy to enforce.\nSecond, we would like the clustering to behave gracefully on noisy data where the classes\nmay in fact overlap, so we adopt the soft margin formulation of the maximum margin cri-\nterion. Third, although it is indeed possible to extend our approach to the multiclass case\n[5], the extension is not simple and for ease of presentation we focus on simple two class\nclustering in this paper. Finally, there is a small technical complication that arises with one\nof the SVM parameters: It turns out that an unfortunate nonconvexity problem arises when\nwe include the use of the offset b in the underlying large margin classifier. We currently\ndo not have a way to avoid this nonconvexity, and therefore we currently set b = 0 and\ntherefore only consider homogeneous linear classifiers. The consequence of this restriction\nis that the constraint y = 0 is removed from the dual SVM quadratic program (3). Al-\nthough it would seem like this is a harsh restriction, the negative effects are mitigated by\ncentering the data at the origin, which can always be imposed. Nevertheless, dropping this\n\n\f\nrestriction remains an important question for future research. With these caveats in mind,\nwe proceed to the main development.\n\nWe wish to solve for a labeling y {-1, +1}N that leads to a maximum (soft) mar-\ngin. Straightforwardly, one could attempt to tackle this optimization problem by directly\nformulating\n\n min -2(y) subject to - e y where\n y{-1,+1}N\n\n -2(y) = max 2 e - K ,yy subject to 0 C\n \n\nUnfortunately, -2(y) is not a convex function of y, and this formulation does not lead to\nan effective algorithmic approach. In fact, to obtain an efficient technique for solving this\nproblem we need two key insights.\n\nThe first key step is to re-express this optimization, not directly in terms of the cluster labels\ny, but instead in terms of the label kernel matrix M = yy . The main advantage of doing\nso is that the inverse soft margin -2 is in fact a convex function of M\n\n -2(M ) = max 2 e - K ,M subject to 0 C (4)\n \n\nThe convexity of -2 with respect to M is easy to establish since this quantity is just a\nmaximum over linear functions of M [3]. This observation parallels one of the key insights\nof [10], here applied to M instead of K.\n\nUnfortunately, even though we can pose a convex objective, it does not allow us to imme-\ndiately solve our problem because we still have to relate M to y, and M = yy is not\na convex constraint. Thus, the main challenge is to find a way to constrain M to ensure\nM = yy while respecting the class balance constraints - e y . One obvious\nway to enforce M = yy would be to impose the constraint that rank(M ) = 1, since\ncombined with M {-1, +1}NN this forces M to have a decomposition yy for some\ny {-1, +1}N. Unfortunately, rank(M) = 1 is not a convex constraint on M [7].\nOur second key idea is to realize that one can indirectly enforce the desired relationship\nM = yy by imposing a different set of linear constraints on M . To do so, notice that any\nsuch M must encode an equivalence relation over the training points. That is, if M = yy\nfor some y {-1, +1}N then we must have1 ify\n m i = yj\n ij = -1 if yi = yj\nTherefore to enforce the constraint M = yy for y {-1, +1}N it suffices to impose\nthe set of constraints: (1) M encodes an equivalence relation, namely that it is transitive,\nreflexive and symmetric; (2) M has at most two equivalence classes; and (3) M has at\nleast two equivalence classes. Fortunately we can enforce each of these requirements by\nimposing a set of linear constraints on M {-1, +1}NN respectively:\nL1: mii = 1; mij = mji; mik mij + mjk - 1; ijk\nL2: mjk -mij - mik - 1; ijk\nL3: m\n i ij N - 2; j\nThe result is that with only linear constraints on M we can enforce the condition M =\nyy .1 Finally, we can enforce the class balance constraint - e y by imposing the\nadditional set of linear constraints:\n\n 1Interestingly, for M {-1, +1}NN the first two sets of linear constraints can be replaced by\nthe compact set of convex constraints diag(M ) = e, M 0 [7, 11]. However, when we relax the\ninteger constraint below, this equivalence is no longer true and we realize some benefit in keeping the\nlinear equivalence relation constraints.\n\n\f\nL4: - m\n i ij ; j\nwhich obviates L3.\nThe combination of these two steps leads to our first main result: One can solve for a\nhard clustering y that maximizes the soft margin by solving a convex integer program. To\naccomplish this, one first solves for the equivalence relation matrix M in\n\n min max 2 e - K ,M subject to 0C,L1,L2,L4 (5)\n M {-1,+1}NN \n\nThen, from the solution M recover the optimal cluster assignment \n y simply by setting\n \ny to any column vector in M .\n\nUnfortunately, the formulation (5) is still not practical because convex integer programming\nis still a hard computational problem. Therefore, we are compelled to take one further step\nand relax the integer constraint on M to obtain a convex optimization problem over a\ncontinuous parameter space\n\n min max 2 e - K ,M subject to 0C,L1,L2,L4,M 0 (6)\nM [-1,+1]NN \n\nThis can be turned into an equivalent semidefinite program using essentially the same\nderivation as in [10], yielding\n\n min subject to L1,L2,L4, 0, 0,M 0 (7)\n M,,,\n M K e + - 0\n (e + - ) - 2C e\nThis gives us our second main result: To solve for a soft clustering y that approximately\nmaximizes the soft margin, first solve the semidefinite program (7), and then from the\nsolution matrix M recover the soft cluster assignment y by setting y = 1v1, where\n1, v1 are the maximum eigenvalue and corresponding eigenvector of M .2\n\n\n4 Experimental results\n\nWe implemented the maximum margin clustering algorithm based on the semidefinite pro-\ngramming formulation (7), using the SeDuMi library, and tested it on various data sets.\n\nIn these experiments we compared the performance of our maximum margin clustering\ntechnique to the spectral clustering method of [14] as well as straightforward k-means\nclustering. Both maximum margin clustering and spectral clustering were run with the\nsame radial basis function kernel and matching width parameters. In fact, in each case, we\nchose the best width parameter for spectral clustering by searching over a small set of five\nwidths related to the scale of the problem. In addition, the slack parameter for maximum\nmargin clustering was simply set to an arbitrary value.3\n\nTo assess clustering performance we first took a set of labeled data, removed the labels,\nran the clustering algorithms, labeled each of the resulting clusters with the majority class\naccording to the original training labels, and finally measured the number of misclassifica-\ntions made by each clustering.\n\nOur first experiments were conducted on the synthetic data sets depicted in Figure 1. Ta-\nble 1 shows that for the first three sets of data (Gaussians, Circles, AI) maximum margin\nand spectral clustering obtained identical small error rates, which were in turn significantly\n\n 2One could also employ randomized rounding to choose a hard class assignment y.\n 3It turns out that the slack parameter C did not have a significant effect on any of our preliminary\ninvestigations, so we just set it to C = 100 for all of the experiments reported here.\n\n\f\nsmaller than those obtained by k-means. However, maximum margin clustering demon-\nstrates a substantial advantage on the fourth data set (Joined Circles) over both spectral and\nk-means clustering.\n\nWe also conducted clustering experiments on the real data sets, two of which are depicted\nin Figures 2 and 3: a database of images of handwritten digits of twos and threes (Figure 2),\nand a database of face images of two people (Figure 3). The last two columns of Table 1\nshow that maximum margin clustering obtains a slight advantage on the handwritten digits\ndata, and a significant advantage on the faces data.\n\n\n5 Semi-supervised learning\n\nAlthough the clustering results are reasonable, we have an additional goal of adapting\nthe maximum margin approach to semisupervised learning. In this case, we assume we\nare given a labeled training set ( 1 n\n x , y1), ..., (x , yn) as well as an unlabeled training set\n n+1 N\nx , ..., x , and the goal is to combine the information in these two data sets to produce\na more accurate classifier.\n\nIn the context of large margin classifiers, many techniques have been proposed for incorpo-\nrating unlabeled data in an SVM, most of which are intuitively based on ensuring that large\nmargins are also preserved on the unlabeled training data [8, 2], just as in our case. How-\never, none of these previous proposals have formulated a convex optimization procedure\nthat was guaranteed to directly maximize the margin, as we propose in Section 3.\n\nFor our procedure, extending the maximum margin clustering approach of Section 3 to\nsemisupervised training is easy: We simply add constraints on the matrix M to force it\nto respect the observed equivalence relations among the labeled training data. In addition,\nwe impose the constraint that each unlabeled example belongs to the same class as at least\none labeled training example. These conditions can be enforced with the simple set of\nadditional linear constraints\n\nS1: mij = yiyj for labeled examples i,j {1,...,n}\nS2: n m\n i=1 ij 2 - n for unlabeled examples j {n + 1, ..., N}\nNote that the observed training labels yi for i {1, ..., n} are constants, and therefore the\nnew constraints are still linear in the parameters of M that are being optimized.\n\nThe resulting training procedure is similar to that of [6], with the addition of the constraints\nL1L4,S2 which enforce two classes and facilitate the ability to perform clustering on the\nunlabeled examples.\n\n\n6 Experimental results\n\nWe tested our approach to semisupervised learning on various two class data sets from\nthe UCI repository. We compared the performance of our technique to the semisupervised\nSVM technique of [8]. In each case, we evaluated the techniques transductively. That is,\nwe split the data into a labeled and unlabeled part, held out the labels of the unlabeled\nportion, trained the semisupervised techniques, reclassified the unlabeled examples using\nthe learned results, and measured the misclassification error on the held out labels.\n\nHere we see that the maximum margin approach based on semidefinite programming can\noften outperform the approach of [8]. Table 2 shows that our maximum margin method\nis effective at exploiting unlabeled data to improve the prediction of held out labels. In\nevery case, it significantly reduces the error of plain SVM, and obtains the best overall\nperformance of the semisupervised learning techniques we have investigated.\n\n\f\n 2 1.5 2.5 0.8\n\n\n\n\n 0.6\n 1.8\n 1 2\n\n\n 0.4\n 1.6\n\n 0.5 1.5\n\n 0.2\n\n 1.4\n\n\n 0 1 0\n\n\n 1.2\n\n -0.2\n\n -0.5 0.5\n\n 1\n -0.4\n\n\n -1 0\n 0.8 -0.6\n\n\n\n\n 0.6 -1.5 -0.5 -0.8\n 0.5 1 1.5 2 -1.5 -1 -0.5 0 0.5 1 1.5 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8\n\n\n\n\n\nFigure 1: Four artificial data sets used in the clustering experiments. Each data set consists\nof eighty two-dimensional points. The points and stars show the two classes discovered by\nmaximum margin clustering.\n\n\n\n\n\nFigure 2: A sampling of the handwritten digits (twos and threes). Each row shows a random\nsampling of images from a cluster discovered by maximum margin clustering. Maximum\nmargin made very few misclassifications on this data set, as shown in Table 1.\n\n\n\n\n\nFigure 3: A sampling of the face data (two people). Each row shows a random sampling of\nimages from a cluster discovered by maximum margin clustering. Maximum margin made\nno misclassifications on this data set, as shown in Table 1.\n\n Gaussians Circles A I Joined Circles Digits Faces\n Maximum Margin 1.25 0 0 1 3 0\n Spectral Clustering 1.25 0 0 24 6 16.7\n K-means 5 50 38.5 50 7 24.4\n\nTable 1: Percentage misclassification errors of the various clustering algorithms on the\nvarious data sets.\n\n\n HWD 1-7 HWD 2-3 UCI Austra. UCI Flare UCI Vote UCI Diabet.\n Max Marg 3.3 4.7 32 34 14 35.55\n Spec Clust 4.2 6.4 48.7 40.7 13.8 44.67\n TSVM 4.6 5.4 38.7 33.3 17.5 35.89\n SVM 4.5 10.9 37.5 37 20.4 39.44\n\n\nTable 2: Percentage misclassification errors of the various semisupervised learning algo-\nrithms on the various data sets. SVM uses no unlabeled data. TSVM is due to [8].\n\n\f\n7 Conclusion\n\nWe have proposed a simple, unified principle for clustering and semisupervised learning\nbased on the maximum margin principle popularized by supervised SVMs. Interestingly,\nthis criterion can be approximately optimized using an efficient semidefinite programming\nformulation. The results on both clustering and semisupervised learning are competitive\nwith, and sometimes exceed the state of the art. Overall, margin maximization appears to\nbe an effective way to achieve a unified approach to these different learning problems.\n\nFor future work we plan to address the restrictions of the current method, including the\nommission of an offset b and the restriction to two class problems. We note that a multiclass\nextension to our approach is possible, but it is complicated by the fact that it cannot be\nconveniently based on the standard multiclass SVM formulation of [5]\n\nAcknowledgements\n\nResearch supported by the Alberta Ingenuity Centre for Machine Learning, NSERC, MI-\nTACS, IRIS and the Canada Research Chairs program.\n\nReferences\n\n [1] A. Ben-Hur, D. Horn, H. Siegelman, and V. Vapnik. Support vector clustering. In Journal of\n Machine Learning Research 2 (2001), 2001.\n\n [2] K. Bennett and A. Demiriz. Semi-supervised support vector machines. In Advances in Neural\n Information Processing Systems 11 (NIPS-98), 1998.\n\n [3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge U. Press, 2004.\n\n [4] Chakra Chennubhotla and Allan Jepson. Eigencuts: Half-lives of eigenflows for spectral clus-\n tering. In In Advances in Neural Information Processing Systems, 2002, 2002.\n\n [5] K. Crammer and Y. Singer. On the algorithmic interpretation of multiclass kernel-based vector\n machines. Journal of Machine Learning Research, 2, 2001.\n\n [6] T. De Bie and N. Cristianini. Convex methods for transduction. In Advances in Neural Infor-\n mation Processing Systems 16 (NIPS-03), 2003.\n\n [7] C. Helmberg. Semidefinite programming for combinatorial optimization. Technical Report\n ZIB-Report ZR-00-34, Konrad-Zuse-Zentrum Berlin, 2000.\n\n [8] T. Joachims. Transductive inference for text classification using support vector machines. In\n International Conference on Machine Learning (ICML-99), 1999.\n\n [9] Y. Kluger, R. Basri, J. Chang, and M. Gerstein. Spectral biclustering of microarray cancer data:\n co-clustering genes and conditions. Genome Research, 13, 2003.\n\n[10] G. Lanckriet, N. Cristianini, P. Bartlett, L Ghaoui, and M. Jordan. Learning the kernel matrix\n with semidefinite programming. Journal of Machine Learning Research, 5, 2004.\n\n[11] M. Laurent and S. Poljak. On a positive semidefinite relaxation of the cut polytope. Linear\n Algebra and its Applications, 223/224, 1995.\n\n[12] S. Chawla N. Bansal, A. Blum. Correlation clustering. In Conference on Foundations of Com-\n puter Science (FOCS-02), 2002.\n\n[13] J. Kandola N. Cristianini, J. Shawe-Taylor. Spectral kernel methods for clustering. In In Ad-\n vances in Neural Information Processing System, 2001, 2001.\n\n[14] A. Ng, M. Jordan, and Y Weiss. On spectral clustering: analysis and an algorithm. In Advances\n in Neural Information Processing Systems 14 (NIPS-01), 2001.\n\n[15] B. Schoelkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization,\n Optimization, and Beyond. MIT Press, 2002.\n\n[16] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans PAMI, 22(8), 2000.\n\n[17] Y. Weiss. Segmentation using eigenvectors: a unifying view. In International Conference on\n Computer Vision (ICCV-99), 1999.\n\n[18] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and\n harmonic functions. In International Conference on Machine Learning (ICML-03), 2003.\n\n\f\n", "award": [], "sourceid": 2602, "authors": [{"given_name": "Linli", "family_name": "Xu", "institution": null}, {"given_name": "James", "family_name": "Neufeld", "institution": null}, {"given_name": "Bryce", "family_name": "Larson", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}]}