{"title": "Generalized Maximum Margin Clustering and Unsupervised Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1417, "page_last": 1424, "abstract": null, "full_text": "Generalized Maximum Margin Clustering and Unsup ervised Kernel Learning\n\nHamed Valizadegan Computer Science and Engineering Michigan State University East Lansing, MI 48824 valizade@msu.edu\n\nRong Jin Computer Science and Engineering Michigan State University East Lansing, MI 48824 rongjin@cse.msu.edu\n\nAbstract\nMaximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three ma jor problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose \"generalized maximum margin clustering\" framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository.\n\n1\n\nIntro duction\n\nData clustering, the unsupervised classification of samples into groups, is an important research area in machine learning for several decades. A large number of algorithms have been developed for data clustering, including the k-means algorithm [3], mixture models [4], and spectral clustering [5, 6, 7, 8, 9]. More recently, maximum margin clustering [1, 2] was proposed for data clustering and has shown promising performance. The key idea of maximum margin clustering is to extend the theory of support vector machine to unsupervised learning. However, despite its success, the following three ma jor problems with maximum margin clustering has prevented it from being applied to real-world applications:  High computational cost. The number of parameters in maximum margin clustering is quadratic in the number of examples. Thus, it is difficult to scale to large-scale datasets. Figure 1 shows the computational time (in seconds) of the maximum margin clustering algorithm with respect to different numbers of examples. We\n\n\f\nTime comparision 1600 1400 1200 Time (seconds) 1000 800 600 400 200 0 40 Generalized Maxmium Marging Clustering Maximum Margin Clustering\n\n60\n\n80\n\n100\n\n120 140 160 Number of Samples\n\n180\n\n200\n\n220\n\nFigure 1: The scalability of the original maximum margin clustering algorithm versus the generalized maximum margin clustering algorithm\n50\n\n2\n45\n\n1.8\n\n40\n\n1.6\n\n35\n\n1.4\n\nClustering error\n\n30\n\n25\n\n1.2\n\n20\n\n1\n\n15\n\n0.8\n\n10\n\n5\n\n0.6\n0 10 20 30 40 50 60 70 80 90 100\n\n0.4 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2\n\nKernel Width (% of data range) in RBF function\n\n(a) Data distribution\n\n(b) Clustering error versus kernel width\n\nFigure 2: Clustering error of spectral clustering using the RBF kernel with different kernel width. The horizonal axis of Figure 2(b) represents the percentage of the distance range (i.e., the difference between the maximum and the minimum distance) that is used for kernel width. clearly see that the computational time increases dramatically when we apply the maximum margin clustering algorithm to even modest numbers of examples.  Requiring clustering boundaries to pass through the origins. One important assumption made by the maximum margin clustering in [1] is that the clustering boundaries will pass through the origins. To this end, maximum margin clustering requires centralizing data points around the origins before clustering data. It is important to note that centralizing data points at the origins does not guarantee clustering boundaries to go through origins, particularly when cluster sizes are unbalanced with one cluster significantly more popular than the other.  Sensitive to the choice of kernel functions. Figure 2(b) shows the clustering error of maximum margin clustering for the synthesized data of two overlapped Gaussians clusters (Figure 2(a)) using the RBF kernel with different kernel width. We see that the performance of maximum margin clustering depends critically on the choice of kernel width. The same problem is also observed in spectral clustering [10]. Although a number of studies [8, 9, 10, 6] are devote to automatically identifying appropriate kernel matrices in clustering, they are either heuristic approaches or require additional labeled data. In this paper, we propose \"generalized maximum margin clustering\" framework that resolves the above three problems simultaneously. In particular, the proposed framework\n\n\f\nreformulates the problem of maximum margin clustering to include the bias term in the classification boundary, and therefore remove the assumption that clustering boundaries have to pass through the origins. Furthermore, the new formulism reduces the number of parameters to be linear in the number of examples, and therefore significantly reduces the computational cost. Finally, it is equipped with the capability of unsupervised kernel learning, and therefore, is able to determine the appropriate kernel matrix and clustering memberships simultaneously. More interestingly, we will show that spectral clustering, such as the normalized cut algorithm, can be viewed as a special case of the generalized maximum margin clustering. The remainder of the paper is organized as follows: Section 2 reviews the work of maximum margin clustering and kernel learning. Section 3 presents the framework of generalized maximum margin clustering. Our empirical studies are presented in Section 4. Section 5 concludes this work.\n\n2\n\nRelated Work\n\nThe key idea of maximum margin clustering is to extend the theory of support vector machine to unsupervised learning. Given the training examples D = (x1 , x2 , . . . , xn ) and their class labels y = (y1 , y2 , . . . , yn )  {-1, +1}n , the dual problem of support vector machine can be written as: 1 max  e -  diag(y)K diag(y) n R 2 s. t. 0    C,  y = 0 (1) where K  Rnn is the kernel matrix and diag(y) stands for the diagonal matrix that uses the vector y as its diagonal elements. To apply the above formulism to unsupervised learning, the maximum margin clustering approach relaxes class labels y to continuous variables, and searches for both y and  that maximizes the classification margin. This leads to the following optimization problem:\ny,,,\n\nmin\n\nt ( yy )  K ( e +  -  + y )   0,   0 e +  -  + y t - 2C  e 0\n\ns. t.\n\nwhere  stands for the element wise product between two matrices. To convert the above problem into a convex programming problem, the authors of [1] makes two important relaxations. The first one relaxes yy into a positive semi-definitive (PSD) matrix M 0 whose diagonal elements are set to be 1. The second relaxation sets  = 0, which is equivalent to assuming that there is no bias term b in the expression of classification boundaries, or in other words, classification boundaries have to pass through the origins of data. These two assumption simplify the above optimization problem as follows:\nM ,,\n\nmin\n\nt K (e +  -  )   0,   0, M M e+- t - 2C  e 0 0 (2)\n\ns. t.\n\nFinally, a few additional constraints of M are added to the above optimization problem to prevent skewed clustering sizes [1]. As a consequence of these two relaxations, the number of parameters is increased from n to n2 , which will significantly increase the computational cost. Furthermore, by setting  = 0, the maximum margin clustering algorithm requires clustering boundaries to pass through the origins of data, which is unsuitable for clustering data with unbalanced clusters. Another important problem with the above maximum margin clustering is the difficulty in determining the appropriate kernel similarity matrix K . Although many kernel based clustering algorithms set the kernel parameters manually, there are several studies devoted to automatic selection of kernel functions, in particular the kernel width for the RBF kernel,\n\n\f\n- . xi - xj 2 2 i.e.,  in exp Shi et al. [8] recommended choosing the kernel width as 10% to 2 2 20% of the range of the distance between samples. However, in our experiment, we found that this is not always a good choice, and in many situations it produces poor results. Ng et al. [9] chose kernel width which provides the least distorted clusters by running the same clustering algorithm several times for each kernel width. Although this approach seems to generate good results, it requires running seperate experiments for each kernel width, and therefore could be computationally intensive. Manor et al. in [10] proposed a self-tuning spectral clustering algorithm that computes a different local kernel width for each data point xi . In particular, the local kernel width for each xi is computed as the distance of xi to its k th nearest neighbor. Although empirical study seems to show the effectiveness of this approach, it is unclear how to find the optimal k in computing the local kernel width. As we will see in the experiment section, the clustering accuracy depends heavily on the choice of k . Finally, we will briefly overview the existing work on kernel learning. Most previous work focus on supervised kernel learning. The representative approaches in this category include the kernel alignment [11, 12], semi-definitive programming [13], and spectral graph partitioning [6]. Unlike these approaches, the proposed framework is designed for unsupervised kernel learning.\n\n3\n\nGeneralized Maximum Margin Clustering and Unsup ervised Kernel Learning\n\nWe will first present the proposed clustering algorithm for hard margin, followed by the extension to soft margin and unsupervised kernel learning. 3.1 Hard Margin\n\nIn the case of hard margin, the dual problem of SVM is almost identical to the problem in Eqn. (1) except that the parameter  does not have the upper bound C . Following [13], we further convert the problem in (1) into its dual form: 1 min (e +  + y)T diag(y)K -1 diag(y)(e +  + y)  ,y, 2 s. t.   0, y  {+1, -1}n (3) where e is a vector with all its elements being one. Unlike the treatment in [13], which rewrites the above problem as a semi-definitive programming problem, we introduce variables z that is defined as follows: z = diag(y)(e +  ) Given that   0, the above expression for z is essentially equivalent to the constraint 2 |zi |  1 or zi  1 for i = 1, 2, . . . , n. Then, the optimization problem in (3) is rewritten as follows: 1 min (z + e)T K -1 (z + e) z, 2 2 s. t. zi  1, i = 1, 2, . . . , n (4) Note that the above problem may not have unique solutions for z and  due to the translation invariance of the ob jective function. More specifically, given an optimal solution z and , we may be able to construct another solution z and  such that: z = z + e,  =  - . Evidently, both solutions result in the same value for the ob jective function in (4). Furthermore, with appropriately chosen , the new solution z and  will be able to satisfy the 2 constraint zi  1. Thus, z and  is another optimal solution for (3). This is in fact related to the problem in SVM where the bias term b may not be unique [14]. To remove the translation invariance from the ob jective function, we introduce an additional term Ce (z e)2 into the ob jective function, i.e. 1 (z + e)T K -1 (z + e) + Ce (z e)2 min z, 2 2 s. t. zi  1, i = 1, 2, . . . , n (5)\n\n\f\nwhere constant Ce weights the important of the punishment factor against the original ob jective. It is set to be 10, 000 in our experiment. For the simplicity of our expression, we further define w = (z; ) and P = (In , e). Then, the problem in (4) becomes min+1 wT P T K -1 P w + Ce (e0 w)2 n\nw R 2 s. t. wi  1, i = 1, 2, . . . , n (6) where e0 is a vector with all its elements being 1 except its last element which is zero. We then construct the Lagrangian as follows in i L(w,  ) = wT P T K -1 P w + Ce (e0 w)2 - i (w In+1 w - 1) =1\n\n=w\n\nPT\n\nK\n\n-1\n\nP + C e e0 e -\n\n0\n\nin\n=1\n\nw +\n\ni i In+1\n\nin\n=1\n\ni\n\ni where In+1 is an (n + 1)  (n + 1) matrix with all the elements being zero except the ith diagonal element which is 1. Hence, the dual problem of (6) is in max i n  R =1\n\ns. t.\n\nPK\n\nT\n\n-1\n\nP + Ce e0 e -\n\n0\n\nin\n=1\n\ni i In+1\n\n0 (7)\n\ni  0, i = 1, 2, . . . , n Finally, the solution w can be computed using the KKT wondition, i.e., c P in T i K -1 P + Ce e0 e0 - i In+1 = 0n+1\n=1\n\nIn other words, the sn lution w f is proportional to the eigenvector of matrix o P T -1 i K P + Ce e0 e0 - i=1 i In+1 or the zero eigenvalue. Since wi = (1 + i )yi , i = 1, 2, . . . , n and i  0, the class labels {yi }n 1 can be inferred directly from the sign of i= {wi }n 1 . i= Remark I It is important to realize that the problem in (5) is non-convex due to the non2 convex constraint wi  1. Thus, the optimal solution found by the dual problem in (7) is not necessarily the optimal solution for the prime problem in (5). Our hope is that although the solution found by the dual problem is not optimal for the prime problem, it is still a good solution for the prime problem in (5). This is similar to the SDP relaxation made by the maximum margin clustering algorithm in (2) that relaxes a non-convex programming problem into a convex one. However, unlike the relaxation made in (2) that increases the number of variables from n to n2 , the new formulism of maximum margin does not increase the number of parameters (i.e.,  ), and therefore will be computational more efficient. This is shown in Figure 1, in which the computational time of generalized maximum margin clustering is increased much slower than that of the maximum margin algorithm. Remark I I To avoid the high computational cost in estimating K -1 , we replace K -1 with its normalized graph Laplacian L(K ) [15], which is defined as L(K ) = I - D1/2 K D1/2 where n D is a diagonal matrix whose diagonal elements are computed as Di,i = j =1 Ki,j , i = ~ 1, 2, . . . , n. This is equivalent to defining a kernel matrix K = L(K ) where  stands for the operator of pseudo inverse. More interesting, we have the following theorem showing the relationship between generalized maximum margin clustering and the normalized cut. Theorem 1. The normalized cut algorithm is a special case of the generalized maximum margin clustering in (7) if the fol lowing conditions hold, i.e., (1) K -1 is set to be the  normalized Laplacian L(K ), (2) al l the  s are enforced to be the same, i.e., i = 0 , i = 1, 2, . . . , n, and (3) Ce  1. Proof sketch: Given the conditions 1 to 3 in the theorem, the new objective function in (7)  becomes: max  s.t. L(K ) In and the solution for this problem is the largest eigenvector  of L(K ).\n 0\n\n\f\n3.2\n\nSoft Margin\n\nWe extend the formulism in (7) to the case of soft margin by considering the following problem: in 1 2 min (e +  -  + y)T diag(y)K -1 diag(y)(e +  -  + y) + C i  ,y,, 2 =1 s. t.   0,   0, y  {+1, -1}n (8) where C weights the importance of the clustering errors against the clustering margin. Similar to the previous derivation, we introduce the slack variable z and simplify the above problem as follows: in 1 2 (z + e)T K -1 (z + e) + Ce (z e)2 + C min i z,, 2 =1 s. t. (zi + i )2  1, i  0, i = 1, 2, . . . , n (9) 2 2 By approximating (zi + i )2 as zi + i , we have the dual form of the above problem written as: in max i n\n R =1\n\ns. t.\n\nP\n\nK -1\n\nP + C e e0 e0 -\n\nin\n=1\n\ni i In+1\n\n0\n\n0  i  C , i = 1, 2, . . . , n (10) The main difference between the above formulism and the formulism in (7) is the introduction of the upper bound C for  in the case of soft margin. In the experiment, we set the parameter C to be 100, 000, a very large value. 3.3 Unsup ervised Kernel Learning\n\nAs already pointed out, the performance of many clustering algorithms depend on the right choice of the kernel similarity matrix. To address this problem, we extend the formulism in (10) by including the kernel learning mechanism. In particular, we assume that a set of m kernel similarity matrices K1 , K2 , . . . , Km are available. Our goal is to identify the linear m combination of kernel matrices, i.e., K = i=1 i Ki , that leads to the optimal clustering accuracy. More specifically, we need to solve the following optimization problem: in max i\n , =1\n\ns. t.\n\nP\n\ni\n\n-1\nm\n\ni Ki\n\nP + C e e0 e0 - im\n=1\n\nin\n=1\n\ni i In+1\n\n0 (11)\n\n=1\n\n0  i  C , i = 1, 2, . . . , n,\n\ni = 1, i  0, i = 1, 2, . . . , m\n\nUnfortmnately, it is difficult to solve the above problem due to the complexity introduced u by ( i=1 i Ki )-1 . Hence, we consider an alternative problem to the above one. We first   introduce a set of normalized graph Laplacian L1 , L2 , . . . , Lm . Each Laplacian Li is constructed from the kem el similarity matrix Ki . We then defined the inverse of the combined rn  matrix as K -1 = i=1 i Li . Then, we have the following optimization problem n i max i\n , =1\n\ns. t.\n\nim\n=1\n\n i P Li P + Ce e0 e0 -\n\nin\n=1\n\ni i In+1\n\n0 (12)\n\n0  i  C , i = 1, 2, . . . , n,\n\nim\n=1\n\ni = 1, i  0, i = 1, 2, . . . , m\n\n\f\n2 1.8 1.6 0.5 1.4 1.2 1 0.8 0.6 0.4 -1 0 1\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n-0.2 -0.5 -0.4\n\n-0.6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 -1 -0.5 0 0.5 1 -0.6 -0.4 -0.2 0 0.2 0.4 0.6\n\n(a) Overlapped Gaussian\n\n(b) Two Circles\n\n(c) Two Connected Circles\n\nFigure 3: Data distribution of the three synthesized datasets By solving the above problem, we are able to resolve both  (corresponding to clustering memberships) and  (corresponding to kernel learning) simultaneously.\n\n4\n\nExp eriment\n\nWe tested the generalized maximum margin clustering algorithm on both synthetic datasets and real datasets from the UCI repository. Figure 3 gives the distribution of the synthetic datasets. The four UCI datasets used in our study are \"Vote\", \"Digits\", \"Ionosphere\", and \"Breast\". These four datasets comprise of 218, 180, 351, and 285 examples, respectively, and each example in these four datasets is represented by 17, 64, 35, and 32 features. Since the \"Digits\" dataset consists of multiple classes, we further decompose it into four datasets of binary classes that include pairs of digits difficult to distinguish. Both the normalized cut algorithm [8] and the maximum margin clustering algorithm [1] are used as the baseline. The RBF kernel is used throughout this study to construct the kernel similarity matrices. In our first experiment, we examine the optimal performance of each clustering algorithm by using the optimal kernel width that is acquired through an exhaustive search. The optimal clustering errors of these three algorithms are summarized in the first three columns of Table 1. It is clear that generalized maximum margin clustering algorithm achieve similar or better performance than both maximum margin clustering and normlized cut for most datasets when they are given the optimal kernel matrices. Note that the results of maximum margin clustering are reported for a subset of samples(including 80 instances) in UCI datasets due to the out of memory problem. Table 1: Clustering error (%) of normalized cut (NC), maximum margin clustering (MMC), generalized maximum margin clustering (GMMC) and self-tuning spectral clustering (ST). Dataset Optimal Kernel Width Unsupervised Kernel Learning NC MMC GMMC GMMC ST (Best k ) ST(Worst k ) Two Circles 2 0 0 0 0 50 Two Jointed Circles 7 6.25 0 0 1 45 Two Gaussian 1.25 2.5 1.25 3.75 5 7.5 Vote 25 15 9.6 11.90 11 40 Digits 3-8 35 10 5.6 5.6 5 50 Digits 1-7 45 31.25 2.2 3 0 47 Digits 2-7 34 1.25 .5 5.6 1.5 50 Digits 8-9 48 3.75 16 12 9 48 Ionosphere 25 21.25 23.5 27.3 26.5 48 36.5 38.75 36.1 37 37.5 41.5 Breast In the second experiment, we evaluate the effectiveness of unsupervised kernel learning. Ten kernel matrices are created by using the RBF kernel with the kernel width varied from 10% to 100% of the range of distance between any two examples. We compare the proposed unsupervised kernel learning to the self-tuning spectral clustering algorithm in [10]. One of the problem with the self-tuning spectral clustering algorithm is that its clustering error usually depends on the parameter k , i.e., the number of nearest neighbor used for computing the kernel width. To provide a full picture of the self-tuning spectral clustering, we vary k from 1 and 15 , and calculate both best and worst performance using different k . The last three columns of Table 1 summarizes the clustering errors of generalized maximum margin\n\n\f\nclustering and self-tuning spectral clustering with both best and worst k . First, observe the big gap between best and worst performance of self-tuning spectral clustering with different choice of k , which implies that this algorithm is sensitive to parameter k . Second, for most datasets, generalized maximum margin clustering achieves similar performance as self-tuning spectral clustering with the best k . Furthermore, for a number of datasets, the unsupervised kernel learning method achieves the performance close to the one using the optimal kernel width. Both results indicate that the proposed algorithm for unsupervised kernel learning is effective in identifying appropriate kernels.\n\n5\n\nConclusion\n\nIn this paper, we proposed a framework for the generalized maximum margin clustering. Compared to the existing algorithm for maximum margin clustering, the new framework has three advantages: 1) it reduces the number of parameters from n2 to n, and therefore has a significantly lower computational cost, 2) it allows for clustering boundaries that do not pass through the origin, and 3) it can automatically identify the appropriate kernel similarity matrix through unsupervised kernel learning. Our empirical study with three synthetic datasets and four UCI datasets shows the promising performance of our proposed algorithm.\n\nReferences\n[1] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In Advances in Neural Information Processing Systems (NIPS) 17, 2004. [2] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector machines. In Proceedings of the 20th National Conference on Artificial Intel ligence (AAAI-05)., 2005. [3] J. Hartigan and M. Wong. A k-means clustering algorithm. Appl. Statist., 28:100108, 1979. [4] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26:195239, 1984. [5] C. Ding, X. He, H. Zha, M. Gu, and H. Simon. A min-max cut algorithm for graph partitioning and data clustering. In Proc. IEEE Int'l Conf. Data Mining, 2001. [6] F. R. Bach and M. I. Jordan. Learning spectral clustering. In Advances in Neural Information Processing Systems 16, 2004. [7] R. Jin, C. Ding, and F. Kang. A probabilistic approach for optimizing spectral clustering. In Advances in Neural Information Processing Systems 18, 2006. [8] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 22(8):888905, 2000. [9] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, 2001. [10] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In Advances in Neural Information Processing Systems 17, pages 16011608, 2005. [11] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola. On kernel-target alignment. In NIPS, pages 367373, 2001. [12] X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. In Advances in Neural Information Processing Systems 17, pages 16411648, 2005. [13] G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:2772, 2004. [14] C. J. C. Burges and D. J. Crisp. Uniqueness theorems for kernel methods. Neurocomputing, 55(1-2):187220, 2003. [15] F.R.K. Chung. Spectral Graph Theory. Amer. Math. Society, 1997.\n\n\f\n", "award": [], "sourceid": 3072, "authors": [{"given_name": "Hamed", "family_name": "Valizadegan", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}]}