{"title": "Discriminative K-means for Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1649, "page_last": 1656, "abstract": "We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection and clustering in this framework is not well understood, due to the iterative nature of the algorithm. We show in this paper that this iterative subspace selection and clustering is equivalent to kernel K-means with a specific kernel Gram matrix. This provides significant and new insights into the nature of this subspace selection procedure. Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter estimation procedure. We also present the nonlinear extension of DisKmeans using kernels. We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation. The connection between DisKmeans and several other clustering algorithms is also analyzed. The presented theories and algorithms are evaluated through experiments on a collection of benchmark data sets.", "full_text": "Discriminative K-means for Clustering\n\nJieping Ye\n\nArizona State University\n\nTempe, AZ 85287\n\nZheng Zhao\n\nArizona State University\n\nTempe, AZ 85287\n\njieping.ye@asu.edu\n\nzhaozheng@asu.edu\n\nAbstract\n\nMingrui Wu\n\nmingrui.wu@tuebingen.mpg.de\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\nWe present a theoretical study on the discriminative clustering framework, re-\ncently proposed for simultaneous subspace selection via linear discriminant analy-\nsis (LDA) and clustering. Empirical results have shown its favorable performance\nin comparison with several other popular clustering algorithms. However, the in-\nherent relationship between subspace selection and clustering in this framework\nis not well understood, due to the iterative nature of the algorithm. We show in\nthis paper that this iterative subspace selection and clustering is equivalent to ker-\nnel K-means with a speci\ufb01c kernel Gram matrix. This provides signi\ufb01cant and\nnew insights into the nature of this subspace selection procedure. Based on this\nequivalence relationship, we propose the Discriminative K-means (DisKmeans)\nalgorithm for simultaneous LDA subspace selection and clustering, as well as an\nautomatic parameter estimation procedure. We also present the nonlinear exten-\nsion of DisKmeans using kernels. We show that the learning of the kernel matrix\nover a convex set of pre-speci\ufb01ed kernel matrices can be incorporated into the\nclustering formulation. The connection between DisKmeans and several other\nclustering algorithms is also analyzed. The presented theories and algorithms are\nevaluated through experiments on a collection of benchmark data sets.\n\n1 Introduction\nApplications in various domains such as text/web mining and bioinformatics often lead to very high-\ndimensional data. Clustering such high-dimensional data sets is a contemporary challenge, due to the\ncurse of dimensionality. A common practice is to project the data onto a low-dimensional subspace\nthrough unsupervised dimensionality reduction such as Principal Component Analysis (PCA) [9]\nand various manifold learning algorithms [1, 13] before the clustering. However, the projection may\nnot necessarily improve the separability of the data for clustering, due to the inherent separation\nbetween subspace selection (via dimensionality reduction) and clustering.\nOne natural way to overcome this limitation is to integrate dimensionality reduction and clustering\nin a joint framework. Several recent work [5, 10, 16] incorporate supervised dimensionality reduc-\ntion such as Linear Discriminant Analysis (LDA) [7] into the clustering framework, which performs\nclustering and LDA dimensionality reduction simultaneously. The algorithm, called Discrimina-\ntive Clustering (DisCluster) in the following discussion, works in an iterative fashion, alternating\nbetween LDA subspace selection and clustering. In this framework, clustering generates the class\nlabels for LDA, while LDA provides the subspace for clustering. Empirical results have shown the\nbene\ufb01ts of clustering in a low dimensional discriminative space rather than in the principal com-\nponent space (generative). However, the integration between subspace selection and clustering in\nDisCluster is not well understood, due to the intertwined and iterative nature of the algorithm.\nIn this paper, we analyze this discriminative clustering framework by studying several fundamental\nand important issues: (1) What do we really gain by performing clustering in a low dimensional\ndiscriminative space? (2) What is the nature of its iterative process alternating between subspace\n\n1\n\n\fselection and clustering? (3) Can this iterative process be simpli\ufb01ed and improved? (4) How to\nestimate the parameter involved in the algorithm?\nThe main contributions of this paper are summarized as follows: (1) We show that the LDA pro-\njection can be factored out from the integrated LDA subspace selection and clustering formulation.\nThis results in a simple trace maximization problem associated with a regularized Gram matrix of\nthe data, which is controlled by a regularization parameter \u03bb; (2) The solution to this trace max-\nimization problem leads to the Discriminative K-means (DisKmeans) algorithm for simultaneous\nLDA subspace selection and clustering. DisKmeans is shown to be equivalent to kernel K-means,\nwhere discriminative subspace selection essentially constructs a kernel Gram matrix for clustering.\nThis provides new insights into the nature of this subspace selection procedure; (3) The DisKmeans\nalgorithm is dependent on the value of the regularization parameter \u03bb. We propose an automatic\nparameter tuning process (model selection) for the estimation of \u03bb; (4) We propose the nonlinear\nextension of DisKmeans using the kernels. We show that the learning of the kernel matrix over\na convex set of pre-speci\ufb01ed kernel matrices can be incorporated into the clustering formulation,\nresulting in a semide\ufb01nite programming (SDP) [15]. We evaluate the presented theories and algo-\nrithms through experiments on a collection of benchmark data sets.\n2 Linear Discriminant Analysis and Discriminative Clustering\nConsider a data set consisting of n data points {xi}n\ncentered, that is,\ngiven by xi. In clustering, we aim to group the data {xi}n\nbe the cluster indicator matrix de\ufb01ned as follows:\n\n(cid:80)n\ni=1 \u2208 Rm. For simplicity, we assume the data is\ni=1 xi/n = 0. Denote X = [x1,\u00b7\u00b7\u00b7 , xn] as the data matrix whose i-th column is\nj=1. Let F \u2208 Rn\u00d7k\n\ni=1 into k clusters {Cj}k\n\nF = {fi,j}n\u00d7k, where fi,j = 1, iff xi \u2208 Cj.\n\nWe can de\ufb01ne the weighted cluster indicator matrix as follows [4]:\nL = [L1, L2,\u00b7\u00b7\u00b7 , Lk] = F (F T F )\u2212 1\n2 .\n\nIt follows that the j-th column of L is given by\n\n(1)\n\n(2)\n\nLj = (0, . . . , 0,\n\n(3)\nwhere nj is the sample size of the j-th cluster Cj. Denote \u00b5j =\nx/nj as the mean of the j-th\ncluster Cj. The within-cluster scatter, between-cluster scatter, and total scatter matrices are de\ufb01ned\nas follows [7]:\n\n1, . . . , 1, 0, . . . , 0)T /n\n\nj ,\nx\u2208Cj\n\nSw =\n\n(xi \u2212 \u00b5j)(xi \u2212 \u00b5j)T , Sb =\n\nnj\u00b5j\u00b5T\n\nj = XLLT X T , St = XX T .\n\n(4)\n\nk(cid:88)\n\n(cid:88)\n\nj=1\n\nxi\u2208Cj\n\nIt follows that trace(Sw) captures the intra-cluster distance, and trace(Sb) captures the inter-cluster\ndistance. It can be shown that St = Sw + Sb.\nGiven the cluster indicator matrix F (or L), Linear Discriminant Analysis (LDA) aims to compute\na linear transformation (projection) P \u2208 Rm\u00d7d that maps each xi in the m-dimensional space to\na vector \u02c6xi in the d-dimensional space (d < m) as follows: xi \u2208 IRm \u2192 \u02c6xi = P T xi \u2208 IRd,\nsuch that the following objective function is maximized [7]: trace\n. Since\nSt = Sw + Sb, the optimal transformation matrix P is also given by maximizing the following\nobjective function:\n\n(P T SwP )\u22121P T SbP\n\n(cid:162)\n\n(cid:161)\n\n(5)\nFor high-dimensional data, the estimation of the total scatter (covariance) matrix is often not reliable.\nThe regularization technique [6] is commonly applied to improve the estimation as follows:\n\ntrace\n\n.\n\n(P T StP )\u22121P T SbP\n\n(cid:162)\n\n(cid:161)\n\n\u02dcSt = St + \u03bbIm = XX T + \u03bbIm,\n\n(6)\n\nwhere Im is the identity matrix of size m and \u03bb > 0 is a regularization parameter.\nIn Discriminant Clustering (DisCluster) [5, 10, 16], the transformation matrix P and the weighted\ncluster indicator matrix L are computed by maximizing the following objective function:\n\n(cid:180)\n\nf(L, P ) \u2261 trace\n= trace\n\n(P T \u02dcStP )\u22121P T SbP\n(P T (XX T + \u03bbIm)P )\u22121P T XLLT X T P\n\n.\n\n(7)\n\n(cid:162)\n\n(cid:179)\n(cid:161)\n\n2\n\n(cid:80)\n\n1\n2\n\n(cid:122) (cid:125)(cid:124) (cid:123)\n\nnj\n\nk(cid:88)\n\nj=1\n\n\f.\n\n(cid:162)\n\n(cid:161)\n\nf(L, P ) = trace\n\nThe algorithm works in an intertwined and iterative fashion, alternating between the computation\nof L for a given P and the computation of P for a given L. More speci\ufb01cally, for a given L, P is\ngiven by the standard LDA procedure. Since trace(AB) = trace(BA) for any two matrices [8], for\na given P , the objective function f(L, P ) can be expressed as:\n\nLT X T P (P T (XX T + \u03bbIm)P )\u22121P T XL\n\n(8)\nNote that L is not an arbitrary matrix, but a weighted cluster indicator matrix, as de\ufb01ned in Eq. (3).\nThe optimal L can be computed by applying the gradient descent strategy [10] or by solving a kernel\nK-means problem [5, 16] with X T P (P T (XX T + \u03bbIm)P )\u22121P T X as the kernel Gram matrix [4].\nThe algorithm is guaranteed to converge in terms of the value of the objective function f(L, P ), as\nthe value of f(L, P ) monotonically increases and is bounded from above.\nExperiments [5, 10, 16] have shown the effectiveness of DisCluster in comparison with several other\npopular clustering algorithms. However, the inherent relationship between subspace selection via\nLDA and clustering is not well understood, and there is need for further investigation. We show\nin the next section that the iterative subspace selection and clustering in DisCluster is equivalent\nto kernel K-means with a speci\ufb01c kernel Gram matrix. Based on this equivalence relationship,\nwe propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace\nselection and clustering.\n3 DisKmeans: Discriminative K-means with a Fixed \u03bb\nAssume that \u03bb is a \ufb01xed positive constant. Let\u2019s consider the maximization of the function in Eq. (7):\n(9)\nHere, P is a transformation matrix and L is a weighted cluster indicator matrix as in Eq. (3). It\nfollows from the Representer Theorem [14] that the optimal transformation matrix P \u2208 IRm\u00d7d can\nbe expressed as P = XH, for some matrix H \u2208 IRn\u00d7d. Denote G = X T X as the Gram matrix,\nwhich is symmetric and positive semide\ufb01nite. It follows that\n\n(P T (XX T + \u03bbIm)P )\u22121P T XLLT X T P\n\nf(L, P ) = trace\n\n(cid:161)\n\n(cid:162)\n\n.\n\nH T GLLT GH\n\nf(L, P ) = trace\n\nH T (GG + \u03bbG) H\n\n(10)\nWe show that the matrix H can be factored out from the objective function in Eq. (10), thus dramat-\nically simplifying the optimization problem in the original DisCluster algorithm. The main result is\nsummarized in the following theorem:\nTheorem 3.1. Let G be the Gram matrix de\ufb01ned as above and \u03bb > 0 be the regularization param-\neter. Let L\u2217 and P \u2217 be the optimal solution to the maximization of the objective function f(L, P ) in\nEq. (7). Then L\u2217 solves the following maximization problem:\nIn \u2212 (In +\n\nL\u2217 = arg max\n\nG)\u22121\n\n(cid:182)\n\n(cid:181)\n\n(cid:181)\n\n(cid:182)\n\ntrace\n\n(11)\n\nLT\n\nL\n\n.\n\n.\n\n(cid:162)\u22121\n\n(cid:179)(cid:161)\n\n(cid:180)\n\nL\n\n1\n\u03bb\n\nProof. Let G = U\u03a3U T be the Singular Value Decomposition (SVD) [8] of G, where U \u2208 IRn\u00d7n\nis orthogonal, \u03a3 = diag (\u03c31,\u00b7\u00b7\u00b7 , \u03c3t, 0,\u00b7\u00b7\u00b7 , 0) \u2208 IRn\u00d7n is diagonal, and t = rank(G). Let U1 \u2208\nIRn\u00d7t consist of the \ufb01rst t columns of U and \u03a3t = diag (\u03c31,\u00b7\u00b7\u00b7 , \u03c3t) \u2208 IRt\u00d7t . Then\n\n(cid:179)\n\n2 \u03a3tU T\n\nG = U\u03a3U T = U1\u03a3tU T\n1 .\n\n(12)\nDenote R = (\u03a32\n1 L and let R = M\u03a3RN T be the SVD of R, where M and\nN are orthogonal and \u03a3R is diagonal with rank(\u03a3R) = rank(R) = q. De\ufb01ne the matrix Z as\nZ = Udiag\n, where diag(A, B) is a block diagonal matrix. It follows that\n\u02dc\u03a3 0\n0\n0\n\nt + \u03bb\u03a3t)\u2212 1\nt + \u03bb\u03a3t)\u2212 1\n\n, Z T (GG + \u03bbG) Z =\n\nZ T(cid:161)\n\n(cid:180)\n(cid:181)\n\n2 M, In\u2212t\n\nGLLT G\n\n(cid:182)\n\n(cid:181)\n\n(cid:182)\n\nZ =\n\n(13)\n\n(\u03a32\n\n(cid:162)\n\nIt\n0\n\n0\n0\n\n,\n\nwhere \u02dc\u03a3 = (\u03a3R)2 is diagonal with non-increasing diagonal entries. It can be veri\ufb01ed that\n\n(cid:179)\n\n(cid:180)\n\nf(L, P ) \u2264 trace\n\n\u02dc\u03a3\n\n= trace\n\n(GG + \u03bbG)+ GLLT G\n\n(cid:180)\n\n(cid:179)\n(cid:179)\n(cid:181)\n\n(cid:181)\n\n(cid:180)\n\n(cid:182)\n\n(cid:182)\n\n= trace\n\n= trace\n\nLT G (GG + \u03bbG)+ GL\nG)\u22121\n\nIn \u2212 (In +\n\nLT\n\n1\n\u03bb\n\nL\n\n,\n\n(14)\n\nwhere the equality holds when P = XH and H consists of the \ufb01rst q columns of Z.\n\n3\n\n\f3.1 Computing the Weighted Cluster Matrix L\nThe weighted cluster indicator matrix L solving the maximization problem in Eq. (11) can be com-\nputed by solving a kernel K-means problem [5] with the kernel Gram matrix given by\n\n(cid:181)\n\n(cid:182)\u22121\n\n\u02dcG = In \u2212\n\nIn +\n\n1\n\u03bb\n\nG\n\n.\n\n(15)\n\nThus, DisCluster is equivalent to a kernel K-means problem. We call the algorithm Discriminative\nK-means (DisKmeans).\n3.2 Constructing the Kernel Gram Matrix via Subspace Selection\nThe kernel Gram matrix in Eq. (15) can be expressed as\n\n\u02dcG = U diag (\u03c31/(\u03bb + \u03c31), \u03c32/(\u03bb + \u03c32),\u00b7\u00b7\u00b7 , \u03c3n/(\u03bb + \u03c3n)) U T .\n\n(16)\nRecall that the original DisCluster algorithm involves alternating LDA subspace selection and clus-\ntering. The analysis above shows that the LDA subspace selection in DisCluster essentially con-\nstructs a kernel Gram matrix for clustering. More speci\ufb01cally, all the eigenvectors in G is kept\nunchanged, while the following transformation is applied to the eigenvalues:\n\n\u03a6(\u03c3) = \u03c3/(\u03bb + \u03c3).\n\nThis elucidates the nature of the subspace selection procedure in DisCluster. The clustering algo-\nrithm is dramatically simpli\ufb01ed by removing the iterative subspace selection. We thus address issues\n(1)\u2013(3) in Section 1. The last issue will be addressed in Section 4 below.\n3.3 Connection with Other Clustering Approaches\nConsider the limiting case when \u03bb \u2192 \u221e. It follows from Eq. (16) that \u02dcG \u2192 G/\u03bb. The optimal L is\nthus given by solving the following maximization problem:\n\n(cid:161)\n\n(cid:162)\n\narg max\n\ntrace\n\nL\n\nLT GL\n\n.\n\nThe solution is given by the standard K-means clustering [4, 5].\nConsider the other extreme case when \u03bb \u2192 0. It follows from Eq. (16) that \u02dcG \u2192 U T\n1 U1. Note that\nthe columns of U1 form the full set of (normalized) principal components [9]. Thus, the algorithm\nis equivalent to clustering in the (full) principal component space.\n4 DisKmeans\u03bb: Discriminative K-means with Automatically Tuned \u03bb\nOur experiments show that the value of the regularization parameter \u03bb has a signi\ufb01cant impact on\nthe performance of DisKmeans. In this section, we show how to incorporate the automatic tuning\nof \u03bb into the optimization framework, thus addressing issue (4) in Section 1.\nThe maximization problem in Eq. (11) is equivalent to the minimization of the following function:\n\n(cid:195)\n\n(cid:181)\n\n(cid:182)\u22121\n\n(cid:33)\n\n1\n\u03bb\n\ntrace\n\nLT\n\nIn +\n\nG\n\nL\n\n.\n\n(17)\n\nIt is clear that a small value of \u03bb leads to a small value of the objective function in Eq. (17). To\novercome this problem, we include an additional penalty term to control the eigenvalues of the\nmatrix In + 1\n\n\u03bb G. This leads to the following optimization problem:\n\ng(L, \u03bb) \u2261 trace\n\nmin\nL,\u03bb\n\nLT\n\nIn +\n\nL\n\n+ log det\n\nIn +\n\nG\n\n.\n\n(18)\n\n(cid:195)\n\n(cid:181)\n\n(cid:182)\u22121\n\n(cid:33)\n\n1\n\u03bb\n\nG\n\n(cid:181)\n\n(cid:182)\n\n1\n\u03bb\n\nNote that the objective function in Eq. (18) is closely related to the negative log marginal likelihood\nfunction in Gaussian Process [12] with In + 1\n\u03bb G as the covariance matrix. We have the following\nmain result for this section:\nTheorem 4.1. Let G be the Gram matrix de\ufb01ned above and let L be a given weighted cluster\n1 be the SVD of G with \u03a3t = diag (\u03c31,\u00b7\u00b7\u00b7 , \u03c3t)\nindicator matrix. Let G = U\u03a3U T = U1\u03a3tU T\nas in Eq. (12), and ai be the i-th diagonal entry of the matrix U T\n1 LLT U1. Then for a \ufb01xed L,\n\n4\n\n\ft(cid:88)\n\ni=1\n\nt(cid:88)\n\nthe optimal \u03bb\u2217 solving the optimization problem in Eq. (18) is given by minimizing the following\nobjective function:\n\n\u03bbai\n\u03bb + \u03c3i\n\n+ log\n\n1 + \u03c3i\n\u03bb\n\n.\n\n(19)\n\nProof. Let U = [U1, U2], that is, U2 is the orthogonal complement of U1. It follows that\n\n(cid:181)\n\n(cid:195)\n\nlog det\n\n(cid:181)\n\nIn +\n\nG\n\n1\n\u03bb\n\n(cid:182)\u22121\n\n= log det\n\n(cid:195)\n\n(cid:182)\n(cid:33)\n\n(cid:181)\n\n1\n\u03bb\n\nG\n\ntrace\n\nLT\n\nIn +\n\nL\n\n= trace\n\nLT U1\n\nIt +\n\n\u03a3t\n\nU T\n\n1 L\n\n+ trace\n\nLT U2U T\n\n2 L\n\nIt +\n\n\u03a31\n\n=\n\nlog (1 + \u03c3i/\u03bb) .\n\n(cid:161)\n\n(20)\n\n(cid:162)\n\n(cid:179)\n\n1\n\u03bb\n\n(cid:181)\n\n(cid:161)\n\n(cid:180)\n\n(cid:182)\n\n1\n\u03bb\n\ni=1\n\nt(cid:88)\n(cid:182)\u22121\n(cid:161)\n(cid:162)\n\n(cid:33)\n\n(cid:162)\n\n=\n\n(1 + \u03c3i/\u03bb)\u22121ai + trace\n\nLT U2U T\n\n2 L\n\n,\n\n(21)\n\nThe result follows as the second term in Eq. (21), trace\n\ni=1\n\nLT U2U T\n\n2 L\n\n, is a constant.\n\nWe can thus solve the optimization problem in Eq. (18) iteratively as follows: For a \ufb01xed \u03bb, we\nupdate L by maximizing the objective function in Eq. (17), which is equivalent to the DisKmeans\nalgorithm; for a \ufb01xed L, we update \u03bb by minimizing the objective function in Eq. (19), which is a\nsingle-variable optimization and can be solved ef\ufb01ciently using the line search method. We call the\nalgorithm DisKmeans\u03bb, whose solution depends on the initial value of \u03bb.\n5 Kernel DisKmeans: Nonlinear Discriminative K-means using the kernels\nThe DisKmeans algorithm can be easily extended to deal with nonlinear data using the kernel trick.\nKernel methods [14] work by mapping the data into a high-dimensional feature space F equipped\nwith an inner product through a nonlinear mapping \u03c6 : IRm \u2192 F. The nonlinear mapping can\nbe implicitly speci\ufb01ed by a symmetric kernel function K, which computes the inner product of the\nimages of each data pair in the feature space. For a given training data set {xi}n\ni=1, the kernel Gram\nmatrix GK is de\ufb01ned as follows: GK(i, j) = (\u03c6(xi), \u03c6(xj)). For a given GK, the weighted cluster\nmatrix L = [L1,\u00b7\u00b7\u00b7 , Lk] in kernel DisKmeans is given by minimizing the following objective\nfunction:\n\n(cid:195)\n\n(cid:181)\n\n(cid:182)\u22121\n\n(cid:33)\n\nk(cid:88)\n\n(cid:181)\n\n(cid:182)\u22121\n\ntrace\n\nLT\n\nIn +\n\nL\n\n=\n\nLT\nj\n\nIn +\n\nGK\n\nLj.\n\n(22)\n\nj=1\n\nThe performance of kernel DisKmeans is dependent on the choice of the kernel Gram matrix.\nFollowing [11], we assume that GK is restricted to be a convex combination of a given set\nof kernel Gram matrices {Gi}(cid:96)\ni=1 satisfy\ni=1 may be\n\n(cid:80)(cid:96)\ni=1 \u03b8iGi, where the coef\ufb01cients {\u03b8i}(cid:96)\nIf L is given, the optimal coef\ufb01cients {\u03b8i}(cid:96)\n\n(cid:80)(cid:96)\ni=1 \u03b8itrace(Gi) = 1 and \u03b8i \u2265 0 \u2200i.\n\ncomputed by solving a Semide\ufb01nite programming (SDP) problem as follows:\nTheorem 5.1. Let GK be constrained to be a convex combination of a given set of kernel matrices\n{Gi}(cid:96)\ni=1 \u03b8iGi satisfying the constraints de\ufb01ned above. Then the optimal GK\nminimizing the objective function in Eq. (22) is given by solving the following SDP problem:\n\ni=1 as GK =\n\ni=1 as GK =\n\n(cid:80)(cid:96)\n\n1\n\u03bb\n\nGK\n\n1\n\u03bb\n\nmin\n\nt1,\u00b7\u00b7\u00b7 ,tk,\u03b8\n\n(cid:181)\n\nIn + 1\n\u03bb\n\n\u03b8i \u2265 0 \u2200i,\n\ntj\n\nj=1\n\nk(cid:88)\n(cid:80)(cid:96)\n(cid:96)(cid:88)\n(cid:162)\u22121\n\nLT\nj\n\ni=1\n\ns.t.\n\n(cid:161)\n\nProof. It follows as LT\nj\n\n(cid:182)\n\n(cid:186) 0, for j = 1,\u00b7\u00b7\u00b7 , k,\n\ni=1 \u03b8i \u02dcGi Lj\ntj\n\n\u03b8i trace(Gi) = 1.\n\n(cid:181)\n\nI + 1\n\u03bb\n\n(cid:80)(cid:96)\n\ni=1 \u03b8i \u02dcGi Lj\nLT\ntj\nj\n\n(cid:182)\n\n(23)\n\n(cid:186) 0.\n\nIn + 1\n\n\u03bb GK\n\nLj \u2264 ti is equivalent to\n\n5\n\n\fThis leads to an iterative algorithm alternating between the computation of the kernel Gram matrix\nGK and the computation of the cluster indicator matrix L. The parameter \u03bb can also be incorporated\ninto the SDP formulation by treating the identity matrix In as one of the kernel Gram matrix as in\n[11]. The algorithm is named Kernel DisKmeans\u03bb. Note that unlike the kernel learning in [11], the\nclass label information is not available in our formulation.\n6 Empirical Study\nIn this section, we empirically study the properties of DisKmeans and its variants, and evaluate the\nperformance of the proposed algorithms in comparison with several other representative algorithms,\nincluding Locally Linear Embedding (LLE) [13] and Laplacian Eigenmap (Leigs) [1].\nExperiment Setup:\nAll algorithms were implemented us-\ning Matlab and experiments were conducted on a PEN-\nTIUM IV 2.4G PC with 1.5GB RAM. We test\nthese al-\ngorithms on eight benchmark data sets.\nThey are \ufb01ve\nbanding, soybean, segment, satimage,\nUCI data sets [2]:\npendigits; one biological data set:\nleukemia (http://www.\nupo.es/eps/aguilar/datasets.html) and two image\ndata sets: ORL (http://www.uk.research.att.com/\nfacedatabase.html, sub-sampled to a size of 100*100\n= 10000 from 10 persons) and USPS (ftp://ftp.kyb.tuebingen.mpg.de/pub/bs/\ndata/). See Table 1 for more details. To make the results of different algorithms comparable,\nwe \ufb01rst run K-means and the clustering result of K-means is used to construct the set of k initial\ncentroids, for all experiments. This process is repeated for 50 times with different sub-samples from\nthe original data sets. We use two standard measurements: the accuracy (ACC) and the normalized\nmutual information (NMI) to measure the performance.\n\nTable 1: Summary of benchmark data sets\nData Set\nbanding\nsoybean\nsegment\npendigits\nsatimage\nleukemia\nORL\nUSPS\n\n# CL\n(k)\n2\n15\n7\n10\n6\n2\n10\n10\n\n# DIM # INST\n(m)\n29\n35\n19\n16\n36\n7129\n10304\n256\n\n(n)\n238\n562\n2309\n10992\n6435\n72\n100\n9298\n\nFigure 1: The effect of the regularization parameter \u03bb on DisKmeans and Discluster.\n\nEffect of the regularization parameter \u03bb: Figure 1 shows the accuracy (y-axis) of DisKmeans\nand DisCluster for different \u03bb values (x-axis). We can observe that \u03bb has a signi\ufb01cant impact on\nthe performance of DisKmeans. This justi\ufb01es the development of an automatic parameter tuning\nprocess in Section 4. We can also observe from the \ufb01gure that when \u03bb \u2192 \u221e, the performance of\nDisKmeans approaches to that of K-means on all eight benchmark data sets. This is consistent with\nour theoretical analysis in Section 3.3. It is clear that in many cases, \u03bb = 0 is not the best choice.\nEffect of parameter tuning in DisKmeans\u03bb: Figure 2 shows the accuracy of DisKmeans\u03bb using\n4 data sets. In the \ufb01gure, the x-axis denotes the different \u03bb values used as the starting point for\nDisKmeans\u03bb. The result of DisKmeans (without parameter tuning) is also presented for comparison.\nWe can observe from the \ufb01gure that in many cases the tuning process is able to signi\ufb01cantly improve\nthe performance. We observe similar trends on other four data sets and the results are omitted.\n\n6\n\n10\u2212610\u2212410\u221221001021041060.7620.7630.7640.7650.7660.7670.7680.7690.770.7710.772BandingACCl K\u2212meansDisClusterDisKmeans10\u2212610\u2212410\u221221001021041060.6240.6260.6280.630.6320.6340.6360.6380.640.6420.644soybeanACCl K\u2212meansDisClusterDisKmeans10\u2212610\u2212410\u221221001021041060.630.640.650.660.670.680.69segmentACCl K\u2212meansDisClusterDisKmeans10\u2212610\u2212410\u221221001021041060.680.6850.690.6950.7pendigitsACCl K\u2212meansDisClusterDisKmeans10\u2212610\u2212410\u221221001021041060.610.620.630.640.650.660.670.680.690.70.71satimageACCl K\u2212meansDisClusterDisKmeans10\u2212610\u2212410\u221221001021041060.7350.740.7450.750.7550.760.7650.770.7750.78leukemiaACCl K\u2212meansDisClusterDisKmeans10\u2212510010510100.7350.7360.7370.7380.7390.740.7410.7420.7430.7440.745ORLACCl K\u2212meansDisClusterDisKmeans10\u2212610\u2212410\u221221001021041060.540.560.580.60.620.640.660.680.70.72USPSACCl K\u2212meansDisClusterDisKmeans\fFigure 2: The effect of the parameter tuning in DisKmeans\u03bb using 4 data sets. The x-axis denotes the different\n\u03bb values used as the starting point for DisKmeans\u03bb.\n\nFigure 2 also shows that the tuning process is dependent on the initial value of \u03bb due to its non-\nconvex optimization, and when \u03bb \u2192 \u221e, the effect of the tuning process become less pronounced.\nOur results show that a value of \u03bb, which is neither too large nor too small works well.\n\nFigure 3: Comparison of the trace value achieved by DisKmean and DisCluster. The x-axis denotes the\nnumber of iterations in Discluster. The trace value of DisCluster is bounded from above by that of DisKmean.\n\nDisKmean versus DisCluster: Figure 3 compares the trace value achieved by DisKmean and\nthe trace value achieved in each iteration of DisCluster on 4 data sets for a \ufb01xed \u03bb.\nIt is clear\nthat the trace value of DisCluster increases in each iteration but is bounded from above by that of\nDisKmean. We observe a similar trend on the other four data sets and the results are omitted. This is\nconsistent with our analysis in Section 3 that both algorithms optimize the same objective function,\nand DisKmean is a direct approach for the trace maximization without the iterative process.\nClustering evaluation: Table 2 presents the accuracy (ACC) and normalized mutual information\n(NMI) results of various algorithms on all eight data sets. In the table, DisKmeans (or DisCluster)\nwith \u201cmax\u201d and \u201cave\u201d stands for the maximal and average performance achieved by DisKmeans and\nDisCluster using \u03bb from a wide range between 10\u22126 and 106. We can observe that DisKmeans\u03bb is\ncompetitive with other algorithms. It is clear that the average performance of DisKmeans\u03bb is robust\nagainst different initial values of \u03bb. We can also observe that the average performance of DisKmeans\nand DisCluster is quite similar, while DisCluster is less sensitive to the value of \u03bb.\n\n7 Conclusion\nIn this paper, we analyze the discriminative clustering (DisCluster) framework, which integrates\nsubspace selection and clustering. We show that the iterative subspace selection and clustering in\nDisCluster is equivalent to kernel K-means with a speci\ufb01c kernel Gram matrix. We then propose the\nDisKmeans algorithm for simultaneous LDA subspace selection and clustering, as well as an auto-\nmatic parameter tuning procedure. The connection between DisKmeans and several other clustering\nalgorithms is also studied. The presented analysis and algorithms are veri\ufb01ed through experiments\non a collection of benchmark data sets.\nWe present the nonlinear extension of DisKmeans in Section 5. Our preliminary studies have shown\nthe effectiveness of Kernel DisKmeans\u03bb in learning the kernel Gram matrix. However, the SDP\nformulation is limited to small-sized problems. We plan to explore ef\ufb01cient optimization techniques\nfor this problem. Partial label information may be incorporated into the proposed formulations. This\nleads to semi-supervised clustering [3]. We plan to examine various semi-learning techniques within\nthe proposed framework and their effectiveness for clustering from both labeled and unlabeled data.\n\n7\n\n10\u2212610\u2212410\u221221001021041060.60.620.640.660.680.70.72satimageACCl DisKmeansDisKmeansl10\u2212610\u2212410\u221221001021041060.680.6820.6840.6860.6880.690.6920.6940.6960.6980.7pendigitsACCl DisKmeansDisKmeansl10\u2212510010510100.730.7320.7340.7360.7380.740.7420.7440.7460.7480.75ORLACCl DisKmeansDisKmeanl10\u2212610\u2212410\u221221001021041060.540.560.580.60.620.640.660.680.70.72USPSACCl DisKmeansDisKmeansl123456780.0840.0860.0880.090.0920.0940.0960.098 satimageTRACEl DisClusterDisKmeans123450.3410.3420.3430.3440.3450.3460.347pendigitsTRACEl DisClusterDisKmeans12345670.2140.2160.2180.220.2220.2240.2260.2280.23segmentTRACEl DisClusterDisKmeans11.522.533.544.550.0250.02550.0260.02650.0270.0275USPSTRACEl DisClusterDisKmeans\fTable 2: Accuracy (ACC) and Normalized Mutual Information (NMI) results on 8 data sets. \u201cmax\u201d and \u201cave\u201d\nstand for the maximal and average performance achieved by DisKmeans and DisCluster using \u03bb from a wide\nrange of values between 10\u22126 and 106. We present the result of DisKmeans\u03bb with different initial \u03bb values.\nLLE stands for Local Linear Embedding and LEI for Laplacian Eigenmap. \u201cAVE\u201d stands for the mean of ACC\nor NMI on 8 data sets for each algorithm.\n\nDisKmeans\n\nmax\n\nave\n\nDisCluster\n\nmax\n\nave\n\nACC\n\nData Sets\n\nbanding\nsoybean\nsegment\npendigits\nsatimage\nleukemia\nORL\nUSPS\nAVE\n\nbanding\nsoybean\nsegment\npendigits\nsatimage\nleukemia\nORL\nUSPS\nAVE\n\n0.771\n0.641\n0.687\n0.699\n0.701\n0.775\n0.744\n0.712\n0.716\n\n0.225\n0.707\n0.632\n0.669\n0.593\n0.218\n0.794\n0.647\n0.561\n\n0.768\n0.634\n0.664\n0.690\n0.651\n0.763\n0.738\n0.628\n0.692\n\n0.221\n0.701\n0.612\n0.656\n0.537\n0.199\n0.789\n0.544\n0.532\n\n0.771\n0.633\n0.676\n0.696\n0.654\n0.738\n0.739\n0.692\n0.700\n\n0.225\n0.698\n0.615\n0.660\n0.551\n0.163\n0.789\n0.629\n0.541\n\n10\u22122\n\n0.771\n0.639\n0.664\n0.700\n0.696\n0.738\n0.749\n0.684\n0.705\n\n0.225\n0.706\n0.629\n0.661\n0.597\n0.163\n0.800\n0.612\n0.549\n\nDisKmeans\u03bb\n10\u22121\n100\n\n0.771\n0.639\n0.659\n0.696\n0.712\n0.753\n0.743\n0.702\n0.709\n\n0.225\n0.707\n0.625\n0.658\n0.608\n0.185\n0.795\n0.637\n0.555\n\n0.771\n0.638\n0.671\n0.696\n0.696\n0.738\n0.748\n0.680\n0.705\n\n0.225\n0.704\n0.628\n0.658\n0.596\n0.163\n0.801\n0.609\n0.548\n\n101\n\n0.771\n0.637\n0.680\n0.697\n0.683\n0.738\n0.748\n0.684\n0.705\n\n0.225\n0.704\n0.632\n0.660\n0.586\n0.163\n0.800\n0.612\n0.548\n\nLLE\n\nLEI\n\n0.648\n0.630\n0.594\n0.599\n0.627\n0.714\n0.733\n0.631\n0.647\n\n0.093\n0.691\n0.539\n0.577\n0.493\n0.140\n0.784\n0.569\n0.486\n\n0.764\n0.649\n0.663\n0.697\n0.663\n0.686\n0.317\n0.700\n0.642\n\n0.213\n0.709\n0.618\n0.645\n0.548\n0.043\n0.327\n0.640\n0.468\n\n0.767\n0.632\n0.672\n0.690\n0.642\n0.738\n0.738\n0.683\n0.695\n\n0.219\n0.696\n0.608\n0.654\n0.541\n0.163\n0.788\n0.613\n0.535\n\nNMI\n\nAcknowledgments\n\nThis research is sponsored by the National Science Foundation Grant IIS-0612069.\n\nReferences\n[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. In\n\nNIPS, 2003.\n\n[2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.\n[3] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 2006.\n[4] I. S. Dhillon, Y. Guan, and B. Kulis. A uni\ufb01ed view of kernel k-means, spectral clustering and graph\n\npartitioning. Technical report, Department of Computer Sciences, University of Texas at Austin, 2005.\n\n[5] C. Ding and T. Li. Adaptive dimension reduction using discriminant analysis and k-means clustering. In\n\nICML, 2007.\n\n[6] J. H. Friedman. Regularized discriminant analysis. JASA, 84(405):165\u2013175, 1989.\n[7] K. Fukunaga. Introduction to Statistical Pattern Classi\ufb01cation. Academic Press.\n[8] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins Univ. Press, 1996.\n[9] I.T. Jolliffe. Principal Component Analysis. Springer; 2nd edition, 2002.\n[10] F. De la Torre Frade and T. Kanade. Discriminative cluster analysis. In ICML, pages 241\u2013248, 2006.\n[11] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. JMLR, 5:27\u201372, 2004.\n\n[12] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n[13] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290:2323\u20132326, 2000.\n\n[14] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimiza-\n\ntion and Beyond. MIT Press, 2002.\n\n[15] L. Vandenberghe and S. Boyd. Semide\ufb01nite programming. SIAM Review, 38:49\u201395, 1996.\n[16] J. Ye, Z. Zhao, and H. Liu. Adaptive distance metric learning for clustering. In CVPR, 2007.\n\n8\n\n\f", "award": [], "sourceid": 737, "authors": [{"given_name": "Jieping", "family_name": "Ye", "institution": null}, {"given_name": "Zheng", "family_name": "Zhao", "institution": null}, {"given_name": "Mingrui", "family_name": "Wu", "institution": null}]}