{"title": "Optimal Scoring for Unsupervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2241, "page_last": 2249, "abstract": "We are often interested in casting classification and clustering problems in a regression framework, because it is feasible to achieve some statistical properties in this framework by imposing some penalty criteria. In this paper we illustrate optimal scoring, which was originally proposed for performing Fisher linear discriminant analysis by regression, in the application of unsupervised learning. In particular, we devise a novel clustering algorithm that we call optimal discriminant clustering (ODC). We associate our algorithm with the existing unsupervised learning algorithms such as spectral clustering, discriminative clustering and sparse principal component analysis. Thus, our work shows that optimal scoring provides a new approach to the implementation of unsupervised learning. This approach facilitates the development of new unsupervised learning algorithms.", "full_text": "Optimal Scoring for Unsupervised Learning\n\nZhihua Zhang and Guang Dai\n\nCollege of Computer Science & Technology\n\nZhejiang University\n\nHangzhou, Zhejiang, 310027 China\n\nAbstract\n\nWe are often interested in casting classi\ufb01cation and clustering problems as a re-\ngression framework, because it is feasible to achieve some statistical properties\nin this framework by imposing some penalty criteria. In this paper we illustrate\noptimal scoring, which was originally proposed for performing the Fisher linear\ndiscriminant analysis by regression, in the application of unsupervised learning. In\nparticular, we devise a novel clustering algorithm that we call optimal discriminant\nclustering. We associate our algorithm with the existing unsupervised learning al-\ngorithms such as spectral clustering, discriminative clustering and sparse principal\ncomponent analysis. Experimental results on a collection of benchmark datasets\nvalidate the effectiveness of the optimal discriminant clustering algorithm.\n\n1 Introduction\nThe Fisher linear discriminant analysis (LDA) is a classical method that considers dimensionality re-\nduction and classi\ufb01cation jointly. LDA estimates a low-dimensional discriminative space de\ufb01ned by\nlinear transformations through maximizing the ratio of between-class scatter to within-class scatter.\nIt is well known that LDA is equivalent to a least mean squared error procedure in the binary classi-\n\ufb01cation problem [4]. It is of great interest to obtain a similar relationship in multi-class problems. A\nsigni\ufb01cant literature has emerged to address this issue [6, 8, 12, 14]. This provides another approach\nto performing LDA by regression, in which penalty criteria are tractably introduced to achieve some\nstatistical properties such as regularized LDA [5] and sparse discriminant analysis [2].\nIt is also desirable to explore unsupervised learning problems in a regression framework. Recently,\nZou et al. [17] reformulated principal component analysis (PCA) as a regression problem and then\ndevised a sparse PCA by imposing the lasso (the elastic net) penalty [10, 16] on the regression\nvector. In this paper we consider unsupervised learning problems by optimal scoring, which was\noriginally proposed to perform LDA by regression [6]. In particular, we devise a novel unsupervised\nframework by using the optimal scoring and the ridge penalty.\nThis framework can be used for dimensionality reduction and clustering simultaneously. We are\nmainly concerned with the application in clustering. In particular, we propose a clustering algorithm\nthat we called optimal discriminant clustering (ODC). Moreover, we establish a connection of our\nclustering algorithm with discriminative clustering algorithms [3, 13] and spectral clustering algo-\nrithms [7, 15]. This implies that we can cast these clustering algorithms as regression-type problems.\nIn turn, this facilitates the introduction of penalty terms such as the lasso and elastic net so that we\nhave sparse unsupervised learning algorithms.\nThroughout this paper, Im denotes the m\u00d7m identity matrix, 1m the m\u00d71 vector of ones, 0 the zero\nvector or matrix with appropriate size, and Hm = Im \u2212 1\nm the m\u00d7m centering matrix. For\nan m\u00d71 vector a = (a1, . . . , am)(cid:48), diag(a) represents the m\u00d7m diagonal matrix with a1, . . . , am\nas its diagonal entries. For an m\u00d7m matrix A = [aij], we let A+ be the Moore-Penrose inverse\nof A, tr(A) be the trace of A, rk(A) be the rank of A and (cid:107)A(cid:107)F =\ntr(A(cid:48)A) be the Frobenius\nnorm of A.\n\nm 1m1(cid:48)\n\n(cid:112)\n\n1\n\n\f2 Problem Formulation\n\n(cid:80)c\n\nWe are concerned with a multi-class classi\ufb01cation problem. Given a set of n p-dimensional data\npoints, {x1, . . . , xn} \u2208 X \u2282 Rp, we assume that the xi are grouped into c disjoint classes and that\neach xi belongs to one class. Let V = {1, 2, . . . , n} denote the index set of the data points xi and\npartition V into c disjoint subsets Vj; i.e., Vi \u2229 Vj = \u2205 for i (cid:54)= j and \u222ac\nj=1Vj = V , where the\ncardinality of Vj is nj so that\nWe also make use of a matrix representation for the problem in question. In particular, we let X =\n[x1, . . . , xn](cid:48) be an n\u00d7p data matrix, and E = [eij] be an n\u00d7c indicator matrix with eij = 1 if input\nxi is in class j and eij = 0 otherwise. Let \u03a0 = diag(n1, . . . , nc), \u03a0 1\nnc),\nc\u03a0 = \u03c0(cid:48), E1c = 1n,\n\u03c0 = (n1, . . . , nc)(cid:48) and\nc\u03c0 = n, E(cid:48)E = \u03a0 and \u03a0\u22121\u03c0 = 1c.\n1(cid:48)\n\nnc)(cid:48). It follows that 1(cid:48)\n\n\u221a\n2 = diag(\n\nj=1 nj = n.\n\nnE = 1(cid:48)\n\nn1, . . . ,\n\nn1, . . . ,\n\n\u03c0 = (\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n2.1 Scoring Matrices\n\nHastie et al. [6] de\ufb01ned a scoring matrix for the c-class classi\ufb01cation problem. That is, it is such\na c\u00d7(c\u22121) matrix \u0398 \u2208 Rc\u00d7(c\u22121) that \u0398(cid:48)(E(cid:48)E)\u0398 = \u0398(cid:48)\u03a0\u0398 = Ic\u22121. The jth row of \u0398 de\ufb01nes a\nscoring or scaling for the jth class. Here we re\ufb01ne this de\ufb01nition as:\n\n(cid:181)\n\n(cid:179)\u221a\n\nDe\ufb01nition 1 Given a c-class classi\ufb01cation problem with the cardinality of the jth class being nj, a\nc\u00d7(c\u22121) matrix \u0398 is referred to as the class scoring matrix if it satis\ufb01es\n\n\u0398(cid:48)\u03a0\u0398 = Ic\u22121\n\nand \u03c0(cid:48)\u0398 = 0.\nIt follows from this de\ufb01nition that \u0398\u0398(cid:48) = \u03a0\u22121\u2212 1\nn 1c1(cid:48)\nc. In the literature [15], the authors presented\n(cid:48)\na speci\ufb01c example for \u0398 = (\u03b81, . . . , \u03b8c\u22121)(cid:48). That is, \u03b8\n1 =\n\n(cid:113)(cid:80)c\nj=l nj\nj=l nj\n,\u2212 \u221a\nfor l = 2, . . . , c\u22121. Especially, when c = 2, \u0398 = (\nn1\u221a\nnn2\nLet Y = E\u0398 (n\u00d7(c\u22121)). We then have Y(cid:48)Y = Ic\u22121 and 1(cid:48)\nnY = 0. To address an unsupervised\nclustering problem with c classes, we relax the setting of Y = E\u0398 and give the following de\ufb01nition.\nDe\ufb01nition 2 An n\u00d7(c\u22121) matrix Y is referred to as the sample scoring matrix if it satis\ufb01es\n\nj=l+1 nj\n)(cid:48) is a 2-dimensional vector.\n\n(cid:113)(cid:80)c\n(cid:113)\n(cid:80)c\n\n,\u2212 \u221a\nn1\u221a\nn(n\u2212n1)\n\nn\u2212n1\u221a\n\u221a\n\n(cid:80)c\n\n\u221a\nn2\u221a\nnn1\n\nj=l+1 nj\n\n0 \u2217 1(cid:48)\n\n1(cid:48)\nc\u22121\n\n1(cid:48)\nc\u2212l\n\n(cid:182)\n\n(cid:180)\n\n(cid:48)\nl =\n\n\u03b8\n\nnn1\n\nnl\n\nl\u22121,\n\nand\n\nnl\n\n,\n\nY(cid:48)Y = Ic\u22121\n\nand\n\n1(cid:48)\nnY = 0.\n\nNote that c does not necessarily represent the number of classes in this de\ufb01nition. For example, we\nview c\u22121 as the dimension of a reduced dimensional space in the dimensionality reduction problem.\n\n2.2 Optimal Scoring for LDA\n\nTo devise a classi\ufb01er for the c-class classi\ufb01cation problem, we consider a penalized optimal scoring\nmodel, which is de\ufb01ned by\n\n(cid:111)\n\n(cid:110)\nf(\u0398, W) (cid:44) 1\n2\n\n(cid:107)E\u0398 \u2212 HnXW(cid:107)2\n\nF + \u03c32\n2\n\ntr(W(cid:48)W)\n\nmin\n\u0398, W\n\n(1)\nunder the constraints \u0398(cid:48)\u03a0\u0398 = Ic\u22121 and \u03c0(cid:48)\u0398 = 0 where \u0398 \u2208 Rc\u00d7(c\u22121) and W \u2208 Rp\u00d7(c\u22121).\nCompared with the setting in [6], we add the constraint \u03c0(cid:48)\u0398 = 0. The reason is due to\n1(cid:48)\nnHnXW = 0. We thus impose 1(cid:48)\nDenote\n\nnE\u0398 = \u03c0(cid:48)\u0398 = 0 for consistency.\n\nR = \u03a0\u2212 1\n\n2 E(cid:48)HnX(X(cid:48)HnX + \u03c32Ip)\u22121X(cid:48)HnE\u03a0\u2212 1\n2 .\n\n1\n\n2 = 0, there exists a c\u00d7(c\u22121) orthogonal matrix \u2206, the columns of which are the eigen-\n\nSince R\u03c0\nvectors of R. That is, \u2206 satis\ufb01es \u2206(cid:48)\u2206 = Ic\u22121 and \u2206(cid:48)\u03c0\n\n1\n\n2 = 0.\n\n2\n\n\fTheorem 1 A minimizer of Problem (1) is \u02c6\u0398 = \u03a0\u2212 1\nHere [\u2206, 1\u221a\n\n2 ] is the c\u00d7c matrix of the orthonormal eigenvectors of R.\n\n2 \u2206 and \u02c6W = (X(cid:48)HnX + \u03c32Ip)\u22121X(cid:48)HnE \u02c6\u0398.\n\n1\n\nn \u03c0\n\nSince for an arbitrary class scoring matrix \u0398, its rank is c\u22121, we have \u0398 = \u02c6\u0398\u03a5 where \u03a5 is\nsome (c\u22121)\u00d7(c\u22121) orthonormal matrix. Moreover, it follows from \u0398\u0398(cid:48) = \u03a0\u22121 \u2212 1\nc that the\nbetween-class scatter matrix is given by\n\nn 1c1(cid:48)\n\n\u03a3b = X(cid:48)HnE\u0398\u0398(cid:48)E(cid:48)HnX = X(cid:48)HnE \u02c6\u0398 \u02c6\u0398\n\n(cid:48)\n\nE(cid:48)HnX.\n\nAccordingly, we can also write the generalized eigenproblem for the penalized LDA as\n\nX(cid:48)HnE \u02c6\u0398 \u02c6\u0398\n\n(cid:48)\n\nE(cid:48)HnXA = (X(cid:48)HnX + \u03c32Ip)A\u039b,\n\nbecause the total scatter matrix \u03a3 is \u03a3 = X(cid:48)HnX. We now obtain\n\nIt is well known that \u02c6W \u02c6\u0398\nover, \u02c6\u0398\nbetween A in the penalized LDA and W in the penalized optimal scoring model (1).\n\nE(cid:48)HnXA is the eigenvector matrix of \u02c6\u0398\n\nE(cid:48)HnXA = A\u039b.\n(cid:48)\nE(cid:48)HnX \u02c6W have the same nonzero eigenvalues. More-\nE(cid:48)HnX \u02c6W. We thus establish the relationship\n\n\u02c6W \u02c6\u0398\nE(cid:48)HnX and \u02c6\u0398\n\n(cid:48)\n\n(cid:48)\n\n(cid:48)\n\n(cid:48)\n\n3 Optimal Scoring for Unsupervised Learning\n\nIn this section we extend the notion of optimal scoring to unsupervised learning problems, leading\nto a new framework for dimensionality reduction and clustering analysis simultaneously.\n\n3.1 Framework\n\n(cid:111)\n\n(cid:110)\nf(Y, W) (cid:44) 1\n2\n\ntr(W(cid:48)W)\n\nIn particular, we relax E\u0398 in (1) as a sample scoring matrix Y and de\ufb01ne the following penalized\nmodel:\n\n(cid:107)Y \u2212 HnXW(cid:107)2\n\nF + \u03c32\n2\n\n(2)\nnY = 0 and Y(cid:48)Y = Ic\u22121. The following theorem provides a solution for\n\nmin\nY, W\nunder the constraints 1(cid:48)\nthis problem.\nTheorem 2 A minimizer of Problem (2) is \u02c6Y and \u02c6W = (X(cid:48)HnX + \u03c32Ip)\u22121X(cid:48)Hn \u02c6Y, where \u02c6Y is\nthe n\u00d7(c\u22121) orthogonal matrix of the top eigenvectors of HnX(X(cid:48)HnX + \u03c32Ip)\u22121X(cid:48)Hn.\nThe proof is given in Appendix A. Note that all the eigenvalues of HnX(X(cid:48)HnX + \u03c32Ip)\u22121X(cid:48)Hn\nare between 0 and 1. Especially, when \u03c32 = 0, the eigenvalues are either 1 or 0.\nIn this\ncase, if rk(HnX) \u2265 c\u22121, f( \u02c6Y, \u02c6W) achieves its minimum 0, otherwise the minimum value is\nc\u22121\u2212rk(HnX)\n\n.\n\n2\n\nWith the estimates of Y and W, we can develop an unsupervised learning procedure. It is clear that\nW can be treated as a non-orthogonal projection matrix and HnXW is then the low-dimensional\ncon\ufb01guration of X. Using this treatment, we obtain a new alternative to the regression formulation of\nPCA by Zou et al. [17]. In this paper, however, we concentrate on the application of the framework\nin clustering analysis.\n\n3.2 Optimal Discriminant Clustering\n\nOur clustering procedure is given in Algorithm 1. We refer to this procedure as optimal discriminant\nclustering due to its relationship with LDA, which is shown by the connection between (1) and (2).\nAssume that \u02dcX = [\u02dcx1, . . . , \u02dcxn](cid:48) (n\u00d7r) is a feature matrix corresponding to the data matrix X. In\nthis case, we have\n\nS = Hn \u02dcX( \u02dcX(cid:48)Hn \u02dcX + \u03c32Ir)\u22121 \u02dcX(cid:48)Hn = C(C + \u03c32In)\u22121,\n\n3\n\n\fwhere C = Hn \u02dcX \u02dcX(cid:48)Hn is the n\u00d7n centered kernel matrix. This implies that we can obtain \u02c6Y\nwithout the explicit use of the feature matrix \u02dcX. Moreover, we can compute Z by\n\nZ = Hn \u02dcX( \u02dcX(cid:48)Hn \u02dcX + \u03c32Ir)\u22121 \u02dcX(cid:48)HnY = SY.\n\nWe are thus able to devise this clustering algorithm by using the reproducing kernel k(\u00b7,\u00b7) :\nX\u00d7X \u2192 R such that K(xi, xj) = \u02dcx(cid:48)\n\ni\u02dcxj and K = \u02dcX \u02dcX(cid:48).\n\nAlgorithm 1 Optimal Discriminant Clustering Algorithm\n1: procedure ODC(HnX, c, \u03c32)\n2:\n3:\n4:\n5:\n6: end procedure\n\nEstimate \u02c6Y and \u02c6W according to Theorem 2;\nCalculate Z = [z1, . . . , zn](cid:48) = HnX \u02c6W;\nPerform K-means on the zi;\nReturn the partition of the zi as the partition of the xi.\n\n3.3 Related Work\n\nWe now explore the connection of the optimal discriminant clustering with the discriminative clus-\ntering algorithm [3] and spectral clustering [7]. Recall that \u02c6Y is the matrix of the c\u22121 top eigenvec-\ntors of C(C + \u03c32In)\u22121. Consider that if \u03bb (cid:54)= 0 is an eigenvalue of C with associated eigenvector\nu, then \u03bb/(\u03bb + \u03c32) ((cid:54)= 0) is an eigenvalue of C(C + \u03c32In)\u22121 with associated eigenvector u. More-\nover, \u03bb/(\u03bb + \u03c32) is increasing as \u03bb increases. This implies that \u02c6Y is also the matrix of the c\u22121 top\neigenvectors of C. As we know, the spectral clustering applies a rounding scheme such as K-means\ndirectly on \u02c6Y. We thus have a relationship between the spectral clustering and optimal discriminant\nclustering.\nWe study the relationship between the discriminative clustering algorithm and the spectral cluster-\ning algorithm. Let M be a linear transformation from the r-dimensional \u02dcX to an s-dimensional\ntransformed feature space F, namely\n\nwhere M is an r\u00d7s matrix of rank s (s < r). The corresponding scatter matrices in the F-space\nare thus given by M(cid:48)\u03a3M and M(cid:48)\u03a3bM. The discriminative clustering algorithm [3, 13] in the\nreproducing kernel Hilbert space (RKHS) tries to solve the problem of\n\nF = \u02dcXM,\n\nargmax\n\nE, M\n\nf(E, M) (cid:44) tr((M(cid:48)(\u03a3+\u03c32Ir)M)\u22121M(cid:48)\u03a3bM)\n\n= tr\n\n(M(cid:48)( \u02dcX(cid:48)Hn \u02dcX+\u03c32Ir)M)\u22121M(cid:48) \u02dcX(cid:48)HnE\n\n(cid:161)\n\n(cid:161)\n\n(cid:162)\u22121E(cid:48)Hn \u02dcXM\n(cid:162)\n\nE(cid:48)E\n\nApplying the discussion in [15] to Hn \u02dcXM(M(cid:48)( \u02dcX(cid:48)Hn \u02dcX+\u03c32Ir)M)\u22121M(cid:48) \u02dcX(cid:48)Hn, we have the fol-\nlowing relaxation problem\n\nmax Y\u2208Rn\u00d7(c\u22121),M\u2208Rr\u00d7s tr(Y(cid:48)Hn \u02dcXM(M(cid:48)( \u02dcX(cid:48)Hn \u02dcX+\u03c32Ir)M)\u22121M(cid:48) \u02dcX(cid:48)HnY),\ns.t. Y(cid:48)Y = Ic\u22121 and Y(cid:48)1n = 0.\n\n(3)\n\nExpress M = \u02dcX(cid:48)HnB + N where N satis\ufb01es N(cid:48) \u02dcX(cid:48)Hn = 0 (i.e., N \u2208 span{ \u02dcX(cid:48)Hn}\u22a5) and B is\nsome n\u00d7s matrix. Under the condition of either \u03c32 = 0 or N = 0 (i.e., M \u2208 span{ \u02dcX(cid:48)Hn}), we\ncan obtain that\n\nHn \u02dcXM(M(cid:48)( \u02dcX(cid:48)Hn \u02dcX+\u03c32Ir)M)\u22121M(cid:48) \u02dcX(cid:48)Hn = CB(B(cid:48)(CC + \u03c32C)B)\u22121B(cid:48)C.\n\nAgain consider that if \u03bb (cid:54)= 0 is an eigenvalue of C with associated eigenvector u, then \u03bb/(\u03bb+\u03c32) (cid:54)=\n0 is an eigenvalue of C(CC + \u03c32C)+C with associated eigenvector u. Moreover, \u03bb/(\u03bb + \u03c32) is\nincreasing in \u03bb. We now directly obtain the following theorem from Theorem 3.1 in [13].\nTheorem 3 Let Y\u2217 and M\u2217 be the solution of Problem (3). Then\n\n4\n\n\fTable 1: Summary of the benchmark datasets, where c is the number of classes, p is the dimension\nof the input vector, and n is the number of samples in the dataset.\n\nTypes\n\nFace\n\nGene\n\nUCI\n\nDataset\nORL\nYale\nPIE\n\nSRBCT\n\nIris\nYeast\n\nImage segmentation\n\nStatlog landsat satellite\n\nc\n40\n15\n68\n4\n4\n10\n7\n7\n\np\n\n1024\n1024\n1024\n2308\n\n4\n8\n19\n36\n\nn\n400\n165\n6800\n63\n150\n1484\n2100\n2000\n\n(i) If \u03c32 = 0, Y\u2217 is the solution of the following problem\n\nargmaxY\u2208Rn\u00d7(c\u22121) tr(Y(cid:48)CC+Y),\ns.t. Y(cid:48)Y = Ic\u22121 and Y(cid:48)1n = 0.\n\n(ii) If M \u2208 span{ \u02dcX(cid:48)Hn}, Y\u2217 is the solution of the following problem:\n\nargmaxY \u2208Rn\u00d7(c\u22121) tr(Y(cid:48)CY),\ns.t. Y(cid:48)Y = Ic\u22121 and Y(cid:48)1n = 0.\n\nTheorem 3 shows that discriminative clustering is essentially equivalent to spectral clustering. This\nfurther leads us to a relationship between the discriminative clustering and optimal discriminant\nclustering from the relationship between the spectral clustering and optimal discriminant clustering.\nIn summary, we are able to unify the discriminative clustering as well as spectral clustering into the\noptimal scoring framework in (2).\n\n4 Experimental Study\n\nTo evaluate the performance of our optimal discriminant clustering (ODC) algorithm, we conducted\nexperimental comparisons with other related clustering algorithms on several real-world datasets. In\nparticular, the comparison was implemented on three face datasets, the \u201cSRBCT\u201d gene dataset, and\nfour UCI datasets. Further details of these datasets are summarized in Table 1.\nTo effectively evaluate the performance, we employed two typical measurements: the Normalized\nMutual Information (NMI) and the Clustering Error (CE). It should be mentioned that for NMI, the\nlarger this value, the better the performance. For CE, the smaller the value, the better the perfor-\nmance. More details and the corresponding implementations for both can be found in [11].\nIn the experiments, we compared our ODC with four different clustering algorithms, i.e., the\nconventional K-means [1], normalized cut (NC) [9], DisCluster [3] and DisKmeans [13].\nIt is\nworth noting that two discriminative clustering algorithms: DisCluster [3] and DisKmeans [13],\nare very closely related to our ODC, because they are derived from the discriminant anal-\nysis criteria in essence (also see the analysis in Section 3.3).\nthe implemen-\ntation code for NC is available at http://www.cis.upenn.edu/\u223cjshi/software/.\nthe parameter \u03c32 in ODC is sought from the range \u03c32 \u2208\nFor the sake of simplicity,\n{10\u22123, 10\u22122.5, 10\u22122, 10\u22121.5, 10\u22121, 10\u22120.5, 100, 100.5, 101, 101.5, 102, 102.5, 103}. Similarly, the\nparameters in other clustering algorithms compared here are also searched in a wide range.\nFor simplicity, we just reported the best results of clustering algorithms with respect to different\nparameters on each dataset. Table 2 summaries the NMI and CE on all datasets. According to the\nNMI values in Table 2, our ODC outperforms other clustering algorithms on \ufb01ve datasets: ORL,\nSRBCT, iris, yeast and image segmentation. According to the CE values in Table 2, it\nis obvious that the performance of our ODC is best in comparison with other algorithms on all the\ndatasets, and NC and DisKmeans algorithms can achieve the almost same performance with ODC\non the SRBCT and iris datasets respectively. Also, it is seen that the DisCluster algorithm has\ndramatically different performance based on the NMI and CE. The main reason is that the \ufb01nal\nsolution in DisCluster is very sensitive to the initial variables and numerical computation.\n\nIn addition,\n\n5\n\n\f(a)\n\n(e)\n\n(b)\n\n(f)\n\n(c)\n\n(g)\n\n(d)\n\n(h)\n\nFigure 1: The NMI versus the parameter \u03c3 tuning in ODC on all datasets, where the NMI of K-\nmeans is used as the baseline: (a) ORL; (b) Yale; (c) PIE; (d) SRBCT; (e) iris; (f) yeast; (g)\nimage segmentation; (h) statlog landsat satellite.\n\nIn order to reveal the effect of the parameter \u03c3 on ODC, Figures 1 and 2 depict the NMI and CE\nresults of ODC with respect to different parameters \u03c3 on all datasets. Similar to [11, 13], we used the\nresults of K-means as a baseline. From Figures 1 and 2, we can see that similar to the conventional\nclustering algorithms (including the compared algorithms), the parameter \u03c3 has a signi\ufb01cant impact\non the performance of ODC, especially when the evaluation results are measured by NMI. In contrast\nto the result in Figure 1, the effect of the parameter \u03c3 becomes less pronounced in Figure 2.\n\nTable 2: Clustering results: the Normalized Mutual Information (NMI) and the Clustering Error\n(CE) (%) of all clustering algorithms are calculated on different datasets.\n\nMeasure\n\nNMI\n\nCE (%)\n\nDataset\nORL\nYale\nPIE\n\nSRBCT\n\nIris\nYeast\n\nImage segmentation\n\nStatlog landsat satellite\n\nORL\nYale\nPIE\n\nSRBCT\n\nIris\nYeast\n\nImage segmentation\n\nStatlog landsat satellite\n\nK-means\n0.7971\n0.6237\n0.1140\n0.2509\n0.6595\n0.2968\n0.5830\n0.6126\n38.25\n45.45\n79.82\n55.55\n16.66\n59.43\n45.14\n32.30\n\nNC\n\n0.8015\n0.6203\n0.2232\n0.3722\n0.6876\n0.2915\n0.5500\n0.6316\n34.50\n46.06\n79.82\n47.61\n15.33\n59.90\n49.47\n32.65\n\nDisCluster DisKmeans\n\n0.7978\n0.5974\n0.1940\n0.3216\n0.7248\n0.2993\n0.5700\n0.6152\n38.75\n45.45\n77.35\n50.79\n12.66\n59.43\n45.95\n32.25\n\n0.8531\n0.5641\n0.3360\n0.2683\n0.7353\n0.3020\n0.5934\n0.6009\n29.00\n45.45\n66.23\n53.96\n11.33\n57.07\n41.66\n31.20\n\nODC\n0.8567\n0.5766\n0.3035\n0.3966\n0.7353\n0.3041\n0.5942\n0.6166\n28.50\n44.84\n65.52\n47.61\n11.33\n56.73\n40.23\n30.50\n\n5 Concluding Remarks\n\nIn this paper we have proposed a regression framework to deal with unsupervised dimensionality\nreduction and clustering simultaneously. The framework is based on the optimal scoring and ridge\npenalty. In particular, we have developed a new clustering algorithm which is called optimal discrim-\ninant clustering (ODC). ODC can ef\ufb01ciently identify the optimal solution and it has an underlying\nrelationship with the discriminative clustering and spectral clustering.\n\n6\n\n\u22126\u22124\u2212202460.680.70.720.740.760.780.80.820.840.860.882log(\u03c3)NMI K\u2212meansODC\u22126\u22124\u2212202460.50.520.540.560.580.60.620.642log(\u03c3)NMI K\u2212meansODC\u22126\u22124\u2212202460.050.10.150.20.250.30.350.42log(\u03c3)NMI K\u2212meansODC\u22126\u22124\u2212202460.20.250.30.350.42log(\u03c3)NMI K\u2212meansODC\u22126\u22124\u2212202460.60.620.640.660.680.70.720.742log(\u03c3)NMI K\u2212meansODC\u22126\u22124\u2212202460.2550.260.2650.270.2750.280.2850.290.2950.30.3052log(\u03c3)NMI K\u2212meansODC\u22126\u22124\u2212202460.540.550.560.570.580.590.62log(\u03c3)NMI K\u2212meansODC\u22126\u22124\u2212202460.570.580.590.60.610.620.632log(\u03c3)NMI K\u2212meansODC\f(a)\n\n(e)\n\n(b)\n\n(f)\n\n(c)\n\n(g)\n\n(d)\n\n(h)\n\nFigure 2: The CE (%) versus the parameter \u03c3 tuning in ODC on all datasets, where the CE (%) of\nK-means is used as the baseline: (a) ORL; (b) Yale; (c) PIE; (d) SRBCT; (e) iris; (f) yeast;\n(g) image segmentation; (h) statlog landsat satellite.\n\nThis framework allows us for developing a sparse unsupervised learning algorithm; that is, we alter-\nnatively consider the following optimization problem:\n(cid:107)Y \u2212 HnXW(cid:107)2\n\ntr(W(cid:48)W) + \u03bb2(cid:107)W(cid:107)1\n\nmin\nY, W\n\nunder the constraints 1(cid:48)\n\n1\n2\n\nf(Y, W) =\nnY = 0 and Y(cid:48)Y = Ic\u22121. We will study this further.\n\nF + \u03bb1\n2\n\nAcknowledgement\n\nThis work has been supported in part by program for Changjiang Scholars and Innovative Research\nTeam in University (IRT0652, PCSIRT), China.\n\nA Proof of Theorem 2\nFor simplicity, we replace HnX by X and let q = c\u22121 in the following derivation. Consider the\nLagrange function:\n\nL(Y, W, B, b)\n\n=\n\n1\n2\n\ntr(Y(cid:48)Y) \u2212 tr(Y(cid:48)XW) +\n\ntr(B(Y(cid:48)Y\u2212Iq)) \u2212 tr(b(cid:48)Y(cid:48)1n),\nwhere B is a q\u00d7q symmetric matrix of Lagrange multipliers and b is a q\u00d71 vector of Lagrange\nmultipliers. By direct differentiation, it can be shown that\n\ntr(W(cid:48)(X(cid:48)X+\u03c32Ip)W) \u2212 1\n2\n\n1\n2\n\n\u2202L\n\u2202Y\n\u2202L\n\u2202W\n\n= Y \u2212 XW \u2212 YB \u2212 1nb(cid:48),\n= (X(cid:48)X + \u03c32Ip)W \u2212 X(cid:48)Y.\n\nLetting \u2202L\n\n\u2202Y = 0, we have\n\nY \u2212 XW \u2212 YB \u2212 1nb(cid:48) = 0.\n\nPre-multiplying both sides of the above equation by 1(cid:48)\n\u2202Y = 0 and \u2202L\n\n\u2202W = 0 that\n\n\u2202L\n\n(cid:189)\n\nn, we obtain b = 0. Thus, it follows from\n\nY \u2212 XW \u2212 YB = 0,\nW = (X(cid:48)X + \u03c32Ip)\u22121X(cid:48)Y.\n\n7\n\n\u22126\u22124\u221220246283032343638402log(\u03c3)CE (%) K\u2212meansODC\u22126\u22124\u22122024644.844.94545.145.245.345.445.545.6CE (%)2log(\u03c3) K\u2212meansODC\u22126\u22124\u221220246646668707274767880822log(\u03c3)CE (%) K\u2212meansODC\u22126\u22124\u221220246474849505152535455562log(\u03c3)CE (%) K\u2212meansODC\u22126\u22124\u221220246111213141516172log(\u03c3)CE (%) K\u2212meansODC\u22126\u22124\u22122024656.55757.55858.55959.52log(\u03c3)CE (%) K\u2212meansODC\u22126\u22124\u221220246404142434445462log(\u03c3)CE (%) K\u2212meansODC\u22126\u22124\u22122024630.53131.53232.52log(\u03c3)CE (%) K\u2212meansODC\fSubstituting the second equation into the \ufb01rst equation, we further have\n(In \u2212 X(X(cid:48)X + \u03c32Ip)\u22121X(cid:48))Y = YB.\n\nB where UB is a q\u00d7q orthonormal\nNow we take the spectral decomposition of B as B = UB\u039bBU(cid:48)\nmatrix and \u039bB is a q\u00d7q diagonal matrix. We thus have (In \u2212 X(X(cid:48)X + \u03c32Ip)\u22121X(cid:48))YUB =\nYUB\u039bB. This shows that the diagonal entries of \u039bB and the columns of YUB are the eigenvalues\nand the associated eigenvectors of In \u2212 X(X(cid:48)X + \u03c32Ip)\u22121X(cid:48).\nWe consider the case that n \u2265 p. Let the SVD of X be X = U\u0393V(cid:48) where U (n\u00d7p) and V (p\u00d7p) are\northogonal, and \u0393 = diag(\u03b31, . . . , \u03b3p) (p\u00d7p) is a diagonal matrix with \u03b31 \u2265 \u03b32 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03b3p \u2265 0. We\nthen have X(X(cid:48)X + \u03c32Ip)\u22121X(cid:48) = U\u039bU(cid:48), where \u039b = diag(\u03bb1, . . . , \u03bbp) with \u03bbi = \u03b32\ni + \u03c32).\nThere exists such an n\u00d7(n\u2212p) orthogonal matrix U3 that its last column is\n1\u221a\nn 1n and [U, U3]\nis an n\u00d7n orthonormal matrix. That is, U3 is the eigenvector matrix of X(X(cid:48)X + \u03c32Ip)\u22121X(cid:48)\ncorresponding to the eigenvalue 0. Let U1 be the n\u00d7q matrix of the \ufb01rst q columns of [U, U3].\nWe now de\ufb01ne \u02c6Y = U1, \u02c6W = (X(cid:48)X + \u03c32Ip)\u22121X(cid:48)U1, UB = Iq and \u039bB = diag(1\u2212 \u03bb1, . . . , 1\u2212\n\u03bbq) where \u03bbi = 0 whenever i > p. It is easily seen that such a \u02c6Y satis\ufb01es \u02c6Y(cid:48) \u02c6Y = Iq and \u02c6Y(cid:48)1n = 0\ndue to U(cid:48)\n\ni /(\u03b32\n\n1U(cid:48)\n\n1 = Iq and X(cid:48)1n = 0. Moreover, we have\n\u03bbi = q\n2\n\nf( \u02c6Y, \u02c6W) = q\n2\n\n\u2212 1\n2\n\n\u2212 1\n2\n\nq(cid:88)\n\ni=1\n\nq(cid:88)\n\ni=1\n\n\u03b32\ni\ni + \u03c32\n\u03b32\n\nwhere \u03b3i = 0 whenever i > p. Note that all the eigenvalues of X(X(cid:48)X + \u03c32Ip)\u22121X(cid:48) are between\n0 and 1. Especially, when \u03c32 = 0, the eigenvalues are either 1 or 0. In this case, if rk(X) \u2265 q,\nf( \u02c6Y, \u02c6W) achieves its minimum 0, otherwise the minimum value is q\u2212rk(X)\nTo verify that ( \u02c6Y, \u02c6W) is a minimizer of problem (2), we consider the Hessian matrix of\nL with respect to (Y, W). Let vec(Y(cid:48)) = (y11, . . . , y1q, y21, . . . , ynq)(cid:48) and vec(W(cid:48)) =\n(cid:184)\n(w11, . . . , w1q, w21, . . . , wpq)(cid:48). The Hessian matrix is then given by\n\n(cid:34)\n\n(cid:35)\n\n(cid:183)\n\n.\n\n2\n\n\u22022L\n\n\u22022L\n\n\u2202vec(Y(cid:48))\u2202vec(Y(cid:48))(cid:48)\n\u2202vec(W(cid:48))\u2202vec(Y(cid:48))(cid:48)\n2], where C1 and C2 are n\u00d7q and p\u00d7q, be an arbitrary nonzero (n+p)\u00d7q matrix\n\n\u2202vec(Y(cid:48))\u2202vec(W(cid:48))(cid:48)\n\u2202vec(W(cid:48))\u2202vec(W(cid:48))(cid:48)\n\nIq\u2297(X(cid:48)X + \u03c32Ip)\n\n\u22022L\n\n=\n\n.\n\n(Iq\u2212B)\u2297In\n\u2212Iq\u2297X(cid:48)\n\n\u2212Iq\u2297X\n\n1, C(cid:48)\n1[1n, \u02c6Y] = 0, which is equivalent to C(cid:48)\n\n\u22022L\n\nH(Y, W) =\nLet C(cid:48) = [C(cid:48)\nsuch that C(cid:48)\nIf rk(X) \u2264 q, we have C(cid:48)\nvec(C(cid:48))(cid:48)H( \u02c6Y, \u02c6W)vec(C(cid:48)) = tr(C(cid:48)\n= tr(C(cid:48)\n\n1X = 0. Hence,\n\n1C1(Iq \u2212 B)) \u2212 2tr(C(cid:48)\n1C1(Iq \u2212 B)) + tr(C(cid:48)\n\n1XC2) + tr(C(cid:48)\n2(X(cid:48)X + \u03c32Ip)C2) \u2265 0.\n\n2(X(cid:48)X + \u03c32Ip)C2)\n\n11n = 0 and C(cid:48)\n\n1U1 = 0.\n\nThis implies that ( \u02c6Y, \u02c6W) is a minimizer of problem (2).\nIn the case that rk(X) = m > q, we have p > q. Thus we can partition U and V into U = [U1, U2]\nand V = [V1, V2] where V1 and V2 are p\u00d7q and p\u00d7(p\u2212q). Thus,\nvec(C(cid:48))(cid:48)H( \u02c6Y, \u02c6W)vec(C(cid:48)) = tr(C(cid:48)\n\u2265 tr(C(cid:48)\n+tr(C(cid:48)\n(\u039b1/2\n+tr(C(cid:48)\n\n1C1(Iq \u2212 B)) \u2212 2tr(C(cid:48)\n2C1)\u22122tr(C(cid:48)\n1U2\u039b2U(cid:48)\n3C1\u039b1) + tr(C(cid:48)\n1U3U(cid:48)\n2C1 \u2212 D1/2\n2 U(cid:48)\n1U3U(cid:48)\n3C1\u039b1) + tr(C(cid:48)\n\n1XC2) + tr(C(cid:48)\n1U2\u03932V(cid:48)\n2V1D1V(cid:48)\n2 U(cid:48)\n2V1D1V(cid:48)\n\n2(X(cid:48)X + \u03c32Ip)C2)\n(cid:164)\n2C2)\n\n2C2)+tr(C(cid:48)\n1C2)\n2C1 \u2212 D1/2\n1C2) \u2265 0.\n\n2C2)(cid:48)(\u039b1/2\n\n2V2D2V(cid:48)\n\n2 V(cid:48)\n\n2 V(cid:48)\n\n= tr\n\n2C2)\n\n(cid:163)\n\nHere \u039b1 = diag(\u03bb1, . . . , \u03bbq), \u039b2 = diag(\u03bbq+1, . . . , \u03bbp), \u03931 = diag(\u03b31, . . . , \u03b3q), \u03932 =\n2 \u039b1/2\ndiag(\u03b3q+1, . . . , \u03b3p), D1 = \u03932\n.\nMoreover, we use the fact that\ntr(C(cid:48)\n\n2 + \u03c32Ip\u2212q, so we have \u03932 = D1/2\n1U2\u039b2U(cid:48)\n\n1 + \u03c32Iq and D2 = \u03932\n2C1\u039b1) \u2265 tr(C(cid:48)\n\nbecause \u03bbiIq \u2212 \u039b2 for i = 1, . . . , q are positive semide\ufb01nite.\nIf n < p, we also make the SVD of X as X = U\u0393V(cid:48). But, right now, U is n\u00d7n, V is n\u00d7p, and \u039b\nis n\u00d7n. Using this SVD, we have the same result as the case of n \u2265 p.\n\n1U2U(cid:48)\n\n2C1)\n\n2\n\n8\n\n\fReferences\n[1] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, \ufb01rst edition, 2007.\n[2] L. Clemmensen, T. Hastie, and B. Erb\u00f8ll. Sparse discriminant analysis. Technical report, June\n\n2008.\n\n[3] F. De la Torre and T. Kanade. Discriminative cluster analysis.\n\nConference on Machine Learning, 2006.\n\nIn The 23rd International\n\n[4] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi\ufb01cation. John Wiley and Sons, New\n\nYork, second edition, 2001.\n\n[5] T. Hastie, A. Buja, and R. Tibshirani. Penalized discriminant analysis. The Annals of Statistics,\n\n23(1):73\u2013102, 1995.\n\n[6] T. Hastie, R. Tibshirani, and A. Buja. Flexible discriminant analysis by optimal scoring. Jour-\n\nnal of the American Statistical Association, 89(428):1255\u20131270, 1994.\n\n[7] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. In\n\nAdvances in Neural Information Processing Systems 14, volume 14, 2002.\n\n[8] C. H. Park and H. Park. A relationship between linear discriminant analysis and the gener-\nalized minimum squared error solution. SIAM Journal on Matrix Analysis and Applications,\n27(2):474\u2013492, 2005.\n\n[9] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 22(8):888\u2013905, 2000.\n\n[10] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58:267\u2013288, 1996.\n\n[11] M. Wu and B. Sch\u00a8olkopf. A local learning approach for clustering. In Advances in Neural\n\nInformation Processing Systems 19, 2007.\n\n[12] J. Ye. Least squares linear discriminant analysis. In The Twenty-Fourth International Confer-\n\nence on Machine Learning, 2007.\n\n[13] J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clustering. In Advances in Neural\n\nInformation Processing Systems 20, 2008.\n\n[14] Z. Zhang, G. Dai, and M. I. Jordan. A \ufb02exible and ef\ufb01cient algorithm for regularized Fisher\ndiscriminant analysis. In The European Conference on Machine Learning and Principles and\nPractice of Knowledge Discovery in Databases (ECML PKDD), 2009.\n\n[15] Z. Zhang and M. I. Jordan. Multiway spectral clustering: A margin-based perspective. Statis-\n\ntical Science, 23(3):383\u2013403, 2008.\n\n[16] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the\n\nRoyal Statistical Society, Series B, 67:301\u2013320, 2005.\n\n[17] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Compu-\n\ntational and Graphical Statistics, 15:265\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 780, "authors": [{"given_name": "Zhihua", "family_name": "Zhang", "institution": null}, {"given_name": "Guang", "family_name": "Dai", "institution": null}]}