{"title": "Nonlinear Discriminant Analysis Using Kernel Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 568, "page_last": 574, "abstract": null, "full_text": "Nonlinear Discriminant Analysis using \n\nKernel Functions \n\nUniversity of Bonn, Institut of Computer Science III \n\nRomerstrasse 164, D-53117 Bonn, Germany \n\nVolker Roth & Volker Steinhage \n\n{roth, steinhag}@cs.uni-bonn.de \n\nAbstract \n\nFishers linear discriminant analysis (LDA) is a classical multivari(cid:173)\nate technique both for dimension reduction and classification. The \ndata vectors are transformed into a low dimensional subspace such \nthat the class centroids are spread out as much as possible. In \nthis subspace LDA works as a simple prototype classifier with lin(cid:173)\near decision boundaries. However, in many applications the linear \nboundaries do not adequately separate the classes. We present a \nnonlinear generalization of discriminant analysis that uses the ker(cid:173)\nnel trick of representing dot products by kernel functions. The pre(cid:173)\nsented algorithm allows a simple formulation of the EM-algorithm \nin terms of kernel functions which leads to a unique concept for un(cid:173)\nsupervised mixture analysis, supervised discriminant analysis and \nsemi-supervised discriminant analysis with partially unlabelled ob(cid:173)\nservations in feature spaces. \n\nIntroduction \n\n1 \nClassical linear discriminant analysis (LDA) projects N data vectors that belong to \nc different classes into a (c - 1)-dimensional space in such way that the ratio of be(cid:173)\ntween group scatter SB and within group scatter Sw is maximized [1]. LDA formally \nconsists of an eigenvalue decomposition of Sv) S B leading to the so called canonical \nvariates which contain the whole class specific information in a (c - I)-dimensional \nsubspace. The canonical variates can be ordered by decreasing eigenvalue size in(cid:173)\ndicating that the first variates contain the major part of the information. As a \nconsequence, this procedure allows low dimensional representations and therefore \na visualization of the data. Besides from interpreting LDA only as a technique for \ndimensionality reduction, it can also be seen as a multi-class classification method: \nthe set of linear discriminant functions define a partition of the projected space into \nregions that are identified with class membership. A new observation x is assigned \nto the class with centroid closest to x in the projected space. \nTo overcome the limitation of only linear decision functions some attempts have \nbeen made to incorporate nonlinearity into the classical algorithm. HASTIE et al. \n[2] introduced the so called model of Flexible Discriminant Analysis: LDA is refor(cid:173)\nmulated in the framework of linear regression estimation and a generalization of this \nmethod is given by using nonlinear regression techniques. The proposed regression \ntechniques implement the idea of using nonlinear mappings to transform the input \ndata into a new space in which again a linear regression is performed. In real world \n\n\fNonlinear Discriminant Analysis Using Kernel Functions \n\n569 \n\napplications this approach has to deal with numerical problems due to the dimen(cid:173)\nsional explosion resulting from nonlinear mappings. In the recent years approaches \nthat avoid such explicit mappings by using kernel functions have become popular. \nThe main idea is to construct algorithms that only afford dot products of pattern \nvectors which can be computed efficiently in high-dimensional spaces. Examples of \nthis type of algorithms are the Support Vector Machine [3] and Kernel Principal \nComponent Analysis [4]. \nIn this paper we show that it is possible to formulate classical linear regression \nand therefore also linear discriminant analysis exclusively in terms of dot products. \nTherefore, kernel methods can be used to construct a nonlinear variant of dis(cid:173)\ncriminant analysis. We call this technique Kernel Discriminant Analysis (KDA). \nContrary to a similar approach that has been published recently [5J, our algorithm is \na real multi-class classifier and inherits from classical LDA the convenient property \nof data visualization. \n2 Review of Linear Discriminant Analysis \nUnder the assumption of the data being centered (i.e. Ei Xi = 0) the scatter ma(cid:173)\ntrices S B and Sw are defined by \n\n(1) \n\n(2) \n\n(3) \n\nSB = \"\"~ ~ \"\"nj \n\nL...,;J=1 nj L...,;l,m=l \n\n(x(j)) (x~))T \n\nI \n\nSw = \"\"c \"\"nj (x(j) _ ~ \"\"nj XU)) (x(j) _ ~ \"\"nj x(j)) T \n\nL...,;j=l L...,;l=l \n\nI \n\nnj L...,;l=l \n\nI \n\nI \n\nnj L...,;m=l m \n\nwhere nj is the number of patterns x~j) that belong to class j. \nLDA chooses a transformation matrix V that maximizes the objective function \n\nJ(V) = IVTSB VI. \nIVTSwVI \n\nThe columns of an optimal V are the generalized eigenvectors that correspond to \nthe nonzero eigenvalues in SBVi = Ai SWVi' \nIn [6J and [7J we have shown, that the standard LDA algorithm can be restat(cid:173)\ned exclusively in terms of dot products of input vectors. The final equation is an \neigenvalue equation in terms of dot product matrices which are of size N x N. Since \nthe solution of high-dimensional generalized eigenvalue equations may cause numer(cid:173)\nical problems (N may be large in real world applications), we present an improved \nalgorithm that reformulates discriminant analysis as a regression problem. More(cid:173)\nover, this version allows a simple implementation of the EM-algorithm in feature \nspaces. \n3 Linear regression analysis \nIn this section we give a brief review of linear regression analysis which we use as \n\"building block\" for LDA. The task of linear regression analysis is to approximate \nthe regression function by a linear function \n\nr(x) = E(YIX = x) ~ c + x T f'. \n\n(4) \non the basis of a sample (YI, Xl), ... ,(Y N , x N ). Let now y denote the vector \n(YI, ... ,YN)T and X denote the data matrix which rows are the input vectors. \nU sing a quadratic loss function, the optimal parameters c and f' are chosen to \nminimize the average squared residual \n\nASR = N-Illy - c IN + Xf'112 + f'Tnf'. \n\n(5) \nIN denotes a N-vector of ones, n denotes a ridge-type penalty matrix n = \u20acI which \npenalizes the coefficients of f'. Assuming the data beirig centered, i.e E~l Xi = 0, \nthe parameters of the regression function are given by: \n\nc = N- 1 \"\". Yi =: /-Ly , \n\nN \n\nL...,;t=l \n\nf' = (XT X + \u20acI)-l XT y. \n\n(6) \n\n\f570 \n\nV. Roth and V. Steinhage \n\n4 LDA by optimal scoring \nIn this section the LDA problem is linked to linear regression using the framework \nof penalized optimal scoring. We give an overview over the detailed derivation in \n[2] and [8]. Considering again the problem with c classes and N data vectors, \nthe class-memberships are represented by a categorical response variable 9 with \nIt is useful to code the n responses in terms of the indicator matrix \nc levels. \nZ: Zi ,j = 1, if the i-th data vectJr belongs to class j, and 0 otherwise. The point \nof optimal scoring is to turn categorical variables into quantitative ones by assigning \nscores to classes: the score vector 9 assigns the real number 9j to the j-th level of \ng. The vector Z9 then represents a vector of scored training data and is regressed \nonto the data matrix X. The simultaneous estimation of scores and regression \ncoefficients constitutes the optimal scoring problem: minimize the criterion \n\nASR(9, (3) = N- 1 [IIZ9 - X{311 2 + {3TO{3] \n\n(7) \n\nunder the constraint ~ IIZ9UZ = 1. According to (6), for a given score 9 the \nminimizing {3 is given by \n\n{3os = (XT X + 0)-1 xT Z(J, \n\n(8) \n\nand the partially minimized criteri9n becomes: \n\nminASR(9,{3) = 1- N-19TZ\u2122(0)Z9, \n(3 \n\n(9) \n\nwhere M(O) = X(XTX +O)-IXT denotes the regularized hat or smoother matrix. \nMinimizing of (9) under the constraint ~ IIZ9W = 1 can be performed by the \nfollowing procedure: \n\n~ \n\n1. Choose an initial matrix 8 0 satisfying the constraint N- 18'{; ZT Z80 = I \nand set 8 0 = Z80 \n2. Run a multi-response regression of 8 0 onto X: 8 0 = M(0)80 = XB, \nwhere B is the ma.1rix of regression coefficients. \n3. Eigenanalyze 8 0T 8 0 to obtain the optimal scores, and update the matrix of \nregression coefficients: B* = BW, with W being the matrix of eigenvectors. \nIt can be shown, that the final matrix B* is, up to a diagonal scale matrix, equivalent \nto the matrix of LDA-vectors, see [8]. \n5 Ridge regression using only dot products \nThe penalty matrix 0 in (5) assures that the penalized d x d covariance matrix \ni: = XT X + d is a symmetric nonsingular matrix. Therefore, it has d eigenvectors \nei with accomplished positive eigenvalues Ii such that the following equations hold: \n\n- -1 \"d 1 \n~ = 6 \ni=1 Ii \n\nT \n-eie \u00b7 \nt \n\n(10) \n\nThe first equation implies that the first 1 leading eigenvectors ei with eigenvalues \nIi > \u20ac have an expansion in terms of the input vectors. Note that 1 is the number \nof nonzero eigenvalues of the unpenalized covariance matrix X T X. Together with \n(6), it follows for the general case, when the dimensionality d may extend l, that {3 \ncan be written as the sum of two terms: an expansion in terms of the vectors Xi \nwith coefficients ai and a similar expansion in terms of the remaining eigenvectors: \n\nI:N \n\n. aixi + \n\nI:d \n. \n)=1+1 \n\n{3 = \n\nt=1 \n\n~jej = X a + \n\nT \n\n~jej , \n\n(11) \n\nI:d \n. \n)=1+1 \n\nwith a = (a1 ... an) T. However, the last term can be dropped, since every eigen(cid:173)\nvector ej, j = 1 + 1, ... ,d is orthogonal to every vector Xi and does not influence \nthe value of the regression function (4). \nThe problem of penalized linear regression can therefore be stated as minimizing \n\n\fNonlinear Discriminant Analysis Using Kernel Functions \n\nASR(a) = N- 1 [Ily - XXT al1 2 + aTXOXTaJ. \n\n571 \n\n(12) \n\nA stationary vector a is determined by \n\na = (XXT + O)-ly. \n\n)-1 \n\ny . \n\nT( \n\n(13) \nLet now the dot product matrix K be defined by Kij = xT Xj and let for a given test \npoint (Xl) the dot product vector kl be defined by kl = XXI . With this notation \nthe regression function of a test point (xL) reads \nr(Xl) = /-Ly + kl K + \u20acI \n\n(14) \nThis equation requires only dot products and we can apply the kernel trick. The \nfinal equation (14), up to the constant term /-Ly , has also been found by SAUNDERS et \nal., [9J . They restated ridge regression in dual variables and optimized the resulting \ncriterion function with a lagrange multiplier technique. Note that our derivation, \nwhich is a direct generalization of the standard linear regression formalism, leads in \na natural way to a class of more general regression functions including the constant \nterm. \n6 LDA using only dot products \nSetting f3 = XT a as in (11) and using the notation of section 5, for a given score \no the optimal vector a is given by: \n\naas = (XXT + 0)-1 ZO . \n\n(15) \n\nAnalogous to (9), the partially minimized criterion becomes: \nmin ASR(O, a) = 1 - N- 10T ZT M(O)ZO, \nex \n\n(16) \n\nwith \n\nM(O) = XXT(XXT + 0)-1 = K(K + \u20acI)-l. \n\nTo minimize (16) under the constraint tv IIZOW = 1 the procedure described in \nsection 4 can be used when M(O) is substituted by M(O). The matrix Y which \nrows are the input vectors projected onto the column vectors of B* is given by: \n\nY = XB* = K(K + \u20acI)-l Z8oW. \n\n(17) \n\nNote that again the dot product matrix K is all that is needed to calculate Y. \n7 The kernel trick \nThe main idea of constructing nonlinear algorithms is to apply the linear methods \nnot in the space of observations but in a feature space F that is related to the former \nby a nonlinear mapping \u00a2 : RN ---+ F, X ---+ \u00a2(x) . \nAssuming that the mapped data are centered in F, i.e. L~=l \u00a2(Xi) = 0, the present(cid:173)\ned algorithms remain formally unchanged if the dot product matrix K is computed \nin F: Kij = (\u00a2(Xi) . \u00a2(Xj)). As shown in [4], this assumption can be dropped by \nwriting \u00a2 instead of the mapping \u00a2: \nComputation of dot products in feature spaces can be done efficiently by using k(cid:173)\nernel functions k(xi, Xj) [3]: For some choices of k there exists a mapping \u00a2 into \nsome feature space F such that k acts as a dot product in F. Among possible \nkernel functions there are e.g. Radial Basis Function (RBF) kernels of the form \nk(x,y) = exp(-llx - YW/c). \n8 The EM-algorithm in feature spaces \nLDA can be derived as the maximum likelihood method for normal populations \nwith different means and common covariance matrix ~ (see [11]) . Coding the class \nmembership of the observations in the matrix Z as in section 4, LDA maximizes \nthe (complete data) log-likelihood function \n\n\u00a2(Xi) := \u00a2(Xi) - ~ L~=l \u00a2(Xi). \n\n\f572 \n\nV. Roth and V. Steinhage \n\nThis concept can be generalized for the case that only the group membership of \nNc < N observations is known ([14], p.679): the EM-algorithm provides a conve(cid:173)\nnient method for maximizing the likelihood function with missing data: \nE-step: set Pki = Prob(xi E class k) \n\nPki = \n\n{\n\nif the class membership of Xi has been observed \n\nZik' \nLk=l 1Tk\u00a2>\"(Z;) ' ot erWlse, \n\n1Tk \u00a2>\" (z.) h \u00b7 '\" ( ) \n\n'f'k Xi ex exp -\n\n[1/2( \n\nXi -\n\nILk \n\n)T~-l ( \n\nL.; \n\nXi -\n\n)] \n\nILk \n\nM-step: set \n\n1 N \n\n'irk = N LPki' \n\ni=l \n\nThe idea behind this approach is that even an unclassified observation can be used \nfor estimation if it is given a proper weight according to its posterior probability \nfor class membership. The M-step can be seen as weighted mean and covariance \nmaximum likelihood estimates in a weighted and augmented problem: we augment \nthe data by replicating the N observations c times, with the l-th such replication \nhaving observation weights Plio The maximization of the likelihood function can be \nachieved via a weighted and augmented LDA. It turns out that it is not necessary to \nexplicitly replicate the observations and run a standard LDA: the optimal scoring \nversion of LDA described in section 4 allows an implicit solution of the augmented \nproblem that still uses only N observations. Instead of using a response indicator \nmatrix Z, one uses a blurred response Matrix Z, whose rows consist of the current \nclass probabilities for each observation. At each M-step this Z is used in a multiple \nlinear regression followed by an eigen-decomposition. A detailed derivation is given \nin [11]. Since we have shown that the optimal scoring problem can be solved in fea(cid:173)\nture spaces using kernel functions this is also the case for the whole EM-algorithm: \nthe E-step requires only differences in Mahalonobis distances which are supplied by \nKDA. \nAfter iterated application of the E- and M-step an observation is classified to the \nclass k with highest probability Pk. This leads to a unique framework for pure \nmixture analysis (Nc = 0), pure discriminant analysis (Nc = N) and the semi(cid:173)\nsupervised models of discriminant analysis with partially unclassified observations \n(0 < Nc < N) in feature spaces. \n\n9 Experiments \nWaveform data: We illustrate KDA on a popular simulated example, taken from \n[10], pA9-55 and used in [2, 11]. It is a three class problem with 21 variables. The \nlearning set consisted of 100 observations per class. The test set was of size 1000. \nThe results are given in table 1. \n\nTable 1: Results for waveform data. The values are averages over 10 simulations. \nThe 4 entries above the line are taken from [11]. QDA: quadratic discriminant \nanalysis, FDA: flexible discriminant analysis, MDA: mixture discriminant analysis. \n\nTechnique \nLDA \nQDA \nFDA (best model parameters) \nMDA (best model parameters) \nKDA (RBF kernel, (7 = 2, \u20ac = 1.5) \n\nTraining Error [%] Test Error [%] \n19.1(0.6) \n20.5(0.6) \n19.1(0.6) \n15.5(0.5) \n14.1(0.7) \n\n12.1(0.6) \n3.9(OA) \n10.0(0.6) \n13.9{0.5) \n10.7(0.6) \n\n\fNonlinear Discriminant Analysis Using Kernel Functions \n\n573 \n\nThe Bayes risk for the problem is about 14% [10]. KDA outperforms the other \nnonlinear versions of discriminant analysis and reaches the Bayes rate within the \nerror bounds, indicating that one cannot expect significant further improvement \nusing other classifiers. Figure 1 demonstrates the data visualization property of \nKDA. Since for a 3 class problem the dimensionality of the projected space equals \n2, the data can be visualized without any loss of information. In the left plot one \ncan see the projected learn data and the class centroids, the right plot shows the \ntest data and again the class centroids of the learning set. \n\nFigure 1: Data visualization with KDA. Left: learn set, right: test set \n\nTo demonstrate the effect of using unlabeled data for classification we repeated \nthe experiment with waveform data using only 20 labeled observations per class. \nWe compared the the classification results on a test set of size 300 using only the \nlabeled data (error rate E 1 ) with the results of the EM-model which considers \nthe test data as incomplete measurements during an iterative maximization of the \nlikelihood function (error rate E2). Using a RBF kernel (0\" = 250), we obtained the \nfollowing mean error rates over 20 simulations: El = 30.5(3.6)%, E2 = 17.1(2.7)%. \nThe classification performance could be drastically improved when including the \nunlabelled data into the learning process. \nObject recognition: We tested KDA on the MPI Chair Database l . It consists of \n89 regular spaced views form the upper viewing hemisphere of 25 different classes \nof chairs as a training set and 100 random views of each class as a test set. The \navailable images are downscaled to 16 x 16 pixels. We did not use the additional \n4 edge detection patterns for each view. Classification results for several classifiers \nare given in table 2. \n\nKDA poly. kernel \n\n2.1 \n\nFor a comparison of the computational performance we also trained the SVM-light \nimplementation (V 2.0) on the data, [13]. In this experiment with 25 classes the \nKDA algorithm showed to be Significantly faster than the SVM: using the RBF(cid:173)\nkernel, KDA was 3 times faster, with the polynomial kernel KDA was 20 times \nfaster than SVM-light. \n\n10 Discussion \nIn this paper we present a nonlinear version of classical linear discriminant analysis. \nThe main idea is to map the input vectors into a high- or even infinite dimensional \nfeature space and to apply LDA in this enlarged space. Restating LDA in a way that \nonly dot products of input vectors are needed makes it possible to use kernel repre(cid:173)\nsentations of dot products. This overcomes numerical problems in high-dimensional \n\nIThe database is available via ftp:/ /ftp.mpik-tueb.mpg.de/pub/chair_dataset/ \n\n\f574 \n\nV. Roth and V. Steinhage \n\nfeature spaces. We studied the classification performance of the KDA classifier on \nsimulated waveform data and on the MPI chair database that has been widely used \nfor benchmarking in the literature. For medium size problems, especially if the \nnumber of classes is high, the KDA algorithm showed to be significantly faster than \na SVM while leading to the same classification performance. From classical LDA \nthe presented algorithm inherits the convenient property of data visualization, since \nit allows low dimensional views of the data vectors. This makes an intuitive inter(cid:173)\npretation possible, which is helpful in many practical applications. The presented \nKDA algorithm can be used as the maximization step in an EM algorithm in feature \nspaces. This allows to include unlabeled observation into the learning process which \ncan improve classification results. Studying the performance of KDA for other clas(cid:173)\nsification problems as well as a theoretical comparison of the optimization criteria \nused in the KDA- and SVM-algorithm will be subject of future work. \n\nAcknowledgements \nThis work was supported by Deutsche Forschungsgemeinschaft, DFG. We heavily \nprofitted from discussions with Armin B. Cremers, John Held and Lothar Hermes. \n\nReferences \n[1] R. Duda and P. Hart, Pattern Classification and Scene Analysis. Wiley & Sons, 1973. \n[2] T . Hastie, R. Tibshirani, and A. Buja, \"Flexible discriminant analysis by optimal \n\nscoring,\" JASA, vol. 89, pp. 1255-1270, 1994. \n\n[3] V. N. Vapnik, Statistical learning theory. Wiley & Sons, 1998. \n[4] B. Sch6lkopf, A. Smola, and K.-R. Muller, \"Nonlinear component analysis as a kernel \n\neigenvalue problem,\" Neural Computation, vol. 10, no. 5, pp. 1299-1319, 1998. \n\n[5] S. Mika, G. Ratsch, J. Weston, B. Sch6lkopf, and K.-R. Miiller, \"Fisher discrimi(cid:173)\n\nnant analysis with kernels,\" in Neural Networks for Signal Processing IX (Y.-H. Hu, \nJ. Larsen, E. Wilson, and S. Douglas, eds.), pp. 41-48, IEEE, 1999. \n\n[6] V. Roth and V. Steinhage, \"Nonlinear discriminant analysis using kernel functions,\" \nTech. Rep. IAI-TR-99-7, Department of Computer Science III, Bonn University, 1999. \n[7] V. Roth, A. Pogoda, V. Steinhage, and S. Schroder, \"Pattern recognition combining \n\nfeature- and pixel-based classification within a real world application,\" in Muster(cid:173)\nerkennung 1999 (W. Forstner, J. Buhmann, A. Faber, and P. Faber, eds.), Informatik \naktuell, pp. 120-129, 21. DAGM Symposium, Bonn, Springer, 1999. \n\n[8] T. Hastie, A. Buja, and R. Tibshirani, \"Penalized discriminant analysis,\" AnnStat, \n\nvol. 23, pp. 73-102, 1995. \n\n[9] S. Saunders, A. Gammermann, and V. Vovk, \"Ridge regression learning algorithm in \n\ndual variables,\" tech. rep., Royal Holloway, University of London, 1998. \n\n[10] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J . Stone, Classification and Re(cid:173)\n\ngression Trees. Monterey, CA: Wadsworth and Brooks/Cole, 1984. \n\n[11] T. Hastie and R. Tibshirani, \"Discriminant analysis by gaussian mixtures,\" JRSSB, \n\nvol. 58, pp. 158-176, 1996. \n\n[12] B. Scholkopf, Support Vector Learning. PhD thesis, 1997. R. Oldenbourg Verlag, \n\nMunich. \n\n[13] T. Joachims, \"Making large-scale svm learning practical,\" in Advances in Kernel \nMethods - Support Vector Learning (B. Scholkopf, C. Burges, and A. Smola, eds.), \nMIT Press, 1999. \n\n[14] B. Flury, A First Course in Multivariate Statistics. Springer, 1997. \n\n\f", "award": [], "sourceid": 1736, "authors": [{"given_name": "Volker", "family_name": "Roth", "institution": null}, {"given_name": "Volker", "family_name": "Steinhage", "institution": null}]}