{"title": "A Mathematical Programming Approach to the Kernel Fisher Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 591, "page_last": 597, "abstract": null, "full_text": "A Mathematical Programming Approach to the \n\nKernel Fisher Algorithm \n\nSebastian Mika*, Gunnar Ratsch*, and Klaus-Robert Miiller*+ \n\n*GMD FIRST.lDA, KekulestraBe 7, 12489 Berlin, Germany \n+University of Potsdam, Am Neuen Palais 10, 14469 Potsdam \n\n{mika, raetsch, klaus}@jirst.gmd.de \n\nAbstract \n\nWe investigate a new kernel-based classifier: the Kernel Fisher Discrim(cid:173)\ninant (KFD). A mathematical programming formulation based on the ob(cid:173)\nservation that KFD maximizes the average margin permits an interesting \nmodification of the original KFD algorithm yielding the sparse KFD. We \nfind that both, KFD and the proposed sparse KFD, can be understood \nin an unifying probabilistic context. Furthermore, we show connections \nto Support Vector Machines and Relevance Vector Machines. From this \nunderstanding, we are able to outline an interesting kernel-regression \ntechnique based upon the KFD algorithm. Simulations support the use(cid:173)\nfulness of our approach. \n\n1 Introduction \n\nRecent years have shown an enormous interest in kernel-based classification algorithms, \nprimarily in Support Vector Machines (SVM) [2]. The success of SVMs seems to be trig(cid:173)\ngered by (i) their good generalization performance, (ii) the existence of a unique solution, \nand (iii) the strong theoretical background: structural risk minimization [12], supporting \nthe good empirical results. One of the key ingredients responsible for this success is the \nuse of Mercer kernels, allowing for nonlinear decision surfaces which even might incorpo(cid:173)\nrate some prior knowledge about the problem to solve. For our purpose, a Mercer kernel \ncan be defined as a function k : IRn x IRn --+ IR, for which some (nonlinear) mapping \n~ : IRn --+ F into afeature ,space F exists, such that k(x, y) = (~(x) . ~(y)). Clearly, the \nuse of such kernel functions is not limited to SVMs. The interpretation as a dot-product \nin another space makes it particularly easy to develop new algorithms: take any (usually) \nlinear method and reformulate it using training samples only in dot-products, which are \nthen replaced by the kernel. Examples thereof, among others, are Kernel-PCA [9] and the \nKernel Fisher Discriminant (KFD [4]; see also [8, 1]). \nIn this article we consider algorithmic ideas for KFD. Interestingly KFD - although ex(cid:173)\nhibiting a similarly good performance as SVMs - has no explicit concept of a margin. This \nis noteworthy since the margin is often regarded as explanation for good generalization \nin SVMs. We will give an alternative formulation of KFD which makes the difference \nbetween both techniques explicit and allows a better understanding of the algorithms. An(cid:173)\nother advantage of the new formulation is that we can derive more efficient algorithms for \noptimizing KFDs, that have e.g. sparseness properties or can be used for regression. \n\n\f2 A Review of Kernel Fisher Discriminant \n\nThe idea of the KFD is to solve the problem of Fisher's linear discriminant in a kernel \nfeature space F , thereby yielding a nonlinear discriminant in the input space. First we \nfix some notation. Let {Xi Ii = 1, ... ,e} be our training sample and y E {-1, 1}l be \nthe vector of corresponding labels. Furthermore define 1 E ~l as the vector of all ones, \n11 ,1 2 E ~l as binary (0,1) vectors corresponding to the class labels and let I, I l , andI2 \nbe appropriate index sets over e and the two classes, respectively (with ei = IIil). \nIn the linear case, Fisher's discriminant is computed by maximizing the coefficient J( w) = \n(WTSBW)/(WTSww) of between and within class variance, i.e. SB = (m2 - mt)(m2-\nmll and Sw = Lk=1,2 LiEIk (Xi - mk)(Xi - mkl, where mk denotes the sample \nmean for class k. To solve the problem in a kernel feature space F one needs a formulation \nwhich makes use of the training samples only in terms of dot-products. One first shows \n[4], that there exists an expansion for w E F in terms of mapped training patterns, i.e. \n\n(1) \n\nUsing some straight forward algebra, the optimization problem for the KFD can then be \nwritten as [5]: \n\n(o.TIL) 2 \n\no.TMo. \nJ(o.) = o.TNo. = o.TNo.' \n\n(2) \nwhere ILi = t K1 i' N = KKT - Li=1 ,2eiILiILY, IL = IL2 -\nILl' M = ILILT, and \nKij = (

(x)) + b, where w is given by (1). Assuming a Gaussian noise model with \nvariance u the likelihood can be written as \n\np(ylo:, u 2) = exp( - 2u2 L)(w . (Xi)) + b - Yi)2) = exp( - 2u21IeI12). \n\n1 \n\n1 \n\ni \n\nNow, assume some prior p(o:IC) over the weights with hyper-parameters C . Comput(cid:173)\ning the posterior we would end up with the Relevance Vector Machine (RVM) [11]. An \nadvantage of the RVM approach is that all hyper-parameters u and C are estimated auto(cid:173)\nmatically. The drawback however is that one has to solve a hard, computationally expen(cid:173)\nsive optimization problem. The following simplifications show how KFD can be seen as \nan approximation to this probabilistic approach. Assuming the noise variance u is known \n(i.e. dropping all terms depending solely on u) and taking the logarithm of the posterior \np(ylo:, u2)p(0:IC), yields the following optimization problem \n\nmin IIel12 -log(P(o:IC)), \na,b \n\n(5) \n\nsubject to the constraint (4a). Interpreting the prior as a regularization operator P, intro(cid:173)\nducing an appropriate weighting factor C, and adding the two zero-mean constraints (4b) \nyields the KFD problem (4). The latter are necessary for classification as the two classes \nare independently assumed to be zero-mean Gaussians. This probabilistic interpretation \nhas some appealing properties which we outline in the following: \n\nInterpretation of outputs The probabilistic framework reflects the fact, that the outputs \nproduced by KFD can be interpreted as probabilities, thus making it possible to assign a \nconfidence to the final classification. This is in contrast to SVMs whose outputs can not \ndirectly be seen as probabilities. \n\n\fNoise models \nIn the above illustration we assumed a Gaussian noise model and some yet \nunspecified prior which was then interpreted as regularizer. Of course, one is not limited \nto Gaussian models. E.g. assuming a Laplacian noise model we would get Ilelh instead of \nIlell~ in the objective (5) or (4), respectively. Table 1 gives a selection of different noise \nmodels and their corresponding loss functions which could be used (cf. Figure 1 for an \nillustration). All of them still lead to convex linear or quadratic programming problems in \nthe KFD framework. \n\nTable 1: Loss functions \nfor the slack variables e \nand their corresponding \ndensity/noise models in \na probabilistic frame(cid:173)\nwork [10]. \n\nII loss function \n\nc-ins. \n\nLaplacian \nGaussian \n\nHuber's \n\n1~le \nI~I \n~e \n{le \n2\". \nI~I- ~ \n\ndensity model \n2d+E\"l exp( -1~le) \n~ exp(-IW \n~ exp(-s;.) \ne \n< exp(-2\".) \n\nif I~I :::; a \nexp(~ -IW otherwise \n\nRegularizers Still open in this probabilistic interpretation is the choice of the prior or \nregularizer p(aIC) . One choice would be a zero-mean Gaussian as for the RVM. Assum(cid:173)\ning again that this Gaussians' variance C is known and a multiple of the identity this would \nlead to a regularizer of the form P(a) = Iia 112. Crucially, choosing a single, fixed variance \nparameter for all a we would not achieve sparsity as in RVM anymore. But of course any \nother choice, e.g. from Table 1 is possible. Especially interesting is the choice of a Lapla(cid:173)\ncian prior which in the optimization procedure would correspond to a h -loss on the a's, \ni.e. P(a) = Iialh. This choice leads to sparse solutions in the KFD as the h-norm can \nbe seen as an approximation to the lo-norm. In the following we call this particular setting \nsparse KFD (SKFD). \n\nFigure 1: Illustration of Gaussian, Laplacian, Huber's robust and c-insensitive loss func(cid:173)\ntions (dotted) and corresponding densities (solid). \n\nRegression and connection to SVM Considering the program (4) it is rather simple to \nmodify the KFD approach forregression. Instead of \u00b11 outputs y we now have real-valued \ny's. And instead of two classes there is only one class left. Thus, we can use KFD for \nregression as well by simply dropping the distinction between classes in constraint (4b). \nThe remaining constraint requires the average error to be zero while the variance of the \nerrors is minimized. \nThis as well gives a connection to SVM regression (e.g. [12]), where one uses the c(cid:173)\ninsensitive loss for e (cf. Table 1) and a K-regularizer, i.e. P(a) = aTKa = Ilw112. \nFinally, we can as well draw the connection to a SVM classifier. In SVM classification one \nis maximizing the (smallest) margin, traded off against the complexity controlled by Ilw112. \nContrary, besides parallels in the algorithmic formulation, in KFD is no explicit concept of \na margin. Instead, implicitly the average margin, i.e. the average distance of samples from \ndifferent classes, is maximized. \n\nOptimization Besides a more intuitive understanding, the formulation (4) allows for de(cid:173)\nriving more efficient algorithms as well. Using a sparsity regularizer (i.e. SKFD) one could \n\n\femploy chunking techniques during the optimization of (4). However, the problem of se(cid:173)\nlecting a good working set is not solved yet, and contrary to e.g. SVM, for KFD all samples \nwill influence the final solution via the constraints (4a), not just the ones with ai I:- O. Thus \nthese samples can not simply be eliminated from the optimization problem. Another in(cid:173)\nteresting option induced by (4) is to use a sparsity regularizer and a linear loss function, \ne.g. the Laplacian loss (cf. Table 1). This results in a linear program which we call linear \nsparse KFD (LSKFD). This can very efficiently be solved by column generation techniques \nknown from mathematical programming. A final possibility to optimize (4) for the stan(cid:173)\ndard KFD problem (i.e. quadratic loss and regularizer) is described in [6]. Here one uses \na greedy approximation scheme which iteratively constructs a (sparse) solution to the full \nproblem. Such an approach is straight forward to implement and much faster than solving \na quadratic program, provided that the number of non-zero a's necessary to get a good \napproximation to the full solution is small. \n\n5 Experiments \n\nIn this section we present some experimental results targeting at (i) showing that the KFD \nand some of its variants proposed here are capable of producing state of the art results \nand (ii) comparing the influence of different settings for the regularization P(a) and the \nloss-function applied to e in kernel based classifiers. \nIn an initial experiment we compare the output distributions \nThe Output Distribution \ngenerated by a SVM and the KFD (cf. Figure 2). By maximizing the smallest margin and \nusing linear slack variables for patterns which do not achieve a reasonable margin, the \nSVM produces a training output sharply peaked around \u00b11 with Laplacian tails inside the \nmargin area (the inside margin area is the interval [-1, 1], the outside area its complement). \nContrary, KFD produces normal distributions which have a small variance along the dis(cid:173)\ncriminating direction. Comparing the distributions on the training set to those on the test \nset, there is almost no difference for KFD. In this sense the direction found on the training \ndata is consistent with the test data. For SVM the output distribution on the test set is signif(cid:173)\nicantly different. In the example given in Figure 2 the KFD performed slightly better than \nSVM (1.5% vs. 1.7%; for both the best parameters found by 5-fold cross validation were \nused), a fact that is surprising looking only on the training distribution (which is perfectly \nseparated for SVM but has some overlap for KFD). \n\nSVM training set \n\nSVM test set \n\nKFD training set \n\nFigure 2: Comparison of output distributions on training and test set for SVM and KFD for \noptimal parameters on the ringnorm dataset (averaged over 100 different partitions). It is \nclearly observable, that the training and test set distributions for KFD are almost identical \nwhile they are considerable different for SVM. \n\n2 \n\nPerformance To evaluate the performance of the various KFD approaches on real data \nsets we performed an extensive comparison to SVMl. The results in Table 2 show the \n\nlTbanks to M. Zwitter and M. Soklic for the breast cancer data. All data sets used in the experi(cid:173)\n\nments can be obtained via http : // www.f i rs t . gmd . d e ;- raet s ch/ . \n\n\faverage test error and the standard deviation of the averages' estimation, over 100 runs \nwith different realizations of the datasets. To estimate the necessary parameters, we ran \n5-fold cross validation on the first five realizations of the training sets and took the model \nparameters to be the median over the five estimates (see [7] for details of the experimental \nsetup). \nFrom Table 2 it can be seen that both, SVM and the KFD variants on average perform \nequally well. In terms of (4) KFD denotes the formulation with quadratic regularizer, SKFD \nwith h -regularizer, and LSKFD with h -regularizer and h loss on e. The comparable \nperformance might be seen as an indicator, that maximizing the smallest margin or the \naverage margin does not make a big difference on the data sets studied. The same seems \nto be true for using different regularizer and loss functions. Noteworthy is the significantly \nhigher degree of sparsity for KFD. \n\nRegression Just to show that the proposed KFD regression works in principle, we con(cid:173)\nducted a toy experiment on the sine function (cf. Figure 3). In terms of the number of \nsupport vectors we obtain similarly sparse results as with RVMs [11], i.e. a much smaller \nnumber of non-zero coefficients than in SVM regression. A thorough evaluation is cur(cid:173)\nrently being carried out. \n\n0.8 \n\n0.6 \n\n0.4 \n\n1.2 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0 \n\n\u2022\u2022 ~Iil \n\n0.2 ~ 0.2 \n\n~/T'~ \n\n-0.2 \n\n-0.4 \n\n0 \n\n-0.2 \n\n-0.:.10 \n\n-5 \n\n0 \n\n5 \n\n10 \n\n-10 \n\n-5 \n\n0 \n\n5 \n\n10 \n\nFigure 3: I\\Iustration of KFD regression. The left panel shows a fit to the noise-free sine \nfunction sampled on 100 equally spaced points, the right panel with Gaussian noise of \nstd. dev. 0.2 added. In both cases we used RBF-kemel exp( -llx-yI12 /c)ofwidth c = 4.0 \nand c = 3.0, respectively. The regularization was C = 0.01 and C = 0.1 (small dots \ntraining samples, circled dots SVs). \n\nSVM \n\nKFD \n\nSKFD \n\nLSKFD \n\nBanana \nB.Cancer \nDiabetes \nGerman \nHern.t \nRingnorm \nESonar \nThyroid \nTitanic \nWaveform 9.9\u00b1O.O4 \n\n11.5\u00b10.07 (78%) 10. 8\u00b1O. OS \n26.0\u00b10.47 (42%) 2S.8\u00b1O.46 \n23.5\u00b10.17 (57%) 23.2\u00b1O.16 \n23.6\u00b1O.21 (58%) 23.7\u00b1O.22 \n16.0\u00b1O.33 (51 %) 16.l\u00b1O.34 \n1.7\u00b10.01 (62%) \n1.S\u00b1O.01 \n(9%) 33.2\u00b1O.17 \n32.4\u00b1O.18 \n4.8\u00b10.22 (79%) 4.2\u00b1O.21 \n22.4\u00b1O.10 (10%) 23.2\u00b10.20 \n(60%) 9.9\u00b1O.04 \n\n11.2\u00b10.48 (86%) 10.6\u00b1O.04 (92%) \n2S.2\u00b1O.44 (88%) 2S.8\u00b10.47 (88 %) \n23.l\u00b1O.18 (97%) 23.6\u00b10.18 (97%) \n23.6\u00b1O.23 (96%) 24.1\u00b10.23 (98%) \n16.4\u00b10.31 (88%) 16.0\u00b1O.36 (96%) \n1.6\u00b1O.Ol (85 %) \nl.S\u00b1O.01 (94%) \n33.4\u00b10.17 (67%) 34.4\u00b10.23 (99%) \n4.7\u00b10.22 (89%) \n4.3\u00b1O.18 (88 %) \n(8%) 22.S\u00b1O.20 (95 %) \n22.6\u00b10.17 \n1O.l\u00b1O.04 (81 %) 1O.2\u00b10.04 (96%) \n\nTable 2: Comparison between KFD, sparse KFD (SKFD), sparse KFD with linear loss \non e (LSKFD), and SVMs (see text). All experiments were carried out with RBF-kemels \nexp( -llx-yI12 /c). Best result in bold face, second best in italics. The numbers in brackets \ndenote the fraction of expansions coefficients which were zero. \n\n\f6 Conclusion and Outlook \n\nIn this work we showed how KFD can be reformulated as a mathematical programming \nproblem. This allows a better understanding of KFD and interesting extensions: First, a \nprobabilistic interpretation gives new insights about connections to RVM, SVM and regu(cid:173)\nlarization properties. Second, using a Laplacian prior, i.e. a it regularizer yields the sparse \nalgorithm SKFD. Third, the more general modeling permits a very natural KFD algorithm \nfor regression. Finally, due to the quadratic programming formulation, we can use tricks \nknown from SVM literature like chunking or active set methods for solving the optimiza(cid:173)\ntion problem. However the optimal choice of a working set is not completely resolved and \nis still an issue of ongoing research. In this sense sparse KFD inherits some of the most ap(cid:173)\npealing properties of both, SVM and RVM: a unique, mathematical programming solution \nfrom SVM and a higher sparsity together with interpretable outputs from RVM. \nOur experimental studies show a competitive performance of our new KFD algorithms if \ncompared to SVMs. This indicates that neither the margin nor sparsity nor a specific out(cid:173)\nput distribution alone seem to be responsible for the good performance of kernel-machines. \nFurther theoretical and experimental research is therefore needed to learn more about this \ninteresting question. Our future research will also investigate the role of output distribu(cid:173)\ntions and their difference between training and test set. \n\nAcknowledgments This work was partially supported by grants of the DFG (JA 379/7-\n1,9-1). Thanks to K. Tsuda for helpful comments and discussions. \n\nReferences \n[1] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach. Neural \n\nComputation, 12(10):2385- 2404, 2000. \n\n[2] B.E. Boser, LM. Guyon, and Y.N. Vapnik. A training algorithm for optimal margin classifiers. In \nD. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning \nTheory, pages 144-152, 1992. \n\n[3] J.H. Friedman. Regularized discriminant analysis. Journal of the American Statistical Associ(cid:173)\n\nation, 84(405):165- 175, 1989. \n\n[4] S. Mika, G. Ratsch, J. Weston, B. Schtilkopf, and K.-R. Mi.iller. Fisher discriminant analysis \nwith kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for \nSignal Processing IX, pages 41-48. IEEE, 1999. \n\n[5] S. Mika, G. Ratsch, J. Weston, B. SchOlkopf, AJ. Smola, and K.-R. Millier. Invariant feature \nextraction and classification in kernel spaces. \nIn S.A. Solla, T.K. Leen, and K.-R. Mi.iller, \neditors, Advances in Neural Information Processing Systems 12, pages 526- 532. MIT Press, \n2000. \n\n[6] S. Mika, AJ. Smola, and B. Schtilkopf. An improved training algorithm for kernel fisher dis(cid:173)\n\ncriminants. In Proceedings AlSTATS 2001. Morgan Kaufmann, 2001. to appear. \n\n[7] G. Ratsch, T. Onoda, and K.-R. Mi.iller. Soft margins for AdaBoost. Machine Learning, \n\n42(3):287- 320, March 2001. also NeuroCOLT Technical Report NC-TR-1998-021. \n\n[8] V. Roth and V. Steinhage. Nonlinear discriminant analysis using kernel functions. In S.A. Solla, \nT.K. Leen, and K.-R. Mi.iller, editors, Advances in Neural Information Processing Systems 12, \npages 568- 574. MIT Press, 2000. \n\n[9] B. Schtilkopf, A.J. Smola, and K.-R. Mi.iller. Nonlinear component analysis as a kernel eigen(cid:173)\n\nvalue problem. Neural Computation, 10:1299- 1319, 1998. \n\n[10] A J. Smola. Learning with Kernels. PhD thesis, Technische Universitat Berlin, 1998. \n[11] M.E. Tipping. The relevance vector machine. In S.A. Solla, T.K. Leen, and K.-R. Mi.iller, \neditors, Advances in Neural Information Processing Systems 12, pages 652-658. MIT Press, \n2000. \n\n[12] Y.N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995. \n\n\f", "award": [], "sourceid": 1930, "authors": [{"given_name": "Sebastian", "family_name": "Mika", "institution": null}, {"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}]}