{"title": "Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 603, "page_last": 609, "abstract": null, "full_text": "Bayesian model selection for Support \n\nVector machines, Gaussian processes and \n\nother kernel classifiers \n\nMatthias Seeger \n\nInstitute for Adaptive and Neural Computation \n\nUniversity of Edinburgh \n\n5 Forrest Hill, Edinburgh EHI 2QL \n\nseeger@dai.ed.ac.uk \n\nAbstract \n\nWe present a variational Bayesian method for model selection over \nfamilies of kernels classifiers like Support Vector machines or Gaus(cid:173)\nsian processes. The algorithm needs no user interaction and is able \nto adapt a large number of kernel parameters to given data without \nhaving to sacrifice training cases for validation. This opens the pos(cid:173)\nsibility to use sophisticated families of kernels in situations where \nthe small \"standard kernel\" classes are clearly inappropriate. We \nrelate the method to other work done on Gaussian processes and \nclarify the relation between Support Vector machines and certain \nGaussian process models. \n\n1 \n\nIntroduction \n\nBayesian techniques have been widely and successfully used in the neural networks \nand statistics community and are appealing because of their conceptual simplicity, \ngenerality and consistency with which they solve learning problems. In this paper \nwe present a new method for applying the Bayesian methodology to Support Vector \nmachines. We will briefly review Gaussian Process and Support Vector classification \nin this section and clarify their relationship by pointing out the common roots. \nAlthough we focus on classification here, it is straightforward to apply the methods \nto regression problems as well. In section 2 we introduce our algorithm and show \nrelations to existing methods. Finally, we present experimental results in section 3 \nand close with a discussion in section 4. \nLet X be a measure space (e.g. X = ~d) and D = (X,t) = {(Xl,tt), ... , \n(Xn,tn)), Xi E X, ti E {-l,+l} a noisy LLd. sample from a latent function \ny : X -+ lR, where P(tly) denotes the noise distribution. Given further points X. we \nwish to predict t. so as to minimize the error probability P(tlx., D), or (more diffi(cid:173)\ncult) to estimate this probability. Generative Bayesian methods attack this problem \nby placing a stochastic process prior P(y(\u00b7)) over the space of latent functions and \n\n\f604 \n\nM Seeger \n\nthen compute posterior and predictive distributions P(yID), P(y.lx., D) as \n\nP(yID) = P(Dly)P(y) \n, \n\nP(y.ID,x.) = f P(y.ly)P(yID) dy \n\nP(D) \n\n(1) \n\nwhere y = (y(Xi\u00bbi, y. = y(x.), the likelihood P(Dly) = TIi P(tiIYi) and P(D) is \na normalization constant. P(tlx., D) can then be obtained by averaging P(tly.) \nover P(y.lx., D). Gaussian process (GP) or spline smoothing models use a Gaus(cid:173)\nsian process prior on y(.) which can be seen as function of X into a set of random \nvariables such that for each finite XI C X the corresponding variables are jointly \nGaussian (see [15] for an introduction). A GP is determined by a mean function 1 \nx 1-4 E[y(x)] and a positive definite covariance kernel K(x,x). Gaussian process \nclassification (GPC) amounts to specifying available prior knowledge by choosing \na class of kernels K(x, xIO), 0 E e, where 0 is a vector of hyperparameters, and a \nhyperprior P(O). Usually, these choices are guided by simple attributes of y(.) such \nas smoothness, trends, differentiability, but more general approaches to kernel de(cid:173)\nsign have also been considered [5]. For 2-class classification the most common noise \ndistribution is the binomial one where P(tly) = (j(ty), (j(u) = (1 + exp( _U\u00bb-1 the \nlogistic function, and y is the logit 10g(P( +llx)/ P( -llx\u00bb of the target distribution. \nFor this noise model the integral in (1) is not analytically tractable, but a range \nof approximative techniques based on Laplace approximations [16], Markov chain \nMonte Carlo [7], variational methods [2] or mean field algorithms [8] are known. \nWe follow [16]. The Laplace approach to GPC is to approximate the posterior \nP(yID,O) by the Gaussian distribution N(y, 1i-1) where y = argmaxP(yID, 0) \nis the posterior mode and 1i = \\7~\\7y(-logP(YID,O\u00bb, evaluated at y. Then it \nis easy to show that the predictive distribution is Gaussian with mean k(x.),K-l y \nand variance k. - k(x.)'K- 1k(x.) where K is the covariance matrix (K(Xi,Xj\u00bbij, \nk(\u00b7) = (K(Xi, '\u00bbi, k. = K(x., x.) and the prime denotes transposition. The final \ndiscriminant is therefore a linear combination of the K (Xi, .). \nThe discriminative approach to the prediction problem is to choose a loss function \nget, y), being an approximation to the misclassification loss2 I{tY:5o} and then to \nsearch for a discriminant y(.) which minimizes E [get, y(x.\u00bb] for the points x. of \ninterest (see [14]). Support Vector classification (SVC) uses the c-insensitive loss \n(SVC loss) get, y) = [1 - ty]+, [u]+ = uI{u~o} which is an upper bound on the \nmisclassification loss, and a reproducing kernel Hilbert space (RKHS) with kernel \nK(x,xIO) as hypothesis space for y(.). Indeed, Support Vector models and the \nLaplace method for Gaussian processes are special cases of spline smoothing models \nin RKHS where the aim is to minimize the functional \n\nn \n~9(ti'Yi) + AllyOIl~ \n\n(2) \n\nwhere II . 11K denotes the norm of the RKHS. It can be shown that the minimizer of \n(2) can be written as k(\u00b7), K-1y where y maximizes \n\ni=l \n\nn \n\n- ~9(ti'Yi) - Ay'K-ly. \n\ni=l \n\n(3) \n\nAll these facts can be found in [13]. Now (3) is, up to terms not depending on y, \nthe log posterior in the above GP framework if we choose g(t,y) = -logP(tly) and \n\nlW.l.O.g. we only consider GPs with mean function 0 in what follows. \n2 I A denotes the indicator function of the set A c lR.. \n\n\fBayesian Model Selection for Support Vector Machines \n\n605 \n\nabsorb A into O. For the SVC loss, (3) can be transformed into a dual problem via \ny = Ka, where a is a vector of dual variables, which can be efficiently solved using \nquadratic programming techniques. [12] is an excellent reference. \n\nNote that the SVC loss cannot be written as the negative log of a noise distribution, \nso we cannot reduce SVC to a special case of a Gaussian process classification \nmodel. Although a generative model for SVC is given in [11], it is easier and less \nproblematic to regard SVC as efficient approximation to a proper Gaussian process \nmodel. Various such models have been proposed (see [8],[4]). In this work, we \nsimply normalize the SVC loss pointwise, i.e. use a Gaussian process model with \nthe normalized BVe loss g(t, y) = [1 - ty]+ + log Z(y), Z(y) = exp( -[1 - y]+) + \nexp( -[1 + y]+). Note that g(t, y) is a close approximation of the (unnormalized) \nSVC loss. The reader might miss the SVM bias parameter which we dropped here \nfor clarity, but it is straightforward to apply this semiparametric extension to GP \nmodels to03 . \n\n2 A variational method for kernel classification \n\nThe real Bayesian way to deal with the hyperparameters 0 is to average \nP(y.lx., D, 0) over the posterior P( OlD) in order to obtain the predictive dis(cid:173)\ntribution P(y.lx., D). This can be approximated by Markov chain Monte Carlo \nmethods [7], [16] or simply by P(y.lx.,D,9), 9 = argmaxP(OID). The latter \napproach, called maximum a-posteriori (MAP), can be justified in the limit of large \nn and often works well in practice. The basic challenge of MAP is to calculate the \nevidence \n\nP(DI9) = ! P(D,yI9)dy = ! exp (-t.9(ti,Yi\u00bb) N(yIO,K(9))dy. \n\n(4) \n\nOur plan is to attack (4) by a variational approach. Let P be a density from a \nmodel class r chosen to approximate the posterior P(yID, 0). Then: \n\n-logP(DIO) = -JP(Y)lOg (P(D'YIO)~(Y)) dy \n\nP(y ID, O)P(y) \n\n-\n\n= F(P, 0) -\n\n! -\n\nP(y) log P(y ID, 0) dy \n\n(P(y)) \n\n(5) \n\nwhere we call F(P, 0) = Ep[-log P(D, yIO)] +Ep[logP(y)] the variational free en(cid:173)\nergy. The second term in (5) is the well-known Kullback-Leibler divergence between \nP and the posterior which is nonnegative and equals zero iff P(y) = P(yID,O) \nalmost everywhere with respect to the distribution P. Thus, F is an upper bound \non - log P (D I 0), and changing (P, 0) to decrease F enlarges the evidence or de(cid:173)\ncreases the divergence between the posterior and its approximation, both being \nfavourable. This idea has been introduced in [3] as ensemble learning4 and has \nbeen successfully applied to MLPs [1]. The latter work also introduced the model \nclass r we use here, namely the class of Gaussians with mean IL and factor-analyzed \ncovariance ~ = V + L,~1 Cjcj, V diagonal with positive elements5 . Hinton and \n3This is the \"random effects model with improper prior\" of [13], p.19, and works by \n\nplacing a flat improper prior on the bias parameter. \n\n4We average different discriminants (given by y) over the ensemble P. \n5 Although there is no danger of overfitting, the use of full covariances would render the \n\noptimization more difficult, time and memory consuming. \n\n\f606 \n\nM Seeger \n\nvan Camp [3] used diagonal covariances which would be M = 0 in our setting. By \nchoosing a small M, we are able to track the most important correlations between \nthe components in the posterior using O( M n) parameters to represent P. \nHaving agreed on r, the criterion F and its gradients with respect to (J and the \nparameters of P can easily and efficiently be computed except for the generic term \n\n(6) \n\na sum of one-dimensional Gaussian expectations which are, depending on the ac(cid:173)\ntual g, either analytically tractable or can be approximated using a quadrature \nalgorithm. For example, the expectation for the normalized SVC loss can be de(cid:173)\ncomposed into expectations over the (unnormalized) SVC loss and over log Z(y) (see \nend of section 1). While the former can be computed analytically, the latter expec(cid:173)\ntation can be handled by replacing log Z (y) by a piecewise defined tight bound such \nthat the integral can be solved analytically. For the GPC loss (6) cannot be solved \nanalytically and was in our experiments approximated by Gaussian quadrature. \nWe can optimize F using a nested loop algorithm as follows. In the inner loop we \nrun an optimizer to minimize F w.r.t. P for fixed (J. We used a conjugate gradients \noptimizer since the number of parameters of P is rather large. The outer loop is \nan optimizer minimizing F w.r.t. (J, and we chose a Quasi-Newton method here \nsince the dimension of e is usually rather small and gradients w.r.t. (J are costly \nto evaluate. \nWe can use the resulting minimizer (P,O) of F in two different ways. The most \nnatural is to discard P, plug 0 into the original architecture and predict using the \nmode of P(y ID, 0) as an approximation to the true posterior mode, benefitting from \na kernel now adapted to the given data. This is particularly interesting for Support \nVector machines due to the sparseness of the final kernel expansion (typically only \na small fraction of the components in the weight vector K-1iJ is non-zero, the \ncorresponding datapoints are termed Support Vectors) which allows very efficient \npredictions for a large number of test points. However, we can also retain P and use \nit as a Gaussian approximation of the posterior P(yID, 0). Doing so, we can use \nthe variance of the approximative predictive distribution P(y.lx., D) to derive error \nbars for our predictions, although the interpretation of these figures is somewhat \ncomplicated in the case of kernel discriminants like SVM whose loss function does \nnot correspond to a noise distribution. \n\n2.1 Relations to other methods \n\nLet us have a look at alternative ways to maximize (4). If the loss get, y) is twice \ndifferentiable everywhere, progress can be made by replacing g by its second order \nTaylor expansion around the mode of the integrand. This is known as Laplace \napproximation and is used in [16] to maximize (4) approximately. However, this \ntechnique cannot be used for nondifferentiable losses of the c-insensitive type6 \u2022 \n\nNevertheless, for the SVC loss the evidence (4) can be approximated in a Laplace(cid:173)\nlike fashion [11], and it will be interesting to compare the results of this work with \nours. This approximation can be evaluated very efficiently, but is not continuous 7 \n\n6The nondifferentiabilities cannot be ignored since with probability one a nonzero num(cid:173)\n\nber of the ih sit exactly at these margin locations. \n\n7 Although continuity can be accomplished by a further modification, see [11]. \n\n\fBayesian Model Selection for Support Vector Machines \n\n607 \n\nw.r .t. (J and difficult to optimize if the dimension of e is not small. Opper and \nWinther [8] use mean field ideas to derive an approximate leave-one-out test error \nestimator which can be quickly evaluated, but suffers from the typical noisiness \nof cross-validation scores. Kwok [6] applies the evidence framework to Support \nVector machines, but the technique seems to be restricted to kernels with a finite \neigenfunction expansion (see [13] for details). \n\nIt is interesting to compare our variational method to the Laplace method of [16] \nand the variational technique of [2]. Let g(t, y) be differentiable and suppose that \nfor given (J we restrict ourselves to approximate (6) by replacing g(ti' Yi) by the \nexpansion \n\ng(ti' JLi) + By ti, JLi) Yi - JLi + 2. 8y2 ti, Yi Yi - JLi \n\n) 182g ( A)( \n\n)2 \n, \n\n8g ( \n\n( \n\n(7) \n\nwhere fj is the posterior mean. This will change the criterion F to Fapproz, say. Then \nit is easy to show that the Gaussian approximation to the posterior employed by \nthe Laplace method, namely N(fj, (K-1 + W)-l), W = diag(u(Yi)(1-u(Yi\u00bb) , min(cid:173)\nimizes Fapproz w.r.t. P if full covariances ~ are used, and plugging this minimizer \ninto Fapproz we end up with the evidence approximation which is maximized by the \nLaplace method. The latter is not a variational technique since the approximation \n(7) to the loss function is not an upper bound, and works only for differentiable \nloss functions . If we upper bound the loss function g(t,y) by a quadratic polyno(cid:173)\nmial and add the variational parameters of this bound to the parameters of P, our \nmethod becomes broadly similar to the lower bound algorithm of [2]. Indeed, since \nfor fixed variational parameters of the polynomials we can easily solve for the mean \nand covariance of P, the former parameters are the only essential ones. However, \nthe quadratic upper bound is poor for functions like the SVC loss, and in these \ncases our bound is expected to be tighter. \n\n3 Experiments \n\nWe tested our variational algorithm on a number of datasets from the UCI ma(cid:173)\nchine learning repository and the DELVE archive of the University of Toront08 : \nLeptograpsus crabs, Pima Indian diabetes, Wisconsin Breast Cancer, Ringnorm, \nTwonorm and Waveform (class 1 against 2). Descriptions may be found on the \nweb. In each case we normalized the whole set to zero mean, unit variance in all \ninput columns, picked a training set at random and used the rest for testing. We \nchose (for X = JRd) the well-known squared-exponential kernel (see [15]): \n\nK(x,xI9) = C (exp ( - 2~ t Wi (Xi - Xi)') + v), 9 = \u00abWi):'C,V)'. \n\n(8) \n\nAll parameters are constrained to be positive, so we chose the representation ()i = v'f. \nWe did not use a prior on (J (see comment at end of this section). For comparison \nwe trained a Gaussian Process classifier with the Laplace method (also without \nhyperprior) and a Support Vector machine using lO-fold cross-validation to select \nthe free parameters. In the latter case we constrained the scale parameters Wi to \nbe equal (it is infeasible to adapt d + 2 hyperparameters to the data using cross(cid:173)\nvalidation) and dropped the v parameter while allowing for a bias parameter. As \nmentioned above, within the variational method we can use the posterior mode fj \n\n8See http://vvv . cs. utoronto. cal ...... del ve and \n\nhttp://vvv.ics.uci.edu/ ...... mlearn/MLRepository.html . \n\n\f608 \n\nName \n\ncrabs \npima \nwdbc \ntwonorm \nringnorm \nwaveform \n\ntest Var.GP \nsize I y \n120 \n3 \n332 \n66 \n269 \n11 \n\ntrain \nIL Lapl. \nsize \n4 \n80 \n4 \n68 \n66 \n200 \n11 \n8 \n300 \n300 7100 233 224 \n400 7000 119 124 \n800 2504 206 204 \n\nGP Var. SVM \nIL \n4 \n66 \n10 \n223 \n126 \n206 \n\ny \n4 \n64 \n10 \n297 260 \n184 129 \n221 \n211 \n\nM Seeger \n\nLin. \nSVM \n10-CV discr. \n3 \n67 \n19 \n207 \n1763 \n220 \n\n4 \n67 \n9 \n163 \n160 \n197 \n\nTable 1: Number of test errors for various methods. \n\nas well as the mean IL of P for prediction, and we tested both methods. Error \nbars were not computed. The baseline method was a linear discriminant trained \nto minimize the squared error. Table 1 shows the test errors the different methods \nattained. \nThese results show that the new algorithm performs equally well as the other meth(cid:173)\nods we considered. They have of course to be regarded in combination with how \nmuch effort was necessary to produce them. It took us almost a whole day and a lot \nof user interactions to do the cross-validation model selection. The rule-of-thumb \nthat a lot of Support Vectors at the upper bound indicate too large a parameter C \nin (8) failed for at least two of these sets, so we had to start with very coarse grids \nand sweep through several stages of refinement. \n\nAn effect known as automatic relevance determination (ARD) (see [7]) can be nicely \nobserved on some of the datasets, by monitoring the length scale parameters Wi in \n(8). Indeed, our variational SVC algorithm almost completely ignored (by driving \ntheir length scales to very small values) 3 of the 5 dimensions in \"crabs\", 2 of \n7 in \"pima\" and 3 of 21 in \"waveform\". On \"wdbc\", it detected dimension 24 \nas particularly important with regard to separation, all this in harmony with the \nGP Laplace method. Thus, a sensible parameterized kernel family together with \na method of the Bayesian kind allows us to gain additional important information \nfrom a dataset which might be used to improve the experimental design. \n\nResults of experiments with the methods tested above and hyperpriors as well as a \nmore detailed analysis of the experiments can be found in [9]. \n\n4 Discussion \n\nWe have shown how to perform model selection for Support Vector machines using \napproximative Bayesian variational techniques. Our method is applicable to a wide \nrange of loss functions and is able to adapt a large number of hyperparameters to \ngiven data. This allows for the use of sophisticated kernels and Bayesian techniques \nlike automatic relevance determination (see [7]) which is not possible using other \ncommon model selection criteria like cross-validation. Since our method is fully \nautomatic, it is easy for non-experts to use9 , and as the evidence is computed on \nthe training set, no training data has to be sacrificed for validation. We refer to [9] \nwhere the topics of this paper are investigated in much greater detail. \nA pressing issue is the unfortunate scaling of the method with the training set \n\n9 As an aside, this opens the possibility of comparing SVMs against other fully(cid:173)\n\nautomatic methods within the DELVE project (see section 3) . \n\n\fBayesian Model Selection for Support Vector Machines \n\n609 \n\nsize n which is currently O(n3)1O. We are currently explori~g the applicability of \nthe powerful approximations of [10] which might bring us very much closer to the \ndesired O(n2) scaling (see also [2]). Another interesting issue would be to connect \nour method with the work of [5] who use generative models to derive kernels in \nsituations where the \"standard kernels\" are not applicable or not reasonable. \n\nAcknowledgments \n\nWe thank Chris Williams, Amos Storkey, Peter Sollich and Carl Rasmussen for \nhelpful and inspiring discussions. This work was partially funded by a scholarship \nof the Dr. Erich Muller foundation. We are grateful to the Division of Informatics \nfor supporting our visit in Edinburgh, and to Chris Williams for making it possible. \n\nReferences \n[1] David Barber and Christopher Bishop. Ensemble learning for multi-layer networks. \n\nIn Advances in NIPS, number 10, pages 395-401. MIT Press, 1997. \n\n[2] Mark N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD \n\nthesis, University of Cambridge, 1997. \n\n[3] Geoffrey E. Hinton and D. Van Camp. Keeping neural networks simple by minimizing \nthe description length of the weights. In Proceedings of the 6th annual conference on \ncomputational learning theory, pages 5- 13, 1993. \n\n[4] Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy discrimination. \n\nIn Advances in NIPS, number 13. MIT Press, 1999. \n\n[5] Tommi S. Jaakkola and David Haussler. Exploiting generative models in discrimina(cid:173)\n\ntive classifiers. In Advances in NIPS, number 11, 1998. \n\n[6] James Tin-Tau Kwok. Integrating the evidence framework and the Support Vector \n\nmachine. Submitted to ESANN 99, 1999. \n\n[7] Radford M. Neal. Monte Carlo implementation of Gaussian process models for \nBayesian classification and regression. Technical Report 9702, Department of Statis(cid:173)\ntics, University of Toronto, January 1997. \n\n[8] Manfred Opper and Ole Winther. GP classification and SVM: Mean field results and \nleave-one-out estimator. In Advances in Large Margin Classifiers. MIT Press, 1999. \n[9] Matthias Seeger. Bayesian methods for Support Vector machines and Gaussian \nprocesses. Master's thesis, University of Karlsruhe, Germany, 1999. Available at \nhttp://vvw.dai.ed.ac.uk/-seeger. \n\n[10] John Skilling. Maximum entropy and Bayesian methods. Cambridge University Press, \n\n1988. \n\n[11] Peter Sollich. Probabilistic methods for Support Vector machines. In Advances in \n\nNIPS, number 13. MIT Press, 1999. \n\n[12] Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998. \n[13] Grace Wahba. Spline Models for Observational Data. CBMS-NSF Regional Confer(cid:173)\n\nence Series. SIAM, 1990. \n\n[14] Grace Wahba. Support Vector machines, reproducing kernel Hilbert spaces and the \n\nrandomized GACV. Technical Report 984, University of Wisconsin, 1997. \n\n[15] Christopher K. 1. Williams. Prediction with Gaussian processes: From linear regres(cid:173)\n\nsion to linear prediction and beyond. In M. 1. Jordan, editor, Learning in Graphical \nModels. Kluwer, 1997. \n\n[16] Christopher K.I. Williams and David Barber. Bayesian classification with Gaussian \n\nprocesses. IEEE 7rans. PAMI, 20(12):1342-1351, 1998. \n\nlaThe running time is essentially the same as that of the Laplace method, thus being \n\ncomparable to the fastest known Bayesian GP algorithm. \n\n\f", "award": [], "sourceid": 1722, "authors": [{"given_name": "Matthias", "family_name": "Seeger", "institution": null}]}