{"title": "The Kernel Gibbs Sampler", "book": "Advances in Neural Information Processing Systems", "page_first": 514, "page_last": 520, "abstract": null, "full_text": "The Kernel Gibbs Sampler \n\nThore Graepel \n\nStatistics Research Group \n\nComputer Science Department \nTechnical University of Berlin \n\nBerlin, Germany \n\nguru@cs.tu-berlin.de \n\nRalf Herbrich \n\nStatistics Research Group \n\nComputer Science Department \nTechnical University of Berlin \n\nBerlin, Germany \n\nralfh@cs.tu-berlin.de \n\nAbstract \n\nWe present an algorithm that samples the hypothesis space of ker(cid:173)\nnel classifiers. Given a uniform prior over normalised weight vectors \nand a likelihood based on a model of label noise leads to a piece(cid:173)\nwise constant posterior that can be sampled by the kernel Gibbs \nsampler (KGS). The KGS is a Markov Chain Monte Carlo method \nthat chooses a random direction in parameter space and samples \nfrom the resulting piecewise constant density along the line chosen. \nThe KGS can be used as an analytical tool for the exploration of \nBayesian transduction, Bayes point machines, active learning, and \nevidence-based model selection on small data sets that are contam(cid:173)\ninated with label noise. For a simple toy example we demonstrate \nexperimentally how a Bayes point machine based on the KGS out(cid:173)\nperforms an SVM that is incapable of taking into account label \nnoise. \n\n1 \n\nIntroduction \n\nTwo great ideas have dominated recent developments in machine learning: the ap(cid:173)\nplication of kernel methods and the popularisation of Bayesian inference. Focusing \non the task of classification, various connections between the two areas exist: ker(cid:173)\nnels have long been a part of Bayesian inference in the disguise of covariance nmc(cid:173)\ntions that characterise priors over functions [9]. Also, attempts have been made \nto re-derive the support vector machine (SVM) [1] -\npossibly the most prominent \nas a maximum a-posteriori estimator (MAP) \nrepresentative of kernel methods -\nin a Bayesian framework [8] . While this work suggests good strategies for evidence(cid:173)\nbased model selection the MAP estimator is not truly Bayesian in spirit because it is \nnot based on the concept of model averaging which is crucial to Bayesian reasoning. \nAs a consequence, the MAP estimator is generally not as robust as a real Bayesian \nestimator. While this drawback is inconsequential in a noise-free setting or in a situ(cid:173)\nation dominated by feature noise, it may have severe consequences when the data is \ncontaminated by label noise that may lead to a multi-modal posterior distribution. \nIn order to make use of the full Bayesian posterior distribution it is necessary to \ngenerate samples from this distribution. This contribution is concerned with the \ngeneration of samples from the Bayesian posterior over the hypothesis space of lin-\n\n\fear classifiers in arbitrary kernel spaces in the case of label noise. In contrast to [8] \nwe consider normalised weight vectors, IIwll.~: = 1, because the classification given \nby a linear classifier only depends on the spatial direction of the weight vector w \nand not on its length. This point of view leads to a hypothesis space isomorphic to \nthe surface of an n-dimensional sphere which -\nin the absence of prior information \nis naturally equipped with a uniform prior over directions. Incorporating the \n-\nlabel noise model into the likelihood then leads to a piecewise constant posterior on \nthe surface of the sphere. The kernel Gibbs sampler (KGS) is designed to sample \nfrom this type of posterior by iteratively choosing a random direction and sam(cid:173)\npling on the resulting piecewise constant one-dimensional density in the fashion of \na hit-and-run algorithm [7]. \n\nii) The posterior mean -\n\nThe resulting samples can be used in various ways: i) In Bayesian transduction [3] \nthe decision about the labels of new test points can be inferred by a majority decision \nof the sampled classifiers. \nthe Bayes point machine \n(BPM) solution [4] -\ncan be calculated as an approximation to transduction. iii) \nThe binary entropy of candidate training points can be calculated to determine \ntheir information content for active learning [2]. iv) The model evidence [5] can be \nevaluated for the purpose of model selection. We would like to point out, however, \nthat the KGS is limited in practice to a sample size of m ~ 100 and should thus be \nthought of as an analytical tool to advance our understanding of the interaction of \nkernel methods and Bayesian reasoning. \n\nThe paper is structured as follows: in Section 2 we introduce the learning scenario \nand explain our Bayesian approach to linear classifiers in kernel spaces. The kernel \nGibbs sampler is explained in detail in Section 3. Different applications of the KGS \nare discussed in Section 4 followed by an experimental demonstration of the BPM \nsolution based on using the KGS under label noise conditions. We denote n-tuples \nby italic bold letters (e.g. x), vectors by roman bold letters (e.g. x), random vari(cid:173)\nables by sans seriffont (e.g. X), and vector spaces by calligraphic capitalised letters \n(e.g. X). The symbols P, E and I denote a probability measure, the expectation of \na random variable and the indicator function, respectively. \n\n2 Bayesian Learning in Kernel spaces \nWe consider learning given a sequence x = (Xl, ... ,Xm ) E xm and y = \n(Yl, ... Ym) E {-I, + I} m drawn iid from a fixed distribution P XY = Pz over the \nspace X x { -1, + I} = Z of input-output pairs. The hypotheses are linear classifiers \nX I-t (w,ifJ(x))/C =: (w,x)/C in some fixed feature space K ~ \u00a3~ where we assume \nthat a mapping \u00a2 : X -+ K is chosen a priori 1 . Since all we need for learning is the \nreal-valued output (w, Xi) /C of the classifier w at the m training points in Xl, ... , Xm \nwe can assume that w can be expressed as (see [9]) \n\nm \n\nW = LOiX;. \n\ni=l \n\n(1) \n\nThus, it suffices to learn the m expansion coefficients a E IRm rather than the n \ncomponents of w E K. This is particularly useful if the dimensionality dim (K) = n \nof the feature space K is much greater (or possibly infinite) than the number m \nof training points. From (1) we see that all that is needed is the inner product \nfunction k (x, x') = (\u00a2 (x) ,\u00a2 (x'))/C also known as the kernel (see [9] for a detailed \nintroduction to the theory of kernels). \n\nlFor the sake of convenience, we sometimes abbreviate cfJ {x} by x. This, however, \n\nshould not be confused with n-tuple x denoting the training objects. \n\n\f(a) \n\n(b) \n\nFigure 1: Illustration of the (log) posterior distribution on the surface of a 3-\ndimensional sphere {w E Il~a IlIwllK = I} resulting from a label noise model with \na label flip rate of q = 0.20 (a) m = 10, (b) m = 1000. The log posterior is plotted \nover the longitude and latitude, and for small sample size it is multi-modal due \nto the label noise. The classifier w* labelling the data (before label noise) was at \n(~, 11\"). \n\nIn a Bayesian spirit we consider a prior Pw over possible weight vectors w E W \nof unit length, i.e. W = {v E J( IIIvllK = I}. Given an iid training set z = (x,y) \nand a likelihood model PYlx=x,w=w we obtain the posterior PWlz==z using Bayes' \nformula \n\n(2) \n\nPWlz=-z w = \n\n-\n\n() PY=lx==.\"w=w (y) Pw (w) \n] ' \n\nEw PY=lx==.\"w=w (y) \n\n[ \n\nBy the iid assumption and the independence of the denominator from w we obtain \n\nm \n\ni=l \n\n. \n\n.. \n\n.c[w,z] \n\nIn the absence of specific prior knowledge symmetry suggests to take Pw uniform \non W. Furthermore, we choose the likelihood model \n\nPY1X=x,w=w (Y) = { i _ q \n\nif y (w,x)K ::; a \notherwise \n\nwhere q specifies the assumed level of label noise. Please note the difference to the \ncommonly assumed model of feature noise which essentially assumes noise in the \n(mapped) input vectors x instead of the labels y and constitutes the basis of the \nsoft-margin SVM [1]. Thus the likelihood C[w,z] of the weight vector w is given \nby \n\nC [w, z] = qm.Re=p [w,z] (1 _ q)m(l-Re=p [W,Z]) , \n\n(3) \n\nw here the training error Remp [w, z] is defined as \n\nRemp [w,z] = \n\n1 m \n\nm L IYi(W,Xi}K:~O. \n\ni=l \n\n\finto four equivalence classes \n\nTwo data points YIXI and Y2X2 divide \nthe space of normalised weight vec(cid:173)\ntors W \nwith different posterior density indi(cid:173)\ncated by the gray shading. In each \niteration, starting from Wj_l a ran(cid:173)\ndom direction v with v..LWj_l is gen(cid:173)\nerated. We sample from the piecewise \nconstant density on the great circle \ndetermined by the plane defined by \nWj-l and v. In order to obtain (*, \nwe calculate the 2m angles (i where \nthe training samples intersect with \nthe circle and keep track of the num(cid:173)\nber m . ei of training errors for each \nregion i. \n\nFigure 2: Schematic view of the kernel Gibbs sampling procedure. \n\nClearly, the posterior Pw1z==z is piecewise constant for all W with equal training \nerror Remp [w,z] (see Figure 1). \n\n3 The Kernel Gibbs Sampler \n\nIn order to sample from PWlz==z on W we suggest a Markov Chain sampling \nmethod. For a given value of q, the sampling scheme can be decomposed into the \nfollowing steps (see Figure 2): \n\n1. Choose an arbitrary starting point Wo E W and set j = O. \n2. Choose a direction v E W in the tangent space {v E W I (v, Wj)K = O}. \n3. Calculate all m hit points b i E W from W in direction v with the hyperplane \n\nhaving normal YiXi' Before normalisation, this is achieved by [4] \n\nb i = Wj -\n\n(Wj,Xi)K \n( \nV,Xi K \n\n) V. \n\n4. Calculate the 2m angular distances (i from the current position W j \n\nVi E {l, ... ,m}: \nVi E {l, ... ,m}: \n\n(2i-l = -sign ((v,bi)d arccos ((wj,bi)K) , \n(2i = ((2i-l + 7r) mod (27r) . \n\n5. Sort the (i in ascending order, i.e. II: {I, ... , 2m} -+ {I, ... , 2m} such \n\nthat \n\nVi E {2, ... ,2m}: \n\n(nCi-l):::; (nCi) . \n\n6. Calculate the training errors ei of the 2m intervals [(nCi-l),(nCi)] byeval(cid:173)\n\nuating \n\nei = \n\nHe \n\n[ \n\nmp cos \n\n((nCHl) -\n2 \n\n(nCi)) \n\nWj - sm \n\n. ((nCHl) -\n2 \n\n(nCi)) \n\n] \nv, z \n\nHere, we used the shorthand notation (nC2m+1) = (nCl)' \n\n\f7. Sample an angle (* using the piecewise uniform distribution and (3). \n8. Calculate a new sample Wj+! by Wj+l = cos ((*) Wj - sin ((*) v. \n9. Set j f- j + 1 and go back to step 2. \n\nSince the algorithm is carried out in feature space K we can use \n\nm \n\nW = LCtiXi, \n\nm \n\nv= LViXi, \n\ni=l \n\ni=l \n\nm \n\nb = L.BiXi. \n\ni=l \n\nFor the inner products and norms it follows that, e.g. \n\n(w, v)K = a'Gv, \n\nIIwll~ = a'Ga, \n\nwhere the m x m matrix G is known as the Gram matrix and is given by \n\nG ij = (Xi,Xj)K = k(Xi,Xj) . \n\nAs a consequence the above algorithm can be implemented in arbitrary kernel spaces \nonly making use of k. \n\n4 Applications of the Kernel Gibbs Sampler \n\nThe kernel Gibbs sampler provides samples from the full posterior distribution over \nthe hypothesis space of linear classifiers in kernel space for the case of label noise. \nThese samples can be used for various tasks related to learning. In the following \nwe will present a selection of these tasks. \n\nBayesian Transduction Given a sample from the posterior distribution over \nhypotheses, a good strategy for prediction is to let the sampled classifiers vote \non each new test data point. This mode of prediction is closest to the Bayesian \nspirit and has been shown for the zero-noise case to yield excellent generalisation \nperformance [3]. Also the fraction of votes for the majority decision is an excellent \nindicator for the reliability of the final estimate: Rejection of those test points with \nthe closest decision results in a great reduction of the generalisation error on the \nremaining test points x. Given the posterior PWlz~=% the transductive decision is \n\nBT% (x) = sign (Ewlz~=% [sign ((W,x)x;)J) . \n\n(4) \nIn practice, this estimator is approximated by replacing the expectation EWlz~=% \nby a sum over the sampled weight vectors W j. \n\nBayes Point Machines For classification, Bayesian Transduction requires the \nwhole collection of sampled weight vectors W in memory. Since this may be imprac(cid:173)\ntical for large data sets we would like to derive a single classifier W from the Bayesian \nposterior. An excellent approximation of the transductive decision BT% (x) by a sin(cid:173)\ngle classifier is obtained by exchanging the expectation with the inner sign-function \nin (4). Then the classifier hbp is given by \n\n(5) \nwhere the classifier Wbp is referred to as the Bayes point and has been shown to \nyield generalisation performance superior to the well-known support vector solution \nWSVM, which -\ncan be looked upon as an approximation to Wbp in the \nnoise-free case [4]. Again, wbp is estimated by replacing the expectation by the \nmean over samples W j. Note that there exists no SVM equivalence WSVM to the \nBayes point Wbp in the case of label noise -\na fact to be elaborated on in the \nexperimental part in Section 5. \n\nin turn -\n\n\fq = 0.0 \n\nq = 0.1 \n\nq = 0.2 \n\nFigure 3: A set of 50 samples Wj of the posterior PWlz~=z for various noise levels \nq. Shown are the resulting decision boundaries in data space X. \n\nActive Learning The Bayesian posterior can also be employed to determine the \nusefulness of candidate training points -\na task that can be considered as a dual \ncounterpart to Bayesian Transduction. This is particularly useful when the label y \nof a training point x is more expensive to obtain than the training point x itself. It \nwas shown in the context of \"Query by Committee\" [2) that the binary entropy \n\nS (x,z) = p+ log2P+ + p-Iog2P-\n\nwith p\u00b1 = PWlz~=z (\u00b1 (W, x) K > 0) is an indicator of the information content of \na data point x with regard to the learning task. Samples W j from the Bayesian \nposterior PWlz~=z make it possible to estimate S for a given candidate training \npoints x and the current training set z to decide on the basis of S if it is worthwhile \nto query the corresponding label y . \n\nEvidence Estimation for Model Selection Bayesian model selection is often \nbased on a quantity called the evidence [5) of the model (given by the denominator \nof (2)) \n\nIn the PAC-Bayesian framework this quantity has been demonstrated to be respon(cid:173)\nsible for the generalisation performance of a model [6). It turns out that in the \nzero-noise case the margin (the quantity maximised by the SVM) is a measure of \nthe evidence of the model used [4) . In the case of label noise the KGS serves to \nestimate this quantity. \n\n5 Experiments \n\nIn a first experiment we used a surrogate dataset of m = 76 data points x in \nX = IR2 and the kernel k (x,x') = exp(-t Ilx - x'II~). Using the KGS we sampled \n50 different classifiers with weight vectors W j for various noise levels q and plotted \nthe resulting decision boundaries {x E IR2 I (w j, x) K = O} in Figure 3 (circles and \ncrosses depict different classes). As can be seen form these plots, increasing the \nnoise level q leads to more diverse classifiers on the training set z. \nIn a second experiment we investigated the generalisation performance of the Bayes \npoint machine (see (5)) in the case of label noise. In IR3 we generated 100 random \ntraining and test sets of size mtrain = 100 and mtest = 1000, respectively. For each \nnormalised point x E IR3 the longitude and latitude were sampled from a Beta(5, 5) \nand Beta(O.l, 0.1) distribution, respectively. The classes y were obtained by ran(cid:173)\ndomly flipping the classes assigned by the classifier w* at (~, 7r) (see also Figure \n1) with a true label flip rate of q* = 5%. In Figure 4 we plotted the estimated \ngeneralisation error for a BPM (trained using 100 samples Wj from the KGS) and \n\n\fGeneralisation errors of BPMs (circled \nerror-bars) and soft-margin SVMs (tri(cid:173)\nangled error-bars) vs. assumed noise \nlevel q and margin slack penalisation \nA, respectively. The dataset consisted \nof m = 100 observations with a label \nnoise of 5% (dotted line) and we used \nk(x,x') = (x,x')x+A\u00b7I\"=,,,. Note that \nthe abscissa is jointly used for q and A. \n\nFigure 4: Comparison of BPMs and SVMs on data contaminated by label noise. \n\nquadratic soft-margin SVM at different label noise levels q and margin slack penali(cid:173)\nsation A, respectively. Clearly, the BPM with the correct noise model outperformed \nthe SVM irrespective of the chosen level of regularisation. Interestingly, the BPM \nappears to be quite \"robust\" w.r.t. the choice of the label noise parameter q. \n\n6 Conclusion and Future Research \n\nThe kernel Gibbs sampler provides an analytical tool for the exploration of various \nBayesian aspects of learning in kernel spaces. It provides a well-founded way for \ndealing with label noise but suffers from its computational complexity which -\nso \nfar - makes it inapplicable for large scale applications. Therefore it will be an \ninteresting topic for future research to invent new sampling schemes that may be \nable to trade accuracy for speed and would thus be applicable to large data sets. \n\nAcknowledgements This work was partially done while RH and TG were vis(cid:173)\niting Robert C. Williamson at the ANU Canberra. Thanks, Bob, for your great \nhospitality! \n\nReferences \n[1] C. Cortes and V. Vapnik. Support Vector Networks. Machine Learning, 20:273- 297, 1995. \n[2] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee \n\nalgorithm. Machine Learning, 28:133- 168, 1997. \n\n[3] T. Graepel, R. Herbrich, and K. Obermayer. Bayesian Transduction. In Advances in Neural Infor(cid:173)\n\nmation System Processing 12, pages 456-462, 2000. \n\n[4] R. Herbrich, T. Graepel, and C. Campbell. Bayesian learning in reproducing kernel Hilbert spaces. \n\nTechnical report, Technical University of Berlin, 1999. TR 99-1l. \n\n[5] D. MacKay. The evidence framework applied to classification networks. Neural Computation, \n\n4(5):720-736, 1992. \n\n[6] D. A. McAllester. Some PAC Bayesian theorems. In Proceedings of the Eleventh Annual Conference \n\non Computational Learning Theory, pages 230-234, Madison, Wisconsin, 1998. \n\n[7] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical report, \n\nDept. of Computer Science, University of Toronto, 1993. CRG-TR-93-l. \n\n[8] P. Sollich. Probabilistic methods for Support Vector Machines. In Advances in Neural Information \n\nProcessing Systems 12, pages 349-355, San Mateo, CA, 2000. Morgan Kaufmann. \n\n[9] G. Wahba. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the randomized GACV. \n\nTechnical report , Department of Statistics, University of Wisconsin, Madison, 1997. TR- NO- 984. \n\n\f", "award": [], "sourceid": 1802, "authors": [{"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}]}