{"title": "Bayesian Transduction", "book": "Advances in Neural Information Processing Systems", "page_first": 456, "page_last": 462, "abstract": null, "full_text": "Bayesian Transduction \n\nThore Graepel, Ralf Herbrich and Klaus Obermayer \n\nDepartment of Computer Science \n\nTechnical University of Berlin \n\nFranklinstr. 28/29, 10587 Berlin, Germany \n\n{graepeI2, raith, oby} @cs.tu-berlin.de \n\nAbstract \n\nTransduction is an inference principle that takes a training sam(cid:173)\nple and aims at estimating the values of a function at given points \ncontained in the so-called working sample as opposed to the whole \nof input space for induction. Transduction provides a confidence \nmeasure on single predictions rather than classifiers -\na feature \nparticularly important for risk-sensitive applications. The possibly \ninfinite number of functions is reduced to a finite number of equiv(cid:173)\nalence classes on the working sample. A rigorous Bayesian analysis \nreveals that for standard classification loss we cannot benefit from \nconsidering more than one test point at a time. The probability \nof the label of a given test point is determined as the posterior \nmeasure of the corresponding subset of hypothesis space. We con(cid:173)\nsider the PAC setting of binary classification by linear discriminant \nfunctions (perceptrons) in kernel space such that the probability of \nlabels is determined by the volume ratio in version space. We \nsuggest to sample this region by an ergodic billiard. Experimen(cid:173)\ntal results on real world data indicate that Bayesian Transduction \ncompares favourably to the well-known Support Vector Machine, \nin particular if the posterior probability of labellings is used as a \nconfidence measure to exclude test points of low confidence. \n\n1 \n\nIntroduction \n\nAccording to Vapnik [9], when solving a given problem one should avoid solving a \nmore general problem as an intermediate step. The reasoning behind this principle is \nthat in order to solve the more general task resources may be wasted or compromises \nmay have to be made which would not have been necessary for the solution of the \nproblem at hand. A direct application of this common-sense principle reduces the \nmore general problem of inferring a functional dependency on the whole of input \nspace to the problem of estimating the values of a function at given points (working \nsample), a paradigm referred to as transductive inference. More formally, given a \nprobability measure PXY on the space of data X x y = X x {-I, + 1 }, a training \nsample S = {(Xl, yd, ... ,(Xl, Yi)} is generated i.i.d. according to PXY. Additional \nm data points W = {Xl+I, ... ,Xi+m } are drawn: the working sample. The goal \nis to label the objects of the working sample W using a fixed set 1{ of functions \n\n\fBayesian Transduction \n\n457 \n\n: X \n\nJo--ot {-I, + I} so as to minimise a predefined loss. \n\nIn . contrast, inductive \nf \ninference, aims at choosing a single function It E 1i best suited to capture the \ndependency expressed by the unknown PXY . Obviously, if we have a transductive \nalgorithm A (W, S, 1i) that assigns to each working sample W a set of labels given \nthe training sample S and the set 1i offunctions, we can define a function fs : X Jo--ot \n{-I, +1} by fs (x) = A ({x} ,S, 1i) as a result ofthe transduction algorithm. There \nare two crucial differences to induction, however: i) A ({ x} , S, 1i) is not restricted \nto select a single decision function f E 1i for each x, ii) a transduction algorithm \ncan give performance guarantees on particular labellings instead of functions. In \npractical applications this difference may be of great importance. \n\nAfter all, in risk sensitive applications (medical diagnosis, financial and critical \ncontrol applications) it often matters to know how confident we are about a given \nprediction. In this case a general confidence measure of the classifier w.r. t. the \nwhole input distribution would not provide the desired warranty at all. Note that \nfor linear classifiers some guarantee can be obtained by the margin [7] which in \nSection 4 we will demonstrate to be too coarse a confidence measure. The idea of \ntransduction was put forward in [8], where also first algorithmic ideas can be found . \nLater [1] suggested an algorithm for transduction based on linear programming and \n[3] highlighted the need for confidence measures in transduction. \n\nThe paper is structured as follows: A Bayesian approach to transduction is formu(cid:173)\nlated in Section 2. In Section 3 the function class of kernel perceptrons is introduced \nto which the Bayesian transduction scheme is applied . For the estimation of volumes \nin parameter space we present a kernel billiard as an efficient sampling technique. \nFinally, we demonstrate experimentally in Section 4 how the confidence measure \nfor labellings helps Bayesian Transduction to achieve low generalisation error at a \nlow rejection rate of test points and thus to outperform Support Vector Machines \n(SVMs). \n\n2 Bayesian Transductive Classification \nSuppose we are given a training sample S = {(Xl, YI) , . .. , (Xl, Yl )} drawn i.i.d. from \nPXY and a working sample W = {XHI,' \" , XHm} drawn i.i.d. from Px . Given \na prior PH over the set 1i of functions and a likelihood P (Xy)lIH=f we obtain a \nposterior probability PHI(Xy)l=s ~f PHIS by Bayes' rule. This posterior measure \ninduces a probability measure on labellings b E {-I, +l}m of the working sample \nbyl \n\n(1) \n\nFor the sake of simplicity let us assume a PAC style setting, i.e. \n\nfunction r in the space 1i such that PYlx=x (y) = 6 (y - r (x)). In this case one \n\ncan define the so-called version-space as the set of functions that is consistent with \nthe training sample \n\nthere exists a \n\n(2) \noutside which the posterior PHIS vanishes. Then Pymls,w (b) represents the prior \nmeasure of functions consistent with the training sample S and the labelling b \non the working sample W normalised by the prior measure of functions consistent \nwith S alone. The measure PH can be used to incorporate prior knowledge into \n\n1 Note that the number of different labellings b implement able by 1l is bounded above \n\nby the value of the growth function IIu (JWI) [8, p . 321]. \n\n\f458 \n\nT. Graepe/, R. Herbrich and K. Obennayer \n\nthe inference process. If no such knowledge is available, considerations of symmetry \nmay lead to \"uninformative\" priors. \n\nGiven the measure PYFnIS,W over labellings, in order to arrive at a risk minimal \ndecision w.r.t. the labelling we need to define a loss function I : ym X ym I---t IR+ \nbetween labellings and minimise its expectation, \n\nR (b, S, W) = EYFnIS,W [I (b, ym)] = 2: I (b, b/) PYFnIS,W (b/) , \n\n(3) \n\nwhere the summation runs over all the 2m possible labellings b' of the working \nsample. Let us consider two scenarios: \n\n1. A 0-1-loss on the exact labelling b, i.e. for two labellings band b' \n\n{b'} \n\nm \n\ni=l \n\nIe (b, b/) = 1- II 6 (bi - bD \n\u00a2} Re (b, S, W) = 1 - PYFnIS,W (b) . (4) \nIn this case choosing the labelling be = argminb Re (b, S, W) of the highest \njoint probability Pymls,w (b) minimises the risk. This non-labelwise loss is \nappropriate if the goal is to exactly identify a combination of labels, e.g. the \ncombination of handwritten digits defining a postal zip code. Note that \nclassical SVM transduction (see, e.g. [8, 1]) by maximising the margin on \nthe combined training and working sample approximates this strategy and \nhence does not minimise the standard classification risk on single instances \nas intended. \n\n2. A 0-1-10ss on the single labels bi, i.e. for two labellings band b ' \n\n1$ (b, b/) = \n\nR$ (b, S, W) \n\n(5) \n\nm i=l \n\n1 m \n- 2: (1- 6 (b i - bD) , \n! f 2: (1- 6 (bi - b~)) Pymls,w (b /) \n1 m \n- 2: (1- PHIs ({f: f(Xl+i) = bd)) . \n\ni=l {b'} \n\nm i=l \n\nDue to the independent treatment of the loss at working sample points the \nrisk R$ (b, S, W) is minimised by the labelling of highest marginal proba(cid:173)\nbility of the labels, i.e. \n\nbi = argmaXyEY PHIs ({f: f(Xl+i) = y}). \n\nThus in the case of the labelwise loss (5) a working sample of m > 1 \npoint does not offer any advantages over larger working samples w.r. t. the \nBayes-optimal decision. Since this corresponds to the standard classifica(cid:173)\ntion setting, we will restrict ourselves to working samples of size m = 1, \ni.e. to one working point Xl+1. \n\n3 Bayesian Transduction by Volume \n\n3.1 The Kernel Perceptron \n\nWe consider transductive inference for the class of kernel perceptrons. The decision \nfunctions are given by \n\nf (x) = sign \u00abw, q, (xl) >') = sign (t a;k (x;, X)) \n\nl \n\nw = 2: (\u00a5itP (xd E :F , \n\ni=l \n\n\fBayesian Transduction \n\n459 \n\nFigure 1: Schematic view of data space (left) and parameter space ( right) for a \nclassification toy example. Using the duality given by (w , 4> (x)):F = 0 data points \non the left correspond to hyperplanes on the right, while hyperplanes on the left \ncan be thought of as points on the right . \n\nthe set of all perceptrons compatible with the training data -\n\nwhere the mapping 4> : X t--+ :F maps from input space X to a feature space :F \ncompletely determined by the inner product function (kernel) k : X x X t--+ IR \n(see [9, 10]) . Given a training sample S = {(Xi , Yi)}~=l we can define the version \nas in (2) \nspace -\nhaving the additional constraint Ilwll:F = 1 ensuring uniqueness. In order to obtain \na prediction on the label b1 of the working point Xl+l we note that Xl+l may \nbisects the volume V of version space into two sub-volumes V+ and V-, where the \nperceptrons in V+ would classify Xl+l as b1 = +1 and those in V- as b1 = -l. \nThe ratio p+ = V+ IV is the probability of the labelling b1 = +1 given a uniform \nprior PH over wand the class of kernel perceptrons, accordingly for b1 = -1 (see \nFigure 1) . Already Vapnik in [8, p. 323] noticed that it is troublesome to estimate \nsub- volumes of version space. As the solution to this problem we suggest to use a \nbilliard algorithm . \n\n3.2 Kernel Billiard for Volume Estimation \n\nThe method of playing billiard in version space was first introduced by Rujan [6] \nfor the purpose of estimating its centre of mass and consequently refined and ex(cid:173)\ntended to kernel spaces by [4]. For Bayesian Transduction the idea is to bounce \nthe billiard ball in version space and to record how much time it spends in each \nof the sub-volumes of interest. Under the assumption of ergodicity [2] w.r.t. the \nuniform measure in the limit the accumulated flight times for each sub-volume are \nproportional to the sub-volume itself. \nSince the trajectory is located in :F each position wand direction v of the ball can \nbe expressed as linear combinations of the 4> (xd , i.e. \n\nl \n\nW = L Q:i4> (Xi) \n\nl \n\nv = L ,Bi4> (Xi) \n\n(w, v):F = L Q:i,Bjk (Xi, Xj) \n\nl \n\ni=l \n\ni=l \n\ni,j=l \n\nwhere 0:, {3 are real vectors with f components and fully determine the state of the \nbilliard. The algorithm for the determination of the label b1 of Xl+l proceeds as \nfollows : \n\n1. Initialise the starting position Wo in V (S) using any kernel perceptron \nalgorithm that achieves zero training error (e.g. SVM [9]) . Set V+ = V- = \nO. \n\n\f460 \n\nT. Graepel, R. Herbrich and K. Obennayer \n\n2. Find the closest boundary of V (S) starting from current w into direction \nv, where the flight times Tj for all points including Xl+1 are determined \nusing \n\n(w,tP(Xj\u00bb:r \n(v,tP(Xj)):r . \n\nThe smallest positive flight time Tc = minj :T;>o Tj in kernel space corre(cid:173)\nsponds to the closest data point boundary tP (xc) on the hypersphere. Note, \nthat if Tc -7 00 we randomly generate a direction v pointing towards version \nspace, i.e. y (v, tP (x)):r > 0 assuming the last bounce was at tP (x). \n\n3. Calculate the ball's new position w' according to \n\n, \n\nW = \n\nw + TcV \n\nIlw + Tcvll:r \n\n. \n\nCalculate the distance tf = Ilw - w'llsphere = arccos (1 - Ilw - w'lI;' /2) \non the hypersphere and add it to the volume estimate VY corresponding to \nthe current label y = sign (w + w', tP (Xl+d):r)\u00b7 If the test point tP (xl+d \nwas hit, i.e. c = l + 1, keep the old direction vector v. Otherwise update \nto the reflection direction v', \n\nv' = v - 2 (v, tP (xc) ):r tP (xc) . \n\nGo back to step 2 unless the stopping criterion (8) is met. \n\nNote that in practice one trajectory can be calculated in advance and can be used \nfor all test points. The estimators of the probability of the labellings are then given \nby p+ = V+ /(V+ + V-) and p = V- /(v+ + V-). Thus, the algorithm outputs \nb1 with confidence Ctrans according to \n\ndef \n\ndef \n\nargmaXyEY iY' , \n(2 . max (pi\" , p) - 1) E [0, 1] . \n\n~ Ctrans \n\n(6) \n\n(7) \n\nNote that the Bayes Point Machine (BPM) [4] aims at an optimal approximation \nof the transductive classification (6) by a single function f E 1{ and that the well \nknown SVM can be viewed as an approximation of the BPM by the centre of the \nlargest ball in version space. Thus, treating the real valued output If(xl+1) I ~f G;nd \nof SVM classifiers as a confidence measure can be considered an approximation of \n(7). The consequences will be demonstrated experimentally in the following section. \n\nDisregarding the issue of mixing time [2] and the dependence of trajectories we \nassume for the stopping criterion that the fraction pt of time tt spent in volume \nV+ on trajectory i of length (tt + f;) is a random variable having expectation p+ . \nHoeffding's inequality [5] bounds the probability of deviation from the expectation \np+ by more than f, \n\nP (!; t p! - p+ <: ,) ~ exp (-2n,2) ~ ~. \n\n(8) \n\nThus if we want the deviation f from the true label probability to be less than \nf < 0.05 with probability at least 1 -\nT} = 0.99 we need approximately n R:j 1000 \nbounces. The computational effort of the above algorithm for a working set of size \nm is of order 0 (nl (m + l)). \n\n\fBayesian Transduction \n\n461 \n\n1 = 100--1 \n\n~~----~----~--~----~--~ \n\n0.00 \n\n0.05 \n\n0 lei \n\n020 \n\n0.10 \nrejeclion rate \n(a) \n\n2 \n\no . o . \n\no~ __ ~ __ ~ __ ~~ __ ~ __ ~~ \n\n000 \n\n0.05 \n\n0 .10 \n\n0 lei \n\n020 \n\n0.25 \n\n030 \n\nrejection rate \n(b) \n\nFigure 2: Generalisation error vs. rejection rate for Bayesian Transduction and \nSVMs for the thyroid data set (0' = 3) (a) and the heart data set (0' = 10). \nThe error bars in both directions indicate one standard deviation of the estimated \nmeans. The upper curve depicts the result for the SVM algorithm; the lower curve \nis the result obtained by Bayesian Transduction. \n\n4 Experimental Results \n\nWe focused on the confidence Ctrans Bayesian Transduction provides together with \nthe prediction b1 of the label. If the confidence Ctrans reflects reliability of a label \nestimate at a given test point then rejecting those test points whose predictions carry \nlow confidence should lead to a reduction in generalisation error on the remaining \ntest points. In the experiments we varied a rejection threshold () between [0, 1] thus \nobtaining for each () a rejeection rate together with an estimate of the generalisation \nerror at non-rejected points. Both these curves were linked by their common ()-axis \nresulting in a generalisation error versus rejection rate plot. \n\nWe used the UCI 2 data sets thyroid and heart because they are medical ap(cid:173)\nplications for which the confidence of single predictions is particularly important. \nAlso a high rejection rate due to too conservative a confidence measure may in(cid:173)\ncur considerable costs. We trained a Support Vector Machine using RBF kernels \nk (x, x') = exp ( -llx - x'1l2 /20'2) with 0' chosen such as to insure the existence of a \nversion space. We used 100 different training samples obtained by random 60%:40% \nsplits of the whole data set. The margin Clnd of each test point was calculated as a \nconfidence measure of SVM classifications. For comparison we determined the la-\nbels b1 and resulting confidences Ctrans using the Bayesian Transduction algorithm \n(see Section 3) with the same value of the kernel parameter. Since the rejection for \nthe Bayesian Transduction was in both cases higher than for SVMs at the same level \n() we determined ()max which achieves the same rejection rate for the SVM confi(cid:173)\ndence measures as Bayesian Transduction achieves at () = 1 (thyroid: ()max = 2.15, \nheart: ()max = 1.54). The results for the two data sets are depicted in Figure 2. \nIn the thyroid example Figure 2 (a) one can see that Ctrans is indeed an appropriate \nindicator of confidence: at a rejection rate of approximately 20% the generalisation \nerror approaches zero at minimal variance. For any desired generalisation error \nBayesian Transduction needs to reject significantly less examples of the test set as \ncompared to SVM classifiers, e.g. 4% less at 2.3% generalisation error. The results of \nthe heart data set show even more pronounced characteristics w.r.t. to the rejection \n\n2UCI University of California at Irvine: Machine Learning Repository \n\n\f462 \n\nT. Graepe/, R. Herbrich and K. Obermayer \n\nrate. Note that those confidence measures considered cannot capture the effects of \nnoise in the data which leads to a generalisation error of 16.4% even at maximal \nrejection () = 1 corresponding to the Bayes error under the given function class. \n\n5 Conclusions and FUture Work \n\nIn this paper we a presented a Bayesian analysis of transduction. The required \nvolume estimates for kernel perceptrons in version space are performed by an ergodic \nbilliard in kernel space. Most importantly, transduction not only determines the \nlabel of a given point but also returns a confidence measure of the classification \nin the form of the probability of the label under the model. Using this confidence \nmeasure to reject test examples then lead to improved generalisation error over \nSVMs. The billiard algorithm can be extended to the case of non-zero training \nerror by allowing the ball to penetrate walls, a property that is captured by adding \na constant>. to the diagonal of the kernel matrix [4] . Further research will aim at \nthe discovery of PAC-Bayesian bounds on the generalisation error of transduction. \n\nAcknowledgements \n\nWe are greatly indebted to U. Kockelkorn for many interesting suggestions and \ndiscussions. This project was partially funded by Technical University of Berlin via \nFIP 13/41. \n\nReferences \n[1] K. Bennett. Advances in Kernel Methods -\n\nSupport Vector Learning, chapter 19, \n\nCombining Support Vector and Mathematical Programming Methods for Classifica(cid:173)\ntion, pages 307-326. MIT Press, 1998. \n\n[2] I. Cornfeld, S. Fomin, and Y. Sinai. Ergodic Theory. Springer Verlag, 1982. \n[3] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In Proceedings \n\nof Uncertainty in AI, pages 148-155, Madison, Wisconsin, 1998. \n\n[4] R. Herbrich, T. Graepel, and C. Campbell. Bayesian learning in reproducing kernel \n\nHilbert spaces. Technical report, Technical University Berlin, 1999. TR 99-1l. \n\n[5] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal \n\nof the American Statistical Association, 58:13-30, 1963. \n\n[6] P. Rujan. Playing billiard in version space. Neural Computation, 9:99-122, 1997. \n[7] J. Shawe-Taylor. Confidence estimates of classification accuracy on new examples. \nTechnical report, Royal Holloway, University of London, 1996. NC2-TR-1996-054. \n\n[8] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, 1982. \n[9] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. \n[10] G. Wahba. Spline Models for Observational Data. Society for Industrial and Applied \n\nMathematics, Philadelphia, 1990. \n\n\f", "award": [], "sourceid": 1712, "authors": [{"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Klaus", "family_name": "Obermayer", "institution": null}]}