{"title": "Large Scale Bayes Point Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 528, "page_last": 534, "abstract": null, "full_text": "Large Scale Bayes Point Machines \n\nRalf Herbrich \n\nStatistics Research Group \n\nComputer Science Department \nTechnical University of Berlin \n\nThore Graepel \n\nStatistics Research Group \n\nComputer Science Department \nTechnical University of Berlin \n\nralfh@cs.tu-berlin.de \n\nguru@cs.tu-berlin.de \n\nAbstract \n\nalso known as the Bayes point -\n\nThe concept of averaging over classifiers is fundamental to the \nBayesian analysis of learning. Based on this viewpoint, it has re(cid:173)\ncently been demonstrated for linear classifiers that the centre of \nmass of version space (the set of all classifiers consistent with the \ntraining set) -\nexhibits excel(cid:173)\nlent generalisation abilities. However, the billiard algorithm as pre(cid:173)\nsented in [4] is restricted to small sample size because it requires \no (m 2 ) of memory and 0 (N . m2 ) computational steps where m \nis the number of training patterns and N is the number of random \ndraws from the posterior distribution. In this paper we present a \nmethod based on the simple perceptron learning algorithm which \nallows to overcome this algorithmic drawback. The method is al(cid:173)\ngorithmically simple and is easily extended to the multi-class case. \nWe present experimental results on the MNIST data set of hand(cid:173)\nwritten digits which show that Bayes point machines (BPMs) are \ncompetitive with the current world champion, the support vector \nmachine. In addition, the computational complexity of BPMs can \nbe tuned by varying the number of samples from the posterior. \nFinally, rejecting test points on the basis of their (approximative) \nposterior probability leads to a rapid decrease in generalisation er(cid:173)\nror, e.g. 0.1% generalisation error for a given rejection rate of 10%. \n\n1 \n\nIntroduction \n\nKernel machines have recently gained a lot of attention due to the popularisation \nof the support vector machine (SVM) [13] with a focus on classification and the \nrevival of Gaussian Processes (GP) for regression [15]. Subsequently, SVMs have \nbeen modified to handle regression [12] and GPs have been adapted to the problem \nof classification [8]. Both schemes essentially work in the same function space that is \ncharacterised by kernels (SVM) and covariance functions (GP), respectively. While \nthe formal similarity of the two methods is striking the underlying paradigms of \ninference are very different. The SVM was inspired by results from statistical/PAC \nlearning theory while GPs are usually considered in a Bayesian framework. This \nideological clash can be viewed as a continuation in machine learning of the by \nnow classical disagreement between Bayesian and frequentistic statistics. With \n\n\fregard to algorithmics the two schools of thought appear to favour two different \nas a consequence of the \nmethods of learning and predicting: the SVM community -\nformulation of the SVM as a quadratic programming problem -\nfocuses on learning \nas optimisation while the Bayesian community favours sampling schemes based on \nthe Bayesian posterior. Of course there exists a strong relationship between the two \nideas, in particular with the Bayesian maximum a posteriori (MAP) estimator being \nthe solution of an optimisation problem. Interestingly, the two viewpoints have \nrecently been reconciled theoretically in the so-called PAC-Bayesian framework [5] \nthat combines the idea of a Bayesian prior with PAC-style performance guarantees \nand has been the basis of the so far tightest margin bound for SVMs [3]. In practice, \noptimisation based algorithms have the advantage of a unique, deterministic solution \nand the availability of the cost function as an indicator for the quality of the solution. \nIn contrast, Bayesian algorithms based on sampling and voting are more flexible and \nhave the so-called \"anytime\" property, providing a relatively good solution at any \npoint in time. Often, however, they suffer from the computational costs of sampling \nthe Bayesian posterior. \n\nIn this contribution we review the idea of the Bayes point machine (BPM) as an \napproximation to Bayesian inference for linear classifiers in kernel space in Section \n2. In contrast to the GP viewpoint we do not define a Gaussian prior on the length \nIlwllx: of the weight vector. Instead, we only consider weight vectors of length \nIlwllx: = 1 because it is only the spatial direction of the weight vector that matters \nfor classification. It is then natural to define a uniform prior on the resulting ball(cid:173)\nshaped hypothesis space. Hence, we determine the centre of mass (\"Bayes point\") of \nthe resulting posterior that is uniform in version space, i.e. in the zero training error \nregion. While the version space could be sampled using some form of Gibbs sampling \n(see, e.g. [6] for an overview) or an ergodic dynamic system such as a billiard [4] \nwe suggest to use the perceptron algorithm trained on permutations of the training \nset for sampling in Section 3. This extremely simple sampling scheme proves to be \nefficient enough to make the BPM applicable to large data sets. We demonstrate \nthis fact in Section 4 on the well-known MNIST data set containing 60 000 samples \nof handwritten digits and show how an approximation to the posterior probability of \nclassification provided by the BPM can even be used for test-point rejection leading \nto a great reduction in generalisation error on the remaining samples. \nWe denote n-tuples by italic bold letters (e.g. x = (Xl, ... ,xn )), vectors by roman \nbold letters (e.g. x), random variables by sans serif font (e.g. X) and vector spaces \nby calligraphic capitalised letters (e.g. X). The symbols P, E and I denote a prob(cid:173)\nability measure, the expectation of a random variable and the indicator function, \nrespectively. \n\n2 Bayes Point Machines \n\nLet us consider the task of classifying patterns X E X into one of the two classes \ny E Y = {-1, + 1} using functions h : X ~ Y from a given set 1t known as the \nhypothesis space. In this paper we shall only be concerned with linear classifiers: \n\n1t={xf-tsign((\u00a2(x),w)x;) IWEW}, W={wEK I Ilwllx:=1}, \n\n(1) \n\nwhere \u00a2 : X ~ K ~ i~ is known I as the feature map and has to fixed beforehand. \nIf all that is needed for learning and classification are the inner products (., .)x: in \nthe feature space K, it is convenient to specify \u00a2 only by its inner product function \n\n1 For notational convenience we shall abbreviate cf> (x) by x. This should not be confused \n\nwith the set x of training points. \n\n\fk : X X X -t IR known as the kernel, i.e. \n\n\"Ix, x' EX: \n\nk (x, x') = (\u00a2 (x) , \u00a2 (x')}JC . \n\nFor simplicity, let us assume that there exists a classifier2 w* E W that labels all \nour data, i.e. \n\nPYlx=x,w=w' (y) = Ih_.(x)=y. \n\n(2) \nThis assumption can easily be relaxed by introducing slack variables as done in the \nsoft margin variant of the SVM. Then given a training set z = (x, y) of m points \nXi together with their classes Yi assigned by hw' drawn iid from an unknown data \ndistribution Pz = PYIXPX we can assume the existence of a version space V (z), i.e. \nthe set of all classifiers w E W consistent with z: \n\n(3) \nIn a Bayesian spirit we incorporate all of our prior knowledge about w* into a \nprior distribution Pw over W. In the absence of any a priori knowledge we suggest \na uniform prior over the spatial direction of weight vectors w. Now, given the \ntraining set z we update our prior belief by Bayes' formula, i.e. \n\nPw1zm=z (W) = \n\nPzmlw=w (z) Pw (w) 0:1 PYIX=Xi,W=W (Yi) Pw (W) \n= -=-=~~----''-'-----':'::''''':'----:~c-'-\nEw [0:1 PY1X=Xi,W=W (Yi)] \nEw [PzmIW=w (Z)] \n\nPw(w) \n\n~w(V(z)) \n\n{ \n\nifwEV(Z) \notherwise \n\nwhere the first line follows from the independence and the fact that x has no depen(cid:173)\ndence on w and the second line follows from (2) and (3). The Bayesian classification \nof a novel test point x is then given by \n\nBayesz (x) = \n\nargmaxyEy Pw1zm=z ({hw (x) = y}) \n\n= sign (EWlzm=z [hw (x)]) \n= sign (Ew1zm=z [sign ((x, W}dD \n\nUnfortunately, the strategy Bayesz is in general not contained in the set 1-l of \nclassifiers considered beforehand. Since Pw1zm=z is only non-zero inside version \nspace, it has been suggested to use the centre of mass w crn as an approximation for \nBayesz , i.e. \n\nsign (Ew1zm=z [(x, W}JCl) \nsign ((x, wcrn}d , \nEWlzm=z [W] . \n\nwcrn \n\n(4) \n\nThis classifier is called the Bayes point. In a previous work [4] we calculated Wcrn \nusing a first order Markov chain based on a billiard-like algorithm (see also [10]). \nWe entered the version space V (z) using a perceptron algorithm and started play(cid:173)\ning billiards in version space V (z) thus creating a sequence of pseudo-random \nsamples Wi due to the chaotic nature of the billiard dynamics. Playing billiards \nin V (z) is possible because each training point (Xi, Yi) E z defines a hyperplane \n{w E W I Yi (Xi, w}JC = O} ~ W. Hence, the version space is a convex polyhedron \non the surface of W. After N bounces of the billiard ball the Bayes point was \nestimated by \n\n1 N \n\n\\Vcrn = N LWi. \n\ni=1 \n\n2We synonymously call h E 11. and w E W a classifier because there is a one-to-one \n\ncorrespondence between the two by virtue of (1) . \n\n\fAlthough this algorithm shows excellent generalisation performance when compared \nto state-of-the art learning algorithms like support vector machines (SVM) [13], its \neffort scales like 0 (m 2 ) and 0 (N . m2 ) in terms of memory and computational \nrequirements, respectively. \n\n3 Sampling the Version Space \n\nClearly, all we need for estimating the Bayes point (4) is a set of classifiers W drawn \nuniformly from V (z). In order to save computational resources it might be advan(cid:173)\ntageous to achieve a uniform sample only approximately. The classical perceptron \nlearning algorithm offers the possibility to obtain up to m! different classifiers in ver(cid:173)\nsion space simply by learning on different permutations of the training set. Given \na permutation II : {I, ... , m} -+ {I, ... , m} the perceptron algorithm works as \nfollows: \n\n1. Start with Wo = 0 and t = O. \n2. For all i E {I, ... , m}, if YII(i) (XII(i), Wt) K. :::; 0 then Wt+! = Wt + YII(i) XII (i) \n\nand t ~ t + 1. \n\n3. Stop, if for all i E {I, ... ,m}, YII(i) (XII(i), Wt) K. > O. \n\nA classical theorem due to Novikoff [7] guarantees the convergence of this procedure \nand furthermore provides an upper bound on the number t of mistakes needed until \nconvergence. More precisely, if there exists a classifier WSVM with margin \n\n. \n'Y% WSVM = mIll \n\n) \n\n( \n\nYi(Xi,WSVM)K. \n\nIlwsVM 11K. \n\n(Xi ,y;)E% \n\nthen the number of mistakes until convergence - which is an upper bound on \nis not more than R2 (x) y;2 (WSVM), where R (x) \nthe sparsity of the solution -\nis the smallest real number such that V x Ex: II \u00a2 (x) II K. \n:::; R (x). The quantity \n'Y% (WSVM) is maximised for the solution WSVM found by the SVM, and whenever \nthe SVM is theoretically justified by results from learning theory (see [11, 13]) the \nratio d = R2 (x) 'Y;2 (WSVM) is considerably less than m, say d\u00ab m. \nAlgorithmically, we can benefit from this sparsity by the following \"trick\": since \n\nm \n\nW = 2: QiXi \n\ni=l \n\nall we need to store is the m-dimensional vector o. Furthermore, we keep track of \nthe m-dimensional vector 0 of real valued outputs \n\n0i = Yi (Xi, Wt)K. = 2: Qjk (Xi, Xj) \n\nm \n\nj=l \n\nof the current solution at the i-th training point. By definition, in the beginning 0 = \n0=0. Now, if 0i :::; 0 we update Qi by Qi +Yi and update 0 by OJ ~ OJ +Yik (Xi, Xj) \nwhich requires only m kernel calculations. In summary, the memory requirement of \nthis algorithm is 2m and the number of kernel calculations is not more than d\u00b7m. As \na consequence, the computational requirement of this algorithm is no more than the \ncomputational requirement for the evaluation ofthe margin 'Y% (WSVM)! We suggest \nto use this efficient perceptron learning algorithm in order to obtain samples Wi for \nthe computation of the Bayes point by (4). \n\n\f(a) \n\n(b) \n\n(c) \n\nFigure 1: (a) Histogram of generalisation errors (estimated on a test set) using \na kernel Gibbs sampler. (b) Histogram of generalisation errors (estimated on a \ntest set) using a kernel perceptron. (c) QQ plot of distributions (a) and (b). The \nstraight line indicates that both distribution are very similar. \n\nIn order to investigate the usefulness of this approach experimentally, we compared \nthe distribution of generalisation errors of samples obtained by perceptron learning \non permuted training sets (as suggested earlier by [14]) with samples obtained by \na full Gibbs sampling [2]. For computational reasons, we used only 188 training \npatterns and 453 test patterns of the classes \"I\" and \"2\" from the MNIST data set3 . \nIn Figure 1 (a) and (b) we plotted the distribution over 1000 random samples using \nthe kernel4 \n\nk(x,x') = \u00ab(x,x'h+1)5 . \n\n(5) \nUsing a quantile-quantile (QQ) plot technique we can compare both distributions \nin one graph (see Figure 1 (c)). These plots suggest that by simple permutation \nof the training set we are able to obtain a sample of classifiers exhibiting the same \ngeneralisation error distribution as with time-consuming Gibbs sampling. \n\n4 Experimental Results \n\nIn our large scale experiment we used the full MNIST data set with 60000 training \nexamples and 10000 test examples of 28 x 28 grey value images of handwritten \ndigits. As input vector x we used the 784 dimensional vector of grey values. The \nimages were labelled by one of the ten classes \"0\" to \"I\". For each of the ten classes \ny = {O, ... , 9} we ran the perceptron algorithm N = 10 times each time labelling \nall training points of class y by + 1 and the remaining training points by -1. On \nan Ultra Sparc 10 each learning trial took approximately 20 - 30 minutes. For \nthe classification of a test image x we calculated the real-valued output of all 100 \ndifferent classifiers5 by \n\nIi (x) = \n\nwhere we used the kernel k given by (5). (Oi)j refers to the expansion coefficient \ncorresponding to the i- th classifier and the j - th data point. Now, for each of the \n\n3 available at http://wvw .research. att. comryann/ocr/mnist/. \n4We decided to use this kernel because it showed excellent generalisation performance \n\nwhen using the support vector machine. \n\n5For notational simplicity we assume that the first N classifiers are classifiers for the \n\nclass \"0\", the next N for class \"1\" and so on. \n\n\frejection rate generalisation error \n\n0% \n1% \n2% \n3% \n4% \n5% \n6% \n7% \n8% \n9% \n10% \n\n1.46% \n1.10% \n0.87% \n0.67% \n0.49% \n0.37% \n0.32% \n0.26% \n0.21% \n0.14% \n0.11% \n\n004 \n\nrejection rate \n\nOOB \n\n010 \n\nFigure 2: Generalisation error as a function of the rejection rate for the MNIST data \nset. The SVM achieved 1.4% without rejection as compared to 1.46% for the BPM. \nNote that by rejection based on the real-valued output the generalisation error \ncould be reduced to 0.1% indicating that this measure is related to the probability \nof misclassification of single test points. \n\nten classes we calculated the real-valued decision of the Bayes point Wy by \n\nibp,y (x) = N L: ii+yN (x) . \n\n1 N \n\ni=l \n\nIn a Bayesian spirit, the final decision was carried out by \n\nhbp (x) = argmaxyE {O, ... ,9} ibp,y (x) . \n\nNote that ibp ,y (x) [9] can be interpreted as an (unnormalised) approximation of \nthe posterior probability that x is of class y when restricted to the function class \n(1). In order to test the dependence of the generalisation error on the magnitude \nmaxy ibp,y (x) we fixed a certain rejection rate r E [0,1] and rejected the set of \nr\u00b7 10000 test points with the smallest value of maxy ibp,y (x). The resulting plot \nis depicted in Figure 2. \n\nAs can be seen from this plot, even without rejection the Bayes point has excellent \ngeneralisation performance6 . Furthermore, rejection based on the real-valued out(cid:173)\nput ibp (x) turns out to be excellent thus reducing the generalisation error to 0.1%. \nOne should also bear in mind that the learning time for this simple algorithm was \ncomparable to that of SVMs. \n\nA very advantageous feature of our approach as compared to SVMs are its adjustable \ntime and memory requirements and the \"anytime\" availability of a solution due to \nsampling. If the training set grows further and we are not able to spend more time \nwith learning, we can adjust the number N of samples used at the price of slightly \nworse generalisation error. \n\n5 Conclusion \n\nIn this paper we have presented an algorithm for approximating the Bayes point by \nrerunning the classical perceptron algorithm with a permuted training set. Here we \n6Note that the best know result on this data set if 1.1 achieved with a polynomial \nkernel of degree four. Nonetheless, for reason of fairness we compared the results of both \nalgorithms using the same kernel. \n\n\fparticularly exploited the sparseness of the solution which must exist whenever the \nsuccess of the SVM is theoretically justified. The restriction to the zero training \nerror case can be overcome by modifying the kernel as \n\nk>.. (x, x') = k (x, x') + A \u00b7 Ix=x' . \n\nThis technique is well known and was already suggested by Vapnik in 1995 (see [1]). \nAnother interesting question raised by our experimental findings is the following: \nBy how much is the distribution of generalisation errors over random samples from \nversion space related to the distribution of generalisation errors of the up to m! \ndifferent classifiers found by the classical perceptron algorithm? \n\nAcknowledgements We would like to thank Bob Williamson for helpful dis(cid:173)\ncussions and suggestions on earlier drafts. Parts of this work were done during a \nresearch stay of both authors at the ANU Canberra. \n\nReferences \n[1) C. Cortes and V. Vapnik. Support Vector Networks. Machine Learning, 20:273-297, \n\n1995. \n\n[2) T. Graepel and R. Herbrich. The kernel Gibbs sampler. \n\nInformation System Processing 13, 200l. \n\nIn Advances in Neural \n\n[3) R. Herbrich and T . Graepel. A PAC-Bayesian margin bound for linear classifiers: \n\nWhy SVMs work. In Advances in Neural Information System Processing 13, 200l. \n[4) R. Herbrich, T . Graepel, and C. Campbell. Robust Bayes Point Machines. In Pro(cid:173)\n\nceedings of ESANN 2000, pages 49- 54, 2000. \n\n[5) D. A. McAliester. Some PAC Bayesian theorems. In Proceedings of the Eleventh An(cid:173)\nnual Conference on Computational Learning Theory, pages 230- 234, Madison, Wis(cid:173)\nconsin, 1998. \n\n[6) R. M. Neal. Markov chain monte carlo method based on 'slicing' the density function . \n\nTechnical report, Department of Statistics, University of Toronto, 1997. TR- 9722. \n\n[7) A. Novikoff. On convergence proofs for perceptrons. In Report at the Symposium \non Mathematical Theory of Automata, pages 24- 26, Politechnical Institute Brooklyn, \n1962. \n\n[8) M. Opper and O. Winther. Gaussian processes for classification: Mean field algo(cid:173)\n\nrithms. Neural Computation, 12(11) , 2000. \n\n[9) J . Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, \n\npages 61- 74. MIT Press, 2000. \n\n[10) P. Rujan and M. Marchand. Computing the bayes kernel classifier. In Advances in \n\nLarge Margin Classifiers, pages 329- 348. MIT Press, 2000. \n\n[11) J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk \nminimization over data- dependent hierarchies. IEEE Transactions on Information \nTheory, 44(5):1926- 1940, 1998. \n\n[12) A. J. Smola. Learning with Kernels. PhD thesis, Technische Universitat Berlin, 1998. \n[13) V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. \n[14) T. Watkin. Optimal learning with a neural network. Europhysics Letters, 21:871- 877, \n\n1993. \n\n[15) C. Williams. Prediction with Gaussian Processes: From linear regression to linear \nprediction and beyond. Technical report , Neural Computing Research Group, Aston \nUniversity, 1997. NCRG/ 97/ 012. \n\n\f", "award": [], "sourceid": 1922, "authors": [{"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Thore", "family_name": "Graepel", "institution": null}]}