{"title": "The Relevance Vector Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 652, "page_last": 658, "abstract": null, "full_text": "The Relevance Vector Machine \n\nMichael E. Tipping \n\nMicrosoft Research \n\nSt George House, 1 Guildhall Street \n\nCambridge CB2 3NH, U.K. \nmtipping~microsoft.com \n\nAbstract \n\nThe support vector machine (SVM) is a state-of-the-art technique \nfor regression and classification, combining excellent generalisation \nproperties with a sparse kernel representation. However, it does \nsuffer from a number of disadvantages, notably the absence of prob(cid:173)\nabilistic outputs, the requirement to estimate a trade-off parameter \nand the need to utilise 'Mercer' kernel functions. In this paper we \nintroduce the Relevance Vector Machine (RVM), a Bayesian treat(cid:173)\nment of a generalised linear model of identical functional form to \nthe SVM. The RVM suffers from none of the above disadvantages, \nand examples demonstrate that for comparable generalisation per(cid:173)\nformance, the RVM requires dramatically fewer kernel functions. \n\n1 \n\nIntrod uction \n\nIn supervised learning we are given a set of examples of input vectors {Xn}~=l \nalong with corresponding targets {tn}~=l' the latter of which might be real values \n(in regression) or class labels (classification). From this 'training' set we wish to \nlearn a model of the dependency of the targets on the inputs with the objective of \nmaking accurate predictions of t for previously unseen values of x. In real-world \ndata, the presence of noise (in regression) and class overlap (in classification) implies \nthat the principal modelling challenge is to avoid 'over-fitting' of the training set. \n\nA very successful approach to supervised learning is the support vector machine \n(SVM) [8]. It makes predictions based on a function of the form \n\nN \n\ny(x) = 2:: wnK(x, x n ) + Wo, \n\nn=l \n\n(1) \n\nwhere {wn } are the model 'weights' and K(\u00b7,\u00b7) is a kernel function. The key feature \nof the SVM is that, in the classification case, its target function attempts to minimise \nthe number of errors made on the training set while simultaneously maximising the \n'margin' between the two classes (in the feature space implicitly defined by the \nkernel). This is an effective 'prior' for avoiding over-fitting, which leads to good \ngeneralisation, and which furthermore results in a sparse model dependent only on \na subset of kernel functions: those associated with training examples Xn that lie \neither on the margin or on the 'wrong' side of it. State-of-the-art results have been \nreported on many tasks where SVMs have been applied. \n\n\fThe Relevance Vector Machine \n\n653 \n\nHowever, the support vector methodology does exhibit significant disadvantages: \n\n\u2022 Predictions are not probabilistic. In regression the SVM outputs a point \nestimate, and in classification, a 'hard' binary decision. Ideally, we desire to \nestimate the conditional distribution p(tlx) in order to capture uncertainty \nin our prediction. In regression this may take the form of 'error-bars', but it \nis particularly crucial in classification where posterior probabilities of class \nmembership are necessary to adapt to varying class priors and asymmetric \nmisclassification costs. \n\n\u2022 Although relatively sparse, SVMs make liberal use of kernel functions, the \n\nrequisite number of which grows steeply with the size of the training set. \n\u2022 It is necessary to estimate the error/margin trade-off parameter 'e' (and \nin regression, the insensitivity parameter I f' too). This generally entails a \ncross-validation procedure, which is wasteful both of data and computation. \n\n\u2022 The kernel function K(\u00b7,\u00b7) must satisfy Mercer's condition. \n\nIn this paper, we introduce the 'relevance vector machine' (RVM), a probabilistic \nsparse kernel model identical in functional form to the SVM. Here we adopt a \nBayesian approach to learning, where we introduce a prior over the weights governed \nby a set of hyperparameters, one associated with each weight, whose most probable \nvalues are iteratively estimated from the data. Sparsity is achieved because in \npractice we find that the posterior distributions of many of the weights are sharply \npeaked around zero. Furthermore, unlike the support vector classifier, the non(cid:173)\nzero weights in the RVM are not associated with examples close to the decision \nboundary, but rather appear to represent 'prototypical' examples of classes. We \nterm these examples 'relevance' vectors, in deference to the principle of automatic \nrelevance determination (ARD) which motivates the presented approach [4, 6J. \nThe most compelling feature of the RVM is that, while capable of generalisation per(cid:173)\nformance comparable to an equivalent SVM, it typically utilises dramatically fewer \nkernel functions. Furthermore, the RVM suffers from none of the other limitations \nof the SVM outlined above. \n\nIn the next section, we introduce the Bayesian model, initially for regression, and \ndefine the procedure for obtaining hyperparameter values, and thus weights. In \nSection 3, we give brief examples of application of the RVM in the regression case, \nbefore developing the theory for the classification case in Section 4. Examples of \nRVM classification are then given in Section 5, concluding with a discussion. \n\n2 Relevance Vector Regression \n\nGiven a dataset of input-target pairs {xn, tn}~=l' we follow the standard formula(cid:173)\ntion and assume p(tlx) is Gaussian N(tIY(x), a 2 ). The mean ofthis distribution for \na given x is modelled by y(x) as defined in (1) for the SVM. The likelihood of the \ndataset can then be written as \n\np(tlw, a 2 ) = (27ra 2 )-N/2 exp { - 2:2 lit - ~w)1I2 } , \n\n(2) \nwhere t = (tl ... tN), W = (wo .. . WN) and ~ is the N x (N + 1) 'design' matrix \nwith ~nm = K(xn , Xm - l) and ~nl = 1. Maximum-likelihood estimation of wand \na 2 from (2) will generally lead to severe overfitting, so we encode a preference for \nsmoother functions by defining an ARD Gaussian prior [4, 6J over the weights: \n\nN \n\np(wla) = II N(wiIO,ai 1 ), \n\ni=O \n\n(3) \n\n\f654 \n\nME. Tipping \n\nwith 0 a vector of N + 1 hyperparameters. This introduction of an individual hy(cid:173)\nperparameter for every weight is the key feature of the model, and is ultimately \nresponsible for its sparsity properties. The posterior over the weights is then ob(cid:173)\ntained from Bayes' rule: \n\np(wlt, 0,0'2) = (21r)-(N+l)/21:E1- 1/ 2 exp { -~(w - J.lY:E-1(w - JL)}, \n\n(4) \n\nwith \n\n(5) \n(6) \nwhere we have defined A = diag(ao,al, ... ,aN) and B = 0'-2IN. Note that 0'2 is \nalso treated as a hyperparameter, which may be estimated from the data. \n\n:E = (q,TBq, + A)-I, \nJL = :Eq, TBt, \n\nBy integrating out the weights, we obtain the marginal likelihood, or evidence [2], \nfor the hyperparameters: \np(tIO,0'2) = (21r)-N/2IB-1 + q,A -1q,TI- 1/ 2 exp { -~e(B-l + q,A -lq,T)-lt} . \n(7) \nFor ideal Bayesian inference, we should define hyperpriors over 0 and 0'2, and \nintegrate out the hyperparameters too. However, such marginalisation cannot be \nperformed in closed-form here, so we adopt a pragmatic procedure, based on that \nof MacKay [2], and optimise the marginal likelihood (7) with respect to 0 and 0'2, \nwhich is essentially the type II maximum likelihood method [1]. This is equivalent \nto finding the maximum of p(o, 0'2It), assuming a uniform (and thus improper) \nhyperprior. We then make predictions, based on (4), using these maximising values. \n\n2.1 Optimising the hyperparameters \n\nValues of 0 and 0'2 which maximise (7) cannot be obtained in closed form, and \nwe consider two alternative formulae for iterative re-estimation of o . First, by \nconsidering the weights as 'hidden' variables, an EM approach gives: \n\nEii + J-Li \n2' \nSecond, direct differentiation of (7) and rearranging gives: \n\nWi p(wlt,Q,u2) \n\n-\n\nnew \n\na i \n\n1 \n\n= - ( 2) \n\n1 \n\nnew \n\nai \n\n= 2' \n\n'Yi \nJ-Li \n\n(8) \n\n(9) \n\nwhere we have defined the quantities 'Yi = 1 - aiEii, which can be interpreted as a \nmeasure of how 'well-determined' each parameter Wi is by the data [2]. Generally, \nthis latter update was observed to exhibit faster convergence. \n\nFor the noise variance, both methods lead to the same re-estimate: \n\n(10) \n\nIn practice, during re-estimation, we find that many of the ai approach infinity, \nand from (4), p(wilt,0,0'2) becomes infinitely peaked at zero -\nimplying that \nthe corresponding kernel functions can be 'pruned'. While space here precludes a \ndetailed explanation, this occurs because there is an 'Occam' penalty to be paid for \nsmaller values of ai, due to their appearance in the determinant in the marginal \nlikelihood (7). For some ai, a lesser penalty can be paid by explaining the data \nwith increased noise 0'2, in which case those ai -+ 00. \n\n\fThe Relevance Vector Machine \n\n655 \n\n3 Examples of Relevance Vector Regression \n\n3.1 Synthetic example: the 'sine' function \n\nThe function sinc(x) = Ixl- 1 sin Ixl is commonly used to illustrate support vector \nregression [8], where in place of the classification margin, the f.-insensitive region is \nintroduced, a 'tube' of \u00b1f. around the function within which errors are not penalised. \nIn this case, the support vectors lie on the edge of, or outside, this region. For \nexample, using linear spline kernels and with f. = 0.01, the approximation ofsinc(x) \nbased on 100 uniformly-spaced noise-free samples in [-10, 10J utilises 39 support \nvectors [8]. \n\nBy comparison, we approximate the same function with a relevance vector model \nutilising the same kernel. In this case the noise variance is fixed at 0.012 and 0 \nalone re-estimated. The approximating function is plotted in Figure 1 (left), and \nrequires only 9 relevance vectors. The largest error is 0.0087, compared to 0.01 in \nthe SV case. Figure 1 (right) illustrates the case where Gaussian noise of standard \ndeviation 0.2 is added to the targets. The approximation uses 6 relevance vectors, \nand the noise is automatically estimated, using (10), as (7 = 0.189. \n\n1.2 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n\n-0.2 \n\n-0.4 \n\n.. \n. - , . \n\n-10 \n\n-5 \n\n5 \n\n10 \n\n'. \n\n10 \n\nFigure 1: Relevance vector approximation to sinc(x): noise-free data (left), and with \nadded Gaussian noise of (]\" = 0.2 (right) . The estimated functions are drawn as solid lines \nwith relevance vectors shown circled, and in the added-noise case (right) the true function \nis shown dashed. \n\n3.2 Some benchmarks \n\nFriedman's three synthetic functions (results averaged over 100 ran(cid:173)\n\nThe table below illustrates regression performance on some popular benchmark \ndatasets -\ndomly generated training sets of size 240 with a lOOO-example test set) and the \n'Boston housing' dataset (averaged over 100 randomised 481/25 train/test splits). \nThe prediction error obtained and the number of kernel functions required for both \nsupport vector regression (SVR) and relevance vector regression (RVR) are given. \n\nDataset \nFriedman #1 \nFriedman #2 \nFriedman #3 \nBoston Housing \n\n_ errors_ \nSVR RVR \n2.92 \n2.80 \n4140 \n3505 \n0.0202 0.0164 \n8.04 \n7.46 \n\n_ kernels _ \nSVR RVR \n116.6 \n59.4 \n6.9 \n110.3 \n11.5 \n106.5 \n39.0 \n142.8 \n\n\f656 \n\nM E. TIpping \n\n4 Relevance Vector Classification \n\nWe now extend the relevance vector approach to the case of classification - Le. \nwhere it is desired to predict the posterior probability of class membership given the \ninput x. We generalise the linear model by applying the logistic sigmoid function \na(y) = 1/(1 + e- Y ) to y(x) and writing the likelihood as \n\nP(tlw) = II a{y(xn)}tn [1 - a{Y(Xn)}]l-tn . \n\nN \n\nn==l \n\n(11) \n\nHowever, we cannot integrate out the weights to obtain the marginal likelihood \nanalytically, and so utilise an iterative procedure based on that of MacKay [3]: \n\n1. For the current, fixed, values of a we find the most probable weights WMP \n(the location of the posterior mode). This is equivalent to a standard opti(cid:173)\nmisation of a regularised logistic model, and we use the efficient iteratively(cid:173)\nreweighted least-squares algorithm [5] to find the maximum. \n\n2. We compute the Hessian at WMP: \n\n\\7\\7logp(t, wla)1 = _(<)TB<) + A), \n\n(12) \nwhere Bnn = a{y(xn)} [1 - a{y(xn)}], and this is negated and inverted to \ngive the covariance I: for a Gaussian approximation to the posterior over \nweights, and from that the hyperparameters a are updated using (9). Note \nthat there is no 'noise' variance a 2 here. \n\nWMP \n\nThis procedure is repeated until some suitable convergence criteria are satisfied. \nNote that in the Bayesian treatment of multilayer neural networks, the Gaussian \napproximation is considered a weakness of the method if the posterior mode is \nunrepresentative of the overall probability mass. However, for the RVM, we note \nthat p(t, wla) is log-concave (i.e. the Hessian is negative-definite everywhere), which \ngives us considerably more confidence in the Gaussian approximation. \n\n5 Examples of RVM Classification \n\n5.1 Synthetic example: Gaussian mixture data \n\nWe first utilise artificially generated data in two dimensions in order to illustrate \ngraphically the selection of relevance vectors. Class 1 (denoted by 'x ') was sampled \nfrom a single Gaussian, and overlaps to a small degree class 2 ('.'), sampled from a \nmixture of two Gaussians. \nA relevance vector classifier was compared to its support vector counterpart, using \nthe same Gaussian kernel. A value of C for the SVM was selected using 5-fold cross(cid:173)\nvalidation on the training set. The results for a typical dataset of 200 examples \nare given in Figure 2. The test errors for the RVM (9.32%) and SVM (9.48%) \nare comparable, but the remarkable feature of contrast is the complexity of the \nclassifiers. The support vector machine utilises 44 kernel functions compared to \njust 3 for the relevance vector method. \n\nIt is also notable that the relevance vectors are some distance from the decision \nboundary (in x-space). Given further analysis, this observation can be seen to be \nconsistent with the hyperparameter update equations. A more qualitative explana(cid:173)\ntion is that the output of a basis function lying on or near the decision boundary \nis a poor indicator of class membership, and such basis functions are naturally \n'penalised' under the Bayesian framework. \n\n\fThe Relevance Vector Machine \n\n657 \n\nSVM: error=9.48% vectors=44 \n\nRVM: error=9.32% vectors=3 \n\n\\ \n\n\" \n, \n\n\\ \n\nx x \n\n, \n\\ @. \n\n. \n~ . . \n. \n, .. \n\u2022 II \u2022 \nx.\":X~x x \"I. \nx \u00ae\\ \u2022 'C' \u2022 \nx x ~ \u2022\u2022 _~ \nx x~ .. \n\n:xx \nxxx \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\nX X \nx \nx ~x \n\nx \n\nx \n\nx \n\n~ x \nxx'f. \nx x \nx \n\n, , , , , \n\n\\ \n\nI \nI \n\n\u2022 \n\nFigure 2: Results of training functionally identical SVM (left) and RVM (right) clas(cid:173)\nsifiers on a typical synthetic dataset. The decision boundary is shown dashed, and rele(cid:173)\nvance/support vectors are shown circled to emphasise the dramatic reduction in complexity \nof the RVM model. \n\n5.2 Real examples \n\nIn the table below we give error and complexity results for the 'Pima Indian diabetes' \nand the 'U.S.P.S. handwritten digit' datasets. The former task has been recently \nused to illustrate Bayesian classification with the related Gaussian Process (GP) \ntechnique [9], and we utilised those authors' split of the data into 200 training and \n332 test examples and quote their result for the GP case. The latter dataset is a \npopular support vector benchmark, comprising 7291 training examples along with \na 2007-example test set, and the SVM result is quoted from [7]. \n\nDataset \nPima Indians \nU.S.P.S. \n\n___ errors __ _ \nSVM GP RVM \n\n__ kernels __ \nSVM GP RVM \n\n67 \n4.4% \n\n68 \n\n65 \n5.1% \n\n109 \n2540 \n\n200 \n\n4 \n316 \n\nIn terms of prediction accuracy, the RVM is marginally superior on the Pima set, \nbut outperformed by the SVM on the digit data. However, consistent with other \nexamples in this paper, the RVM classifiers utilise many fewer kernel functions. \nMost strikingly, the RVM achieves state-of-the-art performance on the diabetes \ndataset with only 4 kernels. It should be noted that reduced set methods exist \nfor subsequently pruning support vector models to reduce the required number of \nkernels at the expense of some increase in error (e.g. see [7] for some example results \non the U.S.P.S. data). \n\n6 Discussion \n\nExamples in this paper have effectively demonstrated that the relevance vector \nmachine can attain a comparable (and for regression, apparently superior) level of \ngeneralisation accuracy as the well-established support vector approach, while at the \nsame time utilising dramatically fewer kernel functions -\nimplying a considerable \n\n\f658 \n\nME. Tipping \n\nsaving in memory and computation in a practical implementation. Importantly, we \nalso benefit from the absence of any additional nuisance parameters to set, apart \nfrom the need to choose the type of kernel and any associated parameters. \n\nIn fact, for the case of kernel parameters, we have obtained improved (both in \nterms of accuracy and sparsity) results for all the benchmarks given in Section \n3.2 when optimising the marginal likelihood with respect to multiple input scale \nparameters in Gaussian kernels (q. v. [9]). Furthermore, we may also exploit the \nBayesian formalism to guide the choice of kernel itself [2], and it should be noted \nthat the presented methodology is applicable to arbitrary basis functions, so we are \nnot limited, for example, to the use of 'Mercer' kernels as in the SVM. \nA further advantage of the RVM classifier is its standard formulation as a prob(cid:173)\nabilistic generalised linear model. This implies that it can be extended to the \nmultiple-class case in a straightforward and principled manner, without the need \nto train and heuristically combine multiple dichotomous classifiers as is standard \npractice for the SVM. Furthermore, the estimation of posterior probabilities of class \nmembership is a major benefit, as these convey a principled measure of uncertainty \nof prediction, and are essential if we wish to allow adaptation for varying class \npriors, along with incorporation of asymmetric misclassification costs. \n\nHowever, it must be noted that the principal disadvantage of relevance vector meth(cid:173)\nods is in the complexity of the training phase, as it is necessary to repeatedly com(cid:173)\npute and invert the Hessian matrix, requiring O(N2) storage and O(N3) computa(cid:173)\ntion. For large datasets, this makes training considerably slower than for the SVM. \nCurrently, memory constraints limit us to training on no more than 5,000 examples, \nbut we have developed approximation methods for handling larger datasets which \nwere employed on the U.S.P.S. handwritten digit database'. We note that while the \ncase for Bayesian methods is generally strongest when data is scarce, the sparseness \nof the resulting classifier induced by the Bayesian framework presented here is a \ncompelling motivation to apply relevance vector techniques to larger datasets. \n\nAcknowledgements \n\nThe author wishes to thank Chris Bishop, John Platt and Bernhard Sch5lkopf for \nhelpful discussions, and JP again for his Sequential Minimal Optimisation code. \n\nReferences \n[1) J. O. Berger. Statistical decision theory and Bayesian analysis. Springer, New York, \n\nsecond edition , 1985. \n\n[2) D. J. C. Mackay. Bayesian interpolation. Neural Computation, 4(3):415-447, 1992. \n[3) D. J . C. Mackay. The evidence framework applied to classification networks. Neural \n\nComputation, 4(5):720-736, 1992. \n\n(4) D. J . C. Mackay. Bayesian non-linear modelling for the prediction competition. In \nASHRAE Transactions, vol. 100, pages 1053- 1062. ASHRAE, Atlanta, Georgia, 1994. \n(5) 1. T . Nabney. Efficient training of RBF networks for classification. In Proceedings of \n\nICANN99, pages 210-215, London, 1999. lEE. \n\n[6) R. M. Neal. Bayesian Learning for Neural Networks. Springer, New York, 1996. \n(7) B. Sch6lkopf, S. Mika, C. J. C. Burges, P. Knirsch, K.-R. Miiller, G . Ratsch, and A. J. \nSmola. Input space versus feature space in kernel-based methods. IEEE Transactions \non Neural Networks, 10(5):1000- 1017, 1999. \n\n[8) V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. \n[9) C. K. 1. Williams and D. Barber. Bayesian classification with Gaussian processes. \n\nIEEE Trans. Pattern Analysis and Machine Intelligence, 20(12) :1342-1351, 1998. \n\n\f", "award": [], "sourceid": 1719, "authors": [{"given_name": "Michael", "family_name": "Tipping", "institution": null}]}