{"title": "Incremental Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1025, "page_last": 1032, "abstract": null, "full_text": "Incremental Gaussian Processes\n\nJoaquin Qui \u02dcnonero-Candela\n\nOle Winther\n\nInformatics and Mathematical Modelling\n\nInformatics and Mathematical Modelling\n\nTechnical University of Denmark\n\nDK-2800 Lyngby, Denmark\n\nTechnical University of Denmark\n\nDK-2800 Lyngby, Denmark\n\njqc@imm.dtu.dk\n\nowi@imm.dtu.dk\n\nAbstract\n\nIn this paper, we consider Tipping\u2019s relevance vector machine (RVM)\n[1] and formalize an incremental training strategy as a variant of the\nexpectation-maximization (EM) algorithm that we call Subspace EM\n(SSEM). Working with a subset of active basis functions, the sparsity\nof the RVM solution will ensure that the number of basis functions and\nthereby the computational complexity is kept low. We also introduce\na mean \ufb01eld approach to the intractable classi\ufb01cation model that is ex-\npected to give a very good approximation to exact Bayesian inference\nand contains the Laplace approximation as a special case. We test the\nalgorithms on two large data sets with O(103 (cid:0) 104) examples. The re-\nsults indicate that Bayesian learning of large data sets, e.g. the MNIST\ndatabase is realistic.\n\n1 Introduction\n\nTipping\u2019s relevance vector machine (RVM) both achieves a sparse solution like the support\nvector machine (SVM) [2, 3] and the probabilistic predictions of Bayesian kernel machines\nbased upon a Gaussian process (GP) priors over functions [4, 5, 6, 7, 8]. Sparsity is in-\nteresting both with respect to fast training and predictions and ease of interpretation of the\nsolution. Probabilistic predictions are desirable because inference is most naturally for-\nmulated in terms of probability theory, i.e. we can manipulate probabilities through Bayes\ntheorem, reject uncertain predictions, etc.\n\nIt seems that Tipping\u2019s relevance vector machine takes the best of both worlds. It is a GP\nwith a covariance matrix spanned by a small number of basis functions making the compu-\ntational expensive matrix inversion operation go from O(N 3), where N is the number of\ntraining examples to O(M 2N ) (M being the number of basis functions). Simulation stud-\nies have shown very sparse solutions M (cid:28) N and good test performance [1]. However,\nstarting the RVM learning with as many basis functions as examples, i.e. one basis function\nin each training input point, leads to the same complexity as for Gaussian processes (GP)\nsince in the initial step no basis functions are removed. That lead Tipping to suggest in\nan appendix in Ref. [1] an incremental learning strategy that starts with only a single basis\nfunction and adds basis functions along the iterations, and to formalize it very recently [9].\nThe total number of basis functions is kept low because basis functions are also removed.\nIn this paper we formalize this strategy using straightforward expectation-maximization\n(EM) [10] arguments to prove that the scheme is the guaranteed convergence to a local\n\n\fmaximum of the likelihood of the model parameters.\n\nReducing the computational burden of Bayesian kernel learning is a subject of current\ninterest. This can be achieved by numerical approximations to matrix inversion [11] and\nsuboptimal projections onto \ufb01nite subspaces of basis functions without having an explicit\nparametric form of such basis functions [12, 13]. Using mixtures of GPs [14, 15] to make\nthe kernel function input dependent is also a promising technique. None of the Bayesian\nmethods can currently compete in terms of speed with the ef\ufb01cient SVM optimization\nschemes that have been developed, see e.g. [3].\n\nThe rest of the paper is organized as follows: In section 2 we present the extended linear\nmodels in a Bayesian perspective, the regression model and the standard EM approach.\nIn section 3, a variation of the EM algorithm, that we call the Subspace EM (SSEM) is\nintroduced that works well with sparse solution models. In section 4, we present the second\nmain contribution of the paper: a mean \ufb01eld approach to RVM classi\ufb01cation. Section\n5 gives results for the Mackey-Glass time-series and preliminary results on the MNIST\nhand-written digits database. We conclude in section 6.\n\n2 Regression\n\nAn extended linear model is built by transforming the input space by an arbitrary set of ba-\nsis functions (cid:30)j : RD ! R that performs a non-linear transformation of the D-dimensional\ninput space. A linear model is applied to the transformed space whose dimension is equal\nto the number of basis functions M:\n\ny(xi) =\n\nM\n\nXj=1\n\n!j (cid:30)j(xi) = (cid:8)(xi) (cid:1) !!!\n\n(1)\n\nwhere (cid:8)(xi) (cid:17) [(cid:30)1(xi); : : : ; (cid:30)M (xi)] denotes the ith row of the design matrix (cid:8) and !!! =\n(!1; : : : ; !N )T is the weights vector. The output of the model is thus a linear superposition\nof completely general basis functions. While it is possible to optimize the parameters of\nthe basis functions for the problem at hand [1, 16], we will in this paper assume that they\nare given.\n\nThe simplest possible regression learning scenario can be described as follows: a set of\ni=1 are assumed to be independent and contaminated\nN input-target training pairs fxi; tigN\nwith Gaussian noise of variance (cid:27)2. The likelihood of the parameters !!! is given by\n\np(tj!!!; (cid:27)2) =(cid:0)2(cid:25)(cid:27)2(cid:1)(cid:0)N=2\n\nexp(cid:18)(cid:0)\n\n1\n\n2(cid:27)2 kt (cid:0) (cid:8) !!!k2(cid:19)\n\n(2)\n\nwhere t = (t1; : : : ; tN )T is the target vector. Regularization is introduced in Bayesian\nlearning by means of a prior distribution over the weights. In general, the implied prior\nover functions is a very complicated distribution. However, choosing a Gaussian prior on\nthe weights the prior over functions also becomes Gaussian, i.e. a Gaussian process. For\nthe speci\ufb01c choice of a factorized distribution with variance (cid:11)(cid:0)1\n\n:\n\nj\n\np(!jj(cid:11)j) =r (cid:11)j\n\n2(cid:25)\n\nexp(cid:18)(cid:0)\n\n1\n2\n\n(cid:11)j !2\n\nj(cid:19)\n\n(3)\n\nthe prior over functions p(yj(cid:11)(cid:11)(cid:11)) is N (0; (cid:8)A(cid:0)1(cid:8)T ), i.e. a Gaussian process with covariance\nfunction given by\n\nCov(xi; xj) =\n\n1\n(cid:11)k\n\nM\n\nXk=1\n\n(cid:30)k(xi)(cid:30)k(xj)\n\n(4)\n\n\fwhere (cid:11)(cid:11)(cid:11) = ((cid:11)0; : : : ; (cid:11)N )T and A = diag((cid:11)0; : : : ; (cid:11)N ). We can now see how\nsparseness in terms of the basis vectors may arise:\nk = 0 the kth basis vector\n(cid:8)k (cid:17) [(cid:30)k(x1); : : : ; (cid:30)k(xN )]T , i.e. the kth column in the design matrix, will not contribute\nto the model. Associating a basis function with each input point may thus lead to a model\nwith a sparse representations in the inputs, i.e. the solution is only spanned by a subset of\nall input points. This is exactly the idea behind the relevance vector machine, introduced\nby Tipping [17]. We will see in the following how this also leads to a lower computational\ncomplexity than using a regular Gaussian process kernel.\n\nif (cid:11)(cid:0)1\n\nThe posterior distribution over the weights\u2013obtained through Bayes rule\u2013is a Gaussian dis-\ntribution\n\np(!!!jt; (cid:11)(cid:11)(cid:11); (cid:27)2) =\n\np(tj!!!; (cid:27)2)p(!!!j(cid:11)(cid:11)(cid:11))\n\np(tj(cid:11)(cid:11)(cid:11); (cid:27)2)\n\n= N (!!!j(cid:22)(cid:22)(cid:22); (cid:6))\n\n(5)\n\nwhere N (tj(cid:22)(cid:22)(cid:22); (cid:6)) is a Gaussian distribution with mean (cid:22)(cid:22)(cid:22) and covariance (cid:6) evaluated at t.\nThe mean and covariance are given by\n\n(cid:22)(cid:22)(cid:22) = (cid:27)(cid:0)2(cid:6)(cid:8)T t\n(cid:6) = ((cid:27)(cid:0)2(cid:8)T (cid:8) + A)(cid:0)1\n\n(6)\n(7)\n\nThe uncertainty about the optimal value of the weights captured by the posterior distribu-\ntion (5) can be used to build probabilistic predictions. Given a new input x(cid:3), the model\ngives a Gaussian predictive distribution of the corresponding target t(cid:3)\n\np(t(cid:3)jx(cid:3); (cid:11)(cid:11)(cid:11); (cid:27)2) =Z p(t(cid:3)jx(cid:3); !!!; (cid:27)2) p(!!!jt; (cid:11)(cid:11)(cid:11); (cid:27)2) d!!! = N (t(cid:3)jy(cid:3); (cid:27)2\n\n(cid:3))\n\nwhere\n\n(8)\n\n(9)\n(10)\n\ny(cid:3) = (cid:8)(x(cid:3)) (cid:1) (cid:22)(cid:22)(cid:22)\n(cid:27)2\n(cid:3) = (cid:27)2 + (cid:8)(x(cid:3)) (cid:1) (cid:6) (cid:1) (cid:8)(x(cid:3))T\n\nFor regression it is natural to use y(cid:3) and (cid:27)(cid:3) as the prediction and the error bar on the\nprediction respectively. The computational complexity of making predictions is thus\nO(M 2P + M 3 + M 2N ), where M is the number of selected basis functions (RVs) and P\nis the number of predictions. The two last terms come from the computation of (cid:6) in eq.\n(7).\n\nThe likelihood distribution over the training targets (2) can be \u201cmarginalized\u201d with respect\nto the weights to obtain the marginal likelihood, which is also a Gaussian distribution\n\np(tj(cid:11)(cid:11)(cid:11); (cid:27)2) =Z p(tj!!!; (cid:27)2) p(!!!j(cid:11)(cid:11)(cid:11)) d!!! = N (tj0; (cid:27)2I + (cid:8)A(cid:0)1(cid:8)T ) :\n\n(11)\n\nEstimating the hyperparameters f(cid:11)jg and the noise (cid:27)2 can be achieved by maximizing\n(11). This is naturally carried out in the framework of the expectation-maximization (EM)\nalgorithm since the suf\ufb01cient statistics of the weights (that act as hidden variables) are\navailable for this type of model. In other cases e.g. for adapting the length scale of the\nkernel [4], gradient methods have to be used. For regression, the E-step is exact (the lower\nbound on the marginal likelihood is made equal to the marginal likelihood) and consists in\nestimating the mean and variance (6) and (7) of the posterior distribution of the weights\n(5). For classi\ufb01cation, the E-step will be approximate. In this paper we present a mean\n\ufb01eld approach for obtaining the suf\ufb01cient statistics.\n\nThe M-step corresponds to maximizing the expectation of the log marginal likelihood with\nrespect to the posterior, with respect to (cid:27)2 and (cid:11)(cid:11)(cid:11), which gives the following update rules:\n(cid:11)new\n\n, and ((cid:27)2)new = 1\n\n=\n\nj\n\n= 1\nh!2\nj i\n\np(!!!jt;(cid:11)(cid:11)(cid:11);(cid:27)2)\n\n1\n\n(cid:22)2\n\nj +(cid:6)jj\n\nN (jjt (cid:0) (cid:8) (cid:22)jj2 + ((cid:27)2)oldPj (cid:13)j),\n\n\fwhere the quantity (cid:13)j (cid:17) 1(cid:0) (cid:11)j(cid:6)jj is a measure of how \u201cwell-determined\u201d each weight !j\nis by the data [18, 1]. One can obtain a different update rule that gives faster convergence.\nAlthough it is suboptimal in the EM sense, we have never observed it decrease the lower\nbound on the marginal log-likelihood. The rule, derived in [1], is obtained by differentiation\nof (11) and by an arbitrary choice of independent terms as is done by [18]. It makes use of\nthe terms f(cid:13)jg:\n\n(cid:11)new\n\nj =\n\n(cid:13)j\n(cid:22)2\nj\n\n((cid:27)2)new = jjt (cid:0) (cid:8) (cid:22)jj2\nN (cid:0)Pj (cid:13)j\n\n:\n\n(12)\n\nIn the optimization process many (cid:11)j grow to in\ufb01nity, which effectively deletes the cor-\nresponding weight and basis function. Note that the EM update and the Mackay update\nfor (cid:11)j only implicitly depend upon the likelihood. This means that it is also valid for the\nclassi\ufb01cation model we shall consider below.\n\nA serious limitation of the EM algorithm and variants for problems of this type is that the\ncomplexity of computing the covariance of the weights (7) in the E-step is O(M 3 +M 2N ).\nAt least in the \ufb01rst iteration where no basis functions have been deleted M = N and we\nare facing the same kind of complexity explosion that limits the applicability of Gaussian\nprocesses to large training set. This has lead Tipping [1] to consider a constructive or\nincremental training paradigm where one basis function is added before each E-step and\nsince basis functions are removed in the M-step, it turns out in practice that the total number\nof basis functions and the complexity remain low [9]. In the following section we introduce\na new algorithm that formalizes this procedure that can be proven to increase the marginal\nlikelihood in each step.\n\n3 Subspace EM\n\nWe introduce an incremental approach to the EM algorithm, the Subspace EM (SSEM), that\ncan be directly applied to training models like the RVM that rely on a linear superposition\nof completely general basis functions, both for classi\ufb01cation and for regression. Instead of\nstarting with a full model, i.e. where all the basis functions are present with \ufb01nite (cid:11) values,\nwe start with a fully pruned model with all (cid:11)j set to in\ufb01nity. Effectively, we start with no\nmodel. The model is grown by iteratively including some (cid:11)j previously set to in\ufb01nity to\nthe active set of (cid:11)\u2019s. The active set at iteration n, Rn, contains the indices of the basis\nvectors with (cid:11) less than in\ufb01nity:\n\nR1 = 1\nRn = fi j i 2 Rn(cid:0)1 ^ (cid:11)i (cid:20) Lg [ fng\n\n(13)\n\nwhere L is a \ufb01nite very large number arbitrarily de\ufb01ned. Observe that Rn contains at most\none more element (index) than Rn(cid:0)1. If some of the (cid:11)\u2019s indexed by Rn(cid:0)1 happen to reach\nL at the n-th step, Rn can contain less elements than Rn(cid:0)1. In \ufb01gure 1 we give a schematic\ndescription of the SSEM algorithm.\n\nAt iteration n the E-step is taken only in the subspace spanned by the weights whose\nindexes are in Rn. This helps reducing the computational complexity of the M-step to\nO(M 3), where M is the number of relevance vectors.\nSince the initial value of (cid:11)j is in\ufb01nity for all j, for regression the E-step yields always\nan equality between the log marginal likelihood and its lower bound. At any step n, the\nposterior can be exactly projected on to the space spanned by the weights !j such that\nj 2 Rn, because the (cid:11)k = 1 for all k not in Rn. Hence in the regression case, the SSEM\nnever decreases the log marginal likelihood. Figure 2 illustrates the convergence process\nof the SSEM algorithm compared to that of the EM algorithm for regression.\n\n\f1. Set (cid:11)j = L for all j. (L is a very large number) Set n = 1\n2. Update the set of active indexes Rn\n3. Perform an E-step in subspace !j such that j 2 Rn\n4. Perform the M-step for all (cid:11)j such that j 2 Rn\n5. If visited all basis functions, end, else go to 2.\n\nFigure 1: Schematics of the SSEM algorithm.\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n \nl\n\ni\n\na\nn\ng\nr\na\nm\ng\no\nL\n\n \n\n1200\n\n1000\n\n800\n\n600\n\n400\n\n200\n\n0\n\n\u2212200\n\n\u2212400\n\n0\n\n20\n\nLikelihood vs. CPU time\n\nSSEM\nstandard EM\n\n80\n\n100\n\n120\n\n40\n\n60\n\nCPU time (seconds)\n\ns\nV\nR\n\n \nf\n\no\n \nr\ne\nb\nm\nu\nN\n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n0\n\nNumber of RVs vs. CPU time\n\nstandard EM\nSSEM\n\n20\n\n40\n\n60\n\nCPU time (seconds)\n\n80\n\n100\n\n120\n\nFigure 2: Training on 400 samples of the Mackey-Glass time series, testing on 2000 cases.\nLog marginal likelihood as a function of the elapsed CPU time (left) and corresponding\nnumber of relevance vectors (right) for both SSEM and EM.\n\nWe perform one EM step for each time a new basis function is added to the active set. Once\nall the examples have been visited, we switch to the batch EM algorithm on the active set\nuntil some convergence criteria has been satis\ufb01ed, for example until the relative increase in\nthe likelihood is smaller than a certain threshold. In practice some 50 of these batch EM\niterations are enough.\n\n4 Classi\ufb01cation\n\nUnlike the model discussed above, analytical inference is not possible for classi\ufb01cation\nmodels. Here, we will discuss the adaptive TAP mean \ufb01eld approach\u2013initially proposed for\nGaussian processes [8]\u2013that are readily translated to RVMs. The mean \ufb01eld approach has\nthe appealing features that it retains the computational ef\ufb01ciency of RVMs, is exact for the\nregression and reduces to the Laplace approximation in the limit where all the variability\ncomes from the prior distribution.\nWe consider binary t = (cid:6)1 classi\ufb01cation using the probit likelihood with \u2019input\u2019 noise (cid:27) 2\n(14)\n\ny(x)\n\n(cid:27) (cid:19) ;\n\n(cid:0)1 Dz is an error function (or cumulative\nGaussian distribution). The advantage of using this sigmoid rather than the commonly\nused 0/1-logistic is that we under the mean \ufb01eld approximation can derive an analytical\n\np(tjy(x)) = erf(cid:18)t\nwhere Dz (cid:17) e(cid:0)z2=2dz=p2(cid:25) and erf(x) (cid:17) R x\nexpression for the predictive distribution p(t(cid:3)jx(cid:3); t) = R p(t(cid:3)jy)p(yjx(cid:3); t)dy needed for\nmaking Bayesian predictions. Both a variational and the advanced mean \ufb01eld approach\u2013\nused here\u2013make a Gaussian approximation for p(yjx(cid:3); t) [8] with mean and variance given\nby regression results y(cid:3) and (cid:27)2\n(cid:3) given by eqs. (9) and (10). This leads\n\n(cid:3) (cid:0) ^(cid:27)2, and y(cid:3) and (cid:27)2\n\n\fto the following approximation for the predictive distribution\n\np(t(cid:3)jx(cid:3); t) =Z erf(cid:16)t(cid:3)\n\ny\n\n(cid:27)(cid:17) p(yjx(cid:3); t) dy = erf(cid:18)t(cid:3)\n\ny(cid:3)\n\n(cid:27)(cid:3)(cid:19) :\n\n(15)\n\nHowever, the mean and covariance of the weights are no longer found by analytical expres-\nsions, but has to be obtained from a set of non-linear mean \ufb01eld equations that also follow\nfrom equivalent assumptions of Gaussianity for the training set outputs y(xi) in averages\nover reduced (or cavity) posterior averages.\n\nIn the following, we will only state the results which follows from combining the RVM\nGaussian process kernel (4) with the results of [8]. The suf\ufb01cient statistics of the weights\nare written in terms of a set of O(N ) mean \ufb01eld parameters\n(cid:6) = (cid:0)A + (cid:8)T (cid:10)(cid:8)(cid:1)(cid:0)1\n\n(cid:22)(cid:22)(cid:22) = A(cid:0)1(cid:8)T (cid:28)(cid:28)(cid:28)\n\n(16)\n\n(17)\n\nln Z(yc\n\ni ; V c\n\ni + (cid:27)2) and\n\nwhere (cid:28)i (cid:17) @\n\n@yc\ni\n\nZ(yc\n\ni ; V c\n\ni + (cid:27)2) (cid:17) Z p(tijyc\n\ni + zqV c\n\ni + (cid:27)2) Dz = erf ti\n\nyc\ni\n\ni + (cid:27)2! : (18)\n\npV c\n\ni and V c\n\ni are the mean and variance\nThe last equality holds for the likelihood eq. (14) and yc\ni (cid:28)i. The distinction\nof the so called cavity \ufb01eld. The mean value is yc\ni = 0 is the\nbetween the different approximation schemes is solely in the variance V c\nLaplace approximation, V c\nis the so called naive mean \ufb01eld theory and\ni = (cid:2)(cid:8)A(cid:0)1(cid:8)T(cid:3)ii\nan improved estimate is available from the adaptive TAP mean \ufb01eld theory [8]. Lastly, the\ndiagonal matrix (cid:10) is the equivalent of the noise variance in the regression model (compare\neqs. (17) and (7) and is given by (cid:10)i = (cid:0) @(cid:28)i\n) . This set of non-linear equations\nare readily solved (i.e. fast and stable) by making Newton-Raphson updates in (cid:22)(cid:22)(cid:22) treating\nthe remaining quantities as help variables:\n\ni = (cid:8)(xi) (cid:1) (cid:22) (cid:0) V c\n\n=(1+V c\ni\n\ni : V c\n\n@(cid:28)i\n@yc\ni\n\n@yc\ni\n\n(cid:1)(cid:22)(cid:22)(cid:22) = (I + A(cid:0)1(cid:8)T (cid:10)(cid:8))(cid:0)1(A(cid:0)1(cid:8)T (cid:28)(cid:28)(cid:28) (cid:0) (cid:22)(cid:22)(cid:22)) = (cid:6)((cid:8)T (cid:28)(cid:28)(cid:28) (cid:0) A(cid:22)(cid:22)(cid:22))\n\n(19)\n\nThe computational complexity of the E-step for classi\ufb01cation is augmented with respect to\nthe regression case by the fact that it is necessary to construct and invert a M (cid:2) M matrix\nusually many times (typically 20), once for each step of the iterative Newton method.\n\n5 Simulations\n\nWe illustrate the performance of the SSEM for regression on the Mackey-Glass chaotic\ntime series, which is well-known for its strong non-linearity. In [16] we showed that the\nRVM has an order of magnitude superior performance than carefully tuned neural networks\nfor time series prediction on the Mackey-Glass series. The inputs are formed by L = 16\nsamples spaced 6 periods from each other xk = [z(k (cid:0) 6); z(k (cid:0) 12); : : : ; z(k (cid:0) 6L)] and\nthe targets are chosen to be tk = z(k) to perform six steps ahead prediction (see [19] for\ndetails). We use Gaussian basis functions of \ufb01xed variance (cid:23) 2 = 10. The test set comprises\n5804 examples.\nWe perform prediction experiments for different sizes of the training set. We perform in\neach case 10 repetitions with different partitions of the data sets into training and test. We\ncompare the test error, the number of RVs selected and the computer time needed for the\nbatch and the SSEM method. We present the results obtained with the growth method\nrelative to the results obtained with the batch method in \ufb01gure 3. As expected, the relative\n\n\f3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n0\n\nMackey\u2212Glass data\n\nClassification on MNIST digits\n\nEtegrowth/Etebatch\nTcpugrowth/Tcpubatch\nNRVgrowth/NRVbatch\n\n1.5\n\n1\n\n0.5\n\n0\n\n500\n\nNumber of training examples\n\n1000\n\n1500\n\n2000\n\n\u22120.5\n0\n\nTraining error prob.\nTest error prob.\nScaled loglik\n\n200\n\n400\nIteration\n\n600\n\n800\n\nFigure 3: Left: Regression, mean values over 10 repetitions of relative test error, number\nof RVs and computer time for the Mackey-Glass data, up to 2400 training examples and\n5804 test examples. Right: Classi\ufb01cation, Log marginal likelihood, test and training errors\nwhile training on one class against all the others, 60000 training and 10000 test examples.\n\ncomputer time of the growth method compared with the batch method decreases with size\nof the training set. For a few thousand examples the SSEM method is an order of magnitude\nfaster than the batch method. The batch method proved only to be faster for 100 training\nexamples, and could not be used with data sets of thousands of examples on the machine on\nwhich we run the experiments because of its high memory requirements. This is the reason\nwhy we only ran the comparison for up to 2400 training example for the Mackey-Glass\ndata set.\n\nOur experiments for classi\ufb01cation are at the time of sending this paper to press very pre-\nmature: we choose a very large data set, the MNIST database of handwritten digits [20],\nwith 60000 training and 10000 test images. The images are of size 28 (cid:2) 28 pixels. We\nuse PCA to project them down to 16 dimensional vectors. We only perform a preliminary\nexperiment consisting of a one against all binary classi\ufb01cation problem to illustrate that\nBayesian approaches to classi\ufb01cation can be used on very large data sets with the SSEM\nalgorithm. We train on 13484 examples (the 6742 one\u2019s and another 6742 random non-one\ndigits selected at random from the rest) and we use 800 basis functions for both the batch\nand Subspace EM. In \ufb01gure 3 we show the convergence of the SSEM in terms of the log\nmarginal likelihood and the training and test probabilities of error. The test probability of\nerror we obtain is 0:74 percent with the SSEM algorithm and 0:66 percent with the batch\nEM. Under the same conditions the SSEM needed 55 minutes to do the job, while the batch\nEM needed 186 minutes. The SSEM gives a machine with 28 basis functions and the batch\nEM one with 31 basis functions.\n\n6 Conclusion\n\nWe have presented a new approach to Bayesian training of linear models, based on a sub-\nspace extension of the EM algorithm that we call Subspace EM (SSEM). The new method\niteratively builds models from a potentially big library of basis functions. It is especially\nwell-suited for models that are constructed such that they yield a sparse solution, i.e. the so-\nlution is spanned by small number M of basis functions, which is much smaller than N, the\nnumber of examples. A prime example of this is Tipping\u2019s relevance vector machine that\ntypically produces solutions that are sparser than those of support vector machines. With\n\n\fthe SSEM algorithm the computational complexity and memory requirement decrease from\nO(N 3) and O(N 2) to O(M 2N ) (somewhat higher for classi\ufb01cation) and O(N M ). For\nclassi\ufb01cation, we have presented a mean \ufb01eld approach that is expected to be a very good\napproximation to the exact inference and contains the widely used Laplace approximation\nas an extreme case. We have applied the SSEM algorithm to both a large regression and a\nlarge classi\ufb01cation data sets. Although preliminary for the latter, we believe that the results\ndemonstrate that Bayesian learning is possible for very large data sets. Similar methods\nshould also be applicable beyond supervised learning.\n\nAcknowledgments JQC is funded by the EU Multi-Agent Control Research Training Network - EC\nTMR grant HPRNCT-1999-00107. We thank Lars Kai Hansen for very useful discussions.\n\nReferences\n[1] Michael E. Tipping, \u201cSparse bayesian learning and the relevance vector machine,\u201dJournal of\n\nMachine Learning Research, vol. 1, pp. 211\u2013244, 2001.\n\n[2] Vladimir N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.\n[3] Bernhard Sch\u00a8olkopf and Alex J. Smola, Learning with Kernels, MIT Press, Cambridge, 2002.\n[4] Carl E. Rasmussen, Evaluation of Gaussian Processes and Other Methods for Non-linear\n\nRegression, Ph.D. thesis, Dept. of Computer Science, University of Toronto, 1996.\n\n[5] Chris K. I. Williams and Carl E. Rasmussen, \u201cGaussian Proceses for Regression,\u201d inAdvances\n\nin Neural Information Processing Systems, 1996, number 8, pp. 514\u2013520.\n\n[6] D. J. C. Mackay, \u201cGaussian Processes: A replacement for supervised Neural Networks?,\u201d Tech.\n\nRep., Cavendish Laboratory, Cambridge University, 1997, Notes for a tutorial at NIPS 1997.\n\n[7] Radford M. Neal, Bayesian Learning for Neural Networks, Springer, New York, 1996.\n[8] Manfred Opper and Ole Winther, \u201cGaussian processes for classi\ufb01cation: Mean \ufb01eld algo-\n\nrithms,\u201d Neural Computation, vol. 12, pp. 2655\u20132684, 2000.\n\n[9] Michael Tipping and Anita Faul, \u201cFast marginal likelihood maximisation for sparse bayesian\n\nmodels,\u201d inInternational Workshop on Arti\ufb01cial Intelligence and Statistics, 2003.\n\n[10] N. M. Dempster, A.P. Laird, and D. B. Rubin, \u201cMaximum likelihood from incomplete data via\n\nthe EM algorithm,\u201d J. R. Statist. Soc. B, vol. 39, pp. 185\u2013197, 1977.\n\n[11] Chris Williams and Mathias Seeger, \u201cUsing the Nystr\u00a8om method to speed up kernel machines,\u201d\n\nin Advances in Neural Information Processing Systems, 2001, number 13, pp. 682\u2013688.\n\n[12] Alex J. Smola and Peter L. Bartlett, \u201cSparse greedy gaussian process regression,\u201d inAdvances\n\nin Neural Information Processing Systems, 2001, number 13, pp. 619\u2013625.\n\n[13] Lehel Csat\u00b4o and Manfred Opper, \u201cSparse representation for gaussian process models,\u201d in\n\nAdvances in Neural Information Processing Systems, 2001, number 13, pp. 444\u2013450.\n\n[14] Volker Tresp, \u201cMixtures of gaussian processes,\u201d inAdvances in Neural Information Processing\n\nSystems, 2000, number 12, pp. 654\u2013660.\n\n[15] Carl E. Rasmussen and Zoubin Ghahramani, \u201cIn\ufb01nite mixtures of gaussian process experts,\u201d in\n\nAdvances in Neural Information Processing Systems, 2002, number 14.\n\n[16] Joaquin Qui\u02dcnonero-Candela and Lars Kai Hansen, \u201cTime series prediction based on the rele-\nvance vector machine with adaptive kernels,\u201d inInternational Conference on Acoustics, Speech,\nand Signal Processing (ICASSP), 2002.\n\n[17] Michael E. Tipping, \u201cThe relevance vector machine,\u201d inAdvances in Neural Information\n\nProcessing Systems, 2000, number 12, pp. 652\u2013658.\n\n[18] David J. C. MacKay, \u201cBayesian interpolation,\u201dNeural Computation, vol. 4, no. 3, pp. 415\u2013447,\n\n1992.\n\n[19] Claus Svarer, Lars K. Hansen, Jan Larsen, and Carl E. Rasmussen, \u201cDesigner networks for time\n\nseries processing,\u201d inIEEE NNSP Workshop, 1993, pp. 78\u201387.\n\n[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document\n\nrecognition,\u201d inPoceedings of the IEEE, 1998, vol. 86, pp. 2278\u20132324.\n\n\f", "award": [], "sourceid": 2173, "authors": [{"given_name": "Joaquin", "family_name": "Candela", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}