{"title": "A Marginalized Particle Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1187, "page_last": 1195, "abstract": "We present a novel marginalized particle Gaussian process (MPGP) regression, which provides a fast, accurate online Bayesian filtering framework to model the latent function. Using a state space model established by the data construction procedure, our MPGP recursively filters out the estimation of hidden function values by a Gaussian mixture. Meanwhile, it provides a new online method for training hyperparameters with a number of weighted particles. We demonstrate the estimated performance of our MPGP on both simulated and real large data sets. The results show that our MPGP is a robust estimation algorithm with high computational efficiency, which outperforms other state-of-art sparse GP methods.", "full_text": "A Marginalized Particle Gaussian Process Regression\n\nYali Wang\n\nand Brahim Chaib-draa\n\nDepartment of Computer Science\n\nLaval University\n\nQuebec, Quebec G1V0A6\n\n{wang,chaib}@damas.ift.ulaval.ca\n\nAbstract\n\nWe present a novel marginalized particle Gaussian process (MPGP) regression,\nwhich provides a fast, accurate online Bayesian \ufb01ltering framework to model the\nlatent function. Using a state space model established by the data construction\nprocedure, our MPGP recursively \ufb01lters out the estimation of hidden function\nvalues by a Gaussian mixture. Meanwhile, it provides a new online method for\ntraining hyperparameters with a number of weighted particles. We demonstrate\nthe estimated performance of our MPGP on both simulated and real large data\nsets. The results show that our MPGP is a robust estimation algorithm with high\ncomputational ef\ufb01ciency, which outperforms other state-of-art sparse GP methods.\n\n1\n\nIntroduction\n\nThe Gaussian process (GP) is a popular nonparametric Bayesian method for nonlinear regression.\nHowever, the O(n3) computational load for training the GP model would severely limit its applica-\nbility in practice when the number of training points n is larger than a few thousand [1]. A number\nof attempts have been made to handle it with a small computational load. One typical method is a\nsparse pseudo-input Gaussian process (SPGP) [2] that uses a pseudo-input data set with m inputs\n(m (cid:28) n) to parameterize the GP predictive distribution to reduce the computational burden. Then\na sparse spectrum Gaussian process (SSGP) [3] was proposed to further improve the performance\nof SPGP while retaining the computational ef\ufb01ciency by using a stationary trigonometric Bayesian\nmodel with m basis functions. However, both SPGP and SSGP learn hyperparameters of\ufb02ine by\nmaximizing the marginal likelihood before making the inference. They would take a risk to fall in\nthe local optimum. Another recent model is a Kalman \ufb01lter Gaussian process (KFGP) [4] which re-\nduces computation load by correlating function values of data subsets at each Kalman \ufb01lter iteration.\nBut it still causes under\ufb01tting or over\ufb01tting if the hyperparameters are badly learned of\ufb02ine.\nOn the contrary, we propose in this paper an online marginalized particle \ufb01lter to simultaneously\nlearn the hyperprameters and hidden function values. By collecting small data subsets sequentially,\nwe establish a novel state space model which allows us to estimate the marginal posterior distribution\n(not the marginal likelihood) of hyperparameters online with a number of weighted particles. For\neach particle, a Kalman \ufb01lter is applied to estimate the posterior distribution of hidden function\nvalues. We will later explain it in details and show its validity via the experiments on large datasets.\n\n2 Data Construction\n\nIn practice, the whole training data set is usually constructed by gathering small subsets sev-\neral times. For the tth collection, the training subset (Xt, yt) consists of nt input-output pairs:\n{(x1\nt) of a\nd-dimensional input vector xi\n0). All the pairs are separately\norganized as an input matrix Xt and output vector yt. For simplicity, the whole training data with\n\nt is generated from a nonlinear function f (xi\n\nt with an additive Gaussian noise N (0, a2\n\nt ),\u00b7\u00b7\u00b7 (xnt\n\nt , ynt\n\nt )}. Each scalar output yi\n\nt , y1\n\n1\n\n\fT collections is symbolized as (X1:T , y1:T ). The goal refers to a regression issue - estimating the\nfunction value of f (x) at m test inputs X(cid:63) = [x1\n\n(cid:63) ] given (X1:T , y1:T ).\n\n(cid:63),\u00b7\u00b7\u00b7 xm\n\n3 Gaussian Process Regression\n\n3sin\u22121[a\u22122\n\n1exp[\u22120.5a\u22122\n\n4 \u02dcxT \u02dcx(cid:48)((1 + a\u22122\n\nA Gaussian process (GP) represents a distribution over functions, which is a generalization of the\nGaussian distribution to an in\ufb01nite dimensional function space. Formally, it is a collection of random\nvariables, any \ufb01nite number of which have a joint Gaussian distribution [1]. Similar to a Gaussian\ndistribution speci\ufb01ed by a mean vector and covariance matrix, a GP is fully de\ufb01ned by a mean func-\ntion m(x) = E[f (x)] and covariance function k(x, x(cid:48)) = E[(f (x) \u2212 m(x))(f (x(cid:48)) \u2212 m(x(cid:48)))].\nHere we follow the practical choice that m(x) is set to be zero. Moreover, due to the spatial non-\nstationary phenomena in the real world, we choose k(x, x(cid:48)) as kSE(x, x(cid:48)) + kN N (x, x(cid:48)) where\n2 (x \u2212 x(cid:48))T (x \u2212 x(cid:48))] is the stationary squared exponential covariance func-\nkSE = a2\n4 \u02dcx(cid:48)T \u02dcx(cid:48)))\u22120.5] is the nonstationary neural\ntion, kN N = a2\nnetwork covariance function with the augmented input \u02dcx = [1 xT ]T . For simplicity, all the hyper-\nparameters are collected into a vector \u03b8 = [a0 a1 a2 a3 a4]T .\nThe regression problem could be solved by the standard GP with the following two steps: First,\nlearning \u03b8 given (X1:T , y1:T ). One technique is to draw samples from p(\u03b8|X1:T , y1:T ) using\nMarkov Chain Monte Carlo (MCMC) [5, 6], another popular way is to maximize the log evi-\ndence p(y1:T|X1:T , \u03b8) via a gradient based optimizer [1]. Second, estimating the distribution of\nthe function value p(f (X(cid:63))|X1:T , y1:T , X(cid:63), \u03b8). From the perspective of GP, a function f (x) could\nbe loosely considered as an in\ufb01nitely long vector in which each random variable is the function value\nat an input x, and any \ufb01nite set of function values is jointly Gaussian distributed. Hence, the joint\ndistribution p(y1:T , f (X(cid:63))|X1:T , X(cid:63), \u03b8) is a multivariate Gaussian distribution. Then according to\nthe conditional property of Gaussian distribution, p(f (X(cid:63))|X1:T , y1:T , X(cid:63), \u03b8) is also Gaussian dis-\ntributed with the following mean vector \u00aff (X(cid:63)) and covariance matrix P (X(cid:63), X(cid:63)) [1, 7]:\n\n4 \u02dcxT \u02dcx)(1 + a\u22122\n\n\u00aff (X(cid:63)) = K\u03b8(X(cid:63), X1:T )[K\u03b8(X1:T , X1:T ) + a2\n\n0I]\u22121y1:T\n\nP (X(cid:63), X(cid:63)) = K\u03b8(X(cid:63), X(cid:63)) \u2212 K\u03b8(X(cid:63), X1:T )[K\u03b8(X1:T , X1:T ) + a2\n\n0I]\u22121K\u03b8(X(cid:63), X1:T )T\n\nIf there are n training inputs and m test inputs then K\u03b8(X(cid:63), X1:T ) denotes an m \u00d7 n covariance\nmatrix in which each entry is calculated by the covariance function k(x, x(cid:48)) with the learned \u03b8. It is\nsimilar to construct K\u03b8(X1:T , X1:T ) and K\u03b8(X(cid:63), X(cid:63)).\n\n4 Marginalized Particle Gaussian Process Regression\n\nEven though GP is an elegant nonparametric method for Bayesian regression, it is commonly in-\nfeasible for large data sets due to an O(n3) scaling for learning the model. In order to derive a\ncomputational tractable GP model which preserves the estimation accuracy, we \ufb01rstly explore a\nstate space model from the data construction procedure, then propose a marginalized particle \ufb01lter\nto estimate the hidden f (X(cid:63)) and \u03b8 in an online Bayesian \ufb01ltering framework.\n\n4.1 State Space Model\n\nThe standard state space model (SSM) consists of the state equation and observation equation. The\nstate equation re\ufb02ects the Markovian evolution of hidden states (the hyperparamters and function\nvalues). For the hidden static hyperparameter \u03b8, a popular method in \ufb01ltering techniques is to add an\narti\ufb01cial evolution using kernel smoothing which guarantees the estimation convergence [8, 9, 10]:\n(1)\nwhere b = (3\u03b4 \u2212 1)/(2\u03b4), \u03b4 is a discount factor which is typically around 0.95-0.99, \u00af\u03b8t\u22121 is the\nMonte Carlo mean of \u03b8 at t \u2212 1, and st\u22121 \u223c N (0, r2\u03a3t\u22121), r2 = 1 \u2212 b2, \u03a3t\u22121 is the Monte Carlo\nvariance matrix of \u03b8 at t\u2212 1. For hidden function values, we attempt to explore the relation between\nthe (t \u2212 1)th and tth data subset. For simplicity, we denoted X c\nt ). If\nf (x) \u223c GP (0, k(x, x(cid:48))), then the prior distribution p(f c\nt\u22121|X c\nt , f c\nt , X c\nt )\n\nt = Xt \u222a X(cid:63) and f c\n(cid:21)\nt\u22121, X c\n\n\u03b8t = b\u03b8t\u22121 + (1 \u2212 b)\u00af\u03b8t\u22121 + st\u22121\n\n(cid:20) K\u03b8t(X c\n\nt , \u03b8t) is jointly Gaussian:\n\nt = f (X c\n\np(f c\n\nt , f c\n\nt\u22121|X c\n\nt\u22121, X c\n\nt , \u03b8t) = N (0,\n\nK\u03b8t(X c\n\nt , X c\n\nK\u03b8t (X c\nt\u22121)T K\u03b8t(X c\n\nt , X c\nt\u22121, X c\n\nt\u22121)\nt\u22121)\n\n)\n\n2\n\n\fThen according to the conditional property of Gaussian distribution, we could get\n\np(f c\n\nt |f c\n\nt\u22121, X c\n\nt\u22121, X c\n\nt , \u03b8t) = N (G(\u03b8t)f c\n\nt\u22121, Q(\u03b8t))\n\n(2)\n\nwhere\n\n(3)\n(4)\nThis conditional density (2) could be transformed into a linear equation of the function value with\nan additive Gaussian noise vf\n\nG(\u03b8t) = K\u03b8t(X c\nt ) \u2212 K\u03b8t(X c\nt , X c\nt \u223c N (0, Q(\u03b8t)):\n\nt\u22121)\nt\u22121)K\u03b8t(X c\n\nt\u22121, X c\nt\u22121, X c\n\nQ(\u03b8t) = K\u03b8t(X c\n\nt , X c\nt , X c\n\nt\u22121)T\n\nt , X c\n\n(X c\n\n(X c\n\n\u03b8t\n\nt\u22121)K\u22121\nt\u22121)K\u22121\n\n\u03b8t\n\nf c\nt = G(\u03b8t)f c\n\nt\u22121 + vf\n\nt\n\n(5)\n\nFinally, the observation (output) equation could be directly obtained from the tth data collection:\n\n(6)\nt = f (Xt) since yt is only obtained from the\nwhere Ht = [Int 0] is an index matrix to make Htf c\ntth training inputs Xt. The noise vy\n0,tI. Note\nthat a0 is a \ufb01xed unknown hyperparameter. We use the symbol a0,t just because of the consistency\nwith the arti\ufb01cial evolution of \u03b8. To sum up, our SSM is fully speci\ufb01ed by (1), (5), (6).\n\nt \u223c N (0, R(\u03b8t)) is from the section 2 where R(\u03b8t) = a2\n\nyt = Htf c\n\nt + vy\n\nt\n\n4.2 Bayesian Inference by Marginalized Particle Filter\n\nIn contrast to the GP regression with a two-step of\ufb02ine inference in section 3, we propose an online\n\ufb01ltering framework to simultaneously learn hyperparameters and estimate hidden function values.\nAccording to the SSM before, the inference problem refers to compute the posterior distribution\nt , \u03b81:t|X1:t, X(cid:63), y1:t). One technique is MCMC, but MCMC usually suffers from a long con-\np(f c\nvergence time. Hence we choose another popular technique - particle \ufb01lter. However, for our SSM,\nthe traditional sampling importance resampling (SIR) particle \ufb01lter will introduce the unnecessary\ncomputational load due to the fact that (5) in the SSM is a linear structure given \u03b8t. This inspires us\nto apply a more ef\ufb01cient marginalized particle \ufb01lter (also called Rao-Blackwellised particle \ufb01lter)\n[9, 11, 12, 13] to deal with the estimation problem by combining Kalman \ufb01lter into particle \ufb01lter.\nUsing Bayes rule, the posterior could be factorized as\n\np(f c\n\nt , \u03b81:t|X1:t, X(cid:63), y1:t) = p(\u03b81:t|X1:t, X(cid:63), y1:t)p(f c\n\nt |\u03b81:t, X1:t, X(cid:63), y1:t)\n\np(\u03b81:t|X1:t, X(cid:63), y1:t) refers to a marginal posterior which could be solved by particle \ufb01lter. After\nt |\u03b81:t, X1:t, X(cid:63), y1:t) could be computed by\nobtaining the estimation of \u03b81:t, the second term p(f c\nKalman \ufb01lter since f c\nThe detailed inference procedure is as follows: First, p(\u03b81:t|X1:t, X(cid:63), y1:t) should be factorized in\na recursive form so that it could be applied into sequential importance sampling framework:\n\nt is the hidden state in the linear substructure (equation (5)) of SSM.\n\np(\u03b81:t|X1:t, X(cid:63), y1:t) \u221d p(yt|y1:t\u22121, \u03b81:t, X1:t, X(cid:63))p(\u03b8t|\u03b8t\u22121)p(\u03b81:t\u22121|X1:t\u22121, X(cid:63), y1:t\u22121)\n\nAt each iteration of the sequential importance sampling, the particles for the hyperparameter vector\nare drawn from the proposal distribution p(\u03b8t|\u03b8t\u22121) (easily obtained from equation (1)), then the\nimportance weight for each particle at t could be computed according to p(yt|y1:t\u22121, \u03b81:t, X1:t, X(cid:63)).\nThis distribution could be solved analytically:\n\np(yt|y1:t\u22121, \u03b81:t, X1:t, X(cid:63)) =\n\n=\n\n=\n\np(yt, f c\np(yt|f c\n\n(cid:90)\n(cid:90)\n(cid:90)\n\nt |y1:t\u22121, \u03b81:t, X1:t, X(cid:63))df c\n\nt\n\nt , \u03b8t, Xt, X(cid:63))p(f c\n\nt |y1:t\u22121, \u03b81:t, X1:t, X(cid:63))df c\n\nt\n\nN (Htf c\n\nt , R(\u03b8t))N (f c\n\nt|t\u22121, P c\n\nt|t\u22121)df c\nt\n\n= N (Htf c\n\nt|t\u22121, HtP c\n\nt|t\u22121H T\n\nt + R(\u03b8t))\n\n(7)\n\nwhere p(yt|f c\np(f c\nis also Gaussian distributed with the predictive mean f c\n\nt |y1:t\u22121, \u03b81:t, X1:t, X(cid:63)) = N (f c\n\nt , \u03b8t, Xt, X(cid:63)) follows a Gaussian distribution N (Htf c\n\nt|t\u22121, P c\n\nt|t\u22121) is the prediction step of Kalman \ufb01lter for f c\n\nt , R(\u03b8t)) (refers to equation (6)),\nt which\n\nt|t\u22121 and covariance P c\n\nt|t\u22121.\n\n3\n\n\fSecond, we explain how to compute p(f c\n\ufb01lter. According to the recursive Bayesian \ufb01ltering, this posterior could be factorized as:\nt |y1:t\u22121, \u03b81:t, X1:t, X(cid:63))\n\nt |\u03b81:t, X1:t, X(cid:63), y1:t) using the prediction-update Kalman\np(yt|f c\n\nt , \u03b8t, Xt, X(cid:63))p(f c\n\np(f c\n\nt |\u03b81:t, X1:t, X(cid:63), y1:t) =\n\n(8)\n\np(yt|y1:t\u22121, \u03b81:t, X1:t, X(cid:63))\nt |y1:t\u22121, \u03b81:t, X1:t, X(cid:63)) which is an integral:\n\nIn the prediction step, the goal is to compute p(f c\n\np(f c\n\nt |y1:t\u22121, \u03b81:t, X1:t, X(cid:63)) =\n\n=\n\n=\n\nt , f c\nt |f c\n\n(cid:90)\n(cid:90)\n(cid:90)\n\np(f c\n\nt\u22121|y1:t\u22121, \u03b81:t, X1:t, X(cid:63))df c\nt\u22121\n\np(f c\n\nt\u22121, \u03b8t, Xt\u22121:t, X(cid:63))p(f c\n\nt\u22121|y1:t\u22121, \u03b81:t\u22121, X1:t\u22121, X(cid:63))df c\nt\u22121\n\nN (G(\u03b8t)f c\n\nt\u22121, Q(\u03b8t))N (f c\n\nt\u22121|t\u22121, P c\n\nt\u22121|t\u22121)df c\n\nt\u22121\n\n= N (G(\u03b8t)f c\n\nt\u22121|t\u22121, G(\u03b8t)P c\n\nt |f c\nt\u22121|t\u22121, P c\n\nwhere p(f c\nN (f c\nalso be expressed as N (f c\n\nt\u22121, \u03b8t, Xt\u22121:t, X(cid:63)) is directly from (2), and p(f c\nt\u22121|t\u22121) is the posterior estimation for f c\n\nt\u22121. Since p(f c\n\nt|t\u22121, P c\n\nt|t\u22121), then the prediction step is summarized as:\n\nt\u22121|t\u22121G(\u03b8t)T + Q(\u03b8t))\n\n(9)\nt\u22121|y1:t\u22121, \u03b81:t\u22121, X1:t\u22121, X(cid:63)) =\nt |y1:t\u22121, \u03b81:t, X1:t, X(cid:63)) could\n\nf c\nt|t\u22121 = G(\u03b8t)f c\n\nt\u22121|t\u22121, P c\n\nt|t\u22121 = G(\u03b8t)P c\n\nIn the update step, the current observation density p(yt|f c\nto correct the prediction. Putting (7) and (9) into (8), p(f c\nactually Gaussian distributed with the Kalman Gain \u0393t where:\n\nt\u22121|t\u22121G(\u03b8t)T + Q(\u03b8t)\nt , \u03b8t, Xt, X(cid:63)) = N (Htf c\nt |\u03b81:t, X1:t, X(cid:63), y1:t) = N (f c\n\nt , R(\u03b8t)) is used\nt|t) is\n\nt|t, P c\n\n(10)\n\n\u0393t = P c\n\nt|t\u22121H T\n\nf c\nt|t = f c\n\nt|t\u22121 + \u0393t(yt \u2212 Htf c\n\nt (HtP c\n\nt|t\u22121H T\nt|t\u22121), P c\n\nt + R(\u03b8t))\u22121\nt|t = P c\n\nt|t\u22121 \u2212 \u0393tHtP c\n\nt|t\u22121\n\n(11)\n\n(12)\n\nFinally, the whole algorithm (t = 1, 2, 3, ....) is summarized as follows:\n\nt\u22121) according to (1)\n\nt to specify k(x, x(cid:48)) in GP to construct G(\u03b8i\n\nt \u223c p(\u03b8t|\u02dc\u03b8i\n\n\u2022 For i = 1, 2, ...N\n\u2013 Drawing \u03b8i\n\u2013 Using \u03b8i\n\u2013 Kalman Predict: Using \u02dcf c,i\n\u2013 Kalman Update: Using f c,i\n\u2013 Putting f c,i\nt|t\u22121, R(\u03b8i\nt = \u00afwi\n\nt|t\u22121, P c,i\n\nt/((cid:80)N\n\nt), Q(\u03b8i\n\nt), R(\u03b8i\nt) in (3-4) and (6)\nt\u22121|t\u22121, \u02dcP c,i\nt\u22121|t\u22121 into (10) to compute f c,i\nt|t\u22121, P c,i\nt|t\u22121 into (11) and (12) to compute f c,i\nt|t\u22121 and P c,i\nt) into (7) to compute the importance weight \u00afwi\nt\n\nt|t\u22121\nt|t and P c,i\nt|t\n\n\u02c6\u03b8t =(cid:80)N\nt|t =(cid:80)N\n\n\u2022 Normalizing the weight: wi\n\u2022 Hyperparameter and Hidden function value estimation:\n\u02c6f c\nt|t = H (cid:63)\nt|t\nt\nt|t)T ) \u21d2 \u02c6P (cid:63)\nt|t = H (cid:63)\nt\n\nt,\nt\u03b8i\nt(P c,i\ni=1 wi\nt = [0 Im] is an index matrix to get the function value estimation at X(cid:63)\n\ni=1 \u00afwi\nt|t \u21d2 \u02c6f (cid:63)\ntf c,i\nt|t \u2212 \u02c6f c\nt|t)(f c,i\n\nt|t =(cid:80)N\n\ni=1 wi\nt|t \u2212 \u02c6f c\n\n\u02c6P c\nwhere H (cid:63)\n\nt) (i = 1, ...N )\n\nt|t + (f c,i\n\n\u02c6P c\nt|t(H (cid:63)\n\ni=1 wi\n\nt )T\n\n\u02c6f c\n\n\u2022 Resampling: For i = 1, ...N, resample \u03b8i\nt|t for the next step\n\nt to obtain \u02dc\u03b8i\nwi\n\nt|t , \u02dcP c,i\n\nt, \u02dcf c,i\n\nt, f c,i\n\nt|t , P c,i\n\nt|t with respect to the importance weight\n\nAt each iteration, our marginalized particle Gaussian process (MPGP) uses a small training subset\nto estimate f (X(cid:63)) by Kalman \ufb01lters, and learn hyperparameters online by weighted particles. The\ncomputational cost of the marginalized particle \ufb01lter is governed by O(N T S3) [10] where N is the\nnumber of particles, T is the number of data collections, S is the size of each collection. This could\nlargely reduce the computational load. Moreover, the MPGP propagates the previous estimation to\nimprove the current accuracy in the recursive \ufb01ltering framework. From the algorithm above, we\nalso \ufb01nd that f (X(cid:63)) is estimated as a Gaussian mixture at each iteration since each hyperparam-\neter particle accompanies with a Kalman \ufb01lter for f (X(cid:63)). Hence the MPGP could accelerate the\n\n4\n\n\fFigure 1: Estimation result comparison. (a-b) show the estimation for f1 at t = 10 by SE-KFGP\n(blue line with blue dashed interval in (a)), SE-MPGP (red line with red dashed interval in (a)),\nSENN-KFGP (blue line with blue dashed interval in (b)), SENN-MPGP (red line with red dashed\ninterval in (b)). The black crosses are the training outputs at t = 10, the black line is the true f (X(cid:63)).\nThe denotation of (c-d),(e-f),(g-h) is same as (a-b) except that (c-d) are for f2 at t = 10, (e-f) are for\nf1 at t = 100, (g-h) are for f2 at t = 50. (i-m), (n-r) are the estimation of the log hyperparameters\n(log(a0) to log(a4)) for f1, f2 over time.\n\ncomputational speed, while preserving the accuracy. Additionally, it is worth to mention that the\nKalman \ufb01lter GP (KFGP) [4] is a special case of our MPGP since the KFGP \ufb01rstly trains the hy-\nt |\u03b81:t, X1:t, X(cid:63), y1:t)\nperparamter vector of\ufb02ine and uses it to specify the SSM, then estimates p(f c\nby Kalman \ufb01lter. But the of\ufb02ine learning procedure in KFGP will either take a long time using a\nlarge extra training data or fall into an unsatisfactory local optimum using a small extra training data.\nIn our MPGP, the local optimum could be used as the initial setting of hyperparameters, then the\nunderlying \u03b8 could be learned online by the marginalized particle \ufb01lter to improve the performance.\nFinally, to avoid confusion, we should clarify the difference between our MPGP and the GP mod-\neled Bayesian \ufb01lters [14, 15]. The goal of GP modeled Bayesian \ufb01lters is to use GP modeling for\nBayesian \ufb01ltering, on the contrary, our MPGP is to use Bayesian \ufb01ltering for GP modeling.\n\n5 Experiments\n\nTwo Synthetic Datasets: The proposed MPGP is \ufb01rstly evaluated on two simulated one-\ndimensional datasets. One is a function with a sharp peak which is spatially inhomogeneously\nsmooth [16]: f1(x) = sin(x) + 2 exp(\u221230x2). For f1(x), we gather the training data with 100\ncollections. For each collection, we randomly select 30 inputs from [-2, 2], then calculate their\noutputs by adding a Gaussian noise N (0, 0.32) to their function values. The test input is from -2\nto 2 with 0.05 interval. The other function is with a discontinuity [17]: if 0 \u2264 x \u2264 0.3, f2(x) =\nN (x; 0.6, 0.22)+N (x; 0.15, 0.052), if 0.3 < x \u2264 1, f2(x) = N (x; 0.6, 0.22)+N (x; 0.15, 0.052)+\n4. For f2(x), we gather the training data with 50 collections. For each collection, we randomly se-\nlect 60 inputs from [0, 1], then calculate their outputs by adding a Gaussian noise N (0, 0.82) to their\nfunction values. The test input is from 0 to 1 with 0.02 interval.\nThe \ufb01rst experiment aims to evaluate the estimation performance in comparison of KFGP in [4].\nWe denote SE-KFGP, SENN-KFGP as KFGP with the covariance function kSE, KFGP with the\ncovariance function kSE + kN N . Similarly, SE-MPGP and SENN-MPGP are MPGP with kSE,\n\n5\n\n\u22122\u22121012\u22124\u22122024xy\u22122\u22121012\u22124\u22122024xy\u22122\u22121012\u22124\u22122024xy\u22122\u22121012\u22124\u22122024xy00.20.40.60.81\u221250510xy00.20.40.60.81\u221250510xy00.20.40.60.81\u221250510xy00.20.40.60.81\u221250510xy050100\u22122\u22121.5\u22121\u22120.50tlog(a1) 050100\u22121\u22120.500.5tlog(a2) 050100\u22122\u22121.5\u22121\u22120.50tlog(a3) 050100\u22121.6\u22121.4\u22121.2\u22121tlog(a4) 050100\u22121\u22120.500.5tlog(a0) 01020304050\u22123\u22122\u221210tlog(a1) 01020304050\u22121012tlog(a2) 01020304050\u22121\u22120.8\u22120.6\u22120.4\u22120.2tlog(a3) 01020304050\u22120.500.511.5tlog(a4) 01020304050\u22121012tlog(a0) SE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSENN\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSENN\u2212MPGPSENN\u2212MPGP(b)(c)(a)(e)(f)(g)(h)(d)t=10t=10t=10t=10t=100t=100(i)(j)(k)(l)(m)(r)(q)(p)(o)(n)t=50t=50\fFigure 2: The NMSE and MNLP of KFGP and MPGP for f1, f2 over time.\n\nFigure 3: The NMSE and MNLP of MPGP as a function of the number of particles. The \ufb01rst row is\nfor f1, the second row is for f2.\n\nMPGP with kSE + kN N . The number of particles in MPGP is set to 10. The evaluation criterion\nis the test Normalized Mean Square Error (NMSE) and the test Mean Negative Log Probability\n(MNLP) as suggested in [3]. First, it is shown in Figure 1 that the estimation performance for both\nKFGP and MPGP is getting better and attempts to convergence over time (refers to (a-h)) since\nthe previous estimation would be incorporated into the current estimation in the recursive Bayesian\n\ufb01ltering. Second, for both f1 and f2, the estimation of MPGP is better than KFGP via the NMSE and\nMNLP comparison in Figure 2. The KFGP uses of\ufb02ine learned hyperparameters all the time. On\nthe contrary, MPGP initializes hyperparameters using the ones by KFGP, then online learns the true\nhyperparameters (refers to (i-r) in Figure 1). Hence the MNLP of MPGP is much lower than KFGP.\nFinally, if we only focus on our MPGP, then we could \ufb01nd SENN-MPGP is better than SE-MPGP\nsince SENN-MPGP takes the spatial nonstationary phenomenon into account.\nThe second experiment aims to illustrate the average performance of SE-MPGP and SENN-MPGP\nwhen the number of particles increases. For each number of particles, we run the SE-MPGP and\nSENN-MPGP 5 times and compute the average NMSE and MNLP. From Figure 3, we \ufb01nd: First,\nwith increasing the number of particles, the NMSE and MNLP of SE-MPGP and SENN-MPGP\nwould decrease at the beginning and become convergence while the running time increases over\ntime. The reason is that the estimation accuracy and computational load of particle \ufb01lters will\nincrease when the number of particles increases. Second, the average performance of SENN-MPGP\nis better than SE-MPGP since it captures the spatial nonstationarity, but SENN-MPGP needs more\nrunning time since the size of the hyperparameter vector to be inferred will increase.\nThe third experiment aims to compare our MPGP with the benchmarks. The state-of-art sparse\nGP methods we choose are: sparse pseudo-input Gaussian process (SPGP) [2] and sparse spectrum\nGaussian process (SSGP) [3]. Moreover, we also want to examine the robustness of our MPGP\nsince we should clarify whether the good estimation of our MPGP heavily depends on the order\nof training data collection. Hence, we randomly interrupt the order of training subsets we used\nbefore, then implement SPGP with 5 pseudo inputs (5-SPGP), SSGP with 10 basis functions (10-\nSSGP), SE-MPGP with 5 particles (5-SE-MPGP), SENN-MPGP with 5 particles (5-SENN-MPGP).\n\n6\n\n0204060801000.080.10.120.140.160.180.2t 0204060801000.40.60.81t 010203040500.10.20.30.40.5t 0102030405011.21.41.61.82t SE\u2212KFGPSENN\u2212KFGPSE\u2212MPGPSENN\u2212MPGPSE\u2212KFGPSE\u2212MPGPSENN\u2212KFGPSENN\u2212MPGPSE\u2212KFGPSENN\u2212KFGPSE\u2212MPGPSENN\u2212MPGPSE\u2212KFGPSENN\u2212KFGPSE\u2212MPGPSENN\u2212MPGPMNLP for f1(x)NMSE for f2(x)NMSE for f1(x)MNLP for f2(x)2468101214160.0850.090.0950.1Number of ParticlesNMSE 24681012141600.511.5Number of ParticlesMNLP 2468101214160102030405060Number of ParticlesRunning Time 051015200.10.150.20.250.30.350.4Number of ParticlesNMSE 0510152011.522.53Number of ParticlesMNLP 05101520010203040Number of ParticlesRunning Time SE\u2212MPGP SENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGP\fTable 1: Benchmarks Comparison for Synthetic Datasets. The NMSEi, MNLPi, RTimei represent\nthe NMSE, MNLP and running time for the function fi (i = 1, 2)\n\nMethod\n\nNMSE1 MNLP1 RTime1\n\nNMSE2 MNLP2 RTime2\n\n5-SPGP\n10-SSGP\n5-SE-MPGP\n5-SENN-MPGP\n\n0.2243\n0.0887\n0.0880\n0.0881\n\n0.5409\n0.1606\n1.6318\n0.1820\n\n28.6418s\n18.8605s\n12.5737s\n18.7513s\n\n0.5445\n0.1144\n0.1687\n0.1289\n\n1.5950\n1.1208\n1.3524\n1.1782\n\n30.3578s\n10.2025s\n12.4801s\n11.5909s\n\nTable 2: Benchmarks Comparison. Data1 is the temperature dataset. Data2 is the pendulum dataset.\n\nData1\n\nNMSE MNLP RTime Data2\n\nNMSE MNLP RTime\n\n5-SPGP\n10-SSGP\n5-SE-MPGP\n5-SENN-MPGP\n\n0.48\n0.27\n0.11\n0.10\n\n1.62\n1.33\n1.05\n1.16\n\n181.3s\n97.16s\n50.99s\n59.25s\n\n10-SPGP\n10-SSGP\n20-SE-MPGP\n20-SENN-MPGP\n\n0.61\n1.04\n0.63\n0.58\n\n1.98\n10.85\n2.20\n2.12\n\n16.54s\n23.59s\n7.04s\n8.60s\n\nIn Table 1, our 5-SE-MPGP mainly outperforms SPGP except that its MNLP1 is worse than the one\nof SPGP. The reason is the synthetic functions are nonstationary but SE-MPGP uses a stationary SE\nkernel. Hence we perform 5-SENN-MPGP with a nonstationary kernel to show that our MPGP is\ncompetitive with SSGP, and much better with shorter running time than SPGP.\nGlobal Surface Temperature Dataset: We present here a preliminary analysis of the Global Sur-\nface Temperature Dataset in January 2011 (http://data.giss.nasa.gov/gistemp/). We \ufb01rst gather the\ntraining data with 100 collections. For each collection, we randomly select 90 data points where the\ninput vector is the longitude and latitude location, the output is the temperature (oC). There are two\ntest data sets: the \ufb01rst one is a grid test input set (Longitude: -180:40:180, Latitude: -90:20:90) that\nis used to show the estimated surface temperature. The second test input set (100 points) is randomly\nselected from the data website after obtaining all the training data.\nThe \ufb01rst experiment aims to show the predicted surface temperature at the grid test inputs. We set the\nnumber of particles in the SE-MPGP and SENN-MPGP as 20. From Figure 4, the KFGP methods\nstuck in the local optimum: SE-KFGP seems under\ufb01tting since it does not model the cold region\naround the location (100, 50), SENN-KFGP seems over\ufb01tting since it unexpectedly models the cold\nregion around (-100, -50). On the contrary, SE-MPGP and SENN-MPGP suitably \ufb01t the data set via\nthe hyperparameter online learning.\nThe second experiment is to evaluate the estimation error of our MPGP using the second test data.\nWe \ufb01rstly run all the methods to compute the NMSE and MNLP over iteration. From the \ufb01rst row of\nFigure 5, the NMSE and MNLP of MPGP are lower than KFGP. Moreover, SENN-MPGP is much\nlower than SE-MPGP, which shows that SENN-MPGP successfully models the spatial nonstation-\narity of the temperature data. Then we change the number of particles. For each number, we run\nSE-MPGP, SENN-MPGP 3 times to evaluate the average NMSE, MNLP and running time. It shows\nthat SENN-MPGP \ufb01ts the data better than SE-MPGP but the trade-off is the longer running time.\nThe third experiment is to compare our MPGP with the benchmarks. All the denotations are same as\nthe third experiment of the simulated data. We also randomly interrupt the order of training subsets\nfor the robustness consideration. From Table 2, the comparison results show that our MPGP uses a\nshorter running time with a better estimation performance than SPGP and SSGP.\nPendulum Dataset: This is a small data set which contains 315 training points. In [3], it is men-\ntioned that SSGP model seems to be over\ufb01tting for this data due to the gradient ascent optimization.\nWe are interested in whether our method can successfully capture the nonlinear property of this\npendulum data. We \ufb01rstly collect the training data 9 times, and 35 training data for each collec-\ntion. Then, 100 test points are randomly selected for evaluating the performance. From Table 2, our\nSENN-MPGP obtains the estimation with the fastest speed and the smallest NMSE among all the\nmethods, and the MNLP is competitive to SPGP.\n\n7\n\n\fFigure 4: The temperature estimation at t = 100. The \ufb01rst row (from left to right): the temperature\nvalue bar, the full training observation plot, the grid test output estimation by SE-KFGP, SENN-\nKFGP, SE-MPGP, SENN-MPGP. The black crosses are the observations at t = 100. The second\nrow (from left to right) is the estimation of log hyperparameters (log(a0) to log(a4)).\n\nFigure 5: The NMSE and MNLP evaluation. The \ufb01rst row: the NMSE and MNLP over iteration.\nThe second row: the average NMSE, MNLP, Running time as a function of the number of particles.\n\n6 Conclusion\n\nWe have proposed a novel Bayesian \ufb01ltering framework for GP regression, which is a fast and accu-\nrate online method. Our MPGP framework does not only estimate the function value successfully,\nbut it also provides a new technique for learning the unknown static hyperparameters by online es-\ntimating the marginal posterior of hyperparameters. The small training set at each iteration would\nlargely reduce the computation load while the estimation performance is improved over iteration due\nto the fact that recursive \ufb01ltering would propagate the previous estimation to enhance the current es-\ntimation. In comparison with other benchmarks, we have shown that our MPGP could provide a\nrobust estimation with a competitively computational speed. In the future, it would be interesting to\nexplore the time-varying function estimation with our MPGP.\n\n8\n\nlongitudelatitude \u2212180\u22121000100180\u221290\u22125005090longitudelatitude \u2212180\u22121000100180\u221290\u22125005090longitudelatitude \u2212180\u22121000100180\u221290\u22125005090longitudelatitude \u2212180\u22121000100180\u221290\u22125005090longitudelatitude \u2212180\u22121000100180\u221290\u221250050900501003.844.24.44.64.855.25.4tlog(a1) 050100\u22120.4\u22120.3\u22120.2\u22120.100.1tlog(a2) 050100\u22123.05\u22123\u22122.95\u22122.9\u22122.85\u22122.8\u22122.75\u22122.7\u22122.65tlog(a3) 050100\u22121\u22120.98\u22120.96\u22120.94\u22120.92\u22120.9\u22120.88\u22120.86\u22120.84\u22120.82tlog(a4) 0501000.10.20.30.40.50.60.70.8 log(a0)t\u22128\u22126\u22124\u2212202468SE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSENN\u2212MPGPSENN\u2212MPGP01020304050607080901000.20.30.40.50.6IterationNMSE 01020304050607080901001.31.41.51.61.71.81.92IterationMNLP 510152025300.20.250.30.350.40.45Number of ParticlesNMSE 510152025301.41.61.822.22.42.6Number of ParticlesMNLP 510152025300100200300400Number of Particles Running Time SE\u2212KFGPSENN\u2212KFGPSE\u2212MPGPSENN\u2212MPGPSE\u2212KFGPSENN\u2212KFGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGPSE\u2212MPGPSENN\u2212MPGP\fReferences\n[1] C. E. Rasmussen, C. K. I. Williams, Gaussian Process for Machine learning, MIT Press, Cam-\n\nbridge, MA, 2006.\n\n[2] E. Snelson, Z. Ghahramani, Sparse gaussian processes using pseudo-inputs, in: NIPS, 2006,\n\npp. 1257\u20131264.\n\n[3] M. L.-Gredilla, J. Q.-Candela, C. E. Rasmussen, A. R. F.-Vidal, Sparse spectrum gaussian\n\nprocess regression, Journal of Machine Learning Research 11 (2010) 1865\u20131881.\n\n[4] S. Reece, S. Roberts, An introduction to gaussian processes for the kalman \ufb01lter expert, in:\n\nFUSION, 2010.\n\n[5] R. M. Neal, Monte carlo implementation of gaussian process models for bayesian regression\n\nand classi\ufb01cation, Tech. rep., Department of Statistics, University of Toronto (1997).\n\n[6] D. J. C. MacKay, Introduction to gaussian processes, in: Neural Networks and Machine Learn-\n\ning, 1998, pp. 133\u2013165.\n\n[7] M. P. Deisenroth, Ef\ufb01cient reinforcement learning using gaussian processes, Ph.D. thesis, Karl-\n\nsruhe Institute of Technology (2010).\n\n[8] J. Liu, M. West, Combined parameter and state estimation in simulation-based \ufb01ltering, in:\n\nSequential Monte Carlo Methods in Practice, 2001, pp. 197\u2013223.\n\n[9] P. Li, R. Goodall, V. Kadirkamanathan, Estimation of parameters in a linear state space model\nusing a Rao-Blackwellised particle \ufb01lter, IEE Proceedings on Control Theory and Applications\n151 (2004) 727\u2013738.\n\n[10] N. Kantas, A. Doucet, S. S. Singh, J. M. Maciejowski, An overview of squential Monte Carlo\nmethods for parameter estimation in general state space models, in: 15 th IFAC Symposium\non System Identi\ufb01cation, 2009.\n\n[11] A. Doucet, N. de Freitas, K. Murphy, S. Russell, Rao-Blackwellised particle \ufb01ltering for dy-\n\nnamic Bayesian networks, in: UAI, 2000, pp. 176\u2013183.\n\n[12] N. de Freitas, Rao-Blackwellised particle \ufb01ltering for fault diagnosis, in: IEEE Aerospace\n\nConference Proceedings, 2002, pp. 1767\u20131772.\n\n[13] T. Sch\u00a8on, F. Gustafsson, P.-J. Nordlund, Marginalized particle \ufb01lters for mixed linear/nonlinear\n\nstate-space models, IEEE Transactions on Signal Processing 53 (2005) 2279 \u2013 2289.\n\n[14] J. Ko, D. Fox, Gp-bayes\ufb01lters: Bayesian \ufb01ltering using gaussian process prediction and obser-\n\nvation models, in: IROS, 2008, pp. 3471\u20133476.\n\n[15] M. P. Deisenroth, R. Turner, M. F. Huber, U. D. Hanebeck, C. E. Rasmussen, Robust \ufb01ltering\n\nand smoothing with gaussian processes, IEEE Transactions on Automatic Control.\n\n[16] I. DiMatteo, C. R. Genovese, R. E. Kass, Bayesian Curve Fitting with Free-Knot Splines,\n\nBiometrika 88 (2001) 1055\u20131071.\n\n[17] S. A. Wood, Bayesian mixture of splines for spatially adaptive nonparametric regression,\n\nBiometrika 89 (2002) 513\u2013528.\n\n9\n\n\f", "award": [], "sourceid": 583, "authors": [{"given_name": "Yali", "family_name": "Wang", "institution": null}, {"given_name": "Brahim", "family_name": "Chaib-draa", "institution": null}]}