{"title": "Gaussian process regression with Student-t likelihood", "book": "Advances in Neural Information Processing Systems", "page_first": 1910, "page_last": 1918, "abstract": "In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. However, the drawback is that the predictive accuracy of the model can be significantly compromised if the observations are contaminated by outliers. A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. The problem, however, is the analytically intractable inference. In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution.", "full_text": "Gaussian process regression with Student-t likelihood\n\nJarno Vanhatalo\n\nDepartment of Biomedical Engineering\n\nand Computational Science\n\nHelsinki University of Technology\n\nFinland\n\nPasi Jyl\u00a8anki\n\nDepartment of Biomedical Engineering\n\nand Computational Science\n\nHelsinki University of Technology\n\nFinland\n\njarno.vanhatalo@tkk.fi\n\npasi.jylanki@tkk.fi\n\nAki Vehtari\n\nDepartment of Biomedical Engineering\n\nand Computational Science\n\nFinland\n\nHelsinki University of Technology\n\naki.vehtari@tkk.fi\n\nAbstract\n\nIn the Gaussian process regression the observation model is commonly assumed\nto be Gaussian, which is convenient in computational perspective. However, the\ndrawback is that the predictive accuracy of the model can be signi\ufb01cantly com-\npromised if the observations are contaminated by outliers. A robust observation\nmodel, such as the Student-t distribution, reduces the in\ufb02uence of outlying obser-\nvations and improves the predictions. The problem, however, is the analytically\nintractable inference. In this work, we discuss the properties of a Gaussian process\nregression model with the Student-t likelihood and utilize the Laplace approxima-\ntion for approximate inference. We compare our approach to a variational approx-\nimation and a Markov chain Monte Carlo scheme, which utilize the commonly\nused scale mixture representation of the Student-t distribution.\n\n1 Introduction\n\nA commonly used observation model in the Gaussian process (GP) regression is the Normal distri-\nbution. This is convenient since the inference is analytically tractable up to the covariance function\nparameters. However, a known limitation with the Gaussian observation model is its non-robustness,\nand replacing the normal distribution with a heavy-tailed one, such as the Student-t distribution, can\nbe useful in problems with outlying observations.\nIf both the prior and the likelihood are Gaussian, the posterior is Gaussian with mean between\nthe prior mean and the observations.\nIn con\ufb02ict this compromise is not supported by either of\nthe information sources. Thus, outlying observations may signi\ufb01cantly reduce the accuracy of the\ninference. For example, a single corrupted observation may pull the posterior expectation of the\nunknown function value considerably far from the level described by the other observations (see\nFigure 1). A robust, or outlier-prone, observation model would, however, weight down the outlying\nobservations the more, the further away they are from the other observations and prior mean.\nThe idea of robust regression is not new. Outlier rejection was described already by De Finetti [1]\nand theoretical results were given by Dawid [2], and O\u2019Hagan [3]. Student-t observation model with\nlinear regression was studied already by West [4] and Geweke [5], and Neal [6] introduced it for GP\nregression. Other robust observation models include, for example, mixtures of Gaussians, Laplace\n\n1\n\n\f(a) Gaussian observation model.\n\n(b) Student-t observation model.\n\nFigure 1: An example of regression with outliers by Neal [6]. On the left Gaussian and on the right\nthe Student-t observation model. The real function is plotted with black line.\n\ndistribution and input dependent observation models [7\u201310]. The challenge with the Student-t model\nis the inference, which is analytically intractable. A common approach has been to use the scale-\nmixture representation of the Student-t distribution [5], which enables Gibbs sampling [5, 6], and a\nfactorized variational approximation (VB) for the posterior inference [7, 11].\nHere, we discuss the properties of the GP regression with a Student-t likelihood and utilize the\nLaplace approximation for the approximate inference. We discuss the known weaknesses of the\napproximation scheme and show that in practice it works very well and quickly. We use several\ndifferent data sets to compare it to both a full MCMC and a factorial VB, which utilize the scale\nmixture equivalent of the Student-t distribution. We show that the predictive performances are sim-\nilar and that the Laplace\u2019s method approximates the posterior covariance somewhat better than VB.\nWe also point out some of the similarities between these two methods and discuss their differences.\n\n2 Robust regression with Gaussian processes\n\nConsider a regression problem, where the data comprise observations yi = f(xi) + \u0001i at input\nlocations X = {xi}n\ni=1, where the observation errors \u00011, ..., \u0001n are zero-mean exchangeable random\nvariables. The object of inference is the latent function f, which is given a Gaussian process prior.\nThis implies that any \ufb01nite subset of latent variables, f = {f(xi)}n\ni=1, has a multivariate Gaussian\ndistribution. In particular, at the observed input locations X the latent variables have a distribution\n\np(f|X) = N (f|\u00b5, Kf,f),\n\n(1)\n\nwhere Kf,f is the covariance matrix and \u00b5 the mean function. For the notational simplicity, we will\nuse a zero-mean Gaussian process. Each element in the covariance matrix is a realization of covari-\nance function, [Kf,f]ij = k(xi, xj), which represents the prior assumptions of the smoothness of the\nlatent function (for a detailed introduction on GP regression see [12]). The covariance function used\nd=1(xi,d \u2212 xj,d)2/l2\nd),\nin this work is the stationary squared exponential kse(xi, xj) = \u03c32\nwhere \u03c32\nA formal de\ufb01nition of robustness is given, for example, in terms of an outlier-prone observation\nmodel. The observation model is de\ufb01ned to be outlier-prone of order n, if p(f|y1, ..., yn+1) \u2192\np(f|y1, ..., yn) as yn+1 \u2192 \u221e [3, 4]. That is, the effect of a single con\ufb02icting observation to the\nposterior becomes asymptotically negligible as the observation approaches in\ufb01nity. This contrasts\nheavily with the Gaussian observation model where each observation in\ufb02uences the posterior no\nmatter how far it is from the others. The zero-mean Student-t distribution\n\nse is the scaling parameter and ld are the length-scales.\n\nse exp(\u2212(cid:80)D\n\np(yi|fi, \u03c3, \u03bd) =\n\n\u0393((\u03bd + 1)/2)\n\u221a\n\u0393(\u03bd/2)\n\u03bd\u03c0\u03c3\n\n1 +\n\n(yi \u2212 fi)2\n\n\u03bd\u03c32\n\n(cid:18)\n\n(cid:19)\u2212(\u03bd+1)/2\n\nwhere \u03bd is the degrees of freedom and \u03c3 the scale parameter [13], is outlier prone of order 1, and\nit can reject up to m outliers if there are at least 2m observations in all [3]. From this on we will\ncollect all the hyperparameters into \u03b8 = {\u03c32\n\nse, l1, ..., lD, \u03c3, \u03bd}.\n\n2\n\n,\n\n(2)\n\n\f3 Inference with the Laplace approximation\n\n3.1 The conditional posterior of the latent variables\n\nOur approach is motivated by the Laplace approximation in GP classi\ufb01cation [14]. A similar ap-\nproximation has been considered by West [4] in the case of robust linear regression and by Rue\net al. [15] in their integrated nested Laplace approximation (INLA). Below we follow the notation\nof Rasmussen and Williams [12].\nA second order Taylor expansion of log p(f | y, \u03b8) around the mode, gives a Gaussian approximation\n\np(f | y, \u03b8) \u2248 q(f | y, \u03b8) = N(f |\u02c6f , \u03a3),\n\nwhere \u02c6f = arg maxf p(f | y, \u03b8) and \u03a3\u22121 is the Hessian of the negative log conditional posterior at\nthe mode \u02c6f [12, 13]:\n\n\u03a3\u22121 = \u2212\u2207\u2207 log p(f | y, \u03b8)|f =\u02c6f = K-1\n\nf,f +W,\n\n(3)\n\n(4)\n\nwhere\n\nri = (yi \u2212 fi), and Wji = 0 if i (cid:54)= j.\n\ni \u2212 \u03bd\u03c32\nWii = \u2212(\u03bd + 1) r2\ni + \u03bd\u03c32)2 ,\n(r2\n\n(cid:90)\n\n3.2 The maximum a posterior estimate of the hyperparameters\nTo \ufb01nd a maximum a posterior estimate (MAP) for the hyperparameters, we write p(\u03b8| y) \u221d\np(y |\u03b8)p(\u03b8), where\n\np(y |\u03b8) =\n\np(y| f)p(f |X, \u03b8)d f ,\n\n(5)\nis the marginal likelihood. To \ufb01nd an approximation, q(y |\u03b8), for the marginal likelihood one can\nutilize the Laplace method second time [12]. A Taylor expansion of the logarithm of the integrand\nin (5) around \u02c6f gives a Gaussian integral over f multiplied by a constant, giving\nlog | K-1\n\n\u02c6f T K-1\n\nlog q(y |\u03b8) = log p(y|\u02c6f) \u2212 1\n2\n\nf,f +W|.\n\n(6)\n\n\u02c6f \u2212 1\n2\n\nlog | Kf,f | \u2212 1\n2\n\nf,f\n\nThe hyperparameters can then be optimized by maximizing the approximate log marginal posterior,\nlog q(\u03b8| y) \u221d log q(y |\u03b8) + log p(\u03b8). This is differentiable with respect to \u03b8, which enables the use\nof gradient based optimization to \ufb01nd \u02c6\u03b8 = arg max\u03b8 q(\u03b8| y) [12].\n\n3.3 Making predictions\n\nThe approximate posterior distribution of a latent variable f\u2217 at a new input location x\u2217 is also\nGaussian, and therefore de\ufb01ned by its mean and variance [12]\n\n[f\u2217|X, y, x\u2217] = K\u2217,f K-1\n[f\u2217|X, y, x\u2217] = K\u2217,\u2217 \u2212 K\u2217,f(Kf,f +W\u22121)\u22121 Kf,\u2217 .\n\n\u02c6f = K\u2217,f \u2207 log p(y |\u02c6f)\n\nf,f\n\n(7)\n\n(8)\n\nE\n\nq(f | y,\u03b8)\n\nVar\n\nq(f | y,\u03b8)\n\nThe predictive distribution of a new observation is obtained by marginalizing over the posterior\ndistribution of f\u2217\n\nq(y\u2217|X, y, x\u2217) =\n\np(y\u2217|f\u2217)q(f\u2217|X, y, x\u2217)df\u2217,\n\n(9)\n\n(cid:90)\n\nwhich can be evaluated, for example, with a Gaussian quadrature integration.\n\n3.4 Properties of the Laplace approximation\n\nThe Student-t distribution is not log-concave, and therefore the posterior distribution may be mul-\ntimodal. The immediate concern from this is that a unimodal Laplace approximation may give\na poor estimate for the posterior. This is, however, a problem for all unimodal approximations,\n\n3\n\n\f(a) Greater prior variance than likelihood variance\n\n(b) Equal prior and likelihood variance\n\nFigure 2: A comparison of the Laplace and VB approximation for p(f|\u03b8, y) in the case of a single\nobservation with the Student-t likelihood and a Gaussian prior. The likelihood is centered at zero\nand the prior mean is altered. The upper plots show the probability density functions and the lower\nplots the variance of the true posterior and its approximations as a function of the posterior mean.\n\nsuch as the VB in [7, 11]. An other concern is that the estimate of the posterior precision,\n\u03a3\u22121 = \u2212\u2207\u2207 log p(f | y, \u03b8)|f =\u02c6f , is essentially uncontrolled. However, at a posterior mode \u02c6f, the\nHessian \u03a3\u22121 is always positive de\ufb01nite and in practice approximates the truth rather well according\nto our experiments. If the optimization for f ends up in a saddle point or the mode is very \ufb02at, \u03a3\u22121\nmay be close to singular, which leads to problems in the implementation. In this section, we will\ndiscuss these issues with simple examples and address the implementation in the section 4.\nConsider a single observation yi = 0 from a Student-t distribution with a Gaussian prior for its\nmean, fi. The behavior of the true posterior, the Laplace approximation, and VB as a function of\nprior mean are illustrated in the upper plots of the Figure 2. The dotted lines represent the situation,\nwhere the observation is a clear outlier in which case the posterior is very close to the prior (cf.\nsection 2). The solid lines represent a situation where the prior and data agree, and the dashed lines\nrepresent a situation where the prior and data con\ufb02ict moderately.\ni + W (fi) > 0, for all fi \u2208 (cid:60), where \u03c4 2\nThe posterior of the mean is unimodal if \u03a3(fi)\u22121 = \u03c4\u22122\ni is\n(4)). With \u03bd and \u03c3 \ufb01xed, W (fi) reaches its (negative) minimum at |yi\u2212fi| = \u00b1\u221a\nthe prior variance and W (fi) is the Hessian of the negative log likelihood at fi (see equations (3) and\n3\u03bd\u03c3, where \u03a3\u22121 =\ni \u2212 (\u03bd + 1)/(8\u03bd\u03c32). Therefore, the posterior distribution is unimodal if \u03c4\u22122\n\u03c4\u22122\ni > (\u03bd + 1)/(8\u03bd\u03c32),\nor in terms of variances if Var[yi|fi, \u03bd, \u03c3]/\u03c4 2\ni > (\u03bd + 1)/(8(\u03bd \u2212 2)) (for \u03bd > 2). It follows that the\n\u221a\nmost problematic situation for the Laplace approximation is when the prior is much wider than the\nlikelihood. Then in the case of a moderate con\ufb02ict (|yi \u2212 \u02c6fi| is close to\n3\u03bd\u03c3) the posterior may be\nmultimodal (see the Figure 2(a)), meaning that it is unclear whether the observation is an outlier or\nnot. In this case, W (fi) is negative and \u03a3\u22121 may be close to zero, which re\ufb02ects uncertainty on the\nlocation. In the implementation this may lead to numerical problems but in practice, the problem\nbecomes concrete only seldom as described in the section 4.\nThe negative values of W relate to a decrease in the posterior precision compared to the prior preci-\nsion. As long as the total precision remains positive it approximates the behavior of the true posterior\nrather well. The Student-t likelihood leads to a decrease in the variance from prior to posterior only\nif the prior mean and the observation are consistent with each other as shown in the Figure 2. This\nbehavior is not captured with the factorized VB approximation [7], where W in q(f |\u03b8, y) is replaced\nwith a strictly positive diagonal that always increases the precision as illustrated in the Figure 2.\n\n4\n\n\u221215\u221210\u221250501latent value fp(f), p(f|D), p(y|f) priorlikelihreal posteriorLaplace appVB approx\u221215\u221210\u2212505024posterior mean of latent value fVar(f|D),Var(f)\u221215\u221210\u22125050latent value fp(f), p(f|D), p(y|f)\u221215\u221210\u2212505posterior mean of latent value fVar(f|D),Var(f)0.30.6\f4 On the implementation\n\n4.1 Posterior mode of the latent variables\nThe mode of the latent variables, \u02c6f, can be found with general optimization methods such as the\nscaled conjugate gradients. The most robust and ef\ufb01cient method, however, proved to be the expec-\ntation maximization (EM) algorithm that utilizes the scale mixture representation of the Student-t\ndistribution\n\nyi|fi \u223c N(fi, Vi)\n\nVi \u223c Inv-\u03c72(\u03bd, \u03c32)\n\n(10)\n(11)\n\nwhere each observation has its own noise variance Vi that is Inv-\u03c72 distributed. Following Gelman\net al. [13], p. 456 the E-step of the algorithm consists of evaluating the expectation\n\n(cid:20) 1\n\nVi\n\n(cid:12)(cid:12)(cid:12)yi, f old\n\ni\n\nE\n\n(cid:21)\n\n, \u03bd, \u03c3\n\n=\n\n\u03bd + 1\n\n\u03bd\u03c32 + (yi \u2212 f old\ni )2\n\n,\n\n(12)\n\nafter which the latent variables are updated in the M-step as\n\n\u02c6f new = (K-1\n\nf,f +V\u22121)\u22121V\u22121y,\n\n(13)\nwhere V\u22121 is a diagonal matrix of the expectations in (12). In practice, we do not invert Kf,f and,\nthus, \u02c6f is updated using the Woodbury-Sherman-Morrison [e.g. 16] lemma\n\u02c6f new = (Kf,f \u2212 Kf,f V\u22121/2B\u22121V\u22121/2 Kf,f)V\u22121y\n\n(14)\nwhere matrix B = I + V\u22121/2 Kf,f V\u22121/2. This is numerically more stable than directly inverting\nthe covariance matrix, and gives as an intermediate result the vector a = K-1\nf,f\n\n\u02c6f for later use.\n\n= Kf,f a\n\n4.2 Approximate marginal likelihood\n\nRasmussen and Williams [12] discuss a numerically stable formulation to evaluate the approximate\nmarginal likelihood and its gradients with a classi\ufb01cation model. Their approach relies on W being\nnon-negative, for which reason it requires some modi\ufb01cation for our setting. With the Student-t\nlikelihood, we found the most stable formulation for (6) is\n\n\u02c6f Ta \u2212 n(cid:88)\n\nn(cid:88)\n\nlog Rii +\n\nlog Lii,\n\n(15)\n\nlog q(y |\u03b8) = log p(y|\u02c6f) \u2212 1\n2\n\ni=1\n\ni=1\nf,f +W)\u22121, and a is obtained\nwhere R and L are the Cholesky decomposition of Kf,f and \u03a3 = (K-1\nfrom the EM algorithm. The only problematic term is the last one, which is numerically unstable\nif evaluated directly. We could evaluate \ufb01rst \u03a3 = Kf,f \u2212 Kf,f(W\u22121 + Kf,f)\u22121 Kf,f, but this is in\nmany cases even worse than the direct evaluation, since W\u22121 might have arbitrary large negative\nvalues. For this reason, we evaluate LLT = \u03a3 using a rank one Cholesky updates in a speci\ufb01c order.\nAfter L is found it can also be used in the predictive variance (8) and in the gradients of (6) with\nonly minor modi\ufb01cation to equations given in [12]. We write \ufb01rst the posterior covariance as\n\n\u03a3 = (K-1\n\nf,f +W)\u22121 = (K-1\n\nf,f +e1eT\n\n1W11 + e2eT\n\n2W22 + ...eneT\n\nnWnn)\u22121,\n\n(16)\n\nwhere ei is the ith unit vector. The terms eieT\ni Wii are added iteratively and the Cholesky decompo-\nsition of \u03a3 is updated accordingly. At the beginning L = chol(Kf,f), and at iteration step i+1 we\nuse the rank one Cholesky update to \ufb01nd\n\nL(i+1) = chol\n\nwhere si is the ith column of \u03a3(i) and \u03b4i = Wii(\u03a3(i)\nconduct a Cholesky downdate, and if Wii < 0 and (\u03a3(i)\nwhich increases the covariance. The increase may be arbitrary large if (\u03a3(i)\n\n,\n\nL(i)(L(i))T \u2212 sisT\n(17)\ni \u03b4i\nii )\u22121 + Wii). If Wii is positive we\nii )\u22121/((\u03a3(i)\nii )\u22121 + Wii > 0 we have a Cholesky update\nii )\u22121 \u2248 \u2212Wii, but in\n\n(cid:17)\n\n(cid:16)\n\n5\n\n\fii )\u22121 + Wii \u2264 0, since then the\npractice it can be limited. Problems arise also if Wii < 0 and (\u03a3(i)\nresulting Cholesky downdate is not positive de\ufb01nite. This should not happen if \u02c6f is at local maxima,\nbut in practice it may be in a saddle point or this happens because of numerical instability or the\niterative framework to update the Cholesky decomposition. The problem is prevented by adding the\ndiagonals in a decreasing order, that is, \ufb01rst the \u201dnormal\u201d observations and last the outliers.\nA single Cholesky update is analogous to the discussion in section 3.4 in that the posterior covariance\nis updated using the result of the previous iteration as a prior. If we added the negative W values\nii )\u22121 + Wii \u2264 0 or\nat the beginning, \u03a3ii, (the prior variance) could be so large that either (\u03a3(i)\n(\u03a3(i)\ncould become singular or arbitrary\nlarge and lead to problems in the later iterations (compare to the dashed black line in the Figure\n2(a)). Adding \ufb01rst the largest W we reduce \u03a3 so that negative values of W are less problematic\n(compare to the dashed black line in the Figure 2(b)), and the updates are numerically more stable.\nii )\u22121+Wii \u2265 0 that everything\nDuring the Cholesky updates, we cross-check with the condition (\u03a3(i)\nis \ufb01ne. If the condition is not ful\ufb01lled our code prints a warning and replaces Wii with \u22121/(2\u03a3(i)\nii ).\nThis ensures that the Cholesky update will remain positive de\ufb01nite and doubles the marginal vari-\nance instead. However, in practice we never encountered any warnings in our experiments if the\nhyperparameters were initialized sensibly so that the prior was tight compared to the likelihood.\n\nii )\u22121 \u2248 \u2212Wii, in which case the posterior covariance \u03a3(i+1)\n\nii\n\n5 Relation to other work\n\nNeal [6] implemented the Student-t model for the Gaussian process via Markov chain Monte Carlo\nutilizing the scale mixture representation. However, the most similar approaches to the Laplace\napproximation are the VB approximation [7, 11] and the one in INLA [15]. Here we will shortly\nsummarize them.\nThe difference between INLA and GP framework is that INLA utilizes Gaussian Markov random\n\ufb01elds (GMRF) in place of the Gaussian process. The Gaussian approximation for p(f | y, \u03b8) in INLA\nis the same as the Laplace approximation here with the covariance function replaced by a precision\nmatrix. Rue et al. [15] derive the approximation for the log marginal posterior, log p(\u03b8| y), from\n\np(\u03b8| y) \u2248 q(\u03b8| y) \u221d p(y, f , \u03b8)\nq(f |\u03b8, y)\n\n= p(y | f)p(f |\u03b8)p(\u03b8)\n\nq(f |\u03b8, y)\n\n(18)\nThe proportionality sign is due to the fact that the normalization constant for p(f , \u03b8| y) is unknown.\nThis is exactly the same as the approximation derived in the section 3.2. Taking the logarithm of\n(18) we end up in log q(\u03b8| y) \u221d log q(y |\u03b8) + log p(\u03b8), where log q(y |\u03b8) is given in (6).\nIn the variational approximation [7], the joint posterior of the latent variables and the scale param-\neters in the scale mixture representation (10)-(11) is approximated with a factorizing distribution\np(f , V| y, \u03b8) \u2248 q(f)q(V), where q(f) = N(f |m, A) and q(V) = \u03a0n\ni=1Inv-\u03c72(Vii|\u02dc\u03bd/2, \u02dc\u03c32/2),\nwhere \u02dc\u03b8 = {m, A, \u02dc\u03bd, \u02dc\u03c32} are the parameters of the variational approximation. The approximate\ndistributions and the hyperparameters are updated in turns so that \u02dc\u03b8 are updated with current esti-\nmate for \u03b8 and after that \u03b8 is updated with \ufb01xed \u02dc\u03b8.\nThe variational approximation for the conditional posterior is p(f | y, \u02c6\u03b8, \u02c6V) \u2248 N(f |m, A). Here,\nf,f + \u02c6V\u22121)\u22121, and the iterative search for the posterior parameters m and A is the same as\nA = (K-1\n\nthe EM algorithm described in section 4 except that the update of E(cid:2)V \u22121\nE(cid:2)V \u22121\n\ni )2). Thus, the Laplace and the variational approximation\nare very similar. In practice, the posterior mode, m, is very close to the mode \u02c6f, and the main\ndifference between the approximations is in the covariance and the hyperparameter estimates.\n(cid:20)\nIn the variational approximation \u02c6\u03b8 is searched by maximizing the variational lower bound\nlog p(y | f , V)p(f |\u03b8)p(V|\u03b8)p(\u03b8)\nV = Eq(f ,V| y,\u03b8)\nwhere we have made visible the implicit dependence of the approximations q(f) and q(V) to the\ndata and hyperparameters, and included prior for \u03b8. The variational lower bound is similar to the ap-\n\nii +(yi\u2212 mold\n(cid:21)\n\n(cid:3) in (12) is replaced with\n\n(cid:3) = (\u03bd +1)/(\u03c32 + Aold\n\np(y, f , V, \u03b8)\n\nq(f | y, \u03b8)q(V| y, \u03b8)\n\nq(f , V| y, \u03b8)\n\n(cid:20)\n\nlog\n\n(cid:21)\n\n,\n\n(19)\n\nii\n\n(cid:12)(cid:12)(cid:12)f =\u02c6f\n\n(cid:12)(cid:12)(cid:12)f =\u02c6f\n\n.\n\nii\n\n= Eq(f ,V| y,\u03b8)\n\n6\n\n\fTable 1: The RMSE and NLP statistics on the experiments.\n\nG\nT-lapl\nT-vb\nT-mcmc\n\nNeal\n0.393\n0.028\n0.029\n0.055\n\n0.324\n0.220\n0.220\n0.253\n\nThe RMSE error\n\nFriedman Housing Concrete\n\n0.324\n0.289\n0.294\n0.287\n\n0.230\n0.231\n0.212\n0.197\n\nNeal\n0.254\n-2.181\n-2.228\n-1.907\n\nThe NLP statistics\n\nFriedman Housing Concrete\n0.0642\n-0.116\n-0.132\n-0.241\n\n0.227\n-0.16\n-0.049\n-0.106\n\n1.249\n0.080\n0.091\n0.029\n\nproximate log marginal posterior (18). Only the point estimate \u02c6f is replaced with averaging over the\napproximating distribution q(f , V| y, \u03b8). The other difference is that in the Laplace approximation\nthe scale parameters V are marginalized out and it approximates directly p(f | y, \u03b8).\n\n6 Experiments\n\nWe studied four data sets: 1) Neal data [6] with 100 data points and one input shown in Figure 1.\n2) Friedman data with a nonlinear function of 10 inputs, from which we generated 10 data sets with\n100 training points including 10 randomly selected outliers as described by Kuss [7], p. 83. 3) The\nBoston housing data that summarize median house prices in Boston metropolitan area for 506 data\npoints and 13 input variables [7]. 4) Concrete data that summarize the quality of concrete casting as\na function of 27 variables for 215 measurements [17]. In earlier experiments, the Student-t model\nhas worked better than the Gaussian observation model in all of these data sets.\nThe predictive performance is measured with a root mean squared error (RMSE) and a negative\nlog predictive density (NLP). With simulated data these are evaluated for a test set of 1000 latent\nvariables. With real data we use 10-fold cross-validation. The compared observation models are\nGaussian (G) and Student-t (T). The Student-t model is inferred using the Laplace approximation\n(lapl), VB (vb) [7] and full MCMC (mcmc) [6]. The Gaussian observation model, the Laplace\napproximation and VB are evaluated at \u02c6\u03b8, and in MCMC we sample \u03b8. INLA is excluded from\nthe experiments since GMRF model can not be constructed naturally for these non-regularly dis-\ntributed data sets. The results are summarized in the Table 1. The signi\ufb01cance of the differences in\nperformance is approximated using a Gaussian approximation for the distribution of the NLP and\nRMSE statistics [17]. The Student-t model is signi\ufb01cantly better than the Gaussian with higher than\n95% probability in all other tests but in the RMSE with the concrete data. There is no signi\ufb01cant\ndifference between the Laplace approximation, VB and MCMC.\nThe inference time was the shortest with Gaussian observation model and the longest with the\nStudent-t model utilizing full MCMC. The Laplace approximation for the Student-t likelihood took\nin average 50% more time than the Gaussian model, and VB was in average 8-10 times slower than\nthe Laplace approximation. The reason for this is that in VB two sets of parameters, \u03b8 and \u02dc\u03b8, are\nupdated in turns, which slows down the convergence of hyperparameters. In the Laplace approx-\nimation we have to optimize only \u03b8. Figure 3 shows the mean and the variance of p(f |\u02c6\u03b8, y) for\nMCMC versus the Laplace approximation and VB. The mean of the Laplace approximation and VB\nmatch equally well the mean of the MCMC solution, but VB underestimates the variance more than\nthe Laplace approximation (see also the \ufb01gure 2). In the housing data, both approximations under-\nestimate the variance remarkably for few data points (40 of 506) that were located as clusters at\nplaces where inputs, x are truncated along one or more dimension. At these locations, the marginal\nposteriors were slightly skew and their tails were rather heavy, and thus a Gaussian approximation\npresumably underestimates the variance.\nThe degrees of freedom of the Student-t likelihood were optimized only in Neal data and Boston\nhousing data using the Laplace approximation. In other data sets, there was not enough information\nto infer \u03bd and it was set to 4. Optimizing \u03bd was more problematic for VB than for the Laplace\napproximation probably because the factorized approximation makes it harder to identify \u03bd. The\nMAP estimates \u02c6\u03b8 found by the Laplace approximation and VB were slightly different. This is\nreasonable since the optimized functions (18) and (19) are also different.\n\n7\n\n\f(a) Neal data\n\n(b) Friedman data\n\n(c) Boston housing data\n\n(d) Concrete data\n\nFigure 3: Scatter plot of the posterior mean and variance of the latent variables. Upper row consists\nmeans, and lower row variances. In each \ufb01gure, left plot is for MCMC (x-axis) vs the Laplace\napproximation (y-axis) and the right plot is MCMC (x-axis) vs. VB (y-axis).\n\n7 Discussion\n\nIn our experiments we found that the predictive performance of both the Laplace approximation and\nthe factorial VB is similar with the full MCMC. Compared to the MCMC the Laplace approximation\nand VB estimate the posterior mean E[f |\u02c6\u03b8, y] similarly but VB underestimates the posterior variance\nVar[f |\u02c6\u03b8, y] more than the Laplace approximation. Optimizing the hyperparameters is clearly faster\nwith the Laplace approximation than with VB.\nBoth the Laplace and the VB approximation estimate the posterior precision as a sum of a prior pre-\ncision and a diagonal matrix. In VB the diagonal is strictly positive, whereas in the Laplace approx-\nimation the diagonal elements corresponding to outlying observations are negative. The Laplace ap-\nproximation is closer to the reality in that respect since the outlying observations have a negative ef-\nfect on the (true) posterior precision. This happens because VB minimizes KL(q(f)q(V)||p(f , V)),\nwhich requires that the q(f , V) must be close to zero whenever p(f , V) is (see for example [18]).\nSince a posteriori f and V are correlated, the marginal q(f) underestimates the effect of marginal-\nizing over the scale parameters. The Laplace approximation, on the other hand, tries to estimate\ndirectly the posterior p(f) of the latent variables. Recently, Opper and Archambeau [19] discussed\nthe relation between the Laplace approximation and VB, and proposed a variational approximation\ndirectly for the latent variables and tried it with a Cauchy likelihood (they did not perform extensive\nexperiments though). Presumably their implementation would give better estimate for p(f) than the\nfactorized approximation. However, experiments on that respect are left for future.\nThe advantage of VB is that the objective function (19) is a rigorous lower bound for p(y |\u03b8),\nwhereas the Laplace approximation (18) is not. However, the marginal posteriors p(f | y, \u03b8) in\nour experiments (inferred with MCMC) were so close to Gaussian that the Laplace approximation\nq(f |\u03b8, y) should be very accurate and, thus, the approximation for p(\u03b8| y) (18) should also be close\nto the truth (see also justi\ufb01cations in [15]).\nIn recent years the expectation propagation (EP) algorithm [20] has been demonstrated to be very ac-\ncurate and ef\ufb01cient method for approximate inference in many models with factorizing likelihoods.\nHowever, the Student-t likelihood is problematic for EP since it is not log-concave, for which rea-\nson EPs estimate for the posterior covariance may become singular during the site updates [21]. The\nreason for this is that the variance parameters of the site approximations may become negative. As\ndemonstrated with Laplace approximation here, this re\ufb02ects the behavior of the true posterior. We\nassume that the problem can be overcome, but we are not aware of any work that would have solved\nthis problem.\n\nAcknowledgments\n\nThis research was funded by the Academy of Finland, and the Graduate School in Electronics and\nTelecommunications and Automation (GETA). The \ufb01rst and second author thank also the Finnish\nFoundation for Economic and Technology Sciences - KAUTE, Finnish Cultural Foundation, Emil\nAaltonen Foundation, and Finnish Foundation for Technology Promotion for supporting their post\ngraduate studies.\n\n8\n\n\fReferences\n[1] Bruno De Finetti. The Bayesian approach to the rejection of outliers.\n\nIn Proceedings of\nthe fourth Berkeley Symposium on Mathematical Statistics and Probability, pages 199\u2013210.\nUniversity of California Press, 1961.\n\n[2] A. Philip Dawid. Posterior expectations for large observations. Biometrika, 60(3):664\u2013667,\n\nDecember 1973.\n\n[3] Anthony O\u2019Hagan. On outlier rejection phenomena in Bayes inference. Royal Statistical\n\nSociety. Series B., 41(3):358\u2013367, 1979.\n\n[4] Mike West. Outlier models and prior distributions in Bayesian linear regression. Journal of\n\nRoyal Statistical Society. Serires B., 46(3):431\u2013439, 1984.\n\n[5] John Geweke. Bayesian treatment of the independent Student-t linear model. Journal of\n\nApplied Econometrics, 8:519\u2013540, 1993.\n\n[6] Radford M. Neal. Monte Carlo Implementation of Gaussian Process Models for Bayesian Re-\ngression and Classi\ufb01cation. Technical Report 9702, Dept. of statistics and Dept. of Computer\nScience, University of Toronto, January 1997.\n\n[7] Malte Kuss. Gaussian Process Models for Robust Regression, Classi\ufb01cation, and Reinforce-\n\nment Learning. PhD thesis, Technische Universit\u00a8at Darmstadt, 2006.\n\n[8] Paul W. Goldberg, Christopher K.I. Williams, and Christopher M. Bishop. Regression with\ninput-dependent noise: A Gaussian process treatment. In M. I. Jordan, M. J. Kearns, and S. A\nSolla, editors, Advances in Neural Information Processing Systems 10. MIT Press, Cambridge,\nMA, 1998.\n\n[9] Andrew Naish-Guzman and Sean Holden. Robust regression with twinned gaussian processes.\nIn J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information\nProcessing Systems 20, pages 1065\u20131072. MIT Press, Cambridge, MA, 2008.\n\n[10] Oliver Stegle, Sebastian V. Fallert, David J. C. MacKay, and S\u00f8ren Brage. Gaussian process\nrobust regression for noisy heart rate data. Biomedical Engineering, IEEE Transactions on, 55\n(9):2143\u20132151, September 2008. ISSN 0018-9294. doi: 10.1109/TBME.2008.923118.\n\n[11] Michael E. Tipping and Neil D. Lawrence. Variational inference for Student-t models: Robust\nbayesian interpolation and generalised component analysis. Neurocomputing, 69:123\u2013141,\n2005.\n\n[12] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine\n\nLearning. The MIT Press, 2006.\n\n[13] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis.\n\nChapman & Hall/CRC, second edition, 2004.\n\n[14] Christopher K. I. Williams and David Barber. Bayesian classi\ufb01cation with Gaussian processes.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342\u20131351, 1998.\n\n[15] H\u02daavard Rue, Sara Martino, and Nicolas Chopin. Approximate Bayesian inference for latent\nGaussian models by using integrated nested Laplace approximations. Journal of Royal statis-\ntical Society B, 71(2):1\u201335, 2009.\n\n[16] David A. Harville. Matrix Algebra From a Statistician\u2019s Perspective. Springer-Verlag, 1997.\n[17] Aki Vehtari and Jouko Lampinen. Bayesian model assessment and comparison using cross-\n\nvalidation predictive densities. Neural Computation, 14(10):2439\u20132468, 2002.\n\n[18] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer Science +Busi-\n\nness Media, LLC, 2006.\n\n[19] Manfred Opper and C\u00b4edric Archambeau. The variational Gaussian approximation revisited.\n\nNeural Computation, 21(3):786\u2013792, March 2009.\n\n[20] Thomas Minka. A family of algorithms for approximate Bayesian inference. PhD thesis,\n\nMassachusetts Institute of Technology, 2001.\n\n[21] Matthias Seeger. Bayesian inference and optimal design for the sparse linear model. Journal\n\nof Machine Learning Research, 9:759\u2013813, 2008.\n\n9\n\n\f", "award": [], "sourceid": 224, "authors": [{"given_name": "Jarno", "family_name": "Vanhatalo", "institution": null}, {"given_name": "Pasi", "family_name": "Jyl\u00e4nki", "institution": null}, {"given_name": "Aki", "family_name": "Vehtari", "institution": null}]}