{"title": "Robust Regression with Twinned Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1065, "page_last": 1072, "abstract": "We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp (2000) and Rasmussen and Ghahramani (2002). The value of this restriction is in its tractable expectation propagation updates, which allow for faster inference and model selection, and better convergence than the standard mixture. An additional benefit over the latter method lies in our ability to incorporate knowledge of the noise domain to influence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. The model has asymptotic complexity equal to that of conventional robust methods, but yields more confident predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. We also address similarities with the work of Goldberg et al. (1998), and the more recent contributions of Tresp, and Rasmussen and Ghahramani.", "full_text": "Robust Regression with Twinned Gaussian Processes\n\nAndrew Naish-Guzman & Sean Holden\n\nCambridge, CB3 0FD. United Kingdom\n{agpn2,sbh11}@cl.cam.ac.uk\n\nComputer Laboratory\n\nUniversity of Cambridge\n\nAbstract\n\nWe propose a Gaussian process (GP) framework for robust inference in which a\nGP prior on the mixing weights of a two-component noise model augments the\nstandard process over latent function values. This approach is a generalization of\nthe mixture likelihood used in traditional robust GP regression, and a specializa-\ntion of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahra-\nmani [2]. The value of this restriction is in its tractable expectation propagation\nupdates, which allow for faster inference and model selection, and better conver-\ngence than the standard mixture. An additional bene\ufb01t over the latter method lies\nin our ability to incorporate knowledge of the noise domain to in\ufb02uence predic-\ntions, and to recover with the predictive distribution information about the outlier\ndistribution via the gating process. The model has asymptotic complexity equal\nto that of conventional robust methods, but yields more con\ufb01dent predictions on\nbenchmark problems than classical heavy-tailed models and exhibits improved\nstability for data with clustered corruptions, for which they fail altogether. We\nshow further how our approach can be used without adjustment for more smoothly\nheteroscedastic data, and suggest how it could be extended to more general noise\nmodels. We also address similarities with the work of Goldberg et al. [3].\n\n1 Introduction\n\nRegression data are often modelled as noisy observations of an underlying process. The simplest\nassumption is that all noise is independent and identically distributed (i.i.d.) zero-mean Gaussian,\nsuch that a typical set of samples appears as a cloud around the latent function. The Bayesian frame-\nwork of Gaussian processes [4] is well-suited to these conditions, for which all computations remain\ntractable (see \ufb01gure 1a). Furthermore, the Gaussian noise model enjoys the theoretical justi\ufb01cation\nof the central limit theorem, which states that the sum of suf\ufb01ciently many i.i.d. random variables of\n\ufb01nite variance will be distributed normally. However, only rarely can perturbations affecting data in\nthe real world be argued to have originated in the addition of many i.i.d. sources. The random com-\nponent in the signal may be caused by human or measurement error, or it may be the manifestation\nof systematic variation invisible to a simpli\ufb01ed model. In any case, if ever there is the possibility of\nencountering small quantities of highly implausible data, we require robustness, i.e. a model whose\npredictions are not greatly affected by outliers.\n\nSuch demands render the standard GP inappropriate: the light tails of the Gaussian distribution\ncannot explain large non-Gaussian deviations, which either skew the mean interpolant away from\nthe majority of the data, or force us to infer an unreasonably large (global) noise variance (see\n\ufb01gure 1b). Robust methods use a heavy-tailed likelihood to allow the interpolant effectively to\nfavour smoothness and ignore such erroneous data. Figure 1c shows how this can be achieved using\na two-component noise model\n\np(yn|fn) = (1 \u2212 \u01eb)N(cid:0)yn ; fn , \u03c32\n\nR(cid:1) + \u01ebN(cid:0)yn ; fn , \u03c32\nO(cid:1) ,\n\n1\n\n(1)\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Black dots show noisy samples from the sinc function. In panels (a) and (b), the be-\nhaviour of a GP with a Gaussian noise assumption is illustrated; the shaded region shows 95%\ncon\ufb01dence intervals. The presence of a single outlier is highly in\ufb02uential in this model, but the\nheavy-tailed likelihood (1) in panel (c) is more resilient. Unfortunately, even this model fails for\nthe cluster of outliers in panel (d). Here, grey lines show ten repeated runs of the EP inference\nalgorithm, while the black line and shaded region are their averaged mean and con\ufb01dence intervals\nrespectively\u2014grossly at odds with those of the latent generative model.\n\nin which observations yn are Gaussian corruptions of fn, being drawn with probability \u01eb from a\nlarge variance outlier distribution (\u03c32\nR). Inference in this model is tractable, but impractical\nfor all but the smallest problems due to the exponential explosion of terms in products of (1).\n\nO \u226b \u03c32\n\nIn this paper, we address the more fundamental GP assumption of i.i.d. noise. Our research is mo-\ntivated by observing how the predictive distribution suffers for heavy-tailed models when outliers\nappear in bursts: \ufb01gure 1d replicates \ufb01gure 1c, but introduces an additional three outliers. All param-\neters were taken from the optimal solution to (c), but even without the challenge of hyperparameter\noptimization there is now considerable uncertainty in the posterior since the competing interpreta-\ntions of the cluster as signal or noise have similar posterior mass. Viewed another way, the tails of\nthe effective log likelihood of four clustered observations have approximately one-quarter the weight\nof a single outlier, so the magnitude of the posterior peak associated with the robust solution is com-\nparably reduced. One simple remedy is to make the tails of the likelihood heavier. However, since\nthe noise model is global, this has rami\ufb01cations across the entire data space, potentially causing\nunder\ufb01tting elsewhere when real data are relegated to the tails. We can establish an optimal choice\nfor the parameters by gradient ascent on the marginal likelihood, but it is entirely possible that no\nsingle setting will be universally satisfactory.\n\nThe model introduced in this paper, which we call the twinned Gaussian process (TGP), generalizes\nthe noise model (1) by using a GP gating function to choose between the \u201creal\u201d and \u201coutlier dis-\ntributions\u201d: in regions of con\ufb01dence, the tails can be made very light, encouraging the interpolant\nto hug the data points tightly; more dubious observations can be treated appropriately by broaden-\ning the noise distribution in their vicinity. Our model is also a specialization of the GP mixtures\nproposed by Tresp [1] and Rasmussen and Ghahramani [2]; indeed, the latter automatically infers\nthe correct number of components to use. One may therefore wonder what can possibly be gained\nby restricting ourselves to a comparatively simple architecture. The answer is in the computational\noverhead required for the different approaches, since these more general models require inference\nby Monte Carlo methods. We argue that the two-component mixture is often a sensible distribution\nfor modelling real data, with a natural interpretation and the heavy tails required for robustness;\nits weaknesses are exposed primarily when the noise distribution is not homoscedastic. The TGP\nlargely solves this problem, and allows inference by an ef\ufb01cient expectation propagation (EP) [5]\nprocedure (rather than resorting to more heavy duty Monte Carlo methods). Hence, provided a two-\ncomponent mixture is likely to re\ufb02ect adequately the noise on our data, the TGP will give similar\nresults to the generalized mixtures mentioned above, but at a fraction of the cost.\n\nGoldberg et al. [3] suggest an approach to input-dependent noise in the spirit of the TGP, in which\nthe log variance on observations is itself modelled as a GP (the logarithm since noise variance is\na non-negative property).\nInference is again analytically intractable, so Gibbs sampling is used\nto generate noise vectors from the posterior distribution by alternately \ufb01tting the signal process\nand \ufb01tting the noise process. A further stage of Gibbs sampling is required at each test point to\nestimate the predictive variance, making testing rather slow. Model selection is even slower, and the\nMetropolis-Hastings algorithm is suggested for updating hyperparameters.\n\n2\n\n\f2 Twinned Gaussian processes\n\nGiven a domain X and covariance function K(\u00b7, \u00b7) \u2208 X \u00d7 X \u2192 R, a Gaussian process (GP) over\nthe space of real-valued functions of X speci\ufb01es the joint distribution at any \ufb01nite set X \u2282 X :\n\np(f |X) = N (f ; 0 , Kf ) ,\n\nwhere the f = {fn}N\nn=1 are (latent) values associated with each xn \u2208 X, and Kf is the Gram\nmatrix, the evaluation of the covariance function at all pairs (xi, xj). We apply Bayes\u2019 rule to obtain\nthe posterior distribution over the f, given the observed X and y, which with the assumption of\ni.i.d. Gaussian corrupted observations is also normally distributed. Predictions at X\u22c6 are made by\nmarginalizing over f in the (Gaussian) joint p(f , f\u22c6|X, y, X\u22c6). See [6] for a thorough introduction.\nRobust GP regression is achieved by using a leptokurtic likelihood distribution, i.e. one whose tails\nhave more mass than the Gaussian. Common choices are the Laplace (or double exponential) distri-\nbution, Student\u2019s t distribution, and the mixture model (1). In product with the prior, a heavy-tailed\nlikelihood over an outlying observation does not exert the strong pull on the posterior witnessed\nwith a light-tailed noise model. Kuss [7] describes how inference can be performed for all these\nlikelihoods, and establishes that in many cases their performance is broadly comparable. Since it\nbears closest resemblance to the twinned GP, we are particularly interested in the mixture; however,\nin section 4, we include results for the Laplace model: it is the heaviest-tailed log concave distri-\nbution, which guarantees a unimodal posterior and allows more reliable EP convergence. In any\ncase, all such methods make a global assumption about the noise distribution, and it is where this is\ninappropriate that our model is most bene\ufb01cial.\n\nThe graphical model for the TGP is shown in \ufb01gure 2b. We augment the standard process over f\nwith another GP over a set of variables u; this acts as a gating function, probabilistically dividing\nthe domain between the real and outlier components of the noise model\n\np(yn|fn) = \u03c3(un)N(cid:0)yn ; fn , \u03c32\n\nN (z ; 0 , 1) dz.\n\n.\n\nR(cid:1) + \u03c3(\u2212un)N(cid:0)yn ; fn , \u03c32\nO(cid:1) ,\n\nwhere \u03c3(un)\n\n=Z un\n\n\u2212\u221e\n\n(2)\n\nIn the TGP likelihood, we therefore mix two forms of Gaussian corruption, one strongly peaked at\nthe observation, the other a broader distribution which provides the heavy tails, in proportion deter-\nmined by u(x). This makes intuitive sense; crucially to us, it retains the advantage of tractability\nwith respect to EP updates. The two priors may have quite different covariance structure, re\ufb02ect-\ning our different beliefs about correlations in the signal and in the noise domain. In addition, we\naccommodate prior beliefs about the prevalence of outliers with a non-zero mean process on u,\n\np(u|X) = N (u ; mu , Ku)\n\np(f |X) = N (f ; 0 , Kf ) .\n\nOur model can be understood as lying between two extremes: observe that we recover the heavy-\ntailed (mixture of Gaussians) GP by forcing absolute correlation in u and adjusting the mean of\nthe u-process to mu = \u03c3\u22121(1 \u2212 e); conversely, if we remove all correlations in u, we return to a\nstandard mixture model where independently we must decide to which component an input belongs.\n\n3 Inference\n\nWe begin with a very brief account of EP; for more details, see [5, 8]. Suppose we have an intractable\ndistribution over f whose unnormalized form factorizes into a product of terms, such as a dense\nGaussian prior t0(f , u) and a series of independent likelihoods {tn(yn|fn, un)}N\nn=1. EP constructs\nthe approximate posterior as a product of scaled site functions \u02dctn. For computational tractability,\nthese sites are usually chosen from an exponential family with natural parameters \u03b8, since in this\ncase their product retains the same functional form as its components. The Gaussian (\u00b5, \u03a3) has a\nnatural parameterization (b, \u03a0) = (\u03a3\u22121\u00b5, \u2212 1\n2 \u03a3\u22121). If the prior is of this form, its site function is\nexact:\n\np(f , u|y) =\n\n1\nZ\n\nt0(f , u)\n\nN\n\nYn=1\n\ntn(yn|fn, un) \u2248 q(f ; \u03b8) = t0(f , u)\n\nzn\u02dctn(fn, un; \u03b8n),\n\n(3)\n\nN\n\nYn=1\n\n3\n\n\fx1\n\nx2\n\nx3\n\nxN\n\nx1\n\nx2\n\nx3\n\nxN\n\nf1\n\nf2\n\nf3\n\ny1\n\ny2\n\ny3\n\n(a)\n\nfN\n\nyN\n\nu1\n\nu2\n\nu3\n\nuN\n\nf1\n\nf2\n\nf3\n\nfN\n\ny1\n\ny2\n\nyN\n\ny3\n\n(b)\n\nFigure 2: In panel (a) we show a graphical model for the Gaussian process. The data ordinates are x,\nobservations y, and the GP is over the latent f. The bold black lines indicate a fully-connected set.\nPanel (b) shows a graphical model for the twinned Gaussian process (TGP), in which an auxiliary\nset of hidden variables u describes the noisiness of the data.\n\nwhere Z is the marginal likelihood and zn are the scale parameters. Ideally, we would choose \u03b8\nat the global minimum of some divergence measure d(pkq), but the necessary optimization is usu-\n\nally intractable. EP is an iterative procedure that \ufb01nds a minimizer of KL(cid:0)p(f , u|y)kq(f , u; \u03b8)(cid:1)\n\non a pointwise basis: at each iteration, we select a new site n, and from the product of the cav-\nity distribution formed by the current marginal with the omission of that site, and the true likeli-\nhood term tn, we obtain the so-called tilted distribution qn(fn, un; \u03b8\\n). A simpler optimization\nmin\u03b8n\nment matching between the two distributions, with scale zn chosen to match the zeroth-order mo-\nments. After each site update, the moments at the remaining sites are liable to change, and several\niterations may be required before convergence.\n\nKL(cid:0)qn(fn, un; \u03b8\\n)kq(fn, un; \u03b8)(cid:1) then \ufb01ts only the parameters \u03b8n: this is equivalent to mo-\n\nThe priors over u and f are independent, but we expect correlations in the posterior after condi-\ntioning on observations. To understand this, consider a single observation (xn, yn); in principle, it\nadmits two explanations corresponding to its classi\ufb01cation as either \u201coutlier\u201d or as \u201creal\u201d data: in\ngeneral terms, either un > 0 and fn \u2248 yn, or un < 0 and fn respects the global structure of the\nsignal. A diagram to assist the visualization of the behaviour of the posterior is provided in \ufb01gure 3.\n\nNow, recall that the prior over u and f is\n\nX! = N(cid:18)(cid:20) u\n\nf (cid:21) ; (cid:20) mu\n\n0 (cid:21) , (cid:20)Ku\n\n0 Kf(cid:21)(cid:19)\n\n0\n\nf (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\np (cid:20) u\n\nand the likelihood factorizes into a product of terms (2); our site approximations \u02dctn are therefore\nGaussian in (fn, un). Of importance for EP are the moments of the tilted distribution which we\nseek to match. These are most easily obtained by differentiation of the zeroth moments ZR and ZO\nof each component. We \ufb01nd\n\n\u03c3(u)N(cid:0)y ; f , \u03c32\n\nR(cid:1) N(cid:18)(cid:20) u\nZR =ZZf,u\nwriting the inner Gaussian as N(cid:18)(cid:20) zn\n\nN(cid:18)(cid:20) z\n\ny (cid:21) ; \u00b5 , (cid:20)1\n\n0\n0 \u03c32\n\nR(cid:21) + \u03a3(cid:19) dz;\n\nC BR(cid:21)(cid:19), Z R = N (y ; \u00b5f , BR) \u03c3(q),\n\n0\n\nf (cid:21) ; \u00b5 , \u03a3(cid:19) dudf =Z \u221e\n\u00b5f (cid:21) , (cid:20)A C\nyn (cid:21) ; (cid:20) \u00b5u\nqA \u2212 C 2\n\n\u00b5u + C\nBR\n\nq =\n\nBR\n\n(y \u2212 \u00b5f )\n\n.\n\nwhere\n\nThe integral for the outlier component is similar; ZO = N (y ; \u00b5f , BO) \u03c3(\u2212q). With partial deriva-\ntives \u2202 log Z\nT we are equipped for EP; algorithmic details appear in Seeger\u2019s note [8]. For\nef\ufb01ciency, we make rank-two updates of the full approximate covariance on (f , u) during the EP\nloop, and refresh the posterior at the end of each cycle to avoid loss of precision.\n\nand \u2202 2 log Z\n\n\u2202\u00b5\u00b5\n\n\u2202\u00b5\n\n4\n\n\fprior\nlikelihood\nposterior\nEP\n\n-5\n\n0\n\n5\n\n10\n\nf\n\n-5\n\n0\n\n5\n\n10\n\nf\n\nprior\nlikelihood\nposterior\nEP\n\n-5\n\n0\n\n5\n\n10\n\nf\n\n-5\n\n0\n\n5\n\n10\n\nf\n\nprior\nlikelihood\nposterior\nEP\n\n-5\n\n0\n\n5\n\n10\n\nf\n\n-5\n\n0\n\n5\n\n10\n\nf\n\nreplacements\n\np\ng\no\nl\n\np\ng\no\nl\n\np\ng\no\nl\n\np\ng\no\nl\n\nprior\nlikelihood\nposterior\nEP\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\nu\n\nu\n\nu\n\nu\n\n-5\n\n0\n\n5\n\n10\n\nf\n\n-5\n\n0\n\n5\n\n10\n\nf\n\n-5\n\n0\n\n5\n\n10\n\nf\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\nu\n\nu\n\nu\n\nu\n\n-5\n\n0\n\n5\n\n10\n\nf\n\n-5\n\n0\n\n5\n\n10\n\n-5\n\n0\n\n5\n\n10\n\nf\n\nf\n\nFigure 3: Using the twinned Gaussian process provides a natural resilience against clustered noisy\ndata. The left-hand column illustrates the behaviour of a \ufb01xed heavy-tailed likelihood for one,\ntwo, four and \ufb01ve repeated observations at f = 5. (Outliers in real data are not necessarily so\ntightly packed, but the symmetry of this approximation allows us to treat them as a single unit: by\n\u201cposterior\u201d, for example, we mean the a posteriori belief in all the observations\u2019 (identical) latent\nf .) The context is provided by the prior, which gives 95% con\ufb01dence to data around f = 0 \u00b1 2. The\ntop-left box illustrates how the in\ufb02uence of isolated outliers is mitigated by the standard mixture.\nHowever, a repeated observation (box two on the left) causes the EP solution to collapse onto the\nspike at the data (the log scale is deceptive:\nthe second peak contributes only about 8% of the\nposterior mass). The twinned GP better preserves the marginal distribution of f by maintaining a\njoint distribution over both f and u: in the second and third columns respectively are contours of\nthe true log joint (we use a broad zero-mean prior on u) and that inferred by EP, together with the\nmarginal posterior over f . Only with a \ufb01fth observation\u2014\ufb01nal box\u2014is the context of f essentially\noverruled by the TGP approximation. The thick bar in the central column marks the cross-section\ncorresponding to the unnormalized posterior from column one.\n\n5\n\n\f3.1 Predictions\n\nIf the outlier component describes nuisance noise that should be eliminated, we require at test in-\nputs x\u22c6 only the marginal distribution p(f\u22c6|x\u22c6, X, y), obtained by marginalizing over u in the full\n(approximate) posterior\n\nN(cid:18)(cid:20) u\n\nf (cid:21) ; (cid:20) \u02c6\u00b5u\np(f\u22c6|x\u22c6, X, y) =Z p(f\u22c6|x\u22c6, f )p(f |X, y)df\n\n\u02c6\u00b5f (cid:21) , (cid:20) \u02c6\u03a3uu\n\n\u02c6\u03a3f u\n\n\u02c6\u03a3uf\n\n\u02c6\u03a3ff(cid:21)(cid:19) :\n\n\u2248 N(cid:16)f\u22c6 ; kT\n\nf \u22c6K\u22121\n\nf \u02c6\u00b5f , kf\n\n\u22c6\u22c6 \u2212 kT\n\nf \u22c6K\u22121\n\nf kf \u22c6 + kT\n\nf \u22c6K\u22121\n\n\u02c6\u03a3ff K\u22121\n\nf\n\nf kf \u22c6(cid:17) .\n\nThe noise process may itself be of interest, in which case we need to marginalize over both u\u22c6 and\nf\u22c6 in\n\nf (cid:21)!N(cid:18)(cid:20) u\n(cid:20) u\n\nf (cid:21) ; \u02c6\u00b5 , \u02c6\u03a3(cid:19) du\u22c6df\u22c6dudf .\n\np(y\u22c6|x\u22c6, X, y) =ZZ p y\u22c6(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nx\u22c6,(cid:20) u\n\u2248ZZZZ p y\u22c6(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nx\u22c6,(cid:20) u\u22c6\n\nf (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nX, y! dudf\nf (cid:21)!p (cid:20) u\nf\u22c6 (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nf\u22c6 (cid:21)!p (cid:20) u\u22c6\n\nThis distribution is no longer Gaussian, but its moments may be recovered easily by the same method\nused to obtain moments of the tilted distribution.\n\nEP provides in addition to the approximate moments of the posterior distribution an estimate of the\nmarginal likelihood and its derivatives with respect to kernel hyperparameters. Again, we refer the\ninterested reader to the algorithm presented in [8], adding here only that our implementation uses\nlog noise values on (\u03c32\n\nO) to allow for their unconstrained optimization.\n\nR, \u03c32\n\n3.2 Complexity\n\nThe EP loop is dominated by the rank-two updates of the covariance. Each such update is\n\nO(cid:0)(2N )2(cid:1), making every N iterations O(4N 3). The posterior refresh is O(8N 3) since it re-\n\nquires the inverse of a 2N \u00d7 2N positive semi-de\ufb01nite matrix, most ef\ufb01ciently achieved through\nCholesky factorization (this Cholesky factor can be retained for use in calculating the approximate\nlog marginal likelihood). The total number of loops required for convergence of EP is typically in-\ndependent of N , and can be upper bounded by a small constant, say 10, making the entire inference\nprocess O(8N 3) = O(N 3). Thus, our algorithm has the same limiting time complexity as i.i.d. ro-\nbust regression by EP, which admittedly masks the larger coef\ufb01cient that appears in approximating\nboth u and f simultaneously. Additionally, the body of the EP loop is slightly slower, since the pre-\ncision matrix in a standard GP can be obtained with a single division, whereas our model requires\nthe inversion of a 2 \u00d7 2 matrix.\n\n4 Experiments\n\nWe identify two general noise characteristics for which our model may be suitable. The \ufb01rst is\nwhen the outlying observations can appear in clusters: we saw in \ufb01gure 1d how these occurrences\naffect the standard mixture model. In fact the problem is quite severe, since the multimodality of the\nposterior impedes the convergence of EP, while the possibility of con\ufb02icting gradient information at\nthe optima hampers procedures for evidence maximization. In \ufb01gure 4 we illustrate how the TGP\nsucceeds where the mixture and Laplace models fail; note how the mean process on u falls sharply\nin the contaminated regions. This is a stable solution, and hyperparameters can be \ufb01t reliably.\n\nA data set which exhibits the superior predictive modelling of the TGP in a domain where robust\nmethods can also expect to perform well is provided by Kuss [7] in a variation on a set of Friedman\n[9]. The samples are drawn from a function of ten-dimensional vectors x which depend only on the\n\ufb01rst \ufb01ve components:\n\nf (x) = 10 sin(\u03c0x1x2) + 20(x3 \u2212 0.5)2 + 10x4 + 5x5.\n\n6\n\n\f-10\n\n0\n\n10\n\n-10\n\n0\n\n10\n\n-10\n\n(a) Mixture noise\n\n(b) Laplace noise\n\n0\n\n(c) TGP\n\n10\n\nFigure 4: The corruptions are i.i.d. around x = \u221210, and highly correlated near x = 0.\n\nWe generated ten sets of 90 training examples and 10000 test examples by sampling x uniformly\nin [0, 1]10, and adding to the training data noise N (0, 1). In our \ufb01rst experiment, we replicated the\nprocedure of [7]: ten training points were added at random with outputs sampled from N (15, 9) (a\nvalue likely to lie in the same range as f ). The results appear as Friedman (1) in \ufb01gure 5. Observe\nthat the r.m.s. error for the robust methods is similar, but the TGP is able to \ufb01t the variance far more\naccurately. In a second experiment, the training set was augmented with two Gaussian clusters each\nof \ufb01ve noisy observations. The cluster centres were drawn uniformly in [0, 1]10, with variance \ufb01xed\nat 10\u22123. Output values were then drawn from N (0, 1) for all ten points, to give highly correlated\nvalues distant from the underlying function (Friedman (2)). Now the TGP excels where the other\nmethods offer no improvement on the standard GP; it also yields very con\ufb01dent predictions (cf.\nFriedman (1)), because once the outliers have been accounted for there are fewer corrupted regions;\nfurthermore, estimates of where the data are corrupted can be recovered by considering the process\non u. In both experiments, the training data were renormalized to zero mean and unit variance, and\nthroughout, we used the anisotropic squared exponential for the f process (implementing so-called\nrelevance determination), and an isotropic version for u. The approximate marginal likelihood was\nmaximized on three to \ufb01ve randomly initialized models; we chose for testing the most favoured.\n\nThe second domain of application is when the noise on the data is believed a priori to be a function of\nthe input (i.e. heteroscedastic). The twinned GP can simulate this changing variance by modulating\nthe u process, allocating varying weight to the two components. By way of example, the behaviour\nfor the one-dimensional motorcycle set [10] is shown in \ufb01g. 5c. However, since the input-dependent\nnoise is not modelled directly, there are two notable dangers associated with this approach: \ufb01rst,\nthe predictive variance saturates when all weight has been apportioned to one or other component;\nsecond, the \u201coutlier\u201d component can dominate the variance estimates of the mixture. This is partic-\nularly problematic when variance on the data ranges over several orders of magnitude, such that the\n\u201coutlier\u201d width must be comparably broader than that of the \u201creal\u201d component. In such cases, only\nwith extreme values of u can the smallest errors be predicted, but in consequence the process tends\nto sweep precipitately through the region of sensitivity where variance predictions can be made ac-\ncurately. To circumvent these problems we might employ the warped GP [11] to rescale the process\non u in a supervised manner, but we do not explore these ideas further here.\n\n0.5\n\n0\n\n0.4\n\n0.2\n\nGP Lap MixTGP\n\nGP Lap MixTGP\nneg. log probability\n\ntest error\n(a) Friedman (1)\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n-1\n\nGP Lap MixTGP\n\nGP Lap MixTGP\nneg. log probability\n\ntest error\n(b) Friedman (2)\n\n(c) Motorcycle\n\nFigure 5: Results for the Friedman data, and the predictions of the TGP on the motorcycle set.\n\n7\n\n\f5 Extensions\n\nWith prior knowledge of the nature of corruptions affecting the signal, we can seek to model the\nnoise distribution more accurately, for example by introducing a compound likelihood for the outlier\n\ncomponent pO(yn|fn) = Pj \u03b1jN(cid:0)yn ; \u00b5j(fn) , \u03c32\n\nweight of outlier corruptions to be constant across the entire domain. A richer alternative is provided\nby extending the single u-process on noise to a series u(1), u(2), . . . , u(\u03bd) of noise processes, and\nbroadening the likelihood function appropriately. For example, with \u03bd = 2, we may write\n\nj(cid:1) ,Pj \u03b1j = 1. This constrains the relative\n\np(yn|fn, u(1)\n\nn , u(2)\n\nn ) = \u03c3(u(1)\n\nn )N(cid:0)yn ; fn , \u03c32\n\nn )\u03c3(u(2)\n\n\u03c3(\u2212u(1)\n\nR(cid:1) +\nn )N(cid:0)yn ; fn , \u03c32\n\n\u03c3(\u2212u(1)\n\nO1(cid:1) +\n\nn )\u03c3(\u2212u(2)\n\nIn the former case, the preceding analysis applies with small changes: each component of the outlier\ndistribution contributes moments independently. The second model introduces signi\ufb01cant compu-\ntational dif\ufb01culty: \ufb01rstly, we must maintain a posterior distribution over f and all \u03bd us, yielding\nspace requirements O(N (\u03bd + 1)) and time complexity O(N 3(\u03bd + 1)3). More importantly, the req-\nuisite moments needed in the EP loop are now intractable, although an inner EP loop can be used\nto approximate them, since the product of \u03c3s behaves in essence like the standard model for GP\nclassi\ufb01cation. We omit details, and defer experiments with such a model to future work.\n\nn )N(cid:0)yn ; f0 , \u03c32\nO2(cid:1) .\n\n(4)\n\n6 Conclusions\n\nWe have presented a method for robust GP regression that improves upon classical approaches by\nallowing the noise variance to vary in the input space. We found improved convergence on problems\nwhich upset the standard mixture model, and have shown how predictive certainty can be improved\nby adopting the TGP even for problems which do not. The model also allows an arbitrary process\non u, such that specialized prior knowledge could be used to drive the inference over f to respecting\nregions which may otherwise be considered erroneous. A generalization of our ideas appears as\nthe mixture of GPs [1], and the in\ufb01nite mixture [2], but both involve a slow inference procedure.\nWhen faster solutions are required for robust inference, and a two-component mixture is an adequate\nmodel for the task, we believe the TGP is a very attractive option.\n\nReferences\n\n[1] Volker Tresp. Mixtures of Gaussian processes. In Advances in Neural Information Processing Systems,\n\npages 654\u2013660, 2000.\n\n[2] Carl Edward Rasmussen and Zoubin Ghahramani.\n\nIn\ufb01nite mixtures of gaussian process experts.\n\nIn\n\nAdvances in Neural Information Processing Systems, 2002.\n\n[3] Paul Goldberg, Christopher Williams, and Christopher Bishop. Regression with input-dependent noise:\na Gaussian process treatment. In Advances in Neural Information Processing Systems. MIT Press, 1998.\n[4] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances\n\nin Neural Information Processing Systems 18. MIT Press, 2005.\n\n[5] Thomas Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts\n\nInstitute of Technology, 2001.\n\n[6] Carl Rasmussen and Christopher Williams. Gaussian processes for machine learning. MIT Press, 2006.\n[7] Malte Kuss. Gaussian process models for robust regression, classi\ufb01cation and reinforcement learning.\n\nPhD thesis, Technische Universit\u00a8at Darmstadt, 2006.\n\n[8] Matthias Seeger.\n\nExpectation propagation for exponential\n\nfamilies, 2005.\n\nhttp://www.cs.berkeley.edu/\u02dcmseeger/papers/epexpfam.ps.gz.\n\nAvailable from\n\n[9] J. H. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19(1):1\u201367, 1991.\n[10] B.W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve\n\n\ufb01tting. Journal of the Royal Statistical Society B, 47:1\u201352, 1985.\n\n[11] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian processes.\n\nAdvances in Neural Information Processing Systems 16, 2003.\n\nIn\n\n8\n\n\f", "award": [], "sourceid": 978, "authors": [{"given_name": "Andrew", "family_name": "Naish-guzman", "institution": null}, {"given_name": "Sean", "family_name": "Holden", "institution": null}]}