{"title": "Infinite Mixtures of Gaussian Process Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 881, "page_last": 888, "abstract": null, "full_text": "In\ufb01nite Mixtures of Gaussian Process Experts\n\nCarl Edward Rasmussen and Zoubin Ghahramani\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\n17 Queen Square, London WC1N 3AR, England\n\nedward,zoubin@gatsby.ucl.ac.uk\n\nhttp://www.gatsby.ucl.ac.uk\n\nAbstract\n\nWe present an extension to the Mixture of Experts (ME) model, where\nthe individual experts are Gaussian Process (GP) regression models. Us-\ning an input-dependent adaptation of the Dirichlet Process, we imple-\nment a gating network for an in\ufb01nite number of Experts. Inference in this\nmodel may be done ef\ufb01ciently using a Markov Chain relying on Gibbs\nsampling. The model allows the effective covariance function to vary\nwith the inputs, and may handle large datasets \u2013 thus potentially over-\ncoming two of the biggest hurdles with GP models. Simulations show\nthe viability of this approach.\n\n1 Introduction\n\nGaussian Processes [Williams & Rasmussen, 1996] have proven to be a powerful tool for\nregression. They combine the \ufb02exibility of being able to model arbitrary smooth functions\nif given enough data, with the simplicity of a Bayesian speci\ufb01cation that only requires in-\nference over a small number of readily interpretable hyperparameters \u2013 such as the length\nscales by which the function varies along different dimensions, the contributions of signal\nand noise to the variance in the data, etc. However, GPs suffer from two important limita-\ntions. First, because inference requires inversion of an \u0002\u0001\u0003 covariance matrix where \nis\nthe number of training data points, they are computationally impractical for large datasets.\nSecond, the covariance function is commonly assumed to be stationary, limiting the mod-\neling \ufb02exibility. For example, if the noise variance is different in different parts of the input\nspace, or if the function has a discontinuity, a stationary covariance function will not be\nadequate. Goldberg et al [1998] discussed the case of input dependent noise variance.\n\nSeveral recent attempts have been aimed at approximate inference in GP models [Williams\n& Seeger 2001, Smola & Bartlett 2001]. These methods are based on selecting a projection\nof the covariance matrix onto a smaller subspace (e.g. a subset of the data points) reducing\nthe overall computational complexity. There have also been attempts at deriving more\ncomplex covariance functions [Gibbs 1997] although it can be dif\ufb01cult to decide a priori\non a covariance function of suf\ufb01cient complexity which guarantees positive de\ufb01niteness.\n\nIn this paper we will simultaneously address both the problem of computational complexity\nand the de\ufb01ciencies in covariance functions using a divide and conquer strategy inspired\nby the Mixture of Experts (ME) architecture [Jacobs et al, 1991]. In this model the input\n\n\fspace is (probabilistically) divided by a gating network into regions within which speci\ufb01c\nseparate experts make predictions. Using GP models as experts we gain the double advan-\ntage that computation for each expert is cubic only in the number of data point in its region,\nrather than in the entire dataset, and that each GP-expert may learn different characteristics\nof the function (such as lengths scales, noise variances, etc). Of course, as in the ME, the\nlearning of the experts and the gating network are intimately coupled.\n\nUnfortunately, it may be (practically and statistically) dif\ufb01cult to infer the appropriate num-\nber of experts for a particular dataset. In the current paper we sidestep this dif\ufb01cult problem\nby using an in\ufb01nite number of experts and employing a gating network related to the Dirich-\nlet Process, to specify a spatially varying Dirichlet Process. An in\ufb01nite number of experts\nmay also in many cases be more faithful to our prior expectations about complex real-word\ndatasets. Integrating over the posterior distribution for the parameters is carried out using\na Markov Chain Monte Carlo approach.\n\nTresp [2001] presented an alternative approach to mixtures of GPs. In his approach both the\nexperts and the gating network were implemented with GPs; the gating network being\n\na softmax of GPs. Our new model avoids several limitations of the previous approach,\n\nwhich are covered in depth in the discussion.\n\n2 In\ufb01nite GP mixtures\n\nThe traditional ME likelihood does not apply when the experts are non-parametric. This is\nbecause in a normal ME model the data is assumed to be iid given the model parameters:\n\n\b\n\t\f\u000b\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013\u0016\u0015\u0018\u0017\u0019\u0001\u0003\u0002\u0005\u001a\n\n\u000f\u0016\u001c \u0006\n\r\u001f\u0001\u0003\u0002\u0005\u001b\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\n\t\"! \r#\t\n\u0017 are the parameters\nare inputs and outputs (boldface denotes vectors), \u000b\n\u0013 are the discrete indicator\nare the parameters of the gating network and\u001b\n\n\u000f\u001d\u001c\u000e\t\f\u001e\n\n\t\f\u000b\n\nwhere\b\nand\u0004\nof expert\u001c , !\n\nvariables assigning data points to experts.\n\nThis iid assumption is contrary to GP models which solely model the dependencies in the\njoint distribution (given the hyperparameters). There is a joint distribution corresponding\nto every possible assignment of data points to experts; therefore the likelihood is a sum\nover (exponentially many) assignments:\n\n\u0001\u0003\u0002\u001f\u0004\u0007\u0006\n\n\b\n\t$\u000b\u000e\r\u0010\u000f%\u0015\u0018&'\u0001\u0003\u0002\u001f\u0004\u0007\u0006\n()\t\f\b\n\t\f\u000b*\r+\u0001\u0003\u0002,(-\u0006\n\u001365\n&/.\n\u00170\u00011\u0002324\u001a\n\n\b\n\t\"! \r\n\u000f\u001d\u001c)78\u0006924\u001e\n\n\u001365\n\n\u000f\u0016\u001c)7*\t\f\u000b\n\n;:<\u0001\u0003\u0002,(-\u0006\n\n\b\n\t\"! \r#=\n\n(1)\n\nperts, of the joint Gaussian distribution of all data points assigned to each expert. Whereas\nthe original ME formulation used expectations of assignment variables called responsibili-\nties, this is inadequate for inference in the mixture of GP experts. Consequently, we directly\n\nGiven the con\ufb01guration(>\u000f?\u0002,\u001b\u000e@\u0014\t4=A=A=4\t$\u001bA@&\u0019\u001cBQ\u0002C7\n\u0017MLGN :\n\u0013\u0004K\n\u00011\u0002\u0005\u001b\n\u000f\u0019\u001b\n\u001c8\r ) is the occupation number of expert\u001c excluding observation\nT\u0010\n\rG\u000f\u0016\u0019X\u001b1\u001d\n\nis the total number of data points. This shows that the probabilities are proportional\nto the occupation numbers. To make the gating network input dependent, we will simply\nemploy a local estimate 1 for this occupation number using a kernel classi\ufb01er:\n\n1this local estimate won\u2019t generally be an integer, but this doesn\u2019t have any adverse consequences\n\n\u0002\u0005\u001e\n\u0013FUWV\n$\u001d\u0002\u001f\u001e\n\n. As an example we use a Gaussian kernel function:\n\n\u0017 (\u000fQP\n\nis the kernel function\n\nUOV\n\n\u0002\u0005\u001e\n\n(CE\n\u0013\u0014\u0013\n\n\u0013\u0014\u0013\n\n\t\f\u001e\n\n\u0010Y!\n\n\t\f\u001e\n\n\u0013\u0014\u0013\n\n(5)\n\n(6)\n\n\u0013\u0014\u0013\n\u00134K\n\n\t\f\u001e\n\u0002\u0005\u001e\n\n\u0013&\u00130R\n\n\u0002476\r\n'FD\n\n(4)\n\n\u0013\n\u0006\n\u0004\nE\n\u0013\n\u0002\n\t\n\u0003\n\u0013\nE\n@\n\u0013\n\u0005\n\u0007\n\u0013\n\u0013\n\u0013\n\n\f\n\nE\n@\n\n\u0013\n\n\u0013\n#\n\u0013\n$\n$\n\n\u0015\n\u0002\nK\n\t\n\n\u0002\nK\n\t\nK\n/\n!\n\u000f\nK\n@\n=\n=\nE\nE\n\n5\n\u0011\n5\nE\n@\n\u0017\n\t\nE\n\u0013\n(\nE\n\u0013\n\u000f\n\u0017\n!\n,\n7\n\t\n\u0013\nN\n/\nN\n\u000f\nK\n\u0006\n\u0013\n\u000f\n7\n!\n,\n7\n\t\nS\n\u0013\n.\n\t\n\nE\n\u0017\n\u000f\n\u0002\n!\n\nP\n\u0013\n\u0013\nR\nS\n\u0013\n\n.\n\t\nP\n\u0013\n\u0013\nR\nS\n\u0013\n\n\t\n\u0013\n\u001f\n#\n\u0015\n\u0013\n$\n\u0010\n\u001e\n$\n\n\u0007\n'\n!\n\u0007\n$\n+\n\t\n\fparameterized by length scales!\nsions of\u001e\n\nspace to be more or less relevant to the gating network classi\ufb01cation.\n\nfor each dimension. These length scales allow dimen-\n\n\u0015\u000e@ .\n\nWe Gibbs sample from the indicator variables by multiplying the input-dependent Dirichlet\nprocess prior eq. (4) and (5) with the GP conditional density eq. (2). Gibbs sampling in an\nin\ufb01nite model requires that the indicator variables can take on values that no other indicator\nvariable has already taken, thereby creating new experts. We use the auxiliary variable\napproach of Neal [1998] (algorithm 8 in that paper). In this approach hyperparameters for\nnew experts are sampled from their prior and the likelihood is evaluated based on these.\nThis requires \ufb01nding the likelihood of a Gaussian process with no data. Fortunately, for the\n\ncovariance function eq. (3) this likelihood is Gaussian with zero mean and variance\u0015\n\nIf all data points are assigned to a single GP, the likelihood calculation will still be cubic\nin the number of data points (per Gibbs sweep over all indicators). We can reduce the com-\nputational complexity by introducing the constraint that no GP expert can have more than\n max data points assigned to it. This is easily implemented2 by modifying the conditionals\nin the Gibbs sampler.\n\nthe model more freedom to infer the number of GPs to use for a particular dataset.\n\ncontrols the prior probability of assigning a data point to a new\nexpert, and therefore in\ufb02uences the total number of experts used to model the data. As in\n, and sample from its posterior\nto vary gives\n\nThe hyperparameter7\nRasmussen [2000], we give a vague inverse gamma prior to7\nusing Adaptive Rejection Sampling (ARS) [Gilks & Wild, 1992]. Allowing7\n\nFinally we need to do inference for the parameters of the gating function. Given a set of\nindicator variables one could use standard methods from kernel classi\ufb01cation to optimize\nthe kernel widths in different directions. These methods typically optimize the leave-one-\nout pseudo-likelihood (ie the product of the conditionals), since computing the likelihood\nin a model de\ufb01ned purely from conditional distributions as in eq. (4), (5) & (6) is generally\ndif\ufb01cult (and as pointed out in the discussion section there may not even be a single likeli-\nhood). In our model we multiply the pseudo-likelihood by a (vague) prior and sample from\nthe resulting pseudo-posterior.\n\n4 The Algorithm\n\nand\n\nThe algorithm for learning an in\ufb01nite mixture of GP experts consists of the following steps:\n\nThe individual GP experts are given a stationary Gaussian covariance function, with a sin-\n(where\ngle length scale per dimension, a signal variance and a noise variance, i.e.\nis the dimension of the input) hyperparameters per expert, eq. (3). The signal and noise\n(separately for the\nvariances are given inverse gamma priors with hyper-hypers\ntwo variances). This serves to couple the hyperparameters between experts, and allows the\n\nare to be kept small for computational reasons).\n2. Do a Gibbs sampling sweep over all indicators.\n3. Do Hybrid Monte Carlo (HMC) [Duane et al, 1987] for hyperparameters of the\n\n\u0017 and\u0015*@ (which are used when evaluating auxiliary classes) to adapt. Finally we\npriors on\u0015\ngive vague independent log normal priors to the lenght scale paramters)\n\u0013 to a single value (or a few values if individual GPs\n1. Initialize indicator variables\u001b\n$ , for each expert in turn. We used 10 leapfrog\n\t\r\u0015\u000e@\nGP covariance function,\u0015\n5. Sample the Dirichlet process concentration parameter,7 using ARS.\n\n2We simply set the conditional probability of joining a class which has been deemed full to zero.\n\niterations with a stepsize small enough that rejections were rare.\n\n, for each of the variance parameters.\n\n4. Optimize the hyper-hypers,\n\n.\n\nand!\n\n&\n\n$\n\u0017\n,\n\n,\n#\n\n\u0001\n\u0002\n\u0017\n\t\n)\n\u0001\n\u0002\n\f100\n\n50\n\n0\n\n)\ng\n(\n \n\nn\no\n\ni\nt\n\nl\n\na\nr\ne\ne\nc\nc\nA\n\n\u221250\n\n\u2212100\n\n\u2212150\n0\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n)\ng\n(\n \nn\no\n\ni\nt\n\nl\n\na\nr\ne\ne\nc\nc\nA\n\niMGPE\nstationary GP\n\n40\n\n50\n\n60\n\n\u2212150\n\n0\n\n10\n\n20\n\n30\n\nTime (ms)\n\n10\n\n20\n\n30\n\nTime (ms)\n\n40\n\n50\n\n60\n\nFigure 1: The left hand plot shows the motorcycle impact data (133 points) together with\nthe median of the model\u2019s predictive distribution, and for comparison the mean of a sta-\ntionary covariance GP model (with optimized hyperparameters). On the right hand plot\nsamples from the posterior distribution for the iMGPE of the (noise free)\nfunction evaluated intervals of 1 ms. We have jittered the points in the plot along the time\n\nNJN\nwe show!\n# ms noise, so that the density can be seen more easily.\n# std error (\u0001\u0003\u0002\u0005\u0004\ndimension by adding uniform \nAlso, the \n6. Sample the gating kernel widths,! ; we use the Metropolis method to sample from\nthe pseudo-posterior with a Gaussian proposal \ufb01t at the current! 3\n\n7. Repeat from 2 until the Markov chain has adequately sampled the posterior.\n\n) con\ufb01dence interval for the (noise free) function predicted by\n\na stationary GP is plotted (thin lines).\n\n5 Simulations on a simple real-world data set\n\nTo illustrate our algorithm, we used the motorcycle dataset, \ufb01g. 1, discussed in Silverman\n[1985]. This dataset is obviously non-stationary and has input-dependent noise. We noticed\n\n# g; accordingly we cut off the prior\n\nthat the raw data is discretized into bins of size \u0006\b\u0007\nfor the noise variance at \u0006\n\n# .\n\nThe model is able to capture the general shape of the function and also the input-dependent\nnature of the noise (\ufb01g. 1). This can be seen from the right hand plot in \ufb01g. 1, where the\n\n\f\u000e\r owing to a small inferred noise level\n\nuncertainty of the function is very low for \t\u000b\n\nin this region. For comparison, the predictions from a stationary GP has been superimposed\nin \ufb01g. 1. Whereas the medians of the predictive distributions agree to a large extent (left\nhand plot), we see a huge difference in the predictive distributions (right hand). The ho-\n\nmoscedastic GP cannot capture the very tight distribution for\t\u000f\n\nAlso for large \t\n\nN ms offered by iMGPE.\nL\u0011\u0010JN ms, the iMGPE model predicts with fairly high probability that the\nfunction appears to have signi\ufb01cant mass aroundN g which seems somewhat artifactual.\n\nsignal could be very close to zero. Note that the predictive distribution of the function is\nmultimodal, for example, around time 35 ms. Multimodal predictive distributions could\nin principle be obtained from an ordinary GP by integrating over hyperparameters, how-\never, in a mixture of GP\u2019s model they can arise naturally. The predictive distribution of the\n\nWe explicitly did not normalize or center the data, which has a large range in output. The\n\n3The Gaussian \ufb01t uses the derivative and Hessian of the log posterior wrt the log length scales.\nSince this is an asymmetric proposal the acceptance probabilities must be modi\ufb01ed accordingly. This\nis\n\nscheme has the advantage of containing no tunable parameters; however when the dimension \u0012\n\nlarge, it may be computationally more ef\ufb01cient to use HMC, to avoid calculation of the Hessian.\n\nN\n=\n!\n=\n\u0007\n'\n!\n!\nN\n!\n\f20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\ny\nc\nn\ne\nu\nq\ne\nr\nf\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n100\n\n90 \n\n80 \n\n70 \n\n60 \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\n0 \n\n5\n\n10\n\n25\nnumber of occupied experts\n\n15\n\n20\n\n30\n\nFigure 2: The left hand plot shows the number of times, out of 100 samples, that the\nindicator variables for two data points were equal. The data have been sorted from left-to-\nright according to the value of the time variable (since the data is not equally spaced in time\nthe axis of this matrix cannot be aligned with the plot in \ufb01g.1). The right hand plot shows\na histogram over the 100 samples of the number of GP experts used to model the data.\n\nGaussian processes had zero mean a priori, which coupled with the concentration of data\naround zero may explain the posterior mass at zero. It would be more natural to treat the\nGP means as separate hyperparameters controlled by a hyper-hyperparameter (centered at\nzero) and do inference on them, rather than \ufb01x them all at 0. Although the scatter of data\nfrom the predictive distribution for iMGPE looks somewhat messy with multimodality etc,\nit is important to note that it assigns high density to regions that seem probable.\n\nThe motorcycle data appears to have roughly three regions: a \ufb02at low-noise region, fol-\nlowed by a curved region, and a \ufb02at high noise region. This intuition is bourne out by the\nmodel. We can see this in two ways. Fig 2. (left) shows the number of times two data\npoints were assigned to the same expert. A clearly de\ufb01ned block captures the initial \ufb02at\n\nregion and a few other points that lie near theN g line; the middle block captures the curved\nN , where less than 3 occupied experts is very unlikely, and above!\n\nregion, with a more gradual transition to the last \ufb02at region. A histogram of the number of\nGP experts used shows that the posterior distribution of number of needed GPs has a broad\npeak between\nbecoming progressively less likely. Note that it never uses just a single GP to model the\ndata which accords with the intuition that a single stationary covariance function would be\ninadequate. We should point out that the model is not trying to do model selection between\n\ufb01nite GP mixtures, but rather always assumes that there are in\ufb01nitely many available, most\nof which contribute with small mass4 to a diffuse density in the background.\nIn \ufb01gure 3 we assessed the convergence rate of the Markov Chain by plotting the auto-\ncorrelation function for several parameters. We conclude that the mixing time is around\niterations, discarding\n\nand!\n\n100 iterations5. Consequently, we run the chain for a total of!J!\nthe initial!\n\n(burn-in) and keeping every!\n\nwas around 1 hour (1 GHz Pentium).\n\nN%N \u2019th after that. The total computation time\n\nThe right hand panel of \ufb01gure 3 shows the distribution of the gating function kernel width\n\nNJN%N\n\nNJN%N\n\n4The total mass of the non-represented experts is\n\n, where the posterior for\n\nin this\n\nand\n\n(see \ufb01gure 3, bottom right panel), corresponding to about\n\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\u0001\u0003\u000b\n\nexperiment is peaked between\nof the total mass\n\n5the sum of the auto-correlation coef\ufb01cients from\n\nto\n\nis an estimate of the mixing time\n\n\u0010\u0012\u0011\n\n\f\u000f\u000e\n\n\nN\n\u0001\n\f\n\n\u0011\n\fi\n\nt\nn\ne\nc\ni\nf\nf\ne\no\nc\n \nn\no\ni\nt\na\ne\nr\nr\no\nc\n \no\nt\nu\na\n\nl\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\nlog number of occupied experts\nlog gating kernel width\nlog Dirichlet concentration\n\n50\n\n100\n\ntime lag in iterations\n\n150\n\n200\n\ny\nc\nn\ne\nu\nq\ne\nr\nf\n\n10\n\n5\n\n0\n\ny\nc\nn\ne\nu\nq\ne\nr\nf\n\n10\n\n5\n\n0\n\n\u22121\n\n\u22120.5\n\n0\n\nlog (base 10) gating function kernel width\n\n0.5\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\nlog (base 10) Dirichlet process concentration\n\nposterior.\n\nFigure 3: The left hand plot shows the auto-correlation for various parameters of the model\niterations. The right hand plots show the distribution of the (log) kernel\nsamples from the\n\nNJN%N\nbased on!%!\nand (log) Dirichlet concentration parameter7\nwidth !\nlies betweenN\n! and the concentration parameter of the Dirichlet process. The posterior 6 kernel width!\n! and\n\n; comparing to the scale of the inputs these are quite short distances,\ncorresponding to rapid transitions between experts (as opposed to lengthy intervals with\nmultiple active experts). This corresponds well with our visual impression of the data.\n\n, based on!\n\nN%N\n\n6 Discussion and Conclusions\n\nWe now return to Tresp [2000]. There are four ways in which the in\ufb01nite mixture of GP\nexperts differs from, and we believe, improves upon the model presented by Tresp. First,\nin his model, although a gating network divides up the input space, each GP expert predicts\non the basis of all of the data. Data that was not assigned to a GP expert can therefore spill\nover into predictions of a GP, which will lead to bias near region boundaries especially\nexperts, Tresp\u2019s model has\n\nfor experts with long length scales. Second, if there are \n GPs (the experts, noise models, and separate gating functions) each of which requires\n\u0002\u00014\r computations. In our model\nsince the experts divide up the data points, if there are\ncomputations over the entire dataset resulting in \n\r computations (each of Gibbs updates requires a rank-one\nan iteration takes \nexperts and the Hybrid Monte Carlo takes\n\r for each of the\nN ) this can be a signi\ufb01cant saving.\n\r ). Even for modest\n\u0004\u0003\ncomputation \ntimes \nInference for the gating length scale parameters is \n\r for a diagonal approximation, or Hybrid Monte Carlo if the\nbut can be reduced to \n\ninput dimension is large. Third, by going to the Dirichlet process in\ufb01nite limit, we allow\nthe model to infer the number of components required to capture the data. Finally, in our\nmodel the GP hyperparameters are not \ufb01xed but are instead inferred from the data.\n\nexperts equally dividing the data\n\nif the full Hessian is used,\n\n(e.g.\n\nWe have de\ufb01ned the gating network prior implicitly in terms of the conditional distribution\nof an indicator variable given all the other indicator variables. Speci\ufb01cally, the distribution\nof this indicator variable is an input-dependent Dirichlet process with counts given by local\nestimates of the data density in each class eq. (5). We have not been able to prove that\nthese conditional distributions are always consistent with a single joint distribution over\n\n6for comparison the (vague) prior on the kernel width is log normal with \u0005\u0007\u0006\n\nof the mass between\n\b\u000b\b , corresponding to very short (sub sample) distances upto distances comparable to the\n\nand\n\n\b\n\t\nentire input range\n\n=\n\n\n\u0002\n\n\n\u0002\n\n\u0001\n'\n\n\u0002\n\n\u0007\n'\n\n\u0007\n\u0002\n\n\u0001\n'\n\n\u0001\n!\n\u0002\n\n\u0007\n\n\u0007\n\n\u0002\n\n\u0007\n\n\u000e\n\f\n\f\n\fthe indicators. If indeed there does not exist a single consistent joint distribution the Gibbs\nsampler may converge to different distributions depending on the order of sampling.\n\nWe are encouraged by the preliminary results obtained on the motorcycle data. Future work\nshould also include empirical comparisons with other state-of-the-art regression methods\non multidimensional benchmark datasets. We have argued here that single iterations of the\nMCMC inference are computationally tractable even for large data sets, experiments will\nshow whether mixing is suf\ufb01ciently rapid to allow practical application. We hope that the\nextra \ufb02exibility of the effective covariance function will turn out to improve performance.\nAlso, the automatic choice of the number of experts may make the model advantageous for\npractical modeling tasks.\n\nFinally, we wish to come back to the modeling philosophy which underlies this paper. The\ncomputational problem in doing inference and prediction using Gaussian Processes arises\nout of the unrealistic assumption that a single covariance function captures the behavior of\nthe data over its entire range. This leads to a cumbersome matrix inversion over the entire\ndata set. Instead we \ufb01nd that by making a more realistic assumption, that the data can be\nmodeled by an in\ufb01nite mixture of local Gaussian processes, the computational problem\nalso decomposes into smaller matrix inversions.\n\nReferences\n\nGibbs, M. N. (1997). Bayesian Gaussian Processes for Regression and Classi\ufb01cation. PhD\nthesis. University of Cambridge.\n\nGoldberg, P. W., Williams, C. K. I., & Bishop C. M. (1998). Regression with Input-\ndependent Noise, NIPS 10.\n\nDuane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo,\nPhysics letters B, vol. 55, pp. 2774\u20132777.\n\nGilks, W. R. & Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied\nStatistics 41, 337\u2013348.\n\nJacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. (1991). Adaptive mixture of\nlocal experts. Neural Computation, vol 3, pp 79\u201387.\n\nNeal, R. M. (1998). Markov chain sampling methods for Dirichlet process mixture models.\nTechnical Report 4915, Department of Statistics, University of Toronto.\nhttp://www.cs.toronto.edu/\n\nradford/mixmc.abstract.html.\n\nRasmussen, C. E. (2000). The In\ufb01nite Gaussian Mixture Model, NIPS 12, S.A. Solla, T.K.\nLeen and K.-R. M\u00a8uller (eds.), pp. 554\u2013560, MIT Press.\n\nSilverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric\nregression curve \ufb01tting. J. Royal Stat. Society. Ser. B, vol. 47, pp. 1\u201352.\n\nSmola A. J. and Bartlett, P. (2001). Sparse Greedy Gaussian Process Regression, NIPS 13.\n\nTresp V. (2001). Mixtures of Gaussian Process, NIPS 13.\n\nWilliams, C. K. I. and Seeger, M. (2001). Using the Nystr\u00a8om Method to Speed Up Kernel\nMachines, NIPS 13.\n\nWilliams, C. K. I. and C. E. Rasmussen (1996). Gaussian Processes for Regression, in\nD. S. Touretzky, M. C. Mozer and M. E. Hasselmo (editors), NIPS 8, MIT Press.\n\n\n\f", "award": [], "sourceid": 2055, "authors": [{"given_name": "Carl", "family_name": "Rasmussen", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}]}