{"title": "An Alternative Infinite Mixture Of Gaussian Process Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 883, "page_last": 890, "abstract": "", "full_text": "An Alternative In\ufb01nite Mixture Of Gaussian\n\nProcess Experts\n\nEdward Meeds and Simon Osindero\n\nDepartment of Computer Science\n\nUniversity of Toronto\nToronto, M5S 3G4\n\nfewm,osinderog@cs.toronto.edu\n\nAbstract\n\nWe present an in\ufb01nite mixture model in which each component com-\nprises a multivariate Gaussian distribution over an input space, and a\nGaussian Process model over an output space. Our model is neatly able\nto deal with non-stationary covariance functions, discontinuities, multi-\nmodality and overlapping output signals. The work is similar to that by\nRasmussen and Ghahramani [1]; however, we use a full generative model\nover input and output space rather than just a conditional model. This al-\nlows us to deal with incomplete data, to perform inference over inverse\nfunctional mappings as well as for regression, and also leads to a more\npowerful and consistent Bayesian speci\ufb01cation of the effective \u2018gating\nnetwork\u2019 for the different experts.\n\nIntroduction\n\n1\nGaussian process (GP) models are powerful tools for regression, function approximation,\nand predictive density estimation. However, despite their power and \ufb02exibility, they suffer\nfrom several limitations. The computational requirements scale cubically with the number\nof data points, thereby necessitating a range of approximations for large datasets. Another\nproblem is that it can be dif\ufb01cult to specify priors and perform learning in GP models if we\nrequire non-stationary covariance functions, multi-modal output, or discontinuities.\n\nThere have been several attempts to circumvent some of these lacunae, for example [2, 1].\nIn particular the In\ufb01nite Mixture of Gaussian Process Experts (IMoGPE) model proposed\nby Rasmussen and Ghahramani [1] neatly addresses the aforementioned key issues. In a\nsingle GP model, an n by n matrix must be inverted during inference. However, if we use a\nmodel composed of multiple GP\u2019s, each responsible only for a subset of the data, then the\ncomputational complexity of inverting an n by n matrix is replaced by several inversions\nof smaller matrices \u2014 for large datasets this can result in a substantial speed-up and may\nallow one to consider large-scale problems that would otherwise be unwieldy. Furthermore,\nby combining multiple stationary GP experts, we can easily accommodate non-stationary\ncovariance and noise levels, as well as distinctly multi-modal outputs. Finally, by placing a\nDirichlet process prior over the experts we can allow the data and our prior beliefs (which\nmay be rather vague) to automatically determine the number of components to use.\n\nIn this work we present an alternative in\ufb01nite model that is strongly inspired by the work\nin [1], but which uses a different formulation for the mixture of experts that is in the style\npresented in, for example [3, 4]. This alternative approach effectively uses posterior re-\n\n\fPSfrag replacements\n\nPSfrag replacements\n\nxi\n\nzi\n\nyi\n\nN\n\nzi\n\nxi\n\nyi\n\nN\n\nFigure 1: Left: Graphical model for the standard MoE model [6]. The expert indicators\nfz(i)g are speci\ufb01ed by a gating network applied to the inputs fx(i)g. Right: An alternative\nview of MoE model using a full generative model [4]. The distribution of input locations is\nnow given by a mixture model, with components for each expert. Conditioned on the input\nlocations, the posterior responsibilities for each mixture component behave like a gating\nnetwork.\n\nsponsibilities from a mixture distribution as the gating network. Even if the task at hand\nis simply output density estimation or regression, we suggest a full generative model over\ninputs and outputs might be preferable to a purely conditional model. The generative ap-\nproach retains all the strengths of [1] and also has a number of potential advantages, such\nas being able to deal with partially speci\ufb01ed data (e.g. missing input co-ordinates) and\nbeing able to infer inverse functional mappings (i.e. the input space given an output value).\nThe generative approach also affords us a richer and more consistent way of specifying\nour prior beliefs about how the covariance structure of the outputs might vary as we move\nwithin input space.\n\nAn example of the type of generative model which we propose is shown in \ufb01gure 2. We\nuse a Dirichlet process prior over a countably in\ufb01nite number of experts and each expert\ncomprises two parts: a density over input space describing the distribution of input points\nassociated with that expert, and a Gaussian Process model over the outputs associated with\nthat expert. In this preliminary exposition, we restrict our attention to experts whose in-\nput space densities are given a single full covariance Gaussian. Even this simple approach\ndemonstrates interesting performance and capabilities. However, in a more elaborate setup\nthe input density associated with each expert might itself be an in\ufb01nite mixture of sim-\npler distributions (for instance, an in\ufb01nite mixture of Gaussians [5]) to allow for the most\n\ufb02exible partitioning of input space amongst the experts.\n\nThe structure of the paper is as follows. We begin in section 2 with a brief overview of\ntwo ways of thinking about Mixtures of Experts. Then, in section 3, we give the complete\nspeci\ufb01cation and graphical depiction of our generative model, and in section 4 we outline\nthe steps required to perform Monte Carlo inference and prediction. In section 5 we present\nthe results of several simple simulations that highlight some of the salient features of our\nproposal, and \ufb01nally in section 6, we discuss our work and place it in relation to similar\ntechniques.\n\n2 Mixtures of Experts\nIn the standard mixture of experts (MoE) model [6], a gating network probabilistically\nmixes regression components. One subtlety in using GP\u2019s in a mixture of experts model is\nthat IID assumptions on the data no longer hold and we must specify joint distributions for\neach possible assignment of experts to data. Let fx(i)g be the set of d-dimensional input\nvectors, fy(i)g be the set of scalar outputs, and fz(i)g be the set of expert indicators which\nassign data points to experts.\n\nThe likelihood of the outputs, given the inputs, is speci\ufb01ed in equation 1, where (cid:18)GP\nrep-\nr\nresents the GP parameters of the rth expert, (cid:18)g represents the parameters of the gating\nnetwork, and the summation is over all possible con\ufb01gurations of indicator variables.\n\n\fb(cid:11)0\n\na(cid:11)0\n\n(cid:11)0\n\nfz(i)g\n\n(cid:6)x\n\n(cid:23)S fS\n\na(cid:23)c b(cid:23)c\n\n(cid:23)0 f0\n\n(cid:22)x\n\nS\n\n(cid:23)c\n\n(cid:6)0\n\n(cid:22)0\n\n(cid:6)r\n\n(cid:22)r\n\ni = 1 : Nr\n\nzr\ni\n\nx\n\nr\n(i)\n\nYr\n\nv0r\n\nv1r\n\na0 b0\n\na1 b1\n\nwjr\n\naw bw\n\nr = 1 : K\n\nj = 1 : D\n\nFigure 2: The graphical model representation of the alternative in\ufb01nite mixture of GP\nexperts (AiMoGPE) model proposed in this paper. We have used xr\n(i) to represent the\nith data point in the set of input data whose expert label is r, and Yr to represent the set of\nall output data whose expert label is r. In other words, input data are IID given their expert\nlabel, whereas the sets of output data are IID given their corresponding sets of input data.\nThe lightly shaded boxes with rounded corners represent hyper-hyper parameters that are\n\ufb01xed ((cid:10) in the text). The DP concentration parameter (cid:11)0, the expert indicators variables,\nfz(i)g, the gate hyperparameters, (cid:30)x = f(cid:22)0; (cid:6)0; (cid:23)c; Sg, the gate component parameters,\nr = fv0r; v1r; wjrg, are all updated for\n x\nall r and j.\n\nr = f(cid:22)r; (cid:6)rg, and the GP expert parameters, (cid:18)GP\n\nP (fy(i)gjfx(i)g; (cid:18)) =XZ\n\nP (fz(i)gjfx(i)g; (cid:18)g)Yr\n\nP (fy(i) : z(i) = rgjfx(i) : z(i) = rg; (cid:18)GP\nr )\n\n(1)\nThere is an alternative view of the MoE model in which the experts also generate the inputs,\nrather than simply being conditioned on them [3, 4] (see \ufb01gure 1). This alternative view\nemploys a joint mixture model over input and output space, even though the objective\nis still primarily that of estimating conditional densities i.e. outputs given inputs. The\ngating network effectively gets speci\ufb01ed by the posterior responsibilities of each of the\ndifferent components in the mixture. An advantage of this perspective is that it can easily\naccommodate partially observed inputs and it also allows \u2018reverse-conditioning\u2019, should\nwe wish to estimate where in input space a given output value is likely to have originated.\nFor a mixture model using Gaussian Processes experts, the likelihood is given by\n\nP (fx(i)g;fy(i)gj(cid:18)) = XZ\n\nP (fz(i)gj(cid:18)g)(cid:2)\n\nP (fy(i) : z(i) = rgjfx(i) : z(i) = rg; (cid:18)GP\n\nr )P (fx(i) : z(i) = rgj(cid:18)g)\n\n(2)\n\nYr\n\nwhere the description of the density over input space is encapsulated in (cid:18)g.\n\n3\nIn\ufb01nite Mixture of Gaussian Processes: A Joint Generative Model\nThe graphical structure for our full generative model is shown in \ufb01gure 2. Our generative\nprocess does not produce IID data points and is therefore most simply formulated either as\n\n\fa joint distribution over a dataset of a given size, or as a set of conditionals in which we\nincrementally add data points.To construct a complete set of N sample points from the prior\n(speci\ufb01ed by top-level hyper-parameters (cid:10)) we would perform the following operations:\n\n1. Sample Dirichlet process concentration variable (cid:11)0 given the top-level hyper-\n\nparameters.\n\ni=1.\n\n2. Construct a partition of N objects into at most N groups using a Dirichlet pro-\ncess. This assignment of objects is denoted by using a set the indicator variables\nfz(i)gN\n\n3. Sample the gate hyperparameters (cid:30)x given the top-level hyperparameters.\n4. For each grouping of indicators fz(i) : z(i) = rg, sample the input space param-\nr de\ufb01nes the density in input space, in our case a\n\nr conditioned on (cid:30)x. x\n\nr for each group, sample the locations of the input points\n\neters x\nfull-covariance Gaussian.\n5. Given the parameters x\nXr (cid:17) fx(i) : z(i) = rg.\n\nthat group, (cid:18)GP\nr .\n\n6. For each group, sample the hyper-parameters for the GP expert associated with\n\n7. Using the input locations Xr and hyper-parameters (cid:18)GP\n\nr for the individual groups,\nformulate the GP output covariance matrix and sample the set of output values,\nYr (cid:17) fy(i) : z(i) = rg from this joint Gaussian distribution.\n\nWe write the full joint distribution of our model as follows.\n\nP (fx(i); y(i)gN\n\ni=1; fz(i)gN\n\ni=1; f x\n\nr gN\n\nr=1; f(cid:18)GP\n\nr gN\n\nr=1; (cid:11)0; (cid:30)xjN; (cid:10)) =\n\nN\n\nYr=1(cid:2)H N\n\nr P ( x\n\nr j(cid:30)x)P (Xrj x\n\nr )P ((cid:18)GP\n\nr j(cid:10))P (YrjXr; (cid:18)GP\n\nr ) + (1 (cid:0) H N\n\nr )D0( x\n\nr ; (cid:18)GP\nr )(cid:3)\n\n(cid:2) P (fz(i)gN\n\ni=1jN; (cid:11)0)P ((cid:11)0j(cid:10))P ((cid:30)xj(cid:10))\n\n(3)\n\nWhere we have used the supplementary notation: H N\nempty set and H N\ndummy set of parameters to ensure proper normalisation.\n\nr = 1 otherwise; and D0( x\n\nr ; (cid:18)GP\n\nr = 0 if ffz(i)g : z(i) = rg is the\nr ) is a delta function on an (irrelevant)\n\nFor the GP components, we use a standard, stationary covariance function of the form\n\nQ(x(i); x(h)) = v0 exp(cid:18)(cid:0)\n\n1\n\n2 XD\n\nj=1(cid:0)x(i)j (cid:0) x(h)j(cid:1)2\n\n=w2\n\nj(cid:19) + (cid:14)(i; h)v1\n\nThe individual distributions in equation 3 are de\ufb01ned as follows1:\n\nP (fz(i)gN\n\nP ((cid:11)0j(cid:10)) = G((cid:11)0; a(cid:11)0 ; b(cid:11)0 )\ni=1jN; (cid:10)) = PU((cid:11)0; N )\nP ((cid:30)xj(cid:10)) = N ((cid:22)0; (cid:22)x; (cid:6)x=f0)W((cid:6)(cid:0)1\n\n0 ; (cid:23)0; f0(cid:6)(cid:0)1\nG((cid:23)c; a(cid:23)c ; b(cid:23)c )W(S(cid:0)1; (cid:23)S; fS(cid:6)x=(cid:23)S)\n\nx =(cid:23)0)\n\nP ( x\n\nr j(cid:10)) = N ((cid:22)r; (cid:22)0; (cid:6)0)W((cid:6)(cid:0)1\n\nr ; (cid:23)c; S=(cid:23)c)\n\nP (Xrj x\n\nr ) = N (Xr; (cid:22)r; (cid:6)r)\n\n(4)\n\n(5)\n(6)\n\n(7)\n(8)\n(9)\n\nP ((cid:18)GP\n\nr j(cid:10)) = G(v0r; a0; b0)G(v1r; a1; b1)YD\n\nj=1\n\nP (YrjXr; (cid:18)GP\n\nr ) = N (Yr; (cid:22)Qr ; (cid:27)2\n\nQr )\n\nLN (wjr; aw; bw) (10)\n\n(11)\n\n1We use the notation N , W, G, and LN to represent the normal, the Wishart, the gamma, and the\nlog-normal distributions, respectively; we use the parameterizations found in [7] (Appendix A). The\nnotation PU refers to the Polya urn distribution [8].\n\n\fIn an approach similar to Rasmussen [5], we use the input data mean (cid:22)x and covariance\n(cid:6)x to provide an automatic normalisation of our dataset. We also incorporate additional\nhyperparameters f0 and fS, which allow prior beliefs about the variation in location of (cid:22)r\nand size of (cid:6)r, relative to the data covariance.\n\n4 Monte Carlo Updates\nAlmost all the integrals and summations required for inference and learning operations\nwithin our model are analytically intractable, and therefore necessitate Monte Carlo ap-\nproximations. Fortunately, all the necessary updates are relatively straightforward to carry\nout using a Markov Chain Monte Carlo (MCMC) scheme employing Gibbs sampling and\nHybrid Monte Carlo. We also note that in our model the predictive density depends on\nthe entire set of test locations (in input space). This transductive behaviour follows from\nthe non-IID nature of the model and the in\ufb02uence that test locations have on the posterior\ndistribution over mixture parameters. Consequently, the marginal predictive distribution at\na given location can depend on the other locations for which we are making simultaneous\npredictions. This may or may not be desired. In some situations the ability to incorporate\nthe additional information about the input density at test time may be bene\ufb01cial. However,\nit is also straightforward to effectively \u2018ignore\u2019 this new information and simply compute a\nset of independent single location predictions.\n\n(cid:3)\n\nGiven a set of test locations fx\n(t)g, along with training data pairs fx(i); y(i)g and top-level\nhyper-parameters (cid:10), we iterate through the following conditional updates to produce our\npredictive distribution for unknown outputs fy (cid:3)\n(t)g. The parameter updates are all conjugate\nwith the prior distributions, except where noted:\n\n1. Update indicators fz(i)g by cycling through the data and sampling one indicator\nvariable at a time. We use algorithm 8 from [9] with m = 1 to explore new\nexperts.\n\n2. Update input space parameters.\n3. Update GP hyper-params using Hybrid Monte Carlo [10].\n4. Update gate hyperparameters. Note that (cid:23)c is updated using slice sampling [11].\n5. Update DP hyperparameter (cid:11)0 using the data augmentation technique of Escobar\n\nand West [12].\n\n6. Resample missing output values by cycling through the experts, and jointly sam-\n\npling the missing outputs associated with that GP.\n\nWe perform some preliminary runs to estimate the longest auto-covariance time, (cid:28)max for\nour posterior estimates, and then use a burn-in period that is about 10 times this timescale\nbefore taking samples every (cid:28)max iterations.2 For our simulations the auto-covariance time\nwas typically 40 complete update cycles, so we use a burn-in period of 500 iterations and\ncollect samples every 50.\n\n5 Experiments\n5.1 Samples From The Prior\n\nIn \ufb01gure 3 (A) we give an example of data drawn from our model which is multi-modal\nand non-stationary. We also use this arti\ufb01cial dataset to con\ufb01rm that our MCMC algorithm\nperforms well and is able recover sensible posterior distributions. Posterior histograms for\nsome of the inferred parameters are shown in \ufb01gure 3 (B) and we see that they are well\nclustered around the \u2018true\u2019 values.\n\n2This is primarily for convenience. It would also be valid to use all the samples after the burn-in\nperiod, and although they could not be considered independent, they could be used to obtain a more\naccurate estimator.\n\n\f40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221260\n\u22128\n\n\u22126\n\n\u22124\n\n\u22122\n\n4\n\n6\n\n8\n\n10\n\n0\n\n2\n\n(A)\n\nt\nn\nu\no\nc\n\n15\n\n10\n\n5\n\n0\n\u22123\n\nt\nn\nu\no\nc\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n\u22122.5\n\n\u22122\n\n\u22121.5\n\n\u22121\n0\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n3\n\n4\n\nk\n(B)\n\n5\n\n6\n\nFigure 3: (A) A set of samples from our model prior. The different marker styles are used\nto indicate the sets of points from different experts. (B) The posterior distribution of log (cid:11)0\nwith its true value indicated by the dashed line (top) and the distribution of occupied experts\n(bottom). We note that the posterior mass is located in the vicinity of the true values.\n\n5.2\n\nInference On Toy Data\n\nTo illustrate some of the features of our model we constructed a toy dataset consisting of\n4 continuous functions, to which we added different levels of noise. The functions used\nwere:\n\n1 (cid:0) 40\n\nf1(a1) = 0:25a2\nf2(a2) = (cid:0)0:0625(a2 (cid:0) 18)2 + :5a2 + 20\nf3(a3) = 0:008(a3 (cid:0) 60)3 (cid:0) 70\nf4(a4) = (cid:0) sin(0:25a4) (cid:0) 6\n\nNoise SD: 7 (12)\nNoise SD: 7 (13)\nNoise SD: 4 (14)\nNoise SD: 2 (15)\nThe resulting data has non-stationary noise levels, non-stationary covariance, discontinu-\nities and signi\ufb01cant multi-modality. Figure 4 shows our results on this dataset along with\nthose from a single GP for comparison.\n\na1 2 (0 : : : 15)\na2 2 (35 : : : 60)\na3 2 (45 : : : 80)\na4 2 (80 : : : 100)\n\nWe see that in order to account for the entire data set with a single GP, we are forced to infer\nan unnecessarily high level of noise in the function. Also, a single GP is unable to capture\nthe multi-modality or non-stationarity of the data distribution. In contrast, our model seems\nmuch more able to deal with these challenges.\n\nSince we have a full generative model over both input and output space, we are also able\nto use our model to infer likely input locations given a particular output value. There\nare a number of applications for which this might be relevant, for example if one wanted\nto sample candidate locations at which to evaluate a function we are trying to optimise.\nWe provide a simple illustration of this in \ufb01gure 4 (B). We choose three output levels\nand conditioned on the output having these values, we sample for the input location. The\ninference seems plausible and our model is able to suggest locations in input space for a\nmaximal output value (+40) that was not seen in the training data.\n\n5.3 Regression on a simple \u201creal-world\u201d dataset\n\nWe also apply our model and algorithm to the motorcycle dataset of [13]. This is a com-\nmonly used dataset in the GP community and therefore serves as a useful basis for compar-\nison. In particular, it also makes it easy to see how our model compares with standard GP\u2019s\nand with the work of [1]. Figure 5 compares the performance of our model with that of a\nsingle GP. In particular, we note that although the median of our model closely resembles\nthe mean of the single GP, our model is able to more accurately model the low noise level\n\na\n\f80\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n\u2212100\n\n\u2212120\n\n\u221220\n\nTraining Data\nAiMoGPE\nSingle GP\n\n0\n\n20\n\n40\n(A)\n\n60\n\n80\n\n100\n\n120\n\n80\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n\u2212100\n\n\u2212120\n\n\u221220\n\n0\n\n20\n\n40\n(B)\n\n60\n\n80\n\n100\n\n120\n\nFigure 4: Results on a toy dataset. (A) The training data is shown along with the predictive\nmean of a stationary covariance GP and the median of the predictive distribution of our\nmodel. (B) The small dots are samples from the model (160 samples per location) evaluated\nat 80 equally spaced locations across the range (but plotted with a small amount of jitter\nto aid visualisation). These illustrate the predictive density from our model. The solid the\nlines show the (cid:6) 2 SD interval from a regular GP. The circular markers at ordinates of 40,\n10 and (cid:0)100 show samples from \u2018reverse-conditioning\u2019 where we sample likely abscissa\nlocations given the test ordinate and the set of training data.\n\non the left side of the dataset. For the remainder of the dataset, the noise level modeled by\nour model and a single GP are very similar, although our model is better able to capture\nthe behaviour of the data at around 30 ms. It is dif\ufb01cult to make an exact comparison to\n[1], however we can speculate that our model is more realistically modeling the noise at\nthe beginning of the dataset by not inferring an overly \u201c\ufb02at\u201d GP expert at that location. We\ncan also report that our expert adjacency matrix closely resembles that of [1].\n\n6 Discussion\nWe have presented an alternative framework for an in\ufb01nite mixture of GP experts. We feel\nthat our proposed model carries over the strengths of [1] and augments these with the sev-\neral desirable additional features. The pseudo-likelihood objective function used to adapt\nthe gating network de\ufb01ned in [1] is not guaranteed to lead to a self-consistent distribution\nand therefore the results may depend on the order in which the updates are performed; our\nmodel incorporates a consistent Bayesian density formulation for both input and output\nspaces by de\ufb01nition. Furthermore, in our most general framework we are more naturally\nable to specify priors over the partitioning of space between different expert components.\nAlso, since we have a full joint model we can infer inverse functional mappings.\nThere should be considerable gains to be made by allowing the input density models be\nmore powerful. This would make it easier for arbitrary regions of space to share the same\ncovariance structures; at present the areas \u2018controlled\u2019 by a particular expert tend to be\nlocal. Consequently, a potentially undesirable aspect of the current model is that strong\nclustering in input space can lead us to infer several expert components even if a single GP\nwould do a good job of modelling the data. An elegant way of extending the model in this\nway might be to use a separate in\ufb01nite mixture distribution for the input density of each\nexpert, perhaps incorporating a hierarchical DP prior across the in\ufb01nite set of experts to\nallow information to be shared.\n\nWith regard to applications, it might be interesting to further explore our model\u2019s capability\nto infer inverse functional mappings; perhaps this could be useful in an optimisation or\nactive learning context. Finally, we note that although we have focused on rather small\nexamples so far, it seems that the inference techniques should scale well to larger problems\n\n\f)\ng\n(\n \n\nn\no\n\ni\nt\n\nl\n\na\nr\ne\ne\nc\nc\nA\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212150\n0\n\nTraining Data\nAiMoGPE\nSingleGP\n\n10\n\n20\n\n30\nTime (ms)\n(A)\n\n40\n\n50\n\n60\n\n)\ng\n(\n \n\nn\no\n\ni\nt\n\nl\n\na\nr\ne\ne\nc\nc\nA\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212150\n0\n\n10\n\n20\n\n30\nTime (ms)\n(B)\n\n40\n\n50\n\n60\n\nFigure 5: (A) Motorcycle impact data together with the median of our model\u2019s point-wise\npredictive distribution and the predictive mean of a stationary covariance GP model. (B)\nThe small dots are samples from our model (160 samples per location) evaluated at 80\nequally spaced locations across the range (but plotted with a small amount of jitter to aid\nvisualisation). The solid lines show the (cid:6) 2 SD interval from a regular GP.\n\nand more practical tasks.\n\nAcknowledgments\nThanks to Ben Marlin for sharing slice sampling code and to Carl Rasmussen for making\nminimize.m available.\n\nReferences\n[1] C.E. Rasmussen and Z. Ghahramani. In\ufb01nite mixtures of Gaussian process experts. In Advances\n\nin Neural Information Processing Systems 14, pages 881\u2013888. MIT Press, 2002.\n\n[2] V. Tresp. Mixture of Gaussian processes. In Advances in Neural Information Processing Sys-\n\ntems, volume 13. MIT Press, 2001.\n\n[3] Z. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via an EM ap-\nIn Advances in Neural Information Processing Systems 6, pages 120\u2013127. Morgan-\n\nproach.\nKaufmann, 1995.\n\n[4] L. Xu, M. I. Jordan, and G. E. Hinton. An alternative model for mixtures of experts. In Advances\n\nin Neural Information Processing Systems 7, pages 633\u2013640. MIT Press, 1995.\n\n[5] C. E. Rasmussen. The in\ufb01nite Gaussian mixture model. In Advances in Neural Information\n\nProcessing Systems, volume 12, pages 554\u2013560. MIT Press, 2000.\n\n[6] R.A. Jacobs, M.I. Jordan, and G.E. Hinton. Adaptive mixture of local experts. Neural Compu-\n\n[7] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and\n\n[8] D. Blackwell and J. B. MacQueen. Ferguson distributions via Polya urn schemes. The Annals\n\ntation, 3, 1991.\n\nHall, 2nd edition, 2004.\n\nof Statistics, 1(2):353\u2013355, 1973.\n\n[9] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9:249\u2013265, 2000.\n\n[10] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report\n\nCRG-TR-93-1, University of Toronto, 1993.\n\n[11] R. M. Neal. Slice sampling (with discussion). Annals of Statistics, 31:705\u2013767, 2003.\n[12] M. Escobar and M. West. Computing Bayesian nonparametric hierarchical models. In Prac-\ntical Nonparametric and Semiparametric Bayesian Statistics, number 133 in Lecture Notes in\nStatistics. Springer-Verlag, 1998.\n\n[13] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression\n\ncurve \ufb01tting. J. Royal Stayt Society. Ser. B, 47:1\u201352, 1985.\n\n\f", "award": [], "sourceid": 2768, "authors": [{"given_name": "Edward", "family_name": "Meeds", "institution": null}, {"given_name": "Simon", "family_name": "Osindero", "institution": null}]}