{"title": "Mixtures of Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 654, "page_last": 660, "abstract": null, "full_text": "Mixtures of Gaussian Processes \n\nVolker Tresp \n\nSiemens AG, Corporate Technology, Department of Neural Computation \n\nOtto-Hahn-Ring 6,81730 Miinchen, Germany \n\nVolker. Tresp@mchp.siemens.de \n\nAbstract \n\nWe introduce the mixture of Gaussian processes (MGP) model which is \nuseful for applications in which the optimal bandwidth of a map is input \ndependent. The MGP is derived from the mixture of experts model and \ncan also be used for modeling general conditional probability densities. \nWe discuss how Gaussian processes -in particular in form of Gaussian \nprocess classification, the support vector machine and the MGP model(cid:173)\ncan be used for quantifying the dependencies in graphical models. \n\n1 Introduction \n\nGaussian processes are typically used for regression where it is assumed that the underly(cid:173)\ning function is generated by one infinite-dimensional Gaussian distribution (i.e. we assume \na Gaussian prior distribution). In Gaussian process regression (GPR) we further assume \nthat output data are generated by additive Gaussian noise, i.e. we assume a Gaussian like(cid:173)\nlihood model. GPR can be generalized by using likelihood models from the exponential \nfamily of distributions which is useful for classification and the prediction of lifetimes or \ncounts. The support vector machine (SVM) is a variant in which the likelihood model is \nnot derived from the exponential family of distributions but rather uses functions with a \ndiscontinuous first derivative. In this paper we introduce another generalization of GPR \nin form of the mixture of Gaussian processes (MGP) model which is a variant of the well \nknown mixture of experts (ME) model of Jacobs et al. (1991). The MGP model allows \nGaussian processes to model general conditional probability densities. An advantage of \nthe MGP model is that it is fast to train, if compared to the neural network ME model. \nEven more interesting, the MGP model is one possible approach of addressing the problem \nof input-dependent bandwidth requirements in GPR. Input-dependent bandwidth is useful \nif either the complexity of the map is input dependent -requiring a higher bandwidth in \nregions of high complexity- or if the input data distribution is input dependent. In the \nlatter case, one would prefer Gaussian processes with a higher bandwidth in regions with \nmany data points and a lower bandwidth in regions with lower data density. If GPR models \nwith different bandwidths are used, the MGP approach allows the system to self-organize \nby locally selecting the GPR model with the appropriate optimal bandwidth. \n\nGaussian process classifiers, the support vector machine and the MGP can be used to model \nthe local dependencies in graphical models. Here, we are mostly interested in the case that \nthe dependencies of a set of variables y is modified via Gaussian processes by a set of ex(cid:173)\nogenous variables x. As an example consider a medical domain in which a Bayesian net(cid:173)\nwork of discrete variables y models the dependencies between diseases and symptoms and \n\n\fwhere these dependencies are modified by exogenous (often continuous) variables x rep(cid:173)\nresenting quantities such as the patient's age, weight or blood pressure. Another example \nwould be collaborative filtering where y might represent a set of goods and the correlation \nbetween customer preferences is modeled by a dependency network (another example of \na graphical model). Here, exogenous variables such as income, gender and social status \nmight be useful quantities to modify those dependencies. \n\nThe paper is organized as follows. In the next section we briefly review Gaussian processes \nand their application to regression. In Section 3 we discuss generalizations of the simple \nGPR model. In Section 4 we introduce the MGP model and present experimental results. \nIn Section 5 we discuss Gaussian processes in context with graphical models. In Section 6 \nwe present conclusions. \n\n2 Gaussian Processes \n\nIn Gaussian Process Regression (GPR) one assumes that a priori a function f(x) is gen(cid:173)\nerated from an infinite-dimensional Gaussian distribution with zero mean and covariance \nK(x, Xk) = cav (f (x) , f(Xk)) where K(x, Xk) are positive definite kernel functions. In \nthis paper we will only use Gaussian kernel functions of the form \n\nK(X,Xk) = Aexp (_llx - xk112) \n\n28 2 \n\nwith scale parameter 8 and amplitude A. Furthermore, we assume a set of N training \ndata D = {(Xk' Yk) H'=l where targets are generated following a normal distribution with \nvariance (72 such that \n\nP(ylf(x)) ex exp ( - 2~2 (f(x) - y)2) . \n\n(1) \n\nThe expected value j(x) to an input x given the training data is a superposition of the \nkernel functions of the form \n\nN \n\nj(x) = L WkK(X, Xk). \n\nk=l \n\n(2) \n\nHere, Wk is the weight on the k-th kernel. Let K be the N x N Gram matrix with \n(K)k,j = cov(f(Xk), f(xj)). Then we have the relation fm = Kw where the compo(cid:173)\nnents of fm = (f(XI), ... ,f(XN ))' are the values of f at the location of the training data \nand W = (WI' ... ' WN )'. As a result of this relationship we can either calculate the opti(cid:173)\nmal W or we can calculate the optimal fm and then deduce the corresponding w-vector by \nmatrix inversion. The latter approach is taken in this paper. Following the assumptions, the \noptimal fm minimizes the cost function \n\n(3) \n\nsuch that \n\njm = K(K + (72 I)-ly. \n\nHere y = (YI, ... ,YN)' is the vector of targets and I is the N-dimensional unit matrix. \n\n3 Generalized Gaussian Processes and the Support Vector Machine \n\nIn generalized Gaussian processes the Gaussian prior assumption is maintained but the \nlikelihood model is now derived from the exponential family of distributions. The most \n\n\fimportant special cases are two-class classification \n\nP(y = 1If(x)) = 1 + exp( - f(x)) \n\n1 \n\nand multiple-class classification. Here, y is a discrete variable with C states and \n\n. \n\nP(y = ~Ih(x), ... , fdx)) = ~c;;-----\"---'-'-----'----'--'-\u00ad\nLj=l exp (/i(x)) \n\nexp (h(x)) \n\n(4) \n\nNote, that for multiple-class classification C Gaussian processes h (x), ... , f c (x) are \nused. Generalized Gaussian processes are discusses in Tresp (2000). The special case \nof classification was discussed by Williams and Barber (1998) from a Bayesian perspec(cid:173)\ntive. The related smoothing splines approaches are discussed in Fahrmeir and Tutz (1994). \nFor generalized Gaussian processes, the optimization of the cost function is based on an \niterative Fisher scoring procedure. \n\nIncidentally, the support vector machine (SVM) can also be considered to be a generalized \nGaussian process model with \n\nP(ylf(x)) ex exp (-const(l- yf(x))+). \n\nHere, y E {-I, I}, the operation 0+ sets all negative values equal to zero and const is \na constant (Sollich (2000\u00bb .1 The SVM cost function is particularly interesting since due \nto its discontinuous first derivative, many components of the optimal weight vector w are \nzero, i.e. we obtain sparse solutions. \n\n4 Mixtures of Gaussian Processes \n\nGPR employs a global scale parameter s. In many applications it might be more desirable \nto permit an input-dependent scale parameter: the complexity of the map might be input de(cid:173)\npendent or the input data density might be nonuniform. In the latter case one might want to \nuse a smaller scale parameter in regions with high data density. This is the main motivation \nfor introducing another generalization of the simple GPR model, the mixture of Gaussian \nprocesses (MGP) model, which is a variant of the mixture of experts model of Jacobs et al. \n(1991). Here, a set of GPR models with different scale parameters is used and the system \ncan autonomously decide which GPR model is appropriate for a particular region of input \nspace. Let P'(x) = {ff.'(x), ... , ff{(x)} denote this set of M GPR models. The state ofa \ndiscrete M -state variable z determines which of the GPR models is active for a given input \nx. The state of z is estimated by an M -class classification Gaussian process model with \n\nP(z = iIFZ(x)) = \n\nexp (ft(x)) \n\nL~l exp (Jj(x)) \n\nwhere FZ(x) = {f{(x), ... , fit (x)} denotes a second set of M Gaussian processes. Fi(cid:173)\nnally, we use a set of M Gaussian processes FU (x) = {ff(x) , ... ,fM(X)} to model the \ninput-dependent noise variance of the GPR models. The likelihood model given the state \nof z \n\nP(ylz, P'(x), FU (x)) = G (yj ff'(x), exp(2g (x))) \n\nis a Gaussian centered at ff(x) and with variance (exp(2J:(x))). The exponential is used \nto ensure positivity. Note that G(aj b, c) is our notation for a Gaussian density with mean \nb, variance c, evaluated at a. In the remaining parts of the paper we will not denote the \n\nlProperly normalizing the conditional probability density is somewhat tricky and is discussed in \n\ndetail in Sollich (2000). \n\n\fdependency on the Gaussian processes explicitely, e.g we will write P(Ylz, x) instead of \nP(Ylz, F/1-(x), FU (x)). Since z is a latent variable we obtain with \n\nP(Ylx) = LP(z = ilx) G(Yjft(x),exp(2ff(x))) E(Ylx) = L P(z = ilx) ft(x) \n\nM \n\ni=l \n\nM \n\ni=l \n\nthe well known mixture of experts network of Jacobs et al (1991) where the Jt(x) are the \n(Gaussian process) experts and P(z = ilx) is the gating network. Figure 2 (left) illustrates \nthe dependencies in the GPR model. \n\n4.1 EM Fisher Scoring Learning Rules \nAlthough a priori the functions f are Gaussian distributed, this is not necessarily true -in \ncontrast to simple GPR in Section 2- for the posterior distribution due to the nonlinear \nnature of the model. Therefore one is typically interested in the minimum of the negative \nlogarithm of the posterior density \n\nN \n- L \n\nk=l \n\nM \n\nlog L P(z = ilxk) G (Ykj ft(Xk), exp(2ff(xk))) \n\ni=l \n\n+2 LU:,m)'('E:,m)-l f:,m+ 2 Lut,m)'('Er,m)-l ft,m+ 2 LU;,m)'('Ef,m)-l f;,m. \n\n1M \n\ni=l \n\n1M \n\ni=l \n\n1M \n\ni=l \n\nThe superscript m denotes the vectors and matrices defined at the measurement point, e.g. \nft ,m = Ut(X1), . .. , ft(XN))'. In the E-step, based on the current estimates of the Gaus(cid:173)\nsian processes at the data points, the state of the latent variable is estimated as \n\nIn the M-step, based on the E-step, the Gaussian processes at the data points are updated. \nWe obtain \n\n!t'm = 'Er,m ('Er,m + wr,m) -1 ym \n\nwhere wr,m is a diagonal matrix with entries \n\n(Wr,mhk = exp(2!f(xk))/ F(z = ilxk, Yk). \n\nNote, that data with a small F(z = ilxk,Yk) obtain a small weight. To update the other \nGaussian processes iterative Fisher scoring steps have to be used as shown in the appendix. \n\nThere is a serious problem with overtraining in the MGP approach. The reason is that the \nGPR model with the highest bandwidth tends to obtain the highest weight in the E-step \nsince it provides the best fit to the data. There is an easy fix for the MGP: For calculating \nthe responses of the Gaussian processes at Xk in the E-step we use all training data except \n(Xk, Yk). Fortunately, this calculation is very cheap in the case of Gaussian processes since \nfor example \n\n1_/1-( \ni Xk - Yk -\n\n) _ \n\nYk - it(Xk) \n\n1- Si,kk \n\nwhere it(Xk) denotes the estimates at the training data point Xk not using (Xk, Yk) . Here, \nSi,kk is the k-th diagonal element of Si = 'Er,m ('Er,m + wr,m)-1.2 \n\n2 See Hofmann (2000) for a discussion of the convergence of this type of algorithms. \n\n\f0.5 \n\n~ \nIII \nti \ni.: \n\n0 \n\n(\\ \n\n-0.5 \n\n-1 \n\n'JllA \n\n-2 \n\n-1 \n\nI \nI \nI \n\n/ .Ii \n\n0 \nx \n\n0.8 \n\n~0.6 \n\u00b7Ii \na:- 0.4 \n\n0.2 \n\n0 \n\n-2 \n\n-1 \n\n0 \nx \n\n~/ \n\n. / \n/ \n/ \n/ \n/ \n\nvpt \n\n\\ . \n\nV \n\n0.5 \n\n0 \n\n~ \nQ) \n> c \n0 u \n:( \n\n-0.5 I \nI \n\n-1 \n-2 \n\nil. \nI I \nI \n\n/ \n\n-1 \n\n0 \nx \n\n0.5 \n\n0 \n\nc.. \n(9 \n:2 \n\n-0.5 \n\n-1 \n-2 \n\n-1 \n\n0 \nx \n\n2 \n\n2 \n\n2 \n\n2 \n\nFigure 1: The input data are generated from a Gaussian distribution with unit variance and mean \nO. The output data are generated from a step function (0, bottom right). The top left plot shows the \nmap formed by three GPR models with different bandwidths. As can be seen no individual model \nachieves a good map. Then a MGP model was trained using the three GPR models. The top right \nplot shows the GPR models after convergence. The bottom left plot shows P{ z = ilx). The GPR \nmodel with the highest bandwidth models the transition at zero, the GPR model with an intermediate \nbandwidth models the intermediate region and the GPR model with the lowest bandwidth models the \nextreme regions. The bottom right plot shows the data 0 and the fit obtained by the complete MGP \nmodel which is better than the map formed by any of the individual GPR models. \n\n4.2 Experiments \n\nFigure 1 illustrates how the MGP divides up a complex task into subtasks modeled by \nthe individual GPR models (see caption). By dividing up the task, the MGP model can \npotentially achieve a performance which is better than the performance of any individual \nmodel. Table 1 shows results from artificial data sets and real world data sets. In all cases, \nthe performance of the MGP is better than the mean performance of the GPR models and \nalso better than the performance of the mean (obtained by averaging the predictions of all \nGPR models). \n\n5 Gaussian Processes for Graphical Models \n\nGaussian processes can be useful models for quantifying the dependencies in Bayesian net(cid:173)\nworks and dependency networks (the latter were introduced in Hofmann and Tresp, 1998, \nHeckerman et ai., 2000), in particular when parent variables are continuous quantities. If \nthe child variable is discrete, Gaussian process classification or the SVM are appropriate \nmodels whereas when the child variable is continuous, the MGP model can be employed \nas a general conditional density estimator. Typically one would require that the continu(cid:173)\nous input variables to the Gaussian process systems x are known. It might therefore be \n\n\fTable 1: The table shows results using artificial and real data sets of size N = 100 using \nM = 10 GPR models. The data set ART is generated by adding Gaussian noise with a \nstandard deviation of 0.2 to a map defined by 5 normalized Gaussian bumps. numin is \nthe number of inputs. The bandwidth s was generated randomly between 0 and max. s. \nFurthermore, mean peif. is the mean squared test set error of all GPR networks and peif. of \nmean is the mean squared test set error achieved by simple averaging the predictions. The \nlast column shows the performance of the MGP. \nI Data \nART \nART \nART \nART \nART \nHOUSING \nBUPA \nDIABETES \nWAVEFORM \n\nI numin I max. s I mean perf. I perf. of mean I MGP \n0.0054 \n0.0239 \n0.0808 \n0.0739 \n0.0662 \n0.2634 \n0.8804 \n0.7275 \n0.4453 \n\n0.0167 \n0.0573 \n0.1994 \n0.1670 \n0.1716 \n0.4677 \n0.9654 \n0.8230 \n0.6295 \n\n0.0080 \n0.0345 \n0.1383 \n0.1135 \n0.1203 \n0.3568 \n0.9067 \n0.7660 \n0.5979 \n\n1 \n2 \n5 \n10 \n20 \n13 \n6 \n8 \n21 \n\n1 \n3 \n6 \n10 \n20 \n10 \n20 \n40 \n40 \n\nuseful to consider those as exogenous variables which modify the dependencies in a graph(cid:173)\nical model of y-variables as shown in Figure 2 (right). As an example consider a medical \ndomain in which a Bayesian network of discrete variables y models the dependencies be(cid:173)\ntween diseases and symptoms and where these dependencies are modified by exogenous \n(often continuous) variables x representing quantities such as the patient's age, weight or \nblood pressure. Another example would be collaborative filtering where y might represent \na set of goods and the correlation between customer preferences is modeled by a depen(cid:173)\ndency network as in Heckerman et al. (2000). Here, exogenous variables such as income, \ngender and social status might be useful quantities to modify those correlations. Note, that \nthe GPR model itself can also be considered to be a graphical model with dependencies \nmodeled as Gaussian processes (compare Figure 2). \n\nReaders might also be interested in the related and independent paper by Friedman and \nNachman (2000) in which those authors used GPR systems (not in form of the MGP) to \nperform structural learning in Bayesian networks of continuous variables. \n\n6 Conclusions \n\nWe demonstrated that Gaussian processes can be useful building blocks for forming com(cid:173)\nplex probabilistic models. In particular we introduced the MGP model and demonstrated \nhow Gaussian processes can model the dependencies in graphical models. \n\n7 Appendix \nFor r and r the mode estimates are found by iterating Newton-Raphson equations JCHl) = jO) -\niI-I (l)J(l) where J(l) is the Jacobian and iI(l) the Hessian matrix for which certain interactions \nare ignored. One obtains for (1 = 1,2, ... ) the following update equations. \n\nf'z ,m,(I+I) = ~z,m (~z ,m + WZ ,m,(I)) -1 (Wz ,m,(I)dz ,m,(I) + f'z ,m,o)) where \n\n'l. \n\n'I. \n\nt \n\n'I. \n\n'I. \n\n'I. \n\n'I. \n\nz m (I) \ndi \" \n\n( , \n\n= P(z = ilxk' Yk) - P \n\n' ( I ) ) N \n\n(z = ilxk) k=I ' \n\nw;,m ,(I) = diag ([P(I\\z = ilxk)(l - P(l)(z = ilxk))]-1 (=1 . \n\n\fFigure 2: Left: The graphical structure of an MGP model consisting of the discrete latent variable \nz, the continuous variable y and input variable x. The probability density of z is dependent on the \nGaussian processes F Z. The probability distribution of y is dependent on the state of z and of the \nGaussian processes FI' , FO\". Right: An example of a Bayesian network which contains the variables \nY1, Y2, Y3, Y4. Some of the dependencies are modified by x via Gaussian processes it, /2, h. \n\nSimilarly, \n\nJ;,m,(/+tl = ~f,m (~f,m + ~f,m'(/l) -1 Ge _ tf;f,m,(ll + J;,m,(I)) \n\nwhere e is an N-dimensional vector of ones and \n\nReferences \n[1] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., Hinton, J. E. (1991). Adaptive Mixtures of Local \nExperts, Neural Computation, 3. \n[2] Tresp, V. (2000). The Generalized Bayesian Committee Machine. Proceedings of the Sixth ACM \nSIGKDD International Conference on Knowledge Discovery and Data Mining, KDD-2000. \n[3] Williams, C. K. I., Barber. D. (1998). Bayesian Classification with Gaussian Processes, IEEE \nTransactions on Pattern Analysis and Machine Intelligence, 20(12). \n[4] Fahrmeir, L., Tutz, G. (1994) Multivariate Statistical Modeling Based on Generalized Linear \nModels, Springer. \n[5] Sollich, P. (2000). Probabilistic Methods for Support Vector Machines. In Solla, S. A., Leen, T. \nK., Miiller, K.-R. (Eds.), Advances in Neural Information Processing Systems 12, MIT Press. \n[6] Hofmann R. (2000). Lemen der Struktur nichtlinearer Abhiingingkeiten mit graphischen Mod(cid:173)\nellen. PhD Dissertation. \n[7] Hofmann, R., Tresp, V. (1998). Nonlinear Markov Networks for Continuous Variables. In Jordan, \nM. I., Kearns, M. S., Solla, S. A., (Eds.), Advances in Neural Information Processing Systems 10, \nMIT Press. \n[8] Heckerman, D., Chickering, D., Meek, c., Rounthwaite, R., Kadie C. (2000). Dependency Net(cid:173)\nworks for Inference, Collaborative Filtering, and Data Visualization .. Journal of Machine Learning \nResearch, 1. \n[9] Friedman, N., Nachman, I. (2000). Gaussian Process Networks. In Boutilier, C., Goldszmidt, M., \n(Eds.), Proc. Sixteenth Conf. on Uncertainty in Artificial Intelligence (UAl). \n\n\f", "award": [], "sourceid": 1900, "authors": [{"given_name": "Volker", "family_name": "Tresp", "institution": null}]}