{"title": "Sparse Convolved Gaussian Processes for Multi-output Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 57, "page_last": 64, "abstract": "We present a sparse approximation approach for dependent output Gaussian processes (GP). Employing a latent function framework, we apply the convolution process formalism to establish dependencies between output variables, where each latent function is represented as a GP. Based on these latent functions, we establish an approximation scheme using a conditional independence assumption between the output processes, leading to an approximation of the full covariance which is determined by the locations at which the latent functions are evaluated. We show results of the proposed methodology for synthetic data and real world applications on pollution prediction and a sensor network.", "full_text": "Sparse Convolved Gaussian Processes for\n\nMulti-output Regression\n\nMauricio Alvarez\n\nSchool of Computer Science\nUniversity of Manchester, U.K.\nalvarezm@cs.man.ac.uk\n\nNeil D. Lawrence\n\nSchool of Computer Science\nUniversity of Manchester, U.K.\n\nneill@cs.man.ac.uk\n\nAbstract\n\nWe present a sparse approximation approach for dependent output Gaussian pro-\ncesses (GP). Employing a latent function framework, we apply the convolution\nprocess formalism to establish dependencies between output variables, where each\nlatent function is represented as a GP. Based on these latent functions, we establish\nan approximation scheme using a conditional independence assumption between\nthe output processes, leading to an approximation of the full covariance which is\ndetermined by the locations at which the latent functions are evaluated. We show\nresults of the proposed methodology for synthetic data and real world applications\non pollution prediction and a sensor network.\n\n1 Introduction\n\nWe consider the problem of modeling correlated outputs from a single Gaussian process (GP). Appli-\ncations of modeling multiple outputs include multi-task learning (see e.g. [1]) and jointly predicting\nthe concentration of different heavy metal pollutants [5]. Modelling multiple output variables is a\nchallenge as we are required to compute cross covariances between the different outputs. In geo-\nstatistics this is known as cokriging. Whilst cross covariances allow us to improve our predictions\nof one output given the others because the correlations between outputs are modelled [6, 2, 15, 12]\nthey also come with a computational and storage overhead. The main aim of this paper is to address\nthese overheads in the context of convolution processes [6, 2].\n\nOne neat approach to account for non-trivial correlations between outputs employs convolution pro-\ncesses (CP). When using CPs each output can be expressed as the convolution between a smoothing\nkernel and a latent function [6, 2]. Let\u2019s assume that the latent function is drawn from a GP. If\nwe also share the same latent function across several convolutions (each with a potentially differ-\nent smoothing kernel) then, since a convolution is a linear operator on a function, the outputs of\nthe convolutions can be expressed as a jointly distributed GP. It is this GP that is used to model\nthe multi-output regression. This approach was proposed by [6, 2] who focussed on a white noise\nprocess for the latent function.\n\nEven though the CP framework is an elegant way for constructing dependent output processes, the\nfact that the full covariance function of the joint GP must be considered results in signi\ufb01cant storage\nand computational demands. For Q output dimensions and N data points the covariance matrix\nscales as QN leading to O(Q3N 3) computational complexity and O(N 2Q2) storage. Whilst other\napproaches to modeling multiple output regression are typically more constraining in the types of\ncross covariance that can be expressed [1, 15], these constraints also lead to structured covariances\nfunctions for which inference and learning are typically more ef\ufb01cient (typically for N > Q these\nmethods have O(N 3Q) computation and O(N 2Q) storage). We are interested in exploiting the\nricher class of covariance structures allowed by the CP framework, but without the additional com-\nputational overhead they imply.\n\n\fWe propose a sparse approximation for the full covariance matrix involved in the multiple output\nconvolution process, exploiting the fact that each of the outputs is conditional independent of all oth-\ners given the input process. This leads to an approximation for the covariance matrix which keeps\nintact the covariances of each output and approximates the cross-covariances terms with a low rank\nmatrix. Inference and learning can then be undertaken with the same computational complexity as\na set of independent GPs. The approximation turns out to be strongly related to the partially in-\ndependent training conditional (PITC) [10] approximation for a single output GP. This inspires us\nto consider a further conditional independence function across data points that leads to an approx-\nimation which shares the form of the fully independent training conditional (FITC) approximation\n[13, 10] reducing computational complexity to O(N QM 2) and storage to O(N QM) with M rep-\nresenting a user speci\ufb01ed value.\n\nTo introduce our sparse approximation some review of the CP framework is required (Section 2).\nThen in Section 3, we present sparse approximations for the multi-output GP. We discuss relations\nwith other approaches in Section 4. Finally, in Section 5, we demonstrate the approach on both\nsynthetic and real datasets.\n\n2 Convolution Processes\nConsider a set of Q functions {fq(x)}Q\nbetween a smoothing kernel {kq(x)}Q\n\nq=1, where each function is expressed as the convolution\n\nq=1, and a latent function u(z),\n\nfq(x) =\n\nkq(x \u2212 z)u(z)dz.\n\nZ \u221e\n\n\u2212\u221e\n\nMore generally, we can consider the in\ufb02uence of more than one latent function, {ur(z)}R\nr=1, and\ncorrupt each of the outputs of the convolutions with an independent process (which could also in-\nclude a noise term), wq(x), to obtain\n\nyq(x) = fq(x) + wq(x) =\n\nkqr(x \u2212 z)ur(z)dz + wq(x).\n\n(1)\n\nThe covariance between two different functions yq(x) and ys(x0) is then recovered as\ncov [yq(x), ys(x0)] = cov [fq(x), fs(x0)] + cov [wq(x), ws(x0)] \u03b4qs,\n\nZ \u221e\n\nRX\n\n\u2212\u221e\n\nr=1\n\nZ \u221e\n\n\u2212\u221e\n\nZ \u221e\n\n\u2212\u221e\n\nRX\n\nr=1\n\nwhere\n\ncov [fq(x), fs(x0)] =\n\nRX\n\nRX\n\nr=1\n\np=1\n\nZ \u221e\n\n\u2212\u221e\n\nkqr(x \u2212 z)\n\nksp(x0 \u2212 z0) cov [ur(z), up(z0)] dz0dz\n\n(2)\n\nThis equation is a general result; in [6, 2] the latent functions ur(z) are assumed as independent\nwhite Gaussian noise processes, i.e. cov [ur(z), up(z0)] = \u03c32\n\u03b4rp\u03b4z,z0, so the expression (2) is\nsimpli\ufb01ed as\n\nur\n\ncov [fq(x), fs(x0)] =\n\n\u03c32\nur\n\nkqr(x \u2212 z)ksr(x0 \u2212 z)dz.\n\nWe are going to relax this constraint on the latent processes, we assume that each inducing function is\nan independent GP, i.e. cov [ur(z), up(z0)] = kurup(z, z0)\u03b4rp, where kurur(z, z0) is the covariance\nfunction for ur(z). With this simpli\ufb01cation, (2) can be written as\n\ncov [fq(x), fs(x0)] =\n\nkqr(x \u2212 z)\n\nksr(x0 \u2212 z0)kurur(z, z0)dz0dz.\n\n(3)\n\nZ \u221e\n\nRX\n\n\u2212\u221e\n\nr=1\n\nZ \u221e\n\n\u2212\u221e\n\nAs well as this correlation across outputs, the correlation between the latent function, ur(z), and\nany given output, fq(x), can be computed,\n\ncov [fq(x), ur(z))] =\n\nkqr(x \u2212 z0)kurur(z0, z)dz0.\n\n(4)\n\nZ \u221e\n\n\u2212\u221e\n\n\f3 Sparse Approximation\n\n(cid:3)>\n\nwhere y = (cid:2)y>\n\nGiven the convolution formalism, we can construct a full GP over the set of outputs. The likelihood\nof the model is given by\n\np(y|X, \u03c6) = N (0, Kf ,f + \u03a3),\n\nQ\n\n1 , . . . , y>\n\n(5)\nis the set of output functions with yq = [yq(x1), . . . , yq(xN )]>\n;\nKf ,f \u2208 \n\nwhere \u03b8 are the parameters of the kernels and covariance functions. Our key assumption is that this\nindependence will hold even if we have only observed M samples from ur(z) rather than the whole\nfunction. The observed values of these M samples are then marginalized (as they are for the exact\ncase) to obtain the approximation to the likelihood. Our intuition is that the approximation should\nbe more accurate for larger M and smoother latent functions, as in this domain the latent function\ncould be very well characterized from only a few samples.\n\nR\n\nas the samples from the latent\n\n1 , . . . , u>\nfunction with ur =\n[ur(z1), . . . , ur(zM )]>\n; Ku,u is then the covariance matrix between the samples from the latent\nfunctions ur(z), with elements given by kurur(z, z0); Kf ,u = K>\nu,f are the cross-covariance ma-\ntrices between the latent functions ur(z) and the outputs fq(x), with elements cov [fq(x), ur(z)] in\n(4) and Z = {z1, . . . , zM} is the set of input vectors at which the covariance Ku,u is evaluated.\nWe now make the conditional independence assumption given the samples from the latent functions,\n\n(cid:3)>\n\np(y|u, Z, X, \u03b8) =\n\nu,uu, Kfq,fq \u2212 Kfq,uK\u22121\n\nu,uKu,fq + \u03c32\n\nqI(cid:1) .\n\nQY\n\nq=1\n\np(yq|u, Z, X, \u03b8) =\n\nQY\n\nN(cid:0)Kfq,uK\u22121\np(y|u, Z, X, \u03b8) = N(cid:0)Kf ,uK\u22121\n\nq=1\n\nu,uu, D + \u03a3(cid:1)\n\nWe rewrite this product as a single Gaussian with a block diagonal covariance matrix,\n\nwhere D = blockdiag(cid:2)Kf ,f \u2212 Kf ,uK\u22121\nelements should be set to zero. We can also write this as D =(cid:2)Kf ,f \u2212 Kf ,uK\u22121\np(y|u, Z, X, \u03b8)p(u|Z)du = N(cid:0)0, D + Kf ,uK\u22121\n\nindicate the block associated with each output of the matrix G should be retained, but all other\n(cid:12) is the Hadamard product and M = IQ\u22971N , 1N being the N \u00d7 N matrix of ones and \u2297 being the\nKronecker product. We now marginalize the values of the samples from the latent functions by using\ntheir process priors, i.e. p(u|Z) = N (0, Ku,u). This leads to the following marginal likelihood,\n\n(cid:3), and we have used the notation blockdiag [G] to\n(cid:3)(cid:12) M where\n\nu,uKu,f + \u03a3(cid:1) .\n\np(y|Z, X, \u03b8) =\n\nu,uKu,f\n\nu,uKu,f\n\nZ\n\n(7)\n\n(6)\n\n\fNotice that, compared to (5), the full covariance matrix Kf ,f has been replaced by the low rank co-\nvariance Kf ,uK\u22121\nu,uKu,f in all entries except in the diagonal blocks corresponding to Kfq,fq . When\nusing the marginal likelihood for learning, the computation load is associated to the calculation of\nthe inverse of D. The complexity of this inversion is O(N 3Q) + O(N QM 2), storage of the matrix\nis O(N 2Q) + O(N QM). Note that if we set M = N these reduce to O(N 3Q) and O(N 2Q)\nrespectively which matches the computational complexity of applying Q independent GPs to model\nthe multiple outputs.\nCombining eq. (6) with p(u|Z) using Bayes theorem, the posterior distribution over u is obtained\nas\n\np(u|y, X, Z, \u03b8) = N(cid:0)Ku,uA\u22121Ku,f (D + \u03a3)\u22121y, Ku,uA\u22121Ku,u\n\n(8)\nwhere A = Ku,u + Ku,f (D + \u03a3)\u22121Kf ,u. The predictive distribution is expressed through the\nintegration of (6), evaluated at X\u2217, with (8), giving\n\n(cid:1)\n\nZ\n=N(cid:0)Kf\u2217,uA\u22121Ku,f (D + \u03a3)\u22121y, D\u2217 + Kf\u2217,uA\u22121Ku,f\u2217 + \u03a3(cid:1)\n\np(y\u2217|u, Z, X\u2217, \u03b8)p(u|y, X, Z, \u03b8)du\n\np(y\u2217|y, X, X\u2217, Z, \u03b8) =\n\n(9)\n\nwith D\u2217 = blockdiag(cid:2)Kf\u2217,f\u2217 \u2212 Kf\u2217,uK\u22121\n\nu,uKu,f\u2217\n\n(cid:3).\n\nThe functional form of (7) is almost identical to that of the PITC approximation [10], with the\nsamples we retain from the latent function providing the same role as the inducing values in the\npartially independent training conditional (PITC) approximation. This is perhaps not surprising\ngiven that the nature of the conditional independence assumptions in PITC is similar to that we have\nmade. A key difference is that in PITC it is not obvious which variables should be grouped together\nwhen making the conditional independence assumption, here it is clear from the structure of the\nmodel that each of the outputs should be grouped separately. However, the similarities are such that\nwe \ufb01nd it convenient to follow the terminology of [10] and also refer to our approximation as a PITC\napproximation.\n\nWe have already noted that our sparse approximation reduces the computational complexity of multi-\noutput regression with GPs to that of applying independent GPs to each output. For larger data sets\nthe N 3 term in the computational complexity and the N 2 term in the storage is still likely to be\nprohibitive. However, we can be inspired by the analogy of our approach to the PITC approximation\nand consider a more radical factorization of the outputs. In the fully independent training conditional\np(y|u, Z, X, \u03b8) =QQ\nQN\n(FITC) [13, 14] a factorization across the data points is assumed. For us that would lead to the\nD = diag(cid:2)Kf ,f \u2212 Kf ,uK\u22121\nfollowing expression for conditional distribution of the output functions given the inducing variables,\nn=1 p(yqn|u, Z, X, \u03b8) which can be brie\ufb02y expressed through (6) with\nu,uKu,f\n\n(cid:3)(cid:12)M, with M = IQ\u2297IN . Similar\n\n(cid:3) =(cid:2)Kf ,f \u2212 Kf ,uK\u22121\n\nequations are obtained for the posterior (8), predictive (9) and marginal likelihood distributions (7)\nleading to the Fully Independent Training Conditional (FITC) approximation [13, 10]. Note that\nthe marginal likelihood might be optimized both with respect to the parameters associated with the\ncovariance matrices and with respect to Z. In supplementary material we include the derivatives of\nthe marginal likelihood wrt the matrices Kf ,f , Ku,f and Ku,u.\n\nu,uKu,f\n\nq=1\n\n4 Related work\n\nThere have been several suggestions for constructing multiple output GPs [2, 15, 1]. Under the\nconvolution process framework, the semiparametric latent factor model (SLFM) proposed in [15]\nP\ncorresponds to a speci\ufb01c choice for the smoothing kernel function in (1) namely, kqr(x) = \u03c6qr\u03b4(x).\nThe latent functions are assumed to be independent GPs and in such a case, cov [fq(x), fs(x0)] =\nr \u03c6qr\u03c6srkurur(x, x0). This can be written using matrix notation as Kf ,f = (\u03a6\u2297I)Ku,u(\u03a6>\u2297I).\n\nFor computational speed up the informative vector machine (IVM) is employed [8].\n\nIn the multi-task learning model (MTLM) proposed in [1], the covariance matrix is expressed as\nKf ,f = K f \u2297 k(x, x0), with K f being constrained positive semi-de\ufb01nite and k(x, x0) a covariance\nfunction over inputs. The Nystr\u00a8om approximation is applied to k(x, x0). As stated in [1] with respect\nto SLFM, the convolution process is related with MTLM when the smoothing kernel function is\n\n\fgiven again by kqr(x) = \u03c6qr\u03b4(x) and there is only one latent function with covariance kuu(x, x0) =\nk(x, x0). In this way, cov [fq(x), fs(x0)] = \u03c6q\u03c6sk(x, x0) and in matrix notation Kf ,f = \u03a6\u03a6> \u2297\nk(x, x0). In [2], the latent processes correspond to white Gaussian noises and the covariance matrix\nis given by eq. (3). In this work, the complexity of the computational load is not discussed. Finally,\n[12] use a similar covariance function to the MTLM approach but use an IVM style approach to\nsparsi\ufb01cation.\n\nNote that in each of the approaches detailed above a \u03b4 function is introduced into the integral. In the\ndependent GP model of [2] it is introduced in the covariance function. Our approach considers the\nmore general case when neither kernel nor covariance function is given by the \u03b4 function.\n\n5 Results\n\n(2\u03c0)p/2\n\nFor all our experiments we considered squared exponential covariance functions for the latent pro-\ncess of the form kurur(x, x0) = exp\n, where Lr is a diagonal matrix\nwhich allows for different length-scales along each dimension. The smoothing kernel had the same\nform, kqr(\u03c4 ) = Sqr|Lqr|1/2\ninite matrix. For this kernel/covariance function combination the necessary integrals are tractable\n(see supplementary material).\n\nh\u2212 1\n2 \u03c4 >Lqr\u03c4(cid:3) , where Sqr \u2208 R and Lqr is a symmetric positive def-\n\ni\n2 (x \u2212 x0)> Lr (x \u2212 x0)\n\nexp(cid:2)\u2212 1\n\nWe \ufb01rst setup a toy problem in which we evaluate the quality of the prediction and the speed of\nthe approximation. The toy problem consists of Q = 4 outputs, one latent function, R = 1, and\nN = 200 observation points for each output. The training data was sampled from the full GP with\nthe following parameters, S11 = S21 = 1, S31 = S41 = 5, L11 = L21 = 50, L31 = 300, L41 = 200\nfor the outputs and L1 = 100 for the latent function. For the independent processes, wq (x), we\n4 = 1. For the sparse\nsimply added white noise with variances \u03c32\napproximations we used M = 30 \ufb01xed inducing points equally spaced between the range of the\ninput and R = 1. We sought the kernel parameters through maximizing the marginal likelihood\nusing a scaled conjugate gradient algorithm. For test data we removed a portion of one output as\nshown in Figure 1 (points in the interval [\u22120.8, 0] were removed). The predictions shown correspond\nto the full GP (Figure 1(a)), an independent GP (Figure 1(b)), the FITC approximation (Figure 1(c))\nand the PITC approximation (Figure 1(d)). Due to the strong dependencies between the signals, our\nmodel is able to capture the correlations and predicts accurately the missing information.\n\n2 = 0.0125, \u03c32\n\n3 = 1.2 and \u03c32\n\n1 = \u03c32\n\nTable 1 shows prediction results over an independent test set. We used 300 points to compute the\nstandarized mean square error (SMSE) [11] and ten repetitions of the experiment, so that we also\nincluded one standard deviation for the ten repetitions. The training times for iteration of each model\nare 1.45 \u00b1 0.23 secs for the full GP, 0.29 \u00b1 0.02 secs for the FITC and 0.48 \u00b1 0.01 for the PITC.\nTable 1, shows that the SMSE of the sparse approximations is similar to the one obtained with the\nfull GP with a considerable reduction of training times.\n\nMethod\nFull GP\nFITC\nPITC\n\nOutput 1\n1.07 \u00b1 0.08\n1.08 \u00b1 0.09\n1.07 \u00b1 0.08\n\nOutput 2\n0.99 \u00b1 0.03\n1.00 \u00b1 0.03\n0.99 \u00b1 0.03\n\nOutput 3\n1.12 \u00b1 0.07\n1.13 \u00b1 0.07\n1.12 \u00b1 0.07\n\nOutput 4\n1.05 \u00b1 0.07\n1.04 \u00b1 0.07\n1.05 \u00b1 0.07\n\nTable 1: Standarized mean square error (SMSE) for the toy problem over an independent test set. All numbers\nare to be multiplied by 10\u22122. The experiment was repeated ten times. Table included the value of one standard\ndeviation over the ten repetitions.\n\nWe now follow a similar analysis for a dataset consisting of weather data collected from a sensor net-\nwork located on the south coast of England. The network includes four sensors (named Bramblemet,\nSotonmet, Cambermet and Chimet) each of which measures several environmental variables [12].\nWe selected one of the sensors signals, tide height, and applied the PITC approximation scheme\nwith an additional squared exponential independent kernel for each wq (x) [11]. Here Q = 4 and\nwe chose N = 1000 of the 4320 for the training set, leaving the remaining points for testing. For\ncomparison we also trained a set of independent GP models. We followed [12] in simulating sensor\nfailure by introducing some missing ranges for these signals. In particular, we have a missing range\n\n\f(a) Output 4 using the full GP\n\n(b) Output 4 using an independent GP\n\n(c) Output 4 using the FITC approximation (d) Output 4 using the PITC approximation\n\nFigure 1: Predictive mean and variance using the full multi-output GP, the sparse approximation and an inde-\npendent GP for output 4. The solid line corresponds to the mean predictive, the shaded region corresponds to\n2 standard deviations away from the mean and the dash line is the actual value of the signal without noise. The\ndots are the noisy training points. There is a range of missing data in the interval [\u22120.8, 0.0]. The crosses in\n\ufb01gures 1(c) and 1(d) corresponds to the locations of the inducing inputs.\n\nof [0.6, 1.2] for the Bramblemet tide height sensor and [1.5, 2.1] for the Cambermet. For the other\ntwo sensors we used all 1000 training observations. For the sparse approximation we took M = 100\nequally spaced inducing inputs. We see from Figure 2 that the PITC approximation captures the de-\npendencies and predicts closely the behavior of the signal in the missing range. This contrasts with\nthe behavior of the independent model, which is not able to follow the original signal.\n\nAs another example we employ the Jura dataset, which consists of measurements of concentrations\nof several heavy metals collected in the topsoil of a 14.5 km2 region of the Swiss Jura. The data is\ndivided into a prediction set (259 locations) and a validation set (100 locations)1. In a typical situ-\nation, referred as undersampled or heterotopic case, a few expensive measurements of the attribute\nof interest are supplemented by more abundant data on correlated attributes that are cheaper to sam-\nple. We follow the experiments described in [5, p. 248,249] in which a primary variable (cadmium\nand copper) at prediction locations in conjunction with some secondary variables (nickel and zinc\nfor cadmium; lead, nickel and zinc for copper) at prediction and validation locations, are employed\nto predict the concentration of the primary variable at validation locations. We compare results of\nindependent GP, the PITC approximation, the full GP and ordinary co-kriging. For the PITC ex-\nperiments, a k-means procedure is employed \ufb01rst to \ufb01nd the initial locations of the inducing values\nand then these locations are optimized in the same optimization procedure used for the parameters.\nEach experiment is repeated ten times. The results for ordinary co-kriging were obtained from [5,\np. 248,249]. In this case, no values for standard deviation are reported. Figure 3 shows results of\nprediction for cadmium (Cd) and copper (Cu). From \ufb01gure 3(a), it can be noticed that using 50 in-\nducing values, the approximation exhibits a similar performance to the co-kriging method. As more\n\n1This data is available at http://www.ai-geostats.org/\n\n\u22121\u22120.500.51\u221210\u22128\u22126\u22124\u221220246810\u22121\u22120.500.51\u221210\u22128\u22126\u22124\u221220246810\u22121\u22120.500.51\u221210\u22128\u22126\u22124\u221220246810\u22121\u22120.500.51\u221210\u22128\u22126\u22124\u221220246810\f(a) Bramblemet using an independent GP\n\n(b) Bramblemet using PITC\n\n(c) Cambermet using an independent GP\n\n(d) Cambermet using PITC\n\nFigure 2: Predictive Mean and variance using independent GPs and the PITC approximation for the tide height\nsignal in the sensor dataset. The dots indicate the training observations while the dash indicates the testing\nobservations. We have emphasized the size of the training points to differentiate them from the testing points.\nThe solid line corresponds to the mean predictive. The crosses in \ufb01gures 2(b) and 2(d) corresponds to the\nlocations of the inducing inputs.\n\ninducing values are included, the approximation follows the performance of the full GP, as it would\nbe expected. From \ufb01gure 3(b), it can be observed that, although the approximation is better that the\nindependent GP, it does not obtain similar results to the full GP. Summary statistics of the prediction\ndata ([5, p. 15]) shows higher variability for the copper dataset than for the cadmium dataset, which\nexplains in some extent the different behaviors.\n\n(a) Cadmium (Cd)\n\n(b) Copper (Cu)\n\nFigure 3: Mean absolute error and standard deviation for ten repetitions of the experiment for the Jura dataset\nIn the bottom of each \ufb01gure, IGP stands for independent GP, P(M) stands for PITC with M inducing values,\nFGP stands for full GP and CK stands for ordinary co-kriging (see [5] for detailed description).\n\n00.511.522.530.511.522.533.544.55Tide Height (m)Time (days)00.511.522.530.511.522.533.544.55Tide Height (m)Time (days)00.511.522.530.511.522.533.544.55Tide Height (m)Time (days)00.511.522.530.511.522.533.544.55Tide Height (m)Time (days)IGPP(50)P(100)P(200)P(500)FGPCK0.420.440.460.480.50.520.540.560.58MEAN ABSOLUTE ERROR CdIGPP(50)P(100)P(200)P(500)FGPCK78910111213141516MEAN ABSOLUTE ERROR Cu\f6 Conclusions\n\nWe have presented a sparse approximation for multiple output GPs, capturing the correlated in-\nformation among outputs and reducing the amount of computational load for prediction and opti-\nmization purposes. The reduction in computational complexity for the PITC approximation is from\nO(N 3Q3) to O(N 3Q). This matches the computational complexity for modeling with independent\nGPs. However, as we have seen, the predictive power of independent GPs is lower.\n\nLinear dynamical systems responses can be expressed as a convolution between the impulse re-\nsponse of the system with some input function. This convolution approach is an equivalent way of\nrepresenting the behavior of the system through a linear differential equation. For systems involving\nhigh amounts of coupled differential equations [4], the approach presented here is a reasonable way\nof obtaining approximate solutions and incorporating prior domain knowledge to the model.\n\nOne could optimize with respect to positions of the values of the latent functions. As the input\ndimension grows, it might be more dif\ufb01cult to obtain an acceptable response. Some solutions to this\nproblem have already been proposed [14].\n\nAcknowledgments\n\nWe thank the authors of [12] who kindly made the sensor network database available.\n\nReferences\n\n[1] E. V. Bonilla, K. M. Chai, and C. K. I. Williams. Multi-task Gaussian process prediction. In J. C. Platt,\nD. Koller, Y. Singer, and S. Roweis, editors, NIPS, volume 20, Cambridge, MA, 2008. MIT Press. In\npress.\n\n[2] P. Boyle and M. Frean. Dependent Gaussian processes. In L. Saul, Y. Weiss, and L. Bouttou, editors,\n\nNIPS, volume 17, pages 217\u2013224, Cambridge, MA, 2005. MIT Press.\n\n[3] M. Brookes. The matrix reference manual. Available on-line., 2005. http://www.ee.ic.ac.uk/\n\nhp/staff/dmb/matrix/intro.html.\n\n[4] P. Gao, A. Honkela, M. Rattray, and N. D. Lawrence. Gaussian process modelling of latent chemical\n\nspecies: Applications to inferring transcription factor activities. Bioinformatics, 24(16):i70\u2013i75, 2008.\n\n[5] P. Goovaerts. Geostatistics For Natural Resources Evaluation. Oxford University Press, 1997. ISBN\n\n[6] D. M. Higdon. Space and space-time modelling using process convolutions. In C. Anderson, V. Barnett,\nP. Chatwin, and A. El-Shaarawi, editors, Quantitative methods for current environmental issues, pages\n37\u201356. Springer-Verlag, 2002.\n\n[7] N. D. Lawrence. Learning for larger datasets with the Gaussian process latent variable model. In Meila\n\n0-19-511538-4.\n\nand Shen [9].\n\n[8] N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: The informative\nvector machine. In S. Becker, S. Thrun, and K. Obermayer, editors, NIPS, volume 15, pages 625\u2013632,\nCambridge, MA, 2003. MIT Press.\n\n[9] M. Meila and X. Shen, editors. AISTATS, San Juan, Puerto Rico, 21-24 March 2007. Omnipress.\n[10] J. Qui\u02dcnonero Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cam-\n\nregression. JMLR, 6:1939\u20131959, 2005.\n\nbridge, MA, 2006. ISBN 0-262-18253-X.\n\n[12] A. Rogers, M. A. Osborne, S. D. Ramchurn, S. J. Roberts, and N. R. Jennings. Towards real-time informa-\ntion processing of sensor network data using computationally ef\ufb01cient multi-output Gaussian processes.\nIn Proceedings of the International Conference on Information Processing in Sensor Networks (IPSN\n2008), 2008. In press.\n\n[13] E. Snelson and Z. Ghahramani.\n\nSparse Gaussian processes using pseudo-inputs.\n\nIn Y. Weiss,\n\nB. Sch\u00a8olkopf, and J. C. Platt, editors, NIPS, volume 18, Cambridge, MA, 2006. MIT Press.\n\n[14] E. Snelson and Z. Ghahramani. Local and global sparse Gaussian process approximations. In Meila and\n\nShen [9].\n\n[15] Y. W. Teh, M. Seeger, and M. I. Jordan. Semiparametric latent factor models.\n\nIn R. G. Cowell and\nZ. Ghahramani, editors, AISTATS 10, pages 333\u2013340, Barbados, 6-8 January 2005. Society for Arti\ufb01cial\nIntelligence and Statistics.\n\n\f", "award": [], "sourceid": 170, "authors": [{"given_name": "Mauricio", "family_name": "Alvarez", "institution": null}, {"given_name": "Neil", "family_name": "Lawrence", "institution": null}]}