{"title": "Heterogeneous Multi-output Gaussian Process Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 6711, "page_last": 6720, "abstract": "We present a novel extension of multi-output Gaussian processes for handling heterogeneous outputs. We assume that each output has its own likelihood function and use a vector-valued Gaussian process prior to jointly model the parameters in all likelihoods as latent functions. Our multi-output Gaussian process uses a covariance function with a linear model of coregionalisation form. Assuming conditional independence across the underlying latent functions together with an inducing variable framework, we are able to obtain tractable variational bounds amenable to stochastic variational inference. We illustrate the performance of the model on synthetic data and two real datasets: a human behavioral study and a demographic high-dimensional dataset.", "full_text": "Heterogeneous Multi-output Gaussian Process\n\nPrediction\n\nPablo Moreno-Mu\u00f1oz1\nMauricio A. \u00c1lvarez2\n1Dept. of Signal Theory and Communications, Universidad Carlos III de Madrid, Spain\n\nAntonio Art\u00e9s-Rodr\u00edguez1\n\n2Dept. of Computer Science, University of Shef\ufb01eld, UK\n\n{pmoreno,antonio}@tsc.uc3m.es, mauricio.alvarez@sheffield.ac.uk\n\nAbstract\n\nWe present a novel extension of multi-output Gaussian processes for handling\nheterogeneous outputs. We assume that each output has its own likelihood function\nand use a vector-valued Gaussian process prior to jointly model the parameters\nin all likelihoods as latent functions. Our multi-output Gaussian process uses a\ncovariance function with a linear model of coregionalisation form. Assuming\nconditional independence across the underlying latent functions together with an\ninducing variable framework, we are able to obtain tractable variational bounds\namenable to stochastic variational inference. We illustrate the performance of the\nmodel on synthetic data and two real datasets: a human behavioral study and a\ndemographic high-dimensional dataset.\n\n1\n\nIntroduction\n\nMulti-output Gaussian processes (MOGP) generalise the powerful Gaussian process (GP) predictive\nmodel to the vector-valued random \ufb01eld setup (Alvarez et al., 2012). It has been experimentally\nshown that by simultaneously exploiting correlations between multiple outputs and across the input\nspace, it is possible to provide better predictions, particularly in scenarios with missing or noisy data\n(Bonilla et al., 2008; Dai et al., 2017).\nThe main focus in the literature for MOGP has been on the de\ufb01nition of a suitable cross-covariance\nfunction between the multiple outputs that allows for the treatment of outputs as a single GP with\na properly de\ufb01ned covariance function (Alvarez et al., 2012). The two classical alternatives to\nde\ufb01ne such cross-covariance functions are the linear model of coregionalisation (LMC) (Journel\nand Huijbregts, 1978) and process convolutions (Higdon, 2002). In the former case, each output\ncorresponds to a weighted sum of shared latent random functions. In the latter, each output is\nmodelled as the convolution integral between a smoothing kernel and a latent random function\ncommon to all outputs. In both cases, the unknown latent functions follow Gaussian process priors\nleading to straight-forward expressions to compute the cross-covariance functions among different\noutputs. More recent alternatives to build valid covariance functions for MOGP include the work\nby Ulrich et al. (2015) and Parra and Tobar (2017), that build the cross-covariances in the spectral\ndomain.\nRegarding the type of outputs that can be modelled, most alternatives focus on multiple-output\nregression for continuous variables. Traditionally, each output is assumed to follow a Gaussian\nlikelihood where the mean function is given by one of the outputs of the MOGP and the variance in\nthat distribution is treated as an unknown parameter. Bayesian inference is tractable for these models.\nIn this paper, we are interested in the heterogeneous case for which the outputs are a mix of continuous,\ncategorical, binary or discrete variables with different likelihood functions.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThere have been few attempts to extend the MOGP to other types of likelihoods. For example, Skolidis\nand Sanguinetti (2011) use the outputs of a MOGP for jointly modelling several binary classi\ufb01cation\nproblems, each of which uses a probit likelihood. They use an intrinsic coregionalisation model\n(ICM), a particular case of LMC. Posterior inference is perfomed using expectation-propagation (EP)\nand variational mean \ufb01eld. Both Chai (2012) and Dezfouli and Bonilla (2015) have also used ICM\nfor modeling a single categorical variable with a multinomial logistic likelihood. The outputs of the\nICM model are used as replacements for the linear predictors in the softmax function. Chai (2012)\nderives a particular variational bound for the marginal likelihood and computes Gaussian posterior\ndistributions; and Dezfouli and Bonilla (2015) introduce an scalable inference procedure that uses a\nmixture of Gaussians to approximate the posterior distribution using automated variational inference\n(AVI) (Nguyen and Bonilla, 2014a) that requires sampling from univariate Gaussians.\nFor the single-output GP case, the usual practice for handling non-Gaussian likelihoods has been re-\nplacing the parameters or linear predictors of the non-Gaussian likelihood by one or more independent\nGP priors. Since computing posterior distributions becomes intractable, different alternatives have\nbeen offered for approximate inference. An example is the Gaussian heteroscedastic regression model\nwith variational inference (L\u00e1zaro-Gredilla and Titsias, 2011), Laplace approximation (Vanhatalo\net al., 2013); and stochastic variational inference (SVI) (Saul et al., 2016). This last reference uses the\nsame idea for modulating the parameters of a Student-t likelihood, a log-logistic distribution, a beta\ndistribution and a Poisson distribution. The generalised Wishart process (Wilson and Ghahramani,\n2011) is another example where the entries of the scale matrix of a Wishart distribution are modulated\nby independent GPs.\nOur main contribution in this paper is to provide an extension of multiple-output Gaussian processes\nfor prediction in heterogeneous datasets. The key principle in our model is to use the outputs of\na MOGP as the latent functions that modulate the parameters of several likelihood functions, one\nlikelihood function per output. We tackle the model\u2019s intractability using variational inference.\nFurthermore, we use the inducing variable formalism for MOGP introduced by Alvarez and Lawrence\n(2009) and compute a variational bound suitable for stochastic optimisation as in Hensman et al.\n(2013). We experimentally provide evidence of the bene\ufb01ts of simultaneously modeling heteroge-\nneous outputs in different applied problems. Our model can be seen as a generalisation of Saul\net al. (2016) for multiple correlated output functions of an heterogeneous nature. Our Python im-\nplementation follows the spirit of Had\ufb01eld et al. (2010), where the user only needs to specify a list\nof likelihood functions likelihood_list = [Bernoulli(), Poisson(), HetGaussian()],\nwhere HetGaussian refers to the heteroscedastic Gaussian distribution, and the number of latent\nparameter functions per likelihood is assigned automatically.\n\n2 Heterogeneous Multi-output Gaussian process\n\nd=1, with x \u2208 Rp, that we want to jointly model\nConsider a set of output functions Y = {yd(x)}D\nusing Gaussian processes. Traditionally, the literature has considered the case for which each yd(x)\nis continuous and Gaussian distributed. In this paper, we are interested in the heterogeneous case for\nwhich the outputs in Y are a mix of continuous, categorical, binary or discrete variables with several\ndifferent distributions. In particular, we will assume that the distribution over yd(x) is completely\nspeci\ufb01ed by a set of parameters \u03b8d(x) \u2208 X Jd, where we have a generic X domain for the parameters\nand Jd is the number of parameters thet de\ufb01ne the distribution. Each parameter \u03b8d,j(x) \u2208 \u03b8d(x)\nis a non-linear transformation of a Gaussian process prior fd,j(x), this is, \u03b8d,j(x) = gd,j(fd,j(x)),\nwhere gd,j(\u00b7) is a deterministic function that maps the GP output to the appropriate domain for the\nparameter \u03b8d,j.\nTo make the notation concrete, let us assume an heterogeneous multiple-output problem for which\nD = 3. Assume that output y1(x) is binary and that it will be modelled using a Bernoulli distribution.\nThe Bernoulli distribution uses a single parameter (the probability of success), J1 = 1, restricted to\nvalues in the range [0, 1]. This means that \u03b81(x) = \u03b81,1(x) = g1,1(f1,1(x)), where g1,1(\u00b7) could be\nmodelled using the logistic sigmoid function \u03c3(z) = 1/(1 + exp(\u2212z)) that maps \u03c3 : R \u2192 [0, 1].\nAssume now that the second output y2(x) corresponds to a count variable that can take values\ny2(x) \u2208 N \u222a {0}. The count variable can be modelled using a Poisson distribution with a single\nparameter (the rate), J2 = 1, restricted to the positive reals. This means that \u03b82(x) = \u03b82,1(x) =\ng2,1(f2,1(x)) where g2,1(\u00b7) could be modelled as an exponential function g2,1(\u00b7) = exp(\u00b7) to ensure\nstrictly positive values for the parameter. Finally, y3(x) is a continuous variable with heteroscedastic\n\n2\n\n\fnoise. It can be modelled using a Gaussian distribution where both the mean and the variance are\nfunctions of x. This means that \u03b83(x) = [\u03b83,1(x) \u03b83,2(x)](cid:62) = [g3,1(f3,1(x)) g3,2(f3,2(x))](cid:62),\nwhere the \ufb01rst function is used to model the mean of the Gaussian, and the second function is used\nto model the variance. Therefore, we can assume the g3,1(\u00b7) is the identity function and g3,2(\u00b7) is a\nfunction that ensures that the variance takes strictly positive values, e.g. the exponential function.\nLet us de\ufb01ne a vector-valued function y(x) = [y1(x), y2(x),\u00b7\u00b7\u00b7 , yD(x)](cid:62). We assume\nthat\nthe outputs are conditionally independent given the vector of parameters \u03b8(x) =\n[\u03b81(x), \u03b82(x),\u00b7\u00b7\u00b7 , \u03b8D(x)](cid:62), de\ufb01ned by specifying the vector of latent functions f (x) =\n[f1,1(x), f1,2(x),\u00b7\u00b7\u00b7 f1,J1(x), f2,1(x), f2,2(x),\u00b7\u00b7\u00b7 , fD,JD (x)](cid:62)\nwhere we have de\ufb01ned(cid:101)fd(x) = [fd,1(x),\u00b7\u00b7\u00b7 , fd,Jd (x)](cid:62)\n\u2208 RJd\u00d71, the set of latent functions that\nspecify the parameters in \u03b8d(x). Notice that J \u2265 D. This is, there is not always a one-to-one\nelements in \u03b8d(x), this is, the latent functions in(cid:101)f1(x) = [f1,1(x),\u00b7\u00b7\u00b7 , f1,J1(x)](cid:62) are drawn from\nmap from f (x) to y(x). Most previous work has assumed that D = 1, and that the corresponding\n(2015), that assumed a categorical variable y1(x), where the elements in(cid:101)f1(x) were drawn from\n\n\u2208 RJ\u00d71, where J =(cid:80)D\nD(cid:89)\np(yd(x)|(cid:101)fd(x)),\n\nindependent Gaussian processes. Important exceptions are Chai (2012) and Dezfouli and Bonilla\n\np(y(x)|\u03b8(x)) = p(y(x)|f (x)) =\n\np(yd(x)|\u03b8d(x)) =\n\nan intrinsic coregionalisation model. In what follows, we generalise these models for D > 1 and\npotentially heterogeneuos outputs yd(x). We will use the word \u201coutput\u201d to refer to the elements\nyd(x) and \u201clatent parameter function\u201d (LPF) or \u201cparameter function\u201d (PF) to refer to fd,j(x).\n\nD(cid:89)\n\nd=1 Jd,\n\n(1)\n\nd=1\n\nd=1\n\n2.1 A multi-parameter GP prior\n\nOur main departure from previous work is in modeling of f (x) using a multi-parameter Gaussian\nprocess that allows correlations for the parameter functions fd,j(x). We will use a linear model\nof corregionalisation type of covariance function for expressing correlations between functions\nfd,j(x), and fd(cid:48),j(cid:48)(x(cid:48)). The particular construction is as follows. Consider an additional set of\nindependent latent functions U = {uq(x)}Q\nq=1 that will be linearly combined to produce J LPFs\n{fd,j(x)}Jd,D\nj=1,d=1. Each latent function uq(x) is assummed to be drawn from an independent GP\nprior such that uq(\u00b7) \u223c GP(0, kq(\u00b7,\u00b7)), where kq can be any valid covariance function, and the zero\nQ(cid:88)\nmean is assumed for simplicity. Each latent parameter fd,j(x) is then given as\n\nRq(cid:88)\n\nfd,j(x) =\n\nd,j,qui\nai\n\nq(x),\n\n(2)\n\nq=1\n\ni=1\n\nD ](cid:62)\n\nd,j,qai\n\nq=1 bq\n\n(d,j),(d(cid:48),j(cid:48))kq(x, x(cid:48)), where bq\n\nq(x) are IID samples from uq(\u00b7) \u223c GP(0, kq(\u00b7,\u00b7)) and ai\n\n(d,j),(d(cid:48),j(cid:48)) = (cid:80)Rq\n\u2208 RN\u00d71; (cid:101)fd = [f(cid:62)\nd,1 \u00b7\u00b7\u00b7 f(cid:62)\n\nd,j,q \u2208 R. The mean function\nis equal to (cid:80)Q\nwhere ui\nfor fd,j(x) is zero and the cross-covariance function kfd,j fd(cid:48) ,j(cid:48) (x, x(cid:48)) = cov[fd,j(x), fd(cid:48),j(cid:48)(x(cid:48))]\nd(cid:48),j(cid:48),q. Let us de\ufb01ne\ni=1 ai\nn=1 \u2208 RN\u00d7p as a set of common input vectors for all outputs yd(x). Although, the\nX = {xn}N\npresentation could be extended for the case of a different set of inputs per output. Let us also\n1 \u00b7\u00b7\u00b7(cid:101)f(cid:62)\n[(cid:101)f(cid:62)\n\u2208 RJdN\u00d71, and f =\nde\ufb01ne fd,j = [fd,j(x1),\u00b7\u00b7\u00b7 , fd,j(xN )](cid:62)\n\u2208 RJN\u00d71. The generative model for the heterogeneous MOGP is as follows. We sample\nf \u223c N (0, K), where K is a block-wise matrix with blocks given by {Kfd,j fd(cid:48) ,j(cid:48)}D,D,Jd,Jd(cid:48)\nd=1,d(cid:48)=1,j=1,j(cid:48)=1. In\nK =(cid:80)Q\nturn, the elements in Kfd,j fd(cid:48) ,j(cid:48) are given by kfd,j fd(cid:48) ,j(cid:48) (xn, xm), with xn, xm \u2208 X. For the particular\ncase of equal inputs X for all LPF, K can also be expressed as the sum of Kronecker products\nq=1 Bq \u2297 Kq, where Aq \u2208 RJ\u00d7Rq has entries {ai\nd=1,d(cid:48)=1,j=1,j(cid:48)=1. The matrix Kq \u2208 RN\u00d7N has entries given by\nand Bq has entries {bq\nkq(xn, xm) for xn, xm \u2208 X. Matrices Bq \u2208 RJ\u00d7J are known as the coregionalisation matrices.\n\u03b8d =(cid:101)fd. Having speci\ufb01ed \u03b8, we can generate samples for the output vector y = [y(cid:62)\nOnce we obtain the sample for f, we evaluate the vector of parameters \u03b8 = [\u03b8(cid:62)\nD](cid:62), where\n1 \u00b7\u00b7\u00b7 \u03b8(cid:62)\n1 \u00b7\u00b7\u00b7 y(cid:62)\nD](cid:62)\n\u2208\nX DN\u00d71, where the elements in yd are obtained by sampling from the conditional distributions\n\nq \u2297 Kq =(cid:80)Q\n\n(d,j),(d(cid:48),j(cid:48))}D,D,Jd,Jd(cid:48)\n\nd,j,q}D,Jd,Rq\n\nq=1 AqA(cid:62)\n\nd,Jd ](cid:62)\n\nd=1,j=1,i=1\n\n3\n\n\fp(yd(x)|\u03b8d(x)). To keep the notation uncluttered, we will assume from now that Rq = 1, meaning\nthat Aq = aq \u2208 RJ\u00d71, and the corregionalisation matrices are rank-one. In the literature such model\nis known as the semiparametric latent factor model (Teh et al., 2005).\n\n2.2 Scalable variational inference\nGiven an heterogeneous dataset D = {X, y}, we would like to compute the posterior distribution\nfor p(f|D), which is intractable in our model. In what follows, we use similar ideas to Alvarez\nand Lawrence (2009); \u00c1lvarez et al. (2010) that introduce the inducing variable formalism for\ncomputational ef\ufb01ciency in MOGP. However, instead of marginalising the latent functions U to\nobtain a variational lower bound, we keep their presence in a way that allows us to apply stochastic\nvariational inference as in Hensman et al. (2013); Saul et al. (2016).\n\n2.2.1 Inducing variables for MOGP\n\nA key idea to reduce computational complexity in Gaussian process models is to introduce auxiliary\nvariables or inducing variables. These variables have been used already in the context of MOGP\n(Alvarez and Lawrence, 2009; \u00c1lvarez et al., 2010) . A subtle difference from the single output\ncase is that the inducing variables are not taken from the same latent process, say f1(x), but from\nthe latent processes U used also to build the model for multiple outputs. We will follow the same\nformalism here. We start by de\ufb01ning the set of M inducing variables per latent function uq(x) as\nuq = [uq(z1),\u00b7\u00b7\u00b7 , uq(zM )](cid:62), evaluated at a set of inducing inputs Z = {zm}M\nm=1 \u2208 RM\u00d7p. We\nalso de\ufb01ne u = [u(cid:62)\n\u2208 RQM\u00d71. For simplicity in the exposition, we have assumed that\n1 ,\u00b7\u00b7\u00b7 , u(cid:62)\nQ](cid:62)\nall the inducing variables, for all q, have been evaluated at the same set of inputs Z. Instead of\nfactorises as p(u) =(cid:81)Q\nmarginalising {uq(x)}Q\nq from the model in (2), we explicitly use the joint Gaussian prior p(f , u) =\np(f|u)p(u). Due to the assumed independence in the latent functions uq(x), the distribution p(u)\nq=1 p(uq), with uq \u223c N (0, Kq), where Kq \u2208 RM\u00d7M has entries kq(zi, zj)\nwith zi, zj \u2208 Z. Notice that the dimensions of Kq are different to the dimensions of Kq in section\n2.1. The LPFs fd,j are conditionally independent given u, so we can write the conditional distribution\n(cid:17)\nJd(cid:89)\np(f|u) as\n\nD(cid:89)\n\nD(cid:89)\n\nJd(cid:89)\n\n(cid:16)\n\nfd,j|Kfd,j uK\u22121\n\nuuu, Kfd,j fd,j \u2212 Kfd,j uK\u22121\n\nuuK(cid:62)\n\nfd,j u\n\n,\n\np(fd,j|u) =\n\np(f|u) =\n\nN\n\nd=1\n\nj=1\n\nd=1\n\nj=1\n\nwhere Kuu \u2208 RQM\u00d7QM is a block-diagonal matrix with blocks given by Kq and Kfd,j u \u2208 RN\u00d7QM\nis the cross-covariance matrix computed from the cross-covariances between fd,j(x) and uq(z). The\nexpression for this cross-covariance function can be obtained from (2) leading to kfd,j uq (x, z) =\nad,j,qkq(x, z). This form for the cross-covariance between the LPF fd,j(x) and uq(z) is a key\ndifference between the inducing variable methods for the single-output GP case and the MOGP case.\n\n2.2.2 Variational Bounds\n\nExact posterior inference is intractable in our model due to the presence of an arbitrary number of\nnon-Gaussian likelihoods. We use variational inference to compute a lower bound L for the marginal\nlog-likelihood log p(y), and for approximating the posterior distribution p(f , u|D). Following\n\u00c1lvarez et al. (2010), the posterior of the LPFs f and the latent functions u can be approximated as\n\np(f , u|y, X) \u2248 q(f , u) = p(f|u)q(u) =\n\np(fd,j|u)\n\nq(uq),\n\nD(cid:89)\n\nJd(cid:89)\n\nd=1\n\nj=1\n\nQ(cid:89)\n\nq=1\n\nwhere q(uq) = N (uq|\u00b5uq , Suq ) are Gaussian variational distributions whose parameters\n{\u00b5uq , Suq}Q\nq=1 must be optimised. Building on previous work by Saul et al. (2016); Hensman\net al. (2015), we derive a lower bound that accepts any log-likelihood function that can be modulated\nby the LPFs f. The lower bound L for log p(y) is obtained as follows\n\n(cid:90)\n\n(cid:90)\n\nlog p(y) = log\n\np(y|f )p(f|u)p(u)df du \u2265\n\n4\n\nq(f , u) log\n\np(y|f )p(f|u)p(u)\n\nq(f , u)\n\ndf du = L.\n\n\fWe can further simplify L to obtain\n\n(cid:90) (cid:90)\n(cid:90) (cid:90) D(cid:89)\n\nJd(cid:89)\n\nL =\n\n=\n\np(f|u)q(u) log p(y|f )df du \u2212\n\np(fd,j|u)q(u) log p(y|f )dudf \u2212\n\nQ(cid:88)\n\nq=1\n\nKL(cid:0)q(uq)||p(uq)(cid:1)\n\nQ(cid:88)\n\nq=1\n\nKL(cid:0)q(uq)||p(uq)(cid:1),\n\nwhere KL is the Kullback-Leibler divergence. Moreover, the approximate marginal posterior for fd,j\n\nj=1\n\nd=1\n\nis q(fd,j) =(cid:82) p(fd,j|u)q(u)du, leading to\nfd,j|Kfd,j uK\u22121\n\n(cid:16)\nu1,\u00b7\u00b7\u00b7 , \u00b5(cid:62)\n\nuu\u00b5u, Kfd,j fd,j + Kfd,j uK\u22121\n\nq(fd,j) = N\nwhere \u00b5u = [\u00b5(cid:62)\n(cid:80)D\nThe expression for log p(y|f ) factorises, according to (1):\nd=1 log p(yd|fd,1,\u00b7\u00b7\u00b7 , fd,Jd ). Using this expression for log p(y|f ) leads to the following expres-\nsion for the bound\n\nuQ](cid:62) and Su is a block-diagonal matrix with blocks given by Suq.\n\nd=1 log p(yd|(cid:101)fd) =\n\nfd,j u\n\n,\n\n(cid:17)\n\nuuK(cid:62)\n\nuu(Su \u2212 Kuu)K\u22121\nlog p(y|f ) = (cid:80)D\nQ(cid:88)\nKL(cid:0)q(uq)||p(uq)(cid:1).\n\nq=1\n\nD(cid:88)\n\nd=1\n\n(cid:2) log p(yd|fd,1,\u00b7\u00b7\u00b7 , fd,Jd )(cid:3)\n\n\u2212\n\nEq(fd,1)\u00b7\u00b7\u00b7q(fd,Jd )\n\nq=1. We represent each matrix Suq as Suq = Luq L(cid:62)\n\nWhen D = 1 in the expression above, we recover the bound obtained in Saul et al. (2016). To\nmaximize this lower bound, we need to \ufb01nd the optimal variational parameters {\u00b5uq}Q\nq=1 and\n{Suq}Q\nuq and, to ensure positive de\ufb01niteness for\nSuq, we estimate Luq instead of Suq. Computation of the posterior distributions over fd,j can be done\nanalytically. There is still an intractability issue in the variational expectations on the log-likelihood\nfunctions. Since we construct these bounds in order to accept any possible data type, we need a\ngeneral way to solve these integrals. One obvious solution is to apply Monte Carlo methods, however\nit would be slow both maximising the lower bound and updating variational parameters by sampling\nthousands of times (for approximating expectations) at each iteration. Instead, we address this\nproblem by using Gaussian-Hermite quadratures as in Hensman et al. (2015); Saul et al. (2016).\n\nD(cid:88)\n\nN(cid:88)\n\nStochastic Variational Inference. The conditional expectations in the bound above are also valid\nacross data observations so that we can express the bound as\n\n(cid:2) log p(yd(xn)|fd,1(xn),\u00b7\u00b7\u00b7 , fd,Jd (xn))(cid:3)\n\n\u2212\n\nQ(cid:88)\n\nKL(cid:0)q(uq)||p(uq)(cid:1).\n\nEq(fd,1(xn))\u00b7\u00b7\u00b7q(fd,Jd (xn))\n\nq=1\n\nn=1\n\nd=1\nThis functional form allows the use of mini-batches of smaller sets of training samples, performing\nthe optimization process using noisy estimates of the global objective gradient in a similar fashion\nto Hoffman et al. (2013); Hensman et al. (2013, 2015); Saul et al. (2016) . This scalable bound\nmakes our multi-ouput model applicable to large heterogenous datasets. Notice that computational\ncomplexity is dominated by the inversion of Kuu with a cost of O(QM 3) and products like Kfu\nwith a cost of O(JN QM 2).\nHyperparameter learning. Hyperparameters in our model include Z, {Bq}Q\nq=1, the\nhyperparameters associated to the covariance functions {kq(\u00b7,\u00b7)}Q\nq=1. Since the variational distribu-\ntion q(u) is sensitive to changes of the hyperparameters, we maximize the variational parameters for\nq(u), and the hyperparameters using a Variational EM algorithm (Beal, 2003) when employing the\nfull dataset, or the stochastic version when using mini-batches (Hoffman et al., 2013).\n\nq=1, and {\u03b3q}Q\n\n2.3 Predictive distribution\n(cid:82) p(y\u2217|f\u2217)q(f\u2217)df\u2217, where q(f\u2217) =(cid:82) p(f\u2217|u)q(u)du. Computing\n(cid:81)Jd\nConsider a set of test inputs X\u2217. Assuming that p(u|y) \u2248 q(u), the predictive distribution p(y\u2217) can\nbe approximated as p(y\u2217|y) \u2248\nj=1 q(fd,j,\u2217) involves evaluating Kfd,j,\u2217u at X\u2217. As in the case of\n\nthe expression q(f\u2217) =(cid:81)D\n\nd=1\n\n5\n\n\fthe lower bound, the integral above is intractable for the non-Gaussian likelihoods p(y\u2217|f\u2217). We can\nonce again make use of Monte Carlo integration or quadratures to approximate the integral. Simpler\nintegration problems are obtained if we are only interested in the predictive mean, E[y\u2217], and the\npredictive variance, var[y\u2217].\n\n3 Related Work\n\nThe most closely related works to ours are Skolidis and Sanguinetti (2011), Chai (2012), Dezfouli and\nBonilla (2015) and Saul et al. (2016). We are different from Skolidis and Sanguinetti (2011) because\nwe allow more general heterogeneous outputs beyond the speci\ufb01c case of several binary classi\ufb01cation\nproblems. Our inference method also scales to large datasets. The works by Chai (2012) and Dezfouli\nand Bonilla (2015) do use a MOGP, but they only handle a single categorical variable. Our inference\napproach scales when compared to the one in Chai (2012) and it is fundamentally different to the one\nin Dezfouli and Bonilla (2015) since we do not use AVI. Our model is also different to Saul et al.\n(2016) since we allow for several dependent outputs, D > 1, and our scalable approach is more akin\nto applying SVI to the inducing variable approach of \u00c1lvarez et al. (2010).\nMore recenty, Vanhatalo et al. (2018) used additive multi-output GP models to account for inter-\ndependencies between counting and binary observations. They use the Laplace approximation for\napproximating the posterior distribution. Similarly, Pourmohamad and Lee (2016) perform combined\nregression and binary classi\ufb01cation with a multi-output GP learned via sequential Monte Carlo.\nNguyen and Bonilla (2014b) also uses the same idea from \u00c1lvarez et al. (2010) to provide scalability\nfor multiple-output GP models conditioning the latent parameter functions fd,j(x) on the inducing\nvariables u, but only considers the multivariate regression case.\nIt is also important to mention that multi-output Gaussian processes have been considered as alterna-\ntive models for multi-task learning (Alvarez et al., 2012). Multi-task learning also addresses multiple\nprediction problems together within a single inference framework. Most previous work in this area\nhas focused on problems where all tasks are exclusively regression or classi\ufb01cation problems. When\ntasks are heterogeneous, the common practice is to introduce a regularizer per data type in a global\ncost function (Zhang et al., 2012; Han et al., 2017). Usually, these cost functions are compounded\nby additive terms, each one referring to every single task, while the correlation assumption among\nheterogeneous likelihoods is addressed by mixing regularizers in a global penalty term (Li et al.,\n2014) or by forcing different tasks to share a common mean (Ngufor et al., 2015). Another natural\nway of treating both continuous and discrete tasks is to assume that all of them share a common input\nset that varies its in\ufb02uence on each output. Then, by sharing a jointly sparsity pattern, it is possible\nto optimize a global cost function with a single regularization parameter on the level of sparsity\n(Yang et al., 2009). There have also been efforts for modeling heterogeneous data outside the label\nof multi-task learning including mixed graphical models (Yang et al., 2014), where varied types of\ndata are assumed to be combinations of exponential families, and latent feature models (Valera et al.,\n2017) with heterogeneous observations being mappings of a set of Gaussian distributed variables.\n\n4 Experiments\n\nIn this section, we evaluate our model on different heterogeneous scenarios 1. To demonstrate its\nperformance in terms of multi-output learning, prediction and scalability, we have explored several\napplications with both synthetic and real data. For all the experiments, we consider an RBF kernel for\neach covariance function kq(\u00b7,\u00b7) and we set Q = 3. For standard optimization we used the LBFGS-B\nalgorithm. When SVI was needed, we considered ADADELTA included in the climin library, and a\nmini-batch size of 500 samples in every output. All performance metrics are given in terms of the\nnegative log-predictive density (NLPD) calculated from a test subset and applicable to any type of\nlikelihood. Further details about experiments are included in the appendix.\nMissing Gap Prediction: In our \ufb01rst experiment, we evaluate if our model is able to predict\nobservations in one output using training information from another one. We setup a toy problem\nwhich consists of D = 2 heterogeneous outputs, where the \ufb01rst function y1(x) is real and y2(x)\nbinary. Assumming that heterogeneous outputs do not share a common input set, we observe\n\n1The code is publicly available in the repository github.com/pmorenoz/HetMOGP/\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Comparison between multi-output and single-output performance for two heterogeneous\nsets of observations. a) Fitted function and uncertainty for the \ufb01rst output. It represents the mean\nfunction parameter \u00b5(x) for a Gaussian distribution with \u03c32 = 1. b) Predictive output function for\nbinary inputs. Blue curve is the \ufb01tting function for training data and the red one corresponds to\npredicting from test inputs (true test binary outputs in red too). c) Same output as Figure 1(b) but\ntraining an independent Chained GP only in the single binary output (GP binary classi\ufb01cation).\n\nN1 = 600 and N2 = 500 samples respectively. All inputs are uniformly distributed in the input\nrange [0, 1], and we generate a gap only in the set of binary observations by removing Ntest = 150\nsamples in the interval [0.7, 0.9]. Using the remaining points from both outputs for training, we \ufb01tted\nour MOGP model. In Figures 1(a,b) we can see how uncertainty in binary test predictions is reduced\nby learning from the \ufb01rst output. In contrast, Figure 1(c) shows wider variance in the predicted\nparameter when it is trained independently. For the multi-output case we obtained a NLPD value for\ntest data of 32.5 \u00b1 0.2 \u00d7 10\u22122 while in the single-output case the NLPD was 40.51 \u00b1 0.08 \u00d7 10\u22122.\nHuman Behavior Data: In this experiment, we are interested in modeling human behavior in\npsychiatric patients. Previous work by Soleimani et al. (2018) already explores the application of\nscalable MOGP models to healthcare for reliable predictions from multivariate time-series. Our\ndata comes from a medical study that asked patients to download a monitoring app (EB2)2 on\ntheir smartphones. The system captures information about mobility, communication metadata and\ninteractions in social media. The work has a particular interest in mental health since shifts or\nmisalignments in the circadian feature of human behavior (24h cycles) can be interpreted as early\nsigns of crisis.\n\nTable 1: Behavior Dataset Test-NLPD (\u00d710\u22122)\nBernoulli\nBernoulli\n2.24 \u00b1 0.21\n5.41 \u00b1 0.05\n5.19 \u00b1 0.81\n2.43 \u00b1 0.30\n\nHeteroscedastic\n6.09 \u00b1 0.21\n7.29 \u00b1 0.12\n\nGlobal\n\n13.74 \u00b1 0.41\n14.91 \u00b1 1.05\n\nHetMOGP\nChainedGP\n\nFigure 2: Results for multi-output modeling of human behavior. After training, all output predictions\nshare a common (daily) periodic pattern.\n\nIn particular, we obtained a binary indicator variable of presence/absence at home by monitoring\nlatitude-longitude and measuring its distance from the patient\u2019s home location within a 50m radius\nrange. Then, using the already measured distances, we generated a mobility sequence with all\nlog-distance values. Our last output consists of binary samples representing use/non-use of the\n\n2This smartphone application can be found at https://www.eb2.tech/.\n\n7\n\n00.10.20.30.40.50.60.70.80.91\u22126\u22124\u221220246RealInputRealOutputOutput1:GaussianRegression00.10.20.30.40.50.60.70.80.9100.20.40.60.81RealInputBinaryOutputOutput2:BinaryClassi\ufb01cation00.10.20.30.40.50.60.70.80.9100.20.40.60.81RealInputBinaryOutputSingleOutput:BinaryClassi\ufb01cationMondayTuesdayWednesdayThursdayFridaySaturdaySunday00.20.40.60.81Output1:BinaryPresence/AbsenceatHomeMondayTuesdayWednesdayThursdayFridaySaturdaySunday\u22124\u22122024Output2:Log-distanceDistancefromHome(Km)MondayTuesdayWednesdayThursdayFridaySaturdaySunday00.20.40.60.81Output3:BinaryUse/non-useofWhatsapp\fWhatsapp application in the smartphone. At each monitoring time instant, we used its differential\ndata consumption to determine use or non-use of the application. We considered an entire week in\nseconds as the input domain, normalized to the range [0, 1].\nIn Figure (2), after training on N = 750 samples, we \ufb01nd that the circadian feature is mainly\ncontained in the \ufb01rst output. During the learning process, this periodicity is transferred to the other\noutputs through the latent functions improving the performance of the entire model. Experimentally,\nwe tested that this circadian pattern was not captured in mobility and social data when training outputs\nindependently. In Table 1 we can see prediction metrics for multi-output and independent prediction.\nLondon House Price Data: Based on the large scale experiments in Hensman et al. (2013),\nwe obtained the complete register of properties sold in the Greater London County during 2017\n(https://www.gov.uk/government/collections/price-paid-data). We preprocessed it to\ntranslate all property addresses to latitude-longitude points. For each spatial input, we considered\ntwo observations, one binary and one real. The \ufb01rst one indicates if the property is or is not a \ufb02at\n(zero would mean detached, semi-detached, terraced, etc.. ), and the second one the sale price of\nhouses. Our goal is to predict features of houses given a certain location in the London area. We used\na training set of N = 20, 000 samples, 1, 000 for test predictions and M = 100 inducing points.\n\nTable 2: London Dataset Test-NLPD (\u00d710\u22122)\n\nBernoulli\n6.38 \u00b1 0.46\n6.75 \u00b1 0.25\n\nHeteroscedastic\n10.05 \u00b1 0.64\n10.56 \u00b1 1.03\n\nHetMOGP\nChainedGP\n\nGlobal\n\n16.44 \u00b1 0.01\n17.31 \u00b1 1.06\n\nFigure 3: Results for spatial modeling of heterogeneous data. (Top row) 10% of training samples for\nthe Greater London County. Binary outputs are the type of property sold in 2017 and real ones are\nprices included in sale contracts. (Bottom row) Test prediction curves for Ntest = 2, 500 inputs.\n\nResults in Figure (3) show a portion of the entire heterogeneous dataset and its test prediction\ncurves. We obtained a global NLPD score of 16.44 \u00b1 0.01 using the MOGP and 17.31 \u00b1 1.06 in the\nindependent outputs setting (both \u00d710\u22122). There is an improvement in performance when training\nour multi-output model even in large scale datasets. See Table (2) for scores per each output.\nHigh Dimensional Input Data: In our last experiment, we tested our MOGP model for the ar-\nrhythmia dataset in the UCI repository (http://archive.ics.uci.edu/ml/). We use a dataset\nof dimensionality p = 255 and 452 samples that we divide in training, validation and test sets\n\n8\n\n-0.51-0.34-0.17-0.00.160.3351.2951.3751.4551.5351.6151.69LongitudeLatitudePropertyTypeFlatOther-0.51-0.34-0.17-0.00.160.3351.2951.3751.4551.5351.6151.69LongitudeLatitudeSalePrice79K\u00a3167K\u00a3351K\u00a3738K\u00a31.5M\u00a3-0.51-0.34-0.17-0.00.160.3351.2951.3751.4551.5351.6151.69LongitudeLatitudeLog-priceVariance0.30.60.91.21.51.82.12.4\f(more details are in the appendix). We use our model for predicting a binary output (gender) and a\ncontinuous output (logarithmic age) and we compared against independent Chained GPs per output.\nThe binary output is modelled as a Bernoulli distribution and the continuous one as a Gaussian. We\nobtained an average NLPD value of 0.0191 for both multi-output and independent output models\nwith a slight difference in the standard deviation.\n\n5 Conclusions\n\nIn this paper we have introduced a novel extension of multi-output Gaussian Processes for handling\nheterogeneous observations. Our model is able to work on large scale datasets by using sparse\napproximations within stochastic variational inference. Experimental results show relevant improve-\nments with respect to independent learning of heterogeneous data in different scenarios. In future\nwork it would be interesting to employ convolutional processes (CPs) as an alternative to build the\nmulti-output GP prior. Also, instead of typing hand-made de\ufb01nitions of heterogeneous likelihoods,\nwe may consider to automatically discover them (Valera and Ghahramani, 2017) as an input block in\na pipeline setup of our tool.\n\nAcknowledgments\n\nThe authors want to thank Wil Ward for his constructive comments and Juan Jos\u00e9 Giraldo for his\nuseful advice about SVI experiments and simulations. We also thank Alan Saul and David Ram\u00edrez\nfor their recommendations about scalable inference and feedback on the equations. We are grateful to\nEero Siivola and Marcelo Hartmann for sharing their Python module for heterogeneous likelihoods\nand to Francisco J. R. Ruiz for his illuminating help about the stochastic version of the VEM algorithm.\nAlso, we would like to thank Juan Jos\u00e9 Campa\u00f1a for his assistance on the London House Price\ndataset. Pablo Moreno-Mu\u00f1oz acknowledges the support of his doctoral FPI grant BES2016-077626\nand was also supported by Ministerio de Econom\u00eda of Spain under the project Macro-ADOBE\n(TEC2015-67719-P), Antonio Art\u00e9s-Rodr\u00edguez acknowledges the support of projects ADVENTURE\n(TEC2015-69868-C2-1-R), AID (TEC2014-62194-EXP) and CASI-CAM-CM (S2013/ICE-2845).\nMauricio A. \u00c1lvarez has been partially \ufb01nanced by the Engineering and Physical Research Council\n(EPSRC) Research Projects EP/N014162/1 and EP/R034303/1.\n\nReferences\nM. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In NIPS 21,\n\npages 57\u201364, 2009.\n\nM. \u00c1lvarez, D. Luengo, M. Titsias, and N. Lawrence. Ef\ufb01cient multioutput Gaussian processes through\n\nvariational inducing kernels. In AISTATS, pages 25\u201332, 2010.\n\nM. A. Alvarez, L. Rosasco, N. D. Lawrence, et al. Kernels for vector-valued functions: A review. Foundations\n\nand Trends in Machine Learning, 4(3):195\u2013266, 2012.\n\nM. J. Beal. Variational algorithms for approximate Bayesian inference. Ph. D. Thesis, University College\n\nLondon, 2003.\n\nE. V. Bonilla, K. M. Chai, and C. Williams. Multi-task Gaussian process prediction. In NIPS 20, pages 153\u2013160,\n\n2008.\n\nK. M. A. Chai. Variational multinomial logit Gaussian process. Journal of Machine Learning Research, 13:\n\n1745\u20131808, 2012.\n\nZ. Dai, M. A. \u00c1lvarez, and N. Lawrence. Ef\ufb01cient modeling of latent information in supervised learning using\n\nGaussian processes. In NIPS 30, pages 5131\u20135139, 2017.\n\nA. Dezfouli and E. V. Bonilla. Scalable inference for Gaussian process models with black-box likelihoods. In\n\nNIPS 28, pages 1414\u20131422. 2015.\n\nJ. D. Had\ufb01eld et al. MCMC methods for multi-response generalized linear mixed models: the MCMCglmm R\n\npackage. Journal of Statistical Software, 33(2):1\u201322, 2010.\n\nH. Han, A. K. Jain, S. Shan, and X. Chen. Heterogeneous face attribute estimation: A deep multi-task learning\n\napproach. IEEE transactions on pattern analysis and machine intelligence, 2017.\n\n9\n\n\fJ. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data.\n\nIntelligence, pages 282\u2013290, 2013.\n\nIn Uncertainty in Arti\ufb01cial\n\nJ. Hensman, A. G. d. G. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classi\ufb01cation. In\n\nAISTATS, pages 351\u2013360, 2015.\n\nD. M. Higdon. Space and space-time modelling using process convolutions. In Quantitative methods for current\n\nenvironmental issues, pages 37\u201356, 2002.\n\nM. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14(1):1303\u20131347, 2013.\n\nA. G. Journel and C. J. Huijbregts. Mining Geostatistics. Academic Press, London, 1978.\n\nM. L\u00e1zaro-Gredilla and M. Titsias. Variational heteroscedastic Gaussian process regression. In ICML, pages\n\n841\u2013848, 2011.\n\nS. Li, Z.-Q. Liu, and A. B. Chan. Heterogeneous multi-task learning for human pose estimation with deep\n\nconvolutional neural network. In CVPR, pages 482\u2013489, 2014.\n\nC. Ngufor, S. Upadhyaya, D. Murphree, N. Madde, D. Kor, and J. Pathak. A heterogeneous multi-task learning\n\nfor predicting RBC transfusion and perioperative outcomes. In AIME, pages 287\u2013297. Springer, 2015.\n\nT. V. Nguyen and E. V. Bonilla. Automated variational inference for Gaussian process models. In NIPS 27,\n\npages 1404\u20131412. 2014a.\n\nT. V. Nguyen and E. V. Bonilla. Collaborative multi-output Gaussian processes. In UAI, 2014b.\n\nG. Parra and F. Tobar. Spectral mixture kernels for multi-output Gaussian processes. In NIPS 30, 2017.\n\nT. Pourmohamad and H. K. H. Lee. Multivariate stochastic process models for correlated responses of mixed\n\ntype. Bayesian Analysis, 11(3):797\u2013820, 2016.\n\nA. D. Saul, J. Hensman, A. Vehtari, and N. D. Lawrence. Chained Gaussian processes. In AISTATS, pages\n\n1431\u20131440, 2016.\n\nG. Skolidis and G. Sanguinetti. Bayesian multitask classi\ufb01cation with Gaussian process priors. IEEE Transactions\n\non Neural Networks, 22(12), 2011.\n\nH. Soleimani, J. Hensman, and S. Saria. Scalable joint models for reliable uncertainty-aware event prediction.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 40(8):1948\u20131963, 2018.\n\nY. Teh, M. Seeger, and M. Jordan. Semiparametric latent factor models. In AISTATS, pages 333\u2013340, 2005.\n\nK. R. Ulrich, D. E. Carlson, K. Dzirasa, and L. Carin. GP kernels for cross-spectrum analysis. In NIPS 28, 2015.\n\nI. Valera and Z. Ghahramani. Automatic discovery of the statistical types of variables in a dataset. In ICML,\n\npages 3521\u20133529, 2017.\n\nI. Valera, M. F. Pradier, M. Lomeli, and Z. Ghahramani. General latent feature models for heterogeneous datasets.\n\narXiv preprint arXiv:1706.03779, 2017.\n\nJ. Vanhatalo, J. Riihim\u00e4ki, J. Hartikainen, P. Jyl\u00e4nki, V. Tolvanen, and A. Vehtari. GPstuff: Bayesian modeling\n\nwith Gaussian processes. Journal of Machine Learning Research, 14(1):1175\u20131179, 2013.\n\nJ. Vanhatalo, M. Hartmann, and L. Veneranta. Joint species distribution modeling with additive multivariate\n\nGaussian process priors and heteregenous data. arXiv preprint arXiv:1809.02432, 2018.\n\nA. G. Wilson and Z. Ghahramani. Generalised Wishart processes. In UAI, pages 736\u2013744, 2011.\n\nE. Yang, P. Ravikumar, G. I. Allen, Y. Baker, Y.-W. Wan, and Z. Liu. A general framework for mixed graphical\n\nmodels. arXiv preprint arXiv:1411.0288, 2014.\n\nX. Yang, S. Kim, and E. P. Xing. Heterogeneous multitask learning with joint sparsity constraints. In NIPS 22,\n\npages 2151\u20132159, 2009.\n\nD. Zhang, D. Shen, A. D. N. Initiative, et al. Multi-modal multi-task learning for joint prediction of multiple\n\nregression and classi\ufb01cation variables in Alzheimer\u2019s disease. NeuroImage, 59(2):895\u2013907, 2012.\n\n10\n\n\f", "award": [], "sourceid": 3368, "authors": [{"given_name": "Pablo", "family_name": "Moreno-Mu\u00f1oz", "institution": "Universidad Carlos III de Madrid"}, {"given_name": "Antonio", "family_name": "Art\u00e9s", "institution": "Universidad Carlos III de Madrid"}, {"given_name": "Mauricio", "family_name": "\u00c1lvarez", "institution": "University of Sheffield"}]}