{"title": "It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals", "book": "Advances in Neural Information Processing Systems", "page_first": 1466, "page_last": 1474, "abstract": "Multi-task prediction models are widely being used to couple regressors or classification models by sharing information across related tasks. A common pitfall of these models is that they assume that the output tasks are independent conditioned on the inputs. Here, we propose a multi-task Gaussian process approach to model both the relatedness between regressors as well as the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term that is the sum of Kronecker products, for which efficient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we find substantial benefits of modeling structured noise compared to established alternatives.", "full_text": "It is all in the noise: Ef\ufb01cient multi-task Gaussian\n\nprocess inference with structured residuals\n\nMachine Learning and Computational Biology\n\nBarbara Rakitsch\n\nResearch Group\n\nMax Planck Institutes T\u00a8ubingen, Germany\n\nrakitsch@tuebingen.mpg.de\n\nChristoph Lippert\nMicrosoft Research\nLos Angeles, USA\n\nlippert@microsoft.com\n\nKarsten Borgwardt1,2\n\nMachine Learning and Computational Biology\n\nResearch Group\n\nOliver Stegle2\n\nEuropean Molecular Biology Laboratory\n\nEuropean Bioinformatics Institute\n\nMax Planck Institutes T\u00a8ubingen, Germany\n\nCambridge, UK\n\nkarsten.borgwardt@tuebingen.mpg.de\n\noliver.stegle@ebi.ac.uk\n\nAbstract\n\nMulti-task prediction methods are widely used to couple regressors or classi\ufb01ca-\ntion models by sharing information across related tasks. We propose a multi-task\nGaussian process approach for modeling both the relatedness between regressors\nand the task correlations in the residuals, in order to more accurately identify true\nsharing between regressors. The resulting Gaussian model has a covariance term\nin form of a sum of Kronecker products, for which ef\ufb01cient parameter inference\nand out of sample prediction are feasible. On both synthetic examples and applica-\ntions to phenotype prediction in genetics, we \ufb01nd substantial bene\ufb01ts of modeling\nstructured noise compared to established alternatives.\n\n1\n\nIntroduction\n\nMulti-task Gaussian process (GP) models are widely used to couple related tasks or functions for\njoint regression. This coupling is achieved by designing a structured covariance function, yielding\na prior on vector-valued functions. An important class of structured covariance functions can\nbe derived from a product of a kernel function c relating the tasks (task covariance) and a kernel\nfunction r relation the samples (sample covariance)\n\ncov(fn,t, fn(cid:48),t(cid:48)) = c(t, t(cid:48))\n\n\u00b7\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nr(n, n(cid:48))\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\ntask covariance\n\nsample covariance\n\n,\n\n(1)\n\nwhere fn,t are latent function values that induce the outputs yn,t by adding some Gaussian noise.\nIf the outputs yn,t are fully observed, with one training example per sample and task, the resulting\ncovariance matrix between the latent factors can be written as a Kronecker product between the\nsample covariance matrix and the task covariance matrix (e.g. [1]). More complex multi-task\ncovariance structures can be derived from generalizations of this product structure, for example via\nconvolution of multiple features, e.g. [2]. In [3], a parameterized covariance over the tasks is used,\nassuming that task-relevant features are observed. The authors of [4] couple the latent features over\nthe tasks exploiting a dependency in neural population activity over time.\n\n1Also at Zentrum f\u00a8ur Bioinformatik, Eberhard Karls Universit\u00a8at T\u00a8ubingen,T\u00a8ubingen, Germany\n2Both authors contributed equally to this work.\n\n1\n\n\fWork proposing this type of multi-task GP regression builds on Bonilla and Williams [1], who have\nemphasized that the power of Kronecker covariance models for GP models (Eqn. (1)) is linked to\nnon-zero observation noise. In fact, in the limit of noise-free training observations, the coupling of\ntasks for predictions is lost in the predictive model, reducing to ordinary GP regressors for each indi-\nvidual task. Most multi task GP models build on a simple independent noise model, an assumption\nthat is mainly routed in computational convenience. For example [5] show that this assumption ren-\nders the evaluation of the model likelihood and parameter gradients tractable, avoiding the explicit\nevaluation of the Kronecker covariance.\nIn this paper, we account for residual noise structure by modeling the signal and the noise covari-\nance matrix as two separate Kronecker products. The structured noise covariance is independent\nof the inputs but instead allows to capture residual correlation between tasks due to latent causes;\nmoreover, the model is simple and extends the widely used product covariance structure. Concep-\ntually related noise models have been proposed in animal breeding [6, 7]. In geostatistics [8], linear\ncoregionalization models have been introduced to allow for more complicated covariance structures:\nthe signal covariance matrix is modeled as a sum of Kronecker products and the noise covariance\nas a single Kronecker product. In machine learning, the Gaussian process regression networks [9]\nconsiders an adaptive mixture of GPs to model related tasks. The mixing coef\ufb01cients are dependent\non the input signal and control the signal and noise correlation simultaneously.\nThe remainder of this paper is structured as follows. First, we show that unobserved regressors or\ncausal processes inevitably lead to correlated residual, motivating the need to account for structured\nnoise (Section 2). This extension of the multi task GP model allows for more accurate estimation\nof the task-task relationships, thereby improving the performance for out-of-sample predictions. At\nthe same time, we show how an ef\ufb01cient inference scheme can be derived for this class of models.\nThe proposed implementation handles closed form marginal likelihoods and parameter gradients for\nmatrix-variate normal models with a covariance structure represented by the sum of two Kronecker\nproducts. These operations can be implemented at marginal extra computational cost compared to\nmodels that ignore residual task correlations (Section 3). In contrast to existing work extending\nGaussian process multi task models by de\ufb01ning more complex covariance structures [2, 9, 8], our\nmodel utilizes the gradient of the marginal likelihood for parameter estimation and does not require\nexpected maximization, variational approximation or MCMC sampling. We apply the resulting\nmodel in simulations and real settings, showing that correlated residuals are a concern in important\napplications (Section 4).\n\nthis matrix corresponds to a particular task t is denoted as yt, and vecY =(cid:0)y(cid:62)\n\n2 Multi-task Gaussian processes with structured noise\nLet Y \u2208 RN\u00d7T denote the N \u00d7 T output training matrix for N samples and T tasks. A column of\ndenotes\nthe vector obtained by vertical concatenation of all columns of Y. We indicate the dimensions of the\nmatrix as capital subscripts when needed for clarity. A more thoughtful derivation of all equations\ncan be found in the Supplementary Material.\n\n(cid:1)(cid:62)\n\n1 . . . y(cid:62)\n\nT\n\nMultivariate linear model equivalence The multi-task Gaussian process regression model with\nstructured noise can be derived from the perspective of a linear multivariate generative model. For\na particular task t, the outputs are determined by a linear function of the training inputs across F\nfeatures S = {s1, . . . , sF},\n\nF(cid:88)\n\nf =1\n\nyt =\n\nsf wf,t + \u03c8t.\n\n(2)\n\nMulti-task sharing is achieved by specifying a multivariate normal prior across tasks, both for the\nregression weights wf,t and the noise variances \u03c8t:\n\nF(cid:89)\n\nN(cid:89)\n\nN (\u03c8n | 0, \u03a3T T ) .\n\np(W(cid:62)) =\n\nN (wf | 0, CT T )\n\np(\u03a8(cid:62)) =\n\nf =1\n\nn=1\n\n2\n\n\fMarginalizing out the weights W and the residuals \u03a8 results in a matrix-variate normal model with\nsum of Kronecker products covariance structure\n\n\uf8eb\uf8ec\uf8edvecYN T | 0, CT T \u2297 RN N\n(cid:125)\n(cid:123)(cid:122)\n\n(cid:124)\n\n\uf8f6\uf8f7\uf8f8 ,\n(cid:125)\n\n(3)\n\np(vecY | C, R, \u03a3) = N\n\n+ \u03a3T T \u2297 IN N\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nsignal covariance\n\nnoise covariance\n\nwhere RN N = SS(cid:62) is the sample covariance matrix that results from the marginalization over\nthe weights W in Eqn. (2). In the following, we will refer to a Gaussian process model with this\ntype of sum of Kronecker products covariance structure as GP-kronsum1. As common to any kernel\nmethod, the linear covariance R can be replaced with any positive semi-de\ufb01nite covariance function.\n\nPredictive distribution In a GP-kronsum model, predictions for unseen test instances can be car-\nried out by using the standard Gaussian process framework [10]:\n\np(vecY\u2217|R\u2217, Y) = N (vecY\u2217 | vec M\u2217, V\u2217) .\n\n(4)\nHere, M\u2217 denotes the mean prediction and V\u2217 is the predictive covariance. Analytical expression\nfor both can be obtained by considering the joint distribution of observed and unobserved outputs\nand completing the square, yielding:\nvec M\u2217 = (CT T \u2297 R\u2217\nV\u2217 = (CT T \u2297 R\u2217\n\nN\u2217N ) (CT T \u2297 RN N + \u03a3T T \u2297 IN N )\nN\u2217N\u2217 ) \u2212 (CT T \u2297 R\u2217\n\nN\u2217N ) (CT T \u2297 RN N + \u03a3T T \u2297 IN N )\n\n\u22121 (CT T \u2297 R\u2217\n\n\u22121 vecYN T ,\n\nN N\u2217 ) ,\n\nwhere R\u2217\ncovariance matrix between the test samples.\n\nN\u2217N is the covariance matrix between the test and training instances, and R\u2217\n\nN\u2217N\u2217 is the\n\nDesign of multi-task covariance function In practice, neither the form of C nor the form of \u03a3 is\nknown a priori and hence needs to be inferred from data, \ufb01tting a set of corresponding covariance\nparameters \u03b8C and \u03b8\u03a3. If the number of tasks T is large, learning a free-form covariance matrix is\nprone to over\ufb01tting, as the number of free parameters grows quadratically with T . In the experi-\n\nments, we consider a rank-k approximation of the form(cid:80)K\n\nk + \u03c32I for the task matrices.\n\nk=1 xkx(cid:62)\n\nTask cancellation when the task covariance matrices are equal A notable form of the predictive\ndistribution (4) arises for the special case C = \u03a3, that is the task covariance matrix of signal\nand noise are identical. Similar to previous results for noise-free observations [1], maximizing the\nmarginal likelihood p(vecY|C, R, \u03a3) with respect to the parameters \u03b8R becomes independent of C\nand the predictions are decoupled across tasks, i.e. the bene\ufb01ts from joint modeling are lost:\n\n(5)\nIn this case, the predictions depend on the sample covariance, but not on the task covariance. Thus,\nthe GP-kronsum model is most useful when the task covariances on observed features and on noise\nre\ufb02ect two independent sharing structures.\n\nN\u2217N (RN N + IN N )\u22121YN T\n\nvec M\u2217 = vec(cid:0)R\u2217\n\n(cid:1)\n\n3 Ef\ufb01cient Inference\n\nIn general, ef\ufb01cient inference can be carried out for Gaussian models with a sum covariance of two\narbitrary Kronecker products\n\np(vecY | C, R, \u03a3) = N (vecY | 0, CT T \u2297 RN N + \u03a3T T \u2297 \u2126N N ) .\n\n(6)\nThe key idea is to \ufb01rst consider a suitable data transformation that leads to a diagonalization of all\ncovariance matrices and second to exploit Kronecker tricks whenever possible.\nLet \u03a3 = U\u03a3S\u03a3U(cid:62)\n\u03a3 be the eigenvalue decomposition of \u03a3, and analogously for \u2126. Borrowing\nideas from [11], we can \ufb01rst bring the covariance matrix in a more amenable form by factoring out\nthe structured noise:\n\n1the covariance is de\ufb01ned as the sum of two Kronecker products and not as the classical Kronecker sum\n\nC \u2295 R = C \u2297 I + I \u2297 R.\n\n3\n\n\f(cid:16)\n\nK = C \u2297 R + \u03a3 \u2297 \u2126\n\u03a3 \u2297 U\u2126S\n\u2212 1\n\u03a3 and \u02dcR = S\n\n=\n\u2212 1\n\u03a3 U(cid:62)\n\nU\u03a3S\n\n1\n2\n\u2126\n\n1\n2\n\n2\n\n(cid:17)(cid:16) \u02dcC \u2297 \u02dcR + I \u2297 I\n(cid:17)(cid:16)\n\n\u2212 1\n\u2126 U(cid:62)\nwhere \u02dcC = S\n\u02dcK = \u02dcC \u2297 \u02dcR + I \u2297 I for this transformed covariance.\n\n\u03a3 CU\u03a3S\n\n2\n\n2\n\n\u2126 RU\u2126S\n\n(cid:17)\n\n,\n\n(7)\n\n1\n2\n\n\u03a3U(cid:62)\nS\n\n\u03a3 \u2297 S\n\n1\n2\n\n\u2126U(cid:62)\n\n\u2126\n\n\u2212 1\n\u2126 . In the following, we use de\ufb01nition\n\n2\n\nEf\ufb01cient log likelihood evaluation. The log model likelihood (Eqn. (6)) can be expressed in terms\nof the transformed covariance \u02dcK:\n\nvec \u02dcY(cid:62) \u02dcK\u22121vec \u02dcY,\n\n(8)\n\nL = \u2212 N T\n2\n= \u2212 N T\n2\n\u2212 1\n\u03a3 \u2297 S\n\u03a3 U(cid:62)\nS\n\nln(2\u03c0) \u2212 1\n2\nln(2\u03c0) \u2212 1\n2\n\u2212 1\n\u2126 U(cid:62)\n\n(cid:16)\n\n\u2126\n\n2\n\n(cid:17)\n\nln|K| \u2212 1\n2\nln| \u02dcK| \u2212 1\n2\n\nvecY(cid:62)K\u22121vecY\n|S\u03a3 \u2297 S\u2126| \u2212 1\n2\n\n(cid:16)\n\n(cid:17)\n\n2\n\nwhere vec \u02dcY =\nis the projected output.\nExcept for the additional term |S\u03a3 \u2297 S\u2126|, resulting from the transformation, the log likelihood has\nthe exactly same form as for multi-task GP regression with iid noise [1, 5]. Using an analogous\nderivation, we can now ef\ufb01ciently evaluate the log likelihood:\n\nvecY = vec\n\n2\n\n\u2212 1\n\u2126 UT\nS\n\n\u2212 1\n\u2126YU\u03a3S\n\u03a3\n\n2\n\nL = \u2212 N T\n2\n\n(cid:16)\nwhere we have de\ufb01ned the eigenvalue decomposition of \u02dcC as U \u02dcCS \u02dcCU(cid:62)\n\n(cid:17)(cid:62)\nln|S \u02dcC \u2297 S \u02dcR + I \u2297 I| \u2212 N\n2\n\u22121 vec\n\nln(2\u03c0) \u2212 1\n2\nU(cid:62)\n\u02dcYU \u02dcC\n\u02dcR\n\n(S \u02dcC \u2297 S \u02dcR + I \u2297 I)\n\n\u2212 1\n2\n\n(cid:16)\n\nln|S\u03a3| \u2212 T\n2\n\u02dcYU \u02dcC\n\nU(cid:62)\n\u02dcR\n\nvec\n\n|S\u2126|\n\n(cid:17)\n\nand similar for \u02dcR.\n\n\u02dcC\n\n,\n\n(9)\n\nEf\ufb01cient gradient evaluation The derivative of the log marginal likelihood with respect to a co-\nvariance parameter \u03b8R can be expressed as:\n\n\u2202\n\n\u2202\u03b8R\n\ndiag\n\n\u2202\n\n\u2202\u03b8R\n\nL = \u2212 1\n2\n= \u2212 1\n2\n1\n2\n\n+\n\nvec( \u02c6Y)(cid:62)vec\n\n(cid:16)\n\nln| \u02dcK| \u2212 1\n2\n\nvec \u02dcY(cid:62)(cid:18) \u2202\n\u22121(cid:17)(cid:62)\n(cid:19)\n(cid:18) \u2202\n(S \u02dcC \u2297 S \u02dcR + I \u2297 I)\n(cid:16)\n\nU(cid:62)\n\u02dcR\n\n(cid:18)\n\n\u2202\u03b8R\n\n\u02dcR\n\n\u2202\u03b8R\nU(cid:62)\n\u02dcR\n\n(cid:19)\n\n(cid:18)\n\n\u02dcK\u22121\n\nvec( \u02dcY)\n\n(cid:19)\nS \u02dcC \u2297 U(cid:62)\n\n\u02dcR\n\ndiag\n\n(cid:17)\n\nU \u02dcR\n\n\u02c6YS \u02dcC\n\n,\n\n(cid:18) \u2202\n\n\u2202\u03b8R\n\n(cid:19)\n\n\u02dcR\n\n(cid:19)\n\nU \u02dcR\n\n(10)\n\nwhere vec( \u02c6Y) = (S \u02dcC \u2297 S \u02dcR + I \u2297 I)\n. Analogous gradients can be derived for\nthe task covariance parameters \u03b8C and \u03b8\u03a3. The proposed speed-ups also apply to the special cases\nwhere \u03a3 is modeled as being diagonal as in [1], or for optimizing the parameters of a kernel function.\nSince the sum of Kronecker products generally can not be written as a single Kronecker product, the\nspeed-ups cannot be generalized to larger sums of Kronecker products.\n\n\u02dcYU \u02dcC\n\n\u22121 vec\n\nEf\ufb01cient prediction Similarly, the mean predictor (Eqn. (4)) can be ef\ufb01ciently evaluated\n\nvec M\u2217 = vec\n\nR\u2217U\u2126S\n\n\u2212 1\n\u2126\n\n2\n\n\u02c6YU(cid:62)\n\u02dcC\n\nU \u02dcR\n\n\u2212 1\n\u03a3 U(cid:62)\nS\n\n2\n\n.\n\n(11)\n\n(cid:104)(cid:16)\n\n(cid:17)(cid:16)\n\n(cid:17)(cid:16)\n\n\u03a3 C(cid:62)(cid:17)(cid:105)\n\nGradient-based parameter inference The closed-form expression of the marginal likelihood\n(Eqn. (9)) and gradients with respect to covariance parameters (Eqn. (10)) allow for use of gradient-\nbased parameter inference. In the experiments, we employ a variant of L-BFGS-B [12].\nComputational cost. While the naive approach has a runtime of O(N 3\u00b7 T 3) and memory require-\nment of O(N 2 \u00b7 T 2), as it explicitly computes and inverts the Kronecker products, our reformulation\nreduces the runtime to O(N 3 + T 3) and the memory requirement to O(N 2 + T 2), making it appli-\ncable to large numbers of samples and tasks. The empirical runtime savings over the naive approach\nare explored in Section 4.1.\n\n4\n\n\fFigure 1: Runtime comparison on syn-\nthetic data. We compare our ef\ufb01cient GP-\nkronsum implementation (left) versus its\nnaive counterpart (right). Shown is the run-\ntime in seconds on a logarithmic scale as a\nfunction of the sample size and the number\nof tasks. The optimization was stopped pre-\nmaturely if it did not complete after 104 sec-\nonds.\n\n(a) Ef\ufb01cient Implementation (b) Naive Implementation\n4 Experiments\n\nWe investigated the performance of the proposed GP-kronsum model in both simulated datasets and\nresponse prediction problems in statistical genetics. To investigate the bene\ufb01ts of structured residual\ncovariances, we compared the GP-kronsum model to a Gaussian process (GP-kronprod) with iid\nnoise [5] as well as independent modeling of tasks using a standard Gaussian process (GP-single),\nand joint modeling of all tasks using a standard Gaussian on a pooled dataset, naively merging data\nfrom all tasks (GP-pool).\nThe predictive performance of individual models was assessed through 10-fold cross-validation.\nFor each fold, model parameters were \ufb01t on the training data only. To avoid local optima during\ntraining, parameter \ufb01tting was carried out using \ufb01ve random restarts of the parameters on 90% of\nthe training instances. The remaining 10% of the training instances were used for out of sample\nselection using the maximum log likelihood as criterion. Unless stated otherwise, in the multi-task\nmodels the relationship between tasks was parameterized as xx(cid:62) + \u03c32I, the sum of a rank-1 matrix\nand a constant diagonal component. Both parameters, x and \u03c32, were learnt by optimizing the\nmarginal likelihood. Finally, we measured the predictive performance of the different methods via\nthe averaged square of Pearson\u2019s correlation coef\ufb01cient r2 between the true and the predicted output,\naveraged over tasks. The squared correlation coef\ufb01cient is commonly used in statistical genetics to\nevaluate the performance of different predictors [13].\n\n4.1 Simulations\n\nFirst, we considered simulated experiments to explore the runtime behavior and to \ufb01nd out if there\nare settings in which GP-kronsum performs better than existing methods.\n\nRuntime evaluation. As a \ufb01rst experiment, we examined the runtime behavior of our method as\na function of the number of samples and of the number of tasks. Both parameters were varied in\nthe range {16, 32, 64, 128, 256}. The simulated dataset was drawn from the GP-kronsum model\n(Eqn. (3)) using a linear kernel for the sample covariance matrix R and rank-1 matrices for the task\ncovariances C and \u03a3. The runtime of this model was assessed for a single likelihood optimization on\nan AMD Opteron Processor 6,378 using a single core (2.4GHz, 2,048 KB Cache, 512 GB Memory)\nand compared to a naive implementation. The optimization was stopped prematurely if it did not\nconverge within 104 seconds.\nIn the experiments, we used a standard linear kernel on the features of the samples as sample covari-\nance while learning the task covariances. This modeling choice results in a steeper runtime increase\nwith the number of tasks, due to the increasing number of model parameters to be estimated. Fig-\nure 1 demonstrates the signi\ufb01cant speed-up. While our algorithm can handle 256 samples/256 tasks\nwith ease, the naive implementation failed to process more than 32 samples/32 tasks.\n\nUnobserved causal process induces structured noise A common source of structured residuals\nare unobserved causal processes that are not captured via the inputs. To explore this setting, we\ngenerated simulated outputs from a sum of two different processes. For one of the processes, we\nassumed that the causal features Xobs were observed, whereas for the second process the causal\nfeatures Xhidden were hidden and independent of the observed measurements. Both processes were\nsimulated to have a linear effect on the output. The effect from the observed features was again\ndivided up into an independent effect, which is task-speci\ufb01c, and a common effect, which, up to\n\n5\n\n\frescaling rcommon, is shared over all tasks:\nYcommon = XobsWcommon, Wcommon = rcommon \u2297 wcommon, rcommon \u223c N (0, I), wcommon \u223c N (0, I)\nThe trade-off parameter \u00b5common determines the extent of relatedness between tasks:\n\nYobs = \u00b5commonYcommon + (1 \u2212 \u00b5common)Yind.\n\nThe effect of the hidden features was simulated analogously. A second trade-off parameter \u00b5hidden\nwas introduced, controlling the ratio between the observed and hidden effect:\n\nY = \u00b5signal [(1 \u2212 \u00b5hidden)Yobs + \u00b5hiddenYhidden] + (1 \u2212 \u00b5signal)Ynoise,\n\nwhere Ynoise is Gaussian observation noise, and \u00b5signal is a third trade-off parameter de\ufb01ning the\nratio between noise and signal.\ntrade-off parameters, we considered a series of\nTo investigate the impact of the different\ndatasets varying one of the parameters while keeping others \ufb01xed. We varied \u00b5signal in the\nrange {0.1, 0.3, 0.5, 0.7, 0.9, 1.0}, \u00b5common \u2208 {0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0} and \u00b5hidden \u2208\n{0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0}, with default values marked in bold. Note that the best possible\nexplained variance for the default setting is 45%, as the causal signal is split up equally between\nthe observed and the hidden process. For all simulation experiments, we created datasets with 200\nsamples and 10 tasks. The number of observed features was set to 200, as well as the number of\nhidden features. For each such simulation setting, we created 30 datasets.\nFirst, we considered the impact of variation in signal strength \u00b5signal (Figure 2a), where the overall\nsignal was divided up equally between the observed and hidden signal. Both GP-single and GP-\nkronsum performed better as the overall signal strength increased. The performance of GP-kronsum\nwas superior, as the model can exploit the relatedness between the different tasks.\nSecond, we explored the ability of the different methods to cope with an underlying hidden pro-\ncess (Figure 2b). In the absence of a hidden process (\u00b5hidden = 0), GP-kronprod and GP-kronsum\nhad very similar performances, as both methods leverage the shared signal of the observed pro-\ncess, thereby outperforming the single-task GPs. However, as the magnitude of the hidden signal\nincreases, GP-kronprod falsely explains the task correlation completely by the covariance term rep-\nresenting the observed process which leads to loss of predictive power.\nLast, we examined the ability of different methods to exploit the relatedness between the tasks (Fig-\nure 2c). Since GP-single assumed independent tasks, the model performed very similarly across\nthe full range of common signal. GP-kronprod suffered from the same limitations as previously de-\nscribed, because the correlation between tasks in the hidden process increases synchronously with\nthe correlation in the observed process as \u00b5common increases. In contrast, GP-kronsum could take\nadvantage of the shared component between the tasks, as knowledge is transferred between them.\nGP-pool was consistently outperformed by all competitors as two of its main assumptions are heav-\nily violated: samples of different tasks do not share the same signal and the residuals are neither\nindependent of each other, nor do they have the same noise level.\nIn summary, the proposed model is robust to a range of different settings and clearly outperforms its\ncompetitors when the tasks are related to each other and not all causal processes are observed.\n\n4.2 Applications to phenotype prediction\n\nAs a real world application we considered phenotype prediction in statistical genetics. The aim of\nthese experiments was to demonstrate the relevance of unobserved causes in real world prediction\nproblems and hence warrant greater attention.\n\nGene expression prediction in yeast We considered gene expression levels from a yeast genet-\nics study [14]. The dataset comprised of gene expression levels of 5, 493 genes and 2, 956 SNPs\n(features), measured for 109 yeast crosses. Expression levels for each cross were measured in two\nconditions (glucose and ethanol as carbon source), yielding a total of 218 samples. In this experi-\nment, we treated the condition information as a hidden factor instead of regressing it out, which is\nanalogous to the hidden process in the simulation experiments. The goal of this experiment was to\ninvestigate how alternative methods can deal and correct for this hidden covariate. We normalized\nall features and all tasks to zero mean and unit variance. Subsequently, we \ufb01ltered out all genes\nthat were not consistently expressed in at least 90% of the samples (z-score cutoff 1.5). We also\n\n6\n\n\f(a) Total Signal\n\n(b) Hidden Signal\n\n(c) Shared Signal\n\nFigure 2: Evaluation of alternative methods for different simulation settings. From left to right:\n(a) Evaluation for varying signal strength. (b) Evaluation for variable impact of the hidden signal.\n(c) Evaluation for different strength of relatedness between the tasks. In each simulation setting, all\nother parameters were kept constant at default parameters marked with the yellow star symbol.\n\n(a) Empirical\n\n(b) Signal\n\n(c) Noise\n\nFigure 3: Fitted task covariance matrices for gene expression levels in yeast. From left to right:\n(a) Empirical covariance matrix of the gene expression levels. (b) Signal covariance matrix learnt\nby GP-kronsum. (c) Noise covariance matrix learnt by GP-kronsum. The ordering of the tasks was\ndetermined using hierarchical clustering on the empirical covariance matrix.\n\ndiscarded genes with low signal (< 10% of the variance) or were close to noise free (> 90% of the\nvariance), reducing the number of genes to 123, which we considered as tasks in our experiment.\nThe signal strength was estimated by a univariate GP model. We used a linear kernel calculated on\nthe SNP features for the sample covariance.\nFigure 3 shows the empirical covariance and the learnt task covariances by GP-kronsum. Both learnt\ncovariances are highly structured, demonstrating that the assumption of iid noise in the GP-kronprod\nmodel is violated in this dataset. While the signal task covariance matrix re\ufb02ects genetic signals that\nare shared between the gene expression levels, the noise covariance matrix mainly captures the\nmean shift between the two conditions the gene expression levels were measured in (Figure 4). To\ninvestigate the robustness of the reconstructed latent factor, we repeated the training 10 times. The\nmean latent factors and its standard errors were 0.2103 \u00b1 0.0088 (averaged over factors, over the 10\nbest runs selected by out-of-sample likelihood), demonstrating robustness of the inference.\nWhen considering alternative methods for out of sample prediction, the proposed Kronecker Sum\nmodel (r2(GP-kronsum)=0.3322\u00b1 0.0014) performed signi\ufb01cantly better than previous approaches\n(r2(GP-pool)=0.0673 \u00b1 0.0004, r2(GP-single)=0.2594 \u00b1 0.0011, r2(GP-kronprod)=0.1820 \u00b1\n0.0020). The results are averages over 10 runs and \u00b1 denotes the corresponding standard errors.\n\nMulti-phenotype prediction in Arabidopsis thaliana. As a second dataset, we considered a\ngenome-wide association study in Arabidopsis thaliana [15] to assess the prediction of develop-\nmental phenotypes from genomic data. This dataset consisted of 147 samples and 216,130 single\nnucleotide polymorphisms (SNPs, here used as features). As different tasks, we considered the phe-\nnotypes \ufb02owering period duration, life cycle period, maturation period and reproduction period.\nTo avoid outliers and issues due to non-Gaussianity, we preprocessed the phenotypic data by \ufb01rst\nconverting it to ranks and squashing the ranks through the inverse cumulative Gaussian distribution.\nThe SNPs in Arabidopsis thaliana are binary and we discarded features with a frequency of less\n\n7\n\n\fFigure 4: Correlation between the mean\ndifference of the two conditions and the\nlatent factors on the yeast dataset. Shown\nis the strength of the latent factor of the sig-\nnal (left) and the noise (right) task covari-\nance matrix as a function of the mean dif-\nference between the two environmental con-\nditions. Each dot corresponds to one gene\nexpression level.\n\n(a) Signal\n\n(b) Noise\n\nthan 10% in all samples, resulting in 176,436 SNPs. Subsequently, we normalized the features to\nzero mean and unit variance. Again, we used a linear kernel on the SNPs as sample covariance.\nSince the causal processes in Arabidopsis thaliana are complex, we allowed the rank of the signal\nand noise matrix to vary between 1 and 3. The appropriate rank complexity was selected on the 10%\nhold out data of the training fold. We considered the average squared correlation coef\ufb01cient on the\nholdout fraction of the training data to select the model for prediction on the test dataset. Notably,\nfor GP-kronprod, the selected task complexity was rank(C) = 3, whereas GP-kronsum selected\na simpler structure for the signal task covariance (rank(C) = 1) and chose a more complex noise\ncovariance, rank(\u03a3) = 2.\nThe cross validation prediction performance of each model is shown in Table 1. For reproduction\nperiod, GP-single is outperformed by all other methods. For the phenotype life cycle period, the\nnoise estimates of the univariate GP model were close to zero, and hence all methods, except of\nGP-pool, performed equally well since the measurements of the other phenotypes do not provide\nadditional information. For maturation period, GP-kronsum and GP-kronprod showed improved\nperformance compared to GP-single and GP-pool. For \ufb02owering period duration, GP-kronsum\noutperformed its competitors.\n\nReproduction\nFlowering period\nperiod\nduration\n0.0478 \u00b1 0.0013\n0.0502 \u00b1 0.0025\nGP-pool\n0.0272 \u00b1 0.0024\n0.0385 \u00b1 0.0017\nGP-single\n0.0492 \u00b1 0.0032\n0.0846 \u00b1 0.0021\nGP-kronprod\n0.0501 \u00b1 0.0033\nGP-kronsum 0.1127 \u00b1 0.0049\nTable 1: Predictive performance of the different methods on the Arabidopsis thaliana dataset.\nShown is the squared correlation coef\ufb01cient and its standard error (measured by repeating 10-fold\ncross-validation 10 times).\n\nLife cycle\nperiod\n0.1038 \u00b1 0.0034\n0.3500 \u00b1 0.0069\n0.3417 \u00b1 0.0062\n0.3485 \u00b1 0.0068\n\nMaturation\nperiod\n0.0460 \u00b1 0.0024\n0.1612 \u00b1 0.0027\n0.1878 \u00b1 0.0042\n0.1918 \u00b1 0.0041\n\n5 Discussion and conclusions\n\nMulti-task Gaussian process models are a widely used tool in many application domains, ranging\nfrom the prediction of user preferences in collaborative \ufb01ltering to the prediction of phenotypes in\ncomputational biology. Many of these prediction tasks are complex and important causal features\nmay remain unobserved or are not modeled. Nevertheless, most approaches in common usage as-\nsume that the observation noise is independent between tasks. We here propose the GP-kronsum\nmodel, which allows to ef\ufb01ciently model data where the noise is dependent between tasks by build-\ning on a sum of Kronecker products covariance.\nIn applications to statistical genetics, we have\ndemonstrated (1) the advantages of the dependent noise model over an independent noise model, as\nwell as (2) the feasibility of applying larger data sets by the ef\ufb01cient learning algorithm.\n\nAcknowledgement\n\nWe thank Francesco Paolo Casale for helpful discussions. OS was supported by an Marie Curie\nFP7 fellowship. KB was supported by the Alfried Krupp Prize for Young University Teachers of the\nAlfried Krupp von Bohlen und Halbach-Stiftung.\n\n8\n\nCorr(Glucose,Ethanol)XCCorr(Glucose,Ethanol)XSigma\fReferences\n[1] Edwin V. Bonilla, Kian Ming Adam Chai, and Christopher K. I. Williams. Multi-task gaussian\n\nprocess prediction. In NIPS, 2007.\n\n[2] Mauricio A. \u00b4Alvarez and Neil D. Lawrence. Sparse convolved gaussian processes for multi-\n\noutput regression. In NIPS, pages 57\u201364, 2008.\n\n[3] Edwin V. Bonilla, Felix V. Agakov, and Christopher K. I. Williams. Kernel multi-task learning\n\nusing task-speci\ufb01c features. In AISTATS, 2007.\n\n[4] Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, and\nManeesh Sahani. Gaussian-process factor analysis for low-dimensional single-trial analysis of\nneural population activity. In NIPS, pages 1881\u20131888, 2008.\n\n[5] Oliver Stegle, Christoph Lippert, Joris M. Mooij, Neil D. Lawrence, and Karsten M. Borg-\nIn\n\nwardt. Ef\ufb01cient inference in matrix-variate gaussian models with iid observation noise.\nNIPS, pages 630\u2013638, 2011.\n\n[6] Karin Meyer. Estimating variances and covariances for multivariate animal models by re-\n\nstricted maximum likelihood. Genetics Selection Evolution, 23(1):67\u201383, 1991.\n\n[7] V Ducrocq and H Chapuis. Generalizing the use of the canonical transformation for the so-\nlution of multivariate mixed model equations. Genetics Selection Evolution, 29(2):205\u2013224,\n1997.\n\n[8] Hao Zhang. Maximum-likelihood estimation for multivariate spatial linear coregionalization\n\nmodels. Environmetrics, 18(2):125\u2013139, 2007.\n\n[9] Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process re-\n\ngression networks. In ICML, 2012.\n\n[10] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine\n\nLearning (Adaptive Computation and Machine Learning). The MIT Press, 2005.\n\n[11] Alfredo A. Kalaitzis and Neil D. Lawrence. Residual components analysis. In ICML, 2012.\n[12] Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-bfgs-b:\nFortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw.,\n23(4):550\u2013560, December 1997.\n\n[13] Ulrike Ober, Julien F. Ayroles, Eric A. Stone, Stephen Richards, and et al. Using Whole-\nGenome Sequence Data to Predict Quantitative Trait Phenotypes in Drosophila melanogaster.\nPLoS Genetics, 8(5):e1002685+, May 2012.\n\n[14] Erin N Smith and Leonid Kruglyak. Gene\u2013environment interaction in yeast gene expression.\n\nPLoS Biology, 6(4):e83, 2008.\n\n[15] S. Atwell, Y. S. Huang, B. J. Vilhjalmsson, Willems, and et al. Genome-wide association study\nof 107 phenotypes in Arabidopsis thaliana inbred lines. Nature, 465(7298):627\u2013631, Jun 2010.\n\n9\n\n\f", "award": [], "sourceid": 731, "authors": [{"given_name": "Barbara", "family_name": "Rakitsch", "institution": "MPI T\u00fcbingen"}, {"given_name": "Christoph", "family_name": "Lippert", "institution": "Microsoft Research"}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": "MPI T\u00fcbingen & University of T\u00fcbingen"}, {"given_name": "Oliver", "family_name": "Stegle", "institution": "EMBL-EBI"}]}