{"title": "Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 279, "page_last": 287, "abstract": "We provide some insights into how task correlations in multi-task Gaussian process (GP) regression affect the generalization error and the learning curve. We analyze the asymmetric two-task case, where a secondary task is to help the learning of a primary task. Within this setting, we give bounds on the generalization error and the learning curve of the primary task. Our approach admits intuitive understandings of the multi-task GP by relating it to single-task GPs. For the case of one-dimensional input-space under optimal sampling with data only for the secondary task, the limitations of multi-task GP can be quantified explicitly.", "full_text": "Generalization Errors and Learning Curves for\nRegression with Multi-task Gaussian Processes\n\nKian Ming A. Chai\n\nSchool of Informatics, University of Edinburgh,\n10 Crichton Street, Edinburgh EH8 9AB, UK\n\nk.m.a.chai@ed.ac.uk\n\nAbstract\n\nWe provide some insights into how task correlations in multi-task Gaussian pro-\ncess (GP) regression affect the generalization error and the learning curve. We\nanalyze the asymmetric two-tasks case, where a secondary task is to help the learn-\ning of a primary task. Within this setting, we give bounds on the generalization\nerror and the learning curve of the primary task. Our approach admits intuitive\nunderstandings of the multi-task GP by relating it to single-task GPs. For the\ncase of one-dimensional input-space under optimal sampling with data only for\nthe secondary task, the limitations of multi-task GP can be quanti\ufb01ed explicitly.\n\n1 Introduction\n\nGaussian processes (GPs) (see e.g., [1]) have been applied to many practical problems. In recent\nyears, a number of models for multi-task learning with GPs have been proposed to allow different\ntasks to leverage on one another [2\u20135]. While it is generally assumed that learning multiple tasks\ntogether is bene\ufb01cial, we are not aware of any work that quanti\ufb01es such bene\ufb01ts, other than PAC-\nbased theoretical analysis for multi-task learning [6\u20138]. Following the tradition of the theoretical\nworks on GPs in machine learning, our goal is to quantify the bene\ufb01ts using average-case analysis.\nWe concentrate on the asymmetric two-tasks case, where the secondary task is to help the learning\nof the primary task. Within this setting, the main parameters are (1) the degree of \u201crelatedness\u201d \u03c1\nbetween the two tasks, and (2) the ratio \u03c0S of total training data for the secondary task. While higher\n|\u03c1| and lower \u03c0S is clearly more bene\ufb01cial to the primary task, the extent and manner that this is\nso has not been clear. To address this, we measure the bene\ufb01ts using generalization error, learning\ncurve and optimal error, and investigate the in\ufb02uence of \u03c1 and \u03c0S on these quantities.\nWe will give non-trivial lower and upper bounds on the generalization error and the learning curve.\nBoth types of bounds are important in providing assurance on the quality of predictions: an upper\nbound provides an estimate of the amount of training data needed to attain a minimum performance\nlevel, while a lower bound provides an understanding of the limitations of the model [9]. Our\napproach relates multi-task GPs to single-task GPs and admits intuitive understandings of multi-task\nGPs. For one-dimensional input-space under optimal sampling with data only for the secondary task,\nwe show the limit to which error for the primary task can be reduced. This dispels any misconception\nthat abundant data for the secondary task can remedy no data for the primary task.\n\n2 Preliminaries and problem statement\n\n2.1 Multi-task GP regression model and setup\nThe multi-task Gaussian process regression model in [5] learns M related functions {fm}M\nm=1 by\nplacing a zero mean GP prior which directly induces correlations between tasks. Let ym be an\n\n1\n\n\fobservation of the mth function at x. Then the model is given by\n\nym \u223c N (fm(x), \u03c32\n\n(cid:104)fm(x)fm(cid:48)(x(cid:48))(cid:105) def= K f\n\nmm(cid:48) kx(x, x(cid:48))\n\nm),\n\nm is the noise variance for the mth task.\n\n(1)\nwhere kx is a covariance function over inputs, and K f is a positive semi-de\ufb01nite matrix of inter-task\nsimilarities, and \u03c32\nThe current focus is on the two tasks case, where the secondary task S is to help improve the\nperformance of the primary task T ; this is the asymmetric multi-task learning as coined in [10]. We\n\ufb01x K f to be a correlation matrix, and let the variance be explained fully by kx (the converse has been\ndone in [5]). Thus K f is fully speci\ufb01ed by the correlation \u03c1 \u2208 [\u22121, 1] between the two tasks. We\nfurther \ufb01x the noise variances of the two tasks to be the same, say \u03c32\nn. For the training data, there are\nnT (resp. nS) observations at locations XT (resp. XS) for task T (resp. S). We use n def= nT + nS\nfor the total number of observations, \u03c0S\ndef= nS/n for the proportion of observations for task S, and\nalso X def= XT \u222a XS. The aim is to infer the noise-free response fT\u2217 for task T at x\u2217. See Figure 1.\nThe covariance matrix of the noisy training data is K(\u03c1) + \u03c32\n\nnI, where\n\n(cid:18) K x\n\n(cid:19)\n\nK(\u03c1) def=\n\n\u03c1K x\nT T\n\u03c1K x\nST K x\nSS\n\nT S\n\n;\n\n(2)\n\nT T (resp. K x\n\nn, XT , XS) = k\u2217\u2217 \u2212 kT\u2217 (K(\u03c1) + \u03c32\n\nT S is the matrix of cross-covariances from locations in XT to locations in XS; and K x\n\nnI)\u22121k\u2217, where kT\u2217 def= (cid:0)(kx\n\nSS) is the matrix of covariances (due to kx) between locations in XT (resp. XS);\nST is K x\n\nand K x\nK x\ntransposed. The posterior variance at x\u2217 for task T is\nT (x\u2217, \u03c1, \u03c32\n\u03c32\nand k\u2217\u2217 is the prior variance at x\u2217, and kx\nS\u2217) is the vector of covariances (due to kx)\nbetween locations in XT (resp. XS) and x\u2217. Where appropriate and clear from context, we will\nn, XT , XS), or use X for (XT , XS). Note that\nsuppress some of the parameters in \u03c32\nT (\u22121); for brevity, we only write the former.\nT (\u03c1) = \u03c32\n\u03c32\nIf the GP prior is correctly speci\ufb01ed, then the posterior variance (3) is also the generalization error at\nx\u2217 [1, \u00a77.3]. The latter is de\ufb01ned as (cid:104)(f (cid:63)\n, where \u00affT (x\u2217) is the posterior mean\nat x\u2217 for task T , and the expectation is taken over the distribution from which the true function f (cid:63)\nT\nis drawn. In this paper, in order to distinguish succinctly from the generalization error introduced\nin the next section, we use posterior variance to mean the generalization error at x\u2217. Note that the\nactual y-values observed at X do not effect the posterior variance at any test location.\n\nT (x\u2217, \u03c1, \u03c32\nT (1) is the same as \u03c32\n\nS\u2217)T(cid:1) ; (3)\n\nT (x\u2217) \u2212 \u00affT (x\u2217))2(cid:105)f (cid:63)\n\nT (\u2212\u03c1), so that \u03c32\n\nT\u2217 (resp. kx\n\nT\u2217)T \u03c1(kx\n\nT S\n\nT\n\nProblem statement Given the above setting, the aim is to investigate how training observations\nfor task S can bene\ufb01t the predictions for task T . We measure the bene\ufb01ts using generalization error,\nlearning curve and optimal error, and investigate how these quantities vary with \u03c1 and \u03c0S.\n\n2.2 Generalization errors, learning curves and optimal errors\nWe outline the general approach to obtain the generalization error and the learning curve [1, \u00a77.3]\nunder our setting, where we have two tasks and are concerned with the primary task T . Let p(x)\nbe the probability density, common to both tasks, from which test and training locations are drawn,\nand assume that the GP prior is correctly speci\ufb01ed. The generalization error for task T is obtained\nby averaging the posterior variance for task T over x\u2217, and the learning curve for task T is obtained\nby averaging the generalization error over training sets X:\ngeneralization error:\nlearning curve:\n\n(4)\n(5)\nwhere the training locations in X are drawn i.i.d, that is, p(X) factorizes completely into a product of\np(x)s. Besides averaging \u0001T to obtain the learning curve, one may also use the optimal experimental\ndesign methodology and minimize \u0001T over X to \ufb01nd the optimal generalization error [11, chap. II]:\n(6)\nBoth \u0001T (0, \u03c32\nn, XT , XS) reduce to single-task GP cases; the former discards\ntraining observations at XS, while the latter includes them. Similar analogues to single-task GP\ncases for \u0001avg\nn, \u03c0S, n) can be\nobtained. Note that \u0001avg\n\nn, XT , XS) def= (cid:82) \u03c32\nn, \u03c0S, n) def= (cid:82) \u0001T (\u03c1, \u03c32\n\nT are well-de\ufb01ned since \u03c0Sn = nS \u2208 N0 by the de\ufb01nition of \u03c0S.\n\nn, \u03c0S, n) def= minX \u0001T (\u03c1, \u03c32\n\nn, XT , XS) and \u0001T (1, \u03c32\n\nn, XT , XS)p(x\u2217)dx\u2217\n\nn, XT , XS)p(X)dX,\n\nn, \u03c0S, n), and \u0001opt\n\nn, \u03c0S, n) and \u0001avg\n\nn, \u03c0S, n) and \u0001opt\n\nT and \u0001opt\n\nn, XT , XS).\n\noptimal error:\n\nT (x\u2217, \u03c1, \u03c32\n\n\u0001avg\nT (\u03c1, \u03c32\n\n\u0001opt\nT (\u03c1, \u03c32\n\nT (0, \u03c32\n\nT (1, \u03c32\n\nT (0, \u03c32\n\nT (1, \u03c32\n\n\u0001T (\u03c1, \u03c32\n\n2\n\n\fTask-space\n\n\u03c1\n\nS\n\nT\n\nkx(x, x(cid:48))\n\n|XS| = nS\n\nInput\nspace\n|XT| = nT\n\n(cid:126)\n\nFigure 1: The two tasks S and T have\ntask correlation \u03c1. The data set XT (resp.\nXS) for task T (resp. S) consists of the\n\u2022s (resp.\ns). The test location x\u2217 for\ntask T is denoted by (cid:126).\n\n2.3 Eigen-analysis\n\nT (x\u2217)\n\u03c32\n\n\u03c1 = 0\n\n\u03c1 = 1\n\nx\u2217\n\nFigure 2: The posterior variances of each test location\nwithin [0, 1] given data \u2022s at 1/3 and 2/3 for task T , and\ns at 1/5, 1/2 and 4/5 for task S.\n\nthey satisfy the integral equation (cid:82) kx(x, x(cid:48))\u03c6i(x)p(x)dx = \u03bai\u03c6i(x(cid:48)). Let\n\nWe now state known results of eigen-analysis used in this paper. Let \u00af\u03ba def= \u03ba1 > \u03ba2 > . . . and\n\u03c61(\u00b7), \u03c62(\u00b7), . . . be the eigenvalues and eigenfunctions of the covariance function kx under the\nmeasure p(x)dx:\n\u00af\u03bb def= \u03bb1 > \u03bb2 > . . . > \u03bbnS\nSS. If the locations in XS are sampled\nfrom p(x), then \u03bai = limnS\u2192\u221e \u03bbi/nS, i = 1 . . . nS; see e.g., [1, \u00a74.3.2] and [12, Theorem 3.4].\nHowever, for \ufb01nite nS used in practice, the estimate \u03bbi/nS for \u03bai is better for the larger eigenvalues\nthan for the smaller ones. Additionally, in one-dimension with uniform p(x) on the unit interval,\nif kx satis\ufb01es the Sacks-Ylvisaker conditions of order r, then \u03bai \u221d (\u03c0i)\u22122r\u22122 in the limit i \u2192 \u221e\n[11, Proposition IV.10, Remark IV.2]. Broadly speaking, an order r process is exactly r times mean\nsquare differentiable. For example, the stationary Ornstein-Uhlenbeck process is of order r = 0.\n\ndef= \u00af\u03bb be the eigenvalues of K x\n\n3 Generalization error\n\nIn this section, we derive expressions for the generalization error (and the bounds thereon) for the\ntwo-tasks case in terms of the single-task one. To illustrate and further motivate the problem, Fig-\nT (x\u2217, \u03c1) as a function of x\u2217 given two observations for task T\nure 2 plots the posterior variance \u03c32\nand three observations for task S. We roughly follow [13, Fig. 2], and use squared exponential co-\nn = 0.05. Six solid curves are plotted,\nvariance function with length-scale 0.11 and noise variance \u03c32\ncorresponding, from top to bottom, to \u03c12 = 0, 1/8, 1/4, 1/2, 3/4 and 1. The two dashed curves en-\nveloping each solid curve are the lower and upper bounds derived in this section; the dashed curves\nare hardly visible because the bounds are rather tight. The dotted line is the prior noise variance.\nT (x\u2217, \u03c1)\nSimilar to the case of single-task learning, each training point creates a depression on the \u03c32\nsurface [9, 13]. However, while each training point for task T creates a \u201cfull\u201d depression that reaches\nthe prior noise variance (horizontal dotted line at 0.05), the depression created by each training\npoint for task S depends on \u03c1, \u201cdeeper\u201d depressions for larger \u03c12. From the \ufb01gure, and also from\nde\ufb01nition, it is clear that the following trivial bounds on \u03c32\nT (x\u2217, \u03c1) (cid:54) \u03c32\nProposition 1. For all x\u2217, \u03c32\nIntegrating wrt to x\u2217 then gives the following corollary:\nCorollary 2. \u0001T (1, \u03c32\n\nn, XT , XS) (cid:54) \u0001T (\u03c1, \u03c32\n\nT (x\u2217, 1) (cid:54) \u03c32\n\nn, XT , XS) (cid:54) \u0001T (0, \u03c32\n\nn, XT , XS).\n\nT (x\u2217, \u03c1) hold:\n\nT (x\u2217, 0).\n\nSections 3.2 and 3.3 derive lower and upper bounds that are tighter than the above trivial bounds.\nPrior to the bounds, we consider a degenerate case to illustrate the limitations of multi-task learning.\n\n3.1 The degenerate case of no training data for primary task\nIt is clear that if there is no training data for the secondary task, that is, if XS = \u2205, then \u03c32\nT (x\u2217, \u03c1) = \u03c32\n\u03c32\nprimary task, that is, XT = \u2205, we instead have the following proposition:\n\nT (x\u22171) =\nT (x\u22170) for all x\u2217 and \u03c1. In the converse case where there is no training data for the\n\n3\n\n00.20.40.60.8100.20.40.60.81\fProposition 3. For all x\u2217, \u03c32\nProof.\n\nT (x\u2217, \u03c1,\u2205, XS) = k\u2217\u2217 \u2212 \u03c12(kx\n\u03c32\n\nT (x\u2217, \u03c1,\u2205, XS) = \u03c12\u03c32\nS\u2217)T(K x\n\n= (1 \u2212 \u03c12)k\u2217\u2217 + \u03c12(cid:2)k\u2217\u2217 \u2212 (kx\n\nT (x\u2217, 1,\u2205, XS) + (1 \u2212 \u03c12)k\u2217\u2217.\nSS + \u03c32\n\nnI)\u22121kx\nS\u2217\nS\u2217)T(K x\nSS + \u03c32\nT (x\u2217, 1,\u2205, XS).\n\n= (1 \u2212 \u03c12)k\u2217\u2217 + \u03c12\u03c32\n\nnI)\u22121kx\n\nS\u2217(cid:3)\n\nlimnS\u2192\u221e \u03c32\n\nT (x\u2217, 1,\u2205, XS) = 0\n\n=\u21d2 limnS\u2192\u221e \u03c32\n\nT (x\u2217, \u03c1,\u2205, XS) = (1 \u2212 \u03c12)k\u2217\u2217.\n\nHence the posterior variance is a weighted average of the prior variance k\u2217\u2217 and the posterior vari-\nance at perfect correlation. When the cardinality of XS increases under in\ufb01ll asymptotics [14, \u00a73.3],\n(7)\nThis is the limit for the posterior variance at any test location for task T , if one has training data only\nfor the secondary task S. This is because a correlation of \u03c1 between the tasks prevents any training\nlocation for task S from having correlation higher than \u03c1 with a test location for task T . Suppose\ncorrelations in the input-space are given by an isotropic covariance function kx(|x \u2212 x(cid:48)|). If we\ntranslate correlations into distances between data locations, then any training location from task S\nis beyond a certain radius from any test location for task T . In contrast, a training location from task\nT may lay arbitrarily close to a test location for task T , subject to the constraints of noise.\nWe obtain the generalization error in this degenerate case, by integrating Proposition 3 wrt p(x\u2217)dx\u2217\nand using the fact that the mean prior variance is given by the sum of the process eigenvalues.\nCorollary 4. \u0001T (\u03c1, \u03c32\n\nn,\u2205, XS) + (1 \u2212 \u03c12)(cid:80)\u221e\n\nn,\u2205, XS) = \u03c12\u0001T (1, \u03c32\n\ni=1 \u03bai.\n\n3.2 A lower bound\nWhen XT (cid:54)= \u2205, the correlations between locations in XT and locations in XS complicate the situa-\nT (\u03c1) is a continuous and monotonically decreasing function of \u03c1, there exists\ntion. However, since \u03c32\nT (1) + (1 \u2212 \u03b1)\u03c32\nan \u03b1 \u2208 [0, 1], which depends on \u03c1, x\u2217 and X, such that \u03c32\nT (0). That\nT (\u03c1) of the\n\u03b1 depends on x\u2217 obstructs further analysis. The next proposition gives a lower bound\n\u00af\u03c32\nT (1) (cid:54)\nT (\u03c1), where the mixing proportion is independent of x\u2217.\nsame form satisfying \u03c32\nProposition 5. Let\nT (x\u2217, \u03c1) def= \u03c12\u03c32\n\u00af\u03c32\n\nT (x\u2217, 1) + (1 \u2212 \u03c12)\u03c32\n\nT (x\u2217, 0). Then for all x\u2217:\n\nT (\u03c1) (cid:54) \u03c32\n\u00af\u03c32\n\nT (\u03c1) = \u03b1\u03c32\n\nT (x\u2217, \u03c1) (cid:54) \u03c32\nT (x\u2217, \u03c1)\n(a)\n\u00af\u03c32\nT (x\u2217, \u03c1) \u2212\nT (x\u2217, \u03c1) (cid:54) \u03c12(\u03c32\n(b) \u03c32\n\u00af\u03c32\nT (x\u2217, \u03c1) \u2212\n\n(c) arg max\u03c12(cid:2)\u03c32\n\nT (x\u2217, 0) \u2212 \u03c32\n\nT (x\u2217, \u03c1)(cid:3) (cid:62) 1/2.\n\n\u00af\u03c32\n\nT (x\u2217, 1))\n\nT (0).\n\nT (1) and \u03c32\n\nThe proofs are in supplementary material \u00a7S.2. The lower bound\nT (\u03c1) depends explicitly on \u03c12.\n\u00af\u03c32\nIt depends implicitly on \u03c0S, which is the proportion of observations for task S, through the gap\nIf there is no training data for the primary task, i.e., if \u03c0S = 1, the\nbetween \u03c32\nbound reduces to Proposition 3, and becomes exact for all values of \u03c1. If \u03c0S = 0, the bound is also\nexact. For \u03c0S (cid:54)\u2208 {0, 1}, the bound is exact when \u03c1 \u2208 {\u22121, 0, 1}. As from Figure 2 and later from\nour simulation results in section 5.3, this bound is rather tight. Part (b) of the proposition states the\nT (0) and\ntightness of the bound: it is no more than factor \u03c12 of the gap between the trivial bounds \u03c32\nT (1). Part (c) of the proposition says that the bound is least tight for a value of \u03c12 greater than 1/2.\n\u03c32\nWe provide an intuition on Proposition 5a. Let \u00aff1 (resp. \u00aff0) be the posterior mean of the single-task\nGP when \u03c1 = 1 (resp. \u03c1 = 0). Contrasted with the multi-task predictor \u00affT , \u00aff1 directly involves the\nnoisy observations for task T at XS, so it has more information on task T . Hence, predicting \u00aff1(x\u2217)\nT (\u03c1) is obtained by \u201cthrowing\ngives the trivial lower bound \u03c32\n\u00af\u03c32\naway\u201d information and predicting \u00aff1(x\u2217) with probability \u03c12 and \u00aff0(x\u2217) with probability (1 \u2212 \u03c12).\nFinally, the next corollary is readily obtained from Proposition 5a by integrating wrt p(x\u2217)dx\u2217. This\nis possible because \u03c1 is independent of x\u2217.\nCorollary 6. Let\n\u00af\u0001T (\u03c1, \u03c32\n3.3 An upper bound via equivalent isotropic noise at XS\n\nn, XT , XS) + (1 \u2212 \u03c12)\u0001T (0, \u03c32\n\nn, XT , XS) (cid:54) \u0001T (\u03c1, \u03c32\n\nn, XT , XS) def= \u03c12\u0001T (1, \u03c32\n\nT (\u03c1). The tighter bound\n\nn, XT , XS). Then\n\nn, XT , XS).\n\n\u00af\u0001T (\u03c1, \u03c32\n\nT (1) on \u03c32\n\nThe following question motivates our upper bound: if the training locations in XS had been observed\nfor task T rather than for task S, what is the variance \u02dc\u03c32\nn of the equivalent isotropic noise at XS so\n\n4\n\n\fFor any x\u2217 there is always a \u02dc\u03c32\n\nT (x\u2217, 1, \u03c32\n\u03c32\nn that satis\ufb01es the equation because the difference\n\nn) = \u03c32\n\nn, \u02dc\u03c32\n\nn, \u03c32\n\nn).\n\nT (x\u2217\u03c1, \u03c32\nn) \u2212 \u03c32\n\n(9)\n\nthat the posterior variance remains the same? To answer this question, we \ufb01rst re\ufb01ne the de\ufb01nition\nof \u03c32\n\n(cid:17)(cid:105)\u22121\nT (\u00b7) to include a different noise variance parameter s2 for the XS observations:\nk\u2217;\nT (\u00b7). The variance \u02dc\u03c32\n\nnI 0\n0 s2I\ncf. (3). We may suppress the parameters x\u2217, XT and XS when writing \u03c32\nthe equivalent isotropic noise is a function of x\u2217 de\ufb01ned by the equation\n\nn, s2, XT , XS) def= k\u2217\u2217 \u2212 kT\u2217\n\nT (x\u2217, \u03c1, \u03c32\n\u03c32\n\n(cid:16) \u03c32\n\n(8)\nn of\n\nK(\u03c1) +\n\n(cid:104)\n\nn, \u03c32\n\nn, \u03c32\n\n\u2206(\u03c1, \u03c32\n\nT (1, \u03c32\n\nT (\u03c1, \u03c32\n\nn for \u02dc\u03c32\n\nT (x\u2217, 1, \u03c32\n\nT (x\u2217, \u03c1, \u03c32\n\nT (x\u2217, \u03c1, \u03c32\n\nn, s2) def= \u03c32\n\nn, \u03c32\nn) (cid:54) \u03c32\n\nn that is independent of the choice of x\u2217: \u2206(\u03c1, \u03c32\n\nn, \u00af\u03c32\nn, which is the minimum possible \u00af\u03c32\nn) (cid:54) \u03c32\nT (x\u2217, 1, \u03c32\n\nn, s2)\n(10)\nis a continuous and monotonically decreasing function of s2. To make progress, we seek an upper\nn) (cid:54) 0 for all test locations. Of\nbound \u00af\u03c32\ninterest is the tight upper bound \u00af\u00af\u03c32\nn, given in the next proposition.\nSS, \u03b2 def= \u03c1\u22122 \u2212 1 and \u00af\u00af\u03c32\nProposition 7. Let \u00af\u03bb be the maximum eigenvalue of K x\nn)+ \u03c32\nn.\nn, \u00af\u00af\u03c32\nn). The bound is tight in this sense: for any\nThen for all x\u2217, \u03c32\nn, if \u2200x\u2217 \u03c32\nn, \u00af\u03c32\n\u00af\u03c32\nn).\nProof sketch. Matrix K(\u03c1) may be factorized as\n\n(cid:19)\n(cid:1) factors, we obtain\nBy using this factorization in the posterior variance (8) and taking out the(cid:0) I 0\nwhere (kx\u2217)T def= (cid:0)(kx\n(cid:19)\n(cid:18)0\n(cid:18)K x\n\n(cid:18)I\nS\u2217)T(cid:1) and\nn, s2) = k\u2217\u2217 \u2212 (kx\u2217)T[\u03a3(\u03c1, \u03c32\n(cid:19)\n\n(cid:19)(cid:18)K x\n\nn), then \u2200x\u2217 \u03c32\n\nn, s2)]\u22121kx\u2217,\n\n(cid:19)(cid:18)I\n\ndef= \u03b2(\u00af\u03bb+ \u03c32\n\nK x\n\u03c1\u22122K x\n\nT (x\u2217, \u03c1, \u03c32\n\nT (x\u2217, 1, \u03c32\n\n(cid:18)\u03c32\n\nn) (cid:54) \u03c32\n\n0\n0 \u03c1I\n\n0\n0 \u03c1I\n\nT (\u03c1, \u03c32\n\u03c32\n\nK(\u03c1) =\n\nT T\nK x\nST\n\nn, \u00af\u00af\u03c32\n\n(cid:19)\n\nn, \u00af\u03c32\n\n(11)\n\n(12)\n\n0 \u03c1I\n\nT S\n\nSS\n\n0\n\n0\n\nn\n\n.\n\n+\n\nnI\n0\n\n\u03c1\u22122s2I\n\n= \u03a3(1, \u03c32\n\nn, s2) + \u03b2\n\n0 K x\n\nSS + s2I\n\n.\n\nT\u2217)T, (kx\nK x\n\u03c1\u22122K x\n\nT S\n\nT T\nK x\nST\n\nSS\n\n\u03a3(\u03c1, \u03c32\n\nn, s2) def=\n\nn), having data XS for task\nThe second expression for \u03a3 makes clear that, in the terms of \u03c32\nS is equivalent to an additional correlated noise at these observations for task T . This expression\nmotivates the question that began this section. Note that \u03c1\u22122 (cid:62) 1, and hence \u03b2 (cid:62) 0.\nThe increase in posterior variance due to having XS at task S with noise variance \u03c32\nhaving them at task T with noise variance s2 is given by \u2206(\u03c1, \u03c32\n\nT (\u03c1, \u03c32\n\nn, \u03c32\n\nn, s2) = (kx\u2217)T(cid:2)(\u03a3(1, \u03c32\n\n\u2206(\u03c1, \u03c32\n\nn rather than\nn, s2), which we may now write as\n(13)\nn, \u03c32\nn) (cid:54) 0 for all test locations. In\nn; details can be found in supplementary material\n\nn, s2))\u22121 \u2212 (\u03a3(\u03c1, \u03c32\nn, \u00af\u03c32\n\nn such that \u2206(\u03c1, \u03c32\n(cid:54) \u00af\u03c32\n\nn))\u22121(cid:3) kx\u2217.\n\nn\n\nn\n\nn\n\n(cid:54)\n\n=\u03c32\nn\n\ndef= \u03b2(\n\n(cid:54) \u00af\u03c32\n\n(cid:54) \u00af\u00af\u03c32\n\n(cid:54) \u02dc\u03c32\n\nn for \u02dc\u03c32\nn) + \u03c32\n\nnI. Analogously, the tight lower bound on \u02dc\u03c32\n\nRecall that we seek an upper bound \u00af\u03c32\ndef= \u03b2(\u00af\u03bb + \u03c32\ngeneral, this requires \u00af\u00af\u03c32\n\u00a7S.3. The tightness \u00af\u00af\u03c32\nn\nn is evident from the construction.\nn, \u00af\u00af\u03c32\nn) is the tight upper bound because it in\ufb02ates the noise (co)variance at XS\nT (x\u2217, 1, \u03c32\nIntuitively, \u03c32\nnI/\u03c12) to \u00af\u00af\u03c32\nSS + \u03c32\njust suf\ufb01ciently, from (\u03b2K x\nn is given\nn. In summary, \u03c1\u22122\u03c32\nn) + \u03c32\n\u00af\u03bb + \u03c32\nn, where the \ufb01rst inequality\nby =\u03c32\nn\nn\nis obtained by substituting in zero for\nn. Hence observing XS at S is at most as \u201cnoisy\u201d as\n\u00af\u03bb in =\u03c32\nan additional \u03b2(\u00af\u03bb + \u03c32\nn) noise\nvariance. Since \u03b2 decreases with |\u03c1|, the additional noise variances are smaller when |\u03c1| is larger,\ni.e., when the task S is more correlated with task T .\nWe give a description of how the above bounds scale with nS, using the results stated\nin section 2.3.\nFurther-\nif the covariance\nmore, for uniformly distributed inputs in the one-dimension unit interval,\n\nfunction satis\ufb01es Sacks-Ylvisaker conditions of order r, then \u03banS = \u0398(cid:0)(\u03c0nS)\u22122r\u22122(cid:1), so that\n\u00af\u03bb = \u0398(cid:0)(\u03c0nS)\u22122r\u22121(cid:1). Since \u00af\u00af\u03c32\nn + \u03b2 \u0398(cid:0)n\u22122r\u22121\n\n(cid:1). For the upper bound \u00af\u00af\u03c32\n\nFor large enough nS, we may write \u00af\u03bb \u2248 nS \u00af\u03ba and\n\nn) noise variance, and at least as \u201cnoisy\u201d as an additional \u03b2(\n\n\u00af\u03bb \u2248 nS\u03banS .\nn = \u03c1\u22122\u03c32\n\nand =\u03c32\nwith nS, the eigenvalues of K(1) scales with n, thus \u03c32\ncontrast the lower bound =\u03c32\neven for moderate sizes nS. Therefore, the lower bound is not as useful as the upper bound.\nFinally, if we re\ufb01ne \u0001T as we have done for \u03c32\nCorollary 8. Let \u00af\u0001T (\u03c1, \u03c32\n\nT in (8), we obtain the following corollary:\n\n\u00af\u03bb, we have \u00af\u00af\u03c32\nn + \u03b2 \u0398(nS)\nn, note that although it scales linearly\nn, \u00af\u00af\u03c32\ndef= nS/n. In\nT (1, \u03c32\nn) does not depend on \u03c0S\n\nn is dominated by \u03c1\u22122\u03c32\n\nn are linear in \u00af\u03bb and\n\nT (1, \u03c32\nn, so that \u03c32\n\nn) depends on \u03c0S\n\nn, XT , XS) def= \u0001T (1, \u03c32\n\nn = \u03c1\u22122\u03c32\n\n\u00af\u03bb + \u03c32\n\nn and =\u03c32\n\nn, \u00af\u00af\u03c32\n\nn, =\u03c32\n\nS\n\nn, \u03c32\n\u00af\u0001T (\u03c1, \u03c32\n\nn, XT , XS) (cid:62) \u0001T (\u03c1, \u03c32\n\nn, \u03c32\n\nn, XT , XS). Then\nn, XT , XS).\n\nn, \u03c32\n\n5\n\n\f3.4 Exact computation of generalization error\n\n\u0001T (\u03c1, \u03c32\n\nThe factorization of \u03c32\nT expressed by (12) allows the generalization error to be computed exactly in\ncertain cases. We replace the quadratic form in (12) by matrix trace and then integrate out x\u2217 to give\n\nn, XT , XS) = (cid:104)k\u2217\u2217(cid:105) \u2212 tr(cid:0)\u03a3\u22121(cid:104)kx\u2217(kx\u2217)T(cid:105)(cid:1) =(cid:80)\u221e\n\nwhere \u03a3 denotes \u03a3(\u03c1, \u03c32\n\ndef= (cid:82) kx(xp, x\u2217) kx(xq, x\u2217) p(x\u2217)dx\u2217 =(cid:80)\u221e\n\nn), the expectations are taken over x\u2217, and M is an n-by-n matrix with\nWhen the eigenfunctions \u03c6i(\u00b7)s are not bounded, the in\ufb01nite-summation expression for Mpq is often\ndif\ufb01cult to use. Nevertheless, analytical results for Mpq are still possible in some cases using the\nintegral expression. An example is the case of the squared exponential covariance function with\nnormally distributed x, when the integrand is a product of three Gaussians.\n\ni \u03c6i(xp)\u03c6i(xq), where xp, xq \u2208 X.\n\ni=1 \u03bai \u2212 tr(cid:0)\u03a3\u22121M(cid:1) ,\n\ni=1 \u03ba2\n\nn, \u03c32\n\nMpq\n\n4 Optimal error for the degenerate case of no training data for primary task\n\nIf training examples are provided only for task S, then task T has the following optimal performance.\nProposition 9. Under optimal sampling on a 1-d space, if the covariance function satis\ufb01es Sacks-\nYlvisaker conditions of order r, then \u0001opt\n\n) + (1 \u2212 \u03c12)(cid:80)\u221e\n\nT (\u03c1, \u03c32, 1, n) = \u0398(n\n\ni=1 \u03bai.\n\nT (\u03c1, \u03c32, 1, n) = \u03c12\u0001opt\n\nProof. We obtain \u0001opt\ni=1 \u03bai by minimizing Corol-\nlary 4 wrt XS. Under the same conditions as the proposition, the optimal generalization error using\nthe single-task GP decays with training set size n as \u0398(n\u2212(2r+1)/(2r+2)) [11, Proposition V.3]. Thus\n\u03c12\u0001opt\n\nA directly corollary of the above result is that one cannot expect to do better than (1 \u2212 \u03c12)(cid:80) \u03bai on\n\n\u2212(2r+1)/(2r+2)\nS\n\n\u2212(2r+1)/(2r+2)\nS\n\nn, 1, n) = \u03c12\u0398(n\n\nT (1, \u03c32\n\nT (1, \u03c32\n\nthe average. As this is a lower bound, the same can be said for incorrectly speci\ufb01ed GP priors.\n\n) = \u0398(n\n\n).\n\n\u2212(2r+1)/(2r+2)\nS\n\nn, 1, n) + (1 \u2212 \u03c12)(cid:80)\u221e\n\n5 Theoretical bounds on learning curve\n\nT (1, \u03c32\n\nT (\u03c1, \u03c32\n\nn, \u03c0S, n) (cid:54) \u0001avg\n\nn, \u03c0S, n) (cid:54) \u0001avg\n\nUsing the results from section 3, lower and upper bounds on the learning curve may be computed by\naveraging over the choice of X using Monte Carlo approximation.1 For example, using Corollary 2\nand integrating wrt p(X)dX gives the following trivial bounds on the learning curve:\nCorollary 10. \u0001avg\nThe gap between the trivial bounds can be analyzed as follows. Recall that \u03c0Sn \u2208 N0 by de\ufb01nition,\nn, \u03c0S, (1 \u2212 \u03c0S)n) = \u0001avg\nso that \u0001avg\nT (1, \u03c32\nn, \u03c0S, n) is equivalent to\nn, \u03c0S, n) scaled along the n-axis by the factor (1 \u2212 \u03c0S) \u2208 [0, 1], and hence the gap between\n\u0001avg\nT (0, \u03c32\nthe trivial bounds becomes wider with \u03c0S.\nIn the rest of this section, we derive non-trivial theoretical bounds on the learning curve before\nproviding simulation results. Theoretical bounds are particularly attractive for high-dimensional\ninput-spaces, on which Monte Carlo approximation is harder.\n\nn, \u03c0S, n). Therefore \u0001avg\n\nn, \u03c0S, n).\n\nT (0, \u03c32\n\nT (0, \u03c32\n\nT (1, \u03c32\n\n5.1 Lower bound\n\nFor the single-task GP, a lower bound on its learning curve is \u03c32\nn\nshall call this the single-task OV bound. This lower bound can be combined with Corollary 6.\n\ni=1 \u03bai/(\u03c32\n\nn + n\u03bai) [15]. We\n\nProposition 11. \u0001avg\n\nT (\u03c1, \u03c32\n\nn, \u03c0S, n) (cid:62) \u03c12\u03c32\n\n\u03bai\n\n+ (1 \u2212 \u03c12)\u03c32\n\nn\n\n(cid:80)\u221e\n\n\u221e(cid:88)\n\ni=1\n\nwith b1\ni\n\ndef=\n\n, with b0\ni\n\ndef=\n\nn\n\n\u221e(cid:88)\n\u221e(cid:88)\n\u221e(cid:88)\n\ni=1\n\ni=1\n\ni=1\n\nn + n\u03bai\n\u03c32\nb1\ni \u03bai\nn + n\u03bai\n\u03c32\nb0\ni \u03bai\n\n,\n\nn + (1 \u2212 \u03c0S)n\u03bai\n\u03c32\n\n\u03bai\n\nn + (1 \u2212 \u03c0S)n\u03bai\n,\n\u03c32\nn + (1 \u2212 \u03c12\u03c0S)n\u03bai\n\u03c32\nn + (1 \u2212 \u03c0S)n\u03bai\n\u03c32\nn + (1 \u2212 \u03c12\u03c0S)n\u03bai\n\u03c32\n\nn + n\u03bai\n\u03c32\n\n,\n\n.\n\nor equivalently, \u0001avg\n\nT (\u03c1, \u03c32\n\nn, \u03c0S, n) (cid:62) \u03c32\n\nn\n\nor equivalently, \u0001avg\n\nT (\u03c1, \u03c32\n\nn, \u03c0S, n) (cid:62) \u03c32\n\nn\n\n1Approximate lower bounds are also possible, by combining Corollary 6 and approximations in, e.g., [13].\n\n6\n\n\fn + n\u03bai) for every i.\n\ni with that of \u03bai/(\u03c32\n\nProof sketch. To obtain the \ufb01rst inequality, we integrate Corollary 6 wrt to p(X)dX, and apply the\nsingle-task OV bound twice. For the second inequality, its ith summand is obtained by combining\nthe corresponding pair of ith summands in the \ufb01rst inequality. The third inequality is obtained from\nthe second by swapping the denominator of b1\nn, \u03c0S and n, denote the above bound by OV\u03c1. Then OV0 and OV1 are both single task\nFor \ufb01xed \u03c32\nbounds. In particular, from Corollary 10, we have that the OV1 is a lower bound on \u0001avg\nn, \u03c0S, n).\nFrom the \ufb01rst expression of the above proposition, it is clear from the \u201cmixture\u201d nature of the bound\nthat the two-tasks bound OV\u03c1 is always better than OV1. As \u03c12 decreases, the two-tasks bound moves\ntowards the OV0; and as \u03c0S increases, the gap between OV0 and OV1 increases. In addition, the gap\nis also larger for rougher processes, which are harder to learn. Therefore, the relative tightness of\nOV\u03c1 over OV1 is more noticeable for lower \u03c12, higher \u03c0S and rougher processes.\nThe second expression in the Proposition 11 is useful for comparing with the OV1. Each summand\nfor the two-tasks case is a factor b1\ni of the corresponding summand for the single-task case. Since\ni \u2208 [1, (1 \u2212 \u03c12\u03c0S)/(1 \u2212 \u03c0S)[ , OV\u03c1 is more than OV1 by at most (1 \u2212 \u03c12)\u03c0S/(1 \u2212 \u03c0S) times.\nb1\nSimilarly, the third expression of the proposition is useful for comparing with OV0: each summand\ni \u2208 ](1 \u2212 \u03c12\u03c0S), 1] of the corresponding single-task one.\nfor the the two-tasks case is a factor b0\nHence, OV\u03c1 is less than OV0 by up to \u03c12\u03c0S times. In terms of the lower bound, this is the limit to\nwhich multi-task learning can outperform the single-task learning that ignores the secondary task.\n\nT (\u03c1, \u03c32\n\n5.2 Upper bound using equivalent noise\n\nAn upper bound on the learning curve of a single-task GP is given in [16]. We shall refer to this\nas the single-task FWO bound and combine it with the approach in section 3.3 to obtain an upper\non the learning curve of task T . Although the single-task FWO bound was derived for observations\nwith isotropic noise, with some modi\ufb01cations (see supplementary material \u00a7S.4), the derivations are\nstill valid for observations with heteroscedastic and correlated noise. Below is a version of the FWO\nbound that has yet to assume isotropic noise:\nTheorem 12. ([16], modi\ufb01ed second part of Theorem 6) Consider a zero-mean GP with covari-\nance function kx(\u00b7,\u00b7), and eigenvalues \u03bai and eigenfunctions \u03c6i(\u00b7) under the measure p(x)dx;\nand suppose that the noise (co)variances of the observations are given by \u03b32(\u00b7,\u00b7). For n ob-\nservations {xi}n\ndef= kx(xi, xj) + \u03b32(xi, xj) and\n\u03a6ij\ni /ci, where\nci\nindependently from p(x).\n\ndef= \u03c6j(xi). Then the learning curve at n is upper-bounded by(cid:80)\u221e\ndef= (cid:10)(\u03a6TH\u03a6)ii\n\n(cid:11) /n, and the expectation in ci is taken over the set of n input locations drawn\n\ni=1 \u03bai \u2212 n(cid:80)\u221e\n\nlet H and \u03a6 be matrices such that Hij\n\ni=1 \u03ba2\n\ni=1,\n\nT (1, \u03c32\n\nn, \u00af\u00af\u03c32\n\n(cid:110)(cid:2)(1 + \u03b2\u03c02\n\n\u03b4(xi \u2208 XT )\u03b4(xj \u2208 XT ) \u03b4ij\u03c32\n\nso that, through the de\ufb01nition of ci in Theorem 12, we obtain\n\nUnlike [16], we do not assume that the noise variance \u03b32(xi, xj) is of the form \u03c32\nof proceeding from the upper bound \u03c32\nvariance given by (12). Thus we set the observation noise (co)variance \u03b32(xi, xj) to\n\nn\u03b4ij. Instead\nn), we proceed directly from the exact posterior\n\n(cid:3) ,\nn + \u03b4(xi \u2208 XS)\u03b4(xj \u2208 XS)(cid:2)\u03b2kx(xi, xj) + \u03c1\u22122\u03b4ij\u03c32\n(cid:111)\nS)n/(1 + \u03b2\u03c0S) \u2212 1(cid:3) \u03bai +(cid:82) kx(x, x) [\u03c6i(x)]2 p(x)dx + \u03c32\nn, \u03c0S, n) (cid:54)(cid:80)\u221e\n\ni=1 \u03bai \u2212 n(cid:80)\u221e\n\nci = (1 + \u03b2\u03c0S)\ndetails are in the supplementary material \u00a7S.5. This leads to the following proposition:\nProposition 13. Let \u03b2 def= \u03c1\u22122 \u2212 1. Then, using the cis de\ufb01ned in (15), we have\n\nbound is recovered. However, FWO\u03c1 with \u03c1 = 0 gives the prior variance(cid:80) \u03bai instead. A trivial\n\nDenote the above upper bound by FWO\u03c1. When \u03c1 = \u00b11 or \u03c0S = 0, the single-task FWO upper\nupper bound can be obtained using Corollary 10, by replacing n with (1 \u2212 \u03c0S)n in the single-task\nFWO bound. The FWO\u03c1 bound is better than this trivial single-task bound for small n and high |\u03c1|.\n\n(14)\n\n;\n\n(15)\n\nn\n\nn\n\n\u0001avg\nT (\u03c1, \u03c32\n\ni=1 \u03ba2\n\ni /ci.\n\n5.3 Comparing bounds by simulations of learning curve\n\nWe compare our bounds with simulated learning curves. We follow the third scenario in [13]: the in-\nput space is one dimensional with Gaussian distribution N (0, 1/12), the covariance function is the\n\n7\n\n\f\u0001avg\nT\n\n\u0001avg\nT\n\nOV\u03c1 / (cid:104)(cid:104)\u0001T (\u03c1)(cid:105)(cid:105) / FWO\u03c1\n(cid:104)(cid:104)\u0001T (1)(cid:105)(cid:105) / (cid:104)(cid:104)\u0001T (0)(cid:105)(cid:105)\n\u00d7 (cid:104)(cid:104)\n\u00af\u0001T (\u03c1)(cid:105)(cid:105)\n(cid:52) (cid:104)(cid:104)\u00af\u0001T (\u03c1)(cid:105)(cid:105)\n\nOV\u03c1 / (cid:104)(cid:104)\u0001T (\u03c1)(cid:105)(cid:105) / FWO\u03c1\n(cid:104)(cid:104)\u0001T (1)(cid:105)(cid:105) / (cid:104)(cid:104)\u0001T (0)(cid:105)(cid:105)\n\u00d7 (cid:104)(cid:104)\n\u00af\u0001T (\u03c1)(cid:105)(cid:105)\n(cid:52) (cid:104)(cid:104)\u00af\u0001T (\u03c1)(cid:105)(cid:105)\n\n(a) \u03c12 = 1/2, \u03c0S = 1/2\n\n(b) \u03c12 = 3/4, \u03c0S = 3/4\n\nn\n\nn\n\nT against n\n), the theoretical lower/upper bounds\n), and the empirical lower/upper bounds using Corollaries 6/8 (\u00d7/ (cid:52)). The\n), the empirical trivial lower/upper bounds using Corollary\n\nFigure 3: Comparison of various bounds for two settings of (\u03c1, \u03c0S). Each graph plots \u0001avg\nand consists of the \u201ctrue\u201d multi-task learning curve (middle\nof Propositions 11/13 (lower/upper\n10 (lower/upper\nthickness of the \u201ctrue\u201d multi-task learning curve re\ufb02ects 95% con\ufb01dence interval.\nunit variance squared exponential kx(x, x(cid:48)) = exp[\u2212(x \u2212 x(cid:48))2/(2l2)] with length-scale l = 0.01,\nn = 0.05, and the learning curves are computed for up to n = 300\nthe observation noise variance is \u03c32\ntraining data points. When required, the average over x\u2217 is computed analytically (see section 3.4).\nThe empirical average over X def= XT \u222a XS, denoted by (cid:104)(cid:104)\u00b7(cid:105)(cid:105), is computed over 100 randomly sam-\npled training sets. The process eigenvalues \u03bais needed to compute the theoretical bounds are given\nin [17]. Supplementary material \u00a7S.6 gives further details.\nLearning curves for pairwise combinations of \u03c12 \u2208 {1/8, 1/4, 1/2, 3/4} and \u03c0S \u2208 {1/4, 1/2, 3/4}\nare computed. We compare the following: (a) the \u201ctrue\u201d multi-task learning curve (cid:104)(cid:104)\u0001T (\u03c1)(cid:105)(cid:105) obtained\nT (\u03c1) over x\u2217 and X; (b) the theoretical bounds OV\u03c1 and FWO\u03c1 of Propositions 11\nby averaging \u03c32\nand 13; (c) the trivial upper and lower bounds that are single-task learning curves (cid:104)(cid:104)\u0001T (0)(cid:105)(cid:105) and\n(cid:104)(cid:104)\u0001T (1)(cid:105)(cid:105) obtained by averaging \u03c32\n\u00af\u0001T (\u03c1)(cid:105)(cid:105) and\nupper bound (cid:104)(cid:104)\u00af\u0001T (\u03c1)(cid:105)(cid:105) using Corollaries 6 and 8. Figure 3 gives some indicative plots of the curves.\nWe summarize with the following observations: (a) The gap between the trivial bounds (cid:104)(cid:104)\u0001T (0)(cid:105)(cid:105)\nand (cid:104)(cid:104)\u0001T (1)(cid:105)(cid:105) increases with \u03c0S, as described at the start of section 5. (b) We \ufb01nd the lower bound\n(cid:104)(cid:104)\n\u00af\u0001T (\u03c1)(cid:105)(cid:105) a rather close approximation to the multi-task learning curve (cid:104)(cid:104)\u0001T (\u03c1)(cid:105)(cid:105), as evidenced by\nthe much overlap between the \u00d7 lines and the middle\nlines in Figure 3. (c) The curve for the\nempirical upper bound (cid:104)(cid:104)\u00af\u0001T (\u03c1)(cid:105)(cid:105) using the equivalent noise method has jumps, e.g., the (cid:52) lines in\nFigure 3, because the equivalent noise variance \u00af\u00af\u03c32\nn increases whenever a datum for XS is sampled.\n(d) For small n, (cid:104)(cid:104)\u0001T (\u03c1)(cid:105)(cid:105) is closer to FWO\u03c1, but becomes closer to OV\u03c1 as n increases, as shown by\nthe unmarked solid lines in Figure 3. This is because the theoretical lower bound OV\u03c1 is based on the\n\u00af\u0001T (\u03c1) bound, which is observed to approximate\nasymptotically exact single-task OV bound and the\nthe multi-task learning curve rather closely (point (b)).\n\nT (0) and \u03c32\n\nT (1); and (d) the empirical lower bound (cid:104)(cid:104)\n\nConclusions We have measured the in\ufb02uence of the secondary task on the primary task using the\ngeneralization error and the learning curve, parameterizing these with the correlation \u03c1 between the\ntwo tasks, and the proportion \u03c0S of observations for the secondary task. We have provided bounds\non the generalization error and learning curves, and these bounds highlight the effects of \u03c1 and \u03c0S.\nThis is a step towards understanding the role of the matrix K f of inter-task similarities in multi-task\nGPs with more than two tasks. Analysis on the degenerate case of no training data for the primary\ntask has uncovered an intrinsic limitation of multi-task GP. Our work contributes to an understanding\nof multi-task learning that is orthogonal to the existing PAC-based results in the literature.\n\nAcknowledgments\n\nI thank E Bonilla for motivating this problem, CKI Williams for helpful discussions and for propos-\ning the equivalent isotropic noise approach, and DSO National Laboratories, Singapore, for \ufb01nancial\nsupport. This work is supported in part by the EU through the PASCAL2 Network of Excellence.\n\n8\n\n0501001502002503000.20.40.60.810501001502002503000.20.40.60.81\fReferences\n[1] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.\n\nMIT Press, Cambridge, Massachusetts, 2006.\n\n[2] Yee Whye Teh, Matthias Seeger, and Michael I. Jordan. Semiparametric latent factor models.\nIn Robert G. Cowell and Zoubin Ghahramani, editors, Proceedings of the 10th International\nWorkshop on Arti\ufb01cial Intelligence and Statistics, pages 333\u2013340. Society for Arti\ufb01cial Intel-\nligence and Statistics, January 2005.\n\n[3] Edwin V. Bonilla, Felix V. Agakov, and Christopher K. I. Williams. Kernel Multi-task Learning\nIn Marina Meila and Xiaotong Shen, editors, Proceedings of\nusing Task-speci\ufb01c Features.\nthe 11th International Conference on Arti\ufb01cial Intelligence and Statistics. Omni Press, March\n2007.\n\n[4] Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, and Zhao Xu. Stochastic Relational Models for\nDiscriminative Link Prediction. In B. Sch\u00a8olkopf, J. Platt, and T. Hofmann, editors, Advances\nin Neural Information Processing Systems 19, Cambridge, MA, 2007. MIT Press.\n\n[5] Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K.I. Williams. Multi-task Gaussian\nIn J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in\n\nprocess prediction.\nNeural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008.\n\n[6] Jonathan Baxter. A Model of Inductive Bias Learning. Journal of Arti\ufb01cial Intelligence Re-\n\n[7] Andreas Maurer. Bounds for linear multi-task learning. Journal of Machine Learning Re-\n\nsearch, 12:149\u2013198, March 2000.\n\nsearch, 7:117\u2013139, January 2006.\n\n[8] Shai Ben-David and Reba Schuller Borbely. A notion of task relatedness yielding provable\n\nmultiple-task learning guarantees. Machine Learning, 73(3):273\u2013287, 2008.\n\n[9] Christopher K. I. Williams and Francesco Vivarelli. Upper and lower bounds on the learning\n\ncurve for Gaussian processes. Machine Learning, 40(1):77\u2013102, 2000.\n\n[10] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for\nclassi\ufb01cation with Dirichlet process prior. Journal of Machine Learning Research, 8:35\u201363,\nJanuary 2007.\n\n[11] Klaus Ritter. Average-Case Analysis of Numerical Problems, volume 1733 of Lecture Notes in\n\n[12] Christopher T. H. Baker. The Numerical Treatment of Integral Equations. Clarendon Press,\n\nMathematics. Springer, 2000.\n\n1977.\n\n[13] Peter Sollich and Anason Halees. Learning curves for Gaussian process regression: Approxi-\n\nmations and bounds. Neural Computation, 14(6):1393\u20131428, 2002.\n[14] Noel A. Cressie. Statistics for Spatial Data. Wiley, New York, 1993.\n[15] Manfred Opper and Francesco Vivarelli. General bounds on Bayes errors for regression with\n\nGaussian processes. In Kearns et al. [18], pages 302\u2013308.\n\n[16] Giancarlo Ferrari Trecate, Christopher K. I. Williams, and Manfred Opper. Finite-dimensional\n\napproximation of Gaussian processes. In Kearns et al. [18], pages 218\u2013224.\n\n[17] Huaiyu Zhu, Christopher K. I. Williams, Richard Rohwer, and Michal Morciniec. Gaussian\nregression and optimal \ufb01nite dimensional linear models.\nIn Christopher M. Bishop, editor,\nNeural Networks and Machine Learning, volume 168 of NATO ASI Series F: Computer and\nSystems Sciences, pages 167\u2013184. Springer-Verlag, Berlin, 1998.\n\n[18] Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors. Advances in Neural Information\n\nProcessing Systems 11, 1999. The MIT Press.\n\n9\n\n\f", "award": [], "sourceid": 76, "authors": [{"given_name": "Kian", "family_name": "Chai", "institution": null}]}