{"title": "Learning curves for multi-task Gaussian process regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1781, "page_last": 1789, "abstract": "We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e.\\ the average Bayes error for a chosen task versus the total number of examples $n$ for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form inter-task covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks $T$. We use these to study the asymptotic learning behaviour for large $n$. Surprisingly, multi-task learning can be asymptotically essentially useless: examples from other tasks only help when the degree of inter-task correlation, $\\rho$, is near its maximal value $\\rho=1$. This effect is most extreme for learning of smooth target functions as described by e.g.\\ squared exponential kernels. We also demonstrate that when learning {\\em many} tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by ``collective learning'' even though most tasks have not seen examples, and a final decay that occurs only once the number of examples is proportional to the number of tasks.", "full_text": "Learning curves for multi-task Gaussian process\n\nregression\n\nSimon R F Ashton\n\nKing\u2019s College London\n\nDepartment of Mathematics\n\nStrand, London WC2R 2LS, U.K.\n\nPeter Sollich\n\nKing\u2019s College London\n\nDepartment of Mathematics\n\nStrand, London WC2R 2LS, U.K.\npeter.sollich@kcl.ac.uk\n\nAbstract\n\nWe study the average case performance of multi-task Gaussian process (GP) re-\ngression as captured in the learning curve, i.e. the average Bayes error for a chosen\ntask versus the total number of examples n for all tasks. For GP covariances that\nare the product of an input-dependent covariance function and a free-form inter-\ntask covariance matrix, we show that accurate approximations for the learning\ncurve can be obtained for an arbitrary number of tasks T . We use these to study\nthe asymptotic learning behaviour for large n. Surprisingly, multi-task learning\ncan be asymptotically essentially useless, in the sense that examples from other\ntasks help only when the degree of inter-task correlation, \u03c1, is near its maximal\nvalue \u03c1 = 1. This effect is most extreme for learning of smooth target functions\nas described by e.g. squared exponential kernels. We also demonstrate that when\nlearning many tasks, the learning curves separate into an initial phase, where the\nBayes error on each task is reduced down to a plateau value by \u201ccollective learn-\ning\u201d even though most tasks have not seen examples, and a \ufb01nal decay that occurs\nonce the number of examples is proportional to the number of tasks.\n\n1\n\nIntroduction and motivation\n\nGaussian processes (GPs) [1] have been popular in the NIPS community for a number of years\nnow, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can\nuse a GP prior when learning a function from data. In line with growing interest in multi-task or\ntransfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see\ne.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different\nchoices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions\non whether the functions to be learned are related to a smaller number of latent functions or have\nfree-form inter-task correlations; for a recent review see [9].\nGiven this interest in multi-task GPs, one would like to quantify the bene\ufb01ts that they bring compared\nto single-task learning. PAC-style bounds for classi\ufb01cation [2, 3, 10] in more general multi-task sce-\nnarios exist, but there has been little work on average case analysis. The basic question in this setting\nis: how does the Bayes error on a given task depend on the number of training examples for all tasks,\nwhen averaged over all data sets of the given size. For a single regression task, this learning curve\nhas become relatively well understood since the late 1990s, with a number of bounds and approxi-\nmations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already\ntwo-task GP regression is much more dif\ufb01cult to analyse, and progress was made only very recently\nat NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of\nthese bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge\nof the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve\napproximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly\nwithout recourse to sampling.\n\n1\n\n\fWe begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression\nproblem in a convenient feature space form, where individual training examples enter additively.\nThis requires the introduction of a non-trivial tensor structure combining feature space components\nand tasks. Considering the change in error when adding an example for some task leads to partial\ndifferential equations linking the Bayes errors for all tasks. Solving these using the method of\ncharacteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In\nSec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure\ntransfer learning, when all examples are from tasks other than the one of interest. Next we compare\nwith numerical simulations for some two-task scenarios, \ufb01nding good qualitative agreement. These\nresults also highlight a surprising feature, namely that asymptotically the relatedness between tasks\ncan become much less useful. We analyse this effect in some detail, showing that it is most extreme\nfor learning of smooth functions. Finally we discuss the case of many tasks, where there is an\nunexpected separation of the learning curves into a fast initial error decay arising from \u201ccollective\nlearning\u201d, and a much slower \ufb01nal part where tasks are learned almost independently.\n\n2 GP regression and Bayes error\n\n\u03c4(cid:96). This setup allows the noise level \u03c32\n\nWe consider GP regression for T functions f\u03c4 (x), \u03c4 = 1, 2, . . . , T . These functions have to be\nlearned from n training examples (x(cid:96), \u03c4(cid:96), y(cid:96)), (cid:96) = 1, . . . , n. Here x(cid:96) is the training input, \u03c4(cid:96) \u2208\n{1, . . . , T} denotes which task the example relates to, and y(cid:96) is the corresponding training output.\nWe assume that the latter is given by the target function value f\u03c4(cid:96) (x(cid:96)) corrupted by i.i.d. additive\n\u03c4 to depend on\nGaussian noise with zero mean and variance \u03c32\nthe task.\nIn GP regression the prior over the functions f\u03c4 (x) is a Gaussian process. This means that for any\nset of inputs x(cid:96) and task labels \u03c4(cid:96), the function values {f\u03c4(cid:96)(x(cid:96))} have a joint Gaussian distribution.\nAs is common we assume this to have zero mean, so the multi-task GP is fully speci\ufb01ed by the\ncovariances (cid:104)f\u03c4 (x)f\u03c4(cid:48)(x(cid:48))(cid:105) = C(\u03c4, x, \u03c4(cid:48), x(cid:48)). For this covariance we take the \ufb02exible form from [5],\n(cid:104)f\u03c4 (x)f\u03c4(cid:48)(x(cid:48))(cid:105) = D\u03c4 \u03c4(cid:48)C(x, x(cid:48)). Here C(x, x(cid:48)) determines the covariance between function values\nat different input points, encoding \u201cspatial\u201d behaviour such as smoothness and the lengthscale(s)\nover which the functions vary, while the matrix D is a free-form inter-task covariance matrix.\nOne of the attractions of GPs for regression is that, even though they are non-parametric models\nwith (in general) an in\ufb01nite number of degrees of freedom, predictions can be made in closed form,\nsee e.g. [1]. For a test point x for task \u03c4, one would predict as output the mean of f\u03c4 (x) over\nthe (Gaussian) posterior, which is yTK\u22121k\u03c4 (x). Here K is the n \u00d7 n Gram matrix with entries\n\u03b4(cid:96)m, while k\u03c4 (x) is a vector with the n entries k\u03c4,(cid:96) = D\u03c4(cid:96)\u03c4 C(x(cid:96), x).\nK(cid:96)m = D\u03c4(cid:96)\u03c4mC(x(cid:96), xm) + \u03c32\n\u03c4(cid:96)\nThe error bar would be taken as the square root of the posterior variance of f\u03c4 (x), which is\n\n(1)\nThe learning curve for task \u03c4 is de\ufb01ned as the mean-squared prediction error, averaged over the\nlocation of test input x and over all data sets with a speci\ufb01ed number of examples for each task, say\nn1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario\nwhere the training outputs y(cid:96) are generated from the same prior and noise model that we use for\ninference. In this case the mean-squared prediction error \u02c6\u0001\u03c4 is the Bayes error, and is given by the\naverage posterior variance [1], i.e. \u02c6\u0001\u03c4 = (cid:104)V\u03c4 (x)(cid:105)x. To obtain the learning curve this is averaged\nover the location of the training inputs x(cid:96): \u0001\u03c4 = (cid:104)\u02c6\u0001\u03c4(cid:105). This average presents the main challenge\nfor learning curve prediction because the training inputs feature in a highly nonlinear way in V\u03c4 (x).\nNote that the training outputs, on the other hand, do not appear in the posterior variance V\u03c4 (x) and\nso do not need to be averaged over.\nWe now want to write the Bayes error \u02c6\u0001\u03c4 in a form convenient for performing, at least approxi-\nmately, the averages required for the learning curve. Assume that all training inputs x(cid:96), and also the\n(cid:80)\ntest input x, are drawn from the same distribution P (x). One can decompose the input-dependent\npart of the covariance function into eigenfunctions relative to P (x), according to C(x, x(cid:48)) =\ni \u03bbi\u03c6i(x)\u03c6i(x(cid:48)). The eigenfunctions are de\ufb01ned by the condition (cid:104)C(x, x(cid:48))\u03c6i(x(cid:48))(cid:105)x(cid:48) = \u03bbi\u03c6i(x)\nand can be chosen to be orthonormal with respect to P (x), (cid:104)\u03c6i(x)\u03c6j(x)(cid:105)x = \u03b4ij. The sum over i\nhere is in general in\ufb01nite (unless the covariance function is degenerate, as e.g. for the dot product\nkernel C(x, x(cid:48)) = x \u00b7 x(cid:48)). To make the algebra below as simple as possible, we let the eigenvalues\n\u03bbi be arranged in decreasing order and truncate the sum to the \ufb01nite range i = 1, . . . , M; M is then\nsome large effective feature space dimension and can be taken to in\ufb01nity at the end.\n\nV\u03c4 (x) = D\u03c4 \u03c4 C(x, x) \u2212 kT\n\n\u03c4 (x)K\u22121k\u03c4 (x)\n\n2\n\n\fIn terms of the above eigenfunction decomposition, the Gram matrix has elements\n\nK(cid:96)m = D\u03c4(cid:96)\u03c4m\n\n\u03bbi\u03c6i(x(cid:96))\u03c6i(xm)+\u03c32\n\u03c4(cid:96)\n\n\u03b4(cid:96)m =\n\n\u03b4\u03c4(cid:96),\u03c4 \u03c6i(x(cid:96))\u03bbi\u03b4ijD\u03c4 \u03c4(cid:48)\u03c6j(xm)\u03b4\u03c4(cid:48),\u03c4m+\u03c32\n\u03c4(cid:96)\n\n\u03b4(cid:96)m\n\n(cid:88)\n\ni,\u03c4,j,\u03c4(cid:48)\n\n(cid:88)\n\ni\n\nor in matrix form K = \u03a8L\u03a8T + \u03a3 where \u03a3 is the diagonal matrix from the noise variances and\n\n\u03a8(cid:96),i\u03c4 = \u03b4\u03c4(cid:96),\u03c4 \u03c6i(x(cid:96)),\n\nLi\u03c4,j\u03c4(cid:48) = \u03bbi\u03b4ijD\u03c4 \u03c4(cid:48)\n\n(2)\n\nHere \u03a8 has its second index ranging over M (number of kernel eigenvalues) times T (number of\ntasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D \u2297 \u039b\nif we de\ufb01ne \u039b as the diagonal matrix with entries \u03bbi\u03b4ij. The Kronecker product is convenient for\nthe simpli\ufb01cations below; we will use that for generic square matrices, (A \u2297 B)(A(cid:48) \u2297 B(cid:48)) =\n(AA(cid:48)) \u2297 (BB(cid:48)), (A \u2297 B)\u22121 = A\u22121 \u2297 B\u22121, and tr (A \u2297 B) = (tr A)(tr B). In thinking about\nthe mathematical expressions, it is often easier to picture Kronecker products over feature spaces\nand tasks as block matrices. For example, L can then be viewed as consisting of T \u00d7 T blocks, each\nof which is proportional to \u039b.\nTo calculate the Bayes error, we need to average the posterior variance V\u03c4 (x) over the test input x.\nThe \ufb01rst term in (1) then becomes D\u03c4 \u03c4(cid:104)C(x, x)(cid:105) = D\u03c4 \u03c4 tr \u039b. In the second one, we need to average\n\n(cid:104)k\u03c4,(cid:96)(x)k\u03c4,m(cid:105)x = D\u03c4 \u03c4(cid:96)(cid:104)C(x(cid:96), x)C(x, xm)(cid:105)xD\u03c4m\u03c4\n\n= D\u03c4 \u03c4(cid:96)\n\n\u03bbi\u03bbj\u03c6i(x(cid:96))(cid:104)\u03c6i(x)\u03c6j(x)(cid:105)x\u03c6j(xm)D\u03c4m\u03c4\n\n(cid:88)\n(cid:88)\n\nij\n\ni,\u03c4(cid:48),j,\u03c4(cid:48)(cid:48)\n\n=\n\nD\u03c4 \u03c4(cid:48)\u03a8l,i\u03c4(cid:48)\u03bbi\u03bbj\u03b4ij\u03a8m,j\u03c4(cid:48)(cid:48)D\u03c4(cid:48)(cid:48)\u03c4\n\nIn matrix form this is (cid:104)k\u03c4 (x)kT\n\u03c4 D) \u2297 \u039b2]\u03a8T = \u03a8M\u03c4 \u03a8T Here the last\nequality de\ufb01nes M\u03c4 , and we have denoted by e\u03c4 the T -dimensional vector with \u03c4-th component\nequal to one and all others zero. Multiplying by the inverse Gram matrix K\u22121 and taking the trace\ngives the average of the second term in (1); combining with the \ufb01rst gives the Bayes error on task \u03c4\n\n\u03c4 (x)(cid:105)x = \u03a8[(De\u03c4 eT\n\n\u02c6\u0001\u03c4 = (cid:104)V\u03c4 (x)(cid:105)x = D\u03c4 \u03c4 tr \u039b \u2212 tr \u03a8M\u03c4 \u03a8T(\u03a8L\u03a8T + \u03a3)\u22121\n\nApplying the Woodbury identity and re-arranging yields\n\n\u02c6\u0001\u03c4 = D\u03c4 \u03c4 tr \u039b \u2212 tr M\u03c4 \u03a8T\u03a3\u22121\u03a8(I + L\u03a8T\u03a3\u22121\u03a8)\u22121\n= D\u03c4 \u03c4 tr \u039b \u2212 tr M\u03c4 L\u22121[I \u2212 (I + L\u03a8T\u03a3\u22121\u03a8)\u22121]\n\nBut\n\ntr M\u03c4 L\u22121 = tr{[(De\u03c4 eT\n= tr{[De\u03c4 eT\n\n\u03c4 D) \u2297 \u039b2][D \u2297 \u039b]\u22121}\n\u03c4 ] \u2297 \u039b} = eT\n\n\u03c4 De\u03c4 tr \u039b = D\u03c4 \u03c4 tr \u039b\n\nso the \ufb01rst and second terms in the expression for \u02c6\u0001\u03c4 cancel and one has\n\n\u02c6\u0001\u03c4 = tr M\u03c4 L\u22121(I + L\u03a8T\u03a3\u22121\u03a8)\u22121 = tr L\u22121M\u03c4 L\u22121(L\u22121 + \u03a8T\u03a3\u22121\u03a8)\u22121\n\n= tr [D \u2297 \u039b]\u22121[(De\u03c4 eT\n= tr [e\u03c4 eT\n\n\u03c4 \u2297 I](L\u22121 + \u03a8T\u03a3\u22121\u03a8)\u22121\n\n\u03c4 D) \u2297 \u039b2][D \u2297 \u039b]\u22121(L\u22121 + \u03a8T\u03a3\u22121\u03a8)\u22121\n\nThe matrix in square brackets in the last line is just a projector P\u03c4 onto task \u03c4; thought of as a matrix\nof T \u00d7 T blocks (each of size M \u00d7 M), this has an identity matrix in the (\u03c4, \u03c4 ) block while all other\nblocks are zero. We can therefore write, \ufb01nally, for the Bayes error on task \u03c4,\n\n\u02c6\u0001\u03c4 = tr P\u03c4 (L\u22121 + \u03a8T\u03a3\u22121\u03a8)\u22121\n\n(3)\nBecause \u03a3 is diagonal and given the de\ufb01nition (2) of \u03a8, the matrix \u03a8T\u03a3\u22121\u03a8 is a sum of contri-\nbutions from the individual training examples (cid:96) = 1, . . . , n. This will be important for deriving the\n\u03c4 P\u03c4 = I, the sum of the\n\u03c4 \u02c6\u0001\u03c4 = tr (L\u22121 +\u03a8T\u03a3\u22121\u03a8)\u22121, in close analogy to the corresponding\n\nlearning curve approximation below. We note in passing that, because(cid:80)\nBayes errors on all tasks is(cid:80)\n\nexpression for the single-task case [13].\n\n3\n\n\fresponse or resolvent matrix G = (L\u22121 + \u03a8T\u03a3\u22121\u03a8 +(cid:80)\n\n3 Learning curve prediction\nTo obtain the learning curve \u0001\u03c4 = (cid:104)\u02c6\u0001\u03c4(cid:105), we now need to carry out the average (cid:104). . .(cid:105) over the training\ninputs. To help with this, we can extend an approach for the single-task scenario [13] and de\ufb01ne a\n\u03c4 v\u03c4 P\u03c4 )\u22121 with auxiliary parameters v\u03c4\nthat will be set back to zero at the end. One can then ask how G = (cid:104)G(cid:105) and hence \u0001\u03c4(cid:48) = tr P\u03c4(cid:48)G\nchanges with the number n\u03c4 of training points for task \u03c4. Adding an example at position x for task\n\u03c4 increases \u03a8T\u03a3\u22121\u03a8 by \u03c3\u22122\n\u03c4 , where \u03c6\u03c4 has elements (\u03c6\u03c4 )i\u03c4(cid:48) = \u03c6i(x)\u03b4\u03c4 \u03c4(cid:48). Evaluating the\n\u03c4 \u03c6\u03c4 \u03c6T\ndifference (G\u22121 + \u03c3\u22122\n\u03c4 )\u22121 \u2212 G with the help of the Woodbury identity and approximating it\n\u03c4 \u03c6\u03c4 \u03c6T\nwith a derivative gives\n\n\u2202G\n\u2202n\u03c4\n\n\u03c4 G\n= \u2212 G\u03c6\u03c4 \u03c6T\n\u03c4 G\u03c6\u03c4\n\u03c32\n\u03c4 + \u03c6T\n\nThis needs to be averaged over the new example and all previous ones.\naveraging numerator and denominator separately we get\n\nIf we approximate by\n\n\u2202G\n\u2202n\u03c4\n\n=\n\n(4)\nHere we have exploited for the average over x that the matrix (cid:104)\u03c6\u03c4 \u03c6T\n\u03c4 (cid:105)x has (i, \u03c4(cid:48)), (j, \u03c4(cid:48)(cid:48))-entry\n\u03c4 (cid:105)x = P\u03c4 . We have also used the\n(cid:104)\u03c6i(x)\u03c6j(x)(cid:105)x\u03b4\u03c4 \u03c4(cid:48)\u03b4\u03c4 \u03c4(cid:48)(cid:48) = \u03b4ij\u03b4\u03c4 \u03c4(cid:48)\u03b4\u03c4 \u03c4(cid:48)(cid:48), hence simply (cid:104)\u03c6\u03c4 \u03c6T\nauxiliary parameters to rewrite \u2212(cid:104)GP\u03c4G(cid:105) = \u2202(cid:104)G(cid:105)/\u2202v\u03c4 = \u2202G/\u2202v\u03c4 . Finally, multiplying (4) by P\u03c4(cid:48)\nand taking the trace gives the set of quasi-linear partial differential equations\n\n\u03c32\n\u03c4 + tr P\u03c4 G\n\n1\n\n\u2202G\n\u2202v\u03c4\n\n\u2202\u0001\u03c4(cid:48)\n\u2202n\u03c4\n\n=\n\n1\n\n\u03c32\n\u03c4 + \u0001\u03c4\n\n\u2202\u0001\u03c4(cid:48)\n\u2202v\u03c4\n\n(5)\n\nThe remaining task is now to \ufb01nd the functions \u0001\u03c4 (n1, . . . , nT , v1, . . . , vT ) by solving these differ-\nential equations. We initially attempted to do this by tracking the \u0001\u03c4 as examples are added one\ntask at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond.\nFar more elegant is to adapt the method of characteristics to the present case. We need to \ufb01nd a\n2T -dimensional surface in the 3T -dimensional space (n1, . . . , nT , v1, . . . , vT , \u00011, . . . , \u0001T ), which is\nspeci\ufb01ed by the T functions \u0001\u03c4 (. . .). A small change (\u03b4n1, . . . , \u03b4nT , \u03b4v1, . . . , \u03b4vT , \u03b4\u00011, . . . , \u03b4\u0001T ) in\nall 3T coordinates is tangential to this surface if it obeys the T constraints (one for each \u03c4(cid:48))\n\n(cid:88)\n\n(cid:18) \u2202\u0001\u03c4(cid:48)\n\n\u2202n\u03c4\n\n\u03c4\n\n\u03b4\u0001\u03c4(cid:48) =\n\n\u03b4n\u03c4 +\n\n\u2202\u0001\u03c4(cid:48)\n\u2202v\u03c4\n\n\u03b4v\u03c4\n\nFrom (5), one sees that this condition is satis\ufb01ed whenever \u03b4\u0001\u03c4 = 0 and \u03b4n\u03c4 = \u2212\u03b4v\u03c4 (\u03c32\n\u03c4 + \u0001\u03c4 )\nIt follows that all the characteristic curves given by \u0001\u03c4 (t) = \u0001\u03c4,0 = const., v\u03c4 (t) = v\u03c4,0(1 \u2212 t),\n\u03c4 + \u0001\u03c4,0) t for t \u2208 [0, 1] are tangential to the solution surface for all t, so lie within\nn\u03c4 (t) = v\u03c4,0(\u03c32\nthis surface if the initial point at t = 0 does. Because at t = 0 there are no training examples\n(n\u03c4 (0) = 0), this initial condition is satis\ufb01ed by setting\n\n(cid:19)\n\n(cid:33)\u22121\n\n\u0001\u03c4,0 = tr P\u03c4\n\nL\u22121 +\n\nv\u03c4(cid:48),0P\u03c4(cid:48)\n\n(cid:88)\n\n\u03c4(cid:48)\n\n(cid:32)\n\n(cid:33)\u22121\n\n(cid:32)\n\n(cid:88)\n\n\u03c4(cid:48)\n\nBecause \u0001\u03c4 (t) is constant along the characteristic curve, we get by equating the values at t = 0 and\nt = 1\n\n\u0001\u03c4,0 = tr P\u03c4\n\nL\u22121 +\n\nv\u03c4(cid:48),0P\u03c4(cid:48)\n\n= \u0001\u03c4 ({n\u03c4(cid:48) = v\u03c4(cid:48),0(\u03c32\n\nExpressing v\u03c4(cid:48),0 in terms of n\u03c4(cid:48) gives then\n\n(cid:32)\n\n\u0001\u03c4 = tr P\u03c4\n\nL\u22121 +\n\n(cid:88)\n\n\u03c4(cid:48)\n\nn\u03c4(cid:48)\n\u03c32\n\u03c4(cid:48) + \u0001\u03c4(cid:48)\n\nP\u03c4(cid:48)\n\n\u03c4(cid:48) + \u0001\u03c4(cid:48),0)},{v\u03c4(cid:48) = 0})\n(cid:33)\u22121\n\n(6)\n\nThis is our main result: a closed set of T self-consistency equations for the average Bayes errors\n\u0001\u03c4 . Given L as de\ufb01ned by the eigenvalues \u03bbi of the covariance function, the noise levels \u03c32\n\u03c4 and the\n\n4\n\n\fnumber of examples n\u03c4 for each task, it is straightforward to solve these equations numerically to\n\ufb01nd the average Bayes error \u0001\u03c4 for each task.\nThe r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of\nM \u00d7 M blocks of size T \u00d7 T (which is the reverse of the picture we have used so far). The matrix is\nthen block diagonal, with the blocks corresponding to different eigenvalues \u03bbi. Explicitly, because\nL\u22121 = D\u22121 \u2297 \u039b\u22121, one has\n\n\u0001\u03c4 =\n\ni D\u22121 + diag({\n\u03bb\u22121\n\nn\u03c4(cid:48)\n\u03c32\n\u03c4(cid:48) + \u0001\u03c4(cid:48)\n\n})\n\n(7)\n\n(cid:19)\u22121\n\n\u03c4 \u03c4\n\n(cid:18)\n\n(cid:88)\n\n4 Results and discussion\n\ni\n\nWe now consider the consequences of the approximate prediction (7) for multi-task learning curves\nin GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here\none recovers T separate equations for the individual tasks as expected, which have the same form as\nfor single-task learning [13].\n\n4.1 Pure transfer learning\n\nConsider now the case of pure transfer learning, where one is learning a task of interest (say \u03c4 = 1)\npurely from examples for other tasks. What is the lowest average Bayes error that can be obtained?\nSomewhat more generally, suppose we have no examples for the \ufb01rst T0 tasks, n1 = . . . = nT0 = 0,\nbut a large number of examples for the remaining T1 = T \u2212 T0 tasks. Denote E = D\u22121 and write\nthis in block form as\n\n(cid:18) E00 E01\n\nET\n\n01 E11\n\n(cid:19)\n\nE =\n\ni\n\n00 + E\u22121\n\n01E\u22121\n\n00 E01)\u22121ET\n00 )11 = (cid:104)C(x, x)(cid:105)(E\u22121\n\nand add in the lower right block a diagonal matrix N = diag({n\u03c4 /(\u03c32\n\nNow multiply by \u03bb\u22121\n\u03c4 +\n\u0001\u03c4 )}\u03c4 =T0+1,...,T ). The matrix inverse in (7) then has top left block \u03bbi[E\u22121\n00 E01(\u03bbiN +\nE11 \u2212 ET\n01E\u22121\n00 ]. As the number of examples for the last T1 tasks grows, so do all\n(diagonal) elements of N. In the limit only the term \u03bbiE\u22121\n00 survives, and summing over i gives\n\u00011 = tr \u039b(E\u22121\n00 )11. The Bayes error on task 1 cannot become lower than this,\nplacing a limit on the bene\ufb01ts of pure transfer learning. That this prediction of the approximation\n(7) for such a lower limit is correct can also be checked directly: once the last T1 tasks f\u03c4 (x)\n(\u03c4 = T0 + 1, . . . T ) have been learn perfectly, the posterior over the \ufb01rst T0 functions is, by standard\nGaussian conditioning, a GP with covariance C(x, x(cid:48))(E00)\u22121. Averaging the posterior variance of\nf1(x) then gives the Bayes error on task 1 as \u00011 = (cid:104)C(x, x)(cid:105)(E\u22121\nThis analysis can be extended to the case where there are some examples available also for the \ufb01rst\nT0 tasks. One \ufb01nds for the generalization errors on these tasks the prediction (7) with D\u22121 replaced\nby E00. This is again in line with the above form of the GP posterior after perfect learning of the\nremaining T1 tasks.\n\n00 )11, as found earlier.\n\n4.2 Two tasks\n\nWe next analyse how well the approxiation (7) does in predicting multi-task learning curves for\nT = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose\n\n(cid:19)\n\n(cid:18) 1 \u03c1\n\n\u03c1 1\n\nD =\n\nThe diagonal elements are \ufb01xed to unity, as in a practical application where one would scale both\ntask functions f1(x) and f2(x) to unit variance; the degree of correlation of the tasks is controlled\nby \u03c1. We \ufb01x \u03c02 = n2/n and plot learning curves against n. In numerical simulations we ensure\ninteger values of n1 and n2 by setting n2 = (cid:98)n\u03c02(cid:99), n1 = n \u2212 n2; for evaluation of (7) we use\ndirectly n2 = n\u03c02, n1 = n(1 \u2212 \u03c02). For simplicity we consider equal noise levels \u03c32\n2 = \u03c32.\nAs regards the covariance function and input distribution, we analyse \ufb01rst the scenario studied\nin [21]: a squared exponential (SE) kernel C(x, x(cid:48)) = exp[\u2212(x \u2212 x(cid:48))2/(2l2)] with lengthscale\nl, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues \u03bbi\n\n1 = \u03c32\n\n5\n\n\fFigure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01,\nnoise level \u03c32 = 0.05 and a fraction \u03c02 = 0.75 of examples for task 2. Solid lines: numerical\nsimulations; dashed lines: approximation (7). Task correlation \u03c12 = 0, 0.25, 0.5, 0.75, 1 from top to\nbottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform\ninputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic\nuselessness, i.e. bunching of the \u03c1 < 1 curves towards the one for \u03c1 = 0; this effect is strongest for\nlearning of smooth functions (left and middle).\n\nare known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically\nsimulated learning curves with the predictions for \u00011, the average Bayes error on task 1, from (7).\nFive pairs of curves are shown, for \u03c12 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values\nrepresent single-task limits, where examples from task 2 are either ignored (\u03c1 = 0) or effectively\ntreated as being from task 1 (\u03c1 = 1). Our predictions lie generally below the true learning curves,\nbut qualitatively represent the trends well, in particular the variation with \u03c12. The curves for the dif-\nferent \u03c12 values are fairly evenly spaced vertically for small number of examples, n, corresponding\nto a linear dependence on \u03c12. As n increases, however, the learning curves for \u03c1 < 1 start to bunch\ntogether and separate from the one for the fully correlated case (\u03c1 = 1). The approximation (7)\ncorrectly captures this behaviour, which is discussed in more detail below.\nFigure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval\n[0, 1]; the \u03bbi here decay exponentially with i2 [17]. Quantitative agreement between simulations\nand predictions is better for this case. The discussion in [17] suggests that this is because the\napproximation method we have used implicitly neglects spatial variation of the dataset-averaged\nposterior variance (cid:104)V\u03c4 (x)(cid:105); but for a uniform input distribution this variation will be weak except\nnear the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel\nC(x, x(cid:48)) = exp(\u2212|x \u2212 x(cid:48)|/l), showing that our predictions also work well when learning rough\n(nowhere differentiable) functions.\n\n4.3 Asymptotic uselessness\n\nThe two-task results above suggest that multi-task learning is less useful asymptotically: when the\nnumber of training examples n is large, the learning curves seem to bunch towards the curve for\n\u03c1 = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (\u03c1 = 1).\nWe now study this effect.\nWhen the number of examples for all tasks becomes large, the Bayes errors \u0001\u03c4 will become small\n\u03c4 in (7). One then has an explicit\nand eventually be negligible compared to the noise variances \u03c32\nIf we write, for T tasks,\nprediction for each \u0001\u03c4 , without solving T self-consistency equations.\nn\u03c4 = n\u03c0\u03c4 with \u03c0\u03c4 the fraction of examples for task \u03c4, and set \u03b3\u03c4 = \u03c0\u03c4 /\u03c32\n\u03c4 , then for large n\n(\u03931/2D\u03931/2)\u22121 + nI]\u22121\u0393\u22121/2)\u03c4 \u03c4\n\n(cid:0)\u03bb\u22121\ni D\u22121 + n\u0393(cid:1)\u22121\n\n\u0001\u03c4 =(cid:80)\n\n\u03c4 \u03c4 =(cid:80)\n\n(8)\n\ni\n\nwhere \u0393 = diag(\u03b31, . . . , \u03b3T ). Using an eigendecomposition of the symmetric matrix \u03931/2D\u03931/2 =\n\n(cid:80)T\n\na=1 \u03b4avavT\n\na , one then shows in a few lines that (8) can be written as\n\n\u0001\u03c4 \u2248 \u03b3\u22121\n\n\u03c4\n\na(va,\u03c4 )2\u03b4ag(n\u03b4a)\n\n(9)\n\ni\n\ni(\u0393\u22121/2[\u03bb\u22121\n(cid:80)\n\n6\n\n0100200300400500n00.20.40.60.81\u03b510100200300400500n00.20.40.60.81\u03b510100200300400500n00.20.40.60.81\u03b51110000n0.0011\u03b51110000n0.0011\u03b5111000n0.011\u03b51\fFigure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs \u03c12. To focus on the\nerror reduction with \u03c1, r = [\u00011(\u03c1) \u2212 \u00011(1)]/[\u00011(0) \u2212 \u00011(1)] is shown. Circles: simulations; solid\nline: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic\nuselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions\n(parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters\notherwise as in Fig. 1(left) except \u03c12 = 0.8). Notice the bend around \u00011 = 1 \u2212 \u03c1 = 0.106. Solid\nline: simulations (steps arise because we chose to allocate examples to tasks in order \u03c4 = 1, . . . , T\nrather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with\nasymptotic forms \u0001 = 1\u2212 \u03c1 + \u03c1\u02dc\u0001 and \u0001 = (1\u2212 \u03c1)\u00af\u0001 for the two learning stages shown as solid lines.\n\nwhere g(h) = tr (\u039b\u22121 + h)\u22121 = (cid:80)\n\ni(\u03bb\u22121\n\ni + h)\u22121 and va,\u03c4 is the \u03c4-th component of the a-th\neigenvector va. This is the general asymptotic form of our prediction for the average Bayes error\nfor task \u03c4.\nTo get a more explicit result, consider the case where sample functions from the GP prior have\n(mean-square) derivatives up to order r. The kernel eigenvalues \u03bbi then decay as1 i\u2212(2r+2) for large\ni, and using arguments from [17] one deduces that g(h) \u223c h\u2212\u03b1 for large h, with \u03b1 = (2r +1)/(2r +\n2). In (9) we can then write, for large n, g(n\u03b4a) \u2248 (\u03b4a/\u03b3\u03c4 )\u2212\u03b1g(n\u03b3\u03c4 ) and hence\n\n\u0001\u03c4 \u2248 g(n\u03b3\u03c4 ){(cid:80)\n\na(va,\u03c4 )2(\u03b4a/\u03b3\u03c4 )1\u2212\u03b1}\n\n(10)\n\nWhen there is only a single task, \u03b41 = \u03b31 and this expression reduces to \u00011 = g(n\u03b31) = g(n1/\u03c32\n1).\n\u03c4 ) is the error we would get by ignoring all examples from tasks other than \u03c4,\nThus g(n\u03b3\u03c4 ) = g(n\u03c4 /\u03c32\nand the term in {. . .} in (10) gives the \u201cmulti-task gain\u201d, i.e. the factor by which the error is reduced\nbecause of examples from other tasks. (The absolute error reduction always vanishes trivially for\nn \u2192 \u221e, along with the errors themselves.) One observation can be made directly. Learning of very\nsmooth functions, as de\ufb01ned e.g. by the SE kernel, corresponds to r \u2192 \u221e and hence \u03b1 \u2192 1, so\nthe multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception\noccurs when some of the tasks are fully correlated, because one or more of the eigenvalues \u03b4a of\n\u03931/2D\u03931/2 will then be zero.\nFig. 2(left) shows this effect in action, plotting Bayes error against \u03c12 for the two-task setting of\nFig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on \u03c12 quite well,\nthough the effect is somewhat weaker in the simulations. For larger n the predictions approach a\ncurve that is constant for \u03c1 < 1, signifying negligible improvement from multi-task learning except\nat \u03c1 = 1. It is worth contrasting this with the lower bound from [21], which is linear in \u03c12. While\nthis provides a very good approximation to the learning curves for moderate n [21], our results here\nshow that asymptotically this bound can become very loose.\nWhen predicting rough functions, there is some asymptotic improvement to be had from multi-task\nlearning, though again the multi-task gain is nonlinear in \u03c12: see Fig. 2(left, inset) for the OU case,\nwhich has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to\nwhich we turn next.\n\n1See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here\n\nthough the discussion can be generalized.\n\n7\n\n00.20.40.60.81\u03c1200.51r101001000n0.11\u03b51101001000n0.11\u03b5n=500500050000\f4.4 Many tasks\nWe assume as for the two-task case that all inter-task correlations, D\u03c4,\u03c4(cid:48) with \u03c4 (cid:54)= \u03c4(cid:48), are equal to\n\u03c1, while D\u03c4,\u03c4 = 1. This setup was used e.g. in [23], and can be interpreted as each task having a\ncomponent proportional to\n\u03c1 of a shared latent function, with an independent task-speci\ufb01c signal\nin addition. We assume for simplicity that we have the same number n\u03c4 = n/T of examples for\neach task, and that all noise levels are the same, \u03c32\n\u03c4 = \u03c32. Then also all Bayes errors \u0001\u03c4 = \u0001 will be\nthe same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as\n\n\u221a\n\nwhere gT (h, \u03c1) is related to the single-task function g(h) from above by\n\n(cid:18)\n\n(cid:19)\n\n\u0001 = gT (n/(\u03c32 + \u0001), \u03c1)\n\n(11)\n\ngT (h, \u03c1) =\n\nT \u2212 1\nT\n\n(1 \u2212 \u03c1)g(h(1 \u2212 \u03c1)/T ) +\n\n1 \u2212 \u03c1\nT\n\n\u03c1 +\n\ng(h[\u03c1 + (1 \u2212 \u03c1)/T ])\n\n(12)\n\nNow consider the limit T \u2192 \u221e of many tasks.\nIf n and hence h = n/(\u03c32 + \u0001) is kept\n\ufb01xed, gT (h, \u03c1) \u2192 (1 \u2212 \u03c1) + \u03c1g(h\u03c1); here we have taken g(0) = 1 which corresponds to\ntr \u039b = (cid:104)C(x, x)(cid:105)x = 1 as in the examples above. One can then deduce from (11) that the Bayes\nerror for any task will have the form \u0001 = (1\u2212 \u03c1) + \u03c1\u02dc\u0001, where \u02dc\u0001 decays from one to zero with increas-\ning n as for a single task, but with an effective noise level \u02dc\u03c32 = (1 \u2212 \u03c1 + \u03c32)/\u03c1. Remarkably, then,\neven though here n/T \u2192 0 so that for most tasks no examples have been seen, the Bayes error for\neach task decreases by \u201ccollective learning\u201d to a plateau of height 1\u2212 \u03c1. The remaining decay of \u0001 to\nzero happens only once n becomes of order T . Here one can show, by taking T \u2192 \u221e at \ufb01xed h/T\nin (12) and inserting into (11), that \u0001 = (1 \u2212 \u03c1)\u00af\u0001 where \u00af\u0001 again decays as for a single task but with\nan effective number of examples \u00afn = n/T and effective noise level \u00af\u03c32/(1 \u2212 \u03c1). This \ufb01nal stage of\nlearning therefore happens only when each task has seen a considerable number of exampes n/T .\nFig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that\nis in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000\nshows clearly how the two learning curve stages separate as T becomes larger.\nFinally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors\nwith sample functions with derivatives up to order r as before, the function \u00af\u0001 from above will decay\nas (\u00afn/\u00af\u03c32)\u2212\u03b1; since \u0001 = (1 \u2212 \u03c1)\u00af\u0001 and \u00af\u03c32 = \u03c32/(1 \u2212 \u03c1), the Bayes error \u0001 is then proportional\nto (1 \u2212 \u03c1)1\u2212\u03b1. This multi-task gain again approaches unity for \u03c1 < 1 for smooth functions (\u03b1 =\n(2r + 1)/(2r + 2) \u2192 1). Interestingly, for rough functions (\u03b1 < 1), the multi-task gain decreases\n\nfor small \u03c12 as 1 \u2212 (1 \u2212 \u03b1)(cid:112)\u03c12 and so always lies below a linear dependence on \u03c12 initially. This\n\nshows that a linear-in-\u03c12 lower error bound cannot generally apply to T > 2 tasks, and indeed one\ncan verify that the derivation in [21] does not extend to this case.\n\n5 Conclusion\n\nWe have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid\nfor arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the\nkernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation\nshows that pure transfer learning has a simple lower error bound, and provides a good qualitative\naccount of numerically simulated learning curves. Because it can be used to study the asymptotic\nbehaviour for large training sets, it allowed us to show that multi-task learning can become asymp-\ntotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks\nare fully correlated. For the limit of many tasks we found that, remarkably, some initial \u201ccollective\nlearning\u201d is possible even when most tasks have not seen examples. A much slower second learning\nstage then requires many examples per task. The asymptotic regime of this also showed explicitly\nthat a lower error bound that is linear in \u03c12, the square of the inter-task correlation, is applicable\nonly to the two-task setting T = 2.\nIn future work it would be interesting to use our general result to investigate in more detail the con-\nsequences of speci\ufb01c choices for the inter-task correlations D, e.g. to represent a lower-dimensional\nlatent factor structure. One could also try to deploy similar approximation methods to study the case\nof model mismatch, where the inter-task correlations D would have to be learned from data. More\nchallenging, but worthwhile, would be an extension to multi-task covariance functions where task\nand input-space correlations to not factorize.\n\n8\n\n\fReferences\n[1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge,\n\nMA, 2006.\n\n[2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149\u2013198, 2000.\n[3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning\n\nguarantees. Mach. Learn., 73(3):273\u2013287, December 2008.\n\n[4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Arti\ufb01cial\n\nIntelligence and Statistics 10, pages 333\u2013340. Society for Arti\ufb01cial Intelligence and Statistics, 2005.\n\n[5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-speci\ufb01c features.\nIn Proceedings of the 11th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS).\nOmni Press, 2007.\n\n[6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt,\nD Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153\u2013160, Cambridge, MA, 2008. MIT Press.\nIn\nD Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57\u201364, Cambridge, MA,\n2009. MIT Press.\n\n[7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression.\n\n[8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios\nGunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning\nand Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310\u2013\n325. Springer Berlin, Heidelberg, 2011.\n\n[9] M A \u00b4Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations\n\nand Trends in Machine Learning, 4:195\u2013266, 2012.\n\n[10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117\u2013139, 2006.\n[11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In\nM Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302\u2013308, Cambridge, MA, 1999. MIT Press.\n[12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In\nM Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218\u2013224, Cambridge, MA, 1999. MIT Press.\n[13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors,\n\nNIPS 11, pages 344\u2013350, Cambridge, MA, 1999. MIT Press.\n\n[14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good\napproximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273\u2013279, Cambridge,\nMA, 2001. MIT Press.\n\n[15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and\n\nZ Ghahramani, editors, NIPS 14, pages 463\u2013469, Cambridge, MA, 2002. MIT Press.\n\n[16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys.\n\nRev. Lett., 89:108302, 2002.\n\n[17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds.\n\nNeural Comput., 14(6):1393\u20131428, 2002.\n\n[18] P Sollich. Gaussian process regression with mismatched models.\n\nIn T G Dietterich, S Becker, and\n\nZ Ghahramani, editors, NIPS 14, pages 519\u2013526, Cambridge, MA, 2002. MIT Press.\n\n[19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic\nand Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Arti\ufb01cial Intelligence,\npages 199\u2013210. Springer Berlin, Heidelberg, 2005.\n\n[20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In\nJ Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316\u20132324,\nCambridge, MA, 2010. MIT Press.\n\n[21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes.\nIn Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279\u2013287,\n2009.\n\n[22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal \ufb01nite dimensional\n\nlinear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998.\n\n[23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes.\nIn Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern\nRecognition, volume 6376 of Lecture Notes in Computer Science, pages 232\u2013241. Springer Berlin, Hei-\ndelberg, 2010.\n\n[24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchi-\ncal Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning\n(ICML\u201998), pages 233\u2013241. Morgan Kaufmann, 1998.\n\n9\n\n\f", "award": [], "sourceid": 878, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}, {"given_name": "Simon", "family_name": "Ashton", "institution": null}]}