{"title": "Exact learning curves for Gaussian process regression on large random graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 2316, "page_last": 2324, "abstract": "We study learning curves for Gaussian process regression which characterise performance in terms of the Bayes error averaged over datasets of a given size. Whilst learning curves are in general very difficult to calculate we show that for discrete input domains, where similarity between input points is characterised in terms of a graph, accurate predictions can be obtained. These should in fact become exact for large graphs drawn from a broad range of random graph ensembles with arbitrary degree distributions where each input (node) is connected only to a finite number of others. The method is based on translating the appropriate belief propagation equations to the graph ensemble. We demonstrate the accuracy of the predictions for Poisson (Erdos-Renyi) and regular random graphs, and discuss when and why previous approximations to the learning curve fail.", "full_text": "Exact learning curves for Gaussian process regression\n\non large random graphs\n\nMatthew J. Urry\n\nDepartment of Mathematics\n\nKing\u2019s College London\n\nLondon, WC2R 2LS, U.K.\n\nmatthew.urry@kcl.ac.uk\n\nPeter Sollich\n\nDepartment of Mathematics\n\nKing\u2019s College London\n\nLondon, WC2R 2LS, U.K.\n\npeter.sollich@kcl.ac.uk\n\nAbstract\n\nWe study learning curves for Gaussian process regression which characterise per-\nformance in terms of the Bayes error averaged over datasets of a given size. Whilst\nlearning curves are in general very dif\ufb01cult to calculate we show that for discrete\ninput domains, where similarity between input points is characterised in terms of a\ngraph, accurate predictions can be obtained. These should in fact become exact for\nlarge graphs drawn from a broad range of random graph ensembles with arbitrary\ndegree distributions where each input (node) is connected only to a \ufb01nite number\nof others. Our approach is based on translating the appropriate belief propagation\nequations to the graph ensemble. We demonstrate the accuracy of the predictions\nfor Poisson (Erdos-Renyi) and regular random graphs, and discuss when and why\nprevious approximations of the learning curve fail.\n\n1\n\nIntroduction\n\nLearning curves are a convenient way of characterising the performance that can be achieved with\nmachine learning algorithms: they give the generalisation error \u0001 as a function of the number of\ntraining examples n, averaged over all datasets of size n under appropriate assumptions about the\ndata-generating process. Such a characterization is particularly useful in the case of non-parametric\napproaches such as Gaussian processes (GPs) [1], where in contrast to the parametric case [2] there\nis no generic classi\ufb01cation of possible learning curves.\nHere we study GP regression, where a real-valued output function f (x) is to be learned. Qualita-\ntively, GP learning curves are relatively well understood for the scenario where the inputs x come\nfrom a continuous space, typically Rn [3, 4, 5, 6, 7, 8, 9, 10, 11]. However, except in the limit of\nlarge n, or for very speci\ufb01c situations like one-dimensional inputs [3], the learning curves cannot\nbe calculated exactly. Here we show that this is possible for discrete input spaces where similarity\nbetween input points can be represented as a graph whose edges connect similar points, inspired by\nwork at last year\u2019s NIPS that developed simple approximations for this scenario [12].\nThere are many potential application domains where learning of such functions of discrete inputs x\ncould be relevant, for example if x is a research paper whose impact f (x) we would like to predict;\nthe similarity graph could then be constructed on the basis of shared authorship. Or we could be\ntrying to learn functions on generic symbol strings x, for example ones characterizing protein amino\nacid sequences, and the similarity graph would have edges between homologous molecules.\nOur aim is to \ufb01nd out how well GP regression can perform in such discrete domains; alternative\ninference approaches including online algorithms [13, 14, 15, 16] would also be interesting to study\nbut are outside the scope of the present paper. We focus on large sparse random graphs, where each\nnode is connected only to a \ufb01nite number of other nodes even though the overall number of nodes\nin the graph is large.\n\n1\n\n\fIn section 2 we give a brief overview of GP regression and summarize the approximation for the\nlearning curves used in previous work [4, 8, 12]. Section 3 then explains our method: following a\nsimilar approach in [17] for random matrix spectra, we write down the belief propagation equations\nfor a given graph in the form normally used in the cavity method [18] of statistical mechanics, and\nthen translate them to graphs drawn from a random graph ensemble. Because for sparse random\ngraphs typical loop lengths grow with the graph size, the belief propagation equations and hence our\nlearning curve predictions should become exact for large graphs.\nSection 4 compares the predictions with simulation results for Poisson (Erdos-Renyi) graphs, where\neach edge is independently present with some small probability, and random regular graphs, where\neach node has the same degree (number of neighbours). The new predictions are indeed very ac-\ncurate, and substantially more so than previous approximations. In section 4.1 we discuss in more\ndetail the relationship between our work and these approximations to rationalize where the strongest\ndeviations occur. Finally, section 5 summarises our results and discusses open questions and direc-\ntions for future work.\n\n2 GP regression and approximate learning curves\n\nGaussian processes have become a well known machine learning technique used in a wide range of\nareas, see e.g. [19, 20, 21]. One reason for their success is the intuitive way that a priori information\nabout the function to be learned is transparently encoded by the covariance and mean functions of\nthe GP.\nA GP is a Gaussian prior over functions f with a \ufb01xed covariance function (kernel) C and mean\nfunction (assumed to be 0)1. In the simplest case the likelihood is also Gaussian, i.e. we assume\nthat the outputs y\u00b5 in a set of examples D = {(i1, y1, . . . , (iN , yN )} are obtained by corrupting the\nclean function values fi\u00b5 with i.i.d. Gaussian noise of variance \u03c32. Then the posterior distribution\nover functions is, from Bayes\u2019 theorem P (f|D) \u221d P (f )P (D|f ):\n\nP (f|D) \u221d exp(\u2212 1\n2\n\nf T C\u22121f \u2212 1\n2\u03c32\n\n(y\u00b5 \u2212 fi\u00b5)2)\n\n(1)\n\n1\n\u03ba\n\nWe consider GPs in discrete spaces, where each input is a node of a graph and can therefore be given\na discrete label i as anticipated above; fi is the associated function value. If the graph has V nodes,\nthe covariance function is then just a V \u00d7 V matrix.\nA number of possible forms for covariance functions on graphs have been proposed. We will focus\non the relatively \ufb02exible random walk covariance function [22],\n((1 \u2212 a\u22121)I + a\u22121D\u22121/2AD\u22121/2)p\n\na \u2265 2,\n\np \u2265 0\n\nC =\n\n(2)\n\nthe graph (di =(cid:80)\n\nHere A is the adjacency matrix of the graph, with Aij = 1 if nodes i and j are connected by an edge,\nand 0 otherwise; D = diag{d1, . . . , dV } is a diagonal matrix containing the degrees of the nodes in\nj Aij). One can easily see the relationship to a random walk: the unnormalised\ncovariance function is a (symmetrised) p-step \u2018lazy\u2019 random walk, with probability a\u22121 of moving\nto a neighbouring node at each step. The prior thus assumes that function values up to a distance p\nalong the graph are correlated with each other, to an extent determined by the hyperparameter a\u22121.\nThe constant \u03ba will be chosen throughout to normalise C so that 1\ni Cii = 1, which corresponds\nV\nto setting the average prior variance of the function values to unity.\nOur main concern in this paper are GP learning curves in discrete input spaces. The learning curve\ndescribes how the average generalisation error (mean square error) \u0001 decreases with the number of\nexamples N. Qualitatively, it gives the rate at which one would expect a GP to learn a function in\nthe average case. The generalisation error on an ensemble of graphs is given by\n\n(cid:80)\n\nN(cid:88)\n\n\u00b5=1\n\n(cid:88)\n( \u00affi \u2212 fi)2(cid:105)f|D,D,graphs\n\n\u0001 = (cid:104) 1\nV\n\ni\n\n(3)\n\n1We focus on the zero prior mean case throughout. All results translate fairly straightforwardly to the\n\nnon-zero mean case, but this complicates the algebra without leading to substantially new insights.\n\n2\n\n\fwhere f is the uncorrupted (clean) teacher or target function, and \u00aff is the posterior mean function\nof the GP which gives the function values we predict on the basis of the data D. It is worth noting\nthat the generalisation error for a graph ensemble contains an additional average over this ensem-\nble. As is standard in the study of learning curves we have assumed a matched scenario where the\nposterior P (f|D) for our predictions is also the posterior over the underlying target functions. The\ngeneralisation error is then the Bayes error, and is given by the average posterior variance.\nSollich [4] and later Opper [7] with a more general replica approach showed that for continuous\ninput spaces a reasonable approximation to the learning curve could be expressed as the solution of\nthe following self-consistent equation:\n\n(cid:18) N\n\n(cid:19)\n\nV(cid:88)\n\n\u03b1=1\n\n\u0001 = g\n\n,\n\ng(h) =\n\n\u0001 + \u03c32\n\n(\u03bb\u22121\n\n\u03b1 + h)\u22121\n\n(4)\n\nHere the \u03bb\u03b1 are appropriately de\ufb01ned eigenvalues of the covariance function. The motivation for\nour study is work presented at NIPS2009 [12], which demonstrated that this approximation can\nalso be used in discrete domains, but is not always accurate. Studying random walk and diffusion\nkernels [22] on random regular graphs, the authors showed that although the eigenvalue-based ap-\nproximation is reasonable for both the large and the small N limits, it fails to accurately predict the\nlearning curve in the important transition region between these two extremes, drastically so for low\nnoise variances \u03c32.\nIn the next section we will show that this shortcoming can be overcome by the cavity method (belief\npropagation) which explicitly takes advantage of the sparse structure of the underlying graph. This\nwill give an accurate approximation for the learning curves in a broad range of ensembles of sparse\nrandom graphs.\n\n3 Accurate predictions with the cavity method\n\nThe cavity method was developed in statistical physics [18] but is closely related to belief propa-\ngation; for a good overview of these and other mean \ufb01eld methods, see e.g. [23]. We begin with\nequation (3). Because we only need the posterior variance in the matched case considered here, we\ncan shift f so that \u00aff = 0; fi is then the deviation of the function value at node i from the posterior\nmean. In this notation, the Bayes error is\n\ndf f 2\n\ni P (f|D)(cid:105)D,graphs\n\n(5)\n\n(cid:90)\n\n(cid:88)\n\ni\n\n\u0001 = (cid:104) 1\nV\n\nwhere P (f|D) now contains in the exponent only the terms from (1) that are quadratic in f.\nTo set up the cavity method, we begin by de\ufb01ning a generating or partition function Z, for a \ufb01xed\ngraph, as\n\nZ =\n\ndf exp(\u2212 1\n2\n\nf T C\u22121f \u2212 1\n2\u03c32\n\nf 2\ni\u00b5\n\n\u2212 \u03bb\n2\n\nf 2\ni )\n\n(6)\n\n(cid:88)\n\n\u00b5\n\n(cid:88)\n\ni\n\nIt will be more useful to write this as a sum over all nodes:\n\nnumber of examples seen at node i, then(cid:80)\n2 hT Ch + i(cid:80)\n\nAn auxiliary parameter \u03bb has been added here to allow us to represent the Bayes error as\n\u0001 = \u2212 lim\u03bb\u21920(2/V ) \u2202\n\u2202\u03bb(cid:104)log Z(cid:105)D,graphs. The dependence on the dataset D appears in Z only through\nif ni counts the\nthe sum over \u00b5.\ni . Even with this replacement, the\npartition function in equation (6) is not yet suitable for an application of the cavity method since\nthe inverse covariance function cannot be written explicitly and generates interaction terms fifj\nbetween nodes that can be far away from each other along the graph. To eliminate the inverse of\nthe covariance function we therefore perform a Fourier transform on the \ufb01rst term in the exponent,\nexp(\u2212 1\ni hifi). The integral over f then factorizes over\nthe fi, and one \ufb01nds\n\n2 f T C\u22121f ) \u221d(cid:82) dh exp(\u2212 1\n\n= (cid:80)\n\ni nif 2\n\n\u00b5 f 2\ni\u00b5\n\n(cid:90)\n\n(cid:90)\n\nZ \u221d\n\ndh exp(\u2212 1\n2\n\nhT Ch \u2212 1\n2\n\nhT diag{(\n\nni\n\n\u03c32 + \u03bb)\u22121}h)\n\nSubstituting the explicit form of the covariance function (2) into equation (7) we have\n\n(cid:90)\n\nZ \u221d\n\ndh exp(\u2212 1\n2\n\nhT\n\np(cid:88)\n\nq=0\n\ncq(D\u22121/2AD\u22121/2)qh \u2212 1\n2\n\nhT diag{(\n\nni\n\n\u03c32 + \u03bb)\u22121}h)\n\n3\n\n(7)\n\n(8)\n\n\fwhere we have written the power in equation (2) as a binomial sum and de\ufb01ned cq =\np!/[q!(p \u2212 q)!]a\u2212q(1 \u2212 a\u22121)p\u2212q/\u03ba.\nFor p > 1, equation (8) still has interactions with more than the immediate neighbours. To solve\nthis we introduce additional variables hq, de\ufb01ned recursively via hq = (D\u22121/2AD\u22121/2)hq\u22121\nfor q \u2265 1 and h0 = h. These de\ufb01nitions are enforced via Dirac delta-functions, each i and q \u2265 1\ngiving a factor \u03b4(hq\n)].\nSubstituting this into equation (8) gives the key advantage that now the adjacency matrix appears\nonly linearly in the exponent, so that we have interactions only across edges of the graph. Rescaling\nthe hq\ni , and explicitly separating off the local terms from the\ninteractions \ufb01nally yields\n\n) \u221d(cid:82) d\u02c6hq\n\ni and similarly for the \u02c6hq\n\n\u22121/2\nj Aijd\nj\n\n\u22121/2\nj Aijd\nj\n\ni to d1/2\n\ni hq\n\ni exp[i\u02c6hq\n\n(cid:80)\n\n(cid:80)\n\ni \u2212d\n\ni \u2212d\n\n\u22121/2\ni\n\n\u22121/2\ni\n\nhq\u22121\n\nhq\u22121\n\ni (hq\n\nj\n\nj\n\n(cid:90) p(cid:89)\n\nZ \u221d\n\ndhq\n\np(cid:89)\n\nd\u02c6hq(cid:89)\n\nq=0\n\nq=1\n\ni\n\np(cid:88)\n\nq=0\n\nexp(\u2212 1\n2\n\ncqdih0\n\n\u00d7(cid:89)\n\ni hq\n\ni \u2212 1\n2\nexp(\u2212i\n\n(cid:88)\n\ni )2\ndi(h0\nni/\u03c32 + \u03bb\ni hq\u22121\n(\u02c6hq\n\n+ i\n\nj + \u02c6hq\n\n(ij)\n\nq=1\n\n(cid:88)\nj hq\u22121\n\nq=1\n\ndi\n\ni\n\n\u02c6hq\ni hq\ni )\n\n(9)\n\n))\n\nWe now have the partition function of a (complex-valued) Gaussian graphical model. By differenti-\nating log Z with respect to \u03bb, keeping track of \u03bb-dependent prefactors not written above, one \ufb01nds\nthat the Bayes error is,\n\n(cid:88)\n\n(cid:18)\n\n\u0001 = lim\n\u03bb\u21920\n\n1\nV\n\n1\n\nni/\u03c32 + \u03bb\n\ni\n\n1 \u2212 di(cid:104)(h0\n\ni )2(cid:105)\nni/\u03c32 + \u03bb\n\n(cid:19)\n\n(10)\n\nand so we need the marginal distributions of the h0\ni . This is where the cavity method enters: for a\nlarge random graph the structure is locally treelike, so that if node i were eliminated the correspond-\ning subgraphs (locally trees) rooted at the neighbours j \u2208 N (i) of i would become independent [17].\nj (hj, \u02c6hj|D) can then be calculated iteratively within these sub-\nThe resulting cavity marginals P (i)\np(cid:88)\ngraphs, giving the cavity update equations\n\np(cid:88)\n\nj (hj, \u02c6hj|D) \u221d exp(\u2212 1\nP (i)\n2\n\n(cid:90) (cid:89)\n\nk\u2208N (j)\\i\n\ncqdjh0\n\nj hq\n\nq=0\n\ndhkd\u02c6hk exp(\u2212i\n\nj \u2212 1\np(cid:88)\n2\n\ndj(h0\nj )2\nnj/\u03c32 + \u03bb\nj hq\u22121\n(\u02c6hq\n\nq=1\n\n+ i\n\ndj\n\n\u02c6hq\nj hq\nj )\n\nq=1\n\nk + \u02c6hq\n\nkhq\u22121\n\nj\n\n))P (j)\n\nk (hk, \u02c6hk|D)\n\n(11)\n\nOne sees that these equations are solved self-consistently by complex-valued Gaussian distributions\nwith mean zero and covariance matrices V (i)\n. By performing the Gaussian integrals in the cavity\nupdate equations (11) explicitly, these equations then take the rather simple form\n\nj\n\nj = (Oj \u2212 (cid:88)\n\nV (i)\n\nk\u2208N (j)\\i\n\nXV (j)\n\nk X)\u22121\n\nwhere we have de\ufb01ned the (2p + 1) \u00d7 (2p + 1) matrices\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nOi = di\n\nc0 +\n\n1\n\nni/\u03c32+\u03bb\n1\n2 c1\n...\n1\n2 cp\n0\n...\n\n0\n\n1\n2 c1 . . .\n\n1\n2 cp 0\n\u2212i\n\n\u2212i\n\n...\n\n\u2212i\n\n. . .\n\n0\n\n...\n\n\u2212i\n\n0p,p\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 , X =\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n0p+1,p+1\n\ni\n\n...\n\n0\n\n...\n\n0\n\ni\n\n(12)\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\ni\n\n...\n\ni\n0 . . . 0\n\n0p,p\n\nFinally we need to translate these equations to an ensemble of large sparse graphs. Each ensemble\nis characterised by the distribution p(d) of the degrees di, with every graph that has the desired\ndegree distribution being assigned the same probability.\nInstead of individual cavity covariance\n\n4\n\n\fj\n\nmatrices V (i)\n, we need to consider their probability distribution W (V ) across all edges of the\ngraph. Picking at random an edge (i, j) of a graph, the probability that node j will have degree\ndj is then p(dj)dj/ \u00afd, because such a node has dj \u201cchances\u201d of being picked. (The normalisation\nfactor is the average degree \u00afd.) Using again the locally treelike structure, the incoming (to node j)\ncavity covariances V (j)\nk will be i.i.d. samples from W (V ). Thus a \ufb01xed point of the cavity update\nequations corresponds to a \ufb01xed point of an update equation for W (V ):\n\n(cid:42)(cid:88)\n\nd\n\np(d)d\n\n\u00afd\n\nW (V ) =\n\n(cid:90) d\u22121(cid:89)\n\ndVk W (Vk) \u03b4(V \u2212 (O \u2212 d\u22121(cid:88)\n\nXVkX)\u22121)\n\n(13)\n\nk=1\n\nk=1\n\nn\n\n(cid:43)\n\nj\n\nBecause the node label is now arbitrary, we have abbreviated V (i)\nto V , dj to d, Oj to O and V (j)\nto Vk. The average is over the distribution over the number of examples n \u2261 nj at node j in the\ndataset D. Assuming for simplicity that examples are drawn with uniform input probability across\nall nodes, this distribution is simply n \u223c Poisson(\u03bd) in the limit of large N and V at \ufb01xed \u03bd = N/V .\nIn general equation (13) \u2013 which can also be formally derived using the replica approach [24] \u2013\ncannot be solved analytically, but we can solve it numerically using a standard population dynamics\nmethod [25]. Once we have W (V ), the Bayes error can be found from the graph ensemble version\nof equation (10), which is obtained by inserting the explicit expression for (cid:104)(h0\ni )2(cid:105) in terms of the\ncavity marginals of the neighbouring nodes, and replacing the average over nodes with an average\nover p(d):\n\nk\n\n(cid:42)(cid:88)\n\n(cid:32)\n\n\u0001 = lim\n\u03bb\u21920\n\np(d)\n\nn/\u03c32 + \u03bb\n\nd\n\n1 \u2212\n\nd\n\nn/\u03c32 + \u03bb\n\n(cid:90)\n\nd(cid:89)\n\ndVk W (Vk) (O \u2212 d(cid:88)\n\nk=1\n\nk=1\n\n(cid:33)(cid:43)\n\nXVkX)\u22121\n\n00\n\n(14)\n\nn\n\nThe number of examples at the node is again to be averaged over n \u223c Poisson(\u03bd). The subscript\n\u201c00\u201d indicates the top left element of the matrix, which determines the variance of h0.\nTo be able to use equation (14), it needs to be rewritten in a form that remains explicitly non-\nsingular when n = 0 and \u03bb \u2192 0. We split off the n-dependence of the matrix inverse by writing\n0 = (1, 0, . . . , 0). The matrix inverse\n\nk=1 XVkX = M + [d/(n/\u03c32 + \u03bb)]e0eT\n\nO \u2212(cid:80)d\n\n0 , where eT\n\nappearing above can then be expressed using the Woodbury formula as\n\nM\u22121 \u2212\n\nM\u22121e0eT\n\n0 M\u22121\n\n0 M\u22121e0\nTo extract the (0,0)-element (top left) as required we multiply by eT\ntion the \u03bb \u2192 0 limit can then be taken, with the result\n\n(n/\u03c32 + \u03bb)/d + eT\n\n(15)\n\n0 \u00b7\u00b7\u00b7 e0. After some simpli\ufb01ca-\n\n\u0001 =\n\np(d)\n\nd\n\nk=1\n\ndVk W (Vk)\n\n1\n\nn/\u03c32 + d(M\u22121)00\n\n(16)\n\nn\n\nThis has a simple interpretation: the cavity marginals of the neighbours provide an effective Gaus-\nsian prior for each node, whose inverse variance is d(M\u22121)00.\nThe self-consistency equation (13) for W (V ) and the expression (16) for the resulting Bayes error\nare our main results. They allow us to predict learning curves as a function of the number of exam-\nples per node, \u03bd, for arbitrary degree distributions p(d) of our random graph ensemble providing\nthe graphs are sparse, and for arbitrary noise level \u03c32 and covariance function hyperparameters p\nand a.\nWe note brie\ufb02y that in graphs with isolated nodes (d = 0), one has to be slightly careful as already\nin the de\ufb01nition of the covariance function (2) one should replace D \u2192 D + \u03b4I to avoid division by\nzero, taking \u03b4 \u2192 0 at the end. For d = 0 one then \ufb01nds in the expression (16) that (M\u22121)00 = 1\nso that (\u03b4 + d)(M\u22121)00 = \u03b4(M\u22121)00 = 1/c0. This is to be expected since isolated nodes each\nhave a separate Gaussian prior with variance c0.\n\nc0\u03b4\n\n5\n\n(cid:42)(cid:88)\n\n(cid:90)\n\nd(cid:89)\n\n(cid:43)\n\n\f4 Results\n\nWe will begin by comparing the performance of our new cavity prediction (equation (16)) against\nthe eigenvalue approximation (equation (4)) from [4, 7], for random regular graphs with degree\n3 (so that p(d) = \u03b4d,3).\nIn this way we can exploit the work of [12], where the quality of the\napproximation (4) for this case was studied in some detail.\n\nFigure 1: (Left) A comparison of the cavity prediction (solid line with triangles) against the eigen-\nvalue approximation (dashed line) for the learning curves for random regular graphs of degree 3, and\nagainst simulation results for graphs with V = 500 nodes (solid line with circles). Random walk\nkernel with p = 1, a = 2; noise level as shown. (Right) As before with p = 10, a = 2. (Bottom)\nSimilarly for Poisson (Erdos-Renyi) graphs with c = 3.\n\nAs can be seen in \ufb01gure 1 (left) & (right) the cavity approach is accurate along the entire learning\ncurve, to the point where the prediction is visually almost indistinguishable from the numerical\nsimulation results. Importantly, the cavity approach predicts even the midsection of the learning\ncurve for intermediate values of \u03bd, where the eigenvalue prediction clearly fails. The deviations\nbetween cavity theory and the eigenvalue predictions are largest in this central part because at this\npoint \ufb02uctuations in the number examples seen at each node have the greatest effect. Indeed, for\nmuch smaller \u03bd, the dataset does not contain any examples from many of the nodes, i.e. n = 0 is\ndominant and \ufb02uctuations towards larger n have low probability. For large \u03bd, the dataset typically\ncontains many examples for each node and Poisson \ufb02uctuations around the average value n = \u03bd\nare small. The \ufb02uctuation effects for intermediate \u03bd are suppressed when the noise level \u03c32 is large,\nbecause then the generalisation error in the range of intermediate \u03bd is still fairly close to its initial\nvalue (\u03bd = 0). But for the smaller noise levels \ufb02uctuations in the number of examples for each\nnode can have a large effect, and correspondingly the eigenvalue prediction becomes very poor for\nintermediate \u03bd. We discuss this further in section 4.1.\nComparing \ufb01gure 1 (left) and (right), it can also be seen that unlike the eigenvalue-based approxi-\nmation, the cavity prediction for the learning curve does not deteriorate as p is varied towards lower\nvalues. Similar conclusions apply with regard to changes of a (results not shown).\n\n6\n\n\fNext we consider Poisson (Erdos-Renyi) graphs, where each edge is present independently with\nprobability c/V [26]. This leads to a Poisson distribution of degrees, p(d) = e\u2212ccd/d!. Figure 1\n(bottom) shows the performance of our cavity prediction for this graph ensemble with c = 3 for a GP\nwith p = 10, a = 2, in comparison to simulation results for V = 500. The cavity prediction clearly\noutperforms the eigenvalue-based approximation and again remains accurate even in the central part\nof the learning curve. Taken together, the results for random regular and Poisson graphs clearly\ncon\ufb01rm our expectation that the cavity prediction for the learning curve that we have derived should\nbe exact for large graphs. It is worth noting that our new cavity prediction will work for arbitrary\ndegree distributions and is limited only by the assumption of graph sparsity.\n\n4.1 Why the eigenvalue approximation fails\n\nThe derivation of the eigenvalue approximation (4) by Opper in [8] gives some insight into when\nand how this approximation breaks down. Opper takes equation (6) and uses the replica trick to\nwrite (cid:104)log Z(cid:105)D = limn\u21920\nn log(cid:104)Z n(cid:105)D. The average of Z n is calculated for integer n and then\nappropriately continued to n \u2192 0. The required nth power of equation (6) is in our case\n\n1\n\n(cid:104)Z n(cid:105)D =\n\ndf a(cid:104)exp(\u2212 1\n2\n\nf aT C\u22121f a \u2212 1\n2\u03c32\n\nni(f a\n\ni )2 \u2212 \u03bb\n2\n\n(17)\n\n(cid:88)\n\ni,a\n\n(cid:90) n(cid:89)\n(cid:90) n(cid:89)\n\na=1\n\n(cid:88)\n(cid:88)\n\na\n\nThe dataset average, over ni \u223c Poisson(\u03bd), then gives\n(cid:104)Z n(cid:105)D =\nf aT C\u22121f a + \u03bd\n\n(cid:88)\n(e\u2212(cid:80)\n\ndf a exp(\u2212 1\n2\n\na=1\n\na\n\ni\n\n(cid:88)\n\ni,a\n\ni )2)(cid:105)D\n(f a\n(cid:88)\n\ni,a\n\na(f a\n\ni )2/2\u03c32 \u2212 1) \u2212 \u03bb\n2\n\n(f a\n\ni )2) (18)\n\nIf one now wants to proceed without explicitly exploiting the sparse graph structure, one has to\napproximate the exponential term in the exponent. Opper does this using a variational approximation\nfor the distribution of the f a, of Gaussian form, and this eventually leads to the approximation (4)\nfor the learning curve. This approach is evidently justi\ufb01ed for large \u03c32, where a Taylor expansion\nof the exponential term in (18) can be truncated after the quadratic term. For small noise levels,\non the other hand, the Gaussian variational approach clearly does not capture all the details of the\n\ufb02uctuations in the numbers of examples ni. By comparison, in this paper, using the cavity method\nwe are able to retain the average over D explicitly, without the need to approximate the distribution\nof the ni. The result of this is that the section of the learning curve where \ufb02uctuations in numbers\nof examples play a large role is captured accurately, while the Gaussian variational (eigenvalue)\napproach can give wildly inaccurate results there.\n\n5 Conclusions and further work\n\nIn this paper we have studied the learning curves of GP regression on large random graphs. In a\nsigni\ufb01cant advance on the work of [12], we showed that the approximations for learning curves\nproposed by Sollich [4] and Opper [7] for continuous input spaces can be greatly improved upon in\nthe graph case, by using the cavity method. We argued that the resulting predictions should in fact\nbecome exact in the limit of large random graphs.\nSection 3 derived the learning curve approximation using the cavity method for arbitrary degree\ndistributions. We de\ufb01ned a generating function Z (equation (6)) from which the generalisation error\n\u0001 can be obtained by differentiation. We then rewrote this using Fourier transforms (equation (7)) and\nintroduced additional variables (equation (9)) to get Z into the required form for a cavity approach:\nthe partition function of a complex-valued Gaussian graphical model. By standard arguments we\nthen derived the cavity update equations for a \ufb01xed graph (equation (12)). Finally we generalised\nfrom these to graph ensembles (equation (13)), taking the limit of large graph size. The resulting\nprediction for the generalisation error (equation (16)) has an intuitively appealing interpretation,\nwhere each node in the graph learns subject to an effective (and data-dependent) Gaussian prior\nprovided by its neighbours.\nIn section 4 we compared our new prediction to the eigenvalue approximation results in [12]. We\nshowed that our new method is far more accurate in the challenging midsection of the learning\ncurves than the eigenvalue version, both for random regular and Poisson graph ensembles (\ufb01gure 1).\n\n7\n\n\fSubsection 4.1 discusses why the older approximation, derived from a replica perspective in [7], is\ninaccurate compared to the cavity method. To retain tractable averages in continuous input spaces,\nit has to approximate \ufb02uctuations in the dataset of the number of examples for each node, thus\nresulting in the inaccurate predictions seen in \ufb01gure 1. On graphs one is able to perform this average\nexplicitly when calculating cavity updates and the resulting Bayes error, giving a far more accurate\nprediction of the learning curves.\nAlthough the learning curves predicted using the cavity method cover a broad range of graph en-\nsembles because they apply for arbitrary p(d), there do remain some interesting types of graph\nensembles (for instance graphs with community structure) that cannot be generated by imposing\nonly the degree distribution. Indeed, an important assumption in the current work is that small loops\nare rare whilst in community graphs, where nodes exhibit preferential attachment, there can be many\nsmall loops. We are in the process of analysing GP learning on such graphs using the approach of\nRogers et al. [27], where community graphs are modelled as having a sparse superstructure joining\nclusters of densely connected nodes.\nFollowing previous studies [12], we have in this paper set the scale of the covariance function by\nnormalising the average prior covariance over all nodes. For the Poisson graph case our learning\ncurve simulations then show, however, that there can be large variations in the local prior variances\nCii, while from the Bayesian modelling point of view it would seem more plausible to use covariance\nfunctions where all Cii = 1. This could be achieved by pre- and post-multiplying the random walk\ncovariance matrix by an appropriate diagonal matrix. We hope to study this modi\ufb01ed covariance\nfunction in future, and to extend the cavity prediction for the learning curves to this case.\nIt would also be interesting to expand our approach to model mismatch, where we assume the data-\ngenerating process is a GP with hyperparameters that differ from those of the GP being used for\ninference. This was studied for continuous input spaces in [10]; equally interesting would be a\nstudy of mismatch with a \ufb01xed target function as analysed by Opper et al. [8]. It should further\nbe useful to study the case of mismatched graphs, rather than hyperparameters. This is relevant\nbecause frequently in real world learning one will have only partial knowledge of the graph structure,\nfor instance in metabolic networks when not all of the pathways have been discovered, or social\nnetworks where friendships are continuously being made and broken.\nAnother interesting avenue for further research would be to look at multiple output (multi-task)\nGPs on graphs, to see if the work of Chai [28] can be extended to this scenario. One would hope\nthat, as seen with the learning curves for single output GPs in this paper, input domains de\ufb01ned by\ngraphs might allow simpli\ufb01cations in the analysis and provide more accurate bounds or even exact\npredictions.\nFinally, it would be worth extending the study of graph mismatch to the case of evolving graphs and\nfunctions. Here spatio-temporal GP regression could be employed to predict functions changing\nover time, perhaps including a model based approach as in [29] to account for the evolving graph\nstructure.\n\nReferences\n[1] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive\n\nComputation and Machine Learning). MIT Press, December 2005.\n\n[2] Shun-ichi Amari, Naotake Fujita, and Shigeru Shinomoto. Four types of learning curves. Neural Compu-\n\ntation, 4(4):605\u2013618, 1992.\n\n[3] M. Opper. Regression with Gaussian processes: Average case performance. Theoretical Aspects of Neural\n\nComputation: A Multidisciplinary Perspective. Springer-Verlag, pages 17\u201323, 1997.\n\n[4] P. Sollich. Learning curves for Gaussian processes.\n\nSystems 11, pages 344\u2013350. MIT Press, 1999.\n\nIn Advances in Neural Information Processing\n\n[5] F. Vivarelli and M. Opper. General bounds on Bayes errors for regression with Gaussian processes. In\n\nAdvances in Neural Information Processing Systems 11, pages 302\u2013308. MIT Press, 1999.\n\n[6] C. K. I. Williams and F. Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes.\n\nMachine Learning, 40(1):77\u2013102, 2000.\n\n[7] M. Opper and D. Malzahn. Learning curves for gaussian processes regression: A framework for good\napproximations. In Advances in Neural Information Processing Systems 14, pages 273\u2013279. MIT Press,\n2001.\n\n8\n\n\f[8] M. Opper and D. Malzahn. A variational approach to learning curves. In Advances in Neural Information\n\nProcessing Systems 14, pages 463\u2013469. MIT Press, 2002.\n\n[9] P. Sollich and A. Halees. Learning curves for Gaussian process regression: Approximations and bounds.\n\nNeural Computation, 14(6):1393\u20131428, 2002.\n\n[10] P. Sollich. Gaussian process regression with mismatched models. In Advances in Neural Information\n\nProcessing Systems 14, pages 519\u2013526. MIT Press, 2002.\n\n[11] P. Sollich. Can Gaussian process regression be made robust against model mismatch? In N Lawrence\nJ Winkler and M Niranjan, editors, Deterministic and Statistical Methods in Machine Learning, pages\n211\u2013228, Berlin, 2005. Springer.\n\n[12] P. Sollich, M. J. Urry, and C. Coti. Kernels and learning curves for Gaussian process regression on random\ngraphs. In Advances in Neural Information Processing Systems 22, pages 1723\u20131731. Curran Associates,\nInc., 2009.\n\n[13] M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In ICML \u201905: Proceedings of the\n22nd international conference on Machine learning, pages 305\u2013312, New York, NY, USA, 2005. ACM.\n[14] M. Herbster and M. Pontil. Prediction on a graph with a perceptron. In Advances in Neural Information\n\nProcessing Systems 19, pages 577\u2013584. MIT Press, 2007.\n\n[15] M. Herbster. Exploiting cluster-structure to predict the labeling of a graph. In Proceedings of the 19th\n\ninternational conference on Algorithmic Learning Theory, pages 54\u201369. Springer, 2008.\n\n[16] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large graphs.\n\nLearning theory, 3120:624\u2013638, 2004.\n\n[17] Tim Rogers, Koujin Takeda, Issac P\u00b4erez Castillo, and Reimer K\u00a8uhn. Cavity approach to the spectral\n\ndensity of sparse symmetric random matricies. Physical Review E, 78(3):31116\u201331121, 2008.\n\n[18] M. Mezard, G. Parisi, and M. A. Virasoro. Random free energies in spin glasses. Le journal de physique\n\n- lettres, 46(6):217\u2013222, 1985.\n\n[19] M. T. Farrell and A. Correa. Gaussian process regression models for predicting stock trends. Relation,\n\n10:3414, 2007.\n\n[20] B. Ferris, D. Haehnel, and D. Fox. Gaussian processes for signal strength-based location estimation. In\n\nProceedings of Robotics: Science and Systems, Philadelphia, USA, August 2006.\n\n[21] Sunho Park and Seungjin Choi. Gaussian process regression for voice activity detection and speech\nIn International Joint Conference on Neural Networks, pages 2879\u20132882, Hong Kong,\n\nenhancement.\nChina, 2008. Institute of Electrical and Electronics Engineers (IEEE).\n\n[22] A. J. Smola and R. Kondor. Kernels and regularization on graphs. In M. Warmuth and B. Scholkopf,\neditors, Learning theory and Kernel machines: 16th Annual Conference on Learning Theory and 7th\nKernel Workshop (COLT), pages 144\u2013158, Heidelberg, 2003. Springer.\n\n[23] M. Opper and D. Saad. Advanced mean \ufb01eld methods: Theory and practice. MIT Press, 2001.\n[24] Reimer K\u00a8uhn. Finitely coordinated models for low-temperature phases of amorphous systems. Journal\n\nof Physics A, 40(31):9227, 2007.\n\n[25] M. M\u00b4ezard and G. Parisi. The Bethe lattice spin glass revisited. The European Physical Journal B,\n\n20(2):217\u2013233, 2001.\n\n[26] P. Erd\u00a8os and A. R\u00b4enyi. On random graphs, I. Publicationes Mathematicae (Debrecen), 6:290\u2013297, 1959.\n[27] Tim Rogers, Conrad P\u00b4erez Vicente, Koujin Takeda, and Isaac P\u00b4erez Castillo. Spectral density of random\n\ngraphs with topological constraints. Journal of Physics A, 43(19):195002, 2010.\n\n[28] Kian Ming Chai. Generalization errors and learning curves for regression with multi-task Gaussian pro-\ncesses. In Advances in Neural Information Processing Systems 22, pages 279\u2013287. Curran Associates,\nInc., 2009.\n\n[29] M. Alvarez, D. Luengo, and N. D. Lawrence. Latent force models. In D. van Dyk and M. Welling, editors,\nProceedings of the Twelfth International Workshop on Arti\ufb01cial Intelligence and Statistics, pages 9\u201316,\nClearwater Beach, FL, USA, 2009. MIT Press.\n\n9\n\n\f", "award": [], "sourceid": 1079, "authors": [{"given_name": "Matthew", "family_name": "Urry", "institution": null}, {"given_name": "Peter", "family_name": "Sollich", "institution": null}]}