{"title": "Kernels and learning curves for Gaussian process regression on random graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1723, "page_last": 1731, "abstract": "We investigate how well Gaussian process regression can learn functions defined on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some surprising properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e.neighbouring function values do not become fully correlated, when the lengthscale $\\sigma$ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with $n/V$, the number of training examples per vertex. We also explore how this behaviour changes once kernel lengthscales are large enough for loops to become important.", "full_text": "Kernels and learning curves for Gaussian process\n\nregression on random graphs\n\nPeter Sollich, Matthew J Urry\n\nKing\u2019s College London, Department of Mathematics\n{peter.sollich,matthew.urry}@kcl.ac.uk\n\nLondon WC2R 2LS, U.K.\n\nINRIA Saclay \u02c6Ile de France, F-91893 Orsay, France\n\nCamille Coti\n\nAbstract\n\nWe investigate how well Gaussian process regression can learn functions de-\n\ufb01ned on graphs, using large regular random graphs as a paradigmatic example.\nRandom-walk based kernels are shown to have some non-trivial properties: within\nthe standard approximation of a locally tree-like graph structure, the kernel does\nnot become constant, i.e. neighbouring function values do not become fully cor-\nrelated, when the lengthscale \u03c3 of the kernel is made large. Instead the kernel\nattains a non-trivial limiting form, which we calculate. The fully correlated limit\nis reached only once loops become relevant, and we estimate where the crossover\nto this regime occurs. Our main subject are learning curves of Bayes error versus\ntraining set size. We show that these are qualitatively well predicted by a simple\napproximation using only the spectrum of a large tree as input, and generically\nscale with n/V , the number of training examples per vertex. We also explore how\nthis behaviour changes for kernel lengthscales that are large enough for loops to\nbecome important.\n\n1 Motivation and Outline\n\nGaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning\ncurves are a convenient way of characterizing their capabilities: they give the generalization error\n\u0001 as a function of the number of training examples n, averaged over all datasets of size n under\nappropriate assumptions about the process generating the data. We focus here on the case of GP\nregression, where a real-valued output function f(x) is to be learned. The general behaviour of GP\nlearning curves is then relatively well understood for the scenario where the inputs x come from\na continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then\ntypically decay as a power law \u0001 \u221d n\u2212\u03b1 with an exponent \u03b1 \u2264 1 that depends on the dimensionality\nn of the space as well as the smoothness properties of the function f(x) as encoded in the covariance\nfunction.\nBut there are many interesting application domains that involve discrete input spaces, where x could\nbe a string, an amino acid sequence (with f(x) some measure of secondary structure or biological\nfunction), a research paper (with f(x) related to impact), a web page (with f(x) giving a score\nused to rank pages), etc. In many such situations, similarity between different inputs \u2013 which will\ngovern our prior beliefs about how closely related the corresponding function values are \u2013 can be\nrepresented by edges in a graph. One would then like to know how well GP regression can work\nin such problem domains; see also [11] for a related online regression algorithm. We study this\n\n1\n\n\fproblem here theoretically by focussing on the paradigmatic example of random regular graphs,\nwhere every node has the same connectivity.\nSec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These\nare analogous to the standard radial basis function kernels exp[\u2212(x \u2212 x(cid:48))2/(2\u03c32)], but we \ufb01nd that\nthey have surprising properties on large graphs. In particular, while loops in large random graphs\nare long and can be neglected for many purposes, by approximating the graph structure as locally\ntree-like, here this leads to a non-trivial limiting form of the kernel for \u03c3 \u2192 \u221e that is not constant.\nThe fully correlated limit, where the kernel is constant, is obtained only because of the presence of\nloops, and we estimate when the crossover to this regime takes place.\nIn Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph\neigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and\npredicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale\nis not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves\ndepend on the number of examples per vertex. We also explore how this behaviour changes as the\nkernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions.\n\n2 Kernels on graphs and trees\n\nand there are no self-loops (Aii = 0). The degree of each node is then de\ufb01ned as di =(cid:80)V\n\nWe assume that we are trying to learn a function de\ufb01ned on the vertices of a graph. Vertices are\nlabelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the\nassociated function values are denoted fi \u2208 R. By taking the prior P (f) over these functions\nf = (f1, . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f) \u221d exp(\u2212 1\n2 f TC\u22121f).\nThe covariance function or kernel C is then, in our graph setting, just a positive de\ufb01nite V \u00d7 V\nmatrix.\nThe graph structure is characterized by a V \u00d7 V adjacency matrix, with Aij = 1 if nodes i and j are\nconnected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji,\nj=1 Aij.\nThe covariance kernels we discuss in this paper are the natural generalizations of the squared-\nexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized\ngraph Laplacian, de\ufb01ned as L = 1 \u2212 D\u22121/2AD\u22121/2, where D is a diagonal matrix with entries\nd1, . . . , dV and 1 is the V \u00d7 V identity matrix. An advantage of L over the unnormalized Laplacian\nD\u2212 A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V \u00d7 V matrix)\nlie in the interval [0,2] (see e.g. [14]).\nFrom the graph Laplacian, the covariance kernels we consider here are constructed as follows. The\np-step random walk kernel is (for a \u2265 2)\nC \u221d (1 \u2212 a\u22121L)p =\n\n(1)\n\n(cid:104)(cid:0)1 \u2212 a\u22121(cid:1) 1 + a\u22121D\u22121/2AD\u22121/2(cid:105)p\n(cid:16) 1\n2 \u03c32D\u22121/2AD\u22121/2(cid:17)\n2 \u03c32L(cid:1) \u221d exp\nWe will always normalize these so that (1/V )(cid:80)\n\nC \u221d exp(cid:0)\u2212 1\n\nwhile the diffusion kernel is given by\n\n(2)\n\ni Cii = 1, which corresponds to setting the average\n\n(over vertices) prior variance of the function to be learned to unity.\nTo see the connection of the above kernels to random walks, assume we have a walker on the graph\nwho at each time step selects randomly one of the neighbouring vertices and moves to it. The\nprobability for a move from vertex j to i is then Aij/dj. The transition matrix after s steps follows\nas (AD\u22121)s: its ij-element gives the probability of being on vertex i, having started at j. We can\nnow compare this with the p-step kernel by expanding the p-th power in (1):\n\ns )a\u2212s(1\u2212a\u22121)p\u2212s(D\u22121/2AD\u22121/2)s = D\u22121/2\n( p\n\ns=0\n\ns=0\n\ns )a\u2212s(1\u2212a\u22121)p\u2212s(AD\u22121)sD1/2\n( p\n(3)\n\nThus C is essentially a random walk transition matrix, averaged over the number of steps s with\n\nC \u221d p(cid:88)\n\np(cid:88)\n\ns \u223c Binomial(p, 1/a)\n\n2\n\n(4)\n\n\fFigure 1: (Left) Random walk kernel C(cid:96),p plotted vs distance (cid:96) along graph, for increasing number\nof steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive\nfully correlated limit C(cid:96),p\u2192\u221e = 1. (Right) Numerical results for average covariance K1 between\nneighbouring nodes, averaged over neighbours and over randomly generated regular graphs.\n\nThis shows that 1/a can be interpreted as the probability of actually taking a step at each of p\n\u201cattempts\u201d. To obtain the actual C the resulting averaged transition matrix is premultiplied by\nD\u22121/2 and postmultiplied by D1/2, which ensures that the kernel C is symmetric. For the diffusion\nkernel, one \ufb01nds an analogous result but the number of random walk steps is now distributed as\ns \u223c Poisson(\u03c32/2). This implies in particular that the diffusion kernel is the limit of the p-step\nkernel for p, a \u2192 \u221e at constant p/a = \u03c32/2. Accordingly, we discuss mainly the p-step kernel in\nthis paper because results for the diffusion kernel can be retrieved as limiting cases.\nIn the limit of a large number of steps s, the random walk on a graph will reach its stationary distribu-\ntion p\u221e \u221d De where e = (1, . . . , 1). (This form of p\u221e can be veri\ufb01ed by checking that it remains\nunchanged after multiplication with the transition matrix AD\u22121.) The s-step transition matrix for\nlarge s is then p\u221eeT = DeeT because we converge from any starting vertex to the stationary dis-\ntribution. It follows that for large p or \u03c32 the covariance kernel becomes C \u221d D1/2eeTD1/2, i.e.\nCij \u221d (didj)1/2. This is consistent with the interpretation of \u03c3 or (p/a)1/2 as a lengthscale over\nwhich the random walk can diffuse along the graph: once this lengthscale becomes large, the covari-\nance kernel Cij is essentially independent of the distance (along the graph) between the vertices i\nand j, and the function f becomes fully correlated across the graph. (Explicitly f = vD1/2e under\nthe prior, with v a single Gaussian random variable.) As we next show, however, the approach to\nthis fully correlated limit as p or \u03c3 are increased is non-trivial.\nWe focus in this paper on kernels on random regular graphs. This means we consider adjacency\nmatrices A which are regular in the sense that they give for each vertex the same degree, di = d. A\nuniform probability distribution is then taken across all A that obey this constraint [15]. What will\nthe above kernels look like on typical samples drawn from this distribution? Such random regular\ngraphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure\nis then that of a regular tree of degree d, which suggests that it should be possible to calculate the\nkernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel\ncan only depend on the distance (cid:96) between two nodes i and j. Denoting this kernel value C(cid:96),p for a\np-step random walk kernel, one has then C(cid:96),p=0 = \u03b4(cid:96),0 and\n\n\u03b3p+1C0,p+1 = (cid:0)1 \u2212 1\n\n(cid:1) C0,p + 1\nad C(cid:96)\u22121,p +(cid:0)1 \u2212 1\n\na\n\na\n\na C1,p\n\n(cid:1) C(cid:96),p + d\u22121\n\n\u03b3p+1C(cid:96),p+1 = 1\n\n(5)\n(6)\nwhere \u03b3p is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p.\nFig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the\ntree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more long-\nranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This\ncan be calculated as\n\nfor (cid:96) \u2265 1\n\nad C(cid:96)+1,p\n\nC(cid:96),p\u2192\u221e = [1 + (cid:96)(d \u2212 1)/d](d \u2212 1)\u2212(cid:96)/2\n\n(7)\n\n3\n\n051015l00.20.40.60.81Cl,pp=1p=2p=3p=4p=5p=10p=20p=50p=100p=200p=500p=inftya=2, d=31101001000p/a00.10.20.30.40.50.60.70.80.91K1ln V / ln(d-1)a=2, V=inftya=2, V=500a=4, V=inftya=4, V=500d=3\fand is also plotted in the \ufb01gure, showing good agreement with the numerical iteration. There are\n(at least) two ways of obtaining the result (7). One is to take the limit \u03c3 \u2192 \u221e of the integral\nrepresentation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but\nwith a typographical error that effectively removes the factor (d \u2212 1)\u2212(cid:96)/2). Another route is to \ufb01nd\nthe steady state of the recursion for C(cid:96),p. This is easy to do but requires as input the unknown steady\nstate value of \u03b3p. To determine this, one can map from C(cid:96),p to the total random walk probability S(cid:96),p\nin each \u201cshell\u201d of vertices at distance (cid:96) from the starting vertex, changing variables to S0,p = C0,p\nand S(cid:96),p = d(d \u2212 1)(cid:96)\u22121C(cid:96),p ((cid:96) \u2265 1). Omitting the factors \u03b3p, this results in a recursion for S(cid:96),p\nthat simply describes a biased random walk on (cid:96) = 0, 1, 2, . . ., with a probability of 1 \u2212 1/a of\nremaining at the current (cid:96), probability 1/(ad) of moving to the left and probability (d \u2212 1)/(ad)\nof moving to the right. The point (cid:96) = 0 is a re\ufb02ecting barrier where only moves to the right are\nallowed, with probability 1/a. The time evolution of this random walk starting from (cid:96) = 0 can now\nbe analysed as in [17]. As expected from the balance of moves to the left and right, S(cid:96),p for large\np is peaked around the average position of the walk, (cid:96) = p(d \u2212 2)/(ad). For (cid:96) smaller than this\nS(cid:96),p has a tail behaving as \u221d (d \u2212 1)(cid:96)/2, and converting back to C(cid:96),p gives the large-(cid:96) scaling of\nC(cid:96),p\u2192\u221e \u221d (d \u2212 1)\u2212(cid:96)/2; this in turn \ufb01xes the value of \u03b3p\u2192\u221e and so eventually gives (7).\nThe above analysis shows that for large p the random walk kernel, calculated in the absence of loops,\ndoes not approach the expected fully correlated limit; given that all vertices have the same degree,\nthe latter would correspond to C(cid:96),p\u2192\u221e = 1. This implies, conversely, that the fully correlated limit\nis reached only because of the presence of loops in the graph. It is then interesting to ask at what\npoint, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note\nthat a regular tree of depth (cid:96) has V = 1 + d(d \u2212 1)(cid:96)\u22121 nodes. So a regular graph can be tree-like\nat most out to (cid:96) \u2248 ln(V )/ ln(d \u2212 1). Comparing with the typical number of steps our random walk\ntakes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when\n\np/a \u2248 ln(V )/ ln(d \u2212 1)\n\n(8)\n\nTo check this prediction, we measure the analogue of C1,p on randomly generated [15] regular\ngraphs. Because of the presence of loops, the local kernel values are not all identical, so the appro-\n\npriate estimate of what would be C1,p on a tree is K1 = Cij/(cid:112)CiiCjj for neighbouring nodes i\n\nand j. Averaging over all pairs of such neighbours, and then over a number of randomly generated\ngraphs we \ufb01nd the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree pre-\ndictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as\nmarked by the arrow. The deviations manifest themselves in larger values of K1, which eventually\n\u2013 now that p/a is large enough for the kernel to \u201cnotice\u201d the loops - approach the fully correlated\nlimit K1 = 1.\n\n3 Learning curves\n\ni\u00b5\n\nWe now turn to the analysis of learning curves for GP regression on random regular graphs. We\nassume that the target function f\u2217 is drawn from a GP prior with a p-step random walk covariance\nkernel C. Training examples are input-output pairs (i\u00b5, f\u2217\n+ \u03be\u00b5) where \u03be\u00b5 is i.i.d. Gaussian noise\nof variance \u03c32; the distribution of training inputs i\u00b5 is taken to be uniform across vertices. Inference\nfrom a data set D of n such examples \u00b5 = 1, . . . , n takes place using the prior de\ufb01ned by C and\na Gaussian likelihood with noise variance \u03c32. We thus assume an inference model that is matched\nto the data generating process. This is obviously an over-simpli\ufb01cation but is appropriate for the\npresent \ufb01rst exploration of learning curves on random graphs. We emphasize that as n is increased\nwe see more and more function values from the same graph, which is \ufb01xed by the problem domain;\nthe graph does not grow.\nThe generalization error \u0001 is the squared difference between the estimated function \u02c6fi and the target\nf\u2217\ni , averaged across the (uniform) input distribution, the posterior distribution of f\u2217 given D, the\ndistribution of datasets D, and \ufb01nally \u2013 in our non-Euclidean setting \u2013 the random graph ensemble.\nGiven the assumption of a matched inference model, this is just the average Bayes error, or the\naverage posterior variance, which can be expressed explicitly as [1]\n\n\u0001(n) = V \u22121(cid:88)\n\n(cid:10)Cii \u2212 k(i)TKk\u22121(i)(cid:11)\n\nD,graphs\n\n(9)\n\ni\n\n4\n\n\fwhere the average is over data sets and over graphs, K is an n \u00d7 n matrix with elements K\u00b5\u00b5(cid:48) =\nCi\u00b5,i\u00b5(cid:48) + \u03c32\u03b4\u00b5\u00b5(cid:48) and k(i) is a vector with entries k\u00b5(i) = Ci,i\u00b5. The resulting learning curve\ndepends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise\nlevel as speci\ufb01ed by p, a and \u03c32. We \ufb01x d = 3 throughout to avoid having too many parameters to\nvary, although similar results are obtained for larger d.\nExact prediction of learning curves by analytical calculation is very dif\ufb01cult due to the complicated\nway in which the random selection of training inputs enters the matrix K and vector k in (9).\nHowever, by \ufb01rst expressing these quantities in terms of kernel eigenvalues (see below) and then\napproximating the average over datasets, one can derive the approximation [3, 6]\n\n(cid:18) n\n\n(cid:19)\n\n\u0001 + \u03c32\n\nV(cid:88)\n\n\u03b1=1\n\n\u0001 = g\n\n,\n\ng(h) =\n\n(\u03bb\u22121\n\n\u03b1 + h)\u22121\n\n(10)\n\n(cid:113) 4(d\u22121)\nd2 \u2212 (\u03bbL \u2212 1)2\n2\u03c0d\u03bbL(2 \u2212 \u03bbL)\n\nThis equation for \u0001 has to be solved self-consistently because \u0001 also appears on the r.h.s. In the\nEuclidean case the resulting predictions approximate the true learning curves quite reliably. The\nderivation of (10) for inputs on a \ufb01xed graph is unchanged from [3], provided the kernel eigen-\nvalues \u03bb\u03b1 appearing in the function g(h) are de\ufb01ned appropriately, by the eigenfunction condition\nj . . . From the\nde\ufb01nition (1) of the p-step kernel, we see that then \u03bb\u03b1 = \u03baV \u22121(1 \u2212 \u03bbL\n\u03b1/a)p in terms of the cor-\nresponding eigenvalue of the graph Laplacian L. The constant \u03ba has to be chosen to enforce our\n\n(cid:104)Cij\u03c6j(cid:105) = \u03bb\u03c6i; the average here is over the input distribution, i.e. (cid:104). . .(cid:105) = V \u22121(cid:80)\nnormalization convention(cid:80)\n\n\u03b1 \u03bb\u03b1 = (cid:104)Cjj(cid:105) = 1.\n\nFortunately, for large V the spectrum of the Laplacian of a random regular graph can be approxi-\nmated by that of the corresponding large regular tree, which has spectral density [14]\n\n(cid:90)\n\n\u03c1(\u03bbL) =\n\ncan be ignored for large V .) Rewriting (10) as \u0001 = V \u22121(cid:80)\n\n(11)\n+], \u03bbL\u00b1 = 1 + 2d\u22121(d \u2212 1)1/2, where the term under the square root is\nin the range \u03bbL \u2208 [\u03bbL\u2212, \u03bbL\npositive. (There are also two isolated eigenvalues \u03bbL = 0, 2 but these have weight 1/V each and so\n\u03b1[(V \u03bb\u03b1)\u22121 + (n/V )(\u0001 + \u03c32)\u22121]\u22121 and\nthen replacing the average over kernel eigenvalues by an integral over the spectral density leads to\nthe following prediction for the learning curve:\n\n\u0001 =\n\n(12)\n\nwith \u03ba determined from \u03ba(cid:82) d\u03bbL\u03c1(\u03bbL)(1 \u2212 \u03bbL/a)p = 1. A general consequence of the form of this\n\nd\u03bbL\u03c1(\u03bbL)[\u03ba\u22121(1 \u2212 \u03bbL/a)\u2212p + \u03bd/(\u0001 + \u03c32)]\u22121\n\nresult is that the learning curve depends on n and V only through the ratio \u03bd = n/V , i.e. the number\nof training examples per vertex. The approximation (12) also predicts that the learning curve will\nhave two regimes, one for small \u03bd where \u0001 (cid:29) \u03c32 and the generalization error will be essentially\nindependent of \u03c32; and another for large \u03bd where \u0001 (cid:28) \u03c32 so that \u0001 can be neglected on the r.h.s. and\none has a fully explicit expression for \u0001.\nWe compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learn-\ning curves, averaged over datasets and random regular graphs. The two regimes predicted by the\napproximation are clearly visible; the approximation works well inside each regime but less well in\nthe crossover between the two. One striking observation is that the approximation seems to predict\nthe asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only\nthe power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit\nthat for large n (where \u0001 (cid:28) \u03c32) the approximation (9) effectively neglects \ufb02uctuations in the training\ninput \u201cdensity\u201d of a randomly drawn set of training inputs [3, 6]. This is justi\ufb01ed in the graph case\nfor large \u03bd = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ),\nhas negligible relative \ufb02uctuations away from its mean \u03bd. In the Euclidean case there is no similar\nresult, because all training inputs are different with probability one even for large n.\nFig. 2(right) illustrates that for larger a the difference in the crossover region between the true (nu-\nmerically simulated) learning curves and our approximation becomes larger. This is because the\naverage number of steps p/a of the random walk kernel then decreases: we get closer to the limit\nof uncorrelated function values (a \u2192 \u221e, Cij = \u03b4ij). In that limit and for low \u03c32 and large V the\n\n5\n\n\fFigure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and\nV = 500 (small \ufb01lled circles) and V = 1000 (empty circles) vertices. Plotting generalization error\nversus \u03bd = n/V superimposes the results for both values of V , as expected from the approximation\n(12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel\nparameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4.\n\nFigure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3\nand V = 500, and kernel parameters a = 2, p = 20; noise level \u03c32 as shown. Circles: numerical\nsimulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for\na single random graph, with \u03c32 = 0.1. Dotted line: naive estimate \u0001 = 1/(1 + n/\u03c32). Dashed\nline: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with\nnumerically determined graph eigenvalues \u03bbL\n\n\u03b1 as input.\n\ntrue learning curve is \u0001 = exp(\u2212\u03bd), re\ufb02ecting the probability of a training input set not containing\na particular vertex, while the approximation can be shown to predict \u0001 = max{1 \u2212 \u03bd, 0}, i.e. a\ndecay of the error to zero at \u03bd = 1. Plotting these two curves (not displayed here) indeed shows the\nsame \u201cshape\u201d of disagreement as in Fig. 2(right), with the approximation underestimating the true\ngeneralization error.\nIncreasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of\nincreasing a. In line with this, larger values of p improve the accuracy of the approximation (12):\nsee Fig. 3(left).\nOne may ask about the shape of the learning curves for large number of training examples (per\nvertex) \u03bd. The roughly straight lines on the right of the log-log plots discussed so far suggest that\n\u0001 \u221d 1/\u03bd in this regime. This is correct in the mathematical limit \u03bd \u2192 \u221e because the graph kernel\nhas a nonzero minimal eigenvalue \u03bb\u2212 = \u03baV \u22121(1\u2212\u03bbL\n+/a)p: for \u03bd (cid:29) \u03c32/(V \u03bb\u2212), the square bracket\n\n6\n\n0.1110\u03bd = n / V10-510-410-310-210-1100\u03b5\u03c32 = 0.1\u03c32 = 0.01\u03c32 = 0.001\u03c32 = 0.0001\u03c32 = 0V=500 (filled) & 1000 (empty), d=3, a=2, p=100.1110\u03bd = n / V10-510-410-310-210-1100\u03b5\u03c32 = 0.1\u03c32 = 0.01\u03c32 = 0.001\u03c32 = 0.0001\u03c32 = 0V=500, d=3, a=4, p=100.1110\u03bd = n / V10-510-410-310-210-1100\u03b5\u03c32 = 0.1\u03c32 = 0.01\u03c32 = 0.001\u03c32 = 0.0001\u03c32 = 0V=500, d=3, a=2, p=20110100100010000n10-410-310-210-1100\u03b5simulation1/(1+n/\u03c32)theory (tree)theory (eigenv.)V=500, d=3, a=2, p=200, \u03c32=0.1\fin (12) can then be approximated by \u03bd/(\u0001+\u03c32) and one gets (because also \u0001 (cid:28) \u03c32 in the asymptotic\nregime) \u0001 \u2248 \u03c32/\u03bd.\nHowever, once p becomes reasonably large, V \u03bb\u2212 can be shown \u2013 by analysing the scaling of \u03ba, see\nAppendix \u2013 to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is\naround 4 \u00d7 10\u221230. The \u201cterminal\u201d asymptotic regime \u0001 \u2248 \u03c32/\u03bd is then essentially unreachable. A\nmore detailed analysis of (12) for large p and large (but not exponentially large) \u03bd, as sketched in\nthe Appendix, yields\n\nc \u221d p\u22123/2\n\n\u0001 \u221d (c\u03c32/\u03bd) ln3/2(\u03bd/(c\u03c32)),\n\n(13)\nThis shows that there are logarithmic corrections to the naive \u03c32/\u03bd scaling that would apply in the\ntrue terminal regime. More intriguing is the scaling of the coef\ufb01cient c with p, which implies that to\nreach a speci\ufb01ed (low) generalization error one needs a number of training examples per vertex of\norder \u03bd \u221d c\u03c32 \u221d p\u22123/2\u03c32. Even though the covariance kernel C(cid:96),p \u2013 in the same tree approximation\nthat also went into (12) \u2013 approaches a limiting form for large p as discussed in Sec. 2, generalization\nperformance thus continues to improve with increasing p. The explanation for this must presumably\nbe that C(cid:96),p converges to the limit (7) only at \ufb01xed (cid:96), while in the tail (cid:96) \u221d p, it continues to change.\nFor \ufb01nite graph sizes V we know of course that loops will eventually become important as p in-\ncreases, around the crossover point estimated in (8). The approximation for the learning curve in\n(12) should then break down. The most naive estimate beyond this point would be to say that the\nkernel becomes nearly fully correlated, Cij \u221d (didj)1/2 which in the regular case simpli\ufb01es to\nCij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigen-\nvalue \u03bb\u03b1=1 = 1, one would predict \u0001 = 1/(1 + n/\u03c32). Fig. 3(right) shows, however, that this sig-\nni\ufb01cantly underestimates the actual generalization error, even though for this graph \u03bb\u03b1=1 = 0.994\nis very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect\nprediction is obtained, on the other hand, from the approximation (10) with the numerically calcu-\nlated values of the Laplacian \u2013 and hence kernel \u2013 eigenvalues. The presence of the small kernel\neigenvalues is again seen to cause logarithmic corrections to the naive \u0001 \u221d 1/n scaling. Using the\ntree spectrum as an approximation and exploiting the large-p limit, one \ufb01nds indeed (see Appendix,\nEq. (17)) that \u0001 \u221d (c(cid:48)\u03c32/n) ln3/2(n/c(cid:48)\u03c32) where now n enters rather than \u03bd = n/V , c(cid:48) being a\nconstant dependent only on p and a: informally, the function to be learned only has a \ufb01nite (rather\nthan \u221d V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively\naccurate description of the data Fig. 3(right), as the dashed line in the \ufb01gure shows. We thus have the\nsomewhat unusual situation that the tree spectrum is enough to give a good description of the learn-\ning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of\nthe covariance kernel itself is concerned.\n\n4 Summary and Outlook\n\nWe have studied theoretically the generalization performance of GP regression on graphs, focussing\non the paradigmatic case of random regular graphs where every vertex has the same degree d. Our\ninitial concern was with the behaviour of p-step random walk kernels on such graphs. If these are\ncalculated within the usual approximation of a locally tree-like structure, then they converge to a\nnon-trivial limiting form (7) when p \u2013 or the corresponding lengthscale \u03c3 in the closely related\ndiffusion kernel \u2013 becomes large. The limit of full correlation between all function values on the\ngraph is only reached because of the presence of loops, and we have estimated in (8) the values of\np around which the crossover to this loop-dominated regime occurs; numerical data for correlations\nof function values on neighbouring vertices support this result.\nIn the second part of the paper we concentrated on the learning curves themselves. We assumed\nthat inference is performed with the correct parameters describing the data generating process; the\ngeneralization error is then just the Bayes error. The approximation (12) gives a good qualitative\ndescription of the learning curve using only the known spectrum of a large regular tree as input. It\npredicts in particular that the key parameter that determines the generalization error is \u03bd = n/V ,\nthe number of training examples per vertex. We demonstrated also that the approximation is in fact\nmore useful than in the Euclidean case because it gives exact asymptotics for the limit \u03bd (cid:29) 1.\nQuantitatively, we found that the learning curves decay as \u0001 \u221d \u03c32/\u03bd with non-trivial logarithmic\ncorrection terms. Slower power laws \u221d \u03bd\u2212\u03b1 with \u03b1 < 1, as in the Euclidean case, do not appear.\n\n7\n\n\fWe attribute this to the fact that on a graph there is no analogue of the local roughness of a target\nfunction because there is a minimum distance (one step along the graph) between different input\npoints. Finally we looked at the learning curves for larger p, where loops become important. These\ncan still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if\none keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the\napproximation shows that the generalization error scales as \u03c32/n with again logarithmic corrections.\nIn future work we plan to extend our analysis to graphs that are not regular, including ones from\napplication domains as well as arti\ufb01cial ones with power-law tails in the distribution of degrees\nd, where qualitatively new effects are to be expected. It would also be desirable to improve the\npredictions for the learning curve in the crossover region \u0001 \u2248 \u03c32, which should be achievable using\niterative approaches based on belief propagation that have already been shown to give accurate\napproximations for graph eigenvalue spectra [18]. These tools could then be further extended to\nstudy e.g. the effects of model mismatch in GP regression on random graphs, and how these are\nmitigated by tuning appropriate hyperparameters.\n\nAppendix\n\nWe sketch here how to derive (13) from (12) for large p. Eq. (12) writes \u0001 = g(\u03bdV /(\u0001 + \u03c32)) with\n\ng(h) =\n\nd\u03bbL \u03c1(\u03bbL)[\u03ba\u22121(1 \u2212 \u03bbL/a)\u2212p + hV \u22121]\u22121\n\n(14)\n\n(cid:90) \u03bbL\n\n+\n\n\u03bbL\u2212\n\n(cid:90) \u221e\n\n0\n\n\u221a\ny e\u2212y\n\ndy\n\n(cid:90) \u221e\n\nand \u03ba determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the\ng(h) of (10).) Turning \ufb01rst to g(0), the factor (1 \u2212 \u03bbL/a)p decays quickly to zero as \u03bbL increases\nabove \u03bbL\u2212. One can then approximate this factor according to (1 \u2212 \u03bbL\u2212/a)p[(a \u2212 \u03bbL)/(a \u2212 \u03bbL\u2212)]p \u2248\n(1 \u2212 \u03bbL\u2212/a)p exp[\u2212(\u03bbL \u2212 \u03bbL\u2212)p/(a \u2212 \u03bbL\u2212)]. In the regime near \u03bbL\u2212 one can also approximate the\nspectral density (11) by its leading square-root increase, \u03c1(\u03bbL) = r(\u03bbL \u2212 \u03bbL\u2212)1/2, with r = (d \u2212\n1)1/4d5/2/[\u03c0(d\u2212 2)2]. Switching then to a new integration variable y = (\u03bbL \u2212 \u03bbL\u2212)p/(a\u2212 \u03bbL\u2212) and\nextending the integration limit to \u221e gives\n\n1 = g(0) = \u03bar(1 \u2212 \u03bbL\u2212/a)p[p/(a \u2212 \u03bbL\u2212)]\u22123/2\n\n(15)\n\nand this \ufb01xes \u03ba. Proceeding similarly for h > 0 gives\n\u221a\ng(h) = \u03bar(1\u2212\u03bbL\u2212/a)p[p/(a\u2212\u03bbL\u2212)]\u22123/2F (h\u03baV \u22121(1\u2212\u03bbL\u2212/a)p),\ny (ey+z)\u22121\n(16)\nDividing by g(0) = 1 shows that simply g(h) = F (hV \u22121c\u22121)/F (0), where c = 1/[\u03ba(1 \u2212\nIn the asymptotic regime \u0001 (cid:28) \u03c32\n\u03bbL\u2212/a)p] = rF (0)[p/(a \u2212 \u03bbL\u2212)]\u22123/2 which scales as p\u22123/2.\nwe then have \u0001 = g(\u03bdV /\u03c32) = F (\u03bd/(c\u03c32))/F (0) and the desired result (13) follows from the\nlarge-z behaviour of F (z) \u2248 z\u22121 ln3/2(z).\nOne can proceed similarly for the regime where loops become important. Clearly the zero Laplacian\neigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of\nthe Laplacian spectrum can still be approximated by that of a tree [18], we get\n\nF (z) =\n\ndy\n\n0\n\ng(h) =\n\n(V + h\u03ba)\u22121 + r(1 \u2212 \u03bbL\u2212/a)p[p/(a \u2212 \u03bbL\u2212)]\u22123/2F (h\u03baV \u22121(1 \u2212 \u03bbL\u2212/a)p)\n\nV \u22121 + r(1 \u2212 \u03bbL\u2212/a)p[p/(a \u2212 \u03bbL\u2212)]\u22123/2F (0)\n\n(17)\nThe denominator here is \u03ba\u22121 and the two terms are proportional respectively to the covariance kernel\n1 = 0 and the constant eigenfunction, and to 1\u2212\u03bb1. Dropping the\neigenvalue \u03bb1, corresponding to \u03bbL\n\ufb01rst terms in the numerator and denominator of (17) by taking V \u2192 \u221e leads back to the previous\nanalysis as it should. For a situation as in Fig. 3(right), on the other hand, where \u03bb1 is close to unity,\nwe have \u03ba \u2248 V and so\n\ng(h) \u2248 (1 + h)\u22121 + rV (1 \u2212 \u03bbL\u2212/a)p[p/(a \u2212 \u03bbL\u2212)]\u22123/2F (h(1 \u2212 \u03bbL\u2212/a)p)\n\n(18)\nThe second term, coming from the small kernel eigenvalues, is the more slowly decaying because\nit corresponds to \ufb01ne detail of the target function that needs many training examples to learn accu-\nrately. It will therefore dominate the asymptotic behaviour of the learning curve: \u0001 = g(n/\u03c32) \u221d\nF (n/(c(cid:48)\u03c32)) with c(cid:48) = (1 \u2212 \u03bbL\u2212/a)\u2212p independent of V . The large-n tail of the learning curve in\nFig. 3(right) is consistent with this form.\n\n8\n\n\fReferences\n[1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer,\nand M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514\u2013520, Cam-\nbridge, MA, 1996. MIT Press.\n\n[2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong,\nI King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary\nPerspective, pages 17\u201323. Springer, 1997.\n\n[3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors,\nAdvances in Neural Information Processing Systems 11, pages 344\u2013350, Cambridge, MA, 1999. MIT\nPress.\n\n[4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In\nM Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages\n302\u2013308, Cambridge, MA, 1999. MIT Press.\n\n[5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes.\n\nMach. Learn., 40(1):77\u2013102, 2000.\n\n[6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good\nIn T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information\n\napproximations.\nProcessing Systems 13, pages 273\u2013279, Cambridge, MA, 2001. MIT Press.\n\n[7] D Malzahn and M Opper. A variational approach to learning curves.\n\nIn T G Dietterich, S Becker,\nand Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463\u2013469,\nCambridge, MA, 2002. MIT Press.\n\n[8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds.\n\nNeural Comput., 14(6):1393\u20131428, 2002.\n\n[9] P Sollich. Gaussian process regression with mismatched models.\n\nIn T G Dietterich, S Becker, and\nZ Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519\u2013526, Cam-\nbridge, MA, 2002. MIT Press.\n\n[10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic\nand Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Arti\ufb01cial Intelligence,\npages 199\u2013210. 2005.\n\n[11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML \u201905: Proceedings of the 22nd\n\ninternational conference on Machine learning, pages 305\u2013312, New York, NY, USA, 2005. ACM.\n\n[12] A J Smola and R Kondor. Kernels and regularization on graphs.\n\nIn M Warmuth and B Sch\u00a8olkopf,\neditors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144\u2013158. Springer,\nHeidelberg, 2003.\n\n[13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces.\n\nIn ICML\n\u201902: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315\u2013322, San\nFrancisco, CA, USA, 2002. Morgan Kaufmann.\n\n[14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal\n\nMathematical Society, 1997.\n\n[15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput.,\n\n8(4):377\u2013396, 1999.\n\n[16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combina-\n\ntorics, 6(1):R12, 1999.\n\n[17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A,\n\n29(10):2399\u20132409, 1996.\n\n[18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse\n\nsymmetric random matrices. Phys. Rev. E, 78(3):031116, 2008.\n\n9\n\n\f", "award": [], "sourceid": 92, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}, {"given_name": "Matthew", "family_name": "Urry", "institution": null}, {"given_name": "Camille", "family_name": "Coti", "institution": null}]}