{"title": "Computing with Finite and Infinite Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 336, "page_last": 342, "abstract": null, "full_text": "Computing with Finite and Infinite Networks \n\nOle Winther* \n\nTheoretical Physics, Lund University \n\nSOlvegatan 14 A, S-223 62 Lund, Sweden \n\nwint h e r@ nimis.thep.lu. s e \n\nAbstract \n\nUsing statistical mechanics results, I calculate learning curves (average \ngeneralization error) for Gaussian processes (GPs) and Bayesian neural \nnetworks (NNs) used for regression. Applying the results to learning a \nteacher defined by a two-layer network, I can directly compare GP and \nBayesian NN learning. I find that a GP in general requires CJ (d S )-training \nexamples to learn input features of order s (d is the input dimension), \nwhereas a NN can learn the task with order the number of adjustable \nweights training examples. Since a GP can be considered as an infinite \nNN, the results show that even in the Bayesian approach, it is important \nto limit the complexity of the learning machine. The theoretical findings \nare confirmed in simulations with analytical GP learning and a NN mean \nfield algorithm. \n\n1 Introduction \n\nNon-parametric kernel methods such as Gaussian Processes (GPs) and Support Vector Ma(cid:173)\nchines (SVMs) are closely related to neural networks (NNs). These may be considered as \nsingle layer networks in a possible infinite dimensional feature space. Both the Bayesian \nGP approach and SVMs regularize the learning problem so that only a finite number of the \nfeatures (dependent on the amount of data) is used. \n\nNeal [1] has shown that Bayesian NNs converge to GPs in the limit of infinite number of \nhidden units and furthermore argued that (1) there is no reason to believe that real-world \nproblem should require only a 'small' number of hidden units and (2) there are in the \nBayesian approach no reasons (besides computational) to limit the size of the network. \nWilliams [2] has derived kernels allowing for efficient computation with both infinite feed(cid:173)\nforward and radial basis networks. \n\nIn this paper, I show that learning with a finite rather than infinite networks can make a \nprofound difference by studying the case where the task to be learned is defined by a large \nbut finite two-layer NN. A theoretical analysis of the Bayesian approach to learning this \ntask shows that the Bayesian student makes a learning transition from a linear model to \nspecialized non-linear one when the number of examples is of the order of the number of \nadjustable weights in the network. This effect-which is also seen in the simulations-is a \nconsequence of the finite complexity of the network. In an infinite network, i.e. a GP on the \n\n*http : // www. the p . lu .se /t f 2/ s t aff /winth e r / \n\n\fother hand such a transition will not occur. It will eventually learn the task but it requires \nCJ( dS )-training examples to learn features of order s, where d is the input dimension. \nHere, I focus entirely on regression. However, the basic conclusions regarding learning \nwith kernel methods and NNs turn out to be valid more generally, e.g. for classification \nunpublished results and [3]. \n\nI consider the usual Bayesian setup of supervised learning: A training set DN = \n{(Xi, y; ) Ii = 1 ... , N} (x E Rd and y E R) is known and the output for the new in(cid:173)\nput x is predicted by the function f(x) which is sampled from the prior distribution of \nmodel outputs. I will consider both a Gaussian process prior and the prior implied by \na large (but finite) two-layer network. The output noise is taken to be Gaussian, so the \nLikelihood becomes p(ylf(x)) = e - (Y- J(X))2 /2 /V27r(T2. The error measure is minus the \nlog-Likelihood and Bayes regressor (which minimizes the expected error) is the posterior \nmean prediction \n\n(f(x)) - Ef f(x) 0 ; p(Yi If(Xi)) \n\n-\n\nEfO; p(y;l f(x;)) \n\n, \n\n(1) \n\nwhere I have introduced Ef , f = f(Xl) \" '\" f(XN) , f(x), to denote an average with re(cid:173)\nspect to the model output prior. \n\nGaussian processes. \n\nIn this case, the model output prior is by definition Gaussian \n\n(2) \n\nwhere C is the covariance matrix. The covariance matrix is computed from the kernel \n(covariance function) C(x, x'). Below I give an explicit example corresponding to an \ninfinite two-layer network. \n\nBayesian neural networks The output of the two-layer NN is given by f(x, w , W) = \nJK ~:: Wk(Wk . x), where an especially convenient choice of transfer function in what \nfollows is ( z ) = I~ dte- t2 /2/ V2ii. \nI consider a Bayesian framework (with fixed \nknown hyperparameters) with a weight prior that factorizes over hidden units p(w, W) = \nOk [P(Wk )p(Wk)] and Gaussian input-to-hidden weights Wk ~ N(O, ~). \n\nFrom Bayesian NNs to GPs. The prior over outputs for the Bayesian neural network is \np(f) = I dwdWp(w , W) 0; J(J(x; ) -\nf(x;, w , W)). In the infinite hidden unit limit, \nJ{ -+ 00, when P(Wk) has zero mean and finite, say unit variance, it follows from the \ncentral limit theorem (eLT) that the prior distribution converges to a Gaussian process \nf ~ N(O, C) with kernel [1,2] \n\nC(x, x') J dw p(w) (w . x) (w . x') \n\n~ arcsin (J(l + xT;:~:'+ XIT~XI)) \n\n(3) \n\nThe rest of the paper deals with theoretical statistical mechanics analysis and simulations \nfor GPs and Bayesian NNs learning tasks defined by either a NN or a GP. For the simula(cid:173)\ntions, I use analytical GP learning (scaling like CJ (N 3 )) [4] and a TAP mean field algorithm \nfor Bayesian NN. \n\n\f2 Statistical mechanics of learning \n\nThe aim of the average case statistical mechanics analysis is to derive learning curves, i.e. \nthe expected generalization error as a function of the number of training examples. The \ngeneralization error of the Bayes regressor (f (x)) eq. (1) is \nfg = (((y - (f(X)))2)) , \n\n(4) \nwhere double brackets (( ... )) = I IIi [dx;dYip(Xi, Yi)] .. . denote an average over both \ntraining examples and the test example (x , y). Rather than using eq. (4) directly, fg will-as \nusually done-be derived from the average of the free energy -( (In Z )), where the partition \nfunction is given by \n\nZ = Ef \n\n1 N exp (-~ 2:)Yi - f(X i ))2) . \n\nV27ru 2 \n\n2u \n\ni \n\n(5) \n\nI will not give many details of the actual calculations here since it is beyond the scope of \nthe paper, but only outline some of the basic assumptions. \n\n2.1 Gaussian processes \n\nThe calculation for Gaussian processes is given in another NIPS contribution [5]. The basic \nassumption made is that Y- f(x) becomes Gaussian with zero mean 1 under an average over \nthe training example Y - f(x) ~ N(O , (((y -\nf(x)) 2))). This assumption can be justified \nby the CLT when f(x) is a sum of many random parts contributing on the same scale. \nCorrections to the Gaussian assumption may also be calculated [5]. The free energy may \nbe written in term of a set of order parameters which is found by saddlepoint integration. \nAssuming that the teacher is noisy y = f. (x) + 1], (( 1]2)) = uZ, the generalization error is \ngiven by the following equation which depends upon an orderparameter v \nuZ + ((f;(x))) - Ov(v2Ef((f(x)f.(x)))2) \n\n1 + A20v Ef((J2(X)))/N \nN \n\nv = \n\n(6) \n\n(7) \n\nwhere the new normalized measure Ef . . . ex Ef exp (-v((J2(x)))/2) ... has been intro(cid:173)\nduced. \n\nKernels in feature space. By performing a Karhunen-Loeve expansion, f(x) can be \nwritten as a linear perceptron with weights w p in a possible infinite feature space \n\nf(x) = LWpAcP p(x) , \n\np \n\n(8) \n\nwhere the features cP p (x) are orthonormal eigenvectors of the covariance function with \neigenvalues Ap: I dxp(x) C (x', X)cP p(x) = ApcP p(X') and I dx p(X) cPpl (x)cPp (x) = Jppl. \nThe teacher f. (x) may also be expanded in terms of the the features: \n\nf.(x) = L apAcP p(x) , \n\np \n\n((J2(x))) = I:p ApW ~ , \nUsing the orthonormality the averages may be found: \n((f(x)f. (x))) = I:p Apwpap and ((f;(x))) = I:p Apa~ . For a Gaussian process prior, \n\nlGeneralization to non-zero mean is straightforward. \n\n\fthe prior over the weight is a spherical Gaussian w ~ N(O , I). Averaging over w, the sad(cid:173)\ndlepoint equations can be written in tenns of the number of examples N, the noise levels \n0\"2 and 0\";, the eigenvectors of the covariance function Ap and the teacher projections ap: \n\n( \n\n) \n\n2 \n\nN \n--;; \nN (0\"2+ L \n\nApa~ \n0\"* + ~ (1 + VAp)2 \n)-1 \n\nAp \n\n1 + VAp \n\np \n\nv \n\n( \n\n2 \n\n0\" + ~ (1 + VAp)2 \n\nAp \n\n-1 \n\n)\n\n(9) \n\n(10) \n\nThese eqs. are valid for a fixed teacher. However, eq. (9) may also be averaged over the \ndistribution of teachers. In the Bayes optimal scenario, the teacher is sampled from the \nsame prior as the student and 0\"2 = 0\";. Thus ap ~ N(O, I) implying a~ = 1, where the \naverage over the teacher is denoted by an overline. In this case the equations reduce to the \nBayes optimal result first derived by Sollich [6]: f. g = f.~ayes = N / v. \n\nLearning finite nets. Next, I consider the case where the teacher is the two-layer network \nf*(x) = f(w, W) and the GP student uses the infinite net kernel eq. (3). The average \nover the teacher corresponds to an average over the weight prior and since f* (x)f* (Xl) = \nC(x, Xl), I get \n\na~Ap = ! dxdxlp(x)p(xl)C(x, XI)\u00a2p(X)\u00a2p(XI) = Ap , \n\n(11) \n\nwhere the eigenvalue equation and the orthonormality have been used. The theory therefore \npredicts that a GP student (with the infinite network kernel) will have the same learning \ncurve irre.~pectively of the number of hidden units of the NN teacher. This result is a direct \nconsequence of the Gaussian assumption made for the average over examples. However, \nwhat is more surprising is that it is found to be a very good approximation in simulations \ndown to K = 1, i.e. a simple perceptron with a sigmoid non-linearity. \n\nI specialize to inner product kernels C(x, Xl) = c(x . xl/d) \nInner product kernels. \nand consider large input dimensionality d and input components which are iid with \nzero mean and unit variance. The eigenvectors are products of the input components \n\u00a2p(x) = OmEP Xm and are indexed by subsets of input indices, e.g. p = {I, 2, 42} [3]. \nThe eigenvalues are Ap = cl;IIJ~) with degeneracy nlpl = ( I~I ) R:i dlpl / Ipl!, where Ipi is \nthe cardinality (in the example above Ipl = 3). Plugging these results into eqs. (9) and (10), \nit follows that to learn features that are order s in the inputs, O( d S ) examples are needed. \nThe same behavior has been predicted for learning in SVMs [3]. \n\nThe infinite net eq. (3) reduces to an inner product covariance function for ~ = TI/ d (T \ncontrols the degree on non-linearity of the rule) and large d, X . X R:i d: \n. (TX. Xl ) \nC x, X = ex\u00b7 x d =;: arcsm d (1 + T) \n\n1/) \n\n(12) \n\nI) \n\n2 \n\n( \n\n( \n\n. \n\nFigure 1 shows learning curves for GPs for the infinite network kernel. The mismatch \nbetween theory and simulations is expected to be due to 0(1/ d)-corrections to the eigen(cid:173)\nvalues Ap. The figure clearly shows that learning of the different order features takes place \non different scales. The stars on the f.g-axis show the theoretical prediction of asymptotic \nerrorfor N = O( d), O( d3 ), ... (the teacher is an odd function). \n\n2.2 Bayesian neural networks \n\nThe limit of large but finite NNs allows for efficient computation since the prior over \nfunctions can be approximated by a Gaussian. The hidden-to-output weights are for sim-\n\n\fSmall N = O(d) \n\n0.4 \nEg \n\n0.2 \n\nLarge N = O(d3) \n\n0.15 \n\n0.1 \nEg \n\n0.0 \n\no \n\n20 \n\nN \n\n40 \n\n60 \n\n80 \n\n2000 \nFigure 1: Learning curve for Gaussian processes with the infinite network kernel (d = 10, \nT = 10 and (}2 = 0.01) for two scales of training examples. The full line is the the \ntheoretical prediction for the Bayes optimal GP scenario. The two other curves (almost on \ntop of each other as predicted by theory) are simulations for the Bayes optimal scenario \n(dotted line) and for GP learning a neural network with J{ = 30 hidden units (dash-dotted \nline). \n\n1000 \nN \n\n1500 \n\n500 \n\nplicity set to one and we introduce the 'fields' hk(x) = Wk . x and write the output as \nf(x, w) = f(h(x)) = .Jx ~~ *(hk(X)), h(x) = h1(x), ... , hK(x). In the following, I \n\ndiscuss the TAP mean field algorithm used to find an approximation to the Bayes regressor \nand briefly the theoretical statistical mechanics analysis for the NN task. \n\nMean field algorithm. The derivation sketched here is a straightforward generalization \nof previous results for neural networks [7]. The basic cavity assumption [7, 8] is that for \nlarge d, J{ and for a suitable input distribution, the predictive distribution p(J (x) I D N) is \nGaussian: \n\np(J(x)IDN) RJ N((J(x)), (J2(x)) - (J(x))2) . \n\nThe predictive distribution for the fields h( x) is also assumed to be Gaussian \n\np(h(x)IDN) RJ N((h(x)) , V) , \n\nwhere V = (h(x)h(xf) - (h(x))(h(xf). Using these assumptions, I get an approxi(cid:173)\nmate Bayes regressor \n\n(13) \n\nTo make predictions, we therefore need the two first moments of the weights since \n(hk(x)) = (Wk) . x and Vkl = ~mn XmXn((WmkWnl) - (Wmk)(Wnl)). We can simplify \nthis in the large d limit by taking the inputs to by iid with zero mean and unit variance: \nVkl RJ (Wk' WI) - (Wk) . (WI). This approximation can be avoided at a substantial com(cid:173)\nputational cost [8]. Furthermore, (Wk' WI) turns out equal to the prior covariance *