{"title": "Computing with Infinite Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 295, "page_last": 301, "abstract": null, "full_text": "Computing with infinite networks \n\nChristopher K. I. Williams \n\nNeural Computing Research Group \n\nDepartment of Computer Science and Applied Mathematics \n\nAston University, Birmingham B4 7ET, UK \n\nc.k.i.williamsGaston.ac.nk \n\nAbstract \n\nFor neural networks with a wide class of weight-priors, it can be \nshown that in the limit of an infinite number of hidden units the \nprior over functions tends to a Gaussian process. In this paper an(cid:173)\nalytic forms are derived for the covariance function of the Gaussian \nprocesses corresponding to networks with sigmoidal and Gaussian \nhidden units. This allows predictions to be made efficiently using \nnetworks with an infinite number of hidden units, and shows that, \nsomewhat paradoxically, it may be easier to compute with infinite \nnetworks than finite ones. \n\n1 \n\nIntroduction \n\nTo someone training a neural network by maximizing the likelihood of a finite \namount of data it makes no sense to use a network with an infinite number of hidden \nunits; the network will \"overfit\" the data and so will be expected to generalize \npoorly. However, the idea of selecting the network size depending on the amount \nof training data makes little sense to a Bayesian; a model should be chosen that \nreflects the understanding of the problem, and then application of Bayes' theorem \nallows inference to be carried out (at least in theory) after the data is observed. \n\nIn the Bayesian treatment of neural networks, a question immediately arises as to \nhow many hidden units are believed to be appropriate for a task. Neal (1996) has \nargued compellingly that for real-world problems, there is no reason to believe that \nneural network models should be limited to nets containing only a \"small\" number \nof hidden units. He has shown that it is sensible to consider a limit where the \nnumber of hidden units in a net tends to infinity, and that good predictions can be \nobtained from such models using the Bayesian machinery. He has also shown that \nfor fixed hyperparameters, a large class of neural network models will converge to \na Gaussian process prior over functions in the limit of an infinite number of hidden \nunits. \n\n\f296 \n\nC. K. I. Williams \n\nNeal's argument is an existence proof-it states that an infinite neural net will \nconverge to a Gaussian process, but does not give the covariance function needed \nto actually specify the particUlar Gaussian process. \nIn this paper I show that \nfor certain weight priors and transfer functions in the neural network model, the \ncovariance function which describes the behaviour of the corresponding Gaussian \nprocess can be calculated analytically. This allows predictions to be made using \nneural networks with an infinite number of hidden units in time O( n 3 ), where n \nis the number of training examples l . The only alternative currently available is to \nuse Markov Chain Monte Carlo (MCMC) methods (e.g. Neal, 1996) for networks \nwith a large (but finite) number of hidden units. However, this is likely to be \ncomputationally expensive, and we note possible concerns over the time needed for \nthe Markov chain to reach equilibrium. The availability of an analytic form for \nthe covariance function also facilitates the comparison of the properties of neural \nnetworks with an infinite number of hidden units as compared to other Gaussian \nprocess priors that may be considered. \nThe Gaussian process analysis applies for fixed hyperparameters B. If it were de(cid:173)\nsired to make predictions based on a hyperprior P( B) then the necessary B-space \nintegration could be achieved by MCMC methods. The great advantage of integrat(cid:173)\ning out the weights analytically is that it dramatically reduces the dimensionality \nof the MCMC integrals, and thus improves their speed of convergence. \n\n1.1 From priors on weights to priors on functions \n\nBayesian neural networks are usually specified in a hierarchical manner, so that the \nweights ware regarded as being drawn from a distribution P(wIB). For example, \nthe weights might be drawn from a zero-mean Gaussian distribution, where B spec(cid:173)\nifies the variance of groups of weights. A full description of the prior is given by \nspecifying P( B) as well as P( wIB). The hyperprior can be integrated out to give \nP(w) = J P(wIB)P(B) dB, but in our case it will be advantageous not to do this as \nit introduces weight correlations which prevent convergence to a Gaussian process. \n\nIn the Bayesian view of neural networks, predictions for the output value y .. cor(cid:173)\nresponding to a new input value x .. are made by integrating over the posterior in \n\nweight space. Let D = ((XI,t1),(xz,tz), ... ,(xn,tn\u00bb denote the n training data \npairs, t = (tl'\" .,tnl and ! .. (w) denote the mapping carried out by the network \non input x .. given weights w. P(wlt, B) is the weight posterior given the training \ndataz. Then the predictive distribution for y .. given the training data and hyper(cid:173)\nparameters B is \n\n(1) \n\nWe will now show how this can also be viewed as making the prediction using priors \nover functions rather than weights. Let f(w) denote the vector of outputs corre(cid:173)\nsponding to inputs (Xl, ... , xn) given weights w. Then, using Bayes' theorem we \nhave P(wlt,8) = P(tlw)P(wI8)/ P(tI8), and P(tlw) = J P(tly) o(y -\nf(w\u00bb dy. \nHence equation 1 can be rewritten as \nP(y .. It, 8) = P(~18) J J P(tly) o(Y .. -\n(2) \nHowever, the prior over (y .. , YI, ... , Yn) is given by P(y .. , y18) = P(y .. Iy, 8)P(yI8) = \nJ o(Y .. - ! .. (w) o(y- f(w\u00bbP(wI8) dw and thus the predictive distribution can be \n1 For large n, various ap'proximations to the exact solution which avoid the inversion of \n\nf(w\u00bb P(wI8) dw dy \n\n! .. (w\u00bbo(y -\n\nan n x n matrix are available. \n\n2For notational convenience we suppress the x-dependence of the posterior. \n\n\fComputing with Infinite Networks \n\n297 \n\nwritten as \n\nP(y .. lt,8) = P(~18) J P(tly)P(y .. ly, 8)P(yI8) dy = J P(y .. ly, 8)P(ylt, 8) dy \n\n(3) \nHence in a Bayesian view it is the prior over function values P(y .. , Y18) which is \nimportant; specifying this prior by using weight distributions is one valid way to \nachieve this goal. In general we can use the weight space or function space view, \nwhich ever is more convenient, and for infinite neural networks the function space \nview is more useful. \n\n2 Gaussian processes \n\nA stochastic process is a collection of random variables {Y(z)lz E X} indexed by \na set X . In our case X will be n d , where d is the number of inputs. The stochastic \nprocess is specified by giving the probability distribution for every finite subset \nof variables Y(zt), ... , Y(Zk) in a consistent manner. A Gaussian process (GP) \nis a stochastic process which can be fully specified by its mean function jJ( z) = \nE[Y(z)] and its covariance function C(z, z') = E[(Y(z) -\njJ(z\u00bb(Y(z') - JJ(z'\u00bb]; \nany finite set ofY-variables will have ajoint multivariate Gaussian distribution. For \na multidimensional input space a Gaussian process may also be called a Gaussian \nrandom field. \n\nBelow we consider Gaussian processes which have jJ(z) = 0, as is the case for the \n\nneural network priors discussed in section 3. A non-zero JJ(z) can be incorporated \ninto the framework at the expense of a little extra complexity. \n\nA widely used class of covariance functions is the stationary covariance functions, \nwhereby C(z, z') = C(z - z') . These are related to the spectral density (or power \nspectrum) of the process by the Wiener-Khinchine theorem, and are particularly \namenable to Fourier analysis as the eigenfunctions of a stationary covariance kernel \nare exp ik.z . Many commonly used covariance functions are also isotropic, so that \nC(h) = C(h) where h = z - z' and h = Ihl. For example C(h) = exp(-(h/oy) \nis a valid covariance function for all d and for 0 < v ~ 2. Note that in this case \nu sets the correlation length-scale of the random field, although other covariance \nfunctions (e.g. those corresponding to power-law spectral densities) may have no \npreferred length scale. \n\n2.1 Prediction with Gaussian processes \n\nThe model for the observed data is that it was generated from the prior stochastic \nprocess, and that independent Gaussian noise (of variance u~) was then added. \nGiven a prior covariance function CP(Zi,Zj), a noise process CN(Zj,Zj) = U~6ij \n(i.e. independent noise of variance u~ at each data point) and the training data, \nthe prediction for the distribution of y .. corresponding to a test point z .. is obtained \nsimply by applying equation 3. As the prior and noise model are both Gaussian the \nintegral can be done analytically and P(y .. lt, 8) is Gaussian with mean and variance \n\ny(z .. ) = k~(z .. )(Kp + KN)-lt \nu2(z .. ) = Cp(z .. , z .. ) - k~(z .. )(J{p + KN )-lkp(z .. ) \n\n(4) \n(5) \n\nwhere [Ko]ij = Co(Zi, Zj) for a = P, Nand kp(z .. ) = \nCp(z .. , zn\u00bbT. u~(z .. ) gives the \"error bars\" of the prediction. \n\n(Cp(z .. , zt), ... , \n\nEquations 4 and 5 are the analogue for spatial processes of Wiener-Kolmogorov \nprediction theory. They have appeared in a wide variety of contexts including \n\n\f298 \n\nC. K. I. Williams \n\ngeostatistics where the method is known as \"kriging\" (Journel and Huijbregts, 1978; \nCressie 1993), multidimensional spline smoothing (Wahba, 1990), in the derivation \nof radial basis function neural networks (Poggio and Girosi, 1990) and in the work \nof Whittle (1963). \n\n3 Covariance functions for Neural Networks \n\nConsider a network which takes an input z, has one hidden layer with H units and \nthen linearly combines the outputs of the hidden units with a bias to obtain fez). \nThe mapping can be written \n\nH \n\nfez) = b+ L.:vjh(z;uj) \n\nj=l \n\n(6) \n\nwhere h(z; u) is the hidden unit transfer function (which we shall assume is \nbounded) which depends on the input-to-hidden weights u. This architecture is \nimportant because it has been shown by Hornik (1993) that networks with one \nhidden layer are universal approximators as the number of hidden units tends to \ninfinity, for a wide class of transfer functions (but excluding polynomials). Let b \nand the v's have independent zero-mean distributions of variance O'~ and 0'1) respec(cid:173)\ntively, and let the weights Uj for each hidden unit be independently and identically \ndistributed. Denoting all weights by w, we obtain (following Neal, 1996) \n\n(7) \n(8) \n\nEw[!(z)] \nEw[/(z )/(z')] \n\n-\n\n0 \n\nO'~ + L.: O';Eu[hj(z; u)hj(z'; u)] \nO'l + HO';Eu[h(z; u)h(z'; u)] \n\nj \n\n(9) \nwhere equation 9 follows because all of the hidden units are identically distributed. \nThe final term in equation 9 becomes w 2 Eu[h(z; u)h(z'; u)] by letting 0'; scale as \nw 2 /H. \nAs the transfer function is bounded, all moments of the distribution will be bounded \nand hence the Central Limit Theorem can be applied, showing that the stochastic \nprocess will become a Gaussian process in the limit as H -+ 00. \nBy evaluating Eu[h(z)h(z')] for all z and z' in the training and testing sets we can \nobtain the covariance function needed to describe the neural network as a Gaussian \nprocess. These expectations are, of course, integrals over the relevant probability \ndistributions of the biases and input weights. In the following sections two specific \nchoices for the transfer functions are considered, (1) a sigmoidal function and (2) a \nGaussian. Gaussian weight priors are used in both cases. \n\nIt is interesting to note why this analysis cannot be taken a stage further to integrate \nout any hyperparameters as well. For example, the variance 0'; of the v weights \nmight be drawn from an inverse Gamma distribution. In this case the distribution \nP(v) = J P(vIO';)P(O';)dO'; is no longer the product of the marginal distributions \nfor each v weight (in fact it will be a multivariate t-distribution). A similar analysis \ncan be applied to the u weights with a hyperprior. The effect is to make the hidden \nunits non-independent, so that the Central Limit Theorem can no longer be applied. \n\n3.1 Sigmoidal transfer function \n\nA sigmoidal transfer function is a very common choice in neural networks research; \nnets with this architecture are usually called multi-layer perceptrons. \n\n\fComputing with Infinite Networks \n\n299 \n\nBelow we consider the transfer function h(z; u) = ~(uo+ 'L1=1 UjXi), where ~(z) = \n2/ Vii J; e- t2 dt is the error function, closely related to the cumulative distribution \nfunction for the Gaussian distribution. Appropriately scaled, the graph of this \nfunction is very similar to the tanh function which is more commonly used in the \nneural networks literature. \n\nIn calculating V(z, Z/)d;J Eu[h(z; U)h(Z/; u)] we make the usual assumptions (e.g. \nMacKay, 1992) that u is drawn from a zero-mean Gaussian distribution with co(cid:173)\nvariance matrix E, i.e. u \"\" N(O, E). Let i = (1, Xl, ... , Xd) be an augmented input \nvector whose first entry corresponds to the bias. Then Verf(z, Z/) can be written as \n\nVerf(z,z/) = ~ J~(uTi)~(uTi/)exp(-!uTE-lu) du \n\n(211\") \n\n2 \n\nIE1 1/ 2 \n\n2 \n\nThis integral can be evaluated analytically3 to give \n\nVerf z, z ) = - sm \n\n( \n\n1 \n\n\u2022 -1 \n\n2 \n\n11\" \n\n2 -T .... -1 \nZ \n.wZ \n\n---;=========== \n)(1 + 2iTEi)(1 + 2i/TEi/) \n\n(10) \n\n(11) \n\nWe observe that this covariance function is not stationary, which makes sense as \nthe distributions for the weights are centered about zero, and hence translational \nsymmetry is not present. \nConsider a diagonal weight prior so that E = diag(0\"5, 0\"7, ... ,0\"1), so that the inputs \ni = 1, ... , d have a different weight variance to the bias 0\"6. Then for Iz12, Iz/12\u00bb \n(1+20\"6)/20\"1, we find that Verf(z, Z/) ~ 1-20/11\", where 0 is the angle between z and \nZ/. Again this makes sense intuitively; if the model is made up of a large number of \nsigmoidal functions in random directions (in z space), then we would expect points \nthat lie diametrically opposite (i.e. at z and -z) to be anti-correlated, because \nthey will lie in the + 1 and -1 regions of the sigmoid function for most directions. \n\n3.2 Gaussian transfer function \n\nOne other very common transfer function used in neural networks research is the \nGaussian, so that h(z; u) = exp[-(z - u)T(z - u)/20\"~], where 0\"; is the width \nparameter of the Gaussian. Gaussian basis functions are often used in Radial Basis \nFunction (RBF) networks (e.g. Poggio and Girosi, 1990). \nFor a Gaussian prior over the distribution of u so that u \"\" N(O, O\"~I), \n\n1 \n\nVG(z,z)=( \n\n1 \n2)d/2 \n211\"0\" u \n\nJ \n\nexp-\n\n(z-u)T(z-u) \n\n2 \n20\" 9 \n\nexp-\n\n(Z/-u)T(Z/_U) \n\n2 \n20\" 9 \n\nBy completing the square and integrating out u we obtain \n\nuTu \nexp---2 G \n20\" u \n(12) \n\nVG(Z,Z/) = _e \n\n( 0\" )d \n\nO\"U \n\neXP{--2 2 } exp{-\n\n(13) \nwhere 1/0\"2 = 2/0\"2 + 1/0\"2 0\"2 = 20\"2 + 0\"4/0\"2 and 0\"2 = 20\"2 + 0\"2 This formula \ncan be generalized by allowing covariance matrices Eb and Eu in place of O\";! and \nO\"~!; rescaling each input variable Xi independently is a simple example. \n\n}exp{--2 2 } \n\n9 gum \n\n2 2 \n0\"$ \n\nu' \n\n$ \n\ng. \n\nu \n\n9 \n\ne \n\n(z - z')T(z - z') \n\nzlT z ' \nO\"m \n\nzT z \nO\"m \n\n3Introduce a dummy parameter A to make the first term in the integrand ~(AUTX). \nDifferentiate the integral with respect to A and then use integration by parts. Finally \nrecognize that dVerfjdA is of the form (1-fP)-1/2d9jdA and hence obtain the sin- 1 form \nof the result, and evaluate it at A = 1. \n\n\f300 \n\nC. K. I. Williams \n\nAgain this is a non-stationary covariance function, although it is interest(cid:173)\ning to note that if O\"~ -\n00 (while scaling w 2 appropriately) we find that \nVG(Z,Z/) ex: exp{-(z - z/)T(z - z/)/40\"2} 4. For a finite value of O\"~, VG(Z,Z/) \nis a stationary covariance function \"modulated\" by the Gaussian decay function \nexp( _zT z/20\"?n) exp( _zIT Zl /20\"?n). Clearly if O\"?n is much larger than the largest \ndistance in z-space then the predictions made with VG and a Gaussian process with \nonly the stationary part of VG will be very similar. \n\nIt is also possible to view the infinite network with Gaussian transfer functions as \nan example of a shot-noise process based on an inhomogeneous Poisson process \n(see Parzen (1962) \u00a74.5 for details). Points are generated from an inhomogeneous \nPoisson process with the rate function ex: exp( _zT z/20\"~), and Gaussian kernels of \nheight v are centered on each of the points, where v is chosen iid from a distribution \nwith mean zero and variance 0\"; . \n\n3.3 Comparing covariance functions \n\nThe priors over functions specified by sigmoidal and Gaussian neural networks differ \nfrom covariance functions that are usually employed in the literature, e.g. splines \n(Wahba, 1990). How might we characterize the different covariance functions and \ncompare the kinds of priors that they imply? \nThe complex exponential exp ik.z is an eigenfunction of a stationary and isotropic \ncovariance function, and hence the spectral density (or power spectrum) S(k) \n(k = Ikl) nicely characterizes the corresponding stochastic process. Roughly speak(cid:173)\ning the spectral density describes the \"power\" at a given spatial frequency k; for \nexample, splines have S(k) ex: k- f3 . The decay of S(k) as k increases is essential, \nas it provides a smoothing or damping out of high frequencies. Unfortunately non(cid:173)\nstationary processes cannot be analyzed in exactly this fashion because the complex \nexponentials are not (in general) eigenfunctions of a non-stationary kernel. Instead, \nwe must consider the eigenfunctions defined by J C(z, Z/)\u00a2(Z/)dz l = )..\u00a2(z). How(cid:173)\never, it may be possible to get some feel for the effect of a non-stationary covariance \nfunction by looking at the diagonal elements in its 2d-dimensional Fourier trans(cid:173)\nform, which correspond to the entries in power spectrum for stationary covariance \nfunctions. \n\n3.4 Convergence of finite network priors to GPs \n\nFrom general Central Limit Theorem results one would expect a rate of convergence \nof H-l/2 towards a Gaussian process prior. How many units will be required \nin practice would seem to depend on the particular values of the weight-variance \nparameters. For example, for Gaussian transfer functions, O\"rn defines the radius \nover which we expect the process to be significantly different from zero. If this \nradius is increased (while keeping the variance of the basis functions O\"~ fixed) then \nnaturally one would expect to need more hidden units in order to achieve the same \nlevel of approximation as before. Similar comments can be made for the sigmoidal \ncase, depending on (1 + 20\"6)/20\"1-\nI have conducted some experiments for the sigmoidal transfer umction, comparing \nthe predictive performance of a finite neural network with one Input unit to the \nequivalent Gaussian process on data generated from the GP. The finite network \nsimulations were carried out using a slightly modified version of Neal's MCMC \nBayesian neural networks code (Neal, 1996) and the inputs were drawn from a \n\n4Note that this would require w 2 -\n\n00 and hence the Central Limit Theorem would no \n\nlonger hold, i.e. the process would be non-Gaussian. \n\n\fComputing with Infinite Networks \n\n301 \n\nN(O,l) distribution. The hyperparameter settings were UI = 10.0, 0\"0 = 2.0, O\"v = \n1.189 and Ub = 1.0. Roughly speaking the results are that 100's of hidden units \nare required before similar performance is achieved by the two methods, although \nthere is considerable variability depending on the particular sample drawn from the \nprior; sometimes 10 hidden units appears sufficient for good agreement. \n\n4 Discussion \n\nThe work described above shows how to calculate the covariance function for sig(cid:173)\nmoidal and Gaussian basis functions networks. It is probable similar techniques will \nallow covariance functions to be derived analytically for networks with other kinds \nof basis functions as well; these may turn out to be similar in form to covariance \nfunctions already used in the Gaussian process literature. \n\nIn the derivations above the hyperparameters 9 were fixed. However, in a real data \nanalysis problem it would be unlikely that appropriate values of these parameters \nwould be known. Given a prior distribution P(9) predictions should be made by \nintegrating over the posterior distribution P(9It) ()( P(9)P(tI9), where P(tI9) is \nthe likelihood of the training data t under the model; P(tI9) is easily computed for \na Gaussian process. The prediction y( z) for test input z is then given by \n\ny(z) = J Y9(z)P(9ID)d9 \n\n(14) \n\nwhere Y9(z) is the predicted mean (as given by equation 4) for a particular value \nof 9. This integration is not tractable analytically but Markov Chain Monte Carlo \nmethods such as Hybrid Monte Carlo can be used to approximate it. This strategy \nwas used in Williams and Rasmussen (1996), but for stationary covariance functions, \nnot ones derived from Gaussian processes; it would be interesting to compare results. \n\nAcknowledgements \n\nI thank David Saad and David Barber for help in obtaining the result in equation 11, and \nChris Bishop, Peter Dayan, Ian Nabney, Radford Neal, David Saad and Huaiyu Zhu for \ncomments on an earlier draft of the paper. This work was partially supported by EPSRC \ngrant GR/J75425, \"Novel Developments in Learning Theory for Neural Networks\". \n\nReferences \n\nCressie, N. A. C. (1993). Statistics for Spatial Data. Wiley. \nHornik, K. (1993). Some new results on neural network approximation. Neural Net(cid:173)\n\nworks 6 (8), 1069-1072. \n\nJournel, A. G. and C. J. Huijbregts (1978). Mining Geostatistics. Academic Press. \nMacKay, D. J. C. (1992). A Practical Bayesian Framework for Backpropagation Net(cid:173)\n\nworks. Neural Computation 4(3), 448-472. \n\nNeal, R. M. (1996). Bayesian Learning for Neural Networks. Springer. Lecture Notes in \n\nStatistics 118. \n\nParzen, E. (1962). Stochastic Processes. Holden-Day. \nPoggio, T. and F. Girosi (1990). Networks for approximation and learning. Proceedings \n\nof IEEE 78, 1481-1497. \n\nWahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Ap(cid:173)\n\nplied Mathematics. CBMS-NSF Regional Conference series in applied mathematics. \nWhittle, P. (1963). Prediction and regulation by linear least-square methods. English \n\nUniversities Press. \n\nWilliams, C. K. I. and C. E. Rasmussen (1996). Gaussian processes for regression. In \nD. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), Advances in Neural \nInformation Processing Systems 8, pp. 514-520. MIT Press. \n\n\f", "award": [], "sourceid": 1197, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}]}