{"title": "The Generalisation Cost of RAMnets", "book": "Advances in Neural Information Processing Systems", "page_first": 253, "page_last": 259, "abstract": null, "full_text": "The Generalisation Cost of RAMnets \n\nRichard Rohwer and Michal Morciniec \n\nrohwerrj~cs.aston.ac.uk morcinim~cs.aston.ac.uk \n\nNeural Computing Research Group \n\nAston University \n\nAston Triangle, Birmingham B4 7ET, UK. \n\nAbstract \n\nGiven unlimited computational resources, it is best to use a crite(cid:173)\nrion of minimal expected generalisation error to select a model and \ndetermine its parameters. However, it may be worthwhile to sac(cid:173)\nrifice some generalisation performance for higher learning speed. \nA method for quantifying sub-optimality is set out here, so that \nthis choice can be made intelligently. Furthermore, the method \nis applicable to a broad class of models, including the ultra-fast \nmemory-based methods such as RAMnets. This brings the added \nbenefit of providing, for the first time, the means to analyse the \ngeneralisation properties of such models in a Bayesian framework . \n\n1 \n\nIntroduction \n\nIn order to quantitatively predict the performance of methods such as the ultra-fast \nRAMnet, which are not trained by minimising a cost function, we develop a Bayesian \nformalism for estimating the generalisation cost of a wide class of algorithms. \nWe consider the noisy interpolation problem, in which each output data point if \nresults from adding noise to the result y = f(x) of applying unknown function f \nto input data point x, which is generated from a distribution P (x). We follow a \nsimilar approach to (Zhu & Rohwer, to appear 1996) in using a Gaussian process to \ndefine a prior over the space of functions, so that the expected generalisation cost \nunder the posterior can be determined. The optimal model is defined in terms of \nthe restriction of this posterior to the subspace defined by the model. The optimum \nis easily determined for linear models over a set of basis functions. We go on to \ncompute the generalisation cost (with an error bar) for all models of this class, \nwhich we demonstrate to include the RAMnets. \n\n\f254 \n\nR. Rohwer and M. Morciniec \n\nSection 2 gives a brief overview of RAMnets. Sections 3 and 4 supply the formalism \nfor computing expected generalisation costs under Gaussian process priors. Numer(cid:173)\nical experiments with this formalism are presented in Section 5. Finally, we discuss \nthe current limitations of this technique and future research directions in Section 6. \n\n2 RAMnets \n\nThe RAMnet, or n-tuple network is a very fast I-pass learning system that of(cid:173)\nten gives excellent results competitive with slower methods such as Radial Basis \nFunction networks or Multi-layer Perceptrons (Rohwer & Morciniec, 1996). Al(cid:173)\nthough a semi-quantitative theory explains how these systems generalise, no formal \nframework has previously been given to precisely predict the accuracy of n-tuple \nnetworks. \n\nEssentially, a RAMnet defines a set of \"features\" which can be regarded as Boolean \nfunctions of the input variables. Let the ath feature of x be given by a {O, 1 }-valued \nfunction 4>a(x). We will focus on the n-tuple regression network (Allinson & Kolcz, \n1995), which outputs \n\n(1) \n\nin response to input x, if trained on the set of N samples {X(N)Y(N)} = {(xi, !I)}~l' \nHere U(x , x') = E 4>a(x)4>a(x' ) can be seen to play the role of a smoothing kernel, \nprovided that it turns out to have a suitable shape. It is well-know that it does, for \nappropriate choices of feature sets. The strength of this method is that the sums \nover training data can be done in one pass, producing a table containing two totals \nfor each feature. Only this table is required for recognition. \n\na \n\na \n\nIt is interesting to note that there is a familiar way to expand a kernel into the form \nU(x, x') = E 4>a(x)4>a(x' ), at least when U(x, x') = U(x - x'), if the range of 4> is \nnot restricted to {O, I}: an eigenfunction expansion l . Indeed, principal component \nanalysis2 applied to a Gaussian with variance V shows that the smallest feature \nset for a given generalisation cost consists of the (real-valued) projections onto \nthe leading eigenfunctions of V. Be that as it may, the treatment here applies to \narbitrary feature sets. \n\n3 Bayesian inference with Gaussian priors \n\nGaussian processes provide a diverse set of priors over function spaces. To \navoid mathematical details of peripheral interest, let us approximate the infinite(cid:173)\ndimensional space of functions by a finite-dimensional space of discretised functions, \nso that function f is replaced by high-dimensional vector f, and f(x) is replaced \nby fx, with f(x) ~ fx within a volume Llx around x. We develop the case of scalar \nfunctions f, but the generalisation to vector-valued functions is straightforward. \n\n1 In physics, this is essentially the mode function expansion of U- 1 , the differential \n\noperator with Green's function U. \n\n2V-1 needs to be a compact operator for this to work in the infinite-dimensional limit. \n\n\fThe Generalisation Cost of RAMnets \n\nWe assume a Gaussian prior on f, with zero mean and covariance V /0:: \n\n255 \n\n(2) \n\nwhere Za = det(~:V)t. The overall scale of variation of f is controlled by 0:. \nIllustrative samples of the functions generated from various choices of covariance \nare given in (Zhu & Rohwer, to appear 1996). With q:c/f3 denoting the (possibly \nposition-dependent) variance of the Gaussian output noise, the likelihood of outputs \nYeN) given function f and inputs X(N) is \n\np(Y(N)IX(N),f) = (1/Z,8)exp-~L(f:c. _yi)q;.l(f:c' _yi) \n\n(3) \n\nwhere Z~ = n \u00a5q:c. = det [\u00a5Q] with Qij = q:c. 6ij. \n\ni \n\ni \n\nBecause f and X(N) are independent the joint distribution is \n\nP Y(N) , f X(N) = P YeN) f , X(N) P (f = e 2 \n\nI ) ( \n\n) ( !.bTAb+C)/( \n\n) \n\n( \n\nI \n\nZaZ,8 e ~ \n\n) -!'(f-Ab)TA-1(f-Ab) \n(4) \n\nwhere 6:c:c. is understood to be 1 whenever xi is in the same cell of the discreti(cid:173)\nsation ~ x, and A;;, = o:V;;, + f3'L.q;,I6:c,:c.6x',:c\" b:c = f3'L.yiq;.I6:c,:c\" and \nC = - ~f3 'L. yi q;,Iyi. One can readily verify that \n\ni \n\ni \n\ni \n\nA:c:c' = (ljo:)V:c:c,+ LV:c:ctKtuV:c,,:c, \n\ntu \n\nwhere K is the N x N matrix defined by \n\n(5) \n\n(6) \n\nThe posterior is readily determined to be \n\nwhere f = Ab is the posterior mean estimate of the true function f. \n\n* \n\n4 Calculation of the expected cost and its variance \n\nLet us define the cost of associating an output f:c of the model with an input x that \nactually produced an output y as \n\nm \n\nm \n\nC(f:c, y) = Hf:c - y) r:c \n\nm \n\n2 \n\nwhere r:c is a position dependent cost weight. \n\n\f256 \n\nR. Rohwer and M. Morciniec \n\nThe average of this cost defines a cost functional, given input data X(N): \n\nm \n\nC( f , flx(N\u00bb = C(fx, y)P (XIX(N\u00bb) P (ylx, f) dxdy. \n\n(8) \n\nJ m \n\nThis form is obtained by noting that the function f carries no information about \nthe input point x, and the input data X(N) supplies no information about y beyond \nthat supplied by f. The distributions in (8) are unchanged by further conditioning \non Y(N), so we could write C( f, flx(N\u00bb = C( f, flx(N)' YeN\u00bb~. This cost functional \ntherefore has the posterior expectation value \n\nm \n\nm \n\nand variance \n\nPlugging in the distributions (2) (applied to a single sample), (3) and (7) leads to: \n\n(ClX(N),Y(N\u00bb) = 'ttr[AR] + Hf-f) R(f-f)+ \n\n(11) \n\n* m T \n\n* m \n\ntr [QR] \n\n2/3 \n\nwhere the diagonal matrices Rand Q have the elements Rxx' = P(XIX)TxAxbx,x' \nand Qxx' = qxbxx'. \nSimilar calculations lead to the expression for the variance \n\n(12) \n\nwhere the elements of Fare Fxx' = (fx -\n\nm \n\n* \nfx)bx,x'. \nm \n\n. \n\ni \n\nNote that the RAMnet (1) has the form f x = 2: J Xxi yl linear in the output data \nYeN), with J xx' = U(x, xi)/ 2:j U(x, Xj). Let us take V to have the form Vex, x') = \np(x )G(x - x')p(x'), combining translation-invariant and non-invariant factors in a \nplausible way. Then with the sums over x replaced by integrals, (11) becomes \nexplicitly \n\n2 (ClX(N), YeN\u00bb) = ~ J dxP (XIX(N\u00bb) qxTx + \u00b1 J P (XIX(N\u00bb) Txp;Gxx \n+.!. LPxtKtuPx\" J dxP (XIX(N\u00bb) Txp;Gx\"xGxxt \n+0\"2 LyUKutPxt J dxP (XIX(N\u00bb) Txp;GxtxGxxopx.K$vYv \n+20\" LyUKutpxt J dxP (XIX(N\u00bb) TxPxGxtxJxxvyV \n+ Lyu J dxP (XIX(N\u00bb) TxJx\"xJxxvyv. \n\n(13) \n\ntUV$ \n\ntuv \n\n0\" \n\ntu \n\nuv \n\n\fThe Generalisation Cost of RAMnets \n\n257 \n\n&) True, oplim&1 &nd s ubOplim&1 functions \n\n1.S,----..-- - - r - --,-----,.-----.---r--,------,----r---. \n\nb) Dislribulion of Ihe cost C \n\n\u2022 r.: \n\n1 \n\n,,; '. C 0.5 \n\n.~ L--_~:..:.-~-V \n~ 0 \n~ \n~ -0.5 \nE-< \n\nr \n-\nr \n- _. r \n\n\\_------\n\n::z,=\"] \n\n-0.2 \n\n-0.6 \n\n-0.4 \n\n0.8 \n\n0.2 \n\n0.. \n\n0.6 \n\n1 \n\n-0.8 \n\n0 \n3: \n\n\u2022 \n\nFigure 1: a) The lower figure shows the input distribution. The upper figure shows \nthe true function f generated from a Gaussian prior with covariance matrix V (dot-\nted line), the optimal function f = Ab (solid line) and the suboptimal solution \nm \nf (dashed line). b )The distribution of the cost function obtained by generating \nfunctions from the posterior Gaussian with covariance matrix A and calculating \nthe cost according to equation 14. The mean and one standard deviation calcu(cid:173)\nlated analytically and numerically are shown by the lower and upper error bars \nrespecti vely. \n\nTaking P (XIX(N)) to be Gaussian (the maximum likelihood estimate would be \nreasonable) and p, r, and q uniform, the first four integrals are straightforward. \nThe latter two involve the model J, and were evaluated numerically in the work \nreported below. \n\n5 Numerical results \n\nWe present one numerical example to illustrate the formalism , and another to illus(cid:173)\ntrate its application. \n\nFor the first illustration, let the input and output variables be one dimensional real \nnumbers. Let the input distribution P (x) be a Gaussian with mean I':e = 0 and \nstandard deviation (1:e = 0.2. Nearly all inputs then fall within the range [-1,1]' \nwhich we uniformly quantise into 41 bins. The true function f is generated from a \nGaussian distribution with 1', = 0 and 41 x 41 covariance matrix V with elements \nV:e:e' = e-1:e-:e'I . 50 training inputs x were generated from the input distribution \nand assigned corresponding outputs y = f:e + (, where ( is Gaussian noise with zero \nmean and standard deviation Jq:el/3 = 0.01. The cost weight r:e = 1. \n\nThe inputs were thermometer coded3 over 256 bits, from which 100 subsets of 30 \nbits were randomly selected. Each of the 100 x 230 patterns formed over these \nbits defines a RAMnet feature which evaluates to 1 when that pattern is present \nin the input x. (Only those features which actually appear in the data need to be \ntabulated.) The 50 training data points were used in this way to train an n-tuple \n\n3The first 256(x + 1)/2 bits are set to 1, and the remaining bits to O. \n\n\f258 \n\nR. Rohwer and M. Morciniec \n\na) Neal' s regression problem \n\nb)Mean cosl (CIX(N)'Y(N\u00bb \n\nas a funclion of 0'1 and a \n\n0..06'---~-r---r~-~~-\"'--~-\"'--~-' \n\n.. 2 \n0:: o \n.~ 1.5 \n\" ... , \n0:: \n\" 0:: \n\" ., \n\n.. 0.5 \n\n0. \n\n0:: \n. ~ \n\n-\n\n... (U \n\n\" ..0:: \n\nEo< \n\n... , \n\n... 2 \n\n... , \n\n0' = O.b \n\n........................... ...\n\n... ~ ...... ~ ...... ----~ ... ~ ...... ~ ...... ~ ...... ~ ...\n\n... ~ ...... ~ ... -~ ... \n\n0..045 \n\n0..04 \n\n0..036 \n\n~ 0..025 \n...!!. \nCJ \n\n0..<>2 \n\n0.015 \n\n[IJ \n\n-- ~ \no Dala \n\nt \n\n-\n\n0. \n:z \n\n0.010 \n\n0.02 O.CM \n\n0.08 \n\n0.01 \n\n0.12 \n\n0.1. \n\n0.18 \n\n0.1. \n\n0.2 \n\n0.1 \na \n\n\u2022 \n\nFigure 2: a) Neal's Regression problem. The true function f is indicated by a dotted \nm \nline, the optimal function f is denoted by a solid line and the suboptimal solution f \nis indicated by a dashed line. Circles indicate the training data. b) Dependence of \nthe cost prediction on the values of parameters a and O'f. The cost evaluated from \nthe test set is plotted as a dashed line, predicted cost is shown as a solid line with \none standard deviation indicated by a dotted line. \n\nregression network. The input distribution and functions f, f , f are plotted III \nfigure 1a. \n\nm \n\n\u2022 \n\nA Gaussian distribution with mean f and posterior covariance matrix A was then \nused to generate 104 functions . For each such function fp, the generalisation cost \n\n\u2022 \n\nx \n\n(14) \n\nwas computed. A histogram of these costs appears in figure 1b, together with the \ntheoretical and numerically computed average generalisation cost and its variance. \nGood agreement is evident. \n\nAnother one-dimensional problem illustrates the use of this formalism for predicting \nthe generalisation performance of a RAMnet when the prior over functions can only \nbe guessed. The true function , taken from (Neal, 1995) is given by \nfx = 0.3 + OAx + 0.5 sin(2.7x) + 1.1/(1 + x 2 ) + ( \n\n(15) \n\nwhere the Gaussian noise variable ( has mean I'f = 0 and standard deviation \nJq:c//3 == 0.1. The cost weight Tx == 1. The training and test set each comprised \n100 data-points. The inputs were generated by the standard normal distribution \n(I'x = 0, O':c = 1) and converted into the binary strings using a thermometer code. \nThe input range [-3,3] was quantised into 61 uniform bins . \n\nThe training set and the functions f , f , f are shown on figure 2a for a = 0.1. The \nfunction space covariance matrix was defined to have the Gaussian form V xx' \n_1 (., _ .,')2 \ne 2 \n\n,,; where O'f = 1.0. \n\nm \n\n\u2022 \n\n\fThe Generalisation Cost of RAMnets \n\n259 \n\n(Jf is the correlation length of the functions, which is of order 1, judging from \nfigure 2a. The overall scale of variation is 1/ va, which appears to be about 3, \nso Q' should be about 1/9. Figure 2b shows the expected cost as a function of Q' \nfor various choices of (Jf, with error bars on the (Jf = 1.0 curve. The actual cost \ncomputed from the test set according to C = t E(yi -fx? is plotted with a dashed \nline. There is good agreement around the sensible values of Q' and (Jf. \n\nm \n\ni \n\n6 Conclusions \n\nThis paper demonstrates that unusual models, such as the ultra-fast RAMnets \nwhich are not trained by directly optimising a cost function, can be analysed in a \nBayesian framework to determine their generalisation cost. Because the formalism \nis constructed in terms of distributions over function space rather than distributions \nover model parameters, it can be used for model comparison, and in particular to \nselect RAMnet parameters. \n\nThe main drawback with this technique, as it stands, is the need to numerically \nintegrate two expressions which involve the model. This difficulty intensifies rapidly \nas the input dimension increases. Therefore, it is now a research priority to search \nfor RAMnet feature sets which allow these integrals to be performed analytically. \n\nIt would also be interesting to average the expected costs over the training data, \nproducing an expected generalisation cost for an algorithm. The Y(N) integral is \nstraightforward, but the X(N) integral is difficult. However, similar integrals have \nbeen carried out in the thermodynamic limit (high input dimension) (Sollich, 1994), \nso the investigation of these techniques in the current setting is another promising \nresearch direction. \n\n7 Acknowledgements \n\nWe would like to thank the Aston Neural Computing Group, and especially Huaiyu \nZhu, Chris Williams, and David Saad for helpful discussions. \n\nReferences \n\nAllinson, N.M., & Kolcz, A. 1995. N-tuple Regression Network. to be published in \n\nNeural Networks. \n\nNeal, R. 1995. \n\nIntroductory} documentation for software implementing Bayesian \nlearning for neural networks using Markov chain Monte Carlo techniques. Tech. \nrept. Dept of Computer Science, University of Toronto. \n\nRohwer, R. , & Morciniec, M. 1996. A theoretical and experimental account of the \n\nn-tuple classifier performance. Neural Computation, 8(3), 657-670. \n\nSollich, Peter. 1994. Finite-size effects in learning and generalization in linear per(cid:173)\n\nceptrons. J. Phys. A, 27, 7771-7784. \n\nZhu, H., & Rohwer, R. \n\nto appear 1996. \n\nters and \nftp://cs.aston.ac.uk/neural/zhuh/reg~il-prior.ps.Z. \n\nissue of priors. \n\nthe \n\nfil-\nNeural Computing and Applications. \n\nregression \n\nBayesian \n\n\f", "award": [], "sourceid": 1220, "authors": [{"given_name": "Richard", "family_name": "Rohwer", "institution": null}, {"given_name": "Michal", "family_name": "Morciniec", "institution": null}]}