{"title": "Fast Parameter Estimation Using Green's Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 535, "page_last": 542, "abstract": null, "full_text": "Fast Parameter Estimation \n\nUsing Green's Functions \n\nK. Y. Michael Wong \nDepartment of Physics \nHong Kong University \n\nof Science and Technology \n\nClear Water Bay, Hong Kong \n\nphkywong@ust.hk \n\nFuIi Li \n\nDepartment of Applied Physics \n\nXian Jiaotong University \n\nXian, China 710049 \n\nflli @xjtu. edu. en \n\nAbstract \n\nWe propose a method for the fast estimation of hyperparameters \nin large networks, based on the linear response relation in the cav(cid:173)\nity method, and an empirical measurement of the Green's func(cid:173)\ntion. Simulation results show that it is efficient and precise, when \ncompared with cross-validation and other techniques which require \nmatrix inversion. \n\n1 \n\nIntroduction \n\nIt is well known that correct choices of hyperparameters in classification and regres(cid:173)\nsion tasks can optimize the complexity of the data model, and hence achieve the \nbest generalization [1]. In recent years various methods have been proposed to esti(cid:173)\nmate the optimal hyperparameters in different contexts, such as neural networks [2], \nsupport vector machines [3, 4, 5] and Gaussian processes [5]. Most of these meth(cid:173)\nods are inspired by the technique of cross-validation or its variant, leave-one-out \nvalidation. While the leave-one-out procedure gives an almost unbiased estimate \nof the generalization error, it is nevertheless very tedious. Many of the mentioned \nattempts aimed at approximating this tedious procedure without really having to \nsweat through it. They often rely on theoretical bounds, inverses to large matrices, \nor iterative optimizations. \n\nIn this paper, we propose a new approach to hyperparameter estimation in large \nsystems. It is known that large networks are mean-field systems, so that when one \nexample is removed by the leave-one-out procedure, the background adjustment \ncan be analyzed by a self-consistent perturbation approach. Similar techniques \nhave been applied to the neural network [6], Bayesian learning [7] and the support \nvector machine [5]. They usually involve a macroscopic number of unknown vari(cid:173)\nables, whose solution is obtained through the inversion of a matrix of macroscopic \nsize, or iteration. Here we take a further step to replace it by a direct measurement \nof the Green's function via a small number of learning processes. The proposed \nprocedure is fast since it does not require repetitive cross-validations, matrix in(cid:173)\nversions, nor iterative optimizations for each set of hyperparaemters. We will also \npresent simulation results which show that it is an excellent approximation. \n\n\fThe proposed technique is based on the cavity method, which was adapted from \ndisordered systems in many-body physics. The basis of the cavity method is a \nself-consistent argument addressing the situation of removing an example from the \nsystem. The change on removing an example is described by the Green's function, \nwhich is an extremely general technique used in a wide range of quantum and \nclassical problems in many-body physics [8]. This provides an excellent framework \nfor the leave-one-out procedure. In this paper, we consider two applications of \nthe cavity method to hyperparameter estimation, namely the optimal weight decay \nand the optimal learning time in feedforward networks. In the latter application, \nthe cavity method provides, as far as we are aware of, the only estimate of the \nhyperparameter beyond empirical stopping criteria and brute force cross-validation. \n\n2 Steady-State Hyperparameter Estimation \n\nConsider the network with adjustable parameters w. An energy function E is \ndefined with respect to a set of p examples with inputs and outputs respectively \ngiven by {IL and y'\", JL = 1, ... ,p, where (IL is an N-dimensional input vector with \ncomponents e;, j = 1,\u00b7 \u00b7\u00b7 ,N, and N \u00bb 1 is macroscopic. We will first focus on \n\nthe dynamics of a single-layer feedforward network and generalize the results to \nmultilayer networks later. In single-layer networks, E has the form \n\nE = L f(X'\",y'\") + R(w). \n\n(1) \n\n'\" \n\nHere f( x'\" , y'\") represents the error function with respect to example JL. It is ex(cid:173)\npressed in terms of the activation x'\" == w\u00b7 (IL. R( w) represents a regularization \nterm which is introduced to limit the complexity of the network and hence enhance \nthe generalization ability. Learning is achieved by the gradient descent dynamics \n\ndWj(t) _ _ ~_oE_ \n\ndt \n\n(2) \n\nThe time-dependent Green's function Gjk(t, s) is defined as the response of the \nweight Wj at time t due to a unit stimulus added at time s to the gradient term with \nrespect to weight Wk, in the limit of a vanishing magnitude of the stimulus. Hence \nif we compare the evolution of Wj(t) with another system Wj(t) with a continuous \nperturbative stimulus Jhj(t), we would have \ndWj(t) = _~ oE \n\n(3) \n\n(4) \n\ndt \n\nNow. + \n\nJ \n\nJh() \nJ t , \n\nand the linear response relation \n\nWj(t) = Wj(t) + L J dsGjk(t,s)Jhk(s). \n\nk \n\nNow we consider the evolution ofthe network w;'\"(t) in which example JL is omitted \nfrom the training set. For a system learning macroscopic number of examples, the \nchanges induced by the omission of an example are perturbative, and we can assume \nthat the system has a linear response. Compared with the original network Wj(t), \nthe gradient of the error of example JL now plays the role of the stimulus in (3). \nHence we have \n\n(5) \n\n\fMultiplying both sides by ~f and summing over j, we obtain \n\nI-'() J [1 '\" I-'G \n\nt + \n\nds N \"7:~j jk t ,s ~k \n\n( \n\n1-'( ) -\nh \n\nt - x \n\n) I-']OE(XI-'(S)'YI-') \n\noxl-'(s)' \n\n(6) \n\nHere hl-'(t) == V;\\I-'(t) . ~ is called the cavity activation of example ft. \ndynamics has reached the steady state, we arrive at \n\nWhen the \n\nhI-' \n\nI-' \n\n= x +, \n\nOE(XI-' , yl-') \n\noxl-' \n\n' \n\n(7) \n\nwhere, = limt--+oo J dS[Ljk ~fGjk (t , s)~r]j N is the susceptibility. \nAt time t, the generalization error is defined as the error function averaged over the \ndistribution of input (, and their corresponding output y, i.e. , \n\n(8) \n\nwhere x == V; . (is the network activation. The leave-one-out generalization error is \nan estimate of 109 given in terms ofthe cavity activations hI-' by fg = LI-' 10 (hI-' ,yl-')jp. \nHence if we can estimate the Green's function, the cavity activation in (7) provides \na convenient way to estimate the leave-one-out generalization error without really \nhaving to undergo the validation process. \n\nWhile self-consistent equations for the Green's function have been derived using \ndiagrammatic methods [9], their solutions cannot be computed except for the spe(cid:173)\ncific case of time-translational invariant Green's functions , such as those in Adaline \nlearning or linear regression. However, the linear response relation (4) provides a \nconvenient way to measure the Green's function in the general case. The basic idea \nis to perform two learning processes in parallel, one following the original process \n(2) and the other having a constant stimulus as in (3) with 6hj (t) = TJ6jk, where \n8j k is the Kronecka delta. When the dynamics has reached the steady state, the \nmeasurement Wj - Wj yields the quantity TJ Lk J dsGjk(t, s). \nA simple averaging procedure, replacing all the pairwise measurements between the \nstimulation node k and observation node j, can be applied in the limit of large \nN. We first consider the case in which the inputs are independent and normalized, \ni.e., (~j) = 0, (~j~k) = 8j k. In this case, it has been shown that the off-diagonal \nGreen's functions can be neglected, and the diagonal Green's functions become self(cid:173)\naveraging, i.e. , Gjk(t, s) = G(t, s)8jk , independent of the node labels [9], rendering \n, = limt--+oo J dsG(t, s). \nIn the case that the inputs are correlated and not normalized, we can apply standard \nprocedures of whitening transformation to make them independent and normalized \n[1]. In large networks, one can use the diagrammatic analysis in [9] to show that \nthe (unknown) distribution of inputs does not change the self-averaging property of \nthe Green's functions after the whitening transformation. Thereafter, the measure(cid:173)\nment of Green's functions proceeds as described in the simpler case of independent \nand normalized inputs. Since hyperparameter estimation usually involves a series \nof computing fg at various hyperparameters, the one-time preprocessing does not \nincrease the computational load significantly. \n\nThus the susceptibility, can be measured by comparing the evolution of two pro(cid:173)\ncesses: one following the original process (2), and the other having a constant \nstimulus as in (3) with 8h j (t) = TJ for all j. When the dynamics has reached the \nsteady state, the measurement (Wj - Wj) yields the quantity TJ,. \n\n\fWe illustrate the extension to two-layer networks by considering the committee ma(cid:173)\nchine, in which the errorfunction takes the form E(2::a !(xa), y) , where a = 1,\u00b7 \u00b7\u00b7, nh \nis the label of a hidden node, Xa == wa . [is the activation at the hidden node a, \nand! represents the activation function. The generalization error is thus a function \nof the cavity activations of the hidden nodes, namely, E9 = 2::JL E(2::a !(h~), yJL) /p, \nwhere h~ = w~JL . (IL . When the inputs are independent and normalized, they are \nrelated to the generic activations by \n\nhJL- JL+'\" \na - Xa ~ \"lab \nb \n\naE(2::c !(X~) , yJL) \n' \n\na JL \nXb \n\n(9) \n\nwhere \"lab = limt~oo J dsGab(t, s) is the susceptibility tensor. The Green's function \nGab(t, s) represents the response of a weight feeding hidden node a due to a stimulus \napplied at the gradient with respect to a weight feeding node b. It is obtained by \nmonitoring nh + 1 learning processes, one being original and each of the other nh \nprocesses having constant stimuli at the gradients with respect to one of the hidden \nnodes, viz., \n\ndw~~) (t) _ \n-\n\ndt \n\n1 aE \naWaj \n\n- N ------=:(b) + 'f)rSab , b = 1, ... ,nh\u00b7 \n\n(10) \n\nWhen the dynamics has reached the steady state, the measurement (w~7 - Waj) \nyields the quantity 'f)'Yab. \nWe will also compare the results with those obtained by extending the analysis of \nlinear unlearning leave-one-out (LULOO) validation [6]. Consider the case that the \nregularization R(w) takes the form of a weight decay term, R(w) = N 2::ab AabWa . \nWb/2. The cavity activations will be given by \n\nJL + '\" ( \n\nhJL = \na Xa ~ ,\" \n\niJ 2::jk ~'j(A + Q)~}bk~r \n\nb \n\n1 - 11 2::cjdk ~'j !'(xn(A + Q)~,dd'(x~)~r \n\n1 \n\n) aE(2::c !(xn, yJL)) \n\na JL \nXb \n\n' \n\n(11) \nwhere E~ represents the second derivative of E with respect to the student output \nfor example /1, and the matrix Aaj,bk = AabrSjk and Q is given by \n\nQaj,bk = ~ 2: ~'j f'(x~)f'(x~)~r\u00b7 \n\nJL \n\n(12) \n\nThe LULOO result of (11) differs from the cavity result of (9) in that the suscep(cid:173)\ntibility \"lab now depends on the example label /1, and needs to be computed by \ninverting the matrix A + Q. Note also that second derivatives of the error term \nhave been neglected. \n\nTo verify the proposed method by simulations, we generate examples from a noisy \nteacher network which is a committee machine \n\nyJL = ~ erf yf2Ba \u00b7 f + (Jzw \n\nnh \n\n(1 \n\n) \n\n(13) \n\nHere Ba is the teacher vector at the hidden node a. \n(J is the noise level. ~'j and \nzJL are Gaussian variables with zero means and unit variances. Learning is done by \nthe gradient descent of the energy function \n\n(14) \n\n\fand the weight decay parameter ,X is the hyperparameter to be optimized. The \ngeneralization error fg is given by \n\nwhere the averaging is performed over the distribution of input { and noise z. It can \nbe computed analytically in terms of the inner products Q ab = wa . Wb, Tab = Ba . Bb \nand Rab = Ba . Wb [10]. However, this target result is only known by the teacher, \nsince Tab and Rab are not accessible by the student. \nFigure 1 shows the simulation results of 4 randomly generated samples. Four results \nare compared: the target generalization error observed by the teacher, and those \nestimated by the cavity method, cross-validation and extended LULOO. It can \nbe seen that the cavity method yields estimates of the optimal weight decay with \ncomparable precision as the other methods. \n\nFor a more systematic comparison, we search for the optimal weight decay in 10 sam(cid:173)\nples using golden section search [11] for the same parameters as in Fig. 1. Compared \nwith the target results, the standard deviations of the estimated optimal weight de(cid:173)\ncays are 0.3, 0.25 and 0.24 for the cavity method, sevenfold cross-validation and \nextended LULOO respectively. In another simulation of 80 samples of the single(cid:173)\nlayer perceptron, the estimated optimal weight decays have standard deviations of \n1.2, 1.5 and 1.6 for the cavity method, tenfold cross-validation and extended LU(cid:173)\nLOO respectively (the parameters in the simulations are N = 500, p = 400 and a \nranging from 0.98 to 2.56). \n\nTo put these results in perspective, we mention that the computational resources \nneeded by the cavity method is much less than the other estimations. For example, \nin the single-layer perceptrons, the CPU time needed to estimate the optimal weight \ndecay using the golden section search by the teacher, the cavity method, tenfold \ncross-validation and extended LULOO are in the ratio of 1 : 1.5 : 3.0 : 4.6. \n\nBefore concluding this section, we mention that it is possible to derive an expression \nof the gradient dEg I d,X of the estimated generalization error with respect to the \nweight decay. This provides us an even more powerful tool for hyperparameter \nestimation. In the case of the search for one hyperparameter, the gradient enables \nus to use the binary search for the zero of the gradient, which converges faster \nthan the golden section search. In the single-layer experiment we mentioned, its \nprecision is comparable to fivefold cross-validations, and its CPU time is only 4% \nmore than the teacher's search. Details will be presented elsewhere. In the case of \nmore than one hyperparameters, the gradient information will save us the need for \nan exhaustive search over a multidimensional hyperparameter space. \n\n3 Dynamical Hyperparameter Estimation \n\nThe second example concerns the estimation of a dynamical hyperparameter, \nnamely the optimal early stopping time, in cases where overtraining may plague \nthe generalization ability at the steady state. In perceptrons, when the examples \nare noisy and the weight decay is weak, the generalization error decreases in the \nearly stage of learning, reaches a minimum and then increases towards its asymp(cid:173)\ntotic value [12, 9]. Since the early stopping point sets in before the system reaches \nthe steady state, most analyses based on the equilibrium state are not applicable. \nCross-validation stopping has been proposed as an empirical method to control \novertraining [13]. Here we propose the cavity method as a convenient alternative. \n\n\f0.52 \n\ne \nQ) \n<= \n0 \n~ 0.46 \n.!::! m \nQ) \n<= \nQ) \n0> \n\n0.40 \n\ne Q) \n\n<= \n0 \n~ \n.!::! m \nQ) \n<= \nQ) \n0> \n\n0.40 \n\n0 \n\nG----8 target \nG----EJ cavity \n0 -0 LULOO \n\n(c) \n\n(b) \n\n(d) \n\nweight decay A \n\nweight decay A \n\no \n\n2 \n\nFigure 1: (a-d) The dependence ofthe generalization error of the multilayer percep(cid:173)\ntron on the weight decay for N = 200, p = 700, nh = 3, (J = 0.8 in 4 samples. The \nsolid symbols locate the optimal weight decays estimated by the teacher (circle), the \ncavity method (square), extended LULOO (diamond) and sevenfold cross-validation \n(triangle) . \n\nIn single-layer perceptrons, the cavity activations of the examples evolve according \nto (6), enabling us to estimate the dynamical evolution of the estimated general(cid:173)\nization error when learning proceeds. The remaining issue is the measurement of \nthe time-dependent Green's function. We propose to introduce an initial homoge(cid:173)\nneous stimulus, that is, Jhj (t) = 1]J(t) for all j. Again, assuming normalized and \nindependent inputs with (~j) = 0 and (~j~k) = Jjk , we can see from (4) that the \nmeasurement (Wj(t) - Wj(t)) yields the quantity 1]G(t, 0). \nWe will first consider systems that are time-translational invariant, i.e., G(t, s) = \nG(t - s, 0). Such are the cases for Adaline learning and linear regression [9], where \nthe cavity activation can be written as \n\nh'\"(t) = x'\"(t) + J dsG(t - s, 0) OE(X'\"(S), y'\"). \n\nox,\"(s) \n\n(16) \n\nThis allows us \n2:.,\" E(h'\"(t), y'\")/p, whose minimum in time determines the early stopping point. \n\nthe generalization error Eg(t) via Eg(t) \n\nto estimate \n\nTo verify the proposed method in linear regression, we randomly generate exam(cid:173)\nples from a noisy teacher with y'\" = iJ . f'\" + (Jzw Here iJ is the teacher vec(cid:173)\ntor with B2 = 1. e; and z'\" are independently generated with zero means and \nunit variances. Learning is done by the gradient descent of the energy function \nE(t) = 2:.,\" (y'\" - w(t) . f'\")2/2 . The generalization error Eg(t) is the error av-\neraged over the distribution of input [ and their corresponding output y, i.e., \nEg(t) = ((iJ . [ + (JZ - w\u00b7 [)2/2). As far as the teacher is concerned, Eg(t) can \nbe computed as Eg(t) = (1 - 2R(t) + Q(t) + (J2)/2. where R(t) = w(t) . iJ and \nQ(t) = W(t)2. \n\nFigure 2 shows the simulation results of 6 randomly generated samples. Three re(cid:173)\nsults are compared: the teacher's estimate, the cavity method and cross-validation. \nSince LULOO is based on the equilibrium state, it cannot be used in the present \n\n\fcontext. Again, we see that the cavity method yields estimates of the early stop(cid:173)\nping time with comparable precision as cross-validation. The ratio of the CPU time \nbetween the cavity method and fivefold cross-validation is 1 : 1.4. \n\nFor nonlinear regression and multilayer networks, the Green's functions are not \ntime-translational invariant. To estimate the Green's functions in this case, we have \ndevised another scheme of stimuli. Preliminary results for the determination of the \nearly stopping point are satisfactory and final results will be presented elsewhere. \n\n1 .1 \n\ne \nQ.i \nc:: a \n~ 0.9 \n.!::! \n~ \n \nc:: \n \n0> \n\n0.7 \n\ne Q.i \nc:: a \n~ 0.9 \n.!::! \nc;; \nQ.i \nc:: \n \n0> \n\n0.7 \n\n0 \n\n2 \n\ntime t \n\n0 \n\n2 \n\ntime t \n\n0 \n\n2 \n\ntime t \n\n4 \n\nFigure 2: (a-f) The evolution of the generalization error of linear regression for \nN = 500, p = 600 and (J = 1. The solid symbols locate the early stopping points \nestimated by the teacher (circle), the cavity method (square) and fivefold cross(cid:173)\nvalidation (diamond). \n\n4 Conclusion \n\nWe have proposed a method for the fast estimation of hyperparameters in large \nnetworks, based on the linear response relation in the cavity method, combined \nwith an empirical method of measuring the Green's function. Its efficiency depends \non the independent and identical distribution of the inputs, greatly reducing the \nnumber of networks to be monitored. It does not require the validation process \nor the inversion of matrices of macroscopic size, and hence its speed compares \nfavorably with cross-validation and other perturbative approaches such as extended \nLULOO. For multilayer networks, we will explore further speedup of the Green's \nfunction measurement by multiplexing the stimuli to the different hidden units into \na single network, to be compared with a reference network. We will also extend the \ntechnique to other benchmark data to study its applicability. \n\nOur initial success indicates that it is possible to generalize the method to more \ncomplicated systems in the future. The concept of Green's functions is very general, \nand its measurement by comparing the states of a stimulated system with a reference \none can be adopted to general cases with suitable adaptation. Recently, much \nattention is paid to the issue of model selection in support vector machines [3, 4, 5]. \nIt would be interesting to consider how the proposed method can contribute to these \ncases. \n\n\fAcknowledgements \n\nWe thank C. Campbell for interesting discussions and H. Nishimori for encourage(cid:173)\nment. This work was supported by the grant HKUST6157/99P from the Research \nGrant Council of Hong Kong. \n\nReferences \n\n[1] C. M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford \n\n(1995). \n\n[2] G. B. Orr and K-R. Muller, eds., Neural Networks: Tricks of the Trade, Springer, \n\nBerlin (1998). \n\n[3] O. Chapelle and V. N. Vapnik, Advances in Neural Information Processing Systems \n12, S. A. Solla, T. KLeen and K-R. Muller, eds., MIT Press, Cambridge, 230 (2000). \n\n[4] S. S. Keerthi, Technical Report CD-OI-02, \n\nhttp://guppy.mpe.nus. edu .sg/ mpessk/nparm.html (2001). \n\n[5] M. Opper and O. Winther , Advances in Large Margin Classifiers, A. J. Smola, P. \nBartlett, B. Sch6lkopf and D. Schuurmans, eds., MIT Press, Cambridge, 43 (1999) . \n[6] J. Larsen and L. K Hansen , Advances in Computational Math ematics 5, 269 (1996). \n[7] M. Opper and O. Winther, Phys. R ev. Lett. 76, 1964 (1996). \n[8] A. L. Fetter and J. D. Walecka, Quantum Theory of Many-Particle Systems, McGraw(cid:173)\n\nHill, New York (1971). \n\n[9] K Y. M. Wong, S. Li and Y. W . Tong, Phys. Rev. E 62, 4036 (2000). \n[10] D. Saad and S. A. Solla, Phys. R ev. Lett. 74, 4337 (1995). \n[11] W. H. Press, B. P. Flannery, S. A. Teukolsky and W . T. Vetterling, Num erical \n\nR ecipes in C: Th e Art of Scientific Computing, Cambridge University Press, Cam(cid:173)\nbridge (1990). \n\n[12] A. Krogh and J. A. Hertz, J. Phys. A 25 , 1135 (1992). \n[13] S. Amari, N. Murata, K-R. Muller, M. Finke and H. H. Yang, IEEE Trans. on Neural \n\nN etworks 8, 985 (1997) . \n\n\f", "award": [], "sourceid": 2104, "authors": [{"given_name": "K.", "family_name": "Wong", "institution": null}, {"given_name": "F.", "family_name": "Li", "institution": null}]}