{"title": "The Learning Dynamcis of a Universal Approximator", "book": "Advances in Neural Information Processing Systems", "page_first": 288, "page_last": 294, "abstract": null, "full_text": "The Learning Dynamics of \na Universal Approximator \n\nAnsgar H. L. West1,2 \nA.H.L.West~aston.ac.uk \n\nDavid Saad1 \n\nIan T. N abneyl \n\nD.Saad~aston.ac.uk \n\nI.T.Nabney~aston.ac.uk \n\n1 Neural Computing Research Group, University of Aston \n\nBirmingham B4 7ET, U.K. \n\nhttp://www.ncrg.aston.ac.uk/ \n\n2Department of Physics, University of Edinburgh \n\nEdinburgh EH9 3JZ, U.K. \n\nAbstract \n\nThe learning properties of a universal approximator, a normalized \ncommittee machine with adjustable biases, are studied for on-line \nback-propagation learning. Within a statistical mechanics frame(cid:173)\nwork, numerical studies show that this model has features which \ndo not exist in previously studied two-layer network models with(cid:173)\nout adjustable biases, e.g., attractive suboptimal symmetric phases \neven for realizable cases and noiseless data. \n\n1 \n\nINTRODUCTION \n\nRecently there has been much interest in the theoretical breakthrough in the un(cid:173)\nderstanding of the on-line learning dynamics of multi-layer feedforward perceptrons \n(MLPs) using a statistical mechanics framework. In the seminal paper (Saad & \nSolla, 1995), a two-layer network with an arbitrary number of hidden units was \nstudied, allowing insight into the learning behaviour of neural network models whose \ncomplexity is of the same order as those used in real world applications. \nThe model studied, a soft committee machine (Biehl & Schwarze, 1995), consists of \na single hidden layer with adjustable input-hidden, but fixed hidden-output weights. \nThe average learning dynamics of these networks are studied in the thermodynamic \nlimit of infinite input dimensions in a student-teacher scenario, where a stu.dent \nnetwork is presented serially with training examples (elS , (IS) labelled by a teacher \nnetwork of the same architecture but possibly different number of hidden units. \nThe student updates its parameters on-line, i.e., after the presentation of each \nexample, along the gradient of the squared error on that example, an algorithm \nusually referred to as back-propagation. \nAlthough the above model is already quite similar to real world networks, the ap(cid:173)\nproach suffers from several drawbacks. First, the analysis of the mean learning \ndynamics employs the thermodynamic limit of infinite input dimension -\nlem which has been addressed in (Barber et al., 1996), where finite size effects have \nbeen studied and it was shown that the thermodynamic limit is relevant in most \n\na prob(cid:173)\n\n\fThe Learning Dynamcis of a UniversalApproximator \n\n289 \n\ncases. Second, the hidden-output weights are kept fixed, a constraint which has \nbeen removed in (Riegler & Biehl, 1995), where it was shown that the learning \ndynamics are usually dominated by the input-hidden weights. Third, the biases of \nthe hidden units were fixed to zero, a constraint which is actually more severe than \nfixing the hidden-output weights. We show in Appendix A that soft committee \nmachines are universal approximators provided one allows for adjustable biases in \nthe hidden layer. \nIn this paper, we therefore study the model of a normalized soft committee machine \nwith variable biases following the framework set out in (Saad & Solla, 1995). We \npresent numerical studies of a variety of learning scenarios which lead to remarkable \neffects not present for the model with fixed biases. \n\n2 DERIVATION OF THE DYNAMICAL EQUATIONS \n\nThe student network we consider is a normalized soft committee machine of K \nhidden units with adjustable biases. Each hidden unit i consists of a bias (Ji and a \nweight vector lVi which is connected to the N-dimensional inputs e. All hidden units \nare connected to a linear output unit with arbitrary but fixed gain 'Y by couplings \nof fixed strength. The activation of any unit is normalized by the inverse square \nroot of the number of weight connections into the unit, which allows all weights to \nbe of 0(1) magnitude, independent of the input dimension or the number of hidden \nunits. The implemented mapping is therefore /w(e) = (-Y/VK) L:~1 g(Ui - (Ji), \nwhere Ui = lVi \u00b7e/.,fJii and g(.) is a sigmoidal transfer function. The teacher net(cid:173)\nwork to be learned is of the same architecture except for a possible difference in \nthe number of hidden units M and is defined by the weight vectors En and bi(cid:173)\nases Pn (n = 1, ... , M). Training examples are of the form (e, (1-'), where the \ninput vectors el-' are drawn form the normal distribution and the outputs are \n(I-' = (-Y/.JiJ) L:~1 g(v~ - Pn), where v~ = Bn \u00b7el-' /.,fJii. \nThe weights and biases are updated in response to the presentation of an example \n(el-', (1-'), along the gradient of the squared error measure \u20ac = ![(I-' - /w(el-')F \n\nI \n\n(J.I-'+! - (J .I-' = - 1/0 61!' \nNt \n\nI \n\nI \n\nt \n\nand \n\nWol-'+! - Wol-' = 1/ 61!' el-' \nWt.,fJii \n\n(1) \nwith 6f == [(I-' - /w(el-')]g'(uf - (Ji). The two learning rates are 1/w for the weights \nand 1/0 for the biases. In order to analyse the mean learning dynamics resulting \nfrom the above update equations, we follow the statistical mechanics framework in \n(Saad & Solla, 1995). Here we will only outline the main ideas and concentrate on \nthe results of the calculation. \nAs we are interested in the typical behaviour of our training algorithm we average \nover all possible instances of the examples e. We rewrite the update equations (1) \nin lVi as equations in the order parameters describing the overlaps between pairs \nof student nodes Qij = lVi\u00b7W;/N, student and teacher nodes Rin = lVi\u00b7En/N, \nand teacher nodes Tnm = Bn \u00b7Bm/N. The generalization error \u20acg, measuring the \ntypical performance, can be expressed solely in these variables and the biases (Ji and \nPn. The order parameters Qij, Rin and the biases (Ji are the dynamical variables. \nThese quantities need to be self-averaging with respect to the randomness in the \ntraining data in the thermodynamic limit (N ~ 00), which enforces two necessary \nconstraints on our calculation. First, the number of hidden units K \u00ab N, whereas \none needs K\", O(N) for the universal approximation proof to hold. Second, one \ncan show that the updates of the biases have to be of 0(1/N), i.e., the bias learning \nrate has to be scaled by 1/ N, in order to make the biases self-averaging quantities, \na fact that is confirmed by simulations [see Fig. 1]. If we interpret the normalized \n\n\f290 \n\nA. H. L. West, D. Saad and I. T. Nabney \n\nexample number 0 = piN as a continuous time variable, the update equations for \nthe order parameters and the biases become first order coupled differential equations \n\ndQij \ndo \ndRin \ndo \n\nTJw (8iuj + 8j U i}e + TJ!. (8i8j }e\u00b7 \n\nTJw (8i vn }e ' \n\nand \n\ndOi \ndo = -TJo (8i }e . \n\n(2) \n\nChoosing g(x) = erf(xlV2) as the sigmoidal transfer, most integrations in Eqs. ~2) \ncan be performed analytically, but for single Gaussian integrals remaining for TJw -\nterms and the generalization error. The exact form of the resulting dynamical \nequations is quite complicated and will be presented elsewhere. Here we only re(cid:173)\nmark, that the gain \"/ of the linear output unit, which determines the output scale, \nmerely rescales the learning rates with ,,/2 and can therefore be set to one without \nloss of generality. Due to the numerical integrations required, the differential equa(cid:173)\ntions can only be solved accurately in moderate times for smaller student networks \n(K ~ 5) but any teacher size M. \n\n3 ANALYSIS OF THE DYNAMICAL EQUATIONS \n\nThe dynamical evolution of the overlaps Qij, R in and the biases Oi follows from \nintegrating the equations of motion (2) from initial conditions determined by the \n(random) initialization of the student weights Wi and biases Oi. For random ini(cid:173)\ntialization the resulting norms Qii of the student vector will be order 0(1), while \nthe overlaps Qij between different student vectors, and student-teacher vectors Rin \nwill be only order CJ(I/VN). A random initialization of the weights and biases can \ntherefore be simulated by initializing the norms Qii, the biases Oi and the normalized \noverlaps Qij = Qij I JQiiQjj and Rin = Rinl JQiiTnn from uniform distributions \nin the [0,1]' [-1,1], and [_10- 12,10- 12] intervals respectively. \nWe find that the results of the numerical integration are sensitive to these ran(cid:173)\ndom initial values, which has not been the case to this extent for fixed biases. \nFurthermore, the dynamical behaviour can become very complex even for realiz(cid:173)\nable cases (K = M) and networks with three or four hidden units. For sake of \nsimplicity, we will therefore restrict our presentation to networks with two hidden \nunits (K = M = 2) and uncorrelated isotropic teachers, defined by Tnm = 8nm , al(cid:173)\nthough larger networks and graded teacher scenarios were investigated extensively \nas well. We have further limited our scope by investigating a common learning \nrate (TJo = TJo = TJw) for biases and weights. To study the effect of different weight \ninitialization, we have fixed the initial values of the student-student overlaps Qij \nand biases Oi, as these can be manipulated freely in any learning scenario. Only the \ninitial student-teacher overlaps Rin are randomized as suggested above. \nIn Fig. 1 we compare the evolution of the overlaps, the biases and the generalization \nerror for the soft committee machine with and without adjustable bias learning a \nsimilar realizable teacher task. The student denoted by * lacks biases, Le., Oi = 0, \nand learns to imitate an isotropic teacher with zero biases (Pn = 0). The other \nstudent features adjustable biases, trained from an isotropic teacher with small \nbiases (Pl,2 = =FO.I). For both scenarios, the learning rate and the initial conditions \nwere judiciously chosen to be TJo = 2.0, Qll = 0.1, Q22 = 0.2, Rin = Q12 = \nU[ _10- 12,10-12] with 01 = 0.0 and O2 = 0.5 for the student with adjustable biases. \nIn both cases, the student weight vectors (Fig. Ia) are drawn quickly from their \ninitial values into a suboptimal symmetric phase, characterized by the lack of spe(cid:173)\ncialization of the student hidden units on a particular teacher hidden unit, as can \nbe depicted from the similar values of ~n in Fig. 1 b. This symmetry is broken \n\n\fThe Learning Dynamcis of a UniversalApproximator \n\n291 \n\n,-\n\n0) Q11 ,Q22 \n\nQil--\nQ22\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \nQi2-'-'-\nQ11-----\u00b7 ~n \n\n0 \n\nQ \n\n(N \n\n=1 \n\n11 \n\no \n\"Q11 (N=100) \nA Q12 (N=10) \no Q12 (N=100) \n\n:} \n\n/1 \n_.-.- -_.-. __ ._._. \n~-.-.-.-.-.-.-.-.-:~.'::,'\"------. \n\n'Q* Q* Q22---\n'-. Q12----\u00b7 \n\nIII 22 \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 \n\no 100 200 300 400 500 600 700 \n\nex \n\no 100 200 300 400 500 600 700 \n\nex \n\n0 . 02 - r - - - - - - - - - - - - - , \n\n.. -\n-..---==:0-..... -,.:.<--- '1< ....... ,. <:: \n\n, , \n\n, \n'. ,\". \n,... \n\n' \n.... \n\n1.0 \n\n0.8 \n\n0.6 \n\nQij \n\n0.4 \n\n0.2 \n\n0.0 \n\n0.3 \n\n(}i 0.0 \n\n0.2 \n\n0.1 \n\n-0.1 \n\n-0.2 \n\n-0.3 \n\n(d) \n\nfg(O.Ol) - -(cid:173)\nfg(O.l) _ ... - . \nfg(0.5)(cid:173)\n\nfg(l) - -\n\nfg(O*) ---\nfg(O) .......... . \nf g(10-S) ----_. \nfg(10-4) - .- .-\nN=200 0 \nN=500\" \n\n0.015 \n\n0.005 \n\n0 \n\n100 200 300 400 500 600 700 \n\nex \n\no 100 200 300 400 500 600 700 \n\nex \n\nFigure 1: The dynamical evolution of the student-student overlaps Qij (a), and the \nstudent-teacher overlaps Rin (b) as a function of the normalized example number 0 \nis compared for two student-teacher scenarios: One student (denoted by *) has fixed \nzero biases, the other has adjustable biases. The influence of the symmetry in the \ninitialization of the biases on the dynamics is shown for the student biases (Ji (c), \nand the generalization error fg (d): (Jl = 0 is kept for all runs, but the initial value \nof (J2 varies and is given in brackets in the legends. Finite size simulations for input \ndimensions N = 10 ... 500 show that the dynamical variables are self-averaging. \n\nalmost immediately in the learning scenario with adjustable biases and the student \nconverges quickly to the optimal solution, characterized by the evolution of the \noverlap matrices Q, R and biases (Ji (see Fig. 1c) to their optimal values T and \nPn (up to the permutation symmetry due to the arbitrary labeling of the student \nnodes). Likewise, the generalization error fg decays to zero in Fig. 1d. The student \nwith fixed biases is trapped for most of its training time in the symmetric phase \nbefore it eventually converges. \nExtensive simulations for input dimensions N = 10 ... 500 confirm that the dynamic \nvariables are self-averaging and show that variances decrease with liN. The mean \ntrajectories are in good agreement with the theoretical predictions even for very \nsmall input dimensions (N = 10) and are virtually indistinguishable for N = 500. \nThe length of the symmetric phase for the isotropic teacher scenario is dominated \nby the learning ratel , hut also exhibits a logarithmic dependence on the typical \n\n1The length of the symmetric phase is linearly dependent on 110 for small learning rates. \n\n\f292 \n\nA. H. L. West, D. Saad and I. T. Nabney \n\n(}i O.O-t----\u00b7 -' '-' -' '-' -' '-'-' -' ' -,;--'~; -~--i \n\n..... ..... .. .................. \n\nI \n\\ \n\n\" \n\n\\ \nI \n\n-\n\nr'\" \n\n-0.2- .,;:.:-=:::~.::::.------_ \n.:::--:..:-~-..:.;::.-- .. ' \n-- --. (h \n(05) \n-----. Ol \nI \n\n-0.4- ...... O2 \n(0.25) \n.......... 01 \n\n-0.6 \n\nI \n\n\" \n\" \n\nI \nI \ni \n.... :::-'~.-j \n\n.... \n\nI \n\no \n\n400 \n\n800 \n\n1200 \n\na \n\nJa) \n\nI \n\n1600 \n\n3200 \n\n2800 \n\n2400 \na c \n2000 \n\n1600 \n\n1200 \n\n2-0 \n\n!I \n--710-\n!I \n........... 710=0.01!1 \n-----\u00b7710=0.1 ;1 \n- .- .- 710=0.5 II \n-I \n- - - 710=1 \nil \n-\"'- 710=1.5 \niI \nil \n--- 710=2 \nil \n---. ''10=3 \n!I \n!I \n!I \n/ : / \n\n--------------}/ I \n:77~.:7\"''=:-.~;''- ' ~ :-,.\" \n\nI \nI \n\nI \nI \nI \n\n, \n\n/ \n. ,/ \n\nI \nI \nI \nI \nI \nI \nI \nI \nr \nr \nr \nI \nI \nI \nI \nI \nI \nI \nI \n\n1 \n1 \n1 \nI \nI \nj \nI \ni \n, \n, \ni \n, \n, \ni \n, , \n,. , \n\n0.0 \n\n0.2 \n\n0.4 \n\n0.8 \n\n1.0 \n\n0.6 \n(}2 \n\nFigure 2: (a) The dynamical evolution of the biases Oi for a student imitating an \nisotropic teacher with zero biases. reveals symmetric dynamics for 01 and O2 \u2022 The \nstudent was randomly initialized identically for the different runs, but for a change \nin the range of the random initialization of the biases (U[-b,b]), with the value of \nb given in the legend. Above a critical value of b the student remains stuck in a \nsuboptimal phase. (b) The normalized convergence time ~ == TJoQc is shown as a \nfunction of the initialization of O2 for varios learning rates TJo (see legend, TJ5 = 0 \nsymbolizes the dynamics neglecting TJ5 terms.). \n\ndifferences in the initial student-teacher overlaps Rin (Biehl et al., 1996) which are \ntypically of order O(I/..fN) and cannot be influenced in real scenarios without a \npriori knowledge. The initialization of the biases, however, can be controlled by \nthe user and its influence on the learning dynamics is shown in Figs. lc and Id for \nthe biases and the generalization error respectively. For initially identical biases \n(0 1 = O2 = 0), the evolution of the order parameters and hence the generalization \nerror is almost indistinguishable from the fixed biases case. A breaking of this \nsymmetry leads to a decrease of the symmetric phase linear in log(IOl - ( 2 1) until \nit has all but disappeared. The dynamics are again slowed down for very large \ninitialization of the biases (see Id), where the biases have to travel a long way to \ntheir optimal values. \nThis suggests that for a given learning rate the biases have a dominant effect in \nthe learning process and strongly break existent symmetries in weight space. This \nis argueably due to a steep minimum in the generalization error surface along the \ndirection of the biases. To confirm this, we have studied a range of other learning \nscenarios including larger networks and non-isotropic teachers, e.g., graded teachers \nwith Tnm = n6nm . Even when the norms of the teacher weight vectors are strongly \ngraded, which also breaks the weight symmetry and reduces the symmetric phase \nsignificantly in the case of fixed biases, we have found that the biases usually have \nthe stronger symmetry breaking effect: the trajectories of the biases never cross, \nprovided that they were not initialized too symmetrically. \nThis would seem to promote initializing the biases of the student hidden units evenly \nacross the input domain, which has been suggested previously on a heuristic basis \n(Nguyen & Widrow, 1990). However, this can lead to the student being stuck in a \nsuboptimal configuration. In Fig. 2a, we show the dynamics of the student biases Oi \nwhen the teacher biases are symmetric (Pn = 0). We find that the student progress \nis inversely related to the magnitude of the bias initialization and finally fails to \nconverge at all. It remains in a suboptimal phase characterized by biases of the same \nlarge magnitude but opposite sign and highly correlated weight vectors. In effect, \nthe outputs of the two student nodes cancel out over most of the input domain. In \n\n\fThe Learning Dynamcis of a Universal Approximator \n\n293 \n\nFig. 2b, the influence of the learning rate in combination with the bias initialization \nin determining convergence is illustrated. The convergence time Qc, defined as the \nexample number at which the generalization error has decayed to a small value, \nhere judiciously chosen to be 10-8 , is shown as a function of the initial value of ()2 \nfor various learning rates 'TJo. For convenience, we have normalized the convergence \ntime with 1/\"\"0. The initialization of the other order parameters is identical to \nFig. 1a. One finds that the convergence time diverges for all learning rates, above \na critical initial value of (h. For increasing learning rates, this transition becomes \nsharper and occurs at smaller ()2, i.e., the dynamics become more sensitive to the \nbias initialization. \n\n4 SUMMARY AND DISCUSSION \n\nThis research has been motivated by recent progress in the theoretical study of \nthe soft-committee \non-line learning in realistic two-layer neural network models -\nmachine, trained with back-propagation (Saad & Solla, 1995). The studies so far \nhave excluded biases to the hidden layers, a constraint which has been removed in \nthis paper, which makes the model a universal approximator. The dynamics of the \nextended model turn out to be very rich and more complex than the original model. \nIn this paper, we have concentrated on the effect of initialization of student weights \nand biases. We have further restricted our presentation for simplicity to realizable \ncases and small networks with two hidden units, although larger networks were \nstudied for comparison. Even in these simple learning scenarios, we find surpris(cid:173)\ning dynamical effects due to the adjustable biases. In the case where the teacher \nnetwork exhibits distinct biases, unsymmetric initial values of the student biases \nbreak the node symmetry in weight space effectively and can speed up the learning \nprocess considerably, suggesting that student biases should in practice be initially \nspread evenly across the input domain if there is no a priori knowledge of the func(cid:173)\ntion to be learned. For degenerate teacher biases however such a scheme can be \ncounterproductive as different initial student bias values slow down the learning \ndynamics and can even lead to the student being stuck in suboptimal fixed points, \ncharacterized by student biases being grouped symmetrically around the degenerate \nteacher biases and strong correlations between the associated weight vectors. \nIn fact, these attractive suboptimal fixed points exist even for non-degenerate \nteacher biases, but the range of initial conditions attracted to these suboptimal \nnetwork configurations decreases in size. Furthermore, this domain is shifted to \nvery large initial student biases as the difference in the values of the teacher biases \nis increased. We have found these effects also for larger network sizes, where the \ndynamics and number of attractive suboptimal fixed points with different internal \nsymmetries increases. Although attractive suboptimal fixed points were also found \nin the original model (Biehl et al., 1996), the basins of attraction of initial values \nare in general very small and are therefore only of academic interest. \nHowever, our numerical work suggests that a simple rule of thumb to avoid being \nattracted to suboptimal fixed points is to always initialize the squared norm of a \nweight vector larger than the magnitude of the corresponding bias. This scheme \nwill still support spreading of the biases across the main input domain in order to \nencourage node symmetry breaking. This is somewhat similar to previous findings \n(Nguyen & Widrow, 1990; Kim & Ra, 1991), the former suggesting spreading the \nbiases across the input domain, the latter relating the minimal initial size of each \nweight with the learning rate. This work provides a more theoretical motivation for \nthese results and also distinguishes between the different roles of biases and weights. \nIn this paper we have addressed mainly one important issue for theoreticians and \n\n\f294 \n\nA. H. L West, D. Saad and l. T. Nabney \n\npractitioners alike: the initialization of the student network weights and biases. \nOther important issues, notably the question of optimal and maximal learning rates \nfor different network sizes during convergence, will be reported elsewhere. \nA THEOREM \nLet S9 denote the class of neural networks defined by sums of the form L~l nig(ui - (h) \nwhere K is arbitrary (representing an arbitrary number of hidden units), (h E lR and ni E Z \n(i.e. integer weights). Let 'I/J(x) == ag(x)/ax and let 1>\", denote the class of networks defined \nby sums of the form L~l Wi'I/J(Ui -0;) where W; E lR. If 9 is continuously differentiable and \nif the class 1>\", are universal approximators, then S9 is a class of universal approximatorsj \nthat is, such functions are dense in the space of continuous functions with the Loo norm. \nAs a corollary, the normalized soft committee machine forms a class of universal approxi(cid:173)\nmators with both sigmoid and error transfer functions [since radial basis function networks \nare universal (Park & Sandberg, 1993) and we need consider only the one-dimensional in(cid:173)\nput case as noted in the proof below). Note that some restriction on 9 is necessary: if 9 is \nthe step function, then with arbitrary hidden-output weights, the network is a universal \napproximator, while with fixed hidden-output weights it is not. \nA.! Proof \nBy the arguments of (Hornik et al., 1990) which use the properties of trigonometric poly(cid:173)\nnomials, it is sufficient to consider the case of one-dimensional input and output spaces. \nLet I denote a compact interval in lR and let f be a continuous function defined on I. \nBecause 1>\", is universal, given any E > 0 we can find weights Wi and biases Oi such that \n\nK \n\nf- LW;'I/J(u-Oi ) \n\n;=1 \n\nE <-2 \n\n00 \n\n(i) \n\nBecause the rationals are dense in the reals, without loss of generality we can assume \nthat the weights Wi E Q. Since 'I/J(x) is continuous and I is compact, the convergence of \n[g(x + h) - g(x)J1h to ag(x)/ax is uniform and hence for all n> n (21;Wi) the following \ni~.lity hblds: \n\n(ii) \nAlso note that for suitable ni > n (2~Wi)' rn. = now; E Z, as Wi is a rational number. \nThus, by the triangle inequality, \n\nI \n\nK \n\nL .rn; [g(u+ ~i -0;) -g(u-Oi )] - LWi'I/J(u-Oi) \n\nK \n\n(iii) \n\n.=1 \n\ni=l \n\n00 \n\nThe result now follows from equations (i) and (iii) and the triangle inequality. \n\nReferences \nBarber, D., Saad, D., & Sollich, P. 1996. Europhys. Lett., 34, 151-156. \nBiehl, M., & Schwarze, H. 1995. J. Phys. A, 28, 643-656. \nBiehl, M., Riegler, P., & Wohler, C. 1996. University of Wiirzburg Preprint WUE-ITP-\n\n96-003. \n\nHornik, K., Stinchcombe, M., & White, H. 1990. Neural Networks, 3, 551-560. \nKim, Y. K., & Ra, J. ,B. 1991. Pages 2396-2401 of: International Joint Conference on \n\nNeural Networks 91. \n\nNguyen, D., & Widrow, B. 1990. Pages C21-C26 of: IJCNN International Conference on \n\nNeural Networks 90. \n\nPark, J., & Sandberg, 1. W. 1993. Neural Computation, 5, 305-316. \nRiegler, P., & Biehl, M. 1995. J. Phys. A, 28, L507-L513. \nSaad, D., & SoHa, S. A. 1995. Phys. Rev. E, 52, 4225-4243. \n\n\f", "award": [], "sourceid": 1256, "authors": [{"given_name": "Ansgar", "family_name": "West", "institution": null}, {"given_name": "David", "family_name": "Saad", "institution": null}, {"given_name": "Ian", "family_name": "Nabney", "institution": null}]}