{"title": "A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 244, "page_last": 251, "abstract": null, "full_text": "A Fast Stochastic Error-Descent \n\nAlgorithm for Supervised Learning and \n\nOptimization \n\nGert Cauwenberghs \n\nCalifornia Institute of Technology \n\nMail-Code 128-95 \n\nPasadena, CA 91125 \n\nE-mail: gert(Qcco. cal tech. edu \n\nAbstract \n\nA parallel stochastic algorithm is investigated for error-descent \nlearning and optimization in deterministic networks of arbitrary \ntopology. No explicit information about internal network struc(cid:173)\nture is needed. The method is based on the model-free distributed \nlearning mechanism of Dembo and Kailath. A modified parameter \nupdate rule is proposed by which each individual parameter vector \nperturbation contributes a decrease in error. A substantially faster \nlearning speed is hence allowed. Furthermore, the modified algo(cid:173)\nrithm supports learning time-varying features in dynamical net(cid:173)\nworks. We analyze the convergence and scaling properties of the \nalgorithm, and present simulation results for dynamic trajectory \nlearning in recurrent networks. \n\n1 Background and Motivation \n\nWe address general optimization tasks that require finding a set of constant param(cid:173)\neter values Pi that minimize a given error functional \u00a3(p). For supervised learning, \nthe error functional consists of some quantitative measure of the deviation between \na desired state x T and the actual state of a network x, resulting from an input y \nand the parameters p. In such context the components of p consist of the con(cid:173)\nnection strengths, thresholds and other adjustable parameters in the network. A \n\n244 \n\n\fA Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization \n\n245 \n\ntypical specification for the error in learning a discrete set of pattern associations \n(yCa), x TCa ) for a steady-state network is the Mean Square Error (MSE) \n\n(1) \n\nand similarly, for learning a desired response (y(t), xT(t\u00bb in a dynamic network \n\n(2) \n\nFor \u00a3(p) to be uniquely defined in the latter dynamic case, initial conditions X(tinit) \nneed to be specified. \n\nA popular method for minimizing the error functional is steepest error descent \n(gradient descent) [1]-[6] \n\no\u00a3 \nLlp = -7](cid:173)op \n\n(3) \n\nIteration of (3) leads asymptotically to a local minimum of \u00a3(p), provided 7] is \nstrictly positive and small. The computation of the gradient is often cumbersome, \nespecially for time-dependent problems [2]-[5], and is even ill-posed for analog hard(cid:173)\nware learning systems that unavoidably contain unknown process impurities. This \ncalls for error descent methods avoiding calculation of the gradients but rather prob(cid:173)\ning the dependence of the error on the parameters directly. Methods that use some \ndegree of explicit internal information other than the adjustable parameters, such \nas Madaline III [6] which assumes a specific feedforward multi-perceptron network \nstructure and requires access to internal nodes, are therefore excluded. Two typical \nmethods which satisfy the above condition are illustrated below: \n\n\u2022 Weight Perturbation [7], a simple sequential parameter perturbation tech(cid:173)\nnique. The method updates the individual parameters in sequence, by measuring \nthe change in error resulting from a perturbation of a single parameter and adjust(cid:173)\ning that parameter accordingly. This technique effectively measures the compo(cid:173)\nnents of the gradient sequentially, which for a complete knowledge of the gradient \nrequires as many computation cycles as there are parameters in the system . \n\n\u2022 Model-Free Distributed Learning [8], which is based on the \"M.LT.\" rule \nin adaptive control [9}. Inspired by analog hardware, the distributed algorithm \nmakes use oftime-varying perturbation signals 1I\".(t) supplied in parallel to the pa(cid:173)\nrameters Pi, and correlates these 1I\"i(t) with the instantaneous network response \nE(p + 11\") to form an incremental update Ll.Pi. Unfortunately, the distributed \nmodel-free algorithm does not support learning of dynamic features (2) in net(cid:173)\nworks with delays, and the learning speed degrades sensibly with increasing num(cid:173)\nber of parameters [8]. \n\n2 Stochastic Error-Descent: Formulation and Properties \n\nThe algorithm we investigate here combines both above methods, yielding a sig(cid:173)\nnificant improvement in performance over both. Effectively, at every epoch the \nconstructed algorithm decreases the error along a single randomly selected direc(cid:173)\ntion in the parameter space. Each such decrement is performed using a single \n\n\f246 \n\nCauwenberghs \n\nsynchronous parallel parameter perturbation per epoch. Let I> = p + 1(' with par(cid:173)\nallel perturbations 1C'i selected from a random distribution. The perturbations 1C'i \nare assumed reasonably small, but not necessarily mutually orthogonal. For a given \nsingle random instance of the perturbation 1r, we update the parameters with the \nrule \n\n(4) \n\n, \n\n~p = -I-' \u00a3 1r \nt = \u00a3(1)) - \u00a3(p) \n\nwhere the scalar \n\n(5) \nis the error contribution due to the perturbation 1r, and I-' is a small strictly positive \nconstant. Obviously, for a sequential activation of the 1C'i, the algorithm reduces to \nthe weight perturbation method [7]. On the other hand, by omitting \u00a3(p) in (5) \nthe original distributed model-free method [8] is obtained. The subtraction of the \nunperturbed reference term \u00a3(p) in (5) contributes a significant increase in speed \nover the original method. Intuitively, the incremental error t specified in (5) isolates \nthe specific contribution due to the perturbation, which is obviously more relevant \nthan the total error which includes a bias \u00a3(p) unrelated to the perturbation 1r. \nThis bias necessitates stringent zero-mean and orthogonality conditions on the 1C'i \nand requires many perturbation cycles in order to effect a consistent decrease in \nthe error [8].1 An additional difference concerns the assumption on the dynamics \nof the perturbations 1C'i. By fixing the perturbation 1r during every epoch in the \npresent method, the dynamics of the 1C'i no longer interfere with the time delays of \nthe network, and dynamic optimization tasks as (2) come within reach. \n\nThe rather simple and intuitive structure (4) and (5) of the algorithm is somewhat \nreminiscent of related models for reinforcement learning, and likely finds parallels \nin other fields as well. Random direction and line-search error-descent algorithms \nfor trajectory learning have been suggested and analyzed by P. Baldi [12]. As a \nmatter of coincidence, independent derivations of basically the same algorithm but \nfrom different approaches are presented in this volume as well [13],[14]. Rather than \nfocussing on issues of originality, we proceed by analyzing the virtues and scaling \nproperties of this method. We directly present the results below, and defer the \nformal derivations to the appendix. \n\n2.1 The algorithm performs gradient descent on average, provided that the \nperturbations 1C'i are mutually uncorrelated with uniform auto-variance, \nthat is E(1C'i1C'j) = (J'26ij with (J' the perturbation strength. The effective \ngradient descent learning rate corresponding to (3) equals 7Jeff = 1-'(J'2. \n\nHence on average the learning trajectory follows the steepest path of error descent. The \nstochasticity of the parameter perturbations gives rise to fluctuations around the mean \npath of descent, injecting diffusion in the learning process. However, the individual fluc(cid:173)\ntuations satisfy the following desirable regularity: \n\n1 An interesting noise-injection variant on the model-free distributed learning paradigm \nof [8], presented in [10], avoids the bias due to the offset level \u00a3(p) as well, by differentiating \nthe perturbation and error signals prior to correlating them to construct the parameter \nincrements. A complete demonstration of an analog VLSI system based on this approach \nis lJresented in this volume [llJ. As a matter offact, the modified noise-injection algorithm \ncorresponds to a continuous-time version of the algorithm presented here , for networks and \nerror functionals free of time-varying features. \n\n\fA Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization \n\n247 \n\n2.2 The error \u00a3(p) always decreases under an update (4) for any 1r, provided \n\nthat 11r12 is \"small\", and J1 is strictly positive and \"small\". \n\nTherefore, the algorithm is guaranteed to converge towards local error minima just like \ngradient descent, as long as the perturbation vector 11\" statistically explores all directions of \nthe parameter space, provided the perturbation strength and learning rate are sufficiently \nsmall. This property holds only for methods which bypass the bias due to the offset error \nterm \u00a3(p) for the calculation of the updates, as is performed here by subtraction of the \noffset i'1 (5). \n\nThe guaranteed decrease in error of the update (4) under any small, single instance of \nthe perturbation 11\" removes the need of averaging multiple trials obtained by different in(cid:173)\nstances of 11\" in order to reduce turbulence in the learning dynamics. We intentionally omit \nany smoothing operation on the constructed increments (4) prior to effecting the updates \n~PI' unlike the estimation of the true gradient in [8],[10],[13] by essentially accumulating \nand averaging contributions (4) over a large set of random perturbations. Such averaging \nis unnecessary here (and in [13]) since each individual increment (4) contributes a decrease \nin error, and since the smoothing of the ragged downward trajectory on the error surface \nis effectively performed by the integration of the incremental updates (4) anyway. Fur(cid:173)\nthermore, from a simple analysis it follows that such averaging is actually detrimental to \nthe effective speed of convergence. 2 For a correct measure of the convergence speed of the \nalgorithm relative to that of other methods, we studied the boundaries of learning sta(cid:173)\nbility regions specifying maximum learning rates for the different methods. The analysis \nreveals the following scaling properties with respect to the size of the trained network, \ncharacterized by the number of adjustable parameters P: \n\n2.3 The maximum attainable average speed of the algorithm is a factor pl/2 \nslower than that of pure gradient descent, as opposed to the maximum \naverage speed of sequential weight perturbation which is a factor P slower \nthan gradient descent. \n\nThe reduction in speed of the algorithm vs. gradient descent by the square root of the \nnumber of parameters can be understood as well from an information-theoretical point \nof view using physical arguments. At each epoch, the stochastic algorithm applies per(cid:173)\nturbations in all P dimensions, injecting information in P different \"channels\". However, \nonly scalar information about the global response of the network to the perturbations is \navailable at the outside, through a single \"channel\". On average, such an algorithm can \nextract knowledge about the response of the network in at most p 1 / 2 effective dimensions, \nwhere the upper limit is reached only if the perturbations are truly statistically indepen(cid:173)\ndent, exploiting the full channel capacity. In the worst case the algorithm only retains \nscalar information through a single, low-bandwidth channel, which is e.g. \nthe case for \nthe sequential weight perturbation algorithm. Hence, the stochastic algorithm achieves a \nspeed-up of a factor p 1 / 2 over the technique of sequential weight perturbation, by using \nparallel statistically independent perturbations as opposed to serial single perturbations. \nThe original model-free algorithm by Dembo and Kailath [8] does not achieve this p 1 / 2 \n\n2Sure enough, averaging say M instances of (4) for different random perturbations will improve the \nestimate of the gradient by decreasing its variance. However, the variance of the update ~p decreases \nby a factor of M, allowing an increase in learning rate by only a factor of M 1/2, while to that purpose \nM network evaluations are required. In terms of total computation efforts, the averaged method is hence \na factor A-l1/2 slower. \n\n\f248 \n\nCauwenberghs \n\nspeed-up over the sequential perturbation method (and may even do worse), partly because \nthe information about the specific error contribution by the perturbations is contaminated \ndue to the constant error bias signal \u00a3(p). \n\nNote that up to here the term \"speed\" was defined in terms of the number of epochs, \nwhich does not necessarily directly relate to the physical speed, in terms of the total \nnumber of operations. An equally important factor in speed is the amount of com(cid:173)\nputation involved per epoch to obtain values for the updates (3) and (4). For the \nstochastic algorithm, the most intensive part of the computation involved at every \nepoch is the evaluation of \u00a3(p) for two instances of pin (5), which typically scales \nas O(P) for neural networks. The remaining operations relate to the generation \nof random perturbations 7ri and the calculation of the correlations in (4), scaling \nas O( P) as well. Hence, for an accurate comparison of the learning speed, the \nscaling of the computations involved in a single gradient descent step needs to be \nbalanced against the computation effort by the stochastic method corresponding to \nan equivalent error descent rate, which combining both factors scales as O( p 3 / 2 ). \nAn example where the scaling for this computation balances in favor of the stochas(cid:173)\ntic error-descent method, due to the expensive calculation of the full gradient, will \nbe demonstrated below for dynamic trajectory learning. \n\nMore importantly, the intrinsic parallelism, fault tolerance and computational sim(cid:173)\nplicity of the stochastic algorithm are especially attractive with hardware implemen(cid:173)\ntations in mind. The complexity of the computations can be furthermore reduced \nby picking a binary random distribution for the parallel perturbations, 7ri = \u00b1u \nwith equal probability for both polarities, simplifying the multiply operations in \nthe parameter updates. In addition, powerful techniques exist to generate large(cid:173)\nscale streams of pseudo-random bits in VLSI [15]. \n\n3 Numerical Simulations \n\nFor a test of the learning algorithm on time-dependent problems, we selected dy(cid:173)\nnamic trajectory learning (a \"Figure 8\") as a representative example [2]. Several \nexact gradient methods based on an error functional of the form (2) exist [2]-[5k \nwith a computational complexity scaling as either O( P) per epoch for an off-line \nmethod [2] (requiring history storage over the complete time interval of the error \nfunctional), or as O(p2) [3] and recently as O(p 3 / 2) [4]-[5] per epoch for an on-line \nmethod (with only most current history storage). The stochastic error-descent al(cid:173)\ngorithm provides an on-line alternative with an O( P) per epoch complexity. As a \nconsequence, including the extra p 1/ 2 factor for the convergence speed relative to \ngradient descent, the overall computation complexity of the stochastic error-descent \nstill scales like the best on-line exact gradient method currently available. \n\nFor the simulations, we compared several runs of the stochastic method with a \nsingle run of an exact gradient-descent method, all runs starting from the same \ninitial conditions. For a meaningful comparison, the equivalent learning rate for \n\n3The distinction between on-line and off-line methods here refers to issues of time \nrev~rsal in the computation. On-line methods process iucoming data strictly in the order \nit is received, while off-line methods require extensive access to previously processed data. \nOn-line methods are therefore more desirable for real-time learning applications. \n\n\fA Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization \n\n249 \n\nstochastic descent 7]eff = J-U12 was set to 7], resulting in equal average speeds. We \nimplemented binary random perturbations 7ri = \u00b1O\" with 0\" = 1 X 10- 3 \u2022 We used \nthe network topology, the teacher forcing mechanism, the values for the learning \nparameters and the values for the initial conditions from [4], case 4, except for 7] (and \n7]eff) which we reduced from 0.1 to 0.05 to avoid strong instabilities in the stochastic \nsessions. Each epoch represents one complete period of the figure eight. We found \nno loca.l minima for the learning problem, and all sessions converged successfully \nwithin 4000 epochs as shown in Fig. 1 (a). The occasional upward transitions in \nthe stochastic error are caused by temporary instabilities due to the elevated value \nof the learning rate. At lower values of the learning rate, we observed significantly \nless frequent and articulate upward transitions. The measured distribution for the \ndecrements in error at 7]eff = 0.01 is given in Fig. 1 (b). The values of the stochastic \nerror decrements in the histogram are normalized to the mean of the distribution, \ni. e. the error decrements by gradient descent (8). As expected, the error decreases \nat practically all times with an average rate equal to that of gradient descent, but \nthe largest fraction of the updates cause little change in error. \n\nFigure Eight Trajectory \n\n- - \u2022 Exact Gradient Descent \n-\n\nStochastic Error-Descent \n\n~ \n>. \nu \n~ \n\n~ =' 0\" \n~ ... u.. \n\n15 \n\n10 \n\n5 \n\n0 \n\nNumber of Epochs \n\n(a) \n\n-1 \n\n0 \n\n2 \n\n3 \n\n4 \n\n5 \n\nNormalized Error Decrement \n\n(b) \n\nFigure 1 Exact Gradient and Stochastic Error-Descent Methods for the Figure \"8\" Trajectory. \n\n(a) Convergence Dynamics (11 = 0.05). (b) Distribution of the Error Decrements.(11 = 0.01). \n\n4 Conclusion \n\nThe above analysis and examples serve to demonstrate the solid performance of the \nerror-descent algorithm, in spite of its simplicity and the minimal requirements on \nexplicit knowledge of internal structure. While the functional simplicity and fault(cid:173)\ntolerance of the algorithm is particularly suited for hardware implementations, on \nconventional digital computers its efficiency compares favorably with pure gradient \ndescent methods for certain classes of networks and optimization problems, owing \nto the involved effort to obtain full gradient information. The latter is particularly \ntrue for complex optimization problems, such as for trajectory learning and adaptive \ncontrol, with expensive scaling properties for the calculation of the gradient. In \nparticular, the discrete formulation of the learning dynamics, decoupled from the \ndynamics of the network, enables the stochastic error-descent algorithm to handle \ndynamic networks and time-dependent optimization functionals gracefully. \n\n\f250 \n\nCauwenberghs \n\nAppendix: Formal Analysis \n\nWe analyze the algorithm for small perturbations 1I\"i, by expanding (5) into a Taylor series \naround p: \n\nf = L 88f 11\") + 0(111\"12) \n\np) \n\n) \n\n, \n\n(6) \n\nwhere the 8f / 8p) represent the components of the true error \nphysical structure of the network. Substituting (6) in (4) yields: \n\ngradient, reflecting the \n\ntl.Pi = -It ~ 8pj 1I\"i1l\") + 0(111\"1 )1I\"i \n\n\"\" 8f \n\n2 \n\n. \n\n(7) \n\nFor mutually uncorrelated perturbations 1I\"i with uniform variance (1'2, E(1I\"i1l\") = (1'26i), \nthe parameter vector on average changes as \n\n) \n\nE(tl.p) = -It(1' 8p + 0((1' ) \n\n2 8f \n\n3 \n\n. \n\n(8) \n\nHence, on average the algorithm performs pure gradient descent as in (3), with an effective \nlearning rate 11 = 1'(1'2. The fluctuations of the parameter updates (7) with respect to their \naverage (8) give rise to diffusion in the error-descent process. Nevertheless, regardless of \nthese fluctuations the error will always decrease under the updates (4), provided that the \nincrements tl.Pi are sufficiently small (J.t small): \n\ntl.f = ~ -8 . tl.Pi + O(Itl.pl ) ~ -It ~ ~ -8 1I\"i-8 \n\" \" \" \" 8f 8f \n~ ~ \n\n\"\" 8f \np. \n\n2 \n\n11\") ~ -J.t f \n\n\"2 \n\n::; 0 \n\n. \n\n(9) \n\n. ) \n\n. \n\n. \n\n. \n\nNote that this is a direct consequence of the offset bias subtraction in (5), and (9) is no \nlonger valid when the compensating reference term f(p) in (5) is omitted. The algorithm \nwill converge towards local error minima just like gradient descent, as long as the pertur(cid:173)\nbation vector 11\" statistically explores all directions of the parameter space. In principle, \nstatistical independence of the 11\". is not required to ensure convergence, though in the case \nof cross-correlated perturbations the learning trajectory (7) does not on average follow the \nsteepest path (8) towards the optima, resulting in slower learning. \n\nThe constant It cannot be increased arbitrarily to boost the speed of learning. The value \nof J.t is constrained by the allowable range for Itl.pl in (9). The maximum level for Itl.pl \ndepends on the steepness and nonlinearity of the error functional f, but is largely inde(cid:173)\npendent of which algorithm is being used. A value of Itl.pl exceeding the limit will likely \ncause instability in the learning process, just as it would for an exact gradient descent \nmethod. The constraint on Itl.pl allows us to formulate the maximum attainable speed of \nthe stochastic algorithm, relative to that of other methods. From (4), \n\n(10) \nwhere P is the number of parameters. The approximate equality at the end of (10) holds \nfor large P, and results from the central limit theorem for 111\"12 with E( 1I\"i1l\") = (1'2 h.) . \nFrom (6), the expected value of (10) is \n\nltl.pl2 = J.t 2111\"12f2 ::::::: p1'2(1'2f2 \n\nE(I~pI2) = P (p0'2)21 ~! 12 \n\n. \n\n(11) \n\nThe maximum attainable value for I' can be expressed in terms of the maximum value of \n11 for gradient descent learning. Indeed, from a worst-case analysis of (3) \n\n2 \nItl.plmax = l1max a \n8f \n1 \n1\n\n2 \n\n2 \n\np max \n\n(12) \n\n\fA Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization \n\n251 \n\nand from a similar worst-case analysis of (11). we obtain P IJma.x /7 2 \"\" 71ma.x to a first order \napproximation. With the derived value for J.tma.x, the maximum effective learning rate 1/eff \nassociated with the mean field equation (8) becomes 71eff = p- 1 / 2 71ma.x for the stochastic \nmethod, as opposed to 1/ma.x for the exact gradient method. This implies that on average \nand under optimal conditions the learning process for the stochastic error descent method \nis a factor pl/2 slower than optimal gradient descent. From similar arguments, it can be \nshown that for sequential perturbations lI'j the effective learning rate for the mean field \ngradient descent satisfies 71eff = p-l 71ma.x. Hence under optimal conditions the sequential \nweight perturbation technique is a factor P slower than optimal gradient descent. \n\nAcknowledgements \n\nWe thank J. Alspector, P. Baldi, B. Flower, D. Kirk, M. van Putten, A. Yariv, and many \nother individuals for valuable suggestions and comments on the work presented here. \n\nReferences \n\n[I] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, \"Learning Internal Representations \nby Error Propagation,\" in Parallel Distributed Processing, Explorations in the Microstruc(cid:173)\nture of Cognition, vol. 1, D.E. Rumelhart and J.L. McClelland, eds., Cambridge, MA: \nMIT Press, 1986. \n[2] B.A. Pearlmutter, \"Learning State Space Trajectories in Recurrent Neural Networks,\" \nNeural Computation, vol. 1 (2), pp 263-269, 1989. \n[3] R.J. Williams and D. Zipser, \"A Learning Algorithm for Continually Running Fully \nRecurrent Neural Networks,\" Neural Computation, vol. 1 (2), pp 270-280, 1989. \n[4] N.B. Toomarian, and J. Barhen, \"Learning a Trajectory using Adjoint Functions and \nTeacher Forcing,\" Neural Networks, vol. 5 (3), pp 473-484, 1992. \n[5] J. Schmidhuber, \" A Fixed Size Storage O( n 3 ) Time Complexity Learning Algorithm for \nFully Recurrent Continually Running Networks,\" Neural Computation, vol. 4 (2), pp 243-\n248, 1992. \n[6] B. Widrow and M.A. Lehr, \"30 years of Adaptive Neural Networks. Percept ron, \nMadaline, and Backpropagation,\" Proc. IEEE, vol. 78 (9), pp 1415-1442, 1990. \n[7] M. Jabri and B. Flower, \"Weight Perturbation: An Optimal Architecture and Learning \nTechnique for Analog VLSI Feedforward and Recurrent Multilayered Networks,\" IEEE \nTrans. Neural Networks, vol. 3 (1), pp 154-157, 1992. \n[8] A. Dembo and T. Kailath, \"Model-Free Distributed Learning,\" IEEE Trans. Neural \nNetworks, vol. 1 (1), pp 58-70, 1990. \n[9] H.P. Whitaker, \"An Adaptive System for the Control of Aircraft and Spacecraft,\" in \nInstitute for Aeronautical Sciences, pap. 59-100, 1959. \n[10] B.P. Anderson and D.A. Kerns, \"Using Noise Injection and Correlation in Analog \nHardware to Estimate Gradients,\" submitted, 1992. \n[11] D. Kirk, D. Kerns, K. Fleischer, and A. Barr, \"Analog VLSI Implementation of \nGradient Descent,\" in Advances in Neural Information Processing Systems, San Mateo, \nCA: Morgan Kaufman Publishers, vol. 5, 1993. \n[12] P. Baldi, \"Learning in Dynamical Systems: Gradient Descent, Random Descent and \nModular Approaches,\" JPL Technical Report, California Institute of Technology, 1992. \n[13J J. Alspector, R. Meir, B. Yuhas, and A. Jayakumar, \"A Parallel Gradient Descent \nMethod for Learning in Analog VLSI Neural Networks,\" in Advances in Neural Information \nProcessing Systems, San Mateo, CA: Morgan Kaufman Publishers, vol. 5, 1993. \n[14] B. Flower and M. labri, \"Summed Weight Neuron Perturbation: An O(n) Improve(cid:173)\nment over Weight Perturbation,\" in Advances in Neural Information Processing Systems, \nSan Mateo, CA: Morgan Kaufman Publishers, vol. 5, 1993. \n[15] J. Alspector, l.W. Gannett, S. Haber, M.B. Parker, and R. Chu, \"A VLSI-Efficient \nTechnique for Generating Multiple Uncorrelated Noise Sources and Its Application to \nStochastic Neural Networks,\" IEEE T. Circuits and Systems, 38 (1), pp 109-123, 1991. \n\n\f\fPART III \n\nCONTROL, \n\nNAVIGATION, AND \n\nPLANNING \n\n\f\f", "award": [], "sourceid": 690, "authors": [{"given_name": "Gert", "family_name": "Cauwenberghs", "institution": null}]}