{"title": "A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 858, "page_last": 865, "abstract": null, "full_text": "A Learning Analog Neural Network Chip \n\nwith Continuous-Time Recurrent \n\nDynamics \n\nGert Cauwenberghs* \n\nCalifornia Institute of Technology \n\nDepartment of Electrical Engineering \n128-95 Caltech, Pasadena, CA 91125 \n\nE-mail: gertalcco. cal tech. edu \n\nAbstract \n\nWe present experimental results on supervised learning of dynam(cid:173)\nical features in an analog VLSI neural network chip. The recur(cid:173)\nrent network, containing six continuous-time analog neurons and 42 \nfree parameters (connection strengths and thresholds), is trained to \ngenerate time-varying outputs approximating given periodic signals \npresented to the network. The chip implements a stochastic pertur(cid:173)\nbative algorithm, which observes the error gradient along random \ndirections in the parameter space for error-descent learning. In ad(cid:173)\ndition to the integrated learning functions and the generation of \npseudo-random perturbations, the chip provides for teacher forc(cid:173)\ning and long-term storage of the volatile parameters. The network \nlearns a 1 kHz circular trajectory in 100 sec. The chip occupies \n2mm x 2mm in a 2JLm CMOS process, and dissipates 1.2 m W. \n\n1 \n\nIntroduction \n\nExact gradient-descent algorithms for supervised learning in dynamic recurrent net(cid:173)\nworks [1-3] are fairly complex and do not provide for a scalable implementation in \na standard 2-D VLSI process. We have implemented a fairly simple and scalable \n\n\u00b7Present address: Johns Hopkins University, ECE Dept., Baltimore MD 21218-2686. \n\n858 \n\n\fA Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics \n\n859 \n\nlearning architecture in an analog VLSI recurrent network, based on a stochastic \nperturbative algorithm which avoids calculation of the gradient based on an explicit \nmodel of the network, but instead probes the dependence of the network error on \nthe parameters directly [4]. As a demonstration of principle, we have trained a \nsmall network, integrated with the learning circuitry on a CMOS chip, to gener(cid:173)\nate outputs following a prescribed periodic trajectory. The chip can be extended, \nwith minor modifications to the internal structure of the cells, to accommodate \napplications with larger size recurrent networks. \n\n2 System Architecture \n\nThe network contains six fully interconnected recurrent neurons with continuous(cid:173)\ntime dynamics, \n\nd \n\nT dtXi = -Xi + L Wij U(Xj - (Jj) + Yi \n\n6 \n\n, \n\n(1) \n\nj=l \n\nwith Xi(t) the neuron states representing the outputs of the network, Yi(t) the \nexternal inputs to the network, and u(.) a sigmoidal activation function. The 36 \nconnection strengths Wij and 6 thresholds (Jj constitute the free parameters to be \nlearned, and the time constant T is kept fixed and identical for all neurons. Below, \nthe parameters Wij and (Jj are denoted as components of a single vector p. \n\nThe network is trained with target output signals x[(t) and xf(t) for the first two \nneuron outputs. Learning consists of minimizing the time-averaged error \n\n\u00a3(p) = lim 2T L Ixf(t) - Xk(t)IVdt \n\n, \n\n(2) \n\n1 jT 2 \n\nT-+oo \n\n-T k=l \n\nusing a distance metric with norm v. The learning algorithm [4] iteratively specifies \nincremental updates in the parameter vector p as \n\np(k+l) = p(k) _ J1, t(k) 7r(k) \n\nwith the perturbed error \n\nt(k) = ~ (\u00a3(p(k) + 7r(k\u00bb) _ \u00a3(p(k) _ 7r(k\u00bb)) \n\n(3) \n\n(4) \n\nobtained from a two-sided parallel activation of fixed-amplitude random perturba(cid:173)\ntions '1ri(k) onto the parameters p/k); '1ri(k) = \u00b1u with equal probabilities for both \npolarities. The algorithm basically performs random-direction descent of the error \nas a multi-dimensional extension to the Kiefer-Wolfowitz stochastic approximation \nmethod [5], and several related variants have recently been proposed for optimiza(cid:173)\ntion [6,7] and hardware learning [8-10]. \nTo facilitate learning, a teacher forcing signal is initially applied to the external \ninput y according to \n\nYi(t) = .x ,(xi(t) - Xi(t)) , i = 1,2 \n\n(5) \nproviding a feedback mechanism that forces the network outputs towards the tar(cid:173)\ngets [3]. A symmetrical and monotonically increasing \"squashing\" function for ,(.) \nserves this purpose. The teacher forcing amplitude .x needs to be attenuated along \nthe learning process, as to suppress the bias in the network outputs at convergence \nthat might result from residual errors. \n\n\f860 \n\nCauwenberghs \n\n3 Analog VLSI Implementation \n\nThe network and learning circuitry are implemented on a single analog CMOS chip, \nwhich uses a transconductance current-mode approach for continuous-time opera(cid:173)\ntion. Through dedicated transconductance circuitry, a wide linear dynamic range \nfor the voltages is achieved at relatively low levels of power dissipation (experimen(cid:173)\ntally 1.2 m W while either learning or refreshing). While most learning functions, \nincluding generation of the pseudo-random perturbations, are integrated on-chip in \nconjunction with the network, some global and higher-level learning functions of \nlow dimensionality, such as the evaluation of the error (2) and construction of the \nperturbed error (4), are performed outside the chip for greater flexibility in tailoring \nthe learning process. The structure and functionality of the implemented circuitry \nare illustrated in Figures 1 to 3, and a more detailed description follows below. \n\n3.1 Network Circuitry \n\nFigure 1 shows the schematics of the synapse and neuron circuitry. A synapse cell of \nsingle polarity is shown in Figure 1 (a). A high output impedance triode multiplier, \nusing an adjustable regulated casco de [11], provides a constant current Iij linear in the \nvoltage Wij over a wide range. The synaptic current Iij feeds into a differential pair, \ninjecting a differential current hj a(xj - OJ) into the diode-connected Id:.t and I;;\"t output \nlines. The double-stack transistor configuration of the differential pair offers an expanded \nlinear sigmoid range. The summed output currents Itut and I;;;\"t of a row of synapses are \ncollected in the output cell, Figure 1 (b), which also subtracts the reference currents I;\"c \nand I;;c obtained from a reference rOw of \"dummy\" synapses defining the \"zero-point\" \nsynaptic strength Wolf for bipolar operation. The thus established current corresponds to \nthe summed synaptic contributions in (1). Wherever appropriate (i = 1,2), a differential \ntransconductance element with inputs Xi and xT is added to supply an external input \ncurrent for forced teacher action in accordance with (5). \n\nI~U~ \n~ \n\n1\",,/ \n\nXi \n\nf--\n\nVc \n\n(a) \n\n(b) \n\nFigure 1 Schematics of synapse and neuron circuitry. (a) Synapse of single polarity. \n\n(b) Output cell with current-to-voltage converter. \n\nThe output current is converted to the neuron output voltage Xi, through an active resistive \nelement using the same regulated high output impedance triode circuitry as used in the \nsynaptic current source. The feedback delay parameter T in (1) corresponds to the RC \n\n\fA Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics \n\n861 \n\nproduct of the regulated triode active resistance value and the capacitance Gout. With \nGout = 5 pF, the delay ranges between 20 and 200jLsec, adjustable by the control voltage \nof the regulated cascode. Figure 2 shows the measured static characteristics of the synapse \nand neuron functions for different values of Wij and ()j ( i = j = 1), obtained by disabling \nthe neuron feedback and driving the synapse inputs externally. \n\n0.0 \n\nCII \n\n-0.4 -0 \n\n~ \n'-~ -0.2 \n~ .... \n;> \n.... \n~ -0.6 \n.& \n<5 \n\n-0.8 \n\n~ \nCII \n\n~ .-\n~ .... -0 \n\n;> \n.... \n~ \n<5 \n\nO.OV \n\n- 0.8V \n\n-1.0 \n\n-0.5 \nInput Voltage x j \n\n0.0 \n\n0.5 \n(V) \n\n1.0 \n\n-1.0 \n\n-0.5 \n0.5 \nInput Voltage x j (V) \n\n0.0 \n\n1.0 \n\n(a) \n\n(b) \n\nFigure 2 Measured static synapse and neuron characteristics, for various values of \n\n(a) the connection strength Wij, and \n\n(b) the threshold ()j. \n\n3.2 Learning Circuitry \n\nFigure 3 (a) shows the simplified schematics of the learning and storage circuitry, replicated \nlocally for every parameter (connection strength or threshold) in the network. Most of \nthe variables relating to the operation of the cells are local, with exception of a few global \nsignals communicating to all cells. Global signals include the sign and the amplitude \nof the perturbed error t and predefined control signals. The stored parameter and its \nbinary perturbation are strictly local to the cell, in that they do not need to communicate \nexplicitly to outside circuitry (except trivially through the neural network it drives), which \nsimplifies the structural organization and interconnection of the learning cells. \n\nThe parameter voltage Pi is stored on the capacitor Gstore, which furthermore couples \nto capacitor G pert for activation of the perturbation. The perturbation bit 7ri selects \neither of two complementary signals V+O \n\nEl'CUPD \n\nv+o \nv-(J \n\ni i X \n:: rL \n\nI \nI \nI \nI \nTI I \nIII \nT \nI \nI I_ \n\n: \n: \nI \nI \nI \nI \nI I -\"1t \n-2+ \nI \nII \n--1_ \nI \nI \n\nI;;/\" \n\nE(p) E(p + It) E(p - It) \n\n(a) \n\n(b) \n\nFigure 3 Learning cell circuitry. (a) Simplified schematics. \n\n(b) Waveform and timing diagram. \n\nThe random bit stream 1I\";(k) is generated on-chip by means of a set of linear feedback \nshift registers [12]. For optimal performance, the perturbations need to satisfy certain \nstatistical orthogonality conditions, and a rigorous but elaborate method to generate a \nset of uncorrelated bit streams in VLSI has been derived [13]. To preserve the scalability \nof the learning architecture and the local nature of the perturbations, we have chosen a \nsimplified scheme which does not affect the learning performance to first order, as verified \nexperimentally. The array of perturbation bits, configured in a two-dimensional arrange(cid:173)\nment as prompted by the location of the parameters in the network, is constructed by an \nouter-product exclusive-or operation from two generating linear sets of uncorrelated row \nand column bits on lines running horizontally and vertically across the network array. \n\nIn the present implementation the evaluation of the error functional (2) is performed \nexternally with discrete analog components, leaving some flexibility to experiment with \ndifferent formulations of error functionals that otherwise would have been hardwired. A \nmean absolute difference (/I = 1) norm is used for the metric distance, and the time(cid:173)\naveraging of the error is achieved by a fourth-order Butterworth low-pass filter. The \ncut-off frequency is tuned to accommodate an AC ripple smaller than 0.1 %, giving rise to \na filter settling time extending 20 periods of the training signal. \n\n3.3 Long-Term Volatile Storage \n\nAfter learning, it is desirable to retain (\"freeze\") the learned information, in principle \nfor an infinite period of time. The volatile storage of the parameter values on capacitors \nundergoes a spontaneous decay due to junction leakage and other drift phenomena, and \nneeds to be refreshed periodically. For eight effective bits of resolution, a refresh rate of \n10 Hz is sufficient. Incidentally, the charge pump used for the learning updates provides \nfor refresh of the parameter values as well. To that purpose, probing and multiplexing \ncircuitry (not shown) are added to the learning cell of Figure 3 (a) for sequential refresh. \nIn the experiment conducted here, the parameters are stored externally and refreshed \nsequentially by activating the corresponding charge pump with a DECR/INCR bit defined \nby the polarity of the observed deviation between internally probed and externally stored \n\n\fA Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics \n\n863 \n\nvalues. The parameter refresh is performed in the background with a 100 msec cycle, \nand does not interfere with the continuous-time network operation. A simple internal \nanalog storage method obliterating the need of external storage is described in [14], and \nis supported by the chip architecture. \n\n4 Learning Experiment \n\nAs a proof of principle, the network is trained with a circular target trajectory \ndefined by the quadrature-phase oscillator \n\n{ xi (t) \n\nxr(t) \n\nA cos (27rft) \nA sin (27rft) \n\n(6) \n\nwith A = o.SV and f = 1kHz. In principle a recurrent network of two neurons \nsuffices to generate quadrature-phase oscillations, and the extra neurons in the \nnetwork serve to accommodate the particular amplitUde and frequency requirements \nand assist in reducing the nonlinear harmonic distortion. \nClearly the initial conditions for the parameter values distinguish a trivial learning \nproblem from a hard one, and training an arbitrarily initialized network may lead \nto unpredictable results of poor generality. Incidentally, we found that the majority \nof randomly initialized learning sessions fail to generate oscillatory behavior at con(cid:173)\nvergence, the network being trapped in a local minimum defined by a strong point \nattractor. Even with strong teacher forcing these local minima persist. In contrast, \nwe obtained consistent and satisfactory results with the following initialization of \nnetwork parameters: strong positive diagonal connection strengths W ii = 1, zero \noff-diagonal terms W ij = 0 ; i f. j and zero thresholds (}i = O. The positive di(cid:173)\nagonal connections Wii repel the neuron outputs from the point attractor at the \norigin, counteracting the spontaneous decay term -Xi in (1). Applying non-zero \ninitial values for the cross connections Wij ; i f. j would introduce a bias in the \ndynamics due to coupling between neurons. With zero initial cross coupling, and \nunder strong initial teacher forcing, fairly fast and robust learning is achieved. \nFigure 4 shows recorded error sequences under training of the network with the tar(cid:173)\nget oscillator (6), for five different sessions of 1, 500 learning iterations each starting \nfrom the above initial conditions. The learning iterations span 60 msec each, for a \ntotal of 100 sec per session. The teacher forcing amplitude .A is set initially to 3 V, \nand thereafter decays logarithmically over one order of magnitude towards the end \nof the sessions. Fixed values of the learning rate and the perturbation amplitude \nare used throughout the sessions, with J.L = 25.6 V-I and (J' = 12.5 m V. All five ses(cid:173)\nsions show a rapid initial decrease in the error under stimulus of the strong teacher \nforcing, and thereafter undergo a region of persistent flat error slowly tapering off \ntowards convergence as the teacher forcing is gradually released. Notice that this \nflat region does not imply slow learning; instead the learning constantly removes \nerror as additional error is adiabatically injected by the relaxation of the teacher \nforcing. \n\n\f864 \n\nCau wenberghs \n\n3.0 \n\n25 \n\n-1 \nJl = 25.6 V \n(J = 12.5 mV \n\n2.0 \n\n~ \n... \n0 t: \n\n~ 15 -::I \n1.0 8 05 \n\n0... \n\n0.0 \n\n0 \n\n20 \n\n60 \n40 \nTime (sec) \n\n80 \n\n100 \n\nFigure 4 Recorded evolution of the error during learning, \n\nfor five different sessions on the network. \n\nNear convergence, the bias in the network error due to the residual teacher forcing \nbecomes negligible. Figure 5 shows the network outputs and target signals at con(cid:173)\nvergence, with the learning halted and the parameter refresh activated, illustrating \nthe minor effect of the residual teacher forcing signal on the network dynamics. \nThe oscillogram of Figure 5 (a) is obtained under a weak teacher forcing signal, \nand that of Figure 5 (b) is obtained with the same network parameters but with \nthe teacher forcing signal disabled. In both cases the oscilloscope is triggered on \nthe network output signals. Obviously, in absence of teacher forcing the network \ndoes no longer run synchronously with the target signal. However, the discrepancy \nin frequency, amplitude and shape between either of the free-running and forced \noscillatory output waveforms and the target signal waveforms is evidently small. \n\n(a) \n\n(b) \n\nFigure 5 Oscillograms of the network outputs and target signals after learning, \n\n(a) under weak teacher forcing, and (b) with teacher forcing disabled. \n\nTop traces: Xl(t) and Xl T(t). Bottom traces: X2(t) and X2T(t). \n\n\fA Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics \n\n865 \n\n5 Conclusion \nWe implemented a small-size learning recurrent neural network in an analog VLSI \nchip, and verified its learning performance in a continuous-time setting with a simple \ndynamic test (learning of a quadrature-phase oscillator). By virtue of its scalable \narchitecture, with constant requirements on interconnectivity and limited global \ncommunication, the network structure with embedded learning functions can be \nfreely expanded in a two-dimensional arrangement to accommodate applications of \nrecurrent dynamical networks requiring larger dimensionality. A present limitation \nof the implemented learning model is the requirement of periodicity on the input \nand target signals during the learning process, which is needed to allow a repetitive \nand consistent evaluation of the network error for the parameter updates. \n\nAcknowledgments \n\nFabrication of the CMOS chip was provided through the DARPA/NSF MOSIS service. \nFinancial support by the NIPS Foundation largely covered the expenses of attending the \nconference. \n\nReferences \n\n[1] B.A. Pearlmutter, \"Learning State Space Trajectories in Recurrent Neural Networks,\" \nNeural Computation, vol. 1 (2), pp 263-269, 1989. \n[2] RJ. Williams and D. Zipser, \"A Learning Algorithm for Continually Running Fully \nRecurrent Neural Networks,\" Neural Computation, vol. 1 (2), pp 270-280, 1989. \n[3] N .B. Toomarian, and J. Barhen, \"Learning a Trajectory using Adjoint Functions and \nTeacher Forcing,\" Neural Networks, vol. 5 (3), pp 473-484, 1992. \n[4] G. Cauwenberghs, \"A Fast Stochastic Error-Descent Algorithm for Supervised Learning \nand Optimization,\" in Advances in Neural Information Processing Systems, San Mateo, \nCA: Morgan Kaufman, vol. 5, pp 244-251, 1993. \n[5] H.J. Kushner, and D.S. Clark, \"Stochastic Approximation Methods for Constrained \nand Unconstrained Systems,\" New York, NY: Springer-Verlag, 1978. \n[6] M.A. Styblinski, and T.-S. Tang, \"Experiments in Nonconvex Optimization: Stochastic \nApproximation with Function Smoothing and Simulated Annealing,\" Neural Networks, \nvol. 3 (4), pp 467-483, 1990. \n[7] J.C. Spall, \"Multivariate Stochastic Approximation Using a Simultaneous Perturbation \nGradient Approximation,\" IEEE Trans. Automatic Control, vol. 37 (3), pp 332-341, 1992. \n[8] \nJ. Alspector, R. Meir, B. Yuhas, and A. Jayakumar, \"A Parallel Gradient Descent \nMethod for Learning in Analog VLSI Neural Networks,\" in Advances in Neural Information \nProcessing Systems, San Mateo, CA: Morgan Kaufman, vol. 5, pp 836-844, 1993. \n[9] B. Flower and M. Jabri, \"Summed Weight Neuron Perturbation: An O(n) Improve(cid:173)\nment over Weight Perturbation,\" in Advances in Neural Information Processing Systems, \nSan Mateo, CA: Morgan Kaufman, vol. 5, pp 212-219, 1993. \n[10] D. Kirk, D. Kerns, K. Fleischer, and A. Barr, \"Analog VLSI Implementation of \nGradient Descent,\" in Advances in Neural Information Processing Systems, San Mateo, \nCA: Morgan Kaufman, vol. 5, pp 789-796, 1993. \n[11] J.W. Fattaruso, S. Kiriaki, G. Warwar, and M. de Wit, \"Self-Calibration Techniques \nfor a Second-Order Multibit Sigma-Delta Modulator,\" in ISSCC Technical Digest, IEEE \nPress, vol. 36, pp 228-229, 1993. \n[12] S.W. Golomb, \"Shift Register Sequences,\" San Francisco, CA: Holden-Day, 1967. \n[13] J. Alspector, J.W. Gannett, S. Haber, M.B. Parker, and R. Chu, \"A VLSI-Efficient \nTechnique for Generating Multiple Uncorrelated Noise Sources and Its Application to \nStochastic Neural Networks,\" IEEE T. Circuits and Systems, 38 (1), pp 109-123, 1991. \n[14] G. Cauwenberghs, and A. Yariv, \"Method and Apparatus for Long-Term Multi-Valued \nStorage in Dynamic Analog Memory,\" U.s. Patent pending, filed 1993. \n\n\f", "award": [], "sourceid": 778, "authors": [{"given_name": "Gert", "family_name": "Cauwenberghs", "institution": null}]}