{"title": "Experimental Evaluation of Learning in a Neural Microsystem", "book": "Advances in Neural Information Processing Systems", "page_first": 871, "page_last": 878, "abstract": null, "full_text": "Experimental Evaluation of Learning in a Neural Microsystem \n\nJoshua Alspector Anthony Jayakumar Stephan Lunat \n\nBellcore \n\nMorristown, NJ 07962-1910 \n\nAbstract \n\nWe report learning measurements from a system composed of a cascadable \nlearning chip, data generators and analyzers for training pattern presentation, \nand an X-windows based software interface. The 32 neuron learning chip has \n496 adaptive synapses and can perform Boltzmann and mean-field learning \nusing separate noise and gain controls. We have used this system to do learning \nexperiments on the parity and replication problem. The system settling time \nlimits the learning speed to about 100,000 patterns per second roughly \nindependent of system size. \n\n1. INTRODUCTION \n\nrule. \n\nEven \n\nand \n\na \n\nthough \n\nlocal 1earning \n\nWe have implemented a model of learning in neural networks using feedback \nconnections \nback-propagation[l) \n(Rwnelhart,1986) networks are feedforward in processing, they have separate. implicit \nfeedback paths during learning for error pro~gation. Networks with explicit, full-time \nfeedback paths can perform pattern completion!21 (Hopfield,1982), can learn many-lO-One \nmappings. can learn probability disuibutions. and can have interesting temporal and \ndynamical properties in contrast to the single forward pass processing of multilayer \nperceptrons trained with back-propagation or other means. Because of the potential for \ncomplex dynamics. feedback networks require a reliable method of relaxation for \nlearning and reuieval of static patterns. The Boltzmann machine!3] (Ackley,1985) uses \nstochastic settling while the mean-field theory version[4] (peterson.1987) uses a more \ncomputationally efficient deterministic technique. \nWe have previously shown that Boltzmann learning can be implemented in VLSI(S] \n(Alspector,1989). We have also shown, by simulation,[6] (Alspector, 1991a) that \nBoltzmann and mean-field networks can have powerful learning and representation \nproperties just like the more thoroughly studied back-propagation methods. In this paper, \nwe demonstrate these properties using new, expandable parallel hardware for on-chip \nlearning. \n\nt Pennanenl address: University of California, Bericeley; EECS Dep't, Cory Hall; Berlceley, CA 94720 \n\n871 \n\n\f872 \n\nAlspector, Jayakumar, and Luna \n\n1. VLSI IMPLEMENTATION \n\n1.1 Electronic Model \nWe have implemented these feedback networks in VLSI which speeds up learning by \nmany orders of magnitude due to the parallel nature of weight adjustment and neuron \nstate update. Our choice of learning technique for implementation is due mainly to the \nlocal learning rule which makes it much easier to cast these networks into electronics \nthan back-propagation. \n\nIndividual neurons in the Boltzmann machine have a probabilistic decision rule such that \nneuron i is in state Sj = 1 with probability \n\nPr(Sj = 1)= -~=\u00ad\nl+e-.rr \n\n1 \n\n(1) \n\nwhere Wj = ~WjjSj is the net input to each neuron calculated by current summing and T \n\nj \n\nis a parameter that acts like temperature in a physical system and is represented by the \nnoise and gain terms in Eq. (2), which follows. In the electronic mooel we use, each \nneuron performs the activation computation \n\nSj = f (~* (Uj+Vj\u00bb \n\n(2) \nwhere f is a monotonic non-linear function such as tanh. The noise, v, is chosen from a \nzero mean gaussian distribution whose width is proportional to the temperature. This \nclosely approximates the distribution in Eq. (1) and comes from our hardware \nimplementation, which supplies uncorrelated noise \nthe form of a binomial \ndistribution[7] (Alspector,I991b) to each neuron. The noise is slowly reduced as \nannealing proceeds. For mean-field learning, the noise is zero but the gain, ~, has a finite \nvalue proponional to liT taken from the annealing schedule. Thus the non-linearity \nsharpens as 'annealing' proceeds. \n\nin \n\nThe network is annealed in two phases, + and -, corresponding to clamping the outputs \nin the desired state (teacher phase) and allowing them to run free (student phase) at each \npattern presentation. The learning rule which adjusts the weights Wjj from neuron j to \nneuron i is \n\nNote that this measures the instantaneous correlations after annealing. For both phases \neach synapse memorizes the correlations measured at the end of the annealing cycle and \nthen made, (Le., online). The sgn matches our hardware \nweight adjustment is \nimplementation which changes weights by one each time. \n\n(3) \n\n1.1 Learning Microchip \nFig. 1 shows the learning microchip which has been fabricated. It contains 32 neurons \nand 992 connections (496 bidirectional synapses). On the extreme right is a noise \ngenerator which \nsources[7] \n(Alspector,I991b) to the neurons to their left. These noise sources are summed in the \nform of current along with the weighted post-synaptic signals from other neurons at the \ninput to each neuron in order to implement the simulated annealing process of the \nstochastic Boltzmann machine. The neuron amplifiers implement a non-linear activation \n\nsupplies 32 \n\npseudo-random \n\nun correlated \n\nnoise \n\n\fExperimental Evaiuarion of Learning in a Neural Microsysrem \n\n873 \n\n\u2022 . \u2022 \u2022\u2022\u2022 \u2022\u2022\u2022\u2022 \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 I \u2022 \u2022 \u2022 \u2022 \n\n.. . \n'\" . \n,. -,. . \n'\" . \n.. . \n.. -.. . \n... -\"\" -.. . \n... -.. -.. . \n.. -'\" . \n... . \n' .. .. . \n.. -\n\n.. \n\nIf . . . . . . 11 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 It \u2022 \u2022 \u2022 \n\nFigure 1. Photo of 32-Neuron Cascadable Learning Chip \n\nfunction which has variable gain to provide for the gain sharpening function of the \nmean-field technique. The range of neuron gain can also be adjusted to allow for scaling \nin summing currents due to adjustable network size. \n\nMost of the area is occupied by the synapse array. Each synapse digitally stores a weight \nranging from -15 to +15 as 4 bits plus a sign. It multiples the voltage input from the \npresynaptic neuron by this weight to output a current. One conductance direction can be \ndisconnected so that we can experiment with asymmetric networks[8) (Allen, 1990). \nAlthough the synapses can have their weights set externally, they are designed to be \nadaptive. They store correlations. in parallel, using the local learning rule of Eq. (3) and \nadjust their weights accordingly. A neuron state range of -Ito 1 is assumed by the digital \nlearning processor in each synapse on the chip. \n\nFig. 2a shows a family of transfer functions of a neuron. showing how the gain is \ncontinually adjustable by varying a control voltage. Fig. 2b shows the transfer function \nof a synapse as different weights are loaded. The input linear range is about 2 volts. \n\nFig. 3 shows waveforms during exclusive-OR learning using the noise annealing of the \nBoltzmann machine. The top three traces are hidden neurons while the bottom trace is \nthe output neuron which is clamped during the + phase. There are two input patterns \npresented during the time interval displayed, (-1,+1) and (+1,-1), both of which should \noutput a +1 (note the state clamped to high voltage on the output neuron). Note the \nsequence of steps involved in each pattern presentation. 1) Outputs from the previous \npattern are unclamped. 2) The new pattern is presented to the input neurons. 3) Noise is \npresented to the network and annealed. 4) The student phase latch captures the \n\n\f874 \n\nAlspector, Jayakumar, and Luna \n\nMeasured Neuron Transler Function \n\nMeasured synapse transler lunction \n\n4 \n\n:> \n(I) \n(l) 3 \n! \n\"0 \n> \n'3 \nB- 2 \n::l a \n\no LL~~-LLL~~-LLL~~-LLL~~LL~ \n-300 \n\n-200 \n\n100 \n\n0 \n\n-100 \nal \n\nInput current (flAl \n\n______ -\n\n-\n\n11 \n\n-----\n\n-11 \n\n15 \n\n1.5 \n\n2.5 \n\n2 \nJ \nbl Input voltage (Vl \n\n15 \n\nFigure 2. Transfer Functions of Electronic Neuron (2a) and Synapse (2b) \n\ncorrelations. 5) Data from the neuron states is read into the data analyzer. 6) The output \nneurons are clamped (no annealing is necessary for a three layer network). 7) The \nteacher phase latch captures the correlations. 8) Weights are adjusted (go to step 1). \n\n11.3500 ms \n\n13 . 8500 ms \n\n8.85000 ms \n\n~~~~_=-~~ ____ :~=t=-\u00b7\u00b7~~~--~=~~-~~~~~ \nJ \nF~~~~:~~~-'~=~-:=:;=~~= \\ __ ~~~ \n,-~ ' : . \u2022 + __ /': ,,:f:~ __ +-_ .+ __ !--+-._y \n5 U6 u 7 ... a .-\n\n~ \n~ 1. 411 2 uAu\" U 5 \n\n\u2022\u2022 6u7ua~41I 2 \u2022 .\"l: 4 u \n\n~ \n\n~ \n\n5.000 VoltS/dlv \nChannel 1 -\nChannel 2\u00b7 5.000 Vult:-!'l1v \n5.000 Volts/dlv \nChannel 3 -\nChennel ~. 5.000 Volt./dly \nTlmebase \n\u2022 \n500 us/div \n\nOffset \nUff'3p.t \nOffset \nOff.et \nDelay \n\nH \u00b71 \n-\n2 . 500 Volts \n.. 2.~U!J Vol':' \n\u2022 2.500 Volts \n\u2022 2.500 Volt. \n\u2022 8.85000 ms \n\nFigure 3. Neuron Signals during Learning (see text for steps involved) \n\nFig. 4a shows an expanded view of 4 neuron waveforms during the noise annealing \nportion of the chip operation during Boltzmann learning. Fig. 4b shows a similar portion \nduring gain annealing. Note that, at low gain. the neuron states start at 2.5 volts and \nsettle to an analog value between 0 and 5 volts. For the purposes of classification for the \n\n\fExperimental Evaluation of Learning in a Neural Microsystem \n\n875 \n\n58.0000 UI \n\n----------- - -----\n\n158.000 UI \n\no ---i---~-\n\nt \n\n._ \n\n- - - - -\n\n.. - - - -+ - - - -0 - -\n\nCh.nn.ll~ ~ 000 vii!t 17ii ~ --- - - - ' -- - - - . - a; I ,-at - . - 2. 500 Vii Its \n-\n2.5 ... 0 volt. \n\u2022 Z 500 Voltl \n\u2022 2.500 VDlt. \n... -492000 UI \n\n1.: .... \".1 .l -\nV' ... l~.l:li .. \u00b7 \n: . ,,:vJ \n5 000 UoltS / dl' \nChannel 3 -\nChann.l \u2022\u2022 5.000 VDlt./dlv \n20 a uS / dlV \nTlaeculse \n\u2022 \n\n\"\"', -!.!. \nOlfset \norr \u2022\u2022 t \nDelilY \n\n\u2022 \n\nFigure 4. Neuron Signals during Annealing with Noise (4a) and Gain (4b) \n\ndigital problems we investigated, neurons are either + lor\u00b7} depending on whether their \nvoltage is above or below 2.5 volts. This isn't clear until after settling. There are several \ninstances in Figs. 3 and 4 where the neuron state changes after noise or gain annealing. \n\nThe speed of pattern presentations is limited by the length of the annealing signal for \nsystem settling (100 ~ in Fig. 3). The rest of the operations can be made negligibly \nshort in comparison. The annealing time could be reduced to 10 ~ or so, leading to a \nrate of about 100,000 patterns/sec. In comparison, a 10-10-10 replication problem, \nwhich fits on a single chip, takes about a second per panern on a SPARCstation 2. This \ntime scales roughly with the number of weights on a sequential machine, but is almost \nconstant on the learning chip due to its parallel nature. \n\nWe can do even larger problems in a multiple chip system because the chip is designed to \nbe cascaded with other similar chips in a board-level system which can be accessed by a \ncomputer. The nodes which sum current from synapses for net input into a neuron are \navailable externally for connection to other chips and for external clamping of neurons or \nother external input We are currently building such a system with a VME bus interface \nfor tighter coupling to our software than is allowed by the GPIB instrument bus we are \nusing at the time of this writing. \n\n2.3 Learning Experiments \n\nTo study learning as a function of problem size, we chose the parity and replication \n(identity) problems. This facilitates comparisons with our previous simulations[6) \n\n\f876 \n\nAlspector, Jayakumar, and Luna \n\n(Alspector.I991 a). The parity problem is the genenilization of exclusive-OR for \narbitrary input size. It is difficult because the classification regions are disjoint with \nevery change of input bit. but it has only one output The goal of the replication problem \nis for the output to duplicate the bit pattern found on the input after being encoded by the \nhidden layer. Note that the output bits can be shifted or scrambled in any order without \naffecting the difficulty of the problem. There are as many output neurons as input. For \nthe replication problem. we chose the hidden layer to have the same number of neurons \nas the input layer. while for parity we chose the hidden layer to have twice the number as \nthe input layer. \n\n=f:~~ \n\n0 . 20 \nci \no. \n\n_ \n\n_ \n\n_ \n\n!If\" \",\"1 __ 1[11'01 \n\n1 . 00 \n\no.eo \n\no.~ \n\n_.TlU\", \n\no. \n\n0.20 \n\nI~II \n\n~~J it \n\n=~~F== \n\no \n\n_ \n\n_ \n\n_ . !If\" NnEIOI$ \"'1[11'01 \n\n_IIC \nDIST~ \n\n~. \n:~ o~~~~==.oo=========_====-------I_ \n1.] \no\u00b7l \n'~.TllI'\" :.:11 \n\n-< \n\n......... til \"\"[IMS N[KNTm \n\n\u2022. 20 \n\n.... \n1I_r-1 O]-]-r[~-.2-WT\"~O-' -2{''-' 0.-\n\n.,.. \n\nFigure 5. X-window Display for Learning on Chip (5a) and in Software (5b) \n\nFig. 5 shows the X-window display for 5 mean-field runs for learning the 4 input. 4 \nhidden, 4 output (4-4-4) replication on the chip (Sa) and in the simulator (5b). The user \nspecification is the same for both. Only the learning calculation module is different. \nBoth have di~plays of the network topology, the neuron states (color and pie-shaped arc \nof circles) and the network weights (color and size of squares). There are also graphs of \npercent correct and error (Hamming distance for replication) and one of volatility of \nneuron stateS(9j (Alspector,I992) as a measure of the system temperature. The learning \ncurves look quite similar. In both cases, one of the 5 runs failed to learn to 100 %. The \nboxes representing weights are signed currents (about 4 ~ per unit weight) in 5a and \nintegers from -15 to + 15 in 5b. Volatility is plotted as a function of time (j..lsec) in 5a and \nshows that. in hardware (see Fig. 4), time is needed for a gain decrease at the start of the \nannealing as well as for the gain increase of the annealing proper. The volatility in 5b is \n\n\fExperimental Evaluation of Learning in a Neural Microsystem \n\n877 \n\nplotted as a function of gain (BETA) which increases logarithmically in the simulator at \neach anneal step. \n\nICII \n\n10 \n\n60 \n\n40 \n\nPERCENT \nCORRECT \n\no \n\no \n\nHAMMJI\u00abi \nDISTANCE \n\nHAMMING DISTANcr FQR MfT \n\n5 \n\nICII \n\n4 10 \n\n1 60 \n\n2 40 \n\nI \n\nJ) \n\no 0 \n\n1Cxx) \n\nI !ell \n\nNUMBER OF PATT'fRNS PltESEN1ED \n\n\u2022 \n\n:m> \n\nJCXX> \n\nI(xx) \n\n\u00ae NUMBER OF PATTERNS PItESEN1B) \n\nFigure 6. On-chip Learning for 6 Input Replication (6a) and Parity (6b) \n\nFig. 6a displays data from the average of 10 runs of 6-6-6 replication for both Boltzmann \n(BZ) and mean-field (MFI) learning. While the percent correct saturates at 90 % (70 % \nfor Boltzmann), the output error as measured by the Hamming distance between input \nand output is less than 1 bit out of 6. Boltzmann learning is somewhat poorer in this \nexperiment probably because circuit parameters have not yet been optimized. We expect \nthat a combination of noise and gain annealing will yield the best results but have not \ntested this possibility at this writing. Fig.6b is a similar plot for 6-12-1 parity. \n\nWe have done on-chip learning experiments using noise and gain annealing for parity \nand replication up to 8 input bits, nearly utilizing all the neurons on a single chip. To \njudge scaling behavior in these early experiments, we note the number of patterns \nrequired until no further improvement in percent correct is visible by eye. Fig. 7a plots, \nfor an average of 10 runs of the parity problem, the number of patterns required to learn \nup to the saturation value for percent correct for both Boltzmann and mean-field learning. \nThis scales roughly as an exponential in number of inputs for learning on chip just as it \ndid in simulation[6] (Alspector,199Ia) since the training set size is exponential. The final \npercent correct is indicated on the plot Fig. 7b plots the equivalent data for the \nreplication problem. Outliers are due to low saturation values. Overall, the training time \nper pattern on-chip is quite similar to our simulations. However, in real-time, it can be \nabout 100,000 times as fast for a single chip and will be even faster for multiple chip \nsystems. The speed for either learning or evaluation is roughly 108 connections per \nsecond per chip. \n\n\f878 \n\nAlspector, Jayakumar, and Luna \n\nAVEaMlEPElICEHTAIZ \n\nalUECT ATSAlWAl10H \n\nlQI\"'mwoo~ \n\n\\ \n\nM \n\n-.,. \n\n-\nn\",\\/ \n\nm' \n\n~ \n\n'ATllaNS 1II1II -l.IAIIIIIG \n\nSATURA'I1!S \n\n-\n\n.---------- OJUEC'T ATSATURA l10H \n\nRJlMfT \n\n.,. \n\nAVERAGE PBCEI _m'~1111 \n\n1-\n\n1l1li \n\nI. \n\n10 \n\nFigme 7. Scaling of Parity (7a) and Replicalion (7b) Problem with Input Size \n\n3. CONCLUSION \n\nWe have shown that Boltzmann and mean-field learning networks can be implemented in \na parallel, analog VLSI system. While we report early experiments on a single-chip \ndigital system, a mUltiple-chip VME-based electronic system with analog I/O is being \nconstructed for use on larger problems. \n\nACKNOWLEDGMENT: \nThis work has been partially supported by AFOSR contract F49620-90-C-0042, DEF. \n\nREFERENCES \n\n1. D.E. Rwnelhart, G.E. Hinton. cl R.I. Williams, \"Learning Internal Representtiions by Error \nPropagation\", in Parallel Distribllled Processing: Exploralions ill ,he MicroSlruetwe of \nCognitioll. Vol. 1: FowtdaJions, D.E. Rumelhart cl 1.L. McClelland (eds.), MIT Press, \nCambridge, MA (1986), p. 318. \n\n2. JJ. Hopfield. \"Neural Networks and Physical Systems with Emergent Collective \n\nCompUl4tional Abilities\". Proc. NaJJ. Acad. Sci. USA, 79,2554-2558 (l982). \n\n3. D.H. Ackley, G.E. Hinton. cl T.J. Sejnowski, \"A Learning Algorithm for Boltzmann \n\nMachines\", Cognitive Science 9 (1985) pp. 147-169. \n\n4. C. Peterson cl I.R. Amerson. \"A Mean Field Learning Algorithm for Neural Networks\", \n\nComplexSySlems, 1:5,995-1019, (l987). \n\n5. 1. Alspector, B. Gupta, cl RB. Allen, \u00b7Performance of a Stochastic Learning Microchip\u00b7 in \nAdvances in NewaJ In/ormaJioll Processing Systems 1, D. Touretzky (ed.). Morgan-Kaufmann. \nPalo Alto, (1989), pp. 748-760. \n\n6.1. Alspector. R.B. Allen. A. layakumar, T. Zeppenfeld, &: R. Meir \"Relaxation Networks for \nLarge Supervised Learning Problems\" in Advances in NewaJ In/ormaJioll Processing Systems \n3, R.P Lippmann, IE. Moody. &: D.S. Touretzky (eds.), Morgan-Kaufmann, Palo Alto. (1991), \npp. 1015-1021. \n\n7. 1. Alspector, I.W. Gannett, S. Haber, MB. Parker, &: R. Chu, \"A YLSI-Efficient Teclutique for \nGenerating Multiple Unoorrelated Noise Sources and Its Application to Stochastic Neural \nNetworks\", IEEE TrOJU. CirCllUS &: Systems, 38,109, (Jan., 1991). \n\n8. RB. Allen & J. Alspector, \"Learning of Stable States in Stochastic Asymmetric Networks\u00b7', \n\nIEEE TrOJU. NellTaJ Networks, 1,233-238, (1990). \n\n9. 1. Alspector, T. Zeppenfeld &: S. Luna, \"A Volatility Measure for Annealing in Feedback \n\nNeural Networks\", to appear in NewaJ ComplllalWft, (1m). \n\n\f", "award": [], "sourceid": 453, "authors": [{"given_name": "Joshua", "family_name": "Alspector", "institution": null}, {"given_name": "Anthony", "family_name": "Jayakumar", "institution": null}, {"given_name": "Stephan", "family_name": "Luna", "institution": null}]}