{"title": "Stochastic Learning Networks and their Electronic Implementation", "book": "Neural Information Processing Systems", "page_first": 9, "page_last": 21, "abstract": null, "full_text": "9 \n\nStochastic Learning Networks and their Electronic Implementation \n\nJoshua Alspector*. Robert B. Allen. Victor Hut. and Srinagesh Satyanarayanat \n\nBell Communications Research. Morristown. NJ 01960 \n\nWe describe a family of learning algorithms that operate on a recurrent, symmetrically \nconnected. neuromorphic network that. like the Boltzmann machine, settles in the \npresence of noise. These networks learn by modifying synaptic connection strengths on \nthe basis of correlations seen locally by each synapse. We describe a version of the \nsupervised learning algorithm for a network with analog activation functions. We also \ndemonstrate unsupervised competitive learning with this approach. where weight \nsaturation and decay play an important role. and describe preliminary experiments in \nreinforcement learning. where noise is used in the search procedure. We identify the \nabove described phenomena as elements that can unify learning techniques at a physical \nmicroscopic level. \nThese algorithms were chosen for ease of implementation in vlsi. We have designed a \nCMOS test chip in 2 micron rules that can speed up the learning about a millionfold \nover an equivalent simulation on a VAX lln80. The speedup is due to parallel analog \ncomputation for snmming and multiplying weights and activations. and the use of \nphysical processes for generating random noise. The components of the test chip are a \nnoise amplifier. a neuron amplifier. and a 300 transistor adaptive synapse. each of which \nis separately testable. These components are also integrated into a 6 neuron and 15 \nsynapse network. Finally. we point out techniques for reducing the area of the \nelectronic correlational synapse both in technology and design and show how the \nalgorithms we study can be implemented naturally in electronic systems. \n\n1. INTRODUCTION \n\nIbere has been significant progress. in recent years. in modeling brain function as the collective \nbehavior of highly interconnected networks of simple model neurons. This paper focuses on the \nissue of learning in these networks especially with regard to their implementation in an electronic \nsystem. Learning phenomena that have been studied include associative memoryllJ. supervised \nleaming by error correction(2) and by stochastic search(3). competitive learning(4) lS) reinforcement \nleamingI6). and other forms of unsupervised leaming(7). From the point of view of neural \nplausibility as well as electronic implementation. we particularly like learning algorithms that \nchange synaptic connection strengths asynchronously and are based only on information \navailable locally at the synapse. This is illustrated in Fig. 1. where a model synapse uses only the \ncorrelations of the neurons it connects and perhaps some weak global evaluation signal not \nspecific to individual neurons to decide how to adjust its conductance. \n\n\u2022 Address for correspondence: J. Alspector, BeU Communications ReselllCh, 2E-378, 435 South St., Morristown, Nl \n\n07960 / (201) 8294342/ josh@beUcore.com \n\nt Pennanent address: University of California, Belkeley, EE Department, Cory HaU, Belkeley, CA 94720 \n* PennllDeDt address: Columbia University, EE Department, S.W. Mudd Bldg., New Yolk, NY 10027 \n\n@ American Institute of Physics 1988 \n\n\f10 \n\nS, \nI \n\nS, \nJ \n\nC.= ~~ \n'1 \n\nj \n\ni \n\n \n\nglobal scalar \nevaluation \nsignal \n\nHebb-type learning rule: \nIf C ij \n\nIncreases, \n\n(perhaps in the presence of r ) \n\nIncrement W ij \n\nFig. 1. A local correlational synapse. \n\nWe believe that a stochastic search procedure is most compatible with this viewpoint. Statistical \nprocedures based on noise form the communication pathways by which global optimization can \ntake place based only on the interaction of neurons. Search is a necessary part of any learning \nprocedure as the network attempts to find a connection strength matrix that solves a particular \nproblem. Some learning procedures attack the search directly by gradient following through error \n(orrection[8J (9J but electronic implementation requires specifying which neurons are input, \ntudden and output in advanC'e and nece!;sitates global control of the error correction[2J procedure \nm a way that requires specific connectivity and ~ynch!'Ony at the neural Jevel. There is also the \nquestion of how such procedures would work with unsupervised methods and whether they might \nget stuck in local minima. Stochastic processes can also do gradient foUowing but they are better \nat avoiding minima, are compatible with asynchronous updates and local weight adjustments, \nand, as we show in this paper, can generalize well to less supervifM!d learning. \n\nThe phenomena we studied are 1) analog activation, 2) noise, 3) semi-local Hebbian synaptic \nmodification, and 4) weight decay and saturation. These techniques were applied to problems in \nsupervised, unsupervised, and reinforcement learning. The goal of the study was to see if these \ndiverse learning styles can be unified at the microscopic level with a small set of physically \nplausible and electronically implementable phenomena. The hope is to point the way for \npowerful electronic learning systems in the future by elucidating the conditions and the types of \ncircuits that may be necessary. It may also be true that the conditions for electronic learning may \n\n\f11 \n\nhave some bearing on the general principles of biologicalleaming. \n\n2. WCAL LEAltNlNG AND STOCHASl'IC SEARCH \n\n2.1 Supervised Learning in Recurrent Networks with Analog Activations \n\nWe have previously shown! 10] how the supervised learning procedure of the Boltzmann \nmachine(3) can be implemented in an electronic system. This system works on a recurrent, \nsymmetrically connected network which can be characterized as settling to a minimum in its \nLiapunov function(l]!II). While this architecture may stretch our criterion of neural plausibility, it \ndoes provide for stability and analyzability. The feedback connectivity provides a way for a \nsupervised learning procedure to propagate information back through the network as the \nstochastic search proceeds. More plausible would be a randomly connected network where \nsymmetry is a statistical approximation and inhibition damps oscillations, but symmetry is more \nefficient and weD matched to our choice of learning rule and search procedure. \n\nWe have extended our electronic model of the Boltzmann machine to include analog activations. \nFig. 2 shows the model of the neuron we used and its tanh or sigmoid transfer function. The net \ninput consists of the usual weighted sum of activations from other neurons but, in the case of \nBoltzmann machine learning, these are added to a noise signal chosen from a variety of \ndistributions so that the neuron performs the physical computation: \n\nactivation =1 (neti FI (EwijSj+noise ):::tanh(gain*neti) \n\nInstead of counting the number of on-on and off-off cooccurrences of neurons which a synapse \nconnects, the correlation rule now defines the value of a cooccurrence as: \n\nCij=/i*/i \n\nwhere Ii is the activation of neuron i which is a real value from -1 to 1. Note that this rule \neffectively counts both on-on and off-off cooccurrences in the high gain limit. In this limit, for \nGaussian noise, the cumulative probability distribution for the neuron to have activation + 1 (on) \nis close to sigmoidal. The effect of noise \"jitter\" is illustrated at the bottom of the figure. The \nweight change rule is still: \n\nif Cij+ > Cij- then increment Wij .... else decrement \n\nwhere the plus phase clamps the output neurons in their desired states while the minus phase \nallows them to run free. \n\nAs\u00b7 mentioned, we have studied a variety of noise distributions other than those based on the \nBoltzmann distribution. The 2-2-1 XOR problem was selected as a test case since it has been \nshown! 10] to be easily caught in local minima. The gain was manipulated in conditions with no \nnoise or with noise sampled from one of three distributions. The Gaussian distribution is closest \nto true electronic thermal noise such as used in our implementation, but we also considered a \ncut-off uniform distribution and a Cauchy distribution with long noise tails for comparison. The \ninset to Fig. 3 shows a histogram of samples from the noise distributions used. The noise was \nmultiplied by the temperature to 'jitter' the transfer function. Hence. the jitter decreased as the \nannealing schedule proceeded. \n\n\f12 \n\n1;. Vnolse \n\n1;. vout or \nf. (r. W II II + noise) \n\nI \n\nJ \n\n1;. Vln+ 1;. Vnolsl \nor r. WIJIJ + noise = ne~ \n\nhigh IIIln \n\ntr8nl'.. function \nwUh noll. 'line\" \n\nFig. 2. Electronic analog neuron. \n\nFig. 3 shows average performance across 100 runs for the last 100 patterns of 2000 training \npattern presentations. It can be seen that reducing the gain from a sharp step can improve \nlearning in a small region of gain, even without noise. There seems to be an optimal gain level. \nHowever, the addition of noise for any distribution can substantially improve learning at all levels \nof gain. \n\n~ \n\ntl CLI \n~ u \nc \n0 \n.,.j \n~ \n\n~ 8. 0 \n&: \n\n1 \n\n0 . 9 \n\n0.8 \n\n0.7 \n\n0.6-\n\n0.5 \n\nGaussian \nUnifona \nCauchy \nHO Hoise \n\n~ -----~ -\n\n......-, .'.' .... u __ . . . , .. \n\n... \n\n., \n\n-3 \n\n10 \n\n-2 \n\n10 \n\n-1 \n\n10 \n\nInverse Gain \n\n1 \n\n1 \n\n10 \n\nFig. 3. Proportion correct vs. inverse gain. \n\n\f13 \n\n2.2 Stochastic Competitive Learning \n\nWe have studied how competitive leaming(4J[~) can be accomplished with stochastic local units. \nMter the presentation of the input pattern. the network is annealed and the weight is increased \nbetween the winning cluster unit and the input units which are on. As shown in Fig. 4 this \napproach was applied to the dipole problem of Rumelhart and Zipser. A 4x4 pixel array input \nlayer connects to a 2 unit competitive layer with recurrent inhibitory connections that are not \nadjusted. The inhibitory connections provide the competition by means of a winner-lake-all \nprocess as the network settles. The input patterns are dipoles -\nonly two input units are turned \nOIl at each pattern presentatiOll and they must be physically adjacent. either vertically or \nhorizontally. In this way, the network learns about the connectedness of the space and eventually \ndivides it into two equal spatial regions with each of the cluster units responding only to dipoles \nfrom one of the halves. Rumelhart and Zipser renormalized the weights after each pattern and \npicked the winning unit as the one with the highest activation. Instead of explicit nonnalization \nof the weights. we include a decay term proportional to the weight. The weights between the \ninput layer and cluster layer are incremented for on-on correlations, but here there are no \nalternating phases so that even this gross synchrony is not necessary. Indeed. if small time \nconstants are introduced to the weight updates. no external timing should be needed. \n\nwinner-lake-all \ncluster layer \n\ninput/ayer \n\nPig. 4. Competitive learning network for the dipole problem. \n\nFig. S shows the results of several runs. A 1 at the po~ition of an input unit means that unit 1 of \nthe cluster layer has the larger weight leading to it from that position. A + between two units \nmeans the dipole from these two units excites unit 1. A 0 and - means that unit 0 is the winner in \nthe complementary case. Note that adjacent l's should always have a + between them since both \nweights to unit 1 are stronger. H, however, there is a 1 next to a 0, then there is a tension in the \ndipole and a competition for dominance in the cluster layer. We define a figure of merit called \n\"surface tension\" which is the number of such dipoles in dispute. The smaller the number, the \n\n\f14 \n\nbetter. Note in Runs A and B, the number is reduced to 4, the minimum possible value, after \n2000 pattern presentations. The space is divided vertically and horizontally, respectively. Run C \nbas adopted a less favorable diagonal division with a surface tension of 6. \n\nNumber of dipole pattern presentations \n\n0 \n\n200 \n\n800 \n\n1400 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n2000 \n\n1+1+1+1 \n+ + + + \n1+1+1+1 \n- - - + \n0-0-0-0 \n\n0-0-1+1 \n\n0-0-1+1 \n\n- - + + \n- - + + \n- - + + \n\n0-0-1+1 \n\n0-0+1+1 \n\n1+1+1+1 \n- + + + \n0-0+1+1 \n- - + + \n- - - + \n\n0-0-0-1 \n\n0-0-0-1 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\nRUn A \n\nRun B \n\nRun C \n\n1+0-0+1 \n+ + + + \n1+1+1+1 \n+ + - -\n1+1-0-0 \n+ - - -\n0-0-0-0 \n\n0-0-0-0 \n\n- - - + \n- - - + \n\n0-0-0+1 \n\n1-0-1+1 \n+ - + + \n1+0+1+1 \n\n1+1+1+1 \n\n+ + + -\n+ + - -\n\n1+1+1-0 \n\n1-0-0-0 \n\n1+1+1+1 \n+ + + + \n1+1+1+1 \n+ - + -\n0-0-0-0 \n\n0-0-0+1 \n\n0-0-1+1 \n\n- - + + \n- - + + \n- - + + \n\n0-0-1+1 \n\n0-0+1+1 \n\n0-0-0-1 \n\n0-0-1+1 \n\n- - - + \n- - + + \n0-0-1+1 \n- - + + \n0-0+1+1 \n\n0+1+1+1 \n- + + + \n0-1+1+1 \n- + + + \n0-1+1+1 \n\n0+1+1+1 \n- + + + \n0+1+1+1 \n- + + + \n\n0-0-0-0 \n\n1+1+1+1 \n+ + + + \n0+1+1+1 \n- - + + \n0-0-0-0 \n\n0- 0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\nFig. 5. Results of competitive learning runs on the dipole problem. \n\nTable 1 sbows the result of several competitive algorithms compared when averaged over 100 \nsuch runs. The deterministic algorithm of Rumelhart and Zipser gives an average surface tension \nof 4.6 while the stochastic procedure is almost as good. Note that noise is essential in belping the \ncompetitive layer settle. Without noise the surface tension is 9.8, sbowing that the winner-take(cid:173)\nall procedure is not working properly. \n\nCompetitive learning algorithm \n\n\"surface tension\" \n\nStochastic net with decay \n- anneal: T=3H T=1.0 \n- no anneal: 70 @ T =1.0 \n\nStochastic net with renonnallzation \n\nDeterministic, winner-take-all \n(Rumelhart & Zipser) \n\n4.8 \n9.8 \n5.6 \n4.6 \n\nTable 1. Performance of competitive learning algorithms across 1 ()() runs. \n\nWe also tried a procedure where, instead of decay, weights were renormalized. The model is that \neach neuron can support a maximum amount of weight leading into it. Biologically, this might \nbe the area that other neurons can form synapses on, so that one synapse cannot increase its \nstrength except at the expense of some of the others. Electronically, this can be implemented as \n\n\f15 \n\ncurrent emanating from a fixed clUTent source per neuron. As shown in Table 1, this works \nnearly as well as decay. Moreover, preliminary results show that renormalization is especiaUy \neffective when more then two cluster units are employed. \nBoth of the stochastic algorithms, which can be implemented in an electronic synapse in nearly \nthe same way as the supervised learning algorithm, divide the space just as the deterministic \nnormalization procedure14J does. This suggests that our chip can do both styles of learning, \nsupervised if one includes both phases and unsupervised if only the procedure of the minus phase \nis used. \n\n1.3 Reiolorcelfteot Learning \n\nWe have tried several approaches to reinforcement learning using the synaptic model of Fig. 1 \nwhere the evaluation signal is a scalar value available globally that represents how well the \nsystem performed on each trial. We applied this model to an xor problem with only one output \nunit. The reinforcement was r = 1 for the correct output and r = -1 otherwise. To the network, \nthis was similar to supervised learning since for a single unit, the output state is fully specified by \na scalar value. A major difference, however, is that we do not clamp the output unit in the \ndesired state in order to compare plus and minus phases. This feature of supervised learning has \nthe effect of adjusting weights to follow a gradient to the desired state. In the reinforcement \nlearning described here, there is no plus phase. This has a satisfying aspect in that no overall \nsynchrony is necessary to compare phases, but is also much slower at converging to a solution \nbecause the network has to search the solution space without the guidance of a teacher clamping \nthe output units. This situation becomes much worse when there is more than one output unit. In \nthat case, the probability of reinforcement goes down exponentially with the number of outputs. \nTo test multiple outputs, we chose the simple replication problem whereby the output simply has \nto replicate the input. We chose the number of bidden units equal to the input (or output). \n\n10 the absence of a teacher to clamp the outputs, the network has to find the answer by chance, \nguided only by a \"critic\" which rates its effort as \"better\" or \"worse\". This means the units must \nsomehow search the space. We use the same stochastic units as in the supervised or unsupervised \ntechniques, but now it is important to have the noise or the annealing temperature set to a proper \nlevel. If it is too high, the reinforcement received is random rather than directed by the weights \nin the network. If it is too low, the available states searched become too smaU and the probability \nof finding the right solution decreases. We tuned our annealing schedule by looking at a \nvolatility measure defined at each neuron which is simply the fraction of the time the neuron \nactivation is above zero. We then adjust the final anneal temperature so that this number is \nneither 0 or 1 (noise too low) nor 0.5 (noise too high). We used both a fixed annealing schedule \nfor all neurons and a unit-specific schedule where the noise was proportional to the sum of weight \nmagnitudes into the unit. A characteristic of reinforcement learning is that the percent correct \ninitially increases but then decreases and often oscillates widely. To avoid this, we added a factor \nof (I - 0, then ~';j is \nincremented and it is decremented if R <0. We later refined this procedure by insisting that the \nreinforcement be greater than a recent average so that R = (r-<,. > hi Sj. This type of procedure \n\n\f16 \n\n(13) For r =\u00b1l only, this \"excess \nappears in previous work in a number of fonns.(12] \nreinforcement\" is the same as our previous algorithm but differs if we make a comparison \nbetween short term and long tenn averages or use a graded reinforcement such as the negative of \nthe sum squared error. Following a suggestion by G. Hinton, we also investigated a more \ncomplex technique whereby each synapse must store a time average of three quantities: , \n, and . The definition now is R = - and the rule is the same as \nbefore. Statistically, this is the same as \"excess reinforcement\" if the latter is averaged over \ntrials. For the results reported below the values were collected across 10 pattern presentations. A \nvariation. which employed a continuous moving average, gave similar results. \n\nTable 2 summarizes the perfonnance on the xor and the replication task of these reinforcement \nlearning techniques. As the table shows a variety of increasingly sophisticated weight adjustment \nrules were explored; nevertheless we were unable to obtain good results with the techniques \ndescribed for more than S output units. In the third column, a small threshold had to be exceeded \nprior to weight adjustment. In the fourth column, unit-specific temperatures dependent on the \nsum of weights, were employed. The last column in the table refers to frequency dependent \nlearning where we trained on a single pattern until the network produced a correct answer and \nthen moved on to another pattern. This final procedure is one of several possible techniques \nrelated to 'shaping' in operant learning theory in which difficult patterns are presented more often \nto the network. \n\nnetwork \n\nxor \n24-1 \n2-2-1 \n-\neplication \n2-2-2 \n3-3-3 \n444 \nS-S-S \n6-6-6 \n\nt=1 \n\ntime-averaged \n\n+\u00a3=0.1 \n\n+T-I:W \n\n+freq \n\n(0.60) 0.64 \n(0.58) 0.57 \n\n(0.70) 0.88 \n(0.69) 0.74 \n\n(0.76) 0.88 \n(0.96) 1.00 \n\n(0.92)0.99 \n(0.85) 1.00 \n\n(0.98) 1.00 \n(0.78) 0.88 \n\n(0.94)0.94 \n(0.15) 0.21 \n\n-\n-\n-\n\n(0.46) 0.46 \n(0.31) 0.33 \n\n-\n-\n-\n\n(0.91) 0.97 \n(0.31) 0.62 \n\n-\n-\n-\n\n(0.87) 0.99 \n(0.37)0.37 \n\n-\n-\n-\n\n(0.97) 1.00 \n(0.97) 1.00 \n(0.75) 1.00 \n(0.13) 0.87 \n(0.02) 0.03 \n\nTable 2. Proportion correct performance of reinforcement learning \n\nafter (2K) and 10K patterns. \n\nOur experiments. while incomplete, hint that reinforcement learning can also be implemented by \nthe same type of local-global synapse that characterize the other learning paradigms. Noise is \nalso necessary here for the random search procedure. \n\n2. .. Sanunary of Study of hDdameatai Learning Par ... eters \n\nIn summary, we see that the use of noise and our model of a local correlational synapse with a \nDOn-specific global evaluation signal are two important features in all the learning paradigms. \nGraded activation is somewhat less important. Weight decay seems to be quite important \nalthough saturation can substitute for it in unsupervised learning. Most interesting from our point \nof view is that all these phenomena are electronically implementable and therefore physically \n\n\f17 \n\nplausible. Hopefully this means they are also related to true neural phenomena and therefore \nprovide a basis for unifying the various approaches of learning at a microscopic level. \n\n3. ELECTRONIC IMPLEMENTATION \n\n3.1 The Supervised LearDiog Chip \n\nWe have completed the design of the chip previously proposed.(IO] Its physical style of \ncomputation speeds up learning a millionfold over a computer simulation. Fig. 6 shows a block \ndiagram of the neuron. It is a double differential amplifier. One branch forms a sum of the inputs \nfrom the differential outputs of aU other neurons with connections to it. The other adds noise \nfrom the noise amplifier. This first stage has low gain to preserve dynamic range at the summing \nnodes. The second stage has high gain and converts to a single ended output. This is fed to a \nswitching arrangement whereby either this output state or some externally applied desired state is \nfed into the final set of inverter stages which provide for more gain and guaranteed digital \ncomplementarity . \n\nSdlslrld \n\nFig. 6. Block diagram of neuron. \n\nThe noise amplifier is shown schematically in Fig. 7. Thermal noise, with an nns level of tens of \nmicrovolts, from the channel of an FET is fed into a 3 stage amplifier. Each stage provides a \npotential gain of 100 over the noise bandwidth. Low pass feedback in each stage stabilizes the \nDC output as well as controls gain and bandwidth by means of an externally controlled variable \nresistance for tuning the annealing cycle. \n\nFig. 8 shows a block diagram of the synapse. The weight is stored in 5 flip-flops as a sign and \nmagnitude binary number. These flip-flops control the conductance from the outputs of neuron i \nto the inputs of neuron j and vice-versa as shown in the figure. The conductance of the FETs are \nin the ratio 1 :2:4:8 to correspond to the value of the binary number while the sign bit determines \nwhether the true or complementary lines connect. The flip-flops are arranged in a counter which \nis controUed by the correlation logic. If the plus phase correlations are greater than the minus \nphase, then the counter is incremented by a single unit If less, it is decremented. \n\n\f18 \n\nVcontrol \n\nI \n\nI \n\nI \n\nl \n\n>--.._V._.nOISI \n\nFig. 7. Block diagram of noise amplifier. \n\nSj or I \n\nSj or I \nnior~\" T---~'-~ __ ~--~ \n\nW \n\nI) or JI \n\n.r----... \"ncrement \n\ncorrelation \nlogic \n\nphase \n\nup. \n\ndown. \n& set \nlogic \n\ni------lhnl \n\nsgn \n\no \n\n2 \n\n3 \n\nFig. 8. Block diagram of synapse. \n\nFig. 9 sbows the layout of a test chip. A 6 neuron, 15 synapse network may be seen in the lower \nleft comer. Eacb neuron bas attacbed to it a noise amplifier to assure that the noise is \nuncorrelated. The network occupies an area about 2.5 mm on a side in 2 micron design rules. \nEacb 300 transistor synapse occupies 400 by 600 microns. In contrast, a biological synapse \noccupies only about one square micron. The real miracle of biological learning is in the synapse \nwbere plasticity operates on a molecular level, not in the neuron. We can't bope to compete using \ntransistors, bowevc:r small, especially in the digital domain. Aside from this small network, the \nrest of the chip is occupied with test structures of the various components. \n\n3.1 Analog Synapse \n\nAnalog circuit tecbni~ues can reduce the size of the synapse and increase its functionality. \nSeveral recent papers( 4] II~I have shown how to make a voltage controlled resistor in MOS \ntechnology. The voltage controlling the conductance representing the synaptic weight can be \nobtained by an analog charge integrator from the correlated activation of the neurons which the \nsynapse in question connects. A charge integrator with a \"leaky capacitor\" bas a time constant \n\n\fwhich can be used to make comparisons as a continuous time average over the last several trials. \nthereby' adding temporal information. One can envision this time constant as being adaptive as \nwell. The charge integrator directly implements the analog Hebb-typel 16] correlation rules of \nsection 2. \n\n19 \n\n\u2022 \n\n: \n.. , .... \" . II~.~ \n\ni \n\n\u2022 . / .\n\n' \n\n\" A , ..\u2022 . \\ ' :\": :\" . _.\n\n' \n\n. \n\n_ \u2022 \n\n..\u00b7 \n\n. .. . . . \n\nIf. ., \u2022 \n\niii \u2022 \n\n-\n\n., \u2022 \u2022 \u2022 \u2022 It ill \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \n\n~.~ ~,~~~' ~ .. ~i~ 'i~ ~ ~~ilf'~~ \n\u2022\u2022 ' /., ~ \"') '<\"\"~:~\";\" .. I I . \u00b7 ~ii.:' .. . \n~., \n.' \n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 * \u2022\u2022 ~.* \u2022\u2022.\u2022\u2022\u2022\u2022 ~ \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 ~.~ \u2022\u2022 \n:i. c\u00b7..\n~.I ':;;::dU:;.;;;.UEEi \n......... ~ ~\" , . \n: i ii r .. \u00b7\u00b7 .. ' .. ,: \n... : .................. ~ \n.. ,;.:\" ..... ' ..\u2022.\u2022...... \n:-\"fO.,a'l.~\"~;\" \n...\u2022 ',...... .. \n'. JjI! \n!\n.. .\n. ,.;,:,... \n. \n.. ... \n. \n\". :::, ':. \u2022\u2022 \n\u2022 \n.. ....... \n\u2022 \n\n\u2022 1!~'.' \n-Jot-: \n~ II. \n\u2022 . ' ;~-. ~ .... ...... , \u2022 \u2022 ~I' \n... :t~\"1I \u2022 \n\u2022 \u2022 \u2022 \n\ns \u00b7 ,\u00b7 '. . \n\nIi iI \u2022 \n\u2022 \n\ni nr-':,~\"\u00b7\"\";\u00b7\u00b7;. \n\n' :\" :::'.l(cid:173)\n\n1IIi1. \n\n3.3 Tecbnologicalbnprovemeots for Flectronic Neural Networks \n\nFig. 9. Chip layout. \n\nIt is still necessary to store the voltage which controls the analog conductance and we propose the \nEPROMll7] or EEPROM device for this. Such a device can hold the value of the weight in the \nsame way that flip-flops do in the digital implementation of the synapse(lOJ. The process which \ncreates this device has two polysilicon layers which are useful for making high valued \ncapacitances in analog circuitry. In addition. the second polysilicon layer could be used to make \nCCD devices for charge storage and transport. Coupled with the charge storage on a floating \ngate(l8], this forms a compact. low power representation for weight values that apyroach \nbiological values. Another useful addition would be a high valued stable resistive layerl l9 . One \n\n\f20 \n\ncould thereby avoid space-wasting long-channel MOSFETs which are currently the only \nrea~ble way to achieve high resistance in MOS technology. Lastly, the addition of a diffusion \nstep or two creates a Bi-CMOS process which adds high quality bipolar transistors useful in \nanalog design. Furthermore, one gets the logarithmic dependence of voltage on current in bipolar \ntechnology in a natural, robust way, that is not subject to the variations inherent in using \nMOSFETs in the subthreshold region. This is especially useful in compressing the dynamic \nrange in sensory processing[20J\u2022 \n\n4. CONCLUSION \n\nWe have shown how a simple adaptive synapse which measures correlations can account for a \nvariety of learning styles in stochastic networks. By embellishing the standard CMOS process \nand using analog design techniques. a technology suitable for implementing such a synapse \nelectronically can be developed. Noise is an important element in our formulation of learning. It \ncan help a network settle, interpolate between discrete values of conductance during learning. and \nsearch a large solution space. Weight decay (\"forgetting\") and saturation are also important for \nstability. These phenomena not only unify diverse learning styles but are electronically \nimplementabfe. \n\nACKNOWLEDGMENT: \nThis work has been influenced by many researchers. We would especially like to thank Andy \nBarto and Geoffrey Hinton for valuable discussions on reinforcement learning, Yannis Tsividis \nfor contributing many ideas in analog circuit design, and Joel Gannett for timely releases of his \nvlsi verification software. \n\n\f21 \n\nReferences \n\n1. JJ. Hopfield, \"Neural netwolks and physical systems with emergent coUective computational abilities\", Proc. Natl. \n\nAcad. Sci. USA 79,2554-2558 (1982). \n\n2. D.E. Rumelhart, G.E. Hinton, and RJ. Williams, \"Learning internal representations by error propagation\", in \nParalld Distribuled Processing: Explorations in th~ Microstructur~ of Cognition. Vol. 1: Foundations. edited by \nD.E. Rumelhart and J.L. McClelland, (MrT Press, Cambridge, MA, 1986), p. 318. \n\n3. D.H. Ackley, G.E. Hinton, and T J. Sejnowski, \"A learning algorithm for Boltzmann machines\", Cognitive Science \n\n9, 147-169 (1985). \n\n4. D.E. Rumelhart and D. apser, ''Feature dillCovery by competitive learning\", Cognitive Science 9, 75-112 (1985). \n5. s. Grossberg, \"Adaptive pattern classification and universal recoding: Part L Parallel development and coding of \n\nneural feature detectors.\", Biological Cybernetics 23, 121-134 (1976). \n\n6. A.G. Barto, R.S. Sutton, and C.W. Anderson, \"Neuronlike adaptive elements that can solve difficult learning \n\ncontrol problems\",1EEE Trans. Sys. Man Cyber. 13,835 (1983). \n\n7. B.A. Pearlmutter and G.E. Hinton, \"G-Maximization: An unsupervised learning procedure for discovering \nregularities\", in N~ural Networks for Computing. edited by J.S. Denker, AIP Conference Proceedings 151, \nAmerican Inst. of Physics, New Yolk (1986), p.333. \n\n8. F. Rosenblatt, Principirs of Neurodyrramics: Perc~ptrons and the Th~ory of Brain Mechanisms (Spartan Books, \n\nWashington, D.C., 1961). \n\n9. G. Widrowand M.E. Hoff, \"Adaptive switching cirt:uits\", Inst. of Radio Engineers, Western Electric Show and \n\nConvention. COftycntion Record, Part 4, ~104 (1960). \n\n10. J. Alspeaor and R.B. Allen, \"A neuromorphic vlsi learning system\". in M~'aN:rd Rrs~arch in VLSl: Procudings \n\nofth~ 1987 StQ1lfordConf~rtnu. edited by P. Losleben (MIT Press, Cambridge, MA.1987), pp. 313-349. \n\n11. M.A. Cohen and S. Grossberg, \"Absolute stability of global pattern formation and parallel memory storage by \n\ncompetitive neural networks\", Trans. IEEE 13,815, (1983). \n\n12. B. Widrow. N.K. Gupta, and S. Maitra, \"Punish,IReward: Learning with a critic in adaptive threshold systems\", \n\nIEEE Trans. on Sys. Man & Cyber., SMC-3, 455 (1973). \n\n13. R.S. Sutton, \"Temporal credit assignment in reinforcement learning\", unpublished doctoral dissertation, U. Mass. \n\nAmherst, technical report COINS 84-02 (1984). \n\n]4. Z. Czamul, \"Design of voltage-controlled linear ttansconductance elements with a muched pair of FET \n\ntransistors\", IEEE Trans. Cire. Sys. 33, 1012, (1986). \n\n15. M. Banu and Y. Tsividis, \"Flouing voltage-controUed resistors in CMOS technology\", Electron. Lett. 18,678-679 \n\n(1982). \n\n16. D.O. Hebb, Th~ OrganizotiOlf ofBtMV;oT (Wiley, NY, 19(9). \n17. D. Frohman-Bentchkowsky. HFAMOS - \u2022 new semiconductor charge storage device\", Solid-State Electronics 17, \n\n517 (1974). \n\n18. J.P. Sage, K.. Thompson, and R.S. Withers, \"An artificial neural network integrued circuit based on MNOS/CCD \nprinciples\", in Nrural Networks for Computing. edited by J.S. Denker. AIP Conference Proceedings 151, \nAmerican lost. of Physics, New York (1986), p.38 1. \n\n19. A.P. ThaJcoor, J.L. Lamb. A. Moopenn, and J. Lambe, \"Binary synaptic connections ba!ICd on memory switching \nin a-Si:H\". in Neural N~\"\"\"orks for Computing. edited by J.S. Denker, AIP Conference Proceedings 151. American \nInst. of Physics, New York (1986), p.426. \n\n20. M.A. Sivilotti, M.A. Mahowald, and C.A. Mead, ~ReaJ-Time visual computations using analog CMOS processing \narrays\", in Advanud R~S('arch in VLSl: Prou~dings of thr 1987 Stanford Corrf~r~nu. edited by P. Losleben \n(MIT Press, Cambridge, MA, 1987), pp. 295-312. \n\n\f", "award": [], "sourceid": 80, "authors": [{"given_name": "Joshua", "family_name": "Alspector", "institution": null}, {"given_name": "Robert", "family_name": "Allen", "institution": null}, {"given_name": "Victor", "family_name": "Hu", "institution": null}, {"given_name": "Srinagesh", "family_name": "Satyanarayana", "institution": null}]}~~