{"title": "Performance of a Stochastic Learning Microchip", "book": "Advances in Neural Information Processing Systems", "page_first": 748, "page_last": 760, "abstract": null, "full_text": "748 \n\nPerformance of a Stochastic Learning Microchip \n\nJoshua Alspector, Bhusan Gupta, and Robert B. Allen \n\n\u2022 \n\nBellcore, Morristown, NJ 07960 \n\nWe have fabricated a test chip in 2 micron CMOS that can perform supervised \nlearning in a manner similar to the Boltzmann machine. Patterns can be \npresented to it at 100,000 per second. The chip learns to solve the XOR \nproblem in a few milliseconds. We also have demonstrated the capability to \ndo unsupervised competitive learning with it. The functions of the chip \ncomponents are examined and the performance is assessed. \n\n1. INTRODUCTION \n\nIn previous work,(l] (2] we have pointed out the importance of a local learning rule, \nfeedback connections, and stochastic elements(3] for making learning models that are \nelectronically implementable. We have fabricated a test chip in 2 micron CMOS \ntechnology that embodies these ideas and we report our evaluation of the microchip and \nour plans for improvements. \n\nKnowledge is encoded in the test chip by presenting digital patterns to it that are \nexamples of a desired input-output Boolean mapping. This knowledge is learned and \nstored entirely on chip in a digitally controlled synapse-like element in the form of \nconnection strengths between neuron-like elements. The only portion of this learning \nsystem which is off chip is the VLSI test equipment used to present the patterns. \n\nThis learning system uses a modified Boltzmann machine algorithm[3] which, if \nsimulated on a serial digital computer, takes enormous amounts of computer time. Our \nphysical implementation is about 100,000 times faster. The test chip, if expanded to a \nboard-level system of thousands of neurons, would be an appropriate architecture for \nsolving artificial intelligence problems whose solutions are hard to specify using a \nconventional rule-based approach. Examples include speech and pattern recognition and \nencoding some types of expert knowledge. \n\n2. CIllP COMPONENTS \n\nFig. 1 is a photograph of the silicon chip. It contains various test structures, the largest of \nwhich. in the lower left, is a neural-style learning network composed of 6 neurons, each \nwith its own noise amplifier, and 15 bidirectional synapses which potentially allow the \nnetwork to be fully connected. In order to study these components separately, there is a \nalso a noise amplifier in the upper left comer of the chip, a neuron in the upper right, and \n2 synapses in the lower right. \n\nI \n\n\u2022 Pennanent address: University of California, Berkeley; EE Dep't, Cory Hall; Berkeley, CA 94720 \n\n\fPerformance of a Stochastic Learning Microchip \n\n749 \n\n\u2022 ,-- -\n\n- --\n\nH .. \n\n.. \n\n\u2022 I \n_____ ~_J \n\nFigure 1. Photograph of Test Chip Containing a Learning Network in Lower Left. \n\n2.1 Neuron \n\nThe electronic neuron perfonns the physical computation: \n\nactivation=/ (LWjj sj+noise )=/ (gain*netj) \n\nis a monotonic non-linear function such as tanh. In some of our computer \nwhere / \nsimulations this is a step function corresponding to a high value of gain. The signal from \nother neurons to neuron i is the sum of neural states Sj giving input weighted by the \nconnection strengths Wjj, while the noise simulates a temperature in a physical \nthermodynamic system. Their sum is the effective net input netj . \n\nThe model neuron is a double differential amplifier as shown in Fig. 2. Noise and signal \nhave separate differential inputs and are summed at low gain. The differential outputs of \nthis summing stage are converted to a single output by a high gain stage before being fed \ninto a switching arrangement. This selects either the net input or an external clamping \nsignal which forces the neuron into a desired state. The output of the switch is then \n\n\f750 \n\nAlspector. Gupta and Allen \n\nSdeelred \n\nFigure 2. Circuitry of Electronic Analog Neuron. \n\nfurther amplified before driving the network. The final output approximates a two-state \nbinary neuron. \n\n2.2 Noise amplifier \n\nanneal \n\nI -\n\nVnot .. \n\n>-........ -\n\nFigure 3. Block Diagram of Noise Amplifier. \n\n\fPerformance of a Stochastic Learning Microchip \n\n751 \n\nFig. 3 is a block diagram of the noise amplifier. The original idea was to amplify the \nthermal noise in the channel of a transistor with a gain of nearly a million but to stabilize \nthe de output using low pass negative feedback in 3 stages. By controlling the feedback. \none could control both the bandpass of the noise signal as well as the gain to provide for \nannealin$ the temperature (amount of noise) as required by the Boltzmann machine \nalgorithm. (3) Unfortunately this amplifier proved unstable at high gain values leading to \noscillations of a few MHz which were highly correlated among all the noise amplifiers in \nthe network. In spite of this undesirable correlation in the noise signals. the network was \nstill able to learn (see section 3). Rather than a slow \"annealing\". we used a rapid \n\"heating\" and \"flash freezing\" of the network to randomize\u00b7 it. This was done by \nmomentarily -opening a \"noise on\" switch during the time allotted for annealing. \nLearning was also demonstrated by clamping the free running neurons momentarily to a \npseudo-random state and then releasing them to allow the network to settle. \n\n2.3 Synapse \n\nFig. 4 is a block diagram of the digitally controlled electronic synapse. The weights are \nstored as a sign and four bits of magnitude in five flip-flops arranged as an up-down \ncounter. The correlation logic tests whether the two neurons that the synapse connects \nhave the same binary state (correlated) or not at the end of the anneal cycle. If the \nneurons are correlated in the \"teacher\" phase (when the teacher is clamping the output \nneurons in the correct state) and not in the \"student\" phase (when the output neurons are \nrunning free). then a signal to the counter increments the weight by one. If the reverse is \ntrue. the counter is decremented. If the \"teacher\" and \"student\" phase have the same \ncorrelation. no change is made. \n\nS, \n\nSJ \n\ncorrelation \nlogic \n\nement \n\nUP. \ndown, \n&set \nlogic \n\nson \n\n0 \n\n2 \n\n3 \n\nSJ or' \n\nIn,or \n\nSJ or' \n\nWIJ or II \n\n1n 1or \n\nFigure 4. Block Diagram of Synapse. \n\n\f752 \n\nAlspector, Gupta and Allen \n\nThe digital weight is converted to an analog conductance by a set of pass transistors with \ngraduated binary conductance ratios. Measurements confirmed that the synapse \nconductance increased monotonically from a value of -15 though +15 as the counter was \nincremented. The -0 value, when loaded into the synapse, disconnected that link. We \nusually initialized all the weights to +0 before learning. \n\n3. PERFORMANCE EVALUATION OF NETWORK \n\n3.1 XOR tests \n\nThe most difficult test for our 6 neuron network was to have it learn the exclusive-OR \nfunction. The network was arranged with 2 input neurons, 2 hidden neurons, and 1 \noutput neuron as shown in Fig. 5. There is also a so-called 'true' neuron which is always \nclamped on. The negative of the weights from that neuron provide the threshold for the \nother neurons. The exclusive-OR function is of historical interest because the neural \nmodels of the 1960's could not learn it.[4] [5] This is because those learning algorithms did \nnot work when there was a layer of hidden neurons. Networks with only a single layer of \nmodifiable weights could learn the logical OR function but not the exclusive-OR (XOR). \nThe truth table in Fig. 5 shows that the XOR is 1 (on or true) when either one of the two \ninputs is 1, but not when both are 1. However, recent algorithms such as the Boltzmann \nmachine are able to learn with a hidden layer and hence can solve the XOR. \n\nout \n\nhidden \n\nIn \n\nout \n\n1 \n\n2 \n\nXOR \nIn hidden \n\n00 \n01 \n10 \n11 \n\n? \n\no \n1 \n1 \no \n\nLearn 'rules' to solve problem \n\nFigure 5. 2-2-1 Network to Learn XOR. \n\nTo teach a network to be an XOR, we start with a blank slate where all the weights are \nzero and then present the patterns of 1 's and O's in the figure with the teacher~dtemately \nclamping the output to the correct state and letting it run free. On each presentation, the \nnetwork is jittered by noise and correlations are counted by each synapse. At the end of \neach teacher-student cycle, weights are adjusted. \n\n\fPerformance of a Stochastic Learning Microchip \n\n753 \n\nTests of the chip were conducted using an HP 8180A data generator to present digital \npatterns to the chip, an HP 8182 data analyzer to capture the chip's digital outputs, and \nan HP 54112A digitizing oscilloscope to capture wavefonns. Analog wavefonns were \ngenerated using an HP 8770A arbitrary wavefonn synthesizer feeding a Comlinear E20 1 I \namplifier. These instruments were controlled by an HP 9836 computer running UNIX \nwith test programs written in C. \n\nA pattern presentation phase consisted of five subphases and hence five clock cycles of \nthe data generator. The input and/or output pattern to be presented to the clamped \nneurons is present during all five cycles. The first cycle presents noise or an annealing \nwavefonn to the network. The second cycle sends a signal to each synapse to count \ncorrelations. The fourth cycle can be used to send a signal to each synapse to adjust \nweights. This is usually done only after two 5 cycle phases, one for the \"teacher\" phase \nand one for the \"student\" phase. Thus, during learning, ten digital words were used in the \ndata generator for each pattern presentation. \n\nIn addition to presenting patterns, digital weights can also be read into the chip with a \nsimilar 5 cycle phase. This uses the flip-flop storage arranged as a shift register for \nweight storage and readout. Because the memory of the data generator was only 1024 \nbits deep, we would present only 66 patterns (660 words) each time the data generator \nwas loaded by the control computer. The remaining memory was used to initialize the \nnetwork to its previous value after the destructive readout of weights. In this way, \nperfonnance of the network was monitored after sets of 66 pseudo-randomly selected \npatterns. 100 test patterns could also be presented, without learning, to see what \nperfonnance the network achieved at that point. \n\nFor the XOR, we organized the connectivity as in Fig. 5. For example, the connections \nbetween input and output neurons were fixed at zero. In order to test the settling of the \nnetwork, we loaded a set of synapse weights that were learned in one of the computer \nsimulations. We then checked the settling times of the network for various transitions of \ninput states. These varied from 130 to 1700 nanoseconds, with most transitions in the \n250 to 600 nanosecond range. The shortest time is a simple settling of the neuron \namplifier while the longest time represents several loops of settling of the network before \na stable state is found. \n\nFor the learning trials, we initialized all weights to zero. Fig. 6 shows three learning \ncurves for a 2-2-1 XOR network (Fig. 5). At first the network performs at chance but it \nsoon learns all the patterns. The values of the weights (which have an accuracy of 4 bits \nplus a sign) after learning are also shown for one of the trials. \n\nThe chip had an easier time learning the XOR function in a network with only one \nhidden unit provided there were also direct connections from input to output as shown in \nthe inset of Fig. 7. This also demonstrates the flexibility of the connectivity on the chip \nwhich would not be possible if we organized it as a strictly layered network. The figure \nshows the learning curves at various speeds of pattern presentation from 500 to 256,000 \npatterns per second. The clock rate of the data generator at the highest speed was 2.56 \nMHz so that the time during which noise was applied was only 400 nanoseconds. The \nnoise amplifier often did not produce an excursion of neural states at these frequencies \n\n\f754 \n\nAlspector. Gupta and Allen \n\nf \n\nr \u2022 c \n\nt \nI \n\nc \no\u00b7 \nr \n\nr \u2022 c \u00b7 \n\nt \n\n132 \n\n198 \n\n264 \n\nnumlMt of training pattern. \n\n330 \n\n3911 \n\n482 \n\n528 \n\n594 \n\nI. \nf \n\nr \u2022 C \n\nt \nI \n0 .8 \nn \nc \no. \nr \n\nr \u2022 c \n\nt \n\nFigure 6. Proportion Correct for On-chip Learning vs. Patterns Presented. \n\n... \n\n1 \n\nZ \n\n528 \n\nFigure 7. Learning Curves for 2-1-1 XOR at Various Speeds. \n\neffectively limiting learning above this rate. We could have increased the rate by \ncompressing the five cycle phase to three or by random clamping of free running \nneurons, but probably not by an order of magnitude. Note that noise is necessary for \nlearning by this system as shown by the curve at 500 Hz without noise. \n\n\fPerformance of a Stochastic Learning Microchip \n\n755 \n\nFig. 8 is an oscilloscope trace of the 4 neural states as a function of time during the \npattern presentations. \n\n44.a400 _ \n\nlM.a400 _ \n\n1M ..... _ \n\noutput \nunit \n\nhidden \nunit \n\nInput \nunit A \n\nInput \nunit B \n\n..--------. \n\nf 1'* . . . . . . . . +-4--. ..... I , I , I , .. I I I I \n\n, ....... =$ .... ~ ..... ,...,...,. ...... -c \n\n..... ~_.a... \u2022\n\n\u2022\n\n\u2022 \n\nI \u2022 \u2022 __ ~ \u2022 \u2022\u2022 ~ . . . t \u2022 \u2022 \u2022 \u2022 ~a.....a--' I \u2022 \u2022 \u2022 \u2022 \u2022 '__....a._L.a...-......,. e_, \u2022\u2022 ...a.-\n\n.pply nol \u2022\u2022 \n\u2022 dJu.t weight. \nph .... t \u2022\u2022 ch.r \n\u00ab . student \nPI .m \n\nr t '\" s \"'r t '\" s \"'r t '\" s \"'r t '\" s \"'r t \"'. \"'r t '\" s \"'r t \"'. \"'r t '\" s \"'r t A. A \n\n00 \n\n10 \n\n11 \n\n11 \n\n10 \n\n10 \n\n00 \n\n01 \n\n11 \n\nFigure 8. Neural States during Learning. \n\nThe time during which noise is applied is apparent from the rapid changes of state in the \nhidden neuron and also in the output neuron when it is not clamped. Since each pattern \npresentation can take as little as 5 microseconds, the XOR function can be learned in a \nfew milliseconds. A pattern presentation on a 1 MIP serial computer such as a VAX \n11n80 takes about 0.5 seconds with our simulation software. \n\n3.2 Unsupervised Learning \n\nSo far, we have described only supervised learning procedures, but the chip can also do \nunsupervised learning which has no teacher. Nevertheless, the network can learn to \nclassify input patterns according to their similarity to one another. We set the chip \nconnectivity as in Fig. 9 with 4 input neurons and 2 output neurons arranged so that they \nstrongly inhibit each other to form a 'competitive' layer. With noise, this output layer \nperforms a 'winner-take-all' function in that the output neuron which has the strongest \nnet input is on and the other is off. This is because they inhibit each other strongly (are \nconnected to each other with a large negative weight) so that only one can be on. The \nusual supervised learning rule was effectively simplified by removing the teacher \nrequirement so that correlations always increment weights. Specifically, we stored a \ncomparison pattern in the student phase which consisted of the 'on' state for the two \ncompetitive neurons and 'off' for all the input neurons. We then presented patterns to the \nchip with the \"teacher\" phase signal on. This has the effect of always decrementing the \ncompetitive connections which therefore remain at the lower limit of -15 since it is not \npossible to have more correlations than the stored \"student\" phase correlation. On the \nother hand, the stored \"student\" phase correlation for the weights leading from the input \n\n\f756 \n\nAlspector, Gupta and Allen \n\nInput \n\nFigure 9. A Competitive Learning Network. \n\nto the competitive layer is zero. Then, the winning output neuron will always be \ncorrelated with those input neurons which are ton' and hence these weights will be \nincremented. A decay signal decremented weights occasionally to keep them from \ngrowing too large. The net effect of such a procedure is for the output neurons to classify \nthe input space among themselves, such that each responds to a particular neighborhood \nof similar patterns. (2] \n\nTo demonstrate competitive learning, an input set was prepared such that the four input \nbits were not quite random. We picked two input neurons to represent 'left' and the other \ntwo to represent 'right\". Patterns were never used with an equal number of left and right \nneurons on. Eventually one of the two output neurons responded to left weighted \npatterns and the other to right weighted patterns. Fig. 9 shows one set of weights which \nwere obtained. Therefore the chip learned left from right although nothing in its wiring \npredisposed it in any way. \n\n3.3 Computer Simulations or Chip Test Conditions \n\nComputer tests were conducted which simulated limitations of the operating chip such as \ncorrelated noise. Table 1 presents summaries of 10 replications of 2000 pattern \npresentations across 5 testing conditions. The Table reports the mean percent correct on \nthe last 100 patterns and, in parentheses, the number of networks which reached 100% \nperformance during at least one block of 100 pattern presentations. The first line of the \ntable shows the performance of the network with no noise. In the next four lines, two \nparameters of the noise were varied yielding 4 conditions. Specifically, noise was either \ncorrelated or uncorrelated across neurons and it was either presented as a single pulse in \na \"flash freeze\" schedule or following a broad annealing schedule. \n\n\fPerformance of a Stochastic Learning Microchip \n\n757 \n\nThe 2-1-1 XOR. in which the inputs are directly connected to the outputs. demonstrated \nvery good performance across conditions. Indeed. additional tests of the 2-1-1 in the no(cid:173)\nnoise condition showed that within 10k patterns all networks reached 100%. This \nsuggests there are deterministic solutions for the 2-1-1. \n\nTABLE 1. Results of Computer Simulations. \n\nnOIse \nno noise \ncorrelated \ncorrelated \nuncorrelated \nuncorrelated \nno noise \n\nschedule \n-\nflash freeze \nanneal temperature \nflash freeze \nanneal temperature \nanneal gain \n\n2-1-1 XOR \n\n2-2-1 XOR \n\n4-4-1 parity \n\n92(9) \n95(9) \n99(10) \n99(10) \n99(10) \n99(9) \n\n67(0) \n83(5) \n78(2) \n84(4) \n85(5) \n81(4) \n\n72(0) \n71(0) \n74(0) \n67(0) \n79(0) \n85(2) \n\nThe 2-2-1 networks learned to only 67% correct without noise. Learning with correlated \nnoise degraded performance compared to learning with uncorrelated noise. While the \nchip contained only 6 neurons it was of interest to consider how limitations such as those \nstudied here might affect solutions to larger problems. Thus. the solution to parity \nproblems were considered and are included in the table. \n\nIt is worth noting that the full complexity of the chip's settling and noise distribution is \nnot captured in the discrete time simulations on the computer. The fact that we do not \nuse a circuit simulation may account for some of the differences between the simulations \nand chip performance. It is interesting to note that learning by the chip was generally \nfaster than learning by the simulation program and that the chip seemed to require noise \nfor learning more than the simulator. \n\nWe also considered a system without random noise in which we annealed the inverse \ngain of the neurons like a temperature through a broad annealing schedule covering the \nvalues previously exam ined [2] \u2022 As shown in the last line of the Table this performed \ncomparably to temperature annealing reported above. 10 runs of a 2-2-1 XOR gave a \nmean performance of 81 % with 4 networks reaching 100%. On the 4-4-1 parity problem \nthe mean performance was better than the results of annealing temperature. The mean \nperformance was 85% and 2 networks reached 100%. For still larger problems. such as \n6-8-1 parity. performance was comparable to annealing with noise. \n\n4.1 Applications of Learning Systems \n\n4. FUTURE DIRECTIONS \n\nLearning systems give us a way to encode knowledge as a set of training examples rather \nthan as a set of rules. Learned behavior emerges from the training set in ways that \ndepend on the input representation. the network architecture. and the learning procedure. \nThis technique is suitable for problem domains where there are too many rules or where \nthe rules are not known. Two general categories of problems suitable for learning \n\n\f758 \n\nAlspector, Gupta and Allen \n\nsystems are pattern recognition and some types of expert systems. \n\nPattern recognition of something like an oak leaf is\u00b7 difficult because of the many \nvariations a rule-based system would have to consider even when variations of scale, \nrotation, and translation are accounted for. Yet, it is quite easy to give a learning system \nmany training examples of oak leaves. Scale, rotation, and translation invariance can be \nbuilt into the network structure. Similarly, recognition of speech sounds is difficult, but \nmany training examples exist. Here also, pre-processing of the auditory data is important \nto obtain a useful representation. Another pattern \nin \ntelecommunications is learning the codebook for vector quantization in a real-time visual \ndata compression system. [61 \n\ntask useful \n\nlearning \n\nExpert knowledge is often easier to encode by training examples as well. Experts often \ndo not know the rules they use to troubleshoot equipment or give advice. Again, it is \nquite easy, by taking a history of such advice, to build a large database of training \nexamples. As knowledge changes, training is a more graceful way of Updating a \nknowledge base than changing the rules. In telephone networks, fault handling or traffic \nrouting are examples of problems for which training is a suitable way of encoding \nknowledge. \n\n4.2 Future Large-Scale Learning Systems \n\ntakes \n\ntime \n\ntraining \n\ntoo much computer \n\nBecause \nin a simulation, physical \nimplementations of learning systems such as ours are necessary for speed. It takes \nseveral hours to train a network to recognize a few milliseconds of speech. [7] If we could \nexpand our system to the thousand-neuron level, it would be possible to learn simple \nspeech recognition in real time. \n\nBecause the chip uses Ohm's law to multiply, charge conservation to add, device physics \nto create a threshold step, and a physical noise mechanism for random number \ngeneration, we can present training patterns to this chip about 100,000 times faster than \nthe computer simulator. This factor, mostly due to the physical analog computation at \nthis small network size, will increase with the size of the system due to its inherently \nparallel nature. It would also be possible to build fast special-purpose digital hardware to \nperform the multiply-accumulate calculations and do fast compares in parallel. Such \nhardware would take up considerably more silicon area but may be a good way to \nintegrate neural network calculations into existing computer systems. If we could build a \nlarge VLSI learning system of, say, 10,000 neurons and 1,000,000 synapses, it would be \nabout a billion times faster than a simulator on a 1 MIP machine. Presumably, such a \nsystem will be able to learn things beyond the capability of simulations even if they are \nrun on supercomputers. However, there are several challenges to building these systems. \n\nAn algorithmic problem divorced from implementation is the effect of scaling to large \nsize in highly connected networks. The learning time of such a system scales \nexponentially with the size of the problem. [8] The traditional way of handling complexity \nin large problems is to break them into smaller subpieces. An effective algorithm is yet \nto be discovered for doing learning in the modular, hierarchical networks which would be \nrequired to handle large problems. \n\n\fPerfonnance of a Stochastic Learning Microchip \n\n759 \n\nEven from a technological viewpoint, modularity is necessary to manage the connectivity \nin a typical multiple chip system. A highly connected system, even if it could be built, \nwould take too long to settle even considering the technology and parallel speedups \navailable. Constraints such as power dissipation, capacitive loading across chips, and \ninterchip communication are difficult to solve. If we succeed in these challenges, we will \nhave the problem of presenting data to the system at extremely high rates amounting to \nseveral thousand (or more) bits every few microseconds. Biology solves these problems \nin the visual system, for example, by highly parallel communication via the optic nerve. \nIt is unlikely that we will be able to use a million bit wide bus in our electronic system, \nhowever. \n\nCan one take the weights learned by a learning system and simply load them onto a much \nsimpler system with programmable rather than adaptive synapses? This is perhaps \npossible for smaller systems where analog inaccuracies and defects can be controlled. \nModular networks provide a way of handling inaccuracies. However, for large analog \nsystems, adaptation mechanisms are needed to maintain accuracy. Even if the accuracy \nwere a few percent, a system of only a hundred neurons would be inaccurate across \nchips. In biological systems, if one were to place the connection strengths found in brain \nA onto the structures of brain B, the result would be chaos rather than a brain transplant \nThe robustness of neural systems depends on having the neurons and synapses adapt to \nthe particular environment they find themselves in. Nevertheless, some amount of hard(cid:173)\nwiring is probably possible in modular systems if it is modifiable by a trainable portion of \nthe network. A speech recognition system may, for example, adapt in real time to the \naccents and timbre of a particular speaker. It is also likely that the system would require \nat least partial training beforehand for robustness. \n\nWe plan to design a larger version of our test chip containing both neurons and synapses \nwhich can form part of a still larger multiple chip network with the addition of chips \ncontaining only synapses. This next chip will have self-powered synapses so that each \nneuron need only signal its state rather than drive an unknown number of neurons from \nother chips. In addition, the noise generator will be improved so that true annealing is \npossible. We may also go further toward a fully analog chip[2] by having a variable gain \nneuron. Analog charge domain storage of weights and transport of states would further \nreduce the silicon area necessary but the technology required is not standard. \n\nThere are many challenges in scaling learning networks up to the 1 ()4 neuron and 1 ()6 \nsynapse range although these large electronic learning networks will have on the order of \na billionfold speed advantage over simulations based on serial computers. Thus they may \nbe able to address many longstanding problems in artificial intelligence which have \nresisted attack by more conventional methods. \n\n\f760 \n\nAlspector, Gupta and Allen \n\nReferences \n\n1. J. Alspector & R.B. Allen, \"A neuromorphic VLSI learning system\", in Advanced \nResearch in VLSI: Proceedings of the 1987 Stanford Conference. edited by P. \nLosleben (MIT Press, Cambridge, MA, 1987) pp. 313-349. \n\n2. J. Alspector, R.B. Allen, V. Hu, & S. Satyanarayana, \"Stochastic learning networks \nand their electronic implementation\", Neural Information Processing Systems \n(Denver, Nov. 1987) pp. 9-21. \n\n3. D.H. Ackley, G.E. Hinton, & T J. Sejnowski, \"A learning algorithm for Boltzmann \n\nmachines\", Cognitive Science 9 (1985) pp. 147-169. \n\n4. B. Widrow & M.E. Hoff, \"Adaptive switching circuits\". IRE WESCON Convention \n\nRecord Part 4, (1960) pp. 96-104. \n\n5. F. Rosenblatt. Principles of neurodynamics: Perceptrons and the theory of brain \n\nmechanisms. Spartan Books, Washington. D.C. (1961). \n\n6. J. Alspector, \"A VLSI approach to neural-style information processing\", in VLSI \nSignal Processing III. edited by R.W. Brodersen and H.S. Moscovitz (IEEE Press, \nNew York, 1988) pp. 232-243. \n\n7. T.K. Landauer, C. Kamm, & S. Singhal, \"Teaching a minimally structured back(cid:173)\n\npropagation network to recognize speech sounds\", Proceedings of the Cognitive \nScience Society (Seattle, Aug. 1987) pp. 531-536. \n\n8. G. Tesauro & B. Janssens, \"Scaling relationships in back-propagation learning\". \n\nComplex Systems 2 (1988) pp. 39-44. \n\n\f", "award": [], "sourceid": 159, "authors": [{"given_name": "Joshua", "family_name": "Alspector", "institution": null}, {"given_name": "Bhusan", "family_name": "Gupta", "institution": null}, {"given_name": "Robert", "family_name": "Allen", "institution": null}]}