{"title": "A Self-Learning Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 769, "page_last": 776, "abstract": null, "full_text": "769 \n\nA SELF-LEARNING NEURAL NETWORK \n\nA. Hartstein and R. H. Koch \n\nIBM - Thomas J. Watson Research Center \n\nYorktown Heights, New York \n\nABSTRACf \n\nWe propose a new neural network structure that is compatible \nwith silicon technology and has built-in learning capability. The \nthrust of this network work is a new synapse function. The \nsynapses have the feature that the learning parameter is em(cid:173)\nbodied in the thresholds of MOSFET devices and is local in char(cid:173)\nacter. The network is shown to be capable of learning by \nexample as well as exhibiting the desirable features of the \nHopfield type networks. \n\nThe thrust of what we want to discuss is a new synapse function for an artificial \nneuron to be used in a neural network. We choose the synapse function to be \nreadily implementable in VLSI technology, rather than choosing a function which \nis either our best guess for the function used by real synapses or mathematically \nthe most tractable. In order to demonstrate that this type of synapse function \nprovides interesting behavior in a neural network, we imbed this type of function \nin a Hopfield {Hopfield, 1982} type network and provide the synapses with a \nHebbian {Hebb, 1949} learning capability. We then show that this type of net(cid:173)\nwork functions in much the same way as a Hopfield network and also learns by \nexample. Some of this work has been discussed previously {Hartstein, 1988}. \n\nMost neural networks, which have been described, use a multiplicative function \nfor the synapses. The inputs to the neuron are multiplied by weighting factors \nand then the results are summed in the neuron. The result of the sum is then put \ninto a hard threshold device or a device with a sigmoid output. This is not the \neasiest function for a MOSFET to perform although it can be done. Over a large \nrange of parameters, a MOSFET is a linear device with the output current being \na linear function of the input voltage relative to a threshold voltage. If one could \ndirectly utilize these characteristics, one would be able to design a neural network \nmore compactly. \n\n\f770 \n\nHartstein and Koch \n\nWe propose that we directly use MOSFETs as the input devices for the neurons \nin the network, utilizing their natural characteristics. We assume the following \nform for the input of each neuron in our network: \nV; = 0 ( 2: IIj - T;j I ) \n\n(1) \n\n1 \n\nwhere V, is the output, ~ are the inputs and T,j are the learned threshold voltages. \nIn this network we use a representation in which both the V's and the T's range \nfrom 0 to + 1. The result of the summation is fed into a non-linear sigmoid func(cid:173)\ntion (0). All of the neurons in the network are interconnected, the outputs of \neach neuron feeding the inputs of every other neuron. The functional form of Eq. \n1 might, for instance, represent several n-channel and p-channel MOSFETs in \nparallel. \n\nThe memories in this network are contained in the threshold voltages, 1',}\" We \nimplement learning in this network using a simple linear Hebbian {Hebb, 1949} \nlearning rule. We use a rule which locally reinforces the state of each input node \nin a neuron relative to the output of that neuron. The equation governing this \nlearning algorithm is: \n\n(2) \n\nwhere 1';j are the initial threshold voltages and T'j are the new threshold voltages \nafter a time,.6.t. Here TJ is a small learning parameter related to this time period, \nand the offset factor O.S is needed for symmetry. Additional saturation con(cid:173)\nstraints are imposed to ensure that 1';j remain in the interval 0 to + 1. \n\nThis learning rule is one which is linear in the difference between each input and \noutput of a neuron. This is an enhancing/inhibiting rule. The thresholds are ad(cid:173)\njusted in such a way that the output of the neuron is either pushed in the same \ndirection as the input (enhancing), or pushed in the opposite direction (inhibit(cid:173)\ning). For our simple simulations we started the network with all thresholds at O.S \nand let learning proceed until some saturation occurred. The somewhat more so(cid:173)\nphisticated method of including a relaxation term in Eq. 2 to slowly push the val(cid:173)\nues toward O.S over time was also explored. The results are essentially the same \nas for our simple simulations. \n\nThe interesting question is if we form a network using this type of neuron, what \nwill the overall network response be like? Will the network learn multiple states \nor will it learn a simple average over all of the states it sees? In order to probe the \nfunctioning of this network, we have performed simulations of this network on a \ndigital computer. Each simulation was divided into two phases. The first was a \nlearning phase in which a fixed number of random patterns were presented to the \nnetwork sequentially for some period of time. During this phase the threshold \n\n\fA Self-Learning Neural Network \n\n771 \n\nvoltages were allowed to change using the rule in Eq. 2. The second was a testing \nphase in which learning was turned off and the memories established in the net(cid:173)\nwork were probed to determine the essential features of these learned memories. \nIn this way we could test how well the network was able to learn the initial test \npatterns, how well the network could reconstruct the learned patterns when pre(cid:173)\nsented with test patterns containing errors, and how the network responded to \nrandom input patterns. \n\nWe have simulated this network using N fully interconnected neurons, with N in \nthe range of 10 to 200. M random patterns were chosen and sequentially pre(cid:173)\nsented to the network for learning. M typically ranged up to N/3. After the \nlearning phase, the nature of the stable states in the network was tested. In gen(cid:173)\neral we found that the network is capable of learning all of the input patterns as \nlong as M is not too large. The network also learns the inverse patterns (l's and \nO's interchanged) due to the inherent symmetry of the network. Additional ex(cid:173)\ntraneous patterns are learned which have no obvious connection to the intended \nlearned states. These may be analogous to either the spin glass states or the mixed \npattern states discussed for the multiplicative network {Amit, 1985}. \n\nFig. 1 shows the capacity of a 100 neuron network. We attempted to teach the \nnetwork M states and then probed the network to see how many of the states \nwere successfully learned. This process was repeated many times until we \nachieved good statistics. We have defined successful learning as 100 0;6 accuracy. \nA more relaxed definition would yield a qualitatively similar curve with larger \ncapacity. \n\nThe functional form of the learning is peaked at a fixed value of the number of \ninput patterns. For a small number of input patterns, the network essentially \nlearns all of the patterns. Deviations from perfect learning here generally mean 1 \nbit of information was learned incorrectly. Near the peak the results become \nmore noisy for different learning attempts. Most errors are still only 1 or 2 bits! \nbut the learning in this region becomes marginal as the capacity of the network is \napproached. For larger values of the number of input patterns the network be(cid:173)\ncomes overloaded and it becomes incapable of learning most of the input states. \nSome small number of patterns are still learned, but the network is clearly not \nfunctioning well. Many of the errors in this region are large, showing little corre(cid:173)\nlation with the intended learned states. \n\nThis functional form for the learning in the network is the same for all of the net(cid:173)\nwork sizes tested. We define the capacity of the network as the average value of \nthe peak number of patterns which can be successfully learned. The inset to Fig. \n1 shows the memory capacity of a number of tested networks as a function of the \nsize of the network. The network capacity is seen to be a linear function of the \nnetwork size. The capacity is proportional to the number of T./s specified. In this \n\n\f772 \n\nHartstein and Koch \n\nexample the network capacity was f ouod to be about 8 010 of the maximum possi(cid:173)\nble for binary information. This rather low figure results from a trade-off of ca(cid:173)\npacity for the partic\\Jlar types of functions that a neural network can perform. It \nis possible to construct simple memories with 1000.,.6 capacity. \n\n200 \n25------------------------~--------------~ 20 \n\n0 \n\nN \n100 \n\n, , , \n, , \n, '. \n\u2022 \n\n.?;-\n'0 \nC \n0.. \n0 \nU \n\n10 \n\n0 \n\n20 \n\n] \n~ \n~ 15 \nE o \n~ \n'0 10 \n~ \nE \n~ z \n\n~ \n\n5 \n\no~--~----~--~----~--~ \no \n50 \n\n20 \n\n30 \n\n10 \n\n40 \n\nFigure 1. The number of successfully learned patterns as a func(cid:173)\ntion of the number of input patterns for a 100 neuron network. \nThe dashed curve is for perfect learning. The inset shows the \nmemory capacity of a threshold neural network as a function of \nthe size of the network. \n\nSome important measures of learning in the network are the distribution of stable \nstates in the network after learning has taken place. and the basin of attraction \nr or each stable point. One can gain a handle on these parameters by probing the \nnetwork with random test patterns after the network has learned M states. Fig. \n2 shows the averaged results of such tests for a 100 neuron network and varying \nnumbers of learned states. The figure shows the probability of finding particular \nstates. both learned and extraneous. The states are ordered first by decreasing \n\n\fA Self-Learning Neural Network \n\n773 \n\nprobability for the learned states, followed by decreasing probability for the ex(cid:173)\ntraneous states. It is clear from the figure that both types of stable states are \npresent in the network. It is also clear that the probabilities of finding different \npatterns are not equal. Some learned states are more robust than others, that is \nthey have larger basins of attraction. This network model does not partition the \navailable memory space equally among the input patterns. It also provides a large \namount of memory space for the extraneous states. Clearly, this is not the opti(cid:173)\nmum situation. \n\n(a) \n\n(b) \n\n~ \n\n0.8 \n0.6 \nQ) 0.4 \n\nLearned \n\n-.s (/) \n0') 0.2 L Extraneous \n\nc: \n~ 0.0 \nc: \nG: \n..... 0.8 \n0 \n.b \n~ 0.6 \n:0 \ne \n~ 0.4 \n0.2 \n0.0 \n\n---\n\nExtraneous \n\nLearned \n\n0 \n\n5 \n\n10 \n\n15 \nState \n\n20 \n\n25 \n\n30 \n\nFigure 2. The probability of the network finding a specific pat(cid:173)\ntern. Both learned states and extraneous states are found. The \nfigure was obtained for a 100 neuron network. Fig. 2a is for 5 \nlearned patterns and 2b is for 10 learned patterns. \n\nSome of the learned states appear to have 0 probability of being found in this \nsimulation. Some of these states are not stable states of the network and will \nnever be found. This is particularly true-when the number of learned states is \nclose to or exceeds the capacity of the network. Others of these states simply \nhave an extremely small probability of being found in a random search because \nthey have small basins of attraction. However, as discussed below, these are still \nviable states. When the network learns fewer states than its capacity (Fig. 2a), \n\n\f774 \n\nHartstein and Koch \n\nmost of the stable states are the learned states. As the capacity is approached or \nexceeded, most of the stable states are extraneous states. \n\nThe results shown in Fig. 2 address the question of the networks tolerance to er(cid:173)\nrors. A pattern, which has a large basin of attraction, will be relatively tolerant \nto errors when being retrieved, whereas, a pattern, which has a small basin of at(cid:173)\ntraction, will be less tolerant of errors. The immunity of the learned patterns to \nerrors in being retrieved can also be tested in a more direct way. One can probe \nthe network with test patterns which start out as the learned patterns, but have a \ncertain number of bits changed randomly. One then monitors the final pattern \nwhich the networks finds and compares to the known learned pattern . \n\n0.6 \n\n.$ .s (I) \n\"i 0.8 \nE \no \n!3 \nt7' c: \n;:; \nc: \n~ 04 \n'0 \n\u00b7 \n~ \n~ 0.2 \ne \n\n0.. \n\n\u2022 \u2022 \n\n10 \nHamming Distance \n\n20 \n\n0.0 '------.1--.-.-.--___ .... ..-.4 ... \n40 \n\n30 \n\no \n\nFigure 3. Probability of the network finding a specific learned \nstate when the input pattern has a certain Hamming distance. \nThis figure was obtained for a 100 neuron network which was \ntaught 10 random patterns. \n\nFig. 3 shows typical results of such a calculation. The probability of successfully \nretrieving a pattern is shown as a function of the Hamming distance. the number \nof bits which were randomly changed in the test pattern. For this simulation a \ntOO neuron network was used and it was taught 10 patterns. For small Hamming \ndistances the patterns are successfully found 100\u00b0,.6 of the time. As the Hamming \ndistance gets larger the network is no longer capable of finding the desired pat(cid:173)\ntern. but rather finds one of the other fixed points. This result is a statistical av-\n\n\fA Self-Learning Neural Network \n\n775 \n\nerase over all of the states and therefore tends to emphasize patterns with small \nbasins of attraction. This is just the opposite of the types of states emphasized in \nthe analysis shown in Fig. 2. \n\nWe can define the maximum Hamming distance as the Hamming distance at \nwhich the probability of finding the learned state has dropped to SO%. Fig. 4 \nshows the maximum Hamming distance as a function of the number of learned \nstates in our 100 neuron network. As one expects the maximum Hamming dis(cid:173)\ntance gets smaller as the number of learned states increases. Perhaps surprisingly, \nthe relationship is linear. These results are important since one requires a rea(cid:173)\nsonable maximum Hamming distance for any real system to function. These \nconsiderations also shed some light on the nature of the functioning of the net(cid:173)\nwork and its ability to learn. \n\n60 \n\nCI) \n\n(.) c \n~ \ni5 40 \n0 c \u00b7e \nE \n0 :c \n\u00a7 20 . \nE .-\n~ \n::E \n\n0 \n\n0 \n\n5 \n\n10 \nM \n\n15 \n\n20 \n\nFigure 4. The maximum Hamming distance for a given number \nof learned states. Results are for a 100 neuron network. \n\nThis simulation gives us a picture of the way in which the network utilizes its \nphase space to store information. When only a few patterns are stored in the \nnetwork, the network divides up the available space among these memories. The \nlearning process is almost always successful. When a larger number of learn~ \npatterns are attempted, the available space is now divided among more memories. \nThe maximum Hamming distance decreases and more space is taken up byex(cid:173)\ntraneous states. When the memory capacity is exceeded, the phase space allo-\n\n\f776 \n\nHartstein and Koch \n\ncated to any successful memory is very small and most of the space is taken up \nby extraneous states. \n\nThe types of behavior we have described are similar to those found in the \nHopfield type memory utilizing multiplicative synapses. In fact our central point \nis that by using a completely different type of synapse function, we can obtain the \nsame behavior. At the same time we argue since this network was proposed using \na synapse function which mirrors the operating characteristics of MOSFETs, it \nwill be much easier to realize in hardware. Therefore, we should be able to con(cid:173)\nstruct a smaller more tolerant network with the same operating characteristics. \n\nWe do not mean to imply that the type of synapse function we have explored can \nonly be used in a Hopfield type network. In fact we feel that this type of neuron \nis quite general and can successfully be utilized in any type of network. This is at \npresent just a conjecture which needs to be explored more fully. Perhaps the \nmost important message from our work is the realization that one need not be \nconstrained to the multiplicative type of synapse, and that other forms of \nsynapses can perform similar functions in neural networks. This may open up \nmany new avenues of investigation. \n\nREFERENCES \n\nD.l. Amit, H. Gutfreund and H. Sompolinsky, Phys. Rev. A32, 1001 (1985). \n\nA. Hartstein and R.H. Koch, IEEE Int. Conf. on Neural Networks, (SOS Printing, \nSan Diego, 1988), Vol. I, 425. \n\nD. Hebb, The Organization of Behaviour, (Wiley, New York, 1949). \n\n1.1. Hopfield, Proc. Natl. Acad. Sci. USA 79, 2554 (1982). \n\n\f", "award": [], "sourceid": 189, "authors": [{"given_name": "Allan", "family_name": "Hartstein", "institution": null}, {"given_name": "R.", "family_name": "Koch", "institution": null}]}