{"title": "Capacity for Patterns and Sequences in Kanerva's SDM as Compared to Other Associative Memory Models", "book": "Neural Information Processing Systems", "page_first": 412, "page_last": 421, "abstract": null, "full_text": "412 \n\nCAPACITY FOR PATTERNS AND SEQUENCES IN KANERVA'S SDM \n\nAS COMPARED TO OTHER ASSOCIATIVE MEMORY MODELS \n\nJames O. Keeler \n\nChemistry Department, Stanford University, Stanford, CA 94305 \n\nand RIACS, NASA-AMES 230-5 Moffett Field, CA 94035. \n\ne-mail: jdk@hydra.riacs.edu \n\nABSTRACT \n\nThe information capacity of Kanerva's Sparse, Distributed Memory (SDM) and Hopfield-type \nneural networks is investigated. Under the approximations used here, it is shown that the to(cid:173)\ntal information stored in these systems is proportional to the number connections in the net(cid:173)\nwork. The proportionality constant is the same for the SDM and HopJreld-type models in(cid:173)\ndependent of the particular model, or the order of the model. The approximations are \nchecked numerically. This same analysis can be used to show that the SDM can store se(cid:173)\nquences of spatiotemporal patterns, and the addition of time-delayed connections allows the \nretrieval of context dependent temporal patterns. A minor modification of the SDM can be \nused to store correlated patterns. \n\nINTRODUCTION \n\nMany different models of memory and thought have been proposed by scientists over the \nyears. In (1943) McCulloch and Pitts prorosed a simple model neuron with two states of activity \n(on and off) and a large number of inputs. Hebb (1949) considered a network of such neurons and \nto learn memories. The learning rule \npostulated mechanisms for changing synaptic strengths 2 \nconsidered here uses the outer-product of patterns of +Is and -Is. Anderson (1977) discussed the \neffect of iterative feedback in such a system.) Hopfield (1982) showed that for symmetric connec(cid:173)\ntions,4 the dynamics of such a network is governed by an energy function that is analogous to the \nenergy function of a spin glass . .5 Numerous investigations have been carried out on similar \nmodels. 6-8 \n\nSeveral limitations of these binary interaction, outer-product models have been pointed out. \nFor example, the number of patterns that can be stored in the system (its capacity) is limited to a \nfraction of the length of the pattern vectors. Also, these models are not very successful at storing \ncorrelated patterns or temporal sequences. \n\nOther models have been proposed to overcome these limitations. For example, one can \nallow higher-order interactions among the neurons.9\u202210 In the following, I focus on a model \ndeveloped by Kanerva (1984) called the Sparse, Distributed Memory (SOM) model. J1 The SOM \ncan be viewed as a three layer network that uses an outer-product learning between the second and \nthird layer. As discussed below, the SDM is more versatile than the above mentioned networks \nbecause the number of stored patterns can increased independent of the length of the pattern. and \nthe SDM can be used to store spatiotemporal patterns with context retrieval. and store correlated \npatterns. \n\nThe capacity limitations of outer-product models can be alleviated by using higher-order \ninteraction models or the SOM, but a price must be paid for this added capacity in tenns of an \nincrease in the number of connections. How much information is gained per connection? It is \nshown in the following that the total infonnation stored in each system IS proportional to the \nnumber of connections in the network. and that the proportionality constant is independent of the \nparticular model or the order of the model. This result also holds if the connections are limited to \none bit of precision (clipped weights). The analysis presented here requires certain simplifying \nassumptions. The approximate results are compared numerically to an exact calculation developed \nby Chou. 12 \n\nSIMPLE OUTER\u00b7PRODUCT NEURAL NETWORK MODEL \n\nAs an example or a simple first-order neural network model, I consider in detail the model \ndeveloped by Hopfield.4 This model will be used to introduce the mathematics and the concepts \nthat will be generalized for the analysis of the SOM. The \"neurons\" are simple two-state \n\n\u00a9 American Institute of Phv!lir.s 19M \n\n\fthreshold devices: The state of the i'\" neuron. Uj. is either either +1 (on). or -1 (off). Consider a \nset of n such neurons with net input (local field). h j \u2022 to the i'\" neuron given by \n\n413 \n\nhj = V j j Uj. \n\n/I \n\nj \n\n(1) \n\nwhere T jj represents the interaction strength between the i'\" neuron and the j\"'. The state of each \nneuron is updated asynchronously (at random) according to the rule \n\nUj +-g (h j ). \n\n(2) \n\nwhere the function g is a simple threshold function g (x) = sign (x). \n\nin \n\nthis system. Denote \n\nSuppose we are given M randomly chosen patterns (strings of length n of \u00b1ls) which we \nthese M memory patterns as pattern vectors: \nto store \nlike \n\nwish \npQ = (p f,pf .... ,p\"Q). \n(+1,-1.+1,-1.-1 \u2022...\u2022 +1). One method of storing these patterns is the outer-product (Hebbian) learn(cid:173)\ning rule: Start with T=O, and accumulate the outer-products of the pattern vectors. The resulting \nconnection matrix is given by \n\nex = 1.2.3, ... ,M. \n\npI might \n\nexample. \n\nlook \n\nFor \n\nM \n\nTjj = LPjQpt, Tjj = o. \n\n=1 \n\n(3) \n\nThe system described above is a dynamical system with attracting fixed points. To obtain \nan approximate upper bound on the total information stored in this network. we sidestep the issue \nof the basins of attraction, and we check to see if each of the patterns stored by Eq. (3) is actually \na fixed point of (2). Suppose we are given one of the patterns, p~, say, as the initial configuration \nof the neurons. I will show that p~ is expected to be a fixed point of Eq. (2). After inserting (3) \nfor T into (1), the net input to the i'\" neuron becomes \n\nM \n\nh j = LPt[ L ptpl]\u00b7 \n\n\" \n\nj \n\n=1 \n\n(4) \n\nThe important term in the sum on ex is the one for which ex =~. This term represents the \u2022 'sig(cid:173)\nnal\" between the input p~ and the desired output. The rest of the sum represents \"noise\" result(cid:173)\ning from crosstalk with all of the other stored patterns. The expression for the net input becomes \nhj = signal; + noisej where \n\nsignalj = Pj~[L pI pI], \n\n\" \nj \n\nnoisej = L pjQ[ 2. pt pI]\u00b7 \n\nM \n\n\" \n\n(5) \n\n(6) \n\nQ;t~ \n\nSumming on all of the h, in (6) yields signalj = (n-l)pj~. Since n is positive, the sign of \nthe signal term and pj~ will be the same. Thus. if the noise term were exactly zero, the signal \nwould give the same sign as pj~ with a magnitude of:: n d \u2022 and p~ would be a fixed point of (2). \nMoreover, patterns close to pI' would give nearly the same signal, so that p~ should be an attract(cid:173)\ning fixed point. \nFor randomly chosen patterns. = 0, where < > indicates statistical expectation. and \nits variance will be a'- = (n-l)d (M -1). The probability that there will be an error on recall of pj~ \nis given by the probability that the noise is greater than the signal. For n large, the noise distribu(cid:173)\ntion is approximately gaussian, and the probability that there is an error in the i'\" bit is \n\n(7) \n\nINFORMATION CAPACITY \n\nThe number of patterns that can be stored in the network is known as its capacityP\u00b714 How(cid:173)\never, for a fair comparison between all of the models discussed here. it is more relevant to com(cid:173)\npare the total number of bits (total information) stored in each model rather than the number of \n\n\f414 \n\npatterns. This allows comparison of information storage in models with different lengths of the \npattern vectors. If we view the memory model as a black box which receives input bit strings and \noutputs them with some small probability of error in each bit, then the definition of bit-capacity \nused here is exactly the definition of channel capacity used by Shannon. 1S \nDefine the bit-capacity as the number of bits that can be stored in a network with fixed pro(cid:173)\nbability of ~etting an error in a recalled bit, i.e. Pe = constant in (10). Explicitly, the bit-capacity \nis given by 6 \n\nB = bit capacity = nMll, \n\n(8) \n\nwhere 11 = (I + Pelog2fJe + (l-Pe )log2(I-Pe)). Note that 11=1 for Pe =0. Setting Pe to a constant is \ntantamount to keeping the signal-to-noise ratio (fidelity) constant, where the fidelity, R, is given by \nR = I signail/a. Explicitly, the relation between (constant) Pe and R, is just R = ~-I(l - Pe ), \nwhere \n\n~R) = (1I2Jt)'h J e-t212dt. \n\nR \n\n(9) \n\nHence, the bit-capacity of these networks can be investigated by examining the fidelity of the \nmodels as a function of n, M, and R. From (8) and (9) the fidelity of the Hopfield model is is \nR2 = nl(n(M-l\u00bbY.. (n>I). Solving for M in terms of (fixed) R and 11, the bit-capacity becomes \nB = l1[(n 2IR 2)+n]. \n\nThe results above can be generalized to models with d th order interactions.17,18 The resulting \n\nexpression for the bit-capacity for d,h order interaction models is just \n\nB = 11[-2 +n]. \n\nn d+1 \nR \n\n(10) \n\nHence, we see that the number of bits stored in the system increases with the order d. However, \nto store these bits, one must pay a price by including more connections in the connection tensor. \nTo demonstrate the relationship between the number of connections and the information stored, \ndefine the information capacity, y, to be the total information stored in the network divided by the \nnumber of bits in the connection tensor (note that this is different than the definition used by Abu(cid:173)\nMostafa et al.).19 Thus y is just the bit-capacity divided by the number of bits in the tensor T { \nand represents the efficiency with which information is stored in the network. Since T has n d + \nelements, the information capacity is found to be \n\nY -....!L \n- R 2b' \n\n(11) \n\nwhere b is the number of bits of precision used per tensor element (b ~ log2M for no clipping of \nthe weights). For large n, the information stored per neuronal coimection is y = lllR 2b, indepen(cid:173)\ndent of the order of the model (compare this result to that of Peretto, et al.).11J To illustrate this \npoint, suppose one decides that the maximum allowed probability of getting an error in a recalled \nbit is Pe = 111000, then this would fix the minimum value of R at 3.1. Thus, to store 10,000 bits \nwith a probability of getting an error of a recalled bit of 0.001, equation (15) states that it would \ntake =96,OOOb bits, independent of the order of the model, or =O.ln patterns can be stored with \nprobability 111000 of getting an error in a recalled bit. \n\nKANERVA'S SDM \n\nNow we focus our attention on Ka.nerva's Sparse, Distributed Memory model (SDM).l1 The \nSDM can be viewed as a 3-layer network with the middle layer playing the role of hidden units. \nTo get an autoassociative network, the output layer can be fed back into the input layer, effectively \nmaking this a two layer network. The first layer of the SDM is a layer of n, \u00b1l input units (the \ninput address, a), the middle layer is a layer of m, hidden units, S, and the third layer consists of \nthe n \u00b1l output units (the data, d). The connections between the input units and the hidden units \nare random weights of \u00b1I and are given by the m xn matrix A. The connections between the hid(cid:173)\nden units and the output units are given by the n xm connection matrix C, and these matrix ele(cid:173)\nments are modified by an outer-product learning rule (C ii analogous to the matrix T of the \nHopfield model). \n\n\fGiven an input pattern a, the hidden unit activations are determined by \n\ns = Or (A a), \n\n415 \n\n(12) \n\nwhere Or is the Hamming-distance threshold function: The k'\" element is 1 if the input a is at \nmost r Hamming units away from the k,1t row in A, and 0 if it is further than r units away, i.e., \n\nI if Y2(n -Xj)~ \n\u00b0r(x)j = 0 if Y2(n-x;\u00bbr . \n\n{\n\n(13) \n\nThe hidden-units vector, or select vector, s, is mostly Os with an average of Sm 1s, where S is \nsome small number dependent on r; Sl. With these assumptions, we can \neasily calculate the variance of the noise term, because each of the select vectors are i.i.d. vectors \nof length m with mostly Os and::&n Is. With these assumptions, the fidelity is given by \n\nIn the limit of large m , with Sm :: constant, the number of stored bits scales as \n\nB - n[ \n\nmn \n\n-\n\n'. R2(1+S2m) \n\n+ n] \n. \n\n(17) \n\n(18) \n\nIf we divide this by the number of elements in C, we find the information capacity, y = 1l1R2b, \njust as before, so the information capacity is the same for the two models. (If we divide the bit \ncapacity by the number of elements in C and A then we get y = 1lIR2(b+1), which is about the \nsame for large M .) \n\nA few comments before we continue. FIrst, it should be pointed out that the assumption \nmade by Kanerva 11 and Keeler17,18 that the variance of the signal term is much less than that of \nthe noise is not valid over the entire range. If we took this infO account, then the magnitude of the \ndenominator would be increased by the variance of the signal term. Further, if we read at a dis(cid:173)\ntance I away from the write address, then it is easy to see that the signal changes to be m S(l), \nlength I apart in the binomial space n \nwhere &(1) the overlap of two spheres of radius r \n\n\f416 \n\n(8 = ~O\u00bb. The fidelity for reading at a distance I away from the write address is \n\nR2-----------------~~----~~-----\n- m~IXl~l) + (M-l)m82+(M-l)84m2(I-lIm) ' \n\nm 282(l) \n\nCompare this to the formula derived by ChOU,12 for the exact signal-to-noise ratio: \n\nR2-----------------~~~ ____ ~ ______ __ \n- m~IXI_8(I\u00bb + (M-l)mJ.l .. \".+(M-l)0;\".m2(I-lIm\u00bb ' \n\nm 282(l) \n\n(19) \n\n(20) \n\nwhere J.l .. \" is the average overlap of the spheres of radius r binomially distributed with parameters \n(n ,112) and cr is the square of this overlap. The difference in these two formulas lies in the \ndenominator in the terms 82 verses J.l .. \". and 84 vs. 0;\".. The difference comes from the fact that \nChou correctly calculates the overlap of the spheres without using the independence assumption. \n\nHow do these formula's differ? First of all, it is found numerically that 82 is identical with \nJ.l .. \".. Hence, the only difference comes from 84 verses 0;\".. For m82 < 1, the 84 term is negligi(cid:173)\nIn addition, 84 and 0 2 are approximately \nble compared to the other terms in the denominator. \nequal for large n and r=n 12. Hence, in the limit n ~oo the two fonnulas agree over most of the \nrange if M=O.lm, m<2\". However, for finite n, the two fonnulas can disagree when m82=1 (see \nFigure 1). \n\n30 \n\n20 \n\n10 \n\no \no \n\nSignal-to-Noise Ratios \n\n+ Eq. (17) \no Eq. (19) \n* Eq. (20) \n\n60 \n20 \nHamming Radius \n\n40 \n\n80 \n\nFigure 1: A ~omparison of the fidelity calculations of the SDM for typical n, M, andm \nvalues. Eq~atlon (17) .was derived assuming no variance of the signal term, and is shown \n~y the + line. Equauon (19) ~ses the approximation that all of the select vectors are \nline. EquatIon (20) (\u00b7's) is the exact derivation done by \nindePf2ndeot denoted by the 0 \nChou . The values used here were n = 150, m = 2000, M = 100. \n\n\fEquation (20) suggests that the5 is a best read-write Hamming radius for the SOM. By set(cid:173)\nting I = 0 in (19) and by setting ~ = 0, we get an approximate expression for the best Ham(cid:173)\nming radius: 8\"..., =(2Mntll3. This ttend is qualitatively shown in Figure 2. \n\n417 \n\nII) \nQ) \nu \nc: o \n\nL. \n:> \nI.) \nI.) o \n...... o \n\no \n\nFigure 2: Numerical investigation of the capacity of the SOM. The vertical axis is the per(cid:173)\ncent of recovered patterns with no errors. The x-axis (left to right) is the Hamming dis(cid:173)\ntance used for reading and writing. The y-axis (back to forward) is the number of patterns \nthat were written into the memory. For this investigation, n = 128, m = 1024, and M \nranges from 1 to 501. Note the similarity of a cross-section of this graph at constant M \nwith Figure 1. This calculation was performed by Oavid Cohn at RIACS, NASA-Ames. \n\nFigure 1 indicates that the fonnula (17) that neglected the variance of the signal term is \nincorrect over much of the range. However, a variant of the SOM is to constrain the number of \nselected locations to be constant; circuitry for doing this is easily built.21 The variance of the sig(cid:173)\nnal term would be zero in that case, and the approximate expression for the fidelity is given by Eq. \n(17). There are certain problems where it would be better to keep 8 = constant, as in the case of \ncorrelated patterns (see below). \n\nThe above analysis was done assuming that the elements (weights) in the outer-product \n\nmatrix are not clipped i.e. that there are enough bits to store the largest value of any matrix ele(cid:173)\nment It is interestmg to consider what happens if we allow these values to be represented by only \na few bits. If we consider the case case b = 1, i.e. \nthe weights are clipped at one bit, it is easy \nto showl7 that r-2llf1tR:Z for the d th older models and for the SOM, which yields y = 0.07 for rea(cid:173)\nsonable R, (this is substantially less than Willsbaw's 0.69). \n\n\f418 \n\nSEQUENCES \n\nIn an autoassociative memory, the system relaxes to one of the stored patterns and stays \nfixed in time until a new input is presented. However, there are many problems where the recalled \npatterns must change sequentially in time. For example, a song can be remembered as a string of \nnotes played in the correct sequence; cyclic patterns of muscle contractions are essential for walk(cid:173)\ning, nding a bicycle, or dribbling a basketball. As a first step we consider the very simplistic \nsequence production as put forth by Hopfield (1982) and Kanerva (1984). \n\nSuppose that we wished to store a sequence of patterns in the SOM. Let the pattern vectors \nbe given by (p 1,p2, ...\u2022 pM). This sequence of patterns could be stored by having each pattern \npoint to the next pattern in the se~uence. Thus, for the SOM, the patterns would be stored as \nmput-output pairs (aIX,dIX), where a = pIX and dlX = pa+l for a = 1,2.3, ... ,M -1. Convergence to \nthis sequence works as follows: If the SOM is presented with an address that is close to p i the \nread data will be close to p2. Iterating the system with p2 as the new input address, the read data \nwill be even closer to p3. As this iterative process continues, the read data will converge to the \nstored sequence, with the next pattern in the sequence being presented at each time step. \n\nThe convergence statistics are essentially the same for sequential patterns as that shown \nabove for autoassociative patterns. Presented with pIX as an input address, the signal for the stored \nsequence is found as before \n\n(21) \nThus, given pIX, the read data is expected to be pCX+l. Assuming that the patterns in the sequence \nare randomly chosen, the mean value of the noise is zero, with variance \n\n = Om pIX+l. \n\n(22) \nHence, the length of a sequence that can be stored in the SOM increases linearly with m for large \nm . \n\n = (M-l)82m(I+82(m-l\u00bb. \n\nAttempting to store sequences like this in the Horfield model is not very successful due to \nthe asynchronous updating use in the Hopfield mode. A synchronously updated outer-product \nmodel (for example [6]) would work just as described for the SOM, but it would still be limited to \nstoring fraction of the word size as the maximum sequence length. \n\nAnother method for storing sequences in HOp'field-like networks has been proposed indepen(cid:173)\ndently by Kleinfeld22 and Sompolinsky and Kanter.23 These models relieve the problem created by \nasynchronous updating by using a time-delayed sequential term. This time-delay storage algorithm \nhas different dynamics than the synchronous SOM model. In the time-delay algorithm, the system \nallows time for the units to relax to the first pattern before proceeding on to the next pattern. \nwhereas in the synchronous algorithms, the sequence is recalled imprecisely from imprecise input \nIn other words. convergence to the \nfor the first few iterations and then correctly after that. \nsequence takes place \"on the fly\" in the synchronous models -\nthe system does not wait to zero \nin on the fitst pattern before proceeding on to recover the following patterns. 1bis allows the syn(cid:173)\nchronous algorithms to proceed k times as fast as the asynchronous time-delay algorithms with \nhalf as many (variable) matrix elements. This difference should be able to be detected in biological \nsystems. \n\nTIME DELAYS AND HYSTERESIS: FOLDS \n\nThe above scenario for storing sequences is inadequate to explain speech recognition or pat(cid:173)\n\ntern generation. For example, the above algorithm cannot store sequences of the form ABAC \u2022 or \noverlapping sequences. In Kanerva's original work, he included the concept of time delays as a \ngeneral way of storing sequences with hysteresis. The problem addressed by this is the following: \nSuppose we wish to store two sequences of patterns that overlap. For example, the two pattern \nsequences (a,b,c,d,e,f .... ) and (x,y,z,d,w,v, ... ) overlap at the pattern d. If the system only has \nknowledge of the present state, then when given the input d, it cannot decide whether to output w \nor e. To store two such sequences, the system must have some knowledge of the immediate past. \nKanerva incoIporates this idea into the SOM by using .. folds.\" A system with F + 1 folds has a \ntime history of F past states. These F states may be over the past F time steps or they may go \neven further back in time, skipping some time steps. The algorithm for reading from the SOM with \nfolds becomes \n\nd(t+l) = g(Co's(t) + C l'S(t-'tl) + ... + C F 'S(t-'tF\u00bb, \n\n(23) \n\n\f419 \n\nwhere s(t-'t~= 9r(Aa(t-'t~\u00bb. Tg \n(Pi.pi ..... P2 2) \u2022... (P~.pJ .... ,PQCl), construct the matrix of the W\" fold as follows: \n\nsequences \n\nthe Q \n\npattern \n\nstore \n\n(pl.Pf ..... p~l). \n\n~ \nC~ = w~IL.p~+lxsa tJ. \n\nQMtJ \n\na.1~1 \n\n(24) \n\nwhere any vector with a superscript less than 1 is taken to be zero. s~~\" = 9r(Ap~-~\"), and w~ is a \nweighting factor that would normally decrease with increasing ~. \n\nWhy do 3tese folds work? Suppose that the system is presented with the pattern sequence \n(pl,pr ..... PI I). with each pattern presented sequentially as input until the 'tF time step. For \nsimplicity. assume that w~ = 1 for all~. Each tenn in Eq. (39) will contribute a signal similar to \nthe signal for the single-fOld system. Thus. on the i!\" time step. the signal term coming from Eq. \n(39) is = F (M -1 )82m (1 +82(m -1\u00bb. Hence. the signal-to-noise ratio is ff times as strong as it \nis for the SDM without folds. \n\n'The mean of \n\nzero. with \n\nreached. \n\ntenns \n\nnoise \n\nthe \n\nis \n\nis \n\nSuppose further that the second stored pattern sequence happens to match the first stored \n\nsequence at t = 'to The signal term would then be \n\nsignal(t+l) = F8mp(+\"1 + amprl. \n\n(25) \nWith no history of the past (F = 1) the signal is split between p~+l and JJ21+I. and the output is \nambiguous. However. for F>I. the signal for the first pattern sequence dominates and allows \nretrieval of the remainder of the correct sequence. lbis formulation allows context to aid in the \nretrieval of stored sequences. and can differentiate between overlapping sequences by using time \ndelays. \n\nThe above formulation is still too simplistic in terms of being able to do real recognition \nproblems such as speech recognition. First. the above algorithm can only recall sequences at a \nfixed time rate, whereas speech recognition occurs at widely varying rates. Second. the above \nalgorithm does not allow for deletions in the incoming data. For example \"seqnce\" can be recot \nnized as \"sequence\" even though some letters are missing. Third, as pointed out by Lashley \nspeech processing relies on hierarchical structures. \n\nAlthough Kanerva's original algorithm is too simplistic. a straightforward modification \nallows retrieval at different rates with deletions. To achieve this, we can add on the time-delay \nterms with weights which are smeared out in time. Kanerva's (1984) formulation can thus be \nviewed as a discrete-time formulation of that put forth by Hopfield and Tank, (1987).15 Explicitly \nwe could write \n\nF \nh = L. \n\n~ \n\nI W~C~s(t-'t~), \n\n~= I A:=~F \n\n(26) \n\nwhere the coefficients W ~ are a discrete approximation to a smooth function which spreads the \ndelayed signal out over tlme. As a further step, we could modify these weights dynamically to \noptimize the signal coming out. The time-delay patterns could also be placed in a hierarchical \nstructure as in the matched filter avalanche structure put forth by Grossberg et al. (1986).26 \n\nCORRELATED PA ITERNS \n\nIn the above associative memories. all of the patterns were taken to be randomly chosen. \nunifonnly distributed binary vectors of length n. However, there are many applications where the \nset of input patterns is not uniformly distributed; the input patterns are correlated. In mathematical \nterms, the set K of input patterns would not be uniformly distributed over the entire space of 2/1 \npossible patterns. Let the probabi1i!)' distribution function for the Hamming distance between two \nrandomly chosen vectors pQ and p~ from the distribution K be given by the function p(d(pQ-p~\u00bb, \nwhere d(x-y) is the Hamming distance between x and y. \n\n'The SDM can be generalized from Kanerva's original fonnulation so that correlated input \npatterns can be associated with output patterns. For the moment, assume that the distribution set \nK and the probability density function p(x) are known a priori. Instead of constructing the rows \nof the matrix A from the entire space of 2\" patterns, construct the rows of A from the distribution \n1C. Adjust the Hamming distance r so that ~ = am = constant number of locations are selected. \n\n\f420 \n\nIn other words, adjust r so that the value of a is the same as given above, where a is detennined \nby \n\nr \n[P(X )dx \na=---2/1 \n\n(27) \n\nThis implies that r would have to be adjusted dynamical!-y. This could be done, for example, by a \nfeedback loop. Circuitry for doing this is easily built, \nand a similar structure appears in the \nGolgi cells in the Cerebellum.27. \n\nUsing the same distribution for the rows of A as the distribution of the patterns in 1C. and \nusing (27) to specify the choice of r, all of the above analysis is applicable (assuming randomly \nchosen output patterns). If the outputs do not have equal Is and -Is the mean of the noise is not \nO. However, if the distribution of outputs is also known, the system can still be made to work by \nstoring IIp+ and IIp_ for Is and -Is respectively, where p\u00b1 is the probability of getting a 1 or a-I \nrespectively. Using this storage algorithm, all of the above formulas hold, (as long as the distribu(cid:173)\ntion is smooth enough and not extremely dense). The SOM will be able to recover data stored \nwith correlated inputs with a fidelity given by Equation (17). \n\nWhat if the distribution function K is not known a priori? In that case, we would need to \nhave the matrix A learn the distribution p(x). There are many ways to build A to mimic p. One \nsuch way is to start with a random A matrix and modify the entries of a randomly chosen rows of \nA at each step accordinft!o the statistics of the most recent input patterns. Another method is to \nuse competitive learning \n\n30 to achieve the proper distribution of At. \n\nThe competitive learning algorithm is a method for adjusting the wei~hts A;j between the \nfirst and second layer to match this probability density function, p(x). The i \nrow of the address \nmatrix A can be viewd as a vector A,. The competitive learning algorithm holds a competition \nbetween these vectors, and a few vectors that are the closest (within the Hamming sphere r) to the \ninput pattern x are the winners. Each of these winners are then modified slightly in the direction \nof x. For large eno~ m, this algorithm almost always converges to a distribution of the Aj that \nis the same as p(x). \n\nThe updating equation for the selected addresses is just \n\n(28) \nNote for A. = I, this reduces to the so-called unary representation of Baum et al. 31 Which gives \nthe maximum efficiency in terms of capacity. \n\nA;'''\"''' = Arid -\n\n'A.{Arld - x) \n\nDISCUSSION \n\nThe above analysis said nothing about the basins of attraction of these memory states. A \nmeasure of the perfonnance of a content addressable memory shoUld also say something about the \navera~e radius of convergence of the basin of attraction. The basins are in general quite compli(cid:173)\nand have been investigated numerically for the unclipped models and values of n and m \ncated \nranging in the 100S.21 The basins of attraction for the SOM and the d=1 model are very similar in \ntheir characteristics and their average radius of convergence. However, the above results give an \nupper bound on the capacity by looking at the fixed points of the system (if there is no fixed point, \nthere is no basin). \n\nIn summary, the above arguments show that the total information stored in outer-product \nneural networks is a constant times the number of connections between the neurons. This constant \nis independent of the order of the model and is the same (1lIR2b) for the SOM as well as higher(cid:173)\norder Hopfield-type networks. The advantage of going to an architecture like the SOM is that the \nnumber of patterns that can be stored in the network is independent of the size of the pattern, \nwhereas the number of stored patterns is limited to a fraction of the word size for the Wills haw or \nHopfield architecture. The point of the above analysis is that the efficiency of the SOM in terms \nof information stored per bit is the same as for Hopfield-type models. \n\nIt was also demonstrated how sequences of patterns can be stored in the SOM, and how time \ndelays can be used to recover contextual information. A minor modification of the SOM could be \nused to recover time sequences at slightly different rates of presentation. Moreover, another minor \nmodification allows the storage of correlated patterns in the SOM. With these modifications, the \nSOM presents a versatile and efficient tool for investigating properties of associative memory. \n\n\f421 \n\nAcknowledgements: Discussions with John Hopfield and Pentti Kanerva are gratefully ack(cid:173)\nnowledged. This work: was supported by DARPA contract # 86-A227500-000. \n\nREFERENCES \n\n[1] McCulloch, W. S. & Pitts, W. (1943), Bull. Math. Biophys. 5, 115-133. \n[2] Hebb, D. O. (1949) The Organization of Behavior. John Wiley, New York. \n[3] Anderson, J. A., Solverstein, J. W., Ritz, S. A. & Jones, R. S. (1977) Psych. Rev., 84, \n\n412-451. \n\n[4] Hopfield, J. J. (1982) Proc. Natn'l. Acad. Sci. USA 79 2554-2558. \n[5] Kirkpatrick, S. & Sherringtoo, D. (1978) Phys Rev. 174384-4405. \n[6] Little, W. A. & Shaw, G. L.(l978)Math. Biosci. 39, 281-289. \n[7] Nakano, K. (1972), Association - A model of associative memory. IEEE Trans. Sys. Man \n[8] Willshaw, D. 1., Buneman, O. P. & Longuet-Higgins, H. c., (1969) Nature, 222 960-962. \n[9] Lee, Y. c.; Doolen, G.; Chen. H. H.; Sun, G. Z.; Maxwell, T.; Lee, H. Y.; & Giles, L. \n\nCyber.2, \n\n(1985) Physica , 22D, 276-306. \n\n[10] Baldi, P., and Venkatesh. S. S .\u2022 (1987) Phys. Rev. Lett. 58, 913-916. \n[11] Kanerva, P. (1984) Self-propagating Search: A Unified Theory of Memory, Stanford \n\nUniversity Ph.D. Thesis, and Bradford Books (MIT Press). In press (1987 est). \n[12] Chou, P. A., The capacity of Kanerva's Associative Memory these proceedings. \n[13] McEliece, R. J., Posner, E. C., Rodemich, E. R., & Venkatesh, S. S. (1986), IEEE Trans. \n\non Information Theory. \n\n[14] Amit, D. J., Gutfreund, H. & Sompolinsky, H. (1985) Phys. Rev. Lett. 55, 1530-1533. \n[15] Shannon, C. E., (1948), Bell Syst. Tech. J., 27, 379,623 (Reprinted in Shannon and Weaver \n\n1949) . \n\n[16] Kleinfeld, D. & Pendergraft, D. B., (1987) Biophys. J. 51, 47-53. \n[17] Keeler, J. D. (1986), Comparison of Sparsely Distributed Memory and Hopfield-type Neural \n\nNetwork Models. RIACS Technical Report 86-31, also submitted to J. Cog. Sci. \n\n[18] Keeler, J. D. (1987) Physics Letters 124A, 53-58. \n[19] Abu-Mostafa, Y. & St. Jacques, (1985), IEEE Trans. on Info. Theor., 31, 461. \n[18] Keeler, J. D .\u2022 Basins of Attraction of Neural Network Models AIP Conf. Proc. #151, Ed: \n\nJohn Denker, Neural Networks for Computing, Snowbird Utah, (1986). \n\n[20] Peretto, P. & J.J. Niez, (1986) BioI. Cybem., 54. 53-63. \n[21] Keeler, J. D., Ph. D. Dissertation. Collective phenomena of coupled lattice maps: Reaction(cid:173)\n\ndiffusion systems and neural networks. Department of Physics. University of California, San \nDiego, (1987). \n\n[22] Kleinfeld, D. (1986). Proc. Nat. Acad. Sci. 83 9469-9473. \n[23] Sompolinsky, H. & Kanter, I. (1986). Physical Review Letters. \n[24] Lashley, K. S. (1951). Cerebral Mechanisms in Behavior. Edited by Jeffress, L. A. Wiley, \n\nNew York, 112-136. \n\n[25] Hopfield, J. 1. & Tank, D. W. (1987). ICNN San Diego preprint. \n[26] Grossberg, S. & Stone, G. (1986). Psychological Review, 93, 46-74 \n[27] Marr, D. (1969). A Journal of Phisiology, 202, 437-470. \n[28] Grossberg, S. (1976). Biological Cybernetics 23, 121-134. \n[29] Kohonen, T. (1984) Self-organization and associative memory. Springer-Verlag, Berlin. \n[30] Rumelhart, D. E. & Zipser, D. J. Cognitive Sci .. 9, (1985), 75. \n[31] Baum, E., Moody J., Wilczek F. (1987). Preprint for Biological Cybernetics \n\n\f", "award": [], "sourceid": 39, "authors": [{"given_name": "James", "family_name": "Keeler", "institution": null}]}