{"title": "Performance Measures for Associative Memories that Learn and Forget", "book": "Neural Information Processing Systems", "page_first": 432, "page_last": 441, "abstract": null, "full_text": "432 \n\nPerformance Measures for Associative Memories \n\nthat Learn and Forget \n\nAnthony /(uh \n\nDepartment of Electrical Engineering \n\nUniversity of Hawaii at Manoa \n\nHonolulu HI, 96822 \n\nABSTRACT \n\nRecently, many modifications to the McCulloch/Pitts model have been proposed \nwhere both learning and forgetting occur. Given that the network never saturates (ceases \nto function effectively due to an overload of information), the learning updates can con(cid:173)\ntinue indefinitely. For these networks, we need to introduce performance measmes in addi(cid:173)\ntion to the information capacity to evaluate the different networks. We mathematically \ndefine quantities such as the plasticity of a network, the efficacy of an information vector, \nand the probability of network saturation. From these quantities we analytically compare \ndifferent networks. \n\n1. Introduction \n\nWork has recently been undertaken to quantitatively measure the computational \naspects of network models that exhibit some of the attributes of neural networks. The \nMcCulloch/Pitts model discussed in [1] was one of the earliest neural network models to be \nanalyzed. Some computational properties of what we call a Hopfield Associative Memory \nNetwork (HAMN) :similar to the McCulloch/Pitts model was discussed by Hopfield in [2]. \nThe HAMN can be measured quantitatively by defining and evaluating the information \ncapacity as [2-6] have shown, but this network fails to exhibit more complex computational \ncapabilities that neural network have due to its simplified structure. The HAMN belongs \nto a class of networks which we call static. In static networks the learning and recall pro(cid:173)\ncedures are separate. The network first learns a set of data and after learning is complete, \nrecall occurs. In dynamic networks, as opposed to static networks, updated learning and \nassociative recall are intermingled and continual. In many applications such as in adaptive \ncommunications systems, image processing, and speech recognition dynamic networks are \nneeded to adaptively learn the changing information data. This paper formally develops \nand analyzes some dynamic models for neural networks. Some existing models [7-10] are \nanalyzed, new models are developed, and measures are formulated for evaluating the per(cid:173)\nformance of different dynamic networks. \n\nIn [2-6]' the asymptotic information capacity of the HAMN is defined and evaluated. \nIn [4-5]' this capacity is found by first assuming that the information vectors (Ns) to be \nstored have components that are chosen randomly and independently of all other com(cid:173)\nponents in all IVs. The information capacity then gives the maximum number of Ns that \ncan be stored in the HAMN such that IVs can be recovered with high probability during \nretrieval. At or below capacity, the network with high probability, successfully recovers \nthe desired IVs. Above capacity, the network quickly degrades and eventually fails to \nrecover any of the desired IVs. This phenomena is sometimes referred to as the \"forgetting \ncatastrophe\" [10]. In this paper we will refer to this phenomena as network saturation. \n\nThere are two ways to avoid this phenomena. The first method involves learning a \nlimited number of IVs such that this number is below capacity. After this leaming takes \nplace, no more learning is allowed. Once learning has stopped, the network does not \nchange (defined as static) and therefore lacks many of the interesting computational \n\n@ American Institute of Physics 1988 \n\n\f433 \n\ncapabilities that adaptive learning and neural network models have . The second method is \nto incorporate some type oC forgetting mechanism in the learning structure so that the \ninCormation stored in the network can never exceed capacity. This type of network would \nbe able to adapt to the changing statistics of the IVs and the network would only be able \nto recall the most recently learned IVs. This paper focuses on analyzing dynamic networks \nthat adaptively learn new inCormation and do not exhibit network saturation phenomena \nby selectively Corgetting old data. The emphasis is on developing simple models and much \noC the analysis is performed on a dynamic network that uses a modified Hebbian learning \nrule. \n\nSection 2 introduces and qualitatively discusses a number of network models that are \nclassified as dynamic networks. This section also defines some pertinent measures Cor \nevaluating dynamic network models. These measures include the plasticity of a network, \nthe probability oC network saturation, and the efficacy of stored IVs. A network with no \nplasticity cannot learn and a network with high plasticity has interconnection weights that \nexhibit large changes. The efficacy oC a stored IV as a function oC time is another impor(cid:173)\ntant parameter as it is used in determining the rate at which a network forgets informa(cid:173)\ntion. \n\nIn section 3, we mathematically analyze a simple dynamic network referred to as the \nAttenuated Linear Updated Learning (AL UL) network that uses linear updating and a \nmodified Hebbian rule. Quantities introduccd in section 3 are analytically dctcrmincd for \nthe ALUL network. By adjusting the attenuation parameter of the AL UL network, the \nCorgetting factor is adjusted. It is shown that the optimal capacity for a large AL UL net(cid:173)\nwork in steady state defined by (2.13,3.1) is a factor of e less than the capacity of a \nHAMN. This is the tradeoff that must be paid for having dynamic capabilities. We also \nconjecture that no other network can perform better than this network when a worst case \ncriterion is used. Finally, section 4 discusses further directions for this work along with pos(cid:173)\nsible applications in adaptive signal processing. \n\n2. Dynamic Associative Memory Networks \n\nThe network models discussed in this paper are based on the concept of associative \nmemory. Associative memories are composed of a collection of interconnected elements \nthat have data storage capabilities. Like other memory structures, there are two opera(cid:173)\ntions that occur in associative memories. In the learning operation (referred to as a write \noperation for conventional memories), inCormation is stored in the network structure. In \nthe recall operation (referred to as a read operation for conventional memories), informa(cid:173)\ntion is retrieved from the memory structure. Associative memories recall information on \nthe basis of data content rather than by a specific address. The models that we consider \nwill have learning and recall operations that are updated in discrete time with the activa(cid:173)\ntion state XU) consisting of N cells that take on the values {-l,1}. \n\n2.1. Dynamic Network MeasureS \n\nGeneral associative memory networks are described by two sets of equations. If we \nlet XU) represent the activation state at time i and W( k) represent the weight matrix 01\u00b7 \ninterconnection state at time k then the activation or recall equation is described by \n\nX(j+ 1) = f (XU), W(k)), \n\n(2.1 ) \nwhere X is the data probe vector used for reca.ll. The learning algorithm or int.erconnec(cid:173)\ntion equation is described by \n\ni? 0, k? 0, X(O) = X \n\nW(k+ 1) = g(V(i),O::; i< k, W(O)) \n\nwhere {V( i)} are the information vectors (IV)s to be stored and W(O) is the initial state of \nthe interconnection matrix. Usually the learning algorithm time scale is much longer than \n\n\f434 \n\nthe recall equation time scale so that W in (2.1) can be considered time invariant. Often \n(2.1) is viewed as the equation governing short term memory and (2 .2) is the equation \ngoverning long term memory. From the Hebbian hypothesis we note that the data probe \nvectors should have an effect on the interconnection matrix W. If a number of data p!'Obe \nvectors recall an IV V( a') , the strength of recall of the IV V( i) should be increased by \nappropriate modification of W. If another IV is never recalled, it should gradually be for(cid:173)\ngotten by again adjusting terms of W. Following the analysis in [4,5] we assume that all \ncomponents of IVs introduced are independent and identically distributed Bernoulli random \nvariables with the probability of a 1 or -1 being chosen equal to ~. \n\nOur analysis focuses on learning algorithms. Before describing some dynamic learning \nalgorithms we present some definitions. A network is defined as dynamic if given sorne \nperiod of time the rate of change of W is never nonzero. In addition we will primarily dis(cid:173)\ncuss networks where learning is gradual and updated at discrete times as shown in (2.2). \nBy gradual, we want networks where each update usually consists of one IV being learned \nand/or forgotten. IVs that have been introduced recently should have a high probability of \nrecovery. The probability of recall for one IV should also be a monotonic decreasing func(cid:173)\ntion of time, given that the IV is not repeated. The networks that we consider should also \nhave a relatively low probability of network saturation. \n\nQuantitatively, we let e(k,l,i} be the event that an IV introduced at time l can be \nrecovered at time k with a data probe vector which is of Hamming distance i f!'Om the \ndesired IV. The efficacy of network recovery is then given as p(k,l,i) = Pr(e(k,l,i)). In \nthe analysis performed we say a a vector V can recover V(I), if V(I) = 6(V) where 6(.) \nis a synchronous activation update of all cells in the network. The capacity for dynamic \nnetworks is then given by \n\nO(k,i,l) = maxm3-Pr(r(e(k,l,i),05:I l-l O*1 the network saturation probability is defined by S(k,m) \nwhere S describes the probability that the network cannot recover m IVs. \n\nAnother important measure in analyzing the performance of dynamic networks is t.he \nplasticity of the interconnections of the weight matrix W. Following definitions that are \nsimilar to [10], define \n\nN \n\n2: 2: V AR{ Wi,j(k) - Wi,j(k-l)} \ni\". ii-I \n\nN(N-l) \n\nh(k) = \n\nas the incremental synaptic intensity and \n\nN \n\n2: 2:V AR{ Wi,j(k)} \ni\"..;;= 1 \n\nN(N-l) \n\nH(k) = \n\n(2.4) \n\n(2 .5) \n\nas the cumulative synaptic intensity. From these definitions we can define the plasticity of \nthe network as \n\nWhen network plasticity is zero, the network does not change and no learning takes place. \nWhen plasticity is high, the network interconnections exhibit large changes. \n\nP(k) = h(k) \nH(k) \n\n(2.6) \n\n\f435 \n\nWhen analyzing dynamic networks we are often interested if the network reaches a \n\nsteady state. We say a dynamic network reaches steady state if \n\nlimH(k) = H \nIe--.oo \n\n(2.7) \n\nwhere H is a finite nonzero constant. If the IVs have stationary statistics and given that \nthe learning operations are time invariant, then if a network reaches steady state, we have \nthat \n\nlimP(k) = P \nIe-+oo \n\n(2.8) \n\nwhere P is a finite constant. It is also easily verified from (2.6) that if the plasticity con(cid:173)\nverges to a nonzero constant in a dynamic network, then given the above conditions on the \nIVs and the learning operations the network will eventually reach steady state. \n\nLet us also define the synaptic state at time k for activation state V as \n\n(2.9) \nFrom the synaptic state, we Can define the SNR of V, which we show III section 3 is \nclosely related to the efficacy of an IV and the capacity of the network . \n\ns(k, V) = W(k)V \n\nSNR(k, V,i) = \n\n(E(s.(k V)))2 \nVAR(si(k, V)) \n\n. , \n\n(2.1O) \n\nAnother quantity that is important in measuring dynamic networks is the complexity \nof implementation. Quantities dealing with network complexity are discussed in [12] and \nthis paper focuses on networks that are memory less. A network is memoryless if (2.2) can \nbe expressed in the following form: \n\nW(k+ 1) = 9 #( W(k), V(k)) \n\n(2.11) \n\nNetworks that are not memoryless have the disadvantage that all Ns need t.o be saved dur(cid:173)\ning all learning updates. The complexity of implementation is greatly increased in terms of \nspace complexity and very likely increased in terms of time complexity. \n\n2.2. Examples of Dynamic Associative Memory Networks \n\nThe previous subsection discussed some quantities to measure dynamic networks. \nThis subsection discusses some examples of dynamic associative memo!,y networks and \nqualitatively discusses advantages and disadvantages of different networks . All the net(cid:173)\nworks considered have the memoryless propel\u00b7ty. \n\nThe first network that we discuss is described by the following difference equation \n\nW(k+ 1) = a(k)W(k) + b(k)L(V(k)) \n\n(2.12) \nwith W(O) being the initial value of weights before any learning has taken place . Networks \nwith these learning rules will be labeled as Linear Updated Learning (LUL) networks and \nin addition if O k(A) = min I :7 F( W(I))~ A. Th~efore we have that the efficacy of netwOl'k \nrecovery, p (k,1 ,0) ~ 0 for all J.: ~ I ~ k{A). \n\ncharacteristic \n\nlearning \n\nof \n\nnot \n\nalmost \n\nall \n\nhas \n\nthe \n\nvalues \n\nfor \n\nIn order for the (BL) scheme to be classified as dynamic learning, the attenuation \n\nparameter a must have values between 0 and 1. This learning scheme is just a more com(cid:173)\nplex version of the learning scheme derived from (2.10,2 .11). Let us qualitatively analyze \nthe learning scheme when a and b are constant. There are two cases to consider. When \nA> H, then the network is not affected by the bounds and the network behaves as the \nAL UL network. When A O \n\ni \n\nj \n\n(2 .19) \n\nbut aggregately analyzing all local minimum energy activation states is complex. Through \ncomputer simulations and simplified assumptions [7,8] have come up with a qualitative \nexplanation of the RSF algorithm based on an eigenvalue approach. \n\n3. Analysis of the ALUL Network \n\nSection 2 focused on defining properties and analytical measures for dynamic AMN \nalong with presenting some examples of some learning algorithms for dynamic AMN. This \nsection will focus on the analysis of one of the simpler algorithms, the ALUL network. \nFrom (2.12) we have that the time invariant ALUL network can be described by the fol(cid:173)\nlowing interconnection state equation. \n\nW(k+ 1) = aW(k) + bL(V(k)) \n\n(3.1 ) \n\nwhere a and b are nonnegative real numbers . Many of the measures introduced in section \n2 can easily be determined for the AL UL network. \n\nTo calculate the incremental synaptic intensity h (k) and the cumulative synaptic \nintensity H(k) let the initial condition of the interconnection state W\",i(O) be independent \nIf E W\",i(O) = 0 and \nof all other interconnections states and independent of all IVs. \nV AR W .. ,j(O) = \n\n\"Y then \n\nand \n\nIn steady state when a < 1 we have that \n\np = 2(1~) \n\n(3.2) \n\n(3.3) \n\n(3.4) \n\nFrom this simple relationship between the attenuation parameter a and the plasticity \nmeasure P, we can directly relate plasticity to other measures such as the capacity of the \nnetwork. \n\nWe define the steady state capacity as C(i,i)= lim C(k,i,i) for networks where \n\nk--o.o \n\nsteady state exists. To analytically determine \nthat \nS(k, V(j)) = S(k-i) is a jointly Gaussian random vector. Further assume that Si(l) for \n1~ i< N, 1~ 1< m are all independent and identically distributed. Then for N sufficiently \nlarge, f(a) = a2(k...,.-,l}(1~2), and \n\nthe capacity \n\nfirst assume \n\n\f438 \n\nwe have that \n\nSNR(k, VU)) = SNR(k-n = (N-l)f(a) \nI-f{a) \n\n= c{a )logN \u00bb 1 \n\njO where \n\nthe \n\nlim p(k,j,O) ~ 1. Note that \nN-oo \n\nis given when \n\nf(a) \nI-f (a) = 2logN \n\nN \n\nSolving for m we get that \n\nm = \n\n[ \n\nI \nog (N + 21ogN)(1-a2) \n\n210gN \n\n1 -.......::.-------~ + 1 \n2 \n\nloga \n\n1 \n\nIt is also possible to find the value of a that maximizes m. If we let f = 1 - a2, then \n\n2logN 1 \n\nI \nog (N+ 2logN)f \n\n[ \n\nf \n\nm ~ \n\n(3.7) \n\n(3.8) \n\n(3.9) \n\n. \n\nNTh\u00b7 \n\nd \n\n. \n\nI \n\nh \n\n2elogN \n\nh \n\n2m \n\nor w en m ~ \n\nm IS at a maximum va ue w en f ~ \nIS correspon s to \na ~ 2m -l. Note that this is a factor of e less than the maximum number of Ns allowable \nin a static HAMN [4,5], such that one of the Ns is recoverable. By following the analysis \nin [5], the independence assumption and the Gaussian assumptions used earlier can be \nremoved. The arguments involve using results from exchangeability theory and normal \napproximation theory. \n\n. \n2elogN \n\nN \n\nA similar and somewhat more cumbersome analysis can be performed to show that in \n\nsteady state the maximum capacity achievable is when a ~ 2m -l and given by \n\n2m \n\nlim C(k,O,f) = ~ N \n4e og \n\nN-oo \n\n(3.10) \n\nThis again is a factor of e less than the maximum number of Ns allowable in a static \nHAMN [4,5]' such that all Ns are recoverable. Fig. 2 shows a Monte Carlo simulation of \nthe number of Ns recoverable in a 64 cell network versus the learning time scale for a \nvarying between .5 and .99. We can see that the network reaches approximate steady state \nwhen k:2: 35. The maximum capacity achievable is when a ~ .9 and the capacity is around \n5. This is slightly more than the theoretical value predicted by the analysis just shown \nwhen we compare to Fig. 1. For smaller simulations conducted with larger networks the \nsimulated capacity was closer to the predicted value. From the simulations and the \nanalysis we observe that when a is too small Ns are forgotten at too high a rate and when \n\n\fa is too high network saturation occurs. \n\n439 \n\nUsing the same arguments, it is possible to analyze the capacity of the network and \nefficacy of rvs when k is small. Assuming zero initial conditions and a ~ 2m-l we can \nsummarize the learning behavior of the AL UL network. The learning behavior can be \ndivided into three phases. In the first phase for k< \nall Ns are remembered and \nthe characteristics of the network are similar to the HAMN below saturation. \nIn the \nsecond phase some rvs are forgotten as the rate of forgetting becomes nonzero. During this \nphase the maximum capacity is reached as shown in fig . 2. At this capacity the network \ncannot dynamically recall all IVs so the network starts to forget more information then it \nreceives. This continues until steady state is reached where the learning and forgetting \nrates are equal. If initial conditions are nonzero the network starts in phase 1 or the begin(cid:173)\nning of phase 2 if H( k) is below the value corresponding to the maximum capacity and at \nthe end of phase 2 for larger H( k). \n\n4elogN \n\n2m \n\nN \n\n-\n\nThe calculation of the network saturation probabilities S( k, m) is trivial for large net(cid:173)\n\nworks when the capacity curves have been found. When m~ C(k,O,E) then S(k,m) ~ 0 \notherwise S(k ,m) ~ 1. \n\nin \n\nintroduced \n\n[10]. The network \n\nBefore leaving this section let us briefly examine AL UL networks where a (k) and \nb (k) are time varying. An example of a time varying network is the marginalist learning \nscheme \nthe value of the \nSNR(k,k-l,i) = D(N) for all k. This value is fixed by setting a= 1 and varying b. Since \nthe VARSi(k,V(k-l)) is a monotonic increasing function of k, b(k) must also be a mono(cid:173)\ntonic increasing function of k. It is not too difficult to show that when k is large, the mar(cid:173)\nginalist learning scheme is equivalent to the steady state AL UL defined by (3.1). The argu(cid:173)\nment is based on noting that the steady state SNR depends not on the update time, but \non the difference between the update time and when the rv was stored as is the case with \nthe marginalist learning scheme. The optimal value of D( N) giving the highest capacity is \nwhen D(N) = 4elogN and \n\nis defined by \n\nfixing \n\nb(k+ 1) = \n\n2m b(k) \n\n2m-l \n\n(3.11) \n\nwhere m = \n\nN \n\n4elogN' \n\nIf performance is defined by a worst case criterion with the criterion being \n\nJ(I,N) = min(C(k,O,E),k~/) \n\n(3.12) \n\nthen we conjecture that for I large, no AL UL as defined in (2.12,2.13) can have larger \nJ(I,N) than the optimal ALUL defined by (3.1). If we consider average capacity, we note \nthat the RL network has an average capacity of N which is larger than the optimal \n\n810gN \n\nAL UL network defined in (3.1). However, for most envisioned applications a worst case \ncriterion is a more accurate measure of performance than a criterion based on average \ncapacity. \n\n4. Summary \n\nThis paper has introduced a number of simple dynamic neural network models and \ndefined several measures to evaluate the performance of these models. All parameters for \nthe steady state AL UL network described by (3.1) were evaluated and the attenuation \nparameter a giving the largest capacity was found. This capacity was found to be a factor \nof e less than the static HANIN capacity. Furthermore we conjectured that if we consider \na worst case performance criteria that no AL UL network could perform better than the \n\n\f440 \n\noptimal ALUL network defined by (3.1). Finally, a number of other dynamic models \nincluding BL, RL, and marginalist learning were stated to be equivalent to AL UL networks \nunder certain conditions. \n\nThe network models that were considered in this paper all have binary vector valued \n\nactivation states and may be to simplistic to be considered in many signal processing appli(cid:173)\ncation. By generalizing the analysis to more complicated models with analog vector valued \nactivation states and continuous time updating it may be possible to use these generalized \nmodels in speech and image processing. A specific example would be a controller for a \nmoving robot. The generalized network models would learn the input data by adaptively \nchanging the interconnections of the network. Old data would be forgotten and data that \nwas repeatedly being recalled would be reinforced. These network models could also be \nused when the input data statistics are nonstationary. \n\nReferences \n\n[I] W. S. McCulloch and W . Pitts, \"A Logical Calculus of the Ideas Iminent in Nervous \n\nActivity\", Bulletin of Mathematical Biophysics, 5, 115-133, 1943. \n\n[2) \n\nJ. J. Hopfield, \"Neural Networks and Physical Systems with Emergent Collective Com(cid:173)\nputational Abilities \", Proc. Natl. Acad. Sci. USA 79, 2554-2558, 1982. \n\n[3J Y. S. Abu-Mostafa and J. M. St. Jacques, \"The Information Capacity of the Hopfield \n\nModel\", IEEE Trans. Inform. Theory, vol. IT-31, 461-464, 1985. \n\n[4) R. J. McEliece, E. C. Posner, E. R. Rodemich and S. S. Venkatesh, \"The Capacity of \n'the Hopfield Associative Memory\", IEEE Trans. Inform. Theory, vol. IT-33, 461-482, \n1987. \n\n[5J A. Kuh and B. W. Dickinson, \"Information Capacity of Associative Memories \", to be \n\npublished IEEE Trans. Inform. Theory. \n\n[6] D. J. Amit, H. Gutfreund, and H. Sompolinsky, \"Spin-Glass Models of Nev.ral Net(cid:173)\n\nworks\", Phys. Rev. A, vol. 32, 1007-1018, 1985. \n\n[7J \n\nJ. J. Hopfield, D. I. Feinstein, and R. G. Palmer, \" 'Unlearning' has a StabIlizing \neffect in Collective Memories\", Nature, vol. 304, 158-159, 1983. \n\n[8] R. J. Sasiela, \"Forgetting as a way to Improve Neural-Net Behavior\" , AIP Confer(cid:173)\n\nence Proceedings 151, 386-392, 1986. \n\n[9] \n\n[10] \n\nJ. D. Keeler, \"Basins of Attraction of Neural Network Models\", AlP Conference \nProceedings 151, 259-265, 1986. \n\nJ. P. Nadal, G. Toulouse, J. P. Changeux, and S. Dehaene, \"Networks of Formal \nNeurons and Memory Palimpsests\", Europhysics Let., Vol. 1,535-542, 1986. \n\n[11) S. Grossberg, \"Nonlinear Neural Networks: Principles, Mechanisms, and Architec(cid:173)\n\ntures \", Neural Networks in press. \n\n[12J S. S. Venkatesh and D. Psaltis, \"Information Storage and Retrieval in Two Associa(cid:173)\ntive Nets \", California Institute of Technology Pasadena, Dept. of Elect. Eng., pre(cid:173)\nprint, 1986. \n\n\f441 \n\n\"HAMN Capacity\" \n\n10 \n\nN=64, 1024 trials \n\n8 \n\n> \n0 \n::ea: \nco \nC) \nCIS 4 \n\n- 6 \n\"-co > < \n\n-a- Average # of IV \n\n2 \n\n0 \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\nUpdate Time \nFig. 1 \n\n0 \n\n~ 8 \n... en \n~ -0 \n\nen \n\n6 \n\n4 \n\n::ea: \nco \nC) \nCIS \n\"-co \n> \n< \n\n10 \n\n\"ALUL Capacity\" \n\nN=64, 1024 trials \n\n.. a=.7 \n\n-Go a=.5 \n\n-II- a=.90 \n\n.... a=.95 \n.... a=.99 \n\n2 \n\n0 \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\nUpdate Time \n\nFig. 2 \n\n\f", "award": [], "sourceid": 46, "authors": [{"given_name": "Anthony", "family_name": "Kuh", "institution": null}]}*