{"title": "A Generative Model for Attractor Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 80, "page_last": 88, "abstract": null, "full_text": "A generative model for attractor dynamics \n\nRichard S. Zemel \n\nMichael C. Mozer \n\nDepartment of Psychology \n\nUniversity of Arizona \n\nTucson, AZ 85721 \n\nzemel@u.arizona.edu \n\nDepartment of Computer Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \nmozer@colorado.edu \n\nAbstract \n\nAttractor networks, which map an input space to a discrete out(cid:173)\nput space, are useful for pattern completion. However, designing \na net to have a given set of attractors is notoriously tricky; training \nprocedures are CPU intensive and often produce spurious afuac(cid:173)\ntors and ill-conditioned attractor basins. These difficulties occur \nbecause each connection in the network participates in the encod(cid:173)\ning of multiple attractors. We describe an alternative formulation \nof attractor networks in which the encoding of knowledge is local, \nnot distributed. Although localist attractor networks have similar \ndynamics to their distributed counterparts, they are much easier \nto work with and interpret. We propose a statistical formulation of \nlocalist attract or net dynamics, which yields a convergence proof \nand a mathematical interpretation of model parameters. \n\nAttractor networks map an input space, usually continuous, to a sparse output \nspace composed of a discrete set of alternatives. Attractor networks have a long \nhistory in neural network research. \nAttractor networks are often used for pattern completion, which involves filling in \nmissing, noisy, or incorrect features in an input pattern. The initial state of the \nattractor net is typically determined by the input pattern. Over time, the state is \ndrawn to one of a predefined set of states-the attractors. Attractor net dynam(cid:173)\nics can be described by a state trajectory (Figure 1a). An attractor net is generally \nimplemented by a set of visible units whose activity represents the instantaneous \nstate, and optionally, a set of hidden units that assist in the computation. Attractor \ndynamics arise from interactions among the units. In most formulations of afuac(cid:173)\ntor nets,2,3 the dynamics can be characterized by gradient descent in an energy \nlandscape, allowing one to partition the output space into attractor basins. Instead \nof homogeneous attractor basins, it is often desirable to sculpt basins that depend \non the recent history of the network and the arrangement of attractors in the space. \nIn psychological models of human cognition, for example, priming is fundamental: \nafter the model visits an attractor, it should be faster to fall into the same attractor \nin the near future, i.e., the attractor basin should be broadened. 1 ,6 \nAnother property of attractor nets is key to explaining behavioral data in psycho(cid:173)\nlogical and neurobiological models: the gang effect, in which the strength of an \nattractor is influenced by other attractors in its neighborhood. Figure 1b illustrates \nthe gang effect: the proximity of the two rightmost afuactors creates a deeper at(cid:173)\ntractor basin, so that if the input starts at the origin it will get pulled to the right. \n\n\fA Generative Model for Attractor Dynamics \n\n81 \n\nk , \" \n, \n,/ \n\n\" \n\n\"---\n\n---- .. -\n\nFigure 1: (a) A two-dimensional space can be carved into three regions (dashed \nlines) by an attractor net. The dynamics of the net cause an input pattern (the X) \nto be mapped to one of the attractors (the O's). The solid line shows the tempo(cid:173)\nral trajectory of the network state. (b) the actual energy landscape for a localist \nattractor net as a function of y, when the input is fixed at the origin and there are \nthree attractors, W = ((-1,0), (1,0), (1, -A)), with a uniform prior. The shapes of \nattractor basins are influenced by the proximity of attractors to one another (the \ngang effect). The origin of the space (depicted by a point) is equidistant from the \nattractor on the left and the attractor on the upper right, yet the origiri clearly lies \nin the basin of the right attractors. \n\n1bis effect is an emergent property of the distribution of attractors, and is the basis \nfor interesting dynamics; it produces the mutually reinforcing or inhibitory influ(cid:173)\nence of similar items in domains such as semantics,9 memory,lO,12 and olfaction.4 \n\nTraining an attract or net is notoriously tricky. Training procedures are CPU inten(cid:173)\nsive and often produce spurious attractors and ill-conditioned attractor basins. 5,11 \nIndeed, we are aware of no existing procedure that can robustly translate an arbi(cid:173)\ntrary specification of an attractor landscape into a set of weights. These difficulties \nare due to the fact that each connection participates in the specification of multiple \nattractors; thus, knowledge in the net is distributed over connections. \nWe describe an alternative attractor network model in which knowledge is local(cid:173)\nized, hence the name localist attractor network. The model has many virtues, includ(cid:173)\ning: a trivial procedure for wiring up the architecture given an attractor landscape; \neliminating spurious attractors; achieving gang effects; providing a clear mathe(cid:173)\nmatical interpretation of the model parameters, which clarifies how the parameters \ncontrol the qualitative behavior of the model (e.g., the magnitude of gang effects); \nand proofs of convergence and stability. \n\nA Iocalist attractor net consists of a set of n state units and m attractor units. Pa(cid:173)\nrameters associated with an attractor unit i encode the location of the attractor, \ndenoted Wi, and its \"pull\" or strength, denoted 7ri, which influence the shape of \nthe attractor basin. Its activity at time t, qi(t), reflects the normalized distance from \nthe attractor center to the current state, y(t), weighted by the attractor strength: \n\n7rjg(y(t) , Wi, O\"(t)) \n\nL:j 7rjg(y(t) , Wj, O\"(t)) \nexp( -\\y - w\\2/20\"2) \n\ng(y, w, 0\") \n\n(1) \n\n(2) \n\nThus, the attractors form a layer of normalized radial-basis-function units. \n\nThe input to the net, &, serves as the initial value of the state, and thereafter the \nstate is pulled toward attractors in proportion to their activity. A straightforward \n\n\f82 \n\nR. S. Zemel and M. C. Mozer \n\nexpression of this behavior is: \n\n(3) \n\nwhere a(l) = Ion the first update and a(t) = 0 fort> 1. More generally, however, \none might want to gradually reduce a over time, allowing for a persistent effect of \nthe external input on the asymptotic state. The variables O\"(t) and a(t) are not free \nparameters of the model, but can be derived from the formalism we present below. \nThe localist attractor net is motivated by a generative model of the input based on \nthe attractor distribution, and the network dynamics corresponds to a search for \na maximum likelihood interpretation of the observation. In the following section, \nwe derive this result, and then present simulation studies of the architecture. \n\n1 A MAXIMUM LIKELIHOOD FORMULATION \n\nThe starting point for the statistical formulation of a localist attractor network is \na mixture of Gaussians model. A standard mixture of Gaussians consists of m \nGaussian density functions in n dimensions. Each Gaussian is parameterized by \na mean, a covariance matrix, and a mixture coefficient. The mixture model is gen(cid:173)\nerative, i.e., it is considered to have produced a set of observations. Each obser(cid:173)\nvation is generated by selecting a Gaussian based on the mixture coefficients and \nthen stochastically selecting a point from the corresponding density function. The \nmodel parameters are adjusted to maximize the likelihood of a set of observations. \nThe Expectation-Maximization (EM) algorithm provides an efficient procedure for \nestimating the parameters.The Expectation step calculates the posterior probabil(cid:173)\nity qi of each Gaussian for each observation, and the Maximization step calculates \nthe new parameters based on the previous values and the set of qi . \nThe mixture of Gaussians model can provide an interpretation for a localist attrac(cid:173)\ntor network, in an unorthodox way. Each Gaussian corresponds to an attractor, and \nan observation corresponds to the state. Now, however, instead of fixing the ob(cid:173)\nservation and adjusting the Gaussians, we fix the Gaussians and adjust the obser(cid:173)\nvation. If there is a single observation, and a = 0 and all Gaussians have uniform \nspread 0\", then Equation 1 corresponds to the Expectation step, and Equation 3 to \nthe Maximization step in this unusual mixture model. \nUnfortunately, this simple characterization of the localist attractor network does \nnot produce the desired behavior. Many situations produce partial solutions, in \nwhich the observation does not end up at an attractor. For example, if two unidi(cid:173)\nmensional Gaussians overlap significantly, the most likely value for the observa(cid:173)\ntion is midway between them rather than at the mean of either Gaussian. \nWe therefore extend this mixture-of-Gaussians formulation to better characterize \nthe localist attractor network. As in the simple model, each of the m attractors is \na Gaussian generator, the mean of which is a location in the n-dimensional state \nspace. The input to the net, e, is considered to have been generated by a stochastic \nselection of one of the attractors, followed by the addition of zero-mean Gaussian \nnoise with variance specified by the attractor. Given a particular observation e, \nthe an attractor's posterior probability is the normalized Gaussian probability of \ne, weighted by its mixing proportion. This posterior distribution for the attractors \ncorresponds to a distribution in state space that is a weighted sum of Gaussians. \nWe then consider the attractor network as encoding this distribution over states \nimplied by the attractor posterior probabilities. At anyone time, however, the \nattractor network can only represent a single position in state space, rather than \n\n\fA Generative Model for Attractor Dynamics \n\n83 \n\nthe entire distribution over states. This restriction is appropriate when the state is \nan n-dimensional point represented by the pattern of activity over n state units. \nTo accommodate this restriction, we change the standard mixture of Gaussians \ngenerative model by interjecting an intermediate level between the attractors and \nthe observation. The first generative level consists of the discrete attractors, the \nsecond is the state space, and the third is the observation. Each observation is \nconsidered to have been generated by moving down this hierarchy: \n\n1. select an attractor x = i from the set of attractors \n2. select a state (i.e., a pattern of activity across the state units) based on the \n\npreferred location of that attractor: y = Wi + Ny \n\n3. select an observation z = yG + Nz \n\nThe observation z produced by a particular state y depends on the generative \nweight matrix G. In the networks we consider here, the observation and state \nspaces are identical, so G is the identity matrix, but the formulation allows for z \nto lie in some other space. Ny and Nz describe the zero-mean, spherical Gaussian \nnoise introduced at the two levels, with deviations (1 y and (1 z, respectively. \nIn comparison with the 2-level Gaussian mixture model described above, this 3-\nlevel model is more complicated but more standard: the observation & is pre(cid:173)\nserved as stable data, and rather than the model manipulating the data here it \ncan be viewed as iteratively manipulating an internal representation that fits the \nobservation and attractor structure. The attractor dynamics correspond to an iter(cid:173)\native search through state space to find the most likely single state that: (a) was \ngenerated by the mixture of Gaussian attractors, and (b) in tum generated the ob(cid:173)\nservation. \nUnder this model, one could fit an observation & by finding the posterior distribu(cid:173)\ntion over the hidden states (X and Y) given the observation: \n\n(X = i Y = ylZ = &) = p(&ly, i)p(y, i) = \np , \n\np(&) \n\np(&IY)1riP(yli) \n\nJy p(&IY)L:i1riP(Yli)dy \n\n(4) \n\nwhere the conditional distributions are Gaussian: p(Y = ylX = i) = 9(ylwi, (1y) \nand p(&IY = y) = 9(&ly, (1z). Evaluating the distribution in Equation 4 is \ntractable, because the partition function is a sum of a set of Gaussian integrals. \nDue to the restriction that the network cannot represent the entire distribution, we \ndo not directly evaluate this distribution but instead adopt a mean-field approach, \nin which we approximate the posterior by another distribution Q(X, YI&). Based \non this approximation, the network dynamics can be seen as minimizing an objec(cid:173)\ntive function that describes an upper bound on the negative log probability of the \nobservation given the model and mean-field parameters. \nIn this approach, one can choose any form of Q to estimate the posterior distri(cid:173)\nbution, but a better estimate allows the network to approach a maximum like(cid:173)\nlihood solution.13 We select a simple posterior: Q(X, Y) = qid(Y = y), where \nqi = Q(X = i) is the responsibility assigned to attractor i, and y is the estimate of \nthe state that accounts for the observation. The delta function over Y is motivated \nby the restriction that the explanation of an input consists of a single state. \nGiven this posterior distribution, the objective for the network is to minimize the \nfree energy F, described here for a particular input example &: \n\nF(q,yl&) = ~ QX=t,Y=y)lnp(&,X=i,Y=y)dy \n\n~ J (. \n, \nL qi In qi. -lnp(&IY) - L qi lnp(yli) \n. \n, \n\n1r, \n\n. \n, \n\nQ(X = i, Y = y) \n\n\f84 \n\nR. S. Zemel and M C. Mozer \n\nwhere 7r; is the prior probability (mixture coefficient) associated with attractor i. \nThese priors are parameters of the generative model, as are (fy, u z , and w. \nF(q, yl&) = 2: qdn qi. + 2 \\ 1& - Yl2 + 2 \\ 2: qi!y - wd2 + n In((fy(fz) \n\n(5) \n\n7r, \n\nU z \n\ni \n\n(f Y \n\ni \n\nGiven an observation, a good set of mean-field parameters can be determined by \nalternating between updating the generative parameters and the mean-field pa(cid:173)\nrameters. The update procedure is guaranteed to converge to a minimum of F, as \nlong as the updates are done asynchronously and each update minimizes F with \nrespect to a parameter.s The update equations for the mean-field parameters are: \n\ny = \n\nqi \n\n-\n\n2&+ 2E .\n(f y \n\n. \ni q, w, \n\n(f z \n\n(f2 + (f2 \nz \ny \n\n7riP(Y Ii) \n\nE j 7rjp(yli) \n\n(6) \n\n(7) \n\nIn our simulations, we hold most of the parameters of the generative model con(cid:173)\nstant, such as the priors 7r, the weights w, and the generative noise in the observa(cid:173)\ntion, (f z. The only aspect that changes is the generative noise in the state, (f Y' which \nis a single parameter shared by all attractors: \n1 ~ I' \n\n2 \n12 \n(fy = d L..Jqi Y -Wi \n\n(8) \n\ni \n\nThe updates of Equations 6-8 can be in any order. We typically initialize the state \ny to & at time 0, and then cyclically update the qi, (fy, then y. \nThis generative model avoids the problem of spurious attractors described above \nfor the standard Gaussian mixture model. Intuition into how the model avoids \nspurious attractors can be gained by inspecting the update equations. These equa(cid:173)\ntions effectively tie together two processes: moving y closer to some Wi than the \nothers, and increasing the corresponding responsibility qi. As these two processes \nevolve together, they act to descrease the noise (fy, which accentuates the pull of \nthe attractor. Thus stable points that do not correspond to the attractors are rare. \n\n2 SIMULATION STUDIES \n\nTo create an attractor net, we specify the parameters (7ri' w;) associated with the \nattractors based on the desired structure of the energy landscape (e.g., Figure Ib). \nThe only remaining free parameter, (f z, plays an important role in determining \nhow responsive the system is to the external input. \nWe have conducted several simulation studies to explore properties of localist at(cid:173)\ntractor networks. Systematic investigations with a 200-dimensional state space and \n200 attractors, randomly placed at corners of the 200-D hypercube, have demon(cid:173)\nstrated that spurious responses are exceedingly rare unless more than 85% of an \ninput's features are distorted (Figure 2), and that manipulating parameters such \nas noise and prior probabilities has the predicted effects. We have also conducted \nstudies of localist attractor networks in the domain of visual images of faces. These \nsimulations have shown that gang effects arise when there is structure among the \nattractors. For example, when the attractor set consists of a single view of several \ndifferent faces, and multiple views of one face, then an input that is a morphed \nface-a linear combination of one of the single-view faces and one view of the gang \nface-will end up in the gang attractor even when the initial weighting assigned \nto the gang face was less than 40%. \n\n\fA Generative Modelfor Attractor Dynamics \n\n85 \n\n100 \n\n~~~==~~~~--------~~loo \n\n% Missing features \n\nFigure 2: The input must be severely corrupted before the net makes spurious (final \nstate not at an attractor) or adulterous (final state at a neighbor of the generating \nattractor) responses. (a) The percentage of spurious responses increases as (Tz is \nincreased. (b) The percentage of adulterous responses increases as (T z is decreased. \nTo test the architecture on a larger, structured problem, we modeled the domain \nof three-letter English words. The idea is to use the attractor network as a content \naddressable memory which might, for example, be queried to retrieve a word with \nP in the third position and any letter but A in the second position, a word such as \nHIP. The attractors consist of the 423 three-letter English words, from ACE to ZOO. \nThe state space of the attractor network has one dimension for each of the 26 letters \nof the English alphabet in each of the 3 positions, for a total of 78 dimensions. We \ncan refer to a given dimension by the letter and position it encodes, e.g., P3 denotes \nthe dimension corresponding to the letter P in the third position of the word. The \nattractors are at the comers of a [-1, +1]18 hypercube. The attractor for a word such \nas HIP is located at the state having value -Ion all dimensions except for HI, h, \nand P3 which have value +1. The external input specifies a state that constrains \nthe solution. For example, one might specify lip in the third position\" by setting \nthe external input to +1 on dimension P 3 and to -Ion dimensions ll'3, for all letters \nll' other than P. One might specify the absence of a constraint in a particular letter \nposition, p, by setting the external input to a on dimensions ll'p, for all letters ll'. \nThe network's task is to settle on a state corresponding to one of the words, \ngiven soft constraints on the letters. The interactive-activation model of word \nperception 7 performs a similar computation, and our implementation exhibits the \nkey qualitative properties of their model. If the external input specifies a word, \nof course the attractor net will select that word. Interesting queries are those in \nwhich the external input underconstrains or overconstrains the solution. We il(cid:173)\nlustrate with one example of the network's behavior, in which the external input \nspecifies D1, E2, and G3 \u2022 Because DEG is a nonword, no attractor exists for that \nstate. The closest attractors share two letters with DEG, e.g., PEG, BEG, DEN, and \nDOG. Figure 3 shows the effect of gangs on the selection of a response, BEG. \n\n3 CONCLUSION \n\nLocalist attractor networks offer an attractive alternative to standard attractor net(cid:173)\nworks, in that their dynamics are easy to control and adapt. We described a sta(cid:173)\ntistical formulation of a type of localist attractor, and showed that it provides a \nLyapunov function for the system as well as a mathematical interpretation for the \nnetwork parameters. The dynamics of this system are derived not from intuitive \narguments but from this formal mathematical model. Simulation studies show \nthat the architecture achieves gang effects, and spurious attractors are rare. This \napproach is inefficient if the attractors have compositional structure, but for many \napplications of pattern recognition or associative memory, the number of items \n\n\f86 \n\n~\u00b7;s~., \nDUG \nI n \n\nR. S. Zemel and M C. Mozer \n\n':.';.:; \n\n:;',~ :_'>\":' .\u2022 \n:. ; .. ( \n\":r.:r \n\n~ .. '.':~ \" , j ' \n,.y \n\n:.~.!. \n\n:.:':. DIG \n'a, ~ \n:.< .. \n.... ~ \n1:S~ ':\u00b7:: \n\n,,':i_ \n\n, \n\n~ \n\n~:, ~ \n': L ?<~ \n' u:':,: LEG \n,:;'.:: ?'>:'?J 1:/-:: \n\n'~~', ~,': :? ' :. \n.F\u00b7 \n\n>.~~~-t\".-\n\n.\u00a7.. \u2022\u2022 :; \" \n\n;';;,\"',' \n;: _\"\"- '.~ \n'';'Uoli \n\nLEG \n\nlEG \n\nIteration 2 \n\nIteration 3 \n\nIteration 4 \n\nIteration 5 \n\nf.j.~\u00b7.;:: BEG \n\n~\u00b7; ~: f.\u00b7 \n~ ~ .... ' j:. \n.. h' .~:)~.: \n\n~.:;?:(\n\n,'<};~ \n\nlEG \n\nlEG \n\nlEG \n\nFigure 3: Simulation of the 3-letter word attractor network, queried with DEC. \nEach frame shows the relative activity of attractor units at various points in pro(cid:173)\ncessing. Activity in each frame is normalized such that the most active unit is \nprinted in black ink; the lighter the ink color, the less active the unit. Only attractor \nunits sharing at least one letter with DEC are shown. The selection, BEG, is a prod(cid:173)\nuct of a gang effect. The gangs in this example are formed by words sharing two \nletters. The most common word beginnings are PE- (7 instances) and DI- (6); the \nmost common word endings are -AC (10) and -ET (10); the most common first-last \npairings are B-G (5) and D-G (3). One of these gangs supports B1, two support E2, \nand three support G3, hence BEG is selected. \n\nbeing stored is small. The approach is especially useful in cases where attractor \nlocations are known, and the key focus of the network is the mutual influenre of \nthe attractors, as in many cognitive modelling studies. \nReferences \n\n[1) Becker, s., Moscovitch, M, Belumann,M, & Joordens, S. (1997). Long-term semantic priming: A computational \naccount and empirical evidence. TOIlTtUIl of Experimental Psychology: Learning, Memory, & Cognition, 23(5), 1059-\n1082. \n\n[2) Golden, R (1988). Probabilistic characterization ofneura1modelcomputations. In D. Z. Anderson (Ed.), Neural \n\nInfrmnation Processing Systems (pp. 310-316). American Institute of PhysiCS. \n\n[3) Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. \n\nProceedings of the NationlU Aaulemy of Sciences, 79,2554-2558. \n\n[4) Kay, LM., Lancaster. L.R, & Freeman W.J. (1996). Reafference and attractors in the olfactory system during \n\nodor recognition. Int J Neural Systems, 7(4),489-95. \n\n[5) Mathis, D. (1997). A computational theory of consciousness in cognition. Unpublished Doctoral Dissertation. Boul(cid:173)\n\nder. CO: Department of Computer Science, University of Colorado. \n\n[6) Mathis, D., & Mozer, M. C. (1996). Conscious and unconscious perception: A computational theory. In G. Cot(cid:173)\ntrell (Ed.), Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society (pp. 324-328). Erlbaum. \n[7) McC1e11and, J. L. & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter percep(cid:173)\n\ntion: Part L An account of basic findings. Psychological Reuiew, 88,375-407. \n\n[8) Neal. R M & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other \n\nvariants. In M I. Jordan (Ed.), Learning in Graphical Models. Kluwer Academic Press. \n\n[9) McRae, K., de Sa, V. R, & SeidenbeIg, M S. (1997) On the nature and scope of featural representations of word \n\nmeaning. TournaI of Experimental Psychology; General, 126(2),99-130. \n\n[10) Redish, A. D. & Touretzky, D. S. (1998). The role of the hippocampus in solving the Morris water maze. Neural \n\nComputation, 10(1), 73-111. \n\n[11) Rodrigues, N. c., & Fontanari, J. F. (1997). Multiva1ley structure of attractor neural networks. Journal of Physics \n\nA (Mathematical and General), 30, 7945-7951. \n\n[12) Samsonovich, A. & McNaughton, B. L. (1997) Path integration and cognitive mapping in a continuous attractor \n\nneural network model TournaI ofNeumscience, 17(15),5900-5920. \n\n[13) Saul, L.K., Jaakkola, T., & Jordan, ML (1996). Mean field theory for sigmoid belief networks. Tounud of AI \n\nResetm:h, 4, 61-76. \n\n\fPART II \n\nNEUROSCIENCE \n\n\f\f", "award": [], "sourceid": 1695, "authors": [{"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}]}