{"title": "SARDNET: A Self-Organizing Feature Map for Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 584, "abstract": "", "full_text": "SARDNET: A Self-Organizing Feature \n\nMap for Sequences \n\nDaniel L. James and Risto Miikkulainen \n\nDepartment of Computer Sciences \nThe University of Texas at Austin \n\nAustin, TX 78712 \n\ndljames,risto~cs.utexas.edu \n\nAbstract \n\nA self-organizing neural network for sequence classification called \nSARDNET is described and analyzed experimentally. SARDNET \nextends the Kohonen Feature Map architecture with activation re(cid:173)\ntention and decay in order to create unique distributed response \npatterns for different sequences. SARDNET yields extremely dense \nyet descriptive representations of sequential input in very few train(cid:173)\ning iterations. The network has proven successful on mapping ar(cid:173)\nbitrary sequences of binary and real numbers, as well as phonemic \nrepresentations of English words. Potential applications include \nisolated spoken word recognition and cognitive science models of \nsequence processing. \n\n1 \n\nINTRODUCTION \n\nWhile neural networks have proved a good tool for processing static patterns, classi(cid:173)\nfying sequential information has remained a challenging task. The problem involves \nrecognizing patterns in a time series of vectors, which requires forming a good inter(cid:173)\nnal representation for the sequences. Several researchers have proposed extending \nthe self-organizing feature map (Kohonen 1989, 1990), a highly successful static \npattern classification method, to sequential information (Kangas 1991; Samara(cid:173)\nbandu and Jakubowicz 1990; Scholtes 1991). Below, three of the most recent of \nthese networks are briefly described. The remainder of the paper focuses on a new \narchitecture designed to overcome the shortcomings of these approaches. \n\n\f578 \n\nDaniel L. James, Risto Miikkulainen \n\nRecently, Chappel and Taylor (1993) proposed the Temporal Kohonen Map (TKM) \narchitecture for classifying sequences. The TKM keeps track of the activation his(cid:173)\ntory of each node by updating a value called leaky integrator potential, inspired by \nthe membrane potential in biological neural systems. The activity of a node depends \nboth on the current input vector and the previous input vectors, represented by the \nnode's potential. A given sequence is processed by mapping one vector at a time, \nand the last winning node serves to represent the entire sequence. This way, there \nneeds to be a separate node for every possible sequence, which is a disadvantage \nwhen the number of sequences to be classified is large. The TKM also suffers from \nloss of context. Which node wins depends almost entirely upon the most recent \ninput vectors. For example, the string baaaa would most likely map to the same \nnode as aaaaa, making the approach applicable only to short sequences. \n\nThe SOFM-S network proposed by van Harmelen (1993) extends TKM such that \nthe activity of each map node depends on the current input vector and the past \nactivation of all map nodes. The SOFM-S is an improvement of TKM in that con(cid:173)\ntextual information is not lost as quickly, but it still uses a single node to represent \na sequence. \n\nThe TRACE feature map (Zandhuis 1992) has two feature map layers. The first \nlayer is a topological map of the individual input vectors, and is used to generate \na trace (i.e. path) of the input sequence on the map. The second layer then maps \nthe trace pattern to a single node. In TRACE, the sequences are represented by \ndistributed patterns on the first layer, potentially allowing for larger capacity, but \nit is difficult to encode sequences where the same vectors repeat, such as baaaa. All \na-vectors would be mapped on the same unit in the first layer, and any number of \na-vectors would be indistinguishable. \n\nThe architecture described in this paper, SARDNET (Sequential Activation Re(cid:173)\ntention and Decay NETwork), also uses a subset of map nodes to represent the \nsequence of vectors. Such a distributed approach allows a large number of repre(cid:173)\nsentations be \"packed\" into a small map-like sardines. In the following sections, \nwe will examine how SARDNET differs from conventional self-organizing maps and \nhow it can be used to represent and classify a large number of complex sequences. \n\n2 THE SARDNET ARCHITECTURE \n\nInput to SARDNET consists of a sequence of n-dimensional vectors S = \nV I, V 2 , V 3 , ... , VI (figure 1) . The components of each vector are real values in \nthe interval [0,1]. For example, each vector might represent a sample of a speech \nsignal in n different frequencies, and the entire sequence might constitute a spoken \nword. The SARDNET input layer consists of n nodes, one for each component in \nthe input vector, and their values are denoted as A = (aI, a2, a3, ... , an). The map \nconsists of m x m nodes with activation Ojk , 1 ~ j, k ~ m. Each node has an \nn-dimensional input weight vector Wjk, which determines the node's response to \nthe input activation. \nIn a conventional feature map network as well as in SARDNET, each input vector \nis mapped on a particular unit on the map, called the winner or the maximally \nresponding unit. In SARDNET, however, once a node wins an input, it is made \n\n\fSARDNET: A Self-Organizing Feature Map for Sequences \n\n579 \n\nSequence of Input vectors S \nV1 V2 V3 V4 \n\nV, \n\n---II \n\nPrevious winners \n\nWinning unit jlc \n\nInput weight vector wJk.l \n\nFigure 1: The SARDNET architecture. A sequence of input vectors activates \nunits on the map one at a time. The past winners are excluded from further \ncompetition, and their activation is decayed gradually to indicate position in the \nsequence. \n\nINITIALIZATION: Clear all map nodes to zero. \nMAIN LOOP: While not end of seihence \n1. Find unactivated weight vector t at best matches the input. \n2. Assign 1.0 activation to that unit. \n3. Adjust weight vectors of the nodes in the neighborhood. \n4. Exclude the winning unit from subse~ent competition. \nS. Decrement activation values for all ot er active nodes. \nRESULT: Sequence representation = activated nodes ordered by activation values \n\nTable 1: The SARDNET training algorithm. \n\nuneligible to respond to the subsequent inputs in the sequence. This way a different \nmap node is allocated for every vector in the sequence. As more vectors come in, \nthe activation of the previous winners decays. In other words, each sequence of \nlength 1 is represented by 1 active nodes on the map, with their activity indicating \nthe order in which they were activated. The algorithm is summarized in table 1. \nAssume the maximum length ofthe sequences we wish to classify is I, and each input \nvector component can take on p possible values. Since there are pn possible input \nvectors, Ipn map nodes are needed to represent all possible vectors in all possible \npositions in the sequence, and a distributed pattern over the Ipn nodes can be used \nto represent all pnl different sequences. This approach offers a significant advantage \nover methods in which pnl nodes would be required for pnl sequences. \n\nThe specific computations of the SARDNET algorithm are as follows: The winning \nnode (j, k) in each iteration is determined by the Euclidean distance Djk of the \n\n\f580 \n\nDaniel L. James, Risto Miikkulainen \n\ninput vector A and the node 's weight vector W j k: \n\nn \n\nDjk = 1) Wjk ,i - a;)2. \n\ni=O \n\n(1) \n\nThe unit with the smallest distance is selected as the winner and activated with 1.0. \nThe weights of this node and all nodes in its neighborhood are changed according \nto the standard feature map adaptation rule: \n\n(2) \n\nwhere a denotes the learning rate. As usual, the neighborhood starts out large \nand is gradually decreased as the map becomes more ordered. As the last step in \nprocessing an input vector, the activation 7]jk of all active units in the map are \ndecayed proportional to the decay parameter d: \n\nO