{"title": "Recurrent Neural Networks Can Learn to Implement Symbol-Sensitive Counting", "book": "Advances in Neural Information Processing Systems", "page_first": 87, "page_last": 93, "abstract": null, "full_text": "Recurrent Neural Networks Can Learn to \n\nImplement Symbol-Sensitive Counting \n\nPaul Rodriguez \n\nDepartment of Cognitive Science \nUniversity of California, San Diego \n\nLa Jolla, CA. 92093 \n\nprodrigu@cogsci.ucsd.edu \n\nJanet Wiles \n\nSchool of Information Technology and \n\nDepartment of Psychology \nUniversity of Queensland \n\nBrisbane, Queensland 4072 Australia \n\njanetw@it.uq.edu.au \n\nAbstract \n\nRecently researchers have derived formal complexity analysis of analog \ncomputation in the setting of discrete-time dynamical systems. As an \nempirical constrast, training recurrent neural networks (RNNs) produces \nself -organized systems that are realizations of analog mechanisms. Pre(cid:173)\nvious work showed that a RNN can learn to process a simple context-free \nlanguage (CFL) by counting. Herein, we extend that work to show that a \nRNN can learn a harder CFL, a simple palindrome, by organizing its re(cid:173)\nsources into a symbol-sensitive counting solution, and we provide a dy(cid:173)\nnamical systems analysis which demonstrates how the network: can not \nonly count, but also copy and store counting infonnation. \n\n1 \n\nINTRODUCTION \n\nSeveral researchers have recently derived results in analog computation theory in the set(cid:173)\nting of discrete-time dynamical systems(Siegelmann, 1994; Maass & Opren, 1997; Moore, \n1996; Casey, 1996). For example, a dynamical recognizer (DR) is a discrete-time continu(cid:173)\nous dynamical system with a given initial starting point and a finite set of Boolean output \ndecision functions(pollack. 1991; Moore, 1996; see also Siegelmann, 1993). The dynami(cid:173)\ncal system is composed of a space,~n , an alphabet A, a set of functions (1 per element of A) \nthat each maps ~n -+ ~n and an accepting region H lie, in ~n. With enough precision and \nappropriate differential equations, DRs can use real-valued variables to encode contents of \na stack or counter (for details see Siegelmann, 1994; Moore, 1996). \nAs an empirical contrast, training recurrent neural networks (RNNs) produces self(cid:173)\norganized implementations of analog mechanisms. In previous work we showed that an \nRNN can learn to process a simple context-free language, anbn , by organizing its resources \ninto a counter which is similar to hand-coded dynamical recognizers but also exhibits some \n\n\f88 \n\nP. Rodriguez and J. Wiles \n\nnovelties (Wlles & Elman, 1995). In particular, similar to band-coded counters, the network \ndeveloped proportional contracting and expanding rates and precision matters - but unex(cid:173)\npectedly the network distributed the contraction/expansion axis among hidden units, devel(cid:173)\noped a saddle point to transition between the first half and second half of a string, and used \noscillating dynamics as a way to visit regions of the phase space around the fixed points. \nIn this work we show that an RNN can implement a solution for a harder CFL, a simple \npalindrome language(desaibed below), which requires a symbol-sensitive counting solu(cid:173)\ntion. We provide a dynamical systems analysis which demonstrates how the network can \nnot only count, but also copy and store counting information implicitly in space around a \nfixed point. \n\n2 TRAINING an RNN TO PROCESS CFLs \n\nWe use a discrete-time RNN that has 1 hidden layer with recurreot connections, and 1 output \nlayer withoutrecurreot connections so that the accepting regions are determined by the out(cid:173)\nput units. The RNN processes output in Tune(n), where n is the length of the input, and it \ncan recognize languages that are a proper subset of context-sensitive languages and a proper \nsuperset of regular languages(Moore, 1996). Consequently, the RNN we investigate can in \nprinciple embody the computational power needed to process self-recursion. \nFurthermore, many connectionist models of language processing have used a prediction \ntask(e.g. Elman, 1990). Hence, we trained an RNN to be a real-time transducer version \nof a dynamical recognizer that predicts the next input in a sequence. Although the network \ndoes not explicitly accept or reject strings, if our network makes all the right predictions \npossible then perlorming the prediction task subsumes the accept task, and in principle one \ncould simply reject unmatched predictions. We used a threshbold criterion of .5 such that if \nan ouput node has a value greater than .5 then the network is considered to be making that \nprediction. If the network makes all the right predictions possible for some input string, \nthen it is correctly processing that string. Although a finite dimensional RNN cannot pro(cid:173)\ncess CFLs robustly with a margin for error (e.g.Casey, 1996;Maass and Orponen,I997), we \nwill show that it can acquire the right kind of trajectory to process the language in a way \nthat generalizes to longer strings. \n\n2.1 A SIMPLE PALINDROME LANGUAGE \n\nA palindrome language (mirror language) consists of a set of strings, S, such that each \nstring, 8 eS, 8 = wwr , is a concatenation of a substring, w, and its reverse, wr \u2022 The rele(cid:173)\nvant aspect of this language is that a mechanism cannot use a simple counter to process the \nstring but must use the functional equivalent of a stack that enables it to match the symbols \nin second half of the string with the first half. \nWe investigated a palindrome language that uses only two symbols for w, two other sym(cid:173)\nboIs for w r , such that the second half of the string is fully predictable once the change in \nsymbols occurs. The language we used is a simple version restricted such that one sym(cid:173)\nbol is always present and precedes the other, for example: w = anbm , wr = B m An, e.g. \naaaabbbBBBAAAA, (where n > 0, m >= 0). Note that the embedded subsequence \nbm B m is just the simple-CFL used in Wlles & Elman (1995) as mentioned above, hence, \none can reasonably expect that a solution to this task has an embedded counter for the sub(cid:173)\nsequence b ... B. \n\n202 LINEAR SYSTEM COUNTERS \n\nA basic counter in analog computation theory uses real-valued precision (e.g. Siegelman \n1994; Moore 1996). For example, a l-dimensional up/down counter for two symbols { a I b} \n\n\fRNNs Can Learn Symbol-Sensitive Counting \n\n89 \n\nis the system J(z) = .5z + .5a, J(z) = 2z - .5b where z is the state variable, a is the input \nvariable to count up(push), and b is the variable to count down(pop). A sequence of input \naaabbb has state values(starting at 0): .5,.75,.875, .75,.5,0. \nSimilarly, for our transducer version one can develop piecewise linear system equations in \nwhich counting takes place along different dimensions so that different predictions can be \nmade at appropriate time stepSI. The linear system serves as a hypothesis before running any \nsimulations to understand the implementation issues for an RNN. For example, using the \n\nfunction J(z) = z for z E [0,1], \u00b0 for z < 0, 1 for z > 1, then for the simple palindrome \n\ntask one can explicitly encode a mechanism to copy and store the count for a across the \nb ... B subsequences. If we assign dimension-l to a, dimension-2 to b, dimension-3 to A, \ndimension-4 to B, and dimension-5 to store the a value, we can build a system so that for \na sequence aaabbBBAAA we get state variables values: initial, (0,0,0,0,0), (.5,0,0,0,0), \n(.75,0,0,0,0), (.875,0,0,0,0), (0,.5,0,0,.875), (0,.75,0,0,.875), (0,0,0,.5,.875), (0,0,0,0,.875), \n(0,0,.75,0,0), (0,0,.5,0,0), (0,0,0,0,0). The matrix equations for such a system could be: \n\nX t = J( [~5 .~ ~ ~ ~ 1 * X t- 1 + [~5 1 ~l ::: 1 * It} \n\u00b0 -5 \u00b0 -1 \n-5 \u00b0 -5 \u00b0 \n\no 2 \u00b0 2 \u00b0 \n1 \u00b0 \u00b0 \u00b0 1 \n\nwhere t is time, X t is the 5-dimensional state vector, It is the 4-dimensional input vector \nusing l-hotencodingofa = [1,0,0,0];6 = [O,I,O,O];A = [O,O,I,O],B = [0,0,0,1]. \nThe simple trick is to use the input weights to turn on or off the counting. For example, \nthe dimension-5 state variable is turned off when input is a or A, but then turned on when \nb is input, at which time it copies the last a value and holds on to it. It is then easy to add \nBoolean output decision functions that keep predictions linearly separable. \nHowever, other solutions are possible. Rather than store the a count one could keep count(cid:173)\ning up in dimension-l for b input and then cancel it by counting down for B input. The \nquestions that arise are: Can an RNN implement a solution that generalizes? What kind of \nstore and copy mechanism does an RNN discover? \n\n1.3 TRAINING DATA & RESULTS \n\nThe training set consists of 68 possible strings of total length $ 25, which means a maxi(cid:173)\nmum of n + m = 12, or 12 symbols in the first half, 12 symbols in the second half, and \n1 end symbol 2. The complete training set has more short strings so that the network does \nnot disregard the transitions at the end of the string or at the end of the b ... B subsequence. \nThe network consists of 5 input, 5 hidden, 5 output units, with a bias node. The hidden and \nrecurrent units are updated in the same time step as the input is presented. The recurrent \nlayer activations are input on the next time step. The weight updates are performed using \nback-propagation thru time training with error injected at each time step backward for 24 \ntime stepS for each input. \nWe found that about half our simulations learn to make predictions for transitions, and most \nwill have few generalizations on longer strings not seen in the training set. However, no \nnetwork learned the complete training set perfectly. The best network was trained for 250K \nsweeps (1 per character) with a learning parameter of .001, and 136K more sweeps with \n.0001, for a total of about 51K strings. The network made 28 total prediction errors on 28 \n\nl1bese can be expanded relatively easily to include more symbols, different symbol representa(cid:173)\n\ntions, harder palindrome sequences, or different kind of decision planes. \n\n2We removed training strings w = a\"b,for n > 1; it turns out that the network interpolates on \nthe B-to-A transition for these. Also, we added an end symbol to help reset the system to a consistent \nstarting value. \n\n\f90 \n\nP. Rodriguez and 1. Wiles \n\ndifferent strings in the test set of 68 possible strings seen in training. All of these errors were \nisolated to 3 situations: when the number of a input = 2or4 the error occurred at the B-to(cid:173)\nA transition, when the number of a input = 1, for m > 2, the error occurred as an early \nA-to-end transition. \nImportantly, the networlcmade correct predictions on many strings longer than seen in train(cid:173)\ning, e.g. strings that have total length > 25 (or n + m > 12). It counted longer strings \nof a .. As with or without embedded b .. Bs; such as: w = a13 ; w = a13b2 ; w = an b7 , n = \n6, 7or8 (recall that w is the first half of the string). It also generalized to count longer subse(cid:173)\nquences ofb .. Bs withorwithoutmorea .. As; suchasw = a 5h'\" where n = 8,9,10,11,12. \nThe longest string it processed correctly was w = a 9 b9 , which is 12 more characters than \nseen during training. The network learned to store the count for a 9 for up to 9bs, even though \nthe longest example it had seen in training had only 3bs - clearly it's doing something right. \n\n2.4 NETWORK EVALUATION \n\nOur evaluation will focus on how the best network counts, copies, and stores information. \nWe use a mix of graphical analysis and linear system analysis, to piece together a global \npicture of how phase space trajectories hold informational states. The linear system analysis \nconsists of investigating the local behaviour of the Jacobian at fixed points under each input \ncondition separately. We refer to Fa as the autonomous system under a input condition and \nsimilarly for Fb, FA, and FB. \nThe most salient aspect to the solution is that the network divides up the processing along \ndifferent dimensions in space. By inspection we note that hidden unitl (HUI) takes on low \nvalues for the first half of the string and high values for the second half, which helps keep \nthe processing linearly separable. Therefore in the graphical analysis of the RNN we can \nset HUI to a constant. \nFIrst, we can evaluate how the network counts the b .. B subsequences. Again, by inspection \nthe network uses dimensions HU3,HU4. The graphical analysis in FIgure Ia and Figure Ib \nplots the activity ofHU3xHU4. It shows how the network counts the right number of Bs and \nthen makes a transition to predict the first A. The dominant eigenvalues at the Fb attracting \npoint and F B saddle point are inversely proportional, which indicates that the contraction \nrate to and expansion rate away from the fixed points are inversely matched. The FB sys(cid:173)\ntem expands out to a periodic-2 fixed point in HU3xHU4 subspace, and the unstable eigen(cid:173)\nvector corresponding to the one unstable eigenvalue has components only in HU3,HU4. In \nFi~ure 2 we plot the vector field that describes the flow in phase space for the composite \nF B' which shows the direction where the system contracts along the stable manifold, and \nexpands on the unstable manifold. One can see that the nature of the transition after the last \nb to the first B is to place the state vector close to saddle point for FB so that the number of \nexpansion steps matches the number of the Fb contraction steps. In this way the b count is \ncopied over to a different region of phase space. \nNow we evaluate how the network counts a ... A, first without any b ... B embedding. Since \nthe output unit for the end symbol bas very high weight values for HU2, and the Fa system \nbas little activity in HU4, we note that a is processed in HU2xHU3xHU5. The trajectories \nin Figure 3 show a plot of a 13 A 13 that properly predicts all As as well as the transition at \nthe end. Furthermore, the dominant eigenvalues for the Fa attracting point and the FA sad(cid:173)\ndle point are nearly inversely proportional and the FA system expands to a periodic-2 fixed \npoint in 4-dimensions (HUI is constant, whereas the other HU values are periodic). The \nFa eigenvectors have strong-moderate components in dimensions HU2, HU3, HU5; and \nlikewise in HU2, HU3, HU4, HU5 for FA. \nThe much harder question is: How does the network maintain the information about the \ncount of as that were input while it is processing the b .. B subsequence? Inspection shows \n\n\fRNNs Can Learn Symbol-Sensitive Counting \n\n91 \n\nthat after processing an the activation values are not directly copied over any HU values, \nnor do they latch any HU values that indicate how many as were processed. Instead, the \nlast state value after the last a affects the dynamics for b ... B in such a way that clusters the \nlast state value after the last B, but only in HU3xHU4 space (since the other HU dimensions \nwere unchanging throughout b ... B processing). \nWe show in Figure 4 the clusters for state variables in HU3xHU4 space after processing \nan bm B m , where n = 2,3,4, 50r6; m = 1 .. 10. The graph shows that the information about \nhow many a's occurred is \"stored\" in the HU3xHU4 region where points are clustered. Fig(cid:173)\nure 4 includes the dividing line from Figure Ib for the predict A region. The network does \nnot predict the B-to-A transition after a4 or a 2 because it ends up on the wrong side of the \ndividing line of Figure Ib, but the network in these cases still predicts the A-to-end transi(cid:173)\ntion. We see that if the network did not oscillate around the F B saddle point while exanding \nthen the trajectory would end up correctly on one side of the decision plane. \nIt is important to see that the clusters themselves in Figure 4 are on a contracting trajec(cid:173)\ntory toward a fixed point, which stores information about increasing number of as when \nmatched by an expansion of the FA system. For example, the state values after a5 AA and \na5 bm Bm AA, m = 2 .. 10 have a total hamming distance for all 5 dimensions that ranged \nfrom .070 to .079. Also, the fixed point for the Fa. system, the estimated fixed point for the \ncomposite F'B 0 Fb 0 F;: , and the saddle point of the FA system are colinear 3. in all the \nrelevant counting dimensions: 2,3,4, and 5. In other words, the FA system contracts the \ndifferent coordinate points, one for an and one for anbm B m , towards the saddle point to \nnearly the same location in phase space, treating those points as having the same informa(cid:173)\ntion. Unfortunately, this is a contraction occuring through a 4 dimensional subspace which \nwe cannot easily show grapbically. \n\n3 CONCLUSION \n\nIn conclusion, we have shown that an RNN can develop a symbol-sensitive counting s0-\nlution for a simple palindrome. In fact, this solution is not a stack but consists of non(cid:173)\nindependent counters that use dynamics to visit different regions at appropriate times. Fur(cid:173)\nthermore, an RNN can implement counting solutions for a prediction task that are function(cid:173)\nally similar to that prescribed by analog computation theory, but the store and copy functions \nrely on distance in phase space to implicitly affect other trajectories. \n\nAcknowledgements \n\nThis research was funded by the UCSD, Center for Research in Language Training Grant \nto Paul Rodriguez, and a grant from the Australian Research Council to Janet Wlles. \n\nReferences \n\nCasey, M. (1996) The Dynamics of Discrete-TIme Computation, With Application to Re(cid:173)\ncurrent Neural Networks and Fmite State Machine Extraction. Neural Computation, 8. \nElman, JL. (1990) Finding Structure in TIme. Cognitive Science, 14, 179-211. \nMaass, W. ,Orponen, P. (1997) On the Effect of Analog Noise in Discrete-TIme Analog \nComputations. Proceedings Neural Information Processing Systems, 1996. \nMoore, C. (1996) Dynamical Recognizers: Real-TIme Language Recognition by Analog \nComputation. Santa Fe InstituteWorking Paper 96-05-023. \n\n)Relative to the saddle point, the vector for one fixed point, multiplied by a constant had the same \n\nvalue(to within .OS) in each of 4 dimensions as the vector for the other fixed point \n\n\f92 \n\nHU4 \n1 \n\n0.8 \n\n0.6 \n\n0 . 4 \n\n0.2 \n\no \n\no \n\nP. Rodriguez and 1. Wiles \n\np~ct b regiOD \n\nHU4 \n1 \n\n0.8 \n\n0 . 6 \n\n0 . 4 \n\np~ctB \n~giOD \n\n0.2 \n\n,---\n\" , \n\no \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\no~~':\"'o .~2--0:-.~4 -\"\":0-. 6=----=-0\"\":. 8:---~1 HU3 \n\nFigure 1: la)Trajectory of 610 (after 0 5 ) in HU3xHU4. Each arrow represents a trajectory \nstep:the base is a state vector at time t, the head is a state at time t + 1. The first b trajectory \nstep has a base near (.9,.05), which is the previous state from the last o. The output node \nb is> .5 above the dividing line. Ib) Trajectory of BI0 (after 05b10) in HU3xHU4. The \noutput node B is > .5 above the dashed dividing line, and the output node A is > .5 below \nthe solid dividing line. The system crosses the line on the last B step, hence it predicts the \nB-to-A transitioo. \n\nPollack, J.B. (1991) The Induction of Dynamical Recognizers. Machine Learning, 7, 227-\n252. \nSiegelmann, H.(1993) Foundations of Recurrent Neural Networks. PhD. dissertation, un(cid:173)\npublished. New Brunswick Rutgers, The State University of New Jersey. \nWtles, 1., Elman, J. (1995) Counting Without a Counter: A Case Study in Activation Dy(cid:173)\nnamics. Proceedings of the Seventeenth Annual Conference of the Cognitive Science So(cid:173)\nciety. Hillsdale, N J .: Lawrence Erlbaum Associates. \n\nHU4 \n\n0 . 8 \n\n0.6 \n\n\\ \n\nt \\ \n\nt , \\ , .... - ,. \" / \n\\ \" , ....... _-..,.-/ \n, \n~ .. \n, ~ \" / \n~ , , I , \n\" \n~ , ... \u2022 , , I ~ \n\nI \n\nI \n\n\\ \n\n\\ \n\nl \n\nFigure 2: Vector field that desaibes the flow of Fj projected onto HU3xHU4. The graph \nshows a saddle point near (.5,.5)and a periodic-2 attracting point. \n\n\fRNNs Can Learn Symbol-Sensitive Counting \n\n93 \n\n803 \n1 \n\n0.8 \n\n0.6 \n\nI \n\nI , \nI , \n\nI \nI \n\npredict II reaioo \n\n0 . 4 \n\n0.2 \n\no \n\n0---0.-2--0-.-4 --0-.-6 --0-.8--1 802 \n\n0.4 \n\nI \n, \nI \nI \nI \nI \no .2 predict eDdrepOIl \" \nI \nI \nI \nI \n\no \n\n0~-0~.~2--0~.74-~0~.~6-~0~.8~~lmn \n\nFigure 3: 3a) Trajectory of a 13 projected onto HU2xHU3. The output node a is> .5 below \nand right of dividing line. The projection for HU2xHU5 is very similar. 3b) Trajectory of \nA 13 (after a 13) projected onto HU2xHU3. The output node for the end symbol is > .5 on \nthe 13th trajectory step left of the solid dividing line, and it is > .5 on the 11th step left \nof the dashed dividing line (the hyperplane projection must use values at the appropriate \ntime steps), hence the system predicts the A-to-end transition. The graph for HU2xHU5 \nand HU2xHU4 is very similar. \n\nIIU4 \n\n1.0 \n\n0.9 \n\n0.8 \n\n0. 7 \n\n\" \nfJIJ \n\nII \nfJIJ \n\nJ/ \nfJIJ \n\n\" \nfJIJ \n\n41 \nlib \n\nu \nfJIJ \n\nz \u2022\u2022 \nfJIJ. , \u2022 \n\n0.1 L.-__ --::---:--_:-=-~----:---'':-.:....:-::--_:_::___:_\" HU3 \n\n0 . 6 \n\n0.7 \n\n0.8 \n\n0.9 \n\n1.0 \n\n0.2 \n\n0.3 \n\n0 . 4 \n\n0 . 5 \n\nFigure4: Clusters of lasts tate values anbm Bm, m > 1, projected onto HU3xHU4. Notice \nthat for increasing n the system oscillates toward an attracting point of the system F'B 0 \nFt:0F:;. \n\n\f", "award": [], "sourceid": 1361, "authors": [{"given_name": "Paul", "family_name": "Rodriguez", "institution": null}, {"given_name": "Janet", "family_name": "Wiles", "institution": null}]}