{"title": "Induction of Finite-State Automata Using Second-Order Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 309, "page_last": 317, "abstract": null, "full_text": "Induction of Finite-State Automata Using \n\nSecond-Order Recurrent Networks \n\nRaymond L. Watrous \n\nSiemens Corporate Research \n\n755 College Road East, Princeton, NJ 08540 \n\nGary M. Kuhn \n\nCenter for Communications Research, IDA \n\nThanet Road, Princeton, NJ 08540 \n\nAbstract \n\nSecond-order recurrent networks that recognize simple finite state lan(cid:173)\nguages over {0,1}* are induced from positive and negative examples. Us(cid:173)\ning the complete gradient of the recurrent network and sufficient training \nexamples to constrain the definition of the language to be induced, solu(cid:173)\ntions are obtained that correctly recognize strings of arbitrary length. A \nmethod for extracting a finite state automaton corresponding to an opti(cid:173)\nmized network is demonstrated. \n\n1 \n\nIntroduction \n\nWe address the problem of inducing languages from examples by considering a set of \nfinite state languages over {O, 1}* that were selected for study by Tomita (Tomita, \n1982): \n\nL1. 1* \nL2. (10)* \nL3. no odd-length O-string anywhere after an odd-length I-string \nL4. not more than 20's in a row \nL5. bit pairs, #01 's + #10's = 0 mod 2 \n\n309 \n\n\f310 Watrous and Kuhn \n\nL6. abs(#l's - #O's) = 0 mod 3 \n\nL 7. 0*1*0*1* \n\nTomita also selected for each language a set of positive and negative examples \n(summarized in Table 1) to be used as a training set. By a method of heuristic \nsearch over the space of finite state automata with up to eight states, he was able \nto induce a recognizer for each of these languages (Tomita, 1982). \nRecognizers of finite-state languages have also been induced using first-order re(cid:173)\ncurrent connectionist networks (Elman, 1990; Williams and Zipser, 1988; Cleere(cid:173)\nmans, Servan-Schreiber and McClelland, 1989). Generally speaking, these results \nwere obtained by training the network to predict the next symbol (Cleeremans, \nServan-Schreiber and McClelland, 1989; Williams and Zipser, 1988), rather than \nby training the network to accept or reject strings of different .lengths. Several \ntraining algorithms used an approximation to the gradient (Elman, 1990; Cleere(cid:173)\nmans, Servan-Schreiber and McClelland, 1989) by truncating the computation of \nthe backward recurrence. \nThe problem of inducing languages from examples has also been approached using \nsecond-order recurrent networks (Pollack, 1990; Giles et al., 1990). Using a trun(cid:173)\ncated approximation to the gradient, and Tomita's training sets, Pollack reported \nthat \"none of the ideal languages were induced\" (Pollack, 1990). On the other hand, \na Tomita language has been induced using the complete gradient (Giles et al., 1991). \nThis paper reports the induction of several Tomita languages and the extraction of \nthe corresponding automata with certain differences in method from (Giles et al., \n1991). \n\n2 Method \n\n2.1 Architecture \n\nThe network model consists of one input unit, one threshold unit, N state units and \none output unit. The output unit and each state unit receive a first order connection \nfrom the input unit and the threshold unit. In addition, each of the output and state \nunits receives a second-order connection for each pairing of the input and threshold \nunit with each of the state units. For N = 3, the model is mathematically identical \nto that used by Pollack (Pollack, 1990); it has 32 free parameters. \n\n2.2 Data Representation \n\nThe symbols of the language are represented by byte values, that are mapped into \nreal values between 0 and 1 by dividing by 255. Thus, the ZERO symbol is repre(cid:173)\nsented by octal 040 (0.1255). This value was chosen to be different from 0.0, which \nis used as the initial condition for all units except the threshold unit, which is set to \n1.0. The ONE symbol was chosen as octal 370 (0.97255). All strings are terminated \nby two occurrences of a termination symbol that has the value 0.0. \n\n\fInduction of Finite-State Automata Using Second-Order Recurrent Networks \n\n311 \n\nGrammatical Strings \n\nUngrammatical Strings \n\nI Longer Strmgs \nIn Training Set \n\nI Longer Strmgs \nIn Training Set \n\nLength < 10 \n\nLanguage \n\nTotal Training \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n\n11 \n6 \n652 \n1103 \n683 \n683 \n561 \n\n9 \n5 \n11 \n10 \n9 \n10 \n11 \n\n1 \n2 \n1 \n\n2 \n\nLength ::; 10 \n\nTotal Training \n2036 \n2041 \n1395 \n944 \n1364 \n1364 \n1486 \n\n8 \n10 \n11 \n7 \n11 \n11 \n6 \n\n1 \n2 \n1 \n1 \n2 \n\nTable 1; Number of grammatical and ungrammatical strings oflength 10 or less for \nTomita languages and number of those included in the Tomita training sets. \n\n2.3 Training \n\nThe Tomita languages are characterized in Table 1 by the number of grammatical \nstrings of length 10 or less (out of a total of 2047 strings). The Tomita training \nsets are also characterized by the number of grammatical strings of length 10 or \nless included in the training data. For completeness, the Table also shows the \nnumber of grammatical strings in the training set of length greater than 10. A \ncomparison of the number of grammatical strings with the number included in the \ntraining set shows that while Languages 1 and 2 are very sparse, they are almost \ncompletely covered by the training data, whereas Languages 3-7 are more dense, and \nare sparsely covered by the training sets. Possible consequences of these differences \nare considered in discussing the experimental results. \nA mean-squared error measure was defined with target values of 0.9 and 0.1 for \naccept and reject, respectively. The target function was weighted so that error was \ninjected only at the end of the string. \n\nThe complete gradient of this error measure for the recurrent network was computed \nby a method of accumulating the weight dependencies backward in time (Watrous, \nLadendorf and Kuhn, 1990). This is in contrast to the truncated gradient used \nby Pollack (Pollack, 1990) and to the forward-propagation algorithm used by Giles \n(Giles et al., 1991). \n\nThe networks were optimized by gradient descent using the BFGS algorithm. A \ntermination criterion of 10- 10 was set; it was believed that such a strict tolerance \nmight lead to smaller loss of accuracy on very long strings. No constraints were set \non the number of iterations. \n\nFive networks with different sets of random initial weights were trained separately \non each of the seven languages described by Tomita using exactly his training sets \n(Tomita, 1982), including the null string. The training set used by Pollack (Pollack, \n1990) differs only in not including the null string. \n\n2.4 Testing \n\nThe networks were tested on the complete set of strings up to length 10. Acceptance \nof a string was defined as the network having a final output value of greater than \n\n\f312 Watrous and Kuhn \n\n0.9 - T and rejection as a final value of less than 0.1 + T, where 0 < T < 0.4 is the \ntolerance. The decision was considered ambiguous otherwise. \n\n3 Results \n\nThe results of the first experiment are summarized in Table 2. For each language, \neach network is listed by the seed value used to initialize the random weights. For \neach network, the iterations to termination are listed, followed by the minimum \nMSE value reached. Also listed is the percentage of strings of length 10 or less that \nwere correctly recognized by the network, and the percentage of strings for which \nthe decision was uncertain at a tolerance of 0.0. \n\nThe number of iterations until termination varied widely, from 28 to 37909. There \nis no obvious correlation between number of iterations and minimum MSE. \n\n3.1 Language 1 \n\nIt may be observed that Language 1 is recognized correctly by two of the networks \n(seeds 72 and 987235) and nearly correctly by a third (seed 239). This latter network \nfailed on the strings 19 and 110 , both of which were not in the training set. \n\nThe network of seed 72 was further tested on all strings of length 15 or less and \nmade no errors. This network was also tested on a string of 100 ones and showed no \ndiminution of output value over the length of the string. When tested on strings of \n99 ones plus either an initial zero or a final zero, the network also made no errors. \nAnother network, seed 987235, made no errors on strings of length 15 or less but \nfailed on the string of 100 ones. The hidden units broke into oscillation after about \nthe 30th input symbol and the output fell into a low amplitude oscillation near zero. \n\n3.2 Language 2 \n\nSimilarly, Language 2 was recognized correctly by two networks (seeds 89340 and \n987235) and nearly correctly by a third network (seed 104). The latter network \nfailed only on strings of the form (10)*010, none of which were included in the \ntraining data. \n\nThe networks that performed perfectly on strings up to length 10 were tested further \non all strings up to length 15 and made no errors. These networks were also tested \non a string of 100 alternations of 1 and 0, and responded correctly. Changing the \nfirst or final zero to a one caused both networks correctly to reject the string. \n\n3.3 The Other Languages \n\nFor most of the other languages, at least one network converged to a very low \nMSE value. However, networks that performed perfectly on the training set did \nnot generalize well to a definition of the language. For example, for Language 3, \nthe network with seed 104 reached a MSE of 8 x 10- 10 at termination, yet the \nperformance on the test set was only 78.31%. One interpretation of this outcome \nis that the intended language was not sufficiently constrained by the training set. \n\n\fInduction of Finite-State Automata Using Second-Order Recurrent Networks \n\n313 \n\nLanguage \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\nSeed \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n\nIterations \n28 \n95 \n\n0.0012500000 \n0.0215882357 \n8707 0.0005882353 \n5345 0.0266176471 \n994 0.0000000001 \n0.0005468750 \n5935 \n4081 \n0.0003906250 \n807 0.0476171875 \n1084 0.0005468750 \n1(\"06 0.0001562500 \n442 0.0149000000 \n37909 0.0000000008 \n9264 0.0087000000 \n8250 \n0.0005000000 \n5769 0.0136136712 \n0.0004375001 \n8630 \n0.0624326924 \n60 \n2272 0.0005000004 \n0.0003750001 \n10680 \n324 0.0459375000 \n0.0526912920 \n890 \n0.0464772727 \n368 \n1422 0.0487500000 \n0.0271525856 \n2775 \n0.0209090867 \n2481 \n524 \n0.0788760972 \n0.0789530751 \n332 \n1355 0.0229551248 \n8171 \n0.0001733280 \n0.0577867426 \n306 \n373 0.0588385157 \n8578 0.0104224185 \n969 0.0211073814 \n0.0007684520 \n4259 \n0.0688690476 \n666 \n\nMSE Accuracy Uncertainty \n0.00 \n20.76 \n0.00 \n0.00 \n0.00 \n4.93 \n0.20 \n37.27 \n0.00 \n0.00 \n33.27 \n0.15 \n11.87 \n0.00 \n23.94 \n6.45 \n50.02 \n9.38 \n15.53 \n77.38 \n63.80 \n41.62 \n36.93 \n22.52 \n2.49 \n99.95 \n99.95 \n47.04 \n5.32 \n24.87 \n86.08 \n17.00 \n26.58 \n0.49 \n74.94 \n\n100.00 \n78.07 \n99.90 \n66.93 \n100.00 \n93.36 \n99.80 \n62.73 \n100.00 \n100.00 \n47.09 \n78.31 \n74.60 \n73.57 \n50.76 \n52.71 \n20.86 \n55.40 \n60.92 \n22.62 \n34.39 \n45.92 \n31.46 \n46.12 \n66.83 \n0.05 \n0.05 \n31.95 \n46.21 \n37.71 \n9.38 \n55.74 \n52.76 \n54.42 \n12.55 \n\nTable 2: Results of Training Three State-Unit Network from 5 Random Starts on \nTomita.Languages Using Tomita Training Data \n\nIn the case of Language 5, in no case was the MSE reduced below 0.02. We believe \nthat the model is sufficiently powerful to compute the language. It is possible, \nhowever, that the power of the model is marginally sufficient, so that finding a \nsolution depends critically upon the initial conditions. \n\n\f314 \n\nWatrous and Kuhn \n\nSeed \n72 \n104 \n239 \n89340 \n987235 \n\nIterations \n0.0000001022 \n215 \n0.0000000001 \n665 \n0.0000000001 \n205 \n5244 0.0005731708 \n0.0004624581 \n2589 \n\nMSE Accuracy Uncertainty \n0.00 \n0.05 \n0.10 \n0.10 \n6.55 \n\n100.00 \n99.85 \n99.90 \n99.32 \n92.13 \n\nTable 3: Results of Training Three State-Unit Network from 5 Random Starts on \nTomita Language 4 Using Probabilistic Training Data (p=O.l) \n\n4 \n\nFurther Experiments \n\nThe effect of additional training data was investigated by creating training sets in \nwhich each string oflength 10 or less is randomly included with a fixed probability p. \nThus, for p = 0.1 approximately 10% of 2047 strings are included in the training set. \nA flat random sampling of the lexicographic domain may not be the best approach, \nhowever, since grammaticality can vary non-uniformly. \n\nThe same networks as before were trained on the larger training set for Language \n4, with the results listed in Table 3. \n\nUnder these conditions, a network solution was obtained that generalizes perfectly \nto the test set (seed 72). This network also made no errors on strings up to length 15. \nHowever, very low MSE values were again obtained for networks that do not perform \nperfectly on the test data (seeds 104 and 239). Network 239 made two ambiguous \ndecisions that would have been correct at a tolerance value of 0.23. Network 104 \nincorrectly accepted the strings 000 and 1000 and would have correctly accepted \nthe string 0100 at a tolerance of 0.25. Both networks made no additional errors \non strings up to length 15. The training data may still be slightly indeterminate. \nMoreover, the few errors made were on short strings, that are not included in the \ntraining data. \n\nSince this network model is continuous, and thus potentially infinite state, it is per(cid:173)\nhaps not surprising that the successful induction of a finite state language seems to \nrequire more training data than was needed for Tomita's finite state model (Tomita, \n1982). \nThe effect of more complex models was investigated for Language 5 using a network \nwith 11 state units; this increases the number of weights from 32 to 288. Networks \nof this type were optimized from 5 random initial conditions on the original training \ndata. The results of this experiment are summarized in Table 4. By increasing the \ncomplexity of the model, convergence to low MSE values was obtained in every case, \nalthough none of these networks generalized to the desired language. Once again, \nit is possible that more data is required to constrain the language sufficiently. \n\n5 FSA Extraction \n\nThe following method for extracting a deterministic finite-state automaton corre(cid:173)\nsponding to an optimized network was developed: \n\n\fInduction of Finite-State Automata Using Second-Order Recurrent Networks \n\n315 \n\nSeed \n72 \n104 \n239 \n89340 \n987235 \n\nIterations \n\n1327 0.0002840909 \n0.0001136364 \n680 \n357 0.0006818145 \n122 0.0068189264 \n4502 0.0001704545 \n\nMSE Accuracy Uncertainty \n11.87 \n16.32 \n3.32 \n6.64 \n16.95 \n\n53.00 \n39.47 \n61.31 \n63.36 \n48.41 \n\nTable 4: Results of Training Network with 11 State-Units from 5 Random Starts \non Tomita Language 5 Using Tomita Training Data \n\n1. Record the response of the network to a set of strings. \n2. Compute a zero bin-width histogram for each hidden unit and partition each \n\nhistogram so that the intervals between adjacent peaks are bisected. \n\n3. Initialize a state-transition table which is indexed by the current state and \n\ninput symbol; then, for each string: \n(a) Starting from the NULL state, for each hidden unit activation vector: \n\n1. Obtain the next state label from the concatenation of the histogram \n\ninterval number of each hidden unit value. \n\nll. Record the next state in the state-transition table. If a transition is \nrecorded from the same state on the same input symbol to two different \nstates, move or remove hidden unit histogram partitions so that the two \nstates are collapsed and go to 3; otherwise, update the current state. \n(b) At 'the end of the string, mark the current state as accept, reject or un(cid:173)\n\ncertain according as the output unit is ~ 0.9, S; 0.1 or otherwise. If the . \ncurrent state has already received a different marking, move or insert his(cid:173)\ntogram partitions so that the offending state is subdivided and go to 3. \n\nIf the recorded strings are processed successfully, then the resulting state-transition \ntable may be taken as an FSA interpretation of the optimized network. The FSA \nmay then be minimized by standard methods (Giles et al., 1991). If no histogram \npartition can be found such that the process succeeds, the network may not have a \nfinite-state interpretation. \n\nAs an approximation to Step 3, the hidden unit vector was labeled by the index of \nthat vector in an initially empty set of reference vectors for which each component \nvalue was within some global threshold (B) of the hidden unit value. If no such \nreference vector was found, the observed vector was added to the reference set. The \nthreshold B could be raised or lowered as states needed to be collapsed or subdivided. \n\nUsing the approximate method, for Language 1, the correct and minimal FSA was \nextracted from one network (seed 72, B = 0.1). The correct FSA was also extracted \nfrom another network (seed 987235, B = 0.06), although for no partition of the \nhidden unit activation values could the minimal FSA be extracted. Interestingly, \nthe FSA extracted from the network with seed 239 corresponded to 1 n for n < 8. \nAlso, the FSA for another network (seed 89340, B = 0.0003) was nearly correct, \nalthough the string accuracy was only 67%; one state was wrongly labeled \"accept\" . \nFor Language 2, the correct and minimal FSA was extracted from one network (seed \n987235, B = 0.00001). A correct FSA was also extracted from another network (seed \n\n\f316 Watrous and Kuhn \n\n89340, () = 0.0022), although this FSA was not minimal. \nFor Language 4, a histogram partition was found for one network (seed 72) that \nled to the correct and minimal FSA; for the zero-width histogram, the FSA was \ncorrect, but not minimal. \n\nThus, a correct FSA was extracted from every optimized network that correctly \nrecognized strings of length 10 or less from the language for which it was trained. \nHowever, in some cases, no histogram partition was found for which the extracted \nFSA was minimal. It also appears that an almost-correct FSA can be extracted, \nwhich might perhaps be corrected externally. And, finally, the extracted FSA may \nbe correct, even though the network might fail on very long strings. \n\n6 Conclusions \n\nWe have succeeded in recognizing several simple finite state languages using second(cid:173)\norder recurrent networks and extracting corresponding finite-state automata. We \nconsider the computation of the complete gradient a key element in this result. \n\nAcknowledgements \n\nWe thank Lee Giles for sharing with us their results (Giles et al., 1991). \n\nReferences \n\nCleeremans, A., Servan-Schreiber, D., and McClelland, J. (1989). Finite state au(cid:173)\n\ntomata and simple recurrent networks. Neural Computation, 1(3):372-381. \nElman, J . L. (1990). Finding structure in time. Cognitive Science, 14:179-212. \nGiles, C. 1., Chen, D., Miller, C. B., Chen, H. H., Sun, G. Z., and Lee, Y. C. \n(1991). Second-order recurrent neural networks for grammatical inference. In \nProceedings of the International Joint Conference on Neural Networks, vol(cid:173)\nume II, pages 273-281. \n\nGiles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C., and Chen, D. (1990). Higher order \nrecurrent networks and grammatical inference. In Touretzky, D. S., editor, \nAdvances in Neural Information Systems 2, pages 380-387. Morgan Kaufmann. \nPollack, J. B. (1990). The induction of dynamical recognizers. Technical Report \n\n90-JP-AUTOMATA, Ohio State University. \n\nTomita, M. (1982). Dynamic construction of finite automata from examples us(cid:173)\n\ning hill-climbing. In Proceedings of the Fourth International Cognitive Science \nConference, pages 105-108. \n\nWatrous, R. L., Ladendorf, B., and Kuhn, G. M. (1990). Complete gradient opti(cid:173)\nmization of a recurrent network applied to /b/, /d/, /g/ discrimination. Jour(cid:173)\nnal of the Acoustical Society of America, 87(3):1301-1309. \n\nWilliams, R. J. and Zipser, D. (1988). A learning algorithm for continually running \nfully recurrent neural networks. Technical Report ICS Report 8805, UCSD \nInstitute for Cognitive Science. \n\n\f", "award": [], "sourceid": 560, "authors": [{"given_name": "Raymond", "family_name": "Watrous", "institution": null}, {"given_name": "Gary", "family_name": "Kuhn", "institution": null}]}