{"title": "A Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 26, "abstract": null, "full_text": "A Unified Gradient-Descent/Clustering \n\nArchitecture for \n\nFinite State Machine Induction \n\nSreerupa Das and Michael C. Mozer \n\nDepartment of Computer Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \n\nAbstract \n\nAlthough recurrent neural nets have been moderately successful \nin learning to emulate finite-state machines (FSMs), the continu(cid:173)\nous internal state dynamics of a neural net are not well matched \nto the discrete behavior of an FSM. We describe an architecture, \ncalled DOLCE, that allows discrete states to evolve in a net as learn(cid:173)\ning progresses. DOLCE consists of a standard recurrent neural net \ntrained by gradient descent and an adaptive clustering technique \nthat quantizes the state space. DOLCE is based on the assumption \nthat a finite set of discrete internal states is required for the task, \nand that the actual network state belongs to this set but has been \ncorrupted by noise due to inaccuracy in the weights. DOLCE learns \nto recover the discrete state with maximum a posteriori probabil(cid:173)\nity from the noisy state. Simulations show that DOLCE leads to a \nsignificant improvement in generalization performance over earlier \nneural net approaches to FSM induction. \n\n1 \n\nINTRODUCTION \n\nResearchers often try to understand-post hoc-representations that emerge in the \nhidden layers of a neural net following training. Interpretation is difficult because \nthese representations are typically highly distributed and continuous. By \"contin(cid:173)\nuous,\" we mean that if one constructed a scatterplot over the hidden unit activity \nspace of patterns obtained in response to various inputs, examination at any scale \nwould reveal the patterns to be broadly distributed over the space. \n\nContinuous representations aren't always appropriate. Many task domains seem to \nrequire discrete representations-representations selected from a finite set of alter(cid:173)\nnatives. If a neural net learned a discrete representation, the scatterplot over hidden \nactivity space would show points to be superimposed at fine scales of analysis. Some \n\n19 \n\n\f20 \n\nDas and Mozer \n\nexamples of domains in which discrete representations might be desirable include: \nfinite-state machine emulation, data compression, language and higher cognition \n(involving discrete symbol processing), and categorization in the context of decision \nmaking. In such domains, standard neural net learning procedures, which have \na propensity to produce continuous representations, may not be appropriate. The \nwork we report here involves designing an inductive bias into the learning procedure \nin order to encourage the formation of discrete internal representations. \n\nIn the recent years, various approaches have been explored for learning discrete \nrepresentations using neural networks (McMillan, Mozer, & Smolensky, 1992; Mozer \n& Bachrach, 1990; Mozer & Das, 1993; Schiitze, 1993; Towell & Shavlik, 1992). \nHowever, these approaches are domain specific, making strong assumptions about \nthe nature of the task. In our work, we describe a general methodology that makes \nno assumption about the domain to which it is applied, beyond the fact that discrete \nrepresentations are desireable. \n\n2 FINITE STATE MACHINE INDUCTION \n\nWe illustrate the methodology using the domain of finite-state machine (FSM) \ninduction. An FSM defines a class of symbol strings. For example, the class (lOt \nconsists of all strings with one or more repetitions of 10; 101010 is a positive example \nof the class, 111 is a negative example. An FSM consists principally of a finite set \nof states and a function that maps the current state and the current symbol of the \nstring into a new state. Certain states of the FSM are designated \"accept\" states, \nmeaning that if the FSM ends up in these states, the string is a member of the \nclass. The induction problem is to infer an FSM that parsimoniously characterizes \nthe positive and negative exemplars, and hence characterizes the underlying class. \n\nA generic recurrent net architecture that could be used for FSM emulation and \ninduction is shown on the left side of Figure 1. A string is presented to the input \nlayer of the net, one symbol at a time. Following the end of the string, the net \nshould output whether or not the string is a member of the class. The hidden unit \nactivity pattern at any point during presentation of a string corresponds to the \ninternal state of an FSM. \n\nSuch a net, trained by a gradient descent procedure, is able to learn to perform this \nor related tasks (Elman, 1990; Giles et al., 1992; Pollack, 1991; Servan-Schreiber, \nCleeremans, & McClelland, 1991; Watrous & Kuhn, 1992). Although these models \nhave been relatively successful in learning to emulate FSMs, the continuous internal \nstate dynamics of a neural net are not well matched to the discrete behavior of FSMs. \nRoughly, regions of hidden unit activity space can be identified with states in an \nFSM, but because the activities are continuous, one often observes the network \ndrifting from one state to another. This occurs especially with input strings longer \nthan those on which the network was trained. \n\nTo achieve more robust dynamics, one might consider quantizing the hidden state. \nTwo approaches to quantization have been explored previously. In the first, a net \nis trained in the manner described above. After training, the hidden state space is \npartitioned into disjoint regions and each hidden activity pattern is then discretized \nby mapping it to the center of its corresponding region (Das & Das, 1991; Giles \n\n\fA Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction \n\n21 \n\nFigure 1: On the left is a generic recurrent architecture that could be used for FSM induc(cid:173)\ntion. Each box corresponds to a layer of units, and arrows depict complete connectivity \nbetween layers. At each time step, a new symbol is presented on the input and the input \nand hidden representations are integrated to form a new hidden representation. On the \nright is the general architecture of DOLCE. \n\net al., 1992). In a second approach, quantization is enforced during training by \nmapping the the hidden state at each time step to the nearest corner of a [0,1]\" \nhypercube (Zeng, Goodman, & Smyth, 1993). \n\nEach of these approaches has its limitations. In the first approach, because learning \ndoes not consider the latter quantization, the hidden activity patterns that result \nfrom learning may not lie in natural clusters. Consequently, the quantization step \nmay not group together activity patterns that correspond to the same state. In the \nsecond approach, the quantization process causes the error surface to have discon(cid:173)\ntinuities and to be flat in local neighborhoods of the weight space. Hence, gradient \ndescent learning algorithms cannot be used; instead, even more heuristic approaches \nare required. To overcome the limitations of these approaches, we have pursued an \napproach in which quantization is an integral part of the learning process. \n\n3 DOLCE \n\nOur approach incorporates a clustering module into the recurrent net architecture, \nas shown on the right side of Figure 1. The hidden layer activities are processed by \nthe clustering module before being passed on to other layers. The clustering module \nmaps regions in hidden state space to a single point in the same space, effectively \npartitioning or clustering the hidden state space. Each cluster corresponds to a \ndiscrete internal state. The clusters are adaptive and dynamic, changing over the \ncourse of learning. We call this architecture DOLCE, for gynamic Qn-!ine \u00a3lustering \nand state extraction. \n\nThe DOLCE architecture may be explored along two dimensions: (1) the clustering \nalgorithm used (e.g., a Gaussian mixture model, ISODATA, the Forgy algorithm, \nvector quantization schemes), and (2) whether supervised or unsupervised training \nis used to identify the clusters. In unsupervised mode, the performance error on \nthe FSM induction task has no effect on the operation of the clustering algorithm; \ninstead, an internal criterion characterizes goodness of clusters. In supervised mode, \nthe primary measure that affects the goodness of a cluster is the performance error. \nRegardless of the training mode, all clustering algorithms incorporate a pressure to \n\n\f22 \n\nDas and Mozer \n\no \n\nFigure 2: Two dimensions of a typical state space. The true states needed to perform \nthe task are Cl, C3, and C3, while the observed hidden states, asswned to be corrupted by \nnoise, are distributed about the Ci. \n\nproduce a small number of clusters. Additionally, as we elaborate more specifically \nbelow, the algorithms must allow for a soft or continuous clustering during training, \nin order to be integrated into a gradient-based learning procedure. \n\nWe have explored two possibilities for the clustering module. The first involves \nthe use of Forgy's algorithm in an unsupervised mode. Forgy's (1965) algorithm \ndetermines both the number of clusters and the partitioning of the space. The \nsecond uses a Gaussian mixture model in a supervised mode, where the mixture \nmodel parameters are adjusted so as to minimize the performance error. Both \napproaches were successful, but as the latter approach obtained better results, we \ndescribe it in the next section. \n\n4 CLUSTERING USING A MIXTURE MODEL \n\nHere we motivate the incorporation of a Gaussian mixture model into DOLCE, us(cid:173)\ning an argument that gives the approach a solid theoretical foundation. Several \nassumptions underly the approach. First, we assume that the task faced by DOLCE \nis such that it requires a finite set of internal or true states, C = {Clt C2, \u2022\u2022. , CT}. \nThis is simply the premise that motivates this line of work. Second, we assume \nthat any observed hidden state-i.e., a hidden activity pattern that results from \npresentation of a symbol sequence-belongs to C but has been corrupted by noise \ndue to inaccuracy in the network weights. Third, we assume that this noise is Gaus(cid:173)\nsian and decreases as learning progresses (i.e., as the weights are adjusted to better \nperform the task). These assumptions are depicted in Figure 2. \n\nBased on these assumptions, we construct a Gaussian mixture distribution that \nmodels the observed hidden states: \n\np(hlC tT q) = ~ qi \n\ne-lh-c.12 /2q~ \n\nT \n\n\" \n\nL...J (27r0'~)H/2 \ni=l \n\n\u2022 \n\nwhere h denotes an observed hidden state, 0'; the variance of the noise that cor(cid:173)\nrupts state Ci, qi is the prior probability that the true state is Ci, and H is the \ndimensionality of the hidden state space. For pedagogical purposes, a.ssume for the \ntime being that the parameters of the mixture distribution-T, C, tT, and q-are \nall known; in a later section we discuss how these parameters are determined. \n\n\fA Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction \n\n23 \n\nh \n\n00 0 OOOO!,~OO \n0 ~ \n\n0 7 \n\no \n\n~O 0 \n\no \n000 \n\n0 \n\nA \n\nbefore training \n\nafter successful training \n\nFigure 3: A schematic depiction of the hidden state space before and after training. The \nhorizontal plane represents the state space. The bumps indicate the probability density \nunder the mixture model. Observed hidden states are represented by small open circles. \n\nGiven a noisy observed hidden state, h, DOLCE computes the maximum a posteriori \n(MAP) estimator of h in C. This estimator then replaces the noisy state and is used \nin all subsequent computation. The MAP estimator, h, is computed as follows. The \nprobability of an observed state h being generated by a given true state i is \n\np(hltrue state i) = (27rlTi)-!fe-lh-cill/2u:. \n\nUsing Bayes' rule, one can compute the posterior probability of true state i, given \nthat h has been observed: \n\np true state z = = - - - ' - - - ' - - - - - - - ' - - - -\n( \nL:j p(hltrue state j)qj \n\np(hltrue state i)qi \n\n.Ih) \n\nFinally, the MAP estimator is given by it = Cargmax,p(true state ilh). However, \nbecause learning requires that DOLCE's dynamics be differentiable, we use a soft \nversion of MAP which involves using ii = L:i cip(true state ilh) instead of hand \nincorporating a \"temperature\" parameter into lTi as described below. \n\nAn important parameter in the mixture model is T, the number of true states \n(Gaussians bumps). Because T directly corresponds to the number of states in \nthe target FSM, if T is chosen too small, DOLCE could not emulate the FSM. \nConsequently, we set T to a large value, and the training procedure includes a \ntechnique for eliminating unnecessary true states. (If the initially selected T is not \nlarge enough, the training procedure will not converge to zero error on the training \nset, and the procedure can be restarted with a larger value of T.) \n\nAt the start of training, each Gaussian center I Ci, is initialized to a random location \nin the hidden state space. The standard deviations of each Gaussian, lTi, are initially \nset to a large value. The priors, qi, are set to liT. The weights are set to initial \nvalues chosen from the uniform distribution in [-.25,.25]. All connection weights \nfeeding into the hidden layer are second order. \n\nThe network weights and mixture model parameters-C, iT, and q-are adjusted by \ngradient descent in a cost measure, C. This cost includes two components: (a) the \nperformance error, \u00a3, which is a squared difference between the actual and target \nnetwork output following presentation of a training string, and (b) a complexity \n\n\f24 \n\nDas and Mozer \n\nc: 800,------~...., \no II \n\nlanguage \n\n2000,--------, \n\nlanguage \n\nlanguage S \n\n200 \n\n100 \n\n0600 i 400 \n\nE \n'0 \n2 \n\nNO ROLO OF DG \n\nNO ROLO OF DG \n\no NO RO LO OF DG \n\nlanguage \n\nlanguage 6 \n\n400 \n\n200 \n\nOL.......l.:.O=~ \n\nNO ROLO OF 00 \n\nFigure 4: Each graph depicts generalization performance on one of the Tomita languages \nfor 5 alternative neural net approaches: no clustering [Ne), rigid quantization [RQ), learn \nthen quantize [LQ], DOLCE in unsupervised mode using Forgy's algorithm [DF], DOLCE \nin supervised mode using mixture model [DG) . The vertical axis shows the number of \nmisclassification of 3000 test strings. Each bar is the average result across 10 replications \nwith different initial weights. \n\ncost, which is the entropy of the prior distribution, q: \n\nwhere ..\\ is a regularization parameter. The complexity cost is minimal when only \none Gaussian has a nonzero prior, and maximal when all priors are equal. Hence, \nthe cost encourages unnecessary Gaussians to drop out of the mixture model. \n\nThe particular gradient descent procedure used is a generalization of back propa(cid:173)\ngation through time (Rumelhart, Hinton, & Williams, 1986) that incorporates the \nmixture model. To better condition the search space and to avoid a constrained \nsearch, optimization is performed not over iT and q directly but rather over hyper(cid:173)\nparameters a and h, where u; = exp(ai)/,B and qi = exp( -bl)/Ej exp( -bj). \nThe global parameter ,B scales the overall spread of the Gaussians, which corre(cid:173)\nsponds to the level of noise in the model. As performance on the training set \nimproves, we assume that the network weights are coming to better approximate \nthe target weights, hence that the level of noise is decreasing. Thus, we tie ,B to \nthe performance error e. We have used various annealing schedules and DOLCE \nappears robust to this variation; we currently use {3 ex 1/ e. Note that as \u00a3 --+ 0, \n{3 --+ 00 and the probability density under one Gaussian at h will become infinitely \ngreater than the density under any other; consequently, the soft MAP estimator, \nh, becomes equivalent to the MAP estimator h, and the transformed hidden state \nbecomes discrete. A schematic depiction of the probability landscape both before \nand after training is depicted in Figure 3. \n\n\fA Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction \n\n25 \n\n5 SIMULATION STUDIES \n\nThe network was trained on a set ofregular languages first studied by Tomita (1982). \nThe languages, which utilize only the symbols 0 and 1, are: (1) 1\u00b7; (2) (10)\u00b7; (3) no \nodd number of consecutive 1 's is directly followed by an odd number of consecutive \nO's; (4) any string not containing the substring \"000\"; (5) , [(01110)(01110)].; (6) \nthe difference between the number of ones and number of zeros in the string is a \nmultiple of three; and (7) 0\u00b71\u00b7 0\u00b71\u00b7 . \n\nA fixed training corpus of strings was generated for each language, with an equal \nnumber of positive and negative examples. The maximum string length varied from \n5 to 10 symbols and the total number of examples varied from 50 to 150, depending \non the difficulty of the induction task. \n\nEach string was presented one symbol at a time, after which DOLCE was given a \ntarget output that specified whether the string was a positive or negative example \nof the language. Training continued until DOLCE converged on a set of weights \nand mixture model parameters. Because we assume that the training examples are \ncorrectly classified, the error \u00a3 on the training set should go to zero when DOLCE has \nlearned. If this did not happen on a given training run, we restarted the simulation \nwith different initial random weights. \n\nFor each language, ten replications of DOLCE (with the supervised mixture model) \nwere trained, each with different random initial weights. The learning rate and \nregularization parameter .\\ were chosen for each language by quick experimentation \nwith the aim of maximizing the likelihood of convergence on the training set. We \nalso trained a version of DOLCE that clustered using the unsupervised Forgy algo(cid:173)\nrithm, as well as several alternative neural net approaches: a generic recurrent net, \nas shown on the left side of Figure 1, which used no clustering [NC]; a version with \nrigid quantization during training [RQ], comparable to the earlier work of Zeng, \nGoodman, and Smyth (1993); and a version in which the unsupervised Forgyalgo(cid:173)\nrithm was used to quantize the hidden state following training [LQ], comparable to \nthe earlier work of Das and Das (1991). In these alternative approaches, we used \nthe same architecture as DOLCE except for the clustering procedure. We selected \nlearning parameters to optimize performance on the training set, ran ten replica(cid:173)\ntions for each language, replaced runs which did not converge, and used the same \ntraining sets. \n\n6 RESULTS AND CONCLUSION \n\nIn Figure 4, we compare the generalization performance of DOLCE-both the unsu(cid:173)\npervised Forgy [DF] and supervised mixture model [DG]-to the NC, RQ, and LQ \napproaches. Generalization performance was tested using 3000 strings not in the \ntraining set, half positive examples and half negative. The two versions of DOLCE \noutperformed the alternative neural net approaches, and the DG version of DOLCE \nconsistently outperformed the DF version. \n\nTo summarize, we have described an approach that incorporates inductive bias into \na learning algorithm in order to encourage the evolution of discrete representations \nduring training. This approach is a quite general and can be applied to domains \n\n\f26 \n\nDas and Mozer \n\nother than grammaticality judgement where discrete representations might be de(cid:173)\nsirable. Also, this approach is not specific to recurrent networks and may be applied \nto feedforward networks. We are now in the process of applying DOLCE to a much \nlarger, real-world problem that involves predicting the next symbol in a string. The \ndata base comes from a case study in software engineering, where each symbol \nrepresents an operation in the software development process. This data is quite \nnoisy and it is unlikely that the data can be parsimoniously described by an FSM. \nNonetheless, our initial results are encouraging: DOLCE produces predictions at \nleast three times more accurate than a standard recurrent net without clustering. \n\nAcknowledgements \n\nThis research was supported by NSF Presidential Young Investigator award IRI-\n9058450 and grant 90-21 from the James S. McDonnell Foundation. \n\nReferences \n\nS. Das & R. Das. (1991) Induction of discrete state-machine by stabilizing a continuous re(cid:173)\ncurrent network using clustering. Computer Science and Informatics 21(2):35-40. Special \nIssue on Neural Computing. \n\nJ.L. Elman. (1990) Finding structure in time. Cognitive Science 14:179-212. \n\nE. Forgy. (1965) Cluster analysis of multivariate data: efficiency versus interpretability of \nclassifications. Biometrics 21:768-780. \n\nM.C. Mozer & J.D Bachrach. (1990) Discovering the structure of a reactive environment \nby exploration. Neural Computation 2( 4):447-457. \n\nC. McMillan, M.C. Mozer, & P. Smolensky. (1992) Rule induction through integrated \nsymbolic and subsymbolic processing. In J.E. Moody, S.J. Hanson, & R.P. Lippmann \n(eds.), Advances in Neural Information Proceuing Sy6tems 4, 969-976. San Mateo, CA: \nMorgan Kaufmann. \n\nC.L. Giles, D. Chen, C.B. Miller, H.H. Chen, G.Z. Sun, & Y.C. Lee. (1992) Learning \nand extracting finite state automata with second-order recurrent neural network. Neural \nComputation 4(3):393-405. \n\nH. Schiitze. (1993) Word space. In S.J. Hanson, J.D. Cowan, & C.L. Giles (eds.), Advances \nin Neural Information Proceuing Systems 5, 895-902. San Mateo, CA: Morgan Kaufmann. \n\nM. Tomita. (1982) Dynamic construction of finite-state automata from examples using hill(cid:173)\nclimbing. Proceedings of the Fourth Annual Conference of the Cognitive Science Society, \n105-108. \n\nG. Towell & J. Shavlik. \nknowledge-based neural networks into rules. In J .E. Moody, S.J. Hanson, & R.P. Lipp(cid:173)\nmann (eds.), Advances in Neural Information Proceuing Systems 4, 977-984. San Mateo, \nCA: Morgan Kaufmann. \n\n(1992) Interpretion of artificial neural networks: mapping \n\nR.L. Watrous & G.M. Kuhn. (1992) Induction of finite state languages using second-order \nrecurrent networks. In J.E. Moody, S.J. Hanson, & R.P. Lippmann (eds.), Advances in \nNeural Information Proceuing Systems 4, 969-976. San Mateo, CA: Morgan Kaufmann. \nZ. Zeng, R. Goodman, & P. Smyth. \n(1993) Learning finite state machines with self(cid:173)\nclustering recurrent networks. Neural Computation 5(6):976-990. \n\n\f", "award": [], "sourceid": 846, "authors": [{"given_name": "Sreerupa", "family_name": "Das", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}]}