{"title": "Learning Unambiguous Reduced Sequence Descriptions", "book": "Advances in Neural Information Processing Systems", "page_first": 291, "page_last": 298, "abstract": null, "full_text": "LEARNING UNAMBIGUOUS REDUCED \n\nSEQUENCE DESCRIPTIONS \n\nJiirgen Schmidhuber \n\nDept. of Computer Science \n\nUniversity of Colorado \n\nCampus Box 430 \n\nBoulder, CO 80309, USA \nyirgan@cs.colorado.edu \n\nAbstract \n\nDo you want your neural net algorithm to learn sequences? Do not lim(cid:173)\nit yourself to conventional gradient descent (or approximations thereof). \nInstead, use your sequence learning algorithm (any will do) to implement \nthe following method for history compression. No matter what your fi(cid:173)\nnal goals are, train a network to predict its next input from the previous \nones. Since only unpredictable inputs convey new information, ignore all \npredictable inputs but let all unexpected inputs (plus information about \nthe time step at which they occurred) become inputs to a higher-level \nnetwork of the same kind (working on a slower, self-adjusting time scale). \nGo on building a hierarchy of such networks. This principle reduces the \ndescriptions of event sequences without 1088 of information, thus easing \nsupervised or reinforcement learning tasks. Alternatively, you may use \ntwo recurrent networks to collapse a multi-level predictor hierarchy into a \nsingle recurrent net. Experiments show that systems based on these prin(cid:173)\nciples can require less computation per time step and many fewer training \nsequences than conventional training algorithms for recurrent nets. Final(cid:173)\nly you can modify the above method such that predictability is not defined \nin a yes-or-no fashion but in a continuous fashion. \n\n291 \n\n\f292 \n\nSchmidhuber \n\n1 \n\nINTRODUCTION \n\nThe following methods for supervised sequence learning have been proposed: Simple \nrecurrent nets [7][3], time-delay nets (e.g. [2]), sequential recursive auto-associative \nmemories [16], back-propagation through time or BPTT [21] [30] [33], Mozer's 'fo(cid:173)\ncused back-prop' algorithm [10], the IID- or RTRL-algorithm [19][1][34], its ac(cid:173)\ncelerated versions [32][35][25], the recent fast-weight algorithm [27], higher-order \nnetworks [5], as well as continuous time methods equivalent to some of the above \n[14)[15][4]. The following methods for sequence learning by reinforcement learning \nhave been proposed: Extended REINFORCE algorithms [31], the neural bucket \nbrigade algorithm [22], recurrent networks adjusted by adaptive critics [23](see also \n[8]), buffer-based systems [13], and networks of hierarchically organized neuron-like \n\"bions\" [18]. \n\nWith the exception of [18] and [13], these approaches waste resources and limit \nefficiency by focusing on every input instead of focusing only on relevant inputs. \nMany of these methods have a second drawback as well: The longer the time lag \nbetween an event and the occurrence of a related error the less information is carried \nby the corresponding error information wandering 'back into time' (see [6] for a more \ndetailed analysis). [11], [12] and [20] have addressed the latter problem but not the \nformer. The system described by [18] on the other hand addresses both problems, \nbut in a manner much different from that presented here. \n\n2 HISTORY COMPRESSION \n\nA major contribution of this work is an adaptive method for removing redundant \ninformation from sequences. This principle can be implemented with the help of \nany of the methods mentioned in the introduction. \n\nConsider a deterministic discrete time predictor (not necessarily a neural network) \nwhose state at time t of sequence p is described by an environmental input vector \nzP(t), an internal state vector hP(t), and an output vector zP(t). The environment \nmay be non-deterministic. At time 0, the predictor starts with zP(O) and an internal \nstart state hP(O). At time t ~ 0, the predictor computes \n\nzP(t) = f(zP(t), hP(t)). \nAt time t> 0, the predictor furthermore computes \n\nhP(t) = g(zP(t - 1), hP(t - 1)). \n\nAll information about the input at a given time t z can be reconstructed from \ntz,f,g,zP(O),hP(O), and the pairs (t\"zP(t,)) for which 0 < t, ~ tz and zP(t, -l);j: \nzP(t,). This is because if zP(t) = zP(t + 1) at a given time t, then the predictor is \nable to predict the next input from the previous ones. The new input is derivable \nby means of f and g. \nInformation about the observed input sequence can be even further compressed \nbeyond just the unpredicted input vectors zP(t,). It suffices to know only those \nelements of the vectors zP(t,) that were not correctly predicted. \nThis observation implies that we can discriminate one sequence from another by \nknowing jud the unpredicted inputs and the corresponding time steps at which they \n\n\fLearning Unambiguous Reduced Sequence Descriptions \n\n293 \n\noccurred. No information is lost if we ignore the expected inputs. We do not even \nhave to know f and g. I call this the principle of history compression. \nFrom a theoretical point of view it is important to know at what time an unexpected \ninput occurs; otherwise there will be a potential for ambiguities: Two different input \nsequences may lead to the same shorter sequence of un predicted inputs. With many \npractical tasks, however, there is no need for knowing the critical time steps (see \nsection 5). \n\n3 SELF-ORGANIZING PREDICTOR HIERARCHY \n\nUsing the principle of history compression we can build a self-organizing hierarchical \nneural 'chunking' systeml\n. The basic task can be formulated as a prediction task. \nAt a given time step the goal is to predict the next input from previous inputs. If \nthere are external target vectors at certain time steps then they are simply treated \nas another part of the input to be predicted. \nThe architecture is a hierarchy of predictors, the input to each level of the hierarchy \nis coming from the previous level. Pi denotes the ith level network which is trained \nto predict its own nezt input from its previous inputs2 \u2022 We take Pi to be one of \nthe conventional dynamic recurrent neural networks mentioned in the introduction; \nhowever, it might be some other adaptive sequence processing device as well3 . \nAt each time step the input of the lowest-level recurrent predictor Po is the current \nexternal input. We create a new higher-level adaptive predictor P,+l whenever \nthe adaptive predictor at the previous level, P\" stops improving its predictions. \nWhen this happens the weight-changing mechanism of P, is switched off (to exclude \npotential instabilities caused by ongoing modifications of the lower-level predictors). \nIf at a given time step P, (8 > 0) fails to predict its next input (or if we are at \nthe beginning of a training sequence which usually is not predictable either) then \nP'+l will receive as input the concatenation of this next input of P, plus a unique \nrepresentation of the corresponding time step4; the activations of P,+l 's hidden and \noutput units will be updated. Otherwise P,+l will not perform an activation update. \nThis procedure ensures that P'+l is fed with an unambiguous reduced descriptionS \nof the input sequence observed by P,. This is theoretically justified by the principle \nof history compression. \nIn general, P,+l will receive fewer inputs over time than P,. With existing learning \n\n1 See also [18] for a different hierarchical connectionist chun1cing system based on similar \n\nprinciples. \n\n2Recently I became aware that Don Mathis had some related ideas (personal commu(cid:173)\n\nnication). A hierarchical approach to sequence generation was pursued by [9]. \n\n3For instance, we might employ the more limited feed-forward networks and a 'time \nwindow' approach. In this case, the number of previous inputs to be considered as a basis \nfor the next prediction will remain fixed. \n\n\u2022 A unique time representation is theoretically necessary to provide P.+l with unam(cid:173)\n\nbiguous information about when the failure occurred (see also the last paragraph of section \n2). A unique representation of the time that went by since the lad unpredicted input oc(cid:173)\ncurred will do as well. \n\n& In contrast, the reduced descriptions referred to by [11] are not unambiguous. \n\n\f294 \n\nSchmidhuber \n\nalgorithms, the higher-level predictor should have less difficulties in learning to \npredict the critical inputs than the lower-level predictor. This is because P,+l'S \n'credit assignment paths' will often be short compared to those of P,. This will \nhappen if the incoming inputs cany global temporal structure which has not yet \nbeen discovered by P,. (See also [18] for a related approach to the problem of credit \nassignment in reinforcement learning.) \nThis method is a simplification and an improvement of the recent chunking method \ndescribed by [24]. \n\nA multi-level predictor hierarchy is a rather safe way of learning to deal with se(cid:173)\nquences with multi-level temporal structure (e.g speech). Experiments have shown \nthat multi-level predictors can quickly learn tasks which are practically unlearnable \nby conventionalrecunent networks, e.g. [6]. \n\n4 COLLAPSING THE HIERARCHY \n\nOne disadvantage of a predictor hierarchy as above is that it is not known in advance \nhow many levels will be needed. Another disadvantage is that levels are explicitly \nseparated from each other. It may be possible, however, to collapse the hierarchy \ninto a single network as outlined in this section. See details in [26]. \n\nWe need two conventional recurrent networks: The automatizer A and the chunker \nC, which cones pond to a distinction between automatic and attended events. (See \nalso [13] and [17] which describe a similar distinction in the context ofreinforcement \nlearning). At each time step A receives the current external input. A's enor function \nis threefold: One term forces it to emit certain desired target outputs at certain \ntimes. If there is a target, then it becomes part of the next input. The second term \nforces A at every time step to predict its own next non-target input. The third \n(crucial) term will be explained below. \n\nIf and only if A makes an enor concerning the first and second term of its en or \nfunction, the un predicted input (including a potentially available teaching vector) \nalong with a unique representation of the current time step will become the new \ninput to C. Before this new input can be processed, C (whose last input may have \noccuned many time steps earlier) is trained to predict this higher-level input from \nits cunent internal state and its last input (employing a conventional recunent net \nalgorithm). After this, C performs an activation update which contributes to a \nhigher level internal representation of the input history. Note that according to the \nprinciple of history compression C is fed with an unambiguous reduced description \nof the input history. The information deducible by means of A's predictions can be \nconsidered as redundant. (The beginning of an episode usually is not predictable, \ntherefore it has to be fed to the chunking level, too.) \n\nSince C's 'credit assignment paths' will often be short compared to those of A, C will \noften be able to develop useful internal representations of previous unexpected input \nevents. Due to the final term of its error function, A will be forced to reproduce \nthese internal representations, by predicting C's state. Therefore A will be able \nto create useful internal representations by itself in an early stage of processing a \n\n\fLearning Unambiguous Reduced Sequence Descriptions \n\n295 \n\ngiven sequence; it will often receive meaningful error signals long before errors of \nthe first or second kind occur. These internal representations in turn must cany \nthe discriminating information for enabling A to improve its low-level predictions. \nTherefore the chunker will receive fewer and fewer inputs, since more and more \ninputs become predictable by the automatizer. This is the collapsing operation. \nIdeally, the chunker will become obsolete after some time. \nIt must be emphasized that unlike with the incremental creation of a multi-level \npredictor hierarchy described in section 3, there is no formal proof that the 2-net \non-line version is free of instabilities. One can imagine situations where A unlearns \npreviously learned predictions because of the third term of its enor function. Rel(cid:173)\native weighting of the different terms in A's enor function represents an ad-hoc \nremedy for this potential problem. In the experiments below, relative weighting \nwas not necessary. \n\n5 EXPERIMENTS \n\nOne experiment with a multi-level chunking architecture involved a grammar which \nproduced strings of many a's and b's such that there was local temporal structure \nwithin the training strings (see [6] for details). The task was to differentiate between \nstrings with long overlapping suffixes. The conventional algorithm completely failed \nto solve the task; it became confused by the great numbers of input sequences with \nsimilar endings. Not so the chunking system: It soon discovered certain hierarchical \ntemporal structures in the input sequences and decomposed the problem such that \nit was able to solve it within a few hundred-thousand training sequences. \n\nThe 2-net chunking system (the one with the potential for collapsing levels) was \nalso tested against the conventionalrecUlrent net algorithms. (See details in [26].) \nWith the conventional algorithms, with various learning rates, and with more than \n1,000,000 training sequences performance did not improve in prediction tasks in(cid:173)\nvolving even as few as ~o time steps between relevant events. \n\nBut, the 2-net chunking system was able to solve the task rather quickly. An \nefficient approximation of the BPTT-method was applied to both the chunker and \nthe automatizer: Only 3 iterations of error propagation 'back into the past' were \nperformed at each time step. Most of the test runs required less than 5000 training \nsequences. Still the final weight matriz of the automatizer often resembled what \none would hope to get from the conventional algorithm. There were hidden units \nwhich learned to bridge the 20-step time lags by means of strong self-connections. \nThe chunking system needed less computation per time step than the conventional \nmethod and required many fewer training sequences. \n\n6 CONTINUOUS HISTORY COMPRESSION \n\ntechnique \n\nThe history compression \nformulated above defines expectation(cid:173)\nmismatches in a yes-or-no fashion: Each input unit whose activation is not pre(cid:173)\ndictable at a certain time gives rise to an unexpected event. Each unexpected event \nprovokes an update of the internal state of a higher-level predictor. The updates \nalways take place according to the conventional activation spreading rules for re-\n\n\f296 \n\nSchmidhuber \n\ncurrent neural nets. There is no concept of a partial mismatch or of a 'near-miss'. \nThere is no possibility of updating the higher-level net 'just a little bit' in response \nto a 'nearly expected input'. In practical applications, some 'epsilon' has to be used \nto define an acceptable mismatch. \nIn reply to the above criticism, continuous history compression is based on the \nfollowing ideas. In what follows, Viet) denotes the i-th component of vector vet). \nWe use a local input representation. The components of zP(t) are forced to sum \nup to 1 and are interpreted as a prediction of the probability distribution of the \npossible zP(t + 1). Z}(t) is interpreted as the prediction of the probability that \nzHt + 1) is 1. \nThe output entropy \n\n- 2: zr(t)log zr(t) \n\nj \n\ncan be interpreted as a measure of the predictor's confidence. In the worst ease, \nthe predictor will expect every possible event with equal probability. \n\nHow much information (relative to the current predictor) is conveyed by the event \nz~(t + 1) = 1, once it is observed? According to [29] it is \n\n-log Z}(t). \n\n[28] defines update procedures based on Mozer's recent update function [12] that \nlet highly informative events have a stronger influence on the history representation \nthan less informative (more likely) events. The 'strength' of an update in response \nto a more or less unexpected event is a monotonically increasing function of the \ninformation the event conveys. One of the update procedures uses Pollack's recur(cid:173)\nsive auto-associative memories [16] for storing unexpected events, thus yielding an \nentirely local learning algorithm for learning extended sequences. \n\n7 ACKNOWLEDGEMENTS \n\nThanks to Josef Hochreiter for conducting the experiments. Thanks to Mike Mozer \nand Mark Ring for useful comments on an earlier draft of this paper. This research \nwas supported in part by NSF PYI award IRI-9058450, grant 90-21 from the James \nS. McDonnell Foundation, and DEC external research grant 1250 to Michael C. \nMozer. \n\nReferences \n\n[1] J. Bachrach. Learning to represent state, 1988. Unpublished master's thesis, \n\nUniversity of Massachusetts, Amherst. \n\n[2] U. Bodenhausen and A. Waibel. The tempo 2 algorithm: Adjusting time-delays \nby supervised learning. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, \neditors, Advances in Neural In/ormation Processing Systems 3, pages 155-161. \nSan Mateo, CA: Morgan Kaufmann, 1991. \n\n\fLearning Unambiguous Reduced Sequence Descriptions \n\n297 \n\n[3] J. L. Elman. Finding structure in time. Technical Report CRL Technical \nReport 8801, Center for Research in Language, University of California, San \nDiego, 1988. \n\n[4] M. Gherrity. A learning algorithm for analog fully recurrent neural networks. In \nIEEE/INNS International Joint Conference on Neural Networks, San Diego, \nvolume 1, pages 643-644, 1989. \n\n[S] C. L. Giles and C. B. Miller. Learning and extracting finite state automata. \n\nAccepted for publication in Neural Computation, 1992. \n\n[6] Josef Hochreiter. Diploma thesis, 1991. Institut fur Informatik, Technische \n\nUniversitiit Miinchen. \n\n[7] M. I. Jordan. Serial order: A parallel distributed processing approach. Tech(cid:173)\n\nnical Report ICS Report 8604, Institute for Cognitive Science, University of \nCalifornia, San Diego, 1986. \n\n[8] G. Lukes. Review of Schmidhuber's paper 'Recurrent networks adjusted by \n\nadaptive critics'. Neural Network Reviews, 4(1):41-42, 1990. \n\n[9] Y. Miyata. An unsupervised PDP learning model for action planning. In Proc. \nof the Tenth Annual Conference of the Cognitive Science Society, Hillsdale, \nNJ, pages 223-229. Erlbaum, 1988. \n\n[10] M. C. Mozer. A focused back-propagation algorithm for temporal sequence \n\nrecognition. Complez Systems, 3:349-381, 1989. \n\n[11] M. C. Mozer. Connectionist music composition based on melodic, stylistic, \nand psychophysical constraints. Technical Report CU-CS-49S-90, University \nof Colorado at Boulder, 1990. \n\n[12] M. C. Mozer. Induction of multiscale temporal structure. In D. S. Lippman, \nJ. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information \nProcessing Systems 4, to appear. San Mateo, CA: Morgan Kaufmann, 1992. \n[13] C. Myers. Learning with delayed reinforcement through attention-driven buffer(cid:173)\n\ning. TR, Imperial College of Science, Technology and Medicine, 1990. \n\n[14] B. A. Pearlmutter. Learning state space trajectories in recurrent neural net(cid:173)\n\nworks. Neural Computation, 1:263-269, 1989. \n\n[IS] F. J. Pineda. Time dependent adaptive neural networks. In D. S. Touretzky, \neditor, Advances in Neural Information Processing Systems 2, pages 710-718. \nSan Mateo, CA: Morgan Kaufmann, 1990. \n\n[16] J. B. Pollack. Recursive distributed representation. Artificial Intelligence, \n\n46:77-10S, 1990. \n\n[17] M. A. Ring. PhD Proposal: Autonomous construction of sensorimotor hierar(cid:173)\n\nchies in neural networks. Technical report, Univ. of Texas at Austin, 1990. \n\n[18] M. A. Ring. Incremental development of complex behaviors through automatic \nconstruction of sensory-motor hierarchies. \nIn L. Birnbaum and G. Collins, \neditors, Machine Learning: Proceedings of the Eighth International Workshop, \npages 343-347. Morgan Kaufmann, 1991. \n\n[19] A. J . Robinson and F. Fallside. The utility driven dynamic error propagation \nnetwork. Technical Report CUED/F-INFENG/TR.l, Cambridge University \nEngineering Department, 1987. \n\n\f298 \n\nSchmidhuber \n\n[20] R. Rohwer. The 'moving targets' training method. In J. Kindermann and \nA. Linden, editors, Proceedings of 'Distributed Adaptive Neural Information \nProcessing', St.Augustin, ~4.-~5.5,. Oldenbourg, 1989. \n\n[21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal rep(cid:173)\n\nresentations by error propagation. In D. E. Rumelhart and J. L. McClelland, \neditors, Parallel Distributed Processing, volume I, pages 318-362. MIT Press, \n1986. \n\n[22] J. H. Schmidhuber. A local learning algorithm for dynamic feedforward and \n\nrecurrent networks. Connection Science, 1(4):403-412, 1989. \n\n[23] J. H. Schmidhuber. Recurrent networks adjusted by adaptive critics. In Proc. \nIEEE/INNS International Joint Conference on Neural Networks, Washington, \nD. C., volume I, pages 719-722, 1990. \n\n[24] J. H. Schmidhuber. Adaptive decomposition of time. \n\nIn T. Kohonen, \n\nK. Miikisara, O. Simula, and J. Kangas, editors, Artificial Neural Network(cid:173)\ns, pages 909-914. Elsevier Science Publishers B.V., North-Holland, 1991. \n\n[25] J. H. Schmidhuber. A fixed size storage O(n3 ) time complexity learning algo(cid:173)\n\nrithm for fully recurrent continually running networks. Accepted for publication \nin Neural Computation, 1992. \n\n[26] J. H. Schmidhuber. Learning complex, extended sequences using the principle \nof history compression. Accepted for publication in Neural Computation, 1992. \n[27] J. H. Schmidhuber. Learning to control fast-weight memories: An alternative \n\nto recurrent nets. Accepted for publication in Neural Computation, 1992. \n\n[28] J. H. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history com(cid:173)\n\npression. Technical report, Dept. of Compo Sci., University of Colorado at \nBoulder, 1992. \n\n[29] C. E. Shannon. A mathematical theory of communication (parts I and II). Bell \n\nSystem Technical Journal, XXVII:379-423, 1948. \n\n[30] P. J. Werbos. Generalization of back propagation with application to a recurrent \n\ngas market model. Neural Networks, 1, 1988. \n\n[31] R. J. Williams. Toward a theory of reinforcement-learning connectionist sys(cid:173)\n\ntems. Technical Report NU-CCS-88-3, College of Compo Sci., Northeastern \nUniversity, Boston, MA, 1988. \n\n[32] R. J. Williams. Complexity of exact gradient computation algorithms for re(cid:173)\n\ncurrent neural networks. Technical Report Technical Report NU-CCS-89-27, \nBoston: Northeastern University, College of Computer Science, 1989. \n\n[33] R. J. Williams and J. Pengo An efficient gradient-based algorithm for on-line \ntraining of recurrent network trajectories. Neural Computation, 4:491-501, \n1990. \n\n[34] R. J. Williams and D. Zipser. Experimental analysis of the real-time recurrent \n\nlearning algorithm. Connection Science, 1(1):87-111, 1989. \n\n[35] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent \nnetworks and their computational complexity. In Back-propagation: Theory, \nArchitectures and Applications. Hillsdale, NJ: Erlbaum, 1992, in press. \n\n\fPART VI \n\nRECURRENT \nNETWORKS \n\n\f\f", "award": [], "sourceid": 523, "authors": [{"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}