{"title": "Hierarchical Recurrent Neural Networks for Long-Term Dependencies", "book": "Advances in Neural Information Processing Systems", "page_first": 493, "page_last": 499, "abstract": null, "full_text": "Hierarchical Recurrent Neural Networks for \n\nLong-Term Dependencies \n\nSalah El Hihi \n\nDept. Informatique et \n\nRecherche Operationnelle \nUniversite de Montreal \nMontreal, Qc H3C-3J7 \n\nelhihiGiro.umontreal.ca \n\nYoshua Bengio\u00b7 \nDept. Informatique et \n\nRecherche Operationnelle \nUniversite de Montreal \nMontreal, Qc H3C-3J7 \n\nbengioyGiro.umontreal.ca \n\nAbstract \n\nWe have already shown that extracting long-term dependencies from se(cid:173)\nquential data is difficult, both for determimstic dynamical systems such \nas recurrent networks, and probabilistic models such as hidden Markov \nmodels (HMMs) or input/output hidden Markov models (IOHMMs). In \npractice, to avoid this problem, researchers have used domain specific \na-priori knowledge to give meaning to the hidden or state variables rep(cid:173)\nresenting past context. In this paper, we propose to use a more general \ntype of a-priori knowledge, namely that the temporal dependencIes are \nstructured hierarchically. This implies that long-term dependencies are \nrepresented by variables with a long time scale. This principle is applied \nto a recurrent network which includes delays and multiple time scales. Ex(cid:173)\nperiments confirm the advantages of such structures. A similar approach \nis proposed for HMMs and IOHMMs. \n\n1 \n\nIntroduction \n\nLearning from examples basically amounts to identifying the relations between random \nvariables of interest. Several learning problems involve sequential data, in which the vari(cid:173)\nables are ordered (e.g., time series). Many learning algorithms take advantage of this \nsequential structure by assuming some kind of homogeneity or continuity of the model \nover time, e.g., bX sharing parameters for different times, as in Time-Delay Neural Net(cid:173)\nworks (TDNNs) tLang, WaIbel and Hinton, 1990), recurrent neural networks (Rumelhart, \nHinton and Williams, 1986), or hidden Markov models (Rabiner and Juang, 1986). This \ngeneral a-priori assumption considerably simplifies the learning problem. \nIn previous papers (Bengio, Simard and Frasconi, 1994\u00b7 Bengio and Frasconi, 1995a), we \nhave shown for recurrent networks and Markovian models that, even with this assumption, \ndependencies that span longer intervals are significantly harder to learn. In all of the \nsystems we have considered for learning from sequential data, some form of representation \nof context ( or state) is required (to summarize all \"useful\" past information). The \"hard \nlearning\" problem IS to learn to represent context, which involves performing the proper \n\n\u00b0 also, AT&T Bell Labs, Holmdel, NJ 07733 \n\n\f494 \n\nS. E. HIHI. Y. BENGIO \n\ncredit assignment through time. Indeed, in practice, recurrent networks (e.g., injecting \nprior knowledge for grammar inference (Giles and Omlin, 1992; Frasconi et al., 1993)) \nand HMMs (e.g., for speech recognition (Levinson, Rabiner and Sondhi, 1983; Rabiner \nand Juang, 1986)) work quite well when the representation of context (the meaning of the \nstate variable) is decided a-priori. The hidden variable is not any more completely hidden. \nLearning becomes much easier. Unfortunately, this requires a very precise knowledge of \nthe appropriate state variables, which is not available in many applications. \nWe have seen that the successes ofTDNNs, recurrent networks and HMMs are based on a \ngeneral assumption on the sequential nature of the data. In this paper, we propose another, \nsimple, a-priori assumption on the sequences to be analyzed: the temporal dependencies \nhave a hierarchical structure. This implies that dependencies spanning long intervals are \n\"robust\" to small local changes in the timing of events, whereas dependencies spanning \nshort intervals are allowed to be more sensitive to the precise timing of events. This yields \na multi-resolution representation of state information. This general idea is not new and \ncan be found in various approaches to learning and artificial intelligence. For example, in \nconvolutional neural networks, both for sequential data with TDNNs (Lang, Waibel and \nHinton, 1990), and for 2-dimensional data with MLCNNs (LeCun et al., 1989; Bengio, \nLeCun and Henderson, 1994), the network is organized in layers representing features \nof increasing temporal or spatial coarseness. Similarly, mostly as a tool for analyzing \nand preprocessing sequential or spatial data, wavelet transforms (Daubechies, 1990) also \nrepresent such information at mUltiple resolutions. Multi-scale representations have also \nbeen proposed to improve reinforcement learning systems (Singh , 1992; Dayan and Hinton, \n1993; Sutton, 1995) and path planning systems. However, with these algorithms, one \ngenerally assumes that the state of the system is observed, whereas, in this paper we \nconcentrate on the difficulty of learning what the state variable should represent. A \nrelated idea using a hierarchical structure was presented in (Schmidhuber, 1992) . \nOn the HMM side, several researchers (Brugnara et al., 1992; Suaudeau, 1994) have \nattempted to improve HMMs for speech recognition to better model the different types \nof var1ables, intrmsically varying at different time scales in speech. In those papers, the \nfocus was on setting an a-priori representation, not on learning how to represent context. \nIn section 2, we attempt to draw a common conclusion from the analyses performed on \nrecurrent networks and HMMs to learn to represent long-term dependencies. This will \njustify the proposed approach, presented in section 3. In section 4 a specific hierarchical \nmodel is proposed for recurrent networks, using different time scales for different layers of \nthe network. EXp'eriments performed with this model are described in section 4. Finally, \nwe discuss a sim1lar scheme for HMMs and IOHMMs in section 5. \n\n2 Too Many Products \nIn this section, we take another look at the analyses of (Bengio, Simard and Frasconi, 1994) \nand (Bengio and Frasconi, 1995a), for recurrent networks and HMMs respectively. The \nobjective 1S to draw a parallel between the problems encountered with the two approaches, \nin order to guide us towards some form of solution, and justify the proposals made here. \nFirst, let us consider the deterministic dynamical systems (Bengio, Simard and Frasconi, \n1994) (such as recurrent networks), which map an input sequence U1 l . .. , UT to an output \nsequence Y1, ... , ftr\u00b7 The state or context information is represented at each time t by a \nvariable Xt, for example the activities of all the hidden units of a recurrent network: \n\n(1) \nwhere Ut is the system input at time t and 1 is a differentiable function (such as \ntanh(Wxt_1 + ut)). When the sequence of inputs U1, U2, \u2022 .. , UT is given, we can write \nXt = It(Xt-d = It(/t-1( .. . l1(xo)) . . . ). A learning criterion Ct yields gradients on out(cid:173)\nputs, and therefore on the state variables Xt. Since parameters are shared across time, \nlearning using a gradient-based algorithm depends on the influence of parameters W on \nCt through an time steps before t : \n\naCt OXt oX T \naCt _ \" \noW - L...J OXt OX T oW \n\nT \n\n(2) \n\n\fHierarchical Recurrent Neural Networks for Long-term Dependencies \n\nThe Jacobian matrix of derivatives .!!:U{J{Jx can further be factored as follows: \n\nXr \n\n495 \n\n(3) \n\nOur earlier analysis (Bengio, Simard and Frasconi, 1994) shows that the difficulty revolves \naround the matrix product in equation 3. In order to reliably \"store\" informatIOn in the \ndynamics of the network, the state variable Zt must remain in regions where If:! < 1 \n(i.e., near enough to a stable attractor representing the stored information). However, the \nabove products then rapidly converge to 0 when t - T increases. Consequently, the sum \nin 2 is dominated by terms corresponding to short-term dependencies (t - T is small). \nLet us now consider the case of Markovian models (including HMMs and IOHMMs (Ben(cid:173)\ngio and Frasconi, 1995b)). These are probabilistic models, either of an \"output\" \nsequence P(YI . . . YT) \n(HMMs) or of an output sequence given an input sequence \nP(YI ... YT lUI ... UT) (IOHMMs). \nIntroducing a discrete state variable Zt and using \nMarkovian assumptIOns of independence this probability can be factored in terms of tran(cid:173)\nsition probabilities P(ZtIZt-d (or P(ZtIZt-b ut}) and output probabilities P(ytlZt) (or \nP(ytiZt, Ut)) . According to the model, the distribution of the state Zt at time t given the \nstate ZT at an earlier time T is given by the matrix \n\nP(ZtlZT) = P(ZtiZt-I)P(Zt-Ilzt-2) . .. P(zT+dzT) \n\n(4) \nwhere each of the factors is a matrix of transition probabilities (conditioned on inputs in \nthe case of IOHMMs) . Our earlier analysis (Bengio and Frasconi, 1995a) shows that the \ndifficulty in representing and learning to represent context (i.e., learning what Zt should \nrepresent) revolves around equation 4. The matrices in the above equations have one \neigenvalue equal to 1 (because of the normalization constraint) and the others ~ 1. In \nthe case in which all eIgenvalues are 1 the matrices have only i's and O's, i.e, we obtain \ndeterministic dynamics for IOHMMs or pure cycles for HMMs (which cannot be used to \nmodel most interesting sequences) . Otherwise the above product converges to a lower \nrank matrix (some or most of the eigenvalues converge toward 0). Consequently, P(ZtlZT) \nbecomes more and more independent of ZT as t - T increases. Therefore, both representing \nand learning context becomes more difficult as the span of dependencies increases or when \nthe Markov model is more non-deterministic (transition probabilities not close to 0 or 1). \nClearly, a common trait of both analyses lies in taking too many products, too many time \nsteps, or too many transformations to relate the state variable at time T with the state vari(cid:173)\nable at time t > T, as in equations 3 and 4. Therefore the idea presented in the next section \nis centered on allowing several paths between ZT and Zt, some with few \"transformations\" \nand some with many transformations. At least through those with few transformations, \nwe expect context information (forward) , and credit assignment (backward) to propagate \nmore easily over longer time spans than through \"paths\" lDvolving many tralIBformations. \n\n3 Hierarchical Sequential Models \nInspired by the above analysis we introduce an assumption about the sequential data to \nbe modeled, although it will be a very simple and general a-priori on the structure of the \ndata. Basically, we will assume that the sequential structure of data can be described \nhierarchically: long-term dependencies (e.g., between two events remote from each other \nin time) do not depend on a precise time scale (Le., on the precise timing of these events). \nConsequently, in order to represent a context variable taking these long-term dependencies \ninto account, we will be able to use a coarse time scale (or a Slowly changing state variable). \nTherefore, instead of a single homogeneous state variable, we will introduce several levels \nof state variables, each \"working\" at a different time scale. To implement in a discrete(cid:173)\ntime system such a multi-resolution representation of context, two basic approaches can \nbe considered. Either the higher level state variables change value less often or they \nare constrained to change more slowly at each time step. In our ex~eriments, we have \nconsidered input and output variables both at the shortest time scale highest frequency), \nbut one of the potential advantages of the approach presented here is t at it becomes very \n\n\f496 \n\nS. E. IDHI, Y. BENOIO \n\nFigure 1: Four multi-resolution recurrent architectures used in the experiments. Small \nsguares represent a discrete delay, and numbers near each neuron represent its time scale. \nThe architectures B to E have respectively 2, 3, 4, and 6 time scales. \nsimple to incorporate input and output variables that operate at different time scales. \nFor example, in speech recognition and synthesis, the variable of interest is not only \nthe speech signal itself (fast) but also slower-varying variables such as prosodic (average \nenergy, pitch, etc ... ) and phonemic (place of articulation, phoneme duration) variables. \nAnother example is in the application of learning algorithms to financial and economic \nforecasting and decision taking. Some of the variables of interest are given daily, others \nweekly, monthly, etc ... \n\n4 Hierarchical Recurrent Neural Network: Experiments \nAs in TDNNs (Lang, Waibel and Hinton, 1990) and reverse-TDNNs (Simard and LeCun, \n1992), we will use discrete time delays and subsampling (or oversampling) in order to \nimplement the multiple time scales. In the time-unfolded network, paths going through \nthe recurrences in the slow varying units (long time scale) will carry context farther, \nwhile paths going through faster varying units (short time scale) will respond faster to \nchanges in input or desired changes in output. Examples of such multi-resolution recurrent \nneural networks are shown in Figure 1. Two sets of simple experiments were performed to \nvalidate some of the ideas presented in this paper. In both cases, we compare a hierarchical \nrecurrent network with a single-scale fully-connected recurrent network. \nIn the first set of experiments, we want to evaluate the performance of a hierarchical \nrecurrent network on a problem already used for studying the difficulty in learning long(cid:173)\nterm dependencies (Bengio, Simard and Frasconi, 1994; Bengio and Frasconi, 1994) . In \nthis 2-class J?roblem, the network has to detect a pattern at the beginning of the sequence, \nkeeping a blt of information in \"memory\" (while the inputs are noisy) until the end of \nthe sequence (supervision is only a the end of the sequence). As in (Bengio, Simard and \nFrasconi, 1994; Bengio and Frasconi, 1994) only the first 3 time steps contain information \nabout the class (a 3-number pattern was randomly chosen for each class within [-1,1]3). \nThe length of the sequences is varied to evaluate the effect of the span of input/output \ndependencies. Uniformly distributed noisy inputs between -.1 and .1 are added to the \ninitial patterns as well as to the remainder of the sequence. For each sequence length, 10 \ntrials were run with different initial weights and noise patterns, with 30 training sequences. \nExperiments were performed with sequence of lengths 10, 20,40 and 100. \nSeveral recurrent network architectures were compared. All were trained with the same \nalgorithm (back-propagation through time) to minimize the sum of squared differences \nbetween the final output and a desired value. The simplest architecture (A) is similar to \narchitecture B in Figure 1 but it is not hierarchical: it has a single time scale. Like the \n\n\fHierarchical Recurrent Neural Networks for Long-term Dependencies \n\n497 \n\neo \n\n50 \n\n40 \n\nl \n~ 30 \n\n20 \n\n1.4 \n1.3 \n1.2 \n1.1 \n1.0 \n0.9 \nIs 0.8 \nIi 0.7 \nI 0.6 \n\n0.5 \n0.4 \n\nABCDE ABCDE ABCDE ABCDE \n\nseq.1engIh \n\n10 \n\n20 \n\n40 \n\n100 \n\n~~ '-----'-I....L.~~~....L.-.W....L.mL.......J....I.ll \n\nABCDE ABCDE ABCDE ABCDE \n\n1IIq. 1engIh \n\n10 \n\n20 \n\n40 \n\n100 \n\nFigure 2: Average classification error after training for 2-sequence problem (left, classifica(cid:173)\ntion error) and network-generated data (right, mean squared error), for varying sequence \nlengths and architectures. Each set of 5 consecutive bars represents the performance of 5 \narchitectures A to E, with respectively 1, 2, 3, 4 and 6 time scales (the architectures B \nto E are shown in Figure 1). Error bars show the standard deviation over 10 trials. \n\nother networks, it has however a theoretically \"sufficient\" architecture, i.e., there exists \na set of weights for which it classifies perfectly the trainin~ sequences. Four of the five \narchitectures that we compared are shown in Figure 1, wIth an increasing number of \nlevels in the hierarchy. The performance of these four architectures (B to E) as well as \nthe architecture with a single time-scale (A) are compared in Figure 2 (left, for the 2-\nsequence problem). Clearly, adding more levels to the hierarchy has significantly helped \nto reduce the difficulty in learning long-term dependencies. \nIn a second set of experiments, a hierarchical recurrent network with 4 time scales was \ninitialized with random (but large) weights and used to generate a data set. To generate \nthe inputs as well as the outputs, the network has feedback links from hidden to input \nunits. At the initial time step as well as at 5% of the time steps (chosen randomly), \nthe input was clamped with random values to introduce some further variability. It is a \nregression task, and the mean squared error is shown on Figure 2. Because of the network \nstructure, we expect the data to contain long-term dependencies that can be modeled with \na hierarchical structure. 100 training sequences of length 10, 20,40 and 100 were generated \nby this network. The same 5 network architectures as in the previous experiments were \ncompared (see Figure 1 for architectures B to E), with 10 training trials per network and \nper sequence length. The results are summarized in Figure 2 (right) . More high-level \nhierarchical structure appears to have improved performance for long-term dependencies. \nThe fact that the simpler I-level network does not achieve a good performance suggests \nthat there were some difficult long-term dependencies in the the artificially generated data \nset. It is interesting to compare those results with those reported in (Lin et al., 1995) which \nshow that using longer delays in certain recurrent connections helps learning longer-term \ndependencies. In both cases we find that introducing longer time scales allows to learn \ndependencies whose span is proportionally longer. \n\n5 Hierarchical HMMs \nHow do we represent multiple time scales with a HMM? Some solutions have already been \nproposed in the speech recognition literature, motivated by the obvious presence of differ(cid:173)\nent time scales in the speech phenomena. In (Brugnara et al., 1992) two Markov chains \nare coupled in a \"master/slave\" configuration. For the \"master\" HMM, the observations \nare slowly varying features (such as the signal energy), whereas for the \"slave\" HMM the \nobservations are t.he speech spectra themselves. The two chains are synchronous and op(cid:173)\nerate at the same time scale, therefore the problem of diffusion of credit in HMMs would \nprobably also make difficult the learning of long-term dependencies. Note on the other \n\n\f498 \n\nS. E. HIHI, Y. BENOIO \n\nhand that in most applications of HMMs to speech recognition the meaning of states is \nfixed a-priori rather than learned from the data (see (Bengio and Frasconi, 1995a) for a \ndiscussion). In a more recent contribution, Nelly Suaudeau (Suaudeau, 1994) proposes a \n\"two-level HMM\" in which the higher level HMM represents \"segmental\" variables (such \nas phoneme duration). The two levels operate at different scales: the higher level state \nvarIable represents the phonetic identity and models the distributions of the average energy \nand the duration within each phoneme. Again, this work is not geared towards learning a \nrepresentation of context, but rather, given the traditional (phoneme-based) representa(cid:173)\ntion of context in speech recognition, towards building a better model of the distribution \nof \"slow\" segmental variables such as phoneme duration and energy. Another promising \napproach was recently proposed in (Saul and Jordan, 1995). Using decimation techniques \nfrom statistical mechanics, a polynomial-time algorithm is derived for parallel Boltzmann \nchains (which are similar to parallel HMMs), which can operate at different time scales. \nThe ideas presented here point toward a HMM or IOHMM in which the (hidden) state \nvariable Xt is represented by the Cartesian product of several state variables Xt, each \n\"working\" at a different time scale: Xt = (x;, x~, ... I xf).. To take advantage of the \ndecomposition, we propose to consider that tbe state dIstrIbutions at the different levels \nare conditionally independent (given the state at the previous time step and at the current \nand previous levels). Transition probabilities are therefore factored as followed: \n\n(5) \n\nTo force the state variable at a each level to effectively work at a given time scale, self(cid:173)\ntransition probabilities are constrained as follows (using above independence assumptions): \n\nP(x:=i3Ixt_l=iI,.\u00b7 ., x:_l=i3\" .. , xt-l=is) = P(x:=i3Ix:_1 =i3, X::t=i3-d = W3 \n\n6 Conclusion \nMotivated by the analysis of the problem of learning long-term dependencies in sequen(cid:173)\ntial data, i.e., of learning to represent context, we have proposed to use a very general \nassumption on the structure of sequential data to reduce the difficulty of these learning \ntasks. Following numerous previous work in artificial intelligence we are assuming that \ncontext can be represented with a hierarchical structure. More precisely, here, it means \nthat long-term dependencies are insensitive to small timing variations, i.e., they can be \nrepresented with a coarse temporal scale. This scheme allows context information and \ncredit information to be respectively propagated forward and backward more easily. \nFollowing this intuitive idea, we have proposed to use hierarchical recurrent networks for se(cid:173)\nquence processing. These networks use multiple-time scales to achieve a multi-resolution \nrepresentation of context. Series of experiments on artificial data have confirmed the \nadvantages of imposing such structures on the network architecture. Finally we have \nproposed a similar application of this concept to hidden Markov models (for density esti(cid:173)\nmation) and input/output hidden Markov models (for classification and regression). \nReferences \nBengio, Y. and Frasconi, P. (1994). Credit assignment through time: Alternatives to \nbackpropagation. In Cowan, J., Tesauro, G., and Alspector, J., editors, Advances in \nNeural Information Processing Systems 6. Morgan Kaufmann. \n\nBengio, Y. and Frasconi, P. (1995a) . Diffusion of context and credit information in marko(cid:173)\n\nvian models. Journal of Artificial Intelligence Research, 3:223-244. \n\nBengio, Y. and Frasconi, P. (1995b). An input/output HMM architecture. In Tesauro, \nG., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processmg \nSystems 7, pages 427-434. MIT Press, Cambridge, MA. \n\nBengio, Y., LeCun, Y., and Henderson, D. (1994). Globally trained handwritten word \n\nrecognizer using spatial representation, space displacement neural networks and hid(cid:173)\nden Markov models. In Cowan, J ., Tesauro, G., and Alspector, J., editors, Advances \nin Neural Information Processing Systems 6, pages 937- 944. \n\n\fHierarchical Recurrent Neural Networks for Long-term Dependencies \n\n499 \n\nBengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with \ngradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157-166. \n\nBrugnara, F., DeMori, R, Giuliani, D., and Omologo, M. (1992). A family of parallel \nhidden markov models. In International Conference on Acoustics, Speech and Signal \nProcessing, pages 377-370, New York, NY, USA. IEEE. \n\nDaubechies, I. (1990). The wavelet transform, time-frequency localization and signal \n\nanalysis. IEEE Transaction on Information Theory, 36(5):961-1005 . \n\nDayan, P. and Hinton, G. (1993). Feudal reinforcement learning. In Hanson, S. J., Cowan, \nJ. D., and Giles, C. L., edItors, Advances in Neural Information Processing Systems \n5, San Mateo, CA. Morgan Kaufmann. \n\nFrasconi, P., Gori, M., Maggini, M., and Soda, G. (1993). Unified integration of explicit \n\nrules and learning by example in recurrent networks. IEEE Transactions on Knowl(cid:173)\nedge and Data Engineering. (in press). \n\nGiles, C. 1. and amlin, C. W. (1992). Inserting rules into recurrent neural networks. In \nKung, Fallside, Sorenson, and Kamm, editors, Neural Networks for Signal Processing \nII, Proceedings of the 1992 IEEE workshop, pages 13-22. IEEE Press. \n\nLang, K. J., Waibel, A. H., and Hinton, G. E. (1990). A time-delay neural network \n\narchitecture for isolated word recognition. Neural Networks, 3:23-43. \n\nLeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R, Hubbard, W., and Jackel, \nL. (1989) . Backpropagation applied to handwritten zip code recognition. Neural \nComputation, 1:541-551. \n\nLevinson, S., Rabiner, 1., and Sondhi, M. (1983). An introduction to the application ofthe \ntheory of probabilistic functions of a Markov process to automatic speech recognition. \nBell System Technical Journal, 64(4):1035-1074. \n\nLin, T ., Horne, B., Tino, P., and Giles, C. (1995). Learning long-term dependencies is not \nas difficult with NARX recurrent neural networks. Techmcal Report UMICAS-TR-\n95-78, Institute for Advanced Computer Studies, University of Mariland. \n\nRabiner, L. and Juang, B. (1986). An introduction to hidden Markov models. IEEE A SSP \n\nMagazine, pages 257-285. \n\nRumelhart, D., Hinton, G., and Williams, R (1986). Learning internal representations by \nerror propagation. In Rumelhart, D. and McClelland, J., editors, Parallel Distributed \nProcessing, volume 1, chapter 8, pages 318-362. MIT Press, Cambridge. \n\nSaul, L. and Jordan, M. (1995). Boltzmann chains and hidden markov models. In Tesauro, \nG., Touretzky, D., and Leen, T., editor~ Advances in Neural Information Processing \nSystems 7, pages 435--442. MIT Press, vambridge, MA. \n\nSchmidhuber, J. (1992). Learning complex, extended sequences using the principle of \n\nhistory compression. Neural Computation, 4(2):234-242. \n\nSimard, P. and LeCun, Y. (1992). Reverse TDNN: An architecture for trajectory gen(cid:173)\neration. In Moody, J., Hanson, S., and Lipmann, R, editors, Advances in Neural \nInformation Processing Systems 4, pages 579-588, Denver, CO. Morgan Kaufmann, \nSan Mateo. \n\nSingh, S. (1992). Reinforcement learning with a hierarchy of abstract models. In Pro(cid:173)\n\nceedings of the 10th National Conference on Artificial Intelligence, pages 202-207. \nMIT / AAAI Press. \n\nSuaudeau, N. (1994). Un modele probabiliste pour integrer la dimension temporelle dans \nun systeme de reconnaissance automatique de la parole. PhD thesis, Universite de \nRennes I, France. \n\nSutton, RjI995). TD models: modeling the world at a mixture of time scales. In Proceed(cid:173)\nthe 12th International Conference on Machine Learning. Morgan Kaufmann. \n\nings 0 \n\n\f", "award": [], "sourceid": 1102, "authors": [{"given_name": "Salah", "family_name": "Hihi", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}