{"title": "An Input Output HMM Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 427, "page_last": 434, "abstract": null, "full_text": "An Input Output HMM Architecture \n\nDept. Informatique et Recherche \n\nDipartimento di Sistemi e Informatica \n\nPaolo Frasconi \n\nUniversita di Firenze (Italy) \n\npaoloOmcculloch.ing.unifi.it \n\nYoshua Bengio* \n\nOperationnelle \n\nUniversite de Montreal, Qc H3C-3J7 \n\nbengioyOIRO.UMontreal.CA \n\nAbstract \n\nWe introduce a recurrent architecture having a modular structure \nand we formulate a training procedure based on the EM algorithm. \nThe resulting model has similarities to hidden Markov models, but \nsupports recurrent networks processing style and allows to exploit \nthe supervised learning paradigm while using maximum likelihood \nestimation. \n\nINTRODUCTION \n\n1 \nLearning problems involving sequentially structured data cannot be effectively dealt \nwith static models such as feedforward networks. Recurrent networks allow to model \ncomplex dynamical systems and can store and retrieve contextual information in \na flexible way. Up until the present time, research efforts of supervised learning \nfor recurrent networks have almost exclusively focused on error minimization by \ngradient descent methods. Although effective for learning short term memories, \npractical difficulties have been reported in training recurrent neural networks to \nperform tasks in which the temporal contingencies present in the input/output \nsequences span long intervals (Bengio et al., 1994; Mozer, 1992). \nPrevious work on alternative training algorithms (Bengio et al., 1994) could suggest \nthat the root of the problem lies in the essentially discrete nature of the process \nof storing information for an indefinite amount of time. Thus, a potential solution \nis to propagate, backward in time, targets in a discrete state space rather than \ndifferential error information. Extending previous work (Bengio & Frasconi, 1994a), \nin this paper we propose a statistical approach to target propagation, based on the \nEM algorithm. We consider a parametric dynamical system with discrete states and \nwe introduce a modular architecture, with subnetworks associated to discrete states. \nThe architecture can be interpreted as a statistical model and can be trained by the \nEM or generalized EM (GEM) algorithms (Dempster et al., 1977), considering the \ninternal state trajectories as missing data. In this way learning is decoupled into \n\n\u2022 also, AT&T Bell Labs, Holmdel, N J 07733 \n\n\f428 \n\nYoshua Bengio, Paolo Frasconi \n\na temporal credit assignment subproblem and a static learning subproblem that \nconsists of fitting parameters to the next-state and output mappings defined by the \nestimated trajectories . In order to iteratively tune parameters with the EM or GEM \nalgorithms, the system propagates forward and backward a discrete distribution over \nthe n states, resulting in a procedure similar to the Baum-Welch algorithm used \nto train standard hidden Markov models (HMMs) (Levinson et al., 1983) . HMMs \nhowever adjust their parameters using unsupervised learning, whereas we use EM \nin a supervised fashion . Furthermore, the model presented here could be called \nInput/Output HMM, or IOHMM, because it can be used to learn to map input \nsequences to output sequences (unlike standard HMMs, which learn the output \nsequence distribution) . This model can also be seen as a recurrent version of the \nMixture of Experts architecture (Jacobs et al. , 1991) , related to the model already \nproposed in (Cacciatore and Nowlan, 1994). Experiments on artificial tasks (Bengio \n& Frasconi , 1994a) have shown that EM recurrent learning can deal with long \nterm dependencies more effectively than backpropa~ation through time and other \nalternative algorithms. However, the model used in (Bengio & Frasconi, 1994a) has \nvery limited representational capabilities and can only map an input sequence to a \nfinal discrete state. In the present paper we describe an extended architecture that \nallows to fully exploit both input and output portions of the data, as required by \nthe supervised learning paradigm. In this way, general sequence processing tasks, \nsuch as production, classification, or prediction, can be dealt with. \n\n2 THE PROPOSED ARCHITECTURE \nWe consider a discrete state dynamical system based on the following state space \ndescription: \n(1) \n\n) \nt-l, t \n\n-\n-\n\nf(x \n\nx \nU \nt \nYt = 9(Xt, Ut) \n\nwhere Ut E R m is the input vector at time t, Yt E R r is the output vector, and \nXt E {I , 2, .. . , n} is a discrete state. These equations define a generalized Mealy \nfinite state machine, in which inputs and outputs may take on continuous values. In \nthis paper, we consider a probabilistic version of these dynamics, where the current \ninputs and the current state distribution are used to estimate the state distribution \nand the output distribution for the next time step. Admissible state transitions will \nbe specified by a directed graph 9 whose vertices correspond to the model 's states \nand the set of successors for state j is Sj . \n\nThe system defined by equations (1) can be modeled by the recurrent architecture \ndepicted in Figure l(a) . The architecture is composed by a set of state n etworks \nN j, j = 1 .. . n and a set of output networks OJ, j = 1 . .. n. Each one of the state \nand output networks is uniquely associated to one of the states,and all networks \nshare the same input Ut . Each state network M has the task of predicting the next \nstate distribution, based on the current input and given that Xt-l = j . Similarly, \neach output network OJ predicts the output of the system, given the current state \nand input. All the subnetworks are assumed to be static and they are defined by \nmeans of smooth mappings Nj (Ut ; 9j ) and OJ (Ut; 1J j), where 9 j and 1J j are vectors \nof adjustable parameters (e.g., connection weights). The ranges of the functions \nNj 0 may be constrained in order to account for the underlying transition graph \n9 . Each output 'Pij ,t of the state subnetwork Nj (at time t) is associated to one \nof the successors i of state j . Thus the last layer of M has as many units as the \ncardinality of Sj. For convenience of notation, we suppose that 'Pij,t are defined for \neach i, j = 1, ... , n and we impose the condition 'Pij ,t = 0 for each i not belonging \nto S j . The softmax function is used in the last layer: 'Pij ,t = e a,j ,t ILlEsj ea l j,t, j = \ni E Sj where aij, t are intermediate variables that can be thought of as the \n1, ... , n , \n\n\fAn Input Output HMM Architecture \n\n429 \n\ncurrent \u2022\u2022 pectod output, \ngiven PIlat Input Mquenc. \n\n11 t EIYt lull \n\ncurrent atilt. dlatrtbutton \n\nCt= Pl' t I Ul ) \n\ncu ... nt Input '-'t \n\n(a) \n\nXt \n\nXt-l \n\n... \n... \n1 1 \n\nYt-l \n\nYt \n\nXt+l \n\n... \n.. .... 1 HMM \n\nYt+l \n\nXt-l \n\nX( \n\nXt+l \n\n\\.Yt-l \\Y t \n\n\\.Yt+l \n\nI \n\nI \n\nI \n\nIOHMM \n\nUt-l \n\nUt \n\nUt+l \n\n(b) \n\nFigure 1: (a): The proposed IOHMM architecture. (b): Bottom: Bayesian network \nexpressing conditional dependencies for an IOHMM; top: Bayesian network for a \nstandard HMM \n\nactivations of the output units of subnetwork N j . In this way L:7=1 'Pij ,t = 1 Tij,t. \n\nThe vector 't E R n represents the internal state of the model and it is computed as \n\na linear combination of the outputs of the state networks, gated by the previously \ncomputed internal state: \n\nn \n\nwhere IPj,t = ['PIj,t, ... , 'Pnj,t]'. \noutput of the system 1Jt E R r : \n\n't = L ( j ,t-IIPj,t \nOutput networks compete to predict the global \n\n(2) \n\nj=1 \n\nn \n\n1Jt = L (jt1Jjt \n\nj=1 \n\n(3) \n\nwhere 1Jjt E R r is the output of subnetwork OJ. At this level , we do not need \nto further specify the internal architecture of the state and output subnetworks. \nDepending on the task, the designer may decide whether to include hidden layers \nand what activation rule to use for the hidden units. \nThis connectionist architecture can be also interpreted as a probability model. Let \nus assume a multinomial distribution for the state variable Xt and let us consider \n\n't, the main variable of the temporal recurrence (2). If we initialize the vector '0 \nprobabilities. In general, we obtain relation (it = P(Xt = i I un, having denoted \nto positive numbers summing to 1, it can be interpreted as a vector of initial state \nwith ui the subsequence of inputs from time 1 to t, inclusively. Equation (2) then \nhas the following probabilistic interpretation: \n\nP(Xt = i lui) = L P(Xt = i I Xt-I = j, ut}P(Xt-1 = j lui-I) \n\nn \n\n(4) \n\ni.e., the subnetworks N j compute transition probabilities conditioned on the input \nsequence Ut: \n(5) \n\n. ) \n\n. I \n\nP( \n\n'Pij,t = \n\nXt = ~ Xt-l = ), Ut \n\nj=l \n\nAs in neural networks trained to minimize the output squared error, the output \n1Jt of this architecture can be interpreted as an expected \"position parameter\" \nfor the probability distribution of the output Yt. However, in addition to being \nconditional on an input Ut, this expectation is also conditional on the state Xt, i.e. \n\n\f430 \n\nYoshua Bengio, Paolo Frasconi \n\n7]t = E[Yt I Xt,Ut] . The actual form of the output density, denoted !Y(Yt ; 7]t), will \nbe chosen according to the task. For example a multinomial distribution is suitable \nfor sequence classification, or for symbolic mutually exclusive outputs. Instead, a \nGaussian distribution is adequate for producing continuous outputs. In the first \ncase we use a softmax function at the output of subnetworks OJ; in the second case \nwe use linear output units for the subnetworks OJ. \nIn order to reduce the amount of computation, we introduce an independency model \namong the variables involved in the probabilistic interpretation of the architecture. \nWe shall use a Bayesian network to characterize the probabilistic dependencies \namong these variables. Specifically, we suppose that the directed acyclic graph \n9 depicted at the bottom of Figure 1 b is a Bayesian network for the dependency \nmodel associated to the variables u I, xI, YI. One of the most evident consequences \nof this independency model is that only the previous state and the current input are \nrelevant to determine the next-state. This one-step memory property is analogue \nto the Markov assumption in hidden Markov models (HMM). In fact, the Bayesian \nnetwork for HMMs can be obtained by simply removing the Ut nodes and arcs from \nthem (see top of Figure Ib) . \n\n3 A SUPERVISED LEARNING ALGORITHM \nThe learning algorithm for the proposed architecture is derived from the maximum \nlikelihood principle. The training data are a set of P pairs of input/ output sequences \n(of length Tp): 1) = {(uip(p),Yip(p));p = 1 .. . P}. Let \u20acJ denote the vector of \nparameters obtained by collecting all the parameters (Jj and iJi of the architecture. \nThe likelihood function is then given by \n\np \n\nL(\u20acJ; 1)) = II p(Yip(p) I uip(p); \u20acJ). \n\n(6) \n\np=l \n\nThe output values (used here as targets) may also be specified intermittently. For \nexample, in sequence classification tasks , one may only be interested in the out(cid:173)\nput YT at the end of each sequence. The modification of the likelihood to account \nfor intermittent targets is straightforward. According to the maximum likelihood \nprinciple, the optimal parameters are obtained by maximizing (6). \nIn order to \napply EM to our case we begin by noting that the state variables Xt are not ob(cid:173)\nserved. Knowledge of the model's state trajectories would allow one to decompose \nthe temporal learning problem into 2n static learning subproblems. Indeed, if Xt \nwere known, the probabilities (it would be either 0 or 1 and it would be possible \nto train each subnetwork separately, without taking into account any temporal de(cid:173)\npendency. This observation allows to link EM learning to the target propagation \napproach discussed in the introduction . Note that if we used a Viterbi-like approxi(cid:173)\nmation (i.e., considering only the most likely path) , we would indeed have 2n static \nlearning problems at each epoch. In order to we derive the learning equations, let \nus define the complete data as 1)c = {(uiP(p),yiP(p),xiP(p));p = 1 ... P}. The \ncorresponding complete data l%-likelihood is \n\nIc(\u20acJ;1)c) = L...IOgP(YIP(P),ZlP(P) I u1P(p); \u20acJ). \n\n\"\"\" \n\nT \n\nT \n\nT \n\n(7) \n\np=l \n\nSince lc( \u20acJ; 1)c) depends on the hidden state variables it cannot be maximized di(cid:173)\nrectly. The MLE optimization is then solved by introducing the auxiliary function \nQ(\u20acJ; 0) and iterating the following two,steps for k = 1, 2\n... :, \nr \nCompute Q(\u20acJ ; \u20acJ) = E[lc(\u20acJ; 1)c) 1), \u20acJ] \n\nEstimation: \nMaximization: Update the parameters as 0 t- arg max\u20acJ Q( \u20acJ; 0) \n\n(8) \n\n\fAn Input Output HMM Architecture \n\n431 \n\nThe expectation of (7) can be expressed as \n\nN \n\np Tp N \n\np=1 t=1 i=1 \n\nQ(0 ; 0) = L: L: L: (it!og P(Yt I Xt = i , Ut i 0) + L: hij,tlog