{"title": "Learning Nonlinear Dynamical Systems Using an EM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 431, "page_last": 437, "abstract": null, "full_text": "Learning Nonlinear Dynamical Systems \n\nusing an EM Algorithm \n\nZoubin Ghahramani and Sam T. Roweis \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \nLondon WC1N 3AR, U.K. \n\nhttp://www.gatsby.ucl.ac.uk/ \n\nAbstract \n\nThe Expectation-Maximization (EM) algorithm is an iterative pro(cid:173)\ncedure for maximum likelihood parameter estimation from data \nsets with missing or hidden variables [2]. It has been applied to \nsystem identification in linear stochastic state-space models, where \nthe state variables are hidden from the observer and both the state \nand the parameters of the model have to be estimated simulta(cid:173)\nneously [9]. We present a generalization of the EM algorithm for \nparameter estimation in nonlinear dynamical systems. The \"expec(cid:173)\ntation\" step makes use of Extended Kalman Smoothing to estimate \nthe state, while the \"maximization\" step re-estimates the parame(cid:173)\nters using these uncertain state estimates. In general, the nonlinear \nmaximization step is difficult because it requires integrating out the \nuncertainty in the states. However, if Gaussian radial basis func(cid:173)\ntion (RBF) approximators are used to model the nonlinearities, \nthe integrals become tractable and the maximization step can be \nsolved via systems of linear equations. \n\n1 Stochastic Nonlinear Dynamical Systems \n\nWe examine inference and learning in discrete-time dynamical systems with hidden \nstate Xt, inputs Ut, and outputs Yt. 1 The state evolves according to stationary \nnonlinear dynamics driven by the inputs and by additive noise \n\n(1) \n\n1 All lowercase characters (except indices) denote vectors. Matrices are represented by \n\nuppercase characters. \n\n\f432 \n\nZ. Ghahramani and S. T Roweis \n\nwhere w is zero-mean Gaussian noise with covariance Q. 2 The outputs are non(cid:173)\nlinearly related to the states and inputs by \n\nYt = g(Xt, Ut) + v \n\n(2) \n\nwhere v is zero-mean Gaussian noise with covariance R. The vector-valued non lin(cid:173)\nearities f and 9 are assumed to be differentiable, but otherwise arbitrary. \nModels of this kind have been examined for decades in various communities. Most \nnotably, nonlinear state-space models form one of the cornerstones of modern sys(cid:173)\ntems and control engineering. In this paper, we examine these models within the \nframework of probabilistic graphical models and derive a novel learning algorithm \nfor them based on EM. With one exception,3 this is to the best of our knowledge \nthe first paper addressing learning of stochastic nonlinear dynamical systems of the \nkind we have described within the framework of the EM algorithm. \nThe classical approach to system identification treats the parameters as hidden vari(cid:173)\nables, and applies the Extended Kalman Filtering algorithm (described in section 2) \nto the nonlinear system with the state vector augmented by the parameters [5]. 4 \nThis approach is inherently on-line, which may be important in certain applications. \nFurthermore, it provides an estimate of the covariance of the parameters at each \ntime step. In contrast, the EM algorithm we present is a batch algorithm and does \nnot attempt to estimate the covariance of the parameters. \n\nThere are three important advantages the EM algorithm has over the classical ap(cid:173)\nproach. First, the EM algorithm provides a straightforward and principled method \nfor handing missing inputs or outputs. Second, EM generalizes readily to more \ncomplex models with combinations of discrete and real-valued hidden variables. \nFor example, one can formulate EM for a mixture of nonlinear dynamical systems. \nThird, whereas it is often very difficult to prove or analyze stability within the \nclassical on-line approach, the EM algorithm is always attempting to maximize the \nlikelihood, which acts as a Lyapunov function for stable learning. \n\nIn the next sections we will describe the basic components of the learning algorithm. \nFor the expectation step of the algorithm, we infer the conditional distribution of the \nhidden states using Extended Kalman Smoothing (section 2). For the maximization \nstep we first discuss the general case (section 3) and then describe the particular \ncase where the nonlinearities are represented using Gaussian radial basis function \n(RBF; [6]) networks (section 4). \n\n2 Extended Kalman Smoothing \n\nGiven a system described by equations (1) and (2), we need to infer the hidden \nstates from a history of observed inputs and outputs. The quantity at the heart \nof this inference problem is the conditional density P(XtIUl,\"\" UT, Yl,.' \" YT), for \n1 ::; t ::; T, which captures the fact that the system is stochastic and therefore our \ninferences about x will be uncertain. \n\n2The Gaussian noise assumption is less restrictive for nonlinear systems than for linear \n\nsystems since the nonlinearity can be used to generate non-Gaussian state noise. \n\n3The authors have just become aware that Briegel and Tresp (this volume) have applied \nEM to essentially the same model. Briegel and Tresp's method uses multilayer perceptrons \n(MLP) to approximate the nonlinearities, and requires sampling from the hidden states to \nfit the MLP. We use Gaussian radial basis functions (RBFs) to model the nonlinearities, \nwhich can be fit analytically without sampling (see section 4) . \n\n41t is important not to confuse this use of the Extended Kalman algorithm, to simul(cid:173)\n\ntaneously estimate parameters and hidden states, with our use of EKS, to estimate just \nthe hidden state as part of the E step of EM. \n\n\fLearning Nonlinear Dynamics Using EM \n\n433 \n\nFor linear dynamical systems with Gaussian state evolution and observation noises, \nthis conditional density is Gaussian and the recursive algorithm for computing its \nmean and covariance is known as Kalman smoothing [4, 8]. Kalman smoothing is \ndirectly analogous to the forward-backward algorithm for computing the conditional \nhidden state distribution in a hidden Markov model, and is also a special case of \nthe belief propagation algorithm. 5 \nFor nonlinear systems this conditional density is in general non-Gaussian and can \nin fact be quite complex. Multiple approaches exist for inferring the hidden state \ndistribution of such nonlinear systems, including sampling methods [7] and varia(cid:173)\ntional approximations [3]. We focus instead in this paper on a classic approach from \nengineering, Extended Kalman Smoothing (EKS). \n\nExtended Kalman Smoothing simply applies Kalman smoothing to a local lineariza(cid:173)\ntion of the nonlinear system. At every point x in x-space, the derivatives of the \nvector-valued functions f and 9 define the matrices, Ax == M I x=x and ex == ~ I x=x' \nrespectively. The dynamics are linearized about Xt, the mean of the Kalman filter \nstate estimate at time t: \n\nThe output equation (2) can be similarly linearized. If the prior distribution of the \nhidden state at t = 1 was Gaussian, then, in this linearized system, the conditional \ndistribution of the hidden state at any time t given the history of inputs and outputs \nwill also be Gaussian. Thus, Kalman smoothing can be used on the linearized system \nto infer this conditional distribution (see figure 1, left panel). \n\n(3) \n\n3 Learning \n\nThe M step of the EM algorithm re-estimates the parameters given the observed \ninputs, outputs, and the conditional distributions over the hidden states. For the \nmodel we have described, the parameters define the nonlinearities f and g, and the \nnoise covariances Q and R. \nTwo complications arise in the M step. First, it may not be computationally fea(cid:173)\nsible to fully re-estimate f and g. For example, if they are represented by neural \nnetwork regressors, a single full M step would be a lengthy training procedure using \nbackpropagation, conjugate gradients, or some other optimization method. Alter(cid:173)\nnatively, one could use partial M steps, for example, each consisting of one or a few \ngradient steps. \nThe second complication is that f and 9 have to be trained using the uncertain state \nestimates output by the EKS algorithm. Consider fitting f, which takes as inputs \nXt and Ut and outputs Xt+l. For each t, the conditional density estimated by EKS is \na full-covariance Gaussian in (Xt, xHd-space. So f has to be fit not to a set of data \npoints but instead to a mixture of full-covariance Gaussians in input-output space \n(Gaussian \"clouds\" of data). Integrating over this type of noise is non-trivial for \nalmost any form of f. One simple but inefficient approach to bypass this problem \nis to draw a large sample from these Gaussian clouds of uncertain data and then fit \nf to these samples in the usual way. A similar situation occurs with g. \nIn the next section we show how, by choosing Gaussian radial basis functions to \nmodel f and g, both of these complications vanish. \n\n5The forward part of the Kalman smoother is the Kalman filter. \n\n\f434 \n\nZ. Ghahramani and S. T. Roweis \n\n4 Fitting Radial Basis Functions to Gaussian Clouds \n\nWe will present a general formulation of an RBF network from which it should be \nclear how to fit special forms for f and 9. Consider the following nonlinear mapping \nfrom input vectors x and u to an output vector z: \n\n[ \n\nz = L hi Pi (x) + Ax + Bu + b + w, \n\ni=1 \n\n(4) \n\nwhere w is a zero-mean Gaussian noise variable with covariance Q. For example, \none form of f can be represented using (4) with the substitutions x f- Xt, u f- Ut, \nand z f- Xt+!; another with x f- (Xt, ud, u f- 0, and Z f- Xt+ 1. The parameters \nare: the coefficients of the I RBFs, hi; the matrices A and B multiplying inputs \nx and u, respectively; and an output bias vector b. Each RBF is assumed to be a \nGaussian in x-space, with center Ci and width given by the covariance matrix Si: \n\n(5) \n\nThe goal is to fit this model to data (u,x,z). The complication is that the data \nset comes in the form of a mixture of Gaussian distributions. Here we show how to \nanalytically integrate over this mixture distribution to fit the RBF model. \n\nAssume the data set is: \n\nP(x,z,u) = J LNj(x,z) 8(u - Uj). \n\n1 \n\nj \n\n(6) \n\nThat is, we observe samples from the u variables, each paired with a Gaussian \n\"cloud\" of data, Nj, over (x, z). The Gaussian Nj has mean /1j and covariance \nmatrix Cj . \nLet zo(x, u) = 2:;=1 hi Pi(X) + Ax + Bu + b, where () is the set of parameters \n() = {hI ... h [ , A, B, b}. The log likelihood of a single data point under the model \nis: \n\n-~ [z - zo(x, u)r Q-l [z - zo(x, u)]- ~ In IQI + const. \n\nThe maximum likelihood RBF fit to the mixture of Gaussian data is obtained by \nminimizing the following integrated quadratic form: \n\nmin{L r r Nj(X,Z)[Z-ZO(X,Uj)rQ_l[Z-ZO(X,Uj)]dXdz+JlnIQI}. (7) \n\nO,Q \n\n.}x }z \nJ \n\nWe rewrite this in a slightly different notation, using angled brackets (.) j to denote \nexpectation over Nj , and defining \n\n() \ncJ> \n\n[h~ h; ... hI AT BT bTr \n[PI (x) P2 ( x) ... P [ ( x) x u 1] . \n\nThen, the objective can be written \n\nmin {'\" (( z - () cJ> r Q -1 (z - () cJ\u00bb) . + J In I Q I} . \n\nO,Q ~ \n\nJ \n\nJ \n\n(8) \n\n\fLearning Nonlinear Dynamics Using EM \n\n435 \n\nTaking derivatives with respect to 0, premultiplying by _Q-1, and setting to zero \ngives the linear equations I:j((z - O~)~T)j = 0, which we can solve for 0 and Q: \n\nIn other words, given the expectations in the angled brackets, the optimal parame(cid:173)\nters can be solved for via a set of linear equations. In appendix A we show that these \nexpectations can be computed analytically. The derivation is somewhat laborious, \nbut the intuition is very simple: the Gaussian RBFs multiply with the Gaussian \ndensities Nj to form new unnormalized Gaussians in (x, y)-space. Expectations un(cid:173)\nder these new Gaussians are easy to compute. This fitting algorithm is illustrated \nin the right panel of figure 1. \n\n, , \n\nGaussian \nevidence \nfrom I-I \n\n~x, \n\n+ \n\nxr_1 \n\n\u00b7 rZJ\u00b7\u00b7---\n\nt \n\nfrom 1+1 \n\n~~fJi!~~X,'2 .~ \n~ \n~I ~ :; \na. :; \no \n\n+ \n\nrn \n\n\u2022\u2022 \u00b7 \u00b7f .... \u2022\u2022 \u00b7 \n\n....... , \n\n. \n\ninput dimension \n\nFigure 1: Illustrations of the E and M steps of the algorithm. The left panel shows \nthe information used in Extended Kalman Smoothing (EKS), which infers the hidden \nstate distribution during the E-step. The right panel illustrates the regression technique \nemployed during the M-step. A fit to a mixture of Gaussian densities is required; if \nGaussian RBF networks are used then this fit can be solved analytically. The dashed line \nshows a regular RBF fit to the centres of the four Gaussian densities while the solid line \nshows the analytic RBF fit using the covariance information_ The dotted lines below show \nthe support of the RBF kernels. \n\n5 Results \n\nWe tested how well our algorithm could learn the dynamics of a nonlinear system \nby observing only its inputs and outputs. The system consisted of a single input, \nstate and output variable at each time, where the relation of the state from one time \nstep to the next was given by a tanh nonlinearity. Sample outputs of this system \nin response to white noise are shown in figure 2 (left panel). \n\nWe initialized the nonlinear model with a linear dynamical model trained with \nEM, which in turn we initialized with a variant of factor analysis. The model \nwas given 11 RBFs in Xt-space, which were uniformly spaced within a range which \nwas automatically determined from the density of points in Xt-space. After the \ninitialization was over, the algorithm discovered the sigmoid nonlinearity in the \ndynamics within less than 10 iterations of EM (figure 2, middle and right panels). \n\nFurther experiments need to be done to determine how practical this method will \nbe in real domains. \n\n\f436 \n\nZ. Ghahramani and S. T Roweis \n\nNLOS \n\n.. .\n\n, ~ , . \n\n. ' \n\n~ u \n\n' / \n\n:: -:\",;~fS{ \n\n~~7-~~.~r.~.~~ \n\nIlel'lltJons of EM \n\n-!, \n\n'_ 1.5 \n\n_ , ' ' ' '$ ' ' \n\nII \n\n,, & \n\nI \n\nt..S \n\n, \n\n.. ~ \n\nxlt) \n\nFigure 2: (left): Data set used for training (first half) and testing (rest), which consists \nof a time series of inputs, Ut (a) , and outputs Yt (b) . (middle): Representative plots of \nlog likelihood vs iterations of EM for linear dynamical systems (dashed line) and nonlinear \ndynamical systems trained as described in this paper (solid line) . Note that the actual \nlikelihood for nonlinear dynamical systems cannot generally be computed analytically; \nwhat is shown here is the approximate likelihood computed by EKS. The kink in the solid \ncurve comes when initialization with linear dynamics ends and the nonlinearity starts to \nbe learned. (right): Means of (Xt , Xt+d Gaussian posteriors computed by EKS (dots) , \nalong with the sigmoid nonlinearity (dashed line) and the RBF nonlinearity learned by \nthe algorithm. At no point does the algorithm actually observe (Xt , Xt+d pairs; these are \ninferred from inputs, outputs, and the current model parameters. \n\n6 Discussion \n\nThis paper brings together two classic algorithms, one from statistics and another \nfrom systems engineering, to address the learning of stochastic nonlinear dynam(cid:173)\nical systems. We have shown that by pairing the Extended Kalman Smoothing \nalgorithm for state estimation in the E-step, with a radial basis function learning \nmodel that permits analytic solution of the M-step, the EM algorithm is capable of \nlearning a nonlinear dynamical model from data. As a side effect we have derived \nan algorithm for training a radial basis function network to fit data in the form of \na mixture of Gaussians. \n\nOur initial approach has three potential limitations. First, the M-step presented \ndoes not modify the centres or widths of the RBF kernels. It is possible to compute \nthe expectations required to change the centres and widths, but it requires resort(cid:173)\ning to a partial M-step. For low dimensional state spaces, filling the space with \npre-fixed kernels is feasible, but this strategy needs exponentially many RBFs in \nhigh dimensions. Second, EM training can be slow, especially if initialized poorly. \nUnderstanding how different hidden variable models are related can help devise \nsensible initialization heuristics. For example, for this model we used a nested ini(cid:173)\ntialization which first learned a simple linear dynamical system, which in turn was \ninitialized with a variant of factor analysis. Third, the method presented here learns \nfrom batches of data and assumes stationary dynamics. We have recently extended \nit to handle online learning of nonstationary dynamics. \n\nThe belief network literature has recently been dominated by two methods for \napproximate inference, Markov chain Monte Carlo [7] and variational approxima(cid:173)\ntions [3]. To our knowledge this paper is the first instance where extended Kalman \nsmoothing has been used to perform approximate inference in the E step of EM. \nWhile EKS does not have the theoretical guarantees of variational methods, its sim(cid:173)\nplicity has gained it wide acceptance in the estimation and control literatures as a \nmethod for doing inference in nonlinear dynamical systems. We are now exploring \ngeneralizations of this method to learning nonlinear multilayer belief networks. \n\n\fLearning Nonlinear Dynamics Using EM \n\n437 \n\nAcknowledgements \n\nZG would like to acknowledge the support of the CITO (Ontario) and the Gatsby Char(cid:173)\nitable Fund. STR was supported in part by the NSF Center for Neuromorphic Systems \nEngineering and by an NSERC of Canada 1967 Award. \n\nA Expectations Required to Fit the RBFs \n\nThe expectations we need to compute for equation 9 are (x)j, (z)j, (xx T)j, (xz T)j, (zz T)j, \n(Pi(X))j, (x pi(X))j, (z Pi(X))j, (pi(X) Pl(X))). \n\nStarting with some of the easier ones that do not depend on the RBF, kernel p: \n\n(x)j = JLj \n\n(XXT)j = JLjJLj,T +Cr \n(ZZT)j = JLjJLj,T +Cjz \n\n(z)j = \n\nJL} \n\n(xzT)j = JLjJLj,T +Cr \n\nObserve that when we multiply the Gaussian RBF kernel pi(X) (equation 5) and Nj we \nget a Gaussian density over (x, z) with mean and covariance \n\nJLij = Cij Cj \n\n( \n\n-1 \n\nJLj + \n\n[ S-:-l Ci ]) \n\n' 0 \n\nand an extra constant (due to lack of normalization), \n\n{3ij = (21T)-d\",/2IS;j-1/2ICjl-I/2ICijll/2 exp{ -~ij/2} \n\nwhere ~ij = c~ Si- I Ci + JLl Cj- 1 JLj -\nother expectatIOns: \n\nJL0 Ci-/ JLij . Using {3ij and JLij, we can evaluate the \n\n(pi(X))j = {3ij, \n\n(x pi(X))j = {3ijJLfj , \n\nand \n\n(z pi(X))j = {3ijJL'ij . \n\nFinally, (pi(X) Pl(X))j = (21T)-d\", ICj 1-1/2IS;j-1/2IS11-1/2ICilj 11/ 2 exp{ -,ifj/2}, where \nC,'l)\" = (C):-l + [ Si- 1 +0 Sll 0]) -1 d C (C- 1 \n\n[ Si-1Ci + Sll Cl ]) \n\nJLilj = \n\nilj \n\nan \n\n0 \n\n) \n\nJLj + \n\no \n\n' \n\nd \nan \n\n,iij = Ci \n\nTS-1 \n\ni \n\nci + Cl \n\nTS-l \n\nl Cl + JLj \n\nTC- l \n\nj \n\nJLj - JLilj \n\nT C- 1 \n\nilj JLiij . \n\nReferences \n[1] T. Briegel and V. Tresp. Fisher Scoring and a Mixture of Modes Approach for Ap(cid:173)\n\nproximate Inference and Learning in Nonlinear State Space Models. In This Volume. \nMIT Press, 1999. \n\n[2] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete \n\ndata via the EM algorithm. J. Royal Statistical Society Series B, 39:1- 38, 1977. \n\n[3] M. I. Jordan, Z. Ghahramani, T . S. Jaakkola, and L. K. Saul. An Introduction to \n\nvariational methods in graphical models. Machine Learning, 1999. \n\n[4] R. E. Kalman and R. S. Bucy. New results in linear filtering and prediction. Journal \n\nof Basic Engineering (A SME) , 83D:95-108, 1961. \n\n[5] L. Ljung and T. Soderstrom. Theory and Practice of Recursive Identification. MIT \n\nPress, Cambridge, MA, 1983. \n\n[6] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. \n\nNeural Computation, 1(2):281-294, 1989. \n\n[7] R. M. Neal. Probabilistic inference using Markov chain monte carlo methods. Technical \n\nReport CRG-TR-93-1, 1993. \n\n[8] H. E. Rauch. Solutions -to the linear smoothing problem. \n\nAutomatic Control, 8:371-372, 1963. \n\nIEEE Transactions on \n\n[9] R . H. Shumway and D. S. Stoffer. An approach to time series smoothing and forecasting \n\nusing the EM algorithm. J. Time Series Analysis, 3(4):253- 264, 1982. \n\n\f", "award": [], "sourceid": 1594, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}]}