{"title": "Learning Nonlinear Dynamical Systems Using an EM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 431, "page_last": 437, "abstract": null, "full_text": "Learning Nonlinear  Dynamical  Systems \n\nusing an  EM  Algorithm \n\nZoubin  Ghahramani and Sam T.  Roweis \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \nLondon WC1N 3AR, U.K. \n\nhttp://www.gatsby.ucl.ac.uk/ \n\nAbstract \n\nThe Expectation-Maximization (EM) algorithm is an iterative pro(cid:173)\ncedure  for  maximum  likelihood  parameter  estimation  from  data \nsets  with  missing  or  hidden  variables  [2].  It has  been  applied  to \nsystem identification in linear stochastic state-space models,  where \nthe state variables are hidden from the observer and both the state \nand  the  parameters  of  the  model  have  to  be  estimated  simulta(cid:173)\nneously  [9].  We  present  a  generalization  of the  EM  algorithm for \nparameter estimation in  nonlinear dynamical systems.  The \"expec(cid:173)\ntation\" step makes use of Extended Kalman Smoothing to estimate \nthe state, while the  \"maximization\"  step re-estimates the parame(cid:173)\nters using these uncertain state estimates. In general, the nonlinear \nmaximization step is difficult because it requires integrating out the \nuncertainty  in  the states.  However,  if Gaussian  radial  basis  func(cid:173)\ntion  (RBF)  approximators  are  used  to  model  the  nonlinearities, \nthe integrals  become  tractable  and the  maximization step  can  be \nsolved  via systems of linear equations. \n\n1  Stochastic  Nonlinear Dynamical Systems \n\nWe  examine inference and learning in discrete-time dynamical systems with hidden \nstate  Xt,  inputs  Ut,  and  outputs  Yt. 1  The  state  evolves  according  to  stationary \nnonlinear dynamics driven by the inputs  and by additive noise \n\n(1) \n\n1 All  lowercase  characters  (except indices)  denote  vectors.  Matrices  are represented by \n\nuppercase characters. \n\n\f432 \n\nZ.  Ghahramani and S.  T  Roweis \n\nwhere  w  is  zero-mean  Gaussian noise  with  covariance Q.  2  The outputs  are  non(cid:173)\nlinearly related to the states and inputs by \n\nYt  =  g(Xt, Ut)  + v \n\n(2) \n\nwhere  v  is  zero-mean Gaussian noise with  covariance R.  The vector-valued non lin(cid:173)\nearities f  and 9  are assumed to be  differentiable,  but otherwise arbitrary. \nModels  of this  kind  have been examined for  decades  in  various communities.  Most \nnotably,  nonlinear state-space models form  one of the cornerstones of modern sys(cid:173)\ntems  and  control  engineering.  In this  paper,  we  examine these  models  within  the \nframework  of probabilistic graphical models  and derive  a  novel  learning algorithm \nfor  them  based  on  EM.  With one exception,3  this  is  to  the  best  of our  knowledge \nthe first  paper addressing learning of stochastic nonlinear dynamical systems of the \nkind  we  have described  within  the framework of the EM algorithm. \nThe classical approach to system identification treats the parameters as hidden vari(cid:173)\nables, and applies the Extended Kalman Filtering algorithm (described in section 2) \nto  the  nonlinear  system  with  the  state  vector  augmented  by  the  parameters  [5]. 4 \nThis approach is  inherently on-line, which may be important in certain applications. \nFurthermore,  it  provides  an estimate of the  covariance  of the  parameters  at  each \ntime step.  In contrast, the EM algorithm we  present is  a  batch algorithm and does \nnot  attempt to estimate the covariance of the parameters. \n\nThere are three important advantages the  EM algorithm has  over the classical ap(cid:173)\nproach.  First, the EM algorithm provides a straightforward and  principled method \nfor  handing  missing  inputs  or  outputs.  Second,  EM  generalizes  readily  to  more \ncomplex  models  with  combinations  of  discrete  and  real-valued  hidden  variables. \nFor example, one  can formulate EM for  a  mixture of nonlinear  dynamical systems. \nThird,  whereas  it  is  often  very  difficult  to  prove  or  analyze  stability  within  the \nclassical on-line approach, the EM algorithm is  always attempting to maximize the \nlikelihood,  which  acts as a  Lyapunov function  for  stable learning. \n\nIn the next sections we will describe the basic components of the learning algorithm. \nFor the expectation step of the algorithm, we infer the conditional distribution of the \nhidden states using Extended Kalman Smoothing (section 2).  For the maximization \nstep  we  first  discuss  the  general  case  (section  3)  and  then  describe  the  particular \ncase  where  the  nonlinearities  are  represented  using  Gaussian  radial  basis  function \n(RBF;  [6])  networks  (section 4). \n\n2  Extended Kalman  Smoothing \n\nGiven  a  system  described  by  equations  (1)  and  (2),  we  need  to  infer  the  hidden \nstates from  a  history  of observed  inputs  and  outputs.  The  quantity  at the  heart \nof this  inference problem is  the conditional density P(XtIUl,\"\"  UT, Yl,.' \"  YT),  for \n1 ::;  t  ::;  T,  which  captures the fact  that the system is  stochastic and therefore our \ninferences about  x  will  be uncertain. \n\n2The Gaussian noise assumption is  less restrictive for  nonlinear systems than for  linear \n\nsystems since the nonlinearity can  be  used  to  generate  non-Gaussian  state noise. \n\n3The authors have just become aware that Briegel and Tresp (this volume) have applied \nEM to essentially the same model.  Briegel  and Tresp's method uses multilayer perceptrons \n(MLP) to approximate the nonlinearities,  and requires sampling from the hidden states to \nfit  the  MLP.  We  use  Gaussian  radial  basis functions  (RBFs)  to  model  the  nonlinearities, \nwhich  can be fit  analytically without sampling  (see section  4) . \n\n41t  is  important not  to  confuse  this use  of the  Extended Kalman  algorithm,  to simul(cid:173)\n\ntaneously  estimate  parameters  and  hidden  states,  with  our use  of EKS,  to  estimate just \nthe hidden state as  part of the E  step of EM. \n\n\fLearning Nonlinear Dynamics Using EM \n\n433 \n\nFor linear dynamical systems with  Gaussian state evolution and observation noises, \nthis conditional  density  is  Gaussian and  the recursive algorithm for  computing its \nmean  and  covariance is  known  as  Kalman  smoothing [4,  8].  Kalman  smoothing is \ndirectly analogous to the forward-backward algorithm for computing the conditional \nhidden  state  distribution  in  a  hidden  Markov  model,  and  is  also  a  special  case  of \nthe belief propagation algorithm. 5 \nFor  nonlinear systems  this  conditional  density  is  in  general  non-Gaussian and  can \nin  fact  be quite  complex.  Multiple  approaches  exist  for  inferring the  hidden state \ndistribution  of such  nonlinear  systems,  including  sampling  methods  [7]  and  varia(cid:173)\ntional approximations [3].  We focus  instead in this paper on a classic approach from \nengineering,  Extended Kalman  Smoothing  (EKS). \n\nExtended Kalman Smoothing simply applies Kalman smoothing to a local lineariza(cid:173)\ntion  of the  nonlinear  system.  At  every  point  x in  x-space,  the  derivatives  of  the \nvector-valued functions f  and 9 define the matrices, Ax ==  M I x=x  and ex ==  ~ I x=x' \nrespectively.  The dynamics  are linearized  about  Xt,  the mean of the Kalman filter \nstate estimate at time t: \n\nThe output equation  (2)  can be similarly linearized.  If the prior distribution of the \nhidden state at t =  1 was  Gaussian, then, in this  linearized system, the conditional \ndistribution of the hidden state at any time t  given the history of inputs and outputs \nwill also be Gaussian.  Thus, Kalman smoothing can be used on the linearized system \nto infer  this conditional distribution  (see  figure  1,  left  panel). \n\n(3) \n\n3  Learning \n\nThe  M  step  of the  EM  algorithm  re-estimates  the  parameters  given  the  observed \ninputs,  outputs,  and  the conditional  distributions over  the  hidden  states.  For  the \nmodel we  have described, the parameters define the nonlinearities f  and g,  and the \nnoise covariances Q and  R. \nTwo  complications  arise  in  the  M step.  First, it may  not  be  computationally fea(cid:173)\nsible  to  fully  re-estimate f  and  g.  For  example,  if they  are  represented by  neural \nnetwork regressors, a single full  M step would be a lengthy training procedure using \nbackpropagation, conjugate  gradients,  or some  other  optimization  method.  Alter(cid:173)\nnatively, one could use partial M steps, for  example, each consisting of one or a few \ngradient steps. \nThe second complication is that f  and 9  have to be trained using the uncertain state \nestimates output by the EKS  algorithm.  Consider fitting  f, which  takes as inputs \nXt  and Ut  and outputs Xt+l.  For each t,  the conditional density estimated by EKS is \na full-covariance Gaussian in  (Xt, xHd-space.  So f  has to be fit  not to a set of data \npoints  but instead to a  mixture of full-covariance  Gaussians in  input-output  space \n(Gaussian  \"clouds\"  of data).  Integrating  over  this  type  of noise  is  non-trivial  for \nalmost  any form  of f.  One simple  but inefficient  approach to bypass this  problem \nis  to draw a large sample from these Gaussian clouds of uncertain data and then fit \nf  to these samples  in the usual way.  A similar situation occurs with  g. \nIn  the  next  section  we  show  how,  by  choosing  Gaussian  radial  basis  functions  to \nmodel f  and g,  both of these  complications vanish. \n\n5The forward  part of the Kalman smoother is the Kalman filter. \n\n\f434 \n\nZ.  Ghahramani and S.  T.  Roweis \n\n4  Fitting Radial Basis Functions to Gaussian  Clouds \n\nWe  will  present  a  general formulation  of an RBF network from  which  it should  be \nclear how to fit  special forms for f  and 9.  Consider the following nonlinear mapping \nfrom  input  vectors  x  and u  to an output vector  z: \n\n[ \n\nz = L  hi Pi (x) + Ax + Bu + b + w, \n\ni=1 \n\n(4) \n\nwhere  w  is  a  zero-mean  Gaussian  noise  variable  with  covariance Q.  For  example, \none form  of f  can be  represented using  (4)  with  the substitutions x  f- Xt,  u  f- Ut, \nand  z  f- Xt+!;  another  with  x  f- (Xt, ud,  u  f- 0,  and  Z f- Xt+ 1.  The  parameters \nare:  the  coefficients  of the  I  RBFs,  hi;  the  matrices  A  and  B  multiplying  inputs \nx  and  u,  respectively;  and  an  output bias  vector  b.  Each RBF  is  assumed to be  a \nGaussian in  x-space,  with center  Ci  and width  given by the covariance matrix Si: \n\n(5) \n\nThe  goal  is  to  fit  this  model  to  data  (u,x,z).  The  complication  is  that  the  data \nset comes in the form  of a mixture of Gaussian distributions.  Here we  show how to \nanalytically integrate over this  mixture distribution to fit  the RBF model. \n\nAssume the data set  is: \n\nP(x,z,u) =  J LNj(x,z) 8(u - Uj). \n\n1 \n\nj \n\n(6) \n\nThat  is,  we  observe  samples  from  the  u  variables,  each  paired  with  a  Gaussian \n\"cloud\"  of  data,  Nj,  over  (x, z).  The  Gaussian  Nj  has  mean  /1j  and  covariance \nmatrix Cj . \nLet  zo(x, u)  =  2:;=1 hi  Pi(X)  + Ax + Bu + b,  where  ()  is  the  set  of  parameters \n()  =  {hI ... h [ , A, B, b}.  The log  likelihood  of a  single  data point  under  the model \nis: \n\n-~ [z  - zo(x, u)r Q-l [z  - zo(x, u)]- ~ In IQI + const. \n\nThe maximum  likelihood  RBF  fit  to  the  mixture of Gaussian  data is  obtained  by \nminimizing the following  integrated quadratic form: \n\nmin{L r r Nj(X,Z)[Z-ZO(X,Uj)rQ_l[Z-ZO(X,Uj)]dXdz+JlnIQI}.  (7) \n\nO,Q \n\n.}x }z \nJ \n\nWe  rewrite this in a slightly different  notation, using angled brackets  (.) j  to denote \nexpectation over Nj ,  and defining \n\n() \ncJ> \n\n[h~ h; ...  hI AT  BT bTr \n[PI (x)  P2 ( x) ... P [ ( x)  x u 1] . \n\nThen,  the objective can be written \n\nmin {'\" (( z - ()  cJ> r Q -1 (z  - ()  cJ\u00bb) . + J In I Q I} . \n\nO,Q  ~ \n\nJ \n\nJ \n\n(8) \n\n\fLearning Nonlinear Dynamics Using EM \n\n435 \n\nTaking derivatives with  respect  to 0,  premultiplying by  _Q-1,  and  setting to zero \ngives the linear equations  I:j((z - O~)~T)j = 0,  which  we  can solve for  0 and Q: \n\nIn other words,  given the expectations in the angled  brackets, the optimal parame(cid:173)\nters can be solved for via a set of linear equations.  In appendix A we show that these \nexpectations can be computed  analytically.  The derivation  is  somewhat laborious, \nbut  the  intuition  is  very  simple:  the  Gaussian  RBFs  multiply  with  the  Gaussian \ndensities Nj  to form new unnormalized Gaussians in (x, y)-space.  Expectations un(cid:173)\nder  these new  Gaussians are easy to compute.  This  fitting  algorithm is  illustrated \nin the right panel of figure  1. \n\n, , \n\nGaussian \nevidence \nfrom I-I \n\n~x, \n\n+ \n\nxr_1 \n\n\u00b7 rZJ\u00b7\u00b7---\n\nt \n\nfrom 1+1 \n\n~~fJi!~~X,'2  .~ \n~ \n~I  ~ :; \na. :; \no \n\n+ \n\nrn \n\n\u2022\u2022  \u00b7 \u00b7f .... \u2022\u2022 \u00b7 \n\n....... , \n\n. \n\ninput dimension \n\nFigure  1:  Illustrations  of  the  E  and  M steps  of  the  algorithm.  The  left  panel  shows \nthe  information  used  in  Extended  Kalman  Smoothing  (EKS),  which  infers  the  hidden \nstate distribution  during  the E-step.  The  right  panel  illustrates  the regression  technique \nemployed  during  the  M-step.  A  fit  to  a  mixture  of  Gaussian  densities  is  required;  if \nGaussian RBF networks are  used then this fit  can  be solved  analytically.  The dashed line \nshows  a regular  RBF fit  to the centres of the four  Gaussian  densities while  the solid  line \nshows  the analytic RBF fit  using the covariance information_  The dotted lines below show \nthe support of the RBF kernels. \n\n5  Results \n\nWe  tested how  well  our algorithm could  learn the dynamics  of a  nonlinear system \nby  observing only  its  inputs  and outputs.  The system consisted of a  single input, \nstate and output variable at each time, where the relation of the state from one time \nstep  to  the  next was  given  by  a  tanh nonlinearity.  Sample  outputs  of this  system \nin response to white noise are shown in figure  2 (left  panel). \n\nWe  initialized  the  nonlinear  model  with  a  linear  dynamical  model  trained  with \nEM,  which  in  turn  we  initialized  with  a  variant  of  factor  analysis.  The  model \nwas  given 11  RBFs in  Xt-space,  which were uniformly spaced within a  range which \nwas  automatically  determined  from  the  density  of  points  in  Xt-space.  After  the \ninitialization  was  over,  the  algorithm  discovered  the  sigmoid  nonlinearity  in  the \ndynamics within  less  than 10  iterations of EM  (figure  2,  middle and right panels). \n\nFurther experiments  need  to  be  done to  determine how  practical this  method  will \nbe in  real domains. \n\n\f436 \n\nZ.  Ghahramani and S.  T  Roweis \n\nNLOS \n\n.. .\n\n,  ~  ,  . \n\n.  ' \n\n~ u \n\n' / \n\n:: -:\",;~fS{ \n\n~~7-~~.~r.~.~~ \n\nIlel'lltJons of EM \n\n-!, \n\n'_ 1.5 \n\n_ , ' ' ' '$ ' '  \n\nII \n\n,, & \n\nI \n\nt..S \n\n, \n\n.. ~ \n\nxlt) \n\nFigure 2:  (left):  Data set used for  training  (first  half)  and testing  (rest),  which  consists \nof a time series  of inputs,  Ut  (a) , and outputs Yt  (b) .  (middle):  Representative plots of \nlog likelihood vs iterations of EM for  linear dynamical systems (dashed line)  and nonlinear \ndynamical  systems  trained  as  described  in  this  paper  (solid  line) .  Note  that  the  actual \nlikelihood  for  nonlinear  dynamical  systems  cannot  generally  be  computed  analytically; \nwhat is shown  here is  the approximate likelihood computed by EKS. The kink in  the solid \ncurve  comes  when  initialization  with  linear  dynamics  ends  and the  nonlinearity  starts  to \nbe  learned.  (right):  Means  of  (Xt , Xt+d  Gaussian  posteriors  computed  by  EKS  (dots) , \nalong  with  the  sigmoid  nonlinearity  (dashed  line)  and  the  RBF  nonlinearity  learned  by \nthe algorithm.  At  no  point does  the algorithm  actually observe  (Xt , Xt+d  pairs; these  are \ninferred from  inputs, outputs, and the current  model  parameters. \n\n6  Discussion \n\nThis paper brings  together two  classic algorithms,  one from  statistics and  another \nfrom  systems  engineering,  to  address  the  learning of stochastic  nonlinear  dynam(cid:173)\nical  systems.  We  have  shown  that  by  pairing  the  Extended  Kalman  Smoothing \nalgorithm for  state estimation in  the  E-step,  with  a  radial  basis  function  learning \nmodel that permits analytic solution of the M-step, the EM algorithm is  capable of \nlearning a  nonlinear dynamical  model from  data.  As  a  side effect  we  have  derived \nan algorithm for  training a  radial  basis function  network to fit  data in the form  of \na  mixture of Gaussians. \n\nOur  initial  approach  has  three  potential  limitations.  First,  the  M-step  presented \ndoes not modify the centres or widths of the RBF kernels.  It is  possible to compute \nthe expectations required  to change the centres  and widths,  but  it  requires resort(cid:173)\ning  to  a  partial  M-step.  For  low  dimensional  state  spaces,  filling  the  space  with \npre-fixed  kernels  is  feasible,  but  this  strategy  needs  exponentially  many  RBFs  in \nhigh  dimensions.  Second,  EM training can be slow, especially if initialized  poorly. \nUnderstanding  how  different  hidden  variable  models  are  related  can  help  devise \nsensible initialization heuristics.  For example, for  this  model we  used  a  nested  ini(cid:173)\ntialization which  first  learned a  simple linear dynamical system, which  in  turn was \ninitialized with a variant of factor analysis.  Third, the method presented here learns \nfrom  batches of data and assumes stationary dynamics.  We  have recently extended \nit to handle online  learning of nonstationary dynamics. \n\nThe  belief  network  literature  has  recently  been  dominated  by  two  methods  for \napproximate  inference,  Markov  chain  Monte  Carlo  [7]  and  variational  approxima(cid:173)\ntions  [3].  To our knowledge this paper is  the first  instance where extended Kalman \nsmoothing  has  been  used  to  perform  approximate  inference  in  the  E  step  of EM. \nWhile EKS does not have the theoretical guarantees of variational methods, its sim(cid:173)\nplicity has gained it wide  acceptance in the estimation and control literatures  as  a \nmethod for  doing inference in  nonlinear  dynamical systems.  We  are now  exploring \ngeneralizations of this  method to learning nonlinear multilayer belief networks. \n\n\fLearning Nonlinear Dynamics Using EM \n\n437 \n\nAcknowledgements \n\nZG  would  like  to  acknowledge  the support of the CITO  (Ontario)  and  the  Gatsby Char(cid:173)\nitable  Fund.  STR was  supported in  part  by  the  NSF  Center  for  Neuromorphic Systems \nEngineering and by  an  NSERC of Canada 1967  Award. \n\nA  Expectations Required to  Fit the RBFs \n\nThe expectations we  need to compute for equation 9 are  (x)j,  (z)j,  (xx T)j,  (xz T)j, (zz T)j, \n(Pi(X))j,  (x  pi(X))j,  (z  Pi(X))j,  (pi(X)  Pl(X))). \n\nStarting with some of the easier  ones that do not depend on the RBF, kernel  p: \n\n(x)j  =  JLj \n\n(XXT)j  =  JLjJLj,T  +Cr \n(ZZT)j  =  JLjJLj,T  +Cjz \n\n(z)j  = \n\nJL} \n\n(xzT)j  =  JLjJLj,T  +Cr \n\nObserve  that  when  we  multiply  the  Gaussian  RBF  kernel  pi(X)  (equation  5)  and Nj  we \nget  a  Gaussian  density over  (x, z)  with mean and covariance \n\nJLij  =  Cij  Cj \n\n( \n\n-1 \n\nJLj  + \n\n[  S-:-l Ci  ]) \n\n' 0 \n\nand an  extra constant  (due to lack  of normalization), \n\n{3ij  = (21T)-d\",/2IS;j-1/2ICjl-I/2ICijll/2 exp{ -~ij/2} \n\nwhere  ~ij  =  c~ Si- I Ci  + JLl Cj- 1 JLj  -\nother expectatIOns: \n\nJL0 Ci-/ JLij .  Using  {3ij  and  JLij,  we  can  evaluate  the \n\n(pi(X))j  = {3ij, \n\n(x  pi(X))j  = {3ijJLfj , \n\nand \n\n(z  pi(X))j  = {3ijJL'ij . \n\nFinally,  (pi(X)  Pl(X))j  = (21T)-d\", ICj 1-1/2IS;j-1/2IS11-1/2ICilj 11/ 2 exp{ -,ifj/2}, where \nC,'l)\"  =  (C):-l  + [  Si- 1 +0  Sll  0]) -1  d  C (C- 1 \n\n[  Si-1Ci + Sll Cl  ]) \n\nJLilj  = \n\nilj \n\nan \n\n0 \n\n) \n\nJLj  + \n\no \n\n' \n\nd \nan \n\n,iij =  Ci \n\nTS-1 \n\ni \n\nci  + Cl \n\nTS-l \n\nl  Cl  + JLj \n\nTC- l \n\nj \n\nJLj  - JLilj \n\nT  C- 1 \n\nilj JLiij . \n\nReferences \n[1]  T.  Briegel  and  V.  Tresp.  Fisher  Scoring  and  a  Mixture  of Modes  Approach  for  Ap(cid:173)\n\nproximate Inference  and Learning in  Nonlinear  State Space  Models.  In  This  Volume. \nMIT  Press,  1999. \n\n[2]  A.P.  Dempster,  N.M.  Laird,  and  D.B.  Rubin.  Maximum  likelihood  from  incomplete \n\ndata via the EM  algorithm.  J.  Royal Statistical  Society  Series  B,  39:1- 38,  1977. \n\n[3]  M.  I.  Jordan,  Z.  Ghahramani,  T .  S.  Jaakkola,  and  L.  K.  Saul.  An  Introduction  to \n\nvariational methods in  graphical  models.  Machine  Learning,  1999. \n\n[4]  R.  E.  Kalman  and R.  S. Bucy.  New results  in  linear  filtering  and prediction.  Journal \n\nof Basic  Engineering  (A SME) , 83D:95-108,  1961. \n\n[5]  L.  Ljung  and  T.  Soderstrom.  Theory  and  Practice  of Recursive  Identification.  MIT \n\nPress,  Cambridge,  MA,  1983. \n\n[6]  J.  Moody and C.  Darken.  Fast  learning in  networks of locally-tuned processing  units. \n\nNeural  Computation, 1(2):281-294,  1989. \n\n[7]  R. M.  Neal.  Probabilistic inference using Markov chain monte carlo methods. Technical \n\nReport CRG-TR-93-1,  1993. \n\n[8]  H.  E.  Rauch.  Solutions -to  the  linear  smoothing  problem. \n\nAutomatic  Control,  8:371-372,  1963. \n\nIEEE  Transactions  on \n\n[9]  R . H. Shumway and D.  S. Stoffer.  An approach to time series smoothing and forecasting \n\nusing the EM  algorithm.  J.  Time  Series  Analysis,  3(4):253- 264,  1982. \n\n\f", "award": [], "sourceid": 1594, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}]}