Zoubin Ghahramani, Sam Roweis
The Expectation-Maximization (EM) algorithm is an iterative pro(cid:173) cedure for maximum likelihood parameter estimation from data sets with missing or hidden variables . It has been applied to system identification in linear stochastic state-space models, where the state variables are hidden from the observer and both the state and the parameters of the model have to be estimated simulta(cid:173) neously . We present a generalization of the EM algorithm for parameter estimation in nonlinear dynamical systems. The "expec(cid:173) tation" step makes use of Extended Kalman Smoothing to estimate the state, while the "maximization" step re-estimates the parame(cid:173) ters using these uncertain state estimates. In general, the nonlinear maximization step is difficult because it requires integrating out the uncertainty in the states. However, if Gaussian radial basis func(cid:173) tion (RBF) approximators are used to model the nonlinearities, the integrals become tractable and the maximization step can be solved via systems of linear equations.
1 Stochastic Nonlinear Dynamical Systems
We examine inference and learning in discrete-time dynamical systems with hidden state Xt, inputs Ut, and outputs Yt. 1 The state evolves according to stationary nonlinear dynamics driven by the inputs and by additive noise
1 All lowercase characters (except indices) denote vectors. Matrices are represented by
Z. Ghahramani and S. T Roweis
where w is zero-mean Gaussian noise with covariance Q. 2 The outputs are non(cid:173) linearly related to the states and inputs by
Yt = g(Xt, Ut) + v
where v is zero-mean Gaussian noise with covariance R. The vector-valued non lin(cid:173) earities f and 9 are assumed to be differentiable, but otherwise arbitrary. Models of this kind have been examined for decades in various communities. Most notably, nonlinear state-space models form one of the cornerstones of modern sys(cid:173) tems and control engineering. In this paper, we examine these models within the framework of probabilistic graphical models and derive a novel learning algorithm for them based on EM. With one exception,3 this is to the best of our knowledge the first paper addressing learning of stochastic nonlinear dynamical systems of the kind we have described within the framework of the EM algorithm. The classical approach to system identification treats the parameters as hidden vari(cid:173) ables, and applies the Extended Kalman Filtering algorithm (described in section 2) to the nonlinear system with the state vector augmented by the parameters . 4 This approach is inherently on-line, which may be important in certain applications. Furthermore, it provides an estimate of the covariance of the parameters at each time step. In contrast, the EM algorithm we present is a batch algorithm and does not attempt to estimate the covariance of the parameters.
There are three important advantages the EM algorithm has over the classical ap(cid:173) proach. First, the EM algorithm provides a straightforward and principled method for handing missing inputs or outputs. Second, EM generalizes readily to more complex models with combinations of discrete and real-valued hidden variables. For example, one can formulate EM for a mixture of nonlinear dynamical systems. Third, whereas it is often very difficult to prove or analyze stability within the classical on-line approach, the EM algorithm is always attempting to maximize the likelihood, which acts as a Lyapunov function for stable learning.
In the next sections we will describe the basic components of the learning algorithm. For the expectation step of the algorithm, we infer the conditional distribution of the hidden states using Extended Kalman Smoothing (section 2). For the maximization step we first discuss the general case (section 3) and then describe the particular case where the nonlinearities are represented using Gaussian radial basis function (RBF; ) networks (section 4).
2 Extended Kalman Smoothing
Given a system described by equations (1) and (2), we need to infer the hidden states from a history of observed inputs and outputs. The quantity at the heart of this inference problem is the conditional density P(XtIUl,"" UT, Yl,.' " YT), for 1 ::; t ::; T, which captures the fact that the system is stochastic and therefore our inferences about x will be uncertain.
2The Gaussian noise assumption is less restrictive for nonlinear systems than for linear
systems since the nonlinearity can be used to generate non-Gaussian state noise.
3The authors have just become aware that Briegel and Tresp (this volume) have applied EM to essentially the same model. Briegel and Tresp's method uses multilayer perceptrons (MLP) to approximate the nonlinearities, and requires sampling from the hidden states to fit the MLP. We use Gaussian radial basis functions (RBFs) to model the nonlinearities, which can be fit analytically without sampling (see section 4) .
41t is important not to confuse this use of the Extended Kalman algorithm, to simul(cid:173)
taneously estimate parameters and hidden states, with our use of EKS, to estimate just the hidden state as part of the E step of EM.
Learning Nonlinear Dynamics Using EM
For linear dynamical systems with Gaussian state evolution and observation noises, this conditional density is Gaussian and the recursive algorithm for computing its mean and covariance is known as Kalman smoothing [4, 8]. Kalman smoothing is directly analogous to the forward-backward algorithm for computing the conditional hidden state distribution in a hidden Markov model, and is also a special case of the belief propagation algorithm. 5 For nonlinear systems this conditional density is in general non-Gaussian and can in fact be quite complex. Multiple approaches exist for inferring the hidden state distribution of such nonlinear systems, including sampling methods  and varia(cid:173) tional approximations . We focus instead in this paper on a classic approach from engineering, Extended Kalman Smoothing (EKS).
Extended Kalman Smoothing simply applies Kalman smoothing to a local lineariza(cid:173) tion of the nonlinear system. At every point x in x-space, the derivatives of the vector-valued functions f and 9 define the matrices, Ax == M I x=x and ex == ~ I x=x' respectively. The dynamics are linearized about Xt, the mean of the Kalman filter state estimate at time t:
The output equation (2) can be similarly linearized. If the prior distribution of the hidden state at t = 1 was Gaussian, then, in this linearized system, the conditional distribution of the hidden state at any time t given the history of inputs and outputs will also be Gaussian. Thus, Kalman smoothing can be used on the linearized system to infer this conditional distribution (see figure 1, left panel).