{"title": "Dual Kalman Filtering Methods for Nonlinear Prediction, Smoothing and Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 799, "abstract": null, "full_text": "Dual Kalman Filtering Methods for \nNonlinear Prediction, Smoothing, and \n\nEstimation \n\nEric A. Wan \n\nericwan@ee.ogi.edu \n\nAlex T. Nelson \natnelson@ee.ogi.edu \n\nDepartment of Electrical Engineering \n\nOregon Graduate Institute \n\nP.O. Box 91000 Portland, OR 97291 \n\nAbstract \n\nPrediction, estimation, and smoothing are fundamental to signal \nprocessing. To perform these interrelated tasks given noisy data, \nwe form a time series model of the process that generates the \ndata. Taking noise in the system explicitly into account, maximum(cid:173)\nlikelihood and Kalman frameworks are discussed which involve the \ndual process of estimating both the model parameters and the un(cid:173)\nderlying state of the system. We review several established meth(cid:173)\nods in the linear case, and propose severa! extensions utilizing dual \nKalman filters (DKF) and forward-backward (FB) filters that are \napplicable to neural networks. Methods are compared on several \nsimulations of noisy time series. We also include an example of \nnonlinear noise reduction in speech. \n\n1 \n\nINTRODUCTION \n\nConsider the general autoregressive model of a noisy time series with both process \nand additive observation noise: \n\nx(k) \ny(k) \n\nI(x(k - 1), ... x(k - M), w) + v(k - 1) \nx(k) + r(k), \n\n(1) \n(2) \n\nwhere x(k) corresponds to the true underlying time series driven by process noise \nv(k), and 10 is a nonlinear function of past values of x(k) parameterized by w. \n\n\f794 \n\nE. A. Wan and A. T. Nelson \n\nThe only available observation is y(k) which contains additional additive noise r(k) . \nPrediction refers to estimating an x(k) given past observations. (For purposes of \nthis paper we will restrict ourselves to univariate time series.) In estimation, x(k) \nis determined given observations up to and including time k. Finally, smoothing \nrefers to estimating x(k) given all observations, past and future. \nThe minimum mean square nonlinear prediction of x(k) (or of y(k)) can be writ(cid:173)\nten as the conditional expectation E[x(k)lx(k - 1)], where x(k) = [x(k), x(k -\n1),\u00b7 .. x(O)] . If the time series x(k) were directly available, we could use this data \nto generate an approximation of the optimal predictor. However, when x(k) is not \navailable (as is generally the case), the common approach is to use the noisy data \ndirectly, leading to an approximation of E[y(k)ly(k -1)] . However, this results in a \nbiased predictor: E[y(k)ly(k-l)] = E[x(k)lx(k -1) +R(k -1)] i= E[x(k)lx(k-l)]. \nWe may reduce the above bias in the predictor by exploiting the knowledge that \nthe observations y(k) are measurements arising from a time series. Estimates x(k) \nare found (either through estimation or smoothing) such that Ilx(k) - x(k)11 < \nII x (k ) - y( k) II. These estimates are then used to form a predictor that approximates \nE[x(k)lx(k - 1)].1 \n\nIn the remainder of this paper, we will develop methods for the dual estimation of \nboth states x and weights Vi. We show how a maximum-likelihood framework can \nbe used to relate several existing algorithms and how established linear methods \ncan be extended to a nonlinear framework. New methods involving the use of dual \nKalman filters are also proposed and experiments are provided to compare results. \n\n2 DUAL ESTIMATION \n\nGiven only noisy observations y(k), the dual estimation problem requires considera(cid:173)\ntion of both the standard prediction (or output) errors ep(k) = y(k) - f(ic.(k-1)' w) \nas well as the observation (or input) errors eQ(k) = y(k) - x(k) . The minimum ob(cid:173)\nservation error variance equals the noise variance 0';. The prediction error, however, \nis correlated with the observation error since y(k) - f(x(k - 1)) = r(k - 1) + v(k), \nand thus has a minimum variance of 0'; + 0';. Assuming the errors are Gaussian, \nwe may construct a log-likelihood function which is proportional to eT:E-1e, where \neT = [eQ(O), eQ(l) .... eQ(N), ep(M), ep(M + 1), .. . ep(N)], a vector of all errors up to \ntime N, and \n\no \n\no \no \n\n(3) \n\nMinimization of the log-likelihood function leads to the maximum-likelihood esti(cid:173)\nmates for both x(k) and w. (Although we may also estimate the noise variances 0'; \nand 0';, we will assume in this paper that they are known.) Two general frameworks \nfor optimization are available: \n\nlBecause models are trained on estimated data x(k), it is important that estimated \ndata still be used for prediction of out-of training set (on-line) data. In other words, if our \nmodel was formed as an approximation of E[x(k)lx(k - 1)], then we should not provide it \nwith y(k - 1) as an input in order to avoid a model mismatch. \n\n\fDual Kalman Filtering Methods \n\n795 \n\n2.1 Errors-In-Variables (EIV) Methods \n\nThis method comes from the statistics literature for nonlinear regression (see Seber \nand Wild, 1989), and involves batch optimization of the cost function in Equation \n3. Only minor modifications are made to account for the time series model. These \nmethods, however, are memory intensive (E is approx. 2N >< 2N) and also do not \naccommodate new data in an efficient manner. Retraining is necessary on all the \ndata in order to produce estimates for the new data points. \n\nIf we ignore the cross correlation between the prediction and observation error, then \nE becomes a diagonal matrix and the cost function may be expressed as simply \n2::=1 \"Ye~(k) + e~(k), with \"Y = (J';/((J'; + (J';). This is equivalent to the Gleaming \n(CLRN) cost function (Weigend, 1995), developed as a heuristic method for cleaning \nthe inputs in neural network modelling problems. While this allows for stochastic \noptimization, the assumption in the time series formulation may lead to severely \nbiased results. Note also that no estimate is provided for the last point x(N). \nWhen the model/ = w T x is known and linear, EIV reduces to a standard (batch) \nweighted least squares procedure which can be solved in closed form to generate \na maximum-likelihood estimate of the noise free time series. However, when the \nlinear model is unknown, the problem is far more complicated. The inner product \nof the parameter vector w with the vector x( k - 1) indicates a bilinear relationship \nbetween these unknown quantities. Solving for x( k) requires knowledge of w, while \nsolving for w requires x(k). Iterative methods are necessary to solve the nonlin(cid:173)\near optimization, and a Newton's-type batch method is typically employed. An \nEIV method for nonlinear models is also readily developed, but the computational \nexpense makes it less practical in the context of neural networks. \n\n2.2 Kalman Methods \n\nKalman methods involve reformulation of the problem into a state-space framework \nin order to efficiently optimize the cost function in a recursive manner. At each time \npoint, an optimal estimation is achieved by combining both a prior prediction and \nnew observation. Connor (1994), proposed using an Extended Kalman filter with a \nneural network to perform state estimation alone. Puskorious and Feldkamp (1994) \nand others have posed the weight estimation in a state-space framework to allow \nKalman training of a neural network. Here we extend these ideas to include the \ndual Kalman estimation of both states and weights for efficient maximum-likelihood \noptimization. We also introduce the use offorward-backward in/ormation filters and \nfurther explicate relationships to the EIV methods. \n\nA state-space formulation of Equations 1 and 2 is as follows: \n= F[x(k - 1)] + Bv(k - 1) \n= Cx(k) + r(k) \n\nx(k) \ny(k) \n\n(4) \n(5) \n\nwhere \n\nx(k) = \n\n[ \n\n1 \n\nF[x(k)] = \n\nf(x(k), ... , x(k - M + 1), w) 1 \n~(k) \nx(k - M + 2) \n\nB = [ il' (6) \n\n[ \n\nx(k) \nx(k - 1) \n. \n~(k - M + 1) \n\n\f796 \n\nE. A. Wan and A. T. Nelson \n\nand C = BT. If the model is linear, then f(x(k)) takes the form wT x(k), and \nF[x(k)] can be written as Ax(k), where A is in controllable canonical form. \n\nIf the model is linear, and the parameters ware known, the Kalman filter (KF) \nalgorithm can be readily used to estimate the states (see Lewis, 1986). At each \ntime step, the filter computes the linear least squares estimate x(k) and prediction \nx-(k), as well as their error covariances, Px(k) and P.;(k). In the linear case with \nGaussian statistics, the estimates are the minimum mean square estimates. With \nno prior information on x, they reduce to the maximum-likelihood estimates. \n\nNote, however, that while the Kalman filter provides the maximum-likelihood es(cid:173)\ntimate at each instant in time given all past data, the EIV approach is a batch \nmethod that gives a smoothed estimate given all data. Hence, only the estimates \nx(N) at the final time step will match. An exact equivalence for all time is achieved \nby combining the Kalman filter with a backwards information filter to produce a \nforward-backward (FB) smoothing filter (Lewis, 1986).2 Effectively, an inverse co(cid:173)\nvariance is propagated backwards in time to form backwards state estimates that \nare combined with the forward estimates. When the data set is large, the FB filter \noffers Significant computational advantages over the batch form . \n\nWhen the model is nonlinear, the Kalman filter cannot be applied directly, but \nrequires a linearization of the nonlinear model at the each time step. The resulting \nalgorithm is known as the extended Kalman filter (EKF) and effectively approxi(cid:173)\nmates the nonlinear function with a time-varying linear one. \n\n2.2.1 Batch Iteration for Unknown Models \nAgain, when the linear model is unknown, the bilinear relationship between the time \nseries estimates, X, and the weight estimates, Vi requires an iterative optimization. \nOne approach (referred to as LS-KF) is to use a Kalman filter to estimate x(k) with \nVi fixed, followed by least-squares optimization to find Vi using the current x( k). \nSpecifically, the parameters are estimated as Vi = (X~FXKF) -1 XKFY, where XKF \nis a matrix of KF state estimates, and Y is a 1 x N vector of observations. \n\nFor nonlinear models, we use a feedforward neural network to approximate f(\u00b7), and \nreplace the LS and KF procedures by backpropagation and extended Kalman filter(cid:173)\ning, respectively (referred to here as BP-EKF, see Connor 1994). A disadvantage \nof this approach is slow convergence, due to keeping a set of inaccurate estimates \nfixed at each batch optimization stage. \n\n2.2.2 Dual Kalman Filter \nAnother approach for unknown models is to concatenate both wand x into a joint \nstate vector. The model and time series are then estimated simultaneously by \napplying an EKF to the nonlinear joint state equations (see Goodwin and Sin, 1994 \nfor the linear case). This algorithm, however, has been known to have convergence \nproblems. \n\nAn alternative is to construct a separate state-space formulation for the underlying \nweights as follows: \n\n(7) \n(8) \n\nw(k) \ny(k) \n\n= \n\nw(k -1) \nf(ic.(k - 1), w(k)) + n(k), \n\n2 A slight modification of the cost in Equation 3 is necessary to account for initial \n\nconditions in the Kalman form. \n\n\fDual Kalman Filtering Methods \n\n797 \n\nwhere the state transition is simply an identity matrix, and f(x(k-1), w(k)) plays \nthe role of a time-varying nonlinear observation on w. \n\nWhen the unknown model is linear, the observation takes the form x(k _1)Tw(k). \nThen a pair of dual Kalman filters (DKF) can be run in parallel, one for state \nestimation, and one for weight estimation (see Nelson, 1976) . At each time step, \nall current estimates are used. The dual approach essentially allows us to separate \nthe non-linear optimization into two linear ones. Assumptions are that x and w \nremain uncorrelated and that statistics remain Gaussian. Note, however, that the \nerror in each filter should be accounted for by the other. We have developed several \napproaches to address this coupling, but only present one here for the sake of brevity. \nIn short, we write the variance of the noise n( k) as 0 p~ (k )OT + (J'; . in Equation \n8, and replace v(k - 1) by v(k - 1) + (w(k)T - wT(k))x(k - 1) in Equation 4 for \nestimation of x(k). Note that the ability to couple statistics in this manner is not \npossible in the batch approaches. \n\nWe further extend the DKF method to nonlinear neural network models by in(cid:173)\ntroducing a dual extended Kalman filtering method (DEKF) . This simply requires \nthat Jacobians of the neural network be computed for both filters at each time step. \nNote, by feeding x(k) into the network, we are implicitly using a recurrent network. \n\n2.2.3 Forward-Backward Methods \n\nAll of the Kalman methods can be reformulated by using forward-backward (FB) \nKalman filtering to further improve state smoothing. However, the dual Kalman \nmethods require an interleaving of the forward and backward state estimates in \norder to generate a smooth update at each time step. In addition, using the FB \nestimates requires caution because their noncausal nature can lead to a biased w \nif they are used improperly. Specifically, for LS-FB the weights are computed as: \nw = (XRFXFB)-lXKFY ,where XFB is a matrix of FB (smooth) state estimates. \nEquivalent adjustments are made to the dual Kalman methods. Furthermore, a \nmodel of the time-reversed system is required for the nonlinear case. The explication \nand results of these algorithms will be appear in a future publication. \n\n3 EXPERIMENTS \n\nTable 1 compares the different approaches on two linear time series, both when \nthe linear model is known and when it is unknown. The least square (LS) estima(cid:173)\ntion for the weights in the bottom row represents a baseline performance wherein \nno noise model is used. In-sample training set predictions must be interpreted \ncarefully as all training set data is being used to optimize for the weights. We \nsee that the Kalman-based methods perform better out of training set (recall the \nmodel-mismatch issuel ). Further, only the Kalman methods allow for on-line es(cid:173)\ntimations (on the test set, the state-estimation Kalman filters continue to operate \nwith the weight estimates fixed). The forward-backward method further improves \nperformance over KF methods. Meanwhile, the clearning-equivalent cost function \nsacrifices both state and weight estimation MSE for improved in-sample prediction; \nthe resulting test set performance is significantly worse. \n\nSeveral time series were used to compare the nonlinear methods, with the results \nsummarized in Table 2. Conclusions parallel those for the linear case. Note, the \nDEKF method performed better than the baseline provided by standard backprop-\n\n\f798 \n\nE. A. Wan and A. T. Nelson \n\nTable 1: Comparison of methods for two linear models \n\nModel Known \n\nTram 1 \n\nEst. Pred. \n.322 \n.094 \n.134 \n.203 \n.134 \n.559 \n.559 \n.094 \n\nMLJ:<; \nCLRN \n\nKF \nFB \n\nTest 1 \n\nEst. \n\n-\n-\n.132 \n.132 \n\nPred. \n1.09 \n1.08 \n0.59 \n0.59 \n\nw \n-\n-\n-\n-\n\nTram 2 \n\nJ:<;st. \n.165 \n.343 \n.197 \n.165 \n\nPred. \n.558 \n.342 \n.778 \n.778 \n\nTest 2 \nJ:<;st. \n-\n-\n.221 \n.221 \n\nPred. \n1.32 \n1.32 \n0.85 \n0.85 \n\nw \n-\n-\n-\n-\n\nModel Unknown \n\n.138 \n.099 \n.135 \n.096 \n-\n\nEst. Pred. \n\nEst. Pred. \n\nw \n\n.563 \n.347 \n.557 \n.329 \n.886 \n\n.139 \n.136 \n.133 \n.134 \n-\n\nEIV \nCLRN \nLS-KF \nLS-FB \nUK!\" \nDFB \nLS \n\nw \n.122 \n11.28 \n.325 \n.369 \n.149 \n.065 \n0.590 \nMSE values for estimation (Est.), prediction (Pred.) and weights (w) (normalized to \nsignal var.). 1 - AR(ll) model, (1'; = 4, (1'; = 1. 2000 training samples, 1000 testing \nsamples. EIV and CLRN were not computed for the unknown model due to memory \nconstraints. 2 - AR(5) model, (1'; = .7., (1'; = .5., 375 training, 125 testing. \nTable 2: Comparison of methods on nonlinear time series \n\nEst. Pred. \n.172 \n.545 \n.049 \n.278 \n.197 \n.778 \n.612 \n.169 \n.198 \n.779 \n.587 \n.165 \n-\n1.08 \n\nEst. Pred. \n-\n1.81 \n-\n14.1 \n0.85 \n.226 \n0.89 \n.229 \n.221 \n.863 \n.859 \n.221 \n1.32 \n\n.605 \n.603 \n.595 \n.596 \n1.09 \n\n.134 \n.281 \n.212 \n.187 \n.612 \n\n-\n\nNNet 3 \n\nTram \nPro \n.59 \n.56 \n.68 \n\nEs. \n.16 \n.14 \n.92 \n\nTest \n\nEs. \n.17 \n.14 \n.92 \n\nPro \n.59 \n.55 \n.68 \n\nNNet 1 \n\nTram \nPro \n. 58 \n.57 \n.57 \n\nEs. \n.17 \n.14 \n.95 \n\nBP-EKF \nDEKF \n\nBP \n\nTest \n\nEs . \n.15 \n.13 \n.95 \n\nPro \n.63 \n.59 \n.69 \n\nNNet 2 \n\nTram \nPro \n.31 \n.30 \n.30 \n\nEs. \n.08 \n.07 \n.22 \n\nTest \n\nEs. \n.08 \n.06 \n.29 \n\nPro \n.33 \n.32 \n.36 \n\nThe series Nnet 1,2,3 are generated by autoregressive neural networks which exhibit limit \ncycle and chaotic behavior. (1'; = .16, (1'; = .81, 2700 training samples, 1300 testing \nsamples. All network models fit using 10 inputs and 5 hidden units. Cross-validation \nwas not used in any of the methods. \n\nagation (wherein no model of the noise is used). The DEKF method exhibited fast \nconvergence, requiring only 10-20 epochs for training. A DEFB method is under \ndevelopment. \n\nThe DEKF was tested on a speech signal corrupted with simulated bursting white \nnoise (Figure 1). The method was applied to successive 64ms (512 point) windows \nof the signal, with a new window starting every 8ms (64 points). The results in \nthe figure were computed assuming both (1'; and (1'; were known. The average \nSNR is improved by 9.94 dB. We also ran the experiment when (1'; and (1'; were \nestimated using only the noisy signal (Nelson and Wan, 1997), and acheived an SNR \nimprovement of 8.50 dB. In comparison, available \"state-of-the-art\" techniques of \nspectral subtraction (Boll, 1979) and RASTA processing (Hermansky et al., 1995), \nachieve SNR improvements of only .65 and 1.26 dB, respectively. We extend the \nalgorithms to the colored noise case in a second paper (Nelson and Wan, 1997). \n\n4 CONCLUSIONS \nWe have described various methods under a Kalman framework for the dual estima(cid:173)\ntion of both states and weights of a noisy time series. These methods utilize both \n\n\fDual Kalman Filtering Methods \n\n-1Itf .... , ~. t .... ~ \u2022 ..------il'\" ? \n\nIIIIiII \n\n799 \n\nClean Speech \n\nNoise \n\nNoisy Speech \n\nCleaned Speech \n\n... -'\" \n\nFigure 1: Cleaning Noisy Speech With The DEKF. 33,000 pts (5 sec.) shown. \n\nprocess and observation noise models to improve estimation performance. Work \nin progress includes extensions for colored noise, blind signal separation, forward(cid:173)\nbackward filtering, and noise estimation. While further study is needed, the dual \nextended Kalman filter methods for neural network prediction, estimation, and \nsmoothing offer potentially powerful new tools for signal processing applications. \n\nAcknowledgements \nThis work was sponsored in part by NSF under grant ECS-9410823 and by \nARPA/ AASERT Grant DAAH04-95-1-0485. \n\nReferences \nS.F. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE \nASSP-27, pp. 113-120. April 1979. \n\nJ. Connor, R. Martin, L. Atlas. Recurrent neural networks and robust time series \nprediction. IEEE Tr. on Neural Networks. March 1994. \nF. Lewis. Optimal Estimation John Wiley & Sons, Inc. New York. 1986. \n\nG. Goodwin, K.S. Sin. Adaptive Filtering Prediction and Control. Prentice-Hall, \nInc., Englewood Cliffs, NJ. 1994. \n\nH. Hermansky, E. Wan, C. Avendano. Speech enhancement based on temporal \nprocessing. ICASSP Proceedings. 1995. \n\nA. Nelson, E. Wan. Neural speech enhancement using dual extended Kalman fil(cid:173)\ntering. Submitted to ICNN'97. \n\nL. Nelson, E. Stear. The simultaneous on-line estimation of parameters and states \nin linear systems. IEEE Tr. on Automatic Control. February, 1976. \n\nG. Puskorious, L. Feldkamp. Neural control of nonlinear dynamic systems with \nkalman filter trained recurrent networks. IEEE Tm. on NN, vol. 5, no. 2. 1994. \n\nG. Seber, C. Wild. Nonlinear Regression. John Wiley & Sons. 1989. \n\nA. Weigend, H.G. Zimmerman. Clearning. University of Colorado Computer Sci(cid:173)\nence Technical Report CU-CS-772-95. May, 1995. \n\n\f", "award": [], "sourceid": 1202, "authors": [{"given_name": "Eric", "family_name": "Wan", "institution": null}, {"given_name": "Alex", "family_name": "Nelson", "institution": null}]}