{"title": "Unified Inference for Variational Bayesian Linear Gaussian State-Space Models", "book": "Advances in Neural Information Processing Systems", "page_first": 81, "page_last": 88, "abstract": null, "full_text": "Unified Inference for Variational Bayesian Linear Gaussian State-Space Models\n\nDavid Barber IDIAP Research Institute rue du Simplon 4, Martigny, Switzerland david.barber@idiap.ch\n\nSilvia Chiappa IDIAP Research Institute rue du Simplon 4, Martigny, Switzerland silvia.chiappa@idiap.ch\n\nAbstract\nLinear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. The approximate Variational Bayesian method applied to these models is an attractive approach, used successfully in applications ranging from acoustics to bioinformatics. The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. We show how to convert the inference problem so that standard Kalman Filtering/Smoothing recursions from the literature may be applied. This is in contrast to previously published approaches based on Belief Propagation. Our framework both simplifies and unifies the inference problem, so that future applications may be more easily developed. We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to finding independent dynamical processes underlying noisy EEG signals.\n\n1\n\nLinear Gaussian State-Space Models\n\nLinear Gaussian State-Space Models (LGSSMs)1 are fundamental in time-series analysis [1, 2, 3]. In these models the observations v1:T 2 are generated from an underlying dynamical system on h1:T according to:\nv v vt = B ht + t , t N (0V , V ), h h ht = Aht-1 + t , t N (0H , H ) ,\n\nwhere N (, ) denotes a Gaussian with mean and covariance , and 0X denotes an X dimensional zero vector. The observation vt has dimension V and the hidden state ht has dimension H . Probabilistically, the LGSSM is defined by: p(v1:T , h1:T |) = p(v1 |h1 )p(h1 ) tT p(vt |ht )p(ht |ht-1 ),\n\n=2\n\nwith p(vt |ht ) = N (B ht , V ), p(ht |ht-1 ) = N (Aht-1 , H ), p(h1 ) = N (, ) and where = {A, B , H , V , , } denotes the model parameters. Because of the widespread use of these models, a Bayesian treatment of parameters is of considerable interest [4, 5, 6, 7, 8]. An exact implementation of the Bayesian LGSSM is formally intractable [8], and recently a Variational Bayesian (VB) approximation has been studied [4, 5, 6, 7, 9]. The most challenging part of implementing the VB method is performing inference over h1:T , and previous authors have developed their own specialized routines, based on Belief Propagation, since standard LGSSM inference routines appear, at first sight, not to be applicable.\n1 2\n\nAlso called Kalman Filters/Smoothers, Linear Dynamical Systems. v1:T denotes v1 , . . . , vT .\n\n\f\nA key contribution of this paper is to show how the Variational Bayesian treatment of the LGSSM can be implemented using standard LGSSM inference routines. Based on the insight we provide, any standard inference method may be applied, including those specifically addressed to improve numerical stability [2, 10, 11]. In this article, we decided to describe the predictor-corrector and Rauch-Tung-Striebel recursions [2], and also suggest a small modification that reduces computational cost. The Bayesian LGSSM is particularly of interest when strong prior constraints are needed to find adequate solutions. One such case is in EEG signal analysis, whereby we wish to extract sources that evolve independently through time. Since EEG is particularly noisy [12], a prior that encourages sources to have preferential dynamics is advantageous. This application is discussed in Section 4, and demonstrates the ease of applying our VB framework.\n\n2\n\nBayesian Linear Gaussian State-Space Models\n\nIn the Bayesian treatment of the LGSSM, instead of considering the model parameters as fixed, ^ ^ we define a prior distribution p(|), where is a set of hyperparameters. Then: ^ ^ p(v1:T |) = p(v1:T |)p(|) . (1) In a full Bayesian treatment we would define additional prior distributions over the hyperparameters ^ . Here we take instead the ML-II (`evidence') framework, in which the optimal set of hyperpa^ ^ rameters is found by maximizing p(v1:T |) with respect to [6, 7, 9]. For the parameter priors, here we define Gaussians on the columns of A and B 3 : p(A|, H ) jH e-\nj 2\n\n^ ( A j -A j )\n\nT\n\n^ -1 (Aj -Aj ) H\n\n,\n\np(B | , V ) \n\n=1\n\njH\n\ne-\n\nj 2\n\n^ (Bj -Bj )\n\nT\n\n^ -1 (Bj -Bj ) V\n\n,\n\n=1\n\n^ ^ which has the effect of biasing the transition and emission matrices to desired forms A and B . The -1 -1 4 conjugate priors for general inverse covariances H and V are Wishart distributions [7] . In the simpler case assumed here of diagonal covariances these become Gamma distributions [5, 7]. The ^ hyperparameters are then = {, }5 . Variational Bayes ^ Optimizing Eq. (1) with respect to is difficult due to the intractability of the integrals. Instead, in VB, one considers the lower bound [6, 7, 9]6 : l q ^ ^ L = log p(v1:T |) Hq (, h1:T ) + og p(|) + E(h1:T , ) q(,h1:T ) F ,\n()\n\nwhere E (h1:T , ) log p(v1:T , h1:T |). Hd (x) signifies the entropy of the distribution d(x), and \nd(x)\n\ndenotes the expectation operator.\n\nThe key approximation in VB is q (, h1:T ) q ()q (h1:T ), from which one may show that, for optimality of F , ^ E(h1:T ,) q(h1:T ) . q (h1:T ) e E(h1:T ,) q() , q () p(|)e These coupled equations need to be iterated to convergence. The updates for the parameters q () are straightforward and are given in Appendices A and B. Once converged, the hyperparameters are ^ updated by maximizing F with respect to , which lead to simple update formulae [7]. Our main concern is with the update for q (h1:T ), for which this paper makes a departure from treatments previously presented.\nMore general Gaussian priors may be more suitable depending on the application. For expositional simplicity, we do not put priors on and . 5 For simplicity, we keep the parameters of the Gamma priors fixed. 6 Strictly we should write throughout q (|v1:T ). We omit the dependence on v1:T for notational convenience.\n4 3\n\n\f\n3\n\nUnified Inference on q (h1:T )\n\nOptimally q (h1:T ) is Gaussian since, up to a constant, E(h1:T , ) q() is quadratic in h1:T 7 :\nT 1t - 2 =1\n\n( q vt - B ht )T -1 (vt - B ht ) (B , V\n\nV\n\n( q T + ht - Aht-1 ) -1 (ht - Aht-1 ) H )\n\n( A, H )\n\n. (2)\n\nIn addition, optimally, q (A|H ) and q (B |V ) are Gaussians (see Appendix A), so we can easily carry out the averages in Eq. (2). The further averages over q (H ) and q (V ) are also easy due to conjugacy. Whilst this defines the distribution q (h1:T ), quantities such as q (ht ), required for example for the parameter updates (see the Appendices), need to be inferred from this distribution. Clearly, in the non-Bayesian case, the averages over the parameters are not present, and the above simply represents the posterior distribution of an LGSSM whose visible variables have been clamped into their evidential states. In that case, inference can be performed using any standard LGSSM routine. Our aim, therefore, is to try to represent the averaged Eq. (2) directly as the posterior distribution q (h1:T |v1:T ) of an LGSSM , for some suitable parameter settings. ~ ~ Mean + Fluctuation Decomposition A useful decomposition is to write q ( ( vt - B ht )T -1 (vt - B ht ) (B ,V )= (vt - B ht )T -1 vt - B ht ) + hT SB ht , V V ft m\nean luctuation\n\n- where the parameter covariances are SB -1 B B = V HB 1 and SA V V AT -1 - A T -1 -1 H A A = H HA (for HA and HB defined in Appendix A). The mean terms H simply represent a clamped LGSSM with averaged parameters. However, the extra contributions from the fluctuations mean that Eq. (2) cannot be written as a clamped LGSSM with averaged parameters. In order to deal with these extra terms, our idea is to treat the fluctuations as arising from an augmented visible variable, for which Eq. (2) can then be considered as a clamped LGSSM.\n\nand similarly ( q ( - ht - Aht-1 )T H1 (ht - Aht-1 ) (A, )= (ht - A ht-1 )T -1 ht - A ht-1 ) +hT-1 SA ht-1 , t H H f m\nean T luctuation\n\nBT\n\n-\n\n-1\n\nB\n\nInference Using an Augmented LGSSM To represent Eq. (2) as an LGSSM q (h1:T |v1:T ), we may augment vt and B as8 : ~ ~ vt = v ert(vt , 0H , 0H ), ~ ~ B = v ert( B , UA , UB ),\nT where UA is the Cholesky decomposition of SA , so that UA UA = SA . Similarly, UB is the Cholesky decomposition of SB . The equivalent LGSSM q (h1:T |v1:T ) is then completed by specifying9 ~ ~ -1 -1 -1 ~ ~ ~ ~ A A , H H , V diag ( -1 , IH , IH ), , . ~ V\n\nThe validity of this parameter assignment can be checked by showing that, up to negligible constants, the exponent of this augmented LGSSM has the same form as Eq. (2)10 . Now that this has been written as an LGSSM q (h1:T |v1:T ), standard inference routines in the literature may be applied to ~ ~ compute q (ht |v1:T ) = q (ht |v1:T ) [1, 2, 11]11 . ~ ~\nFor simplicity of exposition, we ignore the first time-point here. The notation v ert(x1 , . . . , xn ) stands for vertically concatenating the arguments x1 , . . . , xn . 9 ~ ~ ~ Strictly, we need a time-dependent emission Bt = B , for t = 1, . . . , T - 1. For time T , BT has the Cholesky factor UA replaced by 0H,H . 10 There are several ways of achieving a similar augmentation. We chose this since, in the non-Bayesian limit UA = UB = 0H,H , no numerical instabilities would be introduced. 11 Note that, since the augmented LGSSM q (h1:T |v1:T ) is designed to match the fully clamped distribution ~ ~ q (h1:T |v1:T ), the filtered posterior q (ht |v1:t ) does not correspond to q (ht |v1:t ). ~ ~\n8 7\n\n\f\nAlgorithm 1 LGSSM: Forward and backward recursive updates. The smoothed posterior p(ht |v1:T ) ^T is returned in the mean ht and covariance PtT . procedure F O RWA R D 1a: P I -1 T T 1b: P D, where D I - UAB + UAB UAB UAB ^0 2a: h1 ^ 2b: h0 D 1 1 ^ ^ ^ 3: K P B T (B P B T + V )-1 , P1 (I - K B )P , h1 h0 + K (vt - B h0 ) 1 1 1 for t 2, T do -1 4: Ptt-1 APtt-1 AT + H t-1 5a: P Pt -1 T I T UAB 5b: P Dt Ptt-1 , where Dt I - Ptt-1 UAB + UAB Ptt-1 UAB ^ ^ 6a: ht-1 Aht-1 t t-1 ^ ^ 6b: ht-1 Dt Aht-1 t t-1 T ^ ^ ^ 7: K P B (B P B T + V )-1 , Ptt (I - K B )P , ht ht-1 + K (vt - B ht-1 ) t t t end for end procedure procedure BAC K WA R D for t T - 1, 1 do - At Ptt AT (Ptt+1 )-1 - - PtT Ptt + At (PtT 1 - Ptt+1 )At T + ^T - ^ ^ ^ hT ht + At (ht+1 - Aht ) t t t end for end procedure\n\nFor completeness, we decided to describe the standard predictor-corrector form of the Kalman Filter, together with the Rauch-Tung-Striebel Smoother [2]. These are given in Algorithm 1, where q (ht |v1:T ) is computed by calling the FORWARD and BACKWARD procedures. ~ ~ We present two variants of the FORWARD pass. Either we may call procedure FORWARD in ~~~ ~ ~~ Algorithm 1 with parameters A, B , H , V , , and the augmented visible variables vt in which ~ we use steps 1a, 2a, 5a and 6a. This is exactly the predictor-corrector form of a Kalman Filter [2]. Otherwise, in order to reduce the computational cost, we may call procedure FORWARD with the -1 ~ ~ , , and the original visible variable vt in which we use steps ~~ parameters A, B , H , -1 V T 1b (where UAB UAB SA + SB ), 2b, 5b and 6b. The two algorithms are mathematically equivalent. Computing q (ht |v1:T ) = q (ht |v1:T ) is then completed by calling the common BACKWARD pass. ~ ~ The important point here is that the reader may supply any standard Kalman Filtering/Smoothing routine, and simply call it with the appropriate parameters. In some parameter regimes, or in very long time-series, numerical stability may be a serious concern, for which several stabilized algorithms have been developed over the years, for example the square-root forms [2, 10, 11]. By converting the problem to a standard form, we have therefore unified and simplified inference, so that future applications may be more readily developed12 . 3.1 Relation to Previous Approaches\n\nAn alternative approach to the one above, and taken in [5, 7], is to write the posterior as log q (h1:T ) = tT t (ht-1 , ht ) + const.\n\n=2\n\nfor suitably defined quadratic forms t (ht-1 , ht ). Here the potentials t (ht-1 , ht ) encode the averaging over the parameters A, B , H , V . The approach taken in [7] is to recognize this as a\n12\n\nThe computation of the log-likelihood bound does not require any augmentation.\n\n\f\npairwise Markov chain, for which the Belief Propagation recursions may be applied. The approach in [5] is based on a Kullback-Leibler minimization of the posterior with a chain structure, which is algorithmically equivalent to Belief Propagation. Whilst mathematically valid procedures, the resulting algorithms do not correspond to any of the standard forms in the Kalman Filtering/Smoothing literature, whose properties have been well studied [14].\n\n4\n\nAn Application to Bayesian ICA\nA particular case for which the Bayesian LGSSM is of interest is in extracting independent source signals underlying a multivariate timeseries [5, 15]. This will demonstrate how the approach developed in Section 3 makes VB easily to apply. The sources si are modeled as independent in the following sense:\nj p(si :T , sj :T ) = p(si :T )p(s1:T ), 1 1 1\n\nfor i = j ,\n\ni, j = 1, . . . , C .\n\nIndependence implies block diagonal transition and state noise matrices A, H and , where each block c has dimension Hc . A one dimensional source sc for each independent dynamical subsystem is then t formed from sc = 1T hc , where 1c is a unit vector and hc is the state of t ct t dynamical system c. Combining the sources, we can write st = P ht , where P = diag (1T , . . . , 1T ), ht = v ert(h1 , . . . , hC ). The resulting 1 C t t emission matrix is constrained to be of the form B = W P , where W is the V C mixing matrix. This means that the observations v are formed from linearly mixing the sources, vt = W st + t . The Figure 1: The structure of graphical structure of this model is presented in Fig 1. To encourage redundant components to be removed, we place a zero mean Gaussian the LGSSM for ICA. prior on W . In this case, we do not define a prior for the parameters H and V which are instead considered as hyperparameters. More details of the model are given in [15]. The constraint B = W P requires a minor modification from Section 3, as we discuss below. Inference on q (h1:T ) A small modification of the mean + fluctuation decomposition for B occurs, namely: ( q vt - B ht )T -1 (vt - B ht ) (W ) = (vt - B ht )T -1 (vt - B ht ) + hT P T SW P ht , t V V\n\n- where B W P and SW = V HW1 . The quantities W and HW are obtained as in Appendix A.1 with the replacement ht P ht . To represent the above as a LGSSM, we augment vt and B as\n\nvt = v ert(vt , 0H , 0C ), ~\n\n~ B = v ert( B , UA , UW P ),\n\nwhere UW is the Cholesky decomposition of SW . The equivalent LGSSM is then completed by ~ ~ ~ ~ specifying A A , H H , V diag (V , IH , IC ), , , and inference for ~ q (h1:T ) performed using Algorithm 1. This demonstrates the elegance and unity of the approach in Section 3, since no new algorithm needs to be developed to perform inference, even in this special constrained parameter case. 4.1 Demonstration\n\nAs a simple demonstration, we used an LGSSM to generate 3 sources sc with random 5 5 transition t matrices Ac , = 0H and H IH . The sources were mixed into three observations v vt = W st + t , for W chosen with elements from a zero mean unit variance Gaussian distribution, and V = IV . We then trained a Bayesian LGSSM with 5 sources and 7 7 transition matrices Ac . ^ To bias the model to find the simplest sources, we used Ac 0Hc ,Hc for all sources. In Fig2a and Fig 2b we see the original sources and the noisy observations respectively. In Fig 2c we see the estimated sources from our method after convergence of the hyperparameter updates. Two of the 5 sources have been removed, and the remaining three are a reasonable estimation of the original sources. Another possible approach for introducing prior knowledge is to use a Maximum a Posteriori (MAP)\n\n\f\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300 0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300 0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300 0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a) Original sources st . (b) Observations resulting from mixing the original sources, v v vt = W st + t , t N (0, I ). (c) Recovered sources using the Bayesian LGSSM. (d) Sources found with MAP LGSSM.\n\n0\n\n1\n\n2\n\n3s 0\n\n1\n\n2\n\n3s 0\n\n1\n\n2\n\n3s 0\n\n1\n\n2\n\n3s 0\n\n1\n\n2\n\n3s\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 3: (a) Original raw EEG recordings from 4 channels. (b-e) 16 sources st estimated by the Bayesian LGSSM. procedure by adding a prior term to the original log-likelihood log p(v1:T |A, W, H , V , , ) + log p(A|) + log p(W | ). However, it is not clear how to reliably find the hyperparameters and in this case. One solution is to estimate them by optimizing the new objective function jointly with respect to the parameters and hyperparameters (this is the so-called joint map estimation see for example [16]). A typical result of using this joint MAP approach on the artificial data is presented in Fig 2d. The joint MAP does not estimate the hyperparameters well, and the incorrect number of sources is identified. 4.2 Application to EEG Analysis\n\nIn Fig 3a we plot three seconds of EEG data recorded from 4 channels (located in the right hemisphere) while a person is performing imagined movement of the right hand. As is typical in EEG, each channel shows drift terms below 1 Hz which correspond to artifacts of the instrumentation, together with the presence of 50 Hz mains contamination and masks the rhythmical activity related to the mental task, mainly centered at 10 and 20 Hz [17]. We would therefore like a method which enables us to extract components in these information-rich 10 and 20 Hz frequency bands. Standard ICA methods such as FastICA do not find satisfactory sources based on raw `noisy' data, and preprocessing with band-pass filters is usually required. Additionally, in EEG research, flexibility in the number of recovered sources is important since there may be many independent oscillators of interest underlying the observations and we would like some way to automatically determine their effective number. To preferentially find sources at particular frequencies, we specified a block ^ diagonal matrix Ac for each source c, where each block is a 2 2 rotation matrix at the desired frequency. We defined the following 16 groups of frequencies: [0.5], [0.5], [0.5], [0.5]; [10,11], [10,11], [10,11], [10,11]; [20,21], [20,21], [20,21], [20,21]; [50], [50], [50], [50]. The temporal evolution of the sources obtained after training the Bayesian LGSSM is given in Fig 3(b,c,d,e) (grouped by frequency range). The Bayes LGSSM removed 4 unnecessary sources from the mixing matrix W , that is one [10,11] Hz and three [20,21] Hz sources. The first 4 sources contain dominant low frequency drift, sources 5, 6 and 8 contain [10,11] Hz, while source 10 contains [20,21] Hz centered activity. Of the 4 sources initialized to 50 Hz, only 2 retained 50 Hz activity, while the Ac of the\n\n\f\nother two have changed to model other frequencies present in the EEG. This method demonstrates the usefulness and applicability of the VB method in a real-world situation.\n\n5\n\nConclusion\n\nWe considered the application of Variational Bayesian learning to Linear Gaussian State-Space Models. This is an important class of models with widespread application, and finding a simple way to implement this approximate Bayesian procedure is of considerable interest. The most demanding part of the procedure is inference of the hidden states of the model. Previously, this has been achieved using Belief Propagation, which differs from inference in the Kalman Filtering/Smoothing literature, for which highly efficient and stabilized procedures exist. A central contribution of this paper is to show how inference can be written using the standard Kalman Filtering/Smoothing recursions by augmenting the original model. Additionally, a minor modification to the standard Kalman Filtering routine may be applied for computational efficiency. We demonstrated the elegance and unity of our approach by showing how to easily apply a Variational Bayes analysis of temporal ICA. Specifically, our Bayes ICA approach successfully extracts independent processes underlying EEG signals, biased towards preferred frequency ranges. We hope that this simple and unifying interpretation of Variational Bayesian LGSSMs may therefore facilitate the further application to related models.\n\nA Parameter Updates for A and B\nA.1 Determining q (B |V ) By examining F , the contribution of q (B |V ) can be interpreted as the negative KL divergence between q (B |V ) and a Gaussian. Hence, optimally, q (B |V ) is a Gaussian. The covariance B [B ]ij,kl Bkl - Bkl (averages wrt q (B |V )) is given by: ij - Bij\n- [B ]ij,kl = [HB 1 ]j l [V ]ik , where [HB ]j l \n\ntT\n\n=1\n\nh\n\njl t ht\n\nq q\n\n(ht )\n\n+ j j l .\n\n- The mean is given by B = NB HB 1 , where [NB ]ij \n\nT\n\nt=1\n\nh\n\nj t\n\n(ht )\n\ni ^ vt + j Bij .\n\nDetermining q (A|H ) Optimally, q (A|H ) is a Gaussian with covariance\n- [A ]ij,kl = [HA 1 ]j l [H ]ik , where [HA ]j l T -1 t =1\n\nh\n\njl t ht\n\nq\n\n(ht )\n\n+ j j l . ^ + j Aij .\n\n- The mean is given by A = NA HA 1 , where [NA ]ij \n\nT\n\nt=2\n\nh\n\nj i t-1 ht\n\nq\n\n(ht-1:t )\n\nB Covariance Updates\nBy specifying a Wishart prior for the inverse of the covariances, conjugate update formulae are possible. In practice, it is more common to specify diagonal inverse covariances, for which the corresponding priors are simply Gamma distributions [7, 5]. For this simple diagonal case, the explicit updates are given below. Determining q (V ) For the constraint -1 = diag (), where each diagonal element follows a Gamma prior V Ga(b1 , b2 ) [7], q () factorizes and the optimal updates are\n\n\f\nwhere GB \n\nq(i ) = Ga b1 +\n- T N B HB 1 NB .\n\n\n\n1 T , b2 + 2 2\n\n\n\ntT\n\ni (vt )2 - [GB ]ii +\n\n=1\n\nj\n\n^2 j Bij ,\n\n\n\nDetermining q (H ) Analogously, for -1 = diag ( ) with prior Ga(a1 , a2 ) [5], the updates are H T j (i2 - T -1 1 t ^i q (i ) = Ga a1 + , a2 + ht ) [GA ]ii + j A2j , 2 2 =2\n- T where GA NA HA 1 NA .\n\nAcknowledgments This work is supported by the European DIRAC Project FP6-0027787. This paper only reflects the authors' views and funding agencies are not liable for any use that may be made of the information contained herein. References\n[1] Y. Bar-Shalom and X.-R. Li. Estimation and Tracking: Principles, Techniques and Software. Artech House, 1998. [2] M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice Using MATLAB. John Wiley and Sons, Inc., 2001. [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 2000. [4] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21:349356, 2005. [5] A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 13th European Signal Processing Conference, 2005. [6] H. Valpola and J. Karhunen. An unsupervised ensemble learning method for nonlinear dynamic statespace models. Neural Computation, 14:26472692, 2002. [7] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. [8] M. Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis (with discussion). In J.O. Bernardo, J.O. Berger, A.P Dawid, and A.F.M. Smith, editors, Bayesian Statistics VII. Oxford University Press, 2003. [9] D. J. C. MacKay. Ensemble learning and evidence maximisation. Unpublished manuscipt: www.variational-bayes.org, 1995. [10] M. Morf and T. Kailath. Square-root algorithms for least-squares estimation. IEEE Transactions on Automatic Control, 20:487497, 1975. [11] P. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions on Automatic Control, 41:727732, 1996. [12] E. Niedermeyer and F. Lopes Da Silva. Electroencephalography: basic principles, clinical applications and related fields. Lippincott Williams and Wilkins, 1999. [13] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11:305345, 1999. [14] M. Verhaegen and P. Van Dooren. Numerical aspects of different Kalman filter implementations. IEEE Transactions of Automatic Control, 31:907917, 1986. [15] S. Chiappa and D. Barber. Bayesian linear Gaussian state-space models for biosignal decomposition. Signal Processing Letters, 14, 2007. [16] S. S. Saquib, C. A. Bouman, and K. Sauer. ML parameter estimation for Markov random fields with applicationsto Bayesian tomography. IEEE Transactions on Image Processing, 7:10291044, 1998. [17] G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and desynchronization: basic principles. Clinical Neurophysiology, pages 18421857, 1999.\n\n\f\n", "award": [], "sourceid": 3023, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}, {"given_name": "Silvia", "family_name": "Chiappa", "institution": null}]}