{"title": "Rao-Blackwellised Particle Filtering via Data Augmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 561, "page_last": 567, "abstract": null, "full_text": "Rao-Blackwellised Particle Filtering \n\nData Augmentation \n\n. \nVIa \n\nChristophe Andrieu \n\nN ando de Freitas \n\nArnaud Doucet \n\nStatistics Group \n\nUniversity of Bristol \n\nUniversity Walk \n\nBristol BS8 1TW, UK \nC.Andrieu @bristol.ac.uk \n\nComputer Science \n\nUC Berkeley \n\n387 Soda Hall, Berkeley \n\nCA 94720-1776, USA \njfgf@cs.berkeley.edu \n\nAbstract \n\nEE Engineering \n\nUniversity of Melbourne \nParkville, Victoria 3052 \n\nAustralia \n\ndoucet@ee.mu.oz.au \n\nIn this paper, we extend the Rao-Blackwellised particle filtering \nmethod to more complex hybrid models consisting of Gaussian la(cid:173)\ntent variables and discrete observations. This is accomplished by \naugmenting the models with artificial variables that enable us to \napply Rao-Blackwellisation. Other improvements include the de(cid:173)\nsign of an optimal importance proposal distribution and being able \nto swap the sampling an selection steps to handle outliers. We focus \non sequential binary classifiers that consist of linear combinations \nof basis functions , whose coefficients evolve according to a Gaussian \nsmoothness prior. Our results show significant improvements. \n\n1 \n\nIntroduction \n\nSequential Monte Carlo (SMC) particle methods go back to the first publically \navailable paper in the modern field of Monte Carlo simulation (Metropolis and \nUlam 1949) ; see (Doucet, de Freitas and Gordon 2001) for a comprehensive review. \nSMC is often referred to as particle filtering (PF) in the context of computing \nfiltering distributions for statistical inference and learning. It is known that the \nperformance of PF often deteriorates in high-dimensional state spaces. In the past, \nwe have shown that if a model admits partial analytical tractability, it is possible \nto combine PF with exact algorithms (Kalman filters, HMM filters , junction tree \nalgorithm) to obtain efficient high dimensional filters (Doucet, de Freitas, Murphy \nand Russell 2000, Doucet, Godsill and Andrieu 2000). In particular, we exploited \na marginalisation technique known as Rao-Blackwellisation (RB). \n\nHere, we attack a more complex model that does not admit immediate analytical \ntractability. This probabilistic model consists of Gaussian latent variables and bi(cid:173)\nnary observations. We show that by augmenting the model with artificial variables, \nit becomes possible to apply Rao-Blackwellisation and optimal sampling strategies. \nWe focus on the problem of sequential binary classification (that is, when the data \narrives one-at-a-time) using generic classifiers that consist of linear combinations \nof basis functions, whose coefficients evolve according to a Gaussian smoothness \n\n\fprior (Kitagawa and Gersch 1996). We have previously addressed this problem in \nthe context of sequential fault detection in marine diesel engines (H0jen-S0rensen, \nde Freitas and Fog 2000). This application is of great importance as early detection \nof incipient faults can improve safety and efficiency, as well as, help to reduce down(cid:173)\ntime and plant maintenance in many industrial and transportation environments. \n\n2 Model Specification and Estimation Objectives \n\nPr( Zt = llxt ,.8t ) = CP(f(xl, .8t}), \n\nLet us consider the following binary classification model. Given at time t = 1,2, .. . \nan input Xt we observe Zt E {O, I} such that \n(1) \nwhere CP (u) = vk J::oo exp (_a 2 /2) da is the cumulative function of the standard \nnormal distribution. This is the so-called pro bit link. By convention, researchers \ntend to adopt a logistic (sigmoidal) link function 'P (u) = (1 + exp (_U)) -1 . How(cid:173)\never, from a Bayesian computational point of view, the probit link has many ad(cid:173)\nvantages and is equally valid. The unknown function is modeled as \n\nK \n\n!(Xt, .8t) = L .8t,k\\[ldxt) = \\[IT (Xt).8t , \n\nk=1 \n\nwhere we have assumed that the basis functions \\[I (Xt) \u00a3 (\\[11 (Xt) , ... , \\[I K (Xt)/ do \nnot depend on unknown parameters; see (Andrieu, de Freitas and Doucet 1999) for \nthe more general case . .8t \u00a3 (.8t,1,' .. ,.8t,K) T E ~K is a set of unknown time-varying \nregression coefficients. To complete the model, we assume that they satisfy \n\n.8t = At.8t-1 + BtVt, .80'\" N (rna, Po) \n\n(2) \nwhere Vt i\u00b7~.:f N (0, In.) and A and B control model correlations and smoothing \n(regularisation). Typically K is rather large, say 10 or 100, and the bases \\[Ik (.) are \nmultivariate splines, wavelets or radial basis functions (Holmes and Mallick 1998). \n\n2.1 Augmented Statistical Model \n\nWe augment the probabilistic model artificially to obtain more efficient sampling \nalgorithms, as will be detailed in the next section. In particular, we introduce the \nset of independent variables Yt , such that \n\nYt =! (Xt,.8t) + nt, \n\n(3) \n\ni.i.d. N (0 1) \n'\" \n\nd d fi \n\n{I if Yt > 0, \n0 otherwise. \n\n\"an e ne Zt = \n\nh \nwere nt \nthat one has Pr ( Zt = 11 Xt, .8t ) = CP (f (Xl, .8t)) . \nThis data augmentation strategy was first introduced in econometrics by economics \nNobel laureate Daniel McFadden (McFadden 1989). In the MCMC context, it has \nbeen used to design efficient samplers (Albert and Chib 1993). Here, we will show \nhow to take advantage of it in an SMC setting. \n\nIt is then easy to check \n\n2.2 Estimation objectives \n\nGiven, at time t , the observations Ol:t \u00a3 (Xl:t, Zl:t), any Bayesian inference is based \non the posterior distribution1 P (d.8o:tl Ol:t)' We are, therefore, interested in es(cid:173)\ntimating sequentially in time this distribution and some of its features , such as \nIFor any B, we use P (dBo,tl au) to denote the distribution and p (Bo,tl au) to denote \n\nthe density, where P (dBo,tl au) = p (Bo,tl au) dBo,t. Also, Bo,t ~ {Bo, BI , ... , Bd . \n\n\flE ( f (xt, ,Bt) I Ol:t) or the marginal predictive distribution at time t for new input \ndata Xt+1, that is Pr (Zt+1 = 11 01:t, xHd. The posterior density satisfies a time \nrecursion according to Bayes rule, but it does not admit an analytical expression \nand, consequently, we need to resort to numerical methods to approximate it. \n\n3 Sequential Bayesian Estimation via Particle Filtering \n\nA straightforward application of SMC methods to the model (1)-(2) would focus \non sampling from the high-dimensional distribution P (d,Bo:t I 01:t) (H0jen-S0rensen \net al. 2000). A substantially more efficient strategy is to exploit the augmentation \nof the model to sample only from the low-dimensional distribution P ( dY1:t I 01:t). \nThe low-dimensional samples allow us then to compute the remaining estimates \nanalytically, as shown in the following subsection. \n\n3.1 Augmentation and Rao-Blackwellisation \n\nConsider the extended model defined by equations (1)-(2)-(3). One has \n\np(,Bo:tlo1:t) = J p( ,Bo:tl x 1:t,Y1:t)p(Y1:tl o1:t)dY1:t\u00b7 \n\nThus if we have a Monte Carlo approximation of P (dY1:t I ol:d of the form \n\nthen P (,Bo:tl 01:t) can be approximated via \n\nPN (,Bo:tl 01:t) = L w~i)p ( ,Bo:tl x1:t,yi:O ' \n\nN \n\ni=l \n\nthat is a mixture of Gaussians. From this approximation, one can estimate \nlE(,Btlxl:t,Yl:t) and lE(,Bt-Llxl:t,Yl:t). For example, an estimate of the predictive \ndistribution is given by \n\nPrN(Zt+1 = Il ol:t,XH1) = J Pr( Zt+1 = lIYH1)PN(dYl:t+1 lol:t, xt+1) \n\n(4) \n\nN \n, , (i) \n= ~ W t \n\n(i) ) \n][(0,+00) YHl \n\n( \n\n, \n\ni=l \n\nwhere Y~21 ~ P ( dYHll Xl:t+1, Yi~~). This shows that we can restrict ourselves to \nthe estimation of P (Y1:t1 Ol:t) for inference purposes. \nIn the SMC framework, we must estimate the \"target\" density P (Y1:t1 Ol:t) \npointwise up to a normalizing constant. By standard factorisation, one has \np(Yl:tlol:t) IX IT Pr( zk IYk)p(Ykl xl:k,Yl:k-l), wherep(YIIY1:0,Xl:0) ,@,p(Yll xd\u00b7 \nSince Pr (Zk I Yk) is known, we only need to estimate P (Yk I Xl:k, Yl:k-d up to a nor(cid:173)\nmalizing constant. This predictive density can be computed using the Kalman filter. \nGiven (Xl:k' Yl:k-l), the Kalman filter equations are the following. Set ,Bo lo = mo \n\nk=l \n\nt \n\n\fand ~o l o = ~o, then for t = 1, ... , k - 1 compute \n\n,Bt lt-1 = At,Bt- 1It- 1 \n~t l t-1 = At~t-1 I t-1AI + BtBI \nSt = \\[IT (xt} ~tlt- 1 \\[I (Xt) + 1 \nYt lt-1 = \\[IT (Xt) ,Bt lt-1 \n,Btlt = ,Bt lt -1 + ~tlt- 1 \\[I (Xt) St- 1 (Yt - Yt lt - t) \n~tit = ~tl t-1 - ~tlt-1 \\[I (xt} St- 1\\[lT (Xt) ~t l t-1' \n\n(5) \n\n,Bt lt - 1 ~ 1E(,BtIXl:t-1,Yl:t-d, \n\nwhere \nIE(Ytlxl:t,Yl:t - d, ~t lt-1 ~ cov(,BtIXl:t- 1,Y1:t- 1), ~t lt ~ cov(,Btlxl:t,Yl:t) and \nSt ~ cov (Ytl Xl:t,Y1:t-1). One obtains \n\n,Btlt ~ 1E(,Btlxl:t,Yl:t), Ytlt - 1 \n\nP (Yk I X1:k, Y1:k-d = N (Yk;Y klk- 1' Sk) . \n\n(6) \n\n3.2 Sampling Algorithm \n\nIn this section, we briefly outline the PF algorithm for generating samples \nfrom p(dYl:tlol:t). \n(For details, please refer to our extended technical report at \nhttp://www . cs. berkeley. edu/ '\" jfgf /publications . html.) Assume that at time t - 1 \nwe have N particles {Yi~Ld~l distributed according to P (dYl:t - 11 ol:t- d from \nwhich one can get the following empirical distribution approximation \n\nPN (dYl:t-11 ol:t-d = N L JYi~;_l (dYl:t-d . \n\n1 N \n\ni = l \n\nVarious SMC methods can be used to obtain N new paths {Yi~~}~l distributed \napproximately according to P (dYl:t1 Ol:t)' The most successful of these methods \ntypically combine importance sampling and a selection scheme. Their asymptotic \nconvergence (N --t 00) is satisfied under mild conditions (Crisan and Doucet 2000). \nSince the selection step is standard (Doucet et al. 2001), we shall concentrate on \ndescribing the importance sampling step. To obtain samples from P( dYl:t IOl:t), we \ncan sample from a proposal distribution Q(dYl:t) and weight the samples appropri(cid:173)\nately. Typically, researchers use the transition prior as proposal distribution (Isard \nand Blake 1996). Here, we implement an optimal proposal distribution, that is one \nthat minimizes the variance of the importance weights W (Yl:t) conditional upon not \nmodifying the path Y1:t-1' In our case, we have \n\n) \n( I \nP Yt X1:t,Yl:t-1,Zt \nex: \n\n{p(YtIXl:t ,Y1:t-dlI[o,+ oo) (Yt) \np(Ytlxl:t ,Yl:t-dlI(- oo,o) (Yt) \n\nif Zt = 1 \nif Zt = 0 \n\n' \n\nwhich is a truncated Gaussian version of (6) of and consequently \n\nW (Yl:t) ex: Pr (Zt I Xl:t, Y1:t - d = (1 _ * ( _ Y$,l ) ) z, ** ( _ Y$,l ) 1-z, \n\n(7) \n\nThe algorithm is shown in Figure 1. (Please refer to our technical report for con(cid:173)\nvergence details.) \n\nRemark 1 When we adopt the optimal proposal distribution, the importance weight \nWt ex: Pr (Zt I X1:t, Y1:t - d does not depend on Yt. \nIt is thus possible to carry out \nthe selection step before the sampling step. The algorithm is then similar to the \nauxiliary variable particle filter of (Pitt and Shephard 1999). This modification to \nthe original algorithm has important implications. It enables us to search for more \n\n\fSequential importance sampling step \n\n. \n\n) \n\u2022 For t = 1, ... , N, (3t lt-1 = (3t lt-1 and sample Yt ~ P dYtl Xl:t, Yl:t-1 ' Zt \n\u2022 For i = 1, ... , N, evaluate the importance weights using (7). \n\n-(i) \n\n:::{i) \n\n(i) \n\n(i)\n\nh. \n\n( \n\n. \n\nSelection step \n\n\u2022 Multiply/Discard particles {~i),,B~i~_l}~l with respect to high/low impor-\n\u00b7 \n\n(i) b' N \n\npartlc es Yt \n\ntance welg ts W t \n\nt lt- 1 i=l ' \n\nto 0 tam \n\n(i) (3(i) \n\n} N \n\n. h \n\n. I \n\n{\n\nUpdatmg step \n\n, \n\n\u2022 Compute ~t+1 I t given ~t l t - 1' \n\u2022 For i = 1, ... , N, use one step of the Kalman recursion (5) to compute {,B~i~ l l t } \n\ngiven {y/ ,(3 ti t-1 } and ~t l t-1' \n\nC) -C) \n\nFigure 1: RBPF for semiparametric binary classification. \n\nlikely regions of the posterior at time t-1 using the information at time t to generate \nbetter samples at time t. In practice, this increases the robustness of the algorithm \nto outliers and allows us to apply it in situations where the distributions are very \npeaked (e.g., econometrics and almost deterministic sensors and actuators). \n\nRemark 2 Th e covariance updates of the Kalman jilter are outside the loop over \nparticles. This results in substantial computational savings. \n\n4 Simulations \n\nTo compare our model, using the RBPF algorithm, to standard logistic and probit \nclassification with PF, we generated data from clusters that change with time as \nshown in Figure 2. This data set captures the characteristics of a fault detection \nproblem that we are currently studying. (For some results of applying PF to fault \ndetection in marine diesel engines, please refer to (H0jen-S0rensen et al. 2000). \nMore results will become available once permission is granted.) This data cannot \nbe easily separated with an algorithm based on a time-invariant model. \n\n'\" N(O , 51) \nFor the results presented here, we set the initial distributions to: (30 \nand Yo '\" N(O, 51). The process matrices were set to A = I and B = JI, where \n82 = 0.1 is a smoothing parameter. The number of bases (cubic splines with random \nlocations) was set to 10. (It is of course possible, when we have some data already, \nto initialise the bases locations so that they correspond to the input data. This trick \nfor efficient classification in high dimensional input spaces is used in the support \nvector machines setting (Vapnik 1995).) The experiment was repeated with the \nnumber of particles varying between 10 and 400. Figure 3 shows the \"value for \nmoney\" summary plot. The new algorithm has a lower computational cost and \nshows a significant reduction in estimation variance. Note that the computation \nof the RBPF stays consistently low even for small numbers of particles. This has \nenabled us to apply the technique to large models consisting of hundreds of Bases \nusing a suitable regulariser. Another advantage of PF algorithms for classification \nis that they yield entire probability estimates of class membership as shown in \nFigure 4. \n\n\f0'<>0 \n\n-:5' - - - -o:--- ---::5 \nData from t=1 to t=100 \n\n-:5:---'-'-- -0:------::5 \nData from t=1 00 to t=200 \n\n- 5 ' - - - - - - - - -\n-5 \n5 \nData from t=200 to t=300 \n\n0 \n\n- 5 ' - - - - - - - \" - - - - -\n-5 \n5 \nData from t=1 to t=300 \n\n0 \n\nFigure 2: Time-varying data. \n\n35 \n\n! ..... 'j \n\nr,. I \n\n'.~ , ...... .... . \n\nI \n\ni \nI \n\nT \nL -\n\n-\n\nI \n1''''' \n... \n.. ,- -,... \n-n. ... \"',i ... T L~ \n~ \n\u2022 \n\n... ~ .......... i \n. \n\n50'----~--L-~--~8-~,0~-~,2--,~4-~,6-~,8 \n\nComputation (flops) \n\n,10' \n\nFigure 3: Number of classification errors as the number of particles varies between \n10 and 400 (different computational costs). The algorithm with the augmentation \ntrick (RBPF) is more efficient than standard PF algorithms. \n\n5 Conclusions \n\nIn this paper, we proposed a dynamic Bayesian model for time-varying binary clas(cid:173)\nsification and an efficient particle filtering algorithm to perform the required com(cid:173)\nputations. The efficiency of our algorithm is a result of data augmentation, Rao(cid:173)\nBlackwellisation, adopting the optimal importance distribution, being able to swap \nthe sampling and selection steps and only needing to update the Kalman filter means \nin the particles loop. This extends the realm of efficient particle filtering to the ubiq(cid:173)\nuitous setting of Gaussian latent variables and binary observations. Extensions to \nn-ary observations, different link functions and estimation of the hyper-parameters \ncan be carried out in the same framework. \n\n\f50 \n\n- 1 \n\nFigure 4: Predictive density. \n\nReferences \n\nAlbert, J. and Chib, S. (1993) . Bayesian analysis of binary and polychotomous response \n\ndata, Journal of the American Statistical Association 88(422): 669- 679. \n\nAndrieu, C., de Freitas, N. and Doucet, A. (1999) . Sequential Bayesian estimation and \nmodel selection applied to neural networks, Technical Report CUED/F-INFENG/TR \n341, Cambridge University Engineering Department. \n\nCrisan, D. and Doucet, A. (2000) . Convergence of sequential Monte Carlo methods, Tech(cid:173)\nnical Report CUED/F-INFENG/TR 381, Cambridge University Engineering Depart(cid:173)\nment. \n\nDoucet, A. , de Freitas, N. and Gordon, N. J. (eds) (2001) . Sequential Monte Carlo Methods \n\nin Practice, Springer-Verlag. \n\nDoucet, A. , de Freitas, N. , Murphy, K. and Russell, S. (2000). Rao blackwellised particle \nfiltering for dynamic Bayesian networks, in C. Boutilier and M. Godszmidt (eds), \nUncertainty in Artificial Intelligence, Morgan Kaufmann Publishers, pp. 176- 183. \n\nDoucet, A., Godsill, S. and Andrieu, C. (2000). On sequential Monte Carlo sampling \n\nmethods for Bayesian filtering , Statistics and Computing 10(3): 197- 208. \n\nH0jen-S0rensen, P. , de Freitas, N. and Fog, T. (2000). On-line probabilistic classification \nwith particle filters, IEEE N eural Networks for Signal Processing, Sydney, Australia. \nHolmes, C. C. and Mallick, B. K. (1998). Bayesian radial basis functions of variable \n\ndimension, Neural Computation 10(5): 1217- 1233. \n\nIsard, M. and Blake, A. (1996) . Contour tracking by stochastic propagation of conditional \n\ndensity, European Conference on Computer Vision, Cambridge, UK, pp. 343- 356. \n\nKitagawa, G. and Gersch, W. (1996). Smoothness Priors Analysis of Tim e Series, Vol. \n\n116 of Lecture Notes In Statistics, Springer-Verlag. \n\nMcFadden, D. (1989). A method of simulated momemts for estimation of discrete response \n\nmodels without numerical integration, Econometrica 57: 995- 1026. \n\nMetropolis, N. and Uiam, S. (1949). The Monte Carlo method, Journal of the American \n\nStatistical Association 44(247): 335- 341. \n\nPitt, M. K. and Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters , \n\nJournal of the American Statistical Association 94(446): 590- 599. \n\nVapnik, V. (1995). Th e Nature of Statistical Learning Th eory, Springer-Verlag, New York. \n\n\f", "award": [], "sourceid": 2066, "authors": [{"given_name": "Christophe", "family_name": "Andrieu", "institution": null}, {"given_name": "Nando", "family_name": "Freitas", "institution": null}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": null}]}*