{"title": "Learning Multi-Class Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 389, "page_last": 395, "abstract": null, "full_text": "Learning multi-class dynamics \n\nA. Blake, B. North and M. Isard \n\nDepartment of Engineering Science, University of Oxford, Oxford OXl 3P J, UK. \n\nWeb: http://www.robots.ox.ac.uk/ ... vdg/ \n\nAbstract \n\nStandard techniques (eg. Yule-Walker) are available for learning \nAuto-Regressive process models of simple, directly observable, dy(cid:173)\nnamical processes. When sensor noise means that dynamics are \nobserved only approximately, learning can still been achieved via \nExpectation-Maximisation (EM) together with Kalman Filtering. \nHowever, this does not handle more complex dynamics, involving \nmultiple classes of motion. For that problem, we show here how \nEM can be combined with the CONDENSATION algorithm, which \nis based on propagation of random sample-sets. Experiments have \nbeen performed with visually observed juggling, and plausible dy(cid:173)\nnamical models are found to emerge from the learning process. \n\n1 \n\nIntroduction \n\nThe paper presents a probabilistic framework for estimation (perception) and classi(cid:173)\nfication of complex time-varying signals, represented as temporal streams of states. \nAutomated learning of dynamics is of crucial importance as practical models may \nbe too complex for parameters to be set by hand. The framework is particularly \ngeneral, in several respects, as follows. \n\n1. Mixed states: each state comprises a continuous and a discrete component. \nThe continuous component can be thought of as representing the instantaneous \nposition of some object in a continuum. The discrete state represents the current \nclass of the motion, and acts as a label, selecting the current member from a set of \ndynamical models. \n\n2. Multi-dimensionality: \nallowed to be multi-dimensional. This could represent motion in a higher dimen(cid:173)\nsional continuum, for example, two-dimensional translation as in figure 1. Other \nexamples include multi-spectral acoustic or image signals, or multi-channel sensors \nsuch as an electro-encephalograph. \n\nthe continuous component of a state is, in general, \n\n\f390 \n\nA. Blake. B. North and M Isard \n\nFigure 1: Learning the dynamics of juggling. Three motion classes, emerging \nfrom dynamical learning, turn out to correspond accurately to ballistic motion (mid \ngrey), catch/throw (light grey) and carry (dark grey). \n\n3. Arbitrary order: each dynamical system is modelled as an Auto-Regressive \nProcess (ARP) and allowed to have arbitrary order (the number of time-steps of \n\"memory\" that it carries.) \n\n4. Stochastic observations: \nnot \nobservable directly, but only via observations, which may be multi-dimensional, \nand are stochastically related to the continuous component of states. This aspect is \nessential to represent the inherent variability of response of any real signal sensing \nsystem. \n\nthe sequence of mixed states is \"hidden\" -\n\nEstimation for processes with properties 2,3,4 has been widely discussed both in \nthe control-theory literature as \"estimation\" and \"Kalman filtering\" (Gelb, 1974) \nand in statistics as ''forecasting'' (Brockwell and Davis, 1996). Learning of models \nwith properties 2,3 is well understood (Gelb, 1974) and once learned can be used \nto drive pattern classification procedures, as in Linear Predictive Coding (LPC) in \nspeech analysis (Rabiner and Bing-Hwang, 1993), or in classification of EEG signals \n(Pardey et al., 1995). When property 4 is added, the learning problem becomes \nharder (Ljung, 1987) because the training sets are no longer observed directly. \n\nMixed states (property 1) allow for combining perception with classification. Allow(cid:173)\ning properties 2,4, but restricted to a Oth order ARP (in breach of property 3), gives \n\n\fLearning Multi-Class Dynamics \n\n391 \n\nHidden Markov Models (HMM) (Rabiner and Bing-Hwang, 1993), which have been \nused effectively for visual classification (Bregler, 1997). Learning HMMs is accom(cid:173)\nplished by the \"Baum-Welch\" algorithm, a form of Expectation-Maximisation (EM) \n(Dempster et al., 1977). Baum-Welch learning has been extended to \"graphical(cid:173)\nmodels\" of quite general topology (Lauritzen, 1996). In this paper, graph topology \nis a simple chain-pair as in standard HMMs, and the complexity of the problem lies \nelsewhere -\n\nin the generality of the dynamical model. \n\nGenerally then, restoring non-zero order to the ARPs (property 3), there is no exact \nalgorithm for estimation. However the estimation problem can be solved by random \nsampling algorithms, known variously as bootstrap filters (Gordon et al., 1993), \nparticle filters (Kitagawa, 1996), and CONDENSATION (Blake and Isard, 1997). Here \nwe show how such algorithms can be used, with EM, in dynamical learning theory \nand experiments (figure 1). \n\n2 Multi-class dynamics \n\nContinuous dynamical systems can be specified in terms of a continuous state vector \nXt E nNcr. In machine vision, for example, Xt represents the parameters of a time(cid:173)\nvarying shape at time t . Multi-class dynamics are represented by appending to the \ncontinuous state vector Xt, a discrete state component Yt to make a \"mixed\" state \n\nX t = ( ~: ) , \n\nwhere Yt E Y = {I, . .. , Ny} is the discrete component of the state, drawn from \na finite set of integer labels. Each discrete state represents a class of motion , for \nexample \"stroke\", \"rest\" and \"shade\" for a hand engaged in drawing. \n\nCorresponding to each state Yt = Y there is a dynamical model, taken to be a \nMarkov model of order KY that specifies Pi (Xt IXt-l, . .. Xt-KY ) . A linear-Gaussian \nMarkov model of order K is an Auto-Regressive Process (ARP) defined by \n\nK \n\nXt = LAkxt-k + d + BWt \n\nk=1 \n\nin which each Wt is a vector of N x independent random N(O, 1) variables and Wi, \nW t' are independent for t \u00a5 t'. The dynamical parameters of the model are \n\n\u2022 deterministic parameters AI, A 2 , ... , AK \n\n\u2022 stochastic parameters B, which are multipliers for the stochastic process Wt, \nand determine the \"coupling\" of noise Wt into the vector valued process Xt. \n\nFor convenience of notation, let \n\nEach state Y E Y has a set {AY, BY, dY} of dynamical parameters, and the goal is \nto learn these from example trajectories. Note that the stochastic parameter BY \nis a first-class part of a dynamical model, representing the degree and the shape \nof uncertainty in motion, allowing the representation of an entire distribution of \npossible motions for each state y. In addition, and independently, state transitions \nare governed by the transition matrix for a 1st order Markov chain: \n\nP(Yt = y'IYt-1 = y) = My,y\" \n\n\f392 \n\nA. Blake. B. North and M. Isard. \n\nObservations Zt are assumed to be conditioned purely on the continuous part x of \nthe mixed state, independent of Yt, and this maintains a healthy separation between \nthe modelling of dynamics and of observations. Observations are also assumed to \nbe independent, both mutually and with respect to the dynamical process. The \nobservation process is defined by specifying, at each time t, the conditional density \np(ZtIXt) which is taken to be Gaussian in experiments here. \n\n3 Maximum Likelihood learning \n\nWhen observations are exact, maximum likelihood estimates (MLE) for dynami(cid:173)\ncal parameters can be obtained from a training sequence Xi ... XT of mixed states. \nThe well known Yule-Walker formula approximates MLE (Gelb, 1974; Ljung, 1987), \nbut generalisations are needed to allow for short training sets (small T), to include \nstochastic parameters B, to allow a non-zero offset d (this proves essential in ex(cid:173)\nperiments later) and to encompass multiple dynamical classes. \n\nThe resulting MLE learning rule is as follows. \n\nAY RY = BY \n0' \n\nd Y = \n\n1 \n\nTY _ KY \n\n0 \n\n(RY _ AYRY) CY = \n\n, \n\n1 \n\nTY _ KY \"\"0,0 \n\n(iW _ AY('QY)T) \n\n.L\"O' \n\nwhere (omitting the Y superscripts for clarity) C = BBT and \n\nand the first-order moments Ri and (offset-invariant) auto correlations Ri,j, for each \nclass y, are given by \n\nRf = L x;_i and RL = RL - T ~ KRfRrT, \n\ny;=y \n\nY \n\nwhere \n\nRL = L X;_iX;_j T; \n\nYt=Y \n\nTy = H t : Y; = y} == L 1. \n\nt:Yt=Y \n\nThe MLE for the transition matrix M is constructed from relative frequencies as: \n\nM \n\nY,Y' = \"\" T, were y,y' = II \n\nh \n\nT \n\nTy,y' \n\n6y'EY Y,Y \n\nll{t\u00b7 * \n\n. Yt-l = y, Yt = Y \n\n* \n\n'} \n. \n\n4 Learning with stochastic observations \n\nTo allow for stochastic observations, direct MLE is no longer possible, but an EM \nlearning algorithm can be formulated. Its M-step is simply the MLE estimate of \nthe previous section. It might be thought that the E-step should consist simply of \ncomputing expectations, for instance [[xtIZ[J, (where Zi = (Zl,\"\" Zt) denotes a \nsequence of observations) and treating them as training values x;. This would be \nincorrect however because the log-likelihood function I:- for the problem is not linear \nin the x; but quadratic. Instead, we need expectations \n\n\fLearning Multi-Class Dynamics \n\n393 \n\nconditioned on the entire training set Z'[ of observations, given that \u00a3 is linear \nin the R i , Ri,j etc. (Shumway and Stoffer, 1982). These expected values of auto(cid:173)\ncorrelations and frequencies are to be used in place of actual auto correlations and \nfrequencies in the learning formulae of section 3. The question is, how to compute \nthem. In the special case y = {I} of single-class dynamics, and assuming a Gaussian \nobservation density, exact methods are available for computing expected moments, \nusing Kalman and smoothing filters (Gelb, 1974), in an \"augmented state\" filter \n(North and Blake, 1998). For multi-class dynamics, exact computation is infeasi(cid:173)\nble, but good approximations can be achieved based on propagation of sample sets, \nusing CONDENSATION. \n\nForward sampling with backward chaining \n\nFor the purposes of learning, an extended and generalised form of the CONDEN(cid:173)\nSATION algorithm is required. The generalisations allow for mixed states, arbi(cid:173)\ntrary order for the ARP, and backward-chaining of samples. In backward chaining, \nsample-sets for successive times are built up and stored together with a complete \nstate history back to time t = O. The extended CONDENSATION algorithm is given \nin figure 2. Note that the algorithm needs to be initialised. This requires that the \nYo and (X~~lo' k = 0, ... ,KYO - 1) be drawn from a suitable (joint) prior for the \nmulti-class process. One way to do this is to ensure that the training set starts in a \nknown state and to fix the initial sample-values accordingly. Normally, the choice \nof prior is not too important as it is dominated by data. \nAt time t = T, when the entire training sequence has been processed, the final \nsample set is \n\n{ (X(n) \n\nTIT'\u00b7 .. , OIT' 7rT \n\nX(n\u00bb) \n\n(n)} \n\n- 1 \n\n,n -\n\n, ... , \n\nN} \n\nrepresents fairly (in the limit, weakly, as N -+ 00) the posterior distribution for \nthe entire state sequence X O, .\u2022\u2022 ,XT, conditioned on the entire training set Z'[ \nof observations. The expectations of the autocorrelation and frequency measures \nrequired for learning can be estimated from the sample set, for example: \n\nAn alternative algorithm is a sample-set version of forward-backward propagation \n(Kitagawa, 1996). Experiments have suggested that probability densities generated \nby this form of smoothing converge far more quickly with respect to sample set \nsize N, but at the expense of computational complexity - O(N2) as opposed to \nO(N log N) for the algorithm above. \n\n5 Practical applications \n\nExperiments are reported briefly here on learning the dynami(:s of juggling using the \nEM-Condensation algorithm, as in figure 1. An offset d Y is learned for each class \nin Y = {I, 2, 3}; other dynamical parameters are fixed such that that learning d Y \namounts to learning mean accelerations a Y for each class. The transition matrix is \nalso learned. From a more or \u00b7less neutral starting point, learned structure emerges \nas in figure 3. Around 60 iterations of EM suffice, with N = 2048, to learn dynamics \nin this case. It is clear from the figure that the learned structure is an altogether \nplausible model for the juggling process. \n\n\f394 \n\nA. Blake, B. North and M. Isard \n\nIterate for t = 1, ... , T. \n\nonstruct t e samp e-set \n\nh \n\nC \nt. \n\nI \n\n{(X (n) \n\nlit\"'\" X tit ,7rt \n\n(n\u00bb) \n\n(n)} \n\n,n = 1, ... , N for time \n\n. \n\nFor each n: \n\n1. Choose (with replacement) mE {I, .. . , N} with prob. 7ri~{' \n\n2. Predict by sampling from \n(x I vt-l -\n\n1\"\\.1 \n\nt \n\n-\n\nP \n\n(X(m) \n\nllt-l\"'\" \n\nX(m\u00bb)) \n\nt-llt-1 \n\nto choose X~~). For multi-class ARPs this is done in two steps. \n\nDiscrete: Choose y~n) = y' E Y with probability My,y\" where \n\ny = y~~i. \n\nContinuous: Compute \n\nK \n\n(n) _ ~AY (m) \n\nx tit - ~ kXt-klt-l \n\n+d + Bw~n), \n\nk=l \n\nwhere y = y~n) and w~n) is a vector of standard normal r.v. \n\n3. Observation weights 7r~n) are computed from the observation \n\ndensity, evaluated for the current observations Zt: \n\n(n) \n\n7rt = P Zt Xt = x tit \n\n(n\u00bb) \n' \n\n(I \n\nthen normalised multiplicatively so that En 7ri n ) = 1. \n\n4. Update sample history: \n\nX (n) - x(m) \n\nti lt -\n\ntilt-I' t = 1, ... , t - 1. \n\nI \n\nFigure 2: The CONDENSATION algorithm for forward propagation with back(cid:173)\nward chaining. \n\nAcknowledgements \n\nWe are grateful for the support of the EPSRC (AB,BN) and Magdalen College \nOxford (MI). \n\nReferences \n\nBlake, A. and Isard, M. (1997) . The Condensation algorithm -\n\nconditional density prop(cid:173)\nagation and applications to visual tracking. In Advances in Neural Information Pro(cid:173)\ncessing Systems 9, pages 361-368. MIT Press. \n\n\fLearning Multi-Class Dynamics \n\n395 \n\n(:0 \n\n0.01 \n\na = ( 0.0 ) \n\n-9.7 \n\n0.04 \n\nBallistic \n\npat:::) \n~ Cony \n\na=(-;:) ~ \nCatchlthrow J \n\nFigure 3: Learned dynamical model for juggling. The three motion classes \nallowed in this experiment organise themselves into: ballistic motion (acceleration \na ~ -g),- catch/throw,- carry. As expected, life-time in the ballistic state is longest, \nthe transition probability of 0.95 corresponding to 20 time-steps or about 0.7 sec(cid:173)\nonds. Transitions tend to be directed, as expected,- for example ballistic motion is \nmore likely to be followed by a catch/throw (p = 0.04) than by a carry (p = 0.01). \n(Acceleration a shown here in units of m/ S2 .) \n\nBregler, C. (1997). Learning and recognising human dynamics in video sequences. In Proc. \n\nConf. Computer Vision and Pattern Recognition. \n\nBrockwell, P. and Davis, R. (1996). Introduction to time-series and forecasting. Springer(cid:173)\n\nVerlag. \n\nDempster, A., Laird, M., and Rubin, D. (1977) . Maximum likelihood from incomplete \n\ndata via the EM algorithm. J. Roy. Stat. Soc. B ., 39:1-38. \n\nGelb, A., editor (1974). Applied Optimal Estimation. MIT Press, Cambridge, MA. \n\nGordon, N., Salmond, D., and Smith, A. (1993). Novel approach to nonlinear/non(cid:173)\n\nGaussian Bayesian state estimation. lEE Proc. F, 140(2):107- 113. \n\nKitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state \n\nspace models. Journal of Computational and Graphical Statistics, 5(1) :1- 25 . \n\nLauritzen, S. (1996). Graphical models. Oxford. \n\nLjung, L. (1987). System identification: theory for the user. Prentice-Hall. \n\nNorth, B. and Blake, A. (1998). \n\nLearning dynamical models using expectation(cid:173)\n\nmaximisation. In Proc. 6th Int. Conf. on Computer Vision, pages 384-389. \n\nPar dey, J., Roberts, S., and Tarassenko, L. (1995). A review of parametric modelling \n\ntechniques for EEG analysis. Medical Engineering Physics, 18(1):2- 1l. \n\nRabiner, L. and Bing-Hwang, J. (1993) . Fundamentals of speech recognition. Prentice-Hall. \n\nShumway, R. and Stoffer, D. (1982) . An approach to time series smoothing and forecasting \n\nUSing the EM algorithm. J. Time Series Analysis, 3:253-226 . \n\n\f", "award": [], "sourceid": 1511, "authors": [{"given_name": "Andrew", "family_name": "Blake", "institution": null}, {"given_name": "Ben", "family_name": "North", "institution": null}, {"given_name": "Michael", "family_name": "Isard", "institution": null}]}