{"title": "Partially Observable SDE Models for Image Sequence Recognition Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 880, "page_last": 886, "abstract": null, "full_text": "Partially Observable SDE Models for \nImage Sequence Recognition Tasks \n\nJavier R. Movellan \n\nInstitute for Neural Computation \nUniversity of California San Diego \n\nPaul Mineiro \n\nDepartment of Cognitive Science \nUniversity of California San Diego \n\nR. J. Williams \n\nDepartment of Mathematics \n\nUniversity of California San Diego \n\nAbstract \n\nThis paper explores a framework for recognition of image sequences \nusing partially observable stochastic differential equation (SDE) \nmodels. Monte-Carlo importance sampling techniques are used for \nefficient estimation of sequence likelihoods and sequence likelihood \ngradients. Once the network dynamics are learned, we apply the \nSDE models to sequence recognition tasks in a manner similar to \nthe way Hidden Markov models (HMMs) are commonly applied. \nThe potential advantage of SDEs over HMMS is the use of contin(cid:173)\nuous state dynamics. We present encouraging results for a video \nsequence recognition task in which SDE models provided excellent \nperformance when compared to hidden Markov models. \n\n1 \n\nIntroduction \n\nThis paper explores a framework for recognition of image sequences using partially \nobservable stochastic differential equations (SDEs). In particular we use SDE mod(cid:173)\nels of low-power non-linear RC circuits with a significant thermal noise component. \nWe call them diffusion networks. A diffusion network consists of a set of n nodes \ncoupled via a vector of adaptive impedance parameters>' which are tuned to op(cid:173)\ntimize the network's behavior. The temporal evolution of the n nodes defines a \ncontinuous stochastic process X that satisfies the following Ito SDE: \n\n(1) \n(2) \nwhere v represents the (stochastic) initial conditions and B is standard Brownian \nmotion. The drift is defined by a non-linear RC charging equation \n\ndX(t) = Ji-(X(t), >')dt + a dB(t), \nX(O) '\" v, \n\nJi-j(X(t),>') = -\n1 ( \nKj \n\n1 \nPj \n\n~j +Xj(t) - -Xj(t) , \n\n) \n\n-\n\nfor j = 1,\u00b7\u00b7\u00b7 ,n, \n\n(3) \n\nwhere Ji-j is the drift of unit j, i.e., the ]fh component of Ji-. Here Xj is the internal \npotential at node j, Kj > 0 is the input capacitance, Pj the node resistance, ~j a \n\n\fDlstllhutl(lll.l t+dt \n\nSDE Modeis \n\nODE Models \n\nHidden Markov Models \n\nFigure 1: An illustration of the differences between stochastic differential equation \nmodels (SDE), ordinary differential equation models (ODE) and Hidden Markov \nModels (HMM). In ODEs the the state dynamics are continuous and deterministic. \nIn SDEs the state dynamics are continuous and stochastic. In HMMs the state \ndynamics are discrete and probabilistic. \n\nconstant input current to the unit, Xj the net electrical current input to the node, \n\nn \n\nXj(t) = L Wj ,m rp(Xm(t)), \n\nfor j = 1,\u00b7\u00b7\u00b7 ,n, \n\nrp(x) = 1 \n\nm=l \n1 \n\n+ e- X \n\n' for all x E JR, \n\n(4) \n\n(5) \n\nwhere rp the input-output characteristic amplification, and l/wj,m is the impedance \nbetween the output Xm and the node j. Intuition for equation (3) can be achieved \nby thinking of it as the limit of a discrete time stochastic difference equation, \n\nX(t + ~t) = X(t) + /-\u00a3(X(t), A)~t + u-/MZ(t) , \n\n(6) \n\nwhere the Z(t) is an n-dimensional vector ofindependent standard Gaussian random \nvariables. For a fixed state at time t there are two forces controlling the change in \nactivation: the drift, which is deterministic, and the dispersion which is stochastic \n(see Figure 1). This results in a distribution of states at time t + ~t. As ~t goes to \nzero, the solution to the difference equation (6) converges to the diffusion process \ndefined in (3). \n\nFigures 1 and 2 shows the relationship between SDE models and other approaches \nin the neural network and the stochastic filtering literature. The main difference \nbetween ODE models, like standard recurrent neural networks, and SDE models is \nthat the first has deterministic dynamics while the second has probabilistic dynam(cid:173)\nics. The two approaches are similar in that the states are continuous. The main \ndifference between HMMs and SDEs is that the first have discrete state dynamics \nwhile the second have continuous state dynamics. The main similarity is that both \nare probabilistic. Kalman filters are linear SDE models. If the impedance ma(cid:173)\ntrix is symmetric and the network is given enough time to approximate stochastic \nequilibrium, diffusion network behave like continuous Boltzmann machines (Ackley, \nHinton & Sejnowski, 1985). If the network is discretized in state and time it be(cid:173)\ncomes a standard HMM. Finally, if the dispersion constant is set to zero the network \nbehaves like a deterministic recurrent neural network. \n\nIn order to use of SDE models we need a method for finding the likelihood and the \nlikelihood gradient of observed sequences. \n\n\fKalman-Bucy \n\nFilters \n\nBoltzmann \nMachines \n\nLinear Dynamics \n\nStochastic Equilibrium \n\nDiffusion Networks \n\nZer Noise \n\nDiscrete pace and Time \n\nRecurrent Neural \n\nNetworks \n\nHidden Markov \n\nModels \n\nFigure 2: Relationship between diffusion filters and other approaches in the neural \nnetwork and stochastic filtering literature. \n\n2 Observed sequence likelihoods \n\nWe regard the first d components of an SDE model as observable and denote them by \nO. The last n - d components are denoted by H and named unobservable or hidden. \nHidden components are included for modeling non-Markovian dependencies in the \nobservable components. Let no, nh be the outcome spaces for the observable and \nhidden processes. Let n = no x nh the joint outcome space. Here each outcome W is \na continuous path w : [0, T] --t IRn. For each wEn, we write w = (wo, Wh), where Wo \nrepresents the observable dimensions of the path and Wh the hidden dimensions. Let \nQ>'(A) represent the probability that a network with parameter A generates paths in \nthe set A, Q~(Ao) the probability that the observable components generate paths in \nAo and Q~(Ah) the probability that the hidden components generate paths in Ah. \nTo apply the familiar techniques of maximum likelihood and Bayesian estimation \nwe use as reference the probability distribution of a diffusion network with zero \ndrift, Le., the paths generated by this network are Brownian motion scaled by u. \nWe denote such reference distribution as R, its observable and hidden components \nas Ro, Rh. Using Girsanov's theorem (Karatzas & Shreve, 1991, p. 303) we have \nthat \n\nL~(wo) = ~~: (wo) = f L~,h(wo,Wh) dRh(wh), Wo E no, \n\n(7) \n\n. \n\nOh \n\nwhere \n\n{1 rT \nL~,h(W) = dR (w) = exp u 2 io \n\ndQ>' \n\nf..L(w(t), A) . dw(t) - 2u2 io 1f..L(w(t) , A)1 2 dt \n\n} \n\n1 rT \n\n(8) \nThe first integral in (8) is an Ito stochastic integral, the second is a standard \nLebesgue integral. The term L~ is a Radon-Nikodym derivative that represents \nthe probability density of Q~ with respect to Ro. For a fixed path Wo the term \nL~(wo) is a likelihood function of A that can be used for Maximum likelihood esti(cid:173)\nmation. To obtain the likelihood gradient, we differentiate (7) which yields \n\n\\7>.logL~(wo) = f L~lo(Wh Iwo)\\7>.logL~,h(wo,wh) dRh(wh), \n\n(9) \n\nOh \n\n\fwhere \n\nand [A is the joint innovation process \n\n[A(t,w) = W(t) - W(O) -lot p,(w(u), A) duo \n\n(10) \n\n(11) \n\n(12) \n\n(13) \n\n(14) \n\n2.1 \n\nImportance sampling \n\nThe likelihood of observed paths (7), and the gradient of the likelihood (9) require \naveraging with respect to the distribution of hidden paths Rh. We estimate these \naverages using an importance sampling in the space of sample paths. Instead of \nsampling from Rh we sample from a distribution that weights more heavily regions \nwhere L~ h is large. Each sample is then weighted by the density of the sampling \ndistributi~n with respect to Rh. This weighting function is commonly known as \nthe importance function in the Monte-Carlo literature (Fishman, 1996, p. 257). \nIn particular for each observable path Wo we let the sampling distribution S~,wo \nbe the probability distribution generated by a diffusion network with parameter \nA which has been forced to exhibit the path Wo over the observable units. The \napproach reminiscent of the technique of teacher forcing from deterministic neural \nnetworks. In practice, we generate Li.d. sampled hidden paths {h(i)}~l from S~,wo \nby numerically simulating a diffusion network with the observable units forced to \nexhibit the path Wo these hidden paths are then weighted by the density of S~,wo \nwith respect to Rh, which acts as a Monte-Carlo importance function \n\nIn practice we have obtained good results with m in the order of 20, i.e., we sample \n20 hidden sequences per observed sequence. One interesting property of this ap(cid:173)\nproach is that the sampling distributions S~,wo change as learning progresses, since \nthey depend on A. \n\nFigure 3 shows results of a computer simulation in which a 2 unit network was \ntrained to oscillate. We tried an oscillation pattern because of its relevance for the \napplication we explore in a later section, which involves recognizing sequences of \nlip movements. The figure shows the \"training\" path and a couple of sample paths, \none obtained with the u parameter set to 0, and one with the parameter set to 0.5. \n\n\f0 . 3 \n\n0 . 4 \n\n0 . 1 \n\n0 . 2 \n\n0 . 5 \n\n0 . 6 \n\n'\" =~~ o \nJ~ o \n~~ o \n\nO . B \n\n0 . 1 \n\n0 . 5 \n\n0 . 6 \n\n0 . 5 \n\n0 . 6 \n\n0 . 1 \n\n0 . 2 \n\nO . B \n\n0 . 9 \n\n0 . 2 \n\n0 . 7 \n\nO . B \n\n0 . 9 \n\n0 . 3 \n\n0 . 4 \n\n0 . 3 \n\n0 . 4 \n\n0 . 9 \n\n., \n\n... \n\n., \n\n0 . 7 \n\n0 . 7 \n\nFigure 3: Training a 2 unit network to maximize the likelihood of a sinusoidal path. \nThe top graph shows the training path. It consists of two sinusoids out of phase \neach representing the activation of the two units in the network. The center graph \nshows a sample path obtained after training the network and setting a = 0, i.e., no \nnoise. The bottom graph shows a sample path obtained with a = 0.5. \n\n3 Recognizing video sequences \n\nIn this section we illustrate the use of SDE models on a sequence classification \ntask of reasonable difficulty with a body of realistic data. We chose this task since \nwe know of SDE models used for tracking problems but know of no SDE models \nused for sequence recognition tasks. The potential advantage of SDEs over more \nestablished approaches such as HMMs is that they enforce continuity constraints, \nan aspect that may be beneficial when the actual signals are better described using \ncontinuous state dynamics. We compared a diffusion network approach with classic \nhidden Markov model approaches. \n\nWe used Tulipsl (Movellan, 1995), a database consisting of 96 movies of 9 male \nand 3 female undergraduate students from the Cognitive Science Department at \nthe University of California, San Diego. For each student two sample utterances \nwere taken for each of the digits \"one\" through \"four\". The database is available \nat http://cogsci.ucsd.edu. We compared the performance of diffusion networks and \nHMMs using two different image processing techniques (contours and contours plus \nintensity) in combination with 2 different recognition engines (HMMs and diffusion \nnetworks). The image processing was performed by Luettin and colleagues (Luettin, \n1997). They employ point density models, where each lip contour is represented by \na set of points; in this case both the inner and outer lip contour are represented, \ncorresponding to Luettin's double contour model. The dimensionality of the rep(cid:173)\nresentation of the contours is reduced using principal component analysis. For the \nwork presented here 10 principal components were used to approximate the contour, \nalong with a scale parameter which measured the pixel distance between the mouth \ncorners; associated with each of these 11 parameters was a corresponding \"delta \ncomponent\", the left-hand temporal difference of the component (defined to be zero \nfor the first frame). In this manner a total of 22 parameters were used to represent \nlip contour information for each still frame. These 22 parameters were represented \nusing diffusion networks with 22 observation units, one per parameter value. We \nalso tested the performance of a representation that used intensity information in \naddition to contour shape information. This approach used 62 parameters, which \nwere represented using diffusion networks with 62 observation units. \n\n\fApproach \nBest HMM, shape information only \nBest diffusion network, shape information only \nUntrained human subjects \nBest HMM, shape and intensity information \nBest diffusion network, shape and intensity information \nTrained human subjects \n\nCorrect Generalization \n82.3% \n85.4% \n89.9% \n90.6% \n91.7% \n95.5% \n\nTable 1: Average generalization performance on the Tulips1 database. Shown in \norder are the performance of the best performing HMM from (Luettin et al., 1996), \nwhich uses only shape information, the best diffusion network obtained using only \nshape information, the performance of untrained human subjects (Movellan, 1995), \nthe HMM from Luettin's thesis (Luettin 1997) which uses both shape and intensity \ninformation, the best diffusion network obtained using both shape and intensity \ninformation, and the performance of trained human lipreaders (Movellan, 1995). \n\nWe independently trained 4 diffusion networks, to approximate the distributions \nof lip-contour trajectories of each of the four words to be recognized, i.e., the first \nnetwork was trained with examples of the word \"one\", and the last network with \nexamples of the word \"four\". Each network had the same number of nodes, and \nthe drift of each network was given by (3) with K.i = 1, ~ = 0 for all units, and \n~ being part of the adaptive vector A. Thus, A = (~1'\u00b7\u00b7\u00b7 ,~n,Wl ,1,Wl,2,\u00b7\u00b7\u00b7Wn,n)/. \nThe number of hidden units was varied from one to 5. We obtained optimal results \nwith 4 hidden units. The initial state of the hidden units was set to (1, ... ,1) with \nprobability 1, and u was set to 1 for all networks. The diffusion network dynamics \nwere simulated using a forward-Euler technique, i.e., equation (1) is approximated \nin discrete time using (6). In our simulations we set tl.t = 1/30 seconds, the time \nbetween video frame samples. Each diffusion network was trained with examples of \none of the 4 digits using the cost function \n\n~(A) = L log i~(y(i)) - ~aIAI2, \n\ni \n\n(16) \n\nwhere {y(i)} are samples from the desired empirical distribution Po and a is the \nstrength of a Gaussian prior on the network parameters. Best results were obtained \nwith diffusion networks with 4 hidden units. The log-likelihood gradients were \nestimated using the importance sampling approach with m = 20, i.e., we generated \n20 hidden sample paths per observed path. With this number of samples training \ntook about 10 times longer with diffusion networks than with HMMs. At test time, \ncomputation of the likelihood estimates was very fast and could have been done in \nreal time using a fast Pentium II. \nThe generalization performance was estimated using a jacknife (one-out) technique: \nwe trained on all subjects but one, which is used for testing. The process is repeated \nleaving a different subject out every time. Results are shown in Table 1. The table \nincludes HMM results reported by Luettin (1997), who tried a variety of HMM \narchitectures and reported the best results obtained with them. The only difference \nbetween Luettin's approach and our approach is the recognition engine, which was a \nbank of HMMs in his case and a bank of diffusion networks in our case. If anything \nwe were at a disadvantage since the image representations mentioned above were \noptimized by Luettin to work best with HMMs. \n\nIn all cases the best diffusion networks outperformed the best HMMs reported in \nthe literature using exactly the same visual preprocessing. In all cases diffusion net-\n\n\fworks outperformed HMMs. The difference in performance was not large. However \nobtaining even a 1 % increment in performance on this database is very difficult. \n\n4 Discussion \n\nWhile we presented results for a video sequence recognition task, the same frame(cid:173)\nwork can be used for tasks such as sequence recognition, object tracking and se(cid:173)\nquence generation. Our work was inspired by the rich literature on continuous \nstochastic filtering and stochastic neural networks. The idea was to combine the \nversatility of recurrent neural networks and the well known advantages of stochas(cid:173)\ntic modeling approaches. The continuous-time nature of the networks is convenient \nfor data with dropouts or variable sample rates, since the models we use define \nall the finite dimensional distributions. The continuous-state representation is well \nsuited to problems involving inference about continuous unobservable quantities, as \nin visual tracking tasks. Since these networks enforce continuity constraints in the \nobservable paths they may not have the well known problems encountered when \nHMMs are used as generative models of continuous sequences. \n\nWe have presented encouraging results on a realistic sequence recognition task. \nHowever more work needs to be done, since the database we used is relatively small. \nAt this point the main disadvantage of diffusion networks relative to conventional \nhidden Markov models is training speed. The diffusion networks used here were \napproximately 10 times slower to train than HMMs. Fortunately the Monte Carlo \napproximations employed herein, which represent the bulk of the computational \nburden, lend themselves to parallel and hardware implementations. Moreover, once \na network is trained, the computation of the density functions needed in recognition \ntasks can be done in real time. \n\nWe are exploring applications of diffusion networks to stochastic filtering prob(cid:173)\nlems (e.g., contour tracking) and sequence generation problems, not just sequence \nrecognition problems. Our work shows that diffusion networks may be a feasible \nalternative to HMMs for problems in which state continuity is advantageous. The \nresults obtained for the visual speech recognition task are encouraging, and rein(cid:173)\nforce the possibility that diffusion networks may become a versatile tool for a very \nwide variety of continuous signal processing tasks. \n\nReferences \n\nAckley, D. H., Hinton, G. E., & Sejnowski, T. (1985). A Learning Algorithm for \n\nBoltzmann Machines. Cognitive Science, 9(2), 147- 169. \n\nFishman, G. S. (1996). Monte Carlo Sampling: Concepts Algorithms and Applica(cid:173)\n\ntions. New York: Sprienger-Verlag. \n\nKaratzas, I. & Shreve, S. (1991). Brownian Motion and Stochastic Calculus. \n\nSpringer. \n\nLuettin, J. (1997). Visual Speech and Speaker Recognition. PhD thesis, University \n\nof Sheffield. \n\nMovellan, J. (1995). Visual Speech Recognition with Stochastic Neural Networks. In \nG. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in Neural Information \nProcessing Systems, volume 7. MIT Press. \n\nOksendal, B. (1992). Stochastic Differential Equations. Berlin: Springer Verlag. \n\n\f", "award": [], "sourceid": 1836, "authors": [{"given_name": "Javier", "family_name": "Movellan", "institution": null}, {"given_name": "Paul", "family_name": "Mineiro", "institution": null}, {"given_name": "Ruth", "family_name": "Williams", "institution": null}]}