{"title": "Layered Dynamic Textures", "book": "Advances in Neural Information Processing Systems", "page_first": 203, "page_last": 210, "abstract": "", "full_text": "Layered Dynamic Textures\n\nAntoni B. Chan\n\nNuno Vasconcelos\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\nabchan@ucsd.edu, nuno@ece.ucsd.edu\n\nAbstract\n\nA dynamic texture is a video model that treats a video as a sample from\na spatio-temporal stochastic process, speci\ufb01cally a linear dynamical sys-\ntem. One problem associated with the dynamic texture is that it cannot\nmodel video where there are multiple regions of distinct motion. In this\nwork, we introduce the layered dynamic texture model, which addresses\nthis problem. We also introduce a variant of the model, and present the\nEM algorithm for learning each of the models. Finally, we demonstrate\nthe ef\ufb01cacy of the proposed model for the tasks of segmentation and syn-\nthesis of video.\n\n1 Introduction\n\nTraditional motion representations, based on optical \ufb02ow, are inherently local and have sig-\nni\ufb01cant dif\ufb01culties when faced with aperture problems and noise. The classical solution to\nthis problem is to regularize the optical \ufb02ow \ufb01eld [1, 2, 3, 4], but this introduces undesirable\nsmoothing across motion edges or regions where the motion is, by de\ufb01nition, not smooth\n(e.g. vegetation in outdoors scenes). More recently, there have been various attempts to\nmodel video as a superposition of layers subject to homogeneous motion. While layered\nrepresentations exhibited signi\ufb01cant promise in terms of combining the advantages of reg-\nularization (use of global cues to determine local motion) with the \ufb02exibility of local repre-\nsentations (little undue smoothing), this potential has so far not fully materialized. One of\nthe main limitations is their dependence on parametric motion models, such as af\ufb01ne trans-\nforms, which assume a piece-wise planar world that rarely holds in practice [5, 6]. In fact,\nlayers are usually formulated as \u201ccardboard\u201d models of the world that are warped by such\ntransformations and then stitched to form the frames in a video stream [5]. This severely\nlimits the types of video that can be synthesized: while layers showed most promise as\nmodels for scenes composed of ensembles of objects subject to homogeneous motion (e.g.\nleaves blowing in the wind, a \ufb02ock of birds, a picket fence, or highway traf\ufb01c), very little\nprogress has so far been demonstrated in actually modeling such scenes.\n\nRecently, there has been more success in modeling complex scenes as dynamic textures or,\nmore precisely, samples from stochastic processes de\ufb01ned over space and time [7, 8, 9, 10].\nThis work has demonstrated that modeling both the dynamics and appearance of video\nas stochastic quantities leads to a much more powerful generative model for video than\nthat of a \u201ccardboard\u201d \ufb01gure subject to parametric motion. In fact, the dynamic texture\nmodel has shown a surprising ability to abstract a wide variety of complex patterns of\nmotion and appearance into a simple spatio-temporal model. One major current limitation\n\n\fof the dynamic texture framework, however, is its inability to account for visual processes\nconsisting of multiple, co-occurring, dynamic textures. For example, a \ufb02ock of birds \ufb02ying\nin front of a water fountain, highway traf\ufb01c moving at different speeds, video containing\nboth trees in the background and people in the foreground, and so forth. In such cases,\nthe existing dynamic texture model is inherently incorrect, since it must represent multiple\nmotion \ufb01elds with a single dynamic process.\n\nIn this work, we address this limitation by introducing a new generative model for video,\nwhich we denote by the layered dynamic texture (LDT). This consists of augmenting the\ndynamic texture with a discrete hidden variable, that enables the assignment of different\ndynamics to different regions of the video. Conditioned on the state of this hidden variable,\nthe video is then modeled as a simple dynamic texture. By introducing a shared dynamic\nrepresentation for all the pixels in the same region, the new model is a layered represen-\ntation. When compared with traditional layered models, it replaces the process of layer\nformation based on \u201cwarping of cardboard \ufb01gures\u201d with one based on sampling from the\ngenerative model (for both dynamics and appearance) provided by the dynamic texture.\nThis enables a much richer video representation. Since each layer is a dynamic texture,\nthe model can also be seen as a multi-state dynamic texture, which is capable of assigning\ndifferent dynamics and appearance to different image regions.\n\nWe consider two models for the LDT, that differ in the way they enforce consistency of\nlayer dynamics. One model enforces stronger consistency but has no closed-form solu-\ntion for parameter estimates (which require sampling), while the second enforces weaker\nconsistency but is simpler to learn. The models are applied to the segmentation and syn-\nthesis of sequences that are challenging for traditional vision representations. It is shown\nthat stronger consistency leads to superior performance, demonstrating the bene\ufb01ts of so-\nphisticated layered representations. The paper is organized as follows. In Section 2, we\nintroduce the two layered dynamic texture models. In Section 3 we present the EM al-\ngorithm for learning both models from training data. Finally, in Section 4 we present an\nexperimental evaluation in the context of segmentation and synthesis.\n\n2 Layered dynamic textures\n\nWe start with a brief review of dynamic textures, and then introduce the layered dynamic\ntexture model.\n\n2.1 Dynamic texture\n\nA dynamic texture [7] is a generative model for video, based on a linear dynamical system.\nThe basic idea is to separate the visual component and the underlying dynamics into two\nprocesses. While the dynamics are represented as a time-evolving state process xt 2 Rn,\nthe appearance of frame yt 2 RN is a linear function of the current state vector, plus some\nobservation noise. Formally, the system is described by\n\n(cid:26) xt = Axt(cid:0)1 + Bvt\nyt = Cxt + prwt\n\n(1)\n\nwhere A 2 Rn(cid:2)n is a transition matrix, C 2 RN (cid:2)n a transformation matrix, Bvt (cid:24)iid\nN (0; Q;) and prwt (cid:24)iid N (0; rIN ) the state and observation noise processes parameter-\nized by B 2 Rn(cid:2)n and r 2 R, and the initial state x0 2 Rn is a constant. One interpre-\ntation of the dynamic texture model is that the columns of C are the principal components\nof the video frames, and the state vectors the PCA coef\ufb01cients for each video frame. This\nis the case when the model is learned with the method of [7].\n\n\fFigure 1: The layered dynamic texture (left), and the approximate layered dynamic texture\n(right). yi is an observed pixel over time, xj is a hidden state process, and Z is the collection\nof layer assignment variables zi that assigns each pixels to one of the state processes.\n\nAn alternative interpretation considers a single pixel as it evolves over time. Each coor-\ndinate of the state vector xt de\ufb01nes a one-dimensional random trajectory in time. A pixel\nis then represented as a weighted sum of random trajectories, where the weighting coef-\n\ufb01cients are contained in the corresponding row of C. This is analogous to the discrete\nFourier transform in signal processing, where a signal is represented as a weighted sum of\ncomplex exponentials although, for the dynamic texture, the trajectories are not necessarily\northogonal. This interpretation illustrates the ability of the dynamic texture to model the\nsame motion under different intensity levels (e.g. cars moving from the shade into sunlight)\nby simply scaling the rows of C. Regardless of interpretation, the simple dynamic texture\nmodel has only one state process, which restricts the ef\ufb01cacy of the model to video where\nthe motion is homogeneous.\n\n2.2 Layered dynamic textures\n\nWe now introduce the layered dynamic texture (LDT), which is shown in Figure 1 (left).\nThe model addresses the limitations of the dynamic texture by relying on a set of state pro-\nj=1 to model different video dynamics. The layer assignment variable\ncesses X = fx(j)gK\nzi assigns pixel yi to one of the state processes (layers), and conditioned on the layer as-\nsignments, the pixels in the same layer are modeled as a dynamic texture. In addition, the\ni=1 is modeled as a Markov random \ufb01eld (MRF)\ncollection of layer assignments Z = fzigN\nto ensure spatial layer consistency. The linear system equations for the layered dynamic\ntexture are\n\ni\n\ni\n\nt\n\nj 2 f1;(cid:1)(cid:1)(cid:1) ; Kg\ni 2 f1;(cid:1)(cid:1)(cid:1) ; Ng\n\nt(cid:0)1 + B(j)v(j)\nt = A(j)x(j)\nt + pr(zi)wi;t\nyi;t = C (zi)\nx(zi)\n\n( x(j)\nwhere C (j)\n2 R1(cid:2)n is the transformation from the hidden state to the observed pixel\ndomain for each pixel yi and each layer j, the noise parameters are B(j) 2 Rn(cid:2)n and r(j) 2\nR, the iid noise processes are wi;t (cid:24)iid N (0; 1) and v(j)\nt (cid:24)iid N (0; In), and the initial\nstates are drawn from x(j)\n1 (cid:24) N ((cid:22)(j); S(j)). As a generative model, the layered dynamic\ntexture assumes that the state processes X and the layer assignments Z are independent, i.e.\nlayer motion is independent of layer location, and vice versa. As will be seen in Section 3,\nthis makes the expectation-step of the EM algorithm intractable to compute in closed-form.\nTo address this issue, we also consider a slightly different model.\n\n(2)\n\n2.3 Approximate layered dynamic texture\n\nWe now consider a different model, the approximate layered dynamic texture (ALDT),\nshown in Figure 1 (right). Each pixel yi is associated with its own state process xi, and a\n\n\fdifferent dynamic texture is de\ufb01ned for each pixel. However, dynamic textures associated\nwith the same layer share the same set of dynamic parameters, which are assigned by the\nlayer assignment variable zi. Again, the collection of layer assignments Z is modeled as an\nMRF but, unlike the \ufb01rst model, conditioning on the layer assignments makes all the pixels\nindependent. The model is described by the following linear system equations\n\n(cid:26) xi;t = A(zi)xi;t(cid:0)1 + B(zi)vi;t\nxi;t + pr(zi)wi;t\n\nyi;t = C (zi)\n\ni\n\ni 2 f1;(cid:1)(cid:1)(cid:1) ; Ng\n\n(3)\n\nwhere the noise processes are wi;t (cid:24)iid N (0; 1) and vi;t (cid:24)iid N (0; In), and the initial\nstates are given by xi;1 (cid:24) N ((cid:22)(zi); S(zi)). This model can also be seen as a video extension\nof the popular image MRF models [11], where class variables for each pixel form an MRF\ngrid and each class (e.g. pixels in the same segment) has some class-conditional distribution\n(in our case a linear dynamical system).\n\nThe main difference between the two proposed models is in the enforcement of consistency\nof dynamics within a layer. With the LDT, consistency of dynamics is strongly enforced by\nrequiring each pixel in the layer to be associated with the same state process. On the other\nhand, for the ALDT, consistency within a layer is weakly enforced by allowing the pixels\nto be associated with many instantiations of the state process (instantiations associated with\nthe same layer sharing the same dynamic parameters). This weaker dependency structure\nenables a more ef\ufb01cient learning algorithm.\n\n2.4 Modeling layer assignments\n\nThe MRF which determines layer assignments has the following distribution\n\np(Z) =\n\n1\n\nZ Yi\n\n i(zi) Y(i;j)2E\n\n i;j(zi; zj)\n\n(4)\n\nwhere E is the set of edges in the MRF grid, Z a normalization constant (partition function),\nand i and i;j potential functions of the form\n\n; zi = 1\n...\n\n(cid:11)1\n...\n(cid:11)K ; zi = K\n\n i;j(zi; zj) =(cid:26) (cid:13)1\n\n(cid:13)2\n\n; zi = zj\n; zi 6= zj\n\n(5)\n\n i(zi) =8><\n>:\n\nThe potential function i de\ufb01nes a prior likelihood for each layer, while i;j attributes\nhigher probability to con\ufb01gurations where neighboring pixels are in the same layer. While\nthe parameters for the potential functions could be learned for each model, we instead treat\nthem as constants that can be estimated from a database of manually segmented training\nvideo.\n\n3 Parameter estimation\n\nThe parameters of the model are learned using the Expectation-Maximization (EM) algo-\nrithm [12], which iterates between estimating hidden state variables X and hidden layer\nassignments Z from the current parameters, and updating the parameters given the current\nhidden variable estimates. One iteration of the EM algorithm contains the following two\nsteps\n\n(cid:15) E-Step: Q((cid:2); ^(cid:2)) = EX;ZjY ; ^(cid:2)(log p(X; Y; Z; (cid:2)))\n(cid:15) M-Step: ^(cid:2)(cid:3) = argmax(cid:2) Q((cid:2); ^(cid:2))\n\n\fIn the remainder of this section, we brie\ufb02y describe the EM algorithm for the two proposed\nmodels. Due to the limited space available, we refer the reader to the companion technical\nreport [13] for further details.\n\n3.1 EM for the layered dynamic texture\n\nt\n\nThe E-step for the layered dynamic texture computes the conditional mean and covari-\nance of x(j)\ngiven the observed video Y . These expectations are intractable to compute in\nclosed-form since it is not known to which state process each of the pixels yi is assigned,\nand it is therefore necessary to marginalize over all con\ufb01gurations of Z. This problem also\nappears for the computation of the posterior layer assignment probability p(zi = jjY ). The\nmethod of approximating these expectations which we currently adopt is to simply average\nover draws from the posterior p(X; ZjY ) using a Gibbs sampler. Other approximations,\ne.g. variational methods or belief propagation, could be used as well. We plan to con-\nsider them in the future. Once the expectations are known, the M-step parameter updates\nare analogous to those required to learn a regular linear dynamical system [15, 16], with a\nminor modi\ufb01cation in the updates if the transformation matrices C (j)\n. See [13] for details.\n\ni\n\n3.2 EM for the approximate layered dynamic texture\n\nThe ALDT model is similar to the mixture of dynamic textures [14], a video clustering\nmodel that treats a collection of videos as a sample from a collection of dynamic textures.\nSince, for the ALDT model, each pixel is sampled from a set of one-dimensional dynamic\ntextures, the EM algorithm is similar to that of the mixture of dynamic textures. There\nare only two differences. First, the E-step computes the posterior assignment probability\np(zijY ) given all the observed data, rather than conditioned on a single data point p(zijyi).\nThe posterior p(zijY ) can be approximated by sampling from the full posterior p(ZjY )\nusing Markov-Chain Monte Carlo [11], or with other methods, such as loopy belief propa-\ngation. Second, the transformation matrix C (j)\nis different for each pixel, and the E and M\nsteps must be modi\ufb01ed accordingly. Once again, the details are available in [13].\n\ni\n\n4 Experiments\n\nIn this section, we show the ef\ufb01cacy of the proposed model for segmentation and synthesis\nof several videos with multiple regions of distinct motion. Figure 2 shows the three video\nsequences used in testing. The \ufb01rst (top) is a composite of three distinct video textures\nof water, smoke, and \ufb01re. The second (middle) is of laundry spinning in a dryer. The\nlaundry in the bottom left of the video is spinning in place in a circular motion, and the\nlaundry around the outside is spinning faster. The \ufb01nal video (bottom) is of a highway [17]\nwhere the traf\ufb01c in each lane is traveling at a different speed. The \ufb01rst, second and fourth\nlanes (from left to right) move faster than the third and \ufb01fth. All three videos have multiple\nregions of motion and are therefore properly modeled by the models proposed in this paper,\nbut not by a regular dynamic texture.\n\nFour variations of the video models were \ufb01t to each of the three videos. The four mod-\nels were the layered dynamic texture and the approximate layered dynamic texture models\n(LDT and ALDT), and those two models without the MRF layer assignment (LDT-iid and\nALDT-iid). In the latter two cases, the layers assignments zi are distributed as iid multino-\nmials. In all the experiments, the dimension of the state space was n = 10. The MRF grid\nwas based on the eight-neighbor system (with cliques of size 2), and the parameters of the\npotential functions were (cid:13)1 = 0:99, (cid:13)2 = 0:01, and (cid:11)j = 1=K. The expectations required\nby the EM algorithm were approximated using Gibbs sampling for the LDT and LDT-iid\nmodels and MCMC for the ALDT model. We \ufb01rst present segmentation results, to show\n\n\fthat the models can effectively separate layers with different dynamics, and then discuss\nresults relative to video synthesis from the learned models.\n\n4.1 Segmentation\n\nThe videos were segmented by assigning each of the pixels to the most probable layer\nconditioned on the observed video, i.e.\n\nz(cid:3)\ni = argmax\n\nj\n\np(zi = jjY )\n\n(6)\n\nAnother possibility would be to assign the pixels by maximizing the posterior of all the pix-\nels p(ZjY ). While this maximizes the true posterior, in practice we obtained similar results\nwith the two methods. The former method was chosen because the individual posterior\ndistributions are already computed during the E-step of EM.\n\nThe columns of Figure 3 show the segmentation results obtained with for the four models:\nLDT and LDT-iid in columns (a) and (b), and ALDT and ALDT-iid in columns (c) and (d).\nThe segmented video is also available at [18]. From the segmentations produced by the iid\nmodels, it can be concluded that the composite and laundry videos can be reasonably well\nsegmented without the MRF prior. This con\ufb01rms the intuition that the various video regions\ncontain very distinct dynamics, which can only be modeled with separate state processes.\nOtherwise, the pixels should be either randomly assigned among the various layers, or uni-\nformly assigned to one of them. The segmentations of the traf\ufb01c video using the iid models\nare poor. While the dynamics are different, the differences are signi\ufb01cantly more subtle,\nand segmentation requires stronger enforcement of layer consistency. In general, the seg-\nmentations using LDT-iid are better than to those of the ALDT-iid, due to the weaker form\nof layer consistency imposed by the ALDT model. While this de\ufb01ciency is offset by the in-\ntroduction of the MRF prior, the stronger consistency enforced by the LDT model always\nresults in better segmentations. This illustrates the need for the design of sophisticated\nlayered representations when the goal is to model video with subtle inter-layer variations.\nAs expected, the introduction of the MRF prior improves the segmentations produced by\nboth models. For example, in the composite sequence all erroneous segments in the water\nregion are removed, and in the traf\ufb01c sequence, most of the speckled segmentation also\ndisappears.\n\nIn terms of the overall segmentation quality, both LDT and ALDT are able to segment\nthe composite video perfectly. The segmentation of the laundry video by both models\nis plausible, as the laundry tumbling around the edge of the dryer moves faster than that\nspinning in place. The two models also produce reasonable segmentations of the traf\ufb01c\nvideo, with the segments roughly corresponding to the different lanes of traf\ufb01c. Much of\nthe errors correspond to regions that either contain intermittent motion (e.g.\nthe region\nbetween the lanes) or almost no motion (e.g. truck in the upper-right corner and \ufb02at-bed\ntruck in the third lane). Some of these errors could be eliminated by \ufb01ltering the video\nbefore segmentation, but we have attempted no pre or post-processing. Finally, we note\nthat the laundry and traf\ufb01c videos are not trivial to segment with standard computer vision\ntechniques, namely methods based on optical \ufb02ow. This is particularly true in the case of\nthe traf\ufb01c video where the abundance of straight lines and \ufb02at regions makes computing\nthe correct optical \ufb02ow dif\ufb01cult due to the aperture problem.\n\n4.2 Synthesis\n\nThe layered dynamic texture is a generative model, and hence a video can be synthesized\nby drawing a sample from the learned model. A synthesized composite video using the\nLDT, ALDT, and the normal dynamic texture can be found at [18]. When modeling a\nvideo with multiple motions, the regular dynamic texture will average different dynamics.\n\n\fFigure 2: Frames from the test video sequences: (top) composite of water, smoke, and \ufb01re\nvideo textures; (middle) spinning laundry in a dryer; and (bottom) highway traf\ufb01c with\nlanes traveling at different speeds.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Segmentation results for each of the test videos using: (a) the layered dynamic\ntexture, and (b) the layered dynamic texture without MRF; (c) the approximate layered\ndynamic texture, and (d) the approximate LDT without MRF.\n\n\fThis is noticeable in the synthesized video, where the \ufb01re region does not \ufb02icker at the same\nspeed as in the original video. Furthermore, the motions in different regions are coupled,\ne.g. when the \ufb01re begins to \ufb02icker faster, the water region ceases to move smoothly. In\ncontrast, the video synthesized from the layered dynamic texture is more realistic, as the\n\ufb01re region \ufb02ickers at the correct speed, and the different regions follow their own motion\npatterns. The video synthesized from the ALDT appears noisy because the pixels evolve\nfrom different instantiations of the state process. Once again this illustrates the need for\nsophisticated layered models.\n\nReferences\n\n[1] B. K. P. Horn. Robot Vision. McGraw-Hill Book Company, New York, 1986.\n[2] B. Horn and B. Schunk. Determining optical \ufb02ow.Arti\ufb01cial Intelligence, vol. 17, 1981.\n[3] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo\n\nvision. Proc. DARPA Image Understanding Workshop, 1981.\n\n[4] J. Barron, D. Fleet, and S. Beauchemin. Performance of optical \ufb02ow techniques. International\n\nJournal of Computer Vision, vol. 12, 1994.\n\n[5] J. Wang and E. Adelson. Representing moving images with layers. IEEE Trans. on Image\n\nProcessing, vol. 3, September 1994.\n\n[6] B. Frey and N. Jojic. Estimating mixture models of images and inferring spatial transformations\nusing the EM algorithm. In IEEE Conference on Computer Vision and Pattern Recognition,\n1999.\n\n[7] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures. International Journal of\n\nComputer Vision, vol. 2, pp. 91-109, 2003.\n\n[8] G. Doretto, D. Cremers, P. Favaro, and S. Soatto. Dynamic texture segmentation.\n\nInternational Conference on Computer Vision, vol. 2, pp. 1236-42, 2003.\n\nIn IEEE\n\n[9] P. Saisan, G. Doretto, Y. Wu, and S. Soatto. Dynamic texture recognition. In IEEE Conference\n\non Computer Vision and Pattern Recognition, Proceedings, vol. 2, pp. 58-63, 2001.\n\n[10] A. B. Chan and N. Vasconcelos. Probabilistic kernels for the classi\ufb01cation of auto-regressive\nvisual processes. In IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp.\n846-51, 2005.\n\n[11] S. Geman and D. Geman. Stochastic relaxation, Gibbs distribution, and the Bayesian restoration\nof images. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6(6), pp. 721-\n41, 1984.\n\n[12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society B, vol. 39, pp. 1-38, 1977.\n\n[13] A. B. Chan and N. Vasconcelos. The EM algorithm for layered dynamic textures. Technical\n\nReport SVCL-TR-2005-03, June 2005. http://www.svcl.ucsd.edu/.\n\n[14] A. B. Chan and N. Vasconcelos. Mixtures of dynamic textures. In IEEE International Confer-\n\nence on Computer Vision, vol. 1, pp. 641-47, 2005.\n\n[15] R. H. Shumway and D. S. Stoffer. An approach to time series smoothing and forecasting using\n\nthe EM algorithm. Journal of Time Series Analysis, vol. 3(4), pp. 253-64, 1982.\n\n[16] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Compu-\n\ntation, vol. 11, pp. 305-45, 1999.\n\n[17] http://www.wsdot.wa.gov\n[18] http://www.svcl.ucsd.edu/(cid:24)abc/nips05/\n\n\f", "award": [], "sourceid": 2832, "authors": [{"given_name": "Antoni", "family_name": "Chan", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}