{"title": "A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3601, "page_last": 3610, "abstract": "This paper takes a step towards temporal reasoning in a dynamically changing video, not in the pixel space that constitutes its frames, but in a latent space that describes the non-linear dynamics of the objects in its world. We introduce the Kalman variational auto-encoder, a framework for unsupervised learning of sequential data that disentangles two latent representations: an object's representation, coming from a recognition model, and a latent state describing its dynamics. As a result, the evolution of the world can be imagined and missing data imputed, both without the need to generate high dimensional frames at each time step. The model is trained end-to-end on videos of a variety of simulated physical systems, and outperforms competing methods in generative and missing data imputation tasks.", "full_text": "A Disentangled Recognition and Nonlinear Dynamics\n\nModel for Unsupervised Learning\n\nMarco Fraccaro\u2020\u2217\n\nSimon Kamronn \u2020\u2217\n\u2020 Technical University of Denmark\n\nUlrich Paquet\u2021\n\n\u2021 DeepMind\n\nOle Winther\u2020\n\nAbstract\n\nThis paper takes a step towards temporal reasoning in a dynamically changing video,\nnot in the pixel space that constitutes its frames, but in a latent space that describes\nthe non-linear dynamics of the objects in its world. We introduce the Kalman\nvariational auto-encoder, a framework for unsupervised learning of sequential data\nthat disentangles two latent representations: an object\u2019s representation, coming\nfrom a recognition model, and a latent state describing its dynamics. As a result, the\nevolution of the world can be imagined and missing data imputed, both without the\nneed to generate high dimensional frames at each time step. The model is trained\nend-to-end on videos of a variety of simulated physical systems, and outperforms\ncompeting methods in generative and missing data imputation tasks.\n\n1\n\nIntroduction\n\nFrom the earliest stages of childhood, humans learn to represent high-dimensional sensory input\nto make temporal predictions. From the visual image of a moving tennis ball, we can imagine\nits trajectory, and prepare ourselves in advance to catch it. Although the act of recognising the\ntennis ball is seemingly independent of our intuition of Newtonian dynamics [31], very little of this\nassumption has yet been captured in the end-to-end models that presently mark the path towards\narti\ufb01cial general intelligence. Instead of basing inference on any abstract grasp of dynamics that is\nlearned from experience, current successes are autoregressive: to imagine the tennis ball\u2019s trajectory,\none forward-generates a frame-by-frame rendering of the full sensory input [5, 7, 23, 24, 29, 30].\nTo disentangle two latent representations, an object\u2019s, and that of its dynamics, this paper introduces\nKalman variational auto-encoders (KVAEs), a model that separates an intuition of dynamics from\nan object recognition network (section 3). At each time step t, a variational auto-encoder [18, 25]\ncompresses high-dimensional visual stimuli xt into latent encodings at. The temporal dynamics in\nthe learned at-manifold are modelled with a linear Gaussian state space model that is adapted to\nhandle complex dynamics (despite the linear relations among its states zt). The parameters of the\nstate space model are adapted at each time step, and non-linearly depend on past at\u2019s via a recurrent\nneural network. Exact posterior inference for the linear Gaussian state space model can be preformed\nwith the Kalman \ufb01ltering and smoothing algorithms, and is used for imputing missing data, for\ninstance when we imagine the trajectory of a bouncing ball after observing it in initial and \ufb01nal video\nframes (section 4). The separation between recognition and dynamics model allows for missing data\nimputation to be done via a combination of the latent states zt of the model and its encodings at only,\nwithout having to forward-sample high-dimensional images xt in an autoregressive way. KVAEs are\ntested on videos of a variety of simulated physical systems in section 5: from raw visual stimuli, it\n\u201cend-to-end\u201d learns the interplay between the recognition and dynamics components. As KVAEs can\ndo smoothing, they outperform an array of methods in generative and missing data imputation tasks\n(section 5).\n\n\u2217Equal contribution.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Background\n\nLinear Gaussian state space models. Linear Gaussian state space models (LGSSMs) are widely\nused to model sequences of vectors a = a1:T = [a1, .., aT ]. LGSSMs model temporal correlations\nthrough a \ufb01rst-order Markov process on latent states z = [z1, .., zT ], which are potentially further\ncontrolled with external inputs u = [u1, .., uT ], through the Gaussian distributions\n\np\u03b3t(zt|zt\u22121, ut) = N (zt; Atzt\u22121 + Btut, Q) ,\n\n(1)\nMatrices \u03b3t = [At, Bt, Ct] are the state transition, control and emission matrices at time t. Q and R\nare the covariance matrices of the process and measurement noise respectively. With a starting state\nz1 \u223c N (z1; 0, \u03a3), the joint probability distribution of the LGSSM is given by\n\np\u03b3t(at|zt) = N (at; Ctzt, R) .\n\np\u03b3(a, z|u) = p\u03b3(a|z) p\u03b3(z|u) =(cid:81)T\n\nt=1p\u03b3t(at|zt) \u00b7 p(z1)(cid:81)T\n\n(2)\nwhere \u03b3 = [\u03b31, .., \u03b3T ]. LGSSMs have very appealing properties that we wish to exploit: the \ufb01ltered\nand smoothed posteriors p(zt|a1:t, u1:t) and p(zt|a, u) can be computed exactly with the classical\nKalman \ufb01lter and smoother algorithms, and provide a natural way to handle missing data.\n\nt=2 p\u03b3t(zt|zt\u22121, ut) ,\n\nVariational auto-encoders. A variational auto-encoder (VAE) [18, 25] de\ufb01nes a deep generative\nmodel p\u03b8(xt, at) = p\u03b8(xt|at)p(at) for data xt by introducing a latent encoding at. Given a\nlikelihood p\u03b8(xt|at) and a typically Gaussian prior p(at), the posterior p\u03b8(at|xt) represents a\nstochastic map from xt to at\u2019s manifold. As this posterior is commonly analytically intractable, VAEs\napproximate it with a variational distribution q\u03c6(at|xt) that is parameterized by \u03c6. The approximation\nq\u03c6 is commonly called the recognition, encoding, or inference network.\n\n3 Kalman Variational Auto-Encoders\n\nThe useful information that describes the movement and interplay of objects in a video typically lies\nin a manifold that has a smaller dimension than the number of pixels in each frame. In a video of a\nball bouncing in a box, like Atari\u2019s game Pong, one could de\ufb01ne a one-to-one mapping from each\nof the high-dimensional frames x = [x1, .., xT ] into a two-dimensional latent space that represents\nthe position of the ball on the screen. If the position was known for consecutive time steps, for a\nset of videos, we could learn the temporal dynamics that govern the environment. From a few new\npositions one might then infer where the ball will be on the screen in the future, and then imagine the\nenvironment with the ball in that position.\nThe Kalman variational auto-encoder (KVAE) is based\non the notion described above. To disentangle recognition\nand spatial representation, a sensory input xt is mapped to\nat (VAE), a variable on a low-dimensional manifold that\nencodes an object\u2019s position and other visual properties. In\nturn, at is used as a pseudo-observation for the dynamics\nmodel (LGSSM). xt represents a frame of a video2 x =\n[x1, .., xT ] of length T . Each frame is encoded into a\npoint at on a low-dimensional manifold, so that the KVAE\ncontains T separate VAEs that share the same decoder\np\u03b8(xt|at) and encoder q\u03c6(at|xt), and depend on each\nother through a time-dependent prior over a = [a1, .., aT ].\nThis is illustrated in \ufb01gure 1.\n\nLGSSM\n\nut\u22121\n\nVAE\n\nxt\u22121\n\nat\u22121\n\nzt\u22121\n\nat\n\nzt\n\nut\n\nut+1\n\nxt\n\nxt+1\n\nat+1\n\nzt+1\n\n3.1 Generative model\n\nfactorizes as p\u03b8(x|a) = (cid:81)T\n\nFigure 1: A KVAE is formed by stack-\ning a LGSSM (dashed blue), and a VAE\n(dashed red). Shaded nodes denote ob-\nserved variables. Solid arrows represent\nthe generative model (with parameters \u03b8)\nwhile dashed arrows represent the VAE\ninference network (with parameters \u03c6).\n2While our main focus in this paper are videos, the same ideas could be applied more in general to any\n\nWe assume that a acts as a latent representation of the\nwhole video, so that the generative model of a sequence\nt=1 p\u03b8(xt|at). In this paper\np\u03b8(xt|at) is a deep neural network parameterized by \u03b8,\n\nsequence of high dimensional data.\n\n2\n\n\f(cid:90)\n\nthat emits either a factorized Gaussian or Bernoulli probability vector depending on the data type of\nxt. We model a with a LGSSM, and following (2), its prior distribution is\n\np\u03b3(a|u) =\n\np\u03b3(a|z) p\u03b3(z|u) dz ,\n\n(3)\nso that the joint density for the KVAE factorizes as p(x, a, z|u) = p\u03b8(x|a) p\u03b3(a|z) p\u03b3(z|u). A\nLGSSM forms a convenient back-bone to a model, as the \ufb01ltered and smoothed distributions\np\u03b3(zt|a1:t, u1:t) and p\u03b3(zt|a, u) can be obtained exactly. Temporal reasoning can be done in the\nlatent space of zt\u2019s and via the latent encodings a, and we can do long-term predictions without\nhaving to auto-regressively generate high-dimensional images xt. Given a few frames, and hence\ntheir encodings, one could \u201cremain in latent space\u201d and use the smoothed distributions to impute\nmissing frames. Another advantage of using a to separate the dynamics model from x can be seen by\nconsidering the emission matrix Ct. Inference in the LGSSM requires matrix inverses, and using\nit as a model for the prior dynamics of at allows the size of Ct to remain small, and not scale with\nthe number of pixels in xt. While the LGSSM\u2019s process and measurement noise in (1) are typically\nformulated with full covariance matrices [26], we will consider them as isotropic in a KVAE, as at\nact as a prior in a generative model that includes these extra degrees of freedom.\nWhat happens when a ball bounces against a wall, and the dynamics on at are not linear any more?\nCan we still retain a LGSSM backbone? We will incorporate nonlinearities into the LGSSM by\nregulating \u03b3t from outside the exact forward-backward inference chain. We revisit this central idea at\nlength in section 3.3.\n\n(cid:90)\n\nlog likelihoods L =(cid:80)\n\n3.2 Learning and inference for the KVAE\nWe learn \u03b8 and \u03b3 from a set of example sequences {x(n)} by maximizing the sum of their respective\nn log p\u03b8\u03b3(x(n)|u(n)) as a function of \u03b8 and \u03b3. For simplicity in the exposition\nwe restrict our discussion below to one sequence, and omit the sequence index n. The log likelihood or\nevidence is an intractable average over all plausible settings of a and z, and exists as the denominator\nin Bayes\u2019 theorem when inferring the posterior p(a, z|x, u). A more tractable approach to both\nlearning and inference is to introduce a variational distribution q(a, z|x, u) that approximates the\nposterior. The evidence lower bound (ELBO) F is\nlog p(x|u) = log\np(x, a, z|u) \u2265 Eq(a,z|x,u)\n(4)\nand a sum of F\u2019s is maximized instead of a sum of log likelihoods. The variational distribution\nq depends on \u03c6, but for the bound to be tight we should specify q to be equal to the posterior\ndistribution that only depends on \u03b8 and \u03b3. Towards this aim we structure q so that it incorporates the\nexact conditional posterior p\u03b3(z|a, u), that we obtain with Kalman smoothing, as a factor of \u03b3:\n\np\u03b8(x|a)p\u03b3(a|z)p\u03b3(z|u)\n\n= F(\u03b8, \u03b3, \u03c6) ,\n\nq(a, z|x, u)\n\nq(a, z|x, u) = q\u03c6(a|x) p\u03b3(z|a, u) =(cid:81)T\n\n(5)\nThe bene\ufb01t of the LGSSM backbone is now apparent. We use a \u201crecognition model\u201d to encode each\nxt using a non-linear function, after which exact smoothing is possible. In this paper q\u03c6(at|xt) is a\ndeep neural network that maps xt to the mean and the diagonal covariance of a Gaussian distribution.\nAs explained in section 4, this factorization allows us to deal with missing data in a principled way.\nUsing (5), the ELBO in (4) becomes\n\nt=1q\u03c6(at|xt) p\u03b3(z|a, u) .\n\n(cid:20)\n\n(cid:21)\n\nlog\n\n+ Ep\u03b3 (z|a,u)\n\nF(\u03b8, \u03b3, \u03c6) = Eq\u03c6(a|x)\n\n(6)\n\n.\n\n1\nI\n\nlog\n\nlog\n\n(cid:88)\n\np\u03b3(z|a, u)\n\ndrawn from q,\n\u02c6F(\u03b8, \u03b3, \u03c6) =\n\nThe lower bound in (6) can be estimated using Monte Carlo integration with samples {(cid:101)a(i),(cid:101)z(i)}I\nlog p\u03b8(x|(cid:101)a(i))+log p\u03b3((cid:101)a(i),(cid:101)z(i)|u)\u2212log q\u03c6((cid:101)a(i)|x)\u2212log p\u03b3((cid:101)z(i)|(cid:101)a(i), u) . (7)\nNote that the ratio p\u03b3((cid:101)a(i),(cid:101)z(i)|u)/p\u03b3((cid:101)z(i)|(cid:101)a(i), u) in (7) gives p\u03b3((cid:101)a(i)|u), but the formulation with\n{(cid:101)z(i)} allows stochastic gradients on \u03b3 to also be computed. A sample from q can be obtained by\n\ufb01rst sampling(cid:101)a \u223c q\u03c6(a|x), and using(cid:101)a as an observation for the LGSSM. The posterior p\u03b3(z|(cid:101)a, u)\ncan be tractably obtained with a Kalman smoother, and a sample(cid:101)z \u223c p\u03b3(z|(cid:101)a, u) obtained from it.\n\nParameter learning is done by jointly updating \u03b8, \u03c6, and \u03b3 by maximising the ELBO on L, which\ndecomposes as a sum of ELBOs in (6), using stochastic gradient ascent and a single sample to\napproximate the intractable expectations.\n\ni=1\n\ni\n\np\u03b8(x|a)\nq\u03c6(a|x)\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)(cid:21)\n\np\u03b3(a|z)p\u03b3(z|u)\n\n3\n\n\f3.3 Dynamics parameter network\nThe LGSSM provides a tractable way to structure p\u03b3(z|a, u) into the variational approximation in\n(5). However, even in the simple case of a ball bouncing against a wall, the dynamics on at are not\nlinear anymore. We can deal with these situations while preserving the linear dependency between\nconsecutive states in the LGSSM, by non-linearly changing the parameters \u03b3t of the model over time\nas a function of the latent encodings up to time t \u2212 1 (so that we can still de\ufb01ne a generative model).\nSmoothing is still possible as the state transition matrix At and others in \u03b3t do not have to be constant\nin order to obtain the exact posterior p\u03b3(zt|a, u).\nRecall that \u03b3t describes how the latent state zt\u22121 changes from time t \u2212 1 to time t. In the more\ngeneral setting, the changes in dynamics at time t may depend on the history of the system, encoded\nin a1:t\u22121 and possibly a starting code a0 that can be learned from data. If, for instance, we see the ball\ncolliding with a wall at time t \u2212 1, then we know that it will bounce at time t and change direction.\nWe then let \u03b3t be a learnable function of a0:t\u22121, so that the prior in (2) becomes\n\np\u03b3(a, z|u) =(cid:81)T\n\nt=1p\u03b3t(a0:t\u22121)(at|zt) \u00b7 p(z1)(cid:81)T\n\nt=2 p\u03b3t(a0:t\u22121)(zt|zt\u22121, ut) .\n\n(8)\n\nDuring inference, after all the frames are encoded in a, the\ndynamics parameter network returns \u03b3 = \u03b3(a), the param-\neters of the LGSSM at all time steps. We can now use the\nKalman smoothing algorithm to \ufb01nd the exact conditional\nposterior over z, that will be used when computing the\ngradients of the ELBO.\nIn our experiments the dependence of \u03b3t on a0:t\u22121 is\nmodulated by a dynamics parameter network \u03b1t =\n\u03b1t(a0:t\u22121), that is implemented with a recurrent neu-\nral network with LSTM cells that takes at each time\nstep the encoded state as input and recurses dt =\nLSTM(at\u22121, dt\u22121) and \u03b1t = softmax(dt), as illustrated\nin \ufb01gure 2. The output of the dynamics parameter network\n\nis weights that sum to one,(cid:80)K\n\nk=1 \u03b1(k)\n\nt\n\nK different operating modes:\n\nK(cid:88)\n\n\u03b1t\u22121\n\n\u03b1t\n\n\u03b1t+1\n\ndt\u22121\n\ndt\n\ndt+1\n\nat\u22122\n\nat\u22121\n\nat\n\nFigure 2: Dynamics parameter network\nfor the KVAE.\n\n(a0:t\u22121) = 1. These weights choose and interpolate between\n\nK(cid:88)\n\nK(cid:88)\n\nAt =\n\n\u03b1(k)\n\nt\n\n(a0:t\u22121)A(k), Bt =\n\n\u03b1(k)\n\nt\n\n(a0:t\u22121)B(k), Ct =\n\n\u03b1(k)\n\nt\n\n(a0:t\u22121)C(k) .\n\n(9)\n\nk=1\n\nk=1\n\nk=1\n\nWe globally learn K basic state transition, control and emission matrices A(k), B(k) and C(k), and\ninterpolate them based on information from the VAE encodings. The weighted sum can be interpreted\nas a soft mixture of K different LGSSMs whose time-invariant matrices are combined using the time-\nvarying weights \u03b1t. In practice, each of the K sets {A(k), B(k), C(k)} models different dynamics,\nthat will dominate when the corresponding \u03b1(k)\nis high. The dynamics parameter network resembles\nthe locally-linear transitions of [16, 33]; see section 6 for an in depth discussion on the differences.\n\nt\n\n4 Missing data imputation\n\nLet xobs be an observed subset of frames in a video sequence, for instance depicting the initial\nmovement and \ufb01nal positions of a ball in a scene. From its start and end, can we imagine how\nthe ball reaches its \ufb01nal position? Autoregressive models like recurrent neural networks can only\nforward-generate xt frame by frame, and cannot make use of the information coming from the \ufb01nal\nframes in the sequence. To impute the unobserved frames xun in the middle of the sequence, we need\nto do inference, not prediction.\nThe KVAE exploits the smoothing abilities of its LGSSM to use both the information from the past\nand the future when imputing missing data. In general, if x = {xobs, xun}, the unobserved frames in\nxun could also appear at non-contiguous time steps, e.g. missing at random. Data can be imputed\nby sampling from the joint density p(aun, aobs, z|xobs, u), and then generating xun from aun. We\nfactorize this distribution as\n\np(aun, aobs, z|xobs, u) = p\u03b3(aun|z) p\u03b3(z|aobs, u) p(aobs|xobs) ,\n\n(10)\n\n4\n\n\f(cid:90)\n(cid:90)\n\n(11)\n\n(12)\n\nand we sample from it with ancestral sampling starting from xobs. Reading (10) from right to left, a\nsample from p(aobs|xobs) can be approximated with the variational distribution q\u03c6(aobs|xobs). Then,\nif \u03b3 is fully known, p\u03b3(z|aobs, u) is computed with an extension to the Kalman smoothing algorithm\nto sequences with missing data, after which samples from p\u03b3(aun|z) could be readily drawn.\nHowever, when doing missing data imputation the parameters \u03b3 of the LGSSM are not known at\nall time steps. In the KVAE, each \u03b3t depends on all the previous encoded states, including aun, and\nthese need to be estimated before \u03b3 can be computed. In this paper we recursively estimate \u03b3 in the\nfollowing way. Assume that x1:t\u22121 is known, but not xt. We sample a1:t\u22121 from q\u03c6(a1:t\u22121|x1:t\u22121)\nusing the VAE, and use it to compute \u03b31:t. The computation of \u03b3t+1 depends on at, which is missing,\ndistribution p\u03b3(zt\u22121|a1:t\u22121, u1:t\u22121) can be computed as it depends only on \u03b31:t\u22121, and from it, we\nsample\n\nand an estimate(cid:98)at will be used. Such an estimate can be arrived at in two steps. The \ufb01ltered posterior\n\n(cid:98)zt \u223c p\u03b3(zt|a1:t\u22121, u1:t) =\n(cid:98)at \u223c p\u03b3(at|a1:t\u22121, u1:t) =\n\nand sample(cid:98)at from the predictive distribution of at,\nThe parameters of the LGSSM at time t + 1 are then estimated as \u03b3t+1([a0:t\u22121,(cid:98)at]). The same\n\np\u03b3t(zt|zt\u22121, ut) p\u03b3(zt\u22121|a1:t\u22121, u1:t\u22121) dzt\u22121\np\u03b3t(at|zt) p\u03b3(zt|a1:t\u22121, u1:t) dzt \u2248 p\u03b3t(at|(cid:98)zt) .\n\nprocedure is repeated at the next time step if xt+1 is missing, otherwise at+1 is drawn from the VAE.\nAfter the forward pass through the sequence, where we estimate \u03b3 and compute the \ufb01ltered posterior\nfor z, the Kalman smoother\u2019s backwards pass computes the smoothed posterior. While the smoothed\nposterior distribution is not exact, as it relies on the estimate of \u03b3 obtained during the forward pass, it\nimproves data imputation by using information coming from the whole sequence; see section 5 for an\nexperimental illustration.\n\n5 Experiments\n\nWe motivated the KVAE with an example of a bouncing ball, and use it here to demonstrate the\nmodel\u2019s ability to separately learn a recognition and dynamics model from video, and use it to impute\nmissing data. To draw a comparison with deep variational Bayes \ufb01lters (DVBFs) [16], we apply\nthe KVAE to [16]\u2019s pendulum example. We further apply the model to a number of environments\nwith different properties to demonstrate its generalizability. All models are trained end-to-end with\nstochastic gradient descent. Using the control input ut in (1) we can inform the model of known\nquantities such as external forces, as will be done in the pendulum experiment. In all the other\nexperiments, we omit such information and train the models fully unsupervised from the videos only.\nFurther implementation details can be found in the supplementary material (appendix A) and in the\nTensor\ufb02ow [1] code released at github.com/simonkamronn/kvae.\n\n5.1 Bouncing ball\n\nWe simulate 5000 sequences of 20 time steps each of a ball moving in a two-dimensional box, where\neach video frame is a 32x32 binary image. A video sequence is visualised as a single image in \ufb01gure\n4d, with the ball\u2019s darkening color re\ufb02ecting the incremental frame index. In this set-up the initial\nposition and velocity are randomly sampled. No forces are applied to the ball, except for the fully\nelastic collisions with the walls. The minimum number of latent dimensions that the KVAE requires\nto model the ball\u2019s dynamics are at \u2208 R2 and zt \u2208 R4, as at the very least the ball\u2019s position in the\nbox\u2019s 2d plane has to be encoded in at, and zt has to encode the ball\u2019s position and velocity. The\nmodel\u2019s \ufb02exibility increases with more latent dimensions, but we choose these settings for the sake of\ninterpretable visualisations. The dynamics parameter network uses K = 3 to interpolate three modes,\na constant velocity, and two non-linear interactions with the horizontal and vertical walls.\nWe compare the generation and imputation performance of the KVAE with two recurrent neural\nnetwork (RNN) models that are based on the same auto-encoding (AE) architecture as the KVAE and\nare modi\ufb01cations of methods from the literature to be better suited to the bouncing ball experiments.3\n\n3We also experimented with the SRNN model from [8] as it can do smoothing. However, the model is\n\nprobably too complex for the task in hand, and we could not make it learn good dynamics.\n\n5\n\n\f(a) Frames xt missing completely at random.\n\n(b) Frames xt missing in the middle of the sequence.\n\n(c) Comparison of encoded (ground truth), generated and smoothed trajectories of a KVAE in the latent space\na. The black squares illustrate observed samples and the hexagons indicate the initial state. Notice that the\nat\u2019s lie on a manifold that can be rotated and stretched to align with the frames of the video.\n\nFigure 3: Missing data imputation results.\n\nIn the AE-RNN, inspired by the architecture from [29], a pretrained convolutional auto-encoder,\nidentical to the one used for the KVAE, feeds the encodings to an LSTM network [13]. During\ntraining the LSTM predicts the next encoding in the sequence and during generation we use the\nprevious output as input to the current step. For data imputation the LSTM either receives the previous\noutput or, if available, the encoding of the observed frame (similarly to \ufb01ltering in the KVAE). The\nVAE-RNN is identical to the AE-RNN except that uses a VAE instead of an AE, similarly to the model\nfrom [6].\nFigure 3a shows how well missing frames are imputed in terms of the average fraction of incorrectly\nguessed pixels. In it, the \ufb01rst 4 frames are observed (to initialize the models) after which the next\n16 frames are dropped at random with varying probabilities. We then impute the missing frames\nby doing \ufb01ltering and smoothing with the KVAE. We see in \ufb01gure 3a that it is bene\ufb01cial to utilize\ninformation from the whole sequence (even the future observed frames), and a KVAE with smoothing\noutperforms all competing methods. Notice that dropout probability 1 corresponds to pure generation\nfrom the models. Figure 3b repeats this experiment, but makes it more challenging by removing an\nincreasing number of consecutive frames from the middle of the sequence (T = 20). In this case\nthe ability to encode information coming from the future into the posterior distribution is highly\nbene\ufb01cial, and smoothing imputes frames much better than the other methods. Figure 3c graphically\nillustrates \ufb01gure 3b. We plot three trajectories over at-encodings. The generated trajectories were\nobtained after initializing the KVAE model with 4 initial frames, while the smoothed trajectories\nalso incorporated encodings from the last 4 frames of the sequence. The encoded trajectories were\nobtained with no missing data, and are therefore considered as ground truth. In the \ufb01rst three plots\nin \ufb01gure 3c, we see that the backwards recursion of the Kalman smoother corrects the trajectory\nobtained with generation in the forward pass. However, in the fourth plot, the poor trajectory that is\nobtained during the forward generation step, makes smoothing unable to follow the ground truth.\nThe smoothing capabilities of KVAEs make it also possible to train it with up to 40% of missing data\nwith minor losses in performance (appendix C in the supplementary material). Links to videos of\nthe imputation results and long-term generation from the models can be found in appendix B and at\nsites.google.com/view/kvae.\n\nUnderstanding the dynamics parameter network.\nIn our experiments the dynamics parameter\nnetwork \u03b1t = \u03b1t(a0:t\u22121) is an LSTM network, but we could also parameterize it with any differen-\ntiable function of a0:t\u22121 (see appendix D in the supplementary material for a comparison of various\n\n6\n\n\f(a) k = 1\n\n(b) k = 2\n\n(c) k = 3\n\n(d) Reconstruction of x\n\nFigure 4: A visualisation of the dynamics parameter network \u03b1(k)\n(at\u22121) for K = 3, as a function of\nat\u22121. The three \u03b1(k)\n\u2019s sum to one at every point in the encoded space. The greyscale backgrounds in\na) to c) correspond to the intensity of the weights \u03b1(k)\n, with white indicating a weight of one in the\ndynamics parameter network\u2019s output. Overlaid on them is the full latent encoding a. d) shows the\nreconstructed frames of the video as one image.\n\nt\n\nt\n\nt\n\narchitectures). When using a multi-layer perceptron (MLP) that depends on the previous encoding as\nmixture network, i.e. \u03b1t = \u03b1t(at\u22121), \ufb01gure 4 illustrates how the network chooses the mixture of\nlearned dynamics. We see that the model has correctly learned to choose a transition that maintains a\nconstant velocity in the center (k = 1), reverses the horizontal velocity when in proximity of the left\nand right wall (k = 2), the reverses the vertical velocity when close to the top and bottom (k = 3).\n\n5.2 Pendulum experiment\n\nModel\n\nTest ELBO\n\nWe test the KVAE on the experiment of a dynamic torque-\ncontrolled pendulum used in [16]. Training, validation and\ntest set are formed by 500 sequences of 15 frames of 16x16\npixels. We use a KVAE with at \u2208 R2, zt \u2208 R3 and K = 2,\nand try two different encoder-decoder architectures for the\nVAE, one using a MLP and one using a convolutional neural\nnetwork (CNN). We compare the performaces of the KVAE\nto DVBFs [16] and deep Markov models4 (DMM) [19], non-\nlinear SSMs parameterized by deep neural networks whose\nintractable posterior distribution is approximated with an inference network. In table 1 we see that\nthe KVAE outperforms both models in terms of ELBO on a test set, showing that for the task in hand\nit is preferable to use a model with simpler dynamics but exact posterior inference.\n\nTable 1: Pendulum experiment.\n\nKVAE (CNN)\nKVAE (MLP)\n\n810.08\n807.02\n798.56\n784.70\n\nDVBF\nDMM\n\n5.3 Other environments\n\nTo test how well the KVAE adapts to different environments, we trained it end-to-end on videos of (i)\na ball bouncing between walls that form an irregular polygon, (ii) a ball bouncing in a box and subject\nto gravity, (iii) a Pong-like environment where the paddles follow the vertical position of the ball to\nmake it stay in the frame at all times. Figure 5 shows that the KVAE learns the dynamics of all three\nenvironments, and generates realistic-looking trajectories. We repeat the imputation experiments of\n\ufb01gures 3a and 3b for these environments in the supplementary material (appendix E), where we see\nthat KVAEs outperform alternative models.\n\n6 Related work\n\nRecent progress in unsupervised learning of high dimensional sequences is found in a plethora of\nboth deterministic and probabilistic generative models. The VAE framework is a common work-\nhorse in the stable of probabilistic inference methods, and it is extended to the temporal setting by\n[2, 6, 8, 16, 19]. In particular, deep neural networks can parameterize the transition and emission\ndistributions of different variants of deep state-space models [8, 16, 19]. In these extensions, inference\n\n4Deep Markov models were previously referred to as deep Kalman \ufb01lters.\n\n7\n\n\f(a) Irregular polygon.\n\n(b) Box with gravity.\n\n(c) Pong-like environment.\n\nFigure 5: Generations from the KVAE trained on different environments. The videos are shown as\nsingle images, with color intensity representing the incremental sequence index t. In the simulation\nthat resembles Atari\u2019s Pong game, the movement of the two paddles (left and right) is also visible.\n\nnetworks de\ufb01ne a variational approximation to the intractable posterior distribution of the latent states\nat each time step. For the tasks in section 5, it is preferable to use the KVAE\u2019s simpler temporal model\nwith an exact (conditional) posterior distribution than a highly non-linear model where the posterior\nneeds to be approximated. A different combination of VAEs and probabilistic graphical models has\nbeen explored in [15], which de\ufb01nes a general class of models where inference is performed with\nmessage passing algorithms that use deep neural networks to map the observations to conjugate\ngraphical model potentials.\nIn classical non-linear extensions of the LGSSM like the extended Kalman \ufb01lter and in the locally-\nlinear dynamics of [16, 33], the transition matrices at time t have a non-linear dependence on zt\u22121.\nThe KVAE\u2019s approach is different: by introducing the latent encodings at and making \u03b3t depend\non a1:t\u22121, the linear dependency between consecutive states of z is preserved, so that the exact\nsmoothed posterior can be computed given a, and used to perform missing data imputation. LGSSM\nwith dynamic parameterization have been used for large-scale demand forecasting in [27]. [20]\nintroduces recurrent switching linear dynamical systems, that combine deep learning techniques and\nswitching Kalman \ufb01lters [22] to model low-dimensional time series. [11] introduces a discriminative\napproach to estimate the low-dimensional state of a LGSSM from input images. The resulting model\nis reminiscent of a KVAE with no decoding step, and is therefore not suited for unsupervised learning\nand video generation. Recent work in the non-sequential setting has focused on disentangling basic\nvisual concepts in an image [12]. [10] models neural activity by \ufb01nding a non-linear embedding of a\nneural time series into a LGSSM.\nGreat strides have been made in the reinforcement learning community to model how environments\nevolve in response to action [5, 23, 24, 30, 32]. In similar spirit to this paper, [32] extracts a latent\nrepresentation from a PCA representation of the frames where controls can be applied. [5] introduces\naction-conditional dynamics parameterized with LSTMs and, as for the KVAE, a computationally\nef\ufb01cient procedure to make long term predictions without generating high dimensional images at\neach time step. As autoregressive models, [29] develops a sequence to sequence model of video\nrepresentations that uses LSTMs to de\ufb01ne both the encoder and the decoder. [7] develops an action-\nconditioned video prediction model of the motion of a robot arm using convolutional LSTMs that\nmodels the change in pixel values between two consecutive frames.\nWhile the focus in this work is to de\ufb01ne a generative model for high dimensional videos of simple\nphysical systems, several recent works have combined physical models of the world with deep learning\nto learn the dynamics of objects in more complex but low-dimensional environments [3, 4, 9, 34].\n\n7 Conclusion\n\nThe KVAE, a model for unsupervised learning of high-dimensional videos, was introduced in this\npaper. It disentangles an object\u2019s latent representation at from a latent state zt that describes its\ndynamics, and can be learned end-to-end from raw video. Because the exact (conditional) smoothed\nposterior distribution over the states of the LGSSM can be computed, one generally sees a marked\n\n8\n\n\fimprovement in inference and missing data imputation over methods that don\u2019t have this property.\nA desirable property of disentangling the two latent representations is that temporal reasoning, and\npossibly planning, could be done in the latent space. As a proof of concept, we have been deliberate\nin focussing our exposition to videos of static worlds that contain a few moving objects, and leave\nextensions of the model to real world videos or sequences coming from an agent exploring its\nenvironment to future work.\n\nAcknowledgements\n\nWe would like to thank Lars Kai Hansen for helpful discussions on the model design. Marco Fraccaro\nis supported by Microsoft Research through its PhD Scholarship Programme. We thank NVIDIA\nCorporation for the donation of TITAN X GPUs.\n\nReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,\nM. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,\nB. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi\u00e9gas, O. Vinyals, P. War-\nden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on\nheterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n[2] E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski. Black box variational inference for\n\nstate space models. arXiv:1511.07367, 2015.\n\n[3] P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu. Interaction networks for learning\n\nabout objects, relations and physics. In NIPS, 2016.\n\n[4] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. A compositional object-based approach to\n\nlearning physical dynamics. In ICLR, 2017.\n\n[5] S. Chiappa, S. Racani\u00e8re, D. Wierstra, and S. Mohamed. Recurrent environment simulators. In ICLR,\n\n2017.\n\n[6] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model\n\nfor sequential data. In NIPS, 2015.\n\n[7] C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video\n\nprediction. In NIPS, 2016.\n\n[8] M. Fraccaro, S. K. S\u00f8nderby, U. Paquet, and O. Winther. Sequential neural models with stochastic layers.\n\nIn NIPS, 2016.\n\n[9] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik. Learning visual predictive models of physics for\n\nplaying billiards. In ICLR, 2016.\n\n[10] Y. Gao, E. W. Archer, L. Paninski, and J. P. Cunningham. Linear dynamical neural population models\n\nthrough nonlinear embeddings. In NIPS, 2016.\n\n[11] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel. Backprop KF: learning discriminative deterministic state\n\nestimators. In NIPS, 2016.\n\n[12] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner.\n\nbeta-vae: Learning basic visual concepts with a constrained variational framework. 2017.\n\n[13] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, Nov.\n\n1997.\n\n[14] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint\n\narXiv:1611.01144, 2016.\n\n[15] M. J. Johnson, D. Duvenaud, A. B. Wiltschko, S. R. Datta, and R. P. Adams. Composing graphical models\n\nwith neural networks for structured representations and fast inference. In NIPS, 2016.\n\n[16] M. Karl, M. Soelch, J. Bayer, and P. van der Smagt. Deep variational bayes \ufb01lters: Unsupervised learning\n\nof state space models from raw data. In ICLR, 2017.\n\n9\n\n\f[17] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.\n\n[18] D. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n\n[19] R. Krishnan, U. Shalit, and D. Sontag. Structured inference networks for nonlinear state space models. In\n\nAAAI, 2017.\n\n[20] S. Linderman, M. Johnson, A. Miller, R. Adams, D. Blei, and L. Paninski. Bayesian Learning and Inference\n\nin Recurrent Switching Linear Dynamical Systems. In AISTATS, 2017.\n\n[21] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete\n\nrandom variables. In ICLR, 2017.\n\n[22] K. P. Murphy. Switching Kalman \ufb01lters. Technical report, 1998.\n\n[23] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks\n\nin atari games. In NIPS, 2015.\n\n[24] V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory.\n\narXiv:1511.06309, 2015.\n\n[25] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In ICML, 2014.\n\n[26] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation,\n\n11(2):305\u201345, 1999.\n\n[27] M. W. Seeger, D. Salinas, and V. Flunkert. Bayesian intermittent demand forecasting for large inventories.\n\nIn NIPS, 2016.\n\n[28] W. Shi, J. Caballero, F. Husz\u00e1r, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time\nsingle image and video super-resolution using an ef\ufb01cient sub-pixel convolutional neural network. In\nCVPR, 2016.\n\n[29] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using\n\nLSTMs. In ICML, 2015.\n\n[30] W. Sun, A. Venkatraman, B. Boots, and J. A. Bagnell. Learning to \ufb01lter with predictive state inference\n\nmachines. In ICML, 2016.\n\n[31] L. G. Ungerleider and L. G. Haxby. \u201cWhat\u201d and \u201cwhere\u201d in the human brain. Curr. Opin. Neurobiol.,\n\n4:157\u2013165, 1994.\n\n[32] N. Wahlstr\u00f6m, T. B. Sch\u00f6n, and M. P. Deisenroth. From pixels to torques: Policy learning with deep\n\ndynamical models. arXiv:1502.02251, 2015.\n\n[33] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent\n\ndynamics model for control from raw images. In NIPS, 2015.\n\n[34] J. Wu, I. Yildirim, J. J. Lim, W. T. Freeman, and J. B. Tenenbaum. Galileo: Perceiving physical object\n\nproperties by integrating a physics engine with deep learning. In NIPS, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2020, "authors": [{"given_name": "Marco", "family_name": "Fraccaro", "institution": "Technical University of Denmark (DTU)"}, {"given_name": "Simon", "family_name": "Kamronn", "institution": "Technical University of Denmark"}, {"given_name": "Ulrich", "family_name": "Paquet", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": "Technical University of Denmark"}]}