{"title": "Deep Generative Video Compression", "book": "Advances in Neural Information Processing Systems", "page_first": 9287, "page_last": 9298, "abstract": "The usage of deep generative models for image compression has led to impressive\nperformance gains over classical codecs while neural video compression is still in its infancy. Here, we propose an end-to-end, deep generative modeling approach to compress temporal sequences with a focus on video. Our approach builds upon variational autoencoder (VAE) models for sequential data and combines them with recent work on neural image compression. The approach jointly learns to transform the original sequence into a lower-dimensional representation as well as to discretize and entropy code this representation according to predictions of the sequential VAE.  Rate-distortion evaluations on small videos from public data sets with varying complexity and diversity show that our model yields competitive results when trained on generic video content. Extreme compression performance is achieved when training the model on specialized content.", "full_text": "Deep Generative Video Compression\n\nJun Han\u2217\n\nDartmouth College\n\njunhan@cs.dartmouth.edu\n\nSalvator Lombardo\u2217\nDisney Research LA\n\nsalvator.d.lombardo@disney.com\n\nChristopher Schroers\nDisneyResearch|Studios\n\nchristopher.schroers@disney.com\n\nAbstract\n\nStephan Mandt\n\nUniversity of California, Irvine\n\nmandt@uci.edu\n\nThe usage of deep generative models for image compression has led to impressive\nperformance gains over classical codecs while neural video compression is still in\nits infancy. Here, we propose an end-to-end, deep generative modeling approach to\ncompress temporal sequences with a focus on video. Our approach builds upon\nvariational autoencoder (VAE) models for sequential data and combines them\nwith recent work on neural image compression. The approach jointly learns to\ntransform the original sequence into a lower-dimensional representation as well\nas to discretize and entropy code this representation according to predictions of\nthe sequential VAE. Rate-distortion evaluations on small videos from public data\nsets with varying complexity and diversity show that our model yields competitive\nresults when trained on generic video content. Extreme compression performance\nis achieved when training the model on specialized content.\n\n1\n\nIntroduction\n\nThe transmission of video content is responsible for up to 80% of the consumer internet traf\ufb01c, and\nboth the overall internet traf\ufb01c as well as the share of video data is expected to increase even further\nin the future (Cisco, 2017). Improving compression ef\ufb01ciency is more crucial than ever. The most\ncommonly used standard is H.264 (Wiegand et al., 2003); more recent codecs include H.265 (Sullivan\net al., 2012) and VP9 (Mukherjee et al., 2015). All of these existing codecs follow the same block\nbased hybrid structure (Musmann et al., 1985) which essentially emerged from engineering out and\nre\ufb01ning this concept over decades. From a high level perspective, they differ in a huge number of\nsmaller design choices and have grown to become more and more complex systems.\nWhile there is room for improving the block based hybrid approach even further (Fraunhofer,\n2018), the question remains as to how much longer signi\ufb01cant improvements can be obtained while\nfollowing the same paradigm. In the context of image compression, deep learning approaches that\nare fundamentally different to existing codecs have already shown promising results (Ball\u00e9 et al.,\n2018, 2016; Theis et al., 2017; Agustsson et al., 2017; Minnen et al., 2018). Motivated by these\nsuccesses for images, we propose a \ufb01rst step towards innovating beyond block-based hybrid codecs\nby framing video compression in a deep generative modeling context. To this end, we propose an\nunsupervised deep learning approach to encoding video. The approach simultaneously learns the\noptimal transformation of the video to a lower-dimensional representation and a powerful predictive\nmodel that assigns probabilities to video segments, allowing us to ef\ufb01ciently entropy-code the\ndiscretized latent representation into a short code length.\n\n\u2217 Shared \ufb01rst authorship.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fH.265 (21.1 dB @ 0.86 bpp)\n\nVP9 (26.0 dB @ 0.57 bpp)\n\nOurs (44.6 dB @ 0.06 bpp)\n\nFigure 1: Reconstructed video frames using the established codecs H.265 (left), VP9 (middle),\nand ours (right), with videos taken from the Sprites data set (Section 4). On specialized content as\nshown here, higher PSNR values in dB (corresponding to lower distortion) can be achieved at almost\nan order of magnitude smaller bits per pixel (bpp) rates. Compared to the classical codecs, fewer\ngeometrical artifacts are apparent in our approach.\n\nOur end-to-end neural video compression scheme is based on sequential variational autoen-\ncoders (Bayer & Osendorfer, 2014; Chung et al., 2015; Li & Mandt, 2018). The transformations to\nand from the latent representation (the encoder and decoder) are parametrized by deep neural networks\nand are learned by unsupervised training on videos. These latent states have to be discretized before\nthey can be compressed into binary. Ball\u00e9 et al. (2016) address this problem by using a box-shaped\nvariational distribution with a \ufb01xed width, forcing the VAE to \u2018forget\u2019 all information stored on\nsmaller length scales due to the insertion of noise during training. This paper follows the same\nparadigm for temporally-conditioned distributions. A sequence of quantized latent representations\nstill contains redundant information as the latents are highly correlated. (Lossless) entropy encoding\nexploits this fact to further reduce the expected \ufb01le size by expressing likely data in fewer bits and\nunlikely data in more bits. This requires knowledge of the probability distribution over the discretized\ndata that is to be compressed, which our approach obtains from the sequential prior.\nAmong the many architectural choices that our approach enables, we empirically investigate a model\nthat is well suited for the regime of extreme compression. This model uses a combination of both\nlocal latent variables, which are inferred from a single frame, and a global state, inferred from a multi-\nframe segment, to ef\ufb01ciently store a video sequence. The dynamics of the local latent variables are\nmodeled stochastically by a deep generative model. After training, the context-dependent predictive\nmodel is used to entropy code the latent variables into binary with an arithmetic coder.\nIn this paper, we focus on low-resolution video (64 \u00d7 64) as the \ufb01rst step towards deep generative\nvideo compression. Figure 1 shows a test example of the possible performance improvements using\nour approach if the model is trained on restricted content (video game characters). The plots show\ntwo frames of a video, compressed and reconstructed by our approach and by classical video codecs.\nOne sees that \ufb01ne granular details, such as the hands of the cartoon character, are lost in the classical\napproach due to artifacts from block motion estimation (low bitrate regime), whereas our deep\nlearning approach successfully captures these details with less than 10% of the \ufb01le length.\nOur contributions are as follows:\n1) A general paradigm for generative compression of sequential data. We propose a general\nframework for compressing sequential data by employing a sequential variational autoencoder (VAE)\nin conjuction with discretization and entropy coding to build an end-to-end trainable codec.\n2) A new neural codec for video compression. We employ the above paradigm towards building\nan end-to-end trainable codec. To the best of our knowledge, this is the \ufb01rst work to utilize a deep\ngenerative video model together with discretization and entropy coding to perform video compression.\n3) High compression ratios. We perform experiments on three public data sets of varying complexity\nand diversity. Performance is evaluated in terms of rate-distortion curves. For the low-resolution\nvideos considered in this paper, our method is competitive with traditional codecs after training and\ntesting on a diverse set of videos. Extreme compression performance can be achieved on a restricted\nset of videos containing specialized content if the model is trained on similar videos.\n4) Ef\ufb01cient compression from a global state. While a deep latent time series model takes temporal\nredundancies in the video into account, one optional variation of our model architecture tries to\ncompress static information into a separate global variable (Li & Mandt, 2018) which acts similarly\nas a key frame in traditional methods. We show that this decomposition can be bene\ufb01cial.\n\n2\n\n\fOur paper is organized as follows. In Section 2, we summarize related works before describing\nour method in Section 3. Section 4 discusses our experimental results. We give our conclusions in\nSection 5.\n\n2 Related Work\n\nThe approaches related to our method fall into three categories: deep generative video models, neural\nimage compression, and neural video compression.\n\nDeep generative video models. Several works have applied the variational autoencoder\n(VAE) (Kingma & Welling, 2014; Rezende et al., 2014) to stochastically model sequences (Bayer &\nOsendorfer, 2014; Chung et al., 2015). Babaeizadeh et al. (2018); Xu et al. (2020) use a VAE for\nstochastic video generation. He et al. (2018) and Denton & Fergus (2018) apply a long short term\nmemory (LSTM) in conjunction with a sequential VAE to model the evolution of the latent space\nacross many video frames. Li & Mandt (2018) separate latent variables of a sequential VAE into local\nand global variables in order to learn a disentangled representation for video generation. Vondrick\net al. (2016) generate realistic videos by using a generative adversarial network (Goodfellow et al.,\n2014) to learn to separate foreground and background, and Lee et al. (2018) combine variational and\nadversarial methods to generate realistic videos. This paper also employs a deep generative model to\nmodel the sequential probability distribution of frames from a video source. In contrast to other work,\nour method learns a continuous latent representation that can be discretized with minimal information\nloss, required for further compression into binary. Furthermore, our objective is to convert the original\nvideo into a short binary description rather than to generate new videos.\n\nNeural image compression. There has been signi\ufb01cant work on applying deep learning to image\ncompression. In Toderici et al. (2016, 2017); Johnston et al. (2018), an LSTM based codec is used\nto model spatial correlations of pixel values and can achieve different bit-rates without having to\nretrain the model. Ball\u00e9 et al. (2016) perform image compression with a VAE and demonstrate how\nto approximately discretize the VAE latent space by introducing noise during training. This work\nis re\ufb01ned by (Ball\u00e9 et al., 2018) which improves the prior model (used for entropy coding) beyond\nthe mean-\ufb01eld approximation by transmitting side information in the form of a hierarchical model.\nMinnen et al. (2018) consider an autoregressive model to achieve a similar effect. Santurkar et al.\n(2018) studies the performance of generative compression on images and suggests it may be more\nresilient to bit error rates. These image codecs encode each image independently and therefore\ntheir probabilistic models are stationary with respect to time. In contrast, our method performs\ncompression according to a non-stationary, time-dependent probability model which typically has\nlower entropy per pixel.\n\nNeural video compression. The use of deep neural networks for video compression is relatively\nnew. Wu et al. (2018) perform video compression through image interpolation between reference\nframes using a predictive model based on a deep neural network. Chen et al. (2017) and Chen et al.\n(2019) use a deep neural architecture to predict the most likely frame with a modi\ufb01ed form of block\nmotion prediction and store residuals in a lossy representation. Since these works are based on motion\nestimation and residuals, they are somewhat similar in function and performance to existing codecs.\nLu et al. (2019) and Djelouah et al. (2019) also follow a pipeline based on motion estimation and\nresidual computation as in existing codecs. In contrast, our method is not based on motion estimation,\nand the full inferred probability distribution over the space of plausible subsequent frames is used\nfor entropy coding the frame sequence (rather than residuals). In a concurrent publication, Habibian\net al. (2019) perform video compression by utilizing a 3D variational autoencoder. In this case, the\n3D encoder removes temporal redundancy by decorrelating latents, wheras our method uses entropy\ncoding (with time-dependent probabilities) to remove temporal redundancy.\n\n3 Deep Generative Video Compression\n\nOur end-to-end approach simultaneously learns to transform a video into a lower-dimensional latent\nrepresentation and to remove the remaining redundancy in the latents through model-based entropy\ncoding. Section 3.1 gives an overview of the deep generative video coding approach as a whole\n\n3\n\n\fFigure 2: High-level operational diagram of our compression codec (see Section 3). A video segment\nis encoded into per-frame latent variables zt and (optionally) also into a per-segment global state f\nusing a VAE architecture. Both latent variables are then quantized and arithmetically encoded into\nbinary according to the respective prior models. To recover an approximation to the original video,\nthe latent variables are arithmetically decoded from the binary and passed through the neural decoder.\n\nbefore Sections 3.2 and 3.3 detail on the model-based entropy coding and the lower-dimensional\nrepresentation, respectively.\n\n3.1 Overview\n\nLossy video compression is a constrained optimization problem that can be approached from two\ndifferent angles: 1) either as \ufb01nding the shortest description of a video without exceeding a certain\nlevel of information loss or 2) as \ufb01nding the minimal level of information loss without exceeding\na certain description length. Both optimization problems are equivalent with either a focus on\ndescription length (rate) or information loss (distortion) constraints. The distortion is a measure of\nhow much error encoding and subsequent decoding incurs while the rate quanti\ufb01es the amount of bits\nthe encoded representation occupies. When denoting distortion by D, rate by R, and the maximal\nrate constraint by Rc, the compression problem can be expressed as\n\nminD subject to R \u2264 Rc.\n\nSuch a constrained formulation is often cumbersome but can be solved in a Lagrange multiplier\nformulation, where the rate and distortion terms are weighted against each other by a Lagrange\nmultiplier \u03b2:\n\nminD + \u03b2R.\n\n(1)\n\nIn existing video codecs, encoders and decoders have been meticulously engineered to improve\ncoding ef\ufb01ciency.\nInstead of engineering encoding and decoding functions, in our end-to-end\nmachine learning approach we aim to learn these mappings by parametrizing them by deep neural\nnetworks and then optimizing Eq. 1 accordingly.\nThere is a well-known equivalence (Ball\u00e9 et al., 2018; Alemi et al., 2018) between the evidence lower\nbound in amortized variational inference (Gershman & Goodman, 2014; Zhang et al., 2018), and the\nLagrangian formulation of lossy coding of Eq. 1. Variational inference involves a probabilistic model\np(x, z) = p(x|z)p(z) over data x and latent variables z. The goal is to lower-bound the marginal\nlikelihood p(x) using a variational distribution q(z|x). When the variational distribution q has a\n\ufb01xed entropy (e.g., by \ufb01xing its variance), this bound is, up to a constant,\n(2)\nwhere H is the cross entropy between the approximate posterior and the prior. When allowing\nfor arbitrary \u03b2, Ball\u00e9 et al. (2016) showed in the context of image compression with variational\nautoencoders that the negative of Eq. 2 becomes a manifestation of Eq. 1. While the \ufb01rst term\nmeasures the expected reconstruction error of the encoded images, the cross entropy term becomes\nthe expected code length as the (learned) prior p(z) is used to inform a lossless entropy coder about\nthe probabilities of the discretized encoded images. In this paper we generalize this approach to\nvideos by employing probabilistic deep sequential latent state models.\nFig. 2 summarizes our overall design. Given a sequence of frames x1:T = (x1, . . . , xT ), we\ntransform them into a sequence of latent states z1:T and optionally also a global state f. Although\nthis transformation into a latent representation is lossy, the video is not yet optimally compressed\nas there are still correlations in the latent space variables. To remove this redundancy, the latent\n\nEq[log p(x|z)] \u2212 \u03b2 H[q(z|x), p(z)],\n\n4\n\nx1:T<latexit sha1_base64=\"YqQCXMo5A7SqXDzJKmdtZ2M1Ar8=\">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgabQiKJ6GXjxO2BdspaRpuoWlSUlScZbiv+LFgyJe/T+8+d+Ybj3o5oOQx3u/H3l5QcKo0o7zbS0tr6yurVc2qptb2zu79t5+R4lUYtLGggnZC5AijHLS1lQz0kskQXHASDcY3xR+955IRQVv6UlCvBgNOY0oRtpIvn04CAQL1SQ2V/aQ+5l71cp9u+bUnSngInFLUgMlmr79NQgFTmPCNWZIqb7rJNrLkNQUM5JXB6kiCcJjNCR9QzmKifKyafocnhglhJGQ5nANp+rvjQzFqghoJmOkR2reK8T/vH6qo0svozxJNeF49lCUMqgFLKqAIZUEazYxBGFJTVaIR0girE1hVVOCO//lRdI5q7tO3b07rzWuyzoq4Agcg1PgggvQALegCdoAg0fwDF7Bm/VkvVjv1sdsdMkqdw7AH1ifP/x2lYw=</latexit><latexit sha1_base64=\"YqQCXMo5A7SqXDzJKmdtZ2M1Ar8=\">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgabQiKJ6GXjxO2BdspaRpuoWlSUlScZbiv+LFgyJe/T+8+d+Ybj3o5oOQx3u/H3l5QcKo0o7zbS0tr6yurVc2qptb2zu79t5+R4lUYtLGggnZC5AijHLS1lQz0kskQXHASDcY3xR+955IRQVv6UlCvBgNOY0oRtpIvn04CAQL1SQ2V/aQ+5l71cp9u+bUnSngInFLUgMlmr79NQgFTmPCNWZIqb7rJNrLkNQUM5JXB6kiCcJjNCR9QzmKifKyafocnhglhJGQ5nANp+rvjQzFqghoJmOkR2reK8T/vH6qo0svozxJNeF49lCUMqgFLKqAIZUEazYxBGFJTVaIR0girE1hVVOCO//lRdI5q7tO3b07rzWuyzoq4Agcg1PgggvQALegCdoAg0fwDF7Bm/VkvVjv1sdsdMkqdw7AH1ifP/x2lYw=</latexit><latexit sha1_base64=\"YqQCXMo5A7SqXDzJKmdtZ2M1Ar8=\">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgabQiKJ6GXjxO2BdspaRpuoWlSUlScZbiv+LFgyJe/T+8+d+Ybj3o5oOQx3u/H3l5QcKo0o7zbS0tr6yurVc2qptb2zu79t5+R4lUYtLGggnZC5AijHLS1lQz0kskQXHASDcY3xR+955IRQVv6UlCvBgNOY0oRtpIvn04CAQL1SQ2V/aQ+5l71cp9u+bUnSngInFLUgMlmr79NQgFTmPCNWZIqb7rJNrLkNQUM5JXB6kiCcJjNCR9QzmKifKyafocnhglhJGQ5nANp+rvjQzFqghoJmOkR2reK8T/vH6qo0svozxJNeF49lCUMqgFLKqAIZUEazYxBGFJTVaIR0girE1hVVOCO//lRdI5q7tO3b07rzWuyzoq4Agcg1PgggvQALegCdoAg0fwDF7Bm/VkvVjv1sdsdMkqdw7AH1ifP/x2lYw=</latexit><latexit sha1_base64=\"YqQCXMo5A7SqXDzJKmdtZ2M1Ar8=\">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgabQiKJ6GXjxO2BdspaRpuoWlSUlScZbiv+LFgyJe/T+8+d+Ybj3o5oOQx3u/H3l5QcKo0o7zbS0tr6yurVc2qptb2zu79t5+R4lUYtLGggnZC5AijHLS1lQz0kskQXHASDcY3xR+955IRQVv6UlCvBgNOY0oRtpIvn04CAQL1SQ2V/aQ+5l71cp9u+bUnSngInFLUgMlmr79NQgFTmPCNWZIqb7rJNrLkNQUM5JXB6kiCcJjNCR9QzmKifKyafocnhglhJGQ5nANp+rvjQzFqghoJmOkR2reK8T/vH6qo0svozxJNeF49lCUMqgFLKqAIZUEazYxBGFJTVaIR0girE1hVVOCO//lRdI5q7tO3b07rzWuyzoq4Agcg1PgggvQALegCdoAg0fwDF7Bm/VkvVjv1sdsdMkqdw7AH1ifP/x2lYw=</latexit>xt<latexit sha1_base64=\"GjNWDXVZ+XhHzG1cnk8MHnN/JK8=\">AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxmUF+4B2GDKZtA3NJEOSKZahf+LGhSJu/RN3/o2ZdhbaeiDkcM695OREKWfaeN63s7a+sbm1Xdmp7u7tHxy6R8dtLTNFaItILlU3wppyJmjLMMNpN1UUJxGnnWh8V/idCVWaSfFopikNEjwUbMAINlYKXbcfSR7raWKv/GkWmtCteXVvDrRK/JLUoEQzdL/6sSRZQoUhHGvd873UBDlWhhFOZ9V+pmmKyRgPac9SgROqg3yefIbOrRKjgVT2CIPm6u+NHCe6CGcnE2xGetkrxP+8XmYGN0HORJoZKsjioUHGkZGoqAHFTFFi+NQSTBSzWREZYYWJsWVVbQn+8pdXSfuy7nt1/+Gq1rgt66jAKZzBBfhwDQ24hya0gMAEnuEV3pzceXHenY/F6JpT7pzAHzifP2FtlCE=</latexit><latexit sha1_base64=\"GjNWDXVZ+XhHzG1cnk8MHnN/JK8=\">AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxmUF+4B2GDKZtA3NJEOSKZahf+LGhSJu/RN3/o2ZdhbaeiDkcM695OREKWfaeN63s7a+sbm1Xdmp7u7tHxy6R8dtLTNFaItILlU3wppyJmjLMMNpN1UUJxGnnWh8V/idCVWaSfFopikNEjwUbMAINlYKXbcfSR7raWKv/GkWmtCteXVvDrRK/JLUoEQzdL/6sSRZQoUhHGvd873UBDlWhhFOZ9V+pmmKyRgPac9SgROqg3yefIbOrRKjgVT2CIPm6u+NHCe6CGcnE2xGetkrxP+8XmYGN0HORJoZKsjioUHGkZGoqAHFTFFi+NQSTBSzWREZYYWJsWVVbQn+8pdXSfuy7nt1/+Gq1rgt66jAKZzBBfhwDQ24hya0gMAEnuEV3pzceXHenY/F6JpT7pzAHzifP2FtlCE=</latexit><latexit sha1_base64=\"GjNWDXVZ+XhHzG1cnk8MHnN/JK8=\">AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxmUF+4B2GDKZtA3NJEOSKZahf+LGhSJu/RN3/o2ZdhbaeiDkcM695OREKWfaeN63s7a+sbm1Xdmp7u7tHxy6R8dtLTNFaItILlU3wppyJmjLMMNpN1UUJxGnnWh8V/idCVWaSfFopikNEjwUbMAINlYKXbcfSR7raWKv/GkWmtCteXVvDrRK/JLUoEQzdL/6sSRZQoUhHGvd873UBDlWhhFOZ9V+pmmKyRgPac9SgROqg3yefIbOrRKjgVT2CIPm6u+NHCe6CGcnE2xGetkrxP+8XmYGN0HORJoZKsjioUHGkZGoqAHFTFFi+NQSTBSzWREZYYWJsWVVbQn+8pdXSfuy7nt1/+Gq1rgt66jAKZzBBfhwDQ24hya0gMAEnuEV3pzceXHenY/F6JpT7pzAHzifP2FtlCE=</latexit><latexit sha1_base64=\"GjNWDXVZ+XhHzG1cnk8MHnN/JK8=\">AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxmUF+4B2GDKZtA3NJEOSKZahf+LGhSJu/RN3/o2ZdhbaeiDkcM695OREKWfaeN63s7a+sbm1Xdmp7u7tHxy6R8dtLTNFaItILlU3wppyJmjLMMNpN1UUJxGnnWh8V/idCVWaSfFopikNEjwUbMAINlYKXbcfSR7raWKv/GkWmtCteXVvDrRK/JLUoEQzdL/6sSRZQoUhHGvd873UBDlWhhFOZ9V+pmmKyRgPac9SgROqg3yefIbOrRKjgVT2CIPm6u+NHCe6CGcnE2xGetkrxP+8XmYGN0HORJoZKsjioUHGkZGoqAHFTFFi+NQSTBSzWREZYYWJsWVVbQn+8pdXSfuy7nt1/+Gq1rgt66jAKZzBBfhwDQ24hya0gMAEnuEV3pzceXHenY/F6JpT7pzAHzifP2FtlCE=</latexit>f<latexit sha1_base64=\"KT3gym11XAjKBtmQlxpgoE5y3+0=\">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIUJdFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK8smg2rNbfuzoFWiVeQGhRoDatfg1CSNKbCEI617ntuYvwMK8MIp7PKINU0wWSCR7RvqcAx1X42Tz1DZ1YJUSSVPcKgufp7I8OxzqPZyRibsV72cvE/r5+a6MrPmEhSQwVZPBSlHBmJ8gpQyBQlhk8twUQxmxWRMVaYGFtUxZbgLX95lXQu6p5b9+4ua83roo4ynMApnIMHDWjCLbSgDQQUPMMrvDlPzovz7nwsRktOsXMMf+B8/gA38pL3</latexit><latexit sha1_base64=\"KT3gym11XAjKBtmQlxpgoE5y3+0=\">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIUJdFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK8smg2rNbfuzoFWiVeQGhRoDatfg1CSNKbCEI617ntuYvwMK8MIp7PKINU0wWSCR7RvqcAx1X42Tz1DZ1YJUSSVPcKgufp7I8OxzqPZyRibsV72cvE/r5+a6MrPmEhSQwVZPBSlHBmJ8gpQyBQlhk8twUQxmxWRMVaYGFtUxZbgLX95lXQu6p5b9+4ua83roo4ynMApnIMHDWjCLbSgDQQUPMMrvDlPzovz7nwsRktOsXMMf+B8/gA38pL3</latexit><latexit sha1_base64=\"KT3gym11XAjKBtmQlxpgoE5y3+0=\">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIUJdFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK8smg2rNbfuzoFWiVeQGhRoDatfg1CSNKbCEI617ntuYvwMK8MIp7PKINU0wWSCR7RvqcAx1X42Tz1DZ1YJUSSVPcKgufp7I8OxzqPZyRibsV72cvE/r5+a6MrPmEhSQwVZPBSlHBmJ8gpQyBQlhk8twUQxmxWRMVaYGFtUxZbgLX95lXQu6p5b9+4ua83roo4ynMApnIMHDWjCLbSgDQQUPMMrvDlPzovz7nwsRktOsXMMf+B8/gA38pL3</latexit><latexit sha1_base64=\"KT3gym11XAjKBtmQlxpgoE5y3+0=\">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIUJdFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK8smg2rNbfuzoFWiVeQGhRoDatfg1CSNKbCEI617ntuYvwMK8MIp7PKINU0wWSCR7RvqcAx1X42Tz1DZ1YJUSSVPcKgufp7I8OxzqPZyRibsV72cvE/r5+a6MrPmEhSQwVZPBSlHBmJ8gpQyBQlhk8twUQxmxWRMVaYGFtUxZbgLX95lXQu6p5b9+4ua83roo4ynMApnIMHDWjCLbSgDQQUPMMrvDlPzovz7nwsRktOsXMMf+B8/gA38pL3</latexit>zt<latexit sha1_base64=\"2+byp70tWvsxbofl6uQHE4ovRBw=\">AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxmUF+4B2GDKZtA3NJEOSKdShf+LGhSJu/RN3/o2ZdhbaeiDkcM695OREKWfaeN63s7a+sbm1Xdmp7u7tHxy6R8dtLTNFaItILlU3wppyJmjLMMNpN1UUJxGnnWh8V/idCVWaSfFopikNEjwUbMAINlYKXbcfSR7raWKv/GkWmtCteXVvDrRK/JLUoEQzdL/6sSRZQoUhHGvd873UBDlWhhFOZ9V+pmmKyRgPac9SgROqg3yefIbOrRKjgVT2CIPm6u+NHCe6CGcnE2xGetkrxP+8XmYGN0HORJoZKsjioUHGkZGoqAHFTFFi+NQSTBSzWREZYYWJsWVVbQn+8pdXSfuy7nt1/+Gq1rgt66jAKZzBBfhwDQ24hya0gMAEnuEV3pzceXHenY/F6JpT7pzAHzifP2R7lCM=</latexit><latexit sha1_base64=\"2+byp70tWvsxbofl6uQHE4ovRBw=\">AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxmUF+4B2GDKZtA3NJEOSKdShf+LGhSJu/RN3/o2ZdhbaeiDkcM695OREKWfaeN63s7a+sbm1Xdmp7u7tHxy6R8dtLTNFaItILlU3wppyJmjLMMNpN1UUJxGnnWh8V/idCVWaSfFopikNEjwUbMAINlYKXbcfSR7raWKv/GkWmtCteXVvDrRK/JLUoEQzdL/6sSRZQoUhHGvd873UBDlWhhFOZ9V+pmmKyRgPac9SgROqg3yefIbOrRKjgVT2CIPm6u+NHCe6CGcnE2xGetkrxP+8XmYGN0HORJoZKsjioUHGkZGoqAHFTFFi+NQSTBSzWREZYYWJsWVVbQn+8pdXSfuy7nt1/+Gq1rgt66jAKZzBBfhwDQ24hya0gMAEnuEV3pzceXHenY/F6JpT7pzAHzifP2R7lCM=</latexit><latexit sha1_base64=\"2+byp70tWvsxbofl6uQHE4ovRBw=\">AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxmUF+4B2GDKZtA3NJEOSKdShf+LGhSJu/RN3/o2ZdhbaeiDkcM695OREKWfaeN63s7a+sbm1Xdmp7u7tHxy6R8dtLTNFaItILlU3wppyJmjLMMNpN1UUJxGnnWh8V/idCVWaSfFopikNEjwUbMAINlYKXbcfSR7raWKv/GkWmtCteXVvDrRK/JLUoEQzdL/6sSRZQoUhHGvd873UBDlWhhFOZ9V+pmmKyRgPac9SgROqg3yefIbOrRKjgVT2CIPm6u+NHCe6CGcnE2xGetkrxP+8XmYGN0HORJoZKsjioUHGkZGoqAHFTFFi+NQSTBSzWREZYYWJsWVVbQn+8pdXSfuy7nt1/+Gq1rgt66jAKZzBBfhwDQ24hya0gMAEnuEV3pzceXHenY/F6JpT7pzAHzifP2R7lCM=</latexit><latexit sha1_base64=\"2+byp70tWvsxbofl6uQHE4ovRBw=\">AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxmUF+4B2GDKZtA3NJEOSKdShf+LGhSJu/RN3/o2ZdhbaeiDkcM695OREKWfaeN63s7a+sbm1Xdmp7u7tHxy6R8dtLTNFaItILlU3wppyJmjLMMNpN1UUJxGnnWh8V/idCVWaSfFopikNEjwUbMAINlYKXbcfSR7raWKv/GkWmtCteXVvDrRK/JLUoEQzdL/6sSRZQoUhHGvd873UBDlWhhFOZ9V+pmmKyRgPac9SgROqg3yefIbOrRKjgVT2CIPm6u+NHCe6CGcnE2xGetkrxP+8XmYGN0HORJoZKsjioUHGkZGoqAHFTFFi+NQSTBSzWREZYYWJsWVVbQn+8pdXSfuy7nt1/+Gq1rgt66jAKZzBBfhwDQ24hya0gMAEnuEV3pzceXHenY/F6JpT7pzAHzifP2R7lCM=</latexit>Quantization\u02dczt<latexit sha1_base64=\"Yt2K9k0MRgUKaCEUNNmLvHQQcUM=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyLUEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wf1ppfZ</latexit><latexit sha1_base64=\"Yt2K9k0MRgUKaCEUNNmLvHQQcUM=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyLUEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wf1ppfZ</latexit><latexit sha1_base64=\"Yt2K9k0MRgUKaCEUNNmLvHQQcUM=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyLUEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wf1ppfZ</latexit><latexit sha1_base64=\"Yt2K9k0MRgUKaCEUNNmLvHQQcUM=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyLUEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wf1ppfZ</latexit>EncodingBinarization\u02dczt<latexit sha1_base64=\"Yt2K9k0MRgUKaCEUNNmLvHQQcUM=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyLUEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wf1ppfZ</latexit><latexit sha1_base64=\"Yt2K9k0MRgUKaCEUNNmLvHQQcUM=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyLUEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wf1ppfZ</latexit><latexit sha1_base64=\"Yt2K9k0MRgUKaCEUNNmLvHQQcUM=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyLUEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wf1ppfZ</latexit><latexit sha1_base64=\"Yt2K9k0MRgUKaCEUNNmLvHQQcUM=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyLUEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wf1ppfZ</latexit>Decoding\u02dcxt<latexit sha1_base64=\"X4Q1aFm5XrHWDo+IKYEVuO42eN8=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyKWEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wfylpfX</latexit><latexit sha1_base64=\"X4Q1aFm5XrHWDo+IKYEVuO42eN8=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyKWEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wfylpfX</latexit><latexit sha1_base64=\"X4Q1aFm5XrHWDo+IKYEVuO42eN8=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyKWEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wfylpfX</latexit><latexit sha1_base64=\"X4Q1aFm5XrHWDo+IKYEVuO42eN8=\">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJCHosevFYwbZCE8Jms2mXbjZhdyKWEC/+FS8eFPHqv/Dmv3HT5qCtD5Z9vDfDzLwg5UyBbX8btaXlldW1+npjY3Nre8fc3eupJJOEdknCE3kXYEU5E7QLDDi9SyXFccBpPxhflX7/nkrFEnELk5R6MR4KFjGCQUu+eeAC4yHN3SDhoZrE+ssfisIH32zaLXsKa5E4FWmiCh3f/HLDhGQxFUA4Vmrg2Cl4OZbACKdFw80UTTEZ4yEdaCpwTJWXTy8orGOthFaUSP0EWFP1d0eOY1VupytjDCM175Xif94gg+jCy5lIM6CCzAZFGbcgsco4rJBJSoBPNMFEMr2rRUZYYgI6tIYOwZk/eZH0TluO3XJuzprtyyqOOjpER+gEOegctdE16qAuIugRPaNX9GY8GS/Gu/ExK60ZVc8++gPj8wfylpfX</latexit>Input VideoReconstructed VideoNoiseRoundTrainingAfter trainingp\u2713(\u02dczt|\u02dcz<t)<latexit sha1_base64=\"LbQgD7kqKzy9eS/a3cS6ePEaYZw=\">AAACPHicdVA9SwNBEN3z2/gVtbRZDEJswp0IWliINpYRTSLkjmNvb2IW9z7YnRPicT/Mxh9hZ2VjoYittXtJCjX6YNnHezPMzAtSKTTa9pM1NT0zOze/sFhZWl5ZXauub7R1kikOLZ7IRF0FTIMUMbRQoISrVAGLAgmd4Oa09Du3oLRI4kscpOBF7DoWPcEZGsmvXqR+7gaJDPUgMl/uYh+QFQWtuyhkCD/Mu6LwkbqRCOk/bn6Exa5frdkNewg6SZwxqZExmn710Q0TnkUQI5dM665jp+jlTKHgEoqKm2lIGb9h19A1NGYRaC8fHl/QHaOEtJco82KkQ/V7R84iXW5oKiOGff3bK8W/vG6GvUMvF3GaIcR8NKiXSYoJLZOkoVDAUQ4MYVwJsyvlfaYYR5N3xYTg/D55krT3Go7dcM73a8cn4zgWyBbZJnXikANyTM5Ik7QIJ/fkmbySN+vBerHerY9R6ZQ17tkkP2B9fgEoTbGk</latexit><latexit sha1_base64=\"LbQgD7kqKzy9eS/a3cS6ePEaYZw=\">AAACPHicdVA9SwNBEN3z2/gVtbRZDEJswp0IWliINpYRTSLkjmNvb2IW9z7YnRPicT/Mxh9hZ2VjoYittXtJCjX6YNnHezPMzAtSKTTa9pM1NT0zOze/sFhZWl5ZXauub7R1kikOLZ7IRF0FTIMUMbRQoISrVAGLAgmd4Oa09Du3oLRI4kscpOBF7DoWPcEZGsmvXqR+7gaJDPUgMl/uYh+QFQWtuyhkCD/Mu6LwkbqRCOk/bn6Exa5frdkNewg6SZwxqZExmn710Q0TnkUQI5dM665jp+jlTKHgEoqKm2lIGb9h19A1NGYRaC8fHl/QHaOEtJco82KkQ/V7R84iXW5oKiOGff3bK8W/vG6GvUMvF3GaIcR8NKiXSYoJLZOkoVDAUQ4MYVwJsyvlfaYYR5N3xYTg/D55krT3Go7dcM73a8cn4zgWyBbZJnXikANyTM5Ik7QIJ/fkmbySN+vBerHerY9R6ZQ17tkkP2B9fgEoTbGk</latexit><latexit sha1_base64=\"LbQgD7kqKzy9eS/a3cS6ePEaYZw=\">AAACPHicdVA9SwNBEN3z2/gVtbRZDEJswp0IWliINpYRTSLkjmNvb2IW9z7YnRPicT/Mxh9hZ2VjoYittXtJCjX6YNnHezPMzAtSKTTa9pM1NT0zOze/sFhZWl5ZXauub7R1kikOLZ7IRF0FTIMUMbRQoISrVAGLAgmd4Oa09Du3oLRI4kscpOBF7DoWPcEZGsmvXqR+7gaJDPUgMl/uYh+QFQWtuyhkCD/Mu6LwkbqRCOk/bn6Exa5frdkNewg6SZwxqZExmn710Q0TnkUQI5dM665jp+jlTKHgEoqKm2lIGb9h19A1NGYRaC8fHl/QHaOEtJco82KkQ/V7R84iXW5oKiOGff3bK8W/vG6GvUMvF3GaIcR8NKiXSYoJLZOkoVDAUQ4MYVwJsyvlfaYYR5N3xYTg/D55krT3Go7dcM73a8cn4zgWyBbZJnXikANyTM5Ik7QIJ/fkmbySN+vBerHerY9R6ZQ17tkkP2B9fgEoTbGk</latexit><latexit sha1_base64=\"LbQgD7kqKzy9eS/a3cS6ePEaYZw=\">AAACPHicdVA9SwNBEN3z2/gVtbRZDEJswp0IWliINpYRTSLkjmNvb2IW9z7YnRPicT/Mxh9hZ2VjoYittXtJCjX6YNnHezPMzAtSKTTa9pM1NT0zOze/sFhZWl5ZXauub7R1kikOLZ7IRF0FTIMUMbRQoISrVAGLAgmd4Oa09Du3oLRI4kscpOBF7DoWPcEZGsmvXqR+7gaJDPUgMl/uYh+QFQWtuyhkCD/Mu6LwkbqRCOk/bn6Exa5frdkNewg6SZwxqZExmn710Q0TnkUQI5dM665jp+jlTKHgEoqKm2lIGb9h19A1NGYRaC8fHl/QHaOEtJco82KkQ/V7R84iXW5oKiOGff3bK8W/vG6GvUMvF3GaIcR8NKiXSYoJLZOkoVDAUQ4MYVwJsyvlfaYYR5N3xYTg/D55krT3Go7dcM73a8cn4zgWyBbZJnXikANyTM5Ik7QIJ/fkmbySN+vBerHerY9R6ZQ17tkkP2B9fgEoTbGk</latexit>\u02dcf<latexit sha1_base64=\"aobgVshmD4k+WfMer3PzPpROslE=\">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJVEBF0W3bisYB/QlDKZ3LRDJ5MwMxFKzMJfceNCEbf+hjv/xkmbhbYeGOZwzr3MmeMnnCntON9WZWV1bX2julnb2t7Z3bP3DzoqTiWFNo15LHs+UcCZgLZmmkMvkUAin0PXn9wUfvcBpGKxuNfTBAYRGQkWMkq0kYb2kacZDyDz/JgHahqZKwvzfGjXnYYzA14mbknqqERraH95QUzTCISmnCjVd51EDzIiNaMc8pqXKkgInZAR9A0VJAI1yGb5c3xqlACHsTRHaDxTf29kJFJFNjMZET1Wi14h/uf1Ux1eDTImklSDoPOHwpRjHeOiDBwwCVTzqSGESmayYjomklBtKquZEtzFLy+TznnDdRru3UW9eV3WUUXH6ASdIRddoia6RS3URhQ9omf0it6sJ+vFerc+5qMVq9w5RH9gff4AOcKW3g==</latexit><latexit sha1_base64=\"aobgVshmD4k+WfMer3PzPpROslE=\">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJVEBF0W3bisYB/QlDKZ3LRDJ5MwMxFKzMJfceNCEbf+hjv/xkmbhbYeGOZwzr3MmeMnnCntON9WZWV1bX2julnb2t7Z3bP3DzoqTiWFNo15LHs+UcCZgLZmmkMvkUAin0PXn9wUfvcBpGKxuNfTBAYRGQkWMkq0kYb2kacZDyDz/JgHahqZKwvzfGjXnYYzA14mbknqqERraH95QUzTCISmnCjVd51EDzIiNaMc8pqXKkgInZAR9A0VJAI1yGb5c3xqlACHsTRHaDxTf29kJFJFNjMZET1Wi14h/uf1Ux1eDTImklSDoPOHwpRjHeOiDBwwCVTzqSGESmayYjomklBtKquZEtzFLy+TznnDdRru3UW9eV3WUUXH6ASdIRddoia6RS3URhQ9omf0it6sJ+vFerc+5qMVq9w5RH9gff4AOcKW3g==</latexit><latexit sha1_base64=\"aobgVshmD4k+WfMer3PzPpROslE=\">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJVEBF0W3bisYB/QlDKZ3LRDJ5MwMxFKzMJfceNCEbf+hjv/xkmbhbYeGOZwzr3MmeMnnCntON9WZWV1bX2julnb2t7Z3bP3DzoqTiWFNo15LHs+UcCZgLZmmkMvkUAin0PXn9wUfvcBpGKxuNfTBAYRGQkWMkq0kYb2kacZDyDz/JgHahqZKwvzfGjXnYYzA14mbknqqERraH95QUzTCISmnCjVd51EDzIiNaMc8pqXKkgInZAR9A0VJAI1yGb5c3xqlACHsTRHaDxTf29kJFJFNjMZET1Wi14h/uf1Ux1eDTImklSDoPOHwpRjHeOiDBwwCVTzqSGESmayYjomklBtKquZEtzFLy+TznnDdRru3UW9eV3WUUXH6ASdIRddoia6RS3URhQ9omf0it6sJ+vFerc+5qMVq9w5RH9gff4AOcKW3g==</latexit><latexit sha1_base64=\"aobgVshmD4k+WfMer3PzPpROslE=\">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJVEBF0W3bisYB/QlDKZ3LRDJ5MwMxFKzMJfceNCEbf+hjv/xkmbhbYeGOZwzr3MmeMnnCntON9WZWV1bX2julnb2t7Z3bP3DzoqTiWFNo15LHs+UcCZgLZmmkMvkUAin0PXn9wUfvcBpGKxuNfTBAYRGQkWMkq0kYb2kacZDyDz/JgHahqZKwvzfGjXnYYzA14mbknqqERraH95QUzTCISmnCjVd51EDzIiNaMc8pqXKkgInZAR9A0VJAI1yGb5c3xqlACHsTRHaDxTf29kJFJFNjMZET1Wi14h/uf1Ux1eDTImklSDoPOHwpRjHeOiDBwwCVTzqSGESmayYjomklBtKquZEtzFLy+TznnDdRru3UW9eV3WUUXH6ASdIRddoia6RS3URhQ9omf0it6sJ+vFerc+5qMVq9w5RH9gff4AOcKW3g==</latexit>p\u2713(\u02dcf)<latexit sha1_base64=\"XnrC0n+r9Qq+oX8IWEEEUcLDRzw=\">AAACGXicbVDJSgNBEO2JW4xb1KOXxiDES5gRQY9BLx4jmAUyIfT01CRNeha6a4QwzG948Ve8eFDEo578GzvLwSQ+aPrxXhVV9bxECo22/WMV1tY3NreK26Wd3b39g/LhUUvHqeLQ5LGMVcdjGqSIoIkCJXQSBSz0JLS90e3Ebz+C0iKOHnCcQC9kg0gEgjM0Ur9sJ/3M9WLp63FovszFISDLc1p1UUgfFswgz8/75Ypds6egq8SZkwqZo9Evf7l+zNMQIuSSad117AR7GVMouIS85KYaEsZHbABdQyMWgu5l08tyemYUnwaxMi9COlX/dmQs1JPlTGXIcKiXvYn4n9dNMbjuZSJKUoSIzwYFqaQY00lM1BcKOMqxIYwrYXalfMgU42jCLJkQnOWTV0nroubYNef+slK/mcdRJCfklFSJQ65IndyRBmkSTp7IC3kj79az9Wp9WJ+z0oI17zkmC7C+fwHqWaIX</latexit><latexit sha1_base64=\"XnrC0n+r9Qq+oX8IWEEEUcLDRzw=\">AAACGXicbVDJSgNBEO2JW4xb1KOXxiDES5gRQY9BLx4jmAUyIfT01CRNeha6a4QwzG948Ve8eFDEo578GzvLwSQ+aPrxXhVV9bxECo22/WMV1tY3NreK26Wd3b39g/LhUUvHqeLQ5LGMVcdjGqSIoIkCJXQSBSz0JLS90e3Ebz+C0iKOHnCcQC9kg0gEgjM0Ur9sJ/3M9WLp63FovszFISDLc1p1UUgfFswgz8/75Ypds6egq8SZkwqZo9Evf7l+zNMQIuSSad117AR7GVMouIS85KYaEsZHbABdQyMWgu5l08tyemYUnwaxMi9COlX/dmQs1JPlTGXIcKiXvYn4n9dNMbjuZSJKUoSIzwYFqaQY00lM1BcKOMqxIYwrYXalfMgU42jCLJkQnOWTV0nroubYNef+slK/mcdRJCfklFSJQ65IndyRBmkSTp7IC3kj79az9Wp9WJ+z0oI17zkmC7C+fwHqWaIX</latexit><latexit sha1_base64=\"XnrC0n+r9Qq+oX8IWEEEUcLDRzw=\">AAACGXicbVDJSgNBEO2JW4xb1KOXxiDES5gRQY9BLx4jmAUyIfT01CRNeha6a4QwzG948Ve8eFDEo578GzvLwSQ+aPrxXhVV9bxECo22/WMV1tY3NreK26Wd3b39g/LhUUvHqeLQ5LGMVcdjGqSIoIkCJXQSBSz0JLS90e3Ebz+C0iKOHnCcQC9kg0gEgjM0Ur9sJ/3M9WLp63FovszFISDLc1p1UUgfFswgz8/75Ypds6egq8SZkwqZo9Evf7l+zNMQIuSSad117AR7GVMouIS85KYaEsZHbABdQyMWgu5l08tyemYUnwaxMi9COlX/dmQs1JPlTGXIcKiXvYn4n9dNMbjuZSJKUoSIzwYFqaQY00lM1BcKOMqxIYwrYXalfMgU42jCLJkQnOWTV0nroubYNef+slK/mcdRJCfklFSJQ65IndyRBmkSTp7IC3kj79az9Wp9WJ+z0oI17zkmC7C+fwHqWaIX</latexit><latexit sha1_base64=\"XnrC0n+r9Qq+oX8IWEEEUcLDRzw=\">AAACGXicbVDJSgNBEO2JW4xb1KOXxiDES5gRQY9BLx4jmAUyIfT01CRNeha6a4QwzG948Ve8eFDEo578GzvLwSQ+aPrxXhVV9bxECo22/WMV1tY3NreK26Wd3b39g/LhUUvHqeLQ5LGMVcdjGqSIoIkCJXQSBSz0JLS90e3Ebz+C0iKOHnCcQC9kg0gEgjM0Ur9sJ/3M9WLp63FovszFISDLc1p1UUgfFswgz8/75Ypds6egq8SZkwqZo9Evf7l+zNMQIuSSad117AR7GVMouIS85KYaEsZHbABdQyMWgu5l08tyemYUnwaxMi9COlX/dmQs1JPlTGXIcKiXvYn4n9dNMbjuZSJKUoSIzwYFqaQY00lM1BcKOMqxIYwrYXalfMgU42jCLJkQnOWTV0nroubYNef+slK/mcdRJCfklFSJQ65IndyRBmkSTp7IC3kj79az9Wp9WJ+z0oI17zkmC7C+fwHqWaIX</latexit>\u02dcf<latexit sha1_base64=\"aobgVshmD4k+WfMer3PzPpROslE=\">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJVEBF0W3bisYB/QlDKZ3LRDJ5MwMxFKzMJfceNCEbf+hjv/xkmbhbYeGOZwzr3MmeMnnCntON9WZWV1bX2julnb2t7Z3bP3DzoqTiWFNo15LHs+UcCZgLZmmkMvkUAin0PXn9wUfvcBpGKxuNfTBAYRGQkWMkq0kYb2kacZDyDz/JgHahqZKwvzfGjXnYYzA14mbknqqERraH95QUzTCISmnCjVd51EDzIiNaMc8pqXKkgInZAR9A0VJAI1yGb5c3xqlACHsTRHaDxTf29kJFJFNjMZET1Wi14h/uf1Ux1eDTImklSDoPOHwpRjHeOiDBwwCVTzqSGESmayYjomklBtKquZEtzFLy+TznnDdRru3UW9eV3WUUXH6ASdIRddoia6RS3URhQ9omf0it6sJ+vFerc+5qMVq9w5RH9gff4AOcKW3g==</latexit><latexit sha1_base64=\"aobgVshmD4k+WfMer3PzPpROslE=\">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJVEBF0W3bisYB/QlDKZ3LRDJ5MwMxFKzMJfceNCEbf+hjv/xkmbhbYeGOZwzr3MmeMnnCntON9WZWV1bX2julnb2t7Z3bP3DzoqTiWFNo15LHs+UcCZgLZmmkMvkUAin0PXn9wUfvcBpGKxuNfTBAYRGQkWMkq0kYb2kacZDyDz/JgHahqZKwvzfGjXnYYzA14mbknqqERraH95QUzTCISmnCjVd51EDzIiNaMc8pqXKkgInZAR9A0VJAI1yGb5c3xqlACHsTRHaDxTf29kJFJFNjMZET1Wi14h/uf1Ux1eDTImklSDoPOHwpRjHeOiDBwwCVTzqSGESmayYjomklBtKquZEtzFLy+TznnDdRru3UW9eV3WUUXH6ASdIRddoia6RS3URhQ9omf0it6sJ+vFerc+5qMVq9w5RH9gff4AOcKW3g==</latexit><latexit sha1_base64=\"aobgVshmD4k+WfMer3PzPpROslE=\">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJVEBF0W3bisYB/QlDKZ3LRDJ5MwMxFKzMJfceNCEbf+hjv/xkmbhbYeGOZwzr3MmeMnnCntON9WZWV1bX2julnb2t7Z3bP3DzoqTiWFNo15LHs+UcCZgLZmmkMvkUAin0PXn9wUfvcBpGKxuNfTBAYRGQkWMkq0kYb2kacZDyDz/JgHahqZKwvzfGjXnYYzA14mbknqqERraH95QUzTCISmnCjVd51EDzIiNaMc8pqXKkgInZAR9A0VJAI1yGb5c3xqlACHsTRHaDxTf29kJFJFNjMZET1Wi14h/uf1Ux1eDTImklSDoPOHwpRjHeOiDBwwCVTzqSGESmayYjomklBtKquZEtzFLy+TznnDdRru3UW9eV3WUUXH6ASdIRddoia6RS3URhQ9omf0it6sJ+vFerc+5qMVq9w5RH9gff4AOcKW3g==</latexit><latexit sha1_base64=\"aobgVshmD4k+WfMer3PzPpROslE=\">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJVEBF0W3bisYB/QlDKZ3LRDJ5MwMxFKzMJfceNCEbf+hjv/xkmbhbYeGOZwzr3MmeMnnCntON9WZWV1bX2julnb2t7Z3bP3DzoqTiWFNo15LHs+UcCZgLZmmkMvkUAin0PXn9wUfvcBpGKxuNfTBAYRGQkWMkq0kYb2kacZDyDz/JgHahqZKwvzfGjXnYYzA14mbknqqERraH95QUzTCISmnCjVd51EDzIiNaMc8pqXKkgInZAR9A0VJAI1yGb5c3xqlACHsTRHaDxTf29kJFJFNjMZET1Wi14h/uf1Ux1eDTImklSDoPOHwpRjHeOiDBwwCVTzqSGESmayYjomklBtKquZEtzFLy+TznnDdRru3UW9eV3WUUXH6ASdIRddoia6RS3URhQ9omf0it6sJ+vFerc+5qMVq9w5RH9gff4AOcKW3g==</latexit>NoiseRoundLocal state neural encoderDecoderArithmetic encoderArithmetic encoderArithmetic decoderArithmetic decoderGlobal state neural encoder\fspace must be entropy coded into binary. This is the distinguishing element between variational\nautoencoders and full compression algorithms. The bit stream can then be sent to a receiver where it\nis decoded into video frames. Our end-to-end machine learning approach simultaneously learns the\npredictive model required for entropy coding and the optimal lossy transformation into the latent\nspace. Both components are described in detail in the next sections, respectively.\n\n3.2 Entropy Coding via a Deep Sequential Model\n\nPredictive modeling is crucial at the entropy coding stage. A better model which more accurately\ncaptures the true certainty about the next symbol has a smaller cross entropy with the data distribution\nand thus produces a bit rate that is closer to the theoretical lower bound for long sequences (Shannon,\n2001). For videos, temporal modeling is most important, making a learned temporal model an integral\npart of our model design. We now discuss a preliminary version of our model which does not yet\ninclude the global state, saving the speci\ufb01c details and encoder of our proposed model for Section 3.3.\n\nGeneral model design. When it comes to designing a generative model, the challenge over image\ncompression is that videos exhibit strong temporal correlations in addition to the spatial correlations\npresent in images. Treating a video segment as an independent data point in the latent representation\n(as would a 3D autoencoder) leads to data sparseness and poor generalization performance. Therefore,\nwe propose to learn a temporally-conditioned prior distribution parametrized by a deep generative\nmodel to ef\ufb01ciently code the latent variables associated with each frame. Let x1:T = (x1,\u00b7\u00b7\u00b7 , xT )\nbe the video sequence and z1:T be the associated latent variables. A generic generative model of this\ntype takes the form:\n\nT(cid:89)\n\nt=1\n\np\u03b8(x1:T , z1:T ) =\n\np\u03b8(zt|z<t)p\u03b8(xt | zt).\n\n(3)\n\nAbove, \u03b8 is shorthand for parameters of the model. By conditioning on previous frame latents in the\nsequence, the prior model can be more certain about the next zt, thus achieving a smaller entropy\nand code length (after entropy coding).\n\nArithmetic coding. Entropy coding schemes require a discrete vocabulary, which is obtained in\nour case by rounding the latent states to the nearest integer after training. Care must be taken such that\nthe quantization at inference time is approximated in a differentiable way during training. In practice,\nthis is handled by introducing noise in the inference process. Besides dealing with quantization, we\nalso need an accurate estimate of the probability density over the latent atoms for ef\ufb01cient coding.\nKnowledge of the sequential probability distribution of latents allows the entropy coder to decorrelate\nthe bitstream so that the maximal amount of information per bit is stored (MacKay, 2003). We obtain\nthis probability estimation from the learned prior.\nWe employ an arithmetic coder (Rissanen & Langdon, 1979; Langdon, 1984) to losslessly code the\nrounded latent variables into binary. In contrast to other forms of entropy encoding, such as Huffman\ncoding, arithmetic coding encodes the entire sequence of discretized latent states z1:T into a single\nnumber. During encoding, the approach uses the conditional probabilities p(zt|z<t) to iteratively\nre\ufb01ne the real number interval [0, 1) into a progressively smaller interval. After the sequence has\nbeen processed and a \ufb01nal (very small) interval is obtained, a binarized \ufb02oating point number from\nthe \ufb01nal interval is stored to encode the entire sequence of latents. Decoding the decimal can similarly\nbe performed iteratively by undoing the sequence of interval re\ufb01nements to recover the original latent\nsequence. The fact that decoding happens in the same temporal order as encoding guarantees access\nto all conditional probabilities p(zt|z<t). Since zt was quantized, all probabilities for encoding and\ndecoding exactly match. In practice, besides iterating over time stamps t, we also iterate over the\ndimensions of zt during arithmetic coding.\n\n3.3 Proposed Generative and Inference Model\n\nIn this section, we describe the modeling aspects of our approach in more detail. We re\ufb01ne the\ngenerative model to also include a global state which can be omitted to capture the base case outlined\nbefore. Besides the local state, the global state may help the model capture long-term information.\n\n5\n\n\fDecoder. The decoder is a probabilistic neural network that models the data as a function of their\nunderlying latent codes. We use a stochastic recurrent variational autoencoder that transforms a\nsequence of local latent variables z1:T and a global state f into the frame sequence x1:T , expressed\nby the following joint distribution:\n\np\u03b8(x1:T , z1:T , f ) = p\u03b8(f )p\u03b8(z1:T )\n\np\u03b8(xt | zt, f ).\n\n(4)\n\nWe discuss the prior distributions p\u03b8(f ) and p\u03b8(z1:T ) separately below. Each reconstructed frame\n\u02dcxt, sampled from the frame likelihood p\u03b8(xt|f , zt), depends on the corresponding latent variables\nzt and (optionally) global variables f. We use a Laplace distribution for the frame likelihood,\n\np\u03b8(xt | zt, f ) = Laplace(cid:0)\u00b5\u03b8(zt, f ), \u03bb\u221211(cid:1), whose logarithm results in an (cid:96)1 loss which we\n\nobserve produces sharper images than the (cid:96)2 loss (Isola et al., 2017; Zhao et al., 2016).\nThe decoder mean, \u00b5\u03b8(\u00b7), is a function parametrized by neural networks. Crucially, the decoder\nis conditioned both on global code f and time-local code zt. In detail, (f , zt) are combined by a\nmultilayer perceptron (MLP) which is then followed by upsampling transpose convolutional layers to\nform the mean. More details on the architecture can be found in the supplementary material. After\ntraining, the reconstructed frame in image space is obtained from the mean, \u02dcxt = \u00b5\u03b8(zt, f ).\n\nEncoder. As the inverse of the decoder, the optimal encoder would be the Bayesian posterior\np(z1:T , f | x1:T ) of the generative model above, which is analytically intractable. Therefore, we\nemploy amortized variational inference (Blei et al., 2017; Zhang et al., 2018; Marino et al., 2018) to\npredict a distribution over latent codes given the input video,\n\nq\u03c6(z1:T , f | x1:T ) = q\u03c6(f | x1:T )\n\nq\u03c6(zt | xt).\n\n(5)\n\nT(cid:89)\n\nt=1\n\nT(cid:89)\n\nt=1\n\n(cid:1),\n\n(cid:0) \u02c6f \u2212 1\n\nThe global variables f are inferred from all video frames in a sequence and may thus contain global\ninformation, while zt is only inferred from a single frame xt.\nAs explained above in Section 3.2, modi\ufb01cations to standard variational inference are required for\nfurther lossless compression into binary. Instead of sampling from Gaussian distributions with learned\nvariances, here we employ \ufb01xed-width uniform distributions centered at their means: \u02dcf \u223c q\u03c6(f |\nx1:T ) = U\nThe means are predicted by additional encoder neural networks \u02c6f = \u00b5\u03c6(x1:T ), \u02c6zt = \u00b5\u03c6(xt) with\nparameters \u03c6. This choice of inference distribution leads exactly to injection of noise with width\none centered at the maximally-likely values for the latent variables, described in Section 3.2. The\nmean for the global state is parametrized by convolutions over x1:T , followed by a bi-directional\nLSTM which is then processed by a MLP. The encoder mean for the local state is simpler, consisting\nof convolutions over each frame followed by a MLP. More details on the decoder architecture is\nprovided in the supplementary material.\n\n\u02dczt \u223c q\u03c6(zt | xt) = U\n\n(cid:0) \u02c6zt \u2212 1\n\n2 , \u02c6zt + 1\n\n2\n\n2 , \u02c6f + 1\n\n2\n\n(cid:1).\n\nPrior Models. The models parametrizing the learned prior distributions are ultimately used as the\nprobability models for entropy coding. The global prior p\u03b8(f ) is assumed to be stationary, while\np\u03b8(z1:T ) consists of a time series model. Each dimension of the latent space has its own density\nmodel:\n\np\u03b8(f ) =\n\np\u03b8(f i) \u2217 U(\u2212\n\n1\n2\n\n,\n\n1\n2\n\n);\n\np\u03b8(z1:T ) =\n\np\u03b8(zi\n\nt | z<t) \u2217 U(\u2212\n\n1\n2\n\n,\n\n1\n2\n\n).\n\n(6)\n\nAbove, indices refer to the dimension index of the latent variable. The convolution with uniform\nnoise is to allow the priors to better match the true marginal distribution when working with the\nbox-shaped approximate posterior (see Ball\u00e9 et al. (2018) Appendix 6.2). This convolution has an\nanalytic form in terms of the cumulative probability density.\nThe stationary density p\u03b8(f i) is adopted from (Ball\u00e9 et al., 2018); it is a \ufb02exible non-parametric,\nfully-factorized model that leads to a good matching between prior and latent code distribution. The\ndensity is de\ufb01ned by its cumulative and is built out of compositions of nonlinear probability densities,\nsimilar to the construction of a normalizing \ufb02ow (Rezende & Mohamed, 2015).\n\n6\n\nT(cid:89)\n\ndim(z)(cid:89)\n\nt\n\ni\n\ndim(f )(cid:89)\n\ni\n\n\f(a) Sprites\n\n(b) BAIR\n\n(c) Kinetics\n\nFigure 3: Rate-distortion curves on three datasets measured in PSNR (higher corresponds to lower\ndistortion). Legend shared. Solid lines correspond to our models, with LSTMP-LG proposed.\n\nTwo dynamical models are considered to model the sequence z1:T . We propose a a recurrent LSTM\nt | z<t) which conditions on all previous frames in a segment. The\nprior architecture for p\u03b8(zi\ndistribution p\u03b8 is taken to be normal with mean and variance predicted by the LSTM. We also\nconsidered a simpler model, which we compare against, with a single frame context, p\u03b8(zi\nt | z<t) =\np\u03b8(zi\n\nt | zt\u22121), which is essentially a deep Kalman \ufb01lter (Krishnan et al., 2015).\n\nVariational Objective. The encoder (variational model) and decoder (generative model) are learned\njointly by maximizing the \u03b2-VAE objective (Higgins et al., 2017; Mandt et al., 2016),\n\nL(\u03c6, \u03b8) =E \u02dcf , \u02dcz1:T \u223cq\u03c6\n\n[log p\u03b8(x1:T| \u02dcf , \u02dcz1:T )] + \u03b2 E \u02dcf , \u02dcz1:T \u223cq\u03c6\n\n[log p\u03b8( \u02dcf , \u02dcz1:T )].\n\n(7)\n\nThe \ufb01rst term corresponds to the distortion, while second term is the cross entropy between the\napproximate posterior and the prior. The latter has the interpretation of the expected code length when\nusing the prior distribution p(f , z1:T ) to entropy code the latent variables. It is known (Hoffman &\nJohnson, 2016) that this term encourages the prior model to approximate the empirical distribution of\ncodes, Ex1:T [q(f , z1:T|x1:T )]. For our choice of generative model, the cross entropy separates into\nchoice of variational distribution, the entropy contribution of q\u03b8 is constant and is therefore omitted.\n\ntwo independent terms H(cid:2)q\u03c6(f|x1:T ), p\u03b8(f )(cid:3) and H(cid:2)q\u03c6(z1:T|x1:T ), p\u03b8(z1:T )(cid:3). Note that for our\n\n4 Experiments\n\nIn this section, we present the experimental results of our work. We \ufb01rst describe the datasets,\nperformance metrics, and baseline methods in Section 4.1. This is followed by a quantitative analysis\nin terms of rate-distortion curves in Section 4.2 which is followed by qualitative results in Section 4.3.\n\n4.1 Datasets, Metrics, and Methods\n\nIn this work, we train separately on three video datasets of increasing complexity with frame size\n64\u00d7 64. 1) Sprites. The simplest dataset consists of videos of Sprites characters from an open-source\nvideo game project, which is used in (Reed et al., 2015; Mathieu et al., 2016; Li & Mandt, 2018).\nThe videos are generated from a script that samples the character action, skin color, clothing, and\neyes from a collection of choices and have an inherently low-dimensional description (i.e. the script\nthat generated it). 2) BAIR. BAIR robot pushing dataset (Ebert et al., 2017) consists of a robot\npushing objects on a table, which is also used in (Babaeizadeh et al., 2018; Denton & Fergus, 2018;\nLee et al., 2018). The video is more realistic and less sparse, but the content is specialized since all\nscenes contain the same background and robot, and the depicted action is simple since the motion is\ndescribed by a limited set of commands sent to the robot. The \ufb01rst two datasets are uncompressed and\nno preprocessing is performed. 3) Kinetics600. The last dataset is the Kinetics600 dataset (Kay et al.,\n2017) which is a diverse set of YouTube videos depicting human actions. The dataset is cropped and\ndownsampled, which removes compression artifacts, to 64 \u00d7 64.\nMetrics. Evaluation is based on bit rate in bits per pixel (bpp) and distortion measured in average\nframe peak signal-to-noise ratio (PSNR), which is related to the frame mean square error. In the\nsupplementary material, we also report on multi-scale structural similarity (MS-SSIM) (Wang et al.,\n2004) which is a perception-based metric that approximates the change in structural information.\n\n7\n\n0.00.20.40.60.81.01.21.4bit rate [bits/pixel]1520253035404550PSNR [dB]0.00.20.40.60.81.01.21.4bit rate [bits/pixel]1520253035404550PSNR [dB]0.00.20.40.60.81.01.21.4bit rate [bits/pixel]1520253035404550PSNR [dB]H.264H.265VP9KFP-LGLSTMP-LGLSTMP-L\f(a) Sprites\n\n(b) BAIR\n\n(c) Kinetics\n\nFigure 4: Average bits of information stored in f and z1:T with PSNR 43.2, 37.1, 30.3 dB for\ndifferent models in (a, b, c). Entropy drops with the frame index as the models adapt to the sequence.\n\nComparisons. We wish to study the performance of our proposed local-global architecture with\nLSTM prior (LSTMP-LG) by comparing to other approaches. To study the effectiveness of the\nglobal state, we introduce our baseline model LSTMP-L which has only local states with LSTM\nprior p\u03b8(zt | z<t). To study the ef\ufb01ciency of the predictive model, we show our baseline model\nKFP-LG which has both global and local states but with a weak predictive model p\u03b8(zt | zt\u22121),\na deep Kalman \ufb01lter (Krishnan et al., 2015). We also provide the performance of H.264, H.265,\nand VP9 codecs. Traditional codecs are not optimized for low-resolution videos. However, their\nperformance is far superior to neural or classical image compression methods (applied to compress\nvideo frame by frame), so their performance is presented for comparison. Codec performance is\nevaluated using the open source FFMPEG implementation in constant rate mode and distortion is\nvaried by adjusting the constant rate factor. Unless otherwise stated, performance is tested on videos\nwith 4:4:4 chroma sampling and on test videos with T = 10 frames. Comparisons with classical\ncodec performance on longer videos is shown in the supplementary material.\n\n4.2 Quantitative Analysis: Rate-Distortion Tradeoff\n\nQuantitative performance is evaluated in terms of rate-distortion curves. For a \ufb01xed quality setting, a\nvideo codec produces an average bit rate on a given dataset. By varying the quality setting, a curve is\ntraced out in the rate-distortion plane. Our curves are generated by varying \u03b2 (Eq. 7).\nThe rate-distortion curves for our method, trained on three datasets and measured in PSNR, are\nshown in Fig. 3. Higher curves indicate better performance. From the Sprites and BAIR results, one\nsees that our method has the ability to dramatically outperform traditional codecs when focusing on\nspecialized content. By training on videos with a \ufb01xed content, the model is able to learn an ef\ufb01cient\nrepresentation for such content, and the learned priors capture the empirical data distribution well.\nThe results from training on the more diverse Kinetics videos also outperform or are competitive with\nstandard codecs and better demonstrate the performance of our method on general content videos.\nSimilar results are obtained with respect to MS-SSIM (supplementary material).\nThe \ufb01rst observation is that the LSTM prior outperforms the deep Kalman \ufb01lter prior in all cases.\nThis is because the LSTM model has more context, allowing the predictive model to be more certain\nabout the trajectory of the local latent variables, which in turn results in shorter code lengths. We also\nobserve that the local-global architecture (LSTMP-LG) outperforms the local architecture (LSTMP-L)\non all datasets. The VAE encoder has the option to store information in local or global variables. The\nlocal variables are modeled by a temporal prior and can be ef\ufb01ciently stored in binary if the sequence\nz1:T can be sequentially predicted from the context. The global variables, on the other hand, provide\nan architectural approach to removing temporal redundancy since the entire segment is stored in one\nglobal state without temporal structure.\nDuring training, the VAE learns to utilize the global and local information in the optimal way. The\nutilization of each variable can be visualized by plotting the average code length of each latent state,\nwhich is shown in Fig. 4. The VAE learns to signi\ufb01cantly utilize the global variables even though\ndim(z) is suf\ufb01ciently large to store the entire content of each individual frame. This provides further\nevidence that it is more ef\ufb01cient to incorporate global inference over several frames. The entropy\nin the local variables initially tends to decrease as a function of time, which supports the bene\ufb01ts\nfrom our predictive models. Note that our approach relies on sequential decoding, prohibiting a\nbi-directional LSTM prior model for the local state.\n\n8\n\nfz1z2z3z4z5z6z7z81.52.02.53.03.54.0Log10 Entropy [bits]LSTMP-LKFP-LGLSTMP-LGfz1z2z3z4z5z6z7z8z9z102.83.03.23.43.63.84.0Log10 Entropy [bits]fz1z2z3z4z5z6z7z8z9z102.83.03.23.43.63.84.0Log10 Entropy [bits]\fOurs (38.1 dB @ 0.29 bpp)\n\nOriginal\n\nOurs (0.39 bpp) VP9 (0.39 bpp)\n\nVP9 (25.7 dB @ 0.44 bpp)\n\n32.0 dB\n\n29.3 dB\n\nt=1\n\nt=5\n\nt=10\n\nFigure 5: Compressed videos by our LSTMP-LG model and VP9 in the low bit rate regime (measured\nin bpp). Our approach achieves better quality (measured in dB) on specialized content (BAIR, left)\nand comparable visual quality on generic video content (Kinetics, right) compared to VP9.\n\n30.1 dB\n\n30.8 dB\n\n4.3 Qualitative Results\n\nWe have shown that a deep neural approach (LSTMP-LG architecture) can achieve competitive\nresults with traditional codecs with respect to PSNR or MS-SSIM (see supplementary material)\nmetrics overall on low-resolution videos. Test videos from the Sprites and BAIR datasets after\ncompression with our method are shown in Fig. 1 and Fig. 5 (left), respectively, and compared to\nmodern codec performance. Our method achieves a superior image quality at a signi\ufb01cantly lower\nbit rate than H.264/H.265 and VP9 on these specialized content datasets. This is perhaps expected\nsince traditional codecs cannot learn ef\ufb01cient representations for specialized content. Furthermore,\n\ufb01ne-grained motion is not accurately predicted with block motion estimation. The artifacts from our\nmethod are displayed in Fig. 5 (right). Our method tends to produce blurry video in the low bit-rate\nregime but does not suffer from the block artifacts present in the H.265/VP9 compressed video.\n\n5 Conclusions\n\nWe have proposed a deep generative modeling approach to video compression. Our method simul-\ntaneously learns to transform the original video into a lower-dimensional representation as well as\nthe temporally-conditioned probabilistic model for entropy coding. The best performing proposed\narchitecture splits up the latent code into global and local variables and yields competitive results\non low-resolution videos. For video sources with specialized content, deep generative video coding\nallows for a signi\ufb01cant increase in coding performance, as our experiment on BAIR suggests. This\ncould be interesting for transmitting specialized content such as teleconferencing.\nOur experimental analysis focused on small-scale videos. One future avenue is to design alternative\npriors that better scale to full-resolution videos, where the dimension of the latent representation\nmust scale with the resolution of the video in order to achieve high quality reconstruction. For the\nlocal/global architecture that we investigated experimentally, the GPU memory limits the maximum\nsize of the latent dimension due to the presence of fully-connected layers to infer global and local\nstates. While being ef\ufb01cient for small videos in the strongly compressed regime, this effectively limits\nthe maximum achievable image quality. Future architectures may focus more on fully convolutional\ncomponents. Besides a different temporal prior, the proposed coding scheme will remain the same.\nSince our approach uses a learned prior for entropy coding, this suggests that improved compression\nperformance can be achieved by improving video prediction. In future work, it will be interesting to\nsee how our model will work with more ef\ufb01cient predictive models for full-resolution videos. It is\nalso interesting to think about comparisons between deterministic and stochastic approaches to neural\ncompression. We argue that by modeling the full data distribution of each frame, a probabilistic\napproach should be able to achieve shorter code lengths for fat-tailed and skewed data distributions\nthan maximum-likelihood based compression methods. Thus we think that our work is a \ufb01rst step\ninto a new direction for video coding which opens up several exciting avenues for future work.\n\n9\n\n\fReferences\nEirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca\nBenini, and Luc V Gool. Soft-to-hard vector quantization for end-to-end learning compressible\nrepresentations. In Advances in Neural Information Processing Systems, 2017.\n\nAlexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a\n\nbroken ELBO. In International Conference on Machine Learning, 2018.\n\nMohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine.\nStochastic variational video prediction. International Conference on Learning Representations,\n2018.\n\nJohannes Ball\u00e9, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression.\n\nInternational Conference on Learning Representations, 2016.\n\nJohannes Ball\u00e9, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational\nimage compression with a scale hyperprior. International Conference on Learning Representations,\n2018.\n\nJustin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. Workshop in Advances\n\nin Approximate Bayesian Inference at NIPS, 2014.\n\nDavid M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 2017.\n\nTong Chen, Haojie Liu, Qiu Shen, Tao Yue, Xun Cao, and Zhan Ma. Deepcoder: A deep neural\nnetwork based video compression. In 2017 IEEE Visual Communications and Image Processing\n(VCIP), pp. 1\u20134. IEEE, 2017.\n\nZhibo Chen, Tianyu He, Xin Jin, and Feng Wu. Learning for video compression. IEEE Transactions\n\non Circuits and Systems for Video Technology, 2019.\n\nJunyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio.\nA recurrent latent variable model for sequential data. In Advances in Neural Information Processing\nSystems, 2015.\n\nVisual Network Index Cisco. Forecast and methodology, 2016-2021. White Paper, 2017.\n\nEmily Denton and Rob Fergus. Stochastic video generation with a learned prior. International\n\nConference on Machine Learning, 2018.\n\nAziz Djelouah, Joaquim Campos, Simone Schaub, and Christopher Schroers. Neural inter-frame\n\ncompression for video coding. International Conference on Computer Vision, 2019.\n\nFrederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with\n\ntemporal skip connections. Conference on Robot Learning, 2017.\n\nFraunhofer. Institute website:versatile video coding. https://jvet.hhi.fraunhofer.de, 2018.\n\nSamuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. In Proceed-\n\nings of the annual meeting of the cognitive science society, volume 36, 2014.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nIn Advances in Neural\n\nAaron Courville, and Yoshua Bengio. Generative adversarial nets.\nInformation Processing Systems, 2014.\n\nAmirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. Video compression\n\nwith rate-distortion autoencoders. arXiv preprint arXiv:1908.05717, 2019.\n\nJiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video\n\ngeneration using holistic attribute control. European Conference on Computer Vision, 2018.\n\nIrina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a\nconstrained variational framework. International Conference on Learning Representations, 2017.\n\n10\n\n\fMatthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the\nvariational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference at\nNIPS, 2016.\n\nPhillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\nconditional adversarial networks. In Conference on Computer Vision and Pattern Recognition,\n2017.\n\nNick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Troy Chinen, Sung\nJin Hwang, Joel Shor, and George Toderici. Improved lossy image compression with priming\nand spatially adaptive bit rates for recurrent networks. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pp. 4385\u20134393, 2018.\n\nWill Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan,\nFabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.\narXiv preprint arXiv:1705.06950, 2017.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational Bayes. Advances in Neural\n\nInformation Processing Systemss, 2014.\n\nRahul G Krishnan, Uri Shalit, and David Sontag. Deep Kalman \ufb01lters. Advances in Approximate\n\nBayesian Inference & Black Box Inference Workshop at NIPS, 2015.\n\nGlen G Langdon. An introduction to arithmetic coding. IBM Journal of Research and Development,\n\n1984.\n\nAlex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine.\n\nStochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.\n\nYingzhen Li and Stephan Mandt. Disentangled sequential autoencoder. The International Conference\n\non Machine Learning, 2018.\n\nGuo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. Dvc: An\nend-to-end deep video compression framework. In Conference on Computer Vision and Pattern\nRecognition, 2019.\n\nD.J.C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University\n\nPress, 2003.\n\nStephan Mandt, James McInerney, Farhan Abrol, Rajesh Ranganath, and David Blei. Variational\n\ntempering. In Arti\ufb01cial Intelligence and Statistics, 2016.\n\nJoseph Marino, Yisong Yue, and Stephan Mandt. Iterative amortized inference. In International\n\nConference on Machine Learning, 2018.\n\nMichael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann\nLeCun. Disentangling factors of variation in deep representation using adversarial training. In\nAdvances in Neural Information Processing Systems, pp. 5040\u20135048, 2016.\n\nDavid Minnen, Johannes Ball\u00e9, and George Toderici. Joint autoregressive and hierarchical priors for\n\nlearned image compression. Advances In Neural Information Processing Systems, 2018.\n\nDebargha Mukherjee, Jingning Han, Jim Bankoski, Ronald Bultje, Adrian Grange, John Koleszar,\nPaul Wilkins, and Yaowu Xu. A technical overview of VP9\u2014the latest open-source video codec.\nSMPTE Motion Imaging Journal, 2015.\n\nHans Georg Musmann, Peter Pirsch, and H-J Grallert. Advances in picture coding. Proceedings of\n\nthe IEEE, 73(4), 1985.\n\nScott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. In Advances\n\nin Neural Information Processing Systems, pp. 1252\u20131260, 2015.\n\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\nInternational Conference on Machine Learning, 2015.\n\n11\n\n\fDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\napproximate inference in deep generative models. The International Conference on Machine\nLearning, 2014.\n\nJorma Rissanen and Glen G Langdon. Arithmetic coding. IBM Journal of research and development,\n\n23(2):149\u2013162, 1979.\n\nShibani Santurkar, David Budden, and Nir Shavit. Generative compression. In 2018 Picture Coding\n\nSymposium (PCS), pp. 258\u2013262. IEEE, 2018.\n\nCE Shannon. A mathematical theory of communication. The Bell System Technical Journal, 2001.\n\nGary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand, et al. Overview of the high\n\nef\ufb01ciency video coding(hevc) standard. IEEE Transactions, 2012.\n\nLucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Husz\u00e1r. Lossy image compression with\n\ncompressive autoencoders. International Conference on Learning Representations, 2017.\n\nGeorge Toderici, Sean M O\u2019Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet\nBaluja, Michele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent\nneural networks. International Conference on Learning Representations, 2016.\n\nGeorge Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and\n\nMichele Covell. Full resolution image compression with recurrent neural networks. 2017.\n\nCarl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In\n\nAdvances in Neural Information Processing Systemss, 2016.\n\nZhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from\n\nerror visibility to structural similarity. IEEE transactions on image processing, 2004.\n\nThomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the H.264/AVC\n\nvideo coding standard. IEEE Transactions, 2003.\n\nChao-Yuan Wu, Nayan Singhal, and Philipp Kr\u00e4henb\u00fchl. Video compression through image interpo-\n\nlation. European Conference on Computer Vision, 2018.\n\nQiangeng Xu, Hanwang Zhang, Peter N Belhumeur, and Ulrich Neumann. Stochastic dynamics for\n\nvideo in\ufb01lling. Winter Conference on Applications of Computer Vision, 2020.\n\nCheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational\n\ninference. IEEE transactions on pattern analysis and machine intelligence, 2018.\n\nHang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restoration with\n\nneural networks. IEEE Transactions on Computational Imaging, 3(1):47\u201357, 2016.\n\n12\n\n\f", "award": [], "sourceid": 4970, "authors": [{"given_name": "Salvator", "family_name": "Lombardo", "institution": "Disney Research"}, {"given_name": "JUN", "family_name": "HAN", "institution": "Dartmouth College"}, {"given_name": "Christopher", "family_name": "Schroers", "institution": "Disney Research|Studios"}, {"given_name": "Stephan", "family_name": "Mandt", "institution": "Disney Research"}]}