{"title": "Compression with Flows via Local Bits-Back Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 3879, "page_last": 3888, "abstract": "Likelihood-based generative models are the backbones of lossless compression due to the guaranteed existence of codes with lengths close to negative log likelihood. However, there is no guaranteed existence of computationally efficient codes that achieve these lengths, and coding algorithms must be hand-tailored to specific types of generative models to ensure computational efficiency. Such coding algorithms are known for autoregressive models and variational autoencoders, but not for general types of flow models. To fill in this gap, we introduce local bits-back coding, a new compression technique for flow models. We present efficient algorithms that instantiate our technique for many popular types of flows, and we demonstrate that our algorithms closely achieve theoretical codelengths for state-of-the-art flow models on high-dimensional data.", "full_text": "Compression with Flows via Local Bits-Back Coding\n\nJonathan Ho\nUC Berkeley\n\njonathanho@berkeley.edu\n\nEvan Lohn\nUC Berkeley\n\nevan.lohn@berkeley.edu\n\nPieter Abbeel\n\nUC Berkeley, covariant.ai\n\npabbeel@cs.berkeley.edu\n\nAbstract\n\nLikelihood-based generative models are the backbones of lossless compression due\nto the guaranteed existence of codes with lengths close to negative log likelihood.\nHowever, there is no guaranteed existence of computationally ef\ufb01cient codes that\nachieve these lengths, and coding algorithms must be hand-tailored to speci\ufb01c types\nof generative models to ensure computational ef\ufb01ciency. Such coding algorithms\nare known for autoregressive models and variational autoencoders, but not for\ngeneral types of \ufb02ow models. To \ufb01ll in this gap, we introduce local bits-back coding,\na new compression technique for \ufb02ow models. We present ef\ufb01cient algorithms\nthat instantiate our technique for many popular types of \ufb02ows, and we demonstrate\nthat our algorithms closely achieve theoretical codelengths for state-of-the-art \ufb02ow\nmodels on high-dimensional data.\n\nIntroduction\n\n1\nTo devise a lossless compression algorithm means to devise a uniquely decodable code whose\nexpected length is as close as possible to the entropy of the data. A general recipe for this is to \ufb01rst\ntrain a generative model by minimizing cross entropy to the data distribution, and then construct a\ncode that achieves lengths close to the negative log likelihood of the model. This recipe is justi\ufb01ed\nby classic results in information theory that ensure that the second step is possible\u2014in other words,\noptimizing cross entropy optimizes the performance of some hypothetical compressor. And, thanks\nto recent advances in deep likelihood-based generative models, these hypothetical compressors are\nquite good. Deep autoregressive models, latent variable models, and \ufb02ow models are now achieving\nstate-of-the-art cross entropy scores on a wide variety of real-world datasets in speech, videos, text,\nimages, and other domains [45, 46, 39, 34, 5, 31, 6, 22, 10, 11, 23, 35, 19, 44, 25, 29].\nBut we are not interested in hypothetical compressors. We are interested in practical, computationally\nef\ufb01cient compressors that scale to high-dimensional data and harness the excellent cross entropy\nscores of modern deep generative models. Unfortunately, naively applying existing codes, like Huff-\nman coding [21], requires computing the model likelihood for all possible values of the data, which\nexpends computational resources scaling exponentially with the data dimension. This inef\ufb01ciency\nstems from the lack of assumptions about the generative model\u2019s structure.\nCoding algorithms must be tailored to speci\ufb01c types of generative models if we want them to be\nef\ufb01cient. There is already a rich literature of tailored coding algorithms for autoregressive models\nand variational autoencoders built from conditional distributions which are already tractable for\ncoding [38, 12, 18, 13, 42], but there are currently no such algorithms for general types of \ufb02ow\nmodels [32]. It seems that this lack of ef\ufb01cient coding algorithms is a con of \ufb02ow models that\nstands at odds with their many pros, like fast and realistic sampling, interpretable latent spaces,\nfast likelihood evaluation, competitive cross entropy scores, and ease of training with unbiased log\nlikelihood gradients [10, 11, 23, 19].\nTo rectify this situation, we introduce local bits-back coding, a new technique for turning a general,\npretrained, off-the-shelf \ufb02ow model into an ef\ufb01cient coding algorithm suitable for continuous data\ndiscretized to high precision. We show how to implement local bits-back coding without assumptions\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fon the \ufb02ow structure, leading to an algorithm that runs in polynomial time and space with respect\nto the data dimension. Going further, we show how to tailor our implementation to various speci\ufb01c\ntypes of \ufb02ows, culminating in a fully parallelizable algorithm for RealNVP-type \ufb02ows that runs\nin linear time and space with respect to the data dimension and is fully parallelizable for both\nencoding and decoding. We then show how to adapt local bits-back coding to losslessly code data\ndiscretized to arbitrarily low precision, and in doing so, we obtain a new compression interpretation of\ndequantization, a method commonly used to train \ufb02ow models on discrete data. We test our algorithms\non recently proposed \ufb02ow models trained on real-world image datasets, and we \ufb01nd that they are\ncomputationally ef\ufb01cient and attain codelengths in close agreement with theoretical predictions.\nOpen-source code is available at https://github.com/hojonathanho/localbitsback.\n\n2 Preliminaries\nLossless compression We begin by de\ufb01ning lossless compression of d-dimensional discrete data\nx\u25e6 using a probability mass function p(x\u25e6) represented by a generative model. It means to construct\na uniquely decodable code C, which is an injective map from data sequences to binary strings,\nwhose lengths |C(x\u25e6)| are close to \u2212 log p(x\u25e6) [7].1 The rationale is that if the generative model\nis expressive and trained well, its cross entropy will be close to the entropy of the data distribution.\nSo, if the lengths of C match the model\u2019s negative log probabilities, the expected length of C will\nbe small, and hence C will be a good compression algorithm. Constructing such a code is always\npossible in theory because the Kraft-McMillan inequality [27, 30] ensures that there always exists\nsome code with lengths |C(x\u25e6)| = (cid:100)\u2212 log p(x\u25e6)(cid:101) \u2248 \u2212 log p(x\u25e6).\nFlow models We wish to construct a computationally ef\ufb01cient code specialized to a \ufb02ow model f,\nwhich is a differentiable bijection between continuous data x \u2208 Rd and latents z = f (x) \u2208 Rd [9\u201311].\nA \ufb02ow model comes with a density p(z) on the latent space and thus has an associated sampling\nprocess\u2014x = f\u22121(z) for z \u223c p(z)\u2014under which it de\ufb01nes a probability density function via the\nchange-of-variables formula for densities:\n\n\u2212 log p(x) = \u2212 log p(z) \u2212 log |det J(x)|\n\nmodel P (x\u25e6) :=(cid:82)\n\n(1)\nwhere J(x) denotes the Jacobian of f at x. Flow models are straightforward to train with maximum\nlikelihood, as Eq. (1) allows unbiased exact log likelihood gradients to be computed ef\ufb01ciently.\nDequantization Standard datasets such as CIFAR10 and ImageNet consist of discrete data x\u25e6 \u2208 Zd.\nTo make a \ufb02ow model suitable for such discrete data, it is standard practice to de\ufb01ne a derived discrete\n[0,1)d p(x\u25e6 + u) du to be trained by minimizing a dequantization objective, which\n(cid:20)\nis a variational bound on the codelength of P (x\u25e6):\n\u2265 \u2212 log\n\np(x\u25e6 + u) du = \u2212 log P (x\u25e6)\n\nEu\u223cq(u|x\u25e6)\n\n\u2212 log\n\np(x\u25e6 + u)\nq(u|x\u25e6)\n\n(cid:21)\n\n(cid:90)\n\n[0,1)d\n\n(2)\n\nHere, q(u|x\u25e6) proposes dequantization noise u \u2208 [0, 1)d that transforms discrete data x\u25e6 into\ncontinuous data x\u25e6 + u; it can be \ufb01xed to either a uniform distribution [43, 41, 39] or to another\nparameterized \ufb02ow to be trained jointly with f [19]. This dequantization objective serves as a\ntheoretical codelength for \ufb02ow models trained on discrete data, just like negative log probability mass\nserves as a theoretical codelength for discrete generative models [41].\n\n3 Local bits-back coding\nOur goal is to develop computationally ef\ufb01cient coding algorithms for \ufb02ows trained with dequan-\ntization (2). In Sections 3.1 to 3.4, we develop algorithms that use \ufb02ows to code continuous data\ndiscretized to high precision. In Section 3.5, we adapt these algorithms to losslessly code data\ndiscretized to low precision, attaining our desired codelength (2) for discrete data.\n3.1 Coding continuous data using discretization\nWe \ufb01rst address the problem of developing coding algorithms that attain codelengths given by negative\nlog densities of \ufb02ow models, such as Eq. (1). Probability density functions do not directly map to\ncodelength, unlike probability mass functions which enjoy the result of the Kraft-McMillan inequality.\nSo, following standard procedure [7, section 8.3], we discretize the data to a high precision k\n\n1We always use base 2 logarithms.\n\n2\n\n\fP (\u00afx) :=(cid:82)\n\n\u2212 log P (\u00afx) \u2248 \u2212 log p(\u00afx)\u03b4x\n\nand code this discretized data with a certain probability mass function derived from the density\nmodel. Speci\ufb01cally, we tile Rd with hypercubes of volume \u03b4x := 2\u2212kd; we call each hypercube\na bin. For x \u2208 Rd, let B(x) be the unique bin that contains x, and let \u00afx be the center of the bin\nB(x). We call \u00afx the discretized version of x. For a suf\ufb01ciently smooth probability density function\np(x), such as a density coming from a neural network \ufb02ow model, the probability mass function\nB(\u00afx) p(x) dx takes on the pleasingly simple form P (\u00afx) \u2248 p(\u00afx)\u03b4x when the precision k is\nlarge. Now we invoke the Kraft-McMillan inequality, so the theoretical codelength for \u00afx using P is\n(3)\nbits. This is the compression interpretation of the negative log density: it is a codelength for data\ndiscretized to high precision, when added to the total number of bits of discretization precision. It is\nthis codelength, Eq. (3), that we will try to achieve with an ef\ufb01cient algorithm for \ufb02ow models. We\ndefer the problem of coding data discretized to low precision to Section 3.5.\n3.2 Background on bits-back coding\nThe main tool we will employ is bits-back coding [47, 18, 13, 20], a coding technique originally\ndesigned for latent variable models (the connection to \ufb02ow models is presented in Section 3.3 and\nz p(x, z),\nwhere p(x, z) = p(x|z)p(z) includes a latent variable z; it is relevant when z ranges over an\nexponentially large set, which makes it intractable to code with p(x) even though coding with p(x|z)\nand p(z) may be tractable individually. Bits-back coding introduces a new distribution q(z|x) with\ntractable coding, and the encoder jointly encodes x along with z \u223c q(z|x) via these steps:\n\nis new to our work). Bits-back coding codes x using a distribution of the form p(x) =(cid:80)\n\n1. Decode z \u223c q(z|x) from an auxiliary source of random bits\n2. Encode x using p(x|z)\n3. Encode z using p(z)\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n\u2212 log\n\n\u2212 log\n\np(x|\u00afz)P (\u00afz)\n\nQ(\u00afz|x)\n\n\u2212 log\n\n\u2248 E\u00afz\u223cQ(\u00afz|x)\n\nThe \ufb01rst step, which decodes z from random bits, produces a sample z \u223c q(z|x). The second\nand third steps transmit z along with x. At decoding time, the decoder recovers (x, z), then\nrecovers the bits the encoder used to sample z using q. So, the encoder will have transmitted extra\ninformation in addition to x\u2014precisely Ez\u223cq(z|x) [\u2212 log q(z|x)] bits on average. Consequently, the\nnet number of bits transmitted regarding x only will be Ez\u223cq(z|x) [log q(z|x) \u2212 log p(x, z)], which\nis redundant compared to the desired length \u2212 log p(x) by an amount equal to the KL divergence\nDKL (q(z|x) (cid:107) p(z|x)) from q to the true posterior.\nBits-back coding also works with continuous z discretized to high precision with negligible change\nin codelength [18, 42]. In this case, q(z|x) and p(z) are probability density functions. Discretizing z\nto bins \u00afz of small volume \u03b4z and de\ufb01ning the probability mass functions Q(\u00afz|x) and P (\u00afz) by the\nmethod in Section 3.1, we see that the bits-back codelength remains approximately unchanged:\np(x|z)p(z)\nE\u00afz\u223cQ(\u00afz|x)\n\nq(z|x)\n(4)\nWhen bits-back coding is applied to a particular latent variable model, such as a VAE, the distributions\ninvolved may take on a certain meaning: p(z) would be the prior, p(x|z) would be the decoder\nnetwork, and q(z|x) would be the encoder network [25, 36, 8, 4, 13, 42, 26]. However, it is important\nto note that these distributions do not need to correspond explicitly to parts of the model at hand.\nAny will do for coding data losslessly, though some choices result in better codelengths. We exploit\nthis fact in Section 3.3, where we apply bits-back coding to \ufb02ow models by constructing arti\ufb01cial\ndistributions p(x|z) and q(z|x), which do not come with a \ufb02ow model by default.\n3.3 Local bits-back coding\nWe now present local bits-back coding, our new high-level principle for using a \ufb02ow model f to code\ndata discretized to high precision. Following Section 3.1, we discretize continuous data x into \u00afx,\nwhich is the center of a bin of volume \u03b4x. The codelength we desire for \u00afx is the negative log density\nof f (1), plus a constant depending on the discretization precision:\n\np(x|\u00afz)p(\u00afz)\u03b4z\nq(\u00afz|x)\u03b4z\n\n\u2212 log p(\u00afx)\u03b4x = \u2212 log p(f (\u00afx)) \u2212 log |det J(\u00afx)| \u2212 log \u03b4x\n\n(5)\nwhere J(\u00afx) is the Jacobian of f at \u00afx. We will construct two densities \u02dcp(z|x) and \u02dcp(x|z) such that\nbits-back coding attains Eq. (5). We need a small scalar parameter \u03c3 > 0, with which we de\ufb01ne\n\n\u2248 Ez\u223cq(z|x)\n\n\u02dcp(z|x) := N (z; f (x), \u03c32J(x)J(x)(cid:62))\n\n(6)\nTo encode \u00afx, local bits-back coding follows the method described in Section 3.2 with continuous z:\n\n\u02dcp(x|z) := N (x; f\u22121(z), \u03c32I)\n\nand\n\n(cid:21)\n\n3\n\n\f1. Decode \u00afz \u223c \u02dcP (\u00afz|x) =(cid:82)\n2. Encode \u00afx using \u02dcP (\u00afx|\u00afz) =(cid:82)\n3. Encode \u00afz using P (\u00afz) =(cid:82)\n\nB(\u00afz) \u02dcp(z|x) dz \u2248 \u02dcp(\u00afz|x)\u03b4z from an auxiliary source of random bits\nB(\u00afx) \u02dcp(x|\u00afz) dz \u2248 \u02dcp(\u00afx|\u00afz)\u03b4x\nB(\u00afz) p(z) dz \u2248 p(\u00afz)\u03b4z\n\nThe conditional density \u02dcp(z|x) (6) is arti\ufb01cially injected noise, scaled by \u03c3 (the \ufb02ow model f remains\nunmodi\ufb01ed). It describes how a local linear approximation of f would behave if it were to act on a\nsmall Gaussian around \u00afx.\nTo justify local bits-back coding, we simply calculate its expected codelength. First, our choices of\n\u02dcp(z|x) and \u02dcp(x|z) (6) satisfy the following equation:\n\nEz\u223c \u02dcp(z|x) [log \u02dcp(z|x) \u2212 log \u02dcp(x|z)] = \u2212 log |det J(x)| + O(\u03c32)\n\n(7)\nNext, just like standard bits-back coding (4), local bits-back coding attains an expected codelength\nclose to Ez\u223c \u02dcp(z|\u00afx)L(\u00afx, z), where\n\nL(x, z) := log \u02dcp(z|x)\u03b4z \u2212 log \u02dcp(x|z)\u03b4x \u2212 log p(z)\u03b4z\n\n(8)\nEquations (6) to (8) imply that the expected codelength matches our desired codelength (5), up to\n\ufb01rst order in \u03c3 (see Appendix A for details):\n\nEzL(x, z) = \u2212 log p(x)\u03b4x + O(\u03c32)\n\n(9)\nNote that local bits-back coding exactly achieves the desired codelength for \ufb02ows (5), up to \ufb01rst order\nin \u03c3. This is in stark contrast to bits-back coding with latent variable models like VAEs, for which the\nbits-back codelength is the negative evidence lower bound, which is redundant by an amount equal to\nthe KL divergence from the approximate posterior to the true posterior [25].\nLocal bits-back coding always codes \u00afx losslessly, no matter the setting of \u03c3, \u03b4x, and \u03b4z. However,\n\u03c3 must be small for the O(\u03c32) inaccuracy in Eq. (9) to be negligible. But for \u03c3 to be small, the\ndiscretization volumes \u03b4z and \u03b4x must be small too, otherwise the discretized Gaussians \u02dcp(\u00afz|x)\u03b4z\nand \u02dcp(\u00afx|z)\u03b4x will be poor approximations of the original Gaussians \u02dcp(z|x) and \u02dcp(x|z). So, because\n\u03b4x must be small, the data x must be discretized to high precision. And, because \u03b4z must be small, a\nrelatively large number of auxiliary bits must be available to decode \u00afz \u223c \u02dcp(\u00afz|x)\u03b4z. We will resolve\nthe high precision requirement for the data with another application of bits-back coding in Section 3.5,\nand we will explore the impact of varying \u03c3, \u03b4x, and \u03b4z on real-world data in experiments in Section 4.\n3.4 Concrete local bits-back coding algorithms\nWe have shown that local bits-back coding attains the desired codelength (5) for data discretized to\nhigh precision. Now, we instantiate local bits-back coding with concrete algorithms.\n3.4.1 Black box \ufb02ows\nAlgorithm 1 is the most straightforward implementation of local bits-back coding. It directly imple-\nments the steps in Section 3.3 by invoking an external procedure, such as automatic differentiation, to\nexplicitly compute the Jacobian of the \ufb02ow. It therefore makes no assumptions on the structure of the\n\ufb02ow, and hence we call it the black box algorithm.\n\nAlgorithm 1 Local bits-back encoding: for black box \ufb02ows (decoding in Appendix B)\nRequire: data \u00afx, \ufb02ow f, discretization volumes \u03b4x, \u03b4z, noise level \u03c3\n1: J \u2190 Jf (\u00afx)\n2: Decode \u00afz \u223c N (f (\u00afx), \u03c32JJ(cid:62)) \u03b4z\n3: Encode \u00afx using N (f\u22121(\u00afz), \u03c32I) \u03b4x\n4: Encode \u00afz using p(\u00afz) \u03b4z\n\n(cid:46) Compute the Jacobian of f at \u00afx\n(cid:46) By converting to an AR model (Section 3.4.1)\n\nCoding with \u02dcp(x|z) (6) is ef\ufb01cient because its coordinates are independent [42]. The same applies\nto the prior p(z) if its coordinates are independent or if another ef\ufb01cient coding algorithm already\nexists for it (see Section 3.4.3). Coding ef\ufb01ciently with \u02dcp(z|x) relies on the fact that any multivariate\nGaussian can be converted into a linear autoregressive model, which can be coded ef\ufb01ciently, one\ncoordinate at a time, using arithmetic coding or asymmetric numeral systems. To see how, suppose\ny = J\u0001, where \u0001 \u223c N (0, I) and J is a full-rank matrix (such as a Jacobian of a \ufb02ow model). Let L\nbe the Cholesky decomposition of JJ(cid:62). Since LL(cid:62) = JJ(cid:62), the distribution of L\u0001 is equal to the\ndistribution of J\u0001 = y, and so solutions \u02dcy to the linear system L\u22121 \u02dcy = \u0001 have the same distribution\n\n4\n\n\fdetermined with back substitution: \u02dcyi = (\u0001i \u2212(cid:80)\nas y. Because L is triangular, L\u22121 is easily computable and also triangular, and thus \u02dcy can be\n(cid:80)\nj*d/2.\nThe \ufb01rst half is passed through unchanged as z\u2264d/2 = x\u2264d/2, and the second half is passed through\nan elementwise transformation z>d/2 = f (x>d/2; x\u2264d/2) which is conditioned on the \ufb01rst half.\nSpecializing Algorithm 2 to this kind of \ufb02ow allows both encoding and decoding to be parallelized\nover coordinates, resembling how the forward and inverse directions for inference and sampling can\nbe parallelized for these \ufb02ows [10, 11]. See Appendix B for the complete algorithm listing.\nAlgorithm 2 is not the only known ef\ufb01cient coding algorithm for autoregressive \ufb02ows. For example, if\ni pi(zi) is independent over coordinates, then f can\nbe rewritten as a continuous autoregressive model p(xi|x*