{"title": "Integer Discrete Flows and Lossless Compression", "book": "Advances in Neural Information Processing Systems", "page_first": 12134, "page_last": 12144, "abstract": "Lossless compression methods shorten the expected representation size of data without loss of information, using a statistical model. Flow-based models are attractive in this setting because they admit exact likelihood optimization, which is equivalent to minimizing the expected number of bits per message. However, conventional flows assume continuous data, which may lead to reconstruction errors when quantized for compression. For that reason, we introduce a flow-based generative model for ordinal discrete data called Integer Discrete Flow (IDF): a bijective integer map that can learn rich transformations on high-dimensional data. As building blocks for IDFs, we introduce a flexible transformation layer called integer discrete coupling. Our experiments show that IDFs are competitive with other flow-based generative models. Furthermore, we demonstrate that IDF based compression achieves state-of-the-art lossless compression rates on CIFAR10, ImageNet32, and ImageNet64. To the best of our knowledge, this is the first lossless compression method that uses invertible neural networks.", "full_text": "Integer Discrete Flows and Lossless Compression\n\nEmiel Hoogeboom\u2217\nUvA-Bosch Delta Lab\nUniversity of Amsterdam\n\nNetherlands\n\ne.hoogeboom@uva.nl\n\nJorn W.T. Peters\u2217\nUvA-Bosch Delta Lab\nUniversity of Amsterdam\n\nNetherlands\n\nj.peters@uva.nl\n\nRianne van den Berg\u2020\nUniversity of Amsterdam\n\nNetherlands\n\nriannevdberg@gmail.com\n\nMax Welling\n\nUvA-Bosch Delta Lab\nUniversity of Amsterdam\n\nNetherlands\n\nm.welling@uva.nl\n\nAbstract\n\nLossless compression methods shorten the expected representation size of data\nwithout loss of information, using a statistical model. Flow-based models are\nattractive in this setting because they admit exact likelihood optimization, which\nis equivalent to minimizing the expected number of bits per message. However,\nconventional \ufb02ows assume continuous data, which may lead to reconstruction\nerrors when quantized for compression. For that reason, we introduce a \ufb02ow-based\ngenerative model for ordinal discrete data called Integer Discrete Flow (IDF): a\nbijective integer map that can learn rich transformations on high-dimensional data.\nAs building blocks for IDFs, we introduce a \ufb02exible transformation layer called\ninteger discrete coupling. Our experiments show that IDFs are competitive with\nother \ufb02ow-based generative models. Furthermore, we demonstrate that IDF based\ncompression achieves state-of-the-art lossless compression rates on CIFAR10,\nImageNet32, and ImageNet64. To the best of our knowledge, this is the \ufb01rst\nlossless compression method that uses invertible neural networks.\n\n1\n\nIntroduction\n\nEvery day, 2500 petabytes of data are generated. Clearly, there is a need for compression to enable\nef\ufb01cient transmission and storage of this data. Compression algorithms aim to decrease the size\nof representations by exploiting patterns and structure in data. In particular, lossless compression\nmethods preserve information perfectly\u2013which is essential in domains such as medical imaging,\nastronomy, photography, text and archiving. Lossless compression and likelihood maximization\nare inherently connected through Shannon\u2019s source coding theorem [34], i.e., the expected message\nlength of an optimal entropy encoder is equal to the negative log-likelihood of the statistical model.\nIn other words, maximizing the log-likelihood (of data) is equivalent to minimizing the expected\nnumber of bits required per message.\nIn practice, data is usually high-dimensional which introduces challenges when building statistical\nmodels for compression. In other words, designing the likelihood and optimizing it for high dimen-\nsional data is often dif\ufb01cult. Deep generative models permit learning these complicated statistical\nmodels from data and have demonstrated their effectiveness in image, video, and audio modeling\n\n\u2217Equal contribution\n\u2020Now at Google\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Overview of IDF based lossless compression. An image x is transformed to a latent\nrepresentation z with a tractable distribution pZ(\u00b7). An entropy encoder takes z and pZ(\u00b7) as input,\nand produces a bitstream c. To obtain x, the decoder uses pZ(\u00b7) and c to reconstruct z. Subsequently,\nz is mapped to x using the inverse of the IDF.\n\n[22, 24, 29]. Flow-based generative models [7, 8, 27, 22, 14, 16] are advantageous over other genera-\ntive models: i) they admit exact log-likelihood optimization in contrast with Variational AutoEncoders\n(VAEs) [21] and ii) drawing samples (and decoding) is comparable to inference in terms of computa-\ntional cost, as opposed to PixelCNNs [41]. However, \ufb02ow-based models are generally de\ufb01ned for\ncontinuous probability distributions, disregarding the fact that digital media is stored discretely\u2013for\nexample, pixels from 8-bit images have 256 distinct values. In order to utilize continuous \ufb02ow models\nfor compression, the latent space must be quantized. This produces reconstruction errors in image\nspace, and is therefore not suited for lossless compression.\nTo circumvent the (de)quantization issues, we propose Integer Discrete Flows (IDFs), which are\ninvertible transformations for ordinal discrete data\u2013such as images, video and audio. We demonstrate\nthe effectiveness of IDFs by attaining state-of-the-art lossless compression performance on CIFAR10,\nImageNet32 and ImageNet64. For a graphical illustration of the coding steps, see Figure 1. In addition,\nwe show that IDFs achieve generative modelling results competitive with other \ufb02ow-based methods.\nThe main contributions of this paper are summarized as follows: 1) We introduce a generative \ufb02ow\nfor ordinal discrete data (Integer Discrete Flow), circumventing the problem of (de)quantization;\n2) As building blocks for IDFs, we introduce a \ufb02exible transformation layer called integer discrete\ncoupling; 3) We propose a neural network based compression method that leverages IDFs; and\n4) We empirically show that our image compression method allows for progressive decoding that\nmaintains the global structure of the encoded image. Code to reproduce the experiments is available\nat https://github.com/jornpeters/integer_discrete_flows.\n\n2 Background\n\nThe continuous change of variables formula lies at the foundation of \ufb02ow-based generative models. It\nadmits exact optimization of a (data) distribution using a simple distribution and a learnable bijective\nmap. Let f : X \u2192 Z be a bijective map, and pZ(\u00b7) a prior distribution on Z. The model distribution\npX (\u00b7) can then be expressed as:\n\n(cid:12)(cid:12)(cid:12)(cid:12) dz\n\ndx\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\npX (x) = pZ(z)\n\n(1)\nThat is, for a given observation x, the likelihood is given by pZ(\u00b7) evaluated at f (x), normalized by\nthe Jacobian determinant. A composition of invertible functions, which can be viewed as a repeated\napplication of the change of variables formula, is generally referred to as a normalizing \ufb02ow in the\ndeep learning literature [5, 37, 36, 30].\n\nfor z = f (x).\n\n2.1 Flow Layers\n\nThe design of invertible transformations is integral to the construction of normalizing \ufb02ows. In this\nsection two important layers for \ufb02ow-based generative modelling are discussed.\nCoupling layers are tractable bijective mappings that are extremely \ufb02exible, when combined into a\n\ufb02ow [8, 7]. Speci\ufb01cally, they have an analytical inverse, which is similar to a forward pass in terms\nof computational cost and the Jacobian determinant is easily computed, which makes coupling layers\nattractive for \ufb02ow models. Given an input tensor x \u2208 Rd, the input to a coupling layer is partitioned\n\n2\n\n\u2026IDFCoder\finto two sets such that x = [xa, xb]. The transformation, denoted f (\u00b7), is then de\ufb01ned by:\n\nz = [za, zb] = f (x) = [xa, xb (cid:12) s(xa) + t(xa)],\n\n(2)\nwhere (cid:12) denotes element-wise multiplication and s and t may be modelled using neural networks.\nGiven this, the inverse is easily computed, i.e., xa = za, and xb = (zb \u2212 t(xa)) (cid:11) s(xa), where\n(cid:11) denotes element-wise division. For f (\u00b7) to be invertible, s(xa) must not be zero, and is often\nconstrained to have strictly positive values.\nFactor-out layers allow for more ef\ufb01cient inference and hierarchical modelling. A general \ufb02ow,\nfollowing the change of variables formula, is described as a single map X \u2192 Z. This implies that a\nd-dimensional vector is propagated throughout the whole \ufb02ow model. Alternatively, a part of the\ndimensions can already be factored-out at regular intervals [8], such that the remainder of the \ufb02ow\nnetwork operates on lower dimensional data. We give an example for two levels (L = 2) although\nthis principle can be applied to an arbitrary number of levels:\n\nz = [z1, z2],\nwhere x \u2208 Rd and y1, z1, z2 \u2208 Rd/2. The likelihood of x is then given by:\n\n[z1, y1] = f1(x),\n\nz2 = f2(y1),\n\n(3)\n\n(4)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202f2(y1)\n\n\u2202y1\n\n(cid:12)(cid:12)(cid:12)(cid:12) p(z1|y1)\n\np(x) = p(z2)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202f1(x)\n\n\u2202x\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n\nThis approach has two clear advantages. First, it admits a factored model for z, p(z) =\np(zL)p(zL\u22121|zL)\u00b7\u00b7\u00b7 p(z1|z2, . . . , zL), which allows for conditional dependence between parts of\nz. This holds because the \ufb02ow de\ufb01nes a bijective map between yl and [zl+1, . . . , zL]. Second, the\nlower dimensional \ufb02ows are computationally more ef\ufb01cient.\n\n2.2 Entropy Encoding\n\nLossless compression algorithms map every input to a unique output and are designed to make\nprobable inputs shorter and improbable inputs longer. Shannon\u2019s source coding theorem [34] states\nthat the optimal code length for a symbol x is \u2212 log D(x), and the minimum expected code length is\nlower-bounded by the entropy:\n\nEx\u223cD [|c(x)|] \u2248 Ex\u223cD [\u2212 log pX (x)] \u2265 H(D),\n\n(5)\nwhere c(x) denotes the encoded message, which is chosen such that |c(x)| \u2248 \u2212 log pX (x), | \u00b7 | is\nlength, H denotes entropy, D is the data distribution, and pX (\u00b7) is the statistical model that is used\nby the encoder. Therefore, maximizing the model log-likelihood is equivalent to minimizing the\nexpected number of bits required per message, when the encoder is optimal.\nStream coders encode sequences of random variables with different probability distributions. They\nhave near-optimal performance, and they can meet the entropy-based lower bound of Shannon [32, 26].\nIn our experiments, the recently discovered and increasingly popular stream coder rANS [10] is used.\nIt has gained popularity due to its computational and coding ef\ufb01ciency. See Appendix A.1 for an\nintroduction to rANS.\n\n3\n\nInteger Discrete Flows\n\nWe introduce Integer Discrete Flows (IDFs): a bijective integer map that can represent rich trans-\nformations. IDFs can be used to learn the probability mass function on (high-dimensional) ordinal\ndiscrete data. Consider an integer-valued observation x \u2208 X = Zd, a prior distribution pZ(\u00b7) with\nsupport on Zd, and a bijective map f : Zd \u2192 Zd de\ufb01ned by an IDF. The model distribution pX (\u00b7)\ncan then be expressed as:\n\npX (x) = pZ(z),\n\n(6)\nNote that in contrast to Equation 1, there is no need for re-normalization using the Jacobian deter-\nminant. Deep IDFs are obtained by stacking multiple IDF layers {fl}L\nl=1, which are guaranteed to\nbe bijective if the individual maps fl are all bijective. For an individual map to be bijective, it must\nbe one-to-one and onto. Consider the bijective map f : Z \u2192 2Z, x (cid:55)\u2192 2x. Although, this map is\na bijection, it requires us to keep track of the codomain of f, which is impracticable in the case of\nmany dimensions and multiple layers. Instead, we design layers to be bijective maps from Zd to Zd,\nwhich ensures that the composition of layers and its inverse is closed on Zd.\n\nz = f (x).\n\n3\n\n\f3.1\n\nInteger Discrete Coupling\n\nAs a building block for IDFs, we introduce integer discrete coupling layers. These are invertible and\nthe set Zd is closed under their transformations. Let [xa, xb] = x \u2208 Zd be an input of the layer. The\noutput z = [za, zb] is de\ufb01ned as a copy za = xa, and a transformation zb = xb + (cid:98)t(xa)(cid:101), where\n(cid:98)\u00b7(cid:101) denotes a nearest rounding operation and t is a neural network (Figure 2).\nNotice the multiplication operation in standard cou-\npling is not used in integer discrete coupling, because\nit does not meet our requirement that the image of\nthe transformations is equal to Z. It may seem dis-\nadvantageous that our model only uses translation,\nalso known as additive coupling, however, large-scale\ncontinuous \ufb02ow models in the literature tend to use\nadditive coupling instead of af\ufb01ne coupling [22].\nIn contrast to existing coupling layers, the input is\nsplit in 75%\u201325% parts for xa and xb, respectively.\nAs a consequence, rounding is applied to fewer di-\nmensions, which results in less gradient bias.\nIn\naddition, the transformation is richer, because it is\nconditioned on more dimensions. Empirically this\nresults in better performance.\nBackpropagation through Rounding Operation\nAs shown in Figure 2, a coupling layer in IDF re-\nquires a rounding operation ((cid:98)\u00b7(cid:101)) on the predicted translation. Since the rounding operation is\neffectively a step function, its gradient is zero almost everywhere. As a consequence, the rounding\noperation is inherently incompatible with gradient based learning methods. In order to backpropagate\nthrough the rounding operations, we make use of the Straight Through Estimator (STE) [2]. In short,\nthe STE ignores the rounding operation during back-propagation, which is equivalent to rede\ufb01ning\nthe gradient of the rounding operation as follows:\n\nFigure 2: Forward computation of an integer\ndiscrete coupling layer. The input is split in\ntwo parts. The output consists of a copy of the\n\ufb01rst part, and a conditional transformation of\nthe second part. The inverse of the coupling\nlayer is computed by inverting the conditional\ntransformation.\n\n\u2207x(cid:98)x(cid:101) (cid:44) I.\n\n(7)\n\nLower Triangular Coupling\nThere exists a trade-off between the number of integer discrete coupling layers and the complexity of\nthe layers in IDF architectures, due to the gradient bias that is introduced by the rounding operation\n(see section 4.1). We introduce a multivariate coupling transformation called Lower Triangular\nCoupling, which is speci\ufb01cally designed such that the number of rounding operations remains\nunchanged. For more details, see Appendix B.\n\n3.2 Tractable Discrete distribution\nAs discussed in Section 2, a simple distribution pZ(\u00b7) is posed\non Z in \ufb02ow-based models. In IDFs, the prior pZ(\u00b7) is a fac-\ntored discretized logistic distribution (DLogistic) [20, 33]. The\ndiscretized logistic captures the inductive bias that values close\ntogether are related, which is well-suited for ordinal data.\nThe probability mass DLogistic(z|\u00b5, s) for an integer z \u2208 Z,\nmean \u00b5, and scale s is de\ufb01ned as the density assigned to the\ninterval [z \u2212 1\n2 ] by the probability density function of\nLogistic(\u00b5, s) (see Figure 3). This can be ef\ufb01ciently computed\nby evaluating the cumulative distribution function twice:\n\n2 , z + 1\n\nDLogistic(z|\u00b5, s) =\n\nLogistic(z(cid:48)|\u00b5, s)dz(cid:48) = \u03c3\n\n(cid:90) z+ 1\n\n2\n\nz\u2212 1\n\n2\n\nFigure 3: The discretized logistic\ndistribution. The shaded area shows\n(cid:18) z \u2212 1\n(cid:18) z + 1\nthe probability density.\n2 \u2212 \u00b5\n2 \u2212 \u00b5\n\n(cid:19)\n\n(cid:19)\n\n,\n\n(8)\n\n\u2212 \u03c3\n\ns\n\ns\n\nwhere \u03c3(\u00b7) denotes the sigmoid, the cumulative distribution function of a standard Logistic. In\nthe context of a factor-out layer, the mean \u00b5 and scale s are conditioned on the subset of\n\n4\n\n\fThat\n\nis,\n\nis not factored out.\n\ndata that\nthe input\nto the lth factor-out\ninto zl and yl.\nlayer is split\nThe conditional distribution on zl,i\nis then given as\nDLogistic(zl,i|\u00b5(yl)i, s(yl)i), where \u00b5(\u00b7) and s(\u00b7) are\nparametrized as neural networks.\nDiscrete Mixture distributions The discretized logistic\ndistribution is unimodal and therefore limited in complex-\nity. With a marginal increase in computational cost, we\nincrease the \ufb02exibility of the latent prior on zL by ex-\ntending it to a mixture of K logistic distributions [33]:\n\nK(cid:88)\n\np(z|\u00b5, s, \u03c0) =\n\n\u03c0k \u00b7 p(z|\u00b5k, sk).\n\n(9)\n\nk\n\nNote that as K \u2192 \u221e, the mixture distribution can model\narbitrary univariate discrete distributions. In practice, we\n\ufb01nd that a limited number of mixtures (K = 5) is usually\nsuf\ufb01cient for image density modelling tasks.\n\n3.3 Lossless Source Compression\n\nFigure 4: Example of a 2-level \ufb02ow ar-\nchitecture. The squeeze layer reduces\nthe spatial dimensions by two, and in-\ncreases the number of channels by four.\nA single integer \ufb02ow layer consists of a\nchannel permutation and an integer dis-\ncrete coupling layer. Each level consists\nof D \ufb02ow layers.\n\nLossless compression is an essential technique to limit\nthe size of representations without destroying information.\nMethods for lossless compression require i) a statistical\nmodel of the source, and ii) a mapping from source sym-\nbols to bit streams.\nIDFs are a natural statistical model for lossless com-\npression of ordinal discrete data, such as images, video\nand audio. They are capable of modelling complicated\nhigh-dimensional distributions, and they provide error-\nfree reconstructions when inverting latent representations.\nThe mapping between symbols and bit streams may be\nprovided by any entropy encoder. Speci\ufb01cally, stream\ncoders can get arbitrarily close to the entropy regardless\nof the symbol distributions, because they encode entire\nsequences instead of a single symbol at a time.\nIn the case of compression using an IDF, the mapping\nf : x (cid:55)\u2192 z is de\ufb01ned by the IDF. Subsequently, z is\nencoded under the distribution pZ(z) to a bitstream c using\nan entropy encoder. Note that, when using factor-out\nlayers, pZ(z) is also de\ufb01ned using the IDF. Finally, in\norder to decode a bitstream c, an entropy encoder uses\npZ(z) to obtain z. and the original image is obtained by\nusing the map f\u22121 : z (cid:55)\u2192 x, i.e., the inverse IDF. See\nFigure 1 for a graphical depiction of this process.\nIn rare cases, the compressed \ufb01le may be larger than the\noriginal. Therefore, following established practice in com-\npression algorithms, we utilize an escape bit. That is, the encoder will decide whether to encode the\nmessage or save it in raw format and encode that decision into the \ufb01rst bit.\n\nFigure 5: Performance of \ufb02ow models\nfor different depths (i.e. coupling lay-\ners per level). The networks in the cou-\npling layers contain 3 convolution lay-\ners. Although performance increases\nwith depth for continuous \ufb02ows, this is\nnot the case for discrete \ufb02ows.\n\n4 Architecture\n\nThe IDF architecture is split up into one or more levels. Each level consists of a squeeze operation [8],\nD integer \ufb02ow layers, and a factor-out layer. Hence, each level de\ufb01nes a mapping from yl\u22121 to\n[zl, yl], except for the \ufb01nal level L, which de\ufb01nes a mapping yL\u22121 (cid:55)\u2192 zL. Each of the D integer\n\ufb02ow layers per level consist of a permutation layer followed by an integer discrete coupling layer.\n\n5\n\nInteger FlowSqueezeFactor outInteger FlowSqueeze248162432depth3.43.53.63.73.83.94.04.1bpdIDFContinuous\fFollowing [8], the permutation layers are initialized once and kept \ufb01xed throughout training and\nevaluation. Figure 4 shows a graphical illustration of a two level IDF. The speci\ufb01c architecture details\nfor each experiment are presented in Appendix D.1. In the remainder of this section, we discuss the\ntrade-off between network depth and performance when rounding operations are used.\n\n4.1 Flow Depth and Network Depth\n\nThe performance of IDFs depends on a trade-off between complexity and gradient bias, in\ufb02uenced\nby the number of rounding functions. Increasing the performance of standard normalizing \ufb02ows is\noften achieved by increasing the depth, i.e., the number of \ufb02ow-modules. However, for IDFs each\n\ufb02ow-module results in additional rounding operations that introduce gradient bias. As a consequence,\nadding more \ufb02ow layers hurts performance, after some point, as is depicted in Figure 5. We found that\nthe limitation of using fewer coupling layers in an IDF can be negated by increasing the complexity\nof the neural networks part of the coupling and factor-out layers. That is, we use DenseNets [17] in\norder to predict the translation t in the integer discrete coupling layers and \u00b5 and s in the factor-out\nlayers.\n\n5 Related Work\n\nThere exist several deep generative modelling frameworks. This work builds mainly upon \ufb02ow-based\ngenerative models, described in [31, 7, 8]. In these works, invertible functions for continuous random\nvariables are developed. However, quantizing a latent representation, and subsequently inverting back\nto image space may lead to reconstruction errors [6, 3, 4].\nOther likelihood-based models such as PixelCNNs [41] utilize a decomposition of conditional\nprobability distributions. However, this decomposition assumes an order on pixels which may not\nre\ufb02ect the actual generative process. Furthermore, drawing samples (and decoding) is generally\ncomputationally expensive. VAEs [21] optimize a lower bound on the log likelihood instead of\nthe exact likelihood. They are used for lossless compression with deterministic encoders [25] and\nthrough bits-back coding. However, the performance of this approach is bounded by the lower bound.\nMoreover, in bits back coding a single data example can be inef\ufb01cient to compress, and the extra bits\nshould be random, which is not the case in practice and may also lead to coding inef\ufb01ciencies [38].\nNon-likelihood based generative models tend to utilize Generative Adversarial Networks [13], and\ncan generate high-quality images. However, since GANs do not optimize for likelihood, which\nis directly connected to the expected number of bits in a message, they are not suited for lossless\ncompression.\nIn the lossless compression literature, numerous reversible integer to integer transforms have been\nproposed [1, 6, 3, 4]. Speci\ufb01cally, lossless JPEG2000 uses a reversible integer wavelet transform\n[11]. However, because these transformations are largely hand-designed, they are dif\ufb01cult to tune for\nreal-world data, which may require complicated nonlinear transformations.\nAround time of submission, unpublished concurrent work appeared [39] that explores discrete \ufb02ows.\nThe main differences between our method and this work are: i) we propose discrete \ufb02ows for ordinal\ndiscrete data (e.g. audio, video, images), whereas they are are focused on categorical data. ii) we\nprovide a connection with the source coding theorem, and present a compression algorithm. iii) We\npresent results on more large-scale image datasets.\n\n6 Experiments\n\nTo test the compression performance of IDFs, we compare with a number of established lossless\ncompression methods: PNG [12]; JPEG2000 [11]; FLIF [35], a recent format that uses machine\nlearning to build decision trees for ef\ufb01cient coding; and Bit-Swap [23], a VAE based lossless\ncompression method. We show that IDFs outperform all these formats on CIFAR10, ImageNet32 and\nImageNet64. In addition, we demonstrate that IDFs can be very easily tuned for speci\ufb01c domains, by\ncompressing the ER + BCa histology dataset. For the exact treatment of datasets and optimization\nprocedures, see Section D.4.\n\n6\n\n\fTable 1: Compression performance of IDFs on CIFAR10, ImageNet32 and ImageNet64 in bits per\ndimension, and compression rate (shown in parentheses). The Bit-Swap results are retrieved from\n[23]. The column marked IDF\u2020 denotes an IDF trained on ImageNet32 and evaluated on the other\ndatasets.\nDataset\nCIFAR10\nImageNet32\nImageNet64\n\nIDF\u2020\n3.60 (2.22\u00d7)\n4.18 (1.91\u00d7)\n3.94 (2.03 \u00d7)\n\nJPEG2000\n5.20 (1.54\u00d7)\n6.48 (1.23\u00d7)\n5.10 (1.56\u00d7)\n\nPNG\n5.89 (1.36\u00d7)\n6.42 (1.25\u00d7)\n5.74 (1.39\u00d7)\n\nIDF\n3.34 (2.40\u00d7)\n4.18 (1.91\u00d7)\n3.90 (2.05\u00d7)\n\nBit-Swap\n3.82 (2.09\u00d7)\n4.50 (1.78\u00d7)\n\n\u2013\n\nFLIF [35]\n4.37 (1.83\u00d7)\n5.09 (1.57\u00d7)\n4.55 (1.76\u00d7)\n\nFigure 6: Left: An example from the ER + BCa histology\ndataset. Right: 625 IDF samples of size 80\u00d780px.\n\nFigure 7: 49 samples from the\nImageNet 64\u00d764 IDF.\n\n6.1\n\nImage Compression\n\nThe compression performance of IDFs is compared with competing methods on standard datasets,\nin bits per dimension and compression rate. The IDFs and Bit-Swap are trained on the train data,\nand compression performance of all methods is reported on the test data in Table 1. IDFs achieve\nstate-of-the-art lossless compression performance on all datasets.\nEven though one can argue that a compressor should be tuned for the source domain, the performance\nof IDFs is also examined on out-of-dataset examples, in order to evaluate compression generalization.\nWe utilize the IDF trained on Imagenet32, and compress the CIFAR10 and ImageNet64 data. For the\nlatter, a single image is split into four 32 \u00d7 32 patches. Surprisingly, the IDF trained on ImageNet32\n(IDF\u2020) still outperforms the competing methods showing only a slight decrease in compression\nperformance on CIFAR10 and ImageNet64, compared to its source-trained counterpart.\nAs an alternative method for lossless compression, one could quantize the distribution pZ(\u00b7) and the\nlatent space Z of a continuous \ufb02ow. This results in reconstruction errors that need to be stored in\naddition to the latent representation z, such that the original data can be recovered perfectly. We show\nthat this scheme is ineffective for lossless compression. Results are presented in Appendix C.\n\n6.2 Tuneable Compression\n\nThus far, IDFs have been tested on standard machine learning datasets. In this section, IDFs are\ntested on a speci\ufb01c domain, medical images. In particular, the ER + BCa histology dataset [18] is\nused, which contains 141 regions of interest scanned at 40\u00d7, where each image is 2000\u00d7 2000 pixels\n(see Figure 6, left). Since current hardware does not support training on such large images directly,\nthe model is trained on random 80 \u00d7 80px patches. See Figure 6, right for samples from the model.\nLikewise, the compression is performed in a patch-based manner, i.e., each patch is compressed\nindependently of all other patches. IDFs are again compared with FLIF and JPEG2000, and also\nwith a modi\ufb01ed version of JPEG2000 that has been optimized for virtual microscopy speci\ufb01cally,\nnamed JP2-WSI [15]. Although the IDF is at a disadvantage because it has to compress in patches, it\nconsiderably outperforms the established formats, as presented in Table 2.\n\n7\n\n\fTable 2: Compression performance on the ER + BCa histology dataset in bits per dimension and\ncompression rate. JP2-WSI is a specialized format optimized for virtual microscopy.\nJPEG2000\n4.26 (1.88\u00d7)\n\nFLIF [35]\n4.00 (2.00\u00d7)\n\nJP2-WSI\n3.04 (2.63\u00d7)\n\nIDF\n2.42 (3.19\u00d7)\n\nDataset\nHistology\n\nFigure 8: Progressive display of the data stream for images taken from the test set of ImageNet64.\nFrom top to bottom row, each image uses approximately 15%, 30%, 60% and 100% of the stream,\nwhere the remaining dimensions are sampled. Best viewed electronically.\n\n6.3 Progressive Image Rendering\n\nIn general, transferring data may take time because of slow internet connections or disk I/O. For\nthis reason, it is desired to progressively visualize data, i.e., to render the image with more detail\nas more data arrives. Several graphics formats support progressive loading. However, the encoded\n\ufb01le size may increase by enabling this option, depending on the format [12], whereas IDFs support\nprogressive rendering naturally. To partially render an image using IDFs, \ufb01rst the received variables\nare decoded. Next, using the hierarchical structure of the prior and ancestral sampling, the remaining\ndimensions are obtained. The progressive display of IDFs for ImageNet64 is presented in Figure 8,\nwhere the rows use approximately 15%, 30%, 60%, and 100% of the bitstream. The global structure\nis already captured by smaller fragments of the bitstream, even for fragments that contain only 15%\nof the stream.\n\n6.4 Probability Mass Estimation\n\nIn addition to a statistical model for compression, IDFs can also be used for image generation and\nprobability mass estimation. Samples are drawn from an ImageNet 32\u00d732 IDF and presented in\nFigure 7. IDFs are compared with recent \ufb02ow-based generative models, RealNVP [8], Glow [22],\nand Flow++ in analytical bits per dimension (negative log2-likelihood). To compare architectural\nchanges, we modify the IDFs to Continuous models by dequantizing, disabling rounding, and using\na continuous prior. The continuous versions of IDFs tend to perform slightly better, which may\nbe caused by the gradient bias on the rounding operation. IDFs show competitive performance on\nCIFAR10, ImageNet32, and ImageNet64, as presented in Table 3. Note that in contrast with IDFs,\nRealNVP uses scale transformations, Glow has 1 \u00d7 1 convolutions and actnorm layers for stability,\nand Flow++ uses the aforementioned, and an additional \ufb02ow for dequantization. Interestingly, IDFs\nhave comparable performance even though the architecture is relatively simple.\n\nTable 3: Generative modeling performance of IDFs and comparable \ufb02ow-based methods in bits per\ndimension (negative log2-likelihood).\n\nDataset\nCIFAR10\nImageNet32\nImageNet64\n\nIDF\n3.32\n4.15\n3.90\n\nContinuous RealNVP Glow Flow++\n3.31\n4.13\n3.85\n\n3.35\n4.09\n3.81\n\n3.08\n3.86\n3.69\n\n3.49\n4.28\n3.98\n\n8\n\n~30%~15%~60%100%\f7 Conclusion\n\nWe have introduced Integer Discrete Flows, \ufb02ows for ordinal discrete data that can be used for deep\ngenerative modelling and neural lossless compression. We show that IDFs are competitive with\ncurrent \ufb02ow-based models, and that we achieve state-of-the-art lossless compression performance\non CIFAR10, ImageNet32 and ImageNet64. To the best of our knowledge, this is the \ufb01rst lossless\ncompression method that uses invertible neural networks.\n\nReferences\n[1] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions\n\non Computers, 100(1):90\u201393, 1974.\n\n[2] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. Estimating or propagating gradients\nthrough stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\n[3] A Robert Calderbank, Ingrid Daubechies, Wim Sweldens, and Boon-Lock Yeo. Lossless\nimage compression using integer to integer wavelet transforms. In Proceedings of International\nConference on Image Processing, volume 1, pages 596\u2013599. IEEE, 1997.\n\n[4] AR Calderbank, Ingrid Daubechies, Wim Sweldens, and Boon-Lock Yeo. Wavelet transforms\nthat map integers to integers. Applied and computational harmonic analysis, 5(3):332\u2013369,\n1998.\n\n[5] Gustavo Deco and Wilfried Brauer. Higher Order Statistical Decorrelation without Information\nLoss. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information\nProcessing Systems 7, pages 247\u2013254. MIT Press, 1995.\n\n[6] Steven Dewitte and Jan Cornelis. Lossless integer wavelet transform. IEEE signal processing\n\nletters, 4(6):158\u2013160, 1997.\n\n[7] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components\nestimation. 3rd International Conference on Learning Representations, ICLR, Workshop Track\nProceedings, 2015.\n\n[8] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP.\n\n5th International Conference on Learning Representations, ICLR, 2017.\n\n[9] Jarek Duda. Asymmetric numeral systems. arXiv preprint arXiv:0902.0271, 2009.\n\n[10] Jarek Duda. Asymmetric numeral systems: entropy coding combining speed of huffman coding\n\nwith compression rate of arithmetic coding. arXiv preprint arXiv:1311.2540, 2013.\n\n[11] International Organization for Standardization. JPEG 2000 image coding system. ISO Standard\n\nNo. 15444-1:2016, 2003.\n\n[12] International Organization for Standardization. Portable Network Graphics (PNG): Functional\n\nspeci\ufb01cation. ISO Standard No. 15948:2003, 2003.\n\n[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[14] Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud.\nFfjord: Free-form continuous dynamics for scalable reversible generative models. 7th Interna-\ntional Conference on Learning Representations, ICLR, 2019.\n\n[15] Henrik Helin, Teemu Tolonen, Onni Ylinen, Petteri Tolonen, Juha N\u00e4p\u00e4nkangas, and Jorma\nIsola. Optimized jpeg 2000 compression for ef\ufb01cient storage of histopathological whole-slide\nimages. Journal of pathology informatics, 9, 2018.\n\n[16] Emiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging convolutions for\ngenerative normalizing \ufb02ows. Proceedings of the 36th International Conference on Machine\nLearning, 2019.\n\n9\n\n\f[17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 4700\u20134708, 2017.\n\n[18] Andrew Janowczyk, Scott Doyle, Hannah Gilmore, and Anant Madabhushi. A resolution\nadaptive deep hierarchical (radhical) learning scheme applied to nuclear segmentation of digital\npathology images. Computer Methods in Biomechanics and Biomedical Engineering: Imaging\n& Visualization, 6(3):270\u2013276, 2018.\n\n[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 3rd Interna-\n\ntional Conference on Learning Representations, ICLR, 2015.\n\n[20] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max\nWelling. Improved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural\nInformation Processing Systems, pages 4743\u20134751, 2016.\n\n[21] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In Proceedings of the\n\n2nd International Conference on Learning Representations, 2014.\n\n[22] Durk P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions.\n\nIn Advances in Neural Information Processing Systems, pages 10236\u201310245, 2018.\n\n[23] Friso H Kingma, Pieter Abbeel, and Jonathan Ho. Bit-swap: Recursive bits-back coding\nfor lossless compression with hierarchical latent variables. 36th International Conference on\nMachine Learning, 2019.\n\n[24] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent\nDinh, and Durk Kingma. Video\ufb02ow: A \ufb02ow-based generative model for video. arXiv preprint\narXiv:1903.01434, 2019.\n\n[25] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool.\nPractical full resolution learned lossless image compression. In IEEE Conference on Computer\nVision and Pattern Recognition, CVPR, pages 10629\u201310638, 2019.\n\n[26] Alistair Moffat, Radford M Neal, and Ian H Witten. Arithmetic coding revisited. ACM\n\nTransactions on Information Systems (TOIS), 16(3):256\u2013294, 1998.\n\n[27] George Papamakarios, Iain Murray, and Theo Pavlakou. Masked autoregressive \ufb02ow for density\nestimation. In Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\n[28] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In NIPS Autodiff Workshop, 2017.\n\n[29] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A \ufb02ow-based generative network\nfor speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP), pages 3617\u20133621. IEEE, 2019.\n\n[30] Danilo Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. In Pro-\nceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings\nof Machine Learning Research, pages 1530\u20131538. PMLR, 2015.\n\n[31] Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep\n\ndensity models. arXiv preprint arXiv:1302.5125, 2013.\n\n[32] Jorma Rissanen and Glen G Langdon. Arithmetic coding.\n\ndevelopment, 23(2):149\u2013162, 1979.\n\nIBM Journal of research and\n\n[33] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the\npixelcnn with discretized logistic mixture likelihood and other modi\ufb01cations. 5th International\nConference on Learning Representations, ICLR, 2017.\n\n[34] Claude Elwood Shannon. A mathematical theory of communication. Bell system technical\n\njournal, 27(3):379\u2013423, 1948.\n\n10\n\n\f[35] Jon Sneyers and Pieter Wuille. Flif: Free lossless image format based on maniac compression.\nIn 2016 IEEE International Conference on Image Processing (ICIP), pages 66\u201370. IEEE, 2016.\n\n[36] EG Tabak and Cristina V Turner. A family of nonparametric density estimation algorithms.\n\nCommunications on Pure and Applied Mathematics, 66(2):145\u2013164, 2013.\n\n[37] Esteban G Tabak, Eric Vanden-Eijnden, et al. Density estimation by dual ascent of the log-\n\nlikelihood. Communications in Mathematical Sciences, 8(1):217\u2013233, 2010.\n\n[38] James Townsend, Tom Bird, and David Barber. Practical lossless compression with latent\nvariables using bits back coding. 7th International Conference on Learning Representations,\nICLR, 2019.\n\n[39] Dustin Tran, Keyon Vafa, Kumar Agrawal, Laurent Dinh, and Ben Poole. Discrete \ufb02ows:\n\nInvertible generative models of discrete data. ICLR 2019 Workshop DeepGenStruct, 2019.\n\n[40] Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester\nnormalizing \ufb02ows for variational inference. Thirty-Fourth Conference on Uncertainty in\nArti\ufb01cial Intelligence, UAI, 2018.\n\n[41] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.\n\nIn International Conference on Machine Learning, pages 1747\u20131756, 2016.\n\n11\n\n\f", "award": [], "sourceid": 6579, "authors": [{"given_name": "Emiel", "family_name": "Hoogeboom", "institution": "University of Amsterdam"}, {"given_name": "Jorn", "family_name": "Peters", "institution": "University of Amsterdam"}, {"given_name": "Rianne", "family_name": "van den Berg", "institution": "Google Brain"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam / Qualcomm AI Research"}]}