{"title": "Joint Autoregressive and Hierarchical Priors for Learned Image Compression", "book": "Advances in Neural Information Processing Systems", "page_first": 10771, "page_last": 10780, "abstract": "Recent models for learned image compression are based on autoencoders that learn approximately invertible mappings from pixels to a quantized latent representation. The transforms are combined with an entropy model, which is a prior on the latent representation that can be used with standard arithmetic coding algorithms to generate a compressed bitstream. Recently, hierarchical entropy models were introduced as a way to exploit more structure in the latents than previous fully factorized priors, improving compression performance while maintaining end-to-end optimization. Inspired by the success of autoregressive priors in probabilistic generative models, we examine autoregressive, hierarchical, and combined priors as alternatives, weighing their costs and benefits in the context of image compression. While it is well known that autoregressive models can incur a significant computational penalty, we find that in terms of compression performance, autoregressive and hierarchical priors are complementary and can be combined to exploit the probabilistic structure in the latents better than all previous learned models. The combined model yields state-of-the-art rate-distortion performance and generates smaller files than existing methods: 15.8% rate reductions over the baseline hierarchical model and 59.8%, 35%, and 8.4% savings over JPEG, JPEG2000, and BPG, respectively. To the best of our knowledge, our model is the first learning-based method to outperform the top standard image codec (BPG) on both the PSNR and MS-SSIM distortion metrics.", "full_text": "Joint Autoregressive and Hierarchical Priors for\n\nLearned Image Compression\n\nDavid Minnen, Johannes Ball\u00e9, George Toderici\n\nGoogle Research\n\n{dminnen, jballe, gtoderici}@google.com\n\nAbstract\n\nRecent models for learned image compression are based on autoencoders that learn\napproximately invertible mappings from pixels to a quantized latent representation.\nThe transforms are combined with an entropy model, which is a prior on the latent\nrepresentation that can be used with standard arithmetic coding algorithms to gener-\nate a compressed bitstream. Recently, hierarchical entropy models were introduced\nas a way to exploit more structure in the latents than previous fully factorized priors,\nimproving compression performance while maintaining end-to-end optimization.\nInspired by the success of autoregressive priors in probabilistic generative mod-\nels, we examine autoregressive, hierarchical, and combined priors as alternatives,\nweighing their costs and bene\ufb01ts in the context of image compression. While it\nis well known that autoregressive models can incur a signi\ufb01cant computational\npenalty, we \ufb01nd that in terms of compression performance, autoregressive and hier-\narchical priors are complementary and can be combined to exploit the probabilistic\nstructure in the latents better than all previous learned models. The combined\nmodel yields state-of-the-art rate\u2013distortion performance and generates smaller\n\ufb01les than existing methods: 15.8% rate reductions over the baseline hierarchical\nmodel and 59.8%, 35%, and 8.4% savings over JPEG, JPEG2000, and BPG, re-\nspectively. To the best of our knowledge, our model is the \ufb01rst learning-based\nmethod to outperform the top standard image codec (BPG) on both the PSNR and\nMS-SSIM distortion metrics.\n\n1\n\nIntroduction\n\nMost recent methods for learning-based, lossy image compression adopt an approach based on\ntransform coding [1]. In this approach, image compression is achieved by \ufb01rst mapping pixel\ndata into a quantized latent representation and then losslessly compressing the latents. Within the\ndeep learning research community, the transforms typically take the form of convolutional neural\nnetworks (CNNs), which learn nonlinear functions with the potential to map pixels into a more\ncompressible latent space than the linear transforms used by traditional image codecs. This nonlinear\ntransform coding method resembles an autoencoder [2], [3], which consists of an encoder transform\nfrom the data (in this case, pixels) to a reduced-dimensionality latent space, and a decoder, an\napproximate inverse function that maps latents back to pixels. While dimensionality reduction\ncan be seen as a simple form of compression, it is not equivalent since the goal of compression\nis to reduce the entropy of the representation under a prior probability model shared between the\nsender and the receiver (the entropy model), not just to reduce the dimensionality. To improve\ncompression performance, recent methods have focused on better encoder/decoder transforms and\non more sophisticated entropy models [4]\u2013[14]. Finally, the entropy model is used in conjunction\nwith standard entropy coding algorithms such as arithmetic, range, or Huffman coding [15]\u2013[17] to\ngenerate a compressed bitstream.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe training goal is to minimize the expected length of the bitstream as well as the expected distortion\nof the reconstructed image with respect to the original, giving rise to a rate\u2013distortion optimization\nproblem:\n\nR + \u03bb \u00b7 D = Ex\u223cpx\n\n(cid:124)\n\n(cid:2)\u2212 log2 p \u02c6y((cid:98)f (x)(cid:101))(cid:3)\n(cid:125)\n\n(cid:123)(cid:122)\n\nrate\n\n+\u03bb \u00b7 Ex\u223cpx\n\n(cid:124)\n\n(cid:2)d(x, g((cid:98)f (x)(cid:101)))(cid:3)\n(cid:123)(cid:122)\n(cid:125)\n\n,\n\ndistortion\n\n(1)\n\nwhere \u03bb is the Lagrange multiplier that determines the desired rate\u2013distortion trade-off, px is the\nunknown distribution of natural images, (cid:98)\u00b7(cid:101) represents rounding to the nearest integer (quantization),\ny = f (x) is the encoder, \u02c6y = (cid:98)y(cid:101) are the quantized latents, p \u02c6y is a discrete entropy model, and\n\u02c6x = g(\u02c6y) is the decoder with \u02c6x representing the reconstructed image. The rate term corresponds\nto the cross entropy between the marginal distribution of the latents and the learned entropy model,\nwhich is minimized when the two distributions are identical. The distortion term may correspond to a\nclosed-form likelihood, such as when d(x, \u02c6x) represents mean squared error (MSE), which allows\nthe model to be interpreted as a variational autoencoder [5], [6]. When optimizing the model for\nother distortion metrics (e.g., SSIM or MS-SSIM), it is simply minimized as an energy function.\nThe models we analyze in this paper build on the work of Ball\u00e9 et al. [13], which uses a noise-based\nrelaxation to apply gradient descent methods to the loss function in Eq. (1) and which introduces a\nhierarchical prior to improve the entropy model. While most previous research uses a \ufb01xed, though\npotentially complex, entropy model, Ball\u00e9 et al. use a Gaussian scale mixture (GSM) [18] where\nthe scale parameters are conditioned on a hyperprior. Their model allows for end-to-end training,\nwhich includes joint optimization of a quantized representation of the hyperprior, the conditional\nentropy model, and the base autoencoder. The key insight of their work is that the compressed\nhyperprior can be added to the generated bitstream as side information, which allows the decoder\nto use the conditional entropy model. In this way, the entropy model itself is image-dependent and\nspatially adaptive, which allows for a richer and more accurate model. Ball\u00e9 et al. show that standard\noptimization methods for deep neural networks are suf\ufb01cient to learn a useful balance between\nthe size of the side information and the savings gained from a more accurate entropy model. The\nresulting compression model provides state-of-the-art image compression results compared to earlier\nlearning-based methods.\nWe extend this GSM-based entropy model in two ways: \ufb01rst, by generalizing the hierarchical GSM\nmodel to a Gaussian mixture model (GMM), and, inspired by recent work on generative models, by\nadding an autoregressive component. We assess the compression performance of both approaches,\nincluding variations in the network architectures, and discuss bene\ufb01ts and potential drawbacks of\nboth extensions. For the results in this paper, we did not investigate the effect of reducing the capacity\n(i.e., the number of layers and number of channels) of the deep networks to optimize computational\ncomplexity, since we are interested in determining the potential of different forms of priors rather\nthan trading off complexity against performance. Note that increasing capacity alone is not suf\ufb01cient\nto obtain arbitrarily good compression performance [13, appendix 6.3].\n\n2 Architecture Details\n\nFigure 1 provides a high-level overview of our generalized compression model, which contains two\nmain sub-networks1. The \ufb01rst is the core autoencoder, which learns a quantized latent representation\nof images (Encoder and Decoder blocks). The second sub-network is responsible for learning a\nprobabilistic model over quantized latents used for entropy coding. It combines the Context Model,\nan autoregressive model over latents, with the hyper-network (Hyper Encoder and Hyper Decoder\nblocks), which learns to represent information useful for correcting the context-based predictions.\nThe data from these two sources is combined by the Entropy Parameters network, which generates\nthe mean and scale parameters for a conditional Gaussian entropy model.\nOnce training is complete, a valid compression model must prevent any information from passing\nbetween the encoder to the decoder unless that information is available in the compressed \ufb01le. In\nFigure 1, the arithmetic encoding (AE) blocks produce the compressed representation of the symbols\ncoming from the quantizer, which is stored in a \ufb01le. Therefore at decoding time, any information that\ndepends on the quantized latents may be used by the decoder once it has been decoded. In order for\n\n1See Section 4 in the supplemental materials for an in-depth visual comparison between our architecture\n\nvariants and previous learning-based methods.\n\n2\n\n\fComponent\n\nInput Image\n\nEncoder\nLatents\n\nLatents (quantized)\n\nDecoder\n\nHyper Encoder\nHyper-latents\n\nHyper-latents (quant.)\n\nHyper Decoder\nContext Model\n\nEntropy Parameters\n\nReconstruction\n\nSymbol\n\nx\n\nf (x; \u03b8e)\n\ny\n\u02c6y\n\ng( \u02c6y; \u03b8d)\n\nfh(y; \u03b8he)\n\nz\n\u02c6z\n\ngcm(y