{"title": "Towards Conceptual Compression", "book": "Advances in Neural Information Processing Systems", "page_first": 3549, "page_last": 3557, "abstract": "We introduce convolutional DRAW, a homogeneous deep generative model achieving state-of-the-art performance in latent variable image modeling. The algorithm naturally stratifies information into higher and lower level details, creating abstract features and as such addressing one of the fundamentally desired properties of representation learning. Furthermore, the hierarchical ordering of its latents creates the opportunity to selectively store global information about an image, yielding a high quality 'conceptual compression' framework.", "full_text": "Towards Conceptual Compression\n\nKarol Gregor\n\nGoogle DeepMind\n\nkarolg@google.com\n\nFrederic Besse\nGoogle DeepMind\n\nfbesse@google.com\n\nDanilo Jimenez Rezende\n\nGoogle DeepMind\n\ndanilor@google.com\n\nIvo Danihelka\n\nGoogle DeepMind\n\ndanihelka@google.com\n\nDaan Wierstra\nGoogle DeepMind\n\nwierstra@google.com\n\nAbstract\n\nWe introduce convolutional DRAW, a homogeneous deep generative model achiev-\ning state-of-the-art performance in latent variable image modeling. The algorithm\nnaturally strati\ufb01es information into higher and lower level details, creating abstract\nfeatures and as such addressing one of the fundamentally desired properties of\nrepresentation learning. Furthermore, the hierarchical ordering of its latents creates\nthe opportunity to selectively store global information about an image, yielding a\nhigh quality \u2018conceptual compression\u2019 framework.\n\n1\n\nIntroduction\n\nDeep generative models with latent variables can capture image information in a probabilistic manner\nto answer questions about structure and uncertainty. Such models can also be used for representation\nlearning, and the associated procedures for inferring latent variables are vital to important application\nareas such as (semi-supervised) classi\ufb01cation and compression.\nIn this paper we introduce convolutional DRAW, a new model in this class that is able to transform\nan image into a progression of increasingly detailed representations, ranging from global conceptual\naspects to low level details (see Figure 1). It signi\ufb01cantly improves upon earlier variational latent\nvariable models (Kingma & Welling, 2014; Rezende et al., 2014; Gregor et al., 2014). Furthermore, it\nis simple and fully convolutional, and does not require complex design choices, just like the recently\nintroduced DRAW architecture (Gregor et al., 2015). It provides an important insight into building\ngood variational auto-encoder models of images: positioning multiple layers of stochastic variables\n\u2018close\u2019 to the pixels (in terms of nonlinear steps in the computational graph) can signi\ufb01cantly improve\ngenerative performance. Lastly, the system\u2019s ability to stratify information has the side bene\ufb01t of\nallowing it to perform high quality lossy compression, by selectively storing a higher level subset of\ninferred latent variables, while (re)generating the remainder during decompression (see Figure 3).\nIn the following we will \ufb01rst discuss variational auto-encoders and compression. The subsequent\nsections then describe the algorithm and present results both on generation quality and compression.\n\n1.1 Variational Auto-Encoders\n\nNumerous deep generative models have been developed recently, ranging from restricted and deep\nBoltzmann machines (Hinton & Salakhutdinov, 2006; Salakhutdinov & Hinton, 2009), generative\nadversarial networks (Goodfellow et al., 2014), autoregressive models (Larochelle & Murray, 2011;\nGregor & LeCun, 2011; van den Oord et al., 2016) to variational auto-encoders (Kingma & Welling,\n2014; Rezende et al., 2014; Gregor et al., 2014). In this paper we focus on the class of models in the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Conceptual Compression. The top rows show full reconstructions from the model for\nOmniglot and ImageNet, respectively. The subsequent rows were obtained by storing the \ufb01rst t\niteratively obtained groups of latent variables and then generating the remaining latents and visibles\nusing the model (only a subset of all possible t values are shown, in increasing order). Left: Omniglot\nreconstructions. Each group of four columns shows different samples at a given compression level.\nWe see that the variations in the latter samples concentrate on small details, such as the precise\nplacement of strokes. Reducing the number of stored bits tends to preserve the overall shape, but\nincreases the symbol variation. Eventually a varied set of symbols is generated. Nevertheless even\nin the \ufb01rst row there is a clear difference between variations produced from a given symbol and\nthose between different symbols. Right: ImageNet reconstructions. Here the latent variables were\ngenerated with zero variance (ie. the mean of the latent prior is used). Again the global structure is\ncaptured \ufb01rst and the details are \ufb01lled in later on.\n\nvariational auto-encoding framework. Since we are also interested in compression, we present them\nfrom an information-theoretic perspective.\nVariational auto-encoders consist of two neural networks: one that generates samples from latent\nvariables (\u2018imagination\u2019), and one that infers latent variables from observations (\u2018recognition\u2019). The\ntwo networks share the latent variables. Intuitively speaking one might think of these variables as\nspecifying, for a given image, at different levels of abstraction, whether a particular object such as\na cat or a dog is present in the input, or perhaps what the exact position and intensity of an edge\nat a given location might be. During the recognition phase the network acquires information about\nthe input and stores it in the latent variables, reducing their uncertainty. For example, at \ufb01rst not\nknowing whether a cat or a dog is present in the image, the network observes the input and becomes\nnearly certain that it is a cat. The reduction in uncertainty is quantitatively equal to the amount of\ninformation that the network acquired about the input. During generation the network starts with\nuncertain latent variables and samples their values from a prior distribution. Different choices will\nproduce different visibles.\nVariational auto-encoders provide a natural framework for unsupervised learning \u2013 we can build\nhierarchical networks with multiple layers of stochastic variables and expect that, after learning, the\nrepresentations become more and more abstract for higher levels of the hierarchy. The pertinent\nquestions then are: can such a framework indeed discover such representations both in principle and\nin practice, and what techniques are required for its satisfactory performance.\n\n1.2 Conceptual Compression\n\nVariational auto-encoders can not only be used for representation learning but also for compression.\nThe training objective of variational auto-encoders is to compress the total amount of information\nneeded to encode the input. They achieve this by using information-carrying latent variables that\nexpress what, before compression, was encoded using a larger amount of information in the input.\nThe information in the layers and the remaining information in the input can be encoded in practice\nas explained later in this paper.\nThe achievable amount of lossless compression is bounded by the underlying entropy of the image\ndistribution. Most image information as measured in bits is contained in the \ufb01ne details of the image.\n\n2\n\n\fFigure 2: Two-layer convolutional DRAW. A schematic depiction of one time slice is shown\non the left. X and R denote input and reconstruction, respectively. On the right, the amount of\ninformation at different layers and time steps is shown. A two-layer convolutional DRAW was trained\non ImageNet, with a convolutional \ufb01rst layer and a fully connected second layer. The amount of\ninformation at a given layer and iteration is measured by the KL-divergence between the prior and\nthe posterior (5). When presented with an image, \ufb01rst the top layer acquires information and then the\nsecond slowly increases, suggesting that the network \ufb01rst acquires \u2018conceptual\u2019 information about\nthe image and only then encodes the remaining details. Note that this is an illustration of a two-layer\nsystem, whereas most experiments in this paper, unless otherwise stated, were performed with a\none-layer version.\n\nThus we might reasonably expect that future improvements in lossless compression technology will\nbe bounded in scope.\nLossy compression, on the other hand, holds much more potential for improvement. In this case the\nobjective is to best compress an image in terms of quality of similarity to the original image, whilst\nallowing for some information loss. As an example, at a low level of compression (close to lossless\ncompression), we could start by reducing pixel precision, e.g. from 8 bits to 7 bits. Then, as in JPEG,\nwe could express a local 8x8 neighborhood in a discrete cosine transform basis and store only the\nmost signi\ufb01cant components. This way, instead of introducing quantization artefacts in the image\nthat would appear if we kept decreasing pixel precision, we preserve higher level structures but to a\nlower level of precision. Nevertheless, if we want to improve upon this and push the limits of what is\npossible in compression, we need to be able to identify what the most salient \u2018aspects\u2019 of an image\nare.\nIf we wanted to compress images of cats and dogs down to one bit, what would that bit ideally\nrepresent? It is natural to argue that it should represent whether the image contains either a cat or\na dog. How would we then produce an image from this single bit? If we have a good generative\nmodel, we can simply generate the entire image from this single latent variable by ancestral sampling,\nyielding an image of a cat if the bit corresponds to \u2018cat\u2019, and an image of a dog otherwise. Now let us\nimagine that instead of compressing down to one bit we wanted to compress down to ten bits. We can\nthen store some other important properties of the animal as well \u2013 e.g. its type, color, and basic pose.\nConditioned on this information, everything else can be probabilistically \u2018\ufb01lled in\u2019 by the generative\nmodel during decompression. Increasing the number of stored bits further we can preserve more\nand more about the image, still \ufb01lling in the \ufb01ne pixel-level details such as precise hair structure, or\nthe exact pattern of the \ufb02oor, etc. Most bits indeed concern such low level details. We refer to this\ntype of compression \u2013 compressing by preferentially storing the higher levels of representation while\ngenerating/\ufb01lling-in the remainder \u2013 \u2018conceptual compression\u2019.\nImportantly, if we solve deep representation learning with latent variable generative models that gen-\nerate high quality samples, we simultaneously achieve the objective of lossy compression mentioned\nabove. We can see this as follows. Assume that the network has learned a hierarchy of progressively\nmore abstract representations. Then, to get different levels of compression, we can store only the\ncorresponding number of topmost layers and generate the rest. By solving unsupervised deep learning,\nthe network would order information according to its importance and store it with that priority.\n\n3\n\nLayer 1E1E2Z1Z2D1D2RXPriorGenerationAppr. PosteriorInferenceLatent (Information)Layer 2 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 5 10 15 20 25 30 35Information (bits)Iteration number0.01 * Layer 1Layer 2\f2 Convolutional DRAW\n\nBelow we present the equations for a one layer system (for a two layer system the reader is referred\nto the supplementary material):\nFor t = 1, . . . , T\n\nAt the end, at time T,\n\nt\u22121)\n\n\u0001t = x \u2212 \u00b5(rt\u22121)\nt = RNN(x, \u0001t, he\nhe\nt\u22121, hd\nzt \u223c qt = q(zt|he\nt )\npt = p(zt|hd\nt\u22121)\nt = KL(qt|pt)\nLz\nt = RNN(zt, hd\nhd\nrt = rt\u22121 + W hd\nt\n\nt\u22121, rt\u22121)\n\n\u00b5, \u03b1 = split(rT )\npx = N (\u00b5, exp(\u03b1)))\nqx = U(x \u2212 s/2, x + s/2)\nLx = log(qx/px)\n\nL = \u03b2Lx +(cid:80)T\n\nt=1 Lz\nt\n\n(8)\n(9)\n(10)\n(11)\n(12)\n\n(1)\n(2)\n(3)\n(4)\n(5)\n(6)\n(7)\n\nsum of the two costs L = Lx +(cid:80)T\n\nLong Short-Term Memory networks (LSTM; Hochreiter & Schmidhuber, 1997) are used as the\nrecurrent modules (RNN) and convolutions are used for all linear operations. We follow the com-\nputations and explain them and the variables as we go along. The input image is x. The canvas\nvariable rt\u22121, initialized to a bias, carries information about the current reconstruction of the image:\na mean \u00b5(rt\u22121) and a log standard deviation \u03b1(rt\u22121). We compute the reconstruction error \u0001t. This,\ntogether with x, is fed to the encoder RNN (E in the diagram), which updates its internal state and\nproduces an output vector he\nt . This goes into the approximate posterior distribution qt from which zt\nis sampled. The prior distribution pt and the latent loss Lz\nt are calculated. zt is passed to the decoder\nt measures the amount of information about x that is transmitted using zt to the decoder at\nand Lz\nthis time. The decoder (D in the diagram) updates its state and outputs the vector hd\nt which is then\nused to update the canvas rt. At the end of the recurrence, the canvas consists of the values of\n\u00b5 and \u03b1 = log \u03c3 of the Gaussian distribution p(x|z1, . . . , zT ) (or analogous parameters for other\ndistributions). This probability is computed for the input x as px. Because we use a real valued\ndistribution, but the original data has 256 values per color channel for a typical image, we encode\nthis discretization as a uniform distribution U(x \u2212 s/2, x + s/2) of width equal to the discretization\ns (typically 1/255) around x. The input cost is then Lx = log(qx/px), it is always non-negative, and\nmeasures the number of bits (nats) needed to describe x knowing (z1, . . . , zT ). The \ufb01nal cost is the\nt and equals the amount of information that the model uses\nto compress x losslessly. This is the loss we use to report the likelihood bounds and is the standard\nloss for variational auto-encoders. However, we also include a constant \u03b2 and train models with\n\u03b2 (cid:54)= 1 to observe the visual effect on generated data and to perform lossy compression as explained\nin section 3. Values \u03b2 < 1 put less pressure on the network to reconstruct exact pixel details and\nincrease its capacity to learn a better latent representation.\nThe general multi-layer architecture is summarized in Figure 2 (left). The algorithm is loosely\ninspired by the architecture of the visual cortex (Carlson et al., 2013). We will describe known\ncortical properties and in brackets the correspondences in our diagram. The visual cortex consists of\nhierarchically organized areas such as V1, V2, V4, IT (in our case: layers 1, 2, . . .). Each area such as\nV1 is a composite structure consisting of six sublayers each most likely performing different functions\n(in our case: E for encoding, Z for sampling and information measuring, D and R for decoding).\nEyes saccade around three times per second with blank periods in between. Thus the cortex has about\n250ms to consider each input. When an input is received, there is a feed-forward computation that\nprogresses to high levels of hierarchy such as IT in about 100ms (in our case: the input is passed\nthrough the E layers). The architecture is recurrent (our architecture as well) with a large amount of\nfeedback from higher to lower layers (in our case: each D feeds into the E, Z, D, R layers of the\nnext step), and can still perform signi\ufb01cant computations before the next input is processed (in our\ncase: the iterations of DRAW).\n\nt=1 Lz\n\n3 Compression Methodology\n\nIn this section we show how instances of the variational auto-encoder paradigm (including convo-\nlutional DRAW) can be turned into compression algorithms. Note however that storing subsets of\n\n4\n\n\fFigure 3: Lossy Compression. Example images for various methods and levels of compression.\nTop row: original images. Each subsequent block has four rows corresponding to four methods\nof compression: (a) JPEG, (b) JPEG2000, (c) convolutional DRAW with full prior variance for\ngeneration and (d) convolutional DRAW with zero prior variance. Each block corresponds to a\ndifferent compression level; in order, the average number of bits per input dimension are: 0.05, 0.1,\n0.15, 0.2, 0.4, 0.8 (bits per image: 153, 307, 460, 614, 1228, 2457). In the \ufb01rst block, JPEG was left\ngray because it does not compress to this level. Images are of size 32 \u00d7 32. See appendix for 64 \u00d7 64\nimages.\n\nlatents as described above results in good compression only if the network separates high level from\nlow level information. It is not obvious whether this should occur to a satisfactory extent, or at all.\nIn the following sections we will show that convolutional DRAW does in fact have this desirable\nproperty. It strati\ufb01es information into a progression of increasingly abstract features, allowing the\nresulting compression algorithm to select a degree of compression. What is appealing here is that this\noccurs naturally in such a simple homogeneous architecture.\nThe underlying compression mechanism is arithmetic coding (Witten et al., 1987). Arithmetic coding\ntakes as input a sequence of discrete variables x1, . . . , xt and a set of probabilities p(xt|x1, . . . , xt\u22121)\nthat predict the variable at time t from the previous ones. It then compresses this sequence to\n\nt log2 p(xt|x1, . . . , xt\u22121) bits plus a constant of order one.\n\nL = \u2212(cid:80)\n\nWe can use variational auto-encoders for compression as follows. First, train the model with an\napproximate posterior q that has a variance independent from the input. After training, discretize the\nlatent variables z to the size of the variance of q. When compressing an input, assign z to the nearest\ndiscretized point to the mean of q instead of sampling from q. Calculate the discrete probabilities p\nover the values of z. Retrain decoder and p to perform well with the discretized values. Now, we\ncan use arithmetic coding directly, having the probabilities over discrete values of z. This procedure\nmight require tuning to achieve the best performance. However such process is likely to work since\nthere is another, less practical way to compress that is guaranteed to achieve the theoretical value.\nThis second approach uses bits-back coding (Hinton & Van Camp, 1993). We explain only the basic\nidea here. First, discretize the latents down to a very high level of precision and use p to transmit\nthe information. Because the discretization precision is high, the probabilities for discrete values are\neasily assigned. That will preserve the information but it will cost many bits, namely \u2212 log2 pd(z)\nwhere pd is the prior under that discretization. Now, instead of choosing a random sample z from\nthe approximate posterior qd under the discretization when encoding, use another stream of bits that\nneeds to be transmitted, to choose z, in effect encoding these bits into the choice of z. The encoded\namount is \u2212 log2 qd(z) bits. When z is recovered at the receiving end, both the information about the\ncurrent input and the other information is recovered and thus the information needed to encode the\n\n5\n\n\fFigure 4: Generated samples on Omniglot.\n\nFigure 5: Generated samples on ImageNet for different input cost scales. On the left, 32 \u00d7 32\nsamples are shown with input cost \u03b2 in (12) equal to {0.2, 0.4, 0.6, 0.8, 1} for each respective block\nof two rows. On the right, 64 \u00d7 64 are shown with input cost scale \u03b2 is {0.4, 0.5, 0.6, 0.8, 1} for each\nrow respectively. For smaller values of \u03b2 the network is less compelled to explain \ufb01ner details of\nimages, and produces \u2018cleaner\u2019 larger structures.\ncurrent input is \u2212 log2 pd(z) + log2 qd(z) = \u2212 log2(pd(z)/qd(z)). The expectation of this quantity\nis the KL-divergence in (5), which therefore measures the amount of information stored in a given\nlatent layer. The disadvantage of this approach is that we need this extra data to encode a given input.\nHowever, this coding scheme works even if the variance of the approximate posterior is dependent on\nthe input.\n\n4 Results\n\nAll models (except otherwise speci\ufb01ed) were single-layer, with the number of DRAW time steps\nnt = 32, a kernel size of 5\u00d7 5, and stride 2 convolutions between input layers and hidden layers with\n12 latent feature maps. We trained the models on Cifar-10, Omniglot and ImageNet with 320, 160\nand 160 LSTM feature maps, respectively. We use the version of ImageNet presented in (van den\nOord et al., 2016). We train the network with Adam optimization (Kingma & Ba, 2014) with learning\nrate 5 \u00d7 10\u22124. We found that the cost occasionally increased dramatically during training. This is\nprobably due to the Gaussian nature of the distribution, when a given variable is produced too far\nfrom the mean relative to sigma. We observed this happening approximately once per run. To be able\nto keep training we store older parameters, detect such jumps and revert to the old parameters when\nthey occur. In these instances training always continued unperturbed.\n\n4.1 Modeling Quality\n\nOmniglot The recently introduced Omniglot dataset Lake et al. (2015) is comprised of 1628 character\nclasses drawn from multiple alphabets with just 20 samples per class. Referred to by some as the\n\n6\n\n\f\u2018transpose of MNIST\u2019, it was designed to study conceptual representations and generative models in a\nlow-data regime. Table 1 shows likelihoods of different models compared to ours. For our model, we\nonly calculate the upper bound (variational bound) and therefore underestimate its quality. Samples\ngenerated by the model are shown in Figure 4.\nCifar-10 Table 1 also shows reported likelihoods of different models on Cifar-10. Convolutional\nDRAW outperforms most previous models. The recently introduced Pixel RNN model (van den\nOord et al., 2016) yields better likelihoods, but as it is not a latent variable model, it does not\nbuild representations, cannot be used for lossy compression, and is slow to sample from due to\nits autoregressive nature. At the same time, we must emphasize that the two approaches might be\ncomplementary, and could be combined by feeding the output of convolutional DRAW into the\nrecurrent network of Pixel RNN.\nWe also show the likelihood for a (non-recurrent) variational auto-encoder that we obtained internally.\nWe tested architectures with multiple layers, both deterministic and stochastic but with standard\nfunctional forms, and reported the best result that we were able to obtain. Convolutional DRAW\nperforms signi\ufb01cantly better.\nImageNet Additionaly, we trained on the version of ImageNet as prepared in (van den Oord et al.,\n2016) which was created with the aim of making a standardized dataset to test generative models.\nThe results are in Table 1. Note that since this is a new dataset, few other methods have yet been\napplied to it.\nIn Figure 5 we show generations from the model. We trained networks with varying input cost scales\nas explained in the next section. The generations are sharp and contain many details, unlike previous\nversions of variational auto-encoder that tend to generate blurry images.\nTable 1: Test set performance of different models. Results on 28 \u00d7 28 Omniglot are shown in nats,\nresults on CIFAR-10 and ImageNet are shown in bits/dim. Training losses are shown in brackets.\n\nOmniglot\nVAE (2 layers, 5 samples)\nIWAE (2 layers, 50 samples)\nRBM (500 hidden)\nDRAW\nConv DRAW\n\nNLL\n106.31\n103.38\n100.46\n< 96.5\n< 92.0\n\nImageNet\nPixel RNN (32 \u00d7 32)\nPixel RNN (64 \u00d7 64)\nConv DRAW (32 \u00d7 32)\nConv DRAW (64 \u00d7 64)\n\nNLL\n\n3.86 (3.83)\n3.63 (3.57)\n4.40 (4.35)\n4.10 (4.04)\n\nCIFAR-10\nUniform Distribution\nMultivariate Gaussian\nNICE [1]\nDeep Diffusion [2]\nDeep GMMs [3]\nPixel RNN [4]\nDeep VAE\nDRAW\nConv DRAW\n\nNLL\n8.00\n4.70\n4.48\n4.20\n4.00\n\n< 3.58 (3.57)\n\n3.00 (2.93)\n\n< 4.54\n< 4.13\n\n4.2 Reconstruction vs Latent Cost Scaling\n\nEach pixel (and color channel) of the data consists of 256 values, and as such, likelihood and lossless\ncompression are well de\ufb01ned. When compressing the image there is much to be gained in capturing\nprecise correlations between nearby pixels. There are a lot more bits in these low level details than in\nthe higher level structure that we are actually interested in when learning higher level representations.\nThe network might focus on these details, ignoring higher level structure.\nOne way to make it focus less on the details is to scale down the cost of the input relative to the\nlatents, that is, setting \u03b2 < 1 in (12). Generations for different cost scalings are shown in Figure 5,\nwith the original objective being scale \u03b2 = 1. Visually we can verify that lower scales indeed have a\n\u2018cleaner\u2019 high level structure. Scale 1 contains a lot of information at the precise pixel values and\nthe network tries to capture that, while not being good enough to properly align details and produce\nreal-looking patterns. Improving this might simply be a matter of network capacity and scaling:\nincreasing layer size and depth, using more iterations, or using better functional forms.\n\n7\n\n\f4.3\n\nInformation Distribution\n\nWe look at how much information is contained at different levels and time steps. This information is\nsimply the KL-divergence in (5) during inference. For a two layer system with one convolutional and\none fully connected layer, this is shown in Figure 2 (right).\nWe see that the higher level contains information mainly at the beginning of computation, whereas\nthe lower layer starts with low information which then gradually increases. This is desirable from a\nconceptual point of view. It suggests that the network \ufb01rst captures the overall structure of the image,\nand only then proceeds to \u2018explain\u2019 the details contained within that structure. Understanding the\noverall structure rapidly is also convenient if the algorithm needs to respond to observations in a\ntimely manner. For the single layer system used in all other experiments, the information distribution\nis similar to the blue curve of Figure 2 (right). Thus, while the variables in the last set of iterations\ncontain the most bits, they don\u2019t seem to visually affect the quality of reconstructed images to a large\nextent, as shown in Figure 1. This demonstrates the separation of information into global aspects that\nhumans consider important from low level details.\n\n4.4 Lossy Compression Results\n\nWe can compress an image lossily by storing only the subset of the latent variables associated with the\nearlier iterations of convolutional DRAW, namely those that encode the more high-level information\nabout the image. The units not stored should be generated from the prior distribution (4). This\namounts to decompression.\nWe can also generate a more likely image by lowering the variance of the prior Gaussian. We show\ngenerations with full variance in row 3 of each block of Figure 3 and with zero variance in row 4.\nWe see that using the original variance, the network generates sharp details. Because the generative\nmodel is not perfect, the resulting images are less realistic looking as we lower the number of stored\ntime steps. For zero variance we see that the network starts with rough details making a smooth\nimage and then re\ufb01nes it with more time steps. All these generations are produced with a single-layer\nconvolutional DRAW, and thus, despite being single-layer, it achieves some level of \u2018conceptual\ncompression\u2019 by \ufb01rst capturing the global structure of the image and then focusing on details.\nThere is another dimension we can vary for lossy compression \u2013 the input scale introduced in\nsubsection 4.2. Even if we store all the latent variables (but not the input bits), the reconstructed\nimages will get less detailed as we scale down the input cost.\nTo build a high performing compressor, at each compression rate, we need to \ufb01nd which of the\nnetworks, input scales and number of time steps would produce visually good images. We have\ndone the following. For several compression levels, we have looked at images produced by different\nmethods and selected qualitatively which network gave the best looking images. We have not done\nthis per image, just per compression level. We then display compressed images that we have not seen\nwith this selection.\nWe compare our results to JPEG and JPEG2000 compression which we obtained using ImageMagick.\nWe found however that these compressors were unable to produce reasonable results for small images\n(3\u00d7 32\u00d7 32) at high compression rates. Instead, we concatenated 100 images into one 3\u00d7 320\u00d7 320\nimage, compressed that and extracted back the compressed small images. The number of bits per\nimage reported is then the number of bits of this image divided by 100. This is actually unfair to our\nalgorithm since any correlations between nearby images can be exploited. Nevertheless we show the\ncomparison in Figure 3. Our algorithm shows better quality than JPEG and JPEG 2000 at all levels\nwhere a corruption is easily detectable. Note that even if our algorithm was trained on one speci\ufb01c\nimage size, it can be used on arbitrarily sized images as it contains only convolutional operators.\n\n5 Conclusion\n\nIn this paper we introduced convolutional DRAW, a state-of-the-art latent variable generative model\nwhich demonstrates the potential of sequential computation and recurrent neural networks in scaling\nup the performance of deep generative models. During inference, the algorithm arrives at a natural\nstrati\ufb01cation of information, ranging from global aspects to low-level details. An interesting feature\nof the method is that, when we restrict ourselves to storing just the high level latent variables, we\narrive at a \u2018conceptual compression\u2019 algorithm that rivals the quality of JPEG2000.\n\n8\n\n\fReferences\nCarlson, Thomas, Tovar, David A, Alink, Arjen, and Kriegeskorte, Nikolaus. Representational\n\ndynamics of object vision: the \ufb01rst 1000 ms. Journal of vision, 13(10):1\u20131, 2013.\n\nGoodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil,\nIn Advances in Neural\n\nCourville, Aaron, and Bengio, Yoshua. Generative adversarial nets.\nInformation Processing Systems, pp. 2672\u20132680, 2014.\n\nGregor, Karol and LeCun, Yann. Learning representations by maximizing compression. arXiv\n\npreprint arXiv:1108.1169, 2011.\n\nGregor, Karol, Danihelka, Ivo, Mnih, Andriy, Blundell, Charles, and Wierstra, Daan. Deep autore-\ngressive networks. In Proceedings of the 31st International Conference on Machine Learning,\n2014.\n\nGregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende, Danilo Jimenez, and Wierstra, Daan. Draw:\nIn Proceedings of the 32nd International\n\nA recurrent neural network for image generation.\nConference on Machine Learning, 2015.\n\nHinton, Geoffrey E and Salakhutdinov, Ruslan R. Reducing the dimensionality of data with neural\n\nnetworks. Science, 313(5786):504\u2013507, 2006.\n\nHinton, Geoffrey E and Van Camp, Drew. Keeping the neural networks simple by minimizing the\ndescription length of the weights. In Proceedings of the sixth annual conference on Computational\nlearning theory, pp. 5\u201313. ACM, 1993.\n\nHochreiter, Sepp and Schmidhuber, J\u00fcrgen. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\nKingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nKingma, Diederik P and Welling, Max. Auto-encoding variational bayes. In Proceedings of the\n\nInternational Conference on Learning Representations (ICLR), 2014.\n\nLake, Brenden M, Salakhutdinov, Ruslan, and Tenenbaum, Joshua B. Human-level concept learning\n\nthrough probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\nLarochelle, Hugo and Murray, Iain. The neural autoregressive distribution estimator. Journal of\n\nMachine Learning Research, 15:29\u201337, 2011.\n\nRezende, Danilo J, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approxi-\nmate inference in deep generative models. In Proceedings of the 31st International Conference on\nMachine Learning, pp. 1278\u20131286, 2014.\n\nSalakhutdinov, Ruslan and Hinton, Geoffrey E. Deep boltzmann machines. In International Confer-\n\nence on Arti\ufb01cial Intelligence and Statistics, pp. 448\u2013455, 2009.\n\nvan den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks.\n\narXiv preprint arXiv:1601.06759, 2016.\n\nWitten, Ian H, Neal, Radford M, and Cleary, John G. Arithmetic coding for data compression.\n\nCommunications of the ACM, 30(6):520\u2013540, 1987.\n\n9\n\n\f", "award": [], "sourceid": 1773, "authors": [{"given_name": "Karol", "family_name": "Gregor", "institution": "Google DeepMind"}, {"given_name": "Frederic", "family_name": "Besse", "institution": "Google DeepMind"}, {"given_name": "Danilo", "family_name": "Jimenez Rezende", "institution": "Google DeepMind"}, {"given_name": "Ivo", "family_name": "Danihelka", "institution": "Google DeepMind"}, {"given_name": "Daan", "family_name": "Wierstra", "institution": "Google DeepMind"}]}