{"title": "Conditional Image Generation with PixelCNN Decoders", "book": "Advances in Neural Information Processing Systems", "page_first": 4790, "page_last": 4798, "abstract": "This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.", "full_text": "Conditional Image Generation with\n\nPixelCNN Decoders\n\nA\u00e4ron van den Oord\nGoogle DeepMind\n\navdnoord@google.com\n\nNal Kalchbrenner\nGoogle DeepMind\nnalk@google.com\n\nOriol Vinyals\n\nGoogle DeepMind\n\nvinyals@google.com\n\nLasse Espeholt\nGoogle DeepMind\n\nespeholt@google.com\n\nAlex Graves\n\nGoogle DeepMind\n\ngravesa@google.com\n\nKoray Kavukcuoglu\nGoogle DeepMind\n\nkorayk@google.com\n\nAbstract\n\nThis work explores conditional image generation with a new image density model\nbased on the PixelCNN architecture. The model can be conditioned on any vector,\nincluding descriptive labels or tags, or latent embeddings created by other networks.\nWhen conditioned on class labels from the ImageNet database, the model is able to\ngenerate diverse, realistic scenes representing distinct animals, objects, landscapes\nand structures. When conditioned on an embedding produced by a convolutional\nnetwork given a single image of an unseen face, it generates a variety of new\nportraits of the same person with different facial expressions, poses and lighting\nconditions. We also show that conditional PixelCNN can serve as a powerful\ndecoder in an image autoencoder. Additionally, the gated convolutional layers in\nthe proposed model improve the log-likelihood of PixelCNN to match the state-of-\nthe-art performance of PixelRNN on ImageNet, with greatly reduced computational\ncost.\n\n1\n\nIntroduction\n\nRecent advances in image modelling with neural networks [30, 26, 20, 10, 9, 28, 6] have made\nit feasible to generate diverse natural images that capture the high-level structure of the training\ndata. While such unconditional models are fascinating in their own right, many of the practical\napplications of image modelling require the model to be conditioned on prior information: for\nexample, an image model used for reinforcement learning planning in a visual environment would\nneed to predict future scenes given speci\ufb01c states and actions [17]. Similarly image processing\ntasks such as denoising, deblurring, inpainting, super-resolution and colorization rely on generating\nimproved images conditioned on noisy or incomplete data. Neural artwork [18, 5] and content\ngeneration represent potential future uses for conditional generation.\nThis paper explores the potential for conditional image modelling by adapting and improving a\nconvolutional variant of the PixelRNN architecture [30]. As well as providing excellent samples,\nthis network has the advantage of returning explicit probability densities (unlike alternatives such as\ngenerative adversarial networks [6, 3, 19]), making it straightforward to apply in domains such as\ncompression [32] and probabilistic planning and exploration [2]. The basic idea of the architecture\nis to use autoregressive connections to model images pixel by pixel, decomposing the joint image\ndistribution as a product of conditionals. Two variants were proposed in the original paper: PixelRNN,\nwhere the pixel distributions are modeled with two-dimensional LSTM [7, 26], and PixelCNN, where\nthey are modelled with convolutional networks. PixelRNNs generally give better performance, but\nPixelCNNs are much faster to train because convolutions are inherently easier to parallelize; given\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Left: A visualization of the PixelCNN that maps a neighborhood of pixels to prediction for\nthe next pixel. To generate pixel xi the model can only condition on the previously generated pixels\nx1, . . . xi\u22121. Middle: an example matrix that is used to mask the 5x5 \ufb01lters to make sure the model\ncannot read pixels below (or strictly to the right) of the current pixel to make its predictions. Right:\nTop: PixelCNNs have a blind spot in the receptive \ufb01eld that can not be used to make predictions.\nBottom: Two convolutional stacks (blue and purple) allow to capture the whole receptive \ufb01eld.\n\nthe vast number of pixels present in large image datasets this is an important advantage. We aim to\ncombine the strengths of both models by introducing a gated variant of PixelCNN (Gated PixelCNN)\nthat matches the log-likelihood of PixelRNN on both CIFAR and ImageNet, while requiring less than\nhalf the training time.\nWe also introduce a conditional variant of the Gated PixelCNN (Conditional PixelCNN) that allows\nus to model the complex conditional distributions of natural images given a latent vector embedding.\nWe show that a single Conditional PixelCNN model can be used to generate images from diverse\nclasses such as dogs, lawn mowers and coral reefs, by simply conditioning on a one-hot encoding\nof the class. Similarly one can use embeddings that capture high level information of an image to\ngenerate a large variety of images with similar features. This gives us insight into the invariances\nencoded in the embeddings \u2014 e.g., we can generate different poses of the same person based on a\nsingle image. The same framework can also be used to analyse and interpret different layers and\nactivations in deep neural networks.\n\n2 Gated PixelCNN\n\nPixelCNNs (and PixelRNNs) [30] model the joint distribution of pixels over an image x as the\nfollowing product of conditional distributions, where xi is a single pixel:\n\nn2(cid:89)\n\np(x) =\n\np(xi|x1, ..., xi\u22121).\n\n(1)\n\ni=1\n\nThe ordering of the pixel dependencies is in raster scan order: row by row and pixel by pixel within\nevery row. Every pixel therefore depends on all the pixels above and to the left of it, and not on any\nof other pixels. The dependency \ufb01eld of a pixel is visualized in Figure 1 (left).\nA similar setup has been used by other autoregressive models such as NADE [14] and RIDE [26].\nThe difference lies in the way the conditional distributions p(xi|x1, ..., xi\u22121) are constructed. In\nPixelCNN every conditional distribution is modelled by a convolutional neural network. To make\nsure the CNN can only use information about pixels above and to the left of the current pixel, the\n\ufb01lters of the convolution are masked as shown in Figure 1 (middle). For each pixel the three colour\nchannels (R, G, B) are modelled successively, with B conditioned on (R, G), and G conditioned on R.\nThis is achieved by splitting the feature maps at every layer of the network into three and adjusting the\ncentre values of the mask tensors. The 256 possible values for each colour channel are then modelled\nusing a softmax.\nPixelCNN typically consists of a stack of masked convolutional layers that takes an N x N x 3 image\nas input and produces N x N x 3 x 256 predictions as output. The use of convolutions allows the\npredictions for all the pixels to be made in parallel during training (all conditional distributions from\n\n2\n\n02551111111111110000000000000Blind spotHorizontal stackVertical stack\fEquation 1). During sampling the predictions are sequential: every time a pixel is predicted, it is\nfed back into the network to predict the next pixel. This sequentiality is essential to generating high\nquality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the\nprevious pixels.\n\n2.1 Gated Convolutional Layers\n\nPixelRNNs, which use spatial LSTM layers instead of convolutional stacks, have previously been\nshown to outperform PixelCNNs as generative models [30]. One possible reason for the advantage\nis that the recurrent connections in LSTM allow every layer in the network to access the entire\nneighbourhood of previous pixels, while the region of the neighbourhood available to pixelCNN\ngrows linearly with the depth of the convolutional stack. However this shortcoming can largely be\nalleviated by using suf\ufb01ciently many layers. Another potential advantage is that PixelRNNs contain\nmultiplicative units (in the form of the LSTM gates), which may help it to model more complex\ninteractions. To amend this we replaced the recti\ufb01ed linear units between the masked convolutions in\nthe original pixelCNN with the following gated activation unit:\n\ny = tanh(Wk,f \u2217 x) (cid:12) \u03c3(Wk,g \u2217 x),\n\n(2)\nwhere \u03c3 is the sigmoid non-linearity, k is the number of the layer, (cid:12) is the element-wise product\nand \u2217 is the convolution operator. We call the resulting model the Gated PixelCNN. Feed-forward\nneural networks with gates have been explored in previous works, such as highway networks [25],\ngrid LSTM [13] and neural GPUs [12], and have generally proved bene\ufb01cial to performance.\n\n2.2 Blind spot in the receptive \ufb01eld\nIn Figure 1 (top right), we show the progressive growth of the effective receptive \ufb01eld of a 3 \u00d7 3\nmasked \ufb01lter over the input image. Note that a signi\ufb01cant portion of the input image is ignored by the\nmasked convolutional architecture. This \u2018blind spot\u2019 can cover as much as a quarter of the potential\nreceptive \ufb01eld (e.g., when using 3x3 \ufb01lters), meaning that none of the content to the right of the\ncurrent pixel would be taken into account.\nIn this work, we remove the blind spot by combining two convolutional network stacks: one that\nconditions on the current row so far (horizontal stack) and one that conditions on all rows above\n(vertical stack). The arrangement is illustrated in Figure 1 (bottom right). The vertical stack, which\ndoes not have any masking, allows the receptive \ufb01eld to grow in a rectangular fashion without any\nblind spot, and we combine the outputs of the two stacks after each layer. Every layer in the horizontal\nstack takes as input the output of the previous layer as well as that of the vertical stack. If we had\nconnected the output of the horizontal stack into the vertical stack, it would be able to use information\nabout pixels that are below or to the right of the current pixel which would break the conditional\ndistribution.\nFigure 2 shows a single layer block of a Gated PixelCNN. We combine Wf and Wg in a single\n(masked) convolution to increase parallelization. As proposed in [30] we also use a residual connec-\ntion [11] in the horizontal stack. We have experimented with adding a residual connection in the\nvertical stack, but omitted it from the \ufb01nal model as it did not improve the results in our initial experi-\nments. Note that the (n \u00d7 1) and (n \u00d7 n) masked convolutions in Figure 2 can also be implemented\nby ((cid:100) n\n\n2(cid:101) \u00d7 n) convolutions followed by a shift in pixels by padding and cropping.\n\n2(cid:101) \u00d7 1) and ((cid:100) n\n\n2.3 Conditional PixelCNN\n\nGiven a high-level image description represented as a latent vector h, we seek to model the conditional\ndistribution p(x|h) of images suiting this description. Formally the conditional PixelCNN models\nthe following distribution:\n\nn2(cid:89)\n\ni=1\n\np(x|h) =\n\np(xi|x1, ..., xi\u22121, h).\n\n(3)\n\nWe model the conditional distribution by adding terms that depend on h to the activations before the\nnonlinearities in Equation 2, which now becomes:\ny = tanh(Wk,f \u2217 x + V T\n\nk,f h) (cid:12) \u03c3(Wk,g \u2217 x + V T\n\nk,gh),\n\n(4)\n\n3\n\n\fFigure 2: A single layer in the Gated PixelCNN architecture. Convolution operations are shown in\ngreen, element-wise multiplications and additions are shown in red. The convolutions with Wf and\nWg from Equation 2 are combined into a single operation shown in blue, which splits the 2p features\nmaps into two groups of p.\n\nwhere k is the layer number. If h is a one-hot encoding that speci\ufb01es a class this is equivalent to\nadding a class dependent bias at every layer. Notice that the conditioning does not depend on the\nlocation of the pixel in the image; this is appropriate as long as h only contains information about\nwhat should be in the image and not where. For example we could specify that a certain animal or\nobject should appear, but may do so in different positions and poses and with different backgrounds.\nWe also developed a variant where the conditioning function was location dependent. This could\nbe useful for applications where we do have information about the location of certain structures\nin the image embedded in h. By mapping h to a spatial representation s = m (h) (which has the\nsame width and height as the image but may have an arbitrary number of feature maps) with a\ndeconvolutional neural network m(), we obtain a location dependent bias as follows:\n\ny = tanh(Wk,f \u2217 x + Vk,f \u2217 s) (cid:12) \u03c3(Wk,g \u2217 x + Vk,g \u2217 s).\n\n(5)\n\nwhere Vk,g \u2217 s is an unmasked 1 \u00d7 1 convolution.\n\n2.4 PixelCNN Auto-Encoders\n\nBecause conditional PixelCNNs have the capacity to model diverse, multimodal image distributions\np(x|h), it is possible to apply them as image decoders in existing neural architectures such as auto-\nencoders. An auto-encoder consists of two parts: an encoder that takes an input image x and maps it\nto a (usually) low-dimensional representation h, and a decoder that tries to reconstruct the original\nimage.\nStarting with a traditional convolutional auto-encoder architecture [16], we replace the deconvo-\nlutional decoder with a conditional PixelCNN and train the complete network end-to-end. Since\nPixelCNN has proved to be a strong unconditional generative model, we would expect this change to\nimprove the reconstructions. Perhaps more interestingly, we also expect it to change the representa-\ntions that the encoder will learn to extract from the data: since so much of the low level pixel statistics\ncan be handled by the PixelCNN, the encoder should be able to omit these from h and concentrate\ninstead on more high-level abstract information.\n\n3 Experiments\n\n3.1 Unconditional Modeling with Gated PixelCNN\n\nTable 1 compares Gated PixelCNN with published results on the CIFAR-10 dataset. These architec-\ntures were all optimized for the best possible validation score, meaning that models that get a lower\n\n4\n\nn\u21e5n1\u21e5ntanh\u21e5++1\u21e511\u21e51tanh\u21e52ppppppp2ppSplit feature mapsp=#featuremaps\fscore actually generalize better. Gated PixelCNN outperforms the PixelCNN by 0.11 bits/dim, which\nhas a very signi\ufb01cant effect on the visual quality of the samples produced, and which is close to the\nperformance of PixelRNN.\n\nModel\nUniform Distribution: [30]\nMultivariate Gaussian: [30]\nNICE: [4]\nDeep Diffusion: [24]\nDRAW: [9]\nDeep GMMs: [31, 29]\nConv DRAW: [8]\nRIDE: [26, 30]\nPixelCNN: [30]\nPixelRNN: [30]\nGated PixelCNN:\n\nNLL Test (Train)\n\n8.00\n4.70\n4.48\n4.20\n4.13\n4.00\n\n3.47\n\n3.58 (3.57)\n\n3.14 (3.08)\n3.00 (2.93)\n3.03 (2.90)\n\nTable 1: Test set performance of different models on CIFAR-10 in bits/dim (lower is better), training\nperformance in brackets.\n\nIn Table 2 we compare the performance of Gated PixelCNN with other models on the ImageNet\ndataset. Here Gated PixelCNN outperforms PixelRNN; we believe this is because the models are\nunder\ufb01tting, larger models perform better and the simpler PixelCNN model scales better. We were\nable to achieve similar performance to the PixelRNN (Row LSTM [30]) in less than half the training\ntime (60 hours using 32 GPUs). For the results in Table 2 we trained a larger model with 20 layers\n(Figure 2), each having 384 hidden units and \ufb01lter size of 5 \u00d7 5. We used 200K synchronous updates\nover 32 GPUs in TensorFlow [1] using a total batch size of 128.\n\n32x32\n\nModel\n\nNLL Test (Train)\n\nConv Draw: [8]\nPixelRNN: [30]\nGated PixelCNN:\n\n4.40 (4.35)\n3.86 (3.83)\n3.83 (3.77)\n\n64x64\n\nModel\n\nNLL Test (Train)\n\nConv Draw: [8]\nPixelRNN: [30]\nGated PixelCNN:\n\n4.10 (4.04)\n3.63 (3.57)\n3.57 (3.48)\n\nTable 2: Performance of different models on ImageNet in bits/dim (lower is better), training perfor-\nmance in brackets.\n\n3.2 Conditioning on ImageNet Classes\n\nFor our second experiment we explore class-conditional modelling of ImageNet images using Gated\nPixelCNNs. Given a one-hot encoding hi for the i-th class we model p(x|hi). The amount of\ninformation that the model receives is only log(1000) \u2248 0.003 bits/pixel (for a 32x32 image). Still,\none could expect that conditioning the image generation on class label could signi\ufb01cantly improve\nthe log-likelihood results, however we did not observe big differences. On the other hand, as noted\nin [27], we observed great improvements in the visual quality of the generated samples.\nIn Figure 3 we show samples from a single class-conditional model for 8 different classes. We see that\nthe generated classes are very distinct from one another, and that the corresponding objects, animals\nand backgrounds are clearly produced. Furthermore the images of a single class are very diverse: for\nexample the model was able to generate similar scenes from different angles and lightning conditions.\nIt is encouraging to see that given roughly 1000 images from every animal or object the model is able\nto generalize and produce new renderings.\n\n5\n\n\f3.3 Conditioning on Portrait Embeddings\n\nIn our next experiment we took the latent representations from the top layer of a convolutional\nnetwork trained on a large database of portraits automatically cropped from Flickr images using a\nface detector. The quality of images varied wildly, because a lot of the pictures were taken with\nmobile phones in bad lightning conditions.\nThe network was trained with a triplet loss function [23] that ensured that the embedding h produced\nfor an image x of a speci\ufb01c person was closer to the embeddings for all other images of the same\nperson than it was to any embedding of another person.\nAfter the supervised net was trained we took the (image=x, embedding=h) tuples and trained the\nConditional PixelCNN to model p(x|h). Given a new image of a person that was not in the training\nset we can compute h = f (x) and generate new portraits of the same person.\nSamples from the model are shown in Figure 4. We can see that the embeddings capture a lot of the\nfacial features of the source image and the generative model is able to produce a large variety of new\nfaces with these features in new poses, lighting conditions, etc.\nFinally, we experimented with reconstructions conditioned on linear interpolations between embed-\ndings of pairs of images. The results are shown in Figure 5. Every image in a single row used the\nsame random seed in the sampling which results in smooth transitions. The leftmost and rightmost\nimages are used to produce the end points of interpolation.\n\n3.4 PixelCNN Auto Encoder\n\nThis experiment explores the possibility of training both the encoder and decoder (PixelCNN) end-\nto-end as an auto-encoder. We trained a PixelCNN auto-encoder on 32x32 ImageNet patches and\ncompared the results with those from a convolutional auto-encoder trained to optimize MSE. Both\nmodels used a 10 or 100 dimensional bottleneck.\nFigure 6 shows the reconstructions from both models. For the PixelCNN we sample multiple\nconditional reconstructions. These images support our prediction in Section 2.4 that the information\nencoded in the bottleneck representation h will be qualitatively different with a PixelCNN decoder\nthan with a more conventional decoder. For example, in the lowest row we can see that the model\ngenerates different but similar looking indoor scenes with people, instead of trying to exactly\nreconstruct the input.\n\n4 Conclusion\n\nThis work introduced the Gated PixelCNN, an improvement over the original PixelCNN that is able to\nmatch or outperform PixelRNN [30], and is computationally more ef\ufb01cient. In our new architecture,\nwe use two stacks of CNNs to deal with \u201cblind spots\u201d in the receptive \ufb01eld, which limited the original\nPixelCNN. Additionally, we use a gating mechanism which improves performance and convergence\nspeed. We have shown that the architecture gets similar performance to PixelRNN on CIFAR-10 and\nis now state-of-the-art on the ImageNet 32x32 and 64x64 datasets.\nFurthermore, using the Conditional PixelCNN we explored the conditional modelling of natural\nimages in three different settings. In class-conditional generation we showed that a single model is\nable to generate diverse and realistic looking images corresponding to different classes. On human\nportraits the model is capable of generating new images from the same person in different poses and\nlightning conditions from a single image. Finally, we demonstrated that the PixelCNN can be used as\na powerful image decoder in an autoencoder. In addition to achieving state of the art log-likelihood\nscores in all these datasets, the samples generated from our model are of very high visual quality\nshowing that the model captures natural variations of objects and lighting conditions.\nIn the future it might be interesting to try and generate new images with a certain animal or object\nsolely from a single example image [21, 22]. Another exciting direction would be to combine\nConditional PixelCNNs with variational inference to create a variational auto-encoder. In existing\nwork p(x|h) is typically modelled with a Gaussian with diagonal covariance and using a PixelCNN\ninstead could thus improve the decoder in VAEs. Another promising direction of this work would be\nto model images based on an image caption instead of class label [15, 19].\n\n6\n\n\fAfrican elephant\n\nCoral Reef\n\nSandbar\n\nSorrel horse\n\nLhasa Apso (dog)\n\nLawn mower\n\nBrown bear\n\nRobin (bird)\n\nFigure 3: Class-Conditional samples from the Conditional PixelCNN.\n\nFigure 4: Left: source image. Right: new portraits generated from high-level latent representation.\n\nFigure 5: Linear interpolations in the embedding space decoded by the PixelCNN. Embeddings from\nleftmost and rightmost images are used for endpoints of the interpolation.\n\n7\n\n\fm = 10\n\nm = 100\n\nm = 10\n\nm = 100\n\nm = 10\n\nm = 100\n\nFigure 6: Left to right: original image, reconstruction by an auto-encoder trained with MSE,\nconditional samples from a PixelCNN auto-encoder. Both auto-encoders were trained end-to-end\nwith a m = 10-dimensional bottleneck and a m = 100 dimensional bottleneck.\nReferences\n[1] Mart\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] Marc G Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.\n\nUnifying count-based exploration and intrinsic motivation. arXiv preprint arXiv:1606.01868, 2016.\n\n[3] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian\npyramid of adversarial networks. In Advances in Neural Information Processing Systems, pages 1486\u20131494,\n2015.\n\n[4] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation.\n\narXiv preprint arXiv:1410.8516, 2014.\n\n[5] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint\n\narXiv:1508.06576, 2015.\n\n[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing\nSystems, pages 2672\u20132680, 2014.\n\n[7] Alex Graves and J\u00fcrgen Schmidhuber. Of\ufb02ine handwriting recognition with multidimensional recurrent\n\nneural networks. In Advances in Neural Information Processing Systems, 2009.\n\n[8] Karol Gregor, Frederic Besse, Danilo J Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual\n\ncompression. arXiv preprint arXiv:1601.06759, 2016.\n\n[9] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for\n\nimage generation. Proceedings of the 32nd International Conference on Machine Learning, 2015.\n\n[10] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive\n\nnetworks. In Proceedings of the 31st International Conference on Machine Learning, 2014.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\narXiv preprint arXiv:1512.03385, 2015.\n\n[12] \u0141ukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.\n\n[13] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. arXiv preprint\n\narXiv:1507.01526, 2015.\n\n[14] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. The Journal of Machine\n\nLearning Research, 2011.\n\n8\n\n\f[15] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating images from\n\ncaptions with attention. arXiv preprint arXiv:1511.02793, 2015.\n\n[16] Jonathan Masci, Ueli Meier, Dan Cire\u00b8san, and J\u00fcrgen Schmidhuber. Stacked convolutional auto-encoders\nfor hierarchical feature extraction. In Arti\ufb01cial Neural Networks and Machine Learning\u2013ICANN 2011,\npages 52\u201359. Springer, 2011.\n\n[17] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video\nprediction using deep networks in atari games. In Advances in Neural Information Processing Systems,\npages 2845\u20132853, 2015.\n\n[18] Christopher Olah and Mike Tyka. Inceptionism: Going deeper into neural networks. 2015.\n\n[19] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee.\n\nGenerative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.\n\n[20] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In Proceedings of the 31st International Conference on Machine\nLearning, 2014.\n\n[21] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. One-shot\n\ngeneralization in deep generative models. arXiv preprint arXiv:1603.05106, 2016.\n\n[22] Ruslan Salakhutdinov, Joshua B Tenenbaum, and Antonio Torralba. Learning with hierarchical-deep\n\nmodels. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1958\u20131971, 2013.\n\n[23] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for face\nrecognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 815\u2013823, 2015.\n\n[24] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised\nlearning using nonequilibrium thermodynamics. Proceedings of the 32nd International Conference on\nMachine Learning, 2015.\n\n[25] Rupesh K Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Training very deep networks. In Advances in\n\nNeural Information Processing Systems, pages 2368\u20132376, 2015.\n\n[26] Lucas Theis and Matthias Bethge. Generative image modeling using spatial LSTMs. In Advances in\n\nNeural Information Processing Systems, 2015.\n\n[27] Lucas Theis, Aaron van den Oord, and Matthias Bethge. A note on the evaluation of generative models.\n\narXiv preprint arXiv:1511.01844, 2015.\n\n[28] Benigno Uria, Marc-Alexandre C\u00f4t\u00e9, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregres-\n\nsive distribution estimation. arXiv preprint arXiv:1605.02226, 2016.\n\n[29] Aaron van den Oord and Joni Dambre. Locally-connected transformations for deep gmms. In International\n\nConference on Machine Learning (ICML) : Deep learning Workshop, Abstracts, pages 1\u20138, 2015.\n\n[30] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759, 2016.\n\n[31] A\u00e4ron van den Oord and Benjamin Schrauwen. Factoring variations in natural images with deep gaussian\n\nmixture models. In Advances in Neural Information Processing Systems, 2014.\n\n[32] Aaron van den Oord and Benjamin Schrauwen. The student-t mixture as a natural image patch prior with\n\napplication to image compression. The Journal of Machine Learning Research, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2435, "authors": [{"given_name": "Aaron", "family_name": "van den Oord", "institution": "Google Deepmind"}, {"given_name": "Nal", "family_name": "Kalchbrenner", "institution": "Google DeepMind"}, {"given_name": "Lasse", "family_name": "Espeholt", "institution": "Google DeepMind"}, {"given_name": "koray", "family_name": "kavukcuoglu", "institution": "Google DeepMind"}, {"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google"}, {"given_name": "Alex", "family_name": "Graves", "institution": "Google DeepMind"}]}