{"title": "Glow: Generative Flow with Invertible 1x1 Convolutions", "book": "Advances in Neural Information Processing Systems", "page_first": 10215, "page_last": 10224, "abstract": "Flow-based generative models are conceptually attractive due to tractability of the exact log-likelihood, tractability of exact latent-variable inference, and parallelizability of both training and synthesis. In this paper we propose Glow, a simple type of generative flow using invertible 1x1 convolution. Using our method we demonstrate a significant improvement in log-likelihood and qualitative sample quality. Perhaps most strikingly, we demonstrate that a generative model optimized towards the plain log-likelihood objective is capable of efficient synthesis of large and subjectively realistic-looking images.", "full_text": "Glow: Generative Flow\n\nwith Invertible 1\u00d71 Convolutions\n\nDiederik P. Kingma*\u2020, Prafulla Dhariwal\u2217\n\n*OpenAI\n\u2020Google AI\n\nAbstract\n\nFlow-based generative models (Dinh et al., 2014) are conceptually attractive due to\ntractability of the exact log-likelihood, tractability of exact latent-variable inference,\nand parallelizability of both training and synthesis. In this paper we propose Glow,\na simple type of generative \ufb02ow using an invertible 1 \u00d7 1 convolution. Using our\nmethod we demonstrate a signi\ufb01cant improvement in log-likelihood on standard\nbenchmarks. Perhaps most strikingly, we demonstrate that a \ufb02ow-based generative\nmodel optimized towards the plain log-likelihood objective is capable of ef\ufb01cient\nrealistic-looking synthesis and manipulation of large images. The code for our\nmodel is available at https://github.com/openai/glow.\n\n1\n\nIntroduction\n\nTwo major unsolved problems in the \ufb01eld of machine learning are (1) data-ef\ufb01ciency: the ability to\nlearn from few datapoints, like humans; and (2) generalization: robustness to changes of the task or\nits context. AI systems, for example, often do not work at all when given inputs that are different\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\nFigure 1: Synthetic celebrities sampled from our model; see Section 3 for architecture and method,\nand Section 5 for more results.\n\n\ffrom their training distribution. A promise of generative models, a major branch of machine learning,\nis to overcome these limitations by: (1) learning realistic world models, potentially allowing agents to\nplan in a world model before actual interaction with the world, and (2) learning meaningful features\nof the input while requiring little or no human supervision or labeling. Since such features can be\nlearned from large unlabeled datasets and are not necessarily task-speci\ufb01c, downstream solutions\nbased on those features could potentially be more robust and more data ef\ufb01cient. In this paper we\nwork towards this ultimate vision, in addition to intermediate applications, by aiming to improve\nupon the state-of-the-art of generative models.\nGenerative modeling is generally concerned with the extremely challenging task of modeling all\ndependencies within very high-dimensional input data, usually speci\ufb01ed in the form of a full joint\nprobability distribution. Since such joint models potentially capture all patterns that are present in the\ndata, the applications of accurate generative models are near endless. Immediate applications are as\ndiverse as speech synthesis, text analysis, semi-supervised learning and model-based control; see\nSection 4 for references.\nThe discipline of generative modeling has experienced enormous leaps in capabilities in recent years,\nmostly with likelihood-based methods (Graves, 2013; Kingma and Welling, 2013, 2018; Dinh et al.,\n2014; van den Oord et al., 2016a) and generative adversarial networks (GANs) (Goodfellow et al.,\n2014) (see Section 4). Likelihood-based methods can be divided into three categories:\n\n1. Autoregressive models (Hochreiter and Schmidhuber, 1997; Graves, 2013; van den Oord\net al., 2016a,b; Van Den Oord et al., 2016). Those have the advantage of simplicity, but have\nas disadvantage that synthesis has limited parallelizability, since the computational length of\nsynthesis is proportional to the dimensionality of the data; this is especially troublesome for\nlarge images or video.\n\n2. Variational autoencoders (VAEs) (Kingma and Welling, 2013, 2018), which optimize a\nlower bound on the log-likelihood of the data. Variational autoencoders have the advantage\nof parallelizability of training and synthesis, but can be comparatively challenging to\noptimize (Kingma et al., 2016).\n\n3. Flow-based generative models, \ufb01rst described in NICE (Dinh et al., 2014) and extended in\nRealNVP (Dinh et al., 2016). We explain the key ideas behind this class of model in the\nfollowing sections.\n\nFlow-based generative models have so far gained little attention in the research community compared\nto GANs (Goodfellow et al., 2014) and VAEs (Kingma and Welling, 2013). Some of the merits of\n\ufb02ow-based generative models include:\n\n\u2022 Exact latent-variable inference and log-likelihood evaluation. In VAEs, one is able to infer\nonly approximately the value of the latent variables that correspond to a datapoint. GAN\u2019s\nhave no encoder at all to infer the latents. In reversible generative models, this can be done\nexactly without approximation. Not only does this lead to accurate inference, it also enables\noptimization of the exact log-likelihood of the data, instead of a lower bound of it.\n\n\u2022 Ef\ufb01cient inference and ef\ufb01cient synthesis. Autoregressive models, such as the Pixel-\nCNN (van den Oord et al., 2016b), are also reversible, however synthesis from such models\nis dif\ufb01cult to parallelize, and typically inef\ufb01cient on parallel hardware. Flow-based gener-\native models like Glow (and RealNVP) are ef\ufb01cient to parallelize for both inference and\nsynthesis.\n\n\u2022 Useful latent space for downstream tasks. The hidden layers of autoregressive models\nhave unknown marginal distributions, making it much more dif\ufb01cult to perform valid\nmanipulation of data. In GANs, datapoints can usually not be directly represented in a latent\nspace, as they have no encoder and might not have full support over the data distribution.\n(Grover et al., 2018). This is not the case for reversible generative models and VAEs, which\nallow for various applications such as interpolations between datapoints and meaningful\nmodi\ufb01cations of existing datapoints.\n\n\u2022 Signi\ufb01cant potential for memory savings. Computing gradients in reversible neural networks\nrequires an amount of memory that is constant instead of linear in their depth, as explained\nin the RevNet paper (Gomez et al., 2017).\n\n2\n\n\fN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\nL(D) =\n\n1\nN\n\n\u2212 log p\u03b8(x(i))\n\nL(D) (cid:39) 1\nN\n\n\u2212 log p\u03b8(\u02dcx(i)) + c\n\n(1)\n\n(2)\n\nIn this paper we propose a new a generative \ufb02ow coined Glow, with various new elements as described\nin Section 3. In Section 5, we compare our model quantitatively with previous \ufb02ows, and in Section\n6, we study the qualitative aspects of our model on high-resolution datasets.\n\n2 Background: Flow-based Generative Models\nLet x be a high-dimensional random vector with unknown true distribution x \u223c p\u2217(x). We collect\nan i.i.d. dataset D, and choose a model p\u03b8(x) with parameters \u03b8. In case of discrete data x, the\nlog-likelihood objective is then equivalent to minimizing:\n\nIn case of continuous data x, we minimize the following:\n\nwhere \u02dcx(i) = x(i) + u with u \u223c U(0, a), and c = \u2212M \u00b7 log a where a is determined by the\ndiscretization level of the data and M is the dimensionality of x. Both objectives (eqs. (1) and (2))\nmeasure the expected compression cost in nats or bits; see (Dinh et al., 2016). Optimization is done\nthrough stochastic gradient descent using minibatches of data (Kingma and Ba, 2015).\nIn most \ufb02ow-based generative models (Dinh et al., 2014, 2016), the generative process is de\ufb01ned as:\n\nz \u223c p\u03b8(z)\nx = g\u03b8(z)\n\n(3)\n(4)\n\nwhere z is the latent variable and p\u03b8(z) has a (typically simple) tractable density, such as a spherical\nmultivariate Gaussian distribution: p\u03b8(z) = N (z; 0, I). The function g\u03b8(..) is invertible, also called\nbijective, such that given a datapoint x, latent-variable inference is done by z = f\u03b8(x) = g\u22121\n\u03b8 (x).\nFor brevity, we will omit subscript \u03b8 from f\u03b8 and g\u03b8.\nWe focus on functions where f (and, likewise, g) is composed of a sequence of transformations:\nf = f1 \u25e6 f2 \u25e6 \u00b7\u00b7\u00b7 \u25e6 fK, such that the relationship between x and z can be written as:\n\nx f1\u2190\u2192 h1\n\nf2\u2190\u2192 h2 \u00b7\u00b7\u00b7\n\nfK\u2190\u2192 z\n\n(5)\n\nSuch a sequence of invertible transformations is also called a (normalizing) \ufb02ow (Rezende and\nMohamed, 2015). Under the change of variables of eq. (4), the probability density function (pdf) of\nthe model given a datapoint can be written as:\n\nlog p\u03b8(x) = log p\u03b8(z) + log | det(dz/dx)|\n\nK(cid:88)\n\n= log p\u03b8(z) +\n\nlog | det(dhi/dhi\u22121)|\n\n(6)\n\n(7)\n\ni=1\n\nwhere we de\ufb01ne h0 (cid:44) x and hK (cid:44) z for conciseness. The scalar value log | det(dhi/dhi\u22121)| is\nthe logarithm of the absolute value of the determinant of the Jacobian matrix (dhi/dhi\u22121), also\ncalled the log-determinant. This value is the change in log-density when going from hi\u22121 to hi\nunder transformation fi. While it may look intimidating, its value can be surprisingly simple to\ncompute for certain choices of transformations, as previously explored in (Deco and Brauer, 1995;\nDinh et al., 2014; Rezende and Mohamed, 2015; Kingma et al., 2016). The basic idea is to choose\ntransformations whose Jacobian dhi/dhi\u22121 is a triangular matrix. For those transformations, the\nlog-determinant is simple:\n\nlog | det(dhi/dhi\u22121)| = sum(log |diag(dhi/dhi\u22121)|)\n\n(8)\n\nwhere sum() takes the sum over all vector elements, log() takes the element-wise logarithm, and\ndiag() takes the diagonal of the Jacobian matrix.\n\n3\n\n\f(a) One step of our \ufb02ow.\n\n(b) Multi-scale architecture (Dinh et al., 2016).\n\nFigure 2: We propose a generative \ufb02ow where each step (left) consists of an actnorm step, followed\nby an invertible 1 \u00d7 1 convolution, followed by an af\ufb01ne transformation (Dinh et al., 2014). This\n\ufb02ow is combined with a multi-scale architecture (right). See Section 3 and Table 1.\n\nTable 1: The three main components of our proposed \ufb02ow, their reverses, and their log-determinants.\nHere, x signi\ufb01es the input of the layer, and y signi\ufb01es its output. Both x and y are tensors of\nshape [h \u00d7 w \u00d7 c] with spatial dimensions (h, w) and channel dimension c. With (i, j) we denote\nspatial indices into tensors x and y. The function NN() is a nonlinear mapping, such as a (shallow)\nconvolutional neural network like in ResNets (He et al., 2016) and RealNVP (Dinh et al., 2016).\nDescription\nActnorm.\nSee Section 3.1.\nInvertible 1 \u00d7 1 convolution.\nW : [c \u00d7 c].\nSee Section 3.2.\n\nReverse Function\n\u2200i, j : xi,j = (yi,j \u2212 b)/s\n\nFunction\n\u2200i, j : yi,j = s (cid:12) xi,j + b\n\n\u2200i, j : yi,j = Wxi,j\n\n\u2200i, j : xi,j = W\u22121yi,j\n\nLog-determinant\nh \u00b7 w \u00b7 sum(log |s|)\n\nh \u00b7 w \u00b7 log | det(W)|\nor\nh \u00b7 w \u00b7 sum(log |s|)\n(see eq. (10))\nsum(log(|s|))\n\nAf\ufb01ne coupling layer.\nSee Section 3.3 and\n(Dinh et al., 2014)\n\nxa, xb = split(x)\n(log s, t) = NN(xb)\ns = exp(log s)\nya = s (cid:12) xa + t\nyb = xb\ny = concat(ya, yb)\n\nya, yb = split(y)\n(log s, t) = NN(yb)\ns = exp(log s)\nxa = (ya \u2212 t)/s\nxb = yb\nx = concat(xa, xb)\n\n3 Proposed Generative Flow\n\nWe propose a new \ufb02ow, building on the NICE and RealNVP \ufb02ows proposed in (Dinh et al., 2014,\n2016). It consists of a series of steps of \ufb02ow, combined in a multi-scale architecture; see Figure 2.\nEach step of \ufb02ow consists of actnorm (Section 3.1) followed by an invertible 1 \u00d7 1 convolution\n(Section 3.2), followed by a coupling layer (Section 3.3).\nThis \ufb02ow is combined with a multi-scale architecture; due to space constraints we refer to (Dinh et al.,\n2016) for more details. This architecture has a depth of \ufb02ow K, and number of levels L (Figure 2).\n\n3.1 Actnorm: scale and bias layer with data dependent initialization\n\nIn Dinh et al. (2016), the authors propose the use of batch normalization (Ioffe and Szegedy, 2015)\nto alleviate the problems encountered when training deep models. However, since the variance of\n\n4\n\nsqueezestep of \ufb02owsplit\u00d7 K \u00d7 (L\u22121) squeezestep of \ufb02ow\u00d7 K xzizLsqueezeactnormsplit\u00d7 K \u00d7 (L\u22121) squeezestep of \ufb02ow\u00d7 K xzlzLinvertible 1x1 conva\ufb03ne coupling layeractnorminvertible 1x1 conva\ufb03ne coupling layersqueezestep of \ufb02owsplit\u00d7 K \u00d7 (L\u22121) squeezestep of \ufb02ow\u00d7 K xzizLsqueezeactnormsplit\u00d7 K \u00d7 (L\u22121) squeezestep of \ufb02ow\u00d7 K xzlzLinvertible 1x1 conva\ufb03ne coupling layeractnorminvertible 1x1 conva\ufb03ne coupling layer\factivations noise added by batch normalization is inversely proportional to minibatch size per GPU\nor other processing unit (PU), performance is known to degrade for small per-PU minibatch size.\nFor large images, due to memory constraints, we learn with minibatch size 1 per PU. We propose an\nactnorm layer (for activation normalizaton), that performs an af\ufb01ne transformation of the activations\nusing a scale and bias parameter per channel, similar to batch normalization. These parameters are\ninitialized such that the post-actnorm activations per-channel have zero mean and unit variance given\nan initial minibatch of data. This is a form of data dependent initialization (Salimans and Kingma,\n2016). After initialization, the scale and bias are treated as regular trainable parameters that are\nindependent of the data.\n\n3.2\n\nInvertible 1 \u00d7 1 convolution\n\n(Dinh et al., 2014, 2016) proposed a \ufb02ow containing the equivalent of a permutation that reverses the\nordering of the channels. We propose to replace this \ufb01xed permutation with a (learned) invertible\n1 \u00d7 1 convolution, where the weight matrix is initialized as a random rotation matrix. Note that a\n1\u00d7 1 convolution with equal number of input and output channels is a generalization of a permutation\noperation.\nThe log-determinant of an invertible 1 \u00d7 1 convolution of a h \u00d7 w \u00d7 c tensor h with c \u00d7 c weight\nmatrix W is straightforward to compute:\n\n(cid:12)(cid:12)(cid:12)(cid:12)det\n\nlog\n\n(cid:18) d conv2D(h; W)\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = h \u00b7 w \u00b7 log | det(W)|\n\nd h\n\n(9)\nThe cost of computing or differentiating det(W) is O(c3), which is often comparable to the cost\ncomputing conv2D(h; W) which is O(h \u00b7 w \u00b7 c2). We initialize the weights W as a random rotation\nmatrix, having a log-determinant of 0; after one SGD step these values start to diverge from 0.\nLU Decomposition. This cost of computing det(W) can be reduced from O(c3) to O(c) by\nparameterizing W directly in its LU decomposition:\n\n(10)\nwhere P is a permutation matrix, L is a lower triangular matrix with ones on the diagonal, U is an\nupper triangular matrix with zeros on the diagonal, and s is a vector. The log-determinant is then\nsimply:\n\nW = PL(U + diag(s))\n\nlog | det(W)| = sum(log |s|)\n\n(11)\nThe difference in computational cost will become signi\ufb01cant for large c, although for the networks in\nour experiments we did not measure a large difference in wallclock computation time.\nIn this parameterization, we initialize the parameters by \ufb01rst sampling a random rotation matrix W,\nthen computing the corresponding value of P (which remains \ufb01xed) and the corresponding initial\nvalues of L and U and s (which are optimized).\n\n3.3 Af\ufb01ne Coupling Layers\n\nA powerful reversible transformation where the forward function, the reverse function and the log-\ndeterminant are computationally ef\ufb01cient, is the af\ufb01ne coupling layer introduced in (Dinh et al., 2014,\n2016). See Table 1. An additive coupling layer is a special case with s = 1 and a log-determinant of\n0.\n\nZero initialization. We initialize the last convolution of each NN() with zeros, such that each af\ufb01ne\ncoupling layer initially performs an identity function; we found that this helps training very deep\nnetworks.\n\nSplit and concatenation. As in (Dinh et al., 2014), the split() function splits h the input tensor\ninto two halves along the channel dimension, while the concat() operation performs the correspond-\ning reverse operation: concatenation into a single tensor. In (Dinh et al., 2016), another type of split\nwas introduced: along the spatial dimensions using a checkerboard pattern. In this work we only\nperform splits along the channel dimension, simplifying the overall architecture.\n\n5\n\n\fPermutation. Each step of \ufb02ow above should be preceded by some kind of permutation of the\nvariables that ensures that after suf\ufb01cient steps of \ufb02ow, each dimensions can affect every other\ndimension. The type of permutation speci\ufb01cally done in (Dinh et al., 2014, 2016) is equivalent to\nsimply reversing the ordering of the channels (features) before performing an additive coupling\nlayer. An alternative is to perform a (\ufb01xed) random permutation. Our invertible 1x1 convolution is a\ngeneralization of such permutations. In experiments we compare these three choices.\n\n4 Related Work\n\nThis work builds upon the ideas and \ufb02ows proposed in (Dinh et al., 2014) (NICE) and (Dinh et al.,\n2016) (RealNVP); comparisons with this work are made throughout this paper. In (Papamakarios\net al., 2017) (MAF), the authors propose a generative \ufb02ow based on IAF (Kingma et al., 2016);\nhowever, since synthesis from MAF is non-parallelizable and therefore inef\ufb01cient, we omit it from\ncomparisons. Synthesis from autoregressive (AR) models (Hochreiter and Schmidhuber, 1997;\nGraves, 2013; van den Oord et al., 2016a,b; Van Den Oord et al., 2016) is similarly non-parallelizable.\nSynthesis of high-dimensional data typically takes multiple orders of magnitude longer with AR\nmodels; see (Kingma et al., 2016; Oord et al., 2017) for evidence. Sampling 256 \u00d7 256 images with\nour largest models takes less than one second on current hardware. 2 (Reed et al., 2017) explores\ntechniques for speeding up synthesis in AR models considerably; we leave the comparison to this\nline of work to future work.\nGANs (Goodfellow et al., 2014) are arguably best known for their ability to synthesize large and\nrealistic images (Karras et al., 2017), in contrast with likelihood-based methods. Downsides of\nGANs are their general lack of latent-space encoders, their general lack of full support over the\ndata (Grover et al., 2018), their dif\ufb01culty of optimization, and their dif\ufb01culty of assessing over\ufb01tting\nand generalization.\n\n5 Quantitative Experiments\n\nWe begin our experiments by comparing how our new \ufb02ow compares against RealNVP (Dinh et al.,\n2016). We then apply our model on other standard datasets and compare log-likelihoods against\nprevious generative models. See the appendix for optimization details. In our experiments, we\nlet each NN() have three convolutional layers, where the two hidden layers have ReLU activation\nfunctions and 512 channels. The \ufb01rst and last convolutions are 3 \u00d7 3, while the center convolution is\n1 \u00d7 1, since both its input and output have a large number of channels, in contrast with the \ufb01rst and\nlast convolution.\nGains using invertible 1 \u00d7 1 Convolution. We choose the architecture described in Section 3,\nand consider three variations for the permutation of the channel variables - a reversing operation\nas described in the RealNVP, a \ufb01xed random permutation, and our invertible 1 \u00d7 1 convolution.\nWe compare for models with only additive coupling layers, and models with af\ufb01ne coupling. As\ndescribed earlier, we initialize all models with a data-dependent initialization which normalizes the\nactivations of each layer. All models were trained with K = 32 and L = 3. The model with 1 \u00d7 1\nconvolution has a negligible 0.2% larger amount of parameters.\nWe compare the average negative log-likelihood (bits per dimension) on the CIFAR-10 (Krizhevsky,\n2009) dataset, keeping all training conditions constant and averaging across three random seeds.\nThe results are in Figure 3. As we see, for both additive and af\ufb01ne couplings, the invertible 1 \u00d7 1\nconvolution achieves a lower negative log likelihood and converges faster. The af\ufb01ne coupling models\nalso converge faster than the additive coupling models. We noted that the increase in wallclock time\nfor the invertible 1 \u00d7 1 convolution model was only \u2248 7%, thus the operation is computationally\nef\ufb01cient as well.\n\nComparison with RealNVP on standard benchmarks. Besides the permutation operation, the\nRealNVP architecture has other differences such as the spatial coupling layers. In order to verify\nthat our proposed architecture is overall competitive with the RealNVP architecture, we compare\n2More speci\ufb01cally, generating a 256 \u00d7 256 image at batch size 1 takes about 130ms on a single NVIDIA\n\nGTX 1080 Ti, and about 550ms on a NVIDIA Tesla K80.\n\n6\n\n\f(a) Additive coupling.\n\n(b) Af\ufb01ne coupling.\n\nFigure 3: Comparison of the three variants - a reversing operation as described in the RealNVP, a\n\ufb01xed random permutation, and our proposed invertible 1 \u00d7 1 convolution, with additive (left) versus\naf\ufb01ne (right) coupling layers. We plot the mean and standard deviation across three runs with different\nrandom seeds.\n\nTable 2: Best results in bits per dimension of our model compared to RealNVP.\nCIFAR-10\n3.49\n\nImageNet 32x32\n4.28\n\nImageNet 64x64\n3.98\n\n3.35\n\n4.09\n\n3.81\n\nLSUN (bedroom)\n2.72\n2.38\n\nLSUN (tower)\n2.81\n2.46\n\nLSUN (church outdoor)\n3.08\n2.67\n\nModel\nRealNVP\nGlow\n\nour models on various natural images datasets. In particular, we compare on CIFAR-10, ImageNet\n(Russakovsky et al., 2015) and LSUN (Yu et al., 2015) datasets. We follow the same preprocessing\nas in (Dinh et al., 2016). For Imagenet, we use the 32 \u00d7 32 and 64 \u00d7 64 downsampled version of\nImageNet (Oord et al., 2016), and for LSUN we downsample to 96 \u00d7 96 and take random crops of\n64 \u00d7 64. We also include the bits/dimension for our model trained on 256 \u00d7 256 CelebA HQ used in\nour qualitative experiments.3 As we see in Table 2, our model achieves a signi\ufb01cant improvement on\nall the datasets.\n\n6 Qualitative Experiments\n\nWe now study the qualitative aspects of the model on high-resolution datasets. We choose the\nCelebA-HQ dataset (Karras et al., 2017), which consists of 30000 high resolution images from the\nCelebA dataset, and train the same architecture as above but now for images at a resolution of 2562,\nK = 32 and L = 6. To improve visual quality at the cost of slight decrease in color \ufb01delity, we train\nour models on 5-bit images. We aim to study if our model can scale to high resolutions, produce\nrealistic samples, and produce a meaningful latent space. Due to device memory constraints, at these\nresolutions we work with minibatch size 1 per PU, and use gradient checkpointing (Salimans and\nBulatov, 2017). In the future, we could use a constant amount of memory independent of depth by\nutilizing the reversibility of the model (Gomez et al., 2017).\nConsistent with earlier work on likelihood-based generative models, we found that sampling from\na reduced-temperature model (Parmar et al., 2018) often results in higher-quality samples. When\nsampling with temperature T , we sample from the distribution p\u03b8,T (x) \u221d (p\u03b8(x))T 2. In case of\nadditive coupling layers, this can be achieved simply by multiplying the standard deviation of p\u03b8(z)\nby a factor of T .\n\nSynthesis and Interpolation. Figure 4 shows the random samples obtained from our model. The\nimages are of high quality for a non-autoregressive likelihood based model. To see how well we can\ninterpolate, we take a pair of real images, encode them with the encoder, and linearly interpolate\n\n3Since the original CelebA HQ dataset didn\u2019t have a validation set, we separated it into a training set of\n\n27000 images and a validation set of 3000 images.\n\n7\n\n020040060080010001200140016001800Epochs3.303.353.403.453.503.553.603.653.70NLLReverseShuffle1x1 Conv020040060080010001200140016001800Epochs3.303.353.403.453.503.553.603.653.70NLLReverseShuffle1x1 Conv\fFigure 4: Random samples from the model, with temperature 0.7.\n\nFigure 5: Linear interpolation in latent space between real images.\n\nbetween the latents to obtain samples. The results in Figure 5 show that the image manifold of the\ngenerator distribution is smooth and almost all intermediate samples look like realistic faces.\n\nSemantic Manipulation. We now consider modifying attributes of an image. To do so, we use the\nlabels in the CelebA dataset. Each image has a binary label corresponding to presence or absence of\nattributes like smiling, blond hair, young, etc. This gives us 30000 binary labels for each attribute.\nWe then calculate the average latent vector zpos for images with the attribute and zneg for images\nwithout, and then use the difference (zpos \u2212 zneg) as a direction for manipulating. Note that this is a\nrelatively small amount of supervision, and is done after the model is trained (no labels were used\nwhile training), making it extremely easy to do for a variety of different target attributes. The results\nare shown in Figure 6 (appendix).\n\nEffect of temperature and model depth. Figure 8 (appendix) shows how the sample quality and\ndiversity varies with temperature. The highest temperatures have noisy images, possibly due to\noverestimating the entropy of the data distribution; we choose a temperature of 0.7 as a sweet spot\nfor diversity and quality of samples. Figure 9 (appendix) shows how model depth affects the ability\nof the model to learn long-range dependencies.\n\n7 Conclusion\n\nWe propose a new type of generative \ufb02ow and demonstrate improved quantitative performance in\nterms of log-likelihood on standard image modeling benchmarks. In addition, we demonstrate that\nwhen trained on high-resolution faces, our model is able to synthesize realistic images.\n\nReferences\nDeco, G. and Brauer, W. (1995). Higher order statistical decorrelation without information loss.\n\nAdvances in Neural Information Processing Systems, pages 247\u2013254.\n3For 128 \u00d7 128 and 96 \u00d7 96 versions, we centre cropped the original image, and downsampled. For 64 \u00d7 64\n\nversion, we took random crops from the 96 \u00d7 96 downsampled image as done in Dinh et al. (2016)\n\n8\n\n\fDinh, L., Krueger, D., and Bengio, Y. (2014). Nice: non-linear independent components estimation.\n\narXiv preprint arXiv:1410.8516.\n\nDinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density estimation using Real NVP. arXiv\n\npreprint arXiv:1605.08803.\n\nGomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. (2017). The reversible residual network:\nIn Advances in Neural Information Processing\n\nBackpropagation without storing activations.\nSystems, pages 2211\u20132221.\n\nGoodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and\nBengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing\nSystems, pages 2672\u20132680.\n\nGraves, A. (2013). Generating sequences with recurrent neural networks.\n\narXiv:1308.0850.\n\narXiv preprint\n\nGrover, A., Dhar, M., and Ermon, S. (2018). Flow-gan: Combining maximum likelihood and\n\nadversarial learning in generative models. In AAAI Conference on Arti\ufb01cial Intelligence.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. arXiv\n\npreprint arXiv:1603.05027.\n\nHochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory. Neural computation,\n\n9(8):1735\u20131780.\n\nIoffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by\n\nreducing internal covariate shift. arXiv preprint arXiv:1502.03167.\n\nKarras, T., Aila, T., Laine, S., and Lehtinen, J. (2017). Progressive growing of gans for improved\n\nquality, stability, and variation. arXiv preprint arXiv:1710.10196.\n\nKingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of the\n\nInternational Conference on Learning Representations 2015.\n\nKingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016).\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information\nProcessing Systems, pages 4743\u20134751.\n\nKingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. Proceedings of the 2nd\n\nInternational Conference on Learning Representations.\n\nKingma, D. P. and Welling, M. (2018). Variational autoencoders. Under Review.\n\nKrizhevsky, A. (2009). Learning multiple layers of features from tiny images.\n\nOord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759.\n\nOord, A. v. d., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G.\nv. d., Lockhart, E., Cobo, L. C., Stimberg, F., et al. (2017). Parallel wavenet: Fast high-\ufb01delity\nspeech synthesis. arXiv preprint arXiv:1711.10433.\n\nPapamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive \ufb02ow for density\n\nestimation. In Advances in Neural Information Processing Systems, pages 2335\u20132344.\n\nParmar, N., Vaswani, A., Uszkoreit, J., Kaiser, \u0141., Shazeer, N., and Ku, A. (2018). Image transformer.\n\narXiv preprint arXiv:1802.05751.\n\nReed, S., Oord, A. v. d., Kalchbrenner, N., Colmenarejo, S. G., Wang, Z., Belov, D., and de Freitas,\nN. (2017). Parallel multiscale autoregressive density estimation. arXiv preprint arXiv:1703.03664.\n\nRezende, D. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. In Proceedings\n\nof The 32nd International Conference on Machine Learning, pages 1530\u20131538.\n\n9\n\n\fRussakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla,\nA., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International\nJournal of Computer Vision, 115(3):211\u2013252.\n\nSalimans, T. and Bulatov, Y. (2017). Gradient checkpointing. https://github.com/openai/\n\ngradient-checkpointing.\n\nSalimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to\n\naccelerate training of deep neural networks. arXiv preprint arXiv:1602.07868.\n\nVan Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner,\nN., Senior, A., and Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv\npreprint arXiv:1609.03499.\n\nvan den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016a). Pixel recurrent neural networks.\n\narXiv preprint arXiv:1601.06759.\n\nvan den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K.\n(2016b). Conditional image generation with PixelCNN decoders. arXiv preprint arXiv:1606.05328.\n\nYu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. (2015). Lsun: Construction of a large-scale image\n\ndataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365.\n\n10\n\n\f", "award": [], "sourceid": 6558, "authors": [{"given_name": "Durk", "family_name": "Kingma", "institution": "Google"}, {"given_name": "Prafulla", "family_name": "Dhariwal", "institution": "OpenAI"}]}