{"title": "MaCow: Masked Convolutional Generative Flow", "book": "Advances in Neural Information Processing Systems", "page_first": 5893, "page_last": 5902, "abstract": "Flow-based generative models, conceptually attractive due to tractability of both the exact log-likelihood computation and latent-variable inference, and efficiency of both training and sampling, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations.\nDespite their computational efficiency, the density estimation performance of flow-based generative models significantly falls behind those of state-of-the-art autoregressive models. In this work, we introduce masked convolutional generative flow (MaCow), a simple yet effective architecture of generative flow using masked convolution. By restricting the local connectivity in a small kernel, MaCow enjoys the properties of fast and stable training, and efficient sampling, while achieving significant improvements over Glow for density estimation on standard image benchmarks, considerably narrowing the gap to autoregressive models.", "full_text": "MaCow: Masked Convolutional Generative Flow\n\nXuezhe Ma, Xiang Kong, Shanghang Zhang, Eduard Hovy\n\nxuezhem,xiangk@cs.cmu.edu, shanghaz@andrew.cmu.edu, hovy@cmu.edu\n\nCarnegie Mellon University\n\nPittsburgh, PA, USA\n\nAbstract\n\nFlow-based generative models, conceptually attractive due to tractability of the\nexact log-likelihood computation and latent-variable inference as well as ef\ufb01ciency\nin training and sampling, has led to a number of impressive empirical successes\nand spawned many advanced variants and theoretical investigations. Despite com-\nputational ef\ufb01ciency, the density estimation performance of \ufb02ow-based generative\nmodels signi\ufb01cantly falls behind those of state-of-the-art autoregressive models.\nIn this work, we introduce masked convolutional generative \ufb02ow (MACOW), a\nsimple yet effective architecture for generative \ufb02ow using masked convolution. By\nrestricting the local connectivity to a small kernel, MACOW features fast and stable\ntraining along with ef\ufb01cient sampling while achieving signi\ufb01cant improvements\nover Glow for density estimation on standard image benchmarks, considerably\nnarrowing the gap with autoregressive models.\n\n1\n\nIntroduction\n\nUnsupervised learning of probabilistic models is a central yet challenging problem. Deep gen-\nerative models have shown promising results in modeling complex distributions such as natural\nimages (Radford et al., 2015), audio (Van Den Oord et al., 2016) and text (Bowman et al., 2015).\nMultiple approaches emerged in recent years, including Variational Autoencoders (VAEs) (Kingma\nand Welling, 2014), Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), autoregres-\nsive neural networks (Larochelle and Murray, 2011; Oord et al., 2016), and \ufb02ow-based generative\nmodels (Dinh et al., 2014, 2016; Kingma and Dhariwal, 2018). Among these, \ufb02ow-based genera-\ntive models gained popularity for this capability of estimating densities of complex distributions,\nef\ufb01ciently generating high-\ufb01delity syntheses, and automatically learning useful latent spaces.\nFlow-based generative models typically warp a simple distribution into a complex one by mapping\npoints from the simple distribution to the complex data distribution through a chain of invertible\ntransformations with Jacobian determinants that are ef\ufb01cient to compute. This design guarantees that\nthe density of the transformed distribution can be analytically estimated, making maximum likelihood\nlearning feasible. Flow-based generative models have spawned signi\ufb01cant interests for improving\nand analyzing its algorithms both theoretically and practically, and applying them to a wide range of\ntasks and domains.\nIn their pioneering work, Dinh et al. (2014) \ufb01rst proposed Non-linear Independent Component\nEstimation (NICE) to apply \ufb02ow-based models for modeling complex high-dimensional densities.\nRealNVP (Dinh et al., 2016) extended NICE with a more \ufb02exible invertible transformation to\nexperiment with natural images. However, these \ufb02ow-based generative models resulted in worse\ndensity estimation performance compared to state-of-the-art autoregressive models, and are incapable\nof realistic synthesis of large images compared to GANs (Karras et al., 2018; Brock et al., 2019).\nRecently, Kingma and Dhariwal (2018) proposed Glow as a generative \ufb02ow with invertible 1 \u00d7 1\nconvolutions, which signi\ufb01cantly improved the density estimation performance on natural images.\nImportantly, they demonstrated that \ufb02ow-based generative models optimized towards the plain\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flikelihood-based objective are capable of generating realistic high-resolution natural images ef\ufb01ciently.\nPrenger et al. (2018) investigated applying \ufb02ow-based generative models to speech synthesis by\ncombining Glow with WaveNet (Van Den Oord et al., 2016). Ziegler and Rush (2019) adopted\nvariational inference to apply generative \ufb02ows to discrete sequential data. Unfortunately, the density\nestimation performance of Glow on natural images remains behind autoregressive models, such as\nPixelRNN/CNN (Oord et al., 2016; Salimans et al., 2017), Image Transformer (Parmar et al., 2018),\nPixelSNAIL (Chen et al., 2017) and SPN (Menick and Kalchbrenner, 2019). There is also some\nwork (Rezende and Mohamed, 2015; Kingma et al., 2016; Zheng et al., 2017) trying to apply \ufb02ow to\nvariational inference.\nIn this paper, we propose a novel architecture of generative \ufb02ow, masked convolutional generative\n\ufb02ow (MACOW), which leverages masked convolutional neural networks (Oord et al., 2016). The\nbijective mapping between input and output variables is easily established while the computation of the\ndeterminant of the Jacobian remians ef\ufb01cient. Compared to inverse autoregressive \ufb02ow (IAF) (Kingma\net al., 2016), MACOW offers stable training and ef\ufb01cient inference and synthesis by restricting the\nlocal connectivity in a small \u201cmasked\u201d kernel as well as large receptive \ufb01elds by stacking multiple\nlayers of convolutional \ufb02ows and using rotational ordering masks (\u00a73.1). We also propose a \ufb01ne-\ngrained version of the multi-scale architecture adopted in previous \ufb02ow-based generative models to\nfurther improve the performance (\u00a73.2). Experimenting with four benchmark datasets for images,\nCIFAR-10, ImageNet, LSUN, and CelebA-HQ, we demonstrate the effectiveness of MACOW as\na density estimator by consistently achieving signi\ufb01cant improvements over Glow on all the three\ndatasets. When equipped with the variational dequantization mechanism (Ho et al., 2019), MACOW\nconsiderably narrows the gap of the density estimation with autoregressive models (\u00a74).\n\n2 Flow-based Generative Models\n\nIn this section, we \ufb01rst setup notations, describe \ufb02ow-based generative models, and review\nGlow (Kingma and Dhariwal, 2018) as it is the foundation for MACOW.\n\n2.1 Notations\n\nThroughout the paper, uppercase letters represent random variables and lowercase letters for realiza-\ntions of their corresponding random variables. Let X \u2208 X be the random variables of the observed\ndata, e.g., X is an image or a sentence for image and text generation, respectively.\nLet P denote the true distribution of the data, i.e., X \u223c P , and D = {x1, . . . , xN} be our training\nsample, where xi, i = 1, . . . , N, are usually i.i.d. samples of X. Let P = {P\u03b8 : \u03b8 \u2208 \u0398} denote a\nparametric statistical model indexed by the parameter \u03b8 \u2208 \u0398, where \u0398 is the parameter space. p\ndenotes the density of the corresponding distribution P . In the deep generative model literature, deep\nneural networks are the most widely used parametric models. The goal of generative models is to\nlearn the parameter \u03b8 such that P\u03b8 can best approximate the true distribution P . In the context of\nmaximum likelihood estimation, we minimize the negative log-likelihood of the parameters with:\n\nmin\n\u03b8\u2208\u0398\n\n1\nN\n\n\u2212 log p\u03b8(xi) = min\n\u03b8\u2208\u0398\n\nE(cid:101)P (X)[\u2212 log p\u03b8(X)],\n\n(1)\n\nN(cid:88)\n\ni=1\n\nwhere \u02dcP (X) is the empirical distribution derived from training data D.\n\n2.2 Flow-based Models\nIn the framework of \ufb02ow-based generative models, a set of latent variables Z \u2208 Z are introduced\nwith a prior distribution pZ(z), which is typically a simple distribution like a multivariate Gaussian.\nFor a bijection function f : X \u2192 Z (with g = f\u22121), the change of the variable formula de\ufb01nes the\nmodel distribution on X by\n\n(cid:12)(cid:12)(cid:12)(cid:12)det\n\n(cid:18) \u2202f\u03b8(x)\n\n\u2202x\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n(2)\n\np\u03b8(x) = pZ (f\u03b8(x))\n\nwhere \u2202f\u03b8(x)\n\n\u2202x\n\nis the Jacobian of f\u03b8 at x.\n\n2\n\n\fThe generative process is de\ufb01ned straightforwardly as the following:\n\nz \u223c pZ(z)\nx = g\u03b8(z).\n\n(3)\n\nFlow-based generative models focus on certain types of transformations f\u03b8 that allow the inverse\nfunctions g\u03b8 and Jacobian determinants to be tractable to compute. By stacking multiple such\ninvertible transformations in a sequence, which is also called a (normalizing) \ufb02ow (Rezende and\nMohamed, 2015), the \ufb02ow is then capable of warping a simple distribution (pZ(z)) into a complex\none (p(x)) through:\n\nX\n\nf1\u2190\u2192\n\ng1\n\nH1\n\nf2\u2190\u2192\n\ng2\n\nf3\u2190\u2192\n\n\u00b7\u00b7\u00b7 fK\u2190\u2192\n\nZ,\n\nH2\n\ng3\n\ngK\n\nwhere f = f1 \u25e6 f2 \u25e6 \u00b7\u00b7\u00b7 \u25e6 fK is a \ufb02ow of K transformations. For brevity, we omit the parameter \u03b8\nfrom f\u03b8 and g\u03b8.\n\n2.3 Glow\n\nRecently, several types of invertible transformations emerged to enhance the expressiveness of \ufb02ows,\namong which Glow (Kingma and Dhariwal, 2018) has stood out for its simplicity and effectiveness\non both density estimation and high-\ufb01delity synthesis. The following brie\ufb02y describes the three types\nof transformations that comprise Glow.\n\nActnorm. Kingma and Dhariwal (2018) proposed an activation normalization layer (Actnorm) as\nan alternative for batch normalization (Ioffe and Szegedy, 2015) to alleviate the challenges in model\ntraining. Similar to batch normalization, Actnorm performs an af\ufb01ne transformation of the activations\nusing a scale and bias parameter per channel for 2D images, such that\n\nyi,j = s (cid:12) xi,j + b,\n\nwhere both x and y are tensors of shape [h \u00d7 w \u00d7 c] with spatial dimensions (h, w) and channel\ndimension c.\nInvertible 1 \u00d7 1 convolution. To incorporate a permutation along the channel dimension, Glow\nincludes a trainable invertible 1 \u00d7 1 convolution layer to generalize the permutation operation as:\n\nwhere W is the weight matrix with shape c \u00d7 c.\n\nyi,j = W xi,j,\n\nAf\ufb01ne Coupling Layers. Following Dinh et al. (2016), Glow includes af\ufb01ne coupling layers in its\narchitecture of:\n\nxa, xb = split(x)\n\nya = xa\nyb = s(xa) (cid:12) xb + b(xa)\ny = concat(ya, yb),\n\nwhere s(xa) and b(xa) are outputs of two neural networks with xa as input. The split() and concat()\nfunctions perform operations along the channel dimension.\nFrom this designed architecture of Glow, we see that interactions between spatial dimensions are\nincorporated only in the coupling layers. The coupling layer, however, is typically costly for memory\nresources, making it infeasible to stack a signi\ufb01cant number of coupling layers into a single model,\nespecially when processing high-resolution images. The main goal of this work is to design a new\ntype of transformation that simultaneously models the dependencies in both the spatial and channel\ndimensions while maintaining a relatively small memory footprint to improve the capacity of the\ngenerative \ufb02ow.\n\n3 Masked Convolutional Generative Flows\n\nIn this section, we describe the architectural components of the masked convolutional generative\n\ufb02ow (MACOW). First, we introduce the proposed \ufb02ow transformation using masked convolutions in\n\u00a73.1. Then, we present a \ufb01ne-grained version of the multi-scale architecture adopted by previous\ngenerative \ufb02ows (Dinh et al., 2016; Kingma and Dhariwal, 2018) in \u00a73.2.\n\n3\n\n\fFigure 1: Visualization of the receptive \ufb01eld of four masked convolutions with rotational ordering.\n\n3.1 Flow with Masked Convolutions\n\nApplying autoregressive models to normalizing \ufb02ows has been previously explored in studies (Kingma\net al., 2016; Papamakarios et al., 2017), with idea of sequentially modeling the input random variables\nin an autoregressive order to ensure the model cannot read input variables behind the current one:\n\nyt = s(x