{"title": "Masked Autoregressive Flow for Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 2338, "page_last": 2347, "abstract": "Autoregressive models are among the best performing neural density estimators. We describe an approach for increasing the flexibility of an autoregressive model, based on modelling the random numbers that the model uses internally when generating data. By constructing a stack of autoregressive models, each modelling the random numbers of the next model in the stack, we obtain a type of normalizing flow suitable for density estimation, which we call Masked Autoregressive Flow. This type of flow is closely related to Inverse Autoregressive Flow and is a generalization of Real NVP. Masked Autoregressive Flow achieves state-of-the-art performance in a range of general-purpose density estimation tasks.", "full_text": "Masked Autoregressive Flow for Density Estimation\n\nGeorge Papamakarios\nUniversity of Edinburgh\n\ng.papamakarios@ed.ac.uk\n\nTheo Pavlakou\n\nUniversity of Edinburgh\n\ntheo.pavlakou@ed.ac.uk\n\nIain Murray\n\nUniversity of Edinburgh\ni.murray@ed.ac.uk\n\nAbstract\n\nAutoregressive models are among the best performing neural density estimators.\nWe describe an approach for increasing the \ufb02exibility of an autoregressive model,\nbased on modelling the random numbers that the model uses internally when gen-\nerating data. By constructing a stack of autoregressive models, each modelling the\nrandom numbers of the next model in the stack, we obtain a type of normalizing\n\ufb02ow suitable for density estimation, which we call Masked Autoregressive Flow.\nThis type of \ufb02ow is closely related to Inverse Autoregressive Flow and is a gen-\neralization of Real NVP. Masked Autoregressive Flow achieves state-of-the-art\nperformance in a range of general-purpose density estimation tasks.\n\n1\n\nIntroduction\n\nThe joint density p(x) of a set of variables x is a central object of interest in machine learning. Being\nable to access and manipulate p(x) enables a wide range of tasks to be performed, such as inference,\nprediction, data completion and data generation. As such, the problem of estimating p(x) from a set\nof examples {xn} is at the core of probabilistic unsupervised learning and generative modelling.\nIn recent years, using neural networks for density estimation has been particularly successful. Combin-\ning the \ufb02exibility and learning capacity of neural networks with prior knowledge about the structure\nof data to be modelled has led to impressive results in modelling natural images [4, 30, 37, 38] and\naudio data [34, 36]. State-of-the-art neural density estimators have also been used for likelihood-free\ninference from simulated data [21, 23], variational inference [13, 24], and as surrogates for maximum\nentropy models [19].\nNeural density estimators differ from other approaches to generative modelling\u2014such as variational\nautoencoders [12, 25] and generative adversarial networks [7]\u2014in that they readily provide exact\ndensity evaluations. As such, they are more suitable in applications where the focus is on explicitly\nevaluating densities, rather than generating synthetic data. For instance, density estimators can learn\nsuitable priors for data from large unlabelled datasets, for use in standard Bayesian inference [39].\nIn simulation-based likelihood-free inference, conditional density estimators can learn models for\nthe likelihood [5] or the posterior [23] from simulated data. Density estimators can learn effective\nproposals for importance sampling [22] or sequential Monte Carlo [8, 21]; such proposals can be\nused in probabilistic programming environments to speed up inference [15, 16]. Finally, conditional\ndensity estimators can be used as \ufb02exible inference networks for amortized variational inference and\nas part of variational autoencoders [12, 25].\nA challenge in neural density estimation is to construct models that are \ufb02exible enough to represent\ncomplex densities, but have tractable density functions and learning algorithms. There are mainly\ntwo families of neural density estimators that are both \ufb02exible and tractable: autoregressive models\n[35] and normalizing \ufb02ows [24]. Autoregressive models decompose the joint density as a product of\nconditionals, and model each conditional in turn. Normalizing \ufb02ows transform a base density (e.g. a\nstandard Gaussian) into the target density by an invertible transformation with tractable Jacobian.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fOur starting point is the realization (as pointed out by Kingma et al. [13]) that autoregressive models,\nwhen used to generate data, correspond to a differentiable transformation of an external source of\nrandomness (typically obtained by random number generators). This transformation has a tractable\nJacobian by design, and for certain autoregressive models it is also invertible, hence it precisely\ncorresponds to a normalizing \ufb02ow. Viewing an autoregressive model as a normalizing \ufb02ow opens\nthe possibility of increasing its \ufb02exibility by stacking multiple models of the same type, by having\neach model provide the source of randomness for the next model in the stack. The resulting stack of\nmodels is a normalizing \ufb02ow that is more \ufb02exible than the original model, and that remains tractable.\nIn this paper we present Masked Autoregressive Flow (MAF), which is a particular implementation of\nthe above normalizing \ufb02ow that uses the Masked Autoencoder for Distribution Estimation (MADE)\n[6] as a building block. The use of MADE enables density evaluations without the sequential loop\nthat is typical of autoregressive models, and thus makes MAF fast to evaluate and train on parallel\ncomputing architectures such as Graphics Processing Units (GPUs). We show a close theoretical\nconnection between MAF and Inverse Autoregressive Flow (IAF) [13], which has been designed for\nvariational inference instead of density estimation, and show that both correspond to generalizations\nof the successful Real NVP [4]. We experimentally evaluate MAF on a wide range of datasets, and\nwe demonstrate that (a) MAF outperforms Real NVP on general-purpose density estimation, and (b)\na conditional version of MAF achieves close to state-of-the-art performance on conditional image\nmodelling even with a general-purpose architecture.\n\n2 Background\n\n2.1 Autoregressive density estimation\n\none-dimensional conditionals as p(x) =(cid:81)\n\nUsing the chain rule of probability, any joint density p(x) can be decomposed into a product of\ni p(xi | x1:i\u22121). Autoregressive density estimators [35]\nmodel each conditional p(xi | x1:i\u22121) as a parametric density, whose parameters are a function of a\nhidden state hi. In recurrent architectures, hi is a function of the previous hidden state hi\u22121 and the\nith input variable xi. The Real-valued Neural Autoregressive Density Estimator (RNADE) [32] uses\nmixtures of Gaussian or Laplace densities for modelling the conditionals, and a simple linear rule for\nupdating the hidden state. More \ufb02exible approaches for updating the hidden state are based on Long\nShort-Term Memory recurrent neural networks [30, 38].\nA drawback of autoregressive models is that they are sensitive to the order of the variables. For\nexample, the order of the variables matters when learning the density of Figure 1a if we assume a\nmodel with Gaussian conditionals. As Figure 1b shows, a model with order (x1, x2) cannot learn\nthis density, even though the same model with order (x2, x1) can represent it perfectly. In practice\nis it hard to know which of the factorially many orders is the most suitable for the task at hand.\nAutoregressive models that are trained to work with an order chosen at random have been developed,\nand the predictions from different orders can then be combined in an ensemble [6, 33]. Our approach\n(Section 3) can use a different order in each layer, and using random orders would also be possible.\nStraightforward recurrent autoregressive models would update a hidden state sequentially for every\nvariable, requiring D sequential computations to compute the probability p(x) of a D-dimensional\nvector, which is not well-suited for computation on parallel architectures such as GPUs. One way to\nenable parallel computation is to start with a fully-connected model with D inputs and D outputs, and\ndrop out connections in order to ensure that output i will only be connected to inputs 1, 2, . . . , i\u22121.\nOutput i can then be interpreted as computing the parameters of the ith conditional p(xi | x1:i\u22121).\nBy construction, the resulting model will satisfy the autoregressive property, and at the same time\nit will be able to calculate p(x) ef\ufb01ciently on a GPU. An example of this approach is the Masked\nAutoencoder for Distribution Estimation (MADE) [6], which drops out connections by multiplying\nthe weight matrices of a fully-connected autoencoder with binary masks. Other mechanisms for\ndropping out connections include masked convolutions [38] and causal convolutions [36].\n\n2.2 Normalizing \ufb02ows\n\nA normalizing \ufb02ow [24] represents p(x) as an invertible differentiable transformation f of a base\ndensity \u03c0u(u). That is, x = f (u) where u \u223c \u03c0u(u). The base density \u03c0u(u) is chosen such that it\ncan be easily evaluated for any input u (a common choice for \u03c0u(u) is a standard Gaussian). Under\n\n2\n\n\f(a) Target density\n\nFigure 1: (a) The density to be learnt, de\ufb01ned as p(x1, x2) = N (x2 | 0, 4)N(cid:0)x1 | 1\n\n(b) MADE with Gaussian conditionals\n\n(c) MAF with 5 layers\n\n2, 1(cid:1). (b) The\n\ndensity learnt by a MADE with order (x1, x2) and Gaussian conditionals. Scatter plot shows the train\ndata transformed into random numbers u; the non-Gaussian distribution indicates that the model is a\npoor \ufb01t. (c) Learnt density and transformed train data of a 5 layer MAF with the same order (x1, x2).\n\n4 x2\n\nthe invertibility assumption for f, the density p(x) can be calculated as\n\n(cid:0)f\u22121(x)(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)det\n\n(cid:18) \u2202f\u22121\n\n\u2202x\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) .\n\np(x) = \u03c0u\n\n(1)\n\nIn order for Equation (1) to be tractable, the transformation f must be constructed such that (a) it\nis easy to invert, and (b) the determinant of its Jacobian is easy to compute. An important point is\nthat if transformations f1 and f2 have the above properties, then their composition f1 \u25e6 f2 also has\nthese properties. In other words, the transformation f can be made deeper by composing multiple\ninstances of it, and the result will still be a valid normalizing \ufb02ow.\nThere have been various approaches in developing normalizing \ufb02ows. An early example is Gaussian-\nization [2], which is based on successive application of independent component analysis. Enforcing\ninvertibility with nonsingular weight matrices has been proposed [1, 26], however in such approaches\ncalculating the determinant of the Jacobian scales cubicly with data dimensionality in general. Pla-\nnar/radial \ufb02ows [24] and Inverse Autoregressive Flow (IAF) [13] are models whose Jacobian is\ntractable by design. However, they were developed primarily for variational inference and are not\nwell-suited for density estimation, as they can only ef\ufb01ciently calculate the density of their own sam-\nples and not of externally provided datapoints. The Non-linear Independent Components Estimator\n(NICE) [3] and its successor Real NVP [4] have a tractable Jacobian and are also suitable for density\nestimation. IAF, NICE and Real NVP are discussed in more detail in Section 3.\n\n3 Masked Autoregressive Flow\n\n3.1 Autoregressive models as normalizing \ufb02ows\n\nConsider an autoregressive model whose conditionals are parameterized as single Gaussians. That is,\nthe ith conditional is given by\n\np(xi | x1:i\u22121) = N(cid:0)xi | \u00b5i, (exp \u03b1i)2(cid:1) where \u00b5i = f\u00b5i(x1:i\u22121) and \u03b1i = f\u03b1i (x1:i\u22121).\n\n(2)\nIn the above, f\u00b5i and f\u03b1i are unconstrained scalar functions that compute the mean and log standard\ndeviation of the ith conditional given all previous variables. We can generate data from the above\nmodel using the following recursion:\n\nxi = ui exp \u03b1i + \u00b5i where \u00b5i = f\u00b5i(x1:i\u22121), \u03b1i = f\u03b1i(x1:i\u22121) and ui \u223c N (0, 1).\n\n(3)\nIn the above, u = (u1, u2, . . . , uI ) is the vector of random numbers the model uses internally to\ngenerate data, typically by making calls to a random number generator often called randn().\nEquation (3) provides an alternative characterization of the autoregressive model as a transformation\nf from the space of random numbers u to the space of data x. That is, we can express the model\nas x = f (u) where u \u223c N (0, I). By construction, f is easily invertible. Given a datapoint x, the\nrandom numbers u that were used to generate it are obtained by the following recursion:\nui = (xi \u2212 \u00b5i) exp(\u2212\u03b1i) where \u00b5i = f\u00b5i(x1:i\u22121) and \u03b1i = f\u03b1i(x1:i\u22121).\n\n(4)\n\n3\n\n\fDue to the autoregressive structure, the Jacobian of f\u22121 is triangular by design, hence its absolute\ndeterminant can be easily obtained as follows:\n\n(cid:12)(cid:12)(cid:12)(cid:12)det\n(cid:18) \u2202f\u22121\n\n\u2202x\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = exp\n\n(cid:16)\u2212(cid:88)\n\n\u03b1i\n\ni\n\n(cid:17)\n\nwhere \u03b1i = f\u03b1i(x1:i\u22121).\n\n(5)\n\nIt follows that the autoregressive model can be equivalently interpreted as a normalizing \ufb02ow, whose\ndensity p(x) can be obtained by substituting Equations (4) and (5) into Equation (1). This observation\nwas \ufb01rst pointed out by Kingma et al. [13].\nA useful diagnostic for assessing whether an autoregressive model of the above type \ufb01ts the target\ndensity well is to transform the train data {xn} into corresponding random numbers {un} using\nEquation (4), and assess whether the ui\u2019s come from independent standard normals. If the ui\u2019s do\nnot seem to come from independent standard normals, this is evidence that the model is a bad \ufb01t. For\ninstance, Figure 1b shows that the scatter plot of the random numbers associated with the train data\ncan look signi\ufb01cantly non-Gaussian if the model \ufb01ts the target density poorly.\nHere we interpret autoregressive models as a \ufb02ow, and improve the model \ufb01t by stacking multiple\ninstances of the model into a deeper \ufb02ow. Given autoregressive models M1, M2, . . . , MK, we model\nthe density of the random numbers u1 of M1 with M2, model the random numbers u2 of M2 with M3\nand so on, \ufb01nally modelling the random numbers uK of MK with a standard Gaussian. This stacking\nadds \ufb02exibility: for example, Figure 1c demonstrates that a \ufb02ow of 5 autoregressive models is able\nto learn multimodal conditionals, even though each model has unimodal conditionals. Stacking has\npreviously been used in a similar way to improve model \ufb01t of deep belief nets [9] and deep mixtures\nof factor analyzers [28].\nWe choose to implement the set of functions {f\u00b5i, f\u03b1i} with masking, following the approach used\nby MADE [6]. MADE is a feedforward network that takes x as input and outputs \u00b5i and \u03b1i for\nall i with a single forward pass. The autoregressive property is enforced by multiplying the weight\nmatrices of MADE with suitably constructed binary masks. In other words, we use MADE with\nGaussian conditionals as the building layer of our \ufb02ow. The bene\ufb01t of using masking is that it\nenables transforming from data x to random numbers u and thus calculating p(x) in one forward\npass through the \ufb02ow, thus eliminating the need for sequential recursion as in Equation (4). We call\nthis implementation of stacking MADEs into a \ufb02ow Masked Autoregressive Flow (MAF).\n\n3.2 Relationship with Inverse Autoregressive Flow\n\nLike MAF, Inverse Autoregressive Flow (IAF) [13] is a normalizing \ufb02ow which uses MADE as its\ncomponent layer. Each layer of IAF is de\ufb01ned by the following recursion:\n\nxi = ui exp \u03b1i + \u00b5i where \u00b5i = f\u00b5i(u1:i\u22121) and \u03b1i = f\u03b1i (u1:i\u22121).\n\n(6)\nSimilarly to MAF, functions {f\u00b5i, f\u03b1i} are computed using a MADE with Gaussian conditionals.\nThe difference is architectural: in MAF \u00b5i and \u03b1i are directly computed from previous data variables\nx1:i\u22121, whereas in IAF \u00b5i and \u03b1i are directly computed from previous random numbers u1:i\u22121.\nThe consequence of the above is that MAF and IAF are different models with different computational\ntrade-offs. MAF is capable of calculating the density p(x) of any datapoint x in one pass through\nthe model, however sampling from it requires performing D sequential passes (where D is the\ndimensionality of x). In contrast, IAF can generate samples and calculate their density with one pass,\nhowever calculating the density p(x) of an externally provided datapoint x requires D passes to \ufb01nd\nthe random numbers u associated with x. Hence, the design choice of whether to connect \u00b5i and\n\u03b1i directly to x1:i\u22121 (obtaining MAF) or to u1:i\u22121 (obtaining IAF) depends on the intended usage.\nIAF is suitable as a recognition model for stochastic variational inference [12, 25], where it only\never needs to calculate the density of its own samples. In contrast, MAF is more suitable for density\nestimation, because each example requires only one pass through the model whereas IAF requires D.\nA theoretical equivalence between MAF and IAF is that training a MAF with maximum likelihood\ncorresponds to \ufb01tting an implicit IAF to the base density with stochastic variational inference. Let\n\u03c0x(x) be the data density we wish to learn, \u03c0u(u) be the base density, and f be the transformation\nfrom u to x as implemented by MAF. The density de\ufb01ned by MAF (with added subscript x for\ndisambiguation) is\n\n(cid:0)f\u22121(x)(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)det\n\n(cid:18) \u2202f\u22121\n\n\u2202x\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) .\n\npx(x) = \u03c0u\n\n(7)\n\n4\n\n\fThe inverse transformation f\u22121 from x to u can be seen as describing an implicit IAF with base\ndensity \u03c0x(x), which de\ufb01nes the following implicit density over the u space:\n\n(cid:12)(cid:12)(cid:12)(cid:12)det\n(cid:18) \u2202f\nTraining MAF by maximizing the total log likelihood(cid:80)\n\n(8)\nn log p(xn) on train data {xn} corresponds\nto \ufb01tting px(x) to \u03c0x(x) by stochastically minimizing DKL(\u03c0x(x)(cid:107) px(x)). In Section A of the\nsupplementary material, we show that\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) .\n\npu(u) = \u03c0x(f (u))\n\n\u2202u\n\nDKL(\u03c0x(x)(cid:107) px(x)) = DKL(pu(u)(cid:107) \u03c0u(u)).\n\n(9)\nHence, stochastically minimizing DKL(\u03c0x(x)(cid:107) px(x)) is equivalent to \ufb01tting pu(u) to \u03c0u(u) by\nminimizing DKL(pu(u)(cid:107) \u03c0u(u)). Since the latter is the loss function used in variational inference,\nand pu(u) can be seen as an IAF with base density \u03c0x(x) and transformation f\u22121, it follows that\ntraining MAF as a density estimator of \u03c0x(x) is equivalent to performing stochastic variational\ninference with an implicit IAF, where the posterior is taken to be the base density \u03c0u(u) and the\ntransformation f\u22121 implements the reparameterization trick [12, 25]. This argument is presented in\nmore detail in Section A of the supplementary material.\n\n3.3 Relationship with Real NVP\n\nReal NVP [4] (NVP stands for Non Volume Preserving) is a normalizing \ufb02ow obtained by stacking\ncoupling layers. A coupling layer is an invertible transformation f from random numbers u to data x\nwith a tractable Jacobian, de\ufb01ned by\n\nx1:d = u1:d\n\n\u00b5 = f\u00b5(u1:d)\n\u03b1 = f\u03b1(u1:d).\n\nwhere\n\nxd+1:D = ud+1:D (cid:12) exp \u03b1 + \u00b5\n\n(10)\nIn the above, (cid:12) denotes elementwise multiplication, and the exp is applied to each element of \u03b1. The\ntransformation copies the \ufb01rst d elements, and scales and shifts the remaining D\u2212d elements, with\nthe amount of scaling and shifting being a function of the \ufb01rst d elements. When stacking coupling\nlayers into a \ufb02ow, the elements are permuted across layers so that a different set of elements is copied\neach time. A special case of the coupling layer where \u03b1 = 0 is used by NICE [3].\nWe can see that the coupling layer is a special case of both the autoregressive transformation used by\nMAF in Equation (3), and the autoregressive transformation used by IAF in Equation (6). Indeed, we\ncan recover the coupling layer from the autoregressive transformation of MAF by setting \u00b5i = \u03b1i = 0\nfor i \u2264 d and making \u00b5i and \u03b1i functions of only x1:d for i > d (for IAF we need to make \u00b5i and \u03b1i\nfunctions of u1:d instead for i > d). In other words, both MAF and IAF can be seen as more \ufb02exible\n(but different) generalizations of Real NVP, where each element is individually scaled and shifted as\na function of all previous elements. The advantage of Real NVP compared to MAF and IAF is that it\ncan both generate data and estimate densities with one forward pass only, whereas MAF would need\nD passes to generate data and IAF would need D passes to estimate densities.\n\ndecomposing any conditional density as p(x| y) =(cid:81)\n\n3.4 Conditional MAF\nGiven a set of example pairs {(xn, yn)}, conditional density estimation is the task of estimating\nthe conditional density p(x| y). Autoregressive modelling extends naturally to conditional density\nestimation. Each term in the chain rule of probability can be conditioned on side-information y,\ni p(xi | x1:i\u22121, y). Therefore, we can turn any\nunconditional autoregressive model into a conditional one by augmenting its set of input variables\nwith y and only modelling the conditionals that correspond to x. Any order of the variables can be\nchosen, as long as y comes before x. In masked autoregressive models, no connections need to be\ndropped from the y inputs to the rest of the network.\nWe can implement a conditional version of MAF by stacking MADEs that were made conditional\nusing the above strategy. That is, in a conditional MAF, the vector y becomes an additional input\nfor every layer. As a special case of MAF, Real NVP can be made conditional in the same way.\nIn Section 4, we show that conditional MAF signi\ufb01cantly outperforms unconditional MAF when\nconditional information (such as data labels) is available. In our experiments, MAF was able to\nbene\ufb01t from conditioning considerably more than MADE and Real NVP.\n\n5\n\n\f4 Experiments\n\n4.1\n\nImplementation and setup\n\nWe systematically evaluate three types of density estimator (MADE, Real NVP and MAF) in terms\nof density estimation performance on a variety of datasets. Code for reproducing our experiments\n(which uses Theano [29]) can be found at https://github.com/gpapamak/maf.\nMADE. We consider two versions: (a) a MADE with Gaussian conditionals, denoted simply by\nMADE, and (b) a MADE whose conditionals are each parameterized as a mixture of C Gaussians,\ndenoted by MADE MoG. We used C = 10 in all our experiments. MADE can be seen either as a\nMADE MoG with C = 1, or as a MAF with only one autoregressive layer. Adding more Gaussian\ncomponents per conditional or stacking MADEs to form a MAF are two alternative ways of increasing\nthe \ufb02exibility of MADE, which we are interested in comparing.\nReal NVP. We consider a general-purpose implementation of the coupling layer, which uses two\nfeedforward neural networks, implementing the scaling function f\u03b1 and the shifting function f\u00b5\nrespectively. Both networks have the same architecture, except that f\u03b1 has hyperbolic tangent hidden\nunits, whereas f\u00b5 has recti\ufb01ed linear hidden units (we found this combination to perform best). Both\nnetworks have a linear output. We consider Real NVPs with either 5 or 10 coupling layers, denoted\nby Real NVP (5) and Real NVP (10) respectively, and in both cases the base density is a standard\nGaussian. Successive coupling layers alternate between (a) copying the odd-indexed variables and\ntransforming the even-indexed variables, and (b) copying the even-indexed variables and transforming\nthe odd-indexed variables. It is important to clarify that this is a general-purpose implementation of\nReal NVP which is different and thus not comparable to its original version [4], which was designed\nspeci\ufb01cally for image data. Here we are interested in comparing coupling layers with autoregressive\nlayers as building blocks of normalizing \ufb02ows for general-purpose density estimation tasks, and our\ndesign of Real NVP is such that a fair comparison between the two can be made.\nMAF. We consider three versions: (a) a MAF with 5 autoregressive layers and a standard Gaussian as\na base density \u03c0u(u), denoted by MAF (5), (b) a MAF with 10 autoregressive layers and a standard\nGaussian as a base density, denoted by MAF (10), and (c) a MAF with 5 autoregressive layers and a\nMADE MoG with C = 10 Gaussian components as a base density, denoted by MAF MoG (5). MAF\nMoG (5) can be thought of as a MAF (5) stacked on top of a MADE MoG and trained jointly with it.\nIn all experiments, MADE and MADE MoG order the inputs using the order that comes with the\ndataset by default; no alternative orders were considered. MAF uses the default order for the \ufb01rst\nautoregressive layer (i.e. the layer that directly models the data) and reverses the order for each\nsuccessive layer (the same was done for IAF by Kingma et al. [13]).\nMADE, MADE MoG and each layer in MAF is a feedforward neural network with masked weight\nmatrices, such that the autoregressive property holds. The procedure for designing the masks (due to\nGermain et al. [6]) is as follows. Each input or hidden unit is assigned a degree, which is an integer\nranging from 1 to D, where D is the data dimensionality. The degree of an input is taken to be its\nindex in the order. The D outputs have degrees that sequentially range from 0 to D\u22121. A unit is\nallowed to receive input only from units with lower or equal degree, which enforces the autoregressive\nproperty. In order for output i to be connected to all inputs with degree less than i, and thus make\nsure that no conditional independences are introduced, it is both necessary and suf\ufb01cient that every\nhidden layer contains every degree. In all experiments except for CIFAR-10, we sequentially assign\ndegrees within each hidden layer and use enough hidden units to make sure that all degrees appear.\nBecause CIFAR-10 is high-dimensional, we used fewer hidden units than inputs and assigned degrees\nto hidden units uniformly at random (as was done by Germain et al. [6]).\nWe added batch normalization [10] after each coupling layer in Real NVP and after each autore-\ngressive layer in MAF. Batch normalization is an elementwise scaling and shifting, which is easily\ninvertible and has a tractable Jacobian, and thus it is suitable for use in a normalizing \ufb02ow. We\nfound that batch normalization in Real NVP and MAF reduces training time, increases stability\nduring training and improves performance (as observed by Dinh et al. [4] for Real NVP). Section B\nof the supplementary material discusses our implementation of batch normalization and its use in\nnormalizing \ufb02ows.\nAll models were trained with the Adam optimizer [11], using a minibatch size of 100, and a step size\nof 10\u22123 for MADE and MADE MoG, and of 10\u22124 for Real NVP and MAF. A small amount of (cid:96)2\n\n6\n\n\fTable 1: Average test log likelihood (in nats) for unconditional density estimation. The best performing\nmodel for each dataset is shown in bold (multiple models are highlighted if the difference is not\nstatistically signi\ufb01cant according to a paired t-test). Error bars correspond to 2 standard deviations.\n\nPOWER\n\nGAS\n\nHEPMASS\n\nBSDS300\nMINIBOONE\n\u22127.74 \u00b1 0.02 \u22123.58 \u00b1 0.75 \u221227.93 \u00b1 0.02 \u221237.24 \u00b1 1.07\n96.67 \u00b1 0.25\nGaussian\n3.56 \u00b1 0.04 \u221220.98 \u00b1 0.02 \u221215.59 \u00b1 0.50\n\u22123.08 \u00b1 0.03\n148.85 \u00b1 0.28\nMADE\n8.47 \u00b1 0.02 \u221215.15 \u00b1 0.02 \u221212.27 \u00b1 0.47\n153.71 \u00b1 0.28\n0.40 \u00b1 0.01\nMADE MoG\nReal NVP (5) \u22120.02 \u00b1 0.01\n4.78 \u00b1 1.80 \u221219.62 \u00b1 0.02 \u221213.55 \u00b1 0.49\n152.97 \u00b1 0.28\n8.33 \u00b1 0.14 \u221218.71 \u00b1 0.02 \u221213.84 \u00b1 0.52\n153.28 \u00b1 1.78\n0.17 \u00b1 0.01\nReal NVP (10)\n9.07 \u00b1 0.02 \u221217.70 \u00b1 0.02 \u221211.75 \u00b1 0.44\n0.14 \u00b1 0.01\n155.69 \u00b1 0.28\nMAF (5)\n0.24 \u00b1 0.01 10.08 \u00b1 0.02 \u221217.73 \u00b1 0.02 \u221212.24 \u00b1 0.45\n154.93 \u00b1 0.28\nMAF (10)\n9.59 \u00b1 0.02 \u221217.39 \u00b1 0.02 \u221211.68 \u00b1 0.44 156.36 \u00b1 0.28\n0.30 \u00b1 0.01\nMAF MoG (5)\n\nregularization was added, with coef\ufb01cient 10\u22126. Each model was trained with early stopping until no\nimprovement occurred for 30 consecutive epochs on the validation set. For each model, we selected\nthe number of hidden layers and number of hidden units based on validation performance (we gave\nthe same options to all models), as described in Section D of the supplementary material.\n\n4.2 Unconditional density estimation\n\nFollowing Uria et al. [32], we perform unconditional density estimation on four UCI datasets\n(POWER, GAS, HEPMASS, MINIBOONE) and on a dataset of natural image patches (BSDS300).\nUCI datasets. These datasets were taken from the UCI machine learning repository [18]. We selected\ndifferent datasets than Uria et al. [32], because the ones they used were much smaller, resulting in\nan expensive cross-validation procedure involving a separate hyperparameter search for each fold.\nHowever, our data preprocessing follows Uria et al. [32]. The sample mean was subtracted from the\ndata and each feature was divided by its sample standard deviation. Discrete-valued attributes were\neliminated, as well as every attribute with a Pearson correlation coef\ufb01cient greater than 0.98. These\nprocedures are meant to avoid trivial high densities, which would make the comparison between\napproaches hard to interpret. Section D of the supplementary material gives more details about the\nUCI datasets and the individual preprocessing done on each of them.\nImage patches. This dataset was obtained by extracting random 8\u00d78 monochrome patches from\nthe BSDS300 dataset of natural images [20]. We used the same preprocessing as by Uria et al. [32].\nUniform noise was added to dequantize pixel values, which was then rescaled to be in the range [0, 1].\nThe mean pixel value was subtracted from each patch, and the bottom-right pixel was discarded.\nTable 1 shows the performance of each model on each dataset. A Gaussian \ufb01tted to the train data is\nreported as a baseline. We can see that on 3 out of 5 datasets MAF is the best performing model, with\nMADE MoG being the best performing model on the other 2. On all datasets, MAF outperforms\nReal NVP. For the MINIBOONE dataset, due to overlapping error bars, a pairwise comparison was\ndone to determine which model performs the best, the results of which are reported in Section E\nof the supplementary material. MAF MoG (5) achieves the best reported result on BSDS300 for a\nsingle model with 156.36 nats, followed by Deep RNADE [33] with 155.2. An ensemble of 32 Deep\nRNADEs was reported to achieve 157.0 nats [33]. The UCI datasets were used for the \ufb01rst time in\nthe literature for density estimation, so no comparison with existing work can be made yet.\n\n4.3 Conditional density estimation\n\nFor conditional density estimation, we used the MNIST dataset of handwritten digits [17] and the\nCIFAR-10 dataset of natural images [14]. In both datasets, each datapoint comes from one of 10\ndistinct classes. We represent the class label as a 10-dimensional, one-hot encoded vector y, and we\nmodel the density p(x| y), where x represents an image. At test time, we evaluate the probability of\n10 is a uniform prior over the labels. For\ncomparison, we also train every model as an unconditional density estimator and report both results.\n\na test image x by p(x) =(cid:80)\n\ny p(x| y)p(y), where p(y) = 1\n\n7\n\n\fTable 2: Average test log likelihood (in nats) for conditional density estimation. The best performing\nmodel for each dataset is shown in bold. Error bars correspond to 2 standard deviations.\n\nMNIST\n\nCIFAR-10\n\nunconditional\n\u22121366.9 \u00b1 1.4\n\u22121380.8 \u00b1 4.8\n\u22121038.5 \u00b1 1.8\n\u22121323.2 \u00b1 6.6\n\u22121370.7 \u00b1 10.1\n\u22121300.5 \u00b1 1.7\n\u22121313.1 \u00b1 2.0\n\u22121100.3 \u00b1 1.6\n\nconditional\n\u22121344.7 \u00b1 1.8\n\u22121361.9 \u00b1 1.9\n\u22121030.3 \u00b1 1.7\n\u22121326.3 \u00b1 5.8\n\u22121371.3 \u00b1 43.9\n\u2212591.7 \u00b1 1.7\n\u2212605.6 \u00b1 1.8\n\u22121092.3 \u00b1 1.7\n\nunconditional\n2367 \u00b1 29\n147 \u00b1 20\n\u2212397 \u00b1 21\n2576 \u00b1 27\n2568 \u00b1 26\n2936 \u00b1 27\n3049 \u00b1 26\n2911 \u00b1 26\n\nconditional\n2030 \u00b1 41\n187 \u00b1 20\n\u2212119 \u00b1 20\n2642 \u00b1 26\n2475 \u00b1 25\n5797 \u00b1 26\n5872 \u00b1 26\n2936 \u00b1 26\n\nGaussian\nMADE\nMADE MoG\nReal NVP (5)\nReal NVP (10)\nMAF (5)\nMAF (10)\nMAF MoG (5)\n\nFor both MNIST and CIFAR-10, we use the same preprocessing as by Dinh et al. [4]. We dequantize\npixel values by adding uniform noise, and then rescale them to [0, 1]. We transform the rescaled pixel\nvalues into logit space by x (cid:55)\u2192 logit(\u03bb + (1 \u2212 2\u03bb)x), where \u03bb = 10\u22126 for MNIST and \u03bb = 0.05 for\nCIFAR-10, and perform density estimation in that space. In the case of CIFAR-10, we also augment\nthe train set with horizontal \ufb02ips of all train examples (as also done by Dinh et al. [4]).\nTable 2 shows the results on MNIST and CIFAR-10. The performance of a class-conditional Gaussian\nis reported as a baseline for the conditional case. Log likelihoods are calculated in logit space. For\nunconditional density estimation, MADE MoG is the best performing model on MNIST, whereas\nMAF is the best performing model on CIFAR-10. For conditional density estimation, MAF is by far\nthe best performing model on both datasets. On CIFAR-10, both MADE and MADE MoG performed\nsigni\ufb01cantly worse than the Gaussian baseline. MAF outperforms Real NVP in all cases.\nThe conditional performance of MAF is particularly impressive. MAF performs almost twice as well\ncompared to its unconditional version and to every other model\u2019s conditional version. To facilitate\ncomparison with the literature, Section E of the supplementary material reports results in bits/pixel.\nMAF (5) and MAF (10), the two best performing conditional models, achieve 3.02 and 2.98 bits/pixel\nrespectively on CIFAR-10. This result is very close to the state-of-the-art 2.94 bits/pixel achieved\nby a conditional PixelCNN++ [27], even though, unlike PixelCNN++, our version of MAF does not\nincorporate prior image knowledge, and it pays a price for doing density estimation in a transformed\nreal-valued space (PixelCNN++ directly models discrete pixel values).\n\n5 Discussion\n\nWe showed that we can improve MADE by modelling the density of its internal random numbers.\nAlternatively, MADE can be improved by increasing the \ufb02exibility of its conditionals. The comparison\nbetween MAF and MADE MoG showed that the best approach is dataset speci\ufb01c; in our experiments\nMAF outperformed MADE MoG in 6 out of 9 cases, which is strong evidence of its competitiveness.\nMADE MoG is a universal density approximator; with suf\ufb01ciently many hidden units and Gaussian\ncomponents, it can approximate any continuous density arbitrarily well. It is an open question\nwhether MAF with a Gaussian base density has a similar property (MAF MoG clearly does).\nWe also showed that the coupling layer used in Real NVP is a special case of the autoregressive layer\nused in MAF. In fact, MAF outperformed Real NVP in all our experiments. Real NVP has achieved\nimpressive performance in image modelling by incorporating knowledge about image structure. Our\nresults suggest that replacing coupling layers with autoregressive layers in the original version of Real\nNVP is a promising direction for further improving its performance. Real NVP maintains however\nthe advantage over MAF (and autoregressive models in general) that samples from the model can be\ngenerated ef\ufb01ciently in parallel.\nMAF achieved impressive results in conditional density estimation. Whereas almost all models we\nconsidered bene\ufb01ted from the additional information supplied by the labels, MAF nearly doubled\nits performance, coming close to state-of-the-art models for image modelling without incorporating\n\n8\n\n\fany prior image knowledge. The ability of MAF to bene\ufb01t signi\ufb01cantly from conditional knowledge\nsuggests that automatic discovery of conditional structure (e.g. \ufb01nding labels by clustering) could be\na promising direction for improving unconditional density estimation in general.\nDensity estimation is one of several types of generative modelling, with the focus on obtaining\naccurate densities. However, we know that accurate densities do not necessarily imply good perfor-\nmance in other tasks, such as in data generation [31]. Alternative approaches to generative modelling\ninclude variational autoencoders [12, 25], which are capable of ef\ufb01cient inference of their (potentially\ninterpretable) latent space, and generative adversarial networks [7], which are capable of high quality\ndata generation. Choice of method should be informed by whether the application at hand calls for\naccurate densities, latent space inference or high quality samples. Masked Autoregressive Flow is a\ncontribution towards the \ufb01rst of these goals.\n\nAcknowledgments\n\nWe thank Maria Gorinova for useful comments. George Papamakarios and Theo Pavlakou were sup-\nported by the Centre for Doctoral Training in Data Science, funded by EPSRC (grant EP/L016427/1)\nand the University of Edinburgh. George Papamakarios was also supported by Microsoft Research\nthrough its PhD Scholarship Programme.\n\nReferences\n[1] J. Ball\u00e9, V. Laparra, and E. P. Simoncelli. Density modeling of images using a generalized normalization\n\ntransformation. Proceedings of the 4nd International Conference on Learning Representations, 2016.\n\n[2] S. S. Chen and R. A. Gopinath. Gaussianization. Advances in Neural Information Processing Systems 13,\n\npages 423\u2013429, 2001.\n\n[3] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear Independent Components Estimation.\n\narXiv:1410.8516, 2014.\n\n[4] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. Proceedings of the 5th\n\nInternational Conference on Learning Representations, 2017.\n\n[5] Y. Fan, D. J. Nott, and S. A. Sisson. Approximate Bayesian computation via regression density estimation.\n\nStat, 2(1):34\u201348, 2013.\n\n[6] M. Germain, K. Gregor, I. Murray, and H. Larochelle. MADE: Masked Autoencoder for Distribution\nEstimation. Proceedings of the 32nd International Conference on Machine Learning, pages 881\u2013889,\n2015.\n\n[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. Advances in Neural Information Processing Systems 27, pages 2672\u20132680,\n2014.\n\n[8] S. Gu, Z. Ghahramani, and R. E. Turner. Neural adaptive sequential Monte Carlo. Advances in Neural\n\nInformation Processing Systems 28, pages 2629\u20132637, 2015.\n\n[9] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation,\n\n18(7):1527\u20131554, 2006.\n\n[10] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\ncovariate shift. Proceedings of the 32nd International Conference on Machine Learning, pages 448\u2013456,\n2015.\n\n[11] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International\n\nConference on Learning Representations, 2015.\n\n[12] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. Proceedings of the 2nd International\n\nConference on Learning Representations, 2014.\n\n[13] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational\ninference with Inverse Autoregressive Flow. Advances in Neural Information Processing Systems 29, pages\n4743\u20134751, 2016.\n\n[14] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report,\n\nUniversity of Toronto, 2009.\n\n[15] T. D. Kulkarni, P. Kohli, J. B. Tenenbaum, and V. Mansinghka. Picture: A probabilistic programming\nlanguage for scene perception. IEEE Conference on Computer Vision and Pattern Recognition, pages\n4390\u20134399, 2015.\n\n9\n\n\f[16] T. A. Le, A. G. Baydin, and F. Wood. Inference compilation and universal probabilistic programming.\n\nProceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[17] Y. LeCun, C. Cortes, and C. J. C. Burges. The MNIST database of handwritten digits. URL http:\n\n//yann.lecun.com/exdb/mnist/.\n\n[18] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.\n[19] G. Loaiza-Ganem, Y. Gao, and J. P. Cunningham. Maximum entropy \ufb02ow networks. Proceedings of the\n\n5th International Conference on Learning Representations, 2017.\n\n[20] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its\napplication to evaluating segmentation algorithms and measuring ecological statistics. pages 416\u2013423,\n2001.\n\n[21] B. Paige and F. Wood. Inference networks for sequential Monte Carlo in graphical models. Proceedings of\n\nthe 33rd International Conference on Machine Learning, 2016.\n\n[22] G. Papamakarios and I. Murray. Distilling intractable generative models, 2015. Probabilistic Integration\n\nWorkshop at Neural Information Processing Systems 28.\n\n[23] G. Papamakarios and I. Murray. Fast \u0001-free inference of simulation models with Bayesian conditional\n\ndensity estimation. Advances in Neural Information Processing Systems 29, 2016.\n\n[24] D. J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. Proceedings of the 32nd\n\nInternational Conference on Machine Learning, pages 1530\u20131538, 2015.\n\n[25] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\ndeep generative models. Proceedings of the 31st International Conference on Machine Learning, pages\n1278\u20131286, 2014.\n\n[26] O. Rippel and R. P. Adams. High-dimensional probability estimation with deep density models.\n\narXiv:1302.5125, 2013.\n\n[27] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. PixelCNN++: Improving the PixelCNN with\n\ndiscretized logistic mixture likelihood and other modi\ufb01cations. arXiv:1701.05517, 2017.\n\n[28] Y. Tang, R. Salakhutdinov, and G. Hinton. Deep mixtures of factor analysers. Proceedings of the 29th\n\nInternational Conference on Machine Learning, pages 505\u2013512, 2012.\n\n[29] Theano Development Team. Theano: A Python framework for fast computation of mathematical expres-\n\nsions. arXiv:1605.02688, 2016.\n\n[30] L. Theis and M. Bethge. Generative image modeling using spatial LSTMs. Advances in Neural Information\n\nProcessing Systems 28, pages 1927\u20131935, 2015.\n\n[31] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. Proceedings of\n\nthe 4nd International Conference on Learning Representations, 2016.\n\n[32] B. Uria, I. Murray, and H. Larochelle. RNADE: The real-valued neural autoregressive density-estimator.\n\nAdvances in Neural Information Processing Systems 26, pages 2175\u20132183, 2013.\n\n[33] B. Uria, I. Murray, and H. Larochelle. A deep and tractable density estimator. Proceedings of the 31st\n\nInternational Conference on Machine Learning, pages 467\u2013475, 2014.\n\n[34] B. Uria, I. Murray, S. Renals, C. Valentini-Botinhao, and J. Bridle. Modelling acoustic feature dependencies\nwith arti\ufb01cial neural networks: Trajectory-RNADE. IEEE International Conference on Acoustics, Speech\nand Signal Processing, pages 4465\u20134469, 2015.\n\n[35] B. Uria, M.-A. C\u00f4t\u00e9, K. Gregor, I. Murray, and H. Larochelle. Neural autoregressive distribution estimation.\n\nJournal of Machine Learning Research, 17(205):1\u201337, 2016.\n\n[36] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W.\n\nSenior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:1609.03499, 2016.\n\n[37] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves. Conditional\nimage generation with PixelCNN decoders. Advances in Neural Information Processing Systems 29, pages\n4790\u20134798, 2016.\n\n[38] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. Proceedings of\n\nthe 33rd International Conference on Machine Learning, pages 1747\u20131756, 2016.\n\n[39] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration.\n\nProceedings of the 13rd International Conference on Computer Vision, pages 479\u2013486, 2011.\n\n10\n\n\f", "award": [], "sourceid": 1368, "authors": [{"given_name": "George", "family_name": "Papamakarios", "institution": "University of Edinburgh"}, {"given_name": "Theo", "family_name": "Pavlakou", "institution": "The University of Edinburgh"}, {"given_name": "Iain", "family_name": "Murray", "institution": "University of Edinburgh"}]}