{"title": "Neural Spline Flows", "book": "Advances in Neural Information Processing Systems", "page_first": 7511, "page_last": 7522, "abstract": "A normalizing flow models a complex probability density as an invertible transformation of a simple base density. Flows based on either coupling or autoregressive transforms both offer exact density evaluation and sampling, but rely on the parameterization of an easily invertible elementwise transformation, whose choice determines the flexibility of these models. Building upon recent work, we propose a fully-differentiable module based on monotonic rational-quadratic splines, which enhances the flexibility of both coupling and autoregressive transforms while retaining analytic invertibility. We demonstrate that neural spline flows improve density estimation, variational inference, and generative modeling of images.", "full_text": "Neural Spline Flows\n\nConor Durkan\u2217 Artur Bekasov\u2217\n\nIain Murray George Papamakarios\n\nSchool of Informatics, University of Edinburgh\n\n{conor.durkan, artur.bekasov, i.murray, g.papamakarios}@ed.ac.uk\n\nAbstract\n\nA normalizing \ufb02ow models a complex probability density as an invertible transfor-\nmation of a simple base density. Flows based on either coupling or autoregressive\ntransforms both offer exact density evaluation and sampling, but rely on the pa-\nrameterization of an easily invertible elementwise transformation, whose choice\ndetermines the \ufb02exibility of these models. Building upon recent work, we pro-\npose a fully-differentiable module based on monotonic rational-quadratic splines,\nwhich enhances the \ufb02exibility of both coupling and autoregressive transforms while\nretaining analytic invertibility. We demonstrate that neural spline \ufb02ows improve\ndensity estimation, variational inference, and generative modeling of images.\n\n1\n\nIntroduction\n\nModels that can reason about the joint distribution of high-dimensional random variables are central\nto modern unsupervised machine learning. Explicit density evaluation is required in many statis-\ntical procedures, while synthesis of novel examples can enable agents to imagine and plan in an\nenvironment prior to choosing a action. In recent years, the variational autoencoder [VAE, 29, 48]\nand generative adversarial network [GAN, 15] have received particular attention in the generative-\nmodeling community, and both are capable of sampling with a single forward pass of a neural network.\nHowever, these models do not offer exact density evaluation, and can be dif\ufb01cult to train. On the\nother hand, autoregressive density estimators [13, 50, 56, 58, 59, 60] can be trained by maximum\nlikelihood, but sampling requires a sequential loop over the output dimensions.\nFlow-based models present an alternative approach to the above methods, and in some cases provide\nboth exact density evaluation and sampling in a single neural-network pass. A normalizing \ufb02ow\nmodels data x as the output of an invertible, differentiable transformation f of noise u:\n\nThe probability density of x under the \ufb02ow is obtained by a change of variables:\n\nx = f (u) where u \u223c \u03c0(u).\n\np(x) = \u03c0(cid:0)f\u22121(x)(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)det\n\n(cid:18) \u2202f\u22121\n\n\u2202x\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12).\n\n(1)\n\n(2)\n\nIntuitively, the function f compresses and expands the density of the noise distribution \u03c0(u), and this\nchange is quanti\ufb01ed by the determinant of the Jacobian of the transformation. The noise distribution\n\u03c0(u) is typically chosen to be simple, such as a standard normal, whereas the transformation f\nand its inverse f\u22121 are often implemented by composing a series of invertible neural-network\nn=1, the \ufb02ow is trained by maximizing the total log likelihood\n\nmodules. Given a dataset D =(cid:8)x(n)(cid:9)N\nn log p(cid:0)x(n)(cid:1) with respect to the parameters of the transformation f. In recent years, normalizing\n(cid:80)\n\n\ufb02ows have received widespread attention in the machine-learning literature, seeing successful use in\ndensity estimation [10, 43], variational inference [30, 36, 46, 57], image, audio and video generation\n[26, 28, 32, 45], likelihood-free inference [44], and learning maximum-entropy distributions [34].\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Monotonic rational-quadratic transforms are drop-in replacements for additive or af\ufb01ne\ntransformations in coupling or autoregressive layers, greatly enhancing their \ufb02exibility while retaining\nexact invertibility. Left: A random monotonic rational-quadratic transform with K = 10 bins and\nlinear tails is parameterized by a series of K + 1 \u2018knot\u2019 points in the plane, and the K \u2212 1 derivatives\nat the internal knots. Right: Derivative of the transform on the left with respect to x. Monotonic\nrational-quadratic splines naturally induce multi-modality when used to transform random variables.\n\nfunction f\u22121 quickly. We don\u2019t evaluate f, so the \ufb02ow is usually de\ufb01ned by specifying f\u22121.\n\nA \ufb02ow is de\ufb01ned by specifying the bijective function f or its inverse f\u22121, usually with a neural\nnetwork. Depending on the \ufb02ow\u2019s intended use cases, there are practical constraints in addition to\nformal invertibility:\n\u2022 To train a density estimator, we need to be able to evaluate the Jacobian determinant and the inverse\n\u2022 If we wish to draw samples using eq. (1), we would like f to be available analytically, rather than\nhaving to invert f\u22121 with iterative or approximate methods.\n\u2022 Ideally, we would like both f and f\u22121 to require only a single pass of a neural network to compute,\nso that both density evaluation and sampling can be performed quickly.\nAutoregressive \ufb02ows such as inverse autoregressive \ufb02ow [IAF, 30] or masked autoregressive \ufb02ow\n[MAF, 43] are D times slower to invert than to evaluate, where D is the dimensionality of x.\nSubsequent work which enhances their \ufb02exibility has resulted in models which do not have an\nanalytic inverse, and require numerical optimization to invert [22]. Flows based on coupling layers\n[NICE, RealNVP, 9, 10] have an analytic one-pass inverse, but are often less \ufb02exible than their\nautoregressive counterparts.\nIn this work, we propose a fully-differentiable module based on monotonic rational-quadratic splines\nwhich has an analytic inverse. The module acts as a drop-in replacement for the af\ufb01ne or additive\ntransformations commonly found in coupling and autoregressive transforms. We demonstrate that\nthis module signi\ufb01cantly enhances the \ufb02exibility of both classes of \ufb02ows, and in some cases brings the\nperformance of coupling transforms on par with the best-known autoregressive \ufb02ows. An illustration\nof our proposed transform is shown in \ufb01g. 1.\n\n2 Background\n\n2.1 Coupling transforms\n\nA coupling transform \u03c6 [9] maps an input x to an output y in the following way:\n\n1. Split the input x into two parts, x = [x1:d\u22121, xd:D].\n2. Compute parameters \u03b8 = NN(x1:d\u22121), where NN is an arbitrary neural network.\n3. Compute yi = g\u03b8i(xi) for i = d, . . . , D in parallel, where g\u03b8i is an invertible function\n\nparameterized by \u03b8i.\n\n4. Set y1:d\u22121 = x1:d\u22121, and return y = [y1:d\u22121, yd:D].\n\nThe Jacobian matrix of a coupling transform is lower triangular, since yd:D is given by transforming\nxd:D elementwise as a function of x1:d\u22121, and y1:d\u22121 is equal to x1:d\u22121. Thus, the Jacobian\ndeterminant of the coupling transform \u03c6 is given by det\n, the product of the\ndiagonal elements of the Jacobian.\n\n\u2202g\u03b8i\n\u2202xi\n\ni=d\n\n= (cid:81)D\n\n(cid:16) \u2202\u03c6\n\n(cid:17)\n\n\u2202x\n\n2\n\n\u2212B0Bx\u2212B0Bg\u03b8(x)RQSplineInverseKnots\u2212B0Bx01g0\u03b8(x)\fCoupling transforms solve two important problems for normalizing \ufb02ows: they have a tractable\nJacobian determinant, and they can be inverted exactly in a single pass. The inverse of a coupling\ntransform can be easily computed by running steps 1\u20134 above, this time inputting y, and using g\u22121\nto\ncompute xd:D in step 3. Multiple coupling layers can also be composed in a natural way to construct\na normalizing \ufb02ow with increased \ufb02exibility. A coupling transform can also be viewed as a special\ncase of an autoregressive transform where we perform two splits of the input data instead of D, as\nnoted by Papamakarios et al. [43]. In this way, advances in \ufb02ows based on coupling transforms can\nbe applied to autoregressive \ufb02ows, and vice versa.\n\n\u03b8i\n\nInvertible elementwise transformations\n\n2.2\nAf\ufb01ne/additive Typically, the function g\u03b8i takes the form of an additive [9] or af\ufb01ne [10] transfor-\nmation for computational ease. The af\ufb01ne transformation is given by:\n\ng\u03b8i(xi) = \u03b1ixi + \u03b2i, where \u03b8i = {\u03b1i, \u03b2i},\n\n(3)\nand \u03b1i is usually constrained to be positive. The additive transformation corresponds to the special\ncase \u03b1i = 1. Both the af\ufb01ne and additive transformations are easy to invert, but they lack \ufb02exibility.\nRecalling that the base distribution of a \ufb02ow is typically simple, \ufb02ow-based models may struggle\nto model multi-modal or discontinuous densities using just af\ufb01ne or additive transformations, since\nthey may \ufb01nd it dif\ufb01cult to compress and expand the density in a suitably nonlinear fashion (for an\nillustration, see appendix C.1). We aim to choose a more \ufb02exible g\u03b8i, that is still differentiable and\neasy to invert.\n\nPolynomial splines Recently, M\u00fcller et al. [39] proposed a powerful generalization of the above\naf\ufb01ne transformations, based on monotonic piecewise polynomials. The idea is to restrict the input\ndomain of g\u03b8i to the interval [0, 1], partition the input domain into K bins, and de\ufb01ne g\u03b8i to be a\nsimple polynomial segment within each bin. M\u00fcller et al. [39] restrict themselves to monotonically-\nincreasing linear and quadratic polynomial segments, whose coef\ufb01cients are parameterized by \u03b8i.\nMoreover, the polynomial segments are restricted to match at the bin boundaries so that g\u03b8i is\ncontinuous. Functions of this form, which interpolate between data using piecewise polynomials, are\nknown as polynomial splines.\n\nCubic splines\nIn a previous iteration of this work [11], we explored the cubic-spline \ufb02ow, a natural\nextension to the framework of M\u00fcller et al. [39]. We proposed to implement g\u03b8i as a monotonic cubic\nspline [54], where g\u03b8i is de\ufb01ned to be a monotonically-increasing cubic polynomial in each bin. By\ncomposing coupling layers featuring elementwise monotonic cubic-spline transforms with invertible\nlinear transformations, we found \ufb02ows of this type to be much more \ufb02exible than the standard\ncoupling-layer models in the style of RealNVP [10], achieving similar results to autoregressive\nmodels on a suite of density-estimation tasks.\nLike M\u00fcller et al. [39], our spline transform and its inverse were de\ufb01ned only on the interval [0, 1].\nTo ensure that the input is always between 0 and 1, we placed a sigmoid transformation before each\ncoupling layer, and a logit transformation after each coupling layer. These transformations allow the\nspline transform to be composed with linear layers, which have an unconstrained domain. However,\nthe limitations of 32-bit \ufb02oating point precision mean that in practice the sigmoid saturates for inputs\noutside the approximate range of [\u221213, 13], which results in numerical dif\ufb01culties. In addition,\ncomputing the inverse of the transform requires inverting a cubic polynomial, which is prone to\nnumerical instability if not carefully treated [1]. In section 3.1 we propose a modi\ufb01ed method based\non rational-quadratic splines which overcomes these dif\ufb01culties.\n\n2.3\n\nInvertible linear transformations\n\nTo ensure all input variables can interact with each other, it is common to randomly permute\nthe dimensions of intermediate layers in a normalizing \ufb02ow. Permutation is an invertible linear\ntransformation, with absolute determinant equal to 1. Oliva et al. [41] generalized permutations\nto a more general class of linear transformations, and Kingma and Dhariwal [28] demonstrated\nimprovements on a range of image tasks. In particular, a linear transformation with matrix W is\nparameterized in terms of its LU-decomposition W = PLU, where P is a \ufb01xed permutation matrix,\nL is lower triangular with ones on the diagonal, and U is upper triangular. By restricting the diagonal\nelements of U to be positive, W is guaranteed to be invertible.\n\n3\n\n\fBy making use of the LU-decomposition, both the determinant and the inverse of the linear trans-\nformation can be computed ef\ufb01ciently. First, the determinant of W can be calculated in O(D) time\nas the product of the diagonal elements of U. Second, inverting the linear transformation can be\ndone by solving two triangular systems, one for U and one for L, each of which costs O(D2M ) time\nwhere M is the batch size. Alternatively, we can pay a one-time cost of O(D3) to explicitly compute\nW\u22121, which can then be cached for re-use.\n\n3 Method\n\n3.1 Monotonic rational-quadratic transforms\n\nWe propose to implement the function g\u03b8i using monotonic rational-quadratic splines as a building\nblock, where each bin is de\ufb01ned by a monotonically-increasing rational-quadratic function. A\nrational-quadratic function takes the form of a quotient of two quadratic polynomials. Rational-\nquadratic functions are easily differentiable, and since we consider only monotonic segments of\nthese functions, they are also analytically invertible. Nevertheless, they are strictly more \ufb02exible\nthan quadratic functions, and allow direct parameterization of the derivatives and heights at each\nknot. In our implementation, we use the method of Gregory and Delbourgo [17] to parameterize\na monotonic rational-quadratic spline. The spline itself maps an interval [\u2212B, B] to [\u2212B, B]. We\nde\ufb01ne the transformation outside this range as the identity, resulting in linear \u2018tails\u2019, so that the overall\ntransformation can take unconstrained inputs.\nThe spline uses K different rational-quadratic functions, with boundaries set by K +1 coordi-\nk=0 known as knots. The knots monotonically increase between (x(0), y(0)) =\n\nnates(cid:8)(x(k), y(k))(cid:9)K\n(\u2212B,\u2212B) and (x(K), y(K)) = (B, B). We give the spline K\u22121 arbitrary positive values(cid:8)\u03b4(k)(cid:9)K\u22121\n\nk=1\nfor the derivatives at the internal points, and set the boundary derivatives \u03b4(0) = \u03b4(K) = 1 to match\nthe linear tails. If the derivatives are not matched in this way, the transformation is still continuous,\nbut its derivative can have jump discontinuities at the boundary points. This in turn makes the log-\nlikelihood training objective discontinuous, which in our experience manifested itself in numerical\nissues and failed optimization.\nThe method constructs a monotonic, continuously-differentiable,\nrational-quadratic spline\nwhich passes through the knots, with the given derivatives at the knots. De\ufb01ning sk =\n\n(cid:0)yk+1 \u2212 yk(cid:1)/(cid:0)xk+1 \u2212 xk(cid:1) and \u03be(x) = (x \u2212 xk)/(xk+1 \u2212 xk), the expression for the rational-\n\nSince the rational-quadratic transformation acts elementwise on an input vector and is monotonic, the\nlogarithm of the absolute value of the determinant of its Jacobian can be computed as the sum of the\nlogarithm of the derivatives of eq. (4) with respect to each of the transformed x values in the input\nvector. It can be shown that\n\nquadratic \u03b1(k)(\u03be)/\u03b2(k)(\u03be) in the kth bin can be written\n\n(y(k+1) \u2212 y(k))(cid:2)s(k)\u03be2 + \u03b4(k)\u03be(1 \u2212 \u03be)(cid:3)\ns(k) +(cid:2)\u03b4(k+1) + \u03b4(k) \u2212 2s(k)(cid:3)\u03be(1 \u2212 \u03be)\n(cid:0)s(k)(cid:1)2(cid:2)\u03b4(k+1)\u03be2 + 2s(k)\u03be(1 \u2212 \u03be) + \u03b4(k)(1 \u2212 \u03be)2(cid:3)\n(cid:2)s(k) +(cid:2)\u03b4(k+1) + \u03b4(k) \u2212 2s(k)(cid:3)\u03be(1 \u2212 \u03be)(cid:3)2\n\u2212b \u2212 \u221ab2 \u2212 4ac(cid:1), where\nis given by \u03be(x) = 2c/(cid:0)\ns(k) \u2212 \u03b4(k)(cid:105)\ny(k+1) \u2212 y(k)(cid:17)(cid:104)\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16)\ny \u2212 y(k)(cid:17)(cid:104)\ny(k+1) \u2212 y(k)(cid:17)\ny \u2212 y(k)(cid:17)\nc = \u2212s(k)(cid:16)\n\ny \u2212 y(k)(cid:17)(cid:104)\n\u03b4(k+1) + \u03b4(k) \u2212 2s(k)(cid:105)\n\n\u03b4(k+1) + \u03b4(k) \u2212 2s(k)(cid:105)\n\nFinally, the inverse of a rational-quadratic function can be computed analytically by inverting eq. (4),\nwhich amounts to solving for the roots of a quadratic equation. Because the transformation is\nmonotonic, we can always determine which of the two quadratic roots is correct, and that the solution\n\n,\n\n(6)\n\n(7)\n\n(8)\n\n(cid:20) \u03b1(k)(\u03be)\n\n(cid:21)\n\n\u03b2(k)(\u03be)\n\nd\ndx\n\n=\n\n\u03b1(k)(\u03be)\n\u03b2(k)(\u03be)\n\n= y(k) +\n\n.\n\n(4)\n\n.\n\n(5)\n\na =\n\nb =\n\n+\n\n\u03b4(k) \u2212\n,\n\n,\n\nwhich can the be used to determine x. An instance of the rational-quadratic transform is illustrated in\n\ufb01g. 1, and appendix A.1 gives full details of the above expressions.\n\n4\n\n\fImplementation The practical implementation of the monotonic rational-quadratic coupling trans-\nform is as follows:\n\n3. Vectors \u03b8w\n\ni and \u03b8h\n\n(cid:3), where \u03b8w\n\n1. A neural network NN takes x1:d\u22121 as input and outputs an unconstrained parameter vector\n\n2. Vector \u03b8i is partitioned as \u03b8i =(cid:2)\u03b8w\nyield the K +1 knots(cid:8)(x(k), y(k))(cid:9)K\nderivatives(cid:8)\u03b4(k)(cid:9)K\u22121\n\n\u03b8i of length 3K \u2212 1 for each i = d, . . . , D.\ni , \u03b8d\ni\nlength K \u2212 1.\ni are each passed through a softmax and multiplied by 2B; the outputs are\ninterpreted as the widths and heights of the K bins, which must be positive and span the\n[\u2212B, B] interval. Cumulative sums of the K bin widths and heights, each starting at \u2212B,\ni is passed through a softplus function and is interpreted as the values of the\n\ni and \u03b8h\n\ni have length K, and \u03b8d\n\ni has\n\ni , \u03b8h\n\nk=0.\n\n4. The vector \u03b8d\n\nk=1 at the internal knots.\n\nEvaluating a rational-quadratic spline transform at location x requires \ufb01nding the bin in which x lies,\nwhich can be done ef\ufb01ciently with binary search, since the bins are sorted. The Jacobian determinant\ncan be computed in closed-form as a product of quotient derivatives, while the inverse requires solving\na quadratic equation whose coef\ufb01cients depend on the value to invert; we provide details of these\nprocedures in appendix A.2 and appendix A.3. Unlike the additive and af\ufb01ne transformations, which\nhave limited \ufb02exibility, a differentiable monotonic spline with suf\ufb01ciently many bins can approximate\nany differentiable monotonic function on the speci\ufb01ed interval [\u2212B, B]2, yet has a closed-form,\ntractable Jacobian determinant, and can be inverted analytically. Finally, our parameterization is\nfully-differentiable, which allows for training by gradient methods.\nThe above formulation can also easily be adapted for autoregressive transforms; each \u03b8i can be\ncomputed as a function of x1:i\u22121 using an autoregressive neural network, and then all elements of x\ncan be transformed at once. Inspired by this, we also introduce a set of splines for our coupling layers\nwhich act elementwise on x1:d\u22121 (the typically non-transformed variables), and whose parameters\nare optimized directly by stochastic gradient descent. This means that our coupling layer transforms\nall elements of x at once as follows:\n\n\u03b81:d\u22121 = Trainable parameters\n\n\u03b8d:D = NN(x1:d\u22121)\n\nyi = g\u03b8i(xi)\n\nfor i = 1, . . . , D.\n\n(9)\n(10)\n(11)\n\nFigure 2 demonstrates the \ufb02exibility of our\nrational-quadratic coupling transform on syn-\nthetic two-dimensional datasets. Using just two\ncoupling layers, each with K = 128 bins, the\nmonotonic rational-quadratic spline transforms\nhave no issue \ufb01tting complex, discontinuous\ndensities with potentially hundreds of modes.\nIn contrast, a coupling layer with af\ufb01ne trans-\nformations has signi\ufb01cant dif\ufb01culty with these\ntasks (see appendix C.1).\n\n3.2 Neural spline \ufb02ows\n\nTraining data\n\nFlow density Flow samples\n\nFigure 2: Qualitative results for two-dimensional\nsynthetic datasets using RQ-NSF with two cou-\npling layers.\n\nThe monotonic rational-quadratic spline transforms described in the previous section act as drop-in\nreplacements for af\ufb01ne or additive transformations in both coupling and autoregressive transforms.\nWhen combined with alternating invertible linear transformations, we refer to the resulting class of\nnormalizing \ufb02ows as rational-quadratic neural spline \ufb02ows (RQ-NSF), which may feature coupling\nlayers, RQ-NSF (C), or autoregressive layers, RQ-NSF (AR). RQ-NSF (C) corresponds to Glow [28]\nwith af\ufb01ne or additive transformations replaced with monotonic rational-quadratic transforms, where\nGlow itself is exactly RealNVP with permutations replaced by invertible linear transformations.\n\n2By de\ufb01nition of the derivative, a differentiable monotonic function is locally linear everywhere, and can\nthus be approximated by a piecewise linear function arbitrarily well given suf\ufb01ciently many bins. For a \ufb01xed and\n\ufb01nite number of bins such universality does not hold, but this limit argument is similar in spirit to the universality\nproof of Huang et al. [22], and the universal approximation capabilities of neural networks in general.\n\n5\n\n\fRQ-NSF (AR) corresponds to either IAF or MAF, depending on whether the \ufb02ow parameterizes\nf or f\u22121, again with af\ufb01ne transformations replaced by monotonic rational-quadratic transforms,\nand also with permutations replaced with invertible linear layers. Overall, RQ-NSF resembles a\ntraditional feed-forward neural network architecture, alternating between linear transformations and\nelementwise non-linearities, while retaining an exact, analytic inverse. In the case of RQ-NSF (C),\nthe inverse is available in a single neural-network pass.\n\n4 Related Work\n\nInvertible linear transformations\nInvertible linear transformations have found signi\ufb01cant use\nin normalizing \ufb02ows. Glow [28] replaces the permutation operation of RealNVP with an LU-\ndecomposed linear transformation interpreted as a 1\u00d71 convolution, yielding superior performance\nfor image modeling. WaveGlow [45] and FloWaveNet [26] have also successfully adapted Glow\nfor generative modeling of audio. Expanding on the invertible 1\u00d7 1 convolution presented in\nGlow, Hoogeboom et al. [21] propose the emerging convolution, based on composing autoregressive\nconvolutions in a manner analogous to an LU-decomposition, and the periodic convolution, which\nuses multiplication in the Fourier domain to perform convolution. Hoogeboom et al. [21] also\nintroduce linear transformations based on the QR-decomposition, where the orthogonal matrix is\nparameterized by a sequence of Householder transformations [55].\n\nInvertible elementwise transformations Outside of those discussed in section 2.2, there has\nbeen much recent work in developing more \ufb02exible invertible elementwise transformations for\nnormalizing \ufb02ows. Flow++ [20] uses the CDF of a mixture of logistic distributions as a monotonic\ntransformation in coupling layers, but requires bisection search to compute an inverse, since a closed\nform is not available. Non-linear squared \ufb02ow [61] adds an inverse-quadratic perturbation to an\naf\ufb01ne transformation in an autoregressive \ufb02ow, which is invertible under certain restrictions of\nthe parameterization. Computing this inverse requires solving a cubic polynomial, and the overall\ntransform is less \ufb02exible than a monotonic rational-quadratic spline. Sum-of-squares polynomial \ufb02ow\n[SOS, 25] parameterizes a monotonic transformation by specifying the coef\ufb01cients of a polynomial\nof some chosen degree which can be written as a sum of squares. For low-degree polynomials, an\nanalytic inverse may be available, but the method would require an iterative solution in general.\nNeural autoregressive \ufb02ow [NAF, 22] replaces the af\ufb01ne transformation in MAF by parameterizing\na monotonic neural network for each dimension. This greatly enhances the \ufb02exibility of the trans-\nformation, but the resulting model is again not analytically invertible. Block neural autoregressive\n\ufb02ow [Block-NAF, 6] directly \ufb01ts an autoregressive monotonic neural network end-to-end rather than\nparameterizing a sequence for each dimension as in NAF, but is also not analytically invertible.\n\nContinuous-time \ufb02ows Rather than constructing a normalizing \ufb02ow as a series of discrete steps,\nit is also possible to use a continuous-time \ufb02ow, where the transformation from noise u to data\nx is described by an ordinary differential equation. Deep diffeomorphic \ufb02ow [51] is one such\ninstance, where the model is trained by backpropagation through an Euler integrator, and the Jacobian\nis computed approximately using a truncated power series and Hutchinson\u2019s trace estimator [23].\nNeural ordinary differential equations [Neural ODEs, 3] de\ufb01ne an additional ODE which describes\nthe trajectory of the \ufb02ow\u2019s gradient, avoiding the need to backpropagate through an ODE solver. A\nthird ODE can be used to track the evolution of the log density, and the entire system can be solved\nwith a suitable integrator. The resulting continuous-time \ufb02ow is known as FFJORD [16]. Like \ufb02ows\nbased on coupling layers, FFJORD is also invertible in \u2018one pass\u2019, but here this term refers to solving\na system of ODEs, rather than performing a single neural-network pass.\n\n5 Experiments\n\nIn our experiments, the neural network NN which computes the parameters of the elementwise\ntransformations is a residual network [18] with pre-activation residual blocks [19]. For autoregressive\ntransformations, the layers must be masked so as to preserve autoregressive structure, and so we use\nthe ResMADE architecture outlined by Nash and Durkan [40]. Preliminary results indicated only\nminor differences in setting the tail bound B within the range [1, 5], and so we \ufb01x a value B = 3\nacross experiments, and \ufb01nd this to work robustly. We also \ufb01x the number of bins K = 8 across\n\n6\n\n\fTable 1: Test log likelihood (in nats) for UCI datasets and BSDS300, with error bars corresponding to\ntwo standard deviations. FFJORD\u2020, NAF\u2020, Block-NAF\u2020, and SOS\u2020 report error bars across repeated\nruns rather than across the test set. Superscript(cid:63) indicates results are taken from the existing literature.\nFor validation results which can be used for comparison during model development, see table 6 in\nappendix B.1.\n\nMODEL\nFFJORD(cid:63)\u2020\nGLOW\nQ-NSF (C)\nRQ-NSF (C)\nMAF\nQ-NSF (AR)\nNAF(cid:63)\u2020\nBLOCK-NAF(cid:63)\u2020\nSOS(cid:63)\u2020\nRQ-NSF (AR)\n\nPOWER\n0.46 \u00b1 0.01\n0.42 \u00b1 0.01\n0.64 \u00b1 0.01\n0.64 \u00b1 0.01\n0.45 \u00b1 0.01\n0.66 \u00b1 0.01\n0.62 \u00b1 0.01\n0.61 \u00b1 0.01\n0.60 \u00b1 0.01\n0.66 \u00b1 0.01\n\nGAS\n\nHEPMASS\n\nMINIBOONE\n8.59 \u00b1 0.12 \u221214.92 \u00b1 0.08 \u221210.43 \u00b1 0.04\n12.24 \u00b1 0.03 \u221216.99 \u00b1 0.02 \u221210.55 \u00b1 0.45\n12.80 \u00b1 0.02 \u221215.35 \u00b1 0.02\n\u22129.35 \u00b1 0.44\n13.09 \u00b1 0.02 \u221214.75 \u00b1 0.03\n\u22129.67 \u00b1 0.47\n12.35 \u00b1 0.02 \u221217.03 \u00b1 0.02 \u221210.92 \u00b1 0.46\n12.91 \u00b1 0.02 \u221214.67 \u00b1 0.03\n\u22129.72 \u00b1 0.47\n11.96 \u00b1 0.33 \u221215.09 \u00b1 0.40\n\u22128.86 \u00b1 0.15\n12.06 \u00b1 0.09 \u221214.71 \u00b1 0.38\n\u22128.95 \u00b1 0.07\n\u22128.90 \u00b1 0.11\n11.99 \u00b1 0.41 \u221215.15 \u00b1 0.10\n13.09 \u00b1 0.02 \u221214.01 \u00b1 0.03\n\u22129.22 \u00b1 0.48\n\nBSDS300\n157.40 \u00b1 0.19\n156.95 \u00b1 0.28\n157.65 \u00b1 0.28\n157.54 \u00b1 0.28\n156.95 \u00b1 0.28\n157.42 \u00b1 0.28\n157.73 \u00b1 0.04\n157.36 \u00b1 0.03\n157.48 \u00b1 0.41\n157.31 \u00b1 0.28\n\nour experiments, unless otherwise noted. We implement all invertible linear transformations using\nthe LU-decomposition, where the permutation matrix P is \ufb01xed at the beginning of training, and\nthe product LU is initialized to the identity. For all non-image experiments, we de\ufb01ne a \ufb02ow \u2018step\u2019\nas the composition of an invertible linear transformation with either a coupling or autoregressive\ntransform, and we use 10 steps per \ufb02ow in all our experiments, unless otherwise noted. All \ufb02ows use\na standard-normal noise distribution. We use the Adam optimizer [27], and anneal the learning rate\naccording to a cosine schedule [35]. In some cases, we \ufb01nd applying dropout [53] in the residual\nblocks bene\ufb01cial for regularization. Full experimental details are provided in appendix B. Code is\navailable online at https://github.com/bayesiains/nsf.\n\n5.1 Density estimation of tabular data\n\nWe \ufb01rst evaluate our proposed \ufb02ows using a selection of datasets from the UCI machine-learning\nrepository [7] and BSDS300 collection of natural images [38]. We follow the experimental setup\nand pre-processing of Papamakarios et al. [43], who make their data available online [42]. We also\nupdate their MAF results using our codebase with ResMADE and invertible linear layers instead\nof permutations, providing a stronger baseline. For comparison, we modify the quadratic splines\nof M\u00fcller et al. [39] to match the rational-quadratic transforms, by de\ufb01ning them on the range\n[\u2212B, B] instead of [0, 1], and adding linear tails, also matching the boundary derivatives as in the\nrational-quadratic case. We denote this model Q-NSF. Our results are shown in table 1, where the\nmid-rule separates \ufb02ows with one-pass inverse from autoregressive \ufb02ows. We also include validation\nresults for comparison during model development in table 6 in appendix B.1.\nBoth RQ-NSF (C) and RQ-NSF (AR) achieve state-of-the-art results for a normalizing \ufb02ow on the\nPower, Gas, and Hepmass datasets, tied with Q-NSF (C) and Q-NSF (AR) on the Power dataset.\nMoreover, RQ-NSF (C) signi\ufb01cantly outperforms both Glow and FFJORD, achieving scores compet-\nitive with the best autoregressive models. These results close the gap between autoregressive \ufb02ows\nand \ufb02ows based on coupling layers, and demonstrate that, in some cases, it may not be necessary to\nsacri\ufb01ce one-pass sampling for density-estimation performance.\n\n5.2\n\nImproving the variational autoencoder\n\nNext, we examine our proposed \ufb02ows in the context of the variational autoencoder [VAE, 29, 48],\nwhere they can act as both \ufb02exible prior and approximate posterior distributions. For our experiments,\nwe use dynamically binarized versions of the MNIST dataset of handwritten digits [33], and the\nEMNIST dataset variant featuring handwritten letters [5]. We measure the capacity of our \ufb02ows\nto improve over the commonly used baseline of a standard-normal prior and diagonal-normal\napproximate posterior, as well as over either coupling or autoregressive distributions with af\ufb01ne\ntransformations. Quantitative results are shown in table 2, and image samples in appendix C.\n\n7\n\n\fTable 2: Variational autoencoder test-set results (in nats) for the evidence lower bound (ELBO) and\nimportance-weighted estimate of the log likelihood (computed as by Burda et al. [2] using 1000\nimportance samples). Error bars correspond to two standard deviations.\n\nMNIST\n\nEMNIST\n\nPOSTERIOR/PRIOR\nBASELINE\nGLOW\nRQ-NSF (C)\nIAF/MAF\nRQ-NSF (AR)\n\nELBO\n\nlog p(x)\n\nELBO\n\nlog p(x)\n\n\u221285.61 \u00b1 0.51 \u221281.31 \u00b1 0.43 \u2212125.89 \u00b1 0.41 \u2212120.88 \u00b1 0.38\n\u221282.25 \u00b1 0.46 \u221279.72 \u00b1 0.42 \u2212120.04 \u00b1 0.40 \u2212117.54 \u00b1 0.38\n\u221282.08 \u00b1 0.46 \u221279.63 \u00b1 0.42 \u2212119.74 \u00b1 0.40 \u2212117.35 \u00b1 0.38\n\u221282.56 \u00b1 0.48 \u221279.95 \u00b1 0.43 \u2212119.85 \u00b1 0.40 \u2212117.47 \u00b1 0.38\n\u221282.14 \u00b1 0.47 \u221279.71 \u00b1 0.43 \u2212119.49 \u00b1 0.40 \u2212117.28 \u00b1 0.38\n\nAll models improve signi\ufb01cantly over the baseline, but perform very similarly otherwise, with most\nfeaturing overlapping error bars. Considering the disparity in density-estimation performance in the\nprevious section, this is likely due to \ufb02ows with af\ufb01ne transformations being suf\ufb01cient to model the\nlatent space for these datasets, with little scope for RQ-NSF \ufb02ows to demonstrate their increased\n\ufb02exibility. Nevertheless, it is worthwhile to highlight that RQ-NSF (C) is the \ufb01rst class of model which\ncan potentially match the \ufb02exibility of autoregressive models, and which requires no modi\ufb01cation for\nuse as either a prior or approximate posterior, due to its one-pass invertibility.\n\n5.3 Generative modeling of images\n\nFinally, we evaluate neural spline \ufb02ows as generative models of images, measuring their capacity\nto improve upon baseline models with af\ufb01ne transforms. In this section, we focus solely on \ufb02ows\nwith a one-pass inverse in the style of RealNVP [10] and Glow [28]. We use the CIFAR-10 [31]\nand downsampled 64 \u00d7 64 ImageNet [49, 60] datasets, with original 8-bit colour depth and with\nreduced 5-bit colour depth. We use Glow-like architectures with either af\ufb01ne (in the baseline\nmodel) or rational-quadratic coupling transforms, and provide full experimental detail in appendix B.\nQuantitative results are shown in table 3, and samples are shown in \ufb01g. 3 and appendix C.\nRQ-NSF (C) improves upon the af\ufb01ne baseline in three out of four tasks, and the improvement is most\nsigni\ufb01cant on the 8-bit version of ImageNet64. At the same time, RQ-NSF (C) achieves scores that\nare competitive with the original Glow model, while signi\ufb01cantly reducing the number of parameters\nrequired, in some cases by almost an order of magnitude. Figure 3 demonstrates that the model is\ncapable of producing diverse, globally coherent samples which closely resemble real data. There\nis potential to further improve our results by replacing the uniform dequantization used in Glow\nwith variational dequantization, and using more powerful networks with gating and self-attention\nmechanisms to parameterize the coupling transforms, both of which are explored by Ho et al. [20].\n\n6 Discussion\nLong-standing probabilistic models such as copulas [12] and Gaussianization [4] can simply represent\ncomplex marginal distributions that would require many layers of transformations in \ufb02ow-based\nmodels like RealNVP and Glow. Differentiable spline-based coupling layers allow these \ufb02ows,\nwhich are powerful ways to represent high-dimensional dependencies, to model distributions with\ncomplex shapes more quickly. Our results show that when we have enough data, the extra \ufb02exibility\nof spline-based layers leads to better generalization.\nFor tabular density estimation, both RQ-NSF (C) and RQ-NSF (AR) excel on Power, Gas, and\nHepmass, the datasets with the highest ratio of data points to dimensionality from the \ufb01ve considered.\nIn image experiments, RQ-NSF (C) achieves the best results on the ImageNet dataset, which has over\nan order of magnitude more data points than CIFAR-10. When the dimension is increased without a\ncorresponding increase in dataset size, RQ-NSF still performs competitively with other approaches,\nbut does not outperform them.\nOverall, neural spline \ufb02ows demonstrate that there is signi\ufb01cant performance to be gained by\nupgrading the commonly-used af\ufb01ne transformations in coupling and autoregressive layers, without\nthe need to sacri\ufb01ce analytic invertibility. Monotonic spline transforms enable models based on\ncoupling layers to achieve density-estimation performance on par with the best autoregressive \ufb02ows,\n\n8\n\n\fTable 3: Test-set bits per dimension (BPD, lower is better) and parameter count for CIFAR-10 and\nImageNet64 models. Superscript(cid:63) indicates results are taken from the existing literature.\n\nMODEL\nBASELINE\nRQ-NSF (C)\nGLOW(cid:63)\n\nCIFAR-10 5-BIT\nPARAMS\nBPD\n5.2M\n5.3M\n44.0M\n\n1.70\n1.70\n\n1.67\n\nCIFAR-10 8-BIT\nPARAMS\nBPD\n11.1M\n11.8M\n44.0M\n\n3.41\n3.38\n\n3.35\n\n1.81\n1.77\n\n1.76\n\nIMAGENET64 5-BIT\nBPD\n\nIMAGENET64 8-BIT\nBPD\n\nPARAMS\n14.3M\n15.6M\n110.9M\n\n3.91\n3.82\n\n3.81\n\nPARAMS\n14.3M\n15.6M\n110.9M\n\nFigure 3: Samples from image models for 5-bit (top) and 8-bit (bottom) datasets. Left: CIFAR-10.\nRight: ImageNet64.\n\nwhile retaining exact one-pass sampling. These models strike a novel middle ground between\n\ufb02exibility and practicality, providing a useful off-the-shelf tool for the enhancement of architectures\nlike the variational autoencoder, while also improving parameter ef\ufb01ciency in generative modeling.\nThe proposed transforms scale to high-dimensional problems, as demonstrated empirically. The only\nnon-constant operation added is the binning of the inputs according to the knot locations, which can\nbe ef\ufb01ciently performed in O(log2 K) time for K bins with binary search, since the knot locations\nare sorted. Moreover, due to the increased \ufb02exibility of the spline transforms, we \ufb01nd that we require\nfewer steps to build \ufb02exible \ufb02ows, reducing the computational cost. In our experiments, which employ\na linear O(K) search, we found rational-quadratic splines added approximately 30-40% to the wall-\nclock time for a single traning update compared to the same model with af\ufb01ne transformations. A\npotential drawback of the proposed method is a more involved implementation; we alleviate this by\nproviding an extensive appendix with technical details, and a reference implementation in PyTorch.\nA third-party implementation has also been added to TensorFlow Probability [8].\nRational-quadratic transforms are also a useful differentiable and invertible module in their own right,\nwhich could be included in many models that can be trained end-to-end. For instance, monotonic\nwarping functions with a tractable Jacobian determinant are useful for supervised learning [52].\nMore generally, invertibility can be useful for training very large networks, since activations can be\nrecomputed on-the-\ufb02y for backpropagation, meaning gradient computation requires memory which\nis constant instead of linear in the depth of the network [14, 37]. Monotonic splines are one way of\nconstructing invertible elementwise transformations, but there may be others. The bene\ufb01ts of research\nin this direction are clear, and so we look forward to future work in this area.\n\nAcknowledgements\n\nThis work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded\nby the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the\nUniversity of Edinburgh. George Papamakarios was also supported by Microsoft Research through\nits PhD Scholarship Programme.\n\n9\n\n\fReferences\n[1] J. F. Blinn. How to solve a cubic equation, part 5: Back to numerics. IEEE Computer Graphics and\n\nApplications, 27(3):78\u201389, 2007.\n\n[2] Y. Burda, R. B. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. International Conference\n\non Learning Representations, 2016.\n\n[3] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations.\n\nAdvances in Neural Information Processing Systems, 2018.\n\n[4] S. S. Chen and R. A. Gopinath. Gaussianization. Advances in Neural Information Processing Systems,\n\n2001.\n\n[5] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik. EMNIST: an extension of MNIST to handwritten\n\nletters. arXiv:1702.05373, 2017.\n\n[6] N. De Cao, I. Titov, and W. Aziz. Block neural autoregressive \ufb02ow. Conference on Uncertainty in Arti\ufb01cial\n\nIntelligence, 2019.\n\n[7] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.\n\nics.uci.edu/ml.\n\n[8] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman,\n\nand R. A. Saurous. TensorFlow Distributions. arXiv preprint arXiv:1711.10604, 2017.\n\n[9] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear independent components estimation. International\n\nConference on Learning Representations, Workshop track, 2015.\n\n[10] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. International Conference\n\non Learning Representations, 2017.\n\n[11] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Cubic-spline \ufb02ows. Workshop on Invertible\n\nNeural Networks and Normalizing Flows, International Conference on Machine Learning, 2019.\n\n[12] G. Elidan. Copulas in machine learning. Copulae in Mathematical and Quantitative Finance, 2013.\n[13] M. Germain, K. Gregor, I. Murray, and H. Larochelle. MADE: Masked autoencoder for distribution\n\nestimation. International Conference on Machine Learning, 2015.\n\n[14] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse. The reversible residual network: Backpropagation\n\nwithout storing activations. Advances in Neural Information Processing Systems, 2017.\n\n[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. Advances in Neural Information Processing Systems, 2014.\n\n[16] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. K. Duvenaud. FFJORD: Free-form\ncontinuous dynamics for scalable reversible generative models. International Conference on Learning\nRepresentations, 2018.\n\n[17] J. Gregory and R. Delbourgo. Piecewise rational quadratic interpolation to monotonic data. IMA Journal\n\nof Numerical Analysis, 2(2):123\u2013130, 1982.\n\n[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on\n\nComputer Vision and Pattern Recognition, 2016.\n\n[19] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. European Conference\n\non Computer Vision, 2016.\n\n[20] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving \ufb02ow-based generative models\nwith variational dequantization and architecture design. International Conference on Machine Learning,\n2019.\n\n[21] E. Hoogeboom, R. van den Berg, and M. Welling. Emerging convolutions for generative normalizing \ufb02ows.\n\nInternational Conference on Machine Learning, 2019.\n\n[22] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive \ufb02ows. International\n\nConference on Machine Learning, 2018.\n\n[23] M. F. Hutchinson. A stochastic estimator of the trace of the in\ufb02uence matrix for Laplacian smoothing\n\nsplines. Communications in Statistics - Simulation and Computation, 19(2):433\u2013450, 1990.\n\n[24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. International Conference on Machine Learning, 2015.\n\n[25] P. Jaini, K. A. Selby, and Y. Yu. Sum-of-squares polynomial \ufb02ow. International Conference on Machine\n\nLearning, 2019.\n\n[26] S. Kim, S.-g. Lee, J. Song, and S. Yoon. FloWaveNet: A generative \ufb02ow for raw audio. arXiv:1811.02155,\n\n2018.\n\n10\n\n\f[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on\n\nLearning Representations, 2014.\n\n[28] D. P. Kingma and P. Dhariwal. Glow: Generative \ufb02ow with invertible 1 \u00d7 1 convolutions. Advances in\n\nNeural Information Processing Systems, 2018.\n\n[29] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. International Conference on Learning\n\nRepresentations, 2014.\n\n[30] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational\n\ninference with inverse autoregressive \ufb02ow. Advances in Neural Information Processing Systems, 2016.\n\n[31] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master\u2019s thesis,\n\nDepartment of Computer Science, University of Toronto, 2009.\n\n[32] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma. VideoFlow: A\n\n\ufb02ow-based generative model for video. arXiv:1903.01434, 2019.\n\n[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[34] G. Loaiza-Ganem, Y. Gao, and J. P. Cunningham. Maximum entropy \ufb02ow networks.\n\nConference on Learning Representations, 2017.\n\n[35] I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts.\n\nConference on Learning Representations, 2017.\n\nInternational\n\nInternational\n\n[36] C. Louizos and M. Welling. Multiplicative normalizing \ufb02ows for variational Bayesian neural networks.\n\nInternational Conference on Machine Learning, 2017.\n\n[37] M. MacKay, P. Vicol, J. Ba, and R. B. Grosse. Reversible recurrent neural networks. Advances in Neural\n\nInformation Processing Systems, 2018.\n\n[38] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its\nInternational\n\napplication to evaluating segmentation algorithms and measuring ecological statistics.\nConference on Computer Vision, 2001.\n\n[39] T. M\u00fcller, B. McWilliams, F. Rousselle, M. Gross, and J. Nov\u00e1k. Neural importance sampling.\n\narXiv:1808.03856, 2018.\n\n[40] C. Nash and C. Durkan. Autoregressive energy machines. International Conference on Machine Learning,\n\n2019.\n\n[41] J. B. Oliva, A. Dubey, M. Zaheer, B. P\u00f3czos, R. Salakhutdinov, E. P. Xing, and J. Schneider. Transformation\n\nautoregressive networks. International Conference on Machine Learning, 2018.\n\n[42] G. Papamakarios. Preprocessed datasets for MAF experiments, 2018. URL https://doi.org/10.5281/\n\nzenodo.1161203.\n\n[43] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive \ufb02ow for density estimation. Advances\n\nin Neural Information Processing Systems, 2017.\n\n[44] G. Papamakarios, D. C. Sterratt, and I. Murray. Sequential neural likelihood: Fast likelihood-free inference\n\nwith autoregressive \ufb02ows. International Conference on Arti\ufb01cial Intelligence and Statistics, 2019.\n\n[45] R. Prenger, R. Valle, and B. Catanzaro. WaveGlow: A \ufb02ow-based generative network for speech synthesis.\n\narXiv:1811.00002, 2018.\n\n[46] D. J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. International Conference\n\non Machine Learning, 2015.\n\n[47] D. J. Rezende and F. Viola. Taming VAEs. arXiv:1810.00597, 2018.\n[48] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. International Conference on Machine Learning, 2014.\n\n[49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International\nJournal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[50] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. PixelCNN++: Improving the PixelCNN with\ndiscretized logistic mixture likelihood and other modi\ufb01cations. International Conference on Learning\nRepresentations, 2017.\n\n[51] H. Salman, P. Yadollahpour, T. Fletcher, and K. Batmanghelich. Deep diffeomorphic normalizing \ufb02ows.\n\narXiv:1810.03256, 2018.\n\n[52] E. Snelson, C. E. Rasmussen, and Z. Ghahramani. Warped Gaussian processes. Advances in Neural\n\nInformation Processing Systems, 2004.\n\n11\n\n\f[53] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\nprevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15(1):1929\u20131958,\n2014.\n\n[54] M. Steffen. A simple method for monotonic interpolation in one dimension. Astronomy and Astrophysics,\n\n239:443, 1990.\n\n[55] J. M. Tomczak and M. Welling.\n\narXiv:1611.09630, 2016.\n\nImproving variational auto-encoders using Householder \ufb02ow.\n\n[56] B. Uria, I. Murray, and H. Larochelle. RNADE: The real-valued neural autoregressive density-estimator.\n\nAdvances in Neural Information Processing Systems, 2013.\n\n[57] R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvester normalizing \ufb02ows for\n\nvariational inference. Conference on Uncertainty in Arti\ufb01cial Intelligence, 2018.\n\n[58] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,\nand K. Kavukcuoglu. WaveNet: A generative model for raw audio. ISCA Speech Synthesis Workshop,\n2016.\n\n[59] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves. Conditional\nimage generation with PixelCNN decoders. Advances in Neural Information Processing Systems, 2016.\n[60] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. International\n\nConference on Machine Learning, 2016.\n\n[61] Z. M. Ziegler and A. M. Rush. Latent normalizing \ufb02ows for discrete sequences. arXiv:1901.10548, 2019.\n\n12\n\n\f", "award": [], "sourceid": 4084, "authors": [{"given_name": "Conor", "family_name": "Durkan", "institution": "University of Edinburgh"}, {"given_name": "Artur", "family_name": "Bekasov", "institution": "University of Edinburgh"}, {"given_name": "Iain", "family_name": "Murray", "institution": "University of Edinburgh"}, {"given_name": "George", "family_name": "Papamakarios", "institution": "DeepMind"}]}