{"title": "Unconstrained Monotonic Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1545, "page_last": 1555, "abstract": "Monotonic neural networks have recently been proposed as a way to define invertible transformations. These transformations can be combined into powerful autoregressive flows that have been shown to be universal approximators of continuous probability distributions. Architectures that ensure monotonicity typically enforce constraints on weights and activation functions, which enables invertibility but leads to a cap on the expressiveness of the resulting transformations. In this work, we propose the Unconstrained Monotonic Neural Network (UMNN) architecture based on the insight that a function is monotonic as long as its derivative is strictly positive. In particular, this latter condition can be enforced with a free-form neural network whose only constraint is the positiveness of its output. We evaluate our new invertible building block within a new autoregressive flow (UMNN-MAF) and demonstrate its effectiveness on density estimation experiments. We also illustrate the ability of UMNNs to improve variational inference.", "full_text": "Unconstrained Monotonic Neural Networks\n\nAntoine Wehenkel\nUniversity of Li\u00e8ge\n\nGilles Louppe\n\nUniversity of Li\u00e8ge\n\nAbstract\n\nMonotonic neural networks have recently been proposed as a way to de\ufb01ne in-\nvertible transformations. These transformations can be combined into powerful\nautoregressive \ufb02ows that have been shown to be universal approximators of con-\ntinuous probability distributions. Architectures that ensure monotonicity typically\nenforce constraints on weights and activation functions, which enables invertibil-\nity but leads to a cap on the expressiveness of the resulting transformations. In\nthis work, we propose the Unconstrained Monotonic Neural Network (UMNN)\narchitecture based on the insight that a function is monotonic as long as its deriva-\ntive is strictly positive. In particular, this latter condition can be enforced with a\nfree-form neural network whose only constraint is the positiveness of its output.\nWe evaluate our new invertible building block within a new autoregressive \ufb02ow\n(UMNN-MAF) and demonstrate its effectiveness on density estimation experi-\nments. We also illustrate the ability of UMNNs to improve variational inference.\n\n1\n\nIntroduction\n\nMonotonic neural networks have been known as powerful tools to build monotone models of a\nresponse variable with respect to individual explanatory variables [Archer and Wang, 1993, Sill,\n1998, Daniels and Velikova, 2010, Gupta et al., 2016, You et al., 2017]. Recently, strictly mono-\ntonic neural networks have also been proposed as a way to de\ufb01ne invertible transformations. These\ntransformations can be combined into effective autoregressive \ufb02ows that can be shown to be univer-\nsal approximators of continuous probability distributions. Examples include Neural Autoregressive\nFlows [NAF, Huang et al., 2018] and Block Neural Autoregressive Flows [B-NAF, De Cao et al.,\n2019]. Architectures that ensure monotonicity typically enforce constraints on weight and activa-\ntion functions, which enables invertibility but leads to a cap on the expressiveness of the resulting\ntransformations. For neural autoregressive \ufb02ows, this does not impede universal approximation but\ntypically requires either complex conditioners or a composition of multiple \ufb02ows.\n\nNevertheless, autoregressive \ufb02ows de\ufb01ned as stacks of reversible transformations have proven\nto be quite ef\ufb01cient for density estimation of empirical distributions [Papamakarios et al., 2019,\n2017, Huang et al., 2018], as well as to improve posterior modeling in Variational Auto-Encoders\n(VAE) [Germain et al., 2015, Kingma et al., 2016, Huang et al., 2018]. Practical successes of these\nmodels include speech synthesis [van den Oord et al., 2016, Oord et al., 2018], likelihood-free infer-\nence [Papamakarios et al., 2019], probabilistic programming [Tran et al., 2017] and image genera-\ntion [Kingma and Dhariwal, 2018]. While stacking multiple reversible transformations improves the\ncapacity of the full transformation to represent complex probability distributions, it remains unclear\nwhich class of reversible transformations should be used.\n\nIn this work, we propose a class of reversible transformations based on a new Unconstrained Mono-\ntonic Neural Network (UMNN) architecture. We base our contribution on the insight that a function\nis monotonic as long as its derivative is strictly positive. This latter condition can be enforced with\na free-form neural network whose only constraint is for its output to remain strictly positive.\n\nWe summarize our contributions as follows:\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u2022 We introduce the Unconstrained Monotonic Neural Network (UMNN) architecture, a new\n\nreversible scalar transformation de\ufb01ned via a free-form neural network.\n\n\u2022 We combine UMNN transformations into an autoregressive \ufb02ow (UMNN-MAF) and we\n\ndemonstrate competitive or state-of-the-art results on benchmarks for normalizing \ufb02ows.\n\n\u2022 We empirically illustrate the scalability of our approach by applying UMNN on high di-\n\nmensional density estimation problems.\n\n2 Unconstrained monotonic neural networks\n\nOur primary contribution consists in a neural network architecture that enables learning arbitrary\nmonotonic functions. More speci\ufb01cally, we want to learn a strictly monotonic scalar function\nF (x; \u03c8) : R \u2192 R without imposing strong constraints on the expressiveness of the hypothesis\nclass. In UMNNs, we achieve this by only imposing the derivative f (x; \u03c8) = \u2202F (x;\u03c8)\nto remain of\nconstant sign or, without loss of generality, to be strictly positive. As a result, we can parameterize\nthe bijective mapping F (x; \u03c8) via its strictly positive derivative f (x; \u03c8) as\n\n\u2202x\n\nf (t; \u03c8) dt + F (0; \u03c8)\n\n,\n\n(1)\n\nF (x; \u03c8) =Z x\n\n0\n\n\u03b2\n\n| {z }\n\nwhere f (t; \u03c8) : R \u2192 R+ is a strictly positive parametric function and \u03b2 \u2208 R is a scalar. We make\nf arbitrarily complex using an unconstrained neural network whose output is forced to be strictly\npositive through an ELU activation unit increased by 1. \u03c8 denotes the parameters of this neural\nnetwork.\n\nForward integration The forward evaluation of F (x; \u03c8) requires solving the integral in Equa-\ntion (1). While this might appear daunting, such integrals can often be ef\ufb01ciently approximated nu-\nmerically using Clenshaw-Curtis quadrature. The better known trapezoidal rule, which corresponds\nto the two-point Newton-Cotes quadrature rule, has an exponential convergence when the integrand\nis periodic and the range of integration corresponds to its period. Clenshaw-Curtis quadrature takes\nadvantage of this property by using a change of variables followed by a cosine transform. This\nextends the exponential convergence of the trapezoidal rule for periodic functions to any Lipschitz\ncontinuous function. As a result, the number of evaluation points required to reach convergence\ngrows with the Lipschitz constant of the function.\n\nBackward integration Training the integrand neural network f requires evaluating the gradient of\nF with respect to its parameters. While this gradient could be obtained by backpropagating directly\nthrough the integral solver, this would also result in a memory footprint that grows linearly with the\nnumber of integration steps. Instead, the derivative of an integral with respect to a parameter \u03c9 can\nbe expressed with the Leibniz integral rule:\n\nd\n\nd\u03c9 Z b(\u03c9)\n\na(\u03c9)\n\nf (t; \u03c9) dt! = f (b(\u03c9); \u03c9)\n\nd\nd\u03c9\n\nb(\u03c9) \u2212 f (a(\u03c9); \u03c9)\n\nd\nd\u03c9\n\na(\u03c9) +Z b(\u03c9)\n\na(\u03c9)\n\n\u2202\n\u2202\u03c9\n\nf (t; \u03c9) dt .\n\n(2)\n\nApplying Equation (2) to evaluate the derivative of Equation (1) with respect to the parameters \u03c8,\nwe \ufb01nd\n\n\u2207\u03c8F (x; \u03c8) = f (x; \u03c8)\u2207\u03c8 (x) \u2212 f (0; \u03c8)\u2207\u03c8 (0) +Z x\n\n0\n\n\u2207\u03c8f (t; \u03c8) dt +\u2207\u03c8\u03b2\n\n\u2207\u03c8f (t; \u03c8) dt +\u2207\u03c8\u03b2.\n\n(3)\n\n=Z x\n\n0\n\nWhen using a UMNN block in a neural architecture, it is also important to be able to compute its\nderivative with respect to its input x. In this case, applying Equation (2) leads to\n\nd\ndx\n\nF (x; \u03c8) = f (x; \u03c8).\n\n2\n\n(4)\n\n\fEquations (3) and (4) make the memory footprint for the backward pass independent from the num-\nber of integration steps, and therefore also from the desired accuracy. Indeed, instead of computing\nthe gradient of the integral (which requires keeping track of all the integration steps), we integrate the\ngradient (which is memory ef\ufb01cient, as this corresponds to summing gradients at different evaluation\npoints). We provide the pseudo-code of the forward and backward passes using Clenshaw-Curtis\nquadrature in Appendix B.\n\nNumerical inversion In UMMNs, the modeled monotonic function F is arbitrary. As a result,\ncomputing its inverse cannot be done analytically. However, since F is strictly monotonic, it admits\na unique inverse x for any point y = F (x; \u03c8) in its image, therefore inversion can be computed\nef\ufb01ciently with common root-\ufb01nding algorithms. In our experiments, search algorithms such as the\nbisection method proved to be fast enough.\n\n3 UMNN autoregressive models\n\n3.1 Normalizing \ufb02ows\n\nA Normalizing Flow [NF, Rezende and Mohamed, 2015] is de\ufb01ned as a sequence of invertible\ntransformations ui : Rd \u2192 Rd (i = 1, ..., k) composed together to create an expressive invertible\nmapping u = u1 \u25e6 \u00b7 \u00b7 \u00b7 \u25e6 uk : Rd \u2192 Rd. It is common for normalizing \ufb02ows to stack the same\nparametric function ui (with different parameters values) and to reverse variables ordering after each\ntransformation. For this reason, we will focus on how to build one of these repeated transformations,\nwhich we further refer to as g : Rd \u2192 Rd.\n\nDensity estimation NFs are most commonly used for density estimation, that map empirical sam-\nples to unstructured noise. Using normalizing \ufb02ows, we de\ufb01ne a bijective mapping u(\u00b7; \u03b8) : Rd \u2192\nRd from a sample x \u2208 Rd to a latent vector z \u2208 Rd equipped with a density pZ(z). The transfor-\nmation u implicitly de\ufb01nes a density p(x; \u03b8) as given by the change of variables formula,\n\nwhere Ju(x;\u03b8) is the Jacobian of u(x; \u03b8) with respect to x. The resulting model is trained by\nmaximizing the likelihood of the data {x1, ..., xN }.\n\np(x; \u03b8) = pZ(u(x; \u03b8))(cid:12)(cid:12)det Ju(x;\u03b8)(cid:12)(cid:12) ,\n\n(5)\n\nVariational auto-encoders NFs are also used in VAE to improve posterior modeling. In this case,\na normalizing \ufb02ow transforms a distribution pZ into a complex distribution q which can better model\nthe variational posterior. The change of variables formula yields\n\n3.2 Autoregressive transformations\n\nq(u(z; \u03b8)) = pZ(z)(cid:12)(cid:12)det Ju(z;\u03b8)(cid:12)(cid:12)\n\n\u22121\n\n.\n\n(6)\n\nTo be of practical use, NFs must be composed of transformations for which the determinant of\nthe Jacobian can be computed ef\ufb01ciently, otherwise its evaluation would be running in O(d3). A\ncommon solution consists in making the transformation g autoregressive, i.e., such that g(x; \u03b8) can\nbe rewritten as a vector of d scalar functions,\n\ng(x; \u03b8) =(cid:2)g1(x1; \u03b8)\n\n. . . xi]T\n\n. . .\n\ngi(x1:i; \u03b8)\n\n. . .\n\ngd(x1:d; \u03b8)(cid:3) ,\n\nwhere x1:i = [x1\nis the vector including the i \ufb01rst elements of the full vector x. The\nJacobian of this function is lower triangular, which makes the computation of its determinant O(d).\nEnforcing the bijectivity of each component gi is then suf\ufb01cient to make g bijective as well.\n\nFor the multivariate density p(x; \u03b8) induced by g(x; \u03b8) and pZ(z), we can use the chain rule to\nexpress the joint probability of x as a product of d univariate conditional densities,\n\np(xi+1|x1:i; \u03b8).\n\n(7)\n\np(x; \u03b8) = p(x1; \u03b8)\n\nd\u22121\n\nYi=1\n\n3\n\n\f(a) Normalizing \ufb02ow\n\nx\n\ng\n\ng\n\n...\n\ng\n\nz\n\n(b) UMNN-MAF\n\ntransformation\n\n(c) UMNN\n\nx3\n\nx\n\nh1\n\nh2\n\nh3\n\n1\n\ng\n\n2\n\ng\n\n3\n\ng\n\nt\n\nh3\n\nZ\n\nz3\n\ndt\n\nFigure 1: (a) A normalizing \ufb02ow made of repeated UMNN-MAF transformations g with identical\narchitectures. (b) A UMNN-MAF which transforms a vector x \u2208 R3. (c) The UMNN network used\nto map x3 to z3 conditioned on the embedding h3(x1:2).\n\nWhen pZ(z) is a factored distribution pZ(z) = Qd\n\ni=1 p(zi), we identify that each component zi\ncoupled with the corresponding function gi encodes for the conditional p(xi|x1:i\u22121; \u03b8). Autore-\ngressive transformations strongly rely on the expressiveness of the scalar functions gi. In this work,\nwe propose to use UMNNs to create powerful bijective scalar transformations.\n\n3.3 UMNN autoregressive transformations (UMNN-MAF)\n\nWe now combine UMNNs with an embedding of the conditioning variables to build invertible au-\ntoregressive functions gi. Speci\ufb01cally, we de\ufb01ne\n\ngi(x1:i; \u03b8) = F i(xi, hi(x1:i\u22121; \u03c6i); \u03c8i)\n\n=Z xi\n\n0\n\nf i(t, hi(x1:i\u22121; \u03c6i); \u03c8i) dt +\u03b2i(hi(x1:i\u22121; \u03c6i)),\n\n(8)\n\nwhere hi(\u00b7; \u03c6i) : Ri\u22121 \u2192 Rq is a q-dimensional neural embedding of the conditioning variables\nx1:i\u22121 and \u03b2(\u00b7)i : Ri\u22121 \u2192 R. Both degenerate into constants for g1(x1). The parameters \u03b8 of the\nwhole transformation g(\u00b7; \u03b8) is the union of all parameters \u03c6i and \u03c8i. For simplicity we remove the\nparameters of the networks by rewriting f i(\u00b7; \u03c8i) as f i(\u00b7) and hi(\u00b7; \u03c6i) as hi(\u00b7).\n\nIn our implementation, we use a Masked Autoregressive Network [Germain et al., 2015, Kingma\net al., 2016, Papamakarios et al., 2017] to simultaneously parameterize the d embeddings. In what\nfollows we refer to the resulting UMNN autoregressive transformation as UMNN-MAF. Figure 1\nsummarizes the complete architecture.\n\nLog-density The change of variables formula applied to the UMMN autoregressive transformation\nresults in the log-density\n\nlog p(x; \u03b8) = log pZ(g(x; \u03b8))(cid:12)(cid:12)det Jg(x;\u03b8)(cid:12)(cid:12)\n= log pZ(g(x; \u03b8)) + log(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nYi=1\nXi=1\n\n= log pZ(g(x; \u03b8)) +\n\nd\n\nd\n\n\u2202F i(xi, hi(x1:i\u22121))\n\n\u2202xi\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(9)\n\nlog f i(xi, hi(x1:i\u22121)).\n\nTherefore, the transformation leads to a simple expression of (the determinant of) its Jacobian, which\ncan be computed ef\ufb01ciently with a single forward pass. This is different from FFJORD [Grathwohl\net al., 2018] which relies on numerical methods to compute both the Jacobian and the transformation\nbetween the data and the latent space. Therefore our proposed method makes the computation of\nthe Jacobian exact and ef\ufb01cient at the same time.\n\nSampling Generating samples require evaluating the inverse transformation g\u22121(z; \u03b8). The com-\nponents of the inverse vector xinv = g\u22121(z; \u03b8) can be computed recursively by inverting each\n\n4\n\n\fcomponent of g(x; \u03b8):\n\nxinv\n\nxinv\n\n1 =(cid:0)g1(cid:1)\u22121(cid:0)z1; h1(cid:1)\ni =(cid:0)gi(cid:1)\u22121(cid:0)zi; hi(cid:0)xinv\n\n1:i\u22121(cid:1)(cid:1)\n\nif\n\nif\n\ni = 1\n\ni > 1\n\n(10)\n\n(11)\n\nwhere (gi)\u22121 is the inverse of gi. Another approach to invert an autoregressive model would be to\napproximate its inverse with another autoregressive network [Oord et al., 2018]. In this case, the\nevaluation of the approximated inverse model is as fast as the forward model.\n\nUniversality Since the proof is straightforward, we only sketch that UMNN-MAF is a univer-\nsal density approximator of continuous random variables. We rely on the inverse sampling the-\norem to prove that UMNNs are universal approximators of continuously derivable (C1) mono-\nIndeed, if UMNNs can represent any C1 monotonic function, then they can\ntonic functions.\nalso represent the (inverse) cumulative distribution function of any continuous random variable.\nAny continuously derivable function f : D \u2192 I can be expressed as the following integral:\ndx is a continuous positive function and\nthe universal approximation theorem of NNs ensures it can be successfully approximated with a NN\nof suf\ufb01cient capacity (such as those used in UMNNs).\n\nf (x) = R x\n\n\u2200x, a \u2208 D. The derivative df\n\ndf\ndx dx + f (a),\n\na\n\n4 Related work\n\nThe most similar work to UMNN-MAF are certainly Neural Autoregressive Flow [NAF, Huang\net al., 2018] and Block Neural Autoregressive Flow [B-NAF, De Cao et al., 2019], which both rely\non strictly monotonic transformations for building bijective mappings. In NAF, transformations are\nde\ufb01ned as neural networks which activation functions are all constrained to be strictly monotonic\nand which weights are the output of a strictly positive and autoregressive HyperNetwork [Ha et al.,\n2017]. Huang et al. [2018] shows that NAFs are universal density approximators. In B-NAF, the\nauthors improve on the scalability of the NAF architecture by making use of masking operations\ninstead of HyperNetworks. They also present a proof of the universality of B-NAF, which extends to\nUMNN-MAF. Our work differs from both NAF and B-NAF in the sense that the UMNN monotonic\ntransformation is based on free-form neural networks for which no constraint, beyond positiveness\nof the output, is enforced on the hypothesis class. This leads to multiple advantages: it enables the\nuse of any state-of-the-art neural architecture, simpli\ufb01es weight initialization, and leads to a more\nlightweight evaluation of the Jacobian.\n\nMore generally, UMNN-MAF relates to works on normalizing \ufb02ows built upon autoregressive net-\nworks and af\ufb01ne transformations. Germain et al. [2015] \ufb01rst introduced masking as an ef\ufb01cient\nway to build autoregressive networks, and proposed autoregressive networks for density estimation\nof high dimensional binary data. Masked Autoregressive Flows [Papamakarios et al., 2017] and\nInverse Autoregressive Flows [Kingma et al., 2016] have generalized this approach to real data, re-\nspectively for density estimation and for latent posterior representation in variational auto-encoders.\nMore recently, Oliva et al. [2018] proposed to stack various autoregressive architectures to create\npowerful reversible transformations. Meanwhile, Jaini et al. [2019] proposed a new Sum-of-Squares\n\ufb02ow that is de\ufb01ned as the integral of a second order polynomial parametrized by an autoregressive\nNN.\n\nWith NICE, Dinh et al. [2015] introduced coupling layers, which correspond to bijective transfor-\nmations splitting the input vector into two parts. They are de\ufb01ned as\n\nz1:k = x1:k\n\nand zk+1:d = e\u03c3(x1:k) \u2299 xk+1:d + \u00b5(x1:k),\n\n(12)\n\nwhere \u03c3 and \u00b5 are two unconstrained functions Rd\u2212k \u2192 Rd\u2212k. The same authors introduced\nRealNVP [Dinh et al., 2017], which combines coupling layers with normalizing \ufb02ows and multi-\nscale architectures for image generation. Glow [Kingma and Dhariwal, 2018] extends RealNVP by\nintroducing invertible 1x1 convolutions between each step of the \ufb02ow. In this work we have used\nUMNNs in the context of autoregressive architectures, however UMNNs could also be applied to\nreplace the linear transformation in coupling layers.\n\nFinally, our architecture also shares a connection with Neural Ordinary Differential Equa-\ntions [NODE, Chen et al., 2018]. The core idea of this architecture is to learn an ordinary dif-\nferential equation which dynamic is parameterized by a neural network. Training can be carried\n\n5\n\n\fFigure 2: Density estimation and sampling with a UMNN-MAF network on 2D toy problems. Top:\nSamples from the empirical distribution p(x). Middle: Learned density p(x; \u03b8). Bottom: Sam-\nples drawn by numerical inversion. UMNN-MAF manages to precisely capture multi-modal and/or\ndiscontinuous distributions. Sampling is possible even if the model is not invertible analytically.\n\nout by backpropagating ef\ufb01ciently through the ODE solver, with constant memory requirements.\nAmong other applications, NODE can be used to model a continuous normalizing \ufb02ow with a free-\nform Jacobian as in FFJORD [Grathwohl et al., 2018]. Similarly, a UMNN transformation can be\nseen as a structured neural ordinary differential equation in which the dynamic of the vector \ufb01eld is\nseparable and can be solved ef\ufb01ciently by direct integration.\n\n5 Experiments\n\nIn this section, we evaluate the expressiveness of UMNN-MAF on a variety of density estimation\nbenchmarks, as well as for approximate inference in variational auto-encoders. The source code to\nreproduce our experiments will be made available on Github at the end of the reviewing process.\n\nExperiments were carried out using the same integrand neural network in the UMNN component\n\u2013 i.e., in Equation 8, f i = f with shared weights \u03c8i = \u03c8 for i \u2208 {1, . . . , d}. The functions\n\u03b2i are taken to be equal to one of the outputs of the embedding network. We observed in our\nexperiments that sharing the same integrand function does not impact performance. Therefore, the\nneural embedding function hi must produce a \ufb01xed size output for i \u2208 {1, . . . , d}.\n\n5.1\n\n2D toy problems\n\nWe \ufb01rst train a UMNN-MAF on 2-dimensional toy distributions, as de\ufb01ned by Grathwohl et al.\n[2018]. To train the model, we minimize the negative log-likelihood of observed data\n\nL(\u03b8) = \u2212\n\nN\n\nXn=1\"log pZ(g(xn; \u03b8)) +\n\nd\n\nXi=1\n\nlog f (xn\n\ni , hi(xn\n\n1:i\u22121))# .\n\n(13)\n\nThe \ufb02ow used to solve these tasks is the same for all distributions and is composed of a single\ntransformation. More details can be found in Appendix A.1.\n\nFigure 2 demonstrates that our model is able to learn a change of variables that warps a simple\nisotropic Gaussian into multimodal and/or discontinuous distributions. We observe from the \ufb01gure\nthat our model precisely captures the density of the data. We also observe that numerical inversion\nfor generating samples yields good results.\n\n5.2 Density estimation\n\nWe further validate UMNN-MAF by comparing it to state-of-the-art normalizing \ufb02ows. We carry\nout experiments on tabular datasets (POWER, GAS, HEPMASS, MINIBOONE, BSDS300) as well\nas on MNIST. We follow the experimental protocol of Papamakarios et al. [2017]. All training\nhyper-parameters and architectural details are given in Appendix A.1. For each dataset, we report\n\n6\n\n\fTable 1: Average negative log-likelihood on test data over 3 runs, error bars are equal to the standard\ndeviation. Results are reported in nats for tabular data and bits/dim for MNIST; lower is better. The\nbest performing architecture for each dataset is written in bold and the best performing architecture\nper category is underlined. (a) Non-autoregressive models, (b) Autoregressive models, (c) Mono-\ntonic and autoregressive models. UMNN outperforms other monotonic transformations on 4 tasks\nover 6 and is the overall best performing model on 2 tasks over 6.\n\nDataset\n\nPOWER\n\nGAS\n\nHEPMASS MINIBOONE\n\nBSDS300\n\nMNIST\n\nRealNVP - Dinh et al. [2017]\n\n\u22120.17\u00b1.01 \u22128.33\u00b1.14 18.71\u00b1.02\n\n13.55\u00b1.49 \u2212153.28\u00b11.78\n\n(a)\n\nGlow - Kingma and Dhariwal [2018] \u22120.17\u00b1.01 \u22128.15\u00b1.40 19.92\u00b1.08\n\n11.35\u00b1.07 \u2212155.07\u00b1.03\n\n-\n\n-\n\nFFJORD - Grathwohl et al. [2018] \u22120.46\u00b1.01\n\n\u22128.59\u00b1.12 14.92\u00b1.08\n3.08\u00b1.03 \u22123.56\u00b1.04 20.98\u00b1.02\n\n10.43\u00b1.04 \u2212157.40\u00b1.19\n15.59\u00b1.50 \u2212148.85\u00b1.28 2.04\u00b1.01\n\n-\n\nMADE - Germain et al. [2015]\n\n(b)\n\nMAF - Papamakarios et al. [2017] \u22120.24\u00b1.01 \u221210.08\u00b1.02 17.70\u00b1.02\n\n11.75\u00b1.44 \u2212155.69\u00b1.28 1.89\u00b1.01\n\nTAN - Oliva et al. [2018]\n\n\u22120.60\u00b1.01\n\n\u221212.06\n\n\u00b1.02\n\n13.78\n\n\u00b1.02 11.01\u00b1.48 \u2212159.80\n\n\u00b1.07\n\n1.19\n\nNAF - Huang et al. [2018]\n\n\u22120.62\u00b1.01 \u221211.96\u00b1.33 15.09\u00b1.40\n\n8.86\n\n\u00b1.15 \u2212157.73\u00b1.30\n\n-\n\n-\n\n(c)\n\nB-NAF - De Cao et al. [2019]\n\nSOS - Jaini et al. [2019]\n\nUMNN-MAF (ours)\n\n\u22120.61\u00b1.01 \u221212.06\u00b1.09 14.71\u00b1.38\n\u22120.60\u00b1.01 \u221211.99\u00b1.41 15.15\u00b1.1\n\n8.95\u00b1.07 \u2212157.36\u00b1.03\n\n8.90\u00b1.11 \u2212157.48\u00b1.41\n\n1.81\n\n\u22120.63\n\n\u00b1.01\n\n\u221210.89\u00b1.7 13.99\u00b1.21\n\n9.67\u00b1.13 \u2212157.98\u00b1.01\n\n1.13\n\n\u00b1.02\n\nresults on test data for our best performing model (selected on the validation data). At testing time\nwe use a large number of integration steps (100) to compute the integral, this ensures its correctness\nand avoids misestimating the performance of UMNN-MAF.\n\nTable 1 summarizes our results, where we can see that on tabular datasets, our method is competitive\nwith other normalizing \ufb02ows. For POWER, our architecture slightly outperforms all others. It is\nalso better than other monotonic networks (category (c)) on 3 tabular datasets over 5. From these\nresults, we could conclude that Transformation Autoregressive Networks [TAN, Oliva et al., 2018]\nis overall the best method for density estimation. It is however important to note that TAN is a \ufb02ow\ncomposed of many heterogeneous transformations (both autoregressive and non-autoregressive).\nFor this reason, it should not be directly compared to the other models which respective results\nare speci\ufb01c to a single architecture. However, TAN provides the interesting insight that combining\nheterogeneous components into a \ufb02ow leads to better results than an homogeneous \ufb02ow.\n\nNotably, we do not make use of a multi-scale architecture to train our model on MNIST. On this task,\nUMNN-MAF slightly outperforms all other models by a reasonable margin. Samples generated\nby a conditional model are shown on Figure 3, for which it is worth noting that UMNN-MAF is\nthe \ufb01rst monotonic architecture that has been inverted to generate samples. Indeed, MNIST can be\nconsidered as a high dimensional dataset (d = 784) for standard feed forward neural networks which\nautoregressive networks are part of. NAF and B-NAF do not report any result for this benchmark,\npresumably because of memory explosion.\nIn comparison, BSDS300, which data dimension is\none order of magnitude smaller than MNIST (63 \u226a 784), are the largest data they have tested\non. Table 2 shows the number of parameters used by UMNN-MAF in comparison to B-NAF and\nNAF. For bigger datasets, UMNN-MAF requires less parameters than NAF to reach similar or better\nperformance. This could explain why NAF has never been used for density estimation on MNIST.\n\nTable 2: Comparison of the number of param-\neters between NAF, B-NAF and UMNN-MAF.\nIn high dimensional datasets, UMNN-MAF re-\nquires fewer parameters than NAF and a similar\nnumber to B-NAF.\n\nDataset\n\nNAF B-NAF UMNN-MAF\n\nPOWER (d = 6)\n\n4.14e5 3.07e5\n\n5.09e5\n\nGAS (d = 8)\n\n4.02e5 5.44e5\n\n8.15e5\n\nHEPMASS (d = 21)\n\n9.27e6 3.72e6\n\n3.62e6\n\nMINIBOONE (d = 43) 7.49e6 4.09e6\n\n3.46e6\n\nBSDS300 (d = 63)\n\n3.68e7 8.76e6\n\n1.56e7\n\nFigure 3: Samples generated by numerical in-\nversion of a conditional UMNN-MAF trained\non MNIST. Samples z are drawn from an\nisotropic Gaussian with \u03c3 = .75. See Appendix\nC for more details.\n\n7\n\n\fTable 3: Average negative evidence lower bound of VAEs over 3 runs, error bars are equal to the\nstandard deviation. Results are reported in bits per dim for Freyfaces and in nats for the other\ndatasets; lower is better. UMNN-NAF is performing slightly better than IAF but is outperformed by\nB-NAF. We believe that the gap in performance between B-NAF and UMNN is due to the way the\nNF is conditioned by the encoder\u2019s output.\n\nDataset\n\nMNIST Freyfaces Omniglot Caltech 101\n\nVAE - Kingma and Welling [2013]\n\n86.65\u00b1.06 4.53\u00b1.02 104.28\u00b1.39 110.80\u00b1.46\n\nPlanar - Rezende and Mohamed [2015] 86.06\u00b1.32 4.40\u00b1.06 102.65\u00b1.42 109.66\u00b1.42\n\n(a)\n\nIAF - Kingma et al. [2016]\n\n84.20\u00b1.17 4.47\u00b1.05 102.41\u00b1.04 111.58\u00b1.38\n\nSylvester - Berg et al. [2018]\n\n83.32\u00b1.06 4.45\u00b1.04 99.00\u00b1.04 104.62\u00b1.29\n\nFFJORD - Grathwohl et al. [2018]\n\n82.82\u00b1.01 4.39\u00b1.01 98.33\u00b1.09 104.03\u00b1.43\n\n(b)\n\nB-NAF - De Cao et al. [2019]\n\n83.59\u00b1.15 4.42\u00b1.05 100.08\u00b1.07 105.42\u00b1.49\n\nUMNN-MAF (ours)\n\n84.11\u00b1.05 4.51\u00b1.01 100.98\u00b1.13 110.45\u00b1.69\n\n5.3 Variational auto-encoders\n\nTo assess the performance of our model, we follow the experimental setting of Berg et al. [2018] for\nVAE. The encoder and the decoder architectures can be found in the appendix of their paper. In VAE\nit is usual to let the encoder output the parameters of the \ufb02ow. For UMNN-MAF, this would cause\nthe encoder output\u2019s dimension to be too large. Instead, the encoder output is passed as additional\nentries of the UMNN-MAF. Like other architectures, the UMNN-MAF also takes as input a vector\nof noise drawn from an isotropic Gaussian of dimension 64.\n\nTable 3 presents our results. It shows that on MNIST and Omniglot, UMNN-MAF slightly outper-\nforms the classical VAE as well as planar \ufb02ows. Moreover, on these datasets and Freyfaces, IAF,\nB-NAF and UMNN-MAF achieve similar results. FFJORD is the best among all, however it is\nworth noting that the roles of encoder outputs in FFJORD, B-NAF, IAF and Sylvester are all differ-\nent. We believe that the heterogeneity of the results could be, at least in part, due to the different\namortizations.\n\n6 Discussion and summary\n\nStatic integral quadrature can be inaccurate. Computing the integral with static Clenshaw-\nCurtis quadrature only requires the evaluation of the integrand at prede\ufb01ned points. As such, batches\nof points can be processed all at once, which makes static Clenshaw-Curtis quadrature well suited\nfor neural networks. However, static quadratures do not account for the error made during the\nintegration. As a consequence, the quadrature is inaccurate when the integrand is not smooth enough\nand the number of integration steps is too small. In this work, we have reduced the integration error\nby applying the normalization described by Gouk et al. [2018] in order to control the Lipschitz\nconstant of the integrand and appropriately set the number of integration steps. We observed that as\nlong as the Lipschitz constant of the network does not increase dramatically (< 1000), a reasonable\nnumber of integration steps (< 100) is suf\ufb01cient to ensure the convergence of the quadrature. An\nalternative solution would be to use dynamic quadrature such as dynamic Clenshaw-Curtis.\n\nEf\ufb01ciency of numerical inversion. Architectures relying on linear transformations [Papamakar-\nios et al., 2017, Kingma et al., 2016, Dinh et al., 2017, Kingma and Dhariwal, 2018] are trivially\nexactly and ef\ufb01ciently invertible. In contrast, the UMNN transformation has no analytic inverse.\nNevertheless, it can be inverted numerically using root-\ufb01nding algorithms. Since most such algo-\nrithms rely on multiple nested evaluations of the function to be inverted, applying them naively to\na numerical integral would quickly become very inef\ufb01cient. However, the Clenshaw-Curtis quadra-\nture is part of the nested quadrature family, meaning that the evaluation of the integral at multiple\nnested points can take advantage of previous evaluations and thus be implemented ef\ufb01ciently. As an\nalternative, Oord et al. [2018] have shown that an invertible model can always be distilled to learn its\ninverse, and thus make the inversion ef\ufb01cient whatever the cost of inversion of the original model.\n\nScalability and complexity analysis. UMNN-MAF is particularly well suited for density estima-\ntion because the computation of the Jacobian only requires a single forward evaluation of a NN.\n\n8\n\n\fTogether with the Leibniz integral rule, they make the evaluation of the log-likelihood derivative\nas memory ef\ufb01cient as usual supervised learning, which is equivalent to a single backward pass on\nthe computation graph. By contrast, density estimation with previous monotonic transformations\ntypically requires a backward evaluation of the computation graph of the transformer NN to obtain\nthe Jacobian. Then, this pass must be evaluated backward again in order to obtain the log-likelihood\nderivative. Both NAF and B-NAF provide a method to make this computation numerically stable,\nhowever both fail at not increasing the size of the computation graph of the log-likelihood derivative,\nhence leading to a memory overhead. The memory saved by the Leibniz rule may serve to speed\nup the quadrature computation. In the case of static Clenshaw-Curtis, the function values at each\nevaluation point can be computed in parallel using batch of points. In consequence, when the GPU\nmemory is large enough to store \"meta-batches\" of size d \u00d7 N \u00d7 B (with d the dimension of the\ndata, N the number of integration steps and B the batch size) the computation is approximately as\nfast as a forward evaluation of the integrand network.\n\nSummary We have introduced Unconstrained Monotonic Neural Networks, a new invertible\ntransformation built upon free-form neural networks allowing the use of any state-of-the-art ar-\nchitecture. Monotonicity is guaranteed without imposing constraints on the expressiveness of the\nhypothesis class, contrary to classical approaches. We have shown that the resulting integrated\nneural network can be evaluated ef\ufb01ciently using standard quadrature rule while its inverse can be\ncomputed using numerical algorithms. We have shown that our transformation can be composed\ninto an autoregressive \ufb02ow, with competitive or state-of-the-art results on density estimation and\nvariational inference benchmarks. Moreover, UMNN is the \ufb01rst monotonic transformation that has\nbeen successfully applied for density estimation on high dimensional data distributions (MNIST),\nshowing better results than the classical approaches.\n\nWe identify several avenues for improvement and further research. First, we believe that numerical\nintegration could be fasten up during training, by leveraging the fact that controlled numerical errors\ncan actually help generalization. Moreover, the UMNN transformation would certainly pro\ufb01t from\nusing a dynamic integration scheme, both in terms of accuracy and ef\ufb01ciency. Second, it would\nbe worth comparing the newly introduced monotonic transformation with common approaches for\nmodelling monotonic functions in machine learning. On a similar track, these common approaches\ncould be combined into an autoregressive \ufb02ow as shown in Section 3.3. Finally, our monotonic\ntransformation could be used within other neural architectures than generative autoregressive net-\nworks, such as multi-scale architectures [Dinh et al., 2017] and learnable 1D convolutions [Kingma\nand Dhariwal, 2018].\n\nAcknowledgments\n\nThe authors would like to acknowledge Matthia Sabatelli, Nicolas Vecoven, Antonio Sutera and\nLouis Wehenkel for useful feedback on the manuscript. They would also like to thank the anony-\nmous reviewers for many relevant remarks. Antoine Wehenkel is a research fellow of the F.R.S.-\nFNRS (Belgium) and acknowledges its \ufb01nancial support.\n\n9\n\n\fReferences\n\nN. P. Archer and S. Wang. Application of the back propagation neural network algorithm with\nmonotonicity constraints for two-group classi\ufb01cation problems. Decision Sciences, 24(1):60\u201375,\n1993.\n\nR. v. d. Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvester normalizing \ufb02ows for\n\nvariational inference. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2018.\n\nT. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equa-\n\ntions. In Advances in Neural Information Processing Systems, pages 6571\u20136583, 2018.\n\nH. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Transactions\n\non Neural Networks, 21(6):906\u2013917, 2010.\n\nN. De Cao,\n\nI. Titov, and W. Aziz.\n\nBlock neural autoregressive \ufb02ow.\n\narXiv preprint\n\narXiv:1904.04676, 2019.\n\nL. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation.\n\nIn\n\nInternational Conference in Learning Representations workshop track, 2015.\n\nL. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp.\n\nIn International\n\nConference in Learning Representations, 2017.\n\nM. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution\n\nestimation. In International Conference on Machine Learning, pages 881\u2013889, 2015.\n\nH. Gouk, E. Frank, B. Pfahringer, and M. Cree. Regularisation of neural networks by enforcing\n\nlipschitz continuity. arXiv preprint arXiv:1804.04368, 2018.\n\nW. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Free-form contin-\nuous dynamics for scalable reversible generative models. In International Conference on Machine\nLearning, 2018.\n\nM. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, and\nA. Van Esbroeck. Monotonic calibrated interpolated look-up tables. The Journal of Machine\nLearning Research, 17(1):3790\u20133836, 2016.\n\nD. Ha, A. M. Dai, and Q. V. Le. Hypernetworks.\n\nIn 5th International Conference on Learning\nRepresentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings,\n2017.\n\nC.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive \ufb02ows. In Interna-\n\ntional Conference on Machine Learning, pages 2083\u20132092, 2018.\n\nP. Jaini, K. A. Selby, and Y. Yu. Sum-of-squares polynomial \ufb02ow. arXiv preprint arXiv:1905.02325,\n\n2019.\n\nD. P. Kingma and P. Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In Advances\n\nin Neural Information Processing Systems, pages 10236\u201310245, 2018.\n\nD. P. Kingma and M. Welling. Auto-encoding variational bayes. In 2nd International Conference\n\non Learning Representations (ICLR), 2013.\n\nD. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved varia-\ntional inference with inverse autoregressive \ufb02ow. In Advances in neural information processing\nsystems, pages 4743\u20134751, 2016.\n\nJ. Oliva, A. Dubey, M. Zaheer, B. Poczos, R. Salakhutdinov, E. Xing, and J. Schneider. Trans-\nIn International Conference on Machine Learning, pages\n\nformation autoregressive networks.\n3895\u20133904, 2018.\n\nA. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart,\nL. Cobo, F. Stimberg, et al. Parallel wavenet: Fast high-\ufb01delity speech synthesis. In International\nConference on Machine Learning, pages 3915\u20133923, 2018.\n\n10\n\n\fG. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive \ufb02ow for density estimation.\n\nIn Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\nG. Papamakarios, D. C. Sterratt, and I. Murray. Sequential neural likelihood: Fast likelihood-free\ninference with autoregressive \ufb02ows. In 22nd International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS), 2019.\n\nD. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. In International Con-\n\nference on Machine Learning, pages 1530\u20131538, 2015.\n\nJ. Sill. Monotonic networks. In Advances in neural information processing systems, pages 661\u2013667,\n\n1998.\n\nD. Tran, M. D. Hoffman, R. A. Saurous, E. Brevdo, K. Murphy, and D. M. Blei. Deep probabilistic\n\nprogramming. In 5th International Conference on Learning Representations (ICLR), 2017.\n\nA. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,\nA. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. In 9th ISCA Speech\nSynthesis Workshop, pages 125\u2013125, 2016.\n\nS. You, D. Ding, K. Canini, J. Pfeifer, and M. Gupta. Deep lattice networks and partial monotonic\n\nfunctions. In Advances in Neural Information Processing Systems, pages 2981\u20132989, 2017.\n\n11\n\n\f", "award": [], "sourceid": 859, "authors": [{"given_name": "Antoine", "family_name": "Wehenkel", "institution": "ULi\u00e8ge"}, {"given_name": "Gilles", "family_name": "Louppe", "institution": "University of Li\u00e8ge"}]}