{"title": "Residual Flows for Invertible Generative Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 9916, "page_last": 9926, "abstract": "Flow-based generative models parameterize probability distributions through an invertible transformation and can be trained by maximum likelihood. Invertible residual networks provide a flexible family of transformations where only Lipschitz conditions rather than strict architectural constraints are needed for enforcing invertibility. However, prior work trained invertible residual networks for density estimation by relying on biased log-density estimates whose bias increased with the network's expressiveness. We give a tractable unbiased estimate of the log density, and reduce the memory required during training by a factor of ten. Furthermore, we improve invertible residual blocks by proposing the use of activation functions that avoid gradient saturation and generalizing the Lipschitz condition to induced mixed norms. The resulting approach, called Residual Flows, achieves state-of-the-art performance on density estimation amongst flow-based models, and outperforms networks that use coupling blocks at joint generative and discriminative modeling.", "full_text": "Residual Flows for Invertible Generative Modeling\n\nRicky T. Q. Chen1,3, Jens Behrmann2, David Duvenaud1,3, J\u00f6rn-Henrik Jacobsen1,3\n\nUniversity of Toronto1, University of Bremen2, Vector Institute3\n\nrtqichen@cs.toronto.edu, jensb@uni-bremen.de\n\nduvenaud@cs.toronto.edu, j.jacobsen@vectorinstitute.ai\n\nAbstract\n\nFlow-based generative models parameterize probability distributions through an\ninvertible transformation and can be trained by maximum likelihood. Invertible\nresidual networks provide a \ufb02exible family of transformations where only Lipschitz\nconditions rather than strict architectural constraints are needed for enforcing\ninvertibility. However, prior work trained invertible residual networks for density\nestimation by relying on biased log-density estimates whose bias increased with\nthe network\u2019s expressiveness. We give a tractable unbiased estimate of the log\ndensity using a \u201cRussian roulette\u201d estimator, and reduce the memory required\nduring training by using an alternative in\ufb01nite series for the gradient. Furthermore,\nwe improve invertible residual blocks by proposing the use of activation functions\nthat avoid derivative saturation and generalizing the Lipschitz condition to induced\nmixed norms. The resulting approach, called Residual Flows, achieves state-of-the-\nart performance on density estimation amongst \ufb02ow-based models, and outperforms\nnetworks that use coupling blocks at joint generative and discriminative modeling.\n\n1\n\nIntroduction\n\n(a) Det. Identities\n\n(Low Rank)\n\n(b) Autoregressive\n(Lower Triangular)\n\nMaximum likelihood is a core machine learning paradigm\nthat poses learning as a distribution alignment problem.\nHowever, it is often unclear what family of distributions\nshould be used to \ufb01t high-dimensional continuous data.\nIn this regard, the change of variables theorem offers an\nappealing way to construct \ufb02exible distributions that al-\nlow tractable exact sampling and ef\ufb01cient evaluation of\nits density. This class of models is generally referred to\nas invertible or \ufb02ow-based generative models (Deco and\nBrauer, 1995; Rezende and Mohamed, 2015).\nWith invertibility as its core design principle, \ufb02ow-based\nmodels (also referred to as normalizing \ufb02ows) have shown\nto be capable of generating realistic images (Kingma\nand Dhariwal, 2018) and can achieve density estimation\nperformance on-par with competing state-of-the-art ap-\nproaches (Ho et al., 2019). In applications, they have been\napplied to study adversarial robustness (Jacobsen et al.,\n2019) and are used to train hybrid models with both gener-\native and classi\ufb01cation capabilities (Nalisnick et al., 2019)\nusing a weighted maximum likelihood objective.\nExisting \ufb02ow-based models (Rezende and Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2014;\nChen et al., 2018) make use of restricted transformations with sparse or structured Jacobians (Fig-\n\n(Structured Sparsity)\nFigure 1: Pathways to designing scal-\nable normalizing \ufb02ows and their en-\nforced Jacobian structure. Residual\nFlows fall under unbiased estimation\nwith free-form Jacobian.\n\n(c) Coupling\n\n(d) Unbiased Est.\n\n(Free-form)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fure 1). These allow ef\ufb01cient computation of the log probability under the model but at the cost of\narchitectural engineering. Transformations that scale to high-dimensional data rely on specialized\narchitectures such as coupling blocks (Dinh et al., 2014, 2017) or solving an ordinary differential\nequation (Grathwohl et al., 2019). Such approaches have a strong inductive bias that can hinder their\napplication in other tasks, such as learning representations that are suitable for both generative and\ndiscriminative tasks.\nRecent work by Behrmann et al. (2019) showed that residual networks (He et al., 2016) can be made\ninvertible by simply enforcing a Lipschitz constraint, allowing to use a very successful discrimina-\ntive deep network architecture for unsupervised \ufb02ow-based modeling. Unfortunately, the density\nevaluation requires computing an in\ufb01nite series. The choice of a \ufb01xed truncation estimator used by\nBehrmann et al. (2019) leads to substantial bias that is tightly coupled with the expressiveness of\nthe network, and cannot be said to be performing maximum likelihood as bias is introduced in the\nobjective and gradients.\nIn this work, we introduce Residual Flows, a \ufb02ow-based generative model that produces an unbiased\nestimate of the log density and has memory-ef\ufb01cient backpropagation through the log density\ncomputation. This allows us to use expressive architectures and train via maximum likelihood.\nFurthermore, we propose and experiment with the use of activations functions that avoid derivative\nsaturation and induced mixed norms for Lipschitz-constrained neural networks.\n\n2 Background\n\nMaximum likelihood estimation. To perform maximum likelihood with stochastic gradient de-\nscent, it is suf\ufb01cient to have an unbiased estimator for the gradient as\n\n\u2207\u03b8DKL(pdata || p\u03b8) = \u2207\u03b8Ex\u223cpdata(x) [log p\u03b8(x)] = Ex\u223cpdata(x) [\u2207\u03b8 log p\u03b8(x)] ,\n\n(1)\nwhere pdata is the unknown data distribution which can be sampled from and p\u03b8 is the model\ndistribution. An unbiased estimator of the gradient also immediately follows from an unbiased\nestimator of the log density function, log p\u03b8(x).\n\nChange of variables theorem. With an invertible transformation f, the change of variables\n\nlog p(x) = log p(f (x)) + log\n\n(2)\n\ncaptures the change in density of the transformed samples. A simple base distribution such as\na standard normal is often used for log p(f (x)). Tractable evaluation of (2) allows \ufb02ow-based\nmodels to be trained using the maximum likelihood objective (1). In contrast, variational autoen-\ncoders (Kingma and Welling, 2014) can only optimize a stochastic lower bound, and generative\nadversial networks (Goodfellow et al., 2014) require an extra discriminator network for training.\n\nInvertible residual networks (i-ResNets). Residual networks are composed of simple transforma-\ntions y = f (x) = x + g(x). Behrmann et al. (2019) noted that this transformation is invertible by\nthe Banach \ufb01xed point theorem if g is contractive, i.e. with Lipschitz constant strictly less than unity,\nwhich was enforced using spectral normalization (Miyato et al., 2018; Gouk et al., 2018).\nApplying i-ResNets to the change-of-variables (2), the identity\n\n(cid:12)(cid:12)(cid:12)(cid:12)det\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\ndf (x)\n\ndx\n\n(cid:32) \u221e(cid:88)\n\n(\u22121)k+1\n\nk\n\nk=1\n\n(cid:33)\n\nlog p(x) = log p(f (x)) + tr\n\n[Jg(x)]k\n\n(3)\n\nwas shown, where Jg(x) = dg(x)\ndx . Furthermore, the Skilling-Hutchinson estimator (Skilling, 1989;\nHutchinson, 1990) was used to estimate the trace in the power series. Behrmann et al. (2019) used\na \ufb01xed truncation to approximate the in\ufb01nite series in (3). However, this na\u00efve approach has a bias\nthat grows with the number of dimensions of x and the Lipschitz constant of g, as both affect the\nconvergence rate of this power series. As such, the \ufb01xed truncation estimator requires a careful\nbalance between bias and expressiveness, and cannot scale to higher dimensional data. Without\ndecoupling the objective and estimation bias, i-ResNets end up optimizing for the bias without\nimproving the actual maximum likelihood objective (see Figure 2).\n\n2\n\n\f3 Residual Flows\n\n3.1 Unbiased Log Density Estimation for Maximum Likelihood Estimation\nEvaluation of the exact log density function log p\u03b8(\u00b7) in (3) requires in\ufb01nite time due to the power\nseries. Instead, we rely on randomization to derive an unbiased estimator that can be computed in\n\ufb01nite time (with probability one) based on an existing concept (Kahn, 1955).\nTo illustrate the idea, let \u2206k denote the k-th term of an in\ufb01nite series, and suppose we always evaluate\nthe \ufb01rst term then \ufb02ip a coin b \u223c Bernoulli(q) to determine whether we stop or continue evaluating\nthe remaining terms. By reweighting the remaining terms by 1\n1\u2212q , we obtain an unbiased estimator\n\n(cid:21)\n\n(cid:80)\u221e\n\n1b=0 + (0)1b=1\n\n= \u22061 +\n\nk=2 \u2206k\n1 \u2212 q\n\n(1 \u2212 q) =\n\n\u2206k.\n\n(4)\n\n(cid:20)(cid:18)(cid:80)\u221e\n\n(cid:19)\n\n\u22061 + E\n\nk=2 \u2206k\n1 \u2212 q\n\n\u221e(cid:88)\n\nk=1\n\nInterestingly, whereas na\u00efve computation would always use in\ufb01nite compute, this unbiased estimator\nhas probability q of being evaluated in \ufb01nite time. We can obtain an estimator that is evaluated\nin \ufb01nite time with probability one by applying this process in\ufb01nitely many times to the remaining\nterms. Directly sampling the number of evaluated terms, we obtain the appropriately named \u201cRussian\nroulette\u201d estimator (Kahn, 1955)\n\n(cid:35)\n\n\u2206k = En\u223cp(N )\n\nk=1\n\nk=1\n\n\u2206k\n\nP(N \u2265 k)\n\n.\n\n(5)\n\nWe note that the explanation above is only meant to be an intuitive guide and not a formal derivation.\nThe peculiarities of dealing with in\ufb01nite quantities dictate that we must make assumptions on \u2206k,\np(N ), or both in order for the equality in (5) to hold. While many existing works have made different\nassumptions depending on speci\ufb01c applications of (5), we state our result as a theorem where the\nonly condition is that p(N ) must have support over all of the indices.\nTheorem 1 (Unbiased log density estimator). Let f (x) = x + g(x) with Lip(g) < 1 and N be a\nrandom variable with support over the positive integers. Then\n\nlog p(x) = log p(f (x)) + En,v\n\n(\u22121)k+1\n\nk\n\nvT [Jg(x)k]v\nP(N \u2265 k)\n\n,\n\n(6)\n\n(cid:35)\n\ng . A detailed proof is given in Appendix B.\n\nwhere n \u223c p(N ) and v \u223c N (0, I).\nHere we have used the Skilling-Hutchinson trace estimator (Skilling, 1989; Hutchinson, 1990) to\nestimate the trace of the matrices J k\nNote that since Jg is constrained to have a spec-\ntral radius less than unity, the power series con-\nverges exponentially. The variance of the Rus-\nsian roulette estimator is small when the in\ufb01-\nnite series exhibits fast convergence (Rhee and\nGlynn, 2015; Beatson and Adams, 2019), and\nin practice, we did not have to tune p(N ) for\nvariance reduction. Instead, in our experiments,\nwe compute two terms exactly and then use\nthe unbiased estimator on the remaining terms\nwith a single sample from p(N ) = Geom(0.5).\nThis results in an expected compute cost of 4\nterms, which is less than the 5 to 10 terms that\nBehrmann et al. (2019) used for their biased\nestimator.\nTheorem 1 forms the core of Residual Flows,\nas we can now perform maximum likelihood\ntraining by backpropagating through (6) to obtain unbiased gradients. This allows us to train more\nexpressive networks where a biased estimator would fail (Figure 2). The price we pay for the unbiased\nestimator is variable compute and memory, as each sample of the log density uses a random number\nof terms in the power series.\n\nFigure 2: i-ResNets suffer from substantial bias\nwhen using expressive networks, whereas Residual\nFlows principledly perform maximum likelihood\nwith unbiased stochastic gradients.\n\n\u221e(cid:88)\n\n(cid:34) n(cid:88)\n\n(cid:34) n(cid:88)\n\nk=1\n\n3\n\n051015202530Epoch2.53.03.54.04.5Bits/dimonCIFAR-10i-ResNet(BiasedTrainEstimate)i-ResNet(ActualTestValue)ResidualFlow(UnbiasedTrainEstimate)ResidualFlow(ActualTestValue)\f3.2 Memory-Ef\ufb01cient Backpropagation\n\nMemory can be a scarce resource, and running out of memory due to a large sample from the unbiased\nestimator can halt training unexpectedly. To this end, we propose two methods to reduce the memory\nconsumption during training.\nTo see how na\u00efve backpropagation can be problematic, the gradient w.r.t. parameters \u03b8 by directly\ndifferentiating through the power series (6) can be expressed as\n(\u22121)k+1\n\nlog det(cid:0)I + Jg(x, \u03b8)(cid:1) = En,v\n\n(cid:34) n(cid:88)\n\n\u2202vT (Jg(x, \u03b8)k)v\n\n(cid:35)\n\n(7)\n\n.\n\n\u2202\n\u2202\u03b8\n\nk\n\nk=1\n\n\u2202\u03b8\n\nUnfortunately, this estimator requires each term to be stored in memory because \u2202/\u2202\u03b8 needs to be\napplied to each term. The total memory cost is then O(n \u00b7 m) where n is the number of computed\nterms and m is the number of residual blocks in the entire network. This is extremely memory-hungry\nduring training, and a large random sample of n can occasionally result in running out of memory.\n\nNeumann gradient series.\nInstead, we can speci\ufb01cally express the gradients as a power series\nderived from a Neumann series (see Appendix C). Applying the Russian roulette and trace estimators,\nwe obtain the following theorem.\nTheorem 2 (Unbiased log-determinant gradient estimator). Let Lip(g) < 1 and N be a random\nvariable with support over positive integers. Then\n\nlog det(cid:0)I + Jg(x, \u03b8)(cid:1) = En,v\n\n\u2202\n\u2202\u03b8\n\n(cid:34)(cid:32) n(cid:88)\n\nk=0\n\n(cid:33)\n\n(cid:35)\n\n(\u22121)k\nP(N \u2265 k)\n\nvT J(x, \u03b8)k\n\n\u2202(Jg(x, \u03b8))\n\n\u2202\u03b8\n\nv\n\n,\n\n(8)\n\nwhere n \u223c p(N ) and v \u223c N (0, I).\n\nAs the power series in (8) does not need to be differentiated through, using this reduces the memory\nrequirement by a factor of n. This is especially useful when using the unbiased estimator as the\nmemory will be constant regardless of the number of terms we draw from p(N ).\n\nBackward-in-forward: early computation of gradients. We can further reduce memory by\npartially performing backpropagation during the forward evaluation. By taking advantage of\nlog det(I + Jg(x, \u03b8)) being a scalar quantity, the partial derivative from the objective L is\n\n\u2202L\n\u2202\u03b8\n\n=\n\n(cid:124)\n\n\u2202 log det(I + Jg(x, \u03b8))\n\n\u2202L\n\n(cid:123)(cid:122)\n\nscalar\n\n(cid:125)\n\n(cid:124)\n\n\u2202 log det(I + Jg(x, \u03b8))\n\n.\n\n(9)\n\n(cid:123)(cid:122)\n\n\u2202\u03b8\n\nvector\n\n(cid:125)\n\nrelease the memory for\n\nevery residual block, we\n\nFor\ncompute\n\u2202 log det(I+Jg(x,\u03b8))/\u2202\u03b8 along with the forward\npass,\nthe com-\nthen simply multiply by\nputation graph,\n\u2202L/\u2202 log det(I+Jg(x,\u03b8))\nlater during the main\nbackprop. This reduces memory by another\nfactor of m to O(1) with negligible overhead.\nNote that while these two tricks remove the\nmemory cost from backpropagating through the\nlog det terms, computing the path-wise deriva-\ntives from log p(f (x)) still requires the same\namount of memory as a single evaluation of\nthe residual network. Figure 3 shows that\nthe memory consumption can be enormous for\nna\u00efve backpropagation, and using large net-\nworks would have been intractable.\n\nFigure 3: Memory usage (GB) per minibatch of 64\nsamples when computing n=10 terms in the cor-\nresponding power series. CIFAR10-small uses im-\nmediate downsampling before any residual blocks.\n\n4\n\nMNISTCIFAR10-smallCIFAR10-large050100150200250Memory (GB)192.166.4263.531.211.340.819.87.426.113.65.918.0Naive BackpropNeumann SeriesBackward-in-forwardBoth Combined\fSoftplus\n\nELU\n\nLipSwish\n\nFigure 4: Common smooth Lipschitz activation functions \u03c6 usually have vanishing \u03c6(cid:48)(cid:48) when \u03c6(cid:48) is\nmaximal. LipSwish has a non-vanishing \u03c6(cid:48)(cid:48) in the region where \u03c6(cid:48) is close to one.\n\n3.3 Avoiding Derivative Saturation with the LipSwish Activation Function\n\nAs the log density depends on the \ufb01rst derivatives through the Jacobian Jg, the gradients for training\ndepend on second derivatives. Similar to the phenomenon of saturated activation functions, Lipschitz-\nconstrained activation functions can have a derivative saturation problem. For instance, the ELU\nactivation used by Behrmann et al. (2019) achieves the highest Lipschitz constant when ELU(cid:48)\n(z) = 1,\nbut this occurs when the second derivative is exactly zero in a very large region, implying there is a\ntrade-off between a large Lipschitz constant and non-vanishing gradients.\nWe thus desire two properties from our activation functions \u03c6(z):\n\n1. The \ufb01rst derivatives must be bounded as |\u03c6(cid:48)(z)| \u2264 1 for all z\n2. The second derivatives should not asymptotically vanish when |\u03c6(cid:48)(z)| is close to one.\n\nWhile many activation functions satisfy condition 1, most do not satisfy condition 2. We argue that\nthe ELU and softplus activations are suboptimal due to derivative saturation. Figure 4 shows that\nwhen softplus and ELU saturate at regions of unit Lipschitz, the second derivative goes to zero, which\ncan lead to vanishing gradients during training.\nWe \ufb01nd that good activation functions satisfying condition 2 are smooth and non-monotonic functions,\nsuch as Swish (Ramachandran et al., 2017). However, Swish by default does not satisfy condition 1\nas maxz | d\n\ndz Swish(z)| (cid:62) 1.1. But scaling via\n\nLipSwish(z) := Swish(z)/1.1 = z \u00b7 \u03c3(\u03b2z)/1.1,\n\n(10)\ndz LipSwish(z)| \u2264 1 for all values of \u03b2. LipSwish\nwhere \u03c3 is the sigmoid function, results in maxz | d\nis a simple modi\ufb01cation to Swish that exhibits a less than unity Lipschitz property. In our experiments,\nwe parameterize \u03b2 to be strictly positive by passing it through softplus. Figure 4 shows that in the\nregion of maximal Lipschitz, LipSwish does not saturate due to its non-monotonicity property.\n\n4 Related Work\n\nEstimation of In\ufb01nite Series. Our derivation of the unbiased estimator follows from the general\napproach of using a randomized truncation (Kahn, 1955). This paradigm of estimation has been\nrepeatedly rediscovered and applied in many \ufb01elds, including solving of stochastic differential\nequations (McLeish, 2011; Rhee and Glynn, 2012, 2015), ray tracing for rendering paths of light (Arvo\nand Kirk, 1990), and estimating limiting behavior of optimization problems (Tallec and Ollivier,\n2017; Beatson and Adams, 2019), among many other applications. Some recent works use Chebyshev\npolynomials to estimate the spectral functions of symmetric matrices (Han et al., 2018; Adams et al.,\n2018; Ramesh and LeCun, 2018; Boutsidis et al., 2008). These works estimate quantities that are\nsimilar to those presented in this work, but a key difference is that the Jacobian in our power series is\nnot symmetric. We also note works that have rediscovered the random truncation approach (McLeish,\n2011; Rhee and Glynn, 2015; Han et al., 2018) made assumptions on p(N ) in order for it to be\napplicable to general in\ufb01nite series. Fortunately, since the power series in Theorems 1 and 2 converge\nfast enough, we were able to make use of a different set of assumptions requiring only that p(N ) has\nsuf\ufb01cient support, which was adapted from Bouchard-C\u00f4t\u00e9 (2018) (details in Appendix B).\n\nMemory-ef\ufb01cient Backpropagation. The issue of computing gradients in a memory-ef\ufb01cient\nmanner was explored by Gomez et al. (2017) and Chang et al. (2018) for residual networks with a\n\n5\n\n0.50.00.51.0d/dxd2/dx20.50.00.51.0d/dxd2/dx20.50.00.51.0d/dxd2/dx2\fTable 1: Results [bits/dim] on standard benchmark datasets for density estimation. In brackets are\nmodels that used \u201cvariational dequantization\u201d (Ho et al., 2019), which we don\u2019t compare against.\nModel\n\nImageNet 64 CelebA-HQ 256\n\nImageNet 32\n\nCIFAR-10\n\nMNIST\n\n1.06\nReal NVP (Dinh et al., 2017)\nGlow (Kingma and Dhariwal, 2018) 1.05\n0.99\nFFJORD (Grathwohl et al., 2019)\n\u2014\nFlow++ (Ho et al., 2019)\n1.05\ni-ResNet (Behrmann et al., 2019)\n0.970\n\nResidual Flow (Ours)\n\n4.28\n4.09\n\u2014\n\n3.98\n3.81\n\u2014\n\n\u2014\n3.49\n1.03\n3.35\n3.40\n\u2014\n3.29 (3.09) \u2014 (3.86) \u2014 (3.69) \u2014\n\u2014\n3.45\n0.992\n3.280\n\n\u2014\n3.757\n\n\u2014\n4.010\n\nFigure 5: Qualitative samples. Real (left) and random samples (right) from a model trained on 5bit\n64\u00d764 CelebA. The most visually appealing samples were picked out of 5 random batches.\n\ncoupling-based architecture devised by Dinh et al. (2014), and explored by Chen et al. (2018) for\na continuous analogue of residual networks. These works focus on the path-wise gradients from\nthe output of the network, whereas we focus on the gradients from the log-determinant term in the\nchange of variables equation speci\ufb01cally for generative modeling. On the other hand, our approach\nshares some similarities with Recurrent Backpropagation (Almeida, 1987; Pineda, 1987; Liao et al.,\n2018), since both approaches leverage convergent dynamics to modify the derivatives.\n\nInvertible Deep Networks. Flow-based generative models are a density estimation approach\nwhich has invertibility as its core design principle (Rezende and Mohamed, 2015; Deco and Brauer,\n1995). Most recent work on \ufb02ows focuses on designing maximally expressive architectures while\nmaintaining invertibility and tractable log determinant computation (Dinh et al., 2014, 2017; Kingma\nand Dhariwal, 2018). An alternative route has been taken by Continuous Normalizing Flows (Chen\net al., 2018) which make use of Jacobian traces instead of Jacobian determinants, provided that the\ntransformation is parameterized by an ordinary differential equation. Invertible architectures are\nalso of interest for discriminative problems, as their information-preservation properties make them\nsuitable candidates for analyzing and regularizing learned representations (Jacobsen et al., 2019).\n\n5 Experiments\n\n5.1 Density & Generative Modeling\n\nWe use a similar architecture as Behrmann et al. (2019), except without the immediate invertible\ndownsampling (Dinh et al., 2017) at the image pixel-level. Removing this substantially increases the\namount of memory required (shown in Figure 3) as there are more spatial dimensions at every layer,\nbut increases the overall performance. We also increase the bound on the Lipschitz constants of each\nweight matrix to 0.98, whereas Behrmann et al. (2019) used 0.90 to reduce the error of the biased\nestimator. More detailed description of architectures is in Appendix E.\nUnlike prior works that use multiple GPUs, large batch sizes, and a few hundred epochs, Residual\nFlow models are trained with the standard batch size of 64 and converges in roughly 300-350 epochs\nfor MNIST and CIFAR-10. Most network settings can \ufb01t on a single GPU (see Figure 3), though we\nuse 4 GPUs in our experiments to speed up training. On CelebA-HQ, Glow had to use a batchsize of\n1 per GPU with a budget of 40 GPUs whereas we trained our model using a batchsize of 3 per GPU\nand a budget of 4 GPUs, owing to the smaller model and memory-ef\ufb01cient backpropagation.\n\n6\n\n\fCIFAR-10 Real Data\n\nResidual Flow (3.29 bits/dim)\n\nPixelCNN (3.14 bits/dim)\n\nVariational Dequantized Flow++ (3.08 bits/dim)\n\nFigure 6: Random samples from Residual Flow are more globally coherent. PixelCNN (Oord et al.,\n2016) and Flow++ samples reprinted from Ho et al. (2019).\n\nTable 1 reports the bits per dimension (log2 p(x)/d where x \u2208 Rd) on standard benchmark datasets\nMNIST, CIFAR-10, downsampled ImageNet, and CelebA-HQ. We achieve competitive performance\nto state-of-the-art \ufb02ow-based models on all datasets. For evaluation, we computed 20 terms of the\npower series (3) and use the unbiased estimator (6) to estimate the remaining terms. This reduces the\nstandard deviation of the unbiased estimate of the test bits per dimension to a negligible level.\nFurthermore, it is possible to generalize the Lipschitz condition of Residual Flows to arbitrary\np-norms and even mixed matrix norms. By learning the norm orders jointly with the model, we\nachieved a small gain of 0.003 bits/dim on CIFAR-10 compared to spectral normalization. In addition,\nwe show that others norms like p = \u221e yielded constraints more suited for lower dimensional data.\nSee Appendix D for a discussion on how to generalize the Lipschitz condition and an exploration of\ndifferent norm-constraints for 2D problems and image data.\n\n5.2 Sample Quality\n\nTable 2: Lower FID implies bet-\nter sample quality. \u2217Results taken\nfrom Ostrovski et al. (2018).\n\nCIFAR10 FID\n\nWe are also competitive with state-of-the-art \ufb02ow-based models in regards to sample quality. Figure 5\nshows random samples from the model trained on CelebA. Furthermore, samples from Residual\nFlow trained on CIFAR-10 are more globally coherent (Figure 6) than PixelCNN and variational\ndequantized Flow++, even though our likelihood is worse.\nFor quantitative comparison, we report FID scores (Heusel et al.,\n2017) in Table 2. We see that Residual Flows signi\ufb01cantly im-\nproves on i-ResNets and PixelCNN, and achieves slightly better\nsample quality than an of\ufb01cial Glow model that has double the\nnumber of layers. It is well-known that visual \ufb01delity and log-\nlikelihood are not necessarily indicative of each other (Theis et al.,\n2015), but we believe residual blocks may have a better induc-\ntive bias than coupling blocks or autoregressive architectures as\ngenerative models. More samples are in Appendix A.\nTo generate visually appealing images, Kingma and Dhariwal\n(2018) used temperature annealing (ie. sampling from [p(x)]T 2\nwith T < 1) to sample closer to the mode of the distribution,\nwhich helped remove artifacts from the samples and resulted in\nsmoother looking images. However, this is done by reducing\nthe entropy of p(z) during sampling, which is only equivalent to\ntemperature annealing if the change in log-density does not depend on the sample itself. Intuitively,\nthis assumption implies that the mode of p(x) and p(z) are the same. As this assumption breaks\nfor general \ufb02ow-based models, including Residual Flows, we cannot use the same trick to sample\nef\ufb01ciently from a temperature annealed model. Figure 7 shows the results of reduced entropy\nsampling on CelebA-HQ 256, but the samples do not converge to the mode of the distribution.\n\nModel\nPixelCNN\u2217\nPixelIQN\u2217\ni-ResNet\nGlow\nResidual Flow\nDCGAN\u2217\nWGAN-GP\u2217\n\n65.93\n49.46\n65.01\n46.90\n46.37\n37.11\n36.40\n\n5.3 Ablation Experiments\n\nWe report ablation experiments for the unbiased estimator and the LipSwish activation function\nin Table 3. Even in settings where the Lipschitz constant and bias are relatively low, we observe\na signi\ufb01cant improvement from using the unbiased estimator. Training the larger i-ResNet model\n\n7\n\n\fT =0.7\n\nT =0.8\n\nT =0.9\n\nT =1.0\n\nT =0.7\n\nT =0.8\n\nT =0.9\n\nT =1.0\n\nFigure 7: Reduced entropy sampling does not equate with proper temperature annealing for gen-\neral \ufb02ow-based models. Na\u00efvely reducing entropy results in samples that exhibit black hair and\nbackground, indicating that samples are not converging to the mode of the distribution.\n\nTraining Setting\n\nMNIST CIFAR-10\u2020 CIFAR-10\n3.66\u223c4.78\n1.05\n1.00\n0.97\n\ni-ResNet + ELU\nResidual Flow + ELU\nResidual Flow + LipSwish\nTable 3: Ablation results. \u2020Uses immediate downsampling\nbefore any residual blocks.\n\n3.45\n3.40\n3.39\n\n3.32\n3.28\n\nFigure 8: Effect of activation\nfunctions on CIFAR-10.\n\non CIFAR-10 results in the biased estimator completely ignoring the actual likelihood objective\naltogether. In this setting, the biased estimate was lower than 0.8 bits/dim by 50 epochs, but the actual\nbits/dim wildly oscillates above 3.66 bits/dim and seems to never converge. Using LipSwish not only\nconverges much faster but also results in better performance compared to softplus or ELU, especially\nin the high Lipschitz settings (Figure 8 and Table 3).\n\n5.4 Hybrid Modeling\n\nNext, we experiment on joint training of continuous and discrete data. Of particular interest is\nthe ability to learn both a generative model and a classi\ufb01er, referred to as a hybrid model which\nis useful for downstream applications such as semi-supervised learning and out-of-distribution\ndetection (Nalisnick et al., 2019). Let x be the data and y be a categorical random variable. The\nmaximum likelihood objective can be separated into log p(x, y) = log p(x) + log p(y|x), where\nlog p(x) is modeled using a \ufb02ow-based generative model and log p(y|x) is a classi\ufb01er network that\nshares learned features from the generative model. However, it is often the case that accuracy is\nthe metric of interest and log-likelihood is only used as a surrogate training objective. In this case,\n(Nalisnick et al., 2019) suggests a weighted maximum likelihood objective,\n\n(11)\nwhere \u03bb is a scaling constant. As y is much lower dimensional than x, setting \u03bb < 1 emphasizes\nclassi\ufb01cation, and setting \u03bb = 0 results in a classi\ufb01cation-only model which can be compared against.\n\nE(x,y)\u223cpdata[\u03bb log p(x) + log p(y|x)],\n\nTable 4: Comparison of residual vs. coupling blocks for the hybrid modeling task.\n\nBlock Type\n\n\u03bb = 0\nAcc\u2191\nNalisnick et al. (2019) 99.33%\n\nCoupling\n+ 1 \u00d7 1 Conv\nResidual\n\nMNIST\n\n\u03bb = 1/D\n\n\u03bb = 1\n\nSVHN\n\n\u03bb = 1/D\n\n\u03bb = 1\n\nBPD\u2193 Acc\u2191\n1.26 97.78%\n\nBPD\u2193 Acc\u2191\n\u2212\n\u2212\n2.21 46.22%\n2.17 46.58%\n1.01 99.46% 0.99 98.69% 96.72% 2.29 95.79% 2.06 58.52%\n\nBPD\u2193 Acc\u2191\n\u2212\n\u2212\n1.04 95.42%\n1.03 94.22%\n\n96.27%\n2.73 95.15%\n96.72% 2.61 95.49%\n\nBPD\u2193 Acc\u2191\n2.40 94.77%\n\n99.50%\n1.18 98.45%\n99.56% 1.15 98.93%\n99.53%\n\n\u03bb = 0\nAcc\u2191\n95.74%\n\n8\n\n050100150200250300Epoch3.253.303.353.403.453.50Bits/dimSoftplusELULipSwish\f\u03bb = 1\n\n\u03bb = 1/D\n\nBPD\u2193 Acc\u2191\n4.30 87.58%\n4.09 87.96%\n\nTable 5: Hybrid modeling results on CIFAR-10.\n\n\u03bb = 0\nAcc\u2191\nBlock Type\nCoupling\n89.77%\n+ 1 \u00d7 1 Conv 90.82%\nResidual\n\nBPD\u2193 Acc\u2191\n3.54 67.62%\n3.47 67.38%\n91.78% 3.62 90.47% 3.39 70.32%\n\nSince Nalisnick et al. (2019) performs approx-\nimate Bayesian inference and uses a different\narchitecture than us, we perform our own abla-\ntion experiments to compare residual blocks to\ncoupling blocks (Dinh et al., 2014) as well as\n1\u00d71 convolutions (Kingma and Dhariwal, 2018).\nWe use the same architecture as the density esti-\nmation experiments and append a classi\ufb01cation\nbranch that takes features at the \ufb01nal output of\nmultiple scales (see details in Appendix E). This allows us to also use features from intermediate\nblocks whereas Nalisnick et al. (2019) only used the \ufb01nal output of the entire network for classi\ufb01-\ncation. Our implementation of coupling blocks uses the same architecture for g(x) except we use\nReLU activations and no longer constrain the Lipschitz constant.\nTables 4 & 5 show our experiment results. Our architecture outperforms Nalisnick et al. (2019) on\nboth pure classi\ufb01cation and hybrid modeling. Furthermore, on MNIST we are able to jointly obtain\na decent classi\ufb01er and a strong density model over all settings. In general, we \ufb01nd that residual\nblocks perform much better than coupling blocks at learning representations for both generative\nand discriminative tasks. Coupling blocks have very high bits per dimension when \u03bb = 1/D while\nperforming worse at classi\ufb01cation when \u03bb = 1, suggesting that they have restricted \ufb02exibility and\ncan only perform one task well at a time.\n\n6 Conclusion\n\nWe have shown that invertible residual networks can be turned into powerful generative models. The\nproposed unbiased \ufb02ow-based generative model, coined Residual Flow, achieves competitive or better\nperformance compared to alternative \ufb02ow-based models in density estimation, sample quality, and\nhybrid modeling. More generally, we gave a recipe for introducing stochasticity in order to construct\ntractable \ufb02ow-based models with a different set of constraints on layer architectures than competing\napproaches, which rely on exact log-determinant computations. This opens up a new design space of\nexpressive but Lipschitz-constrained architectures that has yet to be explored.\n\nAcknowledgments\n\nJens Behrmann gratefully acknowledges the \ufb01nancial support from the German Science Foundation\nfor RTG 2224 \u201c\u03c03: Parameter Identi\ufb01cation - Analysis, Algorithms, Applications\u201d\n\nReferences\nRyan P Adams, Jeffrey Pennington, Matthew J Johnson, Jamie Smith, Yaniv Ovadia, Brian Patton,\nand James Saunderson. Estimating the spectral density of large implicit matrices. arXiv preprint\narXiv:1802.03451, 2018.\n\nL. B. Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial\nenvironment. In Proceedings of the IEEE First International Conference on Neural Networks,\npages 609\u2013618, 1987.\n\nJames Arvo and David Kirk. Particle transport and image synthesis. ACM SIGGRAPH Computer\n\nGraphics, 24(4):63\u201366, 1990.\n\nAlex Beatson and Ryan P. Adams. Ef\ufb01cient optimization of loops and limits with randomized\n\ntelescoping sums. In International Conference on Machine Learning, 2019.\n\nJens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and J\u00f6rn-Henrik Jacobsen.\n\nInvertible residual networks. International Conference on Machine Learning, 2019.\n\nAlexandre Bouchard-C\u00f4t\u00e9. Topics in probability assignment 1. https://www.stat.ubc.ca/\n2018.\n\n~bouchard/courses/stat547-fa2018-19//files/assignment1-solution.pdf,\nAccessed: 2019-05-22.\n\nChristos Boutsidis, Michael W Mahoney, and Petros Drineas. Unsupervised feature selection for\nprincipal components analysis. In Proceedings of the 14th ACM SIGKDD international conference\non Knowledge discovery and data mining, pages 61\u201369. ACM, 2008.\n\n9\n\n\fBo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible\nIn AAAI Conference on Arti\ufb01cial\n\narchitectures for arbitrarily deep residual neural networks.\nIntelligence, 2018.\n\nRicky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary\n\ndifferential equations. Advances in Neural Information Processing Systems, 2018.\n\nGustavo Deco and Wilfried Brauer. Nonlinear higher-order statistical decorrelation by volume-\n\nconserving neural architectures. Neural Networks, 8(4):525\u2013535, 1995.\n\nLaurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components\n\nestimation. arXiv preprint arXiv:1410.8516, 2014.\n\nLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. Interna-\n\ntional Conference on Learning Representations, 2017.\n\nAidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network:\nBackpropagation without storing activations. In Advances in neural information processing systems,\npages 2214\u20132224, 2017.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa-\ntion processing systems, pages 2672\u20132680, 2014.\n\nHenry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael Cree. Regularisation of neural networks\n\nby enforcing lipschitz continuity. arXiv preprint arXiv:1804.04368, 2018.\n\nWill Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord:\nFree-form continuous dynamics for scalable reversible generative models. International Conference\non Learning Representations, 2019.\n\nInsu Han, Haim Avron, and Jinwoo Shin. Stochastic chebyshev gradient descent for spectral\n\noptimization. In Conference on Neural Information Processing Systems. 2018.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\nMartin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems, pages 6626\u20136637, 2017.\n\nJonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving \ufb02ow-\nbased generative models with variational dequantization and architecture design. International\nConference on Machine Learning, 2019.\n\nRoger A Horn and Charles R Johnson. Matrix analysis. Cambridge University Press, 2012.\nMichael F Hutchinson. A stochastic estimator of the trace of the in\ufb02uence matrix for Laplacian\nsmoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433\u2013450,\n1990.\n\nJ\u00f6rn-Henrik Jacobsen, Jens Behrmann, Richard Zemel, and Matthias Bethge. Excessive invariance\n\ncauses adversarial vulnerability. International Conference on Learning Representations, 2019.\n\nNathaniel Johnston. QETLAB: A MATLAB toolbox for quantum entanglement, version 0.9. http:\n\n//qetlab.com, January 2016.\n\nHerman Kahn. Use of different monte carlo sampling techniques. 1955.\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations, 2015.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference on\n\nLearning Representations, 2014.\n\nDurk P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\n\nAdvances in Neural Information Processing Systems, pages 10215\u201310224, 2018.\n\nDurk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in neural information\nprocessing systems, pages 4743\u20134751, 2016.\n\n10\n\n\fRenjie Liao, Yuwen Xiong, Ethan Fetaya, Lisa Zhang, KiJung Yoon, Xaq Pitkow, Raquel Urta-\nsun, and Richard Zemel. Reviving and improving recurrent back-propagation. arXiv preprint\narXiv:1803.06396, 2018.\n\nIlya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Confer-\n\nence on Learning Representations, 2019.\n\nDon McLeish. A general method for debiasing a monte carlo estimator. Monte Carlo Methods and\n\nApplications, 17(4):301\u2013315, 2011.\n\nTakeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for\ngenerative adversarial networks. In International Conference on Learning Representations, 2018.\nEric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan.\nHybrid models with deep and invertible features. International Conference on Machine Learning,\n2019.\n\nAaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.\n\nInternational Conference on Machine Learning, 2016.\n\nGeorg Ostrovski, Will Dabney, and R\u00e9mi Munos. Autoregressive quantile networks for generative\n\nmodeling. arXiv preprint arXiv:1806.05575, 2018.\n\nK. B. Petersen and M. S. Pedersen. The matrix cookbook, 2012.\nFernando J. Pineda. Generalization of back-propagation to recurrent neural networks. Physical\n\nReview Letters, 59:2229\u20132232, 1987.\n\nBoris T. Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. 1992.\nPrajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint\n\narXiv:1710.05941, 2017.\n\nAditya Ramesh and Yann LeCun. Backpropagation for implicit spectral densities. Conference on\n\nNeural Information Processing Systems, abs/1806.00499, 2018.\n\nDanilo Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. In Proceedings\n\nof the 32nd International Conference on Machine Learning, pages 1530\u20131538, 2015.\n\nChang-han Rhee and Peter W Glynn. A new approach to unbiased estimation for sde\u2019s. In Proceedings\n\nof the Winter Simulation Conference, page 17. Winter Simulation Conference, 2012.\n\nChang-han Rhee and Peter W Glynn. Unbiased estimation with square root convergence for sde\n\nmodels. Operations Research, 63(5):1026\u20131043, 2015.\n\nJohn Skilling. The eigenvalues of mega-dimensional matrices. In Maximum Entropy and Bayesian\n\nMethods, pages 455\u2013466. Springer, 1989.\n\nCorentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv preprint\n\narXiv:1705.08209, 2017.\n\nLucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. arXiv preprint arXiv:1511.01844, 2015.\n\nJoel Aaron Tropp. Topics in sparse approximation. In PhD thesis. University of Texas, 2004.\nGuodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay\n\nregularization. In International Conference on Learning Representations, 2019.\n\n11\n\n\f", "award": [], "sourceid": 5254, "authors": [{"given_name": "Ricky T. Q.", "family_name": "Chen", "institution": "U of Toronto"}, {"given_name": "Jens", "family_name": "Behrmann", "institution": "University of Bremen"}, {"given_name": "David", "family_name": "Duvenaud", "institution": "University of Toronto"}, {"given_name": "Joern-Henrik", "family_name": "Jacobsen", "institution": "Vector Institute"}]}