{"title": "Invertible Convolutional Flow", "book": "Advances in Neural Information Processing Systems", "page_first": 5635, "page_last": 5645, "abstract": "Normalizing flows can be used to construct high quality generative probabilistic\nmodels, but training and sample generation require repeated evaluation of Jacobian determinants and function inverses. To make such computations feasible, current approaches employ highly constrained architectures that produce diagonal, triangular, or low rank Jacobian matrices. As an alternative, we investigate a set of novel normalizing flows based on the circular and symmetric convolutions. We show that these transforms admit efficient Jacobian determinant computation and inverse mapping (deconvolution) in O(N log N) time. Additionally, element-wise multiplication, widely used in normalizing flow architectures, can be combined with these transforms to increase modeling flexibility. We further propose an analytic approach to designing nonlinear elementwise bijectors that induce special properties in the intermediate layers, by implicitly introducing specific regularizers in the loss. We show that these transforms allow more effective normalizing flow models to be developed for generative image models.", "full_text": "Invertible Convolutional Flow\n\nMahdi Karami\u2217\n\n\u2217Department of Computer Science\n\nUniversity of Alberta\n\nkarami1@ualberta.ca\n\nJascha Sohl-Dickstein\u2020\n\nDale Schuurmans\u2020 \u2217\n\nLaurent Dinh\u2020\n\n\u2020 Google Brain\n\nDaniel Duckworth\u2020\n\nAbstract\n\nNormalizing \ufb02ows can be used to construct high quality generative probabilistic\nmodels, but training and sample generation require repeated evaluation of Jacobian\ndeterminants and function inverses. To make such computations feasible, current\napproaches employ highly constrained architectures that produce diagonal, trian-\ngular, or low rank Jacobian matrices. As an alternative, we investigate a set of\nnovel normalizing \ufb02ows based on the circular and symmetric convolutions. We\nshow that these transforms admit ef\ufb01cient Jacobian determinant computation and\ninverse mapping (deconvolution) in O(N log N ) time. Additionally, element-wise\nmultiplication, widely used in normalizing \ufb02ow architectures, can be combined\nwith these transforms to increase modeling \ufb02exibility. We further propose an\nanalytic approach to designing nonlinear elementwise bijectors that induce special\nproperties in the intermediate layers, by implicitly introducing speci\ufb01c regularizers\nin the loss. We show that these transforms allow more effective normalizing \ufb02ow\nmodels to be developed for generative image models.\n\n1\n\nIntroduction\n\nFlow-based generative networks have shown tremendous promise for modeling complex observations\nin high dimensional datasets. In \ufb02ow-based models, a complex probability density is constructed by\ntransforming a simple base density, such as a standard normal distribution, via a chain of smooth,\ninvertible mappings (bijections), to yield a normalizing \ufb02ow. Such models are employed in various\ncontexts, including approximating a complex posterior distribution in variational inference [Rezende\nand Mohamed, 2015], or for density estimation with generative models [Dinh et al., 2016].\n\nUsing a complex transformation (bijective function) to de\ufb01ne a normalized density requires the\ncomputation of a Jacobian determinant, which is generally impractical for arbitrary neural network\ntransformations. To overcome this dif\ufb01culty and enable fast computation, previous work has carefully\ndesigned architectures that produce simple Jacobian forms. For example, [Rezende and Mohamed,\n2015, Berg et al., 2018] consider transformations with a Jacobian that corresponds to low rank\nperturbations of a diagonal matrix, enabling the use of Sylvester\u2019s determinant lemma. Other\nworks, such as [Dinh et al., 2014, 2016, Kingma et al., 2016, Papamakarios et al., 2017], use a\nconstrained transformation where the Jacobian has a triangular structure. The latter approach has\nproved particularly successful, since this constraint is easy to enforce without major sacri\ufb01ces in\nexpressiveness or computational ef\ufb01ciency. More recently, Kingma and Dhariwal [2018] propose the\nuse of 1 \u00d7 1 convolutions for cross channel mixing in a multi-channel signal, achieving tractability\nvia a block diagonal Jacobian. Nevertheless, these models have overlooked some opportunities\nfor formulating tractable normalizing \ufb02ows that can enhance expressiveness and better capture the\nstructure of natural data, such as images and audio. Also, a new line of work based on ordinary\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdifferential equations has emerged recently that offers promising continuous dynamics based \ufb02ows\n[Grathwohl et al., 2019].\n\nIn this work, we propose an alternative nonlinear convolution layer, the nonlinear adaptive convolution\n\ufb01lter, where expressiveness is increased by allowing a layer\u2019s kernel to adapt to the layer\u2019s input.\nThe idea is to partition the input of a layer x into {x1, x2}, where the convolution updates x2 as\nw(x1)\u2217x2, while the kernel w(x1) is a function of x1 that can be expressed by a deep neural network.\nWe present invertible convolution operators whose Jacobian can be computed ef\ufb01ciently, making this\napproach practical for normalizing \ufb02ow. Unlike the causal convolution employed in [van den Oord\net al., 2016] to generate audio waveforms, or in [Zheng et al., 2017] to approximate the posterior\nin a variational autoencoder, the proposed transformations are not constrained to depend only on\nthe preceding input variables and also offer ef\ufb01cient inverse mapping, also known as deconvolution,\nanalytically. Also, recently, circular convolution has been adopted in [Karami et al., 2018] as a\nnormalizing \ufb02ow for density estimation and in [Hoogeboom et al., 2019] to design invertible periodic\nconvolution for (almost) periodic data. Furthermore, we propose an analytic approach to add invertible\npointwise nonlinearity in the \ufb02ow that implicitly induces speci\ufb01c regularizers on the intermediate\nlayers.\n\n2 Background\n\nGiven a random variable z \u223c p(z) and an invertible and differentiable mapping g : Rn \u2192 Rn, with\ninverse mapping f = g\u22121, the probability density function of the transformed variable x = g(z)\ncan be recovered by the change of variable rule as p(x) = p(z)|det Jg|\u22121 = p(f (x))|det Jf|.\nHere Jg = \u2202g\n\u22a4 are the Jacobian matrices of functions g and f , respectively.\n\u2202 z\nOne can use these to build a complex mapping g by composing a chain of simple bijective maps,\ng = g(1) \u25e6 g(2) \u25e6 ... \u25e6 g(K), that preserve invertibility, with the inverse mapping being f = f (K) \u25e6\nf (K\u22121) \u25e6 ... \u25e6 f (1). By applying the chain rule to the Jacobian of the composition, and using the fact\nthat det AB = det A det B, the log-likelihood equality (LLE) can be written as\n\n\u22a4 and Jf = \u2202f\n\u2202x\n\nK\n\nlog p(x) = log p(z) +\n\nlog |det Jfk| .\n\n(1)\n\nXk=1\n\nEvaluating the Jacobian determinant is the main computational bottleneck in (1) since, in general, its\nscaling is cubic in the size of input.\nIt is therefore natural to seek structured transformations that\nmitigate this cost while retaining useful modeling \ufb02exibility.1\n\n2.1 Toeplitz structure and Circular Convolution\n\nAlthough available methods have typically considered bijections whose Jacobians have block-diagonal\nor triangular forms, these are not the only useful possibilities. In fact, various other transformations\nexist whose Jacobian has suf\ufb01cient structure to allow computationally ef\ufb01cient determinant calcu-\nlation. One such structure is the Toeplitz property, where all the elements along each diagonal of a\nsquare matrix are identical (Figure 1(a)). The calculation of the determinant can then be simpli\ufb01ed\n\nsigni\ufb01cantly. Let JT be a Toeplitz matrix of size N \u00d7 N ; its determinant can be evaluated in O(N 2)\ntime in general [Monahan, 2011]. More speci\ufb01cally, if JT has a limited bandwidth size of K = r + s,\nas depicted in Figure 1(a), then the determinant computation can be reduced to O(K 2 log N + K 3)\n\ntime [Cinkir, 2011]. Moreover, Toeplitz matrices can be inverted ef\ufb01ciently [Martinsson et al., 2005].\nThe fact that the discrete convolution can be expressed as a product of a Toeplitz matrix and the\ninput [Gray et al., 2006] highlights that the Toeplitz property is of particular interest in convolutional\nneural networks (CNNs).\n\n1Notation de\ufb01nition: Throughout the paper, invertible \ufb02ows are denoted by f , while f (x) is used for\nunconditional \ufb02ows, and conditional (data-parameterized) \ufb02ows are identi\ufb01ed by f (x2; x1) or f (x2; \u03b8(x1))\nwhere the \ufb02ow warps x2 conditioned on x1. Subscripts are intended to specify the type of \ufb02ow or its parameters\nwhile superscripts enumerate the order of \ufb02ows in the chain. For example, f\u2217 denotes the convolutional \ufb02ow in\ngeneral and \u03c3\u03b1 is used to specify the pointwise nonlinear bijectors with its inverse being \u03c6\u03b1. Also, in general, y\nand x indicate the output and input of a \ufb02ow, respectively and when referring to kth \ufb02ow in the chain, we use\ny(k) and x(k) where x(k) = y(k\u22121). Moreover, circular convolution and symmetric convolution are denoted by\n\u229b and \u2217s, respectively, while \u2217 denotes an invertible convolution in general, and xF , xC and xT denote DFT,\nDCT and trigonometric transform of sample x, respectively.\n\n2\n\n\fJT =\n\nw0 w\u22121\nw0\nw1\n...\n. . .\n. . .\n. . .\n\nwr\n\n0\n\nw0\n\nw1\n...\n\nJS =\n\n\uf8ee\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\uf8ee\n\n. . .\n. . .\n. . .\nwr\n(a)\n\nw0\n\nw0\n. . .\n\n\uf8f9\n\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n\uf8f9\n\n. . . w\u2212s\n\n0\n\n. . .\n. . .\n. . .\n. . . w1\n\n. . .\n. . . w\u2212s\n\nw0 w\u22121\nw0\n\n\uf8f9\n\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nJC =\n\nw0\n\nwN \u22121\n\nw1\n...\n\nw0\n. . .\n. . .\nwN \u22122\nwN \u22121 wN \u22122\n(b)\n\n\uf8ee\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nw1\n\n. . . w2\n. . .\n. . .\n. . .\n. . .\n. . . w0 wN \u22121\nw0\n. . . w1\n\nw2\n...\n\n. . . wN \u22123 wN \u22122\n. . . wN \u22124 wN \u22123\n. . .\n. . .\n. . .\n\nw0\nw1\n\nw0\nw0\n\n. . .\n\n...\n\n\uf8f9\n\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nw1\n\nw2\n...\n\nw2\n\nw3\n...\n\nwN \u22121 wN \u22121\nwN \u22121 wN \u22122\n\n+\n\n\uf8ee\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nwN \u22121 wN \u22122\n\n. . . wN \u22121 wN \u22121\n...\n...\n...\n. . .\n\nw2\nw1\n\nw1\nw0\n\n...\n\n...\n\n(c)\n\nwN \u22122 wN \u22123\nwN \u22121 wN \u22122\n\nFigure 1: (a) JT is a Toeplitz matrix with limited bandwidth size of K = r + s, (b) JC is the Jacobian of\ncircular convolution that is a circulant matrix, and (c) JS is the Jacobian of symmetric convolution that can be\nexpressed as summation of a Toeplitz matrix and an upside-down Toeplitz matrix (also called a Hankel matrix\nwhere its skew-diagonal elements are identical).\n\nIn this paper, we consider a particular transformation whose Jacobian is a circulant matrix, a\nspecial form of Toeplitz structure where the rows (columns) are cyclic permutations of the \ufb01rst\nrow (column), i.e. Jl,m = J1,(l\u2212m) mod N . See Figure 1(b) for an illustration. This structure allows\ncertain computationally expensive algebraic operations, such as determinant calculation, inversion\nand eigenvalue decomposition, to be performed ef\ufb01ciently in O(N log N ) time by exploiting the fact\nthat a square circulant matrix can be diagonalized by a discrete Fourier transform (DFT) [Gray et al.,\n2006]. De\ufb01ne the circular convolution as y := w \u229b x where y(i) :=PN\u22121\nn=0 x(n)w(i \u2212 n) mod N ,\nwhich is equivalent to the linear convolution of two sequences when one is padded cyclically, also\nknown as periodic padding, as illustrated in Figure 2(a). The key property we exploit in developing\nan ef\ufb01cient normalizing layer is that the Jacobian of this convolution forms a circulant matrix, hence\nits determinant and inverse mapping (deconvolution) can be computed ef\ufb01ciently. Some useful\nproperties of this operation are needed:\n\nProposition 1 Let y := w \u229bx be a circular convolution on the input vector x with its DFT transform\nxF := FDF T{x}. Then:\na) The circular convolution operation can be expressed as a vector-matrix multiplication y = Cwx\nwhere Cw is a circulant square matrix having the convolution kernel w as its \ufb01rst row.\n\nb) The Jacobian of the mapping is Jy = Cw.\n\nc) The matrix Cw can be diagonalized using DFT basis with its eigenvalues being equal to the DFT\n\nn=0 log |wF (n)| .\n\nof w, hence log |det Jy| =PN\u22121\nd) The circular convolution can be expressed by element-wise multiplication in the frequency domain,\nyF (k) = wF (k) xF (k), a.k.a. the circular convolution-multiplication property.\ne) If wF (n) 6= 0 \u2200n, this linear operation is invertible with inverse xF (n) = w\u22121\n(n) yF (n).\nF\nMoreover, its inverse mapping (deconvolution) is also a circular convolution operation with kernel\nwinv := F\u22121\nF }. On the other hand, the log determinant Jacobian also acts as a log-barrier\nin the objective function that in turn prevents the wF (n) from becoming zero hence enforces the\ninvertibility of the convolution \ufb01lter.\n\nN {w\u22121\n\nf) The circular convolution, its inverse, and Jacobian determinant can all be ef\ufb01ciently computed in\nO(N log N ) time in the frequency domain, exploiting Fast Fourier Transform (FFT) algorithms.\n2.2 Symmetric convolution\n\nCircular convolution is not a unique operation with such properties, symmetric convolution is another\nform of structured \ufb01ltering operation that can be adopted to achieve interesting desirable properties.\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Cyclic (periodic) extension and (b) even-symmetric extension of the base sequence, where the\nbase sequence speci\ufb01ed by dark solid lines. (c) Nonlinear gates corresponding to l1 and l2 regularizers.\n\nA family of symmetric extension (padding) patterns and their corresponding discrete trigonometric\ntransforms (DTT) are outlined in Martucci [1994], based on which alternative symmetric convolution\n\ufb01lters can be de\ufb01ned that satisfy the convolution-multiplication property. Among this family, we\nchoose an even-symmetric extension that can be readily interpreted. De\ufb01ne an even-symmetric\nextension of a base sequence of length N around N \u2212 1/2 as\n\n\u02c6x(n) = \u03b5{x(n)} :=(cid:26)x(n)\n\nn = 0, 1, ..., N \u2212 1\n\nx(\u2212n \u2212 1) n = \u2212N, ..., \u22121\n\n.\n\n(2)\n\nThis even-symmetric extension is illustrated in Figure 2(b). The symmetric convolution of two\nsequences, denoted by \u2217s, can then be de\ufb01ned by the circular convolution of their corresponding\neven-symmetric extensions, as y = w \u2217s x := R{ \u02c6x \u229b \u02c6w}, where R{.} is a rectangular window\noperation that retains the base sequence of interest in an extended sequence; that is, it inverts the\nsymmetric extension operation (2). Now, since the sequences are extended by an even-symmetric\npattern, the cosine functions provide the appropriate basis for the Fourier transform, giving rise to the\ndiscrete cosine transform of type two (DCT-II):\n\nxC(k) = Fdct{x}k =\n\n\u221a2\n\n\u221a1n=0 + 1\n\n1\n\u221aN\n\nN\u22121\n\nXn=0\n\nx(n) cos(cid:18) \u03c0k\n\nN\n\n(n + 1\n\n2 )(cid:19) .\n\n(3)\n\nThe convolution-multiplication property holds for this convolution, which implies that the symmetric\nconvolution of two sequences in the spatial domain can be expressed as a pointwise multiplication\nin the transform domain, after a forward DCT of its operands, i.e. yC = wC \u2299 xC. This property\nalso offers and alternative de\ufb01nition for the symmetric convolution: the inverse DCT of pointwise\nmultiplication of the forward DCT of its operands [Martucci, 1994].\n\nOne can also show that the symmetric convolution provides a structured Jacobian that can be speci\ufb01ed\nby Toeplitz matrices; see Figure 1(c) for an illustration. Analogous to the results presented in\nProposition 1 for circular convolution, the symmetric convolution-multiplication property implies\nthat the Jacobian of the symmetric convolution can be diagonalized by a DCT basis, with eigenvalues\nbeing the DCT of the convolution kernel. Similarly, the inverse \ufb01lter (deconvolution) can be obtained\nby inverting the kernel coef\ufb01cients in the transform domain, i.e. winv := F\u22121\ndct{1./wC}, where,\nagain, the invertibility of the convolution is guaranteed by the fact that it log determinant Jacobian in\nthe objective function keeps the elements of wC away from zero (as a log-barrier). On the other hand,\nsince the DCT can be de\ufb01ned in terms of a DFT of the symmetric extension of the original sequences,\nthe symmetric convolution, its inverse, and Jacobian determinant can exploit available fast Fourier\nalgorithms with O(N log N ) complexity.2\n\n3 Convolutional normalizing \ufb02ow\n\n3.1 Data adaptive convolution layer\n\nThe special convolutional forms introduced above appear to be particularly well suited to capturing\nstructure in images and audio signals, therefore we seek to design more expressive normalizing\n\ufb02ows using the convolution bijections as a building blocks. To increase \ufb02exibility, we propose a\ndata-adaptive convolution \ufb01lter with a \ufb01lter kernel that is a function of the input of the layer.\n\n2 All bijective convolutions in experiments were performed in transform domain using a fast Fourier transform\n\nalgorithm.\n\n4\n\n\fInspired by the idea of the coupling layer in [Dinh et al., 2016], a modular bijection can be formed by\n\nsplitting the input x \u2208 Rd into two disjoint parts {x1 \u2208 Rd1 , x2 \u2208 Rd2 : d1 + d2 = d}, referred to\nas the base input and update input, respectively, and only updating x2 by an invertible convolution\noperation with a data-parameterized kernel that depends on x1. The data-adaptive convolution\nsub-\ufb02ow can then be expressed as\n\n(4)\nIn the above transformation \u2217 is an invertible convolution operation and can be one of the invertible\nconvolutions introduced in last section. Here, the kernel w(x1) can be any nonlinear function, which\nleads to a nonlinear adaptive convolution \ufb01ltering scheme.\n\nf\u2217(x2; x1) = w(x1) \u2217 x2.\n\n3.2 Pointwise nonlinear bijections\n\nAdding pointwise nonlinear bijections in the chain of normalizing \ufb02ows can further enhance expres-\nsiveness. More speci\ufb01cally, focusing on the Jacobian determinant introduced by the nonlinearities\nin log-likelihood equation (1), one can observe that these terms can be interpreted as regularizers\non the latent representation. In other words, speci\ufb01c structures on intermediate activations can be\nencouraged by designing customized pointwise nonlinear gates; these structures encode various prior\nknowledge into the design of the model. Let \u03c3(k) denote the kth bijection in the chain of normalizing\n\ufb02ows that is assumed to be an pointwise nonlinear operation, i.e. y(k)\n). Dropping the\nindices, this mapping can be simply written as y = \u03c3(x) with inverse x = \u03c6(y) = \u03c3\u22121(y). Since\nthe nonlinearity operates elementwise, its Jacobian is diagonal, hence the log determinant reduces to\n\ni = \u03c3(k)(x(k)\n\ni\n\n. Then, an analytic approach designing nonlinear invertible gates\n\nlog |det Jy| =Pd\n\nare derived in the following.\n\ni=1 log(cid:12)(cid:12)(cid:12)\n\n\u2202\u03c3(xi)\n\n\u2202xi\n\n(cid:12)(cid:12)(cid:12)\n\nProposition 2 Assume we want to induce a speci\ufb01c structure, formulated by a regularizer \u03b3(y), on\nthe intermediate activation y := y(k)\n. Then the elementwise bijection can be de\ufb01ned as the solution\nto the differential equation: | \u2202\u03c3\u22121\n\u2202y| = e\u03b3(y). In the other word, the contribution to the\n\u2212 log |det J\u03c3| term in the negative log-likelihood from this unit will then reduces to log | \u2202\u03c6\n\u2202y| = \u03b3(y).\nSolving the above equation and deriving the nonlinear bijection for two well established l1 and l2\nregularizers leads to the following.\n\n\u2202y | = | \u2202\u03c6\n\ni\n\n\u2022 l1 regularization: \u03b3(y) = \u03b1|y| which corresponds to Laplace distribution assumption on y:\n(5)\n\n\u03c3\u03b1(x) = sign(x)\n\n\u03c6\u03b1(y) = sign(y)\n\n\u03b1\n\nln(\u03b1|x| + 1).\n\n\u03b1 (e\u03b1|y| \u2212 1),\n\nDue to its symmetric logarithmic shape, we call the forward function \u03c3\u03b1(x) an S-Log gate\nparameterized by positive-valued \u03b1.\n\n\u2022 l2 regularization: \u03b3(y) = \u03b1y2 which corresponds to Gaussian distribution assumption on y:\n\n4\u03b1 er\ufb01(\u221a\u03b1y),\n\n\u03c6\u03b1(y) =q \u03c0\n\n\u03c3\u03b1(x) = 1\u221a\u03b1 er\ufb01\u22121(q 4\u03b1\n\n\u03c0 x).\n\nThe proposed nonlinear gates, plotted in Figure 2(c), are not only differentiable by construction but\nalso have unbounded domain and range, making them suitable choices for designing normalizing\n\ufb02ows in many settings such as density estimation. Due to its simple analytical form and closed\nform inversion, the S-Log gate, (5), is adopted as nonlinear bijection in our model architecture. For\nmultichannel inputs, we assume that the gates share the same parameter \u03b1 over all spatial locations of\na channel (feature map).\n\n3.3 Combined convolution multiplication layer\n\nThe convolution operation spatially slides a \ufb01lter and applies the same weighted summation at every\nlocation of its input, resulting in location invariant \ufb01ltering. To achieve a more \ufb02exible and richer\n\ufb01ltering scheme, we can combine an element-wise multiplication, indicated by f\u2299, and invertible\nconvolution, indicated by f\u2217, so that the \ufb01ltering scheme varies over space and frequency. The\nproduct of a diagonal matrix with a circulant matrix was also proposed in [Cheng et al., 2015] as a\n\n5\n\n\f\u2217\n\n+\n\n\u2299\n\n.....\n\n{\n\n, }\n\n\n\n\n\n\n\n, 2 \u2264 \u2264 \n\nFigure 3: The diagram of one step of \ufb02ow\n(CONF) that is composed of M combined\nconvolutional \ufb02ows de\ufb01ned in (6). In den-\nsity estimation, the input to the condition-\ning neural network is the base input, x1,\nand the \ufb02ow updates x2.\nIn variational\ninference applications, the neural network\nis conditioned on the data points x while\nwarping the latent random variable z.\n\nstructured approximation for dense (fully connected) linear layers, while [Moczulski et al., 2015]\nshowed that any N \u00d7 N linear operator can be approximated to arbitrary precision by composing\norder N of such products.\n\nOverall, the aforementioned components can be deployed to compose a combined convolutional \ufb02ow\nas\n\nfw,s(x2; x1) = (\u03c3\u03b1\u2032 \u25e6 f\u2299 \u25e6 \u03c3\u03b1 \u25e6 f\u2217)(x2; x1)\n= \u03c3\u03b1\u2032(cid:0) s(x1) \u2299 \u03c3\u03b1(w(x1) \u2217 x2)(cid:1)\n\n(6)\n\nWe found that a more expressive network can be achieved by stacking M iterates of the combined\nconvolutional \ufb02ows and an additive coupling transform in each step of the network. Therefore, the\nconvolutional coupling \ufb02ow (CONF) can be written as\n\n(y1 = x1\n\ny2 = (f (M )\n\nw,s \u25e6 ... \u25e6 f (1)\n\nw,s)(x2; x1) + t(x1).\n\n(7)\n\nThe parameters of the \ufb02ow {w1, s1, ..., wM , sM , b} can be any nonlinear function of the base\ninput x1 and are not required to be invertible, hence they can be modeled by deep neural networks\nwith an arbitrary number of hidden units, offering \ufb02exibility and rich representation capacity while\npreserving an ef\ufb01cient learning algorithm. These are also called conditioning networks in the context\nof normalizing \ufb02ow. The model complexity can be signi\ufb01cantly reduced by using one conditioning\nneural network for all parameters of a coupling \ufb02ow so that it shares all layers except the last one for\ngenerating the parameters of the \ufb02ow. Consequently, we achieve a more expressive \ufb02ow with the\nstack of bijectors in (7) without introducing too many extra NN layers in the model.\n\nThe modular structure of coupling CONF modules (7) implies that its Jacobian determinant can be\nexpressed in terms of its sub-\ufb02ows. More details on the Jacobian determinant, invertibility condition\nand inverse of this transformation can be found in Appendix A.\n\nInitialization of the parameters: Better data propagation is expected to be achieved for very deep\nnormalizing \ufb02ows if the combined \ufb02ow (6) acts (approximately) as an identity mapping at initializa-\ntion. Accordingly, the parameters of the nonlinear bijector pair, {\u03c3\u03b1, \u03c3\u03b1\u2032}, are initialized suf\ufb01ciently\nclose to zero so that they behave approximately as linear functions at the outset. Furthermore, the\nconditioning networks are initialized such that the scaling \ufb01lters, s, and the convolution kernels at the\nfrequency domain, F{w}, are all initially identity \ufb01lters.\nMulti-dimensional extension: The multi-dimensional discrete Fourier transform can be expressed\nin separable forms, meaning that the operations can be performed by successively applying 1-\ndimensional transforms along each dimension [Gonzalez and Woods, 1992]. The separability\nproperty ensures the results mentioned so far can be extended to multi-dimensional settings. In this\nwork, we are particularly interested in 2-D operations for image data. Based on the 2-D circular\nconvolution de\ufb01nition, its equivalent block-circulant matrix form, and diagonalization method by 2-D\nDFT [Gonzalez and Woods, 1992, Ch. 5], the results of the circular convolution in Theorem 1 can be\nreadily generalized to the 2-D case.3 The same properties apply to the 2-D symmetric convolution,\nsince the symmetric convolution-multiplication property can be generalized naturally to the 2-D\nsetting [Foltz and Welsh, 1998].\n\n3 Due to the separability property, the 2-D DFT of matrices of size N1 \u00d7 N2 can be computed in\n\nO(N1N2(log N1 + log N2)) time.\n\n6\n\n11\u00d7\fTable 1: Average test negative log-likelihood (in nats) for tabular datasets and (in bits/dim) for MNIST and\nCIFAR using fully connected conditioning networks (lower is better). C-CONF and S-CONF stands for circular\nand symmetric convolutional coupling \ufb02ow presented in (7), respectively. Error bars correspond to 2 standard\ndeviations. The results of the benchmark methods are from Grathwohl et al. [2019].\n\nMADE\nMAF\nReal NVP\nGlow\nFFJORD\nS-CONF\nC-CONF\n\nPOWER\n3.08 \u00b1 .03\n-0.24 \u00b1 .01\n-0.17 \u00b1 .01\n-0.17 \u00b1 .01\n-0.46 \u00b1 .01\n-0.48 \u00b1 .01\n-0.47 \u00b1 .01\n\nGAS\n\n-3.56 \u00b1 .04\n-10.08 \u00b1 .02\n-8.33 \u00b1 .14\n-8.15 \u00b1 .40\n-8.59 \u00b1 .12\n-10.98 \u00b1 .13\n-10.84 \u00b1 .06\n\nBSDS300\n-148.85 \u00b1 .28\n-155.69 \u00b1 .28\n-153.28 \u00b1 1.78\n-155.07 \u00b1 .03\n-157.40 \u00b1 .19\n-163.23 \u00b1 .13\n-163.23 \u00b1 .34\n\nMNIST\n2.04 \u00b1 .01\n1.89 \u00b1 .01\n1.93 \u00b1 .01\n\n-\n-\n\nCIFAR10\n5.67 \u00b1 .01\n4.31 \u00b1 .01\n4.53 \u00b1 .01\n\n-\n-\n\n1.26 \u00b1 .01\n1.25 \u00b1 .01\n\n3.78 \u00b1 .03\n3.82 \u00b1 .00\n\nTable 2: Results in bits per dimension for MNIST and CIFAR10 using CNN based conditioning networks. The\nresults of the benchmark methods are from [Kingma and Dhariwal, 2018] and [Grathwohl et al., 2019]\n\nReal NVP Glow FFJORD S-CONF\n\nMNIST\nCIFAR10\n\n1.06\n3.49\n\n1.05\n3.35\n\n0.99\n3.40\n\n1.00\n3.34\n\n4 Model architecture\n\nA highly \ufb02exible and complex density approximation can be formed by composing a chain of the\nconvolution coupling layers introduced in this work. As explained in Section 1, the determinant of\nthe Jacobian and inverse of the composition can then be obtained readily. In addition to the invertible\ntransformation introduced in this work, we use the following bijections in the \ufb01nal architecture of the\nnormalizing \ufb02ow.\n\nCross-channel mapping (mixing) For multi-channel setting, the invertible convolution operation\nis performed in a depthwise fashion i.e. each input channel is \ufb01ltered by a separate convolution\nkernel. Then cross channel information \ufb02ow can be complemented by channel shuf\ufb02ing or using a\n1 \u00d7 1 convolution. The latter offered signi\ufb01cant improvement with small computational overhead in\nnormalizing \ufb02ows [Kingma and Dhariwal, 2018] hence, is applied after each convolutional coupling\nlayer in our architecture. Also, for single channel inputs, assuming equal size splits {x1, x2} (base\ninput and update input), these can be treated as two separate channels of the input and the same\ntechnique can be applied to mix them after each coupling layer.\n\nMultiscale architecture To achieve latent representations at multiple scales and obtain more \ufb01ne-\ngrained features, a subset of latent variables can be factored out at the intermediate layers. This\ntechnique is very useful for large image datasets and can signi\ufb01cantly reduce the computational cost\nin very deep models [Dinh et al., 2016].\n\nNormalization To improve the training in very deep normalizing \ufb02ows, batch normalization was\nemployed as a bijection after each coupling layer in [Dinh et al., 2016]. To overcome the adverse\neffect of small minibatch size in batch normalization, Kingma and Dhariwal [2018] proposed actnorm,\nas normalization, which applies an af\ufb01ne transformation and normalizes the activation per channel,\nsimilar to batch normalization but with larger minibatch size, at initialization while the parameters of\nthis bijection are freely updated during training with smaller minibatch size, the technique called data\ndependent initialization. Thus, in density estimation experiments, we employed the actnorm layers as\nbijections in the chain of normalizing \ufb02ow and also in the deep conditioning neural networks.\n\n5 Experiments\n\n5.1 Density estimation\n\nWe \ufb01rst conduct experiments to evaluate the bene\ufb01ts of the proposed \ufb02ow model (CONF). As\nobserved in [Huang et al., 2018], expressiveness of the af\ufb01ne coupling \ufb02ows and af\ufb01ne autoregressive\n\n7\n\n\f\ufb02ows stems from the complexity of the conditioning neural network that models \ufb02ow parameters, and\nsuccessive application of the \ufb02ows. Therefore for fair comparison we follow [Papamakarios et al.,\n2017] and use a general-purpose neural network composed of fully connected layers in the design of\nconditioning networks. In this way we highlight the capacity of the \ufb02ow itself, without relying on\ncomplex data dependent neural networks such as deep residual convolutional network used in [Dinh\net al., 2016, Kingma and Dhariwal, 2018, Ho et al., 2019].\n\nFirst we evaluate the proposed \ufb02ow for density estimation on tabular datasets, considering two UCI\ndatasets (POWR, GAS) and the natural image patches dataset (BSDS300) used in Papamakarios et al.\n[2017]. Description of these datasets and the preprocessing procedure applied can be found therein.\nWe also perform unconditional density estimation on two image datasets; MNIST, consisting of\nhandwritten digits [Y. LeCun, 1998] and CIFAR-10, consisting of natural images [Krizhevsky, 2009].\nIn BSDS300, the value of bottom-right pixel is replaced with the average of its immediate neighbors\nresulting in monochrome patches of size 8 \u00d7 8. For image data, the 2D invertible convolution is used\nas the \ufb02ow. All datasets are dequantized by adding uniform distributed noise to each dimension, and\nthen they are scaled to [0, 1] values. Variational dequantization is proposed as a an alternative method\noffering better variational lower bound on the log-likelihood [Ho et al., 2019], which is beyond the\nscope of this paper.\n\nWe compare the density estimation performance of CONF to the af\ufb01ne coupling \ufb02ow models real-\nNVP [Dinh et al., 2016] and Glow [Kingma and Dhariwal, 2018], and the recent continuous-time\ninvertible generative model FFJORD [Grathwohl et al., 2019]. These reversible models admit\nef\ufb01cient sampling with a single pass of the generative model. We also compare the density estimation\ncapacity of the proposed model against the autoregressive based methods, MADE [Germain et al.,\n2015], MAF [Papamakarios et al., 2017]. These family of autoregressive normalizing \ufb02ows require\nO(D) evaluations of the generative function to sample from the model, making them prohibitively\nexpensive for high dimensional applications. The results, summarized in Table 1, highlight that\nthe circular convolution-based (FFT-based) CONF (C-CONF) and symmetric convolution-based\n(DCT-based) CONF (S-CONF) offer signi\ufb01cant performance gains over the other models. Since\nS-CONF outperforms C-CONF in most of the experiments, we use it as the main convolutional \ufb02ow\nin the next experiments, simply referring to it as CONF. The signi\ufb01cant performance improvement of\nCONF on image datasets suggest that the feedforward conditioning NN were able to capture 2D local\nstructures.\n\nTo make a fair comparison, we used a feedforward neural network architecture similar to the one\nused for MAF [Papamakarios et al., 2017] except that we simpli\ufb01ed the architecture by using a single\nnetwork for all parameters of a \ufb02ow layer, while MAF used separate networks for the scaling and\nshift parameters. Each coupling \ufb02ow is composed of a maximum of M = 2 iterates of the combined\nconvolution \ufb02ow. The parameters of the network and number of layers are selected to be comparable\nto those used in [Papamakarios et al., 2017]. Details of model architecture and experimental setup\ntogether with more empirical results are presented in appendix.\n\n5.2 Density estimation using CNN based conditioning networks\n\nWe further assess the performance of CONF when the conditioning networks are based on convolu-\ntional neural networks, which are speci\ufb01cally designed for image data. A shallow convolutional NN,\nsimilar to the one used in GLOW, is employed to generate the parameters of the \ufb02ow, except that we\nuse one NN to generate all the parameters of a layer, reducing the number of model parameters. The\nresults of the experiments on MNIST and CIFAR10 data are presented in Table 2. The experimental\nsetup and generated samples from the model can be found in Appendix C.1 and D, respectively.\n\n5.3 Variational inference\n\nWe also evaluate the proposed normalizing \ufb02ow as a \ufb02exible inference network for a variational\nauto-encoder (VAE) [Rezende and Mohamed, 2015]. Here \ufb02ows are only conditioned on encoded\ndata points, produced by the encoder, and transform the posterior distribution of the latent variable\nwithout a coupling connection, resulting in z(t) = (f (M )\nw,s)(z(t\u22121); x) + t(x). We compare\nthe performance of the trained VAE using this convolutional \ufb02ow against other approaches, including\na non \ufb02ow-based VAE with factorized Gaussian distributions, and \ufb02ow-based VAE using inverse\nautoregressive \ufb02ow (IAF), planar \ufb02ow [Rezende and Mohamed, 2015, Kingma et al., 2016] and\n\nw,s \u25e6 ...\u25e6 f (1)\n\n8\n\n\fTable 3: Average test negative log-likelihood (in nats) and negative evidence lower bound (ELBO) on four\nbenchmark datasets (lower is better). Reported error bars correspond to 2 standard deviations calculated over 3\ntrials. The combination of number of \ufb02ow steps F and M of each model is reported in the format (F-M).\n\nMNIST\n\nOmniglot\n\n-ELBO\n\nNLL\n\n-ELBO\n\nNLL\n\nCaltech Silhouettes\n-ELBO\nNLL\n\nVAE\nIAF\n\nPlanar\nCONF(16-1)\n\n86.55 \u00b1 .06 82.14 \u00b1 .07\n84.20 \u00b1 .17 80.79 \u00b1 .12\n86.06 \u00b1 .31 81.91 \u00b1 .22\n83.89 \u00b1 .03 80.86 \u00b1 .05\n81.04 \u00b1 .15\n83.22 \u00b1 .05 80.64 \u00b1 .06\nO-SNF(16-32) 83.32 \u00b1 .06 80.22 \u00b1 .03\nCONF(16-16)\n\nO-SNF(4-8)\nCONF(4-8)\n\n84.74\n\n104.28 \u00b1 .39 97.25 \u00b1 .23\n102.41 \u00b1 .04 96.08 \u00b1 .16\n102.65 \u00b1 .42 96.04 \u00b1 .28\n98.35 \u00b1 .27 94.54 \u00b1 .12\n101.41 \u00b1 .08 95.25 \u00b1 .09\n97.17 \u00b1 .08 94.19 \u00b1 .03\n99.00 \u00b1 .29 93.82 \u00b1 .21\n96.35 \u00b1 .05 93.66\u00b1 .03\n\n110.80 \u00b1 .46 99.62 \u00b1 .74\n111.58 \u00b1 .38 99.92 \u00b1 .30\n109.66 \u00b1 .42 98.53 \u00b1 .68\n108.64 \u00b1 1.71 97.29 \u00b1 .91\n109.37 \u00b1 .94 97.78 \u00b1 .47\n104.09 \u00b1 1.03 94.56 \u00b1 .29\n106.08 \u00b1 .39 94.61 \u00b1 .83\n101.10 \u00b1 .49 92.37 \u00b1 .40\n\nFrey Faces\n\nNLL\n\n-ELBO\n4.53 \u00b1 .02 4.40 \u00b1 .03\n4.47 \u00b1 .05 4.38 \u00b1 .04\n4.40 \u00b1 .06 4.31 \u00b1 .06\n4.43 \u00b1 .01 4.34 \u00b1 .02\n4.50 \u00b1 .00 4.39 \u00b1 .01\n4.41 \u00b1 .01 4.31 \u00b1 .00\n4.51 \u00b1 .04 4.39 \u00b1 .05\n4.39 \u00b1 .02 4.29 \u00b1 .00\n\nSylvester normalizing \ufb02ows (SNF) as the building blocks of the normalizing \ufb02ows. We used the\nencoder/decoder architecture of Berg et al. [2018] and the results of the available methods are adopted\nfrom this paper. The details of training procedure are summarized in Appendix C.2.\n\nAlthough the proposed \ufb02ow is slower than SNF of the same size, the results in Table 3 show that\nCONF outperforms Sylvester \ufb02ow in most cases, and even smaller CONF models show similar or\nbetter capacity than larger SNF. Also, we observe that CONF with M = 1 outperforms planar \ufb02ow\nby a wide margin on all datasets, except for FreyFaces which is a challenging dataset and prone to\nover\ufb01tting for large SNF; here large CONF (F = 16, M = 16) perform the best among all methods,\nso demonstrates less sensitivity to over\ufb01tting on the FreyFaces dataset.\n\nNumber of parameters: Let the stochastic latent variable be a D-dimensional vector z \u2208 RD and the\nencoder\u2019s output be e(x) \u2208 RE, then each step of CONF requires an additional E\u00d7(2M D+D)+2M\nparameters to produce the \ufb02ow parameters based on e(x), which is comparable to the number of\nparameters related to a step of planar \ufb02ow if M = 1. This is of the same order of the number of\nparameters of Sylvester \ufb02ow with a bottleneck of size M , which is E \u00d7 (2M D + 2M 2 + M ).\n\n6 Conclusion\n\nIn this work we showed that circular and symmetric convolutions can be used as invertible trans-\nformations with fast and ef\ufb01cient inversion, deconvolution, and Jacobian determinant evaluation.\nThese features make them well suited for designing \ufb02exible normalizing \ufb02ows. Using these invertible\nconvolutions, we introduced a family of data adaptive coupling layers, which consist of convolutions,\nwhere the kernel of the convolutions are themselves a function of the coupling layer input. We also\nanalytically derived invertible pointwise nonlinearities that implicitly induce speci\ufb01c regularizers\non intermediate activations in deep \ufb02ow models. The results also helps better understand the role\nof nonlinear gates through the lens of their contribution to latent variables\u2019 distributions. Using\nthese new architectural components, we achieved state of the art performance on several datasets for\ninvertible normalizing \ufb02ows with fast sampling.\n\nReferences\n\nRianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester\n\nnormalizing \ufb02ows for variational inference. arXiv preprint arXiv:1803.05649, 2018.\n\nYu Cheng, Felix X Yu, Rogerio S Feris, Sanjiv Kumar, Alok Choudhary, and Shi-Fu Chang. An\nexploration of parameter redundancy in deep networks with circulant projections. In Proceedings\nof the IEEE International Conference on Computer Vision, pages 2857\u20132865, 2015.\n\nZ. Cinkir. A fast elementary algorithm for computing the determinant of Toeplitz matrices. ArXiv\n\ne-prints, January 2011.\n\nLaurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components\n\nestimation. arXiv preprint arXiv:1410.8516, 2014.\n\n9\n\n\fLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv\n\npreprint arXiv:1605.08803, 2016.\n\nThomas M Foltz and BM Welsh. Image reconstruction using symmetric convolution and discrete\n\ntrigonometric transforms. JOSA A, 15(11):2827\u20132840, 1998.\n\nMathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for\ndistribution estimation. In International Conference on Machine Learning, pages 881\u2013889, 2015.\n\nRafael C Gonzalez and Richard E Woods. Digital image processing, 1992.\n\nWill Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, and David Duvenaud. Ffjord: Free-form\ncontinuous dynamics for scalable reversible generative models. In International Conference on\nLearning Representations, 2019.\n\nRobert M Gray et al. Toeplitz and circulant matrices: A review. Foundations and Trends R(cid:13) in\n\nCommunications and Information Theory, 2(3):155\u2013239, 2006.\n\nJonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving \ufb02ow-\n\nbased generative models with variational dequantization and architecture design, 2019.\n\nEmiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging convolutions for generative\n\nnormalizing \ufb02ows. arXiv preprint arXiv:1901.11137, 2019.\n\nChin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive\n\n\ufb02ows. In International Conference on Machine Learning, pages 2083\u20132092, 2018.\n\nMahdi Karami, Laurent Dinh, Daniel Duckworth, Jascha Sohl-Dickstein, and Dale Schuurmans.\nGenerative convolutional \ufb02ow for density estimation. In Workshop on Bayesian Deep Learning\nNeurIPS 2018, 2018.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nDiederik P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions.\n\narXiv preprint arXiv:1807.03039, 2018.\n\nDiederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\n\nImproved variational inference with inverse autoregressive \ufb02ow. 2016.\n\nAlex Krizhevsky. Learning multiple layers of features from tiny images. 2009.\n\nPer-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. A fast algorithm for the inversion of\ngeneral toeplitz matrices. Computers & Mathematics with Applications, 50(5-6):741\u2013752, 2005.\n\nStephen A Martucci. Symmetric convolution and the discrete sine and cosine transforms. IEEE\n\nTransactions on Signal Processing, 42(5):1038\u20131051, 1994.\n\nMarcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. Acdc: A structured\n\nef\ufb01cient linear layer, 2015.\n\nJohn F Monahan. Numerical methods of statistics. Cambridge University Press, 2011.\n\nGeorge Papamakarios, Iain Murray, and Theo Pavlakou. Masked autoregressive \ufb02ow for density\n\nestimation. In Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\nDanilo Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. In Proceedings\n\nof The 32nd International Conference on Machine Learning, pages 1530\u20131538, 2015.\n\nCasper Kaae S\u00f8nderby, Tapani Raiko, Lars Maal\u00f8e, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Ladder\nvariational autoencoders. In Advances in neural information processing systems, pages 3738\u20133746,\n2016.\n\nAaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional\nimage generation with pixelcnn decoders. In Advances in Neural Information Processing Systems,\npages 4790\u20134798, 2016.\n\n10\n\n\fC. Cortes Y. LeCun. The mnist database of handwritten digit. 1998.\n\nGuoqing Zheng, Yiming Yang, and Jaime Carbonell. Convolutional normalizing \ufb02ows. arXiv preprint\n\narXiv:1711.02255, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3011, "authors": [{"given_name": "Mahdi", "family_name": "Karami", "institution": "University of Alberta"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "Google"}, {"given_name": "Jascha", "family_name": "Sohl-Dickstein", "institution": "Google Brain"}, {"given_name": "Laurent", "family_name": "Dinh", "institution": "Google Brain"}, {"given_name": "Daniel", "family_name": "Duckworth", "institution": "Google Brain"}]}