{"title": "Discrete Flows: Invertible Generative Models of Discrete Data", "book": "Advances in Neural Information Processing Systems", "page_first": 14719, "page_last": 14728, "abstract": "While normalizing flows have led to significant advances in modeling high-dimensional continuous distributions, their applicability to discrete distributions remains unknown. In this paper, we show that flows can in fact be extended to discrete events---and under a simple change-of-variables formula not requiring log-determinant-Jacobian computations. Discrete flows have numerous applications. We consider two flow architectures: discrete autoregressive flows that enable bidirectionality, allowing, for example, tokens in text to depend on both left-to-right and right-to-left contexts in an exact language model; and discrete bipartite flows that enable efficient non-autoregressive generation as in RealNVP. Empirically, we find that discrete autoregressive flows outperform autoregressive baselines on synthetic discrete distributions, an addition task, and Potts models; and bipartite flows can obtain competitive performance with autoregressive baselines on character-level language modeling for Penn Tree Bank and text8.", "full_text": "Discrete Flows: Invertible Generative\n\nModels of Discrete Data\n\nDustin Tran1 Keyon Vafa12\u2217 Kumar Krishna Agrawal1\u2020 Laurent Dinh1 Ben Poole1\n\n1Google Brain 2Columbia University\n\nAbstract\n\nWhile normalizing \ufb02ows have led to signi\ufb01cant advances in modeling high-\ndimensional continuous distributions, their applicability to discrete distributions\nremains unknown. In this paper, we show that \ufb02ows can in fact be extended to dis-\ncrete events\u2014and under a simple change-of-variables formula not requiring log-\ndeterminant-Jacobian computations. Discrete \ufb02ows have numerous applications.\nWe consider two \ufb02ow architectures: discrete autoregressive \ufb02ows that enable bidi-\nrectionality, allowing, for example, tokens in text to depend on both left-to-right\nand right-to-left contexts in an exact language model; and discrete bipartite \ufb02ows\nthat enable e\ufb03cient non-autoregressive generation as in RealNVP. Empirically, we\n\ufb01nd that discrete autoregressive \ufb02ows outperform autoregressive baselines on syn-\nthetic discrete distributions, an addition task, and Potts models; and bipartite \ufb02ows\ncan obtain competitive performance with autoregressive baselines on character-\nlevel language modeling for Penn Tree Bank and text8.\n\n1\n\nIntroduction\n\nThere have been many recent advances in normalizing \ufb02ows, a technique for constructing\nhigh-dimensional continuous distributions from invertible transformations of simple distributions\n(Rezende and Mohamed, 2015; Tabak and Turner, 2013; Rippel and Adams, 2013). Applications for\nhigh-dimensional continuous distributions are widespread: these include latent variable models with\nexpressive posterior approximations (Rezende and Mohamed, 2015; Ranganath et al., 2016; Kingma\net al., 2016a), parallel image generation (Dinh et al., 2017; Kingma and Dhariwal, 2018), parallel\nspeech synthesis (Oord et al., 2017; Ping et al., 2018; Prenger et al., 2018), and general-purpose\ndensity estimation (Papamakarios et al., 2017).\nNormalizing \ufb02ows are based on the change-of-variables formula, which derives a density given an\ninvertible function applied to continuous events. There have not been analogous advances for dis-\ncrete distributions, where \ufb02ows are typically thought to not be applicable. Instead, most research for\ndiscrete data has focused on building either latent-variable models with approximate inference (Bow-\nman et al., 2015), or increasingly sophisticated autoregressive models that assume a \ufb01xed ordering\nof the data (Bengio et al., 2003; Vaswani et al., 2017).\nIn this paper, we present an alternative for \ufb02exible modeling of discrete sequences by extending\ncontinuous normalizing \ufb02ows to the discrete setting. We construct discrete \ufb02ows with two architec-\ntures:\n\n1. Discrete autoregressive \ufb02ows enable multiple levels of autoregressivity. For example, one\ncan design a bidirectional language model of text where each token depends on both left-\nto-right and right-to-left contexts while maintaining an exact likelihood and sampling.\n\n\u2217Work done as an intern at Google Brain. Supported by NSF grant DGE-1644869.\n\u2020Work done as an AI resident.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2. Discrete bipartite \ufb02ows enable \ufb02exible models with parallel generation by using coupling\nlayers similar to RealNVP (Dinh et al., 2017) . For example, one can design nonautoregres-\nsive text models which maintain an exact likelihood for training and evaluation.\n\nWe evaluate discrete \ufb02ows on a number of controlled problems: discretized mixture of Gaussians,\nfull-rank discrete distributions, an addition task, and Potts models. In all settings, we \ufb01nd that stack-\ning discrete autoregressive \ufb02ows yields improved performance over autoregressive baselines, and\nthat bipartite \ufb02ows can reach similar performance to autoregressive baselines while being fast to\ngenerate. Finally, we scale up discrete bipartite \ufb02ows to character-level language modeling where we\nreach 1.38 bits per character on Penn Tree Bank and 1.23 bits per character on text8 with generation\nspeed over 100x faster than state-of-the-art autoregressive models.\n1.1 Related Work\nBidirectional models. Classically, bidirectional language models such as log-linear models and\nMarkov random \ufb01elds have been pursued, but they require either approximate inference (Mnih and\nTeh, 2012; Jernite et al., 2015) or approximate sampling (Berglund et al., 2015). Unlike bidirectional\nmodels, autoregressive models must impose a speci\ufb01c ordering, and this has been shown to matter\nacross natural language processing tasks (Vinyals et al., 2015; Ford et al., 2018; Xia et al., 2017).\nBidirectionality such as in encoders have been shown to signi\ufb01cantly improve results in neural ma-\nchine translation (Britz et al., 2017). Most recently, BERT has shown bidirectional representations\ncan signi\ufb01cantly improve transfer tasks (Devlin et al., 2018). In this work, discrete autoregressive\n\ufb02ows enable bidirectionality while maintaining the bene\ufb01ts of a (tractable) generative model.\n\nNonautoregressive models. There have been several advances for \ufb02exible modeling with nonau-\ntoregressive dependencies, mostly for continuous distributions (Dinh et al., 2014, 2017; Kingma and\nDhariwal, 2018). For discrete distributions, Reed et al. (2017) and Stern et al. (2018) have consid-\nered retaining blockwise dependencies while factorizing the graphical model structure in order to\nsimulate hierarchically. Gu et al. (2018) and Kaiser et al. (2018) apply latent variable models for\nfast translation, where the prior is autoregressive and the decoder is conditionally independent. Lee\net al. (2018) adds an iterative re\ufb01nement stage to initial parallel generations. Ziegler and Rush (2019)\ninvestigate latent variable models with continuous non-autoregressive normalizing \ufb02ows as the prior.\nAitchison et al. (2018) leverage \ufb01xed-point iterations for faster decoding of autoregressive models.\nIn this work, discrete bipartite \ufb02ows enable nonautoregressive generation while maintaining an exact\ndensity\u2014analogous to RealNVP advances for image generation (Dinh et al., 2017). Most recently,\nHoogeboom et al. (2019) proposed integer discrete \ufb02ows, a concurrent work with similar ideas as\ndiscrete \ufb02ows but with a \ufb02ow transformation for ordinal data and applications to image compression\nand image generation. We \ufb01nd their results complement ours in illustrating the advantages of discrete\ninvertible functions which do not require log determinant Jacobians.\n2 Background\n2.1 Normalizing Flows\nNormalizing \ufb02ows transform a probability distribution using an invertible function (Tabak and\nTurner, 2013; Rezende and Mohamed, 2015; Rippel and Adams, 2013). Let x be a D-dimensional\ncontinuous random variable whose density can be computed e\ufb03ciently. Given an invertible function\nf : RD \u2192 RD, the change-of-variables formula provides an explicit construction of the induced\ndistribution on the function\u2019s output, y = f (x):\n\n(cid:12)(cid:12)(cid:12)(cid:12) d x\n\nd y\n\n(cid:12)(cid:12)(cid:12)(cid:12).\n\np(y) = p(f\u22121(y)) det\n\n(1)\n\nThe transformation f is referred to as a \ufb02ow and p(x) is referred to as the base distribution. Com-\nposing multiple \ufb02ows can induce further complex distributions that increase the expressivity of p(y)\n(Rezende and Mohamed, 2015; Papamakarios et al., 2017).\n\n2.2 Flow Transformation\nFor an arbitrary invertible f, computing the determinant of the Jacobian incurs an O(D3) complexity,\nwhich is infeasible for high-dimensional datasets. Thus, normalizing \ufb02ows are designed so that the\n\n2\n\n\fdeterminant of the \ufb02ow\u2019s Jacobian can be computed e\ufb03ciently. Here, we review two popular \ufb02ow\ntransformations.\n\nAutoregressive \ufb02ows. Autoregressive functions such as recurrent neural networks and Transform-\ners (Vaswani et al., 2017) have been shown to successfully model sequential data across many do-\nmains. Speci\ufb01cally, assume a base distribution x \u223c p(x). With \u00b5 and \u03c3 as autoregressive functions\nof y, i.e. \u00b5d, \u03c3d = f (y1, . . . , yd\u22121), and \u03c3d > 0 for all d, the \ufb02ow computes a location-scale\ntransform (Papamakarios et al., 2017; Kingma et al., 2016b),\n\nyd = \u00b5d + \u03c3d \u00b7 xd\n\nfor d in 1, . . . , D.\n\nThe transformation is invertible and the inverse can be vectorized and computed in parallel:\n\nxd = \u03c3\u22121\n\nd (yd \u2212 \u00b5d)\n\nits determinant is the product of the diagonal elements,(cid:81)\n\nIn addition to a fast-to-compute inverse, the autoregressive \ufb02ow\u2019s Jacobian is lower-triangular, so\nd=1 \u03c3d. This enables autoregressive \ufb02ow\nmodels to have e\ufb03cient log-probabilities for training and evaluation, but generation is sequential and\nine\ufb03cient.\n\nfor d in 1, . . . , D.\n\nBipartite \ufb02ows. Real-valued non-volume preserving (RealNVP) models uses another type of in-\nvertible transformation (Dinh et al., 2017) that nonlinearly transforms subsets of the input. For some\nd < D, a coupling layer follows a bipartite rather than autoregressive factorization:\n\ny1:d = x1:d\n\nyd+1:D = \u00b5(d+1):D + \u03c3(d+1):D \u00b7 x(d+1):D,\n\n(2)\n(3)\nwhere \u03c3(d+1):D and \u00b5(d+1):D are functions of x1:d with \u03c3(d+1):D > 0 (we \ufb01x values for the lower\nindices, \u00b51:d=0, \u03c31:d=1). Due to the bipartite nature of the transformation in coupling layers, we refer\nto them as bipartite \ufb02ows. By changing the ordering of variables between each \ufb02ow, the composition\nof bipartite \ufb02ows can learn highly \ufb02exible distributions. By design, their Jacobian is lower-triangluar,\n\nwith a determinant that is the product of diagonal elements,(cid:81)D\n\ni=d+1 \u03c3i.\n\nBipartite \ufb02ows are not as expressive as autoregressive \ufb02ows, as a subset of variables do not undergo\na transformation. However, both their forward and inverse computations are fast to compute, making\nthem suitable for generative modeling where fast generation is desired.\n\n3 Discrete Flows\n\nNormalizing \ufb02ows depend on the change of variables formula (Equation 1) to compute the change\nin probability mass for the transformation. However, the change of variables formula applies only to\ncontinuous random variables. We extend normalizing \ufb02ows to discrete events.\n\n3.1 Discrete Change of Variables\nLet x be a discrete random variable and y = f (x) where f is some function of x. The induced\nprobability mass function of y is the sum over the pre-image of f:\n\n(cid:88)\n\nx\u2208f\u22121(y)\n\np(y = y) =\n\np(x = x),\n\np(y = y) = p(x = f\u22121(y)).\n\nwhere f\u22121(y) is the set of all elements such that f (x) = y. For an invertible function f, this\nsimpli\ufb01es to\n\n(4)\nThis change of variables formula for discrete variables is similar to the continuous change of variables\nformula (Equation 1), but without the log-determinant-Jacobian. Intuitively, the log-determinant-\nJacobian corrects for changes to the volume of a continuous space; volume does not exist for discrete\ndistributions so there is no need to account for it. Computationally, Equation 4 is appealing as there\nare no restrictions on f such as fast Jacobian computations in the continuous case, or tradeo\ufb00s in how\nthe log-determinant-Jacobian in\ufb02uences the output density compared to the base density.\n\n3\n\n\fFigure 1: Flow transformation when computing log-likelihoods. (a) Discrete autoregressive \ufb02ows\nstack multiple levels of autoregressivity. The receptive \ufb01eld of output unit 2 (red) includes left and\nright contexts. (b) Discrete bipartite \ufb02ows apply a binary mask (blue and green) which determines\nthe subset of variables to transform. With 2 \ufb02ows, the receptive \ufb01eld of output unit 2 is y1:3.\n\n3.2 Discrete Flow Transformation\nNext we develop discrete invertible functions. To build intuition, \ufb01rst consider the binary case. Given\na D-dimensional binary vector x, one natural function applies the XOR bitwise operator,\n\nyd = \u00b5d \u2295 xd,\n\nfor d in 1, . . . , D,\n\nwhere \u00b5d is a function of previous outputs, y1, . . . , yd\u22121; \u2295 is the XOR function (0 if \u00b5d and xd\nare equal and 1 otherwise). The inverse is xd = \u00b5d \u2295 yd. We provide an example next.\nExample. Let D = 2 where p(y) is de\ufb01ned by the following probability table:\n\ny2 = 0 y2 = 1\n0.07\n0.63\n0.03\n0.27\n\ny1 = 0\ny1 = 1\n\nThe data distribution cannot be captured by a factorized one p(y1)p(y2). However, it can with a\n\ufb02ow: set f (x1, x2) = (x1, x1\u2295 x2); p(x1) with probabilities [0.7, 0.3]; and p(x2) with probabilities\n[0.9, 0.1]. The \ufb02ow captures correlations that cannot be captured alone with the base. More broadly,\ndiscrete \ufb02ows perform a multi-dimensional relabeling of the data such that it\u2019s easier to model with\nthe base. This is analogous to continuous \ufb02ows, which whiten the data such that it\u2019s easier to model\nwith the base distribution (typically, a spherical Gaussian).\nModulo location-scale transform. To extend XOR to the categorical setting, consider a D-\ndimensional vector x, each element of which takes on values in 0, 1, . . . , K \u2212 1. One can perform\nlocation-scale transformations on the modulo integer space,\n\nyd = (\u00b5d + \u03c3d \u00b7 xd) mod K.\n\n(5)\nHere, \u00b5d and \u03c3d are autoregressive functions of y taking on values in 0, 1, . . . , K\u22121 and 1, . . . , K\u22121\nrespectively. For this transformation to be invertible, \u03c3 and K must be coprime (an explicit solution\nfor \u03c3\u22121 is Euclid\u2019s algorithm). An easy way to ensure coprimality is to set K to be prime; mask\nnoninvertible \u03c3 values for a given K; or \ufb01x \u03c3 = 1. Setting K = 2 and \u03c3 = 1, it\u2019s easy to see that\nthe modulo location-scale transform generalizes XOR. (We use \u03c3 = 1 for all experiments except\ncharacter-level language modeling.)\nThe idea also extends to the bipartite \ufb02ow setting: the functions (\u00b5, \u03c3) are set to (0, 1) for a subset\nof the data dimensions, and are functions of that subset otherwise.\nInvertible discrete functions\nare widely used in random number generation, and could provide inspiration for alternatives to the\nlocation scale transformation for constructing \ufb02ows (Salmon et al., 2011).\n\nExample. Figure 2 illustrates an example of using \ufb02ows to model correlated categorical data. Fol-\nlowing Metz et al. (2016), the data is drawn from a mixture of Gaussians with 8 means evenly spaced\naround a circle of radius 2. The output variance is 0.01, with samples truncated to be between \u22122.25\nand 2.25, and we discretize at the 0.05 level, resulting in two categorical variables (one for x and\n\n4\n\n12341234\f(a) Data\n\n(b) Factorized Base\n\n(c) 1 Flow\n\nFigure 2: Learning a discretized mixture of Gaussians with maximum likelihood. Discrete \ufb02ows\nhelp capture the modes, which a factorized distribution cannot. (Note because the data is 2-D, discrete\nautoregressive \ufb02ows and discrete bipartite \ufb02ows are equivalent.)\n\none for y) each with 90 states. A factorized base distribution cannot capture the data correlations,\nwhile a single discrete \ufb02ow can. (Note the modulo location-scale transform does not make an ordi-\nnal assumption. We display ordinal data as an example only for visualization; other experiments use\nnon-ordinal data.)\n\n3.3 Training Discrete Flows\n\nWith discrete \ufb02ow models, the maximum likelihood objective per datapoint is\n\nlog p(y) = log p(f\u22121(y)),\n\nwhere the \ufb02ow f has free parameters according to its autoregressive or bipartite network, and the\nbase distribution p has free parameters as a factorized (or itself an autoregressive) distribution. Gra-\ndient descent with respect to base distribution parameters is straightforward. To perform gradient\ndescent with respect to \ufb02ow parameters, one must backpropagate through the discrete-output func-\ntion \u00b5 and \u03c3. We use the straight-through gradient estimator (Bengio et al., 2013). In particular, the\n(autoregressive or bipartite) network outputs two vectors of K logits \u03b8d for each dimension d, one\nfor the location and scale respectively. For the scale, we add a mask whose elements are negative\nin\ufb01nity on non-invertible values such as 0. On the forward pass, we take the argmax of the logits,\nwhere for the location,\n\n(6)\nBecause the argmax operation is not di\ufb00erentiable, we replace Equation 6 on the backward pass with\nthe softmax-temperature function:\n\n\u00b5d = one_hot(argmax(\u03b8d)).\n\n(cid:18) \u03b8d\n\n(cid:19)\n\n.\n\n\u03c4\n\nd\u00b5d\nd\u03b8d\n\n\u2248 d\nd\u03b8d\n\nsoftmax\n\nAs the temperature \u03c4 \u2192 0, the softmax-temperature becomes close to the argmax and the bias of\nthe gradient estimator disappears. However, when \u03c4 is too low, the gradients vanish, inhibiting the\noptimization. Work with the Gumbel-softmax distribution indicates that this approximation works\nwell when the number of classes K < 200 (Maddison et al., 2016; Jang et al., 2017), which aligns\nwith our experimental settings; we also \ufb01x \u03c4 = 0.1.\nLimitations. The extent to which the straight-through estimator works well can be unclear. The num-\nber of classes K makes a di\ufb00erence, but there are also other factors. In particular, depth (number of\n\ufb02ows) a\ufb00ects gradients, since the bias explicitly accumulates as each \ufb02ow uses a gradient approx-\nimation. This is also the case for dimensionality (sequence length). We haven\u2019t found complexity\nof the networks parameterizing the \ufb02ow to make a di\ufb00erence empirically, but this requires further\ninvestigation. As a reviewer mentioned, \u201ctrue gradients\u201d are also not well-de\ufb01ned, making gradient\nbias even less intuitive. Instead of examining gradient bias, one might formalize bias by comparing,\nfor example, the minima obtained from optimizing the discrete loss to the minima obtained from op-\ntimizing the relaxed loss or moving to a setting where the parameters are stochastic. We leave better\nunderstanding of these limitations to future work.\n\n5\n\n\fAutoregressive Base Autoregressive Flow Factorized Base Bipartite Flow\n\n0.9\n7.7\n10.7\n15.9\n\n0.9\n7.6\n10.3\n15.7\n\nD = 2, K = 2\nD = 5, K = 5\nD = 5, K = 10\nD = 10, K = 5\nTable 1: Negative log-likelihoods for the full rank discrete distribution (lower is better). Autore-\ngressive \ufb02ows improve over its autoregressive base (bolded is best). Bipartite \ufb02ows improve over\nits factorized base and achieve nats close to an autoregressive distribution while remaining parallel\n(bolded is best).\n\n1.3\n8.0\n11.5\n16.6\n\n1.0\n7.9\n10.7\n16.0\n\nAR Base AR Flow\n\nnumber of states = 3\nD = 9, J = 0.1\nD = 9, J = 0.5\nD = 16, J = 0.1\nD = 16, J = 0.5\nnumber of states = 4\nD = 9, J = 0.1\nD = 9, J = 0.5\nnumber of states = 5\nD = 9, J = 0.1\nD = 9, J = 0.5\n\n9.27\n3.79\n16.66\n6.30\n\n11.64\n5.87\n\n13.58\n7.94\n\n9.124\n3.79\n11.23\n5.62\n\n10.45\n5.56\n\n10.25\n7.07\n\nTable 2: Negative log-likelihoods on the square-lattice Potts model (lower is better). D denotes\ndimensionality. Higher coupling strength J corresponds to more spatial correlations.\n\n4 Experiments\n\nWe perform a series of synthetic tasks to better understand discrete \ufb02ows, and also perform character-\nlevel language modeling tasks. For all experiments with discrete autoregressive \ufb02ows, we used an\nautoregressive Categorical base distribution where the \ufb01rst \ufb02ow is applied in reverse ordering. (This\nsetup lets us compare its advantage of bidirectionality to the baseline of an autoregressive base with\n0 \ufb02ows.) For all experiments with discrete bipartite \ufb02ows, we used a factorized Categorical base\ndistribution where the bipartite \ufb02ows alternate masking of even and odd dimensions.\n\n4.1 Full-rank Discrete Distribution\n\nTo better understand the expressivity of discrete \ufb02ows, we examined how well they could \ufb01t random\nfull-rank discrete distributions. In particular, we sample a true set of probabilities for all D dimen-\nsions of K classes according to a Dirichlet distribution of size K D \u2212 1, \u03b1 = 1. For the network\nfor the autoregressive base distribution and location parameters of the \ufb02ows, we used a Transformer\nwith 64 hidden units. We used a composition of 1 \ufb02ow for the autoregressive \ufb02ow models, and 4\n\ufb02ows for the bipartite \ufb02ow models.\nTable 1 displays negative log-likelihoods (nats) of trained models over data simulated from this dis-\ntribution. Across the data dimension D and number of classes K, autoregressive \ufb02ows gain several\nnats over the autoregressive base distribution, which has no \ufb02ow on top. Bipartite \ufb02ows improve over\nits factorized base and in fact obtain nats competitive with the autoregressive base while remaining\nfully parallel for generation.\n\n4.2 Addition\n\nFollowing Zaremba and Sutskever (2014), we examine an addition task: there are two input numbers\nwith D digits (each digit takes K = 10 values), and the output is their sum with D digits (we remove\n\n6\n\n\fTest NLL (bpc) Generation\n\n3-layer LSTM (Merity et al., 2018)\nZiegler and Rush (2019) (AF/SCF)\nZiegler and Rush (2019) (IAF/SCF)\nBipartite \ufb02ow\nTable 3: Character-level language modeling results on Penn Tree Bank.\n\n1.181\n1.46\n1.63\n1.38\n\n0.17 sec\n\n3.8 min\n\n-\n-\n\nthe D + 1th digit if it appears). Addition naturally follows a right-to-left ordering: computing the\nleftmost digit requires carrying the remainder from the rightmost computations. Given an autore-\ngressive base which poses a left-to-right ordering, we examine whether the bidirectionality that \ufb02ows\no\ufb00er can adjust for wrong orderings. While the output is determnistic, the \ufb02exibility of discrete \ufb02ows\nmay enable more accurate outputs. We use an LSTM to encode both inputs, apply 0 or 1 \ufb02ows on\nthe output, and then apply an LSTM to parameterize the autoregressive base where its initial state is\nset to the concatenation of the two encodings. All LSTMs use 256 hidden units for D = 10, and 512\nhidden units for D = 20.\nFor D = 10, an autoregressive base achieves 4.0 nats; an autoregressive \ufb02ow achieves 0.2 nats (i.e.,\nclose to the true deterministic solution over all pairs of 10-digit numbers). A bipartite model with\n1, 2, and 4 \ufb02ows achieves 4.0, 3.17, and 2.58 nats respectively. For D = 20, an autoregressive base\nachieves 12.2 nats; an autoregressive \ufb02ow achieves 4.8 nats. A bipartite model with 1, 2, 4, and 8\n\ufb02ows achieves 12.2, 8.8, 7.6, and 5.08 nats respectively.\n\n4.3 Potts Model\n\nGiven the bidirectional dependency enabled by discrete \ufb02ows, we examined how they could be used\nfor distilling undirected models with tractable energies but intractable sampling and likelihoods. We\nsampled from Potts models (the Categorical generalization of Ising models), which are a 2d Markov\nrandom \ufb01eld with pairwise interactions between neighbors (above/below, left/right, but not diago-\nnally connected) (Wu, 1982). To generate data we ran 500 steps of Metropolis-Hastings, and evalu-\nated the NLL of baselines and discrete \ufb02ows as a function of the coupling strength, J. Low coupling\nstrengths correspond to more independent states, while high coupling strengths result in more corre-\nlated states across space. For the base network, we used a single layer LSTM with 32 hidden units.\nFor the \ufb02ow network, we used an embedding layer which returns a trainable location parameter for\nevery unique combination of inputs.\nTable 2 displays negative log-likelihoods (nats) of trained models over data simulated from Potts\nmodels with varying lattice size and coupling strength. As Potts models are undirected models, the\nautoregressive base posits a poor inductive bias by \ufb01xing an ordering and sharing network parameters\nacross the individual conditional distributions. Over data dimension D and coupling J, autoregres-\nsive \ufb02ows perform as well as, or improve upon, autoregressive base models. Appendix A includes\nsamples from the model; they are visually indistinguishable from the data.\n\n4.4 Character-Level Penn Tree Bank\n\nWe follow the setup of Ziegler and Rush (2019), which to the best of our knowledge is the only\ncomparable work with nonautoregressive language modeling. We use Penn Tree Bank with minimal\nprocessing from Mikolov et al. (2012), consisting of roughly 5M characters and a vocabulary size of\nK = 51. We split the data into sentences and restrict to a max sequence length of 288. The LSTM\nbaseline of Merity et al. (2018) uses 3 layers, truncated backpropagation with a sequence length of\n200, embedding size of 400, and hidden size of 1850.3 Ziegler and Rush (2019)\u2019s nonautoregressive\nmodels have two variants, in which they use a speci\ufb01c prior with a conditionally independent likeli-\nhood and fully factorized variational approximation: AF/SCF uses a prior z \u2208 RT\u00d7H over latent time\nsteps and hidden dimension that\u2019s autoregressive in T and nonautoregressive in H; and IAF/SCF is\n\n3The LSTM results are only approximately comparable as they do not apply the extra preprocessing step of\n\nremoving sentences with >288 tokens.\n\n7\n\n\fLSTM (Cooijmans+2016)\n\n64-layer Transformer (Al-Rfou+2018)\n\nBipartite \ufb02ow (4 \ufb02ows, w/ \u03c3)\nBipartite \ufb02ow (8 \ufb02ows, w/o \u03c3)\nBipartite \ufb02ow (8 \ufb02ows, w/ \u03c3)\n\nbpc\n1.43\n1.13\n1.60\n1.29\n1.23\n\nGen.\n19.8s\n35.5s\n0.15s\n0.16s\n0.16s\n\nFigure 3: Character-level language modeling results on text8. The test bits per character decreases\nas the number of \ufb02ows increases. More hidden units H and layers L in the Transformer per \ufb02ow, and\napplying a scale transformation instead of only location, also improves performance.\n\nnonautoregressive in both T and H. For the bipartite \ufb02ow, we use 8 \ufb02ows each with an embedding\nof size 400 and an LSTM with 915 hidden units.\nTable 3 compares the test negative log-likelihood in bits per character as well as the time to generate\na 288-dimensional sequence of tokens on a NVIDIA P100 GPU. The bipartite \ufb02ow signi\ufb01cantly\noutperforms Ziegler and Rush (2019), including their autoregressive/nonautoregressive hybrid. In\naddition, the generation time is over 1000x faster than the LSTM baseline. Intuitively, the use of\nbipartite \ufb02ows means that we only have to perform one forward pass over the model as opposed to\nthe 288 forward passes for a typical autoregressive model.\n\n4.5 Character-Level text8\n\nWe also evaluated on text8, using the preprocessing of Mikolov et al. (2012); Zhang et al. (2016)\nwith 100M characters and a vocabulary size of K = 27. We split the data into 90M characters for\ntrain, 5M characters for dev, and 5M characters for test. For discrete bipartite \ufb02ows, we use a batch\nsize of 128, sequence length of 256, a varying number of \ufb02ows, and parameterize each \ufb02ow with a\nTransformer with 2 or 3 layers, 512 hidden units, 2048 \ufb01lter size, and 8 heads.\nFigure 3 compares the test negative log-likelihood in bits per character as well as the time to generate\none data point, i.e., a 256-dimensional sequence, on a NVIDIA P100 GPU. The bipartite \ufb02ow reaches\ncompetitive performance, i.e., better than an LSTM baseline but not as good as the state-of-the-art bpc\nfrom the 235M parameter 64-layer Transformer (we\u2019re unfamiliar with previous nonautoregressive\nresults to compare to). We also \ufb01nd that having a learned scale (\u201cw/ \u03c3\u201d) improves performance over\n\ufb01xing \u03c3 = 1 and only learning the location transform \u00b5. The bipartite \ufb02ows\u2019 generation times are\nsigni\ufb01cantly faster than the baselines with upwards of a 100x speedup.\n\n5 Discussion\n\nWe describe discrete \ufb02ows, a class of invertible functions for \ufb02exible modeling of discrete data.\nDiscrete autoregressive \ufb02ows enable bidirectionality by stacking multiple levels of autoregressivity,\neach with varying order. Discrete bipartite \ufb02ows enable nonautoregressive generation by \ufb02exibly\nmodeling data with a sequence of bipartite-factorized \ufb02ow transformations. Our experiments across\na range of synthetic tasks and character-level text data show the promise of such approaches.\nAs future work, we\u2019re also investigating discrete inverse autoregressive \ufb02ows, which enable \ufb02exible\nvariational approximations for discrete latent variable models. An open question remains with scal-\ning discrete \ufb02ows to large numbers of classes: in particular, the straight-through gradient estimator\nworks well for small numbers of classes such as for character-level language modeling, but it may\nnot work for (sub)word-level modeling where the vocabulary size is greater than 5,000. In such set-\ntings, alternative gradient estimators or data representations may be fruitful. Lastly, there may be\nother invertible discrete functions used in cryptography and random number generation that could\nbe leveraged for building up alternative forms for discrete \ufb02ows (Salmon et al., 2011).\n\n8\n\n\fReferences\nAitchison, L., Adam, V., and Turaga, S. C. (2018). Discrete \ufb02ow posteriors for variational inference\n\nin discrete dynamical systems.\n\nBengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language model.\n\nJournal of machine learning research, 3(Feb):1137\u20131155.\n\nBengio, Y., L\u00e9onard, N., and Courville, A. (2013). Estimating or propagating gradients through\n\nstochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.\n\nBerglund, M., Raiko, T., Honkala, M., K\u00e4rkk\u00e4inen, L., Vetek, A., and Karhunen, J. T. (2015). Bidi-\nIn Advances in Neural Information\n\nrectional recurrent neural networks as generative models.\nProcessing Systems, pages 856\u2013864.\n\nBowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2015). Generating\n\nsentences from a continuous space. arXiv preprint arXiv:1511.06349.\n\nBritz, D., Goldie, A., Luong, M.-T., and Le, Q. (2017). Massive exploration of neural machine\n\ntranslation architectures. arXiv preprint arXiv:1703.03906.\n\nDevlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep bidirec-\n\ntional transformers for language understanding. arXiv preprint arXiv:1810.04805.\n\nDinh, L., Krueger, D., and Bengio, Y. (2014). NICE: Non-linear independent components estimation.\n\narXiv preprint arXiv:1410.8516.\n\nDinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using real nvp. In Interna-\n\ntional Conference on Learning Representations.\n\nFord, N., Duckworth, D., Norouzi, M., and Dahl, G. E. (2018). The importance of generation order\n\nin language modeling. In Empirical Methods in Natural Language Processing.\n\nGu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. (2018). Non-autoregressive neural machine\n\ntranslation. In International Conference on Learning Representations.\n\nHoogeboom, E., Peters, J. W., van den Berg, R., and Welling, M. (2019). Integer discrete \ufb02ows and\n\nlossless compression. arXiv preprint arXiv:1905.07376.\n\nJang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In\n\nInternational Conference on Learning Representations.\n\nJernite, Y., Rush, A., and Sontag, D. (2015). A fast variational approach for learning markov random\n\n\ufb01eld language models. In International Conference on Machine Learning, pages 2209\u20132217.\n\nKaiser, \u0141., Roy, A., Vaswani, A., Pamar, N., Bengio, S., Uszkoreit, J., and Shazeer, N. (2018). Fast\n\ndecoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382.\n\nKingma, D. P. and Dhariwal, P. (2018). Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\n\nAdvances in Neural Information Processing Systems, pages 10236\u201310245.\n\nKingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016a). Im-\nproved variational inference with inverse autoregressive \ufb02ow. In Advances in neural information\nprocessing systems, pages 4743\u20134751.\n\nKingma, D. P., Salimans, T., and Welling, M. (2016b). Improving Variational Inference with Inverse\n\nAutoregressive Flow. In Neural Information Processing Systems.\n\nLee, J., Mansimov, E., and Cho, K. (2018). Deterministic non-autoregressive neural sequence mod-\n\neling by iterative re\ufb01nement. arXiv preprint arXiv:1802.06901.\n\nMaddison, C. J., Mnih, A., and Teh, Y. W. (2016). The concrete distribution: A continuous relaxation\n\nof discrete random variables. arXiv preprint arXiv:1611.00712.\n\nMerity, S., Keskar, N. S., and Socher, R. (2018). An analysis of neural language modeling at multiple\n\nscales. arXiv preprint arXiv:1803.08240.\n\n9\n\n\fMetz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. (2016). Unrolled generative adversarial net-\n\nworks. arXiv preprint arXiv:1611.02163.\n\nMikolov, T., Sutskever, I., Deoras, A., Le, H.-S., Kombrink, S., and Cernocky, J. (2012). Subword\nlanguage modeling with neural networks. preprint (http://www. \ufb01t. vutbr. cz/imikolov/rnnlm/char.\npdf), 8.\n\nMnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural probabilistic language\n\nmodels. arXiv preprint arXiv:1206.6426.\n\nOord, A. v. d., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G.\nv. d., Lockhart, E., Cobo, L. C., Stimberg, F., et al. (2017). Parallel wavenet: Fast high-\ufb01delity\nspeech synthesis. arXiv preprint arXiv:1711.10433.\n\nPapamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive \ufb02ow for density\n\nestimation. In Advances in Neural Information Processing Systems, pages 2335\u20132344.\n\nPing, W., Peng, K., and Chen, J. (2018). Clarinet: Parallel wave generation in end-to-end text-to-\n\nspeech. arXiv preprint arXiv:1807.07281.\n\nPrenger, R., Valle, R., and Catanzaro, B. (2018). Waveglow: A \ufb02ow-based generative network for\n\nspeech synthesis. arXiv preprint arXiv:1811.00002.\n\nRanganath, R., Tran, D., and Blei, D. (2016). Hierarchical variational models.\n\nConference on Machine Learning, pages 324\u2013333.\n\nIn International\n\nReed, S., van den Oord, A., Kalchbrenner, N., Colmenarejo, S. G., Wang, Z., Chen, Y., Belov, D.,\nand de Freitas, N. (2017). Parallel multiscale autoregressive density estimation. In Proceedings\nof the 34th International Conference on Machine Learning-Volume 70, pages 2912\u20132921. JMLR.\norg.\n\nRezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. In Interna-\n\ntional Conference on Machine Learning.\n\nRippel, O. and Adams, R. P. (2013). High-dimensional probability estimation with deep density\n\nmodels. arXiv preprint arXiv:1302.5125.\n\nSalmon, J. K., Moraes, M. A., Dror, R. O., and Shaw, D. E. (2011). Parallel random numbers: as easy\nas 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing,\nNetworking, Storage and Analysis, page 16. ACM.\n\nStern, M., Shazeer, N., and Uszkoreit, J. (2018). Blockwise parallel decoding for deep autoregressive\n\nmodels. In Advances in Neural Information Processing Systems, pages 10107\u201310116.\n\nTabak, E. and Turner, C. V. (2013). A family of nonparametric density estimation algorithms. Com-\n\nmunications on Pure and Applied Mathematics, 66(2):145\u2013164.\n\nVaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., and Polo-\nsukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems,\npages 5998\u20136008.\n\nVinyals, O., Bengio, S., and Kudlur, M. (2015). Order matters: Sequence to sequence for sets. arXiv\n\npreprint arXiv:1511.06391.\n\nWu, F.-Y. (1982). The potts model. Reviews of modern physics, 54(1):235.\nXia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., and Liu, T.-Y. (2017). Deliberation networks:\nSequence generation beyond one-pass decoding. In Advances in Neural Information Processing\nSystems, pages 1784\u20131794.\n\nZaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv preprint arXiv:1410.4615.\nZhang, S., Wu, Y., Che, T., Lin, Z., Memisevic, R., Salakhutdinov, R. R., and Bengio, Y. (2016). Ar-\nchitectural complexity measures of recurrent neural networks. In Advances in neural information\nprocessing systems, pages 1822\u20131830.\n\nZiegler, Z. M. and Rush, A. M. (2019). Latent normalizing \ufb02ows for discrete sequences. arXiv\n\npreprint arXiv:1901.10548.\n\n10\n\n\f", "award": [], "sourceid": 8312, "authors": [{"given_name": "Dustin", "family_name": "Tran", "institution": "Google Brain"}, {"given_name": "Keyon", "family_name": "Vafa", "institution": "Columbia University"}, {"given_name": "Kumar", "family_name": "Agrawal", "institution": "Google AI Resident"}, {"given_name": "Laurent", "family_name": "Dinh", "institution": "Google Brain"}, {"given_name": "Ben", "family_name": "Poole", "institution": "Google Brain"}]}