{"title": "Deep State Space Models for Unconditional Word Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 6158, "page_last": 6168, "abstract": "Autoregressive feedback is considered a necessity for successful unconditional text generation using stochastic sequence models. However, such feedback is known to introduce systematic biases into the training process and it obscures a principle of generation: committing to global information and forgetting local nuances. We show that a non-autoregressive deep state space model with a clear separation of global and local uncertainty can be built from only two ingredients: An independent noise source and a deterministic transition function. Recent advances on flow-based variational inference can be used to train an evidence lower-bound without resorting to annealing, auxiliary losses or similar measures. The result is a highly interpretable generative model on par with comparable auto-regressive models on the task of word generation.", "full_text": "Deep State Space Models for\n\nUnconditional Word Generation\n\nFlorian Schmidt\n\nETH Z\u00fcrich\n\nThomas Hofmann\n\nETH Z\u00fcrich\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nflorian.schmidt@inf.ethz.ch\n\nthomas.hofmann@inf.ethz.ch\n\nAbstract\n\nAutoregressive feedback is considered a necessity for successful unconditional text\ngeneration using stochastic sequence models. However, such feedback is known\nto introduce systematic biases into the training process and it obscures a principle\nof generation: committing to global information and forgetting local nuances. We\nshow that a non-autoregressive deep state space model with a clear separation of\nglobal and local uncertainty can be built from only two ingredients: An independent\nnoise source and a deterministic transition function. Recent advances on \ufb02ow-\nbased variational inference can be used to train an evidence lower-bound without\nresorting to annealing, auxiliary losses or similar measures. The result is a highly\ninterpretable generative model on par with comparable auto-regressive models on\nthe task of word generation.\n\n1\n\nIntroduction\n\nDeep generative models for sequential data are an active \ufb01eld of research. Generation of text, in\nparticular, remains a challenging and relevant area [HYX+17]. Recurrent neural networks (RNNs) are\na common model class, and are typically trained via maximum likelihood [BVV+15] or adversarially\n[YZWY16, FGD18]. For conditional text generation, the sequence-to-sequence architecture of\n[SVL14] has proven to be an excellent starting point, leading to signi\ufb01cant improvements across\na range of tasks, including machine translation [BCB14, VSP+17], text summarization [RCW15],\nsentence compression [FAC+15] and dialogue systems [SSB+16]. Similarly, RNN language models\nhave been used with success in speech recognition [MKB+10, GJ14]. In all these tasks, generation is\nconditioned on information that severely narrows down the set of likely sequences. The role of the\nmodel is then largely to distribute probability mass within relatively constrained sets of candidates.\nOur interest is, by contrast, in unconditional or free generation of text via RNNs. We take as point of\ndeparture the shortcomings of existing model architectures and training methodologies developed\nfor conditional tasks. These arise from the increased challenges on both, accuracy and coverage.\nGenerating grammatical and coherent text is considerably more dif\ufb01cult without reliance on an\nacoustic signal or a source sentence, which may constrain, if not determine much of the sentence\nstructure. Moreover, failure to suf\ufb01ciently capture the variety and variability of data may not surface\nin conditional tasks, yet is a key desideratum in unconditional text generation.\nThe de facto standard model for text generation is based on the RNN architecture originally proposed\nby [Gra13] and incorporated as a decoder network in [SVL14]. It evolves a continuous state vector,\nemitting one symbol at a time, which is then fed back into the state evolution \u2013 a property that\ncharacterizes the broader class of autoregressive models. However, even in a conditional setting,\nthese RNNs are dif\ufb01cult to train without substitution of previously generated words by ground\ntruth observations during training, a technique generally referred to as teacher forcing [WZ89].\nThis approach is known to cause biases [RCAZ15a, GLZ+16] that can be detrimental to test time\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fperformance, where such nudging is not available and where state trajectories can go astray, requiring\nad hoc \ufb01xes like beam search [WR16] or scheduled sampling [BVJS15]. Nevertheless, teacher\nforcing has been carried over to unconditional generation [BVV+15].\nAnother drawback of autoregressive feedback [Gra13] is in the dual use of a single source of\nstochasticity. The probabilistic output selection has to account for the local variability in the next\ntoken distribution. In addition, it also has to inject a suf\ufb01cient amount of entropy into the evolution of\nthe state space sequence, which is otherwise deterministic. Such noise injection is known to compete\nwith the explanatory power of autoregressive feedback mechanisms and may result in degenerate, near\ndeterministic models [BVV+15]. As a consequence, there have been a variety of papers that propose\ndeep stochastic state sequence models, which combine stochastic and deterministic dependencies,\ne.g. [CKD+15, FSPW16], or which make use of auxiliary latent variables [GSC+17], auxiliary losses\n[SATB17], and annealing schedules [BVV+15]. No canoncial architecture has emerged so far and it\nremains unclear how the stochasticity in these models can be interpreted and measured.\nIn this paper, we propose a stochastic sequence model that preserves the Markov structure of standard\nstate space models by cleanly separating the stochasticity in the state evolution, injected via a\nwhite noise process, from the randomness in the local token generation. We train our model using\nvariational inference (VI) and build upon recent advances in normalizing \ufb02ows [RM15, KSW16]\nto de\ufb01ne rich enough stochastic state transition functions for both, generation and inference. Our\nmain goal is to investigate the fundamental question of how far one can push such an approach in\ntext generation, and to more deeply understand the role of stochasticity. For that reason, we have\nused the most basic problem of text generation as our testbed: word morphology, i.e. the mechanisms\nunderlying the formation of words from characters. This enables us to empirically compare our\nmodel to autoregressive RNNs on several metrics that are intractable in more complex tasks such as\nword sequence modeling.\n\n2 Model\n\nWe argue that text generation is subject to two sorts of uncertainty: Uncertainty about plausible\nlong-term continuations and uncertainty about the emission of the current token. The \ufb01rst re\ufb02ects\nthe entropy of all things considered \u201cnatural language\", the second re\ufb02ects symbolic entropy at a\n\ufb01xed position that arises from ambiguity, (near-)analogies, or a lack of contextual constraints. As a\nconsequence, we cast the emission of a token as a fundamental trade-off between committing and\nforgetting about information.\n\n2.1 State space model\n\nLet us de\ufb01ne a state space model with transition function\n\nF : Rd \u00d7 Rd \u2192 Rd,\n\n(ht, \u03bet) (cid:55)\u2192 ht+1 = F (ht, \u03bet),\n\n(1)\nF is deterministic, yet driven by a white noise process \u03be, and, starting from some h0, de\ufb01nes a\nhomogeneous stochastic process. A local observation model P (wt|ht) generates symbols wt \u2208 \u03a3\nand is typically realized by a softmax layer with symbol embeddings.\nThe marginal probability of a symbol sequence w = w1:T is obtained by integrating out h = h1:T ,\n\n\u03bet\n\niid\u223c N (0, I) .\n\n(cid:90) T(cid:89)\n\nP (w) =\n\np(ht|ht\u22121)P (wt|ht) dh .\n\n(2)\n\nt=1\n\nHere p(ht|ht\u22121) is de\ufb01ned implicitly by driving F with noise as we will explain in more detail\nbelow.1In contrast to common RNN architectures, we have de\ufb01ned F to not include an auto-regressive\ninput, such as wt\u22121, making potential biases as in teacher-forcing a non-issue. Furthermore, this\nimplements our assumption about the role of entropy and information for generation. The information\nabout the local outcome under P (wt|ht) is not considered in the transition to the next state as there is\nno feedback. Thus in this model, all entropy about possible sequence continuations must arise from\nthe noise process \u03be, which cannot be ignored in a successfully trained model.\n\n1For ease of exposition, we assume \ufb01xed length sequences, although in practice one works with end-of-\n\nsequence tokens and variable length sequences.\n\n2\n\n\fh = h1...T from h0 and \u03be via F and (iii) sample from the observation model(cid:81)T\n\nThe implied generative procedure follows directly from the chain rule. To sample a sequence\nof observations we (i) sample a white noise sequence \u03be = \u03be1...T (ii) deterministically compute\nt=1 P (wt|ht). The\nremainder of this section focuses on how we can de\ufb01ne a suf\ufb01ciently powerful familiy of state\nevolution functions F and how variational inference can be used for training.\n\n2.2 Variational inference\n\nModel-based variational inference (VI) allows us to approximate the marginalization in Eq. (2) by\nposterior expectations with regard to an inference model q(h|w). It is easy to verify that the true\nposterior obeys the conditional independences ht \u22a5\u22a5 rest| ht\u22121, wt:T , which informs our design of\nthe inference model, cf. [FSPW16]:\n\nq(h|w) =\n\nq(ht|ht\u22121, wt:T ) .\n\n(3)\n\nThis is to say, the previous state is a suf\ufb01cient summary of the past. Jensen\u2019s inequality then directly\nimplies the evidence lower bound (ELBO)\n\nT(cid:89)\n\nt=1\n\n(cid:20)\n\nlog P (w) \u2265 Eq\n\nlog P (w|h) + log\n\n=: L =\n\nLt := Eq [log P (wt|ht)] + Eq\n\nlog\n\np(ht|ht\u22121)\n\nq(ht|ht\u22121, wt:T )\n\n(cid:21)\n\np(h)\nq(h|w)\n\n(cid:20)\n\nT(cid:88)\n\nt=1\n\nLt\n\n(cid:21)\n\n(4)\n\n(5)\n\n(cid:20)\n\nThis is a well-known form, which highlights the per-step balance between prediction quality and\nthe discrepancy between the transition probabilities of the unconditioned generative and the data-\nconditioned inference models [FSnPW16, CKD+15]. Intuitively, the inference model breaks down\nthe long range dependencies and provides a local training signal to the generative model for a single\nstep transition and a single output generation.\nUsing VI successfully for generating symbol sequences requires parametrizing powerful yet tractable\nnext state transitions. As a minimum requirement, forward sampling and log-likelihood computation\nneed to be available. Extensions of VAEs [RM15, KSW16] have shown that for non-sequential\nmodels under certain conditions an invertible function h = f (\u03be) can shape moderately complex\ndistributions over \u03be into highly complex ones over h, while still providing the operations necessary\nfor ef\ufb01cient VI. The authors show that a bound similar to Eq. (5) can be obtained by using the law of\nthe unconscious statistician [RM15] and a density transformation to express the discrepancy between\ngenerative and inference model in terms of \u03be instead of h\n\nL = Eq(\u03be|w)\n\nlog P (w|f (\u03be)) + log\n\np(f (\u03be))\nq(\u03be|w)\n\n+ log |det Jf (\u03be)|\n\n(6)\n\nThis allows the inference model to work with an implicit latent distribution at the price of computing\nthe Jacobian determinant of f. Luckily, there are many choices such that this can be done in O(d)\n[RM15, DSB16].\n\n2.3 Training through coupled transition functions\n\nWe propose to use two separate transition functions Fq and Fg for the inference and the generative\nmodel, respectively. Using results from \ufb02ow-based VAEs we derive an ELBO that reveals the intrinsic\ncoupling of both and expresses the relation of the two as a part of the objective that is determined\nsolely by the data. A shared transition model Fq = Fg constitutes a special case.\n\nTwo-Flow ELBO For a transition function F as in Eq. (1) \ufb01x h = h\u2217 and de\ufb01ne the restriction\nf (\u03be) = F (h, \u03be)|h=h\u2217. We require that for any h\u2217, f is a diffeomorphism and thus has a differentiable\ninverse. In fact, as we work with (possibly) different Fg and Fq for generation and inference, we have\nrestrictions fg and fq, respectively. For better readability we will omit the conditioning variable h\u2217\nin the sequel.\n\n3\n\n(cid:21)\n\n\fBy combining the per-step decomposition in (5) with the \ufb02ow-based ELBO from (6), we get (implic-\nitly setting h\u2217 = ht\u22121):\nLt = Eq(\u03be|w)\n\n+ log(cid:12)(cid:12)det Jfq (\u03bet)(cid:12)(cid:12)(cid:21)\n\nlog P (wt|fq(\u03bet)) + log\n\n(cid:20)\n\n(7)\n\n.\n\np(fq(\u03bet)|ht\u22121)\nq(\u03bet|ht\u22121; wt:T )\n\nAs our generative model also uses a \ufb02ow to transform \u03bet into a distribition on ht, it is more natural to\nuse the (simple) density in \u03be-space. Performing another change of variable, this time on the density\nof the generative model, we get\n\np(ht|ht\u22121) = p(\u03b6t|ht\u22121) \u00b7 |det Jf\n\n\u22121\ng\n\n(fq(\u03bet))| =\n\nr(\u03b6t)\n\n|det Jfg (\u03b6t)| , \u03b6t := (f\u22121\n\ng\n\n\u25e6 fq)(\u03bet)\n\n(8)\n\nwhere r now is simply the (multivariate) standard normal density as \u03bet does not depend ht\u22121,\nwhereas ht does. We have introduced new noise variable \u03b6t = s(\u03bet) to highlight the importance of\n\u25e6 fq, which is a combined \ufb02ow of the forward inference \ufb02ow and the\nthe transformation s = f\u22121\ninverse generative \ufb02ow. Essentially, it follows the suggested \u03be-distribution of the inference model\ninto the latent state space and back into the noise space of the generative model with its uninformative\ndistribution. Putting this back into Eq. (7) and exploiting the fact that the Jacobians can be combined\nvia det Js = det Jfq /det Jfg we \ufb01nally get\n\ng\n\n(cid:20)\n\n(cid:21)\n\nLt = Eq(\u03be|w)\n\nlog P (wt|fq(\u03bet)) + log\n\nr(s(\u03bet))\n\nq(\u03bet|ht\u22121; wt:T )\n\n+ log |det Js(\u03bet)|\n\n.\n\n(9)\n\nInterpretation Na\u00efvely employing the model-based ELBO approach, one has to learn two inde-\npendently parametrized transition models p(ht|ht\u22121) and q(ht|ht\u22121, wt...T ), one informed about\nthe future and one not. Matching the two then becomes and integral part of the objective. However,\nsince the transition model encapsulates most of the model complexity, this introduces redundancy\nwhere the learning problem is most challenging. Nevertheless, generative and inference model do\naddress the transition problem from very different angles. Therefore, forcing both to use the exact\nsame transition model might limit \ufb02exibility during training and result in an inferior generative model.\nThus our model casts Fg and Fq as independently parametrized functions that are coupled through\nthe objective by treating them as proper transformations of an underlying white noise process. 2\n\nSpecial cases Additive Gaussian noise ht+1 = ht + \u03bet can be seen as the simplest form of Fg\nor, alternatively, as a generative model without \ufb02ow (as Jfg = I). Of course, repeated addition of\nnoise does not provide a meaningful latent trajectory. Finally, note that for Fg = Fq, s = id and the\nnominator in the second term becomes a simple prior probability r(\u03bet), whereas the determinant\nreduces to a constant. We now explore possible candidates for the \ufb02ows in Fg and Fq.\n\n2.4 Families of transition functions\n\nF (ht\u22121, \u03bet) = g(ht\u22121) + G(ht\u22121)\u03bet\n\nSince the Jacobian of a composed function factorizes, a \ufb02ow F is often composed of a chain of\nindividual invertible functions F = Fk \u25e6 \u00b7\u00b7\u00b7 \u25e6 F1 [RM15]. We experiment with individual functions\n(10)\nwhere g is a multilayer MLP Rd \u2192 Rd and G is a neural network Rd \u2192 Rd \u00d7 Rd mapping ht\u22121\nto a lower-triangular d \u00d7 d matrix with non-zero diagonal entries. Again, we use MLPs for this\nmapping and clip the diagonal away from [\u2212\u03b4, \u03b4] for some hyper parameter 0 < \u03b4 < 0.5. The\nlower-triangular structure allows computing the determinant in O(d) and stable inversion of the\nmapping by substitution in O(d2). As a special case we also consider the case when G is restricted to\ndiagonal matrices. Finally, we experiment with a conditional variant of the Real NVP \ufb02ow [DSB16].\nComputing F \u22121\nis central to our objective and we found that depending on the \ufb02ow actually\nparametrizing the inverse directly results in more stable and ef\ufb01cient training.\n\ng\n\n2Note that identifying s as an invertible function allows us to perform a backwards density transformation\nwhich cancels the regularizing terms. This is akin to any \ufb02ow objective (e.g. see equation (15) in[RM15]) where\napplying the transformation additionally to the prior cancels out the Jacobian term. We can think of s as a\nstochastic bottleneck with the observation model P (wt|ht) attached to the middle layer. Removing the middle\nlayer collapses the bottleneck and prohibits learning compression.\n\n4\n\n\f2.5\n\nInference network\n\n(cid:81) q(ht|ht\u22121, wt:T ) but treated it as a black-box otherwise. Remember that sampling from the\n\nthe inference network q(h|w) =\nSo far we have only motivated the factorization of\ninference network amounts to sampling \u03bet \u223c q(\u00b7|ht\u22121, wt...T ) and then performing the deterministic\ntransition Fq(ht\u22121, \u03bet). We observe much better training stability when conditioning q on the data\nwt...T only and modeling interaction with ht\u22121 exclusively through Fq. This coincides with our\nintuition that the two inputs to a transition function provide semantically orthogonal contributions.\nWe follow existing work [DSB16] and choose q as the density of a normal distribution with diagonal\ncovariance matrix. We follow the idea of [FSPW16] and incorporate the variable-length sequence\nwt:T by conditioning on the state of an RNN running backwards in time across w1...T . We embed\nthe symbols w1...T in a vector space RdE and use use a GRU cell to produce a sequence of hidden\nstates aT , . . . , a1 where at has digested tokens wt:T . Together ht\u22121 and at parametrize the mean\nand co-variance matrix of q.\n\n2.6 Optimization\n\nExcept in very speci\ufb01c and simple cases, for instance, a Kalman \ufb01lter, it will not be possible to\nef\ufb01ciently compute the q-expectations in Eq. (5) exactly. Instead, we sample q in every time-step\nas is common practice for sequential ELBOs [FSnPW16, GSC+17]. The re-parametrization trick\nallows pushing all necessary gradients through these expectations to optimize the bound via stochastic\ngradient-based optimization techniques such as Adam [KB14].\n\n2.7 Extension: Importance-weighted ELBO for tracking the generative model\n\nConceptionally, there are two ways we can imagine an inference network to propose \u03be1:T sequences\nfor a given sentence w1:T . Either, as described above, by digesting w1...T right-to-left and proposing\n\u03be1:T left-to-right. Or, by iteratively proposing a \u03bet taking into account the last state ht\u22121 proposed\nand the generative deterministic mechanism Fg. The latter allows the inference network to peek at\nstates ht that Fg could generate from ht\u22121 before proposing an actual target ht. This allows the\ninference model to track a multi-modal Fg without need for Fq to match its expressiveness. As a\nconsequence, this might offer the possibility to learn multi-modal generative models, without the\nneed to employ complex multi-modal distributions in the inference model.\nOur extension is built on importance weighted auto-encoders (IWAE) [BGS15]. The IWAE ELBO is\nderived by writing the log marginal as a Monte Carlo estimate before using Jensen\u2019s inequality. The\nresult is an ELBO and corresponding gradients of the form3\n\n(cid:34)\n\nL = E\n\nh(k)\n\nlog\n\n1\nK\n\n(cid:35)\n\nK(cid:88)\n\nk=1\n\n(cid:124)\n\np(w, h(k))\nq(h(k)|w)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n=: \u03c9(k)\n\n(cid:34) K(cid:88)\n\nk=1\n\n\u03c9(k)(cid:80)\n\nk(cid:48) \u03c9(k(cid:48))\n\n(cid:35)\n\n, \u2207L = E\n\nh(k)\n\n\u2207 log \u03c9(k)\n\n, h(k)\u223c q(\u00b7|w) (11)\n\nThe authors motivate (11) as a weighting mechanism relieving the inference model from explaining\nthe data well with every sample. We will use the symmetry of this argument to let the inference\nmodel condition on potential next states hg t = Fg(ht\u22121, \u03bet), \u03bet\u223cN (0, I) from the generative model\nwithout requiring every hg t to allow q to make a good proposal. In other words, the K sampled\noutputs of Fg become a vectorized representation of Fg to condition on. In our sequential model,\ncomputing \u03c9(k) exactly is intractable as it would require rolling out the network until time T . Instead,\nwe limit the horizon to only one time-step. Although this biases the estimate of the weights and\nconsequently the ELBO, longer horizons did empirically not show bene\ufb01ts. When proceeding to\ntime-step t + 1 we choose the new hidden state by sampling h(k) with probability proportionally to\n\u03c9(k). Algorithm 1 summarizes the steps carried out at time t for a given ht\u22121 (to not overload the\nnotation, we drop t in hg t) and a more detailed derivation of the bound is given in Appendix A.\n\n3Here we have tacitly assumed that h can be rewritten using the reprametrization trick so that the expectation\ncan be expressed with respect to some parameter-free base-distribution. See [BGS15] for a detailed derivation of\nthe gradients in (11).\n\n5\n\n\fAlgorithm 1 Detailed forward pass with importance weighting\n\nSimulate Fg:\nInstantiate the inference family:\nSample inference:\nCompute gradients as in (11) where \u03c9(k) = P (wt|h(k))p(h(k)|ht\u22121)/qk(h(k))\nSample h(k) according to \u03c9(1) . . . \u03c9(K) for the next step.\n\ng = Fg(ht\u22121, \u03be(k)), where \u03be(k) \u223c N (0, I), k = 1, . . . , K\nh(k)\nqk(h) = q(h|h(k)\nh(k) \u223c qk\n\ng , ht\u22121, wt:T )\n\n3 Related Work\n\nOur work intersects with work directly addressing teacher-forcing, mostly on language modelling\nand translation (which are mostly not state space models) and stochastic state space models (which\nare typically autoregressive and do not address teacher forcing).\nEarly work on addressing teacher-forcing has focused on mitigating its biases by adapting the RNN\ntraining procedure to partly rely on the model\u2019s prediction during training [BVJS15, RCAZ15b].\nRecently, the problem has been addressed for conditional generation within an adversarial framework\n[GLZ+16] and in various learning to search frameworks [WR16, LAOL17]. However, by design\nthese models do not perform stochastic state transitions.\nThere have been proposals for hybrid architectures that augment the deterministic RNN state se-\nquences by chains of random variables [CKD+15, FSPW16]. However, these approaches are largely\npatching-up the output feedback mechanism to allow for better modeling of local correlations, leaving\nthe deterministic skeleton of the RNN state sequence untouched. A recent evolution of deep stochas-\ntic sequence models has developed models of ever increasing complexity including intertwined\nstochastic and deterministic state sequences [CKD+15, FSPW16] additional auxiliary latent variables\n[GSC+17] auxiliary losses [SATB17] and annealing schedules [BVV+15]. At the same time, it\nremains often unclear how the stochasticity in these models can be interpreted and measured.\nClosest in spirit to our transition functions is work by Maximilian et al.[KSBvdS17] on generation\nwith external control inputs. In contrast to us they use a simple mixture of linear transition functions\nand work around using density transformations akin to [BO14]. In our unconditional regime we\nfound that relating the stochasticity in \u03be explicitly to the stochasticity in h is key to successful training.\nFinally, variational conditioning mechanisms similar in spirit to ours have seen great success in image\ngeneration[GDGW15].\nAmong generative unconditional sequential models GANs are as of today the most prominent\narchitecture [YZWY16, JKMHL16, FGD18, CLZ+17]. To the best of our knowledge, our model is\nthe \ufb01rst non-autoregressive model for sequence generation in a maximum likelihood framework.\n\n4 Evaluation\n\nNaturally, the quality of a generative model must be measured in terms of the quality of its outputs.\nHowever, we also put special emphasis on investigating whether the stochasticity inherent in our\nmodel operates as advertised.\n\n4.1 Data Inspection\n\nEvaluating generative models of text is a \ufb01eld of ongoing research and currently used methods\nrange from simple data-space statistics to expensive human evaluation [FGD18]. We argue that\nfor morphology, and in particular non-autoregressive models, there is an interesting middle ground:\nCompared to the space of all sentences, the space of all words has still moderate cardinality which\nallows us to estimate the data distribution by unigram word-frequencies. As a consequence, we can\nreliably approximate the cross-entropy which naturally generalizes data-space metrics to probabilistic\nmodels and addresses both, over-generalization (assigning non-zero probability to non-existing words)\nand over-con\ufb01dence (distributing high probability mass only among a few words).\nThis metric can be addressed by all models which operate by \ufb01rst stochastically generating a sequence\nof hidden states and then de\ufb01ning a distribution over the data-space given the state sequence. For our\n\n6\n\n\f(cid:90)\n\nK(cid:88)\n\nk=1\n\nmodel we approximate the marginal by a Monte Carlo estimate of (2)\n\nP (w) =\n\nP (w|h)p(h)dh =\n\n1\nK\n\nP (w|h(k)), h(k) \u223c p(h)\n\n(12)\n\nNote that sampling from p(h) boils down to sampling \u03be1...T from independent standard normals and\nthen applying Fg. In particular, the non-autoregressive property of our model allows us to estimate\nall words in some set S using K samples each by using only K independent trajectories h overall.\nFinally, we include two data-space metrics as an intuitive, yet less accurate measure. From a collection\nof generated words, we estimate (i) the fraction of words that are in the training vocabulary (w \u2208 V )\nand (ii) the fraction of unique words that are in the training vocabulary (w \u2208 V unique).4\n\n4.2 Entropy Inspection\n\nWe want to go beyond the usual evaluation of existing work on stochastic sequence models and\nalso assess the quality of our noise model. In particular, we are interested in how much information\ncontained in a state ht about the output P (wt|ht) is due to the corresponding noise vector \u03bet. This is\nquanti\ufb01ed by the mutual information between the noise \u03bet and the observation wt given the noise\n\u03be1:t\u22121 that de\ufb01ned the pre\ufb01x up to time t. Since ht\u22121 is a deterministic function of \u03be1:t\u22121, we write\n\nI(t) = I(wt; \u03bet|ht\u22121) = Eht\u22121\n\nH[wt|ht\u22121] \u2212 H[wt|\u03bet, ht\u22121]\n\n\u2265 0\n\n(13)\n\n(cid:20)\n\n(cid:21)\n\nto quantify the dependence between noise and observation at one time-step. For a model ignoring the\nnoise variables, knowledge of \u03bet does not reduce the uncertainty about wt, so that I(t) = 0. We can\nuse Monte Carlo estimates for all expectations in (13).\n\n5 Experiments\n\n5.1 Dataset and baseline\n\nFor our experiments, we use the BooksCorpus [KZS+15, ZKZ+15], a freely available collection of\nnovels comprising of almost 1B tokens out of which 1.3M are unique. To \ufb01lter out artefacts and some\nvery uncommon words found in \ufb01ction, we restrict the vocabulary to words of length 2 \u2264 l \u2264 12 with\nat least 10 occurrences that only contain letters resulting in a 143K vocabulary. Besides the standard\n10% test-train split at the word level, we also perform a second, alternative split at the vocabulary\nlevel. That means, 10 percent of the words, chosen regardless of their frequency, will be unique to\nthe test set. This is motivated by the fact that even a small test-set under the former regime will result\nin only very few, very unlikely words unique to the test-set. However, generalization to unseen words\nis the essence of morphology. As an additional metric to measuring generalization in this scenario,\nwe evaluate the generated output under Witten-Bell discounted character n-gram models trained on\neither the whole corpus or the test data only.\nOur baseline is a GRU cell and the standard RNN training procedure with teacher-forcing5. Hidden\nstate size and embedding size are identical to our model\u2019s.\n\n5.2 Model parametrization\n\nWe stick to a standard softmax observation model and instead focus the model design on different\ncombinations of \ufb02ows for Fg and Fq. We investigate the \ufb02ow in Equation (10), denoted as TRIL, its\ndiagonal version DIAG and a simple identity ID. We denote repeated application of (independently\nparametrized) \ufb02ows as in 2 \u00d7 TRIL. For the weighted version we use K \u2208 {2, 5, 10} samples.\nIn addition, for Fg we experiment with a sequence of Real NVPs with masking dimensions d =\n2 . . . 7 (two internal hidden layers of size 8 each). Furthermore, we investigate deviating from the\nfactorization (3) by using a bidirectional RNN conditioning on all w1...T in every timestep. Finally,\nfor the best performing con\ufb01guration, we also investigate state-sizes d = {16, 32}.\n\n4Note that for both data-space metrics there is a trivial generation system that achieves a \u2018perfect\u2019 score.\n\nHence, both must be taken into account at the same time to judge performance.\n\n5It should be noted that despite the greatly reduced vocabulary in character-level generation, RNN training\n\nwithout teacher-forcing for our data still fails miserably.\n\n7\n\n\f5.3 Results\nTable 1 shows the result for the standard split. By \u00b1 we indicate mean and standard deviation across\n5 or 10 (for IWAE) identical runs6. The data-space metrics require manually trading off precision\nand coverage. We observe that two layers of the TRIL \ufb02ow improve performance. Furthermore,\nimportance weighting signi\ufb01cantly improves the results across all metrics with diminishing returns at\nK = 10. Its effectiveness is also con\ufb01rmed by an increase in variance across the weights \u03c91 . . . \u03c9T\nduring training which can be attributed to the signi\ufb01cance of the noise model (see 5.4 for more\ndetails). We found training with REAL-NVP to be very unstable. We attribute the relatively poor\nperformance of NVP to the sequential VI setting which deviates heavily from what it was designed\nfor and keep adaptions for future work.\n\nModel\nTRIL\nTRIL, K=2\nTRIL, K=5\nTRIL, K=10\n2\u00d7TRIL\n2\u00d7TRIL, K=2\n2\u00d7TRIL, K=5\n2\u00d7TRIL, K=10\n2\u00d7TRIL, K=10, BIDI\nd = 16 2\u00d7TRIL, K=10\nd = 32 2\u00d7TRIL, K=10\nREAL-NVP-[2,3,4,5,6,7]\nBASELINE-8D\nBASELINE-16D\nORACLE-TRAIN\n\nH[Ptrain, \u02c6P ] H[Ptest, \u02c6P ] w \u2208 V unique w \u2208 V\n12.13\u00b1.11\n11.76\u00b1.12\n11.46\u00b1.05\n11.43\u00b1.05\n11.91\u00b1.08\n11.55\u00b1.09\n11.42\u00b1.07\n11.33\u00b1.05\n11.33\u00b1.09\n11.21\n11.27\n11.77\n12.92\n12.55\n7.0\n\n11.99\u00b1.11\n11.82\u00b1.12\n11.51\u00b1.05\n11.47\u00b1.05\n11.86\u00b1.13\n11.61\u00b1.09\n11.46\u00b1.06\n11.38\u00b1.06\n11.39\u00b1.10\n11.43\n11.13\n11.81\n12.97\n12.60\n7.027\n\n0.18\u00b1.00\n0.16\u00b1.01\n0.16\u00b1.01\n0.16\u00b1.01\n0.17\u00b1.01\n0.16\u00b1.00\n0.16\u00b1.00\n0.16\u00b1.00\n0.16\u00b1.01\n0.15\n0.15\n0.12\n0.13\n0.14\n0.27\n\n0.43\u00b1.03\n0.46\u00b1.02\n0.48\u00b1.02\n0.49\u00b1.02\n0.45\u00b1.02\n0.47\u00b1.01\n0.49\u00b1.01\n0.49\u00b1.01\n0.48\u00b1.00\n0.48\n0.50\n0.53\n0.53\n0.62\n1.0\n\n\u00afI\n0.95\u00b1.04\n1.06\u00b1.16\n1.08\u00b1.13\n1.12\u00b1.12\n0.89\u00b1.07\n1.00\u00b1.13\n1.20\u00b1.12\n1.28\u00b1.13\n1.25\u00b1.16\n1.43\n1.31\n0.94\n\u2013\n\u2013\n\u2013\n\nTable 1: Results on generation. The cross entropy is computed wrt. both training and test set.\nORACLE-TRAIN is a model sampling from the training data.\n\nInterestingly, our standard inference model is on par with the equivalently parametrized bidirectional\ninference model suggesting that historic information can be suf\ufb01ciently stored in the states and\ncon\ufb01rming d-separation as the right principle for inference design.\nThe poor cross-entropy achieved by the baseline can partly be explained by the fact that auto-\nregressive RNNs are trained on conditional next-word-predictions. Estimating the real data-space\ndistribution would require aggregating over all possible sequences w \u2208 V T . However, the data-space\nmetrics clearly show that the performance cannot solely be attributed to this.\nTable 2 shows that generalization for the alternative split is indeed harder but cross entropy results\ncarry over from the standard setting. Here we sample trajectories and extract the argmax from the\nobservation model which resembles more closely the procedure of the baseline. Under n-gram\nperplexity both models are on par with a slight advantage of the baseline on longer n-grams and\nslightly better generalization of our proposed model.\n\nModel\n2\u00d7TRIL, K=10\nBASELINE-8D\nORACLE-TRAIN\nORACLE-TEST\n\nH[Ptrain, \u02c6P ] H[Ptest, \u02c6P ]\n11.56\n12.90\n\u2013\n\u2013\n\n12.27\n13.67\n\u2013\n\u2013\n\nn-gram from train+test\nP5\n30.7\n24.8\n4.1\n3.9\n\nP4\n20.9\n17.5\n4.8\n4.5\n\nP3\n12.8\n12.1\n6.7\n6.0\n\nP2\n10.4\n11.4\n10.1\n9.5\n\nn-gram from test\n\nP2\n13.1\n14.5\n13.2\n7.9\n\nP3\n21.9\n22.7\n15.7\n4.1\n\nP4\n49.6\n48.3\n21.4\n2.9\n\nP5\n81.1\n80.5\n26.4\n2.6\n\nTable 2: Results for the alternative data split: Cross entropy and perplexity under n = 2, 3, 4, 5-gram\nlanguage models estimated on either the full corpus or the test set only.\n\nTo give more insight into how the transition functions in\ufb02uence the results, Table 1a presents an\nexhaustive overview for all combinations of our simple \ufb02ows. We observe that a powerful generative\n6Single best model with d = 8: 2 \u00d7 TRIL, K = 10 achieved H[Ptrain, \u02c6P ] = 11.26 and H[Ptest, \u02c6P ] = 11.28.\n7Note that the training-set oracle is not optimal for the test set. The entropy of the test set is 6.80.\n\n8\n\n\f\ufb02ow is essential for successful models while the inference \ufb02ow can remain relatively simple \u2013 yet\nsimplistic choices, such as ID degrade performance. Choosing Fg slightly more powerful than Fq\nemerges as a successful pattern.\n\nFlow Fq\n\nDIAG\n\nTRIL\n\n2\u00d7TRIL\n\ng\nF\nw\no\nl\nF\n\nID\n\nID 14.23\u00b1.00 14.23\u00b1.00 14.23\u00b1.00 \u2013\nDIAG 12.82\u00b1.37 12.35\u00b1.37 12.20\u00b1.25 \u2013\nTRIL 13.55\u00b1.01 11.99\u00b1.11 \u2013\n\u2013\n11.86\u00b1.13 \u2013\n\u2013\n(a) Test cross entropy H[Ptest, \u02c6P ]\n\n2\u00d7TRIL \u2013\n\nFlow Fq\n\nID\n\nTRIL\n\n0\u00b1.00\n\nDIAG\n\n0\u00b1.00\n\n2\u00d7TRIL\n0\u00b1.00\n\u2013\n0.93\u00b1.15 0.85\u00b1.16 0.92\u00b1.13 \u2013\n0.65\u00b1.01 0.95\u00b1.04 \u2013\n\u2013\n0.89\u00b1.07 \u2013\n\u2013\n\u2013\n(b) Average mutual information \u00afI\n\nTable 3: Results for different combinations of \ufb02ows driving generative and inference transitions.\nA bar indicates combinations that did not allow for stable training. We also report ID for Fg for\ncompleteness but note that it is by design unsiuted for this contextual setting.\n\n5.4 Noise Model Analysis\n\nWe use K = 20 samples to approximate the entropy terms in (13). In addition we denote by \u00afI the\naverage mutual information across all time-steps. Figure 3 shows how \u00afI along with the symbolic\nentropy H[wt|ht] changes during training. Remember that in a non-autoregressive model, the latter\ncorresponds to information that cannot be recovered in later timesteps. Over the course of the training,\nmore and more information is driven by \u03bet and absorbed into states ht where it can be stored.\nFigures 1 and 1b show \u00afI for all trained models. In addition, Figure 3 shows a box-plot of I(t) for each\nt = 1 . . . T for the con\ufb01guration 2\u00d7TRIL, K=10. As initial tokens are more important to remember,\nit should not come as a surprise that I(t) is largest \ufb01rst and decreases over time, yet with increased\nvariance.\n\n2\n\ns\nt\ni\nb\n1\n\n0\n\n1\n\n2\n\n4\n\n8\n3\nword position t = 1 . . . T\n\n5\n\n6\n\n7\n\n4\n\ns\nt\n2\ni\nb\n\n0\n\n\u00afI\nH[wt|ht]\nbaseline\n\n9\n\n10\n\ntraining time\n\nFigure 2: Noise mutual information I(t) over\nsequence position t = 1 . . . T .\n\nFigure 3: Entropy analysis over training time.\nFor reference the dashed line indicates the\noverall word entropy of the trained baseline.\n\n6 Conclusion\n\nIn this paper we have shown how a deep state space model can be de\ufb01ned and trained with the help\nof variational \ufb02ows. The recurrent mechanism is driven purely by a simple white noise process\nand does not require an autoregressive conditioning on previously generated symbols. In addition,\nwe have shown how an importance-weighted conditioning mechanism integrated into the objective\nallows shifting stochastic complexity from the inference to the generative model. The result is a\nhighly \ufb02exible framework for sequence generation with an extremely simple overall architecture, a\nmeasurable notion of latent information and no need for pre-training, annealing or auxiliary losses. We\nbelieve that pushing the boundaries of non-autoregressive modeling is key to understanding stochastic\ntext generation and can open the door to related \ufb01elds such as particle \ufb01ltering [NLRB17, MLT+17].\n\n9\n\n\fReferences\n[BCB14]\n\n[BGS15]\n\n[BO14]\n\n[BVJS15]\n\n[BVV+15]\n\n[CKD+15]\n\n[CLZ+17]\n\n[DSB16]\n\n[FAC+15]\n\n[FGD18]\n\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation\nby jointly learning to align and translate. CoRR, abs/1409.0473, 2014. 1\nYuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoen-\ncoders. CoRR, abs/1509.00519, 2015. 2.7, 3, A\nJustin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. arXiv\npreprint arXiv:1411.7610, 2014. arXiv. 3\nSamy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling\nfor sequence prediction with recurrent neural networks. CoRR, abs/1506.03099, 2015.\n1, 3\nSamuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal J\u00f3zefowicz, and\nSamy Bengio. Generating sentences from a continuous space. CoRR, abs/1511.06349,\n2015. 1, 1, 1, 1, 3\nJunyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and\nYoshua Bengio. A recurrent latent variable model for sequential data. In NIPS, pages\n2980\u20132988, 2015. 1, 2.2, 3, 3\nTong Che, Yanran Li, Ruixiang Zhang, R. Devon Hjelm, Wenjie Li, Yangqiu Song,\nand Yoshua Bengio. Maximum-likelihood augmented discrete generative adversarial\nnetworks. CoRR, abs/1702.07983, 2017. 3\nLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real\nNVP. CoRR, abs/1605.08803, 2016. 2.2, 2.4, 2.5\nKatja Filippova, Enrique Alfonseca, Carlos A Colmenares, Lukasz Kaiser, and Oriol\nIn EMNLP 2015, pages\nVinyals. Sentence compression by deletion with lstms.\n360\u2013368, 2015. 1\nWilliam Fedus, Ian J. Goodfellow, and Andrew M. Dai. Maskgan: Better text generation\nvia \ufb01lling in the ______. CoRR, abs/1801.07736, 2018. 1, 3, 4.1\n\n[FSnPW16] Marco Fraccaro, S\u00f8ren Kaae S\u00f8 nderby, Ulrich Paquet, and Ole Winther. Sequential\nneural models with stochastic layers. pages 2199\u20132207. Curran Associates, Inc., 2016.\n2.2, 2.6\n\n[FSPW16] Marco Fraccaro, S\u00f8ren Kaae S\u00f8nderby, Ulrich Paquet, and Ole Winther. Sequential\nneural models with stochastic layers. pages 2199\u20132207, 2016. NIPS. 1, 2.2, 2.5, 3, 3\n[GDGW15] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent\n\nneural network for image generation. CoRR, abs/1502.04623, 2015. 3\nAlex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent\nneural networks. In ICML 2014, pages 1764\u20131772, 2014. 1\n\n[GJ14]\n\n[GLZ+16] Anirudh Goyal, Alex Lamb, Ying Zhang, Saizheng Zhang, Aaron C. Courville, and\nYoshua Bengio. Professor forcing: A new algorithm for training recurrent networks.\nIn NIPS 2016, pages 4601\u20134609, 2016. 1, 3\nAlex Graves. Generating sequences with recurrent neural networks. arXiv preprint\narXiv:1308.0850, 2013. 1, 1\n\n[Gra13]\n\n[GSC+17] Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre C\u00f4t\u00e9, Nan Rosemary Ke, and\nYoshua Bengio. Z-forcing: Training stochastic recurrent networks. In NIPS 2017,\npages 6716\u20136726, 2017. 1, 2.6, 3\n\n[HYX+17] Z. Hu, Z. Yang, Liang X., R. Salakhutdinov, and E. R. Xing. Toward controlled\ngeneration of text. In International Conference on Machine Learning (ICML), 2017. 1\n[JKMHL16] Matt J. Kusner and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Gans for sequences of discrete\n\nelements with the gumbel-softmax distribution. 11 2016. 3\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\nCoRR, abs/1412.6980, 2014. 2.6\n\n[KB14]\n\n[KSBvdS17] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep\nvariational bayes \ufb01lters: Unsupervised learning of state space models from raw data.\narXiv preprint arXiv:1605.06432, 2017. ICLR. 3\n\n10\n\n\f[KSW16]\n\n[KZS+15]\n\n[LAOL17]\n\nDiederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference\nwith inverse autoregressive \ufb02ow. CoRR, abs/1606.04934, 2016. 1, 2.2\nRyan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Tor-\nralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. arXiv preprint\narXiv:1506.06726, 2015. 5.1\nR\u00e9mi Leblond, Jean-Baptiste Alayrac, Anton Osokin, and Simon Lacoste-Julien.\nSEARNN: training rnns with global-local losses. CoRR, abs/1706.04499, 2017. 3\n\n[MKB+10] Tom\u00e1\u0161 Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1\u0161 Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur.\n\nRecurrent neural network based language model. In INTERSPEECH 2010, 2010. 1\n\n[MLT+17] Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad\nNorouzi, Andriy Mnih, Arnaud Doucet, and Yee Whye Teh. Filtering variational\nobjectives. CoRR, abs/1705.09279, 2017. 6\n\n[NLRB17] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei.\n\nVariational sequential monte carlo. arXiv preprint arXiv:1705.11140, 2017. 6\n\n[RCAZ15a] Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence\nlevel training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.\n1\n\n[RCAZ15b] Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence\n\n[RCW15]\n\n[RM15]\n\n[SATB17]\n\n[SSB+16]\n\n[SVL14]\n\n[VSP+17]\n\n[WR16]\n\n[WZ89]\n\nlevel training with recurrent neural networks. CoRR, abs/1511.06732, 2015. 3\nAlexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for\nabstractive sentence summarization. CoRR, abs/1509.00685, 2015. 1\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing\n\ufb02ows. In ICML 2015, pages 1530\u20131538, 2015. 1, 2.2, 2.2, 2.2, 2.4, 2\nSamira Shabanian, Devansh Arpit, Adam Trischler, and Y Bengio. Variational bi-lstms.\n11 2017. 1, 3\nIulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle\nPineau. Building end-to-end dialogue systems using generative hierarchical neural\nnetwork models. In AAAI, volume 16, pages 3776\u20133784, 2016. 1\nIlya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with\nneural networks. In NIPS 2017, NIPS\u201914, pages 3104\u20133112, Cambridge, MA, USA,\n2014. MIT Press. 1, 1\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.\nGomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR,\nabs/1706.03762, 2017. 1\nSam Wiseman and Alexander M. Rush. Sequence-to-sequence learning as beam-search\noptimization. CoRR, abs/1606.02960, 2016. 1, 3\nRonald J Williams and David Zipser. A learning algorithm for continually running\nfully recurrent neural networks. Neural computation, 1(2):270\u2013280, 1989. 1\n\n[YZWY16] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative\n\nadversarial nets with policy gradient. CoRR, abs/1609.05473, 2016. 1, 3\n\n[ZKZ+15] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Anto-\nnio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual\nexplanations by watching movies and reading books. arXiv preprint arXiv:1506.06724,\n2015. 5.1\n\n11\n\n\f", "award": [], "sourceid": 3031, "authors": [{"given_name": "Florian", "family_name": "Schmidt", "institution": "ETH Z\u00fcrich"}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": "ETH Zurich"}]}