{"title": "Z-Forcing: Training Stochastic Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6713, "page_last": 6723, "abstract": "Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortised variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables.", "full_text": "Z-Forcing: Training Stochastic Recurrent Networks\n\nAnirudh Goyal\n\nMILA, Universit\u00e9 de Montr\u00e9al\n\nAlessandro Sordoni\nMicrosoft Maluuba\n\nMarc-Alexandre C\u00f4t\u00e9\n\nMicrosoft Maluuba\n\nNan Rosemary Ke\n\nMILA, Polytechnique Montr\u00e9al\n\nYoshua Bengio\n\nMILA, Universit\u00e9 de Montr\u00e9al\n\nAbstract\n\nMany efforts have been devoted to training generative latent variable models with\nautoregressive decoders, such as recurrent neural networks (RNN). Stochastic\nrecurrent models have been successful in capturing the variability observed in\nnatural sequential data such as speech. We unify successful ideas from recently\nproposed architectures into a stochastic recurrent model: each step in the sequence\nis associated with a latent variable that is used to condition the recurrent dynamics\nfor future steps. Training is performed with amortised variational inference where\nthe approximate posterior is augmented with a RNN that runs backward through\nthe sequence. In addition to maximizing the variational lower bound, we ease\ntraining of the latent variables by adding an auxiliary cost which forces them to\nreconstruct the state of the backward recurrent network. This provides the latent\nvariables with a task-independent objective that enhances the performance of the\noverall model. We found this strategy to perform better than alternative approaches\nsuch as KL annealing. Although being conceptually simple, our model achieves\nstate-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard\nand competitive performance on sequential MNIST. Finally, we apply our model to\nlanguage modeling on the IMDB dataset where the auxiliary cost helps in learning\ninterpretable latent variables.\n\n1\n\nIntroduction\n\nDue to their ability to capture long-term dependencies, autoregressive models such as recurrent neural\nnetworks (RNN) have become generative models of choice for dealing with sequential data. By\nleveraging weight sharing across timesteps, they can model variable length sequences within a \ufb01xed\nparameter space. RNN dynamics involve a hidden state that is updated at each timestep to summarize\nall the information seen previously in the sequence. Given the hidden state at the current timestep,\nthe network predicts the desired output, which in many cases corresponds to the next input in the\nsequence. Due to the deterministic evolution of the hidden state, RNNs capture the entropy in the\nobserved sequences by shaping conditional output distributions for each step, which are usually\nof simple parametric form, i.e. unimodal or mixtures of unimodal. This may be insuf\ufb01cient for\nhighly structured natural sequences, where there is correlation between output variables at the same\nstep, i.e. simultaneities (Boulanger-Lewandowski et al., 2012), and complex dependencies between\nvariables at different timesteps, i.e. long-term dependencies. For these reasons, recent efforts recur\nto highly multi-modal output distribution by augmenting the RNN with stochastic latent variables\ntrained by amortised variational inference, or variational auto-encoding framework (VAE) (Kingma\nand Welling, 2014; Fraccaro et al., 2016; Kingma and Welling, 2014). The VAE framework allows\nef\ufb01cient approximate inference by parametrizing the approximate posterior and generative model\nwith neural networks trainable end-to-end by backpropagation.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fAnother motivation for including stochastic latent variables in autoregressive models is to infer,\nfrom the observed variables in the sequence (e.g. pixels or sound-waves), higher-level abstractions\n(e.g. objects or speakers). Disentangling in such way the factors of variations is appealing as it\nwould increase high-level control during generation, ease semi-supervised and transfer learning, and\nenhance interpretability of the trained model (Kingma et al., 2014; Hu et al., 2017).\nStochastic recurrent models proposed in the literature vary in the way they use the stochastic variables\nto perform output prediction and in how they parametrize the posterior approximation for variational\ninference. In this paper, we propose a stochastic recurrent generative model that incorporates into a\nsingle framework successful techniques from earlier models. We associate a latent variable with each\ntimestep in the generation process. Similar to Fraccaro et al. (2016), we use a (deterministic) RNN\nthat runs backwards through the sequence to form our approximate posterior, allowing it to capture\nthe future of the sequence. However, akin to Chung et al. (2015); Bayer and Osendorfer (2014), the\nlatent variables are used to condition the recurrent dynamics for future steps, thus injecting high-\nlevel decisions about the upcoming elements of the output sequence. Our architectural choices are\nmotivated by interpreting the latent variables as encoding a \u201cplan\u201d for the future of the sequence. The\nlatent plan is injected into the recurrent dynamics in order to shape the distribution of future hidden\nstates. We show that mixing stochastic forward pass, conditional prior and backward recognition\nnetwork helps building effective stochastic recurrent models.\nThe recent surge in generative models suggests that extracting meaningful latent representations is\ndif\ufb01cult when using a powerful autoregressive decoder, i.e. the latter captures well enough most of\nthe entropy in the data distribution (Bowman et al., 2015; Kingma et al., 2016; Chen et al., 2017;\nGulrajani et al., 2017). We show that by using an auxiliary, task-agnostic loss, we ease the training of\nthe latent variables which, in turn, helps achieving higher performance for the tasks at hand. The\nlatent variables in our model are forced to contain useful information by predicting the state of the\nbackward encoder, i.e. by predicting the future information in the sequence.\nOur work provides the following contributions:\n\n\u2022 We unify several successful architectural choices into one generative stochastic model for\nsequences: backward posterior, conditional prior and latent variables that condition the\nhidden dynamics of the network. Our model achieves state-of-the-art in speech modeling.\n\u2022 We propose a simple way of improving model performance by providing the latent variables\nwith an auxiliary, task-agnostic objective. In the explored tasks, the auxiliary cost yielded\nbetter performance than other strategies such as KL annealing. Finally, we show that the\nauxiliary signal helps the model to learn interpretable representations in a language modeling\ntask.\n\n2 Background\n\nWe operate in the well-known VAE framework (Kingma and Ba, 2014; Burda et al., 2015; Rezende\nand Mohamed, 2015), a neural network based approach for training generative latent variable models.\nLet x be an observation of a random variable, taking values in X . We assume that the generation of x\ninvolves a latent variable z, taking values in Z, by means of a joint density p\u03b8(x, z), parametrized by\n\u03b8. Given a set of observed datapoints D = {x1, . . . , xn}, the goal of maximum likelihood estimation\n(MLE) is to estimate the parameters \u03b8 that maximize the marginal log-likelihood L(\u03b8;D):\n\n\u2217\n\n\u03b8\n\n= arg max\u03b8 L(\u03b8;D) =\n\nlog\n\np\u03b8(xi, z) dz .\n\n(1)\n\nn(cid:88)\n\n(cid:90)\n\ni=1\n\nz\n\nOptimizing the marginal log-likelihood is usually intractable, due to the integration over the latent\nvariables. A common approach is to maximize a variational lower bound on the marginal log-\nlikelihood. The evidence lower bound (ELBO) is obtained by introducing an approximate posterior\nq\u03c6(z|x) yielding:\n\n(cid:0)q\u03c6(z|x)(cid:107) p(z|x)(cid:1) = F(x; \u03b8, \u03c6),\n\n(2)\n\nlog p\u03b8(x) \u2265 E\n\nq\u03c6(z|x)\n\nlog\n\n= log p(x) \u2212 DKL\n\n(cid:20)\n\n(cid:21)\n\np\u03b8(x, z)\nq\u03c6(z|x)\n\nwhere KL denotes the Kullback-Leibler divergence. The ELBO is particularly appealing because\nthe bound is tight when the approximate posterior matches the true posterior, i.e. it reduces to the\n\n2\n\n\f(a) STORN\n\n(b) VRNN\n\n(c) SRNN\n\n(d) Our model\n\nFigure 1:\nComputation graph for generative models of sequences that use latent variables:\nSTORN (Bayer and Osendorfer, 2014), VRNN (Chung et al., 2015), SRNN (Fraccaro et al., 2016)\nand our model. In this picture, we consider that the task of the generative model consists in predicting\nthe next observation in the sequence, given previous ones. Diamonds represent deterministic states, zt\nand xt are respectively the latent variables and the sequence input at step t. Dashed lines represent the\ncomputation that is part of the inference model. Double lines indicate auxiliary predictions implied\nby the proposed auxiliary cost. Differently from VRNN and SRNN, in STORN and our model the\nlatent variable zt participates to the prediction of the next step xt+1.\n\nmarginal log-likelihood. The ELBO can also be rewritten as a minimum description length loss\nfunction (Honkela and Valpola, 2004):\n\n(cid:104)\n\n(cid:105)\n\n(cid:0)q\u03c6(z|x)(cid:107) p\u03b8(z)(cid:1),\n\n(3)\n\nF(x; \u03b8, \u03c6) = E\n\nq\u03c6(z|x)\n\nlog p\u03b8(x|z)\n\n\u2212 DKL\n\nif\n\n(cid:0)q\u03c6(z|x)(cid:107) p\u03b8(z)(cid:1) is zero then z is independent of x. Usually, the parameters of the gener-\n\ni.e.\n\nwhere the second term measures the degree of dependence between x and z,\nDKL\native model p\u03b8(x|z), the prior p\u03b8(z) and the inference model q\u03c6(z|x) are computed using neural\nnetworks. In this case, the ELBO can be maximized by gradient ascent on a Monte Carlo approx-\nimation of the expectation. For particularly simple parametric forms of q\u03c6(z|x), e.g. multivariate\ndiagonal Gaussian or, more generally, for reparamatrizable distributions (Kingma and Welling, 2014),\none can backpropagate through the sampling process z \u223c q\u03c6(z|x) by applying the reparametrization\ntrick, which simulates sampling from q\u03c6(z|x) by \ufb01rst sampling from a \ufb01xed distribution u, \u0001 \u223c u(\u0001),\nand then by applying deterministic transformation z = f\u03c6(x, \u0001). This makes the approach appealing\nin comparison to other approximate inference approaches.\nIn order to have a better generative model overall, many efforts have been put in augmenting\nthe capacity of the approximate posteriors (Rezende and Mohamed, 2015; Kingma et al., 2016;\nLouizos and Welling, 2017), the prior distribution (Chen et al., 2017; Serban et al., 2017a) and the\ndecoder (Gulrajani et al., 2017; Oord et al., 2016). By having more powerful decoders p\u03b8(x|z),\none could model more complex distributions over X . This idea has been explored while applying\nVAEs to sequences x = (x1, . . . , xT ), where the decoding distribution p\u03b8(x|z) is modeled by an\nt p\u03b8(xt|z, x1:t\u22121) (Bayer and Osendorfer, 2014; Chung et al.,\n2015; Fraccaro et al., 2016). In these models, z typically decomposes as a sequence of latent variables,\nt p\u03b8(xt|z1:t\u22121, x1:t\u22121). We operate in this setting and, in\nthe following section, we present our choices for parametrizing the generative model, the prior and\nthe inference model.\n\nautoregressive model, p\u03b8(x|z) =(cid:81)\nz = (z1, . . . , zT ), yielding p\u03b8(x|z) =(cid:81)\n\n3 Proposed Approach\n\nIn Figure 1, we report the dependencies in the inference and the generative parts of our model,\ncompared to existing models. From a broad perspective, we use a backward recurrent network for the\napproximate posterior (akin to SRNN (Fraccaro et al., 2016)), we condition the recurrent state of\nthe forward auto-regressive model with the stochastic variables and use a conditional prior (akin to\nVRNN (Chung et al., 2015), STORN (Bayer and Osendorfer, 2014)). In order to make better use\n\n3\n\nhtht\u22121ztxtdtdt\u22121htht\u22121ztxththt\u22121ztzt\u22121xtbtbt\u22121htht\u22121ztxtbtbt\u22121\fof the latent variables, we use auxiliary costs (double arrows) to force the latent variables to encode\ninformation about the future. In the following, we describe each of these components.\n\n3.1 Generative Model\n\nDecoder Given a sequence of observations x = (x1, . . . , xT ), and desired set of labels or predic-\ntions y = (y1, . . . , yT ), we assume that there exists a corresponding set of stochastic latent variables\nz = (z1, . . . , zT ). In the following, without loss of generality, we suppose that the set of predictions\ncorresponds to a shifted version of the input sequence, i.e. the model tries to predict the next observa-\ntion given the previous ones, a common setting in language and speech modeling (Fraccaro et al.,\n2016; Chung et al., 2015). The generative model couples observations and latent variables by using\nan autoregressive model, i.e. by exploiting a LSTM architecture (Hochreiter and Schmidhuber, 1997),\nthat runs through the sequence:\n\nht = \u2212\u2192f (xt, ht\u22121, zt).\n\n(4)\nThe parameters of the conditional probability distribution on the next observation p\u03b8(xt+1|x1:t, z1:t)\nare computed by a multi-layered feed-forward network that conditions on ht, f (o)(ht). In the case of\ncontinuous-valued observations, f (o) may output the \u00b5, log \u03c3 parameters of a Gaussian distribution,\nor the categorical proportions in the case of one-hot predictions. Note that, even if f (o) is a simple\nunimodal distribution, the marginal distribution p\u03b8(xt+1|x1:t) may be highly multimodal, due to the\nintegration over the sequence of latent variables z. Note that f (o) does not condition on zt, i.e. zt\nis not directly used in the computation of the output conditional probabilities. We observed better\nperformance by avoiding the latent variables from directly producing the next output.\n\nPrior The parameters of the prior distribution p\u03b8(zt|x1:t, z1:t\u22121) over each latent variable are\nobtained by using a non-linear transformation of the previous hidden state of the forward network. A\ncommon choice in the VAE framework is to use Gaussian latent variables. Therefore, f (p) produces\nthe parameters of a diagonal multivariate Gaussian distribution:\n\np\u03b8(zt|x1:t, z1:t\u22121) = N (zt; \u00b5(p)\n\nt\n\n, \u03c3(p)\n\nt\n\n) where\n\n[\u00b5(p)\n\nt\n\n, log \u03c3(p)\n\nt\n\n] = f (p)(ht\u22121).\n\n(5)\n\nThis type of conditional prior has proven to be useful in previous work (Chung et al., 2015).\n\n3.2\n\nInference Model\n\nThe inference model is responsible for approximating the true posterior over the latent variables\np(z1, . . . , zT|x) in order to provide a tractable lower-bound on the log-likelihood. Our posterior\napproximation uses a LSTM processing the sequence x backwards:\n\nbt = \u2190\u2212f (xt+1, bt+1).\n\n(6)\n\nEach state bt contains information about the future of the sequence and can be used to shape the\napproximate posterior for the latent zt. As the forward LSTM uses zt to condition future predictions,\nthe latent variable can directly inform the recurrent dynamics about the future states, acting as a\n\u201cplan\u201d of the future in the sequence. This information is channeled into the posterior distribution\nby a feed-forward neural network f (q) taking as input both the previous forward state ht\u22121 and the\nbackward state bt:\n\nq\u03c6(zt|x) = N (zt; \u00b5(q)\n\nt\n\n, \u03c3(q)\n\nt\n\n) where\n\n[\u00b5(q)\n\nt\n\n, log \u03c3(q)\n\nt\n\n] = f (q)(ht\u22121, bt).\n\n(7)\n\nBy injecting stochasticity in the hidden state of the forward recurrent model, the true posterior\ndistribution for a given variable zt depends on all the variables zt+1:T after zt through dependence on\nht+1:T . In order to formulate an ef\ufb01cient posterior approximation, we drop the dependence on zt+1:T .\nThis is at the cost of introducing intrinsic bias in the posterior approximation, e.g. we may exclude the\ntrue posterior from the space of functions modelled by our function approximator. This is in contrast\nwith SRNN (Fraccaro et al., 2016), in which the posterior distribution factorizes in a tractable manner\nat the cost of not including the latent variables in the forward autoregressive dynamics, i.e. the latent\nvariables don\u2019t condition the hidden state, but only help in shaping a multi-modal distribution for the\ncurrent prediction.\n\n4\n\n\f3.3 Auxiliary Cost\n\nIn various domains, such as text and images, it has been empirically observed that it is dif\ufb01cult to\nmake use of latent variables when coupled with a strong autoregressive decoder (Bowman et al.,\n2015; Gulrajani et al., 2017; Chen et al., 2017). The dif\ufb01culty in learning meaningful latent variables,\nin many cases of interest, is related to the fact that the abstractions underlying observed data may be\nencoded with a smaller number of bits than the observed variables. For example, there are multiple\nways of picturing a particular \u201ccat\u201d (e.g. different poses, colors or lightning) without varying the more\nabstract properties of the concept \u201ccat\u201d. In these cases, the maximum-likelihood training objective\nmay not be sensitive to how well abstractions are encoded, causing the latent variables to \u201cshut\noff\u201d, i.e. the local correlations at the pixel level may be too strong and bias the learning process\ntowards \ufb01nding parameter solutions for which the latent variables are unused. In these cases, the\nposterior approximation tends to provide a too weak or noisy signal, due to the variance induced by\nthe stochastic gradient approximation. As a result, the decoder may learn to ignore z and instead to\nrely solely on the autoregressive properties of x, causing x and z to be independent, i.e. the KL term\nin Eq. 2 vanishes.\nRecent solutions to this problem generally propose to reduce the capacity of the autoregressive\ndecoder (Bowman et al., 2015; Bachman, 2016; Chen et al., 2017; Semeniuta et al., 2017). The\nconstraints on the decoder capacity inherently bias the learning towards \ufb01nding parameter solutions\nfor which z and x are dependent. One of the shortcomings with this approach is that, in general, it\nmay be hard to achieve the desired solutions by architecture search. Instead, we investigate whether\nit is useful to keep the expressiveness of the autoregressive decoder but force the latent variables to\nencode useful information by adding an auxiliary training signal for the latent variables alone. In\npractice, our results show that this auxiliary cost, albeit simple, helps achieving better performance\non the objective of interest.\nSpeci\ufb01cally, we consider training an additional conditional generative model of the backward states\nz p\u03be(b, z|h)dz \u2265 Eq\u03be(z|b,h)[log p\u03be(b|z) +\nlog p\u03be(z|h) \u2212 log q\u03be(z|b, h)]. This additional model is also trained through amortized variational\ninference. However, we share its prior p\u03be(z|h) and approximate posterior q\u03be(z|b, h) with those of\nthe \u201cprimary\u201d model (b is a deterministic function of x per Eq. 6 and the approximate posterior\nis conditioned on b). In practice, we solely learn additional parameters \u03be for the decoding model\nt p\u03be(bt|zt). The auxiliary reconstruction model trains zt to contain relevant information\n\nb = {b1, . . . , bT} given the forward states p\u03be(b|h) = (cid:82)\np\u03be(b|z) =(cid:81)\n\nabout the future of the sequence contained in the hidden state of the backward network bt:\n\n, \u03c3(a)\n\nt\n\n) where\n\nt\n\np\u03be(bt|zt) = N (\u00b5(a)\n\n(8)\nBy means of the auxiliary reconstruction cost, the approximate posterior and prior of the primary\nmodel is trained with an additional signal that may help with escaping local minima due to short term\nreconstructions appearing in the lower bound, similarly to what has been recently noted in Karl et al.\n(2016).\n\n] = f (a)(zt),\n\n, log \u03c3(a)\n\n[\u00b5(a)\n\nt\n\nt\n\n3.4 Learning\n\nThe training objective is a regularized version of the lower-bound on the data log-likelihood based on\nthe variational free-energy, where the regularization is imposed by the auxiliary cost:\n\n(cid:105)\n\n(9)\n\n(cid:104)\n\n(cid:88)\n\nt\n\nL(x; \u03b8, \u03c6, \u03be) =\n\nE\n\nq\u03c6(zt|x)\n\nlog p\u03b8(xt+1|x1:t, z1:t) + \u03b1 log p\u03be(bt|zt)\n\n(cid:0)q\u03c6(zt|x1:T )(cid:107) p\u03b8(zt|x1:t, z1:t\u22121)(cid:1).\n\n\u2212DKL\n\nWe learn the parameters of our model by backpropagation through time (Rumelhart et al., 1988) and\nwe approximate the expectation with one sample from the posterior q\u03c6(z|x) by using reparametriza-\ntion. When optimizing Eq. 9, we disconnect the gradients of the auxiliary prediction from affecting\nthe backward network, i.e. we don\u2019t use the gradients \u2207\u03c6 log p\u03be(bt|zt) to train the parameters \u03c6 of\nthe approximate posterior: intuitively, the backward network should be agnostic about the auxiliary\ntask assigned to the latent variables. It also performed better empirically. As the approximate\nposterior is trained only with the gradient \ufb02owing through the ELBO, the backward states b may be\nreceiving a weak training signal early in training, which may hamper the usefulness of the auxiliary\ngenerative cost, i.e. all the backward states may be concentrated around the zero vector. Therefore,\n\n5\n\n\fwe additionally train the backward network to predict the output variables in reverse (see Figure 1):\n\n(cid:105)\n\n+ \u03b2 log p\u03be(xt|bt)\n\n(10)\n\n(cid:104)\n\n(cid:88)\n\nt\n\nL(x; \u03b8, \u03c6, \u03be) =\n\nE\n\nq\u03c6(zt|x)\n\nlog p\u03b8(xt+1|x1:t, z1:t) + \u03b1 log p\u03be(bt|zt)\n\n(cid:0)q\u03c6(zt|x1:T )(cid:107) p\u03b8(zt|x1:t, z1:t\u22121)(cid:1).\n\n\u2212DKL\n\n3.5 Connection to previous models\n\nOur model is similar to several previous stochastic recurrent models: similarly to STORN (Bayer\nand Osendorfer, 2014) and VRNN (Chung et al., 2015) the latent variables are provided as input to\nthe autoregressive decoder. Differently from STORN, we use the conditional prior parametrization\nproposed in Chung et al. (2015). However, the generation process in the VRNN differs from our\napproach. In VRNN, zt are directly used, along with ht\u22121, to produce the next output xt. We found\nthat the model performed better if we relieved the latent variables from producing the next output.\nVRNN has a \u201cmyopic\u201d posterior in such that the latent variables are not informed about the whole\nfuture in the sequence. SRNN (Fraccaro et al., 2016) addresses the issue by running a posterior\nbackward in the sequence and thus providing future context for the current prediction. However, the\nautoregressive decoder is not informed about the future of the sequence through the latent variables.\nSeveral efforts have been made in order to bias the learning process towards parameter solutions for\nwhich the latent variables are used (Bowman et al., 2015; Karl et al., 2016; Kingma et al., 2016; Chen\net al., 2017; Zhao et al., 2017). Bowman et al. (2015) tackle the problem in a language modeling\nsetting by dropping words from the input at random in order to weaken the autoregressive decoder\nand by annealing the KL divergence term during training. We achieve similar latent interpolations by\nusing our auxiliary cost. Similarly, Chen et al. (2017) propose to restrict the receptive \ufb01eld of the\npixel-level decoder for image generation tasks. Kingma et al. (2016) propose to reserve some free bits\nof KL divergence. In parallel to our work, the idea of using a task-agnostic loss for the latent variables\nalone has also been considered in (Zhao et al., 2017). The authors force the latent variables to predict\na bag-of-words representation of a dialog utterance. Instead, we work in a sequential setting, in which\nwe have a latent variable for each timestep in the sequence.\n\n4 Experiments\n\nIn this section, we evaluate our proposed model on diverse modeling tasks (speech, images and\ntext). We show that our model can achieve state-of-the-art results on two speech modeling datasets:\nBlizzard (King and Karaiskos, 2013) and TIMIT raw audio datasets (also used in Chung et al. (2015)).\nOur approach also gives competitive results on sequential generation on MNIST (Salakhutdinov and\nMurray, 2008). For text, we show that the the auxiliary cost helps the latent variables to capture\ninformation about latent structure of language (e.g. sequence length, sentiment). In all experiments,\nwe used the ADAM optimizer (Kingma and Ba, 2014).\n\n4.1 Speech Modeling and Sequential MNIST\n\nBlizzard and TIMIT We test our model in two speech modeling datasets. Blizzard consists in\n300 hours of English, spoken by a single female speaker. TIMIT has been widely used in speech\nrecognition and consists in 6300 English sentences read by 630 speakers. We train the model directly\non raw sequences represented as a sequence of 200 real-valued amplitudes normalized using the\nglobal mean and standard deviation of the training set. We adopt the same train, validation and test\nsplit as in Chung et al. (2015). For Blizzard, we report the average log-likelihood for half-second\nsequences (Fraccaro et al., 2016), while for TIMIT we report the average log-likelihood for the\nsequences in the test set.\nIn this setting, our models use a fully factorized multivariate Gaussian distribution as the output\ndistribution for each timestep. In order to keep our model comparable with the state-of-the-art,\nwe keep the number of parameters comparable to those of SRNN (Fraccaro et al., 2016). Our\nforward/backward networks are LSTMs with 2048 recurrent units for Blizzard and 1024 recurrent\nunits for TIMIT. The dimensionality of the Gaussian latent variables is 256. The prior f (p), inference\nf (q) and auxiliary networks f (a) have a single hidden layer, with 1024 units for Blizzard and 512\nunits for TIMIT, and use leaky recti\ufb01ed nonlinearities with leakiness 1\n3 and clipped at \u00b13 (Fraccaro\net al., 2016). For Blizzard, we use a learning rate of 0.0003 and batch size of 128, for TIMIT they are\n\n6\n\n\fBlizzard\n3539\n7413\n\nModels\nDBN 2hl (Germain et al., 2015)\nNADE (Uria et al., 2016)\nEoNADE-5 2hl (Raiko et al., 2014)\nDLGM 8 (Salimans et al., 2014)\nDARN 1hl (Gregor et al., 2015)\nDRAW (Gregor et al., 2015)\nPixelVAE (Gulrajani et al., 2016)\nP-Forcing(3-layer) (Goyal et al., 2016)\nPixelRNN(1-layer) (Oord et al., 2016)\nPixelRNN(7-layer) (Oord et al., 2016)\nMatNets (Bachman, 2016)\nOurs(1 layer)\nOurs + aux(1 layer)\n\nModel\nTIMIT\nRNN-Gauss\n-1900\nRNN-GMM\n26643\nVRNN-I-Gauss\n\u2265 8933 \u2265 28340\nVRNN-Gauss\n\u2265 9223 \u2265 28805\nVRNN-GMM\n\u2265 9392 \u2265 28982\nSRNN (smooth+resq) \u2265 11991 \u2265 60550\nOurs\n\u2265 14435 \u2265 68132\n\u2265 14226 \u2265 68903\nOurs + kla\n\u2265 15430 \u2265 69530\nOurs + aux\n\u2265 15024 \u2265 70469\nOurs + kla, aux\n\nMNIST\n\u2248 84.55\n88.33\n84.68\n\u2248 85.51\n\u2248 84.13\n\u2264 80.97\n\u2248 79.02(cid:72)\n79.58(cid:72)\n80.75\n79.20(cid:72)\n78.50(cid:72)\n\u2264 80.60\n\u2264 80.09\nTable 1: On the left, we report the average log-likelihood per sequence on the test sets for Blizzard\nand TIMIT datasets. \u201ckla\u201d and \u201caux\u201d denote respectively KL annealing and the use of the proposed\nauxiliary costs. On the right, we report the test set negative log-likelihood for sequential MNIST,\nwhere (cid:72) denotes lower performance of our model with respect to the baselines. For MNIST, we\nobserved that KL annealing hurts overall performance.\n\n0.001 and 32 respectively. Previous work reliably anneal the KL term in the ELBO via a temperature\nweight during training (KL annealing) (Fraccaro et al., 2016; Chung et al., 2015). We report the\nresults obtained by our model by training both with KL annealing and without. When KL annealing\nis used, the temperature was linearly annealed from 0.2 to 1 after each update with increments of\n0.00005 (Fraccaro et al., 2016).\nWe show our results in Table 1 (left), along with results that were obtained by models of comparable\nsize to SRNN. Similar to (Fraccaro et al., 2016; Chung et al., 2015), we report the conservative\nevidence lower bound on the log-likelihood. In Blizzard, the KL annealing strategy (Ours + kla) is\neffective in the \ufb01rst training iterations, but eventually converges to a slightly lower log-likelihood\nthan the model trained without KL annealing (Ours). We explored different annealing strategies but\nwe didn\u2019t observe any improvements in performance. Models trained with the proposed auxiliary\ncost outperform models trained with KL annealing strategy in both datasets. In TIMIT, it appears that\nthere is a slightly synergistic effect between KL annealing and auxiliary cost. Even if not explicitly\nreported in the table, similar performance gains were observed on the training sets.\n\nSequential MNIST The task consists in pixel-by-pixel generation of binarized MNIST digits. We\nuse the standard binarized MNIST dataset used in Larochelle and Murray (2011). Both forward\nand backward networks are LSTMs with one layer of 1024 hidden units. We use a learning rate\nof 0.001 and batch size of 32. We report the results in Table 1 (right). In this setting, we observed\nthat KL annealing hurt performance of the model. Although being architecturally \ufb02at, our model is\ncompetitive with respect to strong baselines, e.g. DRAW (Gregor et al., 2015), and is outperformed by\ndeeper version of autoregressive models with latent variables, i.e. PixelVAE (gated) (Gulrajani et al.,\n2016), and deep autoregressive models such as PixelRNN (Oord et al., 2016) and MatNets (Bachman,\n2016).\n\n4.2 Language modeling\n\nA well-known result in language modeling tasks is that the generative model tends to \ufb01t the observed\ndata without storing information in the latent variables, i.e. the KL divergence term in the ELBO\nbecomes zero (Bowman et al., 2015; Zhao et al., 2017; Serban et al., 2017b). We test our proposed\nstochastic recurrent model trained with the auxiliary cost on a medium-sized IMDB text corpus\ncontaining 350K movie reviews (Diao et al., 2014). Following the setting described in Hu et al.\n(2017), we keep only sentences with less than 16 words and \ufb01xed the vocabulary size to 16K words.\nWe split the dataset into train/valid/test sets following these ratios respectively: 85%, 5%, 10%.\nSpecial delimiter tokens were added at the beginning and end of each sentence but we only learned to\n\n7\n\n\fFigure 2: Evolution of the KL divergence term (measured in nats) in the ELBO with and without\nauxiliary cost during training for Blizzard (left) and TIMIT (right). We plot curves for models\nthat performed best after hyper-parameter (KL annealing and auxiliary cost weights) selection on\nthe validation set. The auxiliary cost puts pressure on the latent variables resulting in higher KL\ndivergence. Models trained with the auxiliary cost (Ours + aux) exhibit a more stable evolution of\nthe KL divergence. Models trained with auxiliary cost alone achieve better performance than using\nKL annealing alone (Ours + kla) and similar, or better performance for Blizzard, compared to both\nusing KL annealing and auxiliary cost (Ours + kla, aux).\n\nModel\n\n\u03b1, \u03b2\n\nOurs\nOurs + aux\nOurs + aux\n\n0\n\n0.0025\n0.005\n\nKL\n\n0.12\n3.03\n9.82\n\nValid\n\nTest\n\nELBO IWAE ELBO IWAE\n53.93\n53.11\n53.37\n55.71\n65.03\n58.83\n\n54.67\n56.57\n65.84\n\n52.40\n52.54\n58.13\n\nTable 2: IMDB language modeling results for models trained by maximizing the standard evidence\nlower-bound. We report word perplexity as evaluated by both the ELBO and the IWAE bound and\nKL divergence between approximate posterior and prior distribution, for different values of auxiliary\ncost hyperparameters \u03b1, \u03b2. The gap in perplexity between the ELBO and IWAE (evaluated with 25\nsamples) increases with greater KL divergence values.\n\ngenerate the end of sentence token. We use a single layered LSTM with 500 hidden recurrent units,\n\ufb01x the dimensionality of word embeddings to 300 and use 64 dimensional latent variables. All the\nf (\u00b7) networks are single-layered with 500 hidden units and leaky relu activations. We used a learning\nrate of 0.001 and a batch size of 32.\nResults are shown in Table 2. As expected, it is hard to obtain better perplexity than a baseline model\nwhen latent variables are used in language models. We found that using the IWAE (Importance\nWeighted Autoencoder) (Burda et al., 2015) bound gave great improvements in perplexity. This\nobservation highlights the fact that, in the text domain, the ELBO may be severely underestimating the\nlikelihood of the model: the approximate posterior may loosely match the true posterior and the IWAE\nbound can correct for this mismatch by tightening the posterior approximation, i.e. the IWAE bound\ncan be interpreted as the standard VAE lower bound with an implicit posterior distribution (Bachman\nand Precup, 2015). On the basis of this observation, we attempted training our models with the IWAE\nbound, but observed no noticeable improvement on validation perplexity.\n\nWe analyze whether the latent variables capture characteristics of language by interpolating in the\nlatent space (Bowman et al., 2015). Given a sentence, we \ufb01rst infer the latent variables at each step\nby running the approximate posterior and then concatenate them in order to form a contiguous latent\nencoding for the input sentence. Then, we perform linear interpolation in the latent space between\nthe latent encodings of two sentences. At each step of the interpolation, the latent encoding is run\nthrough the decoder network to generate a sentence. We show the results in Table 3.\n\n8\n\n012345Updates1e450010001500200025003000KL (nats)Blizzardoursours + klaours + auxours + kla, aux012345Updates1e405001000150020002500300035004000KL (nats)TIMIToursours + klaours + auxours + kla, aux\fthis movie is so terrible . never watch ever\n\na\n0.0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1.0\n\nArgmax\nit \u2019s a movie that does n\u2019t work !\nit \u2019s a movie that does n\u2019t work !\nit \u2019s a movie that does n\u2019t work !\nit \u2019s a very powerful piece of \ufb01lm !\nit \u2019s a very powerful story about it !\nit \u2019s a very powerful story about a movie about life\nit \u2019s a very dark part of the \ufb01lm , eh ?\nit \u2019s a very dark movie with a great ending ! !\nit \u2019s a very dark movie with a great message here !\nit \u2019s a very dark one , but a great one !\nit \u2019s a very dark movie , but a great one !\n\nSampling\nthis \ufb01lm is more of a \u201c classic \u201d\ni give it a 5 out of 10\ni felt that the movie did n\u2019t have any\ni do n\u2019t know what the \ufb01lm was about\nthe acting is good and the acting is very good\nthe acting is great and the acting is good too\ni give it a 7 out of 10 , kids\nthe acting is pretty good and the story is great\nthe best thing i \u2019ve seen before is in the \ufb01lm\nfunny movie , with some great performances\nbut the acting is good and the story is really\ninteresting\n\nthis movie is great . i want to watch it again !\n\nArgmax\ngreetings again from the darkness .\n\u201c oh , and no .\n\u201c oh , and it is .\n\na\n0.0\n0.1\n0.2\n0.3 well ... i do n\u2019t know .\nso far , it \u2019s watchable .\n0.4\n0.5\nso many of the characters are likable .\nso many of the characters were likable .\n0.6\nso many of the characters have been there .\n0.7\nso many of them have fun with it .\n0.8\nso many of the characters go to the house !\n0.9\n1.0\nso many of the characters go to the house !\n\n(1 / 10) violence : yes .\nSampling\ngreetings again from the darkness .\n\u201c let \u2019s screenplay it .\nrating : **** out of 5 .\ni do n\u2019t know what the \ufb01lm was about\n( pg-13 ) violence , no .\njust give this movie a chance .\nso far , but not for children\nso many actors were excellent as well .\nthere are a lot of things to describe .\nso where \u2019s the title about the movie ?\nas much though it \u2019s going to be funny !\n\nthere was a lot of fun in this movie !\n\nTable 3: Results of linear interpolation in the latent space. The left column reports greedy argmax\ndecoding obtained by selecting, at each step of the decoding, the word with maximum probability\nunder the model distribution, while the right column reports random samples from the model. a is the\ninterpolation parameter. In general, latent variables seem to capture the length of the sentences.\n\n5 Conclusion\n\nIn this paper, we proposed a recurrent stochastic generative model that builds upon recent architectures\nthat use latent variables to condition the recurrent dynamics of the network. We augmented the\ninference network with a recurrent network that runs backward through the input sequence and added\na new auxiliary cost that forces the latent variables to reconstruct the state of that backward network,\nthus explicitly encoding a summary of future observations. The model achieves state-of-the-art results\non standard speech benchmarks such as TIMIT and Blizzard. The proposed auxiliary cost, albeit\nsimple, appears to promote the use of latent variables more effectively compared to other similar\nstrategies such as KL annealing. In future work, it would be interesting to use a multitask learning\nsetting, e.g. sentiment analysis as in (Hu et al., 2017). Also, it would be interesting to incorporate the\nproposed approach with more powerful autogressive models, e.g. PixelRNN/PixelCNN (Oord et al.,\n2016).\n\nAcknowledgments\n\nThe authors would like to thank Phil Bachman, Alex Lamb and Adam Trischler for the useful\ndiscussions. AG and YB would also like to thank NSERC, CIFAR, Google, Samsung, IBM and\nCanada Research Chairs for funding, and Compute Canada and NVIDIA for computing resources.\nThe authors would also like to express debt of gratitude towards those who contributed to Theano\nover the years (as it is no longer maintained), making it such a great tool.\n\n9\n\n\fReferences\nBachman, P. (2016). An architecture for deep, hierarchical generative models. In Advances in Neural\n\nInformation Processing Systems, pages 4826\u20134834.\n\nBachman, P. and Precup, D. (2015). Training deep generative models: Variations on a theme.\n\nBayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. arXiv preprint\n\narXiv:1411.7610.\n\nBoulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal dependencies in\nhigh-dimensional sequences: Application to polyphonic music generation and transcription. arXiv\npreprint arXiv:1206.6392.\n\nBowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2015). Generating\n\nsentences from a continuous space. arXiv preprint arXiv:1511.06349.\n\nBurda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv\n\npreprint arXiv:1509.00519.\n\nChen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel,\n\nP. (2017). Variational lossy autoencoder. Proc. of ICLR.\n\nChung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. (2015). A recurrent latent\nvariable model for sequential data. In Advances in neural information processing systems, pages\n2980\u20132988.\n\nDiao, Q., Qiu, M., Wu, C.-Y., Smola, A. J., Jiang, J., and Wang, C. (2014). Jointly modeling aspects,\nratings and sentiments for movie recommendation (jmars). In Proceedings of the 20th ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 193\u2013202.\n\nFraccaro, M., S\u00f8nderby, S. K., Paquet, U., and Winther, O. (2016). Sequential neural models with\n\nstochastic layers. In Advances in Neural Information Processing Systems, pages 2199\u20132207.\n\nGermain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: Masked autoencoder for\n\ndistribution estimation. In ICML, pages 881\u2013889.\n\nGoyal, A., Lamb, A., Zhang, Y., Zhang, S., Courville, A. C., and Bengio, Y. (2016). Professor forcing:\nA new algorithm for training recurrent networks. In Advances in Neural Information Processing\nSystems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10,\n2016, Barcelona, Spain, pages 4601\u20134609.\n\nGregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. (2015). Draw: A recurrent\n\nneural network for image generation. arXiv preprint arXiv:1502.04623.\n\nGulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. (2016).\n\nPixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013.\n\nGulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. (2017).\n\nPixelvae: A latent variable model for natural images. Proc. of ICLR.\n\nHochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735\u2013\n\n1780.\n\nHonkela, A. and Valpola, H. (2004). Variational learning and bits-back coding: an information-\n\ntheoretic view to bayesian learning. IEEE Transactions on Neural Networks, 15(4):800\u2013810.\n\nHu, Z., Yang, Z., Liang, X., Salakhutdinov, R., and Xing, E. P. (2017). Controllable text generation.\n\narXiv preprint arXiv:1703.00955.\n\nKarl, M., Soelch, M., Bayer, J., and van der Smagt, P. (2016). Deep variational bayes \ufb01lters:\n\nUnsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432.\n\nKing, S. and Karaiskos, V. (2013). The blizzard challenge 2013. The Ninth Annual Blizzard Challenge,\n\n2013.\n\n10\n\n\fKingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980.\n\nKingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi-supervised learning with\ndeep generative models. In Advances in Neural Information Processing Systems, pages 3581\u20133589.\n\nKingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016).\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information\nProcessing Systems, pages 4743\u20134751.\n\nKingma, D. P. and Welling, M. (2014). Stochastic Gradient VB and the Variational Auto-Encoder.\n\n2nd International Conference on Learning Representationsm (ICLR), pages 1\u201314.\n\nLarochelle, H. and Murray, I. (2011). The neural autoregressive distribution estimator. In Proceedings\nof the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 29\u201337.\n\nLouizos, C. and Welling, M. (2017). Multiplicative normalizing \ufb02ows for variational bayesian neural\n\nnetworks. arXiv preprint arXiv:1703.01961.\n\nOord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759.\n\nRaiko, T., Li, Y., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive distribution estimator\n\nnade-k. In Advances in neural information processing systems, pages 325\u2013333.\n\nRezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. arXiv preprint\n\narXiv:1505.05770.\n\nRumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988). Learning representations by back-\n\npropagating errors. Cognitive modeling, 5(3):1.\n\nSalakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks. In\n\nProceedings of the 25th international conference on Machine learning, pages 872\u2013879. ACM.\n\nSalimans, T., Kingma, D. P., and Welling, M. (2014). Markov chain monte carlo and variational\n\ninference: Bridging the gap. arXiv preprint arXiv:1410.6460.\n\nSemeniuta, S., Severyn, A., and Barth, E. (2017). A hybrid convolutional variational autoencoder for\n\ntext generation. arXiv preprint arXiv:1702.02390.\n\nSerban, I. V., II, A. G. O., Pineau, J., and Courville, A. C. (2017a). Piecewise latent variables for\nneural variational text processing. In Proceedings of the 2017 Conference on Empirical Methods\nin Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017,\npages 422\u2013432.\n\nSerban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A. C., and Bengio, Y. (2017b).\nA hierarchical latent variable encoder-decoder model for generating dialogues. In In Proc. of AAAI.\n\nUria, B., C\u00f4t\u00e9, M.-A., Gregor, K., Murray, I., and Larochelle, H. (2016). Neural autoregressive\n\ndistribution estimation. Journal of Machine Learning Research, 17(205):1\u201337.\n\nZhao, T., Zhao, R., and Eskenazi, M. (2017). Learning discourse-level diversity for neural dialog\n\nmodels using conditional variational autoencoders. arXiv preprint arXiv:1703.10960.\n\n11\n\n\f", "award": [], "sourceid": 3370, "authors": [{"given_name": "Anirudh Goyal", "family_name": "ALIAS PARTH GOYAL", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Alessandro", "family_name": "Sordoni", "institution": "Microsoft Maluuba"}, {"given_name": "Marc-Alexandre", "family_name": "C\u00f4t\u00e9", "institution": "Microsoft Research"}, {"given_name": "Nan Rosemary", "family_name": "Ke", "institution": "MILA, \u00c9cole Polytechnique de Montr\u00e9al"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}]}