{"title": "Sequential Neural Models with Stochastic Layers", "book": "Advances in Neural Information Processing Systems", "page_first": 2199, "page_last": 2207, "abstract": "How can we efficiently propagate uncertainty in a latent state representation with recurrent neural networks? This paper introduces stochastic recurrent neural networks which glue a deterministic recurrent neural network and a state space model together to form a stochastic and sequential neural generative model. The clear separation of deterministic and stochastic layers allows a structured variational inference network to track the factorization of the model\u2019s posterior distribution. By retaining both the nonlinear recursive structure of a recurrent neural network and averaging over the uncertainty in a latent path, like a state space model, we improve the state of the art results on the Blizzard and TIMIT speech modeling data sets by a large margin, while achieving comparable performances to competing methods on polyphonic music modeling.", "full_text": "Sequential Neural Models with Stochastic Layers\n\nMarco Fraccaro\u2020\n\nS\u00f8ren Kaae S\u00f8nderby\u2021\n\nUlrich Paquet*\n\n\u2020 Technical University of Denmark\n\n\u2021 University of Copenhagen\n\n* Google DeepMind\n\nOle Winther\u2020\u2021\n\nAbstract\n\nHow can we ef\ufb01ciently propagate uncertainty in a latent state representation with\nrecurrent neural networks? This paper introduces stochastic recurrent neural\nnetworks which glue a deterministic recurrent neural network and a state space\nmodel together to form a stochastic and sequential neural generative model. The\nclear separation of deterministic and stochastic layers allows a structured variational\ninference network to track the factorization of the model\u2019s posterior distribution.\nBy retaining both the nonlinear recursive structure of a recurrent neural network\nand averaging over the uncertainty in a latent path, like a state space model, we\nimprove the state of the art results on the Blizzard and TIMIT speech modeling data\nsets by a large margin, while achieving comparable performances to competing\nmethods on polyphonic music modeling.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNNs) are able to represent long-term dependencies in sequential data,\nby adapting and propagating a deterministic hidden (or latent) state [5, 16]. There is recent evidence\nthat when complex sequences such as speech and music are modeled, the performances of RNNs can\nbe dramatically improved when uncertainty is included in their hidden states [3, 4, 7, 11, 12, 15]. In\nthis paper we add a new direction to the explorer\u2019s map of treating the hidden RNN states as uncertain\npaths, by including the world of state space models (SSMs) as an RNN layer. By cleanly delineating\na SSM layer, certain independence properties of variables arise, which are bene\ufb01cial for making\nef\ufb01cient posterior inferences. The result is a generative model for sequential data, with a matching\ninference network that has its roots in variational auto-encoders (VAEs).\nSSMs can be viewed as a probabilistic extension of RNNs, where the hidden states are assumed to\nbe random variables. Although SSMs have an illustrious history [24], their stochasticity has limited\ntheir widespread use in the deep learning community, as inference can only be exact for two relatively\nsimple classes of SSMs, namely hidden Markov models and linear Gaussian models, neither of\nwhich are well-suited to modeling long-term dependencies and complex probability distributions\nover high-dimensional sequences. On the other hand, modern RNNs rely on gated nonlinearities\nsuch as long short-term memory (LSTM) [16] cells or gated recurrent units (GRUs) [6], that let the\ndeterministic hidden state of the RNN act as an internal memory for the model. This internal memory\nseems fundamental to capturing complex relationships in the data through a statistical model.\nThis paper introduces the stochastic recurrent neural network (SRNN) in Section 3. SRNNs combine\nthe gated activation mechanism of RNNs with the stochastic states of SSMs, and are formed by\nstacking a RNN and a nonlinear SSM. The state transitions of the SSM are nonlinear and are\nparameterized by a neural network that also depends on the corresponding RNN hidden state. The\nSSM can therefore utilize long-term information captured by the RNN.\nWe use recent advances in variational inference to ef\ufb01ciently approximate the intractable posterior\ndistribution over the latent states with an inference network [19, 23]. The form of our variational\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fxt\u22121\n\nxt\n\nxt+1\n\ndt\u22121\n\nut\u22121\n\ndt\n\nut\n\ndt+1\n\nut+1\n\nxt\u22121\n\nzt\u22121\n\nxt\n\nzt\n\nxt+1\n\nzt+1\n\nut\u22121\n\nut\n\nut+1\n\n(a) RNN\n\n(b) SSM\n\nFigure 1: Graphical models to generate x1:T with a recurrent neural network (RNN) and a state space\nmodel (SSM). Diamond-shaped units are used for deterministic states, while circles are used for\nstochastic ones. For sequence generation, like in a language model, one can use ut = xt\u22121.\n\napproximation is inspired by the independence properties of the true posterior distribution over the\nlatent states of the model, and allows us to improve inference by conveniently using the information\ncoming from the whole sequence at each time step. The posterior distribution over the latent states of\nthe SRNN is highly non-stationary while we are learning the parameters of the model. To further\nimprove the variational approximation, we show that we can construct the inference network so that\nit only needs to learn how to compute the mean of the variational approximation at each time step\ngiven the mean of the predictive prior distribution.\nIn Section 4 we test the performances of SRNN on speech and polyphonic music modeling tasks.\nSRNN improves the state of the art results on the Blizzard and TIMIT speech data sets by a large\nmargin, and performs comparably to competing models on polyphonic music modeling. Finally,\nother models that extend RNNs by adding stochastic units will be reviewed and compared to SRNN\nin Section 5.\n\n2 Recurrent Neural Networks and State Space Models\n\nRecurrent neural networks and state space models are widely used to model temporal sequences\nof vectors x1:T = (x1, x2, . . . , xT ) that possibly depend on inputs u1:T = (u1, u2, . . . , uT ). Both\nmodels rest on the assumption that the sequence x1:t of observations up to time t can be summarized\nby a latent state dt or zt, which is deterministically determined (dt in a RNN) or treated as a random\nvariable which is averaged away (zt in a SSM). The difference in treatment of the latent state has\ntraditionally led to vastly different models: RNNs recursively compute dt = f (dt\u22121, ut) using a\nparameterized nonlinear function f, like a LSTM cell or a GRU. The RNN observation probabilities\np(xt|dt) are equally modeled with nonlinear functions. SSMs, like linear Gaussian or hidden Markov\nmodels, explicitly model uncertainty in the latent process through z1:T . Parameter inference in a\nSSM requires z1:T to be averaged out, and hence p(zt|zt\u22121, ut) and p(xt|zt) are often restricted\nto the exponential family of distributions to make many existing approximate inference algorithms\napplicable. On the other hand, averaging a function over the deterministic path d1:T in a RNN is a\ntrivial operation. The striking similarity in factorization between these models is illustrated in Figures\n1a and 1b.\nCan we combine the best of both worlds, and make the stochastic state transitions of SSMs nonlinear\nwhilst keeping the gated activation mechanism of RNNs? Below, we show that a more expressive\nmodel can be created by stacking a SSM on top of a RNN, and that by keeping them layered, the\nfunctional form of the true posterior distribution over z1:T guides the design of a backward-recursive\nstructured variational approximation.\n\n3 Stochastic Recurrent Neural Networks\n\nWe de\ufb01ne a SRNN as a generative model p\u03b8 by temporally interlocking a SSM with a RNN, as\nillustrated in Figure 2a. The joint probability of a single sequence and its latent states, assuming\nknowledge of the starting states z0 = 0 and d0 = 0, and inputs u1:T , factorizes as\n\n2\n\n\fxt\u22121\n\nzt\u22121\n\ndt\u22121\n\nut\u22121\n\nxt\n\nzt\n\ndt\n\nut\n\nxt+1\n\nzt+1\n\ndt+1\n\nut+1\n\n(a) Generative model p\u03b8\n\nxt\u22121\n\nxt\n\nxt+1\n\nzt\u22121\n\nat\u22121\n\nzt\n\nat\n\nzt+1\n\nat+1\n\ndt\u22121\n\ndt\n\ndt+1\n\n(b) Inference network q\u03c6\n\nFigure 2: A SRNN as a generative model p\u03b8 for a sequence x1:T . Posterior inference of z1:T and d1:T\nis done through an inference network q\u03c6, which uses a backward-recurrent state at to approximate\nthe nonlinear dependence of zt on future observations xt:T and states dt:T ; see Equation (7).\n\np\u03b8(x1:T , z1:T , d1:T|u1:T , z0, d0) = p\u03b8x (x1:T|z1:T , d1:T ) p\u03b8z (z1:T|d1:T , z0) p\u03b8d(d1:T|u1:T , d0)\n\nT(cid:89)\n\nt=1\n\n=\n\np\u03b8x (xt|zt, dt) p\u03b8z (zt|zt\u22121, dt) p\u03b8d(dt|dt\u22121, ut) .\n\n(1)\n\nThe SSM and RNN are further tied with skip-connections from dt to xt. The joint density in (1) is\nparameterized by \u03b8 = {\u03b8x, \u03b8z, \u03b8d}, which will be adapted together with parameters \u03c6 of a so-called\n}N\n\u201cinference network\u201d q\u03c6 to best model N independently observed data sequences {xi\ni=1 that are\ndescribed by the log marginal likelihood or evidence\nLi(\u03b8) .\nL(\u03b8) = log p\u03b8\n\n(cid:0){xi\n\n(cid:1) =\n\n(cid:88)\n\n(cid:88)\n\nlog p\u03b8(xi\n\n}|{ui\n\n|ui\n\n, zi\n\n(2)\n\n1:Ti\n\n, zi\n\n0, di\n\n0) =\n\n0}N\n\ni=1\n\n0, di\n\n1:Ti\n\n1:Ti\n\n1:Ti\n\n1:Ti\n\ni\n\ni\n\nThroughout the paper, we omit superscript i when only one sequence is referred to, or when it is\nclear from the context. In each log likelihood term Li(\u03b8) in (2), the latent states z1:T and d1:T\nwere averaged out of (1). Integrating out d1:T is done by simply substituting its deterministically\nobtained value, but z1:T requires more care, and we return to it in Section 3.2. Following Figure 2a,\nthe states d1:T are determined from d0 and u1:T through the recursion dt = f\u03b8d(dt\u22121, ut). In our\nimplementation f\u03b8d is a GRU network with parameters \u03b8d. For later convenience we denote the value\n\nof d1:T , as computed by application of f\u03b8d, by(cid:101)d1:T . Therefore p\u03b8d(dt|dt\u22121, ut) = \u03b4(dt \u2212(cid:101)dt), i.e.\nd1:T follows a delta distribution centered at(cid:101)d1:T .\nUnlike the VRNN [7], zt directly depends on zt\u22121, as it does in a SSM, via p\u03b8z(zt|zt\u22121, dt). This\nsplit makes a clear separation between the deterministic and stochastic parts of p\u03b8; the RNN remains\nentirely deterministic and its recurrent units do not depend on noisy samples of zt, while the prior\nover zt follows the Markov structure of SSMs. The split allows us to later mimic the structure of\nthe posterior distribution over z1:T and d1:T in its approximation q\u03c6. We let the prior transition\ndistribution p\u03b8z(zt|zt\u22121, dt) = N (zt; \u00b5(p)\n) be a Gaussian with a diagonal covariance matrix,\nwhose mean and log-variance are parameterized by neural networks that depend on zt\u22121 and dt,\n\n, v(p)\n\nt\n\nt\n\n\u00b5(p)\nt = NN(p)\n\n1 (zt\u22121, dt) ,\n\n(3)\nwhere NN denotes a neural network. Parameters \u03b8z denote all weights of NN(p)\n2 , which\n1\nare two-layer feed-forward networks in our implementation. Similarly, the parameters of the emission\ndistribution p\u03b8x(xt|zt, dt) depend on zt and dt through a similar neural network that is parameterized\nby \u03b8x.\n\n2 (zt\u22121, dt) ,\n\nand NN(p)\n\nt = NN(p)\n\nlog v(p)\n\n3.1 Variational inference for the SRNN\n\nThe stochastic variables z1:T of the nonlinear SSM cannot be analytically integrated out to obtain\nL(\u03b8) in (2). Instead of maximizing L with respect to \u03b8, we maximize a variational evidence lower\n\n3\n\n\fbound (ELBO) F(\u03b8, \u03c6) =(cid:80)\n\n(cid:90)(cid:90)\n\n\u03c6 [17]. The ELBO is a sum of lower bounds Fi(\u03b8, \u03c6) \u2264 Li(\u03b8), one for each sequence i,\n\ni Fi(\u03b8, \u03c6) \u2264 L(\u03b8) with respect to both \u03b8 and the variational parameters\n\np\u03b8(x1:T , z1:T , d1:T|A)\nq\u03c6(z1:T , d1:T|x1:T , A)\n\nFi(\u03b8, \u03c6) =\n\nq\u03c6(z1:T , d1:T|x1:T , A) log\n\n, di\n\n1:Ti\n\n1:Ti\n\ndz1:T dd1:T ,\n\n(4)\nwhere A = {u1:T , z0, d0} is a notational shorthand. Each sequence\u2019s approximation q\u03c6 shares\nparameters \u03c6 with all others, to form the auto-encoding variational Bayes inference network or\nvariational auto encoder (VAE) [19, 23] shown in Figure 2b. Maximizing F(\u03b8, \u03c6) \u2013 which we\ncall \u201ctraining\u201d the neural network architecture with parameters \u03b8 and \u03c6 \u2013 is done by stochastic\ngradient ascent, and in doing so, both the posterior and its approximation q\u03c6 change simultaneously.\nAll the intractable expectations in (4) would typically be approximated by sampling, using the\nreparameterization trick [19, 23] or control variates [22] to obtain low-variance estimators of its\ngradients. We use the reparameterization trick in our implementation. Iteratively maximizing F over\n\u03b8 and \u03c6 separately would yield an expectation maximization-type algorithm, which has formed a\nbackbone of statistical modeling for many decades [8]. The tightness of the bound depends on how\n|xi\n, Ai) that constitute the true\nwell we can approximate the i = 1, . . . , N factors p\u03b8(zi\n1:Ti\nposterior over all latent variables with their corresponding factors q\u03c6(zi\n, Ai). In what\nfollows, we show how q\u03c6 could be judiciously structured to match the posterior factors.\nis a delta function over(cid:101)d1:T , and so is the posterior p\u03b8(d1:T|x1:T , u1:T , d0). Consequently, we let\nWe add initial structure to q\u03c6 by noticing that the prior p\u03b8d(d1:T|u1:T , d0) in the generative model\nthe inference network use exactly the same deterministic state setting(cid:101)d1:T as that of the generative\n\nmodel, and we decompose it as\n\nq\u03c6(z1:T , d1:T|x1:T , u1:T , z0, d0) = q\u03c6(z1:T|d1:T , x1:T , z0) q(d1:T|x1:T , u1:T , d0)\n\n(cid:125)\n(cid:16)\n(cid:17)\nThis choice exactly approximates one delta-function by itself, and simpli\ufb01es the ELBO by letting\nq\u03c6(z1:T|(cid:101)d1:T , x1:T , z0)(cid:13)(cid:13) p\u03b8(z1:T|(cid:101)d1:T , z0)\nthem cancel out. By further taking the outer average in (4), one obtains\nFi(\u03b8, \u03c6) = Eq\u03c6\nwhich still depends on \u03b8d, u1:T and d0 via (cid:101)d1:T . The \ufb01rst term is an expected log likelihood\nunder q\u03c6(z1:T|(cid:101)d1:T , x1:T , z0), while KL denotes the Kullback-Leibler divergence between two\n\nlog p\u03b8(x1:T|z1:T ,(cid:101)d1:T )\n\n(cid:105) \u2212 KL\n\n= p\u03b8d (d1:T |u1:T ,d0)\n\ndistributions. Having stated the second factor in (5), we now turn our attention to parameterizing the\n\ufb01rst factor in (5) to resemble its posterior equivalent, by exploiting the temporal structure of p\u03b8.\n\n(cid:123)(cid:122)\n\n,\n(6)\n\n|xi\n\n1:Ti\n\n1:Ti\n\n(cid:104)\n\n, di\n\n1:Ti\n\n.\n\n(5)\n\n(cid:124)\n\n3.2 Exploiting the temporal structure\n\nstates d1:T , factorizes as p\u03b8(z1:T|d1:T , x1:T , u1:T , z0) =(cid:81)\n\nThe true posterior distribution of the stochastic states z1:T , given both the data and the deterministic\nt p\u03b8(zt|zt\u22121, dt:T , xt:T ). This can be\nveri\ufb01ed by considering the conditional independence properties of the graphical model in Figure 2a\nusing d-separation [13]. This shows that, knowing zt\u22121, the posterior distribution of zt does not\ndepend on the past outputs and deterministic states, but only on the present and future ones; this was\nalso noted in [20]. Instead of factorizing q\u03c6 as a mean-\ufb01eld approximation across time steps, we keep\nthe structured form of the posterior factors, including zt\u2019s dependence on zt\u22121, in the variational\napproximation\nq\u03c6(z1:T|d1:T , x1:T , z0) =\n\nq\u03c6z(zt|zt\u22121, at = g\u03c6a(at+1, [dt, xt])) ,\n(7)\nwhere [dt, xt] is the concatenation of the vectors dt and xt. The graphical model for the inference\nnetwork is shown in Figure 2b. Apart from the direct dependence of the posterior approximation at\ntime t on both dt:T and xt:T , the distribution also depends on d1:t\u22121 and x1:t\u22121 through zt\u22121. We\nmimic each posterior factor\u2019s nonlinear long-term dependence on dt:T and xt:T through a backward-\nrecurrent function g\u03c6a, shown in (7), which we will return to in greater detail in Section 3.3. The\ninference network in Figure 2b is therefore parameterized by \u03c6 = {\u03c6z, \u03c6a} and \u03b8d.\nIn (7) all time steps are taken into account when constructing the variational approximation at time\nt; this can therefore be seen as a smoothing problem. In our experiments we also consider \ufb01ltering,\n\nq\u03c6(zt|zt\u22121, dt:T , xt:T ) =\n\n(cid:89)\n\n(cid:89)\n\nt\n\nt\n\n4\n\n\fwhere only the information up to time t is used to de\ufb01ne q\u03c6(zt|zt\u22121, dt, xt). As the parameters \u03c6\nare shared across time steps, we can easily handle sequences of variable length in both cases.\nAs both the generative model and inference network factorize over time steps in (1) and (7), the\nELBO in (6) separates as a sum over the time steps\n\n(cid:104)E\nq\u03c6(zt|zt\u22121,(cid:101)dt:T ,xt:T )\n\u2212 KL\n\n(cid:2) log p\u03b8(xt|zt,(cid:101)dt)(cid:3)+\n\n(cid:17)(cid:105)\nq\u03c6(zt|zt\u22121,(cid:101)dt:T , xt:T )(cid:13)(cid:13) p\u03b8(zt|zt\u22121,(cid:101)dt)\n\n,\n\n(8)\n\n(cid:88)\n\nt\n\nEq\u2217\n\n\u03c6(zt\u22121)\n\nFi(\u03b8, \u03c6) =\n\n(cid:16)\nposterior q\u03c6(z1:t\u22121|(cid:101)d1:T , x1:T , z0), given by\n\nwhere q\u2217\n\nq\u2217\n\u03c6(zt\u22121) =\n\nq\u03c6(z1:t\u22121|(cid:101)d1:T , x1:T , z0) dz1:t\u22122 = Eq\u2217\n\n(cid:90)\n\n(cid:104)\n\n(cid:105)\nq\u03c6(zt\u22121|zt\u22122,(cid:101)dt\u22121:T , xt\u22121:T )\n\n\u03c6(zt\u22121) denotes the marginal distribution of zt\u22121 in the variational approximation to the\n\n\u03c6(zt\u22122)\n\n.\n(9)\nWe can interpret (9) as having a VAE at each time step t, with the VAE being conditioned on the past\nthrough the stochastic variable zt\u22121. To compute (8), the dependence on zt\u22121 needs to be integrated\nout, using our posterior knowledge at time t \u2212 1 which is given by q\u2217\n\u03c6(zt\u22121). We approximate the\nouter expectation in (8) using a Monte Carlo estimate, as samples from q\u2217\n\u03c6(zt\u22121) can be ef\ufb01ciently\nobtained by ancestral sampling. The sequential formulation of the inference model in (7) allows\nsuch samples to be drawn and reused, as given a sample z(s)\nt\u22121 from\nq\u03c6(zt\u22121|z(s)\n\nt\u22122,(cid:101)dt\u22121:T , xt\u22121:T ) will be distributed according to q\u2217\n\nt\u22122 from q\u2217\n\u03c6(zt\u22121).\n\n\u03c6(zt\u22122), a sample z(s)\n\n3.3 Parameterization of the inference network\nThe variational distribution q\u03c6(zt|zt\u22121, dt:T , xt:T ) needs to approximate the dependence of the\ntrue posterior p\u03b8(zt|zt\u22121, dt:T , xt:T ) on dt:T and xt:T , and as alluded to in (7), this is done by\n\nrunning a RNN with inputs(cid:101)dt:T and xt:T backwards in time. Speci\ufb01cally, we initialize the hid-\nat = g\u03c6a (at+1, [(cid:101)dt, xt]). The function g\u03c6a represents a recurrent neural network with, for exam-\nq\u03c6(z1:T|d1:T , x1:T , z0) =(cid:81)\nple, LSTM or GRU units. Each sequence\u2019s variational approximation factorizes over time with\nt q\u03c6z(zt|zt\u22121, at), as shown in (7). We let q\u03c6z(zt|zt\u22121, at) be a Gaus-\n\nden state of the backward-recursive RNN in Figure 2b as aT +1 = 0, and recursively compute\n\nsian with diagonal covariance, whose mean and the log-variance are parameterized with \u03c6z as\n\n\u00b5(q)\nt = NN(q)\n\n(10)\nInstead of smoothing, we can also do \ufb01ltering by using a neural network to approximate the depen-\ndence of the true posterior p\u03b8(zt|zt\u22121, dt, xt) on dt and xt, through for instance at = NN(a)(dt, xt).\n\n1 (zt\u22121, at) ,\n\n2 (zt\u22121, at) .\n\nt = NN(q)\n\nlog v(q)\n\nImproving the posterior approximation.\ning,\n\nIn our experiments we found that during train-\nthe parameterization introduced in (10) can lead to small values of the KL term\n\nKL(q\u03c6(zt|zt\u22121, at)(cid:107) p\u03b8(zt|zt\u22121,(cid:101)dt)) in the ELBO in (8). This happens when g\u03c6 in the inference\nusing the hidden state(cid:101)dt to imitate the behavior of the prior. The inference network could therefore\n\nnetwork does not rely on the information propagated back from future outputs in at, but it is mostly\n\nget stuck by trying to optimize the ELBO through sampling from the prior of the model, making\nthe variational approximation to the posterior useless. To overcome this issue, we directly include\nsome knowledge of the predictive prior dynamics in the parameterization of the inference network,\nusing our approximation of the posterior distribution q\u2217\n\u03c6(zt\u22121) over the previous latent states. In the\nspirit of sequential Monte Carlo methods [10], we improve the parameterization of q\u03c6(zt|zt\u22121, at)\nby using q\u2217\n\u03c6(zt\u22121) from (9). As we are constructing the variational distribution sequentially, we\napproximate the predictive prior mean, i.e. our \u201cbest guess\u201d on the prior dynamics of zt, as\n\nNN(p)\n\n1 (zt\u22121, dt) p(zt\u22121|x1:T ) dzt\u22121 \u2248\n\nNN(p)\n\n1 (zt\u22121, dt) q\u2217\n\n\u03c6(zt\u22121) dzt\u22121 ,\n\n(11)\n\nwhere we used the parameterization of the prior distribution in (3). We estimate the integral required\nby reusing the samples that were needed for the Monte Carlo estimate of the ELBO\n\n(cid:90)\n(cid:98)\u00b5(p)\nto compute(cid:98)\u00b5(p)\n\nt =\n\nt\n\n(cid:90)\n\n5\n\n\fin (8). This predictive prior mean can then be used in the parameterization of the mean of the\nvariational approximation q\u03c6(zt|zt\u22121, at),\n\nt =(cid:98)\u00b5(p)\n\n\u00b5(q)\n\nt + NN(q)\n\n1 (zt\u22121, at) ,\n\n(12)\n\nt\n\nt\n\nt\n\nand \u00b5(q)\n\nthe residual between(cid:98)\u00b5(p)\n(8) will not depend on(cid:98)\u00b5(p)\n\nand we refer to this parameterization as Resq in the results\nin Section 4. Rather than directly learning \u00b5(q)\n, we learn\n. It is straightforward\nto show that with this parameterization the KL-term in\n1 (zt\u22121, at).\nLearning the residual improves inference, making it seem-\ningly easier for the inference network to track changes\nin the generative model while the model is trained, as it\nwill only have to learn how to \u201ccorrect\u201d the predictive\nprior dynamics by using the information coming from\n\n(cid:101)dt:T and xt:T . We did not see any improvement in results\n\n, but only on NN(q)\n\nt\n\nAlgorithm 1 Inference of SRNN with\nResq parameterization from (12).\n\n1: inputs:(cid:101)d1:T and a1:T\n\n1 (zt\u22121,(cid:101)dt)\n\n(cid:98)\u00b5(p)\nt =(cid:98)\u00b5(p)\n\n2: initialize z0\n3: for t = 1 to T do\nt = NN(p)\n4:\n\u00b5(q)\n5:\nlog v(q)\nt = NN(q)\n6:\nzt \u223c N (zt; \u00b5(q)\n7:\n8: end for\n\nt\n\nt + NN(q)\n\n1 (zt\u22121, at)\n\n2 (zt\u22121, at)\n, v(q)\n\n)\n\nt\n\nby parameterizing log v(q)\nprocedure of SRNN with Resq parameterization for one sequence is summarized in Algorithm 1.\n\nin a similar way. The inference\n\nt\n\n4 Results\n\nIn this section the SRNN is evaluated on the modeling of speech and polyphonic music data, as they\nhave shown to be dif\ufb01cult to model without a good representation of the uncertainty in the latent\nstates [3, 7, 11, 12, 15]. We test SRNN on the Blizzard [18] and TIMIT raw audio data sets (Table 1)\nused in [7]. The preprocessing of the data sets and the testing performance measures are identical\nto those reported in [7]. Blizzard is a dataset of 300 hours of English, spoken by a single female\nspeaker. TIMIT is a dataset of 6300 English sentences read by 630 speakers. As done in [7], for\nBlizzard we report the average log-likelihood for half-second sequences and for TIMIT we report\nthe average log likelihood per sequence for the test set sequences. Note that the sequences in the\nTIMIT test set are on average 3.1s long, and therefore 6 times longer than those in Blizzard. For\nthe raw audio datasets we use a fully factorized Gaussian output distribution. Additionally, we test\nSRNN for modeling sequences of polyphonic music (Table 2), using the four data sets of MIDI\nsongs introduced in [4]. Each data set contains more than 7 hours of polyphonic music of varying\ncomplexity: folk tunes (Nottingham data set), the four-part chorales by J. S. Bach (JSB chorales),\norchestral music (MuseData) and classical piano music (Piano-midi.de). For polyphonic music we\nuse a Bernoulli output distribution to model the binary sequences of piano notes. In our experiments\nwe set ut = xt\u22121, but ut could also be used to represent additional input information to the model.\nAll models where implemented using Theano [2], Lasagne [9] and Parmesan1. Training using a\nNVIDIA Titan X GPU took around 1.5 hours for TIMIT, 18 hours for Blizzard, less than 15 minutes\nfor the JSB chorales and Piano-midi.de data sets, and around 30 minutes for the Nottingham and\nMuseData data sets. To reduce the computational requirements we use only 1 sample to approximate\nall the intractable expectations in the ELBO (notice that the KL term can be computed analytically).\nFurther implementation and experimental details can be found in the Supplementary Material.\n\nBlizzard and TIMIT. Table 1 compares the average log-likelihood per test sequence of SRNN to\nthe results from [7]. For RNNs and VRNNs the authors of [7] test two different output distributions,\nnamely a Gaussian distribution (Gauss) and a Gaussian Mixture Model (GMM). VRNN-I differs\nfrom the VRNN in that the prior over the latent variables is independent across time steps, and it is\ntherefore similar to STORN [3]. For SRNN we compare the smoothing and \ufb01ltering performance\n(denoted as smooth and \ufb01lt in Table 1), both with the residual term from (12) and without it (10)\n(denoted as Resq if present). We prefer to only report the more conservative evidence lower bound\nfor SRNN, as the approximation of the log-likelihood using standard importance sampling is known\nto be dif\ufb01cult to compute accurately in the sequential setting [10]. We see from Table 1 that SRNN\noutperforms all the competing methods for speech modeling. As the test sequences in TIMIT are\non average more than 6 times longer than the ones for Blizzard, the results obtained with SRNN for\n\n1github.com/casperkaae/parmesan. The code for SRNN is available at github.com/marcofraccaro/srnn.\n\n6\n\n\fModels\nSRNN\n\n(smooth+Resq)\n\nSRNN (\ufb01lt)\nVRNN-GMM\n\nTIMIT\nBlizzard\n\u226511991 \u2265 60550\n\u2265 10991 \u2265 59269\nSRNN (smooth)\nSRNN (\ufb01lt+Resq) \u2265 10572 \u2265 52126\n\u2265 10846 \u2265 50524\n\u2265 28982\n\u2265 9107\n\u2248 29604\n\u2248 9392\n\u2265 9223\n\u2265 28805\n\u2248 9516\n\u2248 30235\n\u2265 28340\n\u2265 8933\n\u2248 29639\n\u2248 9188\n26643\n7413\n3539\n-1900\n\nRNN-GMM\nRNN-Gauss\n\nVRNN-I-Gauss\n\nVRNN-Gauss\n\nTable 1: Average log-likelihood per sequence\non the test sets. For TIMIT the average test set\nlength is 3.1s, while the Blizzard sequences\nare all 0.5s long. The non-SRNN results are\nreported as in [7]. Smooth: g\u03c6a is a GRU run-\nning backwards; \ufb01lt: g\u03c6a is a feed-forward\nnetwork; Resq: parameterization with resid-\nual in (12).\n\nFigure 3: Visualization of the average KL term and\nreconstructions of the output mean and log-variance\nfor two examples from the Blizzard test set.\n\nModels\n\nTSBN\nNASMC\nSTORN\n\nNottingham JSB chorales MuseData Piano-midi.de\n\u2265 \u22122.94\n\u2265 \u22123.67\n\u2248 \u22122.72\n\u2248 \u22122.85\n\u2248 \u22122.31\n\u2248 \u22124.46\n\n\u2265 \u22126.28\n\u2265 \u22126.81\n\u2248 \u22126.89\n\u2248 \u22126.16\n\u2248 \u22125.60\n\u2248 \u22128.13\n\n\u2265 \u22128.20\n\u2265 \u22127.98\n\u2248 \u22127.61\n\u2248 \u22127.13\n\u2248 \u22127.05\n\u2248 \u22128.37\n\n\u2265 \u22124.74\n\u2265 \u22127.48\n\u2248 \u22123.99\n\u2248 \u22126.91\n\u2248 \u22125.19\n\u2248 \u22128.71\n\nSRNN (smooth+Resq)\n\nRNN-NADE\n\nRNN\n\nTable 2: Average log-likelihood on the test sets. The TSBN results are from [12], NASMC from [15],\nSTORN from [3], RNN-NADE and RNN from [4].\n\nTIMIT are in line with those obtained for Blizzard. The VRNN, which performs well when the voice\nof the single speaker from Blizzard is modeled, seems to encounter dif\ufb01culties when modeling the\n630 speakers in the TIMIT data set. As expected, for SRNN the variational approximation that is\nobtained when future information is also used (smoothing) is better than the one obtained by \ufb01ltering.\nLearning the residual between the prior mean and the mean of the variational approximation, given in\n(12), further improves the performance in 3 out of 4 cases.\nIn the \ufb01rst two lines of Figure 3 we plot two raw signals from the Blizzard test set and the average\nKL term between the variational approximation and the prior distribution. We see that the KL\nterm increases whenever there is a transition in the raw audio signal, meaning that the inference\nnetwork is using the information coming from the output symbols to improve inference. Finally, the\nreconstructions of the output mean and log-variance in the last two lines of Figure 3 look consistent\nwith the original signal.\n\nPolyphonic music. Table 2 compares the average log-likelihood on the test sets obtained with\nSRNN and the models introduced in [3, 4, 12, 15]. As done for the speech data, we prefer to report the\nmore conservative estimate of the ELBO in Table 2, rather than approximating the log-likelihood with\nimportance sampling as some of the other methods do. We see that SRNN performs comparably to\nother state of the art methods in all four data sets. We report the results using smoothing and learning\nthe residual between the mean of the predictive prior and the mean of the variational approximation,\nbut the performances using \ufb01ltering and directly learning the mean of the variational approximation\nare now similar. We believe that this is due to the small amount of data and the fact that modeling\nMIDI music is much simpler than modeling raw speech signals.\n\n7\n\n\f5 Related work\n\nA number of works have extended RNNs with stochastic units to model motion capture, speech\nand music data [3, 7, 11, 12, 15]. The performances of these models are highly dependent on how\nthe dependence among stochastic units is modeled over time, on the type of interaction between\nstochastic units and deterministic ones, and on the procedure that is used to evaluate the typically\nintractable log likelihood. Figure 4 highlights how SRNN differs from some of these works.\nIn STORN [3] (Figure 4a) and DRAW [14] the stochastic units at each time step have an isotropic\nGaussian prior and are independent between time steps. The stochastic units are used as an input\nto the deterministic units in a RNN. As in our work, the reparameterization trick [19, 23] is used to\noptimize an ELBO.\n\nxt\n\nzt\n\nzt\u22121\n\nut\n\ndt\u22121\n\nxt\n\nzt\n\nzt\u22121\n\ndt\u22121\n\nxt\n\nzt\n\ndt\n\nut\n\ndt\n\nut\n\n(b) VRNN\n\n(a) STORN\n\n(c) Deep Kalman Filter\n\nThe authors of the VRNN [7] (Figure\n4b) note that it is bene\ufb01cial to add\ninformation coming from the past\nstates to the prior over latent vari-\nables zt. The VRNN lets the prior\np\u03b8z (zt|dt) over the stochastic units\ndepend on the deterministic units dt,\nwhich in turn depend on both the de-\nterministic and the stochastic units at\nthe previous time step through the\nrecursion dt = f (dt\u22121, zt\u22121, ut).\nThe SRNN differs by clearly separat-\ning the deterministic and stochastic\nFigure 4: Generative models of x1:T that are related to SRNN.\npart, as shown in Figure 2a. The sepa-\nFor sequence modeling it is typical to set ut = xt\u22121.\nration of deterministic and stochastic\nunits allows us to improve the posterior approximation by doing smoothing, as the stochastic units\nstill depend on each other when we condition on d1:T . In the VRNN, on the other hand, the stochastic\nunits are conditionally independent given the states d1:T . Because the inference and generative\nnetworks in the VRNN share the deterministic units, the variational approximation would not improve\nby making it dependent on the future through at, when calculated with a backward GRU, as we\ndo in our model. Unlike STORN, DRAW and VRNN, the SRNN separates the \u201cnoisy\u201d stochastic\nunits from the deterministic ones, forming an entire layer of interconnected stochastic units. We\nfound in practice that this gave better performance and was easier to train. The works by [1, 20]\n(Figure 4c) show that it is possible to improve inference in SSMs by using ideas from VAEs, similar\nto what is done in the stochastic part (the top layer) of SRNN. Towards the periphery of related\nworks, [15] approximates the log likelihood of a SSM with sequential Monte Carlo, by learning\n\ufb02exible proposal distributions parameterized by deep networks, while [12] uses a recurrent model\nwith discrete stochastic units that is optimized using the NVIL algorithm [21].\n\n6 Conclusion\n\nThis work has shown how to extend the modeling capabilities of recurrent neural networks by\ncombining them with nonlinear state space models. Inspired by the independence properties of the\nintractable true posterior distribution over the latent states, we designed an inference network in a\nprincipled way. The variational approximation for the stochastic layer was improved by using the\ninformation coming from the whole sequence and by using the Resq parameterization to help the\ninference network to track the non-stationary posterior. SRNN achieves state of the art performances\non the Blizzard and TIMIT speech data set, and performs comparably to competing methods for\npolyphonic music modeling.\n\nAcknowledgements\n\nWe thank Casper Kaae S\u00f8nderby and Lars Maal\u00f8e for many fruitful discussions, and NVIDIA\nCorporation for the donation of TITAN X and Tesla K40 GPUs. Marco Fraccaro is supported by\nMicrosoft Research through its PhD Scholarship Programme.\n\n8\n\n\fReferences\n[1] E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski. Black box variational inference for\n\nstate space models. arXiv:1511.07367, 2015.\n\n[2] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley,\n\nand Y. Bengio. Theano: new features and speed improvements. arXiv:1211.5590, 2012.\n\n[3] J. Bayer and C. Osendorfer. Learning stochastic recurrent networks. arXiv:1411.7610, 2014.\n\n[4] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in high-\ndimensional sequences: Application to polyphonic music generation and transcription. arXiv:1206.6392,\n2012.\n\n[5] K. Cho, B. Van Merri\u00ebnboer, \u00c7. G\u00fcl\u00e7ehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning\nphrase representations using RNN encoder\u2013decoder for statistical machine translation. In EMNLP, pages\n1724\u20131734, 2014.\n\n[6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on\n\nsequence modeling. arXiv:1412.3555, 2014.\n\n[7] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model\n\nfor sequential data. In NIPS, pages 2962\u20132970, 2015.\n\n[8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1977.\n\n[9] S. Dieleman, J. Schl\u00fcter, C. Raffel, E. Olson, S. K. S\u00f8nderby, D. Nouri, E. Battenberg, and A. van den\n\nOord. Lasagne: First release, 2015.\n\n[10] A. Doucet, N. de Freitas, and N. Gordon. An introduction to sequential Monte Carlo methods. In Sequential\n\nMonte Carlo Methods in Practice, Statistics for Engineering and Information Science. 2001.\n\n[11] O. Fabius and J. R. van Amersfoort. Variational recurrent auto-encoders. arXiv:1412.6581, 2014.\n\n[12] Z. Gan, C. Li, R. Henao, D. E. Carlson, and L. Carin. Deep temporal sigmoid belief networks for sequence\n\nmodeling. In NIPS, pages 2458\u20132466, 2015.\n\n[13] D. Geiger, T. Verma, and J. Pearl. Identifying independence in Bayesian networks. Networks, 20:507\u2013534,\n\n1990.\n\n[14] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. DRAW: A recurrent neural network for image\n\ngeneration. In ICML, 2015.\n\n[15] S. Gu, Z. Ghahramani, and R. E. Turner. Neural adaptive sequential Monte Carlo.\n\n2611\u20132619, 2015.\n\nIn NIPS, pages\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, Nov.\n\n1997.\n\n[17] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. Machine Learning, 37(2):183\u2013233, 1999.\n\n[18] S. King and V. Karaiskos. The Blizzard challenge 2013. In The Ninth Annual Blizzard Challenge, 2013.\n\n[19] D. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n\n[20] R. G. Krishnan, U. Shalit, and D. Sontag. Deep Kalman \ufb01lters. arXiv:1511.05121, 2015.\n\n[21] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. arXiv:1402.0030,\n\n2014.\n\n[22] J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In ICML,\n\n2012.\n\n[23] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In ICML, pages 1278\u20131286, 2014.\n\n[24] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation,\n\n11(2):305\u201345, 1999.\n\n9\n\n\f", "award": [], "sourceid": 1143, "authors": [{"given_name": "Marco", "family_name": "Fraccaro", "institution": "DTU"}, {"given_name": "S\u00f8ren Kaae", "family_name": "S\u00f8nderby", "institution": "KU"}, {"given_name": "Ulrich", "family_name": "Paquet", "institution": "DeepMind"}, {"given_name": "Ole", "family_name": "Winther", "institution": "DTU"}]}