{"title": "A Recurrent Latent Variable Model for Sequential Data", "book": "Advances in Neural Information Processing Systems", "page_first": 2980, "page_last": 2988, "abstract": "In this paper, we explore the inclusion of latent random variables into the hidden state of a recurrent neural network (RNN) by combining the elements of the variational autoencoder. We argue that through the use of high-level latent random variables, the variational RNN (VRNN) can model the kind of variability observed in highly structured sequential data such as natural speech. We empirically evaluate the proposed model against other related sequential models on four speech datasets and one handwriting dataset. Our results show the important roles that latent random variables can play in the RNN dynamics.", "full_text": "A Recurrent Latent Variable Model\n\nfor Sequential Data\n\nJunyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel,\n\nAaron Courville, Yoshua Bengio\u2217\n\nDepartment of Computer Science and Operations Research\n\nUniversit\u00b4e de Montr\u00b4eal\n\u2217CIFAR Senior Fellow\n\n{firstname.lastname}@umontreal.ca\n\nAbstract\n\nIn this paper, we explore the inclusion of latent random variables into the hid-\nden state of a recurrent neural network (RNN) by combining the elements of the\nvariational autoencoder. We argue that through the use of high-level latent ran-\ndom variables, the variational RNN (VRNN)1 can model the kind of variability\nobserved in highly structured sequential data such as natural speech. We empiri-\ncally evaluate the proposed model against other related sequential models on four\nspeech datasets and one handwriting dataset. Our results show the important roles\nthat latent random variables can play in the RNN dynamics.\n\n1\n\nIntroduction\n\nLearning generative models of sequences is a long-standing machine learning challenge and histor-\nically the domain of dynamic Bayesian networks (DBNs) such as hidden Markov models (HMMs)\nand Kalman \ufb01lters. The dominance of DBN-based approaches has been recently overturned by a\nresurgence of interest in recurrent neural network (RNN) based approaches. An RNN is a special\ntype of neural network that is able to handle both variable-length input and output. By training an\nRNN to predict the next output in a sequence, given all previous outputs, it can be used to model\njoint probability distribution over sequences.\nBoth RNNs and DBNs consist of two parts: (1) a transition function that determines the evolution\nof the internal hidden state, and (2) a mapping from the state to the output. There are, however, a\nfew important differences between RNNs and DBNs.\nDBNs have typically been limited either to relatively simple state transition structures (e.g., linear\nmodels in the case of the Kalman \ufb01lter) or to relatively simple internal state structure (e.g., the HMM\nstate space consists of a single set of mutually exclusive states). RNNs, on the other hand, typically\npossess both a richly distributed internal state representation and \ufb02exible non-linear transition func-\ntions. These differences give RNNs extra expressive power in comparison to DBNs. This expressive\npower and the ability to train via error backpropagation are the key reasons why RNNs have gained\npopularity as generative models for highly structured sequential data.\nIn this paper, we focus on another important difference between DBNs and RNNs. While the hidden\nstate in DBNs is expressed in terms of random variables, the internal transition structure of the\nstandard RNN is entirely deterministic. The only source of randomness or variability in the RNN\nis found in the conditional output probability model. We suggest that this can be an inappropriate\nway to model the kind of variability observed in highly structured data, such as natural speech,\nwhich is characterized by strong and complex dependencies among the output variables at different\n\n1Code is available at http://www.github.com/jych/nips2015_vrnn\n\n1\n\n\ftimesteps. We argue, as have others [4, 2], that these complex dependencies cannot be modelled\nef\ufb01ciently by the output probability models used in standard RNNs, which include either a simple\nunimodal distribution or a mixture of unimodal distributions.\nWe propose the use of high-level latent random variables to model the variability observed in the\ndata. In the context of standard neural network models for non-sequential data, the variational au-\ntoencoder (VAE) [11, 17] offers an interesting combination of highly \ufb02exible non-linear mapping\nbetween the latent random state and the observed output and effective approximate inference. In this\npaper, we propose to extend the VAE into a recurrent framework for modelling high-dimensional\nsequences. The VAE can model complex multimodal distributions, which will help when the un-\nderlying true data distribution consists of multimodal conditional distributions. We call this model\na variational RNN (VRNN).\nA natural question to ask is: how do we encode observed variability via latent random variables?\nThe answer to this question depends on the nature of the data itself. In this work, we are mainly\ninterested in highly structured data that often arises in AI applications. By highly structured, we\nmean that the data is characterized by two properties. Firstly, there is a relatively high signal-to-\nnoise ratio, meaning that the vast majority of the variability observed in the data is due to the signal\nitself and cannot reasonably be considered as noise. Secondly, there exists a complex relationship\nbetween the underlying factors of variation and the observed data. For example, in speech, the vocal\nqualities of the speaker have a strong but complicated in\ufb02uence on the audio waveform, affecting\nthe waveform in a consistent manner across frames.\nWith these considerations in mind, we suggest that our model variability should induce temporal\ndependencies across timesteps. Thus, like DBN models such as HMMs and Kalman \ufb01lters, we\nmodel the dependencies between the latent random variables across timesteps. While we are not the\n\ufb01rst to propose integrating random variables into the RNN hidden state [4, 2, 6, 8], we believe we are\nthe \ufb01rst to integrate the dependencies between the latent random variables at neighboring timesteps.\nWe evaluate the proposed VRNN model against other RNN-based models \u2013 including a VRNN\nmodel without introducing temporal dependencies between the latent random variables \u2013 on two\nchallenging sequential data types: natural speech and handwriting. We demonstrate that for the\nspeech modelling tasks, the VRNN-based models signi\ufb01cantly outperform the RNN-based models\nand the VRNN model that does not integrate temporal dependencies between latent random vari-\nables.\n\n2 Background\n\n2.1 Sequence modelling with Recurrent Neural Networks\n\nAn RNN can take as input a variable-length sequence x = (x1, x2, . . . , xT ) by recursively process-\ning each symbol while maintaining its internal hidden state h. At each timestep t, the RNN reads\nthe symbol xt \u2208 Rd and updates its hidden state ht \u2208 Rp by:\nht =f\u03b8 (xt, ht\u22121) ,\n\n(1)\nwhere f is a deterministic non-linear transition function, and \u03b8 is the parameter set of f. The\ntransition function f can be implemented with gated activation functions such as long short-term\nmemory [LSTM, 9] or gated recurrent unit [GRU, 5]. RNNs model sequences by parameterizing a\nfactorization of the joint sequence probability distribution as a product of conditional probabilities\nsuch that:\n\nT(cid:89)\n\nt=1\n\n2\n\np(x1, x2, . . . , xT ) =\n\np(xt | x