{"title": "Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 589, "page_last": 595, "abstract": null, "full_text": "Better Generative Models for Sequential \nData Problems: Bidirectional Recurrent \n\nMixture Density Networks \n\nMike Schuster \n\nATR Interpreting Telecommunications Research Laboratories \n2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, JAPAN \n\ngustl@itl.atr.co.jp \n\nAbstract \n\nThis paper describes bidirectional recurrent mixture density net(cid:173)\nworks, which can model multi-modal distributions of the type \nP(Xt Iyf) and P(Xt lXI, X2 , ... ,Xt-l, yf) without any explicit as(cid:173)\nsumptions about the use of context . These expressions occur fre(cid:173)\nquently in pattern recognition problems with sequential data, for \nexample in speech recognition. Experiments show that the pro(cid:173)\nposed generative models give a higher likelihood on test data com(cid:173)\npared to a traditional modeling approach, indicating that they can \nsummarize the statistical properties of the data better. \n\n1 \n\nIntroduction \n\nMany problems of engineering interest can be formulated as sequential data prob(cid:173)\nlems in an abstract sense as supervised learning from sequential data, where an \ninput vector (dimensionality D) sequence X = xf = {X!,X2, .. . ,XT_!,XT} liv(cid:173)\ning in space X has to be mapped to an output vector (dimensionality J<) target \nsequence T = tf = {tl' t 2, ... , tT -1 , tT} in space1 y, that often embodies cor(cid:173)\nrelations between neighboring vectors Xt, Xt+l and tt, tt+l. In general there are \na number of training data sequence pairs (input and target), which are used to \nestimate the parameters of a given model structure, whose performance can then \nbe evaluated on another set of test data pairs. For many applications the problem \nbecomes to predict the best sequence Y* given an arbitrary input sequence X, with \n, best' meaning the sequence that minimizes an error using a suitable metric that is \nyet to be defined . Making use of the theory of pattern recognition [2] this problem \nis often simplified by treating any sequence as one pattern. This makes it possi(cid:173)\nble to express the objective of sequence prediction with the well known expression \ny* = arg maxy P(YIX), with X being the input sequence, Y being any valid out(cid:173)\nput sequence and y* being the predicted sequence with the highest probability2 \n\n1 a sample sequence of the training target data is denoted as T, while an output sequence \n\nin general is denoted as Y, both live in the output space Y \nsymbols. This means, P(x) = P(X = x). \n\n2to simplify notation, random variables and their values, are not denoted as different \n\n\f590 \n\nM Schuster \n\namong all possible sequences. \n\nTraining of a sequence prediction system corresponds to estimating the distribution \n3 P(YIX) from a number of samples which includes (a) defining an appropriate \nmodel representing this distribution and (b) estimating its parameters such that \nIn practice the' model consists of \nP(YIX) for the training data is maximized. \nseveral modules with each of them being responsible for a different part of P(YIX) . \n\nTesting (usage) of the trained system or recognition for a given input sequence \nX corresponds principally to the evaluation of P(YIX) for all possible output se(cid:173)\nquences to find the best one Y*. This procedure is called the search and its efficient \nimplementation is important for many applications. \nIn order to build a model to predict sequences it is necessary to decompose the \nsequences such that modules responsible for smaller parts can be build. An often \nused approach is the decomposition into a generative and prior model part, using \nP(BIA) = P(AIB)P(B)/ P(A) and P(A, B) = P(A)P(BIA), as: \n\nY* \n\ny \n\narg maxP(YIX) = arg maxP(XIY)P(Y) \narg max [II P(XtI Xl,X2, . .. ,Xt-l,yn] [II P(YtIYI,Y2, ... ,Yt-d](1) \n\ny \n\nT \n\nT \n\nY \n\nt=l \n\n, \n\n'\" \n\ngenerative part \n\n~, \n\nv.------' \n\nprior part \n\nt=l \n\nFor many applications (1) is approximated by simpler expressions, for example as \na first order Markov Model \n\nY* ~ arg max [II P(xtIYd] [II P(Yt!Yt-l)] \n\nT \n\nT \n\nt=l \n\nY \n\nt=l \n\n(2) \n\nmaking some simplifying approximations. These are for this example: \n\n\u2022 Every output Yt depends only on the previous output Yt-l and not on all \n\nprevious outputs: \n\nP(YtIYI,Y2,'\" ,Yt-d => P(YtIYt-d \n\n(3) \n\n\u2022 The inputs are assumed to be statistically independent in time: \n\nP(XtIXI, X2, .. \u00b7, Xt-l. yf) => P(Xt Iyn \n\n(4) \n\u2022 The likelihood of an input vector Xt given the complete output sequence y[ \nis assumed to depend only on the output found at t and not on any other \nones: \n\n(5) \nAssuming that the output sequences are categorical sequences (consisting of sym(cid:173)\nbols), approximation (2) and derived expressions are the basis for many applications. \nFor example, using Gaussian mixture distributions to model P(Xtlye) = Pk(X) V Ii \noccuring symbols, approach (2) is used in a more sophisticated form in most state(cid:173)\nof-the-art speech recognition systems. \n\nFocus of this paper is to present some models for the generative part of (1) which \nneed less assumptions. Ideally this means to be able to model directly expressions \nof the form P(XtIX},X2, ... ,Xt-l,yn, the possibly (multi-modal) distribution of a \nvector conditioned on previous x vectors Xt, Xt-l, ... , Xl and a complete sequence \nyi, as shown in the next section. \n\n3 t here is no distinction made between probability mass and density, usually denoted \nas P and p, respectively. If the quantity to model is categorical, a probability mass is \nassumed, if it is continuous, a probability density is assumed. \n\n\fBidirectional Recurrent Mixture Density Networks \n\n591 \n\n2 Mixture density recurrent neural networks \n\nAssume we want to model a continuous vector sequence, conditioned on a sequence \nof categorical variables as shown in Figure 1. One approach is to assume that \nthe vector sequence can be modeled by a uni-modal Gaussian distribution with \na constant variance, making it a uni-modal regression problem. There are many \npractical examples where this assumption doesn't hold, requiring a more complex \noutput distribution to model multi-modal data. One example is the attempt to \nmodel the sounds of phonemes based on data from multiple speakers. A certain \nphoneme will sound completely different depending on its phonetic environment or \non the speaker, and using a single Gaussian with a constant variance would lead to \na crude averaging of all examples. \n\nThe traditional approach is to build generative models for each symbol separately, as \nsuggested by (2). If conventional Gaussian mixtures are used to model the observed \ninput vectors , then the parameters of the distribution (means, covariances, mixture \nweights) in general do not change with the temporal position of the vector to model \nwithin a given state segment of that symbol. This can be a bad representation \nfor the data in some areas (shown are here the means of a very bi-modal looking \ndistribution) , as indicated by the two shown variances for the state 'E'. When used \nto model speech, a procedure often used to cope with this problem is to increase \nthe number of symbols by grouping often appearing symbol sub-strings into a new \nsymbol and by subdividing each original symbol into a number of states. \n\nKKKEEEEEEEEEEmmrrmKKKOOOOOOOOOo \n\nKKKEEEEEEEEEEIUUIumKKKOOOOOooooo \n\nL -________________________ ~~TINffi \n\nFigure 1: Conventional Gaussian mixtures (left) and mixture density BRNNs (right) \nfor multi-modal regression \n\nAnother alternative is explored here , where all parameters of a Gaussian mixture dis(cid:173)\ntribution modeling the continuous targets are predicted by one bidirectional recur(cid:173)\nrent neural network , extended to model mixture densities conditioned on a complete \nvector sequence , as shown on the right side of Figure 1. Another extension (sec(cid:173)\ntion 2.1) to the architecture allows the estimation of time varying mixture densities \nconditioned on a hypothesized output sequence and a continuous vector sequence \nto model exactly the generative term in (1) without any explicit approximations \nabout the use of context. \n\nBasics of non-recurrent mixture density networks (MLP type) can be found in [1][2] . \nThe extension from uni-modal to multi-modal regression is somewhat involved but \nstraightforward for the two interesting cases of having a radial covariance matrix or a \ndiagonal covariance matrix per mixture component. They are trained with gradient(cid:173)\ndescent procedures as regular uni-modal regression NNs. Suitable equations to \ncalculate the error that is back-propagated can be found in [6] for the two cases \nmentioned, a derivation for the simple case in [1][2]. \n\nConventional recurrent neural networks (RNNs) can model expressions of the form \nP( Xt iYl , Y2 , ... , Yt), the distribution of a vector given an input vector plus its past \ninput vectors. Bidirectional recurrent neural networks (BRNNs) [5][6] are a simple \n\n\f592 \n\nM. Schuster \n\nextension of conventional RNNs. The extension allows one to model expressions of \nthe form P(xtlyi), the distribution of a vector given an input vector plus its past \nand following input vectors. \n\n2.1 Mixture density extension for BRNN s \n\nHere two types of extensions of BRNNs to mixture density networks are considered: \n\nI) An extension to model expressions of the type P( Xt Iyi), a multi-modal \ndistribution of a continuous vector conditioned on a vector sequence y[, \nhere labeled as mixture density BRNN of Type 1. \n\nII) An extension to model expressions of the type P(XtlXt,X2,'\" ,Xt-l,yf), \na probability distribution of a continuous vector conditioned on a vector \nsequence y[ and on its previous context in time Xl,X2, ... ,Xt-l. This \narchitecture is labeled as mixture density BRNN of Type II. \n\nThe first extension of conventional uni-modal regression BRNNs to mixture density \nnetworks is not particularly difficult compared to the non-recurrent implementation, \nbecause the changes to model multi-modal distributions are completely independent \nof the structural changes that have to be made to form a BRNN. \n\nThe second extension involves a structural change to the basic BRNN structure \nto incorporate the Xl, X2, ... ,Xt-l as additional inputs, as shown in Figure 2. For \nany t the neighboring Xt-l. Xt-2, ... are incorporated by adding an additional set \nof weights to feed the hidden forward states with the extended inputs (the tar(cid:173)\ngets for the outputs) from the time step before. This includes Xt-l directly and \nXt-2, Xt-3, ... Xl indirectly through the hidden forward neurons. This architecture \nallows one to estimate the generative term in (1) without making the explicit as(cid:173)\nsumptions (4) and (5), since all the information Xt is conditioned on, is theoretically \navailable. \n\n1-1 \n\n1+1 \n\nFigure 2: BRNN mixture density extension (Type II) (inputs: striped, outputs: \nblack, hidden neurons: grey, additional inputs: dark grey). Note that without the \nbackward states and the additional inputs this structure is a conventional RNN , \nunfolded in time. \n\nDifferent from non-recurrent mixture density networks, the extended BRNNs can \npredict the parameters of a Gaussian mixture distribution conditioned on a vector \nsequence rather than a single vector, that is, at each (time) position t one parameter \nset (means, variances (actually standard variations), mixture weights) conditioned \non y[ for the BRNN of type I and on Xl. X2 , ... ,Xt-l, y[ for the BRNN of type II. \n\n\fBidirectional Recurrent Mixture Density Networks \n\n593 \n\n3 Experiments and Results \n\nThe goal of the experiments is to show that the proposed models are more suit(cid:173)\nable to model speech data than traditional approaches , because they rely on fewer \nassumptions. The speech data used here has observation vector sequences repre(cid:173)\nsenting the original waveform in a compressed form, where each vector is mapped to \nexactly one out of f{ phonemes. Here three approaches are compared, which allow \nthe estimation of the likelihood P(XIY) with various degrees of approximations: \n\nConventional Gaussian mixture model, P(XIY) ~ 0;=1 P(xtIYt): \nAccording to (2) the likelihood of a phoneme class vector is approximated by a \nconventional Gaussian mixture distribution, that is, a separate mixture model is \nbuilt to estimate P(xly) = PI;(X) for each of the possible f{ categorical states in \ny . In this case the two assumptions (4) and (5) are necessary. For the variance \na radial covariance matrix (diagonal single variance for all vector components) is \nchosen to match it to the conditions for the BRNN cases below. The number of \nparameters for the complete model is f{ M(D + 2) for M > 1. Several models of \ndifferent complexity were trained (Table 1). \n\nMixture density BRNN I, P(XIY) ~ 0;=1 P(xtiy[): One mixture density \nBRNN of type I, with the same number of mixture components and a radial co(cid:173)\nvariance matrix for its output distribution as in the approach above , is trained \nby presenting complete sample sequences to it. Note that for type I all possible \ncontext-dependencies (assumption (5\u00bb are automatically taken care of, because the \nprobability is conditioned on complete sequences yi . The sequence yi contains for \nany t not only the information about neighboring phonemes, but also the position of \na frame within a phoneme. In conventional systems this can only be modeled crudely \nby introducing a certain number of states per phoneme . The number of outputs \nfor the network depends on the number of mixture components and is M(D + 2) . \nThe total number of parameters can be adjusted by changing the number of hidden \nforward and backward state neurons, and was set here to 64 each. \n\nMixture density BRNN II, P(XIY) = 0;-1 P(xtix l,X2 , ... , Xt-l , yf): \nOne mixture density BRNN of type II, again with the same number of mixture \ncomponents and a radial covariance matrix, is trained under the same conditions as \nabove. Note that in this case both assumptions (4) and (5) are taken care of, be(cid:173)\ncause exactly expressions of the required form can be modeled by a mixture density \nBRNN of type II. \n\n3.1 Experiments \n\nThe recommended training and test data of the TIMIT speech database [3] was \nused for the experiments. The TIMIT database comes with hand-aligned phonetic \ntranscriptions for all utterances , which were transformed to sequences of categorical \nclass numbers (training = 702438 , test = 256617 vec.). The number of possible \ncategorical classes is the number of phonemes, f{ = 61. The categorical data \n(input data for the BRNNs) is represented as f{-dimensional vectors with the kth \ncomponent being one and all others zero. The feature extraction for the waveforms, \nwhich resulted in the vector sequences xi to model , was done as in most speech \nrecognition systems [7]. The variances were normalized with respect to all training \ndata, such that a radial variance for each mixture component in the model is a \nreasonable choice . \n\n\f594 \n\nM. Schuster \n\nAll three model types were trained with M = 1,2,3,4, the conventional Gaussian \nmixture model also with M = 8,16 mixture components. The number of resulting \nparameters , used as a rough complexity measure for the models , is shown in Table 1. \nThe states of the triphone models were not clustered. \n\nTable 1: Number of parameters for different types of models \n\nmixture \n\ncomponents \n\n1 \n2 \n3 \n4 \n8 \n16 \n\nmon061 mon061 \nI-state \n3-state \n5856 \n1952 \n11712 \n3904 \n17568 \n5856 \n23424 \n7808 \n15616 \n46848 \n31232 \n93696 \n\ntri571 BRNN I BRNN II \n3-state \n54816 \n109632 \n164448 \n219264 \n438528 \n877056 \n\n20256 \n24384 \n28512 \n32640 \n\n-\n-\n\n22176 \n26304 \n30432 \n34560 \n\n-\n-\n\nTraining for the conventional approach using M mixtures of Gaussians was done \nusing the EM algorithm. For some classes with only a few samples M had to be \nreduced to reach a stationary point of the likelihood. Training of the BRNNs of both \ntypes must be done using a gradient descent algorithm. Here a modified version of \nRPROP [4] was used, which is in more detail described in [6] . \n\nThe measure used in comparing the tested approaches is the log-likelihood of train(cid:173)\ning and test data given the models built on the training data. In absence of a search \nalgorithm to perform recognition this is a valid measure to evaluate the models since \nmaximizing log-likelihood on the training data is the objective for all model types. \nNote that the given alignment of vectors to phoneme classes for the test data is \nused in calculating the log-likelihood on the test data - there is no search for the \nbest alignment. \n\n3.2 Results \n\nFigure 3 shows the average log-likelihoods depending on the number of mixture \ncomponents for all tested approaches on training (upper line) and test data (lower \nline). The baseline I-state monophones give the lowest likelihood. The 3-state \nmonophones are slightly better, but have a larger gap between training and test \ndata likelihood . For comparison on the training data a system with 571 distinct \ntriphones with 3 states each was trained also. Note that this system has a lot more \nparameters than the BRNN systems (see Table 1) it was compared to. The results \nfor the traditional Gaussian mixture systems show how the models become better \nby building more detailed models for different (phonetic) context, i.e., by using more \nstates and more context classes. \n\nThe mixture density BRNN of type I gives a higher likelihood than the traditional \nGaussian mixture models. This was expected because the BRNN type I models \nare, in contrast to the traditional Gaussian mixture models , able to include all \npossible phonetic context effects by removing assumption (5) - i.e. a frame of a \ncertain phoneme surrounded by frames of any other phonemes with theoretically no \nrestriction about the range of the contextual influence . \n\nThe mixture density BRNN of type II , which in addition removes the independence \nassumption (4), gives a significant higher likelihood than all other models. Note \nthat the difference in likelihood on training and test data for this model is very \nsmall. indicating a useful model for the underlying distribution of the data. \n\n\f", "award": [], "sourceid": 1778, "authors": [{"given_name": "Mike", "family_name": "Schuster", "institution": null}]}