{"title": "Predictive State Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6053, "page_last": 6064, "abstract": "We present a new model, Predictive State Recurrent Neural Networks (PSRNNs), for filtering and prediction in dynamical systems. PSRNNs draw on insights from both Recurrent Neural Networks (RNNs) and Predictive State Representations (PSRs), and inherit advantages from both types of models. Like many successful RNN architectures, PSRNNs use (potentially deeply composed) bilinear transfer functions to combine information from multiple sources. We show that such bilinear functions arise naturally from state updates in Bayes filters like PSRs, in which observations can be viewed as gating belief states. We also show that PSRNNs can be learned effectively by combining Backpropogation Through Time (BPTT) with an initialization derived from a statistically consistent learning algorithm for PSRs called two-stage regression (2SR). Finally, we show that PSRNNs can be factorized using tensor decomposition, reducing model size and suggesting interesting connections to existing multiplicative architectures such as LSTMs and GRUs. We apply PSRNNs to 4 datasets, and show that we outperform several popular alternative approaches to modeling dynamical systems in all cases.", "full_text": "Predictive State Recurrent Neural Networks\n\nCarlton Downey\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\ncmdowney@cs.cmu.edu\n\nAhmed Hefny\n\nBoyue Li\n\nCarnegie Mellon University\n\nCarnegie Mellon University\n\nPittsburgh, PA, 15213\nahefny@cs.cmu.edu\n\nPittsburgh, PA, 15213\nboyue@cs.cmu.edu\n\nByron Boots\nGeorgia Tech\n\nAtlanta, GA, 30332\n\nbboots@cc.gatech.edu\n\nGeoff Gordon\n\nCarnegie Mellon University\n\nPittsburgh, PA, 15213\nggordon@cs.cmu.edu\n\nAbstract\n\nWe present a new model, Predictive State Recurrent Neural Networks (PSRNNs),\nfor \ufb01ltering and prediction in dynamical systems. PSRNNs draw on insights from\nboth Recurrent Neural Networks (RNNs) and Predictive State Representations\n(PSRs), and inherit advantages from both types of models. Like many successful\nRNN architectures, PSRNNs use (potentially deeply composed) bilinear transfer\nfunctions to combine information from multiple sources. We show that such bilinear\nfunctions arise naturally from state updates in Bayes \ufb01lters like PSRs, in which\nobservations can be viewed as gating belief states. We also show that PSRNNs\ncan be learned effectively by combining Backpropogation Through Time (BPTT)\nwith an initialization derived from a statistically consistent learning algorithm\nfor PSRs called two-stage regression (2SR). Finally, we show that PSRNNs can\nbe factorized using tensor decomposition, reducing model size and suggesting\ninteresting connections to existing multiplicative architectures such as LSTMs and\nGRUs. We apply PSRNNs to 4 datasets, and show that we outperform several\npopular alternative approaches to modeling dynamical systems in all cases.\n\n1\n\nIntroduction\n\nLearning to predict temporal sequences of observations is a fundamental challenge in a range of\ndisciplines including machine learning, robotics, and natural language processing. While there are a\nwide variety of different approaches to modelling time series data, many of these approaches can be\ncategorized as either recursive Bayes Filtering or Recurrent Neural Networks.\nBayes Filters (BFs) [1] focus on modeling and maintaining a belief state: a set of statistics, which,\nif known at time t, are suf\ufb01cient to predict all future observations as accurately as if we know the\nfull history. The belief state is generally interpreted as the statistics of a distribution over the latent\nstate of a data generating process, conditioned on history. BFs recursively update the belief state by\nconditioning on new observations using Bayes rule. Examples of common BFs include sequential\n\ufb01ltering in Hidden Markov Models (HMMs) [2] and Kalman Filters (KFs) [3].\nPredictive State Representations [4] (PSRs) are a variation on Bayes \ufb01lters that do not de\ufb01ne system\nstate explicitly, but proceed directly to a representation of state as the statistics of a distribution\nof features of future observations, conditioned on history. By de\ufb01ning the belief state in terms of\nobservables rather than latent states, PSRs can be easier to learn than other \ufb01ltering methods [5\u20137].\nPSRs also support rich functional forms through kernel mean map embeddings [8], and a natural\ninterpretation of model update behavior as a gating mechanism. This last property is not unique to\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fPSRs, as it is also possible to interpret the model updates of other BFs such as HMMs in terms of\ngating.\nDue to their probabilistic grounding, BFs and PSRs possess a strong statistical theory leading\nto ef\ufb01cient learning algorithms. In particular, method-of-moments algorithms provide consistent\nparameter estimates for a range of BFs including PSRs [5, 7, 9\u201311]. Unfortunately, current versions\nof method of moments initialization restrict BFs to relatively simple functional forms such as linear-\nGaussian (KFs) or linear-multinomial (HMMs).\nRecurrent Neural Networks (RNNs) are an alternative to BFs that model sequential data via a\nparameterized internal state and update function. In contrast to BFs, RNNs are directly trained to\nminimize output prediction error, without adhering to any axiomatic probabilistic interpretation.\nExamples of popular RNN models include Long-Short Term Memory networks [12] (LSTMs), Gated\nRecurrent Units [13] (GRUs), and simple recurrent networks such as Elman networks [14].\nRNNs have several advantages over BFs. Their \ufb02exible functional form supports large, rich models.\nAnd, RNNs can be paired with simple gradient-based training procedures that achieve state-of-the-art\nperformance on many tasks [15]. RNNs also have drawbacks however: unlike BFs, RNNs lack an\naxiomatic probabilistic interpretation, and are therefore dif\ufb01cult to analyze. Furthermore, despite\nstrong performance in some domains, RNNs are notoriously dif\ufb01cult to train; in particular it is\ndif\ufb01cult to \ufb01nd good initializations.\nIn summary, RNNs and BFs offer complementary advantages and disadvantages: RNNs offer rich\nfunctional forms at the cost of statistical insight, while BFs possess a sophisticated statistical theory\nbut are restricted to simpler functional forms in order to maintain tractable training and inference. By\ndrawing insights from both Bayes Filters and RNNs we develop a novel hybrid model, Predictive\nState Recurrent Neural Networks (PSRNNs). Like many successful RNN architectures, PSRNNs\nuse (potentially deeply composed) bilinear transfer functions to combine information from multiple\nsources. We show that such bilinear functions arise naturally from state updates in Bayes \ufb01lters like\nPSRs, in which observations can be viewed as gating belief states. We show that PSRNNs directly\ngeneralize discrete PSRs, and can be learned effectively by combining Backpropogation Through\nTime (BPTT) with an approximately consistent method-of-moments initialization based on two-stage\nregression. We also show that PSRNNs can be factorized using tensor decomposition, reducing model\nsize and suggesting interesting connections to existing multiplicative architectures such as LSTMs.\n\n2 Related Work\n\nIt is well known that a principled initialization can greatly increase the effectiveness of local search\nheuristics. For example, Boots [16] and Zhang et al. [17] use subspace ID to initialize EM for linear\ndyanmical systems, and Ko and Fox [18] use N4SID [19] to initialize GP-Bayes \ufb01lters.\nPasa et al. [20] propose an HMM-based pre-training algorithm for RNNs by \ufb01rst training an HMM,\nthen using this HMM to generate a new, simpli\ufb01ed dataset, and, \ufb01nally, initializing the RNN weights\nby training the RNN on this dataset.\nBelanger and Kakade [21] propose a two-stage algorithm for learning a KF on text data. Their\napproach consists of a spectral initialization, followed by \ufb01ne tuning via EM using the ASOS method\nof Martens [22]. They show that this approach has clear advantages over either spectral learning or\nBPTT in isolation. Despite these advantages, KFs make restrictive linear-Gaussian assumptions that\npreclude their use on many interesting problems.\nDowney et al. [23] propose a two-stage algorithm for learning discrete PSRs, consisting of a spectral\ninitialization followed by BPTT. While that work is similar in spirit to the current paper, it is still an\nattempt to optimize a BF using BPTT rather than an attempt to construct a true hybrid model. This\nresults in several key differences: they focus on the discrete setting, and they optimize only a subset\nof the model parameters.\nHaarnoja et al. [24] also recognize the complementary advantages of Bayes Filters and RNNs, and\npropose a new network architecture attempting to combine some of the advantages of both. Their\napproach differs substantially from ours as they propose a network consisting of a Bayes Filter\nconcatenated with an RNN, which is then trained end-to-end via backprop. In contrast our entire\nnetwork architecture has a dual interpretation as both a Bayes \ufb01lter and a RNN. Because of this,\n\n2\n\n\four entire network can be initialized via an approximately consistent method of moments algorithm,\nsomething not possible in [24].\nFinally, Kossai\ufb01 et al. [25] also apply tensor decomposition in the neural network setting. They\npropose a novel neural network layer, based on low rank tensor factorization, which can directly\nprocess tensor input. This is in contrast to a standard approach where the data is \ufb02attened to a vector.\nWhile they also recognize the strength of the multilinear structure implied by tensor weights, both\ntheir setting and their approach differ from ours: they focus on factorizing tensor input data, while\nwe focus on factorizing parameter tensors which arise naturally from a kernelized interpretation of\nBayes rule.\n\n3 Background\n\n3.1 Predictive State Representations\n\nPredictive state representations (PSRs) [4] are a class of models for \ufb01ltering, prediction, and simulation\nof discrete time dynamical systems. PSRs provide a compact representation of a dynamical system\nby representing state as a set of predictions of features of future observations.\nLet ft = f (ot:t+k\u22121) be a vector of features of future observations and let ht = h(o1:t\u22121) be a\nvector of features of historical observations. Then the predictive state is qt = qt|t\u22121 = E[ft | o1:t\u22121].\nThe features are selected such that qt determines the distribution of future observations P (ot:t+k\u22121 |\no1:t\u22121).1 Filtering is the process of mapping a predictive state qt to qt+1 conditioned on ot, while\nprediction maps a predictive state qt = qt|t\u22121 to qt+j|t\u22121 = E[ft+j | o1:t\u22121] without intervening\nobservations.\nPSRs were originally developed for discrete data as a generalization of existing Bayes Filters such as\nHMMs [4]. However, by leveraging the recent concept of Hilbert Space embeddings of distributions\n[26], we can embed a PSR in a Hilbert Space, and thereby handle continuous observations [8].\nHilbert Space Embeddings of PSRs (HSE-PSRs) [8] represent the state as one or more nonparametric\nconditional embedding operators in a Reproducing Kernel Hilbert Space (RKHS) [27] and use Kernel\nBayes Rule (KBR) [26] to estimate, predict, and update the state.\nFor a full treatment of HSE-PSRs see [8]. Let kf , kh, ko be translation invariant kernels [28] de\ufb01ned\non ft, ht, and ot respectively. We use Random Fourier Features [28] (RFF) to de\ufb01ne projections\n\u03c6t = RFF (ft), \u03b7t = RFF (ht), and \u03c9t = RFF (ot) such that kf (fi, fj) \u2248 \u03c6T\ni \u03c6j, kh(hi, hj) \u2248\ni \u03b7j, ko(oi, oj) \u2248 \u03c9T\ni \u03c9j. Using this notation, the HSE-PSR predictive state is qt = E[\u03c6t | ot:t\u22121].\n\u03b7T\nFormally an HSE-PSR (hereafter simply referred to as a PSR) consists of an initial state b1, a 3-mode\nupdate tensor W , and a 3-mode normalization tensor Z. The PSR update equation is\n\nqt+1 = (W \u00d73 qt) (Z \u00d73 qt)\n\n\u22121 \u00d72 ot.\n\n(1)\nwhere \u00d7i is tensor multiplication along the ith mode of the preceding tensor. In some settings (such\nas with discrete data) it is possible to read off the observation probability directly from W \u00d73 qt;\nhowever, in order to generalize to continuous observations with RFF features we include Z as a\nseparate parameter.\n\n3.2 Two-stage Regression\n\nHefny et al. [7] show that PSRs can be learned by solving a sequence of regression problems. This\napproach, referred to as Two-Stage Regression or 2SR, is fast, statistically consistent, and reduces to\nsimple linear algebra operations. In 2SR the PSR model parameters q1, W , and Z are learned using\n\n1For convenience we assume that the system is k-observable: that is, the distribution of all future observations\nis determined by the distribution of the next k observations. (Note: not by the next k observations themselves.)\nAt the cost of additional notation, this restriction could easily be lifted.\n\n3\n\n\fthe history features \u03b7t de\ufb01ned earlier via the following set of equations:\n\n\u03c6t\n\nt=1\n\n1\nT\n\nT(cid:88)\n(cid:32) T(cid:88)\n(cid:32) T(cid:88)\n\nt=1\n\nq1 =\n\nW =\n\nZ =\n\n\u03c6t+1 \u2297 \u03c9t \u2297 \u03b7t\n\n\u03b7t \u2297 \u03c6t\n\n(cid:33)\n\n(cid:32) T(cid:88)\n(cid:32) T(cid:88)\n\nt=1\n\n\u00d73\n\n(cid:33)\n\n(cid:33)+\n(cid:33)+\n\n.\n\n\u03c9t \u2297 \u03c9t \u2297 \u03b7t\n\n\u00d73\n\n\u03b7t \u2297 \u03c6t\n\n(2)\n\n(3)\n\n(4)\n\nt=1\n\nt=1\n\nWhere + is the Moore-Penrose pseudo-inverse. It\u2019s possible to view (2\u20134) as \ufb01rst estimating predictive\nstates by regression from history (stage 1) and then estimating parameters W and Z by regression\namong predictive states (stage 2), hence the name Two-Stage Regression; for details see [7]. Finally\nin practice we use ridge regression in order to improve model stability, and minimize the destabilizing\neffect of rare events while preserving consistency. We could instead use nonlinear predictors in stage\n1, but with RFF features, linear regression has been suf\ufb01cient for our purposes.2 Once we learn model\nparameters, we can apply the \ufb01ltering equation (1) to obtain predictive states q1:T .\n\n3.3 Tensor Decomposition\n\nThe tensor Canonical Polyadic decomposition (CP decomposition) [29] can be viewed as a general-\nization of the Singular Value Decomposition (SVD) to tensors. If T \u2208 R(d1\u00d7...\u00d7dk) is a tensor, then a\nCP decomposition of T is:\n\nm(cid:88)\n\nT =\n\ni \u2297 a2\na1\n\ni \u2297 ... \u2297 ak\n\ni\n\ni \u2208 Rdj and \u2297 is the Kronecker product. The rank of T is the minimum m such that the\nwhere aj\nabove equality holds. In other words, the CP decomposition represents T as a sum of rank-1 tensors.\n\ni=1\n\n4 Predictive State Recurrent Neural Networks\n\nIn this section we introduce Predictive State Recurrent Neural Networks (PSRNNs), a new RNN\narchitecture inspired by PSRs. PSRNNs allow for a principled initialization and re\ufb01nement via BPTT.\nThe key contributions which led to the development of PSRNNs are: 1) a new normalization scheme\nfor PSRs which allows for effective re\ufb01nement via BPTT; 2) the extention of the 2SR algorithm to a\nmultilayered architecture; and 3) the optional use of a tensor decomposition to obtain a more scalable\nmodel.\n\n4.1 Architecture\n\nThe basic building block of a PSRNN is a 3-mode tensor, which can be used to compute a bilinear\ncombination of two input vectors. We note that, while bilinear operators are not a new development\n(e.g., they have been widely used in a variety of systems engineering and control applications for\nmany years [30]), the current paper shows how to chain these bilinear components together into a\npowerful new predictive model.\nLet qt and ot be the state and observation at time t. Let W be a 3-mode tensor, and let q be a vector.\nThe 1-layer state update for a PSRNN is de\ufb01ned as:\n\nqt+1 =\n\nW \u00d72 ot \u00d73 qt + b\n(cid:107)W \u00d72 ot \u00d73 qt + b(cid:107)2\n\n(5)\n\nHere the 3-mode tensor of weights W and the bias vector b are the model parameters. This architecture\nis illustrated in Figure 1a. It is similar, but not identical, to the PSR update (Eq. 1); sec 3.1 gives\n\n2Note that we can train a regression model to predict any quantity from the state. This is useful for general\n\nsequence-to-sequence mapping models. However, in this work we focus on predicting future observations.\n\n4\n\n\fmore detail on the relationship. This model may appear simple, but crucially the tensor contraction\nW \u00d72 ot \u00d73 qt integrates information from bt and ot multiplicatively, and acts as a gating mechanism,\nas discussed in more detail in section 5.\nThe typical approach used to increase modeling capability for BFs (including PSRs) is to use an initial\n\ufb01xed nonlinearity to map inputs up into a higher-dimensional space [31, 30]. PSRNNs incorporate\nsuch a step, via RFFs. However, a multilayered architecture typically offers higher representation\npower for a given number of parameters [32].\nTo obtain a multilayer PSRNN, we stack the 1-layer blocks of Eq. (5) by providing the output of one\nlayer as the observation for the next layer. (The state input for each layer remains the same.) In this\nway we can obtain arbitrarily deep RNNs. This architecture is displayed in Figure 1b.\nWe choose to chain on the observation (as opposed to on the state) as this architecture leads to a\nnatural extension of 2SR to multilayered models (see Sec. 4.2). In addition, this architecture is\nconsistent with the typical approach for constructing multilayered LSTMs/GRUs [12]. Finally, this\narchitecture is suggested by the full normalized form of an HSE PSR, where the observation is passed\nthrough two layers.\n\n(a) Single Layer PSRNN\n\n(b) Multilayer PSRNN\n\nFigure 1: PSRNN architecture: See equation 5 for details. We omit bias terms to avoid clutter.\n\n4.2 Learning PSRNNs\n\nThere are two components to learning PSRNNs: an initialization procedure followed by gradient-\nbased re\ufb01nement. We \ufb01rst show how a statistically consistent 2SR algorithm derived for PSRs can\nbe used to initialize the PSRNN model; this model can then be re\ufb01ned via BPTT. We omit the\nBPTT equations as they are similar to existing literature, and can be easily obtained via automatic\ndifferentiation in a neural network library such as PyTorch or TensorFlow.\nThe Kernel Bayes Rule portion of the PSR update (equation 1) can be separated into two terms:\n\u22121. The \ufb01rst term corresponds to calculating the joint distribution, while\n(W \u00d73 qt) and (Z \u00d73 qt)\nthe second term corresponds to normalizing the joint to obtain the conditional distribution. In the\ndiscrete case, this is equivalent to dividing the joint distribution of ft+1 and ot by the marginal of ot;\nsee [33] for details.\nIf we remove the normalization term, and replace it with two-norm normalization, the PSR update\nbecomes qt+1 = W\u00d73qt\u00d72ot\n(cid:107)W\u00d73qt\u00d72ot(cid:107), which corresponds to calculating the joint distribution (up to a scale\nfactor), and has the same functional form as our single-layer PSRNN update equation (up to bias).\nIt is not immediately clear that this modi\ufb01cation is reasonable. We show in appendix B that our\nalgorithm is consistent in the discrete (realizable) setting; however, to our current knowledge we\nlose the consistency guarantees of the 2SR algorithm in the full continuous setting. Despite this we\ndetermined experimentally that replacing full normalization with two-norm normalization appears to\nhave a minimal effect on model performance prior to re\ufb01nement, and results in improved performance\nafter re\ufb01nement. Finally, we note that working with the (normalized) joint distribution in place of the\nconditional distribution is a commonly made simpli\ufb01cation in the systems literature, and has been\nshown to work well in practice [34].\nThe adaptation of the two-stage regression algorithm of Hefny et al. [7] described above allows us\nto initialize 1-layer PSRNNs; we now extend this approach to multilayered PSRNNs. Suppose we\nhave learned a 1-layer PSRNN P using two-stage regression. We can use P to perform \ufb01ltering\non a dataset to generate a sequence of estimated states \u02c6q1, ..., \u02c6qn. According to the architecture\ndescribed in Figure 1b, these states are treated as observations in the second layer. Therefore we\ncan initialize the second layer by an additional iteration of two-stage regression using our estimated\n\n5\n\n\fstates \u02c6q1, ..., \u02c6qn in place of observations. This process can be repeated as many times as desired to\ninitialize an arbitrarily deep PSRNN. If the \ufb01rst layer were learned perfectly, the second layer would\nbe super\ufb02uous; however, in practice, we observe that the second layer is able to learn to improve on\nthe \ufb01rst layer\u2019s performance.\nOnce we have obtained a PSRNN using the 2SR approach described above, we can use BPTT to\nre\ufb01ne the PSRNN. We note that we choose to use 2-norm divisive normalization because it is not\npractical to perform BPTT through the matrix inverse required in PSRs: the inverse operation is\nill-conditioned in the neighborhood of any singular matrix. We observe that 2SR provides us with an\ninitialization which converges to a good local optimum.\n\n4.3 Factorized PSRNNs\n\nIn this section we show how the PSRNN model can be factorized to reduce the number of parameters\nprior to applying BPTT.\nLet (W, b0) be a PSRNN block. Suppose we decompose W using CP decomposition to obtain\n\nn(cid:88)\n\nW =\n\nai \u2297 bi \u2297 ci\n\nLet A (similarly B, C) be the matrix whose ith row is ai (respectively bi, ci). Then the PSRNN state\nupdate (equation (5)) becomes (up to normalization):\n\ni=1\n\nqt+1 = W \u00d72 ot \u00d73 qt + b\n\n= (A \u2297 B \u2297 C) \u00d72 ot \u00d73 qt + b\n= AT (Bot (cid:12) Cqt) + b\n\n(6)\n(7)\n(8)\nwhere (cid:12) is the Hadamard product. We call a PSRNN of this form a factorized PSRNN. This\nmodel architecture is illustrated in Figure 2. Using a factorized PSRNN provides us with complete\ncontrol over the size of our model via the rank of the factorization. Importantly, it decouples the\nnumber of model parameters from the number of states, allowing us to set these two hyperparameters\nindependently.\n\nFigure 2: Factorized PSRNN Architecture\n\nWe determined experimentally that factorized PSRNNs are poorly conditioned when compared with\nPSRNNs, due to very large and very small numbers often occurring in the CP decomposition. To\nalleviate this issue, we need to initialize the bias b in a factorized PSRNN to be a small multiple of\nthe mean state. This acts to stabilize the model, regularizing gradients and preventing us from moving\naway from the good local optimum provided by 2SR.\nWe note that a similar stabilization happens automatically in randomly initialized RNNs: after the\n\ufb01rst few iterations the gradient updates cause the biases to become non-zero, stabilizing the model\nand resulting in subsequent gradient descent updates being reasonable. Initialization of the biases is\nonly a concern for us because we do not want the original model to move away from our carefully\nprepared initialization due to extreme gradients during the \ufb01rst few steps of gradient descent.\nIn summary, we can learn factorized PSRNNs by \ufb01rst using 2SR to initialize a PSRNN, then using CP\ndecomposition to factorize the tensor model parameters to obtain a factorized PSRNN, then applying\nBPTT to the re\ufb01ne the factorized PSRNN.\n\n6\n\n\f5 Discussion\n\nThe value of bilinear units in RNNs was the focus of recent work by Wu et al [35]. They introduced the\nconcept of Multiplicative Integration (MI) units \u2014 components of the form Ax (cid:12) By \u2014 and showed\nthat replacing additive units by multiplicative ones in a range of architectures leads to signi\ufb01cantly\nimproved performance. As Eq. (8) shows, factorizing W leads precisely to an architecture with MI\nunits.\nModern RNN architectures such as LSTMs and GRUs are known to outperform traditional RNN\narchitectures on many problems [12]. While the success of these methods is not fully understood,\nmuch of it is attributed to the fact that these architectures possess a gating mechanism which allows\nthem both to remember information for a long time, and also to forget it quickly. Crucially, we note\nthat PSRNNs also allow for a gating mechanism. To see this consider a single entry in the factorized\nPSRNN update (omitting normalization).\n\n(cid:33)\nThe current state qt will only contribute to the new state if the function(cid:80)\n\nBjk[ot]k (cid:12)(cid:88)\n\nk Bjk[ot]k of ot is non-zero.\nOtherwise ot will cause the model to forget this information: the bilinear component of the PSRNN\narchitecture naturally achieves gating.\nWe note that similar bilinear forms occur as components of many successful models. For example,\nconsider the (one layer) GRU update equation:\n\n(cid:32)(cid:88)\n\nCjl[qt]l\n\n+ b\n\n(9)\n\n(cid:88)\n\nj\n\n[qt+1]i =\n\nAji\n\nk\n\nl\n\nzt = \u03c3(Wzot + Uzqt + cz)\nrt = \u03c3(Wrot + Urqt + cr)\n\nqt+1 = zt (cid:12) qt + (1 \u2212 zt) (cid:12) \u03c3(Whot + Uh(rt (cid:12) qt) + ch)\n\nThe GRU update is a convex combination of the existing state qt and and update term Whot +Uh(rt(cid:12)\nqt) + ch. We see that the core part of this update term Uh(rt (cid:12) qt) + ch bears a striking similarity to\nour factorized PSRNN update. The PSRNN update is simpler, though, since it omits the nonlinearity\n\u03c3(\u00b7), and hence is able to combine pairs of linear updates inside and outside \u03c3(\u00b7) into a single matrix.\nFinally, we would like to highlight the fact that, as discussed in section 5, the bilinear form shared in\nsome form by these models (including PSRNNs) resembles the \ufb01rst component of the Kernel Bayes\nRule update function. This observation suggests that bilinear components are a natural structure to\nuse when constructing RNNs, and may help explain the success of the above methods over alternative\napproaches. This hypothesis is supported by the fact that there are no activation functions (other than\ndivisive normalization) present in our PSRNN architecture, yet it still manages to achieve strong\nperformance.\n\n6 Experimental Setup\n\nhardware limitations we use a train/test split of 120780/124774 characters.\n\nIn this section we describe the datasets, models, model initializations, model hyperparameters, and\nevaluation metrics used in our experiments.\nWe use the following datasets in our experiments:\n\u2022 Penn Tree Bank (PTB) This is a standard benchmark in the NLP community [36]. Due to\n\u2022 Swimmer We consider the 3-link simulated swimmer robot from the open-source package\nOpenAI gym.3 The observation model returns the angular position of the nose as well as the\nangles of the two joints. We collect 25 trajectories from a robot that is trained to swim forward\n(via the cross entropy with a linear policy), with a train/test split of 20/5.\n\u2022 Mocap This is a Human Motion Capture dataset consisting of 48 skeletal tracks from three human\nsubjects collected while they were walking. The tracks have 300 timesteps each, and are from\na Vicon motion capture system. We use a train/test split of 40/8. Features consist of the 3D\npositions of the skeletal parts (e.g., upper back, thorax, clavicle).\n\n3https://gym.openai.com/\n\n7\n\n\f\u2022 Handwriting This is a digit database available on the UCI repository [37, 38] created using a\npressure sensitive tablet and a cordless stylus. Features are x and y tablet coordinates and pressure\nlevels of the pen at a sampling rate of 100 milliseconds. We use 25 trajectories with a train/test\nsplit of 20/5.\n\nModels compared are LSTMs [30], GRUs [13], basic RNNs [14], KFs [3], PSRNNs, and factorized\nPSRNNs. All models except KFs consist of a linear encoder, a recurrent module, and a linear decoder.\nThe encoder maps observations to a compressed representation; in the context of text data it can be\nviewed as a word embedding. The recurrent module maps a state and an observation to a new state\nand an output. The decoder maps an output to a predicted observation.4 We initialize the LSTMs and\nRNNs with random weights and zero biases according to the Xavier initialization scheme [39]. We\ninitialize the the KF using the 2SR algorithm described in [7]. We initialize PSRNNs and factorized\nPSRNNs as described in section 3.1.\nIn two-stage regression we use a ridge parameter of 10(\u22122)n where n is the number of training\nexamples (this is consistent with the values suggested in [8]). (Experiments show that our approach\nworks well for a wide variety of hyperparameter values.) We use a horizon of 1 in the PTB experiments,\nand a horizon of 10 in all continuous experiments. We use 2000 RFFs from a Gaussian kernel, selected\naccording to the method of [28], and with the kernel width selected as the median pairwise distance.\nWe use 20 hidden states, and a \ufb01xed learning rate of 1 in all experiments. We use a BPTT horizon of\n35 in the PTB experiments, and an in\ufb01nite BPTT horizon in all other experiments. All models are\nsingle layer unless stated otherwise.\nWe optimize models on the PTB using Bits Per Character (BPC) and evaluate them using both BPC\nand one-step prediction accuracy (OSPA). We optimize and evaluate all continuous experiments using\nthe Mean Squared Error (MSE).\n\n7 Results\n\nIn Figure 3a we compare performance of LSTMs, GRUs, and Factorized PSRNNs on PTB, where\nall models have the same number of states and approximately the same number of parameters. To\nachieve this we use a factorized PSRNN of rank 60. We see that the factorized PSRNN signi\ufb01cantly\noutperforms LSTMs and GRUs on both metrics. In Figure 3b we compare the performance of 1- and\n2-layer PSRNNs on PTB. We see that adding an additional layer signi\ufb01cantly improves performance.\n\n4This is a standard RNN architecture; e.g., a PyTorch implementation of this architecture for text prediction\n\ncan be found at https://github.com/pytorch/examples/tree/master/word_language_model.\n\n(a) BPC and OSPA on PTB. All\nmodels have the same number of\nstates and approximately the same\nnumber of parameters.\n\n(b) Comparison between 1- and 2-\nlayer PSRNNs on PTB.\n\n(c) Cross-entropy and prediction\naccuracy on Penn Tree Bank for\nPSRNNs and factorized PSRNNs\nof various rank.\n\nFigure 3: PTB Experiments\n\n8\n\n\fIn Figure 3c we compare PSRNNs with factorized PSRNNs on the PTB. We see that PSRNNs\noutperform factorized PSRNNs regardless of rank, even when the factorized PSRNN has signi\ufb01cantly\nmore model parameters. (In this experiment, factorized PSRNNs of rank 7 or greater have more\nmodel parameters than a plain PSRNN.) This observation makes sense, as the PSRNN provides a\nsimpler optimization surface: the tensor multiplication in each layer of a PSRNN is linear with respect\nto the model parameters, while the tensor multiplication in each layer of a Factorized PSRNN is\nbilinear. In addition, we see that higher-rank factorized models outperform lower-rank ones. However,\nit is worth noting that even models with low rank still perform well, as demonstrated by our rank 40\nmodel still outperforming GRUs and LSTMs, despite having fewer parameters.\n\n(a) MSE vs Epoch on the Swimmer, Mocap, and Handwriting datasets\n\n(b) Test Data vs Model Prediction on a single feature of Swimmer. The \ufb01rst row shows initial performance. The\nsecond row shows performance after training. In order the columns show KF, RNN, GRU, LSTM, and PSRNN.\n\nFigure 4: Swimmer, Mocap, and Handwriting Experiments\n\nIn Figure 4a we compare model performance on the Swimmer, Mocap, and Handwriting datasets.\nWe see that PSRNNs signi\ufb01cantly outperform alternative approaches on all datasets. In Figure 4b we\nattempt to gain insight into why using 2SR to initialize our models is so bene\ufb01cial. We visualize the\nthe one step model predictions before and after BPTT. We see that the behavior of the initialization\nhas a large impact on the behavior of the re\ufb01ned model. For example the initial (incorrect) oscillatory\nbehavior of the RNN in the second column is preserved even after gradient descent.\n\n8 Conclusions\n\nWe present PSRNNs: a new approach for modelling time-series data that hybridizes Bayes \ufb01lters with\nRNNs. PSRNNs have both a principled initialization procedure and a rich functional form. The basic\nPSRNN block consists of a 3-mode tensor, corresponding to bilinear combination of the state and\nobservation, followed by divisive normalization. These blocks can be arranged in layers to increase\nthe expressive power of the model. We showed that tensor CP decomposition can be used to obtain\nfactorized PSRNNs, which allow \ufb02exibly selecting the number of states and model parameters. We\nshowed how factorized PSRNNs can be viewed as both an instance of Kernel Bayes Rule and a gated\narchitecture, and discussed links to existing multiplicative architectures such as LSTMs. We applied\nPSRNNs to 4 datasets and showed that we outperform alternative approaches in all cases.\n\nAcknowledgements The authors gratefully acknowledge support from ONR (grant number\nN000141512365) and DARPA (grant number FA87501720152).\n\n9\n\n\fReferences\n[1] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural\nComput., 11(2):305\u2013345, February 1999. ISSN 0899-7667. doi: 10.1162/089976699300016674.\nURL http://dx.doi.org/10.1162/089976699300016674.\n\n[2] Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic functions of \ufb01nite state\n\nmarkov chains. The Annals of Mathematical Statistics, 37:1554\u20131563, 1966.\n\n[3] R. E. Kalman. A new approach to linear \ufb01ltering and prediction problems. ASME Journal of\n\nBasic Engineering, 1960.\n\n[4] Michael L. Littman, Richard S. Sutton, and Satinder Singh. Predictive representations of state.\nIn In Advances In Neural Information Processing Systems 14, pages 1555\u20131561. MIT Press,\n2001.\n\n[5] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon. Closing the learning planning loop with\npredictive state representations. International Journal of Robotics Research (IJRR), 30:954\u2013956,\n2011.\n\n[6] Byron Boots and Geoffrey Gordon. An online spectral learning algorithm for partially observable\nnonlinear dynamical systems. In Proceedings of the 25th National Conference on Arti\ufb01cial\nIntelligence (AAAI), 2011.\n\n[7] Ahmed Hefny, Carlton Downey, and Geoffrey J. Gordon. Supervised learning for dynamical\nsystem learning. In Advances in Neural Information Processing Systems, pages 1963\u20131971,\n2015.\n\n[8] Byron Boots, Geoffrey J. Gordon, and Arthur Gretton. Hilbert space embeddings of predictive\nstate representations. CoRR, abs/1309.6819, 2013. URL http://arxiv.org/abs/1309.\n6819.\n\n[9] Daniel J. Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden\n\nmarkov models. CoRR, abs/0811.4413, 2008.\n\n[10] Amirreza Shaban, Mehrdad Farajtabar, Bo Xie, Le Song, and Byron Boots. Learning latent\nvariable models by improving spectral solutions with exterior point methods. In Proceedings of\nThe International Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2015.\n\n[11] Peter Van Overschee and Bart De Moor. N4sid: numerical algorithms for state space subspace\nIn Proc. of the World Congress of the International Federation of\n\nsystem identi\ufb01cation.\nAutomatic Control, IFAC, volume 7, pages 361\u2013364, 1993.\n\n[12] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):\n\n1735\u20131780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735.\n\n[13] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the\nproperties of neural machine translation: Encoder-decoder approaches. CoRR, abs/1409.1259,\n2014.\n\n[14] Jeffrey L. Elman. Finding structure in time. COGNITIVE SCIENCE, 14(2):179\u2013211, 1990.\n\n[15] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural\n\nnetworks. CoRR, abs/1409.3215, 2014. URL http://arxiv.org/abs/1409.3215.\n\n[16] Byron Boots. Learning stable linear dynamical systems. Online]. Avail.: https://www. ml. cmu.\n\nedu/research/dap-papers/dap_boots. pdf, 2009.\n\n[17] Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet em: A\nprovably optimal algorithm for crowdsourcing. In Advances in neural information processing\nsystems, pages 1260\u20131268, 2014.\n\n[18] Jonathan Ko and Dieter Fox. Learning gp-bayes\ufb01lters via gaussian process latent variable\n\nmodels. Autonomous Robots, 30(1):3\u201323, 2011.\n\n10\n\n\f[19] Peter Van Overschee and Bart De Moor. N4sid: Subspace algorithms for the identi\ufb01cation of\ncombined deterministic-stochastic systems. Automatica, 30(1):75\u201393, January 1994. ISSN\n0005-1098. doi: 10.1016/0005-1098(94)90230-5. URL http://dx.doi.org/10.1016/\n0005-1098(94)90230-5.\n\n[20] Luca Pasa, Alberto Testolin, and Alessandro Sperduti. A hmm-based pre-training approach\nIn 22th European Symposium on Arti\ufb01cial Neural Networks, ESANN\nfor sequential data.\n2014, Bruges, Belgium, April 23-25, 2014, 2014. URL http://www.elen.ucl.ac.be/\nProceedings/esann/esannpdf/es2014-166.pdf.\n\n[21] David Belanger and Sham Kakade. A linear dynamical system model for text. In Francis\nBach and David Blei, editors, Proceedings of the 32nd International Conference on Machine\nLearning, volume 37 of Proceedings of Machine Learning Research, pages 833\u2013842, Lille,\nFrance, 07\u201309 Jul 2015. PMLR.\n\n[22] James Martens. Learning the linear dynamical system with asos. In Proceedings of the 27th\n\nInternational Conference on Machine Learning (ICML-10), pages 743\u2013750, 2010.\n\n[23] Carlton Downey, Ahmed Hefny, and Geoffrey Gordon. Practical learning of predictive state\n\nrepresentations. Technical report, Carnegie Mellon University, 2017.\n\n[24] Tuomas Haarnoja, Anurag Ajay, Sergey Levine, and Pieter Abbeel. Backprop kf: Learning\ndiscriminative deterministic state estimators. In Advances in Neural Information Processing\nSystems, pages 4376\u20134384, 2016.\n\n[25] Jean Kossai\ufb01, Zachary C Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar.\n\nTensor regression networks. arXiv preprint arXiv:1707.08308, 2017.\n\n[26] Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch\u00f6lkopf. A hilbert space embedding\nfor distributions. In International Conference on Algorithmic Learning Theory, pages 13\u201331.\nSpringer, 2007.\n\n[27] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathemati-\n\ncal society, 68(3):337\u2013404, 1950.\n\n[28] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[29] Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Studies in\n\nApplied Mathematics, 6(1-4):164\u2013189, 1927.\n\n[30] Lennart Ljung. System identi\ufb01cation. Wiley Online Library, 1999.\n\n[31] Le Song, Byron Boots, Sajid M. Siddiqi, Geoffrey J. Gordon, and Alex J. Smola. Hilbert space\nembeddings of hidden markov models. In Johannes F\u00fcrnkranz and Thorsten Joachims, editors,\nProceedings of the 27th International Conference on Machine Learning (ICML-10), pages\n991\u2013998. Omnipress, 2010. URL http://www.icml2010.org/papers/495.pdf.\n\n[32] Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends R(cid:13) in Machine\n\nLearning, 2(1):1\u2013127, 2009.\n\n[33] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of\nconditional distributions with applications to dynamical systems. In Proceedings of the 26th\nAnnual International Conference on Machine Learning, pages 961\u2013968. ACM, 2009.\n\n[34] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT press, 2005.\n\n[35] Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhutdinov. On\nmultiplicative integration with recurrent neural networks. CoRR, abs/1606.06630, 2016. URL\nhttp://arxiv.org/abs/1606.06630.\n\n[36] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated\n\ncorpus of english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n11\n\n\f[37] Fevzi. Alimoglu E. Alpaydin. Pen-Based Recognition of Handwritten Digits Data Set.\n\nhttps://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits.\n\n[38] E Alpaydin and Fevzi Alimoglu. Pen-based recognition of handwritten digits data set. University\n\nof California, Irvine, Machine Learning Repository. Irvine: University of California, 1998.\n\n[39] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nneural networks. In In Proceedings of the International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS\u201910). Society for Arti\ufb01cial Intelligence and Statistics, 2010.\n\n12\n\n\f", "award": [], "sourceid": 3085, "authors": [{"given_name": "Carlton", "family_name": "Downey", "institution": "Carnegie Mellon University"}, {"given_name": "Ahmed", "family_name": "Hefny", "institution": "Carnegie Mellon University"}, {"given_name": "Byron", "family_name": "Boots", "institution": "Georgia Tech / Google Brain"}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": "CMU"}, {"given_name": "Boyue", "family_name": "Li", "institution": "Carnegie Mellon University"}]}