{"title": "Predictive-State Decoders: Encoding the Future into Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1172, "page_last": 1183, "abstract": "Recurrent neural networks (RNNs) are a vital modeling technique that rely on internal states learned indirectly by optimization of a supervised, unsupervised, or reinforcement training loss. RNNs are used to model dynamic processes that are characterized by underlying latent states whose form is often unknown, precluding its analytic representation inside an RNN. In the Predictive-State Representation (PSR) literature, latent state processes are modeled by an internal state representation that directly models the distribution of future observations, and most recent work in this area has relied on explicitly representing and targeting sufficient statistics of this probability distribution. We seek to combine the advantages of RNNs and PSRs by augmenting existing state-of-the-art recurrent neural networks with Predictive-State Decoders (PSDs), which add supervision to the network's internal state representation to target predicting future observations. PSDs are simple to implement and easily incorporated into existing training pipelines via additional loss regularization. We demonstrate the effectiveness of PSDs with experimental results in three different domains: probabilistic filtering, Imitation Learning, and Reinforcement Learning. In each, our method improves statistical performance of state-of-the-art recurrent baselines and does so with fewer iterations and less data.", "full_text": "Predictive-State Decoders:\n\nEncoding the Future into Recurrent Networks\n\nArun Venkatraman1\u2217, Nicholas Rhinehart1\u2217, Wen Sun1,\n\nLerrel Pinto1, Martial Hebert1, Byron Boots2, Kris M. Kitani1, J. Andrew Bagnell1\n\n1The Robotics Institute, Carnegie-Mellon University, Pittsburgh, PA\n\n2School of Interactive Computing, Georgia Institute of Technology, Atlanta, GA\n\nAbstract\n\nRecurrent neural networks (RNNs) are a vital modeling technique that rely on\ninternal states learned indirectly by optimization of a supervised, unsupervised, or\nreinforcement training loss. RNNs are used to model dynamic processes that are\ncharacterized by underlying latent states whose form is often unknown, precluding\nits analytic representation inside an RNN. In the Predictive-State Representation\n(PSR) literature, latent state processes are modeled by an internal state representa-\ntion that directly models the distribution of future observations, and most recent\nwork in this area has relied on explicitly representing and targeting suf\ufb01cient statis-\ntics of this probability distribution. We seek to combine the advantages of RNNs\nand PSRs by augmenting existing state-of-the-art recurrent neural networks with\nPREDICTIVE-STATE DECODERS (PSDs), which add supervision to the network\u2019s\ninternal state representation to target predicting future observations. PSDs are\nsimple to implement and easily incorporated into existing training pipelines via\nadditional loss regularization. We demonstrate the effectiveness of PSDs with\nexperimental results in three different domains: probabilistic \ufb01ltering, Imitation\nLearning, and Reinforcement Learning. In each, our method improves statistical\nperformance of state-of-the-art recurrent baselines and does so with fewer iterations\nand less data.\n\n1\n\nIntroduction\n\nDespite their wide success in a variety of domains, recurrent neural networks (RNNs) are often\ninhibited by the dif\ufb01culty of learning an internal state representation. Internal state is a unifying\ncharacteristic of RNNs, as it serves as an RNN\u2019s memory. Learning these internal states is challenging\nbecause optimization is guided by the indirect signal of the RNN\u2019s target task, such as maximizing\nthe cost-to-go for reinforcement learning or maximizing the likelihood of a sequence of words. These\ntarget tasks have a latent state sequence that characterizes the underlying sequential data-generating\nprocess. Unfortunately, most settings do not afford a parametric model of latent state that is available\nto the learner.\nHowever, recent work has shown that in certain settings, latent states can be characterized by\nobservations alone [8, 24, 26] \u2013 which are almost always available to recurrent models. In such\npartially-observable problems (e.g. Fig. 1a), a single observation is not guaranteed to contain enough\ninformation to fully represent the system\u2019s latent state. For example, a single image of a robot is\ninsuf\ufb01cient to characterize its latent velocity and acceleration. While a latent state parametrization\nmay be known in some domains \u2013 e.g. a simple pendulum can be suf\ufb01ciently modeled by its angle\nand angular velocity (\u03b8, \u02d9\u03b8) \u2013 data from most domains cannot be explicitly parametrized.\n\n\u2217Contributed equally to this work. Direct correspondence to: {arunvenk,nrhineha}@cs.cmu.edu\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(a) The process generating sequential data has latent\nstate st which generates the next latent state st+1. st\nis usually unknown but generates the observations xt\nwhich are used to learn a model for the system.\n\n(b) An overview of our approach for modelling the pro-\ncess from Fig. 1a. We attach a decoder to the internal\nstate of an RNN to predict statistics of future observa-\ntions xt to xt+k observed at training time.\n\nFigure 1: Data generation process and proposed model\n\nIn lieu of ground truth access to latent states, recurrent neural networks [32, 47] employ internal states\nto summarize previous data, serving as a learner\u2019s memory. We avoid the terminology \u201chidden state\"\nas it refers to the internal state in the RNN literature but refers to the latent state in the HMM, PSR,\nand related literature. Internal states are modi\ufb01ed towards minimizing the target application\u2019s loss,\ne.g., minimizing observation loss in \ufb01ltering or cumulative reward in reinforcement learning. The\ntarget application\u2019s loss is not directly de\ufb01ned over the internal states: they are updated via the chain\nrule (backpropagation) through the global loss. Although this modeling is indirect, recurrent networks\nnonetheless can achieve state-of-the-art results on many robotics [18, 23], vision [34, 50], and natural\nlanguage tasks [15, 20, 38] when training succeeds. However, recurrent model optimization is\nhampered by two main dif\ufb01culties: 1) non-convexity, and 2) the loss does not directly encourage\nthe internal state to model the latent state. A poor internal state representation can yield poor task\nperformance, but rarely does the task objective directly measure the quality of the internal state.\nPredictive-State Representations (PSRs) [8, 24, 26] offer an alternative internal state representation\nto that of RNNs in terms of the available observations. Spectral learning methods for PSRs provide\ntheoretical guarantees on discovering the global optimum for the model and internal state parameters\nunder the assumptions of in\ufb01nite training data and realizability. However, in the non-realizable setting\n\u2013 i.e. model mismatch (e.g., using learned parameters of a linear system model for a non-linear system)\n\u2013 these algorithms lose any performance guarantees on using the learned model for the target inference\ntasks. Extensions to handle nonlinear systems rely on RKHS embeddings [43], which themselves\ncan be computationally infeasible to use with large datasets. Nevertheless, when these models are\ntrainable, they often achieve strong performance [24, 45]; the structure they impose signi\ufb01cantly\nsimpli\ufb01es the learning problem.\nWe leverage ideas from the both RNN and PSR paradigms, resulting in a marriage of two orthogonal\nsequential modeling approaches. When training an RNN, PREDICTIVE-STATE DECODERS (Fig. 1b)\nprovide direct supervision on the internal state, aiding the training problem. The proposed method\ncan be viewed as an instance of Multi-Task Learning (MTL) [13] and self-supervision [27], using the\ninputs to the learner to form a secondary unsupervised objective. Our contribution is a general method\nthat improves performance of learning RNNs for sequential prediction problems. The approach is\neasy to implement as a regularizer on traditional RNN loss functions with little overhead and can\nthus be incorporated into a variety of existing recurrent models.\nIn our experiments, we examine three domains where recurrent models are used to model temporal\ndependencies: probabilistic \ufb01ltering, where we predict the future observation given past observations;\nImitation Learning, where the learner attempts to mimic an expert\u2019s actions; and Reinforcement\nLearning, where a policy is trained to maximize cumulative reward. We observe that our method\nimproves loss convergence rates and results in higher-quality \ufb01nal objectives in these domains.\n\n2 Latent State Space Models\n\nTo model sequential prediction problems, it is common to cast the problem into the Markov Process\nframework. Predictive distributions in this framework satisfy the Markov property:\n\nP (st+1|st, st\u22121, . . . , s0) = P (st+1|st)\n\n(1)\n\n2\n\n\fFigure 2: Learning recurrent models consists of learning a function f that updates the internal state\nht given the latest observation xt. The internal state may also be used to predict targets yt, such as\ncontrol actions for imitation and reinforcement learning. These are then inputs to a loss function (cid:96)\nwhich accumulate as the multi-step loss L over all timesteps.\n\nwhere st is the latent state of the system at timestep t. Intuitively, this property tells us that the\nfuture st+1 is only dependent on the current state2 st and does not depend on any previous state\ns0, . . . , st\u22121. As st is latent, the learner only has access to observations xt, which are produced by\nst. For example, in robotics, xt may be joint angles from sensors or a scene observed as an image. A\ncommon graphical model representation is shown in Fig. 1a.\nThe machine learning problem is to \ufb01nd a model f that uses the latest observation xt to recursively\nupdate an internal state, denoted ht, illustrated in Fig. 2. Note that ht is distinct from st. ht is the\nlearner\u2019s internal state, and st is the underlying con\ufb01guration of the data-generating Markov\nProcess. For example, the internal state in the Bayesian \ufb01ltering/POMDP setup is represented as a\nbelief state [49], a \u201cmemory\" unit in neural networks, or as a distribution over observations for PSRs.\nUnlike traditional supervised machine learning problems, learning models for latent state problems\nmust be accomplished without ground-truth supervision of the internal states themselves. Two distinct\nparadigms for latent state modeling exist. The \ufb01rst are discriminative approaches based on RNNs, and\nthe second is a set of theoretically well-studied approaches based on Predictive-State Representations.\nIn the following sections we provide a brief overview of each class of approach.\n\n2.1 Recurrent Models and RNNs\n\nA classical supervised machine learning approach for learning internal models involves choosing\nan explicit parametrization for the internal states and assuming ground-truth access to these states\nand observations at training time [17, 29, 33, 37]. These models focus on learning only the recursive\nmodel f in Fig. 2, assuming access to the st (Fig. 1a) at training time. Another class of approaches\ndrop the assumption of access to ground truth but still assume a parametrization of the internal state.\nThese models set up a multi-step prediction error and use expectation maximization to alternate\nbetween optimizing over the model\u2019s parameters and the internal state values [2, 19, 16].\nWhile imposing a \ufb01xed representation on the internal state adds structure to the learning problem, it\ncan limit performance. For many problems such as speech recognition [20] or text generation [48], it\nis dif\ufb01cult to fully represent a latent state inside the model\u2019s internal state. Instead, typical machine\nlearning solutions rely on the Recurrent Neural Network architecture. The RNN model (Fig. 2) uses\nthe internal state to make predictions yt = f (ht, xt) and is trained by minimizing a series of loss\nfunctions (cid:96)t over each prediction, as shown in the following optimization problem:\n\n(cid:96)t(f (ht, xt))\n\n(2)\n\n(cid:88)\n\nt\n\nL = min\n\nf\n\nmin\n\nf\n\nThe loss functions (cid:96)t are usually application- and domain-speci\ufb01c. For example, in a probabilistic\n\ufb01ltering problem, the objective may be to minimize the negative log-likelihood of the observa-\ntions [4, 52] or the prediction of the next observation [34]. For imitation learning, this objective\nfunction will penalize deviation of the prediction from the expert\u2019s action [39], and for policy-\ngradient reinforcement learning methods, the objective includes the log-likelihood of choosing\nactions weighted by their observed returns. In general, the task objective optimized by the network\ndoes not directly specify a loss directly over the values of the internal state ht.\n\n2In Markov Decision Processes (MDPs), P (st+1|st) may depend on an action taken at st.\n\n3\n\n\fThe general dif\ufb01culty with the objective in Eq. (2) is that the recurrence with f results in a highly\nnon-convex and dif\ufb01cult optimization [2].\nRNN models are thus often trained with backpropagation-through-time (BPTT) [55]. BPTT allows\nfuture losses incurred at timestep t to be back-propogated and affect the parameter updates to f.\nThese updates to f then change the distribution of internal states computed during the next forward\npass through time. The dif\ufb01culty is then that small updates to f can drastically change the distribution\nof ht, sometimes resulting in error exponential in the time horizon [53]. This \u201cdiffusion problem\"\ncan yield an unstable training procedure with exponentially exploding or vanishing gradients [7].\nWhile techniques such as truncated gradients [47] or gradient-clipping [35] can alleviate some of\nthese problems, each of these techniques yields stability by discarding information about how future\nobservations and predictions should backpropagate through the current internal state. A signi\ufb01cant\ninnovation in training internal states with long-term dependence was the LSTM [25]. Many variants\non LSTMs exist (e.g. GRUs [14]), yet in the domains evaluated by Greff et al. [21], none consistently\nexhibit statistically signi\ufb01cant improvements over LSTMs.\nIn the next section, we discuss a different paradigm for learning temporal models. In contrast with the\nopen-ended internal-state learned by RNNs, Predictive-State methods do not parameterize a speci\ufb01c\nrepresentation of the internal state but use certain assumptions to construct a mathematical structure\nin terms of the observations to \ufb01nd a globally optimal representation.\n\n2.2 Predictive-State Models\n\nt , ..., xT\n\nPredictive-State Representations (PSRs) address the problem of \ufb01nding an internal state by for-\nmulating the representation directly in terms of observable quantities. Instead of targeting a pre-\ndiction loss as with RNNs, PSRs de\ufb01ne a belief over the distribution of k future observations,\nt+k\u22121]T \u2208 Rkn given all the past observations pt = [x0, . . . xt\u22121] [10]. In the case of\ngt = [xT\nlinear systems, this k is similar to the rank of the observability matrix [6]. The key assumption in\nPSRs is that the de\ufb01nition of state is equivalent to having suf\ufb01cient information to predict everything\nabout gt at time-step t [42], i.e. there is a bijective function that maps P (st|pt\u22121) \u2013 the distribution\nof latent state given the past \u2013 to P (gt|pt\u22121) \u2013 the belief over future observations.\nSpectral learning approaches were developed to \ufb01nd an globally optimal internal state representa-\ntion and the transition model f for these Predictive-State models. In the controls literature, these\napproaches were developed as subspace identi\ufb01cation [51], and in the ML literature as spectral\napproaches for partially-observed systems [9, 8, 26, 56]. A signi\ufb01cant improvement in model\nlearning was developed by Boots et al. [10], Hefny et al. [24], where suf\ufb01cient feature functions\n\u03c6 (e.g., moments) map distributions P (gt|pt) to points in feature space E [\u03c6(gt)|pt]. For example,\n\n(cid:3) are the suf\ufb01cient statistics for a Gaussian distribution. With this\n\nE [\u03c6(gt)|pt] = E(cid:2)gt, gtgT\n\nt |pt\n\nrepresentation, learning latent state prediction models can be reduced to supervised learning.\nHefny et al. [24] used this along with Instrumental Variable Regression [11] to develop a procedure\nthat, in the limit of in\ufb01nite data, and under a linear-system realiziablity assumption, would converge\nto the globally optimal solution. Sun et al. [45] extended this setup to create a practical algorithm,\nPredictive-State Inference Machines (PSIMs) [44, 45, 54], based on the concept of inference ma-\nchines [31, 40]. Unlike in Hefny et al. [24], which attempted to \ufb01nd a generative observation model\nand transition model, PSIMs directly learned the \ufb01lter function, an operator f, that can deterministi-\ncally pass the predictive states forward in time conditioned on the latest observation, by minimizing\nthe following loss over f:\n\n(cid:88)\n\n(cid:96)p =\n\n(cid:107)\u03c6(gt+1) \u2212 f (ht, xt)(cid:107)2 ,\n\nht+1 = f (ht, xt)\n\n(3)\n\nt\n\nThis loss function, which we call the predictive-state loss, forms the basis of our PREDICTIVE-STATE\nDECODERS. By minimizing this supervised loss function, PSIM assigns statistical meaning to internal\nstates: it forces the internal state ht to match suf\ufb01cient statistics of future observations E [\u03c6(gt)|pt] at\nevery timestep t. We observe an empirical sample of the future gt = [xt, . . . , xt+k] at each timestep\nby looking into the future in the training dataset or by waiting for streaming future observations.\nWhereas [45] primarily studied algorithms for minimizing the predictive-state loss, we adapt it to\naugment general recurrent models such as LSTMs and for a wider variety of applications such as\nimitation and reinforcement learning.\n\n4\n\n\fFigure 3: Predictive-State Decoders Architecture. We augment the RNN from Fig. 2 with an\nadditional objective function R which targets decoding of the internal state through F at each time\nstep to the predictive-state which is represented as statistics over the future observations.\n\n3 Predictive-State Decoders\n\nOur PREDICTIVE-STATE DECODERS architecture extends the Predictive-State Representation idea\nto general recurrent architectures. We hypothesize that by encouraging the internal states to encode\ninformation suf\ufb01cient for reconstructing the predictive state, the resulting internal states better capture\nthe underlying dynamics and learning can be improved. The result is a simple-to-implement objective\nfunction which is coupled with the existing RNN loss. To represent arbitrary sizes and values of\nPSRs with a \ufb01xed-size internal state in the recurrent network, we attach a decoding module F (\u00b7) to\nthe internal states to produce the resulting PSR estimates. Figure 3 illustrates our approach.\nOur PSD objective R is the predictive-state loss:\n\n(cid:88)\n\nR =\n\n(cid:107)F (ht) \u2212 \u03c6([xt+1, xt+2, . . .])(cid:107)2\n2 ,\n\nht = f (ht\u22121, xt\u22121),\n\n(4)\n\nt\n\nwhere F is a decoder that maps from the internal state ht to an empirical sample of the predictive-\nstate, computed from a sequence of observed future observations available at training. The network\nis optimized by minimizing the weighted total loss function L + \u03bbR where \u03bb is the weighting on\nthe predictive-state objective R. This penalty encourages the internal states to encode information\nsuf\ufb01cient for directly predicting suf\ufb01cient future observations. Unlike more standard regularization\ntechniques, R does not regularize the parameters of the network but instead regularizes the output\nvariables, the internal states predicted by the network.\nOur method may be interpreted as an instance of Multi-Task Learning (MTL) [13]. MTL has found\nuse in recent deep neural networks [5, 27, 30]. The idea of MTL is to employ a shared representation\nto perform complementary or similar tasks. When the learner exhibits good performance on one\ntask, some of its understanding can be transferred to a related task. In our case, forcing RNNs to be\nable to more explicitly reason about the future they will encounter is an intuitive and general method.\nEndowing RNNs with a theoretically-motivated representation of the future better enables them\nto serve their purpose of making sequential predictions, resulting in more effective learning. This\ndifference is pronounced in applications such as imitation and reinforcement learning (Sections 4.2\nand 4.3) where the primary objective is to \ufb01nd a control policy to maximize accumulated future\nreward while receiving only observations from the system. MTL with PSDs supervises the network\nto predict the future and implicitly the consequences of the learned policy. Finally, our PSD objective\ncan be considered an instance of self-supervision [27] as it uses the inputs to the learner to form a\nsecondary unsupervised objective.\nAs discussed in Section 2.1, the purpose of the internal state in recurrent network models (RNNs,\nLSTMs, deep, or otherwise) is to capture a quantity similar to that of state. Ideally, the learner\nwould be able to back-propagate through the primary objective function L and discover the best\nrepresentation of the latent state of the system towards minimizing the objective. However, as this\nproblem highly non-convex, BPTT often yields a locally-optimal solution in a basin determined by\nthe initialization of the parameters and the dataset. By introducing R, the space of feasible models is\nreduced. We observe next how this objective leads our method to \ufb01nd better models.\n\n5\n\n\f(a) Pendulum\n\n(b) Helicopter\n\n(c) Hopper\n\nFigure 4: Loss over predicting future observations during \ufb01ltering. For both RNNs with GRU cells\n(top) and with with LSTM cells (bottom), adding PSDs to the RNN networks can often improve\nperformance and convergence rate.\n\n4 Experiments\n\nWe present results on problems of increasing complexity for recurrent models: probabilistic \ufb01ltering,\nImitation Learning (IL), and Reinforcement Learning (RL). The \ufb01rst is easiest, as the goal is to predict\nthe next future observation given the current observation and internal state. For imitation learning, the\nrecurrent model is given training-time expert guidance with the goal of choosing actions to maximize\nthe sequence of future rewards. Finally, we analyze the challenging domain of reinforcement learning,\nwhere the goal is the same as imitation learning but expert guidance is unavailable.\nPREDICTIVE-STATE DECODERS require two hyperparameters: k, the number of observations to\ncharacterize the predictive state and \u03bb, the regularization trade-off factor. In most cases, we primarily\ntune \u03bb, and set k to one of {2, . . . , 10}. For each domain, for each k, there were \u03bb values for which\nthe performance was worse than the baseline. However, for many sets of hyperparameters, the\nperformance exceeded the baselines. Most notably, for many experiments, the convergence rate was\nsigni\ufb01cantly better using PSDs, implying that PSDs allows for more ef\ufb01cient data utilization for\nlearning recurrent models.\nPSDs also require a speci\ufb01cation of two other parameters in the architecture: the featurization function\n\u03c6 and decoding module F . For simplicity, we use an af\ufb01ne function as the decoder F in Eq. (4). The\nresults presented below use an identity featurization \u03c6 for the presented results but include a short\ndiscussion of second order featurization. We \ufb01nd that in each domain, we are able to improve the\nperformance of the state-of-the-art baselines. We observe improvements with both GRU and LSTM\ncells across a range of k and \u03bb. In IL with PSDs, we come signi\ufb01cantly closer and occasionally\neclipse the expert\u2019s performance, whereas the baselines never do. In our RL experiments, our method\nachieves statistically signi\ufb01cant improvements over the state-of-the-art approach of [18, 41] on the 5\ndifferent settings we tested.\n\n4.1 Probabilistic Filtering\n\nIn the probabilistic \ufb01ltering problem, the goal is to predict the future from the current internal state.\nRecurrent models for \ufb01ltering use a multi-step objective function that maximizes the likelihood of the\nfuture observations over the internal states and dynamics model f\u2019s parameters. Under a Gaussian\nassumption (e.g. like a Kalman \ufb01lter [22]), the equivalent objective that minimizes the negative\n\nlog-likelihood is given as L =(cid:80)\n\nt (cid:107)xt+1 \u2212 f (xt, ht)(cid:107)2.\n\nWhile traditional methods would explicitly solve for parametric internal states ht using an EM\nstyle approach, we use BPTT to implicitly \ufb01nd an non-parametric internal state. We optimize the\n\n6\n\n0100200300400500Iteration100101Observation LossPendulum GRU NetworkBaselinek=2,\u03bb=1.0k=2,\u03bb=10.0k=5,\u03bb=1.0k=5,\u03bb=10.00100200300400500Iteration1018\u00d71009\u00d7100Observation LossHelicopter GRU NetworkBaselinek=2,\u03bb=1.0k=2,\u03bb=10.0k=5,\u03bb=1.0k=5,\u03bb=10.00100200300400500Iteration2\u00d71013\u00d71014\u00d71016\u00d7101Observation LossHopper GRU NetworkBaselinek=2,\u03bb=1.0k=5,\u03bb=5.0k=5,\u03bb=10.0k=10,\u03bb=5.00100200300400500Iteration100101Observation LossPendulum LSTM NetworkBaselinek=2,\u03bb=1.0k=2,\u03bb=10.0k=5,\u03bb=1.0k=5,\u03bb=10.00100200300400500Iteration1019\u00d7100Observation LossHelicopter LSTM NetworkBaselinek=2,\u03bb=1.0k=2,\u03bb=10.0k=5,\u03bb=1.0k=5,\u03bb=10.00100200300400500Iteration2\u00d71013\u00d71014\u00d71016\u00d7101Observation LossHopper LSTM NetworkBaselinek=2,\u03bb=10.0k=5,\u03bb=0.5k=5,\u03bb=10.0k=10,\u03bb=10.0\fFigure 5: Cumulative rewards for AggreVaTeD and AggreVaTeD+PREDICTIVE-STATE DECODERS\non partially observable Acrobot and CartPole with both LSTM cells and GRU cells averaged over 15\nruns with different random seeds.\n\nend-to-end \ufb01ltering performance through the PSD joint objective minf,F L + \u03bbR. Our experimental\nresults are shown in Fig. 4. The experiments were run with \u03c6 as the identity, capturing statistics\nrepresenting the \ufb01rst moment. We tested \u03c6 as second-order statistics and found while the performance\nimproved over the baseline, it was outperformed by the \ufb01rst moment. In all environments, a dataset\nwas collected using a preset control policy. In the Pendulum experiments, we predict the pendulum\u2019s\nangle \u03b8. The LQR controlled Helicopter experiments [3] use a noisy state as the observation, and\nthe Hopper dataset was generated using the OpenAI simulation [12] with robust policy optimization\nalgorithm [36] as the controller.\nWe test each environment with Tensor\ufb02ow\u2019s built-in GRU and LSTM cells [1]. We sweep over\nvarious k and \u03bb hyperparameters and present the average results and standard deviations from runs\nwith different random seeds. Fig. 4 baselines are recurrent models equivalent to PSDs with \u03bb = 0.\n\n4.2\n\nImitation Learning\n\nWe experiment with the partially observable CartPole and Acrobot domains3 from OpenAI Gym [12].\nWe applied the method of AggreVaTeD [46], a policy-gradient method, to train our expert models.\nAggreVaTeD uses access to a cost-to-go oracle in order to train a policy that is sensitive to the value\nof the expert\u2019s actions, providing an advantage over behavior cloning IL approaches. The experts\nhave access to the full state of the robots, unlike the learned recurrent policies.\nWe tune the parameters of LSTM and GRU agents (e.g., learning rate, number of internal units) and\nafterwards only tune \u03bb for PSDs. In Fig. 5, we observe that PSDs improve performance for both GRU-\nand LSTM-based agents and increasing the predictive-state horizon k yields better results. Notably,\nPSDs achieves 73% relative improvement over baseline LSTM and 42% over GRU on Cartpole.\nDifference random seeds were used. The cumulative reward of the current best policy is shown.\n\n4.3 Reinforcement Learning\n\nReinforcement learning (RL) increases the problem complexity from imitation learning by removing\nexpert guidance. The latent state of the system is heavily in\ufb02uenced by the RL agent itself and\nchanges as the policy improves. We use [18]\u2019s implementation of TRPO [41], a Natural Policy\n\n3The observation function only provides positional information (including joint angles), excluding velocities.\n\n7\n\n\fFigure 6: Walker Cumulative Rewards and Sorted Percentiles. N = 15, 5e4 TRPO steps per iteration.\n\nTable 1: Top: Mean Average Returns \u00b1 one standard deviation, with N = 15 for Walker2d\u2020 and\nN = 30 otherwise. Bottom: Relative improvement of on the means. \u2217 indicates p < 0.05 and\n\u2217\u2217 indicates p < 0.005 on Wilcoxon\u2019s signed-rank test for signi\ufb01cance of improvement. All runs\ncomputed with 5e3 transitions per iteration, except Walker2d\u2020, with 5e4.\n\nSwimmer\n91.3 \u00b1 25.5\n97.0 \u00b1 19.4\n6.30%\u2217\n\nHalfCheetah Hopper\n330 \u00b1 158\n372 \u00b1 143\n13.0%\u2217\n\n1103 \u00b1 264\n1195 \u00b1 272\n9.06%\u2217\n\n[41]\n[41]+PSDs\n\nRel. \u2206\n\nWalker2d Walker2d\u2020\n1396 \u00b1 396\n383 \u00b1 96\n416 \u00b1 88\n1611 \u00b1 436\n15.4%\u2217\u2217\n8.59%\u2217\n\nGradient method [28]. Although [41] de\ufb01nes a KL-constraint on policy parameters that affect actions,\nour implementation of PSDs introduces parameters (those of the decoder) that are unaffected by the\nconstraint, as the decoder does not directly govern the agent\u2019s actions.\nIn these experiments, results are highly stochastic due to both environment randomness and non-\ndeterministic parallelization of rllab [18]. We therefore repeat each experiment at least 15 times\nwith paired random seeds. We use k = 2 for most experiments (k = 4 for Hopper), the identity\n\nfeaturization for \u03c6, and vary \u03bb in(cid:8)101, 100, . . . , 10\u22126(cid:9), and employ the LSTM cell and other default\n\nparameters of TRPO. We report the same metric as [18]: per-TRPO batch average return across\nlearning iterations. Additionally, we report per-run performance by plotting the sorted average TRPO\nbatch returns (each item is a number representing a method\u2019s performance for a single seed).\nFigs. 6 and 7 demonstrate that our method generally produces higher-quality results than the baseline.\nThese results are further summarized by their means and stds. in Table 1. In Figure 6, 40% of our\nmethod\u2019s models are better than the best baseline model. In Figure 7c, 25% of our method\u2019s models\nare better than the second-best (98th percentile) baseline model. We compare various RNN cells in\nTable 2, and \ufb01nd our method can improve Basic (linear + tanh nonlinearity), GRU, and LSTM RNNs,\nand usually reduces the performance variance. We used Tensor\ufb02ow [1] and passed both the \u201chidden\"\nand \u201ccell\" components of an LSTM\u2019s internal state to the decoder. We also conducted preliminary\nadditional experiments with second order featurization (\u03c6(x) = [x, vec(xxT )]). Corresponding\nto Tab. 2, column 1 for the inverted pendulum, second order features yielded 861 \u00b1 41, a 4.9%\nimprovement in the mean and a large reduction in variance.\n\n5 Conclusion\n\nWe introduced a theoretically-motivated method for improving the training of RNNs. Our method\nstems from previous literature that assigns statistical meaning to a learner\u2019s internal state for modelling\nlatent state of the data-generating processes. Our approach uses the objective in PSIMs and applies it to\nmore complicated recurrent models such as LSTMs and GRUs and to objectives beyond probabilistic\n\ufb01ltering such as imitation and reinforcement learning. We show that our straightforward method\nimproves performance across all domains with which we experimented.\n\n8\n\nTRPOTRPO+PSD\f(a) Swimmer, N=30\n\n(b) HalfCheetah, N=30\n\n(c) Hopper, N=40\n\nFigure 7: Top: Per-iteration average returns for TRPO and TRPO+PREDICTIVE-STATE DECODERS\nvs. batch iteration, with 5e3 steps per iteration. Bottom: Sorted per-run mean (across iterations)\naverage returns. Our method generally produces better models.\n\nTable 2: Variations of RNN units. Mean Average Returns \u00b1 one standard deviation, with N = 20.\n1e3 transitions per iteration are used. Our method can improve each recurrent unit we tested.\n\nInvertedPendulum\n\n[41]\n[41]+PSDs\nRel. \u2206\n\nBasic\n820 \u00b1 139\n820 \u00b1 118\n\u22120.08%\n\nGRU\n673 \u00b1 268\n782 \u00b1 183\n20.4%\n\nLSTM\n640 \u00b1 265\n784 \u00b1 215\n22.6%\n\nBasic\n66.0 \u00b1 21.4\n71.4 \u00b1 26.9\n8.21%\n\nSwimmer\nGRU\n64.6 \u00b1 55.3\n75.1 \u00b1 28.8\n16.1%\n\nLSTM\n56.5 \u00b1 23.8\n61.0 \u00b1 23.8\n7.94%\n\nAcknowledgements\n\nThis material is based upon work supported in part by: Of\ufb01ce of Naval Research (ONR) contract\nN000141512365, and National Science Foundation NRI award number 1637758.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,\n2016.\n\n[2] Pieter Abbeel and Andrew Y Ng. Learning \ufb01rst-order markov models for control. In NIPS,\n\npages 1\u20138, 2005.\n\n[3] Pieter Abbeel and Andrew Y Ng. Exploration and apprenticeship learning in reinforcement\n\nlearning. In ICML, pages 1\u20138. ACM, 2005.\n\n[4] Pieter Abbeel, Adam Coates, Michael Montemerlo, Andrew Y Ng, and Sebastian Thrun.\n\nDiscriminative training of kalman \ufb01lters. In Robotics: Science and Systems (RSS), 2005.\n\n[5] Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to\npoke by poking: Experiential learning of intuitive physics. In D. D. Lee, M. Sugiyama, U. V.\nLuxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems\n29, pages 5074\u20135082. Curran Associates, Inc., 2016.\n\n[6] Karl Johan Astr\u00f6m and Richard M Murray. Feedback systems: an introduction for scientists\n\nand engineers. Princeton university press, 2010.\n\n9\n\nTRPOTRPO+PSD\f[7] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with\n\ngradient descent is dif\ufb01cult. Neural Networks, IEEE Transactions on, 5(2):157\u2013166, 1994.\n\n[8] Byron Boots. Spectral Approaches to Learning Predictive Representations. PhD thesis, Carnegie\n\nMellon University, December 2012.\n\n[9] Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. Closing the learning-planning loop\nwith predictive state representations. The International Journal of Robotics Research, 30(7):\n954\u2013966, 2011.\n\n[10] Byron Boots, Arthur Gretton, and Geoffrey J. Gordon. Hilbert space embeddings of predictive\n\nstate representations. In UAI-2013, 2013.\n\n[11] Roger J Bowden and Darrell A Turkington. Instrumental variables. Number 8. Cambridge\n\nUniversity Press, 1990.\n\n[12] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[13] Rich Caruana. Multitask learning. In Learning to learn, pages 95\u2013133. Springer, 1998.\n[14] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[15] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua\nBengio. A recurrent latent variable model for sequential data. In Advances in neural information\nprocessing systems, pages 2980\u20132988, 2015.\n\n[16] Adam Coates, Pieter Abbeel, and Andrew Y. Ng. Learning for control from multiple demonstra-\n\ntions. In ICML, pages 144\u2013151, New York, NY, USA, 2008. ACM.\n\n[17] Marc Peter Deisenroth, Marco F Huber, and Uwe D Hanebeck. Analytic moment-based gaussian\nprocess \ufb01ltering. In International Conference on Machine Learning, pages 225\u2013232. ACM,\n2009.\n\n[18] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking\ndeep reinforcement learning for continuous control. In Proceedings of the 33rd International\nConference on Machine Learning (ICML), 2016.\n\n[19] Zoubin Ghahramani and Sam T Roweis. Learning nonlinear dynamical systems using an EM\n\nalgorithm. pages 431\u2014-437, 1999.\n\n[20] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural\n\nnetworks. In ICML, volume 14, pages 1764\u20131772, 2014.\n\n[21] Klaus Greff, Rupesh K Srivastava, Jan Koutn\u00edk, Bas R Steunebrink, and J\u00fcrgen Schmidhuber.\nLstm: A search space odyssey. IEEE transactions on neural networks and learning systems,\n2016.\n\n[22] Tuomas Haarnoja, Anurag Ajay, Sergey Levine, and Pieter Abbeel. Backprop kf: Learning\n\ndiscriminative deterministic state estimators. NIPS, 2016.\n\n[23] Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps.\n\narXiv preprint arXiv:1507.06527, 2015.\n\n[24] Ahmed Hefny, Carlton Downey, and Geoffrey J Gordon. Supervised learning for dynamical\n\nsystem learning. In NIPS, 2015.\n\n[25] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[26] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden\n\nmarkov models. In COLT, 2009.\n\n[27] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo,\nDavid Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary\ntasks. CoRR, abs/1611.05397, 2016. URL http://arxiv.org/abs/1611.05397.\n\n[28] Sham Kakade. A natural policy gradient. Advances in neural information processing systems,\n\n2:1531\u20131538, 2002.\n\n[29] J Ko, D J Klein, D Fox, and D Haehnel. GP-UKF: Unscented kalman \ufb01lters with Gaussian\n\nprocess prediction and observation models. pages 1901\u20131907, 2007.\n\n10\n\n\f[30] Iasonas Kokkinos. Ubernet: Training a \u2019universal\u2019 convolutional neural network for low-, mid-,\nand high-level vision using diverse datasets and limited memory. CoRR, abs/1609.02132, 2016.\n[31] John Langford, Ruslan Salakhutdinov, and Tong Zhang. Learning nonlinear dynamic models.\n\nIn ICML, pages 593\u2013600. ACM, 2009.\n\n[32] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436\u2013444, 2015.\n[33] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. Journal of Machine Learning Research, 17(39):1\u201340, 2016.\n\n[34] Peter Ondruska and Ingmar Posner. Deep tracking: Seeing beyond seeing using recurrent neural\n\nnetworks. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[35] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\n\nneural networks. ICML, 28:1310\u20131318, 2013.\n\n[36] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial\n\nreinforcement learning. arXiv preprint arXiv:1703.02702, 2017.\n\n[37] Liva Ralaivola and Florence D\u2019Alche-Buc. Dynamical modeling with kernels for nonlinear\n\ntime series prediction. NIPS, 2004.\n\n[38] Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level\n\ntraining with recurrent neural networks. ICLR, 2016.\n\n[39] St\u00e9phane Ross, Geoffrey J Gordon, and J Andrew Bagnell. A reduction of imitation learning\n\nand structured prediction to no-regret online learning. AISTATS, 2011.\n\n[40] Stephane Ross, Daniel Munoz, Martial Hebert, and J Andrew Bagnell. Learning message-\n\npassing inference machines for structured prediction. In CVPR. IEEE, 2011.\n\n[41] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In Proceedings of the 32nd International Conference on Machine Learning\n(ICML-15), pages 1889\u20131897, 2015.\n\n[42] Satinder Singh, Michael R. James, and Matthew R. Rudary. Predictive state representations: A\n\nnew theory for modeling dynamical systems. In UAI, 2004.\n\n[43] Le Song, Byron Boots, Sajid M Siddiqi, Geoffrey J Gordon, and Alex J Smola. Hilbert space\n\nembeddings of hidden markov models. In ICML, pages 991\u2013998, 2010.\n\n[44] Wen Sun, Roberto Capobianco, Geoffrey J. Gordon, J. Andrew Bagnell, and Byron Boots.\nLearning to smooth with bidirectional predictive state inference machines. In Proceedings of\nThe International Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2016.\n\n[45] Wen Sun, Arun Venkatraman, Byron Boots, and J Andrew Bagnell. Learning to \ufb01lter with\npredictive state inference machines. In Proceedings of The 33rd International Conference on\nMachine Learning, pages 1197\u20131205, 2016.\n\n[46] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply\n\naggrevated: Differentiable imitation learning for sequential prediction. In ICML, 2017.\n\n[47] Ilya Sutskever. Training recurrent neural networks. PhD thesis, University of Toronto, 2013.\n[48] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural\nnetworks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11),\npages 1017\u20131024, 2011.\n\n[49] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT press, 2005.\n[50] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural\n\nnetworks. arXiv preprint arXiv:1601.06759, 2016.\n\n[51] Peter Van Overschee and BL De Moor. Subspace identi\ufb01cation for linear systems: Theory-\n\nImplementation-Applications. Springer Science & Business Media, 2012.\n\n[52] William Vega-Brown and Nicholas Roy. Cello-em: Adaptive sensor models without ground\ntruth. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages\n1907\u20131914. IEEE, 2013.\n\n[53] Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multi-step prediction of\n\nlearned time series models. In AAAI, pages 3024\u20133030, 2015.\n\n11\n\n\f[54] Arun Venkatraman, Wen Sun, Martial Hebert , Byron Boots, and J. Andrew (Drew) Bagnell.\nInference machines for nonparametric \ufb01lter learning. In 25th International Joint Conference on\nArti\ufb01cial Intelligence (IJCAI-16), July 2016.\n\n[55] Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of\n\nthe IEEE, 78(10):1550\u20131560, 1990.\n\n[56] David Wingate and Satinder Singh. Kernel predictive linear gaussian models for nonlinear\nstochastic dynamical systems. In International Conference on Machine Learning, pages 1017\u2013\n1024. ACM, 2006.\n\n12\n\n\f", "award": [], "sourceid": 786, "authors": [{"given_name": "Arun", "family_name": "Venkatraman", "institution": "Carnegie Mellon University"}, {"given_name": "Nicholas", "family_name": "Rhinehart", "institution": "Carnegie Mellon University"}, {"given_name": "Wen", "family_name": "Sun", "institution": "Carnegie Mellon University"}, {"given_name": "Lerrel", "family_name": "Pinto", "institution": null}, {"given_name": "Martial", "family_name": "Hebert", "institution": "cmu"}, {"given_name": "Byron", "family_name": "Boots", "institution": "Georgia Tech / Google Brain"}, {"given_name": "Kris", "family_name": "Kitani", "institution": "Carnegie Mellon University"}, {"given_name": "J.", "family_name": "Bagnell", "institution": "Carnegie Mellon University"}]}