{"title": "Temporal-Difference Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1377, "page_last": 1384, "abstract": null, "full_text": " Temporal-Difference Networks\n\n\n Richard S. Sutton and Brian Tanner\n Department of Computing Science\n University of Alberta\n Edmonton, Alberta, Canada T6G 2E8\n {sutton,btanner}@cs.ualberta.ca\n\n\n\n Abstract\n\n We introduce a generalization of temporal-difference (TD) learning to\n networks of interrelated predictions. Rather than relating a single pre-\n diction to itself at a later time, as in conventional TD methods, a TD\n network relates each prediction in a set of predictions to other predic-\n tions in the set at a later time. TD networks can represent and apply TD\n learning to a much wider class of predictions than has previously been\n possible. Using a random-walk example, we show that these networks\n can be used to learn to predict by a fixed interval, which is not possi-\n ble with conventional TD methods. Secondly, we show that if the inter-\n predictive relationships are made conditional on action, then the usual\n learning-efficiency advantage of TD methods over Monte Carlo (super-\n vised learning) methods becomes particularly pronounced. Thirdly, we\n demonstrate that TD networks can learn predictive state representations\n that enable exact solution of a non-Markov problem. A very broad range\n of inter-predictive temporal relationships can be expressed in these net-\n works. Overall we argue that TD networks represent a substantial ex-\n tension of the abilities of TD methods and bring us closer to the goal of\n representing world knowledge in entirely predictive, grounded terms.\n\n\nTemporal-difference (TD) learning is widely used in reinforcement learning methods to\nlearn moment-to-moment predictions of total future reward (value functions). In this set-\nting, TD learning is often simpler and more data-efficient than other methods. But the idea\nof TD learning can be used more generally than it is in reinforcement learning. TD learn-\ning is a general method for learning predictions whenever multiple predictions are made of\nthe same event over time, value functions being just one example. The most pertinent of\nthe more general uses of TD learning have been in learning models of an environment or\ntask domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999).\nIn these works, TD learning is used to predict future values of many observations or state\nvariables of a dynamical system.\nThe essential idea of TD learning can be described as \"learning a guess from a guess\". In\nall previous work, the two guesses involved were predictions of the same quantity at two\npoints in time, for example, of the discounted future reward at successive time steps. In this\npaper we explore a few of the possibilities that open up when the second guess is allowed\nto be different from the first.\n\n\f\nTo be more precise, we must make a distinction between the extensive definition of a predic-\ntion, expressing its desired relationship to measurable data, and its TD definition, express-\ning its desired relationship to other predictions. In reinforcement learning, for example,\nstate values are extensively defined as an expectation of the discounted sum of future re-\nwards, while they are TD defined as the solution to the Bellman equation (a relationship to\nthe expectation of the value of successor states, plus the immediate reward). It's the same\nprediction, just defined or expressed in different ways. In past work with TD methods, the\nTD relationship was always between predictions with identical or very similar extensive\nsemantics. In this paper we retain the TD idea of learning predictions based on others, but\nallow the predictions to have different extensive semantics.\n\n1 The Learning-to-predict Problem\n\nThe problem we consider in this paper is a general one of learning to predict aspects of the\ninteraction between a decision making agent and its environment. At each of a series of\ndiscrete time steps t, the environment generates an observation ot O, and the agent takes\nan action at A. Whereas A is an arbitrary discrete set, we assume without loss of gener-\nality that ot can be represented as a vector of bits. The action and observation events occur\nin sequence, o1, a1, o2, a2, o3 , with each event of course dependent only on those pre-\nceding it. This sequence will be called experience. We are interested in predicting not just\neach next observation but more general, action-conditional functions of future experience,\nas discussed in the next section.\nIn this paper we use a random-walk problem with seven states, with left and right actions\navailable in every state:\n\n 1 0 0 0 0 0 1\n 1 2 3 4 5 6 7\n\nThe observation upon arriving in a state consists of a special bit that is 1 only at the two ends\nof the walk and, in the first two of our three experiments, seven additional bits explicitly\nindicating the state number (only one of them is 1). This is a continuing task: reaching an\nend state does not end or interrupt experience. Although the sequence depends determinis-\ntically on action, we assume that the actions are selected randomly with equal probability\nso that the overall system can be viewed as a Markov chain.\nThe TD networks introduced in this paper can represent a wide variety of predictions, far\nmore than can be represented by a conventional TD predictor. In this paper we take just\na few steps toward more general predictions. In particular, we consider variations of the\nproblem of prediction by a fixed interval. This is one of the simplest cases that cannot\notherwise be handled by TD methods. For the seven-state random walk, we will predict\nthe special observation bit some numbers of discrete steps in advance, first unconditionally\nand then conditioned on action sequences.\n\n\n2 TD Networks\n\nA TD network is a network of nodes, each representing a single scalar prediction. The\nnodes are interconnected by links representing the TD relationships among the predictions\nand to the observations and actions. These links determine the extensive semantics of\neach prediction--its desired or target relationship to the data. They represent what we\nseek to predict about the data as opposed to how we try to predict it. We think of these\nlinks as determining a set of questions being asked about the data, and accordingly we\ncall them the question network. A separate set of interconnections determines the actual\n\n\f\ncomputational process--the updating of the predictions at each node from their previous\nvalues and the current action and observation. We think of this process as providing the\nanswers to the questions, and accordingly we call them the answer network. The question\nnetwork provides targets for a learning process shaping the answer network and does not\notherwise affect the behavior of the TD network. It is natural to consider changing the\nquestion network, but in this paper we take it as fixed and given.\nFigure 1a shows a suggestive example of a question network. The three squares across\nthe top represent three observation bits. The node labeled 1 is directly connected to the\nfirst observation bit and represents a prediction that that bit will be 1 on the next time\nstep. The node labeled 2 is similarly a prediction of the expected value of node 1 on the\nnext step. Thus the extensive definition of Node 2's prediction is the probability that the\nfirst observation bit will be 1 two time steps from now. Node 3 similarly predicts the first\nobservation bit three time steps in the future. Node 4 is a conventional TD prediction, in this\ncase of the future discounted sum of the second observation bit, with discount parameter .\nIts target is the familiar TD target, the data bit plus the node's own prediction on the next\ntime step (with weightings 1 - and respectively). Nodes 5 and 6 predict the probability\nof the third observation bit being 1 if particular actions a or b are taken respectively. Node\n7 is a prediction of the average of the first observation bit and Node 4's prediction, both on\nthe next step. This is the first case where it is not easy to see or state the extensive semantics\nof the prediction in terms of the data. Node 8 predicts another average, this time of nodes 4\nand 5, and the question it asks is even harder to express extensively. One could continue in\nthis way, adding more and more nodes whose extensive definitions are difficult to express\nbut which would nevertheless be completely defined as long as these local TD relationships\nare clear. The thinner links shown entering some nodes are meant to be a suggestion of the\nentirely separate answer network determining the actual computation (as opposed to the\ngoals) of the network. In this paper we consider only simple question networks such as the\nleft column of Figure 1a and of the action-conditional tree form shown in Figure 1b.\n\n\n\n 1- a b L R\n \n 1 4 5 6\n L R L R\n\n 2 7 8\n\n\n 3\n\n (a) (b)\n\nFigure 1: The question networks of two TD networks. (a) a question network discussed in\nthe text, and (b) a depth-2 fully-action-conditional question network used in Experiments\n2 and 3. Observation bits are represented as squares across the top while actual nodes of\nthe TD network, corresponding each to a separate prediction, are below. The thick lines\nrepresent the question network and the thin lines in (a) suggest the answer network (the bulk\nof which is not shown). Note that all of these nodes, arrows, and numbers are completely\ndifferent and separate from those representing the random-walk problem on the preceding\npage.\n\n\f\nMore formally and generally, let yi [0, 1], i = 1, . . . , n, denote the prediction of the\n t\nith node at time step t. The column vector of predictions y = (y1, . . . , yn)T is updated\n t t t\naccording to a vector-valued function u with modifiable parameter W:\n y = u(y , a\n t t-1 t-1, ot, Wt) n. (1)\nThe update function u corresponds to the answer network, with W being the weights on\nits links. Before detailing that process, we turn to the question network, the defining TD\nrelationships between nodes. The TD target zi for yi is an arbitrary function zi of the\n t t\nsuccessive predictions and observations. In vector form we have1\n zt = z(ot+1, ~y ) n, (2)\n t+1\nwhere ~y is just like y , as in (1), except calculated with the old weights before they\n t+1 t+1\nare updated on the basis of zt:\n ~\n y = u(y , a\n t t-1 t-1, ot, Wt-1) n. (3)\n(This temporal subtlety also arises in conventional TD learning.) For example, for the\nnodes in Figure 1a we have z1 = o1 , z2 = y1 , z3 = y2 , z4 = (1 - )o2 + y4 ,\n t t+1 t t+1 t t+1 t t+1 t+1\nz5 = z6 = o3 , z7 = 1 o1 + 1 y4 , and z8 = 1 y4 + 1 y5 . The target functions\n t t t+1 t 2 t+1 2 t+1 t 2 t+1 2 t+1\nzi are only part of specifying the question network. The other part has to do with making\nthem potentially conditional on action and observation. For example, Node 5 in Figure\n1a predicts what the third observation bit will be if action a is taken. To arrange for such\nsemantics we introduce a new vector ct of conditions, ci, indicating the extent to which yi\n t t\nis held responsible for matching zi, thus making the ith prediction conditional on ci. Each\n t t\nci is determined as an arbitrary function ci of a\n t t and yt. In vector form we have:\n ct = c(at, y ) [0, 1]n. (4)\n t\nFor example, for Node 5 in Figure 1a, c5 = 1 if a = 0.\n t t = a, otherwise c5\n t\n\nEquations (24) correspond to the question network. Let us now turn to defining u, the\nupdate function for y mentioned earlier and which corresponds to the answer network. In\n t\ngeneral u is an arbitrary function approximator, but for concreteness we define it to be of a\nlinear form\n y = (W\n t txt) (5)\nwhere xt m is a feature vector, Wt is an n m matrix, and is the n-vector form\nof the identity function (Experiments 1 and 2) or the S-shaped logistic function (s) =\n 1 (Experiment 3). The feature vector is an arbitrary function of the preceding action,\n1+e-s\nobservation, and node values:\n xt = x(at ) m. (6)\n -1, ot , yt-1\nFor example, xt might have one component for each observation bit, one for each possible\naction (one of which is 1, the rest 0), and n more for the previous node values y . The\n t-1\nlearning algorithm for each component wij of W\n t t is\n\n yi\n wij - wij = (zi - yi)ci t , (7)\n t+1 t t t t wijt\nwhere is a step-size parameter. The timing details may be clarified by writing the se-\nquence of quantities in the order in which they are computed:\n y a c z W . (8)\n t t t ot+1 xt+1 ~\n yt+1 t t+1 yt+1\nFinally, the target in the extensive sense for y is\n t\n y = E + c ) , (9)\n t t, (1 - ct) yt t z(ot+1, y\n t+1\nwhere represents component-wise multiplication and is the policy being followed,\nwhich is assumed fixed.\n 1In general, z is a function of all the future predictions and observations, but in this paper we treat\nonly the one-step case.\n\n\f\n3 Experiment 1: n-step Unconditional Prediction\n\nIn this experiment we sought to predict the observation bit precisely n steps in advance,\nfor n = 1, 2, 5, 10, and 25. In order to predict n steps in advance, of course, we also\nhave to predict n - 1 steps in advance, n - 2 steps in advance, etc., all the way down to\npredicting one step ahead. This is specified by a TD network consisting of a single chain of\npredictions like the left column of Figure 1a, but of length 25 rather than 3. Random-walk\nsequences were constructed by starting at the center state and then taking random actions\nfor 50, 100, 150, and 200 steps (100 sequences each).\nWe applied a TD network and a corresponding Monte Carlo method to this data. The Monte\nCarlo method learned the same predictions, but learned them by comparing them to the\nactual outcomes in the sequence (instead of zi in (7)). This involved significant additional\n t\ncomplexity to store the predictions until their corresponding targets were available. Both\nalgorithms used feature vectors of 7 binary components, one for each of the seven states, all\nof which were zero except for the one corresponding to the current state. Both algorithms\nformed their predictions linearly (() was the identity) and unconditionally (ci = 1 i, t).\n t\n\nIn an initial set of experiments, both algorithms were applied online with a variety of values\nfor their step-size parameter . Under these conditions we did not find that either algorithm\nwas clearly better in terms of the mean square error in their predictions over the data sets.\nWe found a clearer result when both algorithms were trained using batch updating, in which\nweight changes are collected \"on the side\" over an experience sequence and then made all\nat once at the end, and the whole process is repeated until convergence. Under batch\nupdating, convergence is to the same predictions regardless of initial conditions or value\n(as long as is sufficiently small), which greatly simplifies comparison of algorithms. The\npredictions learned under batch updating are also the same as would be computed by least\nsquares algorithms such as LSTD() (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis &\nParr, 2003). The errors in the final predictions are shown in Table 1.\nFor 1-step predictions, the Monte-Carlo and TD methods performed identically of course,\nbut for longer predictions a significant difference was observed. The RMSE of the Monte\nCarlo method increased with prediction length whereas for the TD network it decreased.\nThe largest standard error in any of the numbers shown in the table is 0.008, so almost\nall of the differences are statistically significant. TD methods appear to have a significant\ndata-efficiency advantage over non-TD methods in this prediction-by-n context (and this\ntask) just as they do in conventional multi-step prediction (Sutton, 1988).\n\n Time 1-step 2-step 5-step 10-step 25-step\n Steps MC/TD MC TD MC TD MC TD MC TD\n 50 0.205 0.219 0.172 0.234 0.159 0.249 0.139 0.297 0.129\n 100 0.124 0.133 0.100 0.160 0.098 0.168 0.079 0.187 0.068\n 150 0.089 0.103 0.073 0.121 0.076 0.130 0.063 0.153 0.054\n 200 0.076 0.084 0.060 0.109 0.065 0.112 0.056 0.118 0.049\n\nTable 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for\nincreasing amounts of training data on the random-walk example with batch updating.\n\n\n4 Experiment 2: Action-conditional Prediction\n\nThe advantage of TD methods should be greater for predictions that apply only when the\nexperience sequence unfolds in a particular way, such as when a particular sequence of\nactions are made. In a second experiment we sought to learn n-step-ahead predictions\nconditional on action selections. The question network for learning all 2-step-ahead pre-\n\n\f\ndictions is shown in Figure 1b. The upper two nodes predict the observation bit conditional\non taking a left action (L) or a right action (R). The lower four nodes correspond to the\ntwo-step predictions, e.g., the second lower node is the prediction of what the observation\nbit will be if an L action is taken followed by an R action. These predictions are the same\nas the e-tests used in some of the work on predictive state representations (Littman, Sutton\n& Singh, 2002; Rudary & Singh, 2003).\nIn this experiment we used a question network like that in Figure 1b except of depth four,\nconsisting of 30 (2+4+8+16) nodes. The conditions for each node were set to 0 or 1 de-\npending on whether the action taken on the step matched that indicated in the figure. The\nfeature vectors were as in the previous experiment. Now that we are conditioning on action,\nthe problem is deterministic and can be set uniformly to 1. A Monte Carlo prediction\ncan be learned only when its corresponding action sequence occurs in its entirety, but then\nit is complete and accurate in one step. The TD network, on the other hand, can learn\nfrom incomplete sequences but must propagate them back one level at a time. First the\none-step predictions must be learned, then the two-step predictions from them, and so on.\nThe results for online and batch training are shown in Tables 2 and 3.\nAs anticipated, the TD network learns much faster than Monte Carlo with both online and\nbatch updating. Because the TD network learns its n step predictions based on its n - 1\nstep predictions, it has a clear advantage for this task. Once the TD Network has seen\neach action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the\nfuture. Monte Carlo, on the other hand, must sample actual sequences, so each exact action\nsequence must be observed.\n\n 1-Step 2-Step 3-Step 4-Step\n Time Step MC/TD MC TD MC TD MC TD\n 100 0.153 0.222 0.182 0.253 0.195 0.285 0.185\n 200 0.019 0.092 0.044 0.142 0.054 0.196 0.062\n 300 0.000 0.040 0.000 0.089 0.013 0.139 0.017\n 400 0.000 0.019 0.000 0.055 0.000 0.093 0.000\n 500 0.000 0.019 0.000 0.038 0.000 0.062 0.000\n\nTable 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo\nand TD-network methods on the random-walk problem with online updating.\n\n Time Steps MC TD\n 50 53.48% 17.21%\n 100 30.81% 4.50%\n 150 19.26% 1.57%\n 200 11.69% 0.14%\n\nTable 3: Average proportion of incorrect action-conditional predictions for batch-updating\nversions of Monte-Carlo and TD-network methods, for various amounts of data, on the\nrandom-walk task. All differences are statistically significant.\n\n\n5 Experiment 3: Learning a Predictive State Representation\n\nExperiments 1 and 2 showed advantages for TD learning methods in Markov problems.\nThe feature vectors in both experiments provided complete information about the nominal\nstate of the random walk. In Experiment 3, on the other hand, we applied TD networks to\na non-Markov version of the random-walk example, in particular, in which only the special\nobservation bit was visible and not the state number. In this case it is not possible to make\n\n\f\naccurate predictions based solely on the current action and observation; the previous time\nstep's predictions must be used as well.\nAs in the previous experiment, we sought to learn n-step predictions using action-\nconditional question networks of depths 2, 3, and 4. The feature vector xt consisted of\nthree parts: a constant 1, four binary features to represent the pair of action at-1 and ob-\nservation bit ot, and n more features corresponding to the components of y . The features\n t-1\nvectors were thus of length m = 11, 19, and 35 for the three depths. In this experiment,\n() was the S-shaped logistic function. The initial weights W0 and predictions y were\n 0\nboth 0.\nFifty random-walk sequences were constructed, each of 250,000 time steps, and presented\nto TD networks of the three depths, with a range of step-size parameters . We measured\nthe RMSE of all predictions made by the networks (computed from knowledge of the task)\nand also the \"empirical RMSE,\" the error in the one-step prediction for the action actually\ntaken on each step. We found that in all cases the errors approached zero over time, showing\nthat the problem was completely solved. Figure 2 shows some representative learning\ncurves for the depth-2 and depth-4 TD networks.\n\n\n\n\n .3\n\n\n\n .2\nEmpirical =.1\nRMS error\n .1 =.5 =.25\n =.5\n =.75 depth 2\n 0 0 50K 100K 150K 200K 250K\n Time Steps\nFigure 2: Prediction performance on the non-Markov random walk with depth-4 TD net-\nworks (and one depth-2 network) with various step-size parameters, averaged over 50 runs\nand 1000 time-step bins. The \"bump\" most clearly seen with small step sizes is reliably\npresent and may be due to predictions of different lengths being learned at different times.\n\n\n\nIn ongoing experiments on other non-Markov problems we have found that TD networks\ndo not always find such complete solutions. Other problems seem to require more than one\nstep of history information (the one-step-preceding action and observation), though less\nthan would be required using history information alone. Our results as a whole suggest that\nTD networks may provide an effective alternative learning algorithm for predictive state\nrepresentations (Littman et al., 2000). Previous algorithms have been found to be effective\non some tasks but not on others (e.g, Singh et al., 2003; Rudary & Singh, 2004; James &\nSingh, 2004). More work is needed to assess the range of effectiveness and learning rate\nof TD methods vis-a-vis previous methods, and to explore their combination with history\ninformation.\n\n\f\n6 Conclusion\n\nTD networks suggest a large set of possibilities for learning to predict, and in this paper we\nhave begun exploring the first few. Our results show that even in a fully observable setting\nthere may be significant advantages to TD methods when learning TD-defined predictions.\nOur action-conditional results show that TD methods can learn dramatically faster than\nother methods. TD networks allow the expression of many new kinds of predictions whose\nextensive semantics is not immediately clear, but which are ultimately fully grounded in\ndata. It may be fruitful to further explore the expressive potential of TD-defined predictions.\nAlthough most of our experiments have concerned the representational expressiveness and\nefficiency of TD-defined predictions, it is also natural to consider using them as state, as in\npredictive state representations. Our experiments suggest that this is a promising direction\nand that TD learning algorithms may have advantages over previous learning methods.\nFinally, we note that adding nodes to a question network produces new predictions and\nthus may be a way to address the discovery problem for predictive representations.\n\nAcknowledgments\n\nThe authors gratefully acknowledge the ideas and encouragement they have received in this\nwork from Satinder Singh, Doina Precup, Michael Littman, Mark Ring, Vadim Bulitko,\nEddie Rafols, Anna Koop, Tao Wang, and all the members of the rlai.net group.\n\nReferences\nBoyan, J. A. (2000). Technical update: Least-squares temporal difference learning. Machine Learn-\ning 49:233246.\nBradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learn-\ning. Machine Learning 22(1/2/3):3357.\nDayan, P. (1993). Improving generalization for temporal difference learning: The successor repre-\nsentation. Neural Computation 5(4):613624.\nJames, M. and Singh, S. (2004). Learning and discovery of predictive state representations in dy-\nnamical systems with reset. In Proceedings of the Twenty-First International Conference on Machine\nLearning, pages 417424.\nKaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Pro-\nceedings of the Tenth International Conference on Machine Learning, pp. 167173.\nLagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning\nResearch 4(Dec):11071149.\nLittman, M. L., Sutton, R. S. and Singh, S. (2002). Predictive representations of state. In Advances\nIn Neural Information Processing Systems 14:15551561.\nRudary, M. R. and Singh, S. (2004). A nonlinear predictive state representation. In Advances in\nNeural Information Processing Systems 16:855862.\nSingh, S., Littman, M. L., Jong, N. K., Pardoe, D. and Stone, P. (2003) Learning predictive state\nrepresentations. In Proceedings of the Twentieth Int. Conference on Machine Learning, pp. 712719.\nSutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning\n3:944.\nSutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis\nand S. Russell (eds.), Proceedings of the Twelfth International Conference on Machine Learning,\npp. 531539. Morgan Kaufmann, San Francisco.\nSutton, R. S., Precup, D. and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for\ntemporal abstraction in reinforcement learning. Artificial Intelligence 112:181121.\n\n\f\n", "award": [], "sourceid": 2545, "authors": [{"given_name": "Richard", "family_name": "Sutton", "institution": null}, {"given_name": "Brian", "family_name": "Tanner", "institution": null}]}