{"title": "Value Prediction Network", "book": "Advances in Neural Information Processing Systems", "page_first": 6118, "page_last": 6128, "abstract": "This paper proposes a novel deep reinforcement learning (RL) architecture, called Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network. In contrast to typical model-based RL methods, VPN learns a dynamics model whose abstract states are trained to make option-conditional predictions of future values (discounted sum of rewards) rather than of future observations. Our experimental results show that VPN has several advantages over both model-free and model-based baselines in a stochastic environment where careful planning is required but building an accurate observation-prediction model is difficult. Furthermore, VPN outperforms Deep Q-Network (DQN) on several Atari games even with short-lookahead planning, demonstrating its potential as a new way of learning a good state representation.", "full_text": "Value Prediction Network\n\nJunhyuk Oh\u2020\n\n\u2020University of Michigan\n\nSatinder Singh\u2020\n\u2217Google Brain\n\nHonglak Lee\u2217,\u2020\n\n{junhyuk,baveja,honglak}@umich.edu, honglak@google.com\n\nAbstract\n\nThis paper proposes a novel deep reinforcement learning (RL) architecture, called\nValue Prediction Network (VPN), which integrates model-free and model-based\nRL methods into a single neural network. In contrast to typical model-based\nRL methods, VPN learns a dynamics model whose abstract states are trained\nto make option-conditional predictions of future values (discounted sum of re-\nwards) rather than of future observations. Our experimental results show that\nVPN has several advantages over both model-free and model-based baselines in a\nstochastic environment where careful planning is required but building an accurate\nobservation-prediction model is dif\ufb01cult. Furthermore, VPN outperforms Deep\nQ-Network (DQN) on several Atari games even with short-lookahead planning,\ndemonstrating its potential as a new way of learning a good state representation.\n\n1\n\nIntroduction\n\nModel-based reinforcement learning (RL) approaches attempt to learn a model that predicts future\nobservations conditioned on actions and can thus be used to simulate the real environment and do\nmulti-step lookaheads for planning. We will call such models an observation-prediction model to\ndistinguish it from another form of model introduced in this paper. Building an accurate observation-\nprediction model is often very challenging when the observation space is large [23, 5, 13, 4] (e.g., high-\ndimensional pixel-level image frames), and even more dif\ufb01cult when the environment is stochastic.\nTherefore, a natural question is whether it is possible to plan without predicting future observations.\nIn fact, raw observations may contain information unnecessary for planning, such as dynamically\nchanging backgrounds in visual observations that are irrelevant to their value/utility. The starting point\nof this work is the premise that what planning truly requires is the ability to predict the rewards and\nvalues of future states. An observation-prediction model relies on its predictions of observations to\npredict future rewards and values. What if we could predict future rewards and values directly without\npredicting future observations? Such a model could be more easily learnable for complex domains or\nmore \ufb02exible for dealing with stochasticity. In this paper, we address the problem of learning and\nplanning from a value-prediction model that can directly generate/predict the value/reward of future\nstates without generating future observations.\nOur main contribution is a novel neural network architecture we call the Value Prediction Network\n(VPN). The VPN combines model-based RL (i.e., learning the dynamics of an abstract state space\nsuf\ufb01cient for computing future rewards and values) and model-free RL (i.e., mapping the learned\nabstract states to rewards and values) in a uni\ufb01ed framework. In order to train a VPN, we propose\na combination of temporal-difference search [28] (TD search) and n-step Q-learning [20]. In brief,\nVPNs learn to predict values via Q-learning and rewards via supervised learning. At the same time,\nVPNs perform lookahead planning to choose actions and compute bootstrapped target Q-values.\nOur empirical results on a 2D navigation task demonstrate the advantage of VPN over model-free\nbaselines (e.g., Deep Q-Network [21]). We also show that VPN is more robust to stochasticity in the\nenvironment than an observation-prediction model approach. Furthermore, we show that our VPN\noutperforms DQN on several Atari games [2] even with short-lookahead planning, which suggests\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthat our approach can be potentially useful for learning better abstract-state representations and\nreducing sample-complexity.\n\n2 Related Work\n\nModel-based Reinforcement Learning. Dyna-Q [32, 34, 39] integrates model-free and model-\nbased RL by learning an observation-prediction model and using it to generate samples for Q-learning\nin addition to the model-free samples obtained by acting in the real environment. Gu et al. [7]\nextended these ideas to continuous control problems. Our work is similar to Dyna-Q in the sense that\nplanning and learning are integrated into one architecture. However, VPNs perform a lookahead tree\nsearch to choose actions and compute bootstrapped targets, whereas Dyna-Q uses a learned model\nto generate imaginary samples. In addition, Dyna-Q learns a model of the environment separately\nfrom a value function approximator. In contrast, the dynamics model in VPN is combined with the\nvalue function approximator in a single neural network and indirectly learned from reward and value\npredictions through backpropagation.\nAnother line of work [23, 4, 8, 30] uses observation-prediction models not for planning, but for improv-\ning exploration. A key distinction from these prior works is that our method learns abstract-state dy-\nnamics not to predict future observations, but instead to predict future rewards/values. For continuous\ncontrol problems, deep learning has been combined with model predictive control (MPC) [6, 18, 26],\na speci\ufb01c way of using an observation-prediction model. In cases where the observation-prediction\nmodel is differentiable with respect to continuous actions, backpropagation can be used to \ufb01nd the\noptimal action [19] or to compute value gradients [11]. In contrast, our work focuses on learning and\nplanning using lookahead for discrete control problems.\nOur VPNs are related to Value Iteration Networks [35] (VINs) which perform value iteration (VI) by\napproximating the Bellman-update through a convolutional neural network (CNN). However, VINs\nperform VI over the entire state space, which in practice requires that 1) the state space is small and\nrepresentable as a vector with each dimension corresponding to a separate state and 2) the states have\na topology with local transition dynamics (e.g., 2D grid). VPNs do not have these limitations and are\nthus more generally applicable, as we will show empirically in this paper.\nVPN is close to and in-part inspired by Predictron [29] in that a recurrent neural network (RNN) acts\nas a transition function over abstract states. VPN can be viewed as a grounded Predictron in that each\nrollout corresponds to the transition in the environment, whereas each rollout in Predictron is purely\nabstract. In addition, Predictrons are limited to uncontrolled settings and thus policy evaluation,\nwhereas our VPNs can learn an optimal policy in controlled settings.\n\nModel-free Deep Reinforcement Learning. Mnih et al. [21] proposed the Deep Q-Network\n(DQN) architecture which learns to estimate Q-values using deep neural networks. A lot of variations\nof DQN have been proposed for learning better state representation [37, 16, 9, 22, 36, 24], including\nthe use of memory-based networks for handling partial observability [9, 22, 24], estimating both\nstate-values and advantage-values as a decomposition of Q-values [37], learning successor state\nrepresentations [16], and learning several auxiliary predictions in addition to the main RL values [12].\nOur VPN can be viewed as a model-free architecture which 1) decomposes Q-value into reward,\ndiscount, and the value of the next state and 2) uses multi-step reward/value predictions as auxiliary\ntasks to learn a good representation. A key difference from the prior work listed above is that our\nVPN learns to simulate the future rewards/values which enables planning. Although STRAW [36]\ncan maintain a sequence of future actions using an external memory, it cannot explicitly perform\nplanning by simulating future rewards/values.\n\nMonte-Carlo Planning. Monte-Carlo Tree Search (MCTS) methods [15, 3] have been used for\ncomplex search problems, such as the game of Go, where a simulator of the environment is already\navailable and thus does not have to be learned. Most recently, AlphaGo [27] introduced a value\nnetwork that directly estimates the value of state in Go in order to better approximate the value of\nleaf-node states during tree search. Our VPN takes a similar approach by predicting the value of\nabstract future states during tree search using a value function approximator. Temporal-difference\nsearch [28] (TD search) combined TD-learning with MCTS by computing target values for a value\nfunction approximator through MCTS. Our algorithm for training VPN can be viewed as an instance\nof TD search, but it learns the dynamics of future rewards/values instead of being given a simulator.\n\n2\n\n\f(a) One-step rollout\n\n(b) Multi-step rollout\n\nFigure 1: Value prediction network. (a) VPN learns to predict immediate reward, discount, and the value of the\nnext abstract-state. (b) VPN unrolls the core module in the abstract-state space to compute multi-step rollouts.\n\n3 Value Prediction Network\n\nE[(cid:80)k\u22121\n\nThe value prediction network is developed for semi-Markov decision processes (SMDPs). Let xt be\nthe observation or a history of observations for partially observable MDPs (henceforth referred to\nas just observation) and let ot be the option [33, 31, 25] at time t. Each option maps observations\nto primitive actions, and the following Bellman equation holds for all policies \u03c0: Q\u03c0(xt, ot) =\ni=0 \u03b3irt+i + \u03b3kV \u03c0(xt+k)], where \u03b3 is a discount factor, rt is the immediate reward at time t,\n\nand k is the number of time steps taken by the option ot before terminating in observation xt+k.\nA VPN not only learns an option-value function Q\u03b8 (xt, ot) through a neural network parameterized\nby \u03b8 like model-free RL, but also learns the dynamics of the rewards/values to perform planning. We\ndescribe the architecture of VPN in Section 3.1. In Section 3.2, we describe how to perform planning\nusing VPN. Section 3.3 describes how to train VPN in a Q-Learning-like framework [38].\n\n3.1 Architecture\nThe VPN consists of the following modules parameterized by \u03b8 = {\u03b8enc, \u03b8value, \u03b8out, \u03b8trans}:\n\nEncoding f enc\nOutcome f out\n\n\u03b8\n\n\u03b8\n\n: x (cid:55)\u2192 s\n: s, o (cid:55)\u2192 r, \u03b3\n\nValue f value\nTransition f trans\n\n\u03b8\n\n\u03b8\n\n: s (cid:55)\u2192 V\u03b8(s)\n: s, o (cid:55)\u2192 s(cid:48)\n\n\u2022 Encoding module maps the observation (x) to the abstract state (s \u2208 Rm) using neural networks\n(e.g., CNN for visual observations). Thus, s is an abstract-state representation which will be\nlearned by the network (and not an environment state or even an approximation to one).\n\n\u2022 Value module estimates the value of the abstract-state (V\u03b8(s)). Note that the value module is not a\n\nfunction of the observation, but a function of the abstract-state.\n\n\u2022 Outcome module predicts the option-reward (r \u2208 R) for executing the option o at abstract-state\ns. If the option takes k primitive actions before termination, the outcome module should predict\nthe discounted sum of the k immediate rewards as a scalar. The outcome module also predicts the\noption-discount (\u03b3 \u2208 R) induced by the number of steps taken by the option.\n\n\u2022 Transition module transforms the abstract-state to the next abstract-state (s(cid:48) \u2208 Rm) in an option-\n\nconditional manner.\n\nFigure 1a illustrates the core module which performs 1-step rollout by composing the above modules:\n: s, o (cid:55)\u2192 r, \u03b3, V\u03b8(s(cid:48)), s(cid:48). The core module takes an abstract-state and option as input and makes\nf core\n\u03b8\nseparate option-conditional predictions of the option-reward (henceforth, reward), the option-discount\n(henceforth, discount), and the value of the abstract-state at option-termination. By combining the\npredictions, we can estimate the Q-value as follows: Q\u03b8(s, o) = r + \u03b3V\u03b8(s(cid:48)). In addition, the VPN\nrecursively applies the core module to predict the sequence of future abstract-states as well as rewards\nand discounts given an initial abstract-state and a sequence of options as illustrated in Figure 1b.\n\n3.2 Planning\n\nVPN has the ability to simulate the future and plan based on the simulated future abstract-states.\nAlthough many existing planning methods (e.g., MCTS) can be applied to the VPN, we implement\na simple planning method which performs rollouts using the VPN up to a certain depth (say d),\nhenceforth denoted as planning depth, and aggregates all intermediate value estimates as described in\nAlgorithm 1 and Figure 2. More formally, given an abstract-state s = f enc\n(x) and an option o, the\n\n\u03b8\n\n3\n\n\fAlgorithm 1 Q-value from d-step planning\n\nfunction Q-PLAN(s, o, d)\nr, \u03b3, V (s(cid:48)), s(cid:48) \u2190 f core\nif d = 1 then\nreturn r + \u03b3V (s(cid:48))\n\n\u03b8\n\n(s, o)\n\nend if\nA \u2190 b-best options based on Q1(s(cid:48), o(cid:48))\nfor o(cid:48) \u2208 A do\n\nqo(cid:48) \u2190 Q-PLAN(s(cid:48), o(cid:48), d \u2212 1)\n\nd maxo(cid:48)\u2208A qo(cid:48)(cid:3)\n\nd V (s(cid:48)) + d\u22121\n\nreturn r +\u03b3(cid:2) 1\n\nend for\n\nend function\n\n(a) Expansion\n\n(b) Backup\n\nFigure 2: Planning with VPN. (a) Simulate b-best options up\nto a certain depth (b = 2 in this example). (b) Aggregate all\npossible returns along the best sequence of future options.\n\nQ-value calculated from d-step planning is de\ufb01ned as:\n\n(cid:26)V\u03b8(s)\n\nif d = 1\nif d > 1,\n\n\u03b8 (s(cid:48))\n\n\u03b8\n\n\u03b8\n\n\u03b8\n\n\u03b8\n\n1\n\n(1)\n\nQd\n\n\u03b8 (s) =\nV d\n\n\u03b8(s, o) = r + \u03b3V d\n\n(s, o)\n\n(s), and r, \u03b3 = f out\n\nd V\u03b8(s) + d\u22121\n\n(s, o), V\u03b8(s) = f value\n\nd maxo Qd\u22121\nwhere s(cid:48) = f trans\n(s, o). Our planning algorithm is divided\ninto two steps: expansion and backup. At the expansion step (see Figure 2a), we recursively simulate\noptions up to a depth of d by unrolling the core module. At the backup step, we compute the weighted\naverage of the direct value estimate V\u03b8(s) and maxo Qd\u22121\n\u03b8 (s) (i.e., value from\n(s, o) is the average over d \u2212 1 possible value\nd-step planning) in Equation 1. Note that maxo Qd\u22121\nestimates. We propose to compute the uniform average over all possible returns by using weights\nproportional to 1 and d \u2212 1 for V\u03b8(s) and maxo Qd\u22121\n\u03b8 (s) is the uniform\naverage of d expected returns along the path of the best sequence of options as illustrated in Figure 2b.\nTo reduce the computational cost, we simulate only b-best options at each expansion step based on\nQ1(s, o). We also \ufb01nd that choosing only the best option after a certain depth does not compromise\nthe performance much, which is analogous to using a default policy in MCTS beyond a certain depth.\nThis heuristic visits reasonably good abstract states during planning, though a more principled way\nsuch as UCT [15] can also be used to balance exploration and exploitation. This planning method\nis used for choosing options and computing target Q-values during training, as described in the\nfollowing section.\n\n(s, o) respectively. Thus, V d\n\n(s, o) to compute V d\n\n\u03b8\n\n\u03b8\n\n\u03b8\n\n3.3 Learning\nVPN can be trained through any existing value-\nbased RL algorithm for the value predictions com-\nbined with supervised learning for reward and dis-\ncount predictions. In this paper, we present a modi\ufb01-\ncation of n-step Q-learning [20] and TD search [28].\nThe main idea is to generate trajectories by follow-\ning \u0001-greedy policy based on the planning method\ndescribed in Section 3.2. Given an n-step trajec-\ntory x1, o1, r1, \u03b31, x2, o2, r2, \u03b32, ..., xn+1 generated\nby the \u0001-greedy policy, k-step predictions are de\ufb01ned\nas follows:\n\n(cid:26)f enc\n\nFigure 3: Illustration of learning process.\n\n(xt)\n(sk\u22121\nt\u22121 , ot\u22121)\n\nif k = 0\nif k > 0\n\nsk\nt =\n\n\u03b8\nf trans\n\u03b8\nIntuitively, sk\nt is the VPN\u2019s k-step prediction of the abstract-state at time t predicted from xt\u2212k\nfollowing options ot\u2212k, ..., ot\u22121 in the trajectory as illustrated in Figure 3. By applying the value\nand the outcome module, VPN can compute the k-step prediction of the value, the reward, and the\ndiscount. The k-step prediction loss at step t is de\ufb01ned as:\n\nvk\nt = f value\n\nt = f out\n\nrk\nt , \u03b3k\n\n\u03b8\n\n\u03b8\n\nt\n\n(sk\nt )\n\n(sk\u22121\n\n, ot).\n\nk(cid:88)\n\n(cid:0)Rt \u2212 vl\n\nt\n\n(cid:1)2\n\n+(cid:0)rt \u2212 rl\n\nt\n\n(cid:1)2\n\n+(cid:0)log\u03b3 \u03b3t \u2212 log\u03b3 \u03b3l\n\nt\n\n(cid:1)2\n\nLt =\n\nl=1\n\n4\n\n\f(cid:26)rt + \u03b3tRt+1\n\nmaxo Qd\n\n\u03b8\u2212(sn+1, o)\n\nif t \u2264 n\nif t = n + 1\n\nis the target value, and Qd\n\n\u03b8\u2212 (sn+1, o) is the Q-\nwhere Rt =\nvalue computed by the d-step planning method described in 3.2. Intuitively, Lt accumulates losses\nover 1-step to k-step predictions of values, rewards, and discounts. We \ufb01nd that applying log\u03b3 for\nthe discount prediction loss helps optimization, which amounts to computing the squared loss with\nrespect to the number of steps.\nOur learning algorithm introduces two hyperparameters: the number of prediction steps (k) and\nplanning depth (dtrain) used for choosing options and computing bootstrapped targets. We also make\nuse of a target network parameterized by \u03b8\u2212 which is synchronized with \u03b8 after a certain number\nof steps to stabilize training as suggested by [20]. The loss is accumulated over n-steps and the\nt=1 \u2207\u03b8Lt. The full algorithm\n\nparameter is updated by computing its gradient as follows: \u2207\u03b8L =(cid:80)n\n\nis described in the supplementary material.\n\n3.4 Relationship to Existing Approaches\n\nVPN is model-based in the sense that it learns an abstract-state transition function suf\ufb01cient to predict\nrewards/discount/values. Meanwhile, VPN can also be viewed as model-free in the sense that it\nlearns to directly estimate the value of the abstract-state. From this perspective, VPN exploits several\nauxiliary prediction tasks, such as reward and discount predictions to learn a good abstract-state\nrepresentation. An interesting property of VPN is that its planning ability is used to compute the\nbootstrapped target as well as choose options during Q-learning. Therefore, as VPN improves the\nquality of its future predictions, it can not only perform better during evaluation through its improved\nplanning ability, but also generate more accurate target Q-values during training, which encourages\nfaster convergence compared to conventional Q-learning.\n\n4 Experiments\n\nOur experiments investigated the following questions: 1) Does VPN outperform model-free baselines\n(e.g., DQN)? 2) What is the advantage of planning with a VPN over observation-based planning? 3)\nIs VPN useful for complex domains with high-dimensional sensory inputs, such as Atari games?\n\n4.1 Experimental Setting\nNetwork Architecture. A CNN was used as the encoding module of VPN, and the transition\nmodule consists of one option-conditional convolution layer which uses different weights depending\non the option followed by a few more convolution layers. We used a residual connection [10] from\nthe previous abstract-state to the next abstract-state so that the transition module learns the change\nof the abstract-state. The outcome module is similar to the transition module except that it does not\nhave a residual connection and two fully-connected layers are used to produce reward and discount.\nThe value module consists of two fully-connected layers. The number of layers and hidden units vary\ndepending on the domain. These details are described in the supplementary material.\n\nImplementation Details. Our algorithm is based on asynchronous n-step Q-learning [20] where\nn is 10 and 16 threads are used. The target network is synchronized after every 10K steps.\nWe used the Adam optimizer [14], and the best learning rate and its decay were chosen from\n{0.0001, 0.0002, 0.0005, 0.001} and {0.98, 0.95, 0.9, 0.8} respectively. The learning rate is multi-\nplied by the decay every 1M steps. Our implementation is based on TensorFlow [1].1\nVPN has four more hyperparameters: 1) the number of predictions steps (k) during training, 2) the\nplan depth (dtrain) during training, 3) the plan depth (dtest) during evaluation, and 4) the branching\nfactor (b) which indicates the number of options to be simulated for each expansion step during\nplanning. We used k = dtrain = dtest throughout the experiment unless otherwise stated. VPN(d)\nrepresents our model which learns to predict and simulate up to d-step futures during training and\nevaluation. The branching factor (b) was set to 4 until depth of 3 and set to 1 after depth of 3, which\nmeans that VPN simulates 4-best options up to depth of 3 and only the best option after that.\n\nBaselines. We compared our approach to the following baselines.\n\n1The code is available on https://github.com/junhyukoh/value-prediction-network.\n\n5\n\n\f(a) Observation\n\n(a) Plan with 20 steps (b) Plan with 12 steps\n(b) DQN\u2019s trajectory (c) VPN\u2019s trajectory\nFigure 4: Collect domain. (a) The agent should collect as many\nFigure 5: Example of VPN\u2019s plan. VPN\ngoals as possible within a time limit which is given as additional\ncan plan the best future options just from\ninput. (b-c) DQN collects 5 goals given 20 steps, while VPN(5)\nthe current state. The \ufb01gures show VPN\u2019s\nfound the optimal trajectory via planning which collects 6 goals.\ndifferent plans depending on the time limit.\n\u2022 DQN: This baseline directly estimates Q-values as its output and is trained through asynchronous\nn-step Q-learning. Unlike the original DQN, however, our DQN baseline takes an option as\nadditional input and applies an option-conditional convolution layer to the top of the last encoding\nconvolution layer, which is very similar to our VPN architecture.2\n\n\u2022 VPN(1): This is identical to our VPN with the same training procedure except that it performs\nonly 1-step rollout to estimate Q-value as shown in Figure 1a. This can be viewed as a variation of\nDQN that predicts reward, discount, and the value of the next state as a decomposition of Q-value.\n\u2022 OPN(d): We call this Observation Prediction Network (OPN), which is similar to VPN except that\nit directly predicts future observations. More speci\ufb01cally, we train two independent networks: a\nmodel network (f model : x, o (cid:55)\u2192 r, \u03b3, x(cid:48)) which predicts reward, discount, and the next observation,\nand a value network (f value : x (cid:55)\u2192 V (x)) which estimates the value from the observation. The\ntraining scheme is similar to our algorithm except that a squared loss for observation prediction is\nused to train the model network. This baseline performs d-step planning like VPN(d).\n\n4.2 Collect Domain\nTask Description. We de\ufb01ned a simple but challenging 2D navigation task where the agent should\ncollect as many goals as possible within a time limit, as illustrated in Figure 4. In this task, the\nagent, goals, and walls are randomly placed for each episode. The agent has four options: move\nleft/right/up/down to the \ufb01rst crossing branch or the end of the corridor in the chosen direction. The\nagent is given 20 steps for each episode and receives a positive reward (2.0) when it collects a goal by\nmoving on top of it and a time-penalty (\u22120.2) for each step. Although it is easy to learn a sub-optimal\npolicy which collects nearby goals, \ufb01nding the optimal trajectory in each episode requires careful\nplanning because the optimal solution cannot be computed in polynomial time.\nAn observation is represented as a 3D tensor (R3\u00d710\u00d710) with binary values indicating the pres-\nence/absence of each object type. The time remaining is normalized to [0, 1] and is concatenated to\nthe 3rd convolution layer of the network as a channel.\nWe evaluated all architectures \ufb01rst in a deterministic environment and then investigated the robustness\nin a stochastic environment separately. In the stochastic environment, each goal moves by one block\nwith probability of 0.3 for each step. In addition, each option can be repeated multiple times with\nprobability of 0.3. This makes it dif\ufb01cult to predict and plan the future precisely.\n\nOverall Performance. The result is summarized in Figure 6. To understand the quality of different\npolicies, we implemented a greedy algorithm which always collects the nearest goal \ufb01rst and a\nshortest-path algorithm which \ufb01nds the optimal solution through exhaustive search assuming that\nthe environment is deterministic. Note that even a small gap in terms of reward can be qualitatively\nsubstantial as indicated by the small gap between greedy and shortest-path algorithms.\nThe results show that many architectures learned a better-than-greedy policy in the deterministic and\nstochastic environments except that OPN baselines perform poorly in the stochastic environment. In\naddition, the performance of VPN is improved as the plan depth increases, which implies that deeper\npredictions are reliable enough to provide more accurate value estimates of future states. As a result,\nVPN with 5-step planning represented by \u2018VPN(5)\u2019 performs best in both environments.\n\n2This architecture outperformed the original DQN architecture in our preliminary experiments.\n\n6\n\n\f(a) Deterministic\n\n(b) Stochastic\n\nFigure 6: Learning curves on Collect domain. \u2018VPN(d)\u2019 represents VPN with d-step planning, while \u2018DQN\u2019 and\n\u2018OPN(d)\u2019 are the baselines.\nComparison to Model-free Baselines. Our VPNs outperform DQN and VPN(1) baselines by a\nlarge margin as shown in Figure 6. Figure 4 (b-c) shows an example of trajectories of DQN and\nVPN(5) given the same initial state. Although DQN\u2019s behavior is reasonable, it ended up with\ncollecting one less goal compared to VPN(5). We hypothesize that 6 convolution layers used by\nDQN and VPN(1) are not expressive enough to \ufb01nd the best route in each episode because \ufb01nding an\noptimal path requires a combinatorial search in this task. On the other hand, VPN can perform such a\ncombinatorial search to some extent by simulating future abstract-states, which has advantages over\nmodel-free approaches for dealing with tasks that require careful planning.\n\nComparison to Observation-based Planning. Compared to OPNs which perform planning based\non predicted observations, VPNs perform slightly better or equally well in the deterministic environ-\nment. We observed that OPNs can predict future observations very accurately because observations\nin this task are simple and the environment is deterministic. Nevertheless, VPNs learn faster than\nOPNs in most cases. We conjecture that it takes additional training steps for OPNs to learn to predict\nfuture observations. In contrast, VPNs learn to predict only minimal but suf\ufb01cient information for\nplanning: reward, discount, and the value of future abstract-states, which may be the reason why\nVPNs learn faster than OPNs.\nIn the stochastic Collect domain, VPNs signi\ufb01cantly outperform OPNs. We observed that OPNs\ntend to predict the average of possible future observations (Ex[x]) because OPN is deterministic.\nEstimating values on such blurry predictions leads to estimating V\u03b8(Ex[x]) which is different from\nthe true expected value Ex[V (x)]. On the other hand, VPN is trained to approximate the true expected\nvalue because there is no explicit constraint or loss for the predicted abstract state. We hypothesize\nthat this key distinction allows VPN to learn different modes of possible future states more \ufb02exibly in\nthe abstract state space. This result suggests that a value-prediction model can be more bene\ufb01cial\nthan an observation-prediction model when the environment is stochastic and building an accurate\nobservation-prediction model is dif\ufb01cult.\n\nTable 1: Generalization performance. Each number\nrepresents average reward. \u2018FGs\u2019 and \u2018MWs\u2019 rep-\nresent unseen environments with fewer goals and\nmore walls respectively. Bold-faced numbers repre-\nsent the highest rewards with 95% con\ufb01dence level.\n\nGeneralization Performance. One advantage of\nmodel-based RL approach is that it can generalize\nwell to unseen environments as long as the dynamics\nof the environment remains similar. To see if our\nVPN has such a property, we evaluated all archi-\ntectures on two types of previously unseen environ-\nments with either reduced number of goals (from 8\nto 5) or increased number of walls. It turns out that\nour VPN is much more robust to the unseen environ-\nments compared to model-free baselines (DQN and\nVPN(1)), as shown in Table 1. The model-free base-\nlines perform worse than the greedy algorithm on\nunseen environments, whereas VPN still performs\nwell. In addition, VPN generalizes as well as OPN\nwhich can learn a near-perfect model in the deter-\nministic setting, and VPN signi\ufb01cantly outperforms\nOPN in the stochastic setting. This suggests that VPN has a good generalization property like\nmodel-based RL methods and is robust to stochasticity.\n\n7.85 4.11 6.72\n7.84 4.27 7.15\n7.55 4.09 6.79\n8.11\n4.45 7.46\n\nGreedy 8.61 5.13 7.79\nShortest 9.71 5.82 8.98\nDQN 8.66 4.57 7.08\nVPN(1) 8.94 4.92 7.64\n5.45 8.36\nOPN(5)\nVPN(5)\n5.43 8.31\n\n9.30\n9.29\n\n7.58 4.48 7.04\n7.64 4.36 7.22\n\nDeterministic\n\nStochastic\n\nOriginal FGs MWs Original FGs MWs\n\n7\n\n0.00.51.01.5Step1e77.07.58.08.59.09.510.0Average reward0.00.51.01.5Step1e76.06.57.07.58.08.5Average rewardGreedyShortestDQNOPN(1)OPN(2)OPN(3)OPN(5)VPN(1)VPN(2)VPN(3)VPN(5)\fTable 2: Performance on Atari games. Each number represents average score over 5 top agents.\n\nFrostbite Seaquest Enduro Alien Q*Bert Ms. Pacman Amidar Krull Crazy Climber\n\nDQN\nVPN\n\n3058\n3811\n\n2951\n5628\n\n326\n382\n\n1804 12592\n1429 14517\n\n2804\n2689\n\n535\n641\n\n12438\n15930\n\n41658\n54119\n\nEffect of Planning Depth. To further investi-\ngate the effect of planning depth in a VPN, we\nmeasured the average reward in the determinis-\ntic environment by varying the planning depth\n(dtest) from 1 to 10 during evaluation after train-\ning VPN with a \ufb01xed number of prediction steps\nand planning depth (k, dtrain), as shown in Fig-\nure 7. Since VPN does not learn to predict obser-\nvations, there is no guarantee that it can perform\ndeeper planning during evaluation (dtest) than the\nplanning depth used during training (dtrain). In-\nterestingly, however, the result in Figure 7 shows\nthat if k = dtrain > 2, VPN achieves better per-\nformance during evaluation through deeper tree\nsearch (dtest > dtrain). We also tested a VPN\nwith k = 10 and dtrain = 5 and found that a plan-\nning depth of 10 achieved the best performance\nduring evaluation. Thus, with a suitably large number of prediction steps during training, our VPN is\nable to bene\ufb01t from deeper planning during evaluation relative to the planning depth during training.\nFigure 5 shows examples of good plans of length greater than 5 found by a VPN trained with planning\ndepth 5. Another observation from Figure 7 is that the performance of planning depth of 1 (dtest = 1)\ndegrades as the planning depth during training (dtrain) increases. This means that a VPN can improve\nits value estimations through long-term planning at the expense of the quality of short-term planning.\n\nFigure 7: Effect of evaluation planning depth. Each\ncurve shows average reward as a function of planning\ndepth, dtest, for each architecture that is trained with\na \ufb01xed number of prediction steps. \u2018VPN(5)*\u2019 was\ntrained to make 10-step predictions but performed\n5-step planning during training (k = 10, dtrain = 5).\n\n4.3 Atari Games\n\nTo investigate how VPN deals with complex visual observations, we evaluated it on several Atari\ngames [2]. Unlike in the Collect domain, in Atari games most primitive actions have only small value\nconsequences and it is dif\ufb01cult to hand-design useful extended options. Nevertheless, we explored if\nVPNs are useful in Atari games even with short-lookahead planning using simple options that repeat\nthe same primitive action over extended time periods by using a frame-skip of 10.3 We pre-processed\nthe game screen to 84 \u00d7 84 gray-scale images. All architectures take last 4 frames as input. We\ndoubled the number of hidden units of the fully-connected layer for DQN to approximately match the\nnumber of parameters. VPN learns to predict rewards and values but not discount (since it is \ufb01xed),\nand was trained to make 3-option-step predictions for planning which means that the agent predicts\nup to 0.5 seconds ahead in real-time.\nAs summarized in Table 2 and Figure 8, our VPN outperforms DQN baseline on 7 out of 9 Atari\ngames and learned signi\ufb01cantly faster than DQN on Seaquest, QBert, Krull, and Crazy Climber.\nOne possible reason why VPN outperforms DQN is that even 3-step planning is indeed helpful for\nlearning a better policy. Figure 9 shows an example of VPN\u2019s 3-step planning in Seaquest. Our VPN\npredicts reasonable values given different sequences of actions, which can potentially help choose a\nbetter action by looking at the short-term future. Another hypothesis is that the architecture of VPN\nitself, which has several auxiliary prediction tasks for multi-step future rewards and values, is useful\nfor learning a good abstract-state representation as a model-free agent. Finally, our algorithm which\nperforms planning to compute the target Q-value can potentially speed up learning by generating more\naccurate targets as it performs value backups multiple times from the simulated futures, as discussed\nin Section 3.4. These results show that our approach is applicable to complex visual environments\nwithout needing to predict observations.\n\n3Much of the previous work on Atari games has used a frame-skip of 4. Though using a larger frame-skip\ngenerally makes training easier, it may make training harder in some games if they require more \ufb01ne-grained\ncontrol [17].\n\n8\n\n12345678910Plan depth (dtest)\u221220246810Average rewardVPN(1)VPN(2)VPN(3)VPN(5)VPN(5)*\fFigure 8: Learning curves on Atari games. X-axis and y-axis correspond to steps and average reward over 100\nepisodes respectively.\n\n(a) State\n\n(b) Plan 1 (19.3)\n\n(c) Plan 2 (18.7)\n\n(d) Plan 3 (18.4)\n\n(e) Plan 4 (17.1)\n\nFigure 9: Examples of VPN\u2019s value estimates. Each \ufb01gure shows trajectories of different sequences of actions\nfrom the initial state (a) along with VPN\u2019s value estimates in the parentheses: r1 + \u03b3r2 + \u03b32r3 + \u03b33V (s4).\nThe action sequences are (b) DownRight-DownRightFire-RightFire, (c) Up-Up-Up, (d) Left-Left-Left, and (e)\nUp-Right-Right. VPN predicts the highest value for (b) where the agent kills the enemy and the lowest value for\n(e) where the agent is killed by the enemy.\n5 Conclusion\n\nWe introduced value prediction networks (VPNs) as a new deep RL way of integrating planning and\nlearning while simultaneously learning the dynamics of abstract-states that make option-conditional\npredictions of future rewards/discount/values rather than future observations. Our empirical evalua-\ntions showed that VPNs outperform model-free DQN baselines in multiple domains, and outperform\ntraditional observation-based planning in a stochastic domain. An interesting future direction would\nbe to develop methods that automatically learn the options that allow good planning in VPNs.\n\nAcknowledgement\n\nThis work was supported by NSF grant IIS-1526059. Any opinions, \ufb01ndings, conclusions, or\nrecommendations expressed here are those of the authors and do not necessarily re\ufb02ect the views of\nthe sponsor.\n\nReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. J\u00f3zefowicz, L. Kaiser,\nM. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens,\nB. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Vi\u00e9gas, O. Vinyals,\nP. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation\n\nplatform for general agents. arXiv preprint arXiv:1207.4708, 2012.\n\n[3] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez,\nS. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. Computational Intelligence\nand AI in Games, IEEE Transactions on, 4(1):1\u201343, 2012.\n\n9\n\n012341e705001000150020002500300035004000Frostbite012341e70100020003000400050006000Seaquest012341e7050100150200250300350400Enduro012341e70500100015002000Alien012341e70200040006000800010000120001400016000QBert012341e7050010001500200025003000Ms. Pacman012341e70100200300400500600700Amidar012341e7020004000600080001000012000140001600018000Krull012341e70100002000030000400005000060000Crazy ClimberDQNVPN\f[4] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed. Recurrent environment simulators. In ICLR,\n\n2017.\n\n[5] C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video\n\nprediction. In NIPS, 2016.\n\n[6] C. Finn and S. Levine. Deep visual foresight for planning robot motion. In ICRA, 2017.\n\n[7] S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration.\n\nIn ICML, 2016.\n\n[8] X. Guo, S. P. Singh, R. L. Lewis, and H. Lee. Deep learning for reward design to improve monte carlo tree\n\nsearch in atari games. In IJCAI, 2016.\n\n[9] M. Hausknecht and P. Stone. Deep recurrent q-learning for partially observable MDPs. arXiv preprint\n\narXiv:1507.06527, 2015.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n\n[11] N. Heess, G. Wayne, D. Silver, T. P. Lillicrap, Y. Tassa, and T. Erez. Learning continuous control policies\n\nby stochastic value gradients. In NIPS, 2015.\n\n[12] M. Jaderberg, V. Mnih, W. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement\n\nlearning with unsupervised auxiliary tasks. In ICLR, 2017.\n\n[13] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu.\n\nVideo pixel networks. arXiv preprint arXiv:1610.00527, 2016.\n\n[14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[15] L. Kocsis and C. Szepesv\u00e1ri. Bandit based monte-carlo planning. In ECML, 2006.\n\n[16] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. Gershman. Deep successor reinforcement learning. arXiv\n\npreprint arXiv:1606.02396, 2016.\n\n[17] A. S. Lakshminarayanan, S. Sharma, and B. Ravindran. Dynamic action repetition for deep reinforcement\n\nlearning. In AAAI, 2017.\n\n[18] I. Lenz, R. A. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model predictive\n\ncontrol. In RSS, 2015.\n\n[19] N. Mishra, P. Abbeel, and I. Mordatch. Prediction and control with temporal segment models. In ICML,\n\n2017.\n\n[20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\n\nAsynchronous methods for deep reinforcement learning. In ICML, 2016.\n\n[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,\nD. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature,\n518(7540):529\u2013533, 2015.\n\n[22] J. Oh, V. Chockalingam, S. Singh, and H. Lee. Control of memory, active perception, and action in\n\nminecraft. In ICML, 2016.\n\n[23] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks\n\nin atari games. In NIPS, 2015.\n\n[24] E. Parisotto and R. Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. arXiv\n\npreprint arXiv:1702.08360, 2017.\n\n[25] D. Precup. Temporal abstraction in reinforcement learning. PhD thesis, University of Massachusetts,\n\nAmherst, 2000.\n\n[26] T. Raiko and M. Tornio. Variational bayesian learning of nonlinear hidden state-space models for model\n\npredictive control. Neurocomputing, 72(16):3704\u20133712, 2009.\n\n[27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,\nI. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of\ngo with deep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n10\n\n\f[28] D. Silver, R. S. Sutton, and M. M\u00fcller. Temporal-difference search in computer go. Machine Learning,\n\n87:183\u2013219, 2012.\n\n[29] D. Silver, H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley, G. Dulac-Arnold, D. Reichert,\nN. Rabinowitz, A. Barreto, and T. Degris. The predictron: End-to-end learning and planning. In ICML,\n2017.\n\n[30] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep\n\npredictive models. arXiv preprint arXiv:1507.00814, 2015.\n\n[31] M. Stolle and D. Precup. Learning options in reinforcement learning. In SARA, 2002.\n\n[32] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic\n\nprogramming. In ICML, 1990.\n\n[33] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal\n\nabstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1):181\u2013211, 1999.\n\n[34] R. S. Sutton, C. Szepesv\u00e1ri, A. Geramifard, and M. H. Bowling. Dyna-style planning with linear function\n\napproximation and prioritized sweeping. In UAI, 2008.\n\n[35] A. Tamar, S. Levine, P. Abbeel, Y. Wu, and G. Thomas. Value iteration networks. In NIPS, 2016.\n\n[36] A. Vezhnevets, V. Mnih, S. Osindero, A. Graves, O. Vinyals, J. Agapiou, and K. Kavukcuoglu. Strategic\n\nattentive writer for learning macro-actions. In NIPS, 2016.\n\n[37] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. Dueling network architectures\n\nfor deep reinforcement learning. In ICML, 2016.\n\n[38] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\n\n[39] H. Yao, S. Bhatnagar, D. Diao, R. S. Sutton, and C. Szepesv\u00e1ri. Multi-step dyna planning for policy\n\nevaluation and control. In NIPS, 2009.\n\n11\n\n\f", "award": [], "sourceid": 3108, "authors": [{"given_name": "Junhyuk", "family_name": "Oh", "institution": "University of Michigan"}, {"given_name": "Satinder", "family_name": "Singh", "institution": "University of Michigan"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "Google / U. Michigan"}]}