{"title": "Value Iteration Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2154, "page_last": 2162, "abstract": "We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.", "full_text": "Value Iteration Networks\n\nAviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel\n\nDept. of Electrical Engineering and Computer Sciences, UC Berkeley\n\nAbstract\n\nWe introduce the value iteration network (VIN): a fully differentiable neural net-\nwork with a \u2018planning module\u2019 embedded within. VINs can learn to plan, and are\nsuitable for predicting outcomes that involve planning-based reasoning, such as\npolicies for reinforcement learning. Key to our approach is a novel differentiable\napproximation of the value-iteration algorithm, which can be represented as a con-\nvolutional neural network, and trained end-to-end using standard backpropagation.\nWe evaluate VIN based policies on discrete and continuous path-planning domains,\nand on a natural-language based search task. We show that by learning an explicit\nplanning computation, VIN policies generalize better to new, unseen domains.\n\n1\n\nIntroduction\n\nOver the last decade, deep convolutional neural networks (CNNs) have revolutionized supervised\nlearning for tasks such as object recognition, action recognition, and semantic segmentation [3, 15, 6,\n19]. Recently, CNNs have been applied to reinforcement learning (RL) tasks with visual observations\nsuch as Atari games [21], robotic manipulation [18], and imitation learning (IL) [9]. In these tasks, a\nneural network (NN) is trained to represent a policy \u2013 a mapping from an observation of the system\u2019s\nstate to an action, with the goal of representing a control strategy that has good long-term behavior,\ntypically quanti\ufb01ed as the minimization of a sequence of time-dependent costs.\nThe sequential nature of decision making in RL is inherently different than the one-step decisions\nin supervised learning, and in general requires some form of planning [2]. However, most recent\ndeep RL works [21, 18, 9] employed NN architectures that are very similar to the standard networks\nused in supervised learning tasks, which typically consist of CNNs for feature extraction, and fully\nconnected layers that map the features to a probability distribution over actions. Such networks are\ninherently reactive, and in particular, lack explicit planning computation. The success of reactive\npolicies in sequential problems is due to the learning algorithm, which essentially trains a reactive\npolicy to select actions that have good long-term consequences in its training domain.\nTo understand why planning can nevertheless be an important ingredient in a policy, consider the\ngrid-world navigation task depicted in Figure 1 (left), in which the agent can observe a map of its\ndomain, and is required to navigate between some obstacles to a target position. One hopes that after\ntraining a policy to solve several instances of this problem with different obstacle con\ufb01gurations, the\npolicy would generalize to solve a different, unseen domain, as in Figure 1 (right). However, as we\nshow in our experiments, while standard CNN-based networks can be easily trained to solve a set of\nsuch maps, they do not generalize well to new tasks outside this set, because they do not understand\nthe goal-directed nature of the behavior. This observation suggests that the computation learned by\nreactive policies is different from planning, which is required to solve a new task1.\n\n1In principle, with enough training data that covers all possible task con\ufb01gurations, and a rich enough policy\nrepresentation, a reactive policy can learn to map each task to its optimal policy. In practice, this is often\ntoo expensive, and we offer a more data-ef\ufb01cient approach by exploiting a \ufb02exible prior about the planning\ncomputation underlying the behavior.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn this work, we propose a NN-based policy that\ncan effectively learn to plan. Our model, termed\na value-iteration network (VIN), has a differen-\ntiable \u2018planning program\u2019 embedded within the\nNN structure.\nThe key to our approach is an observation that\nthe classic value-iteration (VI) planning algo-\nrithm [1, 2] may be represented by a speci\ufb01c\nFigure 1: Two instances of a grid-world domain.\ntype of CNN. By embedding such a VI network\nTask is to move to the goal between the obstacles.\nmodule inside a standard feed-forward classi\ufb01-\ncation network, we obtain a NN model that can learn the parameters of a planning computation\nthat yields useful predictions. The VI block is differentiable, and the whole network can be trained\nusing standard backpropagation. This makes our policy simple to train using standard RL and IL\nalgorithms, and straightforward to integrate with NNs for perception and control.\nConnections between planning algorithms and recurrent NNs were previously explored by Ilin\net al. [12]. Our work builds on related ideas, but results in a more broadly applicable policy\nrepresentation. Our approach is different from model-based RL [25, 4], which requires system\nidenti\ufb01cation to map the observations to a dynamics model, which is then solved for a policy. In\nmany applications, including robotic manipulation and locomotion, accurate system identi\ufb01cation\nis dif\ufb01cult, and modelling errors can severely degrade the policy performance. In such domains, a\nmodel-free approach is often preferred [18]. Since a VIN is just a NN policy, it can be trained model\nfree, without requiring explicit system identi\ufb01cation. In addition, the effects of modelling errors in\nVINs can be mitigated by training the network end-to-end, similarly to the methods in [13, 11].\nWe demonstrate the effectiveness of VINs within standard RL and IL algorithms in various problems,\namong which require visual perception, continuous control, and also natural language based decision\nmaking in the WebNav challenge [23]. After training, the policy learns to map an observation to a\nplanning computation relevant for the task, and generate action predictions based on the resulting\nplan. As we demonstrate, this leads to policies that generalize better to new, unseen, task instances.\n2 Background\nIn this section we provide background on planning, value iteration, CNNs, and policy representations\nfor RL and IL. In the sequel, we shall show that CNNs can implement a particular form of planning\ncomputation similar to the value iteration algorithm, which can then be used as a policy for RL or IL.\nValue Iteration: A standard model for sequential decision making and planning is the Markov\ndecision process (MDP) [1, 2]. An MDP M consists of states s \u2208 S, actions a \u2208 A, a reward\nfunction R(s, a), and a transition kernel P (s(cid:48)|s, a) that encodes the probability of the next state given\nthe current state and action. A policy \u03c0(a|s) prescribes an action distribution for each state. The goal\nin an MDP is to \ufb01nd a policy that obtains high rewards in the long term. Formally, the value V \u03c0(s)\nof a state under policy \u03c0 is the expected discounted sum of rewards when starting from that state and\nt=0 \u03b3tr(st, at)| s0 = s], where \u03b3 \u2208 (0, 1) is a discount factor,\nexecuting policy \u03c0, V \u03c0(s)\nand E\u03c0 denotes an expectation over trajectories of states and actions (s0, a0, s1, a1 . . . ), in which\nactions are selected according to \u03c0, and states evolve according to the transition kernel P (s(cid:48)|s, a).\nThe optimal value function V \u2217(s)\n.\n= max\u03c0 V \u03c0(s) is the maximal long-term return possible from a\n(s) = V \u2217(s) \u2200s. A popular algorithm for calculating\nstate. A policy \u03c0\u2217 is said to be optimal if V \u03c0\u2217\nV \u2217 and \u03c0\u2217 is value iteration (VI):\n\nVn+1(s) = maxa Qn(s, a) \u2200s, where Qn(s, a) = R(s, a) + \u03b3(cid:80)\n\n= E\u03c0 [(cid:80)\u221e\n\ns(cid:48) P (s(cid:48)|s, a)Vn(s(cid:48)).\n\n.\n\n(1)\nIt is well known that the value function Vn in VI converges as n \u2192 \u221e to V \u2217, from which an optimal\npolicy may be derived as \u03c0\u2217(s) = arg maxa Q\u221e(s, a).\nConvolutional Neural Networks (CNNs) are NNs with a particular architecture that has proved\nuseful for computer vision, among other domains [8, 16, 3, 15]. A CNN is comprised of\nstacked convolution and max-pooling layers. The input to each convolution layer is a 3-\ndimensional signal X, typically, an image with l channels, m horizontal pixels, and n verti-\ncal pixels, and its output h is a l(cid:48)-channel convolution of the image with kernels W 1, . . . , W l(cid:48)\n,\n, where \u03c3 is some scalar activation function. A max-pooling\nhl(cid:48),i(cid:48),j(cid:48) = \u03c3\nlayer selects, for each channel l and pixel i, j in h, the maximum value among its neighbors N (i, j),\n= maxi(cid:48),j(cid:48)\u2208N (i,j) hl,i(cid:48),j(cid:48). Typically, the neighbors N (i, j) are chosen as a k \u00d7 k image\nhmaxpool\nl,i,j\n\nl,i,jXl,i(cid:48)\u2212i,j(cid:48)\u2212j\n\n(cid:16)(cid:80)\n\nl,i,j W l(cid:48)\n\n(cid:17)\n\n2\n\n\fand\n\noptimal\n\n(cid:8)\u03c6(si), ai \u223c \u03c0\u2217(\u03c6(si))(cid:9)\n\ndataset\n\nIL,\n\na\n\nobservations\n\ncorresponding\n\npatch around pixel i, j. After max-pooling, the image is down-sampled by a constant factor d, com-\nmonly 2 or 4, resulting in an output signal with l(cid:48) channels, m/d horizontal pixels, and n/d vertical\npixels. CNNs are typically trained using stochastic gradient descent (SGD), with backpropagation for\ncomputing gradients.\nReinforcement Learning and Imitation Learning: In MDPs where the state space is very large or\ncontinuous, or when the MDP transitions or rewards are not known in advance, planning algorithms\ncannot be applied. In these cases, a policy can be learned from either expert supervision \u2013 IL,\nor by trial and error \u2013 RL. While the learning algorithms in both cases are different, the policy\nrepresentations \u2013 which are the focus of this work \u2013 are similar. Additionally, most state-of-the-art\nalgorithms such as [24, 21, 26, 18] are agnostic to the policy representation, and only require it to be\ndifferentiable, for performing gradient descent on some algorithm-speci\ufb01c loss function. Therefore,\nin this paper we do not commit to a speci\ufb01c learning algorithm, and only consider the policy.\nLet \u03c6(s) denote an observation for state s. The policy is speci\ufb01ed as a parametrized function\n\u03c0\u03b8(a|\u03c6(s)) mapping observations to a probability over actions, where \u03b8 are the policy parameters.\nFor example, the policy could be represented as a neural network, with \u03b8 denoting the network\nweights. The goal is to tune the parameters such that the policy behaves well in the sense that\n\u03c0\u03b8(a|\u03c6(s)) \u2248 \u03c0\u2217(a|\u03c6(s)), where \u03c0\u2217 is the optimal policy for the MDP, as de\ufb01ned in Section 2.\nIn\n\nof N state\nactions\ni=1,...,N is generated by an expert. Learning a policy then becomes\nan instance of supervised learning [24, 9]. In RL, the optimal action is not available, but instead,\nthe agent can act in the world and observe the rewards and state transitions its actions effect. RL\nalgorithms such as in [27, 21, 26, 18] use these observations to improve the value of the policy.\n3 The Value Iteration Network Model\nIn this section we introduce a general policy representation that embeds an explicit planning module.\nAs stated earlier, the motivation for such a representation is that a natural solution to many tasks, such\nas the path planning described above, involves planning on some model of the domain.\nLet M denote the MDP of the domain for which we design our policy \u03c0. We assume that there\nis some unknown MDP \u00afM such that the optimal plan in \u00afM contains useful information about the\noptimal policy in the original task M. However, we emphasize that we do not assume to know \u00afM in\nadvance. Our idea is to equip the policy with the ability to learn and solve \u00afM, and to add the solution\nof \u00afM as an element in the policy \u03c0. We hypothesize that this will lead to a policy that automatically\nlearns a useful \u00afM to plan on. We denote by \u00afs \u2208 \u00afS, \u00afa \u2208 \u00afA, \u00afR(\u00afs, \u00afa), and \u00afP (\u00afs(cid:48)|\u00afs, \u00afa) the states, actions,\nrewards, and transitions in \u00afM. To facilitate a connection between M and \u00afM, we let \u00afR and \u00afP depend\non the observation in M, namely, \u00afR = fR(\u03c6(s)) and \u00afP = fP (\u03c6(s)), and we will later learn the\nfunctions fR and fP as a part of the policy learning process.\nFor example, in the grid-world domain described above, we can let \u00afM have the same state and action\nspaces as the true grid-world M. The reward function fR can map an image of the domain to a\nhigh reward at the goal, and negative reward near an obstacle, while fP can encode deterministic\nmovements in the grid-world that do not depend on the observation. While these rewards and\ntransitions are not necessarily the true rewards and transitions in the task, an optimal plan in \u00afM will\nstill follow a trajectory that avoids obstacles and reaches the goal, similarly to the optimal plan in M.\nOnce an MDP \u00afM has been speci\ufb01ed, any standard planning algorithm can be used to obtain the value\nfunction \u00afV \u2217. In the next section, we shall show that using a particular implementation of VI for\nplanning has the advantage of being differentiable, and simple to implement within a NN framework.\nIn this section however, we focus on how to use the planning result \u00afV \u2217 within the NN policy \u03c0. Our\napproach is based on two important observations. The \ufb01rst is that the vector of values \u00afV \u2217(s) \u2200s\nencodes all the information about the optimal plan in \u00afM. Thus, adding the vector \u00afV \u2217 as additional\nfeatures to the policy \u03c0 is suf\ufb01cient for extracting information about the optimal plan in \u00afM.\nHowever, an additional property of \u00afV \u2217 is that the optimal decision \u00af\u03c0\u2217(\u00afs) at a state \u00afs can depend\n\u00afs(cid:48) \u00afP (\u00afs(cid:48)|\u00afs, \u00afa) \u00afV \u2217(\u00afs(cid:48)).\nonly on a subset of the values of \u00afV \u2217, since \u00af\u03c0\u2217(\u00afs) = arg max\u00afa\nTherefore, if the MDP has a local connectivity structure, such as in the grid-world example above,\nthe states for which \u00afP (\u00afs(cid:48)|\u00afs, \u00afa) > 0 is a small subset of \u00afS.\nIn NN terminology, this is a form of attention [31], in the sense that for a given label prediction\n(action), only a subset of the input features (value function) is relevant. Attention is known to improve\nlearning performance by reducing the effective number of network parameters during learning.\nTherefore, the second element in our network is an attention module that outputs a vector of (attention\n\n\u00afR(\u00afs, \u00afa) + \u03b3(cid:80)\n\n3\n\n\fmodulated) values \u03c8(s). Finally, the vector \u03c8(s) is added as additional features to a reactive policy\n\u03c0re(a|\u03c6(s), \u03c8(s)). The full network architecture is depicted in Figure 2 (left).\nReturning to our grid-world example, at a particular state s, the reactive policy only needs to query\nthe values of the states neighboring s in order to select the correct action. Thus, the attention module\nin this case could return a \u03c8(s) vector with a subset of \u00afV \u2217 for these neighboring states.\n\nFigure 2: Planning-based NN models. Left: a general policy representation that adds value function\nfeatures from a planner to a reactive policy. Right: VI module \u2013 a CNN representation of VI algorithm.\nLet \u03b8 denote all the parameters of the policy, namely, the parameters of fR, fP , and \u03c0re, and note\nthat \u03c8(s) is in fact a function of \u03c6(s). Therefore, the policy can be written in the form \u03c0\u03b8(a|\u03c6(s)),\nsimilarly to the standard policy form (cf. Section 2). If we could back-propagate through this function,\nthen potentially we could train the policy using standard RL and IL algorithms, just like any other\nstandard policy representation. While it is easy to design functions fR and fP that are differentiable\n(and we provide several examples in our experiments), back-propagating the gradient through the\nplanning algorithm is not trivial. In the following, we propose a novel interpretation of an approximate\nVI algorithm as a particular form of a CNN. This allows us to conveniently treat the planning module\nas just another NN, and by back-propagating through it, we can train the whole policy end-to-end.\n3.1 The VI Module\nWe now introduce the VI module \u2013 a NN that encodes a differentiable planning computation.\nOur starting point is the VI algorithm (1). Our main observation is that each iteration of VI may\nbe seen as passing the previous value function Vn and reward function R through a convolution\nlayer and max-pooling layer. In this analogy, each channel in the convolution layer corresponds to\nthe Q-function for a speci\ufb01c action, and convolution kernel weights correspond to the discounted\ntransition probabilities. Thus by recurrently applying a convolution layer K times, K iterations of VI\nare effectively performed.\nFollowing this idea, we propose the VI network module, as depicted in Figure 2B. The inputs to the\nVI module is a \u2018reward image\u2019 \u00afR of dimensions l, m, n, where here, for the purpose of clarity, we\nfollow the CNN formulation and explicitly assume that the state space \u00afS maps to a 2-dimensional\ngrid. However, our approach can be extended to general discrete state spaces, for example, a graph,\nas we report in the WikiNav experiment in Section 4.4. The reward is fed into a convolutional layer \u00afQ\n\u00afRl,i(cid:48)\u2212i,j(cid:48)\u2212j. Each channel\nin this layer corresponds to \u00afQ(\u00afs, \u00afa) for a particular action \u00afa. This layer is then max-pooled along\nthe actions channel to produce the next-iteration value function layer \u00afV , \u00afVi,j = max\u00afa \u00afQ(\u00afa, i, j).\nThe next-iteration value function layer \u00afV is then stacked with the reward \u00afR, and fed back into the\nconvolutional layer and max-pooling layer K times, to perform K iterations of value iteration.\nThe VI module is simply a NN architecture that has the capability of performing an approximate VI\ncomputation. Nevertheless, representing VI in this form makes learning the MDP parameters and\nreward function natural \u2013 by backpropagating through the network, similarly to a standard CNN. VI\nmodules can also be composed hierarchically, by treating the value of one VI module as additional\ninput to another VI module. We further report on this idea in the supplementary material.\n\nwith \u00afA channels and a linear activation function, \u00afQ\u00afa,i(cid:48),j(cid:48) =(cid:80)\n\nl,i,j W \u00afa\n\nl,i,j\n\n3.2 Value Iteration Networks\nWe now have all the ingredients for a differentiable planning-based policy, which we term a value\niteration network (VIN). The VIN is based on the general planning-based policy de\ufb01ned above, with\nthe VI module as the planning algorithm. In order to implement a VIN, one has to specify the state\n\n4\n\nK recurrenceRewardQPrev. ValueNew Value VI ModulePRV\fand action spaces for the planning module \u00afS and \u00afA, the reward and transition functions fR and fP ,\nand the attention function; we refer to this as the VIN design. For some tasks, as we show in our\nexperiments, it is relatively straightforward to select a suitable design, while other tasks may require\nmore thought. However, we emphasize an important point: the reward, transitions, and attention can\nbe de\ufb01ned by parametric functions, and trained with the whole policy2. Thus, a rough design can be\nspeci\ufb01ed, and then \ufb01ne-tuned by end-to-end training.\nOnce a VIN design is chosen, implementing the VIN is straightforward, as it is simply a form of a\nCNN. The networks in our experiments all required only several lines of Theano [28] code. In the\nnext section, we evaluate VIN policies on various domains, showing that by learning to plan, they\nachieve a better generalization capability.\n4 Experiments\nIn this section we evaluate VINs as policy representations on various domains. Additional experiments\ninvestigating RL and hierarchical VINs, as well as technical implementation details are discussed in\nthe supplementary material. Source code is available at https://github.com/avivt/VIN.\nOur goal in these experiments is to investigate the following questions:\n\n1. Can VINs effectively learn a planning computation using standard RL and IL algorithms?\n2. Does the planning computation learned by VINs make them better than reactive policies at\n\ngeneralizing to new domains?\n\nAn additional goal is to point out several ideas for designing VINs for various tasks. While this is not\nan exhaustive list that \ufb01ts all domains, we hope that it will motivate creative designs in future work.\n4.1 Grid-World Domain\nOur \ufb01rst experiment domain is a synthetic grid-world with randomly placed obstacles, in which the\nobservation includes the position of the agent, and also an image of the map of obstacles and goal\nposition. Figure 3 shows two random instances of such a grid-world of size 16 \u00d7 16. We conjecture\nthat by learning the optimal policy for several instances of this domain, a VIN policy would learn the\nplanning computation required to solve a new, unseen, task.\nIn such a simple domain, an optimal policy can easily be calculated using exact VI. Note, however,\nthat here we are interested in evaluating whether a NN policy, trained using RL or IL, can learn\nto plan. In the following results, policies were trained using IL, by standard supervised learning\nfrom demonstrations of the optimal policy. In the supplementary material, we report additional RL\nexperiments that show similar \ufb01ndings.\nWe design a VIN for this task following the guidelines described above, where the planning MDP \u00afM\nis a grid-world, similar to the true MDP. The reward mapping fR is a CNN mapping the image input to\na reward map in the grid-world. Thus, fR should potentially learn to discriminate between obstacles,\nnon-obstacles and the goal, and assign a suitable reward to each. The transitions \u00afP were de\ufb01ned as\n3 \u00d7 3 convolution kernels in the VI block, exploiting the fact that transitions in the grid-world are\nlocal3. The recurrence K was chosen in proportion to the grid-world size, to ensure that information\ncan \ufb02ow from the goal state to any other state. For the attention module, we chose a trivial approach\nthat selects the \u00afQ values in the VI block for the current state, i.e., \u03c8(s) = \u00afQ(s,\u00b7). The \ufb01nal reactive\npolicy is a fully connected network that maps \u03c8(s) to a probability over actions.\nWe compare VINs to the following NN reactive policies:\nCNN network: We devised a CNN-based reactive policy inspired by the recent impressive results of\nDQN [21], with 5 convolution layers, and a fully connected output. While the network in [21] was\ntrained to predict Q values, our network outputs a probability over actions. These terms are related,\nsince \u03c0\u2217(s) = arg maxa Q(s, a). Fully Convolutional Network (FCN): The problem setting for\nthis domain is similar to semantic segmentation [19], in which each pixel in the image is assigned a\nsemantic label (the action in our case). We therefore devised an FCN inspired by a state-of-the-art\nsemantic segmentation algorithm [19], with 3 convolution layers, where the \ufb01rst layer has a \ufb01lter that\nspans the whole image, to properly convey information from the goal to every other state.\nIn Table 1 we present the average 0\u2212 1 prediction loss of each model, evaluated on a held-out test-set\nof maps with random obstacles, goals, and initial states, for different problem sizes. In addition, for\neach map, a full trajectory from the initial state was predicted, by iteratively rolling-out the next-states\n\n2VINs are fundamentally different than inverse RL methods [22], where transitions are required to be known.\n3Note that the transitions de\ufb01ned this way do not depend on the state \u00afs. Interestingly, we shall see that the\n\nnetwork learned to plan successful trajectories nevertheless, by appropriately shaping the reward.\n\n5\n\n\fFigure 3: Grid-world domains (best viewed in color). A,B: Two random instances of the 28 \u00d7 28\nsynthetic gridworld, with the VIN-predicted trajectories and ground-truth shortest paths between\nrandom start and goal positions. C: An image of the Mars domain, with points of elevation sharper\nthan 10\u25e6 colored in red. These points were calculated from a matching image of elevation data\n(not shown), and were not available to the learning algorithm. Note the dif\ufb01culty of distinguishing\nbetween obstacles and non-obstacles. D: The VIN-predicted (purple line with cross markers), and the\nshortest-path ground truth (blue line) trajectories between between random start and goal positions.\n\nDomain\n\n8 \u00d7 8\n16 \u00d7 16\n28 \u00d7 28\n\nPrediction\n\nloss\n0.004\n0.05\n0.11\n\nVIN\nSuccess\n\nTraj.\ndiff.\nrate\n99.6% 0.001\n99.3% 0.089\n97% 0.086\n\nPred.\nloss\n0.02\n0.10\n0.13\n\nCNN\nSucc.\nTraj.\nrate\ndiff.\n97.9% 0.006\n87.6% 0.06\n74.2% 0.078\n\nPred.\nloss\n0.01\n0.07\n0.09\n\nFCN\nSucc.\nTraj.\nrate\ndiff.\n97.3% 0.004\n88.3% 0.05\n76.6% 0.08\n\nTable 1: Performance on grid-world domain. Top: comparison with reactive policies. For all domain\nsizes, VIN networks signi\ufb01cantly outperform standard reactive networks. Note that the performance\ngap increases dramatically with problem size.\n\npredicted by the network. A trajectory was said to succeed if it reached the goal without hitting\nobstacles. For each trajectory that succeeded, we also measured its difference in length from the\noptimal trajectory. The average difference and the average success rate are reported in Table 1.\nClearly, VIN policies generalize to domains outside the training set. A visualization of the reward\nmapping fR (see supplementary material) shows that it is negative at obstacles, positive at the goal,\nand a small negative constant otherwise. The resulting value function has a gradient pointing towards\na direction to the goal around obstacles, thus a useful planning computation was learned. VINs also\nsigni\ufb01cantly outperform the reactive networks, and the performance gap increases dramatically with\nthe problem size. Importantly, note that the prediction loss for the reactive policies is comparable to\nthe VINs, although their success rate is signi\ufb01cantly worse. This shows that this is not a standard\ncase of over\ufb01tting/under\ufb01tting of the reactive policies. Rather, VIN policies, by their VI structure,\nfocus prediction errors on less important parts of the trajectory, while reactive policies do not make\nthis distinction, and learn the easily predictable parts of the trajectory yet fail on the complete task.\nThe VINs have an effective depth of K, which is larger than the depth of the reactive policies. One\nmay wonder, whether any deep enough network would learn to plan. In principle, a CNN or FCN of\ndepth K has the potential to perform the same computation as a VIN. However, it has much more\nparameters, requiring much more training data. We evaluate this by untying the weights in the K\nrecurrent layers in the VIN. Our results, reported in the supplementary material, show that untying\nthe weights degrades performance, with a stronger effect for smaller sizes of training data.\n\n4.2 Mars Rover Navigation\nIn this experiment we show that VINs can learn to plan from natural image input. We demonstrate\nthis on path-planning from overhead terrain images of a Mars landscape.\nEach domain is represented by a 128 \u00d7 128 image patch, on which we de\ufb01ned a 16 \u00d7 16 grid-world,\nwhere each state was considered an obstacle if the terrain in its corresponding 8 \u00d7 8 image patch\ncontained an elevation angle of 10 degrees or more, evaluated using an external elevation data base.\nAn example of the domain and terrain image is depicted in Figure 3. The MDP for shortest-path\nplanning in this case is similar to the grid-world domain of Section 4.1, and the VIN design was\nsimilar, only with a deeper CNN in the reward mapping fR for processing the image.\nThe policy was trained to predict the shortest-path directly from the terrain image. We emphasize that\nthe elevation data is not part of the input, and must be inferred (if needed) from the terrain image.\n\n6\n\n\fVIN\nCNN\n\n0.30\n0.39\n\n0.35\n0.59\n\nNetwork Train Error Test Error\n\nFigure 4: Continuous control domain. Top: aver-\nage distance to goal on training and test domains\nfor VIN and CNN policies. Bottom: trajectories\npredicted by VIN and CNN on test domains.\n\nAfter training, VIN achieved a success rate of 84.8%. To put this rate in context, we compare with\nthe best performance achievable without access to the elevation data, which is 90.3%. To make\nthis comparison, we trained a CNN to classify whether an 8 \u00d7 8 patch is an obstacle or not. This\nclassi\ufb01er was trained using the same image data as the VIN network, but its labels were the true\nobstacle classi\ufb01cations from the elevation map (we reiterate that the VIN did not have access to\nthese ground-truth obstacle labels during training or testing). The success rate of planner that uses\nthe obstacle map generated by this classi\ufb01er from the raw image is 90.3%, showing that obstacle\nidenti\ufb01cation from the raw image is indeed challenging. Thus, the success rate of the VIN, which was\ntrained without any obstacle labels, and had to \u2018\ufb01gure out\u2019 the planning process is quite remarkable.\n4.3 Continuous Control\nWe now consider a 2D path planning domain\nwith continuous states and continuous actions,\nwhich cannot be solved using VI, and therefore\na VIN cannot be naively applied. Instead, we\nwill construct the VIN to perform \u2018high-level\u2019\nplanning on a discrete, coarse, grid-world rep-\nresentation of the continuous domain. We shall\nshow that a VIN can learn to plan such a \u2018high-\nlevel\u2019 plan, and also exploit that plan within its\n\u2018low-level\u2019 continuous control policy. Moreover,\nthe VIN policy results in better generalization\nthan a reactive policy.\nConsider the domain in Figure 4. A red-colored\nparticle needs to be navigated to a green goal us-\ning horizontal and vertical forces. Gray-colored\nobstacles are randomly positioned in the domain,\nand apply an elastic force and friction when con-\ntacted. This domain presents a non-trivial control problem, as the agent needs to both plan a feasible\ntrajectory between the obstacles (or use them to bounce off), but also control the particle (which has\nmass and inertia) to follow it. The state observation consists of the particle\u2019s continuous position and\nvelocity, and a static 16 \u00d7 16 downscaled image of the obstacles and goal position in the domain. In\nprinciple, such an observation is suf\ufb01cient to devise a \u2018rough plan\u2019 for the particle to follow.\nAs in our previous experiments, we investigate whether a policy trained on several instances of this\ndomain with different start state, goal, and obstacle positions, would generalize to an unseen domain.\nFor training we chose the guided policy search (GPS) algorithm with unknown dynamics [17], which\nis suitable for learning policies for continuous dynamics with contacts, and we used the publicly\navailable GPS code [7], and Mujoco [29] for physical simulation. We generated 200 random training\ninstances, and evaluate our performance on 40 different test instances from the same distribution.\nOur VIN design is similar to the grid-world cases, with some important modi\ufb01cations: the attention\nmodule selects a 5 \u00d7 5 patch of the value \u00afV , centered around the current (discretized) position in the\nmap. The \ufb01nal reactive policy is a 3-layer fully connected network, with a 2-dimensional continuous\noutput for the controls. In addition, due to the limited number of training domains, we pre-trained the\nVIN with transition weights that correspond to discounted grid-world transitions. This is a reasonable\nprior for the weights in a 2-d task, and we emphasize that even with this initialization, the initial\nvalue function is meaningless, since the reward map fR is not yet learned. We compare with a\nCNN-based reactive policy inspired by the state-of-the-art results in [21, 20], with 2 CNN layers for\nimage processing, followed by a 3-layer fully connected network similar to the VIN reactive policy.\nFigure 4 shows the performance of the trained policies, measured as the \ufb01nal distance to the target.\nThe VIN clearly outperforms the CNN on test domains. We also plot several trajectories of both\npolicies on test domains, showing that VIN learned a more sensible generalization of the task.\n4.4 WebNav Challenge\nIn the previous experiments, the planning aspect of the task corresponded to 2D navigation. We now\nconsider a more general domain: WebNav [23] \u2013 a language based search task on a graph.\nIn WebNav [23], the agent needs to navigate the links of a website towards a goal web-page, speci\ufb01ed\nby a short 4-sentence query. At each state s (web-page), the agent can observe average word-\nembedding features of the state \u03c6(s) and possible next states \u03c6(s(cid:48)) (linked pages), and the features of\nthe query \u03c6(q), and based on that has to select which link to follow. In [23], the search was performed\n\n7\n\n\fon the Wikipedia website. Here, we report experiments on the \u2018Wikipedia for Schools\u2019 website, a\nsimpli\ufb01ed Wikipedia designed for children, with over 6000 pages and at most 292 links per page.\nIn [23], a NN-based policy was proposed, which \ufb01rst learns a NN mapping from (\u03c6(s), \u03c6(q)) to a\n\nhidden state vector h. The action is then selected according to \u03c0(s(cid:48)|\u03c6(s), \u03c6(q)) \u221d exp(cid:0)h(cid:62)\u03c6(s(cid:48))(cid:1). In\n\nessence, this policy is reactive, and relies on the word embedding features at each state to contain\nmeaningful information about the path to the goal. Indeed, this property naturally holds for an\nencyclopedic website that is structured as a tree of categories, sub-categories, sub-sub-categories, etc.\nWe sought to explore whether planning, based on a VIN, can lead to better performance in this task,\nwith the intuition that a plan on a simpli\ufb01ed model of the website can help guide the reactive policy in\ndif\ufb01cult queries. Therefore, we designed a VIN that plans on a small subset of the graph that contains\nonly the 1st and 2nd level categories (< 3% of the graph), and their word-embedding features.\nDesigning this VIN requires a different approach from the grid-world VINs described earlier, where\nthe most challenging aspect is to de\ufb01ne a meaningful mapping between nodes in the true graph and\nnodes in the smaller VIN graph. For the reward mapping fR, we chose a weighted similarity measure\nbetween the query features \u03c6(q), and the features of nodes in the small graph \u03c6(\u00afs). Thus, intuitively,\nnodes that are similar to the query should have high reward. The transitions were \ufb01xed based on the\ngraph connectivity of the smaller VIN graph, which is known, though different from the true graph.\nThe attention module was also based on a weighted similarity measure between the features of the\npossible next states \u03c6(s(cid:48)) and the features of each node in the simpli\ufb01ed graph \u03c6(\u00afs). The reactive\npolicy part of the VIN was similar to the policy of [23] described above. Note that by training such a\nVIN end-to-end, we are effectively learning how to exploit the small graph for doing better planning\non the true, large graph.\nBoth the VIN policy and the baseline reactive policy were trained by supervised learning, on random\ntrajectories that start from the root node of the graph. Similarly to [23], a policy is said to succeed a\nquery if all the correct predictions along the path are within its top-4 predictions.\nAfter training, the VIN policy performed mildly better than the baseline on 2000 held-out test queries\nwhen starting from the root node, achieving 1030 successful runs vs. 1025 for the baseline. However,\nwhen we tested the policies on a harder task of starting from a random position in the graph, VINs\nsigni\ufb01cantly outperformed the baseline, achieving 346 successful runs vs. 304 for the baseline, out of\n4000 test queries. These results con\ufb01rm that indeed, when navigating a tree of categories from the\nroot up, the features at each state contain meaningful information about the path to the goal, making\na reactive policy suf\ufb01cient. However, when starting the navigation from a different state, a reactive\npolicy may fail to understand that it needs to \ufb01rst go back to the root and switch to a different branch\nin the tree. Our results indicate such a strategy can be better represented by a VIN.\nWe remark that there is still room for further improvements of the WebNav results, e.g., by better\nmodels for reward and attention functions, and better word-embedding representations of text.\n5 Conclusion and Outlook\nThe introduction of powerful and scalable RL methods has opened up a range of new problems\nfor deep learning. However, few recent works investigate policy architectures that are speci\ufb01cally\ntailored for planning under uncertainty, and current RL theory and benchmarks rarely investigate the\ngeneralization properties of a trained policy [27, 21, 5]. This work takes a step in this direction, by\nexploring better generalizing policy representations.\nOur VIN policies learn an approximate planning computation relevant for solving the task, and we\nhave shown that such a computation leads to better generalization in a diverse set of tasks, ranging\nfrom simple gridworlds that are amenable to value iteration, to continuous control, and even to\nnavigation of Wikipedia links. In future work we intend to learn different planning computations,\nbased on simulation [10], or optimal linear control [30], and combine them with reactive policies, to\npotentially develop RL solutions for task and motion planning [14].\n\nAcknowledgments\nThis research was funded in part by Siemens, by ONR through a PECASE award, by the Army\nResearch Of\ufb01ce through the MAST program, and by an NSF CAREER grant. A. T. was partially\nfunded by the Viterbi Scholarship, Technion. Y. W. was partially funded by a DARPA PPAML\nprogram, contract FA8750-14-C-0011.\n\n8\n\n\fReferences\n[1] R. Bellman. Dynamic Programming. Princeton University Press, 1957.\n[2] D. Bertsekas. Dynamic Programming and Optimal Control, Vol II. Athena Scienti\ufb01c, 4th edition, 2012.\n[3] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi\ufb01cation. In\n\nComputer Vision and Pattern Recognition, pages 3642\u20133649, 2012.\n\n[4] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to policy search.\n\nIn ICML, 2011.\n\n[5] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning\n\nfor continuous control. arXiv preprint arXiv:1604.06778, 2016.\n\n[6] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 35(8):1915\u20131929, 2013.\n\n[7] C. Finn, M. Zhang, J. Fu, X. Tan, Z. McCarthy, E. Scharff, and S. Levine. Guided policy search code\n\nimplementation, 2016. Software available from rll.berkeley.edu/gps.\n\n[8] K. Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in\n\nposition- neocognitron. Transactions of the IECE, J62-A(10):658\u2013665, 1979.\n\n[9] A. Giusti et al. A machine learning approach to visual perception of forest trails for mobile robots. IEEE\n\nRobotics and Automation Letters, 2016.\n\n[10] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time atari game play using\n\nof\ufb02ine monte-carlo tree search planning. In NIPS, 2014.\n\n[11] X. Guo, S. Singh, R. Lewis, and H. Lee. Deep learning for reward design to improve monte carlo tree\n\nsearch in atari games. arXiv:1604.07095, 2016.\n\n[12] R. Ilin, R. Kozma, and P. J. Werbos. Ef\ufb01cient learning in cellular simultaneous recurrent neural networks-the\n\ncase of maze navigation problem. In ADPRL, 2007.\n\n[13] J. Joseph, A. Geramifard, J. W. Roberts, J. P. How, and N. Roy. Reinforcement learning with misspeci\ufb01ed\n\nmodel classes. In ICRA, 2013.\n\n[14] L. P. Kaelbling and T. Lozano-P\u00e9rez. Hierarchical task and motion planning in the now.\n\nIn IEEE\n\nInternational Conference on Robotics and Automation (ICRA), pages 1470\u20131477, 2011.\n\n[15] A. Krizhevsky, I. Sutskever, and G. Hinton.\n\nImagenet classi\ufb01cation with deep convolutional neural\n\n[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[17] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown\n\n[18] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. JMLR, 17,\n\nnetworks. In NIPS, 2012.\n\ndynamics. In NIPS, 2014.\n\n2016.\n\n[19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 3431\u20133440, 2015.\n\n[20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\n\nAsynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.\n\n[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller,\nA. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,\n518(7540):529\u2013533, 2015.\n\n[22] G. Neu and C. Szepesv\u00e1ri. Apprenticeship learning using inverse reinforcement learning and gradient\n\nmethods. In UAI, 2007.\n\n[23] R. Nogueira and K. Cho. Webnav: A new large-scale task for natural language based sequential decision\n\nmaking. arXiv preprint arXiv:1602.02261, 2016.\n\n[24] S. Ross, G. Gordon, and A. Bagnell. A reduction of imitation learning and structured prediction to no-regret\n\nonline learning. In AISTATS, 2011.\n\n[25] J. Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning in reactive\n\nenvironments. In International Joint Conference on Neural Networks. IEEE, 1990.\n\n[26] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML,\n\n2015.\n\n[27] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998.\n[28] Theano Development Team. Theano: A Python framework for fast computation of mathematical expres-\n\nsions. arXiv e-prints, abs/1605.02688, May 2016.\n\n[29] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Intelligent\nRobots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026\u20135033. IEEE, 2012.\n[30] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent\n\ndynamics model for control from raw images. In NIPS, 2015.\n\n[31] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and\n\ntell: Neural image caption generation with visual attention. In ICML, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1125, "authors": [{"given_name": "Aviv", "family_name": "Tamar", "institution": "UC Berkeley"}, {"given_name": "YI", "family_name": "WU", "institution": "UC Berkeley"}, {"given_name": "Garrett", "family_name": "Thomas", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}]}