{"title": "Towards Generalization and Simplicity in Continuous Control", "book": "Advances in Neural Information Processing Systems", "page_first": 6550, "page_last": 6561, "abstract": "The remarkable successes of deep learning in speech recognition and computer vision have motivated efforts to adapt similar techniques to other problem domains, including reinforcement learning (RL). Consequently, RL methods have produced rich motor behaviors on simulated robot tasks, with their success largely attributed to the use of multi-layer neural networks. This work is among the first to carefully study what might be responsible for these recent advancements. Our main result calls this emerging narrative into question by showing that much simpler architectures -- based on linear and RBF parameterizations -- achieve comparable performance to state of the art results. We not only study different policy representations with regard to performance measures at hand, but also towards robustness to external perturbations. We again find that the learned neural network policies --- under the standard training scenarios --- are no more robust than linear (or RBF) policies; in fact, all three are remarkably brittle. Finally, we then directly modify the training scenarios in order to favor more robust policies, and we again do not find a compelling case to favor multi-layer architectures. Overall, this study suggests that multi-layer architectures should not be the default choice, unless a side-by-side comparison to simpler architectures shows otherwise. More generally, we hope that these results lead to more interest in carefully studying the architectural choices, and associated trade-offs, for training generalizable and robust policies.", "full_text": "Towards Generalization and Simplicity\n\nin Continuous Control\n\nAravind Rajeswaran\u2217 Kendall Lowrey\u2217 Emanuel Todorov\n\nSham Kakade\n\nUniversity of Washington Seattle\n\n{ aravraj, klowrey, todorov, sham } @ cs.washington.edu\n\nAbstract\n\nThis work shows that policies with simple linear and RBF parameterizations can\nbe trained to solve a variety of widely studied continuous control tasks, including\nthe gym-v1 benchmarks. The performance of these trained policies are competitive\nwith state of the art results, obtained with more elaborate parameterizations such as\nfully connected neural networks. Furthermore, the standard training and testing\nscenarios for these tasks are shown to be very limited and prone to over-\ufb01tting, thus\ngiving rise to only trajectory-centric policies. Training with a diverse initial state\ndistribution induces more global policies with better generalization. This allows\nfor interactive control scenarios where the system recovers from large on-line\nperturbations; as shown in the supplementary video.\n\n1\n\nIntroduction\n\nDeep reinforcement learning (deepRL) has recently achieved impressive results on a number of\nhard problems, including sequential decision making in game domains [1, 2]. This success has\nmotivated efforts to adapt deepRL methods for control of physical systems, and has resulted in rich\nmotor behaviors [3, 4]. The complexity of systems solvable with deepRL methods is not yet at the\nlevel of what can be achieved with trajectory optimization (planning) in simulation [5, 6, 7], or with\nhand-crafted controllers on physical robots (e.g. Boston Dynamics). However, RL approaches are\nexciting because they are generic, model-free, and highly automated.\nRecent success of RL [2, 8, 9, 10, 11] has been enabled largely due to engineering efforts such\nas large scale data collection [1, 2, 11] or careful systems design [8, 9] with well behaved robots.\nWhen advances in a \ufb01eld are largely empirical in nature, it is important to understand the relative\ncontributions of representations, optimization methods, and task design or modeling: both as a\nsanity check and to scale up to harder tasks. Furthermore, in line with Occam\u2019s razor, the simplest\nreasonable approaches should be tried and understood \ufb01rst. A thorough understanding of these factors\nis unfortunately lacking in the community.\nIn this backdrop, we ask the pertinent question: \"What are the simplest set of ingredients needed\nto succeed in some of the popular benchmarks?\" To attempt this question, we use the Gym-v1 [12]\ncontinuous control benchmarks, which have accelerated research and enabled objective comparisons.\nSince the tasks involve under-actuation, contact dynamics, and are high dimensional (continuous\nspace), they have been accepted as benchmarks in the deepRL community. Recent works test their\nalgorithms either exclusively or primarily on these tasks [13, 4, 14], and success on these tasks have\nbeen regarded as demonstrating a \u201cproof of concept\u201d.\n\nOur contributions: Our results and their implications are highlighted below with more elaborate\ndiscussions in Section 5:\n\n\u2217 Equal contributions. Project page: https://sites.google.com/view/simple-pol\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1. The success of recent RL efforts to produce rich motor behaviors have largely been attributed\nto the use of multi-layer neural network architectures. This work is among the \ufb01rst to carefully\nanalyze the role of representation, and our results indicate that very simple policies including\nlinear and RBF parameterizations are able to achieve state of the art results on widely studied\ntasks. Furthermore, such policies, particularly the linear ones, can be trained signi\ufb01cantly faster\ndue to orders of magnitude fewer parameters. This indicates that even for tasks with complex\ndynamics, there could exist relatively simple policies. This opens the door for studying a wide\nrange of representations in addition to deep neural networks, and understand trade-offs including\ncomputational time, theoretical justi\ufb01cation, robustness, sample complexity etc.\n\n2. We study these issues not only with regards to the performance metric at hand but we also take\nthe further step in examining them in the context of robustness. Our results indicate that, with\nconventional training methods, the agent is able to successfully learn a limit cycle for walking,\nbut cannot recover from any perturbations that are delivered to it. For transferring the success of\nRL to robotics, such brittleness is highly undesirable.\n\n3. Finally, we directly attempt to learn more robust policies through using more diverse train-\ning conditions, which favor such policies. This is similar in spirit to the model ensemble\napproaches [15, 16] and domain randomization approaches [17, 18], which have successfully\ndemonstrated improved robustness and simulation to real world transfer. Under these new and\nmore diverse training scenarios, we again \ufb01nd that there is no compelling evidence to favor the\nuse of multi-layer architectures, at least for the benchmark tasks. On a side note, we also provide\ninteractive testing of learned policies, which we believe is both novel and which sheds light on\nthe robustness of trained policies.\n\n2 Problem Formulation and Methods\n\nWe consider Markov Decision Processes (MDPs) in the average reward setting, which is de\ufb01ned using\nthe tuple: M = {S,A,R,T , \u03c10}. S \u2286 Rn, A \u2286 Rm, and R : S \u00d7 A \u2192 R are a (continuous) set of\nstates, set of actions, and reward function respectively, and have the usual meaning. T : S \u00d7 A \u2192 S\nis the stochastic transition function and \u03c10 is the probability distribution over initial states. We wish\nto solve for a stochastic policy of the form \u03c0 : S \u00d7 A \u2192 R+, which optimizes the objective function:\n\n\u03b7(\u03c0) = lim\nT\u2192\u221e\n\nE\u03c0,M\n\n1\nT\n\nrt\n\n.\n\n(1)\n\nSince we use simulations with \ufb01nite length rollouts to estimate the objective and gradient, we\napproximate \u03b7(\u03c0) using a \ufb01nite T . In this \ufb01nite horizon rollout setting, we de\ufb01ne the value, Q, and\nadvantage functions as follows:\n\nV \u03c0(s, t) = E\u03c0,M\n\nrt(cid:48)\n\nQ\u03c0(s, a, t) = EM\n\n+ Es(cid:48)\u223cT (s,a)\n\nV \u03c0(s(cid:48), t + 1)\n\n(cid:104)\n\n(cid:105)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt(cid:48)=t\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:104)R(s, a)\n(cid:105)\n\nA\u03c0(s, a, t) = Q\u03c0(s, a, t) \u2212 V \u03c0(s, t)\n\nNote that even though the value functions are time-varying, we still optimize for a stationary policy.\nWe consider parametrized policies \u03c0\u03b8, and hence wish to optimize for the parameters (\u03b8). Thus, we\noverload notation and use \u03b7(\u03c0) and \u03b7(\u03b8) interchangeably.\n\n2.1 Algorithm\n\nIdeally, a controlled scienti\ufb01c study would seek to isolate the challenges related to architecture,\ntask design, and training methods for separate study. In practice, this is not entirely feasible as the\nresults are partly coupled with the training methods. Here, we utilize a straightforward natural policy\ngradient method for training. The work in [19] suggests that this method is competitive with most\nstate of the art methods. We now discuss the training procedure.\nUsing the likelihood ratio approach and Markov property of the problem, the sample based estimate\nof the policy gradient is derived to be [20]:\n\n\u02c6\u2207\u03b8\u03b7(\u03b8) = g =\n\n1\nT\n\n\u2207\u03b8 log \u03c0\u03b8(at|st) \u02c6A\u03c0(st, at, t)\n\n(2)\n\nT(cid:88)\n\nt=0\n\n2\n\n\fAlgorithm 1 Policy Search with Natural Gradient\n1: Initialize policy parameters to \u03b80\n2: for k = 1 to K do\n3:\n4:\n5:\n\nk based on trajectories in iteration k and approximate value function\n\nCollect trajectories {\u03c4 (1), . . . \u03c4 (N )} by rolling out the stochastic policy \u03c0(\u00b7; \u03b8k).\nCompute \u2207\u03b8 log \u03c0(at|st; \u03b8k) for each (s, a) pair along trajectories sampled in iteration k.\nCompute advantages A\u03c0\nk\u22121.\nV \u03c0\nCompute policy gradient according to (2).\nCompute the Fisher matrix (4) and perform gradient ascent (5).\nUpdate parameters of value function in order to approximate V \u03c0\nR(s(n)\nover the trajectories.\n\n) is the empirical return computed as R(s(n)\n\n), where\n. Here n indexes\n\n) =(cid:80)T\n\n) \u2248 R(s(n)\n\nt(cid:48)=t \u03b3(t(cid:48)\u2212t)r(n)\n\nt\n\nk (s(n)\n\nt\n\nt\n\nt\n\n6:\n7:\n8:\n\nt\n\n9: end for\n\nGradient ascent using this \u201cvanilla\u201d gradient is sub-optimal since it is not the steepest ascent direction\nin the metric of the parameter space [21, 22]. The steepest ascent direction is obtained by solving the\nfollowing local optimization problem around iterate \u03b8k:\n\nmaximize\n\n\u03b8\n\ngT (\u03b8 \u2212 \u03b8k)\n\nsubject to (\u03b8 \u2212 \u03b8k)T F\u03b8k (\u03b8 \u2212 \u03b8k) \u2264 \u03b4,\n\nwhere F\u03b8k is the Fisher Information Metric at the current iterate \u03b8k. We estimate F\u03b8k as\n\nT(cid:88)\n\nt=0\n\n\u02c6F\u03b8k =\n\n1\nT\n\n\u2207\u03b8 log \u03c0\u03b8(at|st)\u2207\u03b8 log \u03c0\u03b8(at|st)T ,\n\n(3)\n\n(4)\n\nas originally suggested by Kakade [22]. This yields the steepest ascent direction to be \u02c6F \u22121\ng and\ncorresponding update rule: \u03b8k+1 = \u03b8k + \u03b1 \u02c6F \u22121\ng. Here \u03b1 is the step-size or learning rate parameter.\nEmpirically, we observed that choosing a \ufb01xed value for \u03b1 or an appropriate schedule is dif\ufb01cult [23].\nThus, we use the normalized gradient ascent procedure, where the normalization is under the Fisher\nmetric. This procedure can be viewed as picking a normalized step size \u03b4 as opposed to \u03b1, and\nsolving the optimization problem in (3). This results in the following update rule:\n\n\u03b8k\n\n\u03b8k\n\n(cid:115)\n\n\u03b8k+1 = \u03b8k +\n\n\u03b4\n\ngT \u02c6F \u22121\n\n\u03b8k\n\n\u02c6F \u22121\n\n\u03b8k\n\ng.\n\ng\n\n(5)\n\nA dimensional analysis of these quantities reveal that \u03b1 has the unit of return\u22121 whereas \u03b4 is\ndimensionless. Though units of \u03b1 are consistent with a general optimization setting where step-size\nhas units of objective\u22121, in these problems, picking a good \u03b1 that is consistent with the scales of the\nreward was dif\ufb01cult. On the other hand, a constant normalized step size was numerically more stable\nand easier to tune: for all the results reported in this paper, the same \u03b4 = 0.05 was used. When more\nthan one trajectory rollout is used per update, the above estimators can be used with an additional\naveraging over the trajectories.\nFor estimating the advantage function, we use the GAE procedure [13]. This requires learning\nk along trajectories for the update\na function that approximates V \u03c0\nin (5). GAE helps with variance reduction at the cost of introducing bias, and requires tuning\nhyperparameters like a discount factor and an exponential averaging term. Good heuristics for these\nparameters have been suggested in prior work. The same batch of trajectories cannot be used for both\n\ufb01tting the value function baseline, and also to estimate g using (2), since it will lead to over\ufb01tting\nand a biased estimate. Thus, we use the trajectories from iteration k \u2212 1 to \ufb01t the value function,\nessentially approximating V \u03c0\nk and g. Similar\nprocedures have been adopted in prior work [19].\n\nk\u22121, and use trajectories from iteration k for computing A\u03c0\n\nk , which is used to compute A\u03c0\n\n2.2 Policy Architecture\n\nLinear policy: We \ufb01rst consider a linear policy that directly maps from the observations to the\nmotor torques. We use the same observations as used in prior work which includes joint positions,\n\n3\n\n\fjoint velocities, and for some tasks, information related to contacts. Thus, the policy mapping is\nat \u223c N (W st + b, \u03c3), and the goal is to learn W , b, and \u03c3. For most of these tasks, the observations\ncorrespond to the state of the problem (in relative coordinates). Thus, we use the term states and\nobservations interchangeably. In general, the policy is de\ufb01ned with observations as the input, and\nhence is trying to solve a POMDP.\n\n(cid:32)(cid:80)\n\nj Pijs(j)\n\nt\n\n(cid:33)\n\n,\n\nRBF policy: Secondly, we consider a parameterization that enriches the representational capacity\nusing random Fourier features of the observations. Since these features approximate the RKHS\nfeatures under an RBF Kernel [24], we call this policy parametrization the RBF policy. The features\nare constructed as:\n\n\u03bd\n\ny(i)\nt = sin\n\n+ \u03c6(i)\n\n(6)\nwhere each element Pij is drawn from N (0, 1), \u03bd is a bandwidth parameter chosen approximately as\nthe average pairwise distances between different observation vectors, and \u03c6 is a random phase shift\ndrawn from U [\u2212\u03c0, \u03c0). Thus the policy is at \u223c N (W yt + b, \u03c3), where W , b, and \u03c3 are trainable\nparameters. This architecture can also be interpreted as a two layer neural network: the bottom\nlayer is clamped with random weights, a sinusoidal activation function is used, and the top layer is\n\ufb01netuned. The principal purpose for this representation is to slightly enhance the capacity of a linear\npolicy, and the choice of activation function is not very signi\ufb01cant.\n\n3 Results on OpenAI gym-v1 benchmarks\n\nAs indicated before, we train linear and RBF policies with the natural policy gradient on the popular\nOpenAI gym-v1 benchmark tasks simulated in MuJoCo [25]. The tasks primarily consist of learning\nlocomotion gaits for simulated robots ranging from a swimmer to a 3D humanoid (23 dof).\nFigure 1 presents the learning curves along with the performance levels reported in prior work using\nTRPO and fully connected neural network policies. Table 1 also summarizes the \ufb01nal scores, where\n\u201cstoc\u201d refers to the stochastic policy with actions sampled as at \u223c \u03c0\u03b8(st), while \u201cmean\u201d refers to\nusing mean of the Gaussian policy, with actions computed as at = E[\u03c0\u03b8(st)]. We see that the linear\npolicy is competitive on most tasks, while the RBF policy can outperform previous results on \ufb01ve\nof the six considered tasks. Though we were able to train neural network policies that match the\nresults reported in literature, we have used publicly available prior results for an objective comparison.\nVisualizations of the trained linear and RBF policies are presented in the supplementary video. Given\nthe simplicity of these policies, it is surprising that they can produce such elaborate behaviors.\nTable 2 presents the number of samples needed for the policy performance to reach a threshold value\nfor reward. The threshold value is computed as 90% of the \ufb01nal score achieved by the stochastic\nlinear policy. We visually veri\ufb01ed that policies with these scores are pro\ufb01cient at the task, and hence\nthe chosen values correspond to meaningful performance thresholds. We see that linear and RBF\npolicies are able to learn faster on four of the six tasks.\nAll the simulated robots we considered are under-actuated, have contact discontinuities, and con-\ntinuous action spaces making them challenging benchmarks. When adapted from model-based\ncontrol [26, 5, 27] to RL, however, the notion of \u201csuccess\u201d established was not appropriate. To\nshape the behavior, a very narrow initial state distribution and termination conditions are used in the\nbenchmarks. As a consequence, the learned policies become highly trajectory centric \u2013 i.e. they are\ngood only where they tend to visit during training, which is a very narrow region. For example, the\nwalker can walk very well when initialized upright and close to the walking limit cycle. Even small\nperturbations, as shown in the supplementary video, alters the visitation distribution and dramatically\ndegrades the policy performance. This makes the agent fall down at which point it is unable to\nget up. Similarly, the swimmer is unable to turn when its heading direction is altered. For control\napplications, this is undesirable. In the real world, there will always be perturbations \u2013 stochasticity\nin the environment, modeling errors, or wear and tear. Thus, the speci\ufb01c task design and notion of\nsuccess used for the simulated characters are not adequate. However, the simulated robots themselves\nare rather complex and harder tasks could be designed with them, as partly illustrated in Section 4.\n\n4\n\n\fFigure 1: Learning curves for the Linear and RBF policy architectures. The green line corresponding\nto the reward achieved by neural network policies on the OpenAI Gym website, as of 02/24/2017\n(trained with TRPO). It is observed that for all the tasks, linear and RBF parameterizations are\ncompetitive with state of the art results. The learning curves depicted are for the stochastic policies,\nwhere the actions are sampled as at \u223c \u03c0\u03b8(st). The learning curves have been averaged across three\nruns with different random seeds.\n\nTable 1: Final performances of the policies\n\nTable 2: Number of episodes to achieve threshold\n\nTask\n\nLinear\n\nRBF\n\nSwimmer\nHopper\nCheetah\nWalker\nAnt\nHumanoid\n\nstoc mean\n366\n362\n3466\n3651\n4149\n3810\n5234\n4881\n4607\n3980\n5873\n6440\n\nstoc mean\n361\n365\n3810\n3590\n6620\n6477\n5867\n5631\n4816\n4297\n6849\n6237\n\nNN\nTRPO\n131\n3668\n4800\n5594\n5007\n6482\n\nTask\n\nTh.\n\nLinear\n\nRBF\n\nTRPO+NN\n\nSwimmer\nHopper\nCheetah\nWalker\nAnt\nHumanoid\n\n325\n3120\n3430\n4390\n3580\n5280\n\n1450\n13920\n11250\n36840\n39240\n79800\n\n1550\n8640\n6000\n25680\n30000\n96720\n\nN-A\n10000\n4250\n14250\n73500\n87000\n\n4 Modi\ufb01ed Tasks and Results\n\nUsing the same set of simulated robot characters outlined in Section 3, we designed new tasks with\ntwo goals in mind: (a) to push the representational capabilities and test the limits of simple policies;\n(b) to enable training of \u201cglobal\" policies that are robust to perturbations and work from a diverse set\nof states. To this end, we make the following broad changes, also summarized in Table 3:\n1. Wider initial state distribution to force generalization. For example, in the walker task, some\nfraction of trajectories have the walker initialized prone on the ground. This forces the agent to\nsimultaneously learn a get-up skill and a walk skill, and not forget them as the learning progresses.\nSimilarly, the heading angle for the swimmer and ant are randomized, which encourages learning\nof a turn skill.\n\n2. Reward shaping appropriate with the above changes to the initial state distribution. For example,\nwhen the modi\ufb01ed swimmer starts with a randomized heading angle, we include a small reward for\nadjusting its heading towards the correct direction. In conjunction, we also remove all termination\nconditions used in the Gym-v1 benchmarks.\n\n3. Changes to environment\u2019s physics parameters, such as mass and joint torque. If the agent has\nsuf\ufb01cient power, most tasks are easily solved. By reducing an agent\u2019s action ability and/or\nincreasing its mass, the agent is more under-actuated. These changes also produce more realistic\nlooking motion.\n\n5\n\n\fFigure 2: Hopper completes a get-up sequence before moving to its normal forward walking behavior.\nThe getup sequence is learned along side the forward hopping in the modi\ufb01ed task setting.\n\nvx is forward velocity; \u03b8 is the heading angle; pz is the height of torso; and a is the action.\n\nTable 3: Modi\ufb01ed Task Description\n\nDescription\n\nHopper (2D)\n\nTask\nSwimmer (3D) Agent swims in the desired direction.\nShould recover (turn) if rotated around.\nAgent hops forward as fast as possible.\nShould recover (get up) if pushed down.\nAgent walks forward as fast as possible.\nShould recover (get up) if pushed down.\nAgent moves in the desired direction.\nShould recover (turn) if rotated around.\n\nWalker (2D)\n\nAnt (3D)\n\nReward (des = desired value)\nvx \u2212 0.1|\u03b8 \u2212 \u03b8des| \u2212 0.0001||a||2\nvx \u2212 3||pz \u2212 pdes\n||2 \u2212 0.1||a||2\n||2 \u2212 0.1||a||2\nvx \u2212 3||pz \u2212 pdes\n||2 \u2212 0.01||a||2\nvx \u2212 3||pz \u2212 pdes\n\nz\n\nz\n\nz\n\nCombined, these modi\ufb01cations require that the learned policies not only make progress towards\nmaximizing the reward, but also recover from adverse conditions and resist perturbations. An example\nof this is illustrated in Figure 4, where the hopper executes a get-up sequence before hopping to\nmake forward progress. Furthermore, at test time, a user can interactively apply pushing and rotating\nperturbations to better understand the failure modes. We note that these interactive perturbations may\nnot be the ultimate test for robustness, but a step towards this direction.\n\nRepresentational capacity The supplementary video demonstrates the trained policies. We con-\ncentrate on the results of the walker task in the main paper. Figure 3 studies the performance as\nwe vary the representational capacity. Increasing the Fourier features allows for more expressive\npolicies and consequently allow for achieving a higher score. The policy with 500 Fourier features\nperforms the best, followed by the fully connected neural network. The linear policy also makes\nforward progress and can get up from the ground, but is unable to learn as ef\ufb01cient a walking gait.\n\n(a)\n\n(b)\n\nFigure 3: (a) Learning curve on modi\ufb01ed walker (diverse initialization) for different policy archi-\ntectures. The curves are averaged over three runs with different random seeds. (b) Learning curves\nwhen using different number of conjugate gradient iterations to compute \u02c6F \u22121\ng in (5). A policy with\n300 Fourier features has been used to generate these results.\n\n\u03b8k\n\n6\n\n\fFigure 4: We test policy robustness by measuring distanced traveled in the swimmer, walker, and\nhopper tasks for three training con\ufb01gurations: (a) with termination conditions; (b) no termination,\nand peaked initial state distribution; and (c) with diverse initialization. Swimmer does not have a\ntermination option, so we consider only two con\ufb01gurations. For the case of swimmer, the perturbation\nis changing the heading angle between \u2212\u03c0/2.0 and \u03c0/2.0, and in the case of walker and hopper, an\nexternal force for 0.5 seconds along its axis of movement. All agents are initialized with the same\npositions and velocities.\n\nPerturbation resistance Next, we test the robustness of our policies by perturbing the system with\nan external force. This external force represents an unforeseen change which the agent has to resist\nor overcome, thus enabling us to understand push and fall recoveries. Fall recoveries of the trained\npolicies are demonstrated in the supplementary video. In these tasks, perturbations are not applied to\nthe system during the training phase. Thus, the ability to generalize and resist perturbations come\nentirely out of the states visited by the agent during training. Figure 4 indicates that the RBF policy\nis more robust, and also that diverse initializations are important to obtain the best results. This\nindicates that careful design of initial state distributions are crucial for generalization, and to enable\nthe agent to learn a wide range of skills.\n\n5 Summary and Discussion\n\nThe experiments in this paper were aimed at trying to understand the effects of (a) representation; (b)\ntask modeling; and (c) optimization. We summarize the results with regard to each aforementioned\nfactor and discuss their implications.\n\nRepresentation The \ufb01nding that linear and RBF policies can be trained to solve a variety of\ncontinuous control tasks is very surprising. Recently, a number of algorithms have been shown to suc-\ncessfully solve these tasks [3, 28, 4, 14], but all of these works use multi-layer neural networks. This\nsuggests a widespread belief that expressive function approximators are needed to capture intricate\ndetails necessary for movements like running. The results in this work conclusively demonstrates that\nthis is not the case, at least for the limited set of popular testbeds. This raises an interesting question:\nwhat are the capability limits of shallow policy architectures? The linear policies were not exemplary\nin the \u201cglobal\u201d versions of the tasks, but it must be noted that they were not terrible either. The RBF\npolicy using random Fourier features was able to successfully solve the modi\ufb01ed tasks producing\nglobal policies, suggesting that we do not yet have a sense of its limits.\n\nModeling When using RL methods to solve practical problems, the world provides us with neither\nthe initial state distribution nor the reward. Both of these must be designed by the researcher and\nmust be treated as assumptions about the world or prescriptions about the required behavior. The\nquality of assumptions will invariably affect the quality of solutions, and thus care must be taken in\nthis process. Here, we show that starting the system from a narrow initial state distribution produces\n\n7\n\n\felaborate behaviors, but the trained policies are very brittle to perturbations. Using a more diverse\nstate distribution, in these cases, is suf\ufb01cient to train robust policies.\n\nOptimization In line with the theme of simplicity, we \ufb01rst tried to use REINFORCE [20], which\nwe found to be very sensitive to hyperparameter choices, especially step-size. There are a class of\npolicy gradient methods which use pre-conditioning to help navigate the warped parameter space of\nprobability distributions and for step size selection. Most variants of pre-conditioned policy gradient\nmethods have been reported to achieve state of the art performance, all performing about the same [19].\nWe feel that the used natural policy gradient method is the most straightforward pre-conditioned\nmethod. To demonstrate that the pre-conditioning helps, Figure 3 depicts the learning curve for\ndifferent number of CG iterations used to compute the update in (5). The curve corresponding to\nCG = 0 is the REINFORCE method. As can be seen, pre-conditioning helps with the learning\nprocess. However, there is a trade-off with computation, and hence using an intermediate number of\nCG steps like 20 could lead to best results in wall-clock sense for large scale problems.\nWe chose to compare with neural network policies trained with TRPO, since it has demonstrated\nimpressive results and is closest to the algorithm used in this work. Are function approximators\nlinear with respect to free parameters suf\ufb01cient for other methods is an interesting open question\n(in this sense, RBFs are linear but NNs are not). For a large class of methods based on dynamic\nprogramming (including Q-learning, SARSA, approximate policy and value iteration), linear function\napproximation has guaranteed convergence and error bounds, while non-linear function approximation\nis known to diverge in many cases [29, 30, 31, 32]. It may of course be possible to avoid divergence\nin speci\ufb01c applications, or at least slow it down long enough, for example via target networks or\nreplay buffers. Nevertheless, guaranteed convergence has clear advantages. Similar to recent work\nusing policy gradient methods, recent work using dynamic programming methods have adopted\nmulti-layer networks without careful side-by-side comparisons to simpler architectures. Could a\nglobal quadratic approximation to the optimal value function (which is linear in the set of quadratic\nfeatures) be suf\ufb01cient to solve most of the continuous control tasks currently studied in RL? Given\nthat quadratic value functions correspond to linear policies, and good linear policies exist as shown\nhere, this might make for interesting future work.\n\n6 Conclusion\n\nIn this work, we demonstrated that very simple policy parameterizations can be used to solve many\nbenchmark continuous control tasks. Furthermore, there is no signi\ufb01cant loss in performance due to\nthe use of such simple parameterizations. We also proposed global variants of many widely studied\ntasks, which requires the learned policies to be competent for a much larger set of states, and found\nthat simple representations are suf\ufb01cient in these cases as well. These empirical results along with\nOccam\u2019s razor suggests that complex policy architectures should not be a default choice unless side-\nby-side comparisons with simpler alternatives suggest otherwise. Such comparisons are unfortunately\nnot widely pursued. The results presented in this work directly highlight the need for simplicity\nand generalization in RL. We hope that this work would encourage future work analyzing various\narchitectures and associated trade-offs like computation time, robustness, and sample complexity.\n\nAcknowledgements\n\nThis work was supported in part by the NSF. The authors would like to thank Vikash Kumar, Igor\nMordatch, John Schulman, and Sergey Levine for valuable comments.\n\nReferences\n[1] V. Mnih et al. Human-level control through deep reinforcement learning. Nature, 518, 2015.\n\n[2] D. Silver et al. Mastering the game of go with deep neural networks and tree search. Nature,\n\n529, 2016.\n\n[3] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization.\n\nIn ICML, 2015.\n\n8\n\n\f[4] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.\n\nContinuous control with deep reinforcement learning. ArXiv e-prints, September 2015.\n\n[5] Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through\nonline trajectory optimization. International Conference on Intelligent Robots and Systems,\n2012.\n\n[6] I. Mordatch, E. Todorov, and Z. Popovic. Discovery of complex behaviors through contact-\n\ninvariant optimization. ACM SIGGRAPH, 2012.\n\n[7] M. Al Borno, M. de Lasa, and A. Hertzmann. Trajectory Optimization for Full-Body Movements\nwith Complex Contacts. IEEE Transactions on Visualization and Computer Graphics, 2013.\n\n[8] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. JMLR, 17(39):1\u201340, 2016.\n\n[9] V. Kumar, E. Todorov, and S. Levine. Optimal control with learned local models: Application\n\nto dexterous manipulation. In ICRA, 2016.\n\n[10] V. Kumar, A. Gupta, E. Todorov, and S. Levine. Learning dexterous manipulation policies from\n\nexperience and imitation. ArXiv e-prints, 2016.\n\n[11] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries\n\nand 700 robot hours. In ICRA, 2016.\n\n[12] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenAI Gym, 2016.\n\n[13] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous\n\ncontrol using generalized advantage estimation. In ICLR, 2016.\n\n[14] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine.\n\nQ-Prop: Sample-Ef\ufb01cient Policy Gradient with An Off-Policy Critic. In ICLR, 2017.\n\n[15] I. Mordatch, K. Lowrey, and E.Todorov. Ensemble-CIO: Full-body dynamic motion planning\n\nthat transfers to physical humanoids. In IROS, 2015.\n\n[16] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine. EPOpt: Learning Robust Neural\n\nNetwork Policies Using Model Ensembles. In ICLR, 2017.\n\n[17] Fereshteh Sadeghi and Sergey Levine. (CAD)2RL: Real Single-Image Flight without a Single\n\nReal Image. ArXiv e-prints, 2016.\n\n[18] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.\nDomain randomization for transferring deep neural networks from simulation to the real world.\nArXiv e-prints, 2017.\n\n[19] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement\n\nlearning for continuous control. In ICML, 2016.\n\n[20] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine Learning, 8(3):229\u2013256, 1992.\n\n[21] S. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10:251\u2013276,\n\n1998.\n\n[22] S. Kakade. A natural policy gradient. In NIPS, 2001.\n\n[23] Jan Peters. Machine learning of motor skills for robotics. PhD Dissertation, University of\n\nSouthern California, 2007.\n\n[24] A. Rahimi and B. Recht. Random Features for Large-Scale Kernel Machines. In NIPS, 2007.\n\n[25] E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In\n\nInternational Conference on Intelligent Robots and Systems, 2012.\n\n9\n\n\f[26] Tom Erez, Yuval Tassa, and Emanuel Todorov. In\ufb01nite-horizon model predictive control for\n\nperiodic tasks with contacts. In RSS, 2011.\n\n[27] T. Erez, K. Lowrey, Y. Tassa, V. Kumar, S. Kolev, and E. Todorov. An integrated system for\nreal-time model predictive control of humanoid robots. In Humanoids, pages 292\u2013299, 2013.\n\n[28] Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa.\n\nLearning continuous control policies by stochastic value gradients. In NIPS, 2015.\n\n[29] Alborz Geramifard, Thomas J Walsh, Stefanie Tellex, Girish Chowdhary, Nicholas Roy, and\nJonathan P How. A tutorial on linear function approximators for dynamic programming and\nreinforcement learning. Foundations and Trends R(cid:13) in Machine Learning, 6(4):375\u2013451, 2013.\n[30] Jennie Si. Handbook of learning and approximate dynamic programming, volume 2. John\n\nWiley & Sons, 2004.\n\n[31] Dimitri P Bertsekas. Approximate dynamic programming. 2008.\n\n[32] Residual algorithms: Reinforcement learning with function approximation. In ICML, 1995.\n\n10\n\n\fA Choice of Step Size\n\nCompare \u03b1 vs \u03b4 here. An important design choice in the version of NPG presented in this work\nis normalized vs un-normalized step size. The normalized step size corresponds to solving the\noptimization problem in equation (3), and leads to the following update rule:\n\n(cid:115)\n\n\u03b8k+1 = \u03b8k +\n\n\u03b4\n\ngT \u02c6F \u22121\n\n\u03b8k\n\n\u02c6F \u22121\n\n\u03b8k\n\ng.\n\ng\n\nOn the other hand, an un-normalized step size corresponds to the update rule:\n\n\u03b8k+1 = \u03b8k + \u03b1 \u02c6F \u22121\n\n\u03b8k\n\ng.\n\nThe principal difference between the update rules correspond to the units of the learning rate\nparameters \u03b1 and \u03b4. In accordance with general \ufb01rst order optimization methods, \u03b1 scales inversely\nwith the reward (note that F does not have the units of reward). This makes the choice of \u03b1 highly\nproblem speci\ufb01c, and we \ufb01nd that it is hard to tune. Furthermore, we observed that the same values\nof \u03b1 cannot be used throughout the learning phase, and requires re-scaling. Though this is common\npractice in supervised learning, where the learning rate is reduced after some number of epochs, it\nis hard to employ a similar approach in RL. Often, large steps can destroy a reasonable policy, and\nrecovering from such mistakes is extremely hard in RL since the variance of gradient estimate for a\npoorly performing policy is higher. Employing the normalized step size was found to be more robust.\nThese results are illustrated in Figure 5\n\nFigure 5: Learning curves using normalized and un-normalized step size rules for the diverse versions\nof swimmer, hopper, and walker tasks. We observe that the same normalized step size (\u03b4) works\nacross multiple problems. However, the un-normalized step size values that are optimal for one task\ndo not work for other tasks. In fact, they often lead to divergence in the learning process. We replace\nthe learning curves with \ufb02at lines in cases where we observed divergence, such as \u03b1 = 0.25 in case\nof walker. This suggests that normalized step size rule is more robust, with the same learning rate\nparameter working across multiple tasks.\n\nB Effect of GAE\n\nFor the purpose of advantage estimation, we use the GAE [13] procedure in this work. GAE uses\nan exponential average of temporal difference errors to reduce the variance of policy gradients at\nthe expense of bias. Since the paper explores the theme of simplicity, a pertinent question is how\nwell GAE performs when compared to more straightforward alternatives like using a pure temporal\ndifference error, and pure Monte Carlo estimates. The \u03bb parameter in GAE allows for an interpolation\nbetween these two extremes. In our experiments, summarized in Figure 6, we observe that reducing\nvariance even at the cost of a small bias (\u03bb = 0.97) provides for fast learning in the initial stages.\nThis is consistent with the \ufb01ndings in Schulman et al. [13] and also make intuitive sense. Initially,\nwhen the policy is very far from the correct answer, even if the movement direction is not along the\ngradient (biased), it is bene\ufb01cial to make consistent progress and not bounce around due to high\n\n11\n\n1020304050-400-2000200Swimmer: a vs dTraining IterationsReturna=0.01a=0.05a=0.1a=0.25a=1.0a=2.0d=0.01d=0.05d=0.120406080-200002000Hopper: a vs dTraining IterationsReturna=0.01a=0.05a=0.1a=0.25a=1.0a=2.0d=0.01d=0.05d=0.150100150200250-4000-3000-2000-100001000Walker: a vs dTraining IterationsReturna=0.01a=0.05a=0.1a=0.25a=1.0a=2.0d=0.01d=0.05d=0.1\fvariance. Thus, high bias estimates of the policy gradient, corresponding to smaller \u03bb values make\nfast initial progress. However, after this initial phase, it is important to follow an unbiased gradient,\nand consequently the low-bias variants corresponding to larger \u03bb values show better asymptotic\nperformance. Even without the use of GAE (i.e. \u03bb = 1), we observe good asymptotic performance.\nBut with GAE, it is possible to get faster initial learning due to reasons discussed above.\n\nCarlo estimate: \u02c6A(st, at) =(cid:80)T\n\nFigure 6: Learning curves corresponding to different choices of \u03bb in GAE. \u03bb = 0 corresponds\nto a high bias but low variance version of policy gradient corresponding to a TD error estimate:\n\u02c6A(st, at) = rt + \u03b3V (st+1) \u2212 V (st); while \u03bb = 1 corresponds to a low bias but high variance Monte\nt(cid:48)=t \u03b3t(cid:48)\u2212trt(cid:48) \u2212 V (st). We observe that low bias is asymptotically\nvery important to achieve best performance, but a low variance gradient can help during the initial\nstages.\n\n12\n\n50100150200250-5000-25000Walker: GAETraining IterationsReturnGAE=0.00GAE=0.50GAE=0.90GAE=0.95GAE=0.97GAE=1.00\f", "award": [], "sourceid": 3286, "authors": [{"given_name": "Aravind", "family_name": "Rajeswaran", "institution": "University of Washington"}, {"given_name": "Kendall", "family_name": "Lowrey", "institution": "University of Washington"}, {"given_name": "Emanuel", "family_name": "Todorov", "institution": "University of Washington"}, {"given_name": "Sham", "family_name": "Kakade", "institution": "University of Washington"}]}