{"title": "Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 5279, "page_last": 5288, "abstract": "In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this is the first scalable trust region natural gradient method for actor-critic methods. It is also the method that learns non-trivial tasks in continuous control as well as discrete control policies directly from raw pixel inputs. We tested our approach across discrete domains in Atari games as well as continuous domains in the MuJoCo environment. With the proposed methods, we are able to achieve higher rewards and a 2- to 3-fold improvement in sample efficiency on average, compared to previous state-of-the-art on-policy actor-critic methods. Code is available at https://github.com/openai/baselines", "full_text": "Scalable trust-region method for deep reinforcement\nlearning using Kronecker-factored approximation\n\nYuhuai Wu\u2217\n\nUniversity of Toronto\n\nVector Institute\n\nywu@cs.toronto.edu\n\nElman Mansimov\u2217\nNew York University\n\nmansimov@cs.nyu.edu\n\nShun Liao\n\nUniversity of Toronto\n\nVector Institute\n\nsliao3@cs.toronto.edu\n\nRoger Grosse\n\nUniversity of Toronto\n\nVector Institute\n\nrgrosse@cs.toronto.edu\n\nJimmy Ba\n\nUniversity of Toronto\n\nVector Institute\n\njimmy@psi.utoronto.ca\n\nAbstract\n\nIn this work, we propose to apply trust region optimization to deep reinforce-\nment learning using a recently proposed Kronecker-factored approximation to\nthe curvature. We extend the framework of natural policy gradient and propose\nto optimize both the actor and the critic using Kronecker-factored approximate\ncurvature (K-FAC) with trust region; hence we call our method Actor Critic using\nKronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this\nis the \ufb01rst scalable trust region natural gradient method for actor-critic methods.\nIt is also the method that learns non-trivial tasks in continuous control as well as\ndiscrete control policies directly from raw pixel inputs. We tested our approach\nacross discrete domains in Atari games as well as continuous domains in the Mu-\nJoCo environment. With the proposed methods, we are able to achieve higher\nrewards and a 2- to 3-fold improvement in sample ef\ufb01ciency on average, compared\nto previous state-of-the-art on-policy actor-critic methods. Code is available at\nhttps://github.com/openai/baselines.\n\n1\n\nIntroduction\n\nAgents using deep reinforcement learning (deep RL) methods have shown tremendous success in\nlearning complex behaviour skills and solving challenging control tasks in high-dimensional raw\nsensory state-space [24, 17, 12]. Deep RL methods make use of deep neural networks to represent\ncontrol policies. Despite the impressive results, these neural networks are still trained using simple\nvariants of stochastic gradient descent (SGD). SGD and related \ufb01rst-order methods explore weight\nspace inef\ufb01ciently. It often takes days for the current deep RL methods to master various continuous\nand discrete control tasks. Previously, a distributed approach was proposed [17] to reduce training\ntime by executing multiple agents to interact with the environment simultaneously, but this leads to\nrapidly diminishing returns of sample ef\ufb01ciency as the degree of parallelism increases.\nSample ef\ufb01ciency is a dominant concern in RL; robotic interaction with the real world is typically\nscarcer than computation time, and even in simulated environments the cost of simulation often\ndominates that of the algorithm itself. One way to effectively reduce the sample size is to use\nmore advanced optimization techniques for gradient updates. Natural policy gradient [10] uses the\ntechnique of natural gradient descent [1] to perform gradient updates. Natural gradient methods\n\n\u2217Equal contribution.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Performance comparisons on six standard Atari games trained for 10 million timesteps (1 timestep\nequals 4 frames). The shaded region denotes the standard deviation over 2 random seeds.\n\nfollow the steepest descent direction that uses the Fisher metric as the underlying metric, a metric\nthat is based not on the choice of coordinates but rather on the manifold (i.e., the surface).\nHowever, the exact computation of the natural gradient is intractable because it requires inverting the\nFisher information matrix. Trust-region policy optimization (TRPO) [21] avoids explicitly storing\nand inverting the Fisher matrix by using Fisher-vector products [20]. However, it typically requires\nmany steps of conjugate gradient to obtain a single parameter update, and accurately estimating the\ncurvature requires a large number of samples in each batch; hence TRPO is impractical for large\nmodels and suffers from sample inef\ufb01ciency.\nKronecker-factored approximated curvature (K-FAC) [15, 6] is a scalable approximation to natural\ngradient.\nIt has been shown to speed up training of various state-of-the-art large-scale neural\nnetworks [2] in supervised learning by using larger mini-batches. Unlike TRPO, each update is\ncomparable in cost to an SGD update, and it keeps a running average of curvature information,\nallowing it to use small batches. This suggests that applying K-FAC to policy optimization could\nimprove the sample ef\ufb01ciency of the current deep RL methods.\nIn this paper, we introduce the actor-critic using Kronecker-factored trust region (ACKTR; pro-\nnounced \u201cactor\u201d) method, a scalable trust-region optimization algorithm for actor-critic methods. The\nproposed algorithm uses a Kronecker-factored approximation to natural policy gradient that allows\nthe covariance matrix of the gradient to be inverted ef\ufb01ciently. To best of our knowledge, we are also\nthe \ufb01rst to extend the natural policy gradient algorithm to optimize value functions via Gauss-Newton\napproximation. In practice, the per-update computation cost of ACKTR is only 10% to 25% higher\nthan SGD-based methods. Empirically, we show that ACKTR substantially improves both sample\nef\ufb01ciency and the \ufb01nal performance of the agent in the Atari environments [4] and the MuJoCo [26]\ntasks compared to the state-of-the-art on-policy actor-critic method A2C [17] and the famous trust\nregion optimizer TRPO [21].\nWe make our source code available online at https://github.com/openai/baselines.\n\n2 Background\n\n2.1 Reinforcement learning and actor-critic methods\n\nmaximize the expected \u03b3-discounted cumulative return J (\u03b8) = E\u03c0[Rt] = E\u03c0[(cid:80)\u221e\n\nWe consider an agent interacting with an in\ufb01nite-horizon, discounted Markov Decision Process\n(X ,A, \u03b3, P, r). At time t, the agent chooses an action at \u2208 A according to its policy \u03c0\u03b8(a|st) given\nits current state st \u2208 X . The environment in turn produces a reward r(st, at) and transitions to the\nnext state st+1 according to the transition probability P (st+1|st, at). The goal of the agent is to\ni\u22650 \u03b3ir(st+i, at+i)]\nwith respect to the policy parameters \u03b8. Policy gradient methods [28, 25] directly parameterize a\npolicy \u03c0\u03b8(a|st) and update parameter \u03b8 so as to maximize the objective J (\u03b8). In its general form,\n\n2\n\n1M2M4M6M8M10MNumber of Timesteps01000200030004000500060007000Episode RewardsBeamRiderACKTRA2CTRPO1M2M4M6M8M10MNumber of Timesteps0100200300400Episode RewardsBreakoutACKTRA2CTRPO1M2M4M6M8M10MNumber of Timesteps201001020Episode RewardsPongACKTRA2CTRPO1M2M4M6M8M10MNumber of Timesteps025005000750010000125001500017500Episode RewardsQbertACKTRA2CTRPO1M2M4M6M8M10MNumber of Timesteps40060080010001200140016001800Episode RewardsSeaquestACKTRA2CTRPO1M2M4M6M8M10MNumber of Timesteps2004006008001000Episode RewardsSpaceInvadersACKTRA2CTRPO\fthe policy gradient is de\ufb01ned as [22],\n\n\u2207\u03b8J (\u03b8) = E\u03c0[\n\n\u221e(cid:88)\n\nt=0\n\n\u03a8t\u2207\u03b8 log \u03c0\u03b8(at|st)],\n\nk\u22121(cid:88)\n\nwhere \u03a8t is often chosen to be the advantage function A\u03c0(st, at), which provides a relative measure\nof value of each action at at a given state st. There is an active line of research [22] on designing an\nadvantage function that provides both low-variance and low-bias gradient estimates. As this is not\nthe focus of our work, we simply follow the asynchronous advantage actor critic (A3C) method [17]\nand de\ufb01ne the advantage function as the k-step returns with function approximation,\n\nA\u03c0(st, at) =\n\n\u03b3ir(st+i, at+i) + \u03b3kV \u03c0\n\n\u03c6 (st+k) \u2212 V \u03c0\n\n\u03c6 (st),\n\ni=0\n\n\u03c6 (st) is the value network, which provides an estimate of the expected sum of rewards from\nwhere V \u03c0\n\u03c6 (st) = E\u03c0[Rt]. To train the parameters of the value network,\nthe given state following policy \u03c0, V \u03c0\nwe again follow [17] by performing temporal difference updates, so as to minimize the squared\n\u03c6 (st)||2.\ndifference between the bootstrapped k-step returns \u02c6Rt and the prediction value 1\n\n2|| \u02c6Rt \u2212 V \u03c0\n\n2.2 Natural gradient using Kronecker-factored approximation\nTo minimize a nonconvex function J (\u03b8), the method of steepest descent calculates the update\n\u2206\u03b8 that minimizes J (\u03b8 + \u2206\u03b8), subject to the constraint that ||\u2206\u03b8||B < 1, where || \u00b7 ||B is the\nnorm de\ufb01ned by ||x||B = (xT Bx) 1\n2 , and B is a positive semide\ufb01nite matrix. The solution to the\nconstraint optimization problem has the form \u2206\u03b8 \u221d \u2212B\u22121\u2207\u03b8J , where \u2207\u03b8J is the standard gradient.\nWhen the norm is Euclidean, i.e., B = I, this becomes the commonly used method of gradient\ndescent. However, the Euclidean norm of the change depends on the parameterization \u03b8. This is\nnot favorable because the parameterization of the model is an arbitrary choice, and it should not\naffect the optimization trajectory. The method of natural gradient constructs the norm using the\nFisher information matrix F , a local quadratic approximation to the KL divergence. This norm is\nindependent of the model parameterization \u03b8 on the class of probability distributions, providing a\nmore stable and effective update. However, since modern neural networks may contain millions of\nparameters, computing and storing the exact Fisher matrix and its inverse is impractical, so we have\nto resort to approximations.\nA recently proposed technique called Kronecker-factored approximate curvature (K-FAC) [15] uses\na Kronecker-factored approximation to the Fisher matrix to perform ef\ufb01cient approximate natural\ngradient updates. We let p(y|x) denote the output distribution of a neural network, and L = log p(y|x)\ndenote the log-likelihood. Let W \u2208 RCout\u00d7Cin be the weight matrix in the (cid:96)th layer, where Cout and\nCin are the number of output/input neurons of the layer. Denote the input activation vector to the\nlayer as a \u2208 RCin, and the pre-activation vector for the next layer as s = W a. Note that the weight\ngradient is given by \u2207W L = (\u2207sL)a\n(cid:124). K-FAC utilizes this fact and further approximates the block\nF(cid:96) corresponding to layer (cid:96) as \u02c6F(cid:96),\n\nF(cid:96) = E[vec{\u2207W L}vec{\u2207W L}(cid:124)\n\n(cid:124)\n\n(cid:124)\n] \u2297 E[\u2207sL(\u2207sL)\n\n\u2248 E[aa\n] and S denotes E[\u2207sL(\u2207sL)\n(cid:124)\n\n(cid:124)\n\n] = E[aa\n\n(cid:124)\n(cid:124) \u2297 \u2207sL(\u2207sL)\n\n]\n\n] := A \u2297 S := \u02c6F(cid:96),\n\nwhere A denotes E[aa\n]. This approximation can be interpreted as\nmaking the assumption that the second-order statistics of the activations and the backpropagated\nderivatives are uncorrelated. With this approximation, the natural gradient update can be ef\ufb01ciently\ncomputed by exploiting the basic identities (P \u2297Q)\u22121 = P \u22121\u2297Q\u22121 and (P \u2297Q) vec(T ) = P T Q\n(cid:124):\n\n(cid:96) vec{\u2207WJ } = vec(cid:0)A\u22121 \u2207WJ S\u22121(cid:1) .\n\nvec(\u2206W ) = \u02c6F \u22121\n\nFrom the above equation we see that the K-FAC approximate natural gradient update only requires\ncomputations on matrices comparable in size to W . Grosse and Martens [6] have recently extended\nthe K-FAC algorithm to handle convolutional networks. Ba et al. [2] later developed a distributed\nversion of the method where most of the overhead is mitigated through asynchronous computa-\ntion. Distributed K-FAC achieved 2- to 3-times speed-ups in training large modern classi\ufb01cation\nconvolutional networks.\n\n3\n\n\fFigure 2: In the Atari game of Atlantis, our agent (ACKTR) quickly learns to obtain rewards of 2 million in\n1.3 hours, 600 episodes of games, 2.5 million timesteps. The same result is achieved by advantage actor critic\n(A2C) in 10 hours, 6000 episodes, 25 million timesteps. ACKTR is 10 times more sample ef\ufb01cient than A2C on\nthis game.\n\n3 Methods\n\n3.1 Natural gradient in actor-critic\n\nNatural gradient was proposed to apply to the policy gradient method more than a decade ago\nby Kakade [10]. But there still doesn\u2019t exist a scalable, sample-ef\ufb01cient, and general-purpose\ninstantiation of the natural policy gradient. In this section, we introduce the \ufb01rst scalable and sample-\nef\ufb01cient natural gradient algorithm for actor-critic methods: the actor-critic using Kronecker-factored\ntrust region (ACKTR) method. We use Kronecker-factored approximation to compute the natural\ngradient update, and apply the natural gradient update to both the actor and the critic.\nTo de\ufb01ne the Fisher metric for reinforcement learning objectives, one natural choice is to use the\npolicy function which de\ufb01nes a distribution over the action given the current state, and take the\nexpectation over the trajectory distribution:\n\nwhere p(\u03c4 ) is the distribution of trajectories, given by p(s0)(cid:81)T\n\nF = Ep(\u03c4 )[\u2207\u03b8 log \u03c0(at|st)(\u2207\u03b8 log \u03c0(at|st))\n\n(cid:124)\n\n],\n\nt=0 \u03c0(at|st)p(st+1|st, at). In practice,\n\none approximates the intractable expectation over trajectories collected during training.\nWe now describe one way to apply natural gradient to optimize the critic. Learning the critic can be\nthought of as a least-squares function approximation problem, albeit one with a moving target. In the\nsetting of least-squares function approximation, the second-order algorithm of choice is commonly\nGauss-Newton, which approximates the curvature as the Gauss-Newton matrix G := E[J T J], where\nJ is the Jacobian of the mapping from parameters to outputs [18]. The Gauss-Newton matrix is\nequivalent to the Fisher matrix for a Gaussian observation model [14]; this equivalence allows us to\napply K-FAC to the critic as well. Speci\ufb01cally, we assume the output of the critic v is de\ufb01ned to be\na Gaussian distribution p(v|st) \u223c N (v; V (st), \u03c32). The Fisher matrix for the critic is de\ufb01ned with\nrespect to this Gaussian output distribution. In practice, we can simply set \u03c3 to 1, which is equivalent\nto the vanilla Gauss-Newton method.\nIf the actor and critic are disjoint, one can separately apply K-FAC updates to each using the metrics\nde\ufb01ned above. But to avoid instability in training, it is often bene\ufb01cial to use an architecture where the\ntwo networks share lower-layer representations but have distinct output layers [17, 27]. In this case,\nwe can de\ufb01ne the joint distribution of the policy and the value distribution by assuming independence\nof the two output distributions, i.e., p(a, v|s) = \u03c0(a|s)p(v|s), and construct the Fisher metric with\nrespect to p(a, v|s), which is no different than the standard K-FAC except that we need to sample\nthe networks\u2019 outputs independently. We can then apply K-FAC to approximate the Fisher matrix\nEp(\u03c4 )[\u2207 log p(a, v|s)\u2207 log p(a, v|s)T ] to perform updates simultaneously.\nIn addition, we use regular damping for regularization. We also follow [2] and perform the asyn-\nchronous computation of second-order statistics and inverses required by the Kronecker approximation\nto reduce computation time.\n\n4\n\n1M2M2.5MNumber of Timesteps0.5M1M2MEpisode RewardsAtlantisACKTRA2C0.00.20.40.60.81.01.21.4Hours0.5M1M2MEpisode RewardsAtlantisACKTRA2C0100200300400500600700Number of Episode0.5M1M2MEpisode RewardsAtlantisACKTRA2C\f3.2 Step-size Selection and trust-region optimization\nTraditionally, natural gradient is performed with SGD-like updates, \u03b8 \u2190 \u03b8 \u2212 \u03b7F \u22121\u2207\u03b8L. But in the\ncontext of deep RL, Schulman et al. [21] observed that such an update rule can result in large updates\nto the policy, causing the algorithm to prematurely converge to a near-deterministic policy. They\nadvocate instead using a trust region approach, whereby the update is scaled down to modify the\npolicy distribution (in terms of KL divergence) by at most a speci\ufb01ed amount. Therefore, we adopt\nthe trust region formulation of K-FAC introduced by [2], choosing the effective step size \u03b7 to be\n), where the learning rate \u03b7max and trust region radius \u03b4 are hyperparameters. If\nmin(\u03b7max,\nthe actor and the critic are disjoint, then we need to tune a different set of \u03b7max and \u03b4 separately for\nboth. The variance parameter for the critic output distribution can be absorbed into the learning rate\nparameter for vanilla Gauss-Newton. On the other hand, if they share representations, we need to\ntune one set of \u03b7max, \u03b4, and also the weighting parameter of the training loss of the critic, with respect\nto that of the actor.\n\n(cid:113) 2\u03b4\n\n\u2206\u03b8(cid:124) \u02c6F \u2206\u03b8\n\n4 Related work\n\nNatural gradient [1] was \ufb01rst applied to policy gradient methods by Kakade [10]. Bagnell and\nSchneider [3] further proved that the metric de\ufb01ned in [10] is a covariant metric induced by the\npath-distribution manifold. Peters and Schaal [19] then applied natural gradient to the actor-critic\nalgorithm. They proposed performing natural policy gradient for the actor\u2019s update and using a\nleast-squares temporal difference (LSTD) method for the critic\u2019s update. However, there are great\ncomputational challenges when applying natural gradient methods, mainly associated with ef\ufb01ciently\nstoring the Fisher matrix as well as computing its inverse. For tractability, previous work restricted\nthe method to using the compatible function approximator (a linear function approximator). To\navoid the computational burden, Trust Region Policy Optimization (TRPO) [21] approximately\nsolves the linear system using conjugate gradient with fast Fisher matrix-vector products, similar\nto the work of Martens [13]. This approach has two main shortcomings. First, it requires repeated\ncomputation of Fisher vector products, preventing it from scaling to the larger architectures typically\nused in experiments on learning from image observations in Atari and MuJoCo. Second, it requires\na large batch of rollouts in order to accurately estimate curvature. K-FAC avoids both issues by\nusing tractable Fisher matrix approximations and by keeping a running average of curvature statistics\nduring training. Although TRPO shows better per-iteration progress than policy gradient methods\ntrained with \ufb01rst-order optimizers such as Adam [11], it is generally less sample ef\ufb01cient.\nSeveral methods were proposed to improve the computational ef\ufb01ciency of TRPO. To avoid repeated\ncomputation of Fisher-vector products, Wang et al. [27] solve the constrained optimization problem\nwith a linear approximation of KL divergence between a running average of the policy network and\nthe current policy network. Instead of the hard constraint imposed by the trust region optimizer,\nHeess et al. [8] and Schulman et al. [23] added a KL cost to the objective function as a soft constraint.\nBoth papers show some improvement over vanilla policy gradient on continuous and discrete control\ntasks in terms of sample ef\ufb01ciency.\nThere are other recently introduced actor-critic models that improve sample ef\ufb01ciency by introducing\nexperience replay [27], [7] or auxiliary objectives [9]. These approaches are orthogonal to our work,\nand could potentially be combined with ACKTR to further enhance sample ef\ufb01ciency.\n\n5 Experiments\n\nWe conducted a series of experiments to investigate the following questions: (1) How does ACKTR\ncompare with the state-of-the-art on-policy method and common second-order optimizer baseline\nin terms of sample ef\ufb01ciency and computational ef\ufb01ciency? (2) What makes a better norm for\noptimization of the critic? (3) How does the performance of ACKTR scale with batch size compared\nto the \ufb01rst-order method?\nWe evaluated our proposed method, ACKTR, on two standard benchmark platforms. We \ufb01rst\nevaluated it on the discrete control tasks de\ufb01ned in OpenAI Gym [5], simulated by Arcade Learning\nEnvironment [4], a simulator for Atari 2600 games which is commonly used as a deep reinforcement\nlearning benchmark for discrete control. We then evaluated it on a variety of continuous control\n\n5\n\n\fDomain\nBeamrider\nBreakout\n\nPong\nQ-bert\nSeaquest\n\nSpace Invaders\n\nACKTR\n\nA2C\n\nTRPO (10 M)\n\nHuman level\n5775.0\n31.8\n9.3\n13455.0\n20182.0\n1652.0\n\nRewards\n13581.4\n735.7\n20.9\n21500.3\n1776.0\n19723.0\n\nEpisode Rewards\n8148.1\n581.6\n19.9\n15967.4\n1754.0\n1757.2\n\n3279\n4094\n904\n6422\nN/A\n14696\n\nEpisode Rewards\n670.0\n14.7\n-1.2\n971.8\n810.4\n465.1\n\n8930\n14464\n4768\n19168\nN/A\nN/A\n\nEpisode\n\nN/A\nN/A\nN/A\nN/A\nN/A\nN/A\n\nTable 1: ACKTR and A2C results showing the last 100 average episode rewards attained after 50 million\ntimesteps, and TRPO results after 10 million timesteps. The table also shows the episode N, where N denotes\nthe \ufb01rst episode for which the mean episode reward over the N th game to the (N + 100)th game crosses the\nhuman performance level [16], averaged over 2 random seeds.\n\nbenchmark tasks de\ufb01ned in OpenAI Gym [5], simulated by the MuJoCo [26] physics engine. Our\nbaselines are (a) a synchronous and batched version of the asynchronous advantage actor critic model\n(A3C) [17], henceforth called A2C (advantage actor critic), and (b) TRPO [21]. ACKTR and the\nbaselines use the same model architecture except for the TRPO baseline on Atari games, with which\nwe are limited to using a smaller architecture because of the computing burden of running a conjugate\ngradient inner-loop. See the appendix for other experiment details.\n\n5.1 Discrete control\n\nWe \ufb01rst present results on the standard six Atari 2600 games to measure the performance improvement\nobtained by ACKTR. The results on the six Atari games trained for 10 million timesteps are shown\nin Figure 1, with comparison to A2C and TRPO2. ACKTR signi\ufb01cantly outperformed A2C in terms\nof sample ef\ufb01ciency (i.e., speed of convergence per number of timesteps) by a signi\ufb01cant margin\nin all games. We found that TRPO could only learn two games, Seaquest and Pong, in 10 million\ntimesteps, and performed worse than A2C in terms of sample ef\ufb01ciency.\nIn Table 1 we present the mean of rewards of the last 100 episodes in training for 50 million timesteps,\nas well as the number of episodes required to achieve human performance [16] . Notably, on the\ngames Beamrider, Breakout, Pong, and Q-bert, A2C required respectively 2.7, 3.5, 5.3, and 3.0 times\nmore episodes than ACKTR to achieve human performance. In addition, one of the runs by A2C in\nSpace Invaders failed to match human performance, whereas ACKTR achieved 19723 on average,\n12 times better than human performance (1652). On the games Breakout, Q-bert and Beamrider,\nACKTR achieved 26%, 35%, and 67% larger episode rewards than A2C.\nWe also evaluated ACKTR on the rest of the Atari games; see Appendix for full results. We compared\nACKTR with Q-learning methods, and we found that in 36 out of 44 benchmarks, ACKTR is on par\nwith Q-learning methods in terms of sample ef\ufb01ciency, and consumed a lot less computation time.\nRemarkably, in the game of Atlantis, ACKTR quickly learned to obtain rewards of 2 million in 1.3\nhours (600 episodes), as shown in Figure 2. It took A2C 10 hours (6000 episodes) to reach the same\nperformance level.\n\n5.2 Continuous control\n\nWe ran experiments on the standard benchmark of continuous control tasks de\ufb01ned in OpenAI Gym\n[5] simulated in MuJoCo [26], both from low-dimensional state-space representation and directly\nfrom pixels. In contrast to Atari, the continuous control tasks are sometimes more challenging due to\nhigh-dimensional action spaces and exploration. The results of eight MuJoCo environments trained\nfor 1 million timesteps are shown in Figure 3. Our model signi\ufb01cantly outperformed baselines on six\nout of eight MuJoCo tasks and performed competitively with A2C on the other two tasks (Walker2d\nand Swimmer).\nWe further evaluated ACKTR for 30 million timesteps on eight MuJoCo tasks and in Table 2 we\npresent mean rewards of the top 10 consecutive episodes in training, as well as the number of\n\n2The A2C and TRPO Atari baseline results are provided to us by the OpenAI team, https://github.com/\n\nopenai/baselines.\n\n6\n\n\fFigure 3: Performance comparisons on eight MuJoCo environments trained for 1 million timesteps (1 timestep\nequals 4 frames). The shaded region denotes the standard deviation over 3 random seeds.\n\nFigure 4: Performance comparisons on 3 MuJoCo environments from image observations trained for 40 million\ntimesteps (1 timestep equals 4 frames).\n\nepisodes to reach a certain threshold de\ufb01ned in [7]. As shown in Table 2, ACKTR reaches the\nspeci\ufb01ed threshold faster on all tasks, except for Swimmer where TRPO achieves 4.1 times better\nsample ef\ufb01ciency. A particularly notable case is Ant, where ACKTR is 16.4 times more sample\nef\ufb01cient than TRPO. As for the mean reward score, all three models achieve results comparable with\neach other with the exception of TRPO, which in the Walker2d environment achieves a 10% better\nreward score.\nWe also attempted to learn continuous control policies directly from pixels, without providing low-\ndimensional state space as an input. Learning continuous control policies from pixels is much\nmore challenging than learning from the state space, partially due to the slower rendering time\ncompared to Atari (0.5 seconds in MuJoCo vs 0.002 seconds in Atari). The state-of-the-art actor-\ncritic method A3C [17] only reported results from pixels on relatively simple tasks, such as Pendulum,\nPointmass2D, and Gripper. As shown in Figure 4 we can see that our model signi\ufb01cantly outperforms\nA2C in terms of \ufb01nal episode reward after training for 40 million timesteps. More speci\ufb01cally,\non Reacher, HalfCheetah, and Walker2d our model achieved a 1.6, 2.8, and 1.7 times greater\n\ufb01nal reward compared to A2C. The videos of trained policies from pixels can be found at https:\n//www.youtube.com/watch?v=gtM87w1xGoM. Pretrained model weights are available at https:\n//github.com/emansim/acktr.\n\n5.3 A better norm for critic optimization?\n\nThe previous natural policy gradient method applied a natural gradient update only to the actor. In\nour work, we propose also applying a natural gradient update to the critic. The difference lies in the\nnorm with which we choose to perform steepest descent on the critic; that is, the norm || \u00b7 ||B de\ufb01ned\nin section 2.2. In this section, we applied ACKTR to the actor, and compared using a \ufb01rst-order\nmethod (i.e., Euclidean norm) with using ACKTR (i.e., the norm de\ufb01ned by Gauss-Newton) for critic\noptimization. Figures 5 (a) and (b) show the results on the continuous control task HalfCheetah and\nthe Atari game Breakout. We observe that regardless of which norm we use to optimize the critic,\nthere are improvements brought by applying ACKTR to the actor compared to the baseline A2C.\nHowever, the improvements brought by using the Gauss-Newton norm for optimizing the critic are\nmore substantial in terms of sample ef\ufb01ciency and episode rewards at the end of training. In addition,\n\n7\n\n200K400K600K800K1MNumber of Timesteps020040060080010001200Episode RewardInvertedPendulumACKTRA2CTRPO200K400K600K800K1MNumber of Timesteps0200040006000800010000Episode RewardInvertedDoublePendulumACKTRA2CTRPO200K400K600K800K1MNumber of Timesteps706050403020100Episode RewardReacherACKTRA2CTRPO200K400K600K800K1MNumber of Timesteps05001000150020002500300035004000Episode RewardHopperACKTRA2CTRPO200K400K600K800K1MNumber of Timesteps40200204060Episode RewardSwimmerACKTRA2CTRPO200K400K600K800K1MNumber of Timesteps2000200400600800100012001400Episode RewardWalker2dACKTRA2CTRPO200K400K600K800K1MNumber of Timesteps500050010001500200025003000Episode RewardHalfCheetahACKTRA2CTRPO200K400K600K800K1MNumber of Timesteps15001000500050010001500Episode RewardAntACKTRA2CTRPO10M20M40MNumber of Timesteps14121086420Episode RewardReacher (pixels)ACKTRA2C10M20M40MNumber of Timesteps0500100015002000Episode RewardWalker2d (pixels)ACKTRA2C10M20M40MNumber of Timesteps1000500050010001500200025003000Episode RewardHalfCheetah (pixels)ACKTRA2C\fDomain\n\nAnt\n\nHalfCheetah\n\nHopper\n\nIP\nIDP\n\nReacher\nSwimmer\nWalker2d\n\nThreshold\n3500 (6000)\n4700 (4800)\n2000 (3800)\n950 (950)\n9100 (9100)\n-7 (-3.75)\n90 (360)\n\n3000 (N/A)\n\nACKTR\n\nA2C\n\nTRPO\n\nEpisodes\n60156\n21033\n39426\n29267\n78519\n14940\n1571\n27720\n\nRewards\n4621.6\n5586.3\n3915.9\n1000.0\n9356.0\n-1.5\n138.0\n6198.8\n\nEpisodes Rewards\n4870.5\n5343.7\n3915.3\n1000.0\n9356.1\n-1.7\n140.7\n5874.9\n\n3660\n12980\n17033\n6831\n41996\n3325\n6475\n15043\n\nEpisodes Rewards\n5095.0\n106186\n5704.7\n21152\n3755.0\n33481\n1000.0\n10982\n82694\n9320.0\n-2.0\n20591\n136.4\n11516\n6874.1\n26828\n\nTable 2: ACKTR, A2C, and TRPO results, showing the top 10 average episode rewards attained within 30\nmillion timesteps, averaged over the 3 best performing random seeds out of 8 random seeds. \u201cEpisode\u201d denotes\nthe smallest N for which the mean episode reward over the N th to the (N + 10)th game crosses a certain\nthreshold. The thresholds for all environments except for InvertedPendulum (IP) and InvertedDoublePendulum\n(IDP) were chosen according to Gu et al. [7], and in brackets we show the reward threshold needed to solve the\nenvironment according to the OpenAI Gym website [5].\n\nthe Gauss-Newton norm also helps stabilize the training, as we observe larger variance in the results\nover random seeds with the Euclidean norm.\nRecall that the Fisher matrix for the critic is constructed using the output distribution of the critic,\na Gaussian distribution with variance \u03c3. In vanilla Gauss-Newton, \u03c3 is set to 1. We experimented\nwith estimating \u03c3 using the variance of the Bellman error, which resembles estimating the variance\nof the noise in regression analysis. We call this method adaptive Gauss-Newton. However, we \ufb01nd\nadaptive Gauss-Newton doesn\u2019t provide any signi\ufb01cant improvement over vanilla Gauss-Newton.\n(See detailed comparisons on the choices of \u03c3 in Appendix.\n\n5.4 How does ACKTR compare with A2C in wall-clock time?\n\nWe compared ACKTR to the baselines A2C and TRPO in terms of wall-clock time. Table 3 shows the\naverage timesteps per second over six Atari games and eight MuJoCo (from state space) environments.\nThe result is obtained with the same experiment setup as previous experiments. Note that in MuJoCo\ntasks episodes are processed sequentially, whereas in the Atari environment episodes are processed in\nparallel; hence more frames are processed in Atari environments. From the table we see that ACKTR\nonly increases computing time by at most 25% per timestep, demonstrating its practicality with large\noptimization bene\ufb01ts.\n\n(Timesteps/Second)\n\nbatch size\nACKTR\n\nA2C\nTRPO\n\nAtari\n160\n753\n1038\n161\n\n80\n712\n1010\n160\n\n640\n852\n1162\n177\n\n1000\n519\n624\n593\n\nMuJoCo\n2500\n551\n650\n619\n\n25000\n582\n651\n637\n\nTable 3: Comparison of computational cost. The average timesteps per second over six Atari games and eight\nMuJoCo tasks during training for each algorithms. ACKTR only increases computing time at most 25% over\nA2C.\n\n5.5 How do ACKTR and A2C perform with different batch sizes?\n\nIn a large-scale distributed learning setting, large batch size is used in optimization. Therefore, in\nsuch a setting, it is preferable to use a method that can scale well with batch size. In this section,\nwe compare how ACKTR and the baseline A2C perform with respect to different batch sizes. We\nexperimented with batch sizes of 160 and 640. Figure 5 (c) shows the rewards in number of timesteps.\nWe found that ACKTR with a larger batch size performed as well as that with a smaller batch size.\nHowever, with a larger batch size, A2C experienced signi\ufb01cant degradation in terms of sample\nef\ufb01ciency. This corresponds to the observation in Figure 5 (d), where we plotted the training curve\nin terms of number of updates. We see that the bene\ufb01t increases substantially when using a larger\nbatch size with ACKTR compared to with A2C. This suggests there is potential for large speed-ups\nwith ACKTR in a distributed setting, where one needs to use large mini-batches; this matches the\nobservation in [2].\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: (a) and (b) compare optimizing the critic (value network) with a Gauss-Newton norm (ACKTR)\nagainst a Euclidean norm (\ufb01rst order). (c) and (d) compare ACKTR and A2C with different batch sizes.\n\n6 Conclusion\n\nIn this work we proposed a sample-ef\ufb01cient and computationally inexpensive trust-region-\noptimization method for deep reinforcement learning. We used a recently proposed technique\ncalled K-FAC to approximate the natural gradient update for actor-critic methods, with trust region\noptimization for stability. To the best of our knowledge, we are the \ufb01rst to propose optimizing both\nthe actor and the critic using natural gradient updates. We tested our method on Atari games as well\nas the MuJoCo environments, and we observed 2- to 3-fold improvements in sample ef\ufb01ciency on\naverage compared with a \ufb01rst-order gradient method (A2C) and an iterative second-order method\n(TRPO). Because of the scalability of our algorithm, we are also the \ufb01rst to train several non-trivial\ntasks in continuous control directly from raw pixel observation space. This suggests that extending\nKronecker-factored natural gradient approximations to other algorithms in reinforcement learning is\na promising research direction.\n\nAcknowledgements\n\nWe would like to thank the OpenAI team for their generous support in providing baseline results and\nAtari environment preprocessing codes. We also want to thank John Schulman for helpful discussions.\n\nReferences\n[1] S. I. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):251\u2013276,\n\n1998.\n\n[2] J. Ba, R. Grosse, and J. Martens. Distributed second-order optimization using Kronecker-\n\nfactored approximations. In ICLR, 2017.\n\n[3] J. A. Bagnell and J. G. Schneider. Covariant policy search. In IJCAI, 2003.\n\n[4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279,\n2013.\n\n[5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[6] R. Grosse and J. Martens. A Kronecker-factored approximate Fisher matrix for convolutional\n\nlayers. In ICML, 2016.\n\n[7] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-ef\ufb01cient policy\n\ngradient with an off-policy critic. In ICLR, 2017.\n\n[8] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang,\nS. M. A. Eslami, M. Riedmiller, and D. Silver. Emergence of locomotion behaviours in rich\nenvironments. arXiv preprint arXiv:1707.02286, 2017.\n\n[9] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu.\n\nReinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.\n\n[10] S. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems,\n\n2002.\n\n9\n\n200K400K600K800K1MNumber of Timesteps500050010001500200025003000Episode RewardHalfCheetahACKTR (Both Actor and Critic)ACKTR (Only Actor)A2C1M2M4M6M8M10MNumber of Timesteps0100200300400Episode RewardsBreakoutACKTR(Both Actor and Critic)ACKTR(Only Actor)A2C1M2M4M6M8M10MNumber of Timesteps0100200300400500Episode RewardsBreakoutACKTR (640)ACKTR (160)A2C (640)A2C (160)0100002000030000400005000060000Number of Updates0100200300400500Episode RewardsBreakoutACKTR (640)ACKTR (160)A2C (640)A2C (160)\f[11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n\n[12] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.\n\nContinuous control with deep reinforcement learning. In ICLR, 2016.\n\n[13] J. Martens. Deep learning via Hessian-free optimization. In ICML-10, 2010.\n\n[14] J. Martens. New insights and perspectives on the natural gradient method. arXiv preprint\n\narXiv:1412.1193, 2014.\n\n[15] J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate\n\ncurvature. In ICML, 2015.\n\n[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,\nH. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[17] V. Mnih, A. Puigdomenech Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and\n\nK. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.\n\n[18] J. Nocedal and S. Wright. Numerical Optimization. Springer, 2006.\n\n[19] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180\u20131190, 2008.\n\n[20] N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent.\n\nNeural Computation, 2002.\n\n[21] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization.\n\nIn Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.\n\n[22] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous\ncontrol using generalized advantage estimation. In Proceedings of the International Conference\non Learning Representations (ICLR), 2016.\n\n[23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[24] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-\nbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.\nMastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484\u2013\n489, 2016.\n\n[25] R. S. Sutton, D. A. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforce-\nment learning with function approximation. In Advances in Neural Information Processing\nSystems 12, 2000.\n\n[26] E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control.\n\nIEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.\n\n[27] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample\n\nef\ufb01cient actor-critic with experience replay. In ICLR, 2016.\n\n[28] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8(3):229\u2013256, 1992.\n\n10\n\n\f", "award": [], "sourceid": 2732, "authors": [{"given_name": "Yuhuai", "family_name": "Wu", "institution": "University of Toronto"}, {"given_name": "Elman", "family_name": "Mansimov", "institution": "New York University"}, {"given_name": "Roger", "family_name": "Grosse", "institution": "University of Toronto"}, {"given_name": "Shun", "family_name": "Liao", "institution": "University of Toronto"}, {"given_name": "Jimmy", "family_name": "Ba", "institution": "University of Toronto / Vector Institute"}]}