{"title": "Simple random search of static linear policies is competitive for reinforcement learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1800, "page_last": 1809, "abstract": "Model-free reinforcement learning aims to offer off-the-shelf solutions for controlling dynamical systems without requiring models of the system dynamics. We introduce a model-free random search algorithm for training static, linear policies for continuous control problems. Common evaluation methodology shows that our method matches state-of-the-art sample efficiency on the benchmark MuJoCo locomotion tasks. Nonetheless, more rigorous evaluation reveals that the assessment of performance on these benchmarks is optimistic. We evaluate the performance of our method over hundreds of random seeds and many different hyperparameter configurations for each benchmark task. This extensive evaluation is possible because of the small computational footprint of our method. Our simulations highlight a high variability in performance in these benchmark tasks, indicating that commonly used estimations of sample efficiency do not adequately evaluate the performance of RL algorithms. Our results stress the need for new baselines, benchmarks and evaluation methodology for RL algorithms.", "full_text": "Simple random search of static linear policies is\n\ncompetitive for reinforcement learning\n\nHoria Mania\n\nhmania@berkeley.edu\n\nAurelia Guy\n\nlia@berkeley.edu\n\nBenjamin Recht\n\nbrecht@berkeley.edu\n\nDepartment of Electrical Engineering and Computer Science\n\nUniversity of California, Berkeley\n\nAbstract\n\nModel-free reinforcement learning aims to offer off-the-shelf solutions for con-\ntrolling dynamical systems without requiring models of the system dynamics. We\nintroduce a model-free random search algorithm for training static, linear policies\nfor continuous control problems. Common evaluation methodology shows that our\nmethod matches state-of-the-art sample ef\ufb01ciency on the benchmark MuJoCo loco-\nmotion tasks. Nonetheless, more rigorous evaluation reveals that the assessment\nof performance on these benchmarks is optimistic. We evaluate the performance\nof our method over hundreds of random seeds and many different hyperparameter\ncon\ufb01gurations for each benchmark task. This extensive evaluation is possible\nbecause of the small computational footprint of our method. Our simulations\nhighlight a high variability in performance in these benchmark tasks, indicating\nthat commonly used estimations of sample ef\ufb01ciency do not adequately evaluate\nthe performance of RL algorithms. Our results stress the need for new baselines,\nbenchmarks and evaluation methodology for RL algorithms.\n\n1\n\nIntroduction\n\nModel-free reinforcement learning (RL) aims to offer off-the-shelf solutions for controlling dynamical\nsystems without requiring models of the system dynamics. Such methods have successfully produced\nRL agents that surpass human players in video games and games such as Go [16, 28]. Although\nthese results are impressive, model-free methods have not yet been successfully deployed to control\nphysical systems, outside of research demos. There are several factors prohibiting the adoption of\nmodel-free RL methods for controlling physical systems: the methods require too much data to\nachieve reasonable performance, the ever-increasing assortment of RL methods makes it dif\ufb01cult to\nchoose what is the best method for a speci\ufb01c task, and many candidate algorithms are dif\ufb01cult to\nimplement and deploy [11].\nUnfortunately, the current trend in RL research has put these impediments at odds with each other.\nIn the quest to \ufb01nd methods that are sample ef\ufb01cient (i.e. methods that need little data) the general\ntrend has been to develop increasingly complicated methods. This increasing complexity has led to a\nreproducibility crisis. Recent studies demonstrate that many RL methods are not robust to changes in\nhyperparameters, random seeds, or even different implementations of the same algorithm [11, 12].\nAlgorithms with such fragilities cannot be integrated into mission critical control systems without\nsigni\ufb01cant simpli\ufb01cation and robusti\ufb01cation.\nFurthermore, it is common practice to evaluate and compare new RL methods by applying them to\nvideo games or simulated continuous control problems and measure their performance over a small\nnumber of independent trials (i.e., fewer than ten random seeds) [8\u201310, 14, 17, 19, 21\u201327, 31, 32].\nThe most popular continuous control benchmarks are the MuJoCo locomotion tasks [3, 29], with\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe Humanoid model being considered \u201cone of the most challenging continuous control problems\nsolvable by state-of-the-art RL techniques [23].\u201d In principle, one can use video games and simulated\ncontrol problems for beta testing new ideas, but simple baselines should be established and thoroughly\nevaluated before moving towards more complex solutions.\nTo this end, we aim to determine the simplest model-free RL method that can solve standard\nbenchmarks. Recently, two different directions have been proposed for simplifying RL. Salimans\net al. [23] introduced a derivative-free policy optimization method, called Evolution Strategies. The\nauthors showed that, for several RL tasks, their method can easily be parallelized to train policies\nfaster than other methods. While the method of Salimans et al. [23] is simpler than previously\nproposed methods, it employs several complicated algorithmic elements, which we discuss at the end\nof Section 3. As a second simpli\ufb01cation to model-free RL, Rajeswaran et al. [22] have shown that\nlinear policies can be trained via natural policy gradients to obtain competitive performance on the\nMuJoCo locomotion tasks, showing that complicated neural network policies are not needed to solve\nthese continuous control problems. In this work, we combine ideas from the work of Salimans et al.\n[23] and Rajeswaran et al. [22] to obtain the simplest model-free RL method yet, a derivative-free\noptimization algorithm for training static, linear policies. We demonstrate that a simple random\nsearch method can match or exceed state-of-the-art sample ef\ufb01ciency on the MuJoCo locomotion\ntasks, included in the OpenAI Gym.\nHenderson et al. [11] and Islam et al. [12] pointed out that standard evaluation methodology does not\naccurately capture the performance of RL methods by showing that existing RL algorithms exhibit\nhigh sensitivity to both the choice of random seed and the choice of hyperparameters. We show\nsimilar limitations of common evaluation methodology through a different lens. We exhibit a simple\nderivative free optimization algorithm which matches or surpasses the performance of more complex\nmethods when using the same evaluation methodology. However, a more thorough evaluation of\nARS reveals worse performance. Moreover, our method uses static linear policies and a simple local\nexploration scheme, which might be limiting for more dif\ufb01cult RL tasks. Therefore, better evaluation\nschemes are needed for determining the bene\ufb01ts of more complex RL methods. Our contributions are\nas follows:\n\n\u2022 In Section 3, for applications to continuous control, we augment a basic random search\nmethod with three simple features. First, we scale each update step by the standard deviation\nof the rewards collected for computing that update step. Second, we normalize the system\u2019s\nstates by online estimates of their mean and standard deviation. Third, we discard from\nthe computation of the update steps the directions that yield the least improvement of the\nreward. We refer to this method as Augmented Random Search (ARS).\n\n\u2022 In Section 4, we evaluate the performance of ARS on the benchmark MuJoCo locomotion\ntasks, included in the OpenAI Gym. Our method learns static, linear policies that achieve\nhigh rewards on all MuJoCo tasks. No neural networks are used, and yet state-of-the-art\naverage rewards are achieved. For example, for Humanoid-v1 ARS \ufb01nds linear policies\nwhich achieve average rewards of over 11500, the highest value reported in the literature.\nTo put ARS on equal footing with competing methods, we evaluate its sample complexity\nover three random seeds and compare it to results reported in the literature [9, 22, 23, 26].\nARS matches or exceeds state-of-the-art sample ef\ufb01ciency on the locomotion tasks when\nusing standard evaluation methodology.\n\n\u2022 For a more thorough evaluation, we measured the performance of ARS over a hundred\nrandom seeds and also evaluated its sensitivity to hyperparameter choices. Though ARS\nsuccessfully trains policies for the MuJoCo tasks a large fraction of the time when hyper-\nparameters and random seeds are varied, ARS exhibits large variance. We measure the\nfrequency with which ARS \ufb01nds policies that yield suboptimal locomotion gaits.\n\n2\n\nProblem setup\n\nProblems in reinforcement learning require \ufb01nding policies for controlling dynamical systems that\nmaximize an average reward. Such problems can be abstractly formulated as\n\nmax\n\n\u27132Rd E\u21e0 [r(\u21e1\u2713,\u21e0 )] ,\n\n2\n\n(1)\n\n\fwhere \u2713 parametrizes a policy \u21e1\u2713 : Rn ! Rp. The random variable \u21e0 encodes the randomness of the\nenvironment, i.e., random initial states and stochastic transitions. The value r(\u21e1\u2713,\u21e0 ) is the reward\nachieved by the policy \u21e1\u2713 on one trajectory generated from the system. In general one could use\nstochastic policies \u21e1\u2713, but our proposed method uses deterministic policies.\n\nBasic random search. Note that the problem formulation (1) aims to optimize reward by directly\noptimizing over the policy parameters \u2713. We consider methods which explore in the parameter\nspace rather than the action space. This choice renders RL training equivalent to derivative-free\noptimization with noisy function evaluations. One of the simplest and oldest optimization methods\nfor derivative-free optimization is random search [15].\nA primitive form of random search, which we call basic random search (BRS), simply computes a\n\ufb01nite difference approximation along the random direction and then takes a step along this direction\nwithout using a line search. Our method ARS, described in Section 3, is based on this simple strategy.\nFor updating the parameters \u2713 of a policy \u21e1\u2713, BRS and ARS exploit update directions of the form:\n\nr(\u21e1\u2713+\u232b,\u21e0 1) r(\u21e1\u2713\u232b,\u21e0 2)\n\n\u232b\n\n,\n\n(2)\n\nfor two i.i.d. random variables \u21e01 and \u21e02, \u232b a positive real number, and a zero mean Gaussian\nvector. It is known that such an update increment is an unbiased estimator of the gradient with\nrespect to \u2713 of EE\u21e0 [r(\u21e1\u2713+\u232b,\u21e0 )], a smoothed version of the objective (1) which is close to the\noriginal objective when \u232b is small [20]. When the function evaluations are noisy, minibatches\ncan be used to reduce the variance in this gradient estimate. Evolution Strategies is a version of\nthis algorithm with several complicated algorithmic enhancements [23]. Another version of this\nalgorithm is called Bandit Gradient Descent by Flaxman et al. [6]. The convergence of random search\nmethods for derivative free optimization has been understood for several types of convex optimization\n[1, 2, 13, 20]. Jamieson et al. [13] offer an information theoretic lower bound for derivative free\nconvex optimization and show that a coordinate based random search method achieves the lower\nbound with nearly optimal dependence on the dimension.\nThe rewards r(\u21e1\u2713+\u232b,\u21e0 1) and r(\u21e1\u2713\u232b,\u21e0 2) in Eq. (2) are obtained by collecting two trajectories\nfrom the dynamical system of interest, according to the policies \u21e1\u2713+\u232b and \u21e1\u2713\u232b, respectively. The\nrandom variables \u21e01, \u21e02, and are mutually independent, and independent from previous trajectories.\nOne trajectory is called an episode or a rollout. The goal of RL algorithms is to approximately solve\nproblem (1) by using as few rollouts from the dynamical system as possible.\n\n3 Our proposed algorithm\n\nWe now introduce the Augmented Random Search (ARS) method, which relies on three augmenta-\ntions of BRS that build on successful heuristics employed in deep reinforcement learning. Throughout\nthe rest of the paper we use M to denote the parameters of policies because our method uses linear\npolicies, and hence M is a p \u21e5 n matrix. The different versions of ARS are detailed in Algorithm 1.\nThe \ufb01rst version, ARS V1, is obtained from BRS by scaling the update steps by the standard deviation\nR of the rewards collected at each iteration; see Line 7 of Algorithm 1. As shown in Section 4,\nARS V1 can train linear policies, which achieve the reward thresholds previously proposed in the\nliterature, for \ufb01ve MuJoCo benchmarks. However, ARS V1 requires a larger number of episodes, and\nit cannot train policies for the Humanoid-v1 task. To address these issues in Algorithm 1 we also\npropose ARS V2. This version of ARS trains policies which are linear maps of states normalized\nby a mean and standard deviation computed online. Finally, to further enhance the performance of\nARS, we introduce a third algorithmic enhancement, shown in Algorithm 1 as ARS V1-t and ARS\nV2-t. These versions of ARS can drop perturbation directions that yield the least improvement of the\nreward. Now, we motivate and offer intuition for each of these algorithmic elements.\n\nScaling by the standard deviation R. As the training of policies progresses, random search in the\nparameter space of policies can lead to large variations in the rewards observed across iterations. As\na result, it is dif\ufb01cult to choose a \ufb01xed step-size \u21b5 which does not allow harmful variations in the size\nof the update steps. Salimans et al. [23] address this issue by transforming the rewards into rankings\nand then using the adaptive optimization algorithm Adam for computing the update step. Both of\n\n3\n\n\fAlgorithm 1 Augmented Random Search (ARS): four versions V1, V1-t, V2 and V2-t\n1: Hyperparameters: step-size \u21b5, number of directions sampled per iteration N, standard deviation\nof the exploration noise \u232b, number of top-performing directions to use b (b < N is allowed only\nfor V1-t and V2-t)\n\n2: Initialize: M0 = 0 2 Rp\u21e5n, \u00b50 = 0 2 Rn, and \u23030 = In 2 Rn\u21e5n, j = 0.\n3: while ending condition not satis\ufb01ed do\n4:\n5:\n\nSample 1, 2, . . . , N in Rp\u21e5n with i.i.d. standard normal entries.\nCollect 2N rollouts of horizon H and their corresponding rewards using the 2N policies\n\n\u21e1j,k,(x) = (Mj \u232bk)x\n\nV1: \u21e2\u21e1j,k,+(x) = (Mj + \u232bk)x\nV2: (\u21e1j,k,+(x) = (Mj + \u232bk) diag (\u2303j)1/2 (x \u00b5j)\n\u21e1j,k,(x) = (Mj \u232bk) diag(\u2303j)1/2(x \u00b5j)\n\n6:\n\nfor k 2{ 1, 2, . . . , N}.\nlargest direction, and by \u21e1j,(k),+ and \u21e1j,(k), the corresponding policies.\n\nV1-t, V2-t: Sort the directions k by max{r(\u21e1j,k,+), r(\u21e1j,k,)}, denote by (k) the k-th\n\n7: Make the update step:\n\nMj+1 = Mj + \u21b5\nbR\n\nbXk=1\u21e5r(\u21e1j,(k),+) r(\u21e1j,(k),)\u21e4 (k),\n\nwhere R is the standard deviation of the 2b rewards used in the update step.\n\nV2: Set \u00b5j+1, \u2303j+1 to be the mean and covariance of the 2N H(j + 1) states encountered\n\n8:\n\nfrom the start of training.1\n\nj j + 1\n\n9:\n10: end while\n.\n\nthese techniques change the direction of the updates, obfuscating the behavior of the algorithm and\nmaking it dif\ufb01cult to ascertain the objective Evolution Strategies is actually optimizing. Instead, to\naddress the large variations of the differences r(\u21e1M +\u232b) r(\u21e1M\u232b), we scale the update steps by\nthe standard deviation R of the 2N rewards collected at each iteration (see Line 7 of Algorithm 1).\nWhile training a policy for Humanoid-v1, we observed that the standard deviations R have an\nincreasing trend; see Figure 2 in Appendix A.2. This behavior occurs because perturbations of the\npolicy weights at high rewards can cause Humanoid-v1 to fall early, yielding large variations in the\nrewards collected. Without scaling the update steps by R, eventually random search would take\nupdate steps which are a thousand times larger than in the beginning of training. Therefore, R\nadapts the step sizes according to the local sensitivity of the rewards to perturbations of the policy\nparameters. The same training performance could probably be obtained by tuning a step size schedule.\nHowever, one of our goals was to minimize the amount of tuning required.\n\nNormalization of the states. The normalization of states used by ARS V2 is akin to data whitening\nfor regression tasks. Intuitively, it ensures that policies put equal weight on the different components\nof the states. To see why this might help, suppose that a state coordinate only takes values in the\nrange [90, 100] while another state component takes values in the range [1, 1]. Then, small changes\nin the control gain with respect to the \ufb01rst state coordinate would lead to larger changes in the actions\nthan the same sized changes with respect to the second state component. Hence, state normalization\nallows different state components to have equal in\ufb02uence during training.\nPrevious work has also implemented such state normalization for \ufb01tting a neural network model for\nseveral MuJoCo environments [19]. A similar normalization is used by ES as part of the virtual batch\n\n1Of course, we implement this in an ef\ufb01cient way that does not require the storage of all the states. Also, we\nonly keep track of the diagonal of \u2303j+1. Finally, to ensure that the ratio 0/0 is treated as 0, if a diagonal entry\nof \u2303j is smaller than 108 we make it equal to +1.\n\n4\n\n\fnormalization of the neural network policies [23]. In the case of ARS, the state normalization can be\nseen as a form of non-isotropic exploration in the parameter space of linear policies.\nThe main empirical motivation for ARS V2 comes from the Humanoid-v1 task. We were not able to\ntrain a linear policy for this task without the normalization of the states described in Algorithm 1.\nMoreover, ARS V2 performs better than ARS V1 on other MuJoCo tasks as well, as shown in\nSection 4. However, the usefulness of state normalization is likely to be problem speci\ufb01c.\n\nUsing top performing directions. To further improve the performance of ARS on the MuJoCo\nlocomotion tasks, we propose ARS V1-t and V2-t. In the update steps used by ARS V1 and V2\neach perturbation direction k is weighted by the difference of the rewards r(\u21e1j,k,+) and r(\u21e1j,k,).\nIf r(\u21e1j,k,+) > r(\u21e1j,k,), ARS pushes the policy weights Mj in the direction of k. If r(\u21e1j,k,+) <\nr(\u21e1j,k,), ARS pushes the policy weights Mj in the direction of k. However, since r(\u21e1j,k,+)\nand r(\u21e1j,k,) are noisy evaluations of the performance of the policies parametrized by Mj + \u232bk\nand Mj \u232bk, ARS V1 and V2 might push the weights Mj in the direction k even when k is\nbetter, or vice versa. Moreover, there can be perturbation directions k such that updating the policy\nweights Mj in either the direction k or k would lead to sub-optimal performance. To address\nthese issues, ARS V1-t and V2-t order decreasingly the perturbation directions k, according to\nmax{r(\u21e1j,k,+), r(\u21e1j,k,)}, and then use only the top b directions for updating the policy weights;\nsee Line 7 of Algorithm 1.\nThis algorithmic enhancement intuitively improves the performance of ARS because it ensures\nthat the update steps are an average over directions that obtained high rewards. However, without\ntheoretical investigation we cannot be certain of the effect of using this algorithmic enhancement, i.e.,\nchoosing b < N. When b = N versions V1-t and V2-t are equivalent to V1 and V2. Therefore, it is\ncertain that after tuning ARS V1-t and V2-t, they will not perform any worse than ARS V1 and V2.\n\nComparison to Salimans et al. [23]. ARS simpli\ufb01es Evolution Strategies in several ways. First,\nES feeds the gradient estimate into the Adam algorithm. Second, instead of using the actual reward\nvalues r(\u2713 \u00b1 \u270fi), ES transforms the rewards into rankings and uses the ranks to compute update\nsteps. The rankings are used to make training more robust. Instead, our method scales the update\nsteps by the standard deviation of the rewards. Third, ES bins the action space of the Swimmer-v1\nand Hopper-v1 to encourage exploration. Our method surpasses ES without such binning. Fourth, ES\nrelies on policies parametrized by neural networks with virtual batch normalization, while we show\nthat ARS achieves state-of-the-art performance with linear policies.\n\n4 Empirical results on the MuJoCo locomotion tasks\n\nImplementation details. We implemented a parallel version of Algorithm 1 using the Python\nlibrary Ray [18]. To avoid the computational bottleneck of communicating perturbations , we\ncreated a shared noise table which stores independent standard normal entries. Then, instead of\ncommunicating perturbations , the workers communicate indices in the shared noise table. This\napproach has been used in the implementation of Evolution Strategies by Moritz et al. [18] and is\nsimilar to the approach proposed by Salimans et al. [23]. Our code sets the random seeds for the\nrandom generators of all the workers and for all copies of the OpenAI Gym environments held by\nthe workers. All these random seeds are distinct and are a function of a single integer to which we\nrefer as the random seed. Furthermore, we made sure that the states and rewards produced during the\nevaluation rollouts were not used in any form during training.\nWe evaluate the performance of ARS on the MuJoCo locomotion tasks included in the OpenAI Gym-\nv0.9.3 [3, 29]. The OpenAI Gym provides benchmark reward functions for the different MuJoCo\nlocomotion tasks. We used these default reward functions for evaluating the performance of the\nlinear policies trained with ARS. The reported rewards obtained by a policy were averaged over\n100 independent rollouts. For the Hopper-v1, Walker2d-v1, Ant-v1, and Humanoid-v1 tasks the\ndefault reward functions include a survival bonus, which rewards RL agents with a constant reward at\neach timestep, as long as a termination condition (i.e., falling over) has not been reached. During\ntraining, we removed these survival bonuses, a choice we motivate in Appendix A.1. We also defer to\nAppendix A.3 the sensitivity analysis of ARS to the choice of hyperparameters.\n\n5\n\n\fThree random seeds evaluation: We compare the different versions of ARS to the following\nmethods: Trust Region Policy Optimization (TRPO), Deep Deterministic Policy Gradient (DDPG),\nNatural Gradients (NG), Evolution Strategies (ES), Proximal Policy Optimization (PPO), Soft\nActor Critic (SAC), Soft Q-Learning (SQL), A2C, and the Cross Entropy Method (CEM). For the\nperformance of these methods we used values reported by Rajeswaran et al. [22], Salimans et al. [23],\nSchulman et al. [26], and Haarnoja et al. [9]. In light of well-documented reproducibility issues of\nreinforcement learning methods [11, 12], reporting the values listed in papers rather than rerunning\nthese algorithms casts prior work in the most favorable light possible.\nRajeswaran et al. [22] and Schulman et al. [26] evaluated the performance of RL algorithms on three\nrandom seeds, while Salimans et al. [23] and Haarnoja et al. [9] used six and \ufb01ve random seeds\nrespectively. To put all methods on equal footing, for the evaluation of ARS, we sampled three\nrandom seeds uniformly from the interval [0, 1000) and \ufb01xed them. For each of the six OpenAI Gym\nMuJoCo locomotion tasks we chose a grid of hyperparameters2, shown in Appendix A.6, and for\neach set of hyperparameters we ran ARS V1, V2, V1-t, and V2-t three times, once for each of the\nthree \ufb01xed random seeds.\nTable 1 shows the average number of episodes required by ARS, NG, and TRPO to reach a prescribed\nreward threshold, using the values reported by Rajeswaran et al. [22] for NG and TRPO. For each\nversion of ARS and each MuJoCo task we chose the hyperparameters which minimize the average\nnumber of episodes required to reach the reward threshold. The corresponding training curves of\nARS are shown in Figure 3 of Appendix A.2. For all MuJoCo tasks, except Humanoid-v1, we used\nthe same reward thresholds as Rajeswaran et al. [22]. Our choice to increase the reward threshold for\nHumanoid-v1 is motivated by the presence of the survival bonuses, as discussed in Appendix A.1.\n\nAverage # episodes to reach reward threshold\nNG-rbf\n\nNG-lin\n\nARS\n\nTRPO-nn\n\nV1\n100\n89493\n10240\n392000\n101066\n\nN/A\n\nV1-t\n100\n51840\n8106\n166133\n58133\nN/A\n\nV2\n427\n3013\n2720\n89600\n60533\n142600\n\nV2-t\n427\n1550\n1973\n8640\n1707\n6000\n24000\n25680\n20800\n30000\n142600 \u21e1130000 \u21e1130000\n\n1450\n13920\n11250\n36840\n39240\n\nN/A 3\n10000\n4250\n14250\n73500\nUNK4\n\nTask\n\nThreshold\n\nSwimmer-v1\nHopper-v1\n\nHalfCheetah-v1\nWalker2d-v1\n\nAnt-v1\n\nHumanoid-v1\n\n325\n3120\n3430\n4390\n3580\n6000\n\nTable 1: A comparison of ARS, NG, and TRPO on the MuJoCo locomotion tasks. For each task we\nshow the average number of episodes required to achieve a prescribed reward threshold, averaged\nover three random seeds. We estimated the number of episodes required by NG to reach a reward of\n6000 for Humanoid-v1 based on the learning curves presented by Rajeswaran et al. [22].\n\nTable 1 shows that ARS V1 can train policies for all tasks except Humanoid-v1, which is successfully\nsolved by ARS V2. Secondly, we note that ARS V2 reaches the prescribed thresholds for Swimmer-\nv1, Hopper-v1, and HalfCheetah-v1 faster than NG or TRPO, and matches the performance of NG\non the Humanoid-v1. On Walker2d-v1 and Ant-v1, ARS V2 is outperformed by NG. Nonetheless,\nARS V2-t surpasses the performance of NG on these two tasks. Although TRPO hits the reward\nthreshold for Walker2d-v1 faster than ARS, our method either matches or surpasses TRPO in the\nmetrics reported by Haarnoja et al. [9] and Schulman et al. [26].\nPrecise comparisons to more RL methods are provided in Appendix A.2. Here we offer a summary.\nSalimans et al. [23] reported the average number of episodes required by ES to reach a prescribed\nreward threshold, on four of the locomotion tasks. ARS surpassed ES on all of those tasks. Haarnoja\net al. [9] reported the maximum reward achieved by SAC, DDPG, SQL, and TRPO after a prescribed\nnumber of timesteps, on four of the locomotion tasks. With the exception of SAC on HalfCheetah-v1\nand Ant-v1, ARS outperformed competing methods. Schulman et al. [26] reported the maximum\nreward achieved by PPO, A2C, CEM, and TRPO after a prescribed number of timesteps, on four of\n\n2Recall that ARS V1 and V2 take in only three hyperparameters: the step-size \u21b5, the number of perturbation\ndirections N, and scale of the perturbations \u232b. ARS V1-t and V2-t take in an additional hyperparameter, the\nnumber of top directions used b (b \uf8ff N).\n\n3N/A means that the method did not reach the reward threshold.\n4UNK stands for unknown.\n\n6\n\n\fthe locomotion tasks. With the exception of PPO on Walker2d-v1, ARS matched or surpassed the\nperformance of competing methods.\n\nA hundred seeds evaluation: For a more thorough evaluation of ARS, we sampled 100 distinct\nrandom seeds uniformly at random from the interval [0, 10000). Then, using the hyperparameters\nselected for Table 1, we ran ARS for each of the six MuJoCo locomotion tasks and the 100 random\nseeds. The results are shown in Figure 1. Such a thorough evaluation was feasible because ARS\nhas a small computational footprint. As discussed in Appendix A.3, ARS is at least 15 times more\ncomputationally ef\ufb01cient on the MuJoCo benchmarks than competing methods.\nFigure 1 shows that 70% of the time ARS trains policies for all the MuJoCo locomotion tasks, with\nthe exception of Walker2d-v1 for which it succeeds only 20% of the time. Moreover, ARS succeeds\nat training policies a large fraction of the time while using a competitive number of episodes.\n\nAverage reward evaluated over 100 random seeds, shown by percentile\n\nFigure 1: An evaluation of ARS over 100 random seeds on the MuJoCo locomotion tasks. The\ndotted lines represent median rewards and the shaded regions represent percentiles. For Swimmer-v1\nwe used ARS V1. For Hopper-v1, Walker2d-v1, and Ant-v1 we used ARS V2-t. For HalfCheetah-v1\nand Humanoid-v1 we used ARS V2.\n\nThere are two types of random seeds represented in Figure 1 that cause ARS to not reach high rewards.\nThere are random seeds on which ARS eventually \ufb01nds high reward policies when suf\ufb01ciently many\niterations of ARS are performed, and there are random seeds which lead ARS to discover locally\noptimal behaviors. For the Humanoid model, ARS found numerous distinct gaits, including ones\nduring which the Humanoid hops only on one leg, walks backwards, or moves in a swirling motion.\nSuch gaits were found by ARS on the random seeds which cause slower training. While multiple\ngaits for Humanoid models have been previously observed [10], our evaluation better emphasizes\ntheir prevalence. The presence of local optima is inherent to non-convex optimization, and our results\nshow that RL algorithms should be evaluated on many random seeds for determining the frequency\nwith which local optima are found. Finally, we remark that ARS is the least sensitive to the choice of\nrandom seed used when applied to HalfCheetah-v1, a task which is often used for the evaluation of\nsensitivity of algorithms to the choice of random seeds.\n\nLinear policies are suf\ufb01ciently expressive for MuJoCo: We discussed how linear policies can\nproduce diverse gaits for the MuJoCo models, showing that they are suf\ufb01ciently expressive to capture\ndiverse behaviors. Table 2 shows that linear policies can also achieve high rewards on all the MuJoCo\nlocomotion tasks. In particular, for Humanoid-v1 and Walker2d-v1, ARS found policies that achieve\nsigni\ufb01cantly higher rewards than any other results we encountered in the literature. These results\nshow that linear policies are perfectly adequate for the MuJoCo locomotion tasks, reducing the need\nfor more expressive and more computationally expensive policies.\n\n7\n\n0500100015000100200300AverageRewardSwimmer-v10-1010-2020-100050001000001000200030004000Hopper-v10-2020-3030-10005000100000100020003000400050006000HalfCheetah-v10-55-2020-10002500050000Episodes0200040006000800010000AverageRewardWalker2d-v10-8080-9090-1000250005000075000Episodes100001000200030004000Ant-v10-3030-7070-1000100000200000300000400000Episodes02000400060008000Humanoid-v10-3030-7070-100\fTask\n\nSwimmer-v1\nHopper-v1\n\nARS\n365\n3909\n\nARS\n6722\n11389 Humanoid\n\nTask\nAnt\n\nARS\n5146\n11600\n\nTable 2: Maximum average reward achieved by ARS, where we took the maximum over all sets of\nhyperparameters considered and the three \ufb01xed random seeds.\n\nMaximum reward achieved\n\nTask\n\nHalfCheetah-v1\n\nWalker\n\n5 Discussion\n\nWith a few algorithmic augmentations, basic random search of static, linear policies achieves state-\nof-the-art sample ef\ufb01ciency on the MuJoCo locomotion tasks. Surprisingly, no special nonlinear\ncontrollers are needed to match the performance recorded in the RL literature. Moreover, since\nour algorithm and policies are simple, we were able to perform extensive sensitivity analysis. This\nanalysis brings us to an uncomfortable conclusion that the current evaluation methods adopted in the\ndeep RL community are insuf\ufb01cient to evaluate whether proposed methods are actually solving the\nstudied problems.\nThe choice of benchmark tasks and the small number of random seeds do not represent the only issues\nof current evaluation methodology. Though many RL researchers are concerned about minimizing\nsample complexity, it does not make sense to optimize the running time of an algorithm on a single\nproblem instance. The running time of an algorithm is only a meaningful notion if either (a) evaluated\non a family of problem instances, or (b) when clearly restricting the class of algorithms.\nCommon RL practice, however, does not follow either (a) or (b). Instead, researchers run an algorithm\nA on a task T with a given hyperparameter con\ufb01guration, and plot a \u201clearning curve\u201d showing the\nalgorithm reaches a target reward after collecting X samples. Then the \u201csample complexity\" of the\nmethod is reported as the number of samples required to reach a target reward threshold, with the\ngiven hyperparameter con\ufb01guration. However, any number of hyperparameter con\ufb01gurations can\nbe tried. Any number of algorithmic enhancements can be added or discarded and then tested in\nsimulation. For a fair measurement of sample complexity, should we not count the number of rollouts\nused for all tested hyperparameters?\nThrough optimal hyperparameter tuning one can arti\ufb01cially improve the perceived sample ef\ufb01ciency\nof a method. Indeed, this is what we see in our work. By adding a third algorithmic enhancement to\nbasic random search (i.e., enhancing ARS V2 to V2-t), we are able to improve the sample ef\ufb01ciency of\nan already highly performing method. Considering that most of the prior work in RL uses algorithms\nwith far more tunable parameters and neural nets whose architectures themselves are hyperparameters,\nthe signi\ufb01cance of the reported sample complexities for those methods is not clear. This issue is\nimportant because a meaningful sample complexity of an algorithm should inform us on the number\nof samples required to solve a new, previously unseen task.\nIn light of these issues and of our empirical results, we make several suggestions for future work:\n\n\u2022 Simple baselines should be established before moving forward to more complex benchmarks\nand methods. We propose the Linear Quadratic Regulator as a reasonable testbed for RL\nalgorithms. LQR is well-understood when the model is known, problem instances can be\neasily generated with a variety of different levels of dif\ufb01culty, and little overhead is required\nfor replication; see Appendix A.4 for more details.\n\n\u2022 When games and physics simulators are used for evaluation, separate problem instances\nshould be used for tuning and evaluating RL methods. Moreover, large numbers of random\nseeds should be used for statistically signi\ufb01cant evaluations.\n\n\u2022 Rather than trying to develop general purpose algorithms, it might be better to focus on\n\nspeci\ufb01c problems of interest and \ufb01nd targeted solutions.\n\n\u2022 More emphasis should be put on the development of model-based methods. For many\nproblems, such methods have been observed to require fewer samples than model-free\nmethods. Moreover, the physics of the systems should inform the parametric classes of\nmodels used for different problems. Model-based methods incur many computational\nchallenges themselves, and it is quite possible that tools from deep RL, such as improved\n\n8\n\n\ftree search, can provide new paths forward for tasks that require the navigation of complex\nand uncertain environments.\n\nAcknowledgments\n\nWe thank Orianna DeMasi, Moritz Hardt, Eric Jonas, Robert Nishihara, Rebecca Roelofs, Esther\nRolf, Vaishaal Shankar, Ludwig Schmidt, Nilesh Tripuraneni, Stephen Tu for many helpful comments\nand suggestions. HM thanks Robert Nishihara and Vaishaal Shankar for sharing their expertise\nin parallel computing. As part of the RISE lab, HM is generally supported in part by NSF CISE\nExpeditions Award CCF-1730628, DHS Award HSHQDC-16-3-00083, and gifts from Alibaba,\nAmazon Web Services, Ant Financial, CapitalOne, Ericsson, GE, Google, Huawei, Intel, IBM,\nMicrosoft, Scotiabank, Splunk and VMware. BR is generously supported in part by NSF award\nCCF-1359814, ONR awards N00014-14-1-0024 and N00014-17-1-2191, the DARPA Fundamental\nLimits of Learning (Fun LoL) Program, and an Amazon AWS AI Research Award.\n\nReferences\n[1] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point\n\nbandit feedback. pages 28\u201340, 2010.\n\n[2] F. Bach and V. Perchet. Highly-smooth zero-th order online optimization. Conference on Learning Theory,\n\n2016.\n\n[3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI\n\ngym, 2016.\n\n[4] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic\n\nregulator. arxiv:1710.01688, 2017.\n\n[5] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning\nfor continuous control. Proceedings of the International Conference on Machine Learning, pages 1329\u2013\n1338, 2016.\n\n[6] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient\ndescent without a gradient. Proceedings of the ACM-SIAM symposium on Discrete algorithms, pages\n385\u2013394, 2005.\n\n[7] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-ef\ufb01cient policy gradient\n\nwith an off-policy critic. International Conference on Learning Representations, 2016.\n\n[8] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies.\n\nProceedings of the International Conference on Machine Learning, 2017.\n\n[9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep\n\nreinforcement learning with a stochastic actor. arxiv:1801.01290, 2018.\n\n[10] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, A. Eslami, M. Riedmiller,\n\net al. Emergence of locomotion behaviours in rich environments. arxiv:1707.02286, 2017.\n\n[11] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that\n\nmatters. arXiv:1709.06560, 2017.\n\n[12] R. Islam, P. Henderson, M. Gomrokchi, and D. Precup. Reproducibility of benchmarked deep reinforcement\n\nlearning tasks for continuous control. arxiv:1708.04133, 2017.\n\n[13] K. G. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization. pages\n\n2672\u20132680, 2012.\n\n[14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous\ncontrol with deep reinforcement learning. International Conference on Learning Representations, 2016.\n\n[15] J. Matyas. Random optimization. Automation and Remote control, 26(2):246\u2013253, 1965.\n\n[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,\n518(7540):529, 2015.\n\n9\n\n\f[17] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\nAsynchronous methods for deep reinforcement learning. Proceedings of the International Conference on\nMachine Learning, pages 1928\u20131937, 2016.\n\n[18] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica.\n\nRay: A distributed framework for emerging ai applications. arxiv:1712.05889, 2017.\n\n[19] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for model-based deep\n\nreinforcement learning with model-free \ufb01ne-tuning. arxiv:1708.02596, 2017.\n\n[20] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of\n\nComputational Mathematics, 17(2):527\u2013566, 2017.\n\n[21] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and\n\nM. Andrychowicz. Parameter space noise for exploration. arxiv:1706.01905, 2017.\n\n[22] A. Rajeswaran, K. Lowrey, E. Todorov, and S. Kakade. Towards generalization and simplicity in continuous\n\ncontrol. Advances in Neural Information Processing Systems, 2017.\n\n[23] T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement\n\nlearning. arxiv:1703.03864, 2017.\n\n[24] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. pages\n\n1889\u20131897, 2015.\n\n[25] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using\n\ngeneralized advantage estimation. International Conference on Learning Representations, 2015.\n\n[26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.\n\narxiv:1707.06347, 2017.\n\n[27] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient\n\nalgorithms. Proceedings of the International Conference on Machine Learning, 2014.\n\n[28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural\nnetworks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[29] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control.\n\nInternational Conference on Intelligent Robots and Systems, pages 5026\u20135033, 2012.\n\nIEEE/RSJ\n\n[30] S. Tu and B. Recht. Least-squares temporal difference learning for the linear quadratic regulator.\n\narxiv:1712.08642, 2017.\n\n[31] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample ef\ufb01cient\n\nactor-critic with experience replay. Interational Conference on Learning Representations, 2016.\n\n[32] Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. Scalable trust-region method for deep reinforcement\nlearning using kronecker-factored approximation. Advances in Neural Information Processing Systems,\n2017.\n\n10\n\n\f", "award": [], "sourceid": 914, "authors": [{"given_name": "Horia", "family_name": "Mania", "institution": "UC Berkeley"}, {"given_name": "Aurelia", "family_name": "Guy", "institution": "UC Berkeley"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}]}