Summary and Contributions: The paper introduces AES-RL, an asynchronous framework to combine Evolutionary Strategies and off-policy policy gradient to enable parallelizability. Results in a host of continuous control benchmarks demonstrate that the proposed method improves over baselines in terms of runtime and final performance.
Strengths: 1) The idea is simple but powerful and well-motivated. Parallelizability is often hampered by heterogeneous episode runtimes in the population. The ability to run population updates asynchronously is thus a major advantage that can enable a host of applications. 2) Facilitating online updates by leveraging Welford and Rechenberg update rules is very innovative and seems to work fairly well. 3) The improvements in wall-clock time achieved from the method is significant. While the benchmarks presented in the paper are based on Mujoco, the wall clock gains would be significantly more when using higher-fidelity physics engines such as Opensim or Gazebo where the disparity between simulation steps can be much more pronounced. This can serve to be an enabling technology for application of these methods towards the use of high-fidelity physics simulators.
Weaknesses: 1) Humanoid is a major environment missing from the suite of benchmarks tested on. I would be curious to know how the method performs in Humanoid? 2) The paper is missing a targeted case study that dive into the detail of how the ES and PG workers in the population interact. For example, how useful are the ES-updates in comparison to the PG updates within the population? Does this differ for each task? It would be reasonable to assume so. Detailed experiments that elucidate the inner working of the method would greatly strengthen the claims of the paper.
Correctness: The proposed method seem technically sound. The algorithm is well-presented and the design choices seem justified.
Clarity: The paper is extremely well written and easy to follow. Illustrations such as in Figure 1 and 2 are extremely helpful in elucidating the proposed method. The appendix in particular provides an extremely detailed account of the method and the expeirments.
Relation to Prior Work: The paper does a good job of highlighting the relevant background in hybrid methods that combine EA and off-policy policy gradients. It provides enough context for a reader to be able to place the proposed method amongst the broader field.
Additional Feedback: 1) The ideas being the fixed range update (with or without the baseline) is similar to trust region based updates popularized by TRPO and PPO. It would be great if the authors can draw comparisons with these well-established methods. Further, ideas from the wider literature surrounding trust-region based optimization could provide an avenue for further growth direction for the proposed method. 2) How does the method scle to discrete settings like Atari? Broad experiments that include these benchmarks would greatly increase the appeal of the proposed method to a broader audience of RL practitioners. ### POST REBUTTAL ### I have read the author's rebuttal and other reviewer's comments. Accordingly I maintain my original score recommending acceptance for the submission.
Summary and Contributions: Deep reinforcement learning (DRL) and evolution strategies (ES) are recently combined to make up for their stability issue and sample inefficiency to each other. This paper proposes an asynchronous version of such a combination of DRL and ES to improve its parallel performance. The authors extend the previous work, CEM-RL, to support asynchronous update of a population of agents. Two issues in making the existing algorithm asynchronous are mentioned and addressed. Stated difficulties in extending CEM-RL to asynchronous setting are as follows. 1. "CEM-RL creates a fixed number of ES and RL actors in new generation only after the all actors are terminated. But in the case of the asynchronous update, it is impossible to use this method because the actors have different ending moments." 2. "Secondly, we need an asynchronous method to update population distribution effectively and also stably. Since each actor has its own fitness value, a novel update method to adaptively exploit this is required." These difficulties are addressed by introducing a control mechanism of the ratio between the populations of RL and ES actors, and by employnig a modified distribution parameter update formula. Experimental evaluation on Mujoco environments shows its speed-up by upto 25% compared to the parallel (synchronous) baseline and performance improvement over existing approaches. However, ablation studies are missing and it is unclear whether the performance improvement comes from the speed-up by asynchronization.
Strengths: 1. Stability and sample efficiency are important issues to be addressed in DRL, and a combination of ES and RL have been reported as a way to improve these difficulties. Asynchronous parallel implementation of such a method is a natural way to improve their wall clock time. The topic of this paper is suitable to NeurIPS. 2. Empirically, a speed-up (in wall clock time) and improvement in the final performance have been evaluted on mujoco environments over different SOTA algorithms, TD3, CEM, ERL, CEM-RL.
Weaknesses: 1. Asynchronization is introduced as a way to speed-up a parallel implementation of an algorithm with possibly compromizing sample efficiency (compared with synchronous version), rather than to improve the performance. Showing a performance improvement at a fixed wall-clock time or a fixed nubmer of interaction do not really show the goodness of asynchronization. What is the maximum time budget is higher or lower? Why not showing a performance graph? 2. Ablation study is missing. It is unclear where the performance improvement comes from. 3. As mentioned in the summary above, two difficulties in implementing asynchronous distribution update are stated in this paper. However, I can not agree with the first difficulty: "in the case of the asynchronous update, it is impossible to use this method because the actors have different ending moments." It is definitely not "impossible". The second difficulty is already addressed in asynchronous ES . Therefore, it should be able to simply combine asyncronous ES with CEM-RL. I am not sure whether the proposed approach is really promising compared to this very simple baseline. It is also because of the lack of ablation study.
Correctness: It is unclear whether the performance comparison is done with the fixed wall-clock time or with the fixed interaction. The experimental details are missing. Some statements are not correct nor supported with evidence. "Fitness-based methods use the fitness value itself rather than rank, making it available to fully utilize the superior individual. (1+1)-ES  is one of the early work that uses fitness value in extremely aggressive way, ..." It is wrong. (1+1)-ES  does not use fitness-value itself, and it used only the comparison result of two fitness values. "In order to restrain the aggressiveness of (1+1)-ES, while preserving the capability of high exploration in fitness-based scheme, ..." Provide an evidence or reference for "the capability of high exploration in fitness-based scheme". It is rather counter-intuitive.
Clarity: Yes. The organization of the paper is clear.
Relation to Prior Work: In a global picture, the paper is well-positioned in related works. From a technical viewpoint, components introduced in Section 3.3 and 3.4 are tightly related to existing approaches ([7,25] and fitness shaping in natural evolution strategies) but not clearly stated.
Additional Feedback: It is better to provide the code to reproduce the experiments. Since the contribution of this paper is the speed-up in wall clock time, it may heavily depend on the implementation. -- The author's response is satisfactory. Most of my concern are reflected.
Summary and Contributions: The paper builds on the (interesting) recent trend of combining ES and RL. It adds significant improvements: - better asynchronous management. - new update rules for CEM in an ask-and-tell framework compatible with the asynchronous setting. The new update rules are actually not only for the asynchronous setting; they do provide improvements compared to CEM.
Strengths: Based on the combination RL/Evolution which looks quite cool. Looks like the approach does outperform interesting recent papers.
Weaknesses: A few issues can be solved for the final version (better caption for tables or a few more lines in the corresponding text). A bigger issue is that the many improvements associated to the update rules are not compared with other rules which are compatible with asynchronous generation of offspring.
Relation to Prior Work: Except for the other asynchronous methods. Maybe the authors can point out why existing methods presented in an ask-and-tell format, available open source and known as faster, are not suitable. The one-plus-one ES is definitely not fitness-based but comparison-based; I maintain my recommendation for acceptance.
Additional Feedback: Sometimes rank-based black-box optimization is considered as more robust so some people might disagree with the claim that it's better to use of fitness values (as opposed to only the ranks). Regarding this usage of the detailed fitness values, there are many parameters, so the stability of the method could be discussed. The presentation of the state of the art (combinations of ES and grad-based policy search) is ok. The code of baselines is mentioned in the footnote; you might also provide your code (with github anonymizer), this does not break the double blind reviewing process. Remarks: - The ask and tell format becomes prevalent in black-box optimization; this allows asynchronous generation of new candidates. Sections 3.1 and 3.3 do not take into account this; generating a single point is feasible by many methods, more sophisticated than the (destructive) 1+1. - Table 1: maybe the caption can be more self-contained; how many different actors are running here ? The table suggests that you get only a factor 2 or 3 speed-up (which is not the case if I understand correctly). - Table 2: also more self-contained captions could help, or a bit more info immediately next to the caption; these results are with how many parallel actors and a fixed total time ? - Table 3: also more self-contained captions. Rebuttal: maybe provide answers for remarks above. You can probably easily answer points raised about tables 1, 2, 3. I think the existence of many ask and tell algorithms (for example CMA has an ask-and-tell version, and Nevergrad provides *all* its algorithms in ask and tell format) is a deeper remark. This does not mean that the paper should not be accepted (I'm not aware of anyone having done better than you for these problems so you can't blame for not doing every thing that could have been done!). CEM is not the only possibility, so many algorithms could be tried here. Detail (not taken into account for the overall evaluation) : the english is sometimes suboptimal (though always readable), e.g. mastering Go rather than "the Go".
Summary and Contributions: This paper proposes a unique asynchronous framework for effectively combining Evolution Strategies (ES) and Deep Reinforcement Learning algorithms. The authors introduce various methods to minimize problems that may occur during the asynchronous update. Experimental results show that high performance can be obtained while reducing learning time through the proposed framework.
Strengths: - Various asynchronous update methods are proposed, and the characteristic of each method are well compared. - It is practically useful by increasing the efficiency of parallel training. - The baseline used in the experiment was thoroughly selected and compared. - The experimental results support the benefit of the proposed method.
Weaknesses: - This paper argues that the proposed asynchronous framework is more stable than the previous method. However, a metric to confirm stability is not defined, and experimental results do not support its stability also. - There is no analysis of performance and obstacle when using a variance matrix instead of a covariance matrix. - The experiment of time efficiency comparison was conducted only in a single setup. It would be helpful to have a comparison of different numbers of CPUs, GPUs, and workers.
Correctness: The empirical methodology seems correct. But some claims are not fully supported. The authors claim that the proposed method is more stable than the previous method, but stability cannot be confirmed in the paper.
Clarity: The paper is well written and easy to follow.
Relation to Prior Work: The main background is explained in detail, and the difference from the previous work is clearly described.
Additional Feedback: #### POST REBUTTAL #### The authors' feedback covered my question. I recommend acceptance as in the previous review.