__ Summary and Contributions__: The paper offers a method to improve exploration by training a group of agents with diverse behaviors. The authors seek to train a group of agents to cover a maximum area of the behavioral manifold that defines the set of possible policies. The authors present two variants of their method. One that is based on evolution strategies and one that is based on gradient descent. Besides that, the authors use Thompson sampling to control the degree of regularization by simultaneously solving a multi-armed bandit problem over a discrete set of regularization coefficients. The authors evaluate their method in a set of continuous conrol challenges from the Mujoco benchmark and show improvement over NSR- an evolution strategy algorithm.

__ Strengths__: The paper identifies and addresses an important problem that arises when handling ensembles of agents. The naive requirement of pairwise diversity indeed do not lead to global diversity. The authors include a toy MDP example in section 3.3 that well exemplifies their claim.

__ Weaknesses__: 1. Not only that the motivation of Section 3.1 seems irrelevant to the method, but it also creates unnecessary confusion (the motivation talks about cross-entropy RL, i.e., trust-region methods). While (minimum) cross-entropy RL is a stabilization framework, this paper is about diversification and exploration which is quite the opposite. I think that the motivation confuses between minimum cross entropy RL (stability), and maximum cross entropy RL (exploration/diversity).
2. In my opinion, the most interesting part of this work is to see how the authors derive an optimization algorithm from the determinant regularization (in the gradient-based version). However, this was deferred to the appendix (lemma 4.1), but the appendix is missing.
3. The result on Figure 5 is weird. The authors claim that they do not hurt performance when exploration is not needed. However, this can be contributed to the Thompson smapling part that probably turned exploration off (lambda=0). Ablation studies are missing to understand what is going on.
4. The proposed method comprises of two major modules: determinant regularization, and Thompson sampling of the regularization’s coefficient. Since the two modules are “orthogonal” it is necessary to provide ablation tests in this case. While reviling that the former module is responsible for most of the improvement is interesting, seeing that the improvement comes from the Thompson sampling part makes the results in this paper somewhat less significant.

__ Correctness__: I am not able to evaluate the correctness of the results because all proofs are deferred to a missing appendix. Moreover, the gradient of the regularization term includes a matrix inversion and I am curious to understand how this is carried out in practice.

__ Clarity__: Unfortunately not. The paper should undergo serious proofreading. Listed here is a partial list of typos and spelling errors.
1. line 23: “Training a population of agents provides is a promising approach”
2. line 45: “we note that it is remains”
3. line 68: “targeting learning a value function”
4. line 78: “as taking as input parameters”
5. line 232: “it is proposed to use of a population of agents”

__ Relation to Prior Work__: The authors mention the MAP-Elite aslgorithm as a similar method to their work. However the algorithm is not present in the evaluation part. It would be interesting to see a comparison with this method.

__ Reproducibility__: No

__ Additional Feedback__: Preliminaries section: the section about different RL approaches is too elaborated and can be shortened to free space for important results deferred to the appendix.

__ Summary and Contributions__: The main contribution of this paper is to propose a novel method for promoting diversity for direct policy search in RL using gradient-free methods, what the authors call population-based methods. Their method consists in adding to the loss function a population diversity term based on a kernel defining a distance between two policies. They also have a proof showing that their approach indeed promotes diversity, even if the result is not that strong.

__ Strengths__: The paper is well-written and pretty clear.
They have theoretical results.
The simulations appear to be carried out in a competent way and highlight the interest of their approach.

__ Weaknesses__: The approach they propose is itself not that original.
Minor remark that I would like to be addressed: It bothers me that you are considering stationary policies in a finite time setting. They are suboptimal

__ Correctness__: Yes, I would say so.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper proposes a way of measuring diversity of behavior for a population of agents. This is useful because many RL techniques run several agents in parallel to more efficiently learn optimal policies, but in many tasks they can 'stuck' in local optima.
This measure of diversity of policies is incorporated into the optimization objective, encouraging agents' behavior to be diverse; thus, possibly escaping local optima.

__ Strengths__: This is a very relevant problem to the RL community, specially for tasks that can be simulated. Diversity of behaviors can help agents escape local optima, or discover new skills, which in turn leads to finding solution that could not have been found otherwise.
The paper is well-written, formalizes the problem clearly, and the results are encouraging.

__ Weaknesses__: The main weakness of this paper, in my opinion, is the experimental section. Many details for each experiment are missing, and they feel a bit 'rush'. My suggestion would be to select fewer experiments and have a more felshed out analysis of them.
For example:
in the Exploration task, the authors state NSR-ES fails to solve this problem, but DvD is able to solve it in all instances. Why is that the case? Since I am familiar with the topic, I can infer that the other methods got stuck in a local optimum hitting the wall, and adding diversity allowed some agents to reach policies that move around that wall and successfully reach the goal.
The reader should not be left to infer these results.
Another example is the Humanoid experiments. At first sight, the results do not look particularly meaningful; but the authors significance at a p <0.05 level based on Welch's t-test. The reader should have more context on how this t-test was performed. What are the distributions that are being compared, and why can we assume they are normal distributions?
- What is the computational cost of taking the gradient of the new diversity term? The humanoid section mentions that the new method is able to reach a score of 6,000 in 1M steps while SAC needs about 3M. Is the update of the proposed approach hiding some costly computation, or is the cost of each update comparable between the two method. Whichever is true, should be mentioned to not leave the reader guessing.

__ Correctness__: For the most part, although some details in empirical experiments are missing, or should be clarified.
- Please respond to the t-test comment above.
- How many trials where run per experiment? Exploration mentions 10 seeds, is that the case for all other experiments as well?

__ Clarity__: Yes

__ Relation to Prior Work__: As far as I can tell, it is well placed in the context of relevant literature.

__ Reproducibility__: Yes

__ Additional Feedback__: Besides the comments above, I have a few points where I would appreciate a bit of clarification.
- line 73 mentions that the transition dynamics are often deterministic. This is actually rarely the case in most interesting problems. Did you mean to say stationary?
- line 106, definition of behavioral embedding. Is the behavioral embedding a matrix where each row i corresponds to the prob distribution over actions for state i? Am I reading this correctly?
- line 109, why are policies deterministic? Does this method not apply to stochastic policies? If so, it would be nice to state it right at the intro/abstract, so that the reader can know whether this applies or not to their problems.
- ES and TD3, please introduce abbreviations the first time they are used.
- Just personal preference, feel free to ignore if you don't agree. It threw me off a bit having a related work section before the experiments. Normally I would expect it after the introduction or right before the conclusion.
=======
After rebuttal
Thank you for taking the time to respond to our feedback.
I like the idea in general, but in my opinion the experimental section could be improved quite a bit.
As presented, it is difficult to judge how significant the results are.

__ Summary and Contributions__: This paper proposed a new algorithm called DvD that can promote behavioral diversities among all policies in a population. This is achieved through joint use of several key techniques, including behavioral embedding in the action space, Kernel-based behavioral comparison, population-wide diversity measure through calculating the determinant of the kernel matrix and Thompson sampling for adaptive updating of lambda.

__ Strengths__: This paper introduces several interesting ideas. I found the idea of measuring population diversity through kernel function and determinant of the kernel matrix particularly interesting and novel. The research questions explored in the paper are also highly relevant to the NeurIPS community.

__ Weaknesses__: The paper may need to be improved to address a few important issues, as detailed below.
1. Why is it important to enhance population-wide behavioral diversity? Intuitively I can understand the potential benefits related to deep exploration and learning stability. However, theoretically I cannot link the benefits straightforwardly to the proposed use of kernel function and the kernel matrix determinant. Theorem 3.3 states that when lambda is set properly, the population will contain M distinct optimal policies. This theorem assumes the existence of a positive constant Delta. This assumption is not always valid, especially upon considering stochastic policies. Even when the theorem is valid, why is it necessary to find all the distinct optimal policies when the learning goal is just to find one of them? In other words, why, by trying to find all optimal policies, the samples collected by DvD will increase the chance for the learning algorithm to escape from local optima and achieve higher sample efficiency, as demonstrated in Figure 6? This paper provided some illustrating examples on page 4 and page 5. However, I am still not fully convinced by the necessity of finding all optimal policies. As further noted by the authors in the appendix, if the population size is larger than the number of modes, the learning performance will be negatively affected. In view of this, how many optimal policies that DvD should search for simultaneously in a population? Why collecting samples by using more policies in the population will actually affect the learning performance, provided that we can maintain good diversity among all policies?
2. After checking the proof of Theorem 3.3 in the appendix, I cannot understand the last inequality on page 15. Given the condition that R(pi)+Delta<R and Div(Theta)<1, the LHS of the inequality should be less than M*R-M*Delta+lambda rather than M*R-Delta+lambda. The correctness of this step of derivation may need to be double checked. Also for the kind of learning problems studied in the paper, the condition on the existence of positive constant Delta may not be satisfied. Hence the theorem may not be able to explain how the diversity measure can actually drive the learning of all optimal policies.
3. The proposed behavioral embedding requires a finite set of states. However, for many practical problems, the state space is not only infinite but also uncountable. In this case, the authors proposed to use sampled states to estimate the behavioral embedding. I was wondering how such sampled embedding may affect the diversity of the population and the performance of DvD. Some theoretical analysis would be very helpful for me to truly understand the importance and novelty of the newly proposed behavioral embedding. Similarly, some high-level intuitive discussions were presented on page 4 to justify the use of kernel matrix determinant as a measure of population diversity. In addition to high-level discussions, I think detailed analysis is also required to clearly justify the efficacy of using the determinant measure, especially when it may noticeably increase the computation complexity of the resulting algorithm.
4. The use of notations should be carefully managed. For example M is used in Theorem 3.3 to refer to an MDP. Meanwhile M is also used to represent the number of agents in the population. This may cause some confusion to the readers.
5. DvD introduces several new hyper-parameters, for example hyper-parameters related to the behavioral embeddings, the choice of the kernel function, the number of agents, as well as those highlighted in Table 3 and Table 4. How to determine the suitable values of these hyper-parameters? Is it easy to tune these hyper-parameters in practice? Are there any empirical rules to follow? More experiment results may need to be presented in order to understand the performance impact of these hyper-parameters.
6. Figure 6 does not seem to show that DvD can clearly outperform TD3. Although statistical test results indicate that the performance difference is significant, the test was performed on experiments that used only 7 seeds. Hence the test results may not be very reliable. Even DvD achieved higher performance than TD3, it is more complicated in design than TD3. How does DvD compare to TD3 in terms of computation cost? I think we need more evidences and more experiment results to truly understand the advantages of using DvD over TD3 and other cutting-edge deep reinforcement learning algorithms.
Many thanks for the authors' feedback. I still feel several aspects of this paper may need to be improved with more details. Meanwhile I did not seem to be able to find any direct feedback with regard to the proof of Theorem 3.3 and the validity of its assumptions.

__ Correctness__: Most of the claims and methods appear to be correct. The derivation of theorem 3.3 may need to be double checked. The validity of this theorem should be verified further and its connection to the improved learning performance should be justified more. The applicability of statistical tests on limited data may also need to be carefully checked.

__ Clarity__: This paper is generally well written and easy to follow.

__ Relation to Prior Work__: The connection to previous works has been presented in section 5. The key difference from previous research has also been elucidated.

__ Reproducibility__: Yes

__ Additional Feedback__: