__ Summary and Contributions__: The paper proposes an approach to learn diverse behaviors to avoid policies to be too specific for a single task and making them general and robust to variations of the task. The proposed method considers policies depending on latent variables and optimizes an objective that prefers policies with high mutual information between the trajectory and the latent variable, conditioned to the fact that those policies must be \epsilon-optimal. Differently from meta-learning, the training is carried out in a single environment while testing is done on variations. A theoretical study to justify the proposed objective is provided together with an experimental evaluation.

__ Strengths__: The main strength point of the paper is the attempt of addressing a quite challenging scenario in which training can be carried out in a single environment while testing should be done in different environments.

__ Weaknesses__: - The objective function (1): I have two concerns about the definition of this objective:
1. If the intuitive goal consists of finding a set of policies that contains an optimal policy for every test MDP in S_{test}, I would rather evaluate the quality of \overline{\Pi} with the performance in the worst MDP. In other words, I would have employed the \min over S_{test} rather than the summation. With the summation we might select a subset of policies that are very good for the majority of the MDPs in S_{test} but very bad of the remaining ones and this phenomenon would be hidden by the summation but highlighted by the \min.
2. If no conditions on the complexity of \overline{\Pi} are enforced the optimum of (1) would be exactly \Pi, or, at least, the largest subset allowed.
- Latent-Conditioned Policies: How is this different from considering a hyperpolicy that is used to sample the parameters of the policy, Like in Parameter-based Exploration Policy Gradients (PGPE)?
Sehnke, Frank, et al. "Policy gradients with parameter-based exploration for control." International Conference on Artificial Neural Networks. Springer, Berlin, Heidelberg, 2008.
- Choice of the latent variable distribution: at line 139 the authors say that p(Z) is chosen to be the uniform distribution, while at line 150 p(Z) is a categorical distribution. Which one is actually used in the algorithm? Is there a justification for choosing one distribution rather than another? Can the authors motivate?
**Minor***
- line 64: remove comma after a_i
- line 66: missing expectation around the summation
- line 138: what is H(Z)?
- line 322: don’t -> do not
- Equation (2): at this point the policy becomes a parametric function of \theta. Moreover, the dependence on s when using the policy as an argument for R_{\mathcal{M}} should be removed
- Figure 3: the labels on the axis are way too small
- Font size of the captions should be the same as the text

__ Correctness__: The proofs of all claims are reported in appendix. I made a high-level check of the math and seems correct to me.

__ Clarity__: The paper is clearly written.

__ Relation to Prior Work__: The proposed approach shares similarities with metalearning and robust RL. The connections are appropriately discussed in Section 6.

__ Reproducibility__: Yes

__ Additional Feedback__: The paper has nice potential. Considering my concerns about the definition of the objective function, at present, I am not sure that this paper represents a sufficient contribution to be suited for publication at NeurIPS.
***Post Rebuttal***
I thank the authors for the feedback. I have read it together with the other reviews. I am happy that the authors clarified my issue about the objective function. I think that the paper, provided that the authors make the promised fixes, is a suitable contribution for NeurIPS. For this reason, I increase my score to 6.

__ Summary and Contributions__: UPDATE after rebuttal
I thank the authors for providing more intuition about the settings in which the method can be expected to succeed or fail. I hope they will follow through with the promise to include additional experiments that emphasize the limitations of this approach, as well as more in-depth analysis of the learned policies, as suggested in my review. I believe these additions would be valuable for readers and would improve the paper overall. However, due to the somewhat limited applicability and novelty of the proposed method, I will keep my score.
----
This paper proposes a new approach for generalizing to a certain type of out-of-distribution environments by training a diverse set of near-optimal policies on a single training environment. They formalize the family of environments to which this method is robust as a function of the training environment and its optimal policy. The method outperforms a series of baselines on a wide range of environments with new transition and / or reward functions.

__ Strengths__: I liked this paper overall. The problem formulation and method description are clear and easy to read. I also appreciated that the paper includes both some theoretical analysis and empirical validation. The problem of generalizing to out-of-distribution environments using only a few episodes is an important one. The proposed approach is also novel as far as I know, even if it builds on ideas from prior work. Another strength of this paper is the precise description of the test environments on which you can expect the proposed approach to be robust to.

__ Weaknesses__: My main concern about this paper is the generality and effectiveness of this method on more complex and realistic settings. It seems to me that the environments used for evaluating the method were specifically designed in such a way that the proposed method can work well. I am not sure how realistic this assumption (that near-optimal policies on the train environments will be near-optimal for the test environments) is for more real-world problems we might want to solve. My understanding is that this method would only work well for test environments that are still fairly similar to the training one. So it is not clear how much you gain by doing this rather than doing domain randomization or adversarial robustness training.
It would be useful to provide more analysis of how the method works in practice. For example, it would be valuable to report the performance achieved by all the policies on both the training environments and the test environments, as well as the correlation between the training and test performance for a given policy. This would provide more understanding about how how much variance there is in the performance of the different policies as well as the difference in successful behaviors between train and test.
It would also be useful to report which policy is selected by the algorithm as being the best for each of the test environments as well as how far its performance is from the optimal policy on that environment (e.g. one directly trained on the test environment using SAC). This would be a good sanity check to ensure that there isn't a single policy (perhaps the optimal training one) that always performs bets on the train environments.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper proposes a new algorithm which objective is to learn a set of policies that are robust to variations of the training environment at test time (while trained on a single and fixed MDP). The principle is to learn a latent-condition policy (the latent variable being used to encode a set of different policies) such that, for each value of the latent variable, the resulting policy achieves a reward close to the reward of the optimal policy in the training MDP, but is different from the policies encoded through other latent variables values. Concretely, it is done by extending the 'Diversity is all you need' (DIAYN) algorithm, considering a task reward in addition to the diversity driven reward. Instead of using a naive combination of the two criteria (which is done as a baseline in the experimental section), the authors describe a threshold-based approach such that the diversity is optimized only if a critical reward is obtained by the policy. In addition to a concrete algorithm, the paper provides a study of the properties of such an approach to understanding in which case this model is useful, and which types of modifications in the original MDOP it can deal with. Experiments are made on simple and classical environments, with a comparison to multiple baselines showing the interest of the approach.

__ Strengths__: + The paper targets an important problem which is usually solved by using a different training setting where multiple variations of an environment are available at train time. In the article case, it is interesting to see that the method works training only one a simple environment. It is thus usable in applications where interaction between the world and the agent is expensive
+ The method is a modification of DIAYN that may be simple to reproduce (See next section) and can be plugged in any RL algorithm. It is provided with an interesting analysis of the properties of the produced policies.
+ Experiments are convincing, even if made on simple environments
+ subjective opinion: I am convinced that training multiple policies instead of a single one is a very nice approach to many aspects of RL and I am happy to see more and more paper proposing concrete ways to do it, and concrete use-cases. Such a paper will interest a large audience.

__ Weaknesses__: Some aspects of the method are not clear. My main concern is on the computation of equation (4)/eq (5). The notation R_M refers to the reward obtained by a particular policy and the idea is to consider the diversity reward only if the current policy is able to achieve a reward that is greater than the optimal reward minus epsilon. What is not clear to me is how this optimal reward is computed: is it the expectation of the future reward obtained in the current state by the optimal policy ? or is it just a comparison in term of reward over the whole trajectories of the two policies starting from the same state? If yes, how do you compute this difference when different initial states are sampled? Do you maintain an average reward value for each of the trained policies? In terms of practical implementation, is it just a reward modification, or does it have consequences over the advantage estimation/critic? If you compare the current policy and the optimal policy at each state, how do you manage the fact that these two policies are not generating states with the same distribution (such that the estimated value for the optimal policy may not be accurate), etc.? So I have many concerns on that point and can imagine many different ways to implement the method. Which one is the right one? Having an explicit pseudo-code for the SMERL (with SAC) algorithm would help.
In terms of experiments, problems with a high dimensional input space (e.g pixels) are lacking (but it is not crucial) and would strengthen the paper. The best would be to have a 'real' use-case instead of hand-made ones, for instance in robotics, moving in a house where furniture may move between two episodes.
The way epsilon is chosen is not clear. How do you cross-validate on espilon if you don't have validation MDPs.
Similarly, how do you choose the right number of policies to learn ?
How about having a continuous latent variable instead of a discrete one ? Having a small discussion on that point would be interesting.
The fact that SAC+DIAYN results do not appear in the paper (but in the supplementary material) is strange since it is an important comparison. Moreover, I would be curious to understand how the SAC+DIAYN is concretely done (through a weighted sum? how do you choose the value of the weight in that case) since it may certainly also be a good approach when carefully tuned

__ Correctness__: Everything seems correct, not particular comments on that point.

__ Clarity__: The paper is well written, and well structured.

__ Relation to Prior Work__: Multiple connections are made with the existing related work. Since the proposed model is a simple extension of DIAYN, I would (again) move the SAC+DIAYN model from the supplementary material to the main article, and better describe the differences.

__ Reproducibility__: No

__ Additional Feedback__: Reproducibility is not easy (see my comments) and providing a pseudo-code showing the concrete implementation would help.

__ Summary and Contributions__: This paper presents an algorithm that learns diverse ways to solve a given task. The policy is conditioned on the latent variable, and the lower bound of the mutual information between the state and the latent variable is maximized. The lower bound of the mutual information is based on the approach used in DIAYN. The benefits of learning diverse solutions are shown as robustness to perturbations to a task.

__ Strengths__: Learning diverse solutions is an interesting research direction, and it is not trivial to balance the objective that encourage the diversity of solutions and the objective to solve a given task. The way of encourage the diversity of solution while keeping the solution quality may inspire the researchers in the NeurIPS community.

__ Weaknesses__: To balance the diversity of solutions and the objective to solve a given task, the proposed method introduced the constrained information maximization in (2). However, the proposed algorithm needs to know the optimal return R_M(\pi^*_M). Therefore, before using the proposed method, it is necessary to solve the given task using another off-the-shelf RL method.

__ Correctness__: In the experiment, the training condition for Robust Adversarial RL is not clear. I think that the robustness of the policy learned with RARL is dependent on the hyperparameters of the adversarial. The authors need to describe the training conditions for RARL and clarify how they are determined.

__ Clarity__: The paper is overall clearly written and it is easy to follow except some parts. I’m not sure how the latent variable z is sampled during the learning process. Although it seems that the latent variable is fixed during an episode in the adaptation phase, I’m not sure about the learning phase.

__ Relation to Prior Work__: Learning a policy conditioned on the latent variable for diverse behaviors appears in the context of imitation learning as well. I think the following studies should be also cited and discussed in the related work section.
[1] Y. Li, J. Song, S. Ermon. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations. NeurIPS 2017.
[2] Mere et al., Neural Probabilistic Motor Primitives for Humanoid Control, ICLR 2019
Regarding the diverse trajectories for achieving a task, the following studies on hierarchical policy search seems relevant (See Figure 3 in [3]). This study addresses the special type of HRL in which an option is selected in the beginning of the episode. (See Figure 3, which shows the two paths to reach the specified goal.)
[3] Christian Daniel, Gerhard Neumann, Oliver Kroemer, Jan Peters; Hierarchical Relative Entropy Policy Search, Journal of Machine Learning Research, 17(93):1−50, 2016.

__ Reproducibility__: No

__ Additional Feedback__: - In the supplementary material, I do not clearly understand how the hyperparameter B is used. I think that B does not appear in the main text. Please elaborate it.
- I’m not sure why the results of SAC + DIAYN is only in the supplementary. I think it should be in the main text since it would add important information.
=== comments after rebuttal ===
I have read the other reviews and author response. I appreciate the authors' efforts to answer the questions raised by reviewers. Although the author response clarified some points, I did not find new information that makes me increase the score. I keep the initial score.