Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper is well written. I do not have any clarity issues. To a large extent, the paper is a successor of the work by Zeyu et al. : it is a straightforward extension of learning intrinsic reward to the cooperative multi-agent setting. Therefore, the technical contributions are somewhat limited. PS: Another paper on multi-agent intrinsic reward. Liu, Bingyao, Satinder Singh, Richard L. Lewis, and Shiyin Qin. "Optimal rewards for cooperative agents." I have given my list of three most significant contributions and suggestions for improvement in other sections of the review. Here I have some minor questions: Any particular reason why the authors did not choose all the tasks used in the COMA paper, for the purpose of comparison? In the COMA paper , the tasks are 3M, 5M, 5W, and 2D3Z. In this paper, we have 3M, 8M, 2S3Z, 3S5Z.
Originality: The ideas introduced here are certainly not new, but extending intrinsic rewards to multi-agent RL settings is for sure an interesting research avenue, especially when one is interested in decentralized multi-agent settings were no communication is possible between agents. Quality: The paper is well written. Clarity: It is a fairly convoluted method, with many components. A better overview of the algorithm could be useful (perhaps in supplementary materials). Furthermore, not all the details regarding the experimental setup and parameter choice is specified. This information is important for reproducibility reasons, and could also be included in supplementary materials. A few examples include the beta learning rate, was there any parameter search performed for lambda, more in-depth view of the networks' architectures. Significance: The method is compared against state-of-the-art methods and does show improvements in the selected scenarios. The authors also perform a short analysis of what intrinsic reward the agents learn and how it affect their behaviour. ------------------------------------ Post-rebuttal: I appreciate the authors efforts to answer the raised concerns and I think the additional experiments, analysis and explanations will improve the work. I will maintain my score, given the novelty level of the work.
This work deals with learning individual intrinsic rewards (IR) for muti-agent RL (MARL). Overall, the method provided is a straightforward application of a known IR method to MARL, the results are promising and the writing is clear. As such, this work has limited novelty but provides good empirical contributions, though these too could be improved by considering more domains. A more detailed review of the paper, along with feedback and clarifications required are provided below. The work is motivated by the claim that providing individual IRs to different agents in a population (in a MARL setting) will allow diverse behaviours. * Is it not possible that the IR policies learnt all look similar and the thus the behaviour that emerges is similar? The analysis at the end of the paper shows that a lot of the learned IR curves do overlap. Please provide more justification for this motivation. The work clearly describes related work and how the approach here differs. The main contribution is to apply the meta-gradient based approach in “On learning intrinsic rewards for policy gradient methods” ( as per the paper) to the multi-agent setting. * This looks to be a straightforward application where each agent has the LIRPG approach applied. Please provide succinct details of any modifications that are required to apply this and any differences in implementation. The method section can be shortened, as most of the algorithm and objective are the same as the original LIRPG algorithm uses. A range of methods are compared to the in the experimental section: independent q-learning/actor-critic, central critics, counterfactual critics, QMIX and the proposed approach (LIIR). * Please clarify what is meant by “share the same policy”: do they share the same policy network weights or also the exact same policy output? Do all agents get the same observation? If so, what is the difference between IAC and central-v? Is the only change how the V is updated, whereas policy is the same? * How is the parameter \lambda tuned for the agent? Lastly, the result sections show clear benefits of this approach. This method, along with several baseline is applied to a set of mini games for Starcraft. The analysis is promising and show that the method learns an interesting policy that captures the dynamics of the game. Overall this is a good contribution but for an empirical paper this could be strengthened by considering more domains or tasks and demonstrating the ability of this method to work across the board.