__ Summary and Contributions__: This paper considers fully cooperative multi-agent tasks modeled as Dec-POMDPs and provides a new algorithm, Learning Implicit Credit Assignment (LICA), to train decentralized policies via centralized training with decentralized execution. LICA has two significant features: (a) a policy-mixing-network is proposed for credit assignment in a new implicit way; (b) an adaptive entropy loss is introduced into the policy gradient computation to further improve explorations during training. The evaluation of LICA is performed across several benchmark problems and compared with a set of state-of-the-art multi-agent RL approaches. The results show that LICA outperforms all the baselines and demonstrates the advantage of the adaptive entropy regularization. However, there are still significant issues with the method and several concerns about the results.

__ Strengths__: The architecture of LICA is the first contribution of this work, where agents’ stochastic policies are concatenated first and then passed into a mixing network, whose weights are leaned from a ground truth state to output a joint Q-value as the centralized critic. This policy mixing network is certainly an interesting idea that extends the mixing network of QMIX and the architecture in MADDPG to stochastic policies.
The adaptive entropy regularization is a novel idea. The strength of the entropy loss for encouraging exploration is now adaptive based on the stochasticity of each agent’s policy, such that when an agent’s policy is a near-deterministic one, the entropy loss will have a higher weight to further encourage the exploration. The advantage of having this adaptive feature is demonstrated in the results versus using a fixed weight on the entropy loss.

__ Weaknesses__: The first essential issue in LICA algorithm is that the definition of the centralized value-function is not clear. In particular, what exactly is the proposed value function is trying to approximate? During training, this centralized value function is trained conditioned on a sampled joint action (Eq.3), while during policy updating, it is used in a way that conditions on the concatenation of the probability over actions output by each agent’s policy. Due to this inconsistency in the input of the value-function, this critic should not be able to provide a correct value-estimation for the stochastic policies when calculating the policy gradient. The paper should give a further explanation and theoretical analysis of this approach.
Secondly, in Algorithm 1, the centralized critic is first updated k iterations using the “same” sampled data. Then, the question here is why not just tune the learning rate? Theoretically, this k iterative updates seems ok, but if the goal is to obtain a critic for providing more precise estimation, then why not just keep training until it converges?
Lastly, if you have the k iterations for LICA, do you also implement the same setting for other policy gradient baselines such as COMA, MADDPG, SQDDPG, IDDPG, and LIAR? If not, then the comparison shown in the results is not fair.
Also, in order to prove the advantage of having the mixing network rather than a MLP, the network architecture should be modified and everything else should be kept the same. According to the results shown in Fig. 2b and Fig. 2c, the values of lambda (0.03, 0.04) with MLP are also different from the values (0.06, 0.09) used with the mixing network.
This choice should be justified.
Regarding the StarCraft results, the average performance of each method is only over three independent runs which is not enough. If you look into the original COMA, QMIX, and MAVEN papers, they conducted 35 runs, 20 runs, and 12 runs respectively. Moreover, an essential phenomenon being shown in Fig 4e is also showing that it is necessary to perform more runs to make the results more convincing, because the learning curves of COMA, QMIX and VDN have huge differences with the results shown in the paper “The StarCraft Multi-Agent Challenge”. In addition, MAVEN could be a reasonable baseline to compare with, since it also focuses on improving exploration.

__ Correctness__: The paper appears correct, but some issues should be clarified. Some are mentioned above and below. Additionally, it isn't clear why the proposed objective maximizes the returns from every state. So, the subscript t of state s in Eq. 2 should be clarified.

__ Clarity__: The paper is generally clear, but some more formal details are needed. For instance, it is not very clear how the policy gradient is exactly computed for each policy. There should be an equation detailing this. Also, it is not clear what the maximum time-step is for the cooperative navigation domain, which makes the results shown in Fig 3b hard to interpret. If the default horizon 25 time-step is being used as the only terminal condition, -5 episodic return (the mean performance) means the average distance between each landmark and its nearest agent through the entire episode is 5/25/3=0.06. This means each agent’s location is initialized very close to a corresponding landmark. These details should be clarified.

__ Relation to Prior Work__: The architecture of LICA borrows some the ideas from MADDPG and QMIX and this could be discussed in more detail.

__ Reproducibility__: No

__ Additional Feedback__: ******* After discussion and author feedback *******
The author feedback was appreciated as they clarified some of the issues. More detailed responses are below.
Q1.
A clear definition of the Q-value function being learned is very essential in a proposed RL algorithm. Training an action-probability (AP) conditioned Q-value function using the data only in specially cases of AP with probability 1 means the Q-value function is only able to provide a good estimation for deterministic policies, while the proposed algorithm uses the Q-value function to calculate the objective for optimizing the policies, conditioning on the APs output by stochastic polices, which is problematic. Also, it is not clear which exactly literatures the author mentioned above, and it would be nicer to include the corresponding titles.
Q2.
This argument is not convincing. Please clarify in the paper.
Q3.
OK, but it seems unlikely that k=1 would work best for the others. Please clarify.
Q4.
Line 296 in the paper doesn’t provide a good clarification. Keeping lambda the same and only modifying the network architecture is a more solid comparison. Regardless, please clarify in the paper.
Q5.
Please refer to the number of runs in other papers I mentioned in my review. More runs are certainly needed to make the results more convincing.
Q6.
This is good.
Q7.
Please formally justify your implementation choice.

__ Summary and Contributions__: The article presents a new MERL algorithm for cooperative joint action in the family of centralised training and decentralised execution. The algorithm produces a Q-network for the joint action that mixes the environment state and the individual policy logits in a better way from previous algorithms.
In addition, they introduce a novel entropy cost regularisation that they argue improves exploration throughout training.

__ Strengths__: The article is very well presented and contextualised. The experiments are compelling, and the narrative flow is of high caliber. The interpretation of the algorithm as credit assignment is quite interesting, and, while I'm not 100% convinced that this is exactly what is going on with this algorithm, I think it is arguable that the authors have a valid formulation.

__ Weaknesses__: At its heart, this article is a slightly modified entropy cost term, and a slightly transformed joint-action critic. The authors talk about possible ready extensions to continuous domains, but no concrete evidence is provided.

__ Correctness__: The methodology is of very high quality. The experiments are well motivated, and sufficient to illustrate the strengths of the proposed algorithm. The level of difficulty chosen is commendable, and the comparison against many other SoTA algorithms is very good to see.

__ Clarity__: This is definitely one of the strongest points of the article, where the contextualised choices for the new components are presented clearly and at the correct level of detail. The results are discussed adequately in the text, and the intuitive interpretations accompanying them are topical, without edging on wild speculation.

__ Relation to Prior Work__: The work is very well contextualised.

__ Reproducibility__: Yes

__ Additional Feedback__: I particularly appreciate the use of open source implementations throughout. Thanks!

__ Summary and Contributions__: The authors designed a new critic structure for implicit multi-agent credit assignment. Compared with the vanilla critic of MADDPG, the Mixing Critic in LICA decouples the gradients of actions and state and provides more state information to the policy gradients. To keep consistent exploration, they also proposed adaptive entropy regularization, which dynamically rescales the entropy by dividing a measure of policy stochasticity. The authors compared LICA against other baselines in MPE, SMAC, and Traffic Junction, and verified the effects of the proposed components by ablation studies.

__ Strengths__: The paper is generally clear and well-structured. I much agree with the motivation that credit assignment may not require an explicit formulation. Explicit credit assignment is hard to compute and would be unrealistic in complex scenarios. The adaptive entropy regularization is well-motivated, simple, and practical, which allows easier tuning and balances exploration and exploitation during training. The empirical results on various tasks and comparisons to baselines are well done. The visualizations of the entropy term show that adaptive entropy regularization encourages consistent exploration.

__ Weaknesses__: My main concerns are the novelty and benefits of the Mixing Critic. I think the main difference between LICA and the single-critic variant of MADDPG is that LICA formulates the critic as a hypernetwork. But what you analyzed in Discussion (Page 4), that the policy gradients are decoupled from the state update and carry the state information, do not necessarily lead to better credit assignment. The policy $\theta$ is updated by $\frac{\partial Q(s,a)}{\partial a}\cdot \frac{\partial a}{\partial \theta}$, where $\frac{\partial a}{\partial \theta}$ is unrelated to the critic. The rightness of $\frac{\partial Q(s,a)}{\partial a}$ is determined by how accurate the learned function $Q(s,a)$ is. If the MADDPG critic learns a good approximation of $Q(s,a)$, it could also offer the right direction of action vector, whatever the direction is decoupled from the state update or not. We only need the gradients of action on the given state without actually updating the state. Moreover, the gradients provided by MADDPG critic also contain the state information, since the state is necessary to compute $Q(s,a)$ and gradients. For example, for $y = Activation(wx+b)$, b will influence $\frac{\partial y}{\partial x}$ by influencing the Activation (relu,tanh). (3) and (4) in the Discussion have been achieved in MADDPG, which cannot be seen as the contributions of LICA.
In the ablation experiments for Mixing Critic, LICA outperforms the single-critic variant of MADDPG. However, more explanations are expected to support the conclusions that decoupled gradients and fused state information do bring a better credit assignment. Is it possible that the Mixing Critic just learns a better approximation of $Q(s,a)$? Moreover, in practice, I find that concatenating state and actions at the beginning will lead to poor performance. Concatenating the representations of state and actions after MLP might improve MADDPG critic.
The adaptive entropy regularization could adjust the levels of exploration. However, since this entropy term will be large once the policies become deterministic, which would make the policies be stochastic again, how to guarantee that LICA could converge to stable policies?
In the experiments for the capacity of credit assignment, I do not think COMA is a good baseline since it is an on-policy method, but LICA and MADDPG are off-policy. I suggest comparing LICA with MAAC, an off-policy method equipped with a similar credit assignment mechanism like COMA.

__ Correctness__: Basically correct.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper addresses the issue of credit assignment in a multi-agent reinforcement learning setting. The paper presents an implicit technique that addresses the credit assignment problem in fully cooperative settings. The basic idea (for which the paper provides some empirical evidence) is that an explicit formulation for credit assignment may not be required as long as the (centralized) critic is designed to fuse policy gradients through a clever associate with with agent policies. To prevent premature convergence a technique called adaptive entropy regularization is presented where
magnitudes of the policy gradients from the entropy term are dynamically re-scaled
to sustain consistent levels of exploration throughout training. Results are shown on particle and StarCraft II environments.

__ Strengths__: A key strength of this work is the focus on an important problem: credit assignment in a multi-agent reinforcement learning setting and an implicit strategy to deal with it. If I understand correctly, the algorithm does not require any communication between agents once training is complete. This is a significant strength.

__ Weaknesses__: The key limitations of this work are that the results, while promising, suggest a need for further empirical evaluation before they can be firm. The results in the particle environment are not much of an improvement over past work but I accept the authors' claim that this is because the environment is not sufficiently challenging. In the Starcraft environment, I would have liked to see further training iterations and more complex settings. I accept that in the results presented there is clear evidence that LICA converges faster (except in one setting) but I am not convinced that this will carry over to more complex settings where success is an uneven mix of 'individual performance' and 'cooperation'.

__ Correctness__: The claims and method appear to be correct and the empirical methodology is a good first step in the right directlon. At the risk of being repetitive, I will repeat what I think is the key limitation of this work - The idea behind the paper is intriguing, but the results are not convincing. Further experiments in more challenging settings are likely needed to show that this work has impact. Right now, it reads more like a report where the initial results are promising.

__ Clarity__: The paper is well written and easy to follow.

__ Relation to Prior Work__: The paper is well-situated in the literature, and the relation to prior work is clear. The authors also make it clear how their work differs from previous contributions.

__ Reproducibility__: Yes

__ Additional Feedback__: Thanks for the additional experiments on more complex environments. The results are a bit hard to make out (the figures in the rebuttal are tiny) but just about legible under magnification. Also, the comment about results on MMM2 are appreciated.