NeurIPS 2020

Neurosymbolic Transformers for Multi-Agent Communication

Review 1

Summary and Contributions: This paper studies the problem of learning multi-agent communication structures. It specifically focuses on reducing the communication overhead by constraining the maximum degree of each agents (# of other agents it can communicate with) while keeping the overall performance similar. The authors train a transformer based communication policy which is fully connected and refines the policy using a simple rule. After the communication graph is sparsified, the transformer base policy is retrained to compute actions. Overall, the paper makes the following contributions: - Hardening of the communication graph for reducing communication cost - Promising empirical results on a controlled multi-agent setting

Strengths: - The hardening of the transformer policy outputs for sparsifying communication graph. - Retraining the transformer policy with hard communication graph to adapt the policy to the new graph

Weaknesses: - The paper focuses on sampling from the output distribution of the transformer policy for reducing communication cost but this is itself a RL problem where which agent to communicate with is the action. Instead, the paper picks a communication graph using an ad hoc rule and retrains the transformer with the hard graph as the transformer is not trained with its own predictions. This is not discussed in the paper. - Manually picked rules are very ad hoc and specific to the problem. I think claiming this a program is also misleading as there is only two rules (both of which are used with very simple positivity and argmax constraints w.r.t. a metric.

Correctness: Claims and empirical methodology are somewhat correct. I am skeptical of problem formulation and program synthesis terminology.

Clarity: Paper is well written with some grammatical mistakes. See additional feedbacks.

Relation to Prior Work: It is clearly discussed.

Reproducibility: Yes

Additional Feedback: In line 265, it refers to the loss as the negative reward but for unlabeled goals task, the reward is always positive (sum of max of probabilities). Please clarify Figure-2 accordingly. In line 233, element-of is repeated. In Figure 4 (c), Visualization is misspelled. In line 273, that --> than.

Review 2

Summary and Contributions: Presents and evaluates an approach for inferring communication structures in multiagent systems. Contributes to the emerging literature on learning about communication and cooperation.

Strengths: Well written, solid experimental results, interesting (and important) problem.

Weaknesses: None obvious to me.

Correctness: Everything seems in order as far as I can see.

Clarity: Pretty good.

Relation to Prior Work: Yes, insofar as I can see.

Reproducibility: Yes

Additional Feedback: Overall solid piece of work

Review 3

Summary and Contributions: The paper introduces a new algorithm for routing communication in collaborative MARL. The key idea is first to use a transformer NN to infer the communication pattern and then discretise it using MCMC. The resulting communication policies are compact - less messaging - and perform well as shown in experiments.

Strengths: The paper is clear and makes a good use of examples. It addresses a relevant problem and the proposed method is new and interesting. The work is well positioned within the literature.

Weaknesses: 1. Although the proposed method works well on the example domain, it's not shown that the method would generalise beyond that (neither theoretically nor empirically) 2. Just one domain for the experiments limits how convincing the experimental results are. It would instructive to see how the method works in domains where: a) full communication between all agents is required (does it still work and discover that?) b) communication is noisy (e.g. sometimes the message fails to arrive) c) simply a different domain, which is not navigation - capture the flag, for example 3. Novelty is somewhat limited. It is novel and interesting to use transformer weights to build a skeleton for a combinatorial optimisation, but it isn't clear that this is generally applicable to other problems than collaborative navigation.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: YEs

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The authors introduce a novel method for generating an efficient communication graph between different agents, based on first training a Transformer architecture to generate actions and messages for each agent, second conducting a combinatorial search over communication graphs, guided by the Transformer policy to produce hard attention weights for messages, and finally retraining the Transformer with these hard attention weights.

Strengths: - So far as I am aware, the methods introduced by this paper are novel. The training paradigm is particularly interesting, since the Transformer-guided program synthesis objective has potentially wide applicability. - The algorithm is clearly and concisely explained, and would be reproducible from the description, in my opinion. - Anecdotally, the programs learned by the system are reasonable in the environments considered.

Weaknesses: - The method relies on each agent having observations of other agents (o^{i,j}). This seems like a very strong assumption, given that the motivation for this work was to lower the communication bandwidth necessary. The authors should comment on how this requirement could be weakened to allow scaling to more complex environments. - The results are not presented quite clearly enough. The "loss" in Figure 2 is not clearly defined, and it would be much clearer to use "reward" as the y-axis in these Figures. The overlapping error bars in many of the results call into question the significance of the findings. Perhaps the authors can comment on these, or strengthen their method to reduce the performance variance? - There are several unclear points in the text. In line 154, what does \pi^C look like mathematically? In line 207, what is the evidence that the benefits outweigh the costs (I believe I can read this off from the ablation study, but it would be nice to point this out explicitly)? In line 220, why is Gaussian noise added?

Correctness: empty

Clarity: empty

Relation to Prior Work: empty

Reproducibility: Yes

Additional Feedback: Response to authors: I am satisfied that your responses to my comments address the concerns. Please do include these clarifications in any future version of the paper. I have therefore raised my score by 1 point.