Paper ID: | 4160 |
---|---|

Title: | MAVEN: Multi-Agent Variational Exploration |

1. The proposed MAVEN method extends the idea of deep exploration via Bootstrapped DQN to the QMIX algorithm, aiming to provide committed exploration. One question is that this new exploration solves the sub-optimality issue of QMIX? Is the sub-optimality of value function decomposition methods (e.g., QMIX and VDN) intrinsic due to the full decomposition representation or due to the exploration? 2. Although it is shown that MAVEN significantly outperforms QMIX in the toy game problem, the experimental results in the SMAC domain seem not to demonstrate the expected significance of MAVEN over QMIX. The authors mentioned in the paper that the SMAC domain may not be ideal for testing exploration strategies, but it is highly desired for the authors to find an interesting challenging domain for evaluating the significance of MAVEN, as did by Bootstrapped DQN in the Atari games. In addition, it is also important to analyze the weakness of MAVEN. 3. Did the authors perform an ablation test for evaluating MAVEN against QMIX with each agent using Bootstrapped DQN? 4. How stable is the learning of MAVEN using the alternative training of the value function and the hierarchical control policy? 5. As shown in Figure 1, is the epsilon-greedy exploration still used in MAVEN? Some typos: Line 113: removing “a” Line 116: missing a “)” UPDATE: After reading the author's rebuttal, I have chosen to increase my score from 5 to 6 because the authors provide stronger results and partially address my Question 2.

The authors point out a shortcoming in a popular MARL algorithm and present a well-motivated solution. This much is fine. However, I think the uniqueness of both the problem and the solution were overemphasized. First, as far as I can tell, the inefficiency of exploration for QMIX boils down to the inefficiency of epsilon-greedy, a well-known shortcoming that plenty of people study in the single-agent q-learning domain. This could of course be exacerbated in the multi-agent setting (since the epsilon noise is added independently per agent), but the problem itself isn't new. So forgive me if I misunderstood something, but what is the point of section 3 beyond saying this? For me, this whole section could have been replaced with: "Epsilon-greedy is inefficient [citations]. The problem can be even worse in the multi-agent setting. [brief elaboration] Here we present a committed exploration approach for multi-agent q-learning." Second, the solution presented is essentially exactly DIAYN [Eysenbach et al 2018] (which itself was a generalization of variational intrinsic control [Gregor et al 2016]). Sure, DIAYN was applied in a different setting (unsupervised pre-training for RL), but the method of sampling a latent variable and encouraging behavioral trajectories to be as different as possible so as to allow the inference of that latent variable by a discriminator (as a variational approximation to optimizing mutual information) comes directly from that paper. They even had the same motivation of encouraging exploration. Given this, I find it unfair that exactly one sentence is devoted to DIAYN and VIC (lines 277-279). I think it would be more appropriate for this work to be highlighted prominently in the abstract, introduction, and section 4 on methodology (note that QMIX itself is highlighted in this way). ("MAVEN=QMIX+DIAYN" is the executive summary I wrote down for myself.) The Starcraft results also seem fine, but not so strong as it make it obvious that committed exploration is a crucial empirical improvement for QMIX - while MAVEN agents learn faster in 3s5z, the final performance looks the same; MAVEN agents seem to have less variability in final win rate on 5m_vs_6m; and QMIX actually seems to have better final performance on 10m_vs_11m. The results in figure 2 and 4 do however suggest that there may be scenarios where the advantage of MAVEN is higher. Minor comments: 1) line 64 and others: the subscript "qmix" should probably be wrapped in a "\text{}" 2) first eqn in section 3: inconsistency between using subscripts and superscripts, i.e. u_i and u^i 3) line 81: perhaps better phrased as: "the *best* action of agent i..." 4) line 86: u_n^i -> u_|U|^i? 5) line 87: I was confused by what "the set of all possible such orderings over the action-values" means. Besides a degeneracy when some of the Q values are identical, isn't there only one valid ordering? Or are you just trying to cover that degeneracy? 6) Definition 1: perhaps add an intuitive explanation, e.g. "Intuitively, a Q-function is non-monotonic if the ordering of best actions for agent i can be affected by the other agents action choices at that time step." 7) line 110: precise sequences -> precise sequence 8) line 131: for latent space policy -> for the latent space policy, missing space after period 9) line 162: should call the variational distribution a "discriminator" as it is introduced, both to help explain what role it is playing, and because this is done in Figure 1 without reference in the main text 10) line 174: sate -> state 11) Figure 2b: unexplained "[20]"s in legend can probably be removed 12) line 237: Fig 11 -> Fig 5 13) line 239+1: I think the ablation experiments were useful and interesting and should at least be summarized briefly in the main text. 14) Figure 5: should mention the number of training steps as one progresses left to right 15) line 289: we propose the use state -> we propose to use state OR we propose the use of state 16) line 290: condition latent distribution -> condition the latent distribution UPDATE: Thanks for your rebuttal. On my first point above, thanks for clarifying the strengths of your theoretical result; I underappreciated them on the first read-through. On my second point, thanks for clarifying the distinctions between VIC/DIAYN and your approach (though I do think you should include the discussion of the differences in your paper). Also, thanks for sharing the stronger empirical results. For all of these reasons, I've raised my score from a 5 to a 6.

Clarity: While the paper is readable, there are certainly rooms for improvements in the clarity of the paper. For example, the details of the algorithm in Section 4 is not straight forward and easy to follow, especially lines 146-182. Originality: The paper considers the MARL problem in which each agent has its own local observations, takes a local action and receive a joint reward, and the goal is to find the optimal action-value function. During the training each agent is allowed to access the action-observation of all agents, and the full state. It is shown that VDN and QMIX cannot represent the true optimal action-value function in all cases. In addition, for a fixed episode length $T$, it is proved that with increasing the exploration rate, decreases the probability of learning an optimal solution. Considering this, it is assumed that the lack of good exploration coupled with the representational limitations resulted to the sub-optimality of QMIX. To address this issues, an algorithm, multi-agent variational exploration (MAVEN), is proposed to resolve the limitation of monotonicity on QMIX's exploration. In this order a latent variable $z$ is introduced which is based on a stochastic variable $x \sim p(x)$ which $p$ is uniform or normal probability distribution. Function $g_\theta(x,s_0)$ returns $z$ and another neural network, called, hyper-net map $g_{\phi}(z,a)$, returns $W_{z,a}$. In parallel, to get the Q-function of each agent, a MLP gets $(o_i^t, a, u^{t-1}_i)$, pass the results in a GRU and the results of the GRU is mixed with $W_{z,a}$ to get $Q(\tau_i; z)$. The whole this block introduces parameters $\eta$. Then, the mixer network with parameters $\si$ obtains $Q_{tot}$. Also, a mutual information (MI) objective (what does "objective" mean here?) is also added into the model to encourage more exploration. Significance: There are several things to like about this paper: - The problem of safe RL is very important, of great interest to the community. - The way that the exploration is added in the model might be interesting for other to use in future. However, - I found the paper as a whole a little hard to follow specially in the algorithm side. - The experiments do not support the claim of the paper about the significance of the exploration.