NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6646
Title:Multiagent Evaluation under Incomplete Information

Reviewer 1

This paper investigates the evaluation of multiagent strategies in the incomplete information and general-sum setting. The primary algorithm to be analyzed is alpha-rank, which is a ranking algorithm based on the stationary distribution of a markov chain with states defined over all strategy profiles. Since payoff tables M are typically estimated empirically, the authors provide sample complexity bounds on the number of (uniformly distributed) observations of each strategy profile to be observed for the resultant stationary distribution to be close to the true stationary distribution. The authors propose an adaptive sampling strategy based on confidence intervals over each pair of strategy profiles and analyze its sample complexity. The paper also shows how to propagate uncertainties in M to uncertainty in the ranking weights that alpha-rank yield. Lastly, the authors empirically evaluate their sample complexity bounds (in both adaptive and non-adaptive setting), and the performance of ResponseGraphUCB on 3 different problem domains. Clarity: The paper is generally well written and relatively easy to follow. The problem is well motivated and most key ideas are sufficiently well explained in the main text, with the bulk of proofs being deferred to the appendix. I do have a few minor criticisms. (1) Figure 6.2 should be presented better. Currently, there are 16 plots per graph and it is extremely difficult to make sense of each plot, not to mention confidence bounds. Since not all of these 16 plots are considered at any time, consider reporting only results which yield interesting insights in the main paper, while leaving the rest of the plots to the appendix. (2) Section 5 may be too heavy in content for readers unfamiliar with CSSPs, Markov chain theory, or “standard control problems”. In my opinion, it may be more effective to provide a high level idea of the technique used and leave precise mathematical details to the appendix. Quality: Experiments are conducted on reasonable domains and are fairly extensive. I did not verify all the derivations in the appendix, but the ideas presented in section 4 appear sound. Significance: This paper tackles the problem of estimating payoff values (and ranking weights) from empirical simulations and provides sample complexity bounds. This is a good theoretical contribution to an area which is rapidly gaining popularity, and should be of interest to researchers in many communities. Other comments: (3) Is Theorem 4.1 agnostic to the chosen sampling scheme? Empirically, how do the different sampling strategies compare with each other? (4) With regard to Figure 6.3, the authors mention that errors are positively correlated with error tolerances. This appears to be the case for the poker meta-game, and delta has virtually no visible effect on the soccer meta-game. (5) Again with regard to Figure 6.3: In both meta-games, payoff errors decreased gracefully with increasing samples, which is expected. However, ranking errors were significantly more variable in poker as compared to soccer. Could the authors provide some insight as to why this was so? (6) For Bernoulli games, it was stated that payoffs were constrained such that they would not be too close to 0.5. Is this related to theorems 3.1, 3.2, and 4.2? If this is so, would it not be a good opportunity to empirically evaluate the tightness of bounds in 4.2? ============================= Post-rebuttal. The authors have addressed my concerns satisfactorily. I maintain that the paper is a good submission given some minor edits and the addition of some of the responses in the authors' rebuttal.

Reviewer 2

Originality: - The paper is a novel combination of two known techniques - AlphaRank and UCB. It also casually adds in approaches from the PageRank literature, albeit they are not the central contributions. - The most important parts are formulating the usage of UCB in computing the accurate rankings, running rigorous experiments across a variety of different schemes, and giving algorithmic bounds for the approach. It does each of these well, with the supplementary materials providing well-written proofs for each of the sections. Quality: - This is clearly a complete work with the claims presented well and sufficiently addressed. - My main question comes in this section and accompanies what I think is the only fault of the paper; It doesn't address when or if UCB is actually the best approach to this problem. Yes, it's a good idea to use it because the formulation as a bandit problem makes a lot of sense. But is it actually the ideal in all circumstances? And when does that break down? For example, I could see collaborative filtering approaches working well in some settings, especially ones that are able to actively address which strategy pairing to query next. In what scenarios will the UCB approach be better than that one? Clarity: - The paper is well written, albeit the supplementary material is absolutely necessary to make sense of it properly as key parts are relegated to that area. - I think the paper as written is mostly reproducible. I didn't understand how the reduction to CSSP works, likely because I am not familiar with that literature, and appreciated the further explanation in the appendix. Significance: - These results are important. They build upon a recent paper that was also important as it reinvigorated a field and pushed it into the open. This paper then takes that one and shores up problems it had wrt noise in the evaluations. - I could see practitioners not using this as it adds a lot of complexity to alpha-rank, however it is definitely a line of work that will be explored further and this was/is one of the avenues of exploration.

Reviewer 3

This paper studies the practical algorithm that computes alpha-ranking for a pool of agents in MARL. The setting is broad, covering multiplayer general-sum game, while Elo-score only intends for two-player and cannot deal with intransitive agents. The proposed algorithm is based on graph bandit, which is useful when it is intractable to construct a full payoff matrix based on empirical game outcomes. The proposed algorithm would lead to an efficient way to leverage the game outcomes and assign the computation budget for what games should be carried out. Finally, it is beyond a pure evaluation method, in the sense that any MARL based on player-pool-sampling can potentially benefit from the proposed algorithm.