|
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors present pruning algorithms to improve the speed of CFR and CFR+ and show empirically the performance in the domain of Leduc Hold'em.
While the paper is well written and the algorithm presented by the authors seems to achieve good results, I am unsure whether their claims about the drawback of partial pruning [Lanctot et al. 2009] are correct or not. Because this part is associated with the evaluation of originality and significance of the paper, clarifying this point is crucial to verify whether the right baseline of CFR/CFR+ has been used for performance evaluation or not.
In [Lanctot et al. 2009], they describe, "With vanilla CFR we used to an implementational trick called pruning to dramatically reduce the work done per iteration. When updating one player's regrets, if the other player has no probability of reaching the current history, the entire subtree at that history can be pruned for the current iteration, with no effect on the resulting computation." (see page 7).
In contrast, the authors claim that partial pruning only a branch to be avoided if both players' action sequences leading to that point have zero probability (for example, page 1).
I do not understand whether these descriptions indicate the same thing or not. The description in [Lanctot et al. 2009] sounds to me that partial pruning is performed if one player's action leads to a probability of zero.
Of course, I might miss important points. Therefore, to assess their contribution accurately, I would like to wait for authors' responses regarding this matter in the rebuttal phase.
Other comments
Line 104 in page 2. Full stop is missing in the sentence starting with "If a series of strategies are played over ..."
Lines 130-137 in page 3. \pi_i^\delta(h,z) and \pi_i^\delta(h \dot a, z) should be correct in equations (2) and (3), respectively.
Lines 146-147 in page 3. \sum_{t \in T} (u_i(....) - u_i(...)) should be correct in equation (6) (that is, we need parentheses there).
Q2: Please summarize your review in 1-2 sentences
Although the paper has important contributions, I have a difficulty in assessing whether their claims about the drawback of partial pruning [Lanctot et al. 2009] are correct or not.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes new pruning strategies for Counterfactual Regret Minimization (CFR). CFR is an iterative algorithm for solving large two-player zero-sum games. The proposed pruning technique avoids traversing branches of the game tree if at least one player has zero probability of reaching those paths. This can dramatically increase the convergence speed of CFR.
The paper is mostly well-written, although it is hard to follow some parts. Overall, this is a nice contribution and the results are encouraging.
Q2: Please summarize your review in 1-2 sentences
The paper is mostly well-written, although it is hard to follow some parts. Overall, this is a nice contribution and the results are encouraging.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper describes a new way to prune parts of the game trees in counterfactual regret minimization. This is achieved by properties of regret-matching and by mixing best responses with CFR inspired by CFR-BR. The end result is by doing some extra book-keeping, much of the CFR tree pass can be omitted and then retroactively adapted at a later point when relevant.
Regret-based pruning saves significant computation in terms of nodes touched per iteration and the savings seems to grow as a function of the game. The pruning technique is tied to CFR, but the authors also show improvements in CFR+.
I have some questions to help clarify some points:
Best response strategies normally prescribe an action (or distribution) for each one of a player's information sets, so the partial best responses in Section 3 are a not clear.
Q1. Are these best responses to a well-defined subgame of Player 2? If so, how is the subgame defined?
Q2. Assuming "no" to Q1, then the best responses under the zero-reach information sets are in fact just subsets of a full best response strategy. Can you explain why it's possible to directly re-use the CFR-BR theory in this case, where the strategy used from T_0 to T_0 + m is a combined (partial CFR strategy) and (partial best response)? This seems to assume that the games are decomposable.
Q3. Can you give some kind of indication as to the raw computation time trade-off? How much overhead does the best response computation take? In the graphs, does "nodes touched" also somehow include the extra time taken for the best response calculations?
---- Added after the review + discussion:
The most efficient implementation of partial pruning for CFR can be found in code pointed to by "Limit Texas Hold'em is Solved" paper, reference [1]. Using this implementation, each iteration updates regrets for player i and updates the average strategy for player 3-i, alternating the players. This allows partial pruning when just one of the reaches (pi_{-i}) = 0. If accepted for publication, the authors should rephrase or soften their claim in the intro that CFR requires both reaches to 0 to apply partial pruning because, as admitted by the authors in their response, the validity of this claim is implementation-dependent. Specifically, the claim is not true for this most efficient implementation of Vanilla CFR, and thus this claim as written could mislead readers.
Also, while it is not entirely clear, the benefit of RBP could be significantly affected if the authors had used this most efficient version of Vanilla CFR to compare to. Any future works/publication should either use this most efficient implementation as a baseline for comparison, or compare against both implementations, since the motivational arguments used in the intro do not hold for this implementation.
Q2: Please summarize your review in 1-2 sentences
The paper introduces a new pruning optimization, tied to CFR but also applicable in CFR+. The results show significant gains that seem to get larger as the game grows, but some points need to be clarified in the response.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: this paper proposes an evolution of the Counterfactual Regret Minimization algorithm by introducing a better version of pruning.
Quality/Clarity : the contribution of the paper is clearly identified and presented.
Originality: the solution proposed in this paper is new.
Significance: the paper clearly proposes an interesting evolution to a classic problem to the field.
Additional comment: the present reviewer did not go through the proofs.
Q2: Please summarize your review in 1-2 sentences
A strong paper related to the design of efficient algorithm for playing zero-sum imperfect-information games.
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper introduces a temporary pruning criterion, called regret based pruning (RBP), in context of counterfactual regret minimization (CFR) in extensive form game trees. The idea is to avoid expensive CFR computation (and replace by cheaper one-time best response computation) in certain paths in the tree if some player's contribution to the likelihood of that path is 0. This reduces the best responder's regret, without raising the other's regret. The authors also specify the criterion which stipulates how long this temporary pruning is safe, i.e., preserves CFR's convergence guarantee. Experiments in two versions of Leduc Hold'em shows order of magnitude speedup for RBP, and bigger impact in the larger version.
The paper employs an interesting combination of ideas to enhance speed, and the work is technically sound. The evaluation of savings is a combination of where the reachability criterion is met, the value of m, and the best response calculation. So it might be useful to study each component in isolation (if that is even possible) to see the contributions of the individual components. I was wondering whether perhaps the reachability criterion was being met very often, with lesser contributions from the other two factors? Anyway, the overall saving is impressive, which makes the work significant to me.
Q2: Please summarize your review in 1-2 sentences
Interesting ideas based on astute observations about the behavior of CFR, significant result, strong paper.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Reviewer 1: Thank you for your comments. There are a
number of implementations of vanilla CFR, and the discrepancy between the
description of partial pruning in [Lanctot et al. 2009] and the
description of partial pruning (i.e., the prior benchmark algorithm) in
our paper is due to a difference in the implementation used, though the
end result is the same. In other words, the way we described partial
pruning, and then our improvement over it, is correct in the paper as is.
In the next two paragraphs we explain this point in more detail.
(We also verified by email with Marc Lanctot that our understanding of his
implementation, described below, is correct.)
In [Lanctot et al.
2009], the authors implement vanilla CFR in a way where the game tree is
traversed (and regret is updated) separately for each player on every
iteration. That results in twice as many nodes being touched (because the
game tree is traversed twice each iteration) but half as much work is done
at each node. That implementation is easier to code but does not
significantly differ in performance compared to the partial pruning
benchmark implementation used in our paper.
In the [Lanctot et al.
2009] implementation, when the opponent reaches a history with zero
probability, regret for the player will not be updated for any
descendants. Thus, the rest of the branch can be pruned *only for that
player's traversal*. When the same history is reached during the other
player's traversal (in the same iteration), the branch is not pruned.
Thus, for a branch to be pruned for the entirety of an iteration, partial
pruning does indeed require both players to reach the history with zero
probability. However, if RBP is used, then the branch could be pruned for
both players' traversals if just one player has negative regret leading to
it. Thus, RBP would still lead to the substantial improvement over partial
pruning in their implementation.
You are right about the errors on
line 104 and 146-147. Thank you! We will fix these in the camera-ready
version. Regarding the equations on lines 130-137, the definitions given
in our paper are correct. The definitions are derived from (5) and (6) in
[Zinkevich et al. NIPS-2007] and are equivalent (though not identical) to
the definition for counterfactual value in [Burch et al.
AAAI-2014].
Reviewer 2: Thank you for your comments. In regards
to your Q1 and Q2, we do not define a particular subgame, and the
strategies played in the zero-reach information sets are indeed just part
of a best response strategy. However, our goal is simply to play a
strategy that results in zero regret for any information set (and its
descendants) that is reached by the player with zero probability. Playing
a best response strategy in those information sets and their descendants
ensures they have zero regret (regardless of what is played in other
information sets belonging to the player), and does not affect the bound
on regret for their ancestors or for the opponent.
In regards to
Q3, calculating a best response is slightly cheaper than an iteration of
CFR. The process explores the entire game tree. When a terminal node is
reached, its value is passed up. When going back up the game tree, one
sets the strategy to the action which returned the highest expected value,
and passes this value up. (There are also domain-specific implementations
for faster best response calculations --- e.g., [Johanson et al.
IJCAI-2011] --- that we do not use.) In the graph, we do include the cost
of the best response in the "nodes touched" total. There are fixed
per-iteration costs that are not included in the total (e.g., dealing
cards). These fixed costs are substantial in a tiny game like Leduc.
However, they are negligible in large games where an iteration can take
minutes or hours. There is also some small overhead for implementing RBP
(specifically, checking if the pruning condition has been met, or if
pruning must cease). However, this need not be done every
iteration.
Reviewer 3: Thank you for your comments. It would be
easy to add a breakdown of how many nodes are being skipped and for how
long, as well as the performance of the best response in isolation. We
will include this in an appendix in the camera-ready
version.
Reviewer 4: Thank you for your comments. The main
performance metric for the experiments is how close the algorithm's output
strategy is to a Nash equilibrium, compared to "nodes touched" which is an
implementation-independent proxy for time. This is the typical measurement
used to evaluate algorithms in this area (see, for example, [Kroer &
Sandholm EC-15], [Lanctot et al. NIPS-09]).
Reviewer 6: Thank
you for your comments. In the camera-ready, we can clarify some points to
make the material easier to follow for those outside the field. To that
end, it would be even more helpful if you would point out some points
where you found the paper hard to follow.
Reviewer 7: Thank you for
your comments! |
|