__ Summary and Contributions__: This paper combines A* sampling and direct optimization to construct novel policy gradient method named direct policy gradient (DirPG). The authors apply direct optimization to optimize policy to discrete actions and use A* sampling to find near-optimal action w.r.t. direct objective of RL problem. To mapping direct optimization to RL problem, the authors introduce two trajectories $a_opt$ and $a_dir$, and define direct policy gradient using these trajectories.
The authors provide detailed ablation empirical studies about their own work. However, to solve combinatorial bandit problems, there are other algorithms such as CUCB (Chen, Wei, Yajun Wang, and Yang Yuan. "Combinatorial multi-armed bandit: General framework and applications." ICML. 2013.). Also, many RL algorithms are applied to grid environments with theoretical guarantees of performance. Moreover, there are many MCTS-based algorithms to solve large-scale black-box planning. I think more intuitive and empirical comparison should be provided.

__ Strengths__: This paper combines A* sampling and direct optimization and it seems a way to good performance.

__ Weaknesses__: However, empirical comparison is not enough to show the advantage of the proposed algorithm. Moreover, it seems hard to apply to large-scale environments. First, if it is only applicable to tabular cases, theoretical derivation should be added such as PAC-guarantee or regret analysis. In contrast, if it is applicable to large-scale black-box planning, it should be compared with MCTS-based algorithms because MCTS is the most popular approach to solve large-scale black-box planning problems.

__ Correctness__: Claims are correct and sound. Also, empirical methodology also seems correct.

__ Clarity__: This paper is well written

__ Relation to Prior Work__: The authors compare their work with simple baselines. However, there is no theoretical or empirical comparison with similar works such as MCTS-based algorithms.

__ Reproducibility__: Yes

__ Additional Feedback__: I am confused whether this work is a black-box planning algorithm or general reinforcement learning algorithms with the initial state distribution. If it is a kind of black-box planning algorithm, then it should be compared with MCTS-based algorithms empirically.
=============================
Now I realize my misunderstandings and I change my score.

__ Summary and Contributions__: The paper proposes applying direct optimization (a method developed in other ML subfields) to offline policy gradient RL, resulting in a new method called DirPG. It shows how this method can be combined with domain knowledge in the form of a heuristic (an interesting connection to the field of heuristic search). The paper then presents a variation of A* search that is more efficient with environmental interactions for use in DirPG. Next, the paper shows experiments comparing the performance of the proposed algorithm with some existing baselines.

__ Strengths__: The paper connects disparate subfields of AI—heuristic search and reinforcement learning—using the idea of direct optimization. Theoretically the work seems well-grounded and potentially significant and relevant for the NeurIPS community; there has been significant recent interest in the combination of search and RL methods.

__ Weaknesses__: The empirical investigation seems flawed. Comparing DirPG to UCB, REINFORCE, and CEM doesn't seem fair; DirPG uses a simulator but the other methods don't. Clearly a method accessing a simulator should outperform ones that don't. I would prefer to see how DirPG compares to other methods that use simulators, including the ones mentioned in the related work (Vine TRPO?, MCTS variants?).
EDIT: After discussing with the other reviewers, it seems the version of REINFORCE being compared here is customized to use the simulator, which results in lower variance than a conventional REINFORCE algorithm would have. Therefore the comparison is not as unfair as I had thought, and I have increased my score for the paper. However, I feel this should be made more clear in the paper.

__ Correctness__: The theoretical claims and method seem correct to me; however, I am more familiar with RL than direct optimization so I could have missed something.
The empirical methodology seems flawed, as described in the Weaknesses section above.
EDIT: My previous statement no longer applies.

__ Clarity__: The paper is fairly well-written, but at times can be difficult to understand. For example, it was difficult to see how DirPG could outperform REINFORCE in the motivating example; after several re-readings I realized the heuristic information is being used to sample a better trajectory than REINFORCE's uniform random trajectory. It would be better to explicitly state this. It's important for the motivating example to be extremely clear and easy to understand; it gives the reader intuition for how the method works, what situations they would want to use it in, and problems with existing methods. I would recommend going over the motivating example section again and trying to make sure the section communicates each of those messages as clearly and simply as possible.

__ Relation to Prior Work__: The related work section gives a very short description of some previous contributions, but I would like to see it expanded.

__ Reproducibility__: Yes

__ Additional Feedback__: The motivating example could be explained more clearly. How exactly is the heuristic information incorporated into the search for a_dir?
If a simulator is available, one typically wouldn't use a model-free algorithm like REINFORCE. A major benefit of REINFORCE is that it can do a Monte Carlo rollout and have an estimate of the direction to improve the policy without needing a simulator or a model of the environment. Once a simulator is added, it changes the structure of the problem such that different solution methods become available (i.e., MCTS). So why is REINFORCE a good method to compare against? Wouldn't one expect any method using a simulator to out-perform one that doesn't? I'm actually surprised that REINFORCE performed as well as it did in the experiments. Same with CEM.
If a heuristic and simulator are available (both typically assumed to be unavailable in RL), why wouldn't one use heuristic search methods? RL and heuristic search are solving slightly different problems. RL generally avoids simulators because they are extremely difficult to create for real-world problems. If a simulator is available, why use RL?
Does the proposed method extend to function approximation? As described, DirPG seems to only apply to finite state, finite action MDPs. This restriction would rule out almost all interesting problems due to the size of their state/action spaces, which reduces the potential impact of the work.
What are the advantages of incorporating domain knowledge via a cost-to-go heuristic instead of existing methods like reward shaping, policy initialization/warm starting, exploration bonuses, etc.)? It seems very restrictive, but could also allow methods from the field of heuristic search to be applied, which is an interesting connection. Domain-specific knowledge is also often avoided in RL because it can bias the solutions found in a negative way (for example, AlphaGo Zero removing human domain knowledge led to improved performance over the original AlphaGo). However, using concepts from heuristic search like admissibility seems like a promising way to ensure the domain-specific knowledge doesn't harm the final policy. It would be good to mention this strength in the paper.
It would be much better to show confidence intervals in the experiment plots. Non-overlapping confidence intervals communicate statistically significant differences in performance to the reader.
I didn't see any mention of existing methods that require an oracle/generative model of the environment that gives sample next states given a state and action. This seems closely related to the idea of a simulator, and it might be worth including an overview of this area in the related work.
EDIT: Many of these concerns have been addressed in the author response and discussions with other reviewers. I have increased my score to reflect this.

__ Summary and Contributions__: Edit: I've read the rebuttal and my confusions are clarified. I hope these changes can be made in final version, and I maintain my score.
----------------------------
This paper proposes using direct optimization method to compute gradient under policy gradient setting and studies it from both theoretical and empirical aspects. The proposed DirPG algorithms can take advantage of heuristic search informations and potentially have a more promising performance under a sparse rewarding set up.

__ Strengths__: The paper is well-organized: the connections between sections are clear and easy to follow. It has both necessary theoretical justifications and empirical evaluations.

__ Weaknesses__: The empirically results are not very strong because the examples are more similar to toy examples than real-world tasks. However I think it's acceptable because the main contribution of this work is introducing direct optimization under RL setting and it does have some merits that PG algorithms don't have such as using the information from heuristic search.
Besides, I have some questions as follows:
1. Does DirPG also require the state to be finite? If the state is continuous, the pdf of the trajectory can be larger than 1, can it still be modeled by Gumbel(0)? If DirPG can only be applied to finite state cases, I think it will be more clear to point it out.
2. How will the perfectness of heuristic search affect the algorithm? For example, one crude upper bound U(R) would be (T-t) * max(reward), how much improvement can it bring compared with a perfect heuristic? In other words, will it be challenging to obtain the information that DirPG requires?
Minor points:
1. I didn’t get the connection between the gradient formula in l.66 and eq(5), in eq(5) the quantity inside the integration becomes reward, could you elaborate more on this?
2. I didn’t fully understand the context of the motivation example; if we have a perfect heuristic and know which trajectory has reward m, why do we need to learn a policy?
3. On eq(14), should the partial derivative w.r.t theta instead of theta_i?
4. Should eq(15) have a partial derivative w.r.t theta on LHS?
Besides, I think it will be better to put Proposition 1 in the main text since it's an important theoretical property.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The submission describes an approach to compute policy gradients in problems with discrete action spaces by solving stochastically perturbed planning problems. The resulting expression for the policy gradient is the expected value of a difference of score functions, where one score (policy log-likelihood gradient) function is evaluated at the perturbed solution, and the other is evaluated at the unperturbed solution.
The optimization problem defining the points at which to evaluate the score functions is equal to the log action sequence likelihood, plus a sample from a Gumbel random variable indexed by the action sequence, plus an optional perturbation equal to the reward function evaluated at the action sequence. Since solving this optimization with zero perturbation yields a sample from the policy, the method has the intuitive interpretation of increasing the likelihood of actions that result in locally greater reward, and lowering the likelihood of actions sampled from the current model.
A complexity arises in that solving the perturbed problem is hard, because each of the combinatorially-many action sequences is associated with a different draw of a Gumbel random variable. To get around this, the “top down sampling” method is applied, which allows the search to proceed in a hierarchical fashion. However, this still does not quite solve the problem of efficiently searching with the reward perturbation term, so the heuristic of sorting the search queue using a heuristic (as in A*) is applied.

__ Strengths__: 1— the method is novel and of interest to the NeurIPS community
The method can be thought of as a convergence of interesting ideas that have recently received significant attention in the ML community. First, the popularity of GANs and VAEs have sparked interest in the possibility of applying the reparameterization trick to discrete, combinatorial problems as well. That has led to increased interest in applications of the Gumbel-max sampling trick for obtaining stochastic gradients with hopefully less variance than REINFORCE. Recent work [20] showed how one can directly take the gradient through the argmax required for the Gumbel-max trick. Finally, this submission extends that basic logic to RL.
So, I see this as a natural extension of recent work that will likely be of significant interest to the NeurIPS community. The extension to RL is nontrivial, as it requires some thought about how to solve the Gumbel- and reward-augmented planning problem in the RL context.
2— the method seems intuitive and relatively straightforward to try
I appreciate that the basic method seems relatively easy to describe, in addition to being relatively straightforward to apply, although I’m a bit unclear as to how easy it is to implement the top-down sampling method.
3— the formulation leads to simple proofs of interesting results
In a similar vein, the formulation is very clean and leads to simple proofs and potentially interesting insights. For instance, the interpretation of the non-perturbed score serving as a baseline for variance reduction is interesting. The main expression for the direct policy gradient is also surprisingly simple to derive. The result about epsilon allowing one to control risk sensitivity is also insightful.
The simple form of the direct policy gradient is also interesting in itself and leads to interesting follow-on questions. e.g., the form is very similar to the gradient of a Gibbs distribution, and seems somewhat reminiscent of contrastive divergence, where one takes the “negative” sample to be a “slightly optimized” version of the “positive” sample.
4— clear example showing when this might be better than REINFORCE
The example provided demonstrating when the method might perform much better than REINFORCE was illuminating and much appreciated. Although the reparameterization trick has proliferated in recent times, it is generally unclear when it performs better than REINFORCE, so this simple result was very helpful.

__ Weaknesses__: 1— method requires a simulator, or “common random numbers”
Unfortunately, the method is apparently not applicable to settings where the environment is the real world, because of the need to explore actions repeatedly with a fixed random seed. Access to a simulator is not a totally unreasonable requirement, but this assumption in general seems to be a bit understated in the presentation.
2— how does this compare to simple finite differences?
The nature of the direct policy gradient is suggestive, in that it appears similar to a simple finite difference method, which raises the question—how does this compare to simply finite-differencing the reward? This is probably my biggest doubt about the work. Finite differencing doesn’t work well when the environment is the real world, because the variance of estimating a difference is generally high in that case. However, this method can’t handle that case either. So, I am very curious as to how this method would perform empirically compared to finite differences, and I’d be interested to hear any insights about why one would prefer the proposed method to finite differences.
The assumption of “common random numbers” is subtle, but probably stronger than it appears due to its variance-reduction powers. The “vine” variant of TRPO is cited as an example of another method using common random numbers, but this assumption isn’t discussed much besides that. PEGASUS (Ng and Russel, 2000) is another relevant reference for the use of common random numbers in RL.
3— experiments only test toy problems
The experiments are a bit of a disappointment because they test only very simple problems. I would be very interested in seeing how this method performs on more complex problems—such as video games. If there are limitations of the method that would make it difficult to adapt to such problems, then that should be discussed.
The performance relative to REINFORCE was also a bit disappointing. Besides those flaws, the experiments were reasonable. I found the DeepSea results particularly interesting.

__ Correctness__: I detected no problems with any of the technical claims or results.

__ Clarity__: The submission is well-written overall. It is well-organized and generally does a good job of explaining the main concepts without getting dragged into technical tangents. The notation is fairly clear as well.
One area for improvement is section 5, which seems like it could use a bit more graphical intuition. The meaning of figure 1 is a bit opaque at first—it’s not clear that nodes represent regions, for example. It’s also a bit confusing that this section mainly ignores the reward-perturbation until the end—without that, it seems like this section is just discussing a very complicated way to sample from the policy.
I guess it is actually necessary to sample the policy in this way to ensure that it’s sharing common random numbers with the perturbed problem, but this isn’t very clear. This again recalls my earlier point that the assumption of common random numbers is understated throughout the paper.

__ Relation to Prior Work__: The discussion of prior work was reasonable, with perhaps the already-noted exception of the relationship of this method to finite-difference methods.

__ Reproducibility__: Yes

__ Additional Feedback__: