NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4951
Title:Sampling Networks and Aggregate Simulation for Online POMDP Planning

Reviewer 1

Author feedback: I thank the authors for the feedback. The feedback was of high quality and satisfied my concerns. I suggest that, perhaps a compressed version, of "Explaining limitations of our work" from the author feedback, which I enjoyed reading, will be added to the final version of the paper. The paper "Sampling Networks and Aggregate Simulation for Online POMDP Planning" proposes a new solution to computing policies for large POMDP problems that is based on factorizing the belief distribution using a mean field approximation during planning and execution and extending aggregate simulation to POMDPs. In short, the proposed POMDP planner projects factorized beliefs forward in time forming at the same time a computational graph and then computes gradients backwards in time over the graph to improve the policy. Overall, the approach seems promising and outperforms state-of-the-art in large problems where the approximations do not seem to degrade performance. The approach relies on a factorized approximation which is not accurate contrary to particle based approximation which are accurate in the limit. Due to this the experiments should be extended with more classic POMDP problems, for example the RockSample problem where the factorization is suspected to not work well. This would give a more complete picture of the strengths and weaknesses of the approach. I would like to emphasize that even though the proposed approach may not work in all tasks it provides a novel approach to POMDP solving and could be beneficial in many practical problems. EXPERIMENTS: "Here we test all algorithms without domain knowledge." It is well known that MCTS based algorithms (such as POMCP) only work well with a suitable rollout heuristic (the heuristic can also be learned, see e.g. Alpha Go Zero). Also results with a heuristic (domain knowledge) even if with a simple one should be provided. It can be also considered "fair" to provide some domain knowledge to MCTS approaches in the form of a rollout heuristic since adding domain knowledge is straightforward compared to other approaches. Figures 3 and 4 should include confidence intervals. RELATED WORK: Using factored beliefs in POMDP planning is not new. For example, [Pajarinen et al., 2010] uses factored beliefs to solve large POMDP problems. These results also give hints on when a factored belief may be problematic. For example, in the classic RockSample problem, where movements are deterministic, a factored probability distribution does not seem to be a good approximation. Pajarinen, J., Peltonen, J., Hottinen, A., & Uusitalo, M. A. (2010). Efficient planning in large POMDPs through policy graph based factorized approximations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 1-16). Springer, Berlin, Heidelberg. LANGUAGE: "A MDP" -> "An MDP" "each of these is specific by a" -> "each of these is specified by a" "it is important to noice" -> "it is important to notice" "which rolls out out" -> "which rolls out" "shows the potential of sample sample sizes to yield good performance" ?? In Figure 1, there is the text "SOGBOA". Should this be "SOGBOFA"?

Reviewer 2

The paper extends the previous H. Cui et al. work [2,3,4] to POMDP problems. While the previous method factorizes the state space, the authors consider factoring over the representation of belief state. A symbolic computation graph of the value function is introduced and is used for optimizing the policy by gradient. The numerical experiments show better performance than the state-of-the-art baselines. Strengths are as written in 1 Contributions. Weaknesses: - There is no theoretical analysis. - The numerical evaluation might be limited since the tasks used in the experiments seems to be biased. While these tasks have a large number of states, they might have small impact on the technical assumptions, especially the factorization over the representation of belief state, compared to the classical POMDP tasks like the T-maze task [Bakker, 2002]. - Since the proposed method employs several heuristics, which are used in SOGBOFA [2], it is unclear which heuristics contributed significantly to performance in the numerical experiments. Minor issue: - Line 96, p(x=1) should be p(x=T), p(i=T), or something? Also, the notation of T will be missing. At first, I misunderstood it as the time step T. Bakker, B.: Reinforcement learning with long short-term memory. In: Advances in Neural Information Processing Systems, vol. 14. MIT Press (2002) === Update after the rebuttal === I would like to thank the authors for submitting a response to the reviews and addressing my concerns. The explanation of limitations of the proposed approach in the response is interesting and important. I would like to recommend to include the explanation in the main body of the final version.

Reviewer 3

This paper proposes a new approximate POMDP planning algorithm with a few tricks: approximating beliefs using product distributions, sampling observations, and aggregate simulation. The approach enables action selection using gradient-based optimization. The proposed algorithm performs competitively against POMCP and DESPOT on three large IPC problems. The ideas are interesting, and seems to work effectively on some large problems that current algorithms do not scale up. One main limitation is that the method has no guarantee of finding out the optimal policy even if the search time is unlimited though. While POMCP and DESPOT can find optimal policies given sufficient time and performs much better with some prior knowledge, the proposed approach sacrifices performance guanrantee, but seems to perform better without prior knowledge. Does the approximations used in the algorithm have a big impact on the best possible solution found by the algorithm? Comparing the algorithm with other POMDP planning algorithms on smaller datasets can be helpful. Minor comments - Eq. (2): b_{t}(s) should be b_{t}. - Line 88: why is a sequential plan p not a policy? - Line 96: not clear why p_{i} is p(x=1), and it is confusing to use p for different quantities (plan, probability). * Update after the rebuttal The authors provided some examples to illustrate the limitations of the proposed algorithm. They stated that "DESPOT only runs with POMDPX". This is not true. DESPOT can work with a model implemented in C++ as well. So their additional result need to be updated to take this into account as well.