Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
## Summary The paper extends the NPI formalism studied by Reed & de Freitas and later by Cai et al., to be learnable by an AlphaZero-like reinforcement learning without the need for strong supervision with execution traces. It instead learns the hierarchy of program subroutines in a curriculum fashion, adding a pre- and post-condition to each subroutine and extending the MCTS setup of AlphaZero to handle recursive subroutine calls. The paper demonstrates that the resulting formulation learns the programs in both Sorting and TowersOfHanoi domains more effectively than prior work. ## Review The main contribution of the paper is generalization of the NPI setting to eliminate the need for strong supervision. This required innovation in both components of the final solution: MCTS and NPI. The experimental results are commendable, clearly showing much stronger generalization of the learned NPI models even without as strong of supervision as the baselines required. However, this is to some extent a false equivalence. The requirement for strong supervision has been replaced with the requirement for curriculum learning with a clearly marked hierarchy of subroutines including pre- and post-conditions. I was surprised by the remark in §3.2 that prior approaches also relied on curriculum learning. While Cai et al. trained their NPI system on much smaller instances than at test time, it was for generalization evaluation rather than curriculum learning. The instances weren't ordered and there were no hyperparameters accounting for the agent's progress (like Δ_curr in this work). So at a high level, strong supervision of one kind was traded for strong supervision of another. In addition to the main results, I found Table 2 a super interesting experiment. I would cast its results in a different light than the authors. It seems to show that even without curriculum learning and a hierarchy of subroutines, training weakly supervised NPI models with reinforcement learning using AlphaZero provides an improvement (even if the setting remains impractically hard anyway). This is an important result in and of itself, and I would devote some space investigating it in the final version of the paper. For instance, how do the results change over the course of training? What happens if we a little bit of supervision (e.g. a single level of recursive subroutines)? ## Minor remarks * Would be great to include baseline results of Cai et al. and Xiao et al. into Tables 1-2 to appreciate the numbers in context. * The (shortened version of) Algorithms 1-2 with some commentary might be much more helpful for explaining AlphaNPI than the current lines 120-150. ------------------------------------------------------------ ## Post-response update I'd like to thank the authors for their answers and clarifications. My score remains a positive "7": I believe this work is already a good contribution to NeurIPS. Its presentation clarity should be improved for camera-ready, and the authors have clearly outlined all the improvements they intend to integrate.
The authors apply an actor critic algorithm to the task of learning to solve algorithmic/program induction tasks. The model uses an LSTM and embeddings of program states and actions, with a shared policy and value network. During rollouts, MCTS is used as an improvement operator. Following prior work on curriculum learning, the authors check the validation error periodically and adjust using this the task regiment, increasing in difficulty over time. The authors validate their approach by training to learn how to sort, and solve a Tower of Hanoi problem, they demonstrate that using a recursive solution, it is possible to train a policy to solve either of these tasks without full trace supervision.
** update after author response ** The author response is lucid and to the point. My concerns are cleared. So I decide to raise the score to 8. This paper presents a method AlphaNPI to learn neural programs without groundtruth execution traces. Instead, post-condition tests are used to provide reward signals. AlphaNPI could try different atomic actions and novel subprograms, and reuse subprograms that leads to high rewards. It's verified to generalize well on bubblesort and tower of hanoi. This could be a very good work. However I feel a few points should be improved: 1. The idea of how a new subprogram is learned and added to M_prog is not clearly presented, which I think is the core contribution of this paper. 2. The experiments are a bit weak. Basically the authors compare the model with its own variants. In addition, as in typical RL papers, the authors should provide the training iterations and reward curves, so that readers will have an idea of its sample efficiency. 3. AlphaNPI uses post-conditions to make sure a learned subprogram brings direct benefits. This is interesting, but I'm also worried that this may be limiting. In more sophisticated problems, not every subprogram can be measured this way. Some subprograms could be utility functions and do not provide direct rewards. It's difficult to define meaningful post-conditions for such subprograms. I guess that's why the authors choose bubblesort and tower of hanois (because they can be decomposed into easier subtasks, and post-conditions can measure whether each subtask is accomplished successfully.) Though, this work could be an interesting initial attempt along this line. 4. How the tree policy vector pi_mcts encodes the tree? It's not explained.