Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Contributions: The paper uses a combination of GNN and RL techniques to learn to explore graph-structured environments. The method is used in a synthetic maze exploration experiment, and for program testing to improve test coverage. Key contributions are: Framing of the graph exploration problem using a model that uses (1) a graph memory representing the visited state, (2) incorporating the exploration history in the prediction of the next action, and (3) different weights per time step. Framing program testing (including app testing) as a graph-exploration problem. Ablation studies exploring the overall importance of weight sharing (per RL time step), graph structure, exploration history, and different methods of encoding the program in the program testing. Comments: Originality: There are prior works where GNN + RL are combined that has a similar flavor (e.g. where the states are represented as a GNN, or where actions are represented by nodes in a GNN). This work combines those ideas with the goal of learning to explore. In particular, I don’t recall seeing the idea of using an evolving graph memory. https://openreview.net/pdf?id=HJgd1nAqFX https://arxiv.org/pdf/1906.05370.pdf https://graphreason.github.io/papers/7.pdf Clarity: The paper is well-written and easy to follow. I’m curious about: - In Table 2 and Figure 2, where the “randomness” in the “random next-step selection” in RandDFS comes from. In particular, how much worse is RandDFS compared to actual DFS? - In Figure 2, it would be useful & interesting to see the actual trajectory taken by the various agents, or at least the initial position of the agents. - In Figure 4, I’m not sure what “# inputs generated” means. (At first I thought that means T = 5 in these experiments, but that doesn’t really make sense?) - Also in figure 4, what does the y-axis label “joint coverage” mean? Quality: The submission appears technically sound, with experiments that are relatively different from each other. The maze exploration task with the ablation study shows the performance gain in using full graph structure and using the overall history. Reporting the standard deviation over the 100 held-out mazes would be helpful in analyzing the proposed approach. That using the overall history performs better is a bit unintuitive to me. The program testing tasks also have ablation studies that show how the encoding of the program being tested affect coverage. However, it is unclear whether using the overall history would be beneficial at all. The author also does not compare against any fuzzing or neural fuzzing approaches. The app-testing task shares similarities with the maze task, but in a more realistic scenario. It would be helpful to provide an intuition about what a real app traversal graph looks like (not the generated ER graph) for comparison against the maze task.
I appreciate the contribution of introducing the new problem framework of graph exploration. The proposed GNN-based RL algorithm is also well designed and reasonable. My major concerns are as follows. 1) It is unclear why the proposed framework is better for specific tasks and whether this graph exploration framework, in general, is expressive enough to adapt to task-specific characteristics. Initially, when I’ve seen the title with the term “transferable”, I’ve expected that the graph exploration is transferable between different `tasks’. However, the authors use the term “transferable” in the sense that their trained agent works for `unseen’ graphs in the same task. In this sense, I think many task-specific methods in the literature would also be transferable. If so, I am not sure what is the advantage of the proposed graph exploration network than a task-specific method. E.g., does the other maze exploration or fuzzing methods generalize to unseen data worse than the proposed one? I think this needs to be clearly justified. 2) The baselines are too simple and no comparison with any state-of-the-art method for each task is provided. For maze exploration, there are many recent methods, (e.g., Mirowski et al., Learning to navigate in complex environments, ICLR 2017), but this paper does not compare with them. For fuzzing, the paper only compares with random strategy and an expert-designed heuristics even though there are many methods for fuzzing (e.g, Godefroid et al, Learn&Fuzz: Machine Learning for Input Fuzzing, AES 2017, and Liu et al, DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing, AAAI 2019). 3) Minor comments - For equation (4), if you use the padding of one extra bit c_t(v), it would be more helpful to add this term explicitly. - How do you initialize node embedding for program testing and maze exploration? Is it done in the same way for each dataset? - In maze exploration, it would be more informative to show the detailed trajectory of each agent so that we understand how the network explores the maze.
The idea of this article is new and innovative. It combines machine learning with the exploration of spatial data, which has a lot of application scenarios. The author also gives a specific application scenario, software testing of domain-specific programs and mobile applications.
Although the problem of exploration in a graph-structured space seems to be a pretty natural extension of the exploration problem initially proposed in other contexts, e.g., game playing environment, their proposed problem setup is not well-studied yet, thus it is a meaningful topic. However, the novelty of their approach is very minor. The ideas of RL + GNN and RL for exploration arenot new, as stated in their related work. The idea of maintaining a graph-structured memory is not new; e.g.,  maintains the evolution of the graph structure over time. In particular, RL + GNN + exploration is not completely novel as well;  studies the problem of goal-oriented web navigation, using RL + GNN. Although their formulation is different from the exploration task, they are actually solving a harder problem where the reward could be very sparse. Regarding the experiments, although the latter two tasks, i.e., program testing and app testing, are real-world tasks, the datasets are still more synthetic than the benchmarks used in previous work on related applications. For example, FlashFill and Karel have simplified grammars that constrain the search space for fuzzing. Why not evaluate on standard fuzzing benchmarks as in , and compare with existing neural, RL, and classific fuzzing approaches (you can find comprehensive baselines in )? If the authors can demonstrate superior performance on these more realistic benchmarks, the results could be much more convincing.  Daniel D. Johnson, Learning Graphical State Transitions, ICLR 2017.  Jia et al., DOM-Q-NET: Grounded RL on Structured Language, ICLR 2019.  She et al., NEUZZ: Efficient Fuzzing with Neural Program Smoothing, IEEE S&P 2019.