This paper addresses the problem of vision-and-language navigation from raw visual input and language instructions in a photorealistic indoor environment (Room-to-Room) by iteratively building a high-level graph representation and then goal-driven planning using Graph Neural Networks. Instead of planning on the full graph, the model predicts actions over the fringe nodes of that graph (i.e., jumps through the graph using shortest path) and it also predicts and plans on a sparser proxy graph representation (these are novel ideas). It is trained using imitation learning. After discussion and authors' rebuttal, the reviewers' scores are (6, 7, 7, 6). While many of the reviewers' concerns are addressed, the main remaining concerns are a missing comparison to graph search methods (specifically: "Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation"), confusion about the word planning in an imitation learning setting, discussions about how loop closure is performed, acknowledging the competitive advantage of knowing which nodes of the graph are frontier nodes, and scarce information about how to reproduce the work. Based on these comments, I recommend acceptance as spotlight or poster, and expect the authors to hold on their promises of including algorithmic details in an expanded appendix, ablations, and also add a comparison with Tactical Rewind.