Reviews: Speaker-Follower Models for Vision-and-Language Navigation

This paper presents a method to generate a route given its textual description. The idea is to use the standard sequence to sequence model to learn a mapping between descriptions and corresponding routes. To handle the limited data problem with description and routes, authors train another model for mapping routes to description; and the generate synthetic data from sampled routes. This synthesized data along with the actual data is used to train the model. While the paper addresses an interesting problem, the novelty of the method is limited. The paper uses standard sequence-to-sequence models with nothing specific to the domain or the task. Using the speaker model to generate synthetic data is straightforward. Despite the simplicity of the approach, I am actually impressed with results, especially by the ablation study. It is interesting to see the improvement by progressively adding new components. My other comments are towards writing and explanation. There is lot of redundancy in the paper and that space can be better used to explain implementation and experiments in detail. I believe that we are not "generating" routes, rather selecting them from a fixed dictionary. What does that dictionary made of? How is that dictionary able to generate routes for the environment that are unseen. If we are generating routes then it has to be made clear. Furthermore, the evaluation method is not clear. How do we know whether we are within a certain range of the goal. Does the data come with geolocation annotations? This has to be made clear. Overall the clarity of the paper can be improved. The paper lacks originality however the results are interesting. === I have read the authors' response and modified my reviews accordingly.

Reviewer 2

This paper introduces a new method for visual navigation based on language description. The work effectively exploits the usefulness of "pragmatic speaker" for the language-and-vision navigation problem, during training stage and test phase. The three contributions: 1) Speaker-Driven Data Augmentation, 2) Speaker-Driven Route Selection, and 3) Panoramic Action Space are clean, easy to understand, and effective. The paper is clearly written, and I really enjoyed reading it. a few minor comments: The value of some parameters are chosen without much discussion. For instance, the augmented data adds both useful information and noise into the system. A careful trade-off needs to be made. An evaluation on the performance as a function of the ratio of augmented data will be very interesting. In line 193, the "built-in mapping" needs further explanation and specification. In [a], the authors use visual context to rescore the candidates of referring expressions generated by beam search from speak's spoken language for the task of 'object referring in visual scenes with spoken language'. It is relevant to this work and should be discussed. [a] "Object Referring in Visual Scene with Spoken Language", Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool, WACV 2018.

Reviewer 3

This paper builds upon the indoor vision and language-grounded navigation task and sequence-to-sequence model described in (Anderson et al, 2017), by introducing three improvements: 1) An encoder-decoder-like architecture, dubbed “speaker-follower” model, that not only decodes natural language instructions into a sequence of navigation actions using seq2seq, but also decodes a sequence of navigation actions and of image features into a sequence of natural language instructions using a symmetric seq2seq. That speaker model can then be used for scoring candidate routes (i.e., candidate sequences of images and actions) w.r.t. the likelihood of the natural language instruction under the speaker model. This enables a form of planning for the seq2seq-based agent. 2) That same speaker architecture can be used to generate new natural language descriptions for existing or random-walk generated trajectories through the environment, and thus augment data for training the follower. 3) Instead of considering 60deg field-of-view oriented image crops of the panoramic images as observations for the agent, entailing repetitive rotations to align the agent with the target destination, the agent is presented directly with a 360deg view, and uses an attention mechanism to predict where to look at, then compute the probability of going there in the image. The image and motion are decomposed into 12 yaw and 3 pitch angles. The authors achieve state-of-the-art performance on the task and do a good ablation analysis of the impacts of their 3 improvements, although I would have liked to see navigation attention maps in the appendix as well. Improvement #3 (panoramic action space) is a significant simplification of the task; luckily the NE and SR metrics are unaffected. It is worth noting that the idea of using language prediction for navigation in a grounded 3D environment has already been introduced by Hermann et al, 2017: "Grounded language learning in a simulated 3D world”. Minor remark: The abstract is difficult to understand without reading the paper (both the “pragmatic reasoning” “panoramic action space” could be reformulated as model-based planning and visual attention-based navigation in panoramic images). In summary: Quality: overall well presented paper that contributes to the development of language-grounded visual navigation agents Clarity: good but reformulate the abstract Originality: rather incremental work, missing reference on language generation for language-grounded agent Significance: good, state-of-the-art results on the Matterport dataset Update: I have read the author's rebuttal and updated my score accordingly from 6 to 7.

Paper ID:	1682
Title:	Speaker-Follower Models for Vision-and-Language Navigation

Reviewer 1

Reviewer 2

Reviewer 3