Review for NeurIPS paper: Semantic Visual Navigation by Watching YouTube Videos

NeurIPS 2020

Semantic Visual Navigation by Watching YouTube Videos

Meta Review

This paper proposes to leverage (mostly real-estate) unlabelled YouTube videos of egocentric navigation in indoor environments, to train the Q value function network for the high-level part of a hierarchical RL policy for goal-driven indoor robot navigation. The lower-level part relies on depth-based obstacle avoidance and planning in 2D maps. The method works in an unsupervised way by relying on two ways of augmenting the egocentric navigation video dataset: 1) extract action labels from motion classifiers and 2) extract semantic goal labels from object detection. It uses these two to 3) build experience replay tuples of (previous image, action, next image, goal) and then train the goal-conditional value function using Q-Learning. The high-level policy predicts Q values for navigating a topological graph. The paper builds upon [39] "Learning navigation subroutines by watching videos" which introduces the first idea (extract action labels from navigation videos) as well as a simpler version of the third idea (collect tuples of previous image, action, next image, in the context of Q-Learning with rewards from intrinsic motivation). An independently published paper, [1.r] "DIViS: Domain Invariant Visual Servoing for Collision-Free Goal Reaching", introduces the second idea (images or image categories, as provided by the same MS-COCO object detector, for goal-driven navigation) in a context similar to [39], to train low-level affordance- and obstacle detection-based policy. The paper could be viewed like a combination of [39], [1.r], and of topological graph navigation papers such as [49] "Semi-parametric topological memory for navigation". Paper [6] "Combining optimal control and learning for visual navigation in novel environments" had a different take for object goal-driven hierarchical navigation by using object detectors for waypoints. The paper is extensively evaluated with ablation experiments, beats behavioural cloning and a few heuristic baselines and demonstrates data efficient learning and performance on the Gibson dataset. The algorithm is well documented in the appendix and code is provided. Reviewers R1, R2 and R4 all gave scores of 7 (and judged the research easily reproducible) and during the internal discussion, they all believed the paper should be accepted. Negative points raised by these 3 reviewers were the use of a depth- and occupancy-map based heuristic as low-level policy, the lack of strong RL baselines (addressed in the rebuttal), some remaining questions on data analysis and ablation studies (also addressed in the rebuttal), and the fact the the method was only marginally better than the non-RL but object-detection based Detection Seeker baseline (more on that later). Reviewer R3 strongly disagreed and gave -- then maintained, during discussion -- a score of 2. Some points of criticism seem less valid: * difficulty in reproducing the work (I disagree, as code and an extensive appendix are provided) * lack of applicability in the real world (even though the method is evaluated on a standard indoor navigation environment) * and the fact that some amount of training on Gibson is still necessary (even though the authors claimed that their method achieves significant data efficiency) The following two points of criticism are to my mind valid: * lack of novelty w.r.t. existing work. The method seems like a combination of existing work and ideas (experience replay from videos, object detection for semantic goals, topological graph navigation). At the same time, one could argue that most research is done this way and novelty is difficult to assess and quantify. * strong performance of a heuristic baseline: Detection Seeker, that does not use the learned Q value function, but still uses object detection and spatial consistency. That point is partially addressed in the rebuttal, but I am worried that the problem is actually much larger and is due to the small scale of the environment itself (apartments), and to the way the 360-degree observations are acquired. To reuse the analogy at the beginning of the paper, one often does not need to get up from their table at a restaurant to have an idea of where the toilets are: simply looking around may be enough. I therefore disagree with R3's score of 2 (strong reject), which is not appropriate for that level of work and analysis, and am counting that as score of 5 instead (good work, extensive analysis, interesting combination of ideas though not as novel as claimed, problems with the environment and with how observations are constructed). Ultimately, I would recommend acceptance for this paper as poster and consider it borderline (it ranks 6/16 in my stack of papers). Additional comment: in the broader impact statement, the authors should acknowledge the nature of the dataset used to train the value function: even though it is publicly available, there are privacy issues, consent issues (as these are real-estate videos, and this particular use case was not envisioned by people who uploaded them) and biases (comparable to the biases in Gibson).