Reviews: Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller

This paper proposes a technique for a robot to learn to imitate a novel task from a single video demonstration. The model consists of two levels of hierarchy: intent modeling from a third-person video and a low-level controller. The paper is well-written and easy to follow. However, rather than restricting the discussion to specific techniques, it would have been interesting if the paper broadly discusses the concepts. For instance, when describing the method it is important to mention the problem/requirement and then provide the rationale for choosing a particular method. The main theme of the paper is decoupling the hierarchy and learning. In fact, the idea of running robots in a modular fashion is the defacto choice in almost all robotics tasks, though there is a negligible minority of roboticists attempting to learn robotic tasks in an end-to-end fashion. The low-level controllers in robotics are always task-agnostic. Therefore, I do not see any novelty in the concept of decoupling “what” and “how.” It is also worth noting that there is a rich literature on hierarchical reinforcement learning. What is the advantage of directly predicting the images over learning an abstraction of the intent and then trying to predict the position/configuration of the end-effector? Why do we want to explicitly represent the next state prediction as an image? Why cannot it be a latent feature vector, eschewing the requirement of a conventional GAN that is designed to produce photo-realistic images? Drawing similarities to how an intelligent animal would imitate, I do not think they explicitly hallucinate as images. This might open the door to other generative modeling techniques such as VAEs, those based on optimal transport, and normalizing flows. I would like to know the authors' thoughts on using DiscoGAN, I2I, MUNIT, and similar generators. Further, how can we capture the uncertainty in inferring the intent? It is not quite true that the robot learns novel tasks from a single video demonstration because the low-level controller, as well as the goal generator, have been pre-trained on various tasks. Can the low-level controller perform tasks that it has not been trained on? With reference to section 5.3, can we have a principled quantitative approach to define tasks and perform sensitivity analysis? Although picking and sliding are somewhat different tasks for a human, they might be/not be similar for a mathematical model. Since this is a modeling paper, the emphasis should not be running experiments on a real robot, but performing rigorous experiments, even on a simulator, to validate the model and analyze the sensitivity to various human demonstrations. I would personally appreciate a scientifically valid “design and analysis of experiments” (DOE) over demonstrations. It is always a good idea to avoid adding website links such as google sites as they can be used to track the location/IP address of the reviewers. The provided website provides no purpose as it only contains a video. Please consider attaching the video instead. POST-REBUTTAL COMMENTS: Thank you for the interesting rebuttal! After going through the rebuttal and other reviews, I have three main concerns about the paper. 1. Lack of rationale The paper discusses what was done, not why it was done---the most important question in science. "which did not perform well" - I wasn't expecting a numerical comparison with all possible generative models (thanks for the comparison, though!). Knowing the motivation and rationale for using GANs is important. In the rebuttal, rather than simply saying that it did not work, it is important to tell us why it did not work (at least a hypothesis). 2. The way the novelty is presented - a minor concern The novelty of the paper is applying a set of standard ML techniques for third-person visual imitation learning, not decoupling the hierarchy. The latter is highlighted as the novelty throughout sections 1, 2, and 3. 3. Experiments - a major concern Real-robot experiments are extremely important. The message I attempted to convey was the importance of "controlled experiments" which help to understand the capacity of the proposed algorithms. For instance, consider the following experiment. *Experiment*: We need to understand the "space of demonstrations" the proposed algorithm is valid because human demonstrations can be varied. We asked 100 people (is 100 enough? - "experimental power" in DoE) to perform pouring in this particular experimental setup. We isolated X and Y factors that could confound. (Alternatively, we used a standard dataset that contains pouring actions--I don't think there is one). Then, we found out that the proposed algorithm is valid for pouring actions with these particular class of demonstrations. Merely by running this experiment, we know when this algorithm works and know what improvements we need to make in the next iteration of the work. What this paper suggests is a concept and therefore what it should attempt to do is proving that the concept is valid and have a great potential. What are given in the paper and rebuttal are demonstrations, not experiments. No details about the experimental setup and conditions are provided. Numerical results are less useful without having a standard framework or controlled environments. “manipulation tasks involve intricacies like fine-grained touch, slipping, real objects” - This is exactly why I emphasized the importance of simulations so that we can isolate these artifacts and try to understand our algorithm: why it works and when it works. Otherwise, we might have to run thousands of experimental evaluations to show that our algorithm is indeed valid in the presence of these artifacts.

The paper presents a framework for learning by demonstration using third person videos. The method is based on decoupling the intended task from the controller, by learning a hierarchical setup where the high-level module generates goals conditioned on the third-person video demonstration for the low-level controller. Due to its modularity, the proposed approach is more sample efficient than other end-to-end approaches, and the learned low-level controller is more general. The paper is well written and well structured, it includes insightful figures and diagrams, and fair ablations and comparisons. Originality: the paper presents an interesting approach to use third-person views as demonstrations for a robot; learning from demonstrations, including from videos, is not a novel contribution, as well as learning modular controllers in the form of an inverse model. Despite limited novelty, the approach presented in this paper is neat and clear. It would be interesting to see comparisons for example with methods that explicitly find correspondences between the demonstrator and the robot, and with methods based on trajectory-based demonstrations. Quality: the submission is technically sound, and the approach explained in a clear way; the results (including those shown in the video) suggest that there is still room for improvement in terms of succeeding in completing the different tasks (e.g. the pouring policy execution seems wobbly and lucking robustness). Some discussion about the limitations of the proposed method would help evaluating the overall results. Clarity: the paper is overall clearly written and well organized; a discussion around the limitations of the proposed approach could be added. Significance: the paper provides an interesting way to address learning by third-person view demonstrations in robotics; this is a challenging and important field and this contribution is interesting with respect to more classical approaches based on hand-crafted models or task-specific solutions.

Paper ID:	1488
Title:	Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller

Reviewer 1

Reviewer 2

Reviewer 3