Review for NeurIPS paper: Deep Imitation Learning for Bimanual Robotic Manipulation

NeurIPS 2020

Deep Imitation Learning for Bimanual Robotic Manipulation

Review 1

Summary and Contributions: This paper presents a two stage hierarchical imitation learning framework where the high-level planner predicts the next primitive to be executed from the previous history of state sequence and the low-level planner predicts the next sequence of states for a fixed horizon. A task is represented by a sequence of movement primitives and each movement primitive is encoded with a graph recurrent neural network designed to capture robot-object relational features in the environment. The proposed methodology is evaluated on table lifting and peg-in-hole tasks suggesting performance improvement in capturing relational dependencies with graph RNN, residual skip connections and use of multiple primitives.

Strengths: Bimanual coordination for robotic manipulation is an important venue of research. Utilizing graph structures in robot manipulation is of growing interest to a wider robotics community. The main contribution of the paper is to learn accurate dynamic models that can be useful for bimanual robot manipulation. The accuracy is achieved using Graph RNN, feeding the goal-directed object as residual connection and using multiple primitives that are sequenced together to execute the task.

Weaknesses: My main concerns are: - Feeding the actual pose of one arm (master) and the relative pose of the second arm (slave) with respect to the master and similarly other objects would have been more informative for the network to capture the relational dependencies at the pose level. A baseline comparison with this method would be useful to understand the dependency structure, especially to improve the performance for the second task. Adding other baselines with state-of-the-art methods in the related work would further improve the understanding of the work. - The authors discuss a few examples with the position in table tasks, but the effect of orientation is not explained. Does the approach generalize with randomly sampled orientation of the target object ? Are the orientations normalized to unit quaternions after prediction ? The authors are encouraged to show orientation errors and quantify the performance. - Adding visual pixel information from pixels would help to establish the true merits of graph attention mechanism. - 29 percent accuracy with table assembly tasks is rather low. The Euclidean distance error units in Table 1 seem very high. Are they normalized to per datapoint position errors ? If so, an error of 5 cm with HDR-IL seems unreasonably high. - It is not clear if the proposed methodology is specific to bimanual manipulation. Just using robotic manipulation could be more appropriate. - Experiments with real setup would have been useful to establish the merits of the proposed approach.

Correctness: It seems to me the main contribution is use of graph RNN, residual connection of target object and multiple primitives; along with preliminary experiments for bi-manual manipulation. The contributions claimed in the introduction are rather generic which are common to several other papers including the ones mentioned in the related work. It is not entirely clear to me how effectively the graph attention mechanism captures the relational dynamics between the objects and the scene. A simpler experiment which encodes the relational structure between objects and/or the end-effectors, in particular the effect of orientation, would have been useful to establish the claim of the paper.

Clarity: The paper is well-written and easy to follow for the most part. The experimental set-up is sound and the performance metrics are intuitive to understand.

Relation to Prior Work: The related work discusses hierarchical reinforcement learning and sequencing primitives. It is also related to activity recognition in computer vision. Other useful references: [1] Learning bimanual end-effector poses from demonstrations using task-parameterized dynamical systems, IROS 2015 [2] Motion2Vec: Semi-supervised representation learning from surgical videos, ICRA 2020

Reproducibility: Yes

Additional Feedback: There is a good potential scope of this work, and the preliminary results are encouraging. The authors are encouraged to further ground their claims by comparing their approach with other state-of-the-art methods in the literature. Thanks for clarifying about adding relative poses baseline in the rebuttal. I would encourage the authors to strengthen the evaluation experiments and the baselines for the final version.

Review 2

Summary and Contributions: This work presents an imitation learning method, HDR-IL, for bimanual robotic manipulation. They propose a hierarchical modular network with a high level planner that predicts a sequence of primitive tasks and a low level dynamics model. Their main contributions are using the hierarchical and relational imitation learning model to solve bimanual manipulation on 2 tasks: table lifting and peg-in-hole. They claim to have better generalization and higher success rates when using hierarchy and relational features.

Strengths: The high-level planner / low level controller design and pipeline is explained pretty clearly in the methodology section and each section of the model is explained in detail. The paper is hence very well written. In general, hierarchical control offers several benefits, which has been well discussed and motivated in this paper. This line of work and its applications to bimanual manipulation hasn’t been explored much and this work shows that their method succeeds on 2 tasks that require two arms. I also appreciate that the different failure cases were included in the videos in addition to cases where it worked. Furthermore, the method is well ablated, showing that all components are needed for the proposed method to work.

Weaknesses: In terms of background and related work, it would be nice to see more explanations of current cited methods. What kinds of methods do current bimanual robotic manipulation tasks use? Only saw one sentence that they’re done in the classical control setting. The baseline in the experiments has a 1% success rate, are there success rates that can be compared to in different methods for bimanual manipulation? No actual baselines. The claimed baselines are pretty much ablations of the proposed method. Implementations of previous cited work would be needed to judge the efficacy of specifying the primitives the way it is. For the table task: “success for table starting locations where the ground truth demonstration failed.” Some explanation on when/why did the ground truth demonstration fail, when the HDR-IL method succeeds, and how do you get ground truths? In the conclusion, it’s stated that “our pipeline begins by manually designing and labeling primitives to train the high-level model”, curious to see how these primitives are labeled. “Incorporate an explicit graph structure to model interactions” (54). So the method seems very limited to the provision of the graph structure. It would be good to see a discussion on how to automatically learn the graph structure. The experiments and method do not show off how complex two handed settings can be: “These models are difficult to construct explicitly … complicated interactions in the task including friction, adhesion, and deformation between the two arms/hands and the object being manipulated” (25). But the experiments are all in simulation, with “various weightless links to track location” (180), nor is it mentioned that the table is flexible. Better experiments would be holding a nail to hammer down (as mentioned in line 20). It would be more interesting if the table was to land in a specified configuration, such as on its side / flips, which would require (presumably) more arm to arm interaction / model learning. Peg insertion task -- it is unclear why the table is needed to be lifted if this is the case, it seems reasonable that the task can be done with 1 arm, and 2 is not necessary. Unclear how the primitives are constructed. Seems like the states used to map are manually crafted / designed (184, 60 in appendix), which is contradicting to “ learns a high level … and a set of low-level primitive dynamics models” (53)

Correctness: Overall, looks correct.

Clarity: Overall, easy to read. More comments on clarity are in the strength and weaknesses section.

Relation to Prior Work: Overall, good. More comments on clarity are in the strength and weaknesses section.

Reproducibility: No

Additional Feedback: Clarity on how the primitives are manually annotated is missing. This is needed for reproduction. I have read all the other reviews and the author feedback. Thank you for addressing my concerns. I'm hence increasing my score. However, I still have reservations on the experimental setup and would encourage the authors to strengthen this paper with thorough experimentation and comparisons.

Review 3

Summary and Contributions: A novel approach for making imitation learning work with deep learning was prevented that avoids the pitfalls of monolithic end-to-end approaches and instead leverages on a modular approach -- while remaining a deep learning approach.

Strengths: The paper is theoretically sound and - for NeurIPS as a machine learning conference - the evaluation in simulation is sufficient. The paper stands out in comparison to robotics work at NeurIPS.

Weaknesses: There are no real robot evaluations and more tasks could be tried.

Correctness: The paper appears correct.

Clarity: Easy and fun read.

Relation to Prior Work: The paper is well-embedded in the literature.

Reproducibility: Yes

Additional Feedback: I would recommend ganging up with a robotics group with sufficient hardware for real robot experiments -- but that's gotta be Post-COVID-19.

Review 4

Summary and Contributions: The paper proposes a hierarchical approach to bi-manual manipulation. The high-level model selects a primitive given previous states. The low-level controllers (primitives) predict following states given the current state. Low and high-level controllers are learnt with a supervised training from manually designed demonstrations of the task. The proposed approach allow to generalized to unseen initial states.

Strengths: The paper proposes an interesting the low-level generative model, which draws inspiration from the standard RNN encoder-decoder approach. The proposed solution leverages: (1) a graph attention layer (GAT) for representing strong correlation in the the arm joints; (2) a residual skip connection (RES) which emphasizes the role of the goal state (e.g. the position of the lifted object) over other state variables (e.g. the pose of the arm). Results (Table 1 and Table 2) provide ablation studies which suggest that all these components are necessary to improve the success rate on two different bi-manual tasks.

Weaknesses: Performance (i.e. success rate) on the more difficult task (peg-in-hole) are relatively low and videos seem to suggest that the definition of success for this task was quite generous. It seems therefore that the proposed approach doesn't generalize well to bi-manual tasks which require a lot of precision, or otherwise robustness with respect to the initial state of the environment.

Correctness: Yes

Clarity: Yes, the paper is well written. A few typo: Page 4, line 130 "dynamics function" -> "dynamic functions" Page 8, line 230 "star" -> "start"

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: For some reason the submitted PDF is encrypted. As a result some features (e.g. copy-paste and search) are not possible on the provided PDF. This makes the review slightly less comfortable.