The following video shows the type of interaction that our model allows. Note that speech-to-text conversion is not part of the system and is done in a separate step by using Google Cloud APIs.
We evaluate our approach in a simulated robot task involving a table-top setup. In this task, a seven DoF robot manipulator has to be taught by an expert how to perform a combination of picking and pouring behaviors. After training, the model is able to actuate a robot based on a new environment and a new verbal task description. The following videos show our trained model controlling a robot arm conducting the picking and pouring tasks.
Figure 1a: "Lift the container up"
Figure 1b: "Fill a little into the large blonde bowl"
Figure 1a shows that the robot was able to reproduce trained behavior that was demonstrated by the human operator. This behavior goes beyond just reaching the correct target, and it also shows that the robot learned the motion needed to approach the object correctly. During the demonstrations, the robot was taught to approach a point in front of the object before transitioning to a linear part of the motion in order to grasp the cup. Especially for objects close to the robot, the motion shape is essential for successful grasps. Figures 1b and 2b show the puring action while showing the model’s ability to dispense different quantities into the respective bowls. In Figure 1b, the robot was tasked to pour “a little” into the bowl while in Figure 2b, the robot was tasked to pour “everything” into the bowl. Similar to the demonstrated picking actions in, the shape of the demonstrated motion is essential in order to dispense the correct quantities.
Figure 2a: "Pick the stein up"
Figure 2b: "Pour everything into the rectangular cardinal bowl"
Both experiments show the robot’s ability to generalize to new tasks provided as a top-down image $\mathbf{I}$ and verbal task description $v$. We summarize the results of testing our model on a set of 100 unseen, new environments. Our model’s overall task success describes the percentage of cases in which the cup was first lifted, and then successfully poured into the correct bowl. This sequence of steps was successfully executed in 84% of the new environments. Picking alone achieves a 98% success rate while pouring results in 85%. We argue that the drop in performance is due to increased linguistic variability when describing the pouring behavior. These results indicate that the model appropriately generalizes the trained behavior to changes in object position, verbal command, or perceptual input.