Generalization

Physical Perturbation

The control model works in a closed-loop fashion, taking potential discrepancies (and perturbations) between the desired robot motion and the actual motion of the robot into consideration. In the physical perturbation experiment, we push the robot out of its path by applying a force at around 30% of the distance-to-goal. The following two videos in Figure 2 show the model’s ability to recover from perturbations.

Figure 3a: "Raise the blue cup"

Figure 3b: "Fill a little into the cobalt bowl"

The resulting motion trajectories of the perturbations induced in the above Figure 3 can be seen in Figure 4 below. The gray trajectory indicates the original robot movement without perturbation, whereas the green and red trajectory indicate the actual motion that was executed by the robot after a force has been applied to the robot. It can be seen from the figures that the model quickly recovers from the perturbation and returns to the correct trajectory in order to successfully complete the respective tasks.

Figure 4a: Motion trajectory and induced perturbation during the picking motion

Figure 4b: Motion trajectory and induced perturbation during the pouring motion

The closed-loop nature of our model continuously adapts to changes in the robot state and re-generates new trajectories by re-generating the motor primitive, allowing for various adaptions. Figure 5 a) shows the joint angle of the gripper in a picking task, changing from being open (value at 0) to being closed (value at 1). The execution of the grasp (blue trajectory) was delayed in time as shown by the early motion predictions plotted in yellow. Motion predictions that have been done later after more observations of the current robot state were available are represented on red. Similarly, for Figure 5 b), which shows the pouring motion of the robot’s wrist joint, where the overall pouring duration has been increased by starting the pouring earlier and ending it later than initially anticipated while slightly reducing the pouring angle. Both figures indicate our model’s ability to adapt potential motion changes over time.

Figure 5a: Motion adjustment for the gripper joint over time (picking-action). This joint controlls the closing angle (from 0 = open to 1 = fully closed) of the gripper. The blue trajectory was executed and the yellow trajectories were early motion predictions.

Figure 5b: Motion adjustment for the wrist joint over time (pouring-motion). The wrist joint mainly influences the quantity that is dispensed by causing the rotational motion. The blue trajectory was executed and the yellow trajectories were early motion predictions.

Verbal Generalization

To generate training and test data, we have asked five human experts to provide templates for verbal task descriptions. In turn, these templates are used as sentence generators from which multiple sentences can be extracted via synonym replacement. In order to generate a task description, we identify the minimal set of visual features required to uniquely identify the target object while breaking ties randomly. Synonyms for objects, visual feature descriptors, and verbs are chosen at random and are applied to a randomly chosen template sentence in order to generate a possible task description. The synonyms used during training are shown in the following table.

Baseword Synonyms used during training
round round, curved
square square, rectangular
small small, tiny, smallest, petite, meager
large large, largest, big, biggest, giant, grand
red red, ruby, cardinal, crimson, maroon, carmine
green green, olive, jade
blue blue, azure, cobalt, indigo
yellow yellow, amber, bisque, blonde, gold, golden
pink pink, salmon, rose
cup cup, container, grail, stein
pick pick_up, gather, take_up, grasp, elevate, lift, raise, lift_up, grab
pour pour, spill, fill
little a little, some, a small amount
much everything, all of it, a lot
bowl bowl, basin, dish, pot

Subsequently, we evaluate our model’s performance when interacting with a new set of four human users, from which we collect 160 new sentences that are directly used in our model. A selection reflecting the variability of the sentences that have been provided by new users can be seen in the following table, highlighting new words and sentence structures that have not been part of the training set.

New words and sentences for the picking task New words and sentences for the pouring task
Pick up the dark blue cup Lightly pour into the blue bowl
Grab the red cup Completely empty it into the small green dish
Hold the green veil Fully pour it into the red bowl
Pick up the blue cup Pour half of it into the square green bowl
Lift the cup Put all its content into the square bowl
Take the dark blue cup Dump everything it into the large yellow pot
Select the red cup Pour a small amount into the yellow bowl
Pick the green cup up Fill it all into the small red bowl

When tested with new language commands, our model successfully performs the entire sequence in 64% of the cases. The model nearly doubles the trajectory error but maintains a reasonable success. It is also observable that most of the failed task sequences primarily result from a deterioration in pouring task performance (a reduction from 85% to 69%). Picking remains at 93%.

The following Figure 6 shows the importance of a task instruction in order to disambiguate which object to use in the environment. In this setting, the environment is exactly the same, whereas the task descriptions instruct the robot to dispense different quantities into different bowls. This result underlines our model’s ability to generate language-conditioned control policies.

Figure 6a: "Fill a small amount into the rectangular rose basin"

Figure 6b: "Fill everything into the round bowl"

Illumination changes

Figure 7 shows examples of the same task executed in different illuminated scenarios. This experiment highlights the ability of this approach to cope with perceptual disturbances. Evaluating the model under these conditions yields a task completion rate of 62%. The main source of accuracy loss is the detection network misclassifying or failing to detect the target object.

Figure 7: Example of varying lightning conditions created by two randomly colored and placed lights in the robot's environment

Results

While the task success rate is the most critical metric in our dynamic control scenario, the following table also shows other metrics indicative of the success of our method. Pouring and Picking show the individual success rate of the tasks, whereas Sequential indicates the success rate of executing both tasks sequentially. The Object Detection Rate indicates the success rate of the semantic model when identifying the correct objects. Material Deliver outlines the percentage of granular material that was delivered to the correct bowl during the pouring action. Additionally, Correct Quantity describes the percentage of correctly dispensed quantities according to task modifiers in the verbal command, i.e., “a lot” or “a little”. Trajectory MAE shows the Mean-Absolute-Error of the executed robot trajectory as compared to the demonstrated trajectory in radiant over all six joints of the robot arm. Finally, Distance to Goal reports the error between the robot’s tool center point and the center of the target object in centimeters.

Pouring Picking Sequential Object Detection Rate Material Delivery Correct Quantity Trajectory MAE Distance to Goal
98% 85% 84% 94% 94% 94% 0.05 rad 4.85 cm