Reproducibility

For convenience, the supplemental material contains the code submitted for review. However, for more recent code and additional documentation, please have a look at our GitHub: https://github.com/ir-lab/LanguagePolicies

The full source code is provided as part of the supplemental material, including code to generate data, train a model, and evaluate the result.

Requirements

In order to reproduce our results, we provide a Docker container in which our model can be used. The setup has been tested on docker 19.03.7 on an Ubuntu 18.04 but should work on other versions. Please note that creating the docker image will require a download of ~4GB.

1 Environment Setup

In order to create the container, please run the following command in the root directory of the supplemental material.

docker build --no-cache -t neurips2020:submission12184 .

To start the container, please run the following command and select a free network port on your computer. The current example uses 6080, but if you should encounter issues binding that port, feel free to select any other available port.

docker run -p 6080:80 -e RESOLUTION=1280x720 --rm neurips2020:submission12184

The container may need a minute to start up fully. Please allow a little time until some terminal output is shown before continuing to the next step.

To access the container, please direct your browser to the following URL on the port that you have chosen.

http://127.0.0.1:6080/

This should provide you with a virtual desktop that is streamed inside your browser window.

Training

On the desktop, open the start menu (lower left corner) and open a new terminal by going into System Tools $\rightarrow$ LXTerminal. In that terminal, enter the code directory and run the main file.

cd Code
python main.py

This will train our model on the full dataset with the default hyperparameters. Depending on the used hardware, training will take around 35 hours over 200 epochs until full convergence. Models were trained on a computer equipped with two Intel Xeon CPU E5-2699A v4 @ 2.40GHz processors and no GPUs. Not using a GPU for training has empirically shown to converge significantly faster than training on a GPU due to the custom recurrent structure of our low-level controller. This structure prevents our model from taking advantage of GPU optimizations for recurrent layers by cuDNN.

Training the model will create a new directory Data within the Code directory containing the folders Models and TBoardLog. Inside an additional folder with a randomly generated experiment name, you can find the current model, the currently best model as well as the tensorboard logs that are written during training.

Dataset

Due to size limitations, we do not provide the unprocessed dataset (150 GB) in these supplemental materials, but rather the processed dataset that is presented in the two TFRecord files train.tfrecord and validate.tfrecord. The test data are provided as raw data. All data can be found in the GDrive archive that is inside the docker container or can alternatively be downloaded at the link provided in the sidebar to the left. Overall, our dataset contains 22,500 complete task demonstrations, composed out of the two sub-tasks (grasping and pouring), resulting in 45,000 training samples. Of these samples, we use 4,000 for validation and 1,000 for testing, leaving 40,000 for training. The results are reported based on 100 full tasks that have been randomly selected from the test set. Each of these 100 tasks consists of two actions, resulting in 200 interactions.

Hyper-Parameters

The following hyperparameters are used in our default configuration and represent the used learning rate as well as the weights used for the auxiliary losses. Furthermore, the ranges for each value indicate the values that have been explored during the hyper-parameter search. A reasonable range for each parameter was determined prior to the automated parameter search. The automated search for suitable parameters was conducted by sklearn.model_selection.RandomizedSearchCV

Lerning Rate Attention Weight Weight difference Trajectory Reconstruction Phase Progression Phase Estimation
0.0001 [0.01 - 0.00005] 1.0 [fixed reference] 50.0 [0.1 - 100.0] 5.0 [0.1 - 30] 14.0 [0.1 - 30] 1.0 [0.1 - 30]

Pre-Trained Models

Our supplemental material contains a pre-trained model that you can use to reproduce some of the results. It is already part of the docker container or can alternatively be downloaded as part of the GDrive archive from the sidebar on the left. Furthermore, we also include a fine-tuned version of Faster RCNN that is used in our model to determine object positions and classes. Pre-training of Faster RCNN has been done on a single NVIDIA Quadro P6000 GPU over the 40000 samples of the training data in approximately 15 minutes. The two external modules that we are using are Faster RCNN and GloVe. We are using the following versions:

  • Faster RCNN Based on ResNet 101 and trained on the COCO dataset
  • GloVe Trained on Wikipedia 2014 and Gigaword 5 (6B tokens, 400k vocab, 50 dimensions)

Evaluation

You can reproduce the results of our model by re-running the evaluation on the 100 training data. On the desktop of the docker container, go the start menu (lower left corner) and open two new terminals by going into System Tools $\rightarrow$ LXTerminal. In the first terminal, navigate into the Code directory and start the network service:

cd Code
python service.py

This will run a ROS 2 node that provides a service that the robot can call to generate the next control command $\mathbf{r}_{t+1}$. In the second terminal, you can start CoppelliaSim that will automatically run the 100 test experiments as individual tasks. To do so, please type the following in the second terminal.

cd Code
python val_model_vrep.py

The results of the evaluation run will be shown at the end of the interactions.

CoppeliaSim will output a warning (3 times) about non-convex shapes being part of the simulation. Please check “do not show this again” and confirm the warning.

After the simulation of the test data has completed, a val_results.json will be created in the Code directory, containing all the data needed to generate one entry in our paper’s results table. You can run viz_val_vrep.py in order to generate one line of Latex code that represents the results for the tested model setting (our model). Hence the results shown in the table are automatically generated and does not involve any manual step. When evaluating on the full 100 test scenarios, the output should look similar to the following line:

0.98 & 0.85 & 0.84 & 0.94 & 0.94 & 0.88 & 0.05 & 4.85 & 0.83 & 0.83 & 0.85 & 1.00 & 0.88 & 1.00 & 0.70 & 0.89

The evaluation of our model on all the test scenarios will take 2-3 hours, depending on your hardware. The number of tested demonstrations can be changed at the end of file val_model_vrep.py and is set to 10 as default. For reference, we included the test result of our model in ours_full_cl.json. The file is a json file with the following structure and can be used as input to viz_val_vrep.py by changing the path at the bottom of the file:

  • phase_1 $\rightarrow$ All experiments of the picking task
    • <experiment name> $\rightarrow$ Name of the experiment
      • language
        • original $\rightarrow$ Original voice used
        • features $\rightarrow$ Number of features needed (ground-truth values)
        • quantity $\rightarrow$ Quantity to be disbursed (ground-truth values)
      • trajectory
        • gt $\rightarrow$ Demonstratd trajectory
        • state $\rightarrow$ Executed trajectory by the model
      • locations
        • target $\rightarrow$ Target postion on the table
        • tid $\rightarrow$ Ground-Truth target ID of the object
        • tid/actual $\rightarrow$ IDs of closes cup and bowl as well as their distance in cm
        • current $\rightarrow$ Current position of the robot at during the picking action
        • distance $\rightarrow$ Distance between the robot’s gripper and the ground-truth target object in cm
      • success $\rightarrow$ Depricated value
  • phase_2 $\rightarrow$ All experiments of the pouring task
    • <experiment name> $\rightarrow$ Name of the experiment
      • language
        • original $\rightarrow$ Original voice used
        • features $\rightarrow$ Number of features needed (ground-truth values)
        • quantity $\rightarrow$ Quantity to be disbursed (ground-truth values)
      • trajectory
        • gt $\rightarrow$ Demonstratd trajectory
        • state $\rightarrow$ Executed trajectory by the model
      • locations
        • target $\rightarrow$ Target postion on the table
        • tid $\rightarrow$ Ground-Truth target ID of the object
        • tid/actual $\rightarrow$ IDs of closes cup and bowl as well as their distance in cm
        • current $\rightarrow$ Current position of the robot at during the picking action
        • distance $\rightarrow$ Distance between the robot’s gripper and the ground-truth target object in cm
      • ball_array $\rightarrow$ Array describing which of the dispensed objects went into the bowl.
      • success $\rightarrow$ Depricated value