
This folder contains additional qualitative results, for each dataset and 
object representation.
Each image shows one frame of several videos; thus, sequences of three 
(for rooms) or six (for traffic) consecutive images constitute complete 
videos.
Each column shows a separate video. Moreover, the top ten rows (nine for 
traffic) show decomposition of an input, while the bottom five rows show 
a generated video that is unrelated in content, but uses the same camera 
parameters as conditioning.

The rows are:
 1. input frame
 2. reconstructed frame
 3. reconstructed background
 4. reconstructed objects
 5. ground-truth normal-map (only rooms, not traffic)
 6. predicted normal-map
 7. ground-truth depth-map
 8. predicted depth-map
 9. ground-truth instance segmentation
10. predicted instance segmentation
11. generated frame (unrelated content to the above)
12. generated background
13. generated objects
14. generated depths
15. generated instance segmentation


