Xiaojie Jin, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, Shuicheng Yan
It is important for intelligent systems, e.g. autonomous vehicles and robotics to anticipate the future in order to plan early and make decisions accordingly. Predicting the future scene parsing and motion dynamics helps the agents better understand the visual environment better as the former provides dense semantic segmentations, i.e. what objects will be present and where they will appear, while the latter provides dense motion information, i.e. how the objects move in the future. In this paper, we propose a novel model to predict the scene parsing and motion dynamics in unobserved future video frames simultaneously. Using history information (preceding frames and corresponding scene parsing results) as input, our model is able to predict the scene parsing and motion for arbitrary time steps ahead. More importantly, our model is superior compared to other methods that predict parsing and motion separately, as the complementary relationship between the two tasks are fully utilized in our model through joint learning. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics in the future frames. On the large-scale Cityscapes dataset, it is demonstrated that our model produces significantly better parsing and motion prediction results compared to well established baselines. In addition, we also show our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn underlying latent parameters.