{"title": "Object-Oriented Dynamics Predictor", "book": "Advances in Neural Information Processing Systems", "page_first": 9804, "page_last": 9815, "abstract": "Generalization has been one of the major challenges for learning dynamics models in model-based reinforcement learning. However, previous work on action-conditioned dynamics prediction focuses on learning the pixel-level motion and thus does not generalize well to novel environments with different object layouts. In this paper, we present a novel object-oriented framework, called object-oriented dynamics predictor (OODP), which decomposes the environment into objects and predicts the dynamics of objects conditioned on both actions and object-to-object relations. It is an end-to-end neural network and can be trained in an unsupervised manner. To enable the generalization ability of dynamics learning, we design a novel CNN-based relation mechanism that is class-specific (rather than object-specific) and exploits the locality principle. Empirical results show that OODP significantly outperforms previous methods in terms of generalization over novel environments with various object layouts. OODP is able to learn from very few environments and accurately predict dynamics in a large number of unseen environments. In addition, OODP learns semantically and visually interpretable dynamics models.", "full_text": "Object-Oriented Dynamics Predictor\n\nGuangxiang Zhu, Zhiao Huang, and Chongjie Zhang\n\nInstitute for Interdisciplinary Information Sciences\n\nguangxiangzhu@outlook.com,hza14@mails.tsinghua.edu.cn,chongjie@tsinghua.edu.cn\n\nTsinghua University, Beijing, China\n\nAbstract\n\nGeneralization has been one of the major challenges for learning dynamics mod-\nels in model-based reinforcement learning. However, previous work on action-\nconditioned dynamics prediction focuses on learning the pixel-level motion and\nthus does not generalize well to novel environments with different object layouts.\nIn this paper, we present a novel object-oriented framework, called object-oriented\ndynamics predictor (OODP), which decomposes the environment into objects and\npredicts the dynamics of objects conditioned on both actions and object-to-object\nrelations. It is an end-to-end neural network and can be trained in an unsupervised\nmanner. To enable the generalization ability of dynamics learning, we design a nov-\nel CNN-based relation mechanism that is class-speci\ufb01c (rather than object-speci\ufb01c)\nand exploits the locality principle. Empirical results show that OODP signi\ufb01cantly\noutperforms previous methods in terms of generalization over novel environments\nwith various object layouts. OODP is able to learn from very few environments\nand accurately predict dynamics in a large number of unseen environments. In\naddition, OODP learns semantically and visually interpretable dynamics models.\n\n1\n\nIntroduction\n\nRecently, model-free deep reinforcement learning (DRL) has been extensively studied for automat-\nically learning representations and decisions from visual observations to actions. Although such\napproach is able to achieve human-level control in games [1, 2, 3, 4, 5], it is not ef\ufb01cient enough and\ncannot generalize across different tasks. To improve sample-ef\ufb01ciency and enable generalization for\ntasks with different goals, model-based DRL approaches (e.g., [6, 7, 8, 9]) are extensively studied.\nOne of the core problems of model-based DRL is to learn a dynamics model through interacting\nwith environments. Existing work on learning action-conditioned dynamics models [10, 11, 12] has\nachieved signi\ufb01cant progress, which learns dynamics models and achieves remarkable performance\nin training environments [10, 11], and further makes a step towards generalization over object\nappearances [12]. However, these models focus on learning pixel-level motions and thus their\nlearned dynamics models do not generalize well to novel environments with different object layouts.\nCognitive studies show that objects are the basic units of decomposing and understanding the world\nthrough natural human senses [13, 14, 15, 16], which indicates object-based models are useful for\ngeneralization [17, 18, 19].\nIn this paper, we develop a novel object-oriented framework, called object-oriented dynamics pre-\ndictor (OODP). It is an end-to-end neural network and can be trained in an unsupervised manner.\nIt follows an object-oriented paradigm, which decomposes the environment into objects and learns\nthe object-level dynamic effects of actions conditioned on object-to-object relations. To enable\nthe generalization ability of OODP\u2019s dynamics learning, we design a novel CNN-based relation\nmechanism that is class-speci\ufb01c (rather than object-speci\ufb01c) and exploits the locality principle. This\nmechanism induces neural networks to distinguish objects based on their appearances, as well as\ntheir effects on an agent\u2019s dynamics.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fEmpirical results show that OODP signi\ufb01cantly outperforms previous methods in terms of gener-\nalization over novel environments with different object layouts. OODP is able to learn from very\nfew environments and accurately predict the dynamics of objects in a large number of unseen envi-\nronments. In addition, OODP learns semantically and visually interpretable dynamics models, and\ndemonstrates robustness to some changes of object appearance.\n\n2 Related Work\n\nAction-conditioned dynamics learning approaches have been proposed for learning the dynamic\neffects of an agent\u2019s actions from raw visual observations and achieves remarkable performance in\ntraining environments [10, 11]. CDNA [12] further makes a step towards generalization over object\nappearances, and demonstrates its usage for model-based DRL [9]. However, these approaches focus\non learning pixel-level motions and do not explicitly take object-to-object relations into consideration.\nAs a result, they cannot generalize well across novel environments with different object layouts.\nAn effective relation mechanism can encourage neural networks to focus attentions on the moving\nobjects whose dynamics needs to be predicted, as well as the static objects (e.g., walls and ladders)\nthat have an effect on the moving objects. The lack of such a mechanism also accounts for the fact\nthat the composing masks in CDNA [12] are insensitive to static objects.\nRelation-based nets have shown remarkable effectiveness to learn relations for physical reasoning\n[19, 20, 21, 22, 23, 24]. However, they are not designed for action-conditioned dynamics learning.\nIn addition, they have either manually encoded object representations [19, 20] or not demonstrated\ngeneralization across unseen environments with novel object layouts [20, 21, 22, 23, 24]. In this\npaper, we design a novel CNN-based relation mechanism. First, this mechanism formulates a\nspatial distribution of a class of objects as a class-speci\ufb01c object mask, instead of representing an\nindividual object by a vector, which allows relation nets to handle a vast number of object samples\nand distinguish objects by their speci\ufb01c dynamic properties. Second, we use neighborhood cropping\nand CNNs to exploit the locality principle of object interactions that commonly exists in the real\nworld.\nObject-oriented reinforcement learning has been extensively studied, whose paradigm is that the\nlearning is based on object representations and the effects of actions are conditioned on object-to-\nobject relations. For example, relational MDPs [17] and Object-Oriented MDPs [18] are proposed\nfor ef\ufb01cient model-based planning or learning, which supports strong generalization. Cobo et al. [25]\ndevelop a model-free object-oriented learning algorithm to speed up classic Q-learning with compact\nstate representations. However, these models require explicit encoding of the object representations\n[17, 18, 25] and relations [17, 18]. In contrast, our work aims at automatically learning object\nrepresentations and object-to-object relations to support model-based RL. While approaches from\nobject localization [26] or disentangled representations learning [27, 28] have been proposed for\nidentifying objects, unlike our model, they cannot perform action-conditioned relational reasoning to\nenable generalizable dynamics learning.\n\n3 Object-Oriented Dynamics Predictor\n\nTo enable the generalization ability over object layouts for dynamics learning, we develop a novel\nunsupervised end-to-end neural network framework, called Object-Oriented Dynamics Predictor\n(OODP). This framework takes the video frames and agents\u2019 actions as input and learns the dynamics\nof objects conditioned on actions and object-to-object relations. Figure 1 illustrates the framework of\nOODP that consists of three main components: Object Detector, Dynamics Net, and Background\nExtractor. In this framework, the input image is fed into two data streams: dynamics inference\nand background extraction. In the dynamics inference stream, Object Detector decomposes the\ninput image into dynamic objects (e.g., the agent) and static objects (e.g., walls and ladders). Then,\nDynamic Net learns the motions of dynamic objects based on both their relations with other objects\nand the actions of an agent (e.g., the agent\u2019s moving upward by actions up when touching a ladder).\nOnce the motions are learned, we can apply these transformations to the dynamic objects via Spatial\nTransformer Network (STN) [29]. In the background extraction stream, Background Extractor\nextracts the background of the input image, which is not changing over time. Finally, the extracted\nbackground will be combined with the transformed dynamic objects to generate the prediction of the\n\n2\n\n\fnext frame. OODP assumes the environment is Markovian and deterministic, so it predicts the next\nframe based on the current frame and action.\n\nFigure 1: Overall framework of OODP.\n\nOODP follows an object-oriented dynamics learning paradigm. It uses object masks to bridge object\nperception (Object Detector) with dynamics learning (Dynamics Net) and to form a tensor bottleneck,\nwhich only allows the object-level information to \ufb02ow through. Each object mask has its own\ndynamics learner, which forces Object Detector to act as a detector for the object of interest and also\nas a classi\ufb01er for distinguishing which kind of object has a speci\ufb01c effect on dynamics. In addition,\nwe add an entropy loss to reduce the uncertainty of object masks, thus encouraging attention on key\nparts and learning invariance to irrelevances.\nIn the rest of this section, we will describe in detail the design of each main component of OODP and\ntheir connections.\n\n3.1 Object Detector\n\nObject Detector decomposes the input image into a set of objects. To simplify our model, we group\nthe objects (denoted as O) into static and dynamic objects (denoted as S and D) so that we only need\nto focus on predicting the motions of the dynamic objects. An object Oi is represented by a mask\nMOi \u2208 [0, 1]H\u00d7W , where H and W denote the height and width of the input image I \u2208 RH\u00d7W\u00d73,\nand the entry MOi(u, v) indicates the probability that the pixel I(u, v) belongs to the i-th object.\nThe same class of static objects has the same effects on the motions of dynamic objects, so we use\none mask MSi (1 \u2264 i \u2264 nS, where nS denotes the class number of static objects) to represent each\nclass of static objects. As different dynamic objects may have different motions, we use one mask\nMDj (1 \u2264 j \u2264 nD, where nD denotes the individual number of dynamic objects) to represent each\nindividual dynamic object. Note that OODP does not require the actual number of objects in an\nenvironment, but needs to set a maximum number. When they do not match, some learned object\nmasks may be redundant, which does not affect the accuracy of prediction (more details can be found\nin Supplementary Material).\nFigure 2 depicts the architecture of Object Detector. As shown in the \ufb01gure, the pixels of the input\nimage go through different CNNs to obtain different object masks. There are totally nO CNNs\nowning the same architecture but not sharing weights. The architecture details of these CNNs can\nbe found in Supplementary Material. Then, the output layers of these CNNs are concatenated with\neach other across channels and a pixel-wise softmax is applied on the concatenated feature map\nF \u2208 RH\u00d7W\u00d7nO. Let f (u, v, c) denote the value at position (u, v) in the c-th channel of F . The\nentry MOc (u, v) of the c-th object mask which represents the probability that the pixel I(u, v) of the\ninput image belongs to the c-th object Oc, can be calculated as,\n\nMOc(u, v) = p (Oc|I(u, v)) =\n\n.\n\nef (u,v,i)\n\n(cid:80)nO\n\ni\n\nef (u,v,c)\n\nTo reduce the uncertainty of the af\ufb01liation of each pixel I(u, v) and encourage the object masks\nto obtain more discrete attention distributions, we introduce a pixel-wise entropy loss to limit the\n\n3\n\nObject DetectorDynamics NetBackground Extractor\ud835\udc3c\ud835\udc61\u1218\ud835\udc3c\ud835\udc61+1Object Masks\ud835\udc42\ud835\udc37(\ud835\udc61)\ud835\udc42\ud835\udc46(\ud835\udc61)\ud835\udc3c\ud835\udc4f\ud835\udc54(\ud835\udc61)Region Spatial Transfer ModuleXXinverse+Dynamics Inference StreamBackground Extraction StreamSpatial Transformer NetworksSpatial Transformer NetworksMotioninverseX+Element-wise AdditionElement-wise MultiplicationComplementary operationAction\fentropy of the object masks, which is de\ufb01ned as,\n\n(cid:88)\n\nnO(cid:88)\n\nLentropy =\n\n\u2212p (Oc|I(u, v)) log p (Oc|I(u, v)) .\n\n3.2 Dynamics Net\n\nu,v\n\nc\n\nDynamics Net aims at learning the motion of each dynamic object conditioned on actions and object-\nto-object relations. Its architecture is illustrated in Figure 3. In order to improve the computational\nef\ufb01ciency and generalization ability, the Dynamics Net architecture incorporates a tailor module to\nexploit the locality principle and employs CNNs to learn the effects of object-to-object relations\non the motions of dynamic objects. As the locality principle commonly exists in object-to-object\nrelations in the real world, the tailor module enables the inference on the dynamics of objects focusing\non the relations with neighbour objects. Speci\ufb01cally, it crops a \u201chorizon\u201d window of size w from the\nobject masks centered on the position of the dynamic object whose motion is being predicted, where\nw indicates the maximum effective range of object-to-object relations. The tailored local objects are\nthen fed into the successive network layers to reason their effects. Unlike most previous work which\nuses fully connected networks for identifying relations [19, 20, 21, 22, 23], our dynamics inference\nemploys CNNs. This is because CNNs provide a mechanism of strongly responding to spatially\nlocal patterns, and the multiple receptive \ufb01elds in CNNs are capable of dealing with complex spatial\nrelations expressed in distribution masks.\n\nFigure 2: Architectures of\nObject Detector.\n\nFigure 3: Architecture of Dynamics Net. We illustrate predicting the\nmotion of MDj as an example. O\\{Dj} denotes O excluding Dj.\n\nv MDj (u, v)(cid:1)\u22121(cid:80)H\n(cid:80)W\n\nu\n\nv MDj (u, v)(cid:1)\u22121(cid:80)H\n(cid:80)W\n\ns object mask MDj , where \u00afuDj = (cid:0)(cid:80)H\n(cid:0)(cid:80)H\n\nTo demonstrate the detailed pipeline in Dynamics Net (Figure 3), we take as an example the computa-\ntion of the predicted motion of the j-th dynamic object Dj. First, we describe the cropping process\nof the tailor module, which crops the object masks near to the dynamic object Dj.\n(cid:80)W\nFor each dynamic object Dj, its position (\u00afuDj , \u00afvDj ) is de\ufb01ned as the expected location of it-\n(cid:80)W\nv u \u00b7 MDj (u, v), \u00afvDj =\nv v \u00b7 MDj (u, v). The \u201chorizon\" window of size w centered on\n(\u00afuDj , \u00afvDj ) is written as Bw = {(u, v) : \u00afuDj\u2212w/2 \u2264 u \u2264 \u00afuDj +w/2, \u00afvDj\u2212w/2 \u2264 v \u2264 \u00afvDj +w/2}.\nThe tailor module receives Bw and performs cropping on other object masks excluding MDj , that\nis, MO\\{Dj}. This cropping process can be realized by bilinear sampling [29], which is a sub-\ndifferentiable sampling approach. Taking cropping MO1 as an example, the pairwise transformation\nof sampling grid is (u2, v2) = (u1 + \u00afuDj \u2212 w/2, v1 + \u00afvDj \u2212 w/2), where (u1, v1) are coordinates\nof the cropped object mask CO1 and (u2, v2) \u2208 Bw are coordinates of the original object mask MO1.\nThen applying bilinear sampling kernel on this grid can compute the cropped object mask CO1,\n\nu\n\nu\n\nu\n\nH(cid:88)\n\nW(cid:88)\n\n(cid:16)\nMO1 (u, v) \u00b7 max(0, 1 \u2212 |u2 \u2212 u|) max(0, 1 \u2212 |v2 \u2212 v|)\n\n(cid:17)\n\nCO1 (u1, v1) =\n\n,\n\n(1)\n\nu\n\nv\n\nNote that the gradients wrt (\u00afuDj , \u00afvDj ) are frozen to force the cropping module focus on what to crop\nrather than where to crop, which is different to the vanilla bilinear sampling [29].\nThen, each cropped object mask COi is concatenated with the constant x-coordinate and y-coordinate\nmeshgrid map, which makes networks more sensitive to the spatial information. The concatenated\n\n4\n\nObject Masks \ud835\udc3c\ud835\udc61CNNspixel-wise softmax\ud835\udc40\ud835\udc42(\ud835\udc61)\ud835\udc39Object Masks\ud835\udc40\ud835\udc37\ud835\udc57\ud835\udc40\ud835\udc42\\{\ud835\udc37\ud835\udc57}SumCNNsX MapY MapEffects \ud835\udc38Self Effect \ud835\udc38\ud835\udc60\ud835\udc52\ud835\udc59\ud835\udc53TailorModuleCoordinateActionMotionScalar product\ud835\udc4e\ud835\udc49\ud835\udc37\ud835\udc57\fmap is fed into its own CNNs to learn the effect of i-th object on j-th dynamic object E(Oi, Dj) \u2208\nR2\u00d7na, where na represents the number of actions. There are totally (nO \u2212 1) \u00d7 nD CNNs\nfor (nO \u2212 1) \u00d7 nD pairs of objects. Different CNNs working for different objects forces the\nobject mask to act as a classi\ufb01er for distinguishing each object with the speci\ufb01c dynamics. To\n\u2208 R2 for dynamic object Dj, all the object effects and a self effect\npredict the motion vector V (t)\nDj\nEself(Dj) \u2208 R2\u00d7na representing the natural bias are summed and multiplied by the one-hot coding\nof action a(t) \u2208 {0, 1}na, that is,\n\n(cid:16)(cid:0) (cid:88)\n\nE(Oi, Dj)(cid:1) + Eself(Dj)\n\n(cid:17) \u00b7 a(t).\n\nV (t)\nDj\n\n=\n\nOi\u2208O\\{Dj}\n\nIn addition to the conventional prediction error Lprediction (described in Section 3.4), here we introduce\na regression loss to guide the optimization of M (t)\nD , which serves as a highway to provide\nearly feedback before reconstructing images. This regression loss is de\ufb01ned as follows,\n\nO and V (t)\n\nnD(cid:88)\n\n(cid:119)(cid:119)(cid:119)(\u00afuDj , \u00afvDj )(t) + V (t)\n\nDj\n\n\u2212 (\u00afuDj , \u00afvDj )(t+1)(cid:119)(cid:119)(cid:119)2\n\n.\n\n2\n\nLhighway =\n\nj\n\n3.3 Background Extractor\n\nTo extract time-invariant background, we construct a Background Extractor with neural networks.\nThe Background Extractor takes the form of a traditional encoder-decoder structure. The encoder\nalternates convolutions [30] and Recti\ufb01ed Linear Units (ReLUs) [31] followed by two fully-connected\nlayers, while the decoder alternates deconvolutions [32] and ReLUs. To avoid losing information in\npooling layers, we replace all the pooling layers by convolutional layers with increased stride [33, 34].\nFurther details of the architecture of Background Extractor can be found in Supplementary Material.\nBackground Extractor takes the current frame I (t) \u2208 RH\u00d7W\u00d73 as input and produces the background\nbg \u2208 RH\u00d7W\u00d73, whose pixels remain unchanged over times. We use Lbackground here to force\nimage I (t)\n2.\n\nthe network to satisfy such a property of time invariance, given by Lbackground =(cid:119)(cid:119)I (t+1)\n\n(cid:119)(cid:119)2\n\n\u2212 I (t)\n\nbg\n\nbg\n\n3.4 Merging Streams\n\n(cid:80)nD\n\nFinally, two streams of pixels are merged to produce the prediction of the next frame. One stream\ncarries the pixels of dynamic objects which can be obtained by transforming the masked pixels of\ndynamic objects from the current frame. The other stream provides the rest pixels generated by\nBackground Extractor.\nSpatial Transformer Network (STN) [29] provides neural networks capability of spatial transformation.\nIn the \ufb01rst stream, a STN accepts the learned motion vectors V and performs spatial transforms\n(denoted by STN(\u2217, V )) on dynamic pixels. The bilinear sampling (similar as Equation 1) is also\nused in STN to compute the transformed pixels in a sub-differentiable manner. The difference is\nthat, we allow the gradients of loss backpropagate to the object masks as well as the motion vectors.\nThen, we obtain the pixel frame \u02c6I (t+1)\n=\n), where \u00b7 denotes element-wise multiplication. The other stream aims\nbg .\n\n)(cid:1) \u00b7 I (t)\n)(cid:1) \u00b7 I (t)\nWe use l2 pixel loss to restrain image prediction error, which is given by, Lprediction =(cid:119)(cid:119) \u02c6I (t+1) \u2212\nI (t+1)(cid:119)(cid:119)2\n\nat computing the rest pixels \u02c6I (t+1)\n2\nThus, the output end of OODP, \u02c6I (t+1), is calculated by,\n\n=(cid:0)1 \u2212(cid:80)nD\n) +(cid:0)1 \u2212 nD(cid:88)\n\n2. We also add a similar l2 pixel loss for reconstructing the current image, that is,\n\nof dynamic objects in the next frame \u02c6I (t+1), that is, \u02c6I (t+1)\n\nof \u02c6I (t+1), that is, \u02c6I (t+1)\n\nj STN(M (t)\nDj\n\nj STN(M (t)\nDj\n\nSTN(M (t)\nDj\n\n, V (t)\nDj\n\n\u02c6I (t+1) = \u02c6I (t+1)\n\n+ \u02c6I (t+1)\n\n\u00b7 I (t), V (t)\n\nDj\n\n\u00b7 I (t), V (t)\n\nDj\n\nSTN(M (t)\nDj\n\nnD(cid:88)\n\n, V (t)\nDj\n\nbg .\n\n1\n\n2\n\n1\n\n2\n\n=\n\n1\n\nj\n\nj\n\n5\n\nLreconstruction =\n\nM (t)\nDj\n\n\u00b7 I (t) +(cid:0)1 \u2212 nD(cid:88)\n\nbg \u2212 I (t)(cid:119)(cid:119)(cid:119)2\n(cid:1) \u00b7 I (t)\n\n.\n\n2\n\nM (t)\nDj\n\nj\n\n(cid:119)(cid:119)(cid:119) nD(cid:88)\n\nj\n\n\fIn addition, we add a loss Lconsistency to link the visual perception and dynamics prediction of the\nobjects, which enables learning object by integrating vision and interaction,\n\nnD(cid:88)\n\n(cid:119)(cid:119)(cid:119)M (t+1)\n\nDj\n\n\u2212 STN(cid:0)M (t)\n\nDj\n\n, V (t)\nDj\n\n(cid:1)(cid:119)(cid:119)(cid:119)2\n\n2\n\n.\n\nLconsistency =\n\n3.5 Training Procedure\n\nj\n\nOODP is trained by the following loss given by combining the previous losses with different weights,\n\nLtotal = Lhighway + \u03bbpLprediction + \u03bbeLentropy + \u03bbrLreconstruction + \u03bbcLconsistency + \u03bbbgLbackground\n\nConsidering that signals derived from foreground detection can bene\ufb01t Object Detector to produce\nmore accurate object masks, we use the simplest unsupervised foreground detection approach [35] to\ncalculate a rough proposal dynamic region and then add a l2 loss to encourage the dynamic object\nmasks to concentrate more attentions on this region, that is,\n\nLproposal =(cid:119)(cid:119)(cid:0) nD(cid:88)\n\n(cid:1) \u2212 M (t)\n\nproposal\n\n(cid:119)(cid:119)2\n\n2,\n\nM (t)\nDj\n\n(2)\n\nwhere M (t)\nand make the learning process more stable and robust.\n\nproposal represents the proposal dynamic region. This additional loss can facilitate the training\n\nj\n\n4 Experiments\n\n4.1 Experiment Setting\n\nWe evaluate our model on Monster Kong from the Pygame Learning Environment [36], which offers\nvarious scenes for testing generalization abilities across object layouts (e.g., different number and\nspatial arrangement of objects). Across different scenes, the same underlying physics engine that\nsimulates the real-world dynamics mechanism is shared. For example, in each scene, an agent can\nmove upward using a ladder, it will be stuck when hitting the walls, and it will fall in free space.\nThe agent explores various environments with a random policy over actions including up, down, left,\nright, and noop and its gained experiences are used for learning dynamics. The code of OODP is\navailable online at https://github.com/mig-zh/OODP.\nTo evaluate the generalization ability, we compare our model with state-of-the-art action-conditioned\ndynamics learning approaches, AC Model [10], and CDNA [12]. We evaluate all models in k-to-m\nzero-shot generalization problems (Figure 4), where they learn dynamics with k different training\nenvironments and are then evaluated in m different unseen testing environments with different object\nlayouts. In this paper, we use m = 10 and k = 1, 2, 3, 4, and 5, respectively. The smaller the value k, the\nmore dif\ufb01cult the generalization problem. In this setting, truly achieving generalization to new scenes\nrequires learners\u2019 full understanding of the object-level abstraction, object relationships and dynamics\nmechanism behind the images, which is quite different from the conventional video prediction task\nand crucially challenging for the existing learning models. In addition, we will investigate whether\nOODP can learn semantically and visually interpretable knowledge and is robustness to some changes\nof object appearance.\n\n4.2 Zero-shot Generalization of Learned Models\n\nTo demonstrate the generalization ability, we evaluate the prediction accuracy of the learned dynamics\nmodel in unseen environments with novel object layouts without re-training. Table 1 shows the\nperformance of all models on predicting the dynamics of the agent, where n-error accuracy is de\ufb01ned\nas the proportion that the difference between the predicted and ground-true agent locations is less\nthan n pixel (n = 0, 1, 2).\nFrom Table 1, we can see our model signi\ufb01cantly outperforms the previous methods under all\ncircumstances. This demonstrates our object-oriented approach is bene\ufb01cial for the generalization\nover object layouts. As expected, as the number of training environments increases, the learned object\ndynamics can generalize more easily over new environments and thus the accuracy of dynamics\n\n6\n\n\fFigure 4: An example of 2-to-10 zero-shot generalization problem.\n\nModels\n\nOODP+p\n0-error\nOODP-p\naccuracy AC Model\n\nCDNA\nOODP+p\n1-error\nOODP-p\naccuracy AC Model\n\nTraining environments\n4-10\n2-10\n0.93\n0.94\n0.98\n0.98\n0.99\n0.99\n0.13\n0.19\n0.98\n0.98\n0.99\n0.98\n0.99\n0.99\n0.29\n0.33\n0.99\n0.99\n0.99\n0.99\n0.99\n0.99\n0.44\n0.45\n\n5-10\n0.82\n0.95\n0.70\n0.19\n0.95\n0.97\n0.77\n0.55\n0.99\n0.98\n0.80\n0.62\nTable 1: Accuracy of the dynamics prediction. k-m means the k-to-m zero-shot generalization\nproblem. Here, we use OODP+p and OODP-p to distinguish OODP with or without the proposal\nloss (Equation 2).\n\nUnseen environments\n4-10\n2-10\n0.78\n0.79\n0.90\n0.78\n0.44\n0.17\n0.18\n0.25\n0.94\n0.90\n0.96\n0.91\n0.57\n0.31\n0.49\n0.52\n0.98\n0.96\n0.98\n0.94\n0.64\n0.37\n0.55\n0.62\n\n3-10\n0.73\n0.86\n0.22\n0.20\n0.90\n0.94\n0.31\n0.47\n0.96\n0.97\n0.34\n0.56\n\nCDNA\nOODP+p\n2-error\nOODP-p\naccuracy AC Model\n\nCDNA\n\n5-10\n0.93\n0.97\n0.99\n0.17\n0.98\n0.99\n0.99\n0.30\n0.99\n0.99\n0.99\n0.45\n\n1-10\n0.32\n0.22\n0.01\n0.33\n0.71\n0.61\n0.01\n0.53\n0.87\n0.82\n0.02\n0.56\n\n1-10\n0.90\n0.96\n0.99\n0.20\n0.98\n0.98\n0.99\n0.30\n0.99\n0.99\n0.99\n0.36\n\n3-10\n0.92\n0.98\n0.99\n0.14\n0.99\n0.99\n0.99\n0.30\n0.99\n0.99\n0.99\n0.47\n\nprediction tends to grow. OODP achieves reasonable performance with 0.86 0-error accuracy only\ntrained in 3 environments, while the other methods fail to get satisfactory scores (about 0.2). We\nobserve that the AC Model achieves extremely high accuracy in training environments but cannot\nmake accurate predictions in novel environments, which implies it over\ufb01ts the training environments\nseverely. This is partly because the AC Model only performs video prediction at the pixel level\nand learns few object-level knowledge. Though CDNA includes object concepts in their model, it\nstill performs pixel-level motion prediction and does not consider object-to-object relations. As a\nresult, CDNA also fails to achieve accurate predictions in unseen environments with novel object\nlayouts (As the tuning of hyper parameters does not improve the prediction performance, we use the\ndefault settings here). In addition, we observe that the performance of OODP-p is slightly higher\nthan OODP+p because the region proposals used for the initial guidance of optimization sometimes\nmay introduce some noise. Nevertheless, using proposals can make the learning process more stable.\n\n4.3\n\nInterpretability of Learned Knowledge\n\nInterpretable deep learning has always been a signi\ufb01cant but vitally hard topic [37, 38, 39]. Unlike\nprevious video prediction frameworks [10, 12, 40, 41, 42, 43, 23], most of which use neural networks\nwith uninterpretable hidden layers, our model has informative and meaningful intermediate layers\ncontaining the object-level representations and dynamics.\nTo interpret the intermediate representations learned by OODP, we illustrate its object masks in unseen\nenvironments, as shown in Figure 5. Intriguingly, the learned dynamic object masks accurately capture\nthe moving agents, and the static masks successfully detect the ladders, walls and free space that lead\nto different action conditioned dynamics of the agent. Each object mask includes one class of objects,\nwhich implies that the common characteristics of this class are learned and the knowledge that links\n\n7\n\nTraining environmentsUnseen environments\fFigure 5: Visualization of the masked images in unseen environments with single dynamic object\n(top) and multiple dynamic objects (down). To demonstrate the learned attentions of object masks,\nthe raw input images are multiplied by binarized object masks.\n\nvisual features and dynamics properties is gained. While the learned masks are not as \ufb01ne as those\nderived from the supervised image segmentation, they clearly demonstrate visually interpretable\nrepresentations in the domain of unsupervised dynamics learning.\nTo interpret the learned object dynamics behind frame prediction, we evaluate the root-mean-square\nerrors (RMSEs) between the predicted and ground-truth motions. Table 2 shows the RMSEs averaged\nover 10000 samples. From Table 2, we can observe that motions predicted by OODP are very accurate,\nwith the RMSE close or below one pixel, in both training and unseen environments. Such a small\nerror is visually indistinguishable since it is less than the resolution of the input video frame (1 pixel).\nAs expected, as the number of training environments increases, this prediction error rapidly descends.\nFurther, we also provide some intuitive prediction examples (see Supplementary Material) and a video\n(https://goo.gl/BTL2wH) for better perceptual understanding of the prediction performance.\n\nModels\n\nTraining OODP+p\nenvs\nOODP-p\nUnseen OODP+p\nOODP-p\nenvs\n\n4\n\n2\n\n3\n\nNumber of training envs\n5\n1\n0.28 0.24 0.23 0.23 0.23\n0.18 0.17 0.19 0.14 0.15\n1.04 0.52 0.51 0.43 0.40\n1.09 0.53 0.38 0.35 0.29\n\nTable 2: RMSEs between predicted and ground-\ntruth motions. The unit of measure is pixel.\n\nObject appearance\n\nS2\n\nS4\n\nS1\n\nS0\nAcc\n0.94 0.92 0.94 0.94 0.92 0.88 0.93\nRMSE 0.29 0.35 0.31 0.28 0.31 0.40 0.30\n\nS5\n\nS3\n\nS6\n\nTable 3: The performance (accuracy and RMSE\nin 5-to-10 zero-shot generalization problem) of\nOODP in novel environments with different ob-\nject layouts and appearances.\n\nThese interpretable intermediates demystify why OODP is able to generalize across novel environ-\nments with various object layouts. While the visual perceptions of novel environments are quite\ndifferent, the underlying physical mechanism based on object relations keeps invariant. As shown in\nFigure 5, OODP learns to decompose a novel scene into understandable objects and thus can reuse\nobject-level knowledge (features and relations) acquired from training environments to predict the\neffects of actions.\n\n4.4 Robustness to Changes of Object Appearance\n\nTo demonstrate the robustness to object appearances, we evaluate the generalization performance of\nOODP in testing environments including objects with appearance differences from those in training\nenvironments, as shown in Figure 6. As shown in Table 3, OODP provides still high prediction\nperformance in all these testing environments, which indicates that it can still generalize to novel\nobject layouts even when the object appearances have some differences. This robustness is partly\nbecause the Object Detector in OODP employs CNNs that are capable of learning essential patterns\n\n8\n\nDynamic objectsStatic objectsInput frameSingle agentMultiple agents\fFigure 6: Illustration of the con\ufb01gurations of\nthe novel testing environments. Compared to\nthe training environments, the testing ones have\ndifferent object layouts (S0-S6), and their objects\nhave some appearance differences (S1-S6).\n\nFigure 7: Visualization of the learned masks in\nunseen environments in Mars domain.\n\nover appearances. Furthermore, we provide the learned masks (see Supplementary Material) and a\nvideo (https://goo.gl/ovupdn) to show our results on S2 environments.\n\n4.5 Performance on Natural Image Input\n\nTo test the performance for the natural image input, we also evaluate our model in the Mars Rover\nNavigation domain introduced by Tamar et al. [44]. The Mars landscape images are natural images\ntoken from NASA. A Mars rover random explores in the Martian surface and it will be stuck if there\nare mountains whose elevation angles are equal or greater than 5 degrees. We run our model on\n5-to-10 zero-shot generalization problem and compare it with other approaches. As shown in Figure\n7 and Table 4, in unseen environments, our learned object masks successfully capture the key objects\nand our model signi\ufb01cantly outperforms other methods in terms of dynamics prediction.\n\n5 Conclusion and Future Work\n\nModels\n\nAC Model\n\nCDNA\nOODP\n\nacc0\n0.10\n0.46\n0.70\n\nacc3\n0.12\n0.75\n0.92\n\nacc1\n0.10\n0.54\n0.70\n\nacc2\n0.10\n0.62\n0.78\n\nTable 4: Accuracy of the dynamics prediction in\nunseen environments in Mars domain. accn de-\nnotes n-error accuracy.\n\nWe present an object-oriented end-to-end neu-\nral network framework. This framework is\nable to learn object dynamics conditioned on\nboth actions and object relations in an unsuper-\nvised manner. Its learned dynamics model ex-\nhibits strong generalization and interpretability.\nOur framework demonstrates that object percep-\ntion and dynamics can be mutually learned and\nreveals a promising way to learn object-level\nknowledge by integrating both vision and inter-\naction. We make one of the \ufb01rst steps in investi-\ngating how to design a self-supervised, end-to-end object-oriented dynamics learning framework that\nenables generalization and interpretability. Our learned dynamics model can be used with existing\npolicy search or planning methods (e.g., MCTS and MPC). Although we use random exploration in\nthe experiment, our model can integrate with smarter exploration strategies for better state sampling.\nOur future work includes extending our framework for supporting long-term prediction, abrupt\nchange prediction (e.g., object appearing and disappearing), and dynamic background (e.g., caused\nby a moving camera or multiple dynamic objects). As abrupt changes are often predictable from\na long-term view or with memory, our model can incorporate memory networks (e.g., LSTM) to\ndeal with such changes. In addition, the STN module in our model has the capability of learning\ndisappearing, which is basically an af\ufb01ne transformation with zero scaling. For prediction with\ndynamic background (e.g., in FPS game and driving), we will incorporate a camera motion prediction\nnetwork module similar to that introduced by Vijayanarasimhan et al. [45]. This module will learn a\nglobal transformation and apply it to the whole image to incorporate the dynamics caused by camera\nmotion.\n\n9\n\nS1: increased illuminationS2: graffiti wallsS3: jagged wallsS4: spotted laddersS5: distorted laddersS6: mixturesS0: original appearanceRaw imageMountainsFlat areaAgent\fReferences\n[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[2] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval\nTassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.\nInternational Conference on Learning Representations, 2016.\n\n[3] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[4] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[5] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[6] Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on\n\nMachine Learning, pages 1\u20139, 2013.\n\n[7] S\u00e9bastien Racani\u00e8re, Th\u00e9ophane Weber, David Reichert, Lars Buesing, Arthur Guez, Dani-\nlo Jimenez Rezende, Adri\u00e0 Puigdom\u00e8nech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li,\net al. Imagination-augmented agents for deep reinforcement learning. In Advances in Neural\nInformation Processing Systems, pages 5694\u20135705, 2017.\n\n[8] Silvia Chiappa, S\u00e9bastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent environ-\n\nment simulators. International Conference on Learning Representations, 2017.\n\n[9] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In Robotics\nand Automation (ICRA), 2017 IEEE International Conference on, pages 2786\u20132793. IEEE,\n2017.\n\n[10] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-\nIn Advances in Neural\n\nconditional video prediction using deep networks in atari games.\nInformation Processing Systems, pages 2863\u20132871, 2015.\n\n[11] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to\ncontrol: A locally linear latent dynamics model for control from raw images. In Advances in\nneural information processing systems, pages 2746\u20132754, 2015.\n\n[12] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interac-\ntion through video prediction. In Advances in Neural Information Processing Systems, pages\n64\u201372, 2016.\n\n[13] Jean Piaget. Piaget\u2019s theory. 1970.\n\n[14] Renee Baillargeon, Elizabeth S Spelke, and Stanley Wasserman. Object permanence in \ufb01ve-\n\nmonth-old infants. Cognition, 20(3):191\u2013208, 1985.\n\n[15] Renee Baillargeon. Object permanence in 31/2-and 41/2-month-old infants. Developmental\n\npsychology, 23(5):655, 1987.\n\n[16] Elizabeth S Spelke. Where perceiving ends and thinking begins: The apprehension of objects in\n\ninfancy. In Perceptual development in infancy, pages 209\u2013246. Psychology Press, 2013.\n\n[17] Carlos Guestrin, Daphne Koller, Chris Gearhart, and Neal Kanodia. Generalizing plans to new\nenvironments in relational mdps. In Proceedings of the 18th international joint conference on\nArti\ufb01cial intelligence, pages 1003\u20131010. Morgan Kaufmann Publishers Inc., 2003.\n\n10\n\n\f[18] Carlos Diuk, Andre Cohen, and Michael L Littman. An object-oriented representation for\nef\ufb01cient reinforcement learning. In Proceedings of the 25th international conference on Machine\nlearning, pages 240\u2013247. ACM, 2008.\n\n[19] Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional\nobject-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.\n\n[20] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction\nnetworks for learning about objects, relations and physics. In Advances in Neural Information\nProcessing Systems, pages 4502\u20134510, 2016.\n\n[21] Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pascanu, and\nAndrea Tacchetti. Visual interaction networks. In Advances in Neural Information Processing\nSystems, pages 4540\u20134548, 2017.\n\n[22] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In\nAdvances in neural information processing systems, pages 4974\u20134983, 2017.\n\n[23] Jiajun Wu, Erika Lu, Pushmeet Kohli, Bill Freeman, and Josh Tenenbaum. Learning to see\nphysics via visual de-animation. In Advances in Neural Information Processing Systems, pages\n152\u2013163, 2017.\n\n[24] Sjoerd van Steenkiste, Michael Chang, Klaus Greff, and J\u00fcrgen Schmidhuber. Relational neural\nexpectation maximization: Unsupervised discovery of objects and their interactions. arXiv\npreprint arXiv:1802.10353, 2018.\n\n[25] Luis C Cobo, Charles L Isbell, and Andrea L Thomaz. Object focused q-learning for autonomous\nIn Proceedings of the 2013 international conference on Autonomous agents and\nagents.\nmulti-agent systems, pages 1061\u20131068. International Foundation for Autonomous Agents and\nMultiagent Systems, 2013.\n\n[26] Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, and Cordelia Schmid. Unsupervised object\ndiscovery and tracking in video collections. In Proceedings of the IEEE international conference\non computer vision, pages 3173\u20133181, 2015.\n\n[27] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing\nmotion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033,\n2017.\n\n[28] Emily L Denton et al. Unsupervised learning of disentangled representations from video. In\n\nAdvances in Neural Information Processing Systems, pages 4414\u20134423, 2017.\n\n[29] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In\n\nAdvances in Neural Information Processing Systems, pages 2017\u20132025, 2015.\n\n[30] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[31] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines.\nIn Proceedings of the 27th international conference on machine learning (ICML-10), pages\n807\u2013814, 2010.\n\n[32] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional\nnetworks. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on,\npages 2528\u20132535. IEEE, 2010.\n\n[33] Viren Jain, Joseph F Murray, Fabian Roth, Srinivas Turaga, Valentin Zhigulin, Kevin L Brig-\ngman, Moritz N Helmstaedter, Winfried Denk, and H Sebastian Seung. Supervised learning of\nimage restoration with convolutional networks. In Computer Vision, 2007. ICCV 2007. IEEE\n11th International Conference on, pages 1\u20138. IEEE, 2007.\n\n[34] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving\n\nfor simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.\n\n11\n\n\f[35] BPL Lo and SA Velastin. Automatic congestion detection system for underground platforms. In\nIntelligent Multimedia, Video and Speech Processing, 2001. Proceedings of 2001 International\nSymposium on, pages 158\u2013161. IEEE, 2001.\n\n[36] Norman Tas\ufb01.\n\nPygame learning environment.\n\nPyGame-Learning-Environment, 2016.\n\nhttps://github.com/ntasfi/\n\n[37] Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu.\n\nnetworks. arXiv preprint arXiv:1710.00935, 2017.\n\nInterpretable convolutional neural\n\n[38] Quan-shi Zhang and Song-Chun Zhu. Visual interpretability for deep learning: a survey.\n\nFrontiers of Information Technology & Electronic Engineering, 19(1):27\u201339, 2018.\n\n[39] Quanshi Zhang, Yu Yang, Ying Nian Wu, and Song-Chun Zhu. Interpreting cnns via decision\n\ntrees. arXiv preprint arXiv:1802.00121, 2018.\n\n[40] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video\nrepresentations using lstms. In International conference on machine learning, pages 843\u2013852,\n2015.\n\n[41] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond\n\nmean square error. International Conference on Learning Representations, 2016.\n\n[42] William Lotter, Gabriel Kreiman, and David Cox. Unsupervised learning of visual structure\n\nusing predictive generative networks. Computer Science, 2015.\n\n[43] Marc Aurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and\nSumit Chopra. Video (language) modeling: a baseline for generative models of natural videos.\nEprint Arxiv, 2014.\n\n[44] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks.\n\nIn Advances in Neural Information Processing Systems, pages 2154\u20132162, 2016.\n\n[45] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Ka-\nterina Fragkiadaki. Sfm-net: Learning of structure and motion from video. arXiv preprint\narXiv:1704.07804, 2017.\n\n12\n\n\f", "award": [], "sourceid": 6410, "authors": [{"given_name": "Guangxiang", "family_name": "Zhu", "institution": "Tsinghua university"}, {"given_name": "Zhiao", "family_name": "Huang", "institution": "IIIS, Tsinghua University"}, {"given_name": "Chongjie", "family_name": "Zhang", "institution": "Tsinghua University"}]}