{"title": "Incremental Scene Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 1668, "page_last": 1678, "abstract": "We present a method to incrementally generate complete 2D or 3D scenes with the following properties: (a) it is globally consistent at each step according to a learned scene prior, (b) real observations of a scene can be incorporated while observing global consistency, (c) unobserved regions can be hallucinated locally in consistence with previous observations, hallucinations and global priors, and (d) hallucinations are statistical in nature, i.e., different scenes can be generated from the same observations. To achieve this, we model the virtual scene, where an active agent at each step can either perceive an observed part of the scene or generate a local hallucination. The latter can be interpreted as the agent's expectation at this step through the scene and can  be applied to autonomous navigation. In the limit of observing real data at each point, our method converges to solving the SLAM problem. It can otherwise sample entirely imagined scenes from prior distributions. Besides autonomous agents, applications include problems where large data is required for building robust real-world applications, but few samples are available. We demonstrate efficacy on various 2D as well as 3D data.", "full_text": "Incremental Scene Synthesis\n\nBenjamin Planche1,2\nHarald Kosch2\n\nXuejian Rong3,4\nYingLi Tian3\n\nZiyan Wu4\n\nSrikrishna Karanam4\n\nJan Ernst4\n\nAndreas Hutter1\n\n1Siemens Corporate Technology, Munich, Germany\n\n2University of Passau, Passau, Germany\n\n3The City College, City University of New York, New York NY\n\n4Siemens Corporate Technology, Princeton NJ\n\n{first.last}@siemens.com, {xrong,ytian}@ccny.cuny.edu, harald.kosch@uni-passau.de\n\nAbstract\n\nWe present a method to incrementally generate complete 2D or 3D scenes with\nthe following properties: (a) it is globally consistent at each step according to a\nlearned scene prior, (b) real observations of a scene can be incorporated while\nobserving global consistency, (c) unobserved regions can be hallucinated locally in\nconsistence with previous observations, hallucinations and global priors, and (d)\nhallucinations are statistical in nature, i.e., different scenes can be generated from\nthe same observations. To achieve this, we model the virtual scene, where an active\nagent at each step can either perceive an observed part of the scene or generate a\nlocal hallucination. The latter can be interpreted as the agent\u2019s expectation at this\nstep through the scene and can be applied to autonomous navigation. In the limit\nof observing real data at each point, our method converges to solving the SLAM\nproblem. It can otherwise sample entirely imagined scenes from prior distributions.\nBesides autonomous agents, applications include problems where large data is\nrequired for building robust real-world applications, but few samples are available.\nWe demonstrate ef\ufb01cacy on various 2D as well as 3D data.\n\nFigure 1: Our solution for scene understanding and novel view synthesis, given non-localized agents.\n\n1\n\nIntroduction\n\nWe live in a three-dimensional world, and a proper cognitive understanding of its structure is crucial\nfor planning and action. The ability to anticipate under uncertainty is necessary for autonomous agents\nto perform various downstream tasks such as exploration and target navigation [3]. Deep learning\nhas shown promise in addressing these questions [31, 16]. Given a set of views and corresponding\ncamera poses, existing methods have demonstrated the capability of learning an object\u2019s 3D shape\nvia direct 3D or 2D supervision.\nNovel view synthesis methods of this type have three common limitations. First, most recent\napproaches solely focus on single objects and surrounding viewpoints, and are trained with category-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\u201cHere is what I have seen so far.\u201d\u201cWhat would other locations look like?\u201dsequential synthesis of unobserved viewsagent observationsfeature encodingincremental memory updateglobal memory(with observedor previously synthesizedviews)hallucination\fFigure 2: Proposed pipeline for non-localized agents exploring new scenes. Observations xt are\nsequentially encoded and registered in a global feature map mt with spatial properties, used to\nextrapolate unobserved content and generate consistent novel views xreq from requested viewpoints.\n\ndependent 3D shape representations (e.g., voxel, mesh, point cloud model) and 3D/2D supervision\n(e.g., reprojection loss), which are not trivial to obtain for natural scenes. While recent works\non auto-regressive pixel generation [22], appearance \ufb02ow prediction [31], or a combination of\nboth [21] generate encouraging preliminary results for scenes, they only evaluate on data with\nmostly forwarding translation (e.g., KITTI dataset [9]), and no scene understanding capabilities are\nconvincingly shown. Second, these approaches assume that the camera poses are known precisely\nfor all provided observations. This is a practically and biologically unrealistic assumption; an agent\ntypically only has access to its own observations, not its precise location relative to objects in the\nscene (albeit it is provided by some oracle in synthetic environments, e.g., [6]). Third, there are no\nconstraints to guarantee consistency among the synthesized results.\nIn this paper, we address these issues with a uni\ufb01ed framework that incrementally generates complete\n2D or 3D scenes (c.f . Figure 1). Our solution builds upon the MapNet system [11], which offers an\nelegant solution to the registration problem but has no memory-reading capability. In comparison,\nour method not only provides a completely functional memory system, but also displays superior\ngeneration performance when compared to parallel deep reinforcement learning methods (e.g., [8]).\nTo the best of our knowledge, our solution is the \ufb01rst complete end-to-end trainable read/write\nallocentric spatial memory for visual inputs. Our key contributions are summarized below:\n\n\u2022 Starting with only scene observations from a non-localized agent (i.e., no location/action inputs\nunlike, e.g., [8]), we present novel mechanisms to update a global memory with encoded features,\nhallucinate unobserved regions and query the memory for novel view synthesis.\n\n\u2022 Memory updates are done with either observed or hallucinated data. Our domain-aware mechanism\nis the \ufb01rst to explicitly ensure the representation\u2019s global consistency w.r.t. the underlying scene\nproperties in both cases.\n\n\u2022 We propose the \ufb01rst framework that integrates observation, localization, globally consistent scene\nlearning, and hallucination-aware representation updating to enable incremental scene synthesis.\n\nWe demonstrate the ef\ufb01cacy of our framework on a variety of partially observable synthetic and\nrealistic 2D environments. Finally, to establish scalability, we also evaluate the proposed model on\nchallenging 3D environments.\n\n2 Related Work\n\nOur work is related to localization, mapping, and novel view synthesis. We discuss relevant work to\nprovide some context.\nNeural Localization and Mapping. The ability to build a global representation of an environment,\nby registering frames captured from different viewpoints, is key to several concepts such as re-\ninforcement learning or scene reconstruction. Recurrent neural networks are commonly used to\naccumulate features from image sequences, e.g., to predict the camera trajectory [15, 19]. Extending\nthese solutions with a queryable memory, state-of-the-art models are mostly egocentric and action-\nconditioned [3, 17, 30, 8, 14]. Some oracle is, therefore, usually required to provide the agent\u2019s action\nat each time step t [14]. This information is typically used to regress the agent state st, e.g., its pose,\n\n2\n\nglobal allocentric memory\u201cHere is what I have seen so far.\u201dLegend:requested poses \ud835\udc59\ud835\udc5f\ud835\udc52\ud835\udc5e\ud835\udc65\ud835\udc61\ud835\udc5b\ud835\udc5c0..\ud835\udc611\ud835\udc5c\ud835\udc5f\ud835\udc52\ud835\udc5e\ud835\udc5a\ud835\udc61\ud835\udc4f\ud835\udc61Optrained neural networksOpgeometrical transformsIncremental Observation\u201cWhat would other locations look like?\u201d\ud835\udc5c\ud835\udc61\ud835\udc5bUpdateRegisterHallucinateMemory UpdateMemory Hallucination\ud835\udc650..\ud835\udc611\u2026\u2026registered poses \ud835\udc59\ud835\udc61\ud835\udc5a\ud835\udc61\u210eSequential Synthesis\ud835\udc65\ud835\udc5f\ud835\udc52\ud835\udc5eEncodeProjectEncodeProjectUpdateRegister\u2026CullDecodeCullDecode\u2026\u2026\fwhich can be used in a memory structure to index the corresponding observation xt or its features. In\ncomparison, our method solely relies on the observations to regress the agent\u2019s pose.\nProgress has also been made towards solving visual SLAM with neural networks. CNN-SLAM [23]\nreplaced some modules in classical SLAM methods [5] with neural components. Neural SLAM [30]\nand MapNet [11] both proposed a spatial memory system for autonomous agents. Whereas the former\ndeeply interconnects memory operations with other predictions (e.g., motion planning), the latter\noffers a more generic solution with no assumption on the agents\u2019 range of action or goal. Extending\nMapNet, our proposed model not only attempts to build a map of the environment, but also makes\nincremental predictions and hallucinations based on both past experiences and current observations.\n3D Modeling and Geometry-based View Synthesis. Much effort has also been expended in ex-\nplicitly modeling the underlying 3D structure of scenes and objects, e.g., [5, 4]. While appealing\nand accurate results are guaranteed when multiple source images are provided, this line of work is\nfundamentally not able to deal with sparse inputs. To address this issue, Flynn et al. [7] proposed a\ndeep learning approach focused on the multi-view stereo problem by regressing directly to output\npixel values. On the other hand, Ji et al. [12] explicitly utilized learned dense correspondences to\npredict the image in the middle view of two source images. Generally, these methods are limited to\nsynthesizing a middle view among \ufb01xed source images, whereas our framework is able to generate\narbitrary target views by extrapolating from prior domain knowledge.\nNovel View Synthesis. The problem we tackle here can be formulated as a novel view synthesis\ntask: given pictures taken from certain poses, solutions need to synthesize an image from a new\npose, and has seen signi\ufb01cant interest in both vision [16, 31] and graphics [10]. There are two main\n\ufb02avors of novel view synthesis methods. The \ufb01rst type synthesizes pixels from an input image and a\npose change with an encoder-decoder structure [22]. The second type reuses pixels from an input\nimage with a sampling mechanism. For instance, Zhou et al. [31] recasted the task of novel view\nsynthesis as predicting dense \ufb02ow \ufb01elds that map the pixels in the source view to the target view,\nbut their method is not able to hallucinate pixels missing from source view. Recently, methods that\nuse geometry information have gained popularity, as they are more robust to large view changes and\nresulting occlusions [16]. However, these conditional generative models rely on additional data to\nperform their target tasks. In contrast, our proposed model enables the agent to predict its own pose\nand synthesize novel views in an end-to-end fashion.\n\n3 Methodology\n\nWhile the current state of the art in scene registration yields satisfying results, there are several\nassumptions, including prior knowledge of the agent\u2019s range of actions, as well as the actions at\nthemselves at each time step. In this paper, we consider unknown agents, with only their observations\nxt provided during the memorization phase. In the spirit of the MapNet solution [11], we use an\nallocentric spatial memory map. Projected features from the input observations are registered together\nin a coordinate system relative to the \ufb01rst inputs, allowing to regress the position and orientation (i.e.,\npose) of the agent in this coordinate system at each step. Moreover, given viewpoints and camera\nintrinsic parameters, features can be extracted from the spatial memory (frustum culling) to recover\nviews. Crucially, at each step, memory \u201choles\u201d can be temporarily \ufb01lled by a network trained to\ngenerate domain-relevant features while ensuring global consistency. Put together (c.f . Figure 2),\nour pipeline (trainable both separately and end-to-end) can be seen as an explicit topographic\nmemory system with localization, registration, and retrieval properties, as well as consistent memory-\nextrapolation from prior knowledge. We present details of our proposed approach in this section.\n\n3.1 Localization and Memorization\nOur solution \ufb01rst takes a sequence of observed images xt \u2208 Rc\u00d7h\u00d7w (e.g., with c = 3 for RGB\nimages or 4 for RGB-D ones) for t = 1, . . . , \u03c4 as input, localizing them and updating the spatial\nmemory m \u2208 Rn\u00d7u\u00d7v accordingly. The memory m is a discrete global map of dimensions u \u00d7 v\nand feature size n. mt represents its state at time t, after updating mt\u22121 with features from xt.\nEncoding Memories. Observations are encoded to \ufb01t the memory format. For each observation, a\nfeature map x(cid:48)\nis extracted by an encoding convolutional neural network (CNN). Each\nfeature map is then projected from the 2D image domain into a tensor ot \u2208 Rn\u00d7s\u00d7s representing the\n\nt \u2208 Rn\u00d7h(cid:48)\u00d7w(cid:48)\n\n3\n\n\fFigure 3: Pipeline training. Though steps are shown separately in the \ufb01gure (for clarity), the method\nis trained in a single pass. Lloc measures the accuracy of the predicted allocentric poses, i.e., training\nthe encoding system to extract meaningful features (CNN) and to update the global map mt properly\n(LSTM). Lanam measures the quality of the images rendered from mt using the ground-truth poses,\nto train the decoding CNN. Lhallu trains the method to predict all past and future observations at each\nstep of the sequence, while Lcorrupt punishes it for any memory corruption during hallucination.\n\nagent\u2019s spatial neighborhood (to simplify later equations, we assume u, v, s are odd). This operation\nis data and use-case dependent. For instance, for RGB-D observations of 3D scenes (or RGB images\nextended by some monocular depth estimation method, e.g., [28]), the feature maps are \ufb01rst converted\ninto point clouds using the depth values and the camera intrinsic parameters (assuming like Henriques\nand Vedaldi [11] that the ground plane is approximately known). They are then projected into\not through discretization and max-pooling (to handle many-to-one feature aggregation, i.e., when\nmultiple features are projected into the same cell [18]). For 2D scenes (i.e., agents walking on an\nimage plane), ot can be directly obtained from xt (with optional cropping/scaling).\nLocalizing and Storing Memories. Given a projected feature map ot and the current memory state\nmt\u22121, the registration process involves densely matching ot with mt\u22121, considering all possible\npositions and rotations. As explained in Henriques and Vedaldi [11], this can be ef\ufb01ciently done\nt \u2208 Rr\u00d7n\u00d7s\u00d7s is built by\nthrough cross-correlation. Considering a set of r yaw rotations, a bank o(cid:48)\nrotating ot r times: o(cid:48)\n2 ) horizontal center of ot,\n2 , s+1\nand R(o, \u03b1, c) the function rotating each element in o around the position c by an angle \u03b1, in the\nhorizontal plane. The dense matching can therefore be achieved by sliding this bank of r feature maps\nacross the global memory mt\u22121 and comparing the correlation responses. The localization probability\n\ufb01eld pt \u2208 Rr\u00d7u\u00d7v is ef\ufb01ciently obtained by computing the cross-correlation (i.e., \u201cconvolution\",\noperator (cid:63), in deep learning literature) between mt\u22121 and o(cid:48)\nt and normalizing the response map\n(softmax activation \u03c3). The higher a value in pt, the stronger the belief the observation comes from\nthe corresponding pose. Given this probability map, it is possible to register ot into the global map\nspace (i.e., rotating and translating it according to pt estimation) by directly convolving ot with pt.\nThis registered feature tensor \u02c6ot \u2208 Rn\u00d7u\u00d7v can \ufb01nally be inserted into memory:\n\nt =(cid:8)R(ot , 2\u03c0 i\n\nr , cs,s)(cid:9)r\n\ni=0, with cs,s = ( s+1\n\nmt = LSTM(mt\u22121, \u02c6ot, \u03b8lstm) with \u02c6ot = pt \u2217 o(cid:48)\n\nt and pt = \u03c3(mt\u22121 (cid:63) o(cid:48)\nt)\n\n(1)\n\nA long short-term memory (LSTM) unit is used, to update mt\u22121 (the unit\u2019s hidden state) with \u02c6ot\n(the unit\u2019s input) in a knowledgeable manner (c.f . trainable parameters \u03b8lstm). During training, the\nrecurrent network will indeed learn to properly blend overlapping features, and to use \u02c6ot to solve\npotential uncertainties in previous insertions (uncertainties in p result in blurred \u02c6o after convolution).\nThe LSTM is also trained to update an occupancy mask of the global memory, later used for\nconstrained hallucination (c.f . Section 3.3).\nTraining. The aforementioned process is trained in a supervised manner given the ground-truth\nagent\u2019s poses. For each sequence, the feature vector ot=0 from the \ufb01rst observation is registered\nat the center of the global map without rotation (origin of the allocentric system). Given \u00afpt, the\none-hot encoding of the actual state at time t, the network\u2019s loss Lloc at time \u03c4 is computed over the\nremaining predicted poses using binary cross-entropy:\n\nt=1\n\n4\n\n\u03c4(cid:88)\n\n(cid:2) \u00afpt \u00b7 log(pt) + (1 \u2212 \u00afpt) \u00b7 log(1 \u2212 pt)(cid:3)\n\nLloc = \u2212 1\n\u03c4\n\n(2)\n\n\ud835\udc650..\ud835\udc61\ud835\udc5a\ud835\udc61\ud835\udcdb\ud835\udc82\ud835\udc8f\ud835\udc82\ud835\udc8e\ud835\udc65\ud835\udc5f\ud835\udc52\ud835\udc5eHallu.\ud835\udcdb\ud835\udc89\ud835\udc82\ud835\udc8d\ud835\udc8d\ud835\udc96\ud835\udc650..\ud835\udc61\ud835\udc650\u210e\ud835\udc651\u210e\ud835\udc65\ud835\udc61\u210e\ud835\udcdb\ud835\udc84\ud835\udc90\ud835\udc93\ud835\udc93\ud835\udc96\ud835\udc91\ud835\udc95\ud835\udc4f\ud835\udc61\ud835\udc5a\ud835\udc61\ud835\udc5a\ud835\udc61\u210e(a) Training registration /memorization /anamnesismodules(b) Training of hallucinatorymoduleEnc.CNNProj.Regi.LSTMCNNProj.Regi.LSTMCNNProj.Regi.LSTMDec.CullCNN2CullCNN2CullCNN2Enc.Dec.DAEDAEDAE\ud835\udc5d\ud835\udc61\u04a7\ud835\udc5d\ud835\udc61\ud835\udcdb\ud835\udc8d\ud835\udc90\ud835\udc84Legend:ground-truth dataOptrained neural networksOpgeometrical transformspredictionsOpfrozen neural networks\fFigure 4: Synthesis of memorized and novel views from 2D scenes, comparing to GTM-SM [8].\nMethods receive a sequence of 10 observations (along with the related actions for GTM-SM) from\nan exploring agent, then they apply their knowledge to generate 46 novel views. GTM-SM has\ndif\ufb01culties grasping the structure of the environment from short observation sequences, while our\nmethod usually succeeds thanks to prior knowledge.\n\n3.2 Anamnesis\n\nApplying a novel combination of geometrical transforms and decoding operations, memorized content\ncan be recalled from mt and new images from unexplored locations synthesized. This process can be\nseen as a many-to-one recurrent generative network, with image synthesis conditioned on the global\nmemory and the requested viewpoint. We present how the entire network can thus be advantageously\ntrained as an auto-encoder with a recurrent neural encoder and a persistent latent space.\nCulling Memories. While a decoder can retrieve observations conditioned on the full memory and\nrequested pose, it would have to disentangle the visual and spatial information itself, which is not\ntrivial to learn (c.f . ablation study in Section 4.1). Instead, we propose to use the spatial properties\nof our memory to \ufb01rst cull features from requested viewing volumes before passing them as inputs\nto our decoder. More formally, given the allocentric coordinates lreq = (ureq, vreq), orientation\nr , and \ufb01eld of view \u03b1f ov, oreq \u2208 Rn\u00d7s\u00d7s representing the requested neighborhood is\n\u03b1req = 2\u03c0 rreq\n\ufb01lled as follow:\n\n(cid:40)\n\n\u02c6oreq,kij\n\u22121\n\nif atan2 j\u2212 s+1\ni\u2212 s+1\notherwise\n\n2\n\n< \u03b1f ov\n2\n\n(3)\nwith \u02c6oreq the unculled feature patch extracted from mt rotated by \u2212\u03b1req, i.e., \u2200k \u2208 [0 . . n \u2212\n1], \u2200(i, j) \u2208 [0 . . s \u2212 1]2:\n\noreq,kij =\n\n2\n\n\u02c6oreq,kij = R(mt, \u2212\u03b1req, cu,v + lreq)k\u03be\u03b7 with (\u03be, \u03b7) = (i, j) + cu,v + lreq \u2212 cs,s\n\n(4)\n\nThis differentiable operation combines feature extraction (through translation and rotation) and\nviewing frustum culling (c.f . computer graphics to render large 3D scenes).\nDecoding Memories. As input observations undergo encoding and projection, feature maps culled\nfrom the memory go through a reverse procedure to be projected back into the image domain. With\nthe synthesis conditioning covered in the previous step, a decoder directly takes oreq (i.e., the view-\nencoding features) and returns xreq, the corresponding image. This back-projection is still a complex\ntask. The decoder must both project the features from voxel domain to image plane, and decode\nthem into visual stimuli. Previous works and qualitative results demonstrate that a well-de\ufb01ned (e.g.,\ngeometry-aware) network can successfully accomplish this task.\nTraining. By requesting the pipeline to recall given observations\u2014i.e., setting lreq,t = \u00aflt and\nrreq,t = \u00afrt, \u2200t \u2208 [1, \u03c4 ], with \u00aflt and \u00afrt the agent\u2019s ground-truth position/orientation at each step\nt\u2014it can be trained end-to-end as an image-sequence auto-encoder (c.f . Figure 3.a). Therefore, its\nloss Lanam is computed as the L1 distance between xt and xreq,t, \u2200t \u2208 [0, \u03c4 ], averaged over the\nsequences. Note that thanks to our framework\u2019s modularity, the global map and registration steps can\nbe removed to pre-train the encoder and decoder together (passing the features directly from one to\nthe other). We observe that such a pre-training tends to stabilize the overall learning process.\n\n3.3 Mnemonic Hallucination\n\nWhile the presented pipeline can generate novel views, these views have to overlap with previous\nobservations for the solution to extract enough features for anamnesis. Therefore, we extend our\nmemory system with an extrapolation module to hallucinate relevant features for unexplored regions.\n\n5\n\ninputviews(scale x3)GTM-SMresults(scale x3)Ours(scale x3)GTviews(scale x3)GTM-SMresults(scale x3)Ours(scale x3)queriedposesGTposesGTimage(CelebA)GTImage(HoME-2D)inputviews(scale x3)GTM-SMresults(scale x3)Ours(scale x3)GTviews(scale x3)GTM-SMresults(scale x3)Ours(scale x3)queriedposesGTposes\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\fHole Filling with Global Constraints. Under global constraints, we build a deep auto-encoder\n(DAE) in the feature domain, which takes mt as input, as well as a noise vector of variable amplitude\n(e.g., no noise for deterministic navigation planning or heavy noise for image dataset augmentation),\nand returns a convincingly hole-\ufb01lled version mh\nt , while leaving registered features uncorrupted.\nIn other words, this module should provide relevant features while seamlessly integrating existing\ncontent according to prior domain knowledge.\nTraining. Assuming the agent homogeneously explores training environments, the hallucinatory\nmodule is trained at each step t \u2208 [0, \u03c4 \u2212 1] by generating mh\nt , the hole-\ufb01lled memory used to predict\nyet-to-be-observed views {xi}\u03c4\ni=t+1. To ensure that registered features are not corrupted, we also\nverify that all observations {xi}t\nt (c.f . Figure 3.b). This generative loss\ni=0 can be retrieved from mh\nis computed as follows:\n\nLhallu =\n\n1\n\n\u03c4 (\u03c4 \u2212 1)\n\n|xh\ni,t \u2212 xi|1\n\n(5)\n\n\u03c4\u22121(cid:88)\n\n\u03c4(cid:88)\n\nt=0\n\ni=0\n\n\u03c4(cid:88)\n\nt=0\n\ni,t the view recovered from mh\n\nt using the agent\u2019s true location \u00afli and orientation \u00afri for its\nwith xh\nobservation xi. Additionally, another loss is directly computed in the feature domain, using memory\noccupancy masks bt to penalize any changes to the registered features (given (cid:12) Hadamard product):\n\nLcorrupt =\n\n1\n\u03c4\n\n|(mh\n\nt \u2212 mt) (cid:12) bt|1\n\n(6)\n\nTrainable end-to-end, our model ef\ufb01ciently acquires domain knowledge to register, hallucinate, and\nsynthesize scenes.\n\n4 Experiments\n\nWe demonstrate our solution on various synthetic and real 2D and 3D environments. For each\nexperiment, we consider an unknown agent exploring an environment, only providing a short\nsequence of partial observations (limited \ufb01eld of view). Our method has to localize and register the\nobservations, and build a global representation of the scene. Given a set of requested viewpoints,\nit should then render the corresponding views. In this section, we qualitatively and quantitatively\nevaluate the predicted trajectories and views, comparing with GTM-SM [8], the only other end-to-end\nmemory system for scene synthesis, based on the Generative Query Network [6].\n\n4.1 Navigation in 2D Images\n\nWe \ufb01rst study agents exploring images (randomly walking, accelerating, rotating), observing the\nimage patch in their \ufb01eld of view at each step (more details and results in the supplementary material).\nExperimental Setup. We use a synthetic dataset of indoor 83 \u00d7 83 \ufb02oor plans rendered using the\nHoME platform [2] and SUNCG data [20] (8,640 training + 2,240 test images from random rooms\n\u201cof\ufb01ce\", \u201cliving\", and \u201cbedroom\"). Similar to Fraccaro et al. [8], we also consider an agent exploring\nreal pictures from the CelebA dataset [13], scaled to 43 \u00d7 43px. We consider two types of agents\nfor each dataset. To reproduce Fraccaro et al. [8] experiments, we \ufb01rst consider non-rotating agents\nAs\u2014only able to translate in the 4 directions\u2014with a 360\u25e6 \ufb01eld of view covering an image patch\ncel has a 15 \u00d7 15px square \ufb01eld of view; while\ncentered on the agents\u2019 position. The CelebA agent As\nhom reaches 20px away, and is therefore circular (in the\nthe \ufb01eld of view of the HoME-2D agent As\n41 \u00d7 41 patches, pixels further than 20px are left blank). To consider more complex scenarios, agents\ncel and Ac\nhom are also designed. They can rotate and translate (in the gaze direction), observing\nAc\ncel can rotate by \u00b145\u25e6 or \u00b190\u25e6 each step, and\npatches rotated accordingly. On CelebA images, Ac\nonly observes 8 \u00d7 15 patches in front (180\u25e6 rectangular \ufb01eld of view); while for HoME-2D, Ac\ncan rotate by \u00b190\u25e6 and has a 150\u25e6 \ufb01eld of view limited to 20px. All agents can move from 1/4 to 3/4\nof their \ufb01eld of view each step. Input sequences are 10 steps long. For quantitative studies, methods\nhave to render views covering the whole scenes w.r.t. the agents\u2019 properties.\nQualitative Results. As shown in Figure 4, our method ef\ufb01ciently uses prior knowledge to register\nobservations and extrapolate new views, consistent with the global scene and requested viewpoints.\nWhile an encoding of the agent\u2019s actions is also provided to GTM-SM (guiding the localization), it\n\nhom\n\n6\n\n\fTable 1: Quantitative comparison on 2D and 3D scenes, c.f . setups in Subsections 4.1-4.2 ((cid:38) the\nlower the better; (cid:37) the higher the better; \u201cu\" horizontal bin unit according to AVD setup).\n\nA) As\n\ncel\n\nGTM-SM\n\nExp.\n\nMethods\n\nStd.(cid:38)\n4.32px\n1.23px\n\nAverage Position Error\n\nGTM-SML1st\u2194lt *\nGTM-SMst\u2190lt **\n\nHall. Metr.\nL1(cid:38) SSIM(cid:37)\n0.41\n0.14\n0.15\n0.40\n0.43\n0.13\n0.72\n0.09\n0.41\n0.32\n0.70\n0.20\n0.41\n0.14\n0.09\n0.72\n0.49\n0.13\n0.54\n0.11\n0.10\n0.43\nE) AVD\n0.23\n0.25\n* GTM-SML1st\u2194lt : Custom GTM-SM with a L1 localization loss computed between the predicted states st and ground-truth poses lt.\n** GTM-SMst\u2190lt : Custom GTM-SM with the ground-truth poses lt provided as input (no st inference).\n\nMed.(cid:38) Mean(cid:38)\n4.78px\n4.0px\n1.0px\n1.03px\n0px (NA \u2013 poses passed as inputs)\n1.0px\n3.60px\n1.0px\n4.0px\n1.0px\n1.41u\n1.00u\n1.00u\n0.37u\n\nAnam. Metr.\nL1(cid:38) SSIM(cid:37)\n0.57\n0.14\n0.13\n0.64\n0.76\n0.08\n0.80\n0.06\n0.50\n0.21\n0.79\n0.08\n0.57\n0.14\n0.06\n0.80\n0.52\n0.09\n0.56\n0.09\n0.12\n0.37\n0.22\n0.31\n\nAbsolute Trajectory Error\nStd.(cid:38)\n3.55px\n0.86px\n\nMed.(cid:38) Mean(cid:38)\n6.86px\n6.40px\n0.79px\n0.87px\n0px (NA \u2013 poses passed as inputs)\n0.49px\n2.74px\n1.44px\n6.40px\n0.49px\n1.73u\n1.75u\n0.31u\n0.20u\n\n0.68px\n5.04px\n2.21px\n4.78px\n0.68px\n2.15u\n1.64u\n0.77u\n0.32u\n\n0.60px\n1.97px\n1.72px\n6.86px\n0.60px\n1.81u\n1.95u\n0.36u\n0.21u\n\n0.64px\n2.48px\n2.25px\n3.55px\n0.64px\n1.06u\n1.24u\n0.40u\n0.18u\n\n1.02px\n4.42px\n3.76px\n4.32px\n1.02px\n1.84u\n2.16u\n0.69u\n0.26u\n\nGTM-SM\n\nOurs\n\nGTM-SM\n\nC) As\n\nhom\n\nOurs\n\nGTM-SM\n\nOurs\n\nGTM-SM\n\nOurs\n\nB) Ac\n\ncel\n\nD) Doom\n\nOurs\n\nTable 2: Ablation study on CelebA with agent Ac\ncel. Removed modules are replaced by identity\nmappings; remaining ones are adapted to the new input shapes when necessary. LSTM, memory, and\ndecoder are present in all instances (\u201cLocalization\u201d is the MapNet module).\n\nPipeline Modules\n\nEncoder\n\nLocalization\n\nHallucinatory DAE\n\nCulling\n\n\u2205\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\u2205\n(cid:88)\n(cid:88)\n\n\u2205\n\u2205\n(cid:88)\n(cid:88)\n\u2205\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u2205\n\u2205\n\u2205\n(cid:88)\n(cid:88)\n(cid:88)\n\u2205\n(cid:88)\n\n\u2205\n\u2205\n\u2205\n\u2205\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nAnamnesis Metrics\nL1(cid:38) SSIM(cid:37) L1(cid:38)\n0.24\n0.18\n0.24\n0.17\n0.20\n0.15\n0.19\n0.15\n0.19\n0.14\n0.17\n0.13\n0.08\n0.18\n0.15\n0.08\n\nHallucination Metrics\nSSIM(cid:37)\n0.59\n0.58\n0.61\n0.62\n0.63\n0.66\n0.66\n0.70\n\n0.62\n0.62\n0.66\n0.65\n0.69\n0.71\n0.80\n0.80\n\ncannot properly build a global representation from short input sequences, and thus fails at rendering\ncompletely novel views. Moreover, unlike the dictionary-like memory structure of GTM-SM,\nour method stores its representation into a single feature map, which can therefore be queried in\nseveral ways. As shown in Figure 6, for a varying number of conditioning inputs, one can request\nnovel views one by one, culling and decoding features; with the option to register hallucinated\nviews back into memory (i.e., saving them as \u201cvalid\" observations to be reused). But one can also\ndirectly query the full memory, training another decoder to convert all the features. Figure 6 also\ndemonstrates how different trajectories may lead to different intermediate representations, while\nFigure 7-a illustrates how the proposed model can predict different global properties for identical\ntrajectories but different hallucinatory noise. In both cases though (different trajectories or different\nnoise), the scene representations converge as the scene coverage increases.\nQuantitative Evaluations. We quantitatively evaluate the methods\u2019 ability to register observations\nat the proper positions in their respective coordinate systems (i.e., to predict agent trajectories), to\nretrieve observations from memory, and to synthesize new ones. For localization, we measure the\naverage position error (APE) and the absolute trajectory error (ATE), commonly used to evaluate\nSLAM systems [4].\nFor image synthesis, we make the distinction between recalling images already observed (anamnesis)\nand generating unseen views (hallucination). For both, we compute the common L1 distance between\npredicted and expected values, and the structural similarity (SSIM) index [25] for the assessment of\nperceptual quality [24, 29].\nTable 1.A-C shows the comparison on 2D cases. For pose estimation, our method is generally\nmore precise even though it leverages only the observations to infer trajectories, whereas GTM-SM\nalso infers more directly from the provided agent actions. However, GTM-SM is trained in an\nunsupervised manner, without any location information. Therefore, we extend our evaluation by\ncomparing our method with two custom GTM-SM solutions that leverage ground-truth poses during\ntraining (supervised L1 loss over the predicted states/poses) and inference (poses directly provided\n\n7\n\n\fFigure 5: Qualitative comparison on 3D use-cases, w.r.t. anamnesis and hallucination.\n\nFigure 6: Incremental exploration and hallucination (on 2D data). Scene representations evolve\nwith the registration of observed or hallucinated views (e.g., adapting hair color, face orientation,etc.).\n\nas additional inputs). While these changes unsurprisingly improve the accuracy of GTM-SM, our\nmethod is still on a par with these results (c.f . Table 1.A).\nMoreover, while GTM-SM fares well enough in recovering seen images from memory, it cannot\nsynthesize views out of the observed domain. Our method not only extrapolates adequately from\nprior knowledge, but also generates views which are consistent from one to another (c.f . Figure 6\nshowing views stitched into a consistent global image). Moreover, as the number of observations\nincreases, so does the quality of the generated images (c.f . Figure 7-b), Note that on a Nvidia Titan\nX, the whole process (registering 5 views, localizing the agent, recalling the 5 images, and generating\n5 new ones) takes less than 1s.\nAblation Study. Results of an ablation study are shown in Table 2 to further demonstrate the\ncontribution of each module. Note that the APE/ATE are not represented, as they stay constant as\nlong as the MapNet localization is included. In other words, our extensions cause no regression in\nterms of localization. Localizing and clipping features facilitate the decoding process by disentangling\nthe visual and spatial information, thus improving the synthesis quality. Hallucinating features directly\nin the memory ensures image consistency.\n\n4.2 Exploring Virtual and Real 3D Scenes\n\nWe \ufb01nally demonstrate the capability of our method on the more complex case of 3D scenes.\nExperimental Setup. As a \ufb01rst 3D experiment, we recorded, with the Vizdoom platform [27], 34\ntraining and 6 testing episodes of 300 RGB-D observations from a human-controlled agent navigating\nin various static virtual scenes (walking with variable speed or rotating by 30\u25e6 each step). Poses are\ndiscretized into 2D bins of 30 \u00d7 30 game units. Trajectories of 10 continuous frames are sampled and\npassed to the methods (the \ufb01rst 5 images as observations, and the last 5 as training ground-truths).\nWe then consider the Active Vision Dataset (AVD) [1] which covers various real indoor scenes, often\ncapturing several rooms per scene. We selected 15 for training and 4 for testing as suggested by\nthe dataset authors, for a total of \u223c20, 000 RGB-D images densely captured every 30cm (on a 2D\ngrid) and every 30\u25e6 in rotation. For each scene we randomly sampled 5, 000 agent trajectories of\n10 frames each (each step the agent goes forward with 70% probability or rotates either way, to\n\n8\n\npredictions(GTM_SM)observedsequencepredictions(Ours)GT targetsequence\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026Legend:global target imagepredicted trajectory with recalled observationsrequested trajectory with predicted viewsdirect global memory sampling\f(a) Statistical nature of the hallucinated content. Global scene representations are\nFigure 7:\nshown for each step t, given the same agent trajectories but different noise vectors passed to the\nhallucinatory auto-encoder; (b) Salient image quality w.r.t. agent steps and scene coverage for\ncel, computed over the global scene representations. These results show how the global scene\nAs\nproperties converge and the quality of generated images increase as observations accumulate.\n\nfavor exploration). For both experiments, the 10-frame sequences are passed to the methods\u2014the\n\ufb01rst 5 images as observations and the last 5 as ground-truths during training. Again, GTM-SM also\nreceives the action encodings. For our method, we opted for m \u2208 R32\u00d743\u00d743 for the Doom setup and\nm \u2208 R32\u00d729\u00d729 for the AVD one.\nQualitative Results. Though a denser memory could be used for more re\ufb01ned results, Figure 5\nshows that our solution is able to register meaningful features and to understand scene topographies\nsimply from 5 partial observations. We note that quantization in our method is an application-speci\ufb01c\ndesign choice rather than a limitation. When compute power and memory allow, \ufb01ner quantization\ncan be used to obtain better localization accuracy (c.f . comparisons and discussion presented by\nMapNet authors [11]). In our case, relatively coarse quantization is suf\ufb01cient for scene synthesis,\nwhere the global scene representation is more crucial. In comparison, GTM-SM generally fails to\nadapt the VAE prior and predict the belief of target sequences (refer to the supplementary material\nfor further results).\nQuantitative Evaluation. Adopting the same metrics as in Section 4.1, we compare the methods.\nAs seen in Table 1.D-E, our method slightly underperforms in terms of localization in the Doom\nenvironment. This may be due to the approximate rendering process VizDoom uses for the depth\nobservations, with discretized values not matching the game units. Unlike GTM-SM which relies\non action encodings for localization, these unit discrepancies affect our observation-based method.\nAs to the quality of retrieved and hallucinated images, our method shows superior performance (c.f .\nadditional saliency metrics in the supplementary material). While current results are still far from\nbeing visually pleasing, the proposed method is promising, with improvements expected from more\npowerful generative networks.\nIt should also be noted that the proposed hallucinatory module is more reliable when target scenes\nhave learnable priors (e.g., structure of faces). Hallucination of uncertain content (e.g., layout of\na 3D room) can be of lower quality due to the trade-off between representing uncertainties w.r.t.\nmissing content and unsure localization, and synthesizing detailed (but likely incorrect) images. Soft\nregistration and hallucinations\u2019 statistical nature can add \u201cuncertainty\u201d leading to blurred results,\nwhich our generative components partially compensate for (c.f . our choice of a GAN solution for the\nDAE to improve its sampling, c.f . supplementary material). For data generation use-cases, relaxing\nhallucination constraints and scaling up Lhallu and Lanam can improve image detail at the price of\npossible memory corruption (we focused on consistency rather than high-resolution hallucinations).\n\n5 Conclusion\n\nGiven unlocalized agents only providing observations, our framework builds global representations\nconsistent with the underlying scene properties. Applying prior domain knowledge to harmoniously\ncomplete sparse memory, our method can incrementally sample novel views over whole scenes,\nresulting in the \ufb01rst complete read and write spatial memory for visual imagery. We evaluated on\nsynthetic and real 2D and 3D data, demonstrating the ef\ufb01cacy of the proposed method\u2019s memory\nmap. Future work can involve densifying the memory structure and borrowing recent advances in\ngenerating high-quality images with GANs [26].\n\n9\n\n0123456789100.000.200.400.600.80019293949# agent stepsSSIM % scene observed\ud835\udc61=1\ud835\udc61=10(a)(b)\fReferences\n[1] Phil Ammirato, Patrick Poirson, Eunbyung Park, Jana Kosecka, and Alexander C. Berg. A\n\ndataset for developing and benchmarking active vision. In ICRA, 2017.\n\n[2] Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean\n\nRouat, et al. Home: A household multimodal environment. preprint arXiv:1711.11017, 2017.\n\n[3] Devendra Singh Chaplot, Emilio Parisotto, and Ruslan Salakhutdinov. Active neural localization.\n\nIn ICLR, 2018.\n\n[4] Siddharth Choudhary, Vadim Indelman, Henrik I Christensen, and Frank Dellaert. Information-\n\nbased reduced landmark slam. In ICRA, 2015.\n\n[5] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping. IEEE robotics &\n\nautomation magazine, 13, 2006.\n\n[6] SM Ali Eslami et al. Neural scene representation and rendering. Science, 360(6394), 2018.\n\n[7] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict\n\nnew views from the world\u2019s imagery. In CVPR, 2016.\n\n[8] Marco Fraccaro, Danilo Jimenez Rezende, Yori Zwols, Alexander Pritzel, et al. Generative tem-\nporal models with spatial memory for partially observed environments. arXiv preprint:1804.09401,\n2018.\n\n[9] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the\n\nkitti vision benchmark suite. In CVPR, 2012.\n\n[10] Peter Hedman, Tobias Ritschel, George Drettakis, and Gabriel Brostow. Scalable inside-out\n\nimage-based rendering. ACM Trans. Graphics, 35, 2016.\n\n[11] Joao F Henriques and Andrea Vedaldi. Mapnet: An allocentric spatial memory for mapping\n\nenvironments. In CVPR, 2018.\n\n[12] Dinghuang Ji, Junghyun Kwon, Max McFarland, and Silvio Savarese. Deep view morphing. In\n\nCVPR, 2017.\n\n[13] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In ICCV, 2015.\n\n[14] Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforce-\n\nment learning. In ICLR, 2018.\n\n[15] Emilio Parisotto, Devendra Singh Chaplot, Jian Zhang, and Ruslan Salakhutdinov. Global pose\n\nestimation with an attention-based recurrent network. arXiv preprint, 2018.\n\n[16] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C Berg.\nIn CVPR,\n\nTransformation-grounded image generation network for novel 3d view synthesis.\n2017.\n\n[17] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech, Oriol Vinyals, et al.\n\nNeural episodic control. arXiv preprint:1703.01988, 2017.\n\n[18] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point\n\nsets for 3d classi\ufb01cation and segmentation. In CVPR, 2017.\n\n[19] Dan Rosenbaum, Frederic Besse, Fabio Viola, Danilo J Rezende, and SM Eslami. Learning\n\nmodels for visual 3d localization with implicit mapping. arXiv preprint, 2018.\n\n[20] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser.\n\nSemantic scene completion from a single depth image. In CVPR, 2017.\n\n[21] Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, Ning Zhang, and Joseph J Lim. Multi-view\n\nto novel view: Synthesizing novel views with self-learned con\ufb01dence. In ECCV, 2018.\n\n10\n\n\f[22] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-view 3d models from single\n\nimages with a convolutional network. In ECCV, 2016.\n\n[23] Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. Cnn-slam: Real-time dense\n\nmonocular slam with learned depth prediction. In CVPR, 2017.\n\n[24] Zhou Wang and Qiang Li. Information content weighting for perceptual image quality assess-\n\nment. IEEE Trans. Image Processing, 20(5):1185\u20131198, 2011.\n\n[25] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image\n\nquality assessment. In ACSSC, 2003.\n\n[26] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro.\nHigh-resolution image synthesis and semantic manipulation with conditional gans. In CVPR,\n2018.\n\n[27] Marek Wydmuch, Micha\u0142 Kempka, and Wojciech Ja\u00b4skowski. Vizdoom competitions: Playing\n\ndoom from pixels. IEEE Transactions on Games, 2018.\n\n[28] Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided\nprediction-and-distillation network for simultaneous depth estimation and scene parsing. arXiv\npreprint:1805.04409, 2018.\n\n[29] Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. A comprehensive evaluation of full\n\nreference image quality assessment algorithms. In ICIP, pages 1477\u20131480. IEEE, 2012.\n\n[30] Jingwei Zhang, Lei Tai, Joschka Boedecker, Wolfram Burgard, and Ming Liu. Neural slam:\n\nLearning to explore with external memory. arXiv preprint, 2017.\n\n[31] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View\n\nsynthesis by appearance \ufb02ow. In ECCV, 2016.\n\n11\n\n\f", "award": [], "sourceid": 936, "authors": [{"given_name": "Benjamin", "family_name": "Planche", "institution": "Siemens Corporate Technology"}, {"given_name": "Xuejian", "family_name": "Rong", "institution": "City University of New York"}, {"given_name": "Ziyan", "family_name": "Wu", "institution": "United Imaging Intelligence"}, {"given_name": "Srikrishna", "family_name": "Karanam", "institution": "United Imaging Intelligence"}, {"given_name": "Harald", "family_name": "Kosch", "institution": "PASSAU"}, {"given_name": "YingLi", "family_name": "Tian", "institution": "City University of New York"}, {"given_name": "Jan", "family_name": "Ernst", "institution": "Siemens Research"}, {"given_name": "ANDREAS", "family_name": "HUTTER", "institution": "Siemens Corporate Technology, Germany"}]}