{"title": "Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 1121, "page_last": 1132, "abstract": "Unsupervised learning with generative models has the potential of discovering rich representations of 3D scenes. While geometric deep learning has explored 3D-structure-aware representations of scene geometry, these models typically require explicit 3D supervision. Emerging neural scene representations can be trained only with posed 2D images, but existing methods ignore the three-dimensional structure of scenes. We propose Scene Representation Networks (SRNs), a continuous, 3D-structure-aware scene representation that encodes both geometry and appearance. SRNs represent scenes as continuous functions that map world coordinates to a feature representation of local scene properties. By formulating the image formation as a differentiable ray-marching algorithm, SRNs can be trained end-to-end from only 2D images and their camera poses, without access to depth or shape. This formulation naturally generalizes across scenes, learning powerful geometry and appearance priors in the process. We demonstrate the potential of SRNs by evaluating them for novel view synthesis, few-shot reconstruction, joint shape and appearance interpolation, and unsupervised discovery of a non-rigid face model.", "full_text": "Scene Representation Networks: Continuous\n\n3D-Structure-Aware Neural Scene Representations\n\nVincent Sitzmann Michael Zollh\u00f6fer\n\nGordon Wetzstein\n\n{sitzmann, zollhoefer}@cs.stanford.edu, gordon.wetzstein@stanford.edu\n\nStanford University\n\nvsitzmann.github.io/srns/\n\nAbstract\n\nUnsupervised learning with generative models has the potential of discovering rich\nrepresentations of 3D scenes. While geometric deep learning has explored 3D-\nstructure-aware representations of scene geometry, these models typically require\nexplicit 3D supervision. Emerging neural scene representations can be trained only\nwith posed 2D images, but existing methods ignore the three-dimensional structure\nof scenes. We propose Scene Representation Networks (SRNs), a continuous, 3D-\nstructure-aware scene representation that encodes both geometry and appearance.\nSRNs represent scenes as continuous functions that map world coordinates to\na feature representation of local scene properties. By formulating the image\nformation as a differentiable ray-marching algorithm, SRNs can be trained end-to-\nend from only 2D images and their camera poses, without access to depth or shape.\nThis formulation naturally generalizes across scenes, learning powerful geometry\nand appearance priors in the process. We demonstrate the potential of SRNs by\nevaluating them for novel view synthesis, few-shot reconstruction, joint shape and\nappearance interpolation, and unsupervised discovery of a non-rigid face model.1\n\n1\n\nIntroduction\n\nA major driver behind recent work on generative models has been the promise of unsupervised\ndiscovery of powerful neural scene representations, enabling downstream tasks ranging from robotic\nmanipulation and few-shot 3D reconstruction to navigation. A key aspect of solving these tasks is\nunderstanding the three-dimensional structure of an environment. However, prior work on neural\nscene representations either does not or only weakly enforces 3D structure [1\u20134]. Multi-view\ngeometry and projection operations are performed by a black-box neural renderer, which is expected\nto learn these operations from data. As a result, such approaches fail to discover 3D structure under\nlimited training data (see Sec. 4), lack guarantees on multi-view consistency of the rendered images,\nand learned representations are generally not interpretable. Furthermore, these approaches lack an\nintuitive interface to multi-view and projective geometry important in computer graphics, and cannot\neasily generalize to camera intrinsic matrices and transformations that were completely unseen at\ntraining time.\nIn geometric deep learning, many classic 3D scene representations, such as voxel grids [5\u201310], point\nclouds [11\u201314], or meshes [15] have been integrated with end-to-end deep learning models and\nhave led to signi\ufb01cant progress in 3D scene understanding. However, these scene representations\nare discrete, limiting achievable spatial resolution, only sparsely sampling the underlying smooth\nsurfaces of a scene, and often require explicit 3D supervision.\n\n1Please see supplemental video for additional results.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe introduce Scene Representation Networks (SRNs), a continuous neural scene representation,\nalong with a differentiable rendering algorithm, that model both 3D scene geometry and appearance,\nenforce 3D structure in a multi-view consistent manner, and naturally allow generalization of shape\nand appearance priors across scenes. The key idea of SRNs is to represent a scene implicitly as a\ncontinuous, differentiable function that maps a 3D world coordinate to a feature-based representation\nof the scene properties at that coordinate. This allows SRNs to naturally interface with established\ntechniques of multi-view and projective geometry while operating at high spatial resolution in a\nmemory-ef\ufb01cient manner. SRNs can be trained end-to-end, supervised only by a set of posed 2D\nimages of a scene. SRNs generate high-quality images without any 2D convolutions, exclusively\noperating on individual pixels, which enables image generation at arbitrary resolutions. They\ngeneralize naturally to camera transformations and intrinsic parameters that were completely unseen at\ntraining time. For instance, SRNs that have only ever seen objects from a constant distance are capable\nof rendering close-ups of said objects \ufb02awlessly. We evaluate SRNs on a variety of challenging 3D\ncomputer vision problems, including novel view synthesis, few-shot scene reconstruction, joint shape\nand appearance interpolation, and unsupervised discovery of a non-rigid face model.\nTo summarize, our approach makes the following key contributions:\n\nef\ufb01ciently encapsulate both scene geometry and appearance.\n\n\u2022 A continuous, 3D-structure-aware neural scene representation and renderer, SRNs, that\n\u2022 End-to-end training of SRNs without explicit supervision in 3D space, purely from a set of\n\u2022 We demonstrate novel view synthesis, shape and appearance interpolation, and few-shot\nreconstruction, as well as unsupervised discovery of a non-rigid face model, and signi\ufb01cantly\noutperform baselines from recent literature.\n\nposed 2D images.\n\nScope The current formulation of SRNs does not model view- and lighting-dependent effects or\ntranslucency, reconstructs shape and appearance in an entangled manner, and is non-probabilistic.\nPlease see Sec. 5 for a discussion of future work in these directions.\n\n2 Related Work\n\nOur approach lies at the intersection of multiple \ufb01elds. In the following, we review related work.\n\nGeometric Deep Learning. Geometric deep learning has explored various representations to\nreason about scene geometry. Discretization-based techniques use voxel grids [7, 16\u201322], octree\nhierarchies [23\u201325], point clouds [11, 26, 27], multiplane images [28], patches [29], or meshes\n[15, 21, 30, 31]. Methods based on function spaces continuously represent space as the decision\nboundary of a learned binary classi\ufb01er [32] or a continuous signed distance \ufb01eld [33\u201335]. While\nthese techniques are successful at modeling geometry, they often require 3D supervision, and it is\nunclear how to ef\ufb01ciently infer and represent appearance. Our proposed method encapsulates both\nscene geometry and appearance, and can be trained end-to-end via learned differentiable rendering,\nsupervised only with posed 2D images.\n\nNeural Scene Representations. Latent codes of autoencoders may be interpreted as a feature\nrepresentation of the encoded scene. Novel views may be rendered by concatenating target pose\nand latent code [1] or performing view transformations directly in the latent space [4]. Generative\nQuery Networks [2, 3] introduce a probabilistic reasoning framework that models uncertainty due to\nincomplete observations, but both the scene representation and the renderer are oblivious to the scene\u2019s\n3D structure. Some prior work infers voxel grid representations of 3D scenes from images [6, 8, 9] or\nuses them for 3D-structure-aware generative models [10, 36]. Graph neural networks may similarly\ncapture 3D structure [37]. Compositional structure may be modeled by representing scenes as\nprograms [38]. We demonstrate that models with scene representations that ignore 3D structure fail to\nperform viewpoint transformations in a regime of limited (but signi\ufb01cant) data, such as the Shapenet\nv2 dataset [39]. Instead of a discrete representation, which limits achievable spatial resolution and\ndoes not smoothly parameterize scene surfaces, we propose a continuous scene representation.\n\nNeural Image Synthesis. Deep models for 2D image and video synthesis have recently shown\npromising results in generating photorealistic images. Some of these approaches are based on\n\n2\n\n\fFigure 1: Overview: at the heart of SRNs lies a continuous, 3D-aware neural scene representation, \u03a6,\nwhich represents a scene as a function that maps (x, y, z) world coordinates to a feature representation\nof the scene at those coordinates (see Sec. 3.1). A neural renderer \u0398, consisting of a learned ray\nmarcher and a pixel generator, can render the scene from arbitrary novel view points (see Sec. 3.2).\n\n(variational) auto-encoders [40, 41], generative \ufb02ows [42, 43], or autoregressive per-pixel models [44,\n45]. In particular, generative adversarial networks [46\u201350] and their conditional variants [51\u201353]\nhave recently achieved photo-realistic single-image generation. Compositional Pattern Producing\nNetworks [54, 55] learn functions that map 2D image coordinates to color. Some approaches build on\nexplicit spatial or perspective transformations in the networks [56\u201358, 14]. Recently, following the\nspirit of \u201cvision as inverse graphics\u201d [59, 60], deep neural networks have been applied to the task of\ninverting graphics engines [61\u201365]. However, these 2D generative models only learn to parameterize\nthe manifold of 2D natural images, and struggle to generate images that are multi-view consistent,\nsince the underlying 3D scene structure cannot be exploited.\n\n3 Formulation\nGiven a training set C = {(Ii, Ei, Ki)}N\n\nrespective extrinsic Ei =(cid:2)R|t(cid:3) \u2208 R3\u00d74 and intrinsic Ki \u2208 R3\u00d73 camera matrices [66], our goal\n\ni=1 of N tuples of images Ii \u2208 RH\u00d7W\u00d73 along with their\nis to distill this dataset of observations into a neural scene representation \u03a6 that strictly enforces\n3D structure and allows to generalize shape and appearance priors across scenes. In addition, we\nare interested in a rendering function \u0398 that allows us to render the scene represented by \u03a6 from\narbitrary viewpoints. In the following, we \ufb01rst formalize \u03a6 and \u0398 and then discuss a framework for\noptimizing \u03a6, \u0398 for a single scene given only posed 2D images. Note that this approach does not\nrequire information about scene geometry. Additionally, we show how to learn a family of scene\nrepresentations for an entire class of scenes, discovering powerful shape and appearance priors.\n\n3.1 Representing Scenes as Functions\n\nOur key idea is to represent a scene as a function \u03a6 that maps a spatial location x to a feature\nrepresentation v of learned scene properties at that spatial location:\n\u03a6 : R3 \u2192 Rn, x (cid:55)\u2192 \u03a6(x) = v.\n\n(1)\nThe feature vector v may encode visual information such as surface color or re\ufb02ectance, but it\nmay also encode higher-order information, such as the signed distance of x to the closest scene\nsurface. This continuous formulation can be interpreted as a generalization of discrete neural scene\nrepresentations. Voxel grids, for instance, discretize R3 and store features in the resulting 3D grid [5\u2013\n10]. Point clouds [12\u201314] may contain points at any position in R3, but only sparsely sample surface\nproperties of a scene. In contrast, \u03a6 densely models scene properties and can in theory model arbitrary\nspatial resolutions, as it is continuous over R3 and can be sampled with arbitrary resolution. In\npractice, we represent \u03a6 as a multi-layer perceptron (MLP), and spatial resolution is thus limited by\nthe capacity of the MLP.\nIn contrast to recent work on representing scenes as unstructured or weakly structured feature\nembeddings [1, 4, 2], \u03a6 is explicitly aware of the 3D structure of scenes, as the input to \u03a6 are\nworld coordinates (x, y, z) \u2208 R3. This allows interacting with \u03a6 via the toolbox of multi-view and\nperspective geometry that the physical world obeys, only using learning to approximate the unknown\nproperties of the scene itself. In Sec. 4, we show that this formulation leads to multi-view consistent\nnovel view synthesis, data-ef\ufb01cient training, and a signi\ufb01cant gain in model interpretability.\n\n3\n\n\f3.2 Neural Rendering\n\nGiven a scene representation \u03a6, we introduce a neural rendering algorithm \u0398, that maps a scene\nrepresentation \u03a6 as well as the intrinsic K and extrinsic E camera parameters to an image I:\n\n\u0398 : X \u00d7 R3\u00d74 \u00d7 R3\u00d73 \u2192 RH\u00d7W\u00d73,\n\n(\u03a6, E, K) (cid:55)\u2192 \u0398(\u03a6, E, K) = I,\n\n(2)\n\nwhere X is the space of all functions \u03a6.\nThe key complication in rendering a scene represented by \u03a6 is that geometry is represented implicitly.\nThe surface of a wooden table top, for instance, is de\ufb01ned by the subspace of R3 where \u03a6 undergoes\na change from a feature vector representing free space to one representing wood.\nTo render a single pixel in the image observed by a virtual camera, we thus have to solve two\nsub-problems: (i) \ufb01nding the world coordinates of the intersections of the respective camera rays with\nscene geometry, and (ii) mapping the feature vector v at that spatial coordinate to a color. We will\n\ufb01rst propose a neural ray marching algorithm with learned, adaptive step size to \ufb01nd ray intersections\nwith scene geometry, and subsequently discuss the architecture of the pixel generator network that\nlearns the feature-to-color mapping.\n\n3.2.1 Differentiable Ray Marching Algorithm\n\nAlgorithm 1 Differentiable Ray-Marching\n\n1: function FINDINTERSECTION(\u03a6, K, E, (u, v))\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nd0 \u2190 0.05\n(h0, c0) \u2190 (0, 0)\nfor i \u2190 0 to max_iter do\nxi \u2190 ru,v(di)\nvi \u2190 \u03a6(xi)\n(\u03b4, hi+1, ci+1) \u2190 LST M (v, hi, ci)\ndi+1 \u2190 di + \u03b4\n\nreturn ru,v(dmax_iter)\n\n(cid:46) Near plane\n(cid:46) Initial state of LSTM\n\n(cid:46) Calculate world coordinates\n(cid:46) Extract feature vector\n(cid:46) Predict steplength using ray marching LSTM\n(cid:46) Update d\n\nIntersection testing intuitively amounts to solving an optimization problem, where the point along\neach camera ray is sought that minimizes the distance to the surface of the scene. To model this\nproblem, we parameterize the points along each ray, identi\ufb01ed with the coordinates (u, v) of the\nrespective pixel, with their distance d to the camera (d > 0 represents points in front of the camera):\n\n(cid:32) u\n\n(cid:33)\n\nv\nd\n\nru,v(d) = RT (K\u22121\n\n\u2212 t),\n\nd > 0,\n\n(3)\n\nwith world coordinates ru,v(d) of a point along the ray with distance d to the camera, camera intrinsics\nK, and camera rotation matrix R and translation vector t. For each ray, we aim to solve\n\narg min d\n\ns.t.\n\nru,v(d) \u2208 \u2126,\n\nd > 0\n\n(4)\n\nwhere we de\ufb01ne the set of all points that lie on the surface of the scene as \u2126.\nHere, we take inspiration from the classic sphere tracing algorithm [67]. Sphere tracing belongs to\nthe class of ray marching algorithms, which solve Eq. 4 by starting at a distance dinit close to the\ncamera and stepping along the ray until scene geometry is intersected. Sphere tracing is de\ufb01ned by a\nspecial choice of the step length: each step has a length equal to the signed distance to the closest\nsurface point of the scene. Since this distance is only 0 on the surface of the scene, the algorithm takes\nnon-zero steps until it has arrived at the surface, at which point no further steps are taken. Extensions\nof this algorithm propose heuristics to modifying the step length to speed up convergence [68]. We\ninstead propose to learn the length of each step.\nSpeci\ufb01cally, we introduce a ray marching long short-term memory (RM-LSTM) [69], that maps the\nfeature vector \u03a6(xi) = vi at the current estimate of the ray intersection xi to the length of the next\nray marching step. The algorithm is formalized in Alg. 1.\n\n4\n\n\fGiven our current estimate di, we compute world coordinates xi = ru,v(di) via Eq. 3. We\nthen compute \u03a6(xi) to obtain a feature vector vi, which we expect to encode information about\nnearby scene surfaces. We then compute the step length \u03b4 via the RM-LSTM as (\u03b4, hi+1, ci+1) =\nLST M (vi, hi, ci), where h and c are the output and cell states, and increment di accordingly. We\niterate this process for a constant number of steps. This is critical, because a dynamic termination\ncriterion would have no guarantee for convergence in the beginning of the training, where both \u03a6\nand the ray marching LSTM are initialized at random. The \ufb01nal step yields our estimate of the\nworld coordinates of the intersection of the ray with scene geometry. The z-coordinates of running\nand \ufb01nal estimates of intersections in camera coordinates yield depth maps, which we denote as\ndi, which visualize every step of the ray marcher. This makes the ray marcher interpretable, as\nfailures in geometry estimation show as inconsistencies in the depth map. Note that depth maps are\ndifferentiable with respect to all model parameters, but are not required for training \u03a6. Please see\nthe supplement for a contextualization of the proposed rendering approach with classical rendering\nalgorithms.\n\n3.2.2 Pixel Generator Architecture\n\nThe pixel generator takes as input the 2D feature map sampled from \u03a6 at world coordinates of ray-\nsurface intersections and maps it to an estimate of the observed image. As a generator architecture,\nwe choose a per-pixel MLP that maps a single feature vector v to a single RGB vector. This is\nequivalent to a convolutional neural network (CNN) with only 1 \u00d7 1 convolutions. Formulating the\ngenerator without 2D convolutions has several bene\ufb01ts. First, the generator will always map the same\n(x, y, z) coordinate to the same color value. Assuming that the ray-marching algorithm \ufb01nds the\ncorrect intersection, the rendering is thus trivially multi-view consistent. This is in contrast to 2D\nconvolutions, where the value of a single pixel depends on a neighborhood of features in the input\nfeature map. When transforming the camera in 3D, e.g. by moving it closer to a surface, the 2D\nneighborhood of a feature may change. As a result, 2D convolutions come with no guarantee on multi-\nview consistency. With our per-pixel formulation, the rendering function \u0398 operates independently\non all pixels, allowing images to be generated with arbitrary resolutions and poses. On the \ufb02ip side,\nwe cannot exploit recent architectural progress in CNNs, and a per-pixel formulation requires the\nray marching, the SRNs and the pixel generator to operate on the same (potentially high) resolution,\nrequiring a signi\ufb01cant memory budget. Please see the supplement for a discussion of this trade-off.\n\n3.3 Generalizing Across Scenes\n\nj=1, where each Cj consists of tuples {(Ii, Ei, Ki)}N\n\nWe now generalize SRNs from learning to represent a single scene to learning shape and appearance\npriors over several instances of a single class. Formally, we assume that we are given a set of M\ninstance datasets D = {Cj}M\ni=1 as discussed in\nSec. 3.1.\nWe reason about the set of functions {\u03a6j}M\nj=1 that represent instances of objects belonging to the\nsame class. By parameterizing a speci\ufb01c \u03a6j as an MLP, we can represent it with its vector of\nparameters \u03c6j \u2208 Rl. We assume scenes of the same class have common shape and appearance\nproperties that can be fully characterized by a set of latent variables z \u2208 Rk, k < l. Equivalently, this\nassumes that all parameters \u03c6j live in a k-dimensional subspace of Rl. Finally, we de\ufb01ne a mapping\n(5)\nthat maps a latent vector zj to the parameters \u03c6j of the corresponding \u03a6j. We propose to parameterize\n\u03a8 as an MLP, with parameters \u03c8. This architecture was previously introduced as a Hypernetwork [70],\na neural network that regresses the parameters of another neural network. We share the parameters\nof the rendering function \u0398 across scenes. We note that assuming a low-dimensional embedding\nmanifold has so far mainly been empirically demonstrated for classes of single objects. Here, we\nsimilarly only demonstrate generalization over classes of single objects.\n\n\u03a8 : Rk \u2192 Rl,\n\nzj (cid:55)\u2192 \u03a8(zj) = \u03c6j\n\nFinding latent codes zj. To \ufb01nd the latent code vectors zj, we follow an auto-decoder frame-\nwork [33]. For this purpose, each object instance Cj is represented by its own latent code zj. The zj\nare free variables and are optimized jointly with the parameters of the hypernetwork \u03a8 and the neural\nrenderer \u0398. We assume that the prior distribution over the zj is a zero-mean multivariate Gaussian\nwith a diagonal covariance matrix. Please refer to [33] for additional details.\n\n5\n\n\fFigure 2: Shepard-Metzler object from 1k-object\ntraining set, 15 observations each. SRNs (right)\noutperform dGQN (left) on this small dataset.\n\nFigure 3: Non-rigid animation of a face. Note\nthat mouth movement is directly re\ufb02ected in the\nnormal maps.\n\nFigure 4: Normal maps for a selection of objects. We note that geometry is learned fully unsupervised\nand arises purely out of the perspective and multi-view geometry constraints on the image formation.\n\nM(cid:88)\n\nN(cid:88)\n\nJoint Optimization\n\n3.4\nTo summarize, given a dataset D = {Cj}M\ni=1, we aim\nto \ufb01nd the parameters \u03c8 of \u03a8 that maps latent vectors zj to the parameters of the respective scene\nrepresentation \u03c6j, the parameters \u03b8 of the neural rendering function \u0398, as well as the latent codes zj\nthemselves. We formulate this as an optimization problem with the following objective:\n\nj=1 of instance datasets C = {(Ii, Ei, Ki)}N\n\n(cid:123)(cid:122)\n(cid:124)\n(cid:107)\u0398\u03b8(\u03a6\u03a8(zj), Ej\nLimg\n\ni , Kj\n\ni ) \u2212 I j\n\n(cid:125)\ni (cid:107)2\n\n2\n\n(cid:124)\n\n+ \u03bbdep(cid:107) min(dj\nLdepth\n\n(cid:123)(cid:122)\n(cid:125)\ni,f inal, 0)(cid:107)2\n\n2\n\ni=1\n\nj=1\n\nj=1}\n\narg min\n{\u03b8,\u03c8,{zj}M\nWhere Limg is an (cid:96)2-loss enforcing closeness of the rendered image to ground-truth, Ldepth is a\nregularization term that accounts for the positivity constraint in Eq. 4, and Llatent enforces a Gaussian\nprior on the zj. In the case of a single scene, this objective simpli\ufb01es to solving for the parameters \u03c6\nof the MLP parameterization of \u03a6 instead of the parameters \u03c8 and latent codes zj. We solve Eq. 6\nwith stochastic gradient descent. Note that the whole pipeline can be trained end-to-end, without\nrequiring any (pre-)training of individual parts. In Sec. 4, we demonstrate that SRNs discover both\ngeometry and appearance, initialized at random, without requiring prior knowledge of either scene\ngeometry or scene scale, enabling multi-view consistent novel view synthesis.\n\nLlatent\n\n2\n\n.\n\n+ \u03bblat(cid:107)zj(cid:107)2\n\n(6)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nFew-shot reconstruction. After \ufb01nding model parameters by solving Eq. 6, we may use the\ntrained model for few-shot reconstruction of a new object instance, represented by a dataset C =\n{(Ii, Ei, Ki)}N\n\ni=1. We \ufb01x \u03b8 as well as \u03c8, and estimate a new latent code \u02c6z by minimizing\n\n(cid:107)\u0398\u03b8(\u03a6\u03a8(z), Ei, Ki) \u2212 Ii(cid:107)2\n\n2 + \u03bbdep(cid:107) min(di,f inal, 0)(cid:107)2\n\n2 + \u03bblat(cid:107)z(cid:107)2\n\n2\n\n(7)\n\nN(cid:88)\n\n\u02c6z = arg min\n\nz\n\ni=1\n\n4 Experiments\n\nWe train SRNs on several object classes and evaluate them for novel view synthesis and few-shot\nreconstruction. We further demonstrate the discovery of a non-rigid face model. Please see the\nsupplement for a comparison on single-scene novel view synthesis performance with DeepVoxels [6].\n\n6\n\nShapenet v2 objectsDeepVoxels objects50-shotSingle-Shot\fFigure 5: Interpolating latent code vectors of cars and chairs in the Shapenet dataset while rotating\nthe camera around the model. Features smoothly transition from one model to another.\n\nFigure 6: Qualitative comparison with Tatarchenko et al. [1] and the deterministic variant of the\nGQN [2], for novel view synthesis on the Shapenet v2 \u201ccars\u201d and \u201cchairs\u201d classes. We compare novel\nviews for objects reconstructed from 50 observations in the training set (top row), two observations\nand a single observation (second and third row) from a test set. SRNs consistently outperforms these\nbaselines with multi-view consistent novel views, while also reconstructing geometry. Please see the\nsupplemental video for more comparisons, smooth camera trajectories, and reconstructed geometry.\n\nImplementation Details. Hyperparameters, computational complexity, and full network architec-\ntures for SRNs and all baselines are in the supplement. Training of the presented models takes on the\norder of 6 days. A single forward pass takes around 120 ms and 3 GB of GPU memory per batch\nitem. Code and datasets are available.\n\nShepard-Metzler objects. We evaluate our approach on 7-element Shepard-Metzler objects in a\nlimited-data setting. We render 15 observations of 1k objects at a resolution of 64\u00d7 64. We train both\nSRNs and a deterministic variant of the Generative Query Network [2] (dGQN, please see supplement\nfor an extended discussion). Note that the dGQN is solving a harder problem, as it is inferring the\nscene representation in each forward pass, while our formulation requires solving an optimization\nproblem to \ufb01nd latent codes for unseen objects. We benchmark novel view reconstruction accuracy\non (1) the training set and (2) few-shot reconstruction of 100 objects from a held-out test set. On the\ntraining objects, SRNs achieve almost pixel-perfect results with a PSNR of 30.41 dB. The dGQN\nfails to learn object shape and multi-view geometry on this limited dataset, achieving 20.85 dB. See\nFig. 2 for a qualitative comparison. In a two-shot setting (see Fig. 7 for reference views), we succeed\nin reconstructing any part of the object that has been observed, achieving 24.36 dB, while the dGQN\nachieves 18.56 dB. In a one-shot setting, SRNs reconstruct an object consistent with the observed\nview. As expected, due to the current non-probabilistic implementation, both the dGQN and SRNs\nreconstruct an object resembling the mean of the hundreds of feasible objects that may have generated\nthe observation, achieving 17.51 dB and 18.11 dB respectively.\n\nShapenet v2. We consider the \u201cchair\u201d and \u201ccar\u201d classes of Shapenet\nv.2 [39] with 4.5k and 2.5k model instances respectively. We disable\ntransparencies and specularities, and train on 50 observations of each\ninstance at a resolution of 128 \u00d7 128 pixels. Camera poses are randomly\ngenerated on a sphere with the object at the origin. We evaluate perfor-\nmance on (1) novel-view synthesis of objects in the training set and (2)\nnovel-view synthesis on objects in the held-out, of\ufb01cial Shapenet v2 test\nsets, reconstructed from one or two observations, as discussed in Sec. 3.4.\nFig. 7 shows the sampled poses for the few-shot case. In all settings, we assemble ground-truth novel\nviews by sampling 250 views in an Archimedean spiral around each object instance. We compare\n\nFigure 7: Single- (left)\nand two-shot (both) ref-\nerence views.\n\n7\n\nGround TruthTatarchenko et al.SRNs50-shotdGQN1-Shot2-Shot\fTable 1: PSNR (in dB) and SSIM of images reconstructed with our method, the deterministic variant\nof the GQN [2] (dGQN), the model proposed by Tatarchenko et al. [1] (TCO), and the method\nproposed by Worrall et al. [4] (WRL). We compare novel-view synthesis performance on objects in\nthe training set (containing 50 images of each object), as well as reconstruction from 1 or 2 images\non the held-out test set.\n\n50 images (training set)\n\n2 images\n\nSingle image\n\nChairs\n\nCars\n\nChairs\n\nCars\n\nChairs\n\nCars\n\nTCO [1]\n24.31 / 0.92\nWRL [4]\n24.57 / 0.93\ndGQN [2] 22.72 / 0.90\nSRNs\n\n20.38 / 0.83\n19.16 / 0.82\n19.61 / 0.81\n26.23 / 0.95 26.32 / 0.94\n\n18.41 / 0.80\n21.33 / 0.88\n17.20 / 0.78\n22.28 / 0.90\n22.36 / 0.89\n18.79 / 0.79\n24.48 / 0.92 22.94 / 0.88\n\n18.15 / 0.79\n21.27 / 0.88\n16.89 / 0.77\n22.11 / 0.90\n18.19 / 0.78\n21.59 / 0.87\n22.89 / 0.91 20.72 / 0.85\n\nSRNs to three baselines from recent literature. Table 1 and Fig. 6 report quantitative and qualitative\nresults respectively. In all settings, we outperform all baselines by a wide margin. On the training\nset, we achieve very high visual \ufb01delity. Generally, views are perfectly multi-view consistent, the\nonly exception being objects with distinct, usually \ufb01ne geometric detail, such as the windscreen of\nconvertibles. None of the baselines succeed in generating multi-view consistent views. Several views\nper object are usually entirely degenerate. In the two-shot case, where most of the object has been\nseen, SRNs still reconstruct both object appearance and geometry robustly. In the single-shot case,\nSRNs complete unseen parts of the object in a plausible manner, demonstrating that the learned priors\nhave truthfully captured the underlying distributions.\n\nSupervising parameters for non-rigid deformation.\nIf latent parameters of the scene are known,\nwe can condition on these parameters instead of jointly solving for latent variables zj. We generate 50\nrenderings each from 1000 faces sampled at random from the Basel face model [71]. Camera poses\nare sampled from a hemisphere in front of the face. Each face is fully de\ufb01ned by a 224-dimensional\nparameter vector, where the \ufb01rst 160 parameterize identity, and the last 64 dimensions control\nfacial expression. We use a constant ambient illumination to render all faces. Conditioned on this\ndisentangled latent space, SRNs succeed in reconstructing face geometry and appearance. After\ntraining, we animate facial expression by varying the 64 expression parameters while keeping the\nidentity \ufb01xed, even though this speci\ufb01c combination of identity and expression has not been observed\nbefore. Fig. 3 shows qualitative results of this non-rigid deformation. Expressions smoothly transition\nfrom one to the other, and the reconstructed normal maps, which are directly computed from the\ndepth maps (not shown), demonstrate that the model has learned the underlying geometry.\n\nGeometry reconstruction. SRNs reconstruct geometry in a fully unsupervised manner, purely out\nof necessity to explain observations in 3D. Fig. 4 visualizes geometry for 50-shot, single-shot, and\nsingle-scene reconstructions.\n\nLatent space interpolation. Our learned latent space allows meaningful interpolation of object\ninstances. Fig. 5 shows latent space interpolation.\n\nPose extrapolation. Due to the explicit 3D-aware and per-pixel formulation, SRNs naturally\ngeneralize to 3D transformations that have never been seen during training, such as camera close-ups\nor camera roll, even when trained only on up-right camera poses distributed on a sphere around the\nobjects. Please see the supplemental video for examples of pose extrapolation.\n\nFailure cases. The ray marcher may \u201cget stuck\u201d in holes of sur-\nfaces or on rays that closely pass by occluders, such as commonly\noccur in chairs. SRNs generates a continuous surface in these cases,\nor will sometimes step through the surface. If objects are far away\nfrom the training distribution, SRNs may fail to reconstruct geom-\netry and instead only match texture. In both cases, the reconstructed\ngeometry allows us to analyze the failure, which is impossible with\nblack-box alternatives. See Fig. 8 and the supplemental video.\n\n8\n\nFigure 8: Failure cases.\n\n\fTowards representing room-scale scenes. We demonstrate reconstruction of a room-scale scene\nwith SRNs. We train a single SRN on 500 observations of a minecraft room. The room contains\nmultiple objects as well as four columns, such that parts of the scene are occluded in most observations.\nAfter training, the SRN enables novel view synthesis of the room. Though generated images are\nblurry, they are largely multi-view consistent, with artifacts due to ray marching failures only at\nobject boundaries and thin structures. The SRN succeeds in inferring geometry and appearance of\nthe room, reconstructing occluding columns and objects correctly, failing only on low-texture areas\n(where geometry is only weakly constrained) and thin tubes placed between columns. Please see the\nsupplemental video for qualitative results.\n\n5 Discussion\n\nWe introduce SRNs, a 3D-structured neural scene representation that implicitly represents a scene\nas a continuous, differentiable function. This function maps 3D coordinates to a feature-based\nrepresentation of the scene and can be trained end-to-end with a differentiable ray marcher to render\nthe feature-based representation into a set of 2D images. SRNs do not require shape supervision and\ncan be trained only with a set of posed 2D images. We demonstrate results for novel view synthesis,\nshape and appearance interpolation, and few-shot reconstruction.\nThere are several exciting avenues for future work. SRNs could be explored in a probabilistic\nframework [2, 3], enabling sampling of feasible scenes given a set of observations. SRNs could\nbe extended to model view- and lighting-dependent effects, translucency, and participating media.\nThey could also be extended to other image formation models, such as computed tomography or\nmagnetic resonance imaging. Currently, SRNs require camera intrinsic and extrinsic parameters,\nwhich can be obtained robustly via bundle-adjustment. However, as SRNs are differentiable with\nrespect to camera parameters; future work may alternatively integrate them with learned algorithms\nfor camera pose estimation [72]. SRNs also have exciting applications outside of vision and graphics,\nand future work may explore SRNs in robotic manipulation or as the world model of an independent\nagent. While SRNs can represent room-scale scenes (see the supplemental video), generalization\nacross complex, cluttered 3D environments is an open problem. Recent work in meta-learning could\nenable generalization across scenes with weaker assumptions on the dimensionality of the underlying\nmanifold [73]. Please see the supplemental material for further details on directions for future work.\n\n6 Acknowledgements\n\nWe thank Ludwig Schubert for fruitful discussions. Vincent Sitzmann was supported by a Stanford\nGraduate Fellowship. Michael Zollh\u00f6fer was supported by the Max Planck Center for Visual\nComputing and Communication (MPC-VCC). Gordon Wetzstein was supported by NSF awards (IIS\n1553333, CMMI 1839974), by a Sloan Fellowship, by an Okawa Research Grant, and a PECASE.\n\nReferences\n[1] M. Tatarchenko, A. Dosovitskiy, and T. Brox, \u201cSingle-view to multi-view: Reconstructing unseen views\n\nwith a convolutional network,\u201d CoRR abs/1511.06702, vol. 1, no. 2, p. 2, 2015.\n\n[2] S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu,\nI. Danihelka, K. Gregor et al., \u201cNeural scene representation and rendering,\u201d Science, vol. 360, no. 6394, pp.\n1204\u20131210, 2018.\n\n[3] A. Kumar, S. A. Eslami, D. Rezende, M. Garnelo, F. Viola, E. Lockhart, and M. Shanahan, \u201cConsistent\n\njumpy predictions for videos and scenes,\u201d 2018.\n\n[4] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, \u201cInterpretable transformations with\n\nencoder-decoder networks,\u201d in Proc. ICCV, vol. 4, 2017.\n\n[5] D. Maturana and S. Scherer, \u201cVoxnet: A 3d convolutional neural network for real-time object recognition,\u201d\n\nin Proc. IROS, September 2015, p. 922 \u2013 928.\n\n[6] V. Sitzmann, J. Thies, F. Heide, M. Nie\u00dfner, G. Wetzstein, and M. Zollh\u00f6fer, \u201cDeepvoxels: Learning\n\npersistent 3d feature embeddings,\u201d in Proc. CVPR, 2019.\n\n9\n\n\f[7] A. Kar, C. H\u00e4ne, and J. Malik, \u201cLearning a multi-view stereo machine,\u201d in Proc. NIPS, 2017, pp. 365\u2013376.\n\n[8] H.-Y. F. Tung, R. Cheng, and K. Fragkiadaki, \u201cLearning spatial common sense with geometry-aware\n\nrecurrent networks,\u201d Proc. CVPR, 2019.\n\n[9] T. H. Nguyen-Phuoc, C. Li, S. Balaban, and Y. Yang, \u201cRendernet: A deep convolutional network for\n\ndifferentiable rendering from 3d shapes,\u201d in Proc. NIPS, 2018.\n\n[10] J.-Y. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum, and B. Freeman, \u201cVisual object networks:\n\nimage generation with disentangled 3d representations,\u201d in Proc. NIPS, 2018, pp. 118\u2013129.\n\n[11] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, \u201cPointnet: Deep learning on point sets for 3d classi\ufb01cation and\n\nsegmentation,\u201d Proc. CVPR, 2017.\n\n[12] E. Insafutdinov and A. Dosovitskiy, \u201cUnsupervised learning of shape and pose with differentiable point\n\nclouds,\u201d in Proc. NIPS, 2018, pp. 2802\u20132812.\n\n[13] M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely, and R. Martin-Brualla, \u201cNeural\n\nrerendering in the wild,\u201d Proc. CVPR, 2019.\n\n[14] C.-H. Lin, C. Kong, and S. Lucey, \u201cLearning ef\ufb01cient point cloud generation for dense 3d object recon-\n\nstruction,\u201d in Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[15] D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriksson, \u201cLearning free-form\n\ndeformations for 3d object reconstruction,\u201d CoRR, 2018.\n\n[16] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, \u201cMulti-view supervision for single-view reconstruction via\n\ndifferentiable ray consistency,\u201d in Proc. CVPR.\n\n[17] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum, \u201cLearning a probabilistic latent space of\n\nobject shapes via 3d generative-adversarial modeling,\u201d in Proc. NIPS, 2016, pp. 82\u201390.\n\n[18] M. Gadelha, S. Maji, and R. Wang, \u201c3d shape induction from 2d views of multiple objects,\u201d in 3DV.\n\nComputer Society, 2017, pp. 402\u2013411.\n\nIEEE\n\n[19] C. R. Qi, H. Su, M. Nie\u00dfner, A. Dai, M. Yan, and L. Guibas, \u201cVolumetric and multi-view cnns for object\n\nclassi\ufb01cation on 3d data,\u201d in Proc. CVPR, 2016.\n\n[20] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman, \u201cPix3d:\n\nDataset and methods for single-image 3d shape modeling,\u201d in Proc. CVPR, 2018.\n\n[21] D. Jimenez Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess, \u201cUnsuper-\n\nvised learning of 3d structure from images,\u201d in Proc. NIPS, 2016.\n\n[22] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, \u201c3d-r2n2: A uni\ufb01ed approach for single and\n\nmulti-view 3d object reconstruction,\u201d in Proc. ECCV, 2016.\n\n[23] G. Riegler, A. O. Ulusoy, and A. Geiger, \u201cOctnet: Learning deep 3d representations at high resolutions,\u201d in\n\nProc. CVPR, 2017.\n\n[24] M. Tatarchenko, A. Dosovitskiy, and T. Brox, \u201cOctree generating networks: Ef\ufb01cient convolutional\n\narchitectures for high-resolution 3d outputs,\u201d in Proc. ICCV, 2017, pp. 2107\u20132115.\n\n[25] C. Haene, S. Tulsiani, and J. Malik, \u201cHierarchical surface prediction,\u201d Proc. PAMI, pp. 1\u20131, 2019.\n\n[26] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, \u201cLearning representations and generative models\n\nfor 3D point clouds,\u201d in Proc. ICML, 2018, pp. 40\u201349.\n\n[27] M. Tatarchenko, A. Dosovitskiy, and T. Brox, \u201cMulti-view 3d models from single images with a convolu-\n\ntional network,\u201d in Proc. ECCV, 2016.\n\n[28] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, \u201cStereo magni\ufb01cation: learning view synthesis\n\nusing multiplane images,\u201d ACM Trans. Graph., vol. 37, no. 4, pp. 65:1\u201365:12, 2018.\n\n[29] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, \u201cAtlasnet: A papier-m\u00e2ch\u00e9 approach to\n\nlearning 3d surface generation,\u201d in Proc. CVPR, 2018.\n\n[30] H. Kato, Y. Ushiku, and T. Harada, \u201cNeural 3d mesh renderer,\u201d in Proc. CVPR, 2018, pp. 3907\u20133916.\n\n[31] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik, \u201cLearning category-speci\ufb01c mesh reconstruction from\n\nimage collections,\u201d in ECCV, 2018.\n\n10\n\n\f[32] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, \u201cOccupancy networks: Learning 3d\n\nreconstruction in function space,\u201d in Proc. CVPR, 2019.\n\n[33] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, \u201cDeepsdf: Learning continuous signed\n\ndistance functions for shape representation,\u201d arXiv preprint arXiv:1901.05103, 2019.\n\n[34] K. Genova, F. Cole, D. Vlasic, A. Sarna, W. T. Freeman, and T. Funkhouser, \u201cLearning shape templates\n\nwith structured implicit functions,\u201d Proc. ICCV, 2019.\n\n[35] B. Deng, K. Genova, S. Yazdani, S. Bouaziz, G. Hinton, and A. Tagliasacchi, \u201cCvxnets: Learnable convex\n\ndecomposition,\u201d arXiv preprint arXiv:1909.05736, 2019.\n\n[36] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y. Yang, \u201cHologan: Unsupervised learning of 3d\n\nrepresentations from natural images,\u201d in Proc. ICCV, 2019.\n\n[37] F. Alet, A. K. Jeewajee, M. Bauza, A. Rodriguez, T. Lozano-Perez, and L. P. Kaelbling, \u201cGraph element\n\nnetworks: adaptive, structured computation and memory,\u201d in Proc. ICML, 2019.\n\n[38] Y. Liu, Z. Wu, D. Ritchie, W. T. Freeman, J. B. Tenenbaum, and J. Wu, \u201cLearning to describe scenes with\n\nprograms,\u201d in Proc. ICLR, 2019.\n\n[39] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song,\nH. Su et al., \u201cShapenet: An information-rich 3d model repository,\u201d arXiv preprint arXiv:1512.03012, 2015.\n\n[40] G. E. Hinton and R. Salakhutdinov, \u201cReducing the dimensionality of data with neural networks,\u201d Science,\n\nvol. 313, no. 5786, pp. 504\u2013507, Jul. 2006.\n\n[41] D. P. Kingma and M. Welling, \u201cAuto-encoding variational bayes.\u201d in Proc. ICLR, 2013.\n\n[42] L. Dinh, D. Krueger, and Y. Bengio, \u201cNICE: non-linear independent components estimation,\u201d in Proc.\n\nICLR Workshops, 2015.\n\n[43] D. P. Kingma and P. Dhariwal, \u201cGlow: Generative \ufb02ow with invertible 1x1 convolutions,\u201d in NeurIPS,\n\n2018, pp. 10 236\u201310 245.\n\n[44] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, \u201cConditional\n\nimage generation with pixelcnn decoders,\u201d in Proc. NIPS, 2016, pp. 4797\u20134805.\n\n[45] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, \u201cPixel recurrent neural networks,\u201d in Proc. ICML,\n\n2016.\n\n[46] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio, \u201cGenerative adversarial nets,\u201d in Proc. NIPS, 2014.\n\n[47] M. Arjovsky, S. Chintala, and L. Bottou, \u201cWasserstein generative adversarial networks,\u201d in Proc. ICML,\n\n2017.\n\n[48] T. Karras, T. Aila, S. Laine, and J. Lehtinen, \u201cProgressive growing of gans for improved quality, stability,\n\nand variation,\u201d in Proc. ICLR, 2018.\n\n[49] J.-Y. Zhu, P. Kr\u00e4henb\u00fchl, E. Shechtman, and A. A. Efros, \u201cGenerative visual manipulation on the natural\n\nimage manifold,\u201d in Proc. ECCV, 2016.\n\n[50] A. Radford, L. Metz, and S. Chintala, \u201cUnsupervised representation learning with deep convolutional\n\ngenerative adversarial networks,\u201d in Proc. ICLR, 2016.\n\n[51] M. Mirza and S. Osindero, \u201cConditional generative adversarial nets,\u201d 2014, arXiv:1411.1784.\n\n[52] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, \u201cImage-to-image translation with conditional adversarial\n\nnetworks,\u201d in Proc. CVPR, 2017, pp. 5967\u20135976.\n\n[53] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, \u201cUnpaired image-to-image translation using cycle-consistent\n\nadversarial networks,\u201d in Proc. ICCV, 2017.\n\n[54] K. O. Stanley, \u201cCompositional pattern producing networks: A novel abstraction of development,\u201d Genetic\n\nprogramming and evolvable machines, vol. 8, no. 2, pp. 131\u2013162, 2007.\n\n[55] A. Mordvintsev, N. Pezzotti, L. Schubert, and C. Olah, \u201cDifferentiable image parameterizations,\u201d Distill,\n\nvol. 3, no. 7, p. e12, 2018.\n\n11\n\n\f[56] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, \u201cPerspective transformer nets: Learning single-view 3d\n\nobject reconstruction without 3d supervision,\u201d in Proc. NIPS, 2016.\n\n[57] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, \u201cSpatial transformer networks,\u201d in Proc.\n\nNIPS, 2015.\n\n[58] G. E. Hinton, A. Krizhevsky, and S. D. Wang, \u201cTransforming auto-encoders,\u201d in Proc. ICANN, 2011.\n\n[59] A. Yuille and D. Kersten, \u201cVision as Bayesian inference: analysis by synthesis?\u201d Trends in Cognitive\n\nSciences, vol. 10, pp. 301\u2013308, 2006.\n\n[60] T. Bever and D. Poeppel, \u201cAnalysis by synthesis: A (re-)emerging program of research for language and\n\nvision,\u201d Biolinguistics, vol. 4, no. 2, pp. 174\u2013200, 2010.\n\n[61] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, \u201cDeep convolutional inverse graphics network,\u201d\n\nin Proc. NIPS, 2015.\n\n[62] J. Yang, S. Reed, M.-H. Yang, and H. Lee, \u201cWeakly-supervised disentangling with recurrent transformations\n\nfor 3d view synthesis,\u201d in Proc. NIPS, 2015.\n\n[63] T. D. Kulkarni, P. Kohli, J. B. Tenenbaum, and V. K. Mansinghka, \u201cPicture: A probabilistic programming\n\nlanguage for scene perception,\u201d in Proc. CVPR, 2015.\n\n[64] H. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki, \u201cAdversarial inverse graphics networks: Learning\n\n2d-to-3d lifting and image-to-image translation from unpaired supervision,\u201d in Proc. ICCV.\n\n[65] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras, \u201cNeural face editing with\n\nintrinsic image disentangling,\u201d in Proc. CVPR, 2017.\n\n[66] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University\n\nPress, 2003.\n\n[67] J. C. Hart, \u201cSphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces,\u201d The\n\nVisual Computer, vol. 12, no. 10, pp. 527\u2013545, 1996.\n\n[68] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., \u201cConditional image generation\n\nwith pixelcnn decoders,\u201d in Proc. NIPS, 2016.\n\n[69] S. Hochreiter and J. Schmidhuber, \u201cLong short-term memory,\u201d Neural Computation, vol. 9, no. 8, pp.\n\n1735\u20131780, 1997.\n\n[70] D. Ha, A. Dai, and Q. V. Le, \u201cHypernetworks,\u201d in Proc. ICLR, 2017.\n\n[71] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, \u201cA 3d face model for pose and illumination\ninvariant face recognition,\u201d in 2009 Sixth IEEE International Conference on Advanced Video and Signal\nBased Surveillance.\n\nIeee, 2009, pp. 296\u2013301.\n\n[72] C. Tang and P. Tan, \u201cBa-net: Dense bundle adjustment network,\u201d in Proc. ICLR, 2019.\n\n[73] C. Finn, P. Abbeel, and S. Levine, \u201cModel-agnostic meta-learning for fast adaptation of deep networks,\u201d in\n\nProc. ICML.\n\nJMLR. org, 2017, pp. 1126\u20131135.\n\n12\n\n\f", "award": [], "sourceid": 667, "authors": [{"given_name": "Vincent", "family_name": "Sitzmann", "institution": "Stanford University"}, {"given_name": "Michael", "family_name": "Zollhoefer", "institution": "Facebook Reality Labs"}, {"given_name": "Gordon", "family_name": "Wetzstein", "institution": "Stanford University"}]}