{"title": "Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning", "book": "Advances in Neural Information Processing Systems", "page_first": 2059, "page_last": 2070, "abstract": "This paper presents KeypointNet, an end-to-end geometric reasoning framework to learn an optimal set of category-specific keypoints, along with their detectors to predict 3D keypoints in a single 2D input image. We demonstrate this framework on 3D pose estimation task by proposing a differentiable pose objective that seeks the optimal set of keypoints for recovering the relative pose between two views of an object. Our network automatically discovers a consistent set of keypoints across viewpoints of a single object as well as across all object instances of a given object class. Importantly, we find that our end-to-end approach using no ground-truth keypoint annotations outperforms a fully supervised baseline using the same neural network architecture for the pose estimation task. \nThe discovered 3D keypoints across the car, chair, and plane\ncategories of ShapeNet are visualized at https://keypoints.github.io/", "full_text": "Discovery of Latent 3D Keypoints via\n\nEnd-to-end Geometric Reasoning\n\nSupasorn Suwajanakorn(cid:66)\u2217 Noah Snavely\u2666 Jonathan Tompson\u2666 Mohammad Norouzi\u2666\n\nsupasorn@vistec.ac.th, {snavely, tompson, mnorouzi}@google.com\n\n(cid:66)Vidyasirimedhi Institute of Science and Technology\n\n\u2666Google AI\n\nAbstract\n\nThis paper presents KeypointNet, an end-to-end geometric reasoning framework to\nlearn an optimal set of category-speci\ufb01c 3D keypoints, along with their detectors.\nGiven a single image, KeypointNet extracts 3D keypoints that are optimized for\na downstream task. We demonstrate this framework on 3D pose estimation by\nproposing a differentiable objective that seeks the optimal set of keypoints for\nrecovering the relative pose between two views of an object. Our model discovers\ngeometrically and semantically consistent keypoints across viewing angles and\ninstances of an object category. Importantly, we \ufb01nd that our end-to-end framework\nusing no ground-truth keypoint annotations outperforms a fully supervised baseline\nusing the same neural network architecture on the task of pose estimation. The\ndiscovered 3D keypoints on the car, chair, and plane categories of ShapeNet [6] are\nvisualized at keypointnet.github.io.\n\n1\n\nIntroduction\n\nConvolutional neural networks have shown that jointly optimizing feature extraction and classi\ufb01cation\npipelines can signi\ufb01cantly improve object recognition [26, 25]. That being said, current approaches\nto geometric vision problems, such as 3D reconstruction [47] and shape alignment [29], comprise a\nseparate keypoint detection module, followed by geometric reasoning as a post-process. In this paper,\nwe explore whether one can bene\ufb01t from an end-to-end geometric reasoning framework, in which\nkeypoints are jointly optimized as a set of latent variables for a downstream task.\nConsider the problem of determining the 3D pose of a car in an image. A standard solution \ufb01rst\ndetects a sparse set of category-speci\ufb01c keypoints, and then uses such points within a geometric\nreasoning framework (e.g., a PnP algorithm [28]) to recover the 3D pose or camera angle. Towards\nthis end, one can develop a set of keypoint detectors by leveraging strong supervision in the form of\nmanual keypoint annotations in different images of an object category, or by using expensive and error\nprone of\ufb02ine model-based \ufb01tting methods. Researchers have compiled large datasets of annotated\nkeypoints for faces [44], hands [51], and human bodies [3, 30]. However, selection and consistent\nannotation of keypoints in images of an object category is expensive and ill-de\ufb01ned. To devise a\nreasonable set of points, one should take into account the downstream task of interest. Directly\noptimizing keypoints for a downstream geometric task should naturally encourage desirable keypoint\nproperties such as distinctiveness, ease of detection, diversity, etc.\nThis paper presents KeypointNet, an end-to-end geometric reasoning framework to learn an optimal\nset of category-speci\ufb01c 3D keypoints, along with their detectors, for a speci\ufb01c downstream task.\nOur novelty stands in contrast to prior work that learns latent keypoints through an arbitrary proxy\nself-supervision objective, such as reconstruction [63, 17]. Our framework is applicable to any\ndownstream task represented by an objective function that is differentiable with respect to keypoint\npositions. We formulate 3D pose estimation as one such task, and our key technical contributions\ninclude (1) a novel differentiable pose estimation objective and (2) a multi-view consistency loss\n\u2217Work done while S. Suwajanakorn was a member of the Google AI Residency program (g.co/airesidency).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\ffunction. The pose objective seeks optimal keypoints for recovering the relative pose between two\nviews of an object. The multi-view consistency loss encourages consistent keypoint detections\nacross 3D transformations of an object. Notably, we propose to detect 3D keypoints (2D points with\ndepth) from individual 2D images and formulate pose and consistency losses for such 3D keypoint\ndetections.\nWe show that KeypointNet discovers geometrically and semantically consistent keypoints across\nviewing angles as well as across object instances of a given class. Some of the discovered keypoints\ncorrespond to interesting and semantically meaningful parts, such as the wheels of a car, and we\nshow how these 3D keypoints can infer their depths without access to object geometry. We conduct\nthree sets of experiments on different object categories from the ShapeNet dataset [6]. We evaluate\nour technique against a strongly supervised baseline based on manually annotated keypoints on the\ntask of relative 3D pose estimation. Surprisingly, we \ufb01nd that our end-to-end framework achieves\nsigni\ufb01cantly better results, despite the lack of keypoint annotations.\n2 Related Work\n\nBoth 2D and 3D keypoint detection are long-standing problems in computer vision, where keypoint\ninference is traditionally used as an early stage in object localization pipelines [27]. As an example, a\nsuccessful early application of modern convolutional neural networks (CNNs) was on detecting 2D\nhuman joint positions from monocular RGB images. Due to its compelling utility for HCI, motion\ncapture, and security applications, a large body of work has since developed in this joint detection\ndomain [53, 52, 39, 37, 62, 38, 20, 14].\nMore related to our work, a number of recent CNN-based techniques have been developed for 3D\nhuman keypoint detection from monocular RGB images, which use various architectures, supervised\nobjectives, and 3D structural priors to directly infer a prede\ufb01ned set of 3D joint locations [36, 34,\n8, 35, 12]. Other techniques use inferred 2D keypoint detectors and learned 3D priors to perform\n\u201c2D-to-3D-lifting\u201d [41, 7, 66, 33] or \ufb01nd data-to-model correspondences from depth images [40].\nHonari et al. [18] improve landmark localization by incorporating semi-supervised tasks such as\nattribute prediction and equivariant landmark prediction. In contrast, our set of keypoints is not\nde\ufb01ned a priori and is instead a latent set that is optimized end-to-end to improve inference for a\ngeometric estimation problem. A body of work also exists for more generalized, albeit supervised,\nkeypoint detection, e.g., [15, 61].\nEnforcing latent structure in CNN feature representations has been explored for a number of domains.\nFor instance, the capsule framework [17] and its variants [43, 16] encode activation properties in\nthe magnitude and direction of hidden-state vectors and then combine them to build higher-level\nfeatures. The output of our KeypointNet can be seen as a similar form of latent 3D feature, which is\nencouraged to represent a set of 3D keypoint positions due to the carefully constructed consistency\nand relative pose objective functions.\nRecent work has demonstrated 2D correspondence matching across intra-class instances with large\nshape and appearance variation. For instance, Choy et al. [9] use a novel contrastive loss based on\nappearance to encode geometry and semantic similarity. Han et al. [13] propose a novel SCNet\narchitecture for learning a geometrically plausible model for 2D semantic correspondence. Wang\net al. [60] rely on deep features and perform a multi-image matching across an image collection by\nsolving a feature selection and labeling problem. Thewlis et al. [49] use ground-truth transforms\n(optical \ufb02ow between image pairs) and point-wise matching to learn a dense object-centric coordinate\nframe with viewpoint and image deformation invariance. Similarly, Agrawal et al. [2] use egomotion\nprediction between image pairs to learn semi-supervised feature representations, and show that these\nfeatures are competitive with supervised features for a variety of tasks.\nOther work has sought to learn latent 2D or 3D features with varying amounts of supervision.\nArie-Nachimson & Basri [4] build 3D models of rigid objects and exploit these models to estimate\n3D pose from a 2D image as well as a collection of 3D latent features and visibility properties.\nInspired by cycle consistency for learning correspondence [19, 65], Zhou et al. [64] train a CNN\nto predict correspondence between different objects of the same semantic class by utilizing CAD\nmodels. Independent from our work, Zhang et al. [63] discover sparse 2D landmarks of images of a\nknown object class as explicit structure representation through a reconstruction objective. Similarly,\nJakab and Gupta et al. [23] use conditional image generation and reconstruction objective to learn\n2D keypoints that capture geometric changes in training image pairs. Rhodin et al. [42] uses a\n\n2\n\n\fFigure 1: During training, two views of the same object are given as input to the KeypointNet. The\nknown rigid transformation (R, t) between the two views is provided as a supervisory signal. We\noptimize an ordered list of 3D keypoints that are consistent in both views and enable recovery of the\ntransformation. During inference, KeypointNet extracts 3D keypoints from an individual input image.\n\nmulti-view consistency loss, similar to ours, to infer 3D latent variables speci\ufb01cally for human pose\nestimation task. In contrast to [64, 63, 23, 42], our latent keypoints are optimized for a downstream\ntask, which encourages more directed keypoint selection. By representing keypoints in true physical\n3D structures, our method can even \ufb01nd occluded correspondences between images with large pose\ndifferences, e.g., large out-of-plane rotations.\nApproaches for \ufb01nding 3D correspondence have been investigated. Salti et al. [45] cast 3D keypoint\ndetection as a binary classi\ufb01cation between points whose ground-truth similarity label is determined\nby a prede\ufb01ned 3D descriptor. Zhou et al. [67] use view-consistency as a supervisory signal to predict\n3D keypoints, although only on depth maps. Similarly, Su et al. [48] leverage synthetically rendered\nmodels to estimate object viewpoint by matching them to real-world image via CNN viewpoint\nembedding. Besides keypoints, self-supervision based on geometric and motion reasoning has been\nused to predict other forms of output, such as 3D shape represented as blendshape coef\ufb01cients for\nhuman motion capture [57].\n3 End-to-end Optimization of 3D Keypoints\n\nGiven a single image of a known object category, our model predicts an ordered list of 3D keypoints,\nde\ufb01ned as pixel coordinates and associated depth values. Such keypoints are required to be geometri-\ncally and semantically consistent across different viewing angles and instances of an object category\n(e.g., see Figure 4). Our KeypointNet has N heads that extract N keypoints, and the same head tends\nto extract 3D points with the same semantic interpretation. These keypoints will serve as a building\nblock for feature representations based on a sparse set of points, useful for geometric reasoning and\npose-aware or pose-invariant object recognition (e.g., [43]).\nIn contrast to approaches that learn a supervised mapping from images to a list of annotated keypoint\npositions, we do not de\ufb01ne the keypoint positions a priori. Instead, we jointly optimize keypoints\nwith respect to a downstream task. We focus on the task of relative pose estimation at training\ntime, where given two views of the same object with a known rigid transformation T , we aim to\npredict optimal lists of 3D keypoints, P1 and P2 in the two views that best match one view to the\nother (Figure 1). We formulate an objective function O(P1, P2), based on which one can optimize\na parametric mapping from an image to a list of keypoints. Our objective consists of two primary\ncomponents:\n\nunder the ground truth transformation.\n\n\u2022 A multi-view consistency loss that measures the discrepancy between the two sets of points\n\u2022 A relative pose estimation loss, which penalizes the angular difference between the ground\ntruth rotation R vs. the rotation \u02c6R recovered from P1 and P2 using orthogonal procrustes.\n\nWe demonstrate that these two terms allow the model to discover important keypoints, some of which\ncorrespond to semantically meaningful locations that humans would naturally select for different\nobject classes. Note that we do not directly optimize for keypoints that are semantically meaningful,\nas those may be sub-optimal for downstream tasks or simply hard to detect. In what follows, we \ufb01rst\nexplain our objective function and then describe the neural architecture of KeypointNet.\n\n3\n\n\fNotation. Each training tuple comprises a pair of images (I, I(cid:48)) of the same object from differ-\nent viewpoints, along with their relative rigid transformation T \u2208 SE(3), which transforms the\nunderlying 3D shape from I to I(cid:48). T has the following matrix form:\n\nT =\n\n,\n\n(1)\n\n(cid:20)R3\u00d73\n\n0\n\n(cid:21)\n\nt3\u00d71\n1\n\nwhere R and t represent a 3D rotation and translation respectively. We learn a function f\u03b8(I),\nparametrized by \u03b8, that maps a 2D image I to a list of 3D points P = (p1, . . . , pN ) where pi \u2261\n(ui, vi, zi), by optimizing an objective function of the form O(f\u03b8(I), f\u03b8(I(cid:48))).\n\n3.1 Multi-view consistency\n\nThe goal of our multi-view consistency loss is to ensure that the keypoints track consistent parts\nacross different views. Speci\ufb01cally, a 3D keypoint in one image should project onto the same pixel\nlocation as the corresponding keypoint in the second image. For this task, we assume a perspective\ncamera model with a known global focal length f. Below, we use [x, y, z] to denote 3D coordinates,\nand [u, v] to denote pixel coordinates. The projection of a keypoint [u, v, z] from image I into image\nI(cid:48) (and vice versa) is given by the projection operators:\n\n[\u02c6u, \u02c6v, \u02c6z, 1](cid:62) \u223c \u03c0T \u03c0\u22121([u, v, z, 1](cid:62))\n[ \u02c6u(cid:48), \u02c6v(cid:48), \u02c6z(cid:48), 1](cid:62) \u223c \u03c0T \u22121\u03c0\u22121([u(cid:48), v(cid:48), z(cid:48), 1](cid:62))\n\nwhere, for instance, \u02c6u denotes the projection of u to the second view, and \u02c6u(cid:48) denotes the projection of\nu(cid:48) to the \ufb01rst view. Here, \u03c0 : R4 \u2192 R4 represents the perspective projection operation that maps an\ninput homogeneous 3D coordinate [x, y, z, 1](cid:62) in camera coordinates to a pixel position plus depth:\n\n\u03c0([x, y, z, 1](cid:62)) =\n\n, z, 1\n\n= [u, v, z, 1](cid:62)\n\nWe de\ufb01ne a symmetric multi-view consistency loss as:\n\n(cid:21)(cid:62)\n\nf y\nz\n\n,\n\n(cid:20) f x\n(cid:13)(cid:13)(cid:13)[ui, vi, u(cid:48)\n\nz\n\nN(cid:88)\n\ni=1\n\ni, \u02c6ui, \u02c6vi](cid:62)(cid:13)(cid:13)(cid:13)2\n\ni, v(cid:48)\n\ni](cid:62) \u2212 [ \u02c6u(cid:48)\n\ni, \u02c6v(cid:48)\n\nLcon =\n\n1\n2N\n\n(2)\n\n(3)\n\nWe measure error only in the observable image space (u, v) as opposed to also using z, because depth\nis never directly observed, and usually has different units compared to u and v. Note however that\npredicting z is critical for us to be able to project points between the two views.\nEnforcing multi-view consistency is suf\ufb01cient to infer a consistent set of 2D keypoint positions (and\ndepths) across different views. However, this consistency alone often leads to a degenerate solution\nwhere all keypoints collapse to a single location, which is not useful. One can encode an explicit\nnotion of diversity to prevent collapsing, but there still exists in\ufb01nitely many solutions that satisfy\nmulti-view consistency. Rather, what we need is a notion of optimality for selecting keypoints which\nhas to be de\ufb01ned with respect to some downstream task. For that purpose, we use pose estimation as\na task which naturally encourages keypoint separation so as to yield well-posed estimation problems.\n\n3.2 Relative pose estimation\n\nOne important application of keypoint detection is to recover the relative transformation between\na given pair of images. Accordingly, we de\ufb01ne a differentiable objective that measures the mis\ufb01t\nbetween the estimated relative rotation \u02c6R (computed via Procrustes\u2019 alignment of the two sets of\nkeypoints) and the ground truth R. Given the translation equivariance property of our keypoint\nprediction network (Section 4) and the view consistency loss above, we omit the translation error in\nthis objective. The pose estimation objective is de\ufb01ned as :\n\nLpose = 2 arcsin\n\n(4)\n\n(cid:18) 1\n\n\u221a\n2\n\n2\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13) \u02c6R \u2212 R\n\n(cid:13)(cid:13)(cid:13)F\n\nwhich measures the angular distance between the optimal least-squares estimate \u02c6R computed from the\ntwo sets of keypoints, and the ground truth relative rotation matrix R. Fortunately, we can formulate\nthis objective in terms of fully differentiable operations.\n\n4\n\n\fTo estimate \u02c6R, let X and X(cid:48) \u2208 R3\u00d7N denote two matrices comprising unprojected 3D keypoint\ncoordinates for the two views. In other words, let X \u2261 [X1, . . . , XN ] and Xi \u2261 (\u03c0\u22121pi)[:3] , where\n[:3] returns the \ufb01rst 3 coordinates of its input. Similarly X(cid:48) denotes unprojected points in P (cid:48). Let \u02dcX\nand \u02dcX(cid:48) denote the mean-subtracted version of X and X(cid:48), respectively. The optimal least-squares\nrotation \u02c6R between the two sets of keypoints is then given by:\n\n\u02c6R = V diag(1, 1, . . . , det(V U(cid:62)))U(cid:62),\n\n(5)\nwhere U, \u03a3, V (cid:62) = SVD( \u02dcX \u02dcX(cid:48)(cid:62)). This estimation problem to recover \u02c6R is known as the orthogonal\nProcrustes problem [46]. To ensure that \u02dcX \u02dcX(cid:48)(cid:62)\nis invertible and to increase the robustness of the\nkeypoints, we add Gaussian noise to the 3D coordinates of the keypoints (X and X(cid:48)) and instead\nseek the best rotation under some noisy predictions of keypoints. To minimize the angular distance\n(4), we backpropagate through the SVD operator using matrix calculus [22, 10].\nEmpirically, the pose estimation objective helps signi\ufb01cantly in producing a reasonable and natural\nselection of latent keypoints, leading to the automatic discovery of interesting parts such as the wheels\nof a car, the cockpit and wings of a plane, or the legs and back of a chair. We believe this is because\nthese parts are geometrically consistent within an object class (e.g., circular wheels appear in all cars),\neasy to track, and spatially varied, all of which improve the performance of the downstream task.\n\n4 KeypointNet Architecture\n\nOne important property for the mapping from images to keypoints is translation equivariance at the\npixel level. That is, if we shift the input image, e.g., to the left by one pixel, the output locations of all\nkeypoints should also be changed by one unit. Training a standard CNN without this property would\nrequire a larger training set that contains objects at every possible location, while still providing no\nequivariance guarantees at inference time.\nWe propose the following simple modi\ufb01cations to achieve equivariance. Instead of regressing directly\nto the coordinate values, we ask the network to output a probability distribution map gi(u, v) that\nu,v gi(u, v) = 1. We use a spatial\nsoftmax layer to produce such a distribution over image pixels [11]. We then compute the expected\nvalues of these spatial distributions to recover a pixel coordinate:\n\nrepresents how likely keypoint i is to occur at pixel (u, v), with(cid:80)\n\n(cid:62)\n\n[ui, vi]\n\n=\n\n(cid:62)\n[u \u00b7 gi(u, v), v \u00b7 gi(u, v)]\n\n(6)\n\n(cid:88)\n(cid:88)\n\nu,v\n\nFor the z coordinates, we also predict a depth value at every pixel, denoted di(u, v), and compute\n\nzi =\n\ndi(u, v)gi(u, v).\n\n(7)\n\nu,v\n\nTo produce a probability map with the same resolution and equivariance property, we use strided-one\nfully convolutional architectures [31], also used for semantic segmentation. To increase the receptive\n\ufb01eld of the network, we stack multiple layers of dilated convolutions, similar to [59].\nOur emphasis on designing an equivariant network not only helps signi\ufb01cantly reduce the number\nof training examples required to achieve good generalization, but also removes the computational\nburden of converting between two representations (spatial-encoded in image to value-encoded in\ncoordinates) from the network, so that it can focus on other critical tasks such as inferring depth.\nArchitecture details. All kernels for all layers are 3\u00d7 3, and we stack 13 layers of dilated con-\nvolutions with dilation rates of 1, 1, 2, 4, 8, 16, 1, 2, 4, 8, 16, 1, 1, all with 64 output channels except\nthe last layer which has 2N output channels, split between gi and di. We use leakyRelu and Batch\nNormalization [21] for all layers except the last layer. The output layers for di have no activation\nfunction, and the channels are passed through a spatial softmax to produce gi. Finally, gi and di are\nthen converted to actual coordinates pi using Equations (6) and (7).\nBreaking symmetry. Many object classes are symmetric across at least one axis, e.g., the left\nside of a sedan looks like the right side \ufb02ipped. This presents a challenge to the network because\ndifferent parts can appear visually identical, and can only be resolved by understanding global context.\nFor example, distinguishing the left wheels from the right wheels requires knowing its orientation\n\n5\n\n\f(i.e., whether it is facing left or right). Both supervised and unsupervised techniques bene\ufb01t from some\nglobal conditioning to aid in breaking ties and to make the keypoint prediction more deterministic.\nTo help break symmetries, one can condition the keypoint prediction on some coarse quantization of\nthe pose. Such a coarse-to-\ufb01ne approach to keypoint detection is discussed in more depth in [56]. One\nsimple such conditioning is a binary \ufb02ag that indicates whether the dominant direction of an object is\nfacing left or right. This dominant direction comes from the ShapeNet dataset we use (Section 6),\nwhere the 3D models are consistently oriented. To infer keypoints without this \ufb02ag at inference time,\nwe train a network with the same architecture, although half the size, to predict this binary \ufb02ag.\nIn particular, we train this network to predict the projected pixel locations of two 3D points [1, 0, 0]\nand [\u22121, 0, 0], transformed into each view in a training pair. These points correspond to the front\nand back of a normalized object. This network has a single L2 loss between the predicted and the\nground-truth locations. The binary \ufb02ag is 1 if the x\u2212coordinate of the projected pixel of the \ufb01rst point\nis greater than that of the second point. This \ufb02ag is then fed into the keypoint prediction network.\n\n5 Additional Keypoint Characteristics\n\nIn addition to the main objectives introduced above, there are common, desirable characteristics of\nkeypoints that can bene\ufb01t many possible downstream tasks, in particular:\n\n\u2022 No two keypoints should share the same 3D location.\n\u2022 Keypoints should lie within the object\u2019s silhouette.\n\nSeparation loss penalizes two keypoints if they are closer than a hyperparameter \u03b4 in 3D:\n\n(cid:16)\n\n0, \u03b42 \u2212 (cid:107)Xi \u2212 Xj(cid:107)2(cid:17)\n\nmax\n\nN(cid:88)\n\nN(cid:88)\n\ni=1\n\nj(cid:54)=i\n\nLsep =\n\n1\nN 2\n\n(8)\n\nUnlike the consistency loss, this loss is computed in 3D to allow multiple keypoints to occupy the\nsame pixel location as long as they have different depths. We prefer a robust, bounded support loss\nover an unbounded one (e.g., exponential discounting) because it does not exhibit a bias towards\ncertain structures, such as a honeycomb, or towards placing points in\ufb01nitely far apart. Instead, it\nencourages the points to be suf\ufb01ciently far from one another.\nIdeally, a well-distributed set of keypoints will automatically emerge without constraining the distance\nof keypoints. However, in the absence of keypoint location supervision, our objective with latent\nkeypoints can converge to a local minimum with two keypoints collapsing to one. The main goal of\nthis separation loss is to prevent such degenerate cases, and not to directly promote separation.\nSilhouette consistency encourages the keypoints to lie within the silhouette of the object of interest.\nAs described above, our network predicts (ui, vi) coordinates of the ith keypoint via a spatial\ndistribution, denoted gi(u, v), over possible keypoint positions. One way to ensure silhouette\nconsistency, is by only allowing a non-zero probability inside the silhouette of the object, as well as\nencouraging the spatial distribution to be concentrated, i.e., uni-modal with a low variance.\nDuring training, we have access to the binary segmentation mask of the object b(u, v) \u2208 {0, 1} in\neach image, where 1 means foreground object. The silhouette consistency loss is de\ufb01ned as\n\nN(cid:88)\n\n(cid:88)\n\nLobj =\n\n1\nN\n\n\u2212 log\n\ni=1\n\nu,v\n\nb(u, v)gi(u, v)\n\n(9)\n\nNote that this binary mask is only used to compute the loss and not used at inference time. This\nobjective incurs a zero cost if all of the probability mass lies within the silhouette. We also include a\nterm to minimize the variance of each of the distribution maps:\n\nN(cid:88)\n\n(cid:88)\n\ni=1\n\nu,v\n\n(cid:13)(cid:13)(cid:13)[u, v](cid:62) \u2212 [ui, vi](cid:62)(cid:13)(cid:13)(cid:13)2\n\ngi(u, v)\n\nLvar =\n\n1\nN\n\n(10)\n\nThis term encourages the distributions to be peaky, which has the added bene\ufb01t of helping keep their\nmeans within the silhouette in the case of non-convex object boundaries.\n\n6\n\n\fFigure 2: Histogram plots of angular distance errors, average across car, plane, and chair categories,\nbetween the ground-truth relative rotations and the least-squares estimates computed from two sets of\nkeypoints predicted from test pairs. a) is a supervised method trained with a single L2 loss between\nthe pixel location prediction to the human labels. b) is the same as a) except the network is given an\nadditional orientation \ufb02ag predicted from a pretrained orientation network. c) is our network that\nuses the same pretrained orientation network as b), and d) is our unsupervised method trained jointly\n(the orientation and keypoint networks).\n\n6 Experiments\n\nTraining data. Our training data is generated from ShapeNet [6], a large-scale database of approx-\nimately 51K 3D models across 270 categories. We create separate training datasets for various object\ncategories, including car, chair, and plane. For each model in each category, we normalize the object\nso that the longest dimension lies in [\u22121, 1], and render 200 images of size 128 \u00d7 128 under different\nviewpoints to form 100 training pairs. The camera viewpoints are randomly sampled around the\nobject from a \ufb01xed distance, all above the ground with zero roll angle. We then add small random\nshifts to the camera positions.\nImplementation details. We implemented our network in TensorFlow [1], and trained with the\nAdam optimizer with a learning rate of 10\u22123, \u03b21 = 0.9, \u03b22 = 0.999, and a total batch size of 256.\nWe use the following weights for the losses: (\u03b1con, \u03b1pose, \u03b1sep, \u03b1obj) = (1, 0.2, 1.0, 1.0). We train the\nnetwork for 200K steps using synchronous training with 32 replicas.\n\n6.1 Comparison with a supervised approach\n\nTo evaluate against a supervised approach, we collected human landmark labels for three object\ncategories (cars, chairs, and planes) from ShapeNet using Amazon Mechanical Turk. For each object,\nwe ask three different users to click on points corresponding to reference points shown as an example\nto the user. These reference points are based on the Pascal3D+ dataset (12 points for cars, 10 for\nchairs, 8 for planes). We render the object from multiple views so that each speci\ufb01ed point is facing\noutward from the screen. We then compute the average pixel location over user annotations for each\nkeypoint, and triangulate corresponding points across views to obtain 3D keypoint coordinates.\nFor each category, we train a network with the same architecture as in Section 4 using the supervised\nlabels to output keypoint locations in normalized coordinates [\u22121, 1], as well as depths, using an L2\nloss to the human labels. We then compute the angular distance error on 10% of the models for each\ncategory held out as a test set. (This test set corresponds to 720 models of cars, 200 chairs, and 400\nplanes. Each individual model produces 100 test image pairs.) In Figure 2, we plot the histograms\nof angular errors of our method vs. the supervised technique trained to predict the same number of\nkeypoints, and show error statistics in Table 1. For a fair comparison against the supervised technique,\nwe provide an additional orientation \ufb02ag to the supervised network. This is done by training another\nversion of the supervised network that receives the orientation \ufb02ag predicted from a pre-trained\norientation network. Additionally, we tested a more comparable version of our unsupervised network\nwhere we use and \ufb01x the same pre-trained orientation network during training. The mean and median\naccuracy of the predicted orientation \ufb02ags on the test sets are as follows: cars: (96.0%, 99.0%),\nplanes: (95.5%, 99.0%), chairs: (97.1%, 99.0%).\nOur unsupervised technique produces lower mean and median rotation errors than both versions of\nthe supervised technique. Note that our technique sometimes incorrectly predicts keypoints that are\n180\u25e6 from the correct orientation due to incorrect orientation prediction.\n\n7\n\n020406080100120140160180a) Supervised0.00.10.20.30.40.50.6Probability020406080100120140160180b) Supervisedwith Pretrained Orientation Network0.00.10.20.30.40.50.6020406080100120140160180c) Unsupervisedwith Pretrained Orientation Network0.00.10.20.30.40.50.6020406080100120140160180d) Unsupervised (Ours)0.00.10.20.30.40.50.6Angular Errors\fCars\n\nPlanes\n\nChairs\n\nMean Median 3D-SE Mean Median 3D-SE Mean Median 3D-SE\nMethod\n0.269\na) Supervised\n16.268 5.583\nb) Supervised with 13.961 4.475\n0.248\n\n21.882 8.771\n20.502 8.261\n\n18.350 7.168\n17.800 6.802\n\n0.240\n0.197\n\n0.233\n0.230\n\npretrained O-Net\n\nc) Ours with\n\npretrained O-Net 13.500 4.418\n11.310 3.372\n\nd) Ours\n\n0.165\n0.171\n\n18.561 6.407\n17.330 5.721\n\n0.223\n0.230\n\n14.238 5.607\n14.572 5.420\n\n0.203\n0.196\n\nTable 1: Mean and median angular distance errors between the ground-truth rotation and the Pro-\ncrustes estimate computed from two sets of predicted keypoints on test pairs. O-Net is the network\nthat predicts a binary orientation. 3D-SE is the standard errors described in Section 6.1.\n\nFigure 3: Keypoint results on single objects from different views. Note that these keypoints are\npredicted consistently across views even when they are completely occluded. (e.g., the red point that\ntracks the back right leg of the chair.) Please see keypointnet.github.io for visualizations.\n\nFigure 4: Results on ShapeNet [6] test sets for cars, planes, and chairs. Our network is able to\ngeneralize across unseen appearances and shape variations, and consistently predict occluded parts\nsuch as wheels and chair legs.\n\nKeypoint location consistency.\nTo evaluate the consistency of predicted keypoints across views,\nwe transform the keypoints predicted for the same object under different views to object space using\nthe known camera matrices used for rendering. Then we compute the standard error of 3D locations\nfor all keypoints across all test cars (3D-SE in Table 1). To disregard outliers when the network\nincorrectly infers the orientation, we compute this metric only for keypoints whose error in rotation\nestimate is less than 90\u25e6 (left halves of the histograms in Figure 2), for both the supervised method\nand our unsupervised approach.\n\n6.2 Generalization across views and instances\nIn this section, we show qualitative results of our keypoint predictions on test cars, chairs, and planes\nusing a default number of 10 keypoints for all categories. (We show results with varying numbers\nof keypoints in the Appendix.) In Figure 3, we show keypoint prediction results on single objects\nfrom different views. Some of these views are quite challenging such as the top-down view of the\nchair. However, our network is able to infer the orientation and predict occluded parts such as the\nchair legs. In Figure 4, we run our network on many instances of test objects. Note that during\ntraining, the network only sees a pair of images of the same model, but it is able to utilize the same\nkeypoints for semantically similar parts across all instances from the same class. For example, the\nblue keypoints always track the cockpit of the planes. In contrast to prior work [49, 17, 63] that\nlearns latent representations by training with restricted classes of transformations, such as af\ufb01ne or\n2D optical \ufb02ow, and demonstrates results on images with small pose variations, we learn through\nphysical 3D transformation and are able to produce a consistent set of 3D keypoints from any angle.\n\n8\n\n\fOur method can also be used to establish correspondence between two views under out-of-plane or\neven 180\u25e6 rotations when there is no visual overlap.\nFailure cases. When our orientation network fails to predict\nthe correct orientation, the output keypoints will be \ufb02ipped\nas shown in Figure 5. This happens for cars whose front and\nback look very similar, or for unusual wing shapes that make\ninference of the dominant direction dif\ufb01cult.\n7 Discussion & Future work\n\nFigure 5: Failure cases.\n\nWe explore the possibility of optimizing a representation based on a sparse set of keypoints or\nlandmarks, without access to keypoint annotations, but rather based on an end-to-end geometric\nreasoning framework. We show that, indeed, one can discover consistent keypoints across multiple\nviews and object instances by adopting two novel objective functions: a relative pose estimation loss\nand a multi-view consistency objective. Our translation equivariant architecture is able to generalize to\nunseen object instances of ShapeNet categories [6]. Importantly, our discovered keypoints outperform\nthose from a direct supervised learning baseline on the problem of rigid 3D pose estimation.\nWe present preliminary results on the transfer of the learned keypoint detectors to real world images\nby training on ShapeNet images with random backgrounds (see supplemental material). Further\nimprovements may be achieved by leveraging recent work in domain adaptation [24, 54, 50, 5, 58].\nAlternatively, one can train KeypointNet directly on real images provided relative pose labels.\nSuch labels may be estimated automatically using Structure-from-Motion [32]. Another interesting\ndirection would be to jointly solve for the relative transformation or rely on a coarse pose initialization,\ninspired by [55], to extend this framework to objects that lack 3D models or pose annotations.\nOur framework could also be extended to handle an arbitrary number of keypoints. For example,\none could predict a con\ufb01dence value for each keypoint, then threshold to identify distinct ones,\nwhile using a loss that operates on unordered sets of keypoints. Visual descriptors could also be\nincorporated under our framework, either through a post-processing task or via joint end-to-end\noptimization of both the detector and the descriptor.\n\n8 Acknowledgement\n\nWe would like to thank Chi Zeng who helped setup the Mechanical Turk tasks for our evaluations.\n\nReferences\n\n[1] Mart\u00b4\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dan Man\u00b4e, Rajat Monga, Sherry Moore, Derek Murray,\nChris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul\nTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00b4egas, Oriol Vinyals, Pete Warden,\nMartin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems, 2015.\n\n[2] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. ICCV, 2015.\n[3] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2D human pose\n\nestimation: New benchmark and state of the art analysis. CVPR, 2014.\n\n[4] M. Arie-Nachimson and R. Basri. Constructing implicit 3D shape models for pose estimation.\n\nICCV, 2009.\n\n[5] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan.\nUnsupervised pixel-level domain adaptation with generative adversarial networks. CVPR, 2017.\n[6] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo\nLi, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu.\nShapeNet: An Information-Rich 3D Model Repository. arXiv:1512.03012, 2015.\n\n9\n\n\f[7] Ching-Hang Chen and Deva Ramanan. 3D human pose estimation= 2D pose estimation+\n\nmatching. CVPR, 2017.\n\n[8] Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. Adversarial learning\nof structure-aware fully convolutional networks for landmark localization. arXiv:1711.00253,\n2017.\n\n[9] Christopher B Choy, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. Universal\n\ncorrespondence network. NIPS, 2016.\n\n[10] Mike Giles. An extended collection of matrix derivative results for forward and reverse mode\n\nautomatic differentiation. Oxford University, 2008.\n\n[11] Ross Goroshin, Michael F Mathieu, and Yann LeCun. Learning to linearize under uncertainty.\n\nNIPS, 2015.\n\n[12] R\u0131za Alp G\u00a8uler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose\n\nestimation in the wild. arXiv:1802.00434, 2018.\n\n[13] Kai Han, Rafael S Rezende, Bumsub Ham, Kwan-Yee K Wong, Minsu Cho, Cordelia Schmid,\n\nand Jean Ponce. SCNet: Learning semantic correspondence. ICCV, 2017.\n\n[14] Kaiming He, Georgia Gkioxari, Piotr Doll\u00b4ar, and Ross Girshick. Mask R-CNN. ICCV, 2017.\n[15] Mohsen Hejrati and Deva Ramanan. Analyzing 3d objects in cluttered images. NIPS, 2012.\n[16] Geoffrey Hinton, Nicholas Frosst, and Sara Sabour. Matrix capsules with em routing. ICLR,\n\n2018.\n\n[17] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. Int. Conf.\n\non Arti\ufb01cial Neural Networks, 2011.\n\n[18] Sina Honari, Pavlo Molchanov, Stephen Tyree, Pascal Vincent, Christopher Pal, and Jan Kautz.\nImproving landmark localization with semi-supervised learning. In The IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), 2018.\n\n[19] Qi-Xing Huang and Leonidas Guibas. Consistent shape maps via semide\ufb01nite programming.\n\nComputer Graphics Forum, 2013.\n\n[20] Shaoli Huang, Mingming Gong, and Dacheng Tao. A coarse-\ufb01ne network for keypoint localiza-\n\ntion. ICCV, 2017.\n\n[21] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. ICML, 2015.\n\n[22] Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix backpropagation for deep\n\nnetworks with structured layers. ICCV, 2015.\n\n[23] Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Conditional image generation\n\nfor learning the structure of visual objects. arXiv preprint arXiv:1806.07823, 2018.\n\n[24] Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen,\nand Ram Vasudevan. Driving in the Matrix: Can virtual worlds replace human-generated\nannotations for real world tasks? ICRA, pages 746\u2013753, 2017.\n\n[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classi\ufb01cation with deep\n\nconvolutional neural networks. NIPS, 2012.\n\n[26] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 1998.\n\n[27] Vincent Lepetit and Pascal Fua. Keypoint recognition using randomized trees. IEEE Trans.\n\nPAMI, 28(9):1465\u20131479, 2006.\n\n[28] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An accurate O(n) solution to\n\nthe PnP problem. IJCV, 2008.\n\n[29] Yan Li, Leon Gu, and Takeo Kanade. A robust shape model for multi-view car alignment.\n\nCVPR, 2009.\n\n[30] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,\nPiotr Doll\u00b4ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. ECCV,\n2014.\n\n10\n\n\f[31] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic\n\nsegmentation. CVPR, 2015.\n\n[32] H Christopher Longuet-Higgins. A computer algorithm for reconstructing a scene from two\n\nprojections. Nature, 293(5828):133, 1981.\n\n[33] Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective\n\nbaseline for 3D human pose estimation. ICCV, 2017.\n\n[34] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu,\nand Christian Theobalt. Monocular 3D human pose estimation in the wild using improved CNN\nsupervision. 3DV, 2017.\n\n[35] Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Srinath Sridhar,\nGerard Pons-Moll, and Christian Theobalt. Single-shot multi-person 3D body pose estimation\nfrom monocular RGB input. arXiv:1712.03453, 2017.\n\n[36] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Sha\ufb01ei,\nHans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. VNect: Real-time 3D\nHuman Pose Estimation with a Single RGB Camera. ACM Transactions on Graphics, 2017.\n\n[37] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose\n\nestimation. ECCV, 2016.\n\n[38] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris\nBregler, and Kevin Murphy. Towards accurate multiperson pose estimation in the wild.\narXiv:1701.01779, 2017.\n\n[39] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter\nGehler, and Bernt Schiele. DeepCut: Joint subset partition and labeling for multi person pose\nestimation. CVPR, June 2016.\n\n[40] Gerard Pons-Moll, Jonathan Taylor, Jamie Shotton, Aaron Hertzmann, and Andrew Fitzgibbon.\n\nMetric regression forests for correspondence estimation. IJCV, 113(3):163\u2013175, 2015.\n\n[41] Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Reconstructing 3d human pose from 2d\n\nimage landmarks. ECCV, 2012.\n\n[42] Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsupervised geometry-aware representation\n\nfor 3d human pose estimation. arXiv preprint arXiv:1804.01110, 2018.\n\n[43] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. NIPS,\n\n2017.\n\n[44] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou,\nand Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image and Vision\nComputing, 2016.\n\n[45] Samuele Salti, Federico Tombari, Riccardo Spezialetti, and Luigi Di Stefano. Learning a\n\ndescriptor-speci\ufb01c 3d keypoint detector. ICCV, 2015.\n\n[46] Peter Sch\u00a8onemann. A generalized solution of the orthogonal Procrustes problem. Psychometrika,\n\n1966.\n\n[47] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections\n\nin 3D. ACM transactions on graphics (TOG), 2006.\n\n[48] Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. Render for CNN: Viewpoint\n\nestimation in images using CNNs trained with rendered 3D model views. ICCV, 2015.\n\n[49] James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object frames by\n\ndense equivariant image labelling. NIPS, 2017.\n\n[50] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.\nDomain randomization for transferring deep neural networks from simulation to the real world.\nIROS, pages 23\u201330, 2017.\n\n[51] Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken Perlin. Real-time continuous pose\nrecovery of human hands using convolutional networks. ACM Transactions on Graphics, 33,\n2014.\n\n[52] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a\n\nconvolutional network and a graphical model for human pose estimation. NIPS, 2014.\n\n11\n\n\f[53] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural\n\nnetworks. CVPR, 2014.\n\n[54] Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem\nAnil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birch\ufb01eld. Training deep net-\nworks with synthetic data: Bridging the reality gap by domain randomization. arXiv preprint\narXiv:1804.06516, 2018.\n\n[55] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle\nadjustmenta modern synthesis. International workshop on vision algorithms, pages 298\u2013372,\n1999.\n\n[56] Shubham Tulsiani and Jitendra Malik. Viewpoints and keypoints. CVPR, 2015.\n[57] Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. Self-supervised\nlearning of motion capture. In Advances in Neural Information Processing Systems, pages\n5236\u20135246, 2017.\n\n[58] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain\n\nadaptation. CVPR, 2017.\n\n[59] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative\nmodel for raw audio. arXiv:1609.03499, 2016.\n\n[60] Qianqian Wang, Xiaowei Zhou, and Kostas Daniilidis. Multi-image semantic matching by\n\nmining consistent features. arXiv preprint arXiv:1711.07641, 2017.\n\n[61] Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba,\n\nand William T Freeman. Single Image 3D Interpreter Network. ECCV, 2016.\n\n[62] Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature\n\npyramids for human pose estimation. ICCV, 2017.\n\n[63] Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, and Honglak Lee. Unsupervised\nIn Proceedings of the IEEE\n\ndiscovery of object landmarks as structural representations.\nConference on Computer Vision and Pattern Recognition, pages 2694\u20132703, 2018.\n\n[64] Tinghui Zhou, Philipp Krahenbuhl, Mathieu Aubry, Qixing Huang, and Alexei A Efros. Learning\n\ndense correspondence via 3d-guided cycle consistency. CVPR, 2016.\n\n[65] Xiaowei Zhou, Menglong Zhu, and Kostas Daniilidis. Multi-image matching via fast alternating\n\nminimization. CVPR, 2015.\n\n[66] Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Konstantinos G Derpanis, and Kostas\nDaniilidis. Sparseness meets deepness: 3D human pose estimation from monocular video.\nCVPR, 2016.\n\n[67] Xingyi Zhou, Arjun Karpur, Chuang Gan, Linjie Luo, and Qixing Huang. Unsupervised domain\n\nadaptation for 3d keypoint prediction from a single depth scan. arXiv:1712.05765, 2017.\n\n12\n\n\f", "award": [], "sourceid": 1026, "authors": [{"given_name": "Supasorn", "family_name": "Suwajanakorn", "institution": "VISTEC"}, {"given_name": "Noah", "family_name": "Snavely", "institution": "Google"}, {"given_name": "Jonathan", "family_name": "Tompson", "institution": "Google Brain"}, {"given_name": "Mohammad", "family_name": "Norouzi", "institution": "Google Brain"}]}