{"title": "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network", "book": "Advances in Neural Information Processing Systems", "page_first": 2366, "page_last": 2374, "abstract": "Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.", "full_text": "Depth Map Prediction from a Single Image\n\nusing a Multi-Scale Deep Network\n\nDavid Eigen\n\ndeigen@cs.nyu.edu\n\nChristian Puhrsch\n\ncpuhrsch@nyu.edu\n\nRob Fergus\n\nfergus@cs.nyu.edu\n\nDept. of Computer Science, Courant Institute, New York University\n\nAbstract\n\nPredicting depth is an essential component in understanding the 3D geometry of\na scene. While for stereo images local correspondence suf\ufb01ces for estimation,\n\ufb01nding depth relations from a single image is less straightforward, requiring in-\ntegration of both global and local information from various cues. Moreover, the\ntask is inherently ambiguous, with a large source of uncertainty coming from the\noverall scale. In this paper, we present a new method that addresses this task by\nemploying two deep network stacks: one that makes a coarse global prediction\nbased on the entire image, and another that re\ufb01nes this prediction locally. We also\napply a scale-invariant error to help measure depth relations rather than scale. By\nleveraging the raw datasets as large sources of training data, our method achieves\nstate-of-the-art results on both NYU Depth and KITTI, and matches detailed depth\nboundaries without the need for superpixelation.\n\nIntroduction\n\n1\nEstimating depth is an important component of understanding geometric relations within a scene. In\nturn, such relations help provide richer representations of objects and their environment, often lead-\ning to improvements in existing recognition tasks [18], as well as enabling many further applications\nsuch as 3D modeling [16, 6], physics and support models [18], robotics [4, 14], and potentially rea-\nsoning about occlusions.\nWhile there is much prior work on estimating depth based on stereo images or motion [17], there has\nbeen relatively little on estimating depth from a single image. Yet the monocular case often arises in\npractice: Potential applications include better understandings of the many images distributed on the\nweb and social media outlets, real estate listings, and shopping sites. These include many examples\nof both indoor and outdoor scenes.\nThere are likely several reasons why the monocular case has not yet been tackled to the same degree\nas the stereo one. Provided accurate image correspondences, depth can be recovered deterministi-\ncally in the stereo case [5]. Thus, stereo depth estimation can be reduced to developing robust image\npoint correspondences \u2014 which can often be found using local appearance features. By contrast,\nestimating depth from a single image requires the use of monocular depth cues such as line angles\nand perspective, object sizes, image position, and atmospheric effects. Furthermore, a global view\nof the scene may be needed to relate these effectively, whereas local disparity is suf\ufb01cient for stereo.\nMoreover, the task is inherently ambiguous, and a technically ill-posed problem: Given an image, an\nin\ufb01nite number of possible world scenes may have produced it. Of course, most of these are physi-\ncally implausible for real-world spaces, and thus the depth may still be predicted with considerable\naccuracy. At least one major ambiguity remains, though: the global scale. Although extreme cases\n(such as a normal room versus a dollhouse) do not exist in the data, moderate variations in room\nand furniture sizes are present. We address this using a scale-invariant error in addition to more\n\n1\n\n\fcommon scale-dependent errors. This focuses attention on the spatial relations within a scene rather\nthan general scale, and is particularly apt for applications such as 3D modeling, where the model is\noften rescaled during postprocessing.\nIn this paper we present a new approach for estimating depth from a single image. We directly\nregress on the depth using a neural network with two components: one that \ufb01rst estimates the global\nstructure of the scene, then a second that re\ufb01nes it using local information. The network is trained\nusing a loss that explicitly accounts for depth relations between pixel locations, in addition to point-\nwise error. Our system achieves state-of-the art estimation rates on NYU Depth and KITTI, as well\nas improved qualitative outputs.\n2 Related Work\nDirectly related to our work are several approaches that estimate depth from a single image. Saxena\net al. [15] predict depth from a set of image features using linear regression and a MRF, and later\nextend their work into the Make3D [16] system for 3D model generation. However, the system\nrelies on horizontal alignment of images, and suffers in less controlled settings. Hoiem et al. [6] do\nnot predict depth explicitly, but instead categorize image regions into geometric structures (ground,\nsky, vertical), which they use to compose a simple 3D model of the scene.\nMore recently, Ladicky et al. [12] show how to integrate semantic object labels with monocular\ndepth features to improve performance; however, they rely on handcrafted features and use super-\npixels to segment the image. Karsch et al. [7] use a kNN transfer mechanism based on SIFT Flow\n[11] to estimate depths of static backgrounds from single images, which they augment with motion\ninformation to better estimate moving foreground subjects in videos. This can achieve better align-\nment, but requires the entire dataset to be available at runtime and performs expensive alignment\nprocedures. By contrast, our method learns an easier-to-store set of network parameters, and can be\napplied to images in real-time.\nMore broadly, stereo depth estimation has been extensively investigated. Scharstein et al. [17] pro-\nvide a survey and evaluation of many methods for 2-frame stereo correspondence, organized by\nmatching, aggregation and optimization techniques. In a creative application of multiview stereo,\nSnavely et al. [20] match across views of many uncalibrated consumer photographs of the same\nscene to create accurate 3D reconstructions of common landmarks.\nMachine learning techniques have also been applied in the stereo case, often obtaining better results\nwhile relaxing the need for careful camera alignment [8, 13, 21, 19]. Most relevant to this work is\nKonda et al. [8], who train a factored autoencoder on image patches to predict depth from stereo\nsequences; however, this relies on the local displacements provided by stereo.\nThere are also several hardware-based solutions for single-image depth estimation. Levin et al. [10]\nperform depth from defocus using a modi\ufb01ed camera aperture, while the Kinect and Kinect v2 use\nactive stereo and time-of-\ufb02ight to capture depth. Our method makes indirect use of such sensors\nto provide ground truth depth targets during training; however, at test time our system is purely\nsoftware-based, predicting depth from RGB images.\n3 Approach\n3.1 Model Architecture\nOur network is made of two component stacks, shown in Fig. 1. A coarse-scale network \ufb01rst predicts\nthe depth of the scene at a global level. This is then re\ufb01ned within local regions by a \ufb01ne-scale\nnetwork. Both stacks are applied to the original input, but in addition, the coarse network\u2019s output\nis passed to the \ufb01ne network as additional \ufb01rst-layer image features. In this way, the local network\ncan edit the global prediction to incorporate \ufb01ner-scale details.\n3.1.1 Global Coarse-Scale Network\nThe task of the coarse-scale network is to predict the overall depth map structure using a global view\nof the scene. The upper layers of this network are fully connected, and thus contain the entire image\nin their \ufb01eld of view. Similarly, the lower and middle layers are designed to combine information\nfrom different parts of the image through max-pooling operations to a small spatial dimension. In\nso doing, the network is able to integrate a global understanding of the full scene to predict the\ndepth. Such an understanding is needed in the single-image case to make effective use of cues such\n\n2\n\n\f96\n\n256\n\n384\n\n384\n\n256\n\n4096\n\n5x5 conv \n2x2 pool\n\n3x3 conv \n\n3x3 conv \n\n3x3 conv\n\nfull\n\nfull\n\n1\n\nCoarse\n\nCoarse 1\n\nCoarse 2\n\nCoarse 3\n\nCoarse 4\n\nCoarse 5\n\n11x11 conv \n 4 stride \n 2x2 pool\n\n256\n\n384\n\n384\n\n63\n\n256\n\n3x3 conv \n\n3x3 conv \n\n64\nfull\n\n3x3 conv\nConcatenate\n\n5x5 conv \n2x2 pool\n\n9x9 conv \n 2 stride \nCoarse 2\n2x2 pool \n\nCoarse 3\n\nCoarse 4\nFine 1\n\nCoarse 5\n\nFine 2\n\n5x5 conv\n\nCoarse 6\n\n5x5 conv\n\nCoarse 7\nFine 3\n\nFine 4\n\nCoarse 6\nCoarse\n\nCoarse 7\n\n1\n\nRefined\n\n1\n\n64\n\n4096\n\nfull\n\nInput\n63\n\n64\n\n64\n\n1\n\nRefined\n\nConcatenate\n\n5x5 conv\n\n5x5 conv\n\nFine 1\n\nFine 2\n\nFine 3\n\nFine 4\n\nLayer\nSize (NYUDepth)\nSize (KITTI)\nRatio to input\n\ninput\n\n304x228\n576x172\n\n/1\n\n1\n\n37x27\n71x20\n\n/8\n\nCoarse\n\n5\n8x6\n17x4\n/32\n\n2,3,4\n18x13\n35x9\n/16\n\n6\n1x1\n1x1\n\u2013\n\n74x55\n142x27\n\n7\n\n/4\n\nFine\n1,2,3,4\n74x55\n142x27\n\n/4\n\nFigure 1: Model architecture.\n\nFigure 1: Model architecture.\n\n11x11 conv \n 4 stride \n 2x2 pool\n\nInput\n\n96\n\n054\n055\n056\n057\n058\n059\n060\n061\n062\n063\n064\nCoarse 1\n065\n066\n067\n068\n069\n9x9 conv \n070\n 2 stride \n2x2 pool \n071\n072\n073\n074\n075\n076\n077\n078\n079\n080\n081\n082\n083\n084\n085\n086\n087\n088\n089\n090\n091\n092\n093\n094\n095\n096\n097\n098\n099\n100\n101\n102\n103\n104\n105\n106\n107\n\nas vanishing points, object locations, and room alignment. A local view (as is commonly used for\nstereo matching) is insuf\ufb01cient to notice important features such as these.\nAs illustrated in Fig. 1, the global, coarse-scale network contains \ufb01ve feature extraction layers of\nconvolution and max-pooling, followed by two fully connected layers. The input, feature map and\noutput sizes are also given in Fig. 1. The \ufb01nal output is at 1/4-resolution compared to the input\n(which is itself downsampled from the original dataset by a factor of 2), and corresponds to a center\ncrop containing most of the input (as we describe later, we lose a small border area due to the \ufb01rst\nlayer of the \ufb01ne-scale network and image transformations).\nNote that the spatial dimension of the output is larger than that of the topmost convolutional feature\nmap. Rather than limiting the output to the feature map size and relying on hardcoded upsampling\nbefore passing the prediction to the \ufb01ne network, we allow the top full layer to learn templates over\nthe larger area (74x55 for NYU Depth). These are expected to be blurry, but will be better than the\nupsampled output of a 8x6 prediction (the top feature map size); essentially, we allow the network\nto learn its own upsampling based on the features. Sample output weights are shown in Fig. 2\nAll hidden layers use recti\ufb01ed linear units for activations, with the exception of the coarse output\nlayer 7, which is linear. Dropout is applied to the fully-connected hidden layer 6. The convolu-\ntional layers (1-5) of the coarse-scale network are pretrained on the ImageNet classi\ufb01cation task [1]\n\u2014 while developing the model, we found pretraining on ImageNet worked better than initializing\nrandomly, although the difference was not very large1.\n\npredict depth explicitly, but instead categorize image regions into geometric structures (ground, sky,\nvertical), which they use to compose a simple 3D model of the scene.\nMore recently, Ladicky et al. [?] show how to integrate semantic object labels with monocular depth\nfeatures to improve performance; however, they rely on handcrafted features and use superpixels to\nsegment the image. Karsch et al. [?] use a kNN transfer mechanism based on SIFT Flow [?] to esti-\nmate depths of static backgrounds from single images, which they augment with motion information\nto better estimate moving foreground subjects in videos. This can achieve better alignment, but re-\nquires the entire dataset to be available at runtime and performs expensive alignment procedures.\nBy contrast, our method learns an easier-to-store set of network parameters, and can be applied to\nimages in real-time.\nMore broadly, stereo depth estimation has been extensively investigated. Scharstein et al. [?] provide\na survey and evaluation of many methods for 2-frame stereo correspondence methods, organized by\nmatching, aggregation and optimization techniques. In a creative application of multiview stereo,\nSnavely et al. [?] match across views of many uncalibrated consumer photographs of the same scene\nto create accurate 3D reconstructions of common landmarks.\nMachine learning techniques have been applied in the stereo case, often obtaining better results\nwhile relaxing the need for careful camera alignment [?, ?, ?, ?]. Most relevant to this work is\nKonda et al. [?], who train a factored autoencoder on image patches to predict depth from stereo\nsequences; however, this relies on the local displacements provided by stereo.\nThere are also several hardware-based solutions for single-image depth estimation. Levin et al. [?]\nperform depth from defocus using a modi\ufb01ed camera aperature, while the Kinect and Kinect v2 use\nactive stereo and time-of-\ufb02ight to capture depth. Our method makes indirect use of such sensors\nto provide ground truth depth targets during training; however, at test time our system is purely\nsoftware-based, predicting depth from RGB images only.\n\n3.1.2 Local Fine-Scale Network\nAfter taking a global perspective to predict the coarse depth map, we make local re\ufb01nements using\na second, \ufb01ne-scale network. The task of this component is to edit the coarse prediction it receives\nto align with local details such as object and wall edges. The \ufb01ne-scale network stack consists of\nconvolutional layers only, along with one pooling stage for the \ufb01rst layer edge features.\nWhile the coarse network sees the entire scene, the \ufb01eld of view of an output unit in the \ufb01ne network\nis 45x45 pixels of input. The convolutional layers are applied across feature maps at the target output\nsize, allowing a relatively high-resolution output at 1/4 the input scale.\nMore concretely, the coarse output is fed in as an additional low-level feature map. By design, the\ncoarse prediction is the same spatial size as the output of the \ufb01rst \ufb01ne-scale layer (after pooling),\n\n2\n\n1When pretraining, we stack two fully connected layers with 4096 - 4096 - 1000 output units each, with\ndropout applied to the two hidden layers, as in [9]. We train the network using random 224x224 crops from the\ncenter 256x256 region of each training image, rescaled so the shortest side has length 256. This model achieves\na top-5 error rate of 18.1% on the ILSVRC2012 validation set, voting with 2 \ufb02ips and 5 translations per image.\n\n3\n\n\f(a)\n\n(b)\n\nFigure 2: Weight vectors from layer Coarse 7 (coarse output), for (a) KITTI and (b) NYUDepth.\nRed is positive (farther) and blue is negative (closer); black is zero. Weights are selected uniformly\nand shown in descending order by l2 norm. KITTI weights often show changes in depth on either\nside of the road. NYUDepth weights often show wall positions and doorways.\n\nand we concatenate the two together (Fine 2 in Fig. 1). Subsequent layers maintain this size using\nzero-padded convolutions.\nAll hidden units use recti\ufb01ed linear activations. The last convolutional layer is linear, as it predicts\nthe target depth. We train the coarse network \ufb01rst against the ground-truth targets, then train the\n\ufb01ne-scale network keeping the coarse-scale output \ufb01xed (i.e. when training the \ufb01ne network, we do\nnot backpropagate through the coarse one).\n3.2 Scale-Invariant Error\nThe global scale of a scene is a fundamental ambiguity in depth prediction. Indeed, much of the error\naccrued using current elementwise metrics may be explained simply by how well the mean depth is\npredicted. For example, Make3D trained on NYUDepth obtains 0.41 error using RMSE in log space\n(see Table 1). However, using an oracle to substitute the mean log depth of each prediction with the\nmean from the corresponding ground truth reduces the error to 0.33, a 20% relative improvement.\nLikewise, for our system, these error rates are 0.28 and 0.22, respectively. Thus, just \ufb01nding the\naverage scale of the scene accounts for a large fraction of the total error.\nMotivated by this, we use a scale-invariant error to measure the relationships between points in the\nscene, irrespective of the absolute global scale. For a predicted depth map y and ground truth y\u2217,\neach with n pixels indexed by i, we de\ufb01ne the scale-invariant mean squared error (in log space) as\n\n1\n2n\n\nn\uffffi=1\n\nD(y, y\u2217) =\n\n(log yi \u2212 log y\u2217i + \u03b1(y, y\u2217))2,\n\n(1)\n\nn\uffffi(log y\u2217i \u2212log yi) is the value of \u03b1 that minimizes the error for a given (y, y\u2217).\nwhere \u03b1(y, y\u2217) = 1\nFor any prediction y, e\u03b1 is the scale that best aligns it to the ground truth. All scalar multiples of y\nhave the same error, hence the scale invariance.\nTwo additional ways to view this metric are provided by the following equivalent forms. Setting\ndi = log yi \u2212 log y\u2217i to be the difference between the prediction and ground truth at pixel i, we have\n(2)\n\nD(y, y\u2217) =\n\n1\n\n2n2\uffffi,j \uffff(log yi \u2212 log yj) \u2212 (log y\u2217i \u2212 log y\u2217j )\uffff2\nn2\uffff\uffffi\nn\uffffi\n\nn2\uffffi,j\n\nn\uffffi\n\ndidj =\n\nd2\ni \u2212\n\nd2\ni \u2212\n\n1\n\n1\n\n1\n\n1\n\n=\n\ndi\uffff2\n\n(3)\n\nn2\uffffij didj, that credits mistakes if they are in the same direction and\n\nEqn. 2 expresses the error by comparing relationships between pairs of pixels i, j in the output: to\nhave low error, each pair of pixels in the prediction must differ in depth by an amount similar to that\nof the corresponding pair in the ground truth. Eqn. 3 relates the metric to the original l2 error, but\nwith an additional term, \u2212 1\npenalizes them if they oppose. Thus, an imperfect prediction will have lower error when its mistakes\nare consistent with one another. The last part of Eqn. 3 rewrites this as a linear-time computation.\nIn addition to the scale-invariant error, we also measure the performance of our method according\nto several error metrics have been proposed in prior works, as described in Section 4.\n3.3 Training Loss\nIn addition to performance evaluation, we also tried using the scale-invariant error as a training loss.\nInspired by Eqn. 3, we set the per-sample training loss to\n\n4\n\n\fL(y, y\u2217) =\n\n1\n\nn\uffffi\n\nd2\ni \u2212\n\n\u03bb\n\nn2\uffff\uffffi\n\ndi\uffff2\n\n(4)\n\nwhere di = log yi \u2212 log y\u2217i and \u03bb \u2208 [0, 1]. Note the output of the network is log y; that is, the \ufb01nal\nlinear layer predicts the log depth. Setting \u03bb = 0 reduces to elementwise l2, while \u03bb = 1 is the\nscale-invariant error exactly. We use the average of these, i.e. \u03bb = 0.5, \ufb01nding that this produces\ngood absolute-scale predictions while slightly improving qualitative output.\nDuring training, most of the target depth maps will have some missing values, particularly near\nobject boundaries, windows and specular surfaces. We deal with these simply by masking them out\nand evaluating the loss only on valid points, i.e. we replace n in Eqn. 4 with the number of pixels\nthat have a target depth, and perform the sums excluding pixels i that have no depth value.\n3.4 Data Augmentation\nWe augment the training data with random online transformations (values shown for NYUDepth) 2:\n\u2022 Scale: Input and target images are scaled by s \u2208 [1, 1.5], and the depths are divided by s.\n\u2022 Rotation: Input and target are rotated by r \u2208 [\u22125, 5] degrees.\n\u2022 Translation: Input and target are randomly cropped to the sizes indicated in Fig. 1.\n\u2022 Color: Input values are multiplied globally by a random RGB value c \u2208 [0.8, 1.2]3.\n\u2022 Flips: Input and target are horizontally \ufb02ipped with 0.5 probability.\n\nNote that image scaling and translation do not preserve the world-space geometry of the scene. This\nis easily corrected in the case of scaling by dividing the depth values by the scale s (making the\nimage s times larger effectively moves the camera s times closer). Although translations are not\neasily \ufb01xed (they effectively change the camera to be incompatible with the depth values), we found\nthat the extra data they provided bene\ufb01ted the network even though the scenes they represent were\nslightly warped. The other transforms, \ufb02ips and in-plane rotation, are geometry-preserving. At test\ntime, we use a single center crop at scale 1.0 with no rotation or color transforms.\n4 Experiments\nWe train our model on the raw versions both NYU Depth v2 [18] and KITTI [3]. The raw distribu-\ntions contain many additional images collected from the same scenes as in the more commonly used\nsmall distributions, but with no preprocessing; in particular, points for which there is no depth value\nare left un\ufb01lled. However, our model\u2019s natural ability to handle such gaps as well as its demand for\nlarge training sets make these \ufb01tting sources of data.\n4.1 NYU Depth\nThe NYU Depth dataset [18] is composed of 464 indoor scenes, taken as video sequences using\na Microsoft Kinect camera. We use the of\ufb01cial train/test split, using 249 scenes for training and\n215 for testing, and construct our training set using the raw data for these scenes. RGB inputs are\ndownsampled by half, from 640x480 to 320x240. Because the depth and RGB cameras operate at\ndifferent variable frame rates, we associate each depth image with its closest RGB image in time,\nand throw away frames where one RGB image is associated with more than one depth (such a one-\nto-many mapping is not predictable). We use the camera projections provided with the dataset to\nalign RGB and depth pairs; pixels with no depth value are left missing and are masked out. To\nremove many invalid regions caused by windows, open doorways and specular surfaces we also\nmask out depths equal to the minimum or maximum recorded for each image.\nThe training set has 120K unique images, which we shuf\ufb02e into a list of 220K after evening the\nscene distribution (1200 per scene). We test on the 694-image NYU Depth v2 test set (with \ufb01lled-in\ndepth values). We train the coarse network for 2M samples using SGD with batches of size 32.\nWe then hold it \ufb01xed and train the \ufb01ne network for 1.5M samples (given outputs from the already-\ntrained coarse one). Learning rates are: 0.001 for coarse convolutional layers 1-5, 0.1 for coarse full\nlayers 6 and 7, 0.001 for \ufb01ne layers 1 and 3, and 0.01 for \ufb01ne layer 2. These ratios were found by\ntrial-and-error on a validation set (folded back into the training set for our \ufb01nal evaluations), and the\nglobal scale of all the rates was tuned to a factor of 5. Momentum was 0.9. Training took 38h for\nthe coarse network and 26h for \ufb01ne, for a total of 2.6 days using a NVidia GTX Titan Black. Test\nprediction takes 0.33s per batch (0.01s/image).\n\n2For KITTI, s \u2208 [1, 1.2], and rotations are not performed (images are horizontal from the camera mount).\n\n5\n\n\f4.2 KITTI\nThe KITTI dataset [3] is composed of several outdoor scenes captured while driving with car-\nmounted cameras and depth sensor. We use 56 scenes from the \u201ccity,\u201d \u201cresidential,\u201d and \u201croad\u201d\ncategories of the raw data. These are split into 28 for training and 28 for testing. The RGB images\nare originally 1224x368, and downsampled by half to form the network inputs.\nThe depth for this dataset is sampled at irregularly spaced points, captured at different times using\na rotating LIDAR scanner. When constructing the ground truth depths for training, there may be\ncon\ufb02icting values; since the RGB cameras shoot when the scanner points forward, we resolve con-\n\ufb02icts at each pixel by choosing the depth recorded closest to the RGB capture time. Depth is only\nprovided within the bottom part of the RGB image, however we feed the entire image into our model\nto provide additional context to the global coarse-scale network (the \ufb01ne network sees the bottom\ncrop corresponding to the target area).\nThe training set has 800 images per scene. We exclude shots where the car is stationary (acceleration\nbelow a threshold) to avoid duplicates. Both left and right RGB cameras are used, but are treated\nas unassociated shots. The training set has 20K unique images, which we shuf\ufb02e into a list of 40K\n(including duplicates) after evening the scene distribution. We train the coarse model \ufb01rst for 1.5M\nsamples, then the \ufb01ne model for 1M. Learning rates are the same as for NYU Depth. Training took\ntook 30h for the coarse model and 14h for \ufb01ne; test prediction takes 0.40s/batch (0.013s/image).\n4.3 Baselines and Comparisons\nWe compare our method against Make3D trained on the same datasets, as well as the published\nresults of other current methods [12, 7]. As an additional reference, we also compare to the mean\ndepth image computed across the training set. We trained Make3D on KITTI using a subset of 700\nimages (25 per scene), as the system was unable to scale beyond this size. Depth targets were \ufb01lled\nin using the colorization routine in the NYUDepth development kit. For NYUDepth, we used the\ncommon distribution training set of 795 images. We evaluate each method using several errors from\nprior works, as well as our scale-invariant metric:\nThreshold: % of yi s.t. max( yi\ny\u2217i\nAbs Relative difference: 1\nSquared Relative difference: 1\nNote that the predictions from Make3D and our network correspond to slightly different center crops\nof the input. We compare them on the intersection of their regions, and upsample predictions to the\nfull original input resolution using nearest-neighbor. Upsampling negligibly affects performance\ncompared to downsampling the ground truth and evaluating at the output resolution. 3\n5 Results\n5.1 NYU Depth\nResults for NYU Depth dataset are provided in Table 1. As explained in Section 4.3, we compare\nagainst the data mean and Make3D as baselines, as well as Karsch et al. [7] and Ladicky et al. [12].\n(Ladicky et al. uses a joint model which is trained using both depth and semantic labels). Our system\nachieves the best performance on all metrics, obtaining an average 35% relative gain compared to\nthe runner-up. Note that our system is trained using the raw dataset, which contains many more\nexample instances than the data used by other approaches, and is able to effectively leverage it to\nlearn relevant features and their associations.\nThis dataset breaks many assumptions made by Make3D, particularly horizontal alignment of the\nground plane; as a result, Make3D has relatively poor performance in this task. Importantly, our\nmethod improves over it on both scale-dependent and scale-invariant metrics, showing that our sys-\ntem is able to predict better relations as well as better means.\nQualitative results are shown on the left side of Fig. 4, sorted top-to-bottom by scale-invariant MSE.\nAlthough the \ufb01ne-scale network does not improve in the error measurements, its effect is clearly\nvisible in the depth maps \u2014 surface boundaries have sharper transitions, aligning to local details.\nHowever, some texture edges are sometimes also included. Fig. 3 compares Make3D as well as\n\n|T|\uffffy\u2208T ||yi \u2212 y\u2217i ||2\n|T|\uffffy\u2208T || log yi \u2212 log y\u2217i ||2\n\n|T|\uffffy\u2208T |y \u2212 y\u2217|/y\u2217\n|T|\uffffy\u2208T ||y \u2212 y\u2217||2/y\u2217\n\n, y\u2217i\nyi\n\n) = \u03b4< thr\n\nRMSE (linear):\uffff 1\nRMSE (log):\uffff 1\n\nRMSE (log, scale-invariant): The error Eqn. 1\n\n3On NYUDepth, log RMSE is 0.285 vs 0.286 for upsampling and downsampling, respectively, and scale-\ninvariant RMSE is 0.219 vs 0.221. The intersection is 86% of the network region and 100% of Make3D for\nNYUDepth, and 100% of the network and 82% of Make3D for KITTI.\n\n6\n\n\fthreshold \u03b4< 1.25\nthreshold \u03b4< 1.252\nthreshold \u03b4< 1.253\nabs relative difference\nsqr relative difference\nRMSE (linear)\nRMSE (log)\nRMSE (log, scale inv.)\n\n\u2013\n\u2013\n\u2013\n\n0.542\n0.829\n0.940\n\nMean Make3D Ladicky&al Karsch&al Coarse Coarse + Fine\n0.618\n0.418\n0.891\n0.711\n0.969\n0.874\n0.228\n0.408\n0.223\n0.581\n0.871\n1.244\n0.283\n0.430\n0.221\n0.304\nTable 1: Comparison on the NYUDepth dataset\n\n0.611\n0.887\n0.971\n0.215\n0.212\n0.907\n0.285\n0.219\n\n0.447\n0.745\n0.897\n0.349\n0.492\n1.214\n0.409\n0.325\n\n\u2013\n1.2\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n0.350\n\nhigher\n\nis\n\nbetter\n\nlower\n\nis\n\nbetter\n\ninput\n\nm3d\n\ncoarse\n\nL2\n\nL2 \u00a0scale-\u00adinv\n\nground \u00a0truth\n\ninput\n\nm3d\n\ncoarse\n\ng.truth\n\nL2\n\nsc.-\u00adinv\n\nFigure 3: Qualitative comparison of Make3D, our method trained with l2 loss (\u03bb = 0), and our\nmethod trained with both l2 and scale-invariant loss (\u03bb = 0.5).\n\noutputs from our network trained using losses with \u03bb = 0 and \u03bb = 0.5. While we did not observe\nnumeric gains using \u03bb = 0.5, it did produce slight qualitative improvements in more detailed areas.\n5.2 KITTI\nWe next examine results on the KITTI driving dataset. Here, the Make3D baseline is well-suited\nto the dataset, being composed of horizontally aligned images, and achieves relatively good results.\nStill, our method improves over it on all metrics, by an average 31% relative gain. Just as impor-\ntantly, there is a 25% gain in both the scale-dependent and scale-invariant RMSE errors, showing\nthere is substantial improvement in the predicted structure. Again, the \ufb01ne-scale network does not\nimprove much over the coarse one in the error metrics, but differences between the two can be seen\nin the qualitative outputs.\nThe right side of Fig. 4 shows examples of predictions, again sorted by error. The \ufb01ne-scale network\nproduces sharper transitions here as well, particularly near the road edge. However, the changes are\nsomewhat limited. This is likely caused by uncorrected alignment issues between the depth map\nand input in the training data, due to the rotating scanner setup. This dissociates edges from their\ntrue position, causing the network to average over their more random placements. Fig. 3 shows\nMake3D performing much better on this data, as expected, while using the scale-invariant error as a\nloss seems to have little effect in this case.\n\nthreshold \u03b4< 1.25\nthreshold \u03b4< 1.252\nthreshold \u03b4< 1.253\nabs relative difference\nsqr relative difference\nRMSE (linear)\nRMSE (log)\nRMSE (log, scale inv.)\n\nMean Make3D Coarse Coarse + Fine\n0.556\n0.752\n0.870\n0.412\n5.712\n9.635\n0.444\n0.359\n\n0.679\n0.897\n0.967\n0.194\n1.531\n7.216\n0.273\n0.248\n\n0.601\n0.820\n0.926\n0.280\n3.012\n8.734\n0.361\n0.327\n\n0.692\n0.899\n0.967\n0.190\n1.515\n7.156\n0.270\n0.246\n\nhigher\n\nis\n\nbetter\n\nlower\n\nis\n\nbetter\n\nTable 2: Comparison on the KITTI dataset.\n\n6 Discussion\nPredicting depth estimates from a single image is a challenging task. Yet by combining information\nfrom both global and local views, it can be performed reasonably well. Our system accomplishes\nthis through the use of two deep networks, one that estimates the global depth structure, and another\nthat re\ufb01nes it locally at \ufb01ner resolution. We achieve a new state-of-the-art on this task for NYU\nDepth and KITTI datasets, having effectively leveraged the full raw data distributions.\nIn future work, we plan to extend our method to incorporate further 3D geometry information,\nsuch as surface normals. Promising results in normal map prediction have been made by Fouhey\net al. [2], and integrating them along with depth maps stands to improve overall performance [16].\nWe also hope to extend the depth maps to the full original input resolution by repeated application\nof successively \ufb01ner-scaled local networks.\n\n7\n\n\f!\"#\n\n!$#\n\n!%#\n\n!&#\n\n!\"#\n\n!$#\n\n!%#\n\n!&#\n\nFigure 4: Example predictions from our algorithm. NYUDepth on left, KITTI on right. For each\nimage, we show (a) input, (b) output of coarse network, (c) re\ufb01ned output of \ufb01ne network, (d) ground\ntruth. The \ufb01ne scale network edits the coarse-scale input to better align with details such as object\nboundaries and wall edges. Examples are sorted from best (top) to worst (bottom).\n\nAcknowledgements\nThe authors are grateful for support from ONR #N00014-13-1-0646, NSF #1116923, #1149633 and\nMicrosoft Research.\n\n8\n\n\fReferences\n[1] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-fei. Imagenet: A large-scale hierarchical\n\n[2] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3d primitives for single image under-\n\nimage database. In CVPR, 2009.\n\nstanding. In ICCV, 2013.\n\n[3] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. Inter-\n\nnational Journal of Robotics Research (IJRR), 2013.\n\n[4] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scof\ufb01er, K. Kavukcuoglu, U. Muller, and Y. Le-\nCun. Learning long-range vision for autonomous off-road driving. Journal of Field Robotics,\n26(2):120\u2013144, 2009.\n\n[5] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge\n\nUniversity Press, ISBN: 0521540518, second edition, 2004.\n\n[6] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up. In ACM SIGGRAPH, pages\n\n577\u2013584, 2005.\n\n[7] K. Karsch, C. Liu, S. B. Kang, and N. England. Depth extraction from video using non-\n\nparametric sampling. In TPAMI, 2014.\n\n[8] K. Konda and R. Memisevic.\n\nUnsupervised learning of depth and motion.\n\nIn\n\n[9] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\narXiv:1312.3429v2, 2013.\n\nneural networks. In NIPS, 2012.\n\n[10] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth from a conventional\n\ncamera with a coded aperture. In SIGGRAPH, 2007.\n\n[11] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. Freeman. Sift \ufb02ow: dense correspondence across\n\n[12] M. P. Lubor Ladicky, Jianbo Shi. Pulling things out of perspective. In CVPR, 2014.\n[13] R. Memisevic and C. Conrad. Stereopsis via deep learning.\n\nIn NIPS Workshop on Deep\n\n[14] J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance using monocular vision\n\nand reinforcement learning. In ICML, pages 593\u2013600, 2005.\n\n[15] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In\n\n[16] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3-d scene structure from a single still\n\n[17] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo corre-\n\nspondence algorithms. IJCV, 47:7\u201342, 2002.\n\n[18] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference\n\nfrom rgbd images. In ECCV, 2012.\n\n[19] F. H. Sinz, J. Q. Candela, G. H. Bak\u0131r, C. E. Rasmussen, and M. O. Franz. Learning depth\n\nfrom stereo. In Pattern Recognition, pages 245\u2013252. Springer, 2004.\n\n[20] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3d.\n\n2006.\n\n[21] K. Yamaguchi, T. Hazan, D. Mcallester, and R. Urtasun. Continuous markov random \ufb01elds for\n\nrobust stereo estimation. In arXiv:1204.1393v1, 2012.\n\ndifference scenes. 2008.\n\nLearning, 2011.\n\nNIPS, 2005.\n\nimage. TPAMI, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1236, "authors": [{"given_name": "David", "family_name": "Eigen", "institution": "New York University"}, {"given_name": "Christian", "family_name": "Puhrsch", "institution": null}, {"given_name": "Rob", "family_name": "Fergus", "institution": "NYU"}]}