{"title": "Single-Image Depth Perception in the Wild", "book": "Advances in Neural Information Processing Systems", "page_first": 730, "page_last": 738, "abstract": "This paper studies single-image depth perception in the wild, i.e., recovering depth from a single image taken in unconstrained settings. We introduce a new dataset \u201cDepth in the Wild\u201d consisting of images in the wild annotated with relative depth between pairs of random points. We also propose a new algorithm that learns to estimate metric depth using annotations of relative depth. Compared to the state of the art, our algorithm is simpler and performs better. Experiments show that our algorithm, combined with existing RGB-D data and our new relative depth annotations, significantly improves single-image depth perception in the wild.", "full_text": "Single-Image Depth Perception in the Wild\n\nWeifeng Chen\n\nJia Deng\n\nZhao Fu\n\nDawei Yang\nUniversity of Michigan, Ann Arbor\n\n{wfchen,zhaofu,ydawei,jiadeng}@umich.edu\n\nAbstract\n\nThis paper studies single-image depth perception in the wild, i.e., recovering depth\nfrom a single image taken in unconstrained settings. We introduce a new dataset\n\u201cDepth in the Wild\u201d consisting of images in the wild annotated with relative depth\nbetween pairs of random points. We also propose a new algorithm that learns to\nestimate metric depth using annotations of relative depth. Compared to the state\nof the art, our algorithm is simpler and performs better. Experiments show that\nour algorithm, combined with existing RGB-D data and our new relative depth\nannotations, signi\ufb01cantly improves single-image depth perception in the wild.\n\nFigure 1: We crowdsource annotations of relative depth and train a deep network to recover depth\nfrom a single image taken in unconstrained settings (\u201cin the wild\u201d).\n\n1\n\nIntroduction\n\nDepth from a single RGB image is a fundamental problem in vision. Recent years have seen rapid\nprogress thanks to data-driven methods [1, 2, 3], in particular, deep neural networks trained on large\nRGB-D datasets [4, 5, 6, 7, 8, 9, 10]. But such advances have yet to broadly impact higher-level tasks.\nOne reason is that many higher-level tasks must operate on images \u201cin the wild\u201d\u2014images taken with\nno constraints on cameras, locations, scenes, and objects\u2014but the RGB-D datasets used to train and\nevaluate image-to-depth systems are constrained in one way or another.\nCurrent RGB-D datasets were collected by depth sensors [4, 5], which are limited in range and\nresolution, and often fail on specular or transparent objects [11]. In addition, because there is no\nFlickr for RGB-D images, researchers have to manually capture the images. As a result, current\nRGB-D datasets are limited in the diversity of scenes. For example, NYU depth [4] consists mostly of\nindoor scenes with no human presence; KITTI [5] consists mostly of road scenes captured from a car;\nMake3D [3, 12] consists mostly of outdoor scenes of the Stanford campus (Figure. 2). While these\ndatasets are pivotal in driving research, it is unclear whether systems trained on them can generalize\nto images in the wild.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\nDeep Network with Pixel-wise PredictionMetric DepthRGB-D DataRelative Depth AnnotationstrainInput Image\fIs it possible to collect ground-truth depth for images in the wild? Using depth sensors in un-\nconstrained settings is not yet feasible. Crowdsourcing seems viable, but humans are not good at\nestimating metric depth, or 3D metric structure in general [13]. In fact, metric depth from a single\nimage is fundamentally ambiguous: a tree behind a house can be slightly bigger but further away,\nor slightly smaller but closer\u2014the absolute depth difference between the house and the tree cannot\nbe uniquely determined. Furthermore, even in cases where humans can estimate metric depth, it is\nunclear how to elicit the values from them.\nBut humans are better at judging relative depth [13]: \u201cIs point A closer than point B?\u201d is often a much\neasier question for humans. Recent work by Zoran et al. [14] shows that it is possible to learn to\nestimate metric depth using only annotations of relative depth. Although such metric depth estimates\nare only accurate up to monotonic transformations, they may well be suf\ufb01ciently useful for high-level\ntasks, especially for occlusion reasoning. The seminal results by Zoran et al. point to two fronts for\nfurther progress: (1) collecting a large amount of relative depth annotations for images in the wild\nand (2) improving the algorithms that learn from annotations of relative depth.\nIn this paper, we make contributions on both fronts. Our \ufb01rst contribution is a new dataset called\n\u201cDepth in the Wild\u201d (DIW). It consists of 495K diverse images, each annotated with randomly sampled\npoints and their relative depth. We sample one pair of points per image to minimize the redundancy of\nannotation 1. To the best of our knowledge this is the \ufb01rst large-scale dataset consisting of images in\nthe wild with relative depth annotations. We demonstrate that this dataset can be used as an evaluation\nbenchmark as well as a training resource 2.\nOur second contribution is a new algorithm for learning to estimate metric depth using only anno-\ntations of relative depth. Our algorithm not only signi\ufb01cantly outperforms that of Zoran et al. [14],\nbut is also simpler. The algorithm of Zoran et al. [14] \ufb01rst learns a classi\ufb01er to predict the ordinal\nrelation between two points in an image. Given a new image, this classi\ufb01er is repeatedly applied\nto predict the ordinal relations between a sparse set of point pairs (mostly between the centers of\nneighboring superpixels). The algorithm then reconstructs depth from the predicted ordinal relations\nby solving a constrained quadratic optimization that enforces additional smoothness constraints and\nreconciles potentially inconsistent ordinal relations. Finally, the algorithm estimates depth for all\npixels assuming a constant depth within each superpixel.\nIn contrast, our algorithm consists of a single deep network that directly predicts pixel-wise depth\n(Fig. 1). The network takes an entire image as input, consists of off-the-shelf components, and\ncan be trained entirely with annotations of relative depth. The novelty of our approach lies in the\ncombination of two ingredients: (1) a multi-scale deep network that produces pixel-wise prediction\nof metric depth and (2) a loss function using relative depth. Experiments show that our method\nproduces pixel-wise depth that is more accurately ordered, outperforming not only the method by\nZoran et al. [14] but also the state-of-the-art image-to-depth system by Eigen et al. [8] trained with\nground-truth metric depth. Furthermore, combing our new algorithm, our new dataset, and existing\nRGB-D data signi\ufb01cantly improves single-image depth estimation in the wild.\n2 Related work\nRGB-D Datasets: Prior work on constructing RGB-D datasets has relied on either Kinect [15, 4, 16,\n17] or LIDAR [5, 3]. Existing Kinect-based datasets are limited to indoor scenes; existing LIDAR-\nbased datasets are biased towards scenes of man-made structures [5, 3]. In contrast, our dataset covers\na much wider variety of scenes; it can be easily expanded with large-scale crowdsourcing and the\nvirually umlimited Internet images.\nIntrinsic Images in the Wild: Our work draws inspiration from Intrinsic Images in the Wild [18], a\nseminal work that crowdsources annotations of relative re\ufb02ectance on unconstrained images. Our\nwork differs in goals as well as in several design decisions. First, we sample random points instead of\ncenters of superpixels, because unlike re\ufb02ectance, it is unreasonable to assume a constant depth within\na superpixel. Second, we sample only one pair of points per image instead of many to maximize the\nvalue of human annotations.\nDepth from a Single Image: Image-to-depth is a long-standing problem with a large body of\nliterature [19, 20, 12, 1, 6, 7, 8, 9, 10, 19, 21, 22, 23, 24, 25, 26]. The recent convergence of deep\n\n1A small percentage of images have duplicates and thus have multiple pairs.\n2Project website: http://www-personal.umich.edu/~wfchen/depth-in-the-wild.\n\n2\n\n\fFigure 2: Example images from current RGB-D datasets and our Depth in the Wild (DIW) dataset.\n\nFigure 3: Annotation UI. The user presses\n\u20191\u2019 or \u20192\u2019 to pick the closer point.\n\nFigure 4: Relative image location (normalized to\n[-1,1]) and relative depth of two random points.\n\nneural networks and RGB-D datasets [4, 5] has led to major advances [27, 6, 28, 8, 10, 14]. But\nthe networks in these previous works, with the exception of [14], were trained exclusively using\nground-truth metric depth, whereas our approach uses relative depth.\nOur work is inspired by that of Zoran et al. [14], which proposes to use a deep network to repeatedly\nclassify pairs of points sampled based on superpixel segmentation, and to reconstruct per-pixel metric\ndepth by solving an additional optimization problem. Our approach is different: it consists of a single\ndeep network trained end-to-end that directly predicts per-pixel metric depth; there is no intermediate\nclassi\ufb01cation of ordinal relations and as a result no optimization needed to resolve inconsistencies.\nLearning with Ordinal Relations: Several recent works [29, 30] have used the ordinal relations\nfrom the Intrinsic Images in the Wild dataset [18] to estimate surface re\ufb02etance. Similar to Zoran et\nal. [14], Zhou et al. [29] \ufb01rst learn a deep network to classify the ordinal relations between pairs of\npoints and then make them globally consistent through energy minimization.\nNarihira et al. [30] learn a \u201clightness potential\u201d network that takes an image patch and predicts the\nmetric re\ufb02ectance of the center pixel. But this network is applied to only a sparse set of pixels.\nAlthough in principle this lightness potential network can be applied to every pixel to produce\npixel-wise re\ufb02ectance, doing so would be quite expensive. Making it fully convolutional (as the\nauthors mentioned in [30]) only solves it partially: as long as the lightness potential network has\ndownsampling layers, which is the case in [30], the \ufb01nal output will be downsampled accordingly.\nAdditional resolution augmentation (such as the \u201cshift and stitch\u201d approach [31]) is thus needed. In\ncontrast, our approach completely avoids such issues and directly outputs pixel-wise estimates.\nBeyond intrinsic images, ordinal relations have been used widely in computer vision and machine\nlearning, including object recognition [32] and learning to rank [33, 34].\n\n3 Dataset construction\n\nWe gather images from Flickr. We use random query keywords sampled from an English dictionary\nand exclude arti\ufb01cial images such as drawings and clip arts. To collect annotations of relative depth,\nwe present a crowd worker an image and two highlighted points (Fig. 3), and ask \u201cwhich point is\ncloser, point 1, point 2, or hard to tell?\u201d The worker presses a key to respond.\nHow Many Pairs? How many pairs of points should we query per image? We sample just one per\nimage because this maximizes the amount of information from human annotators. Consider the other\nextreme\u2014querying all possible pairs of points in the same image. This is wasteful because pairs of\npoints in close proximity are likely to have the same relative depth. In other words, querying one\n\n3\n\nOur DatasetNYU V2 DatasetKITTI DatasetMake3D Dataset\fFigure 5: Example images and annotations. Green points are those annotated as closer in depth.\n\nmore pair from the same image may add less information than querying one more pair from a new\nimage. Thus querying only one pair per image is more cost-effective.\nWhich Pairs? Which two points should we query given an image? The simplest way would be to\nsample two random points from the 2D plane. But this results in a severe bias that can be easily\nexploited: if an algorithm simply classi\ufb01es the lower point in the image to be closer in depth, it will\nagree with humans 85.8% of the time (Fig. 4). Although this bias is natural, it makes the dataset less\nuseful as a benchmark.\nAn alternative is to sample two points uniformly from a random horizontal line, which makes it\nimpossible to use the y image coordinate as a cue. But we \ufb01nd yet another bias: if an algorithm simply\nclassi\ufb01es the point closer to the center of the image to be closer in depth, it will agree with humans\n71.4% of the time. This leads to a third approach: uniformly sample two symmetric points with\nrespect to the center from a random horizontal line (the middle column of Fig. 5). With the symmetry\nenforced, we are not able to \ufb01nd a simple yet effective rule based purely on image coordinates: the\nleft point is almost equally likely (50.03%) to be closer than the right one.\nOur \ufb01nal dataset consists of a roughly 50-50 combination of unconstrained pairs and symmetric pairs,\nwhich strikes a balance between the need for representing natural scene statistics and the need for\nperformance differentiation.\nProtocol and Results: We crowdsource the annotations using Amazon Mechanical Turk (AMT).\nTo remove spammers, we insert into all tasks gold-standard images veri\ufb01ed by ourselves, and reject\nworkers whose accumulative accuracy on the gold-standard images is below 85%. We assign each\nquery (an image and a point pair) to two workers, and add the query to our dataset if both workers\ncan tell the relative depth and agree with each other; otherwise the query is discarded. Under this\nprotocol, the chance of adding a wrong answer to our dataset is less than 1% as measured on the\ngold-standard images.\nWe processed 1.24M images on AMT and obtained 0.5M valid answers (both workers can tell the\nrelative depth and agree with each other). Among the valid answers, 261K are for unconstrained pairs\nand 240K are for symmetric pairs. For unconstrained pairs, It takes a median of 3.4 seconds for a\nworker to decide, and two workers agree on the relative depth 52% of the time; for symmetric pairs,\nthe numbers are 3.8s and 32%. These numbers suggest that the symmetric pairs are indeed harder.\nFig. 5 presents examples of different kinds of queries.\n\n4 Learning with relative depth\n\nHow do we learn to predict metric depth given only annotations of relative depth? Zoran et al. [14]\n\ufb01rst learn a classi\ufb01er to predict ordinal relations between centers of superpixels, and then reconcile\nthe relations to recover depth using energy minimization, and then interpolate within each superpixel\nto produce per-pixel depth.\nWe take a simpler approach. The idea is that any image-to-depth algorithm would have to compute a\nfunction that maps an image to pixel-wise depth. Why not represent this function as a neural network\nand learn it from end to end? We just need two ingredients: (1) a network design that outputs the\nsame resolution as the input, and (2) a way to train the network with annotations of relative depth.\nNetwork Design: Networks that output the same resolution as the input are aplenty, including\nthe recent designs for depth estimation [8, 35] and those for semantic segmentation [36] and edge\ndetection [37]. A common element is processing and passing information across multiple scales.\nIn this work, we use a variant of the recently introduced \u201chourglass\u201d network (Fig. 6), which has\nbeen used to achieve state-of-the-art results on human pose estimation [38]. It consists of a series\n\n4\n\nunconstrained pairssymmetric pairshard-to-tell pairs\fFigure 6: Network design. Each block represents a layer. Blocks sharing the same color are identical.\nThe \u2295 sign denotes the element-wise addition. Block H is a convolution with 3x3 \ufb01lter. All other\nblocks denote the Inception module shown in Figure 7. Their parameters are detailed in Tab. 1\n\nTable 1: Parameters for each type of layer in our network.\nConv1 to Conv4 are sizes of the \ufb01lters used in the components\nof Inception module shown in Figure.7. Conv2 to 4 share the\nsame number of input and is speci\ufb01ed in Inter Dim.\n\nBlock Id\n#In/#Out\nInter Dim\n\nConv1\nConv2\nConv3\nConv4\n\nA\n\n128/64\n\n64\n1x1\n3x3\n7x7\n11x11\n\nB\n\n128/128\n\n32\n1x1\n3x3\n5x5\n7x7\n\nC\n\n128/128\n\n64\n1x1\n3x3\n7x7\n11x11\n\nD\n\nE\n\n128/256\n\n32\n1x1\n3x3\n5x5\n7x7\n\n256/256\n\n32\n1x1\n3x3\n5x5\n7x7\n\nF\n\n256/256\n\n64\n1x1\n3x3\n7x7\n11x11\n\nG\n\n256/128\n\n32\n1x1\n3x3\n5x5\n7x7\n\nFigure 7: Variant of Inception\nModule [39] used by us.\n\nof convolutions (using a variant of the inception [39] module) and downsampling, followed by a\nseries of convolutions and upsampling, interleaved with skip connections that add back features from\nhigh resolutions. The symmetric shape of the network resembles a \u201chourglass\u201d, hence the name. We\nrefer the reader to [38] for comparing the design to related work. For our purpose, this particular\nchoice is not essential, as the various designs mainly differ in how information from different scales\nis dispersed and aggregated, and it is possible that all of them can work equally well for our task.\nLoss Function: How do we train the network using only ordinal annotations? All we need is a loss\nfunction that encourages the predicted depth map to agree with the ground-truth ordinal relations.\nSpeci\ufb01cally, consider a training image I and its K queries R = {(ik, jk, rk)}, k = 1, . . . , K, where\nik is the location of the \ufb01rst point in the k-th query, jk is the location of the second point in the\nk-th query, and rk \u2208 {+1,\u22121, 0} is the ground-truth depth relation between ik and jk: closer (+1),\nfurther (\u22121), and equal (0). Let z be the predicted depth map and zik , zjk be the depths at point ik\nand jk. We de\ufb01ne a loss function\n\nK(cid:88)\n\nk=1\nwhere \u03c8k(I, ik, jk, z) is the loss for the k-th query\n\nL(I, R, z) =\n\n\u03c8k(I, ik, jk, r, z),\n\n\uf8f1\uf8f2\uf8f3 log (1 + exp(\u2212zik + zjk )) ,\n\nlog (1 + exp(zik \u2212 zjk )) ,\n(zik \u2212 zjk )2,\n\n(1)\n\n(2)\n\nrk = +1\nrk = \u22121\nrk = 0.\n\n\u03c8k(I, ik, jk, z) =\n\nThis is essentially a ranking loss: it encourages a small difference between depths if the ground-truth\nrelation is equality; otherwise it encourages a large difference.\nNovelty of Our Approach: Our novelty lies in the combination of a deep network that does pixel-\nwise prediction and a ranking loss placed on the pixel-wise prediction. A deep network that does\npixel-wise prediction is not new, nor is a ranking loss. But to the best of our knowledge, such a\ncombination has not been proposed before, and in particular not for estimating depth.\n\n5 Experiments on NYU Depth\n\nWe evaluate our method using NYU Depth [4], which consists of indoor scenes with ground-truth\nKinect depth. We use the same setup as that of Zoran et al. [14]: point pairs are sampled from the\n\n5\n\nupsamplepoolpoolpoolpoolupsampleupsampleupsampleABCDEFGHFilter concatenationConv1Conv2Conv3Conv41x1 Conv1x1 Conv1x1 ConvPrevious layer\fFigure 8: Qualitative results on NYU Depth by our method, the method of Eigen et al. [8], and the\nmethod of Zoran et al. [14]. All depth maps except ours are directly from [14]. More results are in\nthe supplementary material.\n\ntraining images (the subset of NYU Depth consisting of 795 images with semantic labels) using\nsuperpixel segmentation and their ground-truth ordinal relations are generated by comparing the\nground-truth Kinect depth; the same procedure is applied to the test set to generate the point pairs for\nevaluation (around 3K pairs per image). We use the same training and test data as Zoran et al. [14].\n\nTable 2: Left table: ordinal error measures (disagreement rate with ground-truth depth ordering) on\nNYU Depth. Right able: metric error measures on NYU Depth. Details for each metric can be found\nin [8]. There are two versions of results by Eigen et al. [8], one using AlexNet (Eigen(A)) and one\nusing VGGNet (Eigen(V)). Lower is better for all error measures.\n\nMethod\nOurs\nZoran [14]\nrand_12K\nrand_6K\nrand_3K\nOurs_Full\nEigen(A) [8]\nEigen(V) [8]\n\nWKDR WKDR= WKDR(cid:54)=\n36.5%\n35.6% 36.1%\n43.5%\n44.2%\n41.4%\n37.6%\n34.9% 32.4%\n39.9%\n32.2%\n36.1%\n41.3%\n35.8%\n28.7%\n28.6%\n28.3% 30.6%\n37.5%\n46.9%\n32.7%\n29.6%\n43.3%\n34.0%\n\nMethod\n\nOurs\nOurs_Full\nZoran [14]\nEigen(A) [8]\nEigen(V) [8]\nWang [28]\nLiu [6]\nLi [10]\nKarsch [1]\nBaig [40]\n\nRMSE RMSE RMSE a\n(s.inv)\n0.26\n0.24\n\n(log)\n0.39\n0.38\n0.42\n0.26\n0.21\n\n-\n-\n-\n-\n-\n\n1.13\n1.10\n1.20\n0.75\n0.64\n0.75\n0.82\n0.82\n1.20\n1.0\n\nabsrel\n\nsqrrel\n\n0.36\n0.34\n0.40\n0.21\n0.16\n0.22\n0.23\n0.23\n0.35\n0.3\n\n0.46\n0.42\n0.54\n0.19\n0.12\n\n-\n-\n-\n-\n-\n\n0.20\n0.17\n\n-\n\n-\n-\n-\n-\n-\n\nAs the system by Zoran et al. [14], our network predicts one of the three ordinal relations on the\ntest pairs: equal (=), closer (<), or farther (>). We report WKDR, the weighted disagreement rate\nbetween the predicted ordinal relations and ground-truth ordinal relations 3. We also report WKDR=\n(disagreement rate on pairs whose ground-truth relations are =) and WKDR(cid:54)= (disagreement rate on\npairs whose ground-truth relations are < or >).\nSince two ground-truth depths are almost never exactly the same, there needs to be a relaxed de\ufb01nition\nof equality. Zoran et al. [14] de\ufb01ne two points to have equal depths if the ratio between their ground-\ntruth depths is within a pre-determined range. Our network predicts an equality relation if the depth\ndifference is smaller than a threshold \u03c4. The choice of this threshold will result in different values for\nthe error metrics (WKDR, WKDR=, WKDR(cid:54)=): if \u03c4 is too small, most pairs will be predicted to be\nunequal and the error metric on equality relations (WKDR=) will be large; if \u03c4 is too big, most pairs\nwill be predicted to be equal and the error metric on inequality relations (WKDR(cid:54)=) will be large. We\nchoose the threshold \u03c4 that minimizes the maximum of the three error metrics on a validation set\nheld out from the training set. Tab. 2 compares our network (ours) versus that of Zoran et al. [14].\nOur network is trained with the same data 4 but outperforms [14] on all three metrics.\nFollowing [14], we also compare with the state-of-art image-to-depth system by Eigen et al. [8],\nwhich is trained on pixel-wise ground-truth metric depth from the full NYU Depth training set\n(220K images). To compare fairly, we give our network access to the full NYU Depth training set.\nIn addition, we remove the limit of 800 point pairs per training image placed by Zoran et al and\nuse all available pairs. The results in Tab. 2 show that our network (ours_full) achieves superior\nperformance in estimating depth ordering. Granted, this comparison is not entirely fair because [8] is\nnot optimized for predicting ordinal relations. But this comparison is still signi\ufb01cant in that it shows\n\naComputed using our own implementation based on the de\ufb01nition given in [35].\n3WKDR stands for \u201cWeighted Kinect Disagreement Rate\u201d; the weight is set to 1 as in [14]\n4The code released by Zoran et al. [14] indicates that they train with a random subset of 800 pairs per image\n\ninstead of all the pairs. We follow the same procedure and only use a random subset of 800 pairs per image.\n\n6\n\nInput imageOur DepthZoranEigenGround Truth\fFigure 9: Point pairs generated through superpixel segmentation [14] (left) versus point pairs\ngenerated through random sampling with distance constraints (right).\n\nthat we can train on only relative depth and rival the state-of-the-art system in estimating depth up to\nmonotonic transformations.\nIn Figure. 8 we show qualitative results on the same example images used by Zoran et al. [14]. We\nsee that although imperfect, the recovered metric depth by our method is overall reasonable and\nqualitatively similar to that by the state-of-art system [8] trained on ground-truth metric depth.\nMetric Error Measures. Our network is trained with relative depth, so it is unsurprising that it does\nwell in estimating depth up to ordering. But how good is the estimated depth in terms of metric\nerror? We thus evaluate conventional error measures such as RMSE (the root mean squared error),\nwhich compares the absolute depth values to the ground truths. Because our network is trained\nonly on relative depth and does not know the range of the ground-truth depth values, to make these\nerror measures meaningful we normalize the depth predicted by our network such that the mean and\nstandard deviation are the same as those of the mean depth map of the training set. Tab. 2 reports the\nresults. We see that under these metric error measures our network still outperforms the method of\nZoran et al. [14]. In addition, while our metric error is worse than the current state-of-the-art, it is\ncomparable to some of the earlier methods (e.g. [1]) that have access to ground-truth metric depth.\nSuperpixel Sampling versus Random Sampling. To compare with the method by Zoran et al. [14],\nwe train our network using the same point pairs, which are pairs of centers of superpixels (Fig. 9). But\nis superpixel segmentation necessary? That is, can we simply train with randomly sampled points?\nTo answer this question, we train our network with randomly sampled points. We constrain the\ndistance between the two points to be between 13 and 19 pixels (out of a 320\u00d7240 image) such\nthat the distance is similar to that between the centers of neighboring superpixels. The results are\nincluded in Tab. 2. We see that using 3.3k pairs per image (rand_3K) already achieves comparable\nperformance to the method by Zoran et al. [14]. Using twice or four times as many pairs (rand_6K,\nrand_12K) further improves performance and signi\ufb01cantly outperforms [14].\nIt is worth noting that in all these experiments the test pairs are still from superpixels, so training on\nrandom pairs incurs a mismatch between training and testing distributions. Yet we can still achieve\ncomparable performance despite this mismatch. This shows that our method can indeed operate\nwithout superpixel segmentation.\n\n6 Experiments on Depth in the Wild\n\nIn this section we experiment on our new Depth in the Wild (DIW) dataset. We split the dataset into\n421K training images and 74K test images 5.\nWe report the WHDR (Weighted Human Disagreement Rate) 6 of 5 methods in Tab. 3: (1) the\nstate-of-the-art system by Eigen et al. [8] trained on full NYU Depth; (2) our network trained on\nfull NYU Depth (Ours_Full); (3) our network pre-trained on full NYU Depth and \ufb01ne-tuned on\nDIW (Ours_NYU_DIW); (4) our network trained from scratch on DIW (Ours_DIW); (5) a baseline\nmethod that uses only the location of the query points: classify the lower point to be closer or guess\nrandomly if the two points are at the same height (Query_Location_Only).\nWe see that the best result is achieved by pre-training on NYU Depth and \ufb01ne-tuning on DIW. Training\nonly on NYU Depth (Ours_NYU and Eigen) does not work as well, which is expected because NYU\nDepth only has indoor scenes. Training from scratch on DIW achieves slightly better performance\n\n54.38% of images are duplicates downloaded using different query keywords and have more than one pairs\n\nof points. We have removed test images that have duplicates in the training set.\n\n6All weights are 1. A pair of points can only have two possible ordinal relations (farther or closer) for DIW.\n\n7\n\n\fFigure 10: Qualitative results on our Depth in the Wild (DIW) dataset by our method and the method\nof Eigen et al. [8]. More results are in the supplementary material.\n\nTable 3: Weighted Human Disagreement Rate (WHDR) of various methods on our DIW dataset,\nincluding Eigen(V), the method of Eigen et al. [8] (VGGNet [41] version)\n\nMethod Eigen(V) [8] Ours_Full Ours_NYU_DIW Ours_DIW Query_Location_Only\nWHDR\n\n14.39%\n\n22.14%\n\n25.70%\n\n31.31%\n\n31.37%\n\nthan those trained on only NYU Depth despite using much less supervision. Pre-training on NYU\nDepth and \ufb01ne-tuning on DIW leaverages all available data and achieves the best performance. As\nshown in Fig. 10, the quality of predicted depth is notably better with \ufb01ne-tuning on DIW, especially\nfor outdoor scenes. These results suggest that it is promising to combine existing RGB-D data and\ncrowdsourced annotations to advance the state-of-the art in single-image depth estimation.\n\n7 Conclusions\n\nWe have studied single-image depth perception in the wild, recovering depth from a single image\ntaken in unconstrained settings. We have introduced a new dataset consisting of images in the wild\nannotated with relative depth and proposed a new algorithm that learns to estimate metric depth\nsupervised by relative depth. We have shown that our algorithm outperforms prior art and our\nalgorithm, combined with existing RGB-D data and our new relative depth annotations, signi\ufb01cantly\nimproves single-image depth perception in the wild.\nAcknowledgments\nThis work is partially supported by the National Science Foundation under Grant No. 1617767.\nReferences\n\n[1] K. Karsch, C. Liu, and S. B. Kang, \u201cDepthtransfer: Depth extraction from video using non-parametric\n\nsampling,\u201d TPAMI, 2014.\n\n[2] D. Hoiem, A. A. Efros, and M. Hebert, \u201cAutomatic photo pop-up,\u201d TOG, 2005.\n[3] A. Saxena, M. Sun, and A. Ng, \u201cMake3d: Learning 3d scene structure from a single still image,\u201d TPAMI,\n\n2009.\n\n[4] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, \u201cIndoor segmentation and support inference from rgbd\n\nimages,\u201d in ECCV, Springer, 2012.\n\n[5] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, \u201cVision meets robotics: The kitti dataset,\u201d The International\n\nJournal of Robotics Research, p. 0278364913491297, 2013.\n\n[6] F. Liu, C. Shen, and G. Lin, \u201cDeep convolutional neural \ufb01elds for depth estimation from a single image,\u201d\n\nin CVPR, 2015.\n\n[7] L. Ladicky, J. Shi, and M. Pollefeys, \u201cPulling things out of perspective,\u201d in CVPR, IEEE, 2014.\n[8] D. Eigen and R. Fergus, \u201cPredicting depth, surface normals and semantic labels with a common multi-scale\n\nconvolutional architecture,\u201d in ICCV, 2015.\n\n[9] M. H. Baig and L. Torresani, \u201cCoupled depth learning,\u201d arXiv preprint arXiv:1501.04537, 2015.\n[10] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, \u201cDepth and surface normal estimation from\n\nmonocular images using regression on deep features and hierarchical crfs,\u201d in CVPR, 2015.\n\n8\n\nInputEigenOurs_NYU_DIWInputEigenOurs_NYU_DIW\f[11] W. W.-C. Chiu, U. Blanke, and M. Fritz, \u201cImproving the kinect by cross-modal stereo.,\u201d in BMVC, 2011.\n[12] A. Saxena, S. H. Chung, and A. Y. Ng, \u201cLearning depth from single monocular images,\u201d in NIPS, 2005.\n[13] J. T. Todd and J. F. Norman, \u201cThe visual perception of 3-d shape from multiple cues: Are observers capable\n\nof perceiving metric structure?,\u201d Perception & Psychophysics, pp. 31\u201347, 2003.\n\n[14] D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman, \u201cLearning ordinal relationships for mid-level vision,\u201d\n\nin ICCV, 2015.\n\n[15] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell, \u201cA category-level 3d object\ndataset: Putting the kinect to work,\u201d in Consumer Depth Cameras for Computer Vision, Springer, 2013.\n[16] S. Song, S. P. Lichtenberg, and J. Xiao, \u201cSun rgb-d: A rgb-d scene understanding benchmark suite,\u201d in\n\nCVPR, 2015.\n\n[17] S. Choi, Q.-Y. Zhou, S. Miller, and V. Koltun, \u201cA large dataset of object scans,\u201d arXiv preprint\n\narXiv:1602.02481, 2016.\n\n[18] S. Bell, K. Bala, and N. Snavely, \u201cIntrinsic images in the wild,\u201d TOG, 2014.\n[19] J. T. Barron and J. Malik, \u201cShape, illumination, and re\ufb02ectance from shading,\u201d TPAMI, 2015.\n[20] A. Saxena, S. H. Chung, and A. Y. Ng, \u201c3-d depth reconstruction from a single still image,\u201d IJCV, 2008.\n[21] Y. Xiong, A. Chakrabarti, R. Basri, S. J. Gortler, D. W. Jacobs, and T. Zickler, \u201cFrom shading to local\n\nshape,\u201d TPAMI, 2015.\n\n[22] C. Hane, L. Ladicky, and M. Pollefeys, \u201cDirection matters: Depth estimation with a surface normal\n\nclassi\ufb01er,\u201d in CVPR, 2015.\n\n[23] B. Liu, S. Gould, and D. Koller, \u201cSingle image depth estimation from predicted semantic labels,\u201d in CVPR,\n\n2010.\n\n[24] E. Shelhamer, J. Barron, and T. Darrell, \u201cScene intrinsics and depth from a single image,\u201d in ICCV\n\nWorkshops, 2015.\n\n[25] J. Shi, X. Tao, L. Xu, and J. Jia, \u201cBreak ames room illusion: depth from general single images,\u201d TOG,\n\n2015.\n\n[26] W. Zhuo, M. Salzmann, X. He, and M. Liu, \u201cIndoor scene structure analysis for single image depth\n\nestimation,\u201d in CVPR, 2015.\n\n[27] Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun, \u201cMonocular object instance segmentation and depth\n\nordering with cnns,\u201d in ICCV, 2015.\n\n[28] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, \u201cTowards uni\ufb01ed depth and semantic\n\nprediction from a single image,\u201d in CVPR, 2015.\n\n[29] T. Zhou, P. Krahenbuhl, and A. A. Efros, \u201cLearning data-driven re\ufb02ectance priors for intrinsic image\n\ndecomposition,\u201d in ICCV, 2015.\n\n[30] T. Narihira, M. Maire, and S. X. Yu, \u201cLearning lightness from human judgement on relative re\ufb02ectance,\u201d\n\nin CVPR, IEEE, 2015.\n\n[31] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, \u201cOverfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks,\u201d arXiv preprint arXiv:1312.6229, 2013.\n\n[32] D. Parikh and K. Grauman, \u201cRelative attributes,\u201d in ICCV, IEEE, 2011.\n[33] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, \u201cLearning to rank: from pairwise approach to listwise\n\napproach,\u201d in ICML, ACM, 2007.\n\n[34] T. Joachims, \u201cOptimizing search engines using clickthrough data,\u201d in Proceedings of the eighth ACM\n\nSIGKDD international conference on Knowledge discovery and data mining, ACM, 2002.\n\n[35] D. Eigen, C. Puhrsch, and R. Fergus, \u201cDepth map prediction from a single image using a multi-scale deep\n\nnetwork,\u201d in NIPS, 2014.\n\n[36] J. Long, E. Shelhamer, and T. Darrell, \u201cFully convolutional networks for semantic segmentation,\u201d in CVPR,\n\n2015.\n\n[37] S. Xie and Z. Tu, \u201cHolistically-nested edge detection,\u201d CoRR, vol. abs/1504.06375, 2015.\n[38] A. Newell, K. Yang, and J. Deng, \u201cStacked hourglass networks for human pose estimation,\u201d arXiv preprint\n\narXiv:1603.06937, 2016.\n\n[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,\n\n\u201cGoing deeper with convolutions,\u201d in CVPR, 2015.\n\n[40] M. H. Baig, V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and N. Sundaresan, \u201cIm2depth: Scalable\n\nexemplar based depth transfer,\u201d in WACV, IEEE, 2014.\n\n[41] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image recognition,\u201d\n\narXiv preprint arXiv:1409.1556, 2014.\n\n9\n\n\f", "award": [], "sourceid": 413, "authors": [{"given_name": "Weifeng", "family_name": "Chen", "institution": "University of Michigan"}, {"given_name": "Zhao", "family_name": "Fu", "institution": "University of Michigan"}, {"given_name": "Dawei", "family_name": "Yang", "institution": "University of Michigan"}, {"given_name": "Jia", "family_name": "Deng", "institution": "University of Michigan"}]}