{"title": "Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video", "book": "Advances in Neural Information Processing Systems", "page_first": 35, "page_last": 45, "abstract": "Recent work has shown that CNN-based depth and ego-motion estimators can be learned using unlabelled monocular videos. However, the performance is limited by unidentified moving objects that violate the underlying static scene assumption in geometric image reconstruction. More significantly, due to lack of proper constraints, networks output scale-inconsistent results over different samples, i.e., the ego-motion network cannot provide full camera trajectories over a long video sequence because of the per-frame scale ambiguity. This paper tackles these challenges by proposing a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions. Since we do not leverage multi-task learning like recent works, our framework is much simpler and more efficient. Comprehensive evaluation results demonstrate that our depth estimator achieves the state-of-the-art performance on the KITTI dataset. Moreover, we show that our ego-motion network is able to predict a globally scale-consistent camera trajectory for long video sequences, and the resulting visual odometry accuracy is competitive with the recent model that is trained using stereo videos. To the best of our knowledge, this is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale-consistent camera trajectories over a long video sequence.", "full_text": "Unsupervised Scale-consistent Depth and Ego-motion\n\nLearning from Monocular Video\n\nJia-Wang Bian1,2, Zhichao Li3, Naiyan Wang3, Huangying Zhan1,2\n\nChunhua Shen1,2, Ming-Ming Cheng4, Ian Reid1,2\n\n1University of Adelaide, Australia\n\n2Australian Centre for Robotic Vision, Australia\n\n3TuSimple, China\n\n4Nankai University, China\n\nAbstract\n\nRecent work has shown that CNN-based depth and ego-motion estimators can\nbe learned using unlabelled monocular videos. However, the performance is\nlimited by unidenti\ufb01ed moving objects that violate the underlying static scene\nassumption in geometric image reconstruction. More signi\ufb01cantly, due to lack\nof proper constraints, networks output scale-inconsistent results over different\nsamples, i.e., the ego-motion network cannot provide full camera trajectories over\na long video sequence because of the per-frame scale ambiguity. This paper tackles\nthese challenges by proposing a geometry consistency loss for scale-consistent\npredictions and an induced self-discovered mask for handling moving objects and\nocclusions. Since we do not leverage multi-task learning like recent works, our\nframework is much simpler and more ef\ufb01cient. Comprehensive evaluation results\ndemonstrate that our depth estimator achieves the state-of-the-art performance on\nthe KITTI dataset. Moreover, we show that our ego-motion network is able to\npredict a globally scale-consistent camera trajectory for long video sequences, and\nthe resulting visual odometry accuracy is competitive with the recent model that is\ntrained using stereo videos. To the best of our knowledge, this is the \ufb01rst work to\nshow that deep networks trained using unlabelled monocular videos can predict\nglobally scale-consistent camera trajectories over a long video sequence.\n\n1\n\nIntroduction\n\nDepth and ego-motion estimation is crucial for various applications in robotics and computer vision.\nTraditional methods are usually hand-crafted stage-wise systems, which rely on correspondence\nsearch [1, 2] and multi-view geometry [3, 4] for estimation. Recently, deep learning based methods [5,\n6] show that the depth can be inferred from a single image by using Convolutional Neural Network\n(CNN). Especially, unsupervised methods [7\u201311] show that CNN-based depth and ego-motion\nnetworks can be solely trained on monocular video sequences without using ground-truth depth or\nstereo image pairs (pose supervision). The principle is that one can warp the image in one frame to\nanother frame using the predicted depth and ego-motion, and then employ the image reconstruction\nloss as the supervision signal [7] to train the network. However, the performance limitation arises\ndue to the moving objects that violate the underlying static scene assumption in geometric image\nreconstruction. More signi\ufb01cantly, due to lack of proper constraints the network predicts scale-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\finconsistent results over different samples, i.e., the ego-motion network cannot provide a full camera\ntrajectory over a long video sequence because of the per-frame scale ambiguity1.\nTo the best of our knowledge, no previous work (unsupervised learning from monocular videos)\naddresses the scale-inconsistency issue mentioned above. To this end, we propose a geometry\nconsistency loss for tackling the challenge. Speci\ufb01cally, for any two consecutive frames sampled\nfrom a video, we convert the predicted depth map in one frame to 3D space, then project it to the\nother frame using the estimated ego-motion, and \ufb01nally minimize the inconsistency of the projected\nand the estimated depth maps. This explicitly enforces the depth network to predict geometry-\nconsistent (of course scale-consistent) results over consecutive frames. With iterative sampling and\ntraining from videos, depth predictions on each consecutive image pair would be scale-consistent,\nand the frame-to-frame consistency can eventually propagate to the entire video sequence. As the\nscale of ego-motions is tightly linked to the scale of depths, the proposed ego-motion network can\npredict scale-consistent relative camera poses over consecutive snippets. We show that just simply\naccumulating pose predictions can result in globally scale-consistent camera trajectories over a long\nvideo sequence (Fig. 3).\nRegarding the challenge of moving objects, recent work addresses it by introducing an additional\noptical \ufb02ow [9\u201311, 13] or semantic segmentation network [14]. Although this improves performance\nsigni\ufb01cantly, it also brings about huge computational cost during training. Here we show that we\ncould automatically discover a mask from the proposed geometry consistency term for solving the\nproblem without introducing new networks. Speci\ufb01cally, we can easily locate pixels that belong to\ndynamic objects/occluded regions or dif\ufb01cult regions (e.g., textureless regions) using the proposed\nterm. By assigning lower weights to those pixels, we can avoid their impact to the fragile image\nreconstruction loss (see Fig. 2 for mask visualization). Compared with these recent approaches [9\u201311]\nthat leverage multi-task learning, the proposed method is much simpler and more ef\ufb01cient.\nWe conduct detailed ablation studies that clearly demonstrate the ef\ufb01cacy of the proposed approach.\nFurthermore, comprehensive evaluation results on the KITTI [15] dataset show that our depth\nnetwork outperforms state-of-the-art models that are trained in more complicated multi-task learning\nframeworks [9\u201311, 16]. Meanwhile, our ego-motion network is able to predict scale-consistent\ncamera trajectories over long video sequences, and the accuracy of trajectory is competitive with the\nstate-of-the-art model that is trained using stereo videos [17].\nTo summarize, our main contributions are three-fold:\n\nego-motion networks, leading to a globally scale-consistent ego-motion estimator.\n\n\u2022 We propose a geometry consistency constraint to enforce the scale-consistency of depth and\n\u2022 We propose a self-discovered mask for dynamic scenes and occlusions by the aforementioned\ngeometry consistency constraint. Compared with other approaches, our proposed approach\ndoes not require additional optical \ufb02ow or semantic segmentation networks, which makes\nthe learning framework simpler and more ef\ufb01cient.\n\u2022 The proposed depth estimator achieves state-of-the-art performance on the KITTI dataset,\nand the proposed ego-motion predictor shows competitive visual odometry results compared\nwith the state-of-the-art model that is trained using stereo videos.\n\n2 Related work\n\nTraditional methods rely on the disparity between multiple views of a scene to recover the 3D scene\ngeometry, where at least two images are required [3]. With the rapid development of deep learning,\nEigen et al. [5] show that the depth can be predicted from a single image using Convolution Neural\nNetwork (CNN). Speci\ufb01cally, they design a coarse-to-\ufb01ne network to predict the single-view depth\nand use the ground truth depths acquired by range sensors as the supervision signal to train the\nnetwork. However, although these supervised methods [5, 6, 18\u201321] show high-quality \ufb02ow and\ndepth estimation results, it is expensive to acquire ground truth in real-world scenes.\n\n1Monocular systems such as ORB-SLAM [12] suffer from the scale ambiguity issue, but their predictions\nare globally scale-consistent. However, recently learned models using monocular videos not only suffer from the\nscale ambiguity, but also predict scale-inconsistent results over different snippets.\n\n2\n\n\fWithout requiring the ground truth depth, Garg et al. [22] show that a single-view depth network can\nbe trained using stereo image pairs. Instead of using depth supervision, they leverage the established\nepipolar geometry [3]. The color inconsistency between a left image and a synthesized left image\nwarped from the right image is used as the supervision signal. Following this idea, Godard et al. [23]\npropose to constrain the left-right consistency for regularization, and Zhan et al. [17] extend the\nmethod to stereo videos. However, though stereo pairs based methods do not require the ground truth\ndepth, accurately rectifying stereo cameras is also non-trivial in real-world scenarios.\nTo that end, Zhou et al. [7] propose a fully unsupervised framework, in which the depth network\ncan be learned solely from monocular videos. The principle is that they introduce an additional\nego-motion network to predict the relative camera pose between consecutive frames. With the\nestimated depth and relative pose, image reconstruction as in [22] is applied and the photometric loss\nis used as the supervision signal. However, the performance is limited due to dynamic objects that\nviolate the underlying static scene assumption in geometric image reconstruction. More importantly,\nZhou et al. [7]\u2019s method suffers from the per-frame scale ambiguity, in that a single and consistent\nscaling of the camera translations is missing and only direction is known. As a result, the ego-motion\nnetwork cannot predict a full camera trajectory over a long video sequence.\nFor handling moving objects, recent work [9, 10] proposes to introduce an additional optical \ufb02ow\nnetwork. Even more recently [11] introduces an extra motion segmentation network. Although they\nshow signi\ufb01cant performance improvement, there is a huge additional computational cost added into\nthe basic framework, yet they still suffer from the scale-inconsistency issue. Besides, Liu et al. [24]\nuse depth projection loss for supervision density, similar to the proposed consistency loss, but their\nmethod relies on the pre-computed 3D reconstruction for supervision.\nTo the best of our knowledge, this paper is the \ufb01rst one to show that the ego-motion network trained\nin monocular videos can predict a globally scale-consistent camera trajectory over a long video\nsequence. This shows signi\ufb01cant potentials to leverage deep learning methods in Visual SLAM [12]\nfor robotics and autonomous driving.\n\n3 Unsupervised Learning of Scale-consistent Depth and Ego-motion\n\n3.1 Method Overview\n\nOur goal is to train depth and ego-motion networks using monocular videos, and constrain them to\npredict scale-consistent results. Given two consecutive frames (Ia, Ib) sampled from an unlabeled\nvideo, we \ufb01rst estimate their depth maps (Da, Db) using the depth network, and then predict the\nrelative 6D camera pose Pab between them using the pose network.\nWith the predicted depth and relative camera pose, we can synthesize the reference image I(cid:48)\na by\ninterpolating the source image Ib [25, 7]. Then, the network can be supervised by the photometric\nloss between the real image Ia and the synthesized one I(cid:48)\na. However, due to dynamic scenes that\nviolate the geometric assumption in image reconstruction, the performance of this basic framework\nis limited. To this end, we propose a geometry consistency loss LGC for scale-consistency and a\nself-discovered mask M for handling the moving objects and occlusions. Fig. 1 shows an illustration\nof the proposed loss and mask.\nOur overall objective function can be formulated as follows:\n\nL = \u03b1LM\n\np + \u03b2Ls + \u03b3LGC,\n\n(1)\nwhere LM\np stands for the weighted photometric loss (Lp) by the proposed mask M, and Ls stands for\nthe smoothness loss. We train the network in both forward and backward directions to maximize the\ndata usage, and for simplicity we only derive the loss for the forward direction.\nIn the following sections, we \ufb01rst introduce the widely used photometric loss and smoothness loss in\nSec. 3.2, and then describe the proposed geometric consistency loss in Sec. 3.3 and the self-discovered\nmask in Sec. 3.4.\n\n3.2 Photometric loss and smoothness loss\n\nPhotometric loss. Leveraging the brightness constancy and spatial smoothness priors used in\nclassical dense correspondence algorithms [26], previous works [7, 9\u201311] have used the photometric\n\n3\n\n\fFigure 1: Illustration of the proposed geometry consistency loss and self-discover mask. Given two\nconsecutive frames (Ia, Ib), we \ufb01rst estimate their depth maps (Da, Db) and relative pose (Pab) using\nthe network, then we get the warped (Da\nb ) by converting Da to 3D space and projecting to the image\nplane of Ib using Pab, and \ufb01nally we use the inconsistency between Da\nb interpolated\nfrom Db as the geometric consistency loss LGC (Eqn. 6) to supervise the network training. Here, we\ninterpolate Db because the projection \ufb02ow does not lie on the pixel grid of Ib. Besides, we discover a\nmask M (Eqn. 7) from the inconsistency map for handling dynamic scenes and ill-estimated regions\n(Fig. 2). For clarity, the photometric loss and smoothness loss are not shown in this \ufb01gure.\n\nb and the D(cid:48)\n\nerror between the warped frame and the reference frame as an unsupervised loss function for training\nthe network.\nWith the predicted depth map Da and the relative camera pose Pab, we synthesize I(cid:48)\nIb, where differentiable bilinear interpolation [25] is used as in [7]. With the synthesized I(cid:48)\nreference image Ia, we formulate the objective function as\n(cid:107)Ia(p) \u2212 I(cid:48)\n\na by warping\na and the\n\n(cid:88)\n\nLp =\n\na(p)(cid:107)1,\n\n(2)\n\n1\n|V |\n\np\u2208V\n\nwhere V stands for valid points that are successfully projected from Ia to the image plane of Ib, and\n|V | de\ufb01nes the number of points in V . We choose L1 loss due to its robustness to outliers. However, it\nis still not invariant to illumination changes in real-world scenarios. Here we add an additional image\ndissimilarity loss SSIM [27] for better handling complex illumination changes, since it normalizes\nthe pixel illumination. We modify the photometric loss term Eqn. 2 as:\n\nLp =\n\n1\n|V |\n\n(\u03bbi(cid:107)Ia(p) \u2212 I(cid:48)\n\na(p)(cid:107)1 + \u03bbs\n\n2\nwhere SSIMaa(cid:48) stands for the element-wise similarity between Ia and I(cid:48)\nFollowing [23, 9, 11], we use \u03bbi = 0.15 and \u03bbs = 0.85 in our framework.\n\np\u2208V\n\n1 \u2212 SSIMaa(cid:48)(p)\n\n),\n\n(3)\n\na by the SSIM function [27].\n\n(cid:88)\n\n(cid:88)\n\nSmoothness loss. As the photometric loss is not informative in low-texture nor homogeneous\nregion of the scene, existing work incorporates a smoothness prior to regularize the estimated depth\nmap. We adopt the edge-aware smoothness loss used in [11], which is formulated as:\n\nLs =\n\n(e\u2212\u2207Ia(p) \u00b7 \u2207Da(p))2,\n\n(4)\n\nwhere \u2207 is the \ufb01rst derivative along spatial directions. It ensures that smoothness is guided by the\nedge of images.\n\np\n\n3.3 Geometry consistency loss\n\nAs mentioned before, we enforce the geometry consistency on the predicted results. Speci\ufb01cally, we\nrequire that Da and Db (related by Pab) conform the same 3D scene structure, and minimize their\ndifferences. The optimization not only encourages the geometry consistency between samples in a\nbatch but also transfers the consistency to the entire sequence. e.g., depths of I1 agree with depths\nof I2 in a batch; depths of I2 agree with depths of I3 in another training batch. Eventually, depths\n\n4\n\nConcatDepthNetDepthNetPoseNet\ud835\udc77\ud835\udc77\ud835\udc82\ud835\udc82\ud835\udc82\ud835\udc82\ud835\udc03\ud835\udc03\ud835\udc1b\ud835\udc1b\ud835\udc1a\ud835\udc1a\ud835\udc03\ud835\udc03\ud835\udc1b\ud835\udc1b\u2032\ud835\udc6b\ud835\udc6b\ud835\udc82\ud835\udc82\ud835\udc6b\ud835\udc6b\ud835\udc82\ud835\udc82\ud835\udc70\ud835\udc70\ud835\udc82\ud835\udc82\ud835\udc70\ud835\udc70\ud835\udc82\ud835\udc82\ud835\udc73\ud835\udc73\ud835\udc6e\ud835\udc6e\ud835\udc6e\ud835\udc6e\ud835\udc74\ud835\udc74WarpedInterpolatedProjectionFlow\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 2: Visual results. Top to bottom: sample image, estimated depth, self-discovered mask. The\nproposed mask can effectively identify occlusions and moving objects.\n\nof Ii of a sequence should all agree with each other. As the pose network is naturally coupled with\nthe depth network during training, our method yields scale-consistent predictions over the entire\nsequence.\nWith this constraint, we compute the depth inconsistency map Ddiff. For each p \u2208 V , it is de\ufb01ned as:\n\nDdiff(p) =\n\nb (p) \u2212 D(cid:48)\n|Da\nb (p) + D(cid:48)\nDa\n\nb(p)|\nb(p)\n\nb is the computed depth map of Ib by warping Da using Pab, and D(cid:48)\n\nwhere Da\nb is the interpolated\ndepth map from the estimated depth map Db (Note that we cannot directly use Db because the\nwarping \ufb02ow does not lie on the pixel grid ). Here we normalize their difference by their sum. This\nis more intuitive than the absolute distance as it treats points at different absolute depths equally in\noptimization. Besides, the function is symmetric and the outputs are naturally ranging from 0 to 1,\nwhich contributes to numerical stability in training.\nWith the inconsistency map, we simply de\ufb01ne the proposed geometry consistency loss as:\n\nDdiff(p),\n\n(6)\n\n(cid:88)\n\np\u2208V\n\nLGC =\n\n1\n|V |\n\n(5)\n\n(7)\n\nwhich minimizes the geometric distance of predicted depths between each consecutive pair and\nenforces their scale-consistency. With training, the consistency can propagate to the entire video\nsequence. Due to the tight link between ego-motion and depth predictions, the ego-motion network\ncan eventually predict globally scale-consistent trajectories (Fig. 3).\n\n3.4 Self-discovered mask\n\nTo handle moving objects and occlusions that may impair the network training, recent work propose to\nintroduce an additional optical \ufb02ow [9\u201311] or semantic segmentation network [14]. This is effective,\nhowever it also introduces extra computational cost and training burden. Here, we show that these\nregions can be effectively located by the proposed inconsistency map Ddiff in Eqn. 5.\nThere are several scenarios that result in inconsistent scene structure observed from different views,\nincluding (1) dynamic objects, (2) occlusions, and (3) inaccurate predictions for dif\ufb01cult regions.\nWithout separating them explicitly, we observe each of these will result in Ddiff increasing from its\nideal value of zero.\nBased on this simple observation, we propose a weight mask M as Ddiff is in [0, 1]:\n\nM = 1 \u2212 Ddiff,\n\n5\n\n\fwhich assigns low/high weights for inconsistent/consistent pixels. It can be used to re-weight the\nphotometric loss. Speci\ufb01cally, we modify the photometric loss in Eqn. 3 as\n\n(cid:88)\n\np\u2208V\n\nLM\n\np =\n\n1\n|V |\n\n(M (p) \u00b7 Lp(p)).\n\n(8)\n\nBy using the mask, we mitigate the adverse impact from moving objects and occlusions. Further,\nthe gradients computed on inaccurately predicted regions carry less weight during back-propagation.\nFig. 2 shows visual results for the proposed mas, which coincides with our anticipation stated above.\n\n4 Experiment\n\n4.1\n\nImplementation details\n\nNetwork architecture. For the depth network, we experiment with DispNet [7] and DispRes-\nNet [11], which takes a single RGB image as input and outputs a depth map. For the ego-motion\nnetwork, PoseNet without the mask prediction branch [7] is used. The network estimates a 6D relative\ncamera pose from a concatenated RGB image pair. Instead of computing the loss on multiple-scale\noutputs of the depth network (4 scales in [7] or 6 scales in [11]), we empirically \ufb01nd that using\nsingle-scale supervision (i.e., only compute the loss on the \ufb01nest output) is better (Tab. 4). Our\nsingle-scale supervision not only improves the performance but also contributes a more concise\ntraining pipeline. We hypothesize the reason of this phenomenon is that the photometric loss is not\naccurate in low-resolution images, where the pixel color is over-smoothed.\n\nSingle-view depth estimation. The proposed learning framework is implemented using PyTorch\nLibrary [28]. For depth network, we train and test models on KITTI raw dataset [15] using Eigen [5]\u2019s\nsplit that is the same with related works [10, 9, 11, 7]. Following [7], we use a snippet of three\nsequential video frames as a training sample, where we set the second image as reference frame to\ncompute loss with other two images and then inverse their roles to compute loss again for maximizing\nthe data usage. The data is also augmented with random scaling, cropping and horizontal \ufb02ips\nduring training, and we experiment with two input resolutions (416 \u00d7 128 and 832 \u00d7 256). We use\nADAM [29] optimizer, and set the batch size to 4 and the learning rate to 10\u22124. During training,\nwe adopt \u03b1 = 1.0, \u03b2 = 0.1, and \u03b3 = 0.5 in Eqn. 1. We train the network in 200 epochs with 1000\nrandomly sampled batches in one epoch, and validate the model at per epoch. Also, we pre-train\nthe network on CityScapes [30] and \ufb01netune on KITTI [15], each for 200 epochs. Here we follow\nEigen et al. [5]\u2019s evaluation metrics for depth evaluation.\n\nVisual odometry prediction. For pose network, following Zhan et al. [17], we evaluate visual\nodometry results on KITTI odometry dataset [15], where sequence 00-08/09-10 are used for train-\ning/testing. We use the standard evaluation metrics by the dataset for trajectory evaluation rather than\nZhou et al. [7]\u2019s 5-frame pose evaluation, since they are more widely used and more meaningful.\n\n4.2 Comparisons with the state-of-the-art\n\nDepth results on KITTI raw dataset. Tab. 1 shows the results on KITTI raw dataset [15], where\nour method achieves the state-of-the-art performance when compared with models trained on monoc-\nular video sequences. Note that recent work [9\u201311, 31] all jointly learn multiple tasks, while our\napproach does not. This effectively reduces the training and inference overhead. Moreover, our\nmethod competes quite favorably with other methods using stronger supervision signals such as\ncalibrated stereo image pairs (i.e., pose supervision) or even ground-truth depth annotation.\n\nVisual odometry results on KITTI odometry dataset. We compare with SfMLearner [7] and\nthe methods trained with stereo videos [17]. We also report the results of ORB-SLAM [12] system\n(without loop closing) as a reference, though emphasize that this results in a comparison note between\na simple frame-to-frame pose estimation framework with a Visual SLAM system, in which the\nlatter has a strong back-end optimization system (i.e., bundle adjustment [32]) for improving the\nperformance. Here, we ignore the frames (First 9 and 30 respectively) from the sequences (09 and\n10) for which ORB-SLAM [12] fails to output camera poses because of unsuccessful initialization.\n\n6\n\n\fTable 1: Single-view depth estimation results on test split of KITTI raw dataset [15]. The methods\ntrained on KITTI raw dataset [15] are denoted by K. Models with pre-training on CityScapes [30]\nare denoted by CS+K. (D) denotes depth supervision, (B) denotes binocular/stereo input pairs, (M)\ndenotes monocular video clips. (J) denotes joint learning of multiple tasks. The best performance in\neach block is highlighted as bold.\n\nK (B)\n\nK (B+D)\n\nCS+K (B)\n\nDataset\nK (D)\nK (D)\nK (B)\n\nMethods\nEigen et al. [5]\nLiu et al. [6]\nGarg et al. [22]\nKuznietsov et al. [18]\nGodard et al. [23]\nGodard et al. [23]\nZhan et al. [17]\nZhou et al. [7]\nYang et al. [31] (J)\nMahjourian et al. [8]\nWang et al. [16]\nGeonet-VGG [9] (J)\nGeonet-Resnet [9] (J)\nDF-Net [10] (J)\nCC [11] (J)\nOurs\nCS+K (M)\nZhou et al. [7]\nCS+K (M)\nYang et al. [31] (J)\nCS+K (M)\nMahjourian et al. [8]\nWang et al. [16]\nCS+K (M)\nGeonet-Resnet [9] (J) CS+K (M)\nCS+K (M)\nDF-Net [10] (J)\nCS+K (M)\nCC [11] (J)\nOurs\nCS+K (M)\n\nK (B)\nK (M)\nK (M)\nK (M)\nK (M)\nK (M)\nK (M)\nK (M)\nK (M)\nK (M)\n\nAbsRel\n0.203\n0.202\n0.152\n0.113\n0.148\n0.124\n0.144\n0.208\n0.182\n0.163\n0.151\n0.164\n0.155\n0.150\n0.140\n0.137\n0.198\n0.165\n0.159\n0.148\n0.153\n0.146\n0.139\n0.128\n\nError \u2193\n\nAccuracy \u2191\n\nSqRel RMS RMSlog < 1.25 < 1.252 < 1.253\n0.958\n1.548\n0.965\n1.614\n0.967\n1.226\n0.741\n0.986\n0.964\n1.344\n0.973\n1.076\n0.969\n1.391\n0.957\n1.768\n0.963\n1.481\n1.240\n0.968\n0.974\n1.257\n0.968\n1.303\n0.973\n1.296\n0.973\n1.124\n1.070\n0.975\n0.975\n1.089\n0.960\n1.836\n0.969\n1.360\n0.970\n1.231\n0.975\n1.187\n1.328\n0.972\n0.978\n1.182\n1.032\n0.977\n1.047\n0.976\n\n6.307\n6.523\n5.849\n4.621\n5.927\n5.311\n5.869\n6.856\n6.501\n6.220\n5.583\n6.090\n5.857\n5.507\n5.326\n5.439\n6.565\n6.641\n5.912\n5.496\n5.737\n5.215\n5.199\n5.234\n\n0.282\n0.275\n0.246\n0.189\n0.247\n0.219\n0.241\n0.283\n0.267\n0.250\n0.228\n0.247\n0.233\n0.223\n0.217\n0.217\n0.275\n0.248\n0.243\n0.226\n0.232\n0.213\n0.213\n0.208\n\n0.702\n0.678\n0.784\n0.862\n0.803\n0.847\n0.803\n0.678\n0.725\n0.762\n0.810\n0.765\n0.793\n0.806\n0.826\n0.830\n0.718\n0.750\n0.784\n0.812\n0.802\n0.818\n0.827\n0.846\n\n0.890\n0.895\n0.921\n0.960\n0.922\n0.942\n0.928\n0.885\n0.906\n0.916\n0.936\n0.919\n0.931\n0.933\n0.941\n0.942\n0.901\n0.914\n0.923\n0.938\n0.934\n0.943\n0.943\n0.947\n\nTable 2: Visual odometry results on KITTI odometry dataset [15]. We report the performance of\nORB-SLAM [12] as a reference and compare with recent deep methods. K denotes the model trained\non KITTI, and CS+K denotes the model with pre-training on Cityscapes [30].\n\nMethods\n\nORB-SLAM [12]\nZhou et al. [7]\nZhan et al. [17]\nOurs (K)\nOurs (CS+K)\n\nterr (%)\n15.30\n17.84\n11.93\n11.2\n8.24\n\nSeq. 09\nrerr (\u25e6/100m)\n\n0.26\n6.78\n3.91\n3.35\n2.19\n\nterr (%)\n\n3.68\n37.91\n12.45\n10.1\n10.7\n\nSeq. 10\nrerr (\u25e6/100m)\n\n0.48\n17.78\n3.46\n4.96\n4.58\n\n(a) sequence 09\n\n(b) sequence 10\n\nFigure 3: Qualitative results on the testing sequences of KITTI odometry dataset [15].\n\n7\n\n\fTab. 2 shows the average translation and rotation errors for the testing sequence 09 and 10, and Fig. 3\nshows qualitative results. Note that the comparison is highly disadvantageous to the proposed method:\ni) we align per-frame scale to the ground truth scale for [7] due to its scale-inconsistency, while we\nonly align one global scale for our method; ii) [17] requires stereo videos for training, while we only\nuse monocular videos. Although it is unfair to the proposed method, the results show that our method\nachieves competitive results with [17]. Even when compared with the ORB-SLAM [12] system,\nour method shows a lower translational error and a better visual result on sequence 09. This is a\nremarkable progress that deep models trained on unlabelled monocular videos can predict a globally\nscale-consistent visual odometry.\n\n4.3 Ablation study\n\nIn this section, we \ufb01rst validate the ef\ufb01cacy of the proposed geometry-consistency loss LGC and\nthe self-discovered weight mask M. Then we experiment with different scale numbers, network\narchitectures, and image resolutions.\n\nValidating proposed LGC and M. We conduct ablation studies using DispNet [7] and images of\n416\u00d7128 resolution. Tab. 3 shows the depth results for both single-scale and multi-scale supervisions.\nThe results clearly demonstrate the contribution of our proposed terms to the overall performance.\nBesides, Fig. 4 shows the validation error during training, which indicates that the proposed LGC can\neffectively prevent the model from over\ufb01tting.\n\nTable 3: Ablation studies on LGC and M. Brackets show results of multi-scale (4) supervisions.\n\nAbsRel\n\nMethods\n0.161 (0.185)\nBasic\n0.160 (0.163)\nBasic+SSIM\nBasic+SSIM+GC\n0.158 (0.161)\nBasic+SSIM+GC+M 0.151 (0.158)\n\nError \u2193\nSqRel RMS RMSlog < 1.25 < 1.252 < 1.253\n0.972\n1.225\n0.969\n1.230\n0.971\n1.247\n1.154\n0.972\n\n0.237\n0.243\n0.235\n0.232\n\n5.765\n5.950\n5.827\n5.716\n\nAccuracy \u2191\n\n0.780\n0.775\n0.786\n0.798\n\n0.927\n0.923\n0.927\n0.930\n\nFigure 4: Validation error. Both Basic and Basic+SSIM over\ufb01t after about 50 epochs, while others do\nnot due to proposed LGC. Besides, models with the single-scale supervision in training outperforms\nthose with multi-scale (4) supervisions.\n\nProposed single-scale vs multi-scale supervisions. As mentioned in Sec. 4.1, we empirically \ufb01nd\nthat using single-scale supervision leads to better performance than using the widely-used multi-scale\nsolution. Tab. 4 shows the depth results. We hypothesis the reason is that the photometric loss\nis not accurate in low-resolution images, where the pixel color is over-smoothed. Besides, as the\ndisplacement between two consecutive views is small, the multi-scale solution is unnecessary.\n\nNetwork architectures and image resolutions. Tab. 5 shows the results of different network\narchitectures on different resolution images, where DispNet and DispResNet are both borrowed from\n\n8\n\n050100150200Epochs0.160.180.20.220.240.260.280.30.320.340.36AbsRelsingle-scaleBasicBasic+SSIMBasic+SSIM+GCBasic+SSIM+GC+M050100150200Epochs0.160.180.20.220.240.260.280.30.320.340.36AbsRelmulti-scale\fTable 4: Ablation studies on scale numbers of supervision.\n\nError \u2193\n\nAccuracy \u2191\n\n#Scales AbsRel\n0.151\n1\n0.152\n2\n3\n0.159\n0.158\n4\n\nSqRel RMS RMSlog < 1.25 < 1.252 < 1.253\n0.972\n1.154\n0.971\n1.192\n1.226\n0.969\n0.971\n1.214\n\n5.716\n5.900\n5.987\n5.898\n\n0.232\n0.235\n0.240\n0.239\n\n0.798\n0.795\n0.780\n0.782\n\n0.930\n0.927\n0.921\n0.925\n\nCC [11], and DispNet is also used in SfMLearner [7]. It shows that higher resolution images and\ndeeper networks can results in better performance.\n\nTable 5: Ablation studies on different network architectures and image resolutions.\n\nMethods\nDispNet\nDispResNet\nDispNet\nDispResNet\n\nResolutions AbsRel\n416 \u00d7 128\n0.151\n0.149\n832 \u00d7 256\n0.146\n0.137\n\n4.4 Timing and memory analysis\n\nError \u2193\n\nAccuracy \u2191\n\nSqRel RMS RMSlog < 1.25 < 1.252 < 1.253\n0.972\n1.154\n0.973\n1.137\n0.975\n1.197\n1.089\n0.975\n\n5.716\n5.771\n5.578\n5.439\n\n0.930\n0.932\n0.940\n0.942\n\n0.232\n0.230\n0.223\n0.217\n\n0.798\n0.799\n0.814\n0.830\n\nTraining time and parameter numbers. We compare with CC [11], and both methods are trained\non a single 16GB Tesla V100 GPU. We measure the time taken for each training iteration consisting\nof forward and backward pass using a batch size of 4. The image resolution is 832 \u00d7 256. CC [11]\nneeds train 3 parts, including (Depth, Pose), Flow, and Mask. In contrast our method only trains\n(Depth, Pose). In total, CC takes about 7 days for training as reported by authors, while our method\ntakes about 32 hours. Tab. 6 shows the per-iteration time and model parameters of each network.\n\nTable 6: Training time per iteration and model parameters for each network.\n\nNetwork\nTime\nParameter Numbers\n\n(Depth, Pose)\n\n0.96s\n\n(80.88M, 2.18M)\n\nCC [11]\n\nFlow\n1.32s\n39.28M 5.22M (80.88M, 1.59M)\n\n(Depth, Pose)\n\nMask\n0.48s\n\nOurs\n\n0.55s\n\nInference time. We test models on a single RTX 2080 GPU. The batch size is 1, and the time is\naveraged over 100 iterations. Tab. 7 shows the results. The DispNet and DispResNet architectures\nare same with SfMLearner [7] and CC [11], respectively, so their speeds are theoretically same.\n\nTable 7: Inference time on per image or image pair.\n\n128 \u00d7 416\n256 \u00d7 832\n\nDispNet DispResNet\n4.9 ms\n9.2 ms\n\n9.6 ms\n15.5 ms\n\nPoseNet\n0.6 ms\n1.0 ms\n\n5 Conclusion\n\nThis paper presents an unsupervised learning framework for scale-consistent depth and ego-motion\nestimation. The core of the proposed approach is a geometry consistency loss for scale-consistency\nand a self-discovered mask for handling dynamic scenes. With the proposed learning framework, our\ndepth model achieves the state-of-the-art performance on the KITTI [15] dataset, and our ego-motion\nnetwork can show competitive visual odometry results with the model that is trained using stereo\nvideos. To the best of our knowledge, this is the \ufb01rst work to show that deep models training on\nunlabelled monocular videos can predict a globally scale-consistent camera trajectory over a long\nsequence. In future work, we will focus on improving the visual odometry accuracy by incorporating\ndrift correcting solutions into the current framework.\n\n9\n\n\fAcknowledgments\n\nThe work was supported by the Australian Centre for Robotic Vision, the Major Project for New\nGeneration of AI (No. 2018AAA0100400), and NSFC (NO. 61922046). Jiawang would also like to\nthank TuSimple, where he started research in this \ufb01eld.\n\nReferences\n[1] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal\n\non Computer Vision (IJCV), 60(2), 2004.\n\n[2] Jia-Wang Bian, Wen-Yan Lin, Yun Liu, Le Zhang, Sai-Kit Yeung, Ming-Ming Cheng, and\nIan Reid. GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence.\nInternational Journal on Computer Vision (IJCV), 2019.\n\n[3] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge\n\nuniversity press, 2003.\n\n[4] Jia-Wang Bian, Yu-Huan Wu, Ji Zhao, Yun Liu, Le Zhang, Ming-Ming Cheng, and Ian Reid.\nAn evaluation of feature matchers for fundamental matrix estimation. In British Machine Vision\nConference (BMVC), 2019.\n\n[5] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image\n\nusing a multi-scale deep network. In Neural Information Processing Systems (NIPS), 2014.\n\n[6] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular\nimages using deep convolutional neural \ufb01elds. IEEE Transactions on Pattern Recognition and\nMachine Intelligence (PAMI), 38(10), 2016.\n\n[7] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning\nof depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2017.\n\n[8] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and\nego-motion from monocular video using 3d geometric constraints. In IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), 2018.\n\n[9] Zhichao Yin and Jianping Shi. GeoNet: Unsupervised learning of dense depth, optical \ufb02ow and\ncamera pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[10] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. DF-Net: Unsupervised joint learning of depth and\n\ufb02ow using cross-task consistency. In European Conference on Computer Vision (ECCV), 2018.\n\n[11] Anurag Ranjan, Varun Jampani, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J Black.\nCompetitive Collaboration: Joint unsupervised learning of depth, camera motion, optical \ufb02ow\nand motion segmentation. IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2019.\n\n[12] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: a versatile and\n\naccurate monocular slam system. IEEE Transactions on Robotics (TRO), 31(5), 2015.\n\n[13] Yang Wang, Zhenheng Yang, Peng Wang, Yi Yang, Chenxu Luo, and Wei Xu. Joint unsupervised\nlearning of optical \ufb02ow and depth by watching stereo videos. In IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2019.\n\n[14] Jianbo Jiao, Ying Cao, Yibing Song, and Rynson Lau. Look deeper into depth: Monocular\ndepth estimation with semantic booster and attention-driven loss. In European Conference on\nComputer Vision (ECCV), 2018.\n\n[15] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics:\n\nThe kitti dataset. International Journal of Robotics Research (IJRR), 2013.\n\n10\n\n\f[16] Chaoyang Wang, Jos\u00e9 Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from\nmonocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), June 2018.\n\n[17] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian\nReid. Unsupervised learning of monocular depth estimation and visual odometry with deep\nIn IEEE Conference on Computer Vision and Pattern Recognition\nfeature reconstruction.\n(CVPR), 2018.\n\n[18] Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semi-supervised deep learning for monoc-\nular depth map prediction. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2017.\n\n[19] Chengzhou Tang and Ping Tan. BA-Net: Dense bundle adjustment network. In International\n\nConference on Learning Representations (ICLR), 2019.\n\n[20] Zhichao Yin, Trevor Darrell, and Fisher Yu. Hierarchical discrete distribution decomposition\nfor match density estimation. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2019.\n\n[21] Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of\nvirtual normal for depth prediction. In IEEE International Conference on Computer Vision\n(ICCV), 2019.\n\n[22] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single\nview depth estimation: Geometry to the rescue. In European Conference on Computer Vision\n(ECCV). Springer, 2016.\n\n[23] Cl\u00e9ment Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth\nestimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2017.\n\n[24] Xingtong Liu, Ayushi Sinha, Mathias Unberath, Masaru Ishii, Gregory D Hager, Russell H\nTaylor, and Austin Reiter. Self-supervised learning for dense depth estimation in monocular en-\ndoscopy. In OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy,\nClinical Image-Based Procedures, and Skin Image Analysis, pages 128\u2013138. Springer, 2018.\n[25] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In\n\nNeural Information Processing Systems (NIPS), 2015.\n\n[26] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. Interna-\n\ntional Journal on Computer Vision (IJCV), 56(3), 2004.\n\n[27] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image Quality Assess-\nment: from error visibility to structural similarity. IEEE Transactions on Image Processing\n(TIP), 13(4), 2004.\n\n[28] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS-W, 2017.\n\n[29] Diederik P Kingma and Jimmy Ba. ADAM: A method for stochastic optimization. arXiv\n\npreprint arXiv:1412.6980, 2014.\n\n[30] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo\nBenenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic\nurban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2016.\n\n[31] Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. Unsupervised learn-\ning of geometry with edge-aware depth-normal consistency. In Association for the Advancement\nof Arti\ufb01cial Intelligence (AAAI), 2018.\n\n[32] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle\nadjustment\u2014a modern synthesis. In International workshop on vision algorithms. Springer,\n1999.\n\n11\n\n\f", "award": [], "sourceid": 32, "authors": [{"given_name": "Jiawang", "family_name": "Bian", "institution": "The University of Adelaide"}, {"given_name": "Zhichao", "family_name": "Li", "institution": "Tusimple"}, {"given_name": "Naiyan", "family_name": "Wang", "institution": "Hong Kong University of Science and Technology"}, {"given_name": "Huangying", "family_name": "Zhan", "institution": "The University of Adelaide"}, {"given_name": "Chunhua", "family_name": "Shen", "institution": "University of Adelaide"}, {"given_name": "Ming-Ming", "family_name": "Cheng", "institution": "Nankai University"}, {"given_name": "Ian", "family_name": "Reid", "institution": "University of Adelaide"}]}