{"title": "Deep RGB-D Canonical Correlation Analysis For Sparse Depth Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 5331, "page_last": 5341, "abstract": "In this paper, we propose our Correlation For Completion Network (CFCNet), an end-to-end deep learning model that uses the correlation between two data sources to perform sparse depth completion. CFCNet learns to capture, to the largest extent, the semantically correlated features between RGB and depth information. Through pairs of image pixels and the visible measurements in a sparse depth map, CFCNet facilitates feature-level mutual transformation of different data sources. Such a transformation enables CFCNet to predict features and reconstruct data of missing depth measurements according to their corresponding, transformed RGB features. We extend canonical correlation analysis to a 2D domain and formulate it as one of our training objectives (i.e. 2d deep canonical correlation, or \u201c2D^2CCA loss\"). Extensive experiments validate the ability and flexibility of our CFCNet compared to the state-of-the-art methods on both indoor and outdoor scenes with different real-life sparse patterns. Codes are available at: https://github.com/choyingw/CFCNet.", "full_text": "Deep RGB-D Canonical Correlation Analysis For\n\nSparse Depth Completion\n\nUniversity of Southern California\n\nUniversity of Southern California\n\nCho-Ying Wu\u2217\n\nLos Angeles, California\nchoyingw@usc.edu\n\nYiqi Zhong\u2217\n\nLos Angeles, California\nyiqizhon@usc.edu\n\nSuya You\n\nUS Army Research Laboratory\n\nPlaya Vista, California\n\nsuya.you.civ@mail.mil\n\nUlrich Neumann\n\nUniversity of Southern California\n\nLos Angeles, California\nuneumann@usc.edu\n\nAbstract\n\nIn this paper, we propose our Correlation For Completion Network (CFCNet), an\nend-to-end deep learning model that uses the correlation between two data sources\nto perform sparse depth completion. CFCNet learns to capture, to the largest extent,\nthe semantically correlated features between RGB and depth information. Through\npairs of image pixels and the visible measurements in a sparse depth map, CFCNet\nfacilitates feature-level mutual transformation of different data sources. Such a\ntransformation enables CFCNet to predict features and reconstruct data of missing\ndepth measurements according to their corresponding, transformed RGB features.\nWe extend canonical correlation analysis to a 2D domain and formulate it as one\nof our training objectives (i.e. 2d deep canonical correlation, or \u201c2D2CCA loss\").\nExtensive experiments validate the ability and \ufb02exibility of our CFCNet compared\nto the state-of-the-art methods on both indoor and outdoor scenes with different real-\nlife sparse patterns. Codes are available at: https://github.com/choyingw/CFCNet.\n\n1\n\nIntroduction\n\nDepth measurements are widely used in computer vision applications [1, 2, 3]. However, most of\nthe existing techniques for depth capture produce depth maps with incomplete data. For example,\nstructured-light cameras cannot capture depth measurements where surfaces are too shiny; Visual\nSimultaneous Localization And Mappings (VSLAMs) are not able to recover depth of non-textured\nobjects; LiDARs produce semi-dense depth map due to the limited scanlines and scanning frequency.\nRecently, researchers have introduced the sparse depth completion task, aiming to \ufb01ll missing depth\nmeasurements using deep learning based methods [4, 5, 6, 7, 8, 9, 10]. Those studies produce\ndense depth maps by fusing features of sparse depth measurements and corresponding RGB images.\nHowever, they usually treat feature extraction of these two types of information as independent\nprocesses, which in reality turns the task they work on into \"multi-modality depth prediction\" rather\nthan \"depth completion.\" While the multi-modality depth prediction may produce dense outputs,\nthey fail to fully utilize observable data. The depth completion task is unique in that part of its\noutput is already observable in the input. Revealing the relationship between data pairs (i.e. between\nobservable depth measurements and the corresponding image pixels) may help complete depth maps\nby emphasizing the information from image domain at the locations where the depth values are\nnon-observable.\n\n\u2217Both authors contributed equally to this work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Sample results of CFCNet on different sparse patterns. We show in the order of RGB images,\nsparse depth maps with different sparse patterns, and dense depth maps completed by CFCNet. For\nthe stereo pattern and the ORB pattern, we show the depth groundtruth in the last column.\n\nTo accomplish the depth completion task from a novel perspective, we propose an end-to-end deep\nlearning based framework, Correlation For Completion Network (CFCNet). We view a completed\ndense depth map as composed of two parts. One is the sparse depth which is observable and used as\nthe input, another is non-observable and recovered by the task. Also, the corresponding full RGB\nimage of the depth map can be decomposed into two parts, one is called the sparse RGB, which\nholds the corresponding RGB values at the observable locations in the sparse depth. The other part\nis complementary RGB, which is the subtraction of the sparse RGB from the full RGB images.\nSee Figure 2 for examples. During the training phase, CFCNet learns the relationship between\nsparse depth and sparse RGB and uses the learned knowledge to recover non-observable depth from\ncomplementary RGB.\nTo learn the relationship between two modalities, we propose a 2D deep canonical correlation analysis\n(2D2CCA). In the proposed method, our 2D2CCA tries to learn non-linear projections where the\nprojected features from RGB and depth domain are maximally correlated. Using 2D2CCA as an\nobjective function, we could capture the semantically correlated features from the RGB and depth\ndomain. In this fashion, we utilize the relationship of observable depth and its corresponding non-\nobservable locations of the RGB input. We then use the joint information learned from the input data\npairs to output a dense depth map. The pipeline of our CFCNet is shown in Figure 2. Details of our\nmethod are described in Section 3. The main contributions of CFCNet can be summarized as follows.\n\u2022 Constructing a framework for the sparse depth completion task which leverages the rela-\ntionship between sparse depth and its corresponding RGB image, using the complementary\nRGB information to complement the missing sparse depth information.\n\u2022 Proposing the 2D2CCA which forces feature encoders to extract the most similar semantics\nfrom multiple modalities. Our CFCNet is the \ufb01rst to apply the two-dimensional approach in\nCCA with deep learning studies. It overcomes the small sample size problem in other CCA\nbased deep learning frameworks on modern computer vision tasks.\n\u2022 Achieving state-of-the-art of the depth completion on several datasets with a variety of\n\nsparse patterns that serve real-world settings.\n\n2 Related Work\n\nSparse Depth Completion is a task that targets at dense depth completion from sparse depth\nmeasurements and a corresponding RGB image. The nature of sparse depth measurements varies\nacross scenarios and sensors. Sparse depth generated by the stereo method contains more information\non object contours and less information on non-textured areas [11]. LiDAR sensors produce structured\nsparsity due to the scanning behavior [12]. Feature based SLAM systems (such as ORB SLAM [13])\nonly capture depth information at the positions of corresponding feature points. Besides these most\npopular three patterns, some other patterns have also been studied. For instance, [14] uses a line\npattern to simulate partial observations from laser systems; [15] culls the depth data of shiny surfaces\narea out of the dense depth map to mimic commodity depth cameras\u2019 output. [8] uses uniform grid\npatterns. The latter appears a simpli\ufb01ed and arti\ufb01cial pattern. Real-life situations require a more\npractical tool.\nAs for input sparsity, [4] stacks sparse depth maps and corresponding RGB images together to build a\nfour-channel (RGB-D) input before fed into a ResNet based depth estimation network. This treatment\nproduces better results than monocular depth estimation with only RGB images. Other studies\n\n2\n\n\finvolve a two-branch encoder-decoder based framework which is similar to those used in RGB-D\nsegmentation tasks [9, 10, 16, 17]. Their approaches do not apply special treatments to the sparse\ndepth branch. They work well on the dataset where sparsity is not extremely severe, e.g. KITTI depth\ncompletion benchmark [6]. In most of the two-branch frameworks, features from different sources\nare extracted independently and fused through direct concatenations or additions, or using features\nfrom RGB branch to provide an extra guidance to re\ufb01ne depth prediction results.\nCanonical Correlation Analysis is a standard statistical technique for learning the shared subspace\nacross several original data spaces. For two modalities, from the shared subspace, each representation\nis the most predictive to the other representation and the most predictable by the other [18, 19]. To\novercome the constraints of traditional CCA where the projections must be linear, deep canonical\ncorrelation analysis (DCCA) [20, 21] has been proposed. DCCA uses deep neural network to learn\nmore complex non-linear projections between multiple modalities. CCA, DCCA, and other variants\nhave been widely used on multi-modal representation learning problems [22, 23, 24, 25, 26, 27, 28].\nThe one-dimensional CCA method suffers from the singularity problem of covariance matrices in the\ncase of high-dimensional space with small sample size (SSS). Existing works have extended CCA to\na two-dimensional way to avoid the SSS problem. [29, 30, 31] use a similar approach to building\nfull-rank covariance matrices inspired by 2DPCA [32] and 2DLDA [33] on the face recognition task.\nHowever, those studies do not approximate complex non-linear projections as [20, 21] attempt. Our\nCFCNet is the \ufb01rst to integrate two-dimensional CCA into deep learning frameworks to overcome\nthe intrinsic problem of applying DCCA to modern computer vision tasks, detailed in Section 3.2.\n\n3 Our Approach\n\nOur goal is to leverage the relationship of the sparse depth and their corresponding pixels in RGB\nimages in order to optimize the performance of the depth completion task. We try to complement\nthe missing depth components using cues from RGB domain. Since CCA could learn the shared\nsubspace with its predictive characteristics, we estimate the missing depth component using features\nfrom RGB domain through CCA. However, traditional CCA has SSS problem in modern computer\nvision task, detailed in Section 3.2. We further propose the 2D2CCA to capture similar semantics\nfrom both RGB/depth encoders. After encoders learning the semantically similar features, we use a\ntransformer network to transform features from RGB to depth domain. This design not only enables\nthe reconstruction of missing depth features from complementary RGB information but also ensures\nsemantics similarity and the same numerical range of the two data sources. Based on this structure,\nthe decoder in CFCNet is capable of using the reconstructed depth features along with the observable\ndepth features to recover the dense depth map.\n\n3.1 Network Architecture\n\nProposed CFCNet structure is in Figure 2. CFCNet takes in sparse depth map, sparse RGB, and\ncomplementary RGB. We use our Sparsity-aware Attentional Convolutions (SAConv, as shown in\nFigure 3) in VGG16-like encoders. SAConv is inspired by local attention mask [34]. Harley et al. [34]\nintroduces the segmentation-aware mask to let convolution operators \"focus\" on the signals consistent\nwith the segmentation mask. In order to propagate information from reliable sources, we use sparsity\nmasks to make convolution operations attend on the signals from reliable locations. Difference of\nour SAConv and the local attention mask is that SAConv does not apply mask normalization. We\navoid mask normalization because it affect the stability of our later 2D2CCA calculations due to the\nnumerically small extracted features it produces after several times normalization. Also, similar to\n[6], we use maxpooling operation on masks after every SAConv to keep track of the visibility. If\nthere is at least one nonzero value visible to a convolutional kernel, the maxpooling would evaluate\nthe value at the position to 1.\nMost multi-modal deep learning approaches simply concatenate or elementwisely add bottleneck\nfeatures. However, when the extracted semantics and range of feature value differs among elements,\ndirect concatenation and addition on multi-modal data source would not always yield better perfor-\nmance than single-modal data source, as seen in [35, 17]. To avoid this problem. We use encoders to\nextract higher-level semantics from two branches. We propose 2D2CCA, detailed in 3.2, to ensure\nthe extracted features from two branches are maximally correlated. The intuition is that we want to\ncapture the same semantics from the RGB and depth domains. Next, we use a transformer network\n\n3\n\n\fFigure 2: Our network architecture. Here \u2295 is for concatenation operation. The input 0 - 1 sparse mask\nrepresents the sparse pattern of depth measurements. The complementary mask is complementary to\nthe sparse mask. We separate a full RGB image into a sparse RGB and a complementary RGB by the\nmask and feed them with masks into networks.\n\nFigure 3: Our SAConv. The (cid:12) is for\nHadamard product. The \u2297 is for convo-\nlution. The + is for elementwise addi-\ntion. The kernel size is 3\u00d73 and stride is\n1 for both convolution and maxpooling.\n\nFigure 4: Samples of sparsi\ufb01ed depth maps for ex-\nperiments with different sparsi\ufb01ers. (a) The source\ndense depth map from NYUv2. (b) With uniform\nsparsi\ufb01er (500 points). (c) With stereo sparsi\ufb01er\n(500 points). (d) With ORB sparsi\ufb01er.\n\nto transform extracted features from RGB domain to depth domain, making extracted features from\ndifferent sources share the same numerical range. During the training phase, we use features of sparse\ndepth and corresponding sparse RGB image to calculate the 2D2CCA loss and transformer loss.\nWe use a symmetric decoder structure to decode the embedded features. For the input, we concatenate\nthe sparse depth features with the reconstructed missing depth features. The reconstructed missing\ndepth features are extracted from complementary RGB image through the RGB encoder and the\ntransformer. To ensure single-stage training, we adopt weight-sharing strategies as shown in Figure 2.\n\nFigure 5: The visualizations of FI and FD using an example from NYUv2 dataset. Visuals are heat\nmaps of the extracted features. Brighter color means a larger feature value. The values within a\nsingle map were normalized to [0, 1]. (a) The feed-forward heat maps at the \ufb01rst iteration. (b) The\nfeed-forward heat maps after being trained with 10000 iterations. The \ufb01gure shows the heat maps of\nthe \ufb01rst 6 channels. The numbers under the heat maps represent the channel numbers. The example\ndemonstrates that the 2D2CCA is able to capture similar semantics from different sources.\n\n4\n\n\f3.2\n\n2D Deep Canonical Correlation Analysis (2D2CCA)\n\nExisting CCA based techniques introduced in Section 2 have limitations in modern computer vision\ntasks. Since modern computer vision studies usually use very deep networks to extract information\nfrom images of relatively large resolution, the batch size is limited by GPU-memory use. Meanwhile,\nthe latent feature representations in networks are high-dimensional, since the batch size is limited,\nusing DCCA with one-dimensional vector representation would lead to SSS problem. Therefore, We\npropose a novel 2D deep canonical correlation analysis(2D2CCA) to overcome the limitations.\nWe denote the completed depth map as D with its corresponding RGB image as I. Sparse depth\nmap in the input and the corresponding sparse RGB image are denoted as sD and sI. RGB/Depth\nencoders are denoted as fI and fD where the parameters of the encoders are denoted as \u03b8I and \u03b8D\nrespectively. As described in Section 3.1, fI and fD use the SAConv to propagate information from\nreliable points to extract features from sparse inputs. We generate 3D feature grids embedding\npair ( FsD \u2208 Rm\u00d7n\u00d7C, FsI \u2208 Rm\u00d7n\u00d7C) for each sparse depth map/image pair (sD,sI) by de\ufb01ning\nFsD = fD (sD;\u03b8D) and FsI = fI (sI;\u03b8I). Inside each feature grid pair, there are C feature map pairs\nsI), \u2200i (cid:54)= j, we analyze the channelwise canonical\nsD,Fi\nsI\nresult in getting features with similar semantic meanings for each modality, as shown in Figure 6,\nwhich guides fI to embed more valuable information related to depth completion.\nUsing 1-dimensional feature representation would lead to SSS problem in modern deep learning\nbased computer vision task. We introduce the 2-dimensional approach similar to [32] to generate\nfull-rank covariance matrix \u02c6\u03a3sD,sI \u2208 Rm\u00d7n, which is calculated as\n\nsI \u2208 Rm\u00d7n(cid:1) ,\u2200i < C, and C = 512 in our network. Rather than analyzing the global\n(cid:1). This channelwise correlation analysis will\n\nsD \u2208 Rm\u00d7n,Fi\n\n(cid:0)Fi\ncorrelation between the same channel number(cid:0)Fi\n\ncorrelation between any possible pairs of (Fi\n\nsD,Fj\n\n(cid:2)Fi\nsD \u2212 E [FsD](cid:3)(cid:2)Fi\n\nsI \u2212 E [FsI](cid:3)T\n\n,\n\n(1)\n\n1\nC\n\nC\u22121\n\u2211\n\ni=0\n\n\u02c6\u03a3sD,sI =\nC \u2211C\u22121\n\n\u02c6\u03a3sD =\n\n1\nC\n\nC\u22121\n\u2211\n\ni=0\n\nin which we de\ufb01ne E [F] = 1\n\u02c6\u03a3sI) with the regularization constant r1 and identity matrix I as\n\ni=0 Fi. Besides, we generate covariance matrices \u02c6\u03a3sD (and respective\n\nThe correlation between FsD and FsI is calculated as\n\u2212 1\nsD )( \u02c6\u03a3sD,sI)( \u02c6\u03a3\n2\n\ncorr(FsD,FsI) =\n\nsD \u2212 E [FsD](cid:3)(cid:2)Fi\n(cid:2)Fi\n(cid:13)(cid:13)(cid:13)(cid:13)( \u02c6\u03a3\n\nsD \u2212 E [FsD](cid:3)T\n(cid:13)(cid:13)(cid:13)(cid:13)tr\n\n\u2212 1\n2\nsI )\n\n.\n\n+ r1I.\n\n(2)\n\n(3)\n\nThe higher value of corr(FsD,FsI) represents the higher correlation between two feature blocks. Since\ncorr(FsD,FsI) is an non-negative scalar, we use \u2212corr(FsD,FsI) as the optimization objective to guide\ntraining of two feature encoders. To compute the gradient of corr(FsD,FsI) with respect to \u03b8D and \u03b8I,\nwe can compute its gradient with respect to FsD and FsI and then do the back propagation. The detail\n\u2212 1\nis showed following. Regarding to the gradient computation, we de\ufb01ne M = ( \u02c6\u03a3\n2\nsI )\nand decompose M as M = USVT using SVD decomposition. Then we de\ufb01ne\n\n\u2212 1\nsD )( \u02c6\u03a3sD,sI)( \u02c6\u03a3\n2\n\n\u2202corr(FsD,FsI)\n\n\u2202 FsI\n\n=\n\n1\nC (2\u2207sDsDFsD + \u2207sDsIFsI) ,\n\n(4)\n\n\u2212 1\nsD UV T \u02c6\u03a3\nwhere \u2207sDsI = \u02c6\u03a3\n2\ncalculations as \u2202corr(FsD,FsI )\n\n\u2212 1\nsRGB and \u2207sDsD = \u2212 1\n2\nin Equation (4).\n\n2\n\n\u2202 FsI\n\n\u2212 1\nsD UDU T \u02c6\u03a3\n2\n\n\u02c6\u03a3\n\n\u2212 1\nsD . \u2202corr(FsD,FsI )\n2\n\n\u2202 FsD\n\nfollows the similar\n\n3.3 Loss Function\nWe denote our channelwise 2D2CCA loss as L2D2CCA = \u2212corr(FsD,FsI). We denote the transformed\ncomponent from sparse RGB to depth domain as \u02c6FsD. The transformer loss describes the numerical\nsimilarity between RGB and depth domain. We use L2 norm to measure the numerical similarity. Our\ntransformer loss is Ltrans = (cid:107)FsD \u2212 \u02c6FsD(cid:107)2\n2.\n\n5\n\n\fWe also build another encoder and another transformer network which share weights with the encoder\nand transformer network for the spare RGB. The input of the encoder is a complementary RGB image.\nWe use features extracted from complementary RGB image to predict features of non-observable\ndepth using transformer network. For the complementary RGB image, we denote the extracted feature\nand transformed component as FcI and \u02c6FcD. Later, we concatenate FsD and \u02c6FcD, both of which are\n512-channel. We got an 1024-channel bottleneck feature on depth domain. We pass this bottleneck\nfeature into the decoder described in Section 3.1. The output from the decoder is a completed dense\ndepth map \u02c6D. To compare the inconsistency between the groundtruth Dgt and the completed depth\nmap, we use pixelwise L2 norm. Thus our reconstruction loss is Lrecon = (cid:107)Dgt \u2212 \u02c6D(cid:107)2\n2.\nAlso, since bottleneck features have limited expressiveness, if the sparsity of inputs is severe, e.g. only\n0.1% sampled points of the whole resolution, the completed depth maps usually have griding effects.\nTo resolve the griding effects, we introduce the smoothness term as in [36] into our loss function.\nLsmooth = (cid:107)\u22072 \u02c6D(cid:107)1, where \u22072 denotes the second-order gradients. Our \ufb01nal total loss function with\nweights becomes\n\nLtotal = L2D2CCA + wtLtrans + wrLrecon + wsLsmooth.\n\n(5)\n\n4 Experiments\n\n4.1 Dataset and Experiment Details\n\nImplementation details. We use PyTorch to implement the network. Our encoders are similar to\nVGG16, without the fully-connected layers. We use ReLU on extracted features after every SAConv\noperation. Downsampling is applied to both the features and masks in encoders. The transformer\nnetwork is a 2-layer network, size 3\u00d7 3, stride 1, and 512-dimension, with our SAConv. The decoder\nis also a VGG16-like network using deconvolution to upsample. We use SGD optimizer. We conclude\nall the hyperparameter tuning in the supplemental material.\nDatasets. We have done extensive experiments on outdoor scene datasets such as KITTI odometry\ndataset [12] and Cityscape depth dataset[37], and on indoor scene datasets such as NYUv2 [38] and\nSLAM RGBD datasets as ICL_NUM [39] and TUM [40].\n\n\u2022 KITTI dataset. The KITTI dataset contains both RGB and LiDAR measurements, total 22\nsequences for autonomous driving use. We use the of\ufb01cial split, where 46K images are for\ntraining and 46K for testing. We adopt the same settings described in [4, 41] which drops\nthe upper part of the images and resizes the images to 912\u00d7228.\n\n\u2022 Cityscape dataset. The Cityscape dataset contains RGB and depth maps calculated from\nstereo matching of outdoor scenes. We use the of\ufb01cial training/validation dataset split. The\ntraining set contains 23K images from 41 sequences and the testing set contains 3 sequences.\nWe center crop the images to the size of 900\u00d7335 to avoid the upper sky and lower car logo.\n\u2022 NYUv2 dataset. The NYUv2 dataset contains 464 sequences of indoor RGB and depth\ndata using Kinect. We use the of\ufb01cial dataset split and follow [4] to sample 50K images as\ntraining data. The testing data contains 654 images.\n\n\u2022 SLAM RGBD dataset. We use the sequences of ICL-NUIM[42] and TUM RGBD SLAM\ndatasets from stereo camera. [40]. The former is synthetic, and the latter was acquired with\nKinect. We use the same testing sequences as described in [1].\n\nSparsi\ufb01ers. A sparsi\ufb01er describes the strategy of sampling the dense/semi-dense depth maps in the\ndataset to make them become the sparse depth input for the training and evaluation purposes. We\nde\ufb01ne three sparsi\ufb01ers to simulate different sparse patterns existing in the real-world applications.\nUniform sparsi\ufb01er uniformly samples the dense depth map, simulating the scanning effect caused by\nLiDAR which is nearly uniform. Stereo sparsi\ufb01er only samples the depth measurements on the edge\nor textured objects in the scene to simulate the sparse patterns generated by stereo matching or direct\nVSLAM. ORB sparsi\ufb01er only maintains the depth measurements according to the location of ORB\nfeatures in the corresponding RGB images. ORB sparsi\ufb01er simulates the output sparse depth map\nfrom feature based VSLAM. We set a sample number for uniform and stereo sparsi\ufb01ers to control\nthe sparsity. Since the ORB feature number varies in different images, we do not prede\ufb01ne a sample\nnumber but take all the depth at the ORB feature positions.\n\n6\n\n\fError metrics. We use the error metrics the same as in most previous works. (1) RMSE: root mean\nsquare error (2) MAE: mean absolute error (3) \u03b4i: percentage of predicted pixels where the relative\nerror is within 1.25i. Most of related works adopt i = 1,2,3. RMSE and MAE both measure error in\nmeters in our experiments.\nAblation studies. To examine the effectiveness of multi-modal approach, we evaluate the network\nperformance using four types of inputs, i.e. (1) dense RGB images; (2) sparse depth; (3) dense RGB\nimage + sparse depth; (4) complementary RGB image + sparse depth. The evaluation results are\ndemonstrated in Table 1. We could observe that the networks with single-modal input perform worse\nthan those with multi-modal input, which validates our multi-modal design. Besides, we observe that\n\nTable 1: Ablation study of using different data sources on NYUv2 dataset. Dense RGB means we\nfeed in full RGB images. sD means sparse depth generated by 100-point stereo sparsi\ufb01er . cRGB\nmeans complementary RGB image.\n\nInput data\nDense RGB\nsD(100 pts)\n\nDense RGB+sD(100 pts)\n\ncRGB+sD(100 pts)\n\nMAE\n0.576\n0.524\n0.479\n0.473\n\nRMSE\n0.740\n0.700\n0.638\n0.631\n\n\u03b41\n63.5\n68.1\n73.0\n72.4\n\n\u03b42\n89.0\n90.2\n92.4\n92.6\n\n\u03b43\n97.0\n97.0\n97.7\n98.1\n\nusing dense RGB with sparse depth has similar but worse performance than using complementary\nRGB with sparse depth. The sparse depth inputs are precise. However, if we extract RGB domain\nfeatures for the locations where we already have precise depth information, it would cause ambiguity\nthus the performance is worse than using complementary RGB information. We also conduct ablation\nstudies for different loss combinations in our supplementary material on KITTI and NYUv2 dataset.\nFurthermore, we conduct the ablation study with different sparsity on NYUv2 dataset. The stereo\nsparsi\ufb01er is used to sample from dense depth maps to generate sparse depth data for training and\ntesting. We show how different sparsity can affect the predicted depth map quality. The results are in\nTable 2.\n\nTable 2: Ablation study of different sample numbers on NYUv2 using stereo sparsi\ufb01er.\n\nSample#\n\n50\n100\n200\n500\n1000\n2000\n5000\n10000\n\nMAE\n0.547\n0.426\n0.385\n0.342\n0.290\n0.242\n0.222\n0.151\n\nRMSE\n0.715\n0.580\n0.531\n0.476\n0.419\n0.352\n0.323\n0.231\n\n\u03b41\n65.5\n77.5\n80.9\n83.0\n87.0\n91.3\n93.3\n96.6\n\n\u03b42\n90.1\n94.1\n95.1\n96.1\n97.0\n98.2\n98.9\n99.5\n\n\u03b43\n97.4\n98.4\n98.7\n99.0\n99.2\n99.6\n99.8\n99.9\n\n4.2 Outdoor scene - KITTI odometry and Cityscapes\n\nFor KITTI and Cityscapes these two outdoor datasets, we use the uniform sparsi\ufb01er. For the KITTI\ndataset, we sample 500 points as sparse depth the same as some previous works. We compare\nwith some state-of-the-art works, [4, 43, 41, 44]. We follow the evaluation settings in these works,\nrandomly choose 3000 images to calculate the numerical results. The results are in Table 3. Next, we\nconduct experiments using both KITTI and Cityscape datasets. Some monocular depth prediction\nworks use Cityscape dataset for training and KITTI dataset for testing. We choose this setting and\nuse 100 uniformly sampled sparse depth as inputs. The results are shown in Table 4.\n\n4.3\n\nIndoor scene - NYUv2 and SLAM RGBD datasets\n\nFor NYUv2 indoor scene dataset, we use the stereo sparsi\ufb01er to sample points. We compare to the\nstate-of-the-art [4] with different sparsity using their publicly released code. The results are shown in\nTable 5.\n\n7\n\n\fTable 3: 500 points sparse depth completion on KITTI dataset.\n\nMa et al.[4]\nSPN [43]\nCSPN[41]\n\nCSPN+UNet[41]\n\nPnP[44]\n\nCFCNet w/o smoothness\nCFCNet w/ smoothness\n\nMAE\n\n-\n-\n-\n-\n\n1.024\n1.233\n1.197\n\nRMSE\n3.378\n3.243\n3.029\n2.977\n2.975\n2.967\n2.964\n\n\u03b41\n93.5\n94.3\n95.5\n95.7\n94.9\n94.1\n94.0\n\n\u03b42\n97.6\n97.8\n98.0\n98.0\n98.0\n98.1\n98.0\n\n\u03b43\n98.9\n99.1\n99.0\n99.1\n99.0\n99.3\n99.3\n\nTable 4: Depth evaluation results: cap 50m means only taking the depth that smaller than 50m into\nconsideration when doing the evaluation. CS->K means we train the network on the Cityscape dataset\nand we do the evaluation on the KITTI dataset. Comparing methods all train train/test with 100 pts.\n\nMethods\n\nZhou et al. [36]\nGodard et al. [45]\nAleotti et al. [46]\nCFCNet(50 pts)\nCFCNet(100 pts)\n\nZhou et al. [36](cap 50m)\nCFCNet(50 pts, cap 50m)\nCFCNet(100 pts, cap 50m)\nCFCNet(50 pts, cap 50m)\nCFCNet(100 pts, cap 50m)\nCFCNet(100 pts, cap 50m)\n\nInput\nRGB\nRGB\nRGB\n\nRGB+sD\nRGB+sD\n\nRGB\n\nRGB+sD\nRGB+sD\nRGB+sD\nRGB+sD\nRGB+sD\n\nDataset\nCS\u2192K\nCS\u2192K\nCS\u2192K\nCS \u2192K\nCS\u2192K\nCS\u2192K\nCS \u2192K\nCS \u2192K\nCS\u2192CS\nCS\u2192CS\nK\u2192K\n\nRMSE\n7.580\n14.445\n14.051\n7.841\n5.827\n6.148\n6.334\n4.524\n9.019\n6.887\n3.157\n\n\u03b4 1\n57.7\n5.3\n6.3\n78.3\n82.6\n59.0\n79.2\n83.7\n82.8\n88.9\n91.0\n\n\u03b4 2\n84.0\n32.6\n39.4\n92.7\n94.7\n85.2\n93.2\n95.2\n94.1\n96.1\n97.1\n\n\u03b4 3\n93.7\n86.2\n87.6\n97.0\n97.9\n94.5\n97.3\n98.1\n97.2\n98.1\n98.9\n\nTable 5: Comparisons on NYUv2 dataset using stereo sparsi\ufb01er.\n\nSample#\n\n100\n100\n200\n200\n500\n500\n\nMethods\n\nCFCNet\n\nCFCNet\n\n[4]\n\n[4]\n\n[4]\n\nCFCNet\n\nMAE\n0.473\n0.426\n0.451\n0.385\n0.384\n0.342\n\nRMSE\n0.629\n0.580\n0.603\n0.531\n0.529\n0.476\n\n\u03b41\n71.5\n77.5\n73.0\n80.9\n79.2\n83.0\n\n\u03b42\n92.4\n94.1\n93.5\n95.1\n94.9\n96.1\n\n\u03b43\n98.0\n98.4\n98.4\n98.7\n98.6\n99.0\n\nNext, we conduct experiments on SLAM RGBD dataset. We follow the setting in the state-of-the-art,\nCNN-SLAM [1], and do the cross-dataset evaluation. We train the model on NYUv2 using ORB\nsparsi\ufb01er and evaluate on the SLAM RGBD dataset. We use the metric in CNN-SLAM, calculating\nthe percentage of accurate estimations. Accurate estimations mean the error is within \u00b110% of the\ngroundtruth. The results are in Table 6.\n\nTable 6: Comparison in terms of percentage of correctly estimated depth on two SLAM RGBD\ndatasets, ICL-NUIM and TUM. (TUM/seq1 name: fr3/long of\ufb01ce household, TUM/seq2 name:\nfr3/nostructure texture near withloop, TUM/seq3 name: fr3/structure texture far)\n\nSequence#\nICL/of\ufb01ce0\nICL/of\ufb01ce1\nICL/of\ufb01ce2\nICL/living0\nICL/living1\nICL/living2\nTUM/seq1\nTUM/seq2\nTUM/seq3\nAverage\n\nCFCNet\n41.97\n43.86\n63.64\n51.76\n64.34\n59.07\n54.70\n66.30\n74.61\n57.81\n\nCNN-SLAM[1]\n\n19.41\n29.15\n37.22\n12.84\n13.03\n26.56\n12.47\n24.07\n27.39\n22.46\n\nLaina[47]\n\n17.19\n20.83\n30.63\n15.00\n11.44\n33.01\n12.98\n15.41\n9.45\n18.44\n\nRemode[48]\n\n4.47\n3.13\n16.70\n4.47\n2.42\n8.68\n9.54\n12.65\n6.73\n7.64\n\n8\n\n\fFigure 6: Visual results on KITTI dataset. (a) The RGB image (b) 500 points sparse depth as inputs.\n(c) Our Completed depth maps. (d) Results from [4].\n5 Conclusion\n\nIn this paper, we directly analyze the relationship between the sparse depth information and their\ncorresponding pixels in RGB images. To better fuse information, we propose 2D2CCA to ensure the\nmost similar semantics are captured from two branches and use the complementary RGB information\nto complement the missing depth. Extensive experiments using total four outdoor/indoor scene\ndatasets show our results achieve state-of-the-art.\n\nReferences\n[1] Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab, \u201cCnn-slam: Real-time dense monocular\nslam with learned depth prediction,\u201d in IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2017, pp. 6243\u20136252.\n\n[2] Weiyue Wang and Ulrich Neumann, \u201cDepth-aware cnn for rgb-d segmentation,\u201d in Proceedings of the\n\nEuropean Conference on Computer Vision (ECCV), 2018, pp. 135\u2013150.\n\n[3] Weiyue Wang, Qiangeng Xu, Duygu Ceylan, Radomir Mech, and Ulrich Neumann, \u201cDisn: Deep implicit\nsurface network for high-quality single-view 3d reconstruction,\u201d in Advances in Neural Information\nProcessing Systems (NeurIPS), 2019.\n\n[4] Fangchang Ma and Sertac Karaman, \u201cSparse-to-dense: Depth prediction from sparse depth samples and a\n\nsingle image,\u201d in IEEE International Conference on Robotics and Automation (ICRA), 2018.\n\n[5] Shreyas S. Shivakumar, Ty Nguyen, Steven W. Chen, and Camillo J. Taylor, \u201cDfusenet: Deep fusion of rgb\nand sparse depth information for image guided dense depth completion,\u201d arXiv preprint arXiv:1902.00761,\n2019.\n\n[6] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger, \u201cSparsity\n\ninvariant cnns,\u201d in IEEE International Conference on 3D Vision (3DV), 2017, pp. 11\u201320.\n\n[7] Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, and Marc Pollefeys,\n\u201cDeeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and\nsingle color image,\u201d arXiv preprint arXiv:1812.00488, 2018.\n\n[8] Zhao Chen, Vijay Badrinarayanan, Gilad Drozdov, and Andrew Rabinovich, \u201cEstimating depth from rgb\n\nand sparse sensing,\u201d in European Conference on Computer Vision (ECCV), 2018, pp. 167\u2013182.\n\n[9] Yanchao Yang, Alex Wong, and Stefano Soatto, \u201cDense depth posterior (ddp) from single image and\n\nsparse range,\u201d arXiv preprint arXiv:1901.10034, 2019.\n\n[10] Yilun Zhang, Ty Nguyen, Ian D Miller, Steven Chen, Camillo J Taylor, Vijay Kumar, et al., \u201cD\ufb01nenet:\nEgo-motion estimation and depth re\ufb01nement from sparse, noisy depth input with rgb guidance,\u201d arXiv\npreprint arXiv:1903.06397, 2019.\n\n9\n\n\f[11] David A Forsyth and Jean Ponce, \u201cComputer vision: A modern approach,\u201d 2003.\n\n[12] Andreas Geiger, Philip Lenz, and Raquel Urtasun, \u201cAre we ready for autonomous driving? the kitti vision\nbenchmark suite,\u201d in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp.\n3354\u20133361.\n\n[13] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos, \u201cOrb-slam: a versatile and accurate\n\nmonocular slam system,\u201d IEEE Transactions on robotics, vol. 31, no. 5, pp. 1147\u20131163, 2015.\n\n[14] Yiyi Liao, Lichao Huang, Yue Wang, Sarath Kodagoda, Yinan Yu, and Yong Liu, \u201cParse geometry from a\nline: Monocular depth estimation with partial laser observation,\u201d in IEEE International Conference on\nRobotics and Automation (ICRA), 2017, pp. 5059\u20135066.\n\n[15] Yinda Zhang and Thomas Funkhouser, \u201cDeep depth completion of a single rgb-d image,\u201d in Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 175\u2013185.\n\n[16] Saif Imran, Yunfei Long, Xiaoming Liu, and Daniel Morris, \u201cDepth coef\ufb01cients for depth completion,\u201d\n\narXiv preprint arXiv:1903.05421, 2019.\n\n[17] Maximilian Jaritz, Raoul De Charette, Emilie Wirbel, Xavier Perrotton, and Fawzi Nashashibi, \u201cSparse\nand dense data with cnns: Depth completion and semantic segmentation,\u201d IEEE International Conference\non 3D Vision (3DV), 2018.\n\n[18] Harold Hotelling, \u201cRelations between two sets of variates,\u201d in Breakthroughs in statistics, pp. 162\u2013190.\n\nSpringer, 1992.\n\n[19] TW Anderson, \u201cAn introduction to multivariate statistical analysis.,\u201d 1984.\n\n[20] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu, \u201cDeep canonical correlation analysis,\u201d in\n\nInternational Conference on Machine Learning (ICML), 2013, pp. 1247\u20131255.\n\n[21] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes, \u201cOn deep multi-view representation learning,\u201d\n\nin International Conference on Machine Learning (ICML), 2015, pp. 1083\u20131092.\n\n[22] Changxing Ding and Dacheng Tao, \u201cA comprehensive survey on pose-invariant face recognition,\u201d ACM\n\nTransactions on intelligent systems and technology, vol. 7, no. 3, pp. 37, 2016.\n\n[23] Xinghao Yang, Weifeng Liu, Dapeng Tao, and Jun Cheng, \u201cCanonical correlation analysis networks for\n\ntwo-view image recognition,\u201d Information Sciences, vol. 385, pp. 338\u2013352, 2017.\n\n[24] Weiran Wang, Raman Arora, Karen Livescu, and Jeff A Bilmes, \u201cUnsupervised learning of acoustic\nfeatures via deep canonical correlation analysis,\u201d in IEEE International Conference on Acoustics, Speech\nand Signal Processing (ICASSP), 2015, pp. 4590\u20134594.\n\n[25] Meina Kan, Shiguang Shan, and Xilin Chen, \u201cMulti-view deep network for cross-view classi\ufb01cation,\u201d in\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4847\u20134855.\n\n[26] Fei Yan and Krystian Mikolajczyk, \u201cDeep correlation for matching images and text,\u201d in IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3441\u20133450.\n\n[27] Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel, \u201cDeep multimodal learning for audio-visual\nspeech recognition,\u201d in IEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), 2015, pp. 2130\u20132134.\n\n[28] Marie Katsurai and Shin\u2019ichi Satoh, \u201cImage sentiment analysis using latent correlations among visual,\ntextual, and sentiment views,\u201d in IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), 2016, pp. 2837\u20132841.\n\n[29] G Kukharev and E Kamenskaya, \u201cApplication of two-dimensional canonical correlation analysis for face\nimage processing and recognition,\u201d Pattern Recognition and Image Analysis, vol. 20, no. 2, pp. 210\u2013219,\n2010.\n\n[30] Cai-rong Zou, Ning Sun, Zhen-hai Ji, and Li Zhao, \u201c2dcca: A novel method for small sample size face\n\nrecognition,\u201d in IEEE Workshop on Applications of Computer Vision (WACV), 2007, pp. 43\u201343.\n\n[31] Sun Ho Lee and Seungjin Choi, \u201cTwo-dimensional canonical correlation analysis,\u201d IEEE Signal Processing\n\nLetters, vol. 14, no. 10, pp. 735\u2013738, 2007.\n\n10\n\n\f[32] Jian Yang, David D Zhang, Alejandro F Frangi, and Jing-yu Yang, \u201cTwo-dimensional pca: a new approach\nto appearance-based face representation and recognition,\u201d IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 2004.\n\n[33] Ming Li and Baozong Yuan, \u201cA novel statistical linear discriminant analysis for image matrix: two-\ndimensional \ufb01sherfaces,\u201d in IEEE International Conference on Signal Processing, 2004, vol. 2, pp.\n1419\u20131422.\n\n[34] Iasonas Kokkinos Adam W Harley, Konstantinos G. Derpanis, \u201cSegmentation-aware convolutional\nnetworks using local attention masks,\u201d in IEEE International Conference on Computer Vision (ICCV),\n2017.\n\n[35] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng, \u201cMultimodal\n\ndeep learning,\u201d in International Conference on Machine Learning, 2011, pp. 689\u2013696.\n\n[36] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe, \u201cUnsupervised learning of depth and\nego-motion from video,\u201d in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017,\npp. 1851\u20131858.\n\n[37] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benen-\nson, Uwe Franke, Stefan Roth, and Bernt Schiele, \u201cThe cityscapes dataset for semantic urban scene\nunderstanding,\u201d in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp.\n3213\u20133223.\n\n[38] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus, \u201cIndoor segmentation and support\ninference from rgbd images,\u201d in European Conference on Computer Vision (ECCV), 2012, pp. 746\u2013760.\n\n[39] Ankur Handa, Thomas Whelan, John McDonald, and Andrew J Davison, \u201cA benchmark for rgb-d visual\nodometry, 3d reconstruction and slam,\u201d in IEEE International Conference on Robotics and automation\n(ICRA), 2014, pp. 1524\u20131531.\n\n[40] J\u00fcrgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers, \u201cA benchmark for\nthe evaluation of rgb-d slam systems,\u201d in IEEE/RSJ International Conference on Intelligent Robots and\nSystems, 2012, pp. 573\u2013580.\n\n[41] Xinjing Cheng, Peng Wang, and Ruigang Yang, \u201cLearning depth with convolutional spatial propagation\n\nnetwork,\u201d European Conference on Computer Vision (ECCV), 2018.\n\n[42] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison, \u201cA benchmark for RGB-D visual odometry, 3D\nreconstruction and SLAM,\u201d in IEEE International Conference on Robotics and Automation (ICRA), 2014.\n\n[43] Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz, \u201cLearning\naf\ufb01nity via spatial propagation networks,\u201d in Advances in Neural Information Processing Systems (NIPS),\n2017, pp. 1520\u20131530.\n\n[44] Tsun-Hsuan Wang, Fu-En Wang, Juan-Ting Lin, Yi-Hsuan Tsai, Wei-Chen Chiu, and Min Sun, \u201cPlug-\nand-play: Improve depth estimation via sparse data propagation,\u201d in IEEE International Conference on\nRobotics and Automation (ICRA), 2019.\n\n[45] Cl\u00e9ment Godard, Oisin Mac Aodha, and Gabriel J Brostow, \u201cUnsupervised monocular depth estimation\nwith left-right consistency,\u201d in IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2017, pp. 270\u2013279.\n\n[46] Filippo Aleotti, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia, \u201cGenerative adversarial networks for\nunsupervised monocular depth prediction,\u201d in European Conference on Computer Vision (ECCV), 2018.\n\n[47] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab, \u201cDeeper depth\nprediction with fully convolutional residual networks,\u201d in IEEE International Conference on 3D Vision\n(3DV), 2016, pp. 239\u2013248.\n\n[48] Matia Pizzoli, Christian Forster, and Davide Scaramuzza, \u201cRemode: Probabilistic, monocular dense\nreconstruction in real time,\u201d in IEEE International Conference on Robotics and Automation (ICRA), 2014,\npp. 2609\u20132616.\n\n11\n\n\f", "award": [], "sourceid": 2869, "authors": [{"given_name": "Yiqi", "family_name": "Zhong", "institution": "University of Southern California"}, {"given_name": "Cho-Ying", "family_name": "Wu", "institution": "University of Southern California"}, {"given_name": "Suya", "family_name": "You", "institution": "US Army Research Laboratory"}, {"given_name": "Ulrich", "family_name": "Neumann", "institution": "USC"}]}