{"title": "Revisiting Multi-Task Learning with ROCK: a Deep Residual Auxiliary Block for Visual Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 1310, "page_last": 1322, "abstract": "Multi-Task Learning (MTL) is appealing for deep learning regularization. In this paper, we tackle a specific MTL context denoted as primary MTL, where the ultimate goal is to improve the performance of a given primary task by leveraging several other auxiliary tasks. Our main methodological contribution is to introduce ROCK, a new generic multi-modal fusion block for deep learning tailored to the primary MTL context. ROCK architecture is based on a residual connection, which makes forward prediction explicitly impacted by the intermediate auxiliary representations. The auxiliary predictor's architecture is also specifically designed to our primary MTL context, by incorporating intensive pooling operators for maximizing complementarity of intermediate representations. Extensive experiments on NYUv2 dataset (object detection with scene classification, depth prediction, and surface normal estimation as auxiliary tasks) validate the relevance of the approach and its superiority to flat MTL approaches. Our method outperforms state-of-the-art object detection models on NYUv2 dataset by a large margin, and is also able to handle large-scale heterogeneous inputs (real and synthetic images) with missing annotation modalities.", "full_text": "Revisiting Multi-Task Learning with ROCK: a Deep\n\nResidual Auxiliary Block for Visual Detection\n\nTaylor Mordan(1, 2)\n\ntaylor.mordan@lip6.fr\n\nNicolas Thome(3)\n\nnicolas.thome@cnam.fr\n\nGilles Henaff(2)\n\ngilles.henaff@fr.thalesgroup.com\n\nMatthieu Cord(1)\n\nmatthieu.cord@lip6.fr\n\n(1) Sorbonne Universit\u00e9, CNRS, Laboratoire d\u2019Informatique de Paris 6, LIP6\n\nF-75005 Paris, France\n\n(2) Thales Land and Air Systems\n\n2 Avenue Gay-Lussac, 78990 \u00c9lancourt, France\n\n(3) CEDRIC, Conservatoire National des Arts et M\u00e9tiers\n\n292 Rue St Martin, 75003 Paris, France\n\nAbstract\n\nMulti-Task Learning (MTL) is appealing for deep learning regularization. In this\npaper, we tackle a speci\ufb01c MTL context denoted as primary MTL, where the ul-\ntimate goal is to improve the performance of a given primary task by leveraging\nseveral other auxiliary tasks. Our main methodological contribution is to introduce\nROCK, a new generic multi-modal fusion block for deep learning tailored to the\nprimary MTL context. ROCK architecture is based on a residual connection, which\nmakes forward prediction explicitly impacted by the intermediate auxiliary repre-\nsentations. The auxiliary predictor\u2019s architecture is also speci\ufb01cally designed to\nour primary MTL context, by incorporating intensive pooling operators for maxi-\nmizing complementarity of intermediate representations. Extensive experiments\non NYUv2 dataset (object detection with scene classi\ufb01cation, depth prediction,\nand surface normal estimation as auxiliary tasks) validate the relevance of the\napproach and its superiority to \ufb02at MTL approaches. Our method outperforms\nstate-of-the-art object detection models on NYUv2 dataset by a large margin, and\nis also able to handle large-scale heterogeneous inputs (real and synthetic images)\nwith missing annotation modalities.\n\n1\n\nIntroduction\n\nThe outstanding success of ConvNets for image classi\ufb01cation in the ILSVRC challenge [26] has\nheralded a new era for deep learning. A key element of this success is the availability of large-scale\nannotated datasets such as ImageNet [40]. When dealing with smaller-scale datasets, however,\ntraining such big ConvNets is not viable, due to strong over\ufb01tting issues. In some applications\nwhere images themselves are dif\ufb01cult to obtain, e.g. medical or military domains, getting additional\nannotations on available images can be easier than collecting more examples, as a way to get more data\nto feed the networks. An appealing option to limit over\ufb01tting is then to rely on Transfer Learning (TL),\nwhich aims at leveraging different objectives and datasets for improving predictive performances. The\nmost popular strategy for tackling small datasets in vision is certainly Fine-Tuning (FT) [1], which\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Residual auxiliary block (ROCK) for object detection with auxiliary information.\nROCK (middle) is incorporated into a backbone SSD object detection model [31] to utilize additional\nsupervision from multi-modal auxiliary tasks (scene classi\ufb01cation, depth prediction and surface\nnormal estimation) and to improve performance on the primary object detection task.\n\ncan be regarded as sequential TL. It consists in using networks pre-trained on a large-scale dataset (e.g.\nImageNet), which provide very powerful visual descriptors known as Deep Features (DF). DF are at\nthe core of state-of-the-art dense prediction methods, e.g. object detection [17, 11, 16, 12, 31, 7, 36],\nimage segmentation [10, 5], depth prediction [28, 29, 14] or pose estimation [33].\nA drawback with FT is that the large-scale data and labels are only used to initialize network\nparameters, but not for optimizing the ultimate model used for prediction. At the other extreme,\nMulti-Task Learning (MTL) solutions consist in using different tasks and datasets and to share some\nintermediate representations between the tasks, which are optimized jointly [35, 3, 32, 25, 50, 34],\nand can be seen as parallel TL. Although MTL is an old topic in machine learning [4], this is currently\nintensively revisited with deep learning. The crux of the matter is to de\ufb01ne where and how to share\nparameters between tasks, depending on the applications. Some approaches focus on learning the\noptimal MTL architectures [35, 32], while other explore relating every layer of the networks [50]\nor to relate layers at various depths to account for semantic variations between modalities [34]. In\nUberNet [25], the goal is to learn a universal network which can share various low- and high-level\ndense prediction tasks [25]. This MTL strategy has been shown to improve results of individual tasks\nwhen learned together under certain conditions.\nThe aforementioned approaches assume a \ufb02at structure between tasks, the goal of MTL usually\nbeing to have good results on all tasks simultaneously while saving computations or time. However,\nour problem is concerned with a primary task, which is augmented during training with several\nauxiliary tasks. The ultimate goal is here to improve the primary task performance, not to have good\nperformances on average across tasks. Flat MTL is therefore intrinsically sub-optimal here, since the\nproblem is biased toward a given application. We frame it as a new kind of MTL, named primary\nMTL, where there is only one task of interest, and other auxiliary tasks that can be leveraged to\nimprove the \ufb01rst, primary one. In this sense, our context is related to Learning Using Privileged\nInformation (LUPI) [47, 38, 41, 42, 46, 23, 43] and end-to-end trainable deep LUPI approaches.\nIn this paper, we introduce a new model for leveraging auxiliary supervision and improving perfor-\nmance on a primary task, in a primary MTL setup. Regarding methodology, the main contribution\nof the paper is to introduce ROCK, a new residual auxiliary block (Figure 1), which can easily be\ninserted into any existing architecture to effectively exploit additional auxiliary annotations. The\nmain goal is to produce predictions for auxiliary tasks and to learn features through MTL. However, it\nis designed around two key features differentiating it from \ufb02at MTL in order to better \ufb01t ROCK to our\ncontext. First, the block is equipped with a residual connection, which explicitly merges intermediate\nrepresentations of the primary and auxiliary tasks, making the latter ones have a real effect on the\nformer in the forward pass, not just through shared feature learning. Then, the predictor, which is\nnot merged back as shown in Figure 1, contains pooling operations only and no parameters. This\nforces the model to learn relevant auxiliary features earlier in the intermediate representations, so as\n\n2\n\nResNet-50...ResidualAuxiliaryBlockBasefeaturemapEncoderDecoderPredictorFusionRe\ufb01nedfeaturemapSlivingroomSSDDetectionLayers\fFigure 2: Detailed architecture of residual auxiliary block (ROCK). The block is composed of\nfour parts, represented by the shaded areas: the encoder extracts task-speci\ufb01c features for all auxiliary\ntasks; the decoder and fusion operation transform these encodings back to the original feature space\nand merge them into the main path, explicitly bringing complementary information to the primary\ntask; the predictor produces outputs for auxiliary tasks in order to learn from them through MTL.\nAlthough the block can be instantiated for any number and kind of tasks, it is presented here with the\nspeci\ufb01c setup of three auxiliary tasks described in Section 3.\n\nto maximize their in\ufb02uence on the primary task when they are fused into it. We evaluate our approach\non object detection, with multi-modal auxiliary information (scene labels, depth and surface normals),\nas illustrated in Figure 1. Experiments carried out on NYUv2 dataset [44] validate the relevance of\nthe approach: we outperform state-of-the-art detection methods on this dataset by a large margin.\n\n2 ROCK: Residual auxiliary block\n\nThe general architecture of our model is shown in Figure 1. It is created from an existing model\nperforming a given task t0. This model should be composed of a backbone network yielding a base\nfeature map X (left of Figure 1), used as input to a task-speci\ufb01c module computing predictions (right\nof Figure 1). This kind of design is fairly general, so this assumption is not restrictive. The idea\nbehind ROCK is to add a new residual auxiliary block (middle of Figure 1) between the two existing\ncomponents, in order to leverage T other, auxiliary tasks {ti}T\ni=1 to extract useful information and\ninject it into the base feature map X to yield a re\ufb01ned version \u02dcX of it. This re\ufb01ned representation,\nbeing similar to the base feature map, is then used by the task-speci\ufb01c module of the primary task,\nwhich is now explicitly in\ufb02uenced by auxiliary tasks. The new task-speci\ufb01c features might not be\neasily learned from the primary task t0 only, so \u02dcX encodes additional details of the scenes learned by\nthe block, therefore leading to better performance on the primary task t0.\nTo re\ufb01ne the base feature map X, the auxiliary block must extract information from all auxiliary\ntasks {ti}T\ni=1. To this end, it is learned within Multi-Task Learning (MTL) framework: during\ntraining, a prediction yt is produced for every task t and a loss (cid:96)t is applied, so that the block is\nlearned from all tasks (including the main primary task, through the re\ufb01nement path) simultaneously.\nLearned intermediate features are then used in the re\ufb01nement step. In the inference phase, features\nare extracted for auxiliary tasks and are used to modify the base feature map in the same way, so that\nthe predictions for the primary task explicitly take this information into account, without needing any\nannotations. Therefore, ROCK uses auxiliary supervision as privileged information.\nWe now present the general design of the residual auxiliary block for arbitrary tasks, then detail the\narchitecture we use in the experiments on NYUv2 dataset with the associated tasks in Section 3.1.\nThe block is thought to be generic, so that it can be easily integrated into a wide range of networks\nand can be applied to almost any task, without further major change. All its components are designed\nto have a small computational overhead, in order to keep the increase in complexity light, easing\nthe integration of the block into existing architectures. It also has as few parameters as possible.\nThe resulting model can therefore be learned ef\ufb01ciently, and fully leverage additional annotations to\neffectively increase performance.\n\n3\n\nKWHBasefeaturemapXConv1\u00d71Conv3\u00d73Conv1\u00d71KsceneWHConv1\u00d71KdepthWHConv1\u00d71KnormalWHEncoderGlobalaveragepoolingChannel-wiseaveragepoolingChannel-wiseaveragepooling(perblock)PredictorSWHWHSceneclasspredictionDepthpredictionSurfacenormalpredictionConv1\u00d71Conv1\u00d71Conv1\u00d71DecoderKWHKWHKWHElement-wiseadditionFusionKWHRe\ufb01nedfeaturemap\u02dcX\fOur auxiliary block is composed of four main parts: encoder, decoder, fusion and predictor. They are\nall illustrated in Figure 1 and detailed in Figure 2 within shaded blocks. We note that we use a simple\ndesign here to have a generic approach and show its bene\ufb01ts, but more complex architectures could\nlead to better results through better feature learning.\nThe base feature map X is \ufb01rst processed by the encoder Enc, whose role is to learn task-speci\ufb01c\nfeatures {Encti(X)}T\ni=1 from it with dedicated heads. For each task t, we use a bottleneck-like\narchitecture to keep computation low. As shown in Figure 2, it is composed of a 1 \u00d7 1 convolution to\nreduce width of the base feature map by a factor of 4, followed by a 3 \u00d7 3 convolution with the same\nwidth. The last operation is a task-speci\ufb01c 1 \u00d7 1 convolution to yield a width Kt adapted to the task\nt. When learning from multiple auxiliary tasks, the \ufb01rst two layers of the encoder are shared to have\na common encoder trunk with task-speci\ufb01c heads, further reducing computation. The task-speci\ufb01c\nencodings obtained here are then used as input for both the decoder and the predictor, and should\ntherefore contain all necessary information about auxiliary tasks to be used in the re\ufb01nement step.\nWe detail how this is achieved in the following.\n\n2.1 Merging of primary and auxiliary representations\nDifferent tasks bringing different kinds of supervision, the previous encodings {Encti(X)}T\ni=1\nshould contain information complementary to what is learned from the primary task t0. Therefore, it\nseems useful to combine them with the base feature maps X to get more complete representations of\nthe scenes. The second and third parts of the residual auxiliary block are the decoder Dec and the\nfusion step F . The former takes the output of the encoder and projects it back to the space of the\nbase feature map, so that it can be injected back into the primary path. This is done for each task\nt separately with a single 1 \u00d7 1 convolution to have the same width as the base feature map (see\nFigure 2). The fusion step then merges all these task-speci\ufb01c features with the base one in a residual\nmanner to yield the re\ufb01ned feature map \u02dcX, which encodes both primary and auxiliary information:\n\n\u02dcX = F(cid:0)X,{Decti \u25e6 Encti (X)}T\n\ni=1\n\n(cid:1) = X +\n\nT(cid:88)\n\ni=1\n\nDecti \u25e6 Encti(X).\n\n(1)\n\nThe residual formulation allows the base feature map to keep its content while focusing it more on\nrelevant details of the images, yielding better features for the primary task. This feature merging step\nis key in ROCK to improve upon \ufb02at MTL, and these two modules are the main difference between\n\ufb02at and primary MTLs. In \ufb02at MTL, all tasks are at the same level and are able to bene\ufb01t each other\nthrough shared feature learning only, i.e. their mutual in\ufb02uence is implicit in the models. By injecting\nthe auxiliary representations into the primary one, we break the symmetry between tasks, effectively\nfavoring the primary one. This task is then explicitly in\ufb02uenced by the auxiliary tasks through fusion,\nand the model can fully leverage auxiliary supervision.\n\n2.2 Effective MTL from auxiliary supervision\n\ni=1 for all auxiliary tasks {ti}T\n\nThe last element of the residual auxiliary block is the predictor P red. Its purpose is to produce\npredictions {yti}T\ni=1, so that losses can be applied to learn from the\ntasks through MTL. Its inputs are the feature encodings {Encti (X)}T\ni=1 from the encoder, whose\nsizes are already task-speci\ufb01c. Since the predictor might lose information with respect to these\nfeatures in order to yield the predictions, only the features from the encoder go through the decoder to\nbe merged back into the main path (as illustrated in Figure 2 and formalized in Equation (1)), so that\nmore information is kept for use in the primary task. Therefore, all parameters learned in the predictor\nare to be thrown away after training, i.e. they are not used for inference (we are only interested in the\nprimary task, the auxiliary tasks being used only to improve its performance). In order to force the\nmodel to learn useful information in the encoder and not in the predictor, so that it is kept and merged\nback, we use a predictor composed of pooling layers only, with no learned parameter:\n\nyt = P redt (Enct(X)) = P oolt (Enct(X))\n\n(2)\nwith P oolt a task-speci\ufb01c pooling operation. The kinds of pooling used are dependent on the tasks\nconsidered, as they are directly linked to the natures of the tasks (e.g. scalar or spatial). Once again,\nthis design choice of not having any learned parameter in the predictor is important for ROCK to\ndistinguish from \ufb02at MTL. It forces the task-speci\ufb01c representation learning to happen within the\nencoder, and therefore to take part in the re\ufb01nement step. This is a way to maximize the in\ufb02uence of\nthe auxiliary tasks on the primary one, i.e. to get away from \ufb02at MTL.\n\n4\n\n\f3 Application to object detection with multi-modal auxiliary information\n\nWe incorporate ROCK for object detection as the primary task, using multi-modal auxiliary in-\nformation: scene classi\ufb01cation, depth prediction and surface normal estimation (Figure 1). The\nuse of annotations from different but related tasks to improve performance is a common approach.\nOur problem is related to the use of semantic (scene) and geometric (depth and surface normals)\ninformation, which have been successfully combined [30, 27, 49, 20, 13, 8], although not directly for\nobject detection and primary MTL. In the context of object detection, combining several additional\ninformations, e.g. depth [45, 19, 18, 48, 37], or surface normal and surface curvature [48] has shown\nto signi\ufb01cantly improve performance. Our problem is slightly different since we only use auxiliary\ninformation during training. More related to our context, [39] leverages depth, surface normals and\ninstance contours to pre-train a model on synthetic data through \ufb02at MTL, then \ufb01ne-tunes it for object\ndetection on real target data. We differ from this kind of approach by our MTL strategy, which is\ndriven by an object detection primary task.\nOur approach is closely connected to [23], which uses depth as privileged information to improve\nobject detection. However, both methods differ in the way they use the depth annotations. While [23]\ndirectly uses depth to perform object detection, we merge intermediate representations used for depth\nprediction. This difference leads to an earlier fusion in ROCK, where intermediate representations\nfrom all tasks are fused together to bene\ufb01t from the correlation between tasks, while [23] uses a\nlate fusion of predictions. We show in the experiments that we outperform [23] for object detection,\nvalidating the relevance of our approach.\n\n3.1\n\nInstantiation of ROCK\n\nWe now describe how ROCK is instantiated for the three tasks utilized with NYUv2 dataset [44].\nThis dataset contains relatively few images compared to large-scale datasets, e.g. ImageNet [40], so\nadditional supervision might yield a larger gain than on bigger datasets. We also show that ROCK\ncan handle a larger-scale synthetic dataset with missing annotation modality in Section 4.3.\nFor scene classi\ufb01cation, the encoder and predictor follow the common design for classi\ufb01cation\nproblems: the last layer of the encoder is a classi\ufb01cation layer into Kscene = S = 27 scene classes,\nand the pooling is a global average pooling, reducing spatial dimensions to a single neuron while\nkeeping width equal to the number of classes. Error is computed with a cross-entropy loss preceded\nby a SoftMax layer over the S classes as is common in classi\ufb01cation tasks.\nFor depth estimation, annotations consist in a single depth map for each example. We choose Kdetph\nby dividing previous width by a factor of 4, as a trade-off between compressing maps to the \ufb01nal\ntarget width of 1 and keeping enough information to provide to decoder. We then use a channel-wise\naverage pooling reducing width from Kdepth to 1, while keeping spatial dimensions. The spatial\nresolution of predictions is the same as that of the base feature map, and therefore depends on where\nthe block is inserted into the network. The regression loss used here is a reverse Huber loss [28] in\nlog space, as it has been shown to yield good results for depth prediction.\nThe last surface normal estimation task is similar to the depth estimation one, with the differences that\nground truth maps represent normalized vectors, i.e. are of size 3 and L2 normalized. The structure\nof the auxiliary block is therefore close too: we apply the same strategy as for the depth estimation\ntask separately for each component of the vectors (i.e. Knormal = 3Kdepth and the channel-wise\npooling is applied for each block of Kdepth maps) and concatenate the resulting three maps. The\nloss is different however: it is the sum of negative dot product and L2 losses [8], following a L2\nnormalization layer.\nOnce intermediate features are extracted from all auxiliary tasks, they are all fused into the base\nfeature maps to yield its re\ufb01ned version. As shown in Figure 2 and in Equation (1), we do it here\nwith a generic element-wise addition of all feature maps. The optimal fusion scheme could depend\non the nature of primary and auxiliary tasks. For example, element-wise product can be interpreted\nas a gating mechanism [9, 22], which is well suited when the auxiliary task can be interpreted as an\nattention map. We show in the experiments that this fusion strategy is relevant for leveraging depth\ninformation. Finally, more complex fusion models, e.g. full bilinear fusion schemes [15, 2], could\ncertainly be leveraged in our context.\n\n5\n\n\f4 Experiments\n\nIn this section, we \ufb01rst present an ablation study of ROCK (Section 4.1) to evaluate the effect of every\ncomponent, then we compare ROCK to other state-of-the-art object detection methods on NYUv2\ndataset (Section 4.2) and using another large-scale dataset (Section 4.3). We \ufb01nally conduct several\nfurther experiments to \ufb01nely analyze ROCK (Section 4.4).\n\nExperimental setup. We use NYUv2 dataset [44] for the experiments. It is composed of an of\ufb01cial\ntrain/test split with 795 and 654 images respectively. For model analysis and ablation study, we\nfurther divide the train set into new train and val sets of 673 and 122 images respectively, taking\ncare that images from a same video sequence are all put into the same set. We then train our model\non the train and val sets and evaluate it on the of\ufb01cial test split for comparison with state of the art.\nObject detection is performed on the same 19 object classes as [19, 23] and is evaluated with three\ncommon metrics to thoroughly analyze proposed improvements. We use the SSD framework [31]\nwith a ResNet-50 [21] backbone architecture pre-trained on ImageNet [40]. We train the networks\nusing Adam optimizer [24] with a batch size of 8 for 30,000 iterations with a learning rate of 5\u00b7 10\u22125,\nthen we lower it to 5 \u00b7 10\u22126 and keep training for 10,000 more iterations. We use a standard setup for\nobject detection, the details of which can be found in supplementary.\n\n4.1 Ablation study\n\nROCK architecture. We present an ablation study of ROCK in Table 1 to identify the in\ufb02uence of\neach component. The \ufb01rst row shows results of our baseline, which is a ResNet SSD model. The\nlast row corresponds to our full ROCK model, which yields improvements of 6.4, 1.3 and 2.3 points\nin all three metrics with respect to the baseline. To break down this gain between using additional\nsupervision and using our auxiliary block, we \ufb01rst consider a simple \ufb02at multi-task SSD baseline,\npresented in the second row of Table 1. The task-speci\ufb01c heads applied on conv5 feature map are just\n1\u00d7 1 convolution into S, 1 or 3 maps depending on the task, followed by a global average pooling for\nscene classi\ufb01cation. This model has an improvement of 3.1, 0.2 and 1.2 with respect to the baseline,\nwhich corresponds to the use of additional annotations, i.e. the gain from Multi-Task Learning. Then\nwe use our residual auxiliary block but remove the feature merging step, i.e. the decoder and fusion,\nwhile keeping the encoder and predictor the same. This is shown in the third row of Table 1. This\nresults in an improvement of 1.2, 0.2 for the \ufb01rst two metrics with respect to the \ufb02at MTL baseline,\nwhich is speci\ufb01cally brought by our auxiliary block, compared to a more common way of doing MTL.\nThe difference between our full ROCK model and this last one, i.e. 1.9, 0.9 and 1.1 points on all\nmetrics, is due to the feature merging step, therefore validating the explicit exploitation of auxiliary\nfeatures through fusion for object detection.\n\nTable 1: Ablation study of ROCK on NYUv2 val set in average precision (%).\n\nName\n\nDetection baseline\nFlat MTL baseline\nROCK w/o fusion\nROCK\n\nModel\nAuxiliary\nannotations\n\nAux. task\nencoding\n\nFeature\nmerging\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n\n(cid:88)\n\nResults\nmAP@\n\n0.75\n15.8\n16.0\n16.2\n17.1\n\nmAP@\n[0.5:0.95]\n\n16.2\n17.4\n17.4\n18.5\n\nmAP@\n\n0.5\n31.2\n34.3\n35.7\n37.6\n\nContributions of auxiliary tasks.\nIn order to evaluate the importance of each auxiliary task,\nTable 2 presents another ablation study with results obtained when one task is dropped at a time, both\nfor the \ufb02at MTL baseline and ROCK. It appears that all supervisions are leveraged to improve results\nfor both models, with small differences in their contributions.\n\n4.2 Comparison with state of the art\n\nWe compare ROCK to other state-of-the-art object detection methods on NYUv2 dataset in Table 3.\nThe \ufb01rst two entries ([19, 48]) of the table use detection annotations only. It is noticeable that all\n\n6\n\n\fTable 2: Ablation study of auxiliary supervisions for \ufb02at MTL baseline (left) and ROCK (right)\non NYUv2 val set in average precision (%). Auxiliary supervision used is given between parentheses\n(D: depth, N: surface normals, S: scene class).\n\nName\n\nFlat MTL (DN)\nFlat MTL (DS)\nFlat MTL (NS)\n\nmAP@\n\n0.5\n32.1\n32.7\n32.7\n\nmAP@\n\n0.75\n15.9\n15.9\n15.8\n\nmAP@\n[0.5:0.95]\n\n16.1\n16.3\n16.1\n\nName\n\nROCK (DN)\nROCK (DS)\nROCK (NS)\n\nmAP@\n\n0.5\n34.0\n33.2\n35.1\n\nmAP@\n\n0.75\n16.6\n16.1\n16.8\n\nmAP@\n[0.5:0.95]\n\n16.8\n16.3\n17.1\n\nother methods, leveraging some kind of additional information, outperform them by a large margin,\nindicating that augmenting images with more annotations has a large impact on this dataset with few\nexamples. Our ROCK model outperforms Modality Hallucination network [23] by 3.1 points in the\nsame setting, where only depth is used as privileged information. This validates that our approach is\nable to exploit correlations between depth estimation and object detection. ROCK is also competitive\nwith methods using depth during inference too (i.e. not as privileged information) ([48]), even when\nthey are trained on additional synthetic data ([19]), as displayed on the following two rows.\nUsing more annotations yields signi\ufb01cantly better results again, as shown with the use of surface\nnormal and curvature ([18, 48]). When ROCK adds supervision from surface normal estimation and\nscene classi\ufb01cation, results are greatly improved, by 2.7 points with respect to using depth only. By\nspeci\ufb01cally designing the architecture to leverage this auxiliary supervision to improve the primary\nobject detection performance, ROCK even outperforms methods using similar kinds of annotations,\nbut at test-time too, in contrast with the privileged context of ROCK.\n\nTable 3: Detailed detection results on NYUv2 test set in average precision (%) with an IoU\nthreshold of 0.5. Additional supervision used for training is indicated between parentheses (D: depth,\nN: surface normals, C: surface curvature, S: scene class). A (cid:63) means that additional information is\nalso used during inference. Methods marked with (+SYN) and (+MLT) are trained with additional\nsynthetic data (see [19] or [18] for details) and pre-trained on MLT dataset [51] respectively.\n\nModel\nRGB R-CNN [19]\nRGB R-CNN [48]\nModality Hallucination (D) [23]\nROCK (D) [ours]\nRGB-D R-CNN (D(cid:63)) [48]\nRGB-D R-CNN (D(cid:63)+SYN) [19]\nPose CNN (DN(cid:63)+SYN) [18]\nRGB-Geo R-CNN (DNC(cid:63)) [48]\nROCK (DNS) [ours]\nROCK (DNS+MLT) [ours]\n\nmAP\n22.5\n22.8\n34.0\n37.1\n35.5\n37.3\n38.8\n39.3\n39.8\n46.8\n\nbtub\n16.9\n16.2\n16.8\n23.5\n37.8\n44.4\n36.4\n41.8\n22.7\n45.8\n\nbed\n45.3\n41.0\n62.3\n61.8\n69.9\n71.0\n70.8\n75.0\n66.9\n77.4\n\nbshelf\n28.5\n28.0\n41.8\n43.0\n33.9\n32.9\n35.1\n36.4\n40.0\n40.8\n\nbox\n0.7\n0.7\n2.1\n1.5\n1.5\n1.4\n3.6\n2.2\n3.2\n3.2\n\nchair\n25.9\n27.4\n37.3\n51.8\n43.2\n43.3\n47.3\n46.9\n51.5\n60.2\n\ncounter\n30.4\n34.6\n43.4\n42.5\n45.0\n44.0\n46.8\n46.4\n41.8\n48.4\n\ndesk\n9.7\n8.4\n15.4\n19.5\n15.7\n15.1\n14.9\n15.8\n16.6\n30.1\n\ndoor\n16.3\n15.2\n24.4\n35.7\n20.5\n24.5\n23.3\n23.9\n33.7\n35.7\n\ndresser\n18.9\n16.9\n39.1\n22.9\n32.9\n30.4\n38.6\n37.9\n34.7\n42.6\n\ngbin\n15.7\n16.5\n22.4\n39.0\n32.9\n39.4\n43.9\n39.9\n37.4\n43.1\n\nlamp monitor\n27.9\n32.5\n38.4\n25.9\n46.6\n30.3\n40.0\n39.8\n50.9\n33.7\n52.6\n36.5\n52.7\n37.6\n37.5\n53.0\n43.3\n38.8\n54.3\n39.7\n\nnstand\n17.0\n12.1\n30.9\n37.7\n31.6\n40.0\n40.7\n41.7\n47.0\n60.4\n\npillow\n11.1\n15.0\n27.0\n38.5\n37.3\n34.8\n42.4\n44.0\n41.7\n45.4\n\nsink\n16.6\n27.5\n42.9\n36.6\n39.0\n36.1\n43.5\n44.4\n43.8\n44.9\n\nsofa\n29.4\n28.2\n46.2\n49.8\n49.0\n53.9\n51.6\n51.8\n52.1\n63.0\n\ntable\n12.7\n10.6\n22.2\n22.0\n22.9\n24.4\n22.0\n26.9\n23.7\n32.5\n\ntv\n27.4\n24.9\n34.1\n47.1\n32.2\n37.5\n38.0\n34.5\n53.7\n55.0\n\ntoilet\n44.1\n44.8\n60.4\n53.1\n44.9\n46.8\n47.7\n47.0\n63.3\n66.2\n\n4.3 Pre-training on large-scale MLT dataset and Fine-Tuning\n\nTo test ROCK in a more challenging context, we pre-train it on MLT dataset [51], then \ufb01ne-tune\nit and evaluate it on NYUv2 dataset. MLT is composed of over 500,000 synthetic indoor images\nsimilar to these from NYUv2, and annotated for object detection, depth and surface normal estimation.\nThis makes this dataset well suited for Fine-Tuning (FT) to NYUv2. However, it raises three main\nchallenges. First, the scale of the dataset is several orders of magnitude larger. In contrast to NYUv2,\nMLT images are synthetic, so FT requires to address the domain shift between the two datasets.\nMLT also does not provide scene classes and ROCK would then have to handle imbalance between\npre-trained and newly added tasks when transfered from MLT to NYUv2. We keep the same setup as\nbefore but ROCK is learned on 23 slightly different object classes, for 240,000 and 80,000 iterations\nwith the same learning rates. The scene classi\ufb01cation branch is removed as there is no annotation for\nit. Results are presented in the last row of Table 3. FT from MLT to NYUv2 gives an outstanding\nstate-of-the-art performance of 46.8 points, which is an improvement of 7.0 points over directly\ntraining on NYUv2. This result shows that ROCK is able to overcome challenges associated with\nMLT dataset, in particular to scale to larger datasets and to handle heterogeneous data and missing\nannotation modalities.\n\n7\n\n\f4.4 Further analysis\n\nComplexity of ROCK. We conduct an analysis of the complexity of ROCK with ResNet-50 as\nbackbone architecture. A comparison of numbers of parameters and inference times without and with\nROCK is displayed in Table 4. It shows that including ROCK into the network only yields a slight\nincrease in complexity (around 17% more parameters and 7% slower in time), easing its integration\ninto existing models.\n\nTable 4: Complexity of ROCK in parameters and inference time measured with ResNet-50 backbone.\n\nModel\nDetection baseline\nROCK\n\nNumber of parameters\n\nInference time (ms/image)\n\n27.8M\n32.6M\n\n57\n61\n\nAnalysis of architecture of residual auxiliary block. We present several design experiments to\nvalidate the architecture of our residual auxiliary block in Table 5. We \ufb01rst verify that the performance\nimprovement of ROCK is not due to the additional parameters introduced in the model. For this, we\nevaluate ROCK with the complete architecture but with all auxiliary loss weights set to 0, effectively\ndeactivating Multi-Task Learning. The results are shown on the left of Table 5 and are close to those\nof the detection baseline, indicating that the auxiliary block is only useful to learn from auxiliary\ntasks in an effective way. We study the effect of the fusion operation with depth only on the right\npart of Table 5. It appears that the product is superior to the addition for this task. Depth bringing a\ngeometric information, the product can be interpreted as a spatial selection. However, design of this\ncomponent has not been fully explored and further experiments should yield better results.\n\nTable 5: Analysis of architecture of ROCK on NYUv2 val set in average precision (%).\n\nModel\n\nROCK\nROCK w/o aux. sup.\n\nmAP@\n\n0.5\n37.6\n30.6\n\nmAP@\n\n0.75\n17.1\n15.6\n\nmAP@\n[0.5:0.95]\n\n18.5\n16.2\n\nModel (depth-only) mAP@\n0.5\n30.9\n32.3\n\nEl.-wise addition\nEl.-wise product\n\nmAP@\n\n0.75\n14.8\n16.2\n\nmAP@\n[0.5:0.95]\n\n16.1\n17.3\n\nEffectiveness of additional supervision. We here analyze the relation between getting more im-\nages or additional annotations on available images. To this end, we train ROCK on a fraction of the\ntrain set and observe how many examples are needed to get the same performance as the detection\nbaseline (i.e. without auxiliary supervision) on the whole train set. Results are summarized in Table 6.\nTraining ROCK on around 70% of the train set roughly gives similar results than the detection baseline\n(depending on which metric is used for comparison), i.e. having the additional three auxiliary tasks\nto learn from compensates for the loss of 30% of examples. This result shows that fully annotating\navailable data with more tasks can be helpful in domains where examples are hard to obtain.\n\nTable 6: Effectiveness of additional supervision on NYUv2 val set in average precision (%).\n\nModel\n\nDetection baseline (on 100% of train set)\nROCK on 60% of train set\nROCK on 70% of train set\nROCK on 80% of train set\n\nmAP@\n\n0.5\n31.2\n29.5\n32.8\n34.7\n\nmAP@\n\n0.75\n15.8\n12.2\n14.5\n16.2\n\nmAP@\n[0.5:0.95]\n\n16.2\n13.9\n15.9\n17.0\n\nVisualization of results. We show outputs of ROCK on some unseen images in Figure 3 for\nqualitative visual inspection. In the \ufb01rst row, the baseline model wrongly detects a table. However,\nthe classi\ufb01cation of the scene into the bathroom class might decrease the probability of such an object\nclass, in favor to classes seen more often in these scenes. It is noticeable that detections produced by\nROCK agree more with the scene class. On the second row, ROCK detects more objects than the\nbaseline, especially the bed which is only partially visible. This may be due to the depth prediction,\n\n8\n\n\f(a) image\n\n(b) baseline\n\n(c) ROCK\n\n(d) scene\n\n(e) depth\n\n(f) normal\n\nFigure 3: Visualization of outputs. The original images are presented in (a). Outputs of the detection\nbaseline and ROCK are illustrated in (b) and (c) respectively. Column (d) depicts scene classi\ufb01cation\nthrough heatmaps of ground truth scene classes (i.e. the maps just before global average pooling).\nColumns (e) and (f) show predictions for depth prediction and surface normal estimation respectively.\n\nwhere a clear separation of the bed from the rest of the scene is present, easing its detection. In the\nlast row, the pillows are rather dif\ufb01cult to distinguish as they all have similar colors. The surface\nnormal prediction brings geometric information enabling to discern instances and \ufb01nd their contours\nmore easily, leading to better detections. Additional examples are presented in the supplementary.\n\nGeneralization to another dataset. To evaluate the generality of the approach, we run additional\nexperiments on CityScapes dataset [6], which is composed of outdoor scenes in urban context, in\norder to contrast with NYUv2. We train ROCK on it for object detection (8 object classes), with\ndisparity estimation as auxiliary task, using a similar setup (with the same con\ufb01guration as for depth\nestimation, but with 60,000 training iterations). We use the train set for learning and evaluate on the\nval set. Results are shown in Table 7. Again, ROCK outperforms the detection baseline by 1.2, 0.9\nand 1.1 points in all metrics, showing the generality of our approach.\n\nTable 7: Results of ROCK on CityScapes dataset [6] val set in average precision (%).\n\nModel\n\nDetection baseline\nROCK\n\nmAP@\n\n0.5\n42.8\n44.0\n\nmAP@\n\n0.75\n19.0\n19.9\n\nmAP@\n[0.5:0.95]\n\n21.6\n22.7\n\n5 Conclusion\n\nIn this paper, we introduced ROCK, a generic multi-modal fusion block for deep networks, to\ntackle the primary MTL context, where auxiliary tasks are leveraged during training to improve\nperformance on a primary task. By designing it with a residual connection and intensive pooling\noperators in predictors, we maximize the impact and complementarity of the auxiliary representations,\nbene\ufb01ting the primary task. We apply ROCK to object detection on NYUv2 dataset and outperform\nstate-of-the-art \ufb02at MLT by a large margin. We show that exploiting additional supervision with\nROCK yields the same performance than having around 30% additional examples with a single-task\nmodel, encouraging to fully exploit available data in contexts where images are dif\ufb01cult to gather.\nBy pre-training our model on a large-scale synthetic dataset with different classes and auxiliary\nmodalities, we set a new state of the art on NYUv2 and demonstrate ROCK is \ufb02exible and can adapt\nto various challenging setups. However, the design of ROCK has been kept fairly simple to prove the\nrelevance of the approach. In particular, the fusion operation could be studied more thoroughly.\n\n9\n\n\fReferences\n[1] Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan Carlsson.\nFactors of transferability for a generic ConvNet representation. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence (TPAMI), 38(9):1790\u20131802, 2016. 1\n\n[2] Hedi Ben-younes, R\u00e9mi Cadene, Matthieu Cord, and Nicolas Thome. MUTAN: Multimodal\ntucker fusion for visual question answering. In Proceedings of the IEEE International Confer-\nence on Computer Vision (ICCV), 2017. 5\n\n[3] Hakan Bilen and Andrea Vedaldi.\n\nIntegrated perception with recurrent multi-task neural\nnetworks. In Advances in Neural Information Processing Systems (NIPS), pages 235\u2013243, 2016.\n2\n\n[4] Rich Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997. 2\n\n[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan Yuille.\nDeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution,\nand fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence\n(TPAMI), 2018. 2\n\n[6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo\nBenenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic\nurban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2016. 9\n\n[7] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: Object detection via region-based fully\nconvolutional networks. In Advances in Neural Information Processing Systems (NIPS), 2016.\n2\n\n[8] Thanuja Dharmasiri, Andrew Spek, and Tom Drummond. Joint prediction of depths, normals\nand surface curvature from RGB images using CNNs. In Proceedings of the International\nConference on Intelligent Robots and Systems (IROS), 2017. 5\n\n[9] Alain Droniou and Olivier Sigaud. Gated autoencoders with tied input weights. In Proceedings\n\nof the International Conference on Machine Learning (ICML), 2013. 5\n\n[10] Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord. WILDCAT: Weakly\nsupervised learning of deep convnets for image classi\ufb01cation, pointwise localization and seg-\nmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2017. 2\n\n[11] Thibaut Durand, Nicolas Thome, and Matthieu Cord. MANTRA: Minimum maximum latent\nstructural SVM for image classi\ufb01cation and ranking. In Proceedings of the IEEE International\nConference on Computer Vision (ICCV), pages 2713\u20132721, 2015. 2\n\n[12] Thibaut Durand, Nicolas Thome, and Matthieu Cord. WELDON: Weakly supervised learning\nof deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 4743\u20134752, 2016. 2\n\n[13] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with\na common multi-scale convolutional architecture. In Proceedings of the IEEE International\nConference on Computer Vision (ICCV), pages 2650\u20132658, 2015. 5\n\n[14] Huan Fu, Mingming Gong, Chaohui Wang, and Dacheng Tao. A compromise principle in deep\n\nmonocular depth estimation. arXiv preprint arXiv:1708.08267, 2017. 2\n\n[15] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus\nRohrbach. Multimodal compact bilinear pooling for visual question answering and visual\ngrounding. arXiv:1606.01847, 2016. 5\n\n[16] Ross Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer\n\nVision (ICCV), pages 1440\u20131448, 2015. 2\n\n10\n\n\f[17] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for\naccurate object detection and semantic segmentation. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), pages 580\u2013587, 2014. 2\n\n[18] Saurabh Gupta, Pablo Arbel\u00e1ez, Ross Girshick, and Jitendra Malik. Inferring 3D object pose in\n\nRGB-D images. arXiv preprint arXiv:1502.04652, 2015. 5, 7\n\n[19] Saurabh Gupta, Ross Girshick, Pablo Arbel\u00e1ez, and Jitendra Malik. Learning rich features from\nRGB-D images for object detection and segmentation. In Proceedings of the IEEE European\nConference on Computer Vision (ECCV), pages 345\u2013360, 2014. 5, 6, 7\n\n[20] Christian Hane, Lubor Ladicky, and Marc Pollefeys. Direction matters: Depth estimation with\na surface normal classi\ufb01er. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 381\u2013389, 2015. 5\n\n[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2016. 6\n\n[22] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation,\n\n9(8):1735\u20131780, 1997. 5\n\n[23] Judy Hoffman, Saurabh Gupta, and Trevor Darrell. Learning with side information through\nmodality hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 826\u2013834, 2016. 2, 5, 6, 7\n\n[24] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings\n\nof the International Conference on Learning Representations (ICLR), 2015. 6\n\n[25] Iasonas Kokkinos. UberNet: Training a universal convolutional neural network for low-, mid-,\nand high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), 2017. 2\n\n[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems (NIPS),\n2012. 1\n\n[27] Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. Pulling things out of perspective. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 89\u201396,\n2014. 5\n\n[28] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab.\nDeeper depth prediction with fully convolutional residual networks. In Proceedings of the IEEE\nInternational Conference on 3D Vision (3DV), pages 239\u2013248, 2016. 2, 5\n\n[29] Jun Li, Reinhard Klein, and Angela Yao. A two-streamed network for estimating \ufb01ne-scaled\ndepth maps from single RGB images. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 3372\u20133380, 2017. 2\n\n[30] Beyang Liu, Stephen Gould, and Daphne Koller. Single image depth estimation from predicted\nsemantic labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 1253\u20131260, 2010. 5\n\n[31] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott Reed. SSD: Single\nshot multibox detector. In Proceedings of the IEEE European Conference on Computer Vision\n(ECCV), 2016. 2, 6\n\n[32] Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, and Rog\u00e9rio Feris.\nFully-adaptive feature sharing in multi-task networks with applications in person attribute clas-\nsi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 1131\u20131140, 2017. 2\n\n[33] Diogo Luvizon, David Picard, and Hedi Tabia. 2D/3D pose estimation and action recognition\nusing multitask deep learning. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2018. 2\n\n11\n\n\f[34] Elliot Meyerson and Risto Miikkulainen. Beyond shared hierarchies: Deep multitask learning\nIn Proceedings of the International Conference on Learning\n\nthrough soft layer ordering.\nRepresentations (ICLR), 2018. 2\n\n[35] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks\nfor multi-task learning. Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 3994\u20134003, 2016. 2\n\n[36] Taylor Mordan, Nicolas Thome, Gilles Henaff, and Matthieu Cord. End-to-end learning of\nlatent deformable part-based representations for object detection. International Journal of\nComputer Vision (IJCV), pages 1\u201321, 2018. 2\n\n[37] Seong-Jin Park, Ki-Sang Hong, and Seungyong Lee. RDFNet: RGB-D multi-level residual\nfeature fusion for indoor semantic segmentation. In Proceedings of the IEEE International\nConference on Computer Vision (ICCV), 2017. 5\n\n[38] Dmitry Pechyony and Vladimir Vapnik. On the theory of learnining with privileged information.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 1894\u20131902, 2010. 2\n\n[39] Zhongzheng Ren and Yong Jae Lee. Cross-domain self-supervised multi-task feature learning\nusing synthetic imagery. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2018. 5\n\n[40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander Berg, and Li Fei-Fei.\nImageNet large scale visual recognition challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015. 1, 5, 6\n\n[41] Viktoriia Sharmanska, Novi Quadrianto, and Christoph Lampert. Learning to rank using\nprivileged information. In Proceedings of the IEEE International Conference on Computer\nVision (ICCV), 2013. 2\n\n[42] Viktoriia Sharmanska, Novi Quadrianto, and Christoph Lampert. Learning to transfer privileged\n\ninformation. In arXiv:1410.0389, 2014. 2\n\n[43] Zhiyuan Shi and Tae-Kyun Kim. Learning and re\ufb01ning of privileged information-based RNNs\nIn Proceedings of the IEEE Conference on\n\nfor action recognition from depth sequences.\nComputer Vision and Pattern Recognition (CVPR), 2017. 2\n\n[44] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and\nsupport inference from RGBD images. In Proceedings of the IEEE European Conference on\nComputer Vision (ECCV), 2012. 3, 5, 6\n\n[45] Luciano Spinello and Kai Arras. Leveraging RGB-D data: Adaptive fusion and domain\nIn Proceedings of the IEEE Conference on Robotics and\n\nadaptation for object detection.\nAutomation (ICRA), pages 4469\u20134474, 2012. 5\n\n[46] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: Similarity control\nand knowledge transfer. Journal of Machine Learning Research (JMLR), 16:2023\u20132049, 2015.\n2\n\n[47] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged\n\ninformation. Neural Networks, 22(5-6):544\u2013557, 2009. 2\n\n[48] Chu Wang and Kaleem Siddiqi. Differential geometry boosts convolutional neural networks\nfor object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR) Workshops, pages 51\u201358, 2016. 5, 6, 7\n\n[49] Peng Wang, Xiaohui Shen, Zhe Lin, Scott Cohen, Brian Price, and Alan L Yuille. Towards\nuni\ufb01ed depth and semantic prediction from a single image.\nIn Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), pages 2800\u20132809, 2015. 5\n\n[50] Yongxin Yang and Timothy Hospedales. Deep multi-task representation learning: A tensor\nfactorisation approach. In Proceedings of the International Conference on Learning Represen-\ntations (ICLR), 2017. 2\n\n12\n\n\f[51] Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin Jin, and\nThomas Funkhouser. Physically-based rendering for indoor scene understanding using convolu-\ntional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2017. 7\n\n13\n\n\f", "award": [], "sourceid": 687, "authors": [{"given_name": "Taylor", "family_name": "Mordan", "institution": "Sorbonne Universit\u00e9, LIP6"}, {"given_name": "Nicolas", "family_name": "THOME", "institution": "Cnam"}, {"given_name": "Gilles", "family_name": "Henaff", "institution": "Thales Optronique S.A.S."}, {"given_name": "Matthieu", "family_name": "Cord", "institution": "Sorbonne University"}]}