{"title": "Weakly Supervised Instance Segmentation using the Bounding Box Tightness Prior", "book": "Advances in Neural Information Processing Systems", "page_first": 6586, "page_last": 6597, "abstract": "This paper presents a weakly supervised instance segmentation method that consumes training data with tight bounding box annotations. The major difficulty lies in the uncertain figure-ground separation within each bounding box since there is no supervisory signal about it. We address the difficulty by formulating the problem as a multiple instance learning (MIL) task, and generate positive and negative bags based on the sweeping lines of each bounding box. The proposed deep model integrates MIL into a fully supervised instance segmentation network, and can be derived by the objective consisting of two terms, i.e., the unary term and the pairwise term. The former estimates the foreground and background areas of each bounding box while the latter maintains the unity of the estimated object masks. The experimental results show that our method performs favorably against existing weakly supervised methods and even surpasses some fully supervised methods for instance segmentation on the PASCAL VOC dataset.", "full_text": "Weakly Supervised Instance Segmentation using the\n\nBounding Box Tightness Prior\n\nCheng-Chun Hsu1\u2217\n\nKuang-Jui Hsu2\u2217\n\nChung-Chi Tsai2\n\nYen-Yu Lin1,3\n\n1Academia Sinica\n\nYung-Yu Chuang1,4\n\n2Qualcomm Technologies, Inc.\n\n3National Chiao Tung University\n\n4National Taiwan University\n\nhsu06118@citi.sinica.edu.tw\n\n{kuangjui, chuntsai}@qti.qualcomm.com\n\nlin@cs.nctu.edu.tw\n\ncyy@csie.ntu.edu.tw\n\nAbstract\n\nThis paper presents a weakly supervised instance segmentation method that con-\nsumes training data with tight bounding box annotations. The major dif\ufb01culty lies\nin the uncertain \ufb01gure-ground separation within each bounding box since there is\nno supervisory signal about it. We address the dif\ufb01culty by formulating the problem\nas a multiple instance learning (MIL) task, and generate positive and negative bags\nbased on the sweeping lines of each bounding box. The proposed deep model\nintegrates MIL into a fully supervised instance segmentation network, and can\nbe derived by the objective consisting of two terms, i.e., the unary term and the\npairwise term. The former estimates the foreground and background areas of each\nbounding box while the latter maintains the unity of the estimated object masks.\nThe experimental results show that our method performs favorably against existing\nweakly supervised methods and even surpasses some fully supervised methods\nfor instance segmentation on the PASCAL VOC dataset. The code is available at\nhttps://github.com/chengchunhsu/WSIS_BBTP.\n\n1\n\nIntroduction\n\nInstance-aware semantic segmentation [1, 2, 3, 4], or instance segmentation for short, has attracted\nincreasing attention in computer vision. It involves object detection and semantic segmentation, and\naims to jointly detect and segment object instances of interest in an image. As a crucial component to\nimage understanding, instance segmentation facilitates many high-level vision applications ranging\nfrom autonomous driving [5, 6], pose estimation [7, 8, 9] to image synthesis [10].\nSigni\ufb01cant progress on instance segmentation such as [7, 11, 12, 13, 14] has been made based\non convolutional neural networks (CNNs) [15] where instance-aware feature representations and\nsegmentation models are derived simultaneously. Despite effectiveness, these CNN-based methods\nrequire a large number of training images with instance-level pixel-wise annotations. Collecting this\nkind of training data is labor-intensive because of the efforts on delineating the contours of object\ninstances as mentioned in the previous work [16]. The expensive annotation cost has restricted the\napplicability of instance segmentation.\nIn this work, we address this issue by proposing a CNN-based method where a model is learned for\ninstance segmentation using training data with bounding box annotations. Compared with the contour,\nthe bounding box of an object instance can be labeled by simply clicking the four outermost points of\nthat instance, leading to a greatly reduced annotation cost. There is only one method for instance\nsegmentation [17] that adopts box-level annotated training data. However, the method [17] is not\n\n\u2217indicates equal contributios\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Bounding box annotations, e.g., the yellow rectangle, conform the tightness prior. Namely,\neach object such as the horse here should touch the four sides of its bounding box. Each crossing line\nwhose endpoints are on the opposite sides of a box must contain at lease one pixel belonging to the\nobject. We leverage this property and represent horizontal and vertical crossing lines, e.g., the green\nones, as positive bags in multiple instance learning. In contrast, any horizontal and vertical lines,\ne.g., the red ones, that do not overlap any bounding boxes yield negative bags. Note that, although\nexample bags here are visualized as rectangles, they are for illustrative purpose only. In practice, they\nare horizontal/vertical lines with 1-pixel heights/widths.\n\nend-to-end trainable. It utilizes GrabCut [18] and MCG proposals [19] to compile pseudo ground truth\nbefore learning a fully supervised model for instance segmentation. We deal with the unavailability\nof ground-truth object instances via multiple instance learning (MIL), and integrate the proposed\nMIL formulation into fully supervised CNN model training. In this way, the instance-aware feature\nrepresentation and segmentation model are learned by taking the latent ground truth into account.\nIt turns out that the proposed method is end-to-end trainable, resulting in substantial performance\nimprovement. In addition, compared to the existing method [17], our method is proposal-free during\ninference and considerably speeds up instance segmentation.\nOur method is inspired by the tightness prior inferred from bounding boxes [20], as shown in Figure 1.\nThe tightness property states that an object instance should touch all four sides of its bounding box.\nWe leverage this property to design an MIL formulation. Training data in MIL are composed of\npositive and negative bags. A positive bag contains at least one positive instance while a negative bag\ncontains only negative instances. As shown in Figure 1, a vertical or a horizontal crossing line within\na bounding box yields a positive bag because it must cover at least one pixel belonging to the object.\nA horizontal or vertical line that does not pass through any bounding boxes forms a negative bag.\nWe integrate the MIL formulation into an instance segmentation network, and carry out end-to-end\nlearning of an instance segmentation model with box-level annotated training data.\nThe major contributions of this work are summarized as follows. First, we present a CNN-based\nmodel for weakly supervised instance segmentation. The task of inferring the object from its bounding\nbox is formulated via MIL in our method. To the best of our knowledge, the proposed method offers\nthe \ufb01rst end-to-end trainable algorithm that learns the instance segmentation model using training data\nwith bounding box annotations. Second, we develop the MIL formulation by leveraging the tightness\nproperty of bounding boxes, and incorporate it in instance segmentation. In this way, latent pixel-wise\nground truth, the object instance feature representation, and the segmentation model can be derived\nsimultaneously. Third, existing instance segmentation algorithms work based on the detected object\nbounding box. The quality of detected object boxes limits the performance of instance segmentation.\nWe address this issue by using DenseCRF [21] to re\ufb01ne the instance masks. Finally, when evaluated\non the popular instance segmentation benchmark, the PASCAL VOC 2012 dataset [22], our method\nachieves substantially better performance than the state-of-the-art box-level instance segmentation\nmethod [17].\n\n2 Related Work\n\nWeakly supervised semantic segmentation. CNNs [15] have demonstrated the effectiveness for\njoint feature extraction and non-linear classi\ufb01er learning. The state-of-the-art semantic segmentation\n\n2\n\n\u2026Positive Bags\u2026Negative Bags\fmethods [23, 24, 25, 26, 27, 28, 29, 30, 31] are developed based on CNNs. Owing to the large\ncapability of CNNs, a vast amount of training data with pixel-wise annotations are required for\nlearning a semantic segmentation model without over\ufb01tting. To address this issue, different types\nof weak annotations have been adopted to save manual efforts for training data labeling, such as\nsynthetic annotations [32], bounding boxes [33, 17, 34], scribble [35, 36, 37, 38], points [39, 40],\nand image-level supervision [41, 42, 43, 44, 45, 46, 47, 48, 49].\nOur method adopts bounding box annotations that come to a compromise between the instance\nsegmentation accuracy and the annotation cost. Different from existing methods [33, 17, 34] that also\nuse box-level annotations, our method targets at the more challenging instance segmentation problem\nrather than semantic segmentation. In addition, our method makes the most of bounding boxes by\nexploring their tightness properties to improve instance segmentation.\nFully supervised instance segmentation. SDS [1] extends the R-CNN [50] scheme for instance seg-\nmentation by jointly considering the box proposals and segment proposals pre-generated by selective\nsearch [51]. The early methods, such as [2, 3, 4], follow this strategy for instance segmentation. How-\never, their performance heavily depends on the quality of the pre-generated proposals. Furthermore,\nextracting region proposals [51] is typically computationally intensive. The region proposal network\n(RPN) in Faster R-CNN [52] offers an ef\ufb01cient way to generate high-quality proposals. Recent\ninstance segmentation methods [53, 54, 7, 11, 12, 13, 14] bene\ufb01t from RPN and an object detector\nlike Faster R-CNN. These methods \ufb01rst detect object bounding boxes and perform instance segment\non each detected bounding box. Another branch for instance segmentation is to segment each instance\ndirectly without referring to the detection results, such as [55, 56, 57, 58, 59, 5, 60]. Methods of this\ncategory rely on pixel-wise annotated training data, which may restrict their applicability.\nWeakly supervised instance segmentation. There exist few deep-learning-based methods that\nuse weak annotations for instance segmentation, such as box-level annotations [17], image-level\nannotations [61], and image groups [62]. Khoreva et al.\u2019s method [17] is the \ufb01rst and only CNN-based\nmethod that consumes box-level supervisory training data for instance segmentation. However, their\ntwo-stage method is not end-to-end trainable. In its \ufb01rst stage, instance-level pixel-wise pseudo\nground truth is generated using GrabCut [18] and the MCG proposals [19]. In the second stage,\nthe generated pseudo ground truth is used to train their instance segmentation algorithm called\nDeepLabbox to complete instance segmentation. Thus, the performance of their method [17] is\nbounded by the pseudo ground truth generators, i.e., GrabCut [18] and the MCG proposals [19].\nDifferent from [17], the methods in [61, 62] respectively utilize image-level supervision and self-\nsupervised learning within an image group to carry out weakly supervised instance segmentation.\nIn contrast, our method handles the uncertainty of inferring object segments from their bounding\nboxes via the proposed MIL formulation. The proposed method is end-to-end trainable so that pseudo\nground truth estimation, instance feature representation learning, and segmentation model derivation\ncan mutually refer to and facilitate each other during training. Furthermore, our method does not\nrequire object proposals during inference. As a result, the proposed method performs favorably\nagainst the competing method [17] in both accuracy and ef\ufb01ciency. Some frameworks [63, 64, 65]\nuse training data with a mixture of few pixel-wise annotations and abundant box-level annotations. In\ncontrast, our framework does not rely on pixel-wise annotations.\n\n3 Proposed Method\n\nThe proposed method is introduced in this section. We \ufb01rst give an overview to the method, and then\ndescribe the proposed MIL formulation followed by specifying the objective for network optimization.\nFinally, the adopted segment re\ufb01nement techniques and the implementation details are provided.\n\n3.1 Overview\nn=1 where N is\nWe are given a set of training data with bounding box annotations, D = {In, Bn}N\nthe number of images, In is the nth image, and Bn is the box-level annotations for In. Suppose the\nimage In contains Kn bounding boxes. Its annotations would be in the form of Bn = {bk\nn}Kn\nwhere the location label bk\nn is a 4-dimensional vector representing the location and size of the kth\nbox, the class label yk\nn is a C-dimensional vector specifying the category of the corresponding box,\nand C is the number of classes. In this work, we aim to learn a CNN-based model in an end-to-end\nfashion for instance segmentation by using the training set D.\n\nn, yk\n\nk=1\n\n3\n\n\fFigure 2: Overview of our model, which has two branches, the detection and segmentation branches.\nThe Mask R-CNN framework is adopted. First, we extract the features with ResNet 101, and then\nuse RPN and ROI align to generate the region feature for each detected bounding box. The detection\nbranch consists of the fully connected layers and two losses, the box classi\ufb01cation loss, Lcls, and the\nbox regression loss, Lbox. In the segmentation branch, we \ufb01rst estimate the object instance map inside\neach detected bounding box, and then generate the positive and negative bags using the bounding box\ntightness prior for MIL. The MIL loss, Lmil, is optimized with the bags and the additional structural\nconstraint. The model can be jointly optimized by these three losses in an end-to-end manner only\nwith the box-level annotations.\n\nThe overview of our method is given in Figure 2 where the proposed MIL formulation is integrated\ninto a backbone network for fully supervised instance segmentation. In this work, we choose Mask\nR-CNN [7] as the backbone network owing to its effectiveness though our method is \ufb02exible to team\nup with any existing fully supervised instance segmentation models. Learning the Mask R-CNN [7]\nmodel requires instance-level pixel-wise annotations. To save the annotation cost and \ufb01t the problem\nsetting, we adopt the bounding box tightness prior for handling weakly annotated training data, and\nformulate it as an MIL objective to derive the network. The tightness prior is utilized to augment\nthe training data D with a set of bag-structured data for MIL. With the augmented training data, the\nnetwork for instance segmentation can be optimized using the following loss function:\n(1)\nwhere Lcls and Lreg are the box classi\ufb01cation and regression losses, respectively, and Lmil is\nthe proposed MIL loss. In Eq. (1), the classi\ufb01cation loss Lcls assesses the accuracy of the box\nclassi\ufb01cation task while the regression loss Lreg measures the goodness of the box coordinate\nregression. Both Lcls and Lreg can be directly optimized using the box-level annotations. We\nimplement Lcls and Lreg as proposed in [7], and omit the details here. The proposed loss Lmil\nenables end-to-end network optimization by using training data with box-level annotations instead of\nthe original pixel-level ones. The details of Lmil are provided in the following.\n3.2 Proposed MIL formulation\n\nL(w) = Lcls(w) + Lreg(w) + Lmil(w),\n\nWe describe how to leverage the tightness prior inferred from bounding boxes to yield the bag-instance\ndata structure for MIL. The bounding box of an object instance is the smallest rectangle enclosing\nthe whole instance. Thus, the instance must touch the four sides of its bounding box, and there is no\noverlap between the instance and the region outside its bounding box. These two properties will be\nused to construct the positive and negative bags respectively in this work.\nA crossing line of a bounding box is a line with its two endpoints locating on the opposite sides of the\nbox. Pixels on a crossing line compose a positive bag of the corresponding box category, since the\nline has at least one pixel belonging to the object within the box. On the contrary, pixels on a line that\ndoes not pass through any bounding boxes of the category yield a negative bag because no object of\nthat category is present outside the bounding boxes. In this work, for each bounding box, we collect\nall horizontal and vertical crossing lines to produce the positive bags. A positive bag can be denoted\nas \u02c6b+ = {pi}, where pi is the ith pixel on the line. We also randomly sample the same number of\n\n4\n\nRPNResNet101Detection Branch\u2112cls\u2112regSegmentation BranchPositive Bags\u22eeNegative Bags\u22eeStructural Info.\u2112mil\f.\n\nn,k = {\u02c6b+\n\n(cid:96)=1\n\n(cid:96)=1\n\nn,k, \u02c6B\n\nk=1\n\nn=1, where \u02dcBn = { \u02c6B+\n\nnegative bags, each of which corresponds to a line near and outside the bounding box. Similarly, a\nnegative bag is expressed as \u02c6b\u2212 = {pi}.\nn=1 with the generated bag-structured data. The\nWe augment the given training set D = {In, Bn}N\n\u2212\nresultant augmented dataset is denoted as \u02c6D = {In, Bn, \u02dcBn}N\nn,k}Kn\nn in image In. Speci\ufb01cally, the\ncontains all positive and negative bags of the kth bounding box bk\nn,k,(cid:96)}Hn,k+Wn,k\npositive set \u02c6B+\nconsists of Hn,k + Wn,k bags, each of which corresponds\nto a crossing vertical or horizontal line within the bounding box, where Hn,k and Wn,k are the\nheight and width of the bounding box bk\nn, respectively. Similarly, we have the set of negative bags\n\u2212\n\u2212\nn,k,(cid:96)}Hn,k+Wn,k\nn,k = {\u02c6b\n\u02c6B\n3.3 MIL loss\nThis section speci\ufb01es the design of the loss function Lmil with the augmented dataset \u02c6D. As shown\nin Figure 2, given the augmented training set \u02c6D, the detection branch of the network can be optimized\nby using the ground-truth bounding boxes through the two loss functions Lcls and Lreg. On the\nother hand, the segmentation branch predicts an instance score map Sn,k \u2208 [0, 1]Wn,k\u00d7Hn,k for each\nbounding box bk\nn with respect to its object category. To train this branch, we develop the loss function\nLmil based on MIL and the augmented bag-structured data. Considering each bounding box b, we\nuse its corresponding sets of the positive and negative bags, \u02c6B+ and \u02c6B\u2212, respectively. We omit\nthe image and box indices here for the sake of brevity. The MIL loss Lmil consists of two terms as\nde\ufb01ned by\n(2)\nwhere the unary term \u03c8 enables MIL by enforcing the tightness constraints with the training bags\n\u02c6B+ and \u02c6B\u2212 on S, and the pairwise term \u03c6 imposes the structural constraint on S for maintaining\nintegrity of the object. The unary term \u03c8 and the pairwise term \u03c6 in Eq. (2) are described below.\nUnary term. Given the sets of positive and negative bags \u02c6B+ and \u02c6B\u2212, the unary term enforces the\ntightness constraints of the bounding boxes on the prediction map S. It also helps the network to\npredict better instance masks. As discussed in Section 3.2, a positive bag must contain at least one\npixel inside the object segment, so we encourage the maximal prediction score among all pixels in a\npositive bag to be as close to 1 as possible. In contrast, no pixels in a negative bag belong to an object\nof the category, and hence we minimize the maximal prediction score among all pixels in a negative\nbag. We implement the two observations by de\ufb01ning the unary term as\n\nLmil(S; \u02c6B+, \u02c6B\n\n) = \u03c8(S; \u02c6B+, \u02c6B\n\n) + \u03c6(S),\n\n\u2212\n\n\u2212\n\n(cid:88)\n\n\u02c6b\u2208 \u02c6B+\n\n(cid:88)\n\u02c6b\u2208 \u02c6B\u2212 \u2212 log (1 \u2212 P (\u02c6b)),\n\n\u03c8(S; \u02c6B+, \u02c6B\n\n\u2212\n\n) =\n\n\u2212 log P (\u02c6b) +\n\n(3)\n\nwhere P (\u02c6b) = maxp\u2208\u02c6b S(p) is the estimated probability of the bag \u02c6b being positive, and S(p) is the\nscore value of the map S at position p. The probability P (\u02c6b) in Eq. (3) can be ef\ufb01ciently computed\nby column- and row-wise maximum pooling, which can be implemented easily without changing the\nnetwork architecture.\nPairwise term. Using the unary term alone is prone to segment merely the discriminative parts of\nan object rather than the whole object, as pointed out in previous work [66, 67]. The pairwise term\naddresses this issue and uses the structural constraint to enforce the piece-wise smoothness in the\npredicted instance masks. In this way, the high scores of the discriminative parts can be propagated to\nits neighborhood. We explicitly model the structural information by de\ufb01ning the following pairwise\nterm:\n\n\u03c6(S) =\n\n(p,p(cid:48))\u2208\u03b5\n\n(cid:107)S(p) \u2212 S(p(cid:48)\n\n)(cid:107)2,\n\n(4)\n\n(cid:88)\n\nwhere \u03b5 is the set containing all neighboring pixel pairs.\nThe proposed MIL loss in Eq. (2) is differentiable and convex, so our network can be ef\ufb01ciently\noptimized by stochastic gradient descent. The gradients of all loss functions in Eq. (1) with respect to\nthe optimization variables can be derived straightforwardly, and hence are omitted here.\n\n5\n\n\f0.75\n\n-\n\n-\n\n30.4\n54.1\n\n29.4\n\n-\n-\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n44.3\n38.3\n69.2\n53.1\n\n-\n8.2\n10.0\n19.8\n\n0.7 mAPr\n44.9\n\n0.5 mAPr\n52.5\n25.3\n\n-\n9.0\n5.4\n6.7\n14.7\n16.3\n21.6\n47.9\n18.5\n21.2\n\n0.25 mAPr\n67.9\n49.7\n26.8\n19.6\n34.5\n37.1\n44.8\n58.9\n67.5\n46.4\n60.1\n\npublication Sup. mAPr\nICCV\u201917 M\n76.7\nECCV\u201914 M\nI\nCVPR\u201918\nB\nB\nB\nCVPR\u201917 B\nB\nICCV\u201917 M\u2217\nCVPR\u201917 B\u2217\nB\u2217\n\nmethod\nM. R-CNN [7]\nSDS [1]\nPRM [61]\nDetMCG\nBoxMask\nBoxMCG\nSDI [17]\nOurs\nM. R-CNN [7]\nSDI [17]\nOurs\nTable 1: Evaluation of instance segmentation re-\nsults from different methods. The \ufb01eld \u201cSup.\u201d indi-\ncates the supervision type, I for image-level labels,\nB for box-level labels, and M for mask-level la-\nbels. An asterisk \u2217 indicates the use of the training\nset containing both the MS COCO and PASCAL\nVOC 2012 datasets.\n\n75.0\n74.8\n\n77.2\n\nFigure 3: The ablation study for compo-\nnent contributions on the PASCAL VOC\n2012 dataset.\n\n3.4 DenseCRF for object instance re\ufb01nement\n\nA major limitation of detection-based methods for instance segmentation is that they highly suffer\nfrom the problem caused by inaccurately detected object boxes, since instance segmentation is based\non the detected bounding boxes. To address this issue, we use DenseCRF [21] for re\ufb01nement. During\ninference, for a detected bounding box, we \ufb01rst generate the score map for the detected box using\nthe trained segmentation branch. Then, the predicted score map of the box is pasted to a map \u02c6S\nof the same size as the input image according to the box\u2019s location. For setting up DenseCRF, we\nemploy the map \u02c6S as the unary term and use the color and pixel location differences with the bilateral\nkernel to construct the pairwise term. After optimization using mean \ufb01eld approximation, DenseCRF\nproduces the \ufb01nal instance mask. As shown in the experiments, DenseCRF compensates for the\ninaccurate detection by the object detector.\n\n3.5\n\nImplementation details\n\nWe implement the proposed method using PyTorch. ResNet-101 [68] serves as the backbone network.\nIt is pre-trained on the ImageNet dataset [69], and is updated during optimizing Eq. (1). The same\nnetwork architecture is used in all experiments. During training, the network is optimized on a\nmachine with four GeForce GTX 1080 Ti GPUs. The batch size, learning rate, weight decay,\nmomentum, and the number of the iterations are set to 8, 10\u22122, 10\u22124, 0.9 and 22k, respectively. We\nchoose ADAM [70] as the optimization solver because of its fast convergence. For data augmentation,\nfollowing the setting used in Mask R-CNN, we horizontally \ufb02ip each image with probability 0.5, and\nrandomly resize each image so that the shorter side is larger than 800 pixels and the longer side is\nsmaller than 1, 333 pixels, while maintaining the original aspect ratio.\n\n4 Experimental Results\n\nThe proposed method is evaluated in this section. First, we describe the adopted dataset and evaluation\nmetrics. Then, the performance of the proposed method and the competing methods is compared and\nanalyzed. Finally, the ablation studies on each proposed component and several baseline variants are\nconducted.\n\n4.1 Dataset and evaluation metrics\n\nDataset. The Pascal VOC 2012 [22] dataset is widely used in the literature of instance segmenta-\ntion [1, 17, 61]. This dataset consists of 20 object classes. Following the previous work [17], we use\nthe augmented Pascal VOC 2012 dataset [71] which contains totally 10, 582 training images.\n\nThe authors from Academia Sinica and the universities in Taiwan completed the experiments on the datasets.\n\n6\n\nmAPr0.7mAPr0.50102030405060 + + + D\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 4: The examples of the segmentation results with our method. The top row are the input\nimages while the bottom row shows the corresponding segmentation results. In the segmentation\nresults, different instances are indicated by different colors. Their categories are identi\ufb01ed by texts.\n\nIn addition, for fair comparison with the SDI method [17], we also train our method using the\nadditional training images from the MS COCO [72] dataset and report the performance. Following\nthe same setting adopted in SDI [17], we only select the images with objects whose categories are\ncovered by the Pascal VOC 2012 dataset and bounding box areas larger than 200 pixels. The number\nof the selected images is 99, 310. We add the selected images to the training set. Note that only\nbox-level annotations are used in all the experiments.\nEvaluation metrics. The standard evaluation metric, the mean average precision (mAP) [1], is\nadopted. Following the same evaluation protocol in [17, 61], we report mAP with four IoU\n(intersection over union) thresholds, including 0.25, 0.5, 0.7, and 0.75, denoted as mAPr\nk where\nk \u2208 {0.25, 0.5, 0.7, 0.75}.\n4.2 Comparison with the state-of-the-art methods\n\n0.5 and 5.3% in mAPr\n\n0.5 and 2.7% in mAPr\n\nWe compare the proposed method with several state-of-the-art methods including PRM [61], SDI [17],\nMask R-CNN [7] and SDS [1]. The results are reported in Table 1, where the \ufb01eld \u201cSup.\u201d indicates\nthe supervision types, I for image-level labels, B for box-level labels and M for mask-level labels. In\naddition, the asterisk \u2217 in the \ufb01eld \u201cSup.\u201d indicates the use of the mixed training set which combines\nthe MS COCO and PASCAL VOC 2012 datasets. Among the compared methods, Mask R-CNN [7]\nand SDS [1] are fully supervised with object masks. SDI [17] and PRM [61] are weakly supervised.\nThe supervision type of SDI is the same as our method with box-level annotations, while PRM [61]\nuses only image-level annotations.\nAs shown in Table 1, our proposed method signi\ufb01cantly outperforms SDI [17] by large margins,\naround 14.1% in mAPr\n0.75 on the Pascal VOC 2012 dataset. Noted that SDI [17]\nis the state-of-the-art instance segmentation method using box-level supervision. When training\non both the MS COCO and Pascal VOC 2012 datasets, our method also outperforms SDI by large\nmargins, around 13.7% in mAPr\n0.75. It is also worth mentioning that our method\neven achieves a remarkably better performance than the fully supervised method, SDS [1]. Thus, we\nbelieve that the proposed MIL loss is really useful to the box-level supervision setting for instance\nsegmentation and also likely bene\ufb01ts other tasks which require instance masks or use box-level\nannotations. In addition, since we use the Mask R-CNN as the backbone, training it with the ground\ntruth masks provides the upper bound of our method. From Table 1, our method achieves comparable\nperformance with Mask R-CNN in terms of mAPr\n0.5, showing that the proposed method method is\neffective on utilizing the information provided by bounding boxes. Note that, our method adopting\nthe MIL formulation tends to highlight the discriminative parts of objects, while Mask R-CNN with\nmask-level annotations emphasizes the whole objects. The IOUs between the ground truth and the\ndiscriminative regions are often larger than 0.25 but less than 0.5. It is why our method slightly\noutperforms Mask R-CNN in mAPr\nFigure 4 shows some instance segmentation examples using the proposed method. Our approach\ncan produce high-quality results even in challenging scenarios. All examples in Figure 4 except\n(b) exhibit occlusions between instances. Our method can distinguish instances even when they are\nclosed or occluded with each other. Examples in (a), (b) and (e) shows that our method performs\n\n0.25, but falls behind in others.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 5: The failure examples produced by our proposed method.\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nwell with objects of different scales. In (d), there are objects of several classes and complex shapes.\nOur method can segment multi-class instances with the complex shapes very well, showing that\nthe proposed method can delineate \ufb01ne-detailed object structures. Some failure cases are shown\nin Figure 5. As the proposed method is a weakly-supervised method, it may be misled by noisy\nco-occurrence patterns and have problems when separating different object instances of the same\nclass. For example, in (a) and (b), segments of small objects are incomplete due to the unclear\nboundaries in low-resolution regions. In (c) and (d), different instances are wrongly merged. In (e)\nand (f), inaccurate object contours are segmented due to inter-instance similarity and cluttered scenes.\n\n4.3 Ablation studies\n\nWe conduct four types of ablation studies, including the baseline studies, the analysis of the contribu-\ntion of each proposed component, the robustness to inaccurate annotations, and the performance with\ndifferent annotation costs.\n\n4.3.1 Baseline studies\n\nFor better investigating how well the proposed method utilizes box-level annotations, we construct\nthree baseline methods which also perform instance segmentation using box-level annotations. As\nour model, all of them are based on Mask R-CNN. The \ufb01rst baseline, DetMCG, trains Mask R-CNN\nwithout the segmentation branch and the network only detects bounding boxes of objects. During\ninference, given a detected bounding box, we retrieve the MCG proposal with the highest IoU with\nthe detected box as the output instance mask. The second baseline, BoxMask, trains Mask R-CNN\nby regarding the ground-truth boxes as the object instance masks. In the third baseline, BoxMCG,\nfor each ground-truth bounding box, we retrieve the MCG proposal with the highest IoU with it\nand regard the proposal as the pixel-level annotation for training Mask R-CNN. Table 1 reports the\nperformance of the three baseline methods. Their performance is much worse than the proposed\nmethod, showing that our method utilizes the box-level annotations much more effectively.\nFor DetMCG, the information of the object inside the bounding box is not explored, and the detection\nand segment branches are not jointly optimized. BoxMask uses bounding boxes as masks.\nIt\ncompletely ignores the \ufb01ne object contours and regards many background pixels as objects during\ntraining, which can be misleading. In BoxMCG, although object contours are considered, inaccurate\nobject proposals could contain many false positives and false negatives and hurt the performance\nwhen serving as the training masks. Therefore, it can only achieve sub-optimal results. In contract, in\nthe proposed method, the detection and segmentation branches are jointly trained, while the tightness\nproperty is utilized for distinguishing the foreground pixels from background pixels. In addition,\nspatial coherence is enforced to better maintain the integrity of objects. Therefore, our method can\nachieve much better results compared to the naive baselines.\n\n4.3.2 Component contributions\n\nWe analyze the contributions of each proposed components, including terms in the proposed MIL\nobjective (Eq. (2)) and DenseCRF, on the Pascal VOC 2012 dataset. The results of mAPr\n0.5 and\nmAPr\n0.7 are reported in Figure 3. First, for the MIL loss in Eq. (2), the performance gain by adding\nthe pairwise term \u03c6 is signi\ufb01cant in both measures. It is because the term helps better discover the\nwhole objects instead of focusing only on their discriminative parts. DenseCRF can moderately\nimprove the performance by correcting inaccurate box detection and re\ufb01ning the object shapes.\n\n8\n\n\fFigure 6: Robustness to inaccurate annotations.\n\nFigure 7: Study of annotation costs.\n\n4.4 Robustness to inaccurate annotations\n\nWe study the robustness and sensitivity of the proposed method to the accuracy of bounding boxes by\nexpanding/contracting the boxes and evaluating the performance under different expansion/contraction\nratios on the Pascal VOC 2012 dataset. Figure 6 shows the results. Noted that all the results are\nevaluated in mAPr\n0.5. The proposed method is quite robust to small change ratios from \u22125% to +5%.\nIn fact, by slightly contracting the bounding boxes, e.g. \u22125%, we can reduce noisy pixels in each\npositive bag, resulting in even better performance. However, excessively expanding or contracting\nratios leads to unreliable positive and negative bags, and thus produces sub-optimal results.\n\n4.5 Performance versus different annotation costs\n\nAccording to [73], the instance-level and box-level annotation costs on the Pascal VOC 2012 dataset\nare 239.7 and 38.1 seconds per image, respectively. We train Mask R-CNN by using instance-level\nannotations, and limit the amount of annotations so that the annotation budget is comparable with\n1.0\u00d7, 1.5\u00d7, and 2.0\u00d7 of the box-level annotation cost, respectively. As shown in Figure 7, the\nresults of the three different settings on the Pascal VOC 2012 dataset are 48.3%, 53.5% and 59.9%\nin mAPr\n0.5, respectively. The \ufb01rst two results fall behind our method by the margins 10.6% and 5.4%\nrespectively, while the last one surpasses ours by 1.0% but with 2\u00d7 of the annotation cost. Therefore,\nwith the same annotation cost, our method outperforms Mask R-CNN because less training data lead\nto over\ufb01tting for Mask R-CNN.\n\n5 Conclusion\n\nIn this paper, we propose a weakly supervised instance segmentation method, which can be trained\nwith only box-level annotations. For achieving \ufb01gure-ground separation using only information\nprovided by bounding boxes, we integrate the MIL formulation into a fully supervised instance\nsegmentation network. We explore the tightness prior of the bounding boxes for effectively generating\nthe positive and negative bags for MIL. By integrating spatial coherence and DenseCRF, the integrity\nand shape of the object can be better preserved. Experiments show that the proposed method\noutperforms existing weakly supervised methods and even surpasses some fully supervised methods\nfor instance segmentation on the PASCAL VOC 2012 dataset. With the simpler box-level annotations,\nthe proposed method could expand the utility of instance segmentation. In addition, we believe that\nthe proposed scheme can also bene\ufb01t other tasks using box-level annotations.\nAcknowledgments\nThis work was funded in part by Qualcomm through a Taiwan University Research Collaboration\nProject and also supported in part by the Ministry of Science and Technology (MOST) under grants\n107-2628-E-001-005-MY3 and 108-2634-F-007-009, and MOST Joint Research Center for AI\nTechnology and All Vista Healthcare under grant 108-2634-F-002-004.\n\n9\n\n\u221215%\u221210%\u22125%\u22120%+5%+10%+15%Ratio010203040506070mAPr0.528.146.059.558.958.055.054.0\u00d71.0\u00d71.5\u00d72.0Ratio303540455055606570mAPr0.548.353.559.9\fReferences\n[1] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV,\n\n2014.\n\n[2] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and \ufb01ne-\n\ngrained localization. In CVPR, 2015.\n\n[3] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR,\n\n2015.\n\n[4] K. Li, B. Hariharan, and J. Malik. Iterative instance segmentation. In CVPR, 2016.\n\n[5] B. Brabandere, D. Neven, and L. Van Gool. Semantic instance segmentation for autonomous driving. In\n\nCVPR Workshop, 2017.\n\n[6] Z. Zhang, S. Fidler, and S. Urtasun. Instance-level segmentation for autonomous driving with deep densely\n\nconnected mrfs. In CVPR, 2016.\n\n[7] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. In ICCV, 2017.\n\n[8] R. A. G\u00fcler, N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the wild. In\n\nCVPR, 2018.\n\n[9] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy. PersonLab: Person pose\nestimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV,\n2018.\n\n[10] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and\n\nsemantic manipulation with conditional GANs. In CVPR, 2018.\n\n[11] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In\n\nCVPR, 2017.\n\n[12] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In CVPR,\n\n2018.\n\n[13] L.-C. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam. MaskLab: Instance\n\nsegmentation by re\ufb01ning object detection with semantic and direction features. In CVPR, 2018.\n\n[14] D. Novotny, D. Novotny, D. Larlus, and A. Vedaldi. Semi-convolutional operators for instance segmentation.\n\nIn ECCV, 2018.\n\n[15] A. Krizhevsky, I. Sutskever, and G. Hinton.\n\nnetworks. In NeurIPS, 2012.\n\nImagenet classi\ufb01cation with deep convolutional neural\n\n[16] K.-J. Hsu, Y.-Y. Lin, and Y.-Y. Chuang. Weakly supervised saliency detection with a category-driven map\n\ngenerator. In BMVC, 2017.\n\n[17] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance\n\nand semantic segmentation. In CVPR, 2017.\n\n[18] C. Rother, V. Kolmogorov, and A. Blake. GrabCut - Interactive foreground extraction using iterated graph\n\ncuts. TOG, 2004.\n\n[19] J. Pont-Tuset, P. Arbelaez, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for\n\nimage segmentation and object proposal generation. TPAMI, 2017.\n\n[20] V. Lempitsky, P. Kohli, C. Rother, and T. Sharp. Image segmentation with a bounding box prior. In ICCV,\n\n2009.\n\n[21] P. Krahenbuhl and V. Koltun. Ef\ufb01cient inference in fully connected CRFs with gaussian edge potentials. In\n\nNeurIPS, 2011.\n\n[22] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object\n\nclasses (VOC) challenge. IJCV, 2010.\n\n[23] G. Lin, F. Liu, A. Milan, C. Shen, and I. Reid. Re\ufb01nenet: Multi-path re\ufb01nement networks for dense\n\nprediction. TPAMI, 2019.\n\n10\n\n\f[24] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun. ExFuse: Enhancing feature fusion for semantic\n\nsegmentation. In ECCV, 2018.\n\n[25] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi. ESPNet: Ef\ufb01cient spatial pyramid of\n\ndilated convolutions for semantic segmentation. In ECCV, 2018.\n\n[26] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and Hartwig Adam. Encoder-decoder with atrous separable\n\nconvolution for semantic image segmentation. In ECCV, 2018.\n\n[27] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. DenseASPP for semantic segmentation in street scenes. In\n\nCVPR, 2018.\n\n[28] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context encoding for semantic\n\nsegmentation. In CVPR, 2018.\n\n[29] W.-C. Hung, Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M.-H. Yang. Scene parsing with global\n\ncontext embedding. In ICCV, 2017.\n\n[30] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.\n\n[31] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic image\n\nsegmentation withdeep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.\n\n[32] Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang. CrDoCo: Pixel-level domain transfer with cross-\n\ndomain consistency. In CVPR, 2019.\n\n[33] J. Dai, K. He, and J. Sun. BoxSup: Exploiting bounding boxes to supervise convolutional networks for\n\nsemantic segmentation. In ICCV, 2015.\n\n[34] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a\n\nDCNN for semantic image segmentation. In ICCV, 2015.\n\n[35] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for\n\nsemantic segmentation. In CVPR, 2016.\n\n[36] P. Vernaza and M. Chandraker. Learning random-walk label propagation for weakly-supervised semantic\n\nsegmentation. In CVPR, 2017.\n\n[37] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers. Normalized cut loss for weakly-supervised\n\ncnn segmentation. In CVPR, 2018.\n\n[38] M. Tang, F. Perazzi, A. Djelouah, I. Ayed, C. Schroers, and Y. Boykov. On regularized losses for\n\nweakly-supervised cnn segmentation. In ECCV, 2018.\n\n[39] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. What\u2019s the point: Semantic segmentation with\n\npoint supervision. In ECCV, 2016.\n\n[40] R. Qian, Y. Wei, H. Shi, J. Li, J. Liu, and T. Huang. Weakly supervised scene parsing with point-based\n\ndistance metric learning. In AAAI, 2019.\n\n[41] B. Jin, M. Segovia, and S.S\u00fcsstrunkn. Webly supervised semantic segmentation. In CVPR, 2017.\n\n[42] R. Briq, M. Moeller, and J. Gall. Convolutional simplex projection network for weakly supervised semantic\n\nsegmentation. In BMVC, 2018.\n\n[43] Q. Hou, P.-T. Jiang, Y. Wei, and M.-M.-Cheng. Self-erasing network for integral object attention. In\n\nNeurIPS, 2018.\n\n[44] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. Huang. Revisiting dilated convolution: A simple approach\n\nfor weakly- and semi-supervised semantic segmentation. In CVPR, 2018.\n\n[45] J. Ahn and S. Kwak. Learning pixel-level semantic af\ufb01nity with image-level supervision for weakly\n\nsupervised semantic segmentation. In CVPR, 2018.\n\n[46] Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang. Weakly-supervised semantic segmentation network\n\nwith deep seeded region growing. In CVPR, 2018.\n\n[47] R Fan, Q. H, M.-M. Cheng, G. Yu, R. Martin, and S.-M. Hu. Associating inter-image salient instances for\n\nweakly supervised semantic segmentation. In ECCV, 2018.\n\n11\n\n\f[48] T. Shen, G. Lin, C. Shen, and I. Reid. Bootstrapping the performance of webly supervised semantic\n\nsegmentation. In CVPR, 2018.\n\n[49] X. Wang, S. You, X. Li, and H. Ma. Weakly-supervised semantic segmentation by iteratively mining\n\ncommon object features. In CVPR, 2018.\n\n[50] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n\n[51] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object\n\nrecognition. 2013.\n\n[52] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region\n\nproposal networks. In NeurIPS, 2015.\n\n[53] Z. Hayder, X. He, and M. Salzmann. Boundary-aware instance segmentation. In CVPR, 2017.\n\n[54] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In\n\nCVPR, 2016.\n\n[55] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.\n\n[56] S. Kong and C. Fowlkes. Recurrent pixel embedding for instance grouping. In CVPR, 2018.\n\n[57] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposal-free network for instance-level object\n\nsegmentation. TPAMI, 2018.\n\n[58] Y. Liu, S. Yang, B. Li, W. Zhou, J. Xu, H. Li, and Y. Lu. Af\ufb01nity derivation and graph merge for instance\n\nsegmentation. In ECCV, 2018.\n\n[59] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother. InstanceCut: from edges to instances\n\nwith MultiCut. In CVPR, 2017.\n\n[60] Y.-C. Hsu, Z. Xu, Z. Kira, and J. Huang. Learning to cluster for proposal-free instance segmentation. In\n\nIJCNN, 2018.\n\n[61] Y. Zhou, Y. Zhu, Q. Ye, Q. Qiu, and J. Jiao. Weakly supervised instance segmentation using class peak\n\nresponse. In CVPR, 2018.\n\n[62] K.-J. Hsu, Y.-Y. Lin, and Y.-Y. Chuang. Deep instance co-segmentation by co-peak search and co-saliency\n\ndetectio. In CVPR, 2019.\n\n[63] R. Hu, P. Dollar, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. In CVPR, 2018.\n\n[64] K.-J. Hsu, Y.-Y. Lin, and Y.-Y. Chuang. Augmented multiple instance regression for inferring object\n\ncontours in bounding boxes. TIP, 2014.\n\n[65] F.-J. Chang, Y.-Y. Lin, and K.-J. Hsu. Multiple structured-instance learning for semantic segmentation\n\nwith uncertain training data. In CVPR, 2014.\n\n[66] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative\n\nlocalization. In CVPR, 2016.\n\n[67] L. Wang, H. Lu, Y. Wang, and M. Feng. Learning to detect salient objects with image-level supervision. In\n\nCVPR, 2017.\n\n[68] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n\n[69] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\n\nM. Bernstein, A. Berg, and F.-F. Li. ImageNet large scale visual recognition challenge. IJCV, 2015.\n\n[70] D. Kingma and J. Ba. ADAM: A method for stochastic optimization. In ICLR, 2014.\n\n[71] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In\n\nICCV, 2011.\n\n[72] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft\n\nCOCO: Common objects in context. In ECCV, 2014.\n\n[73] M. Bellver, A. Salvador, J. Torrres, and X. GiroiNieto. Budget-aware semi-supervised semantic and\n\ninstance segmentation. In CVPR Workshop, 2019.\n\n12\n\n\f", "award": [], "sourceid": 3565, "authors": [{"given_name": "Cheng-Chun", "family_name": "Hsu", "institution": "Academia Sinica"}, {"given_name": "Kuang-Jui", "family_name": "Hsu", "institution": "Qualcomm"}, {"given_name": "Chung-Chi", "family_name": "Tsai", "institution": "Qualcomm"}, {"given_name": "Yen-Yu", "family_name": "Lin", "institution": "National Chiao Tung University"}, {"given_name": "Yung-Yu", "family_name": "Chuang", "institution": "National Taiwan University"}]}