{"title": "Foreground Clustering for Joint Segmentation and Localization in Videos and Images", "book": "Advances in Neural Information Processing Systems", "page_first": 1695, "page_last": 1704, "abstract": "This paper presents a novel framework in which video/image segmentation and localization are cast into a single optimization problem that integrates information from low level appearance cues with that of high level localization cues in a very weakly supervised manner. The proposed framework leverages two representations at different levels, exploits the spatial relationship between bounding boxes and superpixels as linear constraints and simultaneously discriminates between foreground and background at bounding box and superpixel level. Different from previous approaches that mainly rely on discriminative clustering, we incorporate a foreground model that minimizes the histogram difference of an object across all image frames. Exploiting the geometric relation between the superpixels and bounding boxes enables the transfer of segmentation cues to improve localization output and vice-versa. Inclusion of the foreground model generalizes our discriminative framework to video data where the background tends to be similar and thus, not discriminative. We demonstrate the effectiveness of our unified framework on the YouTube Object video dataset, Internet Object Discovery dataset and Pascal VOC 2007.", "full_text": "Foreground Clustering for Joint Segmentation and\n\nLocalization in Videos and Images\n\nAbhishek Sharma\n\nNavinfo Europe Research, Eindhoven, NL \u2217\n\nkein.iitian@gmail.com\n\nAbstract\n\nThis paper presents a novel framework in which video/image segmentation and\nlocalization are cast into a single optimization problem that integrates information\nfrom low level appearance cues with that of high level localization cues in a very\nweakly supervised manner. The proposed framework leverages two representa-\ntions at different levels, exploits the spatial relationship between bounding boxes\nand superpixels as linear constraints and simultaneously discriminates between\nforeground and background at bounding box and superpixel level. Different from\nprevious approaches that mainly rely on discriminative clustering, we incorporate\na foreground model that minimizes the histogram difference of an object across\nall image frames. Exploiting the geometric relation between the superpixels and\nbounding boxes enables the transfer of segmentation cues to improve localization\noutput and vice-versa. Inclusion of the foreground model generalizes our discrimi-\nnative framework to video data where the background tends to be similar and thus,\nnot discriminative. We demonstrate the effectiveness of our uni\ufb01ed framework on\nthe YouTube Object video dataset, Internet Object Discovery dataset and Pascal\nVOC 2007.\n\n1\n\nIntroduction\n\nLocalizing and segmenting objects in an image and video is a fundamental problem in computer vision\nsince it facilitates many high level vision tasks such as object recognition, action recognition (49),\nnatural language description (17) to name a few. Thus, any advancements in segmentation and\nlocalization algorithm are automatically transferred to the performance of high level tasks (17).\nWith the success of deep networks, supervised top down segmentation methods obtain impressive\nperformance by learning on pixel level (28; 34) or bounding box labelled datasets (10; 12). Taking\ninto account the cost of obtaining such annotations, weakly supervised methods have gathered a lot\nof interest lately (16; 7; 20). In this paper, we use very weak supervision to imply that labels are\ngiven only at the image or video level and aim to jointly segment and localize the foreground object\ngiven the weak supervision.\nWhile great progress has been made in both image, video and 3D domain (20; 5; 40) using weak\nsupervision, most existing work are tailored for a speci\ufb01c task. Although UberNet (19) achieves\nimpressive results on multiple image perception tasks by training a deep network, we are not aware\nof any similar universal network in the weak supervision domain that performs on both image and\nvideo data. Part of the dif\ufb01culty lies in de\ufb01ning a loss function that can explicitly model or exploit\nthe similarity between similar tasks using weak supervision only while simultaneously learning\nmultiple classi\ufb01ers. More speci\ufb01cally, we address the following challenge: how can we use semantic\nlocalization cues of bounding boxes to guide segmentation and leverage low level segmentation\nappearance cues at superpixel level to improve localization.\n\n\u2217work done before\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOur key idea is as follows: If an object localization classi\ufb01er considers some bounding box to\nbe a background, this, in principle, should enforce the segmentation classi\ufb01er that superpixels in\nthis bounding box are more likely to be background and vice-versa. We frame this idea of online\nknowledge transfer between the two classi\ufb01ers as linear constraints. More precisely, our uni\ufb01ed\nframework, based on discriminative clustering (2), avoids making hard decisions and instead, couples\nthe two discriminative classi\ufb01ers by linear constraints. Contrary to the conventional approach of\nmulti-task learning (6; 24) where two (or more) similar tasks are jointly learned using a shared\nrepresentation, we instead leverage two representations and enable the transfer of information implicit\nin these representations during a single shot optimization scheme.\nOur work, although similar in spirit to the prior work that embeds pixels and parts in a graph (50; 29),\ngoes a step further by modelling video data as well. To this end, we incorporate a foreground model\nin our discriminative clustering framework. Often, a video is shot centered around an object with\nsimilar background frames which limits the performance of discriminative clustering as shown in\nour experiments later. The proposed foreground model basically includes a histogram matching term\nin our objective function that minimizes the discrepancy between the segmented foreground across\nimages and thereby brings a notion of similarity in a purely discriminative model. We call our method\nForeground Clustering and make source code publicly available. 2\nOur contributions are as follows: 1) We propose a novel framework that simultaneously learns to\nlocalize and segment common objects in images and videos. By doing so, we provide a principled\nmathematical framework to group these individual problems in a uni\ufb01ed framework. 2) We introduce\na foreground model within the discriminative clustering by including a histogram matching term. 3)\nWe show a novel mechanism to exploit spatial relation between a superpixel and a bounding box in\nan unsupervised way that improves the output of cosegmentation and colocalization signi\ufb01cantly on\nthree datasets. 4)We provide state of the art performance on the Youtube Videos Object segmentation\ndataset and convincing results on Pascal VOC 2007 and Internet Object Discovery Dataset.\n\n2 Related work\n\nWe only describe and relate some of the existing literature brie\ufb02y in this section since each of the\nfour problems are already well explored separately on their own.\nSupervised Setting. Numerous works (23; 47) have used off-the-shelf object detectors to guide\nsegmentation process. Ladicky et al. (23) used object detections as higher order potentials in a\nCRF-based segmentation system by encouraging all pixels in the foreground of a detected object to\nshare the same category label as that of the detection. Alternatively, segmentation cues have been\nused before to help detection (27). Hariharan et al. (11) train CNN to simultaneously detect and\nsegment by classifying image regions. All these approaches require ground truth annotation either in\nthe form of bounding boxes or segmented objects or do not exploit the similarity between the two\ntasks.\nWeakly Supervised Setting. Weak supervision in image domain dates back to image cosegmenta-\ntion (37; 14; 31; 18) and colocalization problem where one segments or localizes common foreground\nregions out of a set of images. They can be broadly classi\ufb01ed into discriminative (14; 15; 42; 16) and\nsimilarity based approaches. Similarity based approaches (37; 44; 39; 38) seek to segment out the\ncommon foreground by learning the foreground distribution or matching it across images (38; 45; 9).\nAll these method are designed for one of the two task. Recent work based on CNN either com-\npletely ignores these complimentary cues (20) or use them in a two stage decision process, either\nas pre-processing step (36) or for post processing (27). However, it is dif\ufb01cult to recover from\nerrors introduced in the initial stage. This paper advocates an alternative to the prevalent trends of\neither ignoring these complimentary cues or placing a clear separation between segmentation and\nlocalization in the weakly supervised scenario.\nVideo Segmentation. Existing literature on unsupervised video segmentation (25; 32; 51) are mostly\nbased on a graphical model with the exception of Brox.& Malik (4). Most notably, Papazoglou &\nFerrari (32) \ufb01rst obtain motion saliency maps and then re\ufb01ne it using Markov Random Fields. Recent\nsuccess in video segmentation comes mainly from semi-supervised setting (5). Semi-supervised\nmethods are either tracking-based or rely on foreground propagation algorithms. Typically, one\n\n2https://github.com/Not-IITian/Foreground-Clustering-for-Joint-segmentation-and-Localization\n\n2\n\n\finitializes such methods with ground truth annotations in the \ufb01rst frame and thus, differ from the\nmain goal of this paper that is to segment videos on the \ufb02y.\nVideo Localization. Video localization is a relatively new problem where the end goal is to localize\nthe common object across videos. Prest et al. (35) tackles this problem by proposing candidate tubes\nand selecting the best one. Joulin et al. (16) leverages discriminative clustering and proposes an\ninteger quadratic problem to solve video colocalization. Kwak et al. (22) goes a step further and\nsimultaneously tackles object discovery as well as localization in videos. Jerripothula et al. (13)\nobtains state of the art results by \ufb01rst pooling different saliency maps and then, choosing the most\nsalient tube. Most of the approaches (22; 13) leverage a large set of videos to discriminate or build a\nforeground model. In contrast, we segment and localize the foreground separately on each video,\nmaking our approach much more scalable.\nDiscriminative Clustering for Weak Supervision. Our work builds on the discriminative frame-\nwork (2), \ufb01rst applied to cosegmentation in Joulin et al. (14) and later extended for colocalization\n(42; 16) and other tasks (3; 30). The success of such discriminative frameworks is strongly tied to the\navailability of diverse set of images where hard negative mining with enough negative(background)\ndata separates the foreground. Our model instead explicitly models the foreground by minimizing the\ndifference of histograms across all image frames. The idea of histogram matching originated \ufb01rst in\nimage cosegmentation (37; 46). However, we are the \ufb01rst one to highlight its need in discriminative\nclustering and connection to modelling video data.\n\n3 Background\n\nIn this section, we brie\ufb02y review the two main components of the discriminative frameworks (14; 42;\n16) used for cosegmentation and colocalization as we build on the following two components:\nDiscriminative clustering. We \ufb01rst consider a simple scenario where we are given some labelled\ndata with a label vector y \u2208 {0, 1}n and a d dimensional feature for each sample, concatenated into\na n \u00d7 d feature matrix X . We assume that the matrix X is centered. (If not, we obtain one after\nmultiplying with usual centering matrix \u03a0 = In \u2212 1\nn 11T ). The problem of \ufb01nding a linear classi\ufb01er\nwith a weight vector \u03b1 in Rd and a scalar b is equivalent to:\n\n||y \u2212 X \u03b1 \u2212 b1||2 + \u03b2||\u03b1||2,\n\nmin\n\u03b1\u2208Rd\n\n(1)\n\nfor square loss and a regularization parameter \u03b2. There exists a closed form solution for Eq1 given\nby: \u03b1 = (X TX + \u03b2Id)\u22121X T y. However, in the weakly supervised case, the label vector y is latent\nand optimization needs to be performed over both labels as well as the weight vector of a classi\ufb01er.\nThis is equivalent to obtaining a labelling based on the best linearly separable classi\ufb01er:\n\nmin\n\ny\u2208{0,1}n,\u03b1\u2208Rd\n\n||y \u2212 X \u03b1 \u2212 b1||2 + \u03b2||\u03b1||2,\n\n(2)\n\nXu et al. (48) \ufb01rst proposed the idea of using a supervised classi\ufb01er(SVM) to perform unsupervised\nclustering. Later, (2) shows that the problem has a closed form solution using square loss and is\nequivalent to\n\nyTDy,\n\nmin\n\ny\u2208{0,1}n\n\n(3)\n\nwhere\n\n(4)\nNote that Id is an identity matrix of dimension d, and D is positive semi-de\ufb01nite. This formulation\nalso allows us to kernelize features. For more details, we refer to (2).\n\nD = In \u2212 X (X TX + \u03b2Id)\u22121X T ,\n\nLocal Spatial Similarity To enforce spatial consistency, a similarity term is combined with the\ndiscriminative term yTDy. The similarity term yTLy is based on the idea of normalised cut (41)\nthat encourages nearby superpixels with similar appearance to have the same label. Thus, a similarity\nmatrix W i is de\ufb01ned to represent local interactions between superpixels of same image. For any pair\nof (a, b) of superpixels in image i and for positions pa and color vectors ca, :\n\nW i\n\nab = exp(\u2212\u03bbp||pa \u2212 pb||2\n\n2 \u2212 \u03bbc||ca \u2212 cb||2)\n\n3\n\n\fThe \u03bbp is set empirically to .001 & \u03bbc to .05. Normalised laplacian matrix is given by:\n\n(5)\nwhere IN is an identity matrix of dimension d, Q is the corresponding diagonal degree matrix, with\n\nL = IN \u2212 Q\u22121/2WQ\u22121/2\n\nQii =(cid:80)n\n\nj=1 wij.\n\n4 Foreground Clustering\n\nNotation. We use italic Roman or Greek letters (e.g., x or \u03b3) for scalars, bold italic fonts (e.g.,y =\n(y1, . . . , yn)T ) for vectors, and calligraphic ones (e.g., C) for matrices.\n\n4.0.1 Foreground Model\n\nConsider an image I composed of n pixels (or superpixels), and divided into two regions, foreground\nand background. These regions are de\ufb01ned by the binary vector y in {0, 1}n such that yj = 1\nwhen (super)pixel number j belongs to the foreground, and yj = 0 otherwise. Let us consider the\nhistogram of some features (e.g., colors) associated with the foreground pixels of I. This histogram is\na discrete empirical representation of the feature distribution in the foreground region and can always\nbe represented by a vector h in Nd, where d is the number of its bins, and hi counts the number of\npixels with values in bin number i. The actual feature values associated with I can be represented\nby a binary matrix H in {0, 1}d\u00d7n such that Hij = 1 if the feature associated with pixel j falls in\nbin number i of the histogram, and Hij = 0 otherwise. With this notation, the histogram associated\nwith I is written as h = Hy. Now consider two images I 1 and I 2, and the associated foreground\nindicator vector yk, histogram hk, and data matrix Hk, so that hk = Hkyk (k = 1, 2). We can\nmeasure the discrepancy between the segmentation\u2019s of two images by the (squared) norm of the\nhistogram difference, i.e.,\n\n(6)\nwhere y = (y1; y2) is the vector of {0, 1}2n obtained by stacking the vectors y1 and y2, and\nF = [H1,\u2212H2]T [H1,\u2212H2]. This formulation is easily extended to multiple images (46). Since the\ndiscrepancy term in Eq. 6 is a norm, the resulting matrix F is positive de\ufb01nite by de\ufb01nition.\n\n||H1y1 \u2212 H2y2||2 = yTFy,\n\n4.1 Optimization Problem for one Image\n\nFor the sake of simplicity and clarity, let us consider a single image, and a set of m bounding boxes\nper image, with a binary vector z in {0, 1}m such that zi = 1 when bounding box i in {1, . . . , m} is\nin the foreground and zi = 0 otherwise. We oversegment the image into n superpixels and de\ufb01ne a\nglobal superpixel binary vector y in {0, 1}n such that yj = 1 when superpixel number j in {1, . . . , n}\nis in the foreground and yj = 0 otherwise. We also compute a normalized saliency map M (with\nvalues in [0, 1]), and de\ufb01ne: s = \u2212log(M ). Given these inputs and appropriate feature maps for\nsuperpixels and bounding boxes (de\ufb01ned later in detail), we want to recover latent variables z and\ny simultaneously by learning the two coupled classi\ufb01ers in different feature spaces. However, to\nconstrain the two classi\ufb01ers together, we need another indexing of superpixels detailed next.\nFor each bounding box, we maintain a set Si of its superpixels and de\ufb01ne the corresponding indicator\nvector xi in {0, 1}|Si| such that xij = 1 when superpixel j of bounding box i is in the foreground,\nand xij = 0 otherwise. Note that for every bounding box i, xi ( superpixel indexing at bounding box\nlevel) and y (indexing at image level) are related by an indicator projection matrix Pi of dimensions\n|Si| \u00d7 n such that Pij is 1 if superpixel j is present in bounding box i and 0 otherwise.\n\nWe propose to combine the objective function de\ufb01ned for cosegmentation and colocalization and\nthus, de\ufb01ne:\n\nE(y, z) = yT (Ds + \u03baFs + \u03b1Ls)y + \u00b5yT ss + \u03bb(zTDbz + \u03bdzT sb),\n\n(7)\nGiven the feature matrix for superpixels and bounding box, the matrix Ds and Db are computed\nby Eq. 4 whereas Ls is computed by Eq. 5. We de\ufb01ne the features and value of scalars later in\nthe implementation detail. The quadratic term zTDbz penalizes the selection of bounding boxes\nwhose features are not easily linearly separable from the other boxes. Similarly, minimizing yTDsy\n\n4\n\n\fencourages the most discriminative superpixels to be in the foreground. Minimizing the similarity\nterm yTLsy encourages nearby similar superpixels to have same label whereas the linear terms yT ss\nand zT sb encourage selection of salient superpixels and bounding box respectively. We now impose\nappropriate constraints and de\ufb01ne the optimization problem as follows:\n\nE(y, z) under the constraints:\n\nj\u2208Si\n\nmin\ny,z\nxij \u2264 \u03b7|Si|zi\n\n\u03b3|Si|zi \u2264 (cid:88)\nxij \u2264 (cid:88)\n(cid:88)\nm(cid:88)\n\nzi = 1\n\nzi,\n\ni:j\u2208Si\n\ni:j\u2208Si\nPi y = xi,\n\nfor\n\ni = 1, . . . , m,\n\nfor\n\nfor\n\nj = 1, . . . , n,\n\ni = 1, . . . , m.\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\ni=1\n\nThe constraint (8) guarantees that when a bounding box is in the background, so are all its superpixels,\nand when it is in the foreground, a proportion of at least \u03b3 and at most (\u03b7) of its superpixels are in the\nforeground as well, with 0 \u2264 \u03b3 \u2264 1. We set \u03b3 to .3 and \u03b7 to .9. The constraint (9) guarantees that a\nsuperpixel is in the foreground for only one box, the foreground box that contains it (only one of\nthe variables zi in the summation can be equal to 1). For each bounding box i, the constraint (10)\nrelates the two indexing of superpixels, xi and y, by a projection matrix Pi de\ufb01ned earlier. The\nconstraint (11) guarantees that there is exactly one foreground box per image. We illustrate the\nabove optimization problem by a toy example of 1 image and 2 bounding boxes in appendix at the end.\n\nIn equations (7)-(11), we obtain an integer quadratic program. Thus, we relax the boolean constraints,\nallowing y and z to take any value between 0 and 1. The optimization problem becomes convex\nsince all the matrix de\ufb01ned in equation(7)are positive semi-de\ufb01nite (14) and the constraints are linear.\nGiven the solution to the quadratic program, we obtain the bounding box by choosing zi with highest\nvalue . For superpixels, since the value of x (and thus y) are upper bounded by z, we \ufb01rst normalize\ny and then, round the values to 0 (background) and 1 (foreground) (See Appendix).\nWhy Joint Optimization. We brie\ufb02y visit the intuition behind joint optimization. Note that the\nsuperpixel variables x and y are bounded by bounding box variable z in Eq. 8 and 9. If the\ndiscriminative localization part considers some bounding box zi to be background and sets it to close\nto 0, this , in principle, enforces the segmentation part that superpixels in this bounding box are\nxij \u2264 \u03b4|Si|zi.\nSimilarly, the segmentation cues in\ufb02uence the \ufb01nal score of zi variable if the superpixels inside this\nbounding box are more likely to be foreground.\n\nmore likely to be background (= 0)as de\ufb01ned by the right hand side of Eq. 8:(cid:80)\n\nj\u2208Si\n\n5\n\nImplementation Details\n\nWe use superpixels obtained from publicly available implementation of (43). This reduces the size\nof the matrix Ds,Ls and allows us to optimize at superpixel level. Using the publicly available\nimplementation of (1), we generate 30 bounding boxes for each image. We use (26) to compute off\nthe shelf image saliency maps. To model video data, we obtain motion saliency maps using open\nsource implementation of (32). Final saliency map for videos is obtained by a max-pooling over\nthe two saliency maps. We make a 3D histogram based on RGB values, with 7 bins for each color\nchannel, to build the foreground model F in Eq. 6.\nFeatures. Following (14), we densely extract SIFT features at every 4 pixels and kernelize them\nusing Chi-square distance. For each bounding box, we extract 4096 dimensional feature vector using\nAlexNet (21) and L2 normalize it.\nHyperparameters Following (42), we set \u03bd, the balancing scalar for box saliency, to .001 and\n\u03ba, \u03bb = 10. To set \u03b1, we follow (14) and set it \u03b1 = .1 for foreground objects with fairly uniform\ncolors, and = .001 corresponding to objects with sharp color variations. Similarly, we set scalar\n\u00b5 = .01 for salient datasets and = .001 otherwise.\n\n5\n\n\fTable 1: Video Colocalization Comparison on Youtube Objects dataset.\n\nMetric LP(Sal.)\n28\n\nCorLoc.\n\n(16) QP(Loc.) QP(Loc.)+Seg Ours(full)\n54\n\n35\n\n49\n\n31\n\n(22)\n56\n\n(13)int\n52\n\n(13)ext\n58\n\nTable 2: Video segmentation Comparison on Youtube Objects dataset.\n\nMetric LP(Sal.) QP(Seg.) QP(Seg. +Loc.) Ours(full)\n61\n\nIoU.\n\n43\n\n49\n\n56\n\nFST (32)\n53\n\n6 Experimental Evaluation\n\nThe goal of this section is two fold: First, we propose several baselines that help understand the\nindividual contribution of various cues in the optimization problem de\ufb01ned in section 4.1. Second,\nwe empirically validate and show that learning the two problems jointly signi\ufb01cantly improve\nthe performance over learning them individually and demonstrate the effectiveness of foreground\nmodel within the discriminative framework. Given the limited space, we focus more on localization\nexperiments because we believe that the idea of improving the localization performance on the \ufb02y\nusing segmentation cues is quite novel compared to the opposite case. We evaluate the performance\nof our framework on three benchmark datasets: YouTube Object Dataset (35), Object Discovery\ndataset (38) and PASCAL-VOC 2007.\n\n6.0.1 YouTube Object Dataset.\n\nYouTube Object Dataset (35) consists of videos downloaded from YouTube and is divided into 10\nobject classes. Each object class consists of several video shots of the objects belonging to the class.\nGround-truth boxes are given for a subset of the videos, and one frame is annotated per video. We\nsample key frames from each video with ground truth annotation uniformly with stride 10, and\noptimize our method only on the key frames. This is following (13; 22) because temporally adjacent\nframes typically have redundant information, and it is time-consuming to process all the frames.\nBesides localization, YouTube Object Dataset is also a benchmark dataset for unsupervised video\nsegmentation and provides pixel level annotations for a subset of videos. We evaluate our method for\nsegmentation on all the videos with pixel level ground truth annotation.\nVideo Co-localization Experiments\nMetric Correct Localization (CorLoc) metric, an evaluation metric used in related work (42; 7; 22),\nand de\ufb01ned as the percentage of image frames correctly localized according to the criterion: IoU >\n.5.\nBaseline Methods We analyze individual components of our colocalization model by removing\nvarious terms in the objective function and consider the following baselines:\nLP(Sal.) This baseline only minimizes the saliency term for bounding boxes and picks the most\nsalient one in each frame of video.\nIt is important as it gives an approximate idea about how\neffective (motion) saliency is. We call it LP as it leads to a linear program. Joulin et al. (16)\ntackles colocalization alone without any segmentation spatial support. It quanti\ufb01es how much we\ngain in colocalization performance by leveraging segmentation cues and deep features.QP(Loc.)\nonly solves the objective function corresponding to localization part without any segmentation cues.\nSo, it includes the saliency and discriminative term for boxes. QP(Loc.)+Seg denotes the overall\nperformance without the foreground model and quanti\ufb01es the importance of leveraging segmentation\nmodel. Ours(full) denotes our overall model and quanti\ufb01es the utility of foreground model.\nIn Table 1, in addition to the baselines proposed above, we compare our method with two state of the\nart unsupervised approaches (13; 22). We simply cite numbers from their paper. (13)ext means that\nthe author used extra videos of same class to increase the accuracy on the test video.\nVideo Segmentation Experiments. In Table 2, we report segmentation experiments on Youtube\nObject Dataset. We use Intersection over Union (IoU) metric, also known as Jaccard index, to\nmeasure segmentation accuracy. In addition to the stripped down version of our model, we compare\n\n6\n\n\fTable 3: Image Colocalization Comparison on Object Discovery dataset.\n\nMetric LP(Sal.) QP(Loc.) TJLF14 Ours(full) CSP15 (7)\n84\n\nCorLoc.\n\n75\n\n72\n\n80\n\n68\n\nTable 4: Image Colocalization Comparison on Pascal VOC 2007.\n\nMetric LP(Sal.) QP(Loc.) TJLF14 Ours(full) CSP15 (7)\n68\n\nCorLoc.\n\n40\n\n39\n\n51\n\n33\n\nwith FST (32) which is still considered state of the art on unsupervised Youtube Object segmentation\ndataset.\nDiscussion We observe in both Table 1 and 2, that performance of stripped down versions when\ncompared to the full model, validates our hypothesis of learning the two problems jointly. We\nobserve signi\ufb01cant boost in localization performance by including segmentation cues. Furthermore,\nthe ablation study also underlines empirical importance of including a foreground model in the\ndiscriminative framework. On Video Colocalization task, we perform on par with the current state of\nthe art (13) whereas we outperform FST (32) on video segmentation benchmark.\n\n6.1\n\nImage Colocalization Experiments\n\nIn addition to the baseline proposed above in video colocalization by removing various terms in the\nobjective function, we consider the following baselines:\nBaseline Methods Tang et al.(TJLF14) (42) tackles colocalization alone without any segmenta-\ntion spatial support. It quanti\ufb01es how much we gain in colocalization performance by leveraging\nsegmentation cues. CSP15 (7) is a state of the art method for image colocalization.\nThe Object Discovery dataset (38) This dataset was collected by downloading images from Internet\nfor airplane, car and horse. It contains about 100 images for each class. We use the same CorLoc\nmetric and report the results in Table 3.\nPascal VOC 2007 In Table 4, we evaluate our method on the PASCAL07-6x2 subset to compare to\nprevious methods for co-localization. This subset consists of all images from 6 classes (aeroplane, bi-\ncycle, boat, bus, horse, and motorbike) of the PASCAL VOC 2007 (8). Each of the 12 class/viewpoint\ncombinations contains between 21 and 50 images for a total of 463 images. Compared to the Object\nDiscovery dataset, it is signi\ufb01cantly more challenging due to considerable clutter, occlusion, and\ndiverse viewpoints. We see that results using stripped down versions of our model are not consistent\nand less reliable. This again validates our hypothesis of leveraging segmentation cues to lift the colo-\ncalization performance. Our results outperforms TJLF14 (42) on all classes. Cho et al., CSP15 (7),\noutperforms all approaches on Pascal VOC 2007.\n\n7 Conclusion & Future Work\n\nWe proposed a simple framework that jointly learns to localize and segment objects. The proposed\nformulation is based on two different level of visual representations and uses linear constraints\nas a means to transfer information implicit in these representations in an unsupervised manner.\nAlthough we demonstrate the effectiveness of our approach with foreground clustering, the key idea\nof transferring knowledge between tasks via spatial relation is very general. We believe this work\nwill encourage CNN frameworks such as constrained CNN (33) to learn similar problems jointly\nfrom weak supervision and act as a strong baseline for any future work that seek to address multiple\ntasks using weak supervision. Optimizing the current objective function using the recently proposed\nlarge scale discriminative clustering framework (30) is left as a future work.\nAcknowledgement Part of this work was partially supported by ERC Advanced grant VideoWorld.\nMany thanks to Armand Joulin for helpful discussions.\n\n7\n\n\f8 Appendix\n\n8.1 Toy Example\n\nWe illustrate the spatial (geometric) constraints by a simple toy example where the image contains 5\nsuperpixels. Global image level superpixel indexing is de\ufb01ned by y = (y1, y2, y3, y4, y5)T . Also,\nassume that there are two bounding boxes per image and that bounding box 1, z1, contains superpixel\n1, 3, 4 while bounding box 2, z2, contains superpixel 1, 2, 4. Thus, bounding box indexing for \ufb01rst\nproposal z1 is de\ufb01ned by x1 = (y1, y3, y4)T and for z2 is de\ufb01ned by x2 = (y1, y2, y4)T . Vector x\nis obtained by concatenating x1 and x2. Then, vector x1 and vector y are related by P1 as follows:\n\n(cid:2) x1\n\n(cid:3) =\n\n\uf8f9\uf8fb =\n\n\uf8ee\uf8f0 y1\n\ny3\ny4\n\n\uf8ee\uf8f01\n(cid:124)\n\n0\n0\n\n\uf8f9\uf8fb\n(cid:125)\n\n\u00d7\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n(cid:124) (cid:123)(cid:122) (cid:125)\n\ny1\ny2\ny3\ny4\ny5\n\ny\n\n0\n0\n1\n\n0\n0\n0\n\n0\n0\n0\n\n0\n1\n0\n\n(cid:123)(cid:122)\n\nP1\n\n\u03b3|Si|zi \u2264 (cid:88)\n\nj\u2208Si\n\nxij \u2264 (1 \u2212 \u03b3)|Si|zi\n\nfor\n\ni = 1\n\n\u21d2\u03b3 \u2217 3z1 \u2264 (x11 + x12 + x13) \u2264 (1 \u2212 \u03b3) \u2217 3z1\n\nNote that |Si| = 3 since each bounding box contains 3 superpixels, m = 2 and n = 5.\n\n\u21d2\u03b3 \u2217 3z1 \u2264 (y1 + y3 + y4) \u2264 (1 \u2212 \u03b3) \u2217 3z1 (By P1y = x1)\n\nSimilarly, the second constraint for superpixels is equivalent to:\n\n(cid:88)\n\nxij \u2264 (cid:88)\n\ni:j\u2208Si\n\ni:j\u2208Si\n\nzi, for\n\nj = 1, 2, 3, 4, 5\n\n(x11 + x21) \u2264 (z1 + z2) \u21d2 2y1 \u2264 (z1 + z2)\n\nx22 \u2264 z2 \u21d2 y2 \u2264 z2\n\nx12 \u2264 z1 \u21d2 y3 \u2264 z1\n\n(x13 + x23) \u2264 (z1 + z2) \u21d2 2y4 \u2264 (z1 + z2)\n\nRounding for segmentation Following Wang et al.(45), to convert the segmentation variable into\nbinary indicator variables, we simply sample 30 thresholds within an interval uniformly, and choose\nthe threshold whose corresponding segmentation has the smallest normalized cut score.\n\nReferences\n[1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. PAMI,\n\n34:2189\u20132202, 2012.\n\n[2] F. Bach and Z. Harchaoui. a discriminative and \ufb02exible framework for clustering. In NIPS,\n\n2007.\n\n2010.\n\n[3] P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce, C.Schmid, and J. Sivic. Weakly\n\nsupervised action labeling in videos under ordering constraints. In ECCV, 2014.\n\n[4] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV,\n\n[5] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cremers, and L. V. Gool. One-shot\n\nvideo object segmentation. In CVPR, 2017.\n\n[6] R. Caruana. Algorithms and applications for multitask learning. In ICML, 1996.\n[7] M. Cho, S. Kwak, C. Schmid, and J. Ponce. Unsupervised object discovery and localization in\n\nthe wild: Part-based matching with bottom-up region proposals. In CVPR, 2015.\n\n8\n\n\f[8] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual\n\nObject Classes Challenge 2007 (VOC2007) Results. In Tech-Report, 2007.\n[9] A. Faktor and M. Irani. Co-segmentation by composition. In ICCV, 2013.\n[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object\n\ndetection and semantic segmentation. In CVPR, 2014.\n\n[11] B. Hariharan, P. Arbel\u00e1ez, R. Girshick, and J. Malik. Simultaneous detection and segmentation.\n\n[12] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In ICCV, 2017.\n[13] K. Jerripothula, J. Cai, and J. Yuan. Cats: Co-saliency activated tracklet selection for video\n\nco-localization. In ECCV, 2016.\n\n[14] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image cosegmentation. In CVPR,\n\nIn ECCV, 2014.\n\n2010.\n\n[15] A. Joulin, F. Bach, and J. Ponce. Multi-class cosegmentation. In CVPR, 2012.\n[16] A. Joulin, K. Tang, and L. Fei-Fei. Ef\ufb01cient image and video co-localization with frank-wolfe\n\nalgorithm. In ECCV, 2014.\n\nsentence mapping. In NIPS, 2014.\n\n[17] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image\n\n[18] G. Kim and E. Xing. On multiple foreground cosegmentation. In CVPR, 2012.\n[19] I. Kokkinos. Ubernet: Training universal cnn for low mid and high level vision with diverse\n\ndatasets and limited memory. In CVPR, 2017.\n\n[20] A. Kolesnikov and C. Lampert. Seed, expand and constrain: Three principles for weakly-\n\n[21] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nsupervised segmentation. In ECCV, 2016.\n\nneural networks. In NIPS, 2012.\n\n[22] S. Kwak, M. Cho, I. Laptev, J. Ponce, and C. Schmid. Unsupervised object discovery and\n\ntracking in video collections. In ICCV, 2015.\n\n[23] L. Ladicky, P. Sturgess, K. Alahari, C. Russel, and P. Torr. What, where and how many?\n\ncombining object detectors and crfs. In ECCV, 2010.\n\n[24] M. Lapin, B. Schiele, and M. Hein. Scalable multitask representation learning for scene\n\nclassi\ufb01cation. In CVPR, 2014.\n\n[25] Y. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In ICCV, 2011.\n[26] G. Li and Y. Yu. Visual saliency based on multiscale deep features. In CVPR, 2015.\n[27] Y. Li, L.Liu, C. Shen, and A. van den Hengel. Image co-localization by mimicking a good\n\ndetector\u2019s con\ufb01dence score distribution. In ECCV, 2016.\n\n[28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\n\nIn CVPR, 2015.\n\nparts and pixels. In ICCV, 2011.\n\n[29] M. Maire, S. X. Yu, and P. Perona. Object detection and segmentation from joint embedding of\n\n[30] A. Miech, J. B. Alayrac, P. Bojanowski, I. Laptev, and J. Sivic. Learning from video and text\n\nvia large-scale discriminative clustering. In ICCV, 2017.\n\n[31] L. Mukherjee, V. Singh, and J. Peng. Scale invariant cosegmentation for image groups. In\n\nCVPR, 2011.\n\n[32] A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In CVPR, 2013.\n[33] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly\n\nsupervised segmentation. In ICCV, 2015.\n\n[34] P. Pinheiro, R. Collobert, and P. Doll\u00e1r. Learning to segment object candidates. In NIPS, 2015.\n[35] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from\n\n[36] R. Quan, J. Han, D. Zhang, and F. Nie. Object co-segmentation via graph optimized-\ufb02exible\n\nweakly annotated video. In CVPR, 2012.\n\nmanifold ranking. In CVPR, 2016.\n\n[37] C. Rother, V. Kolmogorov, T. Minka, and A. Blake. Cosegmentation of image pairs by histogram\n\nmatching - incorporating a global constraint into mrfs. In CVPR, 2006.\n\n[38] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu. Unsupervised joint object discovery and\n\nsegmentation in internet images. In CVPR, 2013.\n\n[39] J. Rubio, J. Serrat, A. Lopez, and N. Paragios. Unsupervised co-segmentation through region\n\n[40] A. Sharma, O. Grau, and M. Fritz. Vconv-dae: Deep volumetric shape learning without object\n\n[41] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, 22(8):888\u2013905, 2000.\n[42] K. Tang, A. Joulin, L.-J. Li, and L. Fei-Fei. Co-localization in real-world images. In CVPR,\n\nmatching. In CVPR, 2012.\n\nlabels. In ECCV, 2016.\n\n2014.\n\n[43] A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking. In ECCV, 2008.\n\n9\n\n\f[44] S. Vicente, C. Rother, and V. Kolmogorov. Object cosegmentation. In CVPR, 2011.\n[45] F. Wang, Q. Huang, and L. Guibas. Image co-segmentation via consistent functional maps. In\n\n[46] Z. Wang and R. Liu. Semi-supervised learning for large-scale image cosegmentation. In ICCV,\n\n[47] J. Xu, A. Schwing, and R. Urtasun. Learning to segment under various forms of weak supervi-\n\n[48] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In NIPS, 2005.\n[49] W. Yang, Y. Wang, and G. Mori. Recognizing human actions from still images with latent poses.\n\nICCV, 2013.\n\n2013.\n\nsion. In CVPR, 2015.\n\nIn CVPR, 2010.\n\npartitioning. In NIPS, 2003.\n\n[50] S. X. Yu, R. Grosse, and J. Shi. Concurrent object recognition and segmentation by graph\n\n[51] D. Zhang, O. Javed, and M. Shah. Video object segmentation through spatially accurate and\n\ntemporally dense extraction of primary object regions. In CVPR, 2013.\n\n10\n\n\f", "award": [], "sourceid": 865, "authors": [{"given_name": "Abhishek", "family_name": "Sharma", "institution": "Navinfo Europe Research"}]}