{"title": "Deep Joint Task Learning for Generic Object Extraction", "book": "Advances in Neural Information Processing Systems", "page_first": 523, "page_last": 531, "abstract": "This paper investigates how to extract objects-of-interest without relying on hand-craft features and sliding windows approaches, that aims to jointly solve two sub-tasks: (i) rapidly localizing salient objects from images, and (ii) accurately segmenting the objects based on the localizations. We present a general joint task learning framework, in which each task (either object localization or object segmentation) is tackled via a multi-layer convolutional neural network, and the two networks work collaboratively to boost performance. In particular, we propose to incorporate latent variables bridging the two networks in a joint optimization manner. The first network directly predicts the positions and scales of salient objects from raw images, and the latent variables adjust the object localizations to feed the second network that produces pixelwise object masks. An EM-type method is then studied for the joint optimization, iterating with two steps: (i) by using the two networks, it estimates the latent variables by employing an MCMC-based sampling method; (ii) it optimizes the parameters of the two networks unitedly via back propagation, with the fixed latent variables. Extensive experiments demonstrate that our joint learning framework significantly outperforms other state-of-the-art approaches in both accuracy and efficiency (e.g., 1000 times faster than competing approaches).", "full_text": "Deep Joint Task Learning for Generic Object\n\nExtraction\n\nXiaolong Wang1;4, Liliang Zhang1, Liang Lin1;3(cid:3), Zhujin Liang1, Wangmeng Zuo2\n\n1Sun Yat-sen University, Guangzhou 510006, China\n\n2School of Computer Science and Technology, Harbin Institute of Technology, China\n\n3SYSU-CMU Shunde International Joint Research Institute, Shunde, China\n\n4The Robotics Institute, Carnegie Mellon University, Pittsburgh, U.S.\n\nhttp://vision.sysu.edu.cn/projects/deep-joint-task-learning/\n\nAbstract\n\nThis paper investigates how to extract objects-of-interest without relying on hand-\ncraft features and sliding windows approaches, that aims to jointly solve two sub-\ntasks: (i) rapidly localizing salient objects from images, and (ii) accurately seg-\nmenting the objects based on the localizations. We present a general joint task\nlearning framework, in which each task (either object localization or object seg-\nmentation) is tackled via a multi-layer convolutional neural network, and the two\nnetworks work collaboratively to boost performance. In particular, we propose to\nincorporate latent variables bridging the two networks in a joint optimization man-\nner. The \ufb01rst network directly predicts the positions and scales of salient objects\nfrom raw images, and the latent variables adjust the object localizations to feed\nthe second network that produces pixelwise object masks. An EM-type method is\npresented for the optimization, iterating with two steps: (i) by using the two net-\nworks, it estimates the latent variables by employing an MCMC-based sampling\nmethod; (ii) it optimizes the parameters of the two networks unitedly via back\npropagation, with the \ufb01xed latent variables. Extensive experiments suggest that\nour framework signi\ufb01cantly outperforms other state-of-the-art approaches in both\naccuracy and ef\ufb01ciency (e.g. 1000 times faster than competing approaches).\n\n1 Introduction\nOne typical vision problem usually comprises several subproblems, which tend to be tackled jointly\nto achieve superior capability. In this paper, we focus on a general joint task learning framework\nbased on deep neural networks, and demonstrate its effectiveness and ef\ufb01ciency on generic (i.e.,\ncategory-independent) object extraction.\nGenerally speaking, two sequential subtasks are comprised in object extraction: rapidly localizing\nthe objects-of-interest from images and further generating segmentation masks based on the localiza-\ntions. Despite acknowledged progresses, previous approaches often tackle these two tasks indepen-\ndently, and most of them applied sliding windows over all image locations and scales [17, 22], which\ncould limit their performances. Recently, several works [33, 18, 5] utilized the interdependencies of\nobject localization and segmentation, and showed promising results. For example, Yang et al. [33]\nintroduced a joint framework for object segmentations, in which the segmentation bene\ufb01ts from the\nobject detectors and the object detections are in consistent with the underlying segmentation of the\n\n(cid:3)\n\nCorresponding author is Liang Lin (E-mail: linliang@ieee.org). This work was supported by the National\nNatural Science Foundation of China (no.61173082), the Hi-Tech Research and Development Program of China\n(no.2012AA011504), Guangdong Science and Technology Program (no. 2012B031500006), Special Project\non Integration of Industry, Educationand Research of Guangdong (no.2012B091000101), and Fundamental\nResearch Funds for the Central Universities (no.14lgjc11).\n\n1\n\n\fimage. However, these methods still rely on the exhaustively searching to localize objects. On the\nother hand, deep learning methods have achieved superior capabilities in classi\ufb01cation [21, 19, 23]\nand representation learning [4], and they also demonstrate good potentials on several complex vision\ntasks [29, 30, 20, 25]. Motivated by these works, we build a deep learning architecture to jointly\nsolve the two subtasks in object extraction, in which each task (either object localization or object\nsegmentation) is tackled by a multi-layer convolutional neural network. Speci\ufb01cally, the \ufb01rst net-\nwork (i.e., localization network) directly predicts the positions and scales of salient objects from raw\nimages, upon which the second network (i.e., segmentation network) generates the pixelwise object\nmasks.\n\nFigure 1: Motivation of introducing latent variables in object extraction. Treating predicted object\nlocalizations (the dashed red boxes) as the inputs for segmentation may lead to unsatisfactory seg-\nmentation results, and we can make improvements by enlarging or shrinking the localizations (the\nsolid blue boxes) with the latent variables. Two examples are shown in (a) and (b), respectively.\n\nRather than being simply stacked up, the two networks are collaboratively integrated with latent vari-\nables to boost performance. In general, the two networks optimized for different tasks might have\ninconsistent interests. For example, the object localizations predicted by the \ufb01rst network probably\nindicate incomplete object (foreground) regions or include a lot of backgrounds, which may lead\nto unsatisfactory pixelwise segmentation. This observation is well illustrated in Fig. 1, where we\ncan obtain better segmentation results through enlarging or shrinking the input object localizations\n(denoted by the bounding boxes). To overcome this problem, we propose to incorporate the latent\nvariables between the two networks explicitly indicating the adjustments of object localizations,\nand jointly optimize them with learning the parameters of networks. It is worth mentioning that our\nframework can be generally extended to other applications of joint tasks in similar ways. For concise\ndescription, we use the term \u201csegmentation reference\u201d to represent the predicted object localization\nplus the adjustment in the following.\nFor the framework training, we present an EM-type algorithm, which alternately estimates the la-\ntent variables and learns the network parameters. The latent variables are treated as intermediate\nauxiliary during training: we search for the optimal segmentation reference, and back tune the two\nnetworks accordingly. The latent variable estimation is, however, non-trivial in this work, as it is\nintractable to analytically model the distribution of segmentation reference. To avoid exhaustively\nenumeration, we design a data-driven MCMC method to effectively sample the latent variables, in-\nspired by [24, 31]. In sum, we conduct the training algorithm iterating with two steps: (i) Fixing the\nnetwork parameters, we estimate the latent variables and determine the optimal segmentation refer-\nence under the sampling method. (ii) Fixing the segmentation reference, the segmentation network\ncan be tuned according to the pixelwise segmentation errors, while the localization network tuned\nby taking the adjustment of object localizations into account.\n\n2 Related Work\nExtracting pixelwise objects-of-interest from an image, our work is related to the salient re-\ngion/object detections [26, 9, 10, 32]. These methods mainly focused on feature engineering and\ngraph-based segmentation. For example, Cheng et al. [9] proposed a regional contrast based saliency\nextraction algorithm and further segmented objects by applying an iterative version of GrabCut.\nSome approaches [22, 27] trained object appearance models and utilized spatial or geometric priors\nto address this task. Kuettel et al. [22] proposed to transfer segmentation masks from training data\n\n2\n\n(a)(b)GroundtruthMaskSegmentationResultsGroundtruthMaskSegmentationResults\finto testing images by searching and matching visually similar objects within the sliding windows.\nOther related approaches [28, 7] simultaneously processed a batch of images for object discovery\nand co-segmentation, but they often required category information as priors.\nRecently resurgent deep learning methods have also been applied in object detection and image\nsegmentation [30, 14, 29, 20, 11, 16, 2, 25]. Among these works, Sermanet et al. [29] detected\nobjects by training category-level convolutional neural networks. Ouyang et al. [25] proposed to\ncombine multiple components (e.g., feature extraction, occlusion handling, and classi\ufb01cation) within\na deep architecture for human detection. Huang et al. [20] presented the multiscale recursive neural\nnetworks for robust image segmentation. These mentioned methods generally achieved impressive\nperformances, but they usually rely on sliding detect windows over scales and positions of testing\nimages. Very recently, Erhan et al. [14] adopted neural networks to recognize object categories while\npredicting potential object localizations without exhaustive enumeration. This work inspired us to\ndesign the \ufb01rst network to localize objects. To the best of our knowledge, our framework is original\nto make the different tasks collaboratively optimized by introducing latent variables together with\nnetwork parameter learning.\n\n3 Deep Architecture\nIn this section, we introduce a joint deep model for object extraction(i.e., extracting the segmentation\nmask for a salient object in the image). Our model is presented as comprising two convolutional\nneural networks: localization network and segmentation network, as shown in Fig. 2. Given an\nimage as input, our \ufb01rst network can generate a 4-digit output, which speci\ufb01es the salient object\nbounding box (i.e. the object localization). With the localization result, our segmentation network\ncan extract a m(cid:2)m binary mask for segmentation in its last layer. Both of these networks are stacked\nup by convolutional layers, max-pooling operators and full connection layers. In the following, we\nintroduce the detailed de\ufb01nitions for these two networks.\n\nFigure 2: The architecture of our joint deep model. It is stacked up by two convolutional neural\nnetworks: localization network and segmentation network. Given an image, we can generate its\nobject bounding box and further extract its segmentation mask accordingly.\n\nLocalization Network. The architecture of the localization network contains eight layers: \ufb01ve\nconvolutional layers and three full connection layers. For the parameters setting of the \ufb01rst seven\nlayers, we refer to the network used by Krizhevsky et al. [21]. It takes an image with size 224(cid:2)224(cid:2)\n3 as input, where each dimension represents height, width and channel numbers. The eighth layer of\nthe network contains 4 output neurons, indicating the coordinates (x1; y1; x2; y2) of a salient object\nbounding box. Note that the coordinates are normalized w.r.t. image dimensions into the range of\n0 (cid:24) 224.\nSegmentation Network. Our segmentation network is a \ufb01ve-layer neural network with four convo-\nlutional layers and one full connection layer. To simplify the description, we denote C(k; h(cid:2) w(cid:2) c)\nas a convolutional layer, where k represents kernel numbers, and h; w; c represent the height, width\nand channel numbers for each kernel. We also denote F C as a full connection layer, RN as a\nresponse normalization layer, and M P as a max-pooling layer. The size of max-pooling operator\nis set as 3 (cid:2) 3 and the stride for pooling is 2. Then the network architecture can be described as:\nC(256; 5 (cid:2) 5 (cid:2) 3) (cid:0) RN (cid:0) M P (cid:0) C(384; 3 (cid:2) 3 (cid:2) 256) (cid:0) C(384; 3 (cid:2) 3 (cid:2) 384) (cid:0) C(256; 3 (cid:2)\n3 (cid:2) 384) (cid:0) M P (cid:0) F C. Taking an image with size 55 (cid:2) 55 (cid:2) 3 as input, the segmentation network\ngenerates a binary mask with size 50 (cid:2) 50 as the output from its full connection layer.\nWe then introduce the inference process as object extraction. Formally, we de\ufb01ne the parameters\nof the localization network and segmentation network as !l and !s, respectively. Given an input\nimage Ii, we \ufb01rst resize it to 224(cid:2) 224(cid:2) 3 as the input for the localization network. Then the output\n\n3\n\n224x224 x3ConvolutionLayers4096425650x50OutputsImageFull Connection LayersCropped, Resized ImageFull Connection Layer55x55 x3ConvolutionLayersThree LayersOne LayerFive LayersFour LayersLocalization NetworkSegmentation Network\fof this network via forward propagation is represented as F!l (Ii), which indicates a 4-dimension\nvector bi for the salient object bounding box. We crop the image data for salient object according to\nbi, and resize it to 55 (cid:2) 55 (cid:2) 3 as the input for the segmentation network. By performing forward\npropagation, the output for segmentation network is represented as F!s (Ii; bi), which is a vector\nwith 50 (cid:2) 50 = 2500 dimensions, indicating the binary segmentation result for object extraction.\n4 Learning Algorithm\nWe propose a joint deep learning approach to estimate the parameters of two networks. As the\nobject bounding boxes indicated by groundtruth object mask might not provide the best references\nfor segmentation, we embed this domain-speci\ufb01c prior as latent variables in learning. We adjust the\nobject bounding boxes via the latent variables to mine optimal segmentation references for training.\nFor optimization, an EM-type algorithm is proposed to iteratively estimate the latent variables and\nthe model parameters.\n\ni\n\n4.1 Objective Formulation\nSuppose a set of N training images are I = fI1; :::; INg, the segmentation masks for the salient\nobjects in them are Y = fY1; :::; YNg. For each Yi, we use Y j\nto represent its jth pixel, and\nY j\ni = 1 indicates the foreground, while Y j\ni = 0 the background. According to the given object\nmasks Y , we can obtain the object bounding boxes around them tightly as L = fL1; :::; LNg, where\nLi is a 4-dimensional vector representing the coordinates (x1; y1; x2; y2). For each sample, we\nintroduce a latent variable \u2206Li as the adjustment for Li. We name the adjusted bounding box as\n\nsegmentation reference, which is represented aseLi = Li + \u2206Li. The learning objective is de\ufb01ned\nenceeL = feL1; :::;eLNg. Speci\ufb01cally, P (eL; Ij!l) represents that the object localization is predicted\nby the \ufb01rst network and P (Y; Ij!s;eL) representing the segmentation estimation.\ntance between the output F!l (Ii) and the segmentation referenceeLi = Li + \u2206Li. Thus the proba-\nbility for !l andeL can be represented as,\n\n(1)\nwhere the objective is decoupled into two probabilities that are bridged by the segmentation refer-\n\nP (Y; L; Ij!l; !s) = P (eL; Ij!l)P (Y; Ij!s;eL);\n\nFor the localization network, we optimize the model parameters by minimizing the Euclidean dis-\n\nas the probability of maximizing likelihood estimation (MLE):\n\njjF!l (Ii) (cid:0) Li (cid:0) \u2206Lijj2\n2);\n\n(2)\n\nP (eL; Ij!l) =\n\nexp((cid:0) N\u2211\n\ni=1\n\n1\nZ\n\nwhere Z is a normalization term.\nFor the segmentation network, we specify each neuron of the last layer as a binary classi\ufb01cation\noutput. Then the parameters !s are estimated via logistic regression,\n\n!s(Ii; Li + \u2206Li) (cid:1)\nF j\n\n(1 (cid:0) F j\n\n!s(Ii; Li + \u2206Li)))\n\n(3)\n\nP (Y; Ij!s;eL) =\n\nN\u220f\n\n\u220f\n\n(\nfjjY j\n\ni =1g\n\ni=1\n\n\u220f\n\nfjjY j\n\ni =0g\n\n!s (Ii; Li + \u2206Li) is the jth element of the network output, given image Ii and segmentation\n\nwhere F j\nreference Li + \u2206Li as input.\nFor optimization of the model parameters, we solve the MLE objective by minimizing the following\ncost as,\n\nJ(!l; !s;eL) = (cid:0) 1\nN\u2211\nlog P (Y; L; Ij!l; !s)\n\u2211\n[ jjF!l (Ii) (cid:0) Li (cid:0) \u2206Lijj2\n\n/ 1\nN\n\nN\n\n2\n\nj\n\n4\n\n(cid:0)\n\ni=1\n(Y j\n\ni log F j\n\n!s(Ii; Li + \u2206Li) + (1 (cid:0) Y j\n\ni ) log(1 (cid:0) F j\n\n!s(Ii; Li + \u2206Li))) ]; (6)\n\n(4)\n\n(5)\n\n\fwhere the \ufb01rst term (5) represents the cost for localization network training and the second term (6)\nis the cost for segmentation network training.\n\n4.2 Joint Task Optimization\n\nWe propose an EM-type algorithm to optimize the learning cost J(!l; !s;eL) . As Fig. 3 illustrates,\ntimate the latent variables which indicate the segmentation referenceseL; (ii) given the segmentation\n\nit includes two iterative steps: (i) \ufb01xing the model parameters, apply MCMC based sampling to es-\n\nreferences, compute the model parameters of two networks jointly via back propagation. We explain\nthese two steps as following.\n\nFigure 3: The Em-type learning algorithm with two steps:(i) K moves of MCMC sampling (gray\narrows), the latent variables \u2206Li is sampled with considering both the localization costs (indicated\nby the dashed gray arrow) and segmentation costs. (ii) Given the segmentation reference and result\nafter K moves of sampling, we apply back propagation (blue arrows) to estimate parameters of both\nnetworks.\n\n(i) Latent variables estimation. Given a training image Ii and current model parameters, we esti-\nmate the latent variables \u2206Li. As there is no groundtruth labels for latent variables, it is intractable\nto estimate the distribution of them. It is also time-consuming by enumerating \u2206Li for evaluation\ngiven the large searching space. Thus we propose a MCMC Metropolis-Hastings method [24] for\nlatent variables sampling, which is processed in K moves. In each step, a new latent variable is\nsampled from the proposal distribution and it is accepted with an acceptance rate. For fast and ef-\nfective searching, we design the proposal distribution with a data driven term based on the fact that\nthe segmentation boundaries are often aligned with the boundaries of superpixels [1] generated from\nover-segmentation.\n\u2032\nWe \ufb01rst initialize the latent variable as \u2206Li = 0. To \ufb01nd a better latent variable \u2206L\ni and achieve a\n\u2032\nreversible transition, we de\ufb01ne the acceptance rate of the transition from \u2206Li to \u2206L\ni as,\n\n(cid:11)(\u2206Li ! \u2206L\n\u2032\ni) = min(1;\n\n! \u2206Li)\n(cid:25)(\u2206Li) (cid:1) q(\u2206Li ! \u2206L\n\u2032\ni)\n\ni) (cid:1) q(\u2206L\n\u2032\n\u2032\n(cid:25)(\u2206L\ni\nwhere (cid:25)(\u2206Li) is the invariant distribution and q(\u2206Li ! \u2206L\n\u2032\ni) is the proposal distribution.\nBy replacing the dataset with a single sample in Eq. (1), we de\ufb01ne the invariant distribution as\n\n(cid:25)(\u2206Li) = P (!l; !s;eLijYi; Ii), which can be decomposed into two probabilities: P (!l;eLijYi; Ii)\nP (!s;eLijYi; Ii) encourages a segmentation reference contributing to a better segmentation mask.\n\nconstrains the segmentation reference to be close to the output of the localization network;\n\nTo calculate these probabilities, we need to perform forward propagations in both networks.\n\n);\n\n(7)\n\nThe proposal distribution is de\ufb01ned as a combination of a gaussian distribution and a data-driven\nterm as,\n\ni) = N (\u2206L\nq(\u2206Li ! \u2206L\n\u2032\n\u2032\ni\n\nj(cid:22)i; (cid:6)i) (cid:1) Pc(\u2206L\n\u2032\ni\n\n(8)\n\u2032\nwhere (cid:22)i and (cid:6)i is the mean vector and covariance matrix for the optimal \u2206L\ni in the previous\n\u2032\niterations. It is based on the observation that the current optimal \u2206L\ni has high possibility for being\njYi; Ii), it is computed depending on the given\n\u2032\nselected before. For the data driven term Pc(\u2206L\ni\n\njYi; Ii);\n\n5\n\n\u2026\u2026Localiza\u019fonNetworkSegmenta\u019fonNetworkk = 1k = 2k = Kk = 1k = 2k = KLogistic RegressionSquare ErrorMinimizationSelected Target for LocalizationSelected Segmentation Result\fsegmentation referenceeL\n\u2211\n\n\u2032jY; I) = 1\n\nc\n\nc\n\nimage Ii. After over-segmenting Ii into superpixels, we de\ufb01ne vj = 1 if the jth image pixel is on\nthe boundary of a superpixel and vj = 0 if it is inside a superpixel. We then sample c pixels along the\n\u2032\n\u2032\ni in equal distance, then the data driven term is represented as\ni = Li + \u2206L\nj=1 vj. Thus we encourage to avoid cutting through the possible foreground\nPc(\u2206L\nsuperpixels with the bounding box edges, which leads to more plausible proposals. We set c = 200\nin our experiment, and we only need to perform over-segmentation for superpixels once as pre-\nprocessing for training.\n(ii) Model parameters estimation. As it is shown in Fig. 3, given the optimal latent variable\n\n\u2206L after K moves of sampling, we can obtain the corresponding segmentation references eL and\n\nthe segmentation results. Then the parameters for segmentation network !s is optimized via back\npropagation with logistic regression(as the second term (6) for Eq. (4)), and the parameters for\nlocalization network !l is tuned by minimizing the square error between the segmentation references\nand the localization output(as the \ufb01rst term (5) for Eq. (4)).\nDuring back propagation, we apply the stochastic gradient descent to update the model parameters.\nFor the segmentation network, we use an equal learning rate for all layers as \u03f51. For localization,\nwe \ufb01rst pre-train the network discriminatively for classifying 1000 object categories in the Imagenet\ndataset [12]. With the pre-training, we can borrow the information learned from a large dataset to\nimprove our performance. We maintain the parameters of the convolutional layers and reset the\nparameters of full connection layer randomly as initialization. The learning rate is set as \u03f52 for the\nfull connection layers and \u03f52=100 for the convolutional layers.\n\n5 Experiment\n\nWe validate our approach on the Saliency dataset [9, 8] and a more challenging dataset newly col-\nlected by us, namely Object Extraction(OE) dataset1. We compare our approach with state-of-the-art\nmethods and empirical analyses are also presented in the experiment.\nThe Saliency dataset is a combination of THUR15000 [8] and THUS10000 [9] datasets, which\nincludes 16233 images with pixelwise groundtruth masks. Most of the images contain one salient\nobject, and we do not utilize the category information in training and testing. We randomly split the\ndataset into 14233 images for training and 2000 images for testing. The OE dataset collected by us\nis more comprehensive, including 10183 images with groundtruth masks. We select the images from\nthe PASCAL [15], iCoseg [3], Internet [28] datasets as well as other data (most of them are about\npeople and clothes) from the web. Compared to the traditional segmentation dataset, the salient\nobjects in the OE dataset are more variant in appearance and shape(or pose) and they often appear\nin complicated scene with background clutters. For the evaluation in the OE dataset, 8230 samples\nare randomly selected for training and the remaining 1953 ones are applied in testing.\nExperiment Settings. During training, the domain of each element in the 4-dimension latent variable\nvector \u2206Li is set to [(cid:0)10;(cid:0)5; 0; 5; 10], thus there are 54 = 625 possible proposals for each \u2206Li.\nWe set the number of MCMC sampling moves as K = 20 during searching. The learning rate is\n\u03f51 = 1:0 (cid:2) 10\n(cid:0)8 for the localization network.\nFor testing, as each pixelwise output of our method is well discriminated to the number around 1 or\n0, we simply classify it as foreground or background by setting a threshold 0:5. The experiments\nare performed on a desktop with an Intel I7 3.7GHz CPU, 16GB RAM and GTX TITAN GPU.\n\n(cid:0)6 for the segmentation network and \u03f52 = 1:0 (cid:2) 10\n\n5.1 Results and Comparisons\n\nWe now quantitatively evaluate the performance of our method. For evaluation metric, we adopt\nthe Precision, P(the average number of pixels which are correctly labeled in both foreground and\nbackground) and Jaccard similarity, J(the average intersection-over-union score: S\\G\nS[G, where S is\nthe foreground pixels obtained via our algorithm and G is the groundtruth foreground pixels). We\nthen compare the results of our approach with machine learning based methods such as \ufb01gure-\nground segmentation [22], CPMC [6] and Object Proposals [13]. As CPMC and Object Proposals\ngenerates multiple ranked segments intended to cover objects, we follow the process applied in [22]\nto evaluate its result. We use the union of the top K ranked segments as salient object prediction.\n\n1http://vision.sysu.edu.cn/projects/deep-joint-task-learning/\n\n6\n\n\fOurs(full) Ours(sim) FgSeg [22] CPMC [6] ObjProp [13] HS [32] GC [10] RC [9] HC [9]\n97.81\n89.24\n87.02\n58.42\n\n90.16\n63.69\n\n96.62\n81.10\n\n91.92\n70.85\n\n89.99\n64.72\n\n72.60\n54.12\n\n89.23\n58.30\n\n83.64\n56.14\n\nP\nJ\n\nTable 1: The evaluation in Saliency dataset with Precision(P) and Jaccard similarity(J). Ours(full)\nindicates our joint learning method and Ours(sim) means learning two networks separately.\n\nOurs(full) Ours(sim) FgSeg [22] CPMC [6] ObjProp [13] HS [32] GC [10] RC [9] HC [9]\n93.12\n83.37\n77.69\n50.61\n\n86.25\n59.34\n\n91.25\n71.50\n\n90.42\n70.93\n\n72.14\n54.70\n\n85.53\n54.83\n\n87.42\n62.83\n\n76.33\n53.76\n\nP\nJ\n\nTable 2: The evaluation in OE dataset with Precision(P) and Jaccard similarity(J). Ours(full) indi-\ncates our joint learning method and Ours(sim) means learning two networks separately.\n\nWe evaluate the performance of all K 2 f1; :::; 100g and report the best result for each sample\nin our experiment. Besides machine learning based methods, we also report the results of salient\nregion detection methods [10, 32, 9]. Note that there are two approaches mentioned in [9] utilizing\nhistogram based contrast(HC) and region based contrast(RC). Given the salient maps from these\nmethods, an iterative GrabCut proposed in [9] is utilized to generate binary segmentation results.\nSaliency dataset. We report the experiment result in this dataset as Table. 1. Our result with joint task\nlearning (namely as Ours(full)) reaches 97:81% in Precision(P) and 87:02% in Jaccard similarity(J).\nCompared to the \ufb01gure-ground segmentation method [22], we have 5:89% improvements in P and\n16:17% in J. For the saliency region detection methods, the best results are P:89:99% and J:64:72%\nin [32]. Our method demonstrates superior performances compared to these approaches.\nOE dataset. The evaluation of our method in OE dataset is shown in Table. 2. By jointly learning\nlocalization and segmentation networks, our approach with 93:12% in P and 77:69% in J achieves\nthe highest performances compared to the state-of-the-art methods.\nOne spotlight of our work is its high ef\ufb01ciency in testing. As Table. 3 illustrates, the average time for\nobject extraction from an image with our method is 0:014 seconds, while \ufb01gure-ground segmenta-\ntion [22] requires 94:3 seconds, CPMC [6] requires 59:6 seconds and Object Proposal [13] requires\n37:4 seconds. For most of the saliency region detection methods, the runtime are dominated by the\niterative GrabCut process, thus we apply its time as the average testing time for the saliency region\ndetection methods, which is 0:711 seconds. As a result, our approach is 50 (cid:24) 6000 times faster than\nthe state-of-the-art methods.\nDuring training, it requires around 20 hours for convergence in the Saliency dataset and 13 hours for\nthe OE dataset. For latent variable sampling, we also try to enumerate the 625 possible proposals\nexhaustively for each image. It achieves similar accuracy as our approach while costs about 30 times\nof runtime in each iteration of training.\n\n5.2 Empirical Analysis\n\nFor further evaluation, we conduct two following empirical analyses to illustrate the effectiveness of\nour method.\n(I) To clarify the signi\ufb01cance of joint learning instead of learning two networks separately, we dis-\ncard the latent variables sampling and set all \u2206Li = 0 during training, namely as Ours(sim). We\n\nillustrate the training cost J(!l; !s;eL) (Eq. (4)) for these two methods as Fig. 4. We plot the average\n\nloss over all training samples though the training iterations, and it is shown that our joint learning\n\nOurs(full)\n\n0.014s\n\nTime\n\nFgSeg [22] CPMC [6] ObjProp [13]\n\n94.3s\n\n59.6s\n\n37.4s\n\nSaliency methods\n\n0.711s\n\nTable 3: Testing time for each image. The Saliency methods indicates the saliency region detection\nmethods [32, 10, 9].\n\n7\n\n\fOurs(full)\nChen et al. [7]\nRubinstein et al. [28]\n\nCar\n\nP\n\n87.95\n87.09\n83.38\n\nJ\n\n68.86\n64.67\n63.36\n\nHorse\n\nP\n\n88.11\n89.00\n83.69\n\nJ\n\n53.80\n57.58\n53.89\n\nAirplane\nP\nJ\n\n92.12\n90.24\n86.14\n\n60.10\n59.97\n55.62\n\nTable 4: We compare our method with two object discovery and segmentation methods in the\nInternet dataset. We train our model with other data besides the ones in the Internet dataset.\n\nmethod can achieve lower costs than the one without latent variable adjustment. We also compare\nthese two methods with Precision and Jaccard similarity in both datasets. As Table. 1 illustrates,\nthere are 1:19% and 5:92% improvements in P and J when we learn two networks jointly in the\nSaliency dataset. For the OE dataset, the joint learning performs 1:87% higher in P and 6:19%\nhigher in J than learning two networks separately, as shown in Table. 2.\n(II) We demonstrate that our method can be well generalized across different datasets. Given the OE\ndataset, we train our model with all the data except for the ones collected from Internet dataset [28].\nThen the newly trained model is applied for testing on the Internet dataset. We compare the per-\nformance of this deep model with two object discovery and co-segmentation methods [28, 7] in the\nInternet dataset. As Table. 4 illustrates, our method achieves higher performance in the class of Car\nand Airplane, and a comparable result in the class of Horse. Thus our model can be well generalized\nto handle other datasets which are not applied in training and achieve state-of-the-art performances.\nIt is also worth to mention that it requires a few seconds for testing via the co-segmentation meth-\nods [28, 7], which is much slower than our approach with 0:014 seconds per image.\n\nFigure 4: The training cost across iterations. The cost is evaluated over all the training samples in\neach dataset:(a) Saliency dataset;(b) OE dataset.\n\n6 Conclusion\nThis paper studies joint task learning via deep neural networks for generic object extraction, in which\ntwo networks work collaboratively to boost performance. Our joint deep model has been shown to\nhandle well realistic data from the internet. More importantly, the approach for extracting object\nsegmentation mask in the image is very ef\ufb01cient and the speed is 1000 times faster than competing\nstate-of-the-art methods. The proposed framework can be extended to handle other joint tasks in\nsimilar ways.\n\nReferences\n[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, SLIC Superpixels\nCompared to State-of-the-art Superpixel Methods, In IEEE Trans. Pattern Anal. Mach. Intell.,\n34(11):2274-2282, 2012.\n\n[2] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, Road Scene Segmentation from a Single\n\nImage, In ECCV, 2012.\n\n[3] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, iCoseg: Interactive Co-segmentation with\n\nIntelligent Scribble Guidance, In CVPR, 2010.\n\n8\n\n0200004000060000800001000001200004006008001000120014001600Training IterationsTraining Loss Joint Task LearningSeparating Task Learning01000020000300004000050000600001000110012001300140015001600Training IterationsTraining Loss Joint Task LearningSeparating Task Learning(a)(b)\f[4] Y. Bengio, A. Courville, and P. Vincent, Representation Learning: A Review and New Perspec-\n\ntives, In IEEE Trans. Pattern Anal. Mach. Intell., 35(8): 1798-1828, 2013.\n\n[5] T. Brox, L. Bourdev, S. Maji, and J. Malik, Object Segmentation by Alignment of Poselet Acti-\n\nvations to Image Contours, In CVPR, 2011.\n\n[6] J. Carreira and C. Sminchisescu, Constrained Parametric Min-Cuts for Automatic Object Seg-\n\n[7] X. Chen, A. Shrivastava, and A. Gupta, Enriching Visual Knowledge Bases via Object Discov-\n\nmentation, In CVPR, 2010.\n\nery and Segmentation, In CVPR, 2014.\n\n[8] M. Cheng, N. Mitra, X. Huang, and S. Hu, SalientShape: Group Saliency in Image Collections,\n\nIn The Visual Computer 34(4):443-453, 2014.\n\n[9] M. Cheng, G. Zhang, N. Mitra, X. Huang, and Shi. Hu, Global Contrast based Salient Region\n\nDetection, In CVPR, 2011.\n\n[10] M. Cheng, J. Warrell, W. Lin, S. Zheng, V. Vineet, and N. Crook, Ef\ufb01cient Salient Region\n\nDetection with Soft Image Abstraction, In ICCV, 2013.\n\n[11] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, Deep Neural Networks\n\nSegment Neuronal Membranes in Electron Microscopy Images, In NIPS, 2012.\n\n[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierar-\n\nchical Image Database, In CVPR, 2009.\n\n[13] I. Endres and D. Hoiem, Category-Independent Object Proposals with Diverse Ranking, In\n\nIEEE Trans. Pattern Anal. Mach. Intell., 2014.\n\n[14] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, Scalable Object Detection using Deep\n\nNeural Networks, In CVPR, 2014.\n\n[15] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, The Pascal Visual\n\nObject Classes (VOC) Challenge, In Intl J. of Computer Vision, 88:303-338,2010.\n\n[16] C. Farabet, C. Couprie, L. Najman and Y. LeCun, Scene Parsing with Multiscale Feature Learn-\n\ning, Purity Trees, and Optimal Covers, In ICML, 2012.\n\n[17] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, Object detection with\n\ndiscriminatively trained part based models, In IEEE Trans. Pattern Anal. Mach. Intell., 2010.\n\n[18] S. Fidler, R. Mottaghi, A.L. Yuille, and R. Urtasun, Bottom-Up Segmentation for Top-Down\n\nDetection, In CVPR, 2013.\n\n[19] G. E. Hinton, and R. R. Salakhutdinov, Reducing the Dimensionality of Data with Neural\n\nNetworks, In Science,313(5786):504-507, 2006.\n\n[20] G. B. Huang and V. Jain, Deep and Wide Multiscale Recursive Networks for Robust Image\n\n[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classi\ufb01cation with Deep Convolu-\n\n[22] D. Kuettel and V. Ferrari, Figure-ground segmentation by transferring window masks, In\n\nLabeling, In NIPS, 2013.\n\ntional Neural Networks, In NIPS, 2012.\n\nCVPR, 2012.\n\n[23] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, et\n\nal, Handwritten Digit Recognition with A Back-propagation Network, In NIPS, 1990.\n\n[24] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, Equation of state cal-\n\nculations by fast computing machines, In J. Chemical Physics, 21(6):1087-1092, 1953.\n\n[25] W. Ouyang and X. Wang, Joint Deep Learning for Pedestrian Detection, In ICCV, 2013.\n[26] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, Saliency Filters: Contrast Based Filtering\n\nfor Salient Region Detection, In CVPR, 2012.\n\n[27] A. Rosenfeld and D. Weinshall, Extracting Foreground Masks towards Object Recognition, In\n\nICCV, 2011.\n\n2013.\n\n[28] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, Unsupervised Joint Object Discovery and Seg-\n\nmentation in Internet Images, In CVPR, 2013.\n\n[29] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and Y. LeCun, OverFeat: Integrated\n\nRecognition, Localization and Detection using Convolutional Networks, In ICLR, 2014.\n\n[30] C. Szegedy, A. Toshev, and D. Erhan, Deep Neural Networks for Object Detection, In NIPS,\n\n[31] Z. Tu and S.C. Zhu, Image Segmentation by Data-Driven Markov Chain Monte Carlo, In IEEE\n\nTrans. Pattern Anal. Mach. Intell., 24(5):657-673, 2002.\n\n[32] Q. Yan, L. Xu, J. Shi, and J. Jia, Hierarchical Saliency Detection, In CVPR, 2013.\n[33] Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes, Layered Object Detection for Multi-Class\n\nSegmentation, In CVPR, 2010.\n\n9\n\n\f", "award": [], "sourceid": 345, "authors": [{"given_name": "Xiaolong", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Liliang", "family_name": "Zhang", "institution": "Sun Yat-sen University"}, {"given_name": "Liang", "family_name": "Lin", "institution": "Sun Yat-Sen University"}, {"given_name": "Zhujin", "family_name": "Liang", "institution": "Sun Yat-Sen University"}, {"given_name": "Wangmeng", "family_name": "Zuo", "institution": null}]}