{"title": "Discriminatively Trained Sparse Code Gradients for Contour Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 584, "page_last": 592, "abstract": "Finding contours in natural images is a fundamental problem that serves as the basis of many tasks such as image segmentation and object recognition. At the core of contour detection technologies are a set of hand-designed gradient features, used by most existing approaches including the state-of-the-art Global Pb (gPb) operator. In this work, we show that contour detection accuracy can be significantly improved by computing Sparse Code Gradients (SCG), which measure contrast using patch representations automatically learned through sparse coding. We use K-SVD and Orthogonal Matching Pursuit for efficient dictionary learning and encoding, and use multi-scale pooling and power transforms to code oriented local neighborhoods before computing gradients and applying linear SVM. By extracting rich representations from pixels and avoiding collapsing them prematurely, Sparse Code Gradients effectively learn how to measure local contrasts and find contours. We improve the F-measure metric on the BSDS500 benchmark to 0.74 (up from 0.71 of gPb contours). Moreover, our learning approach can easily adapt to novel sensor data such as Kinect-style RGB-D cameras: Sparse Code Gradients on depth images and surface normals lead to promising contour detection using depth and depth+color, as verified on the NYU Depth Dataset. Our work combines the concept of oriented gradients with sparse representation and opens up future possibilities for learning contour detection and segmentation.", "full_text": "Discriminatively Trained Sparse Code Gradients\n\nfor Contour Detection\n\nIntel Science and Technology Center for Pervasive Computing, Intel Labs\n\nXiaofeng Ren and Liefeng Bo\n\n{xiaofeng.ren,liefeng.bo}@intel.com\n\nSeattle, WA 98195, USA\n\nAbstract\n\nFinding contours in natural images is a fundamental problem that serves as the\nbasis of many tasks such as image segmentation and object recognition. At the\ncore of contour detection technologies are a set of hand-designed gradient fea-\ntures, used by most approaches including the state-of-the-art Global Pb (gPb)\noperator. In this work, we show that contour detection accuracy can be signif-\nicantly improved by computing Sparse Code Gradients (SCG), which measure\ncontrast using patch representations automatically learned through sparse coding.\nWe use K-SVD for dictionary learning and Orthogonal Matching Pursuit for com-\nputing sparse codes on oriented local neighborhoods, and apply multi-scale pool-\ning and power transforms before classifying them with linear SVMs. By extract-\ning rich representations from pixels and avoiding collapsing them prematurely,\nSparse Code Gradients effectively learn how to measure local contrasts and \ufb01nd\ncontours. We improve the F-measure metric on the BSDS500 benchmark to 0.74\n(up from 0.71 of gPb contours). Moreover, our learning approach can easily adapt\nto novel sensor data such as Kinect-style RGB-D cameras: Sparse Code Gradi-\nents on depth maps and surface normals lead to promising contour detection using\ndepth and depth+color, as veri\ufb01ed on the NYU Depth Dataset.\n\nIntroduction\n\n1\nContour detection is a fundamental problem in vision. Accurately \ufb01nding both object boundaries and\ninterior contours has far reaching implications for many vision tasks including segmentation, recog-\nnition and scene understanding. High-quality image segmentation has increasingly been relying on\ncontour analysis, such as in the widely used system of Global Pb [2]. Contours and segmentations\nhave also seen extensive uses in shape matching and object recognition [8, 9].\nAccurately \ufb01nding contours in natural images is a challenging problem and has been extensively\nstudied. With the availability of datasets with human-marked groundtruth contours, a variety of\napproaches have been proposed and evaluated (see a summary in [2]), such as learning to clas-\nsify [17, 20, 16], contour grouping [23, 31, 12], multi-scale features [21, 2], and hierarchical region\nanalysis [2]. Most of these approaches have one thing in common [17, 23, 31, 21, 12, 2]: they are\nbuilt on top of a set of gradient features [17] measuring local contrast of oriented discs, using chi-\nsquare distances of histograms of color and textons. Despite various efforts to use generic image\nfeatures [5] or learn them [16], these hand-designed gradients are still widely used after a decade\nand support top-ranking algorithms on the Berkeley benchmarks [2].\nIn this work, we demonstrate that contour detection can be vastly improved by replacing the hand-\ndesigned Pb gradients of [17] with rich representations that are automatically learned from data.\nWe use sparse coding, in particularly Orthogonal Matching Pursuit [18] and K-SVD [1], to learn\nsuch representations on patches. Instead of a direct classi\ufb01cation of patches [16], the sparse codes\non the pixels are pooled over multi-scale half-discs for each orientation, in the spirit of the Pb\n\n1\n\n\fFigure 1: We combine sparse coding and oriented gradients for contour analysis on color as well as\ndepth images. Sparse coding automatically learns a rich representation of patches from data. With\nmulti-scale pooling, oriented gradients ef\ufb01ciently capture local contrast and lead to much more\naccurate contour detection than those using hand-designed features including Global Pb (gPb) [2].\n\ngradients, before being classi\ufb01ed with a linear SVM. The SVM outputs are then smoothed and non-\nmax suppressed over orientations, as commonly done, to produce the \ufb01nal contours (see Fig. 1).\nOur sparse code gradients (SCG) are much more effective in capturing local contour contrast than\nexisting features. By only changing local features and keeping the smoothing and globalization parts\n\ufb01xed, we improve the F-measure on the BSDS500 benchmark to 0.74 (up from 0.71 of gPb), a sub-\nstantial step toward human-level accuracy (see the precision-recall curves in Fig. 4). Large improve-\nments in accuracy are also observed on other datasets including MSRC2 and PASCAL2008. More-\nover, our approach is built on unsupervised feature learning and can directly apply to novel sensor\ndata such as RGB-D images from Kinect-style depth cameras. Using the NYU Depth dataset [27],\nwe verify that our SCG approach combines the strengths of color and depth contour detection and\noutperforms an adaptation of gPb to RGB-D by a large margin.\n\n2 Related Work\nContour detection has a long history in computer vision as a fundamental building block. Modern\napproaches to contour detection are evaluated on datasets of natural images against human-marked\ngroundtruth. The Pb work of Martin et. al. [17] combined a set of gradient features, using bright-\nness, color and textons, to outperform the Canny edge detector on the Berkeley Benchmark (BSDS).\nMulti-scale versions of Pb were developed and found bene\ufb01cial [21, 2]. Building on top of the Pb\ngradients, many approaches studied the globalization aspects, i.e. moving beyond local classi\ufb01ca-\ntion and enforcing consistency and continuity of contours. Ren et. al. developed CRF models on\nsuperpixels to learn junction types [23]. Zhu et. al. used circular embedding to enforce orderings\nof edgels [31]. The gPb work of Arbelaez et. al. computed gradients on eigenvectors of the af\ufb01nity\ngraph and combined them with local cues [2]. In addition to Pb gradients, Dollar et. al. [5] learned\nboosted trees on generic features such as gradients and Haar wavelets, Kokkinos used SIFT features\non edgels [12], and Prasad et. al. [20] used raw pixels in class-speci\ufb01c settings. One closely related\nwork was the discriminative sparse models of Mairal et al [16], which used K-SVD to represent\nmulti-scale patches and had moderate success on the BSDS. A major difference of our work is the\nuse of oriented gradients: comparing to directly classifying a patch, measuring contrast between\noriented half-discs is a much easier problem and can be effectively learned.\nSparse coding represents a signal by reconstructing it using a small set of basis functions. It has\nseen wide uses in vision, for example for faces [28] and recognition [29]. Similar to deep network\napproaches [11, 14], recent works tried to avoid feature engineering and employed sparse coding of\nimage patches to learn features from \u201cscratch\u201d, for texture analysis [15] and object recognition [30,\n3]. In particular, Orthogonal Matching Pursuit [18] is a greedy algorithm that incrementally \ufb01nds\nsparse codes, and K-SVD is also ef\ufb01cient and popular for dictionary learning. Closely related to our\nwork but on the different problem of recognition, Bo et. al. used matching pursuit and K-SVD to\nlearn features in a coding hierarchy [3] and are extending their approach to RGB-D data [4].\n\n2\n\nimage patch: gray, ab depth patch (optional): depth, surface normal \u2026 local sparse coding multi-scale pooling oriented gradients power transforms \uf0e8 linear SVM + - \u2026 per-pixel sparse codes SVM SVM SVM \u2026 SVM RGB-(D) contours \fThanks to the mass production of Kinect, active RGB-D cameras became affordable and were\nquickly adopted in vision research and applications. The Kinect pose estimation of Shotton et.\nal. used random forests to learn from a huge amount of data [25]. Henry et. al. used RGB-D cam-\neras to scan large environments into 3D models [10]. RGB-D data were also studied in the context\nof object recognition [13] and scene labeling [27, 22]. In-depth studies of contour and segmentation\nproblems for depth data are much in need given the fast growing interests in RGB-D perception.\n\n3 Contour Detection using Sparse Code Gradients\nWe start by examining the processing pipeline of Global Pb (gPb) [2], a highly in\ufb02uential and\nwidely used system for contour detection. The gPb contour detection has two stages: local contrast\nestimation at multiple scales, and globalization of the local cues using spectral grouping. The core\nof the approach lies within its use of local cues in oriented gradients. Originally developed in\n[17], this set of features use relatively simple pixel representations (histograms of brightness, color\nand textons) and similarity functions (chi-square distance, manually chosen), comparing to recent\nadvances in using rich representations for high-level recognition (e.g. [11, 29, 30, 3]).\nWe set out to show that both the pixel representation and the aggregation of pixel information in local\nneighborhoods can be much improved and, to a large extent, learned from and adapted to input data.\nFor pixel representation, in Section 3.1 we show how to use Orthogonal Matching Pursuit [18] and\nK-SVD [1], ef\ufb01cient sparse coding and dictionary learning algorithms that readily apply to low-level\nvision, to extract sparse codes at every pixel. This sparse coding approach can be viewed similar\nin spirit to the use of \ufb01lterbanks but avoids manual choices and thus directly applies to the RGB-\nD data from Kinect. We show learned dictionaries for a number of channels that exhibit different\ncharacteristics: grayscale/luminance, chromaticity (ab), depth, and surface normal.\nIn Section 3.2 we show how the pixel-level sparse codes can be integrated through multi-scale pool-\ning into a rich representation of oriented local neighborhoods. By computing oriented gradients\non this high dimensional representation and using a double power transform to code the features\nfor linear classi\ufb01cation, we show a linear SVM can be ef\ufb01ciently and effectively trained for each\norientation to classify contour vs non-contour, yielding local contrast estimates that are much more\naccurate than the hand-designed features in gPb.\n\n3.1 Local Sparse Representation of RGB-(D) Patches\n\nK-SVD and Orthogonal Matching Pursuit. K-SVD [1] is a popular dictionary learning algorithm\nthat generalizes K-Means and learns dictionaries of codewords from unsupervised data. Given a set\nof image patches Y = [y1,\u00b7\u00b7\u00b7 , yn], K-SVD jointly \ufb01nds a dictionary D = [d1,\u00b7\u00b7\u00b7 , dm] and an\nassociated sparse code matrix X = [x1,\u00b7\u00b7\u00b7 , xn] by minimizing the reconstruction error\n\n(cid:107)Y \u2212 DX(cid:107)2\n\nF\n\nmin\nD,X\n\ns.t. \u2200i, (cid:107)xi(cid:107)0 \u2264 K; \u2200j, (cid:107)dj(cid:107)2 = 1\n\n(1)\nwhere (cid:107) \u00b7 (cid:107)F denotes the Frobenius norm, xi are the columns of X, the zero-norm (cid:107) \u00b7 (cid:107)0 counts the\nnon-zero entries in the sparse code xi, and K is a prede\ufb01ned sparsity level (number of non-zero en-\ntries). This optimization can be solved in an alternating manner. Given the dictionary D, optimizing\nthe sparse code matrix X can be decoupled to sub-problems, each solved with Orthogonal Matching\nPursuit (OMP) [18], a greedy algorithm for \ufb01nding sparse codes. Given the codes X, the dictionary\nD and its associated sparse coef\ufb01cients are updated sequentially by singular value decomposition.\nFor our purpose of representing local patches, the dictionary D has a small size (we use 75 for 5x5\npatches) and does not require a lot of sample patches, and it can be learned in a matter of minutes.\nOnce the dictionary D is learned, we again use the Orthogonal Matching Pursuit (OMP) algorithm\nto compute sparse codes at every pixel. This can be ef\ufb01ciently done with convolution and a batch\nversion of the OMP algorithm [24]. For a typical BSDS image of resolution 321x481, the sparse\ncode extraction is ef\ufb01cient and takes 1\u223c2 seconds.\nSparse Representation of RGB-D Data. One advantage of unsupervised dictionary learning is\nthat it readily applies to novel sensor data, such as the color and depth frames from a Kinect-style\nRGB-D camera. We learn K-SVD dictionaries up to four channels of color and depth: grayscale\nfor luminance, chromaticity ab for color in the Lab space, depth (distance to camera) and surface\nnormal (3-dim). The learned dictionaries are visualized in Fig. 2. These dictionaries are interesting\n\n3\n\n\f(a) Grayscale\n\n(b) Chromaticity (ab)\n\n(c) Depth\n\n(d) Surface normal\n\nFigure 2: K-SVD dictionaries learned for four different channels: grayscale and chromaticity (in\nab) for an RGB image (a,b), and depth and surface normal for a depth image (c,d). We use a \ufb01xed\ndictionary size of 75 on 5x5 patches. The ab channel is visualized using a constant luminance of 50.\nThe 3-dimensional surface normal (xyz) is visualized in RGB (i.e. blue for frontal-parallel surfaces).\n\nto look at and qualitatively distinctive: for example, the surface normal codewords tend to be more\nsmooth due to \ufb02at surfaces, the depth codewords are also more smooth but with speckles, and the\nchromaticity codewords respect the opponent color pairs. The channels are coded separately.\n\n3.2 Coding Multi-Scale Neighborhoods for Measuring Contrast\n\nMulti-Scale Pooling over Oriented Half-Discs. Over decades of research on contour detection and\nrelated topics, a number of fundamental observations have been made, repeatedly: (1) contrast is\nthe key to differentiate contour vs non-contour; (2) orientation is important for respecting contour\ncontinuity; and (3) multi-scale is useful. We do not wish to throw out these principles. Instead, we\nseek to adopt these principles for our case of high dimensional representations with sparse codes.\nEach pixel is presented with sparse codes extracted from a small patch (5-by-5) around it. To aggre-\ngate pixel information, we use oriented half-discs as used in gPb (see an illustration in Fig. 1). Each\norientation is processed separately. For each orientation, at each pixel p and scale s, we de\ufb01ne two\nhalf-discs (rectangles) N a and N b of size s-by-(2s+1), on both sides of p, rotated to that orienta-\ntion. For each half-disc N, we use average pooling on non-zero entries (i.e. a hybrid of average and\nmax pooling) to generate its representation\n\nF (N ) =\n\n|xi1|\n\nI|xi1|>0,\u00b7\u00b7\u00b7 ,\n\n|xim|\n\nI|xim|>0\n\n(2)\n\n(cid:34)(cid:88)\n\ni\u2208N\n\n(cid:44)(cid:88)\n\ni\u2208N\n\n(cid:88)\n\ni\u2208N\n\n(cid:44)(cid:88)\n\ni\u2208N\n\n(cid:35)\n\nwhere xij is the j-th entry of the sparse code xi, and I is the indicator function whether xij is non-\nzero. We rotate the image (after sparse coding) and use integral images for fast computations (on\nboth |xij| and |xij| > 0, whose costs are independent of the size of N.\nFor two oriented half-dics N a and N b at a scale s, we compute a difference (gradient) vector D\n\ns ) =(cid:12)(cid:12)F (N a\n\ns ) \u2212 F (N b\n\ns )(cid:12)(cid:12)\n\nD(N a\n\ns , N b\n\n(3)\n\ns , N b\n\ns )(cid:107) +(cid:107)F (N b\n\nwhere | \u00b7 | is an element-wise absolute value operation. We divide D(N a\ns ) by their norms\ns )(cid:107) + \u0001, where \u0001 is a positive number. Since the magnitude of sparse codes varies\n(cid:107)F (N a\nover a wide range due to local variations in illumination as well as occlusion, this step makes the\nappearance features robust to such variations and increases their discriminative power, as commonly\ndone in both contour detection and object recognition. This value is not hard to set, and we \ufb01nd a\nvalue of \u0001 = 0.5 is better than, for instance, \u0001 = 0.\nAt this stage, one could train a classi\ufb01er on D for each scale to convert it to a scalar value of\ncontrast, which would resemble the chi-square distance function in gPb. Instead, we \ufb01nd that it is\nmuch better to avoid doing so separately at each scale, but combining multi-scale features in a joint\nrepresentation, so as to allow interactions both between codewords and between scales. That is, our\n\ufb01nal representation of the contrast at a pixel p is the concatenation of sparse codes pooled at all the\n\n4\n\n\fscales s \u2208 {1,\u00b7\u00b7\u00b7 , S} (we use S = 4):\n\nDp =(cid:2)D(N a\n\n1 , N b\n\nS)(cid:3)\n\nS \u222a N b\n\nDp =(cid:2)D\u03b11\n\n(cid:3)\n\nS , N b\n\nS); F (N a\n\n1 ),\u00b7\u00b7\u00b7 , D(N a\n\n1 \u222a N b\ns ), which captures the appear-\nIn addition to difference D, we also include a union term F (N a\nance of the whole disc (union of the two half discs) and is normalized by (cid:107)F (N a\ns )(cid:107) + \u0001.\nDouble Power Transform and Linear Classi\ufb01ers. The concatenated feature Dp (non-negative)\nprovides multi-scale contrast information for classifying whether p is a contour location for a partic-\nular orientation. As Dp is high dimensional (1200 and above in our experiments) and we need to do\nit at every pixel and every orientation, we prefer using linear SVMs for both ef\ufb01cient testing as well\nas training. Directly learning a linear function on Dp, however, does not work very well. Instead,\nwe apply a double power transformation to make the features more suitable for linear SVMs\n\n1 ),\u00b7\u00b7\u00b7 , F (N a\ns \u222a N b\n\ns )(cid:107) +(cid:107)F (N b\n\n(4)\n\np , D\u03b12\np\n\n(5)\nwhere 0<\u03b11<\u03b12<1. Empirically, we \ufb01nd that the double power transform works much better\nthan either no transform or a single power transform \u03b1, as sometimes done in other classi\ufb01cation\ncontexts. Perronnin et. al. [19] provided an intuition why a power transform helps classi\ufb01cation,\nwhich \u201cre-normalizes\u201d the distribution of the features into a more Gaussian form. One plausible\nintuition for a double power transform is that the optimal exponent \u03b1 may be different across feature\ndimensions. By putting two power transforms of Dp together, we allow the classi\ufb01er to pick its\nlinear combination, different for each dimension, during the stage of supervised training.\nFrom Local Contrast to Global Contours. We intentionally only change the local contrast es-\ntimation in gPb and keep the other steps \ufb01xed. These steps include: (1) the Savitzky-Goley \ufb01lter\nto smooth responses and \ufb01nd peak locations; (2) non-max suppression over orientations; and (3)\noptionally, we apply the globalization step in gPb that computes a spectral gradient from the local\ngradients and then linearly combines the spectral gradient with the local ones. A sigmoid transform\nstep is needed to convert the SVM outputs on Dp before computing spectral gradients.\n\n4 Experiments\nWe use the evaluation framework of, and extensively compare to, the publicly available Global\nPb (gPb) system [2], widely used as the state of the art for contour detection1. All the results\nreported on gPb are from running the gPb contour detection and evaluation codes (with default\nparameters), and accuracies are veri\ufb01ed against the published results in [2]. The gPb evaluation\nincludes a number of criteria, including precision-recall (P/R) curves from contour matching (Fig. 4),\nF-measures computed from P/R (Table 1,2,3) with a \ufb01xed contour threshold (ODS) or per-image\nthresholds (OIS), as well as average precisions (AP) from the P/R curves.\nBenchmark Datasets. The main dataset we use is the BSDS500 benchmark [2], an extension of the\noriginal BSDS300 benchmark and commonly used for contour evaluation. It includes 500 natural\nimages of roughly resolution 321x481, including 200 for training, 100 for validation, and 200 for\ntesting. We conduct both color and grayscale experiments (where we convert the BSDS500 images\nto grayscale and retain the groundtruth). In addition, we also use the MSRC2 and PASCAL2008\nsegmentation datasets [26, 6], as done in the gPb work [2]. The MSRC2 dataset has 591 images of\nresolution 200x300; we randomly choose half for training and half for testing. The PASCAL2008\ndataset includes 1023 images in its training and validation sets, roughly of resolution 350x500. We\nrandomly choose half for training and half for testing.\nFor RGB-D contour detection, we use the NYU Depth dataset (v2) [27], which includes 1449 pairs\nof color and depth frames of resolution 480x640, with groundtruth semantic regions. We choose\n60% images for training and 40% for testing, as in its scene labeling setup. The Kinect images are\nof lower quality than BSDS, and we resize the frames to 240x320 in our experiments.\nTraining Sparse Code Gradients. Given sparse codes from K-SVD and Orthogonal Matching Pur-\nsuit, we train the Sparse Code Gradients classi\ufb01ers, one linear SVM per orientation, from sampled\nlocations. For positive data, we sample groundtruth contour locations and estimate the orientations\nat these locations using groundtruth. For negative data, locations and orientations are random. We\nsubtract the mean from the patches in each data channel. For BSDS500, we typically have 1.5 to 2\n\n1In this work we focus on contour detection and do not address how to derive segmentations from contours.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Analysis of our sparse code gradients, using average precision of classi\ufb01cation on sampled\nboundaries. (a) The effect of single-scale vs multi-scale pooling (accumulated from the smallest).\n(b) Accuracy increasing with dictionary size, for four orientation channels. (c) The effect of the\nsparsity level K, which exhibits different behavior for grayscale and chromaticity.\n\nl\na\nc\no\nl\n\ngPb (gray)\nSCG (gray)\ngPb (color)\nSCG (color)\nl gPb (gray)\nSCG (gray)\ngPb (color)\nSCG (color)\n\na\nb\no\nl\ng\n\nBSDS500\n\nODS OIS AP\n.67\n.68\n.71\n.69\n.71\n.70\n.75\n.72\n.69\n.67\n.74\n.71\n.72\n.71\n.74\n.77\n\n.69\n.71\n.72\n.74\n.71\n.73\n.74\n.76\n\nTable 1: F-measure evaluation on the BSDS500\nbenchmark [2], comparing to gPb on grayscale\nand color images, both for local contour detec-\ntion as well as for global detection (i.e. com-\nbined with the spectral gradient analysis in [2]).\n\nFigure 4: Precision-recall curves of SCG vs\ngPb on BSDS500, for grayscale and color\nimages. We make a substantial step beyond\nthe current state of the art toward reaching\nhuman-level accuracy (green dot).\n\nmillion data points. We use 4 spatial scales, at half-disc sizes 2, 4, 7, 25. For a dictionary size of 75\nand 4 scales, the feature length for one data channel is 1200. For full RGB-D data, the dimension is\n4800. For BSDS500, we train only using the 200 training images. We modify liblinear [7] to take\ndense matrices (features are dense after pooling) and single-precision \ufb02oats.\nLooking under the Hood. We empirically analyze a number of settings in our Sparse Code Gradi-\nents. In particular, we want to understand how the choices in the local sparse coding affect contour\nclassi\ufb01cation. Fig. 3 shows the effects of multi-scale pooling, dictionary size, and sparsity level\n(K). The numbers reported are intermediate results, namely the mean of average precision of four\noriented gradient classi\ufb01er (0, 45, 90, 135 degrees) on sampled locations (grayscale unless otherwise\nnoted, on validation). As a reference, the average precision of gPb on this task is 0.878.\nFor multi-scale pooling, the single best scale for the half-disc \ufb01lter is about 4x8, consistent with\nthe settings in gPb. For accumulated scales (using all the scales from the smallest up to the current\nlevel), the accuracy continues to increase and does not seem to be saturated, suggesting the use of\nlarger scales. The dictionary size has a minor impact, and there is a small (yet observable) bene\ufb01t to\nuse dictionaries larger than 75, particularly for diagonal orientations (45- and 135-deg). The sparsity\nlevel K is a more intriguing issue. In Fig. 3(c), we see that for grayscale only, K = 1 (normalized\nnearest neighbor) does quite well; on the other hand, color needs a larger K, possibly because ab is\na nonlinear space. When combining grayscale and color, it seems that we want K to be at least 3. It\nalso varies with orientation: horizontal and vertical edges require a smaller K than diagonal edges.\n(If using K = 1, our \ufb01nal F-measure on BSDS500 is 0.730.)\nWe also empirically evaluate the double power transform vs single power transform vs no transform.\nWith no transform, the average precision is 0.865. With a single power transform, the best choice of\nthe exponent is around 0.4, with average precision 0.884. A double power transform (with exponents\n\n6\n\n2345710140.80.820.840.860.880.9pooling disc size (pixel)average precision single scaleaccum. scale2550751001251500.820.840.860.880.90.920.94dictionary sizeaverage precision horizontal edge45\u2212deg edgevertical edge135\u2212deg edge123456780.840.860.880.90.92sparsity levelaverage precision graycolor (ab)gray+color00.10.20.30.40.50.60.70.80.910.20.30.40.50.60.70.80.91RecallPrecision gPb (gray) F=0.69gPb (color) F=0.71SCG (gray) F=0.71SCG (color) F=0.74\fMSRC2\n\nODS OIS AP\n.22\n.37\n.43\n.33\nPASCAL2008\n\n.39\n.43\n\ngPb\nSCG\n\ngPb\nSCG\n\nODS OIS AP\n.20\n.34\n.37\n.27\n\n.38\n.41\n\ngPb (color)\nSCG (color)\ngPb (depth)\nSCG (depth)\ngPb (RGB-D)\nSCG (RGB-D)\n\nRGB-D (NYU v2)\nODS OIS AP\n.37\n.51\n.46\n.55\n.44\n.28\n.45\n.53\n.40\n.53\n.62\n.54\n\n.52\n.57\n.46\n.54\n.54\n.63\n\nTable 2:\nF-measure evaluation comparing\nour SCG approach to gPb on two addi-\ntional image datasets with contour groundtruth:\nMSRC2 [26] and PASCAL2008 [6].\n\nTable 3: F-measure evaluation on RGB-D con-\ntour detection using the NYU dataset (v2) [27].\nWe compare to gPb on using color image only,\ndepth only, as well as color+depth.\n\nFigure 5: Examples from the BSDS500 dataset [2]. (Top) Image; (Middle) gPb output; (Bottom)\nSCG output (this work). Our SCG operator learns to preserve \ufb01ne details (e.g. windmills, faces, \ufb01sh\n\ufb01ns) while at the same time achieving higher precision on large-scale contours (e.g. back of zebras).\n(Contours are shown in double width for the sake of visualization.)\n\n0.25 and 0.75, which can be computed through sqrt) improves the average precision to 0.900, which\ntranslates to a large improvement in contour detection accuracy.\nImage Benchmarking Results. In Table 1 and Fig. 4 we show the precision-recall of our Sparse\nCode Gradients vs gPb on the BSDS500 benchmark. We conduct four sets of experiments, using\ncolor or grayscale images, with or without the globalization component (for which we use exactly\nthe same setup as in gPb). Using Sparse Code Gradients leads to a signi\ufb01cant improvement in\naccuracy in all four cases. The local version of our SCG operator, i.e. only using local contrast, is\nalready better (F = 0.72) than gPb with globalization (F = 0.71). The full version, local SCG plus\nspectral gradient (computed from local SCG), reaches an F-measure of 0.739, a large step forward\nfrom gPb, as seen in the precision-recall curves in Fig. 4. On BSDS300, our F-measure is 0.715.\nWe observe that SCG seems to pick up \ufb01ne-scale details much better than gPb, hence the much\nhigher recall rate, while maintaining higher precision over the entire range. This can be seen in the\nexamples shown in Fig. 5. While our scale range is similar to that of gPb, the multi-scale pooling\nscheme allows the \ufb02exibility of learning the balance of scales separately for each code word, which\nmay help detecting the details. The supplemental material contains more comparison examples.\nIn Table 2 we show the benchmarking results for two additional datasets, MSRC2 and PAS-\nCAL2008. Again we observe large improvements in accuracy, in spite of the somewhat different\nnatures of the scenes in these datasets. The improvement on MSRC2 is much larger, partly because\nthe images are smaller, hence the contours are smaller in scale and may be over-smoothed in gPb.\nAs for computational cost, using integral images, local SCG takes \u223c100 seconds to compute on a\nsingle-thread Intel Core i5-2500 CPU on a BSDS image. It is slower than but comparable to the\nhighly optimized multi-thread C++ implementation of gPb (\u223c60 seconds).\n\n7\n\n\fFigure 6: Examples of RGB-D contour detection on the NYU dataset (v2) [27]. The \ufb01ve panels\nare: input image, input depth, image-only contours, depth-only contours, and color+depth contours.\nColor is good picking up details such as photos on the wall, and depth is useful where color is\nuniform (e.g. corner of a room, row 1) or illumination is poor (e.g. chair, row 2).\n\nRGB-D Contour Detection. We use the second version of the NYU Depth Dataset [27], which\nhas higher quality groundtruth than the \ufb01rst version. A median \ufb01ltering is applied to remove double\ncontours (boundaries from two adjacent regions) within 3 pixels. For RGB-D baseline, we use a\nsimple adaptation of gPb: the depth values are in meters and used directly as a grayscale image\nin gPb gradient computation. We use a linear combination to put (soft) color and depth gradients\ntogether in gPb before non-max suppression, with the weight set from validation.\nTable 3 lists the precision-recall evaluations of SCG vs gPb for RGB-D contour detection. All\nthe SCG settings (such as scales and dictionary sizes) are kept the same as for BSDS. SCG again\noutperforms gPb in all the cases. In particular, we are much better for depth-only contours, for\nwhich gPb is not designed. Our approach learns the low-level representations of depth data fully\nautomatically and does not require any manual tweaking. We also achieve a much larger boost by\ncombining color and depth, demonstrating that color and depth channels contain complementary\ninformation and are both critical for RGB-D contour detection. Qualitatively, it is easy to see that\nRGB-D combines the strengths of color and depth and is a promising direction for contour and\nsegmentation tasks and indoor scene analysis in general [22]. Fig. 6 shows a few examples of RGB-\nD contours from our SCG operator. There are plenty of such cases where color alone or depth alone\nwould fail to extract contours for meaningful parts of the scenes, and color+depth would succeed.\n\n5 Discussions\nIn this work we successfully showed how to learn and code local representations to extract contours\nin natural images. Our approach combined the proven concept of oriented gradients with powerful\nrepresentations that are automatically learned through sparse coding. Sparse Code Gradients (SCG)\nperformed signi\ufb01cantly better than hand-designed features that were in use for a decade, and pushed\ncontour detection much closer to human-level accuracy as illustrated on the BSDS500 benchmark.\nComparing to hand-designed features (e.g. Global Pb [2]), we maintain the high dimensional rep-\nresentation from pooling oriented neighborhoods and do not collapse them prematurely (such as\ncomputing chi-square distance at each scale). This passes a richer set of information into learn-\ning contour classi\ufb01cation, where a double power transform effectively codes the features for linear\nSVMs. Comparing to previous learning approaches (e.g. discriminative dictionaries in [16]), our\nuses of multi-scale pooling and oriented gradients lead to much higher classi\ufb01cation accuracies.\nOur work opens up future possibilities for learning contour detection and segmentation. As we il-\nlustrated, there is a lot of information locally that is waiting to be extracted, and a learning approach\nsuch as sparse coding provides a principled way to do so, where rich representations can be automat-\nically constructed and adapted. This is particularly important for novel sensor data such as RGB-D,\nfor which we have less understanding but increasingly more need.\n\n8\n\n\fReferences\n[1] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries\n\nfor sparse representation. IEEE Transactions on Signal Processing, 54(11):4311\u20134322, 2006.\n\n[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation.\n\nIEEE Trans. PAMI, 33(5):898\u2013916, 2011.\n\n[3] L. Bo, X. Ren, and D. Fox. Hierarchical Matching Pursuit for Image Classi\ufb01cation: Architecture and Fast\n\nAlgorithms. In Advances in Neural Information Processing Systems 24, 2011.\n\n[4] L. Bo, X. Ren, and D. Fox. Unsupervised Feature Learning for RGB-D Based Object Recognition. In\n\nInternational Symposium on Experimental Robotics (ISER), 2012.\n\n[5] P. Dollar, Z. Tu, and S. Belongie. Supervised learning of edges and object boundaries. In CVPR, volume 2,\n\npages 1964\u201371, 2006.\n\n[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object\n\nClasses Challenge 2008 (VOC2008). http://www.pascal-network.org/challenges/VOC/voc2008/.\n\n[7] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A library for large linear classi\ufb01cation. The\n\nJournal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[8] V. Ferrari, T. Tuytelaars, and L. V. Gool. Object detection by contour segment networks. In ECCV, pages\n\n14\u201328, 2006.\n\n[9] C. Gu, J. Lim, P. Arbel\u00b4aez, and J. Malik. Recognition using regions. In CVPR, pages 1030\u20131037, 2009.\n[10] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. Rgb-d mapping: Using depth cameras for dense 3d\nmodeling of indoor environments. In International Symposium on Experimental Robotics (ISER), 2010.\n[11] G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural computation,\n\n18(7):1527\u20131554, 2006.\n\n[12] I. Kokkinos. Highly accurate boundary detection and grouping. In CVPR, pages 2520\u20132527, 2010.\n[13] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view RGB-D object dataset. In ICRA,\n\npages 1817\u20131824, 2011.\n\n[14] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional deep belief networks for scalable unsuper-\n\nvised learning of hierarchical representations. In ICML, pages 609\u2013616, 2009.\n\n[15] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local\n\nimage analysis. In CVPR, pages 1\u20138, 2008.\n\n[16] J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce. Discriminative sparse image models for\n\nclass-speci\ufb01c edge detection and image interpretation. ECCV, pages 43\u201356, 2008.\n\n[17] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using brightness and\n\ntexture. In Advances in Neural Information Processing Systems 15, 2002.\n\n[18] Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal Matching Pursuit: Recursive Function Approx-\nimation with Applications to Wavelet Decomposition. In The Twenty-Seventh Asilomar Conference on\nSignals, Systems and Computers, pages 40\u201344, 1993.\n\n[19] F. Perronnin, J. S\u00b4anchez, and T. Mensink. Improving the \ufb01sher kernel for large-scale image classi\ufb01cation.\n\nIn ECCV, pages 143\u2013156, 2010.\n\n[20] M. Prasad, A. Zisserman, A. Fitzgibbon, M. Kumar, and P. Torr. Learning class-speci\ufb01c edges for object\n\ndetection and segmentation. Computer Vision, Graphics and Image Processing, pages 94\u2013105, 2006.\n[21] X. Ren. Multi-scale improves boundary detection in natural images. In ECCV, pages 533\u2013545, 2008.\n[22] X. Ren, L. Bo, and D. Fox. RGB-(D) scene labeling: features and algorithms. In Computer Vision and\n\nPattern Recognition (CVPR), 2012 IEEE Conference on, pages 2759\u20132766. IEEE, 2012.\n\n[23] X. Ren, C. Fowlkes, and J. Malik. Cue integration in \ufb01gure/ground labeling.\n\nInformation Processing Systems 18, 2005.\n\nIn Advances in Neural\n\n[24] R. Rubinstein, M. Zibulevsky, and M. Elad. Ef\ufb01cient Implementation of the K-SVD Algorithm using\n\nBatch Orthogonal Matching Pursuit. Technical report, CS Technion, 2008.\n\n[25] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-\n\ntime human pose recognition in parts from single depth images. In CVPR, volume 2, page 3, 2011.\n\n[26] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context\n\nmodeling for multi-class object recognition and segmentation. In ECCV, 2006.\n\n[27] N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In IEEE Workshop\n\non 3D Representation and Recognition (3dRR), 2011.\n\n[28] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation.\n\nIEEE Trans. PAMI, 31(2):210\u2013227, 2009.\n\n[29] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In CVPR, pages 1794\u20131801, 2009.\n\n[30] K. Yu, Y. Lin, and J. Lafferty. Learning image representations from the pixel level via hierarchical sparse\n\ncoding. In CVPR, pages 1713\u20131720, 2011.\n\n[31] Q. Zhu, G. Song, and J. Shi. Untangling cycles for contour grouping. In ICCV, 2007.\n\n9\n\n\f", "award": [], "sourceid": 286, "authors": [{"given_name": "Ren", "family_name": "Xiaofeng", "institution": null}, {"given_name": "Liefeng", "family_name": "Bo", "institution": null}]}