{"title": "Beyond Grids: Learning Graph Representations for Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 9225, "page_last": 9235, "abstract": "We propose learning graph representations from 2D feature maps for visual recognition. Our method draws inspiration from region based recognition, and learns to transform a 2D image into a graph structure. The vertices of the graph define clusters of pixels (\"regions\"), and the edges measure the similarity between these clusters in a feature space. Our method further learns to propagate information across all vertices on the graph, and is able to project the learned graph representation back into 2D grids. Our graph representation facilitates reasoning beyond regular grids and can capture long range dependencies among regions. We demonstrate that our model can be trained from end-to-end, and is easily integrated into existing networks. Finally, we evaluate our method on three challenging recognition tasks: semantic segmentation, object detection and object instance segmentation. For all tasks, our method outperforms state-of-the-art methods.", "full_text": "Beyond Grids: Learning Graph Representations\n\nfor Visual Recognition\n\nDepartment of Biostatistics & Medical Informatics\n\nYin Li \u2217\n\nDepartment of Computer Sciences\nUniversity of Wisconsin\u2013Madison\n\nyin.li@wisc.edu\n\nAbhinav Gupta\n\nThe Robotics Institute\n\nSchool of Computer Science\nCarnegie Mellon University\nabhinavg@cs.cmu.edu\n\nAbstract\n\nWe propose learning graph representations from 2D feature maps for visual recog-\nnition. Our method draws inspiration from region based recognition, and learns\nto transform a 2D image into a graph structure. The vertices of the graph de\ufb01ne\nclusters of pixels (\u201cregions\u201d), and the edges measure the similarity between these\nclusters in a feature space. Our method further learns to propagate information\nacross all vertices on the graph, and is able to project the learned graph represen-\ntation back into 2D grids. Our graph representation facilitates reasoning beyond\nregular grids and can capture long range dependencies among regions. We demon-\nstrate that our model can be trained from end-to-end, and is easily integrated into\nexisting networks. Finally, we evaluate our method on three challenging recognition\ntasks: semantic segmentation, object detection and object instance segmentation.\nFor all tasks, our method outperforms state-of-the-art methods.\n\n1\n\nIntroduction\n\nDeep convolutional networks have been tremendously successful for visual recognition [1]. These\ndeep models stack many local operations of convolution and pooling. The assumption is that this\nstacking will not only provide a strong model for local patterns, but also create a large receptive \ufb01eld\nto capture long range dependencies, e.g., contextual relations between an object and other elements\nof the scene. However, this approach for modeling context is highly inef\ufb01cient. A recent study [2]\nshowed that even after hundreds of convolutions, the effective receptive \ufb01eld of a network\u2019s units is\nseverely limited. Such a model may fail to incorporate global context beyond local regions.\nInstead of the \u201cdeep stacking\u201d, one appealing idea is using image regions for context reasoning and\nvisual recognition [3, 4, 5, 6, 7, 8, 9]. This paradigm builds on the theory of perceptual organization,\nand starts by grouping pixels into a small set of coherent regions. Recognition and context modeling\nare often postulated as an inference problem on a graph structure [8, 10]\u2013with regions as vertices\nand the similarity between regions as edges. This graph thus encodes dependencies between regions.\nThese dependencies are of much longer range than those are captured by local convolutions.\nInspired by region based recognition, we propose a novel approach for capturing long range depen-\ndencies using deep networks. Our key idea is to move beyond regular grids, and learn a graph\nrepresentation for a 2D input image or feature map. This graph has its vertices de\ufb01ning clusters of\npixels (\u201cregions\u201d), and its edges measuring the similarity between these clusters in a feature space.\nOur method further learns to propagate messages across all vertices on this graph, making it possible\nto share global information in a single operation. Finally, our method is able to project the learned\ngraph representation back into 2D grids, and thus is fully compatible with existing networks.\n\n\u2217The work was done when Y. Li was at CMU.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fSpeci\ufb01cally, our method consists of Graph Projection, Graph Convolution and Graph Re-projection.\nGraph projection turns a 2D feature map into a graph, where pixels with similar features are assigned\nto the same vertex. It further encodes features for each vertex and computes an adjacency matrix for\neach sample. Graph convolution makes use of convolutions on a graph structure [11], and updates\nvertex features based on the adjacency matrix. Finally, graph re-projection interpolates the vertex\nfeatures into a 2D feature map, by reverting the pixel-to-vertex assignment from the projection step.\nWe evaluate our method on several challenging visual recognition tasks, including semantic seg-\nmentation, object detection and object instance segmentation. Our method consistently improves\nstate-of-the-art methods. For semantic segmentation, our method improves a baseline fully convolu-\ntional network [12] by \u223c7%. And our results slightly outperform the state-of-the-art context modeling\napproaches [13, 14]. For object detection and instance segmentation, our method improves the strong\nbaseline of Mask RCNN [15] by \u223c1%. Note that a 1% improvement is signi\ufb01cant on COCO (even\ndoubling the number of layers provides 1-2% improvement). More importantly, we believe that our\nmethod offers a new perspective in designing deep models for visual recognition.\n\n2 Related Work\n\nMajor progress has been made for visual recognition with the development of deep models. Deep\nnetworks have been widely used for image classi\ufb01cation [1], semantic segmentation [12], object detec-\ntion [16] and instance segmentation [15]. However, even after hundreds of convolution operations,\nthese network may fail to capture long range context in the input image [2, 13].\nSeveral recent works have thus developed deep architectures for modeling visual context. For example,\ndilated convolutions are attached to deep networks to increase the size of their receptive \ufb01elds [17]. A\nglobal context vector, pooled from all spatial positions, can be concatenated to local features [18, 13].\nThese new features thus encode both global context and local appearance. Moreover, local features\nacross different scales can be fused to encode global context [19]. However, all previous methods\nstill reside in a regular 2D feature map with the exception of [20]. The non-local operation in [20]\nconstructed a densely connected graph with pairwise edges between all pixels. Therefore, their\nmethod is computational heavy for high resolution feature maps, and is less desirable for tasks like\nsemantic segmentation. Our methods differs from these approaches by moving beyond regular grids\nand learning an ef\ufb01cient graph representation with a small number of vertices.\nOur method is inspired by region based recognition. This idea can date back to Gestalt school\nof visual perception. In this setting, recognition is posed as labeling image regions. Examples\ninclude segmentation [21], object recognition [3], object detection [22, 23] and scene geometry\nestimation [24]. Several works addressed context reasoning among regions. Context can be encoded\nvia a decomposition of regions [6], or via features from neighborhood regions [7]. Our graph\nrepresentation resembles the key idea of a region graph in [10, 9, 8], where vertices are regions\nand edges encode relationships between regions. While previous approaches did not consider deep\nmodels, our model embeds a region graph in a deep network. More recently, region based recognition\nhas been revisited in deep learning [25, 26, 27]. Nonetheless, these methods considered grouping as a\npre-processing step, and did not learn a graph representation as our method. In contrast, our method\nprovides a novel deep model for learning graph representations of 2D visual data\nFurthermore, our method is related to learning deep models on graph structure [28, 11, 29]. Specif-\nically, graph convolutional networks [11] are used to propagate information on our learned graph.\nHowever, our method focuses on learning graph representations rather than developing message\npassing methods on the graph. Finally, our graph projection step draws inspirations from nonlinear\nfeature encoding methods, such as VLAD and Fisher Vectors [30, 31]. These methods have been\nvisited in the context of deep models [32, 33, 14]. However, previous methods focused on global\nencoding of local features, and did not consider the case of a graph representation.\n\n3 Approach\n\nIn this section, we present our method on learning graph representations for visual recognition. We\nstart with an overview of our key ideas, followed by a detailed derivation of the proposed graph\nconvolutional unit. Finally, we discuss the learning of our method and present approaches for\nincorporating our model into existing networks for recognition tasks.\n\n2\n\n\fFigure 1: Overview of our approach. Our Graph Convolutional Unit (GCU) projects a 2D feature\nmap into a sample-dependent graph structure by assigning pixels to the vertices of the graph. GCU\nthen passes information along the edges of graph and update the vertex features. Finally, these new\nvertex features are projected back into 2D grids based on the pixel-to-vertex assignment. GCU learns\nto reason beyond regular grids, captures long range dependencies across the 2D plane, and can be\neasily integrated into existing networks for recognition tasks.\n\n3.1 Overview\nFor simplicity, we consider an input 2D feature map X of size H \u00d7 W from a single sample. Our\nmethod can easily extend to batch size \u2265 1 or 3D feature maps (e.g., videos). X are the intermediate\nresponses of a deep convolutional network. xij \u2208 Rd thus indexes the d dimensional feature at pixel\n(i, j). Our proposed Graph Convolutional Units (GCU) consists of three operations.\n\n\u2022 Graph Projection Gproj. Gproj projects X into a graph G = (V,E) with its vertices as V\nand edges as E. Speci\ufb01cally, Gproj assigns pixels with similar features to the same vertex.\nThis assignment is soft and likely groups pixels into coherent regions. Pixel features are\nfurther aggregated within each vertex, and form the vertex features Z \u2208 Rd\u00d7|V| for graph G.\nBased on Z, we measure the distance between vertices, and compute the adjacency matrix.\nMoreover, we store the pixel-to-vertex assignments and will use them to re-project the graph\nback to 2D grids.\n\u2022 Graph Convolution Gconv. Gconv performs convolutions on the graph G by propagating\nfeatures Z along the edges of the graph. Gconv makes use of graph convolutions as [11]\nand can stack multiple convolutions with nonlinear activation functions. When G is densely\nconnected, Gconv has a receptive \ufb01eld of all vertices on the graph, and thus is able to capture\nthe global context of the input. Gconv outputs the transformed vertex features \u02dcZ \u2208 R|V|\u00d7 \u02dcd.\n\u2022 Graph Reprojection Greproj. Greproj maps the new features \u02dcZ back into the 2D grid of\nsize (H \u00d7 W ). This is done by \u201cinverting\u201d the assignments from the projection step. The\noutput \u02dcX of Greproj will be a 2D feature map with dimension \u02dcd at each position (i, j). Thus,\n\u02dcX is compatible with a regular convolutional neural network.\n\nFigure 1 presents an overview of our method. In a nutshell, our GCU can be expressed as\n\n\u02dcX = GCU(x) = Greproj(Gconv(Gproj(X))).\n\n(1)\nIt is more intuitive to consider our method in terms of pixels and regions. In GCU, \u201cpixels\u201d are\nassigned to vertices based on their feature vectors. Thus, each vertex de\ufb01nes a cluster of pixels,\ni.e., a \u201cregion\u201d in the image. Each region will re-compute its feature by pooling over all its pixels.\nAnd the similarity between regions is estimated based on the pooled region features, and thus de\ufb01ne\nthe structure of a region graph. Inference can then be performed on the graph by passing messages\nbetween regions and along the edges that connect them. This inference will update the feature for\neach region and can connect regions that are far away in the 2D space. The updated region features\ncan then be projected back to pixels by linearly interpolation between regions.\n\n3.2 Graph Convolutional Unit\n\nWe now describe the details of our graph projection, convolution and reprojection operations.\n\n3\n\nGraph Projection Graph Re-Projection Graph Convolution Soft AssignmentGraph Representation\fGraph Projection Gproj \ufb01rst assigns feature vectors X to a set of vertices, parameterzied by\nW \u2208 Rd\u00d7|V| and \u03a3 \u2208 Rd\u00d7|V|, with the number of vertices |V| pre-speci\ufb01ed. Each column wk \u2208 Rd\nof W speci\ufb01es an anchor point for the vertex k. Speci\ufb01cally, we compute a soft-assignment qk\nij of a\nfeature vector xij to wk by\n\nexp(cid:0)\u2212(cid:107)(xij \u2212 wk)/\u03c3k(cid:107)2\n2/2(cid:1)\n(cid:80)\nk exp (\u2212(cid:107)(xij \u2212 wk)/\u03c3k(cid:107)2\n\n2/2)\n\nqk\nij =\n\n,\n\n(2)\n\nwhere \u03c3k is the column vector of \u03a3 and / is the element-wise division. We constrain the range of\neach element in \u03c3k to (0, 1) by de\ufb01ning \u03c3k as the output of a sigmoid function. Eq 2 computes the\nweighted Euclidean distance between all xij and wk, and creates a soft-assignment using softmax\nfunction. We denote Q \u2208 RHW\u00d7|V| as the soft assignment matrix from pixel to vertices, with each\n\nrow vector qij such that(cid:80)\n\nk qk\n\nij = 1.\n\nMoreover, we encode features zk for each vertex k by\n\nzk =\n\nz(cid:48)\nk(cid:107)z(cid:48)\nk(cid:107)2\n\n(cid:48)\nk =\n\nz\n\n,\n\n(cid:88)\n\n1(cid:80)\n\nij qk\nij\n\nij\n\nij (xij \u2212 wk) /\u03c3k.\nqk\n\n(3)\n\nEach z(cid:48)\nk is a weighted average of the residuals between feature vectors xij to the vertex parameter wk.\nz(cid:48)\nk is further L2 normalized to get the feature vector zk for vertex k. zk thus forms the kth columns of\nthe feature matrix Z \u2208 Rd\u00d7|V|. We further compute the graph adjacency matrix as A = Z T Z. With\nnormalized zk, Ak,k(cid:48) in the adjacency matrix is the pairwise cosine similarity between the feature\nij does not impact the normalized\n\nvectors zk and zk(cid:48). Note that removing the coef\ufb01cients 1/(cid:80)\n\nij qk\nfeature zk, yet will change the way that the gradients are computed.\nEq 3 is inspired by nonlinear feature encoding methods, such as VLAD or Fisher Vectors [30, 34, 31].\nThis connection is more obvious if we consider wk as the cluster center and \u03c3k as its variance\n(assuming a diagonal covariance matrix). In this case, the L2 normalization is exactly the intra-\nnormalization in [34]. We note that our encoding is different from VLAD or Fisher Vectors as we do\nnot concatenate zk as a global representation of X. Instead, we derive graph structure from Z and\nkeep individual zk as vertex features.\nEq 3 can be viewed as multiple parallel yet competing af\ufb01ne transforms, followed by weighted\naverage pooling. And thus each zk provides a different snapshot of the input X. Moreover, if wk and\n\u03c3k are computed as per batch mean and variance for cluster k, each af\ufb01ne transform becomes batch\nnormalization [35]. This link between \ufb01sher vector and batch normalization is discussed in [36].\nTo summarize, the outputs of our graph projection operation are (1) the adjacency matrix A, (2) the\nvertex features Z and (3) the pixel-to-vertex assignment matrix Q. Moreover, our graph projection\noperation introduce 2|V|d new parameters. With a small number of vertices (e.g., 32) and a moderate\nfeature dimension (e.g., 1024), the number of added parameters is small in comparison to those in the\nrest of a deep network. Moreover, every step in this project operation is fully differentiable. Thus,\nchain rule can be used for the derivatives of the input x and the parameters (W and \u03a3). In practice,\nwe reply on automatic differentiation for back propagation.\n\nGraph Convolution We make use of graph convolution Gconv from [11] to further propagate\ninformation on the graph. Speci\ufb01cally, for a single graph convolution with its parameter Wg \u2208 Rd\u00d7 \u02dcd,\nthe operation is de\ufb01ned as\n\n(4)\nwhere f can be any nonlinear activation functions. We use the Batch Normalization [35] with\nRecti\ufb01ed Linear Units for our models. For all our experiments, we use a single graph convolution\nyet stacking multiple graph convolutions is a trivial extension. While each graph convolution has\nparameters of size (d \u00d7 \u02dcd), it remains highly ef\ufb01cient with a small number of vertices. Note that our\nadjacency matrix is computed per sample, and thus our graph representation is sample-dependent\nand will get updated during training. This is different from the settings in [28, 11, 29], where a\nsample-independent graph is pre-computed and remains unchanged during training.\n\n\u02dcZ = f (AZ T Wg),\n\nGraph Reprojection Our graph reprojection operation Greproj takes the inputs of transformed\nvertex features \u02dcZ and the assignment matrix Q, and produces 2D feature map \u02dcX. Ideally, we have to\ninvert the assignment matrix Q, which is unfortunately unfeasible. Instead, we compute pixel features\n\n4\n\n\fof \u02dcX using a re-weighting of the vertex features \u02dcZ, given by \u02dcX = Q \u02dcZ T . Greproj thus linearly\ninterpolates 2D pixel features based on their region assignments and does not have any parameters.\nNote that even if two pixels are assigned to the same vertex, they will have different features after\nreprojection. Thus, GCU is likely to preserve the spatial details of the signal. Finally, these projection\nresults can be integrated into existing networks.\n\n3.3 Learning Graph Representations\n\nOur model is fully differentiable and can be trained from end-to-end. However, we \ufb01nd that learning\nGCUs faces an optimization challenge. Let us consider a corner case where the model assigns most\nof the input pixel features xij to a single vertex k. In this setting, the GCU will degenerate to a\nlinear function Wg(xij \u2212 wk)/\u03c3k. As this rare case seems to be unlikely, we \ufb01nd that the model\ncan be trapped to several modes. For example, the model will always assign the whole image with a\nsingle vertex, but uses different vertices for different images. To address this issue, we propose two\nstrategies to regularize the learning of GCU.\n\nInitialization by Clustering We initialize the W and \u03a3 in graph projection operations by clustering\nthe input feature maps. Speci\ufb01cally, we use K-Means clustering to get the centers for each column wk\nof W . We also estimate the variance along each dimension and setup the column vectors \u03c3k of \u03a3. Note\nthat our model always start with a pre-trained network. Thus, K-Means will produce semantically\nmeaningful clusters. And this initialization does not require labeled data. Once initialized, we use\ngradient descent to update W and \u03a3, and avoid tracking batch statistics as [35]. We found that this\ninitialization is helpful for stable training and gives slightly better results than random initialization.\n\nRegularization by Diversity We \ufb01nd it bene\ufb01cial to directly regularize the assignment between\npixels to vertices. Speci\ufb01cally, we propose to add a graph diversity loss function that matches the\n\ndistribution of the assignments p(Q) =(cid:80)\n\nij \u2208 R|V| to a prior p. This is given by\n\nij qk\n\nLdiv = KL(p(Q)||p)\n\n(5)\nwhere KL is the Kullback\u2013Leibler divergence between p(Q) and p. This regularization term is\nhighly \ufb02exible as it allows us to inject any prior distributions. We assume that p follows a uniform\ndistribution. This prior enforces that each vertex is used with equal frequency, and thus prevents the\nlearning of \u201cempty\u201d vertices. We \ufb01nd that a small coef\ufb01cient (0.05) of this loss term is suf\ufb01cient.\n\n3.4 Graph Blocks\n\nOur GCU can be easily wrapped into a graph\nblock and incorporated into an existing network.\nSpeci\ufb01cally, we de\ufb01ne a graph block as\n\n\u02c6X = X \u2295 GCUk1 (X) \u2295 ... \u2295 GCUkn (X),\n\n(6)\nwhere GCUk denote a GCU with k vertices\nand \u2295 can be either a residual connection or a\nconcatenation operation. A residual connection\nallows us to insert GCU into a network without\nchanging its behavior (by using a zero initial-\nization of the batch normalization after Gconv).\nConcatenating features supplements the original\nmap X with global context captured at different\ngranularity. We explore both architectures in\nour experiments. We use the residual connec-\ntion with a single GCU for object detection and\ninstance segmentation, and concatenate features\nfrom multiple GCUs for semantic segmentation.\nDetails of these architectures are shown in Fig 2.\n\nFigure 2: Architectures of two different graph\nblocks. Top: Single GCU with a residual con-\nnection can be incorporated an existing network;\nBottom: Concatenation of multiple parallel GCUs\nintroduces a new context model.\n\nComputational Complexity We summarize the computational complexity of our graph block. More\nimportantly, we compare the complexity to 2D convolutions and non-local networks [20].\n\n5\n\nProjection Graph ConvolutionUnprojectionProjection Graph ConvolutionUnprojectionGraph ConvolutionUnprojectionProjection \f\u2022 Complexity: For a feature map of size H \u00d7 W with dimension d, our graph projection has a\ncomplexity of O(HW d|V|), where |V| is the number of vertices on the graph. The graph\nconvolution is O(|V|d2 + |V|2d) and the reprojection takes O(HW d|V|)) if we keep the\noutput feature dimension the same as the inputs.\n\u2022 Comparison to convolutions. The graph projection, convolution and reprojection operations\nhave roughly the same complexity as 1x1 convolutions with output dimension |V| (assuming\nHW \u2265 d for high resolution feature maps). Thus, the complexity of a single GCU is\napproximately equivalent to the stacking of three 1x1 convolutions.\n\u2022 Comparison to non-local networks. For the same setting, the non-local operation [20] has\na complexity of O(H 2W 2d). With a high resolution feature map (large HW ) and a small\n|V|, this is almost quadratic to our GCU.\n\n4 Experiments\n\nWe present our experiments and discuss the results in this section. We test our method on three impor-\ntant recognition tasks: (1) semantic segmentation, (2) object detection and instance segmentation.\nOur experiments are thus organized into two parts.\nWe also explored different ways of incorporating our method in the experiments. For semantic\nsegmentation, we concatenate multiple GCU outputs (Fig 2 Bottom). In this case, our method has\nto be accomplished with extra convolutions for recognition, and thus can be considered an novel\ncontext model. For object detection and instance segmentation, we incorporate GCUs with residual\nconnections (Fig 2 Top) into the Mask RCNN framework [15]. Here, our method does not change the\noriginal networks and thus serves as a new plugin unit.\n\n4.1 Semantic Segmentation\n\nWe now present our results on semantic segmentation. We introduce the benchmark and implementa-\ntion details, and present an ablation study of the GCU. More importantly, we compare our model to a\nset of baselines and discuss the results.\n\nDataset and Benchmark We use ADE20K dataset [37] for semantic segmentation. ADE20K\ncontains 22K densely labeled images. The benchmark includes 150 semantic categories with both\nstuff (i.e., wall, sky) and objects (i.e., car, person). The categories are \ufb01ne-grained and their number\nof samples follows a long tailed distribution. Therefore, this dataset is very challenging. We follow\nthe same evaluation protocol as [13] and train our method on the 20K training set. We report the\npixel level accuracy and mean Intersection over Union (mIoU) on the 2K validation set.\n\nImplementation Details Our base model attaches 4 GCUs to the last block of a backbone network\nand concatenates their outputs, followed by convolutions for pixel labeling. These GCUs have\n(2, 4, 8, 32) vertices and output dimensions of \u02dcd = 256. And these numbers are chosen to roughly\nmatch the number of parameters and operations as in [13]. We use ResNet 50/101 [38] pre-trained\non ImageNet [39] as our backbone network. Similar to [13], we add dilation to the last two residual\nblocks, thus the output is down-sampled by a factor of 8. We upsample the result to original resolution\nusing bilinear interpolation. As in [13], we crop the image into a \ufb01xed size (505x505) with data\naugmentations (random \ufb02ip, rotation, scale) and train for 120 epochs. We also add an auxiliary loss\nafter the 4th residual block with a weight of 0.4 as [13]. The network is trained using SGD with batch\nsize 16 (across 4 GPUs), learning rate 0.01 and momentum 0.9. We also adapt the power decay for\nlearning rate schedule [40], and enable synchronized batch normalization. For inference, we average\nnetwork outputs from multiple scales.\n\nAblation Study We provide an ablation study of the GCU using the task of semantic segmentation.\nThe results are reported on ADE20K and with pre-trained ResNet 50 as the backbone. First, we vary\nthe number of GCUs. With dilated convolutions, the backbone itself has a mIoU of 35.6%. Adding a\nsingle GCU with 2 vertices to the backbone achieves 39.43%\u2013a \u223c4% boost. Using two GCUs with\n(2, 4) vertices reaches 40.92%. And our base model (4 GCUs with (2, 4, 8, 32) vertices) has 42.60%.\nAlternatively, if we increase the number of vertices in the last GCU of our base model (from 32 to 64).\nThe mIoU score stays similar to the base model (42.58% vs. 42.60%). Adding more nodes increases\n\n6\n\n\fBackbone\n\nVGG16 [42]\n\nRes50 [38]\n\nRes101 [38]\n\nMethod\n\nFCN-8s [12]\nSegNet [41]\n\nDilatedNet [17]\nCascadeNet [37]\n\nDilated FCN\nPSPNet [13]\nEncNet [14]\nGCU (ours)\n\nRe\ufb01neNet [19]\nPSPNet [13]\nEncNet [14]\nGCU (ours)\n\n71.32\n71.00\n73.55\n74.52\n76.51\n80.76\n79.73\n79.51\n\nPixAcc% mIoU%\n29.39\n21.64\n32.31\n34.90\n35.60\n42.78\n41.11\n42.60\n40.20\n43.29\n44.65\n44.81\n\n-\n\n81.39\n81.69\n81.19\n\nFigure 3: Visualization of segmentation results\non ADE20K (with ResNet 50). Our method pro-\nduces \u201csmoother\u201d maps\u2013regions that are similar\nare likely to be labeled as the same category.\n\nTable 1: Results of semantic segmentation on\nADE20K. mIoU scores within 0.5% of the best\nresult are marked. With ResNet 50, our method\nimproves Dilated FCN by 7%. With ResNet 101,\nour method outperforms PSPNet by 1.5%.\n\nthe run-time and memory cost, yet does not seem to improve the performance. Second, we evaluate\nour initialization and regularization schemes. Our base model (4 GCUs) without regularization and\ninitialization has a mIoU of 41.34. Our regularization improves the result by 0.39% (41.73%). Our\ninitialization further adds another 0.87% (42.60%). Thus, both the diversity loss and the clustering\nhelp to improve the performance.\n\nBaselines We further compare our method with a set of baselines. These baselines are organized as\n\u2022 Dilated FCN: This is the backbone network of our method, where we added dilation to a\n\nResNet. It is also a variant of DeepLab [40].\n\n\u2022 Context Models: We include results from recent context models for deep networks. Specif-\nically, we compare to state-of-the-art results from PSPNet [13], Re\ufb01neNet [19] and Enc-\nNet [14]. These are close competitors of our method.\n\n\u2022 Other Methods: We also report results from [12, 17, 41, 37, 19] for reference.\n\nResults and Discussions Our main results are summarized in Table 1. Our method (GCU) improves\nthe backbone Dilated FCN network by 7% in mIoU. With ResNet 50, our result on mIoU is compa-\nrable to PSPNet. With ResNet 101, our method is 1.5% better than PSPNet and 4.6% higher than\nRe\ufb01neNet in mIoU. We also notice that our pixel level accuracy is consistently lower than PSPNet by\n0.2-1.2%. One possibility is that GCU will produce \u201cdiffused\u201d pixel features. This is because the\noutput features of GCU are linearly interpolated from region features, which are averaged across\npixels. We visualize our results in Fig 3 and \ufb01nd that our method does tend to over-smooth the\noutputs (see the missing clock in the left column). A similar property is also observed in region\nbased recognition [10]. Even a good grouping may decrease the performance if the recognition goes\nwrong. For example, in the middle column of Fig 3, our method mis-classi\ufb01ed the \u201cbuilding\u201d region\nas \u201chouse\u201d and has a lower score than the baseline Dilated FCN. Nonetheless, our method is able to\nassign the same category to the pixels on the building surface, which are previously divided in pieces.\nThus far, we have described our method by taking the analogy of region based recognition. However,\nwe must point out that our method is trained without supervision of regions. And there is no guarantee\nthat it will learn a valid representation of regions or region graphs. To further diagnose our method,\nwe create visualizations of the assignment matrix in GCU (see Fig 4). It is interesting to see that\nour method does learn to identify some meaningful components of the scene. For example, with 2\nvertices, the network seems to build up the concept of foreground vs. background. With 4 vertices,\nthere seems to be a weak correlation between the assignments and the spatial layout (e.g., pink for\nground, and yellow for vertical surfaces). As the number of vertices grows, the assignment begins to\nover-segment the image, creating superpixel-like regions.\n\n7\n\nBuildingSkyFloorTreeCeilingBedWindowGrassPersonTablePlantCurtainChairClockHouseLampSkyscraperPillowCoffeeBenchStoolBrandCarGTFCNGCU \u00a0(ours)\fFigure 4: Visualization of the assignment matrix in GCU for semantic\nsegmentation (with ResNet 50). From left to right: input image, pixel-\nto-vertex assignments with 2, 4, 8 and 32 vertices. Pixels with the\nsame color are assigned to the same vertex. Vertices are colored\nconsistently across images.\n\nFigure 5: Visualization of\nobject instance segmentation\nresults (with ResNet 50). Left:\nMask RCNN; Right: Ours.\nZoom in for details.\n\nBackbone\n\nResNet 50 [38]\n\nResNet 101 [38]\n\nAPbox APbox\nMethod\n59.6\n38.0\nMask RCNN [15, 20]\n39.0\n61.1\nMask RCNN + NL [20]\n59.2\nMask RCNN(Detectron) [15, 44]\n37.7\nMask RCNN(Detectron) + GCU 38.7\n60.5\n61.3\n39.5\nMask RCNN [15, 20]\n40.8\n63.1\nMask RCNN + NL [20]\n61.8\nMask RCNN(Detectron) [15, 44]\n40.0\nMask RCNN(Detectron) + GCU 41.1\n63.2\n\n50 APbox\n41.0\n41.9\n40.9\n41.7\n42.9\n44.5\n43.7\n44.9\n\n75 APseg APseg\n56.4\n58.0\n55.8\n57.2\n58.1\n59.9\n58.3\n59.8\n\n34.6\n35.5\n33.9\n34.7\n36.0\n37.1\n35.9\n36.9\n\n50 APseg\n75\n36.5\n37.4\n35.8\n36.5\n38.3\n39.2\n38.0\n39.0\n\nTable 2: Results of object detection and instance segmentation on COCO dataset. We compare our\nsingle-model results to state-of-the-art methods on COCO minival. Scores of APbox and APseg with\nin 0.5% of the best result are marked. Our method (GCU) improves the strong baseline of Mask\nRCNN by \u223c1% across different networks.\n\n4.2 Object Detection and Instance Segmentation\n\nWe now present our benchmark and results on object detection and segmentation.\n\nDataset and Benchmark For both object detection and instance segmentation, we use COCO\ndataset from [43]. COCO is by far the most challenging dataset for object detection and instance\nsegmentation. COCO includes more than 160K images, where object bounding boxes and masks are\nannotated. We report the standard COCO metrics including AP (averaged over IoU thresholds), and\nAP50, AP75 (AP at different IoU thresholds) for both boxes and masks. As in previous work [15, 16],\nwe train using the union of 80k train images and a 35k subset of val images (trainval35k), and report\nresults on the remaining 5k val images (minival).\n\nImplementation Details For both tasks, we attach 4 GCUs to the last four residual blocks in\nResNet 50/101 [38] with FPN [45]. These GCUs have 32 vertices and output dimensions \u02dcd =\n[256, 512, 1024, 2048] that matches the feature dimensions of the network. Our GCUs are added with\nresidual connections and zero initialization after the last convolution of the residual block and before\nFPN layers. We train the model using SGD with a batch size of 8 across 4 GPUs. Following the\ntraining schedule (x1) in [44], we linearly scale the training iterations (180K) and initial learning rate\n(0.01) based on our batch size. The learning rate is decreased by 10 at 120/160K iterations. We also\n\n8\n\nperson 1.00person 1.00person 1.00person 0.99truck 0.90bus 0.86person 0.99person 0.99bus 0.99truck 0.90person 1.00bus 0.98person 0.98person 0.83car 0.89person 1.00person 1.00skis 0.92person 0.99person 0.98train 0.99person 0.99person 1.00person 0.95train 1.00person 0.98person 0.99boat 0.81boat 0.82boat 0.91boat 0.93boat 0.92boat 0.87boat 0.88boat 0.89\ffreeze the batch normalization layers. Other implementation details for training and inference are\nkept the same as [15]. Note that only random \ufb02ip is used for data augmentation during training. And\nour results are reported without test time augmentation (e.g., multi-scale or \ufb02ip). They can be further\nincorporated for to improve performance. It is also possible to further boost the performance for the\nbaseline and our method by training for longer (as the x2 scheme in [44]).\n\nBaselines Our method is a plugin unit for Mask RCNN. Our baselines thus include\n\n\u2022 Mask RCNN: This is the result reported in [20]. This version is slightly better than the\noriginal Mask RCNN [15] by replacing the stage-wise training with end-to-end training.\n\u2022 Mask RCNN + NL: This is the result of adding non-local operations to the backbone network\n\u2022 Detectron: This is the open source version of Mask RCNN [44] with end-to-end training\n\nof Mask RCNN [20]. This operation is designed to capture long range dependencies.\n\nand careful learning rate schedule. Our method builds on top of this implementation.\n\nResults and Discussions Our results for both tasks are summarized in Table 2. Our method con-\nsistently improves the baseline Mask RCNN (Detectron) results by \u223c1% for both detection and\nsegmentation, and for both ResNet 50 and 101. This trend of improvement is also observed by adding\nnon-local networks. We have to emphasis that our baseline (Detectron) is a well-optimized version of\nMask RCNN. Thus, our improvements are non-trivial. Moreover, we present visualizations of our\nresults and compare them to the Mask RCNN (Detectron) in Fig 5. By modeling context using a\ngraph representation, our method is able to \ufb01nd objects that are previously missing (the \u201cboat\u201d in\n\ufb01rst row), resolve ambiguity in region classi\ufb01cation (\u201ctruck\u201d vs \u201cbus\u201d in second row) and help to\nbetter estimate the spatial extent of objects (third row). One of the failure modes of our model is the\nmissing detection of small, out-of-context objects, such as the \u201cskis\u201d in the sky (zoom in to see in last\nrow). We hypothesis that this is again due to the \u201cdiffused\u201d local features in GCU.\n\n5 Conclusion\n\nIn this paper, we have presented a novel deep model for learning graph representations from 2D\nvisual data. Our method transforms a 2D feature map into a graph structure, where the vertices de\ufb01ne\nregions and edges capture the relationship between regions. Context modeling and recognition can be\ndone using this graph structure. In this case, our method resembles the key idea behind region based\nrecognition. Our model thus addresses pixel grouping, region representation, context modeling and\nrecognition under the same framework. We have evaluated our method on several challenging visual\nrecognition tasks. Our results outperformed state-of-the-art methods. Through careful analysis of\nthese results, we demonstrated that our model is able to learn primitive grouping of scene components\n(such as foreground vs. background), and further leverage these components for recognition. Our\nmethod thus provides a revisit to the region based recognition in the era of deep learning. We hope\nour work will offer useful insights in re-thinking the design of visual representations in deep models.\n\nAcknowledgments This work was supported by ONR MURI N000141612007, Sloan Fellowship,\nOkawa Fellowship and ONR Young Investigator Award to AG. The authors thank Xiaolong Wang for\nmany helpful discussions, and Jianping Shi for sharing implementation details of PSPNet.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[2] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive \ufb01eld in\n\ndeep convolutional neural networks. In NIPS, 2016.\n\n[3] Chunhui Gu, Joseph J Lim, Pablo Arbel\u00e1ez, and Jitendra Malik. Recognition using regions. In CVPR,\n\n2009.\n\n[4] Pablo Arbel\u00e1ez, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir Bourdev, and Jitendra Malik.\n\nSemantic segmentation using regions and parts. In CVPR, 2012.\n\n[5] Jo\u00e3o Carreira, Fuxin Li, and Cristian Sminchisescu. Object recognition by sequential \ufb01gure-ground ranking.\n\nIJCV, 2012.\n\n9\n\n\f[6] Daniel Munoz, J Andrew Bagnell, and Martial Hebert. Stacked hierarchical labeling. In ECCV, 2010.\n\n[7] Brian Fulkerson, Andrea Vedaldi, and Stefano Soatto. Class segmentation and object localization with\n\nsuperpixel neighborhoods. In CVPR, 2009.\n\n[8] Tomasz Malisiewicz and Alyosha Efros. Beyond categories: The visual memex model for reasoning about\n\nobject relationships. In NIPS, 2009.\n\n[9] Stephen Gould, Tianshi Gao, and Daphne Koller. Region-based segmentation and object detection. In\n\nNIPS, 2009.\n\n[10] L\u2019ubor Ladick\u00fd, Chris Russell, Pushmeet Kohli, and Philip H. S. Torr. Associative hierarchical crfs for\n\nobject class image segmentation. In ICCV, 2009.\n\n[11] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nICLR, 2017.\n\n[12] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In CVPR, 2015.\n\n[13] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing\n\nnetwork. In CVPR, 2017.\n\n[14] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit\n\nAgrawal. Context encoding for semantic segmentation. In CVPR, 2018.\n\n[15] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask R-CNN. In ICCV, 2017.\n\n[16] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object\n\ndetection with region proposal networks. In NIPS, 2015.\n\n[17] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2015.\n\n[18] Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. arXiv preprint\n\narXiv:1506.04579, 2015.\n\n[19] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Re\ufb01nenet: Multi-path re\ufb01nement networks for\n\nhigh-resolution semantic segmentation. In CVPR, 2017.\n\n[20] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR,\n\n2018.\n\n[21] Pushmeet Kohli, L\u2019ubor Ladick\u00fd, Philip H. Torr, and S. Robust higher order potentials for enforcing label\n\nconsistency. IJCV, 2009.\n\n[22] Junjie Yan, Yinan Yu, Xiangyu Zhu, Zhen Lei, and Stan Z Li. Object detection by labeling superpixels. In\n\nCVPR, 2015.\n\n[23] Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, and Raquel Urtasun. Bottom-up segmentation for top-down\n\ndetection. In CVPR, 2013.\n\n[24] Derek Hoiem, Alexei A Efros, and Martial Hebert. Geometric context from a single image. In ICCV, 2005.\n\n[25] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for\n\nscene labeling. TPAMI, 2013.\n\n[26] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\n\nobject detection and semantic segmentation. In CVPR, 2014.\n\n[27] Raghudeep Gadde, Varun Jampani, Martin Kiefel, Daniel Kappler, and Peter V Gehler. Superpixel\n\nconvolutional networks using bilateral inceptions. In ECCV, 2016.\n\n[28] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al\u00e1n\nAspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints.\nIn NIPS, 2015.\n\n[29] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The\n\ngraph neural network model. IEEE TNN, 2009.\n\n[30] Herv\u00e9 J\u00e9gou, Matthijs Douze, Cordelia Schmid, and Patrick P\u00e9rez. Aggregating local descriptors into a\n\ncompact image representation. In CVPR, 2010.\n\n10\n\n\f[31] Florent Perronnin, Jorge S\u00e1nchez, and Thomas Mensink. Improving the \ufb01sher kernel for large-scale image\n\nclassi\ufb01cation. In ECCV, 2010.\n\n[32] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep \ufb01sher networks for large-scale image\n\nclassi\ufb01cation. In NIPS, 2013.\n\n[33] Hang Zhang, Jia Xue, and Kristin Dana. Deep TEN: Texture encoding network. In CVPR, July 2017.\n\n[34] Relja Arandjelovic and Andrew Zisserman. All about VLAD. In CVPR, 2013.\n\n[35] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\n[36] Mahdi M Kalayeh and Mubarak Shah. Training faster by separating modes of variation in batch-normalized\n\nmodels. arXiv preprint arXiv:1806.02892, 2018.\n\n[37] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing\n\nthrough ADE20K dataset. In CVPR, 2017.\n\n[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nImageNet large scale visual recognition\n\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al.\nchallenge. IJCV, 2015.\n\n[40] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:\nSemantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\nTPAMI, 2017.\n\n[41] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder\n\narchitecture for image segmentation. TPAMI, 2017.\n\n[42] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In ICLR, 2015.\n\n[43] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\n\nand C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.\n\n[44] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Doll\u00e1r, and Kaiming He. Detectron. https:\n\n//github.com/facebookresearch/detectron, 2018.\n\n[45] Tsung-Yi Lin, Piotr Doll\u00e1r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature\n\npyramid networks for object detection. In CVPR, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5558, "authors": [{"given_name": "Yin", "family_name": "Li", "institution": "University of Wisconsin-Madison"}, {"given_name": "Abhinav", "family_name": "Gupta", "institution": "Facebook AI Research/CMU"}]}