{"title": "Learnable Tree Filter for Structure-preserving Feature Transform", "book": "Advances in Neural Information Processing Systems", "page_first": 1711, "page_last": 1721, "abstract": "Learning discriminative global features plays a vital role in semantic segmentation. And most of the existing methods adopt stacks of local convolutions or non-local blocks to capture long-range context. However, due to the absence of spatial structure preservation, these operators ignore the object details when enlarging receptive fields. In this paper, we propose the learnable tree filter to form a generic tree filtering module that leverages the structural property of minimal spanning tree to model long-range dependencies while preserving the details. Furthermore, we propose a highly efficient linear-time algorithm to reduce resource consumption. Thus, the designed modules can be plugged into existing deep neural networks conveniently. To this end, tree filtering modules are embedded to formulate a unified framework for semantic segmentation. We conduct extensive ablation studies to elaborate on the effectiveness and efficiency of the proposed method. Specifically, it attains better performance with much less overhead compared with the classic PSP block and Non-local operation under the same backbone. Our approach is proved to achieve consistent improvements on several benchmarks without bells-and-whistles. Code and models are available at https://github.com/StevenGrove/TreeFilter-Torch.", "full_text": "Learnable Tree Filter for Structure-preserving\n\nFeature Transform\n\nLin Song1\u2217 Yanwei Li2,3\u2217 Zeming Li4 Gang Yu4 Hongbin Sun1\u2020\n\nJian Sun4 Nanning Zheng1\n\n1 Institute of Arti\ufb01cial Intelligence and Robotics, Xi\u2019an Jiaotong Univeristy.\n\n2 Institute of Automation, Chinese Academy of Sciences.\n\n3 University of Chinese Academy of Sciences. 4 Megvii Inc. (Face++).\n\nstevengrove@stu.xjtu.edu.cn, liyanwei2017@ia.ac.cn,\n\n{hsun, nnzheng}@mail.xjtu.edu.cn, {lizeming, yugang, sunjian}@megvii.com\n\nAbstract\n\nLearning discriminative global features plays a vital role in semantic segmentation.\nAnd most of the existing methods adopt stacks of local convolutions or non-local\nblocks to capture long-range context. However, due to the absence of spatial struc-\nture preservation, these operators ignore the object details when enlarging receptive\n\ufb01elds. In this paper, we propose the learnable tree \ufb01lter to form a generic tree \ufb01lter-\ning module that leverages the structural property of minimal spanning tree to model\nlong-range dependencies while preserving the details. Furthermore, we propose a\nhighly ef\ufb01cient linear-time algorithm to reduce resource consumption. Thus, the\ndesigned modules can be plugged into existing deep neural networks conveniently.\nTo this end, tree \ufb01ltering modules are embedded to formulate a uni\ufb01ed framework\nfor semantic segmentation. We conduct extensive ablation studies to elaborate on\nthe effectiveness and ef\ufb01ciency of the proposed method. Speci\ufb01cally, it attains bet-\nter performance with much less overhead compared with the classic PSP block and\nNon-local operation under the same backbone. Our approach is proved to achieve\nconsistent improvements on several benchmarks without bells-and-whistles. Code\nand models are available at https://github.com/StevenGrove/TreeFilter-Torch.\n\n1\n\nIntroduction\n\nScene perception, based on semantic segmentation, is a fundamental yet challenging topic in the\nvision \ufb01eld. The goal is to assign each pixel in the image with one of several prede\ufb01ned categories.\nWith the developments of convolutional neural networks (CNN), it has achieved promising results\nusing improved feature representations. Recently, numerous approaches have been proposed to\ncapture larger receptive \ufb01elds for global context aggregation [1\u20135], which can be divided into local\nand non-local solutions according to their pipelines.\nTraditional local approaches enlarge receptive \ufb01elds by stacking conventional convolutions [6\u20138]\nor their variants (e.g., atrous convolutions [9, 2]). Moreover, the distribution of impact within a\nreceptive \ufb01eld in deep stacks of convolutions converges to Gaussian [10], without detailed structure\npreservation (the pertinent details, which are proved to be effective in feature representation [11, 12]).\nConsidering the limitation of local operations, several non-local solutions have been proposed to\nmodel long-range feature dependencies directly, such as convolutional methods (e.g., non-local\noperations [13], PSP [3] and ASPP modules [2, 14, 15]) and graph-based neural networks [16\u2013\n18]. However, due to the absence of structure-preserving property, which considers both spatial\n\n\u2217Equal contribution. This work was done in Megvii Research.\n\u2020Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Toy illustration of the tree \ufb01ltering module. Given a detail-rich feature map from low-level\nstage, we \ufb01rst measure the dissimilarity between each pixel and its\u2019 quad neighbours. Then, the MST\nis built upon the 4-connected planar graph to formulate a learnable tree \ufb01lter. The edge between\ntwo vertices denotes the distance calculated from high-level semantics. Red edges indicate the close\nrelation with vertex k. The intra-class inconsistency could be alleviated after feature transform.\n\ndistance and feature dissimilarity, the object details are still neglected. Going one step further, the\nabovementioned operations can be viewed as coarse feature aggregation methods, which means they\nfail to explicitly preserve the original structures when capturing long-range context cues.\nIn this work, we aim to \ufb01x this issue by introducing a novel network component that enables ef\ufb01cient\nstructure-preserving feature transform, called learnable tree \ufb01lter. Motivated by traditional tree\n\ufb01lter [19], a widely used image denoising operator, we utilize tree-structured graphs to model long-\nrange dependencies while preserving the object structure. To this end, we \ufb01rst build the low-level\nguided minimum spanning trees (MST), as illustrated in Fig. 1. Then the distance between vertices in\nMST are calculated based on the high-level semantics, which can be optimized in backpropagation.\nFor example, the dissimilarity wk,m between vertex k and m in Fig. 1 is calculated from semantic-\nrich feature embeddings. Thus, combined with the structural property of MST, the spatial distance\nand feature dissimilarity have been modeled into tree-structured graph simultaneously (e.g., the\ndistance between vertex k and its spatially adjacent one n has been enlarged in Fig. 1, for that more\nedges with dissimilarities are calculated when approaching n). To enable the potential for practical\napplication, we further propose an ef\ufb01cient algorithm which reduces the O(N 2) complexity of brute\nforce implementation to linear-time consumption. Consequently, different from conditional random\n\ufb01elds (CRF) [20\u201322], the formulated modules can be embedded into several neural network layers\nfor differentiable optimization.\nIn principle, the proposed tree \ufb01ltering module is fundamentally different from most CNN based\nmethods. The approach exploits a new dimension: tree-structure graph is utilized for structure-\npreserving feature transform, bring detailed object structure as well as long-range dependencies.\nWith the designed ef\ufb01cient implementation, the proposed approach can be applied for multi-scale\nfeature aggregation with much less resource consumption. Moreover, extensive ablation studies have\nbeen conducted to elaborate on its superiority in both performance and ef\ufb01ciency even comparing\nwith PSP block [3] and Non-local block [13]. Experiments on two well-known datasets (PASCAL\nVOC 2012 [23] and Cityscapes [24]) also prove the effectiveness of the proposed method.\n\n2 Learnable Tree Filter\n\nTo preserve object structures when capturing long-range dependencies, we formulate the proposed\nlearnable tree \ufb01lter into a generic feature extractor, called tree \ufb01ltering module. Thus, it can be easily\nembedded in deep neural networks for end-to-end optimization. In this section, we \ufb01rstly introduce\nthe learnable tree \ufb01ltering operator. And then the ef\ufb01cient implementation is presented for practical\napplications. The constructed framework for semantic segmentation is elaborated at last.\n\n2.1 Formulation\n\nFirst, we represent the low-level feature as a undirected graph G = (V, E), with the dissimilarity\nweight \u03c9 for edges. The vertices V are the semantic features, and the interconnections of them can\n\n2\n\n\fbe denoted as E. Low-level stage feature map, which contains abundant object details, is adopted as\nthe guidance for 4-connected planar graph construction, as illustrated in Fig.1. Thus, a spanning tree\ncan be generated from the graph by performing a pruning algorithm to remove the edges with the\nsubstantial dissimilarity. From this perspective, the graph G turns out to be the minimum spanning\ntree (MST) whose sum of dissimilarity weights is minimum out of all spanning trees. The property of\nMST ensures preferential interaction among similar vertices. Motivated by traditional tree \ufb01lter [19],\na generic tree \ufb01ltering module in the deep neural network can be formulated as:\n\nS (Ei,j) f (xj) .\n\n(1)\n\n(cid:88)\n\n\u2200j\u2208\u2126\n\nyi =\n\n1\nzi\n\nWhere i and j indicate the index of vertices, \u2126 denotes the set of all vertices in the tree G, x represents\nthe input encoded features and y means the output signal sharing the same shape with x. Ei,j is a\nhyperedge which contains a set of vertices traced from vertex i to j in G. The similarity function S\nprojects the features of the hyperedge into a positive scalar value, as described in Eq. 2. The unary\nfunction f (\u00b7) represents the feature embedding transformation. zi is the summation of similarity\nS(Ei,j) alone with j to normalize the response.\n\nS (Ei,j) = exp (\u2212D (i, j)) , where D (i, j) = D (j, i) =\n\n\u03c9k,m.\n\n(2)\n\n(cid:88)\n\n(k,m)\u2208Ei,j\n\nAccording to the formula in Eq. 1, the tree \ufb01ltering operation can be considered as one kind of\nweighted-average \ufb01lter. The variable \u03c9k,m indicates the dissimilarity between adjacent vertices\n(k and m) that can be computed by a pairwise function (here we adopt Euclidean distance). The\ndistance D between two vertices (i and j) is de\ufb01ned as summation of dissimilarity \u03c9k,m along the\npath in hyperedge Ei,j. Note that D degenerates into spatial distance when \u03c9 is set to a constant\nmatrix. Since \u03c9 actually measures pairwise distance in the embedded space, the aggregation along\nthe pre-generated tree considers spatial distance and feature difference simultaneously.\n\nexp (\u2212\u03c9k,m) , where zi =\n\nexp (\u2212\u03c9k,m) .\n\n(3)\n\n(cid:88)\n\n\u2200j\u2208\u2126\n\nyi =\n\n1\nzi\n\n(cid:89)\n\nf (xj)\n\n(k,m)\u2208Ei,j\n\n(cid:88)\n\n(cid:89)\n\n\u2200j\u2208\u2126\n\n(k,m)\u2208Ei,j\n\nThe tree \ufb01ltering module can be reformulated to Eq. 3. Obviously, the input feature xj and dissimi-\nlarity \u03c9k,m take responsibility for the output response yi. Therefore, the derivative of output with\nrespect to input variables can be derived as Eq. 4 and Eq. 5. V i\nm in Eq. 5 is de\ufb01ned with the children\nof vertex m in the tree whose root node is the vertex i.\n\n\u2202yi\n\u2202xj\n\n=\n\nS (Ei,j)\n\n\u2202f (xj)\n\n,\n\n\u2202xj\n\nzi\n\n\uf8eb\uf8ed(cid:88)\n\nj\u2208V i\n\nm\n\n\u2202yi\n\n\u2202\u03c9k,m\n\n=\n\nS (Ei,k)\n\n\u2202S (Ek,m)\n\nzi\n\n\u2202\u03c9k,m\n\nS (Em,j) f (xj) \u2212 yizm\n\n\uf8f6\uf8f8 .\n\n(4)\n\n(5)\n\nIn this way, the proposed tree \ufb01ltering operator can be formulated as a differentiable module, which\ncan be optimized by the backpropagation algorithm in an end-to-end manner.\n\n2.2 Ef\ufb01cient Implementation\n\nLet N denotes the number of vertices in the tree G. The tree \ufb01ltering module needs to be accumulated\nN times for each output vertex. For each channel, the computational complexity of brute force imple-\nmentation is O(N 2) that is prohibitive for practical applications. De\ufb01nition of the tree determines\nthe absence of loop among the connections of vertices. According to this property, a well-designed\ndynamic programming algorithm can be used to speed up the optimization and inference process.\nWe introduce two sequential passes, namely aggregation and propagation, which are performed by\ntraversing the tree structure. Let one vertex to be the root node. In the aggregation pass, the process is\ntraced from the leaf nodes to the root node in the tree. For a vertex, its features do not update until all\nthe children have been visited. In the propagation pass, the features will propagate from the updated\n\n3\n\n\fvertex to their children recursively.\n\nAggr(\u03be)i = \u03bei +\n\nS (Ei,j) Aggr(\u03be)j.\n\n(cid:88)\n(cid:26) Aggr(\u03be)r\nS(cid:0)Epar(i),i\n\npar(j)=i\n\nProp(\u03be)i =\n\n(cid:1) Prop(\u03be)par(i) +(cid:0)1 \u2212 S2(cid:0)Ei,par(i)\n\n(cid:1)(cid:1) Aggr(\u03be)i)\n\ni = r\ni (cid:54)= r\n\n(6)\n\n(7)\n\nThe sequential passes can be formulated into the recursive operators for the input \u03be: the aggregation\npass and the propagation pass are respectively illustrated in Eq. 6 and Eq. 7, where par(i) indicates\nthe parent of vertex i in the tree whose root is vertex r. Prop(\u03be)r is initialized from the updated value\nAggr(\u03be)r of root vertex r.\n\nAlgorithm 1 Linear time algorithm for Learnable Tree Filter\nInput: Tree G \u2208 N(N\u22121)\u00d72; Input feature x \u2208 RC\u00d7N ; Pairwise distance \u03c9 \u2208 RN ; Gradient of loss\n\nw.r.t. output feature \u03c6 \u2208 RC\u00d7N ; channel C, vertex N; Set of vertices \u2126.\n\nOutput: Output feature y; Gradient of loss w.r.t. input feature \u2202loss\nPreparation:\n\n\u2202x , w.r.t. pairwise distance \u2202loss\n\u2202\u03c9 .\n(cid:46) Root vertex sampled with uniform distribution\n(cid:46) Breadth-\ufb01rst topological order for Aggr and Prop\n(cid:46) All-ones matrix for normalization coef\ufb01cient\n\nForward:\n\nr = Uniform(\u2126)\nT = BFS(G, r)\nJ = 1 \u2208 R1\u00d7N\n1. { \u02c6\u03c1, \u02c6z} = Aggr({f (x), J})\n2. {\u03c1, z} = Prop({ \u02c6\u03c1, \u02c6z})\n3. y = \u03c1/z\n\nBackward:\n\n1. { \u02c6\u03c8, \u02c6\u03bd} = Aggr({\u03c6/z, \u03c6 \u00b7 y/z})\n2. {\u03c8, \u03bd} = Prop({ \u02c6\u03c8, \u02c6\u03bd})\n\n\u00b7 \u03c8\n\nx\n\n\u2202x = \u2202f (x)\n3. \u2202loss\n4. for i \u2208 T\\r do\nj = par(i)\ni = \u02c6\u03c8i \u00b7 \u03c1i + \u03c8i \u00b7 \u02c6\u03c1i \u2212 2S(Ei,j) \u02c6\u03c8i \u00b7 \u02c6\u03c1i\n\u03b3s\ni = \u02c6\u03bdizi + \u03bdi \u02c6zi \u2212 2S(Ei,j) \u02c6\u03bdi \u02c6zi\n\u03b3z\n\n(cid:80) (\u03b3s\n\ni \u2212 \u03b3z\ni )\n\n\u2202loss\n\u2202\u03c9i,j\n\n= \u2202S(Ei,j )\n\n\u2202\u03c9i,j\n\nend\n\n(cid:46) Aggregation from leaves to root\n\n(cid:46) Propagation from root to leaves\n\n(cid:46) Normalized output feature\n\n(cid:46) Aggregation from leaves to root\n\n(cid:46) Propagation from root to leaves\n\n(cid:46) Gradient of loss w.r.t. input feature\n\n(cid:46) Parent of vertex i\n\n(cid:46) Gradient of unnormalized output feature\n(cid:46) Gradient of normalization coef\ufb01cient\n\n(cid:46) Gradient of loss w.r.t. pairwise distance\n\nAs shown in the algorithm 1, we propose a linear-time algorithm for the tree \ufb01ltering module, whose\nproofs are provided in the appendix. In the preparation stage, we uniformly sample a vertex as the\nroot and perform breadth-\ufb01rst sorting (BFS) algorithm to obtain the topological order of tree G. The\nBFS algorithm can be accelerated by the parallel version on GPUs and ensure the ef\ufb01ciency of the\nfollowing operations.\nTo compute the normalization coef\ufb01cient, we construct an all-ones matrix as J. Since the embedded\nfeature f (x) is independent of the matrix J, the forward computation can be factorized into two\ndynamic programming processes. Furthermore, we propose an ef\ufb01cient implementation for the\nbackward process. To reduce the unnecessary intermediate process, we combine the gradient of\nmodule and loss function. Note that the output y and normalization coef\ufb01cient z have been already\ncomputed in the inference phase. Thus the key of the backward process is to compute the intermediate\nvariables \u03c8 and \u03bd. The computation of these variables can be accelerated by the proposed linear time\nalgorithm. And we adopt another iteration process for the gradient of pairwise distance.\n\n4\n\n\fFigure 2: Overview of the proposed framework for semantic segmentation. The network is composed\nof a backbone encoder and a naive decoder. GAP denotes the extra global average pooling block.\nMulti-groups means using different feature splits to generate multiple groups of tree weights. The\nright diagram elaborates on the process details of a single stage tree \ufb01ltering module, denoted as the\ngreen node in the decoder. Red arrows represent upsample operations. Best viewed in color.\n\nComputational complexity. Since the number of batch and channel is much smaller than the\nvertices in the input feature, we only consider the in\ufb02uence of the vertices. For each channel, the\ncomputational complexity of all the processes in algorithm 1, including the construction process of\nMST and the computation of pairwise distance, is O(N ), which is linearly dependent on the number\nof vertices. It is necessary to point out that MST can be built in linear time using Contractive Bor\u02dauvka\nalgorithm if given a planar graph, as is designed in this paper. Note that the batches and channels\nare independent of each other. For the practical implementation on GPUs, we can naturally perform\nthe algorithm for batches and channels in parallel. Also, we adopt an effective scheduling scheme\nto compute vertices of the same depth on the tree parallelly. Consequently, the proposed algorithm\nreduces the computational complexity and time consumption dramatically.\n\n2.3 Network Architecture for Segmentation\n\nBased on the ef\ufb01cient implementation algorithm, the proposed tree \ufb01ltering module can be easily\nembedded into deep neural networks for resource-friendly feature aggregation. To illustrate the\neffectiveness of the proposed module, here we employ ResNet [8] as our encoder to build up a uni\ufb01ed\nnetwork. The encoded features from ResNet are usually computed with output stride 32. To remedy\nfor the resolution damage, we design a naive decoder module following previous works [14, 15].\nIn details, the features in the decoder are gradually upsampled by a factor of 2 and summed with\ncorresponding low-level features in the encoder, similar to that in FPN [25]. After that, the bottom-\nup embedding functions in the decoder are replaced by tree \ufb01lter modules for multi-scale feature\ntransform, as intuitively illustrated in Fig. 2.\nTo be more speci\ufb01c, given a low-level feature map Ml in top-down pathway, which riches in instance\ndetails, a 4-connected planar graph can be constructed easily with the guidance of Ml. Then, the\nedges with substantial dissimilarity are removed to formulate MST using the Bor\u02dauvka [26] algorithm.\nHigh level semantic cues contained in Xl are extracted using a simpli\ufb01ed embedding function (Conv\n1 \u00d7 1). To measure the pairwise dissimilarity \u03c9 in Eq. 2 (\u03c9l in Fig. 2), the widely used Euclidean\ndistance [27] is adopted. Furthermore, different groups of tree weights \u03c9l are generated to capture\ncomponent dependent features, which will be analyzed in Sec 3.2. To highlight the effectiveness\nof the proposed method, the feature transformation f (\u00b7) is simpli\ufb01ed to identity mapping, where\nf (Xl) = Xl. Thus, the learnable tree \ufb01lter can be formulated by the algorithm elaborated on\nSec. 2.1. Finally, the low-level feature map Ml is fused with the operation output yl using pixel-wise\nsummation. For multi-stage feature aggregation, the building blocks (green nodes in Fig. 1) are\napplied to different resolutions (Stage 1 to 3 in Fig. 1). An extra global average pooling operation is\nadded to capture global context and construct another tree \ufb01ltering module (Stage 4). The promotion\nbrought by the extra components will be detailed discussed in ablation studies.\n\n5\n\n\f3 Experiments\n\nIn this section, we \ufb01rstly describe the implementation details. Then the proposed approach will\nbe decomposed step-by-step to reveal the effect of each component. Comparisons with several\nstate-of-the-art benchmarks on PASCAL VOC 2012 [23] and Cityscapes [24] are presented at last.\n\n3.1\n\nImplementation Details\n\nFollowing traditional protocols [3, 15, 28], ResNet [8] is adopted as our backbone for the following\nexperiments. Speci\ufb01cally, we employ the \u201cpoly\u201d schedule with an initial learning rate 0.004 and\npower 0.9. The networks are optimized for 40K iterations using mini-batch stochastic gradient\ndescent (SGD) with a weight decay of 1e-4 and a momentum of 0.9. We construct each mini-batch\nfor training from 32 random crops (512 \u00d7 512 for PASCAL VOC 2012 [23] and 800 \u00d7 800 for\nCityscapes [24]) after randomly \ufb02ipping and scaling each image by 0.5 to 2.0\u00d7.\n\n3.2 Ablation Studies\n\nTo elaborate on the effectiveness of the proposed approach, we conduct extensive ablation studies.\nFirst, we give detailed structure-preserving relation analysis as well as the visualization, as presented\nin Fig. 3. Next, the equipped stage and group number of the tree \ufb01ltering module is explored. Different\nbuilding blocks are compared to illustrate the effectiveness and ef\ufb01ciency of the tree \ufb01ltering module.\nStructure-preserving relations. As intuitively presented in the \ufb01rst row of Fig. 3, given different\npositions, the corresponding instance details are fully activated with the high response, which means\nthat the proposed module has learned object structures and the long-range intra-class dependencies.\nSpeci\ufb01cally, the object details (e.g., boundaries of the train rather than the coarse regions in Non-\nlocal [13] blocks) have been highlighted in the af\ufb01nity maps. Qualitative results are also given to\nillustrate the preserved structural details, as presented in Fig. 4.\nWhich stage to equip the module? Tab. 1 presents the results when applying the tree \ufb01ltering\nmodule to different stages (group number is \ufb01xed to 1). The convolutional operations are replaced by\nthe building block (green nodes in Fig. 2) to form the equipped stage. As can be seen from Tab. 1, the\n\nFigure 3: Visualization of af\ufb01nity maps in the speci\ufb01c position (marked by the red cross in each input\nimage). TF and NL denotes using the proposed Tree Filtering module and Non-local block [13],\nrespectively. Different positions, resolution stages, and selected groups are explored in (a), (b), and\n(c), respectively. Our approach preserves more detailed structures than Non-local block. All of the\ninput images are sampled from PASCAL VOC 2012 val set.\n\n6\n\n(a) Affinity maps of different anchorsTF: Left TF: RightInputNL: Left NL: RightTF: Left TF: RightInputNL: Left NL: Right(a) Affinity maps of different anchorsTF: Left TF: RightInputNL: Left NL: Right(b) Affinity maps of different stagesInputTF: Stage 1TF: Stage 2TF: Stage 3NL(b) Affinity maps of different stagesInputTF: Stage 1TF: Stage 2TF: Stage 3NL(c) Affinity maps of different groupsInputNLTF: Group 1TF: Group 2TF: Group 3(c) Affinity maps of different groupsInputNLTF: Group 1TF: Group 2TF: Group 3\fFigure 4: Qualitative results on PASCAL VOC 2012 val set. Given an input image from the top row,\nthe structural cues are preserved in the corresponding prediction (the middle row). The generated\nresults contains rich details even compared with its ground truth in the bottom row.\n\nnetwork performance consistently improves with more tree \ufb01ltering modules equipped. This can also\nbe concluded from the qualitative results (the second row in Fig. 3), where the higher stage (Stage 3)\ncontains more semantic cues and lower stages (Stage 1 and 2) focus more on complementary details.\nThe maximum gap between the multi-stage equipped network and the raw one even up to 4.5% based\non ResNet-50 backbone. Even with the powerful ResNet-101, the proposed approach still attains a\n2.1% absolute gain, reaching 77.1% mIoU on PASCAL VOC 2012 val set.\nDifferent group numbers. Different group settings are used to generate weights for the single-stage\ntree \ufb01ltering module when \ufb01xing the channel number to 256. As shown in Tab. 2, the network reaches\ntop performance (Group Num=16) when the group number approaching the category number (21 in\nPASCAL VOC 2012), and additional groups afford no more contribution. We guess the reason is that\ndifferent kinds of tree weights are needed to deal with similar but different components, as shown in\nthe third row of Fig. 3.\nDifferent building blocks. We further compare the proposed tree \ufb01ltering module with classic\ncontext modeling blocks (e.g., PSP block [3] and Non-local block [13]) and prove the superiority\nboth in accuracy and ef\ufb01ciency. As illustrated in Tab. 3, the proposed module (TF) achieves better\nperformance than others (PSP and NL) with much less resource consumption. What\u2019s more, the tree\n\ufb01ltering module brings consistent improvements in different backbones (5.2% for ResNet-50 with\nstride 32 and 2.6% for ResNet-101) with additional 0.7M parameters and 1.3G FLOPs overheads.\nDue to the structural-preserving property, the proposed module achieves additional 1.1% improvement\nover the PSP block with neglectable consumption, as presented in Tab. 3. And the extra Non-local\nblock contributes no more gain over the tree \ufb01ltering module, which could be attributed to the already\nmodeled feature dependencies in the tree \ufb01lter.\nExtra components. Same with other works [15, 28], we adopt some simple components for further\nimprovements, including an extra global average pooling operation and additional ResBlocks in the\ndecoder. In details, the global average pooling block combined with the Stage 4 module (see Fig. 2)\nis added after the backbone to capture global context, and the \u201cConv1\u00d71\u201d operations in the decoder\n(refer to the detailed diagram in Fig. 2) are replaced by ResBlocks (with \u201cConv3x3\u201d). As presented\nin Tab. 4, the proposed method achieves consistent improvements and attains 79.4% on PASCAL\nVOC 2012 val set without applying data augmentation strategies.\n\n3.3 Experiments on Cityscapes\n\nTo further illustrate the effectiveness of the proposed method, we evaluate the Cityscapes [24] dataset.\nOur experiments involve the 2975, 500, 1525 images in train, val, and test set, respectively. With\nmulti-scale and \ufb02ipping strategy, the proposed method achieves 80.8% mIoU on Cityscapes test set\n\n7\n\n\fTable 1: Comparisons among different stages to\nequip tree \ufb01ltering module on PASCAL VOC\n2012 val set when using the proposed decoder\narchitecture. Multi-scale and \ufb02ipping strategy\nare adopted for testing.\n\nTable 2: Comparisons among different group\nsettings of tree \ufb01ltering module on PASCAL\nVOC 2012 val set when using the proposed de-\ncoder structure. Multi-scale and \ufb02ipping strat-\negy are adopted for testing.\n\nBackbone\n\nResNet-50\n\nResNet-101\n\nStage\nNone\n\n+ Stage 1\n+ Stage 1-2\n+ Stage 1-3\n\nNone\n\n+ Stage 1-3\n\nmIoU (%)\n\nBackbone\n\nGroup Num mIoU (%)\n\n70.2\n72.1\n73.1\n74.7\n75.0\n77.1\n\nResNet-50\n\n0\n1\n4\n8\n16\n32\n\n70.2\n72.1\n73.2\n74.0\n74.4\n74.4\n\nTable 3: Comparisons among different building blocks on PASCAL VOC 2012 val set when using\nResNet-50 as feature extractor without decoder. TF, PSP, and NL denotes using the proposed Tree\nFiltering module, PSP block [3], and Non-local block [13] as the building block, respectively. OS\nrepresents the output stride used in the backbone. We calculate FLOPs when given a single scale\n512 \u00d7 512 input. All of the data augmentation strategies are dropped.\n\nBackbone\n\nResNet-50\n\nResNet-50\n\nResNet-101\n\nBlock\nNone\nNL\nPSP\nTF\n\nNL+TF\nPSP+TF\n\nNone\nTF\nNone\nTF\n\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\nDecoder OS\n8\n8\n8\n8\n8\n8\n32\n32\n32\n32\n\n\u0013\n\u0013\n\n\u0013\n\u0013\n\nParams (M)\n\nFLOPs (G) \u2206 FLOPs (G) mIoU (%)\n\n129.3\n158.3\n178.3\n133.3\n162.3\n182.3\n102.0\n102.7\n175.0\n175.7\n\n162.1\n199.1\n171.1\n163.1\n200.1\n172.1\n39.2\n40.5\n65.0\n66.3\n\n0.0\n+37.0\n+9.0\n+1.0\n+38.0\n+10.0\n0.0\n+1.3\n0.0\n+1.3\n\n69.2\n74.2\n74.3\n74.9\n74.9\n75.4\n67.3\n72.5\n72.8\n75.4\n\nTable 4: Comparisons among different compo-\nnents on PASCAL VOC 2012 val set. TF(multi)\ndenotes multi-stage Tree Filtering modules with\ndecoder. Extra represents extra components. All\nof the data augmentation strategies are dropped.\n\nBackbone\n\nTF(multi) Extra mIoU (%)\n\nResNet-50\n\nResNet-101\n\n\u0017\n\u0017\n\u0013\n\u0013\n\n\u0017\n\u0013\n\n\u0017\n\u0013\n\u0017\n\u0013\n\n\u0013\n\u0013\n\n67.3\n72.5\n72.5\n75.6\n78.3\n79.4\n\nTable 5: Comparisons with state-of-the-arts results\non Cityscapes test set trained on \ufb01ne annotation. We\nadopt vanilla ResNet-101 as our backbone.\n\nBackbone\nMethod\nResNet-101\nRe\ufb01neNet [29]\nResNet-101\nDSSPN [30]\nResNet-101\nPSPNet [3]\nResNet-101\nBiSeNet [31]\nResNet-101\nDFN [28]\nPSANet [32]\nResNet-101\nDenseASPP [33] DenseNet-161\nOurs\nResNet-101\n\nmIoU (%)\n\n73.6\n77.8\n78.4\n78.9\n79.3\n80.1\n80.6\n80.8\n\nwhen trained on \ufb01ne annotation data only. Compared with previous leading algorithms, our method\nachieves superior performance using ResNet-101 without bells-and-whistles.\n\n3.4 Experiments on PASCAL VOC\n\nWe carry out experiments on PASCAL VOC 2012 [23] that contains 20 object categories and one\nbackground class. Following the procedure of [28], we use the augmented data [38] with 10,582\nimages for training and raw train-val set for further \ufb01ne-tuning. In inference stage, multi-scale and\nhorizontally \ufb02ipping strategy are adopted for data augmentation. As shown in Tab. 6, the proposed\nmethod achieves the state-of-the-art performance on PASCAL VOC 2012 [23] test set. In details, our\n\n8\n\n\fTable 6: Comparisons with state-of-the-arts results on PASCAL VOC 2012 test set. We adopt vanilla\nResNet-101 without atrous convolutions as our backbone.\n\nwithout MS-COCO pretrain\n\nwith MS-COCO pretrain\n\nMethod\nFCN [1]\nDeeplab v2 [2]\nDPN [34]\nPiecewise [35]\nPSPNet [3]\nDFN [28]\nEncNet [4]\nOurs\n\nBackbone\nVGG 16\nVGG 16\nVGG 16\nVGG 16\n\nResNet-101\nResNet-101\nResNet-101\nResNet-101\n\nmIoU (%)\n\n62.2\n71.6\n74.1\n75.3\n82.6\n82.7\n82.9\n84.2\n\nBackbone\nMethod\nResNet-152\nGCN [36]\nResNet-101\nRe\ufb01neNet [29]\nPSPNet [3]\nResNet-101\nDeeplab v3 [14] ResNet-101\nEncNet [4]\nResNet-101\nResNet-101\nDFN [28]\nResNet-101\nExFuse [37]\nOurs\nResNet-101\n\nmIoU (%)\n\n82.2\n84.2\n85.4\n85.7\n85.9\n86.2\n86.2\n86.3\n\napproach reaches 84.2% mIoU without MS-COCO [39] pre-train when adopting vanilla ResNet-101\nas our backbone. If MS-COCO [39] is added for pre-training, our approach achieves the leading\nperformance with 86.3% mIoU.\n\n4 Conclusion\n\nIn this work, we propose the learnable tree \ufb01lter for structure-preserving feature transform. Different\nfrom most existing methods, the proposed approach leverages tree-structured graph for long-range\ndependencies modeling while preserving detailed object structures. We formulate the tree \ufb01ltering\nmodule and give an ef\ufb01cient implementation with linear-time source consumption. Extensive ablation\nstudies have been conducted to elaborate on the effectiveness and ef\ufb01ciency of the proposed method,\nwhich is proved to bring consistent improvements on different backbones with little computational\noverhead. Experiments on PASCAL VOC 2012 and Cityscapes prove the superiority of the proposed\napproach on semantic segmentation. More potential domains with structure relations (e.g., detection\nand instance segmentation) remain to be explored in the future.\n\n5 Acknowledgment\n\nWe would like to thank Lingxi Xie for his valuable suggestions. This research was supported by the\nNational Key R&D Program of China (No. 2017YFA0700800).\n\n9\n\n\fReferences\n[1] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:\nSemantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[3] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing\n\nnetwork. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[4] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit\nAgrawal. Context encoding for semantic segmentation. In IEEE Conference on Computer Vision and\nPattern Recognition, 2018.\n\n[5] Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xingang Wang. Attention-\nguided uni\ufb01ed network for panoptic segmentation. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2019.\n\n[6] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In International Conference on Learning Representations, 2014.\n\n[7] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference\non Computer Vision and Pattern Recognition, 2015.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic\nimage segmentation with deep convolutional nets and fully connected crfs. In International Conference on\nLearning Representations, 2015.\n\n[10] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive \ufb01eld in\n\ndeep convolutional neural networks. In Advances in Neural Information Processing Systems, 2016.\n\n[11] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable\n\nconvolutional networks. In IEEE International Conference on Computer Vision, 2017.\n\n[12] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better\n\nresults. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.\n\n[13] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In IEEE\n\nConference on Computer Vision and Pattern Recognition, 2018.\n\n[14] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution\n\nfor semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.\n\n[15] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder\nwith atrous separable convolution for semantic image segmentation. In European Conference on Computer\nVision, 2018.\n\n[16] Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P Xing. Interpretable\n\nstructure-evolving lstm. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[17] Yin Li and Abhinav Gupta. Beyond grids: Learning graph representations for visual recognition. In\n\nAdvances in Neural Information Processing Systems, 2018.\n\n[18] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint\n\ngraphs. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n[19] Qingxiong Yang. Stereo matching using tree \ufb01ltering. IEEE Transactions on Pattern Analysis and Machine\n\nIntelligence, 2015.\n\n[20] Siddhartha Chandra, Nicolas Usunier, and Iasonas Kokkinos. Dense and low-rank gaussian crfs using deep\n\nembeddings. In IEEE International Conference on Computer Vision, 2017.\n\n[21] Adam W Harley, Konstantinos G Derpanis, and Iasonas Kokkinos. Segmentation-aware convolutional\n\nnetworks using local attention masks. In IEEE International Conference on Computer Vision, 2017.\n\n10\n\n\f[22] Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz. Learning\naf\ufb01nity via spatial propagation networks. In Advances in Neural Information Processing Systems, 2017.\n\n[23] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The\n\npascal visual object classes (voc) challenge. International Journal of Computer Vision, 2010.\n\n[24] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson,\nUwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding.\nIn IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[25] Tsung-Yi Lin, Piotr Doll\u00e1r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature\npyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition,\n2017.\n\n[26] Robert G Gallager, Pierre A Humblet, and Philip M Spira. A distributed algorithm for minimum-weight\n\nspanning trees. ACM Transactions on Programming Languages and systems, 1983.\n\n[27] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: l 2 hypersphere embedding for\n\nface veri\ufb01cation. In ACM International Conference on Multimedia, 2017.\n\n[28] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Learning a discrimi-\nnative feature network for semantic segmentation. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2018.\n\n[29] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Re\ufb01nenet: Multi-path re\ufb01nement networks for\nhigh-resolution semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition,\n2017.\n\n[30] Xiaodan Liang, Hongfei Zhou, and Eric Xing. Dynamic-structured semantic propagation network. In\n\nIEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n[31] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral\nsegmentation network for real-time semantic segmentation. In European Conference on Computer Vision,\n2018.\n\n[32] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet:\nPoint-wise spatial attention network for scene parsing. In European Conference on Computer Vision, 2018.\n\n[33] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in\n\nstreet scenes. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n[34] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Semantic image segmentation via\n\ndeep parsing network. In IEEE International Conference on Computer Vision, 2015.\n\n[35] Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid. Ef\ufb01cient piecewise training of\ndeep structured models for semantic segmentation. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2016.\n\n[36] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters\u2013improve semantic\nsegmentation by global convolutional network. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2017.\n\n[37] Zhenli Zhang, Xiangyu Zhang, Chao Peng, Xiangyang Xue, and Jian Sun. Exfuse: Enhancing feature\n\nfusion for semantic segmentation. In European Conference on Computer Vision, 2018.\n\n[38] Bharath Hariharan, Pablo Arbel\u00e1ez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic\n\ncontours from inverse detectors. In IEEE International Conference on Computer Vision, 2011.\n\n[39] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on\nComputer Vision, 2014.\n\n11\n\n\f", "award": [], "sourceid": 964, "authors": [{"given_name": "Lin", "family_name": "Song", "institution": "Xi'an Jiaotong University"}, {"given_name": "Yanwei", "family_name": "Li", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Zeming", "family_name": "Li", "institution": "Megvii(Face++) Inc"}, {"given_name": "Gang", "family_name": "Yu", "institution": "Megvii Inc"}, {"given_name": "Hongbin", "family_name": "Sun", "institution": "Xi'an Jiaotong University"}, {"given_name": "Jian", "family_name": "Sun", "institution": "Megvii, Face++"}, {"given_name": "Nanning", "family_name": "Zheng", "institution": "Xi'an Jiaotong University"}]}