{"title": "DifNet: Semantic Segmentation by Diffusion Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1630, "page_last": 1639, "abstract": "Deep Neural Networks (DNNs) have recently shown state of the art performance on semantic segmentation tasks, however, they still suffer from problems of poor boundary localization and spatial fragmented predictions. The difficulties lie in the requirement of making dense predictions from a long path model all at once since details are hard to keep when data goes through deeper layers. Instead, in this work, we decompose this difficult task into two relative simple sub-tasks: seed detection which is required to predict initial predictions without the need of wholeness and preciseness, and similarity estimation which measures the possibility of any two nodes belong to the same class without the need of knowing which class they are. We use one branch network for one sub-task each, and apply a cascade of random walks base on hierarchical semantics to approximate a complex diffusion process which propagates seed information to the whole image according to the estimated similarities. \nThe proposed DifNet consistently produces improvements over the baseline models with the same depth and with the equivalent number of parameters, and also achieves promising performance on Pascal VOC and Pascal Context dataset. OurDifNet is trained end-to-end without complex loss functions.", "full_text": "DifNet: Semantic Segmentation by Diffusion\n\nNetworks\n\nPeng Jiang 1\n\nFanglin Gu 1\n\nYunhai Wang 1\n\nChanghe Tu 1\n\nBaoquan Chen 2,1\n\n1Shandong University, China\n\n2Peking University, China\n\nsdujump@gmail.com, fanglin.gu@gmail.com, cloudseawang@gmail.com\n\nchtu@sdu.edu.cn, baoquan.chen@gmail.com\n\nAbstract\n\nDeep Neural Networks (DNNs) have recently shown state of the art performance\non semantic segmentation tasks, however, they still suffer from problems of poor\nboundary localization and spatial fragmented predictions. The dif\ufb01culties lie in\nthe requirement of making dense predictions from a long path model all at once,\nsince details are hard to keep when data goes through deeper layers. Instead, in\nthis work, we decompose this dif\ufb01cult task into two relative simple sub-tasks: seed\ndetection which is required to predict initial predictions without the need of whole-\nness and preciseness, and similarity estimation which measures the possibility of\nany two nodes belong to the same class without the need of knowing which class\nthey are. We use one branch network for one sub-task each, and apply a cascade of\nrandom walks base on hierarchical semantics to approximate a complex diffusion\nprocess which propagates seed information to the whole image according to the\nestimated similarities.\nThe proposed DifNet consistently produces improvements over the baseline mod-\nels with the same depth and with the equivalent number of parameters, and also\nachieves promising performance on Pascal VOC and Pascal Context dataset. Our\nDifNet is trained end-to-end without complex loss functions.\n\n1\n\nIntroduction\n\nSemantic Segmentation who aims to give dense label predictions for pixels in an image is one of the\nfundamental topics in computer vision. Recently, Fully convolutional networks (FCNs) proposed\nin [1] have proved to be much more powerful than schemes which rely on hand-crafted features.\nFollowing FCNs, subsequent works [2\u201316] have get promoted by further introducing atrous convo-\nlution, shortcut between layers and CRFs post-processing.\n\nEven with these re\ufb01nements, current FCNs based semantic segmentation methods still suffer from\nthe problems of poor boundary localization and spatial fragmented predictions, because of following\nchallenges: First, to abstract invariant high level feature representations, deeper models are preferred,\nhowever, the invariance character of features and increasing depth of layers may lead detailed spatial\ninformation lost. Second, given this long path model, the requirement of making dense predictions\nall at once makes these problems more severe. Third, the lack of ability to capturing long-range\ndependencies causes model hard to generate accuracy and uniform predictions [17].\n\nTo address these challenges, we relieve the burden of semantic segmentation model by decomposing\nsemantic segmentation task into two relative simple sub-tasks, seed detection and similarity estima-\ntion, then diffuse seed information to the whole image according to the estimated similarities. For\neach sub-task, we train one branch network respectively and simultaneously, therefore our model\nhas two branches: seed branch and similarity branch. The simplicity and motivation lie in these\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffollowing aspects: For seed detection, we hope it can give initial predictions without the need of\nwholeness and preciseness, this requirement is highly appropriate to the property of DNNs\u2019 high\nlevel features which are good at representing high level semantic but hard to keep details. For simi-\nlarity detection, we intend to estimate the possibility of any two nodes that belong to the same class,\nunder this circumstance, relatively low level features already could be competent.\n\nBased on the motivations mentioned above, we let seed branch predict initial predictions and let\nsimilarity branch estimate similarities. To be speci\ufb01c, seed branch \ufb01rstly generates a score map\nwhich assigns score value for each class at each node (pixel), then an importance map is learned\nto re-weight score map to get initial predictions. At the same time, similarity branch will extract\nfeatures from different semantic levels, and compute a sequence of transition matrices correspond-\ningly. Transition matrices measure the possibility of random walk between any two nodes, with our\nimplementation they could also re\ufb02ect similarities on different semantic levels. Finally, we apply\na cascade of random walks based on these transition matrices to approximate a complex diffusion\nprocess, in order to propagate seed information to the whole image according to the hierarchical\nsimilarities.\nIn this way, the inversion operation of dense matrix in the diffusion process could\nbe avoided. Our diffusion process by cascaded random walks shares the similar idea as residual\nlearning framework [18] who eases the approximation of a complex objective by learning residuals.\nMoreover, our random walk actually computes the \ufb01nal response at a position as a weighted sum\nof all the seed values which is a non-local operation that can capture long-range dependencies re-\ngardless of positional distance. Besides, from Fig. 1(a), we can see the cascaded random walks also\nincrease the \ufb02exibility and diversity of information propagation paths.\n\nOur proposed DifNet is trained end-to-end with common loss function and no post-processing. In\nexperiments, our model consistently shows superior performance over the baseline models with the\nsame depth and with the equivalent number of parameters, and also achieves promising performance\non Pascal VOC 2012 and Pascal Context datasets. In summary, our contributions are:\n\n\u2022 We decompose the semantic segmentation task into two simple sub-tasks.\n\u2022 We approximate a complex diffusion process by cascaded random walks.\n\u2022 We provide comprehensive mechanism studies.\n\u2022 Our model can capture long-range dependencies.\n\u2022 Our model demonstrates consistent improvements over various baseline models.\n\n2 Related Work\n\nMany works [2, 8, 9, 15, 14, 13, 12, 11, 10, 19\u201321] (Here, we mainly focus on methods that are\nbased on deep neural networks, as these represent the state-of-the-art and are the most relevant to\nour scheme.) have approached the problems of poor boundary localization and spatially fragmented\npredictions for semantic segmentation.\n\nCurrently, conditional random \ufb01eld (CRF) is one of the major approaches used to tackle these two\nproblems. Works, such as [2], use CRF as a disjoint post-processing module on top of the main\nmodel. Because of disjoint training and post-processing, they often fail to capture semantic relation-\nships between objects accurately and thus produce segmentation results spatially disjoint. Instead,\nworks [8, 10, 12, 9, 20] propose to integrate CRF into the networks, so as to enable end-to-end\ntraining of the joint model. However, this integration may lead to a dramatic increase of param-\neters and computing complexity that the model usually needs many iterations of mean-\ufb01eld infer-\nence or a recurrent scheme [12] to optimize the CNN-CRF models. To avoid iterative optimization,\nworks [13, 14] employ Gaussian Conditional Random Fields which can be optimized by only solving\na system of linear equations, but at the cost of increasing complexity of gradient computation. Apart\nfrom the view of CRF, work [15] utilizes graphical structures to re\ufb01ne results by random walks, but\nthe calculation has a dense matrix inversion term which is not appropriate for the networks. Unlike\nabove mentioned methods, work [21] does not compute global pairwise relations directly. It pre-\ndicts four local pairwise relations along the different direction to approximate the global pairwise\nrelations, however, it also leads to the complexity of the model.\n\nTo integrate CRF into the model, several works also employ networks with two branches that one for\npairwise term and one for unary term. However, the de\ufb01nition and computation are different from\n\n2\n\n\f(b)\n\n(a)\n\nFigure 1: (a) Our DifNet contains two branches: 1. Seed Branch, which produces score map and\nimportance map, from which seed are obtained by Hadamard product \u2297; 2. Similarity Branch,\nwhich extracts features from different layers to compute transition matrices and estimate pixel-wise\nsimilarities. Finally, the model approximates the diffusion process by a cascade of random walks\n\u2295 to propagate seed information to the whole image according to the estimated similarities. (b)\nRandom Walk operation. For each random walk, the inputs are: 1. Output of last random walk\noperation; 2. Features from Similarity Branch; 3. Seed from Seed Branch. Given the inputs, output\nis calculated by \u2295. The speci\ufb01c computation procedure will be explained in Sec. 4.\n\nours that we represent pairwise similarities by the transition matrix whose summation in each row\nor column equals one. For the purpose of measuring similarity, different metrics are presented, ours\nand [14] compute similarities by inner product while most of the others use Mahalanobis distance.\nIt is important to note that our DifNet consists several transition matrices which are computed based\non features from different semantic levels, and each random walk operation is conducted according\nto one transition matrix, see Fig. 1(a). In this way, we do not require each transition matrix contains\nall the similarity information, which relieves the burden of similarity branch. Besides, the cascaded\nrandom walks will also increase the \ufb02exibility and diversity of information propagation paths.\n\nFor supervised semantic segmentation tasks, cross-entropy is the most common used loss function.\nHowever, in some previously mentioned models [15, 20], another loss for similarity estimation had\nbeen applied, where similarity groundtruth are transformed from labels. According to [22], optimal\npairwise term and optimal unary term are mutually in\ufb02uenced, so strict constraint on only one of\nthese two terms may not lead to good results. Consequently, in our DifNet, we only penalize \ufb01nal\npredictions by cross-entropy loss. As for training strategy, we train our two branches simultaneously,\nwhile some works, for instance [14], train the two branches alternatively.\n\n3 Methodology\n\nGiven the input image I of size [c, h, w], the pairwise relationship can be expressed as af\ufb01nity matrix\nWN \u00d7N , where N = h \u00d7 w and each element Wij encodes the similarity between node i and node\nj. As mentioned in Sec. 1, our seed vector is de\ufb01ned as s = M x where x is the score map of size\n[N, K] (K is the number of classes) and importance map M is the diagonal matrix with the size of\nN \u00d7 N and value in [0, 1].\nAssuming the \ufb01nal predictions are y, in order to diffuse seed value to all the other nodes according\nto the af\ufb01nity matrix, we can optimize the following equation:\n\ny = argmin\n\ny\u2217\n\n1\n2\n\n(\u00b5\n\nN\n\nX\n\ni,j=1\n\nWijk\n\ny\u2217\ni\u221adii \u2212\n\ny\u2217\nj\n\npdjj k2 + (1 \u2212 \u00b5)\n\nN\n\nX\n\ni=1\n\nMiiky\u2217\n\ni \u2212 xik2)\n\nEq. 1 is convex and has a closed-form solution, without loss of generality:\n\ny = (D\u22121(D \u2212 \u00b5W ))\u22121(1 \u2212 \u00b5)M x\n\n3\n\n(1)\n\n(2)\n\n\fwhere D is the degree matrix that de\ufb01ned as D = diag{d11, ..., dN N} and dii = Pj Wij , \u00b5 is\nthe weight to balance smooth term and data term. Eq. 2 is usually considered as diffusion process\nthat y = L\u22121s, where s = M x is the seed vector and L\u22121 = (D\u22121(D \u2212 \u00b5W ))\u22121 is the diffusion\nmatrix (L equals to the inversion of normalized graph Laplacians). Works such as [15, 14] propose\nto use networks with two branches to predict these two parts respectively. However, to compute \ufb01nal\npredictions y, they have to solve dense matrix inversion or a system of linear equations, which is\ntime-consuming and unstable (the matrix to be inverted may be singular.). To tackle this problem,\nwe propose to use a cascade of random walks to approximate the diffusion process. A random walk\nwith the seed vector as initial state is de\ufb01ned as:\n\nyt+1 = \u00b5P yt + (1 \u2212 \u00b5)s\n\n(3)\n\nwhere \u00b5 is a parameter in [0, 1] that controls the degree of random walk to other states from the\ninitial state, and P = D\u22121W is transition matrix whose element measures the possibility that a\nrandom walk occurs between corresponding positions and has value in [0, 1]. It is important to note\nthat Eq. 3 does not contain dense matrix inversion anymore and is equal to Eq. 2 when t \u2192 \u221e,\nwhich is proved by the following.\n\nProof. Eq. 3 can be reformulated as yt+1 = (\u00b5P )t+1s + (1 \u2212 \u00b5) Pt\nrecurrence. When t \u2192 \u221e and in view of P, \u00b5 \u2208 [0, 1], apparently lim\n(1 \u2212 \u00b5) Pt\n(1 \u2212 \u00b5P )\u22121(1 \u2212 \u00b5)s which equals to Eq. 2.\n\ni=0(\u00b5P )is by unrolling the\n(\u00b5P )t+1 = 0 and yt+1 =\ni=0(\u00b5P )is. By computing yt+1 \u2212 \u00b5P yt+1 and setting t \u2192 \u221e, we \ufb01nally obtain y\u221e =\n\nt\u2192\u221e\n\n4\n\nImplementation\n\nIn this section, we describe how we implement a cascade of random walks by DifNet to approximate\nthe diffusion process.\n\nThe key part of the random walk is the computing of transition matrix P = D\u22121W which is equiv-\nalent to conducting a softmax along each row of W that P = softmax(W ). To compute W\nand measure similarities between nodes, we apply the inner product of features f from similarity\nbranch. Before that, in our implementation, we \ufb01rst adjust feature dimensions to \ufb01xed length by one\n\nconv(1 \u00d7 1)-bn-pooling layer \u03a8, so that W = \u03a8(f )T \u03a8(f ). Because P has encoded similarities\n\nbetween any node pairs and random walk is a non-local operation, so that long-range dependencies\ncan be captured.\n\nTo let DifNet approximate the diffusion process by learning, we should take full advantage of the\nlearning capacity of model. Instead of using a prede\ufb01ned and \ufb01xed parameter \u00b5, our model learns\nthis parameter and determines the degree of each random walk adaptively. Besides, for the differ-\nent random walk, we compute the corresponding transition matrix Pt based on features from the\ndifferent layer of similarity branch which represent different semantic levels, thus the information\nwill be propagated gradually according to different levels\u2019 semantic similarities. By this manner, the\ndiffusion process is not merely approximated by the cascaded random walks, but by the cascaded\nrandom walks on hierarchical semantics. We demonstrate how these transition matrices look like in\nSec. 5.3. Consequently, our random walks can be de\ufb01ned as:\n\nyt+1 = \u00b5tPtyt + (1 \u2212 \u00b5t)s\n\n(4)\n\nAs mentioned before, our seed s is expressed as a multiplication of importance map M and score\nmap x, this operation is denoted as \u2297 in Fig. 1(a). Score map x is the direct output of seed branch\nwith value in R, thus our DifNet is actually diffusing score and the in\ufb02uence of node i to others in\nchannel k when diffusion should be de\ufb01ned as |xi,k|. To further adjust the in\ufb02uence of nodes, we\nintroduce several layers H (with sigmoid as the last activation) on top of x to predict an importance\nmap M (diagonal matrix), such that M = H(x). Finally, we apply the important map to score\nmap and obtain seed by s = M x. From experiments, we observe that importance map adjusts the\nin\ufb02uence of nodes base on scores of the neighborhood. Fig. 2 demonstrates score map x, in\ufb02uence\nmap E and importance map M . Clearly, M will reduce the in\ufb02uence of over-emphasis nodes and\noutliers. Please see Sec. 5.3 for details.\nIt\u2019s worth to note that in our implementation s is of size [h\u2032\u00d7w\u2032, K] and P is of size [h\u2032\u00d7w\u2032, h\u2032\u00d7w\u2032],\nwhere h\u2032 = h/5 and w\u2032 = w/5, meanwhile random walks only involve matrix multiplication\n\n4\n\n\fFigure 2: Visualization of input image, groundtruth, score map x (only show class corresponding to\nthe largest score value in each position), in\ufb02uence map E(x) and importance map M (x).\n\noperations, consequently the amount of computation related to random walks in our model will be\nrelatively less.\n\nAccording to [22], optimal seed s and optimal diffusion matrix L\u22121 are mutually determined. There-\nfore, we choose to let model learn seed and similarities on its own instead of providing supervision\non af\ufb01nity as [15]. However, no supervision for the cascaded random walks may also cause a\nproblem. By the de\ufb01nition of Eq. 4, if certain Pt cannot gain useful similarity information from cor-\nresponding features, \u00b5t will be set to a small value by model during training, thereby yt+1 \u2248 s. In\nthis case, all the previous results of random walks will be discarded. To preserve useful information\nof preceding random walks, we propose to employ an adaptive identity mapping term further. By\nreformulating Eq. 4 as yt+1 = R(yt, Pt, s, \u00b5t), \ufb01nally the \u2295 operation can be de\ufb01ned as:\n\nyt+1 = \u03b2tR(yt, Pt, s, \u00b5t) + (1 \u2212 \u03b2t)yt\n\n(5)\n\nwhere \u03b2t is another parameter to be learned, for the sake of controlling the degree of identity map-\nping. In our experiments, DifNet occasionally assigns a minimal value to certain \u00b5t, but will also\nreduce the amount of \u03b2t at the same time. In this way, the effect actually equals to omit certain\nrandom walk \u2295, so the information from preceding random walks can be preserved and passed to\nfollowing random walks.\n\nFig. 1(a) shows the whole framework of our DifNet, the upper branch is seed branch while the\nlower branch is similarity branch. We use \u2295 to represent the random walk operation, as illustrated\nin Fig. 1(b), for each \u2295 the inputs are: (1) features ft from certain intermediate layer of similarity\nbranch ; (2) seed vector s from seed branch; (3) output yt of previous random walk. Given the\ninputs, \u2295 computes Pt = softmax(\u03a8(ft)T \u03a8(ft)), determines \u00b5t and \u03b2t, and \ufb01nally gives output\nyt+1 according to Eq. 5.\n\n5 Experiments\n\n5.1 Experimental Settings\n\nOur DifNet can be built on any FCNs-like models.\nIn this paper, we choose DeeplabV2 [3] as\nour backbone. Original DeeplabV2 has reported promising performance by introducing atrous con-\nvolution, ASPP module, multi-scale inputs with max fusion, CRF-postproposing, and MS-COCO\npretrain. Among these operations, the last three are external components that other models could\nalso bene\ufb01t from them. Thus, to better study diffusion property of our DifNet, we design our back-\nbone only with atrous convolution, and ASPP module for seed branch.\n\n5\n\n\fmIOU(Val) mIOU(Test)\n\nSim-Deeplab-18\nSim-Deeplab-34\nSim-Deeplab-50\nSim-Deeplab-101\nSim-Deeplab-101-CRF\nDifNet-18\nDifNet-34\nDifNet-50-noASPP\nDifNet-50\nDifNet-101\n\n66.33%\n69.76%\n70.78%\n71.83%\n72.26%\n70.17%\n71.84%\n72.52%\n72.57%\n73.22%\n\n-\n-\n-\n\n72.54%\n\n-\n\n70.46%\n71.62%\n\n-\n\n72.55%\n73.21%\n\nTable 1: Comparison with simpli\ufb01ed DeeplabV2 of different depth on Pascal VOC dataset.\n\nDeeplabV2 are based on ResNet [18] architecture. To approximate the diffusion process and for the\nsake of ef\ufb01ciency, in view of backbone architecture, in our DifNet we conduct \ufb01ve random walks in\ntotal based on \ufb01ve transition matrices computed from features out of four ResNet blocks as well as\ninput.\n\nWe study the performance and mechanism of our DifNet on the prevalent used Augmented Pascal\nVOC 2012 dataset [23, 24] and Pascal Context dataset [25]. Augmented Pascal VOC 2012 dataset\nhas 10,582 training, 1,449 validation, and 1,456 testing images with pixel-level labels in 20 fore-\nground object classes and one background class, while Pascal Context has 4998 training and 5105\nvalidation images with pixel-level labels in 59 classes and one background category. The perfor-\nmance is measured in terms of pixel intersection-over-union (IOU) averaged across all the classes.\nTo train our model and baseline models, we use a mini-batch of 16 images for 200 epochs and set\nlearning rate, learning policy, momentum and weight decay same as [3]. We also augment training\ndataset by \ufb02ipping, scaling and \ufb01nally cropping to 321 \u00d7 321 due to computing resource limitation.\n\n5.2 Performance Study\n\nPascal VOC For quantitative comparison, we use simpli\ufb01ed DeeplabV2 models as our baselines\nwhich only have atrous convolution and ASPP module as our model. Besides, both our DifNet and\nbaseline models use pre-trained ResNet architecture on ImageNet [26], while other components in\nthe models are trained from scratch. Though our DifNet has two branches, the depth of the model\nis equal to the deepest one, because data \ufb02ows through two branches parallelly other than cascadely\nwhen doing inference. To be more fair, in Table. 1, instead of making comparison based on the\nsame depth, we also report results based on the equivalent number of parameters. For example,\nDifNet-50 has the same depth as Sim-Deeplab-50 while has the equivalent number of parameters as\nSim-Deeplab-101. In experiments, our models achieve consistent improvements over sim-Deeplab\nmodels of the same depth and number of parameters on Pascal VOC validation dataset, and the\nperformance is also veri\ufb01ed on the testing dataset. To verify the effectiveness of our diffuse mod-\nule, we further conduct another two experiments: Firstly, we test DifNet-50 without ASPP module\n(DifNet-50-noASPP), from experiments we can see ASPP module only plays a limited role. Then,\nwe run Sim-Deeplab-101 with CRF post-processing, which improves the performance from 71.83%\nto 72.26% at the cost of about 1.8s/image (10 iterations), but is still worse than our DifNet-50\n(72.57%).\n\nPascal Context\nIn Table. 2, we further make comparison among DifNet with different depth and\ncomponents, original DeeplabV2 [3] with different components and other methods on Pascal Con-\ntext dataset. Compared with baseline models, our DifNet model achieves promising performance by\nfewer components.\n\n5.3 Mechanism Study\n\nIn this section, we focus on the mechanism and effect of components in our model. We use DifNet-\n50 trained on Pascal VOC dataset to carry out following experiments.\n\n6\n\n\fFCN-8s[1]\nCRF-RNN[12]\nParseNet[6]\nConvPP-8s[16]\nUoA-Context+CRF[8]\n\nResNet-101\nDeeplab[3]\nDeeplab[3]\nDeeplab[3](Sim-Deeplab)\nDeeplab[3]\nDeeplab[3]\nDifNet(our model)\nResNet-50\nDifNet(our model)\nDifNet(our model)\n\nMSC COCO ASPP CRF Diffuse\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nmIOU(Val)\n\n39.1%\n39.3%\n40.4%\n41.0%\n43.3%\n\n41.4%\n42.9%\n43.6%\n44.7%\n45.7%\n46.0%\n\n44.7%\n45.1%\n\nTable 2: Comparison with other methods and DeeplabV2 with different components on Pascal Con-\ntext dataset.\n\nFigure 3: Visualization of corresponding rows in Pt of selected nodes in the image. The right \ufb01ve\ncolumns demonstrate similarities measured on different Pt, respectively. Nodes more highlighted\nare more similar to the selected node.\n\nSeed Branch We compute seed as s = M x, where x is the score map and M is the importance\nmap learned based on the neighborhood of x. The in\ufb02uence of node i in the diffusion process in\nchannel k is |xi,k|. To visualize the in\ufb02uence of nodes on all channels, we de\ufb01ne in\ufb02uence map E as\nEi = PK\nk=1 |xi,k|. We show our x, E and M in Fig. 2. Obviously, x contains many outliers and has\nthe problems of poor boundary localization and spatial fragmented predictions. However, observed\nfrom the in\ufb02uence map E, most of these outliers have little in\ufb02uence to the diffusion process. The\nimportance map M will further reduce or increase the in\ufb02uence of certain regions, such as columns\n4, 5 where keyboard is suppressed and columns 3, 9 where sofa is enhanced, to re\ufb01ne the diffusion\nprocess.\n\n7\n\n\fFigure 4: Visualization of our seed and outputs after each random walk \u2295.\n\nSimilarity Branch Our cascaded random walks are carried out on a sequence of transition matri-\nces Pt which measure similarities on the different level of semantics. To visualize these hierarchical\nsemantic similarities, in Fig. 3, for selected node i, we reshape the corresponding rows in all the\ntransition matrices Pti,: to [h\u2032, w\u2032] and show them by color coding. Pti,: represents the possibilities\nthat other nodes random walk to node i based on t-th transition matrix with Pj Pti,j = 1. As shown\nin Fig. 3, from P1 to P5, similarities are measured from low-level feature such as color, texture to\nhigh-level feature such as object. Particularly, Pt is able to identify \ufb01ne-grained similarities among\npixels belonging to coarsely labeled objects, such as \ufb01gures for nodes 1, 6 where table mat and paint-\ning are highlighted when they are labeled as table and background respectively. These results also\nmeet our assumption that similarity branch estimates the possibility of any two nodes that belong to\nthe same class without knowing which class they are. Finally, \ufb01gures for nodes 3, 4, 6, 7 also prove\nthe ability of capturing long-range dependencies in our model.\n\nDiffusion We show outputs of random walks in Fig. 4. RWt represents the output after t-th ran-\ndom walk \u2295t. Obviously, the outputs are gradually re\ufb01ned after each random walk. We also report\nlearned \u00b5t and \u03b2t for each random walk in Table. 3. The increasing of parameter value means the out-\nput is more depended on information transited from other nodes rather than initial seed and previous\nrandom walk result as data \ufb02ows through our model. To validate the effectiveness of the transition\nmatrices built on all the ResNet blocks, we also test DifNet-50 without 2-th and 4-th random walks,\nthe performance will have 1 percent drop on Pascal VOC validation dataset.\n\n\u00b5t\n\n0.4159\n0.4193\n0.4077\n0.6520\n0.8956\n\n\u03b2t\n\n0.4159\n0.4825\n0.5104\n0.6570\n0.8451\n\n\u22951\n\u22952\n\u22953\n\u22954\n\u22955\n\nTable 3: Learned \u00b5t and \u03b2t for each \u2295t.\n\nmodel\n\ntime\n\nDifNet-50\n\nSeed\nSimilarity\nDiffusion\nSim-Deeplab-101\n\n0.018\u00b10.003s\n0.015\u00b10.003s\n0.006\u00b10.001s\n0.036\u00b10.003s\n\nTable 4: Time consumption compari-\nson.\n\n5.4 Ef\ufb01ciency Study\n\nIn Table. 4, we report the time consumption for doing inference with inputs of size [3, 505, 505] on\none GTX 1080 GPU. For inference, total time consumption of our DifNet-50 is equivalent to Sim-\nDeeplab-101. However, in this case, the data can \ufb02ow through two branches of our model parallelly,\nso the computation of our model can be further accelerated by model parallel to two times faster. The\ndiffusion process only involves matrix multiplication (\ufb01ve random walks) and can be implemented\nef\ufb01ciently with little extra computation.\n\n8\n\n\fOn the contrary, the backpropagation of our model will require much more calculations compared\nwith the vanilla model. Since the outputs of two branches determine the \ufb01nal results together from\namounts of information propagation paths, the parameters of two branches will be heavily mutually\nin\ufb02uenced when doing optimization. The time consumption of backpropagation in our DifNet-50\nmodel is about 1.3 times than Sim-Deeplab-101. However, in view of bene\ufb01ts from model parallel\nduring inference, extra time spent on training is considered acceptable.\n\n6 Conclusion\n\nWe present DifNet for semantic segmentation task, our model applies the cascaded random walks to\napproximate a complex diffusion process. With these cascaded random walks, more details can be\ncomplemented according to the hierarchical semantic similarities and meanwhile long-range depen-\ndencies are captured. Our model achieves promising performance compared with various baseline\nmodels, the effectiveness of each component in our model is also veri\ufb01ed through comprehensive\nmechanism studies.\n\nAcknowledgment\n\nThis work was supported by the grants of National Natural Science Foundation of China (61702301),\nChina Postdoctoral Science Foundation funded project (2017M612272), Fundamental Research\nFunds of Shandong University, National Natural Science Foundation of China (61332015) and Na-\ntional Basic Research grant (973) (2015CB352501).\n\nReferences\n\n[1] Long, J., E. Shelhamer, T. Darrell. Fully convolutional networks for semantic segmentation. In The IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR). 2015.\n\n[2] Chen, L.-C., G. Papandreou, I. Kokkinos, et al. Semantic image segmentation with deep convolutional\n\nnets and fully connected crfs. In International Conference on Learning Representations (ICLR). 2015.\n\n[3] \u2014. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully\n\nconnected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[4] Chen, L.-C., G. Papandreou, F. Schroff, et al. Rethinking atrous convolution for semantic image segmen-\n\ntation. In arXiv preprint arxiv:1706.05587. 2017.\n\n[5] Zhao, H., J. Shi, X. Qi, et al. Pyramid scene parsing network. In The IEEE Conference on Computer\n\nVision and Pattern Recognition (CVPR). 2017.\n\n[6] Liu, W., A. Rabinovich, A. Berg.\n\nParsenet: Looking wider to see better.\n\nIn arXiv preprint\n\narXiv:1506.04579. 2015.\n\n[7] Liu, S., L. Qi, H. Qin, et al. Pyramid scene parsing network. In The IEEE Conference on Computer Vision\n\nand Pattern Recognition (CVPR). 2018.\n\n[8] Lin, G., C. Shen, A. van den Hengel, et al. Ef\ufb01cient piecewise training of deep structured models for\nsemantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\n2016.\n\n[9] Vemulapalli, R., O. Tuzel, M.-Y. Liu, et al. Gaussian conditional random \ufb01eld network for semantic\n\nsegmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.\n\n[10] Liu, Z., X. Li, P. Luo, et al. Semantic image segmentation via deep parsing network.\n\nIn The IEEE\n\nInternational Conference on Computer Vision (ICCV). 2015.\n\n[11] Jampani, V., M. Kiefel, P. V. Gehler. Learning sparse high dimensional \ufb01lters: Image \ufb01ltering, dense\ncrfs and bilateral neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR). 2016.\n\n[12] Zheng, S., S. Jayasumana, B. Romera-Paredes, et al. Conditional random \ufb01elds as recurrent neural net-\n\nworks. In The IEEE International Conference on Computer Vision (ICCV). 2015.\n\n[13] Chandra, S., I. Kokkinos. Fast, exact and multi-scale inference for semantic image segmentation with\n\ndeep gaussian crfs. In European Conference on Computer Vision (ECCV). 2016.\n\n[14] Chandra, S., N. Usunier, I. Kokkinos. Dense and low-rank gaussian crfs using deep embeddings. In The\n\nIEEE International Conference on Computer Vision (ICCV). 2017.\n\n[15] Bertasius, G., L. Torresani, S. X. Yu, et al. Convolutional random walk networks for semantic image\n\nsegmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.\n\n[16] Xie, S., X. Huang, Z. Tu. Top-down learning for structured labeling with convolutional pseudoprior. In\n\nEuropean Conference on Computer Vision (ECCV). 2016.\n\n9\n\n\f[17] Wang, X., R. Girshick, A. Gupta, et al. Non-local neural networks. In The IEEE Conference on Computer\n\nVision and Pattern Recognition (CVPR). 2018.\n\n[18] He, K., X. Zhang, S. Ren, et al. Deep residual learning for image recognition. In The IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR). 2016.\n\n[19] Bertasius, G., J. Shi, L. Torresani. Semantic segmentation with boundary neural \ufb01elds.\n\nIn The IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR). 2016.\n\n[20] Adam W Harley, I. K., Konstantinos G. Derpanis. Segmentation-aware convolutional networks using\n\nlocal attention masks. In IEEE International Conference on Computer Vision (ICCV). 2017.\n\n[21] Liu, S., S. De Mello, J. Gu, et al. Learning af\ufb01nity via spatial propagation networks. In Advances in\n\nNeural Information Processing Systems (NIPS). 2017.\n\n[22] Jiang, P., N. Vasconcelos, J. Peng. Generic promotion of diffusion-based salient object detection. In The\n\nIEEE International Conference on Computer Vision (ICCV). 2015.\n\n[23] Everingham, M., S. M. A. Eslami, L. Van Gool, et al. The pascal visual object classes challenge: A\n\nretrospective. International Journal of Computer Vision, 2015.\n\n[24] Hariharan, B., P. Arbelaez, L. Bourdev, et al. Semantic contours from inverse detectors. In International\n\nConference on Computer Vision (ICCV). 2011.\n\n[25] Mottaghi, R., X. Chen, X. Liu, et al. The role of context for object detection and semantic segmentation\n\nin the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2014.\n\n[26] Russakovsky, O., J. Deng, H. Su, et al. ImageNet Large Scale Visual Recognition Challenge. International\n\nJournal of Computer Vision (IJCV), 2015.\n\n10\n\n\f", "award": [], "sourceid": 828, "authors": [{"given_name": "Peng", "family_name": "Jiang", "institution": "Shandong University"}, {"given_name": "Fanglin", "family_name": "Gu", "institution": "Shandong University"}, {"given_name": "Yunhai", "family_name": "Wang", "institution": "Shandong University"}, {"given_name": "Changhe", "family_name": "Tu", "institution": "Shandong University"}, {"given_name": "Baoquan", "family_name": "Chen", "institution": "Shandong University"}]}