{"title": "Learning Affinity via Spatial Propagation Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1520, "page_last": 1530, "abstract": "In this paper, we propose a spatial propagation networks for learning affinity matrix. We show that by constructing a row/column linear propagation model, the spatially variant transformation matrix constitutes an affinity matrix that models dense, global pairwise similarities of an image. Specifically, we develop a three-way connection for the linear propagation model, which (a) formulates a sparse transformation matrix where all elements can be the output from a deep CNN, but (b) results in a dense affinity matrix that is effective to model any task-specific pairwise similarity. Instead of designing the similarity kernels according to image features of two points, we can directly output all similarities in a pure data-driven manner. The spatial propagation network is a generic framework that can be applied to numerous tasks, which traditionally benefit from designed affinity, e.g., image matting, colorization, and guided filtering, to name a few. Furthermore, the model can also learn semantic-aware affinity for high-level vision tasks due to the learning capability of the deep model. We validate the proposed framework by refinement of object segmentation. Experiments on the HELEN face parsing and PASCAL VOC-2012 semantic segmentation tasks show that the spatial propagation network provides general, effective and efficient solutions for generating high-quality segmentation results.", "full_text": "Learning Af\ufb01nity via Spatial Propagation Networks\n\nSifei Liu\n\nUC Merced, NVIDIA\n\nShalini De Mello\n\nNVIDIA\n\nJinwei Gu\nNVIDIA\n\nGuangyu Zhong\n\nDalian University of Technology\n\nMing-Hsuan Yang\nUC Merced, NVIDIA\n\nJan Kautz\nNVIDIA\n\nAbstract\n\nIn this paper, we propose spatial propagation networks for learning the af\ufb01nity ma-\ntrix for vision tasks. We show that by constructing a row/column linear propagation\nmodel, the spatially varying transformation matrix exactly constitutes an af\ufb01nity\nmatrix that models dense, global pairwise relationships of an image. Speci\ufb01cally,\nwe develop a three-way connection for the linear propagation model, which (a)\nformulates a sparse transformation matrix, where all elements can be outputs from\na deep CNN, but (b) results in a dense af\ufb01nity matrix that effectively models\nany task-speci\ufb01c pairwise similarity matrix. Instead of designing the similarity\nkernels according to image features of two points, we can directly output all the\nsimilarities in a purely data-driven manner. The spatial propagation network is a\ngeneric framework that can be applied to many af\ufb01nity-related tasks, such as image\nmatting, segmentation and colorization, to name a few. Essentially, the model\ncan learn semantically-aware af\ufb01nity values for high-level vision tasks due to the\npowerful learning capability of deep CNNs. We validate the framework on the\ntask of re\ufb01nement of image segmentation boundaries. Experiments on the HELEN\nface parsing and PASCAL VOC-2012 semantic segmentation tasks show that the\nspatial propagation network provides a general, effective and ef\ufb01cient solution for\ngenerating high-quality segmentation results.\n\n1\n\nIntroduction\n\nAn af\ufb01nity matrix is a generic matrix that determines how close, or similar, two points are in a space. In\ncomputer vision tasks, it is a weighted graph that regards each pixel as a node, and connects each pair\nof pixels by an edge [25, 16, 15, 10, 29]. The weight on that edge should re\ufb02ect the pairwise similarity\nwith respect to different tasks. For example, for low-level vision tasks such as image \ufb01ltering, the\naf\ufb01nity values should reveal the low-level coherence of color and texture [29, 28, 10, 9]; for mid to\nhigh-level vision tasks such as image matting and segmentation [16, 22], the af\ufb01nity measure should\nreveal the semantic-level pairwise similarities. Most techniques explicitly or implicitly assume a\nmeasurement or a similarity structure over the space of con\ufb01gurations. The success of such algorithms\ndepends heavily on the assumptions made to construct these af\ufb01nity matrices, which are generally\nnot treated as part of the learning problem.\nIn this paper, we show that the problem of learning the af\ufb01nity matrix can be equivalently expressed\nas learning a group of small row/column-wise, spatially varying linear transformation matrices.\nSince a linear transformation can be easily implemented as a differentiable module in a deep neural\nnetwork, the transformation matrix can be learned in a purely data-driven manner as opposed to\nbeing constructed by hand. Speci\ufb01cally, we adopt an independent deep CNN with the original RGB\nimages as inputs to output all entities of the matrix, such that the af\ufb01nity is learned by a deep model\nconditioned on the speci\ufb01c inputs. We show that using a three-way connection, instead of the full\nconnection between adjoining rows/columns, is suf\ufb01cient for learning a dense af\ufb01nity matrix and\nrequires much fewer output channels of a deep CNN. Therefore, instead of using designed features\nand kernel tricks, our network outputs all entities of the af\ufb01nity matrix in a data-driven manner.\n\n\fThe advantages of learning an af\ufb01nity matrix in a data-driven manner are multifold. First, a hand-\ndesigned similarity matrix based on a distance metric in a certain space (e.g., RGB or Euclidean [10,\n25, 5, 36, 14]) may not adequately describe the pairwise relationships in the mid-to-high-level feature\nspaces. To apply such designed pairwise kernels to tasks such as semantic segmentation, multiple\niterations are required [14, 5, 36] for satisfactory performance. In contrast, the proposed method\nlearns and outputs all entities of an af\ufb01nity matrix under direct supervision of ultimate objectives,\nwhere no iteration, speci\ufb01c design or assumption about the kernel function is needed. Second, we can\nlearn the high-level semantic af\ufb01nity measures by initializing with hierarchical deep features from\npre-trained VGG [26] and ResNet [11] networks where conventional metrics and kernels may not be\napplied. Due to the above properties, the framework is far more ef\ufb01cient than the related graphical\nmodels, such as Dense CRF.\nOur proposed architecture, namely spatial propagation network (SPN), contains a deep CNN that\nlearns the entities of the af\ufb01nity matrix and a spatial linear propagation module, which propagates\ninformation in an image using the learned af\ufb01nity values. Images or general 2D matrices are input\ninto the module, and propagated under the guidance of the learned af\ufb01nity values. All modules are\ndifferentiable and jointly trained using the stochastic gradient descent (SGD) method. The spatial\nlinear propagation module is computationally ef\ufb01cient for inference due to the linear time complexity\nof its recurrent architecture.\n\n2 Related Work\n\nNumerous methods explicitly design af\ufb01nity matrices for image \ufb01ltering [29, 10], colorization [15],\nmatting [16] and image segmentation [14] based on the characterstics of the problem. Other methods,\nsuch as total variation (TV) [23] and learning to diffuse [18] improve the modeling of pairwise\nrelationships by utilizing different objectives, or incorporating more priors into diffusion partial\ndifferential equations (PDEs). However, due to the lack of an effective learning strategy, it is still\nchallenging to produce learning-based af\ufb01nity for complex visual analysis problems. Recently, Maire\net al. [22] trained a deep CNN to directly predict the entities of an af\ufb01nity matrix, which demonstrated\ngood performance on image segmentation. However, since the af\ufb01nity is followed by a solver of\nspectral embedding as an independent part, it is not directly supervised for the classi\ufb01cation/prediction\ntask. Bertasius et al. [2] introduced a random walk network that optimizes the objectives of pixel-wise\naf\ufb01nity for semantic segmentation. Differently, their af\ufb01nity matrix is additionally supervised by\nground-truth sparse pixel similarities, which limits the potential connections between pixels.\nOn the other hand, many graphical model-based methods have successfully improved the performance\nof image segmentation. In the deep learning framework, conditional random \ufb01elds (CRFs) with\nef\ufb01cient mean \ufb01eld inference are frequently used [14, 36, 17, 5, 24, 1] to model the pairwise relations\nin the semantic labeling space. Some methods use CFR as a post-processing module [5], while others\nintegrate it as a jointly-trained part [36, 17, 24, 1]. While both methods describe the densely connected\npairwise relationships, dense CRFs rely on designed kernels, while our method directly learns all\npairwise links. Since in this paper, SPN is trained as a universal segmentation re\ufb01nement module, we\nspeci\ufb01cally compare it with one of the methods [5] that relies on dense CRF [14] as a post-processing\nstrategy. Our architecture is also related to the multi-dimensional RNN or LSTM [30, 3, 8]. However,\nboth the standard RNN and LSTM contain multiple non-linear units and thus do not \ufb01t into our\nproposed af\ufb01nity framework.\n\n3 Proposed Approach\n\nIn this work, we construct a spatial propagation network that can transform a two-dimensional\n(2D) map (e.g., coarse image segmentation) into a new one with desired properties (e.g., re\ufb01ned\nsegmentation). With spatially varying parameters that supports the propagation process, we show\ntheoretically in Section 3.1 that this module is equivalent to the standard anisotropic diffusion\nprocess [32, 18]. We prove that the transformation of maps is controlled by a Laplacian matrix that is\nconstituted by the parameters of the spatial propagation module. Since the propagation module is\ndifferentiable, its parameters can be learned by any type of neural network (e.g., a typical deep CNN)\nthat is connected to this module, through joint training. We introduce the spatial propagation network\nin Section 3.2, and speci\ufb01cally analyze the properties of different types of connections within its\nframework for learning the af\ufb01nity matrix.\n\n2\n\n\f3.1 Linear Propagation as Spatial Diffusion\nWe apply a linear transformation by means of the spatial propagation network, where a matrix is\nscanned row/column-wise in four \ufb01xed directions: left-to-right, top-to-bottom, and verse-vise. This\nstrategy is used widely in [8, 30, 19, 4]. We take the left-to-right direction as an example for the\nfollowing discussion. Other directions are processed independently in the same manner.\nWe denote X and H as two 2D maps of size n \u00d7 n, with exactly the same dimensions as the matrix\nbefore and after spatial propagation, where xt and ht, respectively, represent their tth columns with\nn \u00d7 1 elements each. We linearly propagate information from left-to-right between adjacent columns\nusing an n \u00d7 n linear transformation matrix wt as:\n\n(1)\nwhere I is the n \u00d7 n identity matrix, the initial condition h1 = x1, and dt(i, i) is a diagonal matrix,\nwhose ith element is the sum of all the elements of the ith row of wt except wt(i, j) as:\n\nht = (I \u2212 dt) xt + wtht\u22121,\n\nt \u2208 [2, n]\n\ndt(i, i) =\n\nwt(i, j).\n\n(2)\n\nn(cid:88)\n\nj=1,j(cid:54)=i\n\nTo propagate across the entire image, the matrix H, where {ht \u2208 H, t \u2208 [1, n]}, is updated in a\ncolumn-wise manner recursively. For each column, ht is a linear, weighted combination of the\nprevious column ht\u22121, and the corresponding column xt in X. When the recursive scanning is\n\ufb01nished, the updated 2D matrix H can be expressed with an expanded formulation of Eq. (1):\n\nw3w2 w3\u03bb2\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb Xv = GXv,\n\n0\n\u03bb2\n\n...\n...\n\n\u00b7\u00b7\u00b7\n0\n\u03bb3\n\n...\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n0\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n0\n...\n...\n\u00b7\u00b7\u00b7 \u03bbn\n\nI\nw2\n\nHv =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n(cid:3)T and Xv = (cid:2)xT\n\n...\n...\n\n1 , ..., xT\nn\n\n1 , ..., hT\nn\n\nHv = (cid:2)hT\n\n(cid:3)T . All the parameters {\u03bbt, wt, dt, I} , t \u2208 [2, n] are\n\nwhere G is a lower triangular, N \u00d7 N (N = n2) transformation matrix, which relates X and\nH. Hv and Xv are vectorized versions of X and H, respectively, with the dimension of N \u00d7 1.\nSpeci\ufb01cally, they are created by concatenating ht and xt along the same, single dimension, i.e.,\nn \u00d7 n sub-matrices, where \u03bbt = I \u2212 dt.\nIn the following section, we validate that Eq. (3) can be expressed as a spatial anisotropic diffusion\nprocess, with the corresponding propagation af\ufb01nity matrix constituted by all wt for t \u2208 [2, n].\nTheorem 1. The summation of elements in each row of G equals to one.\nSince G contains n\u00d7 n sub-matrices, each representing the transformation between the corresponding\ncolumns of H and X, we denote all the weights used to compute ht as the tth block-row Gt. On\nsetting \u03bb1 = I, the kth constituent n \u00d7 n sub-matrix of Gt is:\n\n(3)\n\n(4)\n\nt(cid:89)\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nGtk =\n\nw\u03c4 \u03bbk,\n\nk \u2208 [1, t \u2212 1]\n\n\u03c4 =k+1\n\n\u03bbk,\n\nk = t\n\nTo prove that the summation of any row in G equals to one, we instead prove that for \u2200t \u2208 [1, n],\neach row of Gt has the summation of one.\n\nProof. Denoting E = [1, 1, ..., 1]T as an n \u00d7 1 vector, we need to prove that Gt [1, ..., 1]T\n\nN\u00d71 = E.\nk=1 GtkE = E, because G is a lower triangular matrix. In the following part, we \ufb01rst\n\u03c4 =m+1 wtE by mathematical induction .\n\nEquivalently(cid:80)t\nprove that when m \u2208 [1, t \u2212 1], we have(cid:80)m\nInitial step. When m = 1,(cid:80)m\n\nk=1 GtkE =(cid:81)t\nk=1 GtkE = Gt1E =(cid:81)t\n\n\u03c4 =2 w\u03c4 E, which satis\ufb01es the assertion.\n\n3\n\n\fFigure 1: Different propagation ranges for (a) one-way connections; and (b) three-way connections. Each pixel\n(node) receives information from a single line with one-way connection, and from a 2 dimensional plane with\nthree-way connection. Integration of four directions w.r.t. (a) results in global, but sparsely connected pairwise\nrelations, while (b) formulates global and densely connected pairwise relations.\n\nInductive step. Assume there is an n \u2208 [1, t \u2212 1], such that(cid:80)n\nprove the formula is true for n + 1 \u2208 [1, t \u2212 1].\nn+1(cid:88)\nt(cid:89)\n(cid:81)t\n\nAccording to the formulation of the diagonal matrix in Eq.\n\u03c4 =n+2 w\u03c4 E. Therefore, the assertion is satis\ufb01ed. When m = t, we have:\n\nk=1 GtkE =(cid:81)t\nt(cid:89)\n(2) we have (cid:80)n+1\n\nGtkE + Gt(n+1)E =\n\nn(cid:88)\n\nk=1\n\n\u03c4 =n+2\n\n\u03c4 =n+2\n\nt(cid:89)\n\nw\u03c4 E +\n\nw\u03c4 =\n\nGtkE =\n\nk=1\n\n\u03c4 =n+1\n\nw\u03c4 [(wn+1 + I \u2212 dn+1) E] .\n\n(5)\nk=1 GtkE =\n\n\u03c4 =n+1 wtE, we must\n\nt(cid:88)\n\nt\u22121(cid:88)\n\nt(cid:89)\n\nGtkE =\n\nGtkE + GttE =\n\nw\u03c4 E + \u03bbtE = w\u03c4 E + (I \u2212 dt) E = E,\n\n(6)\n\nk=1\n\nk=1\n\n\u03c4 =t\n\nwhich yields the equivalence of Theorem 1.\nTheorem 2. We de\ufb01ne the evolution of a 2D matrix as a time sequence {U}T , where U (T = 1) = U1\nis the initial state. When the transformation between any two adjacent states follows Eq. (3), the\nsequence is a diffusion process expressed with a partial differential equation (PDE):\n\n(7)\nwhere L = D \u2212 A is the Laplacian matrix, D is the degree matrix composed of dt in Eq. (2), and A\nis the af\ufb01nity matrix composed by the off-diagonal elements of G.\n\n\u2202T U = \u2212LU\n\nProof. We substitute the X and H as two consecutive matrices UT +1 and UT in (3). According to\nTheorem 1, we ensure that the sum of each row I \u2212 G is 0 that can formulate a standard Laplacian\nmatrix. Since G has the diagonal sub-matrix I \u2212 dt, we can rewrite (3) as:\n\nUT +1 = (I \u2212 D + A) UT = (I \u2212 L) UT\n\n(8)\nwhere G = (I \u2212 D + A), D is an N \u00d7 N diagonal matrix containing all the dt and A is the off-\ndiagonal part of G. It then yields UT +1 \u2212 UT = \u2212LUT , a discrete formulation of (7) with the time\ndiscretization interval as one.\n\nTheorem 2 shows the essential property of the row/column-wise linear propagation in Eq. (1): it\nis a standard diffusion process where L de\ufb01nes the spatial propagation and A, the af\ufb01nity matrix,\ndescribes the similarities between any two points. Therefore, learning the image af\ufb01nity matrix A in\nEq. (8) is equivalent to learning a group of transformation matrices wt in Eq. (1).\nIn the following section, we show how to build the spatial propagation (1) as a differentiable module\nthat can be inserted into a standard feed-forward neural network, so that the af\ufb01nity matrix A can be\nlearned in a data-driven manner.\n3.2 Learning Data-Driven Af\ufb01nity\nSince the spatial propagation in Eq.(1) is differentiable, the transformation matrix can be easily\ncon\ufb01gured as a row/column-wise fully-connected layer. However, we note that since the af\ufb01nity\nmatrix indicates the pairwise similarities of a speci\ufb01c input, it should also be conditioned on the\n\n4\n\n\fcontent of this input (i.e., different input images should have different af\ufb01nity matrices). Instead of\nsetting the wt matrices as \ufb01xed parameters of the module, we design them as the outputs of a deep\nCNN, which can be directly conditioned on an input image.\nOne simple way is to set the output of the deep CNN to use the same size as the input matrix. When\nthe input has c channels (e.g., an RGB image has c = 3), the output needs n \u00d7 c \u00d7 4 channels\n(there are n connections from the previous row/column per pixel per channel, and with four different\ndirections). Obviously, this is too many (e.g., an 128 \u00d7 128 \u00d7 16 feature map needs an output of\n128 \u00d7 128 \u00d7 8192) to be implemented in a real-world system. Instead of using full connections\nbetween the adjacent rows/columns, we show that certain local connections, corresponding to a\nsparse row/column-wise transform matrix, can also formulate densely connected af\ufb01nity. Speci\ufb01cally,\nwe introduce the (a) one-way connection and the (b) three-way connection as two different ways to\nimplement Eq. (1).\n\nhk,t = (1 \u2212 pk,t) \u00b7 xk,t + pk,t \u00b7 hk,t\u22121,\n\nOne-way connection. The one-way connection enables every pixel to connect to only one pixel\nfrom the previous row/column (see Figure 1(a)). It is equivalent to one-dimensional (1D) linear\nrecurrent propagation that scans each row/column independently as a 1D sequence. Following Eq. (1),\nwe denote xk,t and hk,t as the kth pixels in the tth column, where the left-to-right propagation for\none-way connection is:\n(9)\nwhere p is a scaler weight indicating the propagation strength between the pixels at {k, t \u2212 1}\nand {k, t}. Equivalently, wt in Eq.\n(1) is a diagonal matrix, with the elements constituted by\npk,t, k \u2208 [1, n]. The one-way connection is a direct extension of sequential recurrent propagation [8,\n31, 13]. The exact formulation of Eq. (9) has been used previously for semantic segmentation [4]\nand for learning low-level vision \ufb01lters [19]. In [4], Chen et al.explain it as domain transform,\nwhere for semantic segmentation, p corresponds to the object edges. Liu et al. [19] explain it by\narbitrary-order recursive \ufb01lters, where p corresponds to more general image properties (e.g., low-level\nimage/color edges, missing pixels, etc.). Both of these formulations can be explained as the same\nlinear propagation framework of Eq. (1) with one-way connections.\n\nThree-way connection. We propose a novel three-way connection in this paper. It enables each\npixel to connect to three pixels from the previous row/column, i.e., the left-top, middle and bottom\npixels from the previous column for the left-to-right propagation direction (see Figure. 2(b)). With the\nsame notations, we denote N as the set of these three pixels. Then the propagation for the three-way\nconnection is:\n\n(10)\nEquivalently, wt forms a tridiagonal matrix, with p:,k, k \u2208 N constitute the three non-zero elements\nof each row/column.\n\npk,thk,t\u22121\n\nhk,t =\n\nk\u2208N\n\npk,t\n\nxk,t +\n\n(cid:32)\n\n1 \u2212(cid:88)\n\n(cid:33)\n\n(cid:88)\n\nk\u2208N\n\nRelations to the af\ufb01nity matrix. As introduced in Theorem 2, the af\ufb01nity matrix A with linear\npropagation is composed of the off-diagonal elements of G in Eq. (3). The one-way connection\nformulates a spares af\ufb01nity matrix, since each sub-matrix of A has nonzero elements only along its\ndiagonal, and the multiplication of several individual diagonal matrics will also results in a diagonal\nmatrix. On the other hand, the three-way connection, also with a sparse wt, can form a relatively\ndense A with the multiplication of several different tridiagonal matrices. It means pixels can be\ndensely and globally associated, by simply increasing the number of connections of each pixel during\nspatial propagation from one to three. As shown in Figures 2(a) and 2(b), the propagation of one-way\nconnections is restricted to a single row, while the three-way connections can expand the region to a\ntriangular 2D plane with respect to each direction. The summarization of the four directions result in\ndense connections of all pixels to each other (see Figure. 2(b)).\n\nStability of linear propagation. Model stability is of critical importance for designing linear systems.\nIn the context of spatial propagation (Eq. 1), it refers to restricting the responses or errors that \ufb02ow\nin the module from going to in\ufb01nity, and preventing the network from encountering the vanishing\nof gradients in the backpropagation process [37]. Speci\ufb01cally, the norm of the temporal Jacobian\n\u2202ht \\ \u2202ht\u22121 should be equal to or less than one. In our case, it is equivalent to regularizing each\ntransformation matrix wt with its norm satisfying\n\n(cid:107)\u2202ht \\ \u2202ht\u22121(cid:107) = (cid:107)wt(cid:107) \u2264 \u03bbmax,\n\n(11)\n\n5\n\n\fFigure 2: We illustrate the general architecture of the SPN using a three-way connection for segmentation\nre\ufb01nement. The network, divided by the black dash line, contains a propagation module (upper) and a guidance\nnetwork (lower). The guidance network outputs all entities that can constitute four af\ufb01nity matrices, where each\nsub-matrix wt is a tridiagonal matrix. The propagation module, being guided by the af\ufb01nity matrices, deforms\nthe input mask to a desired shape. All modules are differentiable and jointly learned via SGD.\nwhere \u03bbmax denotes the largest singularity value of wt. This condition, \u03bbmax \u2264 1 provides a\nsuf\ufb01cient condition for stability.\nTheorem 3. Let\nthe supplementary material for proof.\n\n(cid:111)\nk\u2208N be the weight in wt, the model can be stabilized if(cid:80)\n\n(cid:12)(cid:12)(cid:12) \u2264 1. See\n\n(cid:12)(cid:12)(cid:12)pK\n\nt,k\n\n(cid:110)\n\npK\nt,k\n\nk\u2208N\n\nImplementation\n\nTheorem 3 shows that the stability of a linear propagation model can be maintained by regularizing\nthe all weights of each pixel in the hidden layer H, with the summation of their absolute values less\nthan one. For the one-way connection, Chen et al. [4] limited each scalar output p to be within (0, 1).\nLiu et al. [19] extended the range to (\u22121, 1), where the negative weights showed preferable effects\nfor learning image enhancers. It indicates that the af\ufb01nity matrix is not necessarily restricted to be\npositive/semi-positive de\ufb01nite (e.g., the setting is also applied in [16].) For the three-way connection,\nwe simply regularize the three weights (the output of a deep CNN) according to Theorem 3 without\nrestriction to be any positive/semi-positive de\ufb01nite.\n4\nWe specify two separate branches: (a) a deep CNN, namely the guidance network that outputs\nall elements of the transformation matrix, and (b) a linear propagation module that outputs the\ntransformation matrix entities (see Figure 2). The propagation module receives an input map and\noutput a re\ufb01ned or transformed result. It also takes the weights learned by the deep CNN guidance\nnetwork as the second input. The structure of a guidance network can be any regular CNN, which\nis designed for the task at hand. Examples of this network are described in Section 5. It takes, as\ninput, any 2D matrix that can help with learning the af\ufb01nity matrix (e.g., typically an RGB image),\nand outputs all the weights that constitute the transformation matrix wt.\nSuppose that we have a map of size n \u00d7 n \u00d7 c that is input into the propagation module, the guidance\nnetwork needs to output a weight map with the dimensions of n \u00d7 n \u00d7 c \u00d7 (3 \u00d7 4), i.e., each pixel in\nthe input map is paired with 3 scalar weights per direction, and 4 directions in total. The propagation\nmodule contains 4 independent hidden layers for the different directions, where each layer combines\nthe input map with its respective weight map using Eq. (10). All submodules are differentiable\nand jointly trained using stochastic gradient descent (SGD). We use node-wise max-pooling [19] to\nintegrate the hidden layers and to obtain the \ufb01nal propagation result.\nWe implement the network with a modi\ufb01ed version of CAFFE [12]. We employ a parallel version\nof the SPN implemented in CUDA for propagating each row/column to the next one. We use the\nSGD optimizer, and set the base learning rate to 0.0001. In general, we train the networks for the\nHELEN and VOC segmentation tasks for about 40 and 100 epochs, respectively. The inference time\n(we do not use cuDNN) of SPN on HELEN and Pascal VOC is about 7ms and 84ms for an image of\nsize 512 \u00d7 512 pixels, respectively. In comparison, the dense CRF (CPU only) takes about 1s [14],\n3.2s [5] and 4.4s [36] with different publicly available implementations. We note that the majority\nof the time for the SPN is spend in the guidance network, which can be accelerated by utilizing\nvarious existing network compressing strategies, applying smaller models, or sharing weights with\nthe segmentation model if they are trained jointly. During inference, a single 64 \u00d7 64 \u00d7 32 SPN\nhidden layer takes 1.3ms with the same computational settings.\n\n6\n\n\foriginal\n\nCNN-base\n\nCNN-Highres\n\none-way SPN three-way SPN\n\nground truth\n\nFigure 3: Results of face parsing on the HELEN dataset with detailed regions cropped from the high resolution\nimages. (The images are all in high resolution and can be viewed by zooming in.)\n\n5 Experimental Results\n\nThe SPN can be trained jointly with any segmentation CNN model by being inserted on top of the\nlast layer that outputs probability maps, or trained separately as a segmentation re\ufb01nement model.\nIn this paper we choose the second option. Given a coarse image segmentation mask as the input\nto the spatial propagation module, we show that the SPN can produce higher-quality masks with\nsigni\ufb01cantly re\ufb01ned details at object boundaries. Many models [21, 5] generate low-resolution\nsegmentation masks with coarse boundary shapes to seek a balance between computational ef\ufb01ciency\nand semantic accuracy. The majority of work [21, 5, 36] choose to \ufb01rst produce an output probability\nmap with 8\u00d7 smaller resolution, and then re\ufb01ne the result using either post-processing [5] or jointly\ntrained modules [36]. Hence, producing high-quality segmentation results with low computational\ncomplexity is a non-trivial task. In this work, we train only one SPN model for a speci\ufb01c task, and\ntreat it as a universal re\ufb01nement tool for the different publicly available CNN models for each of\nthese tasks.\nWe carry out the re\ufb01nement of segmentation masks on two tasks: (a) generating high-resolution\nsegmentations on the HELEN face parsing dataset [27]; and (b) re\ufb01ning generic object segmentation\nmaps generated by pretrained models (e.g., VGG based model [21, 5]. For the HELEN dataset, we\ndirectly use low-resolution RGB face images to train a baseline parser, which successfully encapsu-\nlates the global semantic information. The SPN is then trained on top of the coarse segmentations to\ngenerate high-resolution outputs. For the Pascal VOC dataset, we train the SPN on top of the coarse\nsegmentation results generated by the FCN-8s [21], and directly generalize it to any other pretrained\nmodel.\n\nGeneral network settings. For both tasks, we train the SPN as a patch re\ufb01nement model on top of\nthe coarse map with basic semantic information. It is trained with smaller patches cropped from\nthe original high-resolution images, their corresponding coarse segmentation maps produced by a\nbaseline segmentor, and with the corresponding high-resolution ground-truth segmentation masks\nfor supervision. All coarse segmentation maps are obtained by applying a baseline (for HELEN) or\npre-trained (for Pascal VOC) image segmentation CNN to their standard training splits [6, 5]. Since\nthe baseline HELEN parser produces low-resolution segmentation results, we upsample them using a\nbi-linear \ufb01lter to be of the same size as the desired higher output resolution. We \ufb01x the size of our\ninput patches to 128 \u00d7 128, use the softmax loss, and use the SGD solver for all the experiments.\nDuring training, the patches are sampled from image regions that contain more than one ground-truth\nsegmentation label (e.g., a patch with all pixels labeled as \u201cbackground\u201d will not be sampled). During\ntesting, for the VOC dataset, we restrict the classes in the re\ufb01ned results to be contained within the\ncorresponding coarse input. More speci\ufb01c settings are speci\ufb01ed in the supplementary material.\n\nHELEN Dataset. The HELEN dataset provides high-resolution photography-style face images\n(2330 in total), with high-quality manually labeled facial components including eyes, eyebrows, nose,\nlips, and jawline, which makes the high-resolution segmentation tasks applicable. All previous work\nutilize low-resolution parsing output as their \ufb01nal results for evaluation. Although many [27, 33, 20]\nachieve preferable performance, their results cannot be directly adopted by high-quality facial image\nediting applications. We use the same settings as the state-of-the work [20]. We use similarity\ntransformation according to the results of 5-keypoint detection [35] to align all face images to the\ncenter. Keeping the original resolution, we then crop or pad them to the size of 1024 \u00d7 1024.\n\n7\n\n\fTable 1: Quantitative evaluation results on the HELEN dataset. We denote the upper and lower lips as \u201cU-lip\u201d\nand \u201cL-lip\u201d, and overall mouth part as \u201cmouth\u201d, respectively. The label de\ufb01nitions follow [20].\n\neyes\n74.74\n74.86\n74.46\n85.44\n87.71\n\nnose mouth U-lip\n59.22\n90.23\n55.61\n89.16\n89.42\n68.15\n77.61\n91.51\n92.62\n80.17\n\n82.07\n83.83\n81.83\n88.13\n91.08\n\nL-lip\n66.30\n64.88\n72.00\n70.81\n71.63\n\nin-mouth\n\n81.70\n71.72\n71.95\n79.95\n83.13\n\noverall\n83.68\n82.89\n83.21\n87.09\n89.30\n\nMethod\n\nLiu et al. [20]\nbaseline-CNN\nHighres-CNN\nSPN (one-way)\nSPN (three-way)\n\nskin\n90.87\n90.53\n91.78\n92.26\n93.10\n\nbrows\n69.89\n70.09\n71.84\n75.05\n78.53\n\nWe \ufb01rst train a baseline CNN with a symmetric U-net structure, where both the input image and the\noutput map are 8\u00d7 smaller than the original image. The detailed settings are in the supplementary\nmeterial. We apply the multi-objective loss as [20] to improve the accuracy along the boundaries. We\nnote that the symmetric structure is powerful, since the results we obtained for the baseline CNN\nare comparable (see Table. 1) to that of [20], who apply a much larger model (38 MB vs. 12 MB).\nWe then train a SPN on top of the baseline CNN results on the training set, with patches sampled\nfrom the high-resolution input image and the coarse segmentations masks. For the guidance network,\nwe use the same structure as that of the baseline segmentation network, except that its upsampling\npart ends at a resolution of 64 \u00d7 64, and its output layer has 32 \u00d7 12 = 384 channels. In addition,\nwe train another face parsing CNN with 1024 \u00d7 1024 sized inputs and outputs (CNN-Highres) for\nbetter comparison. It has three more sub-modules at each end of the baseline network, where all are\ncon\ufb01gured with 16 channels to process higher resolution images.\nWe show quantitative and qualitative results in Table. 1 and 3 respectively. We compared the one/three\nway connection SPNs with the baseline, the CNN-Highres and the most relevant state-of-the-art\ntechnique for face parsing [20]. Note that the results of baseline and [20]1 are bi-linearly upsampled to\n1024\u00d7 1024 before evaluation. Overall, both SPNs outperform the other techniques with a signi\ufb01cant\nmargin of over 6 intersection-over-union (IoU) points. Especially for the smaller facial components\n(e.g., eyes and lips) where with smaller resolution images, the segmentation network performs poorly.\nWe note that the one-way connection-based SPN is quite successful on relatively simple tasks such\nas the HELEN dataset, but fails for more complex tasks, as revealed by the results of Pascal VOC\ndataset in the following section.\n\nPascal VOC Dataset. The PASCAL VOC 2012 segmentation benchmark [6] involves 20 foreground\nobject classes and one background class. The original dataset contains 1464 training, 1499 validation\nand 1456 testing images, with pixel-level annotations. The performance is mainly measured in terms\nof pixel IoU averaged across the 21 classes. We train our SPNs on the train split with the coarse\nsegmentation results produced by the FCN-8s model [21]. The model is \ufb01ne-tuned on a pre-trained\nVGG-16 network, where different levels of features are upsampled and concatenated to obtain the\n\ufb01nal, low-resolution segmentation results (8\u00d7 smaller than the original image size). The guidance\nnetwork of the SPN also \ufb01ne-tunes the VGG-16 structure from the beginning till the pool5 layer\nas the downsampling part. Similar to the settings for the HELEN dataset, the upsampling part\nhas a symmetric structure with skipped links until the feature dimensions of 64 \u00d7 64. The spatial\npropagation module has the same con\ufb01guration as that of the SPN that we employed for the HELEN\ndataset. The model is applied on the coarse segmentation maps of the validation and test splits\ngenerated by any image segmentation algorithm without \ufb01ne-tuning. We test the re\ufb01nement SPN\non three base models: (a) FCN-8s [21], (b) the atrous spatial pyramid pooling (ASPP-L) network\n\ufb01ne-tuned with VGG-16, denoted as Deeplab VGG, and (c) the ASPP-L: a multi-scale network\n\ufb01ne-tuned with ResNet-101 [11] (pre-trained on the COCO dataset), denoted as Deeplab ResNet-101.\nAmong them, (b) and (c) are the two basic models\nfrom [5], which are then re\ufb01ned with dense CRF [14]\nconditioned on the original image.\nTable 3 shows that through the three-way SPN, the\naccuarcy of segmentation is signi\ufb01cantly improved over\nthe coarse segmentation results for all the three baseline\nmodels. It has strong capability of generalization and\ncan successfully re\ufb01ne any coarse maps from different\npre-trained models by a large margin. Different with\n\nTable 2:\nQuantitative comparison (mean\nIoU) with dense CRF-based re\ufb01nement [5] on\nDeeplab pre-trained models.\n\nCNN +dense CRF\n68.97\n76.40\n\nmIoU\nVGG\nResNet\n\n+SPN\n73.12\n79.76\n\n71.57\n77.69\n\n1The original output (also for evaluation) size it 250 \u2217 250.\n\n8\n\n\fFigure 4: Visualization of Pascal VOC segmentation results (left) and object probability (by 1 \u2212 Pb, Pb is the\nprobability of background). The \u201cpretrained\u201d column denotes the base Deeplab ResNet-101 model, while the\nrest 4 columns show the base model combined with the dense CRF [5] and the proposed SPN, respectively.\n\nTable 3: Quantitative evaluation results on the Pascal VOC dataset. We compare the two connections of SPN\nwith the corresponding pre-trained models, including: (a) FCN-8s (F), (b) Deeplab VGG (V) and (c) Deeplab\nResNet-101 (R). AC denotes accuracy, \u201c+\u201d denote added on top of the base model.\nR\n\nModel\n\nV\n\nF\n\n+1 way\n90.64\n70.64\n60.95\n\n+3 way\n92.90\n79.49\n69.86\n\n92.61\n80.97\n68.97\n\n+1 way\n92.16\n73.53\n64.42\n\n+3 way\n93.83\n83.15\n73.12\n\n94.63\n84.16\n76.46\n\n+1 way\n94.12\n77.46\n72.02\n\n+3 way\n95.49\n86.09\n79.76\n\noverall AC 91.22\n77.61\nmean AC\nmean IoU\n65.51\n\nthe Helen dataset, the one-way SPN fails to re\ufb01ne the segmentation, which is probably due to\nits limited capability of learning preferable af\ufb01nity with a sparse form, especially when the data\ndistribution gets more complex. Table 2 shows that by replacing the dense CRF module with the\nsame re\ufb01nement model, the performance is boosted by a large margin, without \ufb01ne-tuning. One the\ntest split, the DeepNet ResNet-101 based SPN achieves the mean IoU of 80.22, while the dense\nCRF gets 79.7. The three-way SPN produces \ufb01ne visual results, as shown in the red bounding box\nof Figure 4. By comparing the probability maps (column 3 versus 7), SPN exhibits fundamental\nimprovement in object details, boundaries, and semantic integrity.\nIn addition, we show in table 4 that the same re\ufb01nement model can also be generalize to dilated\nconvolution based networks [34]. It signi\ufb01cantly improves the quantitative performance on top of the\n\u201cFront end\u201d base model, as well as adding a multi-scale re\ufb01nement module, denoted as \u201c+Context\u201d.\nSpeci\ufb01cally, the SPN improves the base model with much larger margin compared to the context\naggregation module (see \u201c+3 way\u201d vs \u201c+Context\u201d in table 4).\n\n6 Conclusion\nWe propose spatial propagation networks for learning pairwise af\ufb01nities for vision tasks. It is a generic\nframework that can be applied to numerous tasks, and in this work we demonstrate its effectiveness\nfor semantic object segmentation. Experiments on the HELEN face parsing and PASCAL VOC\nobject semantic segmentation tasks show that the spatial propagation network is general, effective\nand ef\ufb01cient for generating high-quality segmentation results.\n\nTable 4: Quantitative evaluation results on the Pascal VOC dataset. We re\ufb01ne the base models proposed with\ndilated convolutions [34]. \u201c+\u201d denotes additions on top of the \u201cFront end\u201d model.\n\nModel\n\noverall AC\nmean AC\nmean IoU\n\nFront end\n\n93.03\n80.31\n69.75\n\n+3 way\n93.89\n83.47\n73.14\n\n93.44\n80.97\n71.86\n\n94.35\n83.98\n75.28\n\n+Context\n\n+Context+3 way\n\n9\n\n\fAcknowledgement. This work is supported in part by the NSF CAREER Grant #1149783, gifts\nfrom Adobe and NVIDIA.\n\nReferences\n[1] A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr. Higher order conditional random \ufb01elds in deep neural\n\nnetworks. In ECCV. Springer, 2016.\n\n[2] G. Bertasius, L. Torresani, S. X. Yu, and J. Shi. Convolutional random walk networks for semantic image\n\nsegmentation. arXiv preprint arXiv:1605.07681, 2016.\n\n[3] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling with lstm recurrent neural networks. In\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[4] L. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with\ntask-speci\ufb01c edge detection using cnns and a discriminatively trained domain transform. arXiv preprint\narXiv:1511.03328, 2015.\n\n[5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation\nwith deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016.\n\n[6] M. Everingham, S. A. Eslami, L. V. Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual\nobject classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98\u2013136, 2015.\n\n[7] S. Ger\u0161gorin. Uber die abgrenzung der eigenwerte einer matrix. Bulletin de l\u2019Acad\u00e9mie des Sciences de\n\nl\u2019URSS. Classe des sciences math\u00e9matiques et na, 1931.\n\n[8] A. Graves, S. Fern\u00e1ndez, and J. Schmidhuber. Multi-dimensional recurrent neural networks. In ICANN,\n\n549\u2013558, 2007.\n\n[9] K. He, J. Sun, and X. Tang. Single image haze removal using dark channel prior. IEEE transactions on\n\npattern analysis and machine intelligence, 33(12):2341\u20132353, 2011.\n\n[10] K. He, J. Sun, and X. Tang. Guided image \ufb01ltering. IEEE transactions on pattern analysis and machine\n\nintelligence, 35(6):1397\u20131409, 2013.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385,\n\n2015.\n\n[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[13] N. Kalchbrenner, I. Danihelka, and A. Graves. Grid long short-term memory.\n\narXiv:1507.01526, 2015.\n\narXiv preprint\n\n[14] P. Kr\u00e4henb\u00fchl and V. Koltun. Ef\ufb01cient inference in fully connected crfs with gaussian edge potentials. In\n\nAdvances in neural information processing systems, pages 109\u2013117, 2011.\n\n[15] A. Levin, D. Lischinski, and Y. Weiss. Colorization using optimization. ACM Transactions on Graphics\n\n(ToG), 23(3):689\u2013694, 2004.\n\n[16] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 30(2):228\u2013242, 2008.\n\n[17] G. Lin, C. Shen, I. D. Reid, and A. van den Hengel. Deeply learning the messages in message passing\n\ninference. arXiv preprint arXiv:1506.02108, 2015.\n\n[18] R. Liu, G. Zhong, J. Cao, Z. Lin, S. Shan, and Z. Luo. Learning to diffuse: A new perspective to design pdes\nfor visual analysis. IEEE transactions on pattern analysis and machine intelligence, 38(12):2457\u20132471,\n2016.\n\n[19] S. Liu, J. Pan, and M.-H. Yang. Learning recursive \ufb01lters for low-level vision via a hybrid neural network.\n\nIn European Conference on Computer Vision, 2016.\n\n[20] S. Liu, J. Yang, C. Huang, and M.-H. Yang. Multi-objective convolutional learning for face labeling. In\n\nCVPR, 2015.\n\n10\n\n\f[21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431\u20133440,\n2015.\n\n[22] M. Maire, T. Narihira, and S. X. Yu. Af\ufb01nity CNN: learning pixel-centric pairwise relations for \ufb01g-\n\nure/ground embedding. CoRR, abs/1512.02767, 2015.\n\n[23] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D:\n\nNonlinear Phenomena, 60(1-4):259\u2013268, 1992.\n\n[24] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint arXiv:1503.02351,\n\n2015.\n\n[25] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and\n\nmachine intelligence, 22(8):888\u2013905, 2000.\n\n[26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[27] B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang. Exemplar-based face parsing. In CVPR, 2013.\n\n[28] J. A. Suykens, J. D. Brabanter, L. Lukas, and J. Vandewalle. Weighted least squares support vector\n\nmachines: robustness and sparse approximation. Neurocomputing, 48(1):85\u2013105, 2002.\n\n[29] C. Tomasi and R. Manduchi. Bilateral \ufb01ltering for gray and color images. In ICCV, 1998.\n\n[30] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint\n\narXiv:1601.06759, 2016.\n\n[31] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and Y. Bengio. Renet: A recurrent neural network\n\nbased alternative to convolutional networks. arXiv preprint arXiv:1505.00393, 2015.\n\n[32] J. Weickert. Anisotropic diffusion in image processing, volume 1. Teubner Stuttgart, 1998.\n\n[33] T. Yamashita, T. Nakamura, H. Fukui, Y. Yamauchi, and H. Fujiyoshi. Cost-alleviative learning for deep\nconvolutional neural network-based facial part labeling. IPSJ Transactions on Computer Vision and\nApplications, 7:99\u2013103, 2015.\n\n[34] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions.\n\narXiv:1511.07122, 2015.\n\narXiv preprint\n\n[35] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In ECCV,\n\n2014.\n\n[36] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional\nrandom \ufb01elds as recurrent neural networks. In IEEE International Conference on Computer Vision, 2015.\n\n[37] J. G. Zilly, R. K. Srivastava, J. Koutn\u00edk, and J. Schmidhuber. Recurrent highway networks. arXiv preprint\n\narXiv:1607.03474, 2016.\n\n11\n\n\f", "award": [], "sourceid": 972, "authors": [{"given_name": "Sifei", "family_name": "Liu", "institution": "Nvidia"}, {"given_name": "Shalini", "family_name": "De Mello", "institution": "NVIDIA"}, {"given_name": "Jinwei", "family_name": "Gu", "institution": "NVIDIA Research"}, {"given_name": "Guangyu", "family_name": "Zhong", "institution": "Dalian University of Technology"}, {"given_name": "Ming-Hsuan", "family_name": "Yang", "institution": "UC Merced"}, {"given_name": "Jan", "family_name": "Kautz", "institution": "NVIDIA"}]}