{"title": "Designing by Training: Acceleration Neural Network for Fast High-Dimensional Convolution", "book": "Advances in Neural Information Processing Systems", "page_first": 1466, "page_last": 1475, "abstract": "The high-dimensional convolution is widely used in various disciplines but has a serious performance problem due to its high computational complexity. Over the decades, people took a handmade approach to design fast algorithms for the Gaussian convolution. Recently, requirements for various non-Gaussian convolutions have emerged and are continuously getting higher. However, the handmade acceleration approach is no longer feasible for so many different convolutions since it is a time-consuming and painstaking job. Instead, we propose an Acceleration Network (AccNet) which turns the work of designing new fast algorithms to training the AccNet. This is done by: 1, interpreting splatting, blurring, slicing operations as convolutions; 2, turning these convolutions to $g$CP layers to build AccNet. After training, the activation function $g$ together with AccNet weights automatically define the new splatting, blurring and slicing operations. Experiments demonstrate AccNet is able to design acceleration algorithms for a ton of convolutions including Gaussian/non-Gaussian convolutions and produce state-of-the-art results.", "full_text": "Designing by Training: Acceleration Neural Network\n\nfor Fast High-Dimensional Convolution\n\nLongquan Dai\n\nLiang Tang\n\nSchool of Computer Science and Engineering\nNanjing University of Science and Technology\n\nCASA Environmental Technology Co., Ltd\n\nCASA EM&EW IOT Research Center\n\ndailongquan@njust.edu.cn\n\ntangl@casaet.com\n\nYuan Xie\n\nInstitute of Automation\n\nChinese Academy of Sciences\n\nyuan.xie@ia.ac.cn\n\nJinhui Tang\u2217\n\nSchool of Computer Science and Engineering\nNanjing University of Science and Technology\n\njinhuitang@njust.edu.cn\n\nAbstract\n\nThe high-dimensional convolution is widely used in various disciplines but has a\nserious performance problem due to its high computational complexity. Over the\ndecades, people took a handmade approach to design fast algorithms for the Gaus-\nsian convolution. Recently, requirements for various non-Gaussian convolutions\nhave emerged and are continuously getting higher. However, the handmade acceler-\nation approach is no longer feasible for so many different convolutions since it is a\ntime-consuming and painstaking job. Instead, we propose an Acceleration Network\n(AccNet) which turns the work of designing new fast algorithms to training the\nAccNet. This is done by: 1, interpreting splatting, blurring, slicing operations as\nconvolutions; 2, turning these convolutions to gCP layers to build AccNet. After\ntraining, the activation function g together with AccNet weights automatically\nde\ufb01ne the new splatting, blurring and slicing operations. Experiments demonstrate\nAccNet is able to design acceleration algorithms for a ton of convolutions including\nGaussian/non-Gaussian convolutions and produce state-of-the-art results.\n\n1\n\nIntroduction\n\nThe high-dimensional convolution undoubtedly is a common and elementary computation unit in\nmachine learning, computer vision and computer graphics. Kr\u00e4henb\u00fchl and Koltun [2011] conducted\nef\ufb01cient message passing in the fully connected CRFs inference by the high-dimensional Gaussian\nconvolution. Elboer et al. [2013] expressed the generalized Laplacian distance for visual matching as\ncascaded convolutions. Paris and Durand [2009] converted the bilateral \ufb01lter [Tomasi and Manduchi,\n1998] into convolution in an elevated high-dimensional space. However, the computational complexity\nfor a d-D convolution (1) is proportional to rd, where r denotes the radius of the box \ufb01ltering window\n\u2126, Kpq represents the weight between p and q, I p and I(cid:48)\np are the values of input I and output I(cid:48)\nat p. Therefore the running cost will become unacceptable for large r or d.\n\nI(cid:48)\np = (K \u2217 I)p =\n\nKpqI q\n\n(1)\n\n(cid:88)\n\nq\u2208\u2126p\n\nA lot of work was devoted to solving the computational shortcoming. But most of them focus on the\nGaussian \ufb01ltering. This is because not only the Gaussian convolution itself serves as building blocks\n\n\u2217Corresponding Author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffor many algorithms [Baek and Jacobs, 2010, Yang et al., 2015] but also its acceleration approaches\nplay important roles in defocus [Barron et al., 2015], segmentation [Gadde et al., 2016], edge-aware\nsmoothing [Barron and Poole, 2016], video propagation [Jampani et al., 2017].\nIn the literature, the most popular Gaussian blur acceleration algorithm should be the Splatting-\nBlurring-Slicing pipeline (SBS), which is \ufb01rst proposed by Paris and Durand [2006], Adams et al.\n[2010] coined its current name. We attribute its success to data reduction and separable blurring. In\nSBS, pixels are \u201csplatted\u201d(downsampled) onto the grid to reduce the data size, then those vertexes\nare blurred, \ufb01nally the \ufb01ltered values for each pixel are produced via \u201cslicing\u201d(upsampling). Due to\nthe separable blurring kernel, the d-D Gaussian blurring performed on those vertexes can be deemed\nas a sum of separable 1-D \ufb01lters [Szeliski, 2011] and therefore the computational complexity per\npixel is reduced from O(rd) to O(rd). As the \ufb01ltering window becomes small after splatting, the\ncomputational complexity can be roughly viewed as O(d) which is irrelevant to the radius r.\nAccording to our investigation SBS has two problems: 1, how to approximate non-Gaussian blur?\nSBS is designed for the Gaussian convolution. However, the requirements for non-Gaussian blurs\nemerge from local Laplacian \ufb01ltering [Aubry et al., 2014] and mean-\ufb01eld inference [Vineet et al.,\n2014] recently. 2, how to improve the approximation error? Previous SBS based methods just claim\nthat their results are good approximations for the Gaussian \ufb01ltering and prove this by experiments.\nSince current SBS has drawbacks, how can we generalize SBS to get a better result?\nWe recast SBS as a neural network (AccNet) to address above two problems in this paper. The\nbene\ufb01ts are threefold: 1, our AccNet offers a uni\ufb01ed perspective for SBS based acceleration methods;\n2, the layer weights together with the activation function g de\ufb01ne the splatting, blurring and slicing\nconvolution. So we can easily derive new splatting, blurring and slicing operations from the trained\nnetwork for arbitrary convolutions. This ability entitles our network the End-to-End feature; 3, the\noptimal approximation error is guaranteed by AccNet in training.\n\n2 Related Work\n\nFew papers discussed acceleration algorithms for general high-dimensional convolution. Szeliski\n[2011] recorded a separable \ufb01ltering method by SVD. Extending SVD to high-dimensional cases,\nwe can generalize the separable \ufb01ltering to high-dimensional convolution. In bilateral \ufb01ltering\nliteratures [Chaudhury and Dabhade, 2016, Dai et al., 2016], shiftable functions are exploited to\napproximate 1-D range kernels. This technique can also be extended to high-dimensional cases via\nouter product. However, its approximation terms are same as the separable \ufb01ltering method and will\nexponentially increase with the dimension.\nCurrent interest for fast high-dimensional convolution algorithms limits to Gaussian blur. Greengard\nand Strain [1991] provided the \ufb01rst fast Gaussian blur algorithm. Since the inception of the bilateral\n\ufb01lter (BF) [Tomasi and Manduchi, 1998], the study for fast Gaussian convolution emerges in computer\nvision and computer graphics. Durand and Dorsey [2002] computed intermediate \ufb01ltered images and\nsynthesized \ufb01nal results by interpolation. The same approach was adopted by Porikli [2008] and\nYang et al. [2009]. Paris and Durand [2006] implemented the \ufb01rst SBS which hints at more general\napproaches (bilateral grid and permutohedral lattice).\nThe bilateral grid [Chen et al., 2007] is a dense data structure that voxelizes the input space into a\nregular hypercubic lattice. By embedding inputs within the discretized space (splatting), they mix the\nvalues with a conventional Gaussian blur (blurring). The output image is extracted by resampling\nback into image space (slicing). The permutohedral lattice [Adams et al., 2010] is a sparse lattice that\ntessellates the space with simplices. By exploiting the fact that the number of vertices in a simplex\ngrows slowly, it avoids the exponential growth of runtime that the bilateral grid suffers.\n\n3 Design by Training for Fast High-Dimensional Convolution\n\nDifferent from traditional output-focused neural networks, our AccNet implements the design-by-\ntraining strategy to automatically produce fast convolution pipeline and thus only interests in the\nactivation function and weights as they de\ufb01ne new splatting, blurring and slicing operations. In the\nfollowing sections, we discuss how to transform the SBS into an AccNet as well as extensions for\nAccNets.\n\n2\n\n\f(a) Splatting\n\n(b) Blurring\n\n(c) Slicing\n\nFigure 1: The splatting-blurring-slicing pipeline demonstration for bilateral grid and permutohedral\nlattice. The bilateral grid accumulates input values on a grid and factors the Gaussian-weighted gather\ninto a separable Gaussian blur followed by multilinear sampling. The permutohedral lattice operates\nuses the permutohedral lattice. Barycentric weights within each simplex are used to resample into\nand out of the lattice. The separable blur is conducted along each axis.\n\n3.1 Splatting, Blurring and Slicing Operations as Convolutions\n\nSplatting voxelizes the space into a regular lattice and embeds inputs within the discretized vertices\nof the lattice to reduce the data size. Figure 1a illustrates the splatting operation of both bilateral\ngrid and permutohedral lattice. The bilateral grid acceleration method trades accuracy for speed by\naccumulating constant values. The permutohedral lattice acceleration algorithm uses barycentric\nweights within each simplex to resample into the lattice. So the value of each vertice is the weighted\nsum of its nearby inputs. That is to say, the splatting operation conducts convolutions with a stride of\ns. Here s denotes the interval of lattice vertices.\nSlicing as illustrated in Figure 1c is the inverse operation of splatting. SBS employs it to synthesize\n\ufb01ltering results from the smoothed lattice values. The bilateral grid method does this by trilinear\ninterpolation and the permutohedral lattice algorithm takes barycentric weights to resample out of\nthe lattice. Since the slicing values are the weighted sum of neighbor vertices, the slicing operation\nequals to the convolution operation. Intuitively, slicing behaves as the deconvolution layer of the fully\nconvolutional network [Shelhamer et al., 2017] which performs upsampling by convolution.\nBlurring is an alias of convolution. In the d-D case, the full kernel implementation for a convolution\nrequires rd (multiply-add) operations per pixel, where r is the radius of the convolution kernel.\nThis operation can be sped up by sequentially performing 1-D convolutions along each axis (which\nrequires a total of dr operations per pixel) if the kernel is separable. Mathematically, a separable\n\nK = k1 \u25e6 k2 \u00b7\u00b7\u00b7 \u25e6 kd\nI(cid:48) = K \u2217 I = k1 \u2217 k2 \u00b7\u00b7\u00b7 kd \u2217 I\n\n(2)\n(3)\nkernel K is the rank-one tensor (the outer product of d vectors {kn, n = 1, . . . , d} (2)). Then the\nconvolution with K becomes (3). For arbitrary kernels, we can reformulate it as the sum of rank-one\ntensors by Canonical Polyadic (CP) decomposition [Sidiropoulos et al., 2017]. In this way, we have\n(4) and the computational complexity per pixel for the d-D case becomes O(N dr). Note that the\n\nwi \u00b7 ki\n\n1 \u2217 ki\n\n2 \u00b7\u00b7\u00b7 \u2217 ki\n\nd \u2217 I\n\nsmoothing window usually is small after splatting, the computational complexity can be viewed as\nO(N d) which is irrelevant to r.\n\ni=1\n\n3\n\nN(cid:88)\n\nK =\n\nwi \u25e6 ki\n\n2 \u00b7\u00b7\u00b7 \u25e6 ki\n\nd\n\n1 \u25e6 ki\nN(cid:88)\n\ni=1\n\nI(cid:48) = K \u2217 I =\n\n(4)\n\n(5)\n\nBilateral GridPermutohedralLattice\f(a) gCP layer\n\n(b) Cascaded gCP layers\n\n(c) AccNet\n\nFigure 2: Demonstration for gCP layer, cascaded gCP layers and AccNet. The inputs of (a) (b) are\nmatrices formed by [lpi\nd ] (refer to section 3.2.2) and their outputs are scalars. The color cube\nin (c) stands for Lj (refer to section 3.2.4) and the color slice in the cube represents Lj\npi, where the\noutputs of (a)(b)(c) are scalars, the stripes in (a)(b) and the slices in (c) with the same color present\nthe input-output relationship.\n\n1 , . . . , lpi\n\n3.2 Design by Training Acceleration Network (AccNet)\n\nEssentially, our design-by-training approach is to decompose the \ufb01ltering kernel (a tensor) by neural\nnetworks because a convolution can be fast computed according to (5) once (4) is obtained. In both\nequations, basic building blocks are multiplication and addition. If one of them is substituted by other\noperations, we obtain new CP decomposition (4) and separable convolution (5). That is to say, we get\nnew kinds of splatting, blurring and slicing operations. In this section we follow the way of Cohen\nand Shashua [2016] to generalize (4) to gCP decomposition and provide corresponding g-convolution.\nThe gCP layer and cascaded gCP layers are proposed for gCT and gHT decompositions.\n\n3.2.1 gCP (g Canonical Polyadic) Decomposition & g-Convolution\n\nIn (4) each element Kj1,j2,...,jd is formulated as (cid:80)N\ncan be generalized by de\ufb01ning Kj1,j2,...,jd =(cid:80)N\n\n\u00b7\u00b7\u00b7 ki\n. Assuming the\nactivation function g : R \u00d7 R \u2192 R denotes multiplication, we have ki\nki\n=\n, where a \u00d7g b = g(a, b) = ab. Let g be an activation function\nki\nsuch that \u2200a, b, c \u2208 R : g(g(a, b), c) = g(a, g(b, c)), g(a, b) = g(b, a), the tensor decomposition (4)\n. So we have gCP\ndecomposition (6), where \u25e6g denotes the generalized outer product by replacing multiplication with\nthe activation function g.\n\ni=1 wi \u00d7g ki\n\n\u00d7g \u00b7\u00b7\u00b7 \u00d7g ki\n\n\u00d7g \u00b7\u00b7\u00b7 \u00d7g ki\n\ni=1 wiki\n\n\u00d7g ki\n\n\u00b7\u00b7\u00b7 ki\n\n1,j1\n\n2,j2\n\nki\n\n1,j1\n\n2,j2\n\nd,jd\n\n1,j1\n\n2,j2\n\nd,jd\n\nd,jd\n\n1,j1\n\nd,jd\n\nFurther, we substitute g for multiplication in (1) and obtain the g-convolution (7).\n\nI(cid:48)\np = (K \u2217g I)p =\n\nKpq \u00d7g I q\n\nApplying (6) to (7), we get (8) which sequentially performs N 1-D g-convolutions.\n\nI(cid:48) = K \u2217g I =\n\nwi \u00d7g ki\n\n1 \u2217g ki\n\n2 \u00b7\u00b7\u00b7 \u2217g ki\n\nd \u2217g I\n\ni=1\n\ngCP Layer as gCP Decomposition\n\n3.2.2\nKpq and I q in (7) form two d-order tensors. Taking \u02c6K and \u02c6I to denote them and putting (6) into (8),\nj=1 lj,vj = lj,v1 \u00d7g lj,v2 \u00d7g \u00b7\u00b7\u00b7\u00d7g lj,vd\ninto (9), we obtain (10) which is consisted of three operations: 1, the g-af\ufb01ne mapping (g-aff mapping)\nj=1; 3,\ni=1 wi. The activation function g introduces\n\nwe have (9). Letting lj be a vector and putting \u02c6I v1,...,vd =(cid:81)d\nde\ufb01ned by(cid:80)\nthe weighted average pooling (g-avg pooling) given by(cid:80)N\n\nj,v \u00d7g lj,v; 2, the g-multiplication pooling (g-mul pooling) described by(cid:81)d\n\nv ki\n\n4\n\nK =\n\nwi \u00d7g ki\n\n2 \u00b7\u00b7\u00b7 \u25e6g ki\n\nd\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\n1 \u25e6g ki\n(cid:88)\n\nq\u2208\u2126p\n\n(6)\n\n(7)\n\n(8)\n\ng-avg poolingg-mul poolinginputoutputg-aff mappingg-mul poolingg-mul poolingg-avg poolingoutputg-aff mappingg-aff mappinginputg-conv mappingg-aff mappingg-mul poolingg-aff mappingg-mul poolingg-avg poolingg-conv mappingsumSplattingBlurringSlicing\fp = (K \u2217g I)p =\nI(cid:48)\nN(cid:88)\nN(cid:88)\n\n(cid:88)\n\u02c6I(cid:48)\np =\n\nv1,...,vd\n\ni=1\n\n=\n\nwi \u00d7g\n\nwi \u00d7g\n\nj1,...,jd\n\n(cid:88)\n\u02c6Kj1,...,jd \u00d7g \u02c6I j1,...,jd\nd(cid:89)\n(cid:88)\nd(cid:89)\n\n\u00d7g \u02c6I v1,...,vd\n\nki\n\nj,v \u00d7g lj,v\nki\n\nj=1\n\nj,vj\n\n(9)\n\n(10)\n\ni=1\n\nj=1\n\nv\n\nnonlinearity to the three operations. Figure 2a plots the architecture, where the input is a matrix, the\ng-aff mapping transforms lj,v denoted by the black line in the input to a new black vector in m1, the\ng-mul pooling maps each red vector in m1 to a scaler in vector v1 and the g-avg pooling reduces the\nelement number of v1 to 1. In fact, the three operations belong to two categories. the g-avg pooling\njust is a special case of g-aff mapping. At last, we coin this layer as gCP layer as it implements the\ngCP decomposition for K.\n\n3.2.3 Cascaded gCP Layers as gHT (g Hierarchical Turker) Decomposition\n\nThe expressive power of neural network has a close connection with the depth of layers. We cascade\nmultiple gCP layers to extend the expressive ability in this section. The gCP layer maps a matrix to a\nscalar. As illustrated in Figure 2a, the g-aff mapping changes the element number of each red \ufb01ber,\nthe g-mul pooling reduces the number of channels to 1 and the g-avg pooling decreases the element\nnumber of v1 to 1. If we replace the global pooling in the g-mul pooling by the local pooling, the\noutput will become a matrix. Similarly, if we increase the output number of the last operation (the\ng-avg pooling is turned to the g-aff mapping), the output will be a vector. In this way, the gCP layer\nmaps a matrix to another matrix and we can cascade two CP-layers together. Figure 2b provides a\ndemo of two cascaded gCP layers, where the last g-aff mapping of the \ufb01rst gCP layer and the \ufb01rst\ng-aff mapping of the second gCP layer are merged as one g-aff mapping.\nCascaded gCP layers implement g hierarchical tucker decomposition [Hackbusch and K\u00fchn, 2009],\nwhich replaces the multiplication by g in hierarchical tucker decomposition as we do for gCP.\nFor example, a g hierarchical turker decomposition for a 4-order tensor K with two layers can\n\nN2(cid:88)\n\nm=1\n\n2(cid:89)\n\nn=1\n\nK =\n\nwm \u00d7g\n\nKm\n\nn\n\nwith Km\n\nn =\n\n2(cid:89)\n\nN2(cid:88)\n\nN1(cid:88)\narchitecture in Figure 2b, we can \ufb01nd that the operators(cid:80)\n(cid:80)N2\n\nwm \u00d7g\n\n\u02c6I m\nn =\n\nI(cid:48)\np =\n\n\u02c6I m\n\nwith\n\nm=1\n\nn=1\n\ni=1\n\nn\n\ng-mul pooling, the g-avg pooling, respectively.\n\nN1(cid:88)\n\ni=1\n\nj=1\n\nkmi\nnj\n\n2(cid:89)\nni \u00d7g\nwm\n2(cid:89)\n(cid:88)\ni=1,(cid:80)N1\nnj,v,(cid:81)2\n\ni=1\n\nv\n\nni \u00d7g\nwm\n\nnj,v \u00d7g lj,v\nkmi\n\nni,(cid:81)2\n\n(11)\n\n(12)\n\nbe expressed as (11). Put (11) into convolution formula, we have (12). Comparing (12) to the\n\nn=1 and\nm=1 wm corresponds the \ufb01rst g-aff mapping and g-mul pooling, the second g-aff mapping and\n\ni=1 wm\n\nv kmi\n\npi\n\n= lpi\n\n1 \u25e6g\u00b7\u00b7\u00b7\u25e6glpi\n\n1j \u25e6g \u00b7\u00b7\u00b7 \u25e6g lpi\n\nd and form the matrix [lpi\n\nas the network input for each point pi. To relax this assumption, we suppose I pi =(cid:80)l\n\n3.2.4 Proposed AccNet\nInput: in sections (3.2.2) (3.2.3) we assumed I pi = lpi\n1 ,\u00b7\u00b7\u00b7 , lpi\nd ]\nj=1 I j\npi and\nI j\ndj . The blurring value of each vertice pi on the bilateral grid or permutohedral\nlattice depends on the values of its neighborhoods (an image batch I pi). For slicing, we need m\nvertices pi surrounding the target point to interpolate its \ufb01ltering result. So total m image batches\n{I pi, 1 \u2264 i \u2264 m} are required to compute the results of target points encircled by {pi, 1 \u2264 i \u2264 m}.\nTo synthesize \ufb01ltering values of target points encircled by {pi}, we compose Lj by concatenating\ndj ], 1 \u2264 i \u2264 m} vertically. Further {Lj, 1 \u2264 j \u2264 l} are stacked together and\n{Lj\nserves as our AccNet input. Figure 2c illustrates this, where color regions denote different parts\n{Lj\nSplatting: the splatting layer conducts the strided convolution. Theoretically, the convolution kernel\nK is arbitrary. Here, we assume K = k1 \u25e6g \u00b7\u00b7\u00b7 \u25e6g kd is a rank-one tensor in AccNet due to the\n\n= [lpi\n} of Lj and the light cube represents the 3-order input tensor.\n\n1j ,\u00b7\u00b7\u00b7 , lpi\n\npi\n\npi\n\n5\n\n\f(a) gCP convolution\n\n(b) gHT convolution\n\n(cid:81)4\n\nK =(cid:80)4\nscheme (12) for the gHT decomposition K =(cid:81)2\n\nFigure 3: Illustration for fast \ufb01ltering approaches based on gCP and gHT decompositions. Taking\ndifferent tensor decomposition methods for the \ufb01ltering kernel, we achieve different fast \ufb01ltering\nalgorithms. (a) plots the computation graph of fast \ufb01ltering scheme (7) for the gCP decomposition\nj. The path indicated by arrows presents the convolution sequence with kernels\nj}. Final result is the sum of the outputs of all 4 paths. (b) shows the \ufb02ow chart of fast \ufb01ltering\nij . Each input\nconnects to four outputs and thus produces four outputs. The red line indicates a convolution path.\nFinal result is the sum of the outputs of all 16 paths.\n\nm=1 Km with Km =(cid:80)4\n\n(cid:81)2\n\ni=1 km\n\nj=1 ki\n\n{ki\n\nj=1\n\ni=1\n\npi\n\npi\n\npm\n\n= [lpi\n\n, . . . , zj\n\n= lj,1\npi\n\n\u25e6g \u00b7\u00b7\u00b7 \u25e6g lj,d\n\ndj ], the blurring layer produces a scalar value zj\n\npi with a rank-one kernel K for a rank-one input tensor I j\n\nreasons: 1, AccNet takes three layers to approximate the convolution result. Even though the splatting\nlayer is simple, the approximation error can be reduced by increasing the complexity of the blurring\nlayer; 2, each slice of the input tensor of the blurring layer must be a rank-one matrix and the \ufb01ltering\nresult I(cid:48)j\nis also a\nrank-one tensor.\nBlurring: we prefer to employ cascaded gCP-layers to compose the blurring layer of AccNet as it\nhas more powerful expressive ability than single gCP-layer. Figure 2c provides a two gCP-layers\n1j ,\u00b7\u00b7\u00b7 , lpi\nexample. For each slice Lj\npi\nSlicing: let zj = [zj\n], the slicing layer maps zj to a vector tj, where each element of tj\np1\ncorresponds to the interpolated values of the pixels surrounded by {pi} and the value of each pi are\nfrom I j\npi, there are total l different zj and therefore we obtain l different\ntj. The \ufb01nal result is the sum of {tj, 1 \u2264 j \u2264 l}.\ng function: the function plays an important role in our AccNet. First, it introduces nonlinearity to\nAccNet. This strengthens the expressive power of our AccNet. Second, it de\ufb01nes new convolutions.\nEmploying g-conv operation, we can easily de\ufb01ne novel splatting, blurring and slicing operations.\nThere are many possible g functions meeting the associativity g(g(a, b), c) = g(a, g(b, c)) and\ncommutativity g(a, b) = g(b, a) requirements. Here we list two of them used in AccNet: 1, g(a, b) =\nmax{a, 0} max{b, 0}; 2, g(a, b) = max{ab, 0}.\nGradients: The gradients of both sum and g function can be easily obtained. Therefore, AccNet as a\ncomposition of the two basis calculations can be easily trained by the back-propagation algorithm.\n\npi. Since I pi =(cid:80)l\n\nj=1 I j\n\npi.\n\n4 Approximation & Fast Filtering\n\nIn section 3, we discussed the layers of AccNet as well as the way to transform the SBS to an AccNet.\nHere, we describe an approach to compose an expressive powerful AccNet and to turn it back to SBS.\nExpressive Powerful AccNet: the expressive power of AccNet determines the approximation error.\nWe have two ways to increase this power. One is to introduce the nonlinear activation function to\nAccNet. Unlike traditional SBS taking the CP decomposition for acceleration, we implement gCP\ndecomposition in AccNet. Another way is to make AccNet deeper. In this way, gCP becomes gHT.\nAt last, we note that we can choose different activation functions in different layers. This is because\nsplatting, blurring and slicing operations are essentially convolutions therefore we can take different\ngCTs and gHTs to accelerate their computation.\nFrom AccNet to SBS: the weights as well as the activation function of three AccNet layers de\ufb01ne the\nsplatting, blurring and slicing kernels. The correspondences between AccNet weights and convolution\nkernels are determined by (7) (8) for the gCP decomposition and (11) (12) for the gHT decomposition.\nFor easy understanding, we visualize the computation graph of an AccNet in Figure 3. Figure 3a\n\n6\n\n\ud835\udc8c11\ud835\udc8c12\ud835\udc8c13\ud835\udc8c14\ud835\udc8c31\ud835\udc8c32\ud835\udc8c33\ud835\udc8c34\ud835\udc8c41\ud835\udc8c42\ud835\udc8c43\ud835\udc8c44\ud835\udc8c24\ud835\udc8c23\ud835\udc8c22\ud835\udc8c21InputOutputsum\ud835\udc8c121\ud835\udc8c131\ud835\udc8c141\ud835\udc8c112\ud835\udc8c122\ud835\udc8c132\ud835\udc8c142\ud835\udc8c212\ud835\udc8c222\ud835\udc8c232\ud835\udc8c242\ud835\udc8c241\ud835\udc8c231\ud835\udc8c221\ud835\udc8c211InputOutputsum\ud835\udc8c111\fTable 1: Filtering accuracy comparison for the bilateral grid acceleration method (BG), the permuto-\nhedral lattice acceleration method (PL) and our AccNet, where the sampling period of splatting is 3,\nthe radius of blurring is 1 and the radius of original convolution is 5.\n\n2D\n\n3D\n\n5D\n\n\u03c3 = 2\n1.225\n0.336\n1.107\n0.381\n\n(cid:81)4\n\nBG\n\nAccNet\n\nPL\n\nAccNet\n\n\u03c3 = 2\n0.952\n0.309\n0.541\n0.273\n\n\u03c3 = 4\n0.768\n0.249\n0.657\n0.175\n\n\u03c3 = 8\n0.587\n0.165\n0.419\n0.142\n\n\u03c3 = 16\n0.288\n0.054\n0.239\n0.051\n\n\u03c3 = 4\n1.085\n0.276\n0.893\n0.243\n\n\u03c3 = 8\n0.813\n0.267\n0.733\n0.203\n\n\u03c3 = 16\n0.668\n0.171\n0.604\n0.153\n\n\u03c3 = 2\n1.804\n0.853\n1.712\n0.528\n\n\u03c3 = 4\n1.552\n0.465\n1.488\n0.423\n\n\u03c3 = 8\n1.179\n0.349\n1.005\n0.299\n\n\u03c3 = 10\n0.878\n0.259\n0.854\n0.213\n\ni=1\n\nj=1 ki\n\ntakes the gCT decomposition K =(cid:80)4\nFigure 3b records the fast convolution for the gHT decomposition K = (cid:81)2\n\nj to implement the fast convolution algorithm and\nij ,\ni=1 km\nwhere circles denote convolution operations with speci\ufb01c \ufb01ltering kernels k and arrows indicate the\ncomputation order.\nThe two examples in Figure 3 disclose the superiority of gHT decomposition based acceleration\nalgorithms. In Figure 3a, each convolution kernel is only used by one computation path. In contrast,\nthe convolution kernel in Figure 3b is used by multiple times. The reuse advantages are twofold: 1, we\ncan reduce the approximation error because more terms can be used to approximate original kernels;\n2, we can reduce the execution time by reusing the convolution result sharing the same convolution\nnode. For example, the \ufb01ltering path k1\n22 share\nthe \ufb01ltering results of k1\n\n11 \u2192 k1\n\n21 \u2192 k2\n\n11 \u2192 k2\n\n11 \u2192 k1\n\n21 \u2192 k2\n\n12 \u2192 k2\n\n(cid:80)4\n\n21 and k1\n\n(cid:81)2\n\nm=1\n\nj=1\n\n11 \u2192 k1\n21.\n\n5 Experiments\n\nAccNet is the \ufb01rst neural network producing fast convolution algorithms. To reveal its advantages,\nthree experiments are conducted: 1, we compare our AccNet designed acceleration method to the\nhandmade bilateral grid and permutohedral lattice acceleration methods; 2, we provide a new neural\nnetwork to automatically design fast algorithm and compare it to AccNet; 3, we employ AccNet to\ndesign new acceleration algorithms for non-Gaussian convolution and demonstrate their applications.\nIn the following experiments, the blurring layer of AccNet is composed by two cascaded gCP layers\nand the activation function is g(a, b) = max(ab, 0).\nFast Gaussian convolution comparison: Both bilateral grid acceleration method (BG) and permu-\ntohedral lattice acceleration method (PL) are designed for fast Gaussian convolution. The major\ndifference between them is the underlying grid. Our AccNet can be applied to both bilateral grid\nand permutohedral lattice. To illustrate the \ufb01ltering accuracy of the methods produced by AccNet,\nwe keep their convolution number same to BG and PL and evaluate their \ufb01ltering accuracy. Table 1\nrecords the quantitative comparison results, where the \ufb01rst row denotes the dimension of the Gaussian\nkernel, \u03c3 denotes the bandwidth of kernel, the accuracy is measured by MSE (the mean-square error),\nthe \ufb01rst two rows record the results of BG and AccNet on the bilateral grid and the last two rows plot\nthe results of PL and AccNet on the permutohedral lattice.\nAcceleration network comparison: SBS sequentially conducts three convolutions. We can turn it\nto a CNN with three layers and further transform each CNN layer to d cascaded 1-D convolution\naccording to the CP decomposition (4) (5). The differences between this network and our AccNet are\nthat: 1, the depth of each layer of this CNN model is proportional to the dimension of \ufb01ltering kernel.\nIn contrast, the layer depth of AccNet only depends on the desired expressive power of the layer and\nthe expressive power of the simplest AccNet layer equal to the expressive power of CNN layer. 2, the\nCNN model is hard to express the gHT decomposition (11) as its straightforward processing pipeline\nis similar to Figure 3a and could not reuse intermediate results as AccNet does in Figure 3b.\nThe \ufb01rst shortcoming makes the CNN model deeper for high-dimensional convolution. We thus have\nto spend more time to tweak it. What\u2019s worse, the depth does not increase the expressive power of\n\nTable 2: Two acceleration neural networks (CNN and AccNet) comparisons. The bandwidth of target\nGaussian kernel is 5 and the underlying lattice is the bilateral grid.\n\n2D\n\nFiltering Error Training Time\n\n3D\n\nFiltering Error Training Time\n\n5D\n\nFiltering Error Training Time\n\nCNN\nAccNet\n\n0.245\n0.239\n\n12.5h\n7.2h\n\n0.283\n0.271\n\n13.1h\n7.3h\n\n0.473\n0.461\n\n7\n\n14h\n7.6h\n\n\fthis model because its expressive power is determined by the number N of cascaded 1-D convolution\npipelines. The second weakness causes its inferiority of the expressive power when we limits its\nconvolution number equal to AccNet. This usually means larger \ufb01ltering errors in \ufb01ltering. To prove\nthese, we plot Table 2 which records the training time as well as the \ufb01ltering error measured by MSE,\nwhere the dimension of \ufb01ltering kernel varies from 2-D to 5-D.\nFast non-Gaussian \ufb01ltering: Non-Gaussian blur becomes popular recently. To illustrate the power\nof our AccNet, we demonstrate three applications of fast non-Gaussian \ufb01ltering in machine learning,\ncomputer vision and computer graphics, respectively.\n\n(a) Input\n\n(b) Kr\u00e4henb\u00fchl\n\n(c) Ours\n\n(a) Input\n\n(b) Paris\n\n(c) Ours\n\nFigure 4: Pixel-level segmentation results of two\nfully connected CRF implementations. (a) is in-\nput images.\n(b) is the segmentation results of\nKr\u00e4henb\u00fchl. (c) records our segmentation results.\n\nFigure 5: Detail enhancement of two local\nLaplace \ufb01ltering implementations. (a) is input\nimages. (b) is the \ufb01ltering results of Paris. (c)\ndenotes our results.\n\nTable 3: Stereo matching quantitative comparison.\n\n[Zbontar and LeCun, 2015]\n[Barron and Poole, 2016]\nOurs\n\nAll\n\nbad 1% MAE RMS\n18.36\n20.07\n8.44\n19.49\n19.21\n7.79\n\n5.93\n2.81\n2.13\n\nNoOcc\n\nbad 1% MAE RMS\n9.07\n10.42\n5.23\n11.33\n10.41\n4.96\n\n1.94\n1.40\n1.34\n\nCRF inference: The pairwise edge potentials used in the fully connected CRFs [Kr\u00e4henb\u00fchl and\nKoltun, 2011] is the Gaussian mixture kernels. Kr\u00e4henb\u00fchl and Koltun [2011] provided a highly\nef\ufb01cient approximate inference algorithm by showing a mean \ufb01eld update of all variables in a fully\nconnected CRF can be performed using Gaussian \ufb01ltering in the feature space. In order to speed up\nthe computation via the separability of the Gaussian kernel Gi, Kr\u00e4henb\u00fchl has to perform multiple\ntimes Gaussian \ufb01ltering. Employing AccNet, we can accelerate the Gaussian mixture kernels directly.\nCompared to the original method, we save 60% of the time while producing the same segmentation\nresults as shown in Figure 4.\nBilateral solver: Bilateral solver [Barron and Poole, 2016] allows for some optimization problems\nwith bilateral af\ufb01nity terms to be solved quickly, and also guarantees that the solutions are smoothed\nwithin objects, but not smooth across edges. Although the prior used by bilateral solver is arbitrary in\ntheory, bilateral solver can only take the Gaussian function as it is the only function can be presented\nby SBS before our work. Here we take the smooth exponential family prior [Zhang and Allebach,\n2008] to construct non-Guassian bilateral solver and apply it stereo post-processing procedure of\nMC-CNN [Zbontar and LeCun, 2015] following the way of [Barron and Poole, 2016]. In Table 3,\nwe record the quantitative results, where \u201cbad 1%\u201d presents the percent of pixels whose disparities\nare wrong by more than 1, \u201cMAE\u201d stands for the mean absolute error and \u201cRMS\u201d is the root mean\nsquare error.\nLocal Laplace \ufb01ltering: Local Laplacian \ufb01lter [Paris et al., 2011] is an edge-aware operator that\nde\ufb01nes the output image \u00afI by constructing its Laplacian pyramid {L[ \u00afI]} coef\ufb01cient by coef\ufb01cient.\nAubry et al. [2014] present the Laplacian coef\ufb01cient at level l and position p as the nonlinear\nDl(q \u2212 p)f (Iq \u2212 g)(Iq \u2212 g), where f is a continuous function,\nDl is the difference-of-Gaussians \ufb01lter de\ufb01ning the pyramid coef\ufb01cients at level l and g is the\ncoef\ufb01cient of the Gaussian pyramid at (l, p). Obviously, this convolution can be accelerated by\n\nconvolution {Ll[ \u00afI](p)} =(cid:80)\n\nq\u2208\u2126p\n\n8\n\n\fAccNet and achieves speed-ups on the order of 100 times. Figure 5 visualizes the similar detail\nenhancement results of Paris and ours.\n\n6 Conclusion\n\nIn this paper, we propose the \ufb01rst neural network producing fast high-dimensional convolution\nalgorithms. We take AccNet to express the approximation function of SBS and generalize SBS by\nchanging the architecture of AccNet. Once training is \ufb01nished, new fast convolution algorithm can\nbe easily derived from the weights and activation functions of each layer. Experiments prove the\neffectiveness of our algorithm.\n\n7 Acknowledgment\n\nThis work was supported by the 973 Program (Project No. 2014CB347600), the National Natural\nScience Foundation of China (Grant No. 61701235, 61732007, 61522203, 61772275, 61873293 and\n61772524), the Fundamental Research Funds for the Central Universities (Grant No. 30917011323)\nand the Beijing Municipal Natural Science Foundation (Grant No. 4182067).\n\nReferences\nAndrew Adams, Jongmin Baek, and Myers Abraham Davis. Fast high-dimensional \ufb01ltering using the\n\npermutohedral lattice. Computer Graphics Forum, 29(2):753\u2013762, may 2010. 2\n\nMathieu Aubry, Sylvain Paris, Samuel W. Hasinoff, Jan Kautz, and Fr\u00e9do Durand. Fast local laplacian\n\n\ufb01lters: Theory and applications. ACM Transactions on Graphics, 33(5):1\u201314, sep 2014. 2, 8\n\nJongmin Baek and David E. Jacobs. Accelerating spatially varying gaussian \ufb01lters. ACM Transactions\n\non Graphics, 29(6):1, dec 2010. 2\n\nJonathan T. Barron and Ben Poole. The fast bilateral solver. In European Conference on Computer\n\nVision, pages 617\u2013632. Springer International Publishing, 2016. 2, 8\n\nJonathan T. Barron, Andrew Adams, YiChang Shih, and Carlos Hernandez. Fast bilateral-space\nstereo for synthetic defocus. In IEEE Conference on Computer Vision and Pattern Recognition.\nIEEE, jun 2015. 2\n\nKunal N. Chaudhury and Swapnil D. Dabhade. Fast and provably accurate bilateral \ufb01ltering. IEEE\n\nTransactions on Image Processing, 25(6):2519\u20132528, jun 2016. 2\n\nJiawen Chen, Sylvain Paris, and Fr\u00e9do Durand. Real-time edge-aware image processing with the\n\nbilateral grid. ACM Transactions on Graphics, 26(3):103, 2007. 2\n\nNadav Cohen and Amnon Shashua. Convolutional recti\ufb01er networks as generalized tensor decom-\npositions. In Maria Florina Balcan and Kilian Q. Weinberger, editors, International Conference\non Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 955\u2013963,\nNew York, New York, USA, 20\u201322 Jun 2016. PMLR. 4\n\nLongquan Dai, Mengke Yuan, and Xiaopeng Zhang. Speeding up the bilateral \ufb01lter: A joint\n\nacceleration way. IEEE Transactions on Image Processing, 25(6):2657\u20132672, jun 2016. 2\n\nFr\u00e9do Durand and Julie Dorsey. Fast bilateral \ufb01ltering for the display of high-dynamic-range images.\n\nACM Transactions on Graphics, 21(3):257\u2013266, jul 2002. 2\n\nElhanan Elboer, Michael Werman, and Yacov Hel-Or. The generalized laplacian distance and its\napplications for visual matching. In IEEE Conference on Computer Vision and Pattern Recognition.\nIEEE, jun 2013. 1\n\nRaghudeep Gadde, Varun Jampani, Martin Kiefel, Daniel Kappler, and Peter V. Gehler. Superpixel\nconvolutional networks using bilateral inceptions. In European Conference on Computer Vision,\npages 597\u2013613. Springer International Publishing, 2016. 2\n\n9\n\n\fLeslie Greengard and John Strain. The fast gauss transform. SIAM Journal on Scienti\ufb01c and Statistical\n\nComputing, 12(1):79\u201394, jan 1991. 2\n\nW. Hackbusch and S. K\u00fchn. A new scheme for the tensor representation. Journal of Fourier Analysis\n\nand Applications, 15(5):706\u2013722, oct 2009. 5\n\nVarun Jampani, Raghudeep Gadde, and Peter V. Gehler. Video propagation networks. In IEEE\n\nConference on Computer Vision and Pattern Recognition. IEEE, jul 2017. 2\n\nPhilipp Kr\u00e4henb\u00fchl and Vladlen Koltun. Ef\ufb01cient inference in fully connected CRFs with gaussian\nedge potentials. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 24, pages 109\u2013117. Curran Associates,\nInc., 2011. 1, 8\n\nSylvain Paris and Fr\u00e9do Durand. A fast approximation of the bilateral \ufb01lter using a signal processing\napproach. In European Conference on Computer Vision, pages 568\u2013580. Springer Nature, 2006. 2\n\nSylvain Paris and Fr\u00e9do Durand. A fast approximation of the bilateral \ufb01lter using a signal processing\n\napproach. International Journal of Computer Vision, 81(1):24\u201352, dec 2009. 1\n\nSylvain Paris, Samuel W. Hasinoff, and Jan Kautz. Local laplacian \ufb01lters: edge-aware image\n\nprocessing with a laplacian pyramid. ACM Transactions on Graphics, 30(4):1, jul 2011. 8\n\nFatih Porikli. Constant time o(1) bilateral \ufb01ltering. In IEEE Conference on Computer Vision and\n\nPattern Recognition. Institute of Electrical and Electronics Engineers (IEEE), jun 2008. 2\n\nEvan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic\nsegmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640\u2013651,\napr 2017. 3\n\nNicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis,\nand Christos Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE\nTransactions on Signal Processing, 65(13):3551\u20133582, jul 2017. 3\n\nRichard Szeliski. Computer Vision: Algorithms and Applications. Springer-Verlag GmbH, 2011.\n\nISBN 1848829345. 2\n\nC. Tomasi and R. Manduchi. Bilateral \ufb01ltering for gray and color images. In IEEE International\n\nConference on Computer Vision. Narosa Publishing House, 1998. 1, 2\n\nVibhav Vineet, Jonathan Warrell, and Philip H. S. Torr. Filter-based mean-\ufb01eld inference for random\n\ufb01elds with higher-order terms and product label-spaces. International Journal of Computer Vision,\n110(3):290\u2013307, mar 2014. 2\n\nQingxiong Yang, Kar-Han Tan, and Narendra Ahuja. Real-time o(1) bilateral \ufb01ltering. In IEEE\nConference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics\nEngineers (IEEE), jun 2009. 2\n\nQingxiong Yang, Narendra Ahuja, and Kar-Han Tan. Constant time median and bilateral \ufb01ltering.\n\nInternational Journal of Computer Vision, 112(3):307\u2013318, sep 2015. 2\n\nJure Zbontar and Yann LeCun. Computing the stereo matching cost with a convolutional neural\nnetwork. In IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical\nand Electronics Engineers (IEEE), jun 2015. 8\n\nBuyue Zhang and J.P. Allebach. Adaptive bilateral \ufb01lter for sharpness enhancement and noise\n\nremoval. IEEE Transactions on Image Processing, 17(5):664\u2013678, may 2008. 8\n\n10\n\n\f", "award": [], "sourceid": 757, "authors": [{"given_name": "Longquan", "family_name": "Dai", "institution": "Nanjing University of Science and Technology"}, {"given_name": "Liang", "family_name": "Tang", "institution": "CASA Environmental Technology Co., Ltd and CASA EM&EW IOT Research Center"}, {"given_name": "Yuan", "family_name": "Xie", "institution": "Chinese Academy of Sciences"}, {"given_name": "Jinhui", "family_name": "Tang", "institution": "Nanjing University of Science and Technology"}]}