{"title": "Compact Generalized Non-local Network", "book": "Advances in Neural Information Processing Systems", "page_first": 6510, "page_last": 6519, "abstract": "The non-local module is designed for capturing long-range spatio-temporal dependencies in images and videos. Although having shown excellent performance, it lacks the mechanism to model the interactions between positions across channels, which are of vital importance in recognizing fine-grained objects and actions. To address this limitation, we generalize the non-local module and take the correlations between the positions of any two channels into account. This extension utilizes the compact representation for multiple kernel functions with Taylor expansion that makes the generalized non-local module in a fast and low-complexity computation flow. Moreover, we implement our generalized non-local method within channel groups to ease the optimization. Experimental results illustrate the clear-cut improvements and practical applicability of the generalized non-local module on both fine-grained object recognition and video classification. Code is available at: https://github.com/KaiyuYue/cgnl-network.pytorch.", "full_text": "Compact Generalized Non-local Network\n\nKaiyu Yue\u2020,\u00a7 Ming Sun\u2020 Yuchen Yuan\u2020 Feng Zhou\u2021 Errui Ding\u2020 Fuxin Xu\u00a7\n\n\u2020Baidu VIS \u2021Baidu Research \u00a7Central South University\n\n{yuekaiyu, sunming05, yuanyuchen02, zhoufeng09, dingerrui}@baidu.com\n\nfxxu@csu.edu.cn\n\nAbstract\n\nThe non-local module [27] is designed for capturing long-range spatio-temporal\ndependencies in images and videos. Although having shown excellent performance,\nit lacks the mechanism to model the interactions between positions across channels,\nwhich are of vital importance in recognizing \ufb01ne-grained objects and actions. To\naddress this limitation, we generalize the non-local module and take the correlations\nbetween the positions of any two channels into account. This extension utilizes the\ncompact representation for multiple kernel functions with Taylor expansion that\nmakes the generalized non-local module in a fast and low-complexity computation\n\ufb02ow. Moreover, we implement our generalized non-local method within chan-\nnel groups to ease the optimization. Experimental results illustrate the clear-cut\nimprovements and practical applicability of the generalized non-local module on\nboth \ufb01ne-grained object recognition and video classi\ufb01cation. Code is available at:\nhttps://github.com/KaiyuYue/cgnl-network.pytorch.\n\n1\n\nIntroduction\n\nFigure 1: Comparison between non-local (NL) and compact generalized non-local (CGNL) networks on\nrecognizing an action video of kicking the ball. Given the reference patch (green rectangle) in the \ufb01rst frame, we\nvisualize for each method the highly related responses in the other frames by thresholding the feature space.\nCGNL network out-performs the original NL network in capturing the ball that is not only in long-range distance\nfrom the reference patch but also corresponds to different channels in the feature map.\n\nCapturing spatio-temporal dependencies between spatial pixels or temporal frames plays a key role in\nthe tasks of \ufb01ne-grained object and action classi\ufb01cation. Modeling such interactions among images\nand videos is the major topic of various feature extraction techniques, including SIFT, LBP, Dense\nTrajectory [26], etc. In the past few years, deep neural network automates the feature designing\npipeline by stacking multiple end-to-end convolutional or recurrent modules, where each of them\nprocesses correlation within spatial or temporal local regions. In general, capturing the long-range\ndependencies among images or videos still requires multiple stacking of these modules, which greatly\nhinders the learning and inference ef\ufb01ciency. A recent work [16] also suggests that stacking more\nlayers cannot always increase the effective receptive \ufb01elds to capture enough local relations.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fInspired by the classical non-local means for image \ufb01ltering, the recently proposed non-local neural\nnetwork [27] addresses this challenge by directly modeling the correlation between any two positions\nin the feature maps in a single module. Without bells and whistles, the non-local method can greatly\nimprove the performances of existing networks on many video classi\ufb01cation benchmarks. Despite\nits great performances, the original non-local network only considers the global spatio-temporal\ncorrelation by merging channels, and it might miss the subtle but important cross-channel clues for\ndiscriminating \ufb01ne-grained objects or actions. For instance, the body, the ball and their interaction\nare all necessary for describing the action of kicking the ball in Fig. 1, while the original non-local\noperation learns to focus on the body part relations but neglect the body-ball interactions that usually\ncorrespond to different channels of the input features.\nTo improve the effectiveness in \ufb01ne-grained object and action recognition tasks, this work extends\nthe non-local module by learning explicit correlations among all of the elements across the channels.\nFirst, this extension scale-ups the representation power of the non-local operation to attend the\ninteraction between subtle object parts (e.g., the body and ball in Fig. 1). Second, we propose its\ncompact representation for various kernel functions to address the high computation burden issue. We\nshow that as a self-contained module, the compact generalized non-local (CGNL) module provides\nsteady improvements in classi\ufb01cation tasks. Third, we also investigate the grouped CGNL blocks,\nwhich model the correlations across channels within each group.\nWe evaluate the proposed CGNL method on the task of \ufb01ne-grained classi\ufb01cation and action recog-\nnition. Extensive experimental results show that: 1) The CGNL network are easy to optimize as\nthe original non-local network; 2) Compared with the non-local module, CGNL module enjoys\ncapturing richer features and dense clues for prediction, as shown in Figure 1, which leads to results\nsubstantially better than those of the original non-local module. Moreover, in the appendix of exten-\nsional experiments, the CGNL network can also promise a higher accuracy than the baseline on the\nlarge-scale ImageNet dataset [20].\n\n2 Related Works\n\nChannel Correlations: The mechanism of sharing the same conv kernel among channels of a\nlayer in a ConvNet [12] can be seen as a basic way to capture correlations among channels, which\naggregates the channels of feature maps by the operation of sum pooling. The SENet [10] may be the\n\ufb01rst work that explicitly models the interdependencies between the channels of its spatial features.\nIt aims to select the useful feature maps and suppress the others, and only considers the global\ninformation of each channel. Inspired by [27], we present the generalized non-local (GNL) module,\nwhich generalizes the non-local (NL) module to learn the correlations between any two positions\nacross the channels. Compared to the SENet, we model the interdependencies among channels in an\nexplicit and dense manner.\nCompact Representation: After further investigation, we \ufb01nd that the non-local module contains a\nsecond-order feature space (Sect.3.1), which is used widely in previous computer vision tasks, e.g.,\nSIFT [15], Fisher encoding [17], Bilinear model [14] [5] and segmentation task [2]. However, such\nsecond-order feature space involves high dimensions and heavy computational burdens. In the area of\nkernel learning [21], there are many prior works such as compact bilinear pooling (CBP) [5] that uses\nthe Tensor Sketching [18] to address this problem. But this type of method is not perfect yet. Because\nthe it cannot produce a light computation to the various size of sketching vectors. Fortunately, in\nmathematics, the whole non-local operation can be viewed as a trilinear formation. It can be fast\ncomputed with the associative law of matrix production. To the other types of pairwise function, such\nas Embedded Gaussian or RBF [19], we propose a tight approximation for them by using the Taylor\nexpansion.\n\n3 Approach\n\nIn this section, we introduce a general formulation of the proposed general non-local operation. We\nthen show that the original non-local and the bilinear pooling are special cases of this formulation.\nAfter that, we illustrate that the general non-local operation can be seen as a modality in the trilinear\nmatrix production and show how to implement our generalized non-local (GNL) module in a compact\nrepresentations.\n\n2\n\n\f3.1 Review of Non-local Operation\n\nWe begin by brie\ufb02y reviewing the original non-local operation [27] in matrix form. Suppose that an\nimage or video is given to the network and let X \u2208 RN\u00d7C denote (see notation1) the input feature\nmap of the non-local module, where C is the number of channels. For the sake of notation clarity,\nwe collapse all the spatial (width W and height H) and temporal (video length T ) positions in one\ndimension, i.e., N = HW or N = HW T . To capture long-range dependencies across the whole\nfeature map, the original non-local operation computes the response Y \u2208 RN\u00d7C as the weighted\nsum of the features at all positions,\n\n(1)\nwhere \u03b8(\u00b7), \u03c6(\u00b7), g(\u00b7) are learnable transformations on the input. In [27], the authors suggest using\n1 \u00d7 1 or 1 \u00d7 1 \u00d7 1 convolution for simplicity, i.e., the transformations can be written as\n\nY = f(cid:0)\u03b8(X), \u03c6(X)(cid:1)g(X),\n\n\u03b8(X) = XW\u03b8 \u2208 RN\u00d7C, \u03c6(X) = XW\u03c6 \u2208 RN\u00d7C,\n\n(2)\nparameterized by the weight matrices W\u03b8, W\u03c6, Wg \u2208 RC\u00d7C respectively. The pairwise function\nf (\u00b7,\u00b7) : RN\u00d7C \u00d7 RN\u00d7C \u2192 RN\u00d7N computes the af\ufb01nity between all positions (space or space-time).\nThere are multiple choices for f, among which dot-product is perhaps the simplest one, i.e.,\n\ng(X) = XWg \u2208 RN\u00d7C,\n\nf(cid:0)\u03b8(X), \u03c6(X)(cid:1) = \u03b8(X)\u03c6(X)(cid:62).\n\n(3)\n\nPlugging Eq. 2 and Eq. 3 into Eq. 1 yields a trilinear interpretation of the non-local operation,\n\nY = XW\u03b8W(cid:62)\n(4)\n\u03c6 X(cid:62) \u2208 RN\u00d7N encodes the similarity between any locations of\nwhere the pairwise matrix XW\u03b8W(cid:62)\nthe input feature. The effect of non-local operation can be related to the self-attention module [1]\nbased on the fact that each position (row) in the result Y is a linear combination of all the positions\n(rows) of XWg weighted by the corresponding row of the pairwise matrix.\n\n\u03c6 X(cid:62)XWg,\n\n3.2 Review of Bilinear Pooling\n\nAnalogous to the conventional kernel trick [21], the idea of bilinear pooling [14] has recently been\nadopted in ConvNets for enhancing the feature representation in various tasks, such as \ufb01ne-grained\nclassi\ufb01cation, person re-id, action recognition. At a glance, bilinear pooling models pairwise feature\ninteractions using explicit outer product at the \ufb01nal classi\ufb01cation layer:\n\nZ = X(cid:62)X \u2208 RC\u00d7C,\n\nof the \ufb01nal descriptor zc1c2 =(cid:80)\n\n(5)\nwhere X \u2208 RN\u00d7C is the input feature map generated by the last convolutional layer. Each element\nn xnc1xnc2 sum-pools at each location n = 1,\u00b7\u00b7\u00b7 , N the bilinear\nproduct xnc1 xnc2 of the corresponding channel pair c1, c2 = 1,\u00b7\u00b7\u00b7 , C.\nDespite the distinct design motivation, it is interesting to see that bilinear pooling (Eq. 5) can be\nviewed as a special case of the second-order term (Eq. 3) in the non-local operation if we consider,\n\n\u03b8(X) = X(cid:62) \u2208 RC\u00d7N , \u03c6(X) = X(cid:62) \u2208 RC\u00d7N .\n\n(6)\n\n3.3 Generalized Non-local Operation\n\nThe original non-local operation aims to directly capture long-range dependencies between any two\npositions in one layer. However, such dependencies are encoded in a joint location-wise matrix\nf (\u03b8(X), \u03c6(X)) by aggregating all channel information together. On the other hand, channel-wise\ncorrelation has been recently explored in both discriminative [14] and generative [24] models through\nthe covariance analysis across channels. Inspired by these works, we generalize the original non-local\noperation to model long-range dependencies between any positions of any channels.\n\n1Bold capital letters denote a matrix X, bold lower-case letters a column vector x. xi represents the ith\ncolumn of the matrix X. xij denotes the scalar in the ith row and jth column of the matrix X. All non-bold\nletters represent scalars. 1m \u2208 Rm is a vector of ones. In \u2208 Rn\u00d7n is an identity matrix. vec(X) denotes the\nvectorization of matrix X. X \u25e6 Y and X \u2297 Y are the Hadamard and Kronecker products of matrices.\n\n3\n\n\fWe \ufb01rst reshape the output of the transformations (Eq. 2) on X by merging channel into position:\n\n\u03b8(X) = vec(XW\u03b8) \u2208 RN C, \u03c6(X) = vec(XW\u03c6) \u2208 RN C, g(X) = vec(XWg) \u2208 RN C.\n\n(7)\n\nBy lifting the row space of the underlying transformations, our generalized non-local (GNL) operation\npursues the same goal of Eq. 1 that computes the response Y \u2208 RN\u00d7C as:\n\nvec(Y) = f(cid:0) vec(XW\u03b8), vec(XW\u03c6)(cid:1) vec(XWg).\n\n(8)\n\nCompared to the original non-local operation (Eq. 4), GNL utilizes a more general pairwise function\nf (\u00b7,\u00b7) : RN C \u00d7 RN C \u2192 RN C\u00d7N C that can differentiate between pairs of same location but at\ndifferent channels. This richer similarity greatly augments the non-local operation in discriminating\n\ufb01ne-grained object parts or action snippets that usually correspond to channels of the input feature.\nCompared to the bilinear pooling (Eq. 5) that can only be used after the last convolutional layer, GNL\nmaintains the input size and can thus be \ufb02exibly plugged between any network blocks. In addition,\nbilinear pooling neglects the spatial correlation which, however, is preserved in GNL.\nRecently, the idea of dividing channels into groups has been established as a very effective technique\nin increasing the capacity of ConvNets. Well-known examples include Xception [3], MobileNet [9],\nShuf\ufb02eNet [31], ResNeXt [29] and Group Normalization [28]. Given its simplicity and independence,\nwe also realize the channel grouping idea in GNL by grouping all C channels into G groups, each\nof which contains C(cid:48) = C/G channels of the input feature. We then perform GNL operation\nindependently for each group to compute Y(cid:48) and concatenate the results along the channel dimension\nto restore the full response Y.\n\n3.4 Compact Representation\n\nA straightforward implementation of GNL (Eq. 8) is prohibitive as the quadratic increase with respect\nto the channel number C in the presence of the N C \u00d7 N C pairwise matrix. Although the channel\ngrouping technique can reduce the channel number from C to C/G, the overall computational\ncomplexity is still much higher than the original non-local operation. To mitigate this problem, this\nsection proposes a compact representation that leads to an affordable approximation for GNL.\nLet us denote \u03b8 = vec(XW\u03b8), \u03c6 = vec(XW\u03c6) and g = vec(XWg), each of which is a N C-D\nvector column. Without loss of generality, we assume f is a general kernel function (e.g., RBF,\nbilinear, etc.) that computes a N C \u00d7 N C matrix composed by the elements,\n\n(cid:2)f (\u03b8, \u03c6)(cid:3)\n\nij \u2248 P(cid:88)\n\np=0\n\n\u03b12\np(\u03b8i\u03c6j)p,\n\n(9)\n\nwhich can be approximated by Taylor series up to certain order P . The coef\ufb01cient \u03b1p can be computed\nin closed form once the kernel function is known. Taking RBF kernel for example,\n\n[f (\u03b8, \u03c6)]ij = exp(\u2212\u03b3(cid:107)\u03b8i \u2212 \u03c6j(cid:107)2) \u2248 P(cid:88)\nand \u03b2 = exp(cid:0) \u2212 \u03b3((cid:107)\u03b8(cid:107)2 + (cid:107)\u03c6(cid:107)2)(cid:1) is a constant and \u03b2 = exp(\u22122\u03b3) if the input\n\n(\u03b8i\u03c6j)p,\n\n(2\u03b3)p\n\n(10)\n\np=0\n\n\u03b2\n\np!\n\np = \u03b2 (2\u03b3)p\n\nwhere \u03b12\nvectors \u03b8 and \u03c6 are (cid:96)2-normalized. By introducing two matrices,\n\np!\n\n\u0398 = [\u03b10\u03b80,\u00b7\u00b7\u00b7 , \u03b1P \u03b8P ] \u2208 RN C\u00d7(P +1), \u03a6 = [\u03b10\u03c60,\u00b7\u00b7\u00b7 , \u03b1P \u03c6P ] \u2208 RN C\u00d7(P +1)\n\n(11)\n\nour compact generalized non-local (CGNL) operation approximates Eq. 8 via a trilinear equation,\n\nvec(Y) \u2248 \u0398\u03a6(cid:62)g.\n\n(12)\n\nAt \ufb01rst glance, the above approximation still involves the computation of a large pairwise matrix\n\u0398\u03a6(cid:62) \u2208 RN C\u00d7N C. Fortunately, the order of Taylor series is usually relatively small P (cid:28) N C.\nAccording to the associative law, we could alternatively compute the vector z = \u03a6(cid:62)g \u2208 RP +1 \ufb01rst\nand then calculate \u0398z in a much smaller complexity of O(N C(P + 1)). In another view, the process\nthat this bilinear form \u03a6(cid:62)g is squeezed into scalars can be treated as a related concept of the SE\nmodule [10].\n\n4\n\n\fComplexity analysis: Table 1 compares the com-\nputational complexity of CGNL network with the\nGNL ones. We cannot afford for directly comput-\ning GNL operation because of its huge complexity of\nO(2(N C)2) in both time and space. Instead, our com-\npact method dramatically eases the heavy calculation\nto O(N C(P + 1)).\n\n3.5\n\nImplementation Details\n\nStrategy\nTime\nSpace\n\nTable 1: Complexity comparison of GNL and\nCGNL operations, where N and C indicate the\nnumber of positions and channels respectively.\n\nGeneral NL Method\n\nf(cid:0)\u0398\u03a6(cid:62)(cid:1)g\n\nO(2(N C)2)\nO(2(N C)2)\n\nCGNL Method\n\n\u0398\u03a6(cid:62)g\n\nO(N C(P + 1))\nO(N C(P + 1))\n\nFigure 2: Grouped compact generalized non-local (CGNL) module. The feature maps are shown with the\nshape of their tensors, e.g., [C, N ], where N = T HW or N = HW . The feature maps will be divided along\nchannels into multiple groups after three conv layers whose kernel size and stride both equals 1 (k = 1, s = 1).\nThe channels dimension is grouped into C(cid:48) = C/G, where G is a group number. The compact representations\nfor generalized non-local module are build within each group. P indicates the order of Taylor expansion for\nkernel functions.\n\nFig. 2 illustrates the work\ufb02ow of how CGNL module processes a feature map X of the size N \u00d7 C,\nwhere N = H \u00d7 W or N = T \u00d7 H \u00d7 W . X is \ufb01rst fed into three 1 \u00d7 1 \u00d7 1 convolutional layers\nthat are described by the weights W\u03b8, W\u03c6, Wg respectively in Eq. 7. To improve the capacity of\nneural networks, the channel grouping idea [29, 28] is then applied to divide the transformed feature\nalong the channel dimension into G groups. As shown in Fig. 2, we approximate for each group\nthe GNL operation (Eq. 8) using the Taylor series according to Eq. 12. To achieve generality and\ncompatibility with existing neural network blocks, the CGNL block is implemented by wrapping\nEq. 8 in an identity mapping of the input as in residual learning [8]:\nZ = concat(BN (Y(cid:48)Wz)) + X,\n\n(13)\nwhere Wz \u2208 RC\u00d7C denotes a 1 \u00d7 1 or 1 \u00d7 1 \u00d7 1 convolution layer followed by a Batch Normaliza-\ntion [11] in each group.\n\n4 Experiments\n\n4.1 Datasets\n\nWe evaluate the CGNL network on multiple tasks, including \ufb01ne-grained classi\ufb01cation and action\nrecognition. For \ufb01ne-grained classi\ufb01cation, we experiment on the Birds-200-2011 (CUB) dataset [25],\nwhich contains 11788 images of 200 bird categories. For action recognition, we experiment on\ntwo challenging datasets, Mini-Kinetics [30] and UCF101 [22]. The Mini-Kinetics dataset contains\n200 action categories. Due to some video links are unavaliable to download, we use 78265 videos\nfor training and 4986 videos for validation. The UCF101 dataset contains 101 actions, which are\nseparated into 25 groups with 4-7 videos of each action in a group.\n\n4.2 Baselines\n\nGiven the steady performance and ef\ufb01ciency, the ResNet [8] series (ResNet-50 and ResNet-101) are\nadopted as our baselines. For video tasks, we keep the same architecture con\ufb01guration with [27],\nwhere the temporal dimension is trivially addressed by max pooling. Following [27] the convolutional\nlayers in the baselines are implemented as 1 \u00d7 k \u00d7 k kernels, and we insert our CGNL blocks into\n\n5\n\n\u04e8**[ NC\u2032, P+1 ][ P+1, NC\u2032 ][ NC\u2032, 1 ]=*[ NC\u2032, P+1 ]=[ P+1, 1 ]YXInput[ N, C ]conv_\u03b8 conv_\u00d8 conv_g k=1, s=1groups groups groups +Identity Mapping[ NC\u2032, 1 ]concatenate groupsconv_z+ BN ZOutput[ N, C ]\u03a6 \u3112\u03a6\u3112k=1, s=1\u04e8\fTable 2: Ablations. Top1 and top5 accuracy (%) on various datasets.\n\n(a) Results of adding 1 CGNL\nblock on CUB. The kernel of dot\nproduction achieves the best result.\nThe accuracies of others are at the\nedge of baselines.\n\nmodel\n\nR-50.\n\nDot Production\nGaussian RBF\nEmbedded Gaussian\n\ntop1\n\n84.05\n\n85.14\n84.10\n84.01\n\ntop5\n\n96.00\n\n96.88\n95.78\n96.08\n\n(b) Results of comparison on UCF-\n101. Note that CGNL network is not\ngrouped in channel.\n\nmodel\n\nR-50.\n\n+ 1 NL block\n+ 1 CGNL block\n\ntop1\n\n81.62\n\n82.88\n83.38\n\ntop5\n\n94.62\n\n95.74\n95.42\n\n(c) Results of channel grouped CGNL networks on CUB. A few groups\ncan boost the performance. But more groups tend to prevent the CGNL\nblock from capturing the correlations between positions across channels.\n\nmodel\n\ngroups\n\nR-101\n\n+ 1 CGNL\nblock\n\n-\n\n1\n4\n8\n16\n32\n\ntop1\n\n85.05\n\n86.17\n86.24\n86.35\n86.13\n86.04\n\ntop5\n\n96.70\n\n97.82\n97.05\n97.86\n96.75\n96.69\n\nmodel\n\ngroups\n\nR-101\n\n+ 5 CGNL\nblock\n\n-\n\n1\n4\n8\n16\n32\n\ntop1\n\n85.05\n\n86.01\n86.19\n86.24\n86.43\n86.10\n\ntop5\n\n96.70\n\n95.97\n96.07\n97.23\n98.89\n97.13\n\n(d) Results of grouped CGNL networks on Mini-Kinetics. More groups\nhelp the CGNL networks improve top1 accuracy obveriously.\n\nmodel\n\nR-50\n\n+ 1 CGNL\nblock\n\ngorups\n\n-\n\n1\n4\n8\n\ntop1\n\n75.54\n\n77.16\n77.56\n77.76\n\ntop5\n\n92.16\n\n93.56\n93.00\n93.18\n\nmodel\n\nR-101\n\n+ 1 CGNL\nblock\n\ngorups\n\n-\n\n1\n4\n8\n\ntop1\n\n77.44\n\n78.79\n79.06\n79.54\n\ntop5\n\n93.18\n\n93.64\n93.54\n93.84\n\nthe network to turn them into compact generalized non-local (CGNL) networks. We investigate the\ncon\ufb01gurations of adding 1 and 5 blocks. [27] suggests that adding 1 block on the res4 is slightly\nbetter than the others. So our experiments of adding 1 block all target the res4 of ResNet. The\nexperiments of adding 5 blocks, on the other hand, are con\ufb01gured by inserting 2 blocks on the res3,\nand 3 blocks on the res4, to every other residual block in ResNet-50 and ResNet-101.\nTraining: We use the models pretrained on ImageNet [20] to initialize the weights. The frames of a\nvideo are extracted in a dense manner. Following [27], we generate 32-frames input clips for models,\n\ufb01rst randomly crop out 64 consecutive frames from the full-length video and then drop every other\nframe. The way to choose these 32-frames input clips can be viewed as a temporal augmentation.\nThe crop size for each clip is distributed evenly between 0.08 and 1.25 of the original image and its\naspect ratio is chosen randomly between 3/4 and 4/3. Finally we resize it to 224. We use a weight\ndecay of 0.0001 and momentum of 0.9 in default. The strategy of gradual warmup is used in the\n\ufb01rst ten epochs. The dropout [23] with ratio 0.5 is inserted between average pooling layer and last\nfully-connected layer. To keep same with [27], we use zero to initialize the weight and bias of the\nBatchNorm (BN) layer in both CGNL and NL blocks [6]. To train the networks on CUB dataset, we\nfollow the same training strategy above but the \ufb01nal crop size of 448.\nInference: The models are tested immediately after training is \ufb01nished. In [27], spatially fully-\nconvolutional inference 2 is used for NL networks. For these video clips, the shorter side is resized to\n256 pixels and use 3 crops to cover the entire spatial size along the longer side. The \ufb01nal prediction\nis the averaged softmax scores of all clips. For \ufb01ne-grined classi\ufb01cation, we do 1 center-crop testing\nin size of 448.\n\n4.3 Ablation Experiments\n\nKernel Functions: We use three popular kernel functions, namely dot production, embedded\nGaussian and Gaussian RBF, in our ablation studies. For dot production, Eq. 12 will be held for\ndirect computation. For embedded Gaussian, the \u03b12\np! in Eq. 9. And for Gaussian RBF,\nthe corresponding formula is de\ufb01ned as Eq. 10. We expend the Taylor series with third order and\nthe hyperparameter \u03b3 for RBF is set by 1e-4 [4]. Table 2a suggests that dot production is the best\nkernel functions for CGNL networks. Such experimental observations are consistent with [27]. The\nother kernel functions we used, Embedded Gaussion and Gaussian RBF, has a little improvements\nfor performance. Therefore, we choose the dot production as our main experimental con\ufb01guration for\nother tasks.\n\np will be 1\n\n2https://github.com/facebookresearch/video-nonlocal-net\n\n6\n\n\fGrouping: The grouping strategy is another important technique. On Mini-Kinetics, Table 2d shows\nthat grouping can bring higher accuracy. The improvements brought in by adding groups are larger\nthan those by reducing the channel reduction ratio. The best top1 accuracy is achieved by splitting\ninto 8 groups for CGNL networks. On the other hand, however, it is worthwhile to see if more groups\ncan always improve the results, and Table 2c gives the answer that more groups will hamper the\nperformance improvements. This is actually expected, as the af\ufb01nity in CGNL block considers the\npoints across channels. When we split the channels into a few groups, it can facilitate the restricted\noptimization and ease the training. However, if too many groups are adopted, it hinder the af\ufb01nity to\ncapture the rich correlations between elements across the channels.\n\nTable 3: Results comparison of\nthe CGNL block to the simple\nresidual block on CUB dataset.\n\nFigure 3: The work\ufb02ow of our\nCGNL block. The corresponding\nformula is shown below in a blue\ntinted box.\n\nFigure 4: The work\ufb02ow of the\nsimple residual block for compar-\nison. The corresponding formula is\nshown below in a blue tinted box.\n\nmodel\n\nR-50\n\n+ 1 Residual Block\n+ 1 CGNL block\n\ntop1\n\n84.05\n\n84.11\n85.14\n\ntop5\n\n96.00\n\n96.23\n96.88\n\n\u221a\n\n\u2217 \u03b3 + \u03b2 = \u0398\u2212E(\u0398)\n\n\u221a\n\nV ar(\u0398)\n\n\u221a\n\nV ar(s\u0398)\n\n\u2217 \u03b3 + \u03b2 = s\u0398\u2212sE(\u0398)\n\nComparison of CGNL Block to Simple Residual Block: There is a confusion about the ef\ufb01ciency\ncaused by the possibility that the scalars from \u03a6(cid:62)g in Eq. 12 could be wiped out by the BN layer.\nBecause according to Algorithm 1 in [11], the output of input \u0398 weighted by the scalars s = \u03a6(cid:62)g\ncan be approximated to O = s\u0398\u2212E(s\u0398)\n\u2217 \u03b3 + \u03b2. At\n\ufb01rst glance, the scalars s is totally erased by BN in this mathmatical process. However, the de facto\noperation of a convolutional module has a process order to aggregate the features. Before passing into\nthe BN layer, the scalars s has already saturated in the input features \u0398 and then been transformed\ninto a different feature space by a learnable parameter Wz. In other words, it is Wz that \"protects\" s\nfrom being erased by BN via the convolutional operation. To eliminate this confusion, we further\ncompare adding 1 CGNL block (with the kernel of dot production) in Fig 3 and adding 1 simple\nresidual block in Fig 4 on CUB dataset in Table 3. The top1 accuracy 84.11% of adding a simple\nresidual block is slightly better than 84.05% of the baseline, but still worse than 85.14% of adding a\nlinear kerenlized CGNL module. We think that the marginal improvement (84.06% \u2192 84.11%) is\ndue to the more parameters from the added simple residual block.\n\ns2V ar(\u0398)\n\nFigure 5: Result analysis of the NL block and our CGNL block on CUB. Column 1: the input images with\na small reference patch (green rectangle), which is used to \ufb01nd the highly related patches (white rectangle).\nColumn 2: the highly related clues for prediction in each feature map found by the NL network. The dimension\nof self-attention space in NL block is N \u00d7 N, where N = HW . So its visualization only has one column.\nColumns 3 to 7: the most related patches computed by our compact generalized non-local module. We \ufb01rst\npick a reference position in the space of g, then we use the corresponding vectors in \u0398 and \u03a6 to compute the\nattention maps with a threshold (here we use 0.7). Last column: the ground truth of body parts. The highly\nrelated areas of CGNL network can easily cover all of the standard parts that provide the prediction clues.\n\n7\n\n\fFigure 6: Visualization with feature heatmaps. We select a reference patch (green rectangle) in one frame, then\nvisualize the high related ares by heatmaps. The CGNL network enjoys capturing dense relationships in feature\nspace than NL networks.\n\nTable 4: Main results. Top1 and top5 accuracy (%) on various datasets.\n\n(a) Main validation results on\nMini-Kinetics. The CGNL net-\nworks is build within 8 groups.\n\nmodel\n\nR-50\n\n+ 1 NL block\n+ 1 CGNL block\n\n+ 5 NL block\n+ 5 CGNL block\n\nR-101\n\n+ 1 NL block\n+ 1 CGNL block\n\n+ 5 NL block\n+ 5 CGNL block\n\ntop1\n\n75.54\n\n76.53\n77.76\n\n77.53\n78.79\n\n77.44\n\n78.02\n79.54\n\n79.21\n79.88\n\ntop5\n\n92.16\n\n92.90\n93.18\n\n94.00\n94.37\n\n93.18\n\n93.86\n93.84\n\n93.21\n93.37\n\n(b) Results on CUB. The CGNL networks are set by 8 channel groups.\n\nmodel\n\nR-50\n\n+ 1 NL block\n+ 1 CGNL block\n\n+ 5 NL block\n+ 5 CGNL block\n\ntop1\n\n84.05\n\n84.79\n85.14\n\n85.10\n85.68\n\ntop5\n\n96.00\n\n96.76\n96.88\n\n96.18\n96.69\n\nmodel\n\nR-101\n\n+ 1 NL block\n+ 1 CGNL block\n\n+ 5 NL block\n+ 5 CGNL block\n\ntop1\n\n85.05\n\n85.49\n86.35\n\n86.10\n86.24\n\ntop5\n\n96.70\n\n97.04\n97.86\n\n96.35\n97.23\n\n(c) Results on COCO. 1 NL or 1 CGNL block is added in Mask R-CNN.\n\nmodel\n\nBaseline\n\n+ 1 NL block\n+ 1 CGNL block\n\nAPbox\n34.47\n\n35.02\n35.70\n\nAPbox\n50\n54.87\n\n55.79\n56.07\n\nAPbox\n75\n36.58\n\n37.54\n38.69\n\nAPmask\n30.44\n\n30.23\n31.22\n\nAPmask\n50\n51.55\n\n52.40\n52.44\n\nAPmask\n75\n31.95\n\n32.77\n32.67\n\n4.4 Main Results\n\nTable 4a shows that although adding 5 NL and CGNL blocks in the baseline networks can both\nimprove the accuracy, the improvement of CGNL network is larger. The same applies to Table 2b\nand Table 4b. In experiments on UCF101 and CUB dataset, the similar results are also observed that\nadding 5 CGNL blocks provides the optimal results both for R-50 and R-101.\nTable 4a shows the main results on Mini-Kinetics dataset. Compared to the baseline R-50 whose top1\nis 75.54%, adding 1 NL block brings improvement by about 1.0%. Similar results can be found in\nthe experiments based on R-101, where adding 1 CGNL provides about more than 2% improvement,\nwhich is larger than that of adding 1NL block. Table 2b shows the main results on the UCF101\ndataset, where adding 1CGNL block achieves higher accuracy than adding 1NL block. And Table 4b\nshows the main results on the CUB dataset. To understand the effects brought by CGNL network, we\nshow the visualization analysis as shown in Fig 5 and Fig 6. Additionly, to investigate the capacity\nand the generalization ability of our CGNL network. We test them on the task of object detection and\n\n8\n\n\finstance segmentation. We add 1 NL and 1 CGNL block in the R-50 backbone for Mask-RCNN [7].\nTable 4c shows the main results on COCO2017 dataset [13] by adopting our 1 CGNL block in the\nbackbone of Mask-RCNN [7]. It shows that the performance of adding 1 CGNL block is still better\nthan that of adding 1 NL block.\nWe observe that adding CGNL block can always obtain better results than adding the NL block with\nthe same blocks number. These experiments suggest that considering the correlations between any\ntwo positions across the channels can signi\ufb01cantly improve the performance than that of original\nnon-local methods.\n\n5 Conclusion\n\nWe have introduced a simple approximated formulation of compact generalized non-local operation,\nand have validated it on the task of \ufb01ne-grained classi\ufb01cation and action recognition from RGB images.\nOur formulation allows for explicit modeling of rich interdependencies between any positions across\nchannels in the feature space. To ease the heavy computation of generalized non-local operation, we\npropose a compact representation with the simple matrix production by using Taylor expansion for\nmultiple kernel functions. It is easy to implement and requires little additional parameters, making it\nan attractive alternative to the original non-local block, which only considers the correlations between\ntwo positions along the speci\ufb01c channel. Our model produces competitive or state-of-the-art results\non various benchmarked datasets.\n\nAppendix: Experiments on ImageNet\n\nAs a general method, the CGNL block is compatible with complementary techniques developed for\nthe image task of \ufb01ne-grained classi\ufb01cation, temporal feature needed task of action recognition and\nthe basic task of object detection.\n\nTable 5: Results on ImageNet. Best top1 and top5 accuracy (%).\n\nmodel\n\nR-50\n\n+ 1 CGNL block\n+ 1 CGNLx block\n\nR-152\n\n+ 1 CGNL block\n+ 1 CGNLx block\n\ntop1\n\n76.15\n\n77.69\n77.32\n\n78.31\n\n79.53\n79.37\n\ntop5\n\n92.87\n\n93.64\n93.46\n\n94.06\n\n94.59\n94.47\n\nIn this appendix, we further report the results of our spatial CGNL network on the large-scale\nImageNet [20] dataset, which has 1.2 million training images and 50000 images for validation in\n1000 object categories. The training strategy and con\ufb01gurations of our CGNL networks is kept same\nas those in Sec 4, only except the crop size here used for input is 224. For a better demonstration\nof the generality of our CGNL network, we investigate both adding 1 dot production CGNL block\nand 1 Gaussian RBF CGNL block (identi\ufb01ed by CGNLx) in Table 5. We compare these models with\ntwo strong baselines, R-50 and R-152. In Table 5, all the best top1 and top5 accuracies are reported\nunder the single center crop testing. The CGNL networks beat the basemodels by larger than 1 point\nno matter whichever the dot production or Gaussian RBF plays as the kernel function in the CGNL\nmodule.\n\n9\n\n\fReferences\n[1] N. P. J. U. L. J. A. N. G. L. K. Ashish Vaswani, Noam Shazeer and I. Polosukhin. Attention is all you need.\n\nIn NIPS, 2017.\n\n[2] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling.\n\nIn European Conference on Computer Vision, pages 430\u2013443. Springer, 2012.\n\n[3] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, 2016.\n[4] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie. Kernel pooling for convolutional neural networks.\n\nIn Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[5] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 317\u2013326, 2016.\n\n[6] P. Goyal, P. Doll\u00e1r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He.\n\nAccurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[7] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE\n\nInternational Conference on, pages 2980\u20132988. IEEE, 2017.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.\nMobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications. arXiv preprint\narXiv:1704.04861, 2017.\n\n[10] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.\n[11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\ncoco: Common objects in context. In European conference on computer vision, pages 740\u2013755. Springer,\n2014.\n\n[14] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnn models for \ufb01ne-grained visual recognition. In\n\nProceedings of the IEEE International Conference on Computer Vision, pages 1449\u20131457, 2015.\n\n[15] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer\n\nvision, 60(2):91\u2013110, 2004.\n\n[16] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective receptive \ufb01eld in deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems, pages 4898\u20134906, 2016.\n\n[17] F. Perronnin, J. S\u00e1nchez, and T. Mensink. Improving the \ufb01sher kernel for large-scale image classi\ufb01cation.\n\nIn European conference on computer vision, pages 143\u2013156. Springer, 2010.\n\n[18] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of\nthe 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 239\u2013247.\nACM, 2013.\n\n[19] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9):1481\u2013\n\n1497, 1990.\n\n[20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[21] B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization,\n\nand beyond. MIT press, 2001.\n\n[22] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the\n\nwild. arXiv preprint arXiv:1212.0402, 2012.\n\n[23] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\nprevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15(1):1929\u20131958,\n2014.\n\n[24] I. Ustyuzhaninov, W. Brendel, L. A. Gatys, and M. Bethge. What does it take to generate natural textures?\n\n[25] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD birds-200-2011 dataset.\n\nTechnical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\n[26] H. Wang, A. Kl\u00e4ser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Computer\n\nVision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3169\u20133176. IEEE, 2011.\n\n[27] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. arXiv preprint arXiv:1711.07971,\n\nIn ICLR, 2017.\n\n2017.\n\n[28] Y. Wu and K. He. Group normalization. arXiv preprint arXiv:1803.08494, 2018.\n[29] S. Xie, R. Girshick, P. Doll\u00e1r, Z. Tu, and K. He. Aggregated residual transformations for deep neural\nIn Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages\n\nnetworks.\n5987\u20135995. IEEE, 2017.\n\n[30] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning for video\n\nunderstanding. arXiv preprint arXiv:1712.04851, 2017.\n\n[31] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolutional neural network\n\nfor mobile devices. arXiv preprint arXiv:1707.01083, 2017.\n\n10\n\n\f", "award": [], "sourceid": 3226, "authors": [{"given_name": "Kaiyu", "family_name": "Yue", "institution": "Baidu Inc."}, {"given_name": "Ming", "family_name": "Sun", "institution": "baidu"}, {"given_name": "Yuchen", "family_name": "Yuan", "institution": "Baidu Inc."}, {"given_name": "Feng", "family_name": "Zhou", "institution": "Carnegie Mellon University"}, {"given_name": "Errui", "family_name": "Ding", "institution": "Baidu Inc."}, {"given_name": "Fuxin", "family_name": "Xu", "institution": "Central South University"}]}