{"title": "A^2-Nets: Double Attention Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 352, "page_last": 361, "abstract": "Learning to capture long-range relations is fundamental to image/video recognition. Existing CNN models generally rely on increasing depth to model such relations which is highly inefficient. In this work, we propose the \u201cdouble attention block\u201d, a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access features from the entire space efficiently. The component is designed with a double attention mechanism in two steps, where the first step gathers features from the entire space into a compact set through second-order attention pooling and the second step adaptively selects and distributes features to each location via another attention. The proposed double attention block is easy to adopt and can be plugged into existing deep neural networks conveniently. We conduct extensive ablation studies and experiments on both image and video recognition tasks for evaluating its performance. On the image recognition task, a ResNet-50 equipped with our double attention blocks outperforms a much larger ResNet-152 architecture on ImageNet-1k dataset with over 40% less the number of parameters and less FLOPs. On the action recognition task, our proposed model achieves the state-of-the-art results on the Kinetics and UCF-101 datasets with significantly higher efficiency than recent works.", "full_text": "A2-Nets: Double Attention Networks\n\nYunpeng Chen\u2217\n\nNational University of Singapore\n\nchenyunpeng@u.nus.edu\n\nYannis Kalantidis\nFacebook Research\nyannisk@fb.com\n\nJianshu Li\n\nNational University of Singapore\n\njianshu@u.nus.edu\n\nShuicheng Yan\n\nQihoo 360 AI Institute\n\nNational University of Singapore\n\neleyans@nus.edu.sg\n\nJiashi Feng\n\nNational University of Singapore\n\nelefjia@nus.edu.sg\n\nAbstract\n\nLearning to capture long-range relations is fundamental to image/video recognition.\nExisting CNN models generally rely on increasing depth to model such relations\nwhich is highly inef\ufb01cient. In this work, we propose the \u201cdouble attention block\u201d, a\nnovel component that aggregates and propagates informative global features from\nthe entire spatio-temporal space of input images/videos, enabling subsequent con-\nvolution layers to access features from the entire space ef\ufb01ciently. The component\nis designed with a double attention mechanism in two steps, where the \ufb01rst step\ngathers features from the entire space into a compact set through second-order\nattention pooling and the second step adaptively selects and distributes features\nto each location via another attention. The proposed double attention block is\neasy to adopt and can be plugged into existing deep neural networks conveniently.\nWe conduct extensive ablation studies and experiments on both image and video\nrecognition tasks for evaluating its performance. On the image recognition task, a\nResNet-50 equipped with our double attention blocks outperforms a much larger\nResNet-152 architecture on ImageNet-1k dataset with over 40% less the number\nof parameters and less FLOPs. On the action recognition task, our proposed model\nachieves the state-of-the-art results on the Kinetics and UCF-101 datasets with\nsigni\ufb01cantly higher ef\ufb01ciency than recent works.\n\n1\n\nIntroduction\n\nDeep Convolutional Neural Networks (CNNs) have been successfully applied in image and video\nunderstanding during the past few years. Many new network topologies have been developed to\nalleviate optimization dif\ufb01culties [9, 10] and increase the learning capacities [26, 5], which bene\ufb01t\nrecognition performance for both images [8, 2] and videos [23] signi\ufb01cantly.\nHowever, CNNs are inherently limited by their convolution operators which are dedicated to capturing\nlocal features and relations, e.g. from a 7 \u00d7 7 region, and are inef\ufb01cient in modeling long-range\ninterdependencies. Though stacking multiple convolution operators can enlarge the receptive \ufb01eld, it\nalso comes with a number of unfavorable issues in practice. First, stacking multiple operators makes\nthe model unnecessarily deep and large, resulting in higher computation and memory cost as well as\nincreased over-\ufb01tting risks. Second, features far away from a speci\ufb01c location have to pass through a\nstack of layers before affecting the location for both forward propagation and backward propagation,\nincreasing the optimization dif\ufb01culties during the training. Third, the features visible to a distant\nlocation are actually \u201cdelayed\u201d ones from several layers behind, causing inef\ufb01cient reasoning. Though\n\n\u2217Part of the work is done during internship at Facebook Research.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsome recent works [11, 25] can partially alleviate the above issues, they are either non-\ufb02exible [11]\nor computationally expensive [25].\nIn this work, we aim to overcome these limitations by introducing a new network component\nthat enables a convolution layer to sense the entire spatio-temporal space2 from its adjacent layer\nimmediately. The core idea is to \ufb01rst gather key features from the entire space into a compact set\nand then distribute them to each location adaptively, so that the subsequent convolution layers can\nsense features from the entire space even without a large receptive \ufb01led. We develop a generic\nfunction for such purpose and implement it with an ef\ufb01cient double attention mechanism. The \ufb01rst\nsecond-order attention pooling operation selectively gathers key features from the entire space, while\nthe second adopts another attention mechanism to adaptively distribute a subset of key features that\nare helpful to complement each spatio-temporal location for high-level tasks. We denote our proposed\ndouble-attention block as A2-block and its resultant network as A2-Net.\nThe double-attention block is related to a number of recent works, including the Squeeze-and-\nExcitation Networks [11], covariance pooling [14], the Non-local Neural Networks [25] and the\nTransformer architecture of [24]. However, compared with these existing works, it enjoys several\nunique advantages: Its \ufb01rst attention operation implicitly computes second-order statistics of pooled\nfeatures and can capture complex appearance and motion correlations that cannot be captured by\nthe global average pooling used in SENet [11]. Its second attention operation adaptively allocates\nfeatures from a compact bag, which is more ef\ufb01cient than exhaustively correlating the features from\nall the locations with every speci\ufb01c location as in [25, 24]. Extensive experiments on image and\nvideo recognition tasks clearly validate the above advantages of our proposed method.\nWe summarize our contributions as follows:\n\n\u2022 We propose a generic formulation for capturing long-range feature interdependencies via\n\nuniversal gathering and distribution functions.\n\n\u2022 We propose the double attention block for gathering and distributing long-range features, an\nef\ufb01cient architecture that captures second-order feature statistics and makes adaptive feature\nassignment. The block can model long-range interdependencies with a low computational\nand memory footprint and at the same time boost image/video recognition performance\nsigni\ufb01cantly.\n\n\u2022 We investigate the effect of our proposed A2-Net with extensive ablation studies and prove\nits superior performance through comparison with the state-of-the-arts on a number of\npublic benchmarks for both image recognition and video action recognition tasks, including\nImageNet-1k, Kinetics and UCF-101.\n\nThe rest of the paper is organized as follows. We \ufb01rst motivate and present our approach in Section 2,\nwhere we also discuss the relation of our approach to recent works. We then evaluate and report\nresults in Section 3 and conclude the paper with Section 4.\n\n2 Method\n\nConvolutional operators are designed to focus on local neighborhoods and therefore fail to \u201csense\u201d\nthe entire spatial and/or temporal space, e.g. the entire input frame or one location across multiple\nframes. A CNN model thus usually employs multiple convolution layers (or recurrent units [6, 17]) in\norder to capture global aspects of the input. Meanwhile, self-attentive and correlation operators like\nsecond-order pooling have been recently shown to work well in a wide range of tasks [24, 14, 15]. In\nthis section we present a component capable of gathering and distributing global features to each\nspatial-temporal location of the input, helping subsequent convolution layers sense the entire space\nimmediately and capture complex relations. We \ufb01rst formally describe this desired component by\nproviding a generic formulation and then introduce our double attention block, a highly ef\ufb01cient\ninstantiation of such a component. We \ufb01nally discuss the relation of our approach to other recent\nrelated approaches.\n\n2Here by \u201cspace\u201d we mean the entire feature maps of an input frame and the complete spatio-temporal\n\nfeatures from a video sequence.\n\n2\n\n\fFigure 1: Illustration of the double-attention mechanism. (a) An example on a single frame input\nfor explaining the idea of our double attention method, where the set of global featues is computed\nonly once and then shared by all locations. Meanwhile, each location i will generate its own attention\nvector based on the need of its local feature vi to select a desired subset of global features that is\nhelpful to complement current location and form the feature zi. (b) The double attention operation on\na three dimensional input array A. The \ufb01rst attention step is shown on the top and produces a set of\nglobal features. At location i, the second attention step generates the new local feature zi, as shown\nat the bottom.\n\nLet X \u2208 Rc\u00d7d\u00d7h\u00d7w denote the input tensor for a spatio-temporal (3D) convolutional layer, where\nc denotes the number of channels, d denotes the temporal dimension3 and h, w are the spatial\ndimensions of the input frames. For every spatio-temporal input location i = 1, . . . , dhw with local\nfeature vi, let us de\ufb01ne\n\nzi = Fdistr (Ggather(X), vi) ,\n\n(1)\nto be the output of an operator that \ufb01rst gathers features in the entire space and then distributes them\nback to each input location i, taking into account the local feature vi of that location. Speci\ufb01cally,\nGgather adaptively aggregates features from the entire input space, and Fdistr distributes the gathered\ninformation to each location i, conditioned on the local feature vector vi.\nThe idea of gathering and distributing information is motivated by the squeeze-and-excitation network\n(SENet) [11]. Eqn. (1), however, presents it in a more general form that leads to some interesting\ninsights and optimizations. In [11], global average pooling is used in the gathering process, while the\nresulted single global feature is distributed to all locations, ignoring different needs across locations.\nSeeing these shortcomings, we introduce this genetic formulation and propose the Double Attention\nblock, where global information is \ufb01rst gathered by second-order attention pooling (instead of \ufb01rst-\norder average pooling), and the gathered global features are adaptively distributed conditioned on the\nneed of current local feature vi, by a second attention mechanism. In this way, more complex global\nrelations can be captured by a compact set of features and each location can receive its customized\nglobal information that is complementary to the exiting local features, facilitating learning more\ncomplex relations. The proposed component is illustrated in Figure 1 (a). At below, we \ufb01rst describe\nits architecture in details and then discuss some instantiations and its connections to other recent\nrelated approaches.\n\n2.1 The First Attention Step: Feature Gathering\n\nA recent work [15] used bilinear pooling to capture second-order statistics of features and generate\nglobal representations. Compared with the conventional average and max pooling which only\ncompute \ufb01rst-order statistics, bilinear pooling can capture and preserve complex relations better.\nConcretely, bilinear pooling gives a sum pooling of second-order features from the outer product of\nall the feature vector pairs (ai, bi) within two input feature maps A and B:\n\nGbilinear(A, B) = AB(cid:62) =\n\naib(cid:62)\ni ,\n\n(2)\n\n(cid:88)\n\n\u2200i\n\n3For a spatial (2D) convolution, i.e. when the input is an image, d = 1.\n\n3\n\nwhm\u00d7\u2026AAttention Mapswh=m1stn-th\u2026Global Descriptorsm\u2026\u00d7\u2026n=Global DescriptorsAttention VectorswhmZ(b) Visualization of the Double Attention Operation(a) Double Attention Methodb11stn-th1sthw-thbn1stn-thInputAttention Vector(conditioned on local descriptors vi)Attention MapsGlobal DescriptorsEncode backto input\fwhere A = [a1,\u00b7\u00b7\u00b7 , adhw] \u2208 Rm\u00d7dhw and B = [b1,\u00b7\u00b7\u00b7 , bdhw] \u2208 Rn\u00d7dhw. In CNNs, A and B can\nbe the feature maps from the same layer, i.e. A = B, or from two different layers, i.e. A = \u03c6(X; W\u03c6)\nand B = \u03b8(X; W\u03b8), with parameters W\u03c6 and W\u03b8.\nBy introducing the output variable G = [g1,\u00b7\u00b7\u00b7 , gn] \u2208 Rm\u00d7n of the bilinear pooling and rewriting\nthe second feature B as B = [\u00afb1;\u00b7\u00b7\u00b7 ; \u00afbn] where each \u00afbi is a dhw-dimensional row vector, we can\nreformulate Eqn. (2) as\n\ngi = A\u00afb(cid:62)\n\ni =\n\n\u00afbijaj.\n\n(3)\n\n(cid:88)\n\n\u2200j\n\nEqn. (3) gives a new perspective on the bilinear pooling result: instead of just computing second-\norder statistics, the output of bilinear pooling G is actually a bag of visual primitives, where each\nprimitive gi is calculated by gathering local features weighted by \u00afbi. This inspires us to develop\na new attention-based feature gathering operation. We further apply a softmax onto B to ensure\n\u00afbij = 1, i.e. a valid attention weighting vector, which gives following second-order attention\n\n(cid:80)\n\nj\n\npooling process:\n\ngi = A softmax(\u00afbi)(cid:62).\n\n(4)\nThe \ufb01rst row in Figure 1 (b) shows the second-order attention pooling that corresponds to Eqn. (4),\nwhere both A and B are outputs of two different convolution layers transforming the input X. In\nimplementation, we let A = \u03c6(X; W\u03c6) and B = softmax (\u03b8(X; W\u03b8)). The second-order attention\npooling offers an effective way to gather key features: it captures the global features, e.g. texture\nand lighting, when \u00afbi is densely attended on all locations; and it captures the existence of speci\ufb01c\nsemantic, e.g. an object and parts, when \u00afbi is sparsely attended on a speci\ufb01c region. We note that\nsimilar understandings were presented in [7], in which they proposed a rank-1 approximation of a\nbilinear pooling operation associated with a fully connected classi\ufb01er. However, in our work, we\npropose to apply attention pooling to gather visual primitives at different locations into a bag of\nglobal descriptors using softmax attention map and do not apply any low-rank constraint.\n\n2.2 The Second Attention Step: Feature Distribution\n\nThe next step after gathering features from the entire space is to distribute them to each location of\nthe input, such that the subsequent convolution layer can sense the global information even with a\nsmall convolutional kernel.\nInstead of distributing the same summarized global features to all locations like SENet [11], we\npropose to get more \ufb02exibility by distributing an adaptive bag of visual primitives based on the need\nof feature vi at each location. In this way, each location can select features that are complementary to\nthe current feature which can make the training easier and help capture more complex relations. This\nis achieved by selecting a subset of feature vectors from Ggather(X) with soft attention:\n\nzi =\n\nvijgj = Ggather(X)vi, where\n\nvij = 1.\n\n(5)\n\n(cid:88)\n\n\u2200j\n\n(cid:88)\n\n\u2200j\n\nEqn. (5) formulates the proposed soft attention for feature selection. In our implementation, we\napply the softmax function to normalize vi into the one with unit sum, which is found to give better\nconvergence. The second row in Figure 1 (b) shows the above feature selection step. Similar to\nthe way we generate the attention map, the set of attention weight vectors is also generated by a\nconvolution layer follow by a softmax normalizer, i.e. V = softmax (\u03c1(X; W\u03c1)) where W\u03c1 contains\nparameters for this layer.\n\n2.3 The Double Attention Block\n\nWe combine the above two attention steps to form our proposed double-attention block, with its\ncomputation graph in deep neural networks is given in Figure 2. To formulate the double attention\noperation, we substitute Eqn. (4) and Eqn. (5) into Eqn. (1) and obtain\n\nZ = Fdistr (Ggather(X), V )\n\n= Ggather(X)softmax (\u03c1(X; W\u03c1))\n\n(cid:62)(cid:105)\n\n(cid:104)\n\n=\n\n\u03c6(X; W\u03c6)softmax (\u03b8(X; W\u03b8))\n\n4\n\nsoftmax (\u03c1(X; W\u03c1)) .\n\n(6)\n\n\fFigure 2: The computational graph of the proposed double attention block. All convolution kernel\nsize is 1 \u00d7 1 \u00d7 1. We insert this double attention block to existing convolutional neural network, e.g.\nresidual networks [9], to form the A2-Net.\n\nFigure 1 (b) shows the combined double attention operation and Figure 2 shows the corresponding\ncomputational graph, where the feature arrays A, B and V are generated by three different convolution\nlayers operating on the input feature array X followed by softmax normalization if necessary. The\noutput result Z is given by conducting two matrix multiplications with necessary reshape and\ntranspose operations. Here, an additional convolution layer is added at the end to expand the number\nof channels for the output Z, such that it can be encoded back to the input X via element-wise\naddition. During the training process, gradient of the loss function can be easily computed using\nauto-gradient [3, 18] with the chain rule.\nThere are two different ways to implement the computational graph of Eqn. (6). One is to use the left\nassociation as given in Eqn. (6) with computation graph is shown in Figure 2. The other is to conduct\nthe right association, as formulate below:\n\nZ = \u03c6(X; W\u03c6)\n\nsoftmax (\u03b8(X; W\u03b8))\n\n(cid:62)\n\nsoftmax (\u03c1(X; W\u03c1))\n\n(cid:104)\n\n(cid:105)\n\n.\n\n(7)\n\nWe note these two different associations are mathematically equivalent and thus will produce the\nsame output. However, they have different computational cost and memory consumption. The\ncomputational complexity of the second matrix multiplication in \u201cleft association\u201d in Eqn. (6) is\nO(mndhw), while \u201cright association\u201d in Eqn. (7) has complexity of O(m(dhw)2). As for the\nmemory cost4, storing the output of the results of the \ufb01rst matrix multiplication costs mn/218MB\nand (dhw)2/218MB for the left and right associations respectively. In practice, an input data array X\nwith 32 28 \u00d7 28 frames and 512 channel size can easily cost more than 2GB memory when adopting\nthe right association, much more expensive than 1MB cost of the left association. In this case, left\nassociation is also more computationally ef\ufb01cient than the right one. Therefore, for common cases\nwhere (dhw)2 > nm, we suggest implementation in Eqn. (6) with left association.\n\n2.4 Discussion\n\nIt is interesting to observe that the implementation in Eqn. (7) with right association can be further\nexplained by the recent NL-Net [25], where the \ufb01rst multiplication captures pair-wise relations\nbetween local features and gives an output relation matrix in Rdhw\u00d7dhw. The resulted relation\nmatrix is then applied to linearly combine the transformed features \u03c6(X) into the output feature\nZ. The difference is apparent in the design of the pair-wise relation function, where we propose\na new relation function, i.e. softmax (\u03b8(X))\nsoftmax (\u03c1(X)) rather than using the Embedded\nGaussian formulation [24] to capture the pair-wise relations. Meanwhile, as discussed above, any\nsuch a method practically suffers from high computational and memory costs, and relies on the some\nsubsampling tricks to reduce the cost which may potentially hurts the accuracy. Since NL-Net is the\ncurrent state-of-the-art for video recognition tasks and also closely related, we directly compare and\nextensively discuss performance between the two in the Experiments section. The results clearly\nshow that our proposed method not only outperforms NL-Net, but does so with higher ef\ufb01ciency\nand accuracy. As the Embedded Gaussian NL-Net formulation that we compare in the experiments\nis mathematically equivalent to the self-attention formulation of [24], conclusions/comparisons to\nNL-Net extend to the transformer networks as well.\n\n(cid:62)\n\n4All values are stored in 32-bit \ufb02oat.\n\n5\n\nSoftmaxConvolution(1\u00d71\u00d71) MatrixMultiplicationFeature DistributionAttentionVectorsSoftmaxConvolution(1\u00d71\u00d71) BilinearPoolingFeature GatheringAttentionMapsInputDimensionReductionDimensionExtensionOutput\fTable 1: Three backbone Residual Networks for the video tasks. The input size for ResNet-26 and\nResNet-29 are 16\u00d7112\u00d7112, while the input size for ResNet-50 is 8\u00d7224\u00d7224. We follow [25] and\nset k = [3, 1, 3], [3, 1, 3, 1, 3, 1], [1, 3, 1] for ResNet-50 in last three stages and decrease the temporal\nsize to reduce computational cost.\n\nstage\n\nconv1\n\nconv2\n\nconv3\n\nconv4\n\nconv5\n\nResNet-26\n\nResNet-29\n\n3\u00d75\u00d75, 16, stride (1,2,2)\n\n3\u00d75\u00d75, 16, stride (1,2,2)\n\n3\u00d73\u00d73, 32\n1\u00d71\u00d71, 128\n\n\uf8ee\uf8f0 1\u00d71\u00d71, 32\n\uf8ee\uf8f0 1\u00d71\u00d71, 64\n\uf8ee\uf8f0 1\u00d71\u00d71, 128\n\uf8ee\uf8f0 1\u00d71\u00d71, 256\n\n3\u00d73\u00d73, 64\n1\u00d71\u00d71, 256\n\n3\u00d73\u00d73, 128\n1\u00d71\u00d71, 512\n\n3\u00d73\u00d73, 256\n1\u00d71\u00d71, 1024\n\n\uf8f9\uf8fb \u00d7 2\n\uf8f9\uf8fb \u00d7 2\n\uf8f9\uf8fb \u00d7 2\n\uf8f9\uf8fb \u00d7 2\n\n3\u00d73\u00d73, 32\n1\u00d71\u00d71, 128\n\n\uf8ee\uf8f0 1\u00d71\u00d71, 32\n\uf8ee\uf8f0 1\u00d71\u00d71, 64\n\uf8ee\uf8f0 1\u00d71\u00d71, 128\n\uf8ee\uf8f0 1\u00d71\u00d71, 256\n\n3\u00d73\u00d73, 64\n1\u00d71\u00d71, 256\n\n3\u00d73\u00d73, 128\n1\u00d71\u00d71, 512\n\n3\u00d73\u00d73, 256\n1\u00d71\u00d71, 1024\n\n\uf8f9\uf8fb \u00d7 2\n\uf8f9\uf8fb \u00d7 2\n\uf8f9\uf8fb \u00d7 3\n\uf8f9\uf8fb \u00d7 2\n\noutput\n\n16\u00d756\u00d756\n\n8\u00d756\u00d756\n\n8\u00d728\u00d728\n\n8\u00d714\u00d714\n\n8\u00d77\u00d77\n\n1\u00d71\u00d71\n\nResNet-50\n\n3\u00d75\u00d75, 32, stride (1,2,2)\nmax pooling, stride (1,2,2)\n\n3\u00d73\u00d73, 64\n1\u00d71\u00d71, 256\n\n\uf8ee\uf8f0 1\u00d71\u00d71, 64\n\uf8ee\uf8f0 1\u00d71\u00d71, 128\n\uf8ee\uf8f0 1\u00d71\u00d71, 256\n\uf8ee\uf8f0 1\u00d71\u00d71, 512\n\nk\u00d73\u00d73, 128\n1\u00d71\u00d71, 512\n\nk\u00d73\u00d73, 256\n1\u00d71\u00d71, 1024\n\nk\u00d73\u00d73, 512\n1\u00d71\u00d71, 2048\n\n\uf8f9\uf8fb \u00d7 3\n\uf8f9\uf8fb \u00d7 4\n\uf8f9\uf8fb \u00d7 6\n\uf8f9\uf8fb \u00d7 3\n\noutput\n\n8\u00d756\u00d756\n\n8\u00d756\u00d756\n\n4\u00d728\u00d728\n\n4\u00d714\u00d714\n\n4\u00d77\u00d77\n\n1\u00d71\u00d71\n\nglobal average pool, fc, softmax\n\nglobal average pool, fc, softmax\n\nglobal average pool, fc, softmax\n\n(#Params, FLOPs)\n\n(7.0 M, 8.3 G)\n\n(7.6 M, 9.2 G)\n\n(33.4 M, 31.3 G)\n\n3 Experiments\n\nIn this section, we \ufb01rst conduct extensive ablation studies to evaluate the proposed A2-Nets on the\nKinetics [12] video recognition dataset and compare it with the state-of-the-art NL-Net [25]. Then\nwe conduct more experiments using deeper and wider neural networks on both image recognition\nand video recognition tasks and compare it with state-of-the-art methods.\n\n3.1\n\nImplementation Details\n\nBackbone CNN We use the residual network [10] as our backbone CNN for all experiments.\nTable 1 shows architecture details of the backbone CNNs for video recognition tasks, where we use\nResNet-26 for all ablation studies and ResNet-29 as one of the baseline methods. The computational\ncost is measured by FLOPs, i.e. \ufb02oating-point multiplication-adds, and the model complexity is\nmeasured by #Params, i.e. total number of trained parameters. The ResNet-50 is almost 2\u00d7 deeper\nand wider than the ResNet-26 and thus only used for last several experiments when comparing with\nthe state-of-the-art methods. For the image recognition task, we use the same ResNet-50 but without\nthe temporal dimension for both the input/output data and convolution kernels.\n\nTraining and Testing Settings We use MXNet [3] to experiment on the image classi\ufb01cation task,\nand PyTorch [18] on video classi\ufb01cation tasks. For image classi\ufb01cation, we report standard single\nmodel single 224 \u00d7 224 center crop validation accuracy, following [9, 10]. For experiments on video\ndatasets, we report both single clip accuracy and video accuracy. All experiments are conducted\nusing a distributed K80 GPU cluster and the networks are optimized by synchronized SGD. Code\nand trained models will be released on GitHub soon.\n\n3.2 Ablation Studies\n\nFor the ablation studies on Kinetics [1], we use 32 GPUs per experiment with a total batch size of\n512 training from scratch. All networks take 16 frames with resolution 112 \u00d7 112 as input. The\nbase learning rate is set to 0.2 and is reduced with a factor of 0.1 at the 20k-th, 30k-th iterations, and\nterminated at the 37k-th iteration. We set the number of output channels for three convolution layers\n\u03b8(\u00b7), \u03c6(\u00b7) and \u03c1(\u00b7) to be 1/4 of the number of input channels. Note that sub-sampling trick is not\nadopted for all methods for fair comparison.\nSingle Block Table 2 shows the results when only one extra block is added to the backbone network.\nThe block is placed after the second residual unit of a certain stage. As can be seen from the last\nthree rows, our proposed A2-block constantly improves the performance compared with both the\n\n6\n\n\fTable 2: Comparisons between single nonlocal block [25] and single double attention block on the\nKinetics dataset. The performance of vanilla residual networks without extra block is shown in the\ntop row.\n\nModel\n\nResNet-26\nResNet-29\n\nResNet-26 + NL [25]\n\nResNet-26 + A2\n\n+ 1 Block\n\nNone\nNone\n\n@ Conv2\n@ Conv3\n@ Conv4\n@ Conv2\n@ Conv3\n@ Conv4\n\n\u2013\n\n\u2013\n\n\u2013\n\n50.4 %\n50.8 %\n\nFLOPs \u2206 FLOPs Clip @1 \u2206 Clip@1 Video@1\n#Params\n8.3 G\n60.7 %\n7.043 M\n7.620 M\n61.6 %\n9.2 G\n7.061 M 49.0 G\n7.112 M 13.7 G\n7.312 M\n9.3 G\n8.7 G\n7.061 M\n8.7 G\n7.112 M\n7.312 M\n8.7 G\n\n900 M\n40.69 G\n5.45 G\n1.04 G\n463 M\n463 M\n463 M\n\n+1.1 %\n+1.3 %\n+0.8 %\n+1.5 %\n+1.9 %\n\n51.5 %\n51.7 %\n51.2 %\n51.9 %\n52.3 %\n\n62.0 %\n62.3 %\n61.8 %\n62.0 %\n62.6 %\n\n+0.5 %\n\n\u2013\n\n\u2013\n\nTable 3: Comparisons between performance from multiple nonlocal blocks [25] and multiple double\nattention blocks on Kinetics dataset. We report both top-1 clips accuracy and top-1 video accuracy\nfor all the methods. The vanilla residual networks without extra blocks are shown in the top row.\n\nModel\n\nResNet-26\nResNet-29\n\nResNet-26 + NL [25]\n\nResNet-26 + A2\n\n+N Blocks\n\nNone\nNone\n\n1 @ Conv4\n2 @ Conv4\n\n4 @ Conv3&4\n\n1 @ Conv4\n2 @ Conv4\n\n4 @ Conv3&4\n\n\u2013\n\n\u2013\n\nFLOPs \u2206 FLOPs Clip @1 \u2206 Clip@1 Video @1\n#Params\n60.7 %\n8.3 G\n7.043 M\n61.6 %\n9.2 G\n7.620 M\n62.3 %\n7.312 M\n9.3 G\n62.9 %\n7.581 M 10.4 G\n7.719 M 21.3 G\n62.8 %\n62.6 %\n8.7 G\n7.312 M\n63.1 %\n7.581 M\n9.2 G\n63.5 %\n7.719 M 10.1 G\n\n900 M\n1.04 G\n2.08 G\n12.97 G\n463 M\n925 M\n1.85 G\n\n50.4 %\n50.8 %\n51.7 %\n52.0 %\n52.4 %\n52.3 %\n52.5 %\n53.0 %\n\n+0.5 %\n+1.3 %\n+1.6 %\n+2.0 %\n+1.9 %\n+2.1 %\n+2.6 %\n\nbaseline ResNet-26 and the deeper ResNet-29. Notably the extra cost is very little. We also \ufb01nd that\nthe performance gain from placing A2-block on top layers is more signi\ufb01cant than placing it at lower\nlayers. This may be because the top layers give more semantically abstract representations that are\nsuitable for extracting global visual primitives. Comparatively, the Nonlocal Network [25] shows less\naccuracy gain and more computational cost than ours. Since the computational cost for Nonlocal\nNetwork is increased quadratically on bottom stage, we are even unable to \ufb01nish the training when\nthe block is placed at Conv2.\nMultiple Blocks Table 3 shows the performance gain when multiple blocks are added to the\nbackbone networks. As can be seen from the results, our proposed A2-Net monotonically improves\nthe accuracy when more blocks are added and costs less #FLOPs compared with its competitor. We\nalso \ufb01nd that adding blocks to different stages can lead to more signi\ufb01cant accuracy gain than adding\nall blocks to the same stage.\n\n3.3 Experiments on Image Recognition\n\nWe evaluate the proposed A2-Net on ImageNet-1k [13] image classi\ufb01cation dataset, which contains\nmore than 1.2 million high resolution images in 1, 000 categories. Our implementation is based on\n\u221a\nthe code released by [5] using 64 GPUs with a batch size of 2, 048. The base learning rate is set to\n\n0.1 and decreases with a factor of 0.1 when training accuracy is saturated.\n\nTable 4: Comparison with state-of-the-\narts on ImageNet-1k.\n\nTable 5: Comparisons with state-of-the-arts results on Ki-\nnetics. Only RGB information is used for input.\n\nModel\n\nResNet [9]\n\nSENet [11]\n\nA2-Net\n\nBackbone\nResNet-50\nResNet-152\nResNet-50\nResNet-50\n\nTop-1\nTop-5\n75.3 % 92.2 %\n77.0 % 93.3 %\n76.7 % 93.4 %\n77.0 % 93.5 %\n\nModel\n\n#Frames\n\nFLOPs Video @1 Video @5\n\nConvNet+LSTM [1]\n\nI3D [1]\n\nR(2+1)D [23]\n\nA2-Net\n\n\u2013\n64\n32\n8\n\n\u2013\n\n107.9 G\n152.4 G\n40.8 G\n\n63.3 %\n71.1 %\n72.0 %\n74.6 %\n\n\u2013\n\n89.3 %\n90.0 %\n91.5 %\n\n7\n\n\fTable 6: Comparisons with state-of-the-arts results on UCF-101. The averaged Top-1 video accuracy\non three train/test splits is reported.\n\nMethod\nC3D [21]\nRes3D [22]\nI3D-RGB [1]\n\nR(2+1)D-RGB [23]\n\nA2-Net\n\nBackbone\n\nVGG\n\nResNet-18\nInception\nResNet-34\nResNet-50\n\nFLOPs\n38.5 G\n19.3 G\n107.9 G\n152.4 G\n41.6 G\n\nVideo @1\n82.3 %\n85.8 %\n95.6 %\n96.8 %\n96.4 %\n\nAs can be seen from Table 4, a ResNet-50 equipped with 5 extra A2-blocks at Conv3 and Conv4\noutperforms a much larger ResNet-152 architecture. We note that the A2-blocks embedded ResNet-\n50 is also over 40% more ef\ufb01cient than ResNet-152 and only costs 6.5 GFLOPs and 33.0 M\nparameters. Compared with the SENet [11], the A2-Net also achieves better accuracy which proves\nthe effectiveness of the proposed double attention mechanism.\n\n3.4 Experiment Results on Video Recognition\n\nIn this subsection, we evaluate the proposed method on learning video representations. We consider\nthe scenario where static image features are pretrained but motion features are learned from scratch by\ntraining a model on the large-scale Kinetics [1] dataset, and the scenario where well-trained motion\nfeatures are transfered to small-scale UCF-101 [20] dataset.\n\nLearning Motion from Scratch on Kinetics We use ResNet-50 pretrained on ImageNet and add 5\nrandomly initialized A2-blocks to build the 3D convolutional network. The corresponding backbone\nis shown in Table 1. The network takes 8 frames (sampling stride: 8) as input and is trained for\n32k iterations with a total batch size of 512 using 64 GPUs. The initial learning rate is set to 0.04\nand decreased in a stepwise manner when training accuracy is saturated. The \ufb01nal result is shown\nin Table 5. Compared with the state-of-the-art I3D [1] and R(2+1)D [23], our proposed model\nshows higher accuracy even with a less number of sampled frames, which once again con\ufb01rms the\nsuperiority of the proposed double-attention mechanism.\n\nTransfer the Learned Feature to UCF-101 The UCF-101 contains about 13, 320 videos from\n101 action categories and has three train/test splits. The training set of UCF-101 is several times\nsmaller than the Kinetics dataset and we use it to evaluate the generality and robustness of the features\nlearned by our model pre-trained on Kinetics. The network is trained with a base learning rate of 0.01\nwhich is decreased for three times with a factor 0.1, using 8 GPUs with a batch size of 104 clips and\ntested with 224 \u00d7 224 input resolution on single scale. Table 6 shows results of our proposed model\nand comparison with state-of-the-arts. Consistent with above results, the A2-Net achieves leading\nperformance with signi\ufb01cantly lower computational cost. This shows that the features learned by\nA2-Net are robust and can be effectively transfered to new dataset in very low cost compared with\nexisting methods.\n\n4 Conclusions\n\nIn this work, we proposed a double attention mechanism for deep CNNs to overcome the limitation\nof local convolution operations. The proposed double attention method effectively captures the\nglobal information and distributes it to every location in a two-step attention manner. We well\nformulated the proposed method and instantiated it as an light-weight block that can be easily\ninserted into to existing CNNs with little computational overhead. Extensive ablation studies and\nexperiments on a number of benchmark datasets, including ImageNet-1k, Kinetics and UCF-101,\ncon\ufb01rmed the effectiveness of the proposed A2-Net on both 2D image recognition tasks and 3D video\nrecognition tasks. In the future, we want to explore integrating the double attention in recent compact\nnetwork architectures [19, 16, 4], to leverage the expressiveness of the proposed method for smaller,\nmobile-friendly models.\n\n8\n\n\fReferences\n[1] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the\nkinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 4724\u20134733. IEEE, 2017.\n\n[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.\nDeeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and\nfully connected crfs. arXiv preprint arXiv:1606.00915, 2016.\n\n[3] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu,\nChiyuan Zhang, and Zheng Zhang. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for\nheterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.\n\n[4] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. Multi-\ufb01ber\nnetworks for video recognition. In European Conference on Computer Vision (ECCV), 2018.\n\n[5] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path\n\nnetworks. In Advances in Neural Information Processing Systems, pages 4470\u20134478, 2017.\n\n[6] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini\nVenugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for\nvisual recognition and description. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pages 2625\u20132634, 2015.\n\n[7] Rohit Girdhar and Deva Ramanan. Attentional pooling for action recognition. In Advances in\n\nNeural Information Processing Systems, pages 33\u201344, 2017.\n\n[8] Ross Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European Conference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[11] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), 2018.\n\n[12] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-\nnarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human\naction video dataset. arXiv preprint arXiv:1705.06950, 2017.\n\n[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[14] Peihua Li, Jiangtao Xie, Qilong Wang, and Wangmeng Zuo. Is second-order information helpful\n\nfor large-scale visual recognition? arXiv preprint arXiv:1703.08050, 2017.\n\n[15] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for \ufb01ne-grained\nvisual recognition. In Proceedings of the IEEE International Conference on Computer Vision,\npages 1449\u20131457, 2015.\n\n[16] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuf\ufb02enet v2: Practical guidelines\n\nfor ef\ufb01cient cnn architecture design. arXiv preprint arXiv:1807.11164, 2018.\n\n[17] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat\nMonga, and George Toderici. Beyond short snippets: Deep networks for video classi\ufb01cation. In\nComputer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 4694\u20134702.\nIEEE, 2015.\n\n[18] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch, 2017.\n\n9\n\n\f[19] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.\nInverted residuals and linear bottlenecks: Mobile networks for classi\ufb01cation, detection and\nsegmentation. arXiv preprint arXiv:1801.04381, 2018.\n\n[20] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human\n\nactions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.\n\n[21] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning\nspatiotemporal features with 3d convolutional networks. In Computer Vision (ICCV), 2015\nIEEE International Conference on, pages 4489\u20134497. IEEE, 2015.\n\n[22] Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. Convnet architecture\n\nsearch for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.\n\n[23] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A\ncloser look at spatiotemporal convolutions for action recognition. In 2018 IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), 2018.\n\n[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 6000\u20136010, 2017.\n\n[25] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks.\n\nIn Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[26] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 5987\u20135995. IEEE, 2017.\n\n10\n\n\f", "award": [], "sourceid": 233, "authors": [{"given_name": "Yunpeng", "family_name": "Chen", "institution": "National University of Singapore"}, {"given_name": "Yannis", "family_name": "Kalantidis", "institution": "Facebook"}, {"given_name": "Jianshu", "family_name": "Li", "institution": "National University of Singapore"}, {"given_name": "Shuicheng", "family_name": "Yan", "institution": "National University of Singapore"}, {"given_name": "Jiashi", "family_name": "Feng", "institution": "National University of Singapore"}]}