{"title": "Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 9401, "page_last": 9411, "abstract": "While the use of bottom-up local operators in convolutional neural networks (CNNs) matches well some of the statistics of natural images, it may also prevent such models from capturing contextual long-range feature interactions. In this work, we propose a simple, lightweight approach for better context exploitation in CNNs. We do so by introducing a pair of operators: gather, which efficiently aggregates feature responses from a large spatial extent, and excite, which redistributes the pooled information to local features. The operators are cheap, both in terms of number of added parameters and computational complexity, and can be integrated directly in existing architectures to improve their performance. Experiments on several datasets show that gather-excite can bring benefits comparable to increasing the depth of a CNN at a fraction of the cost. For example, we find ResNet-50 with gather-excite operators is able to outperform its 101-layer counterpart on ImageNet with no additional learnable parameters. We also propose a parametric gather-excite operator pair which yields further performance gains, relate it to the recently-introduced Squeeze-and-Excitation Networks, and analyse the effects of these changes to the CNN feature activation statistics.", "full_text": "Gather-Excite: Exploiting Feature Context in\n\nConvolutional Neural Networks\n\nJie Hu\u2217\nMomenta\n\nhujie@momenta.ai\n\nLi Shen\u2217\n\nVisual Geometry Group\nUniversity of Oxford\n\nlishen@robots.ox.ac.uk\n\nalbanie@robots.ox.ac.uk\n\nSamuel Albanie\u2217\n\nVisual Geometry Group\nUniversity of Oxford\n\nGang Sun\nMomenta\n\nsungang@momenta.ai\n\nAndrea Vedaldi\n\nVisual Geometry Group\nUniversity of Oxford\n\nvedaldi@robots.ox.ac.uk\n\nAbstract\n\nWhile the use of bottom-up local operators in convolutional neural networks\n(CNNs) matches well some of the statistics of natural images, it may also prevent\nsuch models from capturing contextual long-range feature interactions. In this work,\nwe propose a simple, lightweight approach for better context exploitation in CNNs.\nWe do so by introducing a pair of operators: gather, which ef\ufb01ciently aggregates\nfeature responses from a large spatial extent, and excite, which redistributes the\npooled information to local features. The operators are cheap, both in terms of\nnumber of added parameters and computational complexity, and can be integrated\ndirectly in existing architectures to improve their performance. Experiments on\nseveral datasets show that gather-excite can bring bene\ufb01ts comparable to increasing\nthe depth of a CNN at a fraction of the cost. For example, we \ufb01nd ResNet-50\nwith gather-excite operators is able to outperform its 101-layer counterpart on\nImageNet with no additional learnable parameters. We also propose a parametric\ngather-excite operator pair which yields further performance gains, relate it to the\nrecently-introduced Squeeze-and-Excitation Networks, and analyse the effects of\nthese changes to the CNN feature activation statistics.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNN) [21] are the gold-standard approach to problems such as image\nclassi\ufb01cation [20, 35, 9], object detection [32] and image segmentation [3]. Thus, there is a signi\ufb01cant\ninterest in improved CNN architectures. In computer vision, an idea that has often improved visual\nrepresentations is to augment functions that perform local decisions with functions that operate on\na larger context, providing a cue for resolving local ambiguities [39]. While the term \u201ccontext\u201d is\noverloaded [6], in this work we focus speci\ufb01cally on feature context, namely the information captured\nby the feature extractor responses (i.e. the CNN feature maps) as a whole, spread over the full spatial\nextent of the input image.\nIn many standard CNN architectures the receptive \ufb01elds of many feature extractors are theoretically\nalready large enough to cover the input image in full. However, the effective size of such \ufb01elds is\nin practice considerably smaller [27]. This may be one factor explaining why improving the use of\ncontext in deep networks can lead to better performance, as has been repeatedly demonstrated in\nobject detection and other applications [1, 26, 48].\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: The interaction of a gather-excite operator pair, (\u03beG, \u03beE). The gather operator \u03beG \ufb01rst\naggregates feature responses across spatial neighbourhoods. The resulting aggregates are then passed,\ntogether with the original input tensor, to an excite operator \u03beE that produces an output that matches\nthe dimensions of the input.\n\nPrior work has illustrated that using simple aggregations of low level features can be effective at\nencoding contextual information for visual tasks, and may prove a useful alternative to iterative\nmethods based on higher level semantic features [44]. Demonstrating the effectiveness of such an\napproach, the recently proposed Squeeze-and-Excitation (SE) networks [15] showed that reweighting\nfeature channels as a function of features from the full extent of input can improve classi\ufb01cation\nperformance. In these models, the squeeze operator acts as a lightweight context aggregator and\nthe resulting embeddings are then passed to the reweighting function to ensure that it can exploit\ninformation beyond the local receptive \ufb01elds of each \ufb01lter.\nIn this paper, we build on this approach and further explore mechanisms to incorporate context\nthroughout the architecture of a deep network. Our goal is to explore more ef\ufb01cient algorithms as\nwell as the essential properties that make them work well. We formulate these \u201ccontext\u201d modules as\nthe composition of two operators: a gather operator, which aggregates contextual information across\nlarge neighbourhoods of each feature map, and an excite operator, which modulates the feature maps\nby conditioning on the aggregates.\nUsing this decomposition, we chart the space of designs that can exploit feature context in deep\nnetworks and explore the effect of different operators independently. Our study leads us to propose a\nnew, lightweight gather-excite pair of operators which yields signi\ufb01cant improvements across different\narchitectures, datasets and tasks, with minimal tuning of hyperparameters. We also investigate the\neffect of the operators on distributed representation learned by existing deep architectures: we \ufb01nd\nthe mechanism produces intermediate representations that exhibit lower class selectivity, suggesting\nthat providing access to additional context may enable greater feature re-use. The code for all models\nused in this work is publicly available at https://github.com/hujie-frank/GENet.\n\n2 The Gather-Excite Framework\n\nIn this section, we introduce the Gather-Excite (GE) framework and describe its operation.\nThe design is motivated by examining the \ufb02ow of information that is typical of CNNs. These models\ncompute a hierarchy of representations that transition gradually from spatial to channel coding.\nDeeper layers achieve greater abstraction by combining features from previous layers while reducing\nresolution, increasing the receptive \ufb01eld size of the units, and increasing the number of feature\nchannels.\nThe family of bag-of-visual-words [5, 47, 34] models demonstrated the effectiveness of pooling\nthe information contained in local descriptors to form a global image representation out of a local\none. Inspired by this observation, we aim to help convolutional networks exploit the contextual\ninformation contained in the \ufb01eld of feature responses computed by the network itself.\nTo this end, we construct a lightweight function to gather feature responses over large neighbourhoods\nand use the resulting contextual information to modulate original responses of the neighbourhood\nelements. Speci\ufb01cally, we de\ufb01ne a gather operator \u03beG which aggregates neuron responses over a\ngiven spatial extent, and an excite operator \u03beE which takes in both the aggregates and the original\ninput to produce a new tensor with the same dimensions of the original input. The GE operator pair\nis illustrated in Fig. 1.\n\n2\n\n\ud835\udc3b\ud835\udc4a\ud835\udc3b\u2032\ud835\udc4a\u2032\u21e0G<latexit sha1_base64=\"nH19Yx1anhi3YVd07v0EGRBbYXI=\">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBg56kgqmFNpTNdtIu3WzC7kYsob/BiwdFvPqDvPlv3LY5aOuDgcd7M8zMC1PBtXHdb6e0srq2vlHerGxt7+zuVfcPWjrJFEOfJSJR7ZBqFFyib7gR2E4V0jgU+BCOrqb+wyMqzRN5b8YpBjEdSB5xRo2V/O4T7133qjW37s5AlolXkBoUaPaqX91+wrIYpWGCat3x3NQEOVWGM4GTSjfTmFI2ogPsWCppjDrIZ8dOyIlV+iRKlC1pyEz9PZHTWOtxHNrOmJqhXvSm4n9eJzPRZZBzmWYGJZsvijJBTEKmn5M+V8iMGFtCmeL2VsKGVFFmbD4VG4K3+PIyaZ3VPbfu3Z3XGrdFHGU4gmM4BQ8uoAE30AQfGHB4hld4c6Tz4rw7H/PWklPMHMIfOJ8/p9OOmw==</latexit><latexit sha1_base64=\"nH19Yx1anhi3YVd07v0EGRBbYXI=\">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBg56kgqmFNpTNdtIu3WzC7kYsob/BiwdFvPqDvPlv3LY5aOuDgcd7M8zMC1PBtXHdb6e0srq2vlHerGxt7+zuVfcPWjrJFEOfJSJR7ZBqFFyib7gR2E4V0jgU+BCOrqb+wyMqzRN5b8YpBjEdSB5xRo2V/O4T7133qjW37s5AlolXkBoUaPaqX91+wrIYpWGCat3x3NQEOVWGM4GTSjfTmFI2ogPsWCppjDrIZ8dOyIlV+iRKlC1pyEz9PZHTWOtxHNrOmJqhXvSm4n9eJzPRZZBzmWYGJZsvijJBTEKmn5M+V8iMGFtCmeL2VsKGVFFmbD4VG4K3+PIyaZ3VPbfu3Z3XGrdFHGU4gmM4BQ8uoAE30AQfGHB4hld4c6Tz4rw7H/PWklPMHMIfOJ8/p9OOmw==</latexit><latexit sha1_base64=\"nH19Yx1anhi3YVd07v0EGRBbYXI=\">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBg56kgqmFNpTNdtIu3WzC7kYsob/BiwdFvPqDvPlv3LY5aOuDgcd7M8zMC1PBtXHdb6e0srq2vlHerGxt7+zuVfcPWjrJFEOfJSJR7ZBqFFyib7gR2E4V0jgU+BCOrqb+wyMqzRN5b8YpBjEdSB5xRo2V/O4T7133qjW37s5AlolXkBoUaPaqX91+wrIYpWGCat3x3NQEOVWGM4GTSjfTmFI2ogPsWCppjDrIZ8dOyIlV+iRKlC1pyEz9PZHTWOtxHNrOmJqhXvSm4n9eJzPRZZBzmWYGJZsvijJBTEKmn5M+V8iMGFtCmeL2VsKGVFFmbD4VG4K3+PIyaZ3VPbfu3Z3XGrdFHGU4gmM4BQ8uoAE30AQfGHB4hld4c6Tz4rw7H/PWklPMHMIfOJ8/p9OOmw==</latexit><latexit sha1_base64=\"nH19Yx1anhi3YVd07v0EGRBbYXI=\">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBg56kgqmFNpTNdtIu3WzC7kYsob/BiwdFvPqDvPlv3LY5aOuDgcd7M8zMC1PBtXHdb6e0srq2vlHerGxt7+zuVfcPWjrJFEOfJSJR7ZBqFFyib7gR2E4V0jgU+BCOrqb+wyMqzRN5b8YpBjEdSB5xRo2V/O4T7133qjW37s5AlolXkBoUaPaqX91+wrIYpWGCat3x3NQEOVWGM4GTSjfTmFI2ogPsWCppjDrIZ8dOyIlV+iRKlC1pyEz9PZHTWOtxHNrOmJqhXvSm4n9eJzPRZZBzmWYGJZsvijJBTEKmn5M+V8iMGFtCmeL2VsKGVFFmbD4VG4K3+PIyaZ3VPbfu3Z3XGrdFHGU4gmM4BQ8uoAE30AQfGHB4hld4c6Tz4rw7H/PWklPMHMIfOJ8/p9OOmw==</latexit>\ud835\udc36\ud835\udc36\ud835\udc4a\ud835\udc3b\ud835\udc3b\u2032\ud835\udc4a\u2032\ud835\udc36\fFigure 2: Top-1 ImageNet validation error (%) for the proposed (left) GE-\u03b8\u2212 and (right) GE-\u03b8\ndesigns based on a ResNet-50 architecture (the baseline label indicates the performance of the\noriginal ResNet-50 model in both plots). For reference, ResNet-101 achieves a top-1 error of 22.20%.\nSee Sec. 3 for further details.\n\nMore formally, let x = {xc : c \u2208 {1, . . . , C}} denote a collection of feature maps produced by the\nnetwork. To assess the effect of varying the size of the spatial region over which the gathering occurs,\nwe de\ufb01ne the selection operator \u03b9(u, e) = {eu + \u03b4 : \u03b4 \u2208 [\u2212(cid:98)(2e \u2212 1)/2(cid:99),(cid:98)(2e \u2212 1)/2(cid:99)]2} where e\nrepresents the extent ratio of the selection. We then de\ufb01ne a gather operator with extent ratio e to\nbe a function \u03beG : RH\u00d7W\u00d7C \u2192 RH\n(cid:48)\u00d7C (H(cid:48) = (cid:100) H\ne (cid:101)) that satis\ufb01es for any input x\n), where u \u2208 {1, . . . , H(cid:48)} \u00d7 {1, . . . , W (cid:48)}, c \u2208 {1, . . . , C},\nu = \u03beG(x (cid:12) 1c\nthe constraint \u03beG(x)c\n1{\u00b7} denotes the indicator tensor and (cid:12) is the Hadamard product. This notation simply states that\nat each output location u of the channel c, the gather operator has a receptive \ufb01eld of the input that\nlies within a single channel and has an area bounded by (2e \u2212 1)2. If the \ufb01eld envelops the full input\nfeature map, we say that the gather operator has global extent. The objective of the excite operator\nis to make use of the gathered output as a contextual feature and takes the form \u03beE(x, \u02c6x) = x (cid:12) f (\u02c6x),\n(cid:48)\u00d7C \u2192 [0, 1]H\u00d7W\u00d7C is the map responsible for rescaling and distributing the\nwhere f : RH\nsignal from the aggregates.\n\ne (cid:101), W (cid:48) = (cid:100) W\n\n(cid:48)\u00d7W\n\n(cid:48)\u00d7W\n\n\u03b9(u,e)\n\n3 Models and Experiments\n\nIn this section, we explore and evaluate a number of possible instantiations of the gather-excite\nframework. To compare the utility of each design, we conduct a series of experiments on the task of\nimage classi\ufb01cation using the ImageNet 1K dataset [33]. The dataset contains 1.2 million training\nimages and 50k validation images. In the experiments that follow, all models are trained on the\ntraining set and evaluated on the validation set. We base our investigation on the popular ResNet-\n50 architecture which attains good performance on this dataset and has been shown to generalise\neffectively to a range of other domains [9]. New models are formed by inserting gather-excite\noperators into the residual branch immediately before summation with the identity branch of each\nbuilding block of ResNet-50. These models are trained from random initialisation [10] using SGD\nwith momentum 0.9 with minibatches of 256 images, each cropped to 224 \u00d7 224 pixels. The\ninitial learning rate is set to 0.1 and is reduced by a factor of 10 each time the loss plateaus (three\ntimes). Models typically train for approximately 300 epochs in total (note that this produces stronger\nmodels than the \ufb01xed 100-epoch optimisation schedule used in [15]). In all experiments, we report\nsingle-centre-crop results on the ImageNet validation set.\n\n3.1 Parameter-free pairings\n\nWe \ufb01rst consider a collection of GE pairings which require no additional learnable parameters. We\ntake the gather operator \u03beG to be average pooling with varying extent ratios (the effect of changing the\npooling operator is analysed in the suppl. material). The excite operator then resizes the aggregates,\napplies a sigmoid and multiplies the result with the input. Thus, each output feature map is computed\nas yc = x(cid:12) \u03c3(interp(\u03beG(x)c)), where interp(\u00b7) denotes resizing to the original input size via nearest\nneighbour interpolation. We refer to this model as GE-\u03b8\u2212, where the notation \u03b8\u2212 is used to denote\n\n3\n\n\ftop-1 err.\n\ntop-5 err.\n\nGFLOPs\n\nResNet-50 (Baseline)\nGE-\u03b8 (stage2)\nGE-\u03b8 (stage3)\nGE-\u03b8 (stage4)\nGE-\u03b8 (all)\n\n23.30\n23.29\n22.70\n22.50\n22.00\n\n6.55\n6.50\n6.24\n6.20\n5.87\n\n3.86\n3.86\n3.86\n3.86\n3.87\n\n#Params\n25.6 M\n28.0 M\n27.2 M\n26.8 M\n31.2 M\n\nTable 1: Effect (error %) of inserting GE operators at different stages of the baseline architecture ResNet-50.\n\nthat the operator is parameter-free2. A diagram illustrating how these operators are integrated into a\nresidual unit can be found in Fig. 4 of the supplementary material.\nSpatial extent: This basic model allows us to test the central hypothesis of this paper, namely that\nproviding the network with access to simple summaries of additional feature context improves the\nrepresentational power of the network. To this end, our \ufb01rst experiment varies the spatial extent ratio\nof the GE-\u03b8\u2212 design: we consider values of e = {2, 4, 8}, as well a global extent ratio using global\naverage pooling. The results of this experiment are shown in Fig. 2 (left). Each increase in the extent\nratio yields consistent improvements over the performance of the ResNet-50 baseline (23.30% top-1\nerror), with the global extent ratio achieving the strongest performance (22.14% top-1 error). This\nexperiment suggests that even with a simple parameter-free approach, context-based modulation can\nstrengthen the discriminative power of the network. Remarkably, this model is competitive with the\nmuch heavier ResNet-101 model (22.20% top-1 error). In all following experiments, except where\nnoted otherwise, a global extent ratio is used.\n\n3.2 Parameterised pairings\n\nWe have seen that simple gather-excite operators without learned parameters can offer an effective\nmechanism for exploiting context. To further explore the design space for these pairings, we next\nconsider the introduction of parameters into the gather function, \u03beG(\u03b8). In this work, we propose to\nuse strided depth-wise convolution as the gather operator, which applies spatial \ufb01lters to independent\nchannels of the input. We combine \u03beG(\u03b8) with the excite operator described in Sec. 3.1 and refer to\nthis pairing as GE-\u03b8.\nSpatial extent: We begin by repeating the experiment to assess the effect of an increased extent ratio\nfor the parameterised model. For parameter ef\ufb01ciency, varying extent ratios e is achieved by chaining\n3 \u00d7 3 stride 2 depth-wise convolutions (e/2 such convolutions are performed in total). For the global\nextent ratio, a single global depth-wise convolution is used. Fig. 2 (right) shows the results of this\nexperiment. We observe a similar overall trend to the GE-\u03b8\u2212 study and note that the introduction of\nadditional parameters brings expected improvements over the parameter-free design.\nEffect on different stages: We next investigate the in\ufb02uence of GE-\u03b8 on different stages (here\nwe use the term \u201cstage\u201d as it is de\ufb01ned in [9]) of the network by training model variants in which\nthe operators are inserted into each stage separately. The accuracy, computational cost and model\ncomplexity of the resulting models are shown in Tab. 1. While there is some improvement from\ninsertion at every stage, the greatest improvement comes from the mid and late stages (where there\nare also more channels). The effects of insertion at different stages are not mutually redundant, in the\nsense that they can be combined effectively to further bolster performance. For simplicity, we include\nGE operators throughout the network in all remaining experiments, but we note that if parameter\nstorage is an important concern, GE can be removed from Stage 2 at a marginal cost in performance.\nRelationship to Squeeze-and-Excitation Networks: The recently proposed Squeeze-and-\nExcitation Networks [15] can be viewed as a particular GE pairing, in which the gather operator\nis a parameter-free operation (global average pooling) and the excite operator is a fully connected\nsubnetwork. Given the strong performance of these networks (see [15] for details), a natural question\narises: are the bene\ufb01ts of parameterising the gather operator complementary to increasing the capac-\nity of the excite operator? To answer this question, we experiment with a further variant, GE-\u03b8+,\n\n2Throughout this work, we use the term \u201cparameter-free\u201d to denote a model that requires no additional\nlearnable parameters. Under this de\ufb01nition, average pooling and nearest neighbour interpolation are parameter-\nfree operations.\n\n4\n\n\ftop-1 err.\n\ntop-5 err.\n\nGFLOPs\n\nResNet-101\nResNet-50 (Baseline)\nSE\nGE-\u03b8\u2212\nGE-\u03b8\nGE-\u03b8+\n\n22.20\n\n23.30\n22.12\n22.14\n22.00\n21.88\n\n6.14\n\n6.55\n5.99\n6.24\n5.87\n5.80\n\n7.57\n\n3.86\n3.87\n3.86\n3.87\n3.87\n\n#Params\n44.6 M\n25.6 M\n28.1 M\n25.6 M\n31.2 M\n33.7 M\n\nTable 2: Comparison of differing GE con\ufb01gurations with a ResNet-50 baseline on the ImageNet validation set\n(error %) and their respective complexities. The ResNet-101 model is included for reference.\n\ntop-1 err.\n\ntop-5 err.\n\nGFLOPs\n\nResNet-152\nResNet-101 (Baseline)\nSE\nGE-\u03b8\u2212\nGE-\u03b8\nGE-\u03b8+\n\n21.87\n\n22.20\n20.94\n21.47\n21.46\n20.74\n\n5.78\n\n6.14\n5.50\n5.69\n5.45\n5.29\n\n11.28\n\n7.57\n7.58\n7.58\n7.59\n7.59\n\n#Params\n60.3 M\n44.6 M\n49.4 M\n44.6 M\n53.7 M\n58.4 M\n\nTable 3: Comparison of differing GE con\ufb01gurations with a ResNet-101 baseline on the ImageNet validation set\n(error %) and their respective complexities. The GE-\u03b8\u2212(101) model outperforms a deeper ResNet-152 (included\nabove for reference).\n\nwhich combines the GE-\u03b8 design with a 1 \u00d7 1 convolutional channel subnetwork excite operator\n(supporting the use of variable spatial extent ratios). The parameterised excite operator thus takes\nthe form \u03beE(x, \u02c6x) = x (cid:12) \u03c3(interp(f (\u02c6x|\u03b8)), where f (\u02c6x|\u03b8) matches the de\ufb01nition given in [15], with\nreduction ratio 16). The performance of the resulting model is given in Tab. 2. We observe that the\nGE-\u03b8+ model not only outperforms the SE and GE-\u03b8 models, but approaches the performance of the\nconsiderably larger 152 layer ResNet (21.88% vs 21.87% top-1 error) at approximately one third of\nthe computational complexity.\n\n3.3 Generalisation\n\nDeeper networks: We next ask whether the improvements brought by incorporating GE operators\nare complementary to the bene\ufb01ts of increased network depth. To address this question, we train\ndeeper ResNet-101 variants of the GE-\u03b8\u2212, GE-\u03b8 and GE-\u03b8+ designs. The results are reported in\nTab. 3. It is important to note here that the GE operators themselves add layers to the architecture\n(thus this experiment does not control precisely for network depth). However, they do so in an\nextremely lightweight manner in comparison to the standard computational blocks that form the\nnetwork and we observe that the improvements achieved by GE transfer to the deeper ResNet-101\nbaseline, suggesting that to a reasonable degree, these gains are complementary to increasing the\ndepth of the underlying backbone network.\nResource constrained architectures: We have seen that GE operators can strengthen deep residual\nnetwork architectures. However, these models are largely composed of dense convolutional com-\nputational units. Driven by demand for mobile applications, a number of more sparsely connected\narchitectures have recently been proposed with a view to achieving good performance under strict\nresource constraints [14, 50]. We would therefore like to assess how well GE generalises to such\nscenarios. To answer this question, we conduct a series of experiments on the Shuf\ufb02eNet architec-\nture [50], an ef\ufb01cient model that achieves a good tradeoff between accuracy and latency. Results are\nreported in Tab. 4. In practice, we found these models challenging to optimise and required longer\ntraining schedules (\u2248 400 epochs) to reproduce the performance of the baseline model reported\nin [50] (training curves under a \ufb01xed schedule are provided in the suppl. material). We also found it\ndif\ufb01cult to achieve improvements without the use of additional parameters. The GE-\u03b8 variants yield\nimprovements in performance at a fairly modest theoretical computational complexity. In scenarios\nfor which parameter storage represents the primary system constraint, a naive application of GE may\n\n5\n\n\fShuf\ufb02eNet variant\nShuf\ufb02eNet (Baseline)\nSE\nGE-\u03b8 (E2)\nGE-\u03b8 (E4)\nGE-\u03b8 (E8)\nGE-\u03b8\nGE-\u03b8+\n\ntop-1 err.\n\ntop-5 err.\n\nMFLOPs\n\n32.60\n31.24\n32.40\n32.32\n32.12\n31.80\n30.12\n\n12.40\n11.38\n12.31\n12.24\n12.11\n11.98\n10.70\n\n137.5\n139.9\n138.9\n139.1\n139.2\n140.8\n141.6\n\n#Params\n1.9 M\n2.5 M\n2.0 M\n2.1 M\n2.2 M\n3.6 M\n4.4 M\n\nTable 4: Comparison of differing GE con\ufb01gurations with a Shuf\ufb02eNet baseline on the ImageNet validation set\n(error %) and their respective complexities. Here, Shuf\ufb02eNet refers to \u201cShuf\ufb02eNet 1 \u00d7 (g = 3)\u201d in [50].\n\nResNet-110 [10]\n\nResNet-164 [10] WRN-16-8 [49]\n\nBaseline\nSE\nGE-\u03b8\u2212\nGE-\u03b8\nGE-\u03b8+\n\n6.37 / 26.88\n5.21 / 23.85\n6.01 / 26.58\n5.57 / 24.29\n4.93 / 23.36\n\n5.46 / 24.33\n4.39 / 21.31\n5.12 / 23.94\n4.67 / 21.86\n4.07 / 20.85\n\n4.27 / 20.43\n3.88 / 19.14\n4.12 / 20.25\n4.02 / 19.76\n3.72 / 18.87\n\nTable 5: Classi\ufb01cation error (%) on the CIFAR-10/100 test set with standard data augmentation (padding 4\npixels on each side, random crop and \ufb02ip).\n\nbe less appropriate and more care is needed to achieve a good tradeoff between accuracy and storage\n(this may be achieved, for example, by using GE at a subset of the layers).\nBeyond ImageNet: We next assess the ability of GE operators generalise to other datasets beyond\nImageNet. To this end, we conduct additional experiments on the CIFAR-10 and CIFAR-100 image\nclassi\ufb01cation benchmarks [19]. These datasets consist of 32 \u00d7 32 color images drawn from 10\nclasses and 100 classes respectively. Each contains 50k train images and 10k test images. We\nadopt a standard data augmentation scheme (as used in [9, 16, 24]) to facilitate a useful comparative\nanalysis between models. During training, images are \ufb01rst zero-padded on each side with four\npixels, then a random 32 \u00d7 32 patch is produced from the padded image or its horizontal \ufb02ip before\napplying mean/std normalization. We combine GE operators with several popular backbones for\nCIFAR: ResNet-110 [10], ResNet-164 [10] and the Wide Residual Network-16-8 [49]. The results\nare reported in Tab. 5. We observe that even on datasets with considerably different characteristics\n(e.g. 32 \u00d7 32 pixels), GE still yields good performance gains.\nBeyond image classi\ufb01cation: We would like to evaluate whether GE operators can generalise\nto other tasks beyond image classi\ufb01cation. For this purpose, we train an object detector on MS\nCOCO [25], a dataset which has approximately 80k training images and 40k validation images\n(we use the train-val splits provided in the 2014 release). Our experiment uses the Faster R-CNN\nframework [32] (replacing the RoIPool operation with RoIAlign proposed in [8]) and otherwise\nfollows the training settings in [9]. We train two variants: one with a ResNet-50 backbone and one\nwith a GE-\u03b8 (E8) backbone, keeping all other settings \ufb01xed. The ResNet-50 baseline performance is\n27.3% mAP. Incorporating the GE-\u03b8 backbone improves the baseline performance to 28.6% mAP.\n\n4 Analysis and Discussion\n\nEffect on learned representations: We have seen that GE operators can improve the performance of\na deep network for visual tasks and would like to gain some insight into how the learned features may\ndiffer from those found in the baseline ResNet-50 model. For this purpose, we use the class selectivity\nindex metric introduced by [28] to analyse the features of these models. This metric computes, for\neach feature map, the difference between the highest class-conditional mean activity and the mean of\nall remaining class-conditional activities over a given data distribution. The resulting measurement is\nnormalised such that it varies between zero and one, where one indicates that a \ufb01lter only \ufb01res for a\nsingle class and zero indicates that the \ufb01lter produced the same value for every class. The metric is\nof interest to our work because it provides some measure of the degree to which features are being\n\n6\n\n\fFigure 3: Each \ufb01gure depicts the class selectivity index distribution for features in both the baseline\nResNet-50 and corresponding GE-\u03b8 network at various blocks in the fourth stage of their architectures.\nAs depth increases, we observe that the GE-\u03b8 model exhibits less class selectivity than the ResNet-50\nbaseline.\n\nFigure 4: Top-1 error (%) on the ImageNet training set (left) and validation set (right) of the ResNet-50\nbaseline and proposed GE-\u03b8 (global extent) model under a \ufb01xed-length training schedule.\n\nshared across classes, a central property of distributed representations that can describe concepts\nef\ufb01ciently [12].\nWe compute the class selectivity index for intermediate representations generated in the fourth stage\n(here we use the term \u201cstage\u201d as it is de\ufb01ned in [9]). The features of this stage have been shown to\ngeneralise well to other semantic tasks [31]. We compute class selectivity histograms for the last layer\nin each block in this stage of both models, and present the results of GE-\u03b8 and ResNet-50 in Fig. 3.\nAn interesting trend emerges: in the early blocks of the stage, the distribution of class selectivity for\nboth models appears to be closely matched. However, with increasing depth, the distributions begin\nto separate, and by conv4-6-relu the distributions appear more distinct with GE-\u03b8 exhibiting less\nclass selectivity than ResNet-50. Assuming that additional context may allow the network to better\nrecognise patterns that would be locally ambiguous, we hypothesise that networks without access to\nsuch context are required to allocate a greater number of highly specialised units that are devoted\nto the resolution of these ambiguities, reducing feature re-use. Additional analyses of the SE and\nGE-\u03b8\u2212 models can be found in the suppl. material.\nEffect on convergence: We explore how the usage of GE operators play a role in the optimisation\nof deep networks. For this experiment, we train both a baseline ResNet-50 and a GE-\u03b8 model (with\nglobal extent ratio) from scratch on ImageNet using a \ufb01xed 100 epoch schedule. The learning rate\nis initialised to 0.1 and decreased by a factor of 10 every 30 epochs. The results of this experiment\nare shown in Fig. 4. We observe that the GE-\u03b8 model achieves lower training and validation error\nthroughout the course of the optimisation schedule. A similar trend was reported when training with\nSE blocks [15], which as noted in Sec. 3.2, can be interpreted as a parameter-free gather operator and\na parameterised excite operator. By contrast, we found empirically that the GE-\u03b8\u2212 model does not\nexhibit the same ease of optimisation and takes longer to learn effective representations.\n\n7\n\n\fFigure 5: Top-1 ImageNet validation accuracy for the GE-\u03b8 model after dropping a given ratio of feature maps\nout the residual branch for each test image. Dashed line denotes the effect of dropping features with the least\nassigned importance scores \ufb01rst. Solid line denotes the effect of dropping features with the highest assigned\nimportance scores \ufb01rst. For reference, the black stars indicate the importance of these feature blocks to the\nResNet-50 model (see Sec. 4 for further details).\n\nFeature importance and performance. The gating mechanism of the excite operator allows the\nnetwork to perform feature selection throughout the learning process, using the feature importance\nscores that are assigned to the outputs of the gather operator. Features that are assigned a larger\nimportance will be preserved, and those with lower importance will be squashed towards zero. While\nintuitively we might expect that feature importance is a good predictor of the contribution of a feature\nto the overall network performance, we would like to verify this relationship. We conduct experiments\non a GE-\u03b8 network, based on the ResNet-50 architecture. We \ufb01rst examine the effect of pruning\nthe least important features: given a building block of the models, for each test image we sort the\nchannel importances induced by the gating mechanism in ascending order (labelled as \u201casc.\" in\nFig. 5), and set a portion (the prune ratio) of the values to zero in a \ufb01rst-to-last manner. As the prune\nratio increases, information \ufb02ow \ufb02ows through an increasingly small subset of features. Thus, no\nfeature maps are dropped out when the prune ratio is equal to zero, and the whole residual branch is\ndropped out when the ratio is equal to one (i.e., the information of the identity branch passes through\ndirectly). We repeat this experiment in reverse order, dropping out the most important features \ufb01rst\n(this process is labelled \u201cdes.\" in Fig. 5). This experiment is repeated for three building blocks\nin GE-\u03b8 (experiments for SE are included in the suppl. material). As a reference for the relative\nimportance of features contained in these residual branches, we additionally report the performance\nof the baseline ResNet-50 model with the prune ratio set to 0 and 1 respectively. We observe that\npreserving the features estimated as most important by the excite operator retains the much of the\noverall accuracy during the early part of the pruning process before an increasingly strong decay\nin performance occurs. When reversing the pruning order, the shape of this performance curve\nis inverted, suggesting a consistent positive correlation between the estimated feature importance\nand overall performance. This trend is clearest for the deeper conv5-1 block, indicating a stronger\ndependence between primary features and concepts, which is consistent with \ufb01ndings in previous\nwork [22, 28]. While these feature importance estimates are instance-speci\ufb01c, they can also be used\nto probe the relationships between classes and different features [15], and may potentially be useful\nas a tool for interpreting the activations of networks.\n\n5 Related Work\n\nContext-based features have a rich history of use in computer vision, motivated by studies in percep-\ntion that have shown that contextual information in\ufb02uences the accuracy and ef\ufb01ciency of object\nrecognition and detection by humans [2, 13]. Several pioneering automated vision systems incorpo-\nrated context as a component of sophisticated rule-based approaches to image understanding [36, 7];\nfor tasks such as object recognition and detection, low-dimensional, global descriptors have often\nproven effective as contextual clues [39, 30, 40]. Alternative approaches based on graphical models\nrepresent another viable mechanism for exploiting context [11, 29] and many other forms of contex-\ntual features have been proposed [6]. A number of works have incorporated context for improving\nsemantic segmentation (e.g. [48, 23]), and in particular, ParseNet [26] showed that encoding context\nthrough global feature averaging can be highly effective for this task.\n\n8\n\n\fThe Inception family of architectures [38, 37] popularised the use of multi-scale convolutional\nmodules, which help ensure the ef\ufb01cient aggregation of context throughout the hierarchy of learned\nrepresentations [17]. Variants of these modules have emerged in recent work on automated archi-\ntecture search [51], suggesting that they are components of (at least) a local optimum in the current\ndesign space of network blocks. Recent work has developed both powerful and generic parame-\nterised attention modules to allow the system to extract informative signals dynamically [46, 4, 45].\nTop-down attention modules [42] and self-attention [41] can be used to exploit global relationships\nbetween features. By reweighting features as a generic function of all pairwise interactions, non-local\nnetworks [43] showed that self-attention can be generalised to a broad family of global operator\nblocks useful for visual tasks.\nThere has also been considerable recent interest in developing more specialised, lightweight modules\nthat can be cheaply integrated into existing designs. Our work builds on the ideas developed in\nSqueeze-and-Excitation networks [15], which used global embeddings as part of the SE block design\nto provide context to the recalibration function. We draw particular inspiration from the studies\nconducted in [44], which showed that useful contextual information for localising objects can be\ninferred in a feed-forward manner from simple summaries of basic image descriptors (our aim is to\nincorporate such summaries of low, mid and high level features throughout the model). In particular,\nwe take the SE emphasis on lightweight contextual mechanisms to its logical extreme, showing\nthat strong performance gains can be achieved by the GE-\u03b8\u2212 variant with no additional learnable\nparameters. We note that similar parameterised computational mechanisms have also been explored\nin the image restoration community [18], providing an interesting alternative interpretation of this\nfamily of module designs as learnable activation functions.\n\n6 Conclusion and Future Work\n\nIn this work we considered the question of how to ef\ufb01ciently exploit feature context in CNNs. We\nproposed the gather-excite (GE) framework to address this issue and provided experimental evidence\nthat demonstrates the effectiveness of this approach across multiple datasets and model architectures.\nIn future work we plan to investigate whether gather-excite operators may prove useful in other\ncomputer vision tasks such as semantic segmentation, which we anticipate may also bene\ufb01t from\nef\ufb01cient use of feature context.\nAcknowledgments. The authors would like to thank Andrew Zisserman and Aravindh Mahendran\nfor many helpful discussions. Samuel Albanie is supported by ESPRC AIMS CDT. Andrea Vedaldi\nis supported by ERC 638009-IDIU.\n\nReferences\n[1] Sean Bell, C Lawrence Zitnick, Kavita Bala, and Ross Girshick. Inside-outside net: Detecting\n\nobjects in context with skip pooling and recurrent neural networks. In CVPR, 2016. 1\n\n[2] Irving Biederman, Robert J Mezzanotte, and Jan C Rabinowitz. Scene perception: Detecting\n\nand judging objects undergoing relational violations. Cognitive psychology, 1982. 8\n\n[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.\nDeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and\nfully connected crfs. IEEE TPAMI, 2018. 1\n\n[4] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua.\nSCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning.\nIn CVPR, 2017. 9\n\n[5] Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and C\u00e9dric Bray. Visual\n\ncategorization with bags of keypoints. In ECCV Workshop, 2004. 2\n\n[6] Santosh K Divvala, Derek Hoiem, James H Hays, Alexei A Efros, and Martial Hebert. An\n\nempirical study of context in object detection. In CVPR, 2009. 1, 8\n\n[7] A Hanson. Visions: A computer system for interpreting scenes. Computer vision systems, 1978.\n\n8\n\n9\n\n\f[8] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask R-CNN. In ICCV, 2017.\n\n6\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016. 1, 3, 4, 6, 7\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In ECCV, 2016. 3, 6\n\n[11] Geremy Heitz and Daphne Koller. Learning spatial context: Using stuff to \ufb01nd things. In ECCV,\n\n2008. 8\n\n[12] Geoffrey E Hinton, James L McClelland, David E Rumelhart, et al. Distributed representations.\n\nParallel distributed processing: Explorations in the microstructure of cognition, 1986. 7\n\n[13] Howard S Hock, Gregory P Gordon, and Robert Whitehurst. Contextual relations: the in\ufb02uence\nof familiarity, physical plausibility, and belongingness. Perception & Psychophysics, 1974. 8\n\n[14] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias\nWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 5\n\n[15] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018. 2, 3, 4, 5, 7,\n\n8, 9\n\n[16] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with\n\nstochastic depth. In ECCV, 2016. 6\n\n[17] Tsung-Wei Ke, Michael Maire, and X Yu Stella. Multigrid neural architectures. In CVPR, 2017.\n\n9\n\n[18] Idan Kligvasser, Tamar Rott Shaham, and Tomer Michaeli. xUnit: Learning a spatial activation\n\nfunction for ef\ufb01cient image restoration. In CVPR, 2018. 9\n\n[19] Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009. 6\n\n[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classi\ufb01cation with deep\n\nconvolutional neural networks. In NIPS, 2012. 1\n\n[21] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 1998. 1\n\n[22] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief\nnetworks for scalable unsupervised learning of hierarchical representations. In ICML, 2009. 8\n\n[23] Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid. Ef\ufb01cient piecewise\n\ntraining of deep structured models for semantic segmentation. In CVPR, 2016. 8\n\n[24] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR, 2014. 6\n\n[25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,\nPiotr Doll\u00e1r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV,\n2014. 6\n\n[26] Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. In\n\nICLR workshop, 2016. 1, 8\n\n[27] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive\n\n\ufb01eld in deep convolutional neural networks. In NIPS, 2016. 1\n\n[28] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance\n\nof single directions for generalization. In ICLR, 2018. 6, 8\n\n[29] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler,\nRaquel Urtasun, and Alan Yuille. The role of context for object detection and semantic\nsegmentation in the wild. In CVPR, 2014. 8\n\n10\n\n\f[30] Kevin P Murphy, Antonio Torralba, and William T Freeman. Using the forest to see the trees:\n\nA graphical model relating features, objects, and scenes. In NIPS, 2004. 8\n\n[31] David Novotny, Samuel Albanie, Diane Larlus, and Andrea Vedaldi. Self-supervised learning\n\nof geometrically stable features through probabilistic introspection. In CVPR, 2018. 7\n\n[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time\n\nobject detection with region proposal networks. In NIPS, 2015. 1, 6\n\n[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. ImageNet large scale visual\nrecognition challenge. IJCV, 2015. 3\n\n[34] Jorge Sanchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classi\ufb01cation\n\nwith the \ufb01sher vector: Theory and practice. IJCV, 2013. 2\n\n[35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. In ICLR, 2015. 1\n\n[36] Thomas M Strat and Martin A Fischler. Context-based vision: recognizing objects using\n\ninformation from both 2D and 3D imagery. IEEE TPMI, 1991. 8\n\n[37] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-\n\nresnet and the impact of residual connections on learning. In ICLR Workshop, 2016. 9\n\n[38] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nIn CVPR, 2015. 9\n\n[39] Antonio Torralba. Contextual priming for object detection. IJCV, 2003. 1, 8\n\n[40] Antonio Torralba, Kevin P Murphy, William T Freeman, Mark A Rubin, et al. Context-based\n\nvision system for place and object recognition. In ICCV, 2003. 8\n\n[41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. 9\n\n[42] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang\nWang, and Xiaoou Tang. Residual attention network for image classi\ufb01cation. In CVPR, 2017. 9\n\n[43] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks.\n\nIn CVPR, 2018. 9\n\n[44] Lior Wolf and Stanley Bileschi. A critical view of context. IJCV, 2006. 2, 9\n\n[45] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional\n\nblock attention module. In ECCV, 2018. 9\n\n[46] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov,\nRichard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation\nwith visual attention. In ICML, 2015. 9\n\n[47] Jun Yang, Yu-Gang Jiang, Alexander G Hauptmann, and Chong-Wah Ngo. Evaluating bag-of-\n\nvisual-words representations in scene classi\ufb01cation. In MIR, 2007. 2\n\n[48] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In\n\nICLR, 2016. 1, 8\n\n[49] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016. 6\n\n[50] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient\n\nconvolutional neural network for mobile devices. In CVPR, 2018. 5, 6\n\n[51] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable\n\narchitectures for scalable image recognition. In CVPR, 2018. 9\n\n11\n\n\f", "award": [], "sourceid": 5727, "authors": [{"given_name": "Jie", "family_name": "Hu", "institution": "Momenta"}, {"given_name": "Li", "family_name": "Shen", "institution": "University of Oxford"}, {"given_name": "Samuel", "family_name": "Albanie", "institution": "Oxford University"}, {"given_name": "Gang", "family_name": "Sun", "institution": "Momenta"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "Facebook AI Research and University of Oxford"}]}