{"title": "Stand-Alone Self-Attention in Vision Models", "book": "Advances in Neural Information Processing Systems", "page_first": 68, "page_last": 80, "abstract": "Convolutions are a fundamental building block of modern  computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention to ResNet-50 produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12% fewer FLOPS and 29% fewer parameters. On COCO object detection, a fully self-attention model matches the mAP of a baseline RetinaNet while having 39% fewer FLOPS and 34% fewer parameters. Detailed ablation studies demonstrate that self-attention is especially impactful when used in later layers. These results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox.", "full_text": "Stand-Alone Self-Attention in Vision Models\n\nPrajit Ramachandran\u2217\n\nNiki Parmar\u2217\n\nAshish Vaswani\u2217\n\nIrwan Bello\n\nAnselm Levskaya\u2020\n\nJonathon Shlens\n\nGoogle Research, Brain Team\n\n{prajit, nikip, avaswani}@google.com\n\nAbstract\n\nConvolutions are a fundamental building block of modern computer vision systems.\nRecent approaches have argued for going beyond convolutions in order to capture\nlong-range dependencies. These efforts focus on augmenting convolutional models\nwith content-based interactions, such as self-attention and non-local means, to\nachieve gains on a number of vision tasks. The natural question that arises is\nwhether attention can be a stand-alone primitive for vision models instead of\nserving as just an augmentation on top of convolutions. In developing and testing\na pure self-attention vision model, we verify that self-attention can indeed be an\neffective stand-alone layer. A simple procedure of replacing all instances of spatial\nconvolutions with a form of self-attention applied to ResNet model produces a fully\nself-attentional model that outperforms the baseline on ImageNet classi\ufb01cation with\n12% fewer FLOPS and 29% fewer parameters. On COCO object detection, a pure\nself-attention model matches the mAP of a baseline RetinaNet while having 39%\nfewer FLOPS and 34% fewer parameters. Detailed ablation studies demonstrate\nthat self-attention is especially impactful when used in later layers. These results\nestablish that stand-alone self-attention is an important addition to the vision\npractitioner\u2019s toolbox. Code for this project is made available.1\n\n1\n\nIntroduction\n\nDigital image processing arose from the recognition that handcrafted linear \ufb01lters applied convolu-\ntionally to pixelated imagery may subserve a large variety of applications [1]. The success of digital\nimage processing as well as biological considerations [2, 3] inspired early practitioners of neural\nnetworks to exploit convolutional representations in order to provide parameter-ef\ufb01cient architectures\nfor learning representations on images [4, 5].\n\nThe advent of large datasets [6] and compute resources [7] made convolution neural networks (CNNs)\nthe backbone for many computer vision applications [8\u201310]. The \ufb01eld of deep learning has in turn\nlargely shifted toward the design of architectures of CNNs for improving the performance on image\nrecognition [11\u201316], object detection [17\u201319] and image segmentation [20\u201322]. The translation\nequivariance property of convolutions has provided a strong motivation for adopting them as a\nbuilding block for operating on images [23, 24]. However, capturing long range interactions for\nconvolutions is challenging because of their poor scaling properties with respect to large receptive\n\ufb01elds.\n\n\u2217Denotes equal contribution. Ordering determined by random shuf\ufb02e.\n\u2020Work done as a member of the Google AI Residency Program.\n1 https://github.com/google-research/google-research/tree/master/standalone_self_attention_in_vision_models\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe problem of long range interactions has been tackled in sequence modeling through the use of\nattention. Attention has enjoyed rich success in tasks such as language modeling [25, 26], speech\nrecognition [27, 28] and neural captioning [29]. Recently, attention modules have been employed in\ndiscriminative computer vision models to boost the performance of traditional CNNs. Most notably,\na channel-based attention mechanism termed Squeeze-Excite may be applied to selectively modulate\nthe scale of CNN channels [30, 31]. Likewise, spatially-aware attention mechanisms have been used\nto augment CNN architectures to provide contextual information for improving object detection [32]\nand image classi\ufb01cation [33\u201335]. These works have used global attention layers as an add-on to\nexisting convolutional models. This global form attends to all spatial locations of an input, limiting\nits usage to small inputs which typically require signi\ufb01cant downsampling of the original image.\n\nIn this work, we ask the question if content-based interactions can serve as the primary primitive of\nvision models instead of acting as an augmentation to convolution. To this end, we develop a simple\nlocal self-attention layer that can be used for both small and large inputs. We leverage this stand-alone\nattention layer to build a fully attentional vision model that outperforms the convolutional baseline\nfor both image classi\ufb01cation and object detection while being parameter and compute ef\ufb01cient.\nFurthermore, we conduct a number of ablations to better understand stand-alone attention. We hope\nthat this result will spur new research directions focused on exploring content-based interactions as a\nmechanism for improving vision models.\n\n2 Background\n\n2.1 Convolutions\n\nConvolutional neural networks (CNNs) are typically employed with small neighborhoods (i.e. kernel\nsizes) to encourage the network to learn local correlation structures within a particular layer. Given\nan input x \u2208 Rh\u00d7w\u00d7din with height h, width w, and input channels din, a local neighborhood Nk\naround a pixel xij is extracted with spatial extent k, resulting in a region with shape k \u00d7 k \u00d7 din (see\nFigure 1).\nGiven a learned weight matrix W \u2208 Rk\u00d7k\u00d7dout\u00d7din , the output yij \u2208 Rdout for position ij is de\ufb01ned\nby spatially summing the product of depthwise matrix multiplications of the input values:\n\nyij = X\n\nWi\u2212a,j\u2212b xab\n\na,b\u2208Nk(i,j)\n\n(1)\n\nwhere Nk(i, j) = (cid:8)a, b (cid:12)(cid:12) |a \u2212 i| \u2264 k/2, |b \u2212 j| \u2264 k/2(cid:9) (see Figure 2). Importantly, CNNs employ\n\nweight sharing, where W is reused for generating the output for all pixel positions ij. Weight\nsharing enforces translation equivariance in the learned representation and consequently decouples\nthe parameter count of the convolution from the input size.\n\nFigure 1: An example of a local window around\ni = 3, j = 3 (one-indexed) with spatial extent\nk = 3.\n\nFigure 2: An example of a 3 \u00d7 3 convolution.\nThe output is the inner product between the\nlocal window and the learned weights.\n\nA wide array of machine learning applications have leveraged convolutions to achieve competitive\nresults including text-to-speech [36] and generative sequence models [37, 38]. Several efforts have\n\n2\n\n\freformulated convolutions to improve the predictive performance or the computational ef\ufb01ciency of a\nmodel. Notably, depthwise-separable convolutions provide a low-rank factorization of spatial and\nchannel interactions [39\u201341]. Such factorizations have allowed for the deployment of modern CNNs\non mobile and edge computing devices [42, 43]. Likewise, relaxing translation equivariance has been\nexplored in locally connected networks for various vision applications [44].\n\n2.2 Self-Attention\n\nAttention was introduced by [45] for the encoder-decoder in a neural sequence transduction model\nto allow for content-based summarization of information from a variable length source sentence.\nThe ability of attention to learn to focus on important regions within a context has made it a critical\ncomponent in neural transduction models for several modalities [26, 29, 27]. Using attention as a\nprimary mechanism for representation learning has seen widespread adoption in deep learning after\n[25], which entirely replaced recurrence with self-attention. Self-attention is de\ufb01ned as attention\napplied to a single context instead of across multiple contexts (in other words, the query, keys,\nand values, as de\ufb01ned later in this section, are all extracted from the same context). The ability of\nself-attention to directly model long-distance interactions and its parallelizability, which leverages\nthe strengths of modern hardware, has led to state-of-the-art models for various tasks [46\u201351].\n\nAn emerging theme of augmenting convolution models with self-attention has yielded gains in several\nvision tasks. [32] show that self-attention is an instantiation of non-local means [52] and use it to\nachieve gains in video classi\ufb01cation and object detection. [53] also show improvements on image\nclassi\ufb01cation and achieve state-of-the-art results on video action recognition tasks with a variant\nof non-local means. Concurrently, [33] also see signi\ufb01cant gains in object detection and image\nclassi\ufb01cation through augmenting convolutional features with global self-attention features. This\npaper goes beyond [33] by removing convolutions and employing local self-attention across the\nentirety of the network. Another concurrent work [35] explores a similar line of thinking by proposing\na new content-based layer to be used across the model. This approach is complementary to our focus\non directly leveraging existing forms of self-attention for use across the vision model.\n\nWe now describe a stand-alone self-attention layer that can be used to replace spatial convolutions and\nbuild a fully attentional model. The attention layer is developed with a focus on simplicity by reusing\ninnovations explored in prior works, and we leave it up to future work to develop novel attentional\nforms.\nSimilar to a convolution, given a pixel xij \u2208 Rdin , we \ufb01rst extract a local region of pixels in positions\nab \u2208 Nk(i, j) with spatial extent k centered around xij , which we call the memory block. This form\nof local attention differs from prior work exploring attention in vision which have performed global\n(i.e., all-to-all) attention between all pixels [32, 33]. Global attention can only be used after signi\ufb01cant\nspatial downsampling has been applied to the input because it is computationally expensive, which\nprevents its usage across all layers in a fully attentional model.\nSingle-headed attention for computing the pixel output yij \u2208 Rdout is then computed as follows (see\nFigure 3):\n\nyij = X\n\nsoftmaxab (cid:0)q\u22a4\n\nijkab(cid:1) vab\n\na,b\u2208 Nk(i,j)\n\n(2)\n\nwhere the queries qij = WQxij , keys kab = WKxab, and values vab = WV xab are linear transforma-\ntions of the pixel in position ij and the neighborhood pixels. softmaxab denotes a softmax applied to\nall logits computed in the neighborhood of ij. WQ, WK, WV \u2208 Rdout\u00d7din are all learned transforms.\nWhile local self-attention aggregates spatial information over neighborhoods similar to convolutions\n(Equation 1), the aggregation is done with a convex combination of value vectors with mixing\nweights (softmaxab(\u00b7)) parametrized by content interactions. This computation is repeated for every\npixel ij. In practice, multiple attention heads are used to learn multiple distinct representations of\nthe input. It works by partitioning the pixel features xij depthwise into N groups xn\nij \u2208 Rdin/N ,\ncomputing single-headed attention on each group separately as above with different transforms\nV \u2208 Rdin\u00d7dout/N per head, and then concatenating the output representations into the\nW n\n\ufb01nal output yij \u2208 Rdout .\n\nK, W n\n\nQ, W n\n\n3\n\n\fFigure 3: An example of a local attention layer over spatial\nextent of k = 3.\n\nFigure 4: An example of relative\ndistance computation. The rela-\ntive distances are computed with\nrespect to the position of the high-\nlighted pixel. The format of dis-\ntances is row offset, column offset.\n\nAs currently framed, no positional information is encoded in attention, which makes it permutation\nequivariant, limiting expressivity for vision tasks. Sinusoidal embeddings based on the absolute\nposition of pixels in an image (ij) can be used [25], but early experimentation suggested that using\nrelative positional embeddings [51, 46] results in signi\ufb01cantly better accuracies. Instead, attention\nwith 2D relative position embeddings, relative attention, is used. Relative attention starts by de\ufb01ning\nthe relative distance of ij to each position ab \u2208 Nk(i, j). The relative distance is factorized across\ndimensions, so each element ab \u2208 Nk(i, j) receives two distances: a row offset a \u2212 i and column\noffset b \u2212 j (see Figure 4). The row and column offsets are associated with an embedding ra\u2212i\nand rb\u2212j respectively each with dimension 1\n2 dout. The row and column offset embeddings are\nconcatenated to form ra\u2212i,b\u2212j . This spatial-relative attention is now de\ufb01ned as\n\nyij = X\n\nsoftmaxab (cid:0)q\u22a4\n\nijkab + q\u22a4\n\nijra\u2212i,b\u2212j(cid:1) vab\n\n(3)\n\na,b\u2208 Nk(i,j)\n\nThus, the logit measuring the similarity between the query and an element in Nk(i, j) is modulated\nboth by the content of the element and the relative distance of the element from the query. Note that\nby infusing relative position information, self-attention also enjoys translation equivariance, similar\nto convolutions.\n\nThe parameter count of attention is independent of the size of spatial extent, whereas the parameter\ncount for convolution grows quadratically with spatial extent. The computational cost of attention\nalso grows slower with spatial extent compared to convolution with typical values of din and dout.\nFor example, if din = dout = 128, a convolution layer with k = 3 has the same computational cost\nas an attention layer with k = 19.\n\n3 Fully Attentional Vision Models\n\nGiven a local attention layer as a primitive, the question is how to construct a fully attentional\narchitecture. We achieve this in two steps:\n\n3.1 Replacing Spatial Convolutions\n\nA spatial convolution is de\ufb01ned as a convolution with spatial extent k > 1. This de\ufb01nition excludes\n1 \u00d7 1 convolutions, which may be viewed as a standard fully connected layer applied to each pixel\nindependently.2 This work explores the straightforward strategy of creating a fully attentional vision\nmodel: take an existing convolutional architecture and replace every instance of a spatial convolution\nwith an attention layer. A 2 \u00d7 2 average pooling with stride 2 operation follows the attention layer\nwhenever spatial downsampling is required.\n\n2Many deep learning libraries internally translate a 1 \u00d7 1 convolution to a simple matrix multiplication.\n\n4\n\n\fThis work applies the transform on the ResNet family of architectures [15]. The core building block\nof a ResNet is a bottleneck block with a structure of a 1 \u00d7 1 down-projection convolution, a 3 \u00d7 3\nspatial convolution, and a 1 \u00d7 1 up-projection convolution, followed by a residual connection between\nthe input of the block and the output of the last convolution in the block. The bottleneck block is\nrepeated multiple times to form the ResNet, with the output of one bottleneck block being the input\nof the next bottleneck block. The proposed transform swaps the 3 \u00d7 3 spatial convolution with a\nself-attention layer as de\ufb01ned in Equation 3. All other structure, including the number of layers\nand when spatial downsampling is applied, is preserved. This transformation strategy is simple but\npossibly suboptimal. Crafting the architecture with attention as a core component, such as with\narchitecture search [54], holds the promise of deriving better architectures.\n\n3.2 Replacing the Convolutional Stem\n\nThe initial layers of a CNN, sometimes referred to as the stem, play a critical role in learning local\nfeatures such as edges, which later layers use to identify global objects. Due to input images being\nlarge, the stem typically differs from the core block, focusing on lightweight operations with spatial\ndownsampling [11, 15]. For example, in a ResNet, the stem is a 7 \u00d7 7 convolution with stride 2\nfollowed by 3 \u00d7 3 max pooling with stride 2.\n\nAt the stem layer, the content is comprised of RGB pixels that are individually uninformative and\nheavily spatially correlated. This property makes learning useful features such as edge detectors\ndif\ufb01cult for content-based mechanisms such as self-attention. Our early experiments verify that\nusing self-attention form described in Equation 3 in the stem underperforms compared to using the\nconvolution stem of ResNet.\n\nThe distance based weight parametrization of convolutions allows them to easily learn edge dectectors\nand other local features necessary for higher layers. To bridge the gap between convolutions and\nself-attention while not signi\ufb01cantly increasing computation, we inject distance based information\nin the pointwise 1 \u00d7 1 convolution (WV ) through spatially-varying linear transformations. The\nnew value transformation is \u02dcvab = (Pm p(a, b, m)W m\nV are\ncombined through a convex combination of factors that are a function of the position of the pixel in\nits neighborhood p(a, b, m). The position dependent factors are similar to convolutions, which learn\nscalar weights dependent on the pixel location in a neighborhood. The stem is then comprised of\nthe attention layer with spatially aware value features followed by max pooling. For simplicity, the\nattention receptive \ufb01eld aligns with the max pooling window. More details on the exact formulation\nof p(a, b, m) is given in the appendix.\n\nV ) xab where multiple value matrices W m\n\n4 Experiments\n\n4.1\n\nImageNet Classi\ufb01cation\n\nSetup We perform experiments on ImageNet classi\ufb01cation task [55] which contains 1.28 million\ntraining images and 50000 test images. The procedure described in Section 3.1 of replacing the\nspatial convolution layer with a self-attention layer from inside each bottleneck block of a ResNet-50\n[15] model is used to create the attention model. The multi-head self-attention layer uses a spatial\nextent of k = 7 and 8 attention heads. The position-aware attention stem as described above is used.\nThe stem performs self-attention within each 4 \u00d7 4 spatial block of the original image, followed by\nbatch normalization and a 4 \u00d7 4 max pool operation. Exact hyperparameters can be found in the\nappendix.\n\nTo study the behavior of these models with different computational budgets, we scale the model either\nby width or depth. For width scaling, the base width is linearly multiplied by a given factor across\nall layers. For depth scaling, a given number of layers are removed from each layer group. There\nare 4 layer groups, each with multiple layers operating on the same spatial dimensions. Groups are\ndelineated by spatial downsampling. The 38 and 26 layer models remove 1 and 2 layers respectively\nfrom each layer group compared to the 50 layer model.\n\nResults Table 1 and Figure 5 shows the results of the full attention variant compared with the\nconvolution baseline. Compared to the ResNet-50 baseline, the full attention variant achieves 0.5%\n\n5\n\n\fResNet-26\n\nParams Acc.\n(%)\n\n(M)\n\n13.7\n10.3\n10.3\n\n74.5\n75.8\n74.8\n\nFLOPS\n\n(B)\n\n4.7\n4.5\n4.7\n\nResNet-38\n\nResNet-50\n\nFLOPS\n\n(B)\n\n6.5\n5.7\n6.0\n\nParams Acc.\n(%)\n\n(M)\n\n19.6\n14.1\n14.1\n\n76.2\n77.1\n76.9\n\nFLOPS\n\n(B)\n\n8.2\n7.0\n7.2\n\nParams Acc.\n(%)\n\n(M)\n\n25.6\n18.0\n18.0\n\n76.9\n77.4\n77.6\n\nBaseline\n\nConv-stem + Attention\n\nFull Attention\n\nTable 1: ImageNet classi\ufb01cation results for a ResNet network with different depths. Baseline\nis a standard ResNet, Conv-stem + Attention uses spatial convolution in the stem and attention\neverywhere else, and Full Attention uses attention everywhere including the stem. The attention\nmodels outperform the baseline across all depths while having 12% fewer FLOPS and 29% fewer\nparameters.\n\nFigure 5: Comparing parameters and FLOPS against accuracy on ImageNet classi\ufb01cation across a\nrange of network widths for ResNet-50. Attention models have fewer parameters and FLOPS while\nimproving upon the accuracy of the baseline.\n\nhigher classi\ufb01cation accuracy while having 12% fewer \ufb02oating point operations (FLOPS)3 and 29%\nfewer parameters. Furthermore, this performance gain is consistent across most model variations\ngenerated by both depth and width scaling.\n\n4.2 COCO Object Detection\n\nSetup In this section, we evaluate attention models on the COCO object detection task [56] using\nthe RetinaNet architecture [18]. RetinaNet is an object detection model that consists of a backbone\nimage classi\ufb01cation network followed by a Feature Pyramid Network (FPN) [57] and two output\nnetworks known as detection heads. We experiment with making the backbone and/or the FPN and\ndetection heads fully attentional. The backbone models are the same models described in Section\n4.1. The details of how the FPN and detection heads are made fully attentional are provided in the\nappendix.\n\nResults Table 2 shows the object detection results. Using an attention-based backbone in the\nRetinaNet matches the mAP of using the convolutional backbone but contains 22% fewer parameters.\nFurthermore, employing attention across all parts of the model including the backbone, FPN, and\ndetection heads matches the mAP of the baseline RetinaNet while using 34% fewer parameters and\n39% fewer FLOPS. These results demonstrate the ef\ufb01cacy of stand-alone attention across multiple\nvision tasks.\n\n6\n\n\fDetection\n\nHeads + FPN\n\nBackbone\n\nBaseline\n\nConvolution\n\nConv-stem + Attention\n\nFull Attention\n\nAttention\n\nConv-stem + Attention\n\nFull Attention\n\nFLOPS\n\nParams\n\n(B)\n\n182\n173\n173\n\n111\n110\n\n(M)\n\n33.4\n25.9\n25.9\n\n22.0\n22.0\n\nmAPcoco / 50 / 75\n\nmAPs / m / l\n\n36.5 / 54.3 / 39.0\n36.8 / 54.6 / 39.3\n36.2 / 54.0 / 38.7\n\n18.3 / 40.6 / 51.7\n18.4 / 41.1 / 51.7\n17.5 / 40.3 / 51.7\n\n36.6 / 54.3 / 39.1\n36.6 / 54.5 / 39.2\n\n19.0 / 40.7 / 51.1\n18.5 / 40.6 / 51.6\n\nTable 2: Object detection on COCO dataset with RetinaNet [18]. Mean Average Precision (mAP) is\nreported at three different IoU values and for three different object sizes (small, medium, large). The\nfully attentional models achieve similar mAP as the baseline while having up to 39% fewer FLOPS\nand 34% fewer parameters.\n\nConv\n\nGroups\n\nAttention\nGroups\n\n-\n1\n\n1, 2\n\n1, 2, 3\n\n1, 2, 3, 4\n\n2, 3, 4\n\n3, 4\n\n4\n\n1, 2, 3, 4\n\n2, 3, 4\n\n3, 4\n\n4\n-\n1\n\n1, 2\n\n1, 2, 3\n\n(B)\n\n7.0\n7.3\n7.5\n8.0\n8.2\n7.9\n7.8\n7.2\n\n(M)\n\n18.0\n18.1\n18.5\n20.8\n25.6\n25.5\n25.0\n22.7\n\nTable 3: Modifying which layer groups use which\nprimitive. Accuracies computed on validation set.\nThe best performing models use convolutions for early\ngroups and attention for later groups.\n\nFLOPS\n\nParams\n\nTop-1\n\nAcc. (%)\n\nSpatial Extent\n\nFLOPS\n\nTop-1\n\n80.2\n80.7\n80.7\n80.2\n79.5\n79.7\n79.6\n79.9\n\n(k \u00d7 k)\n\n3 \u00d7 3\n\n5 \u00d7 5\n\n7 \u00d7 7\n\n9 \u00d7 9\n\n11 \u00d7 11\n\n(B)\n\n6.6\n6.7\n7.0\n7.3\n7.7\n\nAcc. (%)\n\n76.4\n77.2\n77.4\n77.7\n77.6\n\nTable 4: Varying the spatial extent k. Param-\neter count is constant across all variations.\nSmall k perform poorly, but the improve-\nments of larger k plateaus off.\n\n4.3 Where is stand-alone attention most useful?\n\nThe impressive performance of fully attentional models veri\ufb01es that stand-alone attention is a viable\nprimitive for vision models. In this section, we study which parts of the network bene\ufb01t the most\nfrom stand-alone attention.\n\nStem First, we compare the performance of the attention stem against the convolution stem used in\nResNet. All other spatial convolutions are replaced with stand-alone attention. Tables 1 and 2 and\nFigure 5 show the results on ImageNet classi\ufb01cation and COCO object detection. For classi\ufb01cation,\nthe convolution stem consistently matches or outperforms the attention stem. For object detection,\nthe convolution stem performs better when a the detection heads and FPN are also convolutional, but\nperforms similarly when the entire rest of the network is fully attentional. These results suggest that\nconvolutions consistently perform well when used in the stem.\n\nFull network Next, we experiment with using convolution and stand-alone attention in different\nlayer groups in a ResNet with a convolution stem. Table 3 shows that the best performing models use\nconvolutions in the early groups and attention in the later groups. These models are also similar in\nterms of FLOPS and parameters to the fully attentional model. In contrast, when attention is used in\nthe early groups and convolutions are used in the later groups, the performance degrades despite a\nlarge increase in the parameter count. This suggests that convolutions may better capture low level\nfeatures while stand-alone attention layers may better integrate global information.\n\nTaken together, these results suggest that vision practitioners should focus on developing strategies\nof designing architectures that combine the comparative advantages of convolution and stand-alone\nattention.\n\n3Some prior works de\ufb01ne a FLOP as a single atomic Multiply-Add, whereas we treat the Multiply and Add\n\nas 2 FLOPS. This causes a 2\u00d7 discrepancy in the reported number.\n\n7\n\n\fPositional\nEncoding\n\nType\n\nnone\n\nabsolute\nrelative\n\nFLOPS\n\nParams\n\nTop-1\n\n(B)\n\n6.9\n6.9\n7.0\n\n(M)\n\nAcc. (%)\n\n18.0\n18.0\n18.0\n\n77.6\n78.2\n80.2\n\nTable 5: The effect of changing the positional en-\ncoding type for attention. Accuracies computed\non the validation set. Relative encodings signi\ufb01-\ncantly outperform other strategies.\n\nAttention\n\nFLOPS\n\nParams\n\nTop-1\n\n(B)\n\n(M)\n\nAcc. (%)\n\nType\nq\u22a4r\n\n6.1\n7.0\n\n16.7\n18.0\n\n76.9\n77.4\n\nq\u22a4k + q\u22a4r\nTable 6: The effect of removing the q\u22a4k inter-\nactions in attention. Using just q\u22a4r interactions\nonly drops accuracy by 0.5%.\n\nAttention Stem\n\nFLOPS\n\nTop-1\n\nType\n\nstand-alone\n\nspatial convolution for values\n\nspatially aware values\n\n(B)\n\n7.1\n7.4\n7.2\n\nAcc. (%)\n\n76.2\n77.2\n77.6\n\nTable 7: Ablating the form of the attention stem. Spatially-aware value attention outperforms both\nstand-alone attention and values generated by a spatial convolution.\n\n4.4 Which components are important in attention?\n\nThis section presents ablations designed to understand the contributions of the various components in\nthe local attention layer. Unless speci\ufb01ed, all attention models in the ablations use the convolution\nstem.\n\n4.4.1 Effect of spatial extent of self-attention\n\nThe value of the spatial extent k controls the size of the region each pixel can attend to. Table 4\nstudies the effect of varying the spatial extent. While using small k, such as k = 3, has a large\nnegative impact on performance, the improvements of using a larger k plateau around k = 11. The\nexact plateau value likely depends on speci\ufb01c settings of hyperparameters such as the feature size\nand number of attention heads used.\n\n4.4.2\n\nImportance of positional information\n\nTable 5 ablates the different types of positional encodings that can be used: no positional encoding, a\nsinusodial encoding dependent on the absolute position of a pixel [25], and relative position encodings.\nUsing any notion of positional encoding is bene\ufb01cial over using none, but the type of positional\nencoding is also important. Relative position encodings perform 2% better than absolute encodings.\nFurthermore, Table 6 demonstrates the important role of the content-relative interactions (q \u00b7 r)\nin attention. Removing the content-content (q \u00b7 k) interactions and just using the content-relative\ninteractions drops the accuracy by only 0.5%. The importance of positional information suggests that\nfuture work may improve attention by exploring different parameterizations and usages of positional\ninformation.\n\n4.4.3\n\nImportance of spatially-aware attention stem\n\nTable 7 compares using stand-alone attention in the stem with the attention stem with spatially-aware\nvalues proposed in Section 3.2. The proposed attention stem outperforms stand-alone attention by\n1.4% despite having a similar number of FLOPS, validating the utility of modifying attention for use\nin the stem. Furthermore, applying a spatial convolution to the values instead of a spatially-aware\nmixture of point-wise transformations proposed in Section 3.2 incurs more FLOPS and performs\nslightly worse. Future work can focus on unifying the spatially-aware attention used in the stem with\nthe attention used in the main trunk of the network.\n\n8\n\n\f5 Discussion\n\nIn this work, we veri\ufb01ed that content-based interactions can indeed serve as the primary primitive of\nvision models. A fully attentional network based off of the proposed stand-alone local self-attention\nlayer achieves competitive predictive performance on ImageNet classi\ufb01cation and COCO object\ndetection tasks while requiring fewer parameters and \ufb02oating point operations than the corresponding\nconvolution baselines. Furthermore, ablations show that attention is especially effective in the later\nparts of the network.\n\nWe see several opportunities for improving the performance of these networks. First, the attention\nmechanism may be improved by developing better methods for capturing geometries [58, 59]. Second,\nthe architectures employed for image classi\ufb01cation and object detection were developed by applying a\nsimple transformation to models designed for the convolutional primitive [13, 19]. It may be possible\nto achieve improvements by speci\ufb01cally searching for the architecture with an attention layer as a\ncomponent in the design search space [31, 16, 21, 60]. Finally, additional work on proposing new\nattention forms that can capture low level features can make attention effective in the early layers of\nnetworks [61, 62].\n\nAlthough the training ef\ufb01ciency and computational demand of an attention based architecture is\nfavorable to a traditional convolution, the resulting network is slower in wall-clock time. The reason\nfor this discrepancy is the lack of optimized kernels available on various hardware accelerators. In\nprinciple, depending on the degree to which the \ufb01eld deems that attention provides a viable path, it\nmay be possible to signi\ufb01cantly speed up the wall-clock time for training and inference accordingly.\n\nWhile this work primarily focuses on content-based interactions to establish their virtue for vision\ntasks, in the future, we hope to unify convolution and self-attention to best combine their unique\nadvantages. Given the success of content-based interactions on core computer vision tasks, we expect\nthat future work may explore how attention could be applied to other vision tasks such as semantic\nsegmentation [63], instance segmentation [64], keypoint detection [65], human pose estimation\n[66, 67] and other tasks currently addressed with convolutional neural networks.\n\nAcknowledgments\n\nWe thank Blake Hechtman, Justin Gilmer, Pieter-jan Kindermans, Quoc Le, Samy Bengio, and Shibo\nWang for fruitful discussions and assistance with implementations as well as the larger Google Brain\nteam for support and assistance.\n\nReferences\n\n[1] R. C. Gonzalez, R. E. Woods, et al., \u201cDigital image processing [m],\u201d Publishing house of\n\nelectronics industry, vol. 141, no. 7, 2002.\n\n[2] K. Fukushima, \u201cNeocognitron: A self-organizing neural network model for a mechanism of\npattern recognition unaffected by shift in position,\u201d Biological cybernetics, vol. 36, no. 4,\npp. 193\u2013202, 1980.\n\n[3] K. Fukushima, \u201cNeocognitron: A hierarchical neural network capable of visual pattern recogni-\n\ntion,\u201d Neural networks, vol. 1, no. 2, pp. 119\u2013130, 1988.\n\n[4] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel,\n\u201cBackpropagation applied to handwritten zip code recognition,\u201d Neural computation, vol. 1,\nno. 4, pp. 541\u2013551, 1989.\n\n[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document\n\nrecognition,\u201d Proceedings of the IEEE, 1998.\n\n[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, \u201cImagenet: A large-scale\nhierarchical image database,\u201d in IEEE Conference on Computer Vision and Pattern Recognition,\nIEEE, 2009.\n\n[7] J. Nickolls and W. J. Dally, \u201cThe gpu computing era,\u201d IEEE micro, vol. 30, no. 2, pp. 56\u201369,\n\n2010.\n\n9\n\n\f[8] A. Krizhevsky, \u201cLearning multiple layers of features from tiny images,\u201d tech. rep., University of\n\nToronto, 2009.\n\n[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional\n\nneural networks,\u201d in Advances in Neural Information Processing System, 2012.\n\n[10] Y. LeCun, Y. Bengio, and G. Hinton, \u201cDeep learning,\u201d nature, vol. 521, no. 7553, p. 436, 2015.\n\n[11] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\nA. Rabinovich, \u201cGoing deeper with convolutions,\u201d in IEEE Conference on Computer Vision and\nPattern Recognition, 2015.\n\n[12] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, \u201cRethinking the Inception architec-\nture for computer vision,\u201d in IEEE Conference on Computer Vision and Pattern Recognition,\n2016.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun, \u201cIdentity mappings in deep residual networks,\u201d in European\n\nConference on Computer Vision, 2016.\n\n[14] S. Xie, R. Girshick, P. Doll\u00e1r, Z. Tu, and K. He, \u201cAggregated residual transformations for deep\nneural networks,\u201d in Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 2017.\n\n[15] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in IEEE\n\nConference on Computer Vision and Pattern Recognition, 2016.\n\n[16] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, \u201cLearning transferable architectures for scalable\nimage recognition,\u201d in Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pp. 8697\u20138710, 2018.\n\n[17] T.-Y. Lin, P. Doll\u00e1r, R. Girshick, K. He, B. Hariharan, and S. Belongie, \u201cFeature pyramid\nnetworks for object detection,\u201d in Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, 2017.\n\n[18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\u00e1r, \u201cFocal loss for dense object detection,\u201d in\n\nProceedings of the IEEE international conference on computer vision, pp. 2980\u20132988, 2017.\n\n[19] S. Ren, K. He, R. Girshick, and J. Sun, \u201cFaster R-CNN: Towards real-time object detection with\nregion proposal networks,\u201d in Advances in Neural Information Processing Systems, pp. 91\u201399,\n2015.\n\n[20] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, \u201cDeeplab: Semantic\nimage segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,\u201d\nIEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834\u2013848,\n2018.\n\n[21] L.-C. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens,\n\u201cSearching for ef\ufb01cient multi-scale architectures for dense image prediction,\u201d in Advances in\nNeural Information Processing Systems, pp. 8713\u20138724, 2018.\n\n[22] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick, \u201cMask r-cnn,\u201d in Proceedings of the IEEE\n\ninternational conference on computer vision, pp. 2961\u20132969, 2017.\n\n[23] E. P. Simoncelli and B. A. Olshausen, \u201cNatural image statistics and neural representation,\u201d\n\nAnnual review of neuroscience, vol. 24, no. 1, pp. 1193\u20131216, 2001.\n\n[24] D. L. Ruderman and W. Bialek, \u201cStatistics of natural images: Scaling in the woods,\u201d in Advances\n\nin neural information processing systems, pp. 551\u2013558, 1994.\n\n[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\nI. Polosukhin, \u201cAttention is all you need,\u201d in Advances in Neural Information Processing\nSystems, pp. 5998\u20136008, 2017.\n\n10\n\n\f[26] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, et al., \u201cGoogle\u2019s neural machine translation system: Bridging the gap between\nhuman and machine translation,\u201d arXiv preprint arXiv:1609.08144, 2016.\n\n[27] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, \u201cAttention-based models\nfor speech recognition,\u201d in Advances in neural information processing systems, pp. 577\u2013585,\n2015.\n\n[28] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, \u201cListen, attend and spell: A neural network for\nlarge vocabulary conversational speech recognition,\u201d in 2016 IEEE International Conference on\nAcoustics, Speech and Signal Processing (ICASSP), pp. 4960\u20134964, IEEE, 2016.\n\n[29] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio,\n\u201cShow, attend and tell: Neural image caption generation with visual attention,\u201d in International\nconference on machine learning, pp. 2048\u20132057, 2015.\n\n[30] J. Hu, L. Shen, and G. Sun, \u201cSqueeze-and-excitation networks,\u201d in Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, 2018.\n\n[31] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, \u201cMnasnet: Platform-aware neural\narchitecture search for mobile,\u201d in Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, 2018.\n\n[32] X. Wang, R. Girshick, A. Gupta, and K. He, \u201cNon-local neural networks,\u201d in Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pp. 7794\u20137803, 2018.\n\n[33] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, \u201cAttention augmented convolutional\n\nnetworks,\u201d CoRR, vol. abs/1904.09925, 2019.\n\n[34] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi, \u201cGather-excite: Exploiting feature context\nin convolutional neural networks,\u201d in Advances in Neural Information Processing Systems,\npp. 9423\u20139433, 2018.\n\n[35] H. Hu, Z. Zhang, Z. Xie, and S. Lin, \u201cLocal relation networks for image recognition,\u201d arXiv\n\npreprint arXiv:1904.11491, 2019.\n\n[36] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,\nA. Senior, and K. Kavukcuoglu, \u201cWavenet: A generative model for raw audio,\u201d arXiv preprint\narXiv:1609.03499, 2016.\n\n[37] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, \u201cPixelCNN++: Improving the Pix-\nelCNN with discretized logistic mixture likelihood and other modi\ufb01cations,\u201d arXiv preprint\narXiv:1701.05517, 2017.\n\n[38] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, \u201cConvolutional sequence to\n\nsequence learning,\u201d CoRR, vol. abs/1705.03122, 2017.\n\n[39] L. Sifre and S. Mallat, \u201cRigid-motion scattering for image classi\ufb01cation,\u201d PhD thesis, Ph. D.\n\nthesis, vol. 1, p. 3, 2014.\n\n[40] S. Ioffe and C. Szegedy, \u201cBatch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift,\u201d in International Conference on Learning Representations, 2015.\n\n[41] F. Chollet, \u201cXception: Deep learning with depthwise separable convolutions,\u201d in Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[42] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and\nH. Adam, \u201cMobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications,\u201d\narXiv preprint arXiv:1704.04861, 2017.\n\n[43] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, \u201cMobilenetv2: Inverted\nresiduals and linear bottlenecks,\u201d in Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pp. 4510\u20134520, 2018.\n\n11\n\n\f[44] S. Bartunov, A. Santoro, B. Richards, L. Marris, G. E. Hinton, and T. Lillicrap, \u201cAssessing the\nscalability of biologically-motivated deep learning algorithms and architectures,\u201d in Advances\nin Neural Information Processing Systems, pp. 9368\u20139378, 2018.\n\n[45] D. Bahdanau, K. Cho, and Y. Bengio, \u201cNeural machine translation by jointly learning to align\n\nand translate,\u201d in International Conference on Learning Representations, 2015.\n\n[46] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai, M. D. Hoffman,\n\nand D. Eck, \u201cMusic transformer,\u201d in Advances in Neural Processing Systems, 2018.\n\n[47] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, \u201cLanguage models are\n\nunsupervised multitask learners,\u201d OpenAI Blog, vol. 1, p. 8, 2019.\n\n[48] J. Devlin, M. Chang, K. Lee, and K. Toutanova, \u201cBERT: pre-training of deep bidirectional\n\ntransformers for language understanding,\u201d CoRR, vol. abs/1810.04805, 2018.\n\n[49] N. Parmar, A. Vaswani, J. Uszkoreit, \u0141. Kaiser, N. Shazeer, A. Ku, and D. Tran, \u201cImage\n\ntransformer,\u201d in International Conference on Machine Learning, 2018.\n\n[50] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee,\nM. Hong, C. Young, R. Sepassi, and B. A. Hechtman, \u201cMesh-tensor\ufb02ow: Deep learning for\nsupercomputers,\u201d CoRR, vol. abs/1811.02084, 2018.\n\n[51] P. Shaw, J. Uszkoreit, and A. Vaswani, \u201cSelf-attention with relative position representations,\u201d\n\narXiv preprint arXiv:1803.02155, 2018.\n\n[52] A. Buades, B. Coll, and J.-M. Morel, \u201cA non-local algorithm for image denoising,\u201d in Pro-\nceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern\nRecognition (CVPR\u201905) - Volume 2 - Volume 02, CVPR \u201905, (Washington, DC, USA), pp. 60\u201365,\nIEEE Computer Society, 2005.\n\n[53] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, \u201cA\u02c6 2-nets: Double attention networks,\u201d in\n\nAdvances in Neural Information Processing Systems, pp. 352\u2013361, 2018.\n\n[54] B. Zoph and Q. V. Le, \u201cNeural architecture search with reinforcement learning,\u201d in International\n\nConference on Learning Representations, 2017.\n\n[55] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, \u201cImagenet large scale visual recognition\nchallenge,\u201d CoRR, vol. abs/1409.0575, 2014.\n\n[56] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick,\n\u201cMicrosoft coco: Common objects in context,\u201d in European Conference on Computer Vision,\npp. 740\u2013755, Springer, 2014.\n\n[57] T.-Y. Lin, P. Doll\u00e1r, R. Girshick, K. He, B. Hariharan, and S. Belongie, \u201cFeature pyramid\nnetworks for object detection,\u201d in Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pp. 2117\u20132125, 2017.\n\n[58] T. S. Cohen, M. Geiger, J. K\u00f6hler, and M. Welling, \u201cSpherical cnns,\u201d arXiv preprint\n\narXiv:1801.10130, 2018.\n\n[59] T. S. Cohen, M. Weiler, B. Kicanaoglu, and M. Welling, \u201cGauge equivariant convolutional\n\nnetworks and the icosahedral cnn,\u201d arXiv preprint arXiv:1902.04615, 2019.\n\n[60] G. Ghiasi, T.-Y. Lin, R. Pang, and Q. V. Le, \u201cNas-fpn: Learning scalable feature pyramid\n\narchitecture for object detection,\u201d arXiv preprint arXiv:1904.07392, 2019.\n\n[61] F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, \u201cPay less attention with lightweight\n\nand dynamic convolutions,\u201d arXiv preprint arXiv:1901.10430, 2019.\n\n[62] X. Zhu, D. Cheng, Z. Zhang, S. Lin, and J. Dai, \u201cAn empirical study of spatial attention\n\nmechanisms in deep networks,\u201d arXiv preprint arXiv:1904.05873, 2019.\n\n12\n\n\f[63] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, \u201cDeeplab: Semantic\nimage segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,\u201d\nIEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834\u2013848,\n2017.\n\n[64] L.-C. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam, \u201cMasklab: Instance\nsegmentation by re\ufb01ning object detection with semantic and direction features,\u201d in Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013\u20134022, 2018.\n\n[65] D. DeTone, T. Malisiewicz, and A. Rabinovich, \u201cSuperpoint: Self-supervised interest point\ndetection and description,\u201d in Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition Workshops, pp. 224\u2013236, 2018.\n\n[66] A. Toshev and C. Szegedy, \u201cDeeppose: Human pose estimation via deep neural networks,\u201d in\nProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653\u20131660,\n2014.\n\n[67] A. Newell, K. Yang, and J. Deng, \u201cStacked hourglass networks for human pose estimation,\u201d in\n\nEuropean Conference on Computer Vision, pp. 483\u2013499, Springer, 2016.\n\n[68] Y. E. NESTEROV, \u201cA method for solving the convex programming problem with convergence\n\nrate o(1/k2),\u201d Dokl. Akad. Nauk SSSR, vol. 269, pp. 543\u2013547, 1983.\n\n[69] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, \u201cOn the importance of initialization and\n\nmomentum in deep learning,\u201d in International Conference on Machine Learning, 2013.\n\n[70] I. Loshchilov and F. Hutter, \u201cSGDR: Stochastic gradient descent with warm restarts,\u201d arXiv\n\npreprint arXiv:1608.03983, 2016.\n\n[71] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,\nN. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau,\nJ. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,\nD. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan,\nD. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,\nA. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,\nR. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek,\nE. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan,\nG. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H.\nYoon, \u201cIn-datacenter performance analysis of a tensor processing unit,\u201d SIGARCH Comput.\nArchit. News, vol. 45, pp. 1\u201312, June 2017.\n\n[72] B. Polyak and A. Juditsky, \u201cAcceleration of stochastic approximation by averaging,\u201d SIAM\n\nJournal on Control and Optimization, vol. 30, no. 4, pp. 838\u2013855, 1992.\n\n[73] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\nA. Rabinovich, \u201cGoing deeper with convolutions,\u201d in Computer Vision and Pattern Recognition\n(CVPR), 2015.\n\n13\n\n\f", "award": [], "sourceid": 41, "authors": [{"given_name": "Prajit", "family_name": "Ramachandran", "institution": "Google Brain"}, {"given_name": "Niki", "family_name": "Parmar", "institution": "Google"}, {"given_name": "Ashish", "family_name": "Vaswani", "institution": "Google Brain"}, {"given_name": "Irwan", "family_name": "Bello", "institution": "Google Brain"}, {"given_name": "Anselm", "family_name": "Levskaya", "institution": "Google"}, {"given_name": "Jon", "family_name": "Shlens", "institution": "Google Research"}]}