{"title": "An intriguing failing of convolutional neural networks and the CoordConv solution", "book": "Advances in Neural Information Processing Systems", "page_first": 9605, "page_last": 9616, "abstract": "Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be appropriate. In this paper we show a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and coordinates in one-hot pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious. We call this solution CoordConv, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels. Without sacrificing the computational and parametric efficiency of ordinary convolution, CoordConv allows networks to learn either complete translation invariance or varying degrees of translation dependence, as required by the end task. CoordConv solves the coordinate transform problem with perfect generalization and 150 times faster with 10--100 times fewer parameters than convolution. This stark contrast raises the question: to what extent has this inability of convolution persisted insidiously inside other tasks, subtly hampering performance from within? A complete answer to this question will require further investigation, but we show preliminary evidence that swapping convolution for CoordConv can improve models on a diverse set of tasks. Using CoordConv in a GAN produced less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST detection showed 24% better IOU when using CoordConv, and in the Reinforcement Learning (RL) domain agents playing Atari games benefit significantly from the use of CoordConv layers.", "full_text": "An intriguing failing of convolutional neural networks\n\nand the CoordConv solution\n\nRosanne Liu1\n\nJoel Lehman1\n\nPiero Molino1\n\nFelipe Petroski Such1\n\nEric Frank1\n\nAlex Sergeev2\n\nJason Yosinski1\n\n1Uber AI Labs, San Francisco, CA, USA\n\n2Uber Technologies, Seattle, WA, USA\n\n{rosanne,joel.lehman,piero,felipe.such,mysterefrank,asergeev,yosinski}@uber.com\n\nAbstract\n\nFew ideas have enjoyed as large an impact on deep learning as convolution. For any\nproblem involving pixels or spatial representations, common intuition holds that\nconvolutional neural networks may be appropriate. In this paper we show a striking\ncounterexample to this intuition via the seemingly trivial coordinate transform\nproblem, which simply requires learning a mapping between coordinates in (x, y)\nCartesian space and coordinates in one-hot pixel space. Although convolutional\nnetworks would seem appropriate for this task, we show that they fail spectacularly.\nWe demonstrate and carefully analyze the failure \ufb01rst on a toy problem, at which\npoint a simple \ufb01x becomes obvious. We call this solution CoordConv, which\nworks by giving convolution access to its own input coordinates through the use of\nextra coordinate channels. Without sacri\ufb01cing the computational and parametric\nef\ufb01ciency of ordinary convolution, CoordConv allows networks to learn either\ncomplete translation invariance or varying degrees of translation dependence, as\nrequired by the end task. CoordConv solves the coordinate transform problem with\nperfect generalization and 150 times faster with 10\u2013100 times fewer parameters\nthan convolution. This stark contrast raises the question: to what extent has this\ninability of convolution persisted insidiously inside other tasks, subtly hampering\nperformance from within? A complete answer to this question will require further\ninvestigation, but we show preliminary evidence that swapping convolution for\nCoordConv can improve models on a diverse set of tasks. Using CoordConv in\na GAN produced less mode collapse as the transform between high-level spatial\nlatents and pixels becomes easier to learn. A Faster R-CNN detection model\ntrained on MNIST detection showed 24% better IOU when using CoordConv, and\nin the Reinforcement Learning (RL) domain agents playing Atari games bene\ufb01t\nsigni\ufb01cantly from the use of CoordConv layers.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) [17] have enjoyed immense success as a key tool for enabling\neffective deep learning in a broad array of applications, like modeling natural images [36, 16], images\nof human faces [15], audio [33], and enabling agents to play games in domains with synthetic imagery\nlike Atari [21]. Although straightforward CNNs excel at many tasks, in many other cases progress\nhas been accelerated through the development of specialized layers that complement the abilities\nof CNNs. Detection models like Faster R-CNN [27] make use of layers to compute coordinate\ntransforms and focus attention, spatial transformer networks [13] make use of differentiable cameras\nto transform data from the output of one CNN into a form more amenable to processing with another,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Toy tasks considered in this paper. The *conv block represents a network comprised of\none or more convolution, deconvolution (convolution transpose), or CoordConv layers. Experiments\ncompare networks with no CoordConv layers to those with one or more.\n\nand some generative models like DRAW [8] iteratively perceive, focus, and re\ufb01ne a canvas rather\nthan using a single pass through a CNN to generate an image. These models were all created by\nneural network designers that intuited some inability or misguided inductive bias of standard CNNs\nand then devised a workaround.\n\nIn this work, we expose and analyze a generic inability of CNNs to transform spatial representations\nbetween two different types: from a dense Cartesian representation to a sparse, pixel-based represen-\ntation or in the opposite direction. Though such transformations would seem simple for networks\nto learn, it turns out to be more dif\ufb01cult than expected, at least when models are comprised of the\ncommonly used stacks of convolutional layers. While straightforward stacks of convolutional layers\nexcel at tasks like image classi\ufb01cation, they are not quite the right model for coordinate transform.\n\nThe main contributions of this paper are as follows:\n\n1. We de\ufb01ne a simple toy dataset, Not-so-Clevr, which consists of squares randomly positioned\n\non a canvas (Section 2).\n\n2. We de\ufb01ne the CoordConv operation, which allows convolutional \ufb01lters to know where they\nare in Cartesian space by adding extra, hard-coded input channels that contain coordinates\nof the data seen by the convolutional \ufb01lter. The operation may be implemented via a couple\nextra lines of Tensor\ufb02ow (Section 3).\n\n3. Throughout the rest of the paper, we examine the coordinate transform problem starting\nwith the simplest scenario and ending with the most complex. Although results on toy\nproblems should generally be taken with a degree of skepticism, starting small allows us to\npinpoint the issue, exploring and understanding it in detail. Later sections then show that\nthe phenomenon observed in the toy domain indeed appears in more real-world settings.\n\nWe begin by showing that coordinate transforms are surprisingly dif\ufb01cult even when the\nproblem is small and supervised. In the Supervised Coordinate Classi\ufb01cation task, given a\npixel\u2019s (x, y) coordinates as input, we train a CNN to highlight it as output. The Supervised\nCoordinate Regression task entails the inverse: given an input image containing a single\nwhite pixel, output its coordinates. We show that both problems are harder than expected\nusing convolutional layers but become trivial by using a CoordConv layer (Section 4).\n\n4. The Supervised Rendering task adds complexity to the above by requiring a network to paint\na full image from the Not-so-Clevr dataset given the (x, y) coordinates of the center of a\nsquare in the image. The task is still fully supervised, but as before, the task is dif\ufb01cult to\nlearn for convolution and trivial for CoordConv (Section 4.3).\n\n5. We show that replacing convolutional layers with CoordConv improves performance in a\nvariety of tasks. On two-object Sort-of-Clevr [29] images, Generative Adversarial Networks\n(GANs) and Variational Autoencoders (VAEs) using CoordConv exhibit less mode collapse,\nperhaps because ease of learning coordinate transforms translates to ease of using latents\nto span a 2D Cartesian space. Larger GANs on bedroom scenes with CoordConv offer\ngeometric translation that was never observed in regular GAN. Adding CoordConv to a\nFaster R-CNN produces much better object boxes and scores. Finally, agents learning to\n\n2\n\n\fFigure 2: The Not-so-Clevr dataset. (a) Example one-hot center images Pi from the dataset. (b) The\npixelwise sum of the entire train and test splits for uniform vs. quadrant splits. (c) and (d) Analagous\ndepictions of the canvas images Ii from the dataset. Best viewed electronically with zoom.\n\nplay Atari games obtain signi\ufb01cantly higher scores on some but not all games, and they\nnever do signi\ufb01cantly worse (Section 5).\n\n6. To enable other researchers to reproduce experiments in this paper, and bene\ufb01t from using\nCoordConv as a simple drop-in replacement of the convolution layer in their models, we\nrelease our code at https://github.com/uber-research/coordconv.\n\nWith reference to the above numbered contributions, the reader may be interested to know that the\ncourse of this research originally progressed in the 5 \u2192 2 direction as we debugged why progressively\nsimpler problems continued to elude straightforward modeling. But for ease of presentation, we give\nresults in the 2 \u2192 5 direction. A progression of the toy problems considered is shown in Figure 1.\n\n2 Not-so-Clevr dataset\n\nWe de\ufb01ne the Not-so-Clevr dataset and make use of it for the \ufb01rst experiments in this paper. The\ndataset is a single-object, grayscale version of Sort-of-CLEVR [29], which itself is a simpler version\nof the Clevr dataset of rendered 3D shapes [14]. Note that the series of Clevr datasets have been\ntypically used for studies regarding relations and visual question answering, but we here use them\nfor supervised learning and generative models. Not-so-Clevr consists of 9 \u00d7 9 squares placed on a\n64 \u00d7 64 canvas. Square positions are restricted such that the entire square lies within the 64 \u00d7 64\ngrid, so that square centers fall within a slightly smaller possible area of 56 \u00d7 56. Enumerating these\npossible center positions results in a dataset with a total of 3,136 examples. For each example square\ni, the dataset contains three \ufb01elds:\n\n\u2022 Ci \u2208 R2, its center location in (x, y) Cartesian coordinates,\n\u2022 Pi \u2208 R64\u00d764, a one-hot representation of its center pixel, and\n\u2022 Ii \u2208 R64\u00d764, the resulting 64 \u00d7 64 image of the square painted on the canvas.\n\nWe de\ufb01ne two train/test splits of these 3,136 examples: uniform, where all possible center locations\nare randomly split 80/20 into train vs. test sets, and quadrant, where three of four quadrants are in the\ntrain set and the fourth quadrant in the test set. Examples from the dataset and both splits are depicted\nin Figure 2. To emphasize the simplicity of the data, we note that this dataset may be generated in\nonly a line or two of Python using a single convolutional layer with \ufb01lter size 9 \u00d7 9 to paint the\nsquares from a one-hot representation.1\n\n3 The CoordConv layer\n\nThe proposed CoordConv layer is a simple extension to the standard convolutional layer. We assume\nfor the rest of the paper the case of two spatial dimensions, though operators in other dimensions\nfollow trivially. Convolutional layers are used in a myriad of applications because they often work\nwell, perhaps due to some combination of three factors: they have relatively few learned parameters,\nthey are fast to compute on modern GPUs, and they learn a function that is translation invariant (a\ntranslated input produces a translated output).\n\n1For example, ignoring import lines and train/test splits:\n\nonehots = np.pad(np.eye(3136).reshape((3136, 56, 56, 1)), ((0,0), (4,4), (4,4), (0,0)), \"constant\");\n\nimages = tf.nn.conv2d(onehots, np.ones((9, 9, 1, 1)), [1]*4, \"SAME\")\n\n3\n\n\fFigure 3: Comparison of 2D convolutional and CoordConv layers. (left) A standard convolutional\nlayer maps from a representation block with shape h \u00d7 w \u00d7 c to a new representation of shape\nh\u2032 \u00d7 w\u2032 \u00d7 c\u2032. (right) A CoordConv layer has the same functional signature, but accomplishes the\nmapping by \ufb01rst concatenating extra channels to the incoming representation. These channels contain\nhard-coded coordinates, the most basic version of which is one channel for the i coordinate and one\nfor the j coordinate, as shown above. Other derived coordinates may be input as well, like the radius\ncoordinate used in ImageNet experiments (Section 5).\n\nThe CoordConv layer keeps the \ufb01rst two of these properties\u2014few parameters and ef\ufb01cient\ncomputation\u2014but allows the network to learn to keep or to discard the third\u2014translation invariance\u2014\nas is needed for the task being learned. It may appear that doing away with translation invariance\nwill hamper networks\u2019 abilities to learn generalizable functions. However, as we will see in later\nsections, allocating a small amount of network capacity to model non-translation invariant aspects of\na problem can enable far more trainable models that also generalize far better.\n\nThe CoordConv layer can be implemented as a simple extension of standard convolution in which\nextra channels are instantiated and \ufb01lled with (constant, untrained) coordinate information, after\nwhich they are concatenated channel-wise to the input representation and a standard convolutional\nlayer is applied. Figure 3 depicts the operation where two coordinates, i and j, are added. Concretely,\nthe i coordinate channel is an h \u00d7 w rank-1 matrix with its \ufb01rst row \ufb01lled with 0\u2019s, its second row with\n1\u2019s, its third with 2\u2019s, etc. The j coordinate channel is similar, but with columns \ufb01lled in with constant\nvalues instead of rows. In all experiments, we apply a \ufb01nal linear scaling of both i and j coordinate\nvalues to make them fall in the range [\u22121, 1]. For convolution over two dimensions, two (i, j)\ncoordinates are suf\ufb01cient to completely specify an input pixel, but if desired, further channels can be\nadded as well to bias models toward learning particular solutions. In some of the experiments that\nfollow, we have also used a third channel for an r coordinate, where r = p(i \u2212 h/2)2 + (j \u2212 w/2)2.\nThe full implementation of the CoordConv layer is provided in Section S9. Let\u2019s consider next a few\nproperties of this operation.\n\nNumber of parameters.\nIgnoring bias parameters (which are not changed), a standard convolu-\ntional layer with square kernel size k and with c input channels and c\u2032 output channels will contain\ncc\u2032k2 weights, whereas the corresponding CoordConv layer will contain (c + d)c\u2032k2 weights, where\nd is the number of coordinate dimensions used (e.g. 2 or 3). The relative increase in parameters is\nsmall to moderate, depending on the original number of input channels. 2\n\nTranslation invariance. CoordConv with weights connected to input coordinates set by initializa-\ntion or learning to zero will be translation invariant and thus mathematically equivalent to ordinary\nconvolution. If weights are nonzero, the function will contain some degree of translation dependence,\nthe precise form of which will ideally depend on the task being solved. Similar to locally connected\nlayers with unshared weights, CoordConv allows learned translation dependence, but by contrast\n\n2A CoordConv layer implemented via the channel concatenation discussed entails an increase of dc\u2032k2\nweights. However, if k > 1, not all k2 connections from coordinates to each output unit are necessary, as\nspatially neighboring coordinates do not provide new information. Thus, if one cares acutely about minimizing\nthe number of parameters and operations, a k \u00d7 k conv may be applied to the input data and a 1 \u00d7 1 conv to the\ncoordinates, then the results added. In this paper we have used the simpler, if marginally inef\ufb01cient, channel\nconcatenation version that applies a single convolution to both input data and coordinates. However, almost all\nexperiments use 1 \u00d7 1 \ufb01lters with CoordConv.\n\n4\n\n\fit requires far fewer parameters: (c + d)c\u2032k2 vs. hwcc\u2032k2 for spatial input size h \u00d7 w. Note that\nall CoordConv weights, even those to coordinates, are shared across all positions, so translation\ndependence comes only from the speci\ufb01cation of coordinates; one consequence is that, as with\nordinary convolution but unlike locally connected layers, the operation can be expanded outside the\noriginal spatial domain if the appropriate coordinates are extrapolated.\n\nRelations to other work. CoordConv layers are related to many other bodies of work. Composi-\ntional Pattern Producing Networks (CPPNs) [31] implement functions from coordinates in arbitrarily\nmany dimensions to one or more output values. For example, with two input dimensions and N\noutput values, this can be thought of as painting N separate grayscale pictures. CoordConv can then\nbe thought of as a conditional CPPN where output values depend not only on coordinates but also\non incoming data. In one CPPN-derived work [11], researchers did train networks that take as input\nboth coordinates and incoming data for their use case of synthesizing a drum track that could derive\nboth from a time coordinate and from other instruments (input data) and trained using interactive\nevolution. With respect to that work, we may see CoordConv as a simpler, single-layer mechanism\nthat \ufb01ts well within the paradigm of training large networks with gradient descent on GPUs. In a\nsimilar vein, research on convolutional sequence to sequence learning [7] has used \ufb01xed and learned\nposition embeddings at the input layer; in that work, positions were represented via an overcomplete\nbasis that is added to the incoming data rather than being compactly represented and input as separate\nchannels. In some cases using overcomplete sine and cosine bases or learned encodings for locations\nhas seemed to work well [34, 24]. Connections can also be made to mechanisms of spatial attention\n[13] and to generative models that separately learn what and where to draw [8, 26]. While such works\nmight appear to provide alternative solutions to the problem explored in this paper, in reality, similar\ncoordinate transforms are often embedded within such models (e.g. a spatial transformer network\ncontains a localization network that regresses from an image to a coordinate-based representation\n[13]) and might also bene\ufb01t from CoordConv layers.\n\nMoreover, several previous works have found it necessary or useful to inject geometric information\nto networks, for example, in prior networks to enhance spatial smoothness [32], in segmentation\nnetworks [2, 20], and in robotics control through a spatial softmax layer and an expected coordinate\nlayer that map scenes to object locations [18, 5]. However, in those works it is often seen as a\nminor detail in a larger architecture which is tuned to a speci\ufb01c task and experimental project, and\ndiscussions of this necessity are scarce. In contrast, our research (a) examines this necessity in depth\nas its central thrust, (b) reduces the dif\ufb01culty to its minimal form (coordinate transform), leading\nto a simple single explanation that uni\ufb01es previously disconnected observations, and (c) presents\none solution used in various forms by others as a uni\ufb01ed layer, easily included anywhere in any\nconvolutional net. Indeed, the wide range of prior works provide strong evidence of the generality of\nthe core coordinate transform problem across domains, suggesting the signi\ufb01cant value of a work\nthat systematically explores its impact and collects together these disparate previous references.\n\nFinally, we note that challenges in learning coordinate transformations are not unknown in machine\nlearning, as learning a Cartesian-to-polar coordinate transform forms the basis of the classic two-\nspirals classi\ufb01cation problem [4].\n\n4 Supervised Coordinate tasks\n\n4.1 Supervised Coordinate Classi\ufb01cation\n\nThe \ufb01rst and simplest task we consider is Supervised Coordinate Classi\ufb01cation. Illustrated at the top\nof Figure 1, given an (x, y) coordinate as input, a network must learn to paint the correct output pixel.\nThis is simply a multi-class classi\ufb01cation problem where each pixel is a class. Why should we study\nsuch a toy problem? If we expect to train generative models that can transform high level latents like\nhorizontal and vertical position into pixel data, solving this toy task would seem a simple prerequisite.\nWe later verify that performance on this task does in fact predict performance on larger problems.\n\nIn Figure 4 we depict training vs. test accuracy on the task for both uniform and quadrant train/test\nsplits. For convolutional models3(6 layers of deconvolution with stride 2, see Section S1 in the\nSupplementary Information for architecture details) on uniform splits, we \ufb01nd models that generalize\nsomewhat, but 100% test accuracy is never achieved, with the best model achieving only 86% test\n\n3For classi\ufb01cation, convolutions and CoordConvs are actually deconvolutional on certain layers when\n\nresolutions must be expanded, but we refer to the models as conv or CoordConv for simplicity.\n\n5\n\n\fConvolution\n\nCoordConv\n\nConvergence to 80% test \naccuracy takes 4000 seconds\n\nPerfect test accuracy \ntakes 10\u201320 seconds\n\nFigure 4: Performance of convolution and CoordConv on Supervised Coordinate Classi\ufb01cation.\n(left column) Final test vs. train accuracy. On the easier uniform split, convolution never attains\nperfect test accuracy, though the largest models memorize the training set. On the quadrant split,\ngeneralization is almost zero. CoordConv attains perfect train and test accuracy on both splits. One\nof the main results of this paper is that the translation invariance in ordinary convolution does not\nlead to coordinate transform generalization even to neighboring pixels! (right column) Test accuracy\nvs. training time of the best uniform-split models from the left plot (any reaching \ufb01nal test accuracy\n\u2265 0.8). The convolution models never achieve more than about 86% accuracy, and training is slow:\nthe fastest learning models still take over an hour to converge. CoordConv models learn several\nhundred times faster, attaining perfect accuracy in seconds.\n\naccuracy. This is surprising: because of the way the uniform train/test splits were created, all test\npixels are close to multiple train pixels. Thus, we reach a \ufb01rst striking conclusion: learning a smooth\nfunction from (x, y) to one-hot pixel is dif\ufb01cult for convolutional networks, even when trained with\nsupervision, and even when supervision is provided on all sides. Further, training a convolutional\nmodel to 86% accuracy takes over an hour and requires about 200k parameters (see Section S2 in the\nSupplementary Information for details on training). On the quadrant split, convolutional models are\nunable to generalize at all. Figure 5 shows sums over training set and test set predictions, showing\nvisually both the memorization of the convolutional model and its lack of generalization.\n\nIn striking contrast, CoordConv models attain perfect performance on both data splits and do so with\nonly 7.5k parameters and in only 10\u201320 seconds. The parsimony of parameters further con\ufb01rms they\nare simply more appropriate models for the task of coordinate transform [28, 10, 19].\n\n4.2 Supervised Coordinate Regression\n\nBecause of the surprising dif\ufb01culty of learning to transform coordinates from Cartesian to a pixel-\nbased, we examine whether the inverse transformation from pixel-based to Cartesian is equally\ndif\ufb01cult. This is the type of transform that could be employed by a VAE encoder or GAN discriminator\nto transform pixel information into higher level latents encoding locations of objects in a scene.\n\nWe experimented with various convolutional network structures, and found a 4-layer convolutional\nnetwork with fully connected layers (85k parameters, see Section S3 for details) can \ufb01t the uniform\ntraining split and generalize well (less than half a pixel error on average), but that same architecture\ncompletely fails on the quadrant split. A smaller fully-convolutional architecture (12k parameters, see\nSection S3) can be tuned to achieve limited generalization on the quadrant split (around \ufb01ve pixels\nerror on average) as shown in Figure 5 (right column), but it performs poorly on the uniform split.\n\nA number of factors may have led to the observed variation of performance, including the use of\nmax-pooling, batch normalization, and fully-connected layers. We have not fully and separately\nmeasured how much each factor contributes to poor performance on these tasks; rather we report\nonly that our efforts to \ufb01nd a workable architecture across both splits did not yield any winners. In\ncontrast, a 900 parameter CoordConv model, where a single CoordConv layer is followed by several\nlayers of standard convolution, trains quickly and generalizes in both the uniform and quadrant splits.\nSee Section S3 in Supplementary Information for more details. These results suggest that the inverse\ntransformation requires similar considerations to solve as the Cartesian-to-pixel transformation.\n\n6\n\n\fSupervised Coordinate Classification\n\nSupervised Coordinate Regression\n\nTrain\n\nTest\n\nTrain\n\nTest\n\nGround \ntruth\n\nConvolution \nprediction\n\nCoordConv \nprediction\n\nFigure 5: Comparison of convolutional and CoordConv models on the Supervised Coordinate\nClassi\ufb01cation and Regression tasks, on a quadrant split. (left column) Results on the seemingly\nsimple classi\ufb01cation task where the network must highlight one pixel given its (x, y) coordinates as\ninput. Images depict ground truth or predicted probabilities summed across the entire train or test set\nand then normalized to make use of the entire black to white image range. Thus, e.g., the top-left\nimage shows the sum of all train set examples. The conv predictions on the train set cover it well,\nalthough the amount of noise in predictions hints at the dif\ufb01culty with which this model eventually\nattained 99.6% train accuracy by memorization. The conv predictions on the test set are almost\nentirely incorrect, with two pixel locations capturing the bulk of the probability for all locations in\nthe test set. By contrast, the CoordConv model attains 100% accuracy on both the train and test\nsets. Models used: conv\u20136 layers of deconv with strides 2; CoordConv\u20135 layers of 1\u00d71 conv, \ufb01rst\nlayer is CoordConv. Details in Section S2. (right column) The regression task poses the inverse\nproblem: predict real-valued (x, y) coordinates from a one-hot pixel input. As before, the conv\nmodel memorizes poorly and largely fails to generalize, while the CoordConv model \ufb01ts train and\ntest set perfectly. Thus we observe the coordinate transform problem to be dif\ufb01cult in both directions.\nModels used: conv\u20139-layer fully-convolution with global pooling; CoordConv\u20135 layers of conv with\nglobal pooling, \ufb01rst layer is CoordConv. Details in Section S3.\n\n4.3 Supervised Rendering\n\nMoving beyond the domain of single pixel coordinate transforms, we compare performance of\nconvolutional vs. CoordConv networks on the Supervised Rendering task, which requires a network\nto produce a 64 \u00d7 64 image with a square painted centered at the given (x, y) location. As shown in\nFigure 6, we observe the same stark contrast between convolution and CoordConv. Architectures\nused for both models can be seen in Section S1 in the Supplementary Information, along with further\nplots, details of training, and hyperparameter sweeps given in Section S4.\n\n5 Applicability to Image Classi\ufb01cation, Object Detection, Generative\n\nModeling, and Reinforcement Learning\n\nGiven the starkly contrasting results above, it is natural to ask how much the demonstrated inability\nof convolution at coordinate transforms infects other tasks. Does the coordinate transform hurdle\npersist insidiously inside other tasks, subtly hampering performance from within? Or do networks\nskirt the issue by learning around it, perhaps by representing space differently, e.g. via non-Cartesian\nrepresentations like grid cells [1, 6, 3]? A complete answer to this question is beyond the scope of this\npaper, but encouraging preliminary evidence shows that swapping Conv for CoordConv can improve\na diverse set of models \u2014 including ResNet-50, Faster R-CNN, GANs, VAEs, and RL models.\n\n7\n\n\fFigure 6: Results on the Supervised Rendering task. As with the Supervised Coordinate Classi\ufb01cation\nand Regression tasks, we see the same vast separation in training time and generalization between\nconvolution models and CoordConv models. (left) Test intersection over union (IOU) vs Train\nIOU. We show all attempted models on the uniform and quadrant splits, including some CoordConv\nmodels whose hyperparameter selections led to worse than perfect performance. (right) Test IOU\nvs. training time of the best uniform-split models from the left plot (any reaching \ufb01nal test IOU\n\u2265 0.8). Convolution models never achieve more than about IOU 0.83, and training is slow: the fastest\nlearning models still take over two hours to converge vs. about a minute for CoordConv models.\n\nImageNet Classi\ufb01cation As might be expected for tasks requiring straightforward translation\ninvariance, CoordConv does not help signi\ufb01cantly when tested with image classi\ufb01cation. Adding a\nsingle extra 1\u00d71 CoordConv layer with 8 output channels improves ResNet-50 [9] Top-5 accuracy by\na meager 0.04% averaged over \ufb01ve runs for each treatment; however, this difference is not statistically\nsigni\ufb01cant. It is at least reassuring that CoordConv doesn\u2019t hurt the performance since it can always\nlearn to ignore coordinates. This result was obtained using distributed training on 100 GPUs with\nHorovod [30]; see Section S5 in Supplementary Information for more details.\n\nObject Detection In object detection, models look at pixel space and output bounding boxes in\nCartesian space. This creates a natural coordinate transform problem which makes CoordConv\nseemingly a natural \ufb01t. On a simple problem of detecting MNIST digits scattered on a canvas, we\nfound the test intersection-over-union (IOU) of a Faster R-CNN network improved by 24% when\nusing CoordConv. See Section S6 in Supplementary Information for details.\n\nGenerative Modeling Well-trained generative models can generate visually compelling images\n[23, 15, 36], but careful inspection can reveal mode collapse: images are of an attractive quality, but\nsample diversity is far less than diversity present in the dataset. Mode collapse can occur in many\ndimensions, including those having to do with content, style, or position of components of a scene.\nWe hypothesize that mode collapse of position may be due to the dif\ufb01culty of learning straightforward\ntransforms from a high-level latent space containing coordinate information to pixel space and that\nusing CoordConv could help. First we investigate a simple task of generating colored shapes with,\nin particular, all possible geometric locations, using both GANs and VAEs. Then we scale up the\nproblem to Large-scale Scene Understanding (LSUN) [35] bedroom scenes with DCGAN [25],\nthrough distributed training using Horovod [30].\n\nUsing GANs to generate simple colored objects, Figure 7a-d show sampled images and model\ncollapse analyses. We observe that a convolutional GAN exhibits collapse of a two-dimensional\ndistribution to a one-dimensional manifold. The corresponding CoordConv GAN model generates\nobjects that better cover the 2D Cartesian space while using 7% of the parameters of the conv GAN.\nDetails of the dataset and training can be seen in Section S7.1 in the Supplementary Information. A\nsimilar story with VAEs is discussed in Section S7.2.\n\nWith LSUN, samples are shown in Figure 7e, and more in Section S7.3 in the Supplementary\nInformation. We observe (1) qualitatively comparable samples when drawing randomly from each\nmodel, and (2) geometric translating behavior during latent space interpolation.\n\nLatent space interpolation4 demonstrates that in generating colored objects, motions through latent\nspace generate coordinated object motion. In LSUN, while with convolution we see frozen objects\nfading in and out, with CoordConv, we instead see smooth geometric transformations including\ntranslation and deformation.\n\n4https://www.youtube.com/watch?v=YefMbLqS7Jg\n\n8\n\n0.00.20.40.60.81.0Train IOU0.00.20.40.60.81.0Test IOUDeconv uniformDeconv quadrantCoordConv uniformCoordConv quadrant\fFigure 7: Real images and generated images by GAN and CoordConv GAN. Both models learn the\nbasic concepts similarly well: two objects per image, one red and one blue, their size is \ufb01xed, and\ntheir positions can be random (a). However, plotting the spread of object centers over 1000 samples,\nwe see that CoordConv GAN samples cover the space signi\ufb01cantly better (average entropy: Data red\n4.0, blue 4.0, diff 3.5; GAN red 3.13, blue 2.69, diff 2.81; CoordConv GAN red 3.30, blue 2.93, diff\n2.62), while GAN samples exhibit mode collapse on where objects can be (b). In terms of relative\nlocations between the two objects, both model exhibit a certain level of model collapse, CoordConv\nis worse (c). The averaged image of CoordConv GAN is smoother and closer to that of data (d). With\nLSUN, sampled images are shown (e). All models used in generation are the best out of many runs.\n\nFigure 8: Results using A2C to train on Atari games. Out of 9 games, (a) in 6 CoordConv improves\nover convolution, (b) in 2 performs similarly, and (c) on 1 it is slightly worse.\n\nReinforcement Learning Adding a CoordConv layer to an actor network within A2C [22] pro-\nduces signi\ufb01cant improvements on some games, but not all, as shown in Figure 8. We also tried\nadding CoordConv to our own implementation of Distributed Prioritized Experience Replay (Ape-X)\n[12], but we did not notice any immediate difference. Details of training are included in Section S8.\n\n6 Conclusions and Future Work\n\nWe have shown the curious inability of CNNs to model the coordinate transform task, shown a simple\n\ufb01x in the form of the CoordConv layer, and given results that suggest including these layers can\nboost performance in a wide range of applications. Future work will further evaluate the bene\ufb01ts of\nCoordConv in large-scale datasets, exploring its ability against perturbations of translation, its impact\nin relational reasoning [29], language tasks, video prediction, with spatial transformer networks [13],\nand with cutting-edge generative models [8].\n\n9\n\n\fAcknowledgements\n\nThe authors gratefully acknowledge Zoubin Ghahramani, Peter Dayan, and Ken Stanley for insightful\ndiscussions. We are also grateful to the entire Opus team and Machine Learning Platform team inside\nUber for providing our computing platform and for technical support.\n\nReferences\n\n[1] Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr\nMirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al.\nVector-based navigation using grid-like representations in arti\ufb01cial agents. Nature, page 1,\n2018.\n\n[2] Clemens-Alexander Brust, Sven Sickert, Marcel Simon, Erik Rodner, and Joachim Denzler.\nConvolutional patch networks with spatial prior for road detection and urban scene under-\nstanding. In International Conference on Computer Vision Theory and Applications (VISAPP),\n2015.\n\n[3] C. J. Cueva and X.-X. Wei. Emergence of grid-like representations by training recurrent neural\n\nnetworks to perform spatial localization. ArXiv e-prints, March 2018.\n\n[4] Scott E Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In\n\nAdvances in neural information processing systems, pages 524\u2013532, 1990.\n\n[5] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep\nIn 2016 IEEE International Conference on\n\nspatial autoencoders for visuomotor learning.\nRobotics and Automation (ICRA), pages 512\u2013519. IEEE, 2016.\n\n[6] Mathias Franzius, Henning Sprekeler, and Laurenz Wiskott. Slowness and sparseness lead to\n\nplace, head-direction, and spatial-view cells. PLoS computational biology, 3(8):e166, 2007.\n\n[7] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolu-\n\ntional sequence to sequence learning. CoRR, abs/1705.03122, 2017.\n\n[8] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw:\n\nA recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. CoRR, abs/1512.03385, 2015.\n\n[10] Geoffrey E Hinton and Drew Van Camp. Keeping neural networks simple by minimizing\nIn Proceedings of the sixth annual conference on\n\nthe description length of the weights.\nComputational learning theory, pages 5\u201313. ACM, 1993.\n\n[11] Amy K Hoover and Kenneth O Stanley. Exploiting functional relationships in musical composi-\n\ntion. Connection Science, 21(2-3):227\u2013251, 2009.\n\n[12] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado\nVan Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint\narXiv:1803.00933, 2018.\n\n[13] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In\n\nAdvances in neural information processing systems, pages 2017\u20132025, 2015.\n\n[14] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick,\nand Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary\nvisual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference\non, pages 1988\u20131997. IEEE, 2017.\n\n[15] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\n\nimproved quality, stability, and variation. In ICLR, volume abs/1710.10196, 2018.\n\n10\n\n\f[16] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classi\ufb01cation with deep convo-\nlutional neural networks. In Advances in Neural Information Processing Systems 25, pages\n1106\u20131114, 2012.\n\n[17] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.\n\nThe handbook of brain theory and neural networks, 3361(10):1995, 1995.\n\n[18] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. The Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[19] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic\nDimension of Objective Landscapes. In International Conference on Learning Representations,\nApril 2018.\n\n[20] Yecheng Lyu and Xinming Huang. Road segmentation using cnn with gru. arXiv preprint\n\narXiv:1804.05164, 2018.\n\n[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.\n\nPlaying Atari with Deep Reinforcement Learning. ArXiv e-prints, December 2013.\n\n[22] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[23] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & Play Generative Net-\nworks: Conditional Iterative Generation of Images in Latent Space. ArXiv e-prints, November\n2016.\n\n[24] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, \u0141ukasz Kaiser, Noam Shazeer, and Alexander\n\nKu. Image transformer. arXiv preprint arXiv:1802.05751, 2018.\n\n[25] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[26] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee.\nLearning what and where to draw. In Advances in Neural Information Processing Systems,\npages 217\u2013225, 2016.\n\n[27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time\nobject detection with region proposal networks. In Advances in neural information processing\nsystems, pages 91\u201399, 2015.\n\n[28] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465\u2013471, 1978.\n\n[29] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In\nAdvances in neural information processing systems, pages 4974\u20134983, 2017.\n\n[30] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow.\n\nArXiv e-prints, February 2018.\n\n[31] Kenneth O Stanley. Compositional pattern producing networks: A novel abstraction of develop-\n\nment. Genetic programming and evolvable machines, 8(2):131\u2013162, 2007.\n\n[32] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. arXiv preprint\n\narXiv:1711.10925, 2017.\n\n[33] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative\nmodel for raw audio. arXiv preprint arXiv:1609.03499, 2016.\n\n[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 6000\u20136010, 2017.\n\n11\n\n\f[35] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:\nConstruction of a large-scale image dataset using deep learning with humans in the loop. arXiv\npreprint arXiv:1506.03365, 2015.\n\n[36] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and\nDimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative\nadversarial networks. In IEEE Int. Conf. Comput. Vision (ICCV), pages 5907\u20135915, 2017.\n\n12\n\n\f", "award": [], "sourceid": 5859, "authors": [{"given_name": "Rosanne", "family_name": "Liu", "institution": "Uber AI Labs"}, {"given_name": "Joel", "family_name": "Lehman", "institution": "Uber AI Labs"}, {"given_name": "Piero", "family_name": "Molino", "institution": "Uber AI Labs"}, {"given_name": "Felipe", "family_name": "Petroski Such", "institution": "Uber AI Labs"}, {"given_name": "Eric", "family_name": "Frank", "institution": "Uber AI Labs"}, {"given_name": "Alex", "family_name": "Sergeev", "institution": "Uber Technologies Inc,"}, {"given_name": "Jason", "family_name": "Yosinski", "institution": "Uber AI Labs; Recursion"}]}